Paper status: completed

Vision Foundation Models in Remote Sensing: A Survey

Published:08/07/2024

Foundation Models in Remote Sensing (1)Self-Supervised Learning Techniques (1)Contrastive Learning (1)Architectures and Pre-Training Datasets of Foundation Models (1)AI Transformation in Remote Sensing Technology (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper surveys vision foundation models in remote sensing, categorizing them by architectures, pre-training datasets, and methodologies. It highlights significant advancements and emerging trends while discussing challenges like data quality and computational resources, findi

Abstract

Artificial Intelligence (AI) technologies have profoundly transformed the field of remote sensing, revolutionizing data collection, processing, and analysis. Traditionally reliant on manual interpretation and task-specific models, remote sensing research has been significantly enhanced by the advent of foundation models-large-scale, pre-trained AI models capable of performing a wide array of tasks with unprecedented accuracy and efficiency. This paper provides a comprehensive survey of foundation models in the remote sensing domain. We categorize these models based on their architectures, pre-training datasets, and methodologies. Through detailed performance comparisons, we highlight emerging trends and the significant advancements achieved by those foundation models. Additionally, we discuss technical challenges, practical implications, and future research directions, addressing the need for high-quality data, computational resources, and improved model generalization. Our research also finds that pre-training methods, particularly self-supervised learning techniques like contrastive learning and masked autoencoders, remarkably enhance the performance and robustness of foundation models. This survey aims to serve as a resource for researchers and practitioners by providing a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing.

Mind Map

In-depth Reading

English Analysis~28 min read · 41,897 chars

1. Bibliographic Information

1.1. Title

Vision Foundation Models in Remote Sensing: A Survey

1.2. Authors

Siqi Lu (Student Member, IEEE) - Department of Electrical and Computer Engineering, Vanderbilt University
Junlin Guo - Department of Electrical and Computer Engineering, Vanderbilt University
James R Zimmer-Dauphinee - Department of Anthropology, Vanderbilt University
Jordan M Nieusma - Vanderbilt University Spatial Analysis Research Laboratory
Xiao Wang - Oak Ridge National Laboratory, Data Science Institute, Vanderbilt University
Parker VanValkenburgh - Department of Anthropology, Brown University
Steven A Wernke - Department of Anthropology, Vanderbilt University, Spatial Analysis Research Laboratory, Vanderbilt Institute for Spatial Research
Yuankai Huo (Assistant Professor) - Department of Electrical and Computer Engineering, and Data Science Institute, Vanderbilt University

1.3. Journal/Conference

This paper is published as a preprint on arXiv. arXiv is a well-respected open-access archive for preprints of scientific papers in various disciplines, including computer science. While it is not a peer-reviewed journal or conference, publishing on arXiv allows for rapid dissemination of research and feedback from the scientific community before formal peer review and publication.

1.4. Publication Year

2024 (Published at UTC: 2024-08-06T22:39:34.000Z)

1.5. Abstract

Artificial Intelligence (AI) technologies have profoundly transformed the field of remote sensing, revolutionizing data collection, processing, and analysis. Traditionally reliant on manual interpretation and task-specific models, remote sensing research has been significantly enhanced by the advent of foundation models—large-scale, pre-trained AI models capable of performing a wide array of tasks with unprecedented accuracy and efficiency. This paper provides a comprehensive survey of foundation models in the remote sensing domain. We categorize these models based on their architectures, pre-training datasets, and methodologies. Through detailed performance comparisons, we highlight emerging trends and the significant advancements achieved by those foundation models. Additionally, we discuss technical challenges, practical implications, and future research directions, addressing the need for high-quality data, computational resources, and improved model generalization. Our research also finds that pre-training methods, particularly self-supervised learning techniques like contrastive learning and masked autoencoders, remarkably enhance the performance and robustness of foundation models. This survey aims to serve as a resource for researchers and practitioners by providing a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing.

1.6. Original Source Link

The paper is available as a preprint on arXiv:

Original Source Link: https://arxiv.org/abs/2408.03464
PDF Link: https://arxiv.org/pdf/2408.03464v2.pdf

2. Executive Summary

2.1. Background & Motivation

The field of remote sensing (RS), which involves acquiring information about objects or areas from a distance, traditionally relied heavily on manual interpretation and task-specific models. These conventional methods often required extensive labeled datasets and significant computational resources, making them labor-intensive and limited in scalability.

The core problem addressed by the rise of Artificial Intelligence (AI) and Deep Learning (DL) is the inefficiency and resource intensity of traditional remote sensing data processing. The paper highlights that AI technologies have profoundly transformed remote sensing, leading to a revolution in data collection, processing, and analysis.

The advent of foundation models (FMs)—large-scale, pre-trained AI models capable of performing a wide array of tasks—has significantly enhanced remote sensing research. These models offer unprecedented accuracy and efficiency, opening new avenues for applications across diverse domains.

This paper is motivated by the rapid surge in the development of modern foundation models in remote sensing, particularly between June 2021 and June 2024. This timeframe saw the emergence of vision transformers and advanced self-supervised learning (SSL) techniques. The paper aims to provide a comprehensive and up-to-date survey to synthesize these recent advancements, addressing the need for a structured overview of this evolving landscape for researchers and practitioners.

2.2. Main Contributions / Findings

The paper makes several primary contributions by providing a comprehensive survey of vision foundation models in the remote sensing domain:

Exhaustive Review of Current Models: It offers a detailed review of vision foundation models proposed in remote sensing, specifically focusing on models released between June 2021 and June 2024. This covers their background, methodologies, and specific applications.
Structured Categorization and Analysis: The models are categorized and analyzed based on their application in image analysis (e.g., image-level, region-level, pixel-level) and practical applications (e.g., environmental monitoring, agriculture, archaeology, urban planning, disaster management). For each model, the paper discusses its architecture, pre-training datasets, pre-training methods, and performance.
Discussion of Challenges and Future Directions: The survey identifies and discusses technical challenges, unresolved aspects, emerging trends, and future research directions for foundation models in remote sensing, emphasizing the need for high-quality data, computational resources, and improved model generalization.
Key Finding on Pre-training Methods: A significant finding is that pre-training methods, especially self-supervised learning (SSL) techniques like contrastive learning and masked autoencoders, remarkably enhance the performance and robustness of foundation models.

These contributions aim to serve as a valuable resource, offering a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing.

3.1. Foundational Concepts

3.1.1. Remote Sensing (RS)

Remote sensing is the science of acquiring information about the Earth's surface without making physical contact. This is typically done using sensors mounted on satellites or airborne platforms. These technologies collect data over vast geographical areas, playing a vital role in diverse fields.

Data Acquisition: Modern remote sensing employs a variety of sensors:
- Optical sensors: Capture visible and near-infrared light, providing detailed images of land cover and vegetation health.
- Thermal sensors: Detect heat emitted or reflected from the Earth's surface, useful for monitoring volcanic activity, forest fires, and climate change.
- Radar sensors: Can penetrate clouds and vegetation, providing crucial information in all-weather conditions for applications like soil moisture estimation and urban infrastructure mapping.
Applications: Remote sensing is used in:
- Environmental monitoring: Tracking deforestation, air and water quality, climate change impacts.
- Agriculture: Crop health monitoring, yield estimation, resource management.
- Urban planning: Monitoring urban sprawl, infrastructure, land-use planning.
- Disaster management: Assessing damage from natural disasters, aiding relief operations.
GIS Integration: Remote sensing data is often integrated with Geographic Information Systems (GIS), which provide a framework for capturing, storing, analyzing, and visualizing spatial and geographic data. This synergy creates detailed and dynamic maps for various applications.

3.1.2. Artificial Intelligence (AI) and Deep Learning (DL)

Artificial Intelligence (AI): A broad field of computer science that aims to create machines capable of performing tasks that typically require human intelligence. This includes learning, problem-solving, perception, and decision-making.
Deep Learning (DL): A subfield of machine learning (which is a subfield of AI) that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. DL models have achieved state-of-the-art performance in tasks like image recognition, natural language processing, and speech recognition.

3.1.3. Foundation Models (FMs)

Foundation models are a new paradigm in AI referring to large-scale, pre-trained AI models that serve as a robust starting point for a wide range of downstream tasks across various domains. They are trained on vast datasets, allowing them to capture complex patterns and features that can then be fine-tuned for specific applications with minimal additional training. Their major strength lies in their ability to generalize well to new, unseen tasks and data.

3.1.4. Self-Supervised Learning (SSL)

Self-supervised learning is a powerful machine learning paradigm where models learn representations from unlabeled data by creating and solving pretext tasks. Instead of relying on human-annotated labels, SSL generates "supervision" signals from the data itself. This is particularly valuable in remote sensing where vast amounts of unlabeled imagery are available, but manual labeling is expensive and time-consuming. SSL helps models learn generalizable representations that can be transferred to downstream tasks.

3.1.5. Convolutional Neural Networks (CNNs)

Convolutional Neural Networks are a fundamental architecture in deep learning, specifically designed to process data with a grid-like topology, such as images. They excel at extracting hierarchical spatial features through specialized layers called convolutional layers.

Convolutional Layers: These layers apply filters (small matrices of weights) across the input data, performing a mathematical operation called convolution. Each filter detects specific patterns (e.g., edges, textures, shapes) at different locations in the image. By stacking multiple convolutional layers, CNNs can learn increasingly complex and abstract features.
ResNet (Residual Neural Network): A specific type of CNN that addresses the vanishing gradient problem and degradation problem in very deep networks. ResNet introduces residual connections (also known as skip connections) that allow gradients to bypass one or more layers, ensuring that information can flow directly through the network. This enables the training of much deeper networks without a drop in performance. The core idea of a residual block in ResNet can be described by the following equation: $ \mathbf { y } = \mathcal { F } ( \mathbf { x } , { W _ { i } } ) + \mathbf { x } $ Where:
- $\mathbf { y }$ is the output of the residual block.
- $\mathcal { F } ( \mathbf { x } , \{ W _ { i } \} )$ represents the residual mapping (typically a stack of two or three convolutional layers) that the network learns. It computes the change or residual that needs to be added to the input.
- $\mathbf { x }$ is the input to the residual block.
- $\{ W _ { i } \}$ are the weights of the convolutional layers within the residual mapping $\mathcal { F }$ . The $+ x$ term indicates the skip connection, where the input $\mathbf { x }$ is added directly to the output of the residual mapping. This allows the network to easily learn identity functions (i.e., $\mathcal { F } (\mathbf { x })=0$ ), which helps in training deeper architectures. ResNet models come in various depths (e.g., ResNet-50, ResNet-101) based on the number of layers.

3.1.6. Transformers and Vision Transformers (ViTs)

Transformers: Originally developed for natural language processing (NLP), Transformers are a neural network architecture that relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence. They are highly effective at modeling long-range dependencies within data, meaning they can understand relationships between elements that are far apart in a sequence.
Vision Transformers (ViTs): An adaptation of the Transformer architecture for computer vision (CV) tasks. Instead of processing images directly as a grid of pixels, ViTs divide an image into fixed-size patches. Each patch is then treated as a token (similar to a word in NLP) and linearly embedded. These patch embeddings are fed into a Transformer encoder, which uses self-attention to learn relationships between different image patches. The self-attention mechanism is central to Transformers and is computed as follows: $ { \mathrm { Attention } } ( Q , K , V ) = { \mathrm { softmax } } \left( { \frac { Q K ^ { T } } { \sqrt { d _ { k } } } } \right) V $ Where:
- $Q$ represents the query matrix.
- $K$ represents the key matrix.
- $V$ represents the value matrix.
- $Q$ , $K$ , and $V$ are derived from the same input (or different inputs in cross-attention) by multiplying it with different learned weight matrices.
- $K^T$ is the transpose of the key matrix.
- $\sqrt { d _ { k } }$ is a scaling factor, where $d_k$ is the dimension of the key vectors. This scaling prevents the softmax function from having extremely small gradients.
- softmax is an activation function that normalizes the scores, turning them into probabilities. This mechanism allows each token (image patch) to attend to all other tokens in the sequence, dynamically calculating their relevance to each other and capturing both local and global patterns within the image.

3.2. Previous Works

The paper contextualizes its contributions by summarizing several influential review papers in AI for remote sensing:

Zhang et al. (2016) - "Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art" [121]: This foundational review introduced deep learning techniques to remote sensing, primarily focusing on Convolutional Neural Networks (CNNs) for tasks like image classification and object detection. It highlighted early AI integration's promise and challenges, setting the stage for future advancements.
Zhu et al. (2017) - "Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources" [129]: This review explored diverse AI applications, including hyperspectral analysis and synthetic aperture radar (SAR) interpretation. It provided extensive resources and captured the rapid adoption of deep learning in addressing complex RS challenges.
Wang et al. (2022) - "Self-Supervised Learning in Remote Sensing" [103]: This review focused on self-supervised learning (SSL) methods, emphasizing their ability to utilize large volumes of unlabeled data, reducing the reliance on costly labeled datasets while maintaining high performance. It identified key SSL challenges and future directions.
Zhang et al. (2022) - "Artificial Intelligence for Remote Sensing Data Analysis: A Review of Challenges and Opportunities" [120]: This comprehensive overview synthesized findings from over 270 studies, focusing on AI algorithms for remote sensing data analysis. It highlighted ongoing challenges such as explainability, security, and integration with other computational techniques.
Aleissaee et al. (2023) - "Transformers in Remote Sensing" [3]: This survey explored the impact of transformer-based models across various RS tasks, comparing them with CNNs. It identified strengths, limitations, and unresolved challenges for transformers in RS.
Li et al. (2024) - "Vision-Language Models in Remote Sensing" [60]: This review examined the growing significance of vision-language models (VLMs), which combine visual and textual data. It highlighted their potential in applications like image captioning and visual question answering, emphasizing a shift toward richer semantic understanding.
Zhu et al. (2024) - "On the Foundations of Earth and Climate Foundation Models" [130]: This recent work provided a comprehensive review of existing foundation models, proposing features like geolocation embedding and multisensory capability for future Earth and climate models.

3.3. Technological Evolution

The technological evolution in remote sensing has progressed through several distinct phases:

Early Analog Techniques (Mid-20th Century): Initially, remote sensing primarily involved analog photographic techniques via aerial and satellite platforms. These methods provided limited spectral and spatial resolution.
Early Earth Observation Satellites (1960s onwards): The launch of programs like Landsat (commenced in 1967) marked a significant advancement, enabling consistent and wide-ranging data collection for environmental monitoring. This era saw the rise of manual interpretation and traditional image processing techniques.
Task-Specific Models and Machine Learning (Pre-2010s): Projects began to rely on task-specific models that required extensive labeled datasets. Early machine learning algorithms were applied but often faced limitations in handling the complexity and scale of remote sensing data.
Deep Learning Era (Post-2010s): The advent of AI and deep learning (particularly CNNs like ResNet) revolutionized image recognition and classification. These models could learn hierarchical features, improving performance but still often requiring substantial labeled data. Early representation learning models like Tile2Vec (2018) laid groundwork but were limited in scale and generalization.
Self-Supervised Learning and Transformers (Post-2020s): The development of self-supervised learning (SSL) techniques, which enable models to learn from unlabeled data, and Transformer architectures (especially Vision Transformers), which excel at modeling long-range dependencies, marked a new era. These innovations allowed for the creation of larger, more powerful, and generalizable models.
Foundation Models (Current Era, Post-June 2021): The combination of SSL, Transformer architectures, and massive datasets led to the emergence of foundation models. These large-scale, pre-trained models can perform a wide array of tasks with unprecedented accuracy and efficiency, often requiring minimal fine-tuning for new remote sensing applications.

This paper's work fits squarely into the current Foundation Models era, focusing on advancements from June 2021 to June 2024, a period of significant growth for modern foundation models leveraging vision transformers and advanced self-supervised learning.

3.4. Differentiation Analysis

Compared to the main methods and reviews in related work, this paper offers several core differentiations and innovations:

Focus on Recent Developments (June 2021 - June 2024): Unlike previous reviews that might cover earlier deep learning advancements or foundational SSL and Transformer concepts, this survey specifically targets the most recent wave of foundation models. This timeframe marks a critical period of rapid development and maturation for FMs in remote sensing.
Comprehensive Integration of SSL and Transformer-based Architectures: While earlier reviews might have focused on SSL or Transformers individually, this paper explores their combined potential within foundation models. It systematically examines how these advanced techniques are integrated to address remote sensing tasks like semantic segmentation, multi-spectral analysis, and change detection. For example, it highlights SatMAE's effective use of SSL for transformer pre-training for improved segmentation in multi-spectral imagery, and Scale-MAE's application of scale-aware masked autoencoders for handling varied spatial resolutions.
Emphasis on Practical Applications and Addressing Persistent Challenges: The survey goes beyond theoretical advancements to emphasize the practical applications of recent foundation models. It illustrates how these new models address persistent challenges such as domain adaptation and computational efficiency. Examples include DINO-MC's integration of global-local view alignment for SSL to detect changes in high-resolution satellite imagery, and ORBIT's real-world applications in environmental monitoring and disaster response. The discussion also covers efficient self-attention mechanisms in models like Scale-MAE to reduce computation costs and enhanced geolocation embeddings in SatMAE for better geospatial feature extraction.
Structured Categorization by Perception Levels: The paper introduces a structured categorization based on perception levels (image-level, region-level, pixel-level), which helps clarify how different foundation models are tested for general image-based challenges or specialized applications. This organization enhances utility for researchers in identifying suitable models for specific needs.

In essence, this survey provides a more current, integrated, and application-focused perspective on vision foundation models in remote sensing, distinguishing itself from prior works by its specific temporal scope and its detailed analysis of the interplay between advanced pre-training techniques and transformer architectures for real-world geospatial analysis.

4. Methodology

The development of foundation models (FMs) for remote sensing hinges on robust pre-training methods that enable models to learn transferable and generalized representations from large-scale datasets. This section delves into the core methodologies, including pre-training strategies and image analysis techniques, employed by these FMs.

4.1. Principles

The core idea behind foundation models is to leverage extensive datasets and advanced architectures during a pre-training phase to capture complex patterns and features. This allows the models to learn domain-agnostic features that can then be fine-tuned for various downstream tasks with minimal additional training. In remote sensing, where data diversity and complexity are high (e.g., multispectral, multi-temporal imagery), this transfer learning capability is particularly valuable. The models utilize techniques such as self-supervised learning (SSL) and Transformers to enhance performance and efficiency across tasks like image classification, object detection, and change detection.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Pretraining Methods

Pretraining is a critical step that allows FMs to learn effective representations. Two main categories are explored: self-supervised learning and supervised pretraining.

4.2.1.1. Self-Supervised Learning (SSL)

Self-supervised learning is a cornerstone for pre-training FMs, as it allows models to learn powerful representations from vast amounts of unlabeled data, which is abundant in remote sensing. The general pipeline of SSL involves acquiring diverse datasets, applying pretext tasks, and then using the learned representations for knowledge transfer to downstream tasks.

The following figure (Figure 3 from the original paper) illustrates the general pipeline of self-supervised learning.

该图像是示意图，展示了在遥感中基础模型的训练过程，包括多样化的数据集、预训练模型、知识迁移以及下游任务。图中突出显示的预训练任务包括MAE重建和对比任务，最终目标是实现目标模型的微调以应对物体检测和图像分割等任务。

1. Predictive Coding Predictive coding is an SSL method that adopts a generative approach. The model learns representations by predicting missing or occluded parts of an input based on its visible portions. This is crucial for remote sensing imagery, which often contains diverse textures, complex scenes, and varying resolutions, and may have gaps due to sensor limitations or occlusions (e.g., clouds).

Mechanism: The model is trained to reconstruct the original input from a corrupted version (e.g., masked input). By learning to "fill in the blanks," it captures spatial and contextual relationships.
Implementations: Popular frameworks include:
- Autoencoder-based architectures: Models that encode input into a lower-dimensional representation and then decode it back to the original input.
- Masked Image Modeling (MIM): Techniques like MAE (Masked Autoencoders) [34] are prime examples. In MAE, portions of the input image are randomly masked, and the model is trained to reconstruct the original pixel values of the masked patches. This forces the model to learn rich, context-aware representations.
- Autoregressive models: Models that predict future elements in a sequence based on past elements, adaptable for spatial prediction in images.
Relevance to Remote Sensing: Effective for gap filling in satellite imagery and learning fine-grained details critical for high-resolution imagery.

2. Contrastive Learning Contrastive learning is another powerful SSL technique that focuses on learning discriminative and invariant features by distinguishing between similar and dissimilar samples.

Mechanism: The core idea is to:
- Bring representations of similar (positive) samples closer together in the embedding space.
- Push representations of dissimilar (negative) samples farther apart. This is typically achieved by applying various data augmentations (e.g., random cropping, rotations, spectral band dropping) to an original image to create positive pairs (different augmented views of the same image). Other images in a batch (or from a memory bank) serve as negative samples.
Frameworks: Examples include SimCLR [13], MoCo [35], DINO [9], and BYOL [29].
Relevance to Remote Sensing: Helps models capture spectral signatures across varying conditions (e.g., multispectral/hyperspectral imagery), improving performance in tasks like crop classification or land cover mapping. It is also useful when labeled datasets are highly imbalanced, as it allows learning from underrepresented classes without explicit labels.

3. Other SSL Methods The paper notes that other innovative SSL methods, such as teacher-student self-distillation frameworks, also show potential. For instance, CMID [70] combines contrastive learning and masked image modeling within a teacher-student framework to capture both global and local features, enhancing its effectiveness for diverse remote sensing tasks.

4.2.1.2. Supervised Pretraining

Supervised pretraining is a traditional deep learning approach where models are trained using labeled datasets to minimize prediction errors for specific tasks (e.g., image classification).

Mechanism: Models learn direct mappings between input features and target labels, developing detailed, task-specific representations.
Examples: Models like ResNet [36] and VGGNet [81] trained on large-scale datasets such as ImageNet [18] are prime examples. These models learn robust feature hierarchies that are highly transferable to related tasks like semantic segmentation or object detection.
Limitations:
- Dependency on Labeled Data: Requires large-scale, high-quality labeled datasets, which are expensive and time-consuming to create, especially for multispectral or hyperspectral data in remote sensing.
- Domain Specificity: Labeled data in remote sensing is often domain-specific, limiting the generalizability of models trained on one dataset to other applications or regions. These limitations highlight why SSL has gained prominence, leveraging abundant unlabeled data to learn general-purpose representations.

4.2.2. Image Analysis at Different Levels

Foundation models in remote sensing enable image analysis at three primary levels, each addressing different spatial, contextual, and application-specific needs.

4.2.2.1. Image-Level Analysis

Focus: Categorizing entire images or large image segments into predefined classes.
Tasks: Scene classification, land use mapping, land cover classification, resource management.
Outputs: Broad, high-level insights into geographic regions, supporting large-scale environmental management and policy planning.

4.2.2.2. Region-Level Analysis

Focus: Identifying and localizing specific objects within an image.
Tasks: Object detection (e.g., buildings, vehicles, ships, infrastructure).
Outputs: Individual entities and their spatial locations, critical for urban planning, disaster response, and security.

4.2.2.3. Pixel-Level Analysis

Focus: Assigning a label to every pixel within an image, offering the most granular perception.
Tasks: Semantic segmentation (classifying each pixel into categories like vegetation, water, buildings) and change detection (identifying temporal differences between images).
Outputs: Highly detailed maps, indispensable for precision agriculture, deforestation tracking, and disaster management.

4.2.3. Backbone Architectures

The underlying architecture of foundation models determines their ability to process and understand remote sensing imagery.

4.2.3.1. Convolutional Neural Networks (CNNs)

CNNs are foundational for extracting hierarchical spatial features.

Mechanism: Convolutional layers apply filters to input data, detecting patterns at different levels of abstraction.
ResNet (Residual Neural Network): A common CNN backbone. It addresses the degradation problem in deep networks by introducing residual connections. The residual block in ResNet is described by the equation: $ \mathbf { y } = \mathcal { F } ( \mathbf { x } , { W _ { i } } ) + \mathbf { x } $ Where:
- $\mathbf { y }$ is the output of the residual block.
- $\mathcal { F } ( \mathbf { x } , \{ W _ { i } \} )$ represents the residual mapping to be learned, typically consisting of convolutional layers.
- $\mathbf { x }$ is the input to the residual block.
- $\{ W _ { i } \}$ are the weights of the layers within the residual mapping. This skip connection ( $\mathbf { + x }$ ) allows gradients to flow more easily through very deep networks, enabling the training of models capable of capturing intricate details in satellite images. ResNet variants (e.g., ResNet-50, ResNet-101) are widely used for tasks like image classification, object detection, and change detection in remote sensing.

4.2.3.2. Transformers and Vision Transformers (ViTs)

Transformers, adapted as Vision Transformers (ViT) for computer vision, model long-range dependencies effectively.

The following figure (Figure 4 from the original paper) illustrates the Vision Transformer architecture.

Fig. 4: The Vision transformer architecture.3 该图像是一个示意图，展示了视觉变换器架构的结构。上方的RGB图像通过分割解码器和变换器编码器进行处理，形成最终的分割结果。该图展示了线性投影的平铺块与分割过程的关系。

Mechanism: ViTs treat images as sequences of patches, capturing both global and local patterns. This is achieved through the self-attention mechanism: $ { \mathrm { Attention } } ( Q , K , V ) = { \mathrm { softmax } } \left( { \frac { Q K ^ { T } } { \sqrt { d _ { k } } } } \right) V $ Where:
- $Q$ (query), $K$ (key), and $V$ (value) are input matrices derived from the patch embeddings.
- d _ { k } is the dimension of the key vectors. This mechanism allows the model to weigh the importance of different image patches to each other, making them particularly effective for semantic segmentation and change detection tasks where understanding relationships across large spatial extents is crucial in high-resolution satellite imagery.

5. Experimental Setup

5.1. Datasets

Datasets are fundamental for training and evaluating remote sensing models. The paper highlights the diversity and importance of large-scale, multimodal datasets for foundation models. The following figure (Figure 2 from the original paper) showcases some examples of data types used in foundation models and their downstream tasks.

该图像是一个示意图，展示了不同类型的遥感数据（如全色影像、真实色彩、合成孔径雷达、超光谱和多光谱）与下游任务（如分割、目标检测、分类和变化检测）之间的关系。图中包含了多种遥感影像类型及其应用领域，主要用于说明基础模型在遥感中的应用。

The paper discusses a range of datasets, varying in size, resolution, sensor types, and geographic coverage:

Size: Datasets range from hundreds of thousands (e.g., RSD46-WHU [62], [116] with 117,000 images) to over a million samples (e.g., MillionAID [63], [64] with over 1 million images, SSL4EO-L [83] with over 5 million images). Larger datasets generally improve model generalization.
Resolution: Resolutions vary from high (sub-meter, for detailed spatial analysis) to moderate (10-60 meters, for broader pattern recognition, e.g., SEN12MS [80], SSL4EO-S12 [107]).
Sensor Types: Datasets leverage RGB, multispectral, hyperspectral, and synthetic aperture radar (SAR) data. For example, SEN12MS [80] integrates both SAR and multispectral imagery. This diversity is crucial for robust model development, as each sensor type captures unique surface characteristics.

The paper provides an appendix with detailed descriptions of commonly used pre-train datasets. The following are the details of these datasets:

The following are the results from the Appendix of the original paper:

Month, Year	Dataset	Title	Patch Size	Size	Resolution (m)	Sensor	Categories	Geographic Coverage	Image Type	Application
2017	RSD46-WHU [62], [116]		256 x 256	117,000	0.5 - 2	Google Earth, Tianditu	46	Global	RGB	Scene Classification
Apr, 2018	fMoW [15]	Functional Map of the World		1,047,691	-	Digital Globe	63	207 of 247 countries	Multispectral	Scene Classification, Object Detection
May, 2019	DOTA [114]	DOTA: A Large-scale Dataset for Object Detection in Aerial Images	800× 800 to 20,000 × 20,000	11,268	Various	Google Earth, GF-2 Satelite, and aerial images	18	Global	RGB	Object Detection
Jun, 2019	SEN12MS [80]	SEN12MS A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion	256 x 256	541,986	10	Sentinel-1, Sentinel-2, MODIS Land Cover		Globally distributed	SAR/Multispectral	Land Cover Classification, Change Detection
Jun, 2019	BigEarthNet [85]	BigEarthNet: A Large-Scale Benchmark Archive For Remote Sensing Image Understanding	20 x 20 to 120 x 120	590,326	Various	Sentinel-2	43	Europe	Multispectral	Scene Classification, Object Detection
Jun, 2019	SeCo [66]	Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data	264 x 264	~1M	10 - 60	Sentinel-2		Global	Multispectral	Seasonal Change Detection, Land Cover Classification over Seasons
Mar, 2021	MillionAID [63], [64]	Million-AID Geographical Knowledge-driven Representation Learning for	110 - 31,672	1,000,848	Various	Google Earth	51	Global	RGB	Scene Classification
Jul, 2021	Levir-KR [56]	Remote Sensing Images The Original Vision Model for Optical	-	1,431,950	Various	Gaofen1, Gaofen-2, Gaofen-6		Global	Multispectral	Change Detection, Scene Classification
Apr, 2022	TOV-RS-Balanced [90]	Remote Sensing Image Understanding via Self-supervised Learning SeasoNet: A Seasonal Scene Classification,	600 x 600	500,000	1 - 20	Google Earth	31	Global	RGB	Scene Classification, Object Detection, Semantic Segmentation
Jul, 2022	SeasoNet [53]	Segmentation and Retrieval dataset for satellite Imagery over Germany SSLEO-S12: A Large-Scale Multi-Modal,	up to 120 x 120	1,759,830	10 - 60	Sentinel-2		Germany	Multispectral	Scene Classification, Scene Segmentation
Nov, 2022	SSL4EO-S12 [107]	Multi-Temporal Dataset for Self-Supervised Learning in Earth Observation SAMRS: Scaling-up Remote Sensing Segmentation	264 x 264	3,012,948	10 - 60	Sentinel-1, Sentinel-2		Global	SAR/Multispectral	Self-Supervised Learning
Oct, 2023	SAMRS [98]	Dataset with Segment Anything Model	600 x 600 to 1024 x 1024	105,090	Various	HRSC2016, DOTA-V2.0, DIOR, FAIR1M-2.0		Global	High-resolution	Semantic Segmentation, Instance Segmentation, Object Detection
Jun, 2023	CACo [67]	Change-Aware Sampling and Contrastive Learning for Satellite Images	Variable	-	10	Sentinel-2		Urban and Rural Areas	Multispectral	Change Detection, Self-Supervised Learning
Oct, 2023	SatlasPretrain [7]	SatlasPretrain: A Large-scale Dataset for Remote Sensing Image Understanding	512 x 512	856,000	1 (Sentinel-2), 0.5 - 2 (NAIP)	Sentinel-1, Sentinel-2, Landsat, and NAIP	137	Global	Multispectral, High-resolution	Land Cover Classification, Segmentation, Change Detection
Oct, 2023	SSL4EO-L [83]	SSL4EO-L: Datasets and Foundation Models for Landsat Imagery	264 x 264	5,000,000	30	Landsat 45 TM, Landsat 7 ETM+, Landsat 89 OLITIRS		Global	Multispectral	Cloud Detection, Land Cover Classification, Semantic Segmentation
Jul, 2024	MMEarth [72]	MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning	128 x 128	1,200,000	10	Sentinel-2, Sentinel-1, Aster DEM	46	Global	Multispectral, SAR, Climate	Land Cover Classification, Semantic Segmentation

These datasets are chosen to represent the broad spectrum of remote sensing data, enabling the development of robust models capable of addressing diverse challenges in understanding and interpreting Earth's surface. They are effective for validating methods' performance due to their scale, diversity in geographic coverage, seasonal variations, environmental conditions, and sensor types.

5.2. Evaluation Metrics

The paper discusses several evaluation metrics used to assess the performance of foundation models across various remote sensing tasks.

5.2.1. Mean Average Precision (mAP)

Conceptual Definition: Mean Average Precision (mAP) is a commonly used metric for evaluating the performance of object detection and instance segmentation models. It provides a single numeric value that summarizes the precision-recall curve for multiple object classes. A higher mAP indicates better performance, meaning the model is more accurate at both identifying objects (precision) and finding all relevant objects (recall) across different confidence thresholds.
Mathematical Formula: The Average Precision (AP) for a single class is the area under its precision-recall curve. mAP is then the average of APs across all object classes: $ \mathrm{mAP} = \frac{1}{N_{classes}} \sum_{i=1}^{N_{classes}} \mathrm{AP}_i $
Symbol Explanation:
- $N_{classes}$ : The total number of object classes.
- $\mathrm{AP}_i$ : The Average Precision for class $i$ . The AP itself is often calculated using an interpolation method, such as the 11-point interpolation or all-point interpolation. It is essentially the weighted mean of precisions at each threshold, where the weight is the increase in recall from the previous threshold.

5.2.2. F1 Score

Conceptual Definition: The F1 Score is a measure of a model's accuracy, particularly useful for classification tasks, especially when dealing with imbalanced datasets. It is the harmonic mean of precision and recall, providing a single score that balances both metrics. An F1 Score of 1 indicates perfect precision and recall, while 0 indicates the worst performance.
Mathematical Formula: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where: $ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} $ $ \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $
Symbol Explanation:
- $\mathrm{TP}$ : True Positives (correctly predicted positive instances).
- $\mathrm{FP}$ : False Positives (incorrectly predicted positive instances, i.e., actual negatives predicted as positives).
- $\mathrm{FN}$ : False Negatives (incorrectly predicted negative instances, i.e., actual positives predicted as negatives).
- $\mathrm{Precision}$ : The proportion of true positive predictions among all positive predictions made by the model.
- $\mathrm{Recall}$ : The proportion of true positive predictions among all actual positive instances.

5.2.3. Mean Intersection over Union (mIoU)

Conceptual Definition: Mean Intersection over Union (mIoU) is a standard metric for evaluating semantic segmentation tasks. It quantifies the overlap between the predicted segmentation mask and the ground truth mask for each class, then averages this value over all classes. A higher mIoU indicates better segmentation quality, meaning the model's predicted boundaries and regions are closer to the actual objects.
Mathematical Formula: For a single class, IoU is defined as: $ \mathrm{IoU} = \frac{\mathrm{Area of Intersection}}{\mathrm{Area of Union}} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP} + \mathrm{FN}} $ mIoU is the average IoU across all classes: $ \mathrm{mIoU} = \frac{1}{N_{classes}} \sum_{i=1}^{N_{classes}} \mathrm{IoU}_i $
Symbol Explanation:
- $\mathrm{TP}$ : True Positives (pixels correctly classified as belonging to a certain class).
- $\mathrm{FP}$ : False Positives (pixels incorrectly classified as belonging to a certain class).
- $\mathrm{FN}$ : False Negatives (pixels of a certain class incorrectly classified as something else).
- $\mathrm{IoU}_i$ : The Intersection over Union for class $i$ .
- $N_{classes}$ : The total number of classes.

5.2.4. Overall Accuracy (OA)

Conceptual Definition: Overall Accuracy (OA) is a straightforward metric used in classification and segmentation tasks, especially when evaluating pixel-wise classification. It represents the proportion of correctly classified samples (or pixels) out of the total number of samples (or pixels). While easy to understand, OA can be misleading in imbalanced datasets as it might be high even if the model performs poorly on minority classes.
Mathematical Formula: $ \mathrm{OA} = \frac{\mathrm{Number of Correctly Classified Samples}}{\mathrm{Total Number of Samples}} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}} $
Symbol Explanation:
- $\mathrm{TP}$ : True Positives (correctly predicted positive instances).
- $\mathrm{TN}$ : True Negatives (correctly predicted negative instances).
- $\mathrm{FP}$ : False Positives (actual negatives predicted as positives).
- $\mathrm{FN}$ : False Negatives (actual positives predicted as negatives).

5.3. Baselines

The paper primarily compares various foundation models against each other, highlighting their relative strengths and weaknesses. However, in some tables, it also includes performance numbers for non-FM (Foundation Model) or shallow CNN models as baselines, which are sourced from the original dataset papers. These baselines represent conventional or earlier state-of-the-art approaches against which the advancements of foundation models can be gauged.

Examples of such baselines mentioned or implied in the performance tables include:

$CNNs*$ for change detection on OSCD dataset [10].
R-SegNet* [127] for pixel-level segmentation on ISPRS Potsdam dataset.
YOLOv2-D* [21] for object detection on DOTA dataset.
Faster R-CNN* [55] for object detection on DIOR dataset.
$AOPG*$ [14] (Anchor-free Oriented Proposal Generator) for object detection on DIOR-R dataset.
$STANet*$ [12] (Spatial-Temporal Attention Network) for change detection on LEVIR-CD dataset.

These baselines are representative of prior leading methods in their respective tasks, allowing for a clear comparison of how foundation models improve upon existing techniques, particularly in areas like accuracy, generalization, and efficiency.

6. Results & Analysis

The paper synthesizes findings on the performance of foundation models in remote sensing, categorizing them by image analysis levels and discussing the influence of pre-training methods. All performance numbers are directly sourced from the original studies cited in the paper.

6.1. Core Results Analysis

6.1.1. Image-Level Performance (BigEarthNet Dataset)

The paper evaluates image-level classification performance on the BigEarthNet dataset [85], primarily using mAP (Mean Average Precision) and occasionally F1 Score.

The following are the results from Table IV of the original paper:

Dataset	Model	Performance (%)	Metrics
BigEarthNet [85]	SeCo [66]	87.81	mAP
	CMC-RSSR [84]	82.90	mAP
	DINO-MM [105]	87.10	mAP
	CACo [67]	74.98	mAP
	GFM [69]	86.30	mAP
	DINO-MC [111]	88.75	mAP
	CROMA [28]	86.46	mAP
	DeCUR [102]	89.70	mAP
	CtxMIM [123]	86.88	mAP
	FG-MAE [108]	78.00	mAP
	USat [44]	85.82	mAP
	FoMo-Bench [8]	69.33	F1 Score
	SwiMDiff [91]	81.10	mAP
	SpectralGPT [40]	88.22	mAP
	SatMAE++ [73]	85.11	mAP
	msGFM [33]	92.90	mAP
SkySense [32]	92.09	mAP
MMEarth[72]	78.6	mAP

Top Performers: msGFM [33] achieves the highest mAP of $92.90\%$ , closely followed by SkySense [32] with $92.09\%$ . These models demonstrate excellent efficiency in classification tasks on BigEarthNet.
Strong Performance: Other models like DeCUR [102] ( $89.70\%$ mAP), DINO-MC [111] ( $88.75\%$ mAP), SpectralGPT [40] ( $88.22\%$ mAP), SeCo [66] ( $87.81\%$ mAP), and DINO-MM [105] ( $87.10\%$ mAP) also show strong performance, indicating robust classification capabilities.
Room for Improvement: CACo [67] ( $74.98\%$ mAP) and FoMo-Bench [8] ( $69.33\%$ F1-Score) show competitiveness but suggest potential for further optimization in this domain.
Key takeaway: The high mAP scores of msGFM and SkySense highlight the effectiveness of advanced pretraining techniques in capturing complex spatial and spectral features, crucial for remote sensing scene classification. SkySense, for instance, achieved an average improvement of $2.76\%$ over recent models by employing multi-granularity contrastive learning on a diverse dataset. Similarly, HyperSIGMA [95] (not in this table but mentioned in text) demonstrates high accuracy in hyperspectral classification by optimizing spectral-spatial feature extraction using a sparse sampling attention mechanism. These results underscore the importance of tailored pre-training strategies for achieving state-of-the-art classification accuracy.

6.1.2. Pixel-Level Performance (ISPRS Potsdam, OSCD, LEVIR-CD Datasets)

6.1.2.1. ISPRS Potsdam Dataset (Semantic Segmentation)

The paper compares semantic segmentation performance on the ISPRS Potsdam dataset [43]. Metrics include mIoU (Mean Intersection over Union), OA (Overall Accuracy), and mF1 Score.

The following are the results from Table V of the original paper:

Dataset	Model	Performance (%)	Metrics
ISPRS Potsdam	GeoKR [56]	70.48	mIoU
	RSP [96]	65.30	mIoU
	RingMo [87]	91.74	OA
	RVSA [100]	91.22	OA
	TOV [89]	60.34	mIoU
	CMID [70]	87.04	mIoU
	RingMo-lite [109]	90.96	OA
	Cross-Scale MAE [88]	76.17	mIoU
	SMLFR [22]	91.82	OA
	SkySense [32]	93.99	mF1
	UPetu [24]	83.17	mIoU
	BFM [11]	92.58	OA
	R-SegNet* [127]	91.37	OA

Top Performers (mF1/mIoU): SkySense [32] achieves the highest mF1 Score of $93.99\%$ , indicating superior overall segmentation performance. CMID [70] leads in mIoU with $87.04\%$ , demonstrating strong capability in accurately segmenting different regions. UPetu [24] also shows competitive mIoU at $83.17\%$ .
Top Performers (OA): BFM [11] records the highest OA of $92.58\%$ , closely followed by SMLFR [22] ( $91.82\%$ ), RingMo [87] ( $91.74\%$ ), and R-SegNet* [127] ( $91.37\%$ - a non-FM baseline).
Other Competitors: Cross-Scale MAE [88] ( $76.17\%$ mIoU) and GeoKR [56] ( $70.48\%$ mIoU) show robust but improvable segmentation performance. TOV [89] has the lowest mIoU at $60.34\%$ .
Key takeaway: The varying metrics highlight that different models excel in specific aspects of segmentation. SkySense and CMID are strong choices for precise region delineation, while BFM and SMLFR offer high overall pixel accuracy.

6.1.2.2. OSCD and LEVIR-CD Datasets (Change Detection)

The performance of foundation models on change detection tasks is evaluated using F1 Score on the OSCD [10] and LEVIR-CD [12] datasets.

The following are the results from Table VII of the original paper:

Dataset	Model	F1 Score
OSCD [10]	SeCo [66]	46.94
	MATTER [2]	49.48
	CACo [67]	52.11
	GFM [69]	59.82
	SWiMDiff [91]	49.60
	SpectralGPT [40]	54.29
	SkySense [32]	60.06
	DINO-MC [111]	52.71
	HyperSIGMA [95]	59.28
	MTP [99]	53.36
	CNNs* [10]	89.66 (OA)
LEVIR-CD [12]	RSP [96]	90.93
	RingMo [87]	91.86
	RIngMo-lite [109]	91.56
	SwiMDiff [91]	80.90
	SkySense [32]	92.58
	UPetu [24]	88.50
	STANet* [12]	85.4

OSCD Dataset: SkySense [32] achieves the highest F1 Score of $60.06\%$ , demonstrating superior change detection ability. GFM [69] ( $59.82\%$ F1 Score) and HyperSIGMA [95] ( $59.28\%$ F1 Score) also perform strongly. SeCo [66] shows the lowest F1 Score ( $46.94\%$ ), indicating potential for improvement. Note that a generic $CNNs*$ baseline [10] on OSCD (likely with different evaluation conditions or a simpler task) achieved a very high OA of $89.66\%$ , suggesting that F1 Score for complex change detection might be more challenging.
LEVIR-CD Dataset: Performance is generally higher on this dataset. SkySense [32] achieves the highest F1 Score of $92.58\%$ . RingMo [87] ( $91.86\%$ F1 Score), RingMo-lite [109] ( $91.56\%$ F1 Score), and RSP [96] ( $90.93\%$ F1 Score) also exhibit robust performance. SwiMDiff [91] records a lower F1 Score ( $80.90\%$ ) compared to its peers, but still effective. $STANet*$ [12] (a non-FM baseline) also shows a strong F1 Score of $85.4\%$ .
Key takeaway: SkySense consistently performs well in change detection. The higher F1 Scores on LEVIR-CD compared to OSCD suggest differences in dataset characteristics or task complexity. Foundation models demonstrate significant advancements in change detection, crucial for environmental monitoring and disaster management.

6.1.3. Region-Level Performance (DOTA, DIOR, DIOR-R Datasets)

Object detection performance is evaluated using mAP (Mean Average Precision) and AP50 (Average Precision at 50% IoU) on DOTA [20], [21], [113], DIOR [55], and DIOR-R [14] datasets.

The following are the results from Table VI of the original paper:

Dataset	Model	Performance (%)	Metrics
DOTA	RSP [96]	77.72	mAP
	RVSA [100]	81.24	mAP
	TOV [89]	26.10	mAP50
	CMID [70]	72.12	mAP
	GeRSP [42]	67.40	mAP
	SMLFR [22]	79.33	mAP
	BFM [11]	58.69	mAP
	YOLOv2-D* [21]	60.51	AP
DIOR	RingMo [87]	75.80	mAP
	CSPT [124]	69.80	mAP
	RingMo-lite [109]	73.40	mAP
	GeRSP [42]	72.20	mAP
	MTP [99]	78.00	AP50
	Faster R-CNN* [55]	74.05	mAP
DIOR-R	RVSA [100]	71.05	mAP
	SMLFR [22]	72.33	mAP
	SkySense [32]	78.73	mAP
	MTP [99]	74.54	mAP
	BFM [11]	73.62	mAP
	AOPG* [14]	64.41	mAP

DOTA Dataset: RVSA [100] achieves the highest mAP of $81.24\%$ , followed by SMLFR [22] ( $79.33\%$ mAP) and RSP [96] ( $77.72\%$ mAP). CMID [70] ( $72.12\%$ mAP) and GeRSP [42] ( $67.40\%$ mAP) show moderate performance. YOLOv2-D* [21] (a non-FM baseline) has an AP of $60.51\%$ .
DIOR Dataset: MTP [99] achieves the highest AP50 of $78.00\%$ , indicating strong performance for relaxed IoU thresholds. RingMo [87] ( $75.80\%$ mAP) and Faster R-CNN* [55] ( $74.05\%$ mAP - a non-FM baseline) also perform well.
DIOR-R Dataset: SkySense [32] is the top performer with an mAP of $78.73\%$ , showcasing superior object detection capabilities. MTP [99] ( $74.54\%$ mAP) and BFM [11] ( $73.62\%$ mAP) also demonstrate strong performance. $AOPG*$ [14] (a non-FM baseline) has an mAP of $64.41\%$ .
Key takeaway: RVSA, SMLFR, MTP, and SkySense consistently perform well across different object detection datasets. Foundation models generally outperform or are highly competitive with traditional object detection methods like YOLOv2-D* and Faster R-CNN*, demonstrating their effectiveness in region-level analysis.

6.2. Influence of Pre-training Methods

The paper highlights that pre-training methods significantly impact the performance of foundation models.

Superiority of SSL: Models pre-trained with Self-Supervised Learning (SSL) techniques, particularly Contrastive Learning (CL) and Masked Autoencoders (MAE), consistently outperform those trained with traditional supervised learning.
- Contrastive Learning Examples: SkySense [32], using a multi-granularity contrastive learning approach, shows an improvement of approximately $3.6\%$ in scene classification and object detection. SeCo [66], based on seasonal contrastive learning, improves land-cover classification metrics by up to $7\%$ over ImageNet-pre-trained models.
- Masked Autoencoder Examples: For multi-temporal and multispectral data, SatMAE [16] and Scale-MAE [78] leverage masked autoencoding. SatMAE shows up to a $14\%$ performance gain in land cover classification [16], and Scale-MAE offers a $1.7\%$ mIoU improvement for segmentation across varied resolutions [78].
Generative vs. Contrastive for Time-Series: Recent studies suggest that generative methods like MAE have distinct advantages over contrastive methods for time-series data, especially with limited labeled data [61]. MAE-based models, by reconstructing data from masked segments, can capture complex underlying temporal and spectral dependencies more effectively, leading to stronger representations under sparse labeling conditions.
Practical Trade-offs: The paper also discusses practical trade-offs among high-performing foundation models:
- SatMAE [16]: Excels in capturing complex spatiotemporal patterns using transformer architecture and temporal/multi-spectral embeddings, but at the cost of significant computational requirements.
- RingMo [87]: Offers a more lightweight vision transformer architecture, balancing performance with computational demands, making it suitable for rapid-inference tasks (e.g., disaster response monitoring).
- A2-MAE [122]: Introduces an anchor-aware masking strategy to optimize spatial-temporal-spectral representations and integrate multi-source data. Its complex encoding enhances adaptability but increases computational load, fitting applications requiring high accuracy over efficiency.
- ORBIT [101]: With 113 billion parameters, it is exceptionally scalable for Earth system predictability tasks, achieving high-throughput performance. However, its substantial resource requirements limit deployment to specialized high-performance computing environments.

6.3. Practical Implications

The advancements in foundation models have profound implications for real-world remote sensing applications:

Environmental Monitoring: Models like GFM [69] achieve high pixel-level accuracy in semantic segmentation for deforestation monitoring (up to $4.5\%$ improvement over baselines), enhancing precision in mapping forest cover changes. HyperSIGMA [95] provides a $6.2\%$ accuracy boost in hyperspectral vegetation monitoring, critical for assessing forest health and biodiversity. These models aid in conservation and policy-making by tracking deforestation, desertification, and pollution levels.
Agriculture and Forestry: Foundation models deliver valuable insights into crop health, yield predictions, and land use management. For example, RSP [96] enhances precision agriculture through multi-spectral data, while EarthPT [82] and GeCo [57] optimize practices and resource allocation. They detect early signs of crop stress, diseases, and pests, and support sustainable forestry by mapping forest cover and estimating biomass.
Archaeology: Models like GeoKR [56] and RingMo [87] revolutionize the discovery and analysis of archaeological sites by processing high-resolution satellite imagery and multi-spectral data. They enhance the detection of features, enable large-scale surveys, and monitor changes over time, improving efficiency and accuracy in archaeological investigations.
Urban Planning and Development: CMID [70] and SkySense [32] are pivotal for urban expansion monitoring, infrastructure development, and land use changes. UPetu [24] excels in infrastructure mapping (over $5\%$ higher accuracy than single-modality models) by integrating multi-modal data (optical and radar), enabling more informed land-use decisions. RingMo [87] enhances object detection accuracy by $3.7\%$ for dense urban features. These models facilitate sustainable urban growth and development planning.
Disaster Management: Models like OFA-Net [118], DOFA [117], and Prithvi [46] are instrumental in flood mapping and fire detection. They provide critical real-time data for rapid response and recovery efforts, helping identify affected areas quickly and prioritize resource allocation. ORBIT [101] demonstrates exceptional scalability for Earth system predictability tasks with up to $85\%$ scaling efficiency, supporting long-term environmental monitoring and climate change prediction.

The adaptability, scalability, and efficiency of foundation models unlock a new level of precision and accessibility, allowing for the tackling of complex and evolving challenges across various domains that traditional models struggled to address at scale.

7. Conclusion & Reflections

7.1. Conclusion Summary

This comprehensive survey has meticulously reviewed the recent advancements in foundation models for remote sensing, specifically focusing on developments between June 2021 and June 2024. The paper successfully categorized these models based on their pre-training methods (e.g., self-supervised learning via contrastive learning and masked autoencoders), image analysis techniques (e.g., image-level, region-level, pixel-level), and practical applications (e.g., environmental monitoring, digital archaeology, agriculture, urban planning, disaster management).

The analysis highlighted the significant performance improvements brought by advanced techniques such as self-supervised learning, Vision Transformers (ViTs), and Residual Neural Networks (ResNets). These foundation models have set new benchmarks across various image perception levels and real-world applications, demonstrating their transformative potential in remote sensing. A key finding emphasized the remarkable enhancement in performance and robustness of foundation models through self-supervised learning approaches.

7.2. Limitations & Future Work

7.2.1. Limitations

The authors acknowledge several limitations of their survey:

Scope and Coverage: The review is limited to foundation models released between June 2021 and June 2024. This temporal constraint means that very recent advancements or innovations without sufficient evaluation metrics at the time of writing may be omitted.
Evolving Field: AI and remote sensing are rapidly evolving fields. The dynamic nature necessitates continuous reviews and updates to maintain relevance and comprehensiveness, as new techniques and models are constantly emerging.
Limited Explicit Testing: While foundation models possess robust architectures and general-purpose training paradigms, current literature often shows them empirically tested on a specific set of downstream applications. This limited testing should not be misinterpreted as a constraint on their broader applicability; rather, it indicates the focus of existing research efforts. These models are expected to generalize effectively to a wider variety of remote sensing tasks beyond those explicitly tested.

7.2.2. Future Work

The paper suggests several crucial directions for future research:

Efficient Model Development:
- Computational Reduction: Explore techniques like model distillation (transferring knowledge from a larger model to a smaller one), pruning (removing unnecessary connections or neurons), and quantization (reducing the precision of model weights) to decrease computational requirements without sacrificing performance.
- Scalable Architectures: Develop scalable architectures that can efficiently handle ultra-high-resolution images.
- Parameter-Efficient Fine-Tuning: Incorporate methods like LoRA (Low-Rank Adaptation) [41] for efficient fine-tuning of large models with minimal computational overhead, making them suitable for resource-constrained environments or frequent retraining.
Multi-Modal Data Integration: Enhance methods for integrating and processing diverse multi-modal data (e.g., combining optical and radar imagery) to provide more comprehensive insights. Research into advanced SSL techniques capable of leveraging multi-modal data is necessary, with frameworks like OFA-Net [118] serving as promising directions.
Interdisciplinary Collaboration: Promote collaboration among remote sensing experts, AI researchers, and domain specialists (e.g., environmental scientists, archaeologists) to address complex challenges and drive innovation in practical applications.

7.3. Personal Insights & Critique

This survey offers a highly valuable and timely overview of vision foundation models in remote sensing. The authors' emphasis on the recent advancements (2021-2024) and the detailed categorization by pre-training methods and image analysis levels provides a clear roadmap for understanding the current landscape.

Personal Insights:

Power of SSL: The consistent success of self-supervised learning in enabling models to learn from unlabeled data is particularly inspiring. This capability is critical for remote sensing, where vast amounts of raw data exist but manual annotation is a major bottleneck. The ability of MAE-based methods to capture intricate temporal and spectral dependencies from masked data, especially for time-series, suggests a profound shift in how we approach geospatial data analysis.
Generalizability and Transferability: Foundation models are inherently designed for generalizability. Their methods and conclusions can undoubtedly be transferred to other scientific domains dealing with large image or time-series datasets, such as medical imaging, climate modeling, or even materials science. The ability to pre-train on general geospatial data and fine-tune for specific, novel tasks significantly accelerates research and application development.
Bridging Academia and Application: The clear articulation of practical implications across environmental monitoring, agriculture, archaeology, urban planning, and disaster management effectively bridges the gap between theoretical AI advancements and their real-world impact. This pragmatic perspective is essential for driving adoption and investment in this field.

Critique and Areas for Improvement:

Computational and Environmental Costs: While acknowledged as a challenge, the true scale of computational resources and the environmental impact of training and maintaining these massive foundation models warrant even deeper critical discussion. Future surveys could delve into carbon footprints and strategies for sustainable AI in remote sensing. The trade-off between model size/accuracy and computational efficiency is a recurring theme, and LoRA-like techniques are crucial steps, but the fundamental challenge remains.
Data Bias and Representativeness: Although the paper discusses the need for high-quality and diverse data, a more detailed critique of potential biases within existing pre-training datasets (e.g., geographic bias, sensor bias, socio-economic bias in urban areas) and their implications for model fairness and generalization to underrepresented regions would be valuable.
Explainability and Trust: As foundation models become larger and more complex, their black-box nature poses challenges for explainability, especially in critical applications like disaster management or policy-making. Future research needs to focus on making these models more interpretable and trustworthy.
Benchmarking Standardization: The variety of datasets, metrics, and experimental setups across different papers makes direct performance comparisons challenging. While the authors synthesize the results, a call for more standardized benchmarking protocols and unified evaluation frameworks for remote sensing foundation models would be beneficial for the community.
Dynamic Nature of the Field: The limitation regarding the rapidly evolving nature of the field is inherent to any survey. However, periodic updates or a "living review" approach might be a future consideration to keep pace with the breakthroughs.

Overall, this paper serves as an excellent reference point for anyone looking to understand the cutting edge of vision foundation models in remote sensing. Its comprehensive nature and forward-looking discussions make it a valuable resource for guiding future research and development.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Vision Foundation Models in Remote Sensing: A Survey

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~28 min read · 41,897 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Remote Sensing (RS)

3.1.2. Artificial Intelligence (AI) and Deep Learning (DL)

3.1.3. Foundation Models (FMs)

3.1.4. Self-Supervised Learning (SSL)

3.1.5. Convolutional Neural Networks (CNNs)

3.1.6. Transformers and Vision Transformers (ViTs)

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Pretraining Methods

4.2.1.1. Self-Supervised Learning (SSL)

4.2.1.2. Supervised Pretraining

4.2.2. Image Analysis at Different Levels

4.2.2.1. Image-Level Analysis

4.2.2.2. Region-Level Analysis

4.2.2.3. Pixel-Level Analysis

4.2.3. Backbone Architectures

4.2.3.1. Convolutional Neural Networks (CNNs)

4.2.3.2. Transformers and Vision Transformers (ViTs)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Mean Average Precision (mAP)

5.2.2. F1 Score

5.2.3. Mean Intersection over Union (mIoU)

5.2.4. Overall Accuracy (OA)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Image-Level Performance (BigEarthNet Dataset)

6.1.2. Pixel-Level Performance (ISPRS Potsdam, OSCD, LEVIR-CD Datasets)

6.1.2.1. ISPRS Potsdam Dataset (Semantic Segmentation)

6.1.2.2. OSCD and LEVIR-CD Datasets (Change Detection)

6.1.3. Region-Level Performance (DOTA, DIOR, DIOR-R Datasets)

6.2. Influence of Pre-training Methods

6.3. Practical Implications

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.2.1. Limitations

7.2.2. Future Work

7.3. Personal Insights & Critique

Similar papers