Abstract

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.

1. Bibliographic Information

Title: Segment Anything
Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. The authors are affiliated with Meta AI Research (FAIR). The paper designates specific roles like 'project lead', 'joint first author', 'equal contribution', and 'directional lead', highlighting a large-scale, collaborative research effort.
Journal/Conference: The paper was submitted to arXiv, a popular open-access repository for preprints in scientific fields. This version ( $v1$ ) indicates it is the first public release, not yet having undergone formal peer review for a conference or journal.
Publication Year: 2023
Abstract: The paper introduces the Segment Anything (SA) project, which consists of three main contributions: a new task for image segmentation called "promptable segmentation," a new model called the Segment Anything Model (SAM), and the largest-ever segmentation dataset, SA-1B, containing over 1 billion masks on 11 million images. The model is designed to be "promptable," allowing it to perform segmentation in a zero-shot manner on new types of images and tasks it has not seen during training. The authors find SAM's zero-shot performance to be highly impressive, often matching or even outperforming previous models that were fully supervised on those specific tasks. The model and dataset are being publicly released to encourage further research into foundation models for computer vision.
Original Source Link:
- Official Page: https://segment-anything.com
- arXiv Abstract: https://arxiv.org/abs/2304.02643
- PDF: https://arxiv.org/pdf/2304.02643v1.pdf

2. Executive Summary

Background & Motivation (Why): In Natural Language Processing (NLP), large "foundation models" like GPT-3 have demonstrated a remarkable ability to perform a wide variety of tasks with little to no task-specific training, often by simply providing them with a "prompt." Computer vision has seen similar successes with models like CLIP, which align images and text, but a true general-purpose foundation model for the fundamental task of image segmentation (identifying which pixels in an image belong to which object) was missing. Prior segmentation models were highly specialized; a model trained to segment cats could not segment cars without being retrained. The core challenge was three-fold:
1. What task could teach a model a general understanding of "what an object is"?
2. What model architecture could perform this task efficiently and flexibly?
3. Where could one obtain the massive and diverse dataset needed to train such a model?
Main Contributions / Findings (What): The "Segment Anything" project provides a comprehensive solution to these three challenges, marking a paradigm shift in image segmentation.
1. A New Task: Promptable Segmentation. The paper defines a new task where the goal is to return a valid segmentation mask for any given prompt. A prompt can be a point, a box, a textual description, or even another mask, specifying what to segment.
2. A New Model: The Segment Anything Model (SAM). SAM is an efficient and powerful model specifically designed for promptable segmentation. It uses a heavy image encoder to process the image once and a lightweight decoder to generate masks in real-time based on prompts, making it suitable for interactive applications. Critically, it can output multiple valid masks for ambiguous prompts.
3. A New Dataset: SA-1B. To train SAM, the authors built a "data engine" that used the model itself to help accelerate data collection. This process culminated in SA-1B, the largest segmentation dataset to date by a massive margin, containing 1.1 billion high-quality masks on 11 million licensed, privacy-respecting images.
4. Impressive Zero-Shot Performance. The key finding is that SAM, trained on SA-1B, can generalize to a vast range of segmentation tasks and new image types (e.g., underwater, medical) without any additional training. Its zero-shot performance is often competitive with or superior to fully-supervised models, as validated by both automatic metrics and human studies.

Foundational Concepts:
- Image Segmentation: The process of partitioning a digital image into multiple segments or sets of pixels, often to identify objects and boundaries. Unlike object detection, which draws a box around an object, segmentation aims to classify every single pixel in the image (e.g., these pixels are 'cat', these are 'grass').
- Foundation Models: Extremely large machine learning models trained on vast quantities of broad data. Their key characteristic is the ability to be adapted to a wide range of downstream tasks, often with minimal fine-tuning. GPT-3 is a prime example in NLP.
- Zero-Shot Learning: The ability of a model to perform a task it was not explicitly trained to do. For SAM, this means segmenting objects in datasets and image distributions it has never seen before.
- Prompting: The technique of guiding a foundation model's output by providing it with a specific input, or "prompt." In SAM, a prompt can be a click on an object, a box drawn around it, or text describing it.
- Vision Transformer (ViT): An architecture that applies the Transformer model, originally developed for text, to image analysis. It works by breaking an image into a sequence of smaller patches and processing them, much like a sentence is a sequence of words.
- Masked Autoencoder (MAE): A self-supervised learning technique used to pre-train ViTs. It involves randomly hiding (masking) parts of an input image and training the model to reconstruct the missing parts. This forces the model to learn meaningful visual representations.
Previous Works: The paper situates itself in the broad field of segmentation but distinguishes its goal. Previous work focused on specialized tasks:
- Interactive Segmentation: A human user iteratively refines a mask by providing input (e.g., clicks). While SAM can be used this way, its goal is more general: to be a composable component in larger systems, not just a tool for human interaction.
- Semantic, Instance, and Panoptic Segmentation: These tasks involve segmenting and classifying objects based on a fixed set of predefined categories (e.g., 'car', 'person', 'road').
- Multi-task Systems: Some models were trained to perform several of these fixed tasks simultaneously. The key limitation of all these prior approaches is that they are trained and evaluated on the same, fixed set of tasks. They cannot easily generalize to a new, unseen segmentation task without retraining.
Differentiation: SAM's approach is fundamentally different. It is not trained to solve a specific task like "find all the cars." Instead, it is trained on the general promptable segmentation task. This endows it with a general understanding of what constitutes an object or a distinct region. This general ability allows SAM to be used as a component in a larger system to solve new tasks through "prompt engineering." For example, by combining an existing object detector with SAM, one can build an instance segmentation system on the fly. The massive scale of the SA-1B dataset is another critical differentiator, enabling this unprecedented level of generalization.

4. Methodology (Core Technology & Implementation)

The Segment Anything project is built on three interconnected pillars: the task, the model, and the data engine.

该图像是示意图，展示了基于提示的图像分割任务流程。输入包含多种分割提示和对应图像，模型输出有效的分割掩码，示例中以猫和杯子为对象。

1. Task: Promptable Segmentation The core idea is to create a model that can respond to any prompt indicating what to segment in an image.
- Goal: Given an image and a prompt, output a valid segmentation mask.
- Prompts: Prompts can be sparse (points, boxes, text) or dense (a rough mask).
- "Valid Mask" Requirement: This is a crucial concept. If a prompt is ambiguous (e.g., a point on a shirt could refer to the shirt or the person wearing it), the model should not fail or average the possibilities. Instead, it must output a reasonable mask for at least one of the potential objects. This ambiguity-awareness is central to the model's design.
  
  该图像是论文中图3的插图，展示了SAM模型从单个模糊点（绿色圆点）提示生成的三组有效分割掩码，体现了模型针对不同图像对象的分割能力。
2. Model: Segment Anything Model (SAM) SAM is designed to be efficient, flexible, and ambiguity-aware, following a three-part architecture.

该图像是一个示意图，展示了Segment Anything模型的架构流程，包括图像编码器生成图像嵌入，提示编码器处理掩码、点、框和文本输入，最后由掩码解码器输出多重分割掩码及对应得分。
- Image Encoder: A heavyweight Vision Transformer (ViT-H), pre-trained with the Masked Autoencoder (MAE) method. It takes a high-resolution image as input and generates a single, powerful image embedding. This is the most computationally intensive part, but it only needs to be run once per image. Its cost is thus "amortized" across all subsequent prompts for that image.
- Prompt Encoder: A lightweight encoder that converts various prompts into embedding vectors.
  - Sparse Prompts: Points and boxes are represented by positional encodings combined with learned embeddings for each type (e.g., 'top-left corner', 'foreground point').
  - Text Prompts: Free-form text is encoded using the text encoder from CLIP.
  - Dense Prompts: Masks are embedded using convolutional layers and then added element-wise to the image embedding.
- Mask Decoder: A very fast and lightweight decoder that takes the image embedding and prompt embeddings to predict segmentation masks. It uses two Transformer decoder blocks to process the information and then upsamples the result to predict a mask at the original image resolution. The entire decoding process takes about 50 milliseconds, enabling real-time interaction.
Key Design Choices:
- Resolving Ambiguity: To handle ambiguous prompts, the decoder is designed to predict three masks simultaneously. During training, the loss is computed only on the mask that best matches the ground truth, encouraging the model to explore different valid interpretations. The model also predicts an IoU score for each mask to rank them.
- Efficiency: The separation of the heavy image encoder from the fast prompt encoder and mask decoder is the key to achieving real-time performance for interactive use cases.
- Training Loss: The model is trained using a combination of focal loss and dice loss, a standard practice in modern segmentation models.
3. Data Engine Since no sufficiently large segmentation dataset existed, the authors built a "data engine" to create one in a virtuous cycle: the model assists in data collection, and the collected data is used to improve the model.

该图像是一个示意图，展示了Segment Anything项目中模型与数据的交互训练流程及其大规模数据集SA-1B核心信息，包括10亿以上的掩码和1100万张隐私保护许可图像。
- Stage 1: Assisted-Manual: Professional annotators used an early version of SAM as an interactive segmentation tool. They would click on objects, and the model would generate a mask, which they could refine. This is a model-in-the-loop annotation process. The model was retrained 6 times during this stage, and as it improved, annotation time per mask dropped from 34 to 14 seconds.
- Stage 2: Semi-Automatic: To increase mask diversity, a more capable SAM was used to automatically segment "confident" objects. Annotators were then presented with these pre-filled images and asked to label any remaining, unannotated (and often more challenging) objects.
- Stage 3: Fully Automatic: In the final stage, the most powerful version of SAM was used to generate all masks automatically. The model was prompted with a $32 \times 32$ grid of points on each of the 11 million images. For each point, the ambiguity-aware model produced multiple masks. A careful filtering process, based on predicted IoU scores and mask stability, was used to select high-quality masks and remove duplicates. This stage produced the vast majority of the 1.1 billion masks in the final SA-1B dataset.

5. Experimental Setup

Datasets:
- Training Dataset: SA-1B, the novel dataset created by the authors. It contains 1.1B masks on 11M high-resolution, licensed, and privacy-protected images. The authors verify that the automatically generated masks are of very high quality, with 94% having over 90% IoU with professionally corrected versions.
- Evaluation Datasets: A diverse suite of 23 existing segmentation datasets was used for zero-shot evaluation. This includes well-known datasets like COCO and LVIS, as well as datasets with very different image distributions like ego-centric (VISOR) and underwater images, which were not present in the training data. For the Responsible AI analysis, the MIAP dataset was used.
Evaluation Metrics:
- Intersection over Union (IoU): A standard metric for segmentation that measures the overlap between a predicted mask ( $A$ $A$ ) and a ground truth mask ( $B$ $B$ ). It ranges from 0 (no overlap) to 1 (perfect overlap).
  - Conceptual Definition: It calculates the ratio of the area of intersection to the area of union of the two masks.
  - Mathematical Formula: $\mathrm{IoU}(A, B) = \frac{|A \cap B|}{|A \cup B|}$
  - Symbol Explanation:
    - $A$ : The set of pixels in the predicted mask.
    - $B$ : The set of pixels in the ground truth mask.
    - $| \cdot |$ : Represents the area, or number of pixels.
- Mean IoU (mIoU): The IoU averaged over all test examples in a dataset.
- Average Precision (AP): Used for instance segmentation, this metric summarizes the shape of the precision-recall curve. A higher AP indicates a model is good at both finding objects (recall) and ensuring its predictions are correct (precision).
- Average Recall (AR): Used for object proposal generation, this metric measures the maximum recall a model can achieve for a given number of proposals, averaged across different IoU thresholds.
- BSDS500 Edge Detection Metrics:
  - ODS (Optimal Dataset Scale): The F-score (harmonic mean of precision and recall) achieved when a single, global threshold is applied to all edge probability maps across the entire dataset.
  - OIS (Optimal Image Scale): The average F-score achieved when the best possible threshold is chosen independently for each image.
  - R50 (Recall at 50% Precision): The recall level when the model's precision is at 50%.
Baselines: The paper compares SAM against strong, often fully-supervised, specialist models for each task:
- Interactive Segmentation: RITM (Revisiting Interactive-based Transformation for Image Matting), a state-of-the-art interactive segmenter.
- Edge Detection: HED and EDETR, deep learning-based methods trained on the BSDS500 dataset.
- Instance Segmentation / Object Proposals: ViTDet-H, a powerful, fully-supervised detector based on a Vision Transformer.

6. Results & Analysis

SAM's performance was evaluated in a zero-shot transfer setting across a variety of tasks, meaning SAM was not fine-tuned on any of the evaluation datasets.

Responsible AI (RAI) Analysis The authors investigated potential biases in the SA-1B dataset and the SAM model.

Geographic and Income Representation: The following is a manual transcription of Table 1. SA-1B shows better representation for Europe, Asia, and middle-income countries compared to COCO and Open Images. However, all datasets underrepresent Africa and low-income countries.

			SA-1B			% images O.I.	% images COCO
	# countries	#imgs	#masks	% images
Africa	54	300k	28M	2.8%	3.0%	1.7%
Asia & Oceania	70	3.9M	423M	36.2%	11.4%	14.3%
Europe	47	5.4M	540M	49.8%	34.2%	36.2%
Latin America & Carib.	42	380k	36M	3.5%	5.0%	3.1%
North America	4	830k	80M	7.7%	48.3%	42.8%
high income countries	81	5.8M	598M	54.0%	89.1%
middle income countries	108	4.9M	499M	45.0%	10.5%	12.0%
low income countries	28	100k	9.4M	0.9%	0.4%	0.5%

Fairness in Segmenting People: The following is a manual transcription of Table 2. The analysis measured SAM's performance across perceived gender, age, and skin tone. The overlapping 95% confidence intervals suggest that there are no significant performance discrepancies across these groups, indicating fair performance.

	mIoU at 1 point	mIoU at 3 points		mIoU at 1 point	mIoU at 3 points
perceived gender presentation			perceived skin tone
feminine	54.4 ±1.7	90.4 ±0.6	1	52.9 ±2.2	91.0 ±0.9
masculine	55.7 ±1.7	90.1 ±0.6	2	51.5 ±1.4	91.1 ±0.5
perceived age group			3	52.2 ±1.9	91.4 ±0.7
older	62.9 ±6.7	92.6 ± 1.3	4	51.5 ±2.7	91.7 ±1.0
middle	54.5 ±1.3	90.2 ±0.5	5	52.4 ±4.2	92.5 ±1.4
young	54.2 ±2.2	91.2±0.7	6	56.7 ±6.3	91.2 ±2.4

7.1. Zero-Shot Single Point Valid Mask Evaluation This experiment tests SAM's core capability: generating a valid mask from a single, often ambiguous, point prompt.

该图像是包含四个子图的图表，展示了SAM与RITM在23个数据集上的性能对比以及手工标注者对掩码质量的评分。图(a)通过中点delta值比较了SAM与RITM的差异，图(b)显示了不同数据集上SAM和其他方法的掩码质量评分，图(c)和图(d)分别展示了不同点数量下SAM及基线方法的mIoU变化，体现了中心点和随机点的影响。
- Automatic Metrics: SAM outperforms the specialist baseline RITM on 16 of the 23 diverse datasets (Fig. 9a). The "oracle" performance (where the best of SAM's three predicted masks is chosen) is superior on all datasets, confirming that SAM's ability to handle ambiguity is effective but can be penalized by standard metrics that expect a single "correct" answer.
- Human Study: This is a crucial result. Human raters consistently judged the quality of SAM's masks to be significantly higher than RITM's, even on datasets where SAM had a lower mIoU (Fig. 9b). This suggests that SAM produces more plausible and accurate masks in practice, while baselines may be over-optimized for specific dataset biases.

7.2. Zero-Shot Edge Detection SAM was prompted with a grid of points to generate masks, and the mask boundaries were interpreted as edges.

Figure 10: Zero-shot edge prediction on BSDS500. SAM was not trained to predict edge maps nor did it have access to BSDS images or annotations during training. 该图像是图表，展示了SAM模型在BSDS500数据集上的零样本边缘预测结果。左侧为原始图像，中间为真实边缘地图，右侧为SAM预测的边缘。模型未使用BSDS图像或标注进行训练。

Results: As shown in the transcribed Table 3 below, SAM's zero-shot performance is remarkably strong, significantly outperforming classic methods like Canny and approaching the performance of deep learning models like HED that were fully trained on this task. Qualitatively, SAM produces clean and sensible edge maps. Table 3: Zero-shot transfer to edge detection on BSDS500. (Manual Transcription)

method	year	ODS	OIS	AP	R50
HED [108]	2015	.788	.808	.840	.923
EDETR [79]	2022	.840	.858	.896	.930
zero-shot transfer methods:
Sobel filter	1968	.539	-	-	-
Canny [13]	1986	.600	.640	.580	-
Felz-Hutt [35]	2004	.610	.640	.560	-
SAM	2023	.768	.786	.794	.928

7.3. Zero-Shot Object Proposals SAM's automatic mask generation procedure was used to generate object proposals.

Results: The transcribed Table 4 shows that SAM is highly competitive with a strong, fully-supervised ViTDet detector. Notably, SAM outperforms ViTDet on medium and large objects, as well as on rare object categories, demonstrating its strong general knowledge of "objectness." Its ambiguity-aware model (SAM) significantly outperforms a single-output version (SAM single out.). Table 4: Object proposal generation on LVIS v1. (Manual Transcription)

method	all	small	med.	large	freq.	com.	rare
ViTDet-H [62]	63.0	51.7	80.8	87.0	63.1	63.3	58.3
zero-shot transfer methods:
SAM single out.	54.9	42.8	76.7	74.4	54.7	59.8	62.0
SAM	59.3	45.5	81.6	86.9	59.1	63.9	65.8

7.4. Zero-Shot Instance Segmentation SAM was composed with an object detector: the detector provided bounding box prompts, and SAM segmented the object within the box.
- Results: According to the standard mask AP metric (Table 5, transcribed below), SAM lags behind the fully-supervised ViTDet. Table 5: Instance segmentation results. (Manual Transcription)
  
  | | \multicolumn{4}{c|}{COCO [66]} | \multicolumn{4}{c|}{LVIS v1 [44]} | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | method | AP | APs | APm | APl | AP | APs | APm | APl | ViTDet-H [62] | 51.0 | 32.0 | 54.3 | 68.9 | 46.6 | 35.0 | 58.0 | 66.3 | zero-shot transfer methods (segmentation module only): | | | | | | | | | SAM | 46.5 | 30.8 | 51.0 | 61.7 | 44.7 | 32.5 | 57.6 | 65.5
- However, in a human study (Figure 11), raters found SAM's masks to be of higher quality than ViTDet's. The authors hypothesize that ViTDet learns to replicate the specific biases and imperfections of the COCO/LVIS ground truth annotations, whereas SAM, being a zero-shot model, produces more geometrically correct and visually pleasing masks.
  
  该图像是一个柱状图，展示了人类研究中ViTDet和SAM模型在LVIS真实标注框上的掩码质量评分分布，并对比了LVIS和COCO的真实标注质量，图例显示评分平均值及95%置信区间。
7.5. Zero-Shot Text-to-Mask This proof-of-concept experiment shows SAM can segment objects from free-form text prompts. This was achieved by a clever training strategy: during training, SAM was prompted with image embeddings from CLIP; at inference, it was prompted with text embeddings from CLIP.
- Results: Qualitative results in Figure 12 show SAM successfully segmenting objects from simple text ("a wheel") and more nuanced phrases ("beaver tooth grille").
  
  该图像是多幅汽车局部区域的示意图，展示了SAM模型零-shot文本到掩码的能力。图中不同文字提示如“a wheel”、“beaver tooth grille”、“a wiper”及附加点提示对比了模型的分割准确性。
7.6. Ablations

该图像是图表，展示了图像尺寸归一化后不同数据集（SA-1B、LVIS v1、COCO、ADE20K、Open Images）掩膜中心点的分布情况，反映了掩膜在图像中的位置集中趋势。
- Data Source: Training on only the automatically generated masks from Stage 3 yields performance nearly identical to training on all data from all three stages. This greatly simplifies the training pipeline.
- Data Volume: Performance with ~10% of the SA-1B dataset (1M images, ~100M masks) is already very strong and comparable to using the full dataset.
- Model Scale: Performance improves significantly from ViT-B to ViT-L, but the gain from ViT-L to ViT-H is marginal, suggesting diminishing returns from simply scaling the image encoder further.

7. Conclusion & Reflections

Conclusion Summary: The "Segment Anything" paper introduces a new task (promptable segmentation), model (SAM), and dataset (SA-1B) that together represent a major leap forward in computer vision. By training a promptable model on a massive and diverse dataset created via a novel "data engine," the authors have created the first true foundation model for image segmentation. SAM demonstrates powerful zero-shot generalization capabilities, enabling it to perform a wide array of segmentation tasks on unseen image distributions with impressive quality, often rivaling or exceeding specialist, fully-supervised models.
Limitations & Future Work:
- The authors acknowledge that while SAM is a general model for segmentation, it is still a "limited scope" foundation model, as it only addresses one (albeit important) area of computer vision.
- The text-to-mask capability is demonstrated as a proof-of-concept and is less developed than the geometric prompting capabilities.
- The model's zero-shot performance is strong, but there is still a gap compared to fully-supervised methods on some standard benchmarks (like instance segmentation AP), even if human studies prefer SAM's outputs.
- While SAM's real-time decoder is fast, the heavyweight image encoder requires significant computation, which could be a limitation for resource-constrained applications that need to process novel images on the fly.
Personal Insights & Critique:
- Paradigm Shift: This work marks a turning point for computer vision, moving away from the paradigm of training one specialist model for every task. SAM acts as a powerful, composable tool that can be integrated into larger systems, much like CLIP has been. This is arguably a "GPT-3 moment" for image segmentation.
- The Data Engine is Key: The methodology for creating the SA-1B dataset is as significant as the model itself. It provides a blueprint for how to overcome the data bottleneck in other domains by using a model-in-the-loop system to scale data creation.
- Redefining "Performance": The discrepancy between automatic metrics (like AP) and human evaluations highlights a critical issue in the field. By learning to be more general, SAM avoids overfitting to the specific annotation styles and artifacts of benchmarks like COCO. This suggests that as models become more general, we may need to rely more on human-centric evaluations or develop new metrics that better capture true visual quality.
- Future Impact: SAM is poised to become a fundamental building block for a new generation of vision applications. Its ability to generate a mask for "anything" with a simple prompt opens up countless possibilities in creative tools, scientific analysis, robotics, and augmented reality. The public release of the model and dataset will undoubtedly accelerate research in this direction.

Segment Anything

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~18 min read · 20,556 chars

1. Bibliographic Information

2. Executive Summary

4. Methodology (Core Technology & Implementation)

5. Experimental Setup

6. Results & Analysis

7. Conclusion & Reflections

Similar papers

Segment Anything

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~18 min read · 20,556 chars

1. Bibliographic Information

2. Executive Summary

3. Prerequisite Knowledge & Related Work

4. Methodology (Core Technology & Implementation)

5. Experimental Setup

6. Results & Analysis

7. Conclusion & Reflections

Similar papers