Unsupervised Learning of Visual Features by Contrasting Cluster
  Assignments

Armand Joulin

Paper status: completed

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Published:06/17/2020

Unsupervised Visual Feature Learning (1)Contrastive Clustering Learning (1)SwAV Algorithm (1)Multi-Crop Data Augmentation (1)ResNet-50 Image Classification (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SwAV enables unsupervised visual feature learning by contrasting cluster assignments across augmented views, avoiding costly pairwise comparisons. Using multi-crop augmentation and efficient training without memory banks or momentum networks, it advances unsupervised learning per

Abstract

Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.

Mind Map

In-depth Reading

English Analysis~23 min read · 30,999 chars

1. Bibliographic Information

1.1. Title

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

The title clearly states the paper's core research area: unsupervised visual feature learning. It specifies the novel technique used, which involves a combination of clustering and contrastive principles. The phrase "contrasting cluster assignments" suggests that instead of comparing image features directly (as in standard contrastive learning), the method operates on a higher level of abstraction—the cluster memberships of the images.

1.2. Authors

The authors are Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin.

Their affiliations are Inria (the French National Institute for Research in Digital Science and Technology) and Facebook AI Research (FAIR). The authors are prominent researchers in computer vision and machine learning. FAIR, in particular, has been a leading institution in the field of self-supervised learning, producing seminal works like MoCo, DETR, and DeepCluster. This strong institutional and personal background lends significant credibility to the research.

1.3. Journal/Conference

The paper was published in the Advances in Neural Information Processing Systems (NeurIPS) 2020.

NeurIPS is a premier, top-tier international conference on machine learning and computational neuroscience. Its acceptance criteria are extremely rigorous. Publication at NeurIPS signifies that the work is considered to be of high quality, originality, and significant impact by the research community.

1.4. Publication Year

The original preprint was submitted in June 2020.

1.5. Abstract

The abstract introduces an unsupervised learning algorithm named SwAV (Swapping Assignments between multiple Views). The authors position SwAV as an advancement over existing contrastive learning methods, which are often computationally intensive due to the need for many pairwise feature comparisons. SwAV is an online algorithm that avoids these direct comparisons. Instead, it simultaneously clusters the data and enforces consistency between the cluster assignments of different augmented "views" of the same image. This is achieved through a "swapped prediction" mechanism: the cluster assignment (or "code") of one view is predicted from the feature representation of another view. The authors highlight SwAV's efficiency, noting it does not require a large memory bank or a momentum network, making it scalable and memory-efficient. A new data augmentation strategy called multi-crop is also introduced, which uses a mix of high- and low-resolution image crops to improve performance without a significant increase in computational cost. The paper's key result is achieving 75.3% top-1 accuracy on ImageNet with a ResNet-50, and outperforming supervised pretraining on all evaluated transfer learning tasks.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2006.09882
PDF Link: https://arxiv.org/pdf/2006.09882v5.pdf

The paper is available on arXiv, a repository for electronic preprints. This version is a preprint that was later accepted and published at NeurIPS 2020.

2. Executive Summary

2.1. Background & Motivation

The central problem addressed by this paper is the high cost and inefficiency of supervised learning, which relies on vast amounts of manually labeled data for pretraining deep neural networks. Self-Supervised Learning (SSL) has emerged as a powerful alternative, aiming to learn rich visual representations from unlabeled data by devising "pretext tasks."

At the time of this paper's publication, contrastive learning was the state-of-the-art approach in SSL. Methods like MoCo and SimCLR learned representations by pulling different augmented views of the same image (positive pairs) together in the feature space, while pushing views from different images (negative pairs) apart. However, these methods faced significant challenges:

Computational Complexity: They required a massive number of negative samples to work well. This was achieved either through very large batch sizes (SimCLR), which demanded substantial hardware (many GPUs), or through a large memory bank/queue and a momentum encoder (MoCo), which added complexity and memory overhead. The core operation involved explicit pairwise comparisons, which is computationally expensive.
Scalability of Clustering: An alternative SSL paradigm, clustering-based methods like DeepCluster, treated cluster assignments as pseudo-labels. These methods were efficient as they avoided pairwise comparisons. However, they were typically offline, requiring a full pass over the entire dataset to perform clustering before the training step could begin. This made them impractical for extremely large, web-scale datasets where one might only perform a single pass.

The paper's innovative entry point is to bridge the gap between contrastive and clustering methods. The authors propose to create an online clustering framework that retains the efficiency of clustering (avoiding pairwise comparisons) but works online like contrastive methods, making it scalable and efficient.

2.2. Main Contributions / Findings

The paper makes three primary contributions that significantly advanced the field of self-supervised learning:

A Novel Online Clustering Method (SwAV): The paper introduces SwAV, a method that contrasts cluster assignments instead of features. It uses a "swapped prediction" mechanism where the model must predict the cluster assignment (code) of one image view from the features of another view. This is an online process that does not require pairwise comparisons, a large memory bank, or a momentum encoder, making it more efficient and scalable than prior contrastive methods.
A New Data Augmentation Strategy (Multi-crop): The authors propose multi-crop, a simple yet highly effective augmentation strategy. Instead of using two standard-resolution views of an image, multi-crop uses two high-resolution views and multiple additional low-resolution views. This increases the number of "positive" pairs the model sees without a substantial increase in computational or memory load, leading to a consistent performance boost of 2-4% across various SSL methods.
State-of-the-Art Performance and Surpassing Supervised Pretraining: By combining SwAV and multi-crop, the paper achieved a new state of the art on the ImageNet linear evaluation benchmark (75.3% top-1 with ResNet-50). More importantly, it was the first SSL method to demonstrate that its learned features, when transferred, could outperform features from a standard ImageNet supervised pretraining on a wide range of downstream tasks (including image classification and object detection), even when the features were kept frozen. This was a landmark achievement, proving that SSL could be a superior alternative to supervised pretraining.

3.1. Foundational Concepts

3.1.1. Self-Supervised Learning (SSL)

SSL is a subfield of machine learning where a model learns from data that has not been manually labeled. The core idea is to create a "pretext task" using the data itself, forcing the model to learn meaningful representations to solve it. For example, a model might be tasked with predicting a missing patch in an image (inpainting), predicting the relative position of two patches (jigsaw puzzle), or colorizing a grayscale image. The learned representations can then be transferred to "downstream tasks" (e.g., image classification) by training a simple linear classifier on top of the frozen features or by fine-tuning the entire network.

3.1.2. Contrastive Learning

Contrastive learning is a dominant paradigm in SSL. It trains a model to distinguish between similar and dissimilar things. In the context of visual learning, the model is given an "anchor" (an augmented view of an image) and must identify a "positive" sample (another augmented view of the same image) from a set of "negative" samples (views of different images).

The goal is to learn an embedding function that maps positive pairs close together in the feature space and pushes negative pairs far apart. A common loss function used for this is the InfoNCE (Noise Contrastive Estimation) loss:

$ \mathcal{L}{i} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum{k=1}^{N} \mathbf{1}_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)} $

$z_i$ and $z_j$ are the feature vectors of a positive pair (e.g., two views of the same image).
$\text{sim}(u, v)$ is a similarity function, typically the cosine similarity $u^T v / (||u|| ||v||)$ .
The sum in the denominator is over all other N-1 samples in the batch, which act as negatives.
$\tau$ is a temperature hyperparameter that controls the sharpness of the distribution. A lower temperature makes the task harder by amplifying distances, forcing the model to learn more discriminative features.

3.1.3. Clustering-based SSL

Another approach to SSL involves using clustering algorithms. These methods typically operate in two alternating steps:

Cluster Features: Run a clustering algorithm (like k-means) on the feature representations of the entire dataset. This groups images with similar features together.
Train Network: Use the resulting cluster assignments as "pseudo-labels" to train the neural network in a standard supervised classification setup.

The model learns to predict which cluster an image belongs to. By iterating these two steps, both the feature representations and the quality of the clusters improve. The main drawback is that the clustering step is offline, requiring access to the entire dataset, which is not scalable.

3.2. Previous Works

3.2.1. Instance Discrimination and Contrastive Methods

Instance-level Classification (Dosovitskiy et al., 2016; Wu et al., 2018, NPID): These pioneering works proposed treating each image in the dataset as its own class. This becomes intractable for large datasets like ImageNet. Wu et al. [58] in NPID (Non-Parametric Instance Discrimination) addressed this by using a memory bank to store a feature representation for each instance, and used noise-contrastive estimation to compare the current image's features to a sample of features from the memory bank.
Momentum Contrast (MoCo) (He et al., 2019): MoCo improved upon the memory bank idea by introducing a momentum encoder. Instead of storing static features, the memory bank (or queue) stores features from a slowly-progressing, momentum-updated version of the encoder. This ensures that the negative samples in the queue are more consistent with the current encoder's features, leading to more stable training.
SimCLR (Chen et al., 2020): SimCLR (A Simple Framework for Contrastive Learning) showed that the memory bank and momentum encoder could be eliminated entirely if the batch size is made very large (e.g., 4096 or 8192). In this setup, the negative samples for a given image are simply all the other images within the same batch. SimCLR also emphasized the importance of a non-linear projection head and strong data augmentation.

3.2.2. Clustering-based Methods

DeepCluster (Caron et al., 2018): This was a highly influential offline clustering method. It iteratively groups features using k-means and then uses the cluster assignments as pseudo-labels to train the network. A key issue was the instability caused by re-initializing the classification layer at each epoch, as cluster indices are arbitrary.
SeLa (Asano et al., 2020): SeLa (Self-Labelling) provided a more principled formulation for clustering-based SSL by framing the pseudo-label assignment as an optimal transport problem. It enforces an "equipartition" constraint, ensuring that clusters are of equal size, which prevents the trivial solution where all images are assigned to a single cluster. However, like DeepCluster, it was an offline method.

3.3. Technological Evolution

The field of SSL for vision evolved rapidly:

Early Pretext Tasks: Handcrafted tasks like predicting rotations, solving jigsaw puzzles, or colorization. These were often heuristic.
Instance Discrimination: A more general principle where each image is its own class. This led to contrastive learning.
Contrastive Learning Refinements: The evolution from NPID's memory bank to MoCo's momentum encoder and queue, and finally SimCLR's large-batch approach, marked a progression towards more effective and stable training of contrastive models.
Offline Clustering: Methods like DeepCluster and SeLa showed that clustering was a powerful source of supervision but were limited by their offline nature.

SwAV is positioned at the confluence of these two streams. It takes the core idea from clustering (assigning features to a set of prototypes) but formulates it in a contrastive-like "swapped prediction" framework that can be solved online within a mini-batch, thus combining the scalability of contrastive methods with the computational efficiency of clustering.

3.4. Differentiation Analysis

SwAV vs. Contrastive Methods (SimCLR, MoCo):
- Comparison Target: SimCLR/MoCo perform direct feature-to-feature comparisons. An anchor feature is contrasted against many other negative features. SwAV compares a feature to a small, fixed set of trainable prototypes ( $K$ vectors) and then contrasts the resulting cluster assignments. This is more efficient as $K$ (e.g., 3000) is typically much smaller than the number of negatives required by contrastive methods (e.g., 65536 in MoCo).
- Mechanism: SimCLR requires huge batches. MoCo requires a momentum encoder and a large queue. SwAV requires neither, making it more memory-efficient and simpler to implement in a distributed setting.
SwAV vs. Clustering Methods (DeepCluster, SeLa):
- Online vs. Offline: DeepCluster and SeLa are offline. They require a full pass over the dataset to cluster features and generate pseudo-labels for the next training epoch. SwAV is fully online. Cluster assignments ("codes") are computed on-the-fly for each mini-batch, making it scalable to arbitrarily large datasets where only a single pass is feasible.
- Hard vs. Soft Assignments: DeepCluster uses hard (one-hot) assignments from k-means. SwAV uses soft assignments derived from the Sinkhorn-Knopp algorithm, which the paper finds to be more effective for online training.
  
  The following figure from the paper visually summarizes the key difference between contrastive instance learning and SwAV.
  
  该图像是论文中的示意图，展示了对比实例学习与SwAV方法的对比。左侧为传统对比实例学习，直接比较同一图像不同变换的特征；右侧为SwAV方法，通过聚类原型生成编码，采用交换预测机制实现视图间编码一致性。

4. Methodology

4.1. Principles

The core idea of SwAV is to learn representations by enforcing consistency between the cluster assignments of different views of the same image. Rather than directly comparing the raw feature vectors, SwAV introduces an intermediate step: mapping features to a set of trainable "prototypes." This transforms the problem from "are these two feature vectors similar?" to "do these two views assign to the same clusters in a similar way?".

This is operationalized through a "swapped" prediction problem: the model must predict the cluster assignment (or "code") of one view using the feature representation of a different view from the same image. If two views capture the same underlying semantic information, then the features from one should be sufficient to infer the cluster assignment of the other.

4.2. Core Methodology In-depth (Layer by Layer)

Let's break down the entire process step-by-step, integrating the mathematical formulas.

4.2.1. Feature and Code Generation

For each image in a batch, SwAV generates multiple "views" through data augmentation. For the basic case, let's consider two views.

Augmentation and Feature Extraction: An image $x_n$ is transformed into two augmented views, $x_{nt}$ and $x_{ns}$ . Each view is passed through a neural network (e.g., ResNet) $f_\theta$ followed by a projection head to produce a feature vector. This vector is then L2-normalized to lie on the unit sphere. Let's call these final feature vectors $\mathbf{z}_{t}$ and $\mathbf{z}_{s}$ .
Code Computation: The features $\mathbf{z}_{t}$ and $\mathbf{z}_{s}$ are then assigned to a set of $K$ trainable prototype vectors $\{\mathbf{c}_1, \dots, \mathbf{c}_K\}$ . These prototypes can be thought of as the "cluster centers." The resulting assignments are called "codes," denoted as $\mathbf{q}_{t}$ and $\mathbf{q}_{s}$ . These codes are soft probability distributions over the $K$ prototypes. The exact method for computing these codes is a key part of SwAV and is explained in section 4.2.3.

4.2.2. The Swapped Prediction Loss

The central objective of SwAV is defined by the "swapped" prediction loss. Given the features $\mathbf{z}_t, \mathbf{z}_s$ and their corresponding codes $\mathbf{q}_t, \mathbf{q}_s$ , the loss for this pair of views is:

$ L(\mathbf{z}_t, \mathbf{z}_s) = \ell(\mathbf{z}_t, \mathbf{q}_s) + \ell(\mathbf{z}_s, \mathbf{q}_t) $

This loss has two symmetric terms.

$\ell(\mathbf{z}_t, \mathbf{q}_s)$ : This term measures how well we can predict the code of the second view ( $\mathbf{q}_s$ ) using the features of the first view ( $\mathbf{z}_t$ ).
$\ell(\mathbf{z}_s, \mathbf{q}_t)$ : Symmetrically, this term measures how well we can predict the code of the first view ( $\mathbf{q}_t$ ) using the features of the second view ( $\mathbf{z}_s$ ).

The function $\ell(\mathbf{z}, \mathbf{q})$ is the cross-entropy loss between the code $\mathbf{q}$ (which acts as the "target") and a probability distribution derived from the feature $\mathbf{z}$ . Specifically:

$ \ell(\mathbf{z}_t, \mathbf{q}_s) = - \sum_k \mathbf{q}_s^{(k)} \log \mathbf{p}_t^{(k)} $

where $\mathbf{q}_s^{(k)}$ is the $k$ -th element of the code vector $\mathbf{q}_s$ (the probability of assigning to prototype $k$ ), and $\mathbf{p}_t^{(k)}$ is the probability of feature $\mathbf{z}_t$ belonging to prototype $k$ . This probability is calculated using a softmax over the dot products between the feature and all prototypes:

$ \mathbf{p}_t^{(k)} = \frac{\exp\left(\frac{1}{\tau} \mathbf{z}_t^\top \mathbf{c}k\right)}{\sum{k'} \exp\left(\frac{1}{\tau} \mathbf{z}t^\top \mathbf{c}{k'}\right)} $

$\mathbf{z}_t^\top \mathbf{c}_k$ is the dot product (similarity) between the feature $\mathbf{z}_t$ and the $k$ -th prototype vector $\mathbf{c}_k$ . Since features and prototypes are L2-normalized, this is equivalent to cosine similarity.
$\tau$ is a temperature parameter. It controls the sharpness of the probability distribution. A smaller $\tau$ leads to a sharper, more confident distribution.

The network parameters $\theta$ (of $f_\theta$ ) and the prototype vectors $\mathbf{C} = [\mathbf{c}_1, \dots, \mathbf{c}_K]$ are both learned jointly by minimizing this loss function via backpropagation.

4.2.3. Online Code Computation with Sinkhorn-Knopp

A crucial part of SwAV is how the codes $\mathbf{q}$ are computed online for each mini-batch. These codes are not fixed targets; they are computed on-the-fly and used in the loss calculation for that same batch. This is what makes SwAV an online method.

For a batch of $B$ feature vectors $\mathbf{Z} = [\mathbf{z}_1, \dots, \mathbf{z}_B]$ , the goal is to find an assignment matrix $\mathbf{Q} = [\mathbf{q}_1, \dots, \mathbf{q}_B]$ that maps these features to the $K$ prototypes $\mathbf{C}$ . The paper frames this as an optimization problem inspired by optimal transport:

$ \max_{\mathbf{Q} \in \mathcal{Q}} \text{Tr}(\mathbf{Q}^\top \mathbf{C}^\top \mathbf{Z}) + \varepsilon H(\mathbf{Q}) $

$\text{Tr}(\mathbf{Q}^\top \mathbf{C}^\top \mathbf{Z})$ : This term maximizes the similarity between features and their assigned prototypes. It's equivalent to $\sum_{i,j} \mathbf{Q}_{ij} (\mathbf{C}^\top \mathbf{Z})_{ij}$ .
$H(\mathbf{Q}) = - \sum_{ij} \mathbf{Q}_{ij} \log \mathbf{Q}_{ij}$ : This is the entropy of the assignment matrix $\mathbf{Q}$ . The parameter $\varepsilon$ controls the strength of this entropy regularization. A higher $\varepsilon$ leads to a "softer," more uniform assignment.
$\mathcal{Q}$ : This is the set of valid assignment matrices. The paper adapts the constraint from SeLa to work on mini-batches. It is the transportation polytope:

$ \mathcal{Q} = \left{ \mathbf{Q} \in \mathbb{R}_+^{K \times B} \mid \mathbf{Q1}_B = \frac{1}{K}\mathbf{1}_K, \mathbf{Q}^\top\mathbf{1}_K = \frac{1}{B}\mathbf{1}_B \right} $
$\mathbf{Q} \in \mathbb{R}_+^{K \times B}$ : The assignment matrix must have non-negative entries.
$\mathbf{Q1}_B = \frac{1}{K}\mathbf{1}_K$ : This constraint means that when summed across all $B$ samples in the batch, each of the $K$ prototypes is used an equal number of times on average. This is the equipartition constraint that prevents the trivial solution of all features mapping to a single prototype.
$\mathbf{Q}^\top\mathbf{1}_K = \frac{1}{B}\mathbf{1}_B$ : This constraint ensures that the columns of $\mathbf{Q}$ (the code for each sample) sum to $1/B$ , which can be normalized to sum to 1 to represent a probability distribution.

The solution to this optimization problem, $\mathbf{Q}^*$ , can be found efficiently using the iterative Sinkhorn-Knopp algorithm. The solution has a specific diagonal scaling form:

$ \mathbf{Q}^* = \text{Diag}(\mathbf{u}) \exp\left(\frac{\mathbf{C}^\top \mathbf{Z}}{\varepsilon}\right) \text{Diag}(\mathbf{v}) $

$\mathbf{u} \in \mathbb{R}^K$ and $\mathbf{v} \in \mathbb{R}^B$ are renormalization vectors that are computed iteratively by alternating between row and column normalizations. The paper finds that just 3 iterations are sufficient.

This algorithm yields a soft code $\mathbf{Q}^*$ . The authors found that using this soft code directly works better than rounding it to a hard, one-hot assignment, as rounding is too aggressive for online mini-batch training. The computed codes $\mathbf{Q}^*$ are then "stopped" from propagating gradients (with torch.no_grad() in PyTorch) and used as targets in the swapped prediction loss.

4.2.4. Multi-Crop Augmentation

The paper introduces a simple but powerful data augmentation strategy to increase the number of positive pairs without a large computational cost.

Instead of creating just two high-resolution views (e.g., $224 \times 224$ ), multi-crop creates:

Two standard-resolution views.
$V$ additional low-resolution views (e.g., $96 \times 96$ ), which are smaller random crops of the original image.

The following image from the paper appendix illustrates this strategy.

$Figure 5: Multi-crop: the image `x _ { n }` is transformed into $V + 2$ views: two global views and $V$ small resolution zoomed views.$ 该图像是图5的示意图，展示了multi-crop方法中，图像 $x_n$ 被转换成 $V+2$ 个视图，包括两个全局视图和 $V$ 个小分辨率视图，每个视图经过 $f_\theta$ 网络转为特征表示 $z$ ，最终计算损失。

The loss function is generalized to handle these multiple views. A key detail is that codes are only computed from the two standard-resolution views. These high-quality codes then serve as targets for all other views (both standard and low-resolution). The generalized loss is:

$ L(\mathbf{z}{t_1}, \mathbf{z}{t_2}, \dots, \mathbf{z}{t{V+2}}) = \sum_{i \in {1, 2}} \sum_{v=1}^{V+2} \mathbf{1}{v \neq i} \ell(\mathbf{z}{t_v}, \mathbf{q}_{t_i}) $

Let's break this down:

The outer sum iterates over the two standard-resolution views ( $i \in \{1, 2\}$ ), whose codes $\mathbf{q}_{t_1}$ and $\mathbf{q}_{t_2}$ are used as targets.
The inner sum iterates over all $V+2$ views.
For each standard-view code $\mathbf{q}_{t_i}$ , we predict it using the features $\mathbf{z}_{t_v}$ from all other views ( $v \neq i$ ).

This strategy creates many more positive pairs for comparison. For example, with 2 standard and 4 small views ( $V=4$ ), each image yields $(2+4) \times (2+4-1) - (4 \times (4-1)) = 30-12=18$ pairs if we were to compute codes for all views. By only using codes from the 2 standard views, we have $2 \times (2+4-1) = 10$ prediction terms. This setup significantly boosts performance by forcing the model to learn representations that are robust across different scales and resolutions. The low resolution of the small crops keeps the additional computational cost manageable.

5. Experimental Setup

5.1. Datasets

The authors used several standard and large-scale datasets to validate SwAV's performance in different settings:

ImageNet (ILSVRC-2012): The standard dataset for both pretraining and evaluation in SSL. It contains ~1.28 million training images and 50,000 validation images across 1000 classes. This is the primary benchmark for measuring representation quality.
Instagram-1B: A massive, uncurated dataset of 1 billion random public images from Instagram. This dataset was used to test SwAV's scalability and its ability to learn from noisy, real-world data without any domain-specific filtering.
Places205: A scene-centric dataset with ~2.4 million training images across 205 scene categories. Used for evaluating transfer learning performance on a different domain (scenes vs. objects).
PASCAL VOC07+12: A popular benchmark for object detection. The model is pretrained on ImageNet, then fine-tuned on the VOC07 and VOC12 trainval sets, and evaluated on the VOC07 test set.
iNaturalist-2018 (iNat18): A large-scale fine-grained classification dataset with over 437,513 images across 8,142 species. This tests the model's ability to transfer to tasks requiring subtle distinctions.
COCO (Common Objects in Context): A large-scale object detection, segmentation, and captioning dataset. It is a more challenging benchmark than VOC and is standard for evaluating modern object detectors.

These datasets were chosen to provide a comprehensive evaluation: pretraining performance (ImageNet), scalability (Instagram-1B), and generalization to various downstream tasks (classification, fine-grained classification, and object detection) on Places205, iNat18, VOC, and COCO.

5.2. Evaluation Metrics

The paper uses standard metrics for each task.

5.2.1. Top-k Accuracy

Conceptual Definition: Used for image classification. A prediction is considered correct if the true class is among the top- $k$ classes with the highest predicted probabilities. Top-1 accuracy means the single highest-probability prediction must be correct. Top-5 accuracy means the true class must be within the top five predictions. It measures the model's ability to correctly classify an image.
Mathematical Formula: $ \text{Top-k Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\text{true_label}_i \in \text{top_k_predictions}_i) $
Symbol Explanation:
- $N$ : Total number of samples in the evaluation set.
- $\text{true\_label}_i$ : The ground-truth label for the $i$ -th sample.
- $\text{top\_k\_predictions}_i$ : The set of the $k$ labels with the highest predicted scores for the $i$ -th sample.
- $\mathbf{1}(\cdot)$ : An indicator function that is 1 if the condition is true, and 0 otherwise.

5.2.2. mean Average Precision (mAP)

Conceptual Definition: Used for object detection (on VOC). mAP provides a single-figure measure of detection quality across all classes and recall levels. It is the mean of the Average Precision (AP) scores calculated for each object class. AP itself summarizes the shape of the precision-recall curve, rewarding models that maintain high precision across different recall thresholds.
Mathematical Formula: First, we need Precision and Recall: $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $ $ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $
Symbol Explanation:
- TP (True Positives): Correctly detected objects (Intersection over Union (IoU) with ground truth > threshold).
- FP (False Positives): Incorrectly detected objects (IoU < threshold or a duplicate detection).
- FN (False Negatives): Ground-truth objects that were not detected.
Average Precision (AP) for a single class is the area under the precision-recall curve. $ \text{AP} = \sum_{k=1}^n (R_k - R_{k-1}) P_k $ where $P_k$ and $R_k$ are the precision and recall at the $k$ -th threshold.
mAP is the average of AP over all object classes. $ \text{mAP} = \frac{1}{C} \sum_{i=1}^C \text{AP}_i $ where $C$ is the number of classes. For PASCAL VOC, this is typically calculated at a single IoU threshold of 0.5 (denoted as $\text{AP}_{50}$ ).

5.2.3. COCO AP

Conceptual Definition: The primary metric for the COCO dataset, which is more stringent than VOC's mAP. It is the mAP averaged over multiple IoU thresholds (from 0.5 to 0.95 in steps of 0.05). This rewards detectors that are accurate at various levels of localization strictness.
Mathematical Formula: It follows the same principle as mAP, but is averaged across 10 different IoU thresholds.
Symbol Explanation: The paper reports AP (the primary COCO metric), $\text{AP}_{50}$ (AP at IoU=0.5), $\text{AP}_{75}$ (AP at IoU=0.75), and $\text{AP}_S, \text{AP}_M, \text{AP}_L$ for small, medium, and large objects, respectively.

5.3. Baselines

The paper compares SwAV against a strong set of baselines representing the state-of-the-art in supervised and self-supervised learning at the time:

Supervised: A ResNet-50 trained on ImageNet with labels. This is the canonical benchmark that SSL methods aim to match or surpass.
Handcrafted Pretext Tasks: Methods like Colorization and Jigsaw.
Contrastive Methods:
- NPID / $NPID++$ : Early instance-discrimination methods.
- MoCo / MoCov2: State-of-the-art momentum-based contrastive learning.
- SimCLR: State-of-the-art large-batch contrastive learning.
- PIRL, PCL: Other concurrent contrastive methods.
Clustering-based Methods:
- DeepCluster-v2: An improved, reimplemented version of DeepCluster by the authors to ensure a fair comparison.
- SeLa / SeLa-v2: The optimal-transport-based clustering method.
Hybrid Methods: BigBiGAN (GAN-based), CPC v2 (predictive coding).

These baselines provide a comprehensive context, allowing for a direct comparison against the best competing paradigms in SSL.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Linear Classification on ImageNet

This is the standard protocol to evaluate the quality of learned features. A linear classifier is trained on top of the frozen features from the pretrained model.

The following figure from the paper shows the main results on a standard ResNet-50 and wider variants.

Figure 2: Linear classification on ImageNet. Top-1 accuracy for linear models trained on frozen features from different self-supervised methods. (left) Performance with a standard ResNet-50. (right)…

Analysis:

ResNet-50 (Left Table): SwAV achieves 75.3% top-1 accuracy. This is a significant improvement over the previous best SSL methods, MoCov2 (71.1%) and SimCLR (70.0%). The performance gap of +4.2% over SimCLR is substantial. SwAV comes very close to the supervised baseline of 76.5%, reducing the gap to just 1.2%.
Wider Models (Right Chart): As the model capacity increases (ResNet-50 width multiplied by 2x, 4x, 5x), SwAV's performance scales gracefully, similar to the supervised models. Crucially, the gap between SwAV and supervised learning shrinks even further, reaching just 0.6% for the largest model. This shows that the benefits of SwAV are not limited to a specific architecture.

6.1.2. Transfer Learning to Downstream Tasks

This is arguably the most impactful result of the paper. It evaluates how well the pretrained features generalize to new tasks and datasets.

The following are the results from Table 2 of the original paper:

	Linear Classification			Object Detection
	Places205	VOC07	iNat18	VOC07+12 (Faster R-CNN R50-C4)	COCO (Mask R-CNN R50-FPN)	COCO (DETR)
	Supervised	53.2	87.5	46.7	81.3	39.7	40.8
SwAV	56.7	88.9	48.6	82.6	41.6	42.1

Analysis:

Linear Classification: On all three classification datasets (Places205, VOC07, iNat18), SwAV outperforms the supervised ImageNet pretrained model. For instance, on Places205, SwAV achieves 56.7% accuracy compared to the supervised model's 53.2%. This was a groundbreaking result, as no previous SSL method had consistently surpassed the supervised baseline on these standard transfer tasks with frozen features.
Object Detection: Similarly, when fine-tuned for object detection on $VOC07+12$ and COCO using three different detector frameworks (Faster R-CNN, Mask R-CNN, DETR), SwAV backbones consistently outperform supervised backbones. For instance, with Mask R-CNN on COCO, SwAV achieves 41.6 AP, compared to 39.7 AP for the supervised model.

This is the central evidence for the paper's claim that SSL can serve as a superior pretraining strategy compared to traditional supervised pretraining.

6.2. Data Presentation (Tables)

6.2.1. Semi-Supervised Learning

This experiment evaluates performance when fine-tuning the model on a small fraction of labeled ImageNet data.

The following are the results from Table 1 of the original paper:

		1% labels		10% labels
	Method	Top-1	Top-5	Top-1	Top-5
	Supervised	25.4	48.4	56.4	80.4
Methods using label-propagation	UDA [60]	-		68.8*	88.5*
Methods using label-propagation	FixMatch [51]	-		71.5*	89.1*
Methods using self-supervision only	PIRL [44]	30.7	57.2	60.4	83.8
	PCL [37]	-	-	75.6	86.2
	SimCLR [10]	48.3	75.5	65.6	87.8
	SwAV	53.9	78.5	70.2	89.9

Note: * indicates use of RandAugment.

Analysis: SwAV provides a much better starting point for fine-tuning than training from scratch (supervised) or other SSL methods. With only 1% of labels, SwAV (53.9%) significantly outperforms SimCLR (48.3%) and the supervised baseline (25.4%). With 10% of labels, it remains competitive with state-of-the-art semi-supervised methods like FixMatch, even though SwAV was not designed specifically for this setting.

6.2.2. Small Batch Training

This ablation tests SwAV's efficiency in a more constrained hardware setting.

The following are the results from Table 3 of the original paper:

Method	Mom. Encoder	Stored Features	multi-crop	epoch	batch	Top-1
SimCLR		0	2x224	200	256	61.9
MoCov2	✓	65,536	2x224	200	256	67.5
MoCov2	✓	65,536	2x224	800	256	71.1
SwAV		3,840	2x160+4x96	200	256	72.0
SwAV		3,840	2x224+6x96	200	256	72.7
SwAV		3,840	2x224+6x96	400	256	74.3

Analysis: With a small batch size of 256, SwAV excels. It outperforms MoCov2 while using a feature queue that is ~17x smaller (3,840 vs 65,536) and without needing a momentum encoder. Furthermore, SwAV learns much faster: it reaches 72.0% in 200 epochs, while MoCov2 needs 800 epochs to reach 71.1%. This demonstrates SwAV's superior efficiency and suitability for training with limited resources.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Impact of Multi-Crop and Comparison to Clustering

The paper investigates the source of SwAV's performance gains, separating the effects of the clustering-based loss and the multi-crop strategy.

The following are the results from the table in Figure 3 (left) of the original paper:

	Top-1		∆
Method	2x224	2x160+4x96
Supervised	76.5	76.0	-0.5
Contrastive-instance approaches
SimCLR	68.2	70.6	+2.4
Clustering-based approaches
SeLa-v2	67.2	71.8	+4.6
DeepCluster-v2	70.2	74.3	+4.1
SwAV	70.1	74.1	+4.0

Analysis:

Clustering vs. Instance-Contrastive: Without multi-crop (column "2x224"), the improved offline clustering method DeepCluster-v2 (70.2%) and online SwAV (70.1%) both outperform the strong instance-contrastive baseline SimCLR (68.2%). This suggests that clustering-based objectives are highly effective.
Impact of Multi-crop: The multi-crop strategy provides a significant and consistent performance boost to all SSL methods tested. SimCLR improves by +2.4%, while the clustering-based methods see an even larger gain of around +4.0%. This confirms that multi-crop is a general and powerful augmentation technique. Interestingly, it slightly hurts the supervised model, suggesting it is particularly beneficial for learning invariances in an unsupervised manner.
SwAV vs. DeepCluster-v2: SwAV performs on par with the strong offline baseline DeepCluster-v2. However, SwAV has the crucial advantage of being an online method, making it practical for massive datasets where offline clustering is infeasible.

6.3.2. Pretraining on Uncurated Data

This experiment on the 1B Instagram dataset demonstrates SwAV's robustness and scalability.

The following are results from the table in Figure 4 (left) of the original paper:

Method	Frozen	Finetuned
Random	15.0	76.5
MoCo*	-	77.3*
SimCLR	60.4	77.2
SwAV	66.5	77.8

Note: * indicates pretraining on a curated subset of Instagram.

Analysis:

When pretrained on the same large, uncurated dataset, SwAV again significantly outperforms SimCLR in the linear evaluation setting (66.5% vs. 60.4%).
When the full network is fine-tuned on ImageNet, the SwAV-pretrained model achieves 77.8% top-1 accuracy, surpassing training from scratch (76.5%) and also outperforming the SimCLR pretrained model (77.2%). This shows that SwAV is an excellent pretraining method that scales well to massive, noisy datasets.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents SwAV, a novel and highly effective self-supervised learning method that combines the strengths of clustering and contrastive learning. By proposing an online clustering loss based on a "swapped prediction" problem, SwAV avoids the computationally expensive pairwise comparisons of traditional contrastive methods and the scalability limitations of offline clustering methods. It is efficient, works well with both large and small batches, and does not require complex components like a memory bank or momentum encoder.

Additionally, the paper introduces the multi-crop data augmentation strategy, a simple and general technique that significantly boosts performance by increasing the number of views per image at a low computational cost.

The combination of these contributions resulted in state-of-the-art performance on ImageNet and, most significantly, demonstrated for the first time that a self-supervised pretrained model could consistently outperform a standard supervised pretrained model on a wide array of downstream transfer learning tasks. This marked a pivotal moment for the field, establishing self-supervised learning as a viable and even superior alternative to supervised pretraining.

7.2. Limitations & Future Work

The authors themselves point to a promising direction for future research:

Architecture Exploration: Most neural architectures (like ResNet) have been designed and optimized for supervised learning. The authors suggest that self-supervised methods like SwAV could be used to guide the search for new architectures that are better suited for learning without labels. This opens up possibilities for neural architecture search (NAS) in an unsupervised setting.
Combination with Other Techniques: The paper notes that SwAV's design is orthogonal to mechanisms like the momentum encoder and large queue from MoCo. Future work could explore combining these techniques to potentially achieve even better performance, although it would trade some of SwAV's simplicity for it.

7.3. Personal Insights & Critique

Elegance and Insight: The core idea of SwAV—contrasting cluster assignments instead of features—is an elegant simplification of the contrastive learning problem. It recasts the problem from an N-to-N comparison to an N-to-K comparison (where K is the number of prototypes), which is inherently more efficient. The "swapped prediction" task is an intuitive and powerful way to enforce representation consistency.
Practical Impact: SwAV's ability to outperform supervised pretraining on transfer tasks was a major catalyst for the adoption of SSL in practice. It provided strong evidence that learning from the structure of the data itself can lead to more generalizable and robust representations than simply learning to map images to a fixed set of 1000 labels. The multi-crop strategy is also a highly practical contribution that has been widely adopted in subsequent SSL research.
Potential Issues and Unverified Assumptions:
- Hyperparameter Sensitivity: While the ablation on the number of prototypes ( $K$ ) showed robustness, the method still introduces other hyperparameters, such as the temperature $τ$ and the Sinkhorn regularization parameter $ε$ . Their interplay and optimal setting might require careful tuning.
- The Role of Prototypes: The paper shows that learned prototypes are better than fixed random ones, but the performance with random prototypes is still surprisingly strong. This suggests that the prototypes may be acting more as a set of "anchors" for contrasting views rather than forming semantically meaningful clusters in the traditional sense. The true nature of what the prototypes learn could be a subject for further investigation.
- Theoretical Grounding: While inspired by optimal transport, the theoretical justification for why the online, mini-batch Sinkhorn-Knopp procedure combined with the swapped loss leads to such powerful representations is not fully fleshed out. The empirical success is clear, but a deeper theoretical analysis would be beneficial.
  
  Overall, SwAV is a landmark paper that masterfully synthesized ideas from different SSL paradigms to create a more scalable, efficient, and powerful method. Its success redefined the ceiling for self-supervised learning and solidified its position as a cornerstone of modern computer vision.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~23 min read · 30,999 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Self-Supervised Learning (SSL)

3.1.2. Contrastive Learning

3.1.3. Clustering-based SSL

3.2. Previous Works

3.2.1. Instance Discrimination and Contrastive Methods

3.2.2. Clustering-based Methods

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Feature and Code Generation

4.2.2. The Swapped Prediction Loss

4.2.3. Online Code Computation with Sinkhorn-Knopp

4.2.4. Multi-Crop Augmentation

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Top-k Accuracy

5.2.2. mean Average Precision (mAP)

5.2.3. COCO AP

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Linear Classification on ImageNet

6.1.2. Transfer Learning to Downstream Tasks

6.2. Data Presentation (Tables)

6.2.1. Semi-Supervised Learning

6.2.2. Small Batch Training

6.3. Ablation Studies / Parameter Analysis

6.3.1. Impact of Multi-Crop and Comparison to Clustering

6.3.2. Pretraining on Uncurated Data

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers