Paper status: completed

Learning Transferable Visual Models From Natural Language Supervision

Published:02/27/2021

Learning Visual Models from Natural Language (1)Image-Text Pair Dataset (1)Zero-Shot Learning (1)Computer Vision Benchmarking (1)Zero-Shot Classification with ResNet-50 (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study presents a method for learning transferable visual models from natural language supervision, showing that pretraining on 400 million image-text pairs enables zero-shot transfer across various tasks, rivaling fully supervised models in performance.

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Mind Map

In-depth Reading

English Analysis~52 min read · 73,267 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Learning Transferable Visual Models From Natural Language Supervision."

1.2. Authors

The authors are Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. All authors are affiliated with OpenAI.

OpenAI is a prominent artificial intelligence research laboratory based in San Francisco, California. It is widely recognized for its contributions to deep learning, particularly in the fields of natural language processing (NLP) with models like GPT (Generative Pre-trained Transformer) and computer vision. OpenAI has a strong reputation for pushing the boundaries of AI research, often publishing influential papers and releasing open-source models that significantly impact the broader AI community.

1.3. Journal/Conference

This paper was published at (UTC): 2021-02-26T19:04:58.000Z. While the provided information doesn't specify a formal conference or journal name, the date and "arXiv preprint" status (as indicated by the original source link) suggest it was released as a preprint on arXiv, a popular open-access archive for scientific papers. Papers on arXiv are not peer-reviewed before publication but are widely read and cited within the research community, often preceding formal publication in top-tier conferences (e.g., NeurIPS, ICML, CVPR, ICLR) or journals. Given OpenAI's track record, this paper likely received significant attention upon its preprint release.

1.4. Publication Year

1.5. Abstract

State-of-the-art computer vision systems traditionally rely on training with a fixed set of predefined object categories, which limits their generality and requires extensive re-labeling for new visual concepts. This paper proposes a promising alternative: learning directly from raw text about images, a much broader source of supervision. The authors demonstrate that a simple pre-training task—predicting which caption corresponds to which image—is an efficient and scalable method to learn state-of-the-art image representations. They achieve this by training a model, called CLIP (Contrastive Language-Image Pre-training), from scratch on a massive dataset of 400 million (image, text) pairs collected from the internet.

After pre-training, CLIP uses natural language to reference learned visual concepts or describe new ones, enabling zero-shot transfer to downstream tasks without additional labeled data. The authors benchmark CLIP's performance across over 30 diverse computer vision datasets, covering tasks like optical character recognition (OCR), action recognition in videos, geo-localization, and fine-grained object classification. The model demonstrates non-trivial transfer performance on most tasks, often rivaling fully supervised baselines. For instance, CLIP matches the accuracy of the original ResNet-50 on ImageNet in a zero-shot setting, without utilizing its 1.28 million training examples. The authors also release their code and pre-trained model weights.

1.6. Original Source Link

https://arxiv.org/abs/2103.00020 This paper is an arXiv preprint, indicating it is publicly accessible and widely shared within the scientific community.

1.7. PDF Link

https://arxiv.org/pdf/2103.00020v1.pdf

2. Executive Summary

2.1. Background & Motivation

The paper addresses a fundamental limitation in conventional computer vision (CV) systems: their reliance on fixed sets of predetermined object categories. Traditionally, these systems are trained on meticulously hand-labeled datasets (like ImageNet, which contains 1000 object classes), requiring significant human effort to define and annotate classes. This approach creates several critical problems:

Limited Generality and Usability: If a user wants to detect a visual concept not included in the original 1000 categories, the model typically needs to be fine-tuned (further trained) on a new, labeled dataset specific to that concept. This process is costly, time-consuming, and limits the system's flexibility in real-world applications where new concepts constantly emerge.
Scalability Bottleneck: Creating large-scale, high-quality, crowd-labeled datasets for every conceivable visual concept is practically impossible. The traditional paradigm struggles to scale to the vast and ever-growing diversity of visual information.
Brittleness to Distribution Shift: Models trained on specific distributions (e.g., ImageNet photos) often perform poorly when encountering natural variations in data (e.g., different styles of images, backgrounds, or contexts), a phenomenon known as distribution shift.

The paper draws inspiration from the revolution in Natural Language Processing (NLP), where pre-training methods learning directly from raw, uncurated web text (e.g., autoregressive or masked language modeling objectives) have led to highly transferable and task-agnostic models like GPT-3. These NLP models can perform a wide array of downstream tasks with little to no dataset-specific training (i.e., zero-shot or few-shot learning), demonstrating that web-scale collections of text contain a richer source of supervision than traditional crowd-labeled datasets.

The paper's entry point and innovative idea are to apply this successful NLP paradigm to computer vision. Specifically, it asks: "Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision?" The authors propose to leverage the implicit supervision contained in natural language accompanying images on the internet to train visual models that are more general, flexible, and robust, without the need for explicit gold labels for every visual concept.

2.2. Main Contributions / Findings

The paper makes several primary contributions and reports key findings that significantly advance the field of computer vision:

Introduction of CLIP (Contrastive Language-Image Pre-training): The paper proposes a novel and efficient pre-training method that uses a simple contrastive learning objective. Instead of trying to predict exact captions, CLIP learns to predict which text description matches which image within a batch. This approach creates a powerful multimodal embedding space where images and their corresponding text descriptions are brought closer together.
Creation of a Large-Scale WebImageText (WIT) Dataset: To enable web-scale pre-training, the authors constructed a new dataset of 400 million (image, text) pairs by collecting publicly available data from the internet, a scale orders of magnitude larger than previous datasets used for similar tasks. This dataset is crucial for the success of CLIP, providing a broad source of natural language supervision.
Demonstration of Strong Zero-Shot Transfer Capabilities: After pre-training, CLIP can perform zero-shot classification on a wide variety of downstream computer vision tasks without any additional training or fine-tuning on task-specific labeled data.
- SOTA-Matching Performance: CLIP demonstrates remarkable performance, often being competitive with fully supervised baselines. Notably, it matches the accuracy of the original ResNet-50 on ImageNet in a zero-shot setting, without using any of ImageNet's 1.28 million training examples.
- Broad Task Coverage: CLIP's task-learning capabilities are extensively benchmarked across over 30 diverse datasets, spanning tasks such as OCR (optical character recognition), action recognition in videos, geo-localization, and various types of fine-grained object classification. This broad evaluation highlights its generality.
Improved Robustness to Natural Distribution Shift: The paper finds that zero-shot CLIP models are significantly more robust to natural distribution shifts (i.e., performance on new test sets with differing image characteristics) compared to supervised ImageNet models of equivalent accuracy. This suggests that zero-shot evaluation better reflects a model's true capabilities and less susceptible to spurious correlations learned from specific training distributions.
Discovery of Scaling Laws: The authors observe that CLIP's transfer performance is a smoothly predictable function of compute, similar to observations in large language models like GPT. This log-log linear scaling trend suggests a clear path for future performance improvements by simply increasing compute and model capacity.
Analysis of Societal Impacts: The paper includes a dedicated section discussing the broader impacts of such a powerful and flexible model, including social biases (e.g., in race, gender, age classification, and crime-related associations) and its potential use in surveillance. It emphasizes the importance of class design and thresholding in mitigating biases.

3.1. Foundational Concepts

To understand the CLIP paper, a beginner should be familiar with several core concepts from machine learning, particularly deep learning, computer vision, and natural language processing.

Deep Learning: A subfield of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn representations of data with multiple levels of abstraction.
- Neural Networks: Computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers, processing information through weights and biases.
- Weights and Biases: Parameters within a neural network that are adjusted during training to learn patterns in the data. Weights determine the strength of connections between neurons, and biases provide an offset to the activation of a neuron.
- Activation Functions: Mathematical functions that introduce non-linearity into the network, allowing it to learn complex patterns. Examples include ReLU (Rectified Linear Unit), sigmoid, and softmax.
- Loss Function (Objective Function): A function that quantifies the difference between the predicted output of a model and the true target values. The goal of training is to minimize this loss.
- Optimizer: An algorithm (e.g., Stochastic Gradient Descent (SGD), Adam) used to adjust the model's weights and biases to minimize the loss function.
- Backpropagation: An algorithm used to efficiently compute the gradients of the loss function with respect to the network's weights, enabling their iterative adjustment.
- Pre-training: An initial training phase where a model is trained on a large, general dataset with a broad objective (e.g., predicting the next word, distinguishing image-text pairs). The learned representations (features) are then transferred to downstream tasks.
- Fine-tuning: A subsequent training phase where a pre-trained model's weights are slightly adjusted using a smaller, task-specific dataset and objective to adapt it to a particular downstream task.
- Representation Learning: The process of automatically discovering good representations of raw data for specific tasks. A good representation captures the essential information from the data in a useful format, often making downstream tasks easier.
Computer Vision (CV): A field of artificial intelligence that enables computers to "see" and interpret visual data from the world.
- Image Classification: The task of assigning a label or category to an entire input image (e.g., "cat," "dog," "car").
- Object Detection: Identifying and localizing objects within an image by drawing bounding boxes around them and assigning a class label to each.
- Semantic Segmentation: Assigning a class label to every pixel in an image, effectively partitioning the image into meaningful regions (e.g., "road," "sky," "person").
- Convolutional Neural Networks (CNNs): A class of deep neural networks specifically designed for processing grid-like data such as images. They use convolutional layers to automatically and adaptively learn spatial hierarchies of features.
  - Convolutional Layer: A core building block of CNNs that applies a set of learnable filters (kernels) to the input image, producing feature maps that highlight specific patterns (edges, textures, etc.).
  - Pooling Layer: Reduces the spatial dimensions of the feature maps, helping to make the learned features more robust to minor variations in input and reducing computational cost.
  - ResNet (Residual Network): A type of CNN architecture (He et al., 2016a) that addresses the problem of vanishing gradients in very deep networks by introducing skip connections or residual connections. These connections allow gradients to flow directly through layers, enabling the training of much deeper models.
- Vision Transformer (ViT): A recent architecture (Dosovitskiy et al., 2020) that applies the Transformer architecture (originally for NLP) directly to images. It works by splitting an image into fixed-size patches, linearly embedding them, adding positional embeddings, and feeding the resulting sequence of vectors to a standard Transformer encoder.
Natural Language Processing (NLP): A field of AI concerned with enabling computers to understand, interpret, and generate human language.
- Language Models: Statistical models that learn the probability distribution of sequences of words in a language. They can predict the next word in a sentence or fill in masked words.
- Embeddings (Word, Sentence, Text): Numerical vector representations of words, sentences, or entire text snippets. Words with similar meanings have similar embeddings. Byte Pair Encoding (BPE) is a subword tokenization algorithm used to handle rare words and out-of-vocabulary words by breaking them into common subword units.
- Transformer: A neural network architecture (Vaswani et al., 2017) that relies heavily on the self-attention mechanism to process sequences of data. It has revolutionized NLP due to its ability to capture long-range dependencies in text.
  - Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence when processing each element. For each token in a sequence, it computes query (Q), key (K), and value (V) vectors. The attention score for a token is calculated by taking the dot product of its query with all other keys, scaling, and applying a softmax function. This score is then multiplied by the values. The formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where $Q$ is the matrix of queries, $K$ is the matrix of keys, $V$ is the matrix of values, and $d_k$ is the dimension of the keys, used for scaling.
  - Multi-Head Attention: Extends self-attention by running the attention mechanism multiple times in parallel with different learned linear projections of Q, K, V. This allows the model to focus on different parts of the input from various perspectives.
- GPT (Generative Pre-trained Transformer): A family of autoregressive language models (Radford et al., 2018, 2019; Brown et al., 2020) developed by OpenAI. They are pre-trained on vast amounts of text data to predict the next token in a sequence and have shown impressive zero-shot and few-shot learning capabilities across many NLP tasks.
- BERT (Bidirectional Encoder Representations from Transformers): A masked language model (Devlin et al., 2018) that learns bidirectional contextual embeddings by predicting masked words in a sentence and also predicting whether two sentences follow each other.
Supervision Paradigms:
- Supervised Learning: Training models on datasets where each input example is explicitly paired with a correct output label (e.g., image and its object category).
- Weakly Supervised Learning: Training with noisy, imprecise, or incomplete labels, often obtained automatically or at a low cost (e.g., image paired with hashtags, rather than precise object bounding boxes).
- Self-Supervised Learning: A form of unsupervised learning where the data itself provides the supervision. Models learn by solving a pretext task (e.g., predicting missing parts of an input, rotating an image) that doesn't require human-labeled data, thereby learning useful representations.
- Unsupervised Learning: Learning patterns from unlabeled data, often to discover hidden structures or generate new data (e.g., clustering, dimensionality reduction).
Multimodal Learning: Combining information from multiple modalities (e.g., vision and language) to achieve a more comprehensive understanding of data.
- Multimodal Embedding Space: A shared vector space where representations from different modalities (e.g., image embeddings and text embeddings) can be compared and related. In such a space, semantically similar items, regardless of modality, are embedded close to each other.
- Cosine Similarity: A measure of similarity between two non-zero vectors in an inner product space. It measures the cosine of the angle between them. A cosine similarity of 1 means identical direction (most similar), 0 means orthogonal (no similarity), and -1 means opposite direction (most dissimilar). The formula for cosine similarity between two vectors $A$ and $B$ is: $ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $ where $A_i$ and $B_i$ are components of vectors $A$ and $B$ , and $\|A\|$ and $\|B\|$ are their magnitudes.
- Contrastive Learning: A self-supervised learning approach where models learn by contrasting positive pairs (similar samples) with negative pairs (dissimilar samples). The goal is to bring representations of positive pairs closer in the embedding space while pushing negative pairs apart.
  - InfoNCE Loss (Noise-Contrastive Estimation): A commonly used contrastive loss function (Oord et al., 2018) that aims to maximize the mutual information between different views of the same data point. It typically involves comparing a positive pair (e.g., an image and its caption) against a set of negative samples. Given a query embedding $q$ and a set of key embeddings $\{k_0, k_1, \dots, k_N\}$ where $k_0$ is the positive key and $k_1, \dots, k_N$ are negative keys, the InfoNCE loss is: $ L_q = -\log \frac{\exp(\text{sim}(q, k_0) / \tau)}{\sum_{i=0}^{N} \exp(\text{sim}(q, k_i) / \tau)} $ where $\text{sim}(q, k)$ is a similarity function (e.g., cosine similarity), and $\tau$ is a temperature parameter that scales the logits before the softmax.
  - Multi-class N-pair Loss: An extension of metric learning (Sohn, 2016) that uses N-1 negative samples for each anchor in a batch of $N$ samples. This loss is similar in spirit to InfoNCE and is the direct inspiration for CLIP's objective.
Zero-Shot Learning (ZSL): The ability of a model to recognize objects or perform tasks that it has never encountered during training, typically by leveraging side information (e.g., textual descriptions of classes).
Few-Shot Learning (FSL): The ability of a model to learn new tasks or recognize new classes from a very small number of examples (e.g., 1-shot, 5-shot).

3.2. Previous Works

The paper contextualizes its work by tracing relevant prior research across NLP, computer vision, and multimodal learning.

NLP Pre-training Revolution:
- The paper explicitly credits the revolution in NLP, citing foundational work like Word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and more recent contextualized word embeddings and language models.
- ELMo (Peters et al., 2018), ULMFiT (Howard & Ruder, 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2018), and T5 (Raffel et al., 2019) are highlighted as systems that scaled task-agnostic objectives (like autoregressive and masked language modeling) across orders of magnitude in compute, model capacity, and data.
- Text-to-text interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) enabled these architectures to zero-shot transfer without specialized output heads.
- GPT-3 (Brown et al., 2020) is noted as a flagship system competitive with bespoke models with little to no dataset-specific training data. This demonstrated that web-scale text supervision can surpass high-quality crowd-labeled datasets in NLP.
Early Multimodal Vision-Language Learning:
- The idea of learning visual models from text is not new, dating back over 20 years.
- Mori et al. (1999) explored improving content-based image retrieval by predicting nouns and adjectives from paired text.
- Quattoni et al. (2007) demonstrated more data-efficient image representations by learning classifiers for words in image captions.
- Srivastava & Salakhutdinov (2012) used multimodal Deep Boltzmann Machines on low-level image and text features.
- Joulin et al. (2016) modernized this by training CNNs to predict bag-of-words from YFCC100M metadata, showing transfer performance similar to ImageNet pre-training. This work is a direct conceptual precursor to CLIP's bag-of-words baseline.
- Li et al. (2017) extended this to phrase n-grams and demonstrated zero-shot transfer by scoring target classes based on learned visual n-grams. CLIP directly compares against Li et al. (2017) on ImageNet, highlighting its significant performance leap from $11.5\%$ to $76.2\%$ .
Recent Text-Supervised Visual Representation Learning:
- More recent works adopting modern architectures and objectives:
  - VirTex (Desai & Johnson, 2020): Used transformer-based language modeling to learn image representations from textual annotations.
  - ICMLM (Bulent Sariyildiz et al., 2020): Employed masked language modeling for image representation learning.
  - ConVIRT (Zhang et al., 2020): Adapted contrastive objectives for (text, image) representation learning, especially in medical imaging. CLIP is described as a "simplified version of ConVIRT trained from scratch."
Weakly Supervised Pre-training for Computer Vision:
- While text-based supervision was rare due to low benchmark performance, narrowly scoped weak supervision has shown gains.
- Mahajan et al. (2018) used ImageNet-related hashtags on 3.5 billion Instagram images, improving ImageNet accuracy by over $5\%$ .
- Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) demonstrated large gains by pre-training on noisily labeled JFT-300M dataset, predicting classes. These approaches are noted for carefully limiting supervision to 1000 and 18291 classes, respectively, unlike natural language's broader scope.
Contrastive Representation Learning:
- Tian et al. (2019) found contrastive objectives learn better representations than equivalent predictive objectives for images.
- Chen et al. (2020a) showed generative models require significantly more compute than contrastive models for similar performance. These findings directly influenced CLIP's choice of a contrastive objective for efficiency.
- Sohn (2016) introduced multi-class N-pair loss, and Oord et al. (2018) popularized InfoNCE loss, both key inspirations for CLIP's batch construction technique and objective.

3.3. Technological Evolution

The evolution of visual models from natural language supervision can be broadly understood as a shift from limited, task-specific, hand-labeled datasets to vast, general, web-scale, implicitly supervised data.

Early Image Retrieval (Pre-Deep Learning): Initial efforts (e.g., Mori et al., 1999) focused on content-based image retrieval by linking images to associated text, often using simpler textual features (nouns, adjectives) and statistical models. These were proofs-of-concept for multimodal understanding.
Bridging Vision and Language with Machine Learning (2000s-early 2010s): Researchers explored learning visual representations by predicting words in captions (Quattoni et al., 2007) or using multimodal graphical models like Deep Boltzmann Machines (Srivastava & Salakhutdinov, 2012). These methods laid theoretical groundwork but were limited by computational power and data scale.
Deep Learning Era - Initial Multimodal Applications (2010s): With the rise of deep learning and CNNs, the field saw modernization. Joulin et al. (2016) showed that CNNs trained on bag-of-words from web metadata could learn useful representations, competing with ImageNet pre-training. Li et al. (2017) further demonstrated zero-shot transfer using n-grams, but performance was still too low for practical application.
Specialized Weak Supervision (Mid-Late 2010s): While general text-to-vision remained challenging, highly-focused weak supervision (e.g., Instagram hashtags by Mahajan et al., 2018) proved effective for improving ImageNet performance. These methods, however, still relied on predetermined class taxonomies and softmax classifiers, limiting flexibility.
NLP-Inspired Architectures and Objectives (Late 2010s-Early 2020s): The success of Transformers and pre-training in NLP inspired analogous efforts in vision. VirTex, ICMLM, and ConVIRT started integrating Transformer-based language models and contrastive objectives for learning image representations from text, but often on relatively smaller datasets.
CLIP's Breakthrough (2021): CLIP represents a culmination of these trends. It scales the contrastive learning objective and Transformer architectures to an unprecedented web-scale dataset (400M image-text pairs). This massive scale, combined with an efficient contrastive objective, enables CLIP to achieve state-of-the-art zero-shot transfer performance, bridging the gap between flexible natural language supervision and practical computer vision applications. It replicates the NLP paradigm's success in achieving task-agnostic and transferable models.

3.4. Differentiation Analysis

Compared to the main methods in related work, CLIP introduces several core differences and innovations:

Scale of Natural Language Supervision:
- Prior Work (e.g., VirTex, ICMLM, ConVIRT): While these works explored similar ideas, they typically trained on much smaller datasets (e.g., MS-COCO, Visual Genome, YFCC100M, ranging from hundreds of thousands to tens of millions of images). $Li et al. (2017)$ used YFCC100M but had limited filtering.
- CLIP's Innovation: CLIP is trained on an unprecedented 400 million (image, text) pairs in its custom WIT (WebImageText) dataset. This scale, leveraging the vastness of the internet, is a direct differentiator and a key enabler of its performance, similar to how large datasets drove progress in NLP.
Efficiency of Pre-training Objective:
- Generative/Predictive Baselines (e.g., VirTex, early CLIP attempts): Many prior approaches, including CLIP's own initial attempts, aimed to predict the exact words of an image caption. The paper demonstrates this is a computationally inefficient task (transformer-based language model learns 3x slower than bag-of-words baseline, Figure 2).
- CLIP's Innovation: CLIP adopts a contrastive learning objective. Instead of predicting specific words, it learns to predict which text snippet as a whole is paired with which image. This proxy task of identifying correct (image, text) pairs from incorrect ones (e.g., InfoNCE loss or multi-class N-pair loss) yields a 4x efficiency improvement over the bag-of-words prediction baseline for zero-shot ImageNet classification (Figure 2). This focus on efficiency was crucial for scaling.
Zero-Shot Transfer Performance and Generality:
- Prior Zero-Shot Approaches (e.g., Li et al., 2017 - Visual N-Grams): These methods showed the possibility of zero-shot transfer but achieved very low accuracy (e.g., $11.5\%$ on ImageNet).
- Weakly Supervised (e.g., Instagram-pretrained ResNeXt, JFT-300M models): These models achieved high performance but were task-specific (e.g., predicting 1000 or 18291 ImageNet-related classes) and lacked a mechanism for dynamic outputs or true open-set recognition. They still required static softmax classifiers.
- CLIP's Innovation: CLIP achieves state-of-the-art zero-shot transfer performance, matching or even exceeding strong fully supervised baselines on many datasets (e.g., $76.2\%$ on ImageNet zero-shot, matching ResNet-50). Its use of natural language prompts allows for dynamic classifier creation for any described visual concept, offering unprecedented flexibility and generality for open-set recognition (Figure 1).
Robustness to Distribution Shift:
- ImageNet-trained Models: Previous studies (Taori et al., 2020) showed that ImageNet-trained models suffer significant performance drops on natural distribution shifts, suggesting they exploit spurious correlations within the ImageNet distribution.
- CLIP's Innovation: CLIP demonstrates significantly higher effective robustness on natural distribution shift datasets. Zero-shot CLIP models reduce the gap between in-distribution and out-of-distribution accuracy by up to $75\%$ (Figure 13). This is attributed to not being trained on specific task distributions and leveraging a very diverse pre-training dataset.
Scaling Laws for Transfer Performance:
- Prior Vision Models: While large-scale pre-training existed, clear scaling laws for zero-shot transfer performance in vision, similar to those observed in NLP (Kaplan et al., 2020), were not as extensively documented.
- CLIP's Innovation: CLIP explicitly demonstrates a smooth log-log linear scaling trend for zero-shot error rate as a function of model compute across various model sizes (Figure 9). This predictability guides future research towards larger, more capable models.
  
  In essence, CLIP differentiates itself by successfully bringing the web-scale, task-agnostic pre-training paradigm from NLP to computer vision, enabled by a massive, custom dataset and an efficient contrastive learning objective, resulting in highly flexible, performant, and robust zero-shot visual models.

4. Methodology

4.1. Principles

The core idea behind CLIP is to learn a highly transferable visual model by leveraging the abundant and diverse supervision contained in natural language paired with images on the internet. Instead of relying on traditional, fixed-category human-labeled datasets, CLIP aims to learn visual representations that are implicitly grounded in human language.

The theoretical basis and intuition are as follows:

Natural Language as a Rich Supervision Signal: Natural language is incredibly expressive, capable of describing a vast, open-ended set of visual concepts (objects, actions, attributes, scenes, emotions, etc.). By learning from text associated with images, a model can acquire a much broader understanding of the visual world than from a limited set of pre-defined labels. This mirrors the success of large language models trained on web-scale text.
Shared Embedding Space for Multimodal Understanding: The goal is to learn a joint multimodal embedding space where images and their corresponding text descriptions are mapped to nearby points. This means that an image of a "dog" and the text "a photo of a dog" will have similar vector representations, while an image of a "cat" and the text "a photo of a dog" will have dissimilar representations.
Contrastive Learning for Efficiency and Effectiveness: Instead of trying to generate exact pixel-level captions (which is computationally intensive and difficult due to the variability of language), CLIP simplifies the task. It frames it as a contrastive prediction problem: given a batch of image-text pairs, can the model identify which image matches which text from all possible pairings within that batch? This proxy task is much more efficient to train and has been shown to be highly effective in learning powerful representations in self-supervised learning.
Zero-Shot Transfer through Language Prompting: Once trained, this shared embedding space allows for zero-shot transfer. To classify an image into a set of categories (e.g., "cat," "dog," "bird"), the model can:
- Compute the embedding of the input image.
- Compute the embeddings of the text descriptions for each category (e.g., "a photo of a cat," "a photo of a dog," "a photo of a bird").
- The category whose text embedding is most similar (e.g., highest cosine similarity) to the image embedding is predicted as the class. This effectively allows natural language to "program" the visual classifier on the fly, without needing any labeled training examples for the new categories.

4.2. Core Methodology In-depth (Layer by Layer)

The CLIP methodology involves several key components: Natural Language Supervision, Dataset Creation, Efficient Pre-Training Method Selection, Model Architectures, Model Scaling, and Training Details.

4.2.1. Natural Language Supervision

CLIP's fundamental premise is to learn visual representations from natural language supervision. This means using descriptive text associated with images as the training signal, rather than discrete, human-annotated class labels. The authors highlight its strengths:

Scalability: It's easier to scale natural language supervision because it doesn't require annotations in a machine learning compatible format (like 1-of-N majority vote gold labels). Instead, it can passively learn from the vast amount of text on the internet.
Flexibility and Zero-Shot Transfer: It doesn't "just" learn a representation but also connects that representation to language, enabling flexible zero-shot transfer by simply using language to describe new visual concepts.

4.2.2. Creating a Sufficiently Large Dataset (WIT)

Existing datasets like MS-COCO (approx. 100,000 images) and Visual Genome (approx. 100,000 images) are too small for web-scale pre-training. YFCC100M (100 million images) has sparse and inconsistent metadata. After filtering for natural language titles/descriptions, YFCC100M shrinks to 15 million images.

To address this, the authors constructed a new dataset called WIT (WebImageText).

Scale: It comprises 400 million (image, text) pairs collected from publicly available internet sources.
Diversity: To cover a broad set of visual concepts, the collection process involved searching for (image, text) pairs where the text included one of 500,000 queries.
Balancing: The results were approximately class-balanced by including up to 20,000 (image, text) pairs per query.
Size comparison: The total word count of WIT is similar to the WebText dataset used to train GPT-2.

4.2.3. Selecting an Efficient Pre-Training Method

Given the immense computational requirements of training large computer vision models, training efficiency was paramount.

Initial Approach (Generative/Predictive): The authors first tried an approach similar to VirTex, jointly training an image CNN and text transformer from scratch to predict the caption of an image.
- Observation: This method proved inefficient. As shown in Figure 2, a 63 million parameter transformer language model (already using twice the compute of a ResNet-50 image encoder) learned to recognize ImageNet classes three times slower than a simpler bag-of-words (BoW) encoding baseline. This highlighted the difficulty of predicting exact words due to the wide variety of text associated with images.
  
  $Figure 2. CLIP is much more efficient at zero-shot transfer than our image caption baseline. Although highly expressive, we found that transformer-based language models are relatively weak at zero-shot ImageNet classification. Here, we see that it learns $_ { 3 \\mathrm { X } }$ slower than a baseline which predicts a bag-of-words (BoW) encoding of the text (Joulin et al., 2016). Swapping the prediction objective for the contrastive objective of CLIP further improves efficiency another $_ { 4 \\mathrm { X } }$ .$ 该图像是图表，展示了不同模型在处理不同数量图像时的零-shot ImageNet 分类准确率。图中包含三条曲线，分别表示“Bag of Words Contrastive (CLIP)”、“Bag of Words Prediction”和“Transformer Language Model”。绿色曲线显示CLIP模型的效率是其他模型的4倍，橙色曲线显示其效率为3倍，蓝色曲线表现相对较低。X轴表示处理的图像数量，Y轴表示分类准确率。
Figure 2. CLIP is much more efficient at zero-shot transfer than our image caption baseline. Although highly expressive, we found that transformer-based language models are relatively weak at zero-shot ImageNet classification. Here, we see that it learns $_ { 3 \mathrm { X } }$ slower than a baseline which predicts a bag-of-words (BoW) encoding of the text (Joulin et al., 2016). Swapping the prediction objective for the contrastive objective of CLIP further improves efficiency another $_ { 4 \mathrm { X } }$ .
Contrastive Objective: Inspired by findings that contrastive objectives can learn better and more computationally efficient representations (Tian et al., 2019; Chen et al., 2020a), the authors shifted to a contrastive learning approach.
- Goal: Predict which text as a whole is paired with which image, rather than the exact words.
- Observation: Swapping the predictive objective for a contrastive objective in the bag-of-words encoding baseline resulted in a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet (Figure 2).

4.2.4. CLIP Training Objective

CLIP is trained to predict which of $N \times N$ possible (image, text) pairings across a batch actually occurred, given a batch of $N$ real pairs.

The core of the CLIP implementation can be visualized with the following pseudocode (Figure 3 from the original paper):

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOw or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, 1] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T)  #[n, d_t]

# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits.T, labels, axis=0) # Corrected from original pseudocode's axis=7
loss = (loss_i + loss_t)/2

Let's break down the process step-by-step:

Input:
- I[n, h, w, c]: A minibatch of $n$ aligned images. Each image has dimensions height (h), width (w), and channels (c).
- T[n, 1]: A minibatch of $n$ aligned texts (captions).
- image_encoder: The neural network responsible for processing images. It can be a ResNet or a Vision Transformer.
- text_encoder: The neural network responsible for processing text. It can be a CBOW (Continuous Bag-of-Words) model or a Text Transformer.
Feature Extraction:
- I_f = image_encoder(I): The image_encoder processes the batch of images $I$ to extract visual feature representations. $I_f$ has shape $[n, d_i]$ , where $d_i$ is the dimension of the image features.
- T_f = text_encoder(T): The text_encoder processes the batch of texts $T$ to extract textual feature representations. $T_f$ has shape $[n, d_t]$ , where $d_t$ is the dimension of the text features.
Joint Multimodal Embedding:
- $W_i[d_i, d_e]$ : A learned linear projection matrix that maps image features to a shared multimodal embedding space of dimension $d_e$ .
- $W_t[d_t, d_e]$ : A learned linear projection matrix that maps text features to the same multimodal embedding space.
- I_e = l2_normalize(np.dot(I_f, W_i), axis=1): The image features $I_f$ are linearly projected into the multimodal embedding space using $W_i$ , and then L2-normalized along the feature dimension ( $axis=1$ ). $I_e$ has shape $[n, d_e]$ .
- T_e = l2_normalize(np.dot(T_f, W_t), axis=1): Similarly, text features $T_f$ are linearly projected using $W_t$ and then L2-normalized. $T_e$ also has shape $[n, d_e]$ .
- L2-Normalization: Ensures that all embeddings lie on a unit hypersphere, making dot products equivalent to cosine similarity. This is crucial for the contrastive objective.
Scaled Pairwise Cosine Similarities (Logits):
- $t$ : A learned temperature parameter, optimized directly during training as a log-parameterized multiplicative scalar. It controls the range of the logits in the softmax, preventing training instability by scaling the similarities.
- logits = np.dot(I_e, T_e.T) * np.exp(t): The dot product of $I_e$ $I_{e}$ and the transpose of $T_e$ $T_{e}$ ( $T_e.T$ $T_{e} . T$ ) computes the pairwise cosine similarities between all $n$ $n$ image embeddings and all $n$ $n$ text embeddings in the batch. This results in an [n, n] matrix. The element at [i, j] represents the similarity between image $i$ $i$ and text $j$ $j$ . This matrix is then scaled by $e^t$ $e^{t}$ .
  - The diagonal elements of this matrix (logits[k, k]) correspond to the similarities of the $n$ correct (image, text) pairs.
  - The off-diagonal elements (logits[i, j] where $i \neq j$ ) correspond to the similarities of the $n^2 - n$ incorrect (image, text) pairings (negative samples).
Symmetric Loss Function:
- labels = np.arange(n): An array $[0, 1, ..., n-1]$ is created. This serves as the target indices for the cross-entropy loss, indicating that image[i] should match text[i].
- loss_i = cross_entropy_loss(logits, labels, axis=0): This calculates the cross-entropy loss from the perspective of the images. For each image $i$ , the model tries to correctly identify its corresponding text $i$ among all $n$ texts in the batch. $axis=0$ in the pseudocode's cross_entropy_loss context typically implies that the loss is computed over the columns, meaning for each image row, we compare its similarity scores to all texts against a one-hot vector where the correct text is 1. (Note: The original paper's pseudocode had $axis=7$ for $loss_t$ , which is a typo and corrected to $axis=0$ or $axis=1$ depending on the cross_entropy_loss implementation details. Assuming it should align with the logits.T operation, it should be $axis=0$ on the transposed logits, effectively meaning $axis=1$ on the original logits).
- loss_t = cross_entropy_loss(logits.T, labels, axis=0): This calculates the cross-entropy loss from the perspective of the texts. For each text $j$ , the model tries to correctly identify its corresponding image $j$ among all $n$ images in the batch. By transposing logits, we swap the roles of images and texts.
- $loss = (loss_i + loss_t)/2$ : The final loss is the symmetric average of the image-side and text-side losses. This ensures that both modalities learn to project into the shared space effectively.
  
  This batch construction technique and objective were first introduced as multi-class N-pair loss (Sohn, 2016) and popularized as InfoNCE loss (Oord et al., 2018) for contrastive representation learning. It was recently adapted for (text, image) learning in medical imaging by Zhang et al. (2020), which CLIP builds upon.

Simplifications Compared to ConVIRT (Zhang et al., 2020):

Training from scratch: CLIP models are trained without ImageNet weights for the image encoder or pre-trained weights for the text encoder.
Linear Projection: Only a linear projection maps from each encoder's representation to the multimodal embedding space, unlike ConVIRT which uses a non-linear projection.
No Text Transformation: The text transformation function $t_u$ that samples a single sentence uniformly from the text is removed, as many (image, text) pairs in WIT are single sentences.
Simple Image Augmentation: A random square crop from resized images is the only data augmentation used during training.
Learned Temperature: The temperature parameter\tau $is directly optimized as a `log-parameterized multiplicative scalar`, avoiding manual hyperparameter tuning. ### 4.2.5. Choosing and Scaling a Model CLIP utilizes two main architectures for the `image encoder` and a `Transformer` for the `text encoder`. * **Image Encoder Architectures:** 1. **ResNet-50 (He et al., 2016a) variants:** * **Base Architecture:** ResNet-50, chosen for its widespread adoption. * **Modifications:** * `ResNetD improvements` (He et al., 2019). * `Antialiased rect-2 blur pooling` (Zhang, 2019). * `Attention pooling mechanism`: Replaces the `global average pooling layer`. This mechanism is a single layer of `transformer-style multi-head QKV attention`, where the `query` is conditioned on the `global average-pooled representation of the image`. 2. **Vision Transformer (ViT) (Dosovitskiy et al., 2020) variants:** * **Base Architecture:** Closely follows the original ViT implementation. * **Minor Modification:** An additional `layer normalization` is applied to the combined `patch and position embeddings` before the `transformer` blocks. A slightly different initialization scheme is also used. * **Text Encoder Architecture:** * **Base Architecture:** A `Transformer` (Vaswani et al., 2017) with architectural modifications described in Radford et al. (2019) (similar to GPT-2). * **Size:** A `63M-parameter`, `12-layer`, `512-wide model` with `8 attention heads`. * **Tokenization:** Operates on a `lower-cased Byte Pair Encoding (BPE)` representation of the text with a `49,152 vocabulary size` (Sennrich et al., 2015). * **Sequence Length:** `Max sequence length capped at 76` for computational efficiency. * **Feature Representation:** The text sequence is bracketed with `[SOS]` (start of sentence) and `[EOS]` (end of sentence) tokens. The activations of the highest layer of the `transformer` at the `[EOS]` token are used as the feature representation of the text. This representation is then `layer normalized` and `linearly projected` into the `multimodal embedding space`. * **Masked Self-Attention:** Used in the `text encoder` to preserve the ability to initialize with a `pre-trained language model` or add `language modeling` as an `auxiliary objective` (though this is left for future work). * **Model Scaling Strategy:** * **ResNet Image Encoders:** Adapts the approach of Tan & Le (2019) (EfficientNet) by `equally allocating additional compute to increasing the width, depth, and resolution` of the model. This is found to outperform scaling only one dimension. * **Text Encoder:** Only the `width of the model` is scaled to be proportional to the calculated increase in width of the `ResNet`. The depth of the text encoder is `not scaled` as CLIP's performance was less sensitive to its capacity. ### 4.2.6. Training A series of CLIP models were trained: * **ResNet Models:** 5 models: `ResNet-50`, `ResNet-101`, and three `EfficientNet-style scaled` models (denoted `RN50x4`, `RN50x16`, `RN50x64`) using approximately 4x, 16x, and 64x the compute of a ResNet-50. * **Vision Transformer (ViT) Models:** 3 models: `ViT-B/32`, `ViT-B/16`, and `ViT-L/14`. * **`ViT-L/14@336px`:** The `ViT-L/14` model was additionally `pre-trained at a higher 336 pixel resolution for one additional epoch` to boost performance, similar to `FixRes` (Touvron et al., 2019). This model is considered the `best performing` and is used as "CLIP" by default in the paper. **Common Training Hyperparameters (Table 18):** <div class="table-wrapper"><table> <thead> <tr> <th>Hyperparameter</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>Batch size</td> <td>32768</td> </tr> <tr> <td>Vocabulary size</td> <td>49408</td> </tr> <tr> <td>Training epochs</td> <td>32</td> </tr> <tr> <td>Maximum temperature</td> <td>100.0</td> </tr> <tr> <td>Weight decay</td> <td>0.2</td> </tr> <tr> <td>Warm-up iterations</td> <td>2000</td> </tr> <tr> <td>Adam β1</td> <td>0.9</td> </tr> <tr> <td>Adam β2</td> <td>0.999 (ResNet), 0.98 (ViT)</td> </tr> <tr> <td>Adam €</td> <td>10−8 (ResNet), 10−6 (ViT)</td> </tr> </tbody> </table></div> **Specific Hyperparameters for CLIP-ResNet (Table 19):** <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Model</th> <th rowspan="2">Learning rate</th> <th rowspan="2">Embedding dimension</th> <th rowspan="2">Input resolution</th> <th colspan="2">ResNet</th> <th colspan="3">Text Transformer</th> </tr> <tr> <th>blocks</th> <th>width</th> <th>layers</th> <th>width</th> <th>heads</th> </tr> </thead> <tbody> <tr> <td>RN50</td> <td>5 × 10−4</td> <td>1024</td> <td>224</td> <td>(3, 4, 6, 3)</td> <td>2048</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>RN101</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>(3, 4, 23, 3)</td> <td>2048</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>RN50x4</td> <td>5 × 10−4</td> <td>640</td> <td>288</td> <td>(4, 6, 10, 6)</td> <td>2560</td> <td>12</td> <td>640</td> <td>10</td> </tr> <tr> <td>RN50x16</td> <td>4 × 10−4</td> <td>768</td> <td>384</td> <td>(6, 8, 18, 8)</td> <td>3072</td> <td>12</td> <td>768</td> <td>12</td> </tr> <tr> <td>RN50x64</td> <td>3.6 × 10−4</td> <td>1024</td> <td>448</td> <td>(3, 15, 36, 10)</td> <td>4096</td> <td>12</td> <td>1024</td> <td>16</td> </tr> </tbody> </table></div> **Specific Hyperparameters for CLIP-ViT (Table 20):** <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Model</th> <th rowspan="2">Learning rate</th> <th rowspan="2">Embedding dimension</th> <th rowspan="2">Input resolution</th> <th colspan="3">Vision Transformer</th> <th colspan="3">Text Transformer</th> </tr> <tr> <th>layers</th> <th>width</th> <th>heads</th> <th>layers</th> <th>width</th> <th>heads</th> </tr> </thead> <tbody> <tr> <td>ViT-B/32</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>12</td> <td>768</td> <td>12</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>ViT-B/16</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>12</td> <td>768</td> <td>12</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>ViT-L/14</td> <td>4 × 10−4</td> <td>768</td> <td>224</td> <td>24</td> <td>1024</td> <td>16</td> <td>12</td> <td>768</td> <td>12</td> </tr> <tr> <td>ViT-L/14-336px</td> <td>2 × 10-5</td> <td>768</td> <td>336</td> <td>24</td> <td>1024</td> <td>16</td> <td>12</td> <td>768</td> <td>12</td> </tr> </tbody> </table></div> **Other Training Details:** * **Optimizer:** `Adam optimizer` (Kingma & Ba, 2014). * **Regularization:** `Decoupled weight decay regularization` (Loshchilov & Hutter, 2017) applied to all weights except gains or biases. * **Learning Rate Schedule:** `Cosine decay schedule` (Loshchilov & Hutter, 2016). * **Hyperparameter Tuning:** Initial hyperparameters were set via `grid searches`, `random search`, and `manual tuning` on the baseline ResNet50 for 1 epoch, then heuristically adapted for larger models due to `computational constraints`. * **Temperature Parameter Initialization:** $\tau$ was initialized to the equivalent of 0.07 from Wu et al. (2018) and clipped to prevent `scaling logits by more than 100` to avoid `training instability`. * **Memory Optimization:** `Mixed-precision` (Micikevicius et al., 2017), `gradient checkpointing` (Griewank & Walther, 2000; Chen et al., 2016), `half-precision Adam statistics` (Dhariwal et al., 2020), and `half-precision stochastically rounded text encoder weights` were used. `Embedding similarity calculation` was `sharded` across GPUs. * **Compute:** The largest ResNet model (`RN50x64`) took `18 days on 592 V100 GPUs`. The largest Vision Transformer (`ViT-L/14`) took `12 days on 256 V100 GPUs`. ### 4.2.7. Using CLIP for Zero-Shot Transfer After pre-training, CLIP uses its learned `multimodal embedding space` for `zero-shot classification`. 1. **Class Name Processing:** For a given `downstream dataset`, the names of all classes are used as potential text pairings. 2. **Embedding Generation:** * The input image is passed through the `image encoder` to obtain its `feature embedding`. * Each class name (e.g., "cat," "dog") is passed through the `text encoder` to obtain its `feature embedding`. 3. **Similarity Calculation:** The `cosine similarity` between the image embedding and each of the class text embeddings is calculated. 4. **Probability Distribution:** These similarities are scaled by the learned `temperature parameter`\tau$ and then normalized into a probability distribution via a softmax function.

Prediction: The class with the highest probability (i.e., the most probable (image, text) pair) is predicted as the image's label.

Interpretation: The paper interprets this as the image encoder acting as the computer vision backbone and the text encoder functioning as a hypernetwork (Ha et al., 2016) that generates the weights of a linear classifier based on the text descriptions of visual concepts. Each step of CLIP pre-training is seen as optimizing a proxy to a computer vision dataset with 1 example per class and 32,768 total classes defined by natural language.

Prompt Engineering and Ensembling: To improve zero-shot performance, especially when class names are ambiguous or lack context:

Prompt Engineering: Context templates are used, such as "A photo of a {label}." This bridges the distribution gap between single-word labels and the full sentences typically seen during pre-training. Specific prompts are customized per task (e.g., "A photo of a {label}, a type of pet." for Oxford-IIT Pets).
Ensembling: Multiple zero-shot classifiers are created using different context prompts (e.g., "A photo of a big {label}", "A photo of a small {label}"). The ensemble is constructed over the embedding space (by averaging text embeddings), allowing for efficient caching and prediction. On ImageNet, ensembling 80 different prompts improved accuracy by an additional $3.5\%$ .

The following figure (Figure 4 from the original paper) illustrates the improvements from prompt engineering and ensembling.

该图像是图表，展示了通过提示工程和集成方法提高零-shot分类性能的结果。与使用无上下文类别名称的基线相比，该方法在36个数据集上的平均得分提高了近5个百分点，显示出显著的效率提升。

Figure 4. Prompt engineering and ensembling improve zeroshot performance. Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4 times more compute with the baseline zero-shot method but is "free" when amortized over many predictions.

5. Experimental Setup

5.1. Datasets

The experiments in this paper use a broad suite of datasets to evaluate CLIP's zero-shot transfer and representation learning capabilities. The core evaluation suite comprises 27 datasets, including the 12 well-studied datasets from Kornblith et al. (2019) and 15 additional datasets to assess performance on a wider variety of distributions and tasks.

The following are the datasets used, with details on their characteristics and purpose:

Dataset	Classes	Train size	Test size	Evaluation metric
Food-101	102	75,750	25,250	accuracy
CIFAR-10	10	50,000	10,000	accuracy
CIFAR-100	100	50,000	10,000	accuracy
Birdsnap	500	42,283	2,149	accuracy
SUN397	397	19,850	19,850	accuracy
Stanford Cars	196	8,144	8,041	accuracy
FGVC Aircraft	100	6,667	3,333	mean per class
Pascal VOC 2007 Classification	20	5,011	4,952	11-point mAP
Describable Textures	47	3,760	1,880	accuracy
Oxford-IIIT Pets	37	3,680	3,669	mean per class
Caltech-101	102	3,060	6,085	mean-per-class
Oxford Flowers 102	102	2,040	6,149	mean per class
MNIST	10	60,000	10,000	accuracy
Facial Emotion Recognition 2013	8	32,140	3,574	accuracy
STL-10	10	1000	8000	accuracy
EuroSAT	10	10,000	5,000	accuracy
RESISC45	45	3,150	25,200	accuracy
GTSRB	43	26,640	12,630	accuracy
KITTI	4	6,770	711	accuracy
Country211	211	43,200	21,100	accuracy
PatchCamelyon	2	294,912	32,768	accuracy
UCF101	101	9,537	1,794	accuracy
Kinetics700	700	494,801	31,669	mean(top1, top5)
CLEVR Counts	8	2,000	500	accuracy
Hateful Memes	2	8,500	500	ROC AUC
Rendered SST2	2	7,792	1,821	accuracy
ImageNet	1000	1,281,167	50,000	accuracy

The following are the results from Table 9 of the original paper:

Dataset	Classes	Train size	Test size	Evaluation metric
Food-101	102	75,750	25,250	accuracy
CIFAR-10	10	50,000	10,000	accuracy
CIFAR-100	100	50,000	10,000	accuracy
Birdsnap	500	42,283	2,149	accuracy
SUN397	397	19,850	19,850	accuracy
Stanford Cars	196	8,144	8,041	accuracy
FGVC Aircraft	100	6,667	3,333	mean per class
Pascal VOC 2007 Classification	20	5,011	4,952	11-point mAP
Describable Textures	47	3,760	1,880	accuracy
Oxford-IIIT Pets	37	3,680	3,669	mean per class
Caltech-101	102	3,060	6,085	mean-per-class
Oxford Flowers 102	102	2,040	6,149	mean per class
MNIST	10	60,000	10,000	accuracy
Facial Emotion Recognition 2013	8	32,140	3,574	accuracy
STL-10	10	1000	8000	accuracy
EuroSAT	10	10,000	5,000	accuracy
RESISC45	45	3,150	25,200	accuracy
GTSRB	43	26,640	12,630	accuracy
KITTI	4	6,770	711	accuracy
Country211	211	43,200	21,100	accuracy
PatchCamelyon	2	294,912	32,768	accuracy
UCF101	101	9,537	1,794	accuracy
Kinetics700	700	494,801	31,669	mean(top1, top5)
CLEVR Counts	8	2,000	500	accuracy
Hateful Memes	2	8,500	500	ROC AUC
Rendered SST2	2	7,792	1,821	accuracy
ImageNet	1000	1,281,167	50,000	accuracy

Specific Dataset Notes:

Video Datasets (UCF101, Kinetics700): For these datasets, the middle frame of each video clip is used as the input image, effectively converting them into image classification tasks for this evaluation.
STL-10 and UCF101: These datasets have multiple predefined train/validation/test splits, and the paper reports the average over all splits.
Country211: A custom dataset created by the authors to assess geolocation capability. It filtered YFCC100m for 211 countries with at least 300 GPS-tagged photos, sampling 200 for training and 100 for testing per country.
Rendered SST2: A custom dataset created to measure optical character recognition (OCR) capability. Sentences from the Stanford Sentiment Treebank (SST-2) dataset are rendered into 448x448 pixel images (black text on white background). The following figure (Figure 19 from the original paper) shows two example images from the Rendered SST2 dataset.

该图像是文本描述的插图，包含两段关于电影叙事的评论。第一段提到Montias为其细致的叙述注入了灵活的能量，并且描述了他所围绕的角色；第二段则表达了电影制作者对故事方向的迷茫以及缺乏实现目标的技能。

Figure 19. Two example images from the Rendered SST2 dataset

These datasets were chosen to provide a broad and diverse evaluation suite, encompassing various tasks (general object recognition, fine-grained classification, scene recognition, OCR, action recognition, geolocation) and different image characteristics (natural images, satellite images, medical images, rendered text, video frames). This diversity is crucial for validating the generality and transferability of CLIP's natural language supervision approach.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided below:

Accuracy:
- Conceptual Definition: Accuracy is a common metric for classification tasks, representing the proportion of correctly predicted instances out of the total instances evaluated. It measures how often the model's prediction matches the true label.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
  - Number of Correct Predictions: The count of instances where the model's predicted class matches the actual true class.
  - Total Number of Predictions: The total number of instances for which the model made a prediction.
Mean Per Class Accuracy:
- Conceptual Definition: This metric calculates the accuracy for each individual class and then averages these per-class accuracies. It is particularly useful when dealing with imbalanced datasets (where some classes have many more examples than others) because it gives equal weight to each class, preventing a model from achieving high overall accuracy by simply performing well on a majority class.
- Mathematical Formula: $ \text{Mean Per Class Accuracy} = \frac{1}{C} \sum_{i=1}^{C} \text{Accuracy}_i $
- Symbol Explanation:
  - $C$ : The total number of unique classes in the dataset.
  - $\text{Accuracy}_i$ : The accuracy calculated specifically for class $i$ , i.e., the number of correctly predicted instances of class $i$ divided by the total number of actual instances of class $i$ .
11-point Mean Average Precision (11-point mAP):
- Conceptual Definition: 11-point mAP is a metric commonly used in object detection and image retrieval tasks, particularly in benchmarks like Pascal VOC. It is a generalization of Average Precision (AP). AP measures the area under the Precision-Recall curve. For 11-point mAP, precision is sampled at 11 equally spaced recall levels (0, 0.1, ..., 1.0). The precision at each recall level $r$ is taken as the maximum precision over any recall $r' \ge r$ . The mean of these 11 precision values is the AP for a single class. mAP is then the mean of AP values across all classes.
- Mathematical Formula: $ \text{AP} = \frac{1}{11} \sum_{r \in {0, 0.1, \dots, 1.0}} \text{P}{\text{interp}}(r) $ where $\text{P}_{\text{interp}}(r) = \max_{r' \ge r} \text{P}(r')$ $ \text{mAP} = \frac{1}{C} \sum{i=1}^{C} \text{AP}_i $
- Symbol Explanation:
  - $\text{AP}$ : Average Precision for a single class.
  - $r$ : Recall level (from 0 to 1.0 in 11 steps).
  - $\text{P}_{\text{interp}}(r)$ : Interpolated precision at recall $r$ , calculated as the maximum precision observed for any recall value greater than or equal to $r$ .
  - $\text{P}(r')$ : Precision at recall $r'$ .
  - $\text{mAP}$ : Mean Average Precision.
  - $C$ : The total number of classes.
  - $\text{AP}_i$ : Average Precision for class $i$ .
ROC AUC (Receiver Operating Characteristic Area Under the Curve):
- Conceptual Definition: ROC AUC is a performance metric for binary classification problems, particularly useful when there is a class imbalance or when the costs of false positives and false negatives are different. The ROC curve plots the True Positive Rate (TPR) (Sensitivity) against the False Positive Rate (FPR) (1 - Specificity) at various threshold settings. The AUC (Area Under the Curve) represents the degree or measure of separability between classes; a higher AUC means the model is better at distinguishing between positive and negative classes. An AUC of 1.0 represents a perfect classifier, while 0.5 represents a random classifier.
- Mathematical Formula: $ \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ $ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} $ $ \text{AUC} = \int_{0}^{1} \text{TPR(FPR)} , d\text{FPR} $
- Symbol Explanation:
  - TP: True Positives (correctly predicted positive instances).
  - FN: False Negatives (actual positive instances incorrectly predicted as negative).
  - FP: False Positives (actual negative instances incorrectly predicted as positive).
  - TN: True Negatives (correctly predicted negative instances).
  - TPR: True Positive Rate, also known as Sensitivity or Recall.
  - FPR: False Positive Rate.
  - AUC: Area Under the ROC Curve. The integral represents the area under the curve formed by plotting TPR against FPR.
mean(top1, top5):
- Conceptual Definition: This metric is specific to the Kinetics-700 dataset, which is a large-scale video action recognition dataset. It represents the average of the Top-1 accuracy and Top-5 accuracy.
  - Top-1 accuracy: The standard accuracy where the model's single most confident prediction must be correct.
  - Top-5 accuracy: The prediction is considered correct if the true label is among the model's top 5 most confident predictions.
- Mathematical Formula: $ \text{mean(top1, top5)} = \frac{\text{Top-1 Accuracy} + \text{Top-5 Accuracy}}{2} $
- Symbol Explanation:
  - Top-1 Accuracy: Proportion of times the true label is the model's highest-scoring prediction.
  - Top-5 Accuracy: Proportion of times the true label is among the model's five highest-scoring predictions.
R@K (Recall@K):
- Conceptual Definition: Recall@K is a metric used in information retrieval or ranking tasks (like image-text retrieval). It measures the proportion of queries for which the correct item (e.g., the true image for a text query, or the true text for an image query) is found within the top $K$ retrieved results. A higher R@K indicates better retrieval performance.
- Mathematical Formula: $ \text{R@K} = \frac{\text{Number of queries where true item is in top K retrieved}}{\text{Total number of queries}} $
- Symbol Explanation:
  - $K$ : The number of top retrieved items to consider.
  - true item: The ground-truth item that corresponds to the query.
  - top K retrieved: The set of $K$ items ranked highest by the model for a given query.
mWAP (Mean Weighted Average Precision) and mWSAP (Mean Weighted Segment Average Precision):
- Conceptual Definition: These metrics are specific to action recognition datasets like RareAct, which involves detecting unusual actions. While the paper mentions them, it does not provide detailed definitions or formulas within its text. In the context of action recognition, Average Precision (AP) is commonly used for detection tasks, and Weighted Average Precision suggests a weighting scheme might be applied, potentially based on action rarity or duration. Segment Average Precision implies evaluation over temporal segments in videos. Without further context from the original source introducing RareAct, precise formulas cannot be provided here, but they generally aim to measure the quality of action detection or recognition in video sequences, potentially considering temporal localization or importance.
Geolocation Performance (within X km):
- Conceptual Definition: This metric is used for geo-localization tasks, where the goal is to predict the geographical coordinates (latitude and longitude) of an image. Performance is measured as the percentage of images for which the predicted location falls within a specified radius (e.g., 1km, 25km, 200km) of the true location.
- Mathematical Formula: Let $(\text{lat}_{\text{true}}, \text{lon}_{\text{true}})$ be the true coordinates and $(\text{lat}_{\text{pred}}, \text{lon}_{\text{pred}})$ be the predicted coordinates. Let $\text{dist}(\text{true}, \text{pred})$ be the Haversine distance between the two points. $ \text{Accuracy within X km} = \frac{\text{Number of images where dist(true, pred)} \le \text{X km}}{\text{Total number of images}} \times 100% $
- Symbol Explanation:
  - $\text{lat}_{\text{true}}, \text{lon}_{\text{true}}$ : True latitude and longitude of the image.
  - $\text{lat}_{\text{pred}}, \text{lon}_{\text{pred}}$ : Predicted latitude and longitude of the image.
  - $\text{dist}(\text{true}, \text{pred})$ : The geographical distance (e.g., Haversine distance) between the true and predicted coordinates.
  - $\text{X km}$ : The specified radius in kilometers.

5.3. Baselines

The paper compares CLIP's performance against a comprehensive set of existing models, covering various pre-training strategies and architectures, for both zero-shot transfer and linear-probe representation learning evaluations.

Baselines for Zero-Shot Transfer:

Visual N-Grams (Li et al., 2017): This is the primary zero-shot baseline mentioned in the paper for ImageNet, aYahoo, and SUN datasets. It learned a dictionary of visual n-grams from web data and used text n-gram representations of class names for prediction. This serves as the direct contextual reference for generically pre-trained zero-shot models.

Baselines for Linear-Probe Representation Learning (and some for zero-shot comparison context):

The evaluation suite includes 66 different models across 27 datasets to ensure a broad comparison. Key families of baselines are:

LM RN50 (Autoregressive Language Model with ResNet-50):
- A multimodal model using a ResNet-50 image encoder and an autoregressive loss to predict text captions. This acts as a direct comparison to CLIP's contrastive loss and demonstrates the efficiency gains of contrastive learning.
EfficientNet (Tan & Le, 2019):
- A family of convolutional neural networks known for systematic model scaling (width, depth, and resolution).
- Includes B0-B8 models from the original paper.
- Also includes Noisy Student variants ( $B0-B7, L2-475, L2-800$ ) (Xie et al., 2020), which use self-training with noisy labels to achieve state-of-the-art performance on ImageNet. These are strong supervised baselines.
Instagram-pretrained ResNeXt (Mahajan et al., 2018):
- ResNeXt-101 models (32x8d, 32x16d, 32x32d, 32x48d) pre-trained on 3.5 billion Instagram images using hashtag prediction (a form of weak supervision). These models demonstrated the power of large-scale weakly supervised pre-training.
- Includes FixRes variants (Touvron et al., 2019) using higher input resolutions.
Big Transfer (BiT) (Kolesnikov et al., 2019):
- BiT-S and BiT-M models (ResNet architectures) pre-trained on ImageNet-1k and ImageNet-21k (a larger ImageNet variant). These models are known for their strong transfer learning performance due to large-scale pre-training. BiT-L models (trained on JFT-300M) are mentioned as superior but not publicly available.
Vision Transformer (ViT) (Dosovitskiy et al., 2020):
- ViT-B/32, ViT-B/16, ViT-L/16, and ViT-H/14 models pre-trained on the ImageNet-21k dataset. These are crucial baselines to compare CLIP's Vision Transformer variants against, particularly in terms of compute efficiency. The best-performing ViT models (trained on JFT-300M) are also not publicly available.
Self-Supervised Learning Methods:
- SimCLRv2 (Chen et al., 2020c): A self-supervised learning framework that uses contrastive learning to learn visual representations without human labels.
- BYOL (Bootstrap Your Own Latent) (Grill et al., 2020): Another self-supervised learning method that avoids negative pairs, using two interacting neural networks to learn representations.
- Momentum Contrast (MoCo) (He et al., 2020; Chen et al., 2020d): A self-supervised learning framework that uses a momentum encoder and a large queue of negative samples for contrastive learning.
VirTex (Desai & Johnson, 2020):
- A model that learns visual representations from textual annotations using transformer-based language modeling. It has a similar model design to CLIP's autoregressive baseline but is trained on a much smaller dataset (MSCOCO).
Standard ResNet Checkpoints (He et al., 2016b):
- Original ResNet-50, ResNet-101, and ResNet-152 models trained on ImageNet-1k. These serve as fundamental supervised baselines representing widely adopted architectures.
  
  These baselines are representative because they cover a spectrum of modern computer vision pre-training techniques: from traditional supervised ImageNet training, to large-scale weakly supervised methods, and recent self-supervised learning paradigms, as well as the emerging Vision Transformer architectures. This comprehensive comparison allows the authors to position CLIP's natural language supervision approach against the current state-of-the-art across different axes like supervision type, model architecture, and scale.

5.4. Linear-Probe Evaluation Setup

For linear-probe evaluation, the following standardized procedure is used:

Feature Extraction: Image features are taken from the penultimate layer of each model, ignoring any provided classification layer. For CLIP-ViT models, features are used before the linear projection to the embedding space (corresponding to $I_f$ in the pseudocode).
Classifier Training: A logistic regression classifier is trained on these extracted features. scikit-learn's L-BFGS implementation is used with a maximum of 1,000 iterations.
Hyperparameter Tuning: The L2 regularization strength\lambda is directly optimized during training as a log-parameterized multiplicative scalar to avoid tuning it as a hyperparameter. This shows an intelligent design choice to automate a critical parameter that affects the logit scaling in the contrastive loss, which can significantly impact training stability and performance.

Resolution Scaling: The ViT-L/14 model is pre-trained at a higher 336 pixel resolution for one additional epoch to boost performance (ViT-L/14@336px). This FixRes-style approach demonstrates that increasing input resolution can further enhance performance for large Vision Transformers, even with minimal additional training.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully investigates the transferability of task-agnostic web-scale pre-training from Natural Language Processing (NLP) to computer vision. The authors demonstrate that Contrastive Language-Image Pre-training (CLIP)—a simple yet scalable approach that predicts which caption goes with which image on a massive 400 million (image, text) pair dataset (WIT)—enables visual models to learn a wide variety of tasks during pre-training. This task learning can then be leveraged through natural language prompting to achieve zero-shot transfer to numerous existing computer vision datasets. At sufficient scale, CLIP's performance is competitive with task-specific supervised models, and it exhibits significantly higher robustness to natural distribution shifts. The work highlights the emergence of predictable scaling laws in multimodal learning and underscores the significant social implications of such flexible and powerful models.

7.2. Limitations & Future Work

The authors are transparent about several limitations of CLIP and suggest future research directions:

Performance Gap to SOTA: While zero-shot CLIP is competitive with a simple ResNet-50 linear classifier baseline, it still underperforms the overall state-of-the-art on many datasets, particularly fully supervised models. The authors estimate that a 1000x increase in compute would be needed for zero-shot CLIP to reach overall state-of-the-art, which is currently infeasible. Future work needs to focus on improving computational and data efficiency.
Weak Performance on Specific Tasks: CLIP's zero-shot performance remains weak on several kinds of tasks:
- Fine-grained classification: Differentiating car models, flower species, aircraft variants.
- Abstract/systematic tasks: Counting objects in an image.
- Novel tasks: Classifying distance to the nearest car, which are unlikely to be in the pre-training dataset.
Brittle Generalization to Truly Out-of-Distribution (OOD) Data: Despite strong robustness to natural distribution shifts, CLIP can still fail catastrophically on truly out-of-distribution data. The example of MNIST (handwritten digits), where CLIP's OCR performance is poor, illustrates that training on a large, varied dataset does not guarantee robustness to all unseen domains. This suggests CLIP circumvents the brittle generalization problem rather than addressing its underlying cause.
Limited Output Flexibility: CLIP is currently limited to choosing from predefined concepts in a given zero-shot classifier. It cannot generate novel outputs like an image captioning model. Future work could explore joint training of contrastive and generative objectives or search over natural language explanations at inference time to combine efficiency with flexibility.
Data Inefficiency: CLIP does not address the poor data efficiency of deep learning but compensates by using web-scale supervision. Training on 400 million images over 32 epochs means seeing 12.8 billion images, which would take 405 years to review manually. Combining CLIP with self-supervision (Henaff, 2020; Chen et al., 2020c) and self-training (Lee; Xie et al., 2020) methods is a promising direction for data efficiency.
Methodological Limitations:
- Validation Set Usage: Repeated querying of full validation sets during development, while standard, is unrealistic for true zero-shot scenarios.
- Evaluation Dataset Selection: The main 27-dataset evaluation suite, while diverse, was haphazardly assembled and co-adapted with CLIP's development. A new benchmark explicitly designed for broad zero-shot transfer is needed.
Social Biases: CLIP, trained on unfiltered internet text-image pairs, learns many social biases, similar to image caption models (Bhargava & Forsyth, 2019). The Broader Impacts section details denigration harms and gendered associations in classification. Future work is needed for broader, more contextual, and robust bias testing and mitigation strategies.
Zero-shot vs. Few-shot Discrepancy: CLIP's current few-shot learning (linear classifiers on top of features) results in a counter-intuitive drop in performance compared to zero-shot, unlike human performance which shows large gains from zero-shot to one-shot. Developing methods to combine strong zero-shot performance with efficient few-shot learning is an important direction.

7.3. Personal Insights & Critique

CLIP represents a significant paradigm shift in computer vision, mirroring the pre-training revolution seen in NLP. The core insight that natural language can serve as a vast, weakly supervised signal for learning transferable visual representations is incredibly powerful. The most inspiring aspect is the demonstration of zero-shot transfer at a competitive level, effectively allowing users to "program" a visual classifier with natural language. This democratizes access to powerful computer vision capabilities, as it removes the immense barrier of collecting and labeling task-specific datasets.

The methods and conclusions of CLIP can definitely be transferred to other domains. The idea of multimodal contrastive learning from web-scale paired data is applicable wherever such data exists (e.g., video-text pairs, audio-text pairs, 3D model-text pairs). This opens avenues for more general perceptual AI systems that understand the world not just visually, but also through sound, motion, and other sensory inputs, all grounded in language.

However, the paper also raises crucial ethical considerations, particularly regarding social biases and surveillance. The bias probes revealed that CLIP learns and can perpetuate harmful stereotypes (e.g., disproportionately classifying "Black" images into "non-human" categories, or "male" images into "crime-related" categories). The finding that class design (e.g., adding a "child" category) can drastically alter bias manifestation is a critical insight, highlighting that AI developers wield immense power in shaping model behavior and impact. This calls for greater awareness and responsibility in defining classes and setting decision thresholds.

A potential issue or unverified assumption is that simply increasing compute and data scale will continue to solve all generality and robustness problems. While scaling laws show predictable improvements, the MNIST example and performance on abstract tasks (like counting) suggest there are fundamental limitations to this brute-force approach. Truly out-of-distribution generalization and higher-level systematic reasoning might require architectural innovations beyond just scaling Transformers and ResNets. The observation that supervised fine-tuning reduces effective robustness is also a critical point for future research: does task-specific fine-tuning inherently lead models to rely on spurious correlations regardless of pre-training, or can more robust fine-tuning strategies be developed?

My personal critique is that while the paper acknowledges limitations and societal impacts, the emphasis is still heavily on performance metrics. The community needs to evolve evaluation metrics to explicitly include measures of fairness, robustness to unseen biases, and interpretability during development, not just as an afterthought. The "omni-use" nature of CLIP, making it easy to create "bespoke, niche surveillance use cases," demands a more proactive regulatory and ethical framework from researchers and policymakers. The paper's call for community exploration to characterize capabilities and biases is vital, but the responsibility to lead on ethical development often falls on the creators of such powerful foundational models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.