Paper status: completed

Learning Transferable Visual Models From Natural Language Supervision

Published:02/27/2021
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study presents a method for learning transferable visual models from natural language supervision, showing that pretraining on 400 million image-text pairs enables zero-shot transfer across various tasks, rivaling fully supervised models in performance.

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Learning Transferable Visual Models From Natural Language Supervision."

1.2. Authors

The authors are Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. All authors are affiliated with OpenAI.

OpenAI is a prominent artificial intelligence research laboratory based in San Francisco, California. It is widely recognized for its contributions to deep learning, particularly in the fields of natural language processing (NLP) with models like GPT (Generative Pre-trained Transformer) and computer vision. OpenAI has a strong reputation for pushing the boundaries of AI research, often publishing influential papers and releasing open-source models that significantly impact the broader AI community.

1.3. Journal/Conference

This paper was published at (UTC): 2021-02-26T19:04:58.000Z. While the provided information doesn't specify a formal conference or journal name, the date and "arXiv preprint" status (as indicated by the original source link) suggest it was released as a preprint on arXiv, a popular open-access archive for scientific papers. Papers on arXiv are not peer-reviewed before publication but are widely read and cited within the research community, often preceding formal publication in top-tier conferences (e.g., NeurIPS, ICML, CVPR, ICLR) or journals. Given OpenAI's track record, this paper likely received significant attention upon its preprint release.

1.4. Publication Year

1.5. Abstract

State-of-the-art computer vision systems traditionally rely on training with a fixed set of predefined object categories, which limits their generality and requires extensive re-labeling for new visual concepts. This paper proposes a promising alternative: learning directly from raw text about images, a much broader source of supervision. The authors demonstrate that a simple pre-training task—predicting which caption corresponds to which image—is an efficient and scalable method to learn state-of-the-art image representations. They achieve this by training a model, called CLIP (Contrastive Language-Image Pre-training), from scratch on a massive dataset of 400 million (image, text) pairs collected from the internet.

After pre-training, CLIP uses natural language to reference learned visual concepts or describe new ones, enabling zero-shot transfer to downstream tasks without additional labeled data. The authors benchmark CLIP's performance across over 30 diverse computer vision datasets, covering tasks like optical character recognition (OCR), action recognition in videos, geo-localization, and fine-grained object classification. The model demonstrates non-trivial transfer performance on most tasks, often rivaling fully supervised baselines. For instance, CLIP matches the accuracy of the original ResNet-50 on ImageNet in a zero-shot setting, without utilizing its 1.28 million training examples. The authors also release their code and pre-trained model weights.

https://arxiv.org/abs/2103.00020 This paper is an arXiv preprint, indicating it is publicly accessible and widely shared within the scientific community.

https://arxiv.org/pdf/2103.00020v1.pdf

2. Executive Summary

2.1. Background & Motivation

The paper addresses a fundamental limitation in conventional computer vision (CV) systems: their reliance on fixed sets of predetermined object categories. Traditionally, these systems are trained on meticulously hand-labeled datasets (like ImageNet, which contains 1000 object classes), requiring significant human effort to define and annotate classes. This approach creates several critical problems:

  • Limited Generality and Usability: If a user wants to detect a visual concept not included in the original 1000 categories, the model typically needs to be fine-tuned (further trained) on a new, labeled dataset specific to that concept. This process is costly, time-consuming, and limits the system's flexibility in real-world applications where new concepts constantly emerge.

  • Scalability Bottleneck: Creating large-scale, high-quality, crowd-labeled datasets for every conceivable visual concept is practically impossible. The traditional paradigm struggles to scale to the vast and ever-growing diversity of visual information.

  • Brittleness to Distribution Shift: Models trained on specific distributions (e.g., ImageNet photos) often perform poorly when encountering natural variations in data (e.g., different styles of images, backgrounds, or contexts), a phenomenon known as distribution shift.

    The paper draws inspiration from the revolution in Natural Language Processing (NLP), where pre-training methods learning directly from raw, uncurated web text (e.g., autoregressive or masked language modeling objectives) have led to highly transferable and task-agnostic models like GPT-3. These NLP models can perform a wide array of downstream tasks with little to no dataset-specific training (i.e., zero-shot or few-shot learning), demonstrating that web-scale collections of text contain a richer source of supervision than traditional crowd-labeled datasets.

The paper's entry point and innovative idea are to apply this successful NLP paradigm to computer vision. Specifically, it asks: "Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision?" The authors propose to leverage the implicit supervision contained in natural language accompanying images on the internet to train visual models that are more general, flexible, and robust, without the need for explicit gold labels for every visual concept.

2.2. Main Contributions / Findings

The paper makes several primary contributions and reports key findings that significantly advance the field of computer vision:

  • Introduction of CLIP (Contrastive Language-Image Pre-training): The paper proposes a novel and efficient pre-training method that uses a simple contrastive learning objective. Instead of trying to predict exact captions, CLIP learns to predict which text description matches which image within a batch. This approach creates a powerful multimodal embedding space where images and their corresponding text descriptions are brought closer together.
  • Creation of a Large-Scale WebImageText (WIT) Dataset: To enable web-scale pre-training, the authors constructed a new dataset of 400 million (image, text) pairs by collecting publicly available data from the internet, a scale orders of magnitude larger than previous datasets used for similar tasks. This dataset is crucial for the success of CLIP, providing a broad source of natural language supervision.
  • Demonstration of Strong Zero-Shot Transfer Capabilities: After pre-training, CLIP can perform zero-shot classification on a wide variety of downstream computer vision tasks without any additional training or fine-tuning on task-specific labeled data.
    • SOTA-Matching Performance: CLIP demonstrates remarkable performance, often being competitive with fully supervised baselines. Notably, it matches the accuracy of the original ResNet-50 on ImageNet in a zero-shot setting, without using any of ImageNet's 1.28 million training examples.
    • Broad Task Coverage: CLIP's task-learning capabilities are extensively benchmarked across over 30 diverse datasets, spanning tasks such as OCR (optical character recognition), action recognition in videos, geo-localization, and various types of fine-grained object classification. This broad evaluation highlights its generality.
  • Improved Robustness to Natural Distribution Shift: The paper finds that zero-shot CLIP models are significantly more robust to natural distribution shifts (i.e., performance on new test sets with differing image characteristics) compared to supervised ImageNet models of equivalent accuracy. This suggests that zero-shot evaluation better reflects a model's true capabilities and less susceptible to spurious correlations learned from specific training distributions.
  • Discovery of Scaling Laws: The authors observe that CLIP's transfer performance is a smoothly predictable function of compute, similar to observations in large language models like GPT. This log-log linear scaling trend suggests a clear path for future performance improvements by simply increasing compute and model capacity.
  • Analysis of Societal Impacts: The paper includes a dedicated section discussing the broader impacts of such a powerful and flexible model, including social biases (e.g., in race, gender, age classification, and crime-related associations) and its potential use in surveillance. It emphasizes the importance of class design and thresholding in mitigating biases.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the CLIP paper, a beginner should be familiar with several core concepts from machine learning, particularly deep learning, computer vision, and natural language processing.

  • Deep Learning: A subfield of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn representations of data with multiple levels of abstraction.

    • Neural Networks: Computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers, processing information through weights and biases.
    • Weights and Biases: Parameters within a neural network that are adjusted during training to learn patterns in the data. Weights determine the strength of connections between neurons, and biases provide an offset to the activation of a neuron.
    • Activation Functions: Mathematical functions that introduce non-linearity into the network, allowing it to learn complex patterns. Examples include ReLU (Rectified Linear Unit), sigmoid, and softmax.
    • Loss Function (Objective Function): A function that quantifies the difference between the predicted output of a model and the true target values. The goal of training is to minimize this loss.
    • Optimizer: An algorithm (e.g., Stochastic Gradient Descent (SGD), Adam) used to adjust the model's weights and biases to minimize the loss function.
    • Backpropagation: An algorithm used to efficiently compute the gradients of the loss function with respect to the network's weights, enabling their iterative adjustment.
    • Pre-training: An initial training phase where a model is trained on a large, general dataset with a broad objective (e.g., predicting the next word, distinguishing image-text pairs). The learned representations (features) are then transferred to downstream tasks.
    • Fine-tuning: A subsequent training phase where a pre-trained model's weights are slightly adjusted using a smaller, task-specific dataset and objective to adapt it to a particular downstream task.
    • Representation Learning: The process of automatically discovering good representations of raw data for specific tasks. A good representation captures the essential information from the data in a useful format, often making downstream tasks easier.
  • Computer Vision (CV): A field of artificial intelligence that enables computers to "see" and interpret visual data from the world.

    • Image Classification: The task of assigning a label or category to an entire input image (e.g., "cat," "dog," "car").
    • Object Detection: Identifying and localizing objects within an image by drawing bounding boxes around them and assigning a class label to each.
    • Semantic Segmentation: Assigning a class label to every pixel in an image, effectively partitioning the image into meaningful regions (e.g., "road," "sky," "person").
    • Convolutional Neural Networks (CNNs): A class of deep neural networks specifically designed for processing grid-like data such as images. They use convolutional layers to automatically and adaptively learn spatial hierarchies of features.
      • Convolutional Layer: A core building block of CNNs that applies a set of learnable filters (kernels) to the input image, producing feature maps that highlight specific patterns (edges, textures, etc.).
      • Pooling Layer: Reduces the spatial dimensions of the feature maps, helping to make the learned features more robust to minor variations in input and reducing computational cost.
      • ResNet (Residual Network): A type of CNN architecture (He et al., 2016a) that addresses the problem of vanishing gradients in very deep networks by introducing skip connections or residual connections. These connections allow gradients to flow directly through layers, enabling the training of much deeper models.
    • Vision Transformer (ViT): A recent architecture (Dosovitskiy et al., 2020) that applies the Transformer architecture (originally for NLP) directly to images. It works by splitting an image into fixed-size patches, linearly embedding them, adding positional embeddings, and feeding the resulting sequence of vectors to a standard Transformer encoder.
  • Natural Language Processing (NLP): A field of AI concerned with enabling computers to understand, interpret, and generate human language.

    • Language Models: Statistical models that learn the probability distribution of sequences of words in a language. They can predict the next word in a sentence or fill in masked words.
    • Embeddings (Word, Sentence, Text): Numerical vector representations of words, sentences, or entire text snippets. Words with similar meanings have similar embeddings. Byte Pair Encoding (BPE) is a subword tokenization algorithm used to handle rare words and out-of-vocabulary words by breaking them into common subword units.
    • Transformer: A neural network architecture (Vaswani et al., 2017) that relies heavily on the self-attention mechanism to process sequences of data. It has revolutionized NLP due to its ability to capture long-range dependencies in text.
      • Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence when processing each element. For each token in a sequence, it computes query (Q), key (K), and value (V) vectors. The attention score for a token is calculated by taking the dot product of its query with all other keys, scaling, and applying a softmax function. This score is then multiplied by the values. The formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where QQ is the matrix of queries, KK is the matrix of keys, VV is the matrix of values, and dkd_k is the dimension of the keys, used for scaling.
      • Multi-Head Attention: Extends self-attention by running the attention mechanism multiple times in parallel with different learned linear projections of Q, K, V. This allows the model to focus on different parts of the input from various perspectives.
    • GPT (Generative Pre-trained Transformer): A family of autoregressive language models (Radford et al., 2018, 2019; Brown et al., 2020) developed by OpenAI. They are pre-trained on vast amounts of text data to predict the next token in a sequence and have shown impressive zero-shot and few-shot learning capabilities across many NLP tasks.
    • BERT (Bidirectional Encoder Representations from Transformers): A masked language model (Devlin et al., 2018) that learns bidirectional contextual embeddings by predicting masked words in a sentence and also predicting whether two sentences follow each other.
  • Supervision Paradigms:

    • Supervised Learning: Training models on datasets where each input example is explicitly paired with a correct output label (e.g., image and its object category).
    • Weakly Supervised Learning: Training with noisy, imprecise, or incomplete labels, often obtained automatically or at a low cost (e.g., image paired with hashtags, rather than precise object bounding boxes).
    • Self-Supervised Learning: A form of unsupervised learning where the data itself provides the supervision. Models learn by solving a pretext task (e.g., predicting missing parts of an input, rotating an image) that doesn't require human-labeled data, thereby learning useful representations.
    • Unsupervised Learning: Learning patterns from unlabeled data, often to discover hidden structures or generate new data (e.g., clustering, dimensionality reduction).
  • Multimodal Learning: Combining information from multiple modalities (e.g., vision and language) to achieve a more comprehensive understanding of data.

    • Multimodal Embedding Space: A shared vector space where representations from different modalities (e.g., image embeddings and text embeddings) can be compared and related. In such a space, semantically similar items, regardless of modality, are embedded close to each other.
    • Cosine Similarity: A measure of similarity between two non-zero vectors in an inner product space. It measures the cosine of the angle between them. A cosine similarity of 1 means identical direction (most similar), 0 means orthogonal (no similarity), and -1 means opposite direction (most dissimilar). The formula for cosine similarity between two vectors AA and BB is: $ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $ where AiA_i and BiB_i are components of vectors AA and BB, and A\|A\| and B\|B\| are their magnitudes.
    • Contrastive Learning: A self-supervised learning approach where models learn by contrasting positive pairs (similar samples) with negative pairs (dissimilar samples). The goal is to bring representations of positive pairs closer in the embedding space while pushing negative pairs apart.
      • InfoNCE Loss (Noise-Contrastive Estimation): A commonly used contrastive loss function (Oord et al., 2018) that aims to maximize the mutual information between different views of the same data point. It typically involves comparing a positive pair (e.g., an image and its caption) against a set of negative samples. Given a query embedding qq and a set of key embeddings {k0,k1,,kN}\{k_0, k_1, \dots, k_N\} where k0k_0 is the positive key and k1,,kNk_1, \dots, k_N are negative keys, the InfoNCE loss is: $ L_q = -\log \frac{\exp(\text{sim}(q, k_0) / \tau)}{\sum_{i=0}^{N} \exp(\text{sim}(q, k_i) / \tau)} $ where sim(q,k)\text{sim}(q, k) is a similarity function (e.g., cosine similarity), and τ\tau is a temperature parameter that scales the logits before the softmax.
      • Multi-class N-pair Loss: An extension of metric learning (Sohn, 2016) that uses N-1 negative samples for each anchor in a batch of NN samples. This loss is similar in spirit to InfoNCE and is the direct inspiration for CLIP's objective.
  • Zero-Shot Learning (ZSL): The ability of a model to recognize objects or perform tasks that it has never encountered during training, typically by leveraging side information (e.g., textual descriptions of classes).

  • Few-Shot Learning (FSL): The ability of a model to learn new tasks or recognize new classes from a very small number of examples (e.g., 1-shot, 5-shot).

3.2. Previous Works

The paper contextualizes its work by tracing relevant prior research across NLP, computer vision, and multimodal learning.

  • NLP Pre-training Revolution:

    • The paper explicitly credits the revolution in NLP, citing foundational work like Word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and more recent contextualized word embeddings and language models.
    • ELMo (Peters et al., 2018), ULMFiT (Howard & Ruder, 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2018), and T5 (Raffel et al., 2019) are highlighted as systems that scaled task-agnostic objectives (like autoregressive and masked language modeling) across orders of magnitude in compute, model capacity, and data.
    • Text-to-text interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) enabled these architectures to zero-shot transfer without specialized output heads.
    • GPT-3 (Brown et al., 2020) is noted as a flagship system competitive with bespoke models with little to no dataset-specific training data. This demonstrated that web-scale text supervision can surpass high-quality crowd-labeled datasets in NLP.
  • Early Multimodal Vision-Language Learning:

    • The idea of learning visual models from text is not new, dating back over 20 years.
    • Mori et al. (1999) explored improving content-based image retrieval by predicting nouns and adjectives from paired text.
    • Quattoni et al. (2007) demonstrated more data-efficient image representations by learning classifiers for words in image captions.
    • Srivastava & Salakhutdinov (2012) used multimodal Deep Boltzmann Machines on low-level image and text features.
    • Joulin et al. (2016) modernized this by training CNNs to predict bag-of-words from YFCC100M metadata, showing transfer performance similar to ImageNet pre-training. This work is a direct conceptual precursor to CLIP's bag-of-words baseline.
    • Li et al. (2017) extended this to phrase n-grams and demonstrated zero-shot transfer by scoring target classes based on learned visual n-grams. CLIP directly compares against Li et al. (2017) on ImageNet, highlighting its significant performance leap from 11.5%11.5\% to 76.2%76.2\%.
  • Recent Text-Supervised Visual Representation Learning:

    • More recent works adopting modern architectures and objectives:
      • VirTex (Desai & Johnson, 2020): Used transformer-based language modeling to learn image representations from textual annotations.
      • ICMLM (Bulent Sariyildiz et al., 2020): Employed masked language modeling for image representation learning.
      • ConVIRT (Zhang et al., 2020): Adapted contrastive objectives for (text, image) representation learning, especially in medical imaging. CLIP is described as a "simplified version of ConVIRT trained from scratch."
  • Weakly Supervised Pre-training for Computer Vision:

    • While text-based supervision was rare due to low benchmark performance, narrowly scoped weak supervision has shown gains.
    • Mahajan et al. (2018) used ImageNet-related hashtags on 3.5 billion Instagram images, improving ImageNet accuracy by over 5%5\%.
    • Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) demonstrated large gains by pre-training on noisily labeled JFT-300M dataset, predicting classes. These approaches are noted for carefully limiting supervision to 1000 and 18291 classes, respectively, unlike natural language's broader scope.
  • Contrastive Representation Learning:

    • Tian et al. (2019) found contrastive objectives learn better representations than equivalent predictive objectives for images.
    • Chen et al. (2020a) showed generative models require significantly more compute than contrastive models for similar performance. These findings directly influenced CLIP's choice of a contrastive objective for efficiency.
    • Sohn (2016) introduced multi-class N-pair loss, and Oord et al. (2018) popularized InfoNCE loss, both key inspirations for CLIP's batch construction technique and objective.

3.3. Technological Evolution

The evolution of visual models from natural language supervision can be broadly understood as a shift from limited, task-specific, hand-labeled datasets to vast, general, web-scale, implicitly supervised data.

  1. Early Image Retrieval (Pre-Deep Learning): Initial efforts (e.g., Mori et al., 1999) focused on content-based image retrieval by linking images to associated text, often using simpler textual features (nouns, adjectives) and statistical models. These were proofs-of-concept for multimodal understanding.
  2. Bridging Vision and Language with Machine Learning (2000s-early 2010s): Researchers explored learning visual representations by predicting words in captions (Quattoni et al., 2007) or using multimodal graphical models like Deep Boltzmann Machines (Srivastava & Salakhutdinov, 2012). These methods laid theoretical groundwork but were limited by computational power and data scale.
  3. Deep Learning Era - Initial Multimodal Applications (2010s): With the rise of deep learning and CNNs, the field saw modernization. Joulin et al. (2016) showed that CNNs trained on bag-of-words from web metadata could learn useful representations, competing with ImageNet pre-training. Li et al. (2017) further demonstrated zero-shot transfer using n-grams, but performance was still too low for practical application.
  4. Specialized Weak Supervision (Mid-Late 2010s): While general text-to-vision remained challenging, highly-focused weak supervision (e.g., Instagram hashtags by Mahajan et al., 2018) proved effective for improving ImageNet performance. These methods, however, still relied on predetermined class taxonomies and softmax classifiers, limiting flexibility.
  5. NLP-Inspired Architectures and Objectives (Late 2010s-Early 2020s): The success of Transformers and pre-training in NLP inspired analogous efforts in vision. VirTex, ICMLM, and ConVIRT started integrating Transformer-based language models and contrastive objectives for learning image representations from text, but often on relatively smaller datasets.
  6. CLIP's Breakthrough (2021): CLIP represents a culmination of these trends. It scales the contrastive learning objective and Transformer architectures to an unprecedented web-scale dataset (400M image-text pairs). This massive scale, combined with an efficient contrastive objective, enables CLIP to achieve state-of-the-art zero-shot transfer performance, bridging the gap between flexible natural language supervision and practical computer vision applications. It replicates the NLP paradigm's success in achieving task-agnostic and transferable models.

3.4. Differentiation Analysis

Compared to the main methods in related work, CLIP introduces several core differences and innovations:

  • Scale of Natural Language Supervision:

    • Prior Work (e.g., VirTex, ICMLM, ConVIRT): While these works explored similar ideas, they typically trained on much smaller datasets (e.g., MS-COCO, Visual Genome, YFCC100M, ranging from hundreds of thousands to tens of millions of images). Lietal.(2017)Li et al. (2017) used YFCC100M but had limited filtering.
    • CLIP's Innovation: CLIP is trained on an unprecedented 400 million (image, text) pairs in its custom WIT (WebImageText) dataset. This scale, leveraging the vastness of the internet, is a direct differentiator and a key enabler of its performance, similar to how large datasets drove progress in NLP.
  • Efficiency of Pre-training Objective:

    • Generative/Predictive Baselines (e.g., VirTex, early CLIP attempts): Many prior approaches, including CLIP's own initial attempts, aimed to predict the exact words of an image caption. The paper demonstrates this is a computationally inefficient task (transformer-based language model learns 3x slower than bag-of-words baseline, Figure 2).
    • CLIP's Innovation: CLIP adopts a contrastive learning objective. Instead of predicting specific words, it learns to predict which text snippet as a whole is paired with which image. This proxy task of identifying correct (image, text) pairs from incorrect ones (e.g., InfoNCE loss or multi-class N-pair loss) yields a 4x efficiency improvement over the bag-of-words prediction baseline for zero-shot ImageNet classification (Figure 2). This focus on efficiency was crucial for scaling.
  • Zero-Shot Transfer Performance and Generality:

    • Prior Zero-Shot Approaches (e.g., Li et al., 2017 - Visual N-Grams): These methods showed the possibility of zero-shot transfer but achieved very low accuracy (e.g., 11.5%11.5\% on ImageNet).
    • Weakly Supervised (e.g., Instagram-pretrained ResNeXt, JFT-300M models): These models achieved high performance but were task-specific (e.g., predicting 1000 or 18291 ImageNet-related classes) and lacked a mechanism for dynamic outputs or true open-set recognition. They still required static softmax classifiers.
    • CLIP's Innovation: CLIP achieves state-of-the-art zero-shot transfer performance, matching or even exceeding strong fully supervised baselines on many datasets (e.g., 76.2%76.2\% on ImageNet zero-shot, matching ResNet-50). Its use of natural language prompts allows for dynamic classifier creation for any described visual concept, offering unprecedented flexibility and generality for open-set recognition (Figure 1).
  • Robustness to Distribution Shift:

    • ImageNet-trained Models: Previous studies (Taori et al., 2020) showed that ImageNet-trained models suffer significant performance drops on natural distribution shifts, suggesting they exploit spurious correlations within the ImageNet distribution.
    • CLIP's Innovation: CLIP demonstrates significantly higher effective robustness on natural distribution shift datasets. Zero-shot CLIP models reduce the gap between in-distribution and out-of-distribution accuracy by up to 75%75\% (Figure 13). This is attributed to not being trained on specific task distributions and leveraging a very diverse pre-training dataset.
  • Scaling Laws for Transfer Performance:

    • Prior Vision Models: While large-scale pre-training existed, clear scaling laws for zero-shot transfer performance in vision, similar to those observed in NLP (Kaplan et al., 2020), were not as extensively documented.

    • CLIP's Innovation: CLIP explicitly demonstrates a smooth log-log linear scaling trend for zero-shot error rate as a function of model compute across various model sizes (Figure 9). This predictability guides future research towards larger, more capable models.

      In essence, CLIP differentiates itself by successfully bringing the web-scale, task-agnostic pre-training paradigm from NLP to computer vision, enabled by a massive, custom dataset and an efficient contrastive learning objective, resulting in highly flexible, performant, and robust zero-shot visual models.

4. Methodology

4.1. Principles

The core idea behind CLIP is to learn a highly transferable visual model by leveraging the abundant and diverse supervision contained in natural language paired with images on the internet. Instead of relying on traditional, fixed-category human-labeled datasets, CLIP aims to learn visual representations that are implicitly grounded in human language.

The theoretical basis and intuition are as follows:

  1. Natural Language as a Rich Supervision Signal: Natural language is incredibly expressive, capable of describing a vast, open-ended set of visual concepts (objects, actions, attributes, scenes, emotions, etc.). By learning from text associated with images, a model can acquire a much broader understanding of the visual world than from a limited set of pre-defined labels. This mirrors the success of large language models trained on web-scale text.
  2. Shared Embedding Space for Multimodal Understanding: The goal is to learn a joint multimodal embedding space where images and their corresponding text descriptions are mapped to nearby points. This means that an image of a "dog" and the text "a photo of a dog" will have similar vector representations, while an image of a "cat" and the text "a photo of a dog" will have dissimilar representations.
  3. Contrastive Learning for Efficiency and Effectiveness: Instead of trying to generate exact pixel-level captions (which is computationally intensive and difficult due to the variability of language), CLIP simplifies the task. It frames it as a contrastive prediction problem: given a batch of image-text pairs, can the model identify which image matches which text from all possible pairings within that batch? This proxy task is much more efficient to train and has been shown to be highly effective in learning powerful representations in self-supervised learning.
  4. Zero-Shot Transfer through Language Prompting: Once trained, this shared embedding space allows for zero-shot transfer. To classify an image into a set of categories (e.g., "cat," "dog," "bird"), the model can:
    • Compute the embedding of the input image.
    • Compute the embeddings of the text descriptions for each category (e.g., "a photo of a cat," "a photo of a dog," "a photo of a bird").
    • The category whose text embedding is most similar (e.g., highest cosine similarity) to the image embedding is predicted as the class. This effectively allows natural language to "program" the visual classifier on the fly, without needing any labeled training examples for the new categories.

4.2. Core Methodology In-depth (Layer by Layer)

The CLIP methodology involves several key components: Natural Language Supervision, Dataset Creation, Efficient Pre-Training Method Selection, Model Architectures, Model Scaling, and Training Details.

4.2.1. Natural Language Supervision

CLIP's fundamental premise is to learn visual representations from natural language supervision. This means using descriptive text associated with images as the training signal, rather than discrete, human-annotated class labels. The authors highlight its strengths:

  • Scalability: It's easier to scale natural language supervision because it doesn't require annotations in a machine learning compatible format (like 1-of-N majority vote gold labels). Instead, it can passively learn from the vast amount of text on the internet.
  • Flexibility and Zero-Shot Transfer: It doesn't "just" learn a representation but also connects that representation to language, enabling flexible zero-shot transfer by simply using language to describe new visual concepts.

4.2.2. Creating a Sufficiently Large Dataset (WIT)

Existing datasets like MS-COCO (approx. 100,000 images) and Visual Genome (approx. 100,000 images) are too small for web-scale pre-training. YFCC100M (100 million images) has sparse and inconsistent metadata. After filtering for natural language titles/descriptions, YFCC100M shrinks to 15 million images.

To address this, the authors constructed a new dataset called WIT (WebImageText).

  • Scale: It comprises 400 million (image, text) pairs collected from publicly available internet sources.
  • Diversity: To cover a broad set of visual concepts, the collection process involved searching for (image, text) pairs where the text included one of 500,000 queries.
  • Balancing: The results were approximately class-balanced by including up to 20,000 (image, text) pairs per query.
  • Size comparison: The total word count of WIT is similar to the WebText dataset used to train GPT-2.

4.2.3. Selecting an Efficient Pre-Training Method

Given the immense computational requirements of training large computer vision models, training efficiency was paramount.

  • Initial Approach (Generative/Predictive): The authors first tried an approach similar to VirTex, jointly training an image CNN and text transformer from scratch to predict the caption of an image.

    • Observation: This method proved inefficient. As shown in Figure 2, a 63 million parameter transformer language model (already using twice the compute of a ResNet-50 image encoder) learned to recognize ImageNet classes three times slower than a simpler bag-of-words (BoW) encoding baseline. This highlighted the difficulty of predicting exact words due to the wide variety of text associated with images.

      Figure 2. CLIP is much more efficient at zero-shot transfer than our image caption baseline. Although highly expressive, we found that transformer-based language models are relatively weak at zero-shot ImageNet classification. Here, we see that it learns \(_ { 3 \\mathrm { X } }\) slower than a baseline which predicts a bag-of-words (BoW) encoding of the text (Joulin et al., 2016). Swapping the prediction objective for the contrastive objective of CLIP further improves efficiency another \(_ { 4 \\mathrm { X } }\) . 该图像是图表,展示了不同模型在处理不同数量图像时的零-shot ImageNet 分类准确率。图中包含三条曲线,分别表示“Bag of Words Contrastive (CLIP)”、“Bag of Words Prediction”和“Transformer Language Model”。绿色曲线显示CLIP模型的效率是其他模型的4倍,橙色曲线显示其效率为3倍,蓝色曲线表现相对较低。X轴表示处理的图像数量,Y轴表示分类准确率。

    Figure 2. CLIP is much more efficient at zero-shot transfer than our image caption baseline. Although highly expressive, we found that transformer-based language models are relatively weak at zero-shot ImageNet classification. Here, we see that it learns 3X_ { 3 \mathrm { X } } slower than a baseline which predicts a bag-of-words (BoW) encoding of the text (Joulin et al., 2016). Swapping the prediction objective for the contrastive objective of CLIP further improves efficiency another 4X_ { 4 \mathrm { X } } .

  • Contrastive Objective: Inspired by findings that contrastive objectives can learn better and more computationally efficient representations (Tian et al., 2019; Chen et al., 2020a), the authors shifted to a contrastive learning approach.

    • Goal: Predict which text as a whole is paired with which image, rather than the exact words.
    • Observation: Swapping the predictive objective for a contrastive objective in the bag-of-words encoding baseline resulted in a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet (Figure 2).

4.2.4. CLIP Training Objective

CLIP is trained to predict which of N×NN \times N possible (image, text) pairings across a batch actually occurred, given a batch of NN real pairs.

The core of the CLIP implementation can be visualized with the following pseudocode (Figure 3 from the original paper):

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOw or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, 1] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T)  #[n, d_t]

# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits.T, labels, axis=0) # Corrected from original pseudocode's axis=7
loss = (loss_i + loss_t)/2

Let's break down the process step-by-step:

  1. Input:

    • I[n, h, w, c]: A minibatch of nn aligned images. Each image has dimensions height (h), width (w), and channels (c).
    • T[n, 1]: A minibatch of nn aligned texts (captions).
    • image_encoder: The neural network responsible for processing images. It can be a ResNet or a Vision Transformer.
    • text_encoder: The neural network responsible for processing text. It can be a CBOW (Continuous Bag-of-Words) model or a Text Transformer.
  2. Feature Extraction:

    • I_f = image_encoder(I): The image_encoder processes the batch of images II to extract visual feature representations. IfI_f has shape [n,di][n, d_i], where did_i is the dimension of the image features.
    • T_f = text_encoder(T): The text_encoder processes the batch of texts TT to extract textual feature representations. TfT_f has shape [n,dt][n, d_t], where dtd_t is the dimension of the text features.
  3. Joint Multimodal Embedding:

    • Wi[di,de]W_i[d_i, d_e]: A learned linear projection matrix that maps image features to a shared multimodal embedding space of dimension ded_e.
    • Wt[dt,de]W_t[d_t, d_e]: A learned linear projection matrix that maps text features to the same multimodal embedding space.
    • I_e = l2_normalize(np.dot(I_f, W_i), axis=1): The image features IfI_f are linearly projected into the multimodal embedding space using WiW_i, and then L2-normalized along the feature dimension (axis=1axis=1). IeI_e has shape [n,de][n, d_e].
    • T_e = l2_normalize(np.dot(T_f, W_t), axis=1): Similarly, text features TfT_f are linearly projected using WtW_t and then L2-normalized. TeT_e also has shape [n,de][n, d_e].
    • L2-Normalization: Ensures that all embeddings lie on a unit hypersphere, making dot products equivalent to cosine similarity. This is crucial for the contrastive objective.
  4. Scaled Pairwise Cosine Similarities (Logits):

    • tt: A learned temperature parameter, optimized directly during training as a log-parameterized multiplicative scalar. It controls the range of the logits in the softmax, preventing training instability by scaling the similarities.
    • logits = np.dot(I_e, T_e.T) * np.exp(t): The dot product of IeI_e and the transpose of TeT_e (Te.TT_e.T) computes the pairwise cosine similarities between all nn image embeddings and all nn text embeddings in the batch. This results in an [n, n] matrix. The element at [i, j] represents the similarity between image ii and text jj. This matrix is then scaled by ete^t.
      • The diagonal elements of this matrix (logits[k, k]) correspond to the similarities of the nn correct (image, text) pairs.
      • The off-diagonal elements (logits[i, j] where iji \neq j) correspond to the similarities of the n2nn^2 - n incorrect (image, text) pairings (negative samples).
  5. Symmetric Loss Function:

    • labels = np.arange(n): An array [0,1,...,n1][0, 1, ..., n-1] is created. This serves as the target indices for the cross-entropy loss, indicating that image[i] should match text[i].

    • loss_i = cross_entropy_loss(logits, labels, axis=0): This calculates the cross-entropy loss from the perspective of the images. For each image ii, the model tries to correctly identify its corresponding text ii among all nn texts in the batch. axis=0axis=0 in the pseudocode's cross_entropy_loss context typically implies that the loss is computed over the columns, meaning for each image row, we compare its similarity scores to all texts against a one-hot vector where the correct text is 1. (Note: The original paper's pseudocode had axis=7axis=7 for losstloss_t, which is a typo and corrected to axis=0axis=0 or axis=1axis=1 depending on the cross_entropy_loss implementation details. Assuming it should align with the logits.T operation, it should be axis=0axis=0 on the transposed logits, effectively meaning axis=1axis=1 on the original logits).

    • loss_t = cross_entropy_loss(logits.T, labels, axis=0): This calculates the cross-entropy loss from the perspective of the texts. For each text jj, the model tries to correctly identify its corresponding image jj among all nn images in the batch. By transposing logits, we swap the roles of images and texts.

    • loss=(lossi+losst)/2loss = (loss_i + loss_t)/2: The final loss is the symmetric average of the image-side and text-side losses. This ensures that both modalities learn to project into the shared space effectively.

      This batch construction technique and objective were first introduced as multi-class N-pair loss (Sohn, 2016) and popularized as InfoNCE loss (Oord et al., 2018) for contrastive representation learning. It was recently adapted for (text, image) learning in medical imaging by Zhang et al. (2020), which CLIP builds upon.

Simplifications Compared to ConVIRT (Zhang et al., 2020):

  • Training from scratch: CLIP models are trained without ImageNet weights for the image encoder or pre-trained weights for the text encoder.
  • Linear Projection: Only a linear projection maps from each encoder's representation to the multimodal embedding space, unlike ConVIRT which uses a non-linear projection.
  • No Text Transformation: The text transformation function tut_u that samples a single sentence uniformly from the text is removed, as many (image, text) pairs in WIT are single sentences.
  • Simple Image Augmentation: A random square crop from resized images is the only data augmentation used during training.
  • Learned Temperature: The temperature parameter\tau is directly optimized as a `log-parameterized multiplicative scalar`, avoiding manual hyperparameter tuning. ### 4.2.5. Choosing and Scaling a Model CLIP utilizes two main architectures for the `image encoder` and a `Transformer` for the `text encoder`. * **Image Encoder Architectures:** 1. **ResNet-50 (He et al., 2016a) variants:** * **Base Architecture:** ResNet-50, chosen for its widespread adoption. * **Modifications:** * `ResNetD improvements` (He et al., 2019). * `Antialiased rect-2 blur pooling` (Zhang, 2019). * `Attention pooling mechanism`: Replaces the `global average pooling layer`. This mechanism is a single layer of `transformer-style multi-head QKV attention`, where the `query` is conditioned on the `global average-pooled representation of the image`. 2. **Vision Transformer (ViT) (Dosovitskiy et al., 2020) variants:** * **Base Architecture:** Closely follows the original ViT implementation. * **Minor Modification:** An additional `layer normalization` is applied to the combined `patch and position embeddings` before the `transformer` blocks. A slightly different initialization scheme is also used. * **Text Encoder Architecture:** * **Base Architecture:** A `Transformer` (Vaswani et al., 2017) with architectural modifications described in Radford et al. (2019) (similar to GPT-2). * **Size:** A `63M-parameter`, `12-layer`, `512-wide model` with `8 attention heads`. * **Tokenization:** Operates on a `lower-cased Byte Pair Encoding (BPE)` representation of the text with a `49,152 vocabulary size` (Sennrich et al., 2015). * **Sequence Length:** `Max sequence length capped at 76` for computational efficiency. * **Feature Representation:** The text sequence is bracketed with `[SOS]` (start of sentence) and `[EOS]` (end of sentence) tokens. The activations of the highest layer of the `transformer` at the `[EOS]` token are used as the feature representation of the text. This representation is then `layer normalized` and `linearly projected` into the `multimodal embedding space`. * **Masked Self-Attention:** Used in the `text encoder` to preserve the ability to initialize with a `pre-trained language model` or add `language modeling` as an `auxiliary objective` (though this is left for future work). * **Model Scaling Strategy:** * **ResNet Image Encoders:** Adapts the approach of Tan & Le (2019) (EfficientNet) by `equally allocating additional compute to increasing the width, depth, and resolution` of the model. This is found to outperform scaling only one dimension. * **Text Encoder:** Only the `width of the model` is scaled to be proportional to the calculated increase in width of the `ResNet`. The depth of the text encoder is `not scaled` as CLIP's performance was less sensitive to its capacity. ### 4.2.6. Training A series of CLIP models were trained: * **ResNet Models:** 5 models: `ResNet-50`, `ResNet-101`, and three `EfficientNet-style scaled` models (denoted `RN50x4`, `RN50x16`, `RN50x64`) using approximately 4x, 16x, and 64x the compute of a ResNet-50. * **Vision Transformer (ViT) Models:** 3 models: `ViT-B/32`, `ViT-B/16`, and `ViT-L/14`. * **`ViT-L/14@336px`:** The `ViT-L/14` model was additionally `pre-trained at a higher 336 pixel resolution for one additional epoch` to boost performance, similar to `FixRes` (Touvron et al., 2019). This model is considered the `best performing` and is used as "CLIP" by default in the paper. **Common Training Hyperparameters (Table 18):** <div class="table-wrapper"><table> <thead> <tr> <th>Hyperparameter</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>Batch size</td> <td>32768</td> </tr> <tr> <td>Vocabulary size</td> <td>49408</td> </tr> <tr> <td>Training epochs</td> <td>32</td> </tr> <tr> <td>Maximum temperature</td> <td>100.0</td> </tr> <tr> <td>Weight decay</td> <td>0.2</td> </tr> <tr> <td>Warm-up iterations</td> <td>2000</td> </tr> <tr> <td>Adam β1</td> <td>0.9</td> </tr> <tr> <td>Adam β2</td> <td>0.999 (ResNet), 0.98 (ViT)</td> </tr> <tr> <td>Adam €</td> <td>10−8 (ResNet), 10−6 (ViT)</td> </tr> </tbody> </table></div> **Specific Hyperparameters for CLIP-ResNet (Table 19):** <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Model</th> <th rowspan="2">Learning rate</th> <th rowspan="2">Embedding dimension</th> <th rowspan="2">Input resolution</th> <th colspan="2">ResNet</th> <th colspan="3">Text Transformer</th> </tr> <tr> <th>blocks</th> <th>width</th> <th>layers</th> <th>width</th> <th>heads</th> </tr> </thead> <tbody> <tr> <td>RN50</td> <td>5 × 10−4</td> <td>1024</td> <td>224</td> <td>(3, 4, 6, 3)</td> <td>2048</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>RN101</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>(3, 4, 23, 3)</td> <td>2048</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>RN50x4</td> <td>5 × 10−4</td> <td>640</td> <td>288</td> <td>(4, 6, 10, 6)</td> <td>2560</td> <td>12</td> <td>640</td> <td>10</td> </tr> <tr> <td>RN50x16</td> <td>4 × 10−4</td> <td>768</td> <td>384</td> <td>(6, 8, 18, 8)</td> <td>3072</td> <td>12</td> <td>768</td> <td>12</td> </tr> <tr> <td>RN50x64</td> <td>3.6 × 10−4</td> <td>1024</td> <td>448</td> <td>(3, 15, 36, 10)</td> <td>4096</td> <td>12</td> <td>1024</td> <td>16</td> </tr> </tbody> </table></div> **Specific Hyperparameters for CLIP-ViT (Table 20):** <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Model</th> <th rowspan="2">Learning rate</th> <th rowspan="2">Embedding dimension</th> <th rowspan="2">Input resolution</th> <th colspan="3">Vision Transformer</th> <th colspan="3">Text Transformer</th> </tr> <tr> <th>layers</th> <th>width</th> <th>heads</th> <th>layers</th> <th>width</th> <th>heads</th> </tr> </thead> <tbody> <tr> <td>ViT-B/32</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>12</td> <td>768</td> <td>12</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>ViT-B/16</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>12</td> <td>768</td> <td>12</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>ViT-L/14</td> <td>4 × 10−4</td> <td>768</td> <td>224</td> <td>24</td> <td>1024</td> <td>16</td> <td>12</td> <td>768</td> <td>12</td> </tr> <tr> <td>ViT-L/14-336px</td> <td>2 × 10-5</td> <td>768</td> <td>336</td> <td>24</td> <td>1024</td> <td>16</td> <td>12</td> <td>768</td> <td>12</td> </tr> </tbody> </table></div> **Other Training Details:** * **Optimizer:** `Adam optimizer` (Kingma & Ba, 2014). * **Regularization:** `Decoupled weight decay regularization` (Loshchilov & Hutter, 2017) applied to all weights except gains or biases. * **Learning Rate Schedule:** `Cosine decay schedule` (Loshchilov & Hutter, 2016). * **Hyperparameter Tuning:** Initial hyperparameters were set via `grid searches`, `random search`, and `manual tuning` on the baseline ResNet50 for 1 epoch, then heuristically adapted for larger models due to `computational constraints`. * **Temperature Parameter Initialization:** $\tau$ was initialized to the equivalent of 0.07 from Wu et al. (2018) and clipped to prevent `scaling logits by more than 100` to avoid `training instability`. * **Memory Optimization:** `Mixed-precision` (Micikevicius et al., 2017), `gradient checkpointing` (Griewank & Walther, 2000; Chen et al., 2016), `half-precision Adam statistics` (Dhariwal et al., 2020), and `half-precision stochastically rounded text encoder weights` were used. `Embedding similarity calculation` was `sharded` across GPUs. * **Compute:** The largest ResNet model (`RN50x64`) took `18 days on 592 V100 GPUs`. The largest Vision Transformer (`ViT-L/14`) took `12 days on 256 V100 GPUs`. ### 4.2.7. Using CLIP for Zero-Shot Transfer After pre-training, CLIP uses its learned `multimodal embedding space` for `zero-shot classification`. 1. **Class Name Processing:** For a given `downstream dataset`, the names of all classes are used as potential text pairings. 2. **Embedding Generation:** * The input image is passed through the `image encoder` to obtain its `feature embedding`. * Each class name (e.g., "cat," "dog") is passed through the `text encoder` to obtain its `feature embedding`. 3. **Similarity Calculation:** The `cosine similarity` between the image embedding and each of the class text embeddings is calculated. 4. **Probability Distribution:** These similarities are scaled by the learned `temperature parameter`\tau and then normalized into a probability distribution via a softmax function.
  1. Prediction: The class with the highest probability (i.e., the most probable (image, text) pair) is predicted as the image's label.

    Interpretation: The paper interprets this as the image encoder acting as the computer vision backbone and the text encoder functioning as a hypernetwork (Ha et al., 2016) that generates the weights of a linear classifier based on the text descriptions of visual concepts. Each step of CLIP pre-training is seen as optimizing a proxy to a computer vision dataset with 1 example per class and 32,768 total classes defined by natural language.

Prompt Engineering and Ensembling: To improve zero-shot performance, especially when class names are ambiguous or lack context:

  • Prompt Engineering: Context templates are used, such as "A photo of a {label}." This bridges the distribution gap between single-word labels and the full sentences typically seen during pre-training. Specific prompts are customized per task (e.g., "A photo of a {label}, a type of pet." for Oxford-IIT Pets).

  • Ensembling: Multiple zero-shot classifiers are created using different context prompts (e.g., "A photo of a big {label}", "A photo of a small {label}"). The ensemble is constructed over the embedding space (by averaging text embeddings), allowing for efficient caching and prediction. On ImageNet, ensembling 80 different prompts improved accuracy by an additional 3.5%3.5\%.

    The following figure (Figure 4 from the original paper) illustrates the improvements from prompt engineering and ensembling.

    Figure 4. Prompt engineering and ensembling improve zeroshot performance. Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4 times more compute with the baseline zero-shot method but is "free" when amortized over many predictions. 该图像是图表,展示了通过提示工程和集成方法提高零-shot分类性能的结果。与使用无上下文类别名称的基线相比,该方法在36个数据集上的平均得分提高了近5个百分点,显示出显著的效率提升。

Figure 4. Prompt engineering and ensembling improve zeroshot performance. Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4 times more compute with the baseline zero-shot method but is "free" when amortized over many predictions.

5. Experimental Setup

5.1. Datasets

The experiments in this paper use a broad suite of datasets to evaluate CLIP's zero-shot transfer and representation learning capabilities. The core evaluation suite comprises 27 datasets, including the 12 well-studied datasets from Kornblith et al. (2019) and 15 additional datasets to assess performance on a wider variety of distributions and tasks.

The following are the datasets used, with details on their characteristics and purpose:

Dataset Classes Train size Test size Evaluation metric
Food-101 102 75,750 25,250 accuracy
CIFAR-10 10 50,000 10,000 accuracy
CIFAR-100 100 50,000 10,000 accuracy
Birdsnap 500 42,283 2,149 accuracy
SUN397 397 19,850 19,850 accuracy
Stanford Cars 196 8,144 8,041 accuracy
FGVC Aircraft 100 6,667 3,333 mean per class
Pascal VOC 2007 Classification 20 5,011 4,952 11-point mAP
Describable Textures 47 3,760 1,880 accuracy
Oxford-IIIT Pets 37 3,680 3,669 mean per class
Caltech-101 102 3,060 6,085 mean-per-class
Oxford Flowers 102 102 2,040 6,149 mean per class
MNIST 10 60,000 10,000 accuracy
Facial Emotion Recognition 2013 8 32,140 3,574 accuracy
STL-10 10 1000 8000 accuracy
EuroSAT 10 10,000 5,000 accuracy
RESISC45 45 3,150 25,200 accuracy
GTSRB 43 26,640 12,630 accuracy
KITTI 4 6,770 711 accuracy
Country211 211 43,200 21,100 accuracy
PatchCamelyon 2 294,912 32,768 accuracy
UCF101 101 9,537 1,794 accuracy
Kinetics700 700 494,801 31,669 mean(top1, top5)
CLEVR Counts 8 2,000 500 accuracy
Hateful Memes 2 8,500 500 ROC AUC
Rendered SST2 2 7,792 1,821 accuracy
ImageNet 1000 1,281,167 50,000 accuracy

The following are the results from Table 9 of the original paper:

Dataset Classes Train size Test size Evaluation metric
Food-101 102 75,750 25,250 accuracy
CIFAR-10 10 50,000 10,000 accuracy
CIFAR-100 100 50,000 10,000 accuracy
Birdsnap 500 42,283 2,149 accuracy
SUN397 397 19,850 19,850 accuracy
Stanford Cars 196 8,144 8,041 accuracy
FGVC Aircraft 100 6,667 3,333 mean per class
Pascal VOC 2007 Classification 20 5,011 4,952 11-point mAP
Describable Textures 47 3,760 1,880 accuracy
Oxford-IIIT Pets 37 3,680 3,669 mean per class
Caltech-101 102 3,060 6,085 mean-per-class
Oxford Flowers 102 102 2,040 6,149 mean per class
MNIST 10 60,000 10,000 accuracy
Facial Emotion Recognition 2013 8 32,140 3,574 accuracy
STL-10 10 1000 8000 accuracy
EuroSAT 10 10,000 5,000 accuracy
RESISC45 45 3,150 25,200 accuracy
GTSRB 43 26,640 12,630 accuracy
KITTI 4 6,770 711 accuracy
Country211 211 43,200 21,100 accuracy
PatchCamelyon 2 294,912 32,768 accuracy
UCF101 101 9,537 1,794 accuracy
Kinetics700 700 494,801 31,669 mean(top1, top5)
CLEVR Counts 8 2,000 500 accuracy
Hateful Memes 2 8,500 500 ROC AUC
Rendered SST2 2 7,792 1,821 accuracy
ImageNet 1000 1,281,167 50,000 accuracy

Specific Dataset Notes:

  • Video Datasets (UCF101, Kinetics700): For these datasets, the middle frame of each video clip is used as the input image, effectively converting them into image classification tasks for this evaluation.

  • STL-10 and UCF101: These datasets have multiple predefined train/validation/test splits, and the paper reports the average over all splits.

  • Country211: A custom dataset created by the authors to assess geolocation capability. It filtered YFCC100m for 211 countries with at least 300 GPS-tagged photos, sampling 200 for training and 100 for testing per country.

  • Rendered SST2: A custom dataset created to measure optical character recognition (OCR) capability. Sentences from the Stanford Sentiment Treebank (SST-2) dataset are rendered into 448x448 pixel images (black text on white background). The following figure (Figure 19 from the original paper) shows two example images from the Rendered SST2 dataset.

    Figure 19. Two example images from the Rendered SST2 dataset 该图像是文本描述的插图,包含两段关于电影叙事的评论。第一段提到Montias为其细致的叙述注入了灵活的能量,并且描述了他所围绕的角色;第二段则表达了电影制作者对故事方向的迷茫以及缺乏实现目标的技能。

    Figure 19. Two example images from the Rendered SST2 dataset

These datasets were chosen to provide a broad and diverse evaluation suite, encompassing various tasks (general object recognition, fine-grained classification, scene recognition, OCR, action recognition, geolocation) and different image characteristics (natural images, satellite images, medical images, rendered text, video frames). This diversity is crucial for validating the generality and transferability of CLIP's natural language supervision approach.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided below:

  1. Accuracy:

    • Conceptual Definition: Accuracy is a common metric for classification tasks, representing the proportion of correctly predicted instances out of the total instances evaluated. It measures how often the model's prediction matches the true label.
    • Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
    • Symbol Explanation:
      • Number of Correct Predictions: The count of instances where the model's predicted class matches the actual true class.
      • Total Number of Predictions: The total number of instances for which the model made a prediction.
  2. Mean Per Class Accuracy:

    • Conceptual Definition: This metric calculates the accuracy for each individual class and then averages these per-class accuracies. It is particularly useful when dealing with imbalanced datasets (where some classes have many more examples than others) because it gives equal weight to each class, preventing a model from achieving high overall accuracy by simply performing well on a majority class.
    • Mathematical Formula: $ \text{Mean Per Class Accuracy} = \frac{1}{C} \sum_{i=1}^{C} \text{Accuracy}_i $
    • Symbol Explanation:
      • CC: The total number of unique classes in the dataset.
      • Accuracyi\text{Accuracy}_i: The accuracy calculated specifically for class ii, i.e., the number of correctly predicted instances of class ii divided by the total number of actual instances of class ii.
  3. 11-point Mean Average Precision (11-point mAP):

    • Conceptual Definition: 11-point mAP is a metric commonly used in object detection and image retrieval tasks, particularly in benchmarks like Pascal VOC. It is a generalization of Average Precision (AP). AP measures the area under the Precision-Recall curve. For 11-point mAP, precision is sampled at 11 equally spaced recall levels (0, 0.1, ..., 1.0). The precision at each recall level rr is taken as the maximum precision over any recall rrr' \ge r. The mean of these 11 precision values is the AP for a single class. mAP is then the mean of AP values across all classes.
    • Mathematical Formula: $ \text{AP} = \frac{1}{11} \sum_{r \in {0, 0.1, \dots, 1.0}} \text{P}{\text{interp}}(r) $ where Pinterp(r)=maxrrP(r)\text{P}_{\text{interp}}(r) = \max_{r' \ge r} \text{P}(r') $ \text{mAP} = \frac{1}{C} \sum{i=1}^{C} \text{AP}_i $
    • Symbol Explanation:
      • AP\text{AP}: Average Precision for a single class.
      • rr: Recall level (from 0 to 1.0 in 11 steps).
      • Pinterp(r)\text{P}_{\text{interp}}(r): Interpolated precision at recall rr, calculated as the maximum precision observed for any recall value greater than or equal to rr.
      • P(r)\text{P}(r'): Precision at recall rr'.
      • mAP\text{mAP}: Mean Average Precision.
      • CC: The total number of classes.
      • APi\text{AP}_i: Average Precision for class ii.
  4. ROC AUC (Receiver Operating Characteristic Area Under the Curve):

    • Conceptual Definition: ROC AUC is a performance metric for binary classification problems, particularly useful when there is a class imbalance or when the costs of false positives and false negatives are different. The ROC curve plots the True Positive Rate (TPR) (Sensitivity) against the False Positive Rate (FPR) (1 - Specificity) at various threshold settings. The AUC (Area Under the Curve) represents the degree or measure of separability between classes; a higher AUC means the model is better at distinguishing between positive and negative classes. An AUC of 1.0 represents a perfect classifier, while 0.5 represents a random classifier.
    • Mathematical Formula: $ \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ $ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} $ $ \text{AUC} = \int_{0}^{1} \text{TPR(FPR)} , d\text{FPR} $
    • Symbol Explanation:
      • TP: True Positives (correctly predicted positive instances).
      • FN: False Negatives (actual positive instances incorrectly predicted as negative).
      • FP: False Positives (actual negative instances incorrectly predicted as positive).
      • TN: True Negatives (correctly predicted negative instances).
      • TPR: True Positive Rate, also known as Sensitivity or Recall.
      • FPR: False Positive Rate.
      • AUC: Area Under the ROC Curve. The integral represents the area under the curve formed by plotting TPR against FPR.
  5. mean(top1, top5):

    • Conceptual Definition: This metric is specific to the Kinetics-700 dataset, which is a large-scale video action recognition dataset. It represents the average of the Top-1 accuracy and Top-5 accuracy.
      • Top-1 accuracy: The standard accuracy where the model's single most confident prediction must be correct.
      • Top-5 accuracy: The prediction is considered correct if the true label is among the model's top 5 most confident predictions.
    • Mathematical Formula: $ \text{mean(top1, top5)} = \frac{\text{Top-1 Accuracy} + \text{Top-5 Accuracy}}{2} $
    • Symbol Explanation:
      • Top-1 Accuracy: Proportion of times the true label is the model's highest-scoring prediction.
      • Top-5 Accuracy: Proportion of times the true label is among the model's five highest-scoring predictions.
  6. R@K (Recall@K):

    • Conceptual Definition: Recall@K is a metric used in information retrieval or ranking tasks (like image-text retrieval). It measures the proportion of queries for which the correct item (e.g., the true image for a text query, or the true text for an image query) is found within the top KK retrieved results. A higher R@K indicates better retrieval performance.
    • Mathematical Formula: $ \text{R@K} = \frac{\text{Number of queries where true item is in top K retrieved}}{\text{Total number of queries}} $
    • Symbol Explanation:
      • KK: The number of top retrieved items to consider.
      • true item: The ground-truth item that corresponds to the query.
      • top K retrieved: The set of KK items ranked highest by the model for a given query.
  7. mWAP (Mean Weighted Average Precision) and mWSAP (Mean Weighted Segment Average Precision):

    • Conceptual Definition: These metrics are specific to action recognition datasets like RareAct, which involves detecting unusual actions. While the paper mentions them, it does not provide detailed definitions or formulas within its text. In the context of action recognition, Average Precision (AP) is commonly used for detection tasks, and Weighted Average Precision suggests a weighting scheme might be applied, potentially based on action rarity or duration. Segment Average Precision implies evaluation over temporal segments in videos. Without further context from the original source introducing RareAct, precise formulas cannot be provided here, but they generally aim to measure the quality of action detection or recognition in video sequences, potentially considering temporal localization or importance.
  8. Geolocation Performance (within X km):

    • Conceptual Definition: This metric is used for geo-localization tasks, where the goal is to predict the geographical coordinates (latitude and longitude) of an image. Performance is measured as the percentage of images for which the predicted location falls within a specified radius (e.g., 1km, 25km, 200km) of the true location.
    • Mathematical Formula: Let (lattrue,lontrue)(\text{lat}_{\text{true}}, \text{lon}_{\text{true}}) be the true coordinates and (latpred,lonpred)(\text{lat}_{\text{pred}}, \text{lon}_{\text{pred}}) be the predicted coordinates. Let dist(true,pred)\text{dist}(\text{true}, \text{pred}) be the Haversine distance between the two points. $ \text{Accuracy within X km} = \frac{\text{Number of images where dist(true, pred)} \le \text{X km}}{\text{Total number of images}} \times 100% $
    • Symbol Explanation:
      • lattrue,lontrue\text{lat}_{\text{true}}, \text{lon}_{\text{true}}: True latitude and longitude of the image.
      • latpred,lonpred\text{lat}_{\text{pred}}, \text{lon}_{\text{pred}}: Predicted latitude and longitude of the image.
      • dist(true,pred)\text{dist}(\text{true}, \text{pred}): The geographical distance (e.g., Haversine distance) between the true and predicted coordinates.
      • X km\text{X km}: The specified radius in kilometers.

5.3. Baselines

The paper compares CLIP's performance against a comprehensive set of existing models, covering various pre-training strategies and architectures, for both zero-shot transfer and linear-probe representation learning evaluations.

Baselines for Zero-Shot Transfer:

  • Visual N-Grams (Li et al., 2017): This is the primary zero-shot baseline mentioned in the paper for ImageNet, aYahoo, and SUN datasets. It learned a dictionary of visual n-grams from web data and used text n-gram representations of class names for prediction. This serves as the direct contextual reference for generically pre-trained zero-shot models.

Baselines for Linear-Probe Representation Learning (and some for zero-shot comparison context):

The evaluation suite includes 66 different models across 27 datasets to ensure a broad comparison. Key families of baselines are:

  1. LM RN50 (Autoregressive Language Model with ResNet-50):

    • A multimodal model using a ResNet-50 image encoder and an autoregressive loss to predict text captions. This acts as a direct comparison to CLIP's contrastive loss and demonstrates the efficiency gains of contrastive learning.
  2. EfficientNet (Tan & Le, 2019):

    • A family of convolutional neural networks known for systematic model scaling (width, depth, and resolution).
    • Includes B0-B8 models from the original paper.
    • Also includes Noisy Student variants (B0B7,L2475,L2800B0-B7, L2-475, L2-800) (Xie et al., 2020), which use self-training with noisy labels to achieve state-of-the-art performance on ImageNet. These are strong supervised baselines.
  3. Instagram-pretrained ResNeXt (Mahajan et al., 2018):

    • ResNeXt-101 models (32x8d, 32x16d, 32x32d, 32x48d) pre-trained on 3.5 billion Instagram images using hashtag prediction (a form of weak supervision). These models demonstrated the power of large-scale weakly supervised pre-training.
    • Includes FixRes variants (Touvron et al., 2019) using higher input resolutions.
  4. Big Transfer (BiT) (Kolesnikov et al., 2019):

    • BiT-S and BiT-M models (ResNet architectures) pre-trained on ImageNet-1k and ImageNet-21k (a larger ImageNet variant). These models are known for their strong transfer learning performance due to large-scale pre-training. BiT-L models (trained on JFT-300M) are mentioned as superior but not publicly available.
  5. Vision Transformer (ViT) (Dosovitskiy et al., 2020):

    • ViT-B/32, ViT-B/16, ViT-L/16, and ViT-H/14 models pre-trained on the ImageNet-21k dataset. These are crucial baselines to compare CLIP's Vision Transformer variants against, particularly in terms of compute efficiency. The best-performing ViT models (trained on JFT-300M) are also not publicly available.
  6. Self-Supervised Learning Methods:

    • SimCLRv2 (Chen et al., 2020c): A self-supervised learning framework that uses contrastive learning to learn visual representations without human labels.
    • BYOL (Bootstrap Your Own Latent) (Grill et al., 2020): Another self-supervised learning method that avoids negative pairs, using two interacting neural networks to learn representations.
    • Momentum Contrast (MoCo) (He et al., 2020; Chen et al., 2020d): A self-supervised learning framework that uses a momentum encoder and a large queue of negative samples for contrastive learning.
  7. VirTex (Desai & Johnson, 2020):

    • A model that learns visual representations from textual annotations using transformer-based language modeling. It has a similar model design to CLIP's autoregressive baseline but is trained on a much smaller dataset (MSCOCO).
  8. Standard ResNet Checkpoints (He et al., 2016b):

    • Original ResNet-50, ResNet-101, and ResNet-152 models trained on ImageNet-1k. These serve as fundamental supervised baselines representing widely adopted architectures.

      These baselines are representative because they cover a spectrum of modern computer vision pre-training techniques: from traditional supervised ImageNet training, to large-scale weakly supervised methods, and recent self-supervised learning paradigms, as well as the emerging Vision Transformer architectures. This comprehensive comparison allows the authors to position CLIP's natural language supervision approach against the current state-of-the-art across different axes like supervision type, model architecture, and scale.

5.4. Linear-Probe Evaluation Setup

For linear-probe evaluation, the following standardized procedure is used:

  1. Feature Extraction: Image features are taken from the penultimate layer of each model, ignoring any provided classification layer. For CLIP-ViT models, features are used before the linear projection to the embedding space (corresponding to IfI_f in the pseudocode).
  2. Classifier Training: A logistic regression classifier is trained on these extracted features. scikit-learn's L-BFGS implementation is used with a maximum of 1,000 iterations.
  3. Hyperparameter Tuning: The L2 regularization strength\lambda is determined using a `hyperparameter sweep` on the validation sets. The sweep covers a range from $10^{-6}$ to $10^6$ with 96 logarithmically spaced steps. A `parametric binary search` strategy is employed to efficiently find the optimal $\lambda$. 4. **Data Splits:** For datasets with predefined `validation and test splits`, the validation set is used for `hyperparameter search`. If no validation split or test labels are provided, the training dataset is split to create a validation set for tuning. For the final result, the validation split is combined back with the training split, and performance is reported on the unused test split. This linear-probe approach is chosen for its simplicity, minimal hyperparameter tuning requirements, and its ability to highlight how well the `pre-trained representations` themselves capture useful information, rather than allowing `fine-tuning` to adapt representations to specific downstream tasks. # 6. Results & Analysis ## 6.1. Core Results Analysis The experimental results demonstrate CLIP's remarkable capabilities in `zero-shot transfer` and `representation learning`, often outperforming strong baselines and showing predictable scaling behaviors. ### 6.1.1. Initial Comparison to Visual N-Grams The following are the results from Table 1 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th></th> <th>aYahoo</th> <th>ImageNet</th> <th>SUN</th> </tr> </thead> <tbody> <tr> <td>Visual N-Grams</td> <td>72.4</td> <td>11.5</td> <td>23.0</td> </tr> <tr> <td>CLIP</td> <td>98.4</td> <td>76.2</td> <td>58.5</td> </tr> </tbody> </table></div> This table compares CLIP's zero-shot accuracy against `Visual N-Grams` (Li et al., 2017) on three datasets. CLIP shows massive improvements: * On `ImageNet`, CLIP boosts accuracy from a mere $11.5\%$ to $76.2\%$. This is a monumental leap, matching the performance of the original `ResNet-50` (a fully supervised model) *without using any of ImageNet's 1.28 million training examples*. CLIP also achieves a $95\%$ `Top-5 accuracy` on ImageNet, matching `Inception-V4`. * On `aYahoo`, CLIP achieves $98.4\%$, representing a $95\%$ reduction in errors compared to `Visual N-Grams`. * On `SUN`, CLIP more than doubles the accuracy from $23.0\%$ to $58.5\%$. These results establish CLIP as a significant advancement towards practical and flexible `zero-shot computer vision classifiers`. While noting that many factors (larger dataset, more compute, `Transformer` architecture) contribute to CLIP's superior performance over `Visual N-Grams`, a controlled ablation confirms that a `CLIP ResNet-50` trained on the `YFCC100M` dataset (the same as `Visual N-Grams`) can match their reported ImageNet performance within one GPU day, even when trained from scratch. ### 6.1.2. Zero-Shot CLIP Performance Analysis * **Competitiveness with Fully Supervised Baselines:** The following figure (Figure 5 from the original paper) shows this comparison across 27 datasets. ![Figure 5. Zero-shot CLIP is competitive with a fully supervised baseline. Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet.](/files/papers/6956a9a55411c3e2652eae93/images/4.jpg) *该图像是一个条形图,展示了零-shot CLIP 与基于 ResNet-50 特征的线性分类器之间的性能比较。图中显示,零-shot CLIP 在27个数据集的评估中,有16个数据集表现优于线性分类器,准确率提升幅度(Δ Score)从 +28.9 到 -37.1不等。* Figure 5. Zero-shot CLIP is competitive with a fully supervised baseline. Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet. Zero-shot CLIP outperforms a `fully supervised, regularized logistic regression classifier` fitted on `ResNet-50 features` on 16 out of 27 datasets, including `ImageNet`. * **Strong Performance:** On `fine-grained classification` tasks like `Stanford Cars` and `Food101`, zero-shot CLIP outperforms the baseline by over $20\%$. On `STL10`, it achieves $99.3\%$, a new state-of-the-art zero-shot result. * **Action Recognition:** For `action recognition in videos` (`Kinetics700`, `UCF101`), CLIP significantly outperforms `ResNet-50` features by $14.5\%$ and $7.7\%$ respectively. This is attributed to `natural language` providing broader supervision for `verbs` compared to `ImageNet`'s `noun-centric` supervision. * **Weak Performance:** CLIP struggles with `specialized, complex, or abstract tasks` such as `satellite image classification` (EuroSAT, RESISC45), `lymph node tumor detection` (PatchCamelyon), `object counting` (CLEVRCounts), and `self-driving related tasks` (GTSRB, KITTI Distance). This highlights areas for improvement. * **Comparison to Few-Shot Linear Probes:** The following figure (Figure 6 from the original paper) compares zero-shot CLIP to few-shot logistic regression on features of many image models. ![Figure 6. Zero-shot CLIP outperforms few-shot linear probes. Zero-shot CLIP matches the average performance of a 4-shot linear classifier trained on the same feature space and nearly matches the best results of a 16-shot linear classifier across publicly available models. For both BiT-M and SimCLRv2, the best performing model is highlighted. Light gray lines are other models in the eval suite. The 20 datasets with at least 16 examples per class were used in this analysis.](/files/papers/6956a9a55411c3e2652eae93/images/5.jpg) *该图像是图表,展示了不同模型在有标签训练样本数量与平均得分之间的关系。Zero-Shot CLIP 模型的表现与4-shot线性分类器相当,接近16-shot线性分类器的最佳结果。图中标注了 BiT-M 和 SimCLRv2 模型的表现,灰色线条代表其他评估模型。* Figure 6. Zero-shot CLIP outperforms few-shot linear probes. Zero-shot CLIP matches the average performance of a 4-shot linear classifier trained on the same feature space and nearly matches the best results of a 16-shot linear classifier across publicly available models. For both BiT-M and SimCLRv2, the best performing model is highlighted. Light gray lines are other models in the eval suite. The 20 datasets with at least 16 examples per class were used in this analysis. Surprisingly, `zero-shot CLIP matches the average performance of a 4-shot linear classifier` trained on the *same feature space*. This suggests that the ability to "communicate" visual concepts directly via `natural language` in zero-shot is highly effective, potentially overcoming the ambiguity of `context-less example-based learning` in few-shot settings. CLIP also `roughly matches the best-performing 16-shot classifier` (a `BiT-M ResNet-152x2` trained on `ImageNet-21K`) in the evaluation suite. * **Data Efficiency of Zero-Shot Transfer:** The following figure (Figure 7 from the original paper) shows the estimated number of labeled examples per class required for a linear classifier on the same CLIP feature space to match zero-shot CLIP's performance. ![Figure 7. The data efficiency of zero-shot transfer varies widely. Calculating the number of labeled examples per class a linear classifier on the same CLIP feature space requires to match the performance of the zero-shot classifier contextualizes the effectiveness of zero-shot transfer. Values are estimated based on log-linear interpolation of 1, 2, 4, 8, 16-shot and fully supervised results. Performance varies widely from still underperforming a one-shot classifier on two datasets to matching an estimated 184 labeled examples per class.](/files/papers/6956a9a55411c3e2652eae93/images/6.jpg) *该图像是一个条形图,展示了在 CLIP 特征空间中,线性分类器匹配零-shot 分类器性能所需的每个类别的标记示例数量。数据呈现不同数据集的标签样本需求,从 FER2013 的 184 个样本到 Flowers102 的 0.9 个样本,均值为 20.8, медиана 为 5.4。* Figure 7. The data efficiency of zero-shot transfer varies widely. Calculating the number of labeled examples per class a linear classifier on the same CLIP feature space requires to match the performance of the zero-shot classifier contextualizes the effectiveness of zero-shot transfer. Values are estimated based on log-linear interpolation of 1, 2, 4, 8, 16-shot and fully supervised results. Performance varies widely from still underperforming a one-shot classifier on two datasets to matching an estimated 184 labeled examples per class. The `effective data efficiency` of zero-shot transfer varies widely from less than 1 labeled example per class to 184. Half of the datasets require less than 5 examples per class, with a median of 5.4. On `ImageNet`, zero-shot CLIP matches the performance of a 16-shot linear classifier on the same feature space. * **Correlation with Linear Probe Performance:** The following figure (Figure 8 from the original paper) compares CLIP's zero-shot performance with fully supervised linear classifiers across datasets. ![Figure 8. Zero-shot performance is correlated with linear probe performance but still mostly sub-optimal. Comparing zero-shot and linear probe performance across datasets shows a strong correlation with zero-shot performance mostly shifted 10 to 25 points lower. On only 5 datasets does zero-shot performance approach linear probe performance ( ${ \\le } 3$ point difference).](/files/papers/6956a9a55411c3e2652eae93/images/7.jpg) *该图像是一个散点图,展示了零-shot CLIP性能与线性探测CLIP性能之间的关系。图中显示,零-shot性能与线性探测性能存在强关联性,相关系数 $r = 0.82$。大部分数据点接近45度线,但零-shot性能普遍低于线性探测性能,最高差值在10至25点之间。* Figure 8. Zero-shot performance is correlated with linear probe performance but still mostly sub-optimal. Comparing zero-shot and linear probe performance across datasets shows a strong correlation with zero-shot performance mostly shifted 10 to 25 points lower. On only 5 datasets does zero-shot performance approach linear probe performance ( ${ \le } 3$ point difference). There is a strong positive correlation ($r=0.82$, $p < 10^{-6}$) between zero-shot and fully supervised performance, indicating consistency in CLIP's `representation learning` and `task learning`. However, zero-shot performance generally `underperforms fully supervised classifiers by 10% to 25%`, suggesting significant room for improvement. * **Scaling Laws for Zero-Shot Performance:** The following figure (Figure 9 from the original paper) plots the average error rate of 5 ResNet CLIP models across 39 evaluations on 36 different datasets as a function of model compute. ![Figure 9. Zero-shot CLIP performance scales smoothly as a function of model compute. Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44x range of compute spanning 5 different CLIP models. Lightly shaded lines are performance on individual evals, showing that performance is much more varied despite the smooth overall trend.](/files/papers/6956a9a55411c3e2652eae93/images/8.jpg) *该图像是图表,展示了 CLIP 模型在不同计算量下的零-shot 性能。横轴为模型的 GFLOPs,纵轴为错误率(%)。数据点显示,随着计算量的增加,错误率呈现平滑的下降趋势,表明模型性能的提升。* Figure 9. Zero-shot CLIP performance scales smoothly as a function of model compute. Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44x range of compute spanning 5 different CLIP models. Lightly shaded lines are performance on individual evals, showing that performance is much more varied despite the smooth overall trend. CLIP exhibits a `log-log linear scaling trend` for `average zero-shot error` across a `44x increase in model compute`, similar to observations in `neural language models`. This indicates predictable performance gains with larger models and more computational resources. ### 6.1.3. Representation Learning The following figure (Figure 10 from the original paper) shows the relationship between linear probe average scores and forward-pass GFLOPs/image for various models. ![该图像是图表,展示了不同模型在Kornblith等的12和27个数据集上的线性探测平均得分与前向传递GFLOPs/图像之间的关系。不同的模型用不同的符号标识,结果表明CLIP模型在效率与准确性之间的平衡表现突出。](/files/papers/6956a9a55411c3e2652eae93/images/9.jpg) *该图像是图表,展示了不同模型在Kornblith等的12和27个数据集上的线性探测平均得分与前向传递GFLOPs/图像之间的关系。不同的模型用不同的符号标识,结果表明CLIP模型在效率与准确性之间的平衡表现突出。* Figure 9. The image is a chart that shows the relationship between linear probe average scores and forward-pass GFLOPs/image for various models across Kornblith et al.'s 12 and 27 datasets. Different models are represented by distinct symbols, and the results indicate that CLIP performs well in balancing efficiency and accuracy. The following are the results from Figure 10 of the original paper. The top graph shows the average score on the 12-dataset Kornblith et al. (2019) suite versus GFLOPs per image. The bottom graph shows the average score on the broader 27-dataset suite versus GFLOPs per image. ![该图像是图表,展示了通过线性探针方法在Kornblith等人的12个数据集和26个数据集上的平均转移得分与ImageNet得分的关系。图中通过散点图反映不同模型(如CLIP-ViT、EfficientNet等)在转移学习任务中的表现。纵轴表示转移得分(%),横轴表示ImageNet得分(%),两条虚线显示了理想的线性关系。不同颜色和形状的标记代表不同的模型。](/files/papers/6956a9a55411c3e2652eae93/images/11.jpg) *该图像是图表,展示了通过线性探针方法在Kornblith等人的12个数据集和26个数据集上的平均转移得分与ImageNet得分的关系。图中通过散点图反映不同模型(如CLIP-ViT、EfficientNet等)在转移学习任务中的表现。纵轴表示转移得分(%),横轴表示ImageNet得分(%),两条虚线显示了理想的线性关系。不同颜色和形状的标记代表不同的模型。* Figure 11. The image is a chart that illustrates the performance comparison of the Zero-Shot CLIP model across multiple datasets. The chart displays accuracy for each dataset along with improvement scores over other methods, highlighting the effectiveness of the model in various tasks. * **Performance on Kornblith et al. (2019) 12-Dataset Suite:** Small CLIP models (RN50, RN101) outperform other ImageNet-1K trained ResNets but underperform ImageNet-21K trained ResNets (`BiT-M`) and `EfficientNet` models with similar compute. However, `CLIP scales very well`, and the largest `ResNet-50x64` slightly outperforms the `Noisy Student EfficientNet-L2` in both overall score and compute efficiency. `CLIP Vision Transformers` are about `3x more compute efficient` than `CLIP ResNets`. The best model, `ViT-L/14@336px`, outperforms the best existing model by an average of $2.6\%$. * **Performance on Broader 27-Dataset Suite:** On the expanded suite (including `OCR`, `geo-localization`, `facial emotion recognition`, `action recognition`), CLIP's benefits become clearer. * All CLIP models, regardless of scale, outperform all evaluated systems in terms of `compute efficiency`. * The best model's average score improvement over previous systems increases from $2.6\%$ to $5\%$. * `Self-supervised systems` (`SimCLRv2`) also perform better on this broader suite, suggesting the value of increased `task diversity` in evaluation. * **Per-Dataset Differences:** The following figure (Figure 11 from the original paper) visualizes per-dataset differences in performance between the best CLIP model (`ViT-L/14@336px`) and the `Noisy Student EfficientNet-L2`. ![Figure 11. CLIP's features outperform the features of the best ImageNet model on a wide variety of datasets. Fitting a linear classifier on CLIP's features outperforms using the Noisy Student EfficientNet-L2 on 21 out of 27 datasets.](/files/papers/6956a9a55411c3e2652eae93/images/10.jpg) *该图像是图表,展示了在多种数据集上,CLIP模型与Noisy Student EfficientNet-L2的线性回归表现的差异(Δ Score %)。大多数数据集中,CLIP的表现优于EfficientNet-L2,尤其是在SST2、Country211和HatefulMemes等数据集上,表现上升显著。* Figure 11. CLIP's features outperform the features of the best ImageNet model on a wide variety of datasets. Fitting a linear classifier on CLIP's features outperforms using the Noisy Student EfficientNet-L2 on 21 out of 27 datasets. CLIP outperforms the `Noisy Student EfficientNet-L2` on 21 out of 27 datasets. * **Significant Gains:** CLIP performs best on tasks requiring `OCR` (SST2, HatefulMemes), `geo-localization` and `scene recognition` (Country211, SUN397), and `activity recognition` (Kinetics700, UCF101). It also excels in `fine-grained car` and `traffic sign recognition` (Stanford Cars, GTSRB). This suggests `ImageNet's narrow supervision` (e.g., a single label for all traffic signs) might hurt performance on fine-grained tasks. * **Underperformance:** CLIP still underperforms on ImageNet (the EfficientNet's training dataset) and low-resolution datasets (CIFAR10, CIFAR100). It also does slightly worse on `PatchCamelyon` (lymph node tumor detection) and `CLEVRCounts` (object counting), where both approaches have low overall performance. ### 6.1.4. Robustness to Natural Distribution Shift The following figure (Figure 13 from the original paper) compares the performance of zero-shot CLIP with existing ImageNet models on natural distribution shifts. ![该图像是一个图表,展示了使用 Zero-Shot CLIP 模型在多个数据集上的性能比较。图表中显示了各数据集的准确性以及与其他方法的改进得分,强调了该模型在不同任务中的有效性。](/files/papers/6956a9a55411c3e2652eae93/images/12.jpg) *该图像是一个图表,展示了使用 Zero-Shot CLIP 模型在多个数据集上的性能比较。图表中显示了各数据集的准确性以及与其他方法的改进得分,强调了该模型在不同任务中的有效性。* Figure 13. The image is a chart that illustrates the performance comparison of the Zero-Shot CLIP model across multiple datasets. The chart displays accuracy for each dataset along with improvement scores over other methods, highlighting the effectiveness of the model in various tasks. * **Zero-Shot CLIP's Enhanced Robustness:** `Zero-shot CLIP models` significantly improve `effective robustness` by reducing the gap between ImageNet accuracy and accuracy under `natural distribution shifts` by up to `75%`. This is a crucial finding, suggesting that models not trained on specific task distributions are less susceptible to `spurious correlations`. * **Supervised Adaptation Harms Robustness:** The following figure (Figure 14 from the original paper) visualizes how performance changes from the zero-shot classifier to a `supervised linear classifier` adapted to the ImageNet distribution. ![Figure 14. While supervised adaptation to ImageNet increases ImageNet accuracy by $9 . 2 \\%$ , it slightly reduces average robustness. with ImageNet categories.](/files/papers/6956a9a55411c3e2652eae93/images/13.jpg) *该图像是一个图表,展示了在多个自然分布偏移数据集上,采用不同方法适应 ImageNet 分类器准确度的变化。图中显示,当针对 ImageNet 进行监督适应时,准确度提高了 $9.2\%$,但在平均鲁棒性方面略有下降。* Figure 14. While supervised adaptation to ImageNet increases ImageNet accuracy by $9 . 2 \\%$ , it slightly reduces average robustness. with ImageNet categories. While adapting CLIP to the `ImageNet distribution` (via `L2 regularized logistic regression`) increases its ImageNet accuracy by $9.2\%$ (to $85.4\%$), `average accuracy under distribution shift slightly decreases`. This is a surprising result: a large gain in in-distribution accuracy does not translate to improved out-of-distribution robustness, implying that these gains are largely from exploiting `distribution-specific patterns`. * **Role of Class Naming:** Using `custom zero-shot classifiers` for each dataset based on its `specific class names` (instead of pooling `ImageNet superclasses`) improves `average effective robustness` by $5\%$. * **Few-Shot Robustness:** The following figure (Figure 15 from the original paper) visualizes the performance of 0-shot, 1-shot, ..., 128-shot, and fully supervised logistic regression classifiers on the best CLIP model's features. ![Figure 15. Few-shot CLIP also increases effective robustness compared to existing ImageNet models but is less robust than zero-shot CLIP. Minimizing the amount of ImageNet training data used for adaption increases effective robustness at the cost of decreasing relative robustness. 16-shot logistic regression CLIP matches zero-shot CLIP on ImageNet, as previously reported in Figure 7, but is less robust.](/files/papers/6956a9a55411c3e2652eae93/images/14.jpg) *该图像是一个图表,展示了不同训练方式下CLIP模型在7个自然分布迁移数据集上的表现。横轴为在ImageNet子抽样类上的平均准确率,纵轴为在自然分布迁移数据集上的平均准确率。图中包含不同训练样本数量的标记,分别为1-shot至128-shot,并且显示出零-shot和few-shot CLIP相较于传统模型的效果差异。* Figure 15. Few-shot CLIP also increases effective robustness compared to existing ImageNet models but is less robust than zero-shot CLIP. Minimizing the amount of ImageNet training data used for adaption increases effective robustness at the cost of decreasing relative robustness. 16-shot logistic regression CLIP matches zero-shot CLIP on ImageNet, as previously reported in Figure 7, but is less robust. `Few-shot models` also show higher `effective robustness` than existing models, but this benefit `fades` as `in-distribution performance` increases with more training data. `Zero-shot CLIP` is `notably more robust` than a few-shot model with equivalent ImageNet performance. The conclusion is that `high effective robustness` results from `minimizing distribution-specific training data`. ### 6.1.5. Comparison to Human Performance The following are the results from Table 2 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th></th> <th>Accuracy</th> <th>Majority Vote on Full Dataset</th> <th>Accuracy on Guesses</th> <th>Majority Vote Accuracy on Guesses</th> </tr> </thead> <tbody> <tr> <td>Zero-shot human</td> <td>53.7</td> <td>57.0</td> <td>69.7</td> <td>63.9</td> </tr> <tr> <td>Zero-shot CLIP</td> <td>93.5</td> <td>93.5</td> <td>93.5</td> <td>93.5</td> </tr> <tr> <td>One-shot human</td> <td>75.7</td> <td>80.3</td> <td>78.5</td> <td>81.2</td> </tr> <tr> <td>Two-shot human</td> <td>75.7</td> <td>85.0</td> <td>79.2</td> <td>86.1</td> </tr> </tbody> </table></div> On the `Oxford IIT Pets` dataset, `zero-shot CLIP` (93.5% accuracy) vastly outperforms `zero-shot humans` (53.7%). Humans, however, show a `large leap in performance` from zero-shot to one-shot (53.7% to 75.7%), with minimal additional gain from two-shot. This suggests humans "know what they don't know" and quickly update priors from a single example. This highlights a `significant gap between machine and human sample efficiency` in few-shot learning, where humans effectively integrate prior knowledge. The following figure (Figure 16 from the original paper) plots human accuracy vs CLIP's zero-shot accuracy. ![Figure 16. The hardest problems for CLIP also tend to be the hardest problems for humans. Here we rank image categories by difficulty for CLIP as measured as probability of the correct label.](/files/papers/6956a9a55411c3e2652eae93/images/15.jpg) *该图像是图表,展示了不同犬种在 CLIP 模型和人类零-shot 及 one-shot 识别中的准确率。横轴为犬种,纵轴为准确率百分比,三条线分别代表 Zero-Shot CLIP、One-Shot Human 和 Zero-Shot Human 的表现,反映出 CLIP 在难度排序上的一致性。* Figure 16. The hardest problems for CLIP also tend to be the hardest problems for humans. Here we rank image categories by difficulty for CLIP as measured as probability of the correct label. The hardest problems for CLIP (e.g., specific dog breeds that are very similar) tend to also be hard for humans, suggesting consistency in difficulty stemming from dataset noise or genuinely challenging visual distinctions. ### 6.1.6. Data Overlap Analysis The following figure (Figure 17 from the original paper) summarizes the data overlap analysis. ![该图像是一个示意图,展示了在不同数据重叠百分比下,清洁数据与重叠数据的准确性差异。左侧图表显示了包括CIFAR-100和SUN397在内的多个数据集的准确性变化,右侧图表则呈现了不同数据集总体准确性变化的情况。](/files/papers/6956a9a55411c3e2652eae93/images/16.jpg) *该图像是一个示意图,展示了在不同数据重叠百分比下,清洁数据与重叠数据的准确性差异。左侧图表显示了包括CIFAR-100和SUN397在内的多个数据集的准确性变化,右侧图表则呈现了不同数据集总体准确性变化的情况。* Figure 16. The image is a diagram showing the difference in accuracy between clean data and overlapping data at various percentages of data overlap. The left chart displays accuracy changes for several datasets, including CIFAR-100 and SUN397, while the right chart presents the overall accuracy change for different datasets. * **Procedure:** A custom `near-duplicate detector` was used (trained with a `contrastive loss` on heavily augmented images) to identify overlapping examples between the `WIT pre-training dataset` and `downstream evaluation datasets`. * **Overlap Rate:** Out of 35 datasets, 9 had `no detected overlap`. The `median overlap` was `2.2%`, and the `average overlap` was `3.2%`. * **Impact on Performance:** Due to the small overlap, `overall accuracy` was rarely shifted by more than $0.1\%$, with only 7 datasets above this threshold. Only 2 were `statistically significant` after `Bonferroni correction`. The `max detected improvement` was $0.6\%$ on `Birdsnap` (with $12.1\%$ overlap). `Country211` had the largest overlap ($21.5\%$) but only a $0.2\%$ accuracy increase, potentially because the training text wasn't directly related to the geo-localization task. * **Conclusion:** The minimal impact of overlap aligns with previous `large-scale pre-training` studies (Mahajan et al., 2018; Kolesnikov et al., 2019), suggesting that `data contamination` is not a major factor inflating CLIP's reported performance. ### 6.1.7. Dataset Ablation on YFCC100M The following are the results from Table 12 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Dataset</th> <th colspan="3">Linear Classifier</th> <th colspan="3">Zero Shot</th> </tr> <tr> <th>YFCC</th> <th>WIT</th> <th>∆</th> <th>YFCC</th> <th>WIT</th> <th>∆</th> </tr> </thead> <tbody> <tr> <td>Birdsnap</td> <td>47.4</td> <td>35.3</td> <td>+12.1</td> <td>19.9</td> <td>4.5</td> <td>+15.4</td> </tr> <tr> <td>Country211</td> <td>23.1</td> <td>17.3</td> <td>+5.8</td> <td>5.2</td> <td>5.3</td> <td>+0.1</td> </tr> <tr> <td>Flowers102</td> <td>94.4</td> <td>89.8</td> <td>+4.6</td> <td>48.6</td> <td>21.7</td> <td>+26.9</td> </tr> <tr> <td>GTSRB</td> <td>66.8</td> <td>72.5</td> <td>-5.7</td> <td>6.9</td> <td>7.0</td> <td>−0.1</td> </tr> <tr> <td>UCF101</td> <td>69.2</td> <td>74.9</td> <td>-5.7</td> <td>22.9</td> <td>32.0</td> <td>-9.1</td> </tr> <tr> <td>Stanford Cars</td> <td>31.4</td> <td>50.3</td> <td>−18.9</td> <td>3.8</td> <td>10.9</td> <td>-7.1</td> </tr> <tr> <td>ImageNet</td> <td>62.0</td> <td>60.8</td> <td>+1.2</td> <td>31.3</td> <td>27.6</td> <td>+3.7</td> </tr> <tr> <td>Dataset Average</td> <td>65.5</td> <td>66.6</td> <td>−1.1</td> <td>29.6</td> <td>30.0</td> <td>−0.4</td> </tr> <tr> <td>Dataset "Wins"</td> <td>10</td> <td>15</td> <td>-5</td> <td>19</td> <td>18</td> <td>+1</td> </tr> </tbody> </table></div> An ablation study compared a `ResNet-50` model trained on a `filtered subset of YFCC100M` with the same model trained on an `equally sized subset of WIT`. * **Overall:** `YFCC` and `WIT` show `similar average performance` for both `zero-shot` and `linear probe` settings. * **Specific Datasets:** Performance on `fine-grained classification datasets` can vary widely (e.g., `Birdsnap`, `Flowers102` better with YFCC; `Stanford Cars`, `UCF101` better with WIT). This likely reflects the `relative density of relevant data` for specific concepts within each pre-training dataset. * **Main Advantage of WIT:** The primary advantage of `WIT` over `YFCC100M` is its `much larger total size`, which enables `better generalization`. ### 6.1.8. Selected Task and Dataset Results (Appendix E) * **Image and Text Retrieval:** The following are the results from Table 13 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="3" colspan="2"></td> <th colspan="6">Text Retrieval</th> <th colspan="6">Image Retrieval</th> </tr> <tr> <th colspan="3">Flickr30k</th> <th colspan="3">MSCOCO</th> <th colspan="3">Flickr30k</th> <th colspan="3">MSCOCO</th> </tr> <tr> <th>R@1</th> <th>R@5</th> <th>R@10</th> <th>R@1</th> <th>R@5</th> <th>R@10</th> <th>R@1</th> <th>R@5</th> <th>R@10</th> <th>R@1</th> <th>R@5</th> <th>R@10</th> </tr> </thead> <tbody> <tr> <td rowspan="4" colspan="2">SOTA Fined-tuned</td> <td>Unicoder-VLa</td> <td>86.2</td> <td>96.3</td> <td>99.0</td> <td>62.3</td> <td>87.1</td> <td>92.8</td> <td>71.5</td> <td>90.9</td> <td>94.9</td> <td>46.7</td> <td>76.0</td> <td>85.3</td> </tr> <tr> <td>Uniterb</td> <td>87.3</td> <td>98.0</td> <td>99.2</td> <td>65.7</td> <td>88.6</td> <td>93.8</td> <td>75.6</td> <td>94.1</td> <td>96.8</td> <td>52.9</td> <td>79.9</td> <td>88.0</td> </tr> <tr> <td>VILLAc</td> <td>87.9</td> <td>97.5</td> <td>98.8</td> <td>-</td> <td>-</td> <td>-</td> <td>76.3</td> <td>94.2</td> <td>96.8</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Oscard</td> <td>-</td> <td>-</td> <td>-</td> <td>73.5</td> <td>92.2</td> <td>96.0</td> <td>-</td> <td>-</td> <td>-</td> <td>57.5</td> <td>82.8</td> <td>89.8</td> </tr> <tr> <td rowspan="5" colspan="2">SOTA Zero-Shot</td> <td>ERNIE-ViLe</td> <td>88.7</td> <td>98.0</td> <td>99.2</td> <td>-</td> <td>-</td> <td>-</td> <td>76.7</td> <td>93.6</td> <td>96.4</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Visual N-Gramsf</td> <td>15.4</td> <td>35.7</td> <td>45.1</td> <td>8.7</td> <td>23.1</td> <td>33.3</td> <td>8.8</td> <td>21.2</td> <td>29.9</td> <td>5.0</td> <td>14.5</td> <td>21.9</td> </tr> <tr> <td>ImageBERTg</td> <td>-</td> <td>-</td> <td>-</td> <td>44.0</td> <td>71.2</td> <td>80.4</td> <td>-</td> <td>-</td> <td>-</td> <td>32.3</td> <td>59.0</td> <td>70.2</td> </tr> <tr> <td>Unicoder-VLa</td> <td>64.3</td> <td>86.8</td> <td>92.3</td> <td>-</td> <td>-</td> <td>-</td> <td>48.4</td> <td>76.0</td> <td>85.2</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Uniterb</td> <td>83.6</td> <td>95.7</td> <td>97.7</td> <td>-</td> <td>-</td> <td>-</td> <td>68.7</td> <td>89.2</td> <td>93.9</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td colspan="2">CLIP</td> <td>88.0</td> <td>98.7</td> <td>99.4</td> <td>58.4</td> <td>81.5</td> <td>88.1</td> <td>68.7</td> <td>90.6</td> <td>95.2</td> <td>37.8</td> <td>62.4</td> <td>72.2</td> </tr> </tbody> </table></div> The following are the results from Table 13 of the original paper: Text Retrieval (Flickr30k, MSCOCO) and Image Retrieval (Flickr30k, MSCOCO). CLIP, pre-trained for `image-text retrieval`, performs well on this `sanity check`. `Zero-shot CLIP` matches or outperforms all prior `zero-shot results` on `Flickr30k` and `MSCOCO`. On `Flickr30k text retrieval`, it's competitive with the `overall SOTA`. On `image retrieval`, it's competitive with a `fine-tuned Unicoder-VL` but not the `overall SOTA`. `Prompting` with "a photo of" boosts performance. * **Optical Character Recognition (OCR):** The following are the results from Table 14 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td></td> <td></td> <th>MNIST</th> <th>SVHN</th> <th>IIIT5K 1k</th> <th>Hateful Memes</th> <th>SST-2</th> </tr> </thead> <tbody> <tr> <td rowspan="3" colspan="2">SOTA JOINT</td> <td>a</td> <td>99.8</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>b</td> <td>-</td> <td>96.4</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>c</td> <td>-</td> <td>-</td> <td>98.9</td> <td>78.0</td> <td>97.5</td> </tr> <tr> <td rowspan="2" colspan="2">SOTA SUPERVISED</td> <td>Raw Pixels</td> <td>92.5</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>ES Best</td> <td>-</td> <td>-</td> <td>89.6</td> <td>58.6</td> <td>59.0</td> </tr> <tr> <td colspan="2">Linear CLIP</td> <td>99.2</td> <td>-</td> <td>-</td> <td>77.3</td> <td>80.5</td> </tr> <tr> <td colspan="2">Zero-Shot CLIP</td> <td>88.4</td> <td>51.0</td> <td>90.0</td> <td>63.3</td> <td>67.9</td> </tr> </tbody> </table></div> The following are the results from Table 14 of the original paper: OCR performance on 5 datasets. CLIP learns `primitive OCR capabilities`. * **Strong on Digital Text:** Strongest on `Hateful Memes` and `SST-2` (digitally rendered words), where linear CLIP reaches $80.5\%$ on `SST-2` (on par with a `GloVe CBOW baseline`) and $77.3\%$ on `Hateful Memes` (0.7 points behind SOTA). * **Weak on Natural/Handwritten:** Weaker on `IIIT5K` (natural images of words) and particularly poor on `SVHN` (street view numbers, $51\%$ accuracy) and `MNIST` (handwritten digits, $88\%$ accuracy), where even simple `logistic regression on raw pixels` outperforms it. This suggests issues with `repeated characters`, `low resolution`, `blurry images`, and truly `out-of-distribution` data. * **Action Recognition in Videos:** The following are the results from Table 15 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2"></td> <td rowspan="2"></td> <th>UCF101</th> <th>K700</th> <th colspan="2">RareAct</th> </tr> <tr> <th>Top-1</th> <th>AVG</th> <th>mWAP</th> <th>mWSAP</th> </tr> </thead> <tbody> <tr> <td rowspan="4">SOTA FINE-TUNED</td> <td>R(2+1)D-BERTa</td> <td>98.7</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>NS ENet-L2b</td> <td>-</td> <td>84.8</td> <td>-</td> <td>-</td> </tr> <tr> <td>HT100M S3Dd</td> <td>91.3</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Baseline I3De</td> <td>-</td> <td>70.2</td> <td>-</td> <td>-</td> </tr> <tr> <td rowspan="4">SOTA LINEAR</td> <td>MMV FACf</td> <td>91.8</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>NS ENet-L2c</td> <td>89.4</td> <td>68.2</td> <td>-</td> <td>-</td> </tr> <tr> <td>CLIP</td> <td>92.0</td> <td>73.0</td> <td>-</td> <td>-</td> </tr> <tr> <td>Zero-Shot CLIP</td> <td>80.3</td> <td>69.6</td> <td>40.7</td> <td>44.8</td> </tr> <tr> <td rowspan="2">SOTA ZERO-SHOT</td> <td>HT100M S3Dd</td> <td>-</td> <td>-</td> <td>30.5</td> <td>34.8</td> </tr> <tr> <td>CLIP</td> <td>80.3</td> <td>69.6</td> <td>40.7</td> <td>44.8</td> </tr> </tbody> </table></div> The following are the results from Table 15 of the original paper: Action recognition performance on 3 video datasets. CLIP performs strongly on `action recognition`, a task involving `verbs`, suggesting benefits from `natural language's broader supervision`. * **Linear Evaluation:** CLIP matches the best prior result on `UCF101` and outperforms all other models in the evaluation suite. On `Kinetics-700`, CLIP outperforms the fine-tuned `I3D baseline`. (Note: Linear evaluations use single central frames, underestimating full video performance.) * **Zero-Shot Evaluation:** On `Kinetics-700`, `zero-shot CLIP` (averaging predictions across all frames) is within $1\%$ of the `fully supervised I3D baseline`. On `RareAct` (unusual actions), CLIP improves over the prior SOTA by 10 points. * **Geolocalization:** The following are the results from Table 17 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td></td> <th>1km</th> <th>25km</th> <th>200km</th> <th>750km</th> <th>2500km</th> </tr> </thead> <tbody> <tr> <td>ISNsa</td> <td>16.9</td> <td>43.0</td> <td>51.9</td> <td>66.7</td> <td>80.2</td> </tr> <tr> <td>CPlaNetb</td> <td>16.5</td> <td>37.1</td> <td>46.4</td> <td>62.0</td> <td>78.5</td> </tr> <tr> <td>CLIP</td> <td>13.9</td> <td>32.9</td> <td>43.0</td> <td>62.0</td> <td>79.3</td> </tr> <tr> <td>Deep-Ret+c</td> <td>14.4</td> <td>33.3</td> <td>47.7</td> <td>61.6</td> <td>73.4</td> </tr> <tr> <td>PlaNetd</td> <td>8.4</td> <td>24.5</td> <td>37.6</td> <td>53.6</td> <td>71.3</td> </tr> </tbody> </table></div> The following are the results from Table 17 of the original paper: Geolocalization performance on the IM2GPS test set. CLIP shows an ability to recognize places. Using `nearest-neighbor regression` in CLIP's embedding space on the `IM2GPS` test set, CLIP performs similarly to several task-specific models despite querying only 1 million images (less than prior work). It is `not competitive with the current state-of-the-art` in this regression-based setting. ### 6.2. Data Presentation (Tables) The following are the results from Table 10 of the original paper: Linear probe performance of various pre-trained models over 27 datasets. <div class="table-wrapper"><table> <thead> <tr> <td rowspan="1" colspan="2"></td> <td rowspan="1" colspan="25"></td> </tr> <tr> <td rowspan="1" colspan="2"></td> <th>Food-101</th> <th>CIFAR-10</th> <th>CIFAR-100</th> <th>Birdsnap</th> <th>SUN397</th> <th>Stanford Cars</th> <th>FGVC Aircraft</th> <th>Describable Textures</th> <th>Oxford-IIIT Pets</th> <th>Caltech-101</th> <th>Oxford Flowers 102</th> <th>MNIST</th> <th>FER2013</th> <th>STL-10</th> <th>EuroSAT</th> <th>RESISC45</th> <th>GTSRB</th> <th>KITTI</th> <th>Country211</th> <th>PatchCamelyon</th> <th>UCF101</th> <th>Kinetics700</th> <th>CLEVR Counts</th> <th>Hateful Memes</th> <th>Rendered SST2</th> <th>ImageNet</th> </tr> </thead> <tbody> <tr> <td colspan="2">LM RN50</td> <td>65.2</td> <td>90.0</td> <td>64.9</td> <td>19.6</td> <td>64.0</td> <td>37.0</td> <td>0.1</td> <td>56.8</td> <td>65.2</td> <td>76.8</td> <td>82.1</td> <td>93.4</td> <td>70.0</td> <td>97.7</td> <td>78.0</td> <td>65.7</td> <td>76.6</td> <td>43.7</td> <td>25.3</td> <td>52.5</td> <td>62.6</td> <td>40.7</td> <td>78.2</td> <td>53.8</td> <td>59.6</td> <td>56.9</td> </tr> <tr> <td rowspan="5" colspan="2">CLIP</td> <td>RN50</td> <td>88.9</td> <td>91.1</td> <td>73.5</td> <td>58.6</td> <td>75.1</td> <td>91.3</td> <td>90.5</td> <td>73.0</td> <td>65.7</td> <td>77.0</td> <td>85.9</td> <td>97.8</td> <td>64.2</td> <td>98.3</td> <td>82.4</td> <td>70.2</td> <td>25.3</td> <td>82.4</td> <td>57.3</td> <td>68.0</td> <td>76.6</td> <td>53.8</td> <td>71.1</td> <td>80.0</td> <td>81.5</td> </tr> <tr> <td>RN101</td> <td>93.3</td> <td>92.2</td> <td>74.9</td> <td>72.8</td> <td>79.2</td> <td>88.7</td> <td>62.7</td> <td>89.0</td> <td>79.1</td> <td>94.8</td> <td>94.1</td> <td>98.3</td> <td>68.7</td> <td>98.6</td> <td>89.7</td> <td>85.5</td> <td>30.3</td> <td>83.0</td> <td>78.6</td> <td>79.1</td> <td>91.4</td> <td>69.2</td> <td>40.7</td> <td>83.7</td> <td>89.5</td> </tr> <tr> <td>RN50x4</td> <td>94.9</td> <td>94.1</td> <td>78.6</td> <td>77.2</td> <td>81.1</td> <td>90.5</td> <td>69.4</td> <td>89.6</td> <td>82.1</td> <td>95.1</td> <td>96.5</td> <td>98.9</td> <td>71.3</td> <td>99.1</td> <td>91.4</td> <td>89.0</td> <td>34.8</td> <td>83.5</td> <td>82.0</td> <td>92.7</td> <td>95.1</td> <td>60.3</td> <td>46.4</td> <td>85.6</td> <td>92.0</td> </tr> <tr> <td>RN50x16</td> <td>95.9</td> <td>95.0</td> <td>80.7</td> <td>78.2</td> <td>82.2</td> <td>91.5</td> <td>71.6</td> <td>89.9</td> <td>83.0</td> <td>95.1</td> <td>96.0</td> <td>99.2</td> <td>72.9</td> <td>99.7</td> <td>92.5</td> <td>90.2</td> <td>42.9</td> <td>85.8</td> <td>83.9</td> <td>94.9</td> <td>98.1</td> <td>69.2</td> <td>46.4</td> <td>85.6</td> <td>92.4</td> </tr> <tr> <td>RN50x64</td> <td>96.2</td> <td>95.9</td> <td>81.6</td> <td>79.9</td> <td>82.2</td> <td>91.5</td> <td>71.6</td> <td>89.9</td> <td>83.0</td> <td>95.1</td> <td>96.0</td> <td>99.2</td> <td>72.9</td> <td>99.7</td> <td>92.5</td> <td>90.2</td> <td>42.9</td> <td>85.8</td> <td>83.9</td> <td>94.9</td> <td>98.1</td> <td>69.2</td> <td>46.4</td> <td>85.6</td> <td>92.4</td> </tr> <tr> <td rowspan="4" colspan="2">CLIP-ViT</td> <td>ViT-B/32</td> <td>92.8</td> <td>96.2</td> <td>83.1</td> <td>67.8</td> <td>78.4</td> <td>86.7</td> <td>69.4</td> <td>89.6</td> <td>82.1</td> <td>95.1</td> <td>96.5</td> <td>96.9</td> <td>69.2</td> <td>98.3</td> <td>85.3</td> <td>66.2</td> <td>27.8</td> <td>83.9</td> <td>76.5</td> <td>90.0</td> <td>93.0</td> <td>61.7</td> <td>52.1</td> <td>66.7</td> <td>70.8</td> </tr> <tr> <td>ViT-B/16</td> <td>94.7</td> <td>97.1</td> <td>86.6</td> <td>67.8</td> <td>80.2</td> <td>89.6</td> <td>70.3</td> <td>89.6</td> <td>82.1</td> <td>95.1</td> <td>96.0</td> <td>97.1</td> <td>72.2</td> <td>99.2</td> <td>86.6</td> <td>67.8</td> <td>33.3</td> <td>83.5</td> <td>79.7</td> <td>93.5</td> <td>97.1</td> <td>70.3</td> <td>57.1</td> <td>75.5</td> <td>80.2</td> </tr> <tr> <td>ViT-L/14</td> <td>95.9</td> <td>97.9</td> <td>87.4</td> <td>79.9</td> <td>82.2</td> <td>91.5</td> <td>71.6</td> <td>89.9</td> <td>83.0</td> <td>95.1</td> <td>96.0</td> <td>98.9</td> <td>72.9</td> <td>99.2</td> <td>92.5</td> <td>90.2</td> <td>42.9</td> <td>85.8</td> <td>83.9</td> <td>95.4</td> <td>98.9</td> <td>69.2</td> <td>46.4</td> <td>85.6</td> <td>92.4</td> </tr> <tr> <td>ViT-L/14-336px</td> <td>96.5</td> <td>98.1</td> <td>89.0</td> <td>78.5</td> <td>82.5</td> <td>91.5</td> <td>72.2</td> <td>90.0</td> <td>83.0</td> <td>95.2</td> <td>96.0</td> <td>99.2</td> <td>73.0</td> <td>99.7</td> <td>94.1</td> <td>92.8</td> <td>46.4</td> <td>85.6</td> <td>85.4</td> <td>94.9</td> <td>98.1</td> <td>69.2</td> <td>46.4</td> <td>85.6</td> <td>92.4</td> </tr> <tr> <td rowspan="9" colspan="2">EfficientNet</td> <td>B0</td> <td>71.2</td> <td>93.0</td> <td>73.3</td> <td>60.6</td> <td>64.0</td> <td>57.0</td> <td>53.7</td> <td>85.6</td> <td>75.6</td> <td>93.8</td> <td>93.1</td> <td>94.5</td> <td>98.1</td> <td>55.2</td> <td>98.2</td> <td>97.0</td> <td>84.3</td> <td>74.0</td> <td>71.6</td> <td>14.0</td> <td>83.1</td> <td>76.2</td> <td>51.7</td> <td>47.3</td> <td>55.7</td> <td>55.0</td> </tr> <tr> <td>B1</td> <td>74.2</td> <td>93.4</td> <td>73.6</td> <td>63.2</td> <td>64.0</td> <td>57.0</td> <td>53.7</td> <td>86.2</td> <td>77.0</td> <td>94.6</td> <td>94.4</td> <td>95.1</td> <td>98.0</td> <td>56.1</td> <td>98.2</td> <td>96.9</td> <td>84.3</td> <td>73.1</td> <td>67.1</td> <td>14.5</td> <td>83.9</td> <td>79.9</td> <td>54.3</td> <td>54.9</td> <td>81.1</td> <td>77.4</td> </tr> <tr> <td>B2</td> <td>77.4</td> <td>94.0</td> <td>78.0</td> <td>66.5</td> <td>64.4</td> <td>66.0</td> <td>59.3</td> <td>85.8</td> <td>73.1</td> <td>94.1</td> <td>93.7</td> <td>93.3</td> <td>98.5</td> <td>57.1</td> <td>98.2</td> <td>97.3</td> <td>85.0</td> <td>75.8</td> <td>76.1</td> <td>13.4</td> <td>83.3</td> <td>78.1</td> <td>50.9</td> <td>45.1</td> <td>53.8</td> <td>54.8</td> </tr> <tr> <td>B3</td> <td>79.7</td> <td>94.1</td> <td>78.7</td> <td>70.1</td> <td>65.4</td> <td>66.4</td> <td>60.4</td> <td>86.5</td> <td>73.4</td> <td>94.7</td> <td>93.5</td> <td>93.2</td> <td>98.8</td> <td>57.9</td> <td>98.2</td> <td>96.8</td> <td>85.0</td> <td>78.3</td> <td>72.3</td> <td>13.9</td> <td>83.1</td> <td>79.1</td> <td>52.5</td> <td>46.5</td> <td>54.4</td> <td>55.4</td> </tr> <tr> <td>B4</td> <td>81.5</td> <td>93.6</td> <td>77.9</td> <td>72.4</td> <td>67.1</td> <td>72.7</td> <td>68.9</td> <td>86.7</td> <td>73.9</td> <td>95.0</td> <td>94.7</td> <td>94.5</td> <td>98.4</td> <td>58.5</td> <td>98.2</td> <td>96.8</td> <td>86.0</td> <td>78.5</td> <td>69.6</td> <td>14.9</td> <td>84.7</td> <td>80.9</td> <td>54.5</td> <td>46.6</td> <td>53.3</td> <td>56.3</td> </tr> <tr> <td>B5</td> <td>84.5</td> <td>94.8</td> <td>80.0</td> <td>73.5</td> <td>65.8</td> <td>71.1</td> <td>68.2</td> <td>87.6</td> <td>73.9</td> <td>95.0</td> <td>94.1</td> <td>93.7</td> <td>98.4</td> <td>60.2</td> <td>98.2</td> <td>96.8</td> <td>85.4</td> <td>78.1</td> <td>72.7</td> <td>15.3</td> <td>84.2</td> <td>80.0</td> <td>54.1</td> <td>51.1</td> <td>53.3</td> <td>57.0</td> </tr> <tr> <td>B6</td> <td>86.9</td> <td>96.0</td> <td>82.0</td> <td>74.7</td> <td>69.0</td> <td>77.1</td> <td>72.3</td> <td>87.2</td> <td>76.8</td> <td>95.2</td> <td>94.7</td> <td>96.9</td> <td>98.6</td> <td>61.4</td> <td>99.1</td> <td>96.3</td> <td>86.8</td> <td>80.8</td> <td>75.8</td> <td>16.4</td> <td>85.2</td> <td>81.9</td> <td>57.7</td> <td>51.9</td> <td>54.8</td> <td>58.8</td> </tr> <tr> <td>B7</td> <td>87.3</td> <td>97.0</td> <td>83.9</td> <td>75.8</td> <td>71.4</td> <td>67.6</td> <td>65.6</td> <td>87.3</td> <td>78.5</td> <td>95.2</td> <td>96.4</td> <td>97.2</td> <td>98.6</td> <td>61.9</td> <td>99.5</td> <td>96.6</td> <td>86.1</td> <td>70.7</td> <td>72.4</td> <td>17.6</td> <td>84.2</td> <td>85.5</td> <td>61.0</td> <td>49.6</td> <td>54.6</td> <td>55.7</td> </tr> <tr> <td>B8</td> <td>88.4</td> <td>96.0</td> <td>82.0</td> <td>76.9</td> <td>72.6</td> <td>72.2</td> <td>71.2</td> <td>88.1</td> <td>80.5</td> <td>95.5</td> <td>95.5</td> <td>96.6</td> <td>98.5</td> <td>62.7</td> <td>99.4</td> <td>96.2</td> <td>88.5</td> <td>73.4</td> <td>73.0</td> <td>18.5</td> <td>83.8</td> <td>86.6</td> <td>63.2</td> <td>50.5</td> <td>57.2</td> <td>56.7</td> </tr> <tr> <td rowspan="2" colspan="2">NS EfficientNet</td> <td>L2-475</td> <td>91.6</td> <td>99.0</td> <td>91.0</td> <td>74.8</td> <td>76.4</td> <td>75.1</td> <td>66.8</td> <td>89.5</td> <td>81.9</td> <td>95.6</td> <td>96.5</td> <td>97.7</td> <td>98.9</td> <td>67.5</td> <td>99.3</td> <td>97.0</td> <td>89.5</td> <td>73.4</td> <td>68.9</td> <td>22.2</td> <td>86.3</td> <td>89.4</td> <td>68.2</td> <td>58.3</td> <td>58.6</td> <td>55.2</td> </tr> <tr> <td>L2-800</td> <td>92.0</td> <td>98.7</td> <td>89.0</td> <td>78.5</td> <td>75.7</td> <td>68.4</td> <td>68.4</td> <td>89.4</td> <td>82.5</td> <td>95.6</td> <td>94.7</td> <td>97.9</td> <td>98.5</td> <td>68.4</td> <td>99.3</td> <td>97.2</td> <td>89.9</td> <td>77.7</td> <td>66.9</td> <td>23.7</td> <td>86.8</td> <td>88.9</td> <td>58.4</td> <td>56.9</td> <td>88.4</td> <td>78.5</td> </tr> <tr> <td rowspan="3" colspan="2">Instagram-pretrained ResNeXt</td> <td>32x8d</td> <td>84.8</td> <td>95.9</td> <td>80.9</td> <td>63.8</td> <td>69.0</td> <td>74.2</td> <td>56.0</td> <td>88.0</td> <td>75.4</td> <td>95.4</td> <td>93.9</td> <td>91.7</td> <td>97.4</td> <td>60.7</td> <td>99.1</td> <td>95.7</td> <td>82.1</td> <td>72.3</td> <td>69.2</td> <td>16.7</td> <td>82.3</td> <td>80.1</td> <td>56.8</td> <td>42.2</td> <td>53.3</td> <td>55.2</td> </tr> <tr> <td>32x16d</td> <td>85.7</td> <td>96.5</td> <td>80.9</td> <td>64.8</td> <td>70.5</td> <td>77.5</td> <td>56.7</td> <td>87.9</td> <td>76.2</td> <td>95.6</td> <td>94.9</td> <td>92.5</td> <td>97.4</td> <td>61.3</td> <td>99.3</td> <td>95.5</td> <td>82.8</td> <td>73.8</td> <td>66.1</td> <td>17.5</td> <td>83.4</td> <td>81.1</td> <td>58.2</td> <td>41.3</td> <td>54.2</td> <td>56.1</td> </tr> <tr> <td>32x32d</td> <td>86.7</td> <td>96.8</td> <td>82.7</td> <td>67.1</td> <td>71.5</td> <td>77.5</td> <td>55.4</td> <td>88.3</td> <td>78.5</td> <td>95.8</td> <td>95.3</td> <td>94.4</td> <td>97.9</td> <td>62.4</td> <td>99.3</td> <td>95.7</td> <td>85.4</td> <td>71.2</td> <td>66.8</td> <td>18.0</td> <td>83.7</td> <td>82.1</td> <td>58.8</td> <td>39.7</td> <td>55.3</td> <td>56.7</td> </tr> <tr> <td rowspan="3" colspan="2">Instagram-pretrained ResNeXt (FixRes)</td> <td>32x48d</td> <td>86.9</td> <td>96.8</td> <td>83.4</td> <td>65.9</td> <td>72.2</td> <td>76.6</td> <td>53.2</td> <td>88.0</td> <td>77.2</td> <td>95.5</td> <td>95.8</td> <td>93.6</td> <td>98.1</td> <td>63.7</td> <td>99.4</td> <td>95.3</td> <td>85.4</td> <td>73.0</td> <td>67.2</td> <td>18.5</td> <td>82.7</td> <td>82.8</td> <td>59.2</td> <td>41.3</td> <td>55.5</td> <td>56.7</td> </tr> <tr> <td>FixRes-v1</td> <td>88.5</td> <td>95.7</td> <td>81.1</td> <td>67.4</td> <td>72.9</td> <td>80.5</td> <td>57.6</td> <td>88.0</td> <td>77.9</td> <td>95.8</td> <td>96.1</td> <td>94.5</td> <td>97.9</td> <td>62.2</td> <td>99.4</td> <td>96.2</td> <td>86.6</td> <td>76.1</td> <td>64.8</td> <td>19.3</td> <td>82.5</td> <td>83.4</td> <td>59.8</td> <td>43.5</td> <td>56.6</td> <td>59.0</td> </tr> <tr> <td>FixRes-v2</td> <td>88.5</td> <td>95.7</td> <td>81.1</td> <td>67.3</td> <td>72.9</td> <td>80.7</td> <td>57.5</td> <td>88.0</td> <td>77.9</td> <td>95.0</td> <td>96.0</td> <td>94.5</td> <td>98.0</td> <td>62.1</td> <td>99.4</td> <td>96.5</td> <td>86.1</td> <td>76.3</td> <td>64.8</td> <td>19.5</td> <td>82.3</td> <td>83.2</td> <td>59.8</td> <td>43.5</td> <td>56.6</td> <td>59.0</td> </tr> <tr> <td rowspan="2" colspan="2">BiT-S</td> <td>R50x1</td> <td>72.5</td> <td>91.7</td> <td>74.8</td> <td>57.7</td> <td>61.1</td> <td>53.5</td> <td>83.7</td> <td>72.4</td> <td>92.3</td> <td>91.2</td> <td>98.4</td> <td>56.5</td> <td>96.4</td> <td>97.4</td> <td>85.0</td> <td>70.0</td> <td>66.0</td> <td>12.5</td> <td>83.0</td> <td>72.3</td> <td>47.5</td> <td>48.3</td> <td>54.1</td> <td>55.3</td> </tr> <tr> <td>R50x3</td> <td>75.1</td> <td>93.7</td> <td>79.0</td> <td>61.1</td> <td>63.7</td> <td>55.2</td> <td>84.8</td> <td>74.6</td> <td>92.5</td> <td>91.6</td> <td>92.8</td> <td>98.8</td> <td>58.7</td> <td>97.0</td> <td>97.8</td> <td>86.4</td> <td>73.1</td> <td>73.8</td> <td>14.0</td> <td>84.2</td> <td>76.4</td> <td>50.0</td> <td>49.2</td> <td>54.7</td> <td>54.2</td> </tr> <tr> <td rowspan="4" colspan="2">BiT-M</td> <td>R101x1</td> <td>73.5</td> <td>92.8</td> <td>77.4</td> <td>58.4</td> <td>61.3</td> <td>54.0</td> <td>84.4</td> <td>73.5</td> <td>92.5</td> <td>91.8</td> <td>90.6</td> <td>98.3</td> <td>56.5</td> <td>96.8</td> <td>97.3</td> <td>84.6</td> <td>69.4</td> <td>68.9</td> <td>12.9</td> <td>82.0</td> <td>73.5</td> <td>73.5</td> <td>48.6</td> <td>45.4</td> <td>52.6</td> </tr> <tr> <td>R101x3</td> <td>74.7</td> <td>93.9</td> <td>79.8</td> <td>57.8</td> <td>62.9</td> <td>54.7</td> <td>84.7</td> <td>75.5</td> <td>92.3</td> <td>91.2</td> <td>92.6</td> <td>98.8</td> <td>59.7</td> <td>97.3</td> <td>98.0</td> <td>85.5</td> <td>71.8</td> <td>60.2</td> <td>14.1</td> <td>83.1</td> <td>75.9</td> <td>75.9</td> <td>50.4</td> <td>49.7</td> <td>54.1</td> </tr> <tr> <td>R152x2</td> <td>74.9</td> <td>94.3</td> <td>79.7</td> <td>58.7</td> <td>62.7</td> <td>55.9</td> <td>85.3</td> <td>74.9</td> <td>93.0</td> <td>92.0</td> <td>91.7</td> <td>98.6</td> <td>58.3</td> <td>97.3</td> <td>97.8</td> <td>86.2</td> <td>71.8</td> <td>71.6</td> <td>13.9</td> <td>84.1</td> <td>76.2</td> <td>76.2</td> <td>49.9</td> <td>48.2</td> <td>53.8</td> </tr> <tr> <td>R152x4</td> <td>74.7</td> <td>94.2</td> <td>79.2</td> <td>57.8</td> <td>62.9</td> <td>51.2</td> <td>85.4</td> <td>75.4</td> <td>93.1</td> <td>91.2</td> <td>91.4</td> <td>98.9</td> <td>61.4</td> <td>97.2</td> <td>98.0</td> <td>85.5</td> <td>72.8</td> <td>67.9</td> <td>14.9</td> <td>83.1</td> <td>76.0</td> <td>50.3</td> <td>42.9</td> <td>53.6</td> <td>56.0</td> </tr> <tr> <td rowspan="7" colspan="2">ViT (ImageNet-21k)</td> <td>B/32</td> <td>86.7</td> <td>96.9</td> <td>86.4</td> <td>74.0</td> <td>74.2</td> <td>54.7</td> <td>86.7</td> <td>86.3</td> <td>73.1</td> <td>90.4</td> <td>94.5</td> <td>97.8</td> <td>59.0</td> <td>99.0</td> <td>96.3</td> <td>83.0</td> <td>68.1</td> <td>65.1</td> <td>15.7</td> <td>82.6</td> <td>79.1</td> <td>51.7</td> <td>38.9</td> <td>57.1</td> <td>54.6</td> </tr> <tr> <td>B/16</td> <td>89.2</td> <td>97.4</td> <td>87.4</td> <td>76.5</td> <td>74.9</td> <td>62.5</td> <td>86.1</td> <td>75.4</td> <td>91.9</td> <td>94.7</td> <td>98.9</td> <td>62.0</td> <td>99.3</td> <td>97.6</td> <td>85.7</td> <td>70.4</td> <td>58.8</td> <td>17.7</td> <td>85.7</td> <td>84.1</td> <td>58.0</td> <td>38.4</td> <td>58.4</td> <td>52.8</td> </tr> <tr> <td>L/14</td> <td>92.9</td> <td>96.2</td> <td>77.9</td> <td>48.3</td> <td>67.7</td> <td>77.3</td> <td>36.1</td> <td>84.1</td> <td>55.3</td> <td>93.5</td> <td>92.6</td> <td>78.7</td> <td>87.2</td> <td>57.5</td> <td>99.3</td> <td>59.9</td> <td>71.6</td> <td>50.3</td> <td>23.1</td> <td>32.7</td> <td>58.8</td> <td>76.2</td> <td>60.3</td> <td>24.3</td> <td>63.3</td> <td>64.0</td> </tr> <tr> <td>H/14</td> <td>93.8</td> <td>95.7</td> <td>77.5</td> <td>49.5</td> <td>68.4</td> <td>78.8</td> <td>37.2</td> <td>84.3</td> <td>55.7</td> <td>93.5</td> <td>92.8</td> <td>78.3</td> <td>88.3</td> <td>57.7</td> <td>99.4</td> <td>59.6</td> <td>71.7</td> <td>52.3</td> <td>21.9</td> <td>34.9</td> <td>63.0</td> <td>76.9</td> <td>61.3</td> <td>24.8</td> <td>63.3</td> <td>67.9</td> </tr> <tr> <td rowspan="2" colspan="2">SimCLRv2</td> <td>R50x1</td> <td>76.4</td> <td>93.2</td> <td>77.9</td> <td>48.6</td> <td>64.1</td> <td>56.3</td> <td>84.4</td> <td>77.0</td> <td>88.3</td> <td>91.8</td> <td>92.9</td> <td>97.6</td> <td>59.5</td> <td>96.7</td> <td>97.9</td> <td>85.8</td> <td>71.1</td> <td>69.1</td> <td>15.8</td> <td>84.8</td> <td>78.4</td> <td>51.0</td> <td>56.2</td> <td>53.9</td> <td>53.8</td> </tr> <tr> <td>R50x3</td> <td>82.2</td> <td>96.4</td> <td>83.4</td> <td>57.5</td> <td>68.2</td> <td>82.2</td> <td>86.9</td> <td>74.6</td> <td>60.6</td> <td>87.7</td> <td>78.5</td> <td>93.2</td> <td>95.3</td> <td>99.4</td> <td>98.6</td> <td>64.1</td> <td>99.3</td> <td>98.0</td> <td>88.1</td> <td>69.9</td> <td>59.6</td> <td>19.6</td> <td>83.4</td> <td>83.0</td> <td>57.8</td> <td>51.3</td> </tr> <tr> <td rowspan="2" colspan="2">BYOL</td> <td>R50x1</td> <td>77.0</td> <td>88.3</td> <td>93.7</td> <td>94.3</td> <td>98.2</td> <td>58.8</td> <td>96.1</td> <td>96.4</td> <td>97.6</td> <td>88.4</td> <td>71.1</td> <td>71.4</td> <td>14.1</td> <td>84.8</td> <td>8.3</td> <td>45.3</td> <td>56.1</td> <td>53.8</td> <td>52.7</td> <td>73.2</td> <td>77.4</td> <td>91.9</td> <td>95.5</td> <td>93.9</td> <td>98.6</td> </tr> <tr> <td>R200x2</td> <td>77.4</td> <td>91.9</td> <td>95.5</td> <td>93.9</td> <td>98.6</td> <td>59.5</td> <td>96.5</td> <td>96.8</td> <td>97.9</td> <td>88.8</td> <td>74.4</td> <td>70.3</td> <td>16.4</td> <td>84.0</td> <td>16.4</td> <td>84.0</td> <td>77.7</td> <td>47.7</td> <td>56.9</td> <td>53.9</td> <td>53.8</td> <td>69.1</td> <td>75.4</td> <td>13.2</td> <td>85.6</td> <td>72.7</td> </tr> <tr> <td rowspan="2" colspan="2">MoCo</td> <td>v1</td> <td>77.2</td> <td>93.4</td> <td>76.3</td> <td>39.6</td> <td>60.2</td> <td>48.3</td> <td>82.6</td> <td>75.1</td> <td>84.4</td> <td>89.9</td> <td>90.7</td> <td>98.4</td> <td>58.3</td> <td>95.7</td> <td>97.2</td> <td>85.4</td> <td>75.7</td> <td>71.1</td> <td>12.6</td> <td>85.7</td> <td>75.4</td> <td>13.2</td> <td>85.6</td> <td>72.7</td> <td>47.8</td> </tr> <tr> <td>v2</td> <td>82.9</td> <td>96.6</td> <td>82.9</td> <td>60.2</td> <td>66.0</td> <td>54.3</td> <td>85.6</td> <td>76.6</td> <td>91.8</td> <td>94.6</td> <td>97.4</td> <td>99.2</td> <td>62.6</td> <td>97.1</td> <td>98.0</td> <td>86.5</td> <td>76.2</td> <td>72.2</td> <td>14.2</td> <td>86.0</td> <td>81.2</td> <td>53.4</td> <td>53.8</td> <td>56.9</td> <td>53.9</td> </tr> <tr> <td colspan="2">VirTex</td> <td>57.9</td> <td>83.9</td> <td>57.5</td> <td>17.0</td> <td>49.8</td> <td>22.4</td> <td>34.5</td> <td>83.8</td> <td>58.2</td> <td>53.6</td> <td>70.6</td> <td>74.7</td> <td>60.6</td> <td>57.9</td> <td>83.9</td> <td>57.5</td> <td>17.0</td> <td>49.8</td> <td>22.4</td> <td>34.5</td> <td>83.8</td> <td>58.2</td> <td>53.6</td> <td>70.6</td> <td>74.7</td> </tr> <tr> <td rowspan="3" colspan="2">ResNet</td> <td>50</td> <td>72.2</td> <td>91.8</td> <td>74.6</td> <td>58.2</td> <td>60.9</td> <td>53.3</td> <td>83.5</td> <td>71.9</td> <td>92.1</td> <td>91.1</td> <td>98.3</td> <td>56.3</td> <td>96.3</td> <td>97.3</td> <td>84.7</td> <td>69.8</td> <td>65.8</td> <td>12.3</td> <td>82.8</td> <td>72.1</td> <td>47.2</td> <td>48.0</td> <td>53.9</td> <td>55.1</td> </tr> <tr> <td>101</td> <td>74.0</td> <td>92.8</td> <td>77.2</td> <td>59.8</td> <td>62.8</td> <td>54.5</td> <td>84.4</td> <td>73.5</td> <td>92.5</td> <td>91.6</td> <td>90.4</td> <td>98.6</td> <td>57.8</td> <td>96.8</td> <td>97.8</td> <td>85.0</td> <td>70.7</td> <td>68.5</td> <td>13.0</td> <td>82.0</td> <td>73.0</td> <td>48.5</td> <td>48.9</td> <td>53.9</td> <td>55.1</td> </tr> <tr> <td>152</td> <td>74.6</td> <td>93.5</td> <td>79.0</td> <td>60.8</td> <td>63.6</td> <td>55.4</td> <td>84.9</td> <td>74.4</td> <td>92.9</td> <td>91.7</td> <td>90.8</td> <td>98.7</td> <td>58.2</td> <td>97.0</td> <td>97.9</td> <td>85.4</td> <td>71.3</td> <td>69.6</td> <td>13.6</td> <td>82.5</td> <td>73.9</td> <td>49.3</td> <td>49.7</td> <td>54.0</td> <td>55.1</td> </tr> </tbody> </table></div> The following figure (Figure 20 from the original paper) plots the linear probe performance for each of the 27 datasets, using the data from Table 10. ![Figure 20. Linear probe performance plotted for each of the 27 datasets, using the data from Table 10.](/files/papers/6956a9a55411c3e2652eae93/images/19.jpg) *该图像是图表,展示了27个数据集的线性探针性能。各条曲线代表不同的模型,横轴为GFLOPs/image,纵轴为准确率。图中的星标和颜色区分了不同的方法,如CLIP和ResNet等。* Figure 20. Linear probe performance plotted for each of the 27 datasets, using the data from Table 10. The following are the results from Table 11 of the original paper: Zero-shot performance of CLIP models over 27 datasets. <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2" colspan="2"></td> <th colspan="25">Zero-Shot Performance</th> </tr> <tr> <td colspan="2"></td> <th>Food-101</th> <th>CIFAR-10</th> <th>CIFAR-100</th> <th>Birdsnap</th> <th>SUN397</th> <th>Stanford Cars</th> <th>FGVC Aircraft</th> <th>Describable Textures</th> <th>Oxford-IIIT Pets</th> <th>Caltech-101</th> <th>Oxford Flowers 102</th> <th>MNIST</th> <th>FER2013</th> <th>STL-10</th> <th>EuroSAT</th> <th>RESISC45</th> <th>GTSRB</th> <th>KITTI</th> <th>Country211</th> <th>PatchCamelyon</th> <th>UCF101</th> <th>Kinetics700</th> <th>CLEVR Counts</th> <th>Hateful Memes</th> <th>Rendered SST2</th> <th>ImageNet</th> </tr> </thead> <tbody> <tr> <td rowspan="5" colspan="2">CLIP-ResNet</td> <td>RN50</td> <td>75.6</td> <td>81.1</td> <td>41.6</td> <td>32.6</td> <td>59.6</td> <td>55.8</td> <td>19.3</td> <td>82.1</td> <td>41.7</td> <td>85.4</td> <td>82.1</td> <td>65.9</td> <td>66.6</td> <td>42.2</td> <td>94.3</td> <td>41.1</td> <td>54.2</td> <td>35.2</td> <td>42.2</td> <td>16.1</td> <td>57.6</td> <td>63.6</td> <td>43.5</td> <td>20.3</td> <td>59.7</td> <td>56.9</td> </tr> <tr> <td>RN101</td> <td>81.0</td> <td>83.9</td> <td>49.0</td> <td>37.2</td> <td>59.9</td> <td>62.3</td> <td>19.5</td> <td>82.4</td> <td>43.9</td> <td>86.2</td> <td>85.1</td> <td>65.7</td> <td>59.3</td> <td>45.6</td> <td>96.7</td> <td>33.1</td> <td>58.5</td> <td>38.3</td> <td>33.3</td> <td>16.9</td> <td>55.2</td> <td>62.2</td> <td>46.7</td> <td>28.1</td> <td>61.1</td> <td>64.2</td> </tr> <tr> <td>RN50x4</td> <td>86.8</td> <td>79.2</td> <td>48.9</td> <td>41.6</td> <td>62.7</td> <td>67.9</td> <td>24.6</td> <td>83.0</td> <td>49.3</td> <td>88.1</td> <td>86.0</td> <td>68.0</td> <td>75.2</td> <td>51.1</td> <td>96.4</td> <td>35.0</td> <td>59.2</td> <td>35.7</td> <td>26.0</td> <td>20.2</td> <td>57.5</td> <td>65.5</td> <td>49.0</td> <td>17.0</td> <td>58.3</td> <td>66.6</td> </tr> <tr> <td>RN50x16</td> <td>90.5</td> <td>82.2</td> <td>54.2</td> <td>45.9</td> <td>65.0</td> <td>72.3</td> <td>30.3</td> <td>82.9</td> <td>52.8</td> <td>89.7</td> <td>87.6</td> <td>71.9</td> <td>80.0</td> <td>56.0</td> <td>97.8</td> <td>40.3</td> <td>64.4</td> <td>39.6</td> <td>33.9</td> <td>24.0</td> <td>62.5</td> <td>68.7</td> <td>53.4</td> <td>17.6</td> <td>91.8</td> <td>86.8</td> </tr> <tr> <td>RN50x64</td> <td>91.8</td> <td>86.8</td> <td>61.3</td> <td>48.9</td> <td>66.9</td> <td>76.0</td> <td>35.6</td> <td>83.8</td> <td>53.4</td> <td>93.4</td> <td>90.6</td> <td>77.3</td> <td>90.8</td> <td>61.0</td> <td>98.3</td> <td>59.4</td> <td>69.7</td> <td>47.9</td> <td>33.2</td> <td>29.6</td> <td>65.0</td> <td>74.1</td> <td>56.8</td> <td>27.5</td> <td>62.1</td> <td>70.7</td> </tr> <tr> <td rowspan="4" colspan="2">CLIP-ViT</td> <td>ViT-B/32</td> <td>83.1</td> <td>44.5</td> <td>87.0</td> <td>87.9</td> <td>66.7</td> <td>51.9</td> <td>47.3</td> <td>97.2</td> <td>49.4</td> <td>60.3</td> <td>32.2</td> <td>39.4</td> <td>17.8</td> <td>58.4</td> <td>64.5</td> <td>47.8</td> <td>24.8</td> <td>57.6</td> <td>59.6</td> <td>63.2</td> <td>84.4</td> <td>91.3</td> <td>65.1</td> <td>37.8</td> <td>63.2</td> <td>59.4</td> </tr> <tr> <td>ViT-B/16</td> <td>89.2</td> <td>91.6</td> <td>68.7</td> <td>39.1</td> <td>65.2</td> <td>65.6</td> <td>27.1</td> <td>83.9</td> <td>46.0</td> <td>88.9</td> <td>89.3</td> <td>70.4</td> <td>56.0</td> <td>52.7</td> <td>98.2</td> <td>54.1</td> <td>65.5</td> <td>43.3</td> <td>44.0</td> <td>23.3</td> <td>48.1</td> <td>69.8</td> <td>52.4</td> <td>23.4</td> <td>61.7</td> <td>59.8</td> </tr> <tr> <td>ViT-L/14</td> <td>92.9</td> <td>96.2</td> <td>77.9</td> <td>48.3</td> <td>67.7</td> <td>77.3</td> <td>36.1</td> <td>84.1</td> <td>55.3</td> <td>93.5</td> <td>92.6</td> <td>78.7</td> <td>87.2</td> <td>57.5</td> <td>99.3</td> <td>59.9</td> <td>71.6</td> <td>50.3</td> <td>23.1</td> <td>32.7</td> <td>58.8</td> <td>76.2</td> <td>60.3</td> <td>24.3</td> <td>63.3</td> <td>64.0</td> </tr> <tr> <td>ViT-L/14-336px</td> <td>93.8</td> <td>95.7</td> <td>77.5</td> <td>49.5</td> <td>68.4</td> <td>78.8</td> <td>37.2</td> <td>84.3</td> <td>55.7</td> <td>93.5</td> <td>92.8</td> <td>78.3</td> <td>88.3</td> <td>57.7</td> <td>99.4</td> <td>59.6</td> <td>71.7</td> <td>52.3</td> <td>21.9</td> <td>34.9</td> <td>63.0</td> <td>76.9</td> <td>61.3</td> <td>24.8</td> <td>63.3</td> <td>67.9</td> </tr> </tbody> </table></div> The following figure (Figure 22 from the original paper) shows CLIP's zero-shot performance compared to linear-probe ResNet performance. ![Figure 22. CLIP's zero-shot performance compared to linear-probe ResNet performance](/files/papers/6956a9a55411c3e2652eae93/images/21.jpg) *该图像是图表,展示了CLIP模型在27个数据集上的零-shot性能与线性探测ResNet的性能对比。图中可见不同数据集的准确率变化情况,CLIP模型在多个任务中显示出了优越的性能。* Figure 22. CLIP's zero-shot performance compared to linear-probe ResNet performance The following are the results from Table 16 of the original paper: Robustness performance on natural distribution shift datasets. <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2"></th> <th rowspan="2">IN Top-1</th> <th rowspan="2">IN-V2 Top-1</th> <th rowspan="2">IN-A Top-1</th> <th rowspan="2">IN-R Top-1</th> <th rowspan="2">ObjectNet Top-1</th> <th rowspan="2">IN-Sketch Top-1</th> <th colspan="2">IN-Vid</th> <th colspan="2">YTBB</th> </tr> <tr> <th>PM0</th> <th>PM10</th> <th>PM0</th> <th>PM10</th> </tr> </thead> <tbody> <tr> <td>NS EfficientNet-L2a</td> <td>88.3</td> <td>80.2</td> <td>84.9</td> <td>74.7</td> <td>68.5</td> <td>47.6</td> <td>88.0</td> <td>82.1</td> <td>67.7</td> <td>63.5</td> </tr> <tr> <td>FixResNeXt101-32x48d V2b</td> <td>86.4</td> <td>78.0</td> <td>68.4</td> <td>80.0</td> <td>57.8</td> <td>59.1</td> <td>85.8</td> <td>72.2</td> <td>68.9</td> <td>57.7</td> </tr> <tr> <td>Linear Probe CLIP</td> <td>85.4</td> <td>75.9</td> <td>75.3</td> <td>84.2</td> <td>66.2</td> <td>57.4</td> <td>89.1</td> <td>77.2</td> <td>68.7</td> <td>63.1</td> </tr> <tr> <td>Zero-Shot CLIP</td> <td>76.2</td> <td>70.1</td> <td>77.2</td> <td>88.9</td> <td>72.3</td> <td>60.2</td> <td>95.3</td> <td>89.2</td> <td>95.2</td> <td>88.5</td> </tr> </tbody> </table></div> ### 6.3. Ablation Studies / Parameter Analysis **Dataset Ablation on YFCC100M:** (Already covered in Section 6.1.7). This study confirms that while the specific data blend (YFCC vs. WIT) can cause performance variations on fine-grained tasks (due to different distributions of concepts), the overall approach is robust across different reasonably filtered large image-text collections. The key advantage of `WIT` is its sheer scale. **Prompt Engineering and Ensembling:** (Already covered in Section 4.2.7 and Figure 4). This is a crucial "parameter analysis" showing that careful crafting of `natural language prompts` and combining multiple prompts (`ensembling`) significantly improves zero-shot performance (almost 5 points on ImageNet). This effectively leverages the `expressiveness of language` to guide the model. **Model Scaling:** (Already covered in Section 6.1.2 and Figure 9). The observation of `log-log linear scaling` of `zero-shot error` with `model compute` across `ResNet` and `Vision Transformer` models serves as a powerful `parameter analysis`. It indicates that increasing model size and computational resources leads to predictable improvements in performance, highlighting the scalability of the CLIP approach. The specific hyperparameters for model architecture and size (Tables 19 and 20 in Methodology) are systematically varied to explore this scaling. **Temperature Parameter $(\tau)$:** The paper states that the `temperature parameter`\tau is directly optimized during training as a log-parameterized multiplicative scalar to avoid tuning it as a hyperparameter. This shows an intelligent design choice to automate a critical parameter that affects the logit scaling in the contrastive loss, which can significantly impact training stability and performance.

Resolution Scaling: The ViT-L/14 model is pre-trained at a higher 336 pixel resolution for one additional epoch to boost performance (ViT-L/14@336px). This FixRes-style approach demonstrates that increasing input resolution can further enhance performance for large Vision Transformers, even with minimal additional training.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully investigates the transferability of task-agnostic web-scale pre-training from Natural Language Processing (NLP) to computer vision. The authors demonstrate that Contrastive Language-Image Pre-training (CLIP)—a simple yet scalable approach that predicts which caption goes with which image on a massive 400 million (image, text) pair dataset (WIT)—enables visual models to learn a wide variety of tasks during pre-training. This task learning can then be leveraged through natural language prompting to achieve zero-shot transfer to numerous existing computer vision datasets. At sufficient scale, CLIP's performance is competitive with task-specific supervised models, and it exhibits significantly higher robustness to natural distribution shifts. The work highlights the emergence of predictable scaling laws in multimodal learning and underscores the significant social implications of such flexible and powerful models.

7.2. Limitations & Future Work

The authors are transparent about several limitations of CLIP and suggest future research directions:

  • Performance Gap to SOTA: While zero-shot CLIP is competitive with a simple ResNet-50 linear classifier baseline, it still underperforms the overall state-of-the-art on many datasets, particularly fully supervised models. The authors estimate that a 1000x increase in compute would be needed for zero-shot CLIP to reach overall state-of-the-art, which is currently infeasible. Future work needs to focus on improving computational and data efficiency.
  • Weak Performance on Specific Tasks: CLIP's zero-shot performance remains weak on several kinds of tasks:
    • Fine-grained classification: Differentiating car models, flower species, aircraft variants.
    • Abstract/systematic tasks: Counting objects in an image.
    • Novel tasks: Classifying distance to the nearest car, which are unlikely to be in the pre-training dataset.
  • Brittle Generalization to Truly Out-of-Distribution (OOD) Data: Despite strong robustness to natural distribution shifts, CLIP can still fail catastrophically on truly out-of-distribution data. The example of MNIST (handwritten digits), where CLIP's OCR performance is poor, illustrates that training on a large, varied dataset does not guarantee robustness to all unseen domains. This suggests CLIP circumvents the brittle generalization problem rather than addressing its underlying cause.
  • Limited Output Flexibility: CLIP is currently limited to choosing from predefined concepts in a given zero-shot classifier. It cannot generate novel outputs like an image captioning model. Future work could explore joint training of contrastive and generative objectives or search over natural language explanations at inference time to combine efficiency with flexibility.
  • Data Inefficiency: CLIP does not address the poor data efficiency of deep learning but compensates by using web-scale supervision. Training on 400 million images over 32 epochs means seeing 12.8 billion images, which would take 405 years to review manually. Combining CLIP with self-supervision (Henaff, 2020; Chen et al., 2020c) and self-training (Lee; Xie et al., 2020) methods is a promising direction for data efficiency.
  • Methodological Limitations:
    • Validation Set Usage: Repeated querying of full validation sets during development, while standard, is unrealistic for true zero-shot scenarios.
    • Evaluation Dataset Selection: The main 27-dataset evaluation suite, while diverse, was haphazardly assembled and co-adapted with CLIP's development. A new benchmark explicitly designed for broad zero-shot transfer is needed.
  • Social Biases: CLIP, trained on unfiltered internet text-image pairs, learns many social biases, similar to image caption models (Bhargava & Forsyth, 2019). The Broader Impacts section details denigration harms and gendered associations in classification. Future work is needed for broader, more contextual, and robust bias testing and mitigation strategies.
  • Zero-shot vs. Few-shot Discrepancy: CLIP's current few-shot learning (linear classifiers on top of features) results in a counter-intuitive drop in performance compared to zero-shot, unlike human performance which shows large gains from zero-shot to one-shot. Developing methods to combine strong zero-shot performance with efficient few-shot learning is an important direction.

7.3. Personal Insights & Critique

CLIP represents a significant paradigm shift in computer vision, mirroring the pre-training revolution seen in NLP. The core insight that natural language can serve as a vast, weakly supervised signal for learning transferable visual representations is incredibly powerful. The most inspiring aspect is the demonstration of zero-shot transfer at a competitive level, effectively allowing users to "program" a visual classifier with natural language. This democratizes access to powerful computer vision capabilities, as it removes the immense barrier of collecting and labeling task-specific datasets.

The methods and conclusions of CLIP can definitely be transferred to other domains. The idea of multimodal contrastive learning from web-scale paired data is applicable wherever such data exists (e.g., video-text pairs, audio-text pairs, 3D model-text pairs). This opens avenues for more general perceptual AI systems that understand the world not just visually, but also through sound, motion, and other sensory inputs, all grounded in language.

However, the paper also raises crucial ethical considerations, particularly regarding social biases and surveillance. The bias probes revealed that CLIP learns and can perpetuate harmful stereotypes (e.g., disproportionately classifying "Black" images into "non-human" categories, or "male" images into "crime-related" categories). The finding that class design (e.g., adding a "child" category) can drastically alter bias manifestation is a critical insight, highlighting that AI developers wield immense power in shaping model behavior and impact. This calls for greater awareness and responsibility in defining classes and setting decision thresholds.

A potential issue or unverified assumption is that simply increasing compute and data scale will continue to solve all generality and robustness problems. While scaling laws show predictable improvements, the MNIST example and performance on abstract tasks (like counting) suggest there are fundamental limitations to this brute-force approach. Truly out-of-distribution generalization and higher-level systematic reasoning might require architectural innovations beyond just scaling Transformers and ResNets. The observation that supervised fine-tuning reduces effective robustness is also a critical point for future research: does task-specific fine-tuning inherently lead models to rely on spurious correlations regardless of pre-training, or can more robust fine-tuning strategies be developed?

My personal critique is that while the paper acknowledges limitations and societal impacts, the emphasis is still heavily on performance metrics. The community needs to evolve evaluation metrics to explicitly include measures of fairness, robustness to unseen biases, and interpretability during development, not just as an afterthought. The "omni-use" nature of CLIP, making it easy to create "bespoke, niche surveillance use cases," demands a more proactive regulatory and ethical framework from researchers and policymakers. The paper's call for community exploration to characterize capabilities and biases is vital, but the responsibility to lead on ethical development often falls on the creators of such powerful foundational models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.