Learning Transferable Visual Models From Natural Language Supervision
TL;DR Summary
This study presents a method for learning transferable visual models from natural language supervision, showing that pretraining on 400 million image-text pairs enables zero-shot transfer across various tasks, rivaling fully supervised models in performance.
Abstract
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Learning Transferable Visual Models From Natural Language Supervision."
1.2. Authors
The authors are Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. All authors are affiliated with OpenAI.
OpenAI is a prominent artificial intelligence research laboratory based in San Francisco, California. It is widely recognized for its contributions to deep learning, particularly in the fields of natural language processing (NLP) with models like GPT (Generative Pre-trained Transformer) and computer vision. OpenAI has a strong reputation for pushing the boundaries of AI research, often publishing influential papers and releasing open-source models that significantly impact the broader AI community.
1.3. Journal/Conference
This paper was published at (UTC): 2021-02-26T19:04:58.000Z. While the provided information doesn't specify a formal conference or journal name, the date and "arXiv preprint" status (as indicated by the original source link) suggest it was released as a preprint on arXiv, a popular open-access archive for scientific papers. Papers on arXiv are not peer-reviewed before publication but are widely read and cited within the research community, often preceding formal publication in top-tier conferences (e.g., NeurIPS, ICML, CVPR, ICLR) or journals. Given OpenAI's track record, this paper likely received significant attention upon its preprint release.
1.4. Publication Year
1.5. Abstract
State-of-the-art computer vision systems traditionally rely on training with a fixed set of predefined object categories, which limits their generality and requires extensive re-labeling for new visual concepts. This paper proposes a promising alternative: learning directly from raw text about images, a much broader source of supervision. The authors demonstrate that a simple pre-training task—predicting which caption corresponds to which image—is an efficient and scalable method to learn state-of-the-art image representations. They achieve this by training a model, called CLIP (Contrastive Language-Image Pre-training), from scratch on a massive dataset of 400 million (image, text) pairs collected from the internet.
After pre-training, CLIP uses natural language to reference learned visual concepts or describe new ones, enabling zero-shot transfer to downstream tasks without additional labeled data. The authors benchmark CLIP's performance across over 30 diverse computer vision datasets, covering tasks like optical character recognition (OCR), action recognition in videos, geo-localization, and fine-grained object classification. The model demonstrates non-trivial transfer performance on most tasks, often rivaling fully supervised baselines. For instance, CLIP matches the accuracy of the original ResNet-50 on ImageNet in a zero-shot setting, without utilizing its 1.28 million training examples. The authors also release their code and pre-trained model weights.
1.6. Original Source Link
https://arxiv.org/abs/2103.00020 This paper is an arXiv preprint, indicating it is publicly accessible and widely shared within the scientific community.
1.7. PDF Link
https://arxiv.org/pdf/2103.00020v1.pdf
2. Executive Summary
2.1. Background & Motivation
The paper addresses a fundamental limitation in conventional computer vision (CV) systems: their reliance on fixed sets of predetermined object categories. Traditionally, these systems are trained on meticulously hand-labeled datasets (like ImageNet, which contains 1000 object classes), requiring significant human effort to define and annotate classes. This approach creates several critical problems:
-
Limited Generality and Usability: If a user wants to detect a visual concept not included in the original 1000 categories, the model typically needs to be
fine-tuned(further trained) on a new, labeled dataset specific to that concept. This process is costly, time-consuming, and limits the system's flexibility in real-world applications where new concepts constantly emerge. -
Scalability Bottleneck: Creating large-scale, high-quality, crowd-labeled datasets for every conceivable visual concept is practically impossible. The traditional paradigm struggles to scale to the vast and ever-growing diversity of visual information.
-
Brittleness to Distribution Shift: Models trained on specific distributions (e.g., ImageNet photos) often perform poorly when encountering natural variations in data (e.g., different styles of images, backgrounds, or contexts), a phenomenon known as
distribution shift.The paper draws inspiration from the revolution in Natural Language Processing (NLP), where
pre-training methodslearning directly from raw, uncurated web text (e.g.,autoregressiveormasked language modelingobjectives) have led to highlytransferableandtask-agnosticmodels like GPT-3. These NLP models can perform a wide array ofdownstream taskswith little to nodataset-specific training(i.e.,zero-shotorfew-shot learning), demonstrating thatweb-scale collections of textcontain a richer source of supervision than traditional crowd-labeled datasets.
The paper's entry point and innovative idea are to apply this successful NLP paradigm to computer vision. Specifically, it asks: "Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision?" The authors propose to leverage the implicit supervision contained in natural language accompanying images on the internet to train visual models that are more general, flexible, and robust, without the need for explicit gold labels for every visual concept.
2.2. Main Contributions / Findings
The paper makes several primary contributions and reports key findings that significantly advance the field of computer vision:
- Introduction of CLIP (Contrastive Language-Image Pre-training): The paper proposes a novel and efficient
pre-training methodthat uses a simplecontrastive learningobjective. Instead of trying to predict exact captions, CLIP learns to predict which text description matches which image within a batch. This approach creates a powerfulmultimodal embedding spacewhere images and their corresponding text descriptions are brought closer together. - Creation of a Large-Scale WebImageText (WIT) Dataset: To enable
web-scale pre-training, the authors constructed a new dataset of 400 million(image, text)pairs by collecting publicly available data from the internet, a scale orders of magnitude larger than previous datasets used for similar tasks. This dataset is crucial for the success of CLIP, providing a broad source ofnatural language supervision. - Demonstration of Strong Zero-Shot Transfer Capabilities: After pre-training, CLIP can perform
zero-shot classificationon a wide variety ofdownstream computer vision taskswithout any additional training or fine-tuning ontask-specific labeled data.- SOTA-Matching Performance: CLIP demonstrates remarkable performance, often being
competitive with fully supervised baselines. Notably, it matches the accuracy of the original ResNet-50 on ImageNet in a zero-shot setting, without using any of ImageNet's 1.28 million training examples. - Broad Task Coverage: CLIP's
task-learning capabilitiesare extensively benchmarked across over 30 diverse datasets, spanning tasks such asOCR(optical character recognition),action recognitionin videos,geo-localization, and various types offine-grained object classification. This broad evaluation highlights its generality.
- SOTA-Matching Performance: CLIP demonstrates remarkable performance, often being
- Improved Robustness to Natural Distribution Shift: The paper finds that
zero-shot CLIP modelsare significantly morerobusttonatural distribution shifts(i.e., performance on new test sets with differing image characteristics) compared tosupervised ImageNet modelsof equivalent accuracy. This suggests thatzero-shot evaluationbetter reflects a model's true capabilities and less susceptible tospurious correlationslearned from specific training distributions. - Discovery of Scaling Laws: The authors observe that CLIP's
transfer performanceis asmoothly predictable function of compute, similar to observations in large language models like GPT. Thislog-log linear scaling trendsuggests a clear path for future performance improvements by simply increasingcomputeandmodel capacity. - Analysis of Societal Impacts: The paper includes a dedicated section discussing the
broader impactsof such a powerful and flexible model, includingsocial biases(e.g., inrace,gender,ageclassification, andcrime-related associations) and its potential use insurveillance. It emphasizes the importance ofclass designandthresholdingin mitigating biases.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the CLIP paper, a beginner should be familiar with several core concepts from machine learning, particularly deep learning, computer vision, and natural language processing.
-
Deep Learning: A subfield of machine learning that uses
artificial neural networkswith multiple layers (hence "deep") to learn representations of data with multiple levels of abstraction.- Neural Networks: Computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers, processing information through
weightsandbiases. - Weights and Biases: Parameters within a neural network that are adjusted during training to learn patterns in the data.
Weightsdetermine the strength of connections between neurons, andbiasesprovide an offset to the activation of a neuron. - Activation Functions: Mathematical functions that introduce non-linearity into the network, allowing it to learn complex patterns. Examples include
ReLU(Rectified Linear Unit),sigmoid, andsoftmax. - Loss Function (Objective Function): A function that quantifies the difference between the predicted output of a model and the true target values. The goal of training is to minimize this loss.
- Optimizer: An algorithm (e.g.,
Stochastic Gradient Descent (SGD),Adam) used to adjust the model's weights and biases to minimize the loss function. - Backpropagation: An algorithm used to efficiently compute the gradients of the loss function with respect to the network's weights, enabling their iterative adjustment.
- Pre-training: An initial training phase where a model is trained on a large, general dataset with a broad objective (e.g., predicting the next word, distinguishing image-text pairs). The learned
representations(features) are then transferred todownstream tasks. - Fine-tuning: A subsequent training phase where a pre-trained model's weights are slightly adjusted using a smaller, task-specific dataset and objective to adapt it to a particular
downstream task. - Representation Learning: The process of automatically discovering good representations of raw data for specific tasks. A good representation captures the essential information from the data in a useful format, often making downstream tasks easier.
- Neural Networks: Computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers, processing information through
-
Computer Vision (CV): A field of artificial intelligence that enables computers to "see" and interpret visual data from the world.
- Image Classification: The task of assigning a label or category to an entire input image (e.g., "cat," "dog," "car").
- Object Detection: Identifying and localizing objects within an image by drawing bounding boxes around them and assigning a class label to each.
- Semantic Segmentation: Assigning a class label to every pixel in an image, effectively partitioning the image into meaningful regions (e.g., "road," "sky," "person").
- Convolutional Neural Networks (CNNs): A class of deep neural networks specifically designed for processing grid-like data such as images. They use
convolutional layersto automatically and adaptively learn spatial hierarchies of features.- Convolutional Layer: A core building block of CNNs that applies a set of learnable filters (kernels) to the input image, producing
feature mapsthat highlight specific patterns (edges, textures, etc.). - Pooling Layer: Reduces the spatial dimensions of the feature maps, helping to make the learned features more robust to minor variations in input and reducing computational cost.
- ResNet (Residual Network): A type of CNN architecture (He et al., 2016a) that addresses the problem of vanishing gradients in very deep networks by introducing
skip connectionsorresidual connections. These connections allow gradients to flow directly through layers, enabling the training of much deeper models.
- Convolutional Layer: A core building block of CNNs that applies a set of learnable filters (kernels) to the input image, producing
- Vision Transformer (ViT): A recent architecture (Dosovitskiy et al., 2020) that applies the
Transformerarchitecture (originally for NLP) directly to images. It works by splitting an image into fixed-size patches, linearly embedding them, addingpositional embeddings, and feeding the resulting sequence of vectors to a standard Transformer encoder.
-
Natural Language Processing (NLP): A field of AI concerned with enabling computers to understand, interpret, and generate human language.
- Language Models: Statistical models that learn the probability distribution of sequences of words in a language. They can predict the next word in a sentence or fill in masked words.
- Embeddings (Word, Sentence, Text): Numerical vector representations of words, sentences, or entire text snippets. Words with similar meanings have similar embeddings.
Byte Pair Encoding (BPE)is a subword tokenization algorithm used to handle rare words and out-of-vocabulary words by breaking them into common subword units. - Transformer: A neural network architecture (Vaswani et al., 2017) that relies heavily on the
self-attention mechanismto process sequences of data. It has revolutionized NLP due to its ability to capture long-range dependencies in text.- Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence when processing each element. For each token in a sequence, it computes
query (Q),key (K), andvalue (V)vectors. Theattention scorefor a token is calculated by taking thedot productof its query with all other keys, scaling, and applying asoftmaxfunction. This score is then multiplied by the values. The formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where is the matrix of queries, is the matrix of keys, is the matrix of values, and is the dimension of the keys, used for scaling. - Multi-Head Attention: Extends
self-attentionby running the attention mechanism multiple times in parallel with different learnedlinear projectionsofQ, K, V. This allows the model to focus on different parts of the input from various perspectives.
- Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence when processing each element. For each token in a sequence, it computes
- GPT (Generative Pre-trained Transformer): A family of
autoregressive language models(Radford et al., 2018, 2019; Brown et al., 2020) developed by OpenAI. They are pre-trained on vast amounts of text data to predict the next token in a sequence and have shown impressivezero-shotandfew-shot learningcapabilities across many NLP tasks. - BERT (Bidirectional Encoder Representations from Transformers): A
masked language model(Devlin et al., 2018) that learnsbidirectional contextual embeddingsby predicting masked words in a sentence and also predicting whether two sentences follow each other.
-
Supervision Paradigms:
- Supervised Learning: Training models on datasets where each input example is explicitly paired with a correct output label (e.g., image and its object category).
- Weakly Supervised Learning: Training with noisy, imprecise, or incomplete labels, often obtained automatically or at a low cost (e.g., image paired with hashtags, rather than precise object bounding boxes).
- Self-Supervised Learning: A form of
unsupervised learningwhere the data itself provides the supervision. Models learn by solving apretext task(e.g., predicting missing parts of an input, rotating an image) that doesn't require human-labeled data, thereby learning useful representations. - Unsupervised Learning: Learning patterns from unlabeled data, often to discover hidden structures or generate new data (e.g., clustering, dimensionality reduction).
-
Multimodal Learning: Combining information from multiple modalities (e.g., vision and language) to achieve a more comprehensive understanding of data.
- Multimodal Embedding Space: A shared vector space where representations from different modalities (e.g., image embeddings and text embeddings) can be compared and related. In such a space, semantically similar items, regardless of modality, are embedded close to each other.
- Cosine Similarity: A measure of similarity between two non-zero vectors in an inner product space. It measures the cosine of the angle between them. A cosine similarity of 1 means identical direction (most similar), 0 means orthogonal (no similarity), and -1 means opposite direction (most dissimilar). The formula for cosine similarity between two vectors and is: $ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $ where and are components of vectors and , and and are their magnitudes.
- Contrastive Learning: A
self-supervised learningapproach where models learn by contrasting positive pairs (similar samples) with negative pairs (dissimilar samples). The goal is to bring representations of positive pairs closer in the embedding space while pushing negative pairs apart.- InfoNCE Loss (Noise-Contrastive Estimation): A commonly used
contrastive loss function(Oord et al., 2018) that aims to maximize the mutual information between different views of the same data point. It typically involves comparing a positive pair (e.g., an image and its caption) against a set of negative samples. Given a query embedding and a set of key embeddings where is the positive key and are negative keys, the InfoNCE loss is: $ L_q = -\log \frac{\exp(\text{sim}(q, k_0) / \tau)}{\sum_{i=0}^{N} \exp(\text{sim}(q, k_i) / \tau)} $ where is a similarity function (e.g.,cosine similarity), and is atemperature parameterthat scales the logits before the softmax. - Multi-class N-pair Loss: An extension of
metric learning(Sohn, 2016) that usesN-1negative samples for each anchor in a batch of samples. This loss is similar in spirit to InfoNCE and is the direct inspiration for CLIP's objective.
- InfoNCE Loss (Noise-Contrastive Estimation): A commonly used
-
Zero-Shot Learning (ZSL): The ability of a model to recognize objects or perform tasks that it has never encountered during training, typically by leveraging
side information(e.g., textual descriptions of classes). -
Few-Shot Learning (FSL): The ability of a model to learn new tasks or recognize new classes from a very small number of examples (e.g., 1-shot, 5-shot).
3.2. Previous Works
The paper contextualizes its work by tracing relevant prior research across NLP, computer vision, and multimodal learning.
-
NLP Pre-training Revolution:
- The paper explicitly credits the revolution in NLP, citing foundational work like
Word2vec(Mikolov et al., 2013),GloVe(Pennington et al., 2014), and more recentcontextualized word embeddingsandlanguage models. ELMo(Peters et al., 2018),ULMFiT(Howard & Ruder, 2018),GPT(Radford et al., 2018),BERT(Devlin et al., 2018), andT5(Raffel et al., 2019) are highlighted as systems that scaledtask-agnostic objectives(likeautoregressiveandmasked language modeling) across orders of magnitude incompute,model capacity, anddata.Text-to-textinterface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) enabled these architectures tozero-shot transferwithout specializedoutput heads.GPT-3(Brown et al., 2020) is noted as a flagship system competitive withbespoke modelswith little to nodataset-specific training data. This demonstrated thatweb-scale textsupervision can surpass high-quality crowd-labeled datasets in NLP.
- The paper explicitly credits the revolution in NLP, citing foundational work like
-
Early Multimodal Vision-Language Learning:
- The idea of learning visual models from text is not new, dating back over 20 years.
- Mori et al. (1999) explored
improving content-based image retrievalby predicting nouns and adjectives from paired text. - Quattoni et al. (2007) demonstrated more
data-efficient image representationsby learning classifiers for words in image captions. - Srivastava & Salakhutdinov (2012) used
multimodal Deep Boltzmann Machineson low-level image and text features. - Joulin et al. (2016) modernized this by training
CNNsto predictbag-of-wordsfromYFCC100Mmetadata, showing transfer performance similar to ImageNet pre-training. This work is a direct conceptual precursor to CLIP'sbag-of-words baseline. - Li et al. (2017) extended this to
phrase n-gramsand demonstratedzero-shot transferby scoring target classes based on learned visual n-grams. CLIP directly compares against Li et al. (2017) on ImageNet, highlighting its significant performance leap from to .
-
Recent Text-Supervised Visual Representation Learning:
- More recent works adopting modern architectures and objectives:
VirTex(Desai & Johnson, 2020): Usedtransformer-based language modelingto learn image representations from textual annotations.ICMLM(Bulent Sariyildiz et al., 2020): Employedmasked language modelingfor image representation learning.ConVIRT(Zhang et al., 2020): Adaptedcontrastive objectivesfor(text, image)representation learning, especially in medical imaging. CLIP is described as a "simplified version of ConVIRT trained from scratch."
- More recent works adopting modern architectures and objectives:
-
Weakly Supervised Pre-training for Computer Vision:
- While text-based supervision was rare due to low benchmark performance,
narrowly scoped weak supervisionhas shown gains. - Mahajan et al. (2018) used
ImageNet-related hashtagson 3.5 billion Instagram images, improving ImageNet accuracy by over . - Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) demonstrated large gains by pre-training on
noisily labeled JFT-300Mdataset, predicting classes. These approaches are noted for carefully limiting supervision to 1000 and 18291 classes, respectively, unlike natural language's broader scope.
- While text-based supervision was rare due to low benchmark performance,
-
Contrastive Representation Learning:
- Tian et al. (2019) found
contrastive objectiveslearn better representations than equivalentpredictive objectivesfor images. - Chen et al. (2020a) showed
generative modelsrequire significantly morecomputethancontrastive modelsfor similar performance. These findings directly influenced CLIP's choice of acontrastive objectiveforefficiency. - Sohn (2016) introduced
multi-class N-pair loss, and Oord et al. (2018) popularizedInfoNCE loss, both key inspirations for CLIP'sbatch construction techniqueandobjective.
- Tian et al. (2019) found
3.3. Technological Evolution
The evolution of visual models from natural language supervision can be broadly understood as a shift from limited, task-specific, hand-labeled datasets to vast, general, web-scale, implicitly supervised data.
- Early Image Retrieval (Pre-Deep Learning): Initial efforts (e.g., Mori et al., 1999) focused on content-based image retrieval by linking images to associated text, often using simpler textual features (nouns, adjectives) and statistical models. These were proofs-of-concept for multimodal understanding.
- Bridging Vision and Language with Machine Learning (2000s-early 2010s): Researchers explored learning visual representations by predicting words in captions (Quattoni et al., 2007) or using multimodal graphical models like
Deep Boltzmann Machines(Srivastava & Salakhutdinov, 2012). These methods laid theoretical groundwork but were limited by computational power and data scale. - Deep Learning Era - Initial Multimodal Applications (2010s): With the rise of
deep learningandCNNs, the field saw modernization. Joulin et al. (2016) showed that CNNs trained onbag-of-wordsfrom web metadata could learn useful representations, competing with ImageNet pre-training. Li et al. (2017) further demonstratedzero-shot transferusingn-grams, but performance was still too low for practical application. - Specialized Weak Supervision (Mid-Late 2010s): While general text-to-vision remained challenging, highly-focused
weak supervision(e.g., Instagram hashtags by Mahajan et al., 2018) proved effective for improving ImageNet performance. These methods, however, still relied onpredetermined class taxonomiesandsoftmax classifiers, limiting flexibility. - NLP-Inspired Architectures and Objectives (Late 2010s-Early 2020s): The success of
Transformersandpre-trainingin NLP inspired analogous efforts in vision.VirTex,ICMLM, andConVIRTstarted integratingTransformer-based language modelsandcontrastive objectivesfor learning image representations from text, but often on relatively smaller datasets. - CLIP's Breakthrough (2021): CLIP represents a culmination of these trends. It scales the
contrastive learningobjective andTransformer architecturesto an unprecedentedweb-scale dataset(400M image-text pairs). This massive scale, combined with an efficientcontrastive objective, enables CLIP to achievestate-of-the-art zero-shot transferperformance, bridging the gap between flexiblenatural language supervisionand practicalcomputer vision applications. It replicates the NLP paradigm's success in achievingtask-agnosticandtransferablemodels.
3.4. Differentiation Analysis
Compared to the main methods in related work, CLIP introduces several core differences and innovations:
-
Scale of Natural Language Supervision:
- Prior Work (e.g., VirTex, ICMLM, ConVIRT): While these works explored similar ideas, they typically trained on much smaller datasets (e.g., MS-COCO, Visual Genome, YFCC100M, ranging from hundreds of thousands to tens of millions of images). used YFCC100M but had limited filtering.
- CLIP's Innovation: CLIP is trained on an unprecedented
400 million (image, text) pairsin its customWIT (WebImageText)dataset. This scale, leveraging the vastness of the internet, is a direct differentiator and a key enabler of its performance, similar to how large datasets drove progress in NLP.
-
Efficiency of Pre-training Objective:
- Generative/Predictive Baselines (e.g., VirTex, early CLIP attempts): Many prior approaches, including CLIP's own initial attempts, aimed to predict the exact words of an image caption. The paper demonstrates this is a computationally inefficient task (
transformer-based language modellearns 3x slower thanbag-of-words baseline, Figure 2). - CLIP's Innovation: CLIP adopts a
contrastive learning objective. Instead of predicting specific words, it learns to predict which text snippetas a wholeis paired with which image. Thisproxy taskof identifying correct(image, text)pairs from incorrect ones (e.g.,InfoNCE lossormulti-class N-pair loss) yields a4x efficiency improvementover thebag-of-words prediction baselinefor zero-shot ImageNet classification (Figure 2). This focus on efficiency was crucial for scaling.
- Generative/Predictive Baselines (e.g., VirTex, early CLIP attempts): Many prior approaches, including CLIP's own initial attempts, aimed to predict the exact words of an image caption. The paper demonstrates this is a computationally inefficient task (
-
Zero-Shot Transfer Performance and Generality:
- Prior Zero-Shot Approaches (e.g., Li et al., 2017 - Visual N-Grams): These methods showed the possibility of zero-shot transfer but achieved very low accuracy (e.g., on ImageNet).
- Weakly Supervised (e.g., Instagram-pretrained ResNeXt, JFT-300M models): These models achieved high performance but were
task-specific(e.g., predicting 1000 or 18291 ImageNet-related classes) and lacked a mechanism fordynamic outputsor trueopen-set recognition. They still requiredstatic softmax classifiers. - CLIP's Innovation: CLIP achieves
state-of-the-art zero-shot transferperformance, matching or even exceeding strongfully supervised baselineson many datasets (e.g., on ImageNet zero-shot, matching ResNet-50). Its use ofnatural language promptsallows fordynamic classifier creationfor any described visual concept, offering unprecedented flexibility andgeneralityforopen-set recognition(Figure 1).
-
Robustness to Distribution Shift:
- ImageNet-trained Models: Previous studies (Taori et al., 2020) showed that ImageNet-trained models suffer significant performance drops on
natural distribution shifts, suggesting they exploitspurious correlationswithin the ImageNet distribution. - CLIP's Innovation: CLIP demonstrates significantly higher
effective robustnessonnatural distribution shift datasets. Zero-shot CLIP models reduce the gap between in-distribution and out-of-distribution accuracy by up to (Figure 13). This is attributed to not being trained on specific task distributions and leveraging a very diverse pre-training dataset.
- ImageNet-trained Models: Previous studies (Taori et al., 2020) showed that ImageNet-trained models suffer significant performance drops on
-
Scaling Laws for Transfer Performance:
-
Prior Vision Models: While large-scale pre-training existed, clear
scaling lawsforzero-shot transfer performancein vision, similar to those observed in NLP (Kaplan et al., 2020), were not as extensively documented. -
CLIP's Innovation: CLIP explicitly demonstrates a
smooth log-log linear scaling trendforzero-shot error rateas a function ofmodel computeacross various model sizes (Figure 9). This predictability guides future research towards larger, more capable models.In essence, CLIP differentiates itself by successfully bringing the
web-scale, task-agnostic pre-trainingparadigm from NLP to computer vision, enabled by a massive, custom dataset and an efficientcontrastive learning objective, resulting in highly flexible, performant, and robustzero-shot visual models.
-
4. Methodology
4.1. Principles
The core idea behind CLIP is to learn a highly transferable visual model by leveraging the abundant and diverse supervision contained in natural language paired with images on the internet. Instead of relying on traditional, fixed-category human-labeled datasets, CLIP aims to learn visual representations that are implicitly grounded in human language.
The theoretical basis and intuition are as follows:
- Natural Language as a Rich Supervision Signal: Natural language is incredibly expressive, capable of describing a vast, open-ended set of visual concepts (objects, actions, attributes, scenes, emotions, etc.). By learning from text associated with images, a model can acquire a much broader understanding of the visual world than from a limited set of pre-defined labels. This mirrors the success of large language models trained on web-scale text.
- Shared Embedding Space for Multimodal Understanding: The goal is to learn a joint
multimodal embedding spacewhere images and their corresponding text descriptions are mapped to nearby points. This means that an image of a "dog" and the text "a photo of a dog" will have similar vector representations, while an image of a "cat" and the text "a photo of a dog" will have dissimilar representations. - Contrastive Learning for Efficiency and Effectiveness: Instead of trying to generate exact pixel-level captions (which is computationally intensive and difficult due to the variability of language), CLIP simplifies the task. It frames it as a
contrastive predictionproblem: given a batch of image-text pairs, can the model identify which image matches which text from all possible pairings within that batch? Thisproxy taskis much more efficient to train and has been shown to be highly effective in learning powerful representations inself-supervised learning. - Zero-Shot Transfer through Language Prompting: Once trained, this shared embedding space allows for
zero-shot transfer. To classify an image into a set of categories (e.g., "cat," "dog," "bird"), the model can:- Compute the
embeddingof the input image. - Compute the
embeddingsof the text descriptions for each category (e.g., "a photo of a cat," "a photo of a dog," "a photo of a bird"). - The category whose text embedding is most
similar(e.g., highestcosine similarity) to the image embedding is predicted as the class. This effectively allows natural language to "program" the visual classifier on the fly, without needing any labeled training examples for the new categories.
- Compute the
4.2. Core Methodology In-depth (Layer by Layer)
The CLIP methodology involves several key components: Natural Language Supervision, Dataset Creation, Efficient Pre-Training Method Selection, Model Architectures, Model Scaling, and Training Details.
4.2.1. Natural Language Supervision
CLIP's fundamental premise is to learn visual representations from natural language supervision. This means using descriptive text associated with images as the training signal, rather than discrete, human-annotated class labels. The authors highlight its strengths:
- Scalability: It's easier to scale
natural language supervisionbecause it doesn't require annotations in amachine learning compatible format(like1-of-N majority vote gold labels). Instead, it can passively learn from the vast amount of text on the internet. - Flexibility and Zero-Shot Transfer: It
doesn't "just" learn a representation but also connects that representation to language, enabling flexiblezero-shot transferby simply using language to describe new visual concepts.
4.2.2. Creating a Sufficiently Large Dataset (WIT)
Existing datasets like MS-COCO (approx. 100,000 images) and Visual Genome (approx. 100,000 images) are too small for web-scale pre-training. YFCC100M (100 million images) has sparse and inconsistent metadata. After filtering for natural language titles/descriptions, YFCC100M shrinks to 15 million images.
To address this, the authors constructed a new dataset called WIT (WebImageText).
- Scale: It comprises
400 million (image, text) pairscollected from publicly available internet sources. - Diversity: To cover a broad set of visual concepts, the collection process involved searching for
(image, text)pairs where the text included one of500,000 queries. - Balancing: The results were approximately
class-balancedby including up to20,000 (image, text)pairs per query. - Size comparison: The total word count of WIT is similar to the
WebTextdataset used to trainGPT-2.
4.2.3. Selecting an Efficient Pre-Training Method
Given the immense computational requirements of training large computer vision models, training efficiency was paramount.
-
Initial Approach (Generative/Predictive): The authors first tried an approach similar to
VirTex, jointly training animage CNNandtext transformerfrom scratch to predict thecaption of an image.-
Observation: This method proved inefficient. As shown in Figure 2, a 63 million parameter
transformer language model(already using twice the compute of aResNet-50 image encoder) learned to recognize ImageNet classesthree times slowerthan a simplerbag-of-words (BoW) encoding baseline. This highlighted the difficulty of predicting exact words due to the wide variety of text associated with images.
该图像是图表,展示了不同模型在处理不同数量图像时的零-shot ImageNet 分类准确率。图中包含三条曲线,分别表示“Bag of Words Contrastive (CLIP)”、“Bag of Words Prediction”和“Transformer Language Model”。绿色曲线显示CLIP模型的效率是其他模型的4倍,橙色曲线显示其效率为3倍,蓝色曲线表现相对较低。X轴表示处理的图像数量,Y轴表示分类准确率。
Figure 2. CLIP is much more efficient at zero-shot transfer than our image caption baseline. Although highly expressive, we found that transformer-based language models are relatively weak at zero-shot ImageNet classification. Here, we see that it learns slower than a baseline which predicts a bag-of-words (BoW) encoding of the text (Joulin et al., 2016). Swapping the prediction objective for the contrastive objective of CLIP further improves efficiency another .
-
-
Contrastive Objective: Inspired by findings that
contrastive objectivescan learn better and more computationally efficient representations (Tian et al., 2019; Chen et al., 2020a), the authors shifted to acontrastive learningapproach.- Goal: Predict which
text as a wholeis paired with which image, rather than the exact words. - Observation: Swapping the
predictive objectivefor acontrastive objectivein thebag-of-words encoding baselineresulted in afurther 4x efficiency improvementin the rate ofzero-shot transfer to ImageNet(Figure 2).
- Goal: Predict which
4.2.4. CLIP Training Objective
CLIP is trained to predict which of possible (image, text) pairings across a batch actually occurred, given a batch of real pairs.
The core of the CLIP implementation can be visualized with the following pseudocode (Figure 3 from the original paper):
# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOw or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, 1] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits.T, labels, axis=0) # Corrected from original pseudocode's axis=7
loss = (loss_i + loss_t)/2
Let's break down the process step-by-step:
-
Input:
I[n, h, w, c]: A minibatch of aligned images. Each image has dimensionsheight (h),width (w), andchannels (c).T[n, 1]: A minibatch of aligned texts (captions).image_encoder: The neural network responsible for processing images. It can be aResNetor aVision Transformer.text_encoder: The neural network responsible for processing text. It can be aCBOW(Continuous Bag-of-Words) model or aText Transformer.
-
Feature Extraction:
I_f = image_encoder(I): Theimage_encoderprocesses the batch of images to extract visual feature representations. has shape , where is the dimension of the image features.T_f = text_encoder(T): Thetext_encoderprocesses the batch of texts to extract textual feature representations. has shape , where is the dimension of the text features.
-
Joint Multimodal Embedding:
- : A learned linear projection matrix that maps image features to a shared
multimodal embedding spaceof dimension . - : A learned linear projection matrix that maps text features to the same
multimodal embedding space. I_e = l2_normalize(np.dot(I_f, W_i), axis=1): The image features are linearly projected into themultimodal embedding spaceusing , and thenL2-normalizedalong the feature dimension (). has shape .T_e = l2_normalize(np.dot(T_f, W_t), axis=1): Similarly, text features are linearly projected using and thenL2-normalized. also has shape .- L2-Normalization: Ensures that all embeddings lie on a unit hypersphere, making
dot productsequivalent tocosine similarity. This is crucial for the contrastive objective.
- : A learned linear projection matrix that maps image features to a shared
-
Scaled Pairwise Cosine Similarities (Logits):
- : A learned
temperature parameter, optimized directly during training as alog-parameterized multiplicative scalar. It controls the range of thelogitsin thesoftmax, preventing training instability by scaling the similarities. logits = np.dot(I_e, T_e.T) * np.exp(t): Thedot productof and the transpose of () computes the pairwisecosine similaritiesbetween all image embeddings and all text embeddings in the batch. This results in an[n, n]matrix. The element at[i, j]represents the similarity between image and text . This matrix is then scaled by .- The diagonal elements of this matrix (
logits[k, k]) correspond to the similarities of the correct(image, text)pairs. - The off-diagonal elements (
logits[i, j]where ) correspond to the similarities of the incorrect(image, text)pairings (negative samples).
- The diagonal elements of this matrix (
- : A learned
-
Symmetric Loss Function:
-
labels = np.arange(n): An array is created. This serves as the target indices for thecross-entropy loss, indicating thatimage[i]should matchtext[i]. -
loss_i = cross_entropy_loss(logits, labels, axis=0): This calculates thecross-entropy lossfrom the perspective of the images. For each image , the model tries to correctly identify its corresponding text among all texts in the batch. in the pseudocode'scross_entropy_losscontext typically implies that the loss is computed over the columns, meaning for each image row, we compare its similarity scores to all texts against a one-hot vector where the correct text is 1. (Note: The original paper's pseudocode had for , which is a typo and corrected to or depending on thecross_entropy_lossimplementation details. Assuming it should align with thelogits.Toperation, it should be on the transposed logits, effectively meaning on the original logits). -
loss_t = cross_entropy_loss(logits.T, labels, axis=0): This calculates thecross-entropy lossfrom the perspective of the texts. For each text , the model tries to correctly identify its corresponding image among all images in the batch. By transposinglogits, we swap the roles of images and texts. -
: The final loss is the
symmetric averageof the image-side and text-side losses. This ensures that both modalities learn to project into the shared space effectively.This
batch construction techniqueandobjectivewere first introduced asmulti-class N-pair loss(Sohn, 2016) and popularized asInfoNCE loss(Oord et al., 2018) forcontrastive representation learning. It was recently adapted for(text, image)learning in medical imaging by Zhang et al. (2020), which CLIP builds upon.
-
Simplifications Compared to ConVIRT (Zhang et al., 2020):
- Training from scratch: CLIP models are trained without
ImageNet weightsfor the image encoder orpre-trained weightsfor the text encoder. - Linear Projection: Only a
linear projectionmaps from each encoder's representation to themultimodal embedding space, unlikeConVIRTwhich uses a non-linear projection. - No Text Transformation: The text transformation function that samples a single sentence uniformly from the text is removed, as many
(image, text)pairs in WIT are single sentences. - Simple Image Augmentation: A
random square cropfrom resized images is the only data augmentation used during training. - Learned Temperature: The
temperature parameter\tau is directly optimized as a `log-parameterized multiplicative scalar`, avoiding manual hyperparameter tuning. ### 4.2.5. Choosing and Scaling a Model CLIP utilizes two main architectures for the `image encoder` and a `Transformer` for the `text encoder`. * **Image Encoder Architectures:** 1. **ResNet-50 (He et al., 2016a) variants:** * **Base Architecture:** ResNet-50, chosen for its widespread adoption. * **Modifications:** * `ResNetD improvements` (He et al., 2019). * `Antialiased rect-2 blur pooling` (Zhang, 2019). * `Attention pooling mechanism`: Replaces the `global average pooling layer`. This mechanism is a single layer of `transformer-style multi-head QKV attention`, where the `query` is conditioned on the `global average-pooled representation of the image`. 2. **Vision Transformer (ViT) (Dosovitskiy et al., 2020) variants:** * **Base Architecture:** Closely follows the original ViT implementation. * **Minor Modification:** An additional `layer normalization` is applied to the combined `patch and position embeddings` before the `transformer` blocks. A slightly different initialization scheme is also used. * **Text Encoder Architecture:** * **Base Architecture:** A `Transformer` (Vaswani et al., 2017) with architectural modifications described in Radford et al. (2019) (similar to GPT-2). * **Size:** A `63M-parameter`, `12-layer`, `512-wide model` with `8 attention heads`. * **Tokenization:** Operates on a `lower-cased Byte Pair Encoding (BPE)` representation of the text with a `49,152 vocabulary size` (Sennrich et al., 2015). * **Sequence Length:** `Max sequence length capped at 76` for computational efficiency. * **Feature Representation:** The text sequence is bracketed with `[SOS]` (start of sentence) and `[EOS]` (end of sentence) tokens. The activations of the highest layer of the `transformer` at the `[EOS]` token are used as the feature representation of the text. This representation is then `layer normalized` and `linearly projected` into the `multimodal embedding space`. * **Masked Self-Attention:** Used in the `text encoder` to preserve the ability to initialize with a `pre-trained language model` or add `language modeling` as an `auxiliary objective` (though this is left for future work). * **Model Scaling Strategy:** * **ResNet Image Encoders:** Adapts the approach of Tan & Le (2019) (EfficientNet) by `equally allocating additional compute to increasing the width, depth, and resolution` of the model. This is found to outperform scaling only one dimension. * **Text Encoder:** Only the `width of the model` is scaled to be proportional to the calculated increase in width of the `ResNet`. The depth of the text encoder is `not scaled` as CLIP's performance was less sensitive to its capacity. ### 4.2.6. Training A series of CLIP models were trained: * **ResNet Models:** 5 models: `ResNet-50`, `ResNet-101`, and three `EfficientNet-style scaled` models (denoted `RN50x4`, `RN50x16`, `RN50x64`) using approximately 4x, 16x, and 64x the compute of a ResNet-50. * **Vision Transformer (ViT) Models:** 3 models: `ViT-B/32`, `ViT-B/16`, and `ViT-L/14`. * **`ViT-L/14@336px`:** The `ViT-L/14` model was additionally `pre-trained at a higher 336 pixel resolution for one additional epoch` to boost performance, similar to `FixRes` (Touvron et al., 2019). This model is considered the `best performing` and is used as "CLIP" by default in the paper. **Common Training Hyperparameters (Table 18):** <div class="table-wrapper"><table> <thead> <tr> <th>Hyperparameter</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>Batch size</td> <td>32768</td> </tr> <tr> <td>Vocabulary size</td> <td>49408</td> </tr> <tr> <td>Training epochs</td> <td>32</td> </tr> <tr> <td>Maximum temperature</td> <td>100.0</td> </tr> <tr> <td>Weight decay</td> <td>0.2</td> </tr> <tr> <td>Warm-up iterations</td> <td>2000</td> </tr> <tr> <td>Adam β1</td> <td>0.9</td> </tr> <tr> <td>Adam β2</td> <td>0.999 (ResNet), 0.98 (ViT)</td> </tr> <tr> <td>Adam €</td> <td>10−8 (ResNet), 10−6 (ViT)</td> </tr> </tbody> </table></div> **Specific Hyperparameters for CLIP-ResNet (Table 19):** <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Model</th> <th rowspan="2">Learning rate</th> <th rowspan="2">Embedding dimension</th> <th rowspan="2">Input resolution</th> <th colspan="2">ResNet</th> <th colspan="3">Text Transformer</th> </tr> <tr> <th>blocks</th> <th>width</th> <th>layers</th> <th>width</th> <th>heads</th> </tr> </thead> <tbody> <tr> <td>RN50</td> <td>5 × 10−4</td> <td>1024</td> <td>224</td> <td>(3, 4, 6, 3)</td> <td>2048</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>RN101</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>(3, 4, 23, 3)</td> <td>2048</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>RN50x4</td> <td>5 × 10−4</td> <td>640</td> <td>288</td> <td>(4, 6, 10, 6)</td> <td>2560</td> <td>12</td> <td>640</td> <td>10</td> </tr> <tr> <td>RN50x16</td> <td>4 × 10−4</td> <td>768</td> <td>384</td> <td>(6, 8, 18, 8)</td> <td>3072</td> <td>12</td> <td>768</td> <td>12</td> </tr> <tr> <td>RN50x64</td> <td>3.6 × 10−4</td> <td>1024</td> <td>448</td> <td>(3, 15, 36, 10)</td> <td>4096</td> <td>12</td> <td>1024</td> <td>16</td> </tr> </tbody> </table></div> **Specific Hyperparameters for CLIP-ViT (Table 20):** <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Model</th> <th rowspan="2">Learning rate</th> <th rowspan="2">Embedding dimension</th> <th rowspan="2">Input resolution</th> <th colspan="3">Vision Transformer</th> <th colspan="3">Text Transformer</th> </tr> <tr> <th>layers</th> <th>width</th> <th>heads</th> <th>layers</th> <th>width</th> <th>heads</th> </tr> </thead> <tbody> <tr> <td>ViT-B/32</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>12</td> <td>768</td> <td>12</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>ViT-B/16</td> <td>5 × 10−4</td> <td>512</td> <td>224</td> <td>12</td> <td>768</td> <td>12</td> <td>12</td> <td>512</td> <td>8</td> </tr> <tr> <td>ViT-L/14</td> <td>4 × 10−4</td> <td>768</td> <td>224</td> <td>24</td> <td>1024</td> <td>16</td> <td>12</td> <td>768</td> <td>12</td> </tr> <tr> <td>ViT-L/14-336px</td> <td>2 × 10-5</td> <td>768</td> <td>336</td> <td>24</td> <td>1024</td> <td>16</td> <td>12</td> <td>768</td> <td>12</td> </tr> </tbody> </table></div> **Other Training Details:** * **Optimizer:** `Adam optimizer` (Kingma & Ba, 2014). * **Regularization:** `Decoupled weight decay regularization` (Loshchilov & Hutter, 2017) applied to all weights except gains or biases. * **Learning Rate Schedule:** `Cosine decay schedule` (Loshchilov & Hutter, 2016). * **Hyperparameter Tuning:** Initial hyperparameters were set via `grid searches`, `random search`, and `manual tuning` on the baseline ResNet50 for 1 epoch, then heuristically adapted for larger models due to `computational constraints`. * **Temperature Parameter Initialization:** $\tau$ was initialized to the equivalent of 0.07 from Wu et al. (2018) and clipped to prevent `scaling logits by more than 100` to avoid `training instability`. * **Memory Optimization:** `Mixed-precision` (Micikevicius et al., 2017), `gradient checkpointing` (Griewank & Walther, 2000; Chen et al., 2016), `half-precision Adam statistics` (Dhariwal et al., 2020), and `half-precision stochastically rounded text encoder weights` were used. `Embedding similarity calculation` was `sharded` across GPUs. * **Compute:** The largest ResNet model (`RN50x64`) took `18 days on 592 V100 GPUs`. The largest Vision Transformer (`ViT-L/14`) took `12 days on 256 V100 GPUs`. ### 4.2.7. Using CLIP for Zero-Shot Transfer After pre-training, CLIP uses its learned `multimodal embedding space` for `zero-shot classification`. 1. **Class Name Processing:** For a given `downstream dataset`, the names of all classes are used as potential text pairings. 2. **Embedding Generation:** * The input image is passed through the `image encoder` to obtain its `feature embedding`. * Each class name (e.g., "cat," "dog") is passed through the `text encoder` to obtain its `feature embedding`. 3. **Similarity Calculation:** The `cosine similarity` between the image embedding and each of the class text embeddings is calculated. 4. **Probability Distribution:** These similarities are scaled by the learned `temperature parameter`\tau and then normalized into aprobability distributionvia asoftmaxfunction.
-
Prediction: The class with the highest probability (i.e., the most probable
(image, text)pair) is predicted as the image's label.Interpretation: The paper interprets this as the
image encoderacting as thecomputer vision backboneand thetext encoderfunctioning as ahypernetwork(Ha et al., 2016) thatgenerates the weights of a linear classifierbased on the text descriptions of visual concepts. Each step of CLIP pre-training is seen as optimizing a proxy to acomputer vision datasetwith1 example per classand32,768 total classesdefined by natural language.
Prompt Engineering and Ensembling: To improve zero-shot performance, especially when class names are ambiguous or lack context:
-
Prompt Engineering:
Context templatesare used, such as "A photo of a{label}." This bridges the distribution gap between single-word labels and the full sentences typically seen during pre-training. Specific prompts are customized per task (e.g., "A photo of a {label}, a type of pet." forOxford-IIT Pets). -
Ensembling: Multiple
zero-shot classifiersare created using differentcontext prompts(e.g., "A photo of a big {label}", "A photo of a small {label}"). Theensembleis constructed over theembedding space(by averaging text embeddings), allowing for efficient caching and prediction. On ImageNet, ensembling 80 different prompts improved accuracy by an additional .The following figure (Figure 4 from the original paper) illustrates the improvements from prompt engineering and ensembling.
该图像是图表,展示了通过提示工程和集成方法提高零-shot分类性能的结果。与使用无上下文类别名称的基线相比,该方法在36个数据集上的平均得分提高了近5个百分点,显示出显著的效率提升。
Figure 4. Prompt engineering and ensembling improve zeroshot performance. Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4 times more compute with the baseline zero-shot method but is "free" when amortized over many predictions.
5. Experimental Setup
5.1. Datasets
The experiments in this paper use a broad suite of datasets to evaluate CLIP's zero-shot transfer and representation learning capabilities. The core evaluation suite comprises 27 datasets, including the 12 well-studied datasets from Kornblith et al. (2019) and 15 additional datasets to assess performance on a wider variety of distributions and tasks.
The following are the datasets used, with details on their characteristics and purpose:
| Dataset | Classes | Train size | Test size | Evaluation metric |
|---|---|---|---|---|
| Food-101 | 102 | 75,750 | 25,250 | accuracy |
| CIFAR-10 | 10 | 50,000 | 10,000 | accuracy |
| CIFAR-100 | 100 | 50,000 | 10,000 | accuracy |
| Birdsnap | 500 | 42,283 | 2,149 | accuracy |
| SUN397 | 397 | 19,850 | 19,850 | accuracy |
| Stanford Cars | 196 | 8,144 | 8,041 | accuracy |
| FGVC Aircraft | 100 | 6,667 | 3,333 | mean per class |
| Pascal VOC 2007 Classification | 20 | 5,011 | 4,952 | 11-point mAP |
| Describable Textures | 47 | 3,760 | 1,880 | accuracy |
| Oxford-IIIT Pets | 37 | 3,680 | 3,669 | mean per class |
| Caltech-101 | 102 | 3,060 | 6,085 | mean-per-class |
| Oxford Flowers 102 | 102 | 2,040 | 6,149 | mean per class |
| MNIST | 10 | 60,000 | 10,000 | accuracy |
| Facial Emotion Recognition 2013 | 8 | 32,140 | 3,574 | accuracy |
| STL-10 | 10 | 1000 | 8000 | accuracy |
| EuroSAT | 10 | 10,000 | 5,000 | accuracy |
| RESISC45 | 45 | 3,150 | 25,200 | accuracy |
| GTSRB | 43 | 26,640 | 12,630 | accuracy |
| KITTI | 4 | 6,770 | 711 | accuracy |
| Country211 | 211 | 43,200 | 21,100 | accuracy |
| PatchCamelyon | 2 | 294,912 | 32,768 | accuracy |
| UCF101 | 101 | 9,537 | 1,794 | accuracy |
| Kinetics700 | 700 | 494,801 | 31,669 | mean(top1, top5) |
| CLEVR Counts | 8 | 2,000 | 500 | accuracy |
| Hateful Memes | 2 | 8,500 | 500 | ROC AUC |
| Rendered SST2 | 2 | 7,792 | 1,821 | accuracy |
| ImageNet | 1000 | 1,281,167 | 50,000 | accuracy |
The following are the results from Table 9 of the original paper:
| Dataset | Classes | Train size | Test size | Evaluation metric |
|---|---|---|---|---|
| Food-101 | 102 | 75,750 | 25,250 | accuracy |
| CIFAR-10 | 10 | 50,000 | 10,000 | accuracy |
| CIFAR-100 | 100 | 50,000 | 10,000 | accuracy |
| Birdsnap | 500 | 42,283 | 2,149 | accuracy |
| SUN397 | 397 | 19,850 | 19,850 | accuracy |
| Stanford Cars | 196 | 8,144 | 8,041 | accuracy |
| FGVC Aircraft | 100 | 6,667 | 3,333 | mean per class |
| Pascal VOC 2007 Classification | 20 | 5,011 | 4,952 | 11-point mAP |
| Describable Textures | 47 | 3,760 | 1,880 | accuracy |
| Oxford-IIIT Pets | 37 | 3,680 | 3,669 | mean per class |
| Caltech-101 | 102 | 3,060 | 6,085 | mean-per-class |
| Oxford Flowers 102 | 102 | 2,040 | 6,149 | mean per class |
| MNIST | 10 | 60,000 | 10,000 | accuracy |
| Facial Emotion Recognition 2013 | 8 | 32,140 | 3,574 | accuracy |
| STL-10 | 10 | 1000 | 8000 | accuracy |
| EuroSAT | 10 | 10,000 | 5,000 | accuracy |
| RESISC45 | 45 | 3,150 | 25,200 | accuracy |
| GTSRB | 43 | 26,640 | 12,630 | accuracy |
| KITTI | 4 | 6,770 | 711 | accuracy |
| Country211 | 211 | 43,200 | 21,100 | accuracy |
| PatchCamelyon | 2 | 294,912 | 32,768 | accuracy |
| UCF101 | 101 | 9,537 | 1,794 | accuracy |
| Kinetics700 | 700 | 494,801 | 31,669 | mean(top1, top5) |
| CLEVR Counts | 8 | 2,000 | 500 | accuracy |
| Hateful Memes | 2 | 8,500 | 500 | ROC AUC |
| Rendered SST2 | 2 | 7,792 | 1,821 | accuracy |
| ImageNet | 1000 | 1,281,167 | 50,000 | accuracy |
Specific Dataset Notes:
-
Video Datasets (UCF101, Kinetics700): For these datasets, the
middle frame of each video clipis used as the input image, effectively converting them into image classification tasks for this evaluation. -
STL-10 and UCF101: These datasets have multiple predefined train/validation/test splits, and the paper reports the average over all splits.
-
Country211: A custom dataset created by the authors to assess
geolocation capability. It filteredYFCC100mfor 211 countries with at least 300 GPS-tagged photos, sampling 200 for training and 100 for testing per country. -
Rendered SST2: A custom dataset created to measure
optical character recognition (OCR) capability. Sentences from theStanford Sentiment Treebank (SST-2)dataset are rendered into448x448pixel images (black text on white background). The following figure (Figure 19 from the original paper) shows two example images from the Rendered SST2 dataset.
该图像是文本描述的插图,包含两段关于电影叙事的评论。第一段提到Montias为其细致的叙述注入了灵活的能量,并且描述了他所围绕的角色;第二段则表达了电影制作者对故事方向的迷茫以及缺乏实现目标的技能。Figure 19. Two example images from the Rendered SST2 dataset
These datasets were chosen to provide a broad and diverse evaluation suite, encompassing various tasks (general object recognition, fine-grained classification, scene recognition, OCR, action recognition, geolocation) and different image characteristics (natural images, satellite images, medical images, rendered text, video frames). This diversity is crucial for validating the generality and transferability of CLIP's natural language supervision approach.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, a complete explanation is provided below:
-
Accuracy:
- Conceptual Definition: Accuracy is a common metric for classification tasks, representing the proportion of correctly predicted instances out of the total instances evaluated. It measures how often the model's prediction matches the true label.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The count of instances where the model's predicted class matches the actual true class.Total Number of Predictions: The total number of instances for which the model made a prediction.
-
Mean Per Class Accuracy:
- Conceptual Definition: This metric calculates the accuracy for each individual class and then averages these per-class accuracies. It is particularly useful when dealing with
imbalanced datasets(where some classes have many more examples than others) because it gives equal weight to each class, preventing a model from achieving high overall accuracy by simply performing well on a majority class. - Mathematical Formula: $ \text{Mean Per Class Accuracy} = \frac{1}{C} \sum_{i=1}^{C} \text{Accuracy}_i $
- Symbol Explanation:
- : The total number of unique classes in the dataset.
- : The accuracy calculated specifically for class , i.e., the number of correctly predicted instances of class divided by the total number of actual instances of class .
- Conceptual Definition: This metric calculates the accuracy for each individual class and then averages these per-class accuracies. It is particularly useful when dealing with
-
11-point Mean Average Precision (11-point mAP):
- Conceptual Definition: 11-point mAP is a metric commonly used in
object detectionandimage retrievaltasks, particularly in benchmarks likePascal VOC. It is a generalization ofAverage Precision (AP). AP measures the area under thePrecision-Recall curve. For 11-point mAP, precision is sampled at 11 equally spaced recall levels (0, 0.1, ..., 1.0). The precision at each recall level is taken as the maximum precision over any recall . The mean of these 11 precision values is the AP for a single class. mAP is then the mean of AP values across all classes. - Mathematical Formula: $ \text{AP} = \frac{1}{11} \sum_{r \in {0, 0.1, \dots, 1.0}} \text{P}{\text{interp}}(r) $ where $ \text{mAP} = \frac{1}{C} \sum{i=1}^{C} \text{AP}_i $
- Symbol Explanation:
- : Average Precision for a single class.
- : Recall level (from 0 to 1.0 in 11 steps).
- : Interpolated precision at recall , calculated as the maximum precision observed for any recall value greater than or equal to .
- : Precision at recall .
- : Mean Average Precision.
- : The total number of classes.
- : Average Precision for class .
- Conceptual Definition: 11-point mAP is a metric commonly used in
-
ROC AUC (Receiver Operating Characteristic Area Under the Curve):
- Conceptual Definition: ROC AUC is a performance metric for
binary classificationproblems, particularly useful when there is a class imbalance or when the costs offalse positivesandfalse negativesare different. TheROC curveplots theTrue Positive Rate (TPR)(Sensitivity) against theFalse Positive Rate (FPR)(1 - Specificity) at variousthreshold settings. TheAUC(Area Under the Curve) represents the degree or measure of separability between classes; a higher AUC means the model is better at distinguishing between positive and negative classes. An AUC of 1.0 represents a perfect classifier, while 0.5 represents a random classifier. - Mathematical Formula: $ \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ $ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} $ $ \text{AUC} = \int_{0}^{1} \text{TPR(FPR)} , d\text{FPR} $
- Symbol Explanation:
TP: True Positives (correctly predicted positive instances).FN: False Negatives (actual positive instances incorrectly predicted as negative).FP: False Positives (actual negative instances incorrectly predicted as positive).TN: True Negatives (correctly predicted negative instances).TPR: True Positive Rate, also known asSensitivityorRecall.FPR: False Positive Rate.AUC: Area Under the ROC Curve. The integral represents the area under the curve formed by plotting TPR against FPR.
- Conceptual Definition: ROC AUC is a performance metric for
-
mean(top1, top5):
- Conceptual Definition: This metric is specific to the
Kinetics-700dataset, which is a large-scale video action recognition dataset. It represents the average of theTop-1 accuracyandTop-5 accuracy.Top-1 accuracy: The standard accuracy where the model's single most confident prediction must be correct.Top-5 accuracy: The prediction is considered correct if the true label is among the model's top 5 most confident predictions.
- Mathematical Formula: $ \text{mean(top1, top5)} = \frac{\text{Top-1 Accuracy} + \text{Top-5 Accuracy}}{2} $
- Symbol Explanation:
Top-1 Accuracy: Proportion of times the true label is the model's highest-scoring prediction.Top-5 Accuracy: Proportion of times the true label is among the model's five highest-scoring predictions.
- Conceptual Definition: This metric is specific to the
-
R@K (Recall@K):
- Conceptual Definition:
Recall@Kis a metric used ininformation retrievalorranking tasks(like image-text retrieval). It measures the proportion of queries for which the correct item (e.g., the true image for a text query, or the true text for an image query) is found within the top retrieved results. A higher R@K indicates better retrieval performance. - Mathematical Formula: $ \text{R@K} = \frac{\text{Number of queries where true item is in top K retrieved}}{\text{Total number of queries}} $
- Symbol Explanation:
- : The number of top retrieved items to consider.
true item: The ground-truth item that corresponds to the query.top K retrieved: The set of items ranked highest by the model for a given query.
- Conceptual Definition:
-
mWAP (Mean Weighted Average Precision) and mWSAP (Mean Weighted Segment Average Precision):
- Conceptual Definition: These metrics are specific to
action recognitiondatasets likeRareAct, which involves detectingunusual actions. While the paper mentions them, it does not provide detailed definitions or formulas within its text. In the context of action recognition,Average Precision (AP)is commonly used for detection tasks, andWeighted Average Precisionsuggests a weighting scheme might be applied, potentially based on action rarity or duration.Segment Average Precisionimplies evaluation over temporal segments in videos. Without further context from the original source introducingRareAct, precise formulas cannot be provided here, but they generally aim to measure the quality of action detection or recognition in video sequences, potentially considering temporal localization or importance.
- Conceptual Definition: These metrics are specific to
-
Geolocation Performance (within X km):
- Conceptual Definition: This metric is used for
geo-localizationtasks, where the goal is to predict the geographical coordinates (latitude and longitude) of an image. Performance is measured as the percentage of images for which the predicted location falls within a specified radius (e.g., 1km, 25km, 200km) of the true location. - Mathematical Formula: Let be the true coordinates and be the predicted coordinates. Let be the Haversine distance between the two points. $ \text{Accuracy within X km} = \frac{\text{Number of images where dist(true, pred)} \le \text{X km}}{\text{Total number of images}} \times 100% $
- Symbol Explanation:
- : True latitude and longitude of the image.
- : Predicted latitude and longitude of the image.
- : The geographical distance (e.g., Haversine distance) between the true and predicted coordinates.
- : The specified radius in kilometers.
- Conceptual Definition: This metric is used for
5.3. Baselines
The paper compares CLIP's performance against a comprehensive set of existing models, covering various pre-training strategies and architectures, for both zero-shot transfer and linear-probe representation learning evaluations.
Baselines for Zero-Shot Transfer:
- Visual N-Grams (Li et al., 2017): This is the primary zero-shot baseline mentioned in the paper for
ImageNet,aYahoo, andSUNdatasets. It learned a dictionary of visual n-grams from web data and used text n-gram representations of class names for prediction. This serves as the direct contextual reference forgenerically pre-trained zero-shot models.
Baselines for Linear-Probe Representation Learning (and some for zero-shot comparison context):
The evaluation suite includes 66 different models across 27 datasets to ensure a broad comparison. Key families of baselines are:
-
LM RN50 (Autoregressive Language Model with ResNet-50):
- A multimodal model using a
ResNet-50image encoder and anautoregressive lossto predict text captions. This acts as a direct comparison to CLIP'scontrastive lossand demonstrates theefficiency gainsof contrastive learning.
- A multimodal model using a
-
EfficientNet (Tan & Le, 2019):
- A family of convolutional neural networks known for systematic
model scaling(width, depth, and resolution). - Includes
B0-B8models from the original paper. - Also includes
Noisy Student variants() (Xie et al., 2020), which useself-trainingwithnoisy labelsto achieve state-of-the-art performance on ImageNet. These are strong supervised baselines.
- A family of convolutional neural networks known for systematic
-
Instagram-pretrained ResNeXt (Mahajan et al., 2018):
ResNeXt-101models (32x8d,32x16d,32x32d,32x48d) pre-trained on3.5 billion Instagram imagesusinghashtag prediction(a form ofweak supervision). These models demonstrated the power oflarge-scale weakly supervised pre-training.- Includes
FixRes variants(Touvron et al., 2019) using higher input resolutions.
-
Big Transfer (BiT) (Kolesnikov et al., 2019):
BiT-SandBiT-Mmodels (ResNet architectures) pre-trained onImageNet-1kandImageNet-21k(a larger ImageNet variant). These models are known for their strongtransfer learningperformance due to large-scale pre-training.BiT-Lmodels (trained on JFT-300M) are mentioned as superior but not publicly available.
-
Vision Transformer (ViT) (Dosovitskiy et al., 2020):
ViT-B/32,ViT-B/16,ViT-L/16, andViT-H/14models pre-trained on theImageNet-21kdataset. These are crucial baselines to compare CLIP'sVision Transformervariants against, particularly in terms ofcompute efficiency. The best-performing ViT models (trained on JFT-300M) are also not publicly available.
-
Self-Supervised Learning Methods:
SimCLRv2(Chen et al., 2020c): Aself-supervised learningframework that usescontrastive learningto learnvisual representationswithout human labels.BYOL (Bootstrap Your Own Latent)(Grill et al., 2020): Anotherself-supervised learningmethod that avoids negative pairs, using two interacting neural networks to learn representations.Momentum Contrast (MoCo)(He et al., 2020; Chen et al., 2020d): Aself-supervised learningframework that uses amomentum encoderand alarge queue of negative samplesforcontrastive learning.
-
VirTex (Desai & Johnson, 2020):
- A model that learns visual representations from textual annotations using
transformer-based language modeling. It has a similar model design to CLIP'sautoregressive baselinebut is trained on a much smaller dataset (MSCOCO).
- A model that learns visual representations from textual annotations using
-
Standard ResNet Checkpoints (He et al., 2016b):
-
Original
ResNet-50,ResNet-101, andResNet-152models trained onImageNet-1k. These serve as fundamental supervised baselines representing widely adopted architectures.These baselines are representative because they cover a spectrum of modern
computer vision pre-trainingtechniques: from traditionalsupervised ImageNet training, tolarge-scale weakly supervised methods, and recentself-supervised learningparadigms, as well as the emergingVision Transformerarchitectures. This comprehensive comparison allows the authors to position CLIP'snatural language supervisionapproach against the current state-of-the-art across different axes likesupervision type,model architecture, andscale.
-
5.4. Linear-Probe Evaluation Setup
For linear-probe evaluation, the following standardized procedure is used:
- Feature Extraction: Image features are taken from the
penultimate layerof each model, ignoring any provided classification layer. ForCLIP-ViTmodels, features are used before the linear projection to the embedding space (corresponding to in the pseudocode). - Classifier Training: A
logistic regression classifieris trained on these extracted features.scikit-learn'sL-BFGS implementationis used with a maximum of1,000 iterations. - Hyperparameter Tuning: The
L2 regularization strength\lambda is determined using a `hyperparameter sweep` on the validation sets. The sweep covers a range from $10^{-6}$ to $10^6$ with 96 logarithmically spaced steps. A `parametric binary search` strategy is employed to efficiently find the optimal $\lambda$. 4. **Data Splits:** For datasets with predefined `validation and test splits`, the validation set is used for `hyperparameter search`. If no validation split or test labels are provided, the training dataset is split to create a validation set for tuning. For the final result, the validation split is combined back with the training split, and performance is reported on the unused test split. This linear-probe approach is chosen for its simplicity, minimal hyperparameter tuning requirements, and its ability to highlight how well the `pre-trained representations` themselves capture useful information, rather than allowing `fine-tuning` to adapt representations to specific downstream tasks. # 6. Results & Analysis ## 6.1. Core Results Analysis The experimental results demonstrate CLIP's remarkable capabilities in `zero-shot transfer` and `representation learning`, often outperforming strong baselines and showing predictable scaling behaviors. ### 6.1.1. Initial Comparison to Visual N-Grams The following are the results from Table 1 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th></th> <th>aYahoo</th> <th>ImageNet</th> <th>SUN</th> </tr> </thead> <tbody> <tr> <td>Visual N-Grams</td> <td>72.4</td> <td>11.5</td> <td>23.0</td> </tr> <tr> <td>CLIP</td> <td>98.4</td> <td>76.2</td> <td>58.5</td> </tr> </tbody> </table></div> This table compares CLIP's zero-shot accuracy against `Visual N-Grams` (Li et al., 2017) on three datasets. CLIP shows massive improvements: * On `ImageNet`, CLIP boosts accuracy from a mere $11.5\%$ to $76.2\%$. This is a monumental leap, matching the performance of the original `ResNet-50` (a fully supervised model) *without using any of ImageNet's 1.28 million training examples*. CLIP also achieves a $95\%$ `Top-5 accuracy` on ImageNet, matching `Inception-V4`. * On `aYahoo`, CLIP achieves $98.4\%$, representing a $95\%$ reduction in errors compared to `Visual N-Grams`. * On `SUN`, CLIP more than doubles the accuracy from $23.0\%$ to $58.5\%$. These results establish CLIP as a significant advancement towards practical and flexible `zero-shot computer vision classifiers`. While noting that many factors (larger dataset, more compute, `Transformer` architecture) contribute to CLIP's superior performance over `Visual N-Grams`, a controlled ablation confirms that a `CLIP ResNet-50` trained on the `YFCC100M` dataset (the same as `Visual N-Grams`) can match their reported ImageNet performance within one GPU day, even when trained from scratch. ### 6.1.2. Zero-Shot CLIP Performance Analysis * **Competitiveness with Fully Supervised Baselines:** The following figure (Figure 5 from the original paper) shows this comparison across 27 datasets.  *该图像是一个条形图,展示了零-shot CLIP 与基于 ResNet-50 特征的线性分类器之间的性能比较。图中显示,零-shot CLIP 在27个数据集的评估中,有16个数据集表现优于线性分类器,准确率提升幅度(Δ Score)从 +28.9 到 -37.1不等。* Figure 5. Zero-shot CLIP is competitive with a fully supervised baseline. Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet. Zero-shot CLIP outperforms a `fully supervised, regularized logistic regression classifier` fitted on `ResNet-50 features` on 16 out of 27 datasets, including `ImageNet`. * **Strong Performance:** On `fine-grained classification` tasks like `Stanford Cars` and `Food101`, zero-shot CLIP outperforms the baseline by over $20\%$. On `STL10`, it achieves $99.3\%$, a new state-of-the-art zero-shot result. * **Action Recognition:** For `action recognition in videos` (`Kinetics700`, `UCF101`), CLIP significantly outperforms `ResNet-50` features by $14.5\%$ and $7.7\%$ respectively. This is attributed to `natural language` providing broader supervision for `verbs` compared to `ImageNet`'s `noun-centric` supervision. * **Weak Performance:** CLIP struggles with `specialized, complex, or abstract tasks` such as `satellite image classification` (EuroSAT, RESISC45), `lymph node tumor detection` (PatchCamelyon), `object counting` (CLEVRCounts), and `self-driving related tasks` (GTSRB, KITTI Distance). This highlights areas for improvement. * **Comparison to Few-Shot Linear Probes:** The following figure (Figure 6 from the original paper) compares zero-shot CLIP to few-shot logistic regression on features of many image models.  *该图像是图表,展示了不同模型在有标签训练样本数量与平均得分之间的关系。Zero-Shot CLIP 模型的表现与4-shot线性分类器相当,接近16-shot线性分类器的最佳结果。图中标注了 BiT-M 和 SimCLRv2 模型的表现,灰色线条代表其他评估模型。* Figure 6. Zero-shot CLIP outperforms few-shot linear probes. Zero-shot CLIP matches the average performance of a 4-shot linear classifier trained on the same feature space and nearly matches the best results of a 16-shot linear classifier across publicly available models. For both BiT-M and SimCLRv2, the best performing model is highlighted. Light gray lines are other models in the eval suite. The 20 datasets with at least 16 examples per class were used in this analysis. Surprisingly, `zero-shot CLIP matches the average performance of a 4-shot linear classifier` trained on the *same feature space*. This suggests that the ability to "communicate" visual concepts directly via `natural language` in zero-shot is highly effective, potentially overcoming the ambiguity of `context-less example-based learning` in few-shot settings. CLIP also `roughly matches the best-performing 16-shot classifier` (a `BiT-M ResNet-152x2` trained on `ImageNet-21K`) in the evaluation suite. * **Data Efficiency of Zero-Shot Transfer:** The following figure (Figure 7 from the original paper) shows the estimated number of labeled examples per class required for a linear classifier on the same CLIP feature space to match zero-shot CLIP's performance.  *该图像是一个条形图,展示了在 CLIP 特征空间中,线性分类器匹配零-shot 分类器性能所需的每个类别的标记示例数量。数据呈现不同数据集的标签样本需求,从 FER2013 的 184 个样本到 Flowers102 的 0.9 个样本,均值为 20.8, медиана 为 5.4。* Figure 7. The data efficiency of zero-shot transfer varies widely. Calculating the number of labeled examples per class a linear classifier on the same CLIP feature space requires to match the performance of the zero-shot classifier contextualizes the effectiveness of zero-shot transfer. Values are estimated based on log-linear interpolation of 1, 2, 4, 8, 16-shot and fully supervised results. Performance varies widely from still underperforming a one-shot classifier on two datasets to matching an estimated 184 labeled examples per class. The `effective data efficiency` of zero-shot transfer varies widely from less than 1 labeled example per class to 184. Half of the datasets require less than 5 examples per class, with a median of 5.4. On `ImageNet`, zero-shot CLIP matches the performance of a 16-shot linear classifier on the same feature space. * **Correlation with Linear Probe Performance:** The following figure (Figure 8 from the original paper) compares CLIP's zero-shot performance with fully supervised linear classifiers across datasets.  *该图像是一个散点图,展示了零-shot CLIP性能与线性探测CLIP性能之间的关系。图中显示,零-shot性能与线性探测性能存在强关联性,相关系数 $r = 0.82$。大部分数据点接近45度线,但零-shot性能普遍低于线性探测性能,最高差值在10至25点之间。* Figure 8. Zero-shot performance is correlated with linear probe performance but still mostly sub-optimal. Comparing zero-shot and linear probe performance across datasets shows a strong correlation with zero-shot performance mostly shifted 10 to 25 points lower. On only 5 datasets does zero-shot performance approach linear probe performance ( ${ \le } 3$ point difference). There is a strong positive correlation ($r=0.82$, $p < 10^{-6}$) between zero-shot and fully supervised performance, indicating consistency in CLIP's `representation learning` and `task learning`. However, zero-shot performance generally `underperforms fully supervised classifiers by 10% to 25%`, suggesting significant room for improvement. * **Scaling Laws for Zero-Shot Performance:** The following figure (Figure 9 from the original paper) plots the average error rate of 5 ResNet CLIP models across 39 evaluations on 36 different datasets as a function of model compute.  *该图像是图表,展示了 CLIP 模型在不同计算量下的零-shot 性能。横轴为模型的 GFLOPs,纵轴为错误率(%)。数据点显示,随着计算量的增加,错误率呈现平滑的下降趋势,表明模型性能的提升。* Figure 9. Zero-shot CLIP performance scales smoothly as a function of model compute. Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44x range of compute spanning 5 different CLIP models. Lightly shaded lines are performance on individual evals, showing that performance is much more varied despite the smooth overall trend. CLIP exhibits a `log-log linear scaling trend` for `average zero-shot error` across a `44x increase in model compute`, similar to observations in `neural language models`. This indicates predictable performance gains with larger models and more computational resources. ### 6.1.3. Representation Learning The following figure (Figure 10 from the original paper) shows the relationship between linear probe average scores and forward-pass GFLOPs/image for various models.  *该图像是图表,展示了不同模型在Kornblith等的12和27个数据集上的线性探测平均得分与前向传递GFLOPs/图像之间的关系。不同的模型用不同的符号标识,结果表明CLIP模型在效率与准确性之间的平衡表现突出。* Figure 9. The image is a chart that shows the relationship between linear probe average scores and forward-pass GFLOPs/image for various models across Kornblith et al.'s 12 and 27 datasets. Different models are represented by distinct symbols, and the results indicate that CLIP performs well in balancing efficiency and accuracy. The following are the results from Figure 10 of the original paper. The top graph shows the average score on the 12-dataset Kornblith et al. (2019) suite versus GFLOPs per image. The bottom graph shows the average score on the broader 27-dataset suite versus GFLOPs per image.  *该图像是图表,展示了通过线性探针方法在Kornblith等人的12个数据集和26个数据集上的平均转移得分与ImageNet得分的关系。图中通过散点图反映不同模型(如CLIP-ViT、EfficientNet等)在转移学习任务中的表现。纵轴表示转移得分(%),横轴表示ImageNet得分(%),两条虚线显示了理想的线性关系。不同颜色和形状的标记代表不同的模型。* Figure 11. The image is a chart that illustrates the performance comparison of the Zero-Shot CLIP model across multiple datasets. The chart displays accuracy for each dataset along with improvement scores over other methods, highlighting the effectiveness of the model in various tasks. * **Performance on Kornblith et al. (2019) 12-Dataset Suite:** Small CLIP models (RN50, RN101) outperform other ImageNet-1K trained ResNets but underperform ImageNet-21K trained ResNets (`BiT-M`) and `EfficientNet` models with similar compute. However, `CLIP scales very well`, and the largest `ResNet-50x64` slightly outperforms the `Noisy Student EfficientNet-L2` in both overall score and compute efficiency. `CLIP Vision Transformers` are about `3x more compute efficient` than `CLIP ResNets`. The best model, `ViT-L/14@336px`, outperforms the best existing model by an average of $2.6\%$. * **Performance on Broader 27-Dataset Suite:** On the expanded suite (including `OCR`, `geo-localization`, `facial emotion recognition`, `action recognition`), CLIP's benefits become clearer. * All CLIP models, regardless of scale, outperform all evaluated systems in terms of `compute efficiency`. * The best model's average score improvement over previous systems increases from $2.6\%$ to $5\%$. * `Self-supervised systems` (`SimCLRv2`) also perform better on this broader suite, suggesting the value of increased `task diversity` in evaluation. * **Per-Dataset Differences:** The following figure (Figure 11 from the original paper) visualizes per-dataset differences in performance between the best CLIP model (`ViT-L/14@336px`) and the `Noisy Student EfficientNet-L2`.  *该图像是图表,展示了在多种数据集上,CLIP模型与Noisy Student EfficientNet-L2的线性回归表现的差异(Δ Score %)。大多数数据集中,CLIP的表现优于EfficientNet-L2,尤其是在SST2、Country211和HatefulMemes等数据集上,表现上升显著。* Figure 11. CLIP's features outperform the features of the best ImageNet model on a wide variety of datasets. Fitting a linear classifier on CLIP's features outperforms using the Noisy Student EfficientNet-L2 on 21 out of 27 datasets. CLIP outperforms the `Noisy Student EfficientNet-L2` on 21 out of 27 datasets. * **Significant Gains:** CLIP performs best on tasks requiring `OCR` (SST2, HatefulMemes), `geo-localization` and `scene recognition` (Country211, SUN397), and `activity recognition` (Kinetics700, UCF101). It also excels in `fine-grained car` and `traffic sign recognition` (Stanford Cars, GTSRB). This suggests `ImageNet's narrow supervision` (e.g., a single label for all traffic signs) might hurt performance on fine-grained tasks. * **Underperformance:** CLIP still underperforms on ImageNet (the EfficientNet's training dataset) and low-resolution datasets (CIFAR10, CIFAR100). It also does slightly worse on `PatchCamelyon` (lymph node tumor detection) and `CLEVRCounts` (object counting), where both approaches have low overall performance. ### 6.1.4. Robustness to Natural Distribution Shift The following figure (Figure 13 from the original paper) compares the performance of zero-shot CLIP with existing ImageNet models on natural distribution shifts.  *该图像是一个图表,展示了使用 Zero-Shot CLIP 模型在多个数据集上的性能比较。图表中显示了各数据集的准确性以及与其他方法的改进得分,强调了该模型在不同任务中的有效性。* Figure 13. The image is a chart that illustrates the performance comparison of the Zero-Shot CLIP model across multiple datasets. The chart displays accuracy for each dataset along with improvement scores over other methods, highlighting the effectiveness of the model in various tasks. * **Zero-Shot CLIP's Enhanced Robustness:** `Zero-shot CLIP models` significantly improve `effective robustness` by reducing the gap between ImageNet accuracy and accuracy under `natural distribution shifts` by up to `75%`. This is a crucial finding, suggesting that models not trained on specific task distributions are less susceptible to `spurious correlations`. * **Supervised Adaptation Harms Robustness:** The following figure (Figure 14 from the original paper) visualizes how performance changes from the zero-shot classifier to a `supervised linear classifier` adapted to the ImageNet distribution.  *该图像是一个图表,展示了在多个自然分布偏移数据集上,采用不同方法适应 ImageNet 分类器准确度的变化。图中显示,当针对 ImageNet 进行监督适应时,准确度提高了 $9.2\%$,但在平均鲁棒性方面略有下降。* Figure 14. While supervised adaptation to ImageNet increases ImageNet accuracy by $9 . 2 \\%$ , it slightly reduces average robustness. with ImageNet categories. While adapting CLIP to the `ImageNet distribution` (via `L2 regularized logistic regression`) increases its ImageNet accuracy by $9.2\%$ (to $85.4\%$), `average accuracy under distribution shift slightly decreases`. This is a surprising result: a large gain in in-distribution accuracy does not translate to improved out-of-distribution robustness, implying that these gains are largely from exploiting `distribution-specific patterns`. * **Role of Class Naming:** Using `custom zero-shot classifiers` for each dataset based on its `specific class names` (instead of pooling `ImageNet superclasses`) improves `average effective robustness` by $5\%$. * **Few-Shot Robustness:** The following figure (Figure 15 from the original paper) visualizes the performance of 0-shot, 1-shot, ..., 128-shot, and fully supervised logistic regression classifiers on the best CLIP model's features.  *该图像是一个图表,展示了不同训练方式下CLIP模型在7个自然分布迁移数据集上的表现。横轴为在ImageNet子抽样类上的平均准确率,纵轴为在自然分布迁移数据集上的平均准确率。图中包含不同训练样本数量的标记,分别为1-shot至128-shot,并且显示出零-shot和few-shot CLIP相较于传统模型的效果差异。* Figure 15. Few-shot CLIP also increases effective robustness compared to existing ImageNet models but is less robust than zero-shot CLIP. Minimizing the amount of ImageNet training data used for adaption increases effective robustness at the cost of decreasing relative robustness. 16-shot logistic regression CLIP matches zero-shot CLIP on ImageNet, as previously reported in Figure 7, but is less robust. `Few-shot models` also show higher `effective robustness` than existing models, but this benefit `fades` as `in-distribution performance` increases with more training data. `Zero-shot CLIP` is `notably more robust` than a few-shot model with equivalent ImageNet performance. The conclusion is that `high effective robustness` results from `minimizing distribution-specific training data`. ### 6.1.5. Comparison to Human Performance The following are the results from Table 2 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th></th> <th>Accuracy</th> <th>Majority Vote on Full Dataset</th> <th>Accuracy on Guesses</th> <th>Majority Vote Accuracy on Guesses</th> </tr> </thead> <tbody> <tr> <td>Zero-shot human</td> <td>53.7</td> <td>57.0</td> <td>69.7</td> <td>63.9</td> </tr> <tr> <td>Zero-shot CLIP</td> <td>93.5</td> <td>93.5</td> <td>93.5</td> <td>93.5</td> </tr> <tr> <td>One-shot human</td> <td>75.7</td> <td>80.3</td> <td>78.5</td> <td>81.2</td> </tr> <tr> <td>Two-shot human</td> <td>75.7</td> <td>85.0</td> <td>79.2</td> <td>86.1</td> </tr> </tbody> </table></div> On the `Oxford IIT Pets` dataset, `zero-shot CLIP` (93.5% accuracy) vastly outperforms `zero-shot humans` (53.7%). Humans, however, show a `large leap in performance` from zero-shot to one-shot (53.7% to 75.7%), with minimal additional gain from two-shot. This suggests humans "know what they don't know" and quickly update priors from a single example. This highlights a `significant gap between machine and human sample efficiency` in few-shot learning, where humans effectively integrate prior knowledge. The following figure (Figure 16 from the original paper) plots human accuracy vs CLIP's zero-shot accuracy.  *该图像是图表,展示了不同犬种在 CLIP 模型和人类零-shot 及 one-shot 识别中的准确率。横轴为犬种,纵轴为准确率百分比,三条线分别代表 Zero-Shot CLIP、One-Shot Human 和 Zero-Shot Human 的表现,反映出 CLIP 在难度排序上的一致性。* Figure 16. The hardest problems for CLIP also tend to be the hardest problems for humans. Here we rank image categories by difficulty for CLIP as measured as probability of the correct label. The hardest problems for CLIP (e.g., specific dog breeds that are very similar) tend to also be hard for humans, suggesting consistency in difficulty stemming from dataset noise or genuinely challenging visual distinctions. ### 6.1.6. Data Overlap Analysis The following figure (Figure 17 from the original paper) summarizes the data overlap analysis.  *该图像是一个示意图,展示了在不同数据重叠百分比下,清洁数据与重叠数据的准确性差异。左侧图表显示了包括CIFAR-100和SUN397在内的多个数据集的准确性变化,右侧图表则呈现了不同数据集总体准确性变化的情况。* Figure 16. The image is a diagram showing the difference in accuracy between clean data and overlapping data at various percentages of data overlap. The left chart displays accuracy changes for several datasets, including CIFAR-100 and SUN397, while the right chart presents the overall accuracy change for different datasets. * **Procedure:** A custom `near-duplicate detector` was used (trained with a `contrastive loss` on heavily augmented images) to identify overlapping examples between the `WIT pre-training dataset` and `downstream evaluation datasets`. * **Overlap Rate:** Out of 35 datasets, 9 had `no detected overlap`. The `median overlap` was `2.2%`, and the `average overlap` was `3.2%`. * **Impact on Performance:** Due to the small overlap, `overall accuracy` was rarely shifted by more than $0.1\%$, with only 7 datasets above this threshold. Only 2 were `statistically significant` after `Bonferroni correction`. The `max detected improvement` was $0.6\%$ on `Birdsnap` (with $12.1\%$ overlap). `Country211` had the largest overlap ($21.5\%$) but only a $0.2\%$ accuracy increase, potentially because the training text wasn't directly related to the geo-localization task. * **Conclusion:** The minimal impact of overlap aligns with previous `large-scale pre-training` studies (Mahajan et al., 2018; Kolesnikov et al., 2019), suggesting that `data contamination` is not a major factor inflating CLIP's reported performance. ### 6.1.7. Dataset Ablation on YFCC100M The following are the results from Table 12 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2">Dataset</th> <th colspan="3">Linear Classifier</th> <th colspan="3">Zero Shot</th> </tr> <tr> <th>YFCC</th> <th>WIT</th> <th>∆</th> <th>YFCC</th> <th>WIT</th> <th>∆</th> </tr> </thead> <tbody> <tr> <td>Birdsnap</td> <td>47.4</td> <td>35.3</td> <td>+12.1</td> <td>19.9</td> <td>4.5</td> <td>+15.4</td> </tr> <tr> <td>Country211</td> <td>23.1</td> <td>17.3</td> <td>+5.8</td> <td>5.2</td> <td>5.3</td> <td>+0.1</td> </tr> <tr> <td>Flowers102</td> <td>94.4</td> <td>89.8</td> <td>+4.6</td> <td>48.6</td> <td>21.7</td> <td>+26.9</td> </tr> <tr> <td>GTSRB</td> <td>66.8</td> <td>72.5</td> <td>-5.7</td> <td>6.9</td> <td>7.0</td> <td>−0.1</td> </tr> <tr> <td>UCF101</td> <td>69.2</td> <td>74.9</td> <td>-5.7</td> <td>22.9</td> <td>32.0</td> <td>-9.1</td> </tr> <tr> <td>Stanford Cars</td> <td>31.4</td> <td>50.3</td> <td>−18.9</td> <td>3.8</td> <td>10.9</td> <td>-7.1</td> </tr> <tr> <td>ImageNet</td> <td>62.0</td> <td>60.8</td> <td>+1.2</td> <td>31.3</td> <td>27.6</td> <td>+3.7</td> </tr> <tr> <td>Dataset Average</td> <td>65.5</td> <td>66.6</td> <td>−1.1</td> <td>29.6</td> <td>30.0</td> <td>−0.4</td> </tr> <tr> <td>Dataset "Wins"</td> <td>10</td> <td>15</td> <td>-5</td> <td>19</td> <td>18</td> <td>+1</td> </tr> </tbody> </table></div> An ablation study compared a `ResNet-50` model trained on a `filtered subset of YFCC100M` with the same model trained on an `equally sized subset of WIT`. * **Overall:** `YFCC` and `WIT` show `similar average performance` for both `zero-shot` and `linear probe` settings. * **Specific Datasets:** Performance on `fine-grained classification datasets` can vary widely (e.g., `Birdsnap`, `Flowers102` better with YFCC; `Stanford Cars`, `UCF101` better with WIT). This likely reflects the `relative density of relevant data` for specific concepts within each pre-training dataset. * **Main Advantage of WIT:** The primary advantage of `WIT` over `YFCC100M` is its `much larger total size`, which enables `better generalization`. ### 6.1.8. Selected Task and Dataset Results (Appendix E) * **Image and Text Retrieval:** The following are the results from Table 13 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="3" colspan="2"></td> <th colspan="6">Text Retrieval</th> <th colspan="6">Image Retrieval</th> </tr> <tr> <th colspan="3">Flickr30k</th> <th colspan="3">MSCOCO</th> <th colspan="3">Flickr30k</th> <th colspan="3">MSCOCO</th> </tr> <tr> <th>R@1</th> <th>R@5</th> <th>R@10</th> <th>R@1</th> <th>R@5</th> <th>R@10</th> <th>R@1</th> <th>R@5</th> <th>R@10</th> <th>R@1</th> <th>R@5</th> <th>R@10</th> </tr> </thead> <tbody> <tr> <td rowspan="4" colspan="2">SOTA Fined-tuned</td> <td>Unicoder-VLa</td> <td>86.2</td> <td>96.3</td> <td>99.0</td> <td>62.3</td> <td>87.1</td> <td>92.8</td> <td>71.5</td> <td>90.9</td> <td>94.9</td> <td>46.7</td> <td>76.0</td> <td>85.3</td> </tr> <tr> <td>Uniterb</td> <td>87.3</td> <td>98.0</td> <td>99.2</td> <td>65.7</td> <td>88.6</td> <td>93.8</td> <td>75.6</td> <td>94.1</td> <td>96.8</td> <td>52.9</td> <td>79.9</td> <td>88.0</td> </tr> <tr> <td>VILLAc</td> <td>87.9</td> <td>97.5</td> <td>98.8</td> <td>-</td> <td>-</td> <td>-</td> <td>76.3</td> <td>94.2</td> <td>96.8</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Oscard</td> <td>-</td> <td>-</td> <td>-</td> <td>73.5</td> <td>92.2</td> <td>96.0</td> <td>-</td> <td>-</td> <td>-</td> <td>57.5</td> <td>82.8</td> <td>89.8</td> </tr> <tr> <td rowspan="5" colspan="2">SOTA Zero-Shot</td> <td>ERNIE-ViLe</td> <td>88.7</td> <td>98.0</td> <td>99.2</td> <td>-</td> <td>-</td> <td>-</td> <td>76.7</td> <td>93.6</td> <td>96.4</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Visual N-Gramsf</td> <td>15.4</td> <td>35.7</td> <td>45.1</td> <td>8.7</td> <td>23.1</td> <td>33.3</td> <td>8.8</td> <td>21.2</td> <td>29.9</td> <td>5.0</td> <td>14.5</td> <td>21.9</td> </tr> <tr> <td>ImageBERTg</td> <td>-</td> <td>-</td> <td>-</td> <td>44.0</td> <td>71.2</td> <td>80.4</td> <td>-</td> <td>-</td> <td>-</td> <td>32.3</td> <td>59.0</td> <td>70.2</td> </tr> <tr> <td>Unicoder-VLa</td> <td>64.3</td> <td>86.8</td> <td>92.3</td> <td>-</td> <td>-</td> <td>-</td> <td>48.4</td> <td>76.0</td> <td>85.2</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Uniterb</td> <td>83.6</td> <td>95.7</td> <td>97.7</td> <td>-</td> <td>-</td> <td>-</td> <td>68.7</td> <td>89.2</td> <td>93.9</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td colspan="2">CLIP</td> <td>88.0</td> <td>98.7</td> <td>99.4</td> <td>58.4</td> <td>81.5</td> <td>88.1</td> <td>68.7</td> <td>90.6</td> <td>95.2</td> <td>37.8</td> <td>62.4</td> <td>72.2</td> </tr> </tbody> </table></div> The following are the results from Table 13 of the original paper: Text Retrieval (Flickr30k, MSCOCO) and Image Retrieval (Flickr30k, MSCOCO). CLIP, pre-trained for `image-text retrieval`, performs well on this `sanity check`. `Zero-shot CLIP` matches or outperforms all prior `zero-shot results` on `Flickr30k` and `MSCOCO`. On `Flickr30k text retrieval`, it's competitive with the `overall SOTA`. On `image retrieval`, it's competitive with a `fine-tuned Unicoder-VL` but not the `overall SOTA`. `Prompting` with "a photo of" boosts performance. * **Optical Character Recognition (OCR):** The following are the results from Table 14 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td></td> <td></td> <th>MNIST</th> <th>SVHN</th> <th>IIIT5K 1k</th> <th>Hateful Memes</th> <th>SST-2</th> </tr> </thead> <tbody> <tr> <td rowspan="3" colspan="2">SOTA JOINT</td> <td>a</td> <td>99.8</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>b</td> <td>-</td> <td>96.4</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>c</td> <td>-</td> <td>-</td> <td>98.9</td> <td>78.0</td> <td>97.5</td> </tr> <tr> <td rowspan="2" colspan="2">SOTA SUPERVISED</td> <td>Raw Pixels</td> <td>92.5</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>ES Best</td> <td>-</td> <td>-</td> <td>89.6</td> <td>58.6</td> <td>59.0</td> </tr> <tr> <td colspan="2">Linear CLIP</td> <td>99.2</td> <td>-</td> <td>-</td> <td>77.3</td> <td>80.5</td> </tr> <tr> <td colspan="2">Zero-Shot CLIP</td> <td>88.4</td> <td>51.0</td> <td>90.0</td> <td>63.3</td> <td>67.9</td> </tr> </tbody> </table></div> The following are the results from Table 14 of the original paper: OCR performance on 5 datasets. CLIP learns `primitive OCR capabilities`. * **Strong on Digital Text:** Strongest on `Hateful Memes` and `SST-2` (digitally rendered words), where linear CLIP reaches $80.5\%$ on `SST-2` (on par with a `GloVe CBOW baseline`) and $77.3\%$ on `Hateful Memes` (0.7 points behind SOTA). * **Weak on Natural/Handwritten:** Weaker on `IIIT5K` (natural images of words) and particularly poor on `SVHN` (street view numbers, $51\%$ accuracy) and `MNIST` (handwritten digits, $88\%$ accuracy), where even simple `logistic regression on raw pixels` outperforms it. This suggests issues with `repeated characters`, `low resolution`, `blurry images`, and truly `out-of-distribution` data. * **Action Recognition in Videos:** The following are the results from Table 15 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2"></td> <td rowspan="2"></td> <th>UCF101</th> <th>K700</th> <th colspan="2">RareAct</th> </tr> <tr> <th>Top-1</th> <th>AVG</th> <th>mWAP</th> <th>mWSAP</th> </tr> </thead> <tbody> <tr> <td rowspan="4">SOTA FINE-TUNED</td> <td>R(2+1)D-BERTa</td> <td>98.7</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>NS ENet-L2b</td> <td>-</td> <td>84.8</td> <td>-</td> <td>-</td> </tr> <tr> <td>HT100M S3Dd</td> <td>91.3</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Baseline I3De</td> <td>-</td> <td>70.2</td> <td>-</td> <td>-</td> </tr> <tr> <td rowspan="4">SOTA LINEAR</td> <td>MMV FACf</td> <td>91.8</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>NS ENet-L2c</td> <td>89.4</td> <td>68.2</td> <td>-</td> <td>-</td> </tr> <tr> <td>CLIP</td> <td>92.0</td> <td>73.0</td> <td>-</td> <td>-</td> </tr> <tr> <td>Zero-Shot CLIP</td> <td>80.3</td> <td>69.6</td> <td>40.7</td> <td>44.8</td> </tr> <tr> <td rowspan="2">SOTA ZERO-SHOT</td> <td>HT100M S3Dd</td> <td>-</td> <td>-</td> <td>30.5</td> <td>34.8</td> </tr> <tr> <td>CLIP</td> <td>80.3</td> <td>69.6</td> <td>40.7</td> <td>44.8</td> </tr> </tbody> </table></div> The following are the results from Table 15 of the original paper: Action recognition performance on 3 video datasets. CLIP performs strongly on `action recognition`, a task involving `verbs`, suggesting benefits from `natural language's broader supervision`. * **Linear Evaluation:** CLIP matches the best prior result on `UCF101` and outperforms all other models in the evaluation suite. On `Kinetics-700`, CLIP outperforms the fine-tuned `I3D baseline`. (Note: Linear evaluations use single central frames, underestimating full video performance.) * **Zero-Shot Evaluation:** On `Kinetics-700`, `zero-shot CLIP` (averaging predictions across all frames) is within $1\%$ of the `fully supervised I3D baseline`. On `RareAct` (unusual actions), CLIP improves over the prior SOTA by 10 points. * **Geolocalization:** The following are the results from Table 17 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td></td> <th>1km</th> <th>25km</th> <th>200km</th> <th>750km</th> <th>2500km</th> </tr> </thead> <tbody> <tr> <td>ISNsa</td> <td>16.9</td> <td>43.0</td> <td>51.9</td> <td>66.7</td> <td>80.2</td> </tr> <tr> <td>CPlaNetb</td> <td>16.5</td> <td>37.1</td> <td>46.4</td> <td>62.0</td> <td>78.5</td> </tr> <tr> <td>CLIP</td> <td>13.9</td> <td>32.9</td> <td>43.0</td> <td>62.0</td> <td>79.3</td> </tr> <tr> <td>Deep-Ret+c</td> <td>14.4</td> <td>33.3</td> <td>47.7</td> <td>61.6</td> <td>73.4</td> </tr> <tr> <td>PlaNetd</td> <td>8.4</td> <td>24.5</td> <td>37.6</td> <td>53.6</td> <td>71.3</td> </tr> </tbody> </table></div> The following are the results from Table 17 of the original paper: Geolocalization performance on the IM2GPS test set. CLIP shows an ability to recognize places. Using `nearest-neighbor regression` in CLIP's embedding space on the `IM2GPS` test set, CLIP performs similarly to several task-specific models despite querying only 1 million images (less than prior work). It is `not competitive with the current state-of-the-art` in this regression-based setting. ### 6.2. Data Presentation (Tables) The following are the results from Table 10 of the original paper: Linear probe performance of various pre-trained models over 27 datasets. <div class="table-wrapper"><table> <thead> <tr> <td rowspan="1" colspan="2"></td> <td rowspan="1" colspan="25"></td> </tr> <tr> <td rowspan="1" colspan="2"></td> <th>Food-101</th> <th>CIFAR-10</th> <th>CIFAR-100</th> <th>Birdsnap</th> <th>SUN397</th> <th>Stanford Cars</th> <th>FGVC Aircraft</th> <th>Describable Textures</th> <th>Oxford-IIIT Pets</th> <th>Caltech-101</th> <th>Oxford Flowers 102</th> <th>MNIST</th> <th>FER2013</th> <th>STL-10</th> <th>EuroSAT</th> <th>RESISC45</th> <th>GTSRB</th> <th>KITTI</th> <th>Country211</th> <th>PatchCamelyon</th> <th>UCF101</th> <th>Kinetics700</th> <th>CLEVR Counts</th> <th>Hateful Memes</th> <th>Rendered SST2</th> <th>ImageNet</th> </tr> </thead> <tbody> <tr> <td colspan="2">LM RN50</td> <td>65.2</td> <td>90.0</td> <td>64.9</td> <td>19.6</td> <td>64.0</td> <td>37.0</td> <td>0.1</td> <td>56.8</td> <td>65.2</td> <td>76.8</td> <td>82.1</td> <td>93.4</td> <td>70.0</td> <td>97.7</td> <td>78.0</td> <td>65.7</td> <td>76.6</td> <td>43.7</td> <td>25.3</td> <td>52.5</td> <td>62.6</td> <td>40.7</td> <td>78.2</td> <td>53.8</td> <td>59.6</td> <td>56.9</td> </tr> <tr> <td rowspan="5" colspan="2">CLIP</td> <td>RN50</td> <td>88.9</td> <td>91.1</td> <td>73.5</td> <td>58.6</td> <td>75.1</td> <td>91.3</td> <td>90.5</td> <td>73.0</td> <td>65.7</td> <td>77.0</td> <td>85.9</td> <td>97.8</td> <td>64.2</td> <td>98.3</td> <td>82.4</td> <td>70.2</td> <td>25.3</td> <td>82.4</td> <td>57.3</td> <td>68.0</td> <td>76.6</td> <td>53.8</td> <td>71.1</td> <td>80.0</td> <td>81.5</td> </tr> <tr> <td>RN101</td> <td>93.3</td> <td>92.2</td> <td>74.9</td> <td>72.8</td> <td>79.2</td> <td>88.7</td> <td>62.7</td> <td>89.0</td> <td>79.1</td> <td>94.8</td> <td>94.1</td> <td>98.3</td> <td>68.7</td> <td>98.6</td> <td>89.7</td> <td>85.5</td> <td>30.3</td> <td>83.0</td> <td>78.6</td> <td>79.1</td> <td>91.4</td> <td>69.2</td> <td>40.7</td> <td>83.7</td> <td>89.5</td> </tr> <tr> <td>RN50x4</td> <td>94.9</td> <td>94.1</td> <td>78.6</td> <td>77.2</td> <td>81.1</td> <td>90.5</td> <td>69.4</td> <td>89.6</td> <td>82.1</td> <td>95.1</td> <td>96.5</td> <td>98.9</td> <td>71.3</td> <td>99.1</td> <td>91.4</td> <td>89.0</td> <td>34.8</td> <td>83.5</td> <td>82.0</td> <td>92.7</td> <td>95.1</td> <td>60.3</td> <td>46.4</td> <td>85.6</td> <td>92.0</td> </tr> <tr> <td>RN50x16</td> <td>95.9</td> <td>95.0</td> <td>80.7</td> <td>78.2</td> <td>82.2</td> <td>91.5</td> <td>71.6</td> <td>89.9</td> <td>83.0</td> <td>95.1</td> <td>96.0</td> <td>99.2</td> <td>72.9</td> <td>99.7</td> <td>92.5</td> <td>90.2</td> <td>42.9</td> <td>85.8</td> <td>83.9</td> <td>94.9</td> <td>98.1</td> <td>69.2</td> <td>46.4</td> <td>85.6</td> <td>92.4</td> </tr> <tr> <td>RN50x64</td> <td>96.2</td> <td>95.9</td> <td>81.6</td> <td>79.9</td> <td>82.2</td> <td>91.5</td> <td>71.6</td> <td>89.9</td> <td>83.0</td> <td>95.1</td> <td>96.0</td> <td>99.2</td> <td>72.9</td> <td>99.7</td> <td>92.5</td> <td>90.2</td> <td>42.9</td> <td>85.8</td> <td>83.9</td> <td>94.9</td> <td>98.1</td> <td>69.2</td> <td>46.4</td> <td>85.6</td> <td>92.4</td> </tr> <tr> <td rowspan="4" colspan="2">CLIP-ViT</td> <td>ViT-B/32</td> <td>92.8</td> <td>96.2</td> <td>83.1</td> <td>67.8</td> <td>78.4</td> <td>86.7</td> <td>69.4</td> <td>89.6</td> <td>82.1</td> <td>95.1</td> <td>96.5</td> <td>96.9</td> <td>69.2</td> <td>98.3</td> <td>85.3</td> <td>66.2</td> <td>27.8</td> <td>83.9</td> <td>76.5</td> <td>90.0</td> <td>93.0</td> <td>61.7</td> <td>52.1</td> <td>66.7</td> <td>70.8</td> </tr> <tr> <td>ViT-B/16</td> <td>94.7</td> <td>97.1</td> <td>86.6</td> <td>67.8</td> <td>80.2</td> <td>89.6</td> <td>70.3</td> <td>89.6</td> <td>82.1</td> <td>95.1</td> <td>96.0</td> <td>97.1</td> <td>72.2</td> <td>99.2</td> <td>86.6</td> <td>67.8</td> <td>33.3</td> <td>83.5</td> <td>79.7</td> <td>93.5</td> <td>97.1</td> <td>70.3</td> <td>57.1</td> <td>75.5</td> <td>80.2</td> </tr> <tr> <td>ViT-L/14</td> <td>95.9</td> <td>97.9</td> <td>87.4</td> <td>79.9</td> <td>82.2</td> <td>91.5</td> <td>71.6</td> <td>89.9</td> <td>83.0</td> <td>95.1</td> <td>96.0</td> <td>98.9</td> <td>72.9</td> <td>99.2</td> <td>92.5</td> <td>90.2</td> <td>42.9</td> <td>85.8</td> <td>83.9</td> <td>95.4</td> <td>98.9</td> <td>69.2</td> <td>46.4</td> <td>85.6</td> <td>92.4</td> </tr> <tr> <td>ViT-L/14-336px</td> <td>96.5</td> <td>98.1</td> <td>89.0</td> <td>78.5</td> <td>82.5</td> <td>91.5</td> <td>72.2</td> <td>90.0</td> <td>83.0</td> <td>95.2</td> <td>96.0</td> <td>99.2</td> <td>73.0</td> <td>99.7</td> <td>94.1</td> <td>92.8</td> <td>46.4</td> <td>85.6</td> <td>85.4</td> <td>94.9</td> <td>98.1</td> <td>69.2</td> <td>46.4</td> <td>85.6</td> <td>92.4</td> </tr> <tr> <td rowspan="9" colspan="2">EfficientNet</td> <td>B0</td> <td>71.2</td> <td>93.0</td> <td>73.3</td> <td>60.6</td> <td>64.0</td> <td>57.0</td> <td>53.7</td> <td>85.6</td> <td>75.6</td> <td>93.8</td> <td>93.1</td> <td>94.5</td> <td>98.1</td> <td>55.2</td> <td>98.2</td> <td>97.0</td> <td>84.3</td> <td>74.0</td> <td>71.6</td> <td>14.0</td> <td>83.1</td> <td>76.2</td> <td>51.7</td> <td>47.3</td> <td>55.7</td> <td>55.0</td> </tr> <tr> <td>B1</td> <td>74.2</td> <td>93.4</td> <td>73.6</td> <td>63.2</td> <td>64.0</td> <td>57.0</td> <td>53.7</td> <td>86.2</td> <td>77.0</td> <td>94.6</td> <td>94.4</td> <td>95.1</td> <td>98.0</td> <td>56.1</td> <td>98.2</td> <td>96.9</td> <td>84.3</td> <td>73.1</td> <td>67.1</td> <td>14.5</td> <td>83.9</td> <td>79.9</td> <td>54.3</td> <td>54.9</td> <td>81.1</td> <td>77.4</td> </tr> <tr> <td>B2</td> <td>77.4</td> <td>94.0</td> <td>78.0</td> <td>66.5</td> <td>64.4</td> <td>66.0</td> <td>59.3</td> <td>85.8</td> <td>73.1</td> <td>94.1</td> <td>93.7</td> <td>93.3</td> <td>98.5</td> <td>57.1</td> <td>98.2</td> <td>97.3</td> <td>85.0</td> <td>75.8</td> <td>76.1</td> <td>13.4</td> <td>83.3</td> <td>78.1</td> <td>50.9</td> <td>45.1</td> <td>53.8</td> <td>54.8</td> </tr> <tr> <td>B3</td> <td>79.7</td> <td>94.1</td> <td>78.7</td> <td>70.1</td> <td>65.4</td> <td>66.4</td> <td>60.4</td> <td>86.5</td> <td>73.4</td> <td>94.7</td> <td>93.5</td> <td>93.2</td> <td>98.8</td> <td>57.9</td> <td>98.2</td> <td>96.8</td> <td>85.0</td> <td>78.3</td> <td>72.3</td> <td>13.9</td> <td>83.1</td> <td>79.1</td> <td>52.5</td> <td>46.5</td> <td>54.4</td> <td>55.4</td> </tr> <tr> <td>B4</td> <td>81.5</td> <td>93.6</td> <td>77.9</td> <td>72.4</td> <td>67.1</td> <td>72.7</td> <td>68.9</td> <td>86.7</td> <td>73.9</td> <td>95.0</td> <td>94.7</td> <td>94.5</td> <td>98.4</td> <td>58.5</td> <td>98.2</td> <td>96.8</td> <td>86.0</td> <td>78.5</td> <td>69.6</td> <td>14.9</td> <td>84.7</td> <td>80.9</td> <td>54.5</td> <td>46.6</td> <td>53.3</td> <td>56.3</td> </tr> <tr> <td>B5</td> <td>84.5</td> <td>94.8</td> <td>80.0</td> <td>73.5</td> <td>65.8</td> <td>71.1</td> <td>68.2</td> <td>87.6</td> <td>73.9</td> <td>95.0</td> <td>94.1</td> <td>93.7</td> <td>98.4</td> <td>60.2</td> <td>98.2</td> <td>96.8</td> <td>85.4</td> <td>78.1</td> <td>72.7</td> <td>15.3</td> <td>84.2</td> <td>80.0</td> <td>54.1</td> <td>51.1</td> <td>53.3</td> <td>57.0</td> </tr> <tr> <td>B6</td> <td>86.9</td> <td>96.0</td> <td>82.0</td> <td>74.7</td> <td>69.0</td> <td>77.1</td> <td>72.3</td> <td>87.2</td> <td>76.8</td> <td>95.2</td> <td>94.7</td> <td>96.9</td> <td>98.6</td> <td>61.4</td> <td>99.1</td> <td>96.3</td> <td>86.8</td> <td>80.8</td> <td>75.8</td> <td>16.4</td> <td>85.2</td> <td>81.9</td> <td>57.7</td> <td>51.9</td> <td>54.8</td> <td>58.8</td> </tr> <tr> <td>B7</td> <td>87.3</td> <td>97.0</td> <td>83.9</td> <td>75.8</td> <td>71.4</td> <td>67.6</td> <td>65.6</td> <td>87.3</td> <td>78.5</td> <td>95.2</td> <td>96.4</td> <td>97.2</td> <td>98.6</td> <td>61.9</td> <td>99.5</td> <td>96.6</td> <td>86.1</td> <td>70.7</td> <td>72.4</td> <td>17.6</td> <td>84.2</td> <td>85.5</td> <td>61.0</td> <td>49.6</td> <td>54.6</td> <td>55.7</td> </tr> <tr> <td>B8</td> <td>88.4</td> <td>96.0</td> <td>82.0</td> <td>76.9</td> <td>72.6</td> <td>72.2</td> <td>71.2</td> <td>88.1</td> <td>80.5</td> <td>95.5</td> <td>95.5</td> <td>96.6</td> <td>98.5</td> <td>62.7</td> <td>99.4</td> <td>96.2</td> <td>88.5</td> <td>73.4</td> <td>73.0</td> <td>18.5</td> <td>83.8</td> <td>86.6</td> <td>63.2</td> <td>50.5</td> <td>57.2</td> <td>56.7</td> </tr> <tr> <td rowspan="2" colspan="2">NS EfficientNet</td> <td>L2-475</td> <td>91.6</td> <td>99.0</td> <td>91.0</td> <td>74.8</td> <td>76.4</td> <td>75.1</td> <td>66.8</td> <td>89.5</td> <td>81.9</td> <td>95.6</td> <td>96.5</td> <td>97.7</td> <td>98.9</td> <td>67.5</td> <td>99.3</td> <td>97.0</td> <td>89.5</td> <td>73.4</td> <td>68.9</td> <td>22.2</td> <td>86.3</td> <td>89.4</td> <td>68.2</td> <td>58.3</td> <td>58.6</td> <td>55.2</td> </tr> <tr> <td>L2-800</td> <td>92.0</td> <td>98.7</td> <td>89.0</td> <td>78.5</td> <td>75.7</td> <td>68.4</td> <td>68.4</td> <td>89.4</td> <td>82.5</td> <td>95.6</td> <td>94.7</td> <td>97.9</td> <td>98.5</td> <td>68.4</td> <td>99.3</td> <td>97.2</td> <td>89.9</td> <td>77.7</td> <td>66.9</td> <td>23.7</td> <td>86.8</td> <td>88.9</td> <td>58.4</td> <td>56.9</td> <td>88.4</td> <td>78.5</td> </tr> <tr> <td rowspan="3" colspan="2">Instagram-pretrained ResNeXt</td> <td>32x8d</td> <td>84.8</td> <td>95.9</td> <td>80.9</td> <td>63.8</td> <td>69.0</td> <td>74.2</td> <td>56.0</td> <td>88.0</td> <td>75.4</td> <td>95.4</td> <td>93.9</td> <td>91.7</td> <td>97.4</td> <td>60.7</td> <td>99.1</td> <td>95.7</td> <td>82.1</td> <td>72.3</td> <td>69.2</td> <td>16.7</td> <td>82.3</td> <td>80.1</td> <td>56.8</td> <td>42.2</td> <td>53.3</td> <td>55.2</td> </tr> <tr> <td>32x16d</td> <td>85.7</td> <td>96.5</td> <td>80.9</td> <td>64.8</td> <td>70.5</td> <td>77.5</td> <td>56.7</td> <td>87.9</td> <td>76.2</td> <td>95.6</td> <td>94.9</td> <td>92.5</td> <td>97.4</td> <td>61.3</td> <td>99.3</td> <td>95.5</td> <td>82.8</td> <td>73.8</td> <td>66.1</td> <td>17.5</td> <td>83.4</td> <td>81.1</td> <td>58.2</td> <td>41.3</td> <td>54.2</td> <td>56.1</td> </tr> <tr> <td>32x32d</td> <td>86.7</td> <td>96.8</td> <td>82.7</td> <td>67.1</td> <td>71.5</td> <td>77.5</td> <td>55.4</td> <td>88.3</td> <td>78.5</td> <td>95.8</td> <td>95.3</td> <td>94.4</td> <td>97.9</td> <td>62.4</td> <td>99.3</td> <td>95.7</td> <td>85.4</td> <td>71.2</td> <td>66.8</td> <td>18.0</td> <td>83.7</td> <td>82.1</td> <td>58.8</td> <td>39.7</td> <td>55.3</td> <td>56.7</td> </tr> <tr> <td rowspan="3" colspan="2">Instagram-pretrained ResNeXt (FixRes)</td> <td>32x48d</td> <td>86.9</td> <td>96.8</td> <td>83.4</td> <td>65.9</td> <td>72.2</td> <td>76.6</td> <td>53.2</td> <td>88.0</td> <td>77.2</td> <td>95.5</td> <td>95.8</td> <td>93.6</td> <td>98.1</td> <td>63.7</td> <td>99.4</td> <td>95.3</td> <td>85.4</td> <td>73.0</td> <td>67.2</td> <td>18.5</td> <td>82.7</td> <td>82.8</td> <td>59.2</td> <td>41.3</td> <td>55.5</td> <td>56.7</td> </tr> <tr> <td>FixRes-v1</td> <td>88.5</td> <td>95.7</td> <td>81.1</td> <td>67.4</td> <td>72.9</td> <td>80.5</td> <td>57.6</td> <td>88.0</td> <td>77.9</td> <td>95.8</td> <td>96.1</td> <td>94.5</td> <td>97.9</td> <td>62.2</td> <td>99.4</td> <td>96.2</td> <td>86.6</td> <td>76.1</td> <td>64.8</td> <td>19.3</td> <td>82.5</td> <td>83.4</td> <td>59.8</td> <td>43.5</td> <td>56.6</td> <td>59.0</td> </tr> <tr> <td>FixRes-v2</td> <td>88.5</td> <td>95.7</td> <td>81.1</td> <td>67.3</td> <td>72.9</td> <td>80.7</td> <td>57.5</td> <td>88.0</td> <td>77.9</td> <td>95.0</td> <td>96.0</td> <td>94.5</td> <td>98.0</td> <td>62.1</td> <td>99.4</td> <td>96.5</td> <td>86.1</td> <td>76.3</td> <td>64.8</td> <td>19.5</td> <td>82.3</td> <td>83.2</td> <td>59.8</td> <td>43.5</td> <td>56.6</td> <td>59.0</td> </tr> <tr> <td rowspan="2" colspan="2">BiT-S</td> <td>R50x1</td> <td>72.5</td> <td>91.7</td> <td>74.8</td> <td>57.7</td> <td>61.1</td> <td>53.5</td> <td>83.7</td> <td>72.4</td> <td>92.3</td> <td>91.2</td> <td>98.4</td> <td>56.5</td> <td>96.4</td> <td>97.4</td> <td>85.0</td> <td>70.0</td> <td>66.0</td> <td>12.5</td> <td>83.0</td> <td>72.3</td> <td>47.5</td> <td>48.3</td> <td>54.1</td> <td>55.3</td> </tr> <tr> <td>R50x3</td> <td>75.1</td> <td>93.7</td> <td>79.0</td> <td>61.1</td> <td>63.7</td> <td>55.2</td> <td>84.8</td> <td>74.6</td> <td>92.5</td> <td>91.6</td> <td>92.8</td> <td>98.8</td> <td>58.7</td> <td>97.0</td> <td>97.8</td> <td>86.4</td> <td>73.1</td> <td>73.8</td> <td>14.0</td> <td>84.2</td> <td>76.4</td> <td>50.0</td> <td>49.2</td> <td>54.7</td> <td>54.2</td> </tr> <tr> <td rowspan="4" colspan="2">BiT-M</td> <td>R101x1</td> <td>73.5</td> <td>92.8</td> <td>77.4</td> <td>58.4</td> <td>61.3</td> <td>54.0</td> <td>84.4</td> <td>73.5</td> <td>92.5</td> <td>91.8</td> <td>90.6</td> <td>98.3</td> <td>56.5</td> <td>96.8</td> <td>97.3</td> <td>84.6</td> <td>69.4</td> <td>68.9</td> <td>12.9</td> <td>82.0</td> <td>73.5</td> <td>73.5</td> <td>48.6</td> <td>45.4</td> <td>52.6</td> </tr> <tr> <td>R101x3</td> <td>74.7</td> <td>93.9</td> <td>79.8</td> <td>57.8</td> <td>62.9</td> <td>54.7</td> <td>84.7</td> <td>75.5</td> <td>92.3</td> <td>91.2</td> <td>92.6</td> <td>98.8</td> <td>59.7</td> <td>97.3</td> <td>98.0</td> <td>85.5</td> <td>71.8</td> <td>60.2</td> <td>14.1</td> <td>83.1</td> <td>75.9</td> <td>75.9</td> <td>50.4</td> <td>49.7</td> <td>54.1</td> </tr> <tr> <td>R152x2</td> <td>74.9</td> <td>94.3</td> <td>79.7</td> <td>58.7</td> <td>62.7</td> <td>55.9</td> <td>85.3</td> <td>74.9</td> <td>93.0</td> <td>92.0</td> <td>91.7</td> <td>98.6</td> <td>58.3</td> <td>97.3</td> <td>97.8</td> <td>86.2</td> <td>71.8</td> <td>71.6</td> <td>13.9</td> <td>84.1</td> <td>76.2</td> <td>76.2</td> <td>49.9</td> <td>48.2</td> <td>53.8</td> </tr> <tr> <td>R152x4</td> <td>74.7</td> <td>94.2</td> <td>79.2</td> <td>57.8</td> <td>62.9</td> <td>51.2</td> <td>85.4</td> <td>75.4</td> <td>93.1</td> <td>91.2</td> <td>91.4</td> <td>98.9</td> <td>61.4</td> <td>97.2</td> <td>98.0</td> <td>85.5</td> <td>72.8</td> <td>67.9</td> <td>14.9</td> <td>83.1</td> <td>76.0</td> <td>50.3</td> <td>42.9</td> <td>53.6</td> <td>56.0</td> </tr> <tr> <td rowspan="7" colspan="2">ViT (ImageNet-21k)</td> <td>B/32</td> <td>86.7</td> <td>96.9</td> <td>86.4</td> <td>74.0</td> <td>74.2</td> <td>54.7</td> <td>86.7</td> <td>86.3</td> <td>73.1</td> <td>90.4</td> <td>94.5</td> <td>97.8</td> <td>59.0</td> <td>99.0</td> <td>96.3</td> <td>83.0</td> <td>68.1</td> <td>65.1</td> <td>15.7</td> <td>82.6</td> <td>79.1</td> <td>51.7</td> <td>38.9</td> <td>57.1</td> <td>54.6</td> </tr> <tr> <td>B/16</td> <td>89.2</td> <td>97.4</td> <td>87.4</td> <td>76.5</td> <td>74.9</td> <td>62.5</td> <td>86.1</td> <td>75.4</td> <td>91.9</td> <td>94.7</td> <td>98.9</td> <td>62.0</td> <td>99.3</td> <td>97.6</td> <td>85.7</td> <td>70.4</td> <td>58.8</td> <td>17.7</td> <td>85.7</td> <td>84.1</td> <td>58.0</td> <td>38.4</td> <td>58.4</td> <td>52.8</td> </tr> <tr> <td>L/14</td> <td>92.9</td> <td>96.2</td> <td>77.9</td> <td>48.3</td> <td>67.7</td> <td>77.3</td> <td>36.1</td> <td>84.1</td> <td>55.3</td> <td>93.5</td> <td>92.6</td> <td>78.7</td> <td>87.2</td> <td>57.5</td> <td>99.3</td> <td>59.9</td> <td>71.6</td> <td>50.3</td> <td>23.1</td> <td>32.7</td> <td>58.8</td> <td>76.2</td> <td>60.3</td> <td>24.3</td> <td>63.3</td> <td>64.0</td> </tr> <tr> <td>H/14</td> <td>93.8</td> <td>95.7</td> <td>77.5</td> <td>49.5</td> <td>68.4</td> <td>78.8</td> <td>37.2</td> <td>84.3</td> <td>55.7</td> <td>93.5</td> <td>92.8</td> <td>78.3</td> <td>88.3</td> <td>57.7</td> <td>99.4</td> <td>59.6</td> <td>71.7</td> <td>52.3</td> <td>21.9</td> <td>34.9</td> <td>63.0</td> <td>76.9</td> <td>61.3</td> <td>24.8</td> <td>63.3</td> <td>67.9</td> </tr> <tr> <td rowspan="2" colspan="2">SimCLRv2</td> <td>R50x1</td> <td>76.4</td> <td>93.2</td> <td>77.9</td> <td>48.6</td> <td>64.1</td> <td>56.3</td> <td>84.4</td> <td>77.0</td> <td>88.3</td> <td>91.8</td> <td>92.9</td> <td>97.6</td> <td>59.5</td> <td>96.7</td> <td>97.9</td> <td>85.8</td> <td>71.1</td> <td>69.1</td> <td>15.8</td> <td>84.8</td> <td>78.4</td> <td>51.0</td> <td>56.2</td> <td>53.9</td> <td>53.8</td> </tr> <tr> <td>R50x3</td> <td>82.2</td> <td>96.4</td> <td>83.4</td> <td>57.5</td> <td>68.2</td> <td>82.2</td> <td>86.9</td> <td>74.6</td> <td>60.6</td> <td>87.7</td> <td>78.5</td> <td>93.2</td> <td>95.3</td> <td>99.4</td> <td>98.6</td> <td>64.1</td> <td>99.3</td> <td>98.0</td> <td>88.1</td> <td>69.9</td> <td>59.6</td> <td>19.6</td> <td>83.4</td> <td>83.0</td> <td>57.8</td> <td>51.3</td> </tr> <tr> <td rowspan="2" colspan="2">BYOL</td> <td>R50x1</td> <td>77.0</td> <td>88.3</td> <td>93.7</td> <td>94.3</td> <td>98.2</td> <td>58.8</td> <td>96.1</td> <td>96.4</td> <td>97.6</td> <td>88.4</td> <td>71.1</td> <td>71.4</td> <td>14.1</td> <td>84.8</td> <td>8.3</td> <td>45.3</td> <td>56.1</td> <td>53.8</td> <td>52.7</td> <td>73.2</td> <td>77.4</td> <td>91.9</td> <td>95.5</td> <td>93.9</td> <td>98.6</td> </tr> <tr> <td>R200x2</td> <td>77.4</td> <td>91.9</td> <td>95.5</td> <td>93.9</td> <td>98.6</td> <td>59.5</td> <td>96.5</td> <td>96.8</td> <td>97.9</td> <td>88.8</td> <td>74.4</td> <td>70.3</td> <td>16.4</td> <td>84.0</td> <td>16.4</td> <td>84.0</td> <td>77.7</td> <td>47.7</td> <td>56.9</td> <td>53.9</td> <td>53.8</td> <td>69.1</td> <td>75.4</td> <td>13.2</td> <td>85.6</td> <td>72.7</td> </tr> <tr> <td rowspan="2" colspan="2">MoCo</td> <td>v1</td> <td>77.2</td> <td>93.4</td> <td>76.3</td> <td>39.6</td> <td>60.2</td> <td>48.3</td> <td>82.6</td> <td>75.1</td> <td>84.4</td> <td>89.9</td> <td>90.7</td> <td>98.4</td> <td>58.3</td> <td>95.7</td> <td>97.2</td> <td>85.4</td> <td>75.7</td> <td>71.1</td> <td>12.6</td> <td>85.7</td> <td>75.4</td> <td>13.2</td> <td>85.6</td> <td>72.7</td> <td>47.8</td> </tr> <tr> <td>v2</td> <td>82.9</td> <td>96.6</td> <td>82.9</td> <td>60.2</td> <td>66.0</td> <td>54.3</td> <td>85.6</td> <td>76.6</td> <td>91.8</td> <td>94.6</td> <td>97.4</td> <td>99.2</td> <td>62.6</td> <td>97.1</td> <td>98.0</td> <td>86.5</td> <td>76.2</td> <td>72.2</td> <td>14.2</td> <td>86.0</td> <td>81.2</td> <td>53.4</td> <td>53.8</td> <td>56.9</td> <td>53.9</td> </tr> <tr> <td colspan="2">VirTex</td> <td>57.9</td> <td>83.9</td> <td>57.5</td> <td>17.0</td> <td>49.8</td> <td>22.4</td> <td>34.5</td> <td>83.8</td> <td>58.2</td> <td>53.6</td> <td>70.6</td> <td>74.7</td> <td>60.6</td> <td>57.9</td> <td>83.9</td> <td>57.5</td> <td>17.0</td> <td>49.8</td> <td>22.4</td> <td>34.5</td> <td>83.8</td> <td>58.2</td> <td>53.6</td> <td>70.6</td> <td>74.7</td> </tr> <tr> <td rowspan="3" colspan="2">ResNet</td> <td>50</td> <td>72.2</td> <td>91.8</td> <td>74.6</td> <td>58.2</td> <td>60.9</td> <td>53.3</td> <td>83.5</td> <td>71.9</td> <td>92.1</td> <td>91.1</td> <td>98.3</td> <td>56.3</td> <td>96.3</td> <td>97.3</td> <td>84.7</td> <td>69.8</td> <td>65.8</td> <td>12.3</td> <td>82.8</td> <td>72.1</td> <td>47.2</td> <td>48.0</td> <td>53.9</td> <td>55.1</td> </tr> <tr> <td>101</td> <td>74.0</td> <td>92.8</td> <td>77.2</td> <td>59.8</td> <td>62.8</td> <td>54.5</td> <td>84.4</td> <td>73.5</td> <td>92.5</td> <td>91.6</td> <td>90.4</td> <td>98.6</td> <td>57.8</td> <td>96.8</td> <td>97.8</td> <td>85.0</td> <td>70.7</td> <td>68.5</td> <td>13.0</td> <td>82.0</td> <td>73.0</td> <td>48.5</td> <td>48.9</td> <td>53.9</td> <td>55.1</td> </tr> <tr> <td>152</td> <td>74.6</td> <td>93.5</td> <td>79.0</td> <td>60.8</td> <td>63.6</td> <td>55.4</td> <td>84.9</td> <td>74.4</td> <td>92.9</td> <td>91.7</td> <td>90.8</td> <td>98.7</td> <td>58.2</td> <td>97.0</td> <td>97.9</td> <td>85.4</td> <td>71.3</td> <td>69.6</td> <td>13.6</td> <td>82.5</td> <td>73.9</td> <td>49.3</td> <td>49.7</td> <td>54.0</td> <td>55.1</td> </tr> </tbody> </table></div> The following figure (Figure 20 from the original paper) plots the linear probe performance for each of the 27 datasets, using the data from Table 10.  *该图像是图表,展示了27个数据集的线性探针性能。各条曲线代表不同的模型,横轴为GFLOPs/image,纵轴为准确率。图中的星标和颜色区分了不同的方法,如CLIP和ResNet等。* Figure 20. Linear probe performance plotted for each of the 27 datasets, using the data from Table 10. The following are the results from Table 11 of the original paper: Zero-shot performance of CLIP models over 27 datasets. <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2" colspan="2"></td> <th colspan="25">Zero-Shot Performance</th> </tr> <tr> <td colspan="2"></td> <th>Food-101</th> <th>CIFAR-10</th> <th>CIFAR-100</th> <th>Birdsnap</th> <th>SUN397</th> <th>Stanford Cars</th> <th>FGVC Aircraft</th> <th>Describable Textures</th> <th>Oxford-IIIT Pets</th> <th>Caltech-101</th> <th>Oxford Flowers 102</th> <th>MNIST</th> <th>FER2013</th> <th>STL-10</th> <th>EuroSAT</th> <th>RESISC45</th> <th>GTSRB</th> <th>KITTI</th> <th>Country211</th> <th>PatchCamelyon</th> <th>UCF101</th> <th>Kinetics700</th> <th>CLEVR Counts</th> <th>Hateful Memes</th> <th>Rendered SST2</th> <th>ImageNet</th> </tr> </thead> <tbody> <tr> <td rowspan="5" colspan="2">CLIP-ResNet</td> <td>RN50</td> <td>75.6</td> <td>81.1</td> <td>41.6</td> <td>32.6</td> <td>59.6</td> <td>55.8</td> <td>19.3</td> <td>82.1</td> <td>41.7</td> <td>85.4</td> <td>82.1</td> <td>65.9</td> <td>66.6</td> <td>42.2</td> <td>94.3</td> <td>41.1</td> <td>54.2</td> <td>35.2</td> <td>42.2</td> <td>16.1</td> <td>57.6</td> <td>63.6</td> <td>43.5</td> <td>20.3</td> <td>59.7</td> <td>56.9</td> </tr> <tr> <td>RN101</td> <td>81.0</td> <td>83.9</td> <td>49.0</td> <td>37.2</td> <td>59.9</td> <td>62.3</td> <td>19.5</td> <td>82.4</td> <td>43.9</td> <td>86.2</td> <td>85.1</td> <td>65.7</td> <td>59.3</td> <td>45.6</td> <td>96.7</td> <td>33.1</td> <td>58.5</td> <td>38.3</td> <td>33.3</td> <td>16.9</td> <td>55.2</td> <td>62.2</td> <td>46.7</td> <td>28.1</td> <td>61.1</td> <td>64.2</td> </tr> <tr> <td>RN50x4</td> <td>86.8</td> <td>79.2</td> <td>48.9</td> <td>41.6</td> <td>62.7</td> <td>67.9</td> <td>24.6</td> <td>83.0</td> <td>49.3</td> <td>88.1</td> <td>86.0</td> <td>68.0</td> <td>75.2</td> <td>51.1</td> <td>96.4</td> <td>35.0</td> <td>59.2</td> <td>35.7</td> <td>26.0</td> <td>20.2</td> <td>57.5</td> <td>65.5</td> <td>49.0</td> <td>17.0</td> <td>58.3</td> <td>66.6</td> </tr> <tr> <td>RN50x16</td> <td>90.5</td> <td>82.2</td> <td>54.2</td> <td>45.9</td> <td>65.0</td> <td>72.3</td> <td>30.3</td> <td>82.9</td> <td>52.8</td> <td>89.7</td> <td>87.6</td> <td>71.9</td> <td>80.0</td> <td>56.0</td> <td>97.8</td> <td>40.3</td> <td>64.4</td> <td>39.6</td> <td>33.9</td> <td>24.0</td> <td>62.5</td> <td>68.7</td> <td>53.4</td> <td>17.6</td> <td>91.8</td> <td>86.8</td> </tr> <tr> <td>RN50x64</td> <td>91.8</td> <td>86.8</td> <td>61.3</td> <td>48.9</td> <td>66.9</td> <td>76.0</td> <td>35.6</td> <td>83.8</td> <td>53.4</td> <td>93.4</td> <td>90.6</td> <td>77.3</td> <td>90.8</td> <td>61.0</td> <td>98.3</td> <td>59.4</td> <td>69.7</td> <td>47.9</td> <td>33.2</td> <td>29.6</td> <td>65.0</td> <td>74.1</td> <td>56.8</td> <td>27.5</td> <td>62.1</td> <td>70.7</td> </tr> <tr> <td rowspan="4" colspan="2">CLIP-ViT</td> <td>ViT-B/32</td> <td>83.1</td> <td>44.5</td> <td>87.0</td> <td>87.9</td> <td>66.7</td> <td>51.9</td> <td>47.3</td> <td>97.2</td> <td>49.4</td> <td>60.3</td> <td>32.2</td> <td>39.4</td> <td>17.8</td> <td>58.4</td> <td>64.5</td> <td>47.8</td> <td>24.8</td> <td>57.6</td> <td>59.6</td> <td>63.2</td> <td>84.4</td> <td>91.3</td> <td>65.1</td> <td>37.8</td> <td>63.2</td> <td>59.4</td> </tr> <tr> <td>ViT-B/16</td> <td>89.2</td> <td>91.6</td> <td>68.7</td> <td>39.1</td> <td>65.2</td> <td>65.6</td> <td>27.1</td> <td>83.9</td> <td>46.0</td> <td>88.9</td> <td>89.3</td> <td>70.4</td> <td>56.0</td> <td>52.7</td> <td>98.2</td> <td>54.1</td> <td>65.5</td> <td>43.3</td> <td>44.0</td> <td>23.3</td> <td>48.1</td> <td>69.8</td> <td>52.4</td> <td>23.4</td> <td>61.7</td> <td>59.8</td> </tr> <tr> <td>ViT-L/14</td> <td>92.9</td> <td>96.2</td> <td>77.9</td> <td>48.3</td> <td>67.7</td> <td>77.3</td> <td>36.1</td> <td>84.1</td> <td>55.3</td> <td>93.5</td> <td>92.6</td> <td>78.7</td> <td>87.2</td> <td>57.5</td> <td>99.3</td> <td>59.9</td> <td>71.6</td> <td>50.3</td> <td>23.1</td> <td>32.7</td> <td>58.8</td> <td>76.2</td> <td>60.3</td> <td>24.3</td> <td>63.3</td> <td>64.0</td> </tr> <tr> <td>ViT-L/14-336px</td> <td>93.8</td> <td>95.7</td> <td>77.5</td> <td>49.5</td> <td>68.4</td> <td>78.8</td> <td>37.2</td> <td>84.3</td> <td>55.7</td> <td>93.5</td> <td>92.8</td> <td>78.3</td> <td>88.3</td> <td>57.7</td> <td>99.4</td> <td>59.6</td> <td>71.7</td> <td>52.3</td> <td>21.9</td> <td>34.9</td> <td>63.0</td> <td>76.9</td> <td>61.3</td> <td>24.8</td> <td>63.3</td> <td>67.9</td> </tr> </tbody> </table></div> The following figure (Figure 22 from the original paper) shows CLIP's zero-shot performance compared to linear-probe ResNet performance.  *该图像是图表,展示了CLIP模型在27个数据集上的零-shot性能与线性探测ResNet的性能对比。图中可见不同数据集的准确率变化情况,CLIP模型在多个任务中显示出了优越的性能。* Figure 22. CLIP's zero-shot performance compared to linear-probe ResNet performance The following are the results from Table 16 of the original paper: Robustness performance on natural distribution shift datasets. <div class="table-wrapper"><table> <thead> <tr> <th rowspan="2"></th> <th rowspan="2">IN Top-1</th> <th rowspan="2">IN-V2 Top-1</th> <th rowspan="2">IN-A Top-1</th> <th rowspan="2">IN-R Top-1</th> <th rowspan="2">ObjectNet Top-1</th> <th rowspan="2">IN-Sketch Top-1</th> <th colspan="2">IN-Vid</th> <th colspan="2">YTBB</th> </tr> <tr> <th>PM0</th> <th>PM10</th> <th>PM0</th> <th>PM10</th> </tr> </thead> <tbody> <tr> <td>NS EfficientNet-L2a</td> <td>88.3</td> <td>80.2</td> <td>84.9</td> <td>74.7</td> <td>68.5</td> <td>47.6</td> <td>88.0</td> <td>82.1</td> <td>67.7</td> <td>63.5</td> </tr> <tr> <td>FixResNeXt101-32x48d V2b</td> <td>86.4</td> <td>78.0</td> <td>68.4</td> <td>80.0</td> <td>57.8</td> <td>59.1</td> <td>85.8</td> <td>72.2</td> <td>68.9</td> <td>57.7</td> </tr> <tr> <td>Linear Probe CLIP</td> <td>85.4</td> <td>75.9</td> <td>75.3</td> <td>84.2</td> <td>66.2</td> <td>57.4</td> <td>89.1</td> <td>77.2</td> <td>68.7</td> <td>63.1</td> </tr> <tr> <td>Zero-Shot CLIP</td> <td>76.2</td> <td>70.1</td> <td>77.2</td> <td>88.9</td> <td>72.3</td> <td>60.2</td> <td>95.3</td> <td>89.2</td> <td>95.2</td> <td>88.5</td> </tr> </tbody> </table></div> ### 6.3. Ablation Studies / Parameter Analysis **Dataset Ablation on YFCC100M:** (Already covered in Section 6.1.7). This study confirms that while the specific data blend (YFCC vs. WIT) can cause performance variations on fine-grained tasks (due to different distributions of concepts), the overall approach is robust across different reasonably filtered large image-text collections. The key advantage of `WIT` is its sheer scale. **Prompt Engineering and Ensembling:** (Already covered in Section 4.2.7 and Figure 4). This is a crucial "parameter analysis" showing that careful crafting of `natural language prompts` and combining multiple prompts (`ensembling`) significantly improves zero-shot performance (almost 5 points on ImageNet). This effectively leverages the `expressiveness of language` to guide the model. **Model Scaling:** (Already covered in Section 6.1.2 and Figure 9). The observation of `log-log linear scaling` of `zero-shot error` with `model compute` across `ResNet` and `Vision Transformer` models serves as a powerful `parameter analysis`. It indicates that increasing model size and computational resources leads to predictable improvements in performance, highlighting the scalability of the CLIP approach. The specific hyperparameters for model architecture and size (Tables 19 and 20 in Methodology) are systematically varied to explore this scaling. **Temperature Parameter $(\tau)$:** The paper states that the `temperature parameter`\tau isdirectly optimized during trainingas alog-parameterized multiplicative scalarto avoid tuning it as a hyperparameter. This shows an intelligent design choice to automate a critical parameter that affects thelogit scalingin thecontrastive loss, which can significantly impacttraining stabilityandperformance.
Resolution Scaling: The ViT-L/14 model is pre-trained at a higher 336 pixel resolution for one additional epoch to boost performance (ViT-L/14@336px). This FixRes-style approach demonstrates that increasing input resolution can further enhance performance for large Vision Transformers, even with minimal additional training.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully investigates the transferability of task-agnostic web-scale pre-training from Natural Language Processing (NLP) to computer vision. The authors demonstrate that Contrastive Language-Image Pre-training (CLIP)—a simple yet scalable approach that predicts which caption goes with which image on a massive 400 million (image, text) pair dataset (WIT)—enables visual models to learn a wide variety of tasks during pre-training. This task learning can then be leveraged through natural language prompting to achieve zero-shot transfer to numerous existing computer vision datasets. At sufficient scale, CLIP's performance is competitive with task-specific supervised models, and it exhibits significantly higher robustness to natural distribution shifts. The work highlights the emergence of predictable scaling laws in multimodal learning and underscores the significant social implications of such flexible and powerful models.
7.2. Limitations & Future Work
The authors are transparent about several limitations of CLIP and suggest future research directions:
- Performance Gap to SOTA: While zero-shot CLIP is competitive with a simple
ResNet-50 linear classifierbaseline, it stillunderperforms the overall state-of-the-arton many datasets, particularlyfully supervisedmodels. The authors estimate that a1000x increase in computewould be needed forzero-shot CLIPto reach overall state-of-the-art, which is currently infeasible. Future work needs to focus on improvingcomputational and data efficiency. - Weak Performance on Specific Tasks: CLIP's
zero-shot performanceremainsweak on several kinds of tasks:Fine-grained classification: Differentiating car models, flower species, aircraft variants.Abstract/systematic tasks: Counting objects in an image.Novel tasks: Classifying distance to the nearest car, which are unlikely to be in the pre-training dataset.
- Brittle Generalization to Truly Out-of-Distribution (OOD) Data: Despite strong
robustness to natural distribution shifts, CLIP can still fail catastrophically on trulyout-of-distributiondata. The example ofMNIST(handwritten digits), where CLIP's OCR performance is poor, illustrates that training on a large, varied datasetdoes not guarantee robustnessto all unseen domains. This suggests CLIPcircumventsthebrittle generalizationproblem rather thanaddressing its underlying cause. - Limited Output Flexibility: CLIP is currently limited to choosing from predefined concepts in a given
zero-shot classifier. It cannotgenerate novel outputslike animage captioning model. Future work could explorejoint training of contrastive and generative objectivesorsearch over natural language explanationsat inference time to combine efficiency with flexibility. - Data Inefficiency: CLIP does not address the
poor data efficiency of deep learningbut compensates by usingweb-scale supervision. Training on 400 million images over 32 epochs means seeing 12.8 billion images, which would take 405 years to review manually. Combining CLIP withself-supervision(Henaff, 2020; Chen et al., 2020c) andself-training(Lee; Xie et al., 2020) methods is a promising direction fordata efficiency. - Methodological Limitations:
- Validation Set Usage: Repeated querying of
full validation setsduring development, while standard, is unrealistic fortrue zero-shot scenarios. - Evaluation Dataset Selection: The main 27-dataset evaluation suite, while diverse, was
haphazardly assembledandco-adapted with CLIP's development. A new benchmark explicitly designed forbroad zero-shot transferis needed.
- Validation Set Usage: Repeated querying of
- Social Biases: CLIP, trained on
unfiltered internet text-image pairs, learnsmany social biases, similar toimage caption models(Bhargava & Forsyth, 2019). TheBroader Impactssection detailsdenigration harmsandgendered associationsin classification. Future work is needed forbroader, more contextual, and robust bias testingandmitigation strategies. - Zero-shot vs. Few-shot Discrepancy: CLIP's current few-shot learning (linear classifiers on top of features) results in a
counter-intuitive drop in performancecompared to zero-shot, unlike human performance which shows large gains from zero-shot to one-shot. Developing methods tocombine strong zero-shot performance with efficient few-shot learningis an important direction.
7.3. Personal Insights & Critique
CLIP represents a significant paradigm shift in computer vision, mirroring the pre-training revolution seen in NLP. The core insight that natural language can serve as a vast, weakly supervised signal for learning transferable visual representations is incredibly powerful. The most inspiring aspect is the demonstration of zero-shot transfer at a competitive level, effectively allowing users to "program" a visual classifier with natural language. This democratizes access to powerful computer vision capabilities, as it removes the immense barrier of collecting and labeling task-specific datasets.
The methods and conclusions of CLIP can definitely be transferred to other domains. The idea of multimodal contrastive learning from web-scale paired data is applicable wherever such data exists (e.g., video-text pairs, audio-text pairs, 3D model-text pairs). This opens avenues for more general perceptual AI systems that understand the world not just visually, but also through sound, motion, and other sensory inputs, all grounded in language.
However, the paper also raises crucial ethical considerations, particularly regarding social biases and surveillance. The bias probes revealed that CLIP learns and can perpetuate harmful stereotypes (e.g., disproportionately classifying "Black" images into "non-human" categories, or "male" images into "crime-related" categories). The finding that class design (e.g., adding a "child" category) can drastically alter bias manifestation is a critical insight, highlighting that AI developers wield immense power in shaping model behavior and impact. This calls for greater awareness and responsibility in defining classes and setting decision thresholds.
A potential issue or unverified assumption is that simply increasing compute and data scale will continue to solve all generality and robustness problems. While scaling laws show predictable improvements, the MNIST example and performance on abstract tasks (like counting) suggest there are fundamental limitations to this brute-force approach. Truly out-of-distribution generalization and higher-level systematic reasoning might require architectural innovations beyond just scaling Transformers and ResNets. The observation that supervised fine-tuning reduces effective robustness is also a critical point for future research: does task-specific fine-tuning inherently lead models to rely on spurious correlations regardless of pre-training, or can more robust fine-tuning strategies be developed?
My personal critique is that while the paper acknowledges limitations and societal impacts, the emphasis is still heavily on performance metrics. The community needs to evolve evaluation metrics to explicitly include measures of fairness, robustness to unseen biases, and interpretability during development, not just as an afterthought. The "omni-use" nature of CLIP, making it easy to create "bespoke, niche surveillance use cases," demands a more proactive regulatory and ethical framework from researchers and policymakers. The paper's call for community exploration to characterize capabilities and biases is vital, but the responsibility to lead on ethical development often falls on the creators of such powerful foundational models.
Similar papers
Recommended via semantic vector search.