BioCLIP: A Vision Foundation Model for the Tree of Life

Yu Su

Paper status: completed

BioCLIP: A Vision Foundation Model for the Tree of Life

Published:12/01/2023

TreeOfLife-10M Biological Image Dataset (1)Biological Image Classification Model (1)Tree of Life Foundation Model (1)Application of Computer Vision in Biology (1)Biological Information Extraction Methods (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces BioCLIP, a vision foundation model for the tree of life, leveraging the largest and most diverse biology image dataset, TreeOfLife-10M. BioCLIP significantly outperforms existing models in fine-grained biology classification, showcasing its strengths in diver

Abstract

Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks and find that BioCLIP consistently and substantially outperforms existing baselines (by 16% to 17% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. https://imageomics.github.io/bioclip has models, data and code.

Mind Map

In-depth Reading

English Analysis~31 min read · 47,275 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "BioCLIP: A Vision Foundation Model for the Tree of Life". This title clearly indicates the paper's central topic: the development of a vision foundation model specifically tailored for biological organisms, aiming to cover the vast diversity encapsulated by the "tree of life".

1.2. Authors

The authors of the paper are:

Samuel Stevens
Jiaman Wu
Matthew J Thompson
Elizabeth G Campolongo
Chan Hee Song
David Edward Carlyn
Li Dong
Wasila M Dahdul
Charles Stewart
Tanya Berger-Wolf
Wei-Lun Chao
Yu Su

Their affiliations are primarily with The Ohio State University, with contributions also from Microsoft Research, University of California, Irvine, and Rensselaer Polytechnic Institute. The presence of researchers from multiple institutions, including a major tech company, suggests a collaborative effort bridging academic research and potentially industry-level resources in machine learning and biology.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server for scientific papers. While arXiv itself is not a peer-reviewed journal or conference, it is a highly influential platform for rapid dissemination of research in fields like computer science, physics, mathematics, and biology. Papers often appear on arXiv before formal peer review and publication in a conference or journal. The reputation and influence of arXiv are significant in allowing researchers to share their work promptly and gather feedback from the broader scientific community.

1.4. Publication Year

The paper was published on 2023-11-30.

1.5. Abstract

The abstract introduces the increasing abundance of natural world images and the explosion of computational methods, especially computer vision, to extract biological information. It highlights a critical gap: most existing methods are bespoke, task-specific, and lack adaptability. The paper addresses this need by presenting TREEOFLIFE-10M, the largest and most diverse machine learning (ML)-ready dataset of biology images. Using this dataset, they develop BioCLIP, a foundation model for the tree of life. BioCLIP leverages the unique properties of biology, such as the variety of organisms (plants, animals, fungi) and rich structured biological knowledge. The model is rigorously benchmarked on diverse fine-grained biology classification tasks, where it consistently and substantially outperforms existing baselines by 16% to 17% absolute. Intrinsic evaluation further shows that BioCLIP learns a hierarchical representation conforming to the tree of life, explaining its strong generalizability.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2311.18803. The PDF link is https://arxiv.org/pdf/2311.18803v3.pdf. This paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the lack of a general-purpose, adaptable vision model for organismal biology. Images of the natural world, from various sources like drones, camera traps, and citizen science platforms, are becoming increasingly abundant. This has led to a proliferation of computational tools, especially computer vision methods, to extract biologically relevant information for science and conservation.

However, the current landscape of computer vision applications in biology is fragmented. Most existing approaches are "bespoke," meaning they are custom-designed for a specific task, a particular set of taxa (e.g., a specific species or genus), and a limited dataset. This makes them difficult to adapt, extend, or generalize to new biological questions, contexts, or datasets, significantly limiting their broader utility for biological research and conservation efforts. Biologists often face a laborious process of manually labeling data and training models for each new task, requiring substantial machine learning expertise.

This problem is important because biological research spans a vast "tree of life," encompassing millions of diverse species with complex relationships. An adaptable vision model could dramatically accelerate scientific discovery, aid in biodiversity monitoring, and enhance conservation strategies by lowering the barrier for biologists to apply AI. The specific challenges or gaps in prior research include:

Generalization: Existing models struggle to generalize across the entire tree of life or to taxa not present in their training data. It's infeasible to collect training data for millions of known taxa.
Fine-grained Representation: Biology often requires distinguishing between visually similar organisms (e.g., closely related species, mimics). General-domain vision models, even those trained on massive datasets, often lack the necessary fine-grained representations for such distinctions.
Low-data Regime: Data collection and labeling in biology are expensive. Models need to perform strongly in zero-shot or few-shot learning scenarios where data for new or rare species is scarce.
Suitable Datasets and Pre-training Strategies: Existing biological image datasets often lack the scale, diversity, or fine-grained taxonomic labels required for training foundation models. Furthermore, mainstream pre-training algorithms typically don't leverage the rich hierarchical structure of biological taxonomy.

The paper's entry point and innovative idea is to develop a vision foundation model that overcomes these challenges by:
Curating a large-scale, diverse, and ML-ready dataset: TREEOFLIFE-10M, which explicitly incorporates taxonomic labels and hierarchies.
Developing a novel pre-training strategy: Adapting CLIP's multimodal contrastive learning objective to explicitly leverage the hierarchical "tree of life" taxonomy, thereby enabling better generalization and fine-grained representation learning.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field:

TREEOFLIFE-10M Dataset: They curate and release TREEOFLIFE-10M, which is presented as the largest and most diverse ML-ready dataset of biology images to date. It contains over 10 million images covering 454 thousand taxa, significantly expanding upon previous datasets like iNat21 in terms of scale and diversity by integrating data from iNat21, BIOSCAN-1M, and newly curated images from the Encyclopedia of Life (EOL). Crucially, every image is associated with its full taxonomic hierarchy.
BioCLIP Foundation Model: They develop BioCLIP, a vision foundation model specifically designed for the tree of life. BioCLIP is built upon the CLIP architecture but leverages the unique properties of biology data, particularly the abundance and variety of images of plants, animals, and fungi, and the availability of rich structured biological knowledge (taxonomy).
Novel Pre-training Strategy: Instead of standard supervised classification (which treats labels as isolated symbols), BioCLIP employs a novel strategy. It combines CLIP-style multimodal contrastive learning with biological taxonomy by "flattening" the taxonomic hierarchy (from Kingdom to species) into a single string called a "taxonomic name." This allows the model to learn to match images with their corresponding taxonomic names, thereby embedding the hierarchical relationships. They also propose and demonstrate the effectiveness of a mixed text type training strategy to enhance flexibility at inference time.
Comprehensive Benchmarking: They rigorously benchmark BioCLIP on 10 diverse fine-grained biology classification tasks, including a newly curated RARE SPECIES dataset designed to test generalization to unseen taxa.
Superior Performance and Generalizability: BioCLIP consistently and substantially outperforms existing baselines, including CLIP and OpenCLIP, by an average absolute improvement of 17% in zero-shot settings and 16% in few-shot settings.
Intrinsic Evaluation Revealing Hierarchical Representation: Intrinsic evaluation (e.g., t-SNE visualization) reveals that BioCLIP learns a more fine-grained, hierarchical representation of images that conforms to the tree of life, explaining its strong generalizability to novel taxa and fine-grained distinctions.

These findings solve the problems of limited generalization, insufficient fine-grained representation, and poor performance in low-data regimes for biological image analysis. By providing both a large-scale dataset and a specialized foundation model, BioCLIP significantly lowers the barrier for applying AI to biology.

3.1. Foundational Concepts

To fully understand the paper "BioCLIP: A Vision Foundation Model for the Tree of Life," a reader should be familiar with several key concepts from machine learning, especially computer vision and natural language processing, as well as basic biological classification principles.

Foundation Models:
- Conceptual Definition: A foundation model is a large machine learning model trained on a vast amount of broad data (unlabeled and/or labeled) at scale. Once pre-trained, these models can be adapted (fine-tuned) to a wide range of downstream tasks with little or no additional training data (e.g., zero-shot or few-shot learning). They typically exhibit emergent properties and strong generalization capabilities due to their scale and diverse training data. Examples include GPT-3 for language and CLIP for vision-language tasks.
- Importance: They enable rapid development of AI applications by reducing the need for task-specific model development and large labeled datasets for every new problem.
CLIP (Contrastive Language-Image Pre-training):
- Conceptual Definition: CLIP is a multimodal foundation model developed by OpenAI that learns visual concepts from natural language supervision. It is trained on a massive dataset of image-text pairs (e.g., images found online with their captions). CLIP consists of two separate encoders: a vision encoder (e.g., a Vision Transformer) and a text encoder (e.g., a Transformer-based language model).
- How it Works (Contrastive Learning Objective): During training, CLIP learns to associate images with their corresponding text descriptions (positive pairs) and distinguish them from non-corresponding image-text pairs (negative pairs). The objective is to maximize the cosine similarity (a measure of similarity between two non-zero vectors) between the embeddings of positive image-text pairs and minimize the similarity between negative pairs. After training, both encoders map their respective inputs into a shared, high-dimensional embedding space where similar image and text concepts are located close to each other.
- Zero-shot Capability: This shared embedding space allows CLIP to perform zero-shot classification. For example, to classify an image into one of several categories, CLIP computes the similarity between the image's embedding and the embeddings of text descriptions of each category (e.g., "a photo of a cat," "a photo of a dog"). The category with the highest similarity is chosen as the prediction, without ever having seen labeled examples of those categories during training.
Vision Transformer (ViT):
- Conceptual Definition: ViT is an image encoder architecture that applies the Transformer architecture (originally developed for natural language processing) directly to sequences of image patches. Instead of using convolutional layers, ViT divides an image into fixed-size patches, linearly embeds each patch, adds positional embeddings, and then feeds the resulting sequence into a standard Transformer encoder.
- Role in CLIP: ViT is commonly used as the vision encoder in CLIP models, allowing them to process visual information effectively and leverage the self-attention mechanism for global contextual understanding within images.
Taxonomy / Tree of Life:
- Conceptual Definition: In biology, taxonomy is the scientific classification of organisms. The Tree of Life represents the evolutionary relationships among all living organisms, organized hierarchically. Organisms are grouped into a nested hierarchy of ranks, typically including: Kingdom, Phylum, Class, Order, Family, Genus, and Species. For example, a human would be: Animalia (Kingdom), Chordata (Phylum), Mammalia (Class), Primates (Order), Hominidae (Family), Homo (Genus), sapiens (Species).
- Importance: This hierarchy provides a structured way to understand biodiversity and evolutionary relationships, which is a key signal BioCLIP aims to leverage.
Zero-shot Learning:
- Conceptual Definition: Zero-shot learning (ZSL) refers to the ability of a model to recognize or classify instances of classes that were not seen during training. This is typically achieved by leveraging auxiliary information (like text descriptions or attributes) that links seen and unseen classes. In CLIP, the text encoder provides this auxiliary information.
Few-shot Learning:
- Conceptual Definition: Few-shot learning (FSL) is a machine learning paradigm where a model can learn to classify new classes given only a very small number of labeled examples (e.g., 1, 5, or 10 examples per class). This is crucial in domains like biology where extensive labeled data is often unavailable.
Fine-grained Classification:
- Conceptual Definition: Fine-grained classification involves distinguishing between subordinate categories that are visually very similar (e.g., different species of birds, different car models). This contrasts with coarse-grained classification (e.g., distinguishing between a cat and a dog). It requires models to learn subtle visual cues.
ML-ready Dataset:
- Conceptual Definition: An ML-ready dataset is a dataset that has been processed, cleaned, and formatted in a way that makes it immediately usable for machine learning model training and evaluation. This includes tasks like data collection, annotation, normalization, handling missing values, and structuring labels consistently.

3.2. Previous Works

The paper contextualizes BioCLIP by referencing several prior works across foundation models, domain-specific AI, and hierarchical computer vision.

General Foundation Models:
- CLIP [69]: OpenAI's Contrastive Language-Image Pre-training model is a direct precursor and inspiration. BioCLIP builds upon CLIP's architecture and contrastive learning objective. CLIP was trained on noisy, web-scale image-text datasets (100M+ pairs) and showed state-of-the-art zero-shot capabilities.
- GPT-3 [14]: OpenAI's large language model, a prominent example of a foundation model demonstrating strong zero-shot and few-shot learning for language tasks.
- ALIGN [45] and BASIC [65]: Further scaled multimodal pre-training to billions of image-text pairs, improving vision representation quality.
- ResNet [33] and Swin Transformer [48]: General domain vision models that typically use a supervised classification objective on class indices. The paper contrasts BioCLIP's approach with these, highlighting how BioCLIP leverages the rich label structure rather than treating labels as discrete symbols.
- DINO [15]: A self-supervised vision transformer model, used as a baseline for few-shot classification.
Biology-specific Datasets and AI:
- iNat21 [86]: The previous largest ML-ready biology image dataset, with 2.7M images covering 10K species. TREEOFLIFE-10M integrates iNat21 and significantly expands on its diversity.
- BIOSCAN-1M [28]: A recent dataset of 1M lab images of insects. TREEOFLIFE-10M also incorporates this, adding diverse image distributions (lab vs. in situ).
- Computer vision applications in biology: The paper cites numerous works on using digital images and computer vision for evolutionary biology [13, 51], ecology and biodiversity [5, 77, 83], species classification [32], individual identification, trait detection [23, 39], abundance estimation [3, 40, 58, 82], and biodiversity monitoring [83]. These works highlight the existing demand for robust biological vision models.
Domain-specific CLIPs:
- The paper notes a trend of domain-specific CLIPs outperforming general models [18, 30]. Examples include Ikezogwo et al. [41] and Lu et al. [50] who gathered 1M+ image-text pairs for computational pathology. BioCLIP follows this trend but at a larger scale (10M+ images) and with a strong emphasis on taxonomic diversity in biology.
Hierarchy in Computer Vision:
- ImageNet [70]: Its classes are organized according to the hierarchical WordNet [55] structure, making hierarchy a long-standing topic in CV.
- Bilal et al. [10]: Studied ImageNet model predictions and found that confusion patterns follow hierarchical class structures, and incorporating hierarchy into AlexNet improved performance.
- Bertinetto et al. [9]: Measured mistake severity in image classifiers and proposed alternative objectives incorporating hierarchy to reduce severe errors.
- Zhang et al. [96]: Proposed a contrastive objective where hierarchical distance between labels corresponded to desired embedding distance, outperforming cross-entropy on ImageNet and iNat17.
- BioCLIP's novel contribution in this area is applying a repurposed CLIP objective to leverage a comprehensive biological taxonomy with 454K unique classes, a much larger scale than prior work.

3.3. Technological Evolution

The technological evolution leading to BioCLIP can be summarized as a progression from:

Task-specific, bespoke computer vision models for biology: Early applications required significant manual effort for data labeling and model training for each narrow biological problem. These models were often limited in scope and generalizability.
General-domain vision models (e.g., ResNet, Swin Transformer): These models achieved impressive performance on broad computer vision tasks, but still struggled with the fine-grained distinctions and vast diversity inherent in biology, especially when trained with standard classification objectives that ignore hierarchical relationships.
Multimodal foundation models (e.g., CLIP): The advent of CLIP demonstrated the power of learning visual representations from natural language supervision, enabling strong zero-shot and few-shot capabilities. However, general CLIP models were not optimized for the specific challenges of biological taxonomy (e.g., Latin names, complex hierarchies, extreme fine-grainedness).
Domain-specific adaptation of foundation models: Researchers began adapting CLIP and similar architectures to specific domains (e.g., pathology, fashion) to achieve better performance by leveraging domain-specific data and knowledge.

BioCLIP fits into this timeline by being a cutting-edge example of the fourth stage. It is a domain-specific adaptation of a foundation model (CLIP) that explicitly addresses the unique challenges and leverages the unique properties of the biological domain. It moves beyond general-purpose models by incorporating the vast, hierarchical Tree of Life taxonomy into its pre-training strategy and building the largest, most diverse ML-ready dataset for this domain.

3.4. Differentiation Analysis

Compared to the main methods in related work, BioCLIP introduces several core differences and innovations:

Leveraging Biological Hierarchy with Contrastive Learning:
- Difference from Standard Supervised Models (ResNet, Swin Transformer): Traditional supervised models treat each class label as a discrete, unrelated entity. BioCLIP, in contrast, explicitly leverages the rich hierarchical structure of biological taxonomy (Kingdom, Phylum, Class, etc.).
- Innovation: Instead of using hierarchical classification (which sums losses across levels, as explored in their ablations), BioCLIP creatively repurposes CLIP's multimodal contrastive learning. It "flattens" the entire taxonomic path into a single taxonomic name string and uses CLIP to learn to match images with these hierarchical text representations. This is a novel way to embed hierarchical relationships into the CLIP embedding space, which they show significantly outperforms hierarchical classification with cross-entropy.
Scale and Diversity of Biological Dataset:
- Difference from iNat21, BIOSCAN-1M: While these are valuable datasets, they are significantly smaller and less diverse in terms of the number of unique taxa covered.
- Innovation: TREEOFLIFE-10M is a massive leap in scale (10M+ images) and taxonomic diversity (454K+ taxa), meticulously aggregated from multiple sources including iNat21, BIOSCAN-1M, and especially the Encyclopedia of Life (EOL). This unparalleled scale and diversity are crucial for training a truly generalized foundation model for biology.
Enhanced Generalization to Unseen Taxa and Fine-grained Distinctions:
- Difference from General CLIP/OpenCLIP: While general CLIP models offer good zero-shot capabilities, they are not optimized for the specific, often Latin-based, and highly fine-grained labels of biological organisms. They perform better with common names, which can be ambiguous.
- Innovation: By training with taxonomic names and a mixed text type strategy, BioCLIP learns representations that are inherently more fine-grained and generalize better to unseen species (as demonstrated on the RARE SPECIES dataset) and various label formats. Its intrinsic evaluation reveals a clear hierarchical clustering in its embedding space, a feature lacking in general CLIP models when applied to biological data.
Flexibility in Text Types at Inference:
- Innovation: The mixed text type training strategy allows BioCLIP to retain the generalization benefits of taxonomic names while being robust and flexible when only common names or scientific names are available at inference time. This addresses a practical need for biologists who might use different naming conventions.
  
  In essence, BioCLIP differentiates itself by specifically engineering the CLIP framework and its training data to address the unique structural (hierarchy), scale (diversity of taxa), and practical (fine-grained, low-data, varied naming conventions) challenges of organismal biology.

4. Methodology

4.1. Principles

The core idea behind BioCLIP is to develop a general-purpose vision foundation model that can effectively understand and classify images across the entire Tree of Life. This is achieved by leveraging two key principles:

Multimodal Contrastive Learning: Building upon the success of CLIP, BioCLIP uses a contrastive learning objective to learn joint representations of images and text. This allows the model to learn semantic correspondences between visual features of organisms and their textual descriptions, enabling powerful zero-shot and few-shot classification capabilities.
Explicitly Leveraging Biological Taxonomy: Unlike general-domain CLIP models or standard supervised classifiers that treat labels as independent symbols, BioCLIP recognizes and explicitly integrates the rich hierarchical structure of biological taxonomy. The intuition is that if the model learns the relationships within the Tree of Life (e.g., that different species belong to the same genus, which belongs to the same family, etc.), it will generalize better to unseen taxa. For example, even if it hasn't seen images of a particular species, it might have learned robust representations for its genus or family, which can then be used to infer information about the new species.

These principles combine to allow BioCLIP to learn fine-grained, hierarchically structured visual representations that are broadly applicable across diverse biological tasks and can generalize to novel species.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. `TREEOFLIFE-10M`: The Large-Scale, Diverse ML-Ready Biology Image Dataset

BioCLIP's foundation begins with TREEOFLIFE-10M, a meticulously curated dataset designed to overcome the limitations of previous biological image datasets in terms of scale, diversity, and label richness.

Problem with Existing Datasets:
- iNat21 [86]: While large for biology (2.7M images, 10K species), its species diversity is limited compared to the millions of known species. For instance, there are over 2M described species, with 10K+ bird species and 10K+ reptile species alone. This limits its potential for a foundation model for the entire Tree of Life.
Data Sources and Integration:
- Encyclopedia of Life (EOL): The authors recognized EOL (eol.org) as a crucial source for high-quality, diverse biology images. They downloaded 6.6 million images from EOL, which significantly expanded the dataset to cover an additional 440K taxa. This addresses the diversity gap.
- BIOSCAN-1M [28]: This dataset (1.1M images) primarily contains lab images of insects, covering 494 different families. Its inclusion is important for two reasons:
  1. It provides extremely fine-grained visual representations for insects, a highly diverse subtree of the Tree of Life.
  2. It diversifies the image distribution by including lab images, which differ significantly from the in-situ (natural environment) images typical of iNat21.
- iNat21 (training split): The training split of iNat21 was integrated to further bolster the dataset.

Metadata & Aggregation Challenges:

Integrating these diverse sources is a non-trivial task due to the inherent noise and inconsistency in taxonomic hierarchies across different biological databases [4, 31, 36, 52, 63].
The authors canonicalized (standardized) the labels and unified taxonomic hierarchies from EOL, the Integrated Taxonomic Information System (ITIS) [43], and iNaturalist.
Special consideration was given to homonyms (genus-species labels shared among higher-order taxa) to ensure correct linkage.
For any taxa that couldn't be resolved by these primary sources, the Global Names Resolver (GNR) API was used.
Result: This rigorous process allowed for 84% full taxa labeling for images in TREEOFLIFE-10M, with about 10% labeled down to the family rank (due to BIOSCAN-1M not always having genus-species info).

Dataset Statistics (Table 1): The following are the results from Table 1 of the original paper:

Dataset	Description	Images	Unique Classes
iNat21	Citizen scientist labeled image dataset from iNaturalist for fine-grained classification.	2.7M	10,000
BIOSCAN-1M	Expert labeled image dataset of insects for classification.	1.1M	7,831
EOL	A new dataset with citizen scientist images sourced from Encyclopedia of Life and taxonomic labels standardized by us.	6.6M	448,910
TREEOFLIFE-10M	Largest-to-date ML-ready dataset of biology images with taxonomic labels.	10.4M	454,103

TREEOFLIFE-10M totals over 10.4 million images across more than 454,103 unique taxonomic names, representing a significant increase in scale and diversity over previous datasets.

Release: The dataset (and a smaller RARE SPECIES test set) are released on Hugging Face with DOIs, including CSVs with metadata and links to primary sources, along with GitHub scripts for generation.

4.2.2. `BioCLIP`: A Vision Foundation Model for the Tree of Life

BioCLIP is built on a CLIP architecture, but its training strategy is specifically designed to leverage the taxonomic hierarchy present in TREEOFLIFE-10M.

Initialization: BioCLIP is initialized from OpenAI's public CLIP weights, specifically using a ViT-B/16 (Vision Transformer Base, patch size 16x16) as the vision encoder and a 77-token causal autoregressive transformer as the text encoder. This warm start leverages pre-learned general visual and language representations.
Why CLIP's Objective is Crucial for Taxonomy:
- Challenge with Standard Supervised Classification: A common strategy for labeled datasets like TREEOFLIFE-10M is to use a supervised classification objective, where the model learns to map an image to a taxonomic index. However, this treats each taxon (e.g., a species) as an independent symbol. It fails to account for the rich, interconnected hierarchical structure of the Tree of Life. Consequently, such a model would struggle to generalize to unseen taxa or support zero-shot classification of new species.
- Repurposing CLIP's Multimodal Contrastive Learning: The authors propose that CLIP's contrastive learning objective can be repurposed to explicitly leverage this hierarchical structure.
  - "Flattening" the Taxonomy: For each species, the complete taxonomic hierarchy (from Kingdom down to the most specific taxon rank available, e.g., species) is concatenated into a single string. This string is called the taxonomic name. For example, for a black-billed magpie, the taxonomic name might be "Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia."
  - Contrastive Learning with Taxonomic Names: The CLIP contrastive learning objective is then used to train the model to match images with their corresponding taxonomic names. This means the vision encoder learns to produce embeddings for images, and the text encoder learns to produce embeddings for these taxonomic name strings. The training optimizes for high similarity between correct image-text pairs and low similarity for incorrect pairs.
  - Generalization Mechanism: This approach intuitively aids generalization. If the model hasn't seen a particular species, it has likely encountered its higher taxonomic ranks (e.g., genus, family) in other taxonomic names. The autoregressive text encoder naturally embeds these hierarchical relationships by conditioning later taxonomic rank representations on earlier (higher) ranks. This provides a strong prior for few-shot or zero-shot learning of new taxa.
  - Technical Contribution: The authors emphasize that repurposing CLIP's objective for learning hierarchical representations conforming to a taxonomy is a novel and non-trivial technical contribution.

Text Types for Flexibility:

CLIP's text encoder accepts free-form text, which is a powerful feature for biological labels, as they come in various formats. The paper considers several text types (Table 3) for training and inference: The following are the results from Table 3 of the original paper:

Text Type	Example
Common	black-billed magpie
Scientific	Pica hudsonia
Taxonomic	Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia
Scientific + Common	Pica hudsonia with common name black-billed magpie
Taxonomic + Common	Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia with common name black-billed magpie

Common name: (e.g., "black-billed magpie") More widespread, but can be ambiguous (one species, multiple common names; one common name, multiple species).
Scientific name: (e.g., "Pica hudsonia") Standardized binomial nomenclature (genus + species).
Taxonomic name: (e.g., "Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia") The "flattened" full hierarchical path.
Scientific + Common name & Taxonomic + Common name: Combinations to provide more context.

Mixed Text Type Training Strategy: To improve flexibility at inference time (as users might only have one type of label), the authors propose a mixed text type training strategy. At each training step, for a given image, a text label is randomly sampled from all its available text types (e.g., Common, Scientific, Taxonomic). This strategy helps BioCLIP retain the generalization benefits of taxonomic names while being adaptable to different naming conventions during inference. The final text input to CLIP is formatted using a standard template, e.g., "a photo of Pica hudsonia."

4.2.3. Training Details

Initialization: BioCLIP begins with weights from OpenAI's CLIP model (specifically the ViT-B/16 architecture). This is a form of continual pre-training.
Dataset: TREEOFLIFE-10M.
Epochs: 100 epochs.
Learning Rate Schedule: Cosine learning rate schedule [49].
Hardware: 8 NVIDIA A100-80GB GPUs across 2 nodes.
Global Batch Size: 32,768.
Baseline Training: A baseline model trained only on iNat21 (same architecture) and ablation models on 1M examples from TREEOFLIFE-10M used a smaller global batch size of 16,384 on 4 NVIDIA A100 GPUs on 1 node.
Hyperparameters (Table D1 & D2): The following are the results from Table D1 of the original paper:

Hyperparameter Value

Architecture ViT-B/16

Max learning rate 1 × 10^-4

Warm-up steps 1,000

Weight Decay 0.2

Input Res. 224 × 224

The following are the results from Table D2 of the original paper:

Dataset Text Type Batch Size Epoch

TreeOfLife-10M Mixture 32K 100

iNat21 Only Mixture 16K 65

TreeOfLife-1M Tax+Com

Common 16K 86

Scientific 87

Taxonomy 87

Sci+Com 87

Note: Table D2 has some missing values for TreeOfLife-1M batch sizes and epochs for Scientific/Taxonomy/Sci+Com, implying they share the same 16K batch size and likely similar epoch counts to Common (86/87).

4.2.4. Hierarchical Multitask Objective (for comparison)

To justify the choice of the CLIP objective, the authors compare it against a hierarchical classification approach.

Objective: To predict labels for each taxonomic rank (Kingdom, Phylum, Class, etc.) down to species.
Loss Function: Cross-entropy for each level of the taxonomy, with these losses summed up.
Pseudocode (Listing 1):
```
import torch.nn.functional as F

def forward(vit, heads, images, h_labels):
    # vit: vision transformer.
    # heads: linear layers, one for each taxonomic rank.
    # images: batch of input images
    # h_labels: hierarchical labels; each image has 7 labels
    #
    # img_feats = vit(images)
    # h_logits = [head(img_feats) for head in heads]
    # losses = [F.cross_entropy(logits, label) for logits, labels in zip(h_logits, h_labels)]
    # return sum(losses)
```
- Explanation:
  - The function forward takes the vision transformer (vit), a list of heads (linear layers, one for each taxonomic rank), a batch of images, and h_labels (hierarchical labels for each image, e.g., 7 labels for Kingdom to Species).
  - img_feats = vit(images): The vision transformer (vit) processes the input images to extract dense visual features (img_feats).
  - h_logits = [head(img_feats) for head in heads]: For each taxonomic rank (e.g., Kingdom, Phylum), there is a dedicated linear layer (head). Each head takes the img_feats and produces logits (h_logits), which are raw prediction scores for the classes at that specific taxonomic rank.
  - losses = [F.cross_entropy(logits, label) for logits, labels in zip(h_logits, h_labels)]: For each taxonomic rank, a cross-entropy loss is calculated between the predicted logits for that rank and the true labels (h_labels) for that rank.
  - return sum(losses): The final loss for the entire model is the sum of the cross-entropy losses from all taxonomic ranks. This encourages the ViT to learn image features that are useful for classifying images at multiple hierarchical levels simultaneously.

5. Experimental Setup

5.1. Datasets

The experiments utilize a large-scale training dataset (TREEOFLIFE-10M) and a suite of 10 diverse evaluation datasets.

Training Dataset:
- TREEOFLIFE-10M: This is the primary training dataset developed in the paper.
  - Source: Aggregation of iNat21 (training split), BIOSCAN-1M, and newly curated images from the Encyclopedia of Life (EOL).
  - Scale: Over 10 million images.
  - Characteristics: Covers 454,103 unique taxa across plants, animals, and fungi. Includes diverse image distributions (in-situ, lab images, citizen science photos). Each image is labeled with its full taxonomic hierarchy.
  - Purpose: Designed to train a foundation model with broad coverage and fine-grained understanding of biological diversity.

Evaluation Datasets: The paper evaluates on 10 tasks, covering various kingdoms and image types. The following are the results from Table 2 of the original paper:

Name	Description	Examples	Classes	Labels
Birds 525	Scraped dataset of bird images from web search. [68]	89,885	525	Taxonomic
Plankton	Expert-labeled in situ images of plankton [35].	4,080	102	Mixed
Insects	Expert and volunteer-labeled in-the-wild citizen science images of insects [74].	4,680	117	Scientific
Insects 2	Mixed common and scientific name classification for insect pests [91].	4,080	102	Mixed
PlantNet	Citizen science species-labeled plant images, some drawings [27].	1,000	25	Scientific
Fungi	Expert-labeled images of Danish fungi [66].	1,000	25	Scientific
PlantVillage	Museum-style leaf specimens labeled with common names [25].	1,520	38	Common
Medicinal Leaf	Species classification of leaves from mature, healthy medicinal plants [71].	1,040	26	Scientific
PlantDoc	17 diseases for 13 plant species [76].	1,080	27	Common
RaRe SPecies	Subset of species in the IUCN Red List categories: Near Threatened through Extinct in the Wild (iucnredlist.org).	12,000	400	Taxonomic

Rationale for Dataset Choice: These datasets were chosen because they:
- Cover the Tree of Life: Include organisms from animals, plants, fungi, and protists (from Plankton).
- Diverse Image Distributions: Feature photographs, microscope images, drawings, and museum specimens, ensuring the model's robustness across different visual contexts.
- Fine-grained Nature: All are fine-grained classification tasks, directly testing BioCLIP's core capability.
- Variety of Label Types: Labels range from full taxonomic names to scientific or common names, testing the flexibility of BioCLIP's mixed text type training.
- RARE SPECIES: This newly curated dataset is particularly crucial. It contains species from the IUCN Red List (e.g., Near Threatened to Extinct in the Wild). Critically, these species were removed from TREEOFLIFE-10M training data. This makes RARE SPECIES an ideal benchmark for evaluating BioCLIP's out-of-distribution generalization to unseen taxa and its potential for conservation applications. It consists of 400 species, each with at least 30 images.

5.2. Evaluation Metrics

The primary evaluation metric used in the paper is Accuracy, specifically Top-1 Accuracy.

Top-1 Accuracy:
- Conceptual Definition: Top-1 accuracy measures the percentage of predictions where the model's highest-confidence prediction (the class with the highest probability or similarity score) matches the true label of the input. It's a straightforward and widely used metric in classification tasks, indicating how often the model gets the classification perfectly right.
- Mathematical Formula: Let $N$ be the total number of test samples. Let $y_i$ be the true label for sample $i$ . Let $\hat{y}_i$ be the predicted label with the highest confidence for sample $i$ . The Top-1 Accuracy is calculated as: $ \text{Top-1 Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\hat{y}_i = y_i) $
- Symbol Explanation:
  - $N$ : Total number of samples in the test set.
  - $y_i$ : The ground truth (true) label for the $i$ -th sample.
  - $\hat{y}_i$ : The label predicted by the model with the highest confidence for the $i$ -th sample.
  - $\mathbb{I}(\cdot)$ : The indicator function, which equals 1 if the condition inside the parentheses is true, and 0 otherwise.
  - $\sum_{i=1}^{N}$ : Summation over all $N$ samples.
Evaluation Modes:
- Zero-shot Learning: Follows the standard CLIP procedure. The image embedding is compared against the text embeddings of candidate class labels (e.g., "a photo of [class name]"). The class with the highest similarity is chosen.
- Few-shot Learning: Uses SimpleShot [90], a nearest-centroid classifier.
  1. For $k$ -shot learning, $k$ examples are randomly sampled for each class.
  2. Image embeddings are obtained from the visual encoder of the pre-trained model for these $k$ examples.
  3. The centroid for each class is computed by averaging the $k$ image embeddings.
  4. All remaining examples in the dataset are used for testing.
  5. Mean subtraction and L2-normalization are applied to both centroids and test feature vectors.
  6. The prediction for a test vector is the class whose centroid is nearest in the embedding space (typically using cosine similarity or Euclidean distance).
  7. Each few-shot experiment is repeated 5 times with different random seeds, and the mean accuracy is reported. Standard deviations are provided in the Appendix.
- Generalized Zero-Shot Learning (GZSL) (Appendix H): This setting requires the model to classify images from unseen classes within a set that also includes seen classes. BioCLIP is tested on RARE SPECIES images using a mixed set of 800 labels (400 seen, 400 unseen).

5.3. Baselines

The paper compares BioCLIP against a range of representative baseline models to demonstrate its effectiveness.

CLIP (OpenAI): The original CLIP model [69] with a ViT-B/16 vision transformer. This serves as a direct comparison point, as BioCLIP is a fine-tuned version of CLIP. By default, CLIP is evaluated with common names for class labels, as this is what it's generally most effective with due to its web-scale training data.
OpenCLIP: An open-source reproduction of CLIP trained on the LAION-400M dataset [73]. It represents another strong general-purpose CLIP-like model. Similar to OpenAI CLIP, it's primarily evaluated with common names.
iNat21 Only: A CLIP model (same ViT-B/16 architecture) that is continually pre-trained only on the iNat21 dataset (a subset of TREEOFLIFE-10M). This baseline helps to isolate the impact of TREEOFLIFE-10M's expanded data diversity and scale compared to a leading existing biological dataset.
Supervised-IN21K: A model pre-trained on ImageNet-21K [21] [78]. This represents a strong supervised pre-training baseline from the general computer vision domain. It's used for few-shot classification (as it doesn't support zero-shot in the same way CLIP does).
DINO: A self-supervised vision transformer model [15]. This baseline represents a different pre-training paradigm (self-supervised) from general computer vision, also used for few-shot classification.
Random Guessing: A simple baseline indicating the performance expected from random chance, calculated as $1/\text{number of classes}$ .

These baselines were chosen to represent:

General-purpose CLIP models: To show the benefit of domain-specific adaptation.
Domain-specific CLIP (on iNat21): To highlight the impact of the TREEOFLIFE-10M dataset's scale and diversity.
Strong general vision models (Supervised-IN21K, DINO): To demonstrate BioCLIP's superiority even against advanced models from outside the CLIP paradigm.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that BioCLIP consistently and substantially outperforms existing baselines across diverse fine-grained biology classification tasks, particularly in zero-shot and few-shot settings.

Overall Superiority: BioCLIP achieves significant performance gains, with an average absolute improvement of 17.5% in zero-shot accuracy and 16.7% in one-shot accuracy, and 17.3% in five-shot accuracy over the best baseline.
Generalization to Unseen Taxa: Its strong performance on the RARE SPECIES dataset (a set of species unseen during training) highlights its exceptional out-of-distribution generalization capabilities, which is critical for real-world biological applications like conservation.
Performance Across Domains: BioCLIP performs well on all evaluated tasks, covering animals (Birds 525, Plankton, Insects, Insects 2), plants (PlantNet, PlantVillage, Medicinal Leaf, PlantDoc), and fungi (Fungi), with diverse image types (photos, microscope images, drawings, specimens). This validates its claim as a foundation model for the Tree of Life.
Beyond Species Classification: BioCLIP shows strong capabilities in tasks like plant disease diagnosis (PlantVillage, PlantDoc), which involve classifying both species and conditions, demonstrating its utility beyond simple species identification.

6.2. Data Presentation (Tables)

The following are the results from Table 4 of the original paper:

Model	Animals				Plants & Fungi					Rare Species	Mean (Δ)
	Birds 525	Plankton	Insects	Insects 2	PlantNet	Fungi	PlantVillage	Medicinal Leaf	PlantDoc

Random Guessing	0.2	1.2	1.0	1.0	4.0	4.0	2.6	4.0	3.7	0.3	2.2
Zero-Shot Classification
CLIP	49.9	3.2	9.1	9.8	58.5	10.2	5.4	15.9	26.1	31.8	21.9	−
OpenCLIP	54.7	2.2	6.5	9.6	50.2	5.7	8.0	12.4	25.8	28.4	20.4	−1.5
BioCLIP	72.1	6.1	34.8	20.4	91.4	40.7	24.4	38.6	39.4	56.1	42.4	+20.5
- iNat21 Only	56.1	2.6	30.7	11.5	88.2	43.0	18.4	25.6	31.7	21.3	32.9	+9.8
One-Shot Classification
CLIP	43.7	25.1	21.6	13.7	42.1	17.2	49.7	70.1	24.8	28.5	33.6	−
OpenCLIP	53.7	32.3	23.2	14.3	45.1	18.4	53.6	71.2	26.8	29.2	36.7	+3.1
Supervised-IN21K	60.2	22.9	14.7	14.4	46.7	16.9	62.3	58.6	27.7	28.0	35.2	+1.6
DINO	40.5	37.0	23.5	16.4	30.7	20.0	60.0	79.2	23.7	31.0	36.2	+2.6
BioCLIP	71.8	30.6	57.4	20.4	64.5	40.3	58.8	84.3	30.7	44.9	50.3	+16.7
- iNat21 Only	74.8	29.6	53.9	19.7	67.4	35.5	55.2	75.1	27.8	36.9	47.5	+13.9
Five-Shot Classification
CLIP	73.5	41.2	39.9	24.6	65.2	27.9	71.8	89.7	35.2	46.0	51.5	−
OpenCLIP	81.9	52.5	42.6	25.0	68.0	30.6	77.8	91.3	42.0	47.4	55.9	+4.4
Supervised-IN21K	83.9	39.2	32.0	25.4	70.9	30.9	82.4	82.3	44.7	47.3	53.9	+2.4
DINO	70.8	56.9	46.3	28.6	50.3	34.1	82.1	94.9	40.3	50.1	55.4	+3.9
BioCLIP	90.0	49.3	77.8	33.6	85.6	62.3	80.9	95.9	47.5	65.7	68.8	+17.3
- iNat21 Only	90.1	48.2	73.7	32.1	84.7	55.6	77.2	93.5	41.0	55.6	65.1	+13.6

The following are the results from Table E3 of the original paper:

	Animals					Rare Species
	Birds 525	Plankton	Insects	Insects 2	PlantNet	Rare Species
One-Shot Classification
CLIP	43.7 ± 0.26	25.1 ± 0.71	21.6 ± 1.05	13.7 ± 1.09	42.1 ± 3.40	28.5 ± 0.65
OpenCLIP	53.7 ± 0.52	32.3 ± 0.63	23.2 ± 1.58	14.3 ± 0.67	45.1 ± 3.40	29.2 ± 0.64
Supervised-IN21K	60.2 ± 1.02	22.9 ± 0.84	14.7 ± 1.38	14.4 ± 0.90	46.7 ± 6.30	28.0 ± 0.77
DINO	40.5 ± 0.96	37.0 ± 1.39	23.5 ± 1.49	16.4 ± 0.78	30.7 ± 3.79	31.0 ± 0.89
BioCLIP	71.8 ± 0.47	30.6 ± 0.77	57.4 ± 2.4	20.4 ± 1.28	64.5 ± 2.15	44.9 ± 0.73
- iNat21 Only	74.8 ± 0.89	29.6 ± 0.82	53.9 ± 0.97	19.7 ± 0.80	67.4 ± 4.54	36.9 ± 1.02
Five-Shot Classification
CLIP	73.5 ± 0.37	41.2 ± 1.01	39.9 ± 0.86	24.6 ± 0.90	65.2 ± 1.25	46.0 ± 0.33
OpenCLIP	81.9 ± 0.25	52.5 ± 0.83	42.6 ± 0.82	25.0 ± 0.83	68.0 ± 0.86	47.4 ± 0.34
Supervised-IN21K	83.9 ± 0.15	39.2 ± 1.66	32.0 ± 1.90	25.4 ± 2.13	70.9 ± 2.45	47.3 ± 0.41
DINO	70.9 ± 0.34	56.9 ± 1.61	46.3 ± 1.37	28.6 ± 1.59	50.3 ± 3.20	50.1 ± 0.47
BioCLIP	90.0 ± 0.12	49.3 ± 1.14	77.8 ± 0.81	33.6 ± 0.74	85.6 ± 1.79	65.7 ± 0.43
- iNat21 Only	90.1 ± 0.08	48.2 ± 1.24	73.7 ± 0.65	32.1 ± 1.97	84.7 ± 1.24	55.6 ± 0.16

The following are the results from Table E4 of the original paper:

	Plants & Fungi
	Fungi	PlantVillage	Medicinal Leaf	PlantDoc	Rare Species
One-Shot Classification
CLIP	17.2 ± 0.78	49.7 ± 2.53	70.1 ± 2.83	24.8 ± 1.61	28.5 ± 0.65
OpenCLIP	18.4 ± 1.26	53.6 ± 0.79	71.2 ± 3.58	26.8 ± 1.45	29.2 ± 0.64
Supervised-IN21K	16.9 ± 2.32	62.3 ± 2.28	58.6 ± 4.45	27.7 ± 2.86	28.0 ± 0.77
DINO	20.0 ± 1.53	60.0 ± 2.15	79.2 ± 2.74	23.7 ± 2.48	31.0 ± 0.89
BioCLIP	40.3 ± 3.00	58.8 ± 2.83	84.3 ± 1.90	30.7 ± 1.75	44.9 ± 0.73
- iNat21 Only	35.5 ± 2.93	55.2 ± 1.58	75.1 ± 1.16	27.8 ± 1.31	36.9 ± 1.02
Five-Shot Classification
CLIP	27.9 ± 2.54	71.8 ± 1.46	89.7 ± 1.45	35.2 ± 1.59	46.0 ± 0.33
OpenCLIP	30.6 ± 1.26	77.8 ± 1.28	91.3 ± 0.85	42.0 ± 1.32	47.4 ± 0.34
Supervised-IN21K	30.9 ± 2.64	82.4 ± 1.53	82.3 ± 3.81	44.7 ± 2.26	47.3 ± 0.41
DINO	34.1 ± 2.87	82.1 ± 1.31	94.9 ± 1.30	40.3 ± 2.32	50.1 ± 0.47
BioCLIP	62.3 ± 1.82	80.9 ± 1.04	95.9 ± 1.07	47.5 ± 1.35	65.7 ± 0.43
- iNat21 Only	55.6 ± 2.61	77.2 ± 0.68	93.5 ± 1.13	41.0 ± 1.75	55.6 ± 0.16

6.2.1. Zero-Shot Classification

BioCLIP achieves a mean accuracy of 42.4%, substantially outperforming CLIP (21.9%) and OpenCLIP (20.4%). This is a massive improvement, demonstrating the effectiveness of pre-training on TREEOFLIFE-10M and using taxonomic information.
The RARE SPECIES column is particularly telling: BioCLIP scores 56.1%, while CLIP gets 31.8% and OpenCLIP 28.4%. This validates BioCLIP's ability to generalize to unseen taxa, which is crucial for biological applications.
The - iNat21 Only model (32.9%) also performs better than generic CLIPs but is still significantly behind the full BioCLIP, indicating that the expanded diversity of TREEOFLIFE-10M beyond iNat21 is vital.

6.2.2. One-Shot Classification

BioCLIP achieves a mean accuracy of 50.3%, again significantly higher than CLIP (33.6%) and OpenCLIP (36.7%).
On RARE SPECIES, BioCLIP scores 44.9%, maintaining a strong lead.
Supervised-IN21K (35.2%) and DINO (36.2%) are competitive with CLIP and OpenCLIP, but are still far outpaced by BioCLIP, confirming the benefit of BioCLIP's specialized pre-training for fine-grained biological tasks even in the few-shot regime.
The mean one-shot accuracy of BioCLIP is 9.1% higher than its zero-shot accuracy, indicating that even a single labeled example significantly improves performance, contrary to some findings for general CLIP models.

6.2.3. Five-Shot Classification

The trend continues with BioCLIP leading with a mean accuracy of 68.8%, compared to CLIP (51.5%) and OpenCLIP (55.9%).
The gains are consistent across most datasets, with BioCLIP showing particularly strong performance on Insects (77.8%), Fungi (62.3%), and RARE SPECIES (65.7%).
The performance gap between BioCLIP and Supervised-IN21K / DINO remains substantial, reinforcing BioCLIP's advantage in biological domain generalization.

6.3. Ablation Studies / Parameter Analysis

6.3.1. How Do Text Types Affect Generalization?

This ablation study investigates the impact of different text types used during training on zero-shot generalization, specifically on the RARE SPECIES dataset (where all species are unseen during training). The study uses a 10% subset of TREEOFLIFE-10M (referred to as ToL-1M) for computational reasons.

The following are the results from Table 5 of the original paper:

Dataset	Train↓Test→	Com	Sci	Tax	Sci+Com	Tax+Com
ToL-1M	Com	24.9	9.5	10.8	22.3	21.0
	Sci	11.0	22.3	4.5	21.5	8.0
	Tax	11.8	10.1	26.6	16.0	24.8
	Sci+Com	24.5	12.9	12.6	28.0	24.9
	Tax+Com Mixture	20.5	8.0	19.7	24.0	30.4
iNat21-2.7M Mixture		26.1	24.9	26.7	29.5	30.9
ToL-10M Mixture		31.6	30.1	34.1	37.0	38.0

Key Observations:
- Importance of Taxonomic names: Training with Taxonomic names (row "Tax") yields strong performance when tested with Taxonomic names (26.6%), and Taxonomic + Common (row "Tax+Com Mixture") achieves the best overall performance (30.4%) when tested with Taxonomic + Common names on ToL-1M. This highlights that incorporating the full taxonomic structure is crucial for generalization.
- Mixed Text Type Training is Superior: The "Tax+Com Mixture" row shows that training with a mixture of text types (randomly sampling from available text types for an image) provides the most robust performance across different text types at test time. For instance, Tax+Com Mixture (30.4%) outperforms any single text type trained model (e.g., Tax tested with Tax at 26.6%). This strategy retains the generalization benefits of taxonomic names while offering flexibility.
- Performance Degradation with Mismatched Text Types: When a model is trained on a single text type (e.g., Common or Scientific) and tested with a different one, performance degrades substantially. For example, Sci trained model scores 22.3% when tested with Sci, but only 4.5% when tested with Tax.
- Data Scale Matters: Comparing "ToL-1M Mixture" (30.4%) with "iNat21-2.7M Mixture" (30.9%) and "ToL-10M Mixture" (38.0% - this value is for RARE SPECIES from Table 4, re-stated here as reference for consistency with table 5 format), it's evident that the full TREEOFLIFE-10M dataset significantly boosts performance, even when both iNat21 and ToL-1M use the mixed text type strategy. This confirms the importance of the added data diversity and scale from TREEOFLIFE-10M.

6.3.2. Is the CLIP Objective Necessary?

This ablation compares the CLIP objective against traditional cross-entropy based methods. Models (ViT-B/16) are trained on TREEOFLIFE-1M (10% subset) and evaluated in one-shot and five-shot settings (zero-shot is not possible for non-CLIP models).

The following are the results from Table 6 of the original paper:

Objective	Mean 1-Shot	Mean 5-shot
Cross-entropy	16.5	26.2
Hier. cross-entropy	19.3	30.5
CLIP	44.7	63.8

Key Finding: The CLIP objective (44.7% one-shot, 63.8% five-shot) massively outperforms both standard Cross-entropy (16.5% one-shot, 26.2% five-shot) and Hierarchical cross-entropy (19.3% one-shot, 30.5% five-shot).
Justification: This strong result provides empirical justification for repurposing the CLIP objective for BioCLIP. While Hierarchical cross-entropy does show a modest improvement over simple Cross-entropy by leveraging some hierarchical information, it cannot compete with the representation learning power of the CLIP contrastive objective in low-data regimes (few-shot). This confirms that the contrastive approach is far more effective for learning generalized, fine-grained representations suitable for the Tree of Life.

6.3.3. Can `BioCLIP` Classify More Than Species?

This section examines BioCLIP's ability to transfer to tasks beyond pure species classification, specifically plant diagnosis (classifying species and disease) using the PlantVillage and PlantDoc datasets.

Results (from Table 4):
- PlantVillage (Zero-shot): BioCLIP 24.4%, CLIP 5.4%, OpenCLIP 8.0%.
- PlantDoc (Zero-shot): BioCLIP 39.4%, CLIP 26.1%, OpenCLIP 25.8%.
- PlantVillage (One-shot): BioCLIP 58.8%, CLIP 49.7%, OpenCLIP 53.6%.
- PlantDoc (One-shot): BioCLIP 30.7%, CLIP 24.8%, OpenCLIP 26.8%.
- BioCLIP consistently outperforms baselines in both zero-shot and few-shot settings on these tasks.
Implication: This indicates that BioCLIP learns useful visual representations that are transferable to complex, multi-faceted biological questions like disease diagnosis, even though its primary training objective is a contrastive species classification. It demonstrates that the learned representations are rich enough to capture subtle visual cues for conditions and traits. The fact that BioCLIP's mean one-shot accuracy is 9.1% higher than its zero-shot accuracy implies that it learns useful visual features even from a single labeled example, which is beneficial given the high cost of biological data labeling.

6.3.4. Does `BioCLIP` Learn the Hierarchy?

This intrinsic evaluation aims to visualize BioCLIP's learned image representations and determine if they conform to the Tree of Life hierarchy.

Method:
- t-SNE [85] is used to reduce the dimensionality of image embeddings from iNat21's validation set (unseen during training) for visualization.
- Points in the t-SNE plot are colored by their taxonomic labels.
- t-SNE is run independently on subsets of examples belonging to specific taxonomic ranks to visualize clustering at different hierarchical levels.
Observations (Figure 3 - high-level description based on text):
- At higher taxonomic ranks (e.g., Kingdom, Phylum), both CLIP and BioCLIP show good separation between groups. However, BioCLIP's representations exhibit a more fine-grained and richer clustering structure.
- At lower taxonomic ranks (e.g., Class, Order, Family), BioCLIP produces evidently more separable features, with distinct clusters for different groups. In contrast, CLIP's features appear cluttered and lack clear, discernible hierarchical structure.
Conclusion: This visual evidence supports the hypothesis that BioCLIP has successfully learned a fine-grained hierarchical representation that aligns with the Tree of Life. This intrinsic property explains BioCLIP's superior generalization and fine-grained classification capabilities observed in the extrinsic evaluations.

6.4. Generalized Zero-Shot Learning (GZSL)

From Appendix H, BioCLIP was also evaluated in a challenging GZSL setting, where a model must classify images from unseen classes within a set of both seen and unseen labels.

Setup: 400 seen species from TREEOFLIFE-10M were combined with the 400 unseen RARE SPECIES labels, creating an 800-label classification task for the RARE SPECIES images.
Results: BioCLIP achieved 26.0% top-1 accuracy, outperforming CLIP (23.0%) and OpenCLIP (18.2%).
Implication: This further confirms BioCLIP's strong generalization abilities, even when faced with the added complexity of distinguishing unseen classes amidst a larger set of seen classes.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces TREEOFLIFE-10M, a groundbreaking large-scale and diverse dataset of biological images, and BioCLIP, a novel vision foundation model specifically designed for the Tree of Life. Through extensive and rigorous evaluation, BioCLIP demonstrates exceptional performance in fine-grained biological classification tasks, achieving substantial improvements (16% to 17% absolute accuracy) over existing baselines, including general-purpose CLIP models, in both zero-shot and few-shot settings. The core innovation lies in BioCLIP's unique training strategy, which repurposes CLIP's multimodal contrastive learning objective to explicitly leverage the hierarchical structure of biological taxonomy (by using "flattened" taxonomic names). This approach, coupled with the vast scale and diversity of TREEOFLIFE-10M, enables BioCLIP to learn more fine-grained and generalizable representations. Intrinsic evaluations confirm that BioCLIP's learned representations intrinsically conform to the Tree of Life hierarchy, providing the underlying mechanism for its strong generalization capabilities to unseen taxa and complex biological questions like disease diagnosis.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Classification-focused BioCLIP: While effective for classification, the current BioCLIP is fundamentally trained for this specific objective.
- Future Work: Scaling up the data even further, potentially incorporating over 100 million research-grade images from iNaturalist. Collecting richer textual descriptions of species' appearances to enable BioCLIP to extract fine-grained trait-level representations (e.g., specific morphological features, behaviors, or ecological roles), moving beyond just classification.
Taxonomic Labeling Issues: The authors discovered hemihomonyms were occasionally mislabeled at higher taxonomic levels, impacting a small percentage (0.1-0.2%) of their data.
- Future Work: Developing a more robust solution for taxonomic labeling that accounts for such misclassifications and incorporates ongoing re-naming efforts in biology (e.g., for bird species). They intend to release a patch addressing this.

7.3. Personal Insights & Critique

Innovation of Leveraging Explicit Hierarchy: The most inspiring aspect of this paper is its elegant solution to incorporating explicit biological hierarchy into a CLIP-like model. Repurposing the CLIP objective by "flattening" the taxonomy into a single string for contrastive learning is a clever and highly effective strategy. It bypasses the complexity of multi-task hierarchical losses while still embedding the crucial structural information, which is a major conceptual leap for domain-specific foundation models. This approach could potentially be adapted to other domains with rich, inherent hierarchical data structures (e.g., medical ontologies, material science classifications).
The Power of Data Curation: The paper powerfully demonstrates that for domain-specific foundation models, the quality, diversity, and structured richness of the training data are as, if not more, important than sheer scale. TREEOFLIFE-10M is not just large; its careful aggregation, canonicalization of taxonomic labels, and inclusion of diverse image sources (in-situ, lab, citizen science) are critical to BioCLIP's success. This highlights the immense value of interdisciplinary collaboration between ML researchers and domain experts (biologists in this case) for effective data engineering.
Impact on Biological Research and Conservation: BioCLIP has immediate and significant potential. Its zero-shot and few-shot capabilities are game-changers for identifying rare or newly discovered species, monitoring biodiversity with limited labeled data, and democratizing AI tools for biologists who may lack extensive ML expertise. The ability to generalize to unseen taxa is particularly valuable for conservation efforts, where data on endangered or newly threatened species is often scarce.
Potential Issues and Areas for Improvement:
- Static Taxonomy: The current approach relies on a relatively static taxonomic hierarchy. Biology, however, is dynamic; classifications are constantly revised as new genetic and morphological data emerge. While the authors mention addressing re-naming, a more adaptive system that can gracefully incorporate taxonomic updates (e.g., using knowledge graph embeddings or dynamic graph neural networks) could make BioCLIP even more robust long-term.
- Computational Cost: Training BioCLIP on 10.4M images with a large ViT and Transformer requires substantial computational resources (8 NVIDIA A100-80GB GPUs). This raises questions about accessibility for smaller research groups or NGOs in the biological domain. Future work could explore more efficient training strategies or smaller model architectures that retain performance.
- "Black Box" Concerns: Like many deep learning models, BioCLIP is a "black box." While t-SNE visualizations provide insights into its learned hierarchy, further research into explainable AI (XAI) for BioCLIP could reveal why it makes certain fine-grained distinctions, helping biologists trust and utilize the model more effectively (e.g., by identifying crucial visual features for a species).
- Data Bias: Despite efforts to diversify, any dataset, especially one aggregated from various sources, can inherit biases (e.g., geographic bias in citizen science data, overrepresentation of charismatic megafauna). While not explicitly discussed, mitigating these biases in future dataset iterations is important for equitable representation across the Tree of Life.
- Ethical Considerations (as acknowledged by authors): The authors address potential ethical concerns, particularly regarding misuse for endangered species (e.g., poaching). Their mitigation strategy (no specific geographic info, no conservation status in training) is a good start, but ongoing vigilance and research into responsible AI in conservation are crucial. The concern about over-reliance on model predictions is valid; BioCLIP should be seen as an assistant, not a replacement, for expert biological judgment.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Hyperparameter	Value
Architecture	ViT-B/16
Max learning rate	1 × 10^-4
Warm-up steps	1,000
Weight Decay	0.2
Input Res.	224 × 224

Dataset	Text Type	Batch Size	Epoch
TreeOfLife-10M	Mixture	32K	100
iNat21 Only	Mixture	16K	65
TreeOfLife-1M	Tax+Com
	Common	16K	86
	Scientific		87
	Taxonomy		87
	Sci+Com		87

BioCLIP: A Vision Foundation Model for the Tree of Life

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~31 min read · 47,275 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. TREEOFLIFE-10M: The Large-Scale, Diverse ML-Ready Biology Image Dataset

4.2.2. BioCLIP: A Vision Foundation Model for the Tree of Life

4.2.3. Training Details

4.2.4. Hierarchical Multitask Objective (for comparison)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.2.1. Zero-Shot Classification

6.2.2. One-Shot Classification

6.2.3. Five-Shot Classification

6.3. Ablation Studies / Parameter Analysis

6.3.1. How Do Text Types Affect Generalization?

6.3.2. Is the CLIP Objective Necessary?

6.3.3. Can BioCLIP Classify More Than Species?

6.3.4. Does BioCLIP Learn the Hierarchy?

6.4. Generalized Zero-Shot Learning (GZSL)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.1. `TREEOFLIFE-10M`: The Large-Scale, Diverse ML-Ready Biology Image Dataset

4.2.2. `BioCLIP`: A Vision Foundation Model for the Tree of Life

6.3.3. Can `BioCLIP` Classify More Than Species?

6.3.4. Does `BioCLIP` Learn the Hierarchy?