BioCLIP: A Vision Foundation Model for the Tree of Life
TL;DR Summary
The paper introduces BioCLIP, a vision foundation model for the tree of life, leveraging the largest and most diverse biology image dataset, TreeOfLife-10M. BioCLIP significantly outperforms existing models in fine-grained biology classification, showcasing its strengths in diver
Abstract
Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks and find that BioCLIP consistently and substantially outperforms existing baselines (by 16% to 17% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. https://imageomics.github.io/bioclip has models, data and code.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "BioCLIP: A Vision Foundation Model for the Tree of Life". This title clearly indicates the paper's central topic: the development of a vision foundation model specifically tailored for biological organisms, aiming to cover the vast diversity encapsulated by the "tree of life".
1.2. Authors
The authors of the paper are:
-
Samuel Stevens
-
Jiaman Wu
-
Matthew J Thompson
-
Elizabeth G Campolongo
-
Chan Hee Song
-
David Edward Carlyn
-
Li Dong
-
Wasila M Dahdul
-
Charles Stewart
-
Tanya Berger-Wolf
-
Wei-Lun Chao
-
Yu Su
Their affiliations are primarily with The Ohio State University, with contributions also from Microsoft Research, University of California, Irvine, and Rensselaer Polytechnic Institute. The presence of researchers from multiple institutions, including a major tech company, suggests a collaborative effort bridging academic research and potentially industry-level resources in machine learning and biology.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server for scientific papers. While arXiv itself is not a peer-reviewed journal or conference, it is a highly influential platform for rapid dissemination of research in fields like computer science, physics, mathematics, and biology. Papers often appear on arXiv before formal peer review and publication in a conference or journal. The reputation and influence of arXiv are significant in allowing researchers to share their work promptly and gather feedback from the broader scientific community.
1.4. Publication Year
The paper was published on 2023-11-30.
1.5. Abstract
The abstract introduces the increasing abundance of natural world images and the explosion of computational methods, especially computer vision, to extract biological information. It highlights a critical gap: most existing methods are bespoke, task-specific, and lack adaptability. The paper addresses this need by presenting TREEOFLIFE-10M, the largest and most diverse machine learning (ML)-ready dataset of biology images. Using this dataset, they develop BioCLIP, a foundation model for the tree of life. BioCLIP leverages the unique properties of biology, such as the variety of organisms (plants, animals, fungi) and rich structured biological knowledge. The model is rigorously benchmarked on diverse fine-grained biology classification tasks, where it consistently and substantially outperforms existing baselines by 16% to 17% absolute. Intrinsic evaluation further shows that BioCLIP learns a hierarchical representation conforming to the tree of life, explaining its strong generalizability.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2311.18803. The PDF link is https://arxiv.org/pdf/2311.18803v3.pdf. This paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the lack of a general-purpose, adaptable vision model for organismal biology. Images of the natural world, from various sources like drones, camera traps, and citizen science platforms, are becoming increasingly abundant. This has led to a proliferation of computational tools, especially computer vision methods, to extract biologically relevant information for science and conservation.
However, the current landscape of computer vision applications in biology is fragmented. Most existing approaches are "bespoke," meaning they are custom-designed for a specific task, a particular set of taxa (e.g., a specific species or genus), and a limited dataset. This makes them difficult to adapt, extend, or generalize to new biological questions, contexts, or datasets, significantly limiting their broader utility for biological research and conservation efforts. Biologists often face a laborious process of manually labeling data and training models for each new task, requiring substantial machine learning expertise.
This problem is important because biological research spans a vast "tree of life," encompassing millions of diverse species with complex relationships. An adaptable vision model could dramatically accelerate scientific discovery, aid in biodiversity monitoring, and enhance conservation strategies by lowering the barrier for biologists to apply AI. The specific challenges or gaps in prior research include:
-
Generalization: Existing models struggle to generalize across the entire tree of life or to taxa not present in their training data. It's infeasible to collect training data for millions of known taxa.
-
Fine-grained Representation: Biology often requires distinguishing between visually similar organisms (e.g., closely related species, mimics). General-domain vision models, even those trained on massive datasets, often lack the necessary fine-grained representations for such distinctions.
-
Low-data Regime: Data collection and labeling in biology are expensive. Models need to perform strongly in zero-shot or few-shot learning scenarios where data for new or rare species is scarce.
-
Suitable Datasets and Pre-training Strategies: Existing biological image datasets often lack the scale, diversity, or fine-grained taxonomic labels required for training foundation models. Furthermore, mainstream pre-training algorithms typically don't leverage the rich hierarchical structure of biological taxonomy.
The paper's entry point and innovative idea is to develop a vision foundation model that overcomes these challenges by:
-
Curating a large-scale, diverse, and ML-ready dataset:
TREEOFLIFE-10M, which explicitly incorporates taxonomic labels and hierarchies. -
Developing a novel pre-training strategy: Adapting
CLIP's multimodal contrastive learning objective to explicitly leverage the hierarchical "tree of life" taxonomy, thereby enabling better generalization and fine-grained representation learning.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field:
-
TREEOFLIFE-10MDataset: They curate and releaseTREEOFLIFE-10M, which is presented as the largest and most diverse ML-ready dataset of biology images to date. It contains over 10 million images covering 454 thousand taxa, significantly expanding upon previous datasets likeiNat21in terms of scale and diversity by integrating data fromiNat21,BIOSCAN-1M, and newly curated images from the Encyclopedia of Life (EOL). Crucially, every image is associated with its full taxonomic hierarchy. -
BioCLIPFoundation Model: They developBioCLIP, a vision foundation model specifically designed for the tree of life.BioCLIPis built upon theCLIParchitecture but leverages the unique properties of biology data, particularly the abundance and variety of images of plants, animals, and fungi, and the availability of rich structured biological knowledge (taxonomy). -
Novel Pre-training Strategy: Instead of standard supervised classification (which treats labels as isolated symbols),
BioCLIPemploys a novel strategy. It combinesCLIP-style multimodal contrastive learning with biological taxonomy by "flattening" the taxonomic hierarchy (from Kingdom to species) into a single string called a "taxonomic name." This allows the model to learn to match images with their corresponding taxonomic names, thereby embedding the hierarchical relationships. They also propose and demonstrate the effectiveness of amixed text type training strategyto enhance flexibility at inference time. -
Comprehensive Benchmarking: They rigorously benchmark
BioCLIPon 10 diverse fine-grained biology classification tasks, including a newly curatedRARE SPECIESdataset designed to test generalization to unseen taxa. -
Superior Performance and Generalizability:
BioCLIPconsistently and substantially outperforms existing baselines, includingCLIPandOpenCLIP, by an average absolute improvement of 17% in zero-shot settings and 16% in few-shot settings. -
Intrinsic Evaluation Revealing Hierarchical Representation: Intrinsic evaluation (e.g., t-SNE visualization) reveals that
BioCLIPlearns a more fine-grained, hierarchical representation of images that conforms to the tree of life, explaining its strong generalizability to novel taxa and fine-grained distinctions.These findings solve the problems of limited generalization, insufficient fine-grained representation, and poor performance in low-data regimes for biological image analysis. By providing both a large-scale dataset and a specialized foundation model,
BioCLIPsignificantly lowers the barrier for applying AI to biology.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the paper "BioCLIP: A Vision Foundation Model for the Tree of Life," a reader should be familiar with several key concepts from machine learning, especially computer vision and natural language processing, as well as basic biological classification principles.
-
Foundation Models:
- Conceptual Definition: A
foundation modelis a large machine learning model trained on a vast amount of broad data (unlabeled and/or labeled) at scale. Once pre-trained, these models can be adapted (fine-tuned) to a wide range of downstream tasks with little or no additional training data (e.g., zero-shot or few-shot learning). They typically exhibit emergent properties and strong generalization capabilities due to their scale and diverse training data. Examples includeGPT-3for language andCLIPfor vision-language tasks. - Importance: They enable rapid development of AI applications by reducing the need for task-specific model development and large labeled datasets for every new problem.
- Conceptual Definition: A
-
CLIP (Contrastive Language-Image Pre-training):
- Conceptual Definition:
CLIPis a multimodal foundation model developed by OpenAI that learns visual concepts from natural language supervision. It is trained on a massive dataset of image-text pairs (e.g., images found online with their captions).CLIPconsists of two separate encoders: avision encoder(e.g., aVision Transformer) and atext encoder(e.g., aTransformer-based language model). - How it Works (Contrastive Learning Objective): During training,
CLIPlearns to associate images with their corresponding text descriptions (positive pairs) and distinguish them from non-corresponding image-text pairs (negative pairs). The objective is to maximize the cosine similarity (a measure of similarity between two non-zero vectors) between the embeddings of positive image-text pairs and minimize the similarity between negative pairs. After training, both encoders map their respective inputs into a shared, high-dimensionalembedding spacewhere similar image and text concepts are located close to each other. - Zero-shot Capability: This shared embedding space allows
CLIPto performzero-shot classification. For example, to classify an image into one of several categories,CLIPcomputes the similarity between the image's embedding and the embeddings of text descriptions of each category (e.g., "a photo of a cat," "a photo of a dog"). The category with the highest similarity is chosen as the prediction, without ever having seen labeled examples of those categories during training.
- Conceptual Definition:
-
Vision Transformer (ViT):
- Conceptual Definition:
ViTis an image encoder architecture that applies theTransformerarchitecture (originally developed for natural language processing) directly to sequences of image patches. Instead of using convolutional layers,ViTdivides an image into fixed-size patches, linearly embeds each patch, adds positional embeddings, and then feeds the resulting sequence into a standardTransformer encoder. - Role in CLIP:
ViTis commonly used as thevision encoderinCLIPmodels, allowing them to process visual information effectively and leverage the self-attention mechanism for global contextual understanding within images.
- Conceptual Definition:
-
Taxonomy / Tree of Life:
- Conceptual Definition: In biology,
taxonomyis the scientific classification of organisms. TheTree of Liferepresents the evolutionary relationships among all living organisms, organized hierarchically. Organisms are grouped into a nested hierarchy of ranks, typically including:Kingdom,Phylum,Class,Order,Family,Genus, andSpecies. For example, a human would be: Animalia (Kingdom), Chordata (Phylum), Mammalia (Class), Primates (Order), Hominidae (Family), Homo (Genus), sapiens (Species). - Importance: This hierarchy provides a structured way to understand biodiversity and evolutionary relationships, which is a key signal
BioCLIPaims to leverage.
- Conceptual Definition: In biology,
-
Zero-shot Learning:
- Conceptual Definition:
Zero-shot learning(ZSL) refers to the ability of a model to recognize or classify instances of classes that were not seen during training. This is typically achieved by leveraging auxiliary information (like text descriptions or attributes) that links seen and unseen classes. InCLIP, the text encoder provides this auxiliary information.
- Conceptual Definition:
-
Few-shot Learning:
- Conceptual Definition:
Few-shot learning(FSL) is a machine learning paradigm where a model can learn to classify new classes given only a very small number of labeled examples (e.g., 1, 5, or 10 examples per class). This is crucial in domains like biology where extensive labeled data is often unavailable.
- Conceptual Definition:
-
Fine-grained Classification:
- Conceptual Definition:
Fine-grained classificationinvolves distinguishing between subordinate categories that are visually very similar (e.g., different species of birds, different car models). This contrasts with coarse-grained classification (e.g., distinguishing between a cat and a dog). It requires models to learn subtle visual cues.
- Conceptual Definition:
-
ML-ready Dataset:
- Conceptual Definition: An
ML-ready datasetis a dataset that has been processed, cleaned, and formatted in a way that makes it immediately usable for machine learning model training and evaluation. This includes tasks like data collection, annotation, normalization, handling missing values, and structuring labels consistently.
- Conceptual Definition: An
3.2. Previous Works
The paper contextualizes BioCLIP by referencing several prior works across foundation models, domain-specific AI, and hierarchical computer vision.
-
General Foundation Models:
CLIP[69]: OpenAI'sContrastive Language-Image Pre-trainingmodel is a direct precursor and inspiration.BioCLIPbuilds uponCLIP's architecture and contrastive learning objective.CLIPwas trained on noisy, web-scale image-text datasets (100M+ pairs) and showed state-of-the-art zero-shot capabilities.GPT-3[14]: OpenAI's large language model, a prominent example of afoundation modeldemonstrating strong zero-shot and few-shot learning for language tasks.ALIGN[45] andBASIC[65]: Further scaled multimodal pre-training to billions of image-text pairs, improving vision representation quality.ResNet[33] andSwin Transformer[48]: General domain vision models that typically use a supervised classification objective on class indices. The paper contrastsBioCLIP's approach with these, highlighting howBioCLIPleverages the rich label structure rather than treating labels as discrete symbols.DINO[15]: A self-supervised vision transformer model, used as a baseline for few-shot classification.
-
Biology-specific Datasets and AI:
iNat21[86]: The previous largest ML-ready biology image dataset, with 2.7M images covering 10K species.TREEOFLIFE-10MintegratesiNat21and significantly expands on its diversity.BIOSCAN-1M[28]: A recent dataset of 1M lab images of insects.TREEOFLIFE-10Malso incorporates this, adding diverse image distributions (lab vs. in situ).- Computer vision applications in biology: The paper cites numerous works on using digital images and computer vision for evolutionary biology [13, 51], ecology and biodiversity [5, 77, 83], species classification [32], individual identification, trait detection [23, 39], abundance estimation [3, 40, 58, 82], and biodiversity monitoring [83]. These works highlight the existing demand for robust biological vision models.
-
Domain-specific CLIPs:
- The paper notes a trend of
domain-specific CLIPsoutperforming general models [18, 30]. Examples includeIkezogwo et al.[41] andLu et al.[50] who gathered 1M+ image-text pairs for computational pathology.BioCLIPfollows this trend but at a larger scale (10M+ images) and with a strong emphasis on taxonomic diversity in biology.
- The paper notes a trend of
-
Hierarchy in Computer Vision:
ImageNet[70]: Its classes are organized according to the hierarchicalWordNet[55] structure, making hierarchy a long-standing topic in CV.Bilal et al.[10]: StudiedImageNetmodel predictions and found that confusion patterns follow hierarchical class structures, and incorporating hierarchy into AlexNet improved performance.Bertinetto et al.[9]: Measured mistake severity in image classifiers and proposed alternative objectives incorporating hierarchy to reduce severe errors.Zhang et al.[96]: Proposed a contrastive objective where hierarchical distance between labels corresponded to desired embedding distance, outperforming cross-entropy onImageNetandiNat17.BioCLIP's novel contribution in this area is applying a repurposedCLIPobjective to leverage a comprehensive biological taxonomy with 454K unique classes, a much larger scale than prior work.
3.3. Technological Evolution
The technological evolution leading to BioCLIP can be summarized as a progression from:
-
Task-specific, bespoke computer vision models for biology: Early applications required significant manual effort for data labeling and model training for each narrow biological problem. These models were often limited in scope and generalizability.
-
General-domain vision models (e.g.,
ResNet,Swin Transformer): These models achieved impressive performance on broad computer vision tasks, but still struggled with the fine-grained distinctions and vast diversity inherent in biology, especially when trained with standard classification objectives that ignore hierarchical relationships. -
Multimodal foundation models (e.g.,
CLIP): The advent ofCLIPdemonstrated the power of learning visual representations from natural language supervision, enabling strong zero-shot and few-shot capabilities. However, generalCLIPmodels were not optimized for the specific challenges of biological taxonomy (e.g., Latin names, complex hierarchies, extreme fine-grainedness). -
Domain-specific adaptation of foundation models: Researchers began adapting
CLIPand similar architectures to specific domains (e.g., pathology, fashion) to achieve better performance by leveraging domain-specific data and knowledge.BioCLIPfits into this timeline by being a cutting-edge example of the fourth stage. It is a domain-specific adaptation of a foundation model (CLIP) that explicitly addresses the unique challenges and leverages the unique properties of the biological domain. It moves beyond general-purpose models by incorporating the vast, hierarchicalTree of Lifetaxonomy into its pre-training strategy and building the largest, most diverseML-readydataset for this domain.
3.4. Differentiation Analysis
Compared to the main methods in related work, BioCLIP introduces several core differences and innovations:
-
Leveraging Biological Hierarchy with Contrastive Learning:
- Difference from Standard Supervised Models (
ResNet,Swin Transformer): Traditional supervised models treat each class label as a discrete, unrelated entity.BioCLIP, in contrast, explicitly leverages the rich hierarchical structure of biological taxonomy (Kingdom, Phylum, Class, etc.). - Innovation: Instead of using hierarchical classification (which sums losses across levels, as explored in their ablations),
BioCLIPcreatively repurposesCLIP's multimodal contrastive learning. It "flattens" the entire taxonomic path into a singletaxonomic namestring and usesCLIPto learn to match images with these hierarchical text representations. This is a novel way to embed hierarchical relationships into theCLIPembedding space, which they show significantly outperforms hierarchical classification with cross-entropy.
- Difference from Standard Supervised Models (
-
Scale and Diversity of Biological Dataset:
- Difference from
iNat21,BIOSCAN-1M: While these are valuable datasets, they are significantly smaller and less diverse in terms of the number of unique taxa covered. - Innovation:
TREEOFLIFE-10Mis a massive leap in scale (10M+ images) and taxonomic diversity (454K+ taxa), meticulously aggregated from multiple sources includingiNat21,BIOSCAN-1M, and especially the Encyclopedia of Life (EOL). This unparalleled scale and diversity are crucial for training a truly generalized foundation model for biology.
- Difference from
-
Enhanced Generalization to Unseen Taxa and Fine-grained Distinctions:
- Difference from General
CLIP/OpenCLIP: While generalCLIPmodels offer good zero-shot capabilities, they are not optimized for the specific, often Latin-based, and highly fine-grained labels of biological organisms. They perform better with common names, which can be ambiguous. - Innovation: By training with
taxonomic namesand amixed text type strategy,BioCLIPlearns representations that are inherently more fine-grained and generalize better to unseen species (as demonstrated on theRARE SPECIESdataset) and various label formats. Its intrinsic evaluation reveals a clear hierarchical clustering in its embedding space, a feature lacking in generalCLIPmodels when applied to biological data.
- Difference from General
-
Flexibility in Text Types at Inference:
-
Innovation: The
mixed text type training strategyallowsBioCLIPto retain the generalization benefits of taxonomic names while being robust and flexible when only common names or scientific names are available at inference time. This addresses a practical need for biologists who might use different naming conventions.In essence,
BioCLIPdifferentiates itself by specifically engineering theCLIPframework and its training data to address the unique structural (hierarchy), scale (diversity of taxa), and practical (fine-grained, low-data, varied naming conventions) challenges of organismal biology.
-
4. Methodology
4.1. Principles
The core idea behind BioCLIP is to develop a general-purpose vision foundation model that can effectively understand and classify images across the entire Tree of Life. This is achieved by leveraging two key principles:
-
Multimodal Contrastive Learning: Building upon the success of
CLIP,BioCLIPuses a contrastive learning objective to learn joint representations of images and text. This allows the model to learn semantic correspondences between visual features of organisms and their textual descriptions, enabling powerful zero-shot and few-shot classification capabilities. -
Explicitly Leveraging Biological Taxonomy: Unlike general-domain
CLIPmodels or standard supervised classifiers that treat labels as independent symbols,BioCLIPrecognizes and explicitly integrates the rich hierarchical structure of biological taxonomy. The intuition is that if the model learns the relationships within theTree of Life(e.g., that different species belong to the same genus, which belongs to the same family, etc.), it will generalize better to unseen taxa. For example, even if it hasn't seen images of a particular species, it might have learned robust representations for its genus or family, which can then be used to infer information about the new species.These principles combine to allow
BioCLIPto learn fine-grained, hierarchically structured visual representations that are broadly applicable across diverse biological tasks and can generalize to novel species.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. TREEOFLIFE-10M: The Large-Scale, Diverse ML-Ready Biology Image Dataset
BioCLIP's foundation begins with TREEOFLIFE-10M, a meticulously curated dataset designed to overcome the limitations of previous biological image datasets in terms of scale, diversity, and label richness.
- Problem with Existing Datasets:
iNat21[86]: While large for biology (2.7M images, 10K species), its species diversity is limited compared to the millions of known species. For instance, there are over 2M described species, with 10K+ bird species and 10K+ reptile species alone. This limits its potential for afoundation modelfor the entireTree of Life.
- Data Sources and Integration:
- Encyclopedia of Life (EOL): The authors recognized
EOL(eol.org) as a crucial source for high-quality, diverse biology images. They downloaded 6.6 million images fromEOL, which significantly expanded the dataset to cover an additional 440K taxa. This addresses the diversity gap. BIOSCAN-1M[28]: This dataset (1.1M images) primarily contains lab images of insects, covering 494 different families. Its inclusion is important for two reasons:- It provides extremely fine-grained visual representations for insects, a highly diverse subtree of the
Tree of Life. - It diversifies the image distribution by including lab images, which differ significantly from the in-situ (natural environment) images typical of
iNat21.
- It provides extremely fine-grained visual representations for insects, a highly diverse subtree of the
iNat21(training split): The training split ofiNat21was integrated to further bolster the dataset.
- Encyclopedia of Life (EOL): The authors recognized
- Metadata & Aggregation Challenges:
-
Integrating these diverse sources is a non-trivial task due to the inherent noise and inconsistency in taxonomic hierarchies across different biological databases [4, 31, 36, 52, 63].
-
The authors
canonicalized(standardized) the labels and unifiedtaxonomic hierarchiesfromEOL, theIntegrated Taxonomic Information System (ITIS)[43], andiNaturalist. -
Special consideration was given to
homonyms(genus-species labels shared among higher-order taxa) to ensure correct linkage. -
For any taxa that couldn't be resolved by these primary sources, the
Global Names Resolver (GNR) APIwas used. -
Result: This rigorous process allowed for 84% full taxa labeling for images in
TREEOFLIFE-10M, with about 10% labeled down to the family rank (due toBIOSCAN-1Mnot always having genus-species info). -
Dataset Statistics (Table 1): The following are the results from Table 1 of the original paper:
Dataset Description Images Unique Classes iNat21 Citizen scientist labeled image dataset from iNaturalist for fine-grained classification. 2.7M 10,000 BIOSCAN-1M Expert labeled image dataset of insects for classification. 1.1M 7,831 EOL A new dataset with citizen scientist images sourced from Encyclopedia of Life and taxonomic labels standardized by us. 6.6M 448,910 TREEOFLIFE-10M Largest-to-date ML-ready dataset of biology images with taxonomic labels. 10.4M 454,103 TREEOFLIFE-10Mtotals over 10.4 million images across more than 454,103 unique taxonomic names, representing a significant increase in scale and diversity over previous datasets.
-
- Release: The dataset (and a smaller
RARE SPECIEStest set) are released on Hugging Face with DOIs, including CSVs with metadata and links to primary sources, along with GitHub scripts for generation.
4.2.2. BioCLIP: A Vision Foundation Model for the Tree of Life
BioCLIP is built on a CLIP architecture, but its training strategy is specifically designed to leverage the taxonomic hierarchy present in TREEOFLIFE-10M.
-
Initialization:
BioCLIPis initialized from OpenAI's publicCLIPweights, specifically using aViT-B/16(Vision Transformer Base, patch size 16x16) as thevision encoderand a 77-token causalautoregressive transformeras thetext encoder. This warm start leverages pre-learned general visual and language representations. -
Why
CLIP's Objective is Crucial for Taxonomy:- Challenge with Standard Supervised Classification: A common strategy for labeled datasets like
TREEOFLIFE-10Mis to use a supervised classification objective, where the model learns to map an image to a taxonomic index. However, this treats each taxon (e.g., a species) as an independent symbol. It fails to account for the rich, interconnected hierarchical structure of theTree of Life. Consequently, such a model would struggle to generalize tounseen taxaor supportzero-shot classificationof new species. - Repurposing
CLIP's Multimodal Contrastive Learning: The authors propose thatCLIP's contrastive learning objective can be repurposed to explicitly leverage this hierarchical structure.- "Flattening" the Taxonomy: For each species, the complete taxonomic hierarchy (from
Kingdomdown to the most specifictaxon rankavailable, e.g.,species) is concatenated into a single string. This string is called thetaxonomic name. For example, for a black-billed magpie, thetaxonomic namemight be "Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia." - Contrastive Learning with Taxonomic Names: The
CLIPcontrastive learning objective is then used to train the model to match images with their correspondingtaxonomic names. This means thevision encoderlearns to produce embeddings for images, and thetext encoderlearns to produce embeddings for thesetaxonomic namestrings. The training optimizes for high similarity between correct image-text pairs and low similarity for incorrect pairs. - Generalization Mechanism: This approach intuitively aids generalization. If the model hasn't seen a particular species, it has likely encountered its higher taxonomic ranks (e.g.,
genus,family) in othertaxonomic names. Theautoregressive text encodernaturally embeds these hierarchical relationships by conditioning later taxonomic rank representations on earlier (higher) ranks. This provides a strong prior forfew-shotorzero-shot learningof new taxa. - Technical Contribution: The authors emphasize that repurposing
CLIP's objective for learning hierarchical representations conforming to a taxonomy is a novel and non-trivial technical contribution.
- "Flattening" the Taxonomy: For each species, the complete taxonomic hierarchy (from
- Challenge with Standard Supervised Classification: A common strategy for labeled datasets like
-
Text Types for Flexibility:
-
CLIP'stext encoderaccepts free-form text, which is a powerful feature for biological labels, as they come in various formats. The paper considers severaltext types(Table 3) for training and inference: The following are the results from Table 3 of the original paper:Text Type Example Common black-billed magpie Scientific Pica hudsonia Taxonomic Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia Scientific + Common Pica hudsonia with common name black-billed magpie Taxonomic + Common Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia with common name black-billed magpie Common name: (e.g., "black-billed magpie") More widespread, but can be ambiguous (one species, multiple common names; one common name, multiple species).Scientific name: (e.g., "Pica hudsonia") Standardized binomial nomenclature (genus + species).Taxonomic name: (e.g., "Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia") The "flattened" full hierarchical path.Scientific + Common name&Taxonomic + Common name: Combinations to provide more context.
-
Mixed Text Type Training Strategy: To improve flexibility at inference time (as users might only have one type of label), the authors propose a
mixed text type training strategy. At each training step, for a given image, a text label is randomly sampled from all its available text types (e.g.,Common,Scientific,Taxonomic). This strategy helpsBioCLIPretain the generalization benefits of taxonomic names while being adaptable to different naming conventions during inference. The final text input toCLIPis formatted using a standard template, e.g., "a photo of Pica hudsonia."
-
4.2.3. Training Details
-
Initialization:
BioCLIPbegins with weights from OpenAI'sCLIPmodel (specifically theViT-B/16architecture). This is a form ofcontinual pre-training. -
Dataset:
TREEOFLIFE-10M. -
Epochs: 100 epochs.
-
Learning Rate Schedule:
Cosine learning rate schedule[49]. -
Hardware: 8 NVIDIA A100-80GB GPUs across 2 nodes.
-
Global Batch Size: 32,768.
-
Baseline Training: A baseline model trained only on
iNat21(same architecture) and ablation models on 1M examples fromTREEOFLIFE-10Mused a smaller global batch size of 16,384 on 4 NVIDIA A100 GPUs on 1 node. -
Hyperparameters (Table D1 & D2): The following are the results from Table D1 of the original paper:
Hyperparameter Value Architecture ViT-B/16 Max learning rate 1 × 10-4 Warm-up steps 1,000 Weight Decay 0.2 Input Res. 224 × 224 The following are the results from Table D2 of the original paper:
Dataset Text Type Batch Size Epoch TreeOfLife-10M Mixture 32K 100 iNat21 Only Mixture 16K 65 TreeOfLife-1M Tax+Com Common 16K 86 Scientific 87 Taxonomy 87 Sci+Com 87 Note: Table D2 has some missing values for
TreeOfLife-1Mbatch sizes and epochs for Scientific/Taxonomy/Sci+Com, implying they share the same16Kbatch size and likely similar epoch counts toCommon(86/87).
4.2.4. Hierarchical Multitask Objective (for comparison)
To justify the choice of the CLIP objective, the authors compare it against a hierarchical classification approach.
- Objective: To predict labels for each taxonomic rank (Kingdom, Phylum, Class, etc.) down to species.
- Loss Function:
Cross-entropyfor each level of the taxonomy, with these losses summed up. - Pseudocode (Listing 1):
import torch.nn.functional as F def forward(vit, heads, images, h_labels): # vit: vision transformer. # heads: linear layers, one for each taxonomic rank. # images: batch of input images # h_labels: hierarchical labels; each image has 7 labels # # img_feats = vit(images) # h_logits = [head(img_feats) for head in heads] # losses = [F.cross_entropy(logits, label) for logits, labels in zip(h_logits, h_labels)] # return sum(losses)- Explanation:
- The function
forwardtakes thevision transformer(vit), a list ofheads(linear layers, one for each taxonomic rank), a batch ofimages, andh_labels(hierarchical labels for each image, e.g., 7 labels for Kingdom to Species). img_feats = vit(images): Thevision transformer(vit) processes the inputimagesto extract dense visual features (img_feats).h_logits = [head(img_feats) for head in heads]: For eachtaxonomic rank(e.g., Kingdom, Phylum), there is a dedicatedlinear layer(head). Eachheadtakes theimg_featsand produceslogits(h_logits), which are raw prediction scores for the classes at that specific taxonomic rank.losses = [F.cross_entropy(logits, label) for logits, labels in zip(h_logits, h_labels)]: For eachtaxonomic rank, across-entropy lossis calculated between the predictedlogitsfor that rank and the truelabels(h_labels) for that rank.return sum(losses): The final loss for the entire model is the sum of thecross-entropy lossesfrom all taxonomic ranks. This encourages theViTto learn image features that are useful for classifying images at multiple hierarchical levels simultaneously.
- The function
- Explanation:
5. Experimental Setup
5.1. Datasets
The experiments utilize a large-scale training dataset (TREEOFLIFE-10M) and a suite of 10 diverse evaluation datasets.
-
Training Dataset:
TREEOFLIFE-10M: This is the primary training dataset developed in the paper.- Source: Aggregation of
iNat21(training split),BIOSCAN-1M, and newly curated images from the Encyclopedia of Life (EOL). - Scale: Over 10 million images.
- Characteristics: Covers 454,103 unique taxa across plants, animals, and fungi. Includes diverse image distributions (in-situ, lab images, citizen science photos). Each image is labeled with its full taxonomic hierarchy.
- Purpose: Designed to train a
foundation modelwith broad coverage and fine-grained understanding of biological diversity.
- Source: Aggregation of
-
Evaluation Datasets: The paper evaluates on 10 tasks, covering various kingdoms and image types. The following are the results from Table 2 of the original paper:
Name Description Examples Classes Labels Birds 525 Scraped dataset of bird images from web search. [68] 89,885 525 Taxonomic Plankton Expert-labeled in situ images of plankton [35]. 4,080 102 Mixed Insects Expert and volunteer-labeled in-the-wild citizen science images of insects [74]. 4,680 117 Scientific Insects 2 Mixed common and scientific name classification for insect pests [91]. 4,080 102 Mixed PlantNet Citizen science species-labeled plant images, some drawings [27]. 1,000 25 Scientific Fungi Expert-labeled images of Danish fungi [66]. 1,000 25 Scientific PlantVillage Museum-style leaf specimens labeled with common names [25]. 1,520 38 Common Medicinal Leaf Species classification of leaves from mature, healthy medicinal plants [71]. 1,040 26 Scientific PlantDoc 17 diseases for 13 plant species [76]. 1,080 27 Common RaRe SPecies Subset of species in the IUCN Red List categories: Near Threatened through Extinct in the Wild (iucnredlist.org). 12,000 400 Taxonomic - Rationale for Dataset Choice: These datasets were chosen because they:
- Cover the
Tree of Life: Include organisms from animals, plants, fungi, and protists (fromPlankton). - Diverse Image Distributions: Feature photographs, microscope images, drawings, and museum specimens, ensuring the model's robustness across different visual contexts.
- Fine-grained Nature: All are fine-grained classification tasks, directly testing
BioCLIP's core capability. - Variety of Label Types: Labels range from full
taxonomic namestoscientificorcommon names, testing the flexibility ofBioCLIP'smixed text type training. RARE SPECIES: This newly curated dataset is particularly crucial. It contains species from theIUCN Red List(e.g., Near Threatened to Extinct in the Wild). Critically, these species were removed fromTREEOFLIFE-10Mtraining data. This makesRARE SPECIESan ideal benchmark for evaluatingBioCLIP'sout-of-distribution generalizationtounseen taxaand its potential for conservation applications. It consists of 400 species, each with at least 30 images.
- Cover the
- Rationale for Dataset Choice: These datasets were chosen because they:
5.2. Evaluation Metrics
The primary evaluation metric used in the paper is Accuracy, specifically Top-1 Accuracy.
-
Top-1 Accuracy:
- Conceptual Definition:
Top-1 accuracymeasures the percentage of predictions where the model's highest-confidence prediction (the class with the highest probability or similarity score) matches the true label of the input. It's a straightforward and widely used metric in classification tasks, indicating how often the model gets the classification perfectly right. - Mathematical Formula:
Let be the total number of test samples.
Let be the true label for sample .
Let be the predicted label with the highest confidence for sample .
The
Top-1 Accuracyis calculated as: $ \text{Top-1 Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\hat{y}_i = y_i) $ - Symbol Explanation:
- : Total number of samples in the test set.
- : The ground truth (true) label for the -th sample.
- : The label predicted by the model with the highest confidence for the -th sample.
- : The indicator function, which equals 1 if the condition inside the parentheses is true, and 0 otherwise.
- : Summation over all samples.
- Conceptual Definition:
-
Evaluation Modes:
- Zero-shot Learning: Follows the standard
CLIPprocedure. The image embedding is compared against the text embeddings of candidate class labels (e.g., "a photo of [class name]"). The class with the highest similarity is chosen. - Few-shot Learning: Uses
SimpleShot[90], anearest-centroid classifier.- For -shot learning, examples are randomly sampled for each class.
- Image embeddings are obtained from the visual encoder of the pre-trained model for these examples.
- The
centroidfor each class is computed by averaging the image embeddings. - All remaining examples in the dataset are used for testing.
Mean subtractionandL2-normalizationare applied to bothcentroidsandtest feature vectors.- The prediction for a test vector is the class whose
centroidis nearest in the embedding space (typically using cosine similarity or Euclidean distance). - Each
few-shot experimentis repeated 5 times with different random seeds, and the mean accuracy is reported. Standard deviations are provided in the Appendix.
- Generalized Zero-Shot Learning (GZSL) (Appendix H): This setting requires the model to classify images from unseen classes within a set that also includes seen classes.
BioCLIPis tested onRARE SPECIESimages using a mixed set of 800 labels (400 seen, 400 unseen).
- Zero-shot Learning: Follows the standard
5.3. Baselines
The paper compares BioCLIP against a range of representative baseline models to demonstrate its effectiveness.
-
CLIP(OpenAI): The originalCLIPmodel [69] with aViT-B/16vision transformer. This serves as a direct comparison point, asBioCLIPis a fine-tuned version ofCLIP. By default,CLIPis evaluated withcommon namesfor class labels, as this is what it's generally most effective with due to its web-scale training data. -
OpenCLIP: An open-source reproduction ofCLIPtrained on theLAION-400Mdataset [73]. It represents another strong general-purposeCLIP-like model. Similar toOpenAI CLIP, it's primarily evaluated withcommon names. -
iNat21 Only: ACLIPmodel (sameViT-B/16architecture) that is continually pre-trained only on theiNat21dataset (a subset ofTREEOFLIFE-10M). This baseline helps to isolate the impact ofTREEOFLIFE-10M's expanded data diversity and scale compared to a leading existing biological dataset. -
Supervised-IN21K: A model pre-trained onImageNet-21K[21] [78]. This represents a strongsupervised pre-trainingbaseline from the general computer vision domain. It's used forfew-shot classification(as it doesn't support zero-shot in the same wayCLIPdoes). -
DINO: A self-supervised vision transformer model [15]. This baseline represents a differentpre-training paradigm(self-supervised) from general computer vision, also used forfew-shot classification. -
Random Guessing: A simple baseline indicating the performance expected from random chance, calculated as .These baselines were chosen to represent:
- General-purpose
CLIPmodels: To show the benefit of domain-specific adaptation. - Domain-specific
CLIP(oniNat21): To highlight the impact of theTREEOFLIFE-10Mdataset's scale and diversity. - Strong general vision models (
Supervised-IN21K,DINO): To demonstrateBioCLIP's superiority even against advanced models from outside theCLIPparadigm.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that BioCLIP consistently and substantially outperforms existing baselines across diverse fine-grained biology classification tasks, particularly in zero-shot and few-shot settings.
- Overall Superiority:
BioCLIPachieves significant performance gains, with an average absolute improvement of 17.5% in zero-shot accuracy and 16.7% in one-shot accuracy, and 17.3% in five-shot accuracy over the best baseline. - Generalization to Unseen Taxa: Its strong performance on the
RARE SPECIESdataset (a set of species unseen during training) highlights its exceptionalout-of-distribution generalizationcapabilities, which is critical for real-world biological applications like conservation. - Performance Across Domains:
BioCLIPperforms well on all evaluated tasks, covering animals (Birds 525, Plankton, Insects, Insects 2), plants (PlantNet, PlantVillage, Medicinal Leaf, PlantDoc), and fungi (Fungi), with diverse image types (photos, microscope images, drawings, specimens). This validates its claim as afoundation modelfor theTree of Life. - Beyond Species Classification:
BioCLIPshows strong capabilities in tasks like plant disease diagnosis (PlantVillage,PlantDoc), which involve classifying both species and conditions, demonstrating its utility beyond simple species identification.
6.2. Data Presentation (Tables)
The following are the results from Table 4 of the original paper:
| Model | Animals | Plants & Fungi | Rare Species | Mean (Δ) | ||||||||
| Birds 525 | Plankton | Insects | Insects 2 | PlantNet | Fungi | PlantVillage | Medicinal Leaf | PlantDoc | ||||
| Random Guessing | 0.2 | 1.2 | 1.0 | 1.0 | 4.0 | 4.0 | 2.6 | 4.0 | 3.7 | 0.3 | 2.2 | |
| Zero-Shot Classification | ||||||||||||
| CLIP | 49.9 | 3.2 | 9.1 | 9.8 | 58.5 | 10.2 | 5.4 | 15.9 | 26.1 | 31.8 | 21.9 | − |
| OpenCLIP | 54.7 | 2.2 | 6.5 | 9.6 | 50.2 | 5.7 | 8.0 | 12.4 | 25.8 | 28.4 | 20.4 | −1.5 |
| BioCLIP | 72.1 | 6.1 | 34.8 | 20.4 | 91.4 | 40.7 | 24.4 | 38.6 | 39.4 | 56.1 | 42.4 | +20.5 |
| - iNat21 Only | 56.1 | 2.6 | 30.7 | 11.5 | 88.2 | 43.0 | 18.4 | 25.6 | 31.7 | 21.3 | 32.9 | +9.8 |
| One-Shot Classification | ||||||||||||
| CLIP | 43.7 | 25.1 | 21.6 | 13.7 | 42.1 | 17.2 | 49.7 | 70.1 | 24.8 | 28.5 | 33.6 | − |
| OpenCLIP | 53.7 | 32.3 | 23.2 | 14.3 | 45.1 | 18.4 | 53.6 | 71.2 | 26.8 | 29.2 | 36.7 | +3.1 |
| Supervised-IN21K | 60.2 | 22.9 | 14.7 | 14.4 | 46.7 | 16.9 | 62.3 | 58.6 | 27.7 | 28.0 | 35.2 | +1.6 |
| DINO | 40.5 | 37.0 | 23.5 | 16.4 | 30.7 | 20.0 | 60.0 | 79.2 | 23.7 | 31.0 | 36.2 | +2.6 |
| BioCLIP | 71.8 | 30.6 | 57.4 | 20.4 | 64.5 | 40.3 | 58.8 | 84.3 | 30.7 | 44.9 | 50.3 | +16.7 |
| - iNat21 Only | 74.8 | 29.6 | 53.9 | 19.7 | 67.4 | 35.5 | 55.2 | 75.1 | 27.8 | 36.9 | 47.5 | +13.9 |
| Five-Shot Classification | ||||||||||||
| CLIP | 73.5 | 41.2 | 39.9 | 24.6 | 65.2 | 27.9 | 71.8 | 89.7 | 35.2 | 46.0 | 51.5 | − |
| OpenCLIP | 81.9 | 52.5 | 42.6 | 25.0 | 68.0 | 30.6 | 77.8 | 91.3 | 42.0 | 47.4 | 55.9 | +4.4 |
| Supervised-IN21K | 83.9 | 39.2 | 32.0 | 25.4 | 70.9 | 30.9 | 82.4 | 82.3 | 44.7 | 47.3 | 53.9 | +2.4 |
| DINO | 70.8 | 56.9 | 46.3 | 28.6 | 50.3 | 34.1 | 82.1 | 94.9 | 40.3 | 50.1 | 55.4 | +3.9 |
| BioCLIP | 90.0 | 49.3 | 77.8 | 33.6 | 85.6 | 62.3 | 80.9 | 95.9 | 47.5 | 65.7 | 68.8 | +17.3 |
| - iNat21 Only | 90.1 | 48.2 | 73.7 | 32.1 | 84.7 | 55.6 | 77.2 | 93.5 | 41.0 | 55.6 | 65.1 | +13.6 |
The following are the results from Table E3 of the original paper:
| Animals | Rare Species | |||||
| Birds 525 | Plankton | Insects | Insects 2 | PlantNet | ||
| One-Shot Classification | ||||||
| CLIP | 43.7 ± 0.26 | 25.1 ± 0.71 | 21.6 ± 1.05 | 13.7 ± 1.09 | 42.1 ± 3.40 | 28.5 ± 0.65 |
| OpenCLIP | 53.7 ± 0.52 | 32.3 ± 0.63 | 23.2 ± 1.58 | 14.3 ± 0.67 | 45.1 ± 3.40 | 29.2 ± 0.64 |
| Supervised-IN21K | 60.2 ± 1.02 | 22.9 ± 0.84 | 14.7 ± 1.38 | 14.4 ± 0.90 | 46.7 ± 6.30 | 28.0 ± 0.77 |
| DINO | 40.5 ± 0.96 | 37.0 ± 1.39 | 23.5 ± 1.49 | 16.4 ± 0.78 | 30.7 ± 3.79 | 31.0 ± 0.89 |
| BioCLIP | 71.8 ± 0.47 | 30.6 ± 0.77 | 57.4 ± 2.4 | 20.4 ± 1.28 | 64.5 ± 2.15 | 44.9 ± 0.73 |
| - iNat21 Only | 74.8 ± 0.89 | 29.6 ± 0.82 | 53.9 ± 0.97 | 19.7 ± 0.80 | 67.4 ± 4.54 | 36.9 ± 1.02 |
| Five-Shot Classification | ||||||
| CLIP | 73.5 ± 0.37 | 41.2 ± 1.01 | 39.9 ± 0.86 | 24.6 ± 0.90 | 65.2 ± 1.25 | 46.0 ± 0.33 |
| OpenCLIP | 81.9 ± 0.25 | 52.5 ± 0.83 | 42.6 ± 0.82 | 25.0 ± 0.83 | 68.0 ± 0.86 | 47.4 ± 0.34 |
| Supervised-IN21K | 83.9 ± 0.15 | 39.2 ± 1.66 | 32.0 ± 1.90 | 25.4 ± 2.13 | 70.9 ± 2.45 | 47.3 ± 0.41 |
| DINO | 70.9 ± 0.34 | 56.9 ± 1.61 | 46.3 ± 1.37 | 28.6 ± 1.59 | 50.3 ± 3.20 | 50.1 ± 0.47 |
| BioCLIP | 90.0 ± 0.12 | 49.3 ± 1.14 | 77.8 ± 0.81 | 33.6 ± 0.74 | 85.6 ± 1.79 | 65.7 ± 0.43 |
| - iNat21 Only | 90.1 ± 0.08 | 48.2 ± 1.24 | 73.7 ± 0.65 | 32.1 ± 1.97 | 84.7 ± 1.24 | 55.6 ± 0.16 |
The following are the results from Table E4 of the original paper:
| Plants & Fungi | |||||
| Fungi | PlantVillage | Medicinal Leaf | PlantDoc | Rare Species | |
| One-Shot Classification | |||||
| CLIP | 17.2 ± 0.78 | 49.7 ± 2.53 | 70.1 ± 2.83 | 24.8 ± 1.61 | 28.5 ± 0.65 |
| OpenCLIP | 18.4 ± 1.26 | 53.6 ± 0.79 | 71.2 ± 3.58 | 26.8 ± 1.45 | 29.2 ± 0.64 |
| Supervised-IN21K | 16.9 ± 2.32 | 62.3 ± 2.28 | 58.6 ± 4.45 | 27.7 ± 2.86 | 28.0 ± 0.77 |
| DINO | 20.0 ± 1.53 | 60.0 ± 2.15 | 79.2 ± 2.74 | 23.7 ± 2.48 | 31.0 ± 0.89 |
| BioCLIP | 40.3 ± 3.00 | 58.8 ± 2.83 | 84.3 ± 1.90 | 30.7 ± 1.75 | 44.9 ± 0.73 |
| - iNat21 Only | 35.5 ± 2.93 | 55.2 ± 1.58 | 75.1 ± 1.16 | 27.8 ± 1.31 | 36.9 ± 1.02 |
| Five-Shot Classification | |||||
| CLIP | 27.9 ± 2.54 | 71.8 ± 1.46 | 89.7 ± 1.45 | 35.2 ± 1.59 | 46.0 ± 0.33 |
| OpenCLIP | 30.6 ± 1.26 | 77.8 ± 1.28 | 91.3 ± 0.85 | 42.0 ± 1.32 | 47.4 ± 0.34 |
| Supervised-IN21K | 30.9 ± 2.64 | 82.4 ± 1.53 | 82.3 ± 3.81 | 44.7 ± 2.26 | 47.3 ± 0.41 |
| DINO | 34.1 ± 2.87 | 82.1 ± 1.31 | 94.9 ± 1.30 | 40.3 ± 2.32 | 50.1 ± 0.47 |
| BioCLIP | 62.3 ± 1.82 | 80.9 ± 1.04 | 95.9 ± 1.07 | 47.5 ± 1.35 | 65.7 ± 0.43 |
| - iNat21 Only | 55.6 ± 2.61 | 77.2 ± 0.68 | 93.5 ± 1.13 | 41.0 ± 1.75 | 55.6 ± 0.16 |
6.2.1. Zero-Shot Classification
BioCLIPachieves a mean accuracy of 42.4%, substantially outperformingCLIP(21.9%) andOpenCLIP(20.4%). This is a massive improvement, demonstrating the effectiveness of pre-training onTREEOFLIFE-10Mand using taxonomic information.- The
RARE SPECIEScolumn is particularly telling:BioCLIPscores 56.1%, whileCLIPgets 31.8% andOpenCLIP28.4%. This validatesBioCLIP's ability to generalize to unseen taxa, which is crucial for biological applications. - The
- iNat21 Onlymodel (32.9%) also performs better than genericCLIPs but is still significantly behind the fullBioCLIP, indicating that the expanded diversity ofTREEOFLIFE-10MbeyondiNat21is vital.
6.2.2. One-Shot Classification
BioCLIPachieves a mean accuracy of 50.3%, again significantly higher thanCLIP(33.6%) andOpenCLIP(36.7%).- On
RARE SPECIES,BioCLIPscores 44.9%, maintaining a strong lead. Supervised-IN21K(35.2%) andDINO(36.2%) are competitive withCLIPandOpenCLIP, but are still far outpaced byBioCLIP, confirming the benefit ofBioCLIP's specialized pre-training for fine-grained biological tasks even in thefew-shotregime.- The mean one-shot accuracy of
BioCLIPis 9.1% higher than its zero-shot accuracy, indicating that even a single labeled example significantly improves performance, contrary to some findings for generalCLIPmodels.
6.2.3. Five-Shot Classification
- The trend continues with
BioCLIPleading with a mean accuracy of 68.8%, compared toCLIP(51.5%) andOpenCLIP(55.9%). - The gains are consistent across most datasets, with
BioCLIPshowing particularly strong performance onInsects(77.8%),Fungi(62.3%), andRARE SPECIES(65.7%). - The performance gap between
BioCLIPandSupervised-IN21K/DINOremains substantial, reinforcingBioCLIP's advantage in biological domain generalization.
6.3. Ablation Studies / Parameter Analysis
6.3.1. How Do Text Types Affect Generalization?
This ablation study investigates the impact of different text types used during training on zero-shot generalization, specifically on the RARE SPECIES dataset (where all species are unseen during training). The study uses a 10% subset of TREEOFLIFE-10M (referred to as ToL-1M) for computational reasons.
The following are the results from Table 5 of the original paper:
| Dataset | Train↓Test→ | Com | Sci | Tax | Sci+Com | Tax+Com |
| ToL-1M | Com | 24.9 | 9.5 | 10.8 | 22.3 | 21.0 |
| Sci | 11.0 | 22.3 | 4.5 | 21.5 | 8.0 | |
| Tax | 11.8 | 10.1 | 26.6 | 16.0 | 24.8 | |
| Sci+Com | 24.5 | 12.9 | 12.6 | 28.0 | 24.9 | |
| Tax+Com Mixture | 20.5 | 8.0 | 19.7 | 24.0 | 30.4 | |
| iNat21-2.7M Mixture | 26.1 | 24.9 | 26.7 | 29.5 | 30.9 | |
| ToL-10M Mixture | 31.6 | 30.1 | 34.1 | 37.0 | 38.0 | |
- Key Observations:
- Importance of
Taxonomicnames: Training withTaxonomicnames (row "Tax") yields strong performance when tested withTaxonomicnames (26.6%), andTaxonomic + Common(row "Tax+Com Mixture") achieves the best overall performance (30.4%) when tested withTaxonomic + Commonnames onToL-1M. This highlights that incorporating the full taxonomic structure is crucial for generalization. - Mixed Text Type Training is Superior: The "Tax+Com Mixture" row shows that training with a mixture of text types (randomly sampling from available text types for an image) provides the most robust performance across different text types at test time. For instance,
Tax+Com Mixture(30.4%) outperforms any singletext typetrained model (e.g.,Taxtested withTaxat 26.6%). This strategy retains the generalization benefits of taxonomic names while offering flexibility. - Performance Degradation with Mismatched Text Types: When a model is trained on a single
text type(e.g.,CommonorScientific) and tested with a different one, performance degrades substantially. For example,Scitrained model scores 22.3% when tested withSci, but only 4.5% when tested withTax. - Data Scale Matters: Comparing "ToL-1M Mixture" (30.4%) with "iNat21-2.7M Mixture" (30.9%) and "ToL-10M Mixture" (38.0% - this value is for
RARE SPECIESfrom Table 4, re-stated here as reference for consistency with table 5 format), it's evident that the fullTREEOFLIFE-10Mdataset significantly boosts performance, even when bothiNat21andToL-1Muse the mixed text type strategy. This confirms the importance of the added data diversity and scale fromTREEOFLIFE-10M.
- Importance of
6.3.2. Is the CLIP Objective Necessary?
This ablation compares the CLIP objective against traditional cross-entropy based methods. Models (ViT-B/16) are trained on TREEOFLIFE-1M (10% subset) and evaluated in one-shot and five-shot settings (zero-shot is not possible for non-CLIP models).
The following are the results from Table 6 of the original paper:
| Objective | Mean 1-Shot | Mean 5-shot |
| Cross-entropy | 16.5 | 26.2 |
| Hier. cross-entropy | 19.3 | 30.5 |
| CLIP | 44.7 | 63.8 |
- Key Finding: The
CLIP objective(44.7% one-shot, 63.8% five-shot) massively outperforms both standardCross-entropy(16.5% one-shot, 26.2% five-shot) andHierarchical cross-entropy(19.3% one-shot, 30.5% five-shot). - Justification: This strong result provides empirical justification for repurposing the
CLIPobjective forBioCLIP. WhileHierarchical cross-entropydoes show a modest improvement over simpleCross-entropyby leveraging some hierarchical information, it cannot compete with the representation learning power of theCLIPcontrastive objective in low-data regimes (few-shot). This confirms that the contrastive approach is far more effective for learning generalized, fine-grained representations suitable for theTree of Life.
6.3.3. Can BioCLIP Classify More Than Species?
This section examines BioCLIP's ability to transfer to tasks beyond pure species classification, specifically plant diagnosis (classifying species and disease) using the PlantVillage and PlantDoc datasets.
- Results (from Table 4):
PlantVillage(Zero-shot):BioCLIP24.4%,CLIP5.4%,OpenCLIP8.0%.PlantDoc(Zero-shot):BioCLIP39.4%,CLIP26.1%,OpenCLIP25.8%.PlantVillage(One-shot):BioCLIP58.8%,CLIP49.7%,OpenCLIP53.6%.PlantDoc(One-shot):BioCLIP30.7%,CLIP24.8%,OpenCLIP26.8%.BioCLIPconsistently outperforms baselines in both zero-shot and few-shot settings on these tasks.
- Implication: This indicates that
BioCLIPlearns useful visual representations that are transferable to complex, multi-faceted biological questions like disease diagnosis, even though its primary training objective is a contrastive species classification. It demonstrates that the learned representations are rich enough to capture subtle visual cues for conditions and traits. The fact thatBioCLIP's mean one-shot accuracy is 9.1% higher than its zero-shot accuracy implies that it learns useful visual features even from a single labeled example, which is beneficial given the high cost of biological data labeling.
6.3.4. Does BioCLIP Learn the Hierarchy?
This intrinsic evaluation aims to visualize BioCLIP's learned image representations and determine if they conform to the Tree of Life hierarchy.
-
Method:
t-SNE[85] is used to reduce the dimensionality of image embeddings fromiNat21's validation set (unseen during training) for visualization.- Points in the
t-SNEplot are colored by theirtaxonomic labels. t-SNEis run independently on subsets of examples belonging to specific taxonomic ranks to visualize clustering at different hierarchical levels.
-
Observations (Figure 3 - high-level description based on text):
- At higher taxonomic ranks (e.g.,
Kingdom,Phylum), bothCLIPandBioCLIPshow good separation between groups. However,BioCLIP's representations exhibit a more fine-grained and richer clustering structure. - At lower taxonomic ranks (e.g.,
Class,Order,Family),BioCLIPproduces evidently more separable features, with distinct clusters for different groups. In contrast,CLIP's features appear cluttered and lack clear, discernible hierarchical structure.
- At higher taxonomic ranks (e.g.,
-
Conclusion: This visual evidence supports the hypothesis that
BioCLIPhas successfully learned afine-grained hierarchical representationthat aligns with theTree of Life. This intrinsic property explainsBioCLIP's superior generalization and fine-grained classification capabilities observed in the extrinsic evaluations.
6.4. Generalized Zero-Shot Learning (GZSL)
From Appendix H, BioCLIP was also evaluated in a challenging GZSL setting, where a model must classify images from unseen classes within a set of both seen and unseen labels.
- Setup: 400 seen species from
TREEOFLIFE-10Mwere combined with the 400 unseenRARE SPECIESlabels, creating an 800-label classification task for theRARE SPECIESimages. - Results:
BioCLIPachieved 26.0%top-1 accuracy, outperformingCLIP(23.0%) andOpenCLIP(18.2%). - Implication: This further confirms
BioCLIP's strong generalization abilities, even when faced with the added complexity of distinguishing unseen classes amidst a larger set of seen classes.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces TREEOFLIFE-10M, a groundbreaking large-scale and diverse dataset of biological images, and BioCLIP, a novel vision foundation model specifically designed for the Tree of Life. Through extensive and rigorous evaluation, BioCLIP demonstrates exceptional performance in fine-grained biological classification tasks, achieving substantial improvements (16% to 17% absolute accuracy) over existing baselines, including general-purpose CLIP models, in both zero-shot and few-shot settings. The core innovation lies in BioCLIP's unique training strategy, which repurposes CLIP's multimodal contrastive learning objective to explicitly leverage the hierarchical structure of biological taxonomy (by using "flattened" taxonomic names). This approach, coupled with the vast scale and diversity of TREEOFLIFE-10M, enables BioCLIP to learn more fine-grained and generalizable representations. Intrinsic evaluations confirm that BioCLIP's learned representations intrinsically conform to the Tree of Life hierarchy, providing the underlying mechanism for its strong generalization capabilities to unseen taxa and complex biological questions like disease diagnosis.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Classification-focused
BioCLIP: While effective for classification, the currentBioCLIPis fundamentally trained for this specific objective.- Future Work: Scaling up the data even further, potentially incorporating over 100 million research-grade images from iNaturalist. Collecting richer textual descriptions of species' appearances to enable
BioCLIPto extractfine-grained trait-level representations(e.g., specific morphological features, behaviors, or ecological roles), moving beyond just classification.
- Future Work: Scaling up the data even further, potentially incorporating over 100 million research-grade images from iNaturalist. Collecting richer textual descriptions of species' appearances to enable
- Taxonomic Labeling Issues: The authors discovered
hemihomonymswere occasionally mislabeled at higher taxonomic levels, impacting a small percentage (0.1-0.2%) of their data.- Future Work: Developing a more robust solution for
taxonomic labelingthat accounts for such misclassifications and incorporates ongoing re-naming efforts in biology (e.g., for bird species). They intend to release a patch addressing this.
- Future Work: Developing a more robust solution for
7.3. Personal Insights & Critique
- Innovation of Leveraging Explicit Hierarchy: The most inspiring aspect of this paper is its elegant solution to incorporating explicit biological hierarchy into a
CLIP-like model. Repurposing theCLIPobjective by "flattening" the taxonomy into a single string for contrastive learning is a clever and highly effective strategy. It bypasses the complexity of multi-task hierarchical losses while still embedding the crucial structural information, which is a major conceptual leap for domain-specific foundation models. This approach could potentially be adapted to other domains with rich, inherent hierarchical data structures (e.g., medical ontologies, material science classifications). - The Power of Data Curation: The paper powerfully demonstrates that for domain-specific foundation models, the quality, diversity, and structured richness of the training data are as, if not more, important than sheer scale.
TREEOFLIFE-10Mis not just large; its careful aggregation,canonicalizationof taxonomic labels, and inclusion of diverse image sources (in-situ, lab, citizen science) are critical toBioCLIP's success. This highlights the immense value of interdisciplinary collaboration between ML researchers and domain experts (biologists in this case) for effective data engineering. - Impact on Biological Research and Conservation:
BioCLIPhas immediate and significant potential. Itszero-shotandfew-shotcapabilities are game-changers for identifying rare or newly discovered species, monitoring biodiversity with limited labeled data, and democratizing AI tools for biologists who may lack extensive ML expertise. The ability to generalize to unseen taxa is particularly valuable for conservation efforts, where data on endangered or newly threatened species is often scarce. - Potential Issues and Areas for Improvement:
- Static Taxonomy: The current approach relies on a relatively static
taxonomic hierarchy. Biology, however, is dynamic; classifications are constantly revised as new genetic and morphological data emerge. While the authors mention addressing re-naming, a more adaptive system that can gracefully incorporate taxonomic updates (e.g., usingknowledge graph embeddingsordynamic graph neural networks) could makeBioCLIPeven more robust long-term. - Computational Cost: Training
BioCLIPon 10.4M images with a largeViTandTransformerrequires substantial computational resources (8 NVIDIA A100-80GB GPUs). This raises questions about accessibility for smaller research groups or NGOs in the biological domain. Future work could explore more efficient training strategies or smaller model architectures that retain performance. - "Black Box" Concerns: Like many deep learning models,
BioCLIPis a "black box." Whilet-SNEvisualizations provide insights into its learned hierarchy, further research intoexplainable AI (XAI)forBioCLIPcould reveal why it makes certain fine-grained distinctions, helping biologists trust and utilize the model more effectively (e.g., by identifying crucial visual features for a species). - Data Bias: Despite efforts to diversify, any dataset, especially one aggregated from various sources, can inherit biases (e.g., geographic bias in citizen science data, overrepresentation of charismatic megafauna). While not explicitly discussed, mitigating these biases in future dataset iterations is important for equitable representation across the
Tree of Life. - Ethical Considerations (as acknowledged by authors): The authors address potential ethical concerns, particularly regarding misuse for endangered species (e.g., poaching). Their mitigation strategy (no specific geographic info, no conservation status in training) is a good start, but ongoing vigilance and research into
responsible AIin conservation are crucial. The concern about over-reliance on model predictions is valid;BioCLIPshould be seen as an assistant, not a replacement, for expert biological judgment.
- Static Taxonomy: The current approach relies on a relatively static
Similar papers
Recommended via semantic vector search.