Deciphering the biosynthetic potential of microbial genomes using a BGC language processing neural network model
TL;DR Summary
This study presents BGC-Prophet, a transformer-based model for predicting and classifying biosynthetic gene clusters in microbial genomes. It enhances efficiency and accuracy, analyzing over 85,000 genomes to reveal BGC distribution patterns and environmental influences, aiding r
Abstract
Biosynthetic gene clusters (BGCs), key in synthesizing microbial secondary metabolites, are mostly hidden in microbial genomes and metagenomes. To unearth this vast potential, we present BGC-Prophet, a transformer-based language model for BGC prediction and classification. Leveraging the transformer encoder, BGC-Prophet captures location-dependent relationships between genes. As one of the pioneering ultrahigh-throughput tools, BGC-Prophet significantly surpasses existing methods in efficiency and fidelity, enabling comprehensive pan-phylogenetic and whole-metagenome BGC screening. Through the analysis of 85,203 genomes and 9,428 metagenomes, BGC-Prophet has profiled an extensive array of sub-million BGCs. It highlights notable enrichment in phyla like Actinomycetota and the widespread distribution of polyketide, NRP, and RiPP BGCs across diverse lineages. It reveals enrichment patterns of BGCs following important geological events, suggesting environmental influences on BGC evolution. BGC-Prophet’s capabilities in detection of BGCs and evolutionary patterns offer contributions to deeper understanding of microbial secondary metabolites and application in synthetic biology.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Deciphering the biosynthetic potential of microbial genomes using a BGC language processing neural network model
1.2. Authors
Qilong Lai, Shuai Yao, Yuguo Zha, Haohong Zhang, Haobo Zhang, Ying Ye, Yonghui Zhang, Hong Bai, Kang Ning.
The first three authors (Qilong Lai, Shuai Yao, Yuguo Zha) are explicitly stated to be Joint First Authors.
The corresponding authors are Hong Bai (baihong@hust.edu.cn) and Kang Ning (ningkang@hust.edu.cn).
1.3. Journal/Conference
The provided information does not explicitly state the journal or conference where this paper was published. However, the reference format (e.g., "Nucleic Acids Res") and the content suggest a high-impact journal in bioinformatics, genomics, or computational biology.
1.4. Publication Year
2025
1.5. Abstract
Biosynthetic gene clusters (BGCs) are crucial for synthesizing microbial secondary metabolites, but their vast potential remains largely unexploited within microbial genomes and metagenomes. To address this, the authors introduce BGC-Prophet, a novel transformer-based language model designed for predicting and classifying BGCs. By leveraging the transformer encoder, BGC-Prophet effectively captures location-dependent relationships between genes. This model stands out as an ultrahigh-throughput (UHT) tool, demonstrating superior efficiency and fidelity compared to existing methods, which enables extensive pan-phylogenetic and whole-metagenome BGC screening. Through the analysis of 85,203 genomes and 9,428 metagenomes, BGC-Prophet has profiled sub-million BGCs. Key findings include the enrichment of BGCs in phyla like Actinomycetota and the widespread distribution of polyketide, nonribosomal peptide (NRP), and ribosomally synthesized and post-translationally modified peptide (RiPP) BGCs across diverse lineages. The model also reveals enrichment patterns of BGCs linked to important geological events, suggesting environmental influences on BGC evolution. BGC-Prophet's capabilities in BGC detection and evolutionary pattern analysis offer significant contributions to a deeper understanding of microbial secondary metabolites and hold promise for applications in synthetic biology.
1.6. Original Source Link
/files/papers/6912d8143ac94a268629e4a0/paper.pdf This appears to be a link to a PDF file within a local or specific file system, not a public URL. The paper is officially published.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the efficient and accurate discovery and classification of Biosynthetic Gene Clusters (BGCs) within the ever-growing volume of microbial genomic and metagenomic data. BGCs are critical for producing secondary metabolites (also known as natural products), which are compounds synthesized by organisms that are not directly involved in normal growth, development, or reproduction but often play roles in ecological interactions, defense, or communication. These natural products have immense value, particularly in medicine, serving as sources for antimicrobials and cancer chemotherapeutics.
The problem is important because a vast majority of these BGCs are "hidden" or uncharacterized within microbial genomes and metagenomes. Existing computational methods for BGC identification face several challenges:
-
Rule-based methods (e.g., antiSMASH) are effective for known BGC categories but struggle with
novel BGCsand have scalability issues when analyzing large datasets. -
Machine learning methods (e.g., ClusterFinder, NeuRiPP, DeepRiPP) can identify novel BGCs but often suffer from trade-offs in efficiency and accuracy, sometimes exhibiting higher
false positive rates (FPR)orfalse negativesfor known categories. -
Deep learning approaches (e.g., DeepBGC, BiGCARP) have improved detection but commonly rely on
profile-Hidden Markov Models (pHMMs)forPfamdomain calling, which is computationally intensive and relies on expert-driven manual annotation. Moreover, many useBidirectional Long Short-Term Memory (BiLSTM)networks, which may not effectively capturelong-range, location-dependent relationshipsbetween genes in complex BGCs. -
Computational efficiency: Analyzing the exponentially growing genomic data (thousands to millions of genomes and metagenomes) requires
ultrahigh-throughput (UHT)tools that are significantly faster than current methods.The paper's entry point or innovative idea is to reframe BGC prediction as a
language processing task. By drawing an analogy between aBGCand asentence(wheregenesaretokens), the authors propose using atransformer-based language modelto leverage the transformer's ability to capture long-range dependencies and process data in parallel, aiming to overcome the limitations of previous methods in accuracy, novelty detection, and speed.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Introduction of
BGC-Prophet: A novel transformer-based language model for ultrahigh-throughput prediction and classification of BGCs. This model uses genes as tokens andESM-2 8M(Evolutionary Scale Modeling) protein language model embeddings to represent genes, effectively capturinglocation-dependent relationshipsbetween genes using a transformer encoder. -
Superior Performance and Efficiency:
BGC-Prophetdemonstrates comparable or superior accuracy to existing tools (likeDeepBGC) in BGC detection and significantly outperforms them in BGC product classification. Crucially, it achievesultrahigh-throughput, being several orders of magnitude faster (e.g., processing a genome in 1 minute compared to DeepBGC's 4 hours), enabling large-scale genomic and metagenomic screening. -
Comprehensive BGC Profiling:
BGC-Prophetwas applied to 85,203 genomes and 9,428 metagenomes, profiling an extensive array of sub-million BGCs. This large-scale analysis provides a comprehensive picture of BGC distribution and diversity across bacterial and archaeal lineages. -
Discovery of Novel BGCs and Enrichment Patterns: The model identified a significantly greater number of potential BGCs compared to
antiSMASH, particularly in categories like terpene, polyketide, andRiPP, and also predicted more BGCs in the "other" category, suggesting its ability to discoverpotentially novel BGC categories. -
Insights into BGC Evolutionary Patterns: The study revealed notable enrichment patterns of BGCs in specific phyla (e.g.,
Actinomycetota) and observed a surge in BGC abundance and diversity (especially polyketides) following importantgeological eventslike theGreat OxidationandCambrian Explosion. This suggests a link between environmental changes and the evolution of BGCs and specialized metabolites.Key conclusions/findings that these contributions solve include:
- The ability to efficiently screen vast genomic and metagenomic datasets for BGCs, which was a major bottleneck for previous methods.
- Improved detection of
novel BGCsthat do not fit predefined rules or categories. - A deeper understanding of the distribution, abundance, and evolutionary drivers of BGCs across microbial life.
- Providing a powerful tool for synthetic biology and natural product discovery by highlighting new targets and evolutionary contexts.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Biosynthetic Gene Clusters (BGCs): These are groups of genes that are physically located close to each other on a chromosome (spatially adjacent colocalization) and work together to produce one or more specific
secondary metabolites(natural products). They encode the entire biosynthetic pathway, includingbiosynthetic genes(which catalyze the core steps),tailoring enzymes(which modify the core structure),transport-related genes(for moving precursors or products), andregulatory genes(which control the pathway's expression). An example BGC is the cluster that synthesizes penicillin. -
Secondary Metabolites (Natural Products): These are organic compounds produced by bacteria, fungi, or plants that are not directly involved in the normal growth, development, or reproduction of the organism. Instead, they often play roles in defense, communication, or environmental adaptation. Examples include antibiotics (like penicillin), antifungals, immunosuppressants, and anticancer agents. They have diverse chemical structures, such as
nonribosomal peptides (NRPs),polyketides,saccharides,terpenes, andalkaloids. -
Genomes and Metagenomes:
- Genome: The complete set of genetic instructions in an organism. In this context, it refers to the DNA sequence of a single microbial species.
- Metagenome: The collection of genomes and genes from the members of an entire microbial community (e.g., from a soil sample, human gut).
Metagenomic Assembled Genomes (MAGs)are reconstructed partial genomes derived from metagenomic sequencing data.
-
Machine Learning (ML): A field of artificial intelligence that uses statistical techniques to enable computer systems to "learn" from data without being explicitly programmed. It involves algorithms that build a model from example data to make predictions or decisions.
-
Deep Learning: A subfield of machine learning that uses
artificial neural networkswith multiple layers (hence "deep") to learn representations of data with multiple levels of abstraction. Deep learning models have shown remarkable success in tasks like image recognition, natural language processing, and speech recognition. -
Neural Networks: A computational model inspired by the structure and function of biological neural networks. They consist of interconnected
nodes(neurons) organized into layers. Each connection has a weight, and neurons apply an activation function to the weighted sum of their inputs. -
Transformers: A deep learning architecture introduced in 2017 that has revolutionized
Natural Language Processing (NLP). Unlike traditional recurrent neural networks (RNNs) like LSTMs, transformers rely solely onattention mechanisms(specificallyself-attention) to process sequences, allowing for parallel computation and better capture of long-range dependencies. They consist of anencoderand adecoderstack, though some applications (likeBGC-Prophet) might only use the encoder. -
Attention Mechanism / Self-Attention: A component within transformers that allows the model to weigh the importance of different parts of the input sequence when processing a specific element.
Self-attentionenables each element in a sequence to "attend" to all other elements in the same sequence, identifying relevant relationships regardless of their distance. This helps in capturinglocation-dependent relationships(or long-range dependencies) between elements. -
Language Models: In NLP, a language model is a statistical or neural network model that learns the probability distribution of sequences of words (or other tokens). They predict the next word in a sequence or understand the context of words.
BGC-Prophetapplies this concept to genes, treating a sequence of genes as a "sentence." -
Protein Embeddings (ESM - Evolutionary Scale Modeling): A technique to represent protein sequences as dense numerical vectors (embeddings).
ESM-2 8Mis a specific pre-trainedprotein language modelthat generates these embeddings. These vectors capture evolutionary signals and functional properties of proteins based on their sequences, allowing models to leverage similarities among genes (proteins) even if they don't share high sequence identity. -
Profile-Hidden Markov Models (pHMMs): Statistical models used to represent sequences that share a common pattern or motif, like protein domains. They are widely used in bioinformatics to identify homologous sequences or protein families (
Pfamdomains). pHMMs can be computationally intensive as they involve aligning sequences against the profiles. -
Pfam Domains: A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models. These domains are conserved protein regions that are often associated with specific functions. Many BGC prediction tools use Pfam domains as features.
-
Bidirectional Long Short-Term Memory (BiLSTM): A type of
recurrent neural network (RNN)that can process sequences in both forward and backward directions, allowing it to capture context from both past and future elements in the sequence. While powerful, LSTMs can struggle with very long sequences and are less parallelizable than transformers. -
Evaluation Metrics (for classification tasks):
- Accuracy: The proportion of correctly classified instances (both true positives and true negatives) out of the total number of instances. It measures the overall correctness of the model.
- Precision: The proportion of true positive predictions among all positive predictions made by the model. It measures how many of the identified BGCs are actually BGCs. High precision means fewer false positives.
- Recall (Sensitivity): The proportion of true positive predictions among all actual positive instances. It measures how many of the actual BGCs were correctly identified by the model. High recall means fewer false negatives.
- F1-score: The harmonic mean of precision and recall. It provides a single score that balances both precision and recall, being useful when there is an uneven class distribution.
- True Positive Rate (TPR): Identical to Recall.
- False Positive Rate (FPR): The proportion of false positive predictions among all actual negative instances. It measures how many non-BGCs were incorrectly identified as BGCs.
- Area Under the Receiver Operating Characteristic (AUROC) Curve: The ROC curve plots TPR against FPR at various threshold settings. The AUROC score represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. A higher AUROC indicates better discrimination ability.
-
Dimensionality Reduction Techniques:
- t-distributed Stochastic Neighbor Embedding (t-SNE): A non-linear dimensionality reduction technique used for visualizing high-dimensional data, typically reducing it to two or three dimensions. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.
- Uniform Manifold Approximation and Projection (UMAP): Another non-linear dimensionality reduction technique, often faster than t-SNE and better at preserving global data structure.
-
Statistical Tests:
- t-test: A parametric statistical test used to determine if there is a significant difference between the means of two groups.
- Pearson Correlation Coefficient: A measure of the linear correlation between two sets of data. It has a value between +1 and −1, where +1 is a total positive linear correlation, 0 is no linear correlation, and −1 is a total negative linear correlation.
-
Ultrahigh-throughput (UHT): Refers to methods or systems capable of processing an extremely large number of samples or data points in a short amount of time, essential for large-scale genomic and metagenomic analyses.
3.2. Previous Works
The paper discusses several prior approaches for BGC identification, broadly categorizing them into rule-based and machine learning/deep learning methods:
-
Rule-based Methods:
antiSMASH[15, 16]: Stands forantibiotics & Secondary Metabolite Analysis Shell. It is a widely used comprehensive pipeline that identifies BGCs by employing a set of curatedpHMMsto detect biosynthesis-related gene families and then uses heuristics (rules) to delineate BGC regions. It's successful for known BGC categories (e.g., alkaloids, NRPs, polyketides, RiPPs, saccharides, terpenes).PRISM[20]: Predicts natural product chemical structures from microbial genomes.- Limitations of Rule-based: Less proficient at identifying novel categories of BGCs that don't fit predefined rules.
-
Machine Learning Approaches:
ClusterFinder[23]: An early machine learning method for BGC identification.NeuRiPP[24] andDeepRiPP[25]: Specifically use machine learning to identifyRiPPBGCs.- Limitations of ML approaches: Often involve a trade-off in efficiency and accuracy, can have higher
false positive rates (FPR)than rule-based approaches, and may suffer fromfalse negativesfor known BGC categories.
-
Deep Learning Approaches:
DeepBGC[26]: A pioneering deep learning and NLP strategy for BGC prediction. It usesPfamdomains as tokens and employs aBiLSTM recurrent neural network.e-DeepBGC[27],Deep-BGCPred[28],BiGCARP[29],GECCO[30],SanntiS[31]: Other deep learning methods developed for BGC annotation. Many of these also rely onPfamdomains andBiLSTMor similar RNN architectures.- Common Drawbacks of Deep Learning (as highlighted by the authors):
- Limited training data: Supervised machine learning approaches suffer from a small number of training data.
- Loss of long-range memory:
BiLSTMusually loses long memories and is unable to capture distantlocation-dependent relationshipsbetween biosynthetic genes effectively. - Pfam reliance:
Pfamheavily relies on manual determination by experts to define the scope of each domain. - Computational intensity of pHMMs: The utilization of
pHMMsfor identifying conservedPfamdomains is computationally intensive.
3.3. Technological Evolution
The field of BGC discovery has evolved significantly:
- Early wet-lab methods: Traditional natural product discovery involved culturing microbes and chemically extracting and characterizing compounds. This was slow and often missed the vast majority of non-expressed BGCs.
- Genomic era and rule-based tools: With the advent of high-throughput sequencing, genomic data became abundant. Tools like
antiSMASHemerged, leveraging known patterns (e.g., characteristic protein domains, gene arrangements) to predict BGCs directly from DNA sequences. This was a significant leap, enablinggenome mining. - Rise of machine learning: As the complexity of BGCs and the limitations of rule-based systems for novel BGCs became apparent, machine learning methods were introduced. These could learn patterns from data, offering more flexibility.
- Deep learning revolution: Deep learning, particularly
RNNs(likeBiLSTM) and laterTransformers, brought powerful pattern recognition capabilities. Models likeDeepBGCstarted treating BGC prediction as an NLP problem, using protein domains as "words." - Current work (BGC-Prophet): This paper represents a further evolution by:
-
Moving from
Pfamdomains to entiregenesastokens, which are more "natural" and avoid the computational cost ofpHMMs. -
Adopting the more advanced
Transformerarchitecture overRNNs(likeBiLSTM) to better capturelong-range dependenciesand enableultrahigh-throughputparallel processing. -
Leveraging state-of-the-art
protein language models(ESM-2 8M) for gene embeddings, providing richer, sequence-specific representations.This paper positions itself at the forefront of this evolution, addressing the efficiency and fidelity challenges for large-scale, pan-phylogenetic, and metagenomic BGC screening.
-
3.4. Differentiation Analysis
Compared to the main methods in related work, BGC-Prophet introduces several core differences and innovations:
-
Tokenization Strategy:
- Previous (e.g., DeepBGC): Uses
Pfamdomains as tokens. While balancing information retention and computational complexity,Pfamrelies on expert manual annotation andpHMMsare computationally intensive. MultiplePfamdomains within one gene also require separate handling. BGC-Prophet: Uses entiregenesas tokens. This is presented as more "natural" and flexible, avoiding the complexities and computational cost associated withPfamdomain calling andpHMMalignments. It aims to capture global gene information more effectively.
- Previous (e.g., DeepBGC): Uses
-
Gene Embedding:
- Previous: Often rely on features derived from
Pfamannotations or simpler sequence features. BGC-Prophet: LeveragesESM-2 8M(Evolutionary Scale Modeling) protein language models to generatesequence-specific vector representations(embeddings) for each gene. These embeddings encapsulate evolutionary signals and functional properties directly from protein sequences, providing a richer and more comprehensive input feature that is less dependent on predefined domains. This also removes the dependency between acquiring vector representations and training language models.
- Previous: Often rely on features derived from
-
Model Architecture:
- Previous (e.g., DeepBGC): Primarily uses
Bidirectional Long Short-Term Memory (BiLSTM)recurrent neural networks.BiLSTMscan struggle with capturing verylong-range dependenciesin sequences and are less amenable to parallelization. BGC-Prophet: Employs atransformer encoderarchitecture.Transformers, through theirself-attention mechanism, are inherently designed to capturelocation-dependent relationshipsbetween elements regardless of their distance and are highly parallelizable, leading to significant speed improvements.
- Previous (e.g., DeepBGC): Primarily uses
-
Computational Efficiency (
Ultrahigh-throughput):- Previous: Many existing deep learning tools, while more accurate than rule-based methods, are computationally intensive (e.g.,
DeepBGCneeding 4 hours per genome). BGC-Prophet: Achievesultrahigh-throughputcapabilities (e.g., 1 minute per genome). This dramatic speedup is attributed to the more efficientESMmethod for gene vector generation (avoiding time-consumingPfamalignments) and the parallel processing capabilities of thetransformerarchitecture. This makes comprehensivepan-phylogeneticandwhole-metagenomescreening feasible at an unprecedented scale.
- Previous: Many existing deep learning tools, while more accurate than rule-based methods, are computationally intensive (e.g.,
-
Detection of Novel BGCs:
- Previous (Rule-based): Limited by predefined rules, struggling with novelty.
- Previous (Deep learning): Showed potential but often constrained by
Pfam-based features. BGC-Prophet: Its gene-level tokenization,ESMembeddings, andtransformerarchitecture are designed to learn more generalized patterns ofgene location dependencies, making it more adept at extrapolating and identifyingpotentially novel BGCsthat don't fit into established categories, as demonstrated by its higher prediction count forunannotated BGCsand the "other" category compared toantiSMASH.
4. Methodology
4.1. Principles
The core idea behind BGC-Prophet is to treat the problem of Biosynthetic Gene Cluster (BGC) prediction and classification as a language processing task, drawing an analogy between a sequence of genes and a sentence. In this analogy:
-
A
BGC(or a region of genes to be analyzed) is considered a "sentence". -
Individual
geneswithin that sequence are considered "tokens" (or "words").The theoretical basis and intuition behind this approach are rooted in the success of
transformer-based language modelsinNatural Language Processing (NLP). These models are exceptionally good at:
-
Capturing
location-dependent relationships(long-range dependencies): Just as the meaning of a word in a sentence depends on its surrounding words, the function of a gene within a BGC depends on its neighboring genes, even those far apart.Transformersexcel at identifying these relationships, whichrecurrent neural networks (RNNs)sometimes struggle with over long sequences. -
Learning contextual representations:
Transformerscan learn rich, contextualized representations for each token (gene) based on its position and interactions with all other tokens in the sequence. -
Parallel processing: The
self-attention mechanismallows for parallel computation, making the model highly efficient, especially for processing large volumes of genomic data.By translating the problem into an NLP paradigm,
BGC-Prophetaims to leverage these powerful capabilities to accurately identify BGC boundaries, classify their product types, and even extrapolate to predictnovel BGCsby learning the "grammar" and "semantics" of gene arrangement within biosynthetic pathways. The use ofESM-2 8Mprotein language models for gene embeddings further strengthens this approach by providing high-quality,sequence-specificinput representations that encode evolutionary and functional information for each "gene token."
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Datasets Used in This Study
The study curated several datasets for different purposes:
- MIBiG v3.1 (Minimum Information about a BGC): Contains 2502 experimentally validated BGCs with metadata. Used for constructing training and testing sets, and for BGC category labeling.
- 6KG (5886 genomes from GTDB RS214): A phylogenetically diverse set of microbial genomes. Used for constructing training and testing sets, particularly for generating
non-BGC gene libraries. - NG (Nine Genomes): Nine bacterial genomes previously used in
ClusterFinderandDeepBGCstudies, containing 291 BGCs. Used specifically for validating and comparing the performance of various methods. None of these BGCs were used for training. - AG (982 genomes from Aspergillus genus): Genomes from the fungus
Aspergillus, a genus known for high biosynthetic potential. Used to compareBGC-ProphetandantiSMASHpredictions. - 85KG (85,203 available species/genomes in GTDB RS214): A large-scale dataset of bacterial and archaeal genomes. Used for comprehensive
pan-phylogeneticBGC mining and constructing a profile of BGCs. - MG (Metagenomes from 47 metagenomic studies): Contains 9428 metagenomic samples with 1,792,406,629 contigs. After filtering for contigs nucleotides, 6,238,438 contigs were retained. Used for
whole-metagenomeBGC screening.
4.2.2. Positive and Negative Sample Generation
To train the BGC-Prophet language model, a curated training dataset of positive and negative samples was generated.
4.2.2.1. Non-BGC Gene Library Construction
First, for each of the 5886 microbial genomes in the 6KG dataset, antiSMASH (v6) was used to identify and remove regions predicted to be BGCs. The remaining pruned genomes (without BGC-like regions) formed the non-BGC gene library. This library served as a source for padding positive samples and generating negative samples.
4.2.2.2. Positive Sample Generation
- Source: Derived from the 2502 experimentally validated BGCs in the
MIBiGdataset. - Process: Each
MIBiG BGCwas subjected totwo-sided paddingwith genes randomly selected from thenon-BGC gene library. - Length Normalization: Padding continued until the
gene sequence lengthof the sample equaled 128. This length was chosen because the longest BGC inMIBiGconsists of 115 genes, and 128 accommodates the typical number of non-BGC genes between BGCs (Supplementary Fig. S1). - Replication: The generation procedure was repeated five times for each
MIBiG BGC, resulting in a total of 12,510 positive samples. - Example: A positive sample would be a sequence of 128 gene tokens, where the central portion corresponds to an actual BGC, flanked by non-BGC genes.
4.2.2.3. Negative Sample Generation
- Challenge: To ensure negative samples (non-BGCs) have some similarity to BGCs in terms of gene content but lack the characteristic
semantic information(i.e., the specific order and combination of genes) preserved in BGCs. - Process:
- A random region was selected from the
non-BGC gene library. - From this selected region, a subregion containing 128 continuous genes was randomly chosen to form a single negative sample.
- A random region was selected from the
- Quantity: A total of 20,000 negative samples were generated.
- Example: A negative sample would be a contiguous sequence of 128 genes that, as a whole, do not constitute a BGC, but individual genes might be found in BGCs.
4.2.3. Labeling the Samples
- BGC Categories: According to the
MIBiGdatabase, there are seven predefined categories of BGCs: alkaloids, NRPs, polyketides, RiPPs, saccharides, terpenes, and others. - Multilabel Classification: A crucial aspect is that each BGC may belong to
more than one category. Therefore, predicting BGC categories is framed as amultilabel seven-category problem. For instance, a positive sample derived fromMIBiG accession BGC0000356was labeled with both "alkaloid" and "NRP" categories. - Negative Sample Labeling: All negative samples were explicitly
not labeledinto any of the seven BGC categories.
4.2.4. Token of the BGC-Prophet Model
- Core Concept: In
Natural Language Processing (NLP), the smallest semantic unit is called a "token" (e.g., a word).BGC-Prophetapplies this concept by using individualgenesas tokens. A sequence of genes, representing a BGC or non-BGC region, is thus analogous to a "sentence." - Contrast with Previous Methods:
ClusterFinderandDeepBGCusedPfam domainsas tokens. WhilePfamdomains provide a balance between genetic information loss and computational complexity, they have limitations:Pfamrelies onmanual determinationby experts to define domain scopes.- Identifying
Pfamdomains usingpHMMsiscomputationally intensive. - A single gene can contain
multiple Pfam domains, which requires separate handling and might lose global gene information.
- Rationale for Genes as Tokens:
BGC-Prophetopts forgenesas tokens because they are considered more "natural" and do not require additional, potentially complex, or computationally expensive operations likePfamdomain calling. This choice aims to simplify the input representation while preserving semantic integrity at the gene level.
4.2.5. Vector Representation of Token
- Input Requirement: For
language modelsliketransformers, each token (gene) needs to be converted into a numericalword embedding vectorto serve as input. - Tool Used: The
ESM-2 8M model(Evolutionary Scale Modeling, version 2 with 8 million parameters) was used.ESMis a state-of-the-artgeneral-purpose protein language modelcapable of predicting protein structure, function, and other properties directly from individual sequences. - Embedding Generation: For every gene in both positive and negative samples, the
ESM-2 8M modelgenerated avector representation(an embedding) with a dimensionality of 320 (320D). - Mechanism: The
ESM-2 8M modeloutputs embeddings for proteins. Themean of the model's last layer outputwas selected as the finalword embeddingfor each gene sequence. - Advantages:
- Sequence-specific: These embeddings capture information unique to each gene's sequence, including evolutionary signals and functional properties.
- Breaks limitations: It removes the dependency on training samples for vector representation acquisition, potentially allowing for better prediction of
unknown BGCs. - Computational efficiency: Directly leveraging a pre-trained protein language model for embeddings avoids time-consuming
Pfamalignments and is highly optimized forGPU acceleration. - Higher-level information: By taking the mean of the last layer output, the word vectors tend to represent
higher-level informationabout the protein.
4.2.6. Model Architecture and Configuration
BGC-Prophet employs a transformer encoder structure, which is a neural network architecture known for using a multi-head self-attention mechanism to speed up training and capture long-range dependencies effectively.
- Implementation: Implemented using
PyTorch v2.0.0. - Core Component:
Transformer encoder[33]. - Input Dimension: Set to 320, matching the dimensionality of the embeddings generated by the
ESM-2 8M model. - Normalization:
Pre-layer normalizationis applied to accelerate the convergence of the model [38]. - Positional Encoding:
Classical sine-cosine position codingis used. This type of encoding injects information about the relative or absolute position of the tokens in the sequence, which is crucial fortransformersas they process all tokens in parallel without inherent sequential order. It does not require additional training. - Transformer Encoder Configuration:
Number of layers: Two encoder layers.Number of attention heads: Five attention heads per layer.Dropout rate: , used for regularization to prevent overfitting.
- Training Parameters:
Optimizer:AdamW[41].Learning rate: 1e-2.Batch size: 64.
- Early Stopping: The model uses an early stopping strategy. Training stops if the
loss valueon thevalidation setdoes not improve after 20 consecutive epochs. The model state (weights) from the epoch with thelowest validation lossis selected as the final model.
4.2.7. BGC Gene Detection and Product Classification
BGC-Prophet is designed to perform two distinct downstream tasks:
4.2.7.1. Task 1: BGC Gene Detection (Predicting BGC Loci)
This task involves predicting, for a given gene sequence, whether each individual gene is part of a BGC. This is treated as a sequence labeling problem.
- Input: A sequence of gene embeddings (320D vectors) for genes in a given genomic region.
- Problem Formulation: The task is modeled statistically using a
linear-chain conditional random field (linear-CRF)[39]. ACRFis a type of probabilistic graphical model that is well-suited for sequence labeling problems, as it considers the context of neighboring labels (genes) in making a prediction. - Downstream Neural Network:
- A
fully connected layeris applied after the transformer encoder. Timesteps: 128 timesteps, corresponding to the maximum sequence length.Weight Sharing: The weight of each timestep in this fully connected layer is shared (unlikeDeepBGC).Dimension Reduction: The hidden state vector from the transformer encoder (initially 320D) is progressively reduced:- From 320 to 128.
- From 128 to 32.
- Finally, to 1. This 1D output represents the
probabilitythat a given gene is part of a BGC.
- A
- Activation Functions:
Gaussian Error Linear Unit (GELU)[40]: Used as the activation function for each intermediate fully connected layer.Sigmoid activation function: Applied to the final 1D output, squashing the value between 0 and 1, representing the confidence score that the gene belongs to a BGC.
- Loss Function:
Binary cross entropy. This is commonly used for binary classification tasks (gene is BGC / gene is not BGC). - Optimizer:
AdamW[41] is used to minimize the loss function.
4.2.7.2. Task 2: BGC Product Classification (Predicting BGC Category)
This task involves predicting the specific category (or categories) of a given BGC region.
- BGC Categories: Seven categories from
MIBiG: alkaloids, NRPs, polyketides, RiPPs, saccharides, terpenes, and others. - Encoding: These categories are encoded using
one-hot encoding. An all-zero vector represents thenon-BGC category. - Multilabel Classification: Since a BGC can belong to multiple categories, this is a
multilabel seven-category problem. - Process:
Extract Hidden State Variables: The sequence of hidden state vectors is extracted from thetransformer encodermodel, where (k is the hidden state dimension, 320 in this case).Calculate Average Hidden State: The average hidden state for the entire sequence is calculated: $ \bar{b} = \frac{1}{n} \cdot \sum_{i=1}^n b_i $ where is the number of genes in the sequence.Masking:Key padding masksare used. These masks preventnon-BGC genes(those added during padding or naturally present in negative samples) from influencing the classification of the BGC, ensuring that only relevant gene context is considered.Downstream Layer: The average hidden state is passed through a simplefully connected layer.Output: This layer outputs a 7D vector (one dimension for each BGC category).Activation Function: Asigmoid functionis applied to each element of the 7D vector, yielding aconfidence scorebetween 0 and 1 for each label. This allows for multiple labels to be activated simultaneously.
4.2.8. Hyperparameter Tuning and Performance Evaluation
The model's design choices and performance were rigorously evaluated:
-
Comparison with Simpler Models (BiLSTM):
- To justify the
Transformerarchitecture, its performance was compared againstbidirectional LSTM networks. - Single-layer BiLSTM: 1.7M parameters, achieved
AUROCof 0.88 andF1 scoreof 0.65. - Two-layer BiLSTM: 4.2M parameters, achieved
AUROCof 0.87 andF1 scoreof 0.61. - Observation: Both
LSTMmodels showedsevere overfitting(accuracy > 95% on training/validation sets) and suboptimal generalization. - Conclusion:
BGC-Prophet(Transformer-based, 2.5M parameters) achieved superior performance, balancing parameter efficiency and generalization, highlighting the strength of theTransformerarchitecture in capturingsequence-specific informationwhile mitigating overfitting.
- To justify the
-
Ablation Study (Impact of Key Hyperparameters):
- Number of Encoder Layers: Increasing layers beyond two
slightly reduced test set performance, indicating that additional depth didn't significantly improve feature representation. - Number of Attention Heads: Variations showed
minimal impact. Five attention heads were chosen as the optimal balance between computational complexity and performance. - Embedding Size: Larger dimensions improved the model's capacity to represent
sequence-specific informationbut increased computational demands. An embedding size of 320 dimensions (corresponding to 64 dimensions per attention head) was identified as the optimal trade-off between accuracy and efficiency.
- Number of Encoder Layers: Increasing layers beyond two
-
Final Configuration: Based on these findings, the
BGC-Prophetmodel was finalized withtwo encoder layers,five attention heads, and anembedding size of 320.
5. Experimental Setup
5.1. Datasets
The study utilized several datasets, each serving a specific role in training, validation, and large-scale application of BGC-Prophet.
-
MIBiG v3.1 (Minimum Information about a BGC):
- Source: A robust community standard database for annotating and metadata on BGCs and their molecular products.
- Scale/Characteristics: Contains 2502 experimentally validated BGCs.
- Domain: Microbial BGCs.
- Purpose: Primarily used to define the ground truth for positive BGC samples and to label BGC categories during model training.
- Data Sample: A BGC in MIBiG is represented by a cluster of functionally related genes, like the genes for penicillin biosynthesis, along with meta-information about its product type (e.g., NRP).
-
6KG (5886 genomes from the GTDB RS214 database):
- Source: The Genome Taxonomy Database (GTDB) Release 214.
- Scale/Characteristics: 5886 phylogenetically diverse species/genomes, spanning across the bacterial evolutionary tree.
- Domain: Microbial genomes.
- Purpose: Used to construct the
non-BGC gene libraryfor negative sample generation and padding.
-
NG (Nine Genomes):
- Source: Representative set of nine bacterial genomes previously examined in
ClusterFinderandDeepBGCstudies. - Scale/Characteristics: Total of 291 BGCs. Critically, none of these BGCs were used for training
BGC-Prophet. - Domain: Bacterial genomes.
- Purpose: Used as an independent validation set to evaluate and compare the performance of
BGC-Prophetagainst existing methods.
- Source: Representative set of nine bacterial genomes previously examined in
-
AG (982 genomes from the genus Aspergillus):
- Source: NCBI genome database.
- Scale/Characteristics: 982 genomes from
Aspergillusspecies. - Domain: Fungal genomes (specifically
Aspergillus, known for high biosynthetic potential). - Purpose: Used to compare
BGC-Prophet's predictions againstantiSMASHto assess its ability to identify previously unannotated or novel BGCs in a specific, biomedically relevant genus.
-
85KG (85,203 available species/genomes in GTDB RS214):
- Source: GTDB RS214.
- Scale/Characteristics: 85,203 unique species/genomes (one genome per species).
- Domain: Majority of bacterial and archaeal lineages.
- Purpose: Used for large-scale
pan-phylogeneticgenome mining of BGCs, constructing a comprehensive profile of BGCs across microbial diversity, and studying evolutionary patterns.
-
MG (Metagenomes from 47 metagenomic studies):
-
Source: 47 human microbial environment metagenomic studies [36].
-
Scale/Characteristics: 9428 metagenomic samples containing 1,792,406,629 contigs initially. Filtered to 6,238,438 contigs with nucleotide sequence lengths . These were binned into 160,814 bins (MAGs), and taxonomic annotation was performed using
GTDB-Tk. -
Domain: Human microbiome.
-
Purpose: Used for
whole-metagenomescreening of BGCs to understand their distribution and diversity in environmental samples.The chosen datasets are effective for validating the method's performance because they cover a wide range of genomic scales (individual genomes to massive metagenomes), phylogenetic diversity (bacteria, archaea, fungi), and real-world applicability (experimentally validated BGCs, well-studied genera, environmental samples). The
NGdataset, specifically, allows direct comparison with previous benchmarks.
-
5.2. Evaluation Metrics
The evaluation of BGC-Prophet's performance was based on standard metrics commonly used in classification tasks.
First, the four fundamental parameters of a confusion matrix are defined:
-
TP (True Positive): Instances that are actually positive (e.g., a gene is part of a BGC) and were correctly predicted as positive by the model.
-
FN (False Negative): Instances that are actually positive but were incorrectly predicted as negative by the model (missed BGCs).
-
TN (True Negative): Instances that are actually negative (e.g., a gene is not part of a BGC) and were correctly predicted as negative by the model.
-
FP (False Positive): Instances that are actually negative but were incorrectly predicted as positive by the model (non-BGCs mistakenly identified as BGCs).
Using these, the following metrics are calculated:
-
Accuracy
- Conceptual Definition: The proportion of all predictions that were correct (both true positives and true negatives). It provides a general measure of how well the model performs across all classes.
- Mathematical Formula: $ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \times 100% $
- Symbol Explanation:
TP: Number of true positives.TN: Number of true negatives.FP: Number of false positives.FN: Number of false negatives.
-
Precision
- Conceptual Definition: The proportion of positive predictions that were actually correct. It measures the quality of positive predictions, indicating how many of the identified BGCs are truly BGCs. High precision is important when the cost of false positives is high.
- Mathematical Formula: $ Precision = \frac{TP}{TP + FP} \times 100% $
- Symbol Explanation:
TP: Number of true positives.FP: Number of false positives.
-
Recall (also known as Sensitivity or True Positive Rate, TPR)
- Conceptual Definition: The proportion of actual positive instances that were correctly identified. It measures the model's ability to find all the positive cases. High recall is important when the cost of false negatives is high (e.g., missing a BGC).
- Mathematical Formula: $ Recall = \frac{TP}{TP + FN} \times 100% $
- Symbol Explanation:
TP: Number of true positives.FN: Number of false negatives.
-
F1-score
- Conceptual Definition: The harmonic mean of Precision and Recall. It provides a single score that balances both metrics, being particularly useful when there is an uneven class distribution or when both false positives and false negatives are important.
- Mathematical Formula: $ F1 = \frac{2 \times Precision \times Recall}{Precision + Recall} \times 100% $
- Symbol Explanation:
Precision: Value of precision.Recall: Value of recall.
-
True Positive Rate (TPR)
- Conceptual Definition: Identical to Recall. It represents the proportion of actual positive cases that are correctly identified.
- Mathematical Formula: $ TPR = \frac{TP}{TP + FN} $
- Symbol Explanation:
TP: Number of true positives.FN: Number of false negatives.
-
False Positive Rate (FPR)
- Conceptual Definition: The proportion of actual negative cases that are incorrectly classified as positive. It represents the proportion of non-BGCs that are mistakenly identified as BGCs.
- Mathematical Formula: $ FPR = \frac{FP}{FP + TN} $
- Symbol Explanation:
FP: Number of false positives.TN: Number of true negatives.
-
Area Under the Receiver Operating Characteristic (AUROC) Curve
- Conceptual Definition: The ROC curve is a plot of the
True Positive Rate (TPR)against theFalse Positive Rate (FPR)at various threshold settings for a binary classifier. TheAUROCis the area under this curve. It measures the classifier's ability to distinguish between classes. A higher AUROC (closer to 1) indicates better performance, meaning the model can differentiate between positive and negative classes more effectively across different thresholds. An AUROC of 0.5 suggests a classifier that performs no better than random chance. - Mathematical Formula: While there isn't a single simple formula for AUROC itself (it's the integral of the ROC curve), the ROC curve is generated by plotting
TPRvsFPRas the classification threshold is varied.- For a discrete set of predictions, it can be approximated by various methods, but conceptually it's .
- Symbol Explanation:
TPR: True Positive Rate (as defined above).FPR: False Positive Rate (as defined above).
- Conceptual Definition: The ROC curve is a plot of the
5.3. Baselines
The paper compared BGC-Prophet against several established methods:
-
DeepBGC [26]: A deep learning and NLP strategy for BGC identification in bacterial genomes, employing a
BiLSTM recurrent neural networkandPfamdomains. It was used for direct performance comparison in both BGC gene detection and product classification tasks. -
AntiSMASH [16]: The
antibiotics & Secondary Metabolite Analysis Shell. This is a widely used rule-based tool for BGC mining, which employspHMMsto identify gene families and heuristics to designate BGCs. It serves as a benchmark for established, rule-based methods and is used for large-scale comparison on theAspergillusdataset and in the6KGdataset fornon-BGC gene libraryconstruction. -
GECCO [30]: Another deep learning approach for
de novoidentification of BGCs. Mentioned in the comparative results on theNGdataset. -
BiGCARP [29]: A deep self-supervised learning method for BGC detection and product classification. Mentioned in the comparative results on the
NGdataset.These baselines are representative because they cover both the traditional rule-based (antiSMASH) and advanced deep learning (DeepBGC, GECCO, BiGCARP) approaches, providing a comprehensive context for evaluating
BGC-Prophet's performance and innovations.
6. Results & Analysis
6.1. Evaluation of Sequence-Specific Representations of Genes
The effectiveness of using ESM-2 8M to generate sequence-specific representations (embeddings) of genes was evaluated. This was done by taking these gene embeddings, averaging them to create a single representative BGC vector for each BGC in the MIBiG dataset, and then visualizing these high-dimensional vectors using t-SNE dimensionality reduction.
The following figure (Figure 2 from the original paper) shows the t-SNE dimensionality reduction of BGC categories and the distribution difference between BGCs and non-BGCs:

Figure 2: Evaluation of sequence-specific representations of genes.
-
A) Distributions of BGC categories in 2D t-SNE space: The
t-SNEplot revealed that different categories of BGCs demonstrated distinct patterns. The seven categories (alkaloids, NRPs, polyketides, RiPPs, saccharides, terpenes, and others) clustered into three main regions:Terpenespredominantly clustered on the bottom right.SaccharidesandRiPPsprimarily clustered on the top right.Polyketidesprimarily clustered on the bottom left and bottom right.- Other categories showed a more widespread distribution.
- This distinct clustering, confirmed by
t-test() showing clear separation between categories (e.g., polyketides and terpenes), suggests that theESMgene embeddings capture underlying chemical and functional differences between BGC types. - The
partial overlapbetween clusters indicates that whileESMembeddings are powerful, simply averaging gene vectors (which loses order information) is insufficient to fully distinguish BGCs; thetransformer encoderis still needed to modellocation-dependent relationships.
-
B) Distributions of BGCs and non-BGCs in 2D t-SNE space: A comparison of
BGCs(positive samples) andnon-BGCs(negative samples) in the training set showed a different distribution pattern.-
While some areas were exclusively occupied by BGCs (e.g., bottom right), there was
substantial overlapbetween BGCs and non-BGCs. -
Despite the overlap, their distributions were
significantly differenton both axes (t-test, ). -
Interpretation: This indicates that individual gene embedding features alone are not enough to definitively determine if a gene belongs to a BGC. The contextual information and
location-dependent relationshipslearned by thetransformer encoderare crucial for accurate BGC detection, beyond just the inherent properties of the genes themselves.These findings support the use of
ESMembeddings, demonstrating their ability to generatesequence-specific representationsthat encode meaningful biological information and serve as effective input features for theBGC-Prophetmodel.
-
6.2. Accurate and UHT BGC Prediction
The performance of BGC-Prophet was assessed for both BGC gene detection (locating BGCs) and BGC product classification (categorizing BGC types), along with its computational efficiency.
The following figure (Figure 3 from the original paper) shows the performance metrics, computational efficiency, and attention map of BGC-Prophet:

Figure 3: Performance of BGC-Prophet.
-
A) BGC Gene Detection (AUROC, Precision, Recall, F1 on NG dataset):
BGC-Prophetachieved anAUROCof for BGC gene detection on theNGdataset. This was comparable toDeepBGC'sAUROCof .- At a default threshold of 0.5,
BGC-Prophetshowed higherprecision( vs ) compared toDeepBGC. BGC-Prophet'sF1-scorewas higher thanDeepBGC's (although specific numerical values for F1 are not in the main text for DeepBGC in A, it states "marginal improvement of around 50%"). This implies a better balance between precision and recall forBGC-Prophetat this threshold.
-
B) Receiver Operating Characteristic (ROC) curves for BGC gene detection: This plot visually confirms the comparable AUROC values, showing that both
BGC-ProphetandDeepBGCperform well in discriminating BGC genes from non-BGC genes across various thresholds. -
C) BGC Product Classification (AUROC, Precision, Recall for 7 categories):
BGC-Prophetachieved a significantly higherAUROCof for differentiating among the seven BGC categories.DeepBGCachieved anAUROCof for this task.BGC-Prophetalso demonstrated superiorprecision( vs ) andrecall( vs ) across all BGC categories.- Implication:
BGC-Prophetis substantially better at classifying the type of BGC product. - Limitations noted: Performance varied across categories.
BGC-Prophetstruggled withRiPPpredictions, likely due to the small number ofRiPP BGCsinMIBiG(around 300) and the high diversity withinRiPPclasses.
-
Comparison with other tools (Supplementary Fig. S5 and Table S4): The comparative results on the
NGdataset (which is a standard benchmark) show thatBGC-ProphetoutperformsantiSMASH,GECCO, andBiGCARP, achieving the highestF1 scoreacross all genomes (average F1 = 0.57 for BGC-Prophet compared to others). This is attributed to its effective integration of contextual genomic features and deep learning.
The following are the results from Supplementary Table S4 of the original paper:
| Tool | AntiSMASH | GECCO | BiGCARP | DeepBGC | BGC-Prophet |
|---|---|---|---|---|---|
| Accuracy | 0.94 | 0.89 | 0.93 | 0.91 | 0.94 |
| Precision | 0.70 | 0.43 | 0.56 | 0.30 | 0.60 |
| Recall | 0.34 | 0.52 | 0.55 | 0.93 | 0.55 |
| F1-score | 0.46 | 0.47 | 0.55 | 0.46 | 0.57 |
-
D) Efficiency Evaluation (Processing time per genome):
BGC-Prophetis highlighted as one of the pioneeringUltrahigh-throughput (UHT)methods.- When processing 10 randomly selected genomes,
DeepBGCrequired an average of4 hours per genome. BGC-Prophetcould process each genome in just1 minute.- This difference in processing time scales up significantly: for 100 genomes, the time difference was two orders of magnitude greater.
- Implication:
BGC-Prophetcan process hundreds of thousands of genomes within tens of hours, making large-scalepan-phylogeneticandwhole-metagenomescreening practical.
-
E) Attention Map (Query-Key Heatmap for a BGC): This heatmap visualizes the
self-attention mechanismwithin thetransformer encoder. It shows theattention scores(or weights) between different genes in a BGC. A higher score indicates a stronger learned relationship or importance of one gene to another. -
F) Genomic Association Diagram (Attention to Gene 76, KUTG_02125):
- This diagram zooms into the attention paid to a specific gene, gene 76 (), which encodes an
NRP synthetase. - This gene received the
highest attention scoresfrom other biosynthetic genes in the cluster. - Interpretation: This high attention suggests that gene 76 is
conservedand plays acentral rolewithin this BGC, likely being a key component in the biosynthetic pathway. Suchattention mapsdemonstrate that the language model successfully captureslocation-dependent relationshipsamong biosynthetic genes, providing insights into BGC structure and function.
- This diagram zooms into the attention paid to a specific gene, gene 76 (), which encodes an
6.3. Comprehensive Profiling of BGCs in 982 Genomes from Aspergillus
BGC-Prophet and antiSMASH were used to predict BGCs in the AG dataset (982 Aspergillus genomes), a genus with high biosynthetic potential.
The following figure (Figure 4 from the original paper) shows the comparison of BGC-Prophet and antiSMASH predictions in Aspergillus genomes:

Figure 4: BGC-Prophet predicts a profoundly higher number of BGCs across various categories than antiSMASH.
-
Overall Prediction Count:
BGC-Prophetpredicted almostthree times as many BGCsasantiSMASH(167,375 vs 59,037, Figure 4A). This suggestsBGC-Prophet's ability to predict morepreviously unannotated BGCs. Given thatMIBiGonly contains 2502 experimentally validated BGCs, a substantial portion of these newly predicted BGCs are likely novel. -
Category-Specific Comparisons (Figure 4B, C):
- Terpenes:
BGC-Prophetpredicted significantly moreterpene BGCs(52,004 vs 7,748), with only 7,260 common predictions (intersection). This indicates a large number of terpene BGCs missed byantiSMASH. - NRPs (Nonribosomal Peptides): Predictions were nearly identical between the two tools (27,603 for
BGC-Prophetvs 27,100 forantiSMASH), with a high intersection of 26,278. This suggests strong agreement for this well-characterized category. - Polyketides:
BGC-Prophetpredicted a greater number ofpolyketide BGCs(35,606 vs 18,225), with 16,607 common predictions. - RiPPs (Ribosomally synthesized and post-translationally modified peptides): The predictions showed
complementarity.BGC-Prophetfound 27,155RiPP BGCs, whileantiSMASHfound 8,082, with a small intersection of 1,401. This highlights that both tools capture different sets ofRiPPs, andBGC-Prophetadds substantial coverage. - Alkaloids and Saccharides:
BGC-Prophetpredicted additional BGCs in these categories compared toantiSMASH. - Other: A significant number of BGCs predicted by
BGC-Prophetfell into the "other" category, further supporting its capacity to detectnovel BGCsthat do not fit standard classifications.
- Terpenes:
-
Linear Correlation (Supplementary Fig. S7): The prediction results of the two tools showed a clear linear correlation (, ). This indicates that
BGC-Prophet's predictions are not biased towards specificAspergillusspecies but rather consistently extend the detection across the genus. -
Validation of Alkaloids and Saccharides (Supplementary Fig. S8): To validate the predictions for alkaloids and saccharides,
BiG-SCAPE[47] was used to compareBGC-Prophet's predictions againstMIBiG. The results showed thatBGC-Prophet's predicted BGCs were significantly more similar tosame-category BGCsinMIBiGthan todifferent-category BGCs(t-test), supporting the reliability of its predictions for these classes.
6.4. Comprehensive Profiling of BGCs on 85,203 Microbial Genomes from the Majority of Bacterial and Archaeal Lineages
BGC-Prophet was applied to the 85KG dataset (85,203 bacterial and archaeal genomes from GTDB), yielding significant insights into BGC abundance and diversity.
The following figure (Figure 5 from the original paper) shows the comprehensive profile of BGCs across microbial genomes:

Figure 5: Comprehensive profile of BGCs in microbial genomes.
-
Overall BGC Count: Out of 85,203 genomes, 41,599 (approximately 48.8%) were found to contain BGCs, leading to the identification of a total of
119,305 BGCs. -
Proportions of BGC Categories (Figure 5A):
- The three most widely distributed BGC categories (present in the highest percentage of species) were:
- Polyketide: of total species
- NRP:
- RiPP:
- The three most abundant categories (highest total count of BGCs) were:
- NRP: of total BGCs
- Polyketide:
- RiPP:
AlkaloidBGCs had the narrowest distribution (in only of species).- Comparison to MIBiG: In the
MIBiGdatabase, the most abundant categories are polyketide (), NRP (), and RiPP ().BGC-Prophetidentified a significantly greater number of BGCs in the "other" category (from 324 to 32,233, and from to of total BGCs), demonstrating its ability tomine potentially novel BGC categories(Supplementary Table S5).
- The three most widely distributed BGC categories (present in the highest percentage of species) were:
-
Host Distribution (Figure 5A, C and Supplementary Table S6):
- Phylum Level:
Actinomycetota: Exhibited thehighest predicted number of BGCs(39,252 in total).Pseudomonadota: Showed thewidest genomic coverage, with 12,637 genomes containing at least one BGC, totaling 29,675 BGCs.
- Order Level (Figure 5C): The 27 orders with the highest average number of predicted BGCs () were distributed across 15 phyla, including
ActinobacteriaandAcidobacteriota, which are known for high biosynthetic potential.
- Phylum Level:
-
Archaea vs. Bacteria:
- BGC Count:
- Archaea: 1762 BGCs from 1079 archaeal genomes.
- Bacteria: 117,543 BGCs from 40,520 bacterial genomes.
- Average BGCs per genome:
- Archaea: 1.63 BGCs/genome.
- Bacteria: 2.90 BGCs/genome.
- Statistical Significance: A
t-testshowed asignificantly lower abundance of BGCs in archaeal genomescompared to bacterial genomes (). - Predominant Categories:
- Archaea:
Saccharides() andRiPP(). - Bacteria:
NRP() andpolyketide().
- Archaea:
- Interpretation: This difference might reflect the more ancient nature of archaea and their distinct metabolic strategies (e.g., sulfur reduction, denitrification) in often more extreme, simpler environments, compared to bacteria which thrive in more complex, competitive environments, leading to a higher frequency of BGCs for resource competition and adaptation.
- BGC Count:
6.5. Comprehensive Profiling of BGCs in 9428 Metagenomic Samples
BGC-Prophet's UHT capability enabled whole-metagenome screening on the MG dataset (9428 human metagenomic samples).
The following figure (Figure 6 from the original paper) shows the classification and quantity of BGCs extracted from metagenome datasets:

Figure 6: Classification and quantity of BGCs extracted by BGC-Prophet from 47 metagenome datasets.
- Sample Coverage: Of the 9428 metagenomic samples analyzed, 8255 were predicted to contain at least one BGC.
- BGC Count and Species Distribution: A total of
248,229 BGCswere predicted, distributed among2922 species(from 160,814 bins, of which 132,809 were successfully assigned species). - Enrichment in Actinomycetota: Consistent with the
GTDBdataset findings, BGCs identified from the metagenome dataset weresignificantly enrichedin species belonging toActinomycetotacompared to other species.- Average BGCs/genome in
Actinomycetota: 8.30. - Average BGCs/genome in other species: 4.24.
- Statistical Significance:
t-test, . - Implication:
Actinomycetotaappears to be a particularly rich source of BGCs in human-associated microbial communities.
- Average BGCs/genome in
6.6. The Profound Enrichment of BGC in Microbes After Important Geological Events
The study investigated the relationship between BGC distribution and geological events over billions of years, suggesting environmental influences on BGC evolution.
-
TimeTree Analysis:
TimeTree[50] was used to identify two key time points associated with rapid lineage growth: theGreat Oxidation Eventand theCambrian Explosion. -
Great Oxidation Event (~2.5-2.3 billion years ago):
- Pre-event group: Three microbial genera (
Mesoaciditoga,Vampirovibrio,Synechococcus) comprising 56 of the 41,599 genomes with BGCs. - Post-event group: 2215 genera comprising 41,543 genomes.
- Overall BGCs: A
significant increasein the average number of BGCs per genome from 2.5 (pre) to 4.5 (post) (t-test, , Supplementary Fig. S10). - Polyketides: Abundance
significantly increased(average from 1.09 to 2.81 per genome;t-test, ). This suggests a potential role for polyketides in microbial adaptation to increased oxygen levels or that oxygen facilitated their production. - RiPPs: No significant change (1.29 to 1.25,
t-test, ). - NRPs: Increased (1.0 to 3.16), but not statistically significant due to limited pre-event data (
t-test, ).
- Pre-event group: Three microbial genera (
-
Cambrian Explosion Event (~542-520 million years ago):
- Pre-event group: 1529 genera comprising 9212 of the 41,599 genomes.
- Post-event group: 589 genera comprising 32,387 genomes.
- Overall BGCs: A
significant increasein the average number of BGCs per genome,doublingfrom 2.95 (pre) to 6.07 (post) (t-test, , Supplementary Fig. S10). - Category-specific increases: All BGC categories showed an increase.
Polyketides: Significant increase from 1.77 to 3.52 (t-test, ).NRPs: Significant increase from 2.14 to 3.77 (t-test, ).
- Context: This period saw rapid diversification of multicellular organisms, heightened ocean oxygenation, and an exponential increase in microbial diversity. The emergence of multicellular hosts and diverse microenvironments likely created strong selective pressures, driving the evolution of BGCs to produce specialized metabolites for competition and adaptation.
-
Interpretation: These findings highlight a correlation between major environmental shifts and a surge in BGC abundance and diversity, suggesting that microbes adapt to changing conditions by evolving their biosynthetic capabilities. The direct causal link between events and BGC evolution, as well as the functional roles of these BGCs, requires further investigation.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces BGC-Prophet, a novel transformer-based language model designed for the accurate and ultrahigh-throughput prediction and classification of Biosynthetic Gene Clusters (BGCs) from microbial genomes and metagenomes. By treating BGC analysis as a language processing task, using individual genes as tokens, and leveraging ESM-2 8M protein embeddings, BGC-Prophet effectively captures location-dependent relationships between genes within BGCs.
The model demonstrates performance comparable to or superior to existing deep learning methods like DeepBGC in BGC detection and classification, while achieving ultrahigh-throughput capabilities (orders of magnitude faster than competitors). This efficiency enabled BGC-Prophet to perform pan-phylogenetic screening of 85,203 genomes and whole-metagenome screening of 9,428 metagenomes, leading to the profiling of nearly a million BGCs.
Key findings include the identification of numerous previously unannotated BGCs (especially in Aspergillus), the comprehensive mapping of BGC distribution and abundance across diverse microbial lineages (e.g., high enrichment in Actinomycetota, prevalence of polyketide, NRP, and RiPP BGCs), and the discovery of correlations between BGC enrichment patterns and major geological events (Great Oxidation and Cambrian Explosion), suggesting a strong influence of environmental changes on BGC evolution. BGC-Prophet provides significant contributions to understanding microbial secondary metabolism and holds promise for applications in synthetic biology.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
-
Inability to Determine Specific Small Molecules:
BGC-Prophetcurrently cannot predict the exact small molecules produced by the identified BGCs.- Future Work: Integrate
BGC-Prophetwith computational chemistry tools to predict BGCs associated with known small molecules, followed by screening and validation of these predictions.
- Future Work: Integrate
-
Training Data Size and Overfitting: Despite high accuracy, the model's performance could be constrained by the relatively small size of the
MIBiGdataset (2502 validated BGCs), leading to concerns about overfitting.- Future Work: Construct more diverse and comprehensive BGC databases to enhance model training and validation, and incorporate
debiasing techniquesto improve robustness, especially given potential long-tail distributions in the dataset.
- Future Work: Construct more diverse and comprehensive BGC databases to enhance model training and validation, and incorporate
-
False Positives for Novel BGCs: While capable of identifying
potentially novel BGCs, there remains a risk offalse positives, particularly for BGCs lacking well-defined biosynthetic signatures.- Future Work: Experimental validation is necessary to confirm the true nature of these predictions.
-
Variations in Category Performance: The model's performance varies across different BGC categories (e.g., struggles with
RiPPpredictions).- Future Work: Further refinement through data augmentation and targeted model training for less well-represented classes is needed.
-
Dynamic BGC Evolution: The current study identifies correlations with geological events but needs deeper investigation into the dynamic gain or loss of BGCs over time.
- Future Work: Examine the
gain or loss of BGCs on a dynamic scaleand delve into the functional roles and ecological significance of these BGCs in bacterial evolution.
- Future Work: Examine the
-
Broader Applicability: While focused on BGCs, the language model framework has broader potential.
- Future Work: Extend the approach to discover other functional genes, such as
antibiotic resistance genesandanti-CRISPR proteins, potentially throughfine-tuningfor specific downstream tasks (e.g., predicting expression conditions or chemical structures for BGCs, or type and mechanism for resistance genes).
- Future Work: Extend the approach to discover other functional genes, such as
7.3. Personal Insights & Critique
This paper presents a significant advancement in BGC discovery by elegantly reframing the problem as a language processing task. The core innovation lies in the combination of genes as tokens, ESM-2 8M embeddings, and the transformer encoder architecture. This approach cleverly capitalizes on the strengths of modern NLP models to overcome the limitations of previous bioinformatics tools, particularly in terms of efficiency and the ability to detect novel BGCs.
Strengths and Inspirations:
- Paradigm Shift: The analogy of genes as words and BGCs as sentences is intuitive and powerful. It opens up a vast toolkit from NLP for complex genomic analysis. This modularity, where gene embeddings can be pre-computed and then fed into a transformer, is highly efficient.
- Ultrahigh-Throughput: The speed gain (minutes vs. hours per genome) is a game-changer for large-scale studies, enabling the analysis of entire databases like
GTDBand comprehensivemetagenomes, which was previously impractical. This capability is critical for acceleratingnatural product discoveryandsynthetic biologyefforts. - Discovery Potential: The ability to identify significantly more BGCs and predict potentially novel categories beyond
antiSMASH's rule-based limitations is crucial for expanding our understanding of microbial chemistry. - Evolutionary Insights: Connecting BGC abundance and diversity to major
geological eventsoffers a fascinating glimpse into the co-evolution of life and its chemistry, inspiring new research avenues into environmental drivers of metabolite production. This kind of interdisciplinary finding strengthens the biological impact of the computational method. - Interpretability (Attention Maps): The use of
attention mapsto highlight key genes within a BGC provides a layer of interpretability often lacking in complex deep learning models. This can guide experimentalists in identifying critical components of biosynthetic pathways.
Potential Issues/Critique and Areas for Improvement:
-
"Black Box" of ESM Embeddings: While
ESMembeddings are powerful, their internal representation is still somewhat of a "black box." A deeper analysis of what specific evolutionary or functional signals in the gene sequences are captured byESMthat are most relevant for BGC prediction could further enhance understanding and model design. -
Validation of Novel BGCs: The paper highlights the prediction of many "novel" BGCs. While statistical validation (e.g.,
BiG-SCAPEcomparison) is provided for some categories, the ultimate proof lies in experimental characterization of the metabolites produced. The challenge of connecting BGCs to their specific small molecules remains, andBGC-Prophetcurrently doesn't address this directly. -
Data Imbalance: The acknowledged issue with
RiPPpredictions due to data imbalance inMIBiGpoints to a common deep learning problem. While data augmentation is suggested, exploringfew-shot learningormeta-learningtechniques might be beneficial for underrepresented BGC classes. -
Threshold Sensitivity: The precision/recall trade-off at a default threshold (e.g., 0.5) for gene detection, where
BGC-Prophethas much higher precision but slightly lower recall thanDeepBGC, indicates that the choice of threshold can significantly impact results. A more detailed analysis of optimal threshold selection for different use cases (e.g., high-precision vs. high-recall discovery) would be valuable. -
Computational Resources for Training: While
BGC-Prophetis fast for inference, the training of transformer models andESMitself requires substantial computational resources. This might be a barrier for smaller labs wishing to fine-tune or adapt the model.Transferability and Applications: The methods and conclusions of this paper are highly transferable. The
language modeling frameworkcan be broadly applied to any biological sequence data wherelocation-dependent relationshipsandcontextual understandingare important. This includes:
-
Other functional gene prediction: As suggested, identifying
antibiotic resistance genes,virulence factors,CRISPR-Cas systems, ormetabolic pathways. -
Protein engineering: Understanding protein domain interactions or predicting protein function from sequence context.
-
Synthetic Biology: The ability to accurately delineate BGCs is fundamental for refactoring and engineering new biosynthetic pathways in host organisms.
-
Drug Discovery: High-throughput screening capabilities can rapidly identify new potential sources of natural products, accelerating the drug discovery pipeline.
-
Ecological Studies: Deeper insights into microbial chemical ecology and how microbes adapt to their environments through specialized metabolism.
In conclusion,
BGC-Prophetis a robust and highly efficient tool that pushes the boundaries of BGC discovery. Its innovative use of language modeling andESMembeddings offers a powerful new lens through which to explore the vast biosynthetic potential of the microbial world, significantly contributing to both fundamental biological understanding and applied biotechnological endeavors.
Similar papers
Recommended via semantic vector search.