Artificial intelligence in food bioactive peptides screening: Recent advances and future prospects
TL;DR Summary
This review summarizes AI-driven high-throughput screening of food-derived bioactive peptides, highlighting advances in deep learning models for identifying functional peptides and emphasizing future directions including multi-scale feature frameworks and screening methodologies.
Abstract
Trends in Food Science & Technology 156 (2025) 104845 Available online 13 December 2024 0924-2244/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies. Artificial intelligence in food bioactive peptides screening: Recent advances and future prospects Jingru Chang e , Haitao Wang a,b,c,d , Wentao Su a,b,c,d , Xiaoyang He e,** , Mingqian Tan a,b,c,d,* a State Key Laboratory of Marine Food Processing and Safety Control, Dalian Polytechnic University, Dalian, 116034, Liaoning, China b Academy of Food Interdisciplinary Science, School of Food Science and Technology, Dalian Polytechnic University, Dalian, 116034, Liaoning, China c National Engineering Research Center of Seafood, Dalian Polytechnic University, Dalian, 116034, Liaoning, China d Dalian Key Laboratory for Precision Nutrition, Dalian Polytechnic University, Dalian, 116034, Liaoning, China e School of Information Science and Engineering, Dalian Polytechnic University, Dalian, 116034, Liaoning, China A R T I C L E I N F O Handling Editor: Dr. S Charlebois Keywords: Artificial intelligence Food-derived bioactive peptides Mac
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Artificial intelligence in food bioactive peptides screening: Recent advances and future prospects
1.2. Authors
Jingru Chang, Haitao Wang, Mingqian Tan (a,b,c,d,e), Xiaoyang He (a,b,c,d), Mingqian Tan (a,b,c,d). Affiliations are primarily with Dalian Polytechnic University, Dalian, China, and the National Engineering Research Center of Seafood.
1.3. Journal/Conference
The paper is published in Food Chemistry. This is a highly reputable journal in the field of food science, known for publishing high-impact research on food composition, processing, safety, and nutritional aspects. Its influence is significant among food scientists and researchers.
1.4. Publication Year
2024
1.5. Abstract
This paper reviews the application of artificial intelligence (AI) in the screening of food-derived bioactive peptides (FBPs). It highlights that traditional experimental methods for FBP identification are often laborious, time-consuming, and costly, while conventional computational approaches like virtual screening and molecular dynamics simulations have inherent limitations. AI technology offers a high-throughput solution for screening and analyzing FBP activity mechanisms, promising to advance FBP development and application. The review outlines the general AI screening process, covering data foundation, molecular feature representation, machine learning (ML) and deep learning (DL) model construction and training, and evaluation/validation. It summarizes recent AI advancements in screening FBPs with various bioactivities (anti-inflammatory, antibacterial, antioxidant, flavor-enhancing, hypotensive), noting that research on anti-obesity and anti-fatigue peptides is still nascent. Key findings indicate that DL shows superior predictive advantages over traditional ML. However, challenges remain across different bioactivities. Future directions include developing data augmentation strategies within food-specific large models, creating a universal deep learning framework based on multi-scale chemical space features to predict peptide-target dynamic interactions, and establishing a high-throughput screening framework that also enhances AI methods for multi-functional properties like anti-obesity and anti-fatigue effects.
1.6. Original Source Link
/files/papers/690b66a9079665a523ed1dbe/paper.pdf The publication status is officially published.
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the inefficient and resource-intensive nature of identifying food-derived bioactive peptides (FBPs) through traditional experimental methods. FBPs are protein fragments from food sources that possess specific physiological regulatory effects, playing a vital role in nutrition and health due. They are characterized by low molecular weight, low toxicity, ease of absorption, high biological activity, and strong targeting capabilities. Their potential benefits, such as nutrient supplementation, immune enhancement, and overall health contribution, make them a fascinating area of research. However, the sheer combinatorial complexity of peptides (e.g., combinations for amino acids, plus higher-level structures) makes exhaustive experimental screening impractical. While conventional computational methods like virtual screening and molecular dynamics (MD) simulations offer some streamlining, they are limited by the inherent flexibility of peptides, conformational changes during peptide-target binding, high computational demands, and limited information on peptide-protein complexes.
The paper's entry point is the growing capability of Artificial Intelligence (AI) to overcome these limitations. AI, particularly machine learning (ML) and deep learning (DL), offers a promising pathway for high-throughput screening and analysis of activity mechanisms for FBPs. It aims to extract key features from known bioactive peptide datasets using advanced algorithms, allowing for rapid prediction of activity in large-scale, unknown FBPs, reducing costs, and minimizing human error.
2.2. Main Contributions / Findings
The paper provides a comprehensive review of the current state of AI-driven screening for FBPs, highlighting recent advancements and outlining future prospects. Its primary contributions are:
-
Outline of AI Screening Process: It systematically describes the general process of
AI screeningforFBPs, coveringdata foundation,molecular feature representation,ML/DL model construction and training, andevaluation and validation. This structured overview serves as a guide for researchers. -
Summary of Recent Advances: It summarizes recent research progress in
AI screeningofFBPsacross variousbioactivities, includinganti-inflammatory,antimicrobial,antioxidant,flavor-enhancing, andhypotensive properties. This consolidates diverse research efforts and identifies areas of strength. -
Identification of Gaps: It critically discusses current key issues and challenges in the field, such as limited data size, ambiguity in negative data, class imbalance, reliance on sequence-based representations, and insufficient research on certain
bioactivities(e.g.,anti-obesityandanti-fatigue peptides). -
Future Research Directions: It proposes concrete future research directions and trends, including the development of
data augmentation strategieswithinfood-specific large models, creation of auniversal deep learning frameworkformulti-scale chemical space features, prediction ofpeptide-target dynamic interactions, and establishment of ahigh-throughput screening frameworkformultifunctional properties.Key findings include:
Deep learningmodels generally demonstrate clear predictive advantages overtraditional machine learning techniquesforFBP screening.- Significant advancements have been made in identifying
FBPswith properties likeanti-inflammatory,antibacterial,antioxidant,flavor-enhancing, andhypotensive effects. - Research on
anti-obesityandanti-fatigue peptidesusingAIis still in its nascent stages, presenting a significant area for future exploration.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational grasp of several concepts from computer science, biology, and chemistry is essential.
-
Food-Derived Bioactive Peptides (FBPs): These are specific protein fragments, typically 2 to 20 amino acids long, released from food proteins (e.g., milk, soy, fish) through enzymatic hydrolysis or fermentation. They exert beneficial physiological effects in the body beyond basic nutrition, such as
anti-inflammatory,antioxidant,antimicrobial, orantihypertensiveactivities. Their efficacy is often dependent on theiramino acid sequence,length, andthree-dimensional structure. -
Artificial Intelligence (AI): A broad field of computer science dedicated to creating systems that can perform tasks normally requiring human intelligence. These tasks include learning, reasoning, problem-solving, perception, and language understanding. The paper focuses on
AIforhigh-throughput screeningand analysis. -
Machine Learning (ML): A subset of
AIthat enables systems to learn from data without being explicitly programmed.ML algorithmsbuild a model based on sample data, known as "training data," to make predictions or decisions without being explicitly programmed to perform the task.- Supervised Learning: The primary
MLparadigm used in this paper. It involves training a model on a labeled dataset, where each input example is paired with an output label. The goal is for the model to learn a mapping from inputs to outputs, allowing it to predict labels for new, unseen data. Examples includeclassification(predicting a categorical label, e.g., active/inactive peptide) andregression(predicting a continuous value, e.g., binding affinity). - Unsupervised Learning: Focuses on finding patterns or structures in unlabeled data. Not a primary focus of this paper for
FBP screening. - Reinforcement Learning: Involves an agent learning to make decisions by performing actions in an environment to maximize a reward. Not a primary focus of this paper for
FBP screening.
- Supervised Learning: The primary
-
Deep Learning (DL): A subfield of
MLthat usesartificial neural networks (ANNs)with multiple layers (hence "deep") to learn complex patterns from data.DL modelscan automatically learnhierarchical feature representationsfrom raw input, eliminating the need for manualfeature engineeringin many cases.- Neural Networks: Computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers. Each connection has a weight, and neurons have activation functions.
- Convolutional Neural Networks (CNNs): Primarily used for image processing but also applicable to sequential data. They use
convolutional layersto detect local patterns andpooling layersto reduce dimensionality. - Recurrent Neural Networks (RNNs): Designed to process sequential data (like amino acid sequences). They have internal memory that allows them to maintain information about previous inputs in the sequence.
Long Short-Term Memory (LSTM)andGated Recurrent Unit (GRU)are advancedRNNarchitectures designed to overcome vanishing/exploding gradient problems and capture long-term dependencies. - Graph Neural Networks (GNNs): Designed to operate on graph-structured data (e.g., molecular structures where atoms are nodes and bonds are edges). They learn
node representationsby aggregating information from neighboring nodes. - Transformers: A
DLmodel that relies heavily onself-attention mechanismsto weigh the importance of different parts of the input sequence. They excel at capturing long-range dependencies and allow for parallel processing, unlikeRNNs. - Diffusion Models: A class of
generative modelsthat learn to generate new data samples by reversing a gradual diffusion process that adds noise to data. They have shown impressive results in image and biomolecular structure generation.
-
Molecular Feature Representation: The process of converting raw molecular data (e.g.,
amino acid sequences,chemical structures) into a numerical format (vectors or matrices) thatML/DL modelscan understand and process. This is a critical step as the quality of representation directly impacts model performance.Sequence-intrinsic methods: Focus on the amino acid sequence itself (e.g.,amino acid composition,dipeptide composition,pseudo amino acid composition (PseAAC)).Physicochemical methods: Incorporate properties like charge, hydrophobicity, molecular weight.Structural properties methods: Analyzethree-dimensional conformation(e.g.,secondary structure,solvent accessibility).
3.2. Previous Works
The paper references several foundational AI achievements and applications, which serve as benchmarks for AI's potential:
-
AlphaGo (Ma et al., 2024; Silver et al., 2016): DeepMind's
AIprogram that defeated human Go champions, showcasingAI's capability in complex strategic reasoning, primarily usingdeep reinforcement learning. -
ChatGPT (Schulman et al., 2022): OpenAI's large language model (
LLM) demonstrating advanced natural language understanding and generation, powered bytransformer architectures. -
SORA (Liu et al., 2024): OpenAI's text-to-video
generative AI, illustrating progress inAI's ability to create complex, dynamic content. -
AlphaFold 3 (Brooks et al., 2024): DeepMind's breakthrough in
protein structure prediction, capable of predicting the structures ofprotein-ligand interactionsandbiomolecular complexeswith high accuracy, leveragingdiffusion modelsanddeep learning. This is particularly relevant asFBPactivity often depends onpeptide-protein interactions.In the context of
FBPsand related bioactive compounds: -
Reviews on
MLapplications infood bioactive compounds(Doherty et al., 2021; Kussmann, 2022; Zhang, Zhang, Freddolino, & Zhang, 2024) andbioinformatics toolsforactive peptides(Du, Comer, & Li, 2023; Rivero-Pino, Millán-Linares, & Montserrat-de-la-Paz, 2023) have been published. These works establish the broader context for using computational methods in this domain.The paper points out a
notable gapin comprehensive reviews specifically addressing the application ofAIfor screeningFBPsin recent times. This indicates that while generalAIapplications in food science andMLinbioactive compoundshave been reviewed, a focused synthesis onAIforFBP screeningspecifically was missing, which this paper aims to fill.
3.3. Technological Evolution
The evolution of technology in FBP screening has moved from:
- Traditional Experimental Methods:
Labor-intensive,time-consuming, andcostlydue to the need for synthesizing and testing numerous peptides. While reliable, they limit throughput. - Conventional Computational Approaches:
Virtual screening: Uses computational methods to rapidly screen large libraries of molecules for potential activity, but often lacks precision and struggles withconformational changes.Molecular dynamics simulations: Provide insights into molecular interactions and dynamics at an atomic level but arecomputationally intensiveand limited bysimulation time scalesand the availability ofpeptide-protein complexinformation.
- Artificial Intelligence (AI): The current frontier, offering
high-throughput,cost-effective, andless error-pronescreening.-
Non-deep learning ML: Early applications used models like
Support Vector Machines (SVMs)andRandom Forests (RFs)withmeticulously crafted featuresbased on domain knowledge. These were effective for high-dimensional data but limited by the quality offeature engineering. -
Deep Learning (DL): The latest generation, capable of
automatically learning complex molecular featuresfrom raw data, leading to higherpredictive accuracyandgeneralization. This includesCNNs,RNNs,GNNs,Transformers, andDiffusion Models, each suited for different data types and complexities.This paper's work fits within the leading edge of this technological timeline, emphasizing the transition towards and potential of
DLforFBP screening, while also highlighting the remaining challenges.
-
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach (as a review) lie in its comprehensive and forward-looking synthesis of AI specifically for FBP screening, rather than proposing a new method itself.
- Focus: While previous reviews covered
AIin broaderfood industry applicationsorMLforbioactive compoundsgenerally, this paper specifically targets the screening of food-derived bioactive peptides. This narrow focus allows for a deeper dive into the unique challenges and opportunities in this specific domain. - Structured Process: It provides a systematic, multi-step
AI-driven screening process(data, representation, model, evaluation), which is a clear framework for researchers. - Detailed Algorithmic Comparison: It provides a structured comparison of various
MLandDLalgorithms used inFBP screening, outlining their advantages and limitations in this specific context. - Identification of Gaps in Specific Bioactivities: It specifically points out the under-researched areas like
anti-obesityandanti-fatigue peptides, guiding future research. - Forward-looking Perspective: The paper goes beyond summarizing existing work by critically analyzing current
bottlenecks(data scarcity, representation, interpretability) and proposing concretefuture directions(food-specific large models, multi-scale features, dynamic interactions, high-throughput frameworks). This differentiates it from reviews that merely catalog existing applications.
4. Methodology
The paper describes the general process of AI-driven screening for FBPs as a supervised learning problem. This process typically involves four key interconnected steps: data foundation, molecular feature representation, model construction and training, and evaluation and validation.
4.1. Principles
The core idea behind AI-driven screening of FBPs is to leverage computational intelligence to overcome the limitations of traditional experimental and computational methods in identifying and characterizing bioactive peptides. The theoretical basis is rooted in machine learning principles, specifically supervised learning, where models learn from known bioactive peptide datasets to predict the properties of unknown peptides.
The intuition is that if an AI model can learn the complex relationships between a peptide's molecular structure (or sequence) and its biological activity, it can then rapidly and cost-effectively predict the activity of new, untried peptides. This approach aims to:
-
Extract features: Identify patterns and
key featuresfrom knownbioactive peptide datasetsthat correlate with specificbioactivities. -
Model building: Construct
MLorDL modelscapable of capturing these complex relationships. -
Prediction: Apply the trained models to screen large libraries of potential
FBPsand predict theirbioactivity. -
Efficiency: Reduce the
labor-intensive,time-consuming, andcostlynature of traditional experimental screening. -
Accuracy: Improve upon the
prediction accuracyof conventional computational methods by learning from diverse data and modeling intricate interactions.The process typically involves using
classification modelsforbinary predictions(e.g., active/inactive) based on labeledpositiveandnegative samples, orregression modelsto predictmolecular binding affinityorIC50 values.
4.2. Core Methodology In-depth (Layer by Layer)
The overall AI-driven screening process for FBPs is illustrated in Figure 2 of the original paper.
该图像是论文中图2的示意图,展示了食品来源生物活性肽(FBPs)的AI驱动筛选流程,包括数据基础、分子特征表示、机器学习模型构建与训练及模型评估等步骤。
The AI-driven screening process for FBPs involves several interconnected stages: data foundation, molecular feature representation, construction of machine learning models, and model evaluation.
4.2.1. Data Foundation
Data is the cornerstone of any machine learning approach. For FBP screening, high-quality, diverse, and sufficient data are critical for model performance. The paper highlights that protein-peptide interactions account for about 40% of protein-ligand interactions, making target-based peptide research central.
-
Training Datasets: For
supervised learning, datasets include:Peptides: Sequences, structures, and associated activities.Target proteins: The biomolecules with which peptides interact.Protein-peptide complexes: Structural information about how peptides bind to proteins.Active peptides: Peptides with knownbioactivities.
-
Key Databases: The paper summarizes several important databases crucial for
data foundation(Table 1), categorized by type:- Protein Structural Databases:
RCSB PDB: World's largest biological macromolecule structure database (X-ray, NMR, electron microscopy).UniProt: Comprehensive database of protein-related information (Swiss-Prot for reviewed sequences, TrEMBL for unreviewed).Pfam: Specializes in protein families and structural domains.AlphaFoldDB: Provides over 200 million protein structure predictions usingAI.
- Peptide Structural Databases:
NORINE: Database of nonribosomal peptides.FoldamerDB: Public database of peptidic foldamers.ConjuPepDB: Database of drug-peptide conjugates.StraPep: Collects active peptides with known structures.DBAASP: Information onantimicrobial peptides (AMPs).
- Protein-Peptide Complex Structural Databases:
PepBDB: Extensive information onbiological peptide-mediated protein interaction.PepX: Comprehensive dataset ofprotein-peptide complexesfromPDB.STRING: Provides information onprotein-protein interactions.BioLiP2: Updated structural database forbiologically relevant ligand-protein interactions.
- Bioactive Peptide Databases:
-
Food DB: Includes molecules found in food. -
Coconut: Natural product database. -
BioPep DB: Searchable database ofFBPs. -
BIOPEP-UWM: Searchable database ofbioactive peptides, especially food-derived. -
DFBP:FBPsdatabase with food sources of protein. -
Feptide DB: Collection of open-accessbioactive peptide repositories. -
SpirPep: Combination of publishedbioactive peptide databases. -
CAMPR3: Comprehensive information onantimicrobial peptides. -
DBAASP v3: Information onantimicrobial peptides. -
NeuroPep B 2.0:Neuropeptide database. -
MAMPs-Pred: Providesantimicrobialandnon-antimicrobial peptides. -
IF-AIP: Providesanti-inflammatoryandnon-anti-inflammatory peptides.The following are the results from Table 1 of the original paper:
Name Description Website Protein Structural Databases RCSB PDB Currently the world's largest biological macromolecule structure database. As of June 15, 2024, a total of 194,259 protein 3D structures have been recorded using X-ray crystallography, NMR spectroscopy, and electron microscopy. https://www.rcsb.org/ UniProt Currently the most comprehensive database of protein-related information. Contains Swiss-Prot with 571,609 manually reviewed protein sequences, TrEMBL with 244,910,918 unreviewed protein sequences and PIR with protein sequences. https://www.uniprot.org/ Pfam A database specializes in providing complete classification information of protein families and structural domains, covers 21,979 protein families. http://pfam.xfam.org/ AlphaFoldDB A protein structure prediction database built based on advanced AI technology. Provides over 200 million protein structure predictions. https://alphafold.ebi.ac.uk/ Peptide Structural Databases NORINE The platform features a database of nonribosomal peptides equipped with analytical tools and houses over 1000 peptides. https://ngdc.cncb.ac.cn/databasecommons/database/id/1476 FoldamerDB A public database of peptidic foldamers. http://foldamerdb.ttk.mt ConjuPepDB A public database of drug-peptide conjugates, containing 645 drug-peptide conjugates. https://conjupepdb.ttk.hu/ StraPep A database dedicated to collecting all active peptides of known structure, containing 3791 bioactive peptide structures belonging to 1312 unique bioactive peptide sequences. http://isyslab.info/StraPep/ DBAASP A database dedicated to information on antimicrobial peptides (AMPs), containing 21,426 peptides. https://www.dbaasp.org/home Protein-Peptide Complex Structural Databases PepBDB A database presents extensive information about biological peptide-mediated protein interaction. The current number of structures is 13,299. http://huanglab.phys.hust.edu.cn/pepbdb/ PepX An extensive and comprehensive dataset includes all protein-peptide complexes available in the Protein Data Bank, with peptide lengths of up to 35 residues. This dataset encompasses 505 distinct protein-peptide interface clusters derived from 1431 complexes. https://ngdc.cncb.ac.cn/databasecommons/database/id/1240 STRING A database provides the most comprehensive information on protein-protein interactions. As of August 2, 2024, it includes 332,075,812 interactions at highest confidence (score ≥0.900). https://cn.string-db.org/ BioLip2 An updated structural database focusing on biologically relevant ligand-protein interactions. As of June 15, 2024, it contains 3,7492 entries for peptide ligands. https://zhanggroup.org/BioLiP2/index.cgi Bioactive Peptide Databases Food DB A database includes 70,926 molecules in the food. https://foodb.ca/ Coconut A natural product database currently available with over 400,000 molecules. https://coconut.naturalproducts.net/ BioPep DB A searchable database of FBPs that contains 4807 bioactive peptides. http://bis.zju.edu.cn/biopepdbr/index.php BIOPEP-UWM A searchable database of bioactive peptides, especially on these derived from foods and being constituents of diets. It contains 5047 bioactive peptides. https://biochemia.uwm.edu.pl/biopep-uwm/ DFBP FBPs database currently contains 6818 bioactive peptides, 21,249 food sources of protein. http://www.cqudfbp.net/ Feptide DB A collection of 12 open-access bioactive peptide repositories and peptides extracted from research publications to predict food-derived bioactive peptides. http://www4g.biotec.or.th/FeptideDB/ SpirPep Combination of 13 published bioactive peptide databases, containing 28,334 unique bioactive peptide sequences for compare with putative peptide. http://spirpepapp.sbi.kmutt.ac.th/BioactivePeptideDB.html CAMPR3 A database provides comprehensive information on antimicrobial peptides, including 10,247 antimicrobial peptide sequences obtained through the analysis of 1386 sequences derived from experimental studies. http://www.camp3.bicnirrh.res.in/ DBAASP v3 A database dedicates to information on antimicrobial peptides containing over 15,700 entries, which include more than 14,500 monomers and nearly 400 homo- and hetero-oligomers. http://dbaasp.org/ NeuroPep B 2.0 A neuropeptide database holds 11,417 unique neuropeptide entries. https://isyslab.info/Ne uroPepV2/ MAMPs-Pred A database provides 6989 peptides consisting of antimicrobial and non-antimicrobial. https://github.com/JianyuanLin/SupplementaryData IF-AIP A database provides 5265 peptides with anti-inflammatory and non-anti-inflammatory. https://github.com/Mir-Saima/IF-AIP
-
- Protein Structural Databases:
-
Data Processing: After data collection, meticulous steps are required:
Data Cleaning: Removing errors, inconsistencies, or duplicates.Annotation: Adding labels (e.g., active/inactive, specific activity type) to data points.Normalization: Scaling numerical features to a standard range to prevent certain features from dominating the learning process.
4.2.2. Molecular Feature Representation
This step transforms raw molecular data into a numerical format suitable for ML models. It involves feature selection and feature extraction/encoding.
- Feature Selection: Commonly used features include:
Amino acid sequence: The primary structure of the peptide.Structural information:Secondary(e.g., alpha-helix, beta-sheet) andtertiary structures(3D conformation).Physicochemical properties:Hydrophobicity,charge,molecular weight,isoelectric point, etc.
- Feature Extraction and Encoding: Converting amino acid sequences or structures into
numerical vectorsormatrix forms.- Sequence-based Representation:
Amino acid composition (AAC): Proportion of each amino acid.Dipeptide composition: Proportion of each pair of adjacent amino acids.Pseudo amino acid composition (PseAAC): Incorporates physicochemical properties and sequence arrangement.One-hot encoding: Each amino acid converted to a unique binary vector.Composition-transition-distribution descriptors (CTDD): Describes composition, transformation, and distribution of physicochemical properties.Position-specific scoring matrix (PSSM): Represents evolutionary information.- Limitation: Requires learning features from scratch for new datasets, needs large training data.
- Graph-based Representation:
Nodesrepresentamino acid atoms,edgesrepresentcovalent bonds.- Often used with
Graph Neural Networks (GNNs). - Limitation: Most
2D ligand methodsoverlook3D molecular structureandpeptide-target interactions.
- Image-based Representation:
-
Molecular images(e.g., protein sequences converted to images) used as input forDL models(e.g.,CNNs). -
Captures more detailed
molecular structure information. -
Limitation: Scarcity of high-quality, labeled image datasets; high computational demands.
The following are the results from Table 2 of the original paper:
2 Iolecular representation methods and tools. Name Based on intrinsic sequence properties The protein sequence is transformed into a 20-dimensional vector that quantifies the relative abundance of each amino acid within the protein. Amino acid composition (AAC) This method calculates the proportion of dipeptides, formed by the linkage of two specific amino acids, within the entire protein sequence. It not only captures the distribution of amino acids but also encodes information about their local arrangement. Dipeptide composition The correlation between two proteins or peptide chains is characterized by analyzing specific structural features or physicochemical properties. Normalization Moreau-Broto autocorrelation descriptors Utilizing the Moran's index to describe the spatial autocorrelation of amino acid properties or features within a protein sequence. Moran autocorrelation A metric for assessing the coupling between the amino acid sequence order in a protein and its three-dimensional structure. Sequence-order-coupling Capturing both local and global information in a sequence by considering interactions between each amino acid and its surrounding residues. This quantifies features related to the relative positions and physicochemical differences between amino acids in a protein sequence. A global encoding strategy that converts a protein sequence into a 1000 × 20 binary matrix, providing information on the evolution of the protein sequence. Position-specific scoring matrix (PSSM) Converting each amino acid in a sequence into a fixed-length binary vector, offering intuitive, simple, and scalable features. One-hot encoding based on sequence Based on physicochemical properties Evaluating the similarity between protein sequences by calculating the proportion of identical amino acids at the same positions in two or more sequences. Total amino acid properties Converting protein sequences into numerical feature vectors based on the physicochemical properties of amino acids, describing the composition, transformation, and distribution characteristics of the amino acids. Composition-transition-distribution descriptors Amphiphilic pseudo amino acid Based on AAC, this approach incorporates the physicochemical properties and arrangement information of amino acids. Not only the amino acid sequence order is considered, but also the physicochemical properties of amino acids, such as hydrophilicity, hydrophobicity, and molecular weight, are utilized, along with the compositional information of the 20 amino acids, to construct protein information. Pseudo amino acid composition composition Based on structural properties Topological structure at the atomic level A mathematical descriptor based on molecular structure, which primarily consists of the atomic composition, the type of chemical bonds, and the attributes of their connections. Secondary structure and solvent accessibility The amino acid sequence is converted into two new sequences by using secondary structure and solvent accessibility. Each of the new sequences is represented by a 3D vector and a 2D vector, and finally, each amino acid gets a binary matrix Representation-related tools Scratch Protein Predictor A web-based tool that forecasts the tertiary structure and structural characteristics of proteins not only predicts their secondary structure and hydrophobicity but also provides extensive information regarding disordered regions, structural domains, and individual residue interactions. https://scratch.proteomics.ics.uci.edu/ POSSUM A website that provides property information based on position-specific scoring matrix, which contains 21 distinct PSSM descriptors. https://possum.erc.monash.edu/
-
- Sequence-based Representation:
4.2.3. Construction and Training of Machine Learning Models
This stage involves selecting and configuring ML algorithms and then training them using the prepared data.
4.2.3.1. Machine Learning Algorithms
ML techniques are categorized into non-deep learning and deep learning methods.
-
Non-deep Learning Models: These rely on predefined feature sets.
- Support Vector Machine (SVM): A powerful
supervised learning algorithmused forclassificationandregression. It works by finding an optimal hyperplane that best separates data points of different classes in a high-dimensional space, maximizing the margin between classes.- Strengths: Well-suited for high-dimensional data, robust to overfitting, strong model interpretability.
- Limitations: Performance highly dependent on
feature representationquality, memory-intensive, primarily designed forbinary classification(requires adaptation for multiclass).
- Random Forest (RF): An
ensemble learning methodthat constructs multipledecision treesduring training and outputs the mode of the classes (forclassification) or mean prediction (forregression) of the individual trees.- Strengths: Handles high-dimensional data, high accuracy, parallelizable, robust to overfitting.
- Limitations: Sensitive to
feature selection(requiresdomain expertise), relatively poor interpretability compared to singledecision trees, sensitive toimbalanced datasets.
- Support Vector Machine (SVM): A powerful
-
Deep Learning Models: These can automatically learn complex molecular features directly from raw data.
- Convolutional Neural Networks (CNNs):
-
Structure (Figure 3a): Composed of
convolutional layers(for feature extraction using filters),pooling layers(for dimensionality reduction), andfully connected layers(for classification/regression). -
Mechanism: Filters slide over the input data (e.g., peptide sequence represented as an image or 1D sequence), detecting local patterns.
-
Strengths: Strong generalization in image processing, effective for local feature extraction.
-
Limitations: Limited capability in learning global features of molecular interactions, potentially constrained effectiveness for complex interactions.
该图像是一个示意图,展示了用于食物生物活性肽筛选的多种人工智能模型架构,包括卷积神经网络(a)、循环神经网络(b)、图神经网络(c)、变换器模型(d)以及扩散模型(e),描述了从分子结构输入到输出的处理流程。
-
The image illustrates the architectures of five different artificial intelligence models: (a) a
convolutional neural network (CNN)with convolutional, pooling, and fully connected layers; (b) arecurrent neural network (RNN)showing the flow of information through time steps; (c) agraph neural network (GNN)with nodes and edges; (d) aTransformermodel focusing on its encoder architecture; and (e) adiffusion modelillustrating the forward and reverse diffusion processes.- Recurrent Neural Networks (RNNs):
- Structure (Figure 3b): Networks with
feedback loopsthat allow information to persist, making them suitable for sequential data.Long Short-Term Memory (LSTM)andGated Recurrent Unit (GRU)are advancedRNNvariants. - Mechanism: Process sequences one element at a time, using an internal state (memory) to capture dependencies across time steps.
- Strengths: Excels at handling dependencies in sequential data, better generalization for temporal/contextual information.
- Limitations: Prone to
vanishingandexploding gradients, making convergence difficult for very long sequences.LSTMandGRUmitigate this but have complex architectures and high training costs, prone to overfitting with small, undiverse datasets.
- Structure (Figure 3b): Networks with
- Graph Neural Networks (GNNs):
- Structure (Figure 3c): Designed for graph-structured data where
nodesrepresent entities (e.g., atoms, amino acids) andedgesrepresent relationships (e.g., covalent bonds, non-covalent interactions). - Mechanism: Learn
node representationsby iteratively aggregating information from their neighbors. - Strengths: Directly process graph-structured data, captures complex multi-scale atomic relationships.
- Limitations: High
computational complexityfor large graphs, can only capture local molecular structures effectively in some variants.
- Structure (Figure 3c): Designed for graph-structured data where
- Transformers:
- Structure (Figure 3d): Primarily based on
self-attention mechanisms(e.g.,multi-head attention), which allow the model to weigh the importance of different parts of the input sequence. Also includefeed-forward layers. - Mechanism: Process sequences in parallel, using
attentionto capture long-range dependencies between any positions, rather than sequentially. Requirespositional encodingto retain order information. - Strengths: Excellent
parallel computation capabilities, captures dependencies irrespective of distance. - Limitations: High
complexityandtraining costs,lack of clear input-output mappingmakes interpretability challenging forFBP prediction.
- Structure (Figure 3d): Primarily based on
- Diffusion Models:
- Structure (Figure 3e): A class of
generative modelsthat learn to reverse astochastic process(diffusion) that gradually transforms data into noise. - Mechanism: Consists of a
forward diffusion process(adding noise) and areverse denoising process(learning to reconstruct data from noise).AlphaFold 3leverages these forbiomolecular structure prediction. - Strengths: Capable of
high-quality data generation, used inAlphaFold 3forprotein-ligand interactionandcomplex structure prediction. - Limitations:
Resource-intensivedue to complexneural network architecture, struggles withdiscrete data, prone tohallucinations(generating plausible but incorrect outputs).
- Structure (Figure 3e): A class of
- Convolutional Neural Networks (CNNs):
-
Bottlenecks in Deep Learning Modeling:
- High model complexity.
- Long training and learning times.
- Significant computational resource consumption.
- Limited model interpretability.
- Dependency on
labeled data(scale, quality, diversity). Insufficient sample sizesin existing databases for training complexDL networks.- Challenges in handling
negative data(often classified as unknown or randomly sampled). Bias towards higher-volume data samplesinimbalanced datasets.
4.2.3.2. Model Architecture Selection
Choosing the appropriate model architecture is crucial. The paper notes that deeper networks generally show better generalization for a similar number of parameters. Approaches include:
-
Exploring different
model architecturesbased ondata typesandmolecular representations. -
Adjusting
network depth,connectivity,neuron quantity, andtypes. -
Designing multiple
candidate machine learning modelsand usingheuristic evaluation functionsto quickly estimate performance and select the optimal architecture.Current
AI modelsoften rely onstatic experimental data, neglecting thedynamic interactionsof peptides and targets in natural environments. Simulating thesedynamic processes(e.g.,protein folding,ligand-receptor interactionsat atomic level) requires extensivecomputational resources, leading to a trade-off betweenefficiencyandaccuracy.
4.2.3.3. Training of Machine Learning Models
This involves two key sub-processes:
- Parameter Optimization: Iteratively updating the
model's internal parameters(weights and biases) to minimize aloss function.Optimization algorithms:Stochastic Gradient Descent (SGD),Adaptive Moment Estimation (Adam),Adaptive Gradient (AdaGrad).Loss functions:Mean Squared Error (MSE)forregression,cross-entropy lossorlogarithmic lossforclassification.
- Hyperparameter Tuning: Adjusting external parameters that control the learning process, such as:
Activation function: Non-linear functions applied to neuron outputs.Learning rate: Controls the step size during parameter updates.Optimizer: The algorithm used for parameter optimization.Epochs: Number of complete passes through the training dataset.
- Data Splitting: The dataset is divided into:
Training set: Used to train the model.Validation set: Used to monitor model performance during training, adjusthyperparameters, and preventoverfitting.Test set: Used for final, unbiased evaluation of the model's performance on unseen data.
- Computational Resources:
Deep learning modelsrequire substantial resources.Graphics Processing Units (GPUs)are highly efficient forlarge-scale parallel computing tasks, significantly accelerating the training process for complex models and extensive datasets.
4.2.4. Evaluation and Validation
After training, a comprehensive evaluation ensures the model performs as expected on unseen data.
- Evaluation Metrics:
- For Classification Tasks:
Accuracy: Proportion of correctly predicted instances.Precision: Proportion of true positive predictions among all positive predictions.Recall(Sensitivity): Proportion of true positive predictions among all actual positive instances.F1 score: Harmonic mean ofprecisionandrecall.Receiver Operating Characteristic (ROC) curve: Plotstrue positive rateagainstfalse positive rate.Area Under the ROC Curve (AUC): Measures the overall performance of a classifier across all possible classification thresholds.
- For Regression Tasks: Quantifies deviation between predicted and actual values.
Mean Squared Error (MSE): Average of the squared differences between predicted and actual values.Root Mean Squared Error (RMSE): Square root ofMSE.Mean Absolute Error (MAE): Average of the absolute differences between predicted and actual values.
- For Classification Tasks:
- Validation Methods: Ensure effectiveness and
generalizationon unseen data.Cross-validation (k-fold): Divides data into folds; trains onk-1folds and validates on the remaining fold, repeating times.Leave-one-out cross-validation: A special case ofk-foldwhere equals the number of samples.Hold-out validation: Simple split into training and test sets.Bootstrapping: Resampling technique.
- Comprehensive Validation: Beyond
AImodel evaluation,validationinvolves:Preliminary screening:AI modelidentifies high-activity candidates.Advanced computational screening:Molecular docking,ligand-based virtual screening,molecular dynamics simulationsto refine selection.Rigorous in vitro and in vivo experiments: Essential to confirm theoretical predictions andpractical applicability.
5. Experimental Setup
This paper is a review and does not present its own experimental setup. Instead, it synthesizes the experimental approaches, datasets, evaluation metrics, and comparative analyses from the various studies it reviews on AI-driven FBP screening. The following points summarize the general practices observed across the reviewed literature.
5.1. Datasets
The datasets used in the reviewed studies vary significantly in source, scale, and characteristics, reflecting the specific bioactivity being investigated.
- Sources:
FBPsare derived from diverse food proteins (e.g., milk, soy, fish, macroalgae, chia seeds, lactic acid bacteria, walnut protein). Data is collected from dedicatedpeptide databases(as detailed in Table 1), literature reviews, and in-house experimental results. - Characteristics:
Peptide sequences: The most common form of data, often represented inFASTA format.Structural information: Sometimes includessecondary structures,physicochemical properties(e.g., molecular weight, hydrophobicity, charge),amino acid composition,dipeptide composition,pseudo amino acid composition (PseAAC).Labels:Binary classification(e.g., active/inactive, anti-inflammatory/non-anti-inflammatory) orregression targets(e.g.,IC50 valuesforantihypertensive peptides).
- Scale: The
data sizein reviewed studies ranges from a few hundred (e.g., 203 for taste peptides, 600 for bitterants, 499 for umami peptides) to several thousands (e.g., 4194-5265 for anti-inflammatory, 6989-42213 for antimicrobial, 1338-2120 for antioxidant, 1020-3429 for anti-hypertensive). - Challenges: The paper frequently highlights
small data size,ambiguity in "inactive" classificationfornegative data, andclass imbalanceas significant challenges across variousFBP screeningtasks. Many studies generatenegative samplesrandomly, which can introduce noise. - Example Data Sample: A data sample typically consists of a peptide sequence (e.g., "ALAVAL"), sometimes with associated
physicochemical properties(e.g., molecular weight, hydrophobicity values), and alabelindicating itsbioactivity(e.g., 1 for active, 0 for inactive, or anIC50value).
5.2. Evaluation Metrics
For the classification tasks predominant in FBP screening, a range of standard metrics are employed to assess model performance. For regression tasks, metrics quantify prediction accuracy.
-
Accuracy
- Conceptual Definition:
Accuracymeasures the proportion of total predictions that were correct. It is a straightforward indicator of overall correctness. - Mathematical Formula: $ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $
- Symbol Explanation:
TP:True Positives(correctly predicted positive instances).TN:True Negatives(correctly predicted negative instances).FP:False Positives(incorrectly predicted positive instances, type I error).FN:False Negatives(incorrectly predicted negative instances, type II error).
- Conceptual Definition:
-
Precision
- Conceptual Definition:
Precisionmeasures the proportion of positive identifications that were actually correct. It focuses on the quality of positive predictions, addressing the question: "Of all items the model labeled as positive, how many are actually positive?" - Mathematical Formula: $ \text{Precision} = \frac{TP}{TP + FP} $
- Symbol Explanation:
TP:True Positives.FP:False Positives.
- Conceptual Definition:
-
Recall (Sensitivity)
- Conceptual Definition:
Recallmeasures the proportion of actual positives that were identified correctly. It focuses on the completeness of positive predictions, addressing the question: "Of all actual positive items, how many did the model correctly identify?" - Mathematical Formula: $ \text{Recall} = \frac{TP}{TP + FN} $
- Symbol Explanation:
TP:True Positives.FN:False Negatives.
- Conceptual Definition:
-
F1 Score
- Conceptual Definition: The
F1 scoreis the harmonic mean ofPrecisionandRecall. It provides a single score that balances both metrics, being particularly useful when there is an uneven class distribution. - Mathematical Formula: $ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
- Symbol Explanation:
Precision: Theprecisionscore.Recall: Therecallscore.
- Conceptual Definition: The
-
Receiver Operating Characteristic (ROC) Curve
- Conceptual Definition: The
ROC curveis a graphical plot that illustrates the diagnostic ability of abinary classifiersystem as itsdiscrimination thresholdis varied. It plots theTrue Positive Rate (TPR)against theFalse Positive Rate (FPR)at variousthreshold settings. - Mathematical Formulas (for points on the curve):
True Positive Rate (TPR)=Recall=False Positive Rate (FPR)=
- Symbol Explanation:
TP, TN, FP, FN: As defined above.
- Conceptual Definition: The
-
Area Under the ROC Curve (AUC)
- Conceptual Definition: The
AUCrepresents the degree or measure ofseparabilitybetween classes. It measures the entiretwo-dimensional areaunderneath the entireROC curvefrom (0,0) to (1,1). A higherAUCindicates better model performance in distinguishing between positive and negative classes. - Mathematical Formula: No simple closed-form formula. It's typically calculated numerically as the integral of the
ROC curve. Conceptually, it represents the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. - Symbol Explanation: N/A (it's a calculated area from the
ROC curve).
- Conceptual Definition: The
-
Mean Squared Error (MSE)
- Conceptual Definition:
MSEis a commonregression metricthat measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value. It penalizes larger errors more heavily. - Mathematical Formula: $ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
- Symbol Explanation:
- : The number of data points.
- : The actual (observed) value for the -th data point.
- : The predicted value for the -th data point.
- Conceptual Definition:
-
Root Mean Squared Error (RMSE)
- Conceptual Definition:
RMSEis the square root ofMSE. It is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It has the same units as the target variable, making it more interpretable thanMSE. - Mathematical Formula: $ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2} $
- Symbol Explanation:
- : The number of data points.
- : The actual (observed) value for the -th data point.
- : The predicted value for the -th data point.
- Conceptual Definition:
-
Mean Absolute Error (MAE)
- Conceptual Definition:
MAEmeasures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average of the absolute differences between prediction and actual observation, and all individual differences are weighted equally. - Mathematical Formula: $ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat{Y}_i| $
- Symbol Explanation:
- : The number of data points.
- : The actual (observed) value for the -th data point.
- : The predicted value for the -th data point.
- Conceptual Definition:
5.3. Baselines
As a review, the paper compares the performance of various AI models against each other rather than defining a single "baseline" in the traditional sense for its own method. The "baselines" are effectively the different machine learning algorithms and feature representation methods employed across the literature.
-
Traditional Machine Learning:
Support Vector Machines (SVMs),Random Forests (RFs),Logistic Regression,Linear Discriminant Analysis,K-Nearest Neighbors (KNN),Gradient Boosting. These serve as comparative methods against whichdeep learningapproaches are often evaluated. -
Deep Learning Architectures:
Convolutional Neural Networks (CNNs),Recurrent Neural Networks (RNNs)(includingLSTM,GRU),Graph Neural Networks (GNNs), andTransformers(e.g.,BERT,ProtBERT). Studies often compare differentDL architecturesorhybrid DL models. -
Feature Representations: Comparisons are also made between different
molecular feature representationtechniques (e.g.,sequence-based,physicochemical,structural,image-based,graph-based), highlighting how the choice of representation impacts model performance. -
Ensemble Methods: Some studies employ
ensemble models(e.g.,voting classifierscombining multipleMLmodels) as a form of "baseline" for improved robustness.The paper implicitly evaluates how
deep learningapproaches, by automatically learning complex features, tend to show predictive advantages overtraditional machine learningmethods which rely more onmeticulously handcrafted features.
6. Results & Analysis
6.1. Core Results Analysis
The paper provides a comprehensive overview of the application of AI methods for screening FBPs with various bioactivities. The general trend observed across these applications is a shift towards deep learning models, which often demonstrate superior predictive capabilities compared to traditional machine learning methods, especially when dealing with complex data and relationships. However, significant challenges, particularly related to data quality and quantity, persist across all bioactivity types.
-
Anti-inflammatory Peptides (AIPs):
- Most
AI-based screeningforAIPshas focused onnonspecific anti-inflammatory peptides, usingrandom forest classifierswithsequenceandstructural features. - The paper notes a lack of research on
specific anti-inflammatory peptidestargeting differentinflammatory targets. - A critical observation is the reliance on
manually selected amino acid sequence featuresandtraditional ML algorithms, suggesting untapped potential fordeep learningto automatically extract features and improve accuracy. - A significant gap identified is the lack of
experimental validationfor most proposedAI prediction methodsforAIPs.
- Most
-
Antimicrobial Peptides (AMPs):
AIhas been used to screenAMPsfrom diverse food sources (e.g., shrimp, seaweed, chia seeds, lactic acid bacteria) usingrandom forests,artificial neural networks, andgraph convolutional neural networks.- Challenges include overlooking
molecular characteristicsassociated withlow sequence homologyofAMPsand issues withcumulative prediction errorinstacked models. - Many
ML screening methodsforAMPsalso lackexperimental validationbeyondAI.
-
Antioxidant Peptides:
- Studies have developed
binary classifiersusing varioustraditional ML algorithms(e.g.,logistic regression,SVM,KNN) andCNNswithpeptide sequencesandpseudo amino acid compositionas features. - The sample sizes for training data are generally limited (around 2,000), necessitating
data augmentationandtransfer learning. - Research often focuses on
peptide sequences, neglecting deeperstructuralandphysicochemical propertiesthat significantly influence activity (e.g., specific amino acids). - Interpretability of
ML modelsis highlighted as important for understandinginteraction mechanisms.
- Studies have developed
-
Taste Peptides (Umami, Bitter):
AI-based screeningprimarily targetsumamiandbitter peptides, usingdeep learning(e.g.,multi-layer perceptron,RNN,CNN) andgradient boosting random forest.- A major issue is the
insufficient sample size(often in the hundreds), which makesdeep learning modelsprone tooverfitting. Data augmentation(e.g.,generative adversarial networks) andregularization techniquesare suggested to addressdata scarcityandoverfitting.- Research on other taste types (
sour,sweet,salty) is limited.
-
Antihypertensive Peptides:
AImethods, includingregression decision treesand variousdeep learning models(BERT,ProtBERT,LSTM,RNN), have been applied to predictACE-inhibitory activityorIC50 values.- A common problem is the
random generation of negative samples, which introduces noise and reducespredictive accuracy. - The representation of
peptide molecular sequence featureswith single numerical values mayoversimplify complexity. - Small dataset sizes and lack of
biological experimental validationare recurring issues.
-
Other Bioactivities (Hypoglycemic, Anticancer, Neuroactive, Muscle Synthesis, Multifunctional):
-
AIhas shown promise in identifying peptides for diverse applications, demonstrating its broad potential. -
However,
anti-obesityandanti-fatigue peptidesare significantlyunder-researched, indicating a nascent stage forAI-driven screeningin these areas.Overall,
deep learningexhibits clear advantages, but the field is bottlenecked bydata limitations(small size, imbalance, poor negative sample quality),suboptimal feature representations(over-reliance on sequences, lack of multi-scale integration), and insufficientexperimental validation.
-
6.2. Data Presentation (Tables)
The following are the results from Table 3 of the original paper:
| Bioactivity Type of problem Data Maching learning Validation experiments Website Reference models Data Molecular size representations | |||||||||||||||||||||||||||
| Anti-inflammatory Binary classification 4194 Peptide sequence Random Forests , http://kurata14.bio.kyutech.ac.jp/PreAIP/ Khatun et al. and structural (2019) information 4620 3 peptide sequence Random Forests , , Zhang et al. (2021) features 2748 3 feature encodings Random Forests , / Zhao et al. (2021) 5265 8 sequence features Integrated model , https://github.com/Mir-Saima/IF-AIP Gaffar et al. (2024) | |||||||||||||||||||||||||||
| and structural information | jp/PreAIP/ (2019) | ||||||||||||||||||||||||||
| 4620 | 3 peptide sequence features | Random Forests , , Zhang et al. (2021) | |||||||||||||||||||||||||
| 2748 | 3 feature encodings | Random Forests , / | Zhao et al. (2021) | ||||||||||||||||||||||||
| 5265 | 8 sequence features | Integrated model , https://github.com/Mir-Saima/IF-AIP | Gaffar et al. (2024) | ||||||||||||||||||||||||
| Antimicrobial Multi-label classification Binary classification 42,213 Antioxidant Binary classification 1338 | Antimicrobial Multi-label | 6989 | 8 types of physical- | ma/IF-AIP Random Forests , / Ensemble of artificial / / | |||||||||||||||||||||||
| chemical properties | |||||||||||||||||||||||||||
| and AAC | |||||||||||||||||||||||||||
| 42,213 | Protein sequences | Protein sequences | Caprani et al. (2021) | ||||||||||||||||||||||||
| neural networks and | |||||||||||||||||||||||||||
| random forests | |||||||||||||||||||||||||||
| Integrated model | |||||||||||||||||||||||||||
| Vitro experiments https://cbbio.online/AxPEP/? | |||||||||||||||||||||||||||
| 1067 | Peptide sequences | ||||||||||||||||||||||||||
| León Madrazo and | |||||||||||||||||||||||||||
| 3244 | Initial graph | Segura Campos | |||||||||||||||||||||||||
| Graph convolutional / | http://www.dong-group.cn/database/dlabamp/Prediction/amplab/result/ | (2022) Sun et al. (2022) | |||||||||||||||||||||||||
| obtained by peptide | networks | ||||||||||||||||||||||||||
| sequences | |||||||||||||||||||||||||||
| 1338 | Peptide sequences | The four models of | / | https://doi.org/10.1016/j.foodcont.2021.108439 | Shen et al. (2022) | ||||||||||||||||||||||
| 14042120 | with PseAAC | logistic regression, | |||||||||||||||||||||||||
| encoding | linear discriminant | ||||||||||||||||||||||||||
| analysis, support | |||||||||||||||||||||||||||
| vector machine and K- | |||||||||||||||||||||||||||
| nearest neighbors | |||||||||||||||||||||||||||
| 1404 | Peptide sequences | ||||||||||||||||||||||||||
| Vitro experiments | |||||||||||||||||||||||||||
| Convolutional neural | http://services.bioinformatics.dtu.dk/service.php?AnOxPePred-1.0 | Olsen et al. (2020) | |||||||||||||||||||||||||
| with one-hot | networks | ||||||||||||||||||||||||||
| Peptide sequences | Long short-term | / | http://www.cqudfbp.net/AnOxPP/index.jsp | Qin et al. (2023) García et al. (2022) | |||||||||||||||||||||||
| 564Taste Binary classification 499 | |||||||||||||||||||||||||||
| Peptide sequence | |||||||||||||||||||||||||||
| memory The ensemble model | Vitro experiments / | ||||||||||||||||||||||||||
| with support vector | |||||||||||||||||||||||||||
| machine, random | |||||||||||||||||||||||||||
| forests, k-nearest | |||||||||||||||||||||||||||
| neighbors and logistic regression | |||||||||||||||||||||||||||
| 499 | 6 feature | A merged model for | https://umami-mrnn.herokuapp.com/ | Qi et al. (2023) | |||||||||||||||||||||||
| of umami | of umami | representations | multi-layer perceptron | ||||||||||||||||||||||||
| and recurrent neural networks | |||||||||||||||||||||||||||
| 203 | 8 molecular descriptors | Gradient boosting and Sensory random forests | experiments | https://pypi.org/project/Auto-Taste ML | Cui et al. (2023) Yolandani et al. | ||||||||||||||||||||||
| Binary classification 600 | Molecular weight, | Support vector | Sensory | ||||||||||||||||||||||||
| of bitterants | surface | machine, linear | experiments | ||||||||||||||||||||||||
| hydrophobicity, and | regression, adaptive | ||||||||||||||||||||||||||
| relative | boosting, and k- | ||||||||||||||||||||||||||
| hydrophobicity | nearest neighbors | https://doi.org/10.1016/j.foodres.2022.110974 | |||||||||||||||||||||||||
| Binary classification 2233 | MLP: molecular | Convolutional neural | |||||||||||||||||||||||||
| of bitterants and | bitter, | descriptors and | networks, multi-layer | http://hazralab.iitr.ac.in/ahpp/index.php.Zhang, Dai, Zhao | |||||||||||||||||||||||
| sweeteners | 2366sweet | fingerprint CNN: the 2D image | perceptron | ||||||||||||||||||||||||
| Anti-hypertensive | Regression-based | 1587 | PseACC for peptide | The regression | Molecular docing | ||||||||||||||||||||||
| binary classification Binary classification 2277 Regression | binary classification | structural and | decision tree | and vitro | |||||||||||||||||||||||
| sequence features | experiments | ||||||||||||||||||||||||||
| 2277 | Protein sequences | Four deep learning | Molecular docing / | Zhang, Dai, Zhao | |||||||||||||||||||||||
| with PseAAC | models including | and vitro | et al. (2023) | ||||||||||||||||||||||||
| encoding | BERT, ProtBERT, long | experiments | |||||||||||||||||||||||||
| short-term memory | |||||||||||||||||||||||||||
| and recurrent neural | |||||||||||||||||||||||||||
| Regression | 3429 | Protein sequences | |||||||||||||||||||||||||
| networks Long short-term | Vitro experiments / | Liao et al. (2023) | |||||||||||||||||||||||||
| prediction of the IC50 value Binary classification 1020 Other Bioactivities Multiple 2544 classifications of | prediction of the | memory networks | |||||||||||||||||||||||||
| IC50 value Binary classification 1020 | |||||||||||||||||||||||||||
| Binary classification 1020 | |||||||||||||||||||||||||||
| 1020 | the ESM-2-based | Logistic regression, / https://github.com/dzjxzyd/LM4ACE_webserversupport vector machine, k-nearest neighbors and multi-layer perceptron 8 notable machine / https://balalab-skku.org/ADP-Fuse | / | ||||||||||||||||||||||||
| peptide embeddings | random forests, | ||||||||||||||||||||||||||
| random forests, support vector | |||||||||||||||||||||||||||
| machine, k-nearest | |||||||||||||||||||||||||||
| 22 sequence features | neighbors and multi-layer perceptron | ||||||||||||||||||||||||||
| layer perceptron 8 notable machine / | |||||||||||||||||||||||||||
6.3. Ablation Studies / Parameter Analysis
The review paper does not present its own ablation studies or parameter analyses, as it is a synthesis of existing literature. However, it implicitly discusses aspects related to these concepts by highlighting:
-
Impact of Feature Selection: The paper notes that
non-deep learning modelslikerandom forestsare "sensitive to feature selection" and often require "careful design of ... features, demanding significant domain expertise" (Imai et al., 2021). This implies that the choice ofmolecular feature representationacts akin to anablation studyorparameter analysison the input features. -
Model Complexity vs. Data Size: The paper frequently points out the issue of
overfittingindeep learning modelswhensample sizes are limited. This indicates thatnetwork depthandcomplexity(model parameters) are crucial hyperparameters that need to be tuned relative to the available data. -
Limitations of Local vs. Global Features: When discussing
CNNs, the paper notes their focus onlocal featuresand limited capacity to learnglobal featuresof molecular interactions, constraininggeneralization. This highlights an implicit analysis of how different architectural choices (convolutional filters) impact the model's ability to capture comprehensive information. -
Role of Specific Amino Acids: For
antioxidant peptides, the paper suggests that the "proportion and position of specific amino acids (e.g., sulfur-containing amino acids, aromatic amino acids) ... should be thoroughly considered during data annotation and feature extraction." This implies that more granularfeature engineeringor attention to specificphysicochemical propertiesis an area for furtherparameter analysisorablationonfeature importance. -
Negative Data Generation Strategy: The paper repeatedly criticizes the
random generation of negative samplesinantihypertensive peptide screeningas introducing noise and affectingpredictive accuracy. This suggests that the strategy fornegative data generationis a critical "parameter" that heavily influences model performance and is an area for future improvement, much like anablation studyon the composition of the training data.While the paper doesn't present explicit
ablation studytables orhyperparameter tuning curves, its critical discussion of current challenges and limitations in reviewed studies serves to illuminate how different choices in data, features, and model architectures impact the success ofAI-driven FBP screening.
7. Conclusion & Reflections
7.1. Conclusion Summary
This review effectively highlights Artificial Intelligence (AI) as a transformative technology for the high-throughput screening and analysis of Food-Derived Bioactive Peptides (FBPs). It systematically deconstructs the AI-driven screening process, from data foundation and molecular feature representation to model construction, training, evaluation, and validation. The paper concludes that while Deep Learning (DL) models generally offer superior predictive advantages over traditional Machine Learning (ML) techniques, the field still faces significant challenges. Notable progress has been made in identifying FBPs with anti-inflammatory, antimicrobial, antioxidant, flavor-enhancing, and hypotensive properties, but research on anti-obesity and anti-fatigue peptides is nascent. The review serves as a valuable resource, consolidating current advancements and charting a clear course for future research to accelerate the discovery and application of FBPs.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations in the current AI-driven FBP screening landscape and propose concrete future directions:
- Data Limitations:
- Small data size: Need for new
data augmentation methods(e.g.,generative models,cross-distillationwithinlarge models). - Ambiguity in "inactive" classification for negative data: Need to build dedicated
inactive peptide databasesthroughsynthetic data generationandtransfer learning. - Class imbalance: Requires techniques like
oversampling,under-sampling, andcost-sensitive learning.
- Small data size: Need for new
- Molecular Representation: Current methods are often limited to
peptide sequence representations, restricting the learning of complexatomic interactions. Future work should focus onmultimodal molecular representationsandmulti-scale data featuresacrosschemical spaces. - Algorithm Models:
Uniformityandtraditional classification algorithmsare common. Future trends involve integratingmultiple deep learning modelsand developinggeneral AI model frameworks(e.g.,deep generative models,diffusion generative models).- Improved
interpretabilityandrobustnessofML modelsare crucial, especially with the integration ofAIwithbiologyandchemistryknowledge.
- Training Efficiency: Current reliance on traditional methods like
cross-validationandearly stoppingleads to low efficiency. Future methods includeknowledge distillation,fine-tuning of pretrained models, andgenerative training. - Dynamic Interactions: Current
AImodels predictstatic atomic interactions. Future research must explore methods to predictdynamic interactionsbetween peptides and targets in solution for a better understanding ofactivity mechanismsandtargeted delivery. - Oversimplification and Scope: Problems are often
oversimplifiedintobinary classifications. More research is needed onmultifunctional,anti-obesity, andanti-fatigue peptides. - Lack of Biological Validation: Many
machine learning methodslackbiological experimental validation. A high-throughput screening framework combiningAIwithvirtual screening,molecular dynamics simulations,in vitro testing, andin vivo experimentsis needed.
7.3. Personal Insights & Critique
This paper provides an excellent, structured overview of a rapidly evolving field. Its rigorous breakdown of the AI-driven screening process and detailed discussion of various ML/DL architectures in the context of FBP discovery is highly valuable for both beginners and experienced researchers. The emphasis on data-related challenges (scarcity, imbalance, negative samples) is particularly insightful, as data quality often forms the bottleneck in AI applications.
One key inspiration drawn is the potential for AI to accelerate discovery in areas where traditional methods are prohibitively slow or expensive. The idea of food-specific large models and universal deep learning frameworks for multi-scale chemical space features is ambitious but compelling, hinting at a future where AI can not only predict but also design novel FBPs. The call for predicting dynamic peptide-target interactions is also critical, moving beyond static snapshots to capture the complex biological reality.
However, a potential area for further emphasis, even in a review, could be on the ethical implications and regulatory pathways for AI-discovered FBPs. While AI promises greener and safer options, the journey from in silico prediction to market-approved functional food ingredients is complex, involving rigorous safety assessments and regulatory hurdles. How AI can aid or complicate these processes could be a fascinating future research direction.
Additionally, while deep learning is praised for its advantages, the interpretability issue remains a significant challenge, especially in biological and health-related fields where understanding the "why" behind a prediction is crucial for trust and further scientific exploration. The paper mentions interpretability as a future research direction, which is important. Perhaps future reviews could delve deeper into emerging explainable AI (XAI) techniques applied specifically to peptide-protein interactions or FBP activity prediction.
Overall, the paper is a timely and comprehensive guide, effectively balancing a summary of current achievements with a critical look at the road ahead, making it a pivotal contribution to the field of AI in food science.
Similar papers
Recommended via semantic vector search.