Paper status: completed

Artificial intelligence in food bioactive peptides screening: Recent advances and future prospects

Published:12/13/2024
Original Link
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This review summarizes AI-driven high-throughput screening of food-derived bioactive peptides, highlighting advances in deep learning models for identifying functional peptides and emphasizing future directions including multi-scale feature frameworks and screening methodologies.

Abstract

Trends in Food Science & Technology 156 (2025) 104845 Available online 13 December 2024 0924-2244/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies. Artificial intelligence in food bioactive peptides screening: Recent advances and future prospects Jingru Chang e , Haitao Wang a,b,c,d , Wentao Su a,b,c,d , Xiaoyang He e,** , Mingqian Tan a,b,c,d,* a State Key Laboratory of Marine Food Processing and Safety Control, Dalian Polytechnic University, Dalian, 116034, Liaoning, China b Academy of Food Interdisciplinary Science, School of Food Science and Technology, Dalian Polytechnic University, Dalian, 116034, Liaoning, China c National Engineering Research Center of Seafood, Dalian Polytechnic University, Dalian, 116034, Liaoning, China d Dalian Key Laboratory for Precision Nutrition, Dalian Polytechnic University, Dalian, 116034, Liaoning, China e School of Information Science and Engineering, Dalian Polytechnic University, Dalian, 116034, Liaoning, China A R T I C L E I N F O Handling Editor: Dr. S Charlebois Keywords: Artificial intelligence Food-derived bioactive peptides Mac

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Artificial intelligence in food bioactive peptides screening: Recent advances and future prospects

1.2. Authors

Jingru Chang, Haitao Wang, Mingqian Tan (a,b,c,d,e), Xiaoyang He (a,b,c,d), Mingqian Tan (a,b,c,d). Affiliations are primarily with Dalian Polytechnic University, Dalian, China, and the National Engineering Research Center of Seafood.

1.3. Journal/Conference

The paper is published in Food Chemistry. This is a highly reputable journal in the field of food science, known for publishing high-impact research on food composition, processing, safety, and nutritional aspects. Its influence is significant among food scientists and researchers.

1.4. Publication Year

2024

1.5. Abstract

This paper reviews the application of artificial intelligence (AI) in the screening of food-derived bioactive peptides (FBPs). It highlights that traditional experimental methods for FBP identification are often laborious, time-consuming, and costly, while conventional computational approaches like virtual screening and molecular dynamics simulations have inherent limitations. AI technology offers a high-throughput solution for screening and analyzing FBP activity mechanisms, promising to advance FBP development and application. The review outlines the general AI screening process, covering data foundation, molecular feature representation, machine learning (ML) and deep learning (DL) model construction and training, and evaluation/validation. It summarizes recent AI advancements in screening FBPs with various bioactivities (anti-inflammatory, antibacterial, antioxidant, flavor-enhancing, hypotensive), noting that research on anti-obesity and anti-fatigue peptides is still nascent. Key findings indicate that DL shows superior predictive advantages over traditional ML. However, challenges remain across different bioactivities. Future directions include developing data augmentation strategies within food-specific large models, creating a universal deep learning framework based on multi-scale chemical space features to predict peptide-target dynamic interactions, and establishing a high-throughput screening framework that also enhances AI methods for multi-functional properties like anti-obesity and anti-fatigue effects.

/files/papers/690b66a9079665a523ed1dbe/paper.pdf The publication status is officially published.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the inefficient and resource-intensive nature of identifying food-derived bioactive peptides (FBPs) through traditional experimental methods. FBPs are protein fragments from food sources that possess specific physiological regulatory effects, playing a vital role in nutrition and health due. They are characterized by low molecular weight, low toxicity, ease of absorption, high biological activity, and strong targeting capabilities. Their potential benefits, such as nutrient supplementation, immune enhancement, and overall health contribution, make them a fascinating area of research. However, the sheer combinatorial complexity of peptides (e.g., 20n20^n combinations for nn amino acids, plus higher-level structures) makes exhaustive experimental screening impractical. While conventional computational methods like virtual screening and molecular dynamics (MD) simulations offer some streamlining, they are limited by the inherent flexibility of peptides, conformational changes during peptide-target binding, high computational demands, and limited information on peptide-protein complexes.

The paper's entry point is the growing capability of Artificial Intelligence (AI) to overcome these limitations. AI, particularly machine learning (ML) and deep learning (DL), offers a promising pathway for high-throughput screening and analysis of activity mechanisms for FBPs. It aims to extract key features from known bioactive peptide datasets using advanced algorithms, allowing for rapid prediction of activity in large-scale, unknown FBPs, reducing costs, and minimizing human error.

2.2. Main Contributions / Findings

The paper provides a comprehensive review of the current state of AI-driven screening for FBPs, highlighting recent advancements and outlining future prospects. Its primary contributions are:

  1. Outline of AI Screening Process: It systematically describes the general process of AI screening for FBPs, covering data foundation, molecular feature representation, ML/DL model construction and training, and evaluation and validation. This structured overview serves as a guide for researchers.

  2. Summary of Recent Advances: It summarizes recent research progress in AI screening of FBPs across various bioactivities, including anti-inflammatory, antimicrobial, antioxidant, flavor-enhancing, and hypotensive properties. This consolidates diverse research efforts and identifies areas of strength.

  3. Identification of Gaps: It critically discusses current key issues and challenges in the field, such as limited data size, ambiguity in negative data, class imbalance, reliance on sequence-based representations, and insufficient research on certain bioactivities (e.g., anti-obesity and anti-fatigue peptides).

  4. Future Research Directions: It proposes concrete future research directions and trends, including the development of data augmentation strategies within food-specific large models, creation of a universal deep learning framework for multi-scale chemical space features, prediction of peptide-target dynamic interactions, and establishment of a high-throughput screening framework for multifunctional properties.

    Key findings include:

  • Deep learning models generally demonstrate clear predictive advantages over traditional machine learning techniques for FBP screening.
  • Significant advancements have been made in identifying FBPs with properties like anti-inflammatory, antibacterial, antioxidant, flavor-enhancing, and hypotensive effects.
  • Research on anti-obesity and anti-fatigue peptides using AI is still in its nascent stages, presenting a significant area for future exploration.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a foundational grasp of several concepts from computer science, biology, and chemistry is essential.

  • Food-Derived Bioactive Peptides (FBPs): These are specific protein fragments, typically 2 to 20 amino acids long, released from food proteins (e.g., milk, soy, fish) through enzymatic hydrolysis or fermentation. They exert beneficial physiological effects in the body beyond basic nutrition, such as anti-inflammatory, antioxidant, antimicrobial, or antihypertensive activities. Their efficacy is often dependent on their amino acid sequence, length, and three-dimensional structure.

  • Artificial Intelligence (AI): A broad field of computer science dedicated to creating systems that can perform tasks normally requiring human intelligence. These tasks include learning, reasoning, problem-solving, perception, and language understanding. The paper focuses on AI for high-throughput screening and analysis.

  • Machine Learning (ML): A subset of AI that enables systems to learn from data without being explicitly programmed. ML algorithms build a model based on sample data, known as "training data," to make predictions or decisions without being explicitly programmed to perform the task.

    • Supervised Learning: The primary ML paradigm used in this paper. It involves training a model on a labeled dataset, where each input example is paired with an output label. The goal is for the model to learn a mapping from inputs to outputs, allowing it to predict labels for new, unseen data. Examples include classification (predicting a categorical label, e.g., active/inactive peptide) and regression (predicting a continuous value, e.g., binding affinity).
    • Unsupervised Learning: Focuses on finding patterns or structures in unlabeled data. Not a primary focus of this paper for FBP screening.
    • Reinforcement Learning: Involves an agent learning to make decisions by performing actions in an environment to maximize a reward. Not a primary focus of this paper for FBP screening.
  • Deep Learning (DL): A subfield of ML that uses artificial neural networks (ANNs) with multiple layers (hence "deep") to learn complex patterns from data. DL models can automatically learn hierarchical feature representations from raw input, eliminating the need for manual feature engineering in many cases.

    • Neural Networks: Computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers. Each connection has a weight, and neurons have activation functions.
    • Convolutional Neural Networks (CNNs): Primarily used for image processing but also applicable to sequential data. They use convolutional layers to detect local patterns and pooling layers to reduce dimensionality.
    • Recurrent Neural Networks (RNNs): Designed to process sequential data (like amino acid sequences). They have internal memory that allows them to maintain information about previous inputs in the sequence. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are advanced RNN architectures designed to overcome vanishing/exploding gradient problems and capture long-term dependencies.
    • Graph Neural Networks (GNNs): Designed to operate on graph-structured data (e.g., molecular structures where atoms are nodes and bonds are edges). They learn node representations by aggregating information from neighboring nodes.
    • Transformers: A DL model that relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence. They excel at capturing long-range dependencies and allow for parallel processing, unlike RNNs.
    • Diffusion Models: A class of generative models that learn to generate new data samples by reversing a gradual diffusion process that adds noise to data. They have shown impressive results in image and biomolecular structure generation.
  • Molecular Feature Representation: The process of converting raw molecular data (e.g., amino acid sequences, chemical structures) into a numerical format (vectors or matrices) that ML/DL models can understand and process. This is a critical step as the quality of representation directly impacts model performance.

    • Sequence-intrinsic methods: Focus on the amino acid sequence itself (e.g., amino acid composition, dipeptide composition, pseudo amino acid composition (PseAAC)).
    • Physicochemical methods: Incorporate properties like charge, hydrophobicity, molecular weight.
    • Structural properties methods: Analyze three-dimensional conformation (e.g., secondary structure, solvent accessibility).

3.2. Previous Works

The paper references several foundational AI achievements and applications, which serve as benchmarks for AI's potential:

  • AlphaGo (Ma et al., 2024; Silver et al., 2016): DeepMind's AI program that defeated human Go champions, showcasing AI's capability in complex strategic reasoning, primarily using deep reinforcement learning.

  • ChatGPT (Schulman et al., 2022): OpenAI's large language model (LLM) demonstrating advanced natural language understanding and generation, powered by transformer architectures.

  • SORA (Liu et al., 2024): OpenAI's text-to-video generative AI, illustrating progress in AI's ability to create complex, dynamic content.

  • AlphaFold 3 (Brooks et al., 2024): DeepMind's breakthrough in protein structure prediction, capable of predicting the structures of protein-ligand interactions and biomolecular complexes with high accuracy, leveraging diffusion models and deep learning. This is particularly relevant as FBP activity often depends on peptide-protein interactions.

    In the context of FBPs and related bioactive compounds:

  • Reviews on ML applications in food bioactive compounds (Doherty et al., 2021; Kussmann, 2022; Zhang, Zhang, Freddolino, & Zhang, 2024) and bioinformatics tools for active peptides (Du, Comer, & Li, 2023; Rivero-Pino, Millán-Linares, & Montserrat-de-la-Paz, 2023) have been published. These works establish the broader context for using computational methods in this domain.

    The paper points out a notable gap in comprehensive reviews specifically addressing the application of AI for screening FBPs in recent times. This indicates that while general AI applications in food science and ML in bioactive compounds have been reviewed, a focused synthesis on AI for FBP screening specifically was missing, which this paper aims to fill.

3.3. Technological Evolution

The evolution of technology in FBP screening has moved from:

  1. Traditional Experimental Methods: Labor-intensive, time-consuming, and costly due to the need for synthesizing and testing numerous peptides. While reliable, they limit throughput.
  2. Conventional Computational Approaches:
    • Virtual screening: Uses computational methods to rapidly screen large libraries of molecules for potential activity, but often lacks precision and struggles with conformational changes.
    • Molecular dynamics simulations: Provide insights into molecular interactions and dynamics at an atomic level but are computationally intensive and limited by simulation time scales and the availability of peptide-protein complex information.
  3. Artificial Intelligence (AI): The current frontier, offering high-throughput, cost-effective, and less error-prone screening.
    • Non-deep learning ML: Early applications used models like Support Vector Machines (SVMs) and Random Forests (RFs) with meticulously crafted features based on domain knowledge. These were effective for high-dimensional data but limited by the quality of feature engineering.

    • Deep Learning (DL): The latest generation, capable of automatically learning complex molecular features from raw data, leading to higher predictive accuracy and generalization. This includes CNNs, RNNs, GNNs, Transformers, and Diffusion Models, each suited for different data types and complexities.

      This paper's work fits within the leading edge of this technological timeline, emphasizing the transition towards and potential of DL for FBP screening, while also highlighting the remaining challenges.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach (as a review) lie in its comprehensive and forward-looking synthesis of AI specifically for FBP screening, rather than proposing a new method itself.

  • Focus: While previous reviews covered AI in broader food industry applications or ML for bioactive compounds generally, this paper specifically targets the screening of food-derived bioactive peptides. This narrow focus allows for a deeper dive into the unique challenges and opportunities in this specific domain.
  • Structured Process: It provides a systematic, multi-step AI-driven screening process (data, representation, model, evaluation), which is a clear framework for researchers.
  • Detailed Algorithmic Comparison: It provides a structured comparison of various ML and DL algorithms used in FBP screening, outlining their advantages and limitations in this specific context.
  • Identification of Gaps in Specific Bioactivities: It specifically points out the under-researched areas like anti-obesity and anti-fatigue peptides, guiding future research.
  • Forward-looking Perspective: The paper goes beyond summarizing existing work by critically analyzing current bottlenecks (data scarcity, representation, interpretability) and proposing concrete future directions (food-specific large models, multi-scale features, dynamic interactions, high-throughput frameworks). This differentiates it from reviews that merely catalog existing applications.

4. Methodology

The paper describes the general process of AI-driven screening for FBPs as a supervised learning problem. This process typically involves four key interconnected steps: data foundation, molecular feature representation, model construction and training, and evaluation and validation.

4.1. Principles

The core idea behind AI-driven screening of FBPs is to leverage computational intelligence to overcome the limitations of traditional experimental and computational methods in identifying and characterizing bioactive peptides. The theoretical basis is rooted in machine learning principles, specifically supervised learning, where models learn from known bioactive peptide datasets to predict the properties of unknown peptides.

The intuition is that if an AI model can learn the complex relationships between a peptide's molecular structure (or sequence) and its biological activity, it can then rapidly and cost-effectively predict the activity of new, untried peptides. This approach aims to:

  1. Extract features: Identify patterns and key features from known bioactive peptide datasets that correlate with specific bioactivities.

  2. Model building: Construct ML or DL models capable of capturing these complex relationships.

  3. Prediction: Apply the trained models to screen large libraries of potential FBPs and predict their bioactivity.

  4. Efficiency: Reduce the labor-intensive, time-consuming, and costly nature of traditional experimental screening.

  5. Accuracy: Improve upon the prediction accuracy of conventional computational methods by learning from diverse data and modeling intricate interactions.

    The process typically involves using classification models for binary predictions (e.g., active/inactive) based on labeled positive and negative samples, or regression models to predict molecular binding affinity or IC50 values.

4.2. Core Methodology In-depth (Layer by Layer)

The overall AI-driven screening process for FBPs is illustrated in Figure 2 of the original paper.

Fig. 2. AI-driven screening process for FBPs including the steps of data foundation, molecular feature representation, construction of machine learning models, and model evaluation. 该图像是论文中图2的示意图,展示了食品来源生物活性肽(FBPs)的AI驱动筛选流程,包括数据基础、分子特征表示、机器学习模型构建与训练及模型评估等步骤。

The AI-driven screening process for FBPs involves several interconnected stages: data foundation, molecular feature representation, construction of machine learning models, and model evaluation.

4.2.1. Data Foundation

Data is the cornerstone of any machine learning approach. For FBP screening, high-quality, diverse, and sufficient data are critical for model performance. The paper highlights that protein-peptide interactions account for about 40% of protein-ligand interactions, making target-based peptide research central.

  • Training Datasets: For supervised learning, datasets include:

    • Peptides: Sequences, structures, and associated activities.
    • Target proteins: The biomolecules with which peptides interact.
    • Protein-peptide complexes: Structural information about how peptides bind to proteins.
    • Active peptides: Peptides with known bioactivities.
  • Key Databases: The paper summarizes several important databases crucial for data foundation (Table 1), categorized by type:

    • Protein Structural Databases:
      • RCSB PDB: World's largest biological macromolecule structure database (X-ray, NMR, electron microscopy).
      • UniProt: Comprehensive database of protein-related information (Swiss-Prot for reviewed sequences, TrEMBL for unreviewed).
      • Pfam: Specializes in protein families and structural domains.
      • AlphaFoldDB: Provides over 200 million protein structure predictions using AI.
    • Peptide Structural Databases:
      • NORINE: Database of nonribosomal peptides.
      • FoldamerDB: Public database of peptidic foldamers.
      • ConjuPepDB: Database of drug-peptide conjugates.
      • StraPep: Collects active peptides with known structures.
      • DBAASP: Information on antimicrobial peptides (AMPs).
    • Protein-Peptide Complex Structural Databases:
      • PepBDB: Extensive information on biological peptide-mediated protein interaction.
      • PepX: Comprehensive dataset of protein-peptide complexes from PDB.
      • STRING: Provides information on protein-protein interactions.
      • BioLiP2: Updated structural database for biologically relevant ligand-protein interactions.
    • Bioactive Peptide Databases:
      • Food DB: Includes molecules found in food.

      • Coconut: Natural product database.

      • BioPep DB: Searchable database of FBPs.

      • BIOPEP-UWM: Searchable database of bioactive peptides, especially food-derived.

      • DFBP: FBPs database with food sources of protein.

      • Feptide DB: Collection of open-access bioactive peptide repositories.

      • SpirPep: Combination of published bioactive peptide databases.

      • CAMPR3: Comprehensive information on antimicrobial peptides.

      • DBAASP v3: Information on antimicrobial peptides.

      • NeuroPep B 2.0: Neuropeptide database.

      • MAMPs-Pred: Provides antimicrobial and non-antimicrobial peptides.

      • IF-AIP: Provides anti-inflammatory and non-anti-inflammatory peptides.

        The following are the results from Table 1 of the original paper:

        Name Description Website
        Protein Structural Databases
        RCSB PDB Currently the world's largest biological macromolecule structure database. As of June 15, 2024, a total of 194,259 protein 3D structures have been recorded using X-ray crystallography, NMR spectroscopy, and electron microscopy. https://www.rcsb.org/
        UniProt Currently the most comprehensive database of protein-related information. Contains Swiss-Prot with 571,609 manually reviewed protein sequences, TrEMBL with 244,910,918 unreviewed protein sequences and PIR with protein sequences. https://www.uniprot.org/
        Pfam A database specializes in providing complete classification information of protein families and structural domains, covers 21,979 protein families. http://pfam.xfam.org/
        AlphaFoldDB A protein structure prediction database built based on advanced AI technology. Provides over 200 million protein structure predictions. https://alphafold.ebi.ac.uk/
        Peptide Structural Databases
        NORINE The platform features a database of nonribosomal peptides equipped with analytical tools and houses over 1000 peptides. https://ngdc.cncb.ac.cn/databasecommons/database/id/1476
        FoldamerDB A public database of peptidic foldamers. http://foldamerdb.ttk.mt
        ConjuPepDB A public database of drug-peptide conjugates, containing 645 drug-peptide conjugates. https://conjupepdb.ttk.hu/
        StraPep A database dedicated to collecting all active peptides of known structure, containing 3791 bioactive peptide structures belonging to 1312 unique bioactive peptide sequences. http://isyslab.info/StraPep/
        DBAASP A database dedicated to information on antimicrobial peptides (AMPs), containing 21,426 peptides. https://www.dbaasp.org/home
        Protein-Peptide Complex Structural Databases
        PepBDB A database presents extensive information about biological peptide-mediated protein interaction. The current number of structures is 13,299. http://huanglab.phys.hust.edu.cn/pepbdb/
        PepX An extensive and comprehensive dataset includes all protein-peptide complexes available in the Protein Data Bank, with peptide lengths of up to 35 residues. This dataset encompasses 505 distinct protein-peptide interface clusters derived from 1431 complexes. https://ngdc.cncb.ac.cn/databasecommons/database/id/1240
        STRING A database provides the most comprehensive information on protein-protein interactions. As of August 2, 2024, it includes 332,075,812 interactions at highest confidence (score ≥0.900). https://cn.string-db.org/
        BioLip2 An updated structural database focusing on biologically relevant ligand-protein interactions. As of June 15, 2024, it contains 3,7492 entries for peptide ligands. https://zhanggroup.org/BioLiP2/index.cgi
        Bioactive Peptide Databases
        Food DB A database includes 70,926 molecules in the food. https://foodb.ca/
        Coconut A natural product database currently available with over 400,000 molecules. https://coconut.naturalproducts.net/
        BioPep DB A searchable database of FBPs that contains 4807 bioactive peptides. http://bis.zju.edu.cn/biopepdbr/index.php
        BIOPEP-UWM A searchable database of bioactive peptides, especially on these derived from foods and being constituents of diets. It contains 5047 bioactive peptides. https://biochemia.uwm.edu.pl/biopep-uwm/
        DFBP FBPs database currently contains 6818 bioactive peptides, 21,249 food sources of protein. http://www.cqudfbp.net/
        Feptide DB A collection of 12 open-access bioactive peptide repositories and peptides extracted from research publications to predict food-derived bioactive peptides. http://www4g.biotec.or.th/FeptideDB/
        SpirPep Combination of 13 published bioactive peptide databases, containing 28,334 unique bioactive peptide sequences for compare with putative peptide. http://spirpepapp.sbi.kmutt.ac.th/BioactivePeptideDB.html
        CAMPR3 A database provides comprehensive information on antimicrobial peptides, including 10,247 antimicrobial peptide sequences obtained through the analysis of 1386 sequences derived from experimental studies. http://www.camp3.bicnirrh.res.in/
        DBAASP v3 A database dedicates to information on antimicrobial peptides containing over 15,700 entries, which include more than 14,500 monomers and nearly 400 homo- and hetero-oligomers. http://dbaasp.org/
        NeuroPep B 2.0 A neuropeptide database holds 11,417 unique neuropeptide entries. https://isyslab.info/Ne uroPepV2/
        MAMPs-Pred A database provides 6989 peptides consisting of antimicrobial and non-antimicrobial. https://github.com/JianyuanLin/SupplementaryData
        IF-AIP A database provides 5265 peptides with anti-inflammatory and non-anti-inflammatory. https://github.com/Mir-Saima/IF-AIP
  • Data Processing: After data collection, meticulous steps are required:

    • Data Cleaning: Removing errors, inconsistencies, or duplicates.
    • Annotation: Adding labels (e.g., active/inactive, specific activity type) to data points.
    • Normalization: Scaling numerical features to a standard range to prevent certain features from dominating the learning process.

4.2.2. Molecular Feature Representation

This step transforms raw molecular data into a numerical format suitable for ML models. It involves feature selection and feature extraction/encoding.

  • Feature Selection: Commonly used features include:
    • Amino acid sequence: The primary structure of the peptide.
    • Structural information: Secondary (e.g., alpha-helix, beta-sheet) and tertiary structures (3D conformation).
    • Physicochemical properties: Hydrophobicity, charge, molecular weight, isoelectric point, etc.
  • Feature Extraction and Encoding: Converting amino acid sequences or structures into numerical vectors or matrix forms.
    • Sequence-based Representation:
      • Amino acid composition (AAC): Proportion of each amino acid.
      • Dipeptide composition: Proportion of each pair of adjacent amino acids.
      • Pseudo amino acid composition (PseAAC): Incorporates physicochemical properties and sequence arrangement.
      • One-hot encoding: Each amino acid converted to a unique binary vector.
      • Composition-transition-distribution descriptors (CTDD): Describes composition, transformation, and distribution of physicochemical properties.
      • Position-specific scoring matrix (PSSM): Represents evolutionary information.
      • Limitation: Requires learning features from scratch for new datasets, needs large training data.
    • Graph-based Representation:
      • Nodes represent amino acid atoms, edges represent covalent bonds.
      • Often used with Graph Neural Networks (GNNs).
      • Limitation: Most 2D ligand methods overlook 3D molecular structure and peptide-target interactions.
    • Image-based Representation:
      • Molecular images (e.g., protein sequences converted to images) used as input for DL models (e.g., CNNs).

      • Captures more detailed molecular structure information.

      • Limitation: Scarcity of high-quality, labeled image datasets; high computational demands.

        The following are the results from Table 2 of the original paper:

        2 Iolecular representation methods and tools.
        Name
        Based on intrinsic sequence properties The protein sequence is transformed into a 20-dimensional vector that quantifies the relative abundance of each amino acid within the protein.
        Amino acid composition (AAC) This method calculates the proportion of dipeptides, formed by the linkage of two specific amino acids, within the entire protein sequence. It not only captures the distribution of amino acids but also encodes information about their local arrangement.
        Dipeptide composition The correlation between two proteins or peptide chains is characterized by analyzing specific structural features or physicochemical properties.
        Normalization Moreau-Broto autocorrelation descriptors Utilizing the Moran's index to describe the spatial autocorrelation of amino acid properties or features within a protein sequence.
        Moran autocorrelation A metric for assessing the coupling between the amino acid sequence order in a protein and its three-dimensional structure.
        Sequence-order-coupling Capturing both local and global information in a sequence by considering interactions between each amino acid and its surrounding residues. This quantifies features related to the relative positions and physicochemical differences between amino acids in a protein sequence.
        A global encoding strategy that converts a protein sequence into a 1000 × 20 binary matrix, providing information on the evolution of the protein sequence.
        Position-specific scoring matrix (PSSM) Converting each amino acid in a sequence into a fixed-length binary vector, offering intuitive, simple, and scalable features.
        One-hot encoding based on sequence
        Based on physicochemical properties Evaluating the similarity between protein sequences by calculating the proportion of identical amino acids at the same positions in two or more sequences.
        Total amino acid properties Converting protein sequences into numerical feature vectors based on the physicochemical properties of amino acids, describing the composition, transformation, and distribution characteristics of the amino acids.
        Composition-transition-distribution descriptors
        Amphiphilic pseudo amino acid Based on AAC, this approach incorporates the physicochemical properties and arrangement information of amino acids.
        Not only the amino acid sequence order is considered, but also the physicochemical properties of amino acids, such as hydrophilicity, hydrophobicity, and molecular weight, are utilized, along with the compositional information of the 20 amino acids, to construct protein information.
        Pseudo amino acid composition
        composition
        Based on structural properties
        Topological structure at the atomic level A mathematical descriptor based on molecular structure, which primarily consists of the atomic composition, the type of chemical bonds, and the attributes of their connections.
        Secondary structure and solvent accessibility The amino acid sequence is converted into two new sequences by using secondary structure and solvent accessibility. Each of the new sequences is represented by a 3D vector and a 2D vector, and finally, each amino acid gets a binary matrix
        Representation-related tools
        Scratch Protein Predictor A web-based tool that forecasts the tertiary structure and structural characteristics of proteins not only predicts their secondary structure and hydrophobicity but also provides extensive information regarding disordered regions, structural domains, and individual residue interactions. https://scratch.proteomics.ics.uci.edu/
        POSSUM A website that provides property information based on position-specific scoring matrix, which contains 21 distinct PSSM descriptors. https://possum.erc.monash.edu/

4.2.3. Construction and Training of Machine Learning Models

This stage involves selecting and configuring ML algorithms and then training them using the prepared data.

4.2.3.1. Machine Learning Algorithms

ML techniques are categorized into non-deep learning and deep learning methods.

  • Non-deep Learning Models: These rely on predefined feature sets.

    • Support Vector Machine (SVM): A powerful supervised learning algorithm used for classification and regression. It works by finding an optimal hyperplane that best separates data points of different classes in a high-dimensional space, maximizing the margin between classes.
      • Strengths: Well-suited for high-dimensional data, robust to overfitting, strong model interpretability.
      • Limitations: Performance highly dependent on feature representation quality, memory-intensive, primarily designed for binary classification (requires adaptation for multiclass).
    • Random Forest (RF): An ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (for classification) or mean prediction (for regression) of the individual trees.
      • Strengths: Handles high-dimensional data, high accuracy, parallelizable, robust to overfitting.
      • Limitations: Sensitive to feature selection (requires domain expertise), relatively poor interpretability compared to single decision trees, sensitive to imbalanced datasets.
  • Deep Learning Models: These can automatically learn complex molecular features directly from raw data.

    • Convolutional Neural Networks (CNNs):
      • Structure (Figure 3a): Composed of convolutional layers (for feature extraction using filters), pooling layers (for dimensionality reduction), and fully connected layers (for classification/regression).

      • Mechanism: Filters slide over the input data (e.g., peptide sequence represented as an image or 1D sequence), detecting local patterns.

      • Strengths: Strong generalization in image processing, effective for local feature extraction.

      • Limitations: Limited capability in learning global features of molecular interactions, potentially constrained effectiveness for complex interactions.

        该图像是一个示意图,展示了用于食物生物活性肽筛选的多种人工智能模型架构,包括卷积神经网络(a)、循环神经网络(b)、图神经网络(c)、变换器模型(d)以及扩散模型(e),描述了从分子结构输入到输出的处理流程。 该图像是一个示意图,展示了用于食物生物活性肽筛选的多种人工智能模型架构,包括卷积神经网络(a)、循环神经网络(b)、图神经网络(c)、变换器模型(d)以及扩散模型(e),描述了从分子结构输入到输出的处理流程。

    The image illustrates the architectures of five different artificial intelligence models: (a) a convolutional neural network (CNN) with convolutional, pooling, and fully connected layers; (b) a recurrent neural network (RNN) showing the flow of information through time steps; (c) a graph neural network (GNN) with nodes and edges; (d) a Transformer model focusing on its encoder architecture; and (e) a diffusion model illustrating the forward and reverse diffusion processes.

    • Recurrent Neural Networks (RNNs):
      • Structure (Figure 3b): Networks with feedback loops that allow information to persist, making them suitable for sequential data. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are advanced RNN variants.
      • Mechanism: Process sequences one element at a time, using an internal state (memory) to capture dependencies across time steps.
      • Strengths: Excels at handling dependencies in sequential data, better generalization for temporal/contextual information.
      • Limitations: Prone to vanishing and exploding gradients, making convergence difficult for very long sequences. LSTM and GRU mitigate this but have complex architectures and high training costs, prone to overfitting with small, undiverse datasets.
    • Graph Neural Networks (GNNs):
      • Structure (Figure 3c): Designed for graph-structured data where nodes represent entities (e.g., atoms, amino acids) and edges represent relationships (e.g., covalent bonds, non-covalent interactions).
      • Mechanism: Learn node representations by iteratively aggregating information from their neighbors.
      • Strengths: Directly process graph-structured data, captures complex multi-scale atomic relationships.
      • Limitations: High computational complexity for large graphs, can only capture local molecular structures effectively in some variants.
    • Transformers:
      • Structure (Figure 3d): Primarily based on self-attention mechanisms (e.g., multi-head attention), which allow the model to weigh the importance of different parts of the input sequence. Also include feed-forward layers.
      • Mechanism: Process sequences in parallel, using attention to capture long-range dependencies between any positions, rather than sequentially. Requires positional encoding to retain order information.
      • Strengths: Excellent parallel computation capabilities, captures dependencies irrespective of distance.
      • Limitations: High complexity and training costs, lack of clear input-output mapping makes interpretability challenging for FBP prediction.
    • Diffusion Models:
      • Structure (Figure 3e): A class of generative models that learn to reverse a stochastic process (diffusion) that gradually transforms data into noise.
      • Mechanism: Consists of a forward diffusion process (adding noise) and a reverse denoising process (learning to reconstruct data from noise). AlphaFold 3 leverages these for biomolecular structure prediction.
      • Strengths: Capable of high-quality data generation, used in AlphaFold 3 for protein-ligand interaction and complex structure prediction.
      • Limitations: Resource-intensive due to complex neural network architecture, struggles with discrete data, prone to hallucinations (generating plausible but incorrect outputs).
  • Bottlenecks in Deep Learning Modeling:

    • High model complexity.
    • Long training and learning times.
    • Significant computational resource consumption.
    • Limited model interpretability.
    • Dependency on labeled data (scale, quality, diversity).
    • Insufficient sample sizes in existing databases for training complex DL networks.
    • Challenges in handling negative data (often classified as unknown or randomly sampled).
    • Bias towards higher-volume data samples in imbalanced datasets.

4.2.3.2. Model Architecture Selection

Choosing the appropriate model architecture is crucial. The paper notes that deeper networks generally show better generalization for a similar number of parameters. Approaches include:

  • Exploring different model architectures based on data types and molecular representations.

  • Adjusting network depth, connectivity, neuron quantity, and types.

  • Designing multiple candidate machine learning models and using heuristic evaluation functions to quickly estimate performance and select the optimal architecture.

    Current AI models often rely on static experimental data, neglecting the dynamic interactions of peptides and targets in natural environments. Simulating these dynamic processes (e.g., protein folding, ligand-receptor interactions at atomic level) requires extensive computational resources, leading to a trade-off between efficiency and accuracy.

4.2.3.3. Training of Machine Learning Models

This involves two key sub-processes:

  • Parameter Optimization: Iteratively updating the model's internal parameters (weights and biases) to minimize a loss function.
    • Optimization algorithms: Stochastic Gradient Descent (SGD), Adaptive Moment Estimation (Adam), Adaptive Gradient (AdaGrad).
    • Loss functions: Mean Squared Error (MSE) for regression, cross-entropy loss or logarithmic loss for classification.
  • Hyperparameter Tuning: Adjusting external parameters that control the learning process, such as:
    • Activation function: Non-linear functions applied to neuron outputs.
    • Learning rate: Controls the step size during parameter updates.
    • Optimizer: The algorithm used for parameter optimization.
    • Epochs: Number of complete passes through the training dataset.
  • Data Splitting: The dataset is divided into:
    • Training set: Used to train the model.
    • Validation set: Used to monitor model performance during training, adjust hyperparameters, and prevent overfitting.
    • Test set: Used for final, unbiased evaluation of the model's performance on unseen data.
  • Computational Resources: Deep learning models require substantial resources. Graphics Processing Units (GPUs) are highly efficient for large-scale parallel computing tasks, significantly accelerating the training process for complex models and extensive datasets.

4.2.4. Evaluation and Validation

After training, a comprehensive evaluation ensures the model performs as expected on unseen data.

  • Evaluation Metrics:
    • For Classification Tasks:
      • Accuracy: Proportion of correctly predicted instances.
      • Precision: Proportion of true positive predictions among all positive predictions.
      • Recall (Sensitivity): Proportion of true positive predictions among all actual positive instances.
      • F1 score: Harmonic mean of precision and recall.
      • Receiver Operating Characteristic (ROC) curve: Plots true positive rate against false positive rate.
      • Area Under the ROC Curve (AUC): Measures the overall performance of a classifier across all possible classification thresholds.
    • For Regression Tasks: Quantifies deviation between predicted and actual values.
      • Mean Squared Error (MSE): Average of the squared differences between predicted and actual values.
      • Root Mean Squared Error (RMSE): Square root of MSE.
      • Mean Absolute Error (MAE): Average of the absolute differences between predicted and actual values.
  • Validation Methods: Ensure effectiveness and generalization on unseen data.
    • Cross-validation (k-fold): Divides data into kk folds; trains on k-1 folds and validates on the remaining fold, repeating kk times.
    • Leave-one-out cross-validation: A special case of k-fold where kk equals the number of samples.
    • Hold-out validation: Simple split into training and test sets.
    • Bootstrapping: Resampling technique.
  • Comprehensive Validation: Beyond AI model evaluation, validation involves:
    • Preliminary screening: AI model identifies high-activity candidates.
    • Advanced computational screening: Molecular docking, ligand-based virtual screening, molecular dynamics simulations to refine selection.
    • Rigorous in vitro and in vivo experiments: Essential to confirm theoretical predictions and practical applicability.

5. Experimental Setup

This paper is a review and does not present its own experimental setup. Instead, it synthesizes the experimental approaches, datasets, evaluation metrics, and comparative analyses from the various studies it reviews on AI-driven FBP screening. The following points summarize the general practices observed across the reviewed literature.

5.1. Datasets

The datasets used in the reviewed studies vary significantly in source, scale, and characteristics, reflecting the specific bioactivity being investigated.

  • Sources: FBPs are derived from diverse food proteins (e.g., milk, soy, fish, macroalgae, chia seeds, lactic acid bacteria, walnut protein). Data is collected from dedicated peptide databases (as detailed in Table 1), literature reviews, and in-house experimental results.
  • Characteristics:
    • Peptide sequences: The most common form of data, often represented in FASTA format.
    • Structural information: Sometimes includes secondary structures, physicochemical properties (e.g., molecular weight, hydrophobicity, charge), amino acid composition, dipeptide composition, pseudo amino acid composition (PseAAC).
    • Labels: Binary classification (e.g., active/inactive, anti-inflammatory/non-anti-inflammatory) or regression targets (e.g., IC50 values for antihypertensive peptides).
  • Scale: The data size in reviewed studies ranges from a few hundred (e.g., 203 for taste peptides, 600 for bitterants, 499 for umami peptides) to several thousands (e.g., 4194-5265 for anti-inflammatory, 6989-42213 for antimicrobial, 1338-2120 for antioxidant, 1020-3429 for anti-hypertensive).
  • Challenges: The paper frequently highlights small data size, ambiguity in "inactive" classification for negative data, and class imbalance as significant challenges across various FBP screening tasks. Many studies generate negative samples randomly, which can introduce noise.
  • Example Data Sample: A data sample typically consists of a peptide sequence (e.g., "ALAVAL"), sometimes with associated physicochemical properties (e.g., molecular weight, hydrophobicity values), and a label indicating its bioactivity (e.g., 1 for active, 0 for inactive, or an IC50 value).

5.2. Evaluation Metrics

For the classification tasks predominant in FBP screening, a range of standard metrics are employed to assess model performance. For regression tasks, metrics quantify prediction accuracy.

  • Accuracy

    • Conceptual Definition: Accuracy measures the proportion of total predictions that were correct. It is a straightforward indicator of overall correctness.
    • Mathematical Formula: $ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $
    • Symbol Explanation:
      • TP: True Positives (correctly predicted positive instances).
      • TN: True Negatives (correctly predicted negative instances).
      • FP: False Positives (incorrectly predicted positive instances, type I error).
      • FN: False Negatives (incorrectly predicted negative instances, type II error).
  • Precision

    • Conceptual Definition: Precision measures the proportion of positive identifications that were actually correct. It focuses on the quality of positive predictions, addressing the question: "Of all items the model labeled as positive, how many are actually positive?"
    • Mathematical Formula: $ \text{Precision} = \frac{TP}{TP + FP} $
    • Symbol Explanation:
      • TP: True Positives.
      • FP: False Positives.
  • Recall (Sensitivity)

    • Conceptual Definition: Recall measures the proportion of actual positives that were identified correctly. It focuses on the completeness of positive predictions, addressing the question: "Of all actual positive items, how many did the model correctly identify?"
    • Mathematical Formula: $ \text{Recall} = \frac{TP}{TP + FN} $
    • Symbol Explanation:
      • TP: True Positives.
      • FN: False Negatives.
  • F1 Score

    • Conceptual Definition: The F1 score is the harmonic mean of Precision and Recall. It provides a single score that balances both metrics, being particularly useful when there is an uneven class distribution.
    • Mathematical Formula: $ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
    • Symbol Explanation:
      • Precision: The precision score.
      • Recall: The recall score.
  • Receiver Operating Characteristic (ROC) Curve

    • Conceptual Definition: The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
    • Mathematical Formulas (for points on the curve):
      • True Positive Rate (TPR) = Recall = TPTP+FN\frac{TP}{TP + FN}
      • False Positive Rate (FPR) = FPFP+TN\frac{FP}{FP + TN}
    • Symbol Explanation:
      • TP, TN, FP, FN: As defined above.
  • Area Under the ROC Curve (AUC)

    • Conceptual Definition: The AUC represents the degree or measure of separability between classes. It measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). A higher AUC indicates better model performance in distinguishing between positive and negative classes.
    • Mathematical Formula: No simple closed-form formula. It's typically calculated numerically as the integral of the ROC curve. Conceptually, it represents the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
    • Symbol Explanation: N/A (it's a calculated area from the ROC curve).
  • Mean Squared Error (MSE)

    • Conceptual Definition: MSE is a common regression metric that measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value. It penalizes larger errors more heavily.
    • Mathematical Formula: $ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
    • Symbol Explanation:
      • nn: The number of data points.
      • YiY_i: The actual (observed) value for the ii-th data point.
      • Y^i\hat{Y}_i: The predicted value for the ii-th data point.
  • Root Mean Squared Error (RMSE)

    • Conceptual Definition: RMSE is the square root of MSE. It is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It has the same units as the target variable, making it more interpretable than MSE.
    • Mathematical Formula: $ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2} $
    • Symbol Explanation:
      • nn: The number of data points.
      • YiY_i: The actual (observed) value for the ii-th data point.
      • Y^i\hat{Y}_i: The predicted value for the ii-th data point.
  • Mean Absolute Error (MAE)

    • Conceptual Definition: MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average of the absolute differences between prediction and actual observation, and all individual differences are weighted equally.
    • Mathematical Formula: $ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat{Y}_i| $
    • Symbol Explanation:
      • nn: The number of data points.
      • YiY_i: The actual (observed) value for the ii-th data point.
      • Y^i\hat{Y}_i: The predicted value for the ii-th data point.

5.3. Baselines

As a review, the paper compares the performance of various AI models against each other rather than defining a single "baseline" in the traditional sense for its own method. The "baselines" are effectively the different machine learning algorithms and feature representation methods employed across the literature.

  • Traditional Machine Learning: Support Vector Machines (SVMs), Random Forests (RFs), Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors (KNN), Gradient Boosting. These serve as comparative methods against which deep learning approaches are often evaluated.

  • Deep Learning Architectures: Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) (including LSTM, GRU), Graph Neural Networks (GNNs), and Transformers (e.g., BERT, ProtBERT). Studies often compare different DL architectures or hybrid DL models.

  • Feature Representations: Comparisons are also made between different molecular feature representation techniques (e.g., sequence-based, physicochemical, structural, image-based, graph-based), highlighting how the choice of representation impacts model performance.

  • Ensemble Methods: Some studies employ ensemble models (e.g., voting classifiers combining multiple ML models) as a form of "baseline" for improved robustness.

    The paper implicitly evaluates how deep learning approaches, by automatically learning complex features, tend to show predictive advantages over traditional machine learning methods which rely more on meticulously handcrafted features.

6. Results & Analysis

6.1. Core Results Analysis

The paper provides a comprehensive overview of the application of AI methods for screening FBPs with various bioactivities. The general trend observed across these applications is a shift towards deep learning models, which often demonstrate superior predictive capabilities compared to traditional machine learning methods, especially when dealing with complex data and relationships. However, significant challenges, particularly related to data quality and quantity, persist across all bioactivity types.

  • Anti-inflammatory Peptides (AIPs):

    • Most AI-based screening for AIPs has focused on nonspecific anti-inflammatory peptides, using random forest classifiers with sequence and structural features.
    • The paper notes a lack of research on specific anti-inflammatory peptides targeting different inflammatory targets.
    • A critical observation is the reliance on manually selected amino acid sequence features and traditional ML algorithms, suggesting untapped potential for deep learning to automatically extract features and improve accuracy.
    • A significant gap identified is the lack of experimental validation for most proposed AI prediction methods for AIPs.
  • Antimicrobial Peptides (AMPs):

    • AI has been used to screen AMPs from diverse food sources (e.g., shrimp, seaweed, chia seeds, lactic acid bacteria) using random forests, artificial neural networks, and graph convolutional neural networks.
    • Challenges include overlooking molecular characteristics associated with low sequence homology of AMPs and issues with cumulative prediction error in stacked models.
    • Many ML screening methods for AMPs also lack experimental validation beyond AI.
  • Antioxidant Peptides:

    • Studies have developed binary classifiers using various traditional ML algorithms (e.g., logistic regression, SVM, KNN) and CNNs with peptide sequences and pseudo amino acid composition as features.
    • The sample sizes for training data are generally limited (around 2,000), necessitating data augmentation and transfer learning.
    • Research often focuses on peptide sequences, neglecting deeper structural and physicochemical properties that significantly influence activity (e.g., specific amino acids).
    • Interpretability of ML models is highlighted as important for understanding interaction mechanisms.
  • Taste Peptides (Umami, Bitter):

    • AI-based screening primarily targets umami and bitter peptides, using deep learning (e.g., multi-layer perceptron, RNN, CNN) and gradient boosting random forest.
    • A major issue is the insufficient sample size (often in the hundreds), which makes deep learning models prone to overfitting.
    • Data augmentation (e.g., generative adversarial networks) and regularization techniques are suggested to address data scarcity and overfitting.
    • Research on other taste types (sour, sweet, salty) is limited.
  • Antihypertensive Peptides:

    • AI methods, including regression decision trees and various deep learning models (BERT, ProtBERT, LSTM, RNN), have been applied to predict ACE-inhibitory activity or IC50 values.
    • A common problem is the random generation of negative samples, which introduces noise and reduces predictive accuracy.
    • The representation of peptide molecular sequence features with single numerical values may oversimplify complexity.
    • Small dataset sizes and lack of biological experimental validation are recurring issues.
  • Other Bioactivities (Hypoglycemic, Anticancer, Neuroactive, Muscle Synthesis, Multifunctional):

    • AI has shown promise in identifying peptides for diverse applications, demonstrating its broad potential.

    • However, anti-obesity and anti-fatigue peptides are significantly under-researched, indicating a nascent stage for AI-driven screening in these areas.

      Overall, deep learning exhibits clear advantages, but the field is bottlenecked by data limitations (small size, imbalance, poor negative sample quality), suboptimal feature representations (over-reliance on sequences, lack of multi-scale integration), and insufficient experimental validation.

6.2. Data Presentation (Tables)

The following are the results from Table 3 of the original paper:

Bioactivity Type of problem Data Maching learning Validation experiments Website Reference models Data Molecular size representations
Anti-inflammatory Binary classification 4194 Peptide sequence Random Forests , http://kurata14.bio.kyutech.ac.jp/PreAIP/ Khatun et al. and structural (2019) information 4620 3 peptide sequence Random Forests , , Zhang et al. (2021) features 2748 3 feature encodings Random Forests , / Zhao et al. (2021) 5265 8 sequence features Integrated model , https://github.com/Mir-Saima/IF-AIP Gaffar et al. (2024)
and structural information jp/PreAIP/ (2019)
4620 3 peptide sequence features Random Forests , , Zhang et al. (2021)
2748 3 feature encodings Random Forests , / Zhao et al. (2021)
5265 8 sequence features Integrated model , https://github.com/Mir-Saima/IF-AIP Gaffar et al. (2024)
Antimicrobial Multi-label classification Binary classification 42,213 Antioxidant Binary classification 1338 Antimicrobial Multi-label 6989 8 types of physical- ma/IF-AIP Random Forests , / Ensemble of artificial / /
chemical properties
and AAC
42,213 Protein sequences Protein sequences Caprani et al. (2021)
neural networks and
random forests
Integrated model
Vitro experiments https://cbbio.online/AxPEP/?
1067 Peptide sequences
León Madrazo and
3244 Initial graph Segura Campos
Graph convolutional / http://www.dong-group.cn/database/dlabamp/Prediction/amplab/result/ (2022) Sun et al. (2022)
obtained by peptide networks
sequences
1338 Peptide sequences The four models of / https://doi.org/10.1016/j.foodcont.2021.108439 Shen et al. (2022)
14042120 with PseAAC logistic regression,
encoding linear discriminant
analysis, support
vector machine and K-
nearest neighbors
1404 Peptide sequences
Vitro experiments
Convolutional neural http://services.bioinformatics.dtu.dk/service.php?AnOxPePred-1.0 Olsen et al. (2020)
with one-hot networks
Peptide sequences Long short-term / http://www.cqudfbp.net/AnOxPP/index.jsp Qin et al. (2023) García et al. (2022)
564Taste Binary classification 499
Peptide sequence
memory The ensemble model Vitro experiments /
with support vector
machine, random
forests, k-nearest
neighbors and logistic regression
499 6 feature A merged model for https://umami-mrnn.herokuapp.com/ Qi et al. (2023)
of umami of umami representations multi-layer perceptron
and recurrent neural networks
203 8 molecular descriptors Gradient boosting and Sensory random forests experiments https://pypi.org/project/Auto-Taste ML Cui et al. (2023) Yolandani et al.
Binary classification 600 Molecular weight, Support vector Sensory
of bitterants surface machine, linear experiments
hydrophobicity, and regression, adaptive
relative boosting, and k-
hydrophobicity nearest neighbors https://doi.org/10.1016/j.foodres.2022.110974
Binary classification 2233 MLP: molecular Convolutional neural
of bitterants and bitter, descriptors and networks, multi-layer http://hazralab.iitr.ac.in/ahpp/index.php.Zhang, Dai, Zhao
sweeteners 2366sweet fingerprint CNN: the 2D image perceptron
Anti-hypertensive Regression-based 1587 PseACC for peptide The regression Molecular docing
binary classification Binary classification 2277 Regression binary classification structural and decision tree and vitro
sequence features experiments
2277 Protein sequences Four deep learning Molecular docing / Zhang, Dai, Zhao
with PseAAC models including and vitro et al. (2023)
encoding BERT, ProtBERT, long experiments
short-term memory
and recurrent neural
Regression 3429 Protein sequences
networks Long short-term Vitro experiments / Liao et al. (2023)
prediction of the IC50 value Binary classification 1020 Other Bioactivities Multiple 2544 classifications of prediction of the memory networks
IC50 value Binary classification 1020
Binary classification 1020
1020 the ESM-2-based Logistic regression, / https://github.com/dzjxzyd/LM4ACE_webserversupport vector machine, k-nearest neighbors and multi-layer perceptron 8 notable machine / https://balalab-skku.org/ADP-Fuse /
peptide embeddings random forests,
random forests, support vector
machine, k-nearest
22 sequence features neighbors and multi-layer perceptron
layer perceptron 8 notable machine /

6.3. Ablation Studies / Parameter Analysis

The review paper does not present its own ablation studies or parameter analyses, as it is a synthesis of existing literature. However, it implicitly discusses aspects related to these concepts by highlighting:

  • Impact of Feature Selection: The paper notes that non-deep learning models like random forests are "sensitive to feature selection" and often require "careful design of ... features, demanding significant domain expertise" (Imai et al., 2021). This implies that the choice of molecular feature representation acts akin to an ablation study or parameter analysis on the input features.

  • Model Complexity vs. Data Size: The paper frequently points out the issue of overfitting in deep learning models when sample sizes are limited. This indicates that network depth and complexity (model parameters) are crucial hyperparameters that need to be tuned relative to the available data.

  • Limitations of Local vs. Global Features: When discussing CNNs, the paper notes their focus on local features and limited capacity to learn global features of molecular interactions, constraining generalization. This highlights an implicit analysis of how different architectural choices (convolutional filters) impact the model's ability to capture comprehensive information.

  • Role of Specific Amino Acids: For antioxidant peptides, the paper suggests that the "proportion and position of specific amino acids (e.g., sulfur-containing amino acids, aromatic amino acids) ... should be thoroughly considered during data annotation and feature extraction." This implies that more granular feature engineering or attention to specific physicochemical properties is an area for further parameter analysis or ablation on feature importance.

  • Negative Data Generation Strategy: The paper repeatedly criticizes the random generation of negative samples in antihypertensive peptide screening as introducing noise and affecting predictive accuracy. This suggests that the strategy for negative data generation is a critical "parameter" that heavily influences model performance and is an area for future improvement, much like an ablation study on the composition of the training data.

    While the paper doesn't present explicit ablation study tables or hyperparameter tuning curves, its critical discussion of current challenges and limitations in reviewed studies serves to illuminate how different choices in data, features, and model architectures impact the success of AI-driven FBP screening.

7. Conclusion & Reflections

7.1. Conclusion Summary

This review effectively highlights Artificial Intelligence (AI) as a transformative technology for the high-throughput screening and analysis of Food-Derived Bioactive Peptides (FBPs). It systematically deconstructs the AI-driven screening process, from data foundation and molecular feature representation to model construction, training, evaluation, and validation. The paper concludes that while Deep Learning (DL) models generally offer superior predictive advantages over traditional Machine Learning (ML) techniques, the field still faces significant challenges. Notable progress has been made in identifying FBPs with anti-inflammatory, antimicrobial, antioxidant, flavor-enhancing, and hypotensive properties, but research on anti-obesity and anti-fatigue peptides is nascent. The review serves as a valuable resource, consolidating current advancements and charting a clear course for future research to accelerate the discovery and application of FBPs.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations in the current AI-driven FBP screening landscape and propose concrete future directions:

  • Data Limitations:
    • Small data size: Need for new data augmentation methods (e.g., generative models, cross-distillation within large models).
    • Ambiguity in "inactive" classification for negative data: Need to build dedicated inactive peptide databases through synthetic data generation and transfer learning.
    • Class imbalance: Requires techniques like oversampling, under-sampling, and cost-sensitive learning.
  • Molecular Representation: Current methods are often limited to peptide sequence representations, restricting the learning of complex atomic interactions. Future work should focus on multimodal molecular representations and multi-scale data features across chemical spaces.
  • Algorithm Models:
    • Uniformity and traditional classification algorithms are common. Future trends involve integrating multiple deep learning models and developing general AI model frameworks (e.g., deep generative models, diffusion generative models).
    • Improved interpretability and robustness of ML models are crucial, especially with the integration of AI with biology and chemistry knowledge.
  • Training Efficiency: Current reliance on traditional methods like cross-validation and early stopping leads to low efficiency. Future methods include knowledge distillation, fine-tuning of pretrained models, and generative training.
  • Dynamic Interactions: Current AI models predict static atomic interactions. Future research must explore methods to predict dynamic interactions between peptides and targets in solution for a better understanding of activity mechanisms and targeted delivery.
  • Oversimplification and Scope: Problems are often oversimplified into binary classifications. More research is needed on multifunctional, anti-obesity, and anti-fatigue peptides.
  • Lack of Biological Validation: Many machine learning methods lack biological experimental validation. A high-throughput screening framework combining AI with virtual screening, molecular dynamics simulations, in vitro testing, and in vivo experiments is needed.

7.3. Personal Insights & Critique

This paper provides an excellent, structured overview of a rapidly evolving field. Its rigorous breakdown of the AI-driven screening process and detailed discussion of various ML/DL architectures in the context of FBP discovery is highly valuable for both beginners and experienced researchers. The emphasis on data-related challenges (scarcity, imbalance, negative samples) is particularly insightful, as data quality often forms the bottleneck in AI applications.

One key inspiration drawn is the potential for AI to accelerate discovery in areas where traditional methods are prohibitively slow or expensive. The idea of food-specific large models and universal deep learning frameworks for multi-scale chemical space features is ambitious but compelling, hinting at a future where AI can not only predict but also design novel FBPs. The call for predicting dynamic peptide-target interactions is also critical, moving beyond static snapshots to capture the complex biological reality.

However, a potential area for further emphasis, even in a review, could be on the ethical implications and regulatory pathways for AI-discovered FBPs. While AI promises greener and safer options, the journey from in silico prediction to market-approved functional food ingredients is complex, involving rigorous safety assessments and regulatory hurdles. How AI can aid or complicate these processes could be a fascinating future research direction.

Additionally, while deep learning is praised for its advantages, the interpretability issue remains a significant challenge, especially in biological and health-related fields where understanding the "why" behind a prediction is crucial for trust and further scientific exploration. The paper mentions interpretability as a future research direction, which is important. Perhaps future reviews could delve deeper into emerging explainable AI (XAI) techniques applied specifically to peptide-protein interactions or FBP activity prediction.

Overall, the paper is a timely and comprehensive guide, effectively balancing a summary of current achievements with a critical look at the road ahead, making it a pivotal contribution to the field of AI in food science.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.