论文状态：已完成

Bitter-RF: A random forest machine model for recognizing bitter peptides

发表：2023/01/26

苦味肽识别 (2)随机森林模型 (1)蛋白质序列特征提取 (1)苦味肽分类 (1)生物信息学蛋白质预测 (1)

原文链接

价格：0.100000

已有 3 人读过

本分析由 AI 生成，可能不完全准确，请以原文为准。

TL;DR 精炼摘要

本文提出Bitter-RF，一种基于随机森林的苦味肽识别模型，整合10种肽序列特征，显著提升分类准确率（独立集AUROC=0.98），首次将RF应用于苦味肽预测，为生物信息学和蛋白质功能研究提供新工具。

摘要

TYPE Original Research PUBLISHED 26 January 2023 DOI 10.3389/fmed.2023.1052923 OPEN ACCESS EDITED BY C. George Priya Doss, VIT University, India REVIEWED BY HaiHui Huang, Shaoguan University, China Dragos Horvath, UMR 7140 Chimie de la Matière Complexe, France Zhibin Lv, Sichuan University, China *CORRESPONDENCE Hui Ding hding@uestc.edu.cn Yang Zhang yangzhang@cdutcm.edu.cn Ke-Jun Deng dengkj@uestc.edu.cn † These authors have contributed equally to this work SPECIALTY SECTION This article was submitted to Precision Medicine, a section of the journal Frontiers in Medicine RECEIVED 24 September 2022 ACCEPTED 05 January 2023 PUBLISHED 26 January 2023 CITATION Zhang Y-F, Wang Y-H, Gu Z-F, Pan X-R, Li J, Ding H, Zhang Y and Deng K-J (2023) Bitter-RF: A random forest machine model for recognizing bitter peptides. Front. Med. 10:1052923. doi: 10.3389/fmed.2023.1052923 COPYRIGHT © 2023 Zhang, Wang, Gu, Pan, Li, Ding, Zhang and Deng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copy

思维导图

论文精读

中文精读约 17 分钟读完 · 11,451 字

1. 论文基本信息

1.1. 标题

Bitter-RF: A random forest machine model for recognizing bitter peptides (Bitter-RF：一种识别苦味肽的随机森林机器学习模型)

1.2. 作者

Yu-Fei Zhang, Yu-Hao Wang, Zhi-Feng Gu, Xian-Run Pan, Jian Li, Hui Ding, Yang Zhang, Ke-Jun Deng。通讯作者包括：

Hui Ding (hding@uestc.edu.cn)
Yang Zhang (yangzhang@cdutcm.edu.cn)
Ke-Jun Deng (dengkj@uestc.edu.cn) 作者主要来自以下机构：

中国成都电子科技大学生命科学学院信息生物学中心 (School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China)
成都中医药大学跨学科创新研究院 (Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China)
成都大学基础医学院 (School of Basic Medical Sciences, Chengdu University, Chengdu, China)

1.3. 发表期刊/会议

Frontiers in Medicine (前沿医学)。该期刊是Frontiers系列期刊之一，涵盖医学领域的广泛研究，在生物医学和转化医学领域具有一定的声誉和影响力。

1.4. 发表年份

2023年

1.5. 摘要

苦味肽是具有潜在医疗应用的短肽。其苦味背后蕴藏的巨大潜力仍待挖掘。为了更好地发掘苦味肽在实践中的价值，我们需要一种更有效的分类方法来识别苦味肽。在本研究中，我们利用苦味肽的序列信息，开发了一种基于随机森林（RF）的模型，命名为 Bitter-RF。Bitter-RF 通过整合从苦味肽中提取的 10种特征 (10 features)，涵盖了更全面和广泛的信息，并在独立验证集上取得了优于最新一代模型的结果。所提出的模型能够提高苦味肽的准确分类（独立测试集上的 $AUROC = 0.98$ ），并丰富了 RF 方法在蛋白质分类任务中的实际应用，此前 RF 方法尚未被用于构建苦味肽预测模型。我们希望 Bitter-RF 能为学者们进行苦味肽研究提供更多便利。

1.6. 原文链接

/files/papers/690dd8947a8fb0eb524e6853/paper.pdf (已正式发表)

2. 整体概括

2.1. 研究背景与动机

核心问题： 准确、高效地识别 苦味肽 (bitter peptides)。
重要性： 苦味肽 (bitter peptides) 不仅是发酵、陈化和变质食品中常见的寡肽，而且在生物医学和临床科学中显示出巨大的 潜在医疗应用 (potential medical applications)。例如，一些 苦味肽 (bitter peptides) 被发现具有 调节血糖 (regulate blood glucose)、胃保护 (gastroprotective) 等作用。正确剂量下的 苦味肽 (bitter peptides) 可能非常有用，因此识别它们至关重要。
现有挑战与空白 (Gap)：
- 实验方法局限性： 传统的 实验方法 (experimental methods)（如凝胶分离、液相色谱、 傅里叶变换红外光谱 (Fourier Transform Infrared Spectroscopy, FTIR)） 操作复杂 (complex operation)、耗时 (time-consuming) 且 不准确 (inaccurate)，并需要专业仪器， 不具普适性 (not universal)。感官评估则易受主观因素影响。
- 现有计算方法不足：
  - 早期的 定量结构-苦味关系 (Quantitative Structure-Bitter Taste Relationship, QSBR) 模型受限于特征描述符。
  - 基于序列的分类模型虽有发展，但 第一代 (first-generation) 模型 特征提取不足 (insufficient feature extraction)； 第二代 (second-generation) 深度学习 (deep learning) 模型可能存在 信息冗余 (information redundancy) 和 过拟合 (overfitting) 问题； 第三代 (third-generation) 模型 特征代表性有待优化 (representativeness needs further optimization)； 第四代 (fourth-generation) 虽然先进，但 计算资源消耗较大 (higher computational resource consumption)。
论文切入点/创新思路： 针对现有方法的局限性，作者旨在开发一种 更有效 (more effective)、准确 (accurate) 且 计算效率高 (computationally efficient) 的 苦味肽 (bitter peptide) 识别方法。该研究通过 特征工程 (feature engineering) 整合了 10种 (10 types) 丰富的 序列特征 (sequence information) 和 理化性质 (physicochemical properties)，并利用 随机森林 (Random Forest, RF) 这一 传统机器学习 (traditional machine learning) 方法构建预测模型，以期在 准确性 (accuracy) 和 实用性 (practical application) 之间取得更好的平衡。

2.2. 核心贡献/主要发现

提出了 Bitter-RF 模型： 首次将 随机森林 (RF) 方法应用于 苦味肽 (bitter peptide) 识别任务，构建了一个高效的预测模型。
综合性特征融合： 集成了 10种 (10 types) 不同的 苦味肽 (bitter peptide) 序列特征，包括 氨基酸组成 (Amino Acid Composition, AAC)、传统伪氨基酸组成 (Traditional Pseudo-Amino Acid Composition, TPAAC)、两亲性伪氨基酸组成 (Amphiphilic Pseudo-Amino Acid Composition, APAAC) 等，形成了更全面和 有代表性 (representative) 的特征集。
卓越的预测性能： 在 独立验证集 (independent validation set) 上，Bitter-RF 模型取得了显著的 分类性能 (classification performance)，AUROC 达到 0.98，准确率 (ACC)、特异性 (Sp) 和 敏感性 (Sn) 均达到 0.94，马修斯相关系数 (MCC) 达到 0.88。
超越现有模型： Bitter-RF 在 独立验证集 (independent validation set) 上的性能优于前三代 苦味肽 (bitter peptide) 预测模型，并与最新一代的 iBitter-DRLF 模型表现相当，但使用了 计算资源 (computing resources) 更少的 传统机器学习 (traditional machine learning) 方法。
开源工具： 提供了一个 Python 包，方便学者使用和进一步研究 苦味肽 (bitter peptides)。

3. 预备知识与相关工作

3.1. 基础概念

苦味肽 (Bitter Peptides): 具有苦味的短链 寡肽 (oligopeptides)，通常由2到20个 氨基酸 (amino acids) 组成。它们广泛存在于发酵、陈化和变质食品中，某些 苦味肽 (bitter peptides) 具有 生理活性 (biological activities)，如 降血糖 (blood glucose lowering)、抗氧化 (antioxidant) 或 抗高血压 (antihypertensive) 作用，使其在生物医学领域具有研究价值。
机器学习 (Machine Learning): 人工智能 (Artificial Intelligence) 的一个子领域，通过让计算机从数据中学习模式和关系，而无需进行明确的编程。在本文中，机器学习 (machine learning) 模型用于从 苦味肽 (bitter peptides) 的 序列特征 (sequence features) 中学习，以区分 苦味肽 (bitter peptides) 和 非苦味肽 (non-bitter peptides)。
随机森林 (Random Forest, RF): 一种 集成学习 (ensemble learning) 方法，特别适用于分类和回归任务。它通过构建多个 决策树 (decision trees)，并综合这些 决策树 (decision trees) 的预测结果来提高整体预测的准确性和 鲁棒性 (robustness)。RF 的关键特点是引入了 随机性 (randomness)，例如在构建每棵 决策树 (decision trees) 时 随机选择 (randomly selecting) 部分训练数据和特征，这有助于减少 过拟合 (overfitting)。
特征工程 (Feature Engineering): 指利用 领域知识 (domain knowledge) 从原始数据中提取有意义的特征，以提高 机器学习模型 (machine learning model) 性能的过程。在本文中，特征工程 (feature engineering) 涉及从 苦味肽 (bitter peptides) 的 氨基酸序列 (amino acid sequences) 中计算各种 编码 (encodings) 和 描述符 (descriptors)。
氨基酸组成 (Amino Acid Composition, AAC): 最简单也是最基本的 序列特征 (sequence feature) 之一，表示 肽链 (peptide sequence) 中每种 标准氨基酸 (standard amino acid)（共20种）出现的频率。
伪氨基酸组成 (Pseudo-Amino Acid Composition, PseAAC): 由 Chou 提出的一种 序列编码 (sequence encoding) 方法，旨在将 蛋白质/肽链 (protein/peptide sequence) 的 序列顺序信息 (sequence order information) 和 理化性质 (physicochemical properties) 纳入考虑，以弥补 AAC 仅考虑 氨基酸频率 (amino acid frequency) 而忽略 序列位置信息 (sequence positional information) 的不足。PseAAC 有多种变体，如 传统伪氨基酸组成 (Traditional Pseudo-Amino Acid Composition, TPAAC) 和 两亲性伪氨基酸组成 (Amphiphilic Pseudo-Amino Acid Composition, APAAC)。
二肽组成 (Dipeptide Composition, DPC): 衡量 肽链 (peptide sequence) 中所有 二肽 (dipeptide)（即相邻的两个 氨基酸 (amino acids) 组合）出现的频率。由于有20种 氨基酸 (amino acids)，因此共有 $20 \times 20 = 400$ 种可能的 二肽 (dipeptide)。DPC 能够捕捉短程 序列顺序信息 (sequence order information)。
ROC 曲线 (Receiver Operating Characteristic curve) 与 AUROC (Area Under the ROC curve): ROC 曲线 (ROC curve) 是一种图形化工具，用于评估 二分类模型 (binary classification model) 在不同 分类阈值 (classification thresholds) 下的性能。它绘制了 真阳性率 (True Positive Rate, TPR)（即 敏感性 (Sensitivity)）与 假阳性率 (False Positive Rate, FPR)（即 1-特异性 (1-Specificity)）之间的关系。AUROC 是 ROC 曲线 (ROC curve) 下方的面积，其值介于0到1之间，越接近1表示模型性能越好。AUROC 可以衡量模型区分正负样本的整体能力，而不受 分类阈值 (classification threshold) 选择的影响。
10折交叉验证 (10-fold cross-validation): 一种常用的 模型评估 (model evaluation) 技术，用于估计 机器学习模型 (machine learning model) 在 未见过数据 (unseen data) 上的性能。它将数据集分成10个大小相等的子集。模型会进行10次训练和验证：每次选择一个子集作为 验证集 (validation set)，其余9个子集作为 训练集 (training set)。最终的性能指标是10次验证结果的平均值。这有助于减少 模型评估 (model evaluation) 的 方差 (variance)，并提供对 模型泛化能力 (model generalization ability) 更可靠的估计。
独立验证集 (Independent Validation Set): 与 训练集 (training set) 和 交叉验证集 (cross-validation set) 完全独立的 数据集 (dataset)，用于最终评估 模型 (model) 的 泛化能力 (generalization ability)。在模型训练和 超参数调优 (hyperparameter tuning) 过程中，独立验证集 (independent validation set) 不会被使用，因此它能更客观地反映模型在 真实世界 (real-world) 数据上的表现。

3.2. 前人工作

文章提到了 苦味肽 (bitter peptide) 预测领域的研究进展，主要分为 QSBR (Quantitative Structure-Bitter Taste Relationship) 模型和 基于序列 (sequence-based) 的 机器学习 (machine learning) 模型。

定量结构-苦味关系 (QSBR) 模型:
- 这类模型利用 苦味肽 (bitter peptide) 的 结构描述符 (structural descriptors) 来预测其苦味。
- 方法： 包括 多重线性回归 (multiple linear regression)、支持向量机 (Support Vector Machine, SVM) 和 人工神经网络 (Artificial Neural Network, ANN)。
- 案例： 提到了一项研究基于229个 实验苦味值 (experimental bitterness values)，使用 Dragon 5.4 软件提取 1292个描述符 (1292 descriptors)，并通过 GAPLS (Genetic Algorithm Partial Least Squares) 方法选择出6个最佳描述符（如 SPAN, Mean square distance (MSD), E3s, G3p, Hats8U, 3D-MoRSE）来构建 QSAR (Quantitative Structure-Activity Relationship) 模型。这些描述符代表了 分子 (molecule) 的 维度 (dimension)、原子数量 (number of atoms)、加权原子电拓扑态 (weighted atomic electrical topological states)、对称方向WHIM指数 (symmetry directional WHIM index)、空间自相关描述符 (spatial autocorrelation-based descriptors) 以及 分子的大小、质量和体积 (size, mass, and volume)。
- 限制： QSBR 模型通常需要 肽 (peptide) 的三维结构信息或复杂的 结构描述符 (structural descriptors)，且可能不易扩展到大规模序列数据。
基于序列的苦味肽分类模型（四代）：
- 第一代模型 (iBitter-SCM)： 使用 二肽倾向得分 (dipeptide propensity scores) 预测 苦味肽 (bitter peptides)，通过提取 苦味肽 (bitter peptides) 的少量特征。
- 第二代模型 (BERT4Bitter)： 利用 深度学习 (deep learning) 方法，例如 BERT (Bidirectional Encoder Representations from Transformers) 架构。虽然 深度学习 (deep learning) 能够捕捉复杂的模式，但作者指出可能存在 信息冗余 (information redundancy) 和 过拟合 (overfitting) 的问题。
- 第三代模型 (iBitter-Fuse)： 整合了 五种肽特征 (five peptide features) 来表征 苦味肽 (bitter peptides)，但作者认为其 特征代表性 (representativeness) 仍有优化空间。
- 第四代模型 (iBitter-DRLF)： 通过 深度学习预训练 (deep learning pre-training) 提取特征，然后基于 轻量级梯度提升机 (Light Gradient Boosting Machine, LightGBM) 构建预测模型。这是一种混合方法，结合了 深度学习 (deep learning) 的特征提取能力和 梯度提升 (gradient boosting) 模型的效率。

3.3. 技术演进

苦味肽 (bitter peptide) 识别技术从最初依赖 QSBR (Quantitative Structure-Bitter Taste Relationship) 方法的 肽 (peptide) 结构描述符 (structural descriptors)，逐步演进到利用 肽 (peptide) 序列信息 (sequence information) 的 机器学习 (machine learning) 和 深度学习 (deep learning) 模型。这种演进反映了对 序列数据 (sequence data) 强大 表征能力 (representation capabilities) 的认识，以及对 计算方法 (computational methods) 效率和准确性不断提升的需求。从简单的 二肽组成 (dipeptide composition) 到复杂的 伪氨基酸组成 (pseudo-amino acid composition)，再到 深度学习 (deep learning) 预训练的特征提取，特征工程变得越来越精细和全面。算法层面也从传统的 回归 (regression) 和 分类器 (classifiers)（如 SVM）发展到 集成学习 (ensemble learning)（如 LightGBM）和 深度学习 (deep learning) 架构。本文的 Bitter-RF 模型正是在这一技术脉络下，通过结合更全面的 特征融合 (feature fusion) 策略和 随机森林 (RF) 这一高效的 传统机器学习 (traditional machine learning) 方法，寻求在性能和计算成本之间取得新的平衡点。

3.4. 差异化分析

Bitter-RF 与现有方法的核心区别 (core differences)和创新点 (innovations)主要体现在以下几个方面：

特征的全面性： 之前的 iBitter-Fuse 整合了5种特征，而 Bitter-RF 进一步扩展到 10种 (10 types) 不同的 序列特征 (sequence features) 和 理化性质 (physicochemical properties) 描述符。这种 更全面的特征融合 (more comprehensive feature fusion) 旨在捕捉 苦味肽 (bitter peptides) 更多维度的信息，从而可能提供更强的 表征能力 (representation capability)。
机器学习算法的选择： 尽管 深度学习 (deep learning) 模型（如 BERT4Bitter 和 iBitter-DRLF）在 苦味肽 (bitter peptide) 预测领域逐渐流行，但 Bitter-RF 选择使用 随机森林 (Random Forest, RF) 这一 传统机器学习 (traditional machine learning) 方法。
- 优点： RF 具有 抗过拟合能力强 (strong resistance to overfitting)、对高维数据适应性好 (good adaptability to high-dimensional data)、计算效率相对高 (relatively high computational efficiency) 且 易于解释 (interpretable) 的特点。这与 第二代 (second-generation) 深度学习 (deep learning) 模型可能面临的 过拟合 (overfitting) 和 信息冗余 (information redundancy) 问题形成对比。
- 创新： 作者指出，随机森林 (RF) 方法此前尚未被专门用于构建 苦味肽 (bitter peptide) 的预测模型。因此，Bitter-RF 填补了这一空白，并证明了 RF 在此任务上的 有效性 (effectiveness)，尤其是在 特征融合 (feature fusion) 的加持下。
性能突破： Bitter-RF 在 独立验证集 (independent validation set) 上的性能超越了 iBitter-SCM、BERT4Bitter 和 iBitter-Fuse，并且与最先进的 iBitter-DRLF 模型（该模型结合了 深度学习预训练 (deep learning pre-training) 和 LGBM）表现相当。这意味着 Bitter-RF 在不依赖复杂 深度学习 (deep learning) 架构和大量 计算资源 (computational resources) 的情况下，达到了顶尖的预测水平。

4. 方法论

本文旨在开发一个基于 随机森林 (RF) 的 机器学习 (machine learning) 模型 Bitter-RF，用于准确识别 苦味肽 (bitter peptides)。其核心思想是通过整合多种 氨基酸序列特征 (amino acid sequence features)，为 RF 模型提供全面而丰富的输入信息，从而实现高精度的分类。

4.1. 方法原理

Bitter-RF 的构建流程遵循典型的 机器学习 (machine learning) 范式：首先，收集并构建高质量的 数据集 (dataset)；其次，对 肽 (peptide) 序列进行 特征提取 (feature extraction)，将原始序列信息转化为 机器学习模型 (machine learning model) 可以处理的数值向量；接着，将提取到的 特征 (features) 进行 融合 (fusion) 和 降维 (reduction) 处理；最后，使用 随机森林 (RF) 算法训练 分类器 (classifier)，并使用多种 评估指标 (evaluation metrics) 进行性能评估。该模型的 示意框架 (schematic framework) 如下文 图1 (FIGURE 1) 所示。

下图（原文 FIGURE 1）展示了 Bitter-RF 模型的 示意框架 (schematic framework)：

该图像是论文中图1，展示了Bitter-RF模型构建流程图，包括数据收集、特征融合、机器学习方法选择及模型评估四个步骤，清晰展示了研究设计思路。

4.2. 核心方法详解

4.2.1. 数据集来源 (Dataset Source)

为了确保 模型 (model) 的 可靠性 (reliability) 和与其他 模型 (model) 的 公平比较 (fair comparison)，本研究使用了与之前四代模型相同的 基准数据集 (benchmark dataset)。

来源： 该数据最初通过人工从各种文献中收集 实验验证 (experimentally validated) 的 苦味肽 (bitter peptides) 获得，可在 http://pmlab.pythonanywhere.com/BERT4Bitter 获取。
规模： 包含640条记录，其中320条为 实验验证 (experimentally validated) 的 苦味肽 (bitter peptides)，320条为从 BIOPEP (a database for bioactive peptides) 随机生成的 非苦味肽 (non-bitter peptides)。
划分： 数据集以 8:2 的比例划分为 训练集 (training set) 和 独立集 (independent set)。
- 训练集 (training set)：包含256个 苦味肽 (bitter peptides) 和256个 非苦味肽 (non-bitter peptides)。
- 独立集 (independent set)：包含64个 苦味肽 (bitter peptides) 和64个 非苦味肽 (non-bitter peptides)。

4.2.2. 特征提取 (Feature Extraction)

在基于 机器学习 (machine learning) 方法处理 生物序列数据 (biological sequence data) 的 计算模型 (computational models) 中，能够尽可能多地揭示 序列信息 (sequence information) 的 编码方法 (coding methods) 是最关键的一步。本研究使用 iLearnPlus (a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization) 平台来提取 苦味肽 (bitter peptides) 的 10种 (10 types) 特征。

4.2.2.1. 氨基酸组成 (Amino acid composition, AAC)

AAC 编码计算 肽链 (peptide sequence) 中 20种 (20 natural amino acids) 自然 氨基酸 (amino acids) 的频率。其计算公式如下： $f \left( t \right) = \frac { N \left( t \right) } { N } , t \in \left\{ A , C , . . . , Y \right\}$

f(t)：氨基酸 (amino acid) 类型 $t$ 的频率。
N(t)：肽链 (peptide sequence) 中 氨基酸 (amino acid) 类型 $t$ 的数量。
$N$ ：肽链 (peptide sequence) 的长度。

4.2.2.2. 传统伪氨基酸组成 (Traditional pseudo-amino acid composition, TPAAC)

TPAAC 描述符由 Chou 提出，也被称为 type1 伪氨基酸组成 (type1 pseudo-amino acid composition)。它旨在将 蛋白质/肽链 (protein/peptide sequence) 的 序列顺序信息 (sequence order information) 和 理化性质 (physicochemical properties) 纳入考虑。首先，使用 $H _ { 1 } ^ { 0 } ( i )$ 、 $H _ { 2 } ^ { 0 } ( i )$ 和 $M ^ { 0 } ( i )$ ( $i = 1, 2, ..., 20$ ) 分别表示 20种 (20 natural amino acids) 自然 氨基酸 (amino acids) 的原始 疏水性值 (hydrophobicity values)、原始 亲水性值 (hydrophilicity values) 和原始 侧链质量 (side chain masses)。这些值根据 标准正态分布 (standard normal distribution) 进行归一化，如下所示： $H _ { 1 } \left( i \right) \ = \frac { H _ { 1 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 1 } ^ { o } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ H _ { 1 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 1 } ^ { o } \left( i \right) \right] ^ { 2 } } { 2 0 } } }$ $H _ { 2 } \left( i \right) \ = \frac { H _ { 2 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 2 } ^ { o } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ H _ { 2 } ^ { o } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } H _ { 2 } ^ { o } \left( i \right) \right] ^ { 2 } } { 2 0 } } }$ $M \left( i \right) \ = \frac { M ^ { 0 } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } M ^ { 0 } \left( i \right) } { \sqrt { \frac { \sum _ { i = 1 } ^ { 2 0 } \left[ M ^ { 0 } \left( i \right) - \frac { 1 } { 2 0 } \sum _ { i = 1 } ^ { 2 0 } M ^ { 0 } \left( i \right) \right] ^ { 2 } } { 2 0 } } }$

$H_1(i)$ , $H_2(i)$ , M(i)：经过归一化处理后，氨基酸 (amino acid) $i$ 的 疏水性 (hydrophobicity)、亲水性 (hydrophilicity) 和 侧链质量 (side chain mass) 值。
$H_1^o(i)$ , $H_2^o(i)$ , $M^o(i)$ ：氨基酸 (amino acid) $i$ 的原始 疏水性 (hydrophobicity)、亲水性 (hydrophilicity) 和 侧链质量 (side chain mass) 值。

然后，残基 (residues) $R_i$ 和 $R_j$ 之间的 相关函数 (correlation function) 定义为： $\begin{array} { l } { { \Theta \left( R _ { i } , R _ { j } \right) \ = \ \displaystyle \frac { 1 } { 3 } \big \{ \big [ H _ { 1 } \left( R _ { i } \right) - H _ { 1 } \left( R _ { j } \right) \big ] ^ { 2 } + \big [ H _ { 2 } \left( R _ { i } \right) - H _ { 2 } \left( R _ { j } \right) \big ] ^ { 2 } } } \\ { { \ } } \\ { { \qquad + \big [ M \left( R _ { i } \right) - M \left( R _ { j } \right) \big ] ^ { 2 } \big \} } } \end{array}$
$\Theta(R_i, R_j)$ ：氨基酸 (amino acid) $R_i$ 和 $R_j$ 之间的 相关函数 (correlation function)，综合考虑了它们的 疏水性 (hydrophobicity)、亲水性 (hydrophilicity) 和 侧链质量 (side chain mass) 差异。

此 相关函数 (correlation function) 包含上述三种 氨基酸 (amino acid) 属性 (properties)。通过推广此 函数定义 (function definition)，定义了 氨基酸 (amino acid) 属性 (property)（公式6）和一组 氨基酸属性 (amino acid properties)（公式7）： $\Theta \left( R _ { i } , R _ { j } \right) \ = \left[ H _ { 1 } \left( R _ { i } \right) - H _ { 1 } \left( R _ { j } \right) \right] ^ { 2 }$ $\Theta \left( R _ { i } , R _ { j } \right) ~ = \frac { 1 } { n } \sum _ { n = 1 } ^ { n } \left[ H _ { k } \left( R _ { i } \right) - H _ { k } \left( R _ { j } \right) \right] ^ { 2 }$
$H(R_i)$ ：氨基酸 (amino acid) $R_i$ 经标准化后的 氨基酸属性 (amino acid property)。
$H_k(R_i)$ ：氨基酸属性集 (amino acid attribute set) 中 氨基酸 (amino acid) $R_i$ 的第 $k$ 个属性。

接着，序列顺序相关因子 (sequence order-correlated factors) 定义为： $\Theta _ { 1 } \ = \frac { 1 } { N - 1 } \sum _ { i \ = 1 } ^ { N - 1 } \Theta \left( R _ { i } , R _ { i + 1 } \right)$ $\begin{array} { l } { { \Theta _ { 2 } = \displaystyle \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } \Theta \left( R _ { i } , R _ { i + 2 } \right) } } \\ { { . . . } } \\ { { \Theta _ { \mathsf { \tiny { A } } } = \displaystyle \frac { 1 } { N - \mathsf { \tiny { A } } } \sum _ { i = 1 } ^ { N - \mathsf { \tiny { A } } } \Theta \left( R _ { i } , R _ { i + \mathsf { \tiny { A } } } \right) } } \end{array}$
$\Theta_j$ ：第 $j$ 个 序列顺序相关因子 (sequence order-correlated factor)。
$\lambda$ ：可调整的 相关参数 (correlation parameter)， $\lambda < N$ 。本研究中 $\lambda$ 设置为1。

最后，蛋白质序列 (protein sequence) 的 传统伪氨基酸组成 (traditional pseudo-amino acid composition) 定义为： $X _ { c } = \frac { f _ { c } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + \omega \sum _ { j = 1 } ^ { \lambda } \theta _ { j } } , ( 1 < c < 2 0 )$ $X _ { c } = \frac { \omega \theta _ { c - 2 0 } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + \omega \sum _ { j = 1 } ^ { \lambda } \theta _ { j } } , ( 2 1 < c < 2 0 + \lambda )$
$X_c$ ：TPAAC (Traditional pseudo-amino acid composition) 的第 $c$ 个分量。
$f_c$ ：氨基酸 (amino acid) $c$ 的频率。
$\omega$ ：加权因子 (weighting factor)，本研究中设置为0.05。

4.2.2.3. 两亲性伪氨基酸组成 (Amphiphilic pseudo-amino acid composition, APAAC)

APAAC 是一种 PseAAC (Pseudo-Amino Acid Composition) 变体。它包含 $20 + 2\lambda$ 个离散的数值：前20个数值由常规 氨基酸 (amino acids) 组成；接下来的 $2\lambda$ 个数值是一组 相关因子 (correlation factors)，反映了 肽链 (peptide chain) 上 疏水性 (hydrophobicity) 和 亲水性 (hydrophilicity) 的不同分布模式。此特征描述如下：

首先，使用 TPAAC (Traditional pseudo-amino acid composition) 中定义的 $H_1(i)$ （公式2）和 $H_2(i)$ （公式3）来定义 疏水性 (hydrophobicity) 和 亲水性 (hydrophilicity) 相关函数 (correlation functions)： $H _ { i , j } ^ { 1 } \mathrm { ~ = ~ } H _ { 1 } \left( i \right) H _ { 1 } \left( j \right)$ $H _ { i , j } ^ { 2 } \mathrm { ~ = ~ } H _ { 2 } \left( i \right) H _ { 2 } \left( j \right)$

$H_{i,j}^1$ ：氨基酸 (amino acid) $i$ 和 $j$ 之间 疏水性 (hydrophobicity) 的乘积。
$H_{i,j}^2$ ：氨基酸 (amino acid) $i$ 和 $j$ 之间 亲水性 (hydrophilicity) 的乘积。

其次，序列顺序因子 (sequence order factors) 可以表述为： $\tau _ { 1 } = \frac { 1 } { N - 1 } \sum _ { i = 1 } ^ { N - 1 } H _ { i , i + 1 } ^ { 1 }$ $\tau _ { 2 } = \frac { 1 } { N - 1 } \sum _ { i = 1 } ^ { N - 1 } H _ { i , i + 1 } ^ { 2 }$ $\tau _ { 3 } = \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } H _ { i , i + 2 } ^ { 1 }$ $\tau _ { 4 } = \frac { 1 } { N - 2 } \sum _ { i = 1 } ^ { N - 2 } H _ { i , i + 2 } ^ { 2 }$ $\tau _ { 2 \alpha - 1 } = \frac { 1 } { N - \alpha } \sum _ { i = 1 } ^ { N - \alpha } H _ { i , i + \alpha } ^ { 1 }$ $\tau _ { 2 \alpha } = \frac { 1 } { N - \alpha } \sum _ { i = 1 } ^ { N - \alpha } H _ { i , i + \alpha } ^ { 2 }$
$\tau_j$ ：第 $j$ 个 序列顺序因子 (sequence order factor)，反映了不同 滞后距离 (lag distances) 下 氨基酸 (amino acids) 间的 疏水性 (hydrophobicity) 和 亲水性 (hydrophilicity) 相关性。
$N$ ：肽链 (peptide chain) 的长度。
$\alpha$ ：滞后参数 (lag parameter)，表示 氨基酸 (amino acids) 之间的距离。

最后，APAAC (Amphiphilic pseudo-amino acid composition) 定义为： $P _ { C } = \frac { f _ { c } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { j = 1 } ^ { 2 \lambda } \tau _ { j } } , ( 1 ~ < ~ c ~ < ~ 2 0 )$ $P _ { C } = \frac { \omega \tau _ { u } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { j = 1 } ^ { 2 \lambda } \tau _ { j } } , \ ( 2 1 \ < \ u \ < \ 2 0 + 2 \lambda )$
$P_C$ ：APAAC (Amphiphilic pseudo-amino acid composition) 的第 $C$ 个分量。
$f_c$ ：氨基酸 (amino acid) $c$ 的频率。
$w$ ：加权因子 (weighting factor)，本研究中设置为0.5。
$\lambda$ ：滞后参数 (lag parameter)，本研究中设置为1。

4.2.2.4. 自适应跳跃双核苷酸组成 (Adaptive skip dinucleotide composition, ASDC)

ASDC 是 二肽组成 (dipeptide composition) 的一种修正形式，它充分考虑了 相邻残基 (adjacent residues) 之间以及 间隔残基 (intervening residues) 之间存在的 相关信息 (relevant information)。 ASDC (Adaptive skip dinucleotide composition) 的 特征向量 (feature vector) 定义为： $\mathrm { A S D C } = ( f _ { \nu 1 } , f _ { \nu 2 } , . . . , f _ { \nu 4 0 0 } ) ,$ $f _ { \nu i } = \frac { \sum _ { g = 1 } ^ { L - 1 } O _ { i } ^ { g } } { \sum _ { i = 1 } ^ { 4 0 0 } \sum _ { g = 1 } ^ { L - 1 } O _ { i } ^ { g } }$

ASDC：自适应跳跃双核苷酸组成 (Adaptive skip dinucleotide composition) 的 特征向量 (feature vector)。
$f_{\nu i}$ ：所有可能的 二肽 (dipeptide) 组合（中间间隔 $\leq L-1$ 个 肽 (peptides)）中，第 $i$ 个 二肽 (dipeptide) 的 出现频率 (occurrence frequency)。
$O_i^g$ ：第 $i$ 个 二肽 (dipeptide) 出现时，中间间隔 $g$ 个 肽 (peptides) 的次数。
$L$ ：肽链 (peptide chain) 的长度。

4.2.2.5. 二肽组成 (Di-peptide composition, DPC)

DPC 编码描述了 肽链 (peptide sequence) 中 400种 (400) 二肽 (dipeptide) 组合的频率。计算方法如下： $D \left( r , s \right) \ = \frac { N _ { r s } } { N - 1 } , r , s \in \{ A , C , D , . . . , Y \}$

D(r,s)：由 氨基酸 (amino acid) 类型 $r$ 和 氨基酸 (amino acid) 类型 $s$ 组成的 二肽 (dipeptide) 的频率。
$N_{rs}$ ：由 氨基酸 (amino acid) 类型 $r$ 和 $s$ 组成的 二肽 (dipeptide) 的数量。
$N$ ：肽链 (peptide sequence) 的长度。

4.2.2.6. 二肽与期望均值偏差 (Dipeptide deviation from expected mean, DDE)

DDE 包括三个参数：二肽组成 (dipeptides composition, $D_c$ )、理论均值 (theoretical mean, $T_m$ ) 和 $\text{理论方差} (theoretical variance,$ T_{\nu})。 $D_c$ 的计算方法与 DPC (Di-peptide composition) 相同。 $T_m$ 和 $T_{\nu}$ 计算如下： $T _ { m } \left( r , s \right) = \frac { C _ { r } } { C _ { N } } \times \frac { C _ { s } } { C _ { N } }$

$T_m(r,s)$ ：二肽 (dipeptide) rs 的 理论均值 (theoretical mean)。
$C_r$ ：氨基酸 (amino acid) 类型 $r$ 的 密码子 (codons) 数量。
$C_s$ ：氨基酸 (amino acid) 类型 $s$ 的 密码子 (codons) 数量。
$C_N$ ：包括所有可能的 密码子 (codons) 总数，不包括三个 终止密码子 (stop codons)。

$T _ { \nu } \left( r , s \right) = \frac { T _ { m } \left( r , s \right) \left( 1 - T _ { m } \left( r , s \right) \right) } { N - 1 }$
$T_{\nu}(r,s)$ ：二肽 (dipeptide) rs 的 理论方差 (theoretical variance)。

使用这三个参数，DDE (Dipeptide deviation from expected mean) 计算如下： $D D E \left( r , s \right) = \frac { D _ { c } \left( r , s \right) - T _ { m } \left( r , s \right) } { T _ { \nu } \left( r , s \right) }$
DDE(r,s)：二肽 (dipeptide) rs 的 二肽与期望均值偏差 (Dipeptide deviation from expected mean)。
$D_c(r,s)$ ：二肽 (dipeptide) rs 的 组成频率 (composition frequency) (与 DPC 相同)。

4.2.2.7. 分组氨基酸组成 (Grouped amino acid composition, GAAC)

GAAC 根据 20种 (20) 氨基酸 (amino acids) 的 理化性质 (physicochemical properties) 将它们分为五组：

脂肪族 (aliphatic group) ( $g1$ ：GAVLMI)
芳香族 (aromatic group) ( $g2$ ：FYW)
正电荷组 (positive charge group) ( $g3$ ：KRH)
负电荷组 (negative charged group) ( $g4$ ：DE)
不带电荷组 (uncharged group) ( $g5$ ：STCPNQ) 此特征描述了这五组 氨基酸 (amino acids) 的频率，计算公式如下： $f \left( g \right) \ = \frac { N \left( g \right) } { N } , G \in \left\{ g 1 , g 2 , g 3 , g 4 , g 5 \right\}$
f(g)：氨基酸组 (amino acid group) $g$ 的频率。
N(g)：属于 氨基酸组 (amino acid group) $g$ 的 氨基酸 (amino acids) 数量之和。
$N$ ：肽链 (peptide sequence) 的长度。

4.2.2.8. 分组二肽组成 (Grouped dipeptide composition, GDPC)

GDPC 是 DPC (Di-peptide composition) 的一个变体，基于 GAAC (Grouped amino acid composition) 中提到的 氨基酸分类 (amino acid classification)。此特征由25个 描述符 (descriptors) 组成，计算公式如下： $f \left( r , s \right) \ = \frac { N _ { r s } } { N - 1 } , r , s \in \left\{ g 1 , g 2 , g 3 , g 4 , g 5 \right\}$

f(r,s)：由 氨基酸类型组 (amino acid type groups) $r$ 和 $s$ 代表的 二肽 (dipeptide) 的频率。
$N_{rs}$ ：由 氨基酸类型组 (amino acid type groups) $r$ 和 $s$ 代表的 二肽 (dipeptide) 的数量。
$N$ ：肽链 (peptide sequence) 的长度。

4.2.2.9. 序列-顺序-耦合数 (Sequence-order-coupling number, SOCNumber)

第 $d$ 级 序列-顺序-耦合数 (sequence-order-coupling number) 计算如下： $\tau _ { d } = \sum _ { i = 1 } ^ { N - d } \left( d _ { i , i + d } \right) ^ { 2 } , \ d = 1 , 2 , . . . , n l a g$

$\tau_d$ ：第 $d$ 级 序列-顺序-耦合数 (sequence-order-coupling number)。
$d_{i, i+d}$ ：在给定 距离矩阵 (distance matrix) 中，位置 $i$ 和 $i+d$ 处的两个 氨基酸 (amino acids) 之间的距离。本研究中使用了 Schneider-Wrede (48) 和 Grantham (49) 理化距离矩阵 (physicochemical distance matrix)。
nlag：滞后 (lag) 的最大值（默认值：30）。
$N$ ：肽链 (peptide sequence) 的长度。

4.2.2.10. 准序列顺序 (Quasi-sequence-order, QsOrder)

对于每个 氨基酸 (amino acid)，QsOrder 定义如下： $X _ { r } = \frac { f _ { r } } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { d = 1 } ^ { n l a g } \tau _ { d } } , r = 1 , 2 , 3 , . . . , 2 0$

$X_r$ ：氨基酸 (amino acid) $r$ 的 准序列顺序 (Quasi-sequence-order) 值。
$f_r$ ：类型为 $r$ 的 氨基酸 (amino acid) 的 归一化出现频率 (normalized occurrence)。
$w$ ：加权因子 (weighting factor)，定义为0.1。
nlag：滞后 (lag) 的最大值（默认值：30）。
$\tau_d$ ：与 SOCNumber (Sequence-order-coupling number) 中的定义相同。

对于其他30个 准序列顺序 (quasi-sequence-order) 描述符 (descriptors)，QsOrder 定义如下： $X _ { d } = \frac { w \tau _ { d } - 2 0 } { \sum _ { r = 1 } ^ { 2 0 } f _ { r } + w \sum _ { d = 1 } ^ { n l a g } \tau _ { d } } , d = 2 1 , 2 2 , . . . , 2 0 + n l a g$
$X_d$ ：第 $d$ 个 准序列顺序 (quasi-sequence-order) 描述符 (descriptor)。

4.2.3. 随机森林 (Random Forest)

随机森林 (RF) 算法是 决策树 (decision trees) 的 集成 (ensemble)，已广泛应用于 分类 (classification)。

工作原理： RF 通过构建多个 决策树 (decision trees) 来工作。每棵 树 (tree) 的构建都依赖于独立同分布采样的 随机向量 (random vector)。
优点： 引入 随机性 (randomness) 可以减少 过拟合 (overfitting) 的可能性，提高 抗噪声能力 (ability to resist noise)，并且对 高维数据 (high-dimensional data) 具有很强的 适应性 (adaptability)。
应用： RF 算法已被应用于各种 蛋白质分类 (protein classification) 问题。

4.2.4. 特征融合处理 (Feature Fusion Processing)

本研究将上述 10种 (10 types) 特征进行 融合 (fusion)，最初得到一个 1,337维度 (1,337-dimensional) 的 融合特征 (fusion feature) 向量。为了优化模型并避免冗余，作者进行了 零值项删除 (de-zero) 操作。即，当某个特征列的所有值均为零时，该特征对 分类 (discrimination) 没有实际作用，因此被移除。经过此操作后，剩余 1206个特征 (1206 features) 用于 模型学习 (model learning)。

5. 实验设置

5.1. 数据集

来源与组成： 使用与之前四代模型相同的 基准数据集 (benchmark dataset)，可从 http://pmlab.pythonanywhere.com/BERT4Bitter 获取。该数据集包含640条记录：320条为 实验验证 (experimentally validated) 的 苦味肽 (bitter peptides) (被定义为 正样本 (positive samples))，320条为从 BIOPEP 数据库随机生成的 非苦味肽 (non-bitter peptides) (被定义为 负样本 (negative samples))。
划分： 数据集以 8:2 的比例划分为 训练集 (training set) 和 独立集 (independent set)。
- 训练集 (Training Set)：256个 苦味肽 (bitter peptides) 和256个 非苦味肽 (non-bitter peptides)。
- 独立集 (Independent Set)：64个 苦味肽 (bitter peptides) 和64个 非苦味肽 (non-bitter peptides)。
特点： 数据集规模适中，且正负样本平衡，有助于训练和评估 分类模型 (classification models)。

5.2. 评估指标

为了评估 模型 (model) 的 训练效果 (training effect) 和 预测能力 (prediction ability)，本研究主要使用了 AUROC (Area Under the Receiver Operating Characteristic curve) 值，并辅以 敏感性 (Sensitivity, Sn)、特异性 (Specificity, Sp)、马修斯相关系数 (Matthew's Correlation Coefficient, MCC) 和 准确率 (Accuracy, ACC)。

敏感性 (Sensitivity, Sn)
- 概念定义： 敏感性 (Sensitivity)，又称 真阳性率 (True Positive Rate) 或 召回率 (Recall)，衡量 模型 (model) 正确识别 正样本 (positive samples) 的能力。它表示在所有实际为 正样本 (positive samples) 的数据中，有多少比例被 模型 (model) 正确地预测为 正样本 (positive samples)。
- 数学公式： $Sn = \frac { T P } { ( T P + F N ) }$
- 符号解释：
  - TP (True Positives)：实际为 苦味肽 (bitter peptides) 且被 模型 (model) 正确预测为 苦味肽 (bitter peptides) 的数量。
  - FN (False Negatives)：实际为 苦味肽 (bitter peptides) 但被 模型 (model) 错误预测为 非苦味肽 (non-bitter peptides) 的数量。
特异性 (Specificity, Sp)
- 概念定义： 特异性 (Specificity)，又称 真阴性率 (True Negative Rate)，衡量 模型 (model) 正确识别 负样本 (negative samples) 的能力。它表示在所有实际为 负样本 (negative samples) 的数据中，有多少比例被 模型 (model) 正确地预测为 负样本 (negative samples)。
- 数学公式： $Sp = { \frac { T N } { ( T N + F P ) } }$
- 符号解释：
  - TN (True Negatives)：实际为 非苦味肽 (non-bitter peptides) 且被 模型 (model) 正确预测为 非苦味肽 (non-bitter peptides) 的数量。
  - FP (False Positives)：实际为 非苦味肽 (non-bitter peptides) 但被 模型 (model) 错误预测为 苦味肽 (bitter peptides) 的数量。
马修斯相关系数 (Matthew's Correlation Coefficient, MCC)
- 概念定义： MCC 是一种衡量 二分类 (binary classification) 模型 (model) 性能的 综合指标 (comprehensive indicator)，被认为在 不平衡数据集 (imbalanced datasets) 上比 准确率 (Accuracy) 更可靠。它的值介于 -1 到 1 之间：1 表示完美预测，-1 表示完全相反的预测，0 表示随机预测。它考虑了 真阳性 (true positives)、真阴性 (true negatives)、假阳性 (false positives) 和 假阴性 (false negatives) 四种情况。
- 数学公式： $M C C = { \frac { \left( T N \times \ T P - F N \times \ F P \right) } { \sqrt { \left( T P + F P \right) \left( T P + F N \right) \left( T N + F P \right) \left( T N + F N \right) } } }$
- 符号解释：
  - TP (True Positives)：同 Sn 中的定义。
  - TN (True Negatives)：同 Sp 中的定义。
  - FP (False Positives)：同 Sp 中的定义。
  - FN (False Negatives)：同 Sn 中的定义。
准确率 (Accuracy, ACC)
- 概念定义： 准确率 (Accuracy) 衡量 模型 (model) 正确预测的样本比例。它表示所有预测中，正确预测（包括 真阳性 (true positives) 和 真阴性 (true negatives)）的比例。
- 数学公式： $A C C = { \frac { ( T P + T N ) } { ( T P + T N + F P + F N ) } }$
- 符号解释：
  - TP (True Positives)：同 Sn 中的定义。
  - TN (True Negatives)：同 Sp 中的定义。
  - FP (False Positives)：同 Sp 中的定义。
  - FN (False Negatives)：同 Sn 中的定义。
AUROC (Area Under the Receiver Operating Characteristic curve)
- 概念定义： AUROC 是 ROC 曲线 (Receiver Operating Characteristic curve) 下方的面积。ROC 曲线 (ROC curve) 绘制了 真阳性率 (True Positive Rate, TPR)（敏感性 (Sensitivity)）与 假阳性率 (False Positive Rate, FPR)（1 - 特异性 (1 - Specificity)）在所有可能 分类阈值 (classification thresholds) 下的关系。AUROC 衡量 分类器 (classifier) 在所有可能 阈值 (thresholds) 下区分 正负样本 (positive and negative classes) 的整体能力。AUROC 值越接近1，表示 模型 (model) 的 分类性能 (classification performance) 越好。
- 数学公式： 原文未提供 AUROC 的计算公式，AUROC 通常通过 ROC 曲线 (ROC curve) 下方面积的积分来定义，或者等效地表示为 随机选择 (randomly chosen) 的 正样本 (positive sample) 的 预测分数 (prediction score) 高于 随机选择 (randomly chosen) 的 负样本 (negative sample) 的 预测分数 (prediction score) 的概率。
- 符号解释： AUROC 本身是一个 综合指标 (comprehensive indicator)，其值直接反映 模型 (model) 性能，没有内部变量需要单独解释。

5.3. 对比基线

本研究将 Bitter-RF 模型与以下 基线 (baselines) 进行了比较：

其他机器学习方法 (Other Machine Learning Methods) (在融合特征上):
- 支持向量机 (Support Vector Machines, SVM)
- 轻量级梯度提升机 (LightGBM)
- 决策树 (Decision Trees, DT)
- 逻辑回归 (Logistic Regression, LR) 这些 传统机器学习算法 (traditional machine learning algorithms) 是 分类任务 (classification tasks) 中常用的、具有代表性的 模型 (models)，用于验证 随机森林 (RF) 在 融合特征 (fusion features) 上的 适用性 (suitability) 和 优越性 (superiority)。
现有苦味肽预测模型 (Existing Bitter Peptide Prediction Models):
- iBitter-SCM (第一代模型，基于 二肽倾向得分 (dipeptide propensity scores))
- BERT4Bitter (第二代模型，使用 深度学习 (deep learning) 方法)
- iBitter-Fuse (第三代模型，通过 SVM 结合 融合特征 (fused features))
- iBitter-DRLF (第四代模型，通过 深度学习 (deep learning) 选择特征并使用 LightGBM) 这些 模型 (models) 代表了 苦味肽 (bitter peptide) 预测领域的最新进展和不同技术路线，用于证明 Bitter-RF 在 性能 (performance) 上的 竞争力 (competitiveness) 和 优越性 (superiority)。

6. 实验结果与分析

6.1. 核心结果分析

6.1.1. 单特征模型结果 (Single-feature-based results)

作者首先使用 10种 (10 types) 单独的特征（AAC、TPAAC、APAAC、ASDC、DPC、DDE、GAAC、GDPC、SOCNumber、QSOrder）分别训练 随机森林 (RF) 模型。

性能差异： AAC (Amino Acid Composition) 在 10折交叉验证 (10-fold cross-validation) 和 独立验证集 (independent validation set) 上均表现最佳（AUROC 分别为 0.91 和 0.96）。APAAC (Amphiphilic pseudo-amino acid composition) 和 QSOrder (Quasi-sequence-order) 在 独立验证集 (independent validation set) 上也表现良好（AUROC 分别为 0.97 和 0.95）。而 SOCNumber (Sequence-order-coupling number) 表现最差（AUROC 分别为 0.70 和 0.73），这可能是因为其维度仅为2，信息量不足。
启示： 尽管 AAC 作为基本特征表现良好，但 理化性质 (physicochemical properties) 和 序列顺序信息 (sequence order information)（如 APAAC 和 QSOrder 捕获的）对于 苦味肽 (bitter peptide) 识别至关重要，这些信息可以补充 AAC 的不足，从而有优化 预测性能 (prediction performance) 的潜力。

6.1.2. 融合特征处理 (Fusion feature processing)

作者将 10种 (10 types) 特征进行 融合 (fusion)，最初得到一个 1,337维度 (1,337-dimensional) 的特征集。为了去除冗余信息并优化特征空间，对 融合特征 (fusion feature) 进行了 零值项删除 (de-zero) 操作（即移除所有元素都为零的特征列）。经过此处理，特征维度从 1,337 减少到 1,206。

6.1.3. 融合特征模型结果 (Fusion-feature-based results)

将 处理后 (processed) 的 融合特征 (fusion features) 输入 随机森林 (RF) 模型进行训练，并与 表现最佳 (best-performing) 的 三个单特征 (three single features) (AAC、APAAC、QSOrder) 的 RF 模型进行比较。

显著提升： 融合特征 (fusion features) 模型在 10折交叉验证 (10-fold cross-validation) 和 独立验证集 (independent validation set) 上的 预测性能 (prediction performance) 均有所提升或保持不变。
- 在 独立验证集 (independent validation set) 上，融合特征 (fusion features) 模型的 AUROC 达到 0.98，Sn、Sp 和 Acc 均达到 0.94，MCC 达到 0.88。这显著优于 AAC ( $AUROC=0.96$ )、APAAC ( $AUROC=0.97$ ) 和 QSOrder ( $AUROC=0.95$ ) 等 单特征模型 (single-feature models)。
结论： 特征融合 (feature fusion) 策略有效地整合了不同特征的优势，显著提高了 苦味肽 (bitter peptide) 预测的 准确性 (accuracy)。

下图（原文 FIGURE 2）展示了不同特征 (AAC、APAAC、QSOrder、Fusion) 在 10折交叉验证 (10-fold cross-validation) 和 独立验证集 (independent validation set) 上的 ROC 曲线 (ROC curves)：

该图像是包含4个子图的图表，展示了Bitter-RF模型在苦味肽分类中的性能。A图为10折交叉验证和独立验证集的ROC曲线，B图比较了不同特征的ROC曲线，C、D图分别为不同特征组合在平均性能和具体指标（AUC、Sn、Sp、Acc、Mcc）上的表现对比。

6.1.4. 与其他机器学习方法的比较 (Comparison with other machine learning methods on fusion features)

为了进一步验证 随机森林 (RF) 方法对 苦味肽 (bitter peptides) 预测的 优越性 (superiority)，作者将 RF 模型与 SVM (Support Vector Machines)、LightGBM (Light Gradient Boosting Machine)、DT (Decision Trees) 和 LR (Logistic Regression) 等 传统机器学习方法 (traditional machine learning methods) 在相同的 融合特征 (fusion features) 上进行了比较。

RF的优势： 随机森林 (RF) 模型在各项 评估指标 (evaluation metrics) 上 表现最佳 (best performance) 或与 LightGBM 相当，尤其是在 独立验证集 (independent validation set) 上。
- 在 独立验证集 (independent validation set) 上，RF 的 AUROC 为 0.98，优于 LightGBM 的 0.97、DT 的 0.94、LR 的 0.89 和 SVM 的 0.74。
- RF 的 Sn、Sp、Acc 均为 0.94，MCC 为 0.88，全面优于其他 机器学习 (machine learning) 方法。
结论： 随机森林 (RF) 方法根据本文提供的数据特性，展现出最佳的 预测能力 (predictive ability)。

下图（原文 FIGURE 3）展示了不同 机器学习 (machine learning) 方法在 融合特征 (fusion features) 上的性能比较：

该图像是论文中的图表，展示了不同机器学习模型在多个性能指标（AUC、Sn、Sp、Acc、Mcc）上的表现对比，分为A和B两部分。图中随机森林（RF）模型在大多数指标上表现突出，优于其他模型。

6.1.5. 与现有模型的比较 (Comparison with existed models)

为了全面评估 Bitter-RF 的 预测能力 (predictive ability)，作者将其与 四种 (four) 现有 基于序列 (sequence-based) 的 苦味肽预测模型 (bitter peptide prediction models) 进行了比较：iBitter-SCM、BERT4Bitter、iBitter-Fuse 和 iBitter-DRLF。

总体表现： Bitter-RF 在 10折交叉验证 (10-fold cross-validation) 中的结果与 BERT4Bitter 相似，略低于 iBitter-Fuse 和 iBitter-DRLF。然而，在更具挑战性的 独立验证集 (independent validation set) 上，Bitter-RF 的表现普遍优于前三代模型，并与最新一代的 iBitter-DRLF 模型表现相当。
独立集优势： Bitter-RF 在 独立验证集 (independent validation set) 上取得了 0.98 的 AUROC，与 iBitter-DRLF 持平，并且 Sn、Sp、Acc、MCC 等指标均优于前三代模型。特别是，Bitter-RF 的 AUROC 比 iBitter-Fuse 高出 5%。
资源效率： 尽管 Bitter-RF 的 预测性能 (prediction performance) 与 iBitter-DRLF 接近，但 Bitter-RF 采用的是 传统机器学习方法 (traditional machine learning method) (随机森林 (RF))，这意味着其 计算资源消耗 (computational resource consumption) 相对较低。
结论： Bitter-RF 模型展现出更强的 预测性能 (prediction performance) 和更好的 实际应用能力 (practical application ability)。这进一步证实了 丰富理化性质 (enrich physical and chemical properties) 和 位置信息 (location information) 等特征在 苦味肽 (bitter peptide) 识别中的重要作用。

下图（原文 FIGURE 4）展示了 Bitter-RF 与现有 模型 (models) 的性能比较：

该图像是论文中图4，展示了Bitter-RF及其他模型在不同指标（AUC、Sn、Sp、Acc、Mcc）上的综合性能和具体表现的五边形雷达图，清晰对比了各模型的性能优劣。

6.2. 数据呈现 (表格)

以下是原文 TABLE 1 的结果：

Cross-validation	Feature	Dimension	AUROC	Sn	Sp	Acc	Mcc
Cross-validation	Feature	Dimension	AUROC	Sn	Sp	Acc	Mcc	10-fold cross-validation	AAC	20	0.91	0.85	0.84	0.85	0.69
TPAAC	21	0.90	0.83	0.78	0.80	0.61
APAAC	22	0.89	0.83	0.81	0.82	0.64
ASDC	400	0.88	0.89	0.68	0.79	0.59
DPC	400	0.86	0.87	0.64	0.76	0.53
DDE	400	0.83	0.84	0.73	0.78	0.57
GAAC	5	0.75	0.72	0.66	0.69	0.39
GDPC	25	0.78	0.75	0.71	0.73	0.46
SOCNumber	2	0.70	0.66	0.62	0.64	0.28
QSOrder	42	0.89	0.82	0.82	0.82	0.64
Independent set validation	AAC	20	0.96	0.91	0.89	0.90	0.80
	TPAAC	21	0.94	0.83	0.86	0.84	0.69
	APAAC	22	0.97	0.89	0.91	0.90	0.80
	ASDC	400	0.92	0.89	0.75	0.82	0.65
	CKSAAGP	100	0.87	0.77	0.81	0.79	0.58
	DPC	400	0.89	0.88	0.70	0.79	0.59
	DDE	400	0.90	0.89	0.84	0.87	0.74
	GAAC	5	0.76	0.83	0.64	0.73	0.48
	GDPC	25	0.80	0.73	0.72	0.73	0.45
	SOCNumber	2	0.73	0.59	0.69	0.64	0.28
QSOrder	42	0.95	0.92	0.84	0.88	0.77

以下是原文 TABLE 2 的结果：

Feature	Dimension	Dimension after operation
AAC	20	20
TPAAC	21	21
APAAC	22	22
ASDC	400	366
DPC	400	303
DDE	400	400
GAAC	5	5
GDPC	25	25
SOCNumber	2	2
QSOrder	42	42
Total of features	1,337	1,206

以下是原文 TABLE 3 的结果：

ML method	Cross-validation	Feature	Dimension	AUROC	Sn	Sp	Acc	Mcc
ML method	Cross-validation	Feature	Dimension	AUROC	Sn	Sp	Acc	Mcc	Random Forest	10-fold cross-validation	AAC	20	0.91	0.85	0.84	0.85	0.69
APAAC	22	0.89	0.83	0.81	0.82	0.64
QSOrder	42	0.89	0.82	0.82	0.82	0.64
Fusion	1206	0.93	0.86	0.84	0.85	0.70
	Independent set validation	AAC	20	0.96	0.91	0.89	0.90	0.80
		APAAC	22	0.97	0.89	0.91	0.90	0.80
		QSOrder	42	0.95	0.92	0.84	0.88	0.77
		Fusion	1206	0.98	0.94	0.94	0.94	0.88

以下是原文 TABLE 4 的结果：

Cross-validation	Feature	ML method	AUROC	Sn	Sp	Acc	Mcc
Cross-validation	Feature	ML method	AUROC	Sn	Sp	Acc	Mcc	10-fold cross-validation	Fusion	SVM	0.67	0.51	0.80	0.66	0.34
Fusion	LightGBM	0.92	0.85	0.85	0.85	0.70
Fusion	DT	0.80	0.83	0.77	0.80	0.60
Fusion	LR	0.82	0.74	0.77	0.76	0.52
Fusion	RF	0.93	0.86	0.84	0.85	0.70
Independent set validation	Fusion	SVM	0.74	0.61	0.78	0.70	0.40
	Fusion	LightGBM	0.97	0.92	0.91	0.91	0.83
	Fusion	DT	0.94	0.94	0.84	0.89	0.78
	Fusion	LR	0.89	0.80	0.84	0.82	0.64
	Fusion	RF	0.98	0.94	0.94	0.94	0.88

以下是原文 TABLE 5 的结果：

Cross-validation	Classifier	AUROC	Sn	Sp	Acc	Mcc
Cross-validation	Classifier	AUROC	Sn	Sp	Acc	Mcc	10-fold cross-validation	iBitter-SCM	0.90	0.91	0.83	0.87	0.75
BERT4Bitter	0.92	0.87	0.85	0.86	0.73
iBitter-Fuse	0.94	0.92	0.92	0.92	0.84
iBitter-DRLF	0.95	0.89	0.89	0.89	0.78
Bitter-RF	0.93	0.86	0.84	0.85	0.70
Independent set validation	iBitter-SCM	0.90	0.84	0.84	0.84	0.69
	BERT4Bitter	0.96	0.94	0.91	0.92	0.84
	iBitter-Fuse	0.93	0.94	0.92	0.93	0.86
	iBitter-DRLF	0.98	0.92	0.98	0.94	0.89
	Bitter-RF	0.98	0.94	0.94	0.94	0.88

6.3. 消融实验/参数分析

本文没有明确进行 消融实验 (ablation studies) 来评估每个 融合特征 (fusion feature) 对最终 模型 (model) 性能的单独贡献。相反，它侧重于比较 单特征 (single features) 与 融合特征 (fusion features) 整体的性能提升。也没有详细说明 随机森林 (RF) 超参数 (hyper-parameters) 的 调优过程 (tuning process) 或 敏感性分析 (sensitivity analysis)。

然而，通过比较 单特征 (single features) 的结果，可以间接看出不同特征的 判别能力 (discriminative power)：例如，AAC、APAAC 和 QSOrder 在 独立验证集 (independent validation set) 上表现较好，而 SOCNumber 较差，这表明某些特征比其他特征更能有效编码 苦味肽 (bitter peptide) 的 关键信息 (key information)。特征融合 (feature fusion) 的成功也暗示了这些特征之间存在互补性。

7. 总结与思考

7.1. 结论总结

本研究成功开发了一个名为 Bitter-RF 的 随机森林 (Random Forest, RF) 机器学习模型 (machine learning model)，用于识别 苦味肽 (bitter peptides)。该模型通过从 苦味肽 (bitter peptides) 序列中 提取 (extracting) 并 融合 (fusing) 10种 (10 types) 不同的 特征 (features)，涵盖了 氨基酸组成 (amino acid composition)、伪氨基酸组成 (pseudo-amino acid composition) 以及 序列顺序信息 (sequence order information) 等多个维度，从而获取了更全面和广泛的 信息 (information)。Bitter-RF 在 独立验证集 (independent validation set) 上取得了 0.98 的 AUROC (Area Under the Receiver Operating Characteristic curve)，准确率 (ACC)、敏感性 (Sn) 和 特异性 (Sp) 均达到 0.94，显著优于现有的 苦味肽预测模型 (bitter peptide prediction models)，并与最先进的 深度学习 (deep learning) 模型 iBitter-DRLF 表现相当，但 计算资源消耗 (computational resource consumption) 更低。这一工作不仅提高了 苦味肽 (bitter peptide) 分类 (classification) 的 准确性 (accuracy)，也丰富了 随机森林 (RF) 方法在 蛋白质分类任务 (protein classification tasks) 中的应用，并为 苦味肽 (bitter peptide) 的 生物学实验验证 (biological experiments) 提供了新的研究方向。

7.2. 局限性与未来工作

作者指出了 Bitter-RF 的一个主要 局限性 (limitation)：当前使用的 特征 (features) 尚未经过 优化 (optimized)。基于此，他们提出了未来的研究方向：

特征选择 (Feature Selection)： 在未来，将利用各种 特征选择技术 (feature selection techniques) 来挑选出最佳 特征子集 (feature subset)，以进一步提升 模型 (model) 的 性能 (performance)。这不仅有助于提高 预测精度 (prediction accuracy)，还有望降低 模型 (model) 复杂度并提高 可解释性 (interpretability)。此外，作者也提到，目前无法排除 模型 (model) 可能受到 训练/测试集数据 (training/test set data) 固有偏差 (inherent bias) 的影响，这可能会影响 苦味/非苦味分类 (bitter/non-bitter classification) 的 内在鲁棒性 (intrinsic robustness)。

7.3. 个人启发与批判

个人启发：
- 特征工程的重要性： 即使是 传统机器学习 (traditional machine learning) 方法如 随机森林 (RF)，如果结合 全面 (comprehensive) 且 精心设计 (well-designed) 的 特征工程 (feature engineering)，也能在复杂 生物序列分类任务 (biological sequence classification tasks) 中达到甚至超越 深度学习 (deep learning) 模型的性能。这强调了对 领域知识 (domain knowledge) 的深入理解和有效利用是构建高性能 模型 (model) 的关键。
- 平衡性能与资源： Bitter-RF 在与 深度学习 (deep learning) 模型性能相当的情况下，使用了 计算资源 (computational resources) 相对较少的 RF，这对于资源有限的实验室或实际应用场景具有重要意义，提供了一种 高效实用 (efficient and practical) 的解决方案。
- 苦味肽的医疗潜力： 苦味肽 (bitter peptides) 长期以来被视为负面属性，但这项研究再次提醒了其在 医疗 (medical) 领域的 未开发潜力 (untapped potential)，这可能会激励更多研究者关注并深入探索这些 肽 (peptides) 的 生物活性 (bioactivity)。
批判与可改进之处：
- 缺乏特征重要性分析： 虽然 随机森林 (RF) 模型本身可以提供 特征重要性 (feature importance) 的衡量，但论文并未深入分析哪些 特征 (features) 对 苦味肽 (bitter peptide) 识别的贡献最大。进行此类分析可以为 生物学家 (biologists) 提供关于 苦味肽 (bitter peptide) 关键 理化性质 (physicochemical properties) 的更深层 见解 (insights)。
- 更细致的特征优化： 论文明确指出 特征 (features) 未经 优化 (optimized)。虽然 零值项删除 (de-zero) 是初步处理，但更先进的 特征选择算法 (feature selection algorithms)（如 RFE (Recursive Feature Elimination)、LASSO (Least Absolute Shrinkage and Selection Operator)、遗传算法 (Genetic Algorithm)）或 特征降维技术 (feature dimensionality reduction techniques)（如 PCA (Principal Component Analysis)）可能会进一步提升模型性能并减少 过拟合 (overfitting) 风险，尤其是在 1206维 (1206-dimensional) 的 特征空间 (feature space) 下。
- 数据集规模和多样性： 尽管使用了 基准数据集 (benchmark dataset)，但总样本量（640条）相对较小。 机器学习 (machine learning) 模型，尤其是 深度学习 (deep learning) 模型，通常从更大的 数据集 (datasets) 中受益。作者也提到了 固有偏差 (inherent bias)，未来需要构建更具 多样性 (diverse) 和 规模 (scale) 的 数据集 (datasets) 来验证 模型 (model) 的 鲁棒性 (robustness)。
- 超参数调优细节： 论文中缺少 随机森林 (RF) 超参数调优 (hyperparameter tuning) 的具体细节，例如 树的数量 (number of trees)、每个节点分割的特征数 (number of features to consider for splitting at each node) 等，这对于 模型 (model) 的 可复现性 (reproducibility) 和 性能 (performance) 优化至关重要。
- 定量预测而非仅仅分类： 本文关注的是 苦味肽 (bitter peptide) 的 二分类 (binary classification)（是/否苦味）。未来研究可以探索 定量预测 (quantitative prediction) 苦味强度 (bitterness intensity) 的 回归模型 (regression models)，这将为 苦味肽 (bitter peptide) 的 实际应用 (practical applications) 提供更细粒度的信息。

相似论文推荐

基于向量语义检索推荐的相关论文。

暂时没有找到相似论文。