Paper status: completed

Robust deep learning–based protein sequence design using ProteinMPNN

Published:09/15/2022

Deep Learning-Based Protein Sequence Design (1)ProteinMPNN (1)Protein Structure Prediction (1)Experimental Protein Design Methods (1)Multi-Chain Amino Acid Coupling (1)

Original Link

Price: 0.100000

14 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces ProteinMPNN, a deep learning-based protein sequence design method, achieving 52.4% sequence recovery, outperforming Rosetta's 32.9%. It can handle amino acid coupling in single and multi-chains, successfully rescuing previously failed protein designs, showcas

Abstract

Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using physically based approaches such as Rosetta. Here, we describe a deep learning – based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests. On native protein backbones, ProteinMPNN has a sequence recovery of 52.4% compared with 32.9% for Rosetta. The amino acid sequence at different positions can be coupled between single or multiple chains, enabling application to a wide range of current protein design challenges. We demonstrate the broad utility and high accuracy of ProteinMPNN using x-ray crystallography, cryo – electron microscopy, and functional studies by rescuing previously failed designs, which were made using Rosetta or AlphaFold, of protein monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.

Mind Map

In-depth Reading

English Analysis~31 min read · 43,773 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is the development and validation of a robust deep learning-based method for protein sequence design, named ProteinMPNN, which utilizes a message-passing neural network architecture.

1.2. Authors

The authors are: Daup I. Ako, Beett H. B. J. Rott, F. Mll B. I. W. A. Curt R. J. e Haas N. Bete J. Y. Leun, T. F. Hu S. Pellock, Ti F. Chan, B. Koepnick, H. Nguye, A. Kang, B. Sankaran, A. Bera N. P. King, D. Baker. Their affiliations are diverse, primarily from academic institutions, with prominent researchers from the University of Washington, which is a leading institution in protein design research. David Baker is a highly influential figure in the field.

1.3. Journal/Conference

The paper was published in Science, a highly prestigious and influential peer-reviewed academic journal covering a wide range of scientific disciplines. Publishing in Science signifies significant impact and rigor within the scientific community.

1.4. Publication Year

2022

1.5. Abstract

Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using physically based approaches such as Rosetta. Here, we describe a deep learning – based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests. On native protein backbones, ProteinMPNN has a sequence recovery of 52.4% compared with 32.9% for Rosetta. The amino acid sequence at different positions can be coupled between single or multiple chains, enabling application to a wide range of current protein design challenges. We demonstrate the broad utility and high accuracy of ProteinMPNN using x-ray crystallography, cryo – electron microscopy, and functional studies by rescuing previously failed designs, which were made using Rosetta or AlphaFold, of protein monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.

1.6. Original Source Link

The paper is officially published at /files/papers/69144b9031c24d12df06ec94/paper.pdf.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is protein sequence design, which involves finding an amino acid sequence that will fold into a given desired protein backbone structure. This problem is crucial for creating novel proteins with specific functions, from therapeutics to biosensors and materials.

Despite the revolutionary advancements in protein structure prediction (determining the 3D structure from a sequence) using deep learning (e.g., AlphaFold), the field of de novo protein design (generating a sequence for a desired structure) has largely relied on traditional, physically based approaches like Rosetta. These methods frame design as an energy optimization problem, searching for the lowest-energy combination of amino acid identities and conformations. However, these methods are computationally intensive, often require expert customization, and have lower success rates in experimental validation.

The specific challenges and gaps in prior research include:

Computational Cost: Physically based methods like Rosetta are slow due to explicit consideration of side chain rotameric states.
Limited Scope: Existing deep learning sequence design methods often focused only on monomeric proteins and did not apply to the full range of complex design challenges (e.g., oligomers, interfaces, symmetric designs).
Lack of Experimental Validation: Many deep learning design methods lacked extensive experimental characterization (e.g., crystallography, cryo-EM), which is the ultimate test of a design's success.
Lower Sequence Recovery: Prior deep learning methods generally showed lower native sequence recovery compared to physically based methods.

The paper's entry point or innovative idea is to leverage the power of deep learning, specifically message-passing neural networks (MPNNs), to directly learn the mapping from protein backbone geometry to amino acid sequence. By training on a vast dataset of known protein structures, the model implicitly learns the physical and chemical principles governing protein stability and folding, bypassing the need for explicit energy calculations. This deep learning approach promises to be faster, more robust, and more broadly applicable than existing methods.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Introduction of ProteinMPNN: A novel deep learning-based protein sequence design method built upon a message-passing neural network architecture.
Superior In Silico Performance: ProteinMPNN achieves a significantly higher native sequence recovery (52.4%) on protein backbones compared to Rosetta (32.9%), demonstrating its ability to accurately predict natural sequences given their structures.
Broad Applicability: The method is designed to handle a wide range of protein design challenges, including monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins, as well as complex scenarios like internal repeats and coupled positions for symmetric designs.
Robustness through Noise Training: Training ProteinMPNN with Gaussian noise added to backbone coordinates improves its performance on less-than-perfect backbones (like those from AlphaFold models) and increases the likelihood of designed sequences folding to the target structure.
Enhanced Structure-to-Sequence Mapping: ProteinMPNN designs result in sequences that AlphaFold predicts to fold to the target structures more confidently and accurately than original native sequences or Rosetta-designed sequences.
Extensive Experimental Validation: The paper provides comprehensive experimental validation using x-ray crystallography, cryo-electron microscopy, and functional studies. This is a critical strength, as it demonstrates the practical utility and high accuracy of ProteinMPNN in real-world settings.
Rescue of Previously Failed Designs: ProteinMPNN successfully rescues numerous de novo designs (monomers, oligomers, nanoparticles, functional binders) that had previously failed when designed using Rosetta or AlphaFold, showcasing its superior performance and robustness.
Computational Efficiency: ProteinMPNN generates sequences in a fraction of the time required by physically based methods (e.g., 1.2 seconds vs. 258.8 seconds for 100 residues).

These findings solve the problems of computational expense, limited scope, and low experimental success rates associated with previous protein design methods, making protein design more accessible and efficient for a wider range of applications.

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following foundational concepts:

Proteins: Large, complex molecules made up of smaller units called amino acids, linked together in long chains. The sequence of amino acids (primary structure) determines how the chain folds into a specific three-dimensional (3D) shape (secondary, tertiary, and quaternary structures), which in turn determines its function.
Protein Structure Prediction vs. Protein Sequence Design:
- Protein Structure Prediction: Given an amino acid sequence, predict its 3D folded structure. This field has seen revolutionary advancements with deep learning (e.g., AlphaFold).
- Protein Sequence Design: Given a desired 3D backbone structure, predict an amino acid sequence that will fold into that structure. This is the inverse problem to structure prediction and the focus of this paper.
Protein Backbone: The repeating part of a polypeptide chain, consisting of the nitrogen (N), alpha-carbon (Cα), and carbonyl carbon (C) atoms of each amino acid, linked by peptide bonds. The side chains, which are unique to each amino acid, extend from the Cα atoms. The backbone defines the overall shape and scaffold of the protein.
Amino Acids: The 20 common building blocks of proteins. Each has a central carbon atom (Cα), an amino group ( $-\mathrm{NH}_2$ ), a carboxyl group ( $-\mathrm{COOH}$ ), a hydrogen atom, and a unique side chain (R-group). The diversity of side chains (e.g., polar, nonpolar, charged) drives protein folding and function.
Rosetta: A widely used software suite for computational protein design and analysis. It's a "physically based" approach, meaning it uses simplified physical models (force fields) to estimate the energy of protein conformations and sequences. It designs sequences by searching for the lowest-energy configuration of amino acids on a given backbone. This involves modeling side chain rotameric states.
- Rotameric States: The common, low-energy conformations that amino acid side chains adopt. Rosetta considers these discrete states to reduce the search space.
Deep Learning: A subfield of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. It has excelled in tasks like image recognition, natural language processing, and, more recently, biology.
Neural Networks: Computational models inspired by the brain. They consist of interconnected nodes (neurons) organized into layers. Input data passes through these layers, undergoing transformations (e.g., matrix multiplications, activation functions), to produce an output.
- Encoder-Decoder Architecture: A common neural network design where an encoder processes input data into a rich internal representation (a "latent code" or "features"), and a decoder then uses this representation to generate the desired output.
- Hidden Dimensions/Layers: Hidden layers are the layers between the input and output layers. Hidden dimensions refer to the number of nodes in these hidden layers, impacting the model's capacity to learn complex features.
Message-Passing Neural Networks (MPNNs): A type of graph neural network (GNN) particularly suited for data represented as graphs (nodes and edges), such as protein structures where amino acids are nodes and their interactions are edges.
- Nodes: Represent entities (e.g., amino acids in a protein).
- Edges: Represent relationships or interactions between nodes (e.g., spatial proximity between amino acids).
- Message Passing: In an MPNN, information ("messages") is iteratively exchanged between connected nodes along the edges of the graph. Each node updates its own "state" or "feature vector" by aggregating messages from its neighbors. This allows the network to learn context-dependent features for each node.
  - Node Updates: A function that updates a node's feature vector based on aggregated messages.
  - Edge Updates: A function that updates an edge's feature vector based on the features of the nodes it connects.
Autoregressive Model: A model that predicts a sequence of outputs one element at a time, where each prediction is conditioned on the previously predicted elements. For protein sequence design, this means predicting one amino acid at a time, using the context of already designed amino acids.
- N-to-C Terminus Decoding: A standard autoregressive order in proteins, predicting amino acids sequentially from the N-terminus (beginning) to the C-terminus (end).
- Order-Agnostic Decoding: A more flexible approach where the order of predicting elements is not fixed but can be random or chosen strategically during inference, allowing for greater contextual awareness.
Protein Data Bank (PDB): A public database that archives experimentally determined 3D structures of proteins, nucleic acids, and complex assemblies. It serves as a crucial training dataset for computational biology methods.
CATH Database: A hierarchical classification system for protein domain structures, organizing proteins based on Class, Architecture, Topology, and Homologous superfamily. Used for splitting datasets to avoid data leakage during training/testing.
AlphaFold (and RoseTTAFold): Deep learning models that have achieved unprecedented accuracy in protein structure prediction. They predict 3D structures given an amino acid sequence, often leveraging multiple sequence alignments (MSAs) for co-evolutionary information. The paper uses AlphaFold to validate the quality of designed sequences by checking if they fold to the target structure.
Experimental Structure Determination:
- X-ray Crystallography: A technique that uses the diffraction pattern of X-rays passing through a protein crystal to determine its atomic-level 3D structure.
- Cryo-electron Microscopy (cryo-EM): A technique that images biological molecules (often large complexes) flash-frozen at very low temperatures, providing 3D structural information at near-atomic resolution.
Experimental Characterization:
- Soluble Yield (in E. coli): A measure of how much functional, non-aggregated protein can be produced per unit volume of bacterial culture (e.g., milligrams per liter). Higher yield indicates better expression and solubility.
- Size Exclusion Chromatography (SEC): A chromatographic method that separates proteins based on their size (hydrodynamic volume). Used to determine if a protein is monomeric, oligomeric, or aggregated.
- SEC-MALS (SEC-Multi-Angle Light Scattering): Combines SEC with light scattering detection to accurately determine the absolute molecular weight of proteins in solution, confirming their oligomeric state.
- Circular Dichroism (CD): A spectroscopic technique used to determine the secondary structure (e.g., alpha-helices, beta-sheets) and stability of proteins. Changes in CD spectra with temperature can indicate thermostability.
- Biolayer Interferometry (BLI): A label-free technique used to measure biomolecular interactions (e.g., protein-protein binding affinity) by detecting changes in optical interference patterns.

3.2. Previous Works

The paper builds upon and distinguishes itself from several categories of previous work:

Physically Based Protein Design (e.g., Rosetta):
- Principle: These methods treat sequence design as an energy optimization problem. Given a backbone, they search for the amino acid sequence and corresponding side chain conformations that yield the lowest calculated free energy.
- Mechanism: They typically use computationally intensive sampling algorithms (e.g., Monte Carlo simulations with rotamer libraries) and force fields to evaluate the energy of different sequences.
- Limitations: High computational cost, often require expert parameter tuning (e.g., hydrophobic packing on the surface), and while successful, can have lower experimental success rates compared to ProteinMPNN in complex scenarios. The paper explicitly states Rosetta's sequence recovery at 32.9%.
Early Deep Learning Approaches for Protein Sequence Design:
- Ingraham et al., 2019 (Ref. 1): This is the foundational work on which ProteinMPNN is based. It introduced a message-passing neural network (MPNN) to predict protein sequences from backbone features in an autoregressive manner.
  - Core Idea: An MPNN encoder processes backbone features (Cα-Cα distances, dihedral angles, frame orientations) to generate node and edge features. A decoder then predicts amino acids sequentially.
  - Limitations (as highlighted by the current paper): Primarily focused on monomer design, achieved lower native sequence recoveries (baseline model in this paper starts at 41.2%), and lacked extensive experimental validation across diverse design challenges.
- Other contemporary deep learning methods (e.g., Zhang et al., 2020; Jing et al., 2021; Stroch et al., 2020): These also explored various neural network architectures for sequence design, but often faced similar limitations in terms of scope, recovery, and experimental validation. The paper notes that with the exception of one study on TIM barrel design (Anand et al., 2022), these methods hadn't been extensively validated with crystallography and cryo-EM.
Protein Structure Prediction Models (AlphaFold, RoseTTAFold):
- Principle: These models predict 3D structure from sequence, a complementary but distinct problem from sequence design.
- Relevance to this paper: While not design methods themselves, AlphaFold and RoseTTAFold are crucial for validating designed sequences. A good designed sequence should be predicted by AlphaFold to fold accurately to the target backbone, even without multiple sequence alignment information.

3.3. Technological Evolution

The field of protein engineering has evolved from:

Rational Design (early days): Manual, intuition-driven design based on known protein principles, often limited to small modifications.
Combinatorial/Library-based Design: Generating large libraries of protein variants and screening them experimentally, which is high-throughput but not always targeted.
Computational Design (Rosetta era): Using physics-based force fields and sampling algorithms to computationally search for optimal sequences. This significantly expanded design capabilities but remained computationally intensive and sometimes required extensive manual intervention.
Deep Learning for Structure Prediction (AlphaFold era): The advent of AlphaFold in 2020-2021 marked a paradigm shift in structure prediction, demonstrating the power of deep learning to capture complex protein folding rules.
Deep Learning for Sequence Design (Current era, ProteinMPNN): This paper represents a critical step in extending the deep learning revolution to the inverse problem of sequence design. ProteinMPNN leverages the success of deep learning in structure prediction and graph-based representations to develop a highly efficient and accurate sequence design method that directly learns from vast protein structural data. It aims to make de novo protein design more automated, robust, and accessible.

3.4. Differentiation Analysis

Compared to previous works, especially Rosetta and earlier deep learning design methods, ProteinMPNN offers several core differences and innovations:

Learning-based vs. Physics-based: ProteinMPNN is purely data-driven, learning directly from the PDB, whereas Rosetta relies on explicit energy functions and rotamer libraries. This allows ProteinMPNN to implicitly capture complex interactions without explicit physical modeling.
Comprehensive Backbone Features: ProteinMPNN uses a richer set of input features, including interatomic distances between N, Cα, C, O, and a virtual Cβ, which are shown to provide a better inductive bias for capturing residue interactions than just Cα-Cα distances or dihedral angles used in earlier MPNNs.
Enhanced Graph Processing: The inclusion of edge updates in the encoder neural network, in addition to node updates, allows for a more sophisticated propagation and integration of information across the protein graph.
Order-Agnostic Decoding: Unlike fixed N-to-C terminal decoding, ProteinMPNN employs an order-agnostic autoregressive model. This allows for flexible design, such as fixing certain regions (e.g., a binding motif) while designing others, and provides better contextual awareness during sequence generation.
Multichain and Symmetry-Aware Design: ProteinMPNN is explicitly designed for multichain protein assemblies and can enforce symmetry by coupling amino acid identities at equivalent positions. This is a significant advancement over earlier deep learning methods that primarily focused on monomers.
Robustness to Backbone Perturbations: Training ProteinMPNN with Gaussian noise on backbone coordinates makes the generated sequences more robust to slight structural inaccuracies, which is crucial when designing for computationally predicted or experimentally imperfect backbones. This is a critical practical innovation.
Superior Experimental Validation and Rescue Capability: The paper's most compelling differentiation is the extensive experimental validation and its ability to "rescue" numerous failed designs from Rosetta or AlphaFold. This demonstrates its real-world utility and higher success rate in generating soluble, stable, and functional proteins.
Computational Efficiency: ProteinMPNN is orders of magnitude faster than Rosetta, making high-throughput design feasible.

In essence, ProteinMPNN represents a significant leap forward by combining advanced deep learning architectures with practical considerations for robust, versatile, and experimentally validated protein sequence design.

4. Methodology

4.1. Principles

The core principle of ProteinMPNN is to leverage a deep learning model, specifically a Message-Passing Neural Network (MPNN), to directly learn the probabilistic mapping from a protein backbone structure to its corresponding amino acid sequence. Unlike physically based methods (e.g., Rosetta) that attempt to optimize an explicit energy function, ProteinMPNN learns to predict the most probable amino acid at each position given the local and global structural context, based on patterns observed in thousands of known protein structures from the Protein Data Bank (PDB). This allows it to bypass the computationally intensive side chain packing and energy minimization steps, leading to faster and more robust designs. The intuition is that if the model can accurately recover native sequences, it has learned the underlying structural determinants of amino acid preferences, which can then be applied to design novel sequences for desired backbones.

4.2. Core Methodology In-depth (Layer by Layer)

ProteinMPNN extends a previously described message-passing neural network, incorporating several key improvements to enhance performance, flexibility, and applicability to a broad range of protein design challenges. The overall architecture is schematically outlined in Fig. 1A.

4.2.1. Backbone Encoder

The Encoder component of ProteinMPNN is responsible for processing the input protein backbone geometry into a set of informative graph node and edge features.

Input Features:
- Initial Baseline: The starting point was an MPNN that used $distances between Cα-Cα atoms$ , $relative Cα-Cα-Cα frame orientations and rotations$ , and backbone dihedral angles as input features. These features describe the geometric arrangement of the protein backbone.
- Improvement (Experiment 1): A significant improvement was achieved by including a richer set of interatomic distances. This expanded input includes distances between:
  - $N$ (Nitrogen atom of the backbone amide group)
  - $Cα$ (Alpha-carbon atom, central to each amino acid)
  - $C$ (Carbonyl carbon atom of the backbone)
  - $O$ (Oxygen atom of the backbone carbonyl group)
  - $virtual Cβ$ : A $virtual Cβ$ atom is a conceptual point placed based on the other backbone atoms. For all amino acids except Glycine, the Cβ is the first carbon atom in the side chain. For Glycine, there is no Cβ, so a virtual one is used to maintain consistency in structural representation.
- Rationale: The authors found that these additional interatomic distances provide a better "inductive bias" to capture subtle interactions between residues compared to just dihedral angles or N-Cα-C frame orientations. Inductive bias refers to the assumptions a learning algorithm makes to generalize from training data to new, unseen examples. In this context, it means these distance features help the model better infer the rules of amino acid interactions.
Message-Passing Mechanism:
- The encoder uses a message-passing neural network to process these input features. In a graph representation of a protein, each amino acid is a node, and relationships (like spatial proximity) between amino acids are edges.
- Node and Edge Updates (Experiment 2): The original MPNN used node updates where each amino acid's feature vector (node feature) was updated by aggregating information ("messages") from its neighbors. ProteinMPNN further improved this by adding edge updates. This means that in addition to updating the features of each amino acid, the features describing the interactions between amino acids (edge features) are also dynamically updated during the message-passing process. This allows for a more nuanced representation of pairwise interactions.
- Graph Connectivity: The network considers a local neighborhood around each Cα atom. They tested $16, 24, 32, 48, and 64 nearest Cα neighbor$ networks and found that performance saturated at 32 to 48 neighbors. This indicates that the optimality of an amino acid at a particular position is largely determined by its immediate protein environment.
- Layer Structure: The model consists of three encoder layers and three decoder layers, with 128 hidden dimensions. These layers process the input features through successive transformations.
  
  The output of the Encoder is a refined set of graph node and edge features that encapsulate the structural context of each amino acid position.

4.2.2. Sequence Decoder

The Decoder component takes the encoded structural features from the encoder, along with a partial sequence, and iteratively generates amino acids for the remaining positions.

Autoregressive Decoding (Experiment 4):
- Fixed vs. Random Decoding: Earlier deep learning methods often used a fixed N to C terminal decoding order, meaning amino acids were predicted strictly from the beginning to the end of the protein chain. ProteinMPNN replaces this with an order-agnostic autoregressive model. In this approach, the decoding order (the order in which amino acids are predicted) is randomly sampled from the set of all possible permutations during training.
- Benefits of Order-Agnostic Decoding (Fig. 1B):
  - Flexible Inference: During inference (when designing a new protein), this allows for an arbitrary decoding order. For example, if a specific region of the protein sequence is known or fixed (e.g., a target-binding motif in protein binder design), that region can be "decoded" first (treated as context), and the model can then design the surrounding regions using this fixed context. This is crucial for practical design scenarios where parts of the protein might be pre-determined.
  - Improved Context: Random decoding trains the model to utilize context from any already-decoded residues, regardless of their position relative to the currently predicted residue. A fixed left-to-right decoding cannot use sequence context from positions that follow the current one.
Multichain and Symmetry-Aware Design:
- Equivariance to Chain Order: For multichain proteins (assemblies of multiple polypeptide chains), the model needs to be equivariant to the order of the protein chains. This means that if the order of chains is shuffled in the input, the output (the designed sequence for each chain) should remain consistent. To achieve this, ProteinMPNN uses:
  - Per chain relative positional encoding capped at ±32 residues: This encoding provides information about the relative position of residues within a single chain, up to a certain distance.
  - Binary feature that indicates whether the interacting pair of residues are from the same or different chains: This feature helps the model distinguish between intra-chain and inter-chain interactions, which are critical for designing protein assemblies.
- Coupling Positions for Symmetry (Fig. 1C):
  - For designing symmetric proteins (e.g., homodimers, homotrimers, repeat proteins), certain positions across different chains or within repeating units are functionally equivalent and should have the same amino acid identity. ProteinMPNN implements this by "tying" these positions together.
  - Mechanism: For tied positions (e.g., $A_1$ in chain A and $B_1$ in chain B of a homodimer), the model first predicts unnormalized probabilities for each individual position ( $P(A_1)$ and $P(B_1)$ ). These unnormalized probabilities are then averaged to get a single combined probability distribution. An amino acid is then sampled from this combined distribution, ensuring that the same amino acid is chosen for all tied positions.
  - Unnormalized Probabilities: These are raw outputs from the neural network (often called logits) before they are converted into actual probabilities (e.g., using a softmax function). Averaging these raw scores allows for a more direct combination of preferences.
  - Normalized Probability Distribution: After averaging, a softmax function is typically applied to convert the averaged raw scores into a probability distribution where all values are between 0 and 1 and sum to 1.
  - Multistate Design: This concept can be extended for multistate design (designing a single sequence to fold into two or more desired states) by taking a linear combination of predicted unnormalized probabilities for each state, potentially with positive and negative coefficients to upweight or downweight specific backbone states.
Training Data for Multichain Model: The final ProteinMPNN model was trained on a large dataset of protein assemblies from the PDB (as of 2 August 2021), specifically those determined by x-ray crystallography or cryo-electron microscopy to better than 3.5-Å resolution and with fewer than 10,000 residues. This comprehensive dataset ensures the model learns principles relevant to complex protein architectures.

4.2.3. Training with Backbone Noise

A critical innovation for improving the real-world applicability of ProteinMPNN is training with Gaussian noise added to the backbone coordinates.

Mechanism: During training, small amounts of $Gaussian noise (SD = 0.02 Å)$ are added to the atomic coordinates of the input protein backbones.
Rationale:
- Robustness: Real-world protein backbones (e.g., computationally predicted structures, experimentally determined structures with slight inaccuracies) are rarely perfectly ideal. Training with noise makes the model less sensitive to minor perturbations in the input geometry, making it more robust.
- Generalization: While noise reduces sequence recovery on perfectly rigid PDB structures (as it blurs "perfect" details), it significantly improves performance when designing for AlphaFold models or other computationally generated backbones. This is because it encourages the model to focus on overall topological features and general patterns (e.g., polar-nonpolar sequence patterns) rather than overfitting to hyper-specific local structural details.
- Improved AlphaFold Predictability: Sequences designed by models trained with noise are found to be more robustly predicted by AlphaFold to fold to the target structure, increasing the success rate of designs passing prediction-based filters.

4.2.4. Inference and Sequence Sampling

During inference (when generating sequences for a new target backbone), ProteinMPNN offers flexibility:

Temperature Parameter: The diversity of generated sequences can be increased by performing inference at higher temperatures. In probabilistic models, temperature controls the "peakiness" of the probability distribution; higher temperatures lead to flatter distributions and thus more diverse (less "certain") samples.
Sequence Quality Metric: The averaged log probability of the sequence given the structure (a measure derived from ProteinMPNN) correlated strongly with native sequence recovery over a range of temperatures. This metric can be used to rank and filter generated sequences for experimental characterization, selecting those the model considers most probable or "high quality."

The following figure illustrates the overall architecture of ProteinMPNN:

$Fig. 1. ProteinMPNN architecture. (A) Distances between N, Cα, C, O, and virtual ${ \\mathsf { C } } { \\mathsf { \\beta } }$ are encoded and processed using a message-passing neural network (Encoder) t…$ 该图像是ProteinMPNN的架构示意图。图中展示了用于蛋白质序列设计的背骨编码器和序列解码器。本模型通过消息传递神经网络处理氨基酸间的距离信息，并实现随机解码以生成蛋白质序列，支持多链与对称设计。

Fig. 1. ProteinMPNN architecture. (A) Distances between N, Cα, C, O, and virtual ${ \mathsf { C } } { \mathsf { \beta } }$ are encoded and processed using a message-passing neural network (Encoder) to obtain graph node and edge features. The encoded features, together with a partial sequence, are used to generate amino acids iteratively in a random decoding order. (B) A fixed left-to-right decoding cannot use sequence context (green) for preceding positions (yellow), whereas a model trained with random decoding orders can be used with an arbitrary decoding
order during the inference. The decoding order can be chosen such that the fixed context is decoded first. (c) Residue positions within and between chains can be tied together, enabling symmetric, repeat protein, and multistate design. In this example, a homotrimer is designed with the coupling of positions in different chains. Predicted unnormalized probabilities for tied positions are averaged to get a single probability distribution from which amino acids are sampled.

5. Experimental Setup

5.1. Datasets

The experiments utilized a variety of protein datasets for training, in silico testing, and experimental validation:

Training Set for Single-Chain Model:
- Source: 19,700 high-resolution single-chain structures from the Protein Data Bank (PDB).
- Characteristics: These were split into train, validation, and test sets (80/10/10) based on the CATH protein classification database to prevent data leakage (where very similar structures might appear in both training and test sets).
- Purpose: Used for the initial development and evaluation of model improvements on native sequence recovery for single-chain proteins.
Training Set for Multichain/Symmetry-Aware Model:
- Source: Protein assemblies in the PDB (as of 2 August 2021).
- Characteristics: Structures determined by x-ray crystallography or cryo-electron microscopy (cryo-EM) to better than 3.5-Å resolution and with fewer than 10,000 residues.
- Purpose: To train the final ProteinMPNN model, enabling it to handle complex protein assemblies and design challenges involving multiple chains and symmetry.
Test Set for In Silico Native Sequence Recovery:
- Source: 402 monomer backbones (for comparison with Rosetta), 690 monomers, 732 homomers (with fewer than 2000 residues), and 98 heteromers. These are likely subsets of the PDB test set or derived from similar high-resolution structures.
- Purpose: To benchmark ProteinMPNN's ability to recover native sequences on diverse protein types and interfaces.
AlphaFold Protein Backbone Models from UniRef50:
- Source: 5000 randomly chosen AlphaFold protein backbone models from UniRef50 sequences, with an average predicted lDDT > 80.0 (indicating high confidence in their predicted structures).
- Purpose: To evaluate ProteinMPNN's robustness to slight inaccuracies in input backbones, especially after training with backbone noise.
De Novo Designed Ligand Binding Pocket-Containing Scaffolds (Rosetta):
- Source: A set of scaffolds previously designed using Rosetta.
- Purpose: To test ProteinMPNN's ability to redesign sequences for designed backbones, improving their AlphaFold predictability.
AlphaFold Hallucinated Monomers and Homo-oligomers:
- Source: Protein backbones and sequences generated by AlphaFold's hallucination process (optimizing sequences for well-defined structures starting from random sequences).
- Purpose: To test ProteinMPNN's ability to "rescue" these designs, as the original AlphaFold-generated sequences often failed experimentally.
Previously Suboptimal Rosetta Designs:
- Source:
  - Rosetta designs of repeat protein structures.
  - Rosetta designs of C5 and C6 cyclic oligomers.
  - Rosetta designs of two-component tetrahedral nanoparticle designs (e.g., T33-27).
  - Rosetta designs of proteins that scaffold polyproline II helix motifs recognized by SH3 domains.
- Purpose: To demonstrate ProteinMPNN's ability to "rescue" diverse failed designs by keeping the original backbones but generating new sequences.
  
  The paper does not provide concrete examples of data samples (e.g., actual PDB IDs of specific proteins used in the test sets, or sequence snippets), but rather describes the types of data used. The datasets chosen are representative of the challenges in protein design, covering a wide range of protein sizes, complexities, and design goals (monomers, oligomers, interfaces, functional sites), making them effective for validating the method's broad applicability and performance.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

5.2.1. Sequence Recovery

Conceptual Definition: Sequence recovery quantifies how well a design method can reproduce the original (native) amino acid sequence of a protein given its 3D backbone structure. It measures the percentage of amino acid positions where the designed amino acid matches the native amino acid. A higher percentage indicates that the design method accurately captures the sequence-structure relationship inherent in native proteins.
Mathematical Formula: $ \text{Sequence Recovery} = \frac{\text{Number of correctly predicted amino acids}}{\text{Total number of amino acids}} \times 100% $
Symbol Explanation:
- Number of correctly predicted amino acids: The count of positions where the amino acid predicted by the design method is identical to the amino acid in the native protein sequence.
- Total number of amino acids: The total length of the protein sequence being designed.

5.2.2. Perplexity (Exponentiated Categorical Cross-Entropy Loss)

Conceptual Definition: Perplexity is a measure of how well a probability model predicts a sample. In the context of protein sequence design, it indicates how "surprised" the model is by the actual native amino acid sequence. A lower perplexity value means the model assigns higher probability to the correct amino acids, implying a better fit between the predicted sequence distribution and the true sequence. It's the exponentiated average per-residue categorical cross-entropy loss.
Mathematical Formula: $ \text{Perplexity} = e^{\left( -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{K} y_{i,j} \log(\hat{y}_{i,j}) \right)} $
Symbol Explanation:
- $e$ : Euler's number (the base of the natural logarithm).
- $N$ : The total number of amino acid positions in the protein sequence.
- $K$ : The number of possible amino acid types (typically 20).
- $y_{i,j}$ : A binary indicator variable (0 or 1). It is 1 if the $i$ -th amino acid in the true (native) sequence is of type $j$ , and 0 otherwise.
- $\hat{y}_{i,j}$ : The predicted probability that the $i$ -th amino acid is of type $j$ , as output by the model.
- $\log$ : The natural logarithm.
- The term $-\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{K} y_{i,j} \log(\hat{y}_{i,j})$ represents the average per-residue categorical cross-entropy loss.

5.2.3. lDDT (local-Distance Difference Test)

Conceptual Definition: lDDT is a score used to assess the local accuracy of a predicted protein structure compared to a reference (native) structure. It measures the fraction of all atoms in the predicted model that are in a "correct" environment compared to the reference structure. An atom is considered to be in a correct environment if its distances to a set of neighboring atoms in the reference structure are preserved (within a certain tolerance) in the predicted structure. The score ranges from 0 to 1, with 1 indicating a perfect match. lDDT-Cα specifically refers to lDDT calculated only for the Cα atoms.
Mathematical Formula: The lDDT score for a predicted model $P$ with respect to a reference structure $R$ is defined as the average of distance conservation fractions over all atoms, across a set of distance thresholds. For a given atom $i$ in the predicted structure and a distance threshold $\tau$ : $ \text{lDDT}i(\tau) = \frac{1}{|N_i(R)|} \sum{j \in N_i(R)} \mathbb{I}(|d_{i,j}(P) - d_{i,j}(R)| < \tau) $ Then, the overall lDDT score is: $ \text{lDDT} = \frac{1}{|A| \cdot |\mathcal{T}|} \sum_{i \in A} \sum_{\tau \in \mathcal{T}} \text{lDDT}_i(\tau) $
Symbol Explanation:
- $P$ : The predicted protein structure.
- $R$ : The reference (native) protein structure.
- $A$ : The set of all atoms (or Cα atoms for lDDT-Cα) in the protein.
- $i$ : An individual atom in the protein.
- $N_i(R)$ : The set of neighboring atoms to atom $i$ in the reference structure $R$ (within a defined cutoff distance, typically 15-20 Å).
- $|N_i(R)|$ : The number of neighbors of atom $i$ in $R$ .
- $d_{i,j}(P)$ : The distance between atom $i$ and atom $j$ in the predicted structure $P$ .
- $d_{i,j}(R)$ : The distance between atom $i$ and atom $j$ in the reference structure $R$ .
- $\mathbb{I}(\cdot)$ : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
- $\tau$ : A distance tolerance threshold. Common thresholds are {0.5 Å, 1 Å, 2 Å, 4 Å}.
- $\mathcal{T}$ : The set of distance tolerance thresholds.

5.2.4. TM-score (Template Modeling Score)

Conceptual Definition: TM-score is a metric used to measure the topological similarity between two protein structures. It's designed to be sensitive to the global fold similarity, with values ranging from 0 to 1. A TM-score of 1 indicates a perfect match, and scores above 0.5 generally indicate that two proteins share the same fold. It is less sensitive to local structural deviations than RMSD and is commonly used for fold classification.
Mathematical Formula: $ \text{TM-score} = \max \left[ \frac{1}{L_{\text{target}}} \sum_{i=1}^{L_{\text{common}}} \frac{1}{1 + \left( \frac{d_i}{d_0(L_{\text{target}})} \right)^2} \right] $
Symbol Explanation:
- $L_{\text{target}}$ : The length of the target (reference) protein.
- $L_{\text{common}}$ : The number of residues in the longest common substructure after optimal superposition.
- $d_i$ : The distance between the $i$ -th pair of equivalent Cα atoms after optimal superposition.
- $d_0(L_{\text{target}})$ : A length-dependent scale factor, typically calculated as $1.24 \cdot \sqrt[3]{L_{\text{target}} - 15} - 1.8$ to normalize the score for protein size. The max indicates optimization over different possible superpositions.

5.2.5. RMSD (Root Mean Square Deviation)

Conceptual Definition: RMSD is a widely used measure to quantify the average distance between the atoms of two superimposed protein structures. It calculates the square root of the mean of the squares of the distances between corresponding atoms after the best possible superposition (rotation and translation) to minimize this value. A lower RMSD indicates higher structural similarity.
Mathematical Formula: $ \text{RMSD} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} ||\mathbf{v}_i - \mathbf{w}_i||^2} $
Symbol Explanation:
- $N$ : The total number of corresponding atoms (e.g., Cα atoms) being compared.
- $\mathbf{v}_i$ : The 3D coordinate vector of the $i$ -th atom in the first structure (e.g., design model).
- $\mathbf{w}_i$ : The 3D coordinate vector of the $i$ -th atom in the second structure (e.g., crystal structure) after optimal superposition.
- $||\cdot||^2$ : The squared Euclidean distance between the two vectors.

5.2.6. Soluble Yield

Conceptual Definition: This is an experimental metric measuring the quantity of protein that is correctly folded and soluble (i.e., not aggregated or stuck in inclusion bodies) after expression in a host organism (e.g., E. coli). It's typically quantified as milligrams of purified, soluble protein per liter of bacterial culture. Higher soluble yield indicates a more stable and well-behaved protein in vivo.
Formula & Symbols: Not a mathematical formula but an experimental measurement.
- mg per liter of culture equivalent: The unit of measurement for soluble protein yield.

5.2.7. Thermostability

Conceptual Definition: Thermostability refers to a protein's ability to maintain its folded structure and function at elevated temperatures. It's often assessed by monitoring changes in secondary structure (e.g., using Circular Dichroism, CD) as temperature increases. A protein that maintains its CD profile at high temperatures (e.g., $95^\circ\text{C}$ ) is considered highly thermostable.
Formula & Symbols: Not a direct formula; typically inferred from experimental data like CD spectra (e.g., Mean Residue Ellipticity, MRE).
- MRE: Mean Residue Ellipticity, a unit often used in CD spectroscopy.

5.2.8. Oligomeric State

Conceptual Definition: The oligomeric state describes whether a protein exists as a single polypeptide chain (monomer), or as a complex of multiple chains (dimer, trimer, tetramer, etc.), and how many chains are in the complex. This is crucial for protein function and assembly.
Formula & Symbols: Not a direct formula; determined experimentally.
- Size Exclusion Chromatography (SEC): A technique that separates molecules by size. A protein's retention volume in SEC can indicate its apparent molecular weight and thus its oligomeric state.
- SEC-MALS (SEC-Multi-Angle Light Scattering): A more precise method that combines SEC with light scattering to determine the absolute molecular weight of proteins in solution, confirming their true oligomeric state.

5.2.9. Binding Affinity

Conceptual Definition: Binding affinity quantifies the strength of the interaction between two molecules (e.g., a designed protein and its target). A higher binding affinity means the molecules bind more tightly. This is critical for designing functional proteins like target-binding proteins.
Formula & Symbols: Not a direct formula in the paper, but typically involves equilibrium dissociation constant ( $K_D$ $K_{D}$ ) or association/dissociation rates.
- Biolayer Interferometry (BLI): An optical technique used to measure binding kinetics and affinity. It monitors interference patterns caused by molecular interactions on a sensor surface.

5.3. Baselines

The paper compared ProteinMPNN against the following baseline models:

Rosetta Fixed Backbone Combinatorial Sequence Design:
- Description: This represents the state-of-the-art in physically based protein design. It involves using Rosetta's PackRotamersMover (an algorithm for optimizing side chain conformations) with default options and the beta_nov16 score function (a specific energy function used by Rosetta).
- Why it's representative: Rosetta has been the dominant computational protein design platform for decades and provides a strong benchmark for physically based methods.
AlphaFold-generated sequences:
- Description: For the "hallucinated" monomers and homo-oligomers, the original sequences were generated by AlphaFold's hallucination process (optimizing a random sequence to predict a well-defined structure).
- Why it's representative: This tests ProteinMPNN's ability to improve upon de novo sequences generated by a powerful deep learning structure prediction model, highlighting that structure prediction capability doesn't automatically translate to successful sequence design.
Original Native Sequences:
- Description: For the comparison of AlphaFold prediction accuracy, ProteinMPNN sequences generated for native backbones were compared against the original native sequences themselves.
- Why it's representative: This tests the hypothesis that ProteinMPNN can generate sequences that more strongly encode a specific structure than even naturally evolved sequences, which are often under selection for function rather than maximal stability.

6. Results & Analysis

6.1. Core Results Analysis

The results presented in the paper convincingly demonstrate ProteinMPNN's superior performance and broad utility across various protein design challenges, both in silico and experimentally.

6.1.1. Superior In Silico Native Sequence Recovery

ProteinMPNN significantly outperforms Rosetta in recovering native amino acid sequences given their backbones.

Overall Performance: For a test set of 402 monomer backbones, ProteinMPNN achieved an overall native sequence recovery of 52.4%, which is substantially higher than Rosetta's 32.9%.
Computational Efficiency: Crucially, ProteinMPNN achieves this higher performance in a fraction of the compute time: 1.2 seconds per 100 residues compared to 258.8 seconds for Rosetta on a single CPU.
Performance Across Burial Levels (Fig. 2A): ProteinMPNN showed improvements across the full range of residue burial, from deeply buried protein core residues (left side of x-axis, high Cβ density) to exposed surface residues (right side of x-axis, low Cβ density). This indicates that ProteinMPNN better captures amino acid preferences across diverse microenvironments within a protein.
Broad Applicability to Assemblies (Fig. 2B): ProteinMPNN exhibits high sequence recovery not only for monomers (median 52%) but also for homo-oligomers (median 55%) and heteromers (median 51%). For interface residues (Cβ-Cβ < 8 Å), the recovery was 53% for homomers and 51% for heteromers. This confirms its robustness for designing protein-protein interfaces.
Correlation with Burial (Fig. S1B): Sequence recovery consistently correlated with residue burial, ranging from 90-95% in the deep core to 35% on the surface. This is expected, as buried residues have more local geometric context and are more constrained, making their amino acid identity easier to predict.

6.1.2. Impact of Training with Backbone Noise (Table 1, Fig. 2C)

The paper highlights a critical finding: optimizing for native sequence recovery on perfect crystal structures is not always optimal for real-world design.

Robustness to Imperfect Backbones: Training ProteinMPNN models with Gaussian noise added to backbone coordinates (e.g., SD = 0.02 Å) significantly improved sequence recovery on confident AlphaFold structure models (which have slight inaccuracies compared to native structures) while simultaneously decreasing sequence recovery on unperturbed PDB structures.
Rationale: Crystallographic refinement might impart some "memory" of amino acid identity in the backbone coordinates, which noise-free models can capture but is less relevant for de novo design. Training with noise forces the model to learn more robust, generalizable features, making it resilient to small coordinate displacements common in predicted or de novo designed backbones.
Improved AlphaFold Prediction Success (Fig. 2C): Sequences generated by ProteinMPNN models trained with noise resulted in AlphaFold predictions that were more accurate and confident. For instance, a model trained with 0.3-Å noise generated 2-3 times more sequences with AlphaFold predictions within lDDT-Cα of 95.0 and 90.0 of the true structures, compared to un-noised or slightly noised models. This is highly valuable because it increases the frequency of designs passing prediction-based filters, which is a common step in the protein design pipeline.

6.1.3. Enhanced Sequence-to-Structure Mapping (Fig. 2E, 2F)

ProteinMPNN designs create a stronger link between sequence and structure.

Superior to Native Sequences (Fig. 2E): ProteinMPNN sequences generated for native backbones were predicted by AlphaFold to fold to these structures much more confidently and accurately than the original native sequences themselves (when AlphaFold was run with only single-sequence information, without multiple sequence alignments). This suggests ProteinMPNN can generate sequences that are "better encoders" of a specific fold than even evolutionarily optimized natural sequences, which might prioritize function over maximal stability.
Rescue of Designed Scaffolds (Fig. 2F): For a set of de novo designed ligand-binding scaffolds (originally from Rosetta), only 2.7% of the original sequences were confidently predicted to fold to the target structure by AlphaFold. After ProteinMPNN redesign, a remarkable 54.1% were confidently predicted to fold. This drastically increases the utility of such scaffolds for downstream applications like small-molecule binding and enzyme design.

6.1.4. Sequence Diversity and Quality Ranking (Fig. 2D, Fig. S3A)

Increased Diversity: The diversity of generated sequences can be considerably increased by performing inference at higher temperatures, with only a small decrease in average sequence recovery. This is important for experimentalists who need multiple sequence variants to test.
Quality Ranking: A measure derived from ProteinMPNN, the averaged log probability of the sequence given the structure, correlated strongly with native sequence recovery. This provides a fast, in silico method to rank designed sequences and prioritize them for experimental characterization.

6.1.5. Extensive Experimental Validation and Rescue of Failed Designs

The experimental results are arguably the most compelling aspect, showing ProteinMPNN's practical success where other methods failed.

Rescue of AlphaFold Hallucinations (Fig. 3A-D):
- Problem: AlphaFold's hallucinated sequences, while leading to well-defined predicted structures, were largely insoluble when expressed in E. coli (median soluble yield of 9 mg/L).
- Solution: ProteinMPNN redesigned sequences for a subset of these backbones. The success rate was vastly higher: 73 out of 96 designs expressed solubly (median soluble yield of 247 mg/L), and 50 had the correct oligomeric state.
- Structural Confirmation: One ProteinMPNN monomer design with a complex fold (TM-score = 0.56 against PDB) was solved by x-ray crystallography (PDB ID 8CYK). The crystal structure was nearly identical to the design target (2.35-Å RMSD over 130 residues), especially in the core, with side chains fitting perfectly into electron density. It also showed high thermostability (Fig. 3B).
Rescue of Repeat Proteins (Fig. 3E-F):
- ProteinMPNN successfully redesigned previously suboptimal Rosetta designs of repeat protein structures by tying residues at equivalent positions, resulting in improved folding and solubility (Fig. 3F).
Cyclic Oligomers with Internal Repeats (Fig. 3G-J):
- By tying positions both within and between subunits, ProteinMPNN enforced both cyclic and internal repeat symmetries.
- Improved Success: For C5 and C6 cyclic oligomers, ProteinMPNN designs achieved 16 out of 18 soluble designs and 5 with the correct oligomeric state, significantly better than Rosetta's 4 out of 10 soluble and 0 correct oligomeric state.
- Structural Confirmation: Negative-stain EM images of one design closely matched the design model (Fig. 3J).
Rescue of Tetrahedral Nanoparticles (Fig. 3K):
- Problem: Previously described Rosetta designs for two-component tetrahedral nanoparticles required extensive manual intervention.
- Solution: ProteinMPNN designed 76 sequences for 27 such backbones automatically. 13 designs formed assemblies with the expected molecular weight (~1 MDa), including several that had failed with Rosetta.
- Structural Confirmation: The crystal structure of one ProteinMPNN nanoparticle design was very close to the design model (1.2-Å Cα RMSD over two subunits).
Design of Protein Function (SH3 Binder) (Fig. 4):
- Problem: Rosetta designs for proteins scaffolding SH3-binding motifs (PPPRPPK) failed to fold into structures that bind the Grb2 SH3 domain.
- Solution: ProteinMPNN generated sequences for the same backbones, keeping the SH3-binding motif fixed. These designs showed strong binding to the Grb2 SH3 domain via Biolayer Interferometry, with a much higher signal than the free peptide (Fig. 4B). Point mutations eliminated binding, confirming specificity.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Noise level when training: 0.00 Å/0.02 Å	Modification	Number of parameters in millions	PDB test accuracy (%)		PDB test perplexity		AlphaFold model accuracy (%)
Noise level when training: 0.00 Å/0.02 Å	Modification	Number of parameters in millions	0.00 Å	0.02 Å	0.00 Å	0.02 Å	0.00 Å	0.02 Å
Baseline model	None	1.381	41.2	40.1	6.51	6.77	41.4	41.4
Experiment 1	Add N, Cα, C, Cβ, O distances	1.430	49.0	46.1	5.03	5.54	45.7	47.4
Experiment 2	Update encoder edges	1.629	43.1	42.0	6.12	6.37	43.3	43.0
Experiment 3	Combine 1 and 2	1.678	50.5	47.3	4.82	5.36	46.3	47.9
Experiment 4	Experiment 3 with random decoding	1.678	50.8	47.9	4.74	5.25	46.9	48.5

6.3. Ablation Studies / Parameter Analysis

The paper implicitly conducts an ablation study through the "Experiments" listed in Table 1, demonstrating the contribution of various architectural improvements to ProteinMPNN's performance.

Impact of Additional Input Features (Experiment 1):
- Adding distances between $N, Cα, C, Cβ, O$ atoms as input features (Experiment 1) significantly increased PDB test accuracy from 41.2% (baseline) to 49.0% (without noise).
- It also improved AlphaFold model accuracy from 41.4% to 45.7% (without noise) and 47.4% (with 0.02Å noise). This highlights the importance of providing a richer geometric context to the model.
Impact of Encoder Edge Updates (Experiment 2):
- Implementing edge updates in the encoder (Experiment 2) alone provided a modest improvement in PDB test accuracy to 43.1% (without noise). While less impactful than additional features, it still contributed to the model's overall capability.
Combined Improvements (Experiment 3):
- Combining both the additional input features and edge updates (Experiment 3) further boosted PDB test accuracy to 50.5% (without noise) and AlphaFold model accuracy to 46.3%/47.9% (without/with noise), showing a synergistic effect.
Random Decoding Order (Experiment 4):
- Implementing random decoding (Experiment 4) on top of the previous improvements led to a slight but notable increase in PDB test accuracy to 50.8% (without noise) and AlphaFold model accuracy to 46.9%/48.5% (without/with noise), confirming its value for contextual learning and flexibility.

Effect of Training Noise Level (Fig. 2C): This analysis demonstrates the trade-off and benefit of training with noise:

As the training noise level increases, PDB test accuracy (black line) decreases. This is because adding noise blurs the precise details of crystal structures, making it harder for the model to recover the exact native sequence.
However, AlphaFold prediction success rates (blue lines) generally improve with moderate noise levels.
- For a stringent $lDDT-Cα > 95$ cutoff (circles), a small amount of noise (around 0.1-0.2 Å) is optimal.
- For a lower $lDDT-Cα > 90$ cutoff (squares), models trained with more noise (up to 0.3 Å) perform better, generating 2-3 times more successful AlphaFold predictions compared to un-noised models.
- This shows that noise training makes ProteinMPNN sequences more robustly interpretable by AlphaFold, which is beneficial for filtering designs.

Sequence Recovery vs. Diversity (Fig. 2D):

The temperature parameter during inference allows a trade-off between sequence recovery and diversity. As temperature increases, sequence recovery slightly decreases, but the diversity of generated sequences significantly increases. This is a valuable feature for experimental design, allowing researchers to explore a wider sequence space.

The following figures illustrate the in silico evaluation results:

该图像是图表，展示了 ProteinMPNN 设计的结构特征。图 (A) 比较了使用 AlphaFold 和 ProteinMPNN 设计的蛋白质在大肠杆菌中的可溶性蛋白产量。图 (B) 显示了不同温度下的圆二色性谱 (MRE)；图 (C) 则呈现了原始设计与 ProteinMPNN 设计的尺寸排阻色谱 (SEC) 比较。图 (D) 显示了设计模型与晶体结构的对比，在 (E) 和 (G) 中展示了链内和链间的结合结构。图 (F) 和 (H) 进一步分析了各种设计的 SEC 数据。

Fig. 3. Structural characterization of ProteinMPNN designs. (A) Comparison of soluble protein expression over a set of AlphaFold hallucinated monomers and homo-oligomers (blue) and the same set of backbones with sequences designed using ProteinMPNN (orange) $( N = 1 2 9 )$ The total soluble protein yield after expression in E. coli, obtained from the integrated area under size exclusion traces of nickel-NTA-purified proteins, increases considerably from the barely soluble protein of the original sequences after ProteinMPNN rescue (median yields for 1 liter of culture equivalent are 9 and 247 mg, respectively). Boxes represent the quartiles of the soluble yield distribution and whiskers show the rest of it. (B to D) In-depth characterization of a monomer hallucination and corresponding ProteinMPNN rescue from the set in (A). Like almost all of the designs in (A), the sequence and structural similarities to the PDB of the design model are very low [expected value (E-value) $= 2 . 8$ against UniRef100 using HHblits; TM-score $= 0 . 5 6$ against PDB]. As shown in (B), the ProteinMPNNrescued design has high thermostability, with a virtually unchanged circular dichroism profile at $9 5 ^ { \\circ } \\mathrm { C }$ compared with $2 5 ^ { \\circ } \\complement$ MRE, mean residue ellipticity. Shown in (C) is a SEC profile of the failed original design overlaid with the

The following figure illustrates the structural characterization and experimental rescue of ProteinMPNN designs:

Fig. 4. Design of protein function with ProteinMPNN. (A) Design scheme. The first panel shows the structure (PDB ID 2W0Z) of a fragment of Gab2 peptide bound to the human Grb2 C-term SH3 domain (core…
该图像是图示，展示了使用ProteinMPNN设计蛋白质功能的过程，包括原生肽与SH3结构域的结合，使用Rosetta进行重建与优化，以及ProteinMPNN的重设计成果。实验部分用生物层干涉法测定不同浓度肽的结合信号，并比较了不同设计的效果。

Fig. 4. Design of protein function with ProteinMPNN. (A) Design scheme. The first panel shows the structure (PDB ID 2W0Z) of a fragment of Gab2 peptide bound to the human Grb2 C-term SH3 domain (core SH3-binding motif PPPRPPK is in green; the target is rendered with surface and colored blue). In the second panel,helical bundle scaffolds were docked to the exposed face f the peptideusing RIFDOCK $( 2 0 )$ , and Rosetta remodel was used to build loops connecting the peptide to the scaffolds. Rosetta sequence design with layer design task operations was used to optimize the sequence of the fusion (cyan) for stability, rigidity of the peptide-helical bundle interface, and binding affinity for the Grb2 SH3 domain. The third panel shows the ProteinMPNN redesign (orange) of the designed binder sequence; hydrogen bonds involving asparagine side chains between the peptide and base scaffold are shown in green and in the inset. In the fourth panel, mutation of the twoasparaginesaspartates disupts he scaffolding the et peptide. B) Experimental characterization of bindingusing biolayer interferometry. Biotinylated C-terminal SH3 domain from human Grb2 was loaded onto Streptavidin (SA) Biosensors, which were then immersed in solutions containing varying concentrations of SH3-binding peptide AIAPPPRPPKPSQ (first panel; A, alanine; I, isoleucine; S, serine; Q, glutamine) or of the designs (second to fourth panels) and then transferred to buffer lacking added protein for dissociation measurements. The ProteinMPNN design (third panel) has much greater binding signal than the original Rosetta design (second panel); this is greatly reduced by the asparagine-toaspartate mutations (fourth panel).

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces ProteinMPNN, a robust deep learning-based method for protein sequence design that significantly outperforms traditional physically based approaches like Rosetta and rescues previously failed designs from both Rosetta and AlphaFold. Key findings include:

Superior In Silico Performance: ProteinMPNN achieves a native sequence recovery of 52.4% on native backbones, far exceeding Rosetta's 32.9%, with vastly improved computational efficiency.
Broad Applicability: The method's flexible architecture, including order-agnostic decoding and the ability to couple positions, enables its application to a wide range of design problems, from monomers to complex cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.
Experimental Validation: Extensive experimental characterization using x-ray crystallography, cryo-EM, and functional studies confirms ProteinMPNN's high accuracy and utility in generating soluble, stable, and functional proteins that closely match desired target structures.
Robustness: Training with backbone noise improves the model's ability to design sequences for less-than-perfect backbones and enhances the confidence of AlphaFold structure predictions for designed sequences.
Democratization of Design: ProteinMPNN requires no expert customization, streamlining the design process and making protein engineering more accessible.

In essence, ProteinMPNN represents a significant advancement, translating the power of deep learning from protein structure prediction to robust and experimentally successful protein sequence design.

7.2. Limitations & Future Work

The authors highlight several implicit limitations and suggest future directions:

Physical Transparency vs. Empirical Success: The paper notes that deep learning methods like ProteinMPNN "lack the physical transparency of methods like Rosetta." While they achieve superior empirical results by learning from the PDB, the exact "rules" or "principles" they prioritize are not as directly interpretable as explicit energy functions in physically based models. This isn't necessarily a limitation of the method's effectiveness but rather a characteristic of deep learning models in general.
Correlation of In Silico Metrics with Folding: The authors emphasize that in silico metrics (like native sequence recovery) are sensitive to crystallographic resolution and may not always correlate directly with proper folding in vitro. This underscores the necessity of experimental validation, which ProteinMPNN excels at, but it also implies that further work could be done to refine in silico metrics that better predict experimental success.
Optimizing for Expression and Stability of Native Proteins: The observation that ProteinMPNN-generated sequences for native backbones are predicted to fold more confidently by AlphaFold suggests a promising future direction: using ProteinMPNN to improve the expression and stability of recombinantly expressed native proteins, while keeping functionally critical residues fixed.
Further Applications: The success in rescuing designs for nanoparticles and target-binding proteins points to broad utility in areas like vaccine design, small-molecule binding, and enzymatic function. Continued exploration and application in these specific domains would be a natural extension.

7.3. Personal Insights & Critique

ProteinMPNN is a groundbreaking paper that represents a significant step forward in de novo protein design. My personal insights and critique include:

Paradigm Shift: The most profound aspect is the shift from physics-based energy minimization to data-driven probabilistic modeling for sequence design. This mirrors the trajectory of structure prediction and effectively sidesteps the computational intractability of explicit energy landscapes. The implicit learning from the vast PDB dataset allows ProteinMPNN to capture complex, context-dependent rules that are difficult to hand-code into force fields.
Practical Robustness: The inclusion of training with Gaussian noise is a brilliant practical innovation. It acknowledges the imperfection of real-world backbones (whether from computation or experiment) and designs for robustness rather than theoretical perfection. This directly translates to higher experimental success rates, which is the ultimate arbiter in protein engineering. This strategy could be broadly applied to other machine learning tasks where training data might have subtle, systematic biases or where inference will be performed on noisy inputs.
Democratization of Design: The speed, automation, and lack of expert customization are transformative. Previously, designing complex assemblies or functional proteins often required specialized knowledge of Rosetta protocols and iterative manual intervention. ProteinMPNN drastically lowers the barrier to entry, potentially enabling more labs to engage in sophisticated protein design.
Validation is Key: The paper's strength lies in its rigorous experimental validation across a diverse set of challenging design problems. This level of experimental proof is crucial for deep learning methods in biology, as in silico metrics alone can be misleading. The ability to "rescue" previously failed designs is a powerful testament to its superiority.
Transferability: The core methodological principles—graph neural networks for structural context, order-agnostic decoding for flexible design, and coupled positions for symmetry—are highly generalizable. These could be adapted for designing other biomolecules (e.g., RNA, DNA origami structures) or even small molecule scaffolds where structural constraints are paramount.
Potential Areas for Improvement/Future Exploration:
- Interpretability: While ProteinMPNN works, understanding why it chooses certain amino acids for specific structural contexts could further advance our fundamental understanding of protein folding. Applying interpretability techniques to analyze the learned features could be valuable.
- Beyond Fixed Backbones: The current method assumes a fixed backbone. While highly effective, future work could explore integrating backbone flexibility or even co-designing sequence and backbone, though this is a significantly more complex problem.
- Integration with Functionality Prediction: While ProteinMPNN successfully designs sequences for a desired structure, directly integrating functional constraints into the design process (beyond just binding affinity for a known motif) could be a powerful next step.
- Sampling Strategy: While higher temperature increases diversity, more sophisticated sampling strategies (e.g., reinforcement learning guided by in silico fitness functions) could potentially generate sequences with both high diversity and higher likelihood of specific desired properties.
  
  Overall, ProteinMPNN is a landmark paper that establishes deep learning as a powerful and practical tool for protein sequence design, paving the way for accelerated discovery and engineering of novel proteins.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.