Paper status: completed

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

Published:11/13/2024

Feature Extraction in Protein Language Models (1)Application of Sparse Autoencoders (1)Interpretable Protein Biology (1)Human-Interpretable Latent Features (1)Analysis of ESM-2 Model (1)

Original Link PDF

Price: 0.100000

10 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a method using sparse autoencoders to extract interpretable features from protein language models, revealing up to 2,548 features correlated with 143 biological concepts, and demonstrating applications in database annotation and targeted protein generation.

Abstract

Protein language models (PLMs) have demonstrated remarkable success in protein modeling and design, yet their internal mechanisms for predicting structure and function remain poorly understood. Here we present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from the PLM ESM-2, we identify up to 2,548 human-interpretable latent features per layer that strongly correlate with up to 143 known biological concepts such as binding sites, structural motifs, and functional domains. In contrast, examining individual neurons in ESM-2 reveals up to 46 neurons per layer with clear conceptual alignment across 15 known concepts, suggesting that PLMs represent most concepts in superposition. Beyond capturing known annotations, we show that ESM-2 learns coherent concepts that do not map onto existing annotations and propose a pipeline using language models to automatically interpret novel latent features learned by the SAEs. As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases and enable targeted steering of protein sequence generation. Our results demonstrate that PLMs encode rich, interpretable representations of protein biology and we propose a systematic framework to extract and analyze these latent features. In the process, we recover both known biology and potentially new protein motifs. As community resources, we introduce InterPLM (this http URL), an interactive visualization platform for exploring and analyzing learned PLM features, and release code for training and analysis at this http URL.

Mind Map

In-depth Reading

English Analysis~34 min read · 46,835 chars

1. Bibliographic Information

1.1. Title

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

1.2. Authors

Elana Simon: Stanford University, epsimon@stanford.edu
James Zou: Stanford University, jamesz@stanford.edu

1.3. Journal/Conference

Published as a preprint on arXiv. Venue Reputation: arXiv is a highly influential open-access repository for preprints in various scientific fields, including computer science, mathematics, physics, and biology. Papers published on arXiv are not peer-reviewed prior to posting but often undergo peer review for subsequent publication in journals or conferences. Its reputation lies in its role for rapid dissemination of research findings and fostering early feedback within the scientific community.

1.4. Publication Year

2024

1.5. Abstract

Protein language models (PLMs) have achieved significant success in protein modeling and design, but their internal mechanisms for predicting structure and function are not well understood. This paper introduces a systematic approach, InterPLM, to extract and analyze interpretable features from PLMs using sparse autoencoders (SAEs). By training SAEs on embeddings from ESM-2, the authors identified up to 2,548 human-interpretable latent features per layer that strongly correlate with up to 143 known biological concepts (e.g., binding sites, structural motifs, functional domains). In contrast, examining individual neurons in ESM-2 revealed fewer interpretable concepts (up to 46 neurons per layer across 15 concepts), suggesting that PLMs represent most concepts in superposition. Beyond known annotations, ESM-2 learns coherent concepts that do not map onto existing annotations. The paper proposes a pipeline using large language models (LLMs) to automatically interpret these novel latent features. Practical applications include filling missing annotations in protein databases and targeted steering of protein sequence generation. The results demonstrate that PLMs encode rich, interpretable representations of protein biology, and InterPLM provides a systematic framework to extract and analyze these latent features, recovering both known biology and potentially new protein motifs. InterPLM is also an interactive visualization platform, and the code for training and analysis is openly released.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2412.12101 PDF Link: https://arxiv.org/pdf/2412.12101.pdf Publication Status: This paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The remarkable success of Protein Language Models (PLMs) in tasks like protein structure prediction and design has opened new avenues in computational biology. However, these powerful models often operate as "black boxes," meaning their internal reasoning and the specific biological concepts they learn remain largely opaque. Understanding these internal mechanisms is crucial for several reasons:

Model Development: Insights into PLM representations can guide the design of more effective and robust PLMs.
Biological Discovery: Uncovering the biological principles encoded within PLMs can lead to new biological hypotheses and discoveries.
Trust and Reliability: A better understanding of how PLMs work can increase trust in their predictions for critical applications like drug design.

A significant challenge in PLM interpretability is the phenomenon of superposition, where a single neuron in a neural network might represent multiple unrelated concepts, or a single concept might be distributed across many neurons. This polysemanticity makes direct interpretation of individual neurons difficult and limits the ability to extract clear, human-interpretable biological features.

The paper's entry point is to address this black box problem and the superposition challenge by adapting sparse autoencoders (SAEs), a technique previously successful in Large Language Model (LLM) interpretability, to the domain of PLMs.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of PLM interpretability and biological discovery:

Systematic Framework for Interpretable Feature Extraction: Introduction of InterPLM, a systematic methodology to extract sparse, human-interpretable latent features from PLMs (specifically ESM-2) using sparse autoencoders (SAEs).
Enhanced Interpretability Compared to Neurons: SAEs successfully disentangle superposition in PLMs, identifying up to 2,548 interpretable latent features per layer, correlating with up to 143 known biological concepts (e.g., binding sites, structural motifs). In contrast, individual ESM-2 neurons showed far fewer clear conceptual alignments (up to 46 neurons per layer across 15 concepts), highlighting the SAE approach's superiority in revealing interpretable representations.
Discovery of Novel Biological Concepts: ESM-2 learns coherent biological concepts that do not directly map onto existing Swiss-Prot annotations. The SAE framework helps identify these potentially novel protein motifs.
LLM-Powered Interpretation Pipeline: Development of an automated pipeline using Large Language Models (LLMs) (Claude-3.5 Sonnet) to generate meaningful descriptions for novel SAE latent features that lack existing biological annotations, enabling interpretation beyond current databases.
Practical Applications:
- Filling Missing Annotations: Demonstration of how SAE latent features can be used to identify missing or incomplete annotations in protein databases.
- Targeted Sequence Generation: Showing that interpretable features can be used to steer protein sequence generation in a targeted manner, opening avenues for controllable protein design.
Community Resources: Release of InterPLM.ai (an interactive visualization platform) and open-sourcing of the code for training and analysis, fostering community engagement and further research.

These findings collectively demonstrate that PLMs encode a rich, interpretable understanding of protein biology, and InterPLM provides a powerful framework to unlock these representations for both scientific discovery and model improvement.

3.1. Foundational Concepts

To fully understand this paper, a foundational understanding of several key concepts is essential:

Protein Language Models (PLMs):
- Concept: PLMs are deep learning models, often based on the transformer architecture, that are trained on vast datasets of protein sequences (like UniRef or AlphaFold DB). They treat amino acid sequences as a "language," where individual amino acids are "tokens."
- Mechanism: Through self-supervised learning objectives (e.g., masked language modeling), PLMs learn contextual representations (embeddings) for each amino acid in a sequence. These embeddings capture complex evolutionary, structural, and functional information about proteins.
- Masked Language Modeling (MLM): A common training objective where a certain percentage of amino acids in a sequence are masked (replaced with a special token or a random amino acid), and the model is trained to predict the original masked amino acids based on their context. This forces the model to learn deep contextual dependencies.
- Transformer Architecture: A neural network architecture introduced by Vaswani et al. (2017) that relies heavily on attention mechanisms to process sequential data. It consists of multiple identical layers, each typically containing multi-head self-attention and feed-forward neural networks.
- Attention Mechanism: A core component of transformers that allows the model to weigh the importance of different parts of the input sequence when processing a specific token. For a given token, attention calculates a score for every other token in the sequence, indicating how relevant it is. The output is a weighted sum of the values from all other tokens. The common formula for Scaled Dot-Product Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings, representing different linear transformations of the input.
  - $d_k$ is the dimension of the key vectors, used for scaling to prevent very large dot products that push the softmax function into regions with extremely small gradients.
  - $QK^T$ computes the dot product similarity between queries and keys.
  - softmax normalizes these scores into a probability distribution.
  - The result is a weighted sum of the Value vectors, where weights are determined by the attention scores.
- ESM-2: A specific state-of-the-art PLM developed by Meta AI, based on the transformer architecture. It is known for learning powerful protein representations that are useful for various downstream tasks, including structure prediction (e.g., ESMFold).
Sparse Autoencoders (SAEs):
- Concept: An autoencoder is a type of neural network trained to reconstruct its input. It consists of an encoder that maps the input to a latent (hidden) representation and a decoder that reconstructs the input from this latent representation.
- Sparsity: In SAEs, an additional constraint (often an L1 regularization term on the latent activations) is added during training to encourage most latent units to be inactive (zero or near-zero) for any given input. This forces the autoencoder to learn a more disentangled representation, where each latent feature ideally corresponds to a specific, independent concept.
- Polysemanticity: The phenomenon where individual neurons in a neural network activate for multiple, often unrelated, concepts. This makes direct interpretation of neurons difficult. SAEs are designed to address polysemanticity by finding a sparse basis where each latent feature is monosemantic (represents a single concept).
- Dictionary Learning: The process by which SAEs learn a set of dictionary vectors (the decoder weights) that represent the fundamental "concepts" or "features" present in the data. The activations (latent features) then indicate the presence and strength of these concepts for a given input.
Swiss-Prot:
- Concept: A high-quality, manually annotated, and reviewed protein sequence database, part of UniProtKB.
- Annotations: It contains a wealth of information about proteins, including functional descriptions, post-translational modifications, domains, structural features, and disease associations, all supported by experimental evidence or computational analysis. This makes it an invaluable resource for evaluating the biological relevance of features learned by PLMs.

3.2. Previous Works

The paper builds upon a lineage of work in PLM interpretability and mechanistic interpretability using SAEs.

PLM Interpretability:
- Early work on PLM interpretability often focused on analyzing attention patterns within transformer layers to understand how models weigh different amino acids or regions. For example, some studies analyzed attention heads for patterns related to contact maps or secondary structure.
- Other approaches involved probing individual neurons for associations with known biological properties (e.g., specific amino acid types, structural motifs). However, these studies often encountered polysemanticity, where a single neuron might respond to multiple, seemingly unrelated biological concepts, making interpretation challenging. The paper explicitly contrasts its findings with this neuron-level analysis, showing that SAEs reveal significantly more interpretable concepts.
- Work by Rives et al. (2021) on ESM-1b and ESM-2 showed that PLMs learn deep evolutionary and structural information, paving the way for further mechanistic understanding.
SAE Applications in Mechanistic Interpretability:
- SAEs have gained traction in mechanistic interpretability for Large Language Models (LLMs) and vision models. Works by teams like Anthropic and Redwood Research have demonstrated that SAEs can decompose complex neuron activations into more interpretable features, helping to uncover the underlying "circuits" or "concepts" learned by these models.
- These SAEs have been used to identify interpretable latent features corresponding to specific linguistic phenomena (e.g., negation, specific facts) or visual concepts. This paper adapts this proven methodology to the protein domain.

3.3. Technological Evolution

The field has evolved from:

Sequence-based bioinformatics: Traditional methods relying on sequence alignments, motifs, and statistical models.
Early machine learning in biology: Applying simpler models to protein data.
Deep learning on protein sequences: Emergence of PLMs (like ESM and AlphaFold models) that leverage massive datasets and transformer architectures to learn powerful, general-purpose protein representations. This shift allowed PLMs to capture co-evolutionary patterns and complex physicochemical principles implicitly.
Interpretability of Deep Learning Models: As PLMs became more powerful, the need to understand their internal workings grew. Initial interpretability efforts (attention maps, neuron probing) provided some insights but were often limited by the models' polysemanticity.
Mechanistic Interpretability with SAEs: This paper represents a crucial step in bringing mechanistic interpretability techniques, particularly SAEs successful in other domains, to PLMs. This allows for a more granular and monosemantic understanding of the biological concepts encoded within these models.

This paper's work fits within the technological timeline as a leading-edge effort to bridge the gap between the predictive power of PLMs and their biological interpretability, moving beyond black-box predictions to uncover the explicit biological knowledge learned.

3.4. Differentiation Analysis

Compared to prior approaches, the core differences and innovations of InterPLM are:

Addressing Superposition Directly: While previous PLM interpretability often struggled with polysemanticity at the neuron level, InterPLM directly tackles this by using SAEs to decompose neuron activations into sparse, monosemantic latent features. This is a significant improvement, enabling the identification of thousands of clear biological concepts where neuron analysis found only dozens.
Quantitative Biological Validation at Scale: The paper establishes a rigorous quantitative evaluation framework using Swiss-Prot annotations. This allows for objective assessment of the biological interpretability of the SAE features, demonstrating their strong correlation with a wide range of known biological concepts.
Automated Interpretation of Novel Concepts: Introduction of an LLM-powered pipeline for automatically interpreting SAE features that do not map to existing Swiss-Prot annotations. This is crucial for discovering new biological insights that PLMs might have learned but are not yet cataloged in databases.
Practical Downstream Applications: Beyond mere interpretation, the paper demonstrates tangible applications: identifying missing protein annotations and enabling targeted protein sequence generation. This shows the utility of interpretable features for both data curation and protein engineering.
Open-Source and Interactive Platform: The release of the InterPLM.ai platform and code differentiates this work by making the methodology and results accessible to the broader research community, fostering collaborative discovery.

In essence, InterPLM provides a comprehensive and systematic framework that moves beyond qualitative observation or limited neuron analysis to quantitatively extract, interpret, and apply a vast array of interpretable biological features from PLMs.

4. Methodology

4.1. Principles

The core principle of the InterPLM methodology is to leverage sparse autoencoders (SAEs) to transform the dense, often polysemantic (representing multiple concepts simultaneously) representations within Protein Language Models (PLMs) into sparse, monosemantic (representing a single, clear concept) latent features. This approach aims to address the superposition problem prevalent in neural networks, where individual neurons might encode a mix of different concepts, making direct interpretation difficult. By training SAEs on the embeddings from PLM layers, the method seeks to learn a "dictionary" of fundamental biological concepts. Each latent feature in this dictionary is designed to activate for a specific biological pattern or property, thereby enabling human-interpretable analysis. The methodology then integrates quantitative evaluation against known biological databases (Swiss-Prot) and an automated Large Language Model (LLM)-based interpretation pipeline to provide comprehensive understanding and practical applications of these extracted features.

4.2. Core Methodology In-depth (Layer by Layer)

The InterPLM methodology involves several key stages: Sparse Autoencoder Training, Swiss-Prot Concept Evaluation Pipeline, LLM Feature Annotation Pipeline, Feature Analysis and Visualization, and Steering Experiments.

4.2.1. Sparse Autoencoder Training

This stage focuses on training SAEs for each layer of the Protein Language Model (PLM) to extract interpretable features.

4.2.1.1. Dataset Preparation

The SAEs are trained on embeddings extracted from a pre-trained PLM, specifically ESM-2-8M-UR50D.

Source: Proteins are selected from UniRef50, a clustered protein sequence database, to ensure a diverse training dataset.
Embedding Extraction: Hidden representations are extracted from ESM-2 after each of its six transformer block layers (layers 1 through 6). These embeddings represent the contextualized numerical representation of each amino acid token.
Token Exclusion: The special $<cls>$ (classifier) and $<eos>$ (end-of-sequence) tokens are excluded from the embeddings used for SAE training, focusing only on the amino acid representations.
Sampling: Proteins are randomly sampled during training to ensure a diverse and representative input for the SAEs.

4.2.1.2. Architecture and Training Parameters

The SAE architecture is designed to expand the PLM's embedding space into a much larger dictionary space, promoting sparsity.

Expansion Factor: Each of ESM-2's six layers has 320 neurons (representing the embedding dimension). The SAEs are designed to expand this into a feature dictionary size of 10,240 latent features. This 32x expansion (10240 / 320) is a common heuristic in SAE training to allow for sparse decomposition.
Training Protocol:
- For each ESM-2 layer, 20 SAEs are trained independently.
- Each SAE is trained for 500,000 steps.
- A batch size of 2,048 is used during training.
- Learning Rates: Learning rates are sampled in increments of 10x, ranging from $1 \text{e-}4$ to $1 \text{e-}8$ .
- L1 Penalty: An L1 regularization penalty is applied, with values ranging from 0.07 to 0.1. The L1 penalty is crucial for encouraging sparsity in the latent activations, pushing many feature activations towards zero. This penalty forces the model to represent inputs using only a few active features, leading to more disentangled and interpretable representations.
- Warm-up Phase: L1 penalty values are decreased during an initial warm-up phase, with the learning rate reaching its maximum within the first 5% of training steps.
  
  The reconstruction of an input activation vector $\mathbf { x }$ by an SAE can be expressed as: $\mathbf { x } \approx \mathbf { b } + \sum _ { i = 1 } ^ { d _ { \mathrm { d i c t } } } f _ { i } ( \mathbf { x } ) \mathbf { d } _ { i }$ Where:
$\mathbf { x }$ : The original embedding vector from the PLM layer (input to the SAE).
$\mathbf { b }$ : A bias term that is added to the reconstruction.
$d _ { \mathrm { d i c t } }$ : The size of the learned dictionary (number of latent features, e.g., 10,240).
$f _ { i } ( \mathbf { x } )$ : The activation value of the $i$ -th latent feature for the input $\mathbf { x }$ . This is the output of the encoder.
$\mathbf { d } _ { i }$ : The dictionary vector (or decoder weight vector) for the $i$ -th latent feature. These vectors are rows in the decoder matrix and represent the learned basis elements.

This equation shows that the SAE reconstructs the input embedding as a linear combination of its dictionary vectors, weighted by their activation values. The sparsity constraint ensures that for any given input $\mathbf{x}$ , only a few $f_i(\mathbf{x})$ values will be non-zero, making the contribution of each $\mathbf{d}_i$ explicit.

The activation function for the latent features $f_i(x)$ is typically a Rectified Linear Unit (ReLU), ensuring non-negativity and contributing to sparsity. The relationship is depicted in the following figure:

Figure 8: Overview of SAE decomposition and training. (a) Decomposition of embedding vector into weighted sum of dictionary elements.) Architecture for the SAE

Specifically, the latent feature activation $f_i(\mathbf{x})$ is calculated as: $ f_i(\mathbf{x}) = \mathrm{ReLU}(W_d(\mathbf{x} - \mathbf{b}_d) + \mathbf{b}_i) $ Where:

$\mathbf{x}$ : The input embedding vector.
$W_d$ : The encoder weight matrix.
$\mathbf{b}_d$ : The decoder bias term.
$\mathbf{b}_i$ : The encoder bias term.
$\mathrm{ReLU}$ : The Rectified Linear Unit activation function, defined as $\mathrm{ReLU}(z) = \max(0, z)$ . This enforces non-negativity and promotes sparsity.

4.2.1.3. Feature Normalization

To standardize feature comparisons and ensure consistency across different SAEs and layers, activation values are normalized.

Method: Activation values are scaled (e.g., using min-max scaling or z-score normalization) across a subset of 5,000 proteins randomly sampled from Swiss-Prot. This ensures that activation values are comparable regardless of the specific SAE or protein context.

4.2.1.4. SAE Metrics (Appendix B.1.1)

The performance of SAEs is evaluated using several metrics beyond standard reconstruction loss, focusing on sparsity and interpretability.

L1 Norm of Activations ( $L_1(f(x))$ ): Measures the sum of absolute values of feature activations, used as a regularization term to encourage sparsity. $ L _ { 1 } ( f ( x ) ) = \displaystyle \sum _ { i = 1 } ^ { d _ { \mathrm { d i c t } } } | f _ { i } ( x ) | $ Where:
- $f_i(x)$ : The activation value of the $i$ -th latent feature for input $x$ .
- $d_{\mathrm{dict}}$ : The total number of latent features in the dictionary.
Mean Squared Error (MSE(x, x')): Quantifies the difference between the original input embedding $x$ $x$ and its reconstruction $x'$ $x^{'}$ , indicating the quality of the SAE's reconstruction. $ MSE ( x , x ^ { \prime } ) = \displaystyle \frac { 1 } { d _ { \mathrm { d i c t } } } \sum _ { i = 1 } ^ { d _ { \mathrm { d i c t } } } ( x _ { i } - x _ { i } ^ { \prime } ) ^ { 2 } $ Where:
- $x_i$ : The $i$ -th component of the original embedding vector.
- x'_i: The $i$ -th component of the reconstructed embedding vector.
- $d_{\mathrm{dict}}$ : The dimension of the embedding vector.
L0 Norm of Activations ( $L_0(f(x))$ ): Counts the number of non-zero latent feature activations for a given input, directly measuring sparsity. $ L _ { 0 } ( f ( x ) ) = \displaystyle \sum _ { i = 1 } ^ { d _ { \mathrm { d i c t } } } \mathbf { 1 } ( f _ { i } ( x ) > 0 ) $ Where:
- $\mathbf{1}(condition)$ : The indicator function, which equals 1 if the condition is true and 0 otherwise.
- $f_i(x)$ : The activation value of the $i$ -th latent feature for input $x$ .
- $d_{\mathrm{dict}}$ : The total number of latent features.
% Loss Recovered: This metric assesses how much of the original PLM's predictive performance (measured by cross-entropy) can be retained when its internal embedding layer is replaced by the SAE's reconstruction of that embedding. It quantifies how well the SAE preserves the information necessary for the PLM's downstream task. $ \mathcal { G } _ { L } \mathrm { ~ L o s s ~ R e c o v e r e d } = \Big ( 1 - \frac { C E _ { \mathrm { Reconstruction } } - C E _ { \mathrm { Original } } } { C E _ { \mathrm { Zero } } - C E _ { \mathrm { Original } } } \Big ) \times 1 0 0 $ Where:
- $CE_{\mathrm{Reconstruction}}$ : The cross-entropy of the PLM when the specified embedding layer is replaced by the SAE's reconstruction of the embedding.
- $CE_{\mathrm{Original}}$ : The cross-entropy of the PLM using its original embedding layer.
- $CE_{\mathrm{Zero}}$ : The cross-entropy of the PLM when the specified embedding layer is replaced with all zeros (a baseline representing complete information loss).
- The term $(CE_{\mathrm{Reconstruction}} - CE_{\mathrm{Original}})$ measures the increase in loss due to reconstruction. The denominator $(CE_{\mathrm{Zero}} - CE_{\mathrm{Original}})$ represents the total potential loss if the embedding provides no information. A higher % Loss Recovered indicates that the SAE reconstruction effectively preserves the information content of the original embedding.

4.2.2. Swiss-Prot Concept Evaluation Pipeline

This pipeline quantitatively evaluates the biological interpretability of SAE features by associating them with known biological concepts from Swiss-Prot.

4.2.2.1. Dataset Construction

Source: A random sample of 50,000 proteins is drawn from the reviewed subset of UniProtKB (Swiss-Prot).
Filtering: Proteins with lengths under 1,024 amino acids are selected for efficiency. This dataset is partitioned into validation and held-out sets.
Concept Coverage: The Swiss-Prot annotations cover a wide range of biological concepts, including structural features (active sites, binding sites, disulfide bonds), modifications (carbohydrate, lipid), targeting (signal peptide, transit peptide), and functional domains (protein domains, motifs). A comprehensive list of categories and specific concepts used is provided in Appendix C.2.

4.2.2.2. Feature-Concept Association Analysis

For each SAE feature and each Swiss-Prot concept, the association is quantified using modified precision, recall, and F1 score metrics. The goal is to identify how well a feature's activation pattern aligns with the presence of a known biological concept.

Activation Thresholds: Feature activations are binarized at three thresholds: 0.5, 0.6, and 0.8 to determine "active" positions.
Metrics:
- Precision: Measures the proportion of feature-activated positions that are true positives for a given concept. $\mathrm { p r e c i s i o n } = \frac { \mathrm { TruePositives } } { \mathrm { TruePositives } + \mathrm { FalsePositives } }$ Where:
  - TruePositives: Number of positions where the feature is active AND the concept is present.
  - FalsePositives: Number of positions where the feature is active BUT the concept is NOT present.
- Recall: Measures the proportion of positions where a concept is present that are correctly identified by the feature. This metric is adapted to protein domains. $\mathrm { r e c a l l } = \frac { \mathrm { DomainsWithTruePositive } } { \mathrm { TotalDomains } }$ Where:
  - DomainsWithTruePositive: Number of protein domains (or regions) where the concept is present AND at least one position within that domain is activated by the feature.
  - TotalDomains: Total number of protein domains (or regions) where the concept is present.
- F1 Score: The harmonic mean of precision and recall, providing a balanced measure of a feature's association with a concept. $\mathrm { F1 } = 2 \cdot \frac { \mathrm { p r e c i s i o n } \cdot \mathrm { recall } } { \mathrm { p r e c i s i o n } + \mathrm { recall } }$
Threshold Selection: For each feature-concept pair, the activation threshold (0.5, 0.6, or 0.8) that yields the highest F1 score is selected for final evaluation.

4.2.2.3. Model Selection and Evaluation

Initial Evaluation: Initial evaluations are conducted on 20% of the 135 concepts that have more than 10 domains or 1,500 amino acids.
SAE Selection: For each ESM-2 layer, the SAE (from the 20 trained SAEs) with the highest average F1 score across the validation set is chosen for subsequent analyses and inclusion in the InterPLM dashboard.
Test Set Evaluation: For the selected SAEs, feature-concept pairs with an F1 score $> 0.5$ in the validation set are identified, and their F1 scores are then calculated on an independent test set. The number of pairs retaining F1 score $> 0.5$ in the test set is reported.

4.2.2.4. Baselines

To demonstrate the effectiveness of SAEs, comparisons are made against two baselines:

Individual Neurons of ESM-2: Directly evaluating the interpretability of ESM-2's individual neurons using the same Swiss-Prot association pipeline. This serves as a direct comparison to show SAEs disentangling polysemanticity.
SAEs Trained on Randomized ESM-2 Weights: Training SAEs on embeddings from an ESM-2 model with shuffled weights. This control ensures that interpretability is due to the learned representations of the PLM, not just inherent properties of the SAE or the protein data distribution. Randomized models are expected to show little to no association with biological concepts.

4.2.3. LLM Feature Annotation Pipeline

This pipeline automatically generates descriptions for SAE features, especially those without existing Swiss-Prot annotations.

4.2.3.1. Example Selection

Feature Subset: Analysis is performed on a random selection of 1,200 (10%) SAE features.
Representative Proteins: For each selected feature, proteins that maximally activate that feature are chosen. Specifically, 10 Swiss-Prot proteins are selected based on their activation levels.
Activation Bins: Activation levels are quantified into 10 bins (0-0.1, 0.1-0.2, ..., 0.9-1.0). For each feature, 1 protein is selected per bin. If a feature has fewer than 10 maximally activating proteins in the highest bin (0.9-1.0), additional examples are sampled from lower bins to reach a total of 10 proteins, ensuring a diverse range of activations are shown to the LLM.

4.2.3.2. Description Generation and Validation

LLM Used: Claude-3.5 Sonnet (a Large Language Model from Anthropic).
Prompting: The LLM is provided with:
- Protein metadata from Swiss-Prot (e.g., protein name, gene names, organism, functional annotations).
- Quantitative activation values for each amino acid in the selected proteins.
- Amino acid identities at these activated positions.
- The LLM is instructed to generate a description of the feature's activation patterns and a concise summary, focusing on biological properties associated with high activation and overlapping functional/structural annotations.
Validation: To validate the quality of the LLM-generated descriptions, the LLM is then prompted to predict the maximum activation value for new, unseen proteins based solely on its own generated description and the protein's metadata. The LLM's predicted activations are compared to the true measured activations using Pearson correlation. A high Pearson correlation indicates that the LLM's description accurately captures the essence of the feature's activation.

4.2.4. Feature Analysis and Visualization

This stage involves techniques for exploring and clustering the extracted SAE features.

4.2.4.1. UMAP Embedding and Clustering

Dimensionality Reduction: Uniform Manifold Approximation and Projection (UMAP) is applied to the normalized SAE decoder weights (the dictionary vectors $\mathbf{d}_i$ $d_{i}$ ). UMAP is a non-linear dimensionality reduction technique that preserves both local and global structure of the high-dimensional data, allowing for visualization of feature relationships in 2D or 3D.
- Parameters: $metric='euclidean'$ , $n_neighbors=15$ , min_dist=0.1.
Clustering: Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is used to identify natural clusters among the UMAP-embedded features. HDBSCAN is robust to noise and can find clusters of varying densities and shapes.
- Parameters: min_cluster_size=5, min_samples=3.
Visualization: The results are visualized in the InterPLM interface, where features are plotted in the UMAP space and colored according to their clusters or Swiss-Prot associations.

4.2.4.4. Sequential and Structural Feature Analysis

To understand if SAE features represent local sequence motifs or 3D structural elements, a detailed analysis is performed.

Procedure:
1. High-Activation Regions: For proteins with available AlphaFold structures, regions with high feature activation ( $> 0.6$ ) are identified.
2. Clustering Metrics: For each protein's highest-activation residue, two metrics are calculated:
  - Sequential Clustering: Mean feature activation within $\pm 2$ positions in the sequence around the highest-activation residue.
  - Structural Clustering: Mean feature activation of residues within $6 \mathring{\mathrm{A}}$ (Angstroms) in 3D space from the highest-activation residue.
3. Null Distributions: Null distributions for these clustering metrics are generated by averaging 5 random permutations of activation values per protein. This helps determine if observed clustering is statistically significant compared to random chance.
4. Significance Assessment: Paired t-tests and Cohen's d effect sizes are used to assess the significance of sequential and structural clustering across 100 proteins per feature (or features with at least 25 examples meeting a high activation threshold).
5. Visualization: Features are colored based on the ratio of structural to sequential effect sizes, indicating whether they primarily capture 3D structural proximity or linear sequence patterns. Only features with Bonferroni-corrected structural p-values $< 0.05$ are considered significant.

4.2.5. Steering Experiments

This stage demonstrates a practical application: using interpretable SAE features to guide protein sequence generation.

Approach: Following a method described in prior work [20], ESM embeddings are decomposed into SAE reconstruction predictions and error terms.
Sequence Steering Steps:
1. Extract embeddings at the specified PLM layer.
2. Calculate SAE reconstructions and error terms for these embeddings.
3. Modify the reconstruction by amplifying or suppressing the activation of a desired feature.
4. Combine the modified reconstructions with the error terms.
5. Allow normal PLM processing to continue with these modified embeddings.
6. Extract probabilities for masked tokens (e.g., probability of a specific amino acid at a masked position) using tools like NNsight.
Example: The paper demonstrates this by steering specific glycine-related features to influence glycine predictions in periodic patterns (like GxGX repeats in collagen). This shows how fine-grained control over PLM outputs can be achieved by manipulating interpretable latent features.

5. Experimental Setup

5.1. Datasets

The study utilizes two primary datasets: UniRef50 for SAE training and Swiss-Prot for feature evaluation and LLM interpretation.

UniRef50:
- Source: UniRef50 (UniProt Reference Clusters at 50% sequence identity) is a comprehensive, non-redundant database of protein sequences.
- Purpose: Used as the training dataset for the SAEs. Its purpose is to provide a broad and diverse set of protein sequences to train SAEs to learn general protein biology, similar to how ESM-2 itself was trained.
- Characteristics: Proteins are clustered such that sequences within a cluster share at least 50% sequence identity, reducing redundancy while retaining diversity. The paper doesn't specify the exact number of proteins or their typical lengths from UniRef50 used for SAE training beyond "proteins from UniRef50."
- Data Sample: A protein sequence from UniRef50 might look like: MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKELGYQG (an example of a Globin sequence).
Swiss-Prot (Reviewed UniProtKB):
- Source: The reviewed section of the UniProt Knowledgebase (UniProtKB), known for its high-quality, manually curated, and extensively annotated protein entries.
- Purpose: Primarily used for quantitative evaluation of SAE features against known biological concepts, for selecting optimal SAE models, and for providing metadata to the LLM during feature annotation.
- Characteristics:
  - Size: A random sample of 50,000 proteins was drawn for evaluation.
  - Length: Proteins with lengths under 1,024 amino acids were selected. The criteria for validation and held-out sets ensured a reasonable number of annotations (e.g., $concepts with >10 domains$ or $>1,500 amino acids$ ).
  - Annotations: Rich annotations cover various aspects of protein biology, categorized into:
    - Basic Identification Fields (e.g., accession, protein_name, sequence, organism_name, length).
    - Structural Features (e.g., active sites, binding sites, disulfide bonds, helical regions, beta strands, coiled coil regions, transmembrane regions).
    - Modifications and Chemical Features (e.g., carbohydrate modifications, lipid modifications, modified residues, cofactor information).
    - Targeting and Localization (e.g., signal peptide, transit peptide).
    - Functional Domains and Regions (e.g., protein domains, short motifs, zinc finger regions).
    - Functional Annotations (e.g., catalytic activity, enzyme commission number, function, protein families, Gene Ontology function).
  - Data Sample (Annotations): For a protein like Q46638 (Glycogen synthase from Escherichia coli), Swiss-Prot might have an annotation for a binding site at positions 210-217 for ATP, or a domain called Glycosyltransferase spanning 15-400. The actual sequence might be like: MKIISTL...GGLSQ....
Rationale for Dataset Choice:
- UniRef50 provides the necessary scale and diversity for training deep learning models like SAEs effectively.
- Swiss-Prot is the gold standard for high-quality, verified biological annotations. Its detailed and diverse annotations make it an ideal choice for rigorously evaluating whether SAE features correspond to known biological concepts. It effectively validates the method's ability to uncover biologically meaningful representations.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the performance of SAE features in capturing biological concepts and the quality of LLM-generated descriptions.

5.2.1. Precision

Conceptual Definition: Precision (also known as Positive Predictive Value) measures the accuracy of the feature's positive predictions. It answers the question: "Of all the positions (or domains) that the feature activated for, how many actually correspond to the target biological concept?" A high precision indicates that when the feature activates, it is very likely to be signaling the presence of the concept.

Mathematical Formula: $\mathrm { p r e c i s i o n } = \frac { \mathrm { TruePositives } } { \mathrm { TruePositives } + \mathrm { FalsePositives } }$ Symbol Explanation:

$\mathrm { TruePositives }$ : The number of positions (or domains) where the SAE feature activated above a certain threshold, and the target biological concept was actually present according to Swiss-Prot annotations.
$\mathrm { FalsePositives }$ : The number of positions (or domains) where the SAE feature activated above a certain threshold, but the target biological concept was not present.

5.2.2. Recall

Conceptual Definition: Recall (also known as Sensitivity or True Positive Rate) measures the ability of the feature to find all relevant instances of a biological concept. It answers the question: "Of all the positions (or domains) where the target biological concept was actually present, how many did the feature successfully activate for?" A high recall indicates that the feature is good at detecting most occurrences of the concept. This metric is specifically adapted for protein domains in this paper.

Mathematical Formula: $\mathrm { r e c a l l } = \frac { \mathrm { DomainsWithTruePositive } } { \mathrm { TotalDomains } }$ Symbol Explanation:

$\mathrm { DomainsWithTruePositive }$ : The number of protein domains (or regions) where the target biological concept was present, AND at least one position within that domain (or region) was activated by the SAE feature above the threshold.
$\mathrm { TotalDomains }$ : The total number of protein domains (or regions) where the target biological concept was present according to Swiss-Prot annotations.

5.2.3. F1 Score

Conceptual Definition: The F1 score is the harmonic mean of precision and recall. It is a useful metric when you need to balance precision and recall, especially in cases of imbalanced class distribution (e.g., when biological concepts are rare). A high F1 score indicates that the feature has both high precision (few false positives) and high recall (few false negatives).

Mathematical Formula: $\mathrm { F1 } = 2 \cdot \frac { \mathrm { p r e c i s i o n } \cdot \mathrm { recall } } { \mathrm { p r e c i s i o n } + \mathrm { recall } }$ Symbol Explanation:

$\mathrm { precision }$ : The precision value calculated as described above.
$\mathrm { recall }$ : The recall value calculated as described above.

5.2.4. Pearson Correlation

Conceptual Definition: Pearson correlation coefficient ( $r$ ) measures the linear relationship between two sets of data. In this paper, it's used to validate the LLM-generated descriptions. It quantifies how well the LLM's predicted activation values for unseen proteins align with the actual measured activation values of the SAE feature. A coefficient of 1 indicates a perfect positive linear relationship, -1 a perfect negative linear relationship, and 0 no linear relationship.

Mathematical Formula: The Pearson correlation coefficient for two variables $X$ and $Y$ is: $r = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2}\sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}}$ Symbol Explanation:

$n$ : The number of data points (in this context, the number of proteins for which activation values are compared).
$X_i$ : The $i$ -th LLM-predicted maximum activation value for a feature.
$\bar{X}$ : The mean of all LLM-predicted maximum activation values.
$Y_i$ : The $i$ -th actual measured maximum activation value for the SAE feature.
$\bar{Y}$ : The mean of all actual measured maximum activation values.

5.3. Baselines

The effectiveness of the InterPLM approach (using SAE features) is benchmarked against two crucial baselines to highlight its advantages:

Individual Neurons in ESM-2:
- Description: This baseline involves directly analyzing the activations of the individual neurons within the ESM-2 transformer layers using the same Swiss-Prot Concept Evaluation Pipeline.
- Purpose: This comparison directly addresses the superposition problem. By showing that SAE features correlate with significantly more biological concepts than individual neurons, the authors demonstrate that SAEs successfully disentangle the mixed representations found in neurons, leading to clearer, monosemantic concepts.
- Representativeness: This is a standard and intuitive baseline for interpretability studies, as neurons are the fundamental computational units of a neural network.
SAEs Trained on Randomized ESM-2 Weights:
- Description: This control experiment involves training SAEs on embeddings extracted from an ESM-2 model where the weights have been randomized (shuffled).
- Purpose: This baseline ensures that the observed interpretability of SAE features is genuinely due to the learned biological representations within the PLM and not merely an artifact of the SAE training process itself or generic statistical patterns in protein sequences. A randomized model should not encode meaningful biological information, so SAEs trained on its embeddings should not yield interpretable biological features.
- Representativeness: This is a robust control for attributing interpretability to the PLM's learned knowledge rather than inherent biases in the SAE method or the data distribution.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents compelling evidence that sparse autoencoders (SAEs) can extract rich, interpretable biological features from Protein Language Models (PLMs), far surpassing the interpretability of individual neurons.

1. SAEs Find Interpretable Concepts in Protein Language Models (Section 3.1 & Figure 3, Figure 9): The core finding is that SAEs successfully extract human-interpretable latent features from ESM-2.

Quantitative Advantage: SAEs identify up to 2,548 interpretable latent features per layer that correlate strongly with up to 143 known Swiss-Prot biological concepts (e.g., binding sites, structural motifs, functional domains).
Contrast with Neurons: In stark contrast, individual neurons in ESM-2 revealed clear conceptual alignment for only up to 46 neurons per layer across 15 concepts. This dramatic difference (e.g., 2,548 features vs. 46 neurons) strongly supports the hypothesis that PLMs represent most concepts in superposition, and SAEs are effective at disentangling these concepts.
Control Experiment Validation: The effectiveness of SAE features in capturing biological concepts disappears when SAEs are trained on randomized ESM-2 model weights (Figure 3, green bars). This confirms that the interpretability is derived from the meaningful biological information learned by the PLM, not just an artifact of the SAE method.
Robustness Across Layers: The superiority of SAE features over neurons persists across all ESM-2 layers (Figure 9b). Early layers tend to capture more basic properties (like amino acid types in randomized models), while deeper layers capture more complex biological concepts.

The following are results from [Figure 3b] of the original paper:

Figure 3b: Number of Features with Concept F1 Scores $> 0 . 5$

This bar chart clearly shows that SAE features (pink bars) consistently identify a significantly higher number of Swiss-Prot concepts with an F1 score $> 0.5$ across all ESM-2 layers compared to ESM neurons (blue bars) and SAEs trained on shuffled weights (green bars). The count rises from a few hundred in layer 1 to over 2,500 in layer 6 for SAE features.

2. Features Form Clusters Based on Shared Functional and Structural Roles (Section 3.4 & Figure 4, Figure 6, Figure 7): SAE features not only capture specific concepts but also exhibit meaningful clustering patterns.

UMAP Visualization: UMAP embeddings of SAE decoder weights reveal clusters of features that share functional and structural roles (Figure 4a). For example, a "kinase cluster" was identified where features specialized in different sub-regions of the kinase domain (e.g., catalytic loop, beta sheet).
Specificity vs. Generality: Some features are highly specific (e.g., $f/1503$ as a TBDR detector with $F1=0.998$ ), while others are more general (beta barrel features). This demonstrates the SAE's ability to learn a spectrum of conceptual granularity.
Sequential vs. Structural Activations: Analysis of activation patterns in 3D protein structures shows that SAE features can represent both sequential motifs and 3D structural elements (Figure 2a). Features that activate on spatially proximal residues, even if far apart in sequence, indicate learning of 3D structural concepts.

3. Large Language Models Can Generate Meaningful Feature Descriptions (Section 3.5 & Figure 5): The paper demonstrates a novel pipeline for automatically interpreting SAE features using LLMs, extending interpretability beyond predefined annotations.

Automated Interpretation: Claude-3.5 Sonnet successfully generated feature descriptions by analyzing protein metadata and feature activation patterns. These LLM-generated descriptions were highly predictive of feature activation on new proteins (median Pearson r correlation = 0.72).
Discovering Unannotated Concepts: This LLM-powered pipeline is particularly valuable for features that do not map to existing Swiss-Prot annotations (which labeled less than 20% of features across layers). For instance, $f/8386$ (a hexapeptide beta-helix feature) was accurately described by the LLM despite lacking Swiss-Prot annotations. This highlights PLMs' ability to learn coherent concepts not yet documented in databases.

The following are results from [Figure 5] of the original paper:

Figure 5: Language models can generate automatic feature descriptions for SAE features. (a) Workflow for generating and validating descriptions with Claude-3.5 Sonnet (new). ) Comparing measured maximum activation vauprotei preiati luPearneatcrousl of generatefeaturdesciptions and maximally ctivate proteins ea feaureredicte activations qy vislizevi kerel density estiatio.The text isClaude' descripton sumary o eac featurelement description present in max examples annotated next to structures.

This figure illustrates the workflow for LLM-based interpretation (a) and shows examples of how LLM-generated descriptions (e.g., for $f/8386$ , $f/10091$ , $f/7404$ ) correlate well with the actual feature activation in proteins (b). For instance, the LLM correctly describes $f/8386$ as activating on a hexapeptide beta-helix, which aligns with the structural motif observed in the maximally activating protein.

4. Feature Activations Identify Missing and Novel Protein Annotations (Section 3.6 & Figure 6): A practical application of interpretable SAE features is to fill gaps in existing biological databases.

Nudix Box Motif: $f/939$ consistently activates on specific amino acids within a conserved Nudix box motif, even in proteins (e.g., B2GFH1) where this motif is not yet annotated in Swiss-Prot but is confirmed by other external databases.
Peptidase Domain: Similarly, $f/3147$ identifies peptidase domains across multiple amino acids, revealing missing annotations for this domain in certain proteins.
UDP-GlcNAc and Mg2+ Binding Sites: $f/9046$ suggests previously unannotated binding sites for UDP-N-acetyl\alpha-D-glucosamine and $Mg2+$ within bacterial glycosyltransferases.
These examples demonstrate that PLMs learn beyond curated annotations and that SAE features can serve as powerful tools for biological discovery and database curation.

The following are results from [Figure 6] of the original paper:

Figur :Feature activation patterns can be used to identiy missing and new protein annotations.)/939 identies missigmot otation orNudix box. I activates n slemino aciin cnserve positio whi is lbe a sructurRght to proteis exmpes wi atiatns n pi clori then with Nudix label. Left protein (B2GFH1, which does not have a Nudix motiannotationin Swiss-Prot, has pld f activation ighlihte reThe presnc a ud mot mewher this protei s cn y Iner3 dentis misdoai otatin or peptidas. It activate span mi au pat pr hav petai acivationhighlighte in pik.Left protein, which does ot have peptidase domaanotationinSwiss-rot, is l issatThe preommhe his prot s on y nte f/9046 suggests missing binding site annotations for UDP-N-acetyl $\alpha$ D-glucosamine (UDP-GlcNAc) and $\mathrm { Mg } 2 +$ within bacerial gycosytranerases.In both structures, highe activation ndicated wihdarkr pinkRight proths o u does have glycosyltransferase activity, has implied binding site annotations labeled in green

This figure visually presents three examples of how feature activations highlight missing annotations. For $f/939$ , a Nudix box motif is activated in protein B2GFH1, which lacks the Nudix annotation in Swiss-Prot. For $f/3147$ , an unannotated peptidase domain is activated. For $f/9046$ , implied binding sites for UDP-GlcNAc and $Mg2+$ are activated in bacterial glycosyltransferases that are not explicitly labeled in Swiss-Prot.

5. Protein Sequence Generation Can be Steered by Activating Interpretable Features (Section 3.7 & Figure 7, Figure 10, Figure 11): The interpretable SAE features enable targeted control over PLM-based sequence generation.

Targeted Steering: By selectively amplifying or suppressing specific SAE features during sequence generation, the authors demonstrate that PLMs can be steered to favor or disfavor particular amino acids at masked positions.
Glycine Repeats Example: The paper illustrates this with glycine-related features that activate on periodic glycine repeats (e.g., GxGX) common in collagen-like regions. Steering these features influences the probability of generating glycine at masked positions within these repeats. For instance, steering a periodic glycine feature can increase the probability of glycine at subsequent positions, demonstrating a propagating effect (Figure 7a, Figure 10).
This application opens up exciting possibilities for controllable protein design, allowing researchers to guide PLMs towards generating sequences with desired biological properties.

The following are results from [Figure 10] of the original paper:

Figure 7: Steerin feature activation on single amino acid additionally influences protein generation for neary on glycines in periodic repeats of GxX in collagen-like regions by only steering one G and measurig impact on u plvalue o yc pos he e Gask>.The ate a aplid t the unmasked G, not to the masked token r ny other positions.Steer amount coresponds t th alue the featre wascampe whe is theaxiu bserve tivatnvau this atureservaay lfor glycine (f/6581, f/781, f/5381). Tvaliahat auecaptucusay eniul patters permrententios el' preicsnlikngod he ei e yial ak wecalzula oe avisnW cebilcncetsqnatieratieetialleno lackng clear sequence-based validation.We therefore focused on a simple, measurable example: showing how activating specific features can steer glycine predictions in periodic patterns. Weteted threeatues hat activaten peridi gycie repeats GxGX) withi collagen-ie mans Giv g positc the proabily gyc ot that poit n hemask posi cnsstt t retheear pattesepechyipecatue e99 i ah phi cpabiliyh sfule eicio d sis ve proagatih suseunt peipea widihtensiy Figr1, rveali heptu hieord patterns. Thesutrahaat atteetabi pat aepauy ioe eav eicab wye sibeyon ruatHowr complex biological patterns.

This figure demonstrates the effect of steering specific glycine features on the probability of glycine at subsequent masked positions. The top panel (a) shows that steering a "periodic glycine feature" ( $f/4616$ ) at a specific glycine in a GxGX pattern significantly increases the probability of glycine at subsequent masked positions within the periodic repeat. The bottom panel (b) shows that steering a "non-periodic glycine feature" ( $f/6581$ ) has a less specific or even negative impact on other glycine positions. This highlights the fine-grained control offered by interpretable features.

6.2. Data Presentation (Tables)

The following are the results from [Table 1] of the original paper:

Layer	Learning Rate	L1	% Loss Recovered
L1	1.0e-7	0.1	99.60921
L2	1.0e-7	0.08	99.31827
L3	1.0e-7	0.1	99.02078
L4	1.0e-7	0.1	98.39785
L5	1.0e-7	0.1	99.32478
L6	1.0e-7	0.09	100

Table 1: Layer-wise learning parameters and SAE performance metrics

This table shows the hyperparameters (Learning Rate, L1 penalty) chosen for the best-performing SAE at each ESM-2 layer, along with their reconstruction quality measured by % Loss Recovered. The high % Loss Recovered values (all above 98%) indicate that the SAEs are highly effective at reconstructing the original ESM-2 embeddings with minimal information loss, suggesting that the latent features preserve the critical information content of the PLM.

The following are the results from [Table C.2.1] of the original paper:

Field Name	Full Name	Description	Quant.	LLM
accession	Accession Number	Unique identifier for the protein entry in UniProt	N	Y
id	UniProt ID	Short mnemonic name for the protein	N	Y
protein_name	Protein Name	Full recommended name of the protein	N	Y
gene_names	Gene Names	Names of the genes encoding the protein	N	Y
sequence	Protein Sequence	Complete amino acid sequence of the protein	N	Y
organism_name	Organism	Scientific name of the organism the protein is from	N	Y
length	Sequence Length	Total number of amino acids in the protein	N	Y

Table C.2.1: Swiss-Prot Metadata Categories - Basic Identification Fields

This table lists basic identification fields from Swiss-Prot and indicates whether they were used for quantitative evaluation (Quant.) or provided to the LLM for feature description generation (LLM). For these fields, only the LLM used them, as they are not directly position-specific concepts for quantitative F1 score calculation.

The following are the results from [Table C.2.2] of the original paper:

Field Name	Full Name	Description	Quant.	LLM
ft_act_site	Active Sites	Specific amino acids directly involved in the protein's chemical reaction	Y	Y
ft_binding	Binding Sites	Regions where the protein interacts with other molecules	Y	Y
ft_disulfid	Disulfide Bonds	Covalent bonds between sulfur atoms that stabilize protein structure	Y	Y
ft_helix	Helical Regions	Areas where protein forms alpha-helical structures	Y	Y
ft_turn	Turns	Regions where protein chain changes direction	Y	Y
ft_strand	Beta Strands	Regions forming sheet-like structural elements	Y	Y
ft_coiled	Coiled Coil Regions	Areas where multiple helices intertwine	Y	Y
ft_non_std	Non-standard Residues	Non-standard amino acids in the protein	N
ft_transmem	Transmembrane Regions	Regions that span cellular membranes	N	Y
ft_intramem	Intramembrane Regions	Regions located within membranes	N	Y

Table C.2.2: Swiss-Prot Metadata Categories - Structural Features

This table lists various structural features from Swiss-Prot. Most of these (active sites, binding sites, disulfide bonds, helical regions, turns, beta strands, coiled coil regions) were used for quantitative evaluation (Quant.: Y) and provided to the LLM (LLM: Y). Some, like transmembrane and intramembrane regions, were used by the LLM but not directly in the quantitative F1 score calculation (Quant.: N).

The following are the results from [Table C.2.3] of the original paper:

Field Name	Full Name	Description	Quant.	LLM
ft_carbohyd	Carbohydrate Modifications	Locations where sugar groups are attached to the protein	Y	Y
ft_lipid	Lipid Modifications	Sites where lipid molecules are attached to the protein	Y	Y
ft_mod_res	Modified Residues	Amino acids that undergo post-translational modifications	Y	Y
cc_cofactor	Cofactor Information	Non-protein molecules required for protein function	N	Y

Table C.2.3: Swiss-Prot Metadata Categories - Modifications and Chemical Features

This table details modifications and chemical features. Carbohydrate, lipid, and modified residues were used for both quantitative evaluation and LLM input. Cofactor Information was used by the LLM but not for direct quantitative comparison.

The following are the results from [Table C.2.4] of the original paper:

Field Name	Full Name	Description	Quant.	LLM
ft_signal	Signal Peptide	Sequence that directs protein trafficking in the cell	Y	Y
ft_transit	Transit Peptide	Sequence guiding proteins to specific cellular compartments	Y	Y

Table C.2.4: Swiss-Prot Metadata Categories - Targeting and Localization

This table lists targeting and localization signals. Both signal peptide and transit peptide were used for quantitative evaluation and LLM input, as they are sequence-based concepts.

The following are the results from [Table C.2.5] of the original paper:

Field Name	Full Name	Description	Quant.	LLM
ft_compbias	Compositionally Biased Regions	Sequences with unusual amino acid distributions	Y	Y
ft_domain	Protein Domains	Distinct functional or structural protein units	Y	Y
ft_motif	Short Motifs	Small functionally important amino acid patterns	Y	Y
ft_region	Regions of Interest	Areas with specific biological significance	Y	Y
ft_zn_fing	Zinc Finger Regions	DNA-binding structural motifs containing zinc	Y	Y
ft_dna_bind	DNA Binding Regions	Regions that interact with DNA	N	Y
ft_repeat	Repeated Regions	Repeated sequence motifs within the protein	N	Y
cc_domain	Domain Commentary	General information about functional protein units	N	Y

Table C.2.5: Swiss-Prot Metadata Categories - Functional Domains and Regions

This table covers functional domains and regions. Most were used for both quantitative evaluation and LLM input. DNA Binding Regions, Repeated Regions, and Domain Commentary were used by the LLM but not directly in the quantitative F1 score calculation.

The following are the results from [Table C.2.6] of the original paper:

Field Name	Full Name	Description	Quant.	LLM
cc_catalytic_activity	Catalytic Activity	Description of the chemical reaction(s) performed by the protein	N	Y
ec	Enzyme Commission Number	Enzyme Commission number for categorizing enzyme-catalyzed reactions	N	Y
cc_activity_regulation	Activity Regulation	Information about how the protein's activity is controlled	N	Y
cc_function	Function	General description of the protein's biological role	N	Y
protein_families	Protein Families	Classification of the protein into functional/evolutionary groups	N	Y
go_f	Gene Ontology Function	Gene Ontology terms describing molecular functions	N	Y

Table C.2.6: Swiss-Prot Metadata Categories - Functional Annotations

This table presents functional annotations. All listed categories (catalytic activity, EC number, activity regulation, function, protein families, Gene Ontology function) were used by the LLM for feature description generation but not directly for quantitative evaluation as position-specific concepts.

The following are the results from [Table D.2] of the original paper:

Feature	Pearson r	Feature Summary
4360	0.75	The feature activates on interchain disulfide bonds and surrounding hydrophobic residues in serine proteases, particularly those involved in venom and blood coagulation pathways.
9390	0.98	The feature activates on the conserved Nudix box motif of Nudix hydrolase enzymes, particularly detecting the metal ion binding residues that are essential for their nucleotide pyrophosphatase activity.
3147	0.70	The feature activates on conserved leucine and cysteine residues that occur in leucine-rich repeat domains and metal-binding structural motifs, particularly those involved in protein-protein interactions and signaling.
4616	0.76	The feature activates on conserved glycine residues in structured regions, with highest sensitivity to the characteristic glycine-containing repeats of collagens and GTP-binding motifs.
8704	0.75	The feature activates on conserved catalytic motifs in protein kinase active sites, particularly detecting the proton acceptor residues and surrounding amino acids involved in phosphotransfer reactions.
9047	0.80	The feature activates on conserved glycine/alanine/proline residues within the nucleotide-sugar binding domains of glycosyltransferases, particularly at positions known to interact with the sugar-nucleotide donor substrate.
10091	0.83	The feature activates on conserved hydrophobic residues (particularly V/I/L) within the catalytic regions of N-acetyltransferase domains, likely detecting a key structural or functional motif involved in substrate binding or catalysis.
1503	0.73	The feature activates on extracellular substrate binding loops of TonB-dependent outer membrane transporters, particularly those involved in nutrient uptake.
2469	0.85	The feature activates on conserved structural and sequence elements in bacterial outer membrane beta-barrel proteins, particularly around substrate binding and ion coordination sites in porins and TonB-dependent receptors.

Table D.2: Example feature descriptions and corresponding Pearson correlation coefficients along with more verbose descriptions to predict maximum activation levels.

This table provides examples of LLM-generated feature summaries along with their Pearson correlation coefficients. The high Pearson r values (ranging from 0.70 to 0.98) indicate that the LLM's descriptions are remarkably accurate at predicting how the SAE features will activate on new proteins. This validates the LLM-based interpretation pipeline and its utility in providing meaningful, human-readable explanations for complex latent features. The summaries cover a diverse range of biological concepts, from disulfide bonds and Nudix box motifs to glycosyltransferase binding domains and beta-barrel proteins.

6.3. Ablation Studies / Parameter Analysis

The paper implicitly conducts an ablation study by comparing SAE features derived from the trained ESM-2 model against SAE features derived from an ESM-2 model with randomized weights.

Comparison with Randomized Weights: As shown in Figure 3b and discussed in Section 3.3, SAEs trained on embeddings from a randomized ESM-2 model (green bars) extract significantly fewer biologically interpretable features than those from the properly trained ESM-2 (pink bars). While the randomized models still extract features associated with individual amino acid types (Appendix Figure 9b), they largely fail to correspond to complex biological concepts like binding sites or domains.
Implication: This comparison serves as a crucial control, demonstrating that the interpretability observed in InterPLM is not an artifact of the SAE method itself or general protein sequence statistics, but rather a direct consequence of the meaningful biological knowledge encoded within the pre-trained ESM-2 PLM. It verifies that the SAEs are indeed recovering features learned by the PLM, rather than just finding patterns in raw sequences.

7. Conclusion & Reflections

7.1. Conclusion Summary

The InterPLM framework represents a significant advancement in Protein Language Model (PLM) interpretability. By employing sparse autoencoders (SAEs), the authors successfully disentangled the polysemantic representations within ESM-2, revealing a rich lexicon of up to 2,548 human-interpretable latent features per layer. These features strongly correlate with a wide array of known biological concepts (binding sites, structural motifs, functional domains) and significantly outperform individual neurons in conceptual clarity and quantity (2,548 features vs. 46 neurons per layer). Beyond known annotations, InterPLM facilitated the discovery of novel, coherent biological concepts learned by ESM-2, which were then automatically interpreted using a Large Language Model (LLM) pipeline. Practical applications include identifying missing protein annotations in databases and enabling targeted steering of protein sequence generation, showcasing the utility of these interpretable features for both scientific discovery and protein engineering. InterPLM provides a systematic, quantitative, and extensible framework for unlocking the biological knowledge embedded in PLMs, complemented by open-source tools and an interactive visualization platform.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Scaling to Structure Prediction Models: A crucial next step is to apply this SAE approach to structure-prediction models like ESMFold or AlphaFold. Understanding how latent features evolve with increasing model size and structural capabilities could enable more granular control over generated protein conformations through targeted steering.
Robust Steering and Validation: While sequence steering was demonstrated, applying it to more complex biological patterns and ensuring robust, validated outcomes remains challenging. More sophisticated methods and clearer success metrics are needed.
Interpretation of Masked Token Embeddings and CLS Token: The current analysis focuses on embeddings of unmasked amino acids. Further investigation is needed to understand the information encoded in masked token embeddings and the CLS token, which is often used as a protein-level representation.
Combining Features: The paper's framework focuses on individual latent features. Future work could explore how these individual features combine into higher-level biological circuits or computational mechanisms within the PLM, which would require more advanced activation analysis and external validation.
SAE Training Methods: Continued advancements in SAE training methods could further improve feature interpretability.
Additional Annotations: Incorporating more diverse biological annotations and structural data could enhance the evaluation and interpretation pipeline.

7.3. Personal Insights & Critique

This paper offers a highly impactful contribution to the growing field of mechanistic interpretability for biological AI.

Innovations and Strengths:

Solving Polysemanticity: The most significant innovation is the successful application of SAEs to effectively de-superpose PLM representations. This moves beyond the limitations of neuron-level analysis and provides a much finer-grained and more accurate view of what PLMs learn. The quantitative evidence (thousands of features vs. dozens of neurons) is highly compelling.
Automated Discovery of Novel Biology: The LLM-powered interpretation pipeline is a powerful and elegant solution for interpreting features that don't map to existing databases. This not only enhances interpretability but also transforms the PLM into a potential biological discovery engine that can suggest new motifs or functions. This is a critical step towards AI-driven hypothesis generation.
Bridging Interpretability and Application: The demonstration of missing annotation filling and sequence steering directly links interpretability to practical, high-value applications. This shows that understanding PLM internals is not just an academic exercise but can directly accelerate biological research and protein engineering.
Community Resources: The release of InterPLM.ai and the code is commendable. It lowers the barrier for other researchers to explore and build upon this work, which is crucial for advancing the field.

Potential Issues and Areas for Improvement:

Complexity of Interpretation for Novices: While SAEs make features more interpretable than neurons, understanding what a dictionary vector or an activation pattern truly signifies still requires significant domain expertise. The LLM pipeline helps, but the full depth of a feature might still be challenging for a beginner. The LLM itself is a black box, so interpreting LLM-generated descriptions also requires care.
Computational Cost: Training SAEs (20 per layer) and running the LLM interpretation pipeline can be computationally intensive, especially for larger PLMs and deeper layers. Scaling this to AlphaFold models, as suggested for future work, will be a substantial challenge.
Validation of Novel Concepts: While LLMs can describe novel features, experimentally validating these new biological motifs or missing annotations is the ultimate bottleneck. The paper provides examples but highlights the need for further experimental efforts. This is a general challenge for AI-driven discovery, where AI generates hypotheses faster than experiments can confirm them.
Feature Redundancy/Overlap: Despite the sparsity constraint, it's possible that some SAE features might still capture highly similar or overlapping biological concepts, especially given the hierarchical nature of protein biology. A more granular analysis of feature relationships within clusters could reveal this.
Causality in Steering: While steering experiments demonstrate control, the precise causal mechanism (i.e., why amplifying a feature leads to a specific sequence change) might still be implicit. Further work could delve into the causal circuits within the PLM that link a steered feature to the final output.

Transferability and Future Value: The methods developed in InterPLM are highly transferable. The SAE approach could be applied to other biologically relevant deep learning models (e.g., those for RNA, DNA, or even cellular imaging) to extract interpretable latent features. The framework for quantitative biological validation and LLM-powered interpretation is also broadly applicable across scientific AI domains where models learn complex, unannotated patterns. This paper lays critical groundwork for mechanistic interpretability to become a standard tool in AI-driven biological discovery, moving us closer to truly understanding, rather than just using, our powerful AI models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~34 min read · 46,835 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Sparse Autoencoder Training

4.2.1.1. Dataset Preparation

4.2.1.2. Architecture and Training Parameters

4.2.1.3. Feature Normalization

4.2.1.4. SAE Metrics (Appendix B.1.1)

4.2.2. Swiss-Prot Concept Evaluation Pipeline

4.2.2.1. Dataset Construction

4.2.2.2. Feature-Concept Association Analysis

4.2.2.3. Model Selection and Evaluation

4.2.2.4. Baselines

4.2.3. LLM Feature Annotation Pipeline

4.2.3.1. Example Selection

4.2.3.2. Description Generation and Validation

4.2.4. Feature Analysis and Visualization

4.2.4.1. UMAP Embedding and Clustering

4.2.4.4. Sequential and Structural Feature Analysis

4.2.5. Steering Experiments

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Precision

5.2.2. Recall

5.2.3. F1 Score

5.2.4. Pearson Correlation

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers