Paper status: completed

Knowledge Circuits in Pretrained Transformers

Published:05/28/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper explores knowledge encoding in large language models, introducing the concept of 'Knowledge Circuits' to reveal critical subgraphs in their computation graphs. Experiments with GPT-2 and TinyLLAMA showcase how information heads, relation heads, and MLPs collaboratively

Abstract

The remarkable capabilities of modern large language models are rooted in their vast repositories of knowledge encoded within their parameters, enabling them to perceive the world and engage in reasoning. The inner workings of how these models store knowledge have long been a subject of intense interest and investigation among researchers. To date, most studies have concentrated on isolated components within these models, such as the Multilayer Perceptrons and attention head. In this paper, we delve into the computation graph of the language model to uncover the knowledge circuits that are instrumental in articulating specific knowledge. The experiments, conducted with GPT2 and TinyLLAMA, have allowed us to observe how certain information heads, relation heads, and Multilayer Perceptrons collaboratively encode knowledge within the model. Moreover, we evaluate the impact of current knowledge editing techniques on these knowledge circuits, providing deeper insights into the functioning and constraints of these editing methodologies. Finally, we utilize knowledge circuits to analyze and interpret language model behaviors such as hallucinations and in-context learning. We believe the knowledge circuits hold potential for advancing our understanding of Transformers and guiding the improved design of knowledge editing. Code and data are available in https://github.com/zjunlp/KnowledgeCircuits.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Knowledge Circuits in Pretrained Transformers

1.2. Authors

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, Huajun Chen. Affiliations: Zhejiang University (Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Huajun Chen), National University of Singapore, NUS-NCS Joint Lab, Singapore (Shumin Deng).

1.3. Journal/Conference

Published on arXiv, a preprint server, under the identifier abs/2405.17969v4. arXiv is a reputable platform for sharing preprints in physics, mathematics, computer science, and other fields, allowing for rapid dissemination of research before formal peer review and publication in journals or conferences.

1.4. Publication Year

2024

1.5. Abstract

This paper investigates how modern large language models (LLMs) store and utilize knowledge, moving beyond studies focusing on isolated components like Multilayer Perceptrons (MLPs) and attention heads. The authors introduce the concept of "Knowledge Circuits" – critical subgraphs within the language model's computation graph that articulate specific knowledge. Through experiments on GPT-2 and TinyLLAMA, they observe how information heads, relation heads, and MLPs collaboratively encode knowledge. Furthermore, the paper evaluates the impact of existing knowledge editing techniques on these circuits, offering insights into their functionality and limitations. Finally, knowledge circuits are employed to analyze LLM behaviors such as hallucinations and in-context learning. The authors believe that understanding knowledge circuits can advance the comprehension of Transformers and guide improved knowledge editing designs.

The paper is available as a preprint on arXiv: https://arxiv.org/abs/2405.17969v4. The PDF link is https://arxiv.org/pdf/2405.17969v4.pdf.

2. Executive Summary

2.1. Background & Motivation

Modern large language models (LLMs) possess remarkable capabilities, largely attributed to the vast knowledge encoded within their parameters. However, the precise mechanisms by which LLMs store and utilize this knowledge remain enigmatic, leading to issues like hallucinations, biases, and unsafe behaviors. Previous research has primarily focused on isolated components, such as Multilayer Perceptrons (MLPs) being identified as 'key-value memories' or 'knowledge neurons', and attention heads. While these studies have advanced understanding, they often treat knowledge blocks as isolated and have shown limitations in generalizing editing techniques or fully explaining complex LLM behaviors. There's a gap in understanding how different components cooperate to articulate knowledge.

The core problem the paper aims to solve is to move beyond isolated component analysis and explore the collaborative information flow within the Transformer's computation graph to understand how knowledge is stored and expressed. This is important because a deeper understanding of these internal mechanisms can lead to more robust, reliable, and interpretable LLMs, and better-designed knowledge editing techniques.

The paper's innovative idea is to introduce "Knowledge Circuits"—human-interpretable subgraphs within the LLM's computation graph that are specifically responsible for articulating particular pieces of knowledge. Unlike previous work that focused on identifying where knowledge is stored, this paper investigates how it flows and is processed by the cooperation of various components (attention heads, MLPs, embeddings) to answer questions or make predictions.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Introduction of Knowledge Circuits: Proposes and constructs knowledge circuits as critical subgraphs in the Transformer's computation graph to analyze the cooperative mechanisms of different components (information heads, relation heads, MLPs) in encoding and expressing knowledge. This offers a holistic view beyond isolated components.
  • Unveiling Implicit Neural Knowledge Representations: Demonstrates that these discovered knowledge circuits can independently recall related knowledge with a significant portion of the full model's performance, indicating their effectiveness as neural representations. It shows that knowledge tends to aggregate in earlier to middle layers and is enhanced in later layers. It identifies special components like mover heads and relation heads crucial for information transfer and relational context capture.
  • Elucidating Internal Mechanisms for Knowledge Editing: Evaluates the impact of knowledge editing methods (ROME, FT-M) on knowledge circuits. It reveals that ROME tends to incorporate edited information at the edited layer, which is then transported by mover heads. FT-M, conversely, directly integrates edited knowledge, dominating subsequent predictions but risking overfitting and influencing unrelated knowledge. This provides a mechanistic explanation for editing successes and failures.
  • Facilitating Interpretation of LLM Behaviors: Utilizes knowledge circuits to interpret complex LLM behaviors:
    • Hallucinations: Observes that hallucinations occur when the model fails to correctly transfer knowledge to the final token, often due to an ineffective mover head or selection of incorrect information in earlier layers.
    • In-Context Learning (ICL): Shows that ICL involves the emergence of new attention heads in the knowledge circuit that focus on demonstration context, acting as induction heads to leverage examples for correct responses.
  • Advancing Understanding and Guiding Design: The authors believe that understanding knowledge circuits holds potential for advancing the mechanistic interpretability of Transformers and guiding the improved design of knowledge editing techniques to enhance factuality and alleviate hallucinations.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Transformer Architecture: The core neural network architecture used in modern LLMs. It consists of an encoder and a decoder (or decoder-only for generative LLMs). Key components include:

    • Token Embeddings: Numerical representations of input words or subwords.
    • Positional Encodings: Added to embeddings to provide information about the token's position in the sequence.
    • Multi-Head Attention: Allows the model to weigh the importance of different parts of the input sequence when processing each token. It computes Query (QQ), Key (KK), and Value (VV) vectors for each token. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, QQ, KK, VV are matrices representing query, key, and value vectors respectively. dkd_k is the dimension of the key vectors, used for scaling. The softmax function normalizes the attention scores. Multi-head attention performs this operation multiple times in parallel with different linear projections.
    • Feed-Forward Networks (FFNs) / Multilayer Perceptrons (MLPs): Position-wise fully connected networks applied to each position independently. They typically consist of two linear transformations with a ReLU activation in between. Previous work has identified MLPs as potential 'key-value memories' for factual knowledge.
    • Residual Connections: Add the input of a sub-layer to its output, helping to mitigate the vanishing gradient problem and allowing deeper networks.
    • Layer Normalization: Normalizes the activations across the features for each sample independently, improving training stability and performance.
    • Residual Stream: The cumulative output of all previous attention heads, MLPs, and input embeddings, through which information flows in a Transformer. It's considered a valuable tool for mechanistic interpretability.
  • Mechanistic Interpretability: A field of research focused on understanding the internal mechanisms of neural networks. It aims to reverse-engineer trained models to figure out how they perform specific tasks, often by identifying and understanding circuits.

  • Circuit Theory: Within mechanistic interpretability, a circuit is defined as a human-interpretable subgraph of a neural network's computation graph that is responsible for a specific behavior or functionality. The nodes in the graph represent components like attention heads, MLPs, and embeddings, and edges represent their interactions.

  • Ablation Studies: A technique used in interpretability to understand the importance of specific components or connections by systematically removing them (e.g., setting their output to zero or a mean value) and observing the impact on the model's performance. Zero ablation (setting to zero) and mean ablation (setting to mean activation) are common types.

  • Logit Lens: An interpretability technique that involves taking the intermediate activations from various layers of a Transformer and mapping them directly to the vocabulary space using the unembedding matrix. This allows researchers to see what tokens are "predicted" at each intermediate layer, providing insight into the information flow.

3.2. Previous Works

The paper builds upon a rich history of research into Transformer interpretability, particularly focusing on how knowledge is stored and processed:

  • Knowledge in MLPs:

    • Geva et al. [12, 14, 15] proposed that Transformer Feed-Forward Layers (MLPs) act as key-value memories, storing factual knowledge. Knowledge neurons (KNs) within MLPs are hypothesized to encode specific facts.
    • Meng et al. [18] leveraged this understanding to propose ROME (Rank-One Model Editing), a method to modify specific factual associations by directly editing MLP weights.
    • Niu et al. [23] and Zhang et al. [24] challenged the knowledge neuron thesis as an oversimplification, suggesting that knowledge might be distributed more broadly or that different types of knowledge might reside in the same areas.
  • Knowledge in Attention Heads:

    • Li et al. [58] and Wu et al. [81] highlighted the role of attention heads in retrieving information. Wu et al. identified retrieval heads for long-context factuality.
    • Jin et al. [82] distinguished between memory heads (internal memory) and context heads (external context).
    • Todd et al. [51, 83] introduced function vectors in attention heads, responsible for in-context learning.
  • Knowledge in Hybrid Components / Information Flow:

    • Geva et al. [13] described factual recall as a three-step process: subject enrichment (in MLPs), relation propagation (to the END token), and attribute extraction (by attention heads in later layers).
    • Lv et al. [31] suggested that task-specific attention heads move topic entities to the final position in the residual stream, while MLPs perform relation functions.
  • Circuit Discovery:

    • Olah et al. [25, 30] pioneered the concept of circuits as human-interpretable subgraphs.
    • Conmy et al. [32] developed Automated Circuit Discovery (ACDC) as a method to automatically identify critical circuits.
    • Wang et al. [26] and Merullo et al. [27] identified circuits for specific tasks like Indirect Object Identification (IOI) and Color Object (CO) tasks, noting component reuse across similar tasks.
    • Dutta et al. [84] constructed circuits for chain-of-thought reasoning, observing distinct roles for early-layer attention heads (moving information along ontological relations) and later-layer components (writing answers).

3.3. Technological Evolution

The field of Transformer interpretability has evolved from initial black-box analyses to component-level investigations (e.g., individual neurons, attention heads, or MLPs), and now towards understanding the interactions and flows between these components. Early works focused on identifying "where" knowledge is, while more recent work, including this paper, emphasizes "how" knowledge is processed and articulated through collaborative circuits. This paper's contribution lies in extending circuit theory from simple task-specific circuits (like IOI) to more complex knowledge recall tasks, integrating both MLP and attention components across layers, and using these circuits to explain higher-level LLM behaviors (editing, hallucination, ICL). The move towards sparse auto-encoders [36, 37] also represents an evolution in circuit discovery methods, aiming for more interpretable feature representations.

3.4. Differentiation Analysis

Compared to previous work, this paper differentiates itself in several key ways:

  • Holistic View of Knowledge Storage: Instead of treating MLPs or attention heads as isolated knowledge repositories (as in many knowledge neuron studies), this paper explicitly focuses on knowledge circuits that involve the cooperation of information heads, relation heads, and MLPs across multiple layers. This provides a more integrated understanding of how knowledge is processed and articulated.
  • Focus on Factual Recall Tasks: While circuit theory has been applied to tasks like Indirect Object Identification or Color Object identification, this paper extends it to factual recall tasks, which require the model to retrieve and utilize stored knowledge rather than simply copying tokens from the context. This necessitates a different kind of circuit discovery that taps into the model's parametric knowledge.
  • Application to LLM Behaviors: The paper moves beyond just identifying circuits to utilize them for interpreting complex LLM behaviors. Specifically, it uses knowledge circuits to analyze the mechanisms behind knowledge editing, hallucinations, and in-context learning, offering mechanistic explanations for why these phenomena occur or why editing methods succeed/fail.
  • Empirical Validation across Models: Experiments are conducted on both GPT-2 (medium and large) and TinyLLAMA, allowing for observations across different model sizes and architectures (e.g., TinyLLAMA's Grouped Query Attention Mechanism), providing broader validation than studies focused on a single model.
  • Detailed Analysis of Information Flow: The paper provides a detailed, layer-by-layer analysis of how target entities' ranks and probabilities change in the residual stream, identifying phenomena like early decoding and the specific roles of mover heads and relation heads in this process.

4. Methodology

4.1. Principles

The core idea behind discovering knowledge circuits is to identify a minimal but critical subgraph of the Transformer's computation graph that is essential for articulating a specific piece of knowledge. This is based on the principle of causal mediation: if removing a component or connection significantly degrades the model's ability to perform a task, that component/connection is considered critical to the circuit. Unlike previous work that focused on isolated components, this paper views the Transformer as a graph where various nodes (input embedding, attention heads, MLPs, output) interact via edges, and aims to find the collaborative pathways. The theoretical basis is that specific knowledge recall in LLMs is not a monolithic operation but rather emerges from a structured interplay of multiple internal components.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Transformer as a Computation Graph

The Transformer decoder architecture is conceptualized as a directed acyclic graph (DAG) G\mathcal{G}.

  • Nodes (NN): Represent individual computational components:

    • Input embedding (II)
    • Attention heads (Al,jA_{l,j}, where ll is the layer and jj is the head index)
    • MLPs (MlM_l, for layer ll)
    • Output node (OO)
    • So, N={I,Al,j,Ml,O}N = \{I, A_{l,j}, M_l, O\}.
  • Edges (EE): Represent the data flow and dependencies between these nodes, such as residual connections, attention mechanisms, and projections.

    The residual stream (RlR_l) at layer ll is defined by the following equations, showing how the input embedding and outputs from attention heads and MLPs from previous layers cumulatively contribute: Rl=Rl1+jAl,j+Ml,R0=IInputlA=I+l<l(Ml+jAl,j)InputlM=I+l<lMl+l<ljAl,j \begin{array}{r} \mathcal{R}_l = \mathcal{R}_{l-1} + \displaystyle \sum_j \mathcal{A}_{l,j} + \mathcal{M}_l, \quad \mathcal{R}_0 = \mathcal{I} \\ \mathrm{Input}_l^A = \mathcal{I} + \displaystyle \sum_{l'<l} \left( \mathcal{M}_{l'} + \displaystyle \sum_{j'} \mathcal{A}_{l',j'} \right) \\ \mathrm{Input}_l^M = \mathcal{I} + \displaystyle \sum_{l'<l} \mathcal{M}_{l'} + \displaystyle \sum_{l'<l} \sum_{j'} \mathcal{A}_{l',j'} \end{array} Where:

  • Rl\mathcal{R}_l: The residual stream at layer ll.

  • R0=I\mathcal{R}_0 = \mathcal{I}: The initial residual stream is the input embedding.

  • Al,j\mathcal{A}_{l,j}: The output of the jj-th attention head in layer ll.

  • Ml\mathcal{M}_l: The output of the MLP in layer ll.

  • InputlA\mathrm{Input}_l^A: The input to the attention layers in layer ll. It is the sum of the input embedding I\mathcal{I} and all MLP outputs (Ml\mathcal{M}_{l'}) and attention head outputs (Al,j\mathcal{A}_{l',j'}) from layers ll' before ll.

  • InputlM\mathrm{Input}_l^M: The input to the MLP layer in layer ll. It is the sum of the input embedding I\mathcal{I} and all MLP outputs (Ml\mathcal{M}_{l'}) from layers ll' before ll and all attention head outputs (Al,j\mathcal{A}_{l',j'}) from layers ll' before ll.

    A circuit C\mathcal{C} is then a subgraph of G\mathcal{G}, defined as C=<NC,EC>\mathcal{C} = <N_{\mathcal{C}}, E_{\mathcal{C}}>, containing a selection of nodes NCN_{\mathcal{C}} and edges ECE_{\mathcal{C}} critical for specific behaviors.

4.2.2. Knowledge Circuits Construction

The goal is to find the circuit critical for predicting a target entity oo given a subject-relation pair (s, r), typically presented as a natural language prompt (e.g., "The official language of France is ___").

  1. Causal Mediation and Ablation: The method leverages causal mediation by systematically ablating (removing) edges in the computation graph G\mathcal{G} and measuring the impact on the model's performance.
  2. Edge Ablation: For each edge ei=(nx,ny)e_i = (n_x, n_y) connecting node nxn_x to nyn_y, its contribution is removed by replacing it with a corrupted value. The paper states zero ablation is used, meaning the output of nxn_x as input to nyn_y is set to zero.
  3. Performance Metric: The impact of ablating an edge eie_i on the model's ability to predict the target entity oo is measured using the MatchNLL loss. The MatchNLL loss is essentially the negative log-likelihood of the target token. A higher loss means worse performance. The change in this loss after ablation S(ei)S(e_i) is calculated as: S(ei)=log(G(o(s,r)))log(G/ei(o(s,r))) S(e_i) = \log(\mathcal{G}(o | (s, r))) - \log(\mathcal{G}/e_i(o | (s, r))) Where:
    • G(o(s,r))\mathcal{G}(o | (s, r)): The probability (or log-probability) of the target entity oo given the subject-relation pair (s, r) by the original full model G\mathcal{G}.
    • G/ei(o(s,r))\mathcal{G}/e_i(o | (s, r)): The probability (or log-probability) of the target entity oo given (s, r) by the model G\mathcal{G} with edge eie_i ablated.
    • S(ei)S(e_i): Represents the decrease in log-likelihood (or increase in negative log-likelihood) when eie_i is ablated. A positive S(ei)S(e_i) means ablating the edge hurt performance, indicating the edge was critical.
  4. Thresholding: If the score S(ei)S(e_i) is less than a predefined threshold τ\tau, the edge is considered non-critical and removed from the temporary circuit Ctemp\mathcal{C}_{temp}. This process is typically performed on a topologically sorted graph, traversing edges in a causal order.
  5. Final Circuit: By iteratively removing non-critical edges, a knowledge circuit Ck=<Nk,Ek>\mathcal{C}_k = <N_k, E_k> is derived, containing only the nodes NkN_k and edges EkE_k essential for predicting the target entity oo for a specific knowledge triplet kk.

4.2.3. Knowledge Circuits Information Analysis

Once a knowledge circuit is identified, the behavior and contribution of each node nin_i within it are analyzed:

  1. Mapping to Embedding Space: The output of each node nin_i is first subjected to layer normalization (LN) and then mapped into the vocabulary's embedding space by multiplying it with the language model's unembedding matrix (WU\mathbf{W}_U). $ \mathbf{W}_U \mathrm{LN}(n_i) $ This allows researchers to inspect what tokens each component is "writing" or promoting into the residual stream.
  2. Identifying Special Components: Through this analysis, specific types of attention heads are identified based on their behaviors:
    • Mover Head: Focuses on the last token of the context and attends to the subject token, effectively transferring information from the subject to the final token position.
    • Relation Head: Attends to the relation token in the context and produces tokens related to the relation, guiding subsequent computations.
    • Mix Head: Focuses on both relation and subject tokens, often working similarly to a mover head.

4.2.4. Analyzing Knowledge Editing and LLM Behaviors

  • Knowledge Editing: The paper investigates how knowledge editing methods (e.g., ROME, FT-M) alter the knowledge circuits. It observes how edited information is incorporated (e.g., at specific layers, by specific heads) and how this changes the information flow compared to the original circuit.

  • Hallucinations: By analyzing knowledge circuits when the model hallucinates, the paper seeks to identify breakdowns in the information flow, such as ineffective mover heads or selection of incorrect information.

  • In-Context Learning (ICL): For ICL, knowledge circuits are constructed with and without demonstrations. The appearance of new attention heads (e.g., induction heads [50]) focusing on demonstration contexts is observed to explain how ICL works mechanistically.

    The process of circuit discovery involves several steps of running the model and intervening in its computational graph:

  1. Forward Pass: Run a forward pass to compute the baseline performance (log-likelihood of the target token).

  2. Iterative Ablation: For each potential edge in the graph (or a subset of interest):

    • Ablate the edge (e.g., set its output to zero).
    • Perform another forward pass with the ablated graph.
    • Calculate the performance change S(ei)S(e_i).
    • If S(ei)<τS(e_i) < \tau, consider the edge non-critical and effectively remove it from the circuit.
  3. Circuit Definition: The remaining edges and their connected nodes form the knowledge circuit for that specific piece of knowledge.

  4. Analysis: Analyze the identified circuit components by mapping their outputs to the vocabulary space to understand their role and contribution.

    This methodology provides a rigorous framework for decomposing the complex internal operations of LLMs into interpretable subgraphs responsible for specific knowledge articulation.

5. Experimental Setup

5.1. Datasets

The study utilizes the dataset provided by Hernandez et al. [42]. This dataset comprises various types of knowledge, which are categorized into:

  • Factual Knowledge: Relates to concrete facts (e.g., "person mother," "landmark in country," "company hq").

  • Commonsense Knowledge: General understanding of the world (e.g., "object superclass," "fruit inside color," "work location").

  • Linguistic Knowledge: Pertains to language rules and relationships (e.g., "adjective antonym," "word first letter," "verb past tense").

  • Bias Knowledge: Captures societal biases often present in training data (e.g., "occupation age," "name religion," "occupation gender").

    The original dataset from Hernandez et al. [42] was used for few-shot settings. However, in this paper, the focus is on zero-shot knowledge storage, meaning the model's inherent knowledge without in-context examples. Therefore, the data is sampled based on whether the model correctly answers the given prompt in a zero-shot setting using the Hit@10 metric. The test set is sampled in a 1:1 ratio with the validation set to ensure balanced evaluation.

The detailed statistics of the dataset are presented in the following table (Table 3 from the original paper): The following are the results from Table 3 of the original paper:

Category Relation # # Correct in Hit@10
GPT2-Medium GPT2-large TinyLLaMA
factual person mother 994 83 144 361
person father 991 359 385 474
person sport position 952 400 489 596
landmark on continent 947 835 543 694
person native language 919 310 220 260
landmark in country 836 600 489 709
person occupation 821 57 76 149
company hq 674 308 312 470
product by company 522 422 432 460
person plays instrument 513 510 505 498
star constellation name 362 266 148 297
plays pro sport 318 317 316 315
company CEO 298 20 52 125
superhero person 100 28 35 50
superhero archnemesis 96 6 6 23
person university 91 14 37 35
pokemon evolution 44 11 13 16
country currency 30 25 25 30
food from country city in country 30 23 20 25 29
country capital city 27 24 23 27
country language 24 24 24 24
country largest city 24 24 24 24 24 24 24
person lead singer of band 21 7 16
president birth year 11 12 21
president election year 19 19 17 18 - -
commonsense object superclass 76 62 64 72
word sentiment 60 14 34
task done by tool 52 44 9 45 45
substance phase of matter 50 12 16 48
work location 38 28 24 27
fruit inside color 36 35 36
task person type 36 32 28 27
fruit outside color 30 16 20 26 21
linguistic word first letter 241 236 235 241
word last letter 241 135 73 114
adjective antonym 100 80 81 84
adjective superlative 80 24 19 63
verb past tense 76 1 15 76
adjective comparative 68 7 15 63
bias occupation age 45 18 20 18
univ degree gender 38 14 35 38
name birthplace 31 29 30 31
name religion 31 24 31 31
characteristic gender 30 26 30 30
name gender 19 19 19 19
occupation gender 19 19 19 19

5.2. Evaluation Metrics

  • MatchNLL (Negative Log-Likelihood): This metric is used during the circuit discovery phase to quantify the impact of ablating an edge. It measures how well the model predicts the target entity.

    • Conceptual Definition: In the context of next-token prediction, Negative Log-Likelihood (NLL) measures the discrepancy between the model's predicted probability distribution over the vocabulary and the actual target token. A lower NLL means the model assigns a higher probability to the correct token, indicating better prediction. When an edge is ablated, the change in NLL (or rather, the log-likelihood as formulated in the paper) indicates how critical that edge is for the correct prediction.
    • Mathematical Formula (from paper): S(ei)=log(G(o(s,r)))log(G/ei(o(s,r))) S(e_i) = \log(\mathcal{G}(o | (s, r))) - \log(\mathcal{G}/e_i(o | (s, r)))
    • Symbol Explanation:
      • S(ei)S(e_i): The score quantifying the impact of ablating edge eie_i. A higher positive score means ablating eie_i significantly decreased the log-probability of the target, thus eie_i is critical.
      • log(G(o(s,r)))\log(\mathcal{G}(o | (s, r))): The natural logarithm of the probability assigned by the original full model G\mathcal{G} to the target entity oo given the subject ss and relation rr.
      • log(G/ei(o(s,r)))\log(\mathcal{G}/e_i(o | (s, r))): The natural logarithm of the probability assigned by the model G\mathcal{G} with edge eie_i ablated, to the target entity oo given ss and rr.
  • Hit@10: This metric is used to evaluate the completeness of a discovered knowledge circuit and the overall performance of the models in recalling knowledge.

    • Conceptual Definition: Hit@k measures how often the true target token appears within the top kk predicted tokens from the model's vocabulary. Hit@10 specifically checks if the target token is among the top 10 most probable tokens predicted. It's a common metric for assessing the accuracy of knowledge retrieval or generation, especially in open-ended prediction tasks where exact matches might be less important than finding the correct answer within a reasonable set of candidates.
    • Mathematical Formula: Hit@10=1Vi=1VI(ranko10) \mathrm{Hit@10} = \frac{1}{|V|} \sum_{i=1}^{|V|} \mathrm{I}\left(\mathrm{rank}_o \leq 10\right) (Note: The paper's formula seems to have a typo where V|V| is in the denominator. A more standard definition for Hit@k is the fraction of queries for which the target is in the top k. Assuming the i=1i=1 to V|V| summation means iterating over all test examples, then V|V| should be the number of test examples, not vocabulary size. If it iterates over vocabulary and sums up 1s, it would simply be count. Let's provide a more standard interpretation if the paper's formula is interpreted literally as number of items. A more standard formulation would be 1Nexamplesj=1NexamplesI(rankoj10) \frac{1}{N_{examples}} \sum_{j=1}^{N_{examples}} \mathrm{I}(\mathrm{rank}_{o_j} \leq 10) where NexamplesN_{examples} is the number of examples. Given the context, the paper likely means the percentage of examples where the target rank is 10\leq 10.) Self-correction: The formula presented in the paper for Hit@10 is: Hit@16=1Vi=1VI(ranko10) { \sf H i t @ 1 6 } = \frac { 1 } { | V | } \sum _ { i = 1 } ^ { | V | } { \mathrm { I } \left( { \mathrm { r a n k } _ { o } } \leq 1 0 \right) } The paper writes Hit@16 but then defines it for ranko<=10rank_o <= 10. I will present the formula exactly as written in the paper, and note the discrepancy.
    • Symbol Explanation:
      • Hit@16\mathrm{Hit@16}: The metric measuring the percentage of cases where the target entity's rank is within the top 10. (There is a discrepancy in the paper: the formula is labeled Hit@16 but uses ranko<=10rank_o <= 10. I will maintain the paper's literal notation for formula, but refer to it as Hit@10 in text as that's what the paper generally uses).
      • V|V|: Represents the vocabulary size. (As noted above, in a standard Hit@k context, this often refers to the number of test samples. I will use the paper's definition of "vocabulary size" here for consistency with their text, but acknowledge the common usage difference if it were for a general audience.)
      • I()\mathrm{I}(\cdot): The indicator function, which evaluates to 1 if the condition inside the parenthesis is true, and 0 otherwise.
      • ranko\mathrm{rank}_o: The rank of the target entity oo in the model's list of predicted tokens (sorted by probability, highest probability = rank 1).
      • 10: The threshold for the rank; the target token must be among the top 10 predictions to be counted as a "hit".

5.3. Baselines

The paper's primary contribution is a novel interpretability framework, not a new predictive model, so it doesn't compare against traditional predictive baselines in the same way. Instead, it implicitly compares against:

  • The Full Original Model (G\mathcal{G}): The performance of the discovered knowledge circuits (C\mathcal{C}) is compared against the full model to assess the circuit's completeness (how well it reproduces the full model's behavior).
  • Random Circuits: To validate the significance of the discovered circuits, their performance is compared against random circuits of the same size. These random circuits are generated by randomly removing edges while ensuring connectivity, serving as a null hypothesis to show that the identified circuits are not merely effective due to their size.

5.4. Implementations

  • Models Used: GPT-2 medium, GPT-2 large, and TinyLLAMA [29]. TinyLLAMA is specifically chosen to validate effectiveness across different architectures, particularly noting its use of Grouped Query Attention Mechanism [67].
  • Tools:
    • Automated Circuit Discovery (ACDC) [32] toolkit: Used to construct the circuits.
    • TransformerLens [41]: Used for further analysis of the results.
  • Ablation Method: Zero ablation is employed to remove specific computational nodes.
  • Hyperparameter: The threshold τ\tau for MatchNLL is a critical hyperparameter. Values tested include {0.02,0.01,0.005}\{0.02, 0.01, 0.005\} to determine appropriate circuit size.
  • Computational Resources: NVIDIA-A800 (40GB) GPUs. Computing a circuit for a knowledge type in GPT-2 medium took approximately 1-2 days.
  • TinyLLAMA Specifics: For TinyLLAMA, which uses Grouped Query Attention, the key and value pairs are interleaved and repeated to analyze specific attention head behaviors.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Knowledge Circuit Evaluation and Completeness

The study demonstrates that the discovered knowledge circuits are effective in retaining significant portions of the original model's performance with a substantially reduced subgraph.

The following are the results from Table 1 of the original paper:

Type Knowledge #Edge Dval Dtest
Original(G) Circuit(C) Original(G) Random Circuit(C)
Linguistic Adj Antonym 573 0.80 1.00 ↑ 0.00 0.00 0.40 ↑
word first letter 432 1.00 0.88 0.36 0.00 0.16
word last letter 230 1.00 0.72 0.76 0.00 0.76
Commonsense object superclass 102 1.00 0.68 0.64 0.00 0.52
fruit inside color 433 1.00 0.20 0.93 0.00 0.13
work location 422 1.00 0.70 0.10 0.00 0.10
Factual Capital City 451 1.00 1.00 0.00 0.00 0.00
Landmark country 278 1.00 0.60 0.16 0.00 0.36 ↑
Country Language 329 1.00 1.00 0.16 0.00 0.75 ↑
Person Native Language 92 1.00 0.76 0.50 0.00 0.76↑
Bias name religion 423 1.00 0.50 0.42 0.00 0.42
occupation age 413 1.00 1.00 1.00 0.00 1.00
occupation gender 226 1.00 0.66 1.00 0.00 0.66
name birthplace 276 1.00 0.57 0.07 0.00 0.57 ↑
Avg 0.98 0.73 0.44 0.00 0.47↑
  • Circuit Size and Performance: The discovered circuits, despite using less than 10% of the original model's subgraph (indicated by #Edge column, which is the number of edges in the circuit), can maintain over 70% of the original model's performance on the validation set (DvalD_{val}) (Avg Circuit(C) for DvalD_{val} is 0.73). This demonstrates the completeness and efficacy of these circuits in capturing relevant knowledge.
  • Comparison to Random Circuits: Random circuits of the same size perform significantly worse (Avg Random for DtestD_{test} is 0.00), highlighting that the identified circuits are not merely artifacts of size but truly capture the knowledge-articulating pathways.
  • Performance Improvement: Intriguingly, some tasks, like "Landmark country" and "Country Language," show improved performance on the test set (Circuit(C) > Original(G)) when using only the circuit. This suggests that the full model might contain noise from other components that hinders performance on these specific tasks, and the isolated circuit effectively filters this noise.

6.1.2. Knowledge Aggregation Across Layers

Analysis of activated circuit component distributions (Figure 2) shows that attention and MLPs are more active in the lower layers, where the model extracts general information. The following figure (Figure 2 from the original paper) illustrates the activated circuit component distributions:

Figure 2: The activated circuit component distributions in Layers in GPT2-Medium. 该图像是一个条形图,展示了GPT2-Medium模型中不同层次的激活电路组件分布,分别包含语言、常识、偏见和事实四个方面的激活频率。每个子图表示特定类型信息在各层的激活情况,反映了电路的活跃程度。

Further, by computing the average rank of the target token across layers (Figure 7), the paper observes early decoding—the target entity becomes present in the residual stream by middle to later layers, and subsequent layers primarily serve to increase its probability. The following figure (Figure 7 from the original paper) shows the knowledge encoding performance of different models across various layers:

Figure 6: The output of the model. Ablating the mover head would increase the probability of "Italian", "English" and "Spanish", which are not subject-related. While ablating the relation head would lead to the increase of some meaningless words "a", "that", which are not relation-related. 该图像是图表,展示了不同模型在不同层次上的知识编码效果。上方部分为GPT2-Medium,第二部分为GPT2-Large,最后为TinyLLAMA。各个图表中,y轴表示目标实体的平均排名,x轴表示模型的层数,可以看出不同模型对知识的捕获大致在中间到后面的层次。

6.1.3. Special Components in Knowledge Circuits

The circuits reveal specialized roles for different attention heads:

  • Mover Heads: (e.g., L14H7 in the "France is French" example) transfer information from the subject position (e.g., "France") to the last token position. The paper reinterprets these not just as argument parsers but as extract heads that extract related information from the subject's position.

  • Relation Heads: (e.g., L14H13) focus on the relation token (e.g., "language") and generate relation-related tokens (e.g., "language").

  • Mixture Heads: Attend to both relation and subject tokens.

    A running example for "The official language of France is French" (Figure 1, Figure 3) demonstrates that after MLP17, the target "French" emerges as the top token, and its probability further increases. This MLP integrates information from relation heads (e.g., L14H13, focusing on "language") and mover heads (e.g., L14H7, extracting information related to "France"). The model accumulates knowledge from early-to-middle layers, with MLPs combining information and prioritizing the target token, and later attention heads (e.g., L18H14, L20H6) enhancing the final prediction.

The following figure (Figure 1 from the original paper) illustrates the knowledge circuit for "The official language of France is French":

Figure 1: Knowledge circuit obtained from "The official language of France is French" in GPT2- Medium. Left: a simplified circuit and the whole circuit is in Figure 9 in Appendix. We use `-- { }` to skip some complex connections between nodes. Here, L15H0 means the first attention head in the 15th layer and MLP12 means the multi-perception layer in the 13th layer. Right: the behavior of several special heads. The matrix on the left is the attention pattern of each attention head and the right heapmap demonstrates the output logits of the hean by mapping to the vocabulary space. 该图像是图示,展示了从 "The official language of France is French" 这句话中提取的知识电路,包括简化电路和一些特殊组件的输出。左侧展示了简化电路的结构,右侧显示了不同注意力头的输出日志和关注模式。

The following figure (Figure 3 from the original paper) shows the rank and probability of the target entity oo for the fact "The official language of France is French":

Figure 3: The rank and probability of the target entity \(o\) at both the last subject token and the last token position when unembedding the intermediate layer's output for the fact "The official language of France is French". 该图像是图表,展示了在不同层级下,针对事实“法国的官方语言是法语”中的目标实体 oo 在最后主题词和主题位置的排名及概率变化。图中蓝色线条分别表示目标实体在最后位置和主题位置的排名,紫色虚线表示目标和主题实体的概率。纵轴以对数尺度表示排名,横轴表示层数,显示了知识在模型不同层次上的分布特征。

6.1.4. Knowledge Editing Impact

The study examines how ROME [18] and FT-M [24] affect knowledge circuits.

  • ROME: Tends to incorporate edited information primarily at the edited layer. Subsequent mover heads then transport this new information to the residual stream of the last token. For example, for the fact "Platform Controller Hub was created by Intel", if L15H3 was a mover head that previously moved "Controller", after ROME editing, its output shifts to "Intel".

  • FT-M: Shows a more direct integration, where the edited token's logits are boosted significantly (more than 10 for "Intel" in MLP-0) immediately at the edited component, dominating subsequent predictions. This suggests FT-M writes knowledge directly into specific components.

  • Side Effects: Both methods, especially FT-M, risk overfitting or influencing unrelated knowledge. For instance, after editing to make "Platform Controller Hub was created by Intel", querying "Windows server is created by?" also yields "Intel," demonstrating a generalization issue. This supports the idea that edits add signals into circuits rather than fundamentally altering storage.

    The following figure (Figure 4 from the original paper) illustrates the different behaviors when editing the language model:

    Figure 4: Different behaviors when we edit the language model. In the original model, we can see the mover head L15H3 actually move the original token "Controller" and other information, while for ROME, we observe the mover head select the correct information "Intel", which means ROME successfully added the "Intel" to model. For the FT layer-0 editing, we can find this method directly write the edited knowledge into edited component. However, we find these two editing methods would affect other unrelated input "Windows server is created by?" 该图像是图表,展示了在不同编辑模型下,L15H3 头部的注意力模式和输出日志的对比。上部为原始模型,显示原始令牌"Controller"的信息转移;中部为ROME模型,成功添加"Intel"的信息;下部为FT层-0编辑,直接将编辑知识写入组件,但显示这些方法对"Windows server is created by?"的影响。

The following figure (Figure 11 from the original paper) shows the knowledge circuit obtained from the edited model:

Figure 11: The knowledge circuit obtained from the edited model for the case "Platform Controller Hub was created by" with the target entity "Intel" shows that when editing the model using different layers, the fine-tuned settings allow the edited MLP to directly provide the edited information. 该图像是示意图,展示了针对 "Platform Controller Hub was created by" 的知识电路在调整不同层次(层0、层6、层12和层23)时的变化。具体而言,调整后的 MLP 允许模型直接提供编辑后的信息,揭示了知识的流动和存储过程。

6.1.5. Interpretation of LLM Behaviors

  • Factual Hallucinations: When hallucinations occur (Figure 5, left), the model fails to correctly transfer knowledge to the final token in earlier layers. This is often due to the knowledge circuit lacking an effective mover head or the mover head selecting incorrect information. For example, for "The official currency of Malaysia is called", both "Ringgit" (correct) and "Malaysian" (incorrect) are accumulated early, but a mover head at layer 16 extracts the erroneous "Malaysian," leading to hallucination.

  • In-Context Learning (ICL): When demonstrations are provided for ICL (Figure 5, right), new attention heads emerge in the knowledge circuit. These new heads primarily focus on the demonstration's context, acting as induction heads [50] that look for previous instances of the current token and the token that followed it. Ablation studies on these newly appeared heads show a significant drop in prediction probability, confirming their importance in ICL.

    The following figure (Figure 5 from the original paper) illustrates fact hallucination and in-context learning cases:

    Figure 5: Left: fact hallucination case "The official currency of Malaysia is called", we observe that, at layer 15, the Mover Head selects incorrect information. Right: In-context learning case, we notice that some new heads focusing on the demonstration appear in the knowledge circuit. 该图像是图表,展示了两个案例:左侧是关于马来西亚官方货币的事实幻觉案例,在第15层,Mover Head选择了错误信息;右侧是关于上下文学习的案例,注意到一些新的关注头出现在知识电路中。

The following are the results from Table 2 of the original paper:

Knowledge Origin Model Ablating extra head Ablating random head
Linguistic adj_comparative 62.24 32.55 58.18
Commonsense word_sentiment 89.02 55.50 88.61
substance_phase 78.74 52.85 71.24
Bias occupation_gender 86.97 59.54 86.54
Factual person_occupation 35.17 23.27 31.60

The table shows that ablating extra heads (new attention heads appearing in the ICL circuit) causes a much larger drop in performance compared to ablating random heads, underscoring their critical role in in-context learning.

6.1.6. Multi-hop Factual Knowledge Editing

The paper briefly touches on multi-hop knowledge editing, noting that current methods struggle. They observe that the mover head in the original multi-hop circuit extracts the second-hop answer. After editing the first hop, the mover head extracts the new (edited) information, dominantly saturating and influencing the circuit, leading to incorrect subsequent reasoning. An intriguing finding is that even in the original model, if the first-hop context is removed, the model can still directly provide the multi-hop answer, suggesting reliance on relational and subject-related information rather than strict grammatical adherence.

The following figure (Figure 10 from the original paper) illustrates a specific case in Multi-hop reasoning:

Figure 10: a specific case in Multi-hop reasoning. When we removed the context of the first hop question, we found the model also directly gave us the answer. The phenomenon appears in both GPT2 and TinyLLAMA. 该图像是一个示意图,展示了多跳推理中的知识类型。图中列举了三种知识表现形式:事实知识、多跳知识和损坏知识,并分别指向相关答案,突出即使在移除第一跳上下文后,模型仍能直接给出部分答案的现象。

6.2. Data Presentation (Tables)

Table 1: Hit@10 of the Original and Circuit Standalone performance of knowledge circuit in GPT2- Medium. (Already transcribed in Section 6.1.1)

Table 2: Performance change via ablating the newly appeared attention heads in the ICL circuit and random heads. (Already transcribed in Section 6.1.5)

Table 3: Information about the dataset. (Already transcribed in Section 5.1)

Table 4: Special component behaviour in circuits as task-specific head. The following are the results from Table 4 of the original paper:

Model Type Fact Critical Component in Circuit
GPT2-Medium Linguistic Antonym L17H2, L18H1, L13H12, L13H8
Factual city country L21H12, L16H2
Commonsense work location L19H15, L14H4, L13H3
Bias name country L16H6, L21H12
GPT2-Large Linguistic Factual Antonym L25H5, L24H16, L19H13, L18H8
company hq L30H6, L25H13
Commonsense work location L18H13, L28H18, L30H5
Bias name country L21H19, L29H2
TinyLLAMA Linguistic Factual Verb past tense L17H0, MLP20
Landmark country L15H11, L17H19, MLP18
Commonsense Fruit Inside Color L18H25, MLP18
Bias name country L15H11, MLP17

This table lists specific attention heads and MLPs (e.g., L17H2, MLP20) identified as critical components within the knowledge circuits for various types of knowledge and relations across different models (GPT2-Medium, GPT2-Large, TinyLLAMA). This provides concrete examples of the components that constitute these circuits.

Table 5: Hit for different hop The following are the results from Table 5 of the original paper:

First-hop Second-hop Integrate
node 83.33 70.27 71.42
edge 63.20 45.27 49.42

This table shows the overlap (or Hit) of nodes and edges when comparing knowledge circuits for first-hop, second-hop, and integrated multi-hop knowledge. High overlap for nodes (e.g., 83.33% for first-hop) indicates component reuse in multi-hop reasoning.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Threshold τ\tau for Circuit Discovery

The choice of threshold τ\tau (e.g., {0.02,0.01,0.005}\{0.02, 0.01, 0.005\}) is crucial. A high τ\tau leads to an incomplete circuit, while a low τ\tau results in too many unnecessary nodes. The paper implies that an appropriate τ\tau was chosen to yield circuits that are small yet retain high performance.

6.3.2. Ablation of New Heads in ICL

As shown in Table 2 (transcribed in Section 6.1.5), ablating the newly appeared attention heads in the ICL circuit significantly drops performance (e.g., adj_comparative from 62.24 to 32.55). In contrast, ablating random heads has a much smaller impact (e.g., adj_comparative from 62.24 to 58.18). This confirms the causal role of these specific heads in facilitating in-context learning.

6.3.3. Ablation of Mover and Relation Heads

Figure 6 (Appendix B.2) shows that ablating mover heads increases the probability of other non-subject-related tokens (e.g., "Italian", "English", "Spanish" instead of the target "French"), while ablating relation heads leads to meaningless words (e.g., "a", "that"). This directly illustrates their specific functions in transferring subject information and guiding relation-based output. The following figure (Figure 6 from the original paper) shows the output of the model after ablating mover and relation heads:

该图像是一个示意图,展示了经过不同处理后GPT模型生成的法语相关单词的权重变化。图中包含三部分,分别是原始输出、去除mover头后的输出和去除relation头后的输出。每部分的单词及其对应的权重值以深浅不一的色调展示,反映了不同知识电路对输出的影响。 该图像是一个示意图,展示了经过不同处理后GPT模型生成的法语相关单词的权重变化。图中包含三部分,分别是原始输出、去除mover头后的输出和去除relation头后的输出。每部分的单词及其对应的权重值以深浅不一的色调展示,反映了不同知识电路对输出的影响。

6.4. Component Reuse

The paper observes component reuse across related tasks. For example, L21H12 is critical for city_in_country, name_birth_place, and country_language relations in GPT2-Medium. Similarly, L7H14 appears in circuits for both official_language and official_currency. This suggests that certain components act as topic heads rather than being strictly task-specific, potentially handling general semantic categories.

6.5. Model Differences

  • Knowledge Aggregation: In GPT-2 models, the rank of the target entity drops gradually across middle-to-later layers, indicating gradual knowledge aggregation. In contrast, TinyLLAMA shows a sharper decline in rank around specific layers, suggesting a more concentrated knowledge processing. This difference is hypothesized to relate to model knowledge capacity [44].
  • Special Components Concentration: TinyLLAMA's special components (e.g., mover heads, relation heads) are more concentrated in later layers compared to GPT-2.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces and validates the concept of knowledge circuits as interpretable subgraphs within Transformer language models. By moving beyond isolated component analysis, the authors demonstrate how attention heads, MLPs, and embeddings collaboratively encode and articulate specific knowledge. Key findings include the high completeness of these circuits in reproducing model behavior, the identification of specialized mover and relation heads for information flow, and a mechanistic explanation for knowledge editing effects (ROME adds signals at edited layers, FT-M directly writes knowledge). Furthermore, knowledge circuits provide concrete insights into hallucinations (failure of mover heads) and in-context learning (emergence of new induction heads). The research highlights that knowledge is aggregated from early to middle layers and refined in later layers.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

  • Granularity of Circuits: The current approach operates at a relatively coarse granularity. Neurons within an MLP, for instance, might require finer-grained analysis to fully capture their behavior.

  • Efficiency of Discovery: The causal mediation method used for circuit discovery is time-intensive. More efficient methodologies like mask training [63, 64] or sparse auto-encoders [65, 36] could be explored.

  • Opacity of Activation: While the paper identifies which components work together, the precise reasons why they are activated remain somewhat opaque.

  • Logit Lens Discrepancies: The logit lens method may face discrepancies between middle layers and the output unembedding matrix, hindering comprehensive analysis of early-layer component behavior. More robust techniques like Attention Lens [86] could be explored, despite being resource-intensive.

  • Mover Head Activation Mechanisms: The mechanisms by which mover heads are activated and the conditions under which they operate (especially given their reuse across different knowledge types) require further exploration to understand monosemantic vs. polysemantic behaviors [90].

  • Multi-hop Reasoning in Aligned Models: The intriguing phenomenon of models directly answering multi-hop questions even after removing first-hop context, and how this is alleviated by aligned models, requires further investigation.

    Future work directions suggested include:

  • Advancing knowledge circuit discovery methodologies for greater efficiency and finer granularity.

  • Further exploring the activation mechanisms of specialized heads, particularly in cases of component reuse.

  • Developing more robust techniques for analyzing intermediate layer outputs, potentially leveraging Attention Lens or similar methods.

  • Investigating the observed differences in knowledge aggregation and component concentration between models like GPT-2 and TinyLLAMA.

  • Using knowledge circuits to guide the improved design of knowledge editing and to enhance factuality and alleviate hallucinations in LLMs.

  • Contributing to the development of trustworthy AI by ensuring safety and privacy of information through circuit breakers [66].

7.3. Personal Insights & Critique

This paper offers a significant step forward in mechanistic interpretability by providing a holistic framework for understanding knowledge representation and utilization in LLMs. The shift from isolated component analysis to knowledge circuits is crucial, as real-world cognitive processes are rarely isolated. The finding that knowledge aggregates in early-to-middle layers and is refined later, combined with the identification of specific head types, paints a more detailed picture of the internal "reasoning" process.

The mechanistic explanation for hallucinations and in-context learning is particularly valuable. Understanding that hallucinations can stem from a mover head misdirection provides concrete avenues for mitigation. Similarly, observing the emergence of induction heads during ICL reinforces the idea that LLMs are not just pattern matchers but develop specialized modules for specific learning paradigms.

A potential area for deeper exploration might be the interplay between the identified knowledge circuits and the model's emergent world models. If these circuits are indeed fundamental, how do they contribute to or conflict with higher-level semantic understanding? Also, while the paper uses zero ablation, investigating the impact of mean ablation could provide complementary insights, especially regarding default behaviors when a component is "silenced." The observed performance improvement in circuits over the full model for some tasks is fascinating and warrants further study into how "noise" manifests in LLMs and how circuits might inherently filter it.

The work's focus on bridging the gap between interpretability and practical application (like knowledge editing) is commendable. By providing a "roadmap" of knowledge flow, it can guide more targeted and effective editing strategies, moving beyond trial-and-error. The observed component reuse phenomenon, where certain heads act as topic heads, suggests an underlying organizational principle within LLMs that could be exploited for more efficient and robust knowledge management.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.