Attention is not not Explanation
TL;DR Summary
This paper challenges claims that attention is not explanation, proposing four rigorous tests showing attention in RNNs can provide meaningful interpretability depending on definition and experimental design.
Abstract
Attention mechanisms play a central role in NLP systems, especially within recurrent neural network (RNN) models. Recently, there has been increasing interest in whether or not the intermediate representations offered by these modules may be used to explain the reasoning for a model's prediction, and consequently reach insights regarding the model's decision-making process. A recent paper claims that `Attention is not Explanation' (Jain and Wallace, 2019). We challenge many of the assumptions underlying this work, arguing that such a claim depends on one's definition of explanation, and that testing it needs to take into account all elements of the model, using a rigorous experimental design. We propose four alternative tests to determine when/whether attention can be used as explanation: a simple uniform-weights baseline; a variance calibration based on multiple random seed runs; a diagnostic framework using frozen weights from pretrained models; and an end-to-end adversarial attention training protocol. Each allows for meaningful interpretation of attention mechanisms in RNN models. We show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Attention is not not Explanation
The title is a direct and clever response to a contemporary paper, "Attention is not Explanation" (Jain and Wallace, 2019). The use of a double negative ("not not") signals a nuanced counter-argument. It suggests that while attention may not be a perfect or exclusive form of explanation, it is also not meaningless or useless for explainability. The title effectively frames the paper as a rebuttal that aims to restore some of the community's confidence in attention mechanisms as a tool for interpreting model behavior, while still acknowledging the complexities of the issue.
1.2. Authors
-
Sarah Wiegreffe: At the time of publication, a student at the School of Interactive Computing, Georgia Institute of Technology. Her research focuses on explainability in NLP, particularly in areas like summarization and generating free-text explanations.
-
Yuval Pinter: At the time of publication, a PhD student at the School of Interactive Computing, Georgia Institute of Technology. His research interests include computational morphology, phonology, and low-resource NLP, with a focus on understanding the internal representations of neural networks.
Their affiliations with a leading computer science school and their respective research interests in explainability and model internals position them well to critically analyze and propose new methods for evaluating attention mechanisms.
1.3. Journal/Conference
The paper was published in the Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). ACL is the premier international conference for Natural Language Processing (NLP). Its acceptance criteria are highly rigorous, and papers published at ACL are considered to be of high quality and significant impact within the field. The venue's prestige underscores the importance and timeliness of the debate surrounding attention and explainability.
1.4. Publication Year
2019
1.5. Abstract
The abstract introduces the central role of attention mechanisms in modern NLP, particularly in Recurrent Neural Network (RNN) models. It directly addresses the claim made by Jain and Wallace (2019) that "Attention is not Explanation." The authors of this paper challenge the assumptions underlying that claim, arguing that the validity of attention as an explanation depends heavily on how "explanation" is defined and requires rigorous experimental design that considers the entire model. They propose four new alternative tests to assess when attention can serve as an explanation: a uniform-weights baseline, a variance calibration using multiple random seeds, a diagnostic framework using frozen attention weights, and a model-consistent adversarial training protocol. Their findings show that even when adversarial attention distributions can be found, they are less meaningful than the original distributions, suggesting that prior work did not successfully disprove the utility of attention for explainability.
1.6. Original Source Link
-
Original Source Link: https://arxiv.org/abs/1908.04626
-
Publication Status: The paper is an arXiv preprint that was officially published at ACL 2019.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper addresses is the growing controversy over whether attention mechanisms in neural networks can be reliably used as explanations for their predictions. Attention scores, which assign a weight to each input token, are intuitively appealing as they seem to show what parts of the input the model "focused on." This has led many researchers to use them for model debugging, interpretation, and building trust.
However, a highly influential paper by Jain and Wallace (2019), "Attention is not Explanation," challenged this view. They argued that if alternative, very different attention distributions could be found that still lead to the same model prediction, then the original attention distribution cannot be a "faithful" explanation of the model's reasoning. They demonstrated empirically that such alternative distributions were easy to find.
This paper by Wiegreffe and Pinter serves as a direct rebuttal to Jain and Wallace. The authors' motivation is that the methodology used by Jain and Wallace was flawed because it detached the attention scores from the model that generated them, manipulating them as independent variables for each data instance. This gives the adversarial search an unrealistic amount of freedom. The innovative idea of this paper is to argue that any test of explainability must be model-consistent—that is, an alternative explanation (an adversarial attention distribution) must be generated by a coherently trained model, not just constructed ad-hoc.
2.2. Main Contributions / Findings
The paper makes several key contributions to the debate on attention and explainability:
-
A Methodological Critique: It provides a rigorous critique of the experimental design of Jain and Wallace (2019), arguing that their counterfactual experiments do not prove their thesis because they ignore the fact that attention weights are computed by an integral part of the model.
-
Four Novel Diagnostic Tests: The paper introduces a suite of four practical and more rigorous experiments for researchers to evaluate the meaningfulness of attention in their own models:
- Uniform Attention Baseline: A sanity check to see if learned attention provides any benefit over a simple uniform distribution.
- Random Seed Variance: A method to quantify the natural variance in attention distributions to provide a baseline for what "different" really means.
- Diagnostic MLP Framework: A novel post-hoc test where frozen attention weights from a complex model are used to guide a simpler, non-contextual model, testing the transferability and inherent importance encoded in the weights.
- Model-Consistent Adversarial Training: A new end-to-end training protocol to find adversarial attention distributions that are generated by a fully parameterized model, providing a much fairer test of "faithful" explainability.
-
Key Empirical Findings: The authors apply these tests and find that:
-
On some tasks, learned attention offers little improvement over uniform attention, meaning it's not useful for explanation in those cases.
-
The model-consistent adversarial training protocol can find alternative attention distributions, but they are far less extreme than those found by Jain and Wallace's method.
-
Most importantly, these adversarially generated attention weights perform very poorly on the diagnostic MLP test, indicating they are not as meaningful or useful as the original learned attention weights.
The overarching conclusion is that attention is not not explanation. While not a perfect or exclusive explanation, learned attention distributions often capture meaningful information about token importance that cannot be easily dismissed or replaced by arbitrary adversarial distributions.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)
Recurrent Neural Networks (RNNs) are a class of neural networks designed to work with sequential data, such as text, time series, or speech. Unlike standard feedforward networks, RNNs have loops, allowing information to persist. For a given input sequence , an RNN computes a sequence of hidden states . At each timestep , the hidden state is calculated as a function of the current input and the previous hidden state . This "memory" of past inputs allows the network to capture contextual information.
Long Short-Term Memory (LSTM) networks are a special, more advanced type of RNN, introduced by Hochreiter and Schmidhuber (1997). They were designed to solve the vanishing gradient problem, which makes it difficult for standard RNNs to learn long-range dependencies. LSTMs achieve this with a more complex internal structure, including a cell state () and three gates (input, forget, and output gates). These gates are small neural networks that learn to regulate the flow of information:
-
Forget Gate: Decides what information from the previous cell state should be discarded.
-
Input Gate: Decides what new information from the current input should be stored in the cell state.
-
Output Gate: Decides what information from the cell state should be used to compute the current hidden state .
This gating mechanism allows LSTMs to selectively remember or forget information over long sequences, making them very effective for many NLP tasks. In this paper, a bidirectional LSTM is used, which processes the input sequence both forwards and backwards and concatenates the hidden states, giving the model context from both past and future tokens at every position.
3.1.2. Attention Mechanisms
The attention mechanism, first proposed by Bahdanau et al. (2014) in the context of machine translation, was designed to help models focus on the most relevant parts of a long input sequence when making a prediction.
In a model without attention (e.g., a standard sequence-to-sequence model), the final hidden state of the encoder RNN is used as the sole context vector to generate the output. This creates an information bottleneck, especially for long sequences. The attention mechanism overcomes this by creating a unique context vector for each step of the output generation.
The process for additive attention (as used in this paper) can be broken down as follows:
-
Alignment Scores: For each input hidden state , a score is computed that measures how well it "aligns" with the current decoder state (or, in a classification context, a trainable context vector). This is typically done using a small feedforward neural network: $ e_{ij} = v_a^T \tanh(W_a s_{i-1} + U_a h_j) $ where , , and are learned weight matrices and a vector.
-
Attention Weights (Distribution): The alignment scores are normalized into a probability distribution using the
softmaxfunction. These are the attention weights. $ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} $ Each weight represents the importance of the -th input token for producing the -th output. The sum of all weights is 1. -
Context Vector: A context vector is computed as a weighted sum of all the input hidden states, using the attention weights. $ c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j $ This context vector effectively summarizes the input sequence with a focus on the most relevant parts for the current prediction step. It is then used along with the decoder state to make the final prediction.
It is this attention distribution that researchers have proposed as a form of "explanation."
3.1.3. Explainability, Interpretability, Transparency, and Faithfulness
The paper wades into a complex philosophical debate within AI. These terms are often used interchangeably but have distinct meanings:
- Transparency: Refers to understanding the mechanics of a model as a whole. Lipton (2016) distinguishes different levels, such as understanding the algorithm itself (
simulatability) or understanding its individual components (decomposability). Attention mechanisms could be seen as contributing todecomposabilityby providing an understandable weighting component. - Interpretability: Rudin (2018) argues this is the goal of creating models that are inherently understandable to humans (e.g., linear regression, decision trees). This is distinct from trying to explain an opaque "black box" model after the fact.
- Explainability (Post-hoc): Refers to the process of generating an explanation for a specific prediction made by a black box model. This explanation is a simplified approximation of the model's true reasoning. Rudin and Riedl (2019) argue these are often just "plausible stories" that may not be faithful to the model's actual decision process.
- Faithfulness: This is a crucial concept in the debate. A faithful explanation accurately reflects the model's internal reasoning process. The core claim of Jain and Wallace is that if many different attention distributions can produce the same output, then any single one cannot be a faithful explanation.
3.2. Previous Works
The central point of reference for this paper is:
- Jain and Wallace (2019), "Attention is not Explanation": This work systematically questioned the use of attention as a faithful explanation. Their methodology had two main prongs:
- Correlation Analysis: They found that attention weights often have low correlation with other feature-importance metrics, like gradient-based methods or leave-one-out (LOO) analysis. This suggests a lack of consistency.
- Counterfactual Adversarial Search: This is the part Wiegreffe and Pinter most directly critique. For a trained model and a given test instance, Jain and Wallace would: a. Record the original prediction and attention distribution (, ). b. Find a new attention distribution, , that was maximally different from (measured by Jensen-Shannon Divergence), subject to the constraint that the new prediction (obtained by using to weigh the hidden states) was nearly identical to . c. Crucially, this search was performed by directly manipulating the attention scores, completely detached from the attention mechanism parameters that originally computed them. Since they could easily find such adversarial distributions, they concluded the original attention was not exclusive and therefore not a faithful explanation.
Other related works mentioned provide context for the use of attention as explanation:
- Lei et al. (2016): Trained a model to jointly predict an outcome and extract a "rationale" (a subset of input text), showing that extractive explanations are a valued goal.
- Mullenbach et al. (2018): Successfully used attention mechanisms to provide explainable predictions for medical codes from clinical text, demonstrating a practical application where attention is treated as explanation.
3.3. Technological Evolution
The use of attention in NLP followed a clear arc:
- Introduction and Adoption (c. 2014-2017): Attention was introduced as a powerful mechanism to improve performance on tasks like machine translation, and it quickly became a standard component in many state-of-the-art NLP models.
- Attention as Explanation (c. 2015-2018): Researchers intuitively began to visualize and analyze attention weights as a way to understand and explain their models' behavior, leading to many papers that presented attention heatmaps as evidence of their models "learning the right thing."
- Critical Scrutiny (c. 2019-present): A wave of research, prominently featuring Jain and Wallace (2019) and Serrano and Smith (2019), began to rigorously question this assumption. They provided evidence that attention might not be as reliable an indicator of importance as previously thought.
- Counter-Rebuttal and Nuance (This Paper): "Attention is not not Explanation" represents the next phase of this debate. It does not blindly defend attention but instead critiques the critics' methodologies and proposes more rigorous ways to evaluate when and how attention can be useful for explanation, adding necessary nuance to the conversation.
3.4. Differentiation Analysis
The core difference between this paper's approach and that of Jain and Wallace (2019) lies in the concept of model consistency.
-
Jain and Wallace (Inconsistent): Their adversarial search is performed on a per-instance basis, directly manipulating the vector of attention scores. The resulting adversarial attention distribution does not correspond to any set of model parameters. It's an artifact created for one specific input, detached from the model that is supposed to be explained.
-
Wiegreffe and Pinter (Consistent): Their adversarial search trains an entirely new model. This adversarial model has its own set of parameters for its attention mechanism. The goal is to learn a single set of parameters that, for all instances in the training set, produces attention distributions that are different from the base model while yielding similar predictions. This is a much harder constraint and a fairer test, as it asks whether a plausible alternative model exists, not just a plausible alternative weight vector for one instance.
This shift from an unconstrained, per-instance manipulation to a constrained, model-parameter-based search is the paper's primary methodological innovation.
4. Methodology
The authors propose a series of four experiments designed to more rigorously test the meaningfulness of attention distributions. These are presented as tools for NLP researchers to better understand their own models.
4.1. Principles
The guiding principle behind the authors' methodology is that any evaluation of attention's role as an explanation must be grounded in the context of the entire model and its training process. They argue against treating attention scores as standalone artifacts and instead propose tests that respect the integrity of the model architecture and learning dynamics. Their approach moves from simple sanity checks to a sophisticated, model-consistent adversarial framework.
The architecture used for these experiments is a standard bidirectional LSTM with an additive attention layer for classification, as depicted in the paper's Figure 1.
The components manipulated in the experiments are shown in the following diagram:
该图像是论文中的示意图,展示了一个带注意力机制的分类LSTM模型结构,包括词嵌入、LSTM层、注意力参数与得分,以及最终的预测得分,反映了各组件的连接关系。
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Test 1: Uniform as the Adversary (§3.2)
This first test is a simple but powerful baseline to determine if the learned attention mechanism is contributing anything meaningful to the model's performance.
-
Integrated Explanation: The idea is to compare a standard model with learned attention to one where the attention distribution is fixed.
- First, a
base modelis trained normally. This model consists of a Bi-LSTM that processes the input text, followed by an attention layer that learns to compute weights over the LSTM's hidden states. The final prediction is based on the attended context vector. - Next, a
uniform modelis created. This model has the exact same architecture, but the attention layer is modified. Instead of learning weights, it is forced to assign an equal weight to every input token. For an input of length , each token receives an attention weight of . - The performance (e.g., F1 score) of the
base modelis then compared to theuniform model.
- First, a
-
Hypothesis: If the
base modeldoes not significantly outperform theuniform model, it implies that the complex, learned attention distribution is no better than a simple average. In such cases, the authors argue that the attention weights are not useful for explanation because they are not necessary for the task in the first place.
4.2.2. Test 2: Variance within a Model (§3.3)
This test addresses the question: how much do attention distributions naturally vary? Before one can claim an "adversarial" distribution is significantly different, one needs a baseline for expected stochastic variation.
-
Integrated Explanation: The procedure is as follows:
- The same model architecture (Bi-LSTM with attention) is trained multiple times (e.g., 8 times) from scratch. The only difference between each training run is the random seed used for weight initialization.
- One of these models is designated as the
base model. - For each instance in the test set, the attention distribution from the
base modelis compared to the attention distributions from the other 7 models trained with different seeds. - The difference between distributions is measured using Jensen-Shannon Divergence (JSD).
-
Purpose: This experiment produces a distribution of JSD values representing the "normal" variance one can expect from the training process itself. If an adversarial method produces attention distributions with JSD values that fall within this range, its findings are less impressive, as the "difference" could be attributed to random chance.
4.2.3. Test 3: Diagnosing Attention Distributions by Guiding Simpler Models (§3.4)
This is a novel diagnostic tool designed to test whether attention weights have captured some intrinsic, model-agnostic measure of token importance.
The setup for this diagnostic is illustrated in the following diagram:
该图像是一幅示意图,展示了来自论文中第3.4节的模型结构。图中从底层的单词嵌入(Embedding)开始,经由仿射变换(Affine)和强加权重(Weights Imposed)层,最终得到预测分数(Prediction Score)。
-
Integrated Explanation: The core idea is to replace the complex, contextual LSTM with a simple, non-contextual model and see if the pre-trained attention weights can help it perform the task.
- A simple diagnostic model is created. This model is a token-level Multi-Layer Perceptron (MLP). It processes each token's embedding independently through an affine hidden layer with a
tanhactivation. It has no access to surrounding tokens (i.e., it is not a recurrent or convolutional model). - The output scores from the MLP for each token are then weighted and aggregated to make a final prediction. The key is that these weights are not learned by the MLP but are imposed from an external source.
- The following sets of pre-computed weights (guides) are tested:
UNIFORM: A baseline where every token is weighted equally.Trained MLP: A variant where the MLP is allowed to learn its own attention parameters, serving as a measure of how well a simple non-contextual model can perform on its own.Base LSTM: The attention weights learned by the original, fully-trainedbaseBi-LSTM model are extracted and used as the frozen guides for the MLP.ADVERSARY: The attention weights generated by the adversarial model from Test 4 are used as guides.
- The MLP's other parameters (the affine layer) are trained, but the guiding attention weights are kept fixed for each instance. The final F1 scores for each guided setup are then compared.
- A simple diagnostic model is created. This model is a token-level Multi-Layer Perceptron (MLP). It processes each token's embedding independently through an affine hidden layer with a
-
Hypothesis: If the
Base LSTM-guided MLP outperforms theUNIFORMandTrained MLPversions, it provides strong evidence that the attention weights from the LSTM have captured meaningful information about token importance that is transferable and useful even in a non-contextual setting. Conversely, if theADVERSARY-guided MLP performs poorly, it suggests that the adversarial weights are not truly equivalent explanations and lack this transferable meaning.
4.2.4. Test 4: Training a Model-Consistent Adversary (§4)
This is the paper's primary methodological contribution, proposing a more rigorous way to search for adversarial attention distributions.
- Integrated Explanation: Instead of manipulating attention scores for a single instance, this method trains an entire adversarial model, , that learns to produce divergent attention distributions consistently across the whole dataset.
- First, a
base model, , is trained and its parameters are frozen. Its predictions and attention distributions for the training set are pre-computed. - A new
adversarial model, , with the same architecture is initialized. - This model is trained using a custom loss function designed to satisfy two competing objectives for each instance in a minibatch:
- Mimic Predictions: The predictions of the adversarial model, , should be as close as possible to the base model's predictions, . This is measured by the Total Variation Distance (TVD).
- Diverge Attention: The attention distribution of the adversarial model, , should be as far as possible from the base model's attention distribution, . This is measured by the Kullback-Leibler (KL) Divergence.
- These two objectives are combined into a single loss function, , which is minimized during training:
$
\mathcal { L } ( \mathcal { M } _ { a } , \mathcal { M } _ { b } ) ^ { ( i ) } = \mathrm { T V D } \big ( \hat { y } _ { a } ^ { ( i ) } , \hat { y } _ { b } ^ { ( i ) } \big ) - \lambda \ \mathrm { K L } \big ( \pmb { \alpha } _ { a } ^ { ( i ) } \ \lVert \ \pmb { \alpha } _ { b } ^ { ( i ) } \big )
$
-
: The prediction probability distributions for instance from the adversarial and base models, respectively.
-
: The attention weight distributions for instance from the adversarial and base models, respectively.
-
: Total Variation Distance, which measures the difference between two probability distributions. Minimizing this term forces the prediction scores to be similar.
-
: Kullback-Leibler Divergence, which measures how one probability distribution diverges from a second. Because it is subtracted in the loss function, minimizing the overall loss actually means maximizing the KL divergence, pushing the attention distributions apart.
-
: A hyperparameter that controls the trade-off. A small prioritizes matching predictions, while a large prioritizes finding distant attention distributions, even at the cost of less accurate prediction matching. By varying , the authors can trace a curve showing the trade-off between prediction similarity (low TVD) and attention divergence (high JSD, a metric related to KL).
This setup ensures that any adversarial attention distribution found is not an ad-hoc artifact but is generated by a coherent, parameterized model that has learned a consistent policy across all training data.
-
- First, a
5. Experimental Setup
5.1. Datasets
The authors used six datasets for classification tasks, focusing on the binary classification subset from Jain and Wallace (2019) to ensure a fair comparison. All datasets are in English.
The following are the results from Table 1 of the original paper:
| Dataset | Avg. Length (tokens) | Train Size (neg/pos) | Test Size (neg/pos) |
|---|---|---|---|
| Diabetes | 1858 | 6381/1353 | 1295/319 |
| Anemia | 2188 | 1847/3251 | 460/802 |
| IMDb | 179 | 12500/12500 | 2184/2172 |
| SST | 19 | 3034/3321 | 863/862 |
| AgNews | 36 | 30000/30000 | 1900/1900 |
| 20News | 115 | 716/710 | 151/183 |
- Description of Datasets:
-
MIMIC-III (Diabetes, Anemia): Medical datasets where the task is to predict a diagnosis (e.g., diabetes) from lengthy ICU discharge summaries. These feature very long documents.
-
IMDb: A sentiment analysis dataset of movie reviews.
-
Stanford Sentiment Treebank (SST): A sentiment analysis dataset of single sentences.
-
AG News & 20 Newsgroups: Topic classification datasets. For AG News, the task is to distinguish "world" from "business" news. For 20 Newsgroups, it's "hockey" vs. "baseball."
These datasets were chosen to replicate and directly challenge the findings of Jain and Wallace (2019). The variation in average document length (from 19 tokens in SST to over 2000 in Anemia) provides a diverse set of conditions for testing the methods.
-
5.2. Evaluation Metrics
The paper uses several metrics to compare model performance, prediction outputs, and attention distributions.
5.2.1. F1 Score
- Conceptual Definition: The F1 score is a measure of a test's accuracy. It is the harmonic mean of precision and recall. It is commonly used in binary classification when there is an uneven class distribution. A high F1 score indicates that the model has both low false positives (high precision) and low false negatives (high recall). The paper reports the F1 score on the positive class.
- Mathematical Formula: $ \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $
- Symbol Explanation:
- Precision = (Of all the instances the model predicted as positive, how many were actually positive?)
- Recall = (Of all the actual positive instances, how many did the model correctly identify?)
5.2.2. Total Variation Distance (TVD)
- Conceptual Definition: TVD measures the largest possible difference between the probabilities that two probability distributions assign to the same event. In this paper, it is used to measure the difference between the prediction outputs (which are probability distributions over classes) of two models, e.g., the
basemodel and anadversarialmodel. A lower TVD means the predictions are more similar. - Mathematical Formula: $ \mathrm { T V D } \big ( \hat { y } _ { 1 } , \hat { y } _ { 2 } \big ) = \frac { 1 } { 2 } \sum _ { i = 1 } ^ { | \mathcal { V } | } \left| \hat { y } _ { 1 i } - \hat { y } _ { 2 i } \right| $
- Symbol Explanation:
- : Two probability distributions over the vocabulary of classes .
- : The number of classes (e.g., 2 for binary classification).
- : The probability assigned to the -th class by each distribution.
5.2.3. Jensen-Shannon Divergence (JSD)
- Conceptual Definition: JSD is a method for measuring the similarity between two probability distributions. It is a symmetrized and smoothed version of the Kullback-Leibler (KL) divergence. Unlike KL divergence, JSD is symmetric () and always has a finite value. It is used in this paper to compare two attention distributions, and . A higher JSD means the attention distributions are more different.
- Mathematical Formula: $ \operatorname { J S D } ( \alpha _ { 1 } , \alpha _ { 2 } ) = { \frac { 1 } { 2 } } \operatorname { K L } [ \alpha _ { 1 } \parallel { \bar { \alpha } } ] + { \frac { 1 } { 2 } } \operatorname { K L } [ \alpha _ { 2 } \parallel { \bar { \alpha } } ] $
- Symbol Explanation:
- : The two attention distributions being compared.
- : The average of the two distributions.
- : The Kullback-Leibler divergence between and the average distribution .
5.3. Baselines
The primary baselines for comparison are:
-
Base Model: A standard Bi-LSTM model with a learned additive attention mechanism, trained for each task. Its performance and attention distributions serve as the reference point. -
Jain and Wallace (2019) Results: The paper constantly compares its findings to those of Jain and Wallace, both by reproducing their results and by showing how its own more constrained methodology leads to different conclusions.
-
Uniform Model: A model with the same architecture as thebase modelbut with its attention mechanism fixed to a uniform distribution. This is used in the first experiment (§3.2) to test if learned attention is even necessary.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents its results through the lens of the four proposed experiments, methodically building its case against the strong claim that "Attention is not Explanation."
6.1.1. Experiment 1: Is Attention Necessary? (Uniform Baseline)
The authors first conduct a sanity check to see if learned attention is actually useful on the selected datasets.
The following are the results from Table 2 of the original paper:
| Dataset | Attention (Base) | Uniform | |
|---|---|---|---|
| Reported | Reproduced | ||
| Diabetes | 0.79 | 0.775 | 0.706 |
| Anemia | 0.92 | 0.938 | 0.899 |
| IMDb | 0.88 | 0.902 | 0.879 |
| SST | 0.81 | 0.831 | 0.822 |
| AgNews | 0.96 | 0.964 | 0.960 |
| 20News | 0.94 | 0.942 | 0.934 |
Analysis:
- The middle columns show that the authors successfully reproduced the results of the base attention model from Jain and Wallace.
- For the medical datasets (Diabetes, Anemia), there is a noticeable drop in F1 score when using uniform attention, suggesting that learned attention is indeed helping the model focus on relevant parts of the long clinical notes.
- However, for SST, AgNews, and 20News, the performance of the
Uniformmodel is almost identical to theAttention (Base)model. - Conclusion: The authors rightly conclude that for datasets like AgNews and 20News, attention is not providing a significant benefit. Therefore, debating whether attention is an explanation on these tasks is moot: "attention is not explanation if you don't need it." Consequently, they drop these two datasets from further analysis but keep the borderline SST.
6.1.2. Experiment 2: How Much Variance is Normal?
The paper investigates the natural variance in attention distributions caused by different random initializations. The violin plots in Figure 3 visualize this.
The following figure (Figure 3 from the original paper) shows the densities of JS divergences for models trained with different random seeds compared to the adversarial setup of Jain and Wallace.

Analysis:
- Left Plots (a, b): Random Seed Variance. These plots show the JSD between a base model and other models trained with different random seeds.
- For SST (c), the violin plot is heavily concentrated on the left (low JSD), indicating that different training runs converge to very similar attention distributions. The model is stable.
- For Diabetes (d), the plot is much wider, especially for the negative class (blue), showing that there is already significant natural variance in the attention distributions for this task.
- Right Plots (e, f): Jain & Wallace Adversary. These plots show the JSD from the unconstrained adversarial search. The distributions are pushed far to the right (high JSD).
- Key Insight: By comparing the left and right plots, we can contextualize the adversarial results. For SST, finding an adversary that is "different" is more meaningful because the baseline variance is low. For Diabetes, the high JSD of the adversary is less surprising, given that the model already exhibits high variance naturally. This highlights the importance of establishing a baseline for variance.
6.1.3. Experiment 3: Can Attention Weights Guide a Simpler Model?
This experiment tests the intrinsic value of learned attention weights by using them to guide a simple MLP.
The following are the results from Table 3 of the original paper:
| Guide weights | Diab. | Anemia | SST | IMDb |
|---|---|---|---|---|
| UNIFORM | 0.404 | 0.873 | 0.812 | 0.863 |
| Trained MLP | 0.699 | 0.920 | 0.817 | 0.888 |
| Base LsTM | 0.753 | 0.931 | 0.824 | 0.905 |
| ADVERSARY (4) | 0.503 | 0.932 | 0.592 | 0.700 |
Analysis:
Base LSTMis consistently the best. Across all four datasets, using the frozen attention weights from the originalBase LSTMmodel as guides allows the simple MLP to achieve the highest F1 score. It performs better than both theUNIFORMbaseline and, more importantly, theTrained MLPwhere the simple model learns its own weights. This is strong evidence that the contextual attention weights from the LSTM capture a meaningful, transferable signal about token importance.ADVERSARYweights are poor guides. The adversarially-generated attention weights (from Experiment 4) perform very poorly when used to guide the MLP (except on Anemia, where performance is high across the board). For SST and IMDb, the performance plummets.- Conclusion: This is a crucial result. It shows that while an adversarial distribution might produce a similar final prediction within the original complex model, it does not possess the same underlying explanatory quality. It fails this diagnostic test, suggesting it's a "brittle" solution, whereas the
Base LSTMattention is more robust and meaningful.
6.1.4. Experiment 4: What Does a Consistent Adversary Look Like?
This experiment analyzes the trade-off between prediction similarity (TVD) and attention divergence (JSD) using the paper's model-consistent adversarial training.
The following figure (Figure 5 from the original paper) plots this trade-off.

Analysis:
-
Trade-off Curves (Dotted Lines): The curves show that as the hyperparameter is varied, the adversarial models can achieve higher JSD (more different attention) at the cost of higher TVD (less similar predictions). The convex shape for most datasets suggests that it's relatively easy to get a moderate amount of attention divergence without significantly harming prediction similarity, lending some support to the "attention is manipulable" argument.
-
SST is an Outlier: The curve for SST is unusual. It shows that attention can be changed dramatically with almost no impact on prediction TVD. This confirms SST as a task where attention is less constrained.
-
Comparison to Jain & Wallace (
+symbol): The single point+represents the result from Jain and Wallace's unconstrained method. For every dataset, it lies far below the curve of the model-consistent adversaries. This means the J&W method finds adversarial distributions that are both more different (higher JSD) and produce more similar predictions (lower TVD) than what is possible for a coherently trained model. This visually demonstrates the authors' central critique: the unconstrained, per-instance search of J&W has an unrealistic degree of freedom and exaggerates the ease with which faithful-looking adversaries can be found.The paper provides a concrete example of the different attention maps produced. The following figure (Figure 2 from the original paper) shows attention maps for an IMDb instance.

Analysis:
- The
Base modelfocuses its attention on intuitive sentiment words like "beautiful", "wonderful", and "perfectly". - The authors'
Consistent Adversarymodel manages to shift attention away from these primary words but still distributes it across other plausible words like "performances", "pathos", and "humor". The explanation has changed, but it's still somewhat distributed. - Jain & Wallace's
Inconsistent Adversarysimply shifts almost the entire attention mass to a single, arbitrary, and non-informative word ("it"), a solution that would never be learned by a consistent model. This starkly illustrates the difference between a plausible alternative explanation and an artificial one.
6.2. Data Presentation (Tables)
The full results for the adversarial setup are presented in Appendix B, Table 5, which have been analyzed in the context of Figure 5 above. These tables systematically show how varying the hyperparameter affects the F1 score, TVD, and JSD for each dataset, providing a comprehensive view of the adversarial trade-off.
6.3. Ablation Studies / Parameter Analysis
The entire paper can be viewed as a series of carefully designed ablation and parameter analysis studies.
-
The comparison between the
Attention (Base)model and theUniformmodel is an ablation study on the necessity of the learned attention mechanism. -
The main parameter analysis is the sweeping of the hyperparameter in the adversarial training protocol (Section 4). The results, plotted in Figure 5 and detailed in Table 5, directly analyze how this key parameter affects the model's ability to find divergent yet accurate adversarial distributions. This analysis is central to the paper's argument about the trade-offs involved in creating alternative explanations.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper concludes that the claim "Attention is not Explanation" is too strong and is based on a flawed experimental premise. The authors do not claim that "Attention is Explanation" in all cases, but rather that its utility for explainability is nuanced and task-dependent. Their "Attention is not not Explanation" stance is a call for a more careful and rigorous approach.
The main contributions are:
-
A successful critique of the unconstrained, per-instance counterfactual method used by prior work.
-
A suite of four practical diagnostic tests that researchers can use to evaluate the quality of attention mechanisms in their own models.
-
Empirical evidence showing that while model-consistent adversarial attention distributions can be found, they are less extreme than previously suggested and, critically, they fail a simple diagnostic test of meaningfulness, suggesting they are not truly equivalent explanations.
Ultimately, the paper argues that trained attention mechanisms in RNNs often do learn a meaningful and transferable signal about the relationship between inputs and outputs, which cannot be easily "hacked" or dismissed.
7.2. Limitations & Future Work
The authors acknowledge several directions for future work:
- Broader Task Application: The experiments were limited to classification tasks with LSTM models. Future work could apply these diagnostic methods to other popular architectures (like Transformers) and other tasks such as sequence modeling, question answering (QA), and natural language inference (NLI).
- Multilingual Extension: The study was conducted exclusively on English datasets. The dynamics of attention could differ in languages with different morphological or syntactic properties.
- Human Evaluation: The proposed methods are "functionally-grounded" evaluations (i.e., using proxy tasks). The authors suggest adding human evaluation to see how the quantitative measures of attention "goodness" align with human judgments of what constitutes a plausible explanation.
- Theoretical Analysis: They express hope that their work will motivate the development of analytical methods to predict the usefulness of attention as an explanation based on properties of the dataset and model, without needing to run extensive experiments.
7.3. Personal Insights & Critique
This paper is an exemplary piece of scientific discourse. It responds to a strong claim not with an equally strong opposing claim, but with a meticulous deconstruction of the original argument's methodology and the introduction of a more rigorous, better-principled framework for investigation.
Key Insights:
- The Importance of Model Consistency: The paper's most powerful contribution is its insistence on model-consistent counterfactuals. The idea that an "explanation" must be something a model could plausibly generate is a profound and important principle for the field of XAI. It guards against finding clever but meaningless artifacts.
- Explainability is Not a Monolith: The paper does an excellent job of navigating the philosophical definitions of explainability, interpretability, and faithfulness. It correctly points out that the value of an explanation depends on the goal—a plausible story might be sufficient for building user trust, even if it's not a perfectly faithful account of the model's internals.
- Practical Tools for Researchers: The four proposed tests are not just theoretical constructs; they are practical tools that any researcher using attention can (and should) consider applying to their own work to validate their claims about interpretability. The diagnostic MLP framework, in particular, is a clever and effective way to test for transferable knowledge.
Potential Issues and Critique:
-
Focus on LSTMs: While appropriate for a direct rebuttal in 2019, the field was already rapidly shifting to Transformer-based models. The dynamics of self-attention, multi-head attention, and layer-to-layer attention propagation in Transformers are far more complex than in the Bi-LSTM architecture studied here. While the principles likely transfer, the direct applicability of the methods would need to be re-validated and possibly adapted for Transformers.
-
The Definition of "Meaningful": The diagnostic MLP test is a strong proxy for "meaningfulness," but it is still a proxy. It equates transferability to a non-contextual model with inherent importance. While compelling, this link is not absolute. An attention pattern could be meaningful only in the context of the recurrent dynamics of the original LSTM, and its failure to transfer wouldn't necessarily render it meaningless in its original context.
-
Scalability of Adversarial Training: The model-consistent adversarial training is computationally more expensive than the per-instance search of Jain and Wallace, as it requires training a full model for each point on the trade-off curve. This could be a barrier to its widespread adoption.
Overall, "Attention is not not Explanation" is a landmark paper in the XAI-for-NLP debate. It successfully tempered the community's swing toward complete skepticism of attention and provided a much-needed dose of methodological rigor, shifting the conversation from a binary "yes/no" to a more productive "when and how."
Similar papers
Recommended via semantic vector search.