Abstract

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network for sequence modeling, that utilizes an external dynamic memory to encode and retrieve past information. Our model achieves linear time complexity and constant memory space complexity when processing long sequences. We also propose a new optimization scheme, memory replay back-propagation (MRBP), which promotes long-range back-propagation through time with a significantly reduced memory requirement. Experimental results show that Memformer has achieved comparable performance compared to the baselines by using 8.1x less memory space and 3.2x faster on inference. Analysis of the attention pattern shows that our external memory slots can encode and retain important information through timesteps.

1. Bibliographic Information

1.1. Title

The central topic of the paper is an efficient neural network model named Memformer, which is a Transformer augmented with an external dynamic memory system for sequence modeling tasks.

1.2. Authors

The authors of the paper are:

Qingyang Wu (University of California, Davis)
Zhenzhong Lan (Westlake University)
Kun Qian (University of California, Davis)
Jing Gu (University of California, Davis)
Alborz Geramifard (Facebook)
Zhou Yu (University of California, Davis)

Their research backgrounds appear to be in artificial intelligence, deep learning, and natural language processing, given their affiliations with universities and a major tech company like Facebook.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server for scientific papers. While arXiv is a highly influential platform for disseminating research quickly, it is not a peer-reviewed journal or conference in itself. Papers submitted to arXiv often undergo peer review and are subsequently published in a formal venue. The specific formal publication venue for this paper is not explicitly stated in the provided abstract, but arXiv serves as a widely recognized archive for preprints in fields like computer science, mathematics, and physics.

1.4. Publication Year

The paper was published on 2020-10-14T00:00:00.000Z, which translates to October 14, 2020.

1.5. Abstract

The paper addresses the efficiency issues of Transformers in sequence modeling, particularly their need to store all history token-level representations, leading to high memory and computational costs. It introduces Memformer, an efficient neural network that employs an external dynamic memory to encode and retrieve past information. Memformer achieves linear time complexity and constant memory space complexity for long sequences. The authors also propose a novel optimization technique called memory replay back-propagation (MRBP), designed to facilitate long-range back-propagation through time (BPTT) with significantly reduced memory requirements. Experimental results demonstrate that Memformer achieves comparable performance to baseline models while using 8.1x less memory space and being 3.2x faster during inference. Furthermore, analysis of the attention patterns reveals that the external memory slots can effectively encode and retain important information across timesteps.

1.6. Original Source Link

Official Source/PDF Link: https://arxiv.org/abs/2010.06891
Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The paper tackles a critical limitation of the highly successful Transformer models in sequence modeling: their inefficiency when processing very long sequences.

Core Problem: Transformers compute self-attention over all tokens in a sequence, leading to a quadratic time complexity ( $\mathcal{O}(N^2)$ ) with respect to the sequence length $N$ . More importantly, for generative tasks or processing long contexts, these models need to store all previous token-level representations as "memory," resulting in a linear memory space complexity ( $\mathcal{O}(N)$ ) that quickly becomes prohibitive for truly long sequences. This prevents Transformers from effectively modeling context beyond a fixed-length window due to memory and computational bottlenecks.
Importance: Efficiently handling long sequences is crucial for many applications, including long-document understanding, conversational AI, and advanced generative models. Prior attempts like Transformer-XL and Compressive Transformer improve context handling but still rely on caching raw hidden states, which requires large memory sizes and has a theoretical maximum temporal range, meaning information from the very distant past is eventually discarded.
Innovative Idea: The paper's innovative idea is to re-introduce the concept of a fixed-size, external dynamic memory, reminiscent of older memory networks like Neural Turing Machines (NTMs), but integrate it with the modern Transformer architecture. This external memory is designed to actively encode and retrieve high-level compressed information from the past, rather than just storing raw token representations. This allows for a theoretically infinite temporal range of memorization and ensures constant memory consumption during inference, regardless of sequence length. Furthermore, to address the training challenges of such memory-augmented recurrent models, a new optimization scheme, Memory Replay Back-Propagation (MRBP), is proposed.

2.2. Main Contributions / Findings

The Memformer paper makes several significant contributions:

Novel Model Architecture (Memformer): The paper proposes Memformer, an efficient memory-augmented Transformer that utilizes a fixed-size external dynamic memory. This memory interacts with the Transformer through dedicated memory reading and memory writing modules. This design enables the model to process long sequences with linear time complexity and constant memory space complexity during inference, fundamentally addressing the efficiency bottlenecks of standard Transformers.
Dynamic Memory Management: Memformer introduces a sophisticated dynamic memory management system. This includes:
- Memory Reading via cross-attention to retrieve relevant past information.
- Memory Writing with slot attention to update memory information, focusing on preserving relevant data and updating with new context.
- A Biased Memory Normalization (BMN) mechanism for forgetting, which helps filter out trivial information and maintain gradient stability by normalizing memory slots and incorporating a learnable bias vector.
Efficient Training Algorithm (Memory Replay Back-Propagation - MRBP): To enable training Memformer with its large memory representations over long unrolls, the paper introduces MRBP. This new optimization scheme is a memory-efficient variant of gradient checkpointing, specifically designed to significantly reduce the memory cost of back-propagation through time (BPTT) by replaying memory at each timestep, traversing only the critical path during the forward pass.
Strong Experimental Validation: Memformer demonstrates comparable performance to Transformer-XL on autoregressive image generation (MNIST) and language modeling (WikiText-103) benchmarks, while achieving substantial efficiency gains: 8.1x less memory space and 3.2x faster inference.
Interpretability of Memory: Through attention pattern analysis, the paper provides insights into how the external memory slots function, showing that they can effectively encode and retain important information over extended periods, categorizing memory slots into types based on their update behavior (e.g., retaining old info, aggregating new info, rapidly changing info).

This section lays the groundwork for understanding Memformer by explaining fundamental concepts and summarizing prior research in efficient Transformers and memory networks.

3.1. Foundational Concepts

3.1.1. Neural Networks (NNs)

A neural network is a computational model inspired by the structure and function of biological neural networks. It consists of layers of interconnected "neurons" (nodes) that process information. Each connection has a weight, and each neuron has an activation function. NNs learn by adjusting these weights based on input data to minimize a loss function, enabling them to recognize patterns, classify data, and make predictions.

3.1.2. Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a class of neural networks designed for processing sequential data, where the output from the previous step is fed as input to the current step. This "recurrent" connection allows RNNs to maintain an internal state or "memory" that captures information about past elements in the sequence.

Memory Bottleneck: While RNNs can process sequences, their internal hidden state, which serves as a compressed memory, often struggles to preserve information over very long sequences due a phenomenon called the "vanishing/exploding gradient problem." This makes it hard for standard RNNs to learn long-term dependencies.
Long Short-Term Memory (LSTM): LSTMs (Hochreiter and Schmidhuber, 1997) are a special type of RNN designed to overcome the vanishing gradient problem. They introduce "gates" (input, output, and forget gates) that regulate the flow of information into and out of a special cell state, allowing them to selectively remember or forget information over long periods. The forget gate is particularly relevant to Memformer as it directly inspires its own forgetting mechanism.
Gated Recurrent Unit (GRU): GRUs (Chung et al., 2014) are a simpler variant of LSTMs that combine the input and forget gates into a single "update gate" and merge the cell state and hidden state. They offer comparable performance to LSTMs with fewer parameters.

3.1.3. Attention Mechanism

The attention mechanism (Bahdanau et al., 2015) allows a neural network to focus on specific parts of its input sequence when making a prediction. Instead of processing the entire input uniformly, attention dynamically assigns different weights to different parts of the input, indicating their relevance.

Self-Attention: Self-attention (Vaswani et al., 2017) is a specific type of attention mechanism where the attention is computed on the same input sequence. It allows each token in a sequence to attend to all other tokens in the same sequence to compute a weighted representation.
- Query (Q), Key (K), Value (V): In self-attention, for each token, three vectors are derived: a Query vector (what am I looking for?), a Key vector (what do I have?), and a Value vector (what information do I carry?).
- Attention Calculation: The attention score between a query and a key indicates how relevant the key is to the query. These scores are typically computed as a dot product, scaled, and then passed through a softmax function to get attention weights. These weights are then used to create a weighted sum of the Value vectors.
- The standard self-attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ is the matrix of queries (shape: sequence_length x $d_k$ ).
  - $K$ is the matrix of keys (shape: sequence_length x $d_k$ ).
  - $V$ is the matrix of values (shape: sequence_length x $d_v$ ).
  - $d_k$ is the dimension of the key vectors (used for scaling to prevent large dot products from pushing softmax into regions with tiny gradients).
  - $QK^T$ is the dot product similarity between queries and keys.
  - $\mathrm{softmax}$ normalizes the scores into probability distributions.
  - The result is a weighted sum of values, representing the context-aware representation for each token.
Multi-Head Attention (MHAttn): Multi-Head Attention extends self-attention by running the attention mechanism multiple times in parallel with different learned linear projections (heads). The outputs from these multiple attention heads are then concatenated and linearly transformed to produce the final result. This allows the model to jointly attend to information from different representation subspaces at different positions, enriching its understanding.

3.1.4. Transformer

The Transformer (Vaswani et al., 2017) is a neural network architecture that relies entirely on attention mechanisms (specifically self-attention) to process sequences, discarding traditional recurrent or convolutional layers.

Encoder-Decoder Architecture: A Transformer typically consists of an encoder and a decoder.
- Encoder: Processes the input sequence, producing a sequence of context-aware representations. It's composed of multiple identical layers, each containing a multi-head self-attention mechanism and a feed-forward network.
- Decoder: Generates the output sequence, one token at a time, based on the encoder's output and previously generated tokens. It also has multiple identical layers, each containing a masked multi-head self-attention (to prevent attending to future tokens), a multi-head cross-attention (to attend to the encoder's output), and a feed-forward network.
Positional Encoding: Since Transformers lack recurrence or convolution, they have no inherent sense of token order. Positional encodings (fixed or learned vectors added to input embeddings) are used to inject information about the relative or absolute position of tokens in the sequence.
Efficiency Issues: While powerful, Transformers have two main efficiency challenges for very long sequences:
1. Quadratic Computation: The self-attention mechanism computes attention scores for all pairs of tokens, leading to $\mathcal{O}(N^2)$ computational complexity, where $N$ is the sequence length.
2. Memory Consumption: Storing the key and value matrices for all tokens in self-attention and cross-attention layers, especially during training with back-propagation through time (BPTT), requires $\mathcal{O}(N^2)$ memory. For autoregressive decoding, storing past hidden states also scales linearly with context length.

3.1.5. Back-Propagation Through Time (BPTT)

Back-propagation through time (BPTT) is the standard algorithm used to train recurrent neural networks (RNNs) and other models with recurrent connections. It works by unrolling the recurrent neural network into a feed-forward network over a certain number of timesteps. The gradients are then computed using standard back-propagation through this unrolled network, effectively propagating gradients backward through time.

Memory Cost: A major issue with BPTT, especially for models that maintain large internal states or external memories over many timesteps, is its significant memory consumption. It requires storing all intermediate activations for every timestep in the unrolled computation graph during the forward pass so that they can be used during the backward pass to compute gradients. For long sequences and complex models, this memory cost can become prohibitive.

3.1.6. Gradient Checkpointing

Gradient checkpointing (Chen et al., 2016), also known as "activation checkpointing," is an optimization technique designed to reduce the memory footprint of back-propagation in deep neural networks. Instead of storing all intermediate activations during the forward pass, gradient checkpointing strategically saves only a subset of activations. During the backward pass, the unsaved activations are recomputed on demand from the nearest saved checkpoint. This trades off increased computation (due to recomputation) for significantly reduced memory usage, making it possible to train much larger models or models with longer sequences.

3.2. Previous Works

The paper contextualizes Memformer by discussing several lines of research aimed at addressing the Transformer's limitations, particularly its quadratic cost and context length issues.

3.2.1. Sparse Attention

This direction modifies the self-attention mechanism to compute attention over only a subset of tokens, rather than all pairs.

Sparse Transformer (Child et al., 2019): Uses a block sparse attention pattern to reduce complexity to $\mathcal{O}(N\sqrt{N})$ . It aims to cover all past tokens for sequence generation.
Longformer (Beltagy et al., 2020) and Big Bird (Zaheer et al., 2020): Further explored sparser patterns, achieving $\mathcal{O}(N)$ complexity. They typically introduce global tokens to summarize sequence information and limit attention to local windows and these global tokens.
Limitations for Autoregressive Decoding: The paper highlights that while linear sparse attention is sound for bidirectional encoders, it faces challenges in autoregressive decoders. In autoregressive generation, a token can only see past tokens, and global tokens cannot "leak" information to future tokens. This means linear sparse attention might not guarantee a token sees all relevant past tokens, making Sparse Transformer's $\mathcal{O}(N\sqrt{N})$ approach (which theoretically covers past tokens) more suitable for generation among these methods.

3.2.2. Linear Attention

This approach modifies the softmax operation in self-attention to achieve linear complexity.

Linformer (Wang et al., 2020): Reduces complexity to $\mathcal{O}(N)$ by projecting the entire sequence of keys and values to a constant size. The paper notes it hasn't been widely applied to autoregressive decoding.
Performer (Choromanski et al., 2020) and Linear Transformer (Katharopoulos et al., 2020): Replace softmax with a linear dot-product of kernel feature maps.
Limitations for Autoregressive Decoding: For Linear Transformer in an autoregressive setting, it requires cumulative summation to aggregate historical information. This can lead to numerical instability (overflow) and gradient issues after many steps due to large accumulating values, especially for very long and variable-length sequences.

3.2.3. Recurrence and Memory in Transformers

This direction explicitly incorporates recurrence and memory into Transformer architectures.

Transformer-XL (Dai et al., 2019): Re-introduces segment-level recurrence and relative positional encoding. It caches the hidden states of previous segments and reuses them as "memory" for subsequent segment processing. This allows the model to attend to context beyond the current segment length.
Compressive Transformer (Rae et al., 2020): Extends Transformer-XL by further compressing these cached hidden states into fewer vectors using a compression network, aiming for an even longer context.
Limitations: Both Transformer-XL and Compressive Transformer use past hidden states as memory. While effective, this constitutes a "raw" memory that still has a theoretical maximum temporal range, meaning information from the distant past is eventually forgotten as new hidden states push older ones out of the fixed-size cache. They also require large memory sizes in practice for good performance.

3.2.4. Dynamic Memorization

This refers to memory networks that actively control external memory resources.

Neural Turing Machine (NTM) (Graves et al., 2014) and Differential Neural Computer (DNC) (Graves et al., 2016): Early models that employed an external, addressable memory matrix. They feature sophisticated mechanisms for reading from and writing to this memory, allowing them to store and retrieve information over very long durations without a theoretical upper bound.
Limitations: The paper points out that their complex memory interaction mechanisms made them slow and often unstable to train, hindering their widespread adoption for downstream tasks.

3.3. Technological Evolution

The evolution of sequence modeling has moved from simple RNNs with limited long-term memory to more sophisticated LSTMs/GRUs, then to attention-based Transformers that revolutionized sequence processing but introduced quadratic scaling issues. Subsequent research branched into making Transformers more efficient (sparse/linear attention) or augmenting them with explicit memory (Transformer-XL, Compressive Transformer). Memformer sits at the intersection of Transformers and dynamic memory networks. It seeks to combine the strengths of Transformers (powerful attention) with the long-term memory capabilities of NTM/DNC-like systems, while addressing their respective weaknesses (quadratic cost for Transformers, complexity/instability for NTM/DNC). It evolves the memory concept from raw hidden state caching (Transformer-XL) to more abstract, dynamic memory slots.

3.4. Differentiation Analysis

Memformer differentiates itself from related works primarily in its approach to memory and efficiency:

Vs. Standard Transformers: Memformer directly addresses the $\mathcal{O}(N^2)$ computation and memory complexity of standard Transformers by operating at a segment level with a fixed-size external memory. This results in linear computation and constant memory during inference, unlike Vanilla Transformer.
Vs. Sparse/Linear Attention: While sparse and linear attention methods also aim for linear complexity, Memformer offers a different paradigm. Sparse/linear attention modifies the internal self-attention mechanism itself. Memformer, instead, augments a standard Transformer with an external memory. This allows it to achieve a theoretically infinite temporal range of memorization, which sparse/linear attention often struggles with in autoregressive settings (e.g., Linear Transformer's numerical instability or Longformer's global token limitations for decoders).
Vs. Transformer-XL/Compressive Transformer: This is the most direct comparison.
- Memory Representation: Transformer-XL and Compressive Transformer cache raw hidden states. Memformer uses $k$ fixed-size external dynamic memory slots that store high-level, compressed representations. This is a crucial difference; Memformer's memory is designed to be more abstract and persistent.
- Temporal Range: Because Transformer-XL and Compressive Transformer rely on a fixed-size cache of hidden states, they have a theoretical maximum temporal range of context. Older information is eventually pushed out. Memformer, with its dynamic memory slots, claims a theoretically infinite temporal range of memorization, as information can be actively encoded and retained within the slots across timesteps without being directly tied to a fixed-length window of raw states.
- Memory Efficiency: Memformer is significantly more memory-efficient. Transformer-XL stores hidden states for all layers, leading to $\mathcal{O}(K \times L)$ memory cost (where $K$ is memory size, $L$ is layers). Memformer stores only $K$ vectors, costing $\mathcal{O}(K)$ .
Vs. NTM/DNC: Memformer shares the concept of external dynamic memory with NTM and DNC but aims for greater efficiency and stability. It uses a simpler cross-attention mechanism for reading and a specialized slot attention with biased memory normalization for writing, avoiding the more complex and often unstable memory addressing schemes of NTM/DNC.
Novel Optimization Scheme: Memformer introduces Memory Replay Back-Propagation (MRBP) to specifically address the memory cost of BPTT for its large memory representations during training, offering a more efficient alternative to standard gradient checkpointing.

4. Methodology

This section details the technical solution proposed in Memformer, breaking down its architecture, memory interaction mechanisms, and training algorithm.

4.1. Principles

The core idea behind Memformer is to enable efficient processing of long sequences by decoupling the Transformer's attention mechanism from the full historical context. Instead of attending over all past tokens, Memformer processes sequences in segments and leverages a fixed-size external dynamic memory to summarize and retain crucial information from the entire past. This memory is dynamically updated and queried using attention mechanisms, allowing the model to have a theoretically infinite temporal range of context while maintaining linear time complexity and constant memory space complexity during inference. A novel training scheme, Memory Replay Back-Propagation (MRBP), is introduced to manage the memory demands of back-propagation through time in this recurrent setup.

4.2. Core Methodology In-depth

4.2.1. Segment-level Sequence Modeling

Memformer processes long sequences by dividing them into segments. This is a common strategy in models designed for long contexts.

First, a standard language model learns the joint probability of a sequence $x = \{x_1, x_2, \ldots, x_N\}$ by factoring it into a product of conditional probabilities: $ P(x) = \prod_t P(x_t | x_{ $x_{<t}$

For very long sequences, interacting with an external memory for every single token is inefficient. Memformer addresses this by processing the sequence at a segment level. The entire sequence is split into $T$ segments, where each segment $s_t$ contains $L$ tokens: $s_t = \{x_{t,1}, x_{t,2}, \ldots, x_{t,L}\}$ .

The model then operates with an encoder-decoder architecture, similar to a standard Transformer, but with a crucial memory interaction:

Encoder's Role: The Transformer encoder is responsible for processing the current segment $s_t$ . Critically, it also interacts with the memory. It takes the current segment $s_t$ and the memory from the previous timestep, $M_{t-1}$ , as input. Its primary role is to encode the information from $s_t$ and update the memory to produce $M_t$ . It also retrieves past information from $M_{t-1}$ . $ M_t = \mathrm{Encoder}(s_t, M_{t-1}) $
Decoder's Role: The Transformer decoder then uses the updated memory (or the memory from the previous timestep, $M_{t-1}$ , as indicated in the paper's formula for generation) and the current partial segment to predict the probabilities of the tokens in the next segment. Specifically, to predict the current segment $s_t$ given past segments $s_{<t}$ , the decoder predicts each token $x_{t,n}$ conditioned on its preceding tokens within the segment $x_{t,<n}$ and the memory representing all past segments $M_{t-1}$ . $ P(s_t | s_{
Overall Sequence Probability: By autoregressively generating segment by segment, the model can compute the joint probability of the entire long sequence $x$ : $ P(x) = \prod_{t=1:T} P_{\mathrm{Model}}(s_t | s_{ $P_{\mathrm{Model}}(s_t | s_{<t})$

This segment-level processing allows the model to autoregressively generate all token segments in a sequence, effectively modeling an entire long sequence.

The overall architecture is depicted in Figure 1, which shows the Transformer encoder interacting with memory via a Memory Reader and Memory Writer, and the Transformer decoder using self-attention and cross-attention with the encoder's output (which implicitly carries memory information) to generate the next tokens.

The following figure (Figure 1 from the original paper) shows the system architecture:

Figure 1: Memformer overall architecture for the encoder (left) and decoder (right). Transformer encoder is responsible to interact with the memory. Sequence modeling is achieved by predicting the next segment conditioned to the current segment and memory. 该图像是Memformer模型的示意图，展示了编码器（左侧）和解码器（右侧）的整体架构。编码器通过记忆读取器与外部记忆互动，解码器则通过自注意力机制处理自回归输入和模型内存。模型的计算流程涉及FeedForward层、LayerNorm层以及Cross Attention机制。

4.2.2. External Dynamic Memory Slots

The core of Memformer's long-term memory capability lies in its External Dynamic Memory (EDM).

Structure: The EDM is a simple yet powerful data structure that stores high-level, compressed representations of past inputs. It consists of a constant $k$ number of vectors, referred to as memory slots. At any given timestep $t$ , the memory is represented as $M_t = [m_t^0, m_t^1, \ldots, m_t^k]$ , where each $m_t^i$ is a vector (a memory slot).
"Dynamic" Nature: The term "dynamic" emphasizes that the model actively and interactively encodes and retrieves information from this memory in a recurrent manner. This contrasts with "static" memory designs where information is simply stored and doesn't change during inference.
Constant Memory Consumption: A key advantage of this design is that for each sample in a batch, separate memory representations are maintained. During inference, the memory consumption for the EDM remains constant, $\mathcal{O}(k)$ , regardless of how long the input sequence becomes. This addresses the linear memory scaling issue of previous Transformer-based models for long contexts. Each memory slot $m_t^i$ is designed to work individually, potentially holding different types of information.

4.2.3. Memory Reading

For each input segment, the model needs to retrieve relevant historical information from the EDM. This is achieved using a cross-attention mechanism.

The memory reading process can be broken down as follows:

Projection: The input segment $x$ $x$ (the current queries) and the memory slots $M_t$ $M_{t}$ (the past keys and values) are linearly transformed into Query ( $Q_x$ $Q_{x}$ ), Key ( $K_M$ $K_{M}$ ), and Value ( $V_M$ $V_{M}$ ) vectors. $ Q_x = x W_Q $ $ K_M = M_t W_K $ $ V_M = M_t W_V $ Where:
- $x$ represents the input segment (e.g., current token embeddings).
- $W_Q$ , $W_K$ , $W_V$ are learnable weight matrices for projecting inputs into queries, keys, and values, respectively.
- $M_t$ is the current state of the external dynamic memory.
Attention Calculation: The input sequence's queries ( $Q_x$ $Q_{x}$ ) then attend over all the memory slots' key-value pairs ( $K_M$ $K_{M}$ , $V_M$ $V_{M}$ ) using Multi-Head Attention (MHAttn). This computes attention scores $A_{x,M}$ $A_{x, M}$ indicating how relevant each memory slot is to each token in the input segment. $ A_{x,M} = \mathbf{MHAttn}(Q_x, K_M) $ Where:
- $\mathbf{MHAttn}$ is the Multi-Head Attention function, which performs scaled dot-product attention in parallel using multiple heads. The internal computation for each head involves $\mathrm{softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_k}}\right)$ .
Weighted Sum of Values: The attention scores are then normalized (implicitly by MHAttn's internal softmax) and used to compute a weighted sum of the memory slots' values ( $V_M$ ). This results in $H_x$ , which represents the retrieved historical information, contextualized for the current input segment. $ H_x = \mathbf{Softmax}(A_{x,M}) V_M $ Note: The paper's formula here $H_x = \mathbf{Softmax}(A_{x,M}) V_M$ is a conceptual representation. In practice, $\mathbf{MHAttn}$ internally handles the softmax and weighted sum across multiple heads.

This process, illustrated in Figure 2, allows the model to learn complex associations and retrieve the most relevant past information for the current segment. Memory reading is performed multiple times, as every encoder layer incorporates a memory reading module, increasing the chances of successful information retrieval.

The following figure (Figure 2 from the original paper) shows the memory reading mechanism:

$Figure 2: Memory Reading. The input sequence $x$ attends over all the memory slots to retrieve the history information.$ 该图像是一个示意图，展示了在Memformer中，输入序列 $x$ 如何通过交叉注意力机制读取外部动态记忆mt中的信息，以及注意力权重的映射过程。该模型通过注意力槽编码历史信息，有助于减少内存消耗。

4.2.4. Memory Writing

Memory writing is the process of updating the external dynamic memory with information from the current segment. Unlike memory reading, memory writing occurs only at the last layer of the encoder to ensure that high-level, contextually rich representations are stored. The encoder often appends classification tokens to the input sequence to facilitate better extraction of sequence representations for memory updates.

4.2.4.1. Update via Memory Slot Attention

The memory is updated using a specialized slot attention module. This module determines how each individual memory slot should preserve its old information or integrate new information from the current segment. The process is as follows:

Projection for Memory Slots: Each memory slot $m^i$ from the previous timestep ( $M_t$ ) is projected into its own Query and Key vectors. $ Q_{m^i} = m^i W_Q $ $ K_{m^i} = m^i W_K $ Where $W_Q$ and $W_K$ are learnable projection matrices, and $m^i$ is the $i$ -th memory slot.
Projection for Input Tokens: The current input segment's token representations $x$ are projected into Key and Value vectors. $ K_x = x W_K $ $ V_x = x W_V $
Slot Attention: Each memory slot's query ( $Q_{m^i}$ $Q_{m^{i}}$ ) attends over a concatenated set of keys: its own key ( $K_{m^i}$ $K_{m^{i}}$ ) and the keys of the input segment tokens ( $K_x$ $K_{x}$ ). This specific attention pattern is crucial: each memory slot can only attend to itself and the token representations. This ensures that memory slots do not directly interfere with each other during the writing process, promoting independent information storage. $ A_{m^i}' = \mathbf{MHAttn}(Q_{m^i}, [K_{m^i}; K_x]) $ Where:
- $[K_{m^i}; K_x]$ denotes the concatenation of the memory slot's own key and the input segment's keys.
- $A_{m^i}'$ are the raw attention logits (before softmax).
Temperature Scaling: Before applying softmax to obtain the final attention weights, the raw attention logits $A_{m^i}'$ $A_{m^{i}}^{'}$ are divided by a temperature parameter $\tau$ $τ$ . $ A_{m^i} = \frac{exp(A_i' / \tau)}{\sum_j exp(A_j' / \tau)} $ Where:
- $\tau$ is the temperature parameter. A temperature $\tau < 1$ (e.g., 0.25 as used in experiments) sharpens the attention distribution, making it more focused on fewer slots or token outputs. A higher temperature makes the distribution softer.
Memory Update: Finally, the next timestep's memory slot ${m_{t+1}^i}'$ is formed by taking a weighted sum (using the sharpened attention scores $A_{m^i}$ ) of its old value ( $m_t^i$ ) and the values from the input segment tokens ( $V_x$ ). $ {m_{t+1}^i}' = \mathrm{Softmax}(A_{x,M}) [m_t^i; V_x] $ Note: The formula $\mathrm{Softmax}(A_{x,M}) [m_t^i; V_x]$ in the paper seems to reuse $A_{x,M}$ which was defined for memory reading, and it uses $m_t^i$ directly in the value concatenation. Given the preceding step, it is more likely that this step is conceptually applying $A_{m^i}$ to update $m_t^i$ with relevant values, which would typically be values derived from $m_t^i$ itself and $V_x$ . The structure in Figure 3 confirms that it is $[m_t^i; V_x]$ as values, and $A_{m^i}$ as attention weights. Assuming the authors mean $\mathrm{Softmax}(A_{m^i}) [m_t^i; V_x]$ or a variant where the first element of the concatenated value is $m_t^i W_V$ to maintain consistency with $V_x$ , and the attention weights $A_{m^i}$ are used.

This attention mechanism allows each memory slot to dynamically decide whether to preserve its existing information or update itself with new information from the current segment.

The following figure (Figure 3 from the original paper) shows the memory writing mechanism:

Figure 3: Memory Writing. Each memory slot attends over itself and the input sequence representations to produce the next timestep's memory slot. 该图像是一个示意图，展示了 Memformer 模型中的槽注意机制。每个记忆槽 $m_t$ 与输入序列的表示 $X_0, X_1, X_2, X_3$ 通过查询-键-值（QKV）机制进行交互，以生成下一个时间步的记忆槽 $m_{t+1}$ 。

4.2.4.2. Implementation of Memory Writer

The specific implementation of the memory writer uses a sparse attention pattern. Each slot in the memory can only attend over itself and the encoder outputs. This design choice is critical for two reasons:

Independence: It prevents direct interference between different memory slots, allowing them to store information independently.
Information Preservation: If a slot only attends to itself during the writing process, its information remains unchanged in the next timestep. This mechanism helps to preserve information within each slot for longer durations.

4.2.4.3. Forgetting Mechanism (Biased Memory Normalization - BMN)

Forgetting is a vital cognitive process that allows a system to discard trivial or temporary information and make room for more important or newer information. Inspired by the forget gate in LSTMs, Memformer introduces Biased Memory Normalization (BMN).

The BMN mechanism operates as follows:

Add Learnable Bias: After the memory slot ${m_{t+1}^i}'$ has been updated via slot attention, a learnable bias vector $v_{\mathrm{bias}}^i$ is added to it. Each memory slot $m^i$ has its own corresponding learnable bias vector $v_{\mathrm{bias}}^i$ . $ m_{t+1}^i \leftarrow {m_{t+1}^i}' + v_{\mathrm{bias}}^i $
Normalization: The updated memory slot is then normalized (e.g., L2 normalization) to prevent its weights from growing infinitely and to ensure gradient stability over many timesteps. This projects all memory slots onto a sphere distribution. $ m_{t+1}^i \leftarrow \frac{m_{t+1}^i}{||m_{t+1}^i||} $ Where $|| \cdot ||$ denotes the L2 norm.
Initial State: The initial state of each memory slot at $t=0$ is also derived from its bias vector after normalization, ensuring consistency. $ m_0^i \leftarrow \frac{v_{\mathrm{bias}}^i}{||v_{\mathrm{bias}}^i||} $

Function of $v_{\mathrm{bias}}$ : The learnable bias vector $v_{\mathrm{bias}}$ controls both the speed and direction of forgetting for each memory slot.

When $v_{\mathrm{bias}}$ is added, it pushes the memory slot along the sphere, causing it to "forget" part of its previous information.
If a memory slot is not updated by new input information for many timesteps, it will gradually move towards a "terminal state" defined by its $v_{\mathrm{bias}}$ (after normalization). This terminal state also serves as the initial state, and it is learnable, allowing the model to adapt its default memory.
The magnitude of $v_{\mathrm{bias}}$ and the cosine distance between the current memory slot and $v_{\mathrm{bias}}$ influence the forgetting speed. For example, a memory slot $m_b$ that is nearly opposite to the terminal state (i.e., its $v_{\mathrm{bias}}$ ) would be hard to forget, while a slot $m_a$ closer to the terminal state would be easier to forget, as illustrated in Figure 4.

The following figure (Figure 4 from the original paper) shows the forgetting mechanism:

$Figure 4: Illustration of forgetting. Memory slot `m _ { a }` is easy to be forgotten, while `m _ { b }` is hard to be forgotten.$ 该图像是示意图，展示了记忆槽的遗忘现象。图中标示了两个记忆状态， $m_a$ 和 $m_b$ ，其中 $m_a$ 容易被遗忘，而 $m_b$ 则难以被遗忘。终态 $T$ 和其他状态 $m$ 通过连线展示了记忆之间的关系和遗忘程度。箭头表示记忆状态向终态的偏移，提示这些状态对于长时间维持信息的能力。

4.2.5. Memory Replay Back-Propagation (MRBP)

Training Memformer involves back-propagation through time (BPTT) due to its recurrent interaction with external memory. However, traditional BPTT unrolls the entire computational graph during the forward pass and stores all intermediate activations, leading to impractically high memory consumption for models with large memory representations like Memformer.

Memory Replay Back-Propagation (MRBP) is proposed as an efficient optimization scheme, a variant of gradient checkpointing (Chen et al., 2016), to address this memory challenge. Instead of storing all activations, MRBP strategically recomputes parts of the computational graph during the backward pass by replaying the memory states at each timestep.

The algorithm, described in Algorithm 1, works as follows:

Algorithm 1: Memory Replay Back-Propagation

Input:

rollout = $[x_t, x_{t+1}, \ldots, x_T]$ : A list containing previous input segments.
$Ω$ = $[M_t, M_{t+1}, \ldots, M_T]$ : Memory from the previous rollouts (if pre-computed).

Initialize a list for back-propagation: replayBuffer = $[M_t]$

Forward Pass & No Gradient: 2. For $t = t, t+1, \ldots, T-1$ do: 3. M_{t+1}, _ = Model(x_t, M_t) (Compute next memory state, discarding other outputs which are not needed for storing) 4. $replayBuffer.append(M_{t+1})$ (Store only the memory states in the buffer) 5. End loop

*Explanation of Forward Pass:* In this phase, the model performs a forward pass through the entire `rollout` (sequence of segments). For each segment  $x_t$  and the current memory  $M_t$ , it computes the next memory state  $M_{t+1}$  using the `Memformer` model. Crucially, this pass is done *without storing gradients* for intermediate activations (indicated by `_` for other outputs), except for the memory states themselves. Only the  $M$  states are stored in `replayBuffer`. This significantly reduces memory footprint during the forward pass by not building the full computational graph for `BPTT`.

Backward Pass with Gradient: 6. $\nabla M_{t+1} = 0$ (Initialize gradients for the final memory state) 7. For $t = T, T-1, \ldots, t+1, t$ do (Iterate backwards through the rollout): 8. M_{t+1}, O_t = Model(x_t, M_t) (Recompute the forward pass locally for the current timestep to obtain all intermediate activations and outputs $O_t$ ) 9. loss = f_loss(O_t) (Compute the loss for the current timestep's output $O_t$ ) 10. loss.backward() (Perform back-propagation for the current timestep's loss) 11. $M_{t+1}.backward(∇M_{t+1})$ (Back-propagate the accumulated gradients from future memory states through the recomputed $M_{t+1}$ ) 12. $∇M_{t+1} = ∇M_t$ (Propagate gradients to the previous memory state) 13. End loop

*Explanation of Backward Pass:* This phase performs the actual gradient computation. It iterates backward through the timesteps.
*   For each timestep  $t$ , it *recomputes* the  $Model(x_t, M_t)$  operation. This recomputation ensures that all intermediate activations needed for gradient calculation at this specific timestep are available. This is where `MRBP` saves memory: instead of storing all activations for all timesteps, it recomputes them on the fly for each timestep during the backward pass.
*   It then computes the `loss` for the current timestep's output  $O_t$  and performs `loss.backward()`, which calculates gradients for the parameters of `Model` based on  $O_t$ .
*   The crucial step  $M_{t+1}.backward(∇M_{t+1})$  allows gradients from future timesteps (represented by  $∇M_{t+1}$ ) to be propagated backward through the recomputed memory update path to  $M_t$ . This ensures long-range `back-propagation through time` for the memory network.
*    $∇M_{t+1} = ∇M_t$  conceptually updates the gradient for the previous memory state, so it can be passed to the next iteration in the backward loop.

Update and Pop Oldest Memories: 14. $memories = replayBuffer$ 15. memories.pop() (Remove the oldest memory from the buffer once it's no longer needed for back-propagation)

`MRBP` effectively only traverses the critical path (memory states) during the forward pass and stores only those memory states. During the backward pass, it locally recomputes the necessary parts of the computational graph at each timestep, combining local gradients with the propagated memory gradients. This significantly reduces the memory footprint compared to full `BPTT` while still allowing gradients to flow over long sequences.

5. Experimental Setup

The paper evaluates Memformer on two main tasks: autoregressive image generation and language modeling. It also conducts detailed ablation studies and efficiency comparisons.

5.1. Datasets

5.1.1. Autoregressive Image Generation

Dataset: MNIST (LeCun and Cortes, 2010)
- Description: A classic dataset of grayscale images of handwritten digits (0-9).
- Characteristics: Each image is $28 \times 28$ pixels. For sequence modeling, each image is reshaped into a sequence of 784 tokens (one token per pixel). The 8-bit grayscale pixel values are treated as a vocabulary size of 256 (0-255).
- Example Data Sample: An MNIST image is a small grid of pixel values. For instance, a single digit '7' image would be represented as a 784-element sequence, where each element is an integer between 0 and 255 representing the intensity of a pixel.
- Rationale: This dataset is chosen to demonstrate Memformer's ability to model long sequences in a non-textual domain and to evaluate its efficiency and performance in an autoregressive generation context, where each pixel is predicted based on all previous pixels.

5.1.2. Language Modeling

Dataset: WikiText-103 (Merity et al., 2017)
- Description: A large-scale language modeling benchmark derived from Wikipedia articles.
- Characteristics: It contains approximately 28,000 articles with an average length of 3,600 tokens per article. This dataset is specifically designed for evaluating models on long-range dependencies, making it suitable for Memformer's goals.
- Example Data Sample: A sample from WikiText-103 would be a passage from a Wikipedia article, such as: "The first major battle of the war took place in July 1776, when a large British force under General William Howe landed on Staten Island, New York. Howe's forces quickly defeated the American army at the Battle of Long Island and captured New York City."
- Rationale: WikiText-103 is a standard benchmark for assessing language models' ability to capture long-term context and generate coherent text. Its long average article length directly tests Memformer's efficiency and theoretical infinite context capability.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

5.2.1. Perplexity (PPL)

Conceptual Definition: Perplexity is a common intrinsic evaluation metric for language models. It quantifies how well a probability distribution or a probability model predicts a sample. In simpler terms, it measures how "surprised" the model is by a given text sequence. A lower perplexity score indicates that the model assigns a higher probability to the actual sequence, meaning it predicts the sequence more accurately and is a better language model.
Mathematical Formula: Perplexity is defined as the exponential of the average negative log-likelihood (or cross-entropy loss) per token. For a sequence of $N$ tokens $W = (w_1, w_2, \ldots, w_N)$ : $ PPL(W) = \exp\left( -\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \ldots, w_{i-1}) \right) $ Alternatively, if H(P, Q) is the cross-entropy loss between the true distribution $P$ and the model's predicted distribution $Q$ : $ PPL = e^{H(P, Q)} $
Symbol Explanation:
- $W$ : The text sequence being evaluated.
- $N$ : The total number of tokens in the sequence $W$ .
- $w_i$ : The $i$ -th token in the sequence.
- $P(w_i | w_1, \ldots, w_{i-1})$ : The probability assigned by the language model to the $i$ -th token, given all preceding tokens in the sequence.
- $\log$ : The natural logarithm.
- $\exp(\cdot)$ : The exponential function (e, Euler's number, raised to the power of the argument).
- H(P, Q): The cross-entropy loss between the true token distribution (represented by $P$ ) and the model's predicted token distribution (represented by $Q$ ).

5.2.2. FLOPs (Floating-point operations)

Conceptual Definition: FLOPs stands for "Floating-point operations per second" or simply "Floating-point operations." In machine learning, it refers to the total number of floating-point arithmetic operations (additions, multiplications, divisions, etc.) performed by a model. It is used as a measure of the computational cost or complexity of a model, especially during inference or training. A higher number of FLOPs indicates more computational resources are required.
Mathematical Formula: There isn't a single universal formula for FLOPs as it depends on the specific operations within a neural network. It's usually calculated by summing the number of floating-point operations for each layer and operation (e.g., matrix multiplications, convolutions, activations).
- For a matrix multiplication $A_{m \times k} \times B_{k \times n} \rightarrow C_{m \times n}$ : approximately $2 \times m \times k \times n$ FLOPs (each element in $C$ requires $k$ multiplications and k-1 additions, so roughly 2k FLOPs per element).
Symbol Explanation:
- m, k, n: Dimensions of matrices involved in operations.
- The total FLOPs is a sum over all operations.

5.2.3. GPU Memory (MB)

Conceptual Definition: GPU Memory refers to the amount of memory (in Megabytes, MB) utilized on the Graphics Processing Unit (GPU) by the model during its operation (training or inference). This metric is crucial for evaluating the memory efficiency of models, especially when dealing with large models, long sequences, or limited hardware resources. Lower GPU Memory usage means the model can run on less powerful GPUs or handle larger batch sizes/sequence lengths.
Mathematical Formula: This is a direct measurement from hardware monitoring tools, not a mathematical formula calculated from model parameters.
Symbol Explanation: No specific symbols, it's a direct measurement in MB.

5.2.4. Speed (relative)

Conceptual Definition: Speed refers to the time taken by a model to perform a specific task, often measured in "seconds per sample" or "tokens per second." "Relative speed" compares the speed of the proposed model to a baseline, often expressed as a multiplier (e.g., "3.2x faster"). Higher speed (or larger speed multiplier) indicates better inference efficiency.
Mathematical Formula: $ \text{Speed} = \frac{\text{Number of Samples/Tokens Processed}}{\text{Total Time Taken}} $ $ \text{Relative Speed} = \frac{\text{Speed}{\text{Memformer}}}{\text{Speed}{\text{Baseline}}} $
Symbol Explanation:
- $\text{Number of Samples/Tokens Processed}$ : The quantity of data processed.
- $\text{Total Time Taken}$ : The duration of the processing.
- $\text{Speed}_{\text{Memformer}}$ : Memformer's processing speed.
- $\text{Speed}_{\text{Baseline}}$ : The baseline model's processing speed.

5.3. Baselines

The paper compares Memformer against several representative models to validate its performance and efficiency.

5.3.1. General Efficiency Comparison

Vanilla Transformer (Vaswani et al., 2017): The standard Transformer model, used primarily for comparison of computational and memory cost scaling due to its $\mathcal{O}(N^2)$ complexity.
Transformer-XL (Dai et al., 2019): A strong baseline for long-range sequence modeling, which uses segment-level recurrence and caches past hidden states. This is a direct competitor for Memformer's goal of handling long contexts efficiently.
Compressive Transformer (Rae et al., 2020): An extension of Transformer-XL that further compresses cached hidden states, aiming for even longer context retention.

5.3.2. Autoregressive Image Generation Baselines

LSTM: A standard Recurrent Neural Network architecture with gating mechanisms, representing traditional sequence models. (4 layers, 512 hidden size).
Transformer Decoder: A standard Transformer decoder-only model that can take the entire 784-token sequence as input, representing a strong Transformer baseline without explicit external memory beyond its attention window. (8 layers, 128 hidden size).
Transformer-XL: Used with various memory sizes (56, 224, 784) to evaluate its performance and efficiency trade-offs. (8 layers, 128 hidden size).

5.3.3. Language Modeling Baselines

Transformer-XL base: The base version of Transformer-XL with 16 layers, 512 hidden size, 2048 feedforward size, 64 head size, and 8 heads. Tested with various memory sizes (128, 256, 512, 1024, 1600).
Compressive Transformer: Re-implemented by the authors with the same size as Transformer-XL base to ensure fair comparison (memory length 512, compressive memory length 512, compression ratio 4).

These baselines are representative because they cover different approaches to sequence modeling (recurrent, attention-only, memory-augmented attention) and different strategies for handling long contexts, allowing for a comprehensive evaluation of Memformer's novelty and effectiveness.

6. Results & Analysis

This section delves into the experimental findings, analyzing Memformer's performance, efficiency, and the contributions of its various components compared to the baselines.

6.1. Computation and Memory Cost

The paper first establishes Memformer's fundamental efficiency advantages by comparing its FLOPs and GPU memory consumption against Vanilla Transformer and Transformer-XL.

Computational Complexity:
- Vanilla Transformer: Has an $\mathcal{O}(N^2)$ computational cost, where $N$ is the sequence length. This rapidly becomes prohibitive for long sequences.
- Transformer-XL and Memformer: Both utilize memory to store historical information, allowing them to process sequences in fixed-size segments. This design theoretically reduces their computation complexity to $\mathcal{O}(N)$ (linear with the overall sequence length), as they only process a fixed-size segment at each step.
Memory Space Complexity:
- Transformer-XL: Stores past hidden states for all layers as memory. If $L$ is the number of layers and $K$ is the memory size (number of hidden states), its memory cost is $\mathcal{O}(K \times L)$ .
- Memformer: Stores only $K$ vectors as its external dynamic memory, resulting in a memory cost of $\mathcal{O}(K)$ . This is a significant improvement, especially for deep models.
  
  The graphical comparison in Figure 5 visually confirms these theoretical advantages:
FLOPs vs. Sequence Length (Figure 5, left):
- Vanilla Transformer shows the steepest, quadratic growth in FLOPs as sequence length increases.
- Both Transformer-XL and Memformer exhibit a much slower, linear growth in FLOPs with increasing sequence length, validating their $\mathcal{O}(N)$ theoretical complexity.
- Memformer is shown to be more efficient than Transformer-XL in terms of FLOPs, particularly as sequence length grows, indicating its memory interaction is less computationally intensive.
GPU Memory Consumption vs. Memory Size (Figure 5, right):
- This graph compares actual GPU memory consumption when the memory size (number of cached hidden states for Transformer-XL, number of memory slots for Memformer) increases from 64 to 2,048, with a fixed batch size of 16.
- Transformer-XL's memory consumption grows rapidly with increasing memory size, reflecting its $\mathcal{O}(K \times L)$ cost.
- Memformer demonstrates much greater efficiency, with its memory consumption growing at a significantly slower rate.
- For large memory sizes, Memformer uses 8.1x less memory space compared to Transformer-XL, which is a substantial reduction.
  
  The following figure (Figure 5 from the original paper) shows the comparison of FLOPs and GPU memory consumption:
  
  该图像是图表，展示了Vanilla Transformer、Transformer XL和Memformer在随着序列长度增加时的计算量（FLOPs）和GPU内存消耗的比较。左侧图表显示计算量随着序列长度的变化趋势，右侧图表则展示不同内存大小下的GPU内存消耗对比。Memformer在内存消耗上表现出明显优势。

6.1.1. MRBP Efficiency Test

In Appendix A, the paper provides a specific comparison of Memory Replay Back-Propagation (MRBP) against standard Back-Propagation Through Time (BPTT) and Gradient Checkpointing (GC) algorithms. This test confirms the efficiency of MRBP for training Memformer.

The following are the results from Table 3 of the original paper:

Method	GPU Memory (MB)	Speed (relative)
BPTT	16,177	x1.00
GC	9,885	x0.48
MRBP	7,229	x0.90

Memory Reduction:
- BPTT uses the most memory (16,177 MB) as it stores all intermediate activations.
- Gradient Checkpointing (GC) significantly reduces memory to 9,885 MB.
- MRBP achieves the lowest GPU Memory consumption at 7,229 MB, demonstrating its superior memory efficiency compared to both BPTT and standard GC.
Speed Trade-off:
- BPTT is the fastest (x1.00) because it avoids recomputation.
- GC is much slower (x0.48) due to its recomputation overhead.
- MRBP shows only a slight speed degeneration (x0.90) compared to BPTT, making it nearly as fast while being significantly more memory-efficient than GC. This confirms MRBP as a highly effective optimization for training Memformer by enabling long-range BPTT with greatly reduced memory costs and acceptable computational overhead.

6.2. Autoregressive Image Generation

Memformer was evaluated on the MNIST dataset for autoregressive image generation, where images are treated as long token sequences.

The following are the results from Table 1 of the original paper:

Model	#FLOPs (B)	Perplexity ↓
LSTM	52.5	1.698
Transformer Decoder	41.3	1.569
Transformer-XL
memory=56	5.6	1.650
memory=224	15.6	1.618
memory=784	49.1	1.611
Memformer 4 encoder+8 decoder	5.0	1.555
Memformer Ablation 2 encoder+6 decoder
memory=64	3.9	1.594
memory=32	3.9	1.600
memory=16	3.9	1.604
memory=1	3.9	1.627
4 encoder+4 decoder	3.6	1.628
w/o memory	1.8	1.745
temperature=1.0	3.9	1.612
w/o forgetting	3.9	1.630

w/o multi-head	3.9	1.626

Overall Performance: Memformer (4 encoder + 8 decoder layers) achieved the best perplexity of 1.555. This not only outperformed LSTM (1.698) and all Transformer-XL variants but also surpassed the Transformer Decoder (1.569) that processed the entire 784-token input. This suggests that Memformer's external memory effectively compresses and utilizes long-range context, leading to better predictions.
Efficiency: The flagship Memformer model achieved this superior performance while using significantly fewer FLOPs (5.0 Billion) compared to the best Transformer-XL baseline ( $memory=784$ , 49.1 Billion FLOPs). Memformer used roughly 10% of the FLOPs of the best Transformer-XL model.
Impact of Encoder/Decoder Layers: The ablation study showed that reducing Memformer's decoder layers (e.g., from 4 encoder + 8 decoder to 4 encoder + 4 decoder) led to a drop in performance (1.628 perplexity). This indicates that the number of decoder layers is crucial for the model's predictive capacity. The initial hypothesis that the extra encoder layers alone might be boosting performance was partially addressed, emphasizing the balanced contribution of both encoder and decoder parts.

6.2.1. Ablation Studies on Memformer Components (Image Generation)

The ablation studies further shed light on the importance of Memformer's components:

Memory Size:
- As the memory size decreased from 64 to 1, perplexity gradually increased (1.594 for 64, 1.627 for 1). This confirms that a larger memory capacity allows the model to store more information, leading to better performance.
- When the memory was completely removed (w/o memory), perplexity significantly worsened to 1.745, highlighting the critical role of the external memory in Memformer's design.
Memory Writer Temperature ( $\tau$ ): Setting $temperature=1.0$ (which makes the attention distribution softer) resulted in higher perplexity (1.612) compared to the default $temperature=0.25$ (1.594 for the 2 encoder+6 decoder ablation with 64 memory), indicating that sharpening the attention distribution during memory writing is beneficial for focusing on key information.
Forgetting Mechanism: Removing the forgetting mechanism (w/o forgetting) increased perplexity to 1.630 (compared to 1.594), demonstrating its contribution to filtering out trivial information and improving learning.
Multi-Head Attention: Removing multi-head attention (w/o multi-head) led to a perplexity of 1.626, suggesting that the diverse attention heads contribute to better information processing within the memory system.

6.3. Language Modeling

Memformer was also evaluated on WikiText-103, a benchmark for long-range language modeling.

The following are the results from Table 2 of the original paper:

Model	#FLOPs (B)	PPL ↓
Transformer-XL base
memory=1600	250	23.95
memory=1024	168	23.67
memory=512	94	23.94
memory=256	58	25.39
memory=128	39	25.60
memory=32 Compressive Transformer	26	27.22
memory= 512 compress=512 Memformer	172	23.23
4 encoder + 16 decoder	54	22.74
Memformer Ablation
4 encoder + 12 decoder	48	23.91
memory=512	35	23.30
w/o memory	31	25.57

Overall Performance & Efficiency:
- The best Memformer configuration (4 encoders + 16 decoders, 1,024 memory size) achieved a perplexity of 22.74, which is the lowest among all tested models.
- This superior performance came with a significantly lower computational cost: 54 Billion FLOPs.
- In comparison, the best Transformer-XL ( $memory=1024$ ) achieved 23.67 perplexity with 168 Billion FLOPs. The Transformer-XL with $memory=1600$ even higher FLOPs (250 B) with a slightly worse perplexity (23.95).
- This demonstrates that Memformer is not only more performant but also much more efficient. It is 3.2 times faster than the comparable Transformer-XL baselines in terms of inference.
Transformer-XL Memory Scaling: As Transformer-XL's memory size increased, its perplexity generally dropped, but FLOPs grew rapidly. The perplexity improvement ceased after $memory=1024$ or 1600, suggesting that for WikiText-103's average article length, larger memory sizes might introduce noise or reach a saturation point.
Compressive Transformer: The re-implemented Compressive Transformer achieved 23.23 perplexity with 172 Billion FLOPs. While its perplexity is competitive with Memformer, its FLOPs are substantially higher, reinforcing Memformer's efficiency claim.

6.3.1. Ablation Studies on Memformer Components (Language Modeling)

Impact of Decoder Layers: Reducing Memformer's decoder layers (from 4 encoder + 16 decoder to 4 encoder + 12 decoder) increased perplexity to 23.91. While still competitive with Transformer-XL, this shows the decoder's importance and the trade-off between model size and performance.
Memory Size: Reducing Memformer's memory size to 512 increased perplexity to 23.30 (from 22.74), confirming that larger memory capacity is beneficial, though the marginal gains might decrease at higher sizes.
Removal of Memory: Completely removing the memory module (w/o memory) resulted in a perplexity of 25.57. This is significantly worse than the full Memformer and comparable to Transformer-XL with a small memory size of 128 (25.60), once again underscoring the indispensable role of the external memory system.

6.4. Memory Writer Analysis

The paper provides an insightful analysis of how the memory writer updates the memory slots, categorizing their behavior and visualizing their attention patterns. This helps understand how Memformer retains information.

The following figure (Figure 6 from the original paper) shows the visualization of three types of memory slots:

Figure 6: Visualization of three types of memory slots. 该图像是一个热图，展示了三种类型的记忆槽（ $m^{250}$ 、 $m^{300}$ 、 $m^{355}$ ）与一系列词汇之间的关联强度。每个单元格内的数值表明特定记忆槽与对应词汇的相关性，例如，记忆槽 $m^{300}$ 与词汇“the”之间的关联强度为0.92，显示出其显著性。通过不同颜色的深浅，图像有效地直观展示了信息存储与检索的模式。

Three Types of Memory Slots:
1. Long-Term Retention Slots (e.g., $m^{300}$ ): These slots primarily focus their attention on themselves (60-80% of memory slots exhibit this behavior during a document's processing). This means they are not significantly updated by the current timestep's input. Such slots are crucial for carrying information from the distant past, effectively retaining context over long periods.
2. Aggregating Slots (e.g., $m^{250}$ ): These slots show partial attention over themselves and the rest of their attention distributed over other input tokens. These slots are in a transition state, aggregating new information from the current segment while still retaining some of their existing context. They transform from the first type to incorporate new, relevant details.
3. Rapid Update Slots (e.g., $m^{355}$ ): These slots completely attend to the input tokens, indicating a significant update with new information. At the beginning of processing a document, almost all slots behave this way as they are initialized. Later, only a small percentage (5-10%) falls into this category. It was found that the forgetting vector's bias ( $v_{\mathrm{bias}}$ ) for such slots often had a larger magnitude (e.g., 3.20) compared to more stable slots (e.g., 1.15), suggesting that information in these slots is meant to change rapidly.
  
  The following figure (Figure 7 from the original paper) shows the visualization of the memory writer's attention for slot $m^{355}$ :
  
  该图像是一个展示记忆写入者注意力的示意图，强调了文本中不同词汇的关注程度，突出显示了关键内容如‘volunteer’和‘quitting’等。这些高亮部分表明系统在处理序列时所吸引的注意力。

Figure 7, visualizing the attention of slot $m^{355}$ on an example input sequence, shows that this specific slot learned to form a compressed representation of the sentence by attending to named entities and verbs. This behavior aligns with human cognition, where key information (who, what, when, where, actions) is often prioritized for memorization. This analysis provides strong evidence that Memformer's external memory slots are not just storing raw data but actively encoding and retaining meaningful, high-level information.

6.5. Effects of Time Horizon and Memory Size (Appendix C)

The paper investigates the impact of two critical hyperparameters: the time horizon for back-propagation and the memory size. These experiments were conducted on a smaller Memformer model for efficiency.

The following figure (Figure 8a and Figure 8b from the original paper) shows the effects of changing time horizon and memory size:

该图像是一个图表，展示了不同时间跨度和内存大小对困惑度的影响。左图显示了在不同回传时间跨度下困惑度的变化，右图则展示了不同内存大小对困惑度的影响。

Effect of Time Horizon (Figure 8a):
- The time horizon for back-propagation dictates how many previous timesteps gradients are allowed to flow through.
- When the time horizon is set to 1, back-propagation cannot propagate gradients through memory to previous timesteps. This results in the worst performance, as the memory writer cannot learn to retain long-term information effectively.
- As the time horizon increases, the model achieves better perplexity scores, indicating that longer gradient paths are crucial for training the memory system.
- However, the marginal improvement in perplexity diminishes when the time horizon is increased to 32, suggesting a point of diminishing returns.
Effect of Memory Size (Figure 8b):
- The memory size determines the number of memory slots available.
- Increasing the memory size from 1 to 8 yields a significant improvement in perplexity, showing that more capacity allows for better information storage and retrieval.
- Further increasing the memory size beyond 8 results in smaller improvements, implying that for the specific (smaller) model used in this ablation, the benefit of additional memory slots plateaus. This might be due to the model's capacity to utilize the extra slots or the nature of the task.
  
  These analyses confirm the importance of both a sufficiently long back-propagation time horizon for effective training of the recurrent memory and an adequate memory size to store enough relevant historical information.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Memformer, a novel memory-augmented Transformer model designed for efficient sequence modeling of long sequences. Memformer leverages an external dynamic memory system that, unlike previous Transformer-XL variants, stores high-level compressed representations rather than raw hidden states. This design enables Memformer to achieve linear time complexity and constant memory space complexity during inference, addressing a critical bottleneck of traditional Transformers. Furthermore, the paper proposes Memory Replay Back-Propagation (MRBP), an innovative training scheme that significantly reduces memory requirements for back-propagation through time in such recurrent architectures. Experimental results on autoregressive image generation and language modeling demonstrate that Memformer delivers comparable or superior performance to strong baselines like Transformer-XL, while being substantially more efficient (8.1x less memory, 3.2x faster inference). Analysis of the memory writer's attention patterns provides evidence that Memformer's memory slots can dynamically encode and retain important, high-level information from the distant past.

7.2. Limitations & Future Work

The authors suggest that the enhanced memory capacity of Memformer can spark interesting works that rely on recurrence and autoregressive modeling, which will benefit tasks such as dialog and interactive systems. This implicitly points to areas where models needing long-term, dynamic memory are crucial, and where current solutions might fall short due to efficiency or context limitations.

While the paper doesn't explicitly list limitations of Memformer itself, it implicitly highlights the limitations of prior work that Memformer aims to overcome:

Complexity and Instability of NTM/DNC: The paper motivates Memformer partly by noting that Neural Turing Machines and Differential Neural Computers, despite having external dynamic memory, were often too complex, slow, and unstable for widespread adoption. Memformer presents a simpler, more stable approach to dynamic memory.
Fixed Context Window/Memory for Transformer-XL/Compressive Transformer: These models, while improving Transformer's context, still have a theoretical maximum temporal range and require substantial memory for raw hidden state caching. Memformer aims for a truly infinite temporal range with constant memory.

Therefore, Memformer's future work lies in applying its efficient, long-range memory capabilities to challenging recurrent and autoregressive tasks where context retention is paramount.

7.3. Personal Insights & Critique

Memformer represents a clever synthesis of ideas from different eras of neural network research: the powerful attention mechanism of Transformers combined with the external dynamic memory concept of older memory networks (like NTMs), all while addressing the practical training and inference efficiency concerns.

Inspirations and Transferability:

Modular Design: The modularity of memory reading and memory writing modules, separate from the core Transformer layers, offers a clear framework. This modularity could inspire similar memory augmentation for other neural network architectures beyond Transformers, or for tasks where specific types of memory (e.g., episodic, semantic) are beneficial.
Efficient Long-Context Processing: The approach of segment-level processing combined with a compressed, dynamic memory is highly transferable. This paradigm is crucial for tasks like:
- Long-document summarization/QA: Where understanding context across thousands of tokens is vital.
- Conversational AI/Dialog Systems: To maintain coherent and contextually relevant conversations over many turns.
- Code generation/understanding: Where variable definitions, function calls, and logical flow span long sequences.
MRBP for other Recurrent Models: The Memory Replay Back-Propagation (MRBP) technique is not specific to Memformer. Any recurrent neural network that maintains a large, differentiable internal or external state over many timesteps and struggles with BPTT memory could potentially benefit from MRBP. This is a valuable contribution for the broader recurrent neural network community.
Interpretable Memory: The memory writer analysis (Figures 6 & 7) is particularly insightful. Being able to categorize memory slots by their update behavior and visualize what information they attend to provides a level of interpretability often lacking in complex neural networks. This could lead to further research into designing more specialized or interpretable memory architectures.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Capacity of Fixed-Size Memory: While Memformer achieves a "theoretically infinite temporal range," the practical effectiveness hinges on the ability of the fixed $k$ memory slots to effectively compress and retain all relevant information from an arbitrarily long past. If the past context is extremely complex and diverse, $k$ might still need to be very large, or the compression might lead to loss of crucial fine-grained details. The optimal $k$ is likely task-dependent.
Hyperparameter Sensitivity: The memory writer temperature ( $\tau$ ), the number of memory slots ( $k$ ), and the characteristics of the learnable bias vector ( $v_{\mathrm{bias}}$ ) in the forgetting mechanism are all critical hyperparameters. Their optimal tuning might be complex and task-specific, potentially impacting the model's stability and performance.
Complexity of Memory Interaction: While simpler than NTM/DNC, the memory interaction (multiple cross-attention layers for reading, slot attention for writing, and the BMN mechanism) still adds layers of complexity compared to a plain Transformer. This added complexity, even if efficient, could introduce training challenges or subtle failure modes.
Long-Range Dependency Benchmarking: While WikiText-103 is a good benchmark, datasets with even longer dependencies or more complex reasoning tasks (e.g., requiring multi-hop reasoning over very long documents) could further push the limits of Memformer's constant memory and infinite temporal range claims.
Memory Slot Overlap/Redundancy: The paper states memory slots should not interfere with each other during writing. However, with slot attention, it's possible for multiple slots to learn similar representations or attend to the same aspects of the input, leading to some redundancy or under-utilization of the total memory capacity. Further mechanisms to encourage diverse memory slot specialization could be explored.
Generalization of MRBP: While effective for Memformer, the general applicability and performance of MRBP for other highly complex recurrent neural networks (especially those with different memory structures) would need further investigation.

Overall, Memformer offers a compelling and experimentally validated solution to the Transformer's long-context efficiency problem, opening new avenues for memory-augmented neural networks. Its practical efficiency gains make it a promising candidate for real-world applications requiring robust long-term memory.

Memformer: A Memory-Augmented Transformer for Sequence Modeling

TL;DR Summary