Paper status: completed

NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs

Published:06/10/2022

Node Classification in Graphs (1)Graph Transformer Architecture (1)Neighborhood Aggregation Method (1)Hop2Token Module (1)Large Graph Processing (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

NAGphormer is proposed to address the computational complexity of existing graph Transformers for node classification in large graphs. By introducing the Hop2Token module, it aggregates multi-hop neighborhood features as sequences, enhancing node representation effectiveness and

Abstract

The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratic complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs. Code is available at https://github.com/JHL-HUST/NAGphormer.

Mind Map

In-depth Reading

English Analysis~33 min read · 46,450 chars

1. Bibliographic Information

1.1. Title

NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs

1.2. Authors

The authors are Jinsong Chen, Kaiyuan Gao, Gaichao Li, and Kun He. Their affiliations are:

School of Artificial Intelligence, Huazhong University of Science and Technology
School of Computer Science and Technology, Huazhong University of Science and Technology
Hopcroft Center on Computing Science, Huazhong University of Science and Technology

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, on 2022-06-10T07:23:51.000Z (UTC). arXiv is a highly influential platform for rapid dissemination of research in fields like computer science, mathematics, and physics. While it hosts preprints, many papers later undergo peer review and are published in top-tier conferences or journals. The fact that it's on arXiv indicates it's a recent contribution to the field.

1.4. Publication Year

2022

1.5. Abstract

Existing graph Transformer models struggle with large graphs due to their quadratic computational complexity with respect to the number of nodes, as they treat all nodes as independent tokens in a single long sequence. To overcome this, the paper proposes NAGphormer (Neighborhood Aggregation Graph Transformer), a novel architecture that redefines tokens for graph Transformers. NAGphormer introduces a Hop2Token module that aggregates neighborhood features from different hops for each node, creating a sequence of token vectors for each node. This approach allows for mini-batch training, making it scalable to large graphs. The authors mathematically demonstrate that NAGphormer can learn more informative node representations from multi-hop neighborhoods compared to advanced decoupled Graph Neural Networks (GNNs). Extensive experiments on both small and large benchmark datasets show that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2206.04910
PDF Link: https://arxiv.org/pdf/2206.04910v4.pdf
Publication Status: This paper is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the scalability issue of existing graph Transformer models when applied to large graphs for tasks like node classification.

Graph Neural Networks (GNNs) have become powerful tools for graph mining tasks by aggregating neighborhood information. However, traditional GNNs suffer from inherent limitations such as over-smoothing (where node representations become indistinguishable in deep networks) and over-squashing (information loss when compressing paths into fixed-dimensional vectors) with increasing model depth.

Transformers, initially successful in natural language processing and computer vision, have shown great promise in generalizing to graph-structured data. However, graphs pose unique challenges due to their complex structural topology and attribute features, which cannot be directly encoded into Transformers as simple tokens. Existing graph Transformers typically treat all nodes in a graph as independent tokens and form a single, very long sequence for the Transformer model. This design leads to a quadratic computational complexity with respect to the number of nodes for the self-attention mechanism, making them computationally prohibitive and memory-intensive for large graphs. Furthermore, traditional scalability strategies for GNNs (like node sampling or approximation propagation) are not directly applicable to graph Transformers, which rely on global attention.

The paper's entry point is to rethink how tokens are defined and processed in graph Transformers to overcome this scalability bottleneck while preserving the powerful representational capabilities of Transformers.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

Novel Hop2Token Module: The introduction of Hop2Token, a new neighborhood aggregation method that transforms features from different-hop neighborhoods into a sequence of token vectors for each individual node. This module effectively encodes the structural information of each node's multi-hop neighborhood into a sequence.
NAGphormer Architecture: The proposal of NAGphormer, a new graph Transformer model for node classification. By treating each node as an independent sequence of Hop2Token-generated tokens, NAGphormer enables mini-batch training, making it scalable to large graphs, a significant advancement over prior graph Transformers.
Attention-based Readout Function: Development of an attention-based readout function that adaptively learns the importance of different-hop neighborhoods, further enhancing the model's ability to learn expressive node representations.
Theoretical Analysis: Mathematical proof that NAGphormer can learn more expressive node representations from multi-hop neighborhoods compared to advanced decoupled Graph Convolutional Networks (GCNs), by showing that decoupled GCNs implicitly use a fixed, incomplete attention mechanism.
Empirical Validation: Extensive experiments on a variety of benchmark datasets, including both small and large-scale graphs, demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs in terms of classification accuracy and efficiency on large graphs.

These findings solve the critical scalability problem for graph Transformers, allowing them to leverage their powerful attention mechanisms on real-world large-scale graphs, while also improving representation learning by explicitly modeling multi-hop neighborhood information.

3.1. Foundational Concepts

To understand NAGphormer, a reader should be familiar with the following foundational concepts:

3.1.1. Graphs and Graph Representation

A graph is a data structure consisting of nodes (or vertices) and edges (or links) that connect them.

Nodes ( $V$ ): Entities in the graph (e.g., people, documents, proteins). Each node $v \in V$ can have an associated feature vector $\mathbf{x}_v$ , representing its attributes.
Edges ( $E$ ): Relationships between nodes. Edges can be directed or undirected, weighted or unweighted.
Adjacency Matrix ( $\mathbf{A}$ ): A square matrix where $\mathbf{A}_{ij} = 1$ if there's an edge between node $i$ and node $j$ , and 0 otherwise.
Degree Matrix ( $\mathbf{D}$ ): A diagonal matrix where $\mathbf{D}_{ii}$ is the degree of node $i$ (number of edges connected to it).
Normalized Adjacency Matrix ( $\hat{\mathbf{A}}$ ): A common normalization for graph neural networks, often calculated as $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$ , where $\tilde{\mathbf{A}}$ is the adjacency matrix with self-loops (i.e., $\mathbf{A} + \mathbf{I}$ , where $\mathbf{I}$ is the identity matrix) and $\tilde{\mathbf{D}}$ is its corresponding degree matrix. This normalization helps prevent numerical instabilities and scales feature propagation.
Node Classification: A common task in graph mining where the goal is to predict the category or label of an unlabeled node based on its features and its connections within the graph, given a set of labeled nodes.
$k$ -hop Neighborhood: For a given node $v$ , its $k$ -hop neighborhood $\mathcal{N}^k(v)$ includes all nodes that can be reached from $v$ by traversing at most $k$ edges. The 0-hop neighborhood $\mathcal{N}^0(v)$ is typically just the node $v$ itself.

3.1.2. Graph Neural Networks (GNNs)

GNNs are a class of neural networks designed to operate directly on graph-structured data. They learn node representations (also called embeddings) by iteratively aggregating information from a node's local neighborhood.

Message Passing Mechanism: The core idea behind most GNNs. In each layer, a node's representation is updated by aggregating messages (transformed features) from its neighbors and combining them with its own previous representation. This process allows information to flow across the graph.
Over-smoothing: A common problem in deep GNNs where, after many layers of message passing, the representations of all nodes in a connected component of a graph become increasingly similar, eventually becoming indistinguishable. This limits the model's ability to learn distinct representations for different nodes.
Over-squashing: Another issue in GNNs, where information from distant nodes might be lost or overly compressed when passed through long paths in the graph, especially in bottleneck structures.

3.1.3. Transformers

Transformers are deep learning architectures that rely heavily on the self-attention mechanism.

Self-Attention Mechanism: A mechanism that allows a model to weigh the importance of different parts of an input sequence when processing a specific element. For an input sequence of tokens, self-attention computes an output representation for each token by summing up linearly transformed versions of all tokens in the sequence, weighted by attention scores.
- Query ( $\mathbf{Q}$ ), Key ( $\mathbf{K}$ ), Value ( $\mathbf{V}$ ): These are generated by linearly transforming the input representations. $ \mathbf{Q} = \mathbf{H} \mathbf{W}^Q \ \mathbf{K} = \mathbf{H} \mathbf{W}^K \ \mathbf{V} = \mathbf{H} \mathbf{W}^V $ where $\mathbf{H} \in \mathbb{R}^{n \times d}$ is the input matrix (n tokens, d dimensions), and $\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V$ are learnable weight matrices.
- Attention Score Calculation: The attention scores are typically computed as a scaled dot product between queries and keys. $ \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_K}}\right)\mathbf{V} $ where $d_K$ is the dimension of the key vectors, used for scaling to prevent large dot product values from pushing the softmax into regions with very small gradients. The softmax function normalizes these scores into a probability distribution.
Multi-Head Self-Attention (MSA): An extension of self-attention where the attention mechanism is run multiple times in parallel with different linear projections. The outputs from these "heads" are then concatenated and linearly transformed to produce the final output. This allows the model to attend to different parts of the input from various "representation subspaces."
Feed-Forward Network (FFN): A simple neural network (typically two linear transformations with a non-linear activation in between) applied independently to each position in the sequence after the self-attention layer.
Positional Encoding: Since Transformers process sequences without inherent order, positional encodings are added to the input embeddings to provide information about the relative or absolute position of each token in the sequence.

3.2. Previous Works

3.2.1. Graph Neural Networks (GNNs)

The paper primarily references GNNs as a baseline and highlights their limitations, especially the coupled design in traditional GCNs.

Graph Convolutional Network (GCN) (Kipf & Welling, 2017): A seminal GNN model that applies a first-order approximation of spectral graph convolutions. A GCN layer combines feature transformation and neighborhood aggregation: $ \mathbf{H}^{(l+1)} = \sigma(\hat{\mathbf{A}} \mathbf{H}^{(l)} \mathbf{W}^{(l)}) $ where $\mathbf{H}^{(l)}$ $H^{(l)}$ is the node feature matrix at layer $l$ $l$ , $\hat{\mathbf{A}}$ $\hat{A}$ is the normalized adjacency matrix with self-loops, $\mathbf{W}^{(l)}$ $W^{(l)}$ is a learnable weight matrix, and $\sigma(\cdot)$ $σ (\cdot)$ is a non-linear activation function.
- Background: This formula essentially says that to get the next layer's representation for a node, you aggregate the features of its neighbors (including itself, due to $\hat{\mathbf{A}}$ having self-loops) and then apply a linear transformation and an activation function.
Graph Attention Network (GAT) (Velikovi et al., 2018): Introduces an attention mechanism into the message passing process, allowing nodes to selectively attend to their neighbors based on their features, rather than using fixed weights (like in GCN).
Decoupled GCNs (e.g., APPNP, GPRGNN, GCNII): To mitigate over-smoothing and over-squashing, these models separate the feature transformation and neighborhood aggregation steps.
- Motivation: Traditional GCNs couple feature transformation (applying $\mathbf{W}^{(l)}$ ) and propagation (multiplying by $\hat{\mathbf{A}}$ ). When stacked, non-linearities and repeated propagation can lead to over-smoothing. Decoupled GCNs apply feature transformation upfront, then perform multiple propagation steps without intermediate non-linearities, and finally combine the propagated features.
- General Form (Chien et al., 2021): $ \mathbf{Z} = \sum_{k=0}^K \beta_k \mathbf{H}^{(k)} \ \mathbf{H}^{(k)} = \hat{\mathbf{A}} \mathbf{H}^{(k-1)} \ \mathbf{H}^{(0)} = f_{\boldsymbol{\theta}}(\mathbf{X}) $ Here, $f_{\boldsymbol{\theta}}(\mathbf{X})$ is an initial feature transformation (e.g., a multi-layer perceptron, MLP) applied to raw features $\mathbf{X}$ . Then, $\mathbf{H}^{(k)}$ represents features propagated for $k$ steps. Finally, $\mathbf{Z}$ (the final node representations) is a weighted sum of these multi-hop propagated features, with $\beta_k$ being learnable aggregation coefficients. This allows capturing deeper structural information without immediate over-smoothing.

3.2.2. Graph Transformers

Existing graph Transformers integrate graph structural information into the Transformer architecture using three main strategies:

Structural Encoding: Replacing or augmenting positional encoding with graph-specific structural information.
- Examples: Using Laplacian eigenvectors (Dwivedi & Bresson, 2020; Kreuzer et al., 2021) or degree-related features (Ying et al., 2021) as positional encodings.
GNNs as Auxiliary Modules: Using GNNs to extract local structural information, which is then fed into a Transformer to learn long-range dependencies.
- Examples: Jain et al. (2021) use GNNs to provide fixed local structural information. Chen et al. (2022) use GNNs to learn $k$ -subtree and $k$ -subgraph information. Rampásek et al. (2022) propose a hybrid layer combining a GNN and a self-attention layer.
Graph Bias in Attention Matrix: Modifying the self-attention scores to incorporate graph structural biases.
- Examples: Ying et al. (2021) use shortest-path distance to bias attention. Zhao et al. (2021) incorporate proximity-enhanced attention. Dwivedi et al. (2020) and Hussain et al. (2022) inject edge features into the self-attention module. Wu et al. (2022) use topology structural information as a relational bias.
  
  The paper notes that most of these methods, except for GraphGPS and Nodeformer (which have linear complexity), adopt a fully-connected attention mechanism on all node pairs, leading to quadratic spatial complexity with respect to the number of nodes. This makes them unsuitable for large graphs. Gophormer (Zhao et al., 2021) attempts scalability by sampling ego-graphs, but the sampling is still time-consuming and may limit neighborhood information.

3.3. Technological Evolution

Graph representation learning has evolved from early matrix factorization methods to deep learning approaches. GNNs revolutionized the field by enabling end-to-end learning on graph structures, mimicking spectral graph theory and message passing. However, their limitations (over-smoothing, over-squashing, scalability) pushed research towards more robust and scalable architectures. Concurrently, the success of Transformers in sequential data led to efforts to adapt them to graphs. Initial graph Transformers often adopted a "global attention" view, treating all nodes as tokens, which inherited the Transformer's quadratic complexity. The challenge then became how to retain Transformer's power while making it scalable for graphs. This paper fits into this evolution by proposing a tokenization scheme specific to individual nodes' neighborhoods, effectively transforming the global graph attention problem into a local sequence attention problem for each node, enabling scalability and better capture of multi-hop information.

3.4. Differentiation Analysis

The core differences and innovations of NAGphormer compared to prior work, particularly the main methods in related work, are:

Tokenization Strategy:
- Prior Graph Transformers: Most existing graph Transformers treat each node as an independent token and form one long sequence of all nodes in the graph. This leads to global self-attention and quadratic complexity ( $O(N^2)$ where $N$ is the total number of nodes).
- NAGphormer: Treats each node as a sequence itself. This sequence is composed of tokens generated from the node's multi-hop neighborhoods by the Hop2Token module. This redefinition allows for local self-attention within each node's neighborhood sequence, enabling mini-batch training.
Scalability:
- Prior Graph Transformers: Generally struggle with large graphs due to quadratic complexity, often leading to out-of-memory errors, as observed in experiments with Graphormer, SAN, and GraphGPS.
- NAGphormer: The tokenized approach and mini-batch training inherently solve the scalability issue, making it applicable to large graphs with millions of nodes and edges. Its complexity becomes $O(n (K+1)^2 d)$ , where $n$ is the number of nodes in a batch, $K$ is the number of hops, and $d$ is the feature dimension. The $(K+1)^2$ term is typically much smaller than $N^2$ .
Information Aggregation:
- Traditional GNNs (e.g., GCN, GAT): Aggregate information primarily from immediate neighbors in a coupled fashion, potentially suffering from over-smoothing and over-squashing when going deeper. Decoupled GCNs improve this but use fixed aggregation weights (as shown in Fact 1).
- NAGphormer: Explicitly constructs tokens for different hops via Hop2Token. It then uses the Transformer's self-attention mechanism to learn semantic correlations between these different-hop neighborhood tokens for a given node. The attention-based readout further adaptively weighs these hop-wise representations. This allows for a richer and more adaptive capture of multi-hop information.
Theoretical Foundation:
- Decoupled GCNs: The paper shows that decoupled GCNs can be viewed as applying a self-attention mechanism with a fixed attention matrix, limiting their expressiveness in combining multi-hop features.
- NAGphormer: Utilizes the full flexibility of the Transformer's self-attention to dynamically learn correlations between different hops, offering theoretically more informative representations.
  
  In essence, NAGphormer innovates by changing the fundamental unit of attention from "node" to "node's multi-hop neighborhood sequence," thus making graph Transformers both powerful and scalable.

4. Methodology

4.1. Principles

The core idea behind NAGphormer is to enable Transformer models to process large-scale graph data effectively by redefining how graph information is tokenized. Instead of treating each node in the entire graph as an independent token in a single long sequence (which leads to quadratic complexity for self-attention), NAGphormer treats each node as a sequence of tokens, where each token represents aggregated information from a specific hop of its neighborhood. This novel tokenization scheme, achieved through the Hop2Token module, allows for mini-batch training and significantly reduces the computational burden, making Transformer architectures viable for large graphs. The model then uses a standard Transformer encoder to learn relationships within these hop-wise tokens and an attention-based readout function to adaptively combine them for the final node representation.

4.2. Core Methodology In-depth (Layer by Layer)

The NAGphormer model framework can be broken down into several key components: Structural Encoding, Hop2Token module, a linear projection, a Transformer Encoder backbone, and an Attention-based Readout Function.

The following figure (Figure 1 from the original paper) shows the model framework of NAGphormer:

该图像是NAGphormer模型框架的示意图。它展示了如何使用Hop2Token模块构建每个节点的序列，并通过Transformer主干学习节点表示，使用基于注意力的读出层自适应聚合不同跳数的邻域信息，最终通过MLP模块进行标签预测。

4.2.1. Problem Formulation

The paper defines the problem as node classification on an unweighted and undirected attributed graph $G = (V, E)$ .

$V = \{v_1, v_2, \ldots, v_n\}$ is the set of $n$ nodes.
Each node $v \in V$ has a feature vector $\mathbf{x}_v$ , forming a feature matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ , where $d$ is the dimension of node features.
$\mathbf{A} \in \mathbb{R}^{n \times n}$ is the adjacency matrix, and $\mathbf{D}$ is the diagonal degree matrix.
The normalized adjacency matrix is $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$ , where $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ (adjacency matrix with self-loops) and $\tilde{\mathbf{D}}$ is its corresponding degree matrix.
The task is to predict labels $\mathbf{Y}_{V_u}$ for an unlabeled node set $V_u$ , given labels $\mathbf{Y}_{V_l}$ for a labeled node set $V_l$ .

4.2.2. Structural Encoding

Before generating tokens, NAGphormer enhances the initial node features by incorporating structural information of the graph. This is achieved by concatenating the original attribute features with Laplacian eigenvectors.

The Laplacian matrix of a graph (or its normalized variants) is a matrix representation that captures the graph's structure. Its eigenvectors encode fundamental properties about the graph's connectivity and topology.
The paper selects the eigenvectors corresponding to the $s$ smallest non-trivial eigenvalues of the Laplacian matrix to construct a structure matrix $\mathbf{U} \in \mathbb{R}^{n \times s}$ . These eigenvectors provide a low-dimensional structural embedding for each node.
This structure matrix $\mathbf{U}$ is then concatenated with the original feature matrix $\mathbf{X}$ to form a fused feature matrix $\mathbf{X}'$ : $ \mathbf{X}' = \mathbf{X} \parallel \mathbf{U} $ where $\parallel$ denotes the concatenation operator. The dimension of the effective feature vector for each node becomes $d' = d + s$ . This $\mathbf{X}' \in \mathbb{R}^{n \times d'}$ is then used as the input to the Hop2Token module. This step ensures that both attribute and structural information are considered from the outset.

4.2.3. Hop2Token Module

The Hop2Token module is a crucial component that aggregates neighborhood features from multiple hops and transforms each hop into a representation, which serves as a token.

Definition of $k$ -hop Neighborhood: For a node $v$ , its $k$ -hop neighborhood $\mathcal{N}^k(v)$ is defined as the set of all nodes $u$ such that the shortest path distance d(v, u) between $v$ and $u$ is less than or equal to $k$ . The 0-hop neighborhood $\mathcal{N}^0(v)$ is just the node $v$ itself.
Neighborhood Embedding: For each node $v$ , Hop2Token generates a neighborhood embedding $\mathbf{x}_v^k$ for its $k$ -hop neighborhood by applying an aggregation operator $\phi$ : $ \mathbf{x}_v^k = \phi(\mathcal{N}^k(v)) $
Sequence Construction: By calculating these neighborhood embeddings for a fixed number of hops $K$ (a hyperparameter), Hop2Token constructs a sequence $\mathcal{S}_v$ for each node $v$ : $ \mathcal{S}_v = (\mathbf{\bar{x}}_v^0, \mathbf{x}_v^1, \ldots, \mathbf{x}_v^K) $ where $\mathbf{x}_v^k$ is a $d'$ -dimensional vector. This sequence represents the node's multi-hop neighborhood information.
Implementation of $\phi$ and $\mathcal{S}_v$ : In practice, Hop2Token uses a simplified propagation process, similar to decoupled GCNs, to obtain the $k$ -hop neighborhood features. The $k$ -hop neighborhood matrix $\mathbf{X}_k$ (where each row corresponds to a node's $k$ -hop aggregated feature) is computed by repeatedly multiplying the initial feature matrix $\mathbf{X}'$ with the normalized adjacency matrix $\hat{\mathbf{A}}$ : $ \mathbf{X}_k = \hat{\mathbf{A}}^k \mathbf{X}' $ Here, $\mathbf{X}_0 = \mathbf{X}'$ (the initial structural-encoded features). The sequence of neighborhood matrices for all nodes in the graph is then denoted as $\mathcal{S} = (\mathbf{X}_0, \mathbf{X}_1, \ldots, \mathbf{X}_K)$ , where each $\mathbf{X}_k \in \mathbb{R}^{n \times d'}$ represents the $k$ -hop features for all $n$ nodes.

The detailed implementation is shown in Algorithm 1: The following is the Hop2Token Algorithm from the original paper:

$Algorithm 1 The Hop2Token Algorithm Input: Normalized adjacency matrix $\hat{\mathbf{A}}$; Feature matrix $\mathbf{X}$; Propagation step $K$ Output: Sequences of all nodes $\mathbf{X}_G$; 1: for $k = 0$ to $K$ do 2: for $i = 0$ to $n$ do 3: $\mathbf{X}_G[i, k] = \mathbf{X}[i]$ 4: end for 5: $\mathbf{X} = \hat{\mathbf{A}} \mathbf{X}$; 6: end for 7: return Sequences of all nodes $\mathbf{X}_G$;$ Explanation of Algorithm 1:

Input: Normalized adjacency matrix $\hat{\mathbf{A}}$ (includes self-loops), initial Feature matrix $\mathbf{X}$ (which should be the structurally encoded $\mathbf{X}'$ in practice, as stated in Section 3.3), and the maximum Propagation step $K$ .
Output: Sequences of all nodes $\mathbf{X}_G$ . This $\mathbf{X}_G$ is conceptually a tensor of shape $\mathbb{R}^{n \times (K+1) \times d'}$ , where $n$ is the number of nodes, $(K+1)$ is the number of tokens per node (0-hop to $K$ -th hop), and $d'$ is the feature dimension.
Loop $for k = 0 to K$ : This outer loop iterates through each hop level from 0 to $K$ .
Inner loop $for i = 0 to n$ : This loop (or implicitly, a matrix operation) populates $\mathbf{X}_G$ .
$\mathbf{X}_G[i, k] = \mathbf{X}[i]$ : This line seems to imply that at each step $k$ , the current $\mathbf{X}$ (which has been propagated $k$ times) is copied into the $k$ -th token slot for each node $i$ .
$\mathbf{X} = \hat{\mathbf{A}} \mathbf{X}$ : After storing the features for the current hop $k$ , the feature matrix $\mathbf{X}$ is updated by multiplying it with $\hat{\mathbf{A}}$ . This operation aggregates information from immediate neighbors and effectively propagates features one more hop. So, after the first iteration ( $k=0$ ), $\mathbf{X}$ becomes $\hat{\mathbf{A}}\mathbf{X}'$ , which is $\mathbf{X}_1$ . After the second iteration ( $k=1$ ), $\mathbf{X}$ becomes $\hat{\mathbf{A}}(\hat{\mathbf{A}}\mathbf{X}')$ , which is $\mathbf{X}_2$ , and so on.
Advantages: This module is non-parametric and can be run offline before training. Its output, $\mathbf{X}_G$ , supports mini-batch training because each node's sequence $\mathcal{S}_v$ is self-contained, allowing the Transformer to process a batch of node sequences independently. This is key to scaling to large graphs.

4.2.4. Linear Projection

The Hop2Token module outputs a sequence of $K+1$ token vectors for each node $v$ , $\mathcal{S}_v = (\mathbf{x}_v^0, \mathbf{x}_v^1, \ldots, \mathbf{x}_v^K)$ , where each $\mathbf{x}_v^k \in \mathbb{R}^{d'}$ . Before feeding these into the Transformer encoder, they are mapped to the Transformer's hidden dimension $d_m$ using a learnable linear projection matrix $\mathbf{E}$ : $ \mathbf{Z}_v^{(0)} = \left[ \mathbf{x}_v^0 \mathbf{E} ; \mathbf{x}_v^1 \mathbf{E} ; \ldots ; \mathbf{x}_v^K \mathbf{E} \right] $ Here, $\mathbf{E} \in \mathbb{R}^{d' \times d_m}$ is the learnable projection matrix, and $\mathbf{Z}_v^{(0)} \in \mathbb{R}^{(K+1) \times d_m}$ is the initial input sequence for node $v$ to the Transformer encoder. Each row of $\mathbf{Z}_v^{(0)}$ corresponds to a token representing a specific hop's aggregated features, now projected into the Transformer's model dimension.

4.2.5. Transformer Encoder

The projected sequences $\mathbf{Z}_v^{(0)}$ are then fed into a standard Transformer encoder. The Transformer encoder consists of $L$ identical layers. Each layer $\ell$ (for $\ell=1, \ldots, L$ ) typically comprises a Multi-Head Self-Attention (MSA) module and a Position-wise Feed-Forward Network (FFN), with Layer Normalization (LN) applied before each block and residual connections (summing the input to the block with its output).

Multi-Head Self-Attention (MSA): The MSA module takes the input sequence $\mathbf{Z}_v^{(\ell-1)}$ and processes it to capture semantic correlations between the different hop tokens within that sequence. This means the 0-hop token (the node itself) can attend to the 1-hop, 2-hop, ..., $K$ -hop tokens, and vice-versa. $ \mathbf{Z}_v^{\prime(\ell)} = \operatorname{MSA}\left(\operatorname{LN}\left(\mathbf{Z}_v^{(\ell-1)}\right)\right) + \mathbf{Z}_v^{(\ell-1)} $ Here, $\operatorname{LN}(\cdot)$ performs layer normalization, and the output of MSA is added to its input (residual connection).
Feed-Forward Network (FFN): The FFN further transforms the output of the MSA layer. It consists of two linear layers with a GELU non-linearity in between. $ \mathbf{Z}_v^{(\ell)} = \operatorname{FFN}\left(\operatorname{LN}\left(\mathbf{Z}_v^{\prime(\ell)}\right)\right) + \mathbf{Z}_v^{\prime(\ell)} $ This process is repeated for $L$ layers. The output of the final Transformer layer for node $v$ is $\mathbf{Z}_v^{(L)} \in \mathbb{R}^{(K+1) \times d_m}$ , which contains refined embeddings for all $K+1$ hop-tokens.

4.2.6. Attention-based Readout Function

After the Transformer encoder, the output $\mathbf{Z}_v^{(L)}$ (denoted as $\mathbf{Z}$ in this section for simplicity) contains $K+1$ embeddings for node $v$ , where $\mathbf{Z}_0$ is the embedding for the node itself (0-hop) and $\mathbf{Z}_k$ is the embedding for the $k$ -hop neighborhood. To obtain a single final node representation for classification, a readout function is needed. Common methods like summation or mean treat all hops equally. NAGphormer proposes an attention-based readout function to adaptively weigh the importance of different hop embeddings.

Attention Coefficient Calculation: The function computes normalized attention coefficients $\alpha_k$ $α_{k}$ for each $k$ $k$ -hop neighborhood (from $k=1$ $k = 1$ to $K$ $K$ ) by measuring its correlation with the 0-hop representation (the node itself): $ \alpha_k = \frac{\exp((\mathbf{Z}_0 \parallel \mathbf{Z}_k) \mathbf{W}a^\top)}{\sum{i=1}^K \exp((\mathbf{Z}_0 \parallel \mathbf{Z}_i) \mathbf{W}_a^\top)} $ where:
- $\mathbf{Z}_0 \in \mathbb{R}^{1 \times d_m}$ is the embedding of the node itself (0-hop).
- $\mathbf{Z}_k \in \mathbb{R}^{1 \times d_m}$ is the embedding of the $k$ -th hop neighborhood.
- $\parallel$ denotes concatenation, so $(\mathbf{Z}_0 \parallel \mathbf{Z}_k) \in \mathbb{R}^{1 \times 2d_m}$ .
- $\mathbf{W}_a \in \mathbb{R}^{1 \times 2d_m}$ is a learnable projection matrix (a weight vector).
- $\exp(\cdot)$ and the summation in the denominator ensure that the coefficients $\alpha_k$ form a probability distribution (sum to 1).
Final Node Representation: The final aggregated output representation $\mathbf{Z}_{out}$ for node $v$ is then computed as a weighted sum of the $k$ -hop representations, added to the node's own 0-hop representation: $ \mathbf{Z}_{out} = \mathbf{Z}0 + \sum{k=1}^K \alpha_k \mathbf{Z}_k $ This sum yields a single vector $\mathbf{Z}_{out} \in \mathbb{R}^{1 \times d_m}$ representing node $v$ , which is then typically fed into a simple MLP (Multi-Layer Perceptron) for label prediction. The MLP is responsible for transforming the final node embedding into class probabilities.

4.3. Theoretical Analysis of NAGphormer

The paper provides a theoretical analysis comparing NAGphormer with decoupled GCNs, arguing that NAGphormer can learn more informative representations due to its self-attention mechanism and Hop2Token.

Fact 1: From the perspective of the output node representations of Hop2Token, the paper asserts that a decoupled GCN can be regarded as applying a self-attention mechanism with a fixed attention matrix $\mathbf{S} \in \mathbb{R}^{(K+1) \times (K+1)}$ , where $\mathbf{S}_{K, k} = \beta_k$ for $k \in \{0, \ldots, K\}$ .

Proof of Fact 1 (from Appendix C):

Decoupled GCN Output: For an arbitrary node $i$ , each element $\mathbf{Z}_{i, m}$ (where $m \in \{1, \ldots, d\}$ is a feature dimension) of its output representation $\mathbf{Z}_i \in \mathbb{R}^{1 \times d}$ learned by a decoupled GCN (Equation 2) is: $ \mathbf{Z}{i, m} = \sum{k=0}^K \beta_k \mathbf{H}_{i, m}^{(k)} $ where $\mathbf{H}_{i, m}^{(k)}$ is the $m$ -th dimension of the $k$ -hop feature vector for node $i$ .
Hop2Token Output (Matrix Form): The output $\mathbf{X}_i \in \mathbb{R}^{(K+1) \times d}$ of Hop2Token for node $i$ can be written as a matrix where rows correspond to hop features and columns to feature dimensions: $ \mathbf{X}i = \left[ \begin{array}{cccc} \mathbf{H}{i,0}^{(0)} & \mathbf{H}{i,1}^{(0)} & \cdots & \mathbf{H}{i,d}^{(0)} \ \mathbf{H}{i,0}^{(1)} & \mathbf{H}{i,1}^{(1)} & \cdots & \mathbf{H}{i,d}^{(1)} \ \vdots & \vdots & \ddots & \vdots \ \mathbf{H}{i,0}^{(K)} & \mathbf{H}{i,1}^{(K)} & \cdots & \mathbf{H}{i,d}^{(K)} \end{array} \right] $
Fixed Attention Matrix $\mathbf{S}$ : Consider a special $(K+1) \times (K+1)$ attention matrix $\mathbf{S}$ : $ \mathbf{S} = \left[ \begin{array}{cccc} 0 & 0 & \cdots & 0 \ 0 & 0 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ \beta_0 & \beta_1 & \cdots & \beta_K \end{array} \right] $ This matrix has zeros everywhere except for its last row, which contains the aggregation coefficients $\beta_k$ from the decoupled GCN.
Self-Attention Output with $\mathbf{S}$ : If we apply this fixed attention matrix $\mathbf{S}$ to $\mathbf{X}_i$ (similar to how self-attention works, but with a fixed $\mathbf{S}$ instead of a dynamically computed attention matrix), the output matrix $\mathbf{T} \in \mathbb{R}^{(K+1) \times d}$ would be: $ \mathbf{T} = \mathbf{S} \mathbf{X}i = \left[ \begin{array}{cccc} 0 & 0 & \cdots & 0 \ 0 & 0 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ \gamma_0 & \gamma_1 & \cdots & \gamma_d \end{array} \right] $ where each element $\gamma_m$ in the last row is given by: $ \gamma_m = \sum{k=0}^K \beta_k \mathbf{H}_{i,m}^{(k)} \quad (m \in {1, \ldots, d}) $
Readout Function: If a simple summation readout function is applied to the rows of $\mathbf{T}$ to get the final representation $\mathbf{T}^{final} \in \mathbb{R}^{1 \times d}$ for node $i$ , then each element $\mathbf{T}_m^{final}$ would be: $ \mathbf{T}m^{final} = \sum{k=0}^K \mathbf{T}{k,m} = (0 + 0 + \cdots + \gamma_m) = \sum{k=0}^K \beta_k \mathbf{H}{i,m}^{(k)} = \mathbf{Z}{i,m} $ This demonstrates that the output of a decoupled GCN is equivalent to performing a fixed self-attention-like operation with a specific attention matrix $\mathbf{S}$ and then summing the results.

Implication: This fact highlights that decoupled GCNs use a fixed set of aggregation coefficients ( $\beta_k$ ) that are applied uniformly across all nodes and features. This limits their ability to capture node-specific or feature-specific importance of different hops. In contrast, NAGphormer uses a dynamic self-attention mechanism within its Transformer encoder to learn relationships between hop tokens, and then a dynamic attention-based readout function to adaptively weigh the importance of each hop for each individual node. This allows NAGphormer to learn more expressive node representations by capturing fine-grained semantic correlations and adaptive hop importance.

4.4. Complexity Analysis of NAGphormer (from Appendix B)

4.4.1. Time Complexity

The time complexity of NAGphormer is primarily determined by the self-attention module within the Transformer encoder. For a batch of nodes, if each node has a sequence of $(K+1)$ tokens, and the hidden dimension is $d_m$ , the self-attention computation for one node is $O((K+1)^2 d_m)$ . If there are $B$ nodes in a batch, the batch-wise time complexity would be $O(B (K+1)^2 d_m)$ . The paper states the overall time complexity as: $ O(n (K+1)^2 d) $ where $n$ is the total number of nodes in the graph (implying training over all nodes, possibly summing up mini-batch operations), $K$ is the number of hops, and $d$ is the feature vector dimension (implicitly referring to $d_m$ ). For comparison, traditional graph Transformers have a time complexity of $O(N^2 d)$ where $N$ is the total number of nodes in the graph. By effectively reducing the "sequence length" from $N$ to $K+1$ for each node's internal attention, NAGphormer achieves significant speed-up when $K+1 \ll N$ .

4.4.2. Space Complexity

The space complexity involves both model parameters and intermediate activations.

Model Parameters: The Transformer layers contribute $O(d_m^2 L)$ to the parameter count, where $L$ is the number of Transformer layers and $d_m$ is the hidden dimension.
Intermediate Activations: During mini-batch training, the main memory consumption comes from storing the attention matrices and hidden node representations for a batch of $B$ nodes. This is $O(B (K+1)^2)$ for the attention matrices (for each node, a $(K+1) \times (K+1)$ matrix) and $O(B (K+1) d_m)$ for the hidden node representations. The total space complexity is given as: $ O(b (K+1)^2 + b (K+1) d + d^2 L) $ where $b$ is the batch size, $K$ is the number of hops, $d$ is the feature dimension (again, implicitly $d_m$ ), and $L$ is the number of layers. This mini-batch based memory consumption is significantly lower than full-batch graph Transformers, allowing it to fit into GPU memory for large graphs.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on nine widely used datasets of varying scales.

5.1.1. Small-Scale Datasets

Pubmed: A citation network where nodes are scientific papers and edges are citations.
CoraFull: Another citation network.
Computer: A co-purchase network where nodes are products and edges indicate frequently bought-together items.
Photo: Another co-purchase network.
CS (Computer Science): A co-authorship network where nodes are authors and edges indicate co-authorship.
Physics: Another co-authorship network.

For these small-scale datasets, a standard $60\%/20\%/20\%$ train/validation/test random split was used.

5.1.2. Large-Scale Datasets

AMiner-CS: A large citation network.
Reddit: A social network where nodes represent posts and edges indicate common user commenting.
Amazon2M: A very large co-purchase network.

For these large-scale datasets, the splits followed the settings from Feng et al. (2022).

The following are the statistics from Table 4 of the original paper:

Dataset	# Nodes	# Edges	# Features	# Classes
Pubmed	19,717	44,324	500	3
CoraFull	19,793	126,842	8,710	70
Computer	13,752	491,722	767	10
Photo	7,650	238,163	745	8
CS	18,333	163,788	6,805	15
Physics	34,493	495,924	8,415	5
AMiner-CS	593,486	6,217,004	100	18
Reddit	232,965	11,606,919	602	41
Amazon2M	2,449,029	61,859,140	100	47

These datasets were chosen to cover a range of scales, from tens of thousands of nodes (small-scale) to millions of nodes and tens of millions of edges (large-scale), and diverse domains (citation, co-purchase, co-authorship, social networks). This diversity ensures a comprehensive evaluation of the model's performance and scalability across different graph characteristics.

5.2. Evaluation Metrics

The primary evaluation metric used in the paper for node classification is accuracy.

5.2.1. Accuracy

Conceptual Definition: Accuracy is a common metric used to evaluate classification models. It measures the proportion of correctly classified instances (nodes in this case) out of the total number of instances. It gives a general sense of how well the model is performing across all classes.
Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
Symbol Explanation:
- Number of Correct Predictions: The count of nodes for which the model's predicted class matches the true class.
- Total Number of Predictions: The total count of nodes for which a prediction was made (typically the size of the test set).
  
  The results are reported as mean accuracy ± stdev (%) over multiple trials (10 trials for small-scale datasets), indicating both the average performance and its variability.

5.3. Baselines

NAGphormer was compared against twelve advanced baselines, categorized into three groups:

5.3.1. Full-Batch GNNs

These models require the entire graph in memory during training, making them unsuitable for very large graphs.

GCN (Kipf & Welling, 2017): Graph Convolutional Network.
GAT (Velikovi et al., 2018): Graph Attention Network.
APPNP (Klicpera et al., 2019): Approximate Personalized PageRank, a decoupled GCN that uses Personalized PageRank for propagation.
GPRGNN (Chien et al., 2021): Adaptive Universal Generalized PageRank Graph Neural Network, another advanced decoupled GCN.

5.3.2. Scalable GNNs

These models are designed to handle large graphs, typically using sampling or approximation techniques.

GraphSAINT (Zeng et al., 2020): Graph Sampling Based Inductive Learning Method.
PPRGo (Bojchevski et al., 2020): Predict then Propagate, a scalable GNN using approximate PageRank.
$GRAND+$ (Feng et al., 2022): Scalable Graph Random Neural Networks.

5.3.3. Graph Transformers

Existing Transformer-based models adapted for graphs.

GT (Dwivedi & Bresson, 2020): A generalization of Transformer networks to graphs, often using Laplacian eigenvectors.
SAN (Kreuzer et al., 2021): Spectral Attention Network, leveraging the full spectrum of the Laplacian matrix.
Graphormer (Ying et al., 2021): Incorporates graph information bias into the attention matrix, e.g., shortest-path distance.
GraphGPS (Rampásek et al., 2022): A general, powerful, scalable graph transformer.
Gophormer (Zhao et al., 2021): Ego-graph Transformer for node classification, which samples ego-graphs to reduce computation.

These baselines are representative as they cover both traditional GNNs, state-of-the-art scalable GNNs, and a range of existing graph Transformer approaches, allowing for a comprehensive comparison of NAGphormer's performance and scalability.

5.4. Implementation Details (from Appendix D.2)

Hyperparameter Tuning: Hyperparameters for baselines were set according to recommended settings in official implementations. For NAGphormer, the following ranges were searched:
- Number of Transformer layers $L$ : $\{1, 2, \ldots, 5\}$
- Hidden dimension $d_m$ : $\{128, 256, 512\}$
- Propagation steps $K$ : $\{2, 3, \ldots, 20\}$
Optimization: AdamW (Loshchilov & Hutter, 2019) optimizer was used.
- Learning rate: $\{1e-3, 5e-3, 1e-4\}$
- Weight decay: $\{1e-4, 5e-4, 1e-5\}$
Regularization: Dropout rate was searched in $\{0.1, 0.3, 0.5\}$ .
Batch Size: Set to 2000.
Early Stopping: Training was stopped within 50 epochs if validation performance did not improve.
Hardware: Experiments were conducted on a Linux server with 1 Intel i9-9900k CPU, 1 RTX 2080TI GPU, and 64GB RAM.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Comparison on Small-Scale Datasets

The following are the results from Table 1 of the original paper, comparing NAGphormer against various baselines on small-scale datasets:

Method	Pubmed	CoraFull	Computer	Photo	CS	Physics
GCN	86.54 ± 0.12	61.76 ± 0.14	89.65 ± 0.52			92.70 ± 0.20 92.92 ± 0.12 96.18 ± 0.07
GAT	86.32 ± 0.16 64.47 ± 0.18		90.78 ± 0.13	93.87 ± 0.11		93.61 ± 0.14 96.17 ± 0.08
APPNP	88.43 ± 0.15 65.16 ± 0.28		90.18 ± 0.17	94.32 ± 0.14 94.49 ± 0.07		96.54 ± 0.07
GPRGNN	89.34 ± 0.25	67.12 ± 0.31	89.32 ± 0.29	94.49 ± 0.14 95.13 ± 0.09		96.85 ± 0.08
GraphSAINT	88.96 ± 0.16 67.85 ± 0.21		90.22 ± 0.15	91.72 ± 0.13	94.41 ± 0.09	96.43 ± 0.05
PPRGo	87.38 ± 0.11	63.54 ± 0.25	88.69 ± 0.21	93.61 ± 0.12	92.52 ± 0.15	95.51 ± 0.08
GRAND+	88.64 ± 0.09 71.37 ± 0.11		88.74 ± 0.11	94.75 ± 0.12	93.92 ± 0.08	96.47 ± 0.04
GT	88.79 ± 0.12 61.05 ± 0.38 91.18 ± 0.17			94.74 ± 0.13	94.64 ± 0.13 97.05 ± 0.05
Graphormer	OOM	OOM	OOM	92.74 ± 0.14	OOM	OOM
SAN	88.22 ± 0.15 59.01 ± 0.34			89.83 ± 0.16 94.86 ± 0.10	94.51 ± 0.15	OOM
GraphGPS	88.94 ± 0.16 55.76 ± 0.23		OOM	95.06 ± 0.13	93.93 ± 0.12	OOM
NAGphormer	89.70 ± 0.19	71.51 ± 0.13	91.22 ± 0.14	95.49 ± 0.11	95.75 ± 0.09	97.34 ± 0.03

Note: The original table contains some formatting inconsistencies and repeated values for certain cells, which are preserved here for faithfulness. "OOM" indicates out-of-memory error.

Analysis:

NAGphormer consistently achieves the best performance across all six small-scale datasets (Pubmed, CoraFull, Computer, Photo, CS, Physics), often by a significant margin. For example, on CoraFull, NAGphormer (71.51%) substantially outperforms the next best $GRAND+$ (71.37%) and all other GNNs and Transformers.
Superiority over GNNs: NAGphormer's improved performance over GNN-based methods (like APPNP and GPRGNN, which are decoupled GCNs) supports the claim that its Hop2Token module and Transformer backbone better capture semantic relevance of different hop neighbors, which is overlooked by many GNNs.
Superiority over Graph Transformers: NAGphormer also surpasses existing graph Transformer-based methods like GT and SAN, even though they also use Laplacian eigenvectors for structural encoding. This indicates the effectiveness of NAGphormer's novel tokenization strategy and attention mechanisms.
Scalability Demonstration (Even on Small Graphs): Crucially, several existing graph Transformers (Graphormer, SAN, GraphGPS) encounter out-of-memory (OOM) errors even on some of these relatively small graphs. This strongly validates the necessity and success of NAGphormer's design in enabling scalability for Transformers on graph data.

6.1.2. Comparison on Large-Scale Datasets

The following are the results from Table 2 of the original paper, comparing NAGphormer against scalable GNNs on large-scale datasets:

Method	AMiner-CS	Reddit	Amazon2M
PPRGo	49.07 ± 0.19	90.38 ± 0.11	66.12 ± 0.59
GraphSAINT	51.86 ± 0.21	92.35 ± 0.08	75.21 ± 0.15
GRAND+	54.67 ± 0.25	92.81 ± 0.03	75.49 ± 0.11
NAGphormer	56.21 ± 0.42	93.58 ± 0.05	77.43 ± 0.24

Analysis:

Only scalable GNNs are compared here because existing graph Transformers (as shown in the small-scale results) cannot handle such large-scale datasets due to their high computational cost.
NAGphormer again consistently outperforms all scalable GNN baselines (PPRGo, GraphSAINT, $GRAND+$ ) on all three large-scale datasets (AMiner-CS, Reddit, Amazon2M).
This strong performance on large graphs, coupled with the ability to train these models, is a key validation of NAGphormer's design for scalability and effective local information preservation. For instance, on Amazon2M, it achieves 77.43% accuracy, significantly higher than $GRAND+$ 's 75.49%.

6.1.3. Further Comparison with Gophormer (from Appendix E)

The following are the results from Table 5 of the original paper, comparing NAGphormer with Gophormer (a graph Transformer using ego-graph sampling) on specific datasets:

	Cora	Citeseer	Pumbed	Blogcatalog	DBLP	Flickr
Gophormer	87.85	80.23	89.40	96.03	85.20	91.51
NAGphormer	88.15	80.12	89.70	96.73	85.95	91.80

Analysis:

NAGphormer outperforms Gophormer on most datasets (Cora, Pubmed, Blogcatalog, DBLP, Flickr), with a slight dip on Citeseer. This comparison is particularly relevant as Gophormer also aims for scalability through sampling. The results suggest that NAGphormer's structured Hop2Token approach captures more comprehensive neighborhood information than ego-graph sampling.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Structural Encoding

The following are the results from Table 3 of the original paper, showing the impact of structural encoding:

	Pubmed	CoraFull	CS	Computer	Photo	Physics	Aminer-CS	Reddit	Amazon2M
W/O-SE	89.06	70.42	95.52	90.44	95.02	97.10	55.64	93.47	76.98
With-SE	89.70	71.51	95.75	91.22	95.49	97.34	56.21	93.58	77.43
Gain	+0.64	+1.09	+0.23	+0.78	+0.47	+0.24	+0.57	+0.11	+0.45

Analysis:

The results show a consistent gain in accuracy across all datasets when structural encoding (concatenating Laplacian eigenvectors) is included. This confirms that incorporating explicit structural information is beneficial for node classification tasks.
The magnitude of the gain varies across datasets (e.g., +1.09% on CoraFull vs. +0.11% on Reddit), suggesting that the importance of structural information encoded by Laplacian eigenvectors is sensitive to the specific topology of the graph. Some graphs might have structural patterns that are more crucial for classification than others, or that are better captured by these eigenvectors.

6.2.2. Attention-based Readout Function

The paper compares its proposed attention-based readout function (ATT.) against two alternatives: SIN. (using only the 0-hop node representation) and SUM. (summing all hop representations equally).

The following figure (Figure 2 from the original paper) shows the performance of NAGphormer via different readout functions on small-scale datasets:

Figure 2: The performance of NAGphormer via different readout functions.

Analysis (Small-scale datasets):

The ATT. function consistently outperforms both SIN. and SUM. across all small-scale datasets.
SIN. (using only the self-representation) performs the worst, indicating that integrating multi-hop neighborhood information is crucial.
SUM. performs better than SIN. but worse than ATT., suggesting that simply summing (equal weighting) is less effective than adaptively weighting different hops.
This validates that adaptively learning the importance of different-hop neighborhoods through the attention mechanism in the readout function is beneficial for learning more expressive node representations and improving classification performance.

The following figure (Figure 4 from the original paper) shows the performance of different readout functions on large-scale datasets:

Analysis (Large-scale datasets, from Appendix F):
Similar trends are observed on large-scale datasets. The attention-based readout function (ATT.) consistently outperforms SIN. and SUM. on AMiner-CS, Reddit, and Amazon2M. This reinforces the conclusion that adaptive weighting of multi-hop information is critical for performance, regardless of graph scale.

6.2.3. Parameter Study: Number of Propagation Steps ( $K$ )

The following figure (Figure 3a from the original paper) shows the performance of NAGphormer on different parameters:

Figure 3: Performance of NAGphormer on different parameters.

Analysis (Figure 3a):

The performance of NAGphormer varies with the number of propagation steps $K$ .
Optimal $K$ values differ for each dataset (e.g., $K=16$ for AMiner-CS, $K=10$ for Reddit and Amazon2M in practice), indicating that different networks have distinct effective neighborhood structures and information propagation distances relevant for the task.
Importantly, even for large $K$ (up to 20), the model performance does not decline significantly. For Reddit, the change is less than 0.1%. This suggests that the self-attention mechanism and attention-based readout function effectively mitigate over-smoothing or over-squashing issues, as they can selectively attend to and combine relevant hop information without simply averaging everything. This is a significant advantage over many GNNs that degrade with increasing $K$ .

6.2.4. Parameter Study: Number of Transformer Layers ( $L$ )

Analysis (Figure 3b):

Generally, a smaller number of Transformer layers ( $L$ ) (e.g., $L=1$ or $L=3$ ) achieves high accuracy, while a larger $L$ tends to degrade performance.
This observation is attributed to the increased likelihood of over-fitting with more Transformer layers, given the fixed sequence length ( $K+1$ ) for each node's internal attention.
In practice, $L=3$ was chosen for AMiner-CS and $L=1$ for other datasets, striking a balance between complexity and performance.

6.3. Efficiency Experiments on Large-Scale Graphs (from Appendix G)

The following are the results from Table 6 of the original paper, showing the training cost (GPU memory and running time) on large-scale graphs:

Aminer-CS			Reddit		Amazon2M
	Memory (MB)	Time (s)	Memory (MB)	Time (s)	Memory (MB)	Time (s)
GraphSAINT	1,641	23.67	2,565	43.15	5,317	334.08
PPRGo	1,075	14.21	1,093	35.73	1,097	152.62
GRAND+	1,091	21.41	1,213	197.97	1,123	207.85
NAGphormer	1,827	19.87	1,925	20.72	2,035	58.66

Analysis:

NAGphormer demonstrates high efficiency in terms of running time on large graphs. For instance, on Amazon2M (the largest dataset), NAGphormer trains in 58.66 seconds, which is nearly 3 times faster than the second fastest model, PPRGo (152.62 seconds).
This efficiency is primarily due to NAGphormer's time complexity being independent of the total number of edges in the graph (i.e., $O(n (K+1)^2 d)$ ). Most other scalable GNNs involve propagation operations during training that are related to the number of edges, making them slower on dense graphs.
Regarding GPU memory cost, NAGphormer's memory consumption is affordable and determined by the batch size, thanks to its mini-batch training capability. While PPRGo and $GRAND+$ show lower memory usage on some datasets, NAGphormer's memory usage remains within practical limits for typical GPU resources (e.g., 2035 MB on Amazon2M on an RTX 2080TI).

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces NAGphormer, a novel and powerful Graph Transformer designed for node classification in large graphs. The key innovation lies in its tokenized design via the Hop2Token module, which transforms the traditional graph representation for Transformers. Instead of treating all nodes in the graph as a single long sequence, NAGphormer treats each node as a sequence of tokens derived from its multi-hop neighborhoods. This enables mini-batch training, resolving the long-standing scalability issue of graph Transformers that previously suffered from quadratic computational complexity. Furthermore, NAGphormer incorporates structural encoding and an attention-based readout function that adaptively learns the importance of different-hop neighborhoods, leading to more expressive node representations. Theoretical analysis validates its superior expressive power compared to decoupled GCNs. Extensive experiments on diverse small and large-scale datasets consistently demonstrate NAGphormer's state-of-the-art performance against existing graph Transformers and mainstream GNNs, along with its superior efficiency on large graphs.

7.2. Limitations & Future Work

The paper explicitly mentions that its current work focuses on node classification.

Future Work: The authors aim to generalize NAGphormer to other graph mining tasks, specifically mentioning graph classification. This indicates that adapting the Hop2Token and Transformer architecture for tasks that require a single representation for the entire graph (rather than individual node representations) would be a natural next step. This would likely involve an additional global readout mechanism over all node sequences or a more complex hierarchical aggregation.

7.3. Personal Insights & Critique

NAGphormer presents a highly insightful and practical solution to a critical bottleneck in graph machine learning: extending the power of Transformers to large graphs. The Hop2Token module is particularly clever, effectively transforming a global graph problem into a set of local sequence processing problems, which is amenable to efficient mini-batch training. This idea of defining tokens based on neighborhood context rather than individual nodes could be a fundamental shift in how Transformers interact with graph data.

Inspirations and Applications:

Beyond Node Classification: The core Hop2Token idea could indeed be foundational for other tasks. For graph classification, one could envision a global attention mechanism over the Z_out representations of all nodes in a graph, or a hierarchical Hop2Token that first forms node-level sequences, then aggregates these into subgraph-level sequences, and so on.
Other Domains: The concept of tokenizing local context for global model processing is transferable. For instance, in molecular graph analysis, Hop2Token could capture intricate local chemical environments around each atom, allowing a Transformer to model complex molecular properties without being bottlenecked by the molecule size. In knowledge graphs, it could enable efficient processing of large graphs by focusing on local relational structures for entity representation.
Mitigating Over-smoothing/Over-squashing: The paper implicitly demonstrates a novel way to address over-smoothing and over-squashing not by architectural tweaks to GNNs, but by adopting a Transformer's adaptive attention over multi-hop features. The fact that NAGphormer's performance doesn't degrade significantly with large $K$ (number of hops) is a strong indicator of its robustness against these issues.

Potential Issues or Areas for Improvement:

Computational Cost of Hop2Token Pre-computation: While Hop2Token is non-parametric and offline, the computation of $\hat{\mathbf{A}}^k \mathbf{X}'$ for large $K$ and very large sparse graphs can still be significant, especially if $\mathbf{X}'$ is high-dimensional. For extremely dynamic graphs, re-computing this might be a bottleneck. Exploring approximation techniques for Hop2Token itself for such scenarios could be beneficial.
Sensitivity to $K$ and $L$ : While the paper explores these parameters, finding optimal $K$ and $L$ still requires tuning. The observation that optimal $K$ varies by dataset and that larger $L$ can lead to overfitting suggests that these parameters are not universally robust. Developing adaptive mechanisms for determining $K$ or dynamically adjusting $L$ could improve usability.
Interpretability: Transformers are often less interpretable than simpler models. While the attention-based readout provides some insights into hop importance, understanding why certain hop features are prioritized or how the Transformer weights inter-hop correlations could be further explored for greater model transparency.
Generalization to Other Graph Types: The current work focuses on unweighted and undirected graphs. Extending NAGphormer to directed, weighted, or heterogeneous graphs would require modifications to Hop2Token (e.g., directional propagation, weighted aggregation) and potentially the structural encoding.
Memory Footprint of $\mathbf{X}_G$ : For very large $N$ and large $K$ and $d'$ , storing the entire $\mathbf{X}_G \in \mathbb{R}^{N \times (K+1) \times d'}$ can still be substantial, even if it's pre-computed. While not for training, this pre-computed matrix needs to reside somewhere for batching. This might become a challenge for graphs that truly push the limits of memory, even if training is batched.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 46,450 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Graphs and Graph Representation

3.1.2. Graph Neural Networks (GNNs)

3.1.3. Transformers

3.2. Previous Works

3.2.1. Graph Neural Networks (GNNs)

3.2.2. Graph Transformers

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. Structural Encoding

4.2.3. Hop2Token Module

4.2.4. Linear Projection

4.2.5. Transformer Encoder

4.2.6. Attention-based Readout Function

4.3. Theoretical Analysis of NAGphormer

4.4. Complexity Analysis of NAGphormer (from Appendix B)

4.4.1. Time Complexity

4.4.2. Space Complexity

5. Experimental Setup

5.1. Datasets

5.1.1. Small-Scale Datasets

5.1.2. Large-Scale Datasets

5.2. Evaluation Metrics

5.2.1. Accuracy

5.3. Baselines

5.3.1. Full-Batch GNNs

5.3.2. Scalable GNNs

5.3.3. Graph Transformers

5.4. Implementation Details (from Appendix D.2)

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Comparison on Small-Scale Datasets

6.1.2. Comparison on Large-Scale Datasets

6.1.3. Further Comparison with Gophormer (from Appendix E)

6.2. Ablation Studies / Parameter Analysis

6.2.1. Structural Encoding

6.2.2. Attention-based Readout Function

6.2.3. Parameter Study: Number of Propagation Steps (KKK)

6.2.4. Parameter Study: Number of Transformer Layers (LLL)

6.3. Efficiency Experiments on Large-Scale Graphs (from Appendix G)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.2.3. Parameter Study: Number of Propagation Steps ( $K$ )

6.2.4. Parameter Study: Number of Transformer Layers ( $L$ )