NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs
TL;DR Summary
NAGphormer is proposed to address the computational complexity of existing graph Transformers for node classification in large graphs. By introducing the Hop2Token module, it aggregates multi-hop neighborhood features as sequences, enhancing node representation effectiveness and
Abstract
The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratic complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs. Code is available at https://github.com/JHL-HUST/NAGphormer.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs
1.2. Authors
The authors are Jinsong Chen, Kaiyuan Gao, Gaichao Li, and Kun He. Their affiliations are:
- School of Artificial Intelligence, Huazhong University of Science and Technology
- School of Computer Science and Technology, Huazhong University of Science and Technology
- Hopcroft Center on Computing Science, Huazhong University of Science and Technology
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, on 2022-06-10T07:23:51.000Z (UTC). arXiv is a highly influential platform for rapid dissemination of research in fields like computer science, mathematics, and physics. While it hosts preprints, many papers later undergo peer review and are published in top-tier conferences or journals. The fact that it's on arXiv indicates it's a recent contribution to the field.
1.4. Publication Year
2022
1.5. Abstract
Existing graph Transformer models struggle with large graphs due to their quadratic computational complexity with respect to the number of nodes, as they treat all nodes as independent tokens in a single long sequence. To overcome this, the paper proposes NAGphormer (Neighborhood Aggregation Graph Transformer), a novel architecture that redefines tokens for graph Transformers. NAGphormer introduces a Hop2Token module that aggregates neighborhood features from different hops for each node, creating a sequence of token vectors for each node. This approach allows for mini-batch training, making it scalable to large graphs. The authors mathematically demonstrate that NAGphormer can learn more informative node representations from multi-hop neighborhoods compared to advanced decoupled Graph Neural Networks (GNNs). Extensive experiments on both small and large benchmark datasets show that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs.
1.6. Original Source Link
- Official Source: https://arxiv.org/abs/2206.04910
- PDF Link: https://arxiv.org/pdf/2206.04910v4.pdf
- Publication Status: This paper is a preprint available on
arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the scalability issue of existing graph Transformer models when applied to large graphs for tasks like node classification.
Graph Neural Networks (GNNs) have become powerful tools for graph mining tasks by aggregating neighborhood information. However, traditional GNNs suffer from inherent limitations such as over-smoothing (where node representations become indistinguishable in deep networks) and over-squashing (information loss when compressing paths into fixed-dimensional vectors) with increasing model depth.
Transformers, initially successful in natural language processing and computer vision, have shown great promise in generalizing to graph-structured data. However, graphs pose unique challenges due to their complex structural topology and attribute features, which cannot be directly encoded into Transformers as simple tokens. Existing graph Transformers typically treat all nodes in a graph as independent tokens and form a single, very long sequence for the Transformer model. This design leads to a quadratic computational complexity with respect to the number of nodes for the self-attention mechanism, making them computationally prohibitive and memory-intensive for large graphs. Furthermore, traditional scalability strategies for GNNs (like node sampling or approximation propagation) are not directly applicable to graph Transformers, which rely on global attention.
The paper's entry point is to rethink how tokens are defined and processed in graph Transformers to overcome this scalability bottleneck while preserving the powerful representational capabilities of Transformers.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
Novel
Hop2TokenModule: The introduction ofHop2Token, a new neighborhood aggregation method that transforms features from different-hop neighborhoods into a sequence of token vectors for each individual node. This module effectively encodes the structural information of each node's multi-hop neighborhood into a sequence. -
NAGphormerArchitecture: The proposal ofNAGphormer, a new graph Transformer model for node classification. By treating each node as an independent sequence ofHop2Token-generated tokens,NAGphormerenables mini-batch training, making it scalable to large graphs, a significant advancement over prior graph Transformers. -
Attention-based Readout Function: Development of an attention-based readout function that adaptively learns the importance of different-hop neighborhoods, further enhancing the model's ability to learn expressive node representations.
-
Theoretical Analysis: Mathematical proof that
NAGphormercan learn more expressive node representations from multi-hop neighborhoods compared to advanced decoupled Graph Convolutional Networks (GCNs), by showing that decoupled GCNs implicitly use a fixed, incomplete attention mechanism. -
Empirical Validation: Extensive experiments on a variety of benchmark datasets, including both small and large-scale graphs, demonstrate that
NAGphormerconsistently outperforms existing graph Transformers and mainstream GNNs in terms of classification accuracy and efficiency on large graphs.These findings solve the critical scalability problem for graph Transformers, allowing them to leverage their powerful attention mechanisms on real-world large-scale graphs, while also improving representation learning by explicitly modeling multi-hop neighborhood information.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand NAGphormer, a reader should be familiar with the following foundational concepts:
3.1.1. Graphs and Graph Representation
A graph is a data structure consisting of nodes (or vertices) and edges (or links) that connect them.
- Nodes (): Entities in the graph (e.g., people, documents, proteins). Each node can have an associated feature vector , representing its attributes.
- Edges (): Relationships between nodes. Edges can be directed or undirected, weighted or unweighted.
- Adjacency Matrix (): A square matrix where if there's an edge between node and node , and
0otherwise. - Degree Matrix (): A diagonal matrix where is the degree of node (number of edges connected to it).
- Normalized Adjacency Matrix (): A common normalization for graph neural networks, often calculated as , where is the adjacency matrix with self-loops (i.e., , where is the identity matrix) and is its corresponding degree matrix. This normalization helps prevent numerical instabilities and scales feature propagation.
- Node Classification: A common task in graph mining where the goal is to predict the category or label of an unlabeled node based on its features and its connections within the graph, given a set of labeled nodes.
- -hop Neighborhood: For a given node , its -hop neighborhood includes all nodes that can be reached from by traversing at most edges. The 0-hop neighborhood is typically just the node itself.
3.1.2. Graph Neural Networks (GNNs)
GNNs are a class of neural networks designed to operate directly on graph-structured data. They learn node representations (also called embeddings) by iteratively aggregating information from a node's local neighborhood.
- Message Passing Mechanism: The core idea behind most GNNs. In each layer, a node's representation is updated by aggregating messages (transformed features) from its neighbors and combining them with its own previous representation. This process allows information to flow across the graph.
- Over-smoothing: A common problem in deep GNNs where, after many layers of message passing, the representations of all nodes in a connected component of a graph become increasingly similar, eventually becoming indistinguishable. This limits the model's ability to learn distinct representations for different nodes.
- Over-squashing: Another issue in GNNs, where information from distant nodes might be lost or overly compressed when passed through long paths in the graph, especially in bottleneck structures.
3.1.3. Transformers
Transformers are deep learning architectures that rely heavily on the self-attention mechanism.
- Self-Attention Mechanism: A mechanism that allows a model to weigh the importance of different parts of an input sequence when processing a specific element. For an input sequence of tokens, self-attention computes an output representation for each token by summing up linearly transformed versions of all tokens in the sequence, weighted by
attention scores.- Query (), Key (), Value (): These are generated by linearly transforming the input representations. $ \mathbf{Q} = \mathbf{H} \mathbf{W}^Q \ \mathbf{K} = \mathbf{H} \mathbf{W}^K \ \mathbf{V} = \mathbf{H} \mathbf{W}^V $ where is the input matrix (n tokens, d dimensions), and are learnable weight matrices.
- Attention Score Calculation: The attention scores are typically computed as a scaled dot product between
queriesandkeys. $ \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_K}}\right)\mathbf{V} $ where is the dimension of thekeyvectors, used for scaling to prevent large dot product values from pushing thesoftmaxinto regions with very small gradients. Thesoftmaxfunction normalizes these scores into a probability distribution.
- Multi-Head Self-Attention (MSA): An extension of self-attention where the attention mechanism is run multiple times in parallel with different linear projections. The outputs from these "heads" are then concatenated and linearly transformed to produce the final output. This allows the model to attend to different parts of the input from various "representation subspaces."
- Feed-Forward Network (FFN): A simple neural network (typically two linear transformations with a non-linear activation in between) applied independently to each position in the sequence after the self-attention layer.
- Positional Encoding: Since Transformers process sequences without inherent order,
positional encodingsare added to the input embeddings to provide information about the relative or absolute position of each token in the sequence.
3.2. Previous Works
3.2.1. Graph Neural Networks (GNNs)
The paper primarily references GNNs as a baseline and highlights their limitations, especially the coupled design in traditional GCNs.
- Graph Convolutional Network (GCN) (Kipf & Welling, 2017): A seminal GNN model that applies a first-order approximation of spectral graph convolutions. A GCN layer combines feature transformation and neighborhood aggregation:
$
\mathbf{H}^{(l+1)} = \sigma(\hat{\mathbf{A}} \mathbf{H}^{(l)} \mathbf{W}^{(l)})
$
where is the node feature matrix at layer , is the normalized adjacency matrix with self-loops, is a learnable weight matrix, and is a non-linear activation function.
- Background: This formula essentially says that to get the next layer's representation for a node, you aggregate the features of its neighbors (including itself, due to having self-loops) and then apply a linear transformation and an activation function.
- Graph Attention Network (GAT) (Velikovi et al., 2018): Introduces an attention mechanism into the message passing process, allowing nodes to selectively attend to their neighbors based on their features, rather than using fixed weights (like in GCN).
- Decoupled GCNs (e.g., APPNP, GPRGNN, GCNII): To mitigate
over-smoothingandover-squashing, these models separate the feature transformation and neighborhood aggregation steps.- Motivation: Traditional GCNs couple feature transformation (applying ) and propagation (multiplying by ). When stacked, non-linearities and repeated propagation can lead to
over-smoothing. Decoupled GCNs apply feature transformation upfront, then perform multiple propagation steps without intermediate non-linearities, and finally combine the propagated features. - General Form (Chien et al., 2021):
$
\mathbf{Z} = \sum_{k=0}^K \beta_k \mathbf{H}^{(k)} \
\mathbf{H}^{(k)} = \hat{\mathbf{A}} \mathbf{H}^{(k-1)} \
\mathbf{H}^{(0)} = f_{\boldsymbol{\theta}}(\mathbf{X})
$
Here, is an initial feature transformation (e.g., a multi-layer perceptron,
MLP) applied to raw features . Then, represents features propagated for steps. Finally, (the final node representations) is a weighted sum of these multi-hop propagated features, with being learnable aggregation coefficients. This allows capturing deeper structural information without immediateover-smoothing.
- Motivation: Traditional GCNs couple feature transformation (applying ) and propagation (multiplying by ). When stacked, non-linearities and repeated propagation can lead to
3.2.2. Graph Transformers
Existing graph Transformers integrate graph structural information into the Transformer architecture using three main strategies:
- Structural Encoding: Replacing or augmenting positional encoding with graph-specific structural information.
- Examples: Using Laplacian eigenvectors (Dwivedi & Bresson, 2020; Kreuzer et al., 2021) or degree-related features (Ying et al., 2021) as
positional encodings.
- Examples: Using Laplacian eigenvectors (Dwivedi & Bresson, 2020; Kreuzer et al., 2021) or degree-related features (Ying et al., 2021) as
- GNNs as Auxiliary Modules: Using GNNs to extract local structural information, which is then fed into a Transformer to learn long-range dependencies.
- Examples: Jain et al. (2021) use GNNs to provide
fixed local structural information. Chen et al. (2022) use GNNs to learn -subtree and -subgraph information. Rampásek et al. (2022) propose a hybrid layer combining a GNN and a self-attention layer.
- Examples: Jain et al. (2021) use GNNs to provide
- Graph Bias in Attention Matrix: Modifying the self-attention scores to incorporate graph structural biases.
-
Examples: Ying et al. (2021) use shortest-path distance to bias attention. Zhao et al. (2021) incorporate proximity-enhanced attention. Dwivedi et al. (2020) and Hussain et al. (2022) inject edge features into the self-attention module. Wu et al. (2022) use topology structural information as a relational bias.
The paper notes that most of these methods, except for GraphGPS and Nodeformer (which have linear complexity), adopt a
fully-connected attention mechanismon all node pairs, leading toquadratic spatial complexitywith respect to the number of nodes. This makes them unsuitable for large graphs.Gophormer(Zhao et al., 2021) attempts scalability by sampling ego-graphs, but the sampling is still time-consuming and may limit neighborhood information.
-
3.3. Technological Evolution
Graph representation learning has evolved from early matrix factorization methods to deep learning approaches. GNNs revolutionized the field by enabling end-to-end learning on graph structures, mimicking spectral graph theory and message passing. However, their limitations (over-smoothing, over-squashing, scalability) pushed research towards more robust and scalable architectures. Concurrently, the success of Transformers in sequential data led to efforts to adapt them to graphs. Initial graph Transformers often adopted a "global attention" view, treating all nodes as tokens, which inherited the Transformer's quadratic complexity. The challenge then became how to retain Transformer's power while making it scalable for graphs. This paper fits into this evolution by proposing a tokenization scheme specific to individual nodes' neighborhoods, effectively transforming the global graph attention problem into a local sequence attention problem for each node, enabling scalability and better capture of multi-hop information.
3.4. Differentiation Analysis
The core differences and innovations of NAGphormer compared to prior work, particularly the main methods in related work, are:
-
Tokenization Strategy:
- Prior Graph Transformers: Most existing graph Transformers treat each node as an independent token and form one long sequence of all nodes in the graph. This leads to global self-attention and quadratic complexity ( where is the total number of nodes).
- NAGphormer: Treats each node as a sequence itself. This sequence is composed of tokens generated from the node's multi-hop neighborhoods by the
Hop2Tokenmodule. This redefinition allows for local self-attention within each node's neighborhood sequence, enabling mini-batch training.
-
Scalability:
- Prior Graph Transformers: Generally struggle with large graphs due to quadratic complexity, often leading to
out-of-memoryerrors, as observed in experiments withGraphormer,SAN, andGraphGPS. - NAGphormer: The tokenized approach and mini-batch training inherently solve the scalability issue, making it applicable to large graphs with millions of nodes and edges. Its complexity becomes , where is the number of nodes in a batch, is the number of hops, and is the feature dimension. The term is typically much smaller than .
- Prior Graph Transformers: Generally struggle with large graphs due to quadratic complexity, often leading to
-
Information Aggregation:
- Traditional GNNs (e.g., GCN, GAT): Aggregate information primarily from immediate neighbors in a coupled fashion, potentially suffering from
over-smoothingandover-squashingwhen going deeper. Decoupled GCNs improve this but use fixed aggregation weights (as shown in Fact 1). - NAGphormer: Explicitly constructs tokens for different hops via
Hop2Token. It then uses the Transformer's self-attention mechanism to learnsemantic correlationsbetween these different-hop neighborhood tokens for a given node. The attention-based readout further adaptively weighs these hop-wise representations. This allows for a richer and more adaptive capture of multi-hop information.
- Traditional GNNs (e.g., GCN, GAT): Aggregate information primarily from immediate neighbors in a coupled fashion, potentially suffering from
-
Theoretical Foundation:
-
Decoupled GCNs: The paper shows that decoupled GCNs can be viewed as applying a self-attention mechanism with a fixed attention matrix, limiting their expressiveness in combining multi-hop features.
-
NAGphormer: Utilizes the full flexibility of the Transformer's self-attention to dynamically learn correlations between different hops, offering theoretically more informative representations.
In essence,
NAGphormerinnovates by changing the fundamental unit of attention from "node" to "node's multi-hop neighborhood sequence," thus making graph Transformers both powerful and scalable.
-
4. Methodology
4.1. Principles
The core idea behind NAGphormer is to enable Transformer models to process large-scale graph data effectively by redefining how graph information is tokenized. Instead of treating each node in the entire graph as an independent token in a single long sequence (which leads to quadratic complexity for self-attention), NAGphormer treats each node as a sequence of tokens, where each token represents aggregated information from a specific hop of its neighborhood. This novel tokenization scheme, achieved through the Hop2Token module, allows for mini-batch training and significantly reduces the computational burden, making Transformer architectures viable for large graphs. The model then uses a standard Transformer encoder to learn relationships within these hop-wise tokens and an attention-based readout function to adaptively combine them for the final node representation.
4.2. Core Methodology In-depth (Layer by Layer)
The NAGphormer model framework can be broken down into several key components: Structural Encoding, Hop2Token module, a linear projection, a Transformer Encoder backbone, and an Attention-based Readout Function.
The following figure (Figure 1 from the original paper) shows the model framework of NAGphormer:
该图像是NAGphormer模型框架的示意图。它展示了如何使用Hop2Token模块构建每个节点的序列,并通过Transformer主干学习节点表示,使用基于注意力的读出层自适应聚合不同跳数的邻域信息,最终通过MLP模块进行标签预测。
4.2.1. Problem Formulation
The paper defines the problem as node classification on an unweighted and undirected attributed graph .
- is the set of nodes.
- Each node has a feature vector , forming a feature matrix , where is the dimension of node features.
- is the adjacency matrix, and is the diagonal degree matrix.
- The normalized adjacency matrix is , where (adjacency matrix with self-loops) and is its corresponding degree matrix.
- The task is to predict labels for an unlabeled node set , given labels for a labeled node set .
4.2.2. Structural Encoding
Before generating tokens, NAGphormer enhances the initial node features by incorporating structural information of the graph. This is achieved by concatenating the original attribute features with Laplacian eigenvectors.
- The
Laplacian matrixof a graph (or its normalized variants) is a matrix representation that captures the graph's structure. Itseigenvectorsencode fundamental properties about the graph's connectivity and topology. - The paper selects the eigenvectors corresponding to the smallest non-trivial eigenvalues of the
Laplacian matrixto construct a structure matrix . These eigenvectors provide a low-dimensional structural embedding for each node. - This structure matrix is then concatenated with the original feature matrix to form a fused feature matrix :
$
\mathbf{X}' = \mathbf{X} \parallel \mathbf{U}
$
where denotes the concatenation operator. The dimension of the effective feature vector for each node becomes . This is then used as the input to the
Hop2Tokenmodule. This step ensures that both attribute and structural information are considered from the outset.
4.2.3. Hop2Token Module
The Hop2Token module is a crucial component that aggregates neighborhood features from multiple hops and transforms each hop into a representation, which serves as a token.
- Definition of -hop Neighborhood: For a node , its -hop neighborhood is defined as the set of all nodes such that the shortest path distance
d(v, u)between and is less than or equal to . The 0-hop neighborhood is just the node itself. - Neighborhood Embedding: For each node ,
Hop2Tokengenerates aneighborhood embeddingfor its -hop neighborhood by applying an aggregation operator : $ \mathbf{x}_v^k = \phi(\mathcal{N}^k(v)) $ - Sequence Construction: By calculating these
neighborhood embeddingsfor a fixed number of hops (a hyperparameter),Hop2Tokenconstructs a sequence for each node : $ \mathcal{S}_v = (\mathbf{\bar{x}}_v^0, \mathbf{x}_v^1, \ldots, \mathbf{x}_v^K) $ where is a -dimensional vector. This sequence represents the node's multi-hop neighborhood information. - Implementation of and : In practice,
Hop2Tokenuses a simplified propagation process, similar to decoupled GCNs, to obtain the -hop neighborhood features. The -hop neighborhood matrix (where each row corresponds to a node's -hop aggregated feature) is computed by repeatedly multiplying the initial feature matrix with the normalized adjacency matrix : $ \mathbf{X}_k = \hat{\mathbf{A}}^k \mathbf{X}' $ Here, (the initial structural-encoded features). The sequence ofneighborhood matricesfor all nodes in the graph is then denoted as , where each represents the -hop features for all nodes.
The detailed implementation is shown in Algorithm 1:
The following is the Hop2Token Algorithm from the original paper:
Algorithm 1 The Hop2Token Algorithm Input: Normalized adjacency matrix $\hat{\mathbf{A}}$; Feature matrix $\mathbf{X}$; Propagation step $K$ Output: Sequences of all nodes $\mathbf{X}_G$; 1: for $k = 0$ to $K$ do 2: for $i = 0$ to $n$ do 3: $\mathbf{X}_G[i, k] = \mathbf{X}[i]$ 4: end for 5: $\mathbf{X} = \hat{\mathbf{A}} \mathbf{X}$; 6: end for 7: return Sequences of all nodes $\mathbf{X}_G$; Explanation of Algorithm 1:
- Input:
Normalized adjacency matrix(includes self-loops), initialFeature matrix(which should be the structurally encoded in practice, as stated in Section 3.3), and the maximumPropagation step. - Output:
Sequences of all nodes. This is conceptually a tensor of shape , where is the number of nodes, is the number of tokens per node (0-hop to -th hop), and is the feature dimension. - Loop : This outer loop iterates through each hop level from 0 to .
- Inner loop : This loop (or implicitly, a matrix operation) populates .
- : This line seems to imply that at each step , the current (which has been propagated times) is copied into the -th token slot for each node .
- : After storing the features for the current hop , the feature matrix is updated by multiplying it with . This operation aggregates information from immediate neighbors and effectively propagates features one more hop. So, after the first iteration (), becomes , which is . After the second iteration (), becomes , which is , and so on.
- Advantages: This module is
non-parametricand can be runofflinebefore training. Its output, , supportsmini-batch trainingbecause each node's sequence is self-contained, allowing the Transformer to process a batch of node sequences independently. This is key to scaling to large graphs.
4.2.4. Linear Projection
The Hop2Token module outputs a sequence of token vectors for each node , , where each . Before feeding these into the Transformer encoder, they are mapped to the Transformer's hidden dimension using a learnable linear projection matrix :
$
\mathbf{Z}_v^{(0)} = \left[ \mathbf{x}_v^0 \mathbf{E} ; \mathbf{x}_v^1 \mathbf{E} ; \ldots ; \mathbf{x}_v^K \mathbf{E} \right]
$
Here, is the learnable projection matrix, and is the initial input sequence for node to the Transformer encoder. Each row of corresponds to a token representing a specific hop's aggregated features, now projected into the Transformer's model dimension.
4.2.5. Transformer Encoder
The projected sequences are then fed into a standard Transformer encoder. The Transformer encoder consists of identical layers. Each layer (for ) typically comprises a Multi-Head Self-Attention (MSA) module and a Position-wise Feed-Forward Network (FFN), with Layer Normalization (LN) applied before each block and residual connections (summing the input to the block with its output).
- Multi-Head Self-Attention (MSA): The MSA module takes the input sequence and processes it to capture semantic correlations between the different hop tokens within that sequence. This means the 0-hop token (the node itself) can attend to the 1-hop, 2-hop, ..., -hop tokens, and vice-versa. $ \mathbf{Z}_v^{\prime(\ell)} = \operatorname{MSA}\left(\operatorname{LN}\left(\mathbf{Z}_v^{(\ell-1)}\right)\right) + \mathbf{Z}_v^{(\ell-1)} $ Here, performs layer normalization, and the output of MSA is added to its input (residual connection).
- Feed-Forward Network (FFN): The FFN further transforms the output of the MSA layer. It consists of two linear layers with a
GELUnon-linearity in between. $ \mathbf{Z}_v^{(\ell)} = \operatorname{FFN}\left(\operatorname{LN}\left(\mathbf{Z}_v^{\prime(\ell)}\right)\right) + \mathbf{Z}_v^{\prime(\ell)} $ This process is repeated for layers. The output of the final Transformer layer for node is , which contains refined embeddings for all hop-tokens.
4.2.6. Attention-based Readout Function
After the Transformer encoder, the output (denoted as in this section for simplicity) contains embeddings for node , where is the embedding for the node itself (0-hop) and is the embedding for the -hop neighborhood. To obtain a single final node representation for classification, a readout function is needed. Common methods like summation or mean treat all hops equally. NAGphormer proposes an attention-based readout function to adaptively weigh the importance of different hop embeddings.
- Attention Coefficient Calculation: The function computes
normalized attention coefficientsfor each -hop neighborhood (from to ) by measuring its correlation with the 0-hop representation (the node itself): $ \alpha_k = \frac{\exp((\mathbf{Z}_0 \parallel \mathbf{Z}_k) \mathbf{W}a^\top)}{\sum{i=1}^K \exp((\mathbf{Z}_0 \parallel \mathbf{Z}_i) \mathbf{W}_a^\top)} $ where:- is the embedding of the node itself (0-hop).
- is the embedding of the -th hop neighborhood.
- denotes concatenation, so .
- is a learnable projection matrix (a weight vector).
- and the summation in the denominator ensure that the coefficients form a probability distribution (sum to 1).
- Final Node Representation: The final aggregated output representation for node is then computed as a weighted sum of the -hop representations, added to the node's own 0-hop representation:
$
\mathbf{Z}_{out} = \mathbf{Z}0 + \sum{k=1}^K \alpha_k \mathbf{Z}_k
$
This sum yields a single vector representing node , which is then typically fed into a simple
MLP(Multi-Layer Perceptron) for label prediction. TheMLPis responsible for transforming the final node embedding into class probabilities.
4.3. Theoretical Analysis of NAGphormer
The paper provides a theoretical analysis comparing NAGphormer with decoupled GCNs, arguing that NAGphormer can learn more informative representations due to its self-attention mechanism and Hop2Token.
Fact 1: From the perspective of the output node representations of Hop2Token, the paper asserts that a decoupled GCN can be regarded as applying a self-attention mechanism with a fixed attention matrix , where for .
Proof of Fact 1 (from Appendix C):
- Decoupled GCN Output: For an arbitrary node , each element (where is a feature dimension) of its output representation learned by a decoupled GCN (Equation 2) is: $ \mathbf{Z}{i, m} = \sum{k=0}^K \beta_k \mathbf{H}_{i, m}^{(k)} $ where is the -th dimension of the -hop feature vector for node .
- Hop2Token Output (Matrix Form): The output of
Hop2Tokenfor node can be written as a matrix where rows correspond to hop features and columns to feature dimensions: $ \mathbf{X}i = \left[ \begin{array}{cccc} \mathbf{H}{i,0}^{(0)} & \mathbf{H}{i,1}^{(0)} & \cdots & \mathbf{H}{i,d}^{(0)} \ \mathbf{H}{i,0}^{(1)} & \mathbf{H}{i,1}^{(1)} & \cdots & \mathbf{H}{i,d}^{(1)} \ \vdots & \vdots & \ddots & \vdots \ \mathbf{H}{i,0}^{(K)} & \mathbf{H}{i,1}^{(K)} & \cdots & \mathbf{H}{i,d}^{(K)} \end{array} \right] $ - Fixed Attention Matrix : Consider a special attention matrix : $ \mathbf{S} = \left[ \begin{array}{cccc} 0 & 0 & \cdots & 0 \ 0 & 0 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ \beta_0 & \beta_1 & \cdots & \beta_K \end{array} \right] $ This matrix has zeros everywhere except for its last row, which contains the aggregation coefficients from the decoupled GCN.
- Self-Attention Output with : If we apply this fixed attention matrix to (similar to how self-attention works, but with a fixed instead of a dynamically computed attention matrix), the output matrix would be: $ \mathbf{T} = \mathbf{S} \mathbf{X}i = \left[ \begin{array}{cccc} 0 & 0 & \cdots & 0 \ 0 & 0 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ \gamma_0 & \gamma_1 & \cdots & \gamma_d \end{array} \right] $ where each element in the last row is given by: $ \gamma_m = \sum{k=0}^K \beta_k \mathbf{H}_{i,m}^{(k)} \quad (m \in {1, \ldots, d}) $
- Readout Function: If a simple summation readout function is applied to the rows of to get the final representation for node , then each element would be: $ \mathbf{T}m^{final} = \sum{k=0}^K \mathbf{T}{k,m} = (0 + 0 + \cdots + \gamma_m) = \sum{k=0}^K \beta_k \mathbf{H}{i,m}^{(k)} = \mathbf{Z}{i,m} $ This demonstrates that the output of a decoupled GCN is equivalent to performing a fixed self-attention-like operation with a specific attention matrix and then summing the results.
Implication: This fact highlights that decoupled GCNs use a fixed set of aggregation coefficients () that are applied uniformly across all nodes and features. This limits their ability to capture node-specific or feature-specific importance of different hops. In contrast, NAGphormer uses a dynamic self-attention mechanism within its Transformer encoder to learn relationships between hop tokens, and then a dynamic attention-based readout function to adaptively weigh the importance of each hop for each individual node. This allows NAGphormer to learn more expressive node representations by capturing fine-grained semantic correlations and adaptive hop importance.
4.4. Complexity Analysis of NAGphormer (from Appendix B)
4.4.1. Time Complexity
The time complexity of NAGphormer is primarily determined by the self-attention module within the Transformer encoder. For a batch of nodes, if each node has a sequence of tokens, and the hidden dimension is , the self-attention computation for one node is . If there are nodes in a batch, the batch-wise time complexity would be . The paper states the overall time complexity as:
$
O(n (K+1)^2 d)
$
where is the total number of nodes in the graph (implying training over all nodes, possibly summing up mini-batch operations), is the number of hops, and is the feature vector dimension (implicitly referring to ). For comparison, traditional graph Transformers have a time complexity of where is the total number of nodes in the graph. By effectively reducing the "sequence length" from to for each node's internal attention, NAGphormer achieves significant speed-up when .
4.4.2. Space Complexity
The space complexity involves both model parameters and intermediate activations.
- Model Parameters: The Transformer layers contribute to the parameter count, where is the number of Transformer layers and is the hidden dimension.
- Intermediate Activations: During mini-batch training, the main memory consumption comes from storing the attention matrices and hidden node representations for a batch of nodes. This is for the attention matrices (for each node, a matrix) and for the hidden node representations. The total space complexity is given as: $ O(b (K+1)^2 + b (K+1) d + d^2 L) $ where is the batch size, is the number of hops, is the feature dimension (again, implicitly ), and is the number of layers. This mini-batch based memory consumption is significantly lower than full-batch graph Transformers, allowing it to fit into GPU memory for large graphs.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on nine widely used datasets of varying scales.
5.1.1. Small-Scale Datasets
-
Pubmed: A citation network where nodes are scientific papers and edges are citations.
-
CoraFull: Another citation network.
-
Computer: A co-purchase network where nodes are products and edges indicate frequently bought-together items.
-
Photo: Another co-purchase network.
-
CS (Computer Science): A co-authorship network where nodes are authors and edges indicate co-authorship.
-
Physics: Another co-authorship network.
For these small-scale datasets, a standard train/validation/test random split was used.
5.1.2. Large-Scale Datasets
-
AMiner-CS: A large citation network.
-
Reddit: A social network where nodes represent posts and edges indicate common user commenting.
-
Amazon2M: A very large co-purchase network.
For these large-scale datasets, the splits followed the settings from Feng et al. (2022).
The following are the statistics from Table 4 of the original paper:
| Dataset | # Nodes | # Edges | # Features | # Classes |
|---|---|---|---|---|
| Pubmed | 19,717 | 44,324 | 500 | 3 |
| CoraFull | 19,793 | 126,842 | 8,710 | 70 |
| Computer | 13,752 | 491,722 | 767 | 10 |
| Photo | 7,650 | 238,163 | 745 | 8 |
| CS | 18,333 | 163,788 | 6,805 | 15 |
| Physics | 34,493 | 495,924 | 8,415 | 5 |
| AMiner-CS | 593,486 | 6,217,004 | 100 | 18 |
| 232,965 | 11,606,919 | 602 | 41 | |
| Amazon2M | 2,449,029 | 61,859,140 | 100 | 47 |
These datasets were chosen to cover a range of scales, from tens of thousands of nodes (small-scale) to millions of nodes and tens of millions of edges (large-scale), and diverse domains (citation, co-purchase, co-authorship, social networks). This diversity ensures a comprehensive evaluation of the model's performance and scalability across different graph characteristics.
5.2. Evaluation Metrics
The primary evaluation metric used in the paper for node classification is accuracy.
5.2.1. Accuracy
- Conceptual Definition: Accuracy is a common metric used to evaluate classification models. It measures the proportion of correctly classified instances (nodes in this case) out of the total number of instances. It gives a general sense of how well the model is performing across all classes.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
-
Number of Correct Predictions: The count of nodes for which the model's predicted class matches the true class. -
Total Number of Predictions: The total count of nodes for which a prediction was made (typically the size of the test set).The results are reported as
mean accuracy ± stdev (%)over multiple trials (10 trials for small-scale datasets), indicating both the average performance and its variability.
-
5.3. Baselines
NAGphormer was compared against twelve advanced baselines, categorized into three groups:
5.3.1. Full-Batch GNNs
These models require the entire graph in memory during training, making them unsuitable for very large graphs.
GCN(Kipf & Welling, 2017): Graph Convolutional Network.GAT(Velikovi et al., 2018): Graph Attention Network.APPNP(Klicpera et al., 2019): Approximate Personalized PageRank, a decoupled GCN that uses Personalized PageRank for propagation.GPRGNN(Chien et al., 2021): Adaptive Universal Generalized PageRank Graph Neural Network, another advanced decoupled GCN.
5.3.2. Scalable GNNs
These models are designed to handle large graphs, typically using sampling or approximation techniques.
GraphSAINT(Zeng et al., 2020): Graph Sampling Based Inductive Learning Method.PPRGo(Bojchevski et al., 2020): Predict then Propagate, a scalable GNN using approximate PageRank.- (Feng et al., 2022): Scalable Graph Random Neural Networks.
5.3.3. Graph Transformers
Existing Transformer-based models adapted for graphs.
-
GT(Dwivedi & Bresson, 2020): A generalization of Transformer networks to graphs, often using Laplacian eigenvectors. -
SAN(Kreuzer et al., 2021): Spectral Attention Network, leveraging the full spectrum of the Laplacian matrix. -
Graphormer(Ying et al., 2021): Incorporates graph information bias into the attention matrix, e.g., shortest-path distance. -
GraphGPS(Rampásek et al., 2022): A general, powerful, scalable graph transformer. -
Gophormer(Zhao et al., 2021): Ego-graph Transformer for node classification, which samples ego-graphs to reduce computation.These baselines are representative as they cover both traditional GNNs, state-of-the-art scalable GNNs, and a range of existing graph Transformer approaches, allowing for a comprehensive comparison of
NAGphormer's performance and scalability.
5.4. Implementation Details (from Appendix D.2)
- Hyperparameter Tuning: Hyperparameters for baselines were set according to recommended settings in official implementations. For
NAGphormer, the following ranges were searched:- Number of Transformer layers :
- Hidden dimension :
- Propagation steps :
- Optimization:
AdamW(Loshchilov & Hutter, 2019) optimizer was used.- Learning rate:
- Weight decay:
- Regularization: Dropout rate was searched in .
- Batch Size: Set to 2000.
- Early Stopping: Training was stopped within 50 epochs if validation performance did not improve.
- Hardware: Experiments were conducted on a Linux server with 1 Intel i9-9900k CPU, 1 RTX 2080TI GPU, and 64GB RAM.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Comparison on Small-Scale Datasets
The following are the results from Table 1 of the original paper, comparing NAGphormer against various baselines on small-scale datasets:
| Method | Pubmed | CoraFull | Computer | Photo | CS | Physics |
| GCN | 86.54 ± 0.12 | 61.76 ± 0.14 | 89.65 ± 0.52 | 92.70 ± 0.20 92.92 ± 0.12 96.18 ± 0.07 | ||
| GAT | 86.32 ± 0.16 64.47 ± 0.18 | 90.78 ± 0.13 | 93.87 ± 0.11 | 93.61 ± 0.14 96.17 ± 0.08 | ||
| APPNP | 88.43 ± 0.15 65.16 ± 0.28 | 90.18 ± 0.17 | 94.32 ± 0.14 94.49 ± 0.07 | 96.54 ± 0.07 | ||
| GPRGNN | 89.34 ± 0.25 | 67.12 ± 0.31 | 89.32 ± 0.29 | 94.49 ± 0.14 95.13 ± 0.09 | 96.85 ± 0.08 | |
| GraphSAINT | 88.96 ± 0.16 67.85 ± 0.21 | 90.22 ± 0.15 | 91.72 ± 0.13 | 94.41 ± 0.09 | 96.43 ± 0.05 | |
| PPRGo | 87.38 ± 0.11 | 63.54 ± 0.25 | 88.69 ± 0.21 | 93.61 ± 0.12 | 92.52 ± 0.15 | 95.51 ± 0.08 |
| GRAND+ | 88.64 ± 0.09 71.37 ± 0.11 | 88.74 ± 0.11 | 94.75 ± 0.12 | 93.92 ± 0.08 | 96.47 ± 0.04 | |
| GT | 88.79 ± 0.12 61.05 ± 0.38 91.18 ± 0.17 | 94.74 ± 0.13 | 94.64 ± 0.13 97.05 ± 0.05 | |||
| Graphormer | OOM | OOM | OOM | 92.74 ± 0.14 | OOM | OOM |
| SAN | 88.22 ± 0.15 59.01 ± 0.34 | 89.83 ± 0.16 94.86 ± 0.10 | 94.51 ± 0.15 | OOM | ||
| GraphGPS | 88.94 ± 0.16 55.76 ± 0.23 | OOM | 95.06 ± 0.13 | 93.93 ± 0.12 | OOM | |
| NAGphormer | **89.70 ± 0.19** | **71.51 ± 0.13** | **91.22 ± 0.14** | **95.49 ± 0.11** | **95.75 ± 0.09** | **97.34 ± 0.03** |
Note: The original table contains some formatting inconsistencies and repeated values for certain cells, which are preserved here for faithfulness. "OOM" indicates out-of-memory error.
Analysis:
NAGphormerconsistently achieves the best performance across all six small-scale datasets (Pubmed, CoraFull, Computer, Photo, CS, Physics), often by a significant margin. For example, onCoraFull,NAGphormer(71.51%) substantially outperforms the next best (71.37%) and all other GNNs and Transformers.- Superiority over GNNs:
NAGphormer's improved performance over GNN-based methods (likeAPPNPandGPRGNN, which are decoupled GCNs) supports the claim that itsHop2Tokenmodule and Transformer backbone better capturesemantic relevanceof different hop neighbors, which is overlooked by many GNNs. - Superiority over Graph Transformers:
NAGphormeralso surpasses existing graph Transformer-based methods likeGTandSAN, even though they also useLaplacian eigenvectorsfor structural encoding. This indicates the effectiveness ofNAGphormer's novel tokenization strategy and attention mechanisms. - Scalability Demonstration (Even on Small Graphs): Crucially, several existing graph Transformers (
Graphormer,SAN,GraphGPS) encounterout-of-memory(OOM) errors even on some of these relatively small graphs. This strongly validates the necessity and success ofNAGphormer's design in enabling scalability for Transformers on graph data.
6.1.2. Comparison on Large-Scale Datasets
The following are the results from Table 2 of the original paper, comparing NAGphormer against scalable GNNs on large-scale datasets:
| Method | AMiner-CS | Amazon2M | |
| PPRGo | 49.07 ± 0.19 | 90.38 ± 0.11 | 66.12 ± 0.59 |
| GraphSAINT | 51.86 ± 0.21 | 92.35 ± 0.08 | 75.21 ± 0.15 |
| GRAND+ | 54.67 ± 0.25 | 92.81 ± 0.03 | 75.49 ± 0.11 |
| NAGphormer | 56.21 ± 0.42 | 93.58 ± 0.05 | 77.43 ± 0.24 |
Analysis:
- Only scalable GNNs are compared here because existing graph Transformers (as shown in the small-scale results) cannot handle such large-scale datasets due to their high computational cost.
NAGphormeragain consistently outperforms all scalable GNN baselines (PPRGo,GraphSAINT, ) on all three large-scale datasets (AMiner-CS,Reddit,Amazon2M).- This strong performance on large graphs, coupled with the ability to train these models, is a key validation of
NAGphormer's design for scalability and effective local information preservation. For instance, onAmazon2M, it achieves 77.43% accuracy, significantly higher than 's 75.49%.
6.1.3. Further Comparison with Gophormer (from Appendix E)
The following are the results from Table 5 of the original paper, comparing NAGphormer with Gophormer (a graph Transformer using ego-graph sampling) on specific datasets:
| Cora | Citeseer | Pumbed | Blogcatalog | DBLP | Flickr | |
| Gophormer | 87.85 | 80.23 | 89.40 | 96.03 | 85.20 | 91.51 |
| NAGphormer | 88.15 | 80.12 | 89.70 | 96.73 | 85.95 | 91.80 |
Analysis:
NAGphormeroutperformsGophormeron most datasets (Cora, Pubmed, Blogcatalog, DBLP, Flickr), with a slight dip on Citeseer. This comparison is particularly relevant asGophormeralso aims for scalability through sampling. The results suggest thatNAGphormer's structuredHop2Tokenapproach captures more comprehensive neighborhood information than ego-graph sampling.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Structural Encoding
The following are the results from Table 3 of the original paper, showing the impact of structural encoding:
| Pubmed | CoraFull | CS | Computer | Photo | Physics | Aminer-CS | Amazon2M | ||
| W/O-SE | 89.06 | 70.42 | 95.52 | 90.44 | 95.02 | 97.10 | 55.64 | 93.47 | 76.98 |
| With-SE | 89.70 | 71.51 | 95.75 | 91.22 | 95.49 | 97.34 | 56.21 | 93.58 | 77.43 |
| Gain | +0.64 | +1.09 | +0.23 | +0.78 | +0.47 | +0.24 | +0.57 | +0.11 | +0.45 |
Analysis:
- The results show a consistent
gainin accuracy across all datasets whenstructural encoding(concatenating Laplacian eigenvectors) is included. This confirms that incorporating explicit structural information is beneficial for node classification tasks. - The magnitude of the gain varies across datasets (e.g., +1.09% on CoraFull vs. +0.11% on Reddit), suggesting that the importance of structural information encoded by
Laplacian eigenvectorsissensitive to the specific topology of the graph. Some graphs might have structural patterns that are more crucial for classification than others, or that are better captured by these eigenvectors.
6.2.2. Attention-based Readout Function
The paper compares its proposed attention-based readout function (ATT.) against two alternatives: SIN. (using only the 0-hop node representation) and SUM. (summing all hop representations equally).
The following figure (Figure 2 from the original paper) shows the performance of NAGphormer via different readout functions on small-scale datasets:

Analysis (Small-scale datasets):
-
The
ATT.function consistentlyoutperformsbothSIN.andSUM.across all small-scale datasets. -
SIN.(using only the self-representation) performs the worst, indicating that integrating multi-hop neighborhood information is crucial. -
SUM.performs better thanSIN.but worse thanATT., suggesting that simply summing (equal weighting) is less effective than adaptively weighting different hops. -
This validates that
adaptively learning the importance of different-hop neighborhoodsthrough the attention mechanism in the readout function is beneficial for learning more expressive node representations and improving classification performance.The following figure (Figure 4 from the original paper) shows the performance of different readout functions on large-scale datasets:

Analysis (Large-scale datasets, from Appendix F):
-
Similar trends are observed on large-scale datasets. The
attention-based readout function (ATT.)consistentlyoutperformsSIN.andSUM.on AMiner-CS, Reddit, and Amazon2M. This reinforces the conclusion that adaptive weighting of multi-hop information is critical for performance, regardless of graph scale.
6.2.3. Parameter Study: Number of Propagation Steps ()
The following figure (Figure 3a from the original paper) shows the performance of NAGphormer on different parameters:

Analysis (Figure 3a):
- The performance of
NAGphormervaries with the number of propagation steps . - Optimal values differ for each dataset (e.g., for AMiner-CS, for Reddit and Amazon2M in practice), indicating that different networks have distinct effective
neighborhood structuresand information propagation distances relevant for the task. - Importantly, even for large (up to 20), the model performance does
not decline significantly. ForReddit, the change is less than 0.1%. This suggests that theself-attention mechanismandattention-based readout functioneffectively mitigateover-smoothingorover-squashingissues, as they can selectively attend to and combine relevant hop information without simply averaging everything. This is a significant advantage over many GNNs that degrade with increasing .
6.2.4. Parameter Study: Number of Transformer Layers ()
Analysis (Figure 3b):
- Generally, a
smaller number of Transformer layers ()(e.g., or ) achieves high accuracy, while alargertends todegrade performance. - This observation is attributed to the increased likelihood of
over-fittingwith more Transformer layers, given the fixed sequence length () for each node's internal attention. - In practice, was chosen for AMiner-CS and for other datasets, striking a balance between complexity and performance.
6.3. Efficiency Experiments on Large-Scale Graphs (from Appendix G)
The following are the results from Table 6 of the original paper, showing the training cost (GPU memory and running time) on large-scale graphs:
| Aminer-CS | Amazon2M | |||||
| Memory (MB) | Time (s) | Memory (MB) | Time (s) | Memory (MB) | Time (s) | |
| GraphSAINT | 1,641 | 23.67 | 2,565 | 43.15 | 5,317 | 334.08 |
| PPRGo | 1,075 | 14.21 | 1,093 | 35.73 | 1,097 | 152.62 |
| GRAND+ | 1,091 | 21.41 | 1,213 | 197.97 | 1,123 | 207.85 |
| NAGphormer | 1,827 | 19.87 | 1,925 | 20.72 | 2,035 | 58.66 |
Analysis:
NAGphormerdemonstrateshigh efficiencyin terms of running time on large graphs. For instance, onAmazon2M(the largest dataset),NAGphormertrains in 58.66 seconds, which is nearly3 times fasterthan the second fastest model,PPRGo(152.62 seconds).- This efficiency is primarily due to
NAGphormer's time complexity being independent of the total number of edges in the graph (i.e., ). Most other scalable GNNs involve propagation operations during training that are related to the number of edges, making them slower on dense graphs. - Regarding
GPU memory cost,NAGphormer's memory consumption isaffordableand determined by thebatch size, thanks to itsmini-batch trainingcapability. WhilePPRGoand show lower memory usage on some datasets,NAGphormer's memory usage remains within practical limits for typical GPU resources (e.g., 2035 MB on Amazon2M on an RTX 2080TI).
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces NAGphormer, a novel and powerful Graph Transformer designed for node classification in large graphs. The key innovation lies in its tokenized design via the Hop2Token module, which transforms the traditional graph representation for Transformers. Instead of treating all nodes in the graph as a single long sequence, NAGphormer treats each node as a sequence of tokens derived from its multi-hop neighborhoods. This enables mini-batch training, resolving the long-standing scalability issue of graph Transformers that previously suffered from quadratic computational complexity. Furthermore, NAGphormer incorporates structural encoding and an attention-based readout function that adaptively learns the importance of different-hop neighborhoods, leading to more expressive node representations. Theoretical analysis validates its superior expressive power compared to decoupled GCNs. Extensive experiments on diverse small and large-scale datasets consistently demonstrate NAGphormer's state-of-the-art performance against existing graph Transformers and mainstream GNNs, along with its superior efficiency on large graphs.
7.2. Limitations & Future Work
The paper explicitly mentions that its current work focuses on node classification.
- Future Work: The authors aim to
generalize NAGphormer to other graph mining tasks, specifically mentioninggraph classification. This indicates that adapting theHop2Tokenand Transformer architecture for tasks that require a single representation for the entire graph (rather than individual node representations) would be a natural next step. This would likely involve an additional global readout mechanism over all node sequences or a more complex hierarchical aggregation.
7.3. Personal Insights & Critique
NAGphormer presents a highly insightful and practical solution to a critical bottleneck in graph machine learning: extending the power of Transformers to large graphs. The Hop2Token module is particularly clever, effectively transforming a global graph problem into a set of local sequence processing problems, which is amenable to efficient mini-batch training. This idea of defining tokens based on neighborhood context rather than individual nodes could be a fundamental shift in how Transformers interact with graph data.
Inspirations and Applications:
- Beyond Node Classification: The core
Hop2Tokenidea could indeed be foundational for other tasks. Forgraph classification, one could envision a global attention mechanism over theZ_outrepresentations of all nodes in a graph, or a hierarchicalHop2Tokenthat first forms node-level sequences, then aggregates these into subgraph-level sequences, and so on. - Other Domains: The concept of tokenizing local context for global model processing is transferable. For instance, in molecular graph analysis,
Hop2Tokencould capture intricate local chemical environments around each atom, allowing a Transformer to model complex molecular properties without being bottlenecked by the molecule size. In knowledge graphs, it could enable efficient processing of large graphs by focusing on local relational structures for entity representation. - Mitigating Over-smoothing/Over-squashing: The paper implicitly demonstrates a novel way to address
over-smoothingandover-squashingnot by architectural tweaks to GNNs, but by adopting a Transformer's adaptive attention over multi-hop features. The fact thatNAGphormer's performance doesn't degrade significantly with large (number of hops) is a strong indicator of its robustness against these issues.
Potential Issues or Areas for Improvement:
- Computational Cost of
Hop2TokenPre-computation: WhileHop2Tokenis non-parametric and offline, the computation of for large and very large sparse graphs can still be significant, especially if is high-dimensional. For extremely dynamic graphs, re-computing this might be a bottleneck. Exploring approximation techniques forHop2Tokenitself for such scenarios could be beneficial. - Sensitivity to and : While the paper explores these parameters, finding optimal and still requires tuning. The observation that optimal varies by dataset and that larger can lead to overfitting suggests that these parameters are not universally robust. Developing adaptive mechanisms for determining or dynamically adjusting could improve usability.
- Interpretability: Transformers are often less interpretable than simpler models. While the attention-based readout provides some insights into hop importance, understanding why certain hop features are prioritized or how the Transformer weights inter-hop correlations could be further explored for greater model transparency.
- Generalization to Other Graph Types: The current work focuses on unweighted and undirected graphs. Extending
NAGphormerto directed, weighted, or heterogeneous graphs would require modifications toHop2Token(e.g., directional propagation, weighted aggregation) and potentially the structural encoding. - Memory Footprint of : For very large and large and , storing the entire can still be substantial, even if it's pre-computed. While not for training, this pre-computed matrix needs to reside somewhere for batching. This might become a challenge for graphs that truly push the limits of memory, even if training is batched.
Similar papers
Recommended via semantic vector search.