Paper status: completed

GraphGPT: Graph Instruction Tuning for Large Language Models

Published:10/19/2023

Integration of Graph Neural Networks and Large Language Models (1)Graph Instruction Tuning (1)Zero-Shot Graph Learning (1)Application of Self-Supervised Learning in Graph Models (1)Linking Textual and Graph Structures (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GraphGPT integrates large language models with graph structure knowledge through graph instruction tuning, enabling superior understanding and generalization of complex graphs, outperforming traditional GNNs in both supervised and zero-shot graph learning tasks.

Abstract

Graph Neural Networks (GNNs) have evolved to understand graph structures through recursive exchanges and aggregations among nodes. To enhance robustness, self-supervised learning (SSL) has become a vital tool for data augmentation. Traditional methods often depend on fine-tuning with task-specific labels, limiting their effectiveness when labeled data is scarce. Our research tackles this by advancing graph model generalization in zero-shot learning environments. Inspired by the success of large language models (LLMs), we aim to create a graph-oriented LLM capable of exceptional generalization across various datasets and tasks without relying on downstream graph data. We introduce the GraphGPT framework, which integrates LLMs with graph structural knowledge through graph instruction tuning. This framework includes a text-graph grounding component to link textual and graph structures and a dual-stage instruction tuning approach with a lightweight graph-text alignment projector. These innovations allow LLMs to comprehend complex graph structures and enhance adaptability across diverse datasets and tasks. Our framework demonstrates superior generalization in both supervised and zero-shot graph learning tasks, surpassing existing benchmarks. The open-sourced model implementation of our GraphGPT is available at https://github.com/HKUDS/GraphGPT.

Mind Map

In-depth Reading

English Analysis~36 min read · 52,354 chars

1. Bibliographic Information

1.1. Title

GraphGPT: Graph Instruction Tuning for Large Language Models

1.2. Authors

Jiabin Tang (University of Hong Kong)
Yuhao Yang (University of Hong Kong)
Wei Wei (University of Hong Kong)
Lei Shi (Baidu Inc.)
Lixin Su (Baidu Inc.)
Suqi Cheng (Baidu Inc.)
Dawei Yin (Baidu Inc.)
Chao Huang (University of Hong Kong)

1.3. Journal/Conference

The paper is published as a preprint on arXiv, a repository for electronic preprints of scientific papers. It is not explicitly stated to be published in a specific journal or conference in the provided text. arXiv is a highly influential platform in fields like computer science, mathematics, and physics, where researchers often share their work before, or in parallel with, formal peer review and publication.

1.4. Publication Year

2023

1.5. Abstract

This research addresses the limitations of traditional Graph Neural Networks (GNNs), which often rely on fine-tuning with task-specific labels, thereby limiting their effectiveness in scenarios with scarce labeled data. The paper aims to improve graph model generalization in zero-shot learning environments, drawing inspiration from the success of large language models (LLMs). The authors introduce the GraphGPT framework, which integrates LLMs with graph structural knowledge through a novel graph instruction tuning paradigm. Key components include a text-graph grounding mechanism to align textual and graph structures and a dual-stage instruction tuning approach featuring a lightweight graph-text alignment projector. These innovations enable LLMs to understand complex graph structures and enhance adaptability across diverse datasets and tasks. The framework demonstrates superior generalization in both supervised and zero-shot graph learning tasks, outperforming existing benchmarks. The model's implementation is open-sourced.

1.6. Original Source Link

https://arxiv.org/abs/2310.13023 (Publication status: Preprint)

1.7. PDF Link

https://arxiv.org/pdf/2310.13023v3.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limited generalization ability of traditional Graph Neural Networks (GNNs), particularly in zero-shot learning environments where labeled data for new tasks or datasets is scarce or unavailable.

GNNs have become powerful for analyzing graph-structured data by capturing intricate relationships and dependencies through message passing and aggregation mechanisms. To enhance their robustness and generalization, self-supervised learning (SSL) has emerged as a promising technique, allowing GNNs to pre-train on unlabeled data. However, even SSL-enhanced GNNs often require fine-tuning with task-specific labels for downstream graph learning tasks. This reliance on labeled data from downstream tasks restricts their applicability in practical scenarios, such as cold-start recommendation systems or traffic prediction in new cities, where obtaining high-quality labels is challenging.

Inspired by the remarkable success of large language models (LLMs) in natural language processing (NLP), which have demonstrated exceptional generalization capabilities across diverse tasks and datasets, the authors identify an opportunity to develop a graph-oriented LLM. The innovative idea is to leverage the reasoning and generalization power of LLMs and adapt them to understand and process graph structures, thereby overcoming the limitations of traditional GNNs in zero-shot settings.

However, integrating LLMs with graph learning presents non-trivial challenges:

C1: Alignment: Achieving proper alignment between the structural information of a graph and the language space.
C2: Comprehension: Effectively guiding LLMs to comprehend the structural information of graphs.
C3: Reasoning: Endowing LLMs with step-by-step reasoning abilities for complex graph tasks.

The paper highlights that directly prompting LLMs with purely text-based descriptions of graph structures (e.g., node text, text-based structural prompts) is insufficient, leading to incorrect predictions and increased token size, making it computationally expensive and limited by LLM token limits (as illustrated in Figure 1). This emphasizes the need for more efficient and scalable approaches to incorporate graph structural information into LLMs.

2.2. Main Contributions / Findings

The paper introduces the GraphGPT framework, which makes the following primary contributions:

Novel Framework for Graph-Oriented LLMs: Proposes GraphGPT, a framework that integrates LLMs with graph structural knowledge through a carefully designed graph instruction tuning paradigm, aiming to align LLMs with the graph domain.
Text-Graph Grounding Paradigm: Introduces a text-graph grounding component as the initial step to align the encoding of graph structures with the natural language space. This enables effective alignment of graph structural information within language models by incorporating textual information in a contrastive manner.
Dual-Stage Instruction Tuning with Lightweight Projector: Develops a dual-stage graph instruction tuning paradigm:
1. Self-supervised instruction tuning: Uses self-supervised signals from unlabeled graph structures (via a graph matching task) as instructions to guide the LLM, enhancing its comprehension of graph structural knowledge.
2. Task-specific instruction tuning: Fine-tunes the LLM with task-specific graph instructions to improve its adaptability and reasoning behavior for diverse downstream tasks (e.g., node classification, link prediction). Crucially, a lightweight graph-text alignment projector is employed, optimizing only its parameters while keeping the LLM and graph encoder fixed, ensuring efficiency.
Chain-of-Thought (CoT) Distillation: Incorporates Chain-of-Thought (CoT) distillation to equip GraphGPT with step-by-step reasoning abilities. This addresses challenges posed by distribution shift and varying node class numbers across datasets, by leveraging knowledge from powerful closed-source LLMs like GPT-3.5.
Superior Generalization and Performance: GraphGPT demonstrates superior generalization capabilities in both supervised and zero-shot graph learning tasks, significantly outperforming existing state-of-the-art benchmarks. It achieves a remarkable 2-10 times increase in accuracy in zero-shot scenarios compared to strong baselines.
Efficiency and Scalability: The framework is shown to be efficient in terms of training time, tuned parameters, and GPU occupancy due to its parameter-efficient tuning strategy. It also offers competitive inference efficiency while maintaining high accuracy.
Open-Sourced Implementation: The model implementation of GraphGPT is open-sourced, facilitating further research and application.

3.1. Foundational Concepts

To understand GraphGPT, a foundational grasp of Graph Neural Networks (GNNs), Self-Supervised Learning (SSL), and Large Language Models (LLMs) is essential.

Graph-structured Data: Information represented as entities (called nodes or vertices) and relationships (called edges) between them. A graph is formally denoted as $\mathcal { G } ( \mathcal { V } , \mathcal { E } , { \bf A } , { \bf X } )$ .
- $\mathcal { V }$ : The set of nodes, with $| \mathcal { V } | = N$ being the total number of nodes.
- $\mathcal { E }$ : The set of edges, representing relationships between nodes.
- $\mathbf { A } \in \mathbb { R } ^ { N \times N }$ : The adjacency matrix, encoding the graph's topology. $A_{i,j}=1$ if an edge exists between node $i$ and node $j$ , and 0 otherwise.
- $\mathbf { X } \in \mathbb { R } ^ { N \times F }$ : The feature matrix, containing attribute or feature information for each node, where $F$ is the feature dimensionality.
Graph Neural Networks (GNNs): A class of neural networks designed to operate directly on graph-structured data. Unlike traditional neural networks that process grid-like data (e.g., images, text sequences), GNNs can capture the intricate relationships and dependencies within graphs. They achieve this through iterative message passing and aggregation operations. In essence, each node gathers information (messages) from its direct neighbors, aggregates these messages, and then updates its own representation or embedding. This process is typically repeated for several layers, allowing nodes to incorporate information from increasingly distant neighbors.
- Message Passing: The process where information (features) is exchanged between connected nodes.
- Aggregation: The process where a node combines the messages received from its neighbors into a single, fixed-size representation. Common aggregation functions include sum, mean, or max.
- Node Representation (Embedding): A vector that captures the structural and feature information of a node within the graph. GNNs aim to learn effective node representations.
- Example (Generalized GNN Layer): The feature vector of node $v$ at layer $l$ is $h _ { v } ^ { ( l ) }$ . The Propagate function gathers messages from neighbors N(v), and Aggregate combines this with the node's previous representation: $\begin{array} { l } { m _ { v } ^ { ( l ) } = \mathrm { P r o p a g a t e } ^ { ( l ) } ( \{ h _ { u } ^ { ( l - 1 ) } : u \in N ( v ) \} ) , } \\ { h _ { v } ^ { ( l ) } = \mathrm { A g g r e g a t e } ^ { ( l ) } ( h _ { v } ^ { ( l - 1 ) } , m _ { v } ^ { ( l ) } ) } \end{array}$ Where $h _ { u } ^ { ( l - 1 ) }$ is the representation of neighbor $u$ from the previous layer, and N(v) is the set of neighbors of node $v$ .
Self-Supervised Learning (SSL): A paradigm in machine learning where a model learns representations from unlabeled data by generating its own supervisory signals. Instead of relying on human-annotated labels, SSL defines pretext tasks that leverage the inherent structure or properties of the data itself. For graphs, this might involve predicting masked node features, reconstructing corrupted graph structures, or distinguishing between different views of the same graph. The learned representations can then be fine-tuned for downstream tasks with limited labeled data.
- Contrastive SSL: Learns representations by contrasting positive pairs (different views of the same data) with negative pairs (different data points). The goal is to make representations of positive pairs similar and negative pairs dissimilar.
- Generative SSL: Focuses on generating or reconstructing parts of the input data. Examples include masked autoencoders, where parts of the input are masked, and the model tries to reconstruct them.
Large Language Models (LLMs): Deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data. They are characterized by their large number of parameters (billions or even trillions) and their ability to perform a wide range of NLP tasks, including text generation, translation, summarization, and question answering, often with remarkable zero-shot or few-shot generalization capabilities.
- Zero-Shot Learning: The ability of a model to perform a task it has not been explicitly trained on, using only instructions or examples.
- Few-Shot Learning: The ability to perform a task with only a few examples.
- Instruction Tuning: A training technique where LLMs are fine-tuned on a collection of diverse tasks formatted as instructions. This teaches the model to follow natural language commands and improves its ability to generalize to new, unseen tasks described by similar instructions.
- Chain-of-Thought (CoT) Prompting: A technique that encourages LLMs to generate a series of intermediate reasoning steps before arriving at a final answer. This improves the model's ability to solve complex reasoning problems, especially in multi-step tasks, by making its thought process explicit.
Prompt Tuning: A parameter-efficient fine-tuning method for large pre-trained models. Instead of fine-tuning all model parameters, prompt tuning only trains a small set of "soft prompts" (learnable continuous vectors) that are prepended to the input. The pre-trained model's original parameters remain frozen. This makes adaptation to new tasks much more efficient in terms of computation and memory.

3.2. Previous Works

The paper contextualizes its contributions by referencing several lines of previous research:

Graph Neural Networks (GNNs):
- Early GNNs: Graph Convolutional Networks (GCNs) [17, 22] adapted convolutional operations to graphs, and Graph Attention Networks (GATs) [39, 43] introduced attention mechanisms.
- More advanced architectures: Graph Transformer Networks (GTNs) [14, 60] incorporated self-attention and positional encoding.
- Specific GNN models mentioned as baselines or related work include GraphSAGE [7], RevGNN [21], NodeFormer [51], and DIFFormer [50]. These represent different approaches to message passing, aggregation, or handling graph complexity.
Self-Supervised Learning (SSL) and Pre-training on Graphs:
- Contrastive SSL: DGI [40] (Deep Graph Infomax), GraphCL [59], GCA [67]. Newer developments include automated contrastive augmentation (e.g., JOAO [58], AdaGCL [15]), and community-aware contrastive learning (gCooL [20]).
- Generative SSL: GraphMAE [10, 11] (for feature masking), S2GAE [35], AutoCF [53] (for reconstructing masked edges).
- These methods aim to learn robust graph representations without extensive labels but often still require fine-tuning for downstream tasks.
Prompt-Tuning for Graph Neural Networks:
- GPPT [33]: A transfer learning paradigm where GNNs are pre-trained on masked edge prediction and then prompted for node classification.
- GraphPrompt [26]: Integrates pre-training and downstream tasks into a unified task template.
- Sun et al. [34]: Presents a unified prompt format, reformulates tasks to the graph level, and uses meta-learning for multi-task performance.
- Crucial difference: These methods, despite advances, still rely on supervision labels from downstream tasks for fine-tuning. GraphGPT aims for a more challenging zero-shot graph learning scenario without needing these labels.
Large Language Models (LLMs):
- Key LLMs: ChatGPT [29], Claude [1], Llama [36, 37], ChatGLM [62], Baichuan [54].
- Enhancement techniques: In-context learning [28] (providing demonstrations in the prompt) and Chain-of-Thought (CoT) [47, 57] (generating intermediate reasoning steps).
- Alignment with human feedback: RLAIF [19] and Self-Instruct [45] are methods for aligning LLMs to specific tasks and human preferences.
- Multimodal LLMs: Works like MinGPT-4 [66] and Visual Instruction Tuning [23] have successfully aligned LLMs with visual information. This paper addresses the gap in aligning LLMs with graph structures.
- Prior LLM-Graph integration attempts: Some studies [2, 5] have attempted to incorporate graph information into LLMs using natural language prompts. However, they face challenges in handling complex graph structures and achieving deep understanding due to the limitations of text-only descriptions, as highlighted by Figure 1 in the paper.

3.3. Technological Evolution

The field has evolved from:

Traditional GNNs: Focusing on specialized architectures for capturing graph topology and node features, but often requiring extensive labeled data for supervised learning.
Self-Supervised GNNs: Enhancing GNNs with pre-training on unlabeled graph data to learn more robust and generalizable representations, still typically needing fine-tuning.
Large Language Models (LLMs): Demonstrating unprecedented generalization capabilities in NLP tasks through massive pre-training and sophisticated tuning techniques like instruction tuning and CoT prompting.
Multimodal LLMs: Extending LLMs to other modalities like vision, proving the adaptability of the LLM paradigm.
Initial LLM-Graph Integration Attempts: Exploring ways to feed graph information to LLMs using natural language, but encountering scalability and structural comprehension limitations.

This paper's work fits into the current timeline by bridging the gap between LLMs and graph learning. It attempts to port the successful instruction tuning and generalization paradigms of LLMs to the graph domain, specifically targeting zero-shot learning where previous GNN-based methods fall short.

3.4. Differentiation Analysis

Compared to the main methods in related work, GraphGPT presents several core differences and innovations:

Beyond Fine-tuning Dependency: Unlike most SSL-enhanced GNNs and even recent prompt-tuning GNNs (e.g., GPPT, GraphPrompt), GraphGPT explicitly targets zero-shot graph learning. Previous methods still require fine-tuning with task-specific labels from downstream tasks. GraphGPT aims to eliminate this dependency, offering a more flexible paradigm for real-world scenarios with scarce labels.
Direct LLM Integration with Structural Knowledge: Instead of using GNNs as standalone models or simply feeding text-based descriptions of graphs to LLMs, GraphGPT deeply integrates an LLM with a graph encoder. It specifically addresses C1 and C2 (alignment and comprehension) by developing a text-graph grounding component and a dual-stage instruction tuning approach. This allows the LLM to directly "understand" and reason about graph structures, not just their textual proxies.
Efficient Alignment (Lightweight Projector): A key innovation is the use of a lightweight graph-text alignment projector. While other approaches might involve complex neural architectures for cross-modal alignment or extensive fine-tuning of the LLM itself, GraphGPT freezes the LLM and graph encoder, optimizing only a small projector. This makes the training process highly efficient and scalable, avoiding Out-Of-Memory (OOM) issues common when fine-tuning large models.
Enhanced Reasoning with CoT Distillation: To tackle C3 (step-by-step reasoning) and address distribution shift, GraphGPT incorporates Chain-of-Thought (CoT) distillation. This technique enables the model to acquire sophisticated reasoning capabilities from powerful closed-source LLMs without increasing its own parameter count, a significant advantage over models that might lack explicit reasoning mechanisms or struggle with complex, multi-step graph tasks.
Token Efficiency for Graph Structures: GraphGPT represents graph structures using graph tokens as input to the LLM. This is shown to be significantly more token-efficient than purely text-based descriptions of subgraphs, which can lead to excessively long prompts, higher computational costs, and token limit issues for LLMs. This direct representation allows for more scalable handling of large graph structures.

In essence, GraphGPT moves beyond simply applying LLMs to graph-related text or prompting GNNs with limited textual instructions. It creates a truly "graph-oriented LLM" that can deeply comprehend graph structures and reason about them in a generalized, zero-shot manner, leveraging the strengths of both GNNs and LLMs while mitigating their individual limitations.

4. Methodology

The GraphGPT framework is designed to integrate Large Language Models (LLMs) with graph structural knowledge, enabling them to generalize across diverse datasets and tasks, especially in zero-shot learning scenarios. This is achieved through a multi-faceted approach involving structural information encoding with text-graph grounding, dual-stage graph instruction tuning, and Chain-of-Thought (CoT) distillation.

4.1. Principles

The core idea of GraphGPT is to align the expressive power and generalization capabilities of Large Language Models (LLMs) with the intricate structural knowledge inherent in graphs. The theoretical basis is rooted in the success of instruction tuning in NLP, where LLMs learn to follow diverse instructions, and contrastive learning, which helps align representations from different modalities. The intuition is that if an LLM can be taught to "see" and "reason" about graph structures in a language-interpretable way, it can leverage its vast pre-trained knowledge to understand and solve graph-related tasks without extensive task-specific fine-tuning. This involves translating graph structures into a format compatible with LLMs, guiding the LLM to interpret these structures, and enhancing its reasoning process for complex graph problems.

4.2. Core Methodology In-depth (Layer by Layer)

The GraphGPT framework consists of three main components: (1) Structural Information Encoding with Text-Graph Grounding, (2) Dual-Stage Graph Instruction Tuning, and (3) Chain-of-Thought (CoT) Distillation. The overall architecture is illustrated in Figure 2.

4.2.1. Structural Information Encoding with Text-Graph Grounding

This initial component aims to bridge the gap between graph structural information and the natural language space understood by LLMs. It involves encoding graph structures and textual features, then aligning them.

Graph Encoder: GraphGPT uses a flexible graph encoder (denoted as $f _ { \mathbf { G } }$ ) that can be built upon various Graph Neural Network (GNN) architectures, such as a Graph Transformer or a Graph Convolutional Network (GCN). This encoder is responsible for generating structure-level graph representations from the input graph. The process involves message passing and aggregation among nodes.

For a GCN-like architecture, the feature vector of a node at layer $l$ is updated as follows: $\mathbf { H } ^ { ( l ) } = \sigma \left( \tilde { \mathbf { A } } \mathbf { H } ^ { ( l - 1 ) } \mathbf { W } \right)$

$\mathbf { H } ^ { ( l ) } \in \mathbb{R}^{N \times F'}$ : The matrix of graph representations (node embeddings) at the $l$ -th layer, where $N$ is the number of nodes and $F'$ is the output feature dimension.
$\sigma ( \cdot )$ : A non-linear activation function (e.g., ReLU), which introduces non-linearity into the model.
$\tilde { \mathbf { A } } \in \mathbb{R}^{N \times N}$ : The self-loop adjacency matrix, obtained by adding the identity matrix $\mathbf{I}$ to the original adjacency matrix $\mathbf{A}$ (i.e., $\tilde { \mathbf { A } } = \mathbf{A} + \mathbf{I}$ ). This ensures that a node aggregates its own features from the previous layer in addition to its neighbors.
$\mathbf { H } ^ { ( l - 1 ) } \in \mathbb{R}^{N \times F}$ : The matrix of graph representations from the previous layer, or the initial node feature matrix $\mathbf{X}$ for the first layer ( $l=0$ ).
$\mathbf { W } \in \mathbb{R}^{F \times F'}$ : A parameter matrix (learnable weights) that transforms the features.

After multiple layers, the graph encoder produces a final graph representation matrix $\mathbf{H}$ .

Text Encoder: For the raw textual content $\mathbf { C } = \{c _ { i } \mid 1 \leq i \leq N\}$ associated with each node (e.g., paper abstracts or titles), a text encoder (denoted as $f _ { \mathrm { T } }$ ) is used. This can be a Transformer or BERT model. This encoder generates encoded text representations for each node, forming a matrix $\mathbf{T}$ .

Text-Structure Alignment (Contrastive Approach): To align the graph structure information with LLMs, a contrastive approach is used, building on previous works in multimodal alignment [30, 49]. First, the encoded graph representations $\mathbf{H}$ and text representations $\mathbf{T}$ are normalized row-wise to $\hat{\mathbf{H}}$ and $\hat{\mathbf{T}}$ respectively: $\mathbf { H } = f _ { \mathbf { G } } ( \mathbf { X } ) , \mathbf { T } = f _ { \mathbf { T } } ( \mathbf { C } ) , \hat { \mathbf { H } } = \mathrm { n o r m } ( \mathbf { H } ) , \hat { \mathbf { T } } = \mathrm { n o r m } ( \mathbf { T } )$

$\mathbf{X}$ : Initial node features for the graph encoder.
$\mathbf{C}$ : Raw textual contents for the text encoder.
$\mathrm{norm}(\cdot)$ : A normalization function (e.g., L2 normalization) applied row-wise to feature vectors.

Next, three alignment scores ( $\Gamma_1, \Gamma_2, \Gamma_3$ ) are calculated using dot products and scaled by an exponential temperature parameter $\tau$ : $\Gamma _ { 1 } = ( \hat { \mathbf { H } } \hat { \mathbf { T } } ^ { \top } ) \cdot \exp ( \tau )$ $\Gamma _ { 2 } = ( \hat { \mathbf { H } } \hat { \mathbf { T } } ^ { \prime \top } ) \cdot \exp ( \tau )$ $\Gamma _ { 3 } = ( \hat { \mathbf { T } } \hat { \mathbf { T } } ^ { \prime \top } ) \cdot \exp ( \tau )$ Where:
$\hat { \mathbf { T } } ^ { \prime } = \{ \frac { 1 } { | \mathcal { N } _ { i } | } \sum _ { j \in \mathcal { N } _ { i } } \hat { \mathbf { T } } _ { j } , 1 \leq i \leq N \}$ : This represents the averaged textual embedding of the neighbors for each node $i$ , where $\mathcal{N}_i$ is the set of neighbors of node $i$ . This captures local textual context.
$\tau$ : A temperature parameter that scales the similarity scores. A higher $\tau$ makes the distribution sharper, emphasizing stronger matches.
$\Gamma_1$ : Measures the alignment between node graph embeddings and their own textual embeddings.
$\Gamma_2$ : Measures the alignment between node graph embeddings and the averaged textual embeddings of their neighbors.
$\Gamma_3$ : Measures the alignment between node textual embeddings and the averaged textual embeddings of their neighbors.

These scores are used in a contrastive loss function based on Cross-Entropy (CE) to encourage alignment: $\mathcal { L } = \sum _ { i = 1 } ^ { 3 } \frac { 1 } { 2 } \lambda _ { i } ( \mathbf { C E } ( \Gamma _ { i } , \mathbf { y } ) + \mathbf { C E } ( \Gamma _ { i } ^ { \top } , \mathbf { y } ) )$
$\mathcal { L }$ : The total contrastive alignment loss.
$\lambda _ { i }$ : Weighting coefficients for each of the three alignment components.
$\mathbf { C E } ( \cdot , \cdot )$ : The Cross-Entropy loss function. It measures the difference between two probability distributions. Here, it’s used to maximize agreement between positive pairs (correctly aligned graph and text representations) and minimize agreement with negative pairs.
$\mathbf { y } = \mathrm { EPY } ( 0 , 1 , \cdots , n - 1 ) ^ { \top }$ : The one-hot target label vector for the contrastive objective. For each node $i$ , the goal is to predict $i$ when comparing its representation to its corresponding text/neighbor text, and other nodes as negative samples. Essentially, it's an identity matrix for positive pairs.

The workflow of this text-structure alignment is conceptualized in Figure 3.

该图像是一个示意图，展示了文本属性与图结构信息的对齐工作流程。图中分别通过GNN处理图结构，通过Transformer处理文本属性，最终对两者进行融合对齐。

Figure 3: Workflow of text-structure alignment.

4.2.2. Dual-Stage Graph Instruction Tuning

This paradigm adapts the LLM's language capacity to graph learning tasks through two stages, building on the concept of instruction tuning. The overall architecture is depicted in Figure 2.

Figure 2: The overall architecture of our proposed GraphGPT with graph instruction tuning paradigm. 该图像是论文中图2，展示了GraphGPT整体架构及其图指令调优范式，包括结构信息编码、文本-图结构编码器、大语言模型、对齐投射器以及两阶段指令调优流程。

Figure 2: The overall architecture of our proposed GraphGPT with graph instruction tuning paradigm.

4.2.2.1. Self-Supervised Instruction Tuning (Stage 1)

This first stage aims to enhance the LLM's reasoning abilities by incorporating graph domain-specific structural knowledge and understanding contextual information within the graph's structure.

Instruction Design for Graph Matching: Self-supervised signals are derived from unlabeled graph structures. The core instruction task is a structure-aware graph matching task.

Graph Information: For each node, a subgraph structure is generated by treating the node as a central node and performing h-hops with random neighbor sampling. This subgraph's structural information is represented by graph tokens.
Human Question: The LLM receives a human question that includes an indicator token $<graph>$ and a shuffled list of node text information (e.g., paper titles for a citation graph).
GraphGPT Response (Objective): The LLM's task is to reorder the shuffled list of node text information to align each graph token with its corresponding node text information. This forces the LLM to learn to associate graph structure with textual descriptions. An example of how these instructions are presented is implicitly shown in the "Graph Matching" section of Figure 4.

Tuning Strategy with Lightweight Alignment Projector: To make the tuning efficient, a Lightweight Alignment Projector (denoted as $f _ { \mathbf { P } }$ ) is introduced.

Frozen Parameters: During training, the parameters of both the LLM and the graph encoder are fixed (frozen).
Optimized Parameters: Only the parameters of the projector $f _ { \mathbf { P } }$ are optimized. The projector can be as simple as a single linear layer.
Function: The projector $f _ { \mathbf { P } }$ learns to map the encoded graph representation (from the graph encoder) to graph tokens that are compatible with the LLM's input space.
Process: The indicator token $<graph>$ in the original language token sequence is replaced by the projected aligned graph tokens. This creates a modified token sequence for the LLM, typically in the format: ${<graph_begin>, <graph_token>^1, ..., <graph_token>n, <graph_end>}$ , where $n$ is the number of nodes in the subgraph associated with the prompt.
Unsupervised Nature: Because the graph matching task is self-supervised (does not require human labels), it allows leveraging a vast amount of unlabeled graph data from different domains, which significantly enhances the generalizability of the learned projector.

Mathematically, with projected graph tokens $\mathbf { \boldsymbol { X } } _ { \mathcal { G } } = \nabla f \mathbf { \bar { P } } ( \mathbf { \bar { H } } )$ and text embeddings ${ \bf { X } } _ { \mathcal { I } } = { \bf { \Psi } }_{\text{tokenizer}}(\text{instruction})$ , for a sequence of length $L$ , the probability of generating the target output $\mathbf { X } _ { O }$ is computed as: ${ \LARGE \mathfrak { p } } ( \mathbf { X } _ { O } | \mathbf { X } _ { \mathcal { G } } , \mathbf { X } _ { \mathcal { T } } ) = \prod _ { i = 1 } ^ { L } { \LARGE \mathfrak { p } } _ { \theta } ( x _ { i } | \mathbf { X } _ { \mathcal { G } } , \mathbf { X } _ { \mathcal { T } , < i } , \mathbf { X } _ { O , < i } )$
${ \LARGE \mathfrak { p } } ( \cdot )$ : The probability of a sequence.
$\mathbf { X } _ { O }$ : The target output sequence (e.g., the reordered list of paper titles in the graph matching task).
$\mathbf { X } _ { \mathcal { G } }$ : The projected graph tokens representing the graph structure. Specifically, $\mathbf { \boldsymbol { X } } _ { \mathcal { G } } = f_{\mathbf{P}}(\mathbf{H}_{\text{encoded}})$ , where $\mathbf{H}_{\text{encoded}}$ is the encoded graph representation from the graph encoder, and $f_{\mathbf{P}}$ is the lightweight alignment projector.
$\mathbf { X } _ { \mathcal { T } }$ : The text embeddings of the instruction part provided to the LLM (e.g., the human question). This seems to be denoted as $\mathbf{X}_\mathcal{I}$ in the text, so I will stick with $\mathbf{X}_\mathcal{T}$ as it is in the formula itself.
$L$ : The length of the output sequence.
x _ { i }: The i-th token in the output sequence.
$\mathbf { X } _ { \mathcal { T } , < i }$ : The tokens in the text instruction sequence before position i.
$\mathbf { X } _ { O , < i }$ : The tokens in the output sequence generated before position i.
$\theta$ : The learnable parameters within GraphGPT, primarily referring to the projector $f _ { \mathbf { P } }$ , as the LLM and graph encoder are frozen.

4.2.2.2. Task-Specific Instruction Tuning (Stage 2)

This second stage fine-tunes the LLM to customize its reasoning behavior for specific downstream graph learning tasks (e.g., node classification, link prediction).

Instruction Design:

Consistent Template: A consistent instruction template (Graph Information, Human Question, GraphGPT Response) is used.
Graph Information: Similar to Stage 1, for each node, a subgraph is generated using the same neighbor sampling approach, with the node acting as the central node.
Node Classification Example: For node classification, the human question instruction includes the $<graph>$ indicator token and specific text information about the central node. The LLM's goal is to predict the central node's category based on both graph structure and accompanying text. Figure 4 provides instruction examples for Graph Matching, Node Classification, and Link Prediction.

该图像是论文中的示意图，展示了ChatGPT在仅使用节点内容、结合文本图结构以及GraphGPT三种输入信息下，对论文分类任务的表现对比，突出GraphGPT在利用图结构理解上的优势。

Figure 1: Limitation of LLMs in understanding graph structural contexts with heavy reliance on textual data. Note: The image provided as Figure 1 in the prompt is actually Figure 4 from the paper, showing instruction examples. I will use it as Figure 4 and describe it accordingly.

Image 1/Figure 4 from paper: Description: This image displays examples of instruction designs for different graph learning tasks within the GraphGPT framework. Top left: "Graph Matching" instruction showing $<graph>$ token, Central Node ID, Edge index, and a request to reorder a list of paper titles. Bottom left: "Node Classification" instruction showing $<graph>$ token, Central Node ID, Edge index, and a question "Which arXiv cs sub-category does this paper belong to?". Right side: "Link Prediction" instruction showing $<graph>$ tokens for two central nodes, their edge indices, and a question "are these two central nodes connected? Give me an answer of 'yes' or 'no'." with an example GraphGPT Response. Alt text: Figure 4: Instruction examples for Graph Matching, Node Classification, and Link Prediction tasks.

Tuning Strategy:

Initial Parameters: The projector parameters learned during Stage 1 (self-supervised instruction tuning) are used as the initial state for Stage 2.
Frozen Parameters: Again, the parameters of the LLM and the graph encoder remain fixed (frozen).
Optimized Parameters: Only the projector parameters are optimized further. This ensures that the LLM aligns specifically with the requirements of downstream tasks without forgetting its general structural understanding from Stage 1.

After these two stages, GraphGPT is equipped to comprehend graph structures and perform downstream tasks.

4.2.3. Chain-of-Thought (CoT) Distillation

To enhance GraphGPT's step-by-step reasoning abilities and improve performance in the face of distribution shift (e.g., varying numbers of node classes), the Chain-of-Thought (CoT) technique is incorporated through distillation.

Challenge: Directly implementing CoT in smaller models can be challenging due to its dependence on the scale of model parameters [32].

Solution: Distillation: The paper adopts a distillation approach inspired by previous research [32].

Knowledge Source: High-quality CoT instructions are generated using a powerful, closed-source Large Language Model like ChatGPT (specifically GPT-3.5, with over 200 billion parameters). This model acts as the "teacher."
Goal: To transfer the CoT reasoning capabilities from the teacher LLM to GraphGPT (the "student" model) without increasing GraphGPT's own parameter count.

CoT Distillation Paradigm:

Tailored CoT Prompts: For node-specific tasks (e.g., node classification in a citation graph), specific CoT prompts are designed.
Input to Teacher LLM: The abstract, paper title, and a task description are provided as input to GPT-3.5.
CoT Trigger: The phrase "Please think about the categorization in a step-by-step manner." is included in the prompt to trigger step-by-step reasoning in GPT-3.5.
Teacher Output: GPT-3.5 generates an output that includes both the prediction for node classes and detailed explanations for each prediction, making the reasoning transparent.
Integration: This generated CoT instruction data is then integrated with the previously designed instructions (from Stage 2's task-specific instruction tuning).
Instruction Tuning with CoT Data: GraphGPT is then fine-tuned using this integrated instruction dataset, following the same instruction tuning paradigm, to learn the step-by-step reasoning patterns.

This process enables GraphGPT to mimic the reasoning process of a much larger model, leading to improved performance, especially on complex graph tasks, without the need for a massive increase in its own trainable parameters.

5. Experimental Setup

5.1. Datasets

The experiments evaluate GraphGPT on three public graph datasets, primarily focusing on node classification tasks.

OGB-arxiv [12]:
- Source: From the Open Graph Benchmark (OGB), representing a directed citation network of computer science arXiv papers indexed by MAG [41].
- Scale & Characteristics: A large-scale graph. Each paper (node) is associated with an abstract and title (textual features) and citations (edges).
- Domain: Computer Science academic papers.
- Labels: Manually labeled with a research category from 40 subject areas.
- Example Data (Hypothetical based on context): A node might represent a paper with Abstract: "We propose a novel method for graph neural networks using attention mechanisms..." and Title: "Attention-GNN for Graph Learning". Its label could be cs.LG (Machine Learning).
PubMed [8]:
- Source: Derived from the PubMed database.
- Scale & Characteristics: Consists of 19,717 scientific publications. It includes a citation network with 44,338 links.
- Domain: Biomedical/Medical (specifically diabetes research).
- Labels: Categorized into 3 classes: Experimental induced diabetes, Type 1 diabetes, and Type 2 diabetes.
- Example Data (Hypothetical): A node representing a paper with Abstract: "Investigating the genetic markers for early diagnosis of Type 2 diabetes in adult populations..." and Title: "Genetic Predisposition to Type 2 Diabetes". Its label would be Type 2 diabetes.
Cora [49]:
- Source: A classic citation network dataset.
- Scale & Characteristics: Comprises 25,120 research papers connected through citations. The version used is an expanded version with 70 classes, which is larger than some previous versions [17].
- Domain: Computer Science academic papers.
- Labels: Each paper is categorized into one of 70 classes.
- Example Data (Hypothetical): Similar to OGB-arxiv, a node representing a paper with textual content, and a label such as Neural Networks or Databases.
  
  Choice of Datasets: These datasets are widely used benchmarks in graph machine learning for node classification tasks. They represent different scales, domains (CS vs. Medical), and numbers of classes (3, 40, 70), which collectively allow for a comprehensive evaluation of the model's performance and generalization ability in both supervised and zero-shot settings. They are effective for validating the method's performance across varying complexities and domains.

5.2. Evaluation Metrics

For evaluating the model's performance, three commonly used metrics are employed: Accuracy, Macro F1 (for node classification), and AUC (for link prediction).

5.2.1. Accuracy

Conceptual Definition: Accuracy measures the proportion of correctly classified instances out of the total number of instances. It is a straightforward measure of overall correctness, commonly used in classification tasks.
Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
Symbol Explanation:
- Number of Correct Predictions: The count of instances where the model's predicted label matches the true label.
- Total Number of Predictions: The total count of all instances for which the model made a prediction.

5.2.2. Macro F1-Score

Conceptual Definition: The F1-score is the harmonic mean of precision and recall. It is a good metric for imbalanced datasets as it considers both false positives and false negatives. Macro F1-score calculates the F1-score independently for each class and then takes the average, treating all classes equally. This is useful when all classes are considered equally important.
Mathematical Formula: $ \text{F1-score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $ Where: $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $ $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $ For Macro F1-score, compute F1-score for each class $C_k$ , then average: $ \text{Macro F1} = \frac{1}{|C|} \sum_{k=1}^{|C|} \text{F1-score}_k $
Symbol Explanation:
- True Positives (TP): Instances correctly predicted as positive.
- False Positives (FP): Instances incorrectly predicted as positive (actually negative).
- False Negatives (FN): Instances incorrectly predicted as negative (actually positive).
- Precision: The proportion of positive identifications that were actually correct.
- Recall: The proportion of actual positives that were identified correctly.
- $|C|$ : The total number of classes.
- $\text{F1-score}_k$ : The F1-score calculated for class $k$ .

5.2.3. Area Under the Receiver Operating Characteristic Curve (AUC)

Conceptual Definition: AUC is a performance metric used primarily for binary classification problems, particularly in link prediction tasks where the goal is to classify whether a link exists between two nodes. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. AUC represents the area under this curve, with a value of 1.0 indicating a perfect classifier and 0.5 indicating a random classifier. It measures the classifier's ability to distinguish between classes.
Mathematical Formula: $ \text{AUC} = \int_0^1 \text{TPR}(\text{FPR}^{-1}(x)) dx $ Where: $ \text{TPR} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} (\text{Recall}) $ $ \text{FPR} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}} $
Symbol Explanation:
- TPR (True Positive Rate): Also known as Recall or Sensitivity.
- FPR (False Positive Rate): The proportion of actual negatives that were identified incorrectly as positive.
- $\text{FPR}^{-1}(x)$ : The inverse function of FPR.
- $\text{True Negatives (TN)}$ : Instances correctly predicted as negative.

5.3. Baselines

The paper compares GraphGPT against a comprehensive set of state-of-the-art methods across five categories, plus open-sourced LLMs:

MLP (Multilayer Perceptron): A basic neural network that processes node features without considering graph structure. It serves as a fundamental non-graph baseline.
Representative Graph Neural Encoders: Traditional GNN models that learn node representations by propagating and aggregating information from neighbors.
- GraphSAGE [7]: An inductive GNN framework that generates embeddings by sampling and aggregating features from a node's local neighborhood.
- GCN [17]: A foundational GNN model that applies spectral graph convolution.
- GAT [39]: Introduces attention mechanisms to assign different weights to neighboring nodes during aggregation.
- RevGNN [21]: Designed to build very deep GNNs (1000+ layers) by using reversible connections.
Self-supervised Approach for Graph Learning: Models that use self-supervision to learn robust graph representations, typically pre-trained on unlabeled data.
- DGI [40]: Deep Graph Infomax, a contrastive SSL method that maximizes mutual information between local and global representations.
Knowledge Distillation-Enhanced GNNs: Methods that leverage knowledge distillation techniques to improve GNNs, often by transferring knowledge from a more complex or ensemble model.
- GKD [55]: Geometric Knowledge Distillation, focuses on topology compression for GNNs.
- GLNN [63]: Graph-less Neural Networks, which use distillation to teach MLPs GNN-like behaviors.
Strong Graph Transformer Networks: Recently proposed GNN architectures that incorporate transformer-like mechanisms for capturing global dependencies.
- NodeFormer [51]: A scalable Graph Structure Learning Transformer for node classification.
- DIFFormer [50]: Scalable (Graph) Transformers induced by energy-constrained diffusion.
Open-Sourced LLMs: Large Language Models used as baselines to understand text-attributed graph data, relying solely on text information.
- Baichuan-7B [54]
- Vicuna-7B-v1.1
- Vicuna-7B-v1.5
  
  These baselines cover a wide spectrum of graph learning approaches, from simple MLPs to sophisticated GNNs and LLMs, allowing for a comprehensive comparison of GraphGPT's performance across different capabilities (structural modeling, self-supervision, knowledge transfer, and language understanding).

5.4. Implementation Details

Frameworks: Primarily PyTorch and Transformers libraries.
Base LLMs: Vicuna-7B-v1.1 and Vicuna-7B-v1.5 are chosen as the underlying Large Language Models for GraphGPT.
Node Feature Encoding: Raw text information (e.g., abstracts, titles) from nodes is encoded into a unified vector space using a pre-trained BERT model [3]. This ensures consistent input features across different datasets.
Data Splits:
- Cora and PubMed: Split into training, validation, and testing sets with a 3:1:1 ratio, following common practices [8, 49].
- OGB-arxiv: Adheres to the public split setting [12] with a 6:2:3 ratio for training, validation, and testing, respectively.
Hyperparameters:
- Batch size: 2 per GPU.
- Learning rate: $2 \times 10^{-3}$ .
- Warmup ratio: $3 \times 10^{-2}$ (a fraction of total training steps during which the learning rate linearly increases from zero to the initial learning rate).
- Maximum input length for LLM: 2048 tokens.
Training Stages:
- Self-Supervised Instruction Tuning (Stage 1): Training runs for 3 epochs.
- Task-Specific Instruction Tuning (Stage 2): Training runs for 2 epochs. The alignment projector parameters fine-tuned in Stage 1 serve as the initial parameters for Stage 2.
Parameter Freezing: During both instruction tuning stages, the parameters of the LLM and the graph encoder are frozen (not updated). Only the lightweight graph-text alignment projector's parameters are optimized.
Instruction Data Mixtures (Stage 2): Various combinations of instruction data are explored, including standard instructions ("-std"), CoT instructions ("-cot"), and mixed proportions ("-mix"), as well as link prediction instructions.
GNN-based Model Evaluation: For GNN-based models (baselines), a transfer-trained classifier (e.g., a linear layer) is used when testing on new datasets to handle variations in the number of classes.
Code Availability: The authors state that their model implementation is open-sourced, and for most baselines, their publicly available code was used.

6. Results & Analysis

6.1. Core Results Analysis

The paper evaluates GraphGPT's performance on node classification in both supervised and zero-shot settings.

Supervised Task Setting: Models are trained and evaluated on the same dataset (e.g., train on Arxiv-Arxiv, test on Arxiv test set). Zero-Shot Task Setting: Models are trained on one or more source datasets and tested on entirely new datasets without any further training or fine-tuning on the target dataset (e.g., train on Arxiv-PubMed, test on PubMed).

The following are the results from Table 1 of the original paper:

Dataset		Arxiv-Arxiv		Arxiv-PubMed		Arxiv-Cora		(Arxiv+PubMed)-Cora		(Arxiv+PubMed)-Arxiv
Model	Accuracy	Macro-F1	acc	Macro-F1	Accuracy	Macro-F1	Accuracy	Macro-F1	Accuracy	Macro-F1
MLP	0.5179	0.2536	0.3940	0.1885	0.0258	0.0037	0.0220	0.0006	0.2127	0.0145
GraphSAGE	0.5480	0.3290	0.3950	0.1939	0.0328	0.0132	0.0132	0.0029	0.1281	0.0129
GCN	0.5267	0.3202	0.3940	0.1884	0.0214	0.0088	0.0187	0.0032	0.0122	0.0008
GAT	0.5332	0.3118	0.3940	0.1884	0.0167	0.0110	0.0161	0.0057	0.1707	0.0285
RevGNN	0.5474	0.3240	0.4440	0.3046	0.0272	0.0101	0.0217	0.0016	0.1309	0.0126
DGI	0.5059	0.2787	0.3991	0.1905	0.0205	0.0011	0.0205	0.0011	0.5059	0.2787
GKD	0.5570	0.1595	0.3645	0.2561	0.0470	0.0093	0.0406	0.0037	0.2089	0.0179
GLNN	0.6088	0.3757	0.4298	0.3182	0.0267	0.0115	0.0182	0.0092	0.3373	0.1115
NodeFormer	0.5922	0.3328	0.2064	0.1678	0.0152	0.0065	0.0144	0.0053	0.2713	0.0855
DIFFormer	0.5986	0.3355	0.2959	0.2503	0.0161	0.0094	0.0100	0.0007	0.1637	0.0234
baichuan-7B	0.0946	0.0363	0.4642	0.3876	0.0405	0.0469	0.0405	0.0469	0.0946	0.0363
vicuna-7B-v1.1	0.2657	0.1375	0.5251	0.4831	0.1090	0.0970	0.1090	0.0970	0.2657	0.1375
vicuna-7B-v1.5	0.4962	0.1853	0.6351	0.5231	0.1489	0.1213	0.1489	0.1213	0.4962	0.1853
GraphGPT-7B-v1.1-cot	0.4913	0.1728	0.6103	0.5982	0.1145	0.1016	0.1250	0.0962	0.4853	0.2102
GraphGPT-7B-v1.5-stage2	0.7511	0.5600	0.6484	0.5634	0.0813	0.0713	0.0934	0.0978	0.6278	0.2538
GraphGPT-7B-v1.5-std	0.6258	0.2622	0.7011	0.6491	0.1256	0.0819	0.1501	0.0936	0.6390	0.2652
GraphGPT-7B-v1.5-cot	0.5759	0.2276	0.5213	0.4816	0.1813	0.1272	0.1647	0.1326	0.6476	0.2854
p-val	2.26e-9	1.56e-10	2.22e-7	1.55e-9	1.04e-9	9.96e-6	7.62e-8	1.97e-7	1.5e-13	4.63e-6

The results in Table 1 demonstrate several key observations:

Obs.1: Overall Superiority of GraphGPT (RQ1)

Supervised Tasks: GraphGPT variants (e.g., GraphGPT-7B-v1.5-stage2 with 0.7511 Accuracy on Arxiv-Arxiv, and GraphGPT-7B-v1.5-std with 0.7011 Accuracy on Arxiv-PubMed) generally outperform all other state-of-the-art baselines. For instance, on Arxiv-Arxiv, GraphGPT-7B-v1.5-stage2 achieves 0.7511 Accuracy, significantly higher than the best GNN baseline (GLNN at 0.6088) or LLM baseline (vicuna-7B-v1.5 at 0.4962).
Zero-Shot Tasks: This is where GraphGPT shows its most significant advantage. When trained on one dataset and tested on another without further training (e.g., Arxiv-Cora, (Arxiv+PubMed)-Cora), traditional GNNs exhibit a drastic performance drop (e.g., GraphSAGE from 0.5480 on Arxiv-Arxiv to 0.0328 on Arxiv-Cora). In contrast, GraphGPT maintains much stronger performance and often achieves a 2-10 times increase in accuracy compared to these baselines. For example, on Arxiv-Cora, GraphGPT-7B-v1.5-cot achieves 0.1813 Accuracy, while the best GNN baseline (GKD) only reaches 0.0470.
Comparison with LLM Baselines: Open-sourced LLMs like Baichuan-7B and Vicuna-7B show more stable performance across datasets than GNNs in zero-shot settings because they rely only on text. However, they are generally inferior to GraphGPT as they lack structural awareness. GraphGPT effectively integrates graph structure, providing a more comprehensive solution.
Reasons for Superiority: The authors attribute this to:
1. The dual-stage graph instruction tuning which aligns structural information from the graph encoder with natural language tokens, enabling the LLM to understand graph characteristics.
2. The mutual enhancement between the graph encoder and LLM, filling the LLM's gap in structural understanding and empowering it to reason about graph structures.

Obs.2: Benefits with Structure-aware Graph Matching (RQ3, partially)

The comparison between GraphGPT-7B-v1.5-std (which implies both stages of tuning) and GraphGPT-7B-v1.5-stage2 (only the second stage) highlights the importance of the first stage.
GraphGPT-7B-v1.5-std (e.g., 0.7011 Accuracy on Arxiv-PubMed, 0.1501 on (Arxiv+PubMed)-Cora) generally outperforms GraphGPT-7B-v1.5-stage2 (e.g., 0.6484 on Arxiv-PubMed, 0.0934 on (Arxiv+PubMed)-Cora) in zero-shot scenarios.
The presence of the self-supervised graph matching task (Stage 1) is crucial for zero-shot transferability. It aligns graph tokens (rich structural information) with language tokens, fostering a deeper understanding of inherent graph structures. Without Stage 1, the model (like GraphGPT-7B-v1.5-stage2) is more prone to overfitting on specific datasets, limiting its generalization to unseen data.

Obs.3: Benefits with CoT Distillation (RQ3, partially)

Comparing GraphGPT-7B-v1.5-std and GraphGPT-7B-v1.5-cot:
- For simpler tasks (e.g., PubMed with 3 classes), GraphGPT-7B-v1.5-std performs exceptionally well (0.7011 Accuracy for Arxiv-PubMed).
- For complex tasks (e.g., Cora with 70 classes), GraphGPT-7B-v1.5-cot (0.1813 Accuracy for Arxiv-Cora) significantly outperforms GraphGPT-7B-v1.5-std (0.1256 Accuracy). Similarly, on (Arxiv+PubMed)-Cora, GraphGPT-7B-v1.5-cot (0.1647) is better than GraphGPT-7B-v1.5-std (0.1501).
This indicates that CoT distillation (leveraging reasoning from GPT-3.5) substantially benefits more complex graph learning tasks by integrating powerful reasoning capabilities, enabling better performance on datasets with higher class variability.

6.2. Generalization Ability Investigation (RQ2)

This section explores the model's ability to handle multiple tasks and transfer knowledge without catastrophic forgetting by incorporating more instruction data.

More Data Boost Model Transfer Ability:

Observation: The "(Arxiv+PubMed)-Cora" column in Table 1 shows that combining Arxiv and PubMed datasets for training leads to improved zero-shot transfer performance on Cora. For example, GraphGPT-7B-v1.5-cot achieves 0.1647 Accuracy on (Arxiv+PubMed)-Cora, compared to 0.1813 on Arxiv-Cora, indicating that while there's a slight drop for this specific variant, the overall ability to transfer is maintained or improved for other variants and overall model robustness. The text specifically highlights that incorporating PubMed alongside Arxiv shows significant improvement in transfer performance on Cora for GraphGPT.
Contrast with GNNs: For GNN-based models, training on a combination of datasets ((Arxiv+PubMed)-Cora) often results in a deterioration of transfer performance compared to training on a single source (e.g., Arxiv-Cora). This suggests they struggle to generalize across different data distributions when combined.

More Data Yet No Forgetting:

Observation: The "(Arxiv+PubMed)-Arxiv" column in Table 1 demonstrates that training on a combined dataset (Arxiv and PubMed) does not lead to performance degradation on the original Arxiv dataset for GraphGPT. In fact, GraphGPT-7B-v1.5-cot achieves 0.6476 Accuracy on Arxiv when trained with $(Arxiv+PubMed)$ , which is comparable to or better than its performance when trained solely on Arxiv (0.5759).
Catastrophic Forgetting: Traditional GNNs, after iterative training on multiple datasets, often experience catastrophic forgetting, where their competence on previously learned tasks or datasets is compromised. GraphGPT effectively mitigates this issue through its unified graph instruction tuning paradigm, allowing it to retain and even enhance its performance by leveraging generalized graph structure patterns from combined data.

Generalization for Multitasking Graph Learner: The following are the results from Table 2 of the original paper:

Dataset		Supervision. on Arxiv		Zero Shot on Cora
Model	Acc	Macro-F1	Acc		Macro-F1
MLP	0.5179	0.2536	0.0220		0.0006
GraphSAGE	0.5480	0.3290	0.0132		0.0029
GCN	0.5267	0.3202		0.0187	0.0032
GAT	0.5332	0.3118		0.0161	0.0057
RvGNN	0.5474	0.3240		0.0217	0.0016
DGI	0.5059	0.2787		0.0205	0.0011
GKD	0.5570	0.1595		0.0406	0.0037
GLNN	0.6088	0.3757		0.0182	0.0092
NodeFormer	0.5922	0.3328		0.0144	0.0053
DIFFormer	0.5986	0.3355		0.0100	0.0007
baichuan-7b	0.0946	0.0363		0.0405	0.0469
vicuna-7B-v1.1	0.2657	0.1375		0.1090	0.0970
vicuna-7B-v1.5	0.4962	0.1853		0.1489	0.1213
Arxiv-std + PubMed-std	0.6390	0.2652		0.1501	0.0936
Arxiv-cot + PubMed-cot	0.6476	0.2854	0.1647		0.1326
Arxiv-mix + PubMed-mix	0.6139	0.2772		0.1544	0.1048
Arxiv-std + PubMed-std + Link	0.5931	0.2238		0.1847	0.1579
Arxiv-mix + Pubmed-mix + Link	0.6874	0.3761		0.1836	0.1494

The following are the results from Table 3 of the original paper:

Dataset	PubMed
Model	AUC	AP
MLP	0.5583	0.5833
GAT	0.5606	0.6373
GraphSAGE	0.5041	0.5813
RevGNN	0.4538	0.5083
Node2Vec	0.6535	0.6885
w/o Link	0.5010	0.5005
only Link	0.6704	0.6087
Arxiv-std + PubMed-std + Link	0.8246	0.8026
Arxiv-mix + PubMed-mix + Link	0.6451	0.5886

Instruction Mixture Benefits: Tables 2 and 3 show the impact of mixing different instruction data types (standard, CoT, mixed, and link prediction) on both supervised and zero-shot performance.
- For supervised learning on Arxiv, Arxiv-mix + PubMed-mix + Link achieved the highest Accuracy (0.6874) and Macro-F1 (0.3761), indicating that a diverse instruction set can significantly boost performance.
- For zero-shot on Cora, Arxiv-std + PubMed-std + Link achieved the highest Accuracy (0.1847) and Macro-F1 (0.1579). This demonstrates that adding link prediction instructions, even when primarily focused on node classification, enhances the model's ability to generalize to unseen datasets and tasks.
Multitasking Capability: The inclusion of task-specific instructions for link prediction (+ Link) notably enhances performance not only for link prediction itself but also for node classification (as seen in Table 2). Conversely, after incorporating node classification tasks, the performance of link prediction (Table 3) also exceeds that of the best existing baselines (e.g., Arxiv-std + PubMed-std + Link gets 0.8246 AUC vs. Node2Vec 0.6535).
Conclusion: This indicates that GraphGPT can effectively handle various graph learning tasks simultaneously and successfully transfer knowledge between them, leading to improved generalization across different types of tasks and datasets.

6.3. Module Ablation Study (RQ3)

The paper conducts an ablation study to understand the individual contributions of different sub-modules of GraphGPT. The results are reported in Table 4 of the paper, which corresponds to "Table 4: Module ablation study under both supervised and zero-shot settings to analyze the individual contributions" as stated in the text, and the actual table content refers to these variants.

The following are the results from Table 4 of the original paper:

Dataset	Arxiv-Arxiv		Arxiv-PubMed		Arxiv-Cora
Variant	Acc	Mac-F1	Acc	Mac-F1	Acc	Mac-F1
w/ GS	0.4962	0.1853	0.6351	0.5231	0.1489	0.1213
w/o LR	0.5807	0.2462	0.2523	0.1925	0.0050	0.0016
ours	0.6258	0.2622	0.7011	0.6491	0.1813	0.1272

Note: The variant $w/ GS$ is likely vicuna-7B-v1.5 from Table 1, which represents the LLM without Graph Structural knowledge. The text states "w/o GS" as removing structural information, but the table labels it as "w/ GS" with values matching vicuna-7B-v1.5. I will interpret "w/ GS" here as the baseline LLM before GraphGPT adds structural knowledge, as indicated by the text "In this variant, we directly adopt the base LLM (specifically, Vicuna-7B-v1.5) to perform node classification on three datasets, without incorporating graph structural information." The name "w/ GS" seems to be a typo in the table if it is supposed to mean "without Graph Structure". I will clarify it in the explanation.

Effect of Graph Instruction Tuning ( $w/ GS$ vs. ours):
- The variant labeled $w/ GS$ in the table (which is interpreted as the base Vicuna-7B-v1.5 LLM without specific GraphGPT-style structural integration) shows significantly lower performance compared to ours (the full GraphGPT-7B-v1.5-std model).
- For example, on Arxiv-Arxiv, $w/ GS$ achieves 0.4962 Accuracy, while ours (0.6258) performs substantially better. The difference is even more pronounced in zero-shot settings: on Arxiv-Cora, $w/ GS$ gets 0.1489 Accuracy, while ours reaches 0.1813.
- This confirms that GraphGPT's graph instruction tuning paradigm effectively enables the LLM to understand and leverage graph structural information. The improvement is achieved without altering the LLM's original parameters, relying instead on the lightweight alignment projector for efficient graph-text alignment.
Effect of LLM-enhanced Semantic Reasoning (w/o LR vs. ours):
- The w/o LR variant (without LLM Reasoning) represents using only the default graph encoders (i.e., GNNs without the LLM component) for prediction.
- w/o LR shows extremely poor performance, especially in zero-shot tasks (e.g., Arxiv-Cora Accuracy of 0.0050, compared to ours at 0.1813). Even in supervised setting on Arxiv-Arxiv, ours (0.6258) significantly outperforms w/o LR (0.5807).
- This indicates that the LLM's reasoning ability is crucial. The rich semantic information and reasoning capabilities injected by the LLM component in GraphGPT provide a substantial performance gain, particularly for generalizing to unseen data where structural information alone (from GNNs) is insufficient.

6.4. Model Efficiency Study (RQ4)

This section investigates the computational efficiency of GraphGPT during both training and inference.

The following are the results from Table 5 of the original paper:

Variants	Training Time	Tuned Parameters	GPU Occupy
Stage-1-tune	OOM	6,607,884,288	OOM
Stage-1-freeze	22:53:33	131,612,672	39517.75
improvement	-	↓× 50.21	-
Stage-2-tune	OOM	6,607,884,288	OOM
Stage-2-freeze	03:44:35	131,612,672	38961.75
improvement	-	× 50.21	-

Training Efficiency with Graph Instruction Tuning:
- The table compares two scenarios: "-tune" (tuning all LLM parameters) and "-freeze" (freezing LLM/GNN parameters and only tuning the lightweight projector) in a 4-card 40G Nvidia A100 environment.
- When attempting to tune all LLM parameters (Stage-1-tune, Stage-2-tune), the system consistently runs Out-Of-Memory (OOM), even with a batch size of 1. This highlights the immense computational cost of fine-tuning large LLMs.
- In contrast, using GraphGPT's proposed tuning strategy (Stage-1-freeze, Stage-2-freeze), where only the lightweight projector is trained, the process is stable and feasible, even with a batch size of 2.
- This strategy drastically reduces the number of tuned parameters by more than 50 times (from ~6.6 billion to ~131 million), demonstrating significant parameter efficiency.
- GPU occupancy remains manageable (~39GB per GPU), and training time is reasonable (e.g., 22 hours for Stage 1, 3.7 hours for Stage 2). This confirms the scalability and efficiency of GraphGPT's training approach.
Model Inference Efficiency: The following figure (Figure 5 from the original paper) shows the inference efficiency study of our GraphGPT.

该图像是图表，展示了论文中GraphGPT在Arxiv和Cora数据集上的推理效率对比，横轴为不同模型，左纵轴为推理时间（秒），右纵轴为准确率。图中显示GraphGPT在推理时间与准确率上均表现优异。

Figure 5: Inference efficiency study of our GraphGPT.
- Figure 5 compares the inference speed (seconds per response) and accuracy of GraphGPT against other LLMs (baichuan-7B, vicuna-7B-v1.1, vicuna-7B-v1.5) on Arxiv and Cora CoT instruction datasets.
- baichuan-7B shows very fast inference but often yields low accuracy, indicating quick but potentially irrelevant or incorrect answers.
- vicuna-7B-v1.1 and vicuna-7B-v1.5 require longer inference times, suggesting more complex reasoning steps for better answers.
- GraphGPT (GraphGPT-v1.5-cot) demonstrates a strong balance: it achieves high accuracy (comparable to or better than vicuna-7B-v1.5 for Arxiv-CoT and significantly better for Cora-CoT) with a relatively brief reasoning process (inference time between vicuna-7B-v1.1 and vicuna-7B-v1.5), enhancing overall inference efficiency. This indicates that GraphGPT can produce accurate predictions through a concise yet effective reasoning path.

6.5. Model Case Study (RQ5)

This section provides a qualitative analysis comparing GraphGPT with ChatGPT on downstream graph learning tasks, focusing on node classification.

The following are the results from Table 6 of the original paper:

TTeo eeuaetks.0srede
Ground-Truth Category: cs.LG, Machine Learning
T A T .
TM-N c C

Note: The content for Table 6 from the original paper is not fully legible or formatted as a standard comparison table in the provided text. The provided text block above the image seems to be a continuation of the main paper text describing the case study rather than a structured table. I will explain the case study results based on the descriptive text and the provided image snippet, which appears to be a partial example of a ChatGPT response.

Case Study Overview:

Task: Node classification (predicting the arXiv CS sub-category for a paper).
Dataset: Arxiv data.
Instructions Tested:
1. Node content only (for ChatGPT).
2. Node content with text-based graph structure (for ChatGPT).
3. GraphGPT's designed graph instruction (for GraphGPT).
Specific Example: A paper whose research domains span machine learning and hardware architecture, indicating cross-disciplinary characteristics. The Ground-Truth Category is cs.LG, Machine Learning.

Observations:

ChatGPT Limitations: Despite its massive parameter count (over 200 billion), ChatGPT struggles to make accurate predictions when relying solely on node text information or node content with text-based graph structure. This is particularly evident for cross-disciplinary papers, where textual ambiguity might lead to incorrect classifications. The image snippet, which appears to be a ChatGPT response, lists multiple cs sub-categories (LG, AI, NE, SC, AR) as "most likely to be relevant," indicating uncertainty and a lack of precise classification, even though it provides explanations for each.
GraphGPT's Accuracy and Explanations: In contrast, GraphGPT consistently provides accurate predictions and reasonable explanations. This is attributed to GraphGPT's ability to incorporate a subgraph structure (e.g., 103 nodes in the example), allowing it to extract rich structural information from neighboring nodes' citation relationships. This structural context resolves ambiguities that purely text-based LLMs might encounter.
Token Efficiency: GraphGPT's approach of using graph tokens to represent graph structures is significantly more efficient than natural language descriptions. For a 103-node subgraph, GraphGPT requires only 750 tokens as input to the LLM. The equivalent text-based method (describing the graph in natural language) would require 4649 tokens. This substantial reduction in token consumption translates directly to lower training and inference resource requirements, making GraphGPT more scalable for large graphs.

Conclusion: The case study highlights GraphGPT's ability to overcome the limitations of text-only LLMs by integrating structural knowledge efficiently, leading to more accurate and contextually informed predictions, especially in complex and cross-disciplinary scenarios.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work successfully introduces GraphGPT, an effective and scalable graph large language model designed to enhance the generalization capabilities of graph models. The framework innovatively injects graph domain-specific structural knowledge into an LLM through a dual-stage graph instruction tuning paradigm. A key component, the lightweight graph-text alignment projector, efficiently enables LLMs to comprehend and interpret the structural elements of graphs. Extensive evaluations across various supervised and zero-shot graph learning scenarios consistently demonstrate the method's superior effectiveness. Furthermore, GraphGPT exhibits robust generalization abilities, adeptly handling diverse downstream datasets and tasks without succumbing to catastrophic forgetting, a common issue in traditional iterative model training. The integration of Chain-of-Thought (CoT) distillation further bolsters its step-by-step reasoning, particularly beneficial for complex tasks and scenarios involving distribution shifts.

7.2. Limitations & Future Work

The authors primarily point to one potential avenue for future investigation:

Pruning Techniques for LLM Compression: The paper suggests exploring pruning techniques to compress redundant or less important parameters within the LLM component. The goal would be to further reduce the overall model size while maintaining its strong performance. This implies that while the current lightweight projector strategy makes tuning efficient, the underlying LLM itself is still large, and further optimization of its footprint could be beneficial for deployment on resource-constrained devices or for even greater efficiency.

7.3. Personal Insights & Critique

The GraphGPT paper presents a highly compelling and timely approach to integrating the power of Large Language Models with Graph Neural Networks. The core idea of bridging the structural understanding of GNNs with the semantic reasoning of LLMs is very intuitive and addresses a critical limitation of traditional graph learning models: their poor generalization in zero-shot scenarios and heavy reliance on labeled data.

Innovations and Strengths:

Elegant Integration: The dual-stage instruction tuning with a lightweight projector is a particularly elegant solution. It avoids the prohibitive computational cost of fine-tuning an entire LLM, making the approach practical and scalable. This parameter-efficient tuning is a crucial enabler for this line of research.
Zero-Shot Prowess: The empirical results showcasing 2-10x improvement in zero-shot accuracy over strong baselines are highly impressive and validate the core hypothesis that LLM-like instruction tuning can significantly boost generalization for graph tasks. This opens up possibilities for real-world applications where labels are scarce.
Mitigation of Catastrophic Forgetting: The ability of GraphGPT to incorporate more data and tasks without catastrophic forgetting is a significant advantage over many traditional GNNs. This indicates a more robust and adaptable learning paradigm.
Chain-of-Thought Distillation: Leveraging the reasoning capabilities of powerful closed-source LLMs through CoT distillation is a clever way to enhance the model's intelligence without adding parameters. This technique allows smaller models to "learn" to reason more effectively.
Token Efficiency: The demonstration of graph tokens being significantly more efficient than text-based graph descriptions is a practical win, addressing a major bottleneck for using LLMs with large graph structures.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Reliance on External LLM for CoT: While CoT distillation is efficient, it still relies on a powerful, often closed-source, external LLM (like GPT-3.5) as a "teacher." This introduces a dependency and potential cost for generating the initial CoT data. Future work could explore how to bootstrap CoT generation more autonomously or with smaller open-source models.
Complexity of Instruction Design: Crafting effective instructions and graph token representations can be intricate. The paper shows examples, but the generalizability of this instruction engineering across highly diverse graph types (e.g., knowledge graphs, biological networks with very different node/edge semantics) might need further investigation.
Graph Encoder Backbone: The graph encoder is flexible, but its choice and pre-training method can still heavily influence the quality of graph representations. The paper states it can leverage "diverse graph pre-training paradigms," but the impact of different choices (e.g., simple GCN vs. advanced Graph Transformer, or different SSL pre-training objectives) on the overall GraphGPT performance and generalization is not fully explored.
Interpretability of LLM Reasoning on Graphs: While CoT provides step-by-step reasoning in natural language, understanding how the LLM combines the graph tokens and its language understanding to arrive at a specific conclusion about a graph structure remains somewhat opaque. Further work into making the LLM's graph-aware reasoning more transparent could be valuable.
Scalability to Extremely Large Graphs: While token efficiency is improved for subgraphs, processing entire extremely large graphs (billions of nodes/edges) with LLMs might still pose challenges due to memory limitations for constructing subgraphs and the computational cost of repeated graph encoder runs. Techniques like graph sampling or hierarchical graph representation might be needed.

Transferability and Applications: The methods and conclusions of GraphGPT have high transferability. This framework could be applied to various domains beyond citation networks, such as:

Drug Discovery: Predicting drug-target interactions or properties of molecular graphs in a zero-shot manner.
Social Network Analysis: Classifying users, predicting relationships, or detecting communities in new, unseen social networks.
Recommender Systems: Providing cold-start recommendations by understanding user-item interaction graphs without prior user behavior data.
Knowledge Graphs: Answering complex queries over large knowledge graphs where explicit links are missing or for new entities.

Overall, GraphGPT represents a significant step towards creating truly generalizable and intelligent graph learning systems, effectively merging the best of GNNs and LLMs. It lays a strong foundation for future research in graph foundation models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

GraphGPT: Graph Instruction Tuning for Large Language Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 52,354 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Structural Information Encoding with Text-Graph Grounding

4.2.2. Dual-Stage Graph Instruction Tuning

4.2.2.1. Self-Supervised Instruction Tuning (Stage 1)

4.2.2.2. Task-Specific Instruction Tuning (Stage 2)

4.2.3. Chain-of-Thought (CoT) Distillation

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Accuracy

5.2.2. Macro F1-Score

5.2.3. Area Under the Receiver Operating Characteristic Curve (AUC)

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Generalization Ability Investigation (RQ2)

6.3. Module Ablation Study (RQ3)

6.4. Model Efficiency Study (RQ4)

6.5. Model Case Study (RQ5)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers