GraphBench: Next-generation graph learning benchmarking
TL;DR Summary
GraphBench is a comprehensive benchmarking suite for graph learning, addressing fragmented practices. It covers various tasks and provides standardized evaluation protocols, dataset splits, and hyperparameter tuning, benchmarking with message-passing networks and graph transforme
Abstract
Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
GraphBench: Next-generation graph learning benchmarking
1.2. Authors
The paper lists a large collaborative author group from various institutions, indicating a broad and diverse set of expertise contributing to the project. Key authors include Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber, Fabrizio Frasca, Hadar Shavit, Antoine Siraudin, Arman Mielke, Marie Anastacio, Erik Müller, Maya Bechler-Speicher, Michael Bronstein, Mikhail Galkin, Holger Hoos, Mathias Niepert, Bryan Perozzi, Jan Tönshoff, and Christopher Morris. Their affiliations span universities (RWTH Aachen University, University of Oxford, Mila Quebec AI Institute, Technion - Israel Institute of Technology, University of Stuttgart), research labs (ETAS Research, Meta, AITHYRA, Google Research, Microsoft Research), suggesting a strong academic and industrial backing.
1.3. Journal/Conference
The paper is listed with an "Original Source Link" to https://arxiv.org/abs/2512.04475 and a "PDF Link" to https://arxiv.org/pdf/2512.04475v1.pdf. This indicates it is a preprint published on arXiv, a popular open-access repository for research papers, particularly in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While arXiv is not a peer-reviewed journal or conference, it is a widely used platform for quickly disseminating research findings and soliciting feedback before formal publication. Its influence in the AI/ML community is substantial for early access to cutting-edge research.
1.4. Publication Year
The publication date is 2025-12-04. This indicates it is a forward-looking paper, likely a planned or upcoming publication.
1.5. Abstract
The paper addresses the fragmented and inconsistent benchmarking practices in machine learning on graphs, which currently rely on narrow datasets and varied evaluation protocols, hindering reproducibility and progress. To overcome these issues, the authors introduce GraphBench, a comprehensive benchmarking suite. GraphBench covers diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. It provides standardized evaluation protocols with consistent dataset splits and performance metrics designed to assess out-of-distribution (OOD) generalization. Furthermore, it offers a unified hyperparameter tuning framework. The authors also benchmark GraphBench using common models like message-passing neural networks (MPNNs) and graph transformer models, establishing principled baselines and reference performance.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2512.04475 - PDF Link:
https://arxiv.org/pdf/2512.04475v1.pdfThe paper is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
2.1.1. Core Problem
The core problem GraphBench aims to solve is the fragmentation and inconsistency in benchmarking practices within the field of machine learning on graphs. This fragmentation manifests in several ways:
- Narrow, Task-Specific Datasets: Most existing benchmarks focus on limited domains (e.g., molecular graphs, citation networks), failing to represent the diversity and complexity of real-world applications.
- Inconsistent Evaluation Protocols: Different benchmarks use varied data splits, metrics, and hyperparameter tuning approaches, making it difficult to compare models meaningfully across studies.
- Lack of Reproducibility: Inconsistent protocols and data handling hinder the reproducibility of research findings.
- Limited Out-of-Distribution (OOD) Generalization Assessment: Current benchmarks often emphasize in-distribution accuracy, neglecting how models perform on unseen or structurally different data, which is crucial for real-world deployment.
2.1.2. Importance of the Problem
This problem is critical because machine learning on graphs, particularly with Graph Neural Networks (GNNs) and Graph Transformers (GTs), has achieved impressive progress in various high-impact domains like drug design, molecular property prediction, recommender systems, chip design, and combinatorial optimization. However, without a standardized and comprehensive benchmarking framework, several negative consequences arise:
- Hampered Progress: Researchers struggle to objectively compare new models against existing ones, slowing down innovation.
- Misguided Development: Models might be optimized for narrow, non-representative datasets, leading to solutions that perform poorly in real-world scenarios.
- Inefficient Resource Allocation: Duplicative efforts in data curation and protocol definition waste valuable research resources.
- Lack of Real-World Impact: Benchmarks that do not reflect real-world complexity fail to drive research towards impactful applications.
2.1.3. Paper's Entry Point / Innovative Idea
The paper's innovative idea is to introduce GraphBench, a "next-generation benchmarking suite" that directly addresses these shortcomings. GraphBench is designed to be comprehensive, standardized, and reflective of real-world challenges. Its entry point is to unify diverse graph learning tasks and domains under a single framework, providing consistent evaluation protocols, specialized metrics, and a focus on OOD generalization, thereby facilitating more reproducible, robust, and impactful research.
2.2. Main Contributions / Findings
2.2.1. Primary Contributions
The primary contributions of GraphBench are:
- Comprehensive Benchmarking Suite: It introduces a wide array of datasets spanning diverse domains (social networks, chip design, combinatorial optimization, SAT solving, algorithmic reasoning, weather forecasting) and supporting all major graph learning tasks (
node-level,edge-level,graph-levelprediction, andgenerative settings). - Standardized Evaluation Protocols:
GraphBenchprovides consistent dataset splits, uniform input features and labels, and domain-specific evaluation metrics, which are crucial for fair and reproducible comparisons. It explicitly includes tests forout-of-distribution (OOD) generalizationby splitting data based on time or problem size. - Unified Hyperparameter Tuning Framework: It integrates automated hyperparameter optimization using
SMAC3with multi-fidelity scheduling, enabling efficient model tuning and enhancing reproducibility. - Principled Baselines: The suite includes benchmark results for modern
Message-Passing Neural Networks (MPNNs)andGraph Transformers (GTs), providing strong reference points and insights into current models' capabilities and limitations across diverse tasks. - Open-Source and User-Friendly Software:
GraphBenchis released as an open-source Python package (graphbench-lib) with streamlined data loaders (graphbench.Loader), evaluation tools (graphbench.Evaluator), and optimization utilities (graphbench.Optimizer), making it easy for researchers to adopt and extend.
2.2.2. Key Conclusions / Findings
The key conclusions and findings from the paper, based on their initial benchmarking efforts, are:
- Persistent Challenges: Current graph learning methods, including
MPNNsandGTs, still face significant challenges in handling:Temporal distribution shifts(e.g., in social networks).- Scaling to
large graphs(e.g., in SAT solving, whereMPNNsstruggled with memory on medium/large instances). - Capturing complex
domain-specific structures(e.g., functional equivalence in chip design, combinatorial constraints).
- Graph Structure is Informative: In social network tasks, models leveraging graph structure (
MPNNs) consistently outperform connectivity-agnostic models (MLPs,DeepSets), highlighting the importance of relational information. - OOD Generalization Limitations: While some
algorithmic reasoningtasks show robustness tosize generalization(e.g., MST, Bridges), others (e.g., Topological Sorting, Max Matching) demonstrate a decrease in performance with increasing graph size, indicating that current baselines are not universally size-invariant. - Traditional ML Still Strong in Some Areas: For
SAT solver performance prediction, traditionaltabular machine learningmodels (Random Forest,XGBoost) usinghand-crafted features(SATzilla) significantly outperformMPNNs, suggesting that extracting relevant features remains crucial and that graph models have not yet fully leveraged the permutation invariance of SAT instances. - Unsolved Problems and Future Potential: Many tasks within
GraphBenchare far from solved, indicated by relatively low and Spearman correlation values in social networks, and the inability of existing graph generative models to meet strict constraints (like functional equivalence in chip design). This leaves ample room for meaningful gains and points towards the need for more advanced, task-specific graph learning models, potentially paving the way forgraph foundation models.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice reader should be familiar with the following fundamental concepts in machine learning and graph theory:
3.1.1. Graphs and Graph Theory Basics
- Graph: A
graphis a mathematical structure used to model pairwise relations between objects. It consists of a set ofnodes(orvertices) and a set ofedges(orlinks) that connect pairs of nodes. - Node (Vertex): Represents an entity or object in the graph (e.g., a person in a social network, a component in an electronic circuit, a city in a road network).
- Edge (Link): Represents a relationship or connection between two nodes (e.g., friendship between people, electrical connection between components, road between cities). Edges can be:
- Directed: The relationship goes only one way (e.g., following someone on social media).
- Undirected: The relationship is mutual (e.g., friendship).
- Weighted: Edges can have a numerical value indicating the strength, cost, or distance of the relationship.
- Node Features/Attributes: Information associated with each node (e.g., user profile data, component type).
- Edge Features/Attributes: Information associated with each edge (e.g., type of interaction, resistance of a wire).
- Graph-Structured Data: Data where entities and their relationships are naturally represented as a graph.
- Graph Isomorphism: Two graphs are isomorphic if they have the same structure, even if their nodes are labeled differently. This concept is crucial for
GNNsthat aim to be permutation-invariant (i.e., their output doesn't change if the node ordering changes). - Directed Acyclic Graph (DAG): A directed graph with no directed cycles. Essential in contexts like circuit design or task dependencies.
3.1.2. Machine Learning on Graphs
- Graph Neural Networks (GNNs): A class of neural networks designed to operate directly on graph-structured data. They learn representations (embeddings) of nodes, edges, or entire graphs by iteratively aggregating information from a node's local neighborhood.
- Message Passing Neural Networks (MPNNs): A common framework for
GNNswhere information (messages) is passed along edges, transformed at nodes, and aggregated from neighbors. This process is typically repeated for several layers, allowing nodes to gather information from increasingly distant parts of the graph.- Mechanism: In each
message-passingstep, a node receives messages from its neighbors, aggregates these messages (e.g., sum, mean, max), and then updates its own representation using its previous state and the aggregated message. This process enablesGNNsto capture local graph structure.
- Mechanism: In each
- Message Passing Neural Networks (MPNNs): A common framework for
- Graph Transformers (GTs): Adapt the highly successful Transformer architecture (originally for sequential data like text) to graph-structured data. Unlike
MPNNsthat typically aggregate locally,GTsoften useself-attentionmechanisms to allow each node to attend to all other nodes in the graph (or a subset), potentially capturing long-range dependencies more effectively.- Self-Attention: A mechanism where each element in a sequence (or node in a graph) computes a weighted sum of all other elements, with the weights determined by a learned
attentionscore. This allows the model to dynamically focus on relevant parts of the input.
- Self-Attention: A mechanism where each element in a sequence (or node in a graph) computes a weighted sum of all other elements, with the weights determined by a learned
- Node-level Prediction Tasks: Predicting a property for each node in the graph (e.g., classifying users in a social network, predicting engagement for each user).
- Edge-level Prediction Tasks (or Link Prediction): Predicting a property for an edge or the existence of a link between two nodes (e.g., predicting future interactions, predicting electrical connections).
- Graph-level Prediction Tasks: Predicting a property for the entire graph (e.g., classifying a molecule, predicting the performance of an electronic circuit).
- Generative Settings/Tasks: Generating new graphs that conform to certain properties or distributions (e.g., generating new molecules, generating electronic circuits).
3.1.3. Machine Learning General Concepts
- Hyperparameter Tuning/Optimization (HPO): The process of finding the best set of
hyperparameters(settings that control the learning process, like learning rate, number of layers, dropout rate) for a machine learning model to achieve optimal performance on a given task. - Out-of-Distribution (OOD) Generalization: A model's ability to perform well on data that comes from a different distribution than the training data. This is crucial for real-world robustness, as deployment data often differs from historical training data (e.g., generalizing to larger graphs, future time steps, or different problem sizes).
- Baselines: Simple or well-established models used as a reference point to compare the performance of a new or more complex model.
- Metrics: Quantitative measures used to evaluate the performance of a model (e.g., accuracy, Mean Absolute Error (MAE), , F1-score).
- Reproducibility: The ability of an independent researcher to replicate the results of a study using the same methods, data, and code.
3.1.4. Software Libraries
- PyTorch Geometric (PyG): A popular Python library built on
PyTorchfor deep learning on graphs. It provides efficient data structures for graphs, implementations of variousGNNlayers, and utilities for common graph learning tasks.
3.2. Previous Works
The paper extensively discusses shortcomings of existing graph learning benchmarks, which serves as a foundation for understanding GraphBench's necessity.
3.2.1. Limitations of Existing Benchmarks
- TUDatasets (Morris et al., 2020):
- Summary: Introduced many
graph-level tasks. - Limitations: Most datasets are small, molecular, and lack unified metrics, hindering comparability across tasks.
- Summary: Introduced many
- Open Graph Benchmark (OGB) (Hu et al., 2020a) and extensions (Hu et al., 2021):
- Summary: Provides large-scale datasets.
- Limitations: Focus heavily on molecules, require substantial computational resources, and lack clear broader relevance beyond specific chemical domains.
- Dwivedi et al. (2020) and Long-range Graph Benchmark (LRGB) (Dwivedi et al., 2022a):
- Summary: Emphasize certain prediction regimes.
- Limitations: Narrow prediction regimes, small-scale data, restricted model sizes, and limited evaluation of
out-of-distribution (OOD) generalization.
- Benchmarks for Graph Generation:
- Limitations: Remain underdeveloped, often relying on 2D molecular datasets with questionable practical value, synthetic data with limited applicability, and fail to address challenges like generating large or structurally constrained graphs.
- GRAPHLAND (Bazhenov et al., 2025):
- Summary: Sought to expand
node-level taskswith industrial data. - Limitations: Scalability and coverage gaps remain.
- Summary: Sought to expand
- Algorithmic Reasoning Benchmarks like CLRS (Velickovic et al., 2022):
- Summary: Covers a variety of algorithmic tasks, often providing "hints" to guide the solution.
- Limitations (from GraphBench's perspective): Does not include
regression tasks, restricts learning tasks to specific algorithms, and focuses onhintswhichGraphBenchaims to move beyond for broaderOOD generalization.
3.2.2. Foundational Models and Algorithms in Graph ML
The paper implicitly builds upon several core GNN and GT architectures that serve as baselines. While not detailed in the paper's main text regarding their internal formulas, understanding their general mechanism is key.
-
Message-Passing Neural Networks (MPNNs):
- General Principle (Gilmer et al., 2017; Scarselli et al., 2009):
MPNNsdefine a general framework forGNNsby alternating between amessage functionand anupdate function. - Message Computation: For each node , messages are computed from its neighbors : .
- Node Update: Node embeddings are updated based on aggregated messages: .
- Graph Isomorphism Network (GIN) (Xu et al., 2019): A powerful
MPNNvariant designed to be as discriminative as theWeisfeiler-Leman (WL) testfor graph isomorphism. TheGINupdate rule for node in layer : whereMLPis a multi-layer perceptron and is a learnable parameter. This emphasizesGraphBench's choice ofGINas a strongMPNNbaseline.
- General Principle (Gilmer et al., 2017; Scarselli et al., 2009):
-
Graph Transformers (GTs) (Müller et al., 2024a):
- General Principle: Adapt the
self-attentionmechanism from standard Transformers to graphs. UnlikeMPNNswhich are inherently local,GTscan model global dependencies directly. - Attention Mechanism (Vaswani et al., 2017 - though not cited, this is the foundational work): The core of
Transformersis the scaleddot-product attention. Givenquery(),key(), andvalue() matrices derived from input embeddings, the attention mechanism calculates: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where is the dimension of thekeyvectors, used for scaling to prevent vanishing gradients.Graph Transformersadapt this by derivingQ, K, Vfromnode featuresand potentially incorporatinggraph structurethroughattention biasesorpositional encodings.
- General Principle: Adapt the
-
DeepSets (Zaheer et al., 2017): A simple but powerful neural architecture for
permutation-invariantfunctions (functions whose output does not depend on the order of input elements). A typicalDeepSetarchitecture takes a set of input features and processes them as: $ \rho\left(\sum_{i=1}^N \phi(x_i)\right) $ where and areneural networks(typicallyMLPs). This is used as a baseline to evaluate the utility of graph connectivity information, asDeepSetsexplicitly ignore edge structure. -
Multilayer Perceptron (MLP): A fundamental
feedforward neural networkconsisting of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node (except for input nodes) is aneuronthat uses anonlinear activation function. Used as a baseline for comparison, asMLPsdisregard both graph structure and permutation invariance.
3.3. Technological Evolution
The field of machine learning on graphs has evolved significantly:
-
Early Graph Methods (Pre-2010s): Focused on
graph kernelsandfeature engineering. Graph kernels computed similarity between graphs by counting substructures, but struggled with scalability. Feature engineering involved manually extracting graph-theoretic features (e.g., node degrees, clustering coefficients) for use in traditionalMLmodels. -
Rise of
Deep Learning(Early-Mid 2010s): With the success ofdeep learningincomputer visionandnatural language processing, researchers sought to adaptneural networksto graph data. This led to early attempts atspectral GNNs(using graphLaplacians) andspatial GNNs(directly operating on the graph structure). -
Standardization with
Message Passing(Mid-Late 2010s): TheMessage Passing Neural Network (MPNN)framework (Gilmer et al., 2017) provided a unifying view, abstracting manyGNNvariants into a common paradigm of message computation and aggregation. This spurred rapid development of models likeGraph Convolutional Networks (GCNs),Graph Attention Networks (GATs), andGraph Isomorphism Networks (GINs). -
Scaling and
Out-of-Distribution (OOD)Challenges (Late 2010s-Early 2020s): AsGNNsbecame more powerful, challenges likescalabilityto large graphs,over-smoothing, and particularlyOOD generalization(how models perform on graphs larger or structurally different from training data) came to the forefront. Benchmarks likeOGBemerged to address large-scale data, but often remained domain-specific. -
Graph Transformersand Foundation Models (Early 2020s-Present): The success ofTransformersled to their adaptation for graphs, withGraph Transformers (GTs)aiming to capture long-range dependencies more effectively. The concept ofgraph foundation models– large, pre-trained models that can be fine-tuned for various downstream tasks – is an emerging area, requiring robust and diverse benchmarks for development and evaluation.GraphBenchpositions itself at the current frontier, aiming to provide the necessary infrastructure to addressOOD generalization, diversify beyond narrow domains, and standardize evaluation to facilitate the development of these next-generationgraph learningmodels, includinggraph foundation models.
3.4. Differentiation Analysis
Compared to the main methods and benchmarks in related work, GraphBench offers several core differences and innovations:
-
Domain and Task Diversity:
- Differentiation: Unlike
TUDatasets(mostly small molecular graphs) orOGB(heavily molecular),GraphBenchexplicitly spansdiverse domainslike social networks, chip design, combinatorial optimization, SAT solving, algorithmic reasoning, and earth systems (weather forecasting). It also supports all key task types:node-level,edge-level,graph-level, andgenerative settings. - Innovation: This broad coverage aims to prevent domain-specific overfitting and drive
GNNresearch towards more generalizable solutions for a wider range of real-world problems.
- Differentiation: Unlike
-
Focus on Real-World Impact and Complexity:
- Differentiation: Many prior benchmarks use small-scale or simplified graphs (e.g., citation networks, basic molecular graphs) that have limited applicability to industrial or complex real-world challenges.
- Innovation:
GraphBenchemphasizesdatasetsthat are representative of modern, high-impact applications, such as large-scale chip design, real-world social network interactions with temporal dynamics, and complex combinatorial optimization problems. It moves beyond "toy" problems to address "meaningful" ones.
-
Standardized Protocols and
OOD Generalization:- Differentiation: Previous benchmarks often have inconsistent evaluation protocols, metrics, and lack explicit mechanisms for
OOD generalization. - Innovation:
GraphBenchprovidesstandardized splits(e.g., temporal splits for social networks, cross-size splits for electronic circuits) anddomain-specific metricsthat go beyond simple accuracy to truly assess model performance. Crucially, it explicitly evaluatesOOD generalization(e.g., on different graph sizes, future time periods), which is vital for developing robust and deployable models.
- Differentiation: Previous benchmarks often have inconsistent evaluation protocols, metrics, and lack explicit mechanisms for
-
Unified Framework and Tooling:
- Differentiation: Researchers often have to manage different data loaders, evaluation scripts, and tuning procedures for each benchmark.
- Innovation:
GraphBenchoffers aunified Python package(graphbench-lib) with a singleLoaderclass,Optimizerforhyperparameter tuning, andEvaluatorfor standardized metric calculation. This significantly reduces the overhead for researchers, promotes reproducibility, and allows for rapid prototyping and benchmarking.
-
Principled Baselines and Identified Challenges:
-
Differentiation: While other benchmarks provide baselines,
GraphBench's extensive evaluation across its diverse suite highlights specific persistent challenges liketemporal distribution shifts,scalabilityto very large graphs, and the difficulty ofconditional graph generationunder strict constraints (like functional equivalence). -
Innovation: By systematically exposing these challenges,
GraphBenchguides future research directions more effectively, especially towardsgraph foundation modelsthat can overcome these hurdles.In essence,
GraphBenchdifferentiates itself by offering a holistic, rigorously standardized, and practically relevant benchmarking ecosystem that pushes the boundaries ofgraph machine learningbeyond localized improvements towards broadly applicable and robust solutions.
-
4. Methodology
The GraphBench paper introduces a comprehensive benchmarking suite, focusing on the design principles, framework, and standardized evaluation protocols rather than a single novel algorithm. The core methodology revolves around providing a unified environment for diverse graph learning tasks.
4.1. Principles
GraphBench is designed around five core principles to address the shortcomings of existing benchmarks:
-
Diverse Tasks and Domains:
- Core Idea: To ensure
GraphBenchis broadly applicable and representative of real-world scenarios, it supportsnode-level,edge-level,graph-levelprediction, andgenerative tasks. - Theoretical Basis/Intuition: Graph problems exist across various domains (e.g., social sciences, hardware design, combinatorial optimization, earth systems). A truly comprehensive benchmark must cover these diverse applications to foster generalizable models, rather than models specialized for a narrow domain like molecular graphs.
- Core Idea: To ensure
-
Real-world Impact:
- Core Idea: Focus on datasets and tasks that are highly relevant to current real-world challenges and industrial needs.
- Theoretical Basis/Intuition: Many existing benchmarks feature small-scale graphs or domains with limited practical utility. By including tasks like large-scale chip design, weather forecasting, and combinatorial optimization,
GraphBenchaims to align research efforts with impactful applications, moving beyond purely academic or simplified scenarios.
-
OOD Generalization:
- Core Idea: Explicitly assess a model's ability to generalize beyond its training distribution.
- Theoretical Basis/Intuition: Models often perform well on
in-distributiondata but fail when encountering new, structurally different, or temporally shifted data (out-of-distribution).GraphBenchincorporates datasets with specific splits (e.g., temporal splits for social networks, cross-size splits for electronic circuits) to rigorously test this crucial aspect, which is essential for deployment in dynamic real-world environments.
-
Meaningful Evaluation:
- Core Idea: Provide
standardized evaluation splits,domain-specific metrics, andstate-of-the-art hyperparameter tuning scripts. - Theoretical Basis/Intuition: Reproducibility and fair comparison require consistent evaluation. This includes fixed training/validation/test splits, meaningful metrics tailored to the problem (not just generic accuracy), and robust
hyperparameter optimizationto ensure models are evaluated at their best performance. The inclusion ofOOD generalizationtests further enhances the meaningfulness of evaluation.
- Core Idea: Provide
-
Task-relevant Evaluation Metrics:
- Core Idea: Use metrics that accurately capture the practical utility and real-world performance of models for each specific task.
- Theoretical Basis/Intuition: Standard metrics like
Mean Squared Error (MSE)oraccuracymay not fully reflect a model's practical value. For instance, in social networks,Spearman correlationmight be more relevant for ranking influence thanMAE.GraphBenchprovides tailored metrics to ensure evaluations truly reflect task objectives.
4.2. Core Methodology In-depth (Layer by Layer)
The GraphBench methodology provides a unified software framework and a general encoder-processor-decoder architecture for evaluating graph learning models.
4.2.1. Unified Software Framework (graphbench-lib)
The core of GraphBench is its open-source Python package graphbench-lib, designed to streamline the entire benchmarking process.
4.2.1.1. Dataset Loading (graphbench.Loader)
- Purpose: To provide a consistent and easy way to load all datasets within the
GraphBenchsuite, regardless of their domain or task type. - Procedure:
- The
graphbench.Loaderclass acts as a wrapper for dataset loading. - It handles necessary precomputations and comes with predefined splits (train, validation, test).
- Datasets are fully compatible with
PyTorch Geometric'sInMemoryDatasetinterface, ensuring seamless integration with commongraph deep learningpipelines. - Customization options are available for each dataset.
- The
- Example Integration (from Listing 1):
import graphbench # 1. Initialize Loader # dataset_name: Name of the task or a list of tasks (e.g., 'BlueSky-quotes') # pre_filter: PyTorch Geometric filter matrix for filtering data before loading # pre_transform: PyTorch Geometric-like transform applied during data loading # transform: PyTorch Geometric-like transform applied at computation time (e.g., batching) Loader = graphbench.Loader(dataset_name, pre_filter, pre_transform, transform) # 2. Load a GraphBench dataset and get splits dataset = Loader.load()dataset_name: A string identifying the specific dataset or task (e.g., "BlueSky-quotes", "AIG").pre_filter: A function or object fromPyTorch Geometricthat filters the dataset instances before they are loaded into memory. This is useful for reducing memory footprint by only loading relevant data.pre_transform: A function or object fromPyTorch Geometricthat applies transformations to the data once when the dataset is loaded. Examples include addingpositional encodingsor normalizing features.transform: A function or object fromPyTorch Geometricthat applies transformations to data each time a sample is accessed during training or inference. This is useful for data augmentation or on-the-fly feature generation.
4.2.1.2. Hyperparameter Optimization (graphbench.Optimizer)
- Purpose: To automate the process of finding optimal
hyperparametersfor a given model and task, improving performance and reproducibility. - Procedure:
- Uses
SMAC3(Lindauer et al., 2022) as its backend, which is a powerfulBayesian optimizationframework. - Employs
multi-fidelity scheduling: This technique evaluates manyhyperparameter configurationsonlower budgets(e.g., fewer training epochs or smaller subsets of data) and prunes (eliminates) poorly performing configurations early. This saves significant computational resources, especially forgraph learningmodels which are expensive to train. - Utilizes a
surrogate model:SMAC3builds a model of the objective function (e.g., validation loss) based on previous evaluations. Thissurrogate modelthen proposes new, promising configurations more efficiently than random or grid search.
- Uses
- Example Integration (from Listing 1):
# 1. Initialize Optimizer # optimization_args: Dictionary of arguments for the SMAC3 optimizer (e.g., budget, search space) # training_method: A callable function that trains and evaluates a model for a given set of hyperparameters Optimizer = graphbench.Optimizer(optimization_args, training_method) # 2. Optimize your model's hyperparameters opt_model = Optimizer.optimize()optimization_args: A dictionary containing parameters for thehyperparameter optimizationprocess, such as the total budget of evaluations, the fidelity levels (e.g., range of training steps), and thehyperparameter search space.training_method: A function or method that takes a set ofhyperparametersas input, trains a model with them, and returns its performance (e.g., validation loss). TheOptimizercalls this method repeatedly with differenthyperparametercombinations.
4.2.1.3. Evaluation (graphbench.Evaluator)
- Purpose: To standardize the calculation and reporting of performance metrics for models, ensuring consistency across benchmarks.
- Procedure:
- Loads the corresponding
domain-specific metricsfor each selected dataset. - Provides
evaluation scriptsthat implement these metrics (e.g.,RMSE,ROC-AUC,Closed Gap).
- Loads the corresponding
- Example Integration (from Listing 1):
# 1. Initialize Evaluator Evaluator = graphbench.Evaluator(dataset_name) # 2. Use GraphBench evaluator with true labels and model predictions # y_true: Ground-truth labels from the dataset # y_pred: Predictions generated by the trained model results = Evaluator.evaluate(y_true, y_pred)-
y_true: The actual, observed target values (ground truth) for the evaluation set. -
y_pred: The values predicted by the trainedgraph learningmodel. -
Evaluator.evaluate()computes and returns the relevant metrics (e.g.,MAE, ,Spearman correlationfor social networks) based on these inputs.The overall workflow shown in Listing 1 is:
-
import graphbench
# Define your PyTorch model
model = #your torch model
# Specify the dataset name
dataset_name = #name of the task or list of tasks
# Define PyTorch Geometric pre-filter, pre-transform, and transform functions
pre_filter = #PyTorch Geometric filter matrix
pre_transform = #PyTorch Geometric-like transform during loading
transform = #PyTorch Geometric-like transform at computation time
# Setting up the components of graphbench
Evaluator = graphbench.Evaluator(dataset_name)
Optimizer = graphbench.Optimizer(optimization_args, training_method)
Loader = graphbench.Loader(dataset_name, pre_filter, pre_transform, transform)
# Load a GraphBench dataset and get splits
dataset = Loader.load()
# Optimize your model (this would typically involve training loops inside training_method)
opt_model = Optimizer.optimize() # opt_model would be the best model found by HPO
# Perform inference with the optimized model
# y_pred = opt_model.predict(dataset.test_data)
# y_true = dataset.test_data.y # Assuming target labels are in .y
# Use graphbench evaluator with targets y_true and predictions y_pred
# (Assuming y_true and y_pred are available after training and inference)
# results = Evaluator.evaluate(y_true, y_pred)
This pseudocode highlights how researchers can seamlessly integrate their PyTorch models into the GraphBench framework, leveraging its standardized data loading, hyperparameter optimization, and evaluation capabilities.
4.2.2. General Encoder-Processor-Decoder Architecture
GraphBench proposes a general encoder-processor-decoder pipeline for evaluating models across tasks. This modular design allows different GNN or GT architectures to be "plugged in" as the processor.
4.2.2.1. Encoder
- Purpose: To convert raw
nodeandedge featuresinto dense, common-dimensionembeddingssuitable for theprocessor. - Procedure:
- For each task, a
task-specific encoderis used to deriveencodingsfornode-levelandedge-level featuresinto a common dimension . - These
embeddingsare then passed to theprocessor. - Handling Missing Features: If
nodeoredge featuresare missing for a given task, a learnable vector of dimension is used as a placeholder. - Tokenization for
GTs: ForGraph Transformer (GT)baselines that expecttokenized input:- For
node-level tasks, each graphnodeis treated as a single token. - For
edge-level tasks, a specific transformation is applied to convertedgesinto tokens (as detailed inAlgorithmic Reasoningtasks). - A
[CLS]token (similar toBERTfor sequence tasks) is added forgraph-level representationsforGTs, following Bechler-Speicher et al. (2025). This token's final embedding can then be used to represent the entire graph.
- For
- Positional Encodings (PEs):
Absolute Positional Encodings (PEs)likeRandom Walk Structural Embeddings (RWSE)(Dwivedi et al., 2022b) andLearnable Positional Encodings (LPE)(Müller & Morris, 2024) are added to thenode embeddingsbefore the firstGraph Transformerlayer. These PEs provide information about a node's position or structural role in the graph, which is crucial forGNNsandGTs.
- For each task, a
4.2.2.2. Processor
- Purpose: To perform the core
message passingorattentionoperations to learn contextualizednodeandgraph representations. This is where the actualgraph learningmodel resides. - Architecture: The
processoris typically composed of multiplelayers(e.g.,GINE layersorGraph Transformer layers) stacked sequentially. - Update Rule (General Form): For a graph with
node representationsat layer , the update rule is: $ \boldsymbol{X} = \phi(\mathsf{LayerNorm}(\boldsymbol{X}), G) $ $ \boldsymbol{X} = \boldsymbol{X} + \mathsf{FNN}(\mathsf{LayerNorm}(\boldsymbol{X})) $- Where:
- : The selected
baseline model(e.g.,GINElayer,Graph Transformerlayer). It takes theLayer Normalizedinputnode representationsand the graph structure to produce newrepresentations. - :
Layer Normalizationis applied to thenode representationsbefore passing them to and . This helps stabilize training by normalizing inputs across features for each sample. - : A two-layer
Feed-Forward Neural Network(also known asMLP) applied to theLayer Normalizedoutput of . - : A
residual connectionis used, adding the output of theFNNback to the originalnode representations. This helps mitigate vanishing gradients and allows for deeper models.
- : The selected
- Where:
- Feed-Forward Network (FNN) Details:
- The
FNNlayer is defined as: $ \mathsf{FNN}(x) := \sigma(\mathsf{LayerNorm}(x)W_1)W_2 $- : The
GeLU(Gaussian Error Linear Unit)nonlinearity(Hendrycks & Gimpel, 2016).GeLUis a common activation function that smooths and non-linearly transforms the input. - :
Weight matricesfor the two linear layers within theFNN.
- : The
- The
- Graph Transformer (GT) Specifics:
- Attention Mechanism: The
Graph Transformerlayer computesmulti-head scaled-dot-product attention. - Attention Formula: Given
Query(),Key(),Value() matrices of shape (where is the number of tokens/nodes and is the embedding dimension), and anattention bias: $ \mathsf{Attention}(Q, K, V, B) := \mathsf{softmax}\Big( d^{-\frac{1}{2}} \cdot Q K^T + B \Big)V $- : Scaling factor based on the dimension of
keyvectors to prevent dot products from growing too large and pushingsoftmaxoutputs to extremes. - : The dot product between
queriesandkeys, formingraw attention scores. - : An
attention biasmatrix that injectsgraph structureinformation into theattention mechanism. This can include information about connectivity, shortest paths, or other structural relationships. - : Normalizes the
attention scoresso they sum to 1, effectively producingattention weights. - : The
attention weightsare applied to thevaluevectors to produce the finalcontextualized representations.
- : Scaling factor based on the dimension of
- GT Layer Output: The output of a
GTlayer from the paper is: $ \boldsymbol{X}^{(t+1)} := \mathsf{MLP}\Big( \mathsf{Attention}(\boldsymbol{X}^{(t)}\boldsymbol{W}_Q, \boldsymbol{X}^{(t)}\boldsymbol{W}_K, \boldsymbol{X}^{(t)}\boldsymbol{W}_V, \boldsymbol{B}) \Big) $- :
Node representationsat step . - : Learnable
weight matricesthat project thenode representationsintoqueries,keys, andvalues. - : A two-layer
MLPapplied after theattentionmechanism.
- :
- Attention Mechanism: The
- GIN Architecture Specifics:
- GINE Layer (Generalized GIN with Edge Features): The
GINElayer updatesnode representationsat iteration as follows: $ \boldsymbol{h}_v^{(t+1)} = \mathsf{FNN} \Big( (1 + \epsilon) \boldsymbol{h}v^{(t)} + \sum{u \in N(v)} \mathsf{ReLU}(\boldsymbol{h}u^{(t)} + \boldsymbol{e}{uv}) \Big) $- : A neural network, typically an
MLP. - : A learnable parameter (or fixed to 0).
- :
Representationof node at iteration . N(v): Set of neighbors of node .- :
Representationof neighbor node . - :
Feature vectorof the edge connecting and . - : Rectified Linear Unit
activation function. - The sum aggregates information from neighbors. The term allows the model to retain information from the node's previous state, making it a powerful aggregation scheme.
- : A neural network, typically an
- GINE Processor Integration: Similar to
GTs,GINElayers replace .Layer Normalizationis applied before andresidual connectionsare used after theGINElayer. Mean pooling is typically used forgraph-level tasksat the end of the processor step.
- GINE Layer (Generalized GIN with Edge Features): The
- Social Networks Specific MPNN (GraphConv with mean aggregation):
- Due to the high number of nodes and varying degrees in
BlueSkydatasets, a differentMPNNvariant was chosen:GraphConvwithmean aggregation. - Update Rule:
Node representationsare updated as: $ \nabla \boldsymbol{h}{\nu}^{(t+1)} = \sigma \Big( W_1^{(t)} \boldsymbol{h}{\nu}^{(t)} + \frac{1}{\deg_{\mathrm{in}}(\nu)} \sum_{u: u \to \nu} W_2^{(t)} \boldsymbol{h}_u^{(t+1)} \Big) $- :
ReLUactivation function. - Dropout is applied before
ReLU. - No
normalization layersare interleaved. - :
Weight matricesfor the linear transformations. - : In-degree of node .
- The sum performs
mean aggregationover incoming neighbors.
- :
- Due to the high number of nodes and varying degrees in
4.2.2.3. Decoder
- Purpose: To transform the learned
nodeorgraph representationsfrom theprocessorintotask-specific predictions(e.g., class probabilities, regression values). - Architecture: A
two-layer Multilayer Perceptron (MLP). - Procedure:
- Consists of two
linear layers( and ). - Applies
GeLU nonlinearity. - Optionally, a
bias termcan be added.
- Consists of two
- Formula:
$
W_2(\mathsf{LayerNorm}(\mathsf{GELU}(W_1 x)))
$
- : Input
representation(e.g.,node embeddingfornode-level tasks,[CLS]token embedding or pooled graph embedding forgraph-level tasks). - : Dimension of the
input representationand first hidden layer. - :
Output dimensioncorresponding to thetask-specific target(e.g., number of classes for classification, 1 for regression).
- : Input
- Task-Specific Adaptation:
- For
node-leveloredge-level classification, the output would be the number of classes, followed by asoftmaxactivation. - For
node-leveloredge-level regression, would be 1. - For
graph-level tasks, the pooled graph representation would be fed to the decoder.
- For
4.2.3. Pretraining and Evaluation Protocol
- Pretraining Support:
GraphBenchsupports evaluation of pretrained models, which is crucial forgraph foundation models. - Evaluation Protocol:
-
All
baseline modelsare evaluated acrossthree random seedsper task. -
Mean performanceandstandard deviationsare reported to ensure robustness. -
For
out-of-distribution generalizationtasks, pretrained models from these baseline evaluations are used. -
In addition to performance metrics,
GraphBenchreports operational metrics likememory usage,training wall-clock time, andinference latencyto provide insights into real-world applicability.This detailed methodology ensures that
GraphBenchprovides a robust, flexible, and reproducible framework for advancinggraph learningresearch.
-
5. Experimental Setup
The experimental setup within GraphBench is characterized by a diverse set of datasets, carefully chosen evaluation metrics, and comparison against representative baselines, all within a standardized framework.
5.1. Datasets
GraphBench includes datasets from six diverse domains: Social Networks, Chip Design, Electronic Circuits, SAT Solving, Combinatorial Optimization, Algorithmic Reasoning, and Earth Systems (Weather Forecasting).
5.1.1. Social Networks: BlueSky Engagements
-
Source: Based on the publicly available
BlueSkydataset curated by Failla & Rossetti (2024). -
Scale: Large-scale, with single graphs containing hundreds of thousands of nodes and millions of edges. The following are the results from Table 2 of the original paper:
Category Name #Graphs Avg. #Nodes Avg. #Edges Avg. Deg. Social networks BlueSky - quotes 1 289 136 / 286 425 / 311 272 3 075 967 / 3 815 996 / 4 525 872 10.64 / 13.32 / 14.54 BlueSky replies 470 816 / 464 867 / 569 424 9 287 565 / 10 769 386 / 12 318 196 19.73 / 23.17 / 21.63 BlueSky reposts 1 498 012 / 508 818 / 580 112 12 049 951 / 14 658 114 / 17 261 865 24.20 / 28.81 / 29.76 -
Characteristics: Represents human interactions and societal processes. Nodes are users, edges are directed interactions (quotes, replies, reposts). Features are
language model embeddingsof user posts. Targets are median future engagements (likes, replies, reposts). -
Domain: Social sciences.
-
Why Chosen: Addresses limitations of older
social networkbenchmarks (static, categorical targets, simplified topologies). Provides real-world-likedynamicanddirected relational structures, crucial fortemporal forecastingandOOD generalization.
5.1.2. Chip Design: AIG Generation
-
Source: Synthetically generated
truth tables(1.2 million) compiled into optimizedAnd-Inverter Graphs (AIGs)usingABCtool. -
Scale: 1.2 million graphs. The following are the results from Table 4 of the original paper:
#Graphs #Inputs #Outputs Mean #Nodes Median #Nodes Max. #Nodes 1.2M 6-8 1-2 125.9 104.0 335.0 -
Characteristics:
Boolean circuitsrepresented asDirected Acyclic Graphs (DAGs)wherenodesareAND gatesandedgesmay indicatesignal inversion. Task isconditional graph generation: generate anAIGthat implements a givenBoolean function(truth table) while minimizinggate count. -
Domain: Hardware design.
-
Why Chosen: Represents a highly complex and impactful engineering challenge. It's a novel
generative taskwith strictlogical correctnessconstraints and anoptimization objective, providing a rigorous test forgenerative graph methods.
5.1.3. Electronic Circuits: Voltage Conversion Ratio Prediction
- Source: Datasets generated from random valid topologies with 5, 7, and 10 components, simulated with
NGSPICE. - Scale:
73k(5-comp),14k(7-comp),6k(10-comp) instances. - Characteristics:
Analog circuitsas graphs.Nodesare components (capacitors, inductors, switches, terminals),edgesare electrical connections. Task isgraph-level regressionto predictvoltage conversion ratioandpower conversion efficiency. - Domain: Hardware design.
- Why Chosen: Critical problem in
EDA. Highlights robustness tostructural sensitivityand ability to cope withextreme class imbalance. Explicitly testsOOD generalizationacross different circuit sizes.
5.1.4. SAT Solving: Algorithm Selection and Performance Prediction
-
Source: Largest collection of
SATinstances to date, combiningSAT Competitioninstances,AClibbenchmark, and instances generated byFuzzSAT,Cryptography, andCommunity Attachment. -
Scale:
107,866instances in total, split intoSMALL,MEDIUM, andLARGEdatasets based on size. The following are the results from Table 7 of the original paper:Source # of Instances # of Variables # of Clauses min max avg median min max avg median SAT Competition 31024 3 25 870 369 69 190.77 13 168.50 7 871 935 536 1 022 025.50 118 526.00 Community Attachment 29 994 922 2956 1931.05 1 928.00 3566 13413 8088.71 8041.00 AClib 33 955 6 248 738 3319.65 1 270.00 24 103 670 100 136471.03 10000.00 FuzzSAT 8 020 2 944 685 10217.75 1 820.00 1 4094 516 45 638.30 8074.50 Cryptography 4873 6 12 415 4 264.89 3225.00 24 185 006 55 309.63 45 316.00 Total 107 866 2 25 870 369 22 434.71 1879.00 1 871 935 536 345 051.72 10 177.50 -
Characteristics:
SAT instancesrepresented asgraphs(Variable-Clause GraphVCG, Variable GraphVG, Clause GraphLCG). Tasks areperformance prediction(regression of solver runtime) andalgorithm selection(classification of best solver).SATzilla featurescan be integrated asnode attributes. -
Domain: Reasoning and optimization.
-
Why Chosen:
SATis a centralNP-complete problemwith real-world applications. Provides a practicalalgorithm selectionbenchmark with structuralpermutation invariance, allowing exploration ofGNNsin this area.
5.1.5. Combinatorial Optimization (CO): Predicting Optimal Solution Objectives
-
Source: 50,000 instances for each problem type (Maximum Independent Set
MIS, Max-Cut, Graph Coloring) generated from three random graph models:RB graphs,Erdôs-Rényi (ER) graphs, andBarabási-Albert (BA) graphs. Bothsmall-scaleandlarge-scalevariants are provided. The following are the results from Table 11 of the original paper:Generator Size Nodes Nodes/Clique Cliques Tightness Edge Prob. Attached Edges RB small [200, 300] [5, 12] [20, 25] [0.3, 1] large [800, 1200] [20, 25] [40, 55] [0.3, 1] ER small [200, 300] 0.15 large [700, 800] - 0.15 BA small [200, 300] - 2 large [700, 800] - 2 -
Characteristics: Graphs as defined by the problem. No
node featuresare included, only theadjacency matrix. Tasks aresupervised learning(regression to predict optimal objective value) andunsupervised learning(predicting scores for variables to construct a feasible solution). -
Domain: Reasoning and optimization.
-
Why Chosen:
CO problemsare computationally hard but oftengraph-structured. They provide clearquantitative objectives, ideal forsupervised trainingandOOD generalizationstudies.
5.1.6. Algorithmic Reasoning: Learning to Simulate Algorithms
- Source: 7 tasks with synthetic graphs generated from
Erdós-Rényi (ER),Stochastic Block Model (SBM),Power-Law Cluster (PC),Newman-Watts-Strogatz (NWS),Barabási-Albert (BA), andDual Barabási-Albert (DBA)graph generators. - Scale: One million training graphs, 10,000 validation/test graphs per task. Node count fixed to 16 for training, 128 for validation/test for
OOD generalization. - Characteristics: Tasks include
Topological Sorting(node regression),Bridges(edge classification),Minimum Spanning Tree (MST)(edge classification),Max Flow(graph regression),Max Clique(node classification),Steiner Trees(edge classification), andMax Matching(node classification). Three difficulty levels (EASY,MEDIUM,HARD) vary graph generator parameters. - Domain: Reasoning and optimization.
- Why Chosen: Dedicated to exploring the intersection of algorithms and
neural networks. Provideslarge-scale datasetsfor commongraph algorithms, expanding prior benchmarks likeCLRS, and rigorously testingsize generalization.
5.1.7. Earth Systems: Weather Forecasting
- Source: Reanalysis data from
ERA5dataset, preprocessed viaWeatherBench2 pipeline. - Scale:
64x32equiangular grid resolution. Single graph with4610 nodesand59667 edges. - Characteristics:
3D grid graph(grid cells) and anicosahedral mesh graphwith learned mappings between them.Node featuresinclude62 physical and derived meteorological variables(e.g., temperature, humidity, wind speed) at variouspressure levelsand surface. Task isregressionto predictresidual changein atmospheric state over a 12-hour horizon. - Domain: Earth systems.
- Why Chosen:
Weather forecastingis critical and represents amulti-scale, non-Euclidean dependencyproblem. Testsgraph-based modelsforintegrating heterogeneous featuresandgeneralizing across time and space.
5.2. Evaluation Metrics
GraphBench employs domain-specific metrics tailored to each task, extending beyond classical metrics like MSE and accuracy to capture real-world utility.
5.2.1. Social Networks (BlueSky Engagements)
- Metrics:
Mean Absolute Error (MAE),Coefficient of Determination (R^2), andSpearman Correlation (\sigma). - Mean Absolute Error (MAE):
- Conceptual Definition: Measures the average magnitude of the errors in a set of predictions, without considering their direction. It indicates how close predictions are to actual values on average.
- Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $
- Symbol Explanation:
- : The total number of observations/predictions.
- : The ground-truth value for the -th observation.
- : The predicted value for the -th observation.
- : The absolute value.
- Coefficient of Determination ():
- Conceptual Definition: Measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It indicates how well the model explains the variability of the data; a higher suggests a better fit.
- Mathematical Formula: $ R^2_{\kappa, \tau_{1,2}} := 1 - \frac{\sum_{\boldsymbol{u} \in \boldsymbol{U}} (y^{\kappa, (\boldsymbol{u})}{\tau{1,2}} - \hat{y}^{\kappa, (\boldsymbol{u})})^2}{\sum_{\boldsymbol{u} \in \boldsymbol{U}} (y^{\kappa, (\boldsymbol{u})}{\tau{1,2}} - \bar{y}^{\kappa}{\tau{1,2}})^2} $
- Symbol Explanation:
- : The set of evaluation nodes (users).
- : The ground-truth engagement value of kind for user in the prediction interval .
- : The model's predicted engagement value of kind for user .
- : The mean ground-truth engagement value of kind over all users in .
- Spearman Correlation ():
- Conceptual Definition: A non-parametric measure of the strength and direction of the monotonic relationship between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function (i.e., if one variable increases, the other also tends to increase, but not necessarily at a constant rate). In this context, it measures how well predictions preserve the relative ordering of users by future engagements.
- Mathematical Formula: $ \sigma_{\kappa, \tau_{1,2}} := 1 - \frac{6 \cdot \sum_{u \in U} \big( \mathbf{R}[y^{\kappa, (u)}{\tau{1,2}}] - \mathbf{R}[\widehat{y}^{\kappa, (u)}] \big)^2}{n_U (n_U^2 - 1)} $
- Symbol Explanation:
- : The set of evaluation nodes (users).
- : The number of users in .
- : The rank of the ground-truth engagement value of kind for user .
- : The rank of the model's predicted engagement value of kind for user .
5.2.2. Chip Design (AIG Generation)
- Metrics:
Ad-hoc Score(defined in the paper). - Ad-hoc Score (for AIG Generation):
- Conceptual Definition: Measures the quality of generated
AIGsby considering bothlogical correctness(functional equivalence to the targetBoolean function) andstructural efficiency(minimizing the number of internal nodes, relative to a baseline). - Mathematical Formula: $ \mathrm{score}(\tilde{C}) := \frac{100}{N} \sum_{i=1}^{N} \frac{#C_i^{\mathrm{ref}}}{#\tilde{C}i} \cdot \mathbb{1}{\tilde{C}_i \equiv f_i} $
- Symbol Explanation:
- : The set of generated
AIGsby the model. - : The total number of
Boolean functions(instances). - : The number of internal nodes of the -th generated circuit.
- : The number of internal nodes of the corresponding
baseline AIGgenerated byABC(a standard tool). - : An
indicator functionthat is 1 if the generated circuit islogically equivalentto the targetBoolean function, and 0 otherwise.
- : The set of generated
- Conceptual Definition: Measures the quality of generated
5.2.3. Electronic Circuits (Voltage Conversion Ratio Prediction)
- Metrics:
Relative Squared Error (RSE). - Relative Squared Error (RSE):
- Conceptual Definition: A normalized version of
Mean Squared Errorthat compares the model's performance to that of a simple baseline predictor (the mean of the target values). It indicates how much better the model is at predicting than simply guessing the average. A lowerRSEis better. - Mathematical Formula: $ \mathrm{RSE}(y, \hat{y}) := \frac{\sum_{i=1}^{N} (y_i - \hat{y}i)^2}{\sum{i=1}^{N} (y_i - \bar{y})^2} $
- Symbol Explanation:
- : The total number of observations.
- : The ground-truth value for the -th observation from high-fidelity simulation.
- : The model's predicted value for the -th observation.
- : The mean of all ground-truth values .
- Conceptual Definition: A normalized version of
5.2.4. SAT Solving (Algorithm Selection and Performance Prediction)
- Metrics for Performance Prediction:
Root Mean Squared Error (RMSE)onlog10-scaled computation times. - Metrics for Algorithm Selection:
Closed Gap (CG). - Root Mean Squared Error (RMSE): (For time prediction)
- Conceptual Definition: The square root of the average of the squared differences between predicted values and actual values. It measures the average magnitude of the errors, with larger errors having a disproportionately larger effect.
- Mathematical Formula: For performance prediction, the model predicts the base-10 logarithm of the
PAR{10}`` score. Let be the true log-scaled time for solver on instance , and be the predicted log-scaled time. $ \mathrm{RMSE} = \sqrt{\frac{1}{N} \sum{i=1}^{N} (Y_i - \hat{Y}_i)^2} $ - Symbol Explanation:
- : The total number of solver-instance pairs.
- : The true -scaled computation time for the -th pair.
- : The predicted -scaled computation time for the -th pair.
- Closed Gap (CG):
- Conceptual Definition: Quantifies how close an
algorithm selector(the model) comes to the performance of theVirtual Best Solver (VBS)relative to theSingle Best Solver (SBS). ACGof 1 means the selector achievedVBSperformance, 0 meansSBSperformance, and negative means worse thanSBS. - Mathematical Formula: $ \mathbf{CG}(\varPi) = \frac{\mathrm{PAR}{10}(\mathbf{SBS}, \varPi) - \mathrm{PAR}{10}(s, \varPi)}{\mathrm{PAR}{10}(\mathbf{SBS}, \varPi) - \mathrm{PAR}{10}(\mathbf{VBS}, \varPi)} $
- Symbol Explanation:
- : A set of
SAT instances. - : The
Penalized Average Runtime (PAR10)score for solver/selector on instance set .PAR10is the mean computation time, where unsolved instances are penalized by . - : The
Single Best Solver(the individual solver that performs best overall across ). - : The
Virtual Best Solver(an oracle that always picks the best solver for each individual instance in ). - : The
algorithm selectorbeing evaluated.
- : A set of
- Conceptual Definition: Quantifies how close an
5.2.5. Combinatorial Optimization (CO)
- Metrics for Supervised Learning:
Mean Absolute Error (MAE). - Metrics for Unsupervised Learning: The
objective valueof the decoded solution (e.g.,MIS size,Max-Cut size,number of colors used). ForMISandMax-Cut, higher values are better; forGraph Coloring, lower values are better. - Mean Absolute Error (MAE): (See definition above for Social Networks). Here, it measures the error between predicted and true optimal objective values.
5.2.6. Algorithmic Reasoning
- Metrics for Regression Tasks:
Mean Absolute Error (MAE). - Metrics for Binary Classification Tasks:
F1 Score. - Mean Absolute Error (MAE): (See definition above for Social Networks). Here, it measures the error between model predictions and ground-truth solutions for tasks like
Topological SortingorMax Flow. - F1 Score:
- Conceptual Definition: The harmonic mean of
precisionandrecall. It is a measure of a model's accuracy on a binary classification problem that considers both false positives and false negatives. It is particularly useful when the classes are imbalanced. - Mathematical Formula: $ \mathrm{F1} = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} $ where:
- Symbol Explanation:
- :
True Positives(correctly predicted positive instances). - :
False Positives(incorrectly predicted positive instances). - :
False Negatives(incorrectly predicted negative instances).
- :
- Conceptual Definition: The harmonic mean of
5.2.7. Earth Systems (Weather Forecasting)
- Metrics:
Mean Squared Error (MSE)for each variable and cumulatively across all variables. - Mean Squared Error (MSE): (Unweighted versions are reported for evaluation, distinct from the weighted training loss).
- Conceptual Definition: Measures the average of the squares of the errors. It quantifies the average squared difference between the estimated values and the actual value. It is always non-negative, and values closer to zero are better.
- Mathematical Formula (for a single variable ): $ \mathrm{MSE}j (x^d, \hat{x}^d) = \frac{1}{|D||G||L_j|} \sum{d \in D} \sum_{i \in G} \sum_{\ell \in L_j} (\hat{x}{i,j,\ell}^d - x{i,j,\ell}^d)^2 $
- Mathematical Formula (for all variables): $ \mathrm{MSE}{\mathrm{all}} (x^d, \hat{x}^d) = \frac{1}{|D||G|\sum{j \in J}|L_j|} \sum_{d \in D} \sum_{i \in G} \sum_{j \in J} \sum_{\ell \in L_j} (\hat{x}{i,j,\ell}^d - x{i,j,\ell}^d)^2 $
- Symbol Explanation:
- : The set of forecast date-times.
- : The grid cells (locations).
- : The set of meteorological variables.
- : The set of pressure levels for variable .
- : The ground-truth value for grid cell , variable , pressure level , at date-time .
- : The model's predicted value for grid cell , variable , pressure level , at date-time .
5.3. Baselines
The paper benchmarks GraphBench using common message-passing neural networks (MPNNs) and graph transformer (GT) models. These models are evaluated within the encoder-processor-decoder framework.
-
Graph Isomorphism Network (GIN): A powerful
MPNNthat serves as a strongGNNbaseline due to its highdiscriminative power(matching the1-WL test). It is used across various tasks includingCombinatorial Optimization,Algorithmic Reasoning, andSAT Solving. ForSocial Networks, aGraphConvvariant withmean aggregationis used instead, due to dataset characteristics. -
Graph Transformer (GT): A
Graph Transformermodel is used as a baseline to explore the efficacy ofself-attentionmechanisms on graph data. The implementation follows Bechler-Speicher et al. (2025), usingbiased attentionand incorporatingRWSEandLPEaspositional encodings. It is used acrossAlgorithmic Reasoning,Combinatorial Optimization,Electronic Circuits, andWeather Forecasting.GTswere excluded fromSAT Solvingdue to high computational costs forpositional embeddingsandout-of-memory errorson larger graphs. -
Multilayer Perceptron (MLP): A simple
feed-forward neural networkthat processes node features independently, ignoring all graph structure. It serves as aconnectivity-agnostic baselineto demonstrate the value added bygraph-aware architectures. Used inSocial NetworksandCombinatorial Optimization. -
DeepSets: An architecture designed for
permutation-invarianttasks, where the output does not depend on the order of input elements. It aggregates information globally without explicit message passing. Used as anotherconnectivity-agnostic baseline(though it uses global features) inSocial NetworksandCombinatorial Optimization. -
Traditional
SAT SolverBaselines (forSAT Solving):- Algorithm Selection:
Pairwise Regression (PR),Pairwise Classification (PC), andMulti-Class Classification (MC), all usingRandom Forestonhand-crafted SATzilla features. These are robust, established methods in thealgorithm selectionfield. - Performance Prediction:
Random Forest (RF)andXGBoostmodels, also usingSATzilla features, which are state-of-the-art forempirical performance modeling (EPM)inSAT.
- Algorithm Selection:
-
ABCVariants (forChip Design):STRaSH,ReSyN,Compress2, andResyn2rs. These are classical logic synthesis tools/scripts of varying complexity, providing a strong combinatorial optimization baseline forAIG generation. Nolearning-based generative baselineswere included forAIGdue to their inability to producefunctionally equivalentcircuits. -
Persistence Model(forWeather Forecasting): A simple baseline that assumes future conditions will be the same as the current conditions. This is a common and robust baseline inweather forecastingto show the minimal skill required for a useful model. -
GraphCast(Lam et al., 2023) (forWeather Forecasting): A state-of-the-artMPNN-based system for globalweather forecasting. This represents a strong, specializedgraph learningmodel in the domain.The consistent use of
three random seedsfor evaluation and the reporting ofmean performancewithstandard deviationsensure statistical robustness of the baseline results.
6. Results & Analysis
The results presented in GraphBench not only showcase baseline performances across diverse tasks but also highlight persistent challenges and areas where current graph learning models fall short.
6.1. Core Results Analysis
6.1.1. Social Networks (BlueSky Engagements)
-
Key Finding: Graph-aware models (
GNNs) consistently outperform connectivity-agnostic baselines (MLPs,DeepSets), demonstrating the value of relational structure for predicting user engagements. However, the absolute performance metrics (, ) remain relatively low, indicating significant room for improvement. -
Advantages/Disadvantages:
GNNscapture local, directed relational structures, which are informative.DeepSetsperformed worse thanMLPsin some cases, suggesting that naive global feature aggregation without structural context can be detrimental. The tasks are challenging, especially in predicting subtle future engagement metrics.The following are the results from Table 3 of the original paper:
Dataset Model MAE R² Spearman corr. (σ) likes replies reposts likes replies reposts quotes DeepSets 0.810± 0.005 0.140± 0.002 0.102± 0.002 0.138± 0.002 0.307± 0.004 0.178± 0.002 0.334± 0.003 GNN 0.768± 0.002 0.165± 0.002 0.134± 0.002 0.175± 0.002 0.330± 0.003 0.192± 0.002 0.337± 0.022 MLP 0.784±0.001 0.145± 0.001 0.107± 0.001 0.141± 0.001 0.308± 0.003 0.176± 0.002 0.335± 0.002 replies DeepSets 0.789± 0.033 0.086± 0.033 0.061 ± 0.024 0.104± 0.014 0.253± 0.003 0.130± 0.001 0.240± 0.006 GNN 0.694± 0.002 0.158± 0.003 0.119± 0.002 0.159± 0.002 0.258± 0.009 0.132± 0.006 0.247± 0.011 MLP 0.725± 0.004 0.131± 0.003 0.087± 0.003 0.122± 0.003 0.249± 0.004 0.127± 0.005 0.243± 0.003 reposts DeepSets 0.918± 0.013 0.049± 0.005 0.031 ± 0.012 0.050± 0.009 0.229± 0.004 0.129± 0.007 0.205 ± 0.009 GNN 0.832± 0.009 0.131± 0.017 0.111± 0.022 0.128± 0.008 0.284± 0.044 0.143± 0.032 0.260± 0.051 MLP 0.874± 0.003 0.087± 0.002 0.051 ± 0.003 0.066± 0.002 0.234± 0.008 0.123± 0.006 0.206± 0.010
6.1.2. Chip Design (AIG Generation)
-
Key Finding: Traditional
ABCtools achieve higher scores (structural efficiencyandlogical correctness) with increasing complexity (e.g.,Resyn2rs>STRaSH). A notable result is the absence of competitivelearning-based baselines, as existing generative models struggled withfunctional equivalenceconstraints. -
Advantages/Disadvantages: The
ABCvariants excel at this constrained generative task. The lack ofMLbaselines highlights a significant gap and research opportunity forconditional DAG generative modelsthat can handle strict constraints.The following are the results from Table 5 of the original paper:
Number of (inputs, outputs) Method (6,1) (6,2) (7,1) (7,2) (8,1) (8,2) Overall STRaSH 73.90 74.23 76.55 75.71 77.63 76.52 75.76 ReSyN 86.79 88.16 89.28 89.05 90.45 89.78 88.92 Compress2 92.53 93.57 93.18 92.79 93.28 92.49 92.97 Resyn2rs 93.30 95.24 95.15 95.64 96.24 96.11 95.28
6.1.3. Electronic Circuits (Voltage Conversion Ratio Prediction)
-
Key Finding:
Graph Transformers (GTs)consistently achieve higher prediction accuracy (lower RSE) thanMPNNbaselines (GIN,GAT,GCN) across different circuit sizes (5, 7, 10 components) and prediction targets. All models'RSEincreases with circuit complexity, primarily due to reduced training data for larger circuits and combinatorial growth. -
Advantages/Disadvantages:
GTs' ability to model long-range dependencies likely contributes to their superior performance. The challenge of scaling training data for larger, more complex circuits is evident, reflecting a real-world bottleneck.The following are the results from Table 6 of the original paper:
Task Method Size 5 comp 7 comp 10 comp Efficiency GT 0.07 ± 0.03 0.18± 0.06 0.29± 0.03 GIN 0.09± 0.06 0.21 ± 0.04 0.44± 0.13 GAT 0.11 ± 0.05 0.22± 0.05 0.38± 0.05 GCN 0.13± 0.07 0.30± 0.04 0.53± 0.06 Voltage GT 0.12± 0.02 0.31± 0.05 0.39± 0.05 GIN 0.14± 0.05 0.35± 0.03 0.54± 0.12 GAT 0.17± 0.13 0.34± 0.12 0.48± 0.02 GCN 0.18± 0.07 0.46± 0.08 0.61± 0.07
6.1.4. SAT Solving (Algorithm Selection and Performance Prediction)
-
Key Finding: For performance prediction,
traditional tabular ML models(Random Forest, XGBoost) usinghand-crafted SATzilla featuressignificantlyoutperform MPNNsacross allSAT solvertypes and graph representations. For algorithm selection,Pairwise Regression (PR)performs best, again withSATzilla features.MPNNsstruggle with large instances due to computational limitations. -
Advantages/Disadvantages: The
SATzilla featuresare highly effective, demonstrating that goodfeature engineeringcan still trump raw graph learning in some domains.MPNNsfacedout-of-memory errorson larger instances and struggled to matchSATzilla-based baselines, indicatingscalabilityandexpressivitychallenges forGNNsin this domain. The choice of graph representation also matters, withVGgenerally outperformingVCGandLCGforMPNNs.The following are the results from Table 8 of the original paper:
Solver Method VG VCG LCG SATZILLA Kissat GIN 1.36± 0.15 1.15± 0.03 1.49± 0.02 RF 0.61 ± 0.00 XGB − − − 0.63± 0.00 BrEAKIDKissAT GIN 1.29± 0.03 1.33± 0.15 1.52± 0.01 RF , − − 0.65± 0.00 XGB − − − 0.68± 0.01 KissatMABDC GIN 1.27± 0.00 1.43± 0.24 1.60± 0.01 RF − − − 0.70± 0.00 XGB − − − 0.74± 0.00 AMSAT GIN 1.43± 0.01 1.45± 0.09 1.54± 0.00 RF − − 0.68± 0.00 XGB − − − 0.72± 0.01
The following are the results from Table 9a of the original paper:
| Solver | Method | SATZILLA |
| Kissat | RF | 0.59± 0.00 |
| XGB | 0.62± 0.00 | |
| BreakIDKissat | RF | 0.64± 0.00 |
| XGB | 0.67 ± 0.00 | |
| KissatMaBDC | RF | 0.68± 0.00 |
| XGB | 0.72± 0.00 | |
| AMSAT | RF | 0.64± 0.00 |
| XGB | 0.67± 0.00 |
The following are the results from Table 9b of the original paper:
| Solver | Method | SATZILLA |
| Kissat | RF | 0.58± 0.00 |
| XGB | 0.63± 0.00 | |
| BreakIDKissaT | RF | 0.62± 0.00 |
| XGB | 0.67± 0.00 | |
| KissatMaBDC | RF | 0.66± 0.00 |
| XGB | 0.70± 0.00 | |
| AMSAT | RF | 0.62± 0.00 |
| XGB | 0.66± 0.00 |
The following are the results from Table 10 of the original paper:
| Instances | Method | VG | VCG | LCG |
| SmaLL | GIN | 0.05± 0.00 | 0.00± 0.00 | -0.02± 0.02 |
| PC | 0.03± 0.02 | |||
| PR | 0.48± 0.03 | |||
| MC | -0.32± 0.02 | |||
| MEDIUM | GIN | |||
| PC | 0.13± 0.02 | |||
| PR | 0.44± 0.02 | |||
| MC | -0.13± 0.02 | |||
| LARGE | GIN | |||
| PC | 0.33± 0.01 | |||
| PR | 0.54± 0.02 | |||
| MC | 0.23± 0.02 |
6.1.5. Combinatorial Optimization (CO)
-
Key Finding: For supervised
COtasks (MIS),GINgenerally performs best among baselines due to itsMPNNinductive bias.DeepSetoften outperformsMLP, indicating the benefit of global information.GTssometimes perform poorly, possibly due to training difficulties. For unsupervisedCO,GINandGTgenerally outperformMLPandDeepSet(Graph-based modelsare better) across all threeCO problems(MIS,Max-Cut,Graph Coloring), withGINshowing overall stronger performance. However, for sparseBA graphsinGraph Coloring,DeepSetsurprisingly excels. -
Advantages/Disadvantages:
MPNNsare well-suited for graph-structuredCO problems. The results confirm that current basicGNNbaselines do not yet compete with specializedCO-specific architecturesorexact solversin solution quality, but offer speed advantages.The following are the results from Table 12 of the original paper:
Dataset Method Size Small Large RB graph GIN 0.491± 0.099 2.125± 0.484 GT 4.112± 2.353 0.915± 0.235 MLP 1.583± 0.052 1.437± 0.520 DeepSet 0.918± 0.186 1.427± 0.224 ER graph GIN 0.234± 0.191 0.352± 0.265 GT 6.486± 8.101 9.641± 15.335 MLP 0.751± 0.767 0.914± 0.307 DeepSet 0.756± 0.711 2.244± 0.364 BA graph GIN 0.292± 0.041 0.111± 0.016 GT 3.481 ± 3.446 1.829± 1.383 MLP 2.825± 0.949 3.383± 0.504 DeepSet 2.304± 0.262 3.362± 1.491
The following are the results from Table 13 of the original paper:
| Size | ||||
| Problem | Dataset | Method | Small | Large |
| MIS (MIS size ↑) | RB | GIN | 17.294± 0.328 | 13.999± 0.321 |
| GT MLP | 16.542± 0.477 16.105± 0.097 | 13.406± 0.140 13.040± 0.214 | ||
| DeepSet | 16.021± 0.032 | 13.183± 0.035 | ||
| Solver | 20.803± 1.817 | 42.547± 4.449 | ||
| ER | GIN | 25.418± 0.407 | 26.276± 0.408 | |
| GT | 22.984± 0.473 23.183± 0.016 | 24.980± 0.292 | ||
| MLP | 24.259± 0.449 | |||
| DeepSet Solver | 23.050± 0.061 33.604± 1.428 | 24.220± 0.056 | ||
| 45.637± 0.631 | ||||
| GIN 100.16± 3.674 | 135.00± 0.720 | |||
| GT | 99.579± 6.448 | 114.26± 0.601 | ||
| MLP DeepSet | 95.108± 2.042 95.076± 0.173 | 114.49± 0.758 | ||
| Solver | 142.86± 16.54 | 114.89± 0.016 433.77± 19.17 | ||
| Maximum Cut (maximum cut size ↑) ER | GIN | 2106.7± 14.62 | ||
| GT | 1925.7± 32.75 | 24748.± 87.76 | ||
| MLP | 1727.7± 165.1 | 21524.± 184.0 20357.± 249.6 | ||
| DeepSet | 2920.1 ± 97.23 | 3575.9± 730.0 | ||
| Solver | 33914.± 7861. | |||
| GIN | 2327.9± 24.78 2172.7± 91.75 | 20878.± 107.9 | ||
| GT MLP | 1866.7± 67.64 | 16534.± 278.0 7335.4± 57.49 | ||
| DeepSet Solver | 2835.5± 607.6 | 23884.± 1809. | ||
| GIN | 397.00± 0.605 | 1044.1± 0.649 | ||
| BA | GT | 363.76± 0.639 | 986.93± 3.128 | |
| MLP | 308.73± 0.224 | 929.20± 4.060 | ||
| DeepSet | 1.0669± 0.800 | 154.31± 151.5 | ||
| Solver | 460.91± 50.13 | 1260.4± 48.81 | ||
| Graph Coloring (number of colors used ↓) | RB | GIN GT | 25.166± 0.288 25.146± 0.253 | 55.513± 0.526 55.562± 0.648 |
| MLP | 24.733± 0.667 | |||
| DeepSet | 55.558± 0.557 | |||
| Solver | 26.723± 0.189 | 71.051± 0.604 | ||
| 19.970± 3.465 | 41.480± 6.634 | |||
| ER | GIN | 16.182± 0.202 | 34.587± 0.545 | |
| GT | 16.188± 0.201 | 34.385± 0.413 | ||
| MLP | 17.110± 0.144 | 34.658± 0.394 | ||
| DeepSet Solver | 18.266± 0.018 10.235± 0.836 | 55.345± 0.551 | ||
| BA | 22.933± 0.772 | |||
| GIN | 5.1318± 0.114 | 6.2028± 0.283 | ||
| GT | 5.0939± 0.070 | 6.1167± 0.086 | ||
| MLP | 5.9900± 0.127 | 9.4215± 0.186 | ||
| DeepSet | 3.2780± 0.122 | 3.1981± 0.239 | ||
| Solver | 3.0000± 0.000 | 3.0000± 0.000 | ||
6.1.6. Algorithmic Reasoning
-
Key Finding: Performance varies significantly by task and difficulty.
GTsshow improved performance onMST,Max Clique, andMax Matching, whileGINperforms better onMax Flow,Bridges, andSteiner Trees.F1 scoresgenerally decrease with increasing difficulty.Size generalizationshows mixed results:MSTandSteiner Treessometimes improve with larger graphs (possibly due to changes in average degree), whileTopological Sorting,Max Matching, andMax Cliquedecrease. -
Advantages/Disadvantages: The task-dependent nature of performance suggests no single
GNNarchitecture is universally optimal.Steiner TreesandMax Cliqueare particularly challenging. Thesize generalizationresults highlight that current baselines are notsize-invariantacross all tasks, pointing to a need for more robustOOD generalizationmechanisms.The following are the results from Table 16 of the original paper:
Task Difficulty Model F1 MST EASY GIN 0.6906± 0.1655 GT 0.8566± 0.0068 GIN 0.7288± 0.0894 MEDIUM GT 0.8504± 0.0148 GIN 0.6107± 0.2015 HARD GT 0.8421 ± 0.0115 EASY GIN 0.9831± 0.0184 GT 0.9269± 0.0103 GIN 0.9622± 0.0077 MEDIUM GT 0.8762± 0.0258 HARD GIN 0.968± 0.0178 GT 0.8897± 0.0304 GIN 0.6691± 0.0288 EASY GT 0.6691± 0.0112 0.5628± 0.1100 MEDIUM GIN GT 0.5672± 0.0790 GIN 0.5516± 0.2368 HARD GT 0.5212± 0.0219 EASY GIN 0.4584± 0.0101 Max Clique GT 0.5407± 0.0088 MEDIUM GIN 0.3996± 0.0852 GT 0.4859± 0.0017 GIN 0.4102± 0.0820 HARD GT 0.4868± 0.0013 Max Matching EASY GIN 0.7527 ± 0.0051 GT 0.7402± 0.0172 MEDIUM GIN 0.6399± 0.0231 GT 0.6915± 0.009 GIN 0.6595± 0.0164 HARD GT 0.6743± 0.0038
The following are the results from Table 17 of the original paper:
| Task | Difficulty | Model | MAE |
| Topological Sorting | EASY | GIN GT | 0.1001 ± 0.0089 0.116± 0.0058 |
| MEDIUM | GIN GT | 0.1537 ± 0.0031 0.1305± 0.0094 | |
| HARD | GIN GT | 0.1301 ± 0.0046 0.1532± 0.0214 | |
| Flow | EASY | GIN GT | 3.4387± 0.0631 4.2737± 0.0646 |
| MEDIUM | GIN GT | 9.5960± 0.1707 6.3786± 0.4262 | |
| HARD | GIN GT | 9.5061 ± 0.1265 6.4833± 0.0869 |
The following are the results from Table 18 of the original paper:
| Dataset (Score) | Model | 128 | 192 | 256 | 384 | 512 |
| Topological Sorting (MAE) | GT | 0.1346 | 0.1730 | 0.1827 | 0.2018 | 0.2071 |
| GIN | 0.1492 | 0.181 | 0.1981 | 0.2015 | 0.2179 | |
| MST (F1) | GT | 0.8720 | 0.8773 | 0.8891 | 0.8874 | 0.8817 |
| GIN | 0.6103 | 0.8132 | 0.8605 | 0.8765 | 0.8823 | |
| Bridges (F1) | GT | 0.8799 | 0.8842 | 0.8959 | 0.9049 | 0.9118 |
| GIN | 0.9579 | 0.9213 | 0.9190 | 0.9203 | 0.9212 | |
| Steiner Trees (F1) | GT | 0.5160 | 0.5322 | 0.5221 | 0.5762 | 0.5578 |
| GIN | 0.5499 | 0.5739 | 0.6338 | 0.6651 | 0.6502 | |
| Max Clique (F1) | GT | 0.4877 | 0.4849 | 0.4890 | 0.4915 | 0.4931 |
| GIN | 0.3496 | 0.3112 | 0.3148 | 0.2926 | 0.2673 | |
| Max Matching (F1) | GT | 0.7010 | 0.6552 | 0.6206 | 0.5770 | OOT |
| GIN | 0.6271 | 0.6326 | 0.6372 | 0.6382 | OOT |
6.1.7. Earth Systems (Weather Forecasting)
-
Key Finding: The provided
GNNbaselines establish a transparent, reproducible lower bound formedium-range weather forecasting. While they do not reach the performance of specialized systems likeGraphCastor evenPersistencefor some variables, this is expected given their simpler architecture and limited training. -
Advantages/Disadvantages: The baselines are straightforward and reproducible, serving their purpose as reference points. However, they lack
domain-specific refinementsand extensivehyperparameter optimization, highlighting the complexity ofweather forecastingand the need for more advancedGNNdesigns.The following are the results from Table 19 of the original paper:
Pressure Level Variable GT Persistence GraphCast All All variables 179.629k Surface 2-m temperature (2T) 7.572 7.123 0.068 10-m u wind component (10U) 5.072 2.166 0.012 10-m v wind component (10V) 6.102 3.266 0.013 Mean sea level pressure (MSL) 102.551k 60.056k 240.832 Total precipitation (TP) 0.009 714.517n 52.377n 500 Temperature (T) 2.751 1.120 0.007 U component of wind (U) 15.026 6.658 0.048 V component of wind (V) 27.003 12.988 0.053 Geopotential (Z) 86.250k 48.637k 155.057 Specific humidity (Q) 0.010 66.211n 1.090n Vertical wind speed (W) 0.032 6.242 0.050 700 Temperature (T) 2.406 1.051 0.006 U component of wind (U) 8.716 3.951 0.025 V component of wind (V) 14.156 6.754 0.027 Geopotential (Z) 51.669k 32.773k 141.296 Specific humidity (Q) 0.009 250.556n 3.304n Vertical wind speed (W) 0.024 3.352 0.027 850 Temperature (T) 2.832 1.351 0.009 U component of wind (U) 9.151 4.015 0.022 V component of wind (V) 12.289 6.565 0.023 Geopotential (Z) 51.542k 32.453k 139.716 Specific humidity (Q) 0.010 354.058n 4.881n Vertical wind speed (W) 0.016 3.517 0.024 All Temperature (T) 2.805 U component of wind (U) 13.855 V component of wind (V) 21.501 Geopotential (Z) 77.020k - Specific humidity (Q) 0.010 Vertical wind speed (W) 0.018
6.2. Ablation Studies / Parameter Analysis
The paper mentions an example of automated hyperparameter optimization (HPO) using SMAC3 to demonstrate its effectiveness.
-
Experiment: Tuning the
GIN modelon theSMALL SAT datasetwith theVG graph representation. -
Optimization Budget: 150 evaluations, varying fidelity over training
gradient stepsbetween 1,000 and 10,000. -
Configuration Space: The following are the results from Table 20 of the original paper:
Hyperparameter Range Learning rate [1e-6, 0.1] Weight decay [1e-8, 0.1] Warmup iterations [1 000, 20000] Dropout [0.0, 0.5] -
Results: The following are the results from Table 21 of the original paper:
Solver Method VG Kissat GIN 1.36± 0.15 GIN (tuned) 1.26± 0.03 -
Analysis: The
automated HPOachieved a7.3% reduction in RMSEforKissatsolverperformance predictioncompared to the manually tunedGINmodel. This empirically demonstrates thatGraphBench'sHPOframework can effectively improve model performance, validating its inclusion as a core component for robust benchmarking.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces GraphBench, a meticulously designed and comprehensive benchmarking suite that addresses critical limitations in current graph machine learning (GML) evaluation practices. By integrating diverse domains (social sciences, hardware design, reasoning & optimization, earth systems) and task types (node-, edge-, graph-level, generative), GraphBench establishes standardized evaluation protocols, including consistent data splits, domain-specific metrics, and crucial out-of-distribution (OOD) generalization tests. The inclusion of a unified hyperparameter tuning framework and principled baselines (using MPNNs and Graph Transformers) provides a robust foundation for future GML research. The initial benchmarking results highlight persistent challenges such as handling temporal distribution shifts, scalability to very large graphs, and conditional graph generation with strict constraints, emphasizing that many GML problems are far from solved.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose clear directions for future work:
- Computational Resources: The evaluation of
Graph Transformerson someSAT solvingtasks was limited due to high computational costs forpositional embeddingsandout-of-memory errorson large graphs (even with high-end GPUs). This indicates that scaling currentGTsto very large graphs remains a practical challenge. - Domain-Specific Refinements: For
weather forecasting, the simple baselines did not incorporatedomain-specific refinements(e.g.,positional edge features) known to improve forecast skill. This suggests that whileGraphBenchprovides a general framework, incorporating domain expertise is still critical for state-of-the-art performance. - Benchmarking Generative Models: Existing
graph generative modelsstruggled to meet strict constraints likefunctional equivalenceinchip design, indicating a significant gap in currentgenerative GMLcapabilities. - Limited HPO Scope: Due to limited computational resources,
HPOwas not run on all datasets and domains in this initial release, implying that the reported baseline performances might be further improvable. - Long-Term Relevance: While the
LARGEandMEDIUMSATdatasets are currently inaccessible to manyGNNsdue to hardware constraints, they are included to ensure the benchmark remains relevant as hardware andGNNarchitectures evolve.
Future Work:
- Continuous Expansion:
GraphBenchis envisioned as a "living benchmark," continuously expanding with new domains, tasks, and evaluation paradigms. - Multi-Modal Graph Foundation Models: A key future direction is to extend
GraphBenchto support the training and evaluation ofnext-generation multi-modal graph foundation models, which are expected to handle diverse data types and tasks.
7.3. Personal Insights & Critique
7.3.1. Inspirations and Applications
This paper is highly inspiring as it addresses a critical meta-problem in graph machine learning: the lack of standardized, comprehensive, and real-world-relevant benchmarks.
- Reproducibility and Fair Comparison:
GraphBenchdirectly tackles thereproducibility crisisinMLby standardizing datasets, splits, and metrics. This is invaluable for researchers to truly compare methods rather than just report numbers. - Driving Real-World Impact: By focusing on domains like
chip design,weather forecasting, andcombinatorial optimization,GraphBenchredirectsGMLresearch towards problems with tangible societal and industrial impact, rather than solely academic benchmarks. This could accelerate the adoption ofGMLin critical applications. - Understanding
OOD Generalization: The explicit inclusion ofOOD generalizationtests is a major strength. Many real-world systems operate in dynamic environments, and models must generalize beyond their training data. This will pushGMLresearch towards building more robust and reliable systems. - Foundation Models: The vision for
graph foundation modelsis exciting.GraphBenchprovides the diverse data and evaluation rigor needed to train and assess such large-scale, general-purpose models, potentially revolutionizing howGMLis applied. - Challenging Assumptions: The results, particularly in
SAT solvingwhere traditionalMLwithhand-crafted featuresoutperformsGNNs, serve as a crucial reminder thatGNNsare not a panacea. This encourages researchers to critically assess whereGNNstruly add value versus where simpler, well-engineered solutions are still superior, or how to combine them effectively.
7.3.2. Potential Issues, Unverified Assumptions, or Areas for Improvement
-
Maintenance and Scalability of the Benchmark Itself: As a "living benchmark,"
GraphBenchwill require significant ongoing effort to maintain, update, and expand datasets, and ensure compatibility with evolvingGMLlibraries and hardware. The sheer diversity of tasks and domains could make this a massive undertaking. -
Computational Cost for Baselines: The
out-of-memoryissues and high computational costs reported forGraph Transformerson largerSATinstances highlight that even running baselines on such a comprehensive benchmark can be prohibitive. This could limit participation or make state-of-the-art evaluation difficult for many research groups without access to vast computing resources. -
Feature Engineering vs. End-to-End Learning: The
SATresults pose a fundamental question about the balance betweenfeature engineeringandend-to-end learning. WhileGNNsaim to learn features automatically, the superiority ofSATzillasuggests domain expertise in feature design is still crucial. FutureGNNresearch onSATmight need to explicitly incorporate such features or design architectures that can learn them better. -
Representational Choices: The impact of different graph representations (e.g.,
VGvs.VCGvs.LCGinSAT) onGNNperformance is evident. Further work is needed to systematically understand which graph representations are optimal for which tasks and howGNNscan learn to leverage or even generate better representations. -
Complexity of Metrics: While
domain-specific metricsare valuable, they can also be complex (e.g.,Ad-hoc ScoreforAIGs,Closed GapforSAT). Ensuring that researchers across different domains fully understand and correctly use these metrics will be important. -
"Simpler" Baselines: The statement that current
ML-based COmethods do not compete withexact solversorhand-crafted heuristicsin solution quality, but are often faster, is an important caveat. This points toGNNspotentially serving as efficient surrogates or heuristics, but not necessarily replacements for optimality inCO. -
Ethical Considerations: For
social networkdata, even if anonymized, the "Right to Erasure" and continuous updates are mentioned. Ensuring robust mechanisms for privacy and data governance in dynamic real-world datasets is crucial for the long-term viability and ethical use of such benchmarks.Overall,
GraphBenchis a highly ambitious and necessary contribution that lays foundational groundwork for the next era ofgraph machine learning. Its success will depend on community adoption, continuous maintenance, and a sustained effort to address the deep challenges it has so effectively exposed.
Similar papers
Recommended via semantic vector search.