Paper status: completed

GraphBench: Next-generation graph learning benchmarking

Published:12/04/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GraphBench is a comprehensive benchmarking suite for graph learning, addressing fragmented practices. It covers various tasks and provides standardized evaluation protocols, dataset splits, and hyperparameter tuning, benchmarking with message-passing networks and graph transforme

Abstract

Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

GraphBench: Next-generation graph learning benchmarking

1.2. Authors

The paper lists a large collaborative author group from various institutions, indicating a broad and diverse set of expertise contributing to the project. Key authors include Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber, Fabrizio Frasca, Hadar Shavit, Antoine Siraudin, Arman Mielke, Marie Anastacio, Erik Müller, Maya Bechler-Speicher, Michael Bronstein, Mikhail Galkin, Holger Hoos, Mathias Niepert, Bryan Perozzi, Jan Tönshoff, and Christopher Morris. Their affiliations span universities (RWTH Aachen University, University of Oxford, Mila Quebec AI Institute, Technion - Israel Institute of Technology, University of Stuttgart), research labs (ETAS Research, Meta, AITHYRA, Google Research, Microsoft Research), suggesting a strong academic and industrial backing.

1.3. Journal/Conference

The paper is listed with an "Original Source Link" to https://arxiv.org/abs/2512.04475 and a "PDF Link" to https://arxiv.org/pdf/2512.04475v1.pdf. This indicates it is a preprint published on arXiv, a popular open-access repository for research papers, particularly in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While arXiv is not a peer-reviewed journal or conference, it is a widely used platform for quickly disseminating research findings and soliciting feedback before formal publication. Its influence in the AI/ML community is substantial for early access to cutting-edge research.

1.4. Publication Year

The publication date is 2025-12-04. This indicates it is a forward-looking paper, likely a planned or upcoming publication.

1.5. Abstract

The paper addresses the fragmented and inconsistent benchmarking practices in machine learning on graphs, which currently rely on narrow datasets and varied evaluation protocols, hindering reproducibility and progress. To overcome these issues, the authors introduce GraphBench, a comprehensive benchmarking suite. GraphBench covers diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. It provides standardized evaluation protocols with consistent dataset splits and performance metrics designed to assess out-of-distribution (OOD) generalization. Furthermore, it offers a unified hyperparameter tuning framework. The authors also benchmark GraphBench using common models like message-passing neural networks (MPNNs) and graph transformer models, establishing principled baselines and reference performance.

  • Original Source Link: https://arxiv.org/abs/2512.04475
  • PDF Link: https://arxiv.org/pdf/2512.04475v1.pdf The paper is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The core problem GraphBench aims to solve is the fragmentation and inconsistency in benchmarking practices within the field of machine learning on graphs. This fragmentation manifests in several ways:

  1. Narrow, Task-Specific Datasets: Most existing benchmarks focus on limited domains (e.g., molecular graphs, citation networks), failing to represent the diversity and complexity of real-world applications.
  2. Inconsistent Evaluation Protocols: Different benchmarks use varied data splits, metrics, and hyperparameter tuning approaches, making it difficult to compare models meaningfully across studies.
  3. Lack of Reproducibility: Inconsistent protocols and data handling hinder the reproducibility of research findings.
  4. Limited Out-of-Distribution (OOD) Generalization Assessment: Current benchmarks often emphasize in-distribution accuracy, neglecting how models perform on unseen or structurally different data, which is crucial for real-world deployment.

2.1.2. Importance of the Problem

This problem is critical because machine learning on graphs, particularly with Graph Neural Networks (GNNs) and Graph Transformers (GTs), has achieved impressive progress in various high-impact domains like drug design, molecular property prediction, recommender systems, chip design, and combinatorial optimization. However, without a standardized and comprehensive benchmarking framework, several negative consequences arise:

  1. Hampered Progress: Researchers struggle to objectively compare new models against existing ones, slowing down innovation.
  2. Misguided Development: Models might be optimized for narrow, non-representative datasets, leading to solutions that perform poorly in real-world scenarios.
  3. Inefficient Resource Allocation: Duplicative efforts in data curation and protocol definition waste valuable research resources.
  4. Lack of Real-World Impact: Benchmarks that do not reflect real-world complexity fail to drive research towards impactful applications.

2.1.3. Paper's Entry Point / Innovative Idea

The paper's innovative idea is to introduce GraphBench, a "next-generation benchmarking suite" that directly addresses these shortcomings. GraphBench is designed to be comprehensive, standardized, and reflective of real-world challenges. Its entry point is to unify diverse graph learning tasks and domains under a single framework, providing consistent evaluation protocols, specialized metrics, and a focus on OOD generalization, thereby facilitating more reproducible, robust, and impactful research.

2.2. Main Contributions / Findings

2.2.1. Primary Contributions

The primary contributions of GraphBench are:

  1. Comprehensive Benchmarking Suite: It introduces a wide array of datasets spanning diverse domains (social networks, chip design, combinatorial optimization, SAT solving, algorithmic reasoning, weather forecasting) and supporting all major graph learning tasks (node-level, edge-level, graph-level prediction, and generative settings).
  2. Standardized Evaluation Protocols: GraphBench provides consistent dataset splits, uniform input features and labels, and domain-specific evaluation metrics, which are crucial for fair and reproducible comparisons. It explicitly includes tests for out-of-distribution (OOD) generalization by splitting data based on time or problem size.
  3. Unified Hyperparameter Tuning Framework: It integrates automated hyperparameter optimization using SMAC3 with multi-fidelity scheduling, enabling efficient model tuning and enhancing reproducibility.
  4. Principled Baselines: The suite includes benchmark results for modern Message-Passing Neural Networks (MPNNs) and Graph Transformers (GTs), providing strong reference points and insights into current models' capabilities and limitations across diverse tasks.
  5. Open-Source and User-Friendly Software: GraphBench is released as an open-source Python package (graphbench-lib) with streamlined data loaders (graphbench.Loader), evaluation tools (graphbench.Evaluator), and optimization utilities (graphbench.Optimizer), making it easy for researchers to adopt and extend.

2.2.2. Key Conclusions / Findings

The key conclusions and findings from the paper, based on their initial benchmarking efforts, are:

  1. Persistent Challenges: Current graph learning methods, including MPNNs and GTs, still face significant challenges in handling:
    • Temporal distribution shifts (e.g., in social networks).
    • Scaling to large graphs (e.g., in SAT solving, where MPNNs struggled with memory on medium/large instances).
    • Capturing complex domain-specific structures (e.g., functional equivalence in chip design, combinatorial constraints).
  2. Graph Structure is Informative: In social network tasks, models leveraging graph structure (MPNNs) consistently outperform connectivity-agnostic models (MLPs, DeepSets), highlighting the importance of relational information.
  3. OOD Generalization Limitations: While some algorithmic reasoning tasks show robustness to size generalization (e.g., MST, Bridges), others (e.g., Topological Sorting, Max Matching) demonstrate a decrease in performance with increasing graph size, indicating that current baselines are not universally size-invariant.
  4. Traditional ML Still Strong in Some Areas: For SAT solver performance prediction, traditional tabular machine learning models (Random Forest, XGBoost) using hand-crafted features (SATzilla) significantly outperform MPNNs, suggesting that extracting relevant features remains crucial and that graph models have not yet fully leveraged the permutation invariance of SAT instances.
  5. Unsolved Problems and Future Potential: Many tasks within GraphBench are far from solved, indicated by relatively low R2R^2 and Spearman correlation values in social networks, and the inability of existing graph generative models to meet strict constraints (like functional equivalence in chip design). This leaves ample room for meaningful gains and points towards the need for more advanced, task-specific graph learning models, potentially paving the way for graph foundation models.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following fundamental concepts in machine learning and graph theory:

3.1.1. Graphs and Graph Theory Basics

  • Graph: A graph is a mathematical structure used to model pairwise relations between objects. It consists of a set of nodes (or vertices) and a set of edges (or links) that connect pairs of nodes.
  • Node (Vertex): Represents an entity or object in the graph (e.g., a person in a social network, a component in an electronic circuit, a city in a road network).
  • Edge (Link): Represents a relationship or connection between two nodes (e.g., friendship between people, electrical connection between components, road between cities). Edges can be:
    • Directed: The relationship goes only one way (e.g., following someone on social media).
    • Undirected: The relationship is mutual (e.g., friendship).
    • Weighted: Edges can have a numerical value indicating the strength, cost, or distance of the relationship.
  • Node Features/Attributes: Information associated with each node (e.g., user profile data, component type).
  • Edge Features/Attributes: Information associated with each edge (e.g., type of interaction, resistance of a wire).
  • Graph-Structured Data: Data where entities and their relationships are naturally represented as a graph.
  • Graph Isomorphism: Two graphs are isomorphic if they have the same structure, even if their nodes are labeled differently. This concept is crucial for GNNs that aim to be permutation-invariant (i.e., their output doesn't change if the node ordering changes).
  • Directed Acyclic Graph (DAG): A directed graph with no directed cycles. Essential in contexts like circuit design or task dependencies.

3.1.2. Machine Learning on Graphs

  • Graph Neural Networks (GNNs): A class of neural networks designed to operate directly on graph-structured data. They learn representations (embeddings) of nodes, edges, or entire graphs by iteratively aggregating information from a node's local neighborhood.
    • Message Passing Neural Networks (MPNNs): A common framework for GNNs where information (messages) is passed along edges, transformed at nodes, and aggregated from neighbors. This process is typically repeated for several layers, allowing nodes to gather information from increasingly distant parts of the graph.
      • Mechanism: In each message-passing step, a node receives messages from its neighbors, aggregates these messages (e.g., sum, mean, max), and then updates its own representation using its previous state and the aggregated message. This process enables GNNs to capture local graph structure.
  • Graph Transformers (GTs): Adapt the highly successful Transformer architecture (originally for sequential data like text) to graph-structured data. Unlike MPNNs that typically aggregate locally, GTs often use self-attention mechanisms to allow each node to attend to all other nodes in the graph (or a subset), potentially capturing long-range dependencies more effectively.
    • Self-Attention: A mechanism where each element in a sequence (or node in a graph) computes a weighted sum of all other elements, with the weights determined by a learned attention score. This allows the model to dynamically focus on relevant parts of the input.
  • Node-level Prediction Tasks: Predicting a property for each node in the graph (e.g., classifying users in a social network, predicting engagement for each user).
  • Edge-level Prediction Tasks (or Link Prediction): Predicting a property for an edge or the existence of a link between two nodes (e.g., predicting future interactions, predicting electrical connections).
  • Graph-level Prediction Tasks: Predicting a property for the entire graph (e.g., classifying a molecule, predicting the performance of an electronic circuit).
  • Generative Settings/Tasks: Generating new graphs that conform to certain properties or distributions (e.g., generating new molecules, generating electronic circuits).

3.1.3. Machine Learning General Concepts

  • Hyperparameter Tuning/Optimization (HPO): The process of finding the best set of hyperparameters (settings that control the learning process, like learning rate, number of layers, dropout rate) for a machine learning model to achieve optimal performance on a given task.
  • Out-of-Distribution (OOD) Generalization: A model's ability to perform well on data that comes from a different distribution than the training data. This is crucial for real-world robustness, as deployment data often differs from historical training data (e.g., generalizing to larger graphs, future time steps, or different problem sizes).
  • Baselines: Simple or well-established models used as a reference point to compare the performance of a new or more complex model.
  • Metrics: Quantitative measures used to evaluate the performance of a model (e.g., accuracy, Mean Absolute Error (MAE), R2R^2, F1-score).
  • Reproducibility: The ability of an independent researcher to replicate the results of a study using the same methods, data, and code.

3.1.4. Software Libraries

  • PyTorch Geometric (PyG): A popular Python library built on PyTorch for deep learning on graphs. It provides efficient data structures for graphs, implementations of various GNN layers, and utilities for common graph learning tasks.

3.2. Previous Works

The paper extensively discusses shortcomings of existing graph learning benchmarks, which serves as a foundation for understanding GraphBench's necessity.

3.2.1. Limitations of Existing Benchmarks

  • TUDatasets (Morris et al., 2020):
    • Summary: Introduced many graph-level tasks.
    • Limitations: Most datasets are small, molecular, and lack unified metrics, hindering comparability across tasks.
  • Open Graph Benchmark (OGB) (Hu et al., 2020a) and extensions (Hu et al., 2021):
    • Summary: Provides large-scale datasets.
    • Limitations: Focus heavily on molecules, require substantial computational resources, and lack clear broader relevance beyond specific chemical domains.
  • Dwivedi et al. (2020) and Long-range Graph Benchmark (LRGB) (Dwivedi et al., 2022a):
    • Summary: Emphasize certain prediction regimes.
    • Limitations: Narrow prediction regimes, small-scale data, restricted model sizes, and limited evaluation of out-of-distribution (OOD) generalization.
  • Benchmarks for Graph Generation:
    • Limitations: Remain underdeveloped, often relying on 2D molecular datasets with questionable practical value, synthetic data with limited applicability, and fail to address challenges like generating large or structurally constrained graphs.
  • GRAPHLAND (Bazhenov et al., 2025):
    • Summary: Sought to expand node-level tasks with industrial data.
    • Limitations: Scalability and coverage gaps remain.
  • Algorithmic Reasoning Benchmarks like CLRS (Velickovic et al., 2022):
    • Summary: Covers a variety of algorithmic tasks, often providing "hints" to guide the solution.
    • Limitations (from GraphBench's perspective): Does not include regression tasks, restricts learning tasks to specific algorithms, and focuses on hints which GraphBench aims to move beyond for broader OOD generalization.

3.2.2. Foundational Models and Algorithms in Graph ML

The paper implicitly builds upon several core GNN and GT architectures that serve as baselines. While not detailed in the paper's main text regarding their internal formulas, understanding their general mechanism is key.

  • Message-Passing Neural Networks (MPNNs):

    • General Principle (Gilmer et al., 2017; Scarselli et al., 2009): MPNNs define a general framework for GNNs by alternating between a message function MtM_t and an update function UtU_t.
    • Message Computation: For each node vv, messages are computed from its neighbors uN(v)u \in N(v): mv(t+1)=uN(v)Mt(hv(t),hu(t),evu)m_v^{(t+1)} = \sum_{u \in N(v)} M_t(h_v^{(t)}, h_u^{(t)}, e_{vu}).
    • Node Update: Node embeddings are updated based on aggregated messages: hv(t+1)=Ut(hv(t),mv(t+1))h_v^{(t+1)} = U_t(h_v^{(t)}, m_v^{(t+1)}).
    • Graph Isomorphism Network (GIN) (Xu et al., 2019): A powerful MPNN variant designed to be as discriminative as the Weisfeiler-Leman (WL) test for graph isomorphism. The GIN update rule for node vv in layer kk: hv(k)=MLP(k)((1+ϵ(k))hv(k1)+uN(v)hu(k1))h_v^{(k)} = \mathrm{MLP}^{(k)} \left( (1 + \epsilon^{(k)}) h_v^{(k-1)} + \sum_{u \in N(v)} h_u^{(k-1)} \right) where MLP is a multi-layer perceptron and ϵ\epsilon is a learnable parameter. This emphasizes GraphBench's choice of GIN as a strong MPNN baseline.
  • Graph Transformers (GTs) (Müller et al., 2024a):

    • General Principle: Adapt the self-attention mechanism from standard Transformers to graphs. Unlike MPNNs which are inherently local, GTs can model global dependencies directly.
    • Attention Mechanism (Vaswani et al., 2017 - though not cited, this is the foundational work): The core of Transformers is the scaled dot-product attention. Given query (QQ), key (KK), and value (VV) matrices derived from input embeddings, the attention mechanism calculates: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where dkd_k is the dimension of the key vectors, used for scaling to prevent vanishing gradients. Graph Transformers adapt this by deriving Q, K, V from node features and potentially incorporating graph structure through attention biases or positional encodings.
  • DeepSets (Zaheer et al., 2017): A simple but powerful neural architecture for permutation-invariant functions (functions whose output does not depend on the order of input elements). A typical DeepSet architecture takes a set of input features {x1,,xN}\{x_1, \dots, x_N\} and processes them as: $ \rho\left(\sum_{i=1}^N \phi(x_i)\right) $ where ϕ\phi and ρ\rho are neural networks (typically MLPs). This is used as a baseline to evaluate the utility of graph connectivity information, as DeepSets explicitly ignore edge structure.

  • Multilayer Perceptron (MLP): A fundamental feedforward neural network consisting of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node (except for input nodes) is a neuron that uses a nonlinear activation function. Used as a baseline for comparison, as MLPs disregard both graph structure and permutation invariance.

3.3. Technological Evolution

The field of machine learning on graphs has evolved significantly:

  1. Early Graph Methods (Pre-2010s): Focused on graph kernels and feature engineering. Graph kernels computed similarity between graphs by counting substructures, but struggled with scalability. Feature engineering involved manually extracting graph-theoretic features (e.g., node degrees, clustering coefficients) for use in traditional ML models.

  2. Rise of Deep Learning (Early-Mid 2010s): With the success of deep learning in computer vision and natural language processing, researchers sought to adapt neural networks to graph data. This led to early attempts at spectral GNNs (using graph Laplacians) and spatial GNNs (directly operating on the graph structure).

  3. Standardization with Message Passing (Mid-Late 2010s): The Message Passing Neural Network (MPNN) framework (Gilmer et al., 2017) provided a unifying view, abstracting many GNN variants into a common paradigm of message computation and aggregation. This spurred rapid development of models like Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), and Graph Isomorphism Networks (GINs).

  4. Scaling and Out-of-Distribution (OOD) Challenges (Late 2010s-Early 2020s): As GNNs became more powerful, challenges like scalability to large graphs, over-smoothing, and particularly OOD generalization (how models perform on graphs larger or structurally different from training data) came to the forefront. Benchmarks like OGB emerged to address large-scale data, but often remained domain-specific.

  5. Graph Transformers and Foundation Models (Early 2020s-Present): The success of Transformers led to their adaptation for graphs, with Graph Transformers (GTs) aiming to capture long-range dependencies more effectively. The concept of graph foundation models – large, pre-trained models that can be fine-tuned for various downstream tasks – is an emerging area, requiring robust and diverse benchmarks for development and evaluation.

    GraphBench positions itself at the current frontier, aiming to provide the necessary infrastructure to address OOD generalization, diversify beyond narrow domains, and standardize evaluation to facilitate the development of these next-generation graph learning models, including graph foundation models.

3.4. Differentiation Analysis

Compared to the main methods and benchmarks in related work, GraphBench offers several core differences and innovations:

  1. Domain and Task Diversity:

    • Differentiation: Unlike TUDatasets (mostly small molecular graphs) or OGB (heavily molecular), GraphBench explicitly spans diverse domains like social networks, chip design, combinatorial optimization, SAT solving, algorithmic reasoning, and earth systems (weather forecasting). It also supports all key task types: node-level, edge-level, graph-level, and generative settings.
    • Innovation: This broad coverage aims to prevent domain-specific overfitting and drive GNN research towards more generalizable solutions for a wider range of real-world problems.
  2. Focus on Real-World Impact and Complexity:

    • Differentiation: Many prior benchmarks use small-scale or simplified graphs (e.g., citation networks, basic molecular graphs) that have limited applicability to industrial or complex real-world challenges.
    • Innovation: GraphBench emphasizes datasets that are representative of modern, high-impact applications, such as large-scale chip design, real-world social network interactions with temporal dynamics, and complex combinatorial optimization problems. It moves beyond "toy" problems to address "meaningful" ones.
  3. Standardized Protocols and OOD Generalization:

    • Differentiation: Previous benchmarks often have inconsistent evaluation protocols, metrics, and lack explicit mechanisms for OOD generalization.
    • Innovation: GraphBench provides standardized splits (e.g., temporal splits for social networks, cross-size splits for electronic circuits) and domain-specific metrics that go beyond simple accuracy to truly assess model performance. Crucially, it explicitly evaluates OOD generalization (e.g., on different graph sizes, future time periods), which is vital for developing robust and deployable models.
  4. Unified Framework and Tooling:

    • Differentiation: Researchers often have to manage different data loaders, evaluation scripts, and tuning procedures for each benchmark.
    • Innovation: GraphBench offers a unified Python package (graphbench-lib) with a single Loader class, Optimizer for hyperparameter tuning, and Evaluator for standardized metric calculation. This significantly reduces the overhead for researchers, promotes reproducibility, and allows for rapid prototyping and benchmarking.
  5. Principled Baselines and Identified Challenges:

    • Differentiation: While other benchmarks provide baselines, GraphBench's extensive evaluation across its diverse suite highlights specific persistent challenges like temporal distribution shifts, scalability to very large graphs, and the difficulty of conditional graph generation under strict constraints (like functional equivalence).

    • Innovation: By systematically exposing these challenges, GraphBench guides future research directions more effectively, especially towards graph foundation models that can overcome these hurdles.

      In essence, GraphBench differentiates itself by offering a holistic, rigorously standardized, and practically relevant benchmarking ecosystem that pushes the boundaries of graph machine learning beyond localized improvements towards broadly applicable and robust solutions.

4. Methodology

The GraphBench paper introduces a comprehensive benchmarking suite, focusing on the design principles, framework, and standardized evaluation protocols rather than a single novel algorithm. The core methodology revolves around providing a unified environment for diverse graph learning tasks.

4.1. Principles

GraphBench is designed around five core principles to address the shortcomings of existing benchmarks:

  1. Diverse Tasks and Domains:

    • Core Idea: To ensure GraphBench is broadly applicable and representative of real-world scenarios, it supports node-level, edge-level, graph-level prediction, and generative tasks.
    • Theoretical Basis/Intuition: Graph problems exist across various domains (e.g., social sciences, hardware design, combinatorial optimization, earth systems). A truly comprehensive benchmark must cover these diverse applications to foster generalizable models, rather than models specialized for a narrow domain like molecular graphs.
  2. Real-world Impact:

    • Core Idea: Focus on datasets and tasks that are highly relevant to current real-world challenges and industrial needs.
    • Theoretical Basis/Intuition: Many existing benchmarks feature small-scale graphs or domains with limited practical utility. By including tasks like large-scale chip design, weather forecasting, and combinatorial optimization, GraphBench aims to align research efforts with impactful applications, moving beyond purely academic or simplified scenarios.
  3. OOD Generalization:

    • Core Idea: Explicitly assess a model's ability to generalize beyond its training distribution.
    • Theoretical Basis/Intuition: Models often perform well on in-distribution data but fail when encountering new, structurally different, or temporally shifted data (out-of-distribution). GraphBench incorporates datasets with specific splits (e.g., temporal splits for social networks, cross-size splits for electronic circuits) to rigorously test this crucial aspect, which is essential for deployment in dynamic real-world environments.
  4. Meaningful Evaluation:

    • Core Idea: Provide standardized evaluation splits, domain-specific metrics, and state-of-the-art hyperparameter tuning scripts.
    • Theoretical Basis/Intuition: Reproducibility and fair comparison require consistent evaluation. This includes fixed training/validation/test splits, meaningful metrics tailored to the problem (not just generic accuracy), and robust hyperparameter optimization to ensure models are evaluated at their best performance. The inclusion of OOD generalization tests further enhances the meaningfulness of evaluation.
  5. Task-relevant Evaluation Metrics:

    • Core Idea: Use metrics that accurately capture the practical utility and real-world performance of models for each specific task.
    • Theoretical Basis/Intuition: Standard metrics like Mean Squared Error (MSE) or accuracy may not fully reflect a model's practical value. For instance, in social networks, Spearman correlation might be more relevant for ranking influence than MAE. GraphBench provides tailored metrics to ensure evaluations truly reflect task objectives.

4.2. Core Methodology In-depth (Layer by Layer)

The GraphBench methodology provides a unified software framework and a general encoder-processor-decoder architecture for evaluating graph learning models.

4.2.1. Unified Software Framework (graphbench-lib)

The core of GraphBench is its open-source Python package graphbench-lib, designed to streamline the entire benchmarking process.

4.2.1.1. Dataset Loading (graphbench.Loader)

  • Purpose: To provide a consistent and easy way to load all datasets within the GraphBench suite, regardless of their domain or task type.
  • Procedure:
    • The graphbench.Loader class acts as a wrapper for dataset loading.
    • It handles necessary precomputations and comes with predefined splits (train, validation, test).
    • Datasets are fully compatible with PyTorch Geometric's InMemoryDataset interface, ensuring seamless integration with common graph deep learning pipelines.
    • Customization options are available for each dataset.
  • Example Integration (from Listing 1):
    import graphbench
    
    # 1. Initialize Loader
    # dataset_name: Name of the task or a list of tasks (e.g., 'BlueSky-quotes')
    # pre_filter: PyTorch Geometric filter matrix for filtering data before loading
    # pre_transform: PyTorch Geometric-like transform applied during data loading
    # transform: PyTorch Geometric-like transform applied at computation time (e.g., batching)
    Loader = graphbench.Loader(dataset_name, pre_filter, pre_transform, transform)
    
    # 2. Load a GraphBench dataset and get splits
    dataset = Loader.load()
    
    • dataset_name: A string identifying the specific dataset or task (e.g., "BlueSky-quotes", "AIG").
    • pre_filter: A function or object from PyTorch Geometric that filters the dataset instances before they are loaded into memory. This is useful for reducing memory footprint by only loading relevant data.
    • pre_transform: A function or object from PyTorch Geometric that applies transformations to the data once when the dataset is loaded. Examples include adding positional encodings or normalizing features.
    • transform: A function or object from PyTorch Geometric that applies transformations to data each time a sample is accessed during training or inference. This is useful for data augmentation or on-the-fly feature generation.

4.2.1.2. Hyperparameter Optimization (graphbench.Optimizer)

  • Purpose: To automate the process of finding optimal hyperparameters for a given model and task, improving performance and reproducibility.
  • Procedure:
    • Uses SMAC3 (Lindauer et al., 2022) as its backend, which is a powerful Bayesian optimization framework.
    • Employs multi-fidelity scheduling: This technique evaluates many hyperparameter configurations on lower budgets (e.g., fewer training epochs or smaller subsets of data) and prunes (eliminates) poorly performing configurations early. This saves significant computational resources, especially for graph learning models which are expensive to train.
    • Utilizes a surrogate model: SMAC3 builds a model of the objective function (e.g., validation loss) based on previous evaluations. This surrogate model then proposes new, promising configurations more efficiently than random or grid search.
  • Example Integration (from Listing 1):
    # 1. Initialize Optimizer
    # optimization_args: Dictionary of arguments for the SMAC3 optimizer (e.g., budget, search space)
    # training_method: A callable function that trains and evaluates a model for a given set of hyperparameters
    Optimizer = graphbench.Optimizer(optimization_args, training_method)
    
    # 2. Optimize your model's hyperparameters
    opt_model = Optimizer.optimize()
    
    • optimization_args: A dictionary containing parameters for the hyperparameter optimization process, such as the total budget of evaluations, the fidelity levels (e.g., range of training steps), and the hyperparameter search space.
    • training_method: A function or method that takes a set of hyperparameters as input, trains a model with them, and returns its performance (e.g., validation loss). The Optimizer calls this method repeatedly with different hyperparameter combinations.

4.2.1.3. Evaluation (graphbench.Evaluator)

  • Purpose: To standardize the calculation and reporting of performance metrics for models, ensuring consistency across benchmarks.
  • Procedure:
    • Loads the corresponding domain-specific metrics for each selected dataset.
    • Provides evaluation scripts that implement these metrics (e.g., RMSE, ROC-AUC, Closed Gap).
  • Example Integration (from Listing 1):
    # 1. Initialize Evaluator
    Evaluator = graphbench.Evaluator(dataset_name)
    
    # 2. Use GraphBench evaluator with true labels and model predictions
    # y_true: Ground-truth labels from the dataset
    # y_pred: Predictions generated by the trained model
    results = Evaluator.evaluate(y_true, y_pred)
    
    • y_true: The actual, observed target values (ground truth) for the evaluation set.

    • y_pred: The values predicted by the trained graph learning model.

    • Evaluator.evaluate() computes and returns the relevant metrics (e.g., MAE, R2R^2, Spearman correlation for social networks) based on these inputs.

      The overall workflow shown in Listing 1 is:

import graphbench
# Define your PyTorch model
model = #your torch model
# Specify the dataset name
dataset_name = #name of the task or list of tasks
# Define PyTorch Geometric pre-filter, pre-transform, and transform functions
pre_filter = #PyTorch Geometric filter matrix
pre_transform = #PyTorch Geometric-like transform during loading
transform = #PyTorch Geometric-like transform at computation time

# Setting up the components of graphbench
Evaluator = graphbench.Evaluator(dataset_name)
Optimizer = graphbench.Optimizer(optimization_args, training_method)
Loader = graphbench.Loader(dataset_name, pre_filter, pre_transform, transform)

# Load a GraphBench dataset and get splits
dataset = Loader.load()

# Optimize your model (this would typically involve training loops inside training_method)
opt_model = Optimizer.optimize() # opt_model would be the best model found by HPO

# Perform inference with the optimized model
# y_pred = opt_model.predict(dataset.test_data)
# y_true = dataset.test_data.y # Assuming target labels are in .y

# Use graphbench evaluator with targets y_true and predictions y_pred
# (Assuming y_true and y_pred are available after training and inference)
# results = Evaluator.evaluate(y_true, y_pred)

This pseudocode highlights how researchers can seamlessly integrate their PyTorch models into the GraphBench framework, leveraging its standardized data loading, hyperparameter optimization, and evaluation capabilities.

4.2.2. General Encoder-Processor-Decoder Architecture

GraphBench proposes a general encoder-processor-decoder pipeline for evaluating models across tasks. This modular design allows different GNN or GT architectures to be "plugged in" as the processor.

4.2.2.1. Encoder

  • Purpose: To convert raw node and edge features into dense, common-dimension embeddings suitable for the processor.
  • Procedure:
    • For each task, a task-specific encoder is used to derive encodings for node-level and edge-level features into a common dimension dd.
    • These embeddings are then passed to the processor.
    • Handling Missing Features: If node or edge features are missing for a given task, a learnable vector of dimension dd is used as a placeholder.
    • Tokenization for GTs: For Graph Transformer (GT) baselines that expect tokenized input:
      • For node-level tasks, each graph node is treated as a single token.
      • For edge-level tasks, a specific transformation is applied to convert edges into tokens (as detailed in Algorithmic Reasoning tasks).
      • A [CLS] token (similar to BERT for sequence tasks) is added for graph-level representations for GTs, following Bechler-Speicher et al. (2025). This token's final embedding can then be used to represent the entire graph.
    • Positional Encodings (PEs): Absolute Positional Encodings (PEs) like Random Walk Structural Embeddings (RWSE) (Dwivedi et al., 2022b) and Learnable Positional Encodings (LPE) (Müller & Morris, 2024) are added to the node embeddings before the first Graph Transformer layer. These PEs provide information about a node's position or structural role in the graph, which is crucial for GNNs and GTs.

4.2.2.2. Processor

  • Purpose: To perform the core message passing or attention operations to learn contextualized node and graph representations. This is where the actual graph learning model resides.
  • Architecture: The processor is typically composed of multiple layers (e.g., GINE layers or Graph Transformer layers) stacked sequentially.
  • Update Rule (General Form): For a graph GG with node representations X\boldsymbol{X} at layer tt, the update rule is: $ \boldsymbol{X} = \phi(\mathsf{LayerNorm}(\boldsymbol{X}), G) $ $ \boldsymbol{X} = \boldsymbol{X} + \mathsf{FNN}(\mathsf{LayerNorm}(\boldsymbol{X})) $
    • Where:
      • ϕ\phi: The selected baseline model (e.g., GINE layer, Graph Transformer layer). It takes the Layer Normalized input node representations and the graph structure GG to produce new representations.
      • LayerNorm(X)\mathsf{LayerNorm}(\boldsymbol{X}): Layer Normalization is applied to the node representations before passing them to ϕ\phi and FNN\mathsf{FNN}. This helps stabilize training by normalizing inputs across features for each sample.
      • FNN()\mathsf{FNN}(\cdot): A two-layer Feed-Forward Neural Network (also known as MLP) applied to the Layer Normalized output of ϕ\phi.
      • X=X+\boldsymbol{X} = \boldsymbol{X} + \dots: A residual connection is used, adding the output of the FNN back to the original node representations. This helps mitigate vanishing gradients and allows for deeper models.
  • Feed-Forward Network (FNN) Details:
    • The FNN layer is defined as: $ \mathsf{FNN}(x) := \sigma(\mathsf{LayerNorm}(x)W_1)W_2 $
      • σ\sigma: The GeLU (Gaussian Error Linear Unit) nonlinearity (Hendrycks & Gimpel, 2016). GeLU is a common activation function that smooths and non-linearly transforms the input.
      • W1,W2W_1, W_2: Weight matrices for the two linear layers within the FNN.
  • Graph Transformer (GT) Specifics:
    • Attention Mechanism: The Graph Transformer layer ϕ\phi computes multi-head scaled-dot-product attention.
    • Attention Formula: Given Query (QQ), Key (KK), Value (VV) matrices of shape RL×d\mathbb{R}^{L \times d} (where LL is the number of tokens/nodes and dd is the embedding dimension), and an attention bias BRL×L\boldsymbol{B} \in \mathbb{R}^{L \times L}: $ \mathsf{Attention}(Q, K, V, B) := \mathsf{softmax}\Big( d^{-\frac{1}{2}} \cdot Q K^T + B \Big)V $
      • d12d^{-\frac{1}{2}}: Scaling factor based on the dimension of key vectors to prevent dot products from growing too large and pushing softmax outputs to extremes.
      • QKTQK^T: The dot product between queries and keys, forming raw attention scores.
      • BB: An attention bias matrix that injects graph structure information into the attention mechanism. This can include information about connectivity, shortest paths, or other structural relationships.
      • softmax()\mathsf{softmax}(\cdot): Normalizes the attention scores so they sum to 1, effectively producing attention weights.
      • V\cdot V: The attention weights are applied to the value vectors to produce the final contextualized representations.
    • GT Layer Output: The output of a GT layer from the paper is: $ \boldsymbol{X}^{(t+1)} := \mathsf{MLP}\Big( \mathsf{Attention}(\boldsymbol{X}^{(t)}\boldsymbol{W}_Q, \boldsymbol{X}^{(t)}\boldsymbol{W}_K, \boldsymbol{X}^{(t)}\boldsymbol{W}_V, \boldsymbol{B}) \Big) $
      • X(t)\boldsymbol{X}^{(t)}: Node representations at step tt.
      • WQ,WK,WVRd×d\boldsymbol{W}_Q, \boldsymbol{W}_K, \boldsymbol{W}_V \in \mathbb{R}^{d \times d}: Learnable weight matrices that project the node representations into queries, keys, and values.
      • MLP\mathsf{MLP}: A two-layer MLP applied after the attention mechanism.
  • GIN Architecture Specifics:
    • GINE Layer (Generalized GIN with Edge Features): The GINE layer updates node representations hv(t)h_v^{(t)} at iteration tt as follows: $ \boldsymbol{h}_v^{(t+1)} = \mathsf{FNN} \Big( (1 + \epsilon) \boldsymbol{h}v^{(t)} + \sum{u \in N(v)} \mathsf{ReLU}(\boldsymbol{h}u^{(t)} + \boldsymbol{e}{uv}) \Big) $
      • FNN\mathsf{FNN}: A neural network, typically an MLP.
      • ϵ\epsilon: A learnable parameter (or fixed to 0).
      • hv(t)\boldsymbol{h}_v^{(t)}: Representation of node vv at iteration tt.
      • N(v): Set of neighbors of node vv.
      • hu(t)\boldsymbol{h}_u^{(t)}: Representation of neighbor node uu.
      • euv\boldsymbol{e}_{uv}: Feature vector of the edge connecting uu and vv.
      • ReLU()\mathsf{ReLU}(\cdot): Rectified Linear Unit activation function.
      • The sum aggregates information from neighbors. The (1+ϵ)hv(t)(1 + \epsilon)\boldsymbol{h}_v^{(t)} term allows the model to retain information from the node's previous state, making it a powerful aggregation scheme.
    • GINE Processor Integration: Similar to GTs, GINE layers replace ϕ\phi. Layer Normalization is applied before and residual connections are used after the GINE layer. Mean pooling is typically used for graph-level tasks at the end of the processor step.
  • Social Networks Specific MPNN (GraphConv with mean aggregation):
    • Due to the high number of nodes and varying degrees in BlueSky datasets, a different MPNN variant was chosen: GraphConv with mean aggregation.
    • Update Rule: Node representations hν(t)\boldsymbol{h}_{\nu}^{(t)} are updated as: $ \nabla \boldsymbol{h}{\nu}^{(t+1)} = \sigma \Big( W_1^{(t)} \boldsymbol{h}{\nu}^{(t)} + \frac{1}{\deg_{\mathrm{in}}(\nu)} \sum_{u: u \to \nu} W_2^{(t)} \boldsymbol{h}_u^{(t+1)} \Big) $
      • σ\sigma: ReLU activation function.
      • Dropout is applied before ReLU.
      • No normalization layers are interleaved.
      • W1(t),W2(t)W_1^{(t)}, W_2^{(t)}: Weight matrices for the linear transformations.
      • degin(ν)\deg_{\mathrm{in}}(\nu): In-degree of node ν\nu.
      • The sum performs mean aggregation over incoming neighbors.

4.2.2.3. Decoder

  • Purpose: To transform the learned node or graph representations from the processor into task-specific predictions (e.g., class probabilities, regression values).
  • Architecture: A two-layer Multilayer Perceptron (MLP).
  • Procedure:
    • Consists of two linear layers (W1Rd×dW_1 \in \mathbb{R}^{d \times d} and W2Rd×oW_2 \in \mathbb{R}^{d \times o}).
    • Applies GeLU nonlinearity.
    • Optionally, a bias term can be added.
  • Formula: $ W_2(\mathsf{LayerNorm}(\mathsf{GELU}(W_1 x))) $
    • xx: Input representation (e.g., node embedding for node-level tasks, [CLS] token embedding or pooled graph embedding for graph-level tasks).
    • dd: Dimension of the input representation and first hidden layer.
    • oo: Output dimension corresponding to the task-specific target (e.g., number of classes for classification, 1 for regression).
  • Task-Specific Adaptation:
    • For node-level or edge-level classification, the output oo would be the number of classes, followed by a softmax activation.
    • For node-level or edge-level regression, oo would be 1.
    • For graph-level tasks, the pooled graph representation would be fed to the decoder.

4.2.3. Pretraining and Evaluation Protocol

  • Pretraining Support: GraphBench supports evaluation of pretrained models, which is crucial for graph foundation models.
  • Evaluation Protocol:
    • All baseline models are evaluated across three random seeds per task.

    • Mean performance and standard deviations are reported to ensure robustness.

    • For out-of-distribution generalization tasks, pretrained models from these baseline evaluations are used.

    • In addition to performance metrics, GraphBench reports operational metrics like memory usage, training wall-clock time, and inference latency to provide insights into real-world applicability.

      This detailed methodology ensures that GraphBench provides a robust, flexible, and reproducible framework for advancing graph learning research.

5. Experimental Setup

The experimental setup within GraphBench is characterized by a diverse set of datasets, carefully chosen evaluation metrics, and comparison against representative baselines, all within a standardized framework.

5.1. Datasets

GraphBench includes datasets from six diverse domains: Social Networks, Chip Design, Electronic Circuits, SAT Solving, Combinatorial Optimization, Algorithmic Reasoning, and Earth Systems (Weather Forecasting).

5.1.1. Social Networks: BlueSky Engagements

  • Source: Based on the publicly available BlueSky dataset curated by Failla & Rossetti (2024).

  • Scale: Large-scale, with single graphs containing hundreds of thousands of nodes and millions of edges. The following are the results from Table 2 of the original paper:

    Category Name #Graphs Avg. #Nodes Avg. #Edges Avg. Deg.
    Social networks BlueSky - quotes 1 289 136 / 286 425 / 311 272 3 075 967 / 3 815 996 / 4 525 872 10.64 / 13.32 / 14.54
    BlueSky replies 470 816 / 464 867 / 569 424 9 287 565 / 10 769 386 / 12 318 196 19.73 / 23.17 / 21.63
    BlueSky reposts 1 498 012 / 508 818 / 580 112 12 049 951 / 14 658 114 / 17 261 865 24.20 / 28.81 / 29.76
  • Characteristics: Represents human interactions and societal processes. Nodes are users, edges are directed interactions (quotes, replies, reposts). Features are language model embeddings of user posts. Targets are median future engagements (likes, replies, reposts).

  • Domain: Social sciences.

  • Why Chosen: Addresses limitations of older social network benchmarks (static, categorical targets, simplified topologies). Provides real-world-like dynamic and directed relational structures, crucial for temporal forecasting and OOD generalization.

5.1.2. Chip Design: AIG Generation

  • Source: Synthetically generated truth tables (1.2 million) compiled into optimized And-Inverter Graphs (AIGs) using ABC tool.

  • Scale: 1.2 million graphs. The following are the results from Table 4 of the original paper:

    #Graphs #Inputs #Outputs Mean #Nodes Median #Nodes Max. #Nodes
    1.2M 6-8 1-2 125.9 104.0 335.0
  • Characteristics: Boolean circuits represented as Directed Acyclic Graphs (DAGs) where nodes are AND gates and edges may indicate signal inversion. Task is conditional graph generation: generate an AIG that implements a given Boolean function (truth table) while minimizing gate count.

  • Domain: Hardware design.

  • Why Chosen: Represents a highly complex and impactful engineering challenge. It's a novel generative task with strict logical correctness constraints and an optimization objective, providing a rigorous test for generative graph methods.

5.1.3. Electronic Circuits: Voltage Conversion Ratio Prediction

  • Source: Datasets generated from random valid topologies with 5, 7, and 10 components, simulated with NGSPICE.
  • Scale: 73k (5-comp), 14k (7-comp), 6k (10-comp) instances.
  • Characteristics: Analog circuits as graphs. Nodes are components (capacitors, inductors, switches, terminals), edges are electrical connections. Task is graph-level regression to predict voltage conversion ratio and power conversion efficiency.
  • Domain: Hardware design.
  • Why Chosen: Critical problem in EDA. Highlights robustness to structural sensitivity and ability to cope with extreme class imbalance. Explicitly tests OOD generalization across different circuit sizes.

5.1.4. SAT Solving: Algorithm Selection and Performance Prediction

  • Source: Largest collection of SAT instances to date, combining SAT Competition instances, AClib benchmark, and instances generated by FuzzSAT, Cryptography, and Community Attachment.

  • Scale: 107,866 instances in total, split into SMALL, MEDIUM, and LARGE datasets based on size. The following are the results from Table 7 of the original paper:

    Source # of Instances # of Variables # of Clauses
    min max avg median min max avg median
    SAT Competition 31024 3 25 870 369 69 190.77 13 168.50 7 871 935 536 1 022 025.50 118 526.00
    Community Attachment 29 994 922 2956 1931.05 1 928.00 3566 13413 8088.71 8041.00
    AClib 33 955 6 248 738 3319.65 1 270.00 24 103 670 100 136471.03 10000.00
    FuzzSAT 8 020 2 944 685 10217.75 1 820.00 1 4094 516 45 638.30 8074.50
    Cryptography 4873 6 12 415 4 264.89 3225.00 24 185 006 55 309.63 45 316.00
    Total 107 866 2 25 870 369 22 434.71 1879.00 1 871 935 536 345 051.72 10 177.50
  • Characteristics: SAT instances represented as graphs (Variable-Clause Graph VCG, Variable Graph VG, Clause Graph LCG). Tasks are performance prediction (regression of solver runtime) and algorithm selection (classification of best solver). SATzilla features can be integrated as node attributes.

  • Domain: Reasoning and optimization.

  • Why Chosen: SAT is a central NP-complete problem with real-world applications. Provides a practical algorithm selection benchmark with structural permutation invariance, allowing exploration of GNNs in this area.

5.1.5. Combinatorial Optimization (CO): Predicting Optimal Solution Objectives

  • Source: 50,000 instances for each problem type (Maximum Independent Set MIS, Max-Cut, Graph Coloring) generated from three random graph models: RB graphs, Erdôs-Rényi (ER) graphs, and Barabási-Albert (BA) graphs. Both small-scale and large-scale variants are provided. The following are the results from Table 11 of the original paper:

    Generator Size Nodes Nodes/Clique Cliques Tightness Edge Prob. Attached Edges
    RB small [200, 300] [5, 12] [20, 25] [0.3, 1]
    large [800, 1200] [20, 25] [40, 55] [0.3, 1]
    ER small [200, 300] 0.15
    large [700, 800] - 0.15
    BA small [200, 300] - 2
    large [700, 800] - 2
  • Characteristics: Graphs as defined by the problem. No node features are included, only the adjacency matrix. Tasks are supervised learning (regression to predict optimal objective value) and unsupervised learning (predicting scores for variables to construct a feasible solution).

  • Domain: Reasoning and optimization.

  • Why Chosen: CO problems are computationally hard but often graph-structured. They provide clear quantitative objectives, ideal for supervised training and OOD generalization studies.

5.1.6. Algorithmic Reasoning: Learning to Simulate Algorithms

  • Source: 7 tasks with synthetic graphs generated from Erdós-Rényi (ER), Stochastic Block Model (SBM), Power-Law Cluster (PC), Newman-Watts-Strogatz (NWS), Barabási-Albert (BA), and Dual Barabási-Albert (DBA) graph generators.
  • Scale: One million training graphs, 10,000 validation/test graphs per task. Node count fixed to 16 for training, 128 for validation/test for OOD generalization.
  • Characteristics: Tasks include Topological Sorting (node regression), Bridges (edge classification), Minimum Spanning Tree (MST) (edge classification), Max Flow (graph regression), Max Clique (node classification), Steiner Trees (edge classification), and Max Matching (node classification). Three difficulty levels (EASY, MEDIUM, HARD) vary graph generator parameters.
  • Domain: Reasoning and optimization.
  • Why Chosen: Dedicated to exploring the intersection of algorithms and neural networks. Provides large-scale datasets for common graph algorithms, expanding prior benchmarks like CLRS, and rigorously testing size generalization.

5.1.7. Earth Systems: Weather Forecasting

  • Source: Reanalysis data from ERA5 dataset, preprocessed via WeatherBench2 pipeline.
  • Scale: 64x32 equiangular grid resolution. Single graph with 4610 nodes and 59667 edges.
  • Characteristics: 3D grid graph (grid cells) and an icosahedral mesh graph with learned mappings between them. Node features include 62 physical and derived meteorological variables (e.g., temperature, humidity, wind speed) at various pressure levels and surface. Task is regression to predict residual change in atmospheric state over a 12-hour horizon.
  • Domain: Earth systems.
  • Why Chosen: Weather forecasting is critical and represents a multi-scale, non-Euclidean dependency problem. Tests graph-based models for integrating heterogeneous features and generalizing across time and space.

5.2. Evaluation Metrics

GraphBench employs domain-specific metrics tailored to each task, extending beyond classical metrics like MSE and accuracy to capture real-world utility.

5.2.1. Social Networks (BlueSky Engagements)

  • Metrics: Mean Absolute Error (MAE), Coefficient of Determination (R^2), and Spearman Correlation (\sigma).
  • Mean Absolute Error (MAE):
    1. Conceptual Definition: Measures the average magnitude of the errors in a set of predictions, without considering their direction. It indicates how close predictions are to actual values on average.
    2. Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $
    3. Symbol Explanation:
      • NN: The total number of observations/predictions.
      • yiy_i: The ground-truth value for the ii-th observation.
      • y^i\hat{y}_i: The predicted value for the ii-th observation.
      • |\cdot|: The absolute value.
  • Coefficient of Determination (R2R^2):
    1. Conceptual Definition: Measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It indicates how well the model explains the variability of the data; a higher R2R^2 suggests a better fit.
    2. Mathematical Formula: $ R^2_{\kappa, \tau_{1,2}} := 1 - \frac{\sum_{\boldsymbol{u} \in \boldsymbol{U}} (y^{\kappa, (\boldsymbol{u})}{\tau{1,2}} - \hat{y}^{\kappa, (\boldsymbol{u})})^2}{\sum_{\boldsymbol{u} \in \boldsymbol{U}} (y^{\kappa, (\boldsymbol{u})}{\tau{1,2}} - \bar{y}^{\kappa}{\tau{1,2}})^2} $
    3. Symbol Explanation:
      • U\boldsymbol{U}: The set of evaluation nodes (users).
      • yτ1,2κ,(u)y^{\kappa, (\boldsymbol{u})}_{\tau_{1,2}}: The ground-truth engagement value of kind κ\kappa for user u\boldsymbol{u} in the prediction interval τ1,2\tau_{1,2}.
      • y^κ,(u)\hat{y}^{\kappa, (\boldsymbol{u})}: The model's predicted engagement value of kind κ\kappa for user u\boldsymbol{u}.
      • yˉτ1,2κ\bar{y}^{\kappa}_{\tau_{1,2}}: The mean ground-truth engagement value of kind κ\kappa over all users in U\boldsymbol{U}.
  • Spearman Correlation (σ\sigma):
    1. Conceptual Definition: A non-parametric measure of the strength and direction of the monotonic relationship between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function (i.e., if one variable increases, the other also tends to increase, but not necessarily at a constant rate). In this context, it measures how well predictions preserve the relative ordering of users by future engagements.
    2. Mathematical Formula: $ \sigma_{\kappa, \tau_{1,2}} := 1 - \frac{6 \cdot \sum_{u \in U} \big( \mathbf{R}[y^{\kappa, (u)}{\tau{1,2}}] - \mathbf{R}[\widehat{y}^{\kappa, (u)}] \big)^2}{n_U (n_U^2 - 1)} $
    3. Symbol Explanation:
      • UU: The set of evaluation nodes (users).
      • nUn_U: The number of users in UU.
      • R[yτ1,2κ,(u)]\mathbf{R}[y^{\kappa, (u)}_{\tau_{1,2}}]: The rank of the ground-truth engagement value of kind κ\kappa for user uu.
      • R[y^κ,(u)]\mathbf{R}[\widehat{y}^{\kappa, (u)}]: The rank of the model's predicted engagement value of kind κ\kappa for user uu.

5.2.2. Chip Design (AIG Generation)

  • Metrics: Ad-hoc Score (defined in the paper).
  • Ad-hoc Score (for AIG Generation):
    1. Conceptual Definition: Measures the quality of generated AIGs by considering both logical correctness (functional equivalence to the target Boolean function) and structural efficiency (minimizing the number of internal nodes, relative to a baseline).
    2. Mathematical Formula: $ \mathrm{score}(\tilde{C}) := \frac{100}{N} \sum_{i=1}^{N} \frac{#C_i^{\mathrm{ref}}}{#\tilde{C}i} \cdot \mathbb{1}{\tilde{C}_i \equiv f_i} $
    3. Symbol Explanation:
      • C~\tilde{C}: The set of generated AIGs by the model.
      • NN: The total number of Boolean functions (instances).
      • #C~i\#\tilde{C}_i: The number of internal nodes of the ii-th generated circuit.
      • #Ciref\#C_i^{\mathrm{ref}}: The number of internal nodes of the corresponding baseline AIG generated by ABC (a standard tool).
      • 1C~ifi\mathbb{1}_{\tilde{C}_i \equiv f_i}: An indicator function that is 1 if the generated circuit C~i\tilde{C}_i is logically equivalent to the target Boolean function fif_i, and 0 otherwise.

5.2.3. Electronic Circuits (Voltage Conversion Ratio Prediction)

  • Metrics: Relative Squared Error (RSE).
  • Relative Squared Error (RSE):
    1. Conceptual Definition: A normalized version of Mean Squared Error that compares the model's performance to that of a simple baseline predictor (the mean of the target values). It indicates how much better the model is at predicting than simply guessing the average. A lower RSE is better.
    2. Mathematical Formula: $ \mathrm{RSE}(y, \hat{y}) := \frac{\sum_{i=1}^{N} (y_i - \hat{y}i)^2}{\sum{i=1}^{N} (y_i - \bar{y})^2} $
    3. Symbol Explanation:
      • NN: The total number of observations.
      • yiy_i: The ground-truth value for the ii-th observation from high-fidelity simulation.
      • y^i\hat{y}_i: The model's predicted value for the ii-th observation.
      • yˉ\bar{y}: The mean of all ground-truth values {yi}i=1N\{y_i\}_{i=1}^N.

5.2.4. SAT Solving (Algorithm Selection and Performance Prediction)

  • Metrics for Performance Prediction: Root Mean Squared Error (RMSE) on log10-scaled computation times.
  • Metrics for Algorithm Selection: Closed Gap (CG).
  • Root Mean Squared Error (RMSE): (For log10\log_{10} time prediction)
    1. Conceptual Definition: The square root of the average of the squared differences between predicted values and actual values. It measures the average magnitude of the errors, with larger errors having a disproportionately larger effect.
    2. Mathematical Formula: For performance prediction, the model predicts the base-10 logarithm of the PAR{10}`` score. Let Yi=log10(PAR10(sk,πi))Y_i = \log_{10}(\mathrm{PAR}_{10}(s_k, \pi_i)) be the true log-scaled time for solver sks_k on instance πi\pi_i, and Y^i\hat{Y}_i be the predicted log-scaled time. $ \mathrm{RMSE} = \sqrt{\frac{1}{N} \sum{i=1}^{N} (Y_i - \hat{Y}_i)^2} $
    3. Symbol Explanation:
      • NN: The total number of solver-instance pairs.
      • YiY_i: The true log10\log_{10}-scaled computation time for the ii-th pair.
      • Y^i\hat{Y}_i: The predicted log10\log_{10}-scaled computation time for the ii-th pair.
  • Closed Gap (CG):
    1. Conceptual Definition: Quantifies how close an algorithm selector (the model) comes to the performance of the Virtual Best Solver (VBS) relative to the Single Best Solver (SBS). A CG of 1 means the selector achieved VBS performance, 0 means SBS performance, and negative means worse than SBS.
    2. Mathematical Formula: $ \mathbf{CG}(\varPi) = \frac{\mathrm{PAR}{10}(\mathbf{SBS}, \varPi) - \mathrm{PAR}{10}(s, \varPi)}{\mathrm{PAR}{10}(\mathbf{SBS}, \varPi) - \mathrm{PAR}{10}(\mathbf{VBS}, \varPi)} $
    3. Symbol Explanation:
      • Π\varPi: A set of SAT instances.
      • PAR10(X,Π)\mathrm{PAR}_{10}(X, \varPi): The Penalized Average Runtime (PAR10) score for solver/selector XX on instance set Π\varPi. PAR10 is the mean computation time, where unsolved instances are penalized by 10×cutoff_time10 \times \mathrm{cutoff\_time}.
      • SBS\mathbf{SBS}: The Single Best Solver (the individual solver that performs best overall across Π\varPi).
      • VBS\mathbf{VBS}: The Virtual Best Solver (an oracle that always picks the best solver for each individual instance in Π\varPi).
      • ss: The algorithm selector being evaluated.

5.2.5. Combinatorial Optimization (CO)

  • Metrics for Supervised Learning: Mean Absolute Error (MAE).
  • Metrics for Unsupervised Learning: The objective value of the decoded solution (e.g., MIS size, Max-Cut size, number of colors used). For MIS and Max-Cut, higher values are better; for Graph Coloring, lower values are better.
  • Mean Absolute Error (MAE): (See definition above for Social Networks). Here, it measures the error between predicted and true optimal objective values.

5.2.6. Algorithmic Reasoning

  • Metrics for Regression Tasks: Mean Absolute Error (MAE).
  • Metrics for Binary Classification Tasks: F1 Score.
  • Mean Absolute Error (MAE): (See definition above for Social Networks). Here, it measures the error between model predictions and ground-truth solutions for tasks like Topological Sorting or Max Flow.
  • F1 Score:
    1. Conceptual Definition: The harmonic mean of precision and recall. It is a measure of a model's accuracy on a binary classification problem that considers both false positives and false negatives. It is particularly useful when the classes are imbalanced.
    2. Mathematical Formula: $ \mathrm{F1} = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} $ where: precision=TPTP+FP\mathrm{precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} recall=TPTP+FN\mathrm{recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}
    3. Symbol Explanation:
      • TP\mathrm{TP}: True Positives (correctly predicted positive instances).
      • FP\mathrm{FP}: False Positives (incorrectly predicted positive instances).
      • FN\mathrm{FN}: False Negatives (incorrectly predicted negative instances).

5.2.7. Earth Systems (Weather Forecasting)

  • Metrics: Mean Squared Error (MSE) for each variable and cumulatively across all variables.
  • Mean Squared Error (MSE): (Unweighted versions are reported for evaluation, distinct from the weighted training loss).
    1. Conceptual Definition: Measures the average of the squares of the errors. It quantifies the average squared difference between the estimated values and the actual value. It is always non-negative, and values closer to zero are better.
    2. Mathematical Formula (for a single variable jj): $ \mathrm{MSE}j (x^d, \hat{x}^d) = \frac{1}{|D||G||L_j|} \sum{d \in D} \sum_{i \in G} \sum_{\ell \in L_j} (\hat{x}{i,j,\ell}^d - x{i,j,\ell}^d)^2 $
    3. Mathematical Formula (for all variables): $ \mathrm{MSE}{\mathrm{all}} (x^d, \hat{x}^d) = \frac{1}{|D||G|\sum{j \in J}|L_j|} \sum_{d \in D} \sum_{i \in G} \sum_{j \in J} \sum_{\ell \in L_j} (\hat{x}{i,j,\ell}^d - x{i,j,\ell}^d)^2 $
    4. Symbol Explanation:
      • DD: The set of forecast date-times.
      • GG: The grid cells (locations).
      • JJ: The set of meteorological variables.
      • LjL_j: The set of pressure levels for variable jj.
      • xi,j,dx_{i,j,\ell}^d: The ground-truth value for grid cell ii, variable jj, pressure level \ell, at date-time dd.
      • x^i,j,d\hat{x}_{i,j,\ell}^d: The model's predicted value for grid cell ii, variable jj, pressure level \ell, at date-time dd.

5.3. Baselines

The paper benchmarks GraphBench using common message-passing neural networks (MPNNs) and graph transformer (GT) models. These models are evaluated within the encoder-processor-decoder framework.

  1. Graph Isomorphism Network (GIN): A powerful MPNN that serves as a strong GNN baseline due to its high discriminative power (matching the 1-WL test). It is used across various tasks including Combinatorial Optimization, Algorithmic Reasoning, and SAT Solving. For Social Networks, a GraphConv variant with mean aggregation is used instead, due to dataset characteristics.

  2. Graph Transformer (GT): A Graph Transformer model is used as a baseline to explore the efficacy of self-attention mechanisms on graph data. The implementation follows Bechler-Speicher et al. (2025), using biased attention and incorporating RWSE and LPE as positional encodings. It is used across Algorithmic Reasoning, Combinatorial Optimization, Electronic Circuits, and Weather Forecasting. GTs were excluded from SAT Solving due to high computational costs for positional embeddings and out-of-memory errors on larger graphs.

  3. Multilayer Perceptron (MLP): A simple feed-forward neural network that processes node features independently, ignoring all graph structure. It serves as a connectivity-agnostic baseline to demonstrate the value added by graph-aware architectures. Used in Social Networks and Combinatorial Optimization.

  4. DeepSets: An architecture designed for permutation-invariant tasks, where the output does not depend on the order of input elements. It aggregates information globally without explicit message passing. Used as another connectivity-agnostic baseline (though it uses global features) in Social Networks and Combinatorial Optimization.

  5. Traditional SAT Solver Baselines (for SAT Solving):

    • Algorithm Selection: Pairwise Regression (PR), Pairwise Classification (PC), and Multi-Class Classification (MC), all using Random Forest on hand-crafted SATzilla features. These are robust, established methods in the algorithm selection field.
    • Performance Prediction: Random Forest (RF) and XGBoost models, also using SATzilla features, which are state-of-the-art for empirical performance modeling (EPM) in SAT.
  6. ABC Variants (for Chip Design): STRaSH, ReSyN, Compress2, and Resyn2rs. These are classical logic synthesis tools/scripts of varying complexity, providing a strong combinatorial optimization baseline for AIG generation. No learning-based generative baselines were included for AIG due to their inability to produce functionally equivalent circuits.

  7. Persistence Model (for Weather Forecasting): A simple baseline that assumes future conditions will be the same as the current conditions. This is a common and robust baseline in weather forecasting to show the minimal skill required for a useful model.

  8. GraphCast (Lam et al., 2023) (for Weather Forecasting): A state-of-the-art MPNN-based system for global weather forecasting. This represents a strong, specialized graph learning model in the domain.

    The consistent use of three random seeds for evaluation and the reporting of mean performance with standard deviations ensure statistical robustness of the baseline results.

6. Results & Analysis

The results presented in GraphBench not only showcase baseline performances across diverse tasks but also highlight persistent challenges and areas where current graph learning models fall short.

6.1. Core Results Analysis

6.1.1. Social Networks (BlueSky Engagements)

  • Key Finding: Graph-aware models (GNNs) consistently outperform connectivity-agnostic baselines (MLPs, DeepSets), demonstrating the value of relational structure for predicting user engagements. However, the absolute performance metrics (R2R^2, σ\sigma) remain relatively low, indicating significant room for improvement.

  • Advantages/Disadvantages: GNNs capture local, directed relational structures, which are informative. DeepSets performed worse than MLPs in some cases, suggesting that naive global feature aggregation without structural context can be detrimental. The tasks are challenging, especially in predicting subtle future engagement metrics.

    The following are the results from Table 3 of the original paper:

    Dataset Model MAE Spearman corr. (σ)
    likes replies reposts likes replies reposts
    quotes DeepSets 0.810± 0.005 0.140± 0.002 0.102± 0.002 0.138± 0.002 0.307± 0.004 0.178± 0.002 0.334± 0.003
    GNN 0.768± 0.002 0.165± 0.002 0.134± 0.002 0.175± 0.002 0.330± 0.003 0.192± 0.002 0.337± 0.022
    MLP 0.784±0.001 0.145± 0.001 0.107± 0.001 0.141± 0.001 0.308± 0.003 0.176± 0.002 0.335± 0.002
    replies DeepSets 0.789± 0.033 0.086± 0.033 0.061 ± 0.024 0.104± 0.014 0.253± 0.003 0.130± 0.001 0.240± 0.006
    GNN 0.694± 0.002 0.158± 0.003 0.119± 0.002 0.159± 0.002 0.258± 0.009 0.132± 0.006 0.247± 0.011
    MLP 0.725± 0.004 0.131± 0.003 0.087± 0.003 0.122± 0.003 0.249± 0.004 0.127± 0.005 0.243± 0.003
    reposts DeepSets 0.918± 0.013 0.049± 0.005 0.031 ± 0.012 0.050± 0.009 0.229± 0.004 0.129± 0.007 0.205 ± 0.009
    GNN 0.832± 0.009 0.131± 0.017 0.111± 0.022 0.128± 0.008 0.284± 0.044 0.143± 0.032 0.260± 0.051
    MLP 0.874± 0.003 0.087± 0.002 0.051 ± 0.003 0.066± 0.002 0.234± 0.008 0.123± 0.006 0.206± 0.010

6.1.2. Chip Design (AIG Generation)

  • Key Finding: Traditional ABC tools achieve higher scores (structural efficiency and logical correctness) with increasing complexity (e.g., Resyn2rs > STRaSH). A notable result is the absence of competitive learning-based baselines, as existing generative models struggled with functional equivalence constraints.

  • Advantages/Disadvantages: The ABC variants excel at this constrained generative task. The lack of ML baselines highlights a significant gap and research opportunity for conditional DAG generative models that can handle strict constraints.

    The following are the results from Table 5 of the original paper:

    Number of (inputs, outputs)
    Method (6,1) (6,2) (7,1) (7,2) (8,1) (8,2) Overall
    STRaSH 73.90 74.23 76.55 75.71 77.63 76.52 75.76
    ReSyN 86.79 88.16 89.28 89.05 90.45 89.78 88.92
    Compress2 92.53 93.57 93.18 92.79 93.28 92.49 92.97
    Resyn2rs 93.30 95.24 95.15 95.64 96.24 96.11 95.28

6.1.3. Electronic Circuits (Voltage Conversion Ratio Prediction)

  • Key Finding: Graph Transformers (GTs) consistently achieve higher prediction accuracy (lower RSE) than MPNN baselines (GIN, GAT, GCN) across different circuit sizes (5, 7, 10 components) and prediction targets. All models' RSE increases with circuit complexity, primarily due to reduced training data for larger circuits and combinatorial growth.

  • Advantages/Disadvantages: GTs' ability to model long-range dependencies likely contributes to their superior performance. The challenge of scaling training data for larger, more complex circuits is evident, reflecting a real-world bottleneck.

    The following are the results from Table 6 of the original paper:

    Task Method Size
    5 comp 7 comp 10 comp
    Efficiency GT 0.07 ± 0.03 0.18± 0.06 0.29± 0.03
    GIN 0.09± 0.06 0.21 ± 0.04 0.44± 0.13
    GAT 0.11 ± 0.05 0.22± 0.05 0.38± 0.05
    GCN 0.13± 0.07 0.30± 0.04 0.53± 0.06
    Voltage GT 0.12± 0.02 0.31± 0.05 0.39± 0.05
    GIN 0.14± 0.05 0.35± 0.03 0.54± 0.12
    GAT 0.17± 0.13 0.34± 0.12 0.48± 0.02
    GCN 0.18± 0.07 0.46± 0.08 0.61± 0.07

6.1.4. SAT Solving (Algorithm Selection and Performance Prediction)

  • Key Finding: For performance prediction, traditional tabular ML models (Random Forest, XGBoost) using hand-crafted SATzilla features significantly outperform MPNNs across all SAT solver types and graph representations. For algorithm selection, Pairwise Regression (PR) performs best, again with SATzilla features. MPNNs struggle with large instances due to computational limitations.

  • Advantages/Disadvantages: The SATzilla features are highly effective, demonstrating that good feature engineering can still trump raw graph learning in some domains. MPNNs faced out-of-memory errors on larger instances and struggled to match SATzilla-based baselines, indicating scalability and expressivity challenges for GNNs in this domain. The choice of graph representation also matters, with VG generally outperforming VCG and LCG for MPNNs.

    The following are the results from Table 8 of the original paper:

    Solver Method VG VCG LCG SATZILLA
    Kissat GIN 1.36± 0.15 1.15± 0.03 1.49± 0.02
    RF 0.61 ± 0.00
    XGB 0.63± 0.00
    BrEAKIDKissAT GIN 1.29± 0.03 1.33± 0.15 1.52± 0.01
    RF , 0.65± 0.00
    XGB 0.68± 0.01
    KissatMABDC GIN 1.27± 0.00 1.43± 0.24 1.60± 0.01
    RF 0.70± 0.00
    XGB 0.74± 0.00
    AMSAT GIN 1.43± 0.01 1.45± 0.09 1.54± 0.00
    RF 0.68± 0.00
    XGB 0.72± 0.01

The following are the results from Table 9a of the original paper:

Solver Method SATZILLA
Kissat RF 0.59± 0.00
XGB 0.62± 0.00
BreakIDKissat RF 0.64± 0.00
XGB 0.67 ± 0.00
KissatMaBDC RF 0.68± 0.00
XGB 0.72± 0.00
AMSAT RF 0.64± 0.00
XGB 0.67± 0.00

The following are the results from Table 9b of the original paper:

Solver Method SATZILLA
Kissat RF 0.58± 0.00
XGB 0.63± 0.00
BreakIDKissaT RF 0.62± 0.00
XGB 0.67± 0.00
KissatMaBDC RF 0.66± 0.00
XGB 0.70± 0.00
AMSAT RF 0.62± 0.00
XGB 0.66± 0.00

The following are the results from Table 10 of the original paper:

Instances Method VG VCG LCG
SmaLL GIN 0.05± 0.00 0.00± 0.00 -0.02± 0.02
PC 0.03± 0.02
PR 0.48± 0.03
MC -0.32± 0.02
MEDIUM GIN
PC 0.13± 0.02
PR 0.44± 0.02
MC -0.13± 0.02
LARGE GIN
PC 0.33± 0.01
PR 0.54± 0.02
MC 0.23± 0.02

6.1.5. Combinatorial Optimization (CO)

  • Key Finding: For supervised CO tasks (MIS), GIN generally performs best among baselines due to its MPNN inductive bias. DeepSet often outperforms MLP, indicating the benefit of global information. GTs sometimes perform poorly, possibly due to training difficulties. For unsupervised CO, GIN and GT generally outperform MLP and DeepSet (Graph-based models are better) across all three CO problems (MIS, Max-Cut, Graph Coloring), with GIN showing overall stronger performance. However, for sparse BA graphs in Graph Coloring, DeepSet surprisingly excels.

  • Advantages/Disadvantages: MPNNs are well-suited for graph-structured CO problems. The results confirm that current basic GNN baselines do not yet compete with specialized CO-specific architectures or exact solvers in solution quality, but offer speed advantages.

    The following are the results from Table 12 of the original paper:

    Dataset Method Size
    Small Large
    RB graph GIN 0.491± 0.099 2.125± 0.484
    GT 4.112± 2.353 0.915± 0.235
    MLP 1.583± 0.052 1.437± 0.520
    DeepSet 0.918± 0.186 1.427± 0.224
    ER graph GIN 0.234± 0.191 0.352± 0.265
    GT 6.486± 8.101 9.641± 15.335
    MLP 0.751± 0.767 0.914± 0.307
    DeepSet 0.756± 0.711 2.244± 0.364
    BA graph GIN 0.292± 0.041 0.111± 0.016
    GT 3.481 ± 3.446 1.829± 1.383
    MLP 2.825± 0.949 3.383± 0.504
    DeepSet 2.304± 0.262 3.362± 1.491

The following are the results from Table 13 of the original paper:

Size
Problem Dataset Method Small Large
MIS (MIS size ↑) RB GIN 17.294± 0.328 13.999± 0.321
GT MLP 16.542± 0.477 16.105± 0.097 13.406± 0.140 13.040± 0.214
DeepSet 16.021± 0.032 13.183± 0.035
Solver 20.803± 1.817 42.547± 4.449
ER GIN 25.418± 0.407 26.276± 0.408
GT 22.984± 0.473 23.183± 0.016 24.980± 0.292
MLP 24.259± 0.449
DeepSet Solver 23.050± 0.061 33.604± 1.428 24.220± 0.056
45.637± 0.631
GIN 100.16± 3.674 135.00± 0.720
GT 99.579± 6.448 114.26± 0.601
MLP DeepSet 95.108± 2.042 95.076± 0.173 114.49± 0.758
Solver 142.86± 16.54 114.89± 0.016 433.77± 19.17
Maximum Cut (maximum cut size ↑) ER GIN 2106.7± 14.62
GT 1925.7± 32.75 24748.± 87.76
MLP 1727.7± 165.1 21524.± 184.0 20357.± 249.6
DeepSet 2920.1 ± 97.23 3575.9± 730.0
Solver 33914.± 7861.
GIN 2327.9± 24.78 2172.7± 91.75 20878.± 107.9
GT MLP 1866.7± 67.64 16534.± 278.0 7335.4± 57.49
DeepSet Solver 2835.5± 607.6 23884.± 1809.
GIN 397.00± 0.605 1044.1± 0.649
BA GT 363.76± 0.639 986.93± 3.128
MLP 308.73± 0.224 929.20± 4.060
DeepSet 1.0669± 0.800 154.31± 151.5
Solver 460.91± 50.13 1260.4± 48.81
Graph Coloring (number of colors used ↓) RB GIN GT 25.166± 0.288 25.146± 0.253 55.513± 0.526 55.562± 0.648
MLP 24.733± 0.667
DeepSet 55.558± 0.557
Solver 26.723± 0.189 71.051± 0.604
19.970± 3.465 41.480± 6.634
ER GIN 16.182± 0.202 34.587± 0.545
GT 16.188± 0.201 34.385± 0.413
MLP 17.110± 0.144 34.658± 0.394
DeepSet Solver 18.266± 0.018 10.235± 0.836 55.345± 0.551
BA 22.933± 0.772
GIN 5.1318± 0.114 6.2028± 0.283
GT 5.0939± 0.070 6.1167± 0.086
MLP 5.9900± 0.127 9.4215± 0.186
DeepSet 3.2780± 0.122 3.1981± 0.239
Solver 3.0000± 0.000 3.0000± 0.000

6.1.6. Algorithmic Reasoning

  • Key Finding: Performance varies significantly by task and difficulty. GTs show improved performance on MST, Max Clique, and Max Matching, while GIN performs better on Max Flow, Bridges, and Steiner Trees. F1 scores generally decrease with increasing difficulty. Size generalization shows mixed results: MST and Steiner Trees sometimes improve with larger graphs (possibly due to changes in average degree), while Topological Sorting, Max Matching, and Max Clique decrease.

  • Advantages/Disadvantages: The task-dependent nature of performance suggests no single GNN architecture is universally optimal. Steiner Trees and Max Clique are particularly challenging. The size generalization results highlight that current baselines are not size-invariant across all tasks, pointing to a need for more robust OOD generalization mechanisms.

    The following are the results from Table 16 of the original paper:

    Task Difficulty Model F1
    MST EASY GIN 0.6906± 0.1655
    GT 0.8566± 0.0068
    GIN 0.7288± 0.0894
    MEDIUM GT 0.8504± 0.0148
    GIN 0.6107± 0.2015
    HARD GT 0.8421 ± 0.0115
    EASY GIN 0.9831± 0.0184
    GT 0.9269± 0.0103
    GIN 0.9622± 0.0077
    MEDIUM GT 0.8762± 0.0258
    HARD GIN 0.968± 0.0178
    GT 0.8897± 0.0304
    GIN 0.6691± 0.0288
    EASY GT 0.6691± 0.0112
    0.5628± 0.1100
    MEDIUM GIN GT 0.5672± 0.0790
    GIN 0.5516± 0.2368
    HARD GT 0.5212± 0.0219
    EASY GIN 0.4584± 0.0101
    Max Clique GT 0.5407± 0.0088
    MEDIUM GIN 0.3996± 0.0852
    GT 0.4859± 0.0017
    GIN 0.4102± 0.0820
    HARD GT 0.4868± 0.0013
    Max Matching EASY GIN 0.7527 ± 0.0051
    GT 0.7402± 0.0172
    MEDIUM GIN 0.6399± 0.0231
    GT 0.6915± 0.009
    GIN 0.6595± 0.0164
    HARD GT 0.6743± 0.0038

The following are the results from Table 17 of the original paper:

Task Difficulty Model MAE
Topological Sorting EASY GIN GT 0.1001 ± 0.0089 0.116± 0.0058
MEDIUM GIN GT 0.1537 ± 0.0031 0.1305± 0.0094
HARD GIN GT 0.1301 ± 0.0046 0.1532± 0.0214
Flow EASY GIN GT 3.4387± 0.0631 4.2737± 0.0646
MEDIUM GIN GT 9.5960± 0.1707 6.3786± 0.4262
HARD GIN GT 9.5061 ± 0.1265 6.4833± 0.0869

The following are the results from Table 18 of the original paper:

Dataset (Score) Model 128 192 256 384 512
Topological Sorting (MAE) GT 0.1346 0.1730 0.1827 0.2018 0.2071
GIN 0.1492 0.181 0.1981 0.2015 0.2179
MST (F1) GT 0.8720 0.8773 0.8891 0.8874 0.8817
GIN 0.6103 0.8132 0.8605 0.8765 0.8823
Bridges (F1) GT 0.8799 0.8842 0.8959 0.9049 0.9118
GIN 0.9579 0.9213 0.9190 0.9203 0.9212
Steiner Trees (F1) GT 0.5160 0.5322 0.5221 0.5762 0.5578
GIN 0.5499 0.5739 0.6338 0.6651 0.6502
Max Clique (F1) GT 0.4877 0.4849 0.4890 0.4915 0.4931
GIN 0.3496 0.3112 0.3148 0.2926 0.2673
Max Matching (F1) GT 0.7010 0.6552 0.6206 0.5770 OOT
GIN 0.6271 0.6326 0.6372 0.6382 OOT

6.1.7. Earth Systems (Weather Forecasting)

  • Key Finding: The provided GNN baselines establish a transparent, reproducible lower bound for medium-range weather forecasting. While they do not reach the performance of specialized systems like GraphCast or even Persistence for some variables, this is expected given their simpler architecture and limited training.

  • Advantages/Disadvantages: The baselines are straightforward and reproducible, serving their purpose as reference points. However, they lack domain-specific refinements and extensive hyperparameter optimization, highlighting the complexity of weather forecasting and the need for more advanced GNN designs.

    The following are the results from Table 19 of the original paper:

    Pressure Level Variable GT Persistence GraphCast
    All All variables 179.629k
    Surface 2-m temperature (2T) 7.572 7.123 0.068
    10-m u wind component (10U) 5.072 2.166 0.012
    10-m v wind component (10V) 6.102 3.266 0.013
    Mean sea level pressure (MSL) 102.551k 60.056k 240.832
    Total precipitation (TP) 0.009 714.517n 52.377n
    500 Temperature (T) 2.751 1.120 0.007
    U component of wind (U) 15.026 6.658 0.048
    V component of wind (V) 27.003 12.988 0.053
    Geopotential (Z) 86.250k 48.637k 155.057
    Specific humidity (Q) 0.010 66.211n 1.090n
    Vertical wind speed (W) 0.032 6.242 0.050
    700 Temperature (T) 2.406 1.051 0.006
    U component of wind (U) 8.716 3.951 0.025
    V component of wind (V) 14.156 6.754 0.027
    Geopotential (Z) 51.669k 32.773k 141.296
    Specific humidity (Q) 0.009 250.556n 3.304n
    Vertical wind speed (W) 0.024 3.352 0.027
    850 Temperature (T) 2.832 1.351 0.009
    U component of wind (U) 9.151 4.015 0.022
    V component of wind (V) 12.289 6.565 0.023
    Geopotential (Z) 51.542k 32.453k 139.716
    Specific humidity (Q) 0.010 354.058n 4.881n
    Vertical wind speed (W) 0.016 3.517 0.024
    All Temperature (T) 2.805
    U component of wind (U) 13.855
    V component of wind (V) 21.501
    Geopotential (Z) 77.020k -
    Specific humidity (Q) 0.010
    Vertical wind speed (W) 0.018

6.2. Ablation Studies / Parameter Analysis

The paper mentions an example of automated hyperparameter optimization (HPO) using SMAC3 to demonstrate its effectiveness.

  • Experiment: Tuning the GIN model on the SMALL SAT dataset with the VG graph representation.

  • Optimization Budget: 150 evaluations, varying fidelity over training gradient steps between 1,000 and 10,000.

  • Configuration Space: The following are the results from Table 20 of the original paper:

    Hyperparameter Range
    Learning rate [1e-6, 0.1]
    Weight decay [1e-8, 0.1]
    Warmup iterations [1 000, 20000]
    Dropout [0.0, 0.5]
  • Results: The following are the results from Table 21 of the original paper:

    Solver Method VG
    Kissat GIN 1.36± 0.15
    GIN (tuned) 1.26± 0.03
  • Analysis: The automated HPO achieved a 7.3% reduction in RMSE for Kissat solver performance prediction compared to the manually tuned GIN model. This empirically demonstrates that GraphBench's HPO framework can effectively improve model performance, validating its inclusion as a core component for robust benchmarking.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces GraphBench, a meticulously designed and comprehensive benchmarking suite that addresses critical limitations in current graph machine learning (GML) evaluation practices. By integrating diverse domains (social sciences, hardware design, reasoning & optimization, earth systems) and task types (node-, edge-, graph-level, generative), GraphBench establishes standardized evaluation protocols, including consistent data splits, domain-specific metrics, and crucial out-of-distribution (OOD) generalization tests. The inclusion of a unified hyperparameter tuning framework and principled baselines (using MPNNs and Graph Transformers) provides a robust foundation for future GML research. The initial benchmarking results highlight persistent challenges such as handling temporal distribution shifts, scalability to very large graphs, and conditional graph generation with strict constraints, emphasizing that many GML problems are far from solved.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose clear directions for future work:

  1. Computational Resources: The evaluation of Graph Transformers on some SAT solving tasks was limited due to high computational costs for positional embeddings and out-of-memory errors on large graphs (even with high-end GPUs). This indicates that scaling current GTs to very large graphs remains a practical challenge.
  2. Domain-Specific Refinements: For weather forecasting, the simple baselines did not incorporate domain-specific refinements (e.g., positional edge features) known to improve forecast skill. This suggests that while GraphBench provides a general framework, incorporating domain expertise is still critical for state-of-the-art performance.
  3. Benchmarking Generative Models: Existing graph generative models struggled to meet strict constraints like functional equivalence in chip design, indicating a significant gap in current generative GML capabilities.
  4. Limited HPO Scope: Due to limited computational resources, HPO was not run on all datasets and domains in this initial release, implying that the reported baseline performances might be further improvable.
  5. Long-Term Relevance: While the LARGE and MEDIUM SAT datasets are currently inaccessible to many GNNs due to hardware constraints, they are included to ensure the benchmark remains relevant as hardware and GNN architectures evolve.

Future Work:

  • Continuous Expansion: GraphBench is envisioned as a "living benchmark," continuously expanding with new domains, tasks, and evaluation paradigms.
  • Multi-Modal Graph Foundation Models: A key future direction is to extend GraphBench to support the training and evaluation of next-generation multi-modal graph foundation models, which are expected to handle diverse data types and tasks.

7.3. Personal Insights & Critique

7.3.1. Inspirations and Applications

This paper is highly inspiring as it addresses a critical meta-problem in graph machine learning: the lack of standardized, comprehensive, and real-world-relevant benchmarks.

  1. Reproducibility and Fair Comparison: GraphBench directly tackles the reproducibility crisis in ML by standardizing datasets, splits, and metrics. This is invaluable for researchers to truly compare methods rather than just report numbers.
  2. Driving Real-World Impact: By focusing on domains like chip design, weather forecasting, and combinatorial optimization, GraphBench redirects GML research towards problems with tangible societal and industrial impact, rather than solely academic benchmarks. This could accelerate the adoption of GML in critical applications.
  3. Understanding OOD Generalization: The explicit inclusion of OOD generalization tests is a major strength. Many real-world systems operate in dynamic environments, and models must generalize beyond their training data. This will push GML research towards building more robust and reliable systems.
  4. Foundation Models: The vision for graph foundation models is exciting. GraphBench provides the diverse data and evaluation rigor needed to train and assess such large-scale, general-purpose models, potentially revolutionizing how GML is applied.
  5. Challenging Assumptions: The results, particularly in SAT solving where traditional ML with hand-crafted features outperforms GNNs, serve as a crucial reminder that GNNs are not a panacea. This encourages researchers to critically assess where GNNs truly add value versus where simpler, well-engineered solutions are still superior, or how to combine them effectively.

7.3.2. Potential Issues, Unverified Assumptions, or Areas for Improvement

  1. Maintenance and Scalability of the Benchmark Itself: As a "living benchmark," GraphBench will require significant ongoing effort to maintain, update, and expand datasets, and ensure compatibility with evolving GML libraries and hardware. The sheer diversity of tasks and domains could make this a massive undertaking.

  2. Computational Cost for Baselines: The out-of-memory issues and high computational costs reported for Graph Transformers on larger SAT instances highlight that even running baselines on such a comprehensive benchmark can be prohibitive. This could limit participation or make state-of-the-art evaluation difficult for many research groups without access to vast computing resources.

  3. Feature Engineering vs. End-to-End Learning: The SAT results pose a fundamental question about the balance between feature engineering and end-to-end learning. While GNNs aim to learn features automatically, the superiority of SATzilla suggests domain expertise in feature design is still crucial. Future GNN research on SAT might need to explicitly incorporate such features or design architectures that can learn them better.

  4. Representational Choices: The impact of different graph representations (e.g., VG vs. VCG vs. LCG in SAT) on GNN performance is evident. Further work is needed to systematically understand which graph representations are optimal for which tasks and how GNNs can learn to leverage or even generate better representations.

  5. Complexity of Metrics: While domain-specific metrics are valuable, they can also be complex (e.g., Ad-hoc Score for AIGs, Closed Gap for SAT). Ensuring that researchers across different domains fully understand and correctly use these metrics will be important.

  6. "Simpler" Baselines: The statement that current ML-based CO methods do not compete with exact solvers or hand-crafted heuristics in solution quality, but are often faster, is an important caveat. This points to GNNs potentially serving as efficient surrogates or heuristics, but not necessarily replacements for optimality in CO.

  7. Ethical Considerations: For social network data, even if anonymized, the "Right to Erasure" and continuous updates are mentioned. Ensuring robust mechanisms for privacy and data governance in dynamic real-world datasets is crucial for the long-term viability and ethical use of such benchmarks.

    Overall, GraphBench is a highly ambitious and necessary contribution that lays foundational groundwork for the next era of graph machine learning. Its success will depend on community adoption, continuous maintenance, and a sustained effort to address the deep challenges it has so effectively exposed.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.