CrossST: An Efficient Pre-Training Framework for Cross-District Pattern Generalization in Urban Spatio-Temporal Forecasting
AI Review
Read structured AI reviewer reports
paper.reviews.ctaSubtitle
Completed: 2
TL;DR Summary
CrossST proposes an efficient pre-training framework to generalize urban spatio-temporal patterns across districts, mitigating computational costs. It pre-trains on diverse datasets, utilizing frequency/time analysis and graph attention to build a pattern bank. A disentanglement
Abstract
CrossST: An Efficient Pre-Training Framework for Cross-District Pattern Generalization in Urban Spatio-Temporal Forecasting Aoyu Liu and Yaying Zhang ∗ Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University Shanghai, China { liuaoyu, yaying.zhang } @tongji.edu.cn Abstract —Urban spatio-temporal forecasting is critical for modern urban governance, especially in traffic management, resource planning, and emergency response. Despite advance- ments in pre-trained models for natural language processing, challenges persist in urban spatio-temporal forecasting. Existing methods struggle to identify and generalize universal cross- district spatio-temporal patterns, while computational limitations hinder the extraction of complex patterns from large-scale data. In this study, we propose CrossST, an efficient pre-training framework designed to capture universal spatio-temporal pat- terns across large-scale, cross-district scenarios. Specifically, CrossST performs pre-training on various large-scale spatio- temporal datasets to learn and store diverse valuable patterns in its pattern b
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: CrossST: An Efficient Pre-Training Framework for Cross-District Pattern Generalization in Urban Spatio-Temporal Forecasting
- Authors: Aoyu Liu and Yaying Zhang. Their affiliation is the Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai, China.
- Journal/Conference: The paper does not explicitly state the conference or journal name in the provided text, but the formatting and content suggest a top-tier conference in data mining (like KDD, ICDE), artificial intelligence (like AAAI, IJCAI), or machine learning (like ICML, NeurIPS). The inclusion of an
Index Termssection is common in IEEE publications. - Publication Year: The paper references works from 2024 and even 2025 (likely accepted papers for future conferences), suggesting it was written in or around 2024.
- Abstract: The paper introduces
CrossST, a pre-training framework for urban spatio-temporal forecasting. The goal is to learn universal patterns (like traffic periodicity) across different urban districts and generalize them to new ones. Existing methods struggle with this generalization and are often too computationally expensive for large-scale data.CrossSTaddresses this by pre-training on diverse datasets to populate a "pattern bank." It uses frequency/time domain analysis for temporal patterns and graph attention for spatial patterns. A key innovation is a "spatio-temporal disentanglement" strategy during fine-tuning, which separates universal patterns from the diverse pre-trained ones to improve generalization. The framework is designed to be computationally efficient, using optimization strategies to reduce costs. Experiments showCrossSToutperforms state-of-the-art models in various scenarios (including few-shot and missing data) while being more efficient. - Original Source Link: The paper is available at
/files/papers/68e6174f8137bcc94217f294/paper.pdf. The code and datasets are athttps://github.com/Aoyu-Liu/CrossST.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Spatio-temporal forecasting models (e.g., for traffic prediction) often perform well within the district they are trained on but fail to generalize to new, unseen districts. This is because each urban district has unique spatial structures and data distributions, making it hard to transfer knowledge.
- Importance & Gaps: As cities become smarter, accurate forecasting across entire urban clusters is crucial for traffic management and resource planning. However, two main challenges hinder progress:
- Poor Generalization: Existing models struggle to identify and extract universal patterns (e.g., morning/evening rush hours, congestion propagation) that are common across different districts.
- Computational Inefficiency: Learning from large-scale, long-term data is necessary to capture complex patterns, but state-of-the-art models are often too computationally intensive, making this impractical.
- Innovation:
CrossSTintroduces a novel pre-training and fine-tuning paradigm specifically designed for cross-district generalization. Its key idea is to first learn a wide variety of spatio-temporal patterns from many districts and then, during adaptation to a new district, intelligently disentangle the universal patterns from the district-specific ones. It also incorporates several optimizations to make this process efficient on large datasets.
-
Main Contributions / Findings (What):
- Cross-District Pattern Disentanglement: The paper proposes a novel fine-tuning strategy that uses knowledge distillation to separate universal spatio-temporal patterns from a diverse, pre-trained "pattern bank." This significantly improves the model's ability to generalize to new urban forecasting tasks.
- Efficient Computational Framework:
CrossSTintegrates several optimization techniques. It uses temporal information aggregation to reduce the sequence length and a linear-complexity graph attention mechanism to handle large spatial graphs efficiently. This makes pre-training on large-scale, long-term data feasible. - Comprehensive Evaluation: The framework is rigorously tested by pre-training on five large urban datasets and fine-tuning on four different ones. It demonstrates superior performance over existing state-of-the-art models in standard forecasting, as well as in challenging scenarios with limited training samples (
few-shot) or missing data, all while maintaining low computational overhead.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Spatio-Temporal Forecasting: This is the task of predicting future values at multiple locations over time, considering both temporal dependencies (how values at one location evolve) and spatial dependencies (how values at different locations influence each other). A classic example is predicting traffic speed at all sensors in a city's road network.
- Spatio-Temporal Graph Neural Networks (STGNNs): These are a class of deep learning models that have become the standard for spatio-temporal forecasting. They represent spatial locations (e.g., traffic sensors) as nodes in a graph and use Graph Neural Networks (GNNs) to model spatial relationships, while using other components like Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) to model temporal patterns.
- Pre-training and Fine-tuning: This is a popular paradigm in deep learning, especially in Natural Language Processing (NLP). A large, general model is first "pre-trained" on a massive amount of data to learn fundamental patterns (e.g., grammar and semantics for language). Then, this pre-trained model is "fine-tuned" on a smaller, task-specific dataset, allowing it to adapt its general knowledge to the specific task quickly and effectively.
- Knowledge Distillation: A technique where a smaller "student" model learns to mimic the behavior of a larger, pre-trained "teacher" model. This is often used to transfer knowledge and compress models.
CrossSTuses this principle to transfer universal patterns from its pre-trained components to its task-specific components.
-
Previous Works & Technological Evolution:
- Early STGNNs (
STGCN,DCRNN): These models established the foundation by combining graph convolutions for space and sequence models for time. However, they often relied on fixed, predefined graphs (e.g., based on physical road connections) and struggled with dynamic spatial relationships. - Dynamic Graph Models (
GWNet,AGCRN,DGCRN): To address the limitation of fixed graphs, newer models introduced adaptive or dynamically generated graphs.GWNetlearns a static adaptive adjacency matrix, while others likeDGCRNgenerate graphs that change over time based on the input data. - Transformer-based Models (
PDFormer): Inspired by the success of Transformers in NLP, researchers began incorporating attention mechanisms into STGNNs to better capture long-range dependencies in both space and time. - Spatio-Temporal Pre-training (
STEP,STD-MAE): These methods applied the pre-training concept to spatio-temporal data. However, they typically pre-train and fine-tune on data from the same district, aiming to leverage long-term historical data for better short-term forecasting. They do not address the cross-district generalization problem. - Cross-District Pre-training (
FlashST,OpenCity): These are the most direct predecessors. They aim to pre-train on multiple districts to learn generalizable knowledge.FlashSTuses prompt-tuning, whileOpenCityaims to be a foundational model for spatio-temporal graphs.
- Early STGNNs (
-
Differentiation:
CrossSTdistinguishes itself from prior work in two key ways:- Explicit Pattern Disentanglement: Unlike other cross-district models that implicitly hope for generalization,
CrossSTintroduces a specific spatio-temporal pattern disentanglement strategy during fine-tuning. It actively separates universal patterns from the noise of diverse pre-trained patterns using a dual-loss (KL Divergence and InfoNCE) distillation approach. - Focus on Efficiency and Scalability:
CrossSTis built from the ground up to be efficient. It uses temporal aggregation to shorten sequences and linear attention to handle large graphs, allowing it to process much larger datasets than models likeOpenCity, which struggles with scalability. This efficiency is crucial for learning from the vast amounts of data needed to find universal patterns.
- Explicit Pattern Disentanglement: Unlike other cross-district models that implicitly hope for generalization,
4. Methodology (Core Technology & Implementation)
The CrossST framework operates in two stages: pre-training and fine-tuning. Its architecture is composed of several foundational modules that are optimized for efficiency.
该图像为方法框架示意图,展示了CrossST模型的预训练阶段与微调阶段结构。预训练阶段(蓝色虚线框)中,Patch Embedding经过Temporal Module和Spatial Module提取时空特征并存储预训练模式。微调阶段(红色虚线框)则通过时空解耦策略,利用Temporal Distillation和Spatial Distillation模块对个性化模式进行蒸馏,分别计算KL散度和InfoNCE损失,最终融合特征通过预测器输出结果。图中用蓝色雪花标识预训练参数,红色火焰标识微调参数。
As shown in Figure 2, the framework uses a dual-path architecture during fine-tuning. The top path consists of pre-trained, frozen modules (marked with snowflakes ❄️) that act as a knowledge repository. The bottom path consists of task-specific, trainable modules (marked with flames 🔥) that adapt to the new district. The core innovation, the Spatio-Temporal Pattern Disentanglement Strategy, connects these two paths.
A. Foundational Modules
These modules form the building blocks for both the pre-trained and personalized parts of the network.
1) Patch Embedding
To handle long input sequences (e.g., 96 time steps) efficiently, CrossST first groups consecutive time steps into "patches."
该图像是示意图,展示了时序信号 被切分成若干小片段(Patch),然后通过线性映射(Linear)转换为嵌入向量。图中表明经过该线性层后,生成了内容嵌入 和对应时间嵌入 两部分,用于后续时空特征的表征与处理。
- Process: An input sequence of shape (Time steps Nodes Features) is divided into patches, where is the patch size. This reshapes the input to .
- Embedding: Each patch is then passed through a linear layer to project it into a higher-dimensional embedding space of size .
- Positional Information: A learnable temporal embedding is added to retain information about the position of each patch in the original sequence. The final output of this module is: This reduces the sequence length from to , significantly lowering the computational cost for subsequent modules.
2) Temporal Module
This module is designed to capture complex temporal patterns like periodicity and trends from the patched embeddings . It cleverly combines frequency-domain and time-domain analysis.
该图像为示意图,展示了CrossST中时序模块的工作流程。输入信号经过分块处理后,通过快速傅里叶变换(FFT)得到频域表示,并与时序模式库中的模式进行匹配。匹配结果经过逆傅里叶变换(IFFT)恢复时域信号,随后输入时序卷积网络(TCN)实现表示降维,最终得到优化后的时序特征表示。
- Frequency Domain Analysis:
- The module first applies a Fast Fourier Transform (FFT) to convert the temporal sequence for each node into the frequency domain. This makes periodic patterns explicit.
- A trainable temporal pattern bank (a set of complex-valued weights) is multiplied element-wise with the frequency representation. This bank learns to amplify important frequencies (representing universal periodic patterns) and suppress noise.
- An Inverse Fast Fourier Transform (IFFT) converts the filtered frequency representation back to the time domain.
- Time Domain Analysis & Aggregation:
- A Temporal Convolutional Network (TCN) is then applied to to capture local dependencies between adjacent patches.
- Crucially, the TCN is structured to progressively aggregate temporal information, reducing the sequence length from down to 1. The final output has a shape of , effectively summarizing the entire time series for each node into a single hidden representation. This is a key efficiency strategy.
3) Spatial Module
This module takes the temporally-summarized representation and models the dynamic spatial relationships between nodes.
该图像为模型结构示意图,展示了空间模块的设计流程。左侧通过线性变换生成Query和Value,结合图模式库中的Pattern-Aware图注意力机制计算空间模式权重,降低复杂度从到O(N)。右侧空间模块由线性层、GLU激活和线性层组成,通过元素级乘法融合特征,体现了空间特征的动态捕捉与优化过程。
- Pattern-Aware Graph Attention: The core of this module is a highly efficient linear attention mechanism.
- Challenge: Standard attention mechanisms in Transformers have a computational complexity of , where is the number of nodes. This is prohibitive for large graphs (e.g., thousands of sensors).
- Solution:
CrossSTuses a linear attention mechanism based on random feature mapping (). By reordering the matrix multiplications, it reduces the complexity toO(N). - Formula: The attention calculation is approximated as: where , , are the query, key, and value matrices.
- Spatial Pattern Bank (): Similar to the temporal module, this module uses a trainable spatial pattern bank .
- acts as the Key (K) in the attention mechanism. This allows the model to learn a set of universal spatial patterns (e.g., interactions between downtown and suburban areas).
- and are used in gating mechanisms to modulate the information flow, helping the model select the most relevant spatial patterns.
- Gated Feed-Forward Network: After the attention layer, a Feed-Forward Network with a Gated Linear Unit (GLU) is used to apply a non-linear transformation, further refining the spatial representations.
B. Spatio-Temporal Pre-training
- Process: In this stage, only the universal modules (❄️) are used. The model is trained on a large collection of datasets from multiple districts ( districts) in a standard supervised forecasting manner.
- Goal: The primary goal is to train the temporal and spatial modules to capture a wide variety of patterns and store them in the pattern banks ( and ).
- Loss Function: The pre-training loss is the Mean Absolute Error (MAE) between the model's prediction and the ground truth.
C. Spatio-Temporal Fine-tuning
This is where CrossST's main novelty lies. When adapting to a downstream task in a new district, the model uses both the frozen pre-trained modules (❄️) and a new, smaller set of trainable personalized modules (🔥).
- Spatio-Temporal Pattern Disentanglement Strategy: This strategy uses knowledge distillation to transfer only the relevant, universal patterns from the pre-trained modules to the personalized ones.
- Temporal Distillation: The goal is to make the temporal patterns captured by the personalized module () similar to the universal patterns from the pre-trained module (). Kullback-Leibler (KL) Divergence is used to measure the difference between their output distributions. This encourages the personalized module to learn universal periodicities and trends.
- Spatial Distillation: Spatial patterns can be more district-specific. The goal here is to align the personalized spatial representation () with relevant universal patterns from the pre-trained module () while distinguishing them from irrelevant ones. InfoNCE loss, a contrastive learning loss, is used. It pulls similar spatial patterns (positive pairs) closer together and pushes dissimilar ones (negative pairs) apart.
- Final Prediction: The outputs from both the personalized modules (, ) and filtered versions of the pre-trained modules are combined and fed into a final prediction layer.
- Total Loss: The fine-tuning process is optimized using a combined loss function that includes the prediction loss and the two distillation losses. where and are hyperparameters to balance the loss components.
5. Experimental Setup
-
Datasets:
-
The study uses the LargeST dataset, which contains traffic data from the California Department of Transportation (CalTrans). This dataset is significantly larger than those used in many previous studies, making it ideal for testing scalability and cross-district generalization.
-
Pre-training Data: Data from five districts (CA-D4, D7, D8, D11, D12) collected in December 2018. See Table I in the paper for details.
-
Fine-tuning (Downstream Task) Data: Data from four different districts (CA-D3, D5, D6, D10) collected from January to February 2019. This separation in time and space ensures a rigorous test of generalization. See Table II and Figure 7 for details.
该图像为图表,展示了CrossST模型及其几种变体在不同数据集(CA-D3、CA-D5、CA-D6、CA-D10)上的MAE、RMSE和MAPE指标的对比结果。图中不同颜色柱状代表CrossST、无预训练(w/o Pre-Train)、无时空解耦(w/o STD)、无时间建模(w/o TM)和无空间建模(w/o SM)模型,显示CrossST在各项指标上均表现优越,验证了预训练及时空解耦等模块对性能提升的重要性。
-
-
Evaluation Metrics:
- Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values. It is easy to interpret.
- Root Mean Squared Error (RMSE): The square root of the average of squared differences. It penalizes large errors more heavily than MAE.
- Mean Absolute Percentage Error (MAPE): The average of absolute percentage errors. It expresses error as a percentage of the actual value, making it scale-independent.
-
Baselines:
CrossSTis compared against a comprehensive set of 12 state-of-the-art models from three categories:- Spatio-Temporal Forecasting Methods:
STGCN,AGCRN,GWNet,MTGNN,PDFormer,STWave. - Multivariate Time Series Forecasting Methods:
Crossformer,PatchTST,iTransformer,TimeMixer. These models are strong at long-term forecasting but do not explicitly model spatial dependencies. - Cross-District Spatio-Temporal Pre-Training Methods:
FlashST,OpenCity. These are the most direct competitors.
- Spatio-Temporal Forecasting Methods:
6. Results & Analysis
The paper evaluates CrossST in three scenarios: forecasting on the original full dataset, on a small subset (few-shot, 10% of training data), and on a dataset with missing values (30% of training data randomly removed). For the detailed results, please refer to Table III and Table IV in the original paper.
-
Core Results:
CrossSTconsistently outperforms all 12 baseline models across all four downstream datasets and in all three scenarios (original, few-shot, and missing data).- The performance improvement is particularly significant in the few-shot and missing data scenarios. For example,
CrossSTachieves average MAE improvements of 7.48% (few-shot) and 12.29% (missing data) over the next best model. This demonstrates that the universal patterns learned during pre-training provide robust knowledge that is especially valuable when downstream data is scarce or incomplete. - Even with full data,
CrossSToutperforms specialized STGNNs likeGWNet, indicating that the cross-district patterns it learns are beneficial even when sufficient in-district data is available.
-
Ablation Study:
该图像是多个小型折线图组成的图表组,展示了不同数据集(CA-D3、CA-D5、CA-D6、CA-D10)下预训练模型参数数量、预训练数据量及微调模型参数对MAE和RMSE指标的影响。图中蓝色星形点表示MAE,灰色三角点表示RMSE,整体趋势表现为模型参数和数据量增加时误差指标下降,说明模型性能提升。具体数值和趋势因数据集和变量不同略有差异。This study confirms the importance of each key component of
CrossST. The figure shows that removing any component leads to a significant drop in performance (higher MAE/RMSE/MAPE).w/o Pre-Train: Performance degrades the most, proving that pre-training on diverse districts is the cornerstone of the model's success.w/o STD: Removing the Spatio-temporal Disentanglement strategy also causes a major performance drop, highlighting its crucial role in transferring knowledge effectively.w/o TMandw/o SM: Removing the temporal or spatial modules confirms that both are essential for capturing the complex dependencies in the data.
-
Scaling Law:
该图像为多子图的图表,展示了CA-D5数据集中节点的t-SNE降维聚类结果及不同簇中选定节点的交通流变化与地理空间位置。(a)为节点在t-SNE二维空间的整体分布图;(b)为基于t-SNE的聚类可视化,显示四个不同颜色簇;(c)(e)(g)分别为簇1、簇2、簇3中典型节点随时间的交通流变化曲线;(d)(f)(h)展示对应簇中节点的空间地理分布,颜色与簇分类对应,突出区域聚集特征。This analysis explores how performance changes with model size and pre-training data volume.
- Model Size: Performance improves as the number of parameters increases from
Tiny(0.96M) toBase(3.74M), but the gains diminish with thePlusmodel (5.56M). This suggests that theBasesize is a good trade-off between performance and complexity for this task. - Data Volume: Using more pre-training data generally helps, but the improvement flattens after using 50% of the data. This indicates that two weeks of data is sufficient to learn the primary periodic patterns.
- Model Size: Performance improves as the number of parameters increases from
-
Case Study (Interpretability):
该图像为三部分图表:(a)展示不同模型在单次样本处理下随节点数变化的内存使用对比,Vanilla Att在50000节点时出现内存溢出;(b)展示预训练阶段不同模型随节点数变化的训练时间和内存使用,点大小代表内存占用;(c)以雷达图形式比较多模型在微调阶段的多项成本指标(包括缺失值MAE、少样本MAE、原始MAE、内存使用、训练和推理时间),CrossST在多个指标上表现优异。This study visualizes the learned spatial pattern bank () for the CA-D5 district.
- The t-SNE plot (b) shows that the learned spatial patterns form distinct clusters.
- When nodes from these clusters are mapped back to their geographical locations (d, f, h), they are found to be geographically close.
- Furthermore, the traffic flow patterns of nodes within the same cluster are highly similar (c, e, g), while patterns between clusters are different.
- This provides strong evidence that the pattern bank is learning meaningful, interpretable real-world patterns, such as identifying different functional zones or corridors within the city.
-
Efficiency and Scalability Study:
该图像为示意图,展示了CrossST中时序模块的工作流程。输入信号经过分块处理后,通过快速傅里叶变换(FFT)得到频域表示,并与时序模式库中的模式进行匹配。匹配结果经过逆傅里叶变换(IFFT)恢复时域信号,随后输入时序卷积网络(TCN)实现表示降维,最终得到优化后的时序特征表示。This analysis demonstrates
CrossST's computational advantages.- Ablation on Efficiency (a): Removing the linear attention (
Vanilla Att) or temporal aggregation (w/o TM) leads to drastically higher memory usage, with the standard attention model running out of memory on large graphs. - Pre-training Cost (b): Compared to other cross-district models,
CrossSTis significantly more efficient. It uses 33.87% less memory and trains faster thanFlashSTon a 2000-node graph.OpenCityfails to scale to larger graphs due to high resource requirements. - Fine-tuning Cost (c): The radar chart shows that
CrossSTachieves the best balance of performance (low MAE) and efficiency (low memory usage, fast training/inference) compared to all top-performing baselines.
- Ablation on Efficiency (a): Removing the linear attention (
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents
CrossST, an efficient and effective pre-training framework for cross-district spatio-temporal forecasting. Its core strengths are the novel spatio-temporal pattern disentanglement strategy, which enables robust knowledge transfer, and its highly optimized architecture, which allows it to learn from large-scale data efficiently. By learning and generalizing universal patterns,CrossSTsets a new state-of-the-art, especially in data-limited scenarios. -
Limitations & Future Work: The authors state their intention for future work is to extend the
CrossSTframework to support generalization across multimodal spatio-temporal tasks (e.g., incorporating weather, events, or social media data). This would move towards creating a more universal foundational model for urban computing. -
Personal Insights & Critique:
- Strengths: The paper's main strength is its dual focus on both generalization performance and computational efficiency. The disentanglement strategy is an elegant and intuitive solution to the problem of negative transfer (where irrelevant patterns from pre-training hurt downstream performance). The thorough experimental validation, including scalability and case studies, makes the claims very convincing.
- Potential Limitations:
- The success of
CrossSTrelies on the assumption that truly universal patterns exist and are sufficiently represented in the pre-training datasets. If the pre-training districts are fundamentally different from the fine-tuning district (e.g., a grid-like American city vs. a historic European city), the generalization might be less effective. - The disentanglement strategy relies on two specific loss functions (KL and InfoNCE). The choice of these losses and their hyperparameters () might require careful tuning for different types of spatio-temporal data.
- The success of
- Future Impact:
CrossSTrepresents a significant step towards building true "foundation models" for spatio-temporal data, much like what has been achieved in NLP. Its emphasis on efficiency makes it practical for real-world deployment by city governments or logistics companies. The concept of a learned, interpretable "pattern bank" could also open up new avenues for understanding and analyzing urban dynamics beyond just forecasting.
Similar papers
Recommended via semantic vector search.