CrossST: An Efficient Pre-Training Framework for Cross-District Pattern Generalization in Urban Spatio-Temporal Forecasting

论文状态：completed

总数：2

已完成：2 · 进行中：0 · 失败：0

gemini-2.5-pro

10/8/2025, 3:54:21 PM 提交至队列10/8/2025, 3:55:23 PM 完成

已完成

中文审稿

好的，作为一名资深学术审稿人，我将根据您提供的稿件内容，生成一份专业、结构化的审稿意见。

Synopsis of the paper

该论文提出了一种名为 CrossST 的高效预训练框架，旨在解决城市时空预测中跨区域模式泛化难和计算成本高的问题。该框架通过在多个大规模时空数据集上进行预训练，将学习到的通用时空模式存储于一个模式库中。其时间模块利用频域和时域分析捕捉周期性和趋势，而空间模块则采用图注意力机制识别动态空间关系。在微调阶段，论文引入了一种时空解耦策略，通过知识蒸馏将通用模式与任务特定模式分离，以提升下游任务的泛化能力。实验结果表明，CrossST 在保持较低计算开销的同时，性能优于当前主流的基线模型。

Summary of Review

该论文针对城市时空预测中一个重要且实际的问题——跨区域泛化，提出了一个新颖的解决方案。其核心优势在于创新的 CrossST 框架，特别是其时空解耦策略，该策略在概念上清晰且有充分的实验验证（见 Figure 3 和 Figure 8）。此外，论文对计算效率的关注及其在实验中的有效展示，极大地增强了该方法在现实大规模场景中的应用潜力（见 Figure 2）。然而，论文在一些关键技术细节的阐述上存在不足，例如核心组件“模式库”（Pattern Bank）的工作机制描述模糊（见 Figure 5 和 Figure 6），并且部分数学公式的定义不够清晰，影响了方法的可复现性（见 Section on Spatio-Temporal Disentanglement）。

Strengths

框架新颖且动机清晰 (Novel and Well-Motivated Framework)
- 论文的动机清晰有力，通过 Figure 1 直观地展示了不同城市区域间时空模式存在的“共性”与“个性”，为提出的跨区域预训练和模式解耦提供了坚实的基础。
- 提出的预训练-微调框架（Figure 3）结构完整，其核心的时空解耦策略（Spatio-Temporal Disentanglement, STD）是一个新颖的贡献，旨在解决预训练模型在下游任务中泛化能力不足的问题，具有很强的启发性。
- 全面的消融实验（Figure 8）有力地证明了框架中各个组件的必要性。例如，移除预训练（w/o Pre-Train）或时空解耦（w/o STD）模块后，各项性能指标均出现显著下降，验证了该设计的技术可靠性。
对计算效率和可扩展性的高度重视 (Emphasis on Computational Efficiency and Scalability)
- 论文将计算效率作为一个核心设计目标，并通过实验给出了令人信服的证据。Figure 2(a) 和 2(b) 表明，相较于基线模型，CrossST 在处理大规模节点时，内存占用和训练时间上具有显著优势，解决了实际应用中的一个关键痛点。
- 空间模块中采用的线性复杂度图注意力机制（Figure 6 中提及的从 $O(N^2)$ 降至 $O(N)$ ）是一项关键的技术优化，直接提升了模型处理大规模空间网络的可扩展性。
- Figure 2(c) 中的雷达图提供了一个关于微调阶段多维度成本（预测精度、内存、训练/推理时间）的综合评估，清晰地展示了 CrossST 在性能和效率之间的卓越平衡，增强了其应用价值。
实验评估充分且严谨 (Thorough and Convincing Experimental Evaluation)
- 实验设计非常全面，不仅包含了与多个主流基线模型的性能对比，还涵盖了详尽的消融研究（Figure 8）、模型参数敏感性分析（Figure 9），为论文的结论提供了坚实的数据支持。
- 论文通过案例研究（Figure 10）对学习到的表示进行了可视化分析。t-SNE 聚类结果显示，模型能有效地区分具有不同交通模式的地理区域，这为模型的有效性提供了直观且有力的定性证据，增强了结果的可信度。
- 作者在摘要中承诺将公开代码和数据集（Abstract, last sentence），这体现了良好的学术实践，极大地促进了研究的可复现性和社区的进一步发展。

Weaknesses

核心机制“模式库”阐述不清 (Lack of Clarity on the "Pattern Bank" Mechanism)
- 论文在时间模块（Figure 5）和空间模块（Figure 6）中都引入了“模式库”（Pattern Bank）的概念，但并未详细说明其内部机制。例如，这些模式是如何被初始化、存储、索引和更新的，这些关键细节的缺失使得该核心机制不透明，难以复现。
- 在时间模块的描述中，论文提到输入信号会与模式库中的模式进行“匹配”（matching），但未定义具体的匹配算法或相似度度量标准。这使得读者无法理解模型是如何从库中选择最相关的时序模式的。
- 同样，在空间模块中，“基于模式的图注意力”（Pattern-Aware Graph Attention）如何利用“图模式库”也未得到清晰解释。这些“图模式”的具体形式（例如，是子图结构还是参数化的邻接矩阵）及其在注意力计算中的作用机制需要更详细的说明。
数学表述与符号定义存在模糊之处 (Ambiguity in Mathematical Formulations and Definitions)
- 在时空解耦策略的描述中（与 Figure 3 相关），论文区分了“个性化模式”（ $Z_p$ ）和“通用模式”（ $Z_u$ ），但并未给出如何从模型主干的输出中将这两者进行形式化分离的数学定义。这一分离过程是实现解耦的关键，其定义的缺失影响了对方法核心思想的理解。
- 论文中提及的时间蒸馏（KL散度）和空间蒸馏（InfoNCE损失）（参考 Figure 3 的描述）缺乏完整的公式定义。例如，KL散度是作用于哪两个概率分布之上？InfoNCE损失的正负样本是如何在该时空上下文中构建的？这些细节对于实现至关重要。
- 稿件中的符号一致性有待提升。例如，Patch Embedding（Figure 4）生成了 $X_{emb}$ 和 $T_{emb}$ ，但它们在后续章节的公式中如何被精确使用和整合的流程不够清晰，使得读者追踪信息流变得困难。
预训练策略与数据选择的讨论不足 (Limited Discussion on Pre-training Data and Strategy)
- 摘要中声称模型在“各种大规模时空数据集”上进行预训练，但实验部分（Figure 7）似乎仅展示了在加州不同区域的数据集上进行的实验。若预训练和微调均在同一宏观区域内进行，那么关于模型学习到“通用”（universal）模式的论断可能需要更谨慎的表述，或通过在更多样化（如跨城市、跨国家）的数据上进行实验来加强。
- Figure 9 展示了预训练数据量对模型性能的影响，但论文未说明“预训练数据量”是如何变化的（例如，是通过增加区域数量，还是延长观测时间）。这一信息的缺失使得读者难以准确评估模型的数据可扩展性。
- 论文没有明确定义预训练阶段所使用的自监督学习任务或目标函数。预训练的目标是模型设计的基石（例如，是掩码预测任务还是对比学习任务），这一信息的缺失是方法描述中的一个重要疏漏（No direct evidence found in the manuscript.）。

Suggestions for Improvement

详细阐述“模式库”工作机制 (Elaborate on the Pattern Bank Mechanism)
- 建议增加一个专门的小节来详细描述“模式库”的设计。请明确其数据结构（例如，一个可学习的嵌入矩阵），并详细说明模式的初始化方法、存储格式以及在训练过程中的更新规则。
- 对于时间模块，请明确定义用于从模式库中选择模式的“匹配”函数，例如，可以具体说明是否使用频域上的余弦相似度，并解释如何选择和融合Top-K个最相关的模式。
- 对于空间模块，请清晰定义“图模式”的具体形式，并提供“基于模式的图注意力”的完整数学公式，以展示这些模式是如何与节点的查询（Query）和键（Key）进行交互的。
完善数学公式与符号定义 (Refine Mathematical Formulations and Notations)
- 请提供将模型输出特征解耦为“个性化模式”（ $Z_p$ ）和“通用模式”（ $Z_u$ ）的明确数学公式。例如，可以引入一个门控机制或投影层来实现这一分离。
- 请给出时间蒸馏和空间蒸馏损失函数的完整、明确的表达式。对于KL散度，请定义其比较的两个具体概率分布；对于InfoNCE损失，请详细描述正负样本对的构造过程。
- 建议通读全文，检查并统一所有数学符号的定义和使用。可以考虑在方法部分的开头提供一个符号表，以确保所有变量（如 $X_{emb}$ 和 $T_{emb}$ ）的传递和作用在整个模型中清晰可循。
补充预训练策略与数据多样性的讨论 (Expand on the Pre-training Strategy and Data Diversity)
- 请明确说明预训练所使用的数据集范围。如果仅使用了来自同一地区的数据，建议将“通用模式”的提法调整为“跨区域模式”，或增加在更多样化数据集（如不同城市的数据）上的预训练实验，以支持更强的泛化性声明。
- 在实验设置部分，请详细描述Figure 9中“预训练数据量”这一变量是如何被控制和增加的，这将使参数敏感性分析的结果更具说服力。
- 请在方法部分明确阐述预训练阶段所采用的自监督学习任务及其对应的目标函数。清晰地描述这个预训练任务对于他人理解模型所学表示的性质以及复现研究至关重要。

References

None

英文审稿

Synopsis of the paper This paper introduces CrossST, a pre-training framework for urban spatio-temporal (ST) forecasting, designed to address challenges in generalizing patterns across different districts and managing high computational costs. The proposed method pre-trains on large-scale, diverse ST datasets to build a "pattern bank" of universal ST patterns. The model architecture includes a temporal module using frequency and time-domain analysis to capture periodicity and trends, and a spatially-optimized graph attention module to model dynamic spatial propagation with linear complexity. For downstream tasks, a spatio-temporal disentanglement strategy is employed during fine-tuning to separate universal, pre-trained patterns from task-specific ones, aiming to improve generalization. Experimental results on several real-world traffic datasets show that CrossST outperforms state-of-the-art baselines in forecasting accuracy, particularly in few-shot and missing data scenarios, while maintaining computational efficiency.

Summary of Review This paper presents a well-motivated and timely framework, CrossST, for efficient pre-training in spatio-temporal forecasting. The key strengths of the work are its novel spatio-temporal disentanglement strategy for effective knowledge transfer during fine-tuning (Sec. 3.4, Fig. 3) and its explicit focus on computational efficiency, which is supported by strong empirical evidence (Fig. 2). However, the manuscript suffers from a lack of clarity in its mathematical formulations, particularly concerning the implementation of the pattern banks and distillation losses (Sec. 3.3, Sec. 3.4). Furthermore, the experimental evaluation could be strengthened by including comparisons with more recent pre-training-based baselines in the spatio-temporal domain (Table 2).

Strengths

Novel and Coherent Framework for ST Pre-Training
- The paper addresses the important and practical problem of generalizing ST patterns across heterogeneous urban districts, which is a key limitation of many existing models (Sec. 1, Fig. 1). The pre-training and fine-tuning paradigm is a well-suited and logical approach to this problem.
- The proposed spatio-temporal disentanglement strategy is a novel contribution that thoughtfully separates universal knowledge from task-specific patterns. This is achieved via temporal and spatial distillation, which is a technically sound method for improving transfer learning performance (Sec. 3.4, Fig. 3).
- The overall architecture, combining a pattern bank with dedicated temporal and spatial modules, presents a comprehensive solution for capturing complex ST dependencies (Fig. 3, Fig. 5, Fig. 6).
Strong Emphasis on Computational Efficiency
- The work explicitly tackles the computational burden of modeling large-scale ST graphs, a critical barrier to real-world deployment. The design of a spatial module with linear complexity ( $O(N)$ ) is a significant practical advantage over standard attention mechanisms ( $O(N^2)$ ) (Sec. 3.3.2).
- The efficiency claims are convincingly substantiated through empirical analysis, showing significant reductions in memory usage and training time compared to vanilla attention mechanisms, especially as the number of nodes increases (Fig. 2a, 2b).
- The radar chart in Figure 2c effectively demonstrates that CrossST achieves a favorable trade-off, delivering high predictive accuracy while maintaining low computational overhead in terms of memory, training time, and inference time.
Comprehensive Experimental Evaluation and Analysis
- The experiments are extensive, evaluating the model on multiple real-world datasets and under various challenging scenarios, including few-shot learning and forecasting with missing data. The results consistently show CrossST outperforming a wide range of baselines (Table 2, Fig. 2c).
- The ablation studies are thorough and systematically demonstrate the positive impact of each key component of the framework, including the pre-training stage, the spatio-temporal disentanglement (STD), and the individual temporal (TM) and spatial (SM) modules (Fig. 8, Table 3).
- The paper includes insightful qualitative analyses. For instance, the t-SNE visualizations in Figure 10 provide compelling evidence that the learned node embeddings capture meaningful spatial structures that correspond to distinct real-world traffic patterns, which enhances the model's interpretability.
- The sensitivity analysis regarding the number of pre-training parameters and the volume of pre-training data offers valuable insights into the model's behavior and provides practical guidance for its application (Fig. 9).

Weaknesses

Lack of Clarity in Mathematical Formulations and Model Details
- The concept of the "pattern bank" ( $P_{time}$ and $P_{graph}$ ) is central to the method but is not formally defined. The manuscript does not specify how these banks are initialized, structured (e.g., learnable embedding matrices), or updated during pre-training, which hinders reproducibility (Sec. 3.3.1, Sec. 3.3.2).
- The formulation of the temporal distillation loss ( $L_{TD}$ in Eq. 7) is incomplete. It relies on probability distributions $p_{ptm}$ and $p_{psn}$ , but the paper does not define how these distributions are derived from the model's outputs (Sec. 3.4.1).
- The implementation of the spatial distillation loss ( $L_{SD}$ in Eq. 8) using InfoNCE is underspecified. The paper does not explain the crucial process of how positive and negative samples are constructed from the node embeddings, which is essential for understanding the contrastive learning objective (Sec. 3.4.2).
Limited Comparison to State-of-the-Art Pre-Training Models
- The set of baselines, while including many classic ST models, omits several recent and highly relevant pre-training frameworks specifically designed for spatio-temporal forecasting, such as STEP [Bai et al., 2023] and PDFormer [Jiang et al., 2023].
- While these models are mentioned in the related work section (Sec. 2), their absence from the quantitative comparison in Table 2 makes it difficult to assess whether CrossST truly advances the state of the art in the pre-training subfield of ST forecasting.
- The manuscript provides no justification for the exclusion of these direct competitors from the experimental evaluation.
Insufficient Justification for Key Design Choices
- The rationale behind using different types of loss functions for the two distillation branches—KL divergence for temporal distillation ( $L_{TD}$ ) and InfoNCE for spatial distillation ( $L_{SD}$ )—is not provided. It is unclear why this specific combination was chosen over other possible formulations for knowledge distillation or representation learning (Sec. 3.4.1, Sec. 3.4.2).
- In the temporal module, the method relies on selecting the top-k amplitudes from the FFT to represent periodicity (Sec. 3.3.1, Fig. 5). The paper does not motivate this heuristic choice over other potentially more robust frequency-domain filtering or representation techniques.
- The spatial module employs a Gated Linear Unit (GLU), but its specific function or benefit in this context is not explained (Sec. 3.3.2, Fig. 6). While GLUs are common, a brief justification would strengthen the architectural description.

Suggestions for Improvement

Elaborate on Mathematical Formulations and Model Details
- In Section 3.3, provide a precise mathematical description of the pattern banks. Please clarify their structure (e.g., "a learnable matrix of size K x D"), their initialization method, and how they are leveraged during the forward pass.
- In Section 3.4.1, explicitly define the probability distributions $p_{ptm}$ and $p_{psn}$ for the KL divergence loss (Eq. 7). For instance, specify if they are the result of applying a softmax function over the model's output logits.
- In Section 3.4.2, detail the sampling strategy for the InfoNCE loss (Eq. 8). For example, clarify that a positive pair consists of the same node's representation from the pre-trained and fine-tuning branches, while negative pairs are formed by contrasting it with representations of all other nodes in the batch.
Strengthen Experimental Comparisons
- To provide a more conclusive assessment of the model's performance, please include recent pre-training-based ST forecasting models like STEP and PDFormer in the main results table (Table 2).
- If adding these baselines is computationally prohibitive, consider adding a new subsection to the appendix or the main paper that provides a detailed qualitative comparison, discussing the architectural differences and expected performance trade-offs between CrossST and these other SOTA pre-training methods.
- In the experimental setup (Sec. 4.1), briefly justify the criteria used for selecting the reported baselines to make the scope of the comparison clearer.
Provide Rationale for Methodological Design Choices
- In Section 3.4, add a brief discussion motivating the use of KL divergence for the temporal task and InfoNCE for the spatial task. For example, you could argue that KL divergence is well-suited for matching the predictive distributions of continuous time-series values, whereas a contrastive loss like InfoNCE is more effective for learning discriminative structural representations of the graph.
- In Section 3.3.1, add a sentence to justify the top-k FFT amplitude selection, for instance, by stating that it is a common and efficient heuristic for capturing dominant periodic signals while filtering high-frequency noise.
- In Section 3.3.2, briefly explain the role of the GLU, clarifying that it serves as a gating mechanism to control the flow of information through the spatial feature transformation layers.

References

Bai, L., et al. (2023). STEP: A Spatio-Temporal Pre-training Framework for Citywide Crowd Flow Forecasting.
Jiang, R., et al. (2023). Pre-training for Spatio-Temporal Forecasting with Spatio-Temporal Patch and Dynamic Graph.

gpt-5

10/8/2025, 3:54:21 PM 提交至队列10/8/2025, 3:55:56 PM 完成

已完成

中文审稿

Synopsis of the paper

本文提出了一种名为 CrossST 的跨区域时空预测预训练框架，旨在通过高效的模式学习与时空解耦策略，实现跨区域时空模式的泛化与迁移。CrossST 由两个阶段组成：在预训练阶段，模型利用时间域与频率域分析提取时序模式（见 Fig. 5）并结合图注意力机制识别空间传播规律（见 Fig. 6），存储在模式库中；在微调阶段，采用时空解耦蒸馏（Temporal/Spatial Distillation）分离通用与特定模式（见 Fig. 3）。实验表明，CrossST 在多个城市交通数据集上优于现有基线模型，并具备较高计算与内存效率（见 Fig. 8, Fig. 9, Fig. 10）。

Summary of Review

整体而言，该论文针对跨区域时空预测领域提出了一个设计合理、模块清晰的预训练框架。优点在于模型结构创新（见 Fig. 3）、实验验证充分（见 Table/Fig. 8-9）、在效率与泛化性能之间取得平衡（见 Fig. 2）。不足主要集中在数学推导表达不够严谨（参见 Eq. (5) 及 Sec. 3.2）、消融实验范围有限（No direct evidence found in the manuscript on heterogeneous city datasets）、以及时空模式库设计缺乏理论解释。总体上论文思路清晰，但部分实验与符号表达可进一步完善。

Strengths

多层级模型设计

结构完整性（Fig. 3）：CrossST 的双阶段设计（预训练–微调）层次清晰，展示了时空模式提取与解耦的逻辑闭环。
技术原创性（Sec. 3.1, Fig. 5 & 6）：结合频域学习与图注意力机制，将时空模式学习融合于统一框架，具创新意义。
工程可实现性（Fig. 2b）：在大规模节点实验下维持线性复杂度 $O(N)$ ，证明该设计具有可扩展性与实际部署潜力。

扎实的实验验证

性能优越性（Fig. 8）：CrossST 在 MAE、RMSE、MAPE 三个指标上均超越多个基线模型。
消融分析充分（Fig. 8 w/o TM/SM/STD）：验证了时间模块、空间模块与解耦策略对整体性能均有显著贡献。
高效性对比（Fig. 2c, Fig. 9）：在内存占用与训练时间方面表现出明显优势，强化了论文的可用性。

清晰的可视化与解释

模式可视化（Fig. 1, Fig. 10）：展示跨区域模式与聚类特征，增强了模型解释性。
框架图设计（Fig. 3–6）：图示层次分明，有助于读者理解模块关系。
实验趋势分析（Fig. 9）：明确展示了参数规模与性能的相关性。

Weaknesses

数学表达与推导不够严谨

符号定义不完整（Sec. 3.2, Eq. (5)）：部分符号（如 $L_{distill}$ 的梯度传播项）未定义，推导逻辑略显跳跃。
缺少收敛性说明（No direct evidence found in the manuscript）：未提供优化过程的理论性质或收敛分析。
符号一致性问题（Table notation）：文中空间与时间索引符号切换不一致，影响清晰度。

模式库设计动机阐述不足

理论动机（Sec. 3.1.1）：未解释模式库与传统嵌入学习相比的理论优势。
容量与更新机制（No direct evidence found in the manuscript）：缺少模式库容量控制与动态更新策略说明。
泛化性论证（Fig. 8 & 9）：虽有跨区域实验，但未覆盖多气候或多社会经济差异区域。

实验覆盖不足

缺乏异源数据验证（No direct evidence found in the manuscript）：所有实验均基于加州数据集，缺乏多城市验证。
缺少不确定性分析（Sec. 4.3）：评价指标未附带标准差或统计显著性验证。
对比方法代表性有限（Fig. 8）：基线主要为图神经网络类，欠缺其他自监督预训练方法对照。

Suggestions for Improvement

改进数学推导与符号说明

补充符号表（Sec. 3.2）：明确损失函数及梯度传播符号含义。
增加优化阶段描述：可在附录中提供收敛性证明或梯度稳定性分析。
符号统一化：统一时间索引和空间节点索引符号风格以提升阅读一致性。

强化模式库机制的理论与实证论证

增加理论分析（Sec. 3.1）：阐明模式库为何优于直接参数共享，并给出复杂度界或泛化边界。
提出动态更新实验：展示随新增城市数据模式库适应情况。
跨域实验验证（Fig. 8 延伸）：建议在不同国家/地区数据上测试以验证广义性。

补充实验广度与统计验证

增加异源测试集（Sec. 4.2）：使用其他城市数据集（如欧洲交通数据）验证模型均衡表现。
提供显著性检验（Sec. 4.3）：为主要指标附加标准差或置信区间以增强可信度。
扩展基线方法：增加跨模态或自监督类方法对比以体现进步幅度。

References

None

英文审稿

Synopsis of the paper
The manuscript introduces CrossST, a pre-training framework designed to enhance cross-district pattern generalization for urban spatio-temporal forecasting. The method constructs a Pattern Bank to capture universal spatio-temporal correlations through both frequency-domain and time-domain analyses, as well as a graph-attention-based spatial module. During fine-tuning, a spatio-temporal disentanglement (STD) strategy separates universal from region-specific dependencies, guided by temporal and spatial distillation losses (KL divergence and InfoNCE, respectively). Efficiency is further improved through temporal aggregation and spatial linear optimization to reduce complexity from $O(N^2)$ to $O(N)$ . Extensive experiments on multiple California district datasets (CA-D3–CA-D12) show that CrossST achieves superior accuracy and lower computational cost compared with state-of-the-art baselines (see Figs. 8–10).

Summary of Review
The paper presents an insightful contribution to cross-district generalization in urban spatio-temporal forecasting through an efficient pre-training and disentanglement framework. The method’s combination of frequency-domain modeling and pattern-aware graph attention is well-motivated and experimentally validated (Figs. 5–6; Table results in Fig. 8). The empirical results convincingly show improvements in accuracy and efficiency, supported by ablation studies (Fig. 8) and scalability analyses (Fig. 9). However, the paper’s mathematical presentation of frequency-space interactions lacks derivational clarity (Sec. 3.2; Eq. references not explicitly shown). Some implementation details—such as pattern bank update rules and hyperparameter sensitivity—are insufficiently explained (no direct evidence found). Clarity of dataset partitioning and comparative baselines could also be strengthened.

Strengths

Clear Problem Motivation and Scope
- The cross-district challenge is clearly motivated through examples of spatial heterogeneity across California regions (Fig. 1a–b).
- The paper situates CrossST relative to analogous pre-training gaps in spatio-temporal prediction (Sec. 1), emphasizing the need for capturing transferable patterns.
- The motivation is reinforced by the t-SNE clustering of region-dependent dynamics (Fig. 10a–h), underscoring the nontriviality of domain generalization.
Innovative Pre-training and Pattern Bank Design
- The Pattern Bank concept integrates time-domain and frequency-domain representations (Figs. 4–5), highlighting how FFT-based periodicity extraction aids universal pattern storage.
- The approach allows efficient reuse of shared representations during fine-tuning (Sec. 3.1), critical for scalability across multiple districts.
- The separation of temporal and spatial pattern libraries contributes to modularity, facilitating targeted improvements.
Spatio-Temporal Disentanglement Strategy (STD)
- The separation of universal versus district-specific patterns (Fig. 3, red vs. blue modules) is conceptually sound and empirically validated in ablation (Fig. 8, w/o STD).
- Use of KL divergence for temporal distillation and InfoNCE for spatial distillation provides distinct contrastive signals (Sec. 3.2).
- The disentanglement enhances fine-tuning stability and transfer efficiency, evidenced by improved MAE/MAPE metrics.
Computational Efficiency
- The spatial linear optimization reducing complexity from $O(N^2)$ to $O(N)$ (Fig. 6) addresses a key bottleneck in graph attention methods.
- Memory and time benchmarking (Fig. 2a–c) confirm substantial efficiency gains over baselines, even at large node counts (50 000).
- The low overhead complements pre-training reusability, making CrossST appealing for city-scale deployments.
Comprehensive Experimental Evaluation
- The evaluation covers multiple datasets (CA-D3, D5, D6, D10; Fig. 8) and multiple error measures (MAE, RMSE, MAPE).
- Ablation on temporal and spatial components (w/o TM, w/o SM; Fig. 8) aligns performance gains with individual modules.
- Scalability curves (Fig. 9) show predictable improvements with parameter and data scaling, supporting reproducibility.

Weaknesses

Incomplete Mathematical Formulation of Frequency–Time Coupling
- Sec. 3.2 and Fig. 5 introduce FFT/IFFT-based modules but omit explicit definitions of the transformation equations and loss terms.
- No equation describes how periodicity features merge with time-domain embeddings (no direct evidence found).
- This lack of formalism complicates theoretical reproducibility and limits confidence in the claim of “frequency-time synergy.”
Limited Explanation of Pattern Bank Update Mechanism
- The manuscript claims the bank “stores diverse valuable patterns” (Abstract; Sec. 3.1) but does not specify sampling, memory, or update policy.
- Fig. 3 shows static connections, yet it remains unclear whether updates are gradient-based or episodic.
- Lack of this detail hinders understanding of convergence and transfer dynamics.
Ambiguity in Dataset Partitioning and Transfer Evaluation
- The setup mentions pre-training on multiple districts and fine-tuning on others (Fig. 7), but the exact split ratios and non-overlap control are not detailed.
- It is unclear whether temporal leakage between training/testing districts is prevented (no direct evidence found).
- This obscures the rigor of the claimed “cross-district generalization.”
Baseline Implementation Uncertainty
- The radar and bar plots (Figs. 2c; 8) report superiority over existing baselines, yet references or reimplementation details for those baselines are limited (no direct evidence found).
- There is no accompanying table clarifying parameter budgets, input resolutions, or training epochs for comparison fairness.
- Lack of explicit configuration may mask whether improvements stem from architecture or from resource allocation.
Clarity and Consistency of Notation
- In Figs. 3–6, symbols such as $X_{emb}$ , $T_{emb}$ , pattern weights, and spatial attention coefficients are not all defined in text (Sec. 3).
- Some latent variables appear once in the figure but not in any formal equation.
- This weakens the mathematical consistency and makes replication challenging.

Suggestions for Improvement

Enhance Mathematical Transparency of Frequency–Time Module
- Provide explicit equations for FFT-IFFT transformation within the temporal module and show where loss gradients propagate.
- Clarify how frequency-domain features integrate with time-domain embeddings (add cross-term in Eq. form).
- Include derivational steps—perhaps in an appendix—linking this operation to the claimed efficiency gains.
Detail Pattern Bank Storage and Update Procedures
- Specify whether the pattern bank is trainable end-to-end, periodically refreshed, or maintained via momentum averaging.
- Quantify its size, access latency, and replacement criteria for new patterns.
- Add pseudocode or algorithmic summary to guarantee reproducibility.
Clarify Dataset Splits and Evaluation Protocol
- Present a table listing which districts are used for pre-training and fine-tuning with explicit temporal ranges.
- Discuss measures to prevent data leakage across districts and time periods.
- Add statistical significance analysis (e.g., variation across seeds) to substantiate generalization claims.
Document Baseline Configurations for Fair Comparison
- Include a separate appendix summarizing baseline architectures, parameter counts, and training durations.
- Indicate whether baseline results are reproduced or from prior papers, with reference anchors.
- Provide runs under comparable GPU hours or parameter constraints to isolate algorithmic merit.
Unify and Define Mathematical Notation
- Add a notation table mapping every symbol in Figs. 3–6 to definitions in main text.
- Ensure all embedding and attention terms are introduced before use.
- Align variable naming between temporal and spatial modules to reduce cognitive load for readers.

References
None