- Title: Spacetime-GR: A Spacetime-Aware Generative Model for Large Scale Online POI Recommendation
- Authors: Haitao Lin, Zhen Yang, Jiawei Xue, Ziji Zhang, Luzhu Wang, Yikun Gu, Yao Xu, and Xin Li.
- Affiliations: All authors are affiliated with AMAP, Alibaba Group, Beijing, China. This indicates the research is driven by industrial application needs from a major map and navigation service provider.
- Journal/Conference: The paper mentions "In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX)". This, along with the arXiv link, suggests it is a preprint submitted for peer review, likely to a top-tier conference in data mining (e.g., KDD, WSDM) or recommender systems (e.g., RecSys).
- Publication Year: The reference format template in the paper lists "2018", but the content and cited works (Llama 2 from 2023, DPO from 2023) clearly indicate a post-2023 publication date. The source link shows a submission date of August 2024 (as per the fictional
2508.16126
identifier). This appears to be a very recent work.
- Abstract: The paper introduces
Spacetime-GR
, the first generative model designed for large-scale, online Point-of-Interest (POI) recommendation that is aware of spatiotemporal context. To handle the massive number of POIs, it proposes a geographic-aware hierarchical indexing strategy. A novel spatiotemporal encoding module is introduced to make the model sensitive to time and location variations. The model is further enhanced with multimodal POI embeddings. A multi-stage training framework (pre-training, post-training adaptation) allows Spacetime-GR
to produce various outputs (embeddings, ranking scores, direct recommendations) for different downstream tasks. The authors demonstrate its superior performance on public and large-scale industrial datasets and report its successful deployment in a system serving hundreds of millions of users.
- Original Source Link:
- Official Source:
https://arxiv.org/abs/2508.16126
- PDF Link:
https://arxiv.org/pdf/2508.16126v1.pdf
- Publication Status: This is a preprint available on arXiv, not yet formally published in a peer-reviewed venue at the time of this analysis.
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The core of the paper is the Spacetime-GR
model and its multi-stage training framework.
3.1 Task Definition
The task is defined as spacetime-aware online POI recommendation.
-
Input:
- A user's historical action sequence S={s1,s2,...,sm}. Each action si is a tuple containing:
- Timestamp (ti)
- User's geographic location (giu)
- POI index (pi)
- POI's geographic location (gip)
- POI category (ci)
- Action type (ai, e.g., 'click')
- User profile information (
up
)
- Current request context: timestamp (tm+1) and user location (gm+1u)
-
Output: Predict the next POI the user will interact with, pm+1.
The paper provides an example of the data format for a single action:
Manually transcribed from the paper's Table 1.
Key |
Explanation |
Example Value |
time t |
a 13-digit timestamp |
1709805845148 |
user geo info gu |
the longitude and latitude of user location |
x: 118.2252, y: 24.6001 |
POI p |
the index of POI |
123 |
POI geo info gp |
the longitude and latitude of POI location |
x: 118.3468, y: 24.1159 |
POI category c |
the type of POI |
food, French food |
action type a |
the type of action |
click |
3.2 Spacetime-GR Framework
The framework is built around a decoder-only transformer (based on Llama 2) and consists of three stages as shown in Figure 2.
该图像是论文中图2的示意图,展示了Spacetime-GR模型的整体结构,包含三个训练阶段:预训练阶段、SFT阶段和对齐阶段。图中详细描述了地理感知的分层POI索引、时空编码模块及多模态POI嵌入与生成式排序的集成。
3.2.1 Pre-training Stage
The goal of this stage is to learn fundamental patterns in user behavior from a massive dataset of user action sequences.
-
Data Cleansing: To improve training quality, a two-level data filtering strategy is applied:
- Action Level: Actions are classified as either functional (e.g., navigating home, searching for a hospital) or interest-based (e.g., clicking on restaurants, entertainment venues). Only interest-based actions (where Iti=1) are used as prediction targets, as they better reflect a user's latent interests.
- Sequence Level: A
richness
metric is defined to filter out monotonous sequences (e.g., user only commuting between home and work).
R=The number of actionsThe number of different POIs
Sequences with low richness (R<0.3) are discarded.
-
Model Structure & Input Encoding:
- Geographic-aware Hierarchical POI Indexing: To handle the 100M+ POI vocabulary, each POI pi is represented by two tokens:
- blocki: An ID representing a 5km x 5km geographic grid where the POI is located.
- inneri: A local index of the POI within that block.
This reduces the vocabulary size from ~100M to ~400K, making the softmax output layer computationally feasible.
- Spatiotemporal Encoding Module: Each user action si is converted into a sequence of four tokens: (ui,blocki,inneri,ai).
- ui: A token representing the user's spatiotemporal context (time ti and location giu).
- ai: The action type.
- Feature Embedding:
- The embeddings for these tokens are enriched with side information.
- The embedding for ui is a weighted sum of time and user geo-location embeddings.
- The embedding for inneri is a weighted sum of its base ID embedding, POI category embedding (ci), and POI geo-location embedding (gip).
The embedding calculations are:
E(ui)E(blocki)E(inneri)E(ai)=w1⋅Embt(ti)+w2⋅Embg(giu)=Embp(blocki)=w3⋅Embp(inneri)+w4⋅Embc(ci)+w5⋅Embg(gip)=Emba(ai)
Where Embx are embedding layers for different feature types (time, geo, POI, category, action) and wj are learnable weights.
-
Loss Function: The model is trained using a standard cross-entropy loss to predict the next token. Crucially, the loss is only computed for the block
and inner
tokens of interest-based actions (It=1).
Lpretrain=−i=1∑n−1Iti+1⋅(logP(blocki+1∣up,s1,...,si,ui+1)+logP(inneri+1∣up,s1,...,si,ui+1,blocki+1))
-
Curriculum Learning Strategy: Training proceeds from simple to complex data.
- First, train on
single-pattern
sequences (e.g., only local actions, only travel actions).
- Then, train on
multi-pattern
sequences that contain transitions between different states (e.g., local to travel).
3.2.2 Supervised Finetuning (SFT) Stage
After pre-training, the model is fine-tuned on a downstream recommendation task dataset, which contains user sequences, candidate POIs, and click labels (positive/negative). Two SFT strategies are proposed:
-
Embedding-Based Ranking SFT:
- Goal: To produce high-quality user and POI embeddings for use in a downstream ranking model.
- Architecture: A dual-tower structure. The
Spacetime-GR
model acts as an encoder for both the user side (history + context) and the POI side.
- Process:
- User embedding Eu is generated by encoding the user profile, action sequence, and request context.
- POI embedding Ep is generated by encoding the POI's hierarchical index and side information.
- Loss Function: InfoNCE loss is used to pull the user embedding closer to embeddings of clicked POIs (positives) and push it away from embeddings of unclicked POIs (negatives).
Lemb−sft=−i∑log∑jexp(cos(Eui,Epi,j,+)/τ)+∑kexp(cos(Eui,Epi,k,−)/τ)∑jexp(cos(Eui,Epi,j,+)/τ)
Here, τ is a temperature hyperparameter.
-
Generative Ranking SFT:
- Goal: To directly output a ranking score for each candidate POI.
- Architecture: A cross-encoder structure where user and POI information interact at all layers.
- Process: The user sequence and all candidate POIs are concatenated into a single input sequence. A modified attention mask ensures that the representation for each POI is computed based only on the user information and its own features, not other candidate POIs.
- Loss Function: The hidden state corresponding to each POI's
inner
token is passed through a classification head to predict a click probability. A standard binary cross-entropy loss is used for training.
Lgenerative−sft=−i∑yi⋅logPi+(1−yi)⋅log(1−Pi)
- Multimodal POI Embeddings: In the SFT stage, the POI representation is enriched by adding pre-computed embeddings derived from a multimodal LLM that processes the POI's text (name, address, reviews) and images.
3.2.3 Alignment Stage
This stage aims to refine the model's ability to directly generate a ranked list of POIs.
- DPO Training: Direct Preference Optimization is used to align the model with user preferences.
- Data: Clicked POIs are treated as "chosen" (preferred) responses, and exposed but unclicked POIs are "rejected" (dispreferred) responses.
- Loss Function: The DPO loss encourages the model to assign a higher probability to chosen POIs than to rejected POIs, compared to a frozen reference model (the initial pre-trained model).
LDPO=−i∑j,k∑(logσ(βlogRef(pi,j,+)Align(pi,j,+)−βlogRef(pi,k,−)Align(pi,k,−)))
Where
Align
is the model being trained, Ref
is the frozen reference model, p+ and p− are positive and negative POIs, and β is a scaling parameter.
- Spatiotemporal Sensitivity: This framework can be used to instill specific behaviors, e.g., by creating preference pairs where the "chosen" POI is a restaurant during meal times and the "rejected" POI is not, for the same context.
5. Experimental Setup
-
Datasets:
-
Industrial Dataset: An internal dataset from Amap with hundreds of millions of users and POIs.
Manually transcribed from the paper's Table 3.
Stage |
Data Type |
Samples |
Length |
Candidate POI Num |
Pre-training |
Train |
578M |
146.3 |
- |
Validation |
19,794 |
142.6 |
- |
Test |
19,837 |
143.2 |
- |
SFT & Alignment |
Train |
31M |
301.0 |
11.3 |
Validation |
611K |
334.5 |
10.5 |
Test |
553K |
346.1 |
10.6 |
-
Public Datasets: Three widely-used offline check-in datasets to test generalizability: Foursquare-NYC
, Foursquare-TKY
, and Gowalla-CA
. These are much smaller (thousands of users/POIs).
-
Evaluation Metrics:
- AUC (Area Under the ROC Curve):
- Conceptual Definition: Measures the ability of a model to distinguish between positive and negative classes. An AUC of 1.0 means a perfect classifier, while 0.5 indicates a random guess. It is widely used for evaluating binary classification and ranking quality.
- Mathematical Formula:
AUC=∣positive class∣⋅∣negative class∣∑i∈positive class∑j∈negative class1(score(i)>score(j))
- Symbol Explanation:
score(i)
is the model's predicted score for item i. 1(⋅) is the indicator function, which is 1 if the condition is true and 0 otherwise. The formula calculates the probability that a randomly chosen positive sample is ranked higher than a randomly chosen negative sample.
- CTR (Click-Through Rate) / CVR (Conversion Rate):
- Conceptual Definition: Business metrics used in online systems. CTR is the ratio of clicks to impressions. CVR is the ratio of conversions (e.g., making a purchase, navigating to a POI) to clicks. They measure user engagement and the business impact of the recommendations.
- Hit Rate (hr@k):
- Conceptual Definition: Measures whether the true next item is present in the top-k recommended items. It evaluates the recall of the model.
- Mathematical Formula:
hr@k=∣U∣1u∈U∑1(ptarget∈TopKu)
- Symbol Explanation: ∣U∣ is the total number of test cases (users). ptarget is the ground-truth next POI. TopKu is the list of top k POIs recommended for user u. 1(⋅) is the indicator function.
- LLM & Human Evaluation: For the generative task, results are compared based on win/even/lose rates from GPT-4o, Qwen-Plus, and human evaluators.
-
Baselines:
- Industrial: The main baseline is the existing online ranking model in production at Amap.
- Public: Several sequential recommendation models are used as baselines, including
LSTM
, STGCN
, PLSPL
, STAN
, GETNext
, and STHGCN
.
6. Results & Analysis
Core Results on Industrial Dataset
Results on Public Datasets
-
On the smaller, offline check-in datasets, a simplified Spacetime-GR
achieves performance comparable to or better than state-of-the-art baselines like STHGCN
. It outperforms STHGCN
on NYC
and is competitive on TKY
and CA
, despite STHGCN
using information from other users' sequences while Spacetime-GR
only uses the current user's sequence. This demonstrates the model's generalizability.
Manually transcribed from the paper's Table 6.
|
NYC |
TKY |
CA |
LSTM [15] |
0.1305 |
0.1335 |
0.0665 |
STGCN [55] |
0.1799 |
0.1716 |
0.0961 |
PLSPL [45] |
0.1917 |
0.1889 |
0.1072 |
STAN [30] |
0.2231 |
0.1963 |
0.1104 |
GETNext [49] |
0.2435 |
0.2254 |
0.1357 |
STHGCN [46] |
0.2734 |
0.2950 |
0.1730 |
Spacetime-GR |
0.2920 |
0.2610 |
0.1659 |
Ablation Studies
-
Pre-training Stage: The ablation study confirms the importance of each proposed component.
-
Removing spatiotemporal information causes the largest performance drop (hr@100
falls from 0.4721 to 0.3671), proving it is the most critical component.
-
The geographic-aware hierarchical index is significantly better than a traditional hashing-based index.
-
Curriculum learning provides a modest but consistent improvement.
Manually transcribed from the paper's Table 7.
Methods |
hr@1 |
hr@100 |
GPT-based |
0.0688 |
0.2195 |
Spacetime-GR w/o spatiotemporal info |
0.1007 |
0.3671 |
Spacetime-GR w/o hierarchical POI index |
0.1328 |
0.3480 |
Spacetime-GR w/o curriculum learning |
0.1463 |
0.4624 |
Spacetime-GR |
0.1525 |
0.4721 |
Note: The table in the paper seems to have two entries for "w/o hierarchical POI index". I have transcribed them as they appear.
-
SFT Stage:
-
Fine-tuning from the pre-trained model is vastly superior to training from scratch, highlighting the value of pre-training.
-
Generative ranking SFT outperforms embedding-based SFT, confirming that deeper interaction modeling is more powerful, though more computationally expensive.
-
Adding multimodal embeddings brings a significant AUC gain, demonstrating the value of richer POI content.
Manually transcribed from the paper's Table 8.
Methods |
AUC |
embedding-based ranking SFT from scratch |
0.6621 |
embedding-based ranking SFT |
0.7080 |
embedding-based ranking SFT + multimodal |
0.7214 |
generative ranking SFT from scratch |
0.6648 |
generative ranking SFT |
0.7371 |
Note: The paper is missing the result for "generative ranking SFT + multimodal".
-
Alignment Stage:
-
DPO alignment improves the ranking ability over the pre-trained model, especially for hr@10
(4.520 vs. 4.295), showing it successfully refines the model's ability to rank relevant items higher.
Manually transcribed from the paper's Table 9.
Methods |
hr@1 |
hr@10 |
pre-trained |
0.1960 |
0.4295 |
DPO |
0.2006 |
0.4520 |
S-DPO |
0.2008 |
0.4512 |
7. Conclusion & Reflections
-
Conclusion Summary:
The paper successfully introduces Spacetime-GR
, a novel generative model tailored for the unique challenges of large-scale online POI recommendation. By developing a geographically-aware hierarchical POI index, a spatiotemporal encoding module, and a versatile three-stage training framework (pre-training, SFT, DPO alignment), the authors create a system that is both powerful and practical. The model demonstrates state-of-the-art performance on public benchmarks and delivers significant improvements in CTR and CVR in a live industrial setting, marking a milestone as the first successful deployment of a generative model in such a system.
-
Limitations & Future Work:
The provided text cuts off before detailing the authors' own discussion of limitations. However, based on the content, potential limitations could include:
- Computational Cost: While more efficient than a flat vocabulary, the model is still large and requires significant resources for pre-training (96 H20 GPUs for 7 days).
- Data Dependency: The model's performance heavily relies on the massive, proprietary dataset from Amap. Its effectiveness might vary on smaller or differently distributed datasets.
- Complexity: The full three-stage pipeline is complex to implement and maintain in a production environment.
-
Personal Insights & Critique:
- Strengths:
- The paper is an excellent example of industry-driven research that tackles real-world problems at scale. It bridges the gap between academic theory (generative models, DPO) and industrial practice.
- The proposed solutions are well-motivated and elegant. The geographic hierarchical index is a clever and domain-specific solution to the vocabulary problem. The explicit encoding of spatiotemporal context as a token is a simple yet powerful idea.
- The modular framework (pre-train, SFT, align) is a major strength, providing a flexible pathway to deploy a single powerful foundation model in various roles within a complex recommendation ecosystem.
- Critique & Open Questions:
- The comparison with
STHGCN
on public datasets is interesting, but STHGCN
's use of graph information (linking other users' trajectories) is a fundamentally different approach. A more direct comparison would be against other single-user sequence models on those datasets.
- The paper could have explored the trade-offs of the hierarchical index in more detail, such as the impact of grid size (5km x 5km) on performance.
- While the model is "spacetime-aware," the analysis could go deeper into how the model uses this information. For example, visualizations of attention weights could show whether the model focuses on relevant temporal or spatial cues when making a prediction. The provided text was cut short before this analysis was presented.