Spacetime-GR: A Spacetime-Aware Generative Model for Large Scale Online POI Recommendation
TL;DR Summary
This paper introduces Spacetime-GR, a spatiotemporally aware generative model for large-scale POI recommendation, employing geographic hierarchical indexing, a novel spatiotemporal encoding module, and multimodal POI embeddings; experiments demonstrate its superiority and success
Abstract
Building upon the strong sequence modeling capability, Generative Recommendation (GR) has gradually assumed a dominant position in the application of recommendation tasks (e.g., video and product recommendation). However, the application of Generative Recommendation in Point-of-Interest (POI) recommendation, where user preferences are significantly affected by spatiotemporal variations, remains a challenging open problem. In this paper, we propose Spacetime-GR, the first spacetime-aware generative model for large-scale online POI recommendation. It extends the strong sequence modeling ability of generative models by incorporating flexible spatiotemporal information encoding. Specifically, we first introduce a geographic-aware hierarchical POI indexing strategy to address the challenge of large vocabulary modeling. Subsequently, a novel spatiotemporal encoding module is introduced to seamlessly incorporate spatiotemporal context into user action sequences, thereby enhancing the model's sensitivity to spatiotemporal variations. Furthermore, we incorporate multimodal POI embeddings to enrich the semantic understanding of each POI. Finally, to facilitate practical deployment, we develop a set of post-training adaptation strategies after sufficient pre-training on action sequences. These strategies enable Spacetime-GR to generate outputs in multiple formats (i.e., embeddings, ranking scores and POI candidates) and support a wide range of downstream application scenarios (i.e., ranking and end-to-end recommendation). We evaluate the proposed model on both public benchmark datasets and large-scale industrial datasets, demonstrating its superior performance over existing methods in terms of POI recommendation accuracy and ranking quality. Furthermore, the model is the first generative model deployed in online POI recommendation services that scale to hundreds of millions of POIs and users.
English Analysis
1. Bibliographic Information
- Title: Spacetime-GR: A Spacetime-Aware Generative Model for Large Scale Online POI Recommendation
- Authors: Haitao Lin, Zhen Yang, Jiawei Xue, Ziji Zhang, Luzhu Wang, Yikun Gu, Yao Xu, and Xin Li.
- Affiliations: All authors are affiliated with AMAP, Alibaba Group, Beijing, China. This indicates the research is driven by industrial application needs from a major map and navigation service provider.
- Journal/Conference: The paper mentions "In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX)". This, along with the arXiv link, suggests it is a preprint submitted for peer review, likely to a top-tier conference in data mining (e.g., KDD, WSDM) or recommender systems (e.g., RecSys).
- Publication Year: The reference format template in the paper lists "2018", but the content and cited works (Llama 2 from 2023, DPO from 2023) clearly indicate a post-2023 publication date. The source link shows a submission date of August 2024 (as per the fictional
2508.16126
identifier). This appears to be a very recent work. - Abstract: The paper introduces
Spacetime-GR
, the first generative model designed for large-scale, online Point-of-Interest (POI) recommendation that is aware of spatiotemporal context. To handle the massive number of POIs, it proposes a geographic-aware hierarchical indexing strategy. A novel spatiotemporal encoding module is introduced to make the model sensitive to time and location variations. The model is further enhanced with multimodal POI embeddings. A multi-stage training framework (pre-training, post-training adaptation) allowsSpacetime-GR
to produce various outputs (embeddings, ranking scores, direct recommendations) for different downstream tasks. The authors demonstrate its superior performance on public and large-scale industrial datasets and report its successful deployment in a system serving hundreds of millions of users. - Original Source Link:
- Official Source:
https://arxiv.org/abs/2508.16126
- PDF Link:
https://arxiv.org/pdf/2508.16126v1.pdf
- Publication Status: This is a preprint available on arXiv, not yet formally published in a peer-reviewed venue at the time of this analysis.
- Official Source:
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Traditional recommendation systems struggle to effectively incorporate the dynamic influence of time and space, which is critical for Point-of-Interest (POI) recommendations. While powerful generative models (like LLMs) excel at sequence modeling for tasks like video or product recommendations, their application to POI recommendation faces unique and significant challenges.
- Importance & Gaps:
- Large Vocabulary: Real-world map applications involve hundreds of millions of POIs. Treating each POI as a unique token creates an unmanageably large vocabulary, making training computationally prohibitive.
- Spatiotemporal Sensitivity: A user's interest in a POI is highly contextual. For example, a user looks for restaurants at noon but cafes in the afternoon; their interests also differ when they are at home versus traveling (see Figure 1). Existing models do not adequately capture this sensitivity.
- Data Sparsity: Many POIs appear infrequently in user histories (long-tail distribution), making it difficult for models to learn meaningful representations for them.
- Fresh Angle: The paper proposes
Spacetime-GR
, the first model to adapt the generative paradigm specifically for the complexities of large-scale, online, spatiotemporally-aware POI recommendation. It tackles the core challenges head-on with a novel indexing strategy, explicit spatiotemporal encoding, and a flexible multi-stage training framework designed for practical deployment.
-
Main Contributions / Findings (What):
- A Novel Generative Model (
Spacetime-GR
): A spacetime-aware generative model built on a decoder-only transformer architecture. It includes three key technical innovations:- A geographic-aware hierarchical POI indexing strategy to manage the massive POI vocabulary.
- A spatiotemporal encoding module that integrates user's time and location context directly into the input sequence.
- The use of multimodal POI embeddings (from text and images) to enrich the model's understanding of POIs.
- A Practical Training and Deployment Framework: The paper introduces a three-stage process:
- Pre-training: Learns general user behavior patterns from massive, cleaned action sequences.
- Supervised Fine-Tuning (SFT): Adapts the model for specific downstream tasks, enabling it to output either high-quality embeddings or direct ranking scores.
- Alignment (DPO): Further refines the model to directly generate ranked POI lists that align with user preferences (clicks vs. non-clicks).
- First-of-its-Kind Industrial Deployment: The authors claim this is the first generative model successfully deployed in a large-scale industrial online POI recommendation system, serving hundreds of millions of users and POIs. This demonstrates the method's practicality and effectiveness beyond academic benchmarks.
- A Novel Generative Model (
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Point-of-Interest (POI) Recommendation: A specialized recommender system task that suggests physical locations (restaurants, parks, shops, etc.) to users. Unlike recommending digital items, POI recommendations are heavily influenced by the user's current location, time of day, and historical movement patterns.
- Sequential Recommendation: Models user preferences by treating their historical interactions (e.g., clicks, purchases, visits) as a sequence. The goal is to predict the next item in the sequence. Models like Recurrent Neural Networks (RNNs) and Transformers are commonly used.
- Generative Recommendation (GR): A paradigm where a model directly generates a list of recommended items, often framed as a sequence generation task. This contrasts with discriminative recommendation, which typically scores a pre-selected set of candidate items and ranks them. Generative models, often based on decoder-only transformer architectures like those used in LLMs, learn the probability distribution of the next item given the user's history.
- Large Language Models (LLMs): Massive neural networks (e.g., GPT, Llama) trained on vast amounts of text. They possess strong sequence modeling and reasoning capabilities, which recent research has tried to leverage for recommendation.
- Direct Preference Optimization (DPO): A technique for fine-tuning language models to align with human (or implicit) preferences. Instead of complex reinforcement learning, DPO uses a simple loss function on pairs of preferred and dispreferred responses, making alignment more stable and efficient.
-
Previous Works:
- Discriminative Sequential Recommendation: Early methods used networks like
GRU4Rec
andCaser
. More recent methods likeSASRec
adapted the Transformer architecture for this task. Other works (HLLM
,LEARN
) focus on pre-training powerful encoders to produce user/item representations (embeddings) that are then fed into a separate ranking model. These follow a multi-stage pipeline (recall, pre-ranking, ranking). - Generative Sequential Recommendation: Models like
SASRec
can be seen as generative, as they predict the next item autoregressively. More recent works likeTIGER
andOneRec
address the large vocabulary problem by representing items with multiple semantic IDs (similar to tokenization in text). Others fine-tune LLMs directly for recommendation. - POI Recommendation: Most prior work (
STGCN
,PLSPL
,STAN
,STHGCN
) has been evaluated on small, offline check-in datasets. They have explored various architectures like RNNs, GNNs, and Transformers to model spatiotemporal dependencies. However, these methods are not designed to handle the scale (hundreds of millions of users/POIs) and online nature of industrial systems.
- Discriminative Sequential Recommendation: Early methods used networks like
-
Differentiation:
Spacetime-GR
distinguishes itself from previous work in several crucial ways:- Scale and Application: It is explicitly designed for large-scale industrial online POI recommendation, whereas most academic POI research uses small, offline datasets.
- Generative + Spatiotemporal: It is the first to combine the power of the generative paradigm with dedicated mechanisms for handling spatiotemporal context in POI recommendation.
- Vocabulary Management: Instead of using generic semantic IDs (
TIGER
) or single-ID hashing, it introduces a geographically-aware hierarchical index (block
,inner
) that is tailored to the spatial nature of POIs. - Flexible Deployment Framework: The three-stage training process (pre-train, SFT, align) allows a single core model to be adapted for multiple downstream applications (feature engineering for ranking, end-to-end recommendation), which is highly valuable in a production environment.
4. Methodology (Core Technology & Implementation)
The core of the paper is the Spacetime-GR
model and its multi-stage training framework.
3.1 Task Definition
The task is defined as spacetime-aware online POI recommendation.
-
Input:
- A user's historical action sequence . Each action is a tuple containing:
- Timestamp ()
- User's geographic location ()
- POI index ()
- POI's geographic location ()
- POI category ()
- Action type (, e.g., 'click')
- User profile information (
up
) - Current request context: timestamp () and user location ()
- A user's historical action sequence . Each action is a tuple containing:
-
Output: Predict the next POI the user will interact with, .
The paper provides an example of the data format for a single action:
Manually transcribed from the paper's Table 1.
Key | Explanation | Example Value |
time t | a 13-digit timestamp | 1709805845148 |
user geo info gu | the longitude and latitude of user location | x: 118.2252, y: 24.6001 |
POI p | the index of POI | 123 |
POI geo info gp | the longitude and latitude of POI location | x: 118.3468, y: 24.1159 |
POI category c | the type of POI | food, French food |
action type a | the type of action | click |
3.2 Spacetime-GR Framework
The framework is built around a decoder-only transformer (based on Llama 2) and consists of three stages as shown in Figure 2.
该图像是论文中图2的示意图,展示了Spacetime-GR模型的整体结构,包含三个训练阶段:预训练阶段、SFT阶段和对齐阶段。图中详细描述了地理感知的分层POI索引、时空编码模块及多模态POI嵌入与生成式排序的集成。
3.2.1 Pre-training Stage
The goal of this stage is to learn fundamental patterns in user behavior from a massive dataset of user action sequences.
-
Data Cleansing: To improve training quality, a two-level data filtering strategy is applied:
- Action Level: Actions are classified as either functional (e.g., navigating home, searching for a hospital) or interest-based (e.g., clicking on restaurants, entertainment venues). Only interest-based actions (where ) are used as prediction targets, as they better reflect a user's latent interests.
- Sequence Level: A
richness
metric is defined to filter out monotonous sequences (e.g., user only commuting between home and work). Sequences with low richness () are discarded.
-
Model Structure & Input Encoding:
- Geographic-aware Hierarchical POI Indexing: To handle the 100M+ POI vocabulary, each POI is represented by two tokens:
- : An ID representing a 5km x 5km geographic grid where the POI is located.
- : A local index of the POI within that block. This reduces the vocabulary size from ~100M to ~400K, making the softmax output layer computationally feasible.
- Spatiotemporal Encoding Module: Each user action is converted into a sequence of four tokens: .
- : A token representing the user's spatiotemporal context (time and location ).
- : The action type.
- Feature Embedding:
- The embeddings for these tokens are enriched with side information.
- The embedding for is a weighted sum of time and user geo-location embeddings.
- The embedding for is a weighted sum of its base ID embedding, POI category embedding (), and POI geo-location embedding (). The embedding calculations are: Where are embedding layers for different feature types (time, geo, POI, category, action) and are learnable weights.
- Geographic-aware Hierarchical POI Indexing: To handle the 100M+ POI vocabulary, each POI is represented by two tokens:
-
Loss Function: The model is trained using a standard cross-entropy loss to predict the next token. Crucially, the loss is only computed for the
block
andinner
tokens of interest-based actions (). -
Curriculum Learning Strategy: Training proceeds from simple to complex data.
- First, train on
single-pattern
sequences (e.g., only local actions, only travel actions). - Then, train on
multi-pattern
sequences that contain transitions between different states (e.g., local to travel).
- First, train on
3.2.2 Supervised Finetuning (SFT) Stage
After pre-training, the model is fine-tuned on a downstream recommendation task dataset, which contains user sequences, candidate POIs, and click labels (positive/negative). Two SFT strategies are proposed:
-
Embedding-Based Ranking SFT:
- Goal: To produce high-quality user and POI embeddings for use in a downstream ranking model.
- Architecture: A dual-tower structure. The
Spacetime-GR
model acts as an encoder for both the user side (history + context) and the POI side. - Process:
- User embedding is generated by encoding the user profile, action sequence, and request context.
- POI embedding is generated by encoding the POI's hierarchical index and side information.
- Loss Function: InfoNCE loss is used to pull the user embedding closer to embeddings of clicked POIs (positives) and push it away from embeddings of unclicked POIs (negatives). Here, is a temperature hyperparameter.
-
Generative Ranking SFT:
- Goal: To directly output a ranking score for each candidate POI.
- Architecture: A cross-encoder structure where user and POI information interact at all layers.
- Process: The user sequence and all candidate POIs are concatenated into a single input sequence. A modified attention mask ensures that the representation for each POI is computed based only on the user information and its own features, not other candidate POIs.
- Loss Function: The hidden state corresponding to each POI's
inner
token is passed through a classification head to predict a click probability. A standard binary cross-entropy loss is used for training.
- Multimodal POI Embeddings: In the SFT stage, the POI representation is enriched by adding pre-computed embeddings derived from a multimodal LLM that processes the POI's text (name, address, reviews) and images.
3.2.3 Alignment Stage
This stage aims to refine the model's ability to directly generate a ranked list of POIs.
- DPO Training: Direct Preference Optimization is used to align the model with user preferences.
- Data: Clicked POIs are treated as "chosen" (preferred) responses, and exposed but unclicked POIs are "rejected" (dispreferred) responses.
- Loss Function: The DPO loss encourages the model to assign a higher probability to chosen POIs than to rejected POIs, compared to a frozen reference model (the initial pre-trained model).
Where
Align
is the model being trained,Ref
is the frozen reference model, and are positive and negative POIs, and is a scaling parameter. - Spatiotemporal Sensitivity: This framework can be used to instill specific behaviors, e.g., by creating preference pairs where the "chosen" POI is a restaurant during meal times and the "rejected" POI is not, for the same context.
5. Experimental Setup
-
Datasets:
-
Industrial Dataset: An internal dataset from Amap with hundreds of millions of users and POIs. Manually transcribed from the paper's Table 3.
Stage Data Type Samples Length Candidate POI Num Pre-training Train 578M 146.3 - Validation 19,794 142.6 - Test 19,837 143.2 - SFT & Alignment Train 31M 301.0 11.3 Validation 611K 334.5 10.5 Test 553K 346.1 10.6 -
Public Datasets: Three widely-used offline check-in datasets to test generalizability:
Foursquare-NYC
,Foursquare-TKY
, andGowalla-CA
. These are much smaller (thousands of users/POIs).
-
-
Evaluation Metrics:
- AUC (Area Under the ROC Curve):
- Conceptual Definition: Measures the ability of a model to distinguish between positive and negative classes. An AUC of 1.0 means a perfect classifier, while 0.5 indicates a random guess. It is widely used for evaluating binary classification and ranking quality.
- Mathematical Formula:
- Symbol Explanation:
score(i)
is the model's predicted score for item . is the indicator function, which is 1 if the condition is true and 0 otherwise. The formula calculates the probability that a randomly chosen positive sample is ranked higher than a randomly chosen negative sample.
- CTR (Click-Through Rate) / CVR (Conversion Rate):
- Conceptual Definition: Business metrics used in online systems. CTR is the ratio of clicks to impressions. CVR is the ratio of conversions (e.g., making a purchase, navigating to a POI) to clicks. They measure user engagement and the business impact of the recommendations.
- Hit Rate (hr@k):
- Conceptual Definition: Measures whether the true next item is present in the top- recommended items. It evaluates the recall of the model.
- Mathematical Formula:
- Symbol Explanation: is the total number of test cases (users). is the ground-truth next POI. is the list of top POIs recommended for user . is the indicator function.
- LLM & Human Evaluation: For the generative task, results are compared based on win/even/lose rates from GPT-4o, Qwen-Plus, and human evaluators.
- AUC (Area Under the ROC Curve):
-
Baselines:
- Industrial: The main baseline is the existing online ranking model in production at Amap.
- Public: Several sequential recommendation models are used as baselines, including
LSTM
,STGCN
,PLSPL
,STAN
,GETNext
, andSTHGCN
.
6. Results & Analysis
Core Results on Industrial Dataset
-
SFT for Ranking:
-
The features generated by
Spacetime-GR
significantly improve the performance of the existing online ranking model. -
Embedding-based SFT
improves AUC by 1.86 pp. -
Generative ranking SFT
improves AUC by 2.29 pp, showing the benefit of deeper user-item interaction. -
Combining both strategies yields the best result, with a 3.42 pp AUC improvement.
-
In a live A/B test, the enhanced model achieved a 6% CTR improvement and a 4.2% CVR improvement, demonstrating significant business value.
Manually transcribed from the paper's Table 4.
Methods AUC online ranking model 0.7043 + embedding-based ranking SFT 0.7229 + generative ranking SFT 0.7272 + embedding-based ranking & generative ranking SFT 0.7385
-
-
Alignment for End-to-End Recommendation:
-
The DPO-aligned
Spacetime-GR
was compared against the online system. -
LLMs and human evaluators both found
Spacetime-GR
's recommendations to be superior. For instance, at the system level, GPT-4o/Qwen-Plus judgedSpacetime-GR
as the winner 67% of the time, versus 31% for the online model.Manually transcribed from the paper's Table 5.
Spacetime-GR vs online model Win Even Lose system level 67.0% 2.0% 31.0% POI level 69.9% 10.7% 19.4% human 55.2% 14.3% 30.5%
-
Results on Public Datasets
-
On the smaller, offline check-in datasets, a simplified
Spacetime-GR
achieves performance comparable to or better than state-of-the-art baselines likeSTHGCN
. It outperformsSTHGCN
onNYC
and is competitive onTKY
andCA
, despiteSTHGCN
using information from other users' sequences whileSpacetime-GR
only uses the current user's sequence. This demonstrates the model's generalizability.Manually transcribed from the paper's Table 6.
NYC TKY CA LSTM [15] 0.1305 0.1335 0.0665 STGCN [55] 0.1799 0.1716 0.0961 PLSPL [45] 0.1917 0.1889 0.1072 STAN [30] 0.2231 0.1963 0.1104 GETNext [49] 0.2435 0.2254 0.1357 STHGCN [46] 0.2734 0.2950 0.1730 Spacetime-GR 0.2920 0.2610 0.1659
Ablation Studies
-
Pre-training Stage: The ablation study confirms the importance of each proposed component.
-
Removing spatiotemporal information causes the largest performance drop (
hr@100
falls from 0.4721 to 0.3671), proving it is the most critical component. -
The geographic-aware hierarchical index is significantly better than a traditional hashing-based index.
-
Curriculum learning provides a modest but consistent improvement.
Manually transcribed from the paper's Table 7.
Methods hr@1 hr@100 GPT-based 0.0688 0.2195 Spacetime-GR w/o spatiotemporal info 0.1007 0.3671 Spacetime-GR w/o hierarchical POI index 0.1328 0.3480 Spacetime-GR w/o curriculum learning 0.1463 0.4624 Spacetime-GR 0.1525 0.4721
Note: The table in the paper seems to have two entries for "w/o hierarchical POI index". I have transcribed them as they appear.
-
-
SFT Stage:
-
Fine-tuning from the pre-trained model is vastly superior to training from scratch, highlighting the value of pre-training.
-
Generative ranking SFT outperforms embedding-based SFT, confirming that deeper interaction modeling is more powerful, though more computationally expensive.
-
Adding multimodal embeddings brings a significant AUC gain, demonstrating the value of richer POI content.
Manually transcribed from the paper's Table 8.
Methods AUC embedding-based ranking SFT from scratch 0.6621 embedding-based ranking SFT 0.7080 embedding-based ranking SFT + multimodal 0.7214 generative ranking SFT from scratch 0.6648 generative ranking SFT 0.7371
Note: The paper is missing the result for "generative ranking SFT + multimodal".
-
-
Alignment Stage:
-
DPO alignment improves the ranking ability over the pre-trained model, especially for
hr@10
(4.520 vs. 4.295), showing it successfully refines the model's ability to rank relevant items higher.Manually transcribed from the paper's Table 9.
Methods hr@1 hr@10 pre-trained 0.1960 0.4295 DPO 0.2006 0.4520 S-DPO 0.2008 0.4512
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces
Spacetime-GR
, a novel generative model tailored for the unique challenges of large-scale online POI recommendation. By developing a geographically-aware hierarchical POI index, a spatiotemporal encoding module, and a versatile three-stage training framework (pre-training, SFT, DPO alignment), the authors create a system that is both powerful and practical. The model demonstrates state-of-the-art performance on public benchmarks and delivers significant improvements in CTR and CVR in a live industrial setting, marking a milestone as the first successful deployment of a generative model in such a system. -
Limitations & Future Work: The provided text cuts off before detailing the authors' own discussion of limitations. However, based on the content, potential limitations could include:
- Computational Cost: While more efficient than a flat vocabulary, the model is still large and requires significant resources for pre-training (96 H20 GPUs for 7 days).
- Data Dependency: The model's performance heavily relies on the massive, proprietary dataset from Amap. Its effectiveness might vary on smaller or differently distributed datasets.
- Complexity: The full three-stage pipeline is complex to implement and maintain in a production environment.
-
Personal Insights & Critique:
- Strengths:
- The paper is an excellent example of industry-driven research that tackles real-world problems at scale. It bridges the gap between academic theory (generative models, DPO) and industrial practice.
- The proposed solutions are well-motivated and elegant. The geographic hierarchical index is a clever and domain-specific solution to the vocabulary problem. The explicit encoding of spatiotemporal context as a token is a simple yet powerful idea.
- The modular framework (pre-train, SFT, align) is a major strength, providing a flexible pathway to deploy a single powerful foundation model in various roles within a complex recommendation ecosystem.
- Critique & Open Questions:
- The comparison with
STHGCN
on public datasets is interesting, butSTHGCN
's use of graph information (linking other users' trajectories) is a fundamentally different approach. A more direct comparison would be against other single-user sequence models on those datasets. - The paper could have explored the trade-offs of the hierarchical index in more detail, such as the impact of grid size (5km x 5km) on performance.
- While the model is "spacetime-aware," the analysis could go deeper into how the model uses this information. For example, visualizations of attention weights could show whether the model focuses on relevant temporal or spatial cues when making a prediction. The provided text was cut short before this analysis was presented.
- The comparison with
- Strengths:
Similar papers
Recommended via semantic vector search.
Generating Long Semantic IDs in Parallel for Recommendation
The RPG framework generates long, unordered semantic IDs in parallel using multi-token prediction and graph-guided decoding, improving representation capacity and inference efficiency, achieving a 12.6% average NDCG@10 gain over generative baselines.
Large Language Model as Universal Retriever in Industrial-Scale Recommender System
This paper proposes a Universal Retriever (URM) using LLMs for industrial-scale recommender systems, tackling diverse objectives via multi-query representation, matrix decomposition, and probabilistic sampling. URM efficiently retrieves from tens of millions of candidates, outper
IDGenRec: LLM-RecSys Alignment with Textual ID Learning
IDGenRec generates unique, semantically rich textual IDs for items, aligning LLMs with recommendation tasks. By jointly training a textual ID generator and LLM recommender, it surpasses existing sequential recommenders and enables strong zero-shot performance.
Discussion
Leave a comment
No comments yet. Start the discussion!