- Title: Understanding Generative Recommendation with Semantic IDs from a Model-scaling View
- Authors: Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, and Clark Mingxuan Ju.
- Affiliations: The authors are from Michigan State University and Snap Inc., indicating a collaboration between academia and industry. This background suggests the research is grounded in real-world recommender system challenges while maintaining academic rigor.
- Journal/Conference: The paper is presented as a preprint on arXiv. Preprints are common in fast-moving fields like machine learning, allowing for rapid dissemination of results. The citations to top-tier conferences like SIGIR, KDD, and NeurIPS suggest the work is positioned for a high-impact venue.
- Publication Year: The paper uses future dates (e.g., 2025 citations) and a synthetic arXiv ID, suggesting it's a template or example. For this analysis, we will treat it as a contemporary work from 2024/2025.
- Abstract: The paper investigates the scaling properties of two Generative Recommendation (GR) paradigms. It finds that the popular
SID-based GR
approach, which uses quantized semantic IDs to represent items, suffers from a performance bottleneck. As model components (encoder, tokenizer, recommender) are scaled up, performance quickly saturates. The authors identify the limited information capacity of Semantic IDs (SIDs) as the fundamental cause. In contrast, they show that the LLM-as-RS
paradigm, where a Large Language Model (LLM) directly processes item text, exhibits superior scaling properties. This model's performance improves consistently with size, achieving up to 20% better results than the saturated SID-based model. The paper also challenges the idea that LLMs are poor at collaborative filtering, demonstrating that this ability improves with scale. The authors conclude that LLM-as-RS
is a more promising direction for building foundation models for recommendation.
- Original Source Link:
https://arxiv.org/abs/2509.25522
(Note: This is a placeholder link provided in the prompt).
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The paper's methodology is an analytical framework for studying the scaling behavior of generative recommender systems. It dissects two primary paradigms: SID-based GR
and LLM-as-RS
.
Principles: Decomposing Recommendation Error
The core principle is to decompose the total error of a GR model into two parts, based on the two essential types of information a recommender must learn:
-
Semantic Information (SI): Understanding the content of items.
-
Collaborative Filtering (CF): Understanding user behavior patterns from interaction histories.
Following scaling laws for multimodal models, the paper proposes a general equation for recommendation performance:
Recall@k=R0−NSIaA−NCFbB
-
Recall@k: The primary performance metric, representing the model's accuracy.
-
R0: The maximum achievable recall, or the irreducible error ceiling.
-
NSI and NCF: The number of model parameters allocated to learning Semantic Information and Collaborative Filtering, respectively.
-
A, B, a, b
: Positive constants determined empirically by fitting the formula to experimental data. The exponents a and b represent the scaling rates for SI and CF learning.
This formula posits that performance improves (i.e., the negative error terms get smaller) as more parameters are dedicated to learning SI and CF.
Analysis of SID-based GR
This paradigm, represented by models like TIGER
, has a pipeline with three main components. The authors adapt the scaling formula to this architecture.
该图像是论文中图1的示意图,展示了两种生成推荐范式:SID-based GR通过LLM编码文本生成语义ID并进行推荐,而LLM-as-RS直接用LLM处理文本描述并输出标题,图中还比较了两种范式的主要差异。
-
Components:
- LLM Encoder (NLLM): A frozen, pre-trained model (e.g., Flan-T5) that converts item text into semantic embeddings. It only processes SI.
- Quantization Tokenizer (NQT): A trainable model (e.g., RQ-VAE) that converts the semantic embeddings into discrete SIDs. It also only processes SI.
- RS Module (NRS): A sequential model (e.g., a Transformer) trained to predict the next SID sequence from a history of SIDs. It learns from user sequences, so it primarily captures CF but also implicitly learns SI from SID co-occurrence.
-
Scaling Formula for SID-based GR:
Recall@k=R0−(NRS+γ1NLLM+γ2NQT)aA−NRSbB
- NCF is modeled as just NRS, since the RS module is the only component that sees user interaction sequences.
- NSI is modeled as a weighted sum of all components, as all three contribute to processing semantic content. γ1 and γ2 are "effective parameter coefficients" between 0 and 1, representing how much the frozen LLM and tokenizer contribute to performance. If scaling the LLM encoder helps, γ1 should be greater than 0.
-
Steps & Procedures for Analysis:
- Scale the RS Module: Fix NLLM and NQT and progressively increase the size of the RS transformer (NRS) to see if performance saturates.
- Scale the LLM Encoder: Fix NRS and NQT and use different sizes of Flan-T5 (from 77M to 11B parameters) to see if a more powerful encoder improves performance (i.e., if γ1>0).
- Scale the Quantization Tokenizer: Fix NLLM and NRS and vary the tokenizer's capacity by (a) increasing the number of codebooks and (b) increasing the size of each codebook.
- Ablation Study (Diagnosing the Bottleneck): To test the hypothesis that SIDs are the bottleneck, the authors inject richer information directly into the RS module, bypassing the SID representation. They add:
- CF Embeddings: Item embeddings from a pre-trained
SASRec
model (a pure CF model).
- LLM Embeddings: The original, dense embeddings from the LLM encoder, before quantization.
They then observe the performance change. A large gain from LLM embeddings would confirm they contain information lost during quantization.
Analysis of LLM-as-RS
This paradigm uses an LLM directly for recommendation, avoiding SIDs entirely.
-
Components:
- Frozen LLM (NLLM): The base large language model (e.g., Qwen3).
- LoRA Weights (NLoRA): A small number of trainable parameters added via Low-Rank Adaptation (LoRA) for efficient fine-tuning.
-
Scaling Formula for LLM-as-RS:
Recall@k=R0−(NLoRA+γNLLM)aA−(NLoRA+βNLLM)bB
- Here, both the trainable
LoRA
weights and the frozen LLM
weights are assumed to contribute to learning both SI and CF.
- γ and β are the effective coefficients for the frozen LLM's contribution to SI and CF learning, respectively. If β>0, it means larger base LLMs are better at learning collaborative patterns, even without being fully fine-tuned.
-
Steps & Procedures for Analysis:
- General Scaling: Scale the LLM from 0.6B to 14B parameters (with NLoRA scaling proportionally) and plot the performance to observe the overall trend and compare it to
SID-based GR
.
- Fitting the Scaling Law: Conduct experiments with various combinations of LLM sizes and LoRA ranks to generate a grid of 25 performance data points. Use Huber loss to fit Equation (4) to this data and estimate the values of γ and β. Positive values would confirm that the frozen LLM part contributes to scaling.
- Testing for CF Scaling (Proving β>0): Inject external CF embeddings (from
SASRec
) into the LLM-as-RS
model. The key idea is: if larger LLMs are already better at learning CF (i.e., β>0), then the performance boost from these external CF embeddings should decrease as the LLM gets bigger. The authors test this by observing the change in recall (ΔRecall@k) across different LLM sizes.
5. Experimental Setup
-
Datasets:
- Source: The experiments use three subsets of the Amazon Review datasets:
Beauty
, Sports and Outdoors
, and Toys and Games
.
- Characteristics: These are public benchmark datasets widely used in sequential recommendation research. They contain user interaction sequences (lists of items purchased by users) and rich item metadata (including text descriptions/titles).
- Justification: Using multiple datasets from different domains helps ensure the findings are generalizable and not specific to one product category. The data is fixed to focus purely on model scaling, not data scaling.
-
Evaluation Metrics:
The paper uses standard top-k recommendation metrics to evaluate performance.
-
Recall@k:
- Conceptual Definition: Measures the proportion of actual next items (in the test set) that are found within the top-k recommendations made by the model. It answers the question: "Out of all the items the user actually interacted with, what fraction did we successfully recommend?" It is a measure of coverage or sensitivity.
- Mathematical Formula: For a single user, let Itrue be the set of ground-truth next items and Irec be the list of top-k recommended items.
Recall@k=∣Itrue∣∣Itrue∩Irec∣
- Symbol Explanation:
- Itrue: The set of one or more items the user interacted with next in the test data.
- Irec: The set of the top k items generated by the recommender.
- ∣⋅∣: Denotes the number of elements in a set.
The final metric is averaged over all users in the test set.
-
Normalized Discounted Cumulative Gain (NDCG)@k:
- Conceptual Definition: A metric that evaluates the quality of a ranked list of recommendations. Unlike Recall, it rewards models for placing more relevant items higher up in the recommendation list. It is a measure of ranking quality.
- Mathematical Formula:
NDCG@k=IDCG@kDCG@kwhereDCG@k=i=1∑klog2(i+1)reli
- Symbol Explanation:
- reli: The relevance of the item at rank i. In recommendation, this is typically 1 if the item is a ground-truth next item and 0 otherwise.
- log2(i+1): A penalty term that "discounts" the relevance of items at lower ranks.
- DCG@k (Discounted Cumulative Gain): The total relevance score, discounted by rank.
- IDCG@k (Ideal DCG): The DCG score of a perfect ranking, where all ground-truth items are placed at the top.
- The final metric is averaged over all users in the test set.
-
Baselines:
The study does not focus on beating a wide array of baseline models. Instead, the comparisons are:
- Intra-Paradigm: Models of different sizes within the same paradigm (e.g., a 77M
SID-based GR
vs. an 11B SID-based GR
).
- Inter-Paradigm: The best-performing
SID-based GR
model vs. LLM-as-RS
models of various sizes.
- Ablation Baselines: In ablation studies, the baseline is the model itself without the injected information (e.g.,
SID-based GR
vs. SID-based GR + LLM embeddings
). The SASRec
model is used not as a direct competitor but as a source of pure CF embeddings for these ablations.
6. Results & Analysis
The paper's results systematically uncover the scaling limitations of SID-based GR
and highlight the superiority of LLM-as-RS
.
Core Results: The Failure of SID-based GR to Scale
-
Observation 1: Scaling the RS Module Saturates Quickly (Figure 2)
-
The experiments show that as the size of the sequential recommender (NRS) increases, performance (Recall@5) improves initially but then flattens out completely around 13M parameters.
-
Implication: Simply making the recommender part bigger yields no further gains beyond a relatively small size. There is a bottleneck elsewhere in the system.
该图像是包含三个子图的图表,展示了在不同推荐系统(RS)模型规模(以对数刻度表示)下,Beauty、Sports和Toys三类任务的Recall@5指标的变化趋势。可以看到,性能随着模型规模增长快速提升,但在大约107参数规模时趋于饱和。
-
Observation 2: Scaling the LLM Encoder Brings No Benefit (Figure 3)
-
The results are striking: using a massive 11B parameter Flan-T5 encoder provides virtually no performance improvement over a tiny 77M parameter one. The performance curves are almost flat across all datasets and metrics.
-
Implication: The knowledge from more powerful LLMs is not being successfully transferred to the recommender. In the context of the scaling formula, this means the effective parameter coefficient γ1≈0.
该图像是图19的示意图,展示了通过拼接方式将协同过滤(CF)嵌入注入到LLM-as-RS模型中的流程,包括SASRec生成CF Embedding,经MLP Adapter处理后与LLM Tokenizer的输出拼接,用于下一步预测。
-
Observation 3: Scaling the Quantization Tokenizer Also Fails (Figures 4 & 5)
-
Increasing the capacity of the tokenizer by adding more codebooks (Figure 4) or enlarging codebook size (Figure 5) shows a similar pattern. Performance peaks at a specific configuration (3 codebooks of size 256) and then either stagnates or degrades.
-
Implication: Making SIDs longer or drawing them from a larger vocabulary does not help. It may even hurt by making the prediction task for the RS module harder. This suggests γ2≈0 as well.
该图像是包含三张子图的图表,展示了不同规模LLM编码器在Beauty、Sports和Toys三个领域上的性能表现,指标包括Recall@5、Recall@10、NDCG@5和NDCG@10,横轴为LLM编码器的参数规模(对数刻度),纵轴为性能值。
该图像是图表,展示了不同领域(Beauty, Sports, Toys)中LLM编码器尺寸对召回率(Recall@5、Recall@10)和NDCG(NDCG@5、NDCG@10)性能的影响,横轴为编码器规模的对数值,结果显示性能随模型规模变化较为平稳。
Ablations: Pinpointing the SID Bottleneck
- Observation 4: Injecting Raw LLM Embeddings Breaks the Bottleneck (Figure 6)
-
When injecting external CF embeddings from SASRec
, there is little to no performance gain. This suggests the SID-based GR
model is already effective at capturing CF signals.
-
However, when injecting the original, un-quantized LLM embeddings directly into the RS module, there is a substantial performance improvement.
-
Crucially, this gain is larger for a bigger RS module (21M vs. 13M), indicating that the performance saturation seen earlier was indeed caused by a lack of rich semantic information.
-
Implication: The quantization step from dense embeddings to discrete SIDs is the fundamental bottleneck. It discards critical semantic information that the recommender could otherwise use.
该图像是图表,展示了使用产品量化方法的LLM编码器(Sentence-T5)在不同领域中随着编码器规模变化的性能表现,横轴为编码器大小(对数刻度),纵轴为多种召回率和NDCG指标。
Core Results: The Success of LLM-as-RS Scaling
-
LLM-as-RS Outperforms SID-based GR at Scale (Figure 7)
-
The performance of LLM-as-RS
models consistently improves as the base LLM scales from 0.6B to 14B parameters, with no sign of saturation.
-
At smaller scales, SID-based GR
can be competitive, but as the LLM-as-RS
model grows, it quickly surpasses the best achievable performance of the SID-based
paradigm (the red dashed line).
-
Implication: The LLM-as-RS
architecture does not suffer from the SID information bottleneck and thus exhibits much healthier scaling behavior.
该图像是关于通过加法(ADDING)方式将协同过滤嵌入(CF Emb)注入大语言模型推荐系统(LLM-as-RS)的示意图,展示了从用户购物历史向LLM输入的过程及结构。
-
Observation 5: The Entire LLM Contributes to Scaling (Figure 8)
-
By fitting their proposed scaling equation to experimental data, the authors find high R2 values, confirming the model is a good fit.
-
Most importantly, the estimated coefficients γ (for SI) and β (for CF) are both found to be greater than zero.
-
Implication: This is a key finding. It means that the frozen, pre-trained weights of the LLM (NLLM) actively contribute to improving performance on both semantic and collaborative tasks as the model scales. The LLM isn't just a feature extractor; its latent knowledge is increasingly leveraged at larger sizes.
该图像是三个子图组成的图表,展示了外部协同过滤(CF)嵌入添加到token嵌入后的性能扩展行为。横轴为CF模型大小,纵轴为提升的Recall@5(∆Recall@5),曲线颜色表示不同规模的LLM模型大小。结果显示,固定CF模型规模时,随着LLM主干模型增大,外部CF嵌入带来的性能提升逐渐减小。
-
Observation 6: LLMs Exhibit Scaling for CF Signals (Figure 9)
-
This experiment provides direct evidence for β>0. When external CF embeddings are injected, the performance gain (ΔRecall@5) is largest for the smallest LLM. As the LLM backbone scales up, the benefit from the same external CF information diminishes.
-
Implication: This elegantly demonstrates that larger LLMs are inherently better at learning CF patterns from the raw interaction sequence. They rely less on external crutches for CF information as they scale, challenging the prevailing view that LLMs are weak at CF.
该图像是图表,展示了两种生成推荐范式在不同模型规模下的效率比较。左图显示训练时间与Recall@5性能关系,右图呈现推理时间与Recall@5性能关系,分别以A100 GPU小时和毫秒为单位。
7. Conclusion & Reflections
-
Conclusion Summary:
The paper provides a rigorous and insightful analysis of model scaling in generative recommendation. It concludes that the popular SID-based GR
paradigm, despite its conceptual appeal, has a fundamental architectural flaw—the information bottleneck created by SIDs—that prevents it from effectively scaling. Performance saturates early, and bigger, more powerful encoders do not help. In contrast, the LLM-as-RS
paradigm, which processes text directly, demonstrates superior scaling properties. Its ability to learn both semantic information and collaborative filtering signals improves consistently with model size. This positions LLM-as-RS
as a more promising and scalable path toward building powerful, general-purpose foundation models for recommendation.
-
Limitations & Future Work:
- The authors acknowledge a key limitation: efficiency. As discussed in their appendix (Section K),
LLM-as-RS
models are currently much slower and more computationally expensive for both training and inference compared to the more lightweight SID-based
models.
- This creates a practical trade-off:
SID-based GR
may be preferable for applications with tight latency or budget constraints, while LLM-as-RS
is the choice when performance is paramount and resources are ample.
- Future work should focus on bridging this gap, exploring ways to make
LLM-as-RS
more efficient or to design new SID generation techniques that overcome the information bottleneck.
-
Personal Insights & Critique:
- Significance: This paper is an excellent example of how foundational research can provide more value than simply chasing state-of-the-art numbers. By applying the principles of scaling laws from NLP to the recommender systems domain, the authors provide a clear, evidence-based framework for evaluating and comparing different architectures. This work offers crucial guidance for the entire field on where to invest research efforts.
- Critique & Open Questions:
- The conclusion that SIDs are an intrinsic bottleneck might be strong. The study focuses on one type of quantization (
RQ-VAE
). It's possible that more advanced or multi-modal quantization techniques could be developed to preserve more information. The paper proves a limitation of the current SID approach, but perhaps not all possible SID approaches.
- The analysis relies on item titles as the source of semantic information. The conclusions might differ for other modalities (e.g., images) or richer text (e.g., full descriptions, user reviews).
- The
LLM-as-RS
model's ability to learn CF is impressive, but it's likely still not as data-efficient as specialized CF models for a given dataset size. The scaling advantage may stem from the massive, general-domain knowledge pre-trained into the LLM, which helps it structure and interpret the sequential data more effectively. The paper's findings are in a fixed-data regime; how these paradigms scale with data size is another important open question.