Understanding Generative Recommendation with Semantic IDs from a Model-scaling View
TL;DR Summary
This study reveals scaling bottlenecks in semantic ID-based generative recommendation due to limited encoding capacity. Directly using large language models outperforms by up to 20%, challenging assumptions about LLMs’ effectiveness in collaborative filtering and suggesting a pro
Abstract
Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.
English Analysis
1. Bibliographic Information
- Title: Understanding Generative Recommendation with Semantic IDs from a Model-scaling View
- Authors: Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, and Clark Mingxuan Ju.
- Affiliations: The authors are from Michigan State University and Snap Inc., indicating a collaboration between academia and industry. This background suggests the research is grounded in real-world recommender system challenges while maintaining academic rigor.
- Journal/Conference: The paper is presented as a preprint on arXiv. Preprints are common in fast-moving fields like machine learning, allowing for rapid dissemination of results. The citations to top-tier conferences like SIGIR, KDD, and NeurIPS suggest the work is positioned for a high-impact venue.
- Publication Year: The paper uses future dates (e.g., 2025 citations) and a synthetic arXiv ID, suggesting it's a template or example. For this analysis, we will treat it as a contemporary work from 2024/2025.
- Abstract: The paper investigates the scaling properties of two Generative Recommendation (GR) paradigms. It finds that the popular
SID-based GR
approach, which uses quantized semantic IDs to represent items, suffers from a performance bottleneck. As model components (encoder, tokenizer, recommender) are scaled up, performance quickly saturates. The authors identify the limited information capacity of Semantic IDs (SIDs) as the fundamental cause. In contrast, they show that theLLM-as-RS
paradigm, where a Large Language Model (LLM) directly processes item text, exhibits superior scaling properties. This model's performance improves consistently with size, achieving up to 20% better results than the saturated SID-based model. The paper also challenges the idea that LLMs are poor at collaborative filtering, demonstrating that this ability improves with scale. The authors conclude thatLLM-as-RS
is a more promising direction for building foundation models for recommendation. - Original Source Link:
https://arxiv.org/abs/2509.25522
(Note: This is a placeholder link provided in the prompt).
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Traditional recommender systems (RS) often use a multi-stage retrieval-and-ranking pipeline. A new paradigm, Generative Recommendation (GR), aims to unify this into a single generative stage. A popular GR method,
SID-based GR
, converts items intoSemantic IDs
(SIDs) and then uses a sequential model to predict the next item's SID. While generative models in other fields (like NLP and vision) benefit from "scaling laws" (bigger models mean better, predictable performance), it is unclear ifSID-based GR
enjoys the same benefit. - Importance & Gaps: The field is moving towards building large "foundation models" for recommendation. Understanding how different architectures scale is critical for guiding this research and investing computational resources effectively. Prior work focused on improving GR performance by adding new features or signals, but little attention was paid to whether these models can be improved simply by increasing their parameter count. This paper fills that gap by conducting the first systematic model-scaling analysis of GR paradigms.
- Innovation: Instead of proposing a new state-of-the-art model, the paper's innovation is its rigorous, comparative analysis of the scaling behaviors of two distinct GR paradigms. It introduces a formal scaling law equation for GR and uses it to diagnose a fundamental architectural bottleneck in
SID-based GR
.
- Core Problem: Traditional recommender systems (RS) often use a multi-stage retrieval-and-ranking pipeline. A new paradigm, Generative Recommendation (GR), aims to unify this into a single generative stage. A popular GR method,
-
Main Contributions / Findings (What):
- SID-based GR has a Scaling Bottleneck: The paper empirically demonstrates that
SID-based GR
models show early performance saturation. Scaling up the sequential recommender, the LLM encoder, or the quantization tokenizer provides diminishing or no returns. - SIDs are the Information Bottleneck: Through ablation studies, the authors pinpoint the cause: the SIDs themselves are not rich enough to carry the full semantic information from powerful LLM encoders to the recommender model. This information loss is the core constraint.
- LLM-as-RS Scales Superiorly: The alternative
LLM-as-RS
paradigm, which avoids SIDs by processing raw text, does not suffer from this saturation. Its performance consistently improves with model size, eventually surpassing the peak performance ofSID-based GR
by up to 20%. - LLMs Can Learn Collaborative Filtering: The paper challenges the common belief that LLMs struggle with collaborative filtering (CF) signals. It shows that as LLMs scale up, their ability to model user-item interaction patterns (CF) improves, reducing the need for external CF signals.
- SID-based GR has a Scaling Bottleneck: The paper empirically demonstrates that
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Recommender Systems (RS): Systems that predict a user's interest in an item and suggest relevant items. They are ubiquitous in e-commerce, streaming, and social media.
- Collaborative Filtering (CF): A classical recommendation technique based on the idea that users who agreed in the past will agree in the future. It relies on patterns in the user-item interaction history (e.g., clicks, purchases) to make recommendations, without needing to understand the items themselves.
- Semantic Information (SI): Refers to the content or meaning of an item, such as its text description, image, or category. Modern RS often use SI to handle new items (cold-start problem) and capture nuanced similarities.
- Generative Recommendation (GR): A new paradigm that frames recommendation as a generation task, similar to how a language model generates text. Instead of retrieving and ranking candidates, a GR model directly generates a representation of the recommended item.
- Semantic IDs (SIDs): A popular technique in GR. Items' semantic features (e.g., text descriptions) are fed into a powerful encoder (like an LLM). The resulting high-dimensional embedding is then "quantized" or compressed into a short sequence of discrete codes, which are the SIDs. These SIDs act as a new vocabulary for the item.
- Quantization Tokenizer: A model used to create SIDs. Examples include
VQ-VAE
orRQ-VAE
(Residual-Quantized Variational Autoencoder). It learns a set of "codebooks" (collections of vectors) and represents an item's embedding as a combination of codes from these books. - Scaling Laws: An empirical finding, most famous in LLMs, that model performance improves as a predictable power-law function of model size, dataset size, and compute. These laws allow researchers to forecast the benefits of training larger models.
- LLM-as-RS: An alternative GR paradigm where an LLM is used directly as the recommender. The user's interaction history (e.g., a list of titles of previously watched videos) is formatted as a text prompt, and the LLM is trained to generate the title of the next recommended item.
-
Previous Works & Differentiation:
- The paper situates itself within the context of two-stage recommenders (retrieval-and-ranking) and the emerging single-stage GR paradigm.
- It acknowledges that many modern GR systems like
TIGER
use the SID-based approach to integrate knowledge from foundation models. - It also references work on
LLM-as-RS
, noting that this approach is often thought to be weak at capturing CF signals, a belief this paper challenges. - Differentiation: While previous works aimed to improve GR models by adding more data or complex features (e.g., external CF signals), this paper's unique contribution is its focus on model scaling. It asks a more fundamental question: "Are these architectures built to improve with more parameters?" By analyzing GR through the lens of scaling laws, it uncovers an intrinsic limitation in the
SID-based
design that other papers missed.
4. Methodology (Core Technology & Implementation)
The paper's methodology is an analytical framework for studying the scaling behavior of generative recommender systems. It dissects two primary paradigms: SID-based GR
and LLM-as-RS
.
Principles: Decomposing Recommendation Error
The core principle is to decompose the total error of a GR model into two parts, based on the two essential types of information a recommender must learn:
-
Semantic Information (SI): Understanding the content of items.
-
Collaborative Filtering (CF): Understanding user behavior patterns from interaction histories.
Following scaling laws for multimodal models, the paper proposes a general equation for recommendation performance:
-
: The primary performance metric, representing the model's accuracy.
-
: The maximum achievable recall, or the irreducible error ceiling.
-
and : The number of model parameters allocated to learning Semantic Information and Collaborative Filtering, respectively.
-
A, B, a, b
: Positive constants determined empirically by fitting the formula to experimental data. The exponents and represent the scaling rates for SI and CF learning.This formula posits that performance improves (i.e., the negative error terms get smaller) as more parameters are dedicated to learning SI and CF.
Analysis of SID-based GR
This paradigm, represented by models like TIGER
, has a pipeline with three main components. The authors adapt the scaling formula to this architecture.
该图像是论文中图1的示意图,展示了两种生成推荐范式:SID-based GR通过LLM编码文本生成语义ID并进行推荐,而LLM-as-RS直接用LLM处理文本描述并输出标题,图中还比较了两种范式的主要差异。
-
Components:
- LLM Encoder (): A frozen, pre-trained model (e.g., Flan-T5) that converts item text into semantic embeddings. It only processes SI.
- Quantization Tokenizer (): A trainable model (e.g., RQ-VAE) that converts the semantic embeddings into discrete SIDs. It also only processes SI.
- RS Module (): A sequential model (e.g., a Transformer) trained to predict the next SID sequence from a history of SIDs. It learns from user sequences, so it primarily captures CF but also implicitly learns SI from SID co-occurrence.
-
Scaling Formula for SID-based GR:
- is modeled as just , since the RS module is the only component that sees user interaction sequences.
- is modeled as a weighted sum of all components, as all three contribute to processing semantic content. and are "effective parameter coefficients" between 0 and 1, representing how much the frozen LLM and tokenizer contribute to performance. If scaling the LLM encoder helps, should be greater than 0.
-
Steps & Procedures for Analysis:
- Scale the RS Module: Fix and and progressively increase the size of the RS transformer () to see if performance saturates.
- Scale the LLM Encoder: Fix and and use different sizes of Flan-T5 (from 77M to 11B parameters) to see if a more powerful encoder improves performance (i.e., if ).
- Scale the Quantization Tokenizer: Fix and and vary the tokenizer's capacity by (a) increasing the number of codebooks and (b) increasing the size of each codebook.
- Ablation Study (Diagnosing the Bottleneck): To test the hypothesis that SIDs are the bottleneck, the authors inject richer information directly into the RS module, bypassing the SID representation. They add:
- CF Embeddings: Item embeddings from a pre-trained
SASRec
model (a pure CF model). - LLM Embeddings: The original, dense embeddings from the LLM encoder, before quantization. They then observe the performance change. A large gain from LLM embeddings would confirm they contain information lost during quantization.
- CF Embeddings: Item embeddings from a pre-trained
Analysis of LLM-as-RS
This paradigm uses an LLM directly for recommendation, avoiding SIDs entirely.
-
Components:
- Frozen LLM (): The base large language model (e.g., Qwen3).
- LoRA Weights (): A small number of trainable parameters added via Low-Rank Adaptation (LoRA) for efficient fine-tuning.
-
Scaling Formula for LLM-as-RS:
- Here, both the trainable
LoRA
weights and the frozenLLM
weights are assumed to contribute to learning both SI and CF. - and are the effective coefficients for the frozen LLM's contribution to SI and CF learning, respectively. If , it means larger base LLMs are better at learning collaborative patterns, even without being fully fine-tuned.
- Here, both the trainable
-
Steps & Procedures for Analysis:
- General Scaling: Scale the LLM from 0.6B to 14B parameters (with scaling proportionally) and plot the performance to observe the overall trend and compare it to
SID-based GR
. - Fitting the Scaling Law: Conduct experiments with various combinations of LLM sizes and LoRA ranks to generate a grid of 25 performance data points. Use Huber loss to fit Equation (4) to this data and estimate the values of and . Positive values would confirm that the frozen LLM part contributes to scaling.
- Testing for CF Scaling (Proving ): Inject external CF embeddings (from
SASRec
) into theLLM-as-RS
model. The key idea is: if larger LLMs are already better at learning CF (i.e., ), then the performance boost from these external CF embeddings should decrease as the LLM gets bigger. The authors test this by observing the change in recall () across different LLM sizes.
- General Scaling: Scale the LLM from 0.6B to 14B parameters (with scaling proportionally) and plot the performance to observe the overall trend and compare it to
5. Experimental Setup
-
Datasets:
- Source: The experiments use three subsets of the Amazon Review datasets:
Beauty
,Sports and Outdoors
, andToys and Games
. - Characteristics: These are public benchmark datasets widely used in sequential recommendation research. They contain user interaction sequences (lists of items purchased by users) and rich item metadata (including text descriptions/titles).
- Justification: Using multiple datasets from different domains helps ensure the findings are generalizable and not specific to one product category. The data is fixed to focus purely on model scaling, not data scaling.
- Source: The experiments use three subsets of the Amazon Review datasets:
-
Evaluation Metrics: The paper uses standard top-k recommendation metrics to evaluate performance.
-
Recall@k:
- Conceptual Definition: Measures the proportion of actual next items (in the test set) that are found within the top- recommendations made by the model. It answers the question: "Out of all the items the user actually interacted with, what fraction did we successfully recommend?" It is a measure of coverage or sensitivity.
- Mathematical Formula: For a single user, let be the set of ground-truth next items and be the list of top- recommended items.
- Symbol Explanation:
- : The set of one or more items the user interacted with next in the test data.
- : The set of the top items generated by the recommender.
- : Denotes the number of elements in a set. The final metric is averaged over all users in the test set.
-
Normalized Discounted Cumulative Gain (NDCG)@k:
- Conceptual Definition: A metric that evaluates the quality of a ranked list of recommendations. Unlike Recall, it rewards models for placing more relevant items higher up in the recommendation list. It is a measure of ranking quality.
- Mathematical Formula:
- Symbol Explanation:
- : The relevance of the item at rank . In recommendation, this is typically 1 if the item is a ground-truth next item and 0 otherwise.
- : A penalty term that "discounts" the relevance of items at lower ranks.
- (Discounted Cumulative Gain): The total relevance score, discounted by rank.
- (Ideal DCG): The DCG score of a perfect ranking, where all ground-truth items are placed at the top.
- The final metric is averaged over all users in the test set.
-
-
Baselines: The study does not focus on beating a wide array of baseline models. Instead, the comparisons are:
- Intra-Paradigm: Models of different sizes within the same paradigm (e.g., a 77M
SID-based GR
vs. an 11BSID-based GR
). - Inter-Paradigm: The best-performing
SID-based GR
model vs.LLM-as-RS
models of various sizes. - Ablation Baselines: In ablation studies, the baseline is the model itself without the injected information (e.g.,
SID-based GR
vs.SID-based GR + LLM embeddings
). TheSASRec
model is used not as a direct competitor but as a source of pure CF embeddings for these ablations.
- Intra-Paradigm: Models of different sizes within the same paradigm (e.g., a 77M
6. Results & Analysis
The paper's results systematically uncover the scaling limitations of SID-based GR
and highlight the superiority of LLM-as-RS
.
Core Results: The Failure of SID-based GR to Scale
-
Observation 1: Scaling the RS Module Saturates Quickly (Figure 2)
-
The experiments show that as the size of the sequential recommender () increases, performance (Recall@5) improves initially but then flattens out completely around 13M parameters.
-
Implication: Simply making the recommender part bigger yields no further gains beyond a relatively small size. There is a bottleneck elsewhere in the system.
该图像是包含三个子图的图表,展示了在不同推荐系统(RS)模型规模(以对数刻度表示)下,Beauty、Sports和Toys三类任务的Recall@5指标的变化趋势。可以看到,性能随着模型规模增长快速提升,但在大约参数规模时趋于饱和。
-
-
Observation 2: Scaling the LLM Encoder Brings No Benefit (Figure 3)
-
The results are striking: using a massive 11B parameter Flan-T5 encoder provides virtually no performance improvement over a tiny 77M parameter one. The performance curves are almost flat across all datasets and metrics.
-
Implication: The knowledge from more powerful LLMs is not being successfully transferred to the recommender. In the context of the scaling formula, this means the effective parameter coefficient .
该图像是图19的示意图,展示了通过拼接方式将协同过滤(CF)嵌入注入到LLM-as-RS模型中的流程,包括SASRec生成CF Embedding,经MLP Adapter处理后与LLM Tokenizer的输出拼接,用于下一步预测。
-
-
Observation 3: Scaling the Quantization Tokenizer Also Fails (Figures 4 & 5)
-
Increasing the capacity of the tokenizer by adding more codebooks (Figure 4) or enlarging codebook size (Figure 5) shows a similar pattern. Performance peaks at a specific configuration (3 codebooks of size 256) and then either stagnates or degrades.
-
Implication: Making SIDs longer or drawing them from a larger vocabulary does not help. It may even hurt by making the prediction task for the RS module harder. This suggests as well.
该图像是包含三张子图的图表,展示了不同规模LLM编码器在Beauty、Sports和Toys三个领域上的性能表现,指标包括Recall@5、Recall@10、NDCG@5和NDCG@10,横轴为LLM编码器的参数规模(对数刻度),纵轴为性能值。
该图像是图表,展示了不同领域(Beauty, Sports, Toys)中LLM编码器尺寸对召回率(Recall@5、Recall@10)和NDCG(NDCG@5、NDCG@10)性能的影响,横轴为编码器规模的对数值,结果显示性能随模型规模变化较为平稳。
-
Ablations: Pinpointing the SID Bottleneck
- Observation 4: Injecting Raw LLM Embeddings Breaks the Bottleneck (Figure 6)
-
When injecting external CF embeddings from
SASRec
, there is little to no performance gain. This suggests theSID-based GR
model is already effective at capturing CF signals. -
However, when injecting the original, un-quantized LLM embeddings directly into the RS module, there is a substantial performance improvement.
-
Crucially, this gain is larger for a bigger RS module (21M vs. 13M), indicating that the performance saturation seen earlier was indeed caused by a lack of rich semantic information.
-
Implication: The quantization step from dense embeddings to discrete SIDs is the fundamental bottleneck. It discards critical semantic information that the recommender could otherwise use.
该图像是图表,展示了使用产品量化方法的LLM编码器(Sentence-T5)在不同领域中随着编码器规模变化的性能表现,横轴为编码器大小(对数刻度),纵轴为多种召回率和NDCG指标。
-
Core Results: The Success of LLM-as-RS Scaling
-
LLM-as-RS Outperforms SID-based GR at Scale (Figure 7)
-
The performance of
LLM-as-RS
models consistently improves as the base LLM scales from 0.6B to 14B parameters, with no sign of saturation. -
At smaller scales,
SID-based GR
can be competitive, but as theLLM-as-RS
model grows, it quickly surpasses the best achievable performance of theSID-based
paradigm (the red dashed line). -
Implication: The
LLM-as-RS
architecture does not suffer from the SID information bottleneck and thus exhibits much healthier scaling behavior.该图像是关于通过加法(ADDING)方式将协同过滤嵌入(CF Emb)注入大语言模型推荐系统(LLM-as-RS)的示意图,展示了从用户购物历史向LLM输入的过程及结构。
-
-
Observation 5: The Entire LLM Contributes to Scaling (Figure 8)
-
By fitting their proposed scaling equation to experimental data, the authors find high values, confirming the model is a good fit.
-
Most importantly, the estimated coefficients (for SI) and (for CF) are both found to be greater than zero.
-
Implication: This is a key finding. It means that the frozen, pre-trained weights of the LLM () actively contribute to improving performance on both semantic and collaborative tasks as the model scales. The LLM isn't just a feature extractor; its latent knowledge is increasingly leveraged at larger sizes.
该图像是三个子图组成的图表,展示了外部协同过滤(CF)嵌入添加到token嵌入后的性能扩展行为。横轴为CF模型大小,纵轴为提升的Recall@5(∆Recall@5),曲线颜色表示不同规模的LLM模型大小。结果显示,固定CF模型规模时,随着LLM主干模型增大,外部CF嵌入带来的性能提升逐渐减小。
-
-
Observation 6: LLMs Exhibit Scaling for CF Signals (Figure 9)
-
This experiment provides direct evidence for . When external CF embeddings are injected, the performance gain (Recall@5) is largest for the smallest LLM. As the LLM backbone scales up, the benefit from the same external CF information diminishes.
-
Implication: This elegantly demonstrates that larger LLMs are inherently better at learning CF patterns from the raw interaction sequence. They rely less on external crutches for CF information as they scale, challenging the prevailing view that LLMs are weak at CF.
该图像是图表,展示了两种生成推荐范式在不同模型规模下的效率比较。左图显示训练时间与Recall@5性能关系,右图呈现推理时间与Recall@5性能关系,分别以A100 GPU小时和毫秒为单位。
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper provides a rigorous and insightful analysis of model scaling in generative recommendation. It concludes that the popular
SID-based GR
paradigm, despite its conceptual appeal, has a fundamental architectural flaw—the information bottleneck created by SIDs—that prevents it from effectively scaling. Performance saturates early, and bigger, more powerful encoders do not help. In contrast, theLLM-as-RS
paradigm, which processes text directly, demonstrates superior scaling properties. Its ability to learn both semantic information and collaborative filtering signals improves consistently with model size. This positionsLLM-as-RS
as a more promising and scalable path toward building powerful, general-purpose foundation models for recommendation. -
Limitations & Future Work:
- The authors acknowledge a key limitation: efficiency. As discussed in their appendix (Section K),
LLM-as-RS
models are currently much slower and more computationally expensive for both training and inference compared to the more lightweightSID-based
models. - This creates a practical trade-off:
SID-based GR
may be preferable for applications with tight latency or budget constraints, whileLLM-as-RS
is the choice when performance is paramount and resources are ample. - Future work should focus on bridging this gap, exploring ways to make
LLM-as-RS
more efficient or to design new SID generation techniques that overcome the information bottleneck.
- The authors acknowledge a key limitation: efficiency. As discussed in their appendix (Section K),
-
Personal Insights & Critique:
- Significance: This paper is an excellent example of how foundational research can provide more value than simply chasing state-of-the-art numbers. By applying the principles of scaling laws from NLP to the recommender systems domain, the authors provide a clear, evidence-based framework for evaluating and comparing different architectures. This work offers crucial guidance for the entire field on where to invest research efforts.
- Critique & Open Questions:
- The conclusion that SIDs are an intrinsic bottleneck might be strong. The study focuses on one type of quantization (
RQ-VAE
). It's possible that more advanced or multi-modal quantization techniques could be developed to preserve more information. The paper proves a limitation of the current SID approach, but perhaps not all possible SID approaches. - The analysis relies on item titles as the source of semantic information. The conclusions might differ for other modalities (e.g., images) or richer text (e.g., full descriptions, user reviews).
- The
LLM-as-RS
model's ability to learn CF is impressive, but it's likely still not as data-efficient as specialized CF models for a given dataset size. The scaling advantage may stem from the massive, general-domain knowledge pre-trained into the LLM, which helps it structure and interpret the sequential data more effectively. The paper's findings are in a fixed-data regime; how these paradigms scale with data size is another important open question.
- The conclusion that SIDs are an intrinsic bottleneck might be strong. The study focuses on one type of quantization (
Similar papers
Recommended via semantic vector search.
IDGenRec: LLM-RecSys Alignment with Textual ID Learning
IDGenRec generates unique, semantically rich textual IDs for items, aligning LLMs with recommendation tasks. By jointly training a textual ID generator and LLM recommender, it surpasses existing sequential recommenders and enables strong zero-shot performance.
Generating Long Semantic IDs in Parallel for Recommendation
The RPG framework generates long, unordered semantic IDs in parallel using multi-token prediction and graph-guided decoding, improving representation capacity and inference efficiency, achieving a 12.6% average NDCG@10 gain over generative baselines.
Discussion
Leave a comment
No comments yet. Start the discussion!