Align$^3$GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation
TL;DR Summary
Align$^3$GR effectively transforms LLMs into recommendation systems via a unified multi-level alignment approach, introducing dual tokenization, enhanced behavior modeling, and progressive decision optimization. It significantly outperforms state-of-the-art metrics.
Abstract
Large Language Models (LLMs) demonstrate significant advantages in leveraging structured world knowledge and multi-step reasoning capabilities. However, fundamental challenges arise when transforming LLMs into real-world recommender systems due to semantic and behavioral misalignment. To bridge this gap, we propose AlignGR, a novel framework that unifies token-level, behavior modeling-level, and preference-level alignment. Our approach introduces: Dual tokenization fusing user-item semantic and collaborative signals. Enhanced behavior modeling with bidirectional semantic alignment. Progressive DPO strategy combining self-play (SP-DPO) and real-world feedback (RF-DPO) for dynamic preference adaptation. Experiments show AlignGR outperforms the SOTA baseline by +17.8% in Recall@10 and +20.2% in NDCG@10 on the public dataset, with significant gains in online A/B tests and full-scale deployment on an industrial large-scale recommendation platform.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
AlignGR: Unified Multi-Level Alignment for LLM-based Generative Recommendation. The central topic of the paper is proposing a novel framework to integrate Large Language Models (LLMs) into real-world recommender systems by addressing semantic and behavioral misalignment through a unified multi-level alignment approach.
1.2. Authors
-
Wencai Ye
-
Mingjie Sun
-
Shuhang Chen
-
Wenjin Wu
-
Peng Jiang
All authors are affiliated with Kuaishou Technology, China. Their research backgrounds appear to be in the areas of recommender systems, large language models, and their applications in industrial settings.
1.3. Journal/Conference
This paper is published as a preprint on arXiv. The publication date is 2025-11-14. While arXiv is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating early research findings in various scientific fields, including computer science and artificial intelligence. Papers on arXiv often precede formal publication in conferences like ACM SIGIR, KDD, NeurIPS, or ICML, which are top-tier venues in recommendation systems and machine learning.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces , a novel framework designed to transform Large Language Models (LLMs) into effective real-world recommender systems by addressing inherent semantic and behavioral misalignment. achieves this through a unified multi-level alignment strategy encompassing token-level, behavior modeling-level, and preference-level alignment. Key innovations include: (1) Dual tokenization that fuses user-item semantic and collaborative signals; (2) Enhanced behavior modeling using bidirectional semantic alignment; and (3) a Progressive DPO strategy that combines self-play (SP-DPO) and real-world feedback (RF-DPO) for dynamic preference adaptation. Experimental results on a public dataset demonstrate that significantly outperforms the state-of-the-art (SOTA) baseline by +17.8% in Recall@10 and +20.2% in NDCG@10. Furthermore, the framework shows substantial gains in online A/B tests and has been successfully deployed on an industrial large-scale recommendation platform.
1.6. Original Source Link
The original source link is: https://arxiv.org/abs/2511.11255v1. This indicates it is a preprint version 1 on arXiv.
1.7. PDF Link
The PDF link is: https://arxiv.org/pdf/2511.11255v1.pdf.
2. Executive Summary
2.1. Background & Motivation
The core problem aims to solve is the fundamental challenge of effectively transforming Large Language Models (LLMs) into real-world recommender systems (RS). While LLMs possess impressive capabilities in leveraging structured world knowledge and performing multi-step reasoning, their direct application to recommendation tasks faces significant hurdles due to semantic and behavioral misalignment.
This problem is crucial because recommender systems are vital infrastructure for modern digital platforms like e-commerce, video streaming, and social media. The advancements in LLMs offer immense potential to revolutionize RS by moving beyond traditional discriminative approaches (which predict scores for existing items) to more generative paradigms (which can directly output recommended items in an end-to-end manner). However, LLMs are primarily trained for language modeling (concerned with semantic information and next-token prediction (NTP)), whereas recommender systems traditionally focus on modeling implicit user preferences based on interaction behavior information. This inherent mismatch creates a significant gap between LLMs' capabilities and the requirements of personalized recommendation.
Prior research has attempted to bridge this gap, often focusing on individual aspects like tokenization (transforming user/item info into tokens), Supervised Fine-Tuning (SFT) (adapting LLMs to recommendation tasks), or preference-based Reinforcement Learning (RL) (aligning outputs with user interests). However, existing methods frequently treat user and item information independently during tokenization, neglecting their crucial collaborative and semantic dependencies. Furthermore, preference-based RL techniques like Direct Preference Optimization (DPO) often rely on offline data without robust progressive learning mechanisms, making them less effective in adapting to the dynamic and complex nature of real-world user preferences and business objectives. These limitations result in suboptimal recommendation performance and hinder true LLM-to-recommendation alignment.
The paper's entry point is to propose a unified multi-level alignment framework that systematically addresses these gaps by integrating alignment across token-level, behavior modeling-level, and preference-level stages, ensuring a more comprehensive and adaptive integration of LLMs into RS.
2.2. Main Contributions / Findings
The paper makes several primary contributions to bridge the gap between LLMs and recommender systems:
- Unified Multi-Level Alignment Framework (): The paper proposes , a novel framework that jointly optimizes alignment at three critical levels:
token-level,behavior modeling-level, andpreference-level. This holistic approach aims to provide a more robust and comprehensive solution for transformingLLMsinto effective generative recommenders, overcoming the limitations of single-level or fragmented alignment strategies. - Dual
SCIDTokenization and EnhancedBehavior Modeling:- Dual
SCIDTokenization: It introduces adual tokenization schemethat fuses bothsemanticandcollaborative signalsforusersanditemsintoSemantic-Collaborative IDs (SCIDs). This joint optimization at the input level ensures thatLLMsreceive rich, aligned representations that capture both textual meaning and interaction patterns, which is crucial forpersonalized recommendation. - Enhanced
Behavior Modeling: It designs anenhanced multi-task Supervised Fine-Tuning (SFT)approach. This involves incorporating userSCIDtokens directly into task prompts and introducingbidirectional semantic alignmenttasks between userSCIDsand their semantic profiles. This explicitly grounds the abstractSCIDtokens in real-world meaning, strengthening the model's understanding of user-item relationships.
- Dual
- Progressive
DPOStrategy for Dynamic Preference Adaptation: The paper develops aprogressive Direct Preference Optimization (DPO)strategy. This strategy combinesself-play DPO (SP-DPO)for generating diverse training data andreal-world feedback DPO (RF-DPO)for dynamically adapting to actual user interests and business objectives. Inspired bycurriculum learning, this approach moves from easy to hard preference pairs, enabling smoother convergence and more stable training, thereby addressing the challenges of sparse and dynamic user preferences. - Comprehensive Experimental Validation: The paper provides extensive experimental validation on both public benchmark datasets (e.g.,
Instruments,Beauty,Yelp) and an industrial large-scale recommendation platform.-
Offline Performance: consistently outperforms state-of-the-art baselines, demonstrating significant improvements (e.g., +17.8% in
Recall@10and +20.2% inNDCG@10on theInstrumentsdataset compared toEAGER-LLM). -
Online Performance: Online A/B tests on an industrial platform show achieving a statistically significant
+1.432% revenue improvementand betterRecall@100compared to industrial baselines andTIGER.These findings collectively demonstrate that effectively bridges the gap between
LLMsandrecommender systemsby enabling high-quality personalization and robust adaptation in large-scale, dynamic recommendation environments.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand , it's helpful to be familiar with several core concepts from Large Language Models (LLMs) and Recommender Systems (RS).
- Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the
Transformerarchitecture, trained on massive amounts of text data. Their primary task is oftennext-token prediction (NTP), where they learn to predict the most probable next word or token in a sequence given the preceding context. This enables them to generate coherent and contextually relevant text, understand various linguistic nuances, and performmulti-step reasoning. Examples include GPT-3, Llama, and T5. - Recommender Systems (RS): These are information filtering systems that predict a user's preference for an item. They are ubiquitous in e-commerce (e.g., Amazon, Taobao), media streaming (e.g., Netflix, YouTube), and social media (e.g., TikTok, Instagram).
- Discriminative Recommenders: The traditional paradigm, where models predict a score or probability that a user will interact with a given item. This usually involves ranking pre-existing items.
- Generative Recommenders: A newer paradigm where the model directly generates item identifiers or descriptions as recommendations, rather than just ranking them. This allows for more dynamic and context-aware outputs.
- Tokenization: In
LLMs, text and other forms of data must be converted into numerical representations calledtokens. Tokenization is the process of breaking down raw input (e.g., words, subwords, characters, or even structured data like item IDs) into these discrete units that the model can process. Effective tokenization is crucial forLLMsto understand and generate sequences. - Residual Quantization Variational Autoencoder (RQ-VAE):
RQ-VAEis a type ofVector Quantization (VQ-VAE)model that learns to compress continuousembeddings(dense numerical representations) into discretetokensorcodes. It uses multiple "layers" ofcodebooks(a set of discrete vectors) to progressively refine the quantization, allowing for a more compact and expressive representation. In recommendation,RQ-VAEcan transform rich item or userembeddingsinto discreteIDsthat can be directly used byLLMsastokens. - Supervised Fine-Tuning (SFT): After an
LLMis pre-trained on a vast amount of general text data,SFTis a process where the model is further trained on a smaller, task-specific dataset with explicit labels. For recommendation,SFTadapts a general-purposeLLMto understand recommendation-specific data structures, user behavior patterns, and generate appropriate recommendations. - Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It involves trial and error, where the agent receives feedback (rewards or penalties) for its actions.
- Reinforcement Learning from Human Feedback (RLHF): A technique used to align
LLMswith human preferences or values. It typically involves three steps: (1) anLLMgenerates responses; (2) human annotators rank or rate these responses; (3) areward modelis trained on these human preferences; (4) theLLMis then fine-tuned usingRLto maximize the reward predicted by thereward model.RLHFis effective but can be unstable and computationally expensive. - Direct Preference Optimization (DPO): An alternative to
RLHFthat simplifies the alignment process. Instead of training a separatereward modeland then usingRL,DPOdirectly optimizes theLLM's policy (its ability to generate responses) based on human preference data. It rephrases theRLobjective into a simpleclassification problem, making it more stable and computationally efficient thanRLHF. - Collaborative Filtering (CF): A widely used technique in
recommender systemsthat makes recommendations based on the preferences of similar users or the characteristics of similar items. It leverages patterns of user-item interactions (e.g., purchases, clicks, ratings) to identifycollaborative signals, meaning that users who liked certain items in the past are likely to like other items that were also liked by those same users. - Evaluation Metrics for Recommender Systems:
Recall@K: Measures the proportion of relevant items (e.g., items a user actually interacted with) that are successfully retrieved within the top recommendations.- Conceptual Definition: Recall@K indicates how many of the items a user would have liked are present in the top K recommendations. A higher Recall@K means the system is good at not missing relevant items.
- Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{Relevant_Items} \cap \mathrm{Recommended_Items@K}|}{|\mathrm{Relevant_Items}|} $
- Symbol Explanation:
- : The number of relevant items that are also present in the top recommended items.
- : The total number of relevant items for a given user.
NDCG@K(Normalized Discounted Cumulative Gain at K): A ranking-aware metric that considers the position of relevant items. Highly relevant items appearing earlier in the recommendation list contribute more to the score.- Conceptual Definition: NDCG@K measures the usefulness of a recommended list, where relevant items appearing higher in the list are valued more. It's normalized to be between 0 and 1, making it comparable across different queries or users.
- Mathematical Formula:
$
\mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}}
$
where
$
\mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{\mathrm{rel}_j} - 1}{\log_2(j+1)}
$
and is the
ideal DCG@K, calculated by ranking all relevant items by their relevance score and applying the DCG formula. - Symbol Explanation:
- : The relevance score of the item at position in the recommended list.
- : Discounted Cumulative Gain at position .
- : Ideal Discounted Cumulative Gain at position , which is the maximum possible DCG if all relevant items were perfectly ranked.
- : A logarithmic discount factor, meaning items at higher ranks (smaller ) contribute more.
3.2. Previous Works
The paper references several prior studies that form the foundation or represent state-of-the-art in generative recommendation and LLM preference alignment.
- Generative Recommendation:
DSI(Differentiable Search Index) (Tay et al. 2022; Chen et al. 2023) andGENRE(Si et al. 2023): These approaches transform retrieval tasks intoautoregressive sequence generation, where the model directly generatesuser context tokensoritem identifiers. They highlight the shift from traditional retrieval to generative models.RQ-VAE(Lee et al. 2022) and other indexing/tokenization methods (e.g.,hierarchical k-means,PQ): These techniques are crucial for converting continuouscontent embeddingsinto discretetokensthatLLMscan process.TIGER(Rajput et al. 2023),LC-Rec(Zheng et al. 2024),LETTER(Wang et al. 2024a),EAGER-LLM(Hong et al. 2025): These are prominentgenerative recommendation modelsthat leveragecodebook-based quantized identifiersorlearnable tokenizers.TIGER: Appliescodebook-based quantized identifiersfor items.LC-Rec: Enhancescodebook tokenizationwithauxiliary alignment tasksduringSFT.LETTER: Proposes alearnable tokenizerspecifically for generative recommendation.EAGER-LLM: Further modelsuser-item collaborative signalsfortoken-level alignment, representing a strong baseline that aims to surpass.
- Preference Alignment of
LLMs:RLHF(Reinforcement Learning from Human Feedback) (Bai et al. 2022): This is the pioneering method for aligningLLMswith human preferences, involving areward modelandRLfine-tuning. acknowledges its impact but notes its instability and high computational cost.DPO(Direct Preference Optimization) (Rafailov et al. 2023):DPOdirectly optimizes theLLM's policy on preference data, addressing some limitations ofRLHF. It has inspired many variants.IPO(Yang, Tan, and Li 2025),cDPO(Furuta et al. 2024),rDPO(Qian et al. 2025), andSoftmax-DPO(Chen et al. 2024b): These areDPOvariants developed to handle issues like noise robustness, unbiased learning, or multiple rejected responses. specifically builds uponSoftmax-DPO.
- Curriculum Learning (Liao et al. 2024) and Self-play (Wu et al. 2024; Gao et al. 2025): These are recent advancements that improve
preference alignmentby organizing training from easy to hard tasks (curriculum learning) or by having the model generate its own training data (self-play), which integrates into itsprogressive DPOstrategy.
3.3. Technological Evolution
The evolution of recommender systems has seen a progression from early collaborative filtering and matrix factorization methods to deep learning-based approaches, then to Transformer-based sequential recommenders. The advent of LLMs marked a new paradigm, initially used to augment traditional RS with better content understanding or reasoning capabilities. More recently, the focus has shifted to making LLMs standalone generative recommenders that directly output item recommendations.
However, simply plugging LLMs into RS presents a fundamental gap: LLMs are semantic-focused, while RS are behavior-focused. This paper's work () fits within this technological timeline by proposing a comprehensive framework to bridge this specific gap. It moves beyond merely using LLMs for text generation or simple embedding, instead focusing on deep alignment across multiple levels: ensuring that the tokens themselves carry both semantic and collaborative information, that the LLM's behavior modeling is explicitly aware of user-item relationships, and that user preferences are continually and dynamically optimized through RL-like mechanisms. This represents a significant step towards fully realizing the potential of LLMs as robust generative recommenders in real-world scenarios.
3.4. Differentiation Analysis
Compared to the main methods in related work, introduces several core differentiations and innovations:
- Unified Multi-Level Alignment: Unlike many previous works that often focus on improving one specific stage (e.g.,
tokenizationorSFTorDPOin isolation), proposes aunified frameworkthat explicitly integrates and jointly optimizestoken-level,behavior modeling-level, andpreference-level alignment. This holistic approach ensures thatLLMsare effectively transformed into recommenders across the entire pipeline, addressingmisalignmentat every critical juncture. - Dual
Semantic-Collaborative ID (SCID)Tokenization:- Innovation: Previous
tokenizationmethods either primarily encodeitems(e.g.,TIGER,LC-Rec) or incorporateuser representationswithout trulyco-optimizingthem withitem embeddingsforcollaborative signals. introducesDual SCID Tokenizationwhich jointly encodes bothusersanditems, fusing their respectivesemantic(e.g., profiles, descriptions) andcollaborative features(e.g., behavioral patterns) into hierarchical discreteSCIDswithin a unified framework. - Differentiation: This explicitly tackles the limitation of
isolated modeling(Zhang et al. 2025) by ensuringmutual influencesandcollaborative signalsare preserved from the earliest input stage, which is a significant advancement over methods likeP5-SemID(semantic only) orP5-CID(collaborative via clustering, but less integrated). EvenEAGER-LLM, which modelsuser-item collaborative signalsfortoken-level alignment, is surpassed by 's more comprehensive approach.
- Innovation: Previous
- Enhanced Behavior Modeling with Bidirectional Semantic Alignment:
- Innovation: While
LC-Recalso usesmulti-task SFT, enhances it by directlyinjecting user SCID tokensintoLLMprompts for richer contextual alignment. More importantly, it introducesbidirectional alignment tasks(predicting userSCIDfrom text and reconstructing text fromSCID). - Differentiation: This explicit
index-language alignmentdirectly grounds the abstractSCID tokensin their real-worldsemantic meanings, providing stronger supervision and enabling theLLMto build a more robust correspondence between structuredbehavioral signalsandnatural language semantics, which is less explored in priorSFT-based recommendation works.
- Innovation: While
- Progressive
DPOStrategy (combiningSP-DPOandRF-DPO):-
Innovation: Existing
DPO-based methods forRS(e.g., variants ofDPOdiscussed in related work) often rely on staticoffline preference dataand lackprogressive learningmechanisms. addresses the challenge ofdynamicandsparse user preferencesby proposing aprogressive DPOstrategy inspired bycurriculum learning. It starts withself-play DPO (SP-DPO)to generate diverse training data andmitigate sparsityandexploration bottlenecks, thenprogressivelyincorporatesreal-world feedback DPO (RF-DPO)for accurate adaptation to real user interests and business objectives. -
Differentiation: This
easy-to-hardprogressive learning, coupled with the synergistic use ofsynthetic(SP-DPO) andreal(RF-DPO) feedback, allows for continuous improvement and better generalization to real-world scenarios, a key advantage over staticoffline optimizationapproaches. This also provides an adaptive mechanism that previousDPOvariants, primarily focused onNLPorstatic preference data, often miss in the context of dynamicRS.In summary, differentiates itself by offering a deeply integrated, multi-level solution that considers both
semanticandcollaborativeaspects fromtokenizationthroughbehavior modelingtodynamic preference adaptation, providing a more complete and effective framework forLLM-based generative recommendation.
-
4. Methodology
4.1. Principles
The core idea behind is to systematically bridge the inherent semantic and behavioral misalignment between Large Language Models (LLMs) and recommender systems (RS) by establishing a unified multi-level alignment framework. This framework operates on three tightly integrated stages: token-level alignment, behavior modeling-level alignment, and preference-level alignment. The theoretical basis is that by aligning the LLM at these fundamental levels, from how information is represented as tokens to how user behaviors are modeled and how preferences are learned, the LLM can truly understand and generate personalized recommendations, moving beyond its general language capabilities. The intuition is that for an LLM to act as a recommender, it needs to "think" like a recommender from the ground up, incorporating user-item collaborative signals and preference dynamics into its core operations, rather than just treating items as arbitrary text.
4.2. Core Methodology In-depth (Layer by Layer)
The framework is structured into three consecutive and interconnected stages, as depicted in Figure 2.
The following figure (Figure 2 from the original paper) illustrates the architecture of the framework:

该图像是一个示意图,展示了AlignGR框架的结构,包括多级对齐的机制。图中展示了用户和项目的双重编码,令牌级对齐,以及偏好级对齐的进展,涉及自我游戏(SP-DPO)与真实反馈(RF-DPO)的结合。
Figure 2: (a) The architecture of , a unified multi-level alignment framework for generative recommendation, which iedual Cencers andRQ-VAEs.Preferec-evelletplishev proressivP-DPO ndRF-DPO
4.2.1. Token-level Alignment: Dual SCID Tokenization
Problem Statement: Existing tokenization methods for generative recommendation often focus primarily on encoding items and tend to overlook user structure modeling. Even when user representations are incorporated, they are rarely co-optimized with item embeddings, leading to suboptimal user-item alignment and representations that lack critical collaborative signals. This independent modeling fails to capture the mutual influences crucial for comprehensive preference learning.
Solution: Dual SCID Tokenization. addresses this by introducing a Dual SCID Tokenization scheme. This approach aims to jointly encode both users and items, leveraging their respective semantic and collaborative signals within a unified, co-optimized framework. The goal is to learn mutually aligned, expressive representations for both.
Process:
-
Feature Extraction: For both
usersanditems, the method first extracts two types of features:- Semantic Features: These capture textual information (e.g., user profiles, item descriptions, titles). They are processed by a
frozen semantic encoder, typically initialized with a pre-trainedLanguage Model (LM)like T5 (Ni et al. 2021). - Collaborative Features: These capture behavioral patterns (e.g., past interactions, purchase history). They are processed by a
frozen collaborative encoder, such asDIN(Deep Interest Network) (Zhou et al. 2018).
- Semantic Features: These capture textual information (e.g., user profiles, item descriptions, titles). They are processed by a
-
Hybrid
Semantic-Collaborative (SC)Encoding: The resultingsemanticandcollaborative embeddingsfor users and items are concatenated. These concatenated embeddings are then fed into ahybrid SC Encoder(e.g., a Multi-Layer Perceptron,MLP), which integrates both information types to produce unifiedSC embeddings(denoted as for users and for items). -
SCIDQuantization: Finally, these unifiedSC embeddings(, ) are quantized into discreteSemantic-Collaborative IDs (SCIDs)using aResidual Quantization Variational Autoencoder (RQ-VAE)(Lee et al. 2022). This process compresses the continuous embeddings into a compact, discrete token space.Training Objective: The training of this
Dual SCID Tokenizationmodule involves two main components:
-
User-to-Item (U2I) Behavior Loss: This loss enhances alignment between user and itemSC embeddingsby optimizing for observed interactions. $ \mathcal { L } _ { \mathrm { U2I } } = - \frac { 1 } { | \mathcal { B } | } \sum _ { ( u , i ^ { + } ) \in \mathcal { B } } \left[ \log \frac { \exp ( \mathbf { u } _ { u } ^ { \top } \mathbf { v } _ { i ^ { + } } ) } { \exp ( \mathbf { u } _ { u } ^ { \top } \mathbf { v } _ { i ^ { + } } ) + \sum _ { j \in \mathcal { N } _ { u } } \exp ( \mathbf { u } _ { u } ^ { \top } \mathbf { v } _ { j } ) } \right] $- Symbol Explanation:
- : The
User-to-Item behavior loss. - : The batch size, representing the number of user-item interaction pairs in the current training batch.
- : A positive user-item interaction pair within the batch , meaning user interacted with item .
- : The unified
SC embeddingfor user (). - : The unified
SC embeddingfor the positive item (). - : The set of negative samples for user , which are randomly sampled items from within the batch that user did not interact with.
- : The unified
SC embeddingfor a negative item from . - : The dot product similarity between user 's embedding and positive item 's embedding. A higher value indicates stronger alignment.
- : Natural logarithm.
- : Exponential function.
- The term inside the is a
softmaxprobability, representing the probability of selecting the positive item given user , relative to the positive item and negative samples. This is a common form ofsampled softmaxornegative samplingloss.
- : The
- Symbol Explanation:
-
Overall Joint Loss: This loss combines the
U2I behavior losswith thequantization lossesfrom theRQ-VAEcomponents for both users and items. $ \mathcal { L } = \alpha \cdot \mathcal { L } _ { \mathrm { U2I } } + \gamma \cdot \left( \mathcal { L } _ { \mathrm { User \ :RQ } } + \mathcal { L } _ { \mathrm { Item \ :RQ } } \right) $- Symbol Explanation:
-
: The total joint training loss for the
Dual SCID Tokenizationmodule. -
: A trade-off hyperparameter weighting the
U2I behavior loss. -
: The
User-to-Item behavior lossas defined above. -
: A trade-off hyperparameter weighting the
RQ-VAEquantization losses. -
: The reconstruction and quantization loss specific to the user
RQ-VAE(e.g.,recon loss + commitment lossfromVQ-VAE). -
: The reconstruction and quantization loss specific to the item
RQ-VAE.Training Strategy: In practice, the training proceeds in two phases:
-
- Symbol Explanation:
-
Initially, is set to 1 and to 0. This phase focuses solely on optimizing to stabilize
behavior alignmentand thoroughly train theSC Encoder. This is monitored using metrics likeAUC. -
Once
behavior alignmentis stable, the hyperparameters switch to and . This phase prioritizes optimizing thequantization lossesfrom theRQ-VAEsto ensure effective compression intoSCIDs.During inference, the user and item
SCIDgeneration modules are deployed separately, each producing their respectiveSCIDsfor downstream tasks in theLLM. This design ensures that theLLMreceives compact, collaboratively-awaretokens, preserving crucialcollaborative relationshipsthroughout the recommendation pipeline.
4.2.2. Behavior Modeling-level Alignment: Multi-task SFT
After generating quantized SCIDs for users and items, the next stage focuses on Supervised Fine-Tuning (SFT) the LLM to enhance its generative and semantic alignment capabilities within this new SCID token space.
The following figure (Figure 3 from the original paper) illustrates the behavior modeling-level alignment:

该图像是图示,展示了AlignGR框架中多层对齐的概念。图中分为三部分:序列物品预测、显性索引-语言对齐以及隐性推荐导向对齐。每部分包含了不同的预测和对齐策略,通过图形化展示了用户与物品关系的建模过程。
Figure 3: Behavior Modeling-level Alignment.
Foundation: The framework builds upon LCRec (Zheng et al. 2024), which defines a multi-task SFT framework encompassing:
-
Sequential Item Prediction: Predicting the next item a user will interact with based on their historical sequence. -
Asymmetric Item Prediction: Predicting an item that is related to a target item but not necessarily in a direct sequence (e.g., complementary items). -
Item Prediction Based on User Intention: Predicting items based on explicit or inferred user intentions (e.g., query-based recommendations). -
Personalized Preference Inference: Tasks designed to infer deeper user preferences.These tasks aim to equip the
LLMwith the ability to capturesequential dependencies, understandimplicit user preferences, and alignuser behaviorwithitemsin a diverse and adaptive manner.
Enhancements in : LC-Rec is limited in capturing comprehensive user-item collaborative and semantic relationships. To address this, introduces two key enhancements during SFT:
- User
SCIDInjection: Theuser's SCID tokenis injected into all task prompts. This enriches the input representation for theLLM, allowing it to leverage more comprehensiveuser featuresand contextual alignment. This is shown in Figure 3, where theUser SCIDis included in the input sequence. - Bidirectional Semantic Alignment Tasks (
B.2): This is a crucial addition that explicitly aligns the abstractSCID tokenswith their real-worldsemantic meanings. Two tasks are introduced:-
Text to
SCID: The model is trained to predict a user'sSCID tokengiven their profile text (e.g., textual descriptions or attributes). This ensures that theSCIDencapsulates the user's semantic information. -
SCIDto Text: Conversely, the model is trained to reconstruct the user profile text given theirSCID token. This ensures that theSCIDcan be mapped back to meaningful semantic descriptions, strengthening the grounding.By incorporating
user SCID tokensdirectly and explicitly aligning structuredSCIDinformation withsemantic informationthrough bidirectional tasks, provides a stronger foundation for downstreampreference optimizationcompared to priorSFTdesigns.
-
4.2.3. Preference-level Alignment: Progressive DPO with Self-Play and Real-world Feedback
Even with effective SCID tokenization and multi-task SFT, the model's recommendation capabilities might still be preliminary. Simple preference optimization after SFT is often insufficient for continual improvement or robust business alignment due to limited coverage of annotated preference data and the dynamic complexity of real recommendation scenarios.
Solution: Progressive DPO with Self-Play (SP-DPO) and Real-world Feedback (RF-DPO). proposes a progressive DPO strategy (inspired by curriculum learning, moving from "easy to hard") that leverages both self-play and real-world feedback. This approach ensures continuous adaptation and robust alignment.
Foundation: Softmax-DPO: The progressive DPO is based on Softmax-DPO (Chen et al. 2024b), which can handle training samples containing multiple rejected responses. The SFT model initialized from the previous stage serves as the starting point.
Training Objective (Softmax-DPO): The training objective for each stage is formally defined as: $ \begin{array} { r l r } { \mathcal { L } ( \pi _ { \theta } ^ { i } , \pi _ { \mathrm { ref } } ^ { i } ) = - \mathbb { E } _ { ( x , y _ { w } ^ { i } , Y _ { l } ^ { i } ) \sim \mathcal { D } ^ { i } } \Bigg [ \log \sigma \Bigg ( - \log \sum _ { y _ { l } ^ { i } \in Y _ { l } ^ { i } } \exp ( \frac { \sigma } { \sigma } ) } \ & { } & { ( \beta \log \frac { \pi _ { \theta } ^ { i } ( y _ { l } ^ { i } \mid x ) } { \pi _ { \mathrm { ref } } ^ { i } ( y _ { l } ^ { i } \mid x ) } - \beta \log \frac { \pi _ { \theta } ^ { i } ( y _ { w } ^ { i } \mid x ) } { \pi _ { \mathrm { ref } } ^ { i } ( y _ { w } ^ { i } \mid x ) } ) \Bigg ) \Bigg ] } \end{array} $
- Symbol Explanation:
-
: The
DPOloss function at stage . -
: The current policy (the
LLMbeing fine-tuned) at stage . -
: The reference policy at stage , typically a frozen version of the
SFTmodel or the policy from the previousDPOstage. This acts as a regularization term. -
: Expectation over the training data distribution.
-
: A training sample from the progressive training set .
- : The prompt (e.g., user history, current context).
- : The chosen (preferred/winning) response (e.g., the
SCIDof the item the user liked). - : The set of rejected (less preferred/losing) responses (e.g.,
SCIDsof items the user disliked or did not interact with).
-
: The
sigmoid function. -
: A hyperparameter that controls the strength of the
KL divergencepenalty between the current policy and the reference policy, effectively managing how much the model can deviate from its initial behavior. -
: The
log-ratioof the probability of generating response under the current policy versus the reference policy. This term measures the relative preference of the current policy for a given response. -
Note on : The term appears to be a typographical error in the original paper's formula as presented. In standard
Softmax-DPOorDPOformulations, this part would typically involve areward differenceor alog-ratiodirectly. Given the strict instruction to reproduce the formula exactly, it is presented as is, but a beginner should be aware that this specific term is unusual and likely a placeholder or error in the source text. The intent is likely to compare thelog-ratiosof preferred versus rejected responses.The fine-tuned model at each stage serves as the reference policy for the next stage , enabling
preference distinctionsto be learned progressively.
-
Components of Progressive DPO:
-
Progressive
Self-Play DPO (SP-DPO):- Purpose: To enhance the model's generative capability and mitigate data sparsity by generating diverse and informative training data.
- Mechanism:
SP-DPOinvolves the model interacting with itself to create preference pairs. - Progressive Stages: This learning is divided into three stages based on the
hierarchical nature of SCIDand aprefix-ngram match metric(Zheng et al. 2025):- Easy Stage: Chosen and rejected
SCIDresponses are completely different, with no sharedprefix-ngram. These are easy for the model to distinguish. - Medium Stage: The
prefix-ngram overlapbetween chosen and rejectedSCIDresponses progressively increases, making discrimination harder. - Hard Stage: Even higher
prefix-ngram overlap, further increasing difficulty, but responses remain non-identical.
- Easy Stage: Chosen and rejected
- These
three-stage preference data, combined with real user behavior sequences, progressively form the training data forDPO. Theprefix-ngram match metriccan also be extended to aSCID vector-similarity metricfor a softer sample construction.
-
Progressive
Real-world Feedback DPO (RF-DPO):-
Purpose: To align the model with actual user interests and business objectives by incorporating authentic
user feedback. -
Mechanism: The
LLMrecommends its own generated results to users, and their feedback is collected. -
Feedback Categorization: Feedback is categorized into three levels:
disliked,neutral, andliked. -
Progressive Stages: Similar to
SP-DPO,RF-DPOalso follows a progressive strategy:- Easy Stage: Uses strongly
dislikeditems as negatives andlikeditems as positives. - Hard Stage: Uses
neutralitems as harder negatives, whilelikeditems remain positives. This systematically strengthens preference learning by introducing more nuanced negative signals.
- Easy Stage: Uses strongly
-
Industrial Settings: In industrial recommendation, feedback levels are defined by user behavior:
Disliked: Explicit negative feedback (e.g., "dislike" button, explicit negative comment).Neutral: Implicit negative feedback (e.g., impression without a click, short dwell time).Liked: Positive feedback (e.g., "like" button, purchase, long engagement).
-
Public Datasets: For public datasets (e.g., Amazon reviews), an
LLM-based sentiment model(e.g.,ecomgpt(Li et al. 2023)) scores reviews, mapping scores to levels:disliked (1),neutral (2-3), andliked (4-5).Key Advantages: This
progressive DPO frameworkoffers several benefits:
-
- Continual Enhancement: The model continually improves its ability to discern and generalize user preferences.
- Overcoming "Preference Ceiling": It moves beyond the limitations of static data.
- Smoother Learning: The
curriculum learningapproach (easy-to-hard) provides smoother interpolation between task distributions, leading to more efficient learning and stable training. - Synergistic Learning: Combines the exploration benefits of
self-playwith the grounding provided byreal-world feedback.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on three real-world sequential recommendation datasets from diverse domains and one industrial dataset.
-
Public Datasets:
Instruments: A subset of the Amazon review corpus, focusing on user interactions with musical equipment.Beauty: Also from the Amazon review datasets, containing extensive user behaviors related to beauty products.Yelp: Comprising user-business interactions from the Yelp challenge dataset.- Preprocessing: For fair comparison across models, data is preprocessed following standard protocols (Zheng et al. 2024; Rajput et al. 2023; Wang et al. 2024a). This typically includes filtering users and items with fewer than five interactions and applying the
leave-one-out strategyfor splitting data intotraining,validation, andtest sets. Each user's history length is restricted to a maximum of 20 items for all sequential models.
-
Industrial Dataset:
-
Source: An internal industrial large-scale advertising recommendation platform.
-
Purpose: Used for online A/B tests to validate 's practical efficacy and business impact at scale.
-
Characteristics: Represents real-world, large-scale user traffic (approximately 40+ million users).
These datasets were chosen because they are standard benchmarks in
sequential recommendationandgenerative recommendation, allowing for comparison with state-of-the-art methods. The industrial dataset provides crucial real-world validation beyond academic benchmarks.
-
5.2. Evaluation Metrics
The performance of the models is evaluated using standard top- metrics, which are common in recommender systems to assess the quality of recommendations.
-
Recall@K():- Conceptual Definition: Measures the proportion of relevant items that are successfully retrieved within the top recommendations. It focuses on the completeness of the recommendations, indicating how well the system identifies items that a user will interact with, regardless of their position in the ranked list (as long as they are within the top ).
- Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{Relevant_Items} \cap \mathrm{Recommended_Items@K}|}{|\mathrm{Relevant_Items}|} $
- Symbol Explanation:
- : The number of top recommendations considered.
- : The set of all items that are relevant to the user (e.g., items the user actually interacted with in the test set).
- : The set of the top items recommended by the model.
- : Denotes the cardinality (number of elements) of a set.
- : Set intersection.
-
NDCG@K(Normalized Discounted Cumulative Gain at K):- Conceptual Definition:
NDCG@Kis a ranking-aware metric that evaluates the utility of a ranked list of recommendations. It assigns higher scores to highly relevant items that appear at higher positions (earlier) in the list. The score is normalized to be between 0 and 1, allowing for comparisons across different recommendation lists. - Mathematical Formula:
$
\mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}}
$
where
$
\mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{\mathrm{rel}_j} - 1}{\log_2(j+1)}
$
And is the
Ideal Discounted Cumulative Gain, which is theDCG@Kfor the perfectly sorted list of relevant items. - Symbol Explanation:
-
: The number of top recommendations considered.
-
: The relevance score of the item at position in the recommended list. For implicit feedback scenarios (like in this paper), is often binary (1 if relevant, 0 if not).
-
: A logarithmic discount factor that reduces the contribution of items at lower ranks (higher ).
-
: Discounted Cumulative Gain at position , which sums the relevance scores discounted by their position.
-
: Ideal Discounted Cumulative Gain at , calculated by arranging all relevant items in the test set in decreasing order of their relevance scores and then applying the DCG formula. This serves as the normalization factor.
For the experiments, is set to 5 and 10 (, , , ). For online A/B tests,
Recall@100andRevenue (Improve.)are used.
-
- Conceptual Definition:
5.3. Baselines
The paper compares against a comprehensive set of strong baselines, categorized into traditional, sequential, and generative/LLM-based recommender systems.
-
Traditional Recommendation Methods:
MF(Matrix Factorization) (Mehta and Rana 2017): A classic collaborative filtering technique that decomposes the user-item interaction matrix into lower-dimensional user and item latent factor matrices.LightGCN(He et al. 2020): A simplified Graph Convolutional Network (GCN) for recommendation, which only includes the most essential component (neighborhood aggregation) for collaborative filtering.
-
Sequential Recommendation Methods:
Caser(Convolutional Sequence Embedding Recommendation) (Tang and Wang 2018): Uses convolutional filters to capture local sequential patterns in user interaction history.HGN(Hierarchical Gating Networks) (Ma, Kang, and Liu 2019): Captures long- and short-term user preferences using hierarchical gating mechanisms.BERT4Rec(Sequential Recommendation with Bidirectional Encoder Representations from Transformer) (Sun et al. 2019): Adapts theBERTmodel for sequential recommendation by predicting masked items in a user's interaction sequence.SASRec(Self-Attentive Sequential Recommendation) (Kang and McAuley 2018): Uses a self-attention mechanism to model long-range dependencies in user behavior sequences.
-
Generative and LLM-based Recommendation Methods:
-
BIGRec(Bi-step Grounding Paradigm for Large Language Models in Recommendation Systems) (Bao et al. 2025): AnLLM-based generative recommenderthat uses item titles as textual identifiers. -
P5-SemID(Wang et al. 2024a): A variant that leverages item metadata to createsemantic identifiersforgenerative recommendation. -
P5-CID(Wang et al. 2024a): Incorporatescollaborative signalsvia clustering to createcollaborative identifiersforLLM-based models. -
TIGER(Transformer-based Item Generative Recommender) (Rajput et al. 2023): Appliescodebook-based quantized identifiersfor items, framing recommendation as a sequence generation task. -
LETTER(Learnable Item Tokenization for Generative Recommendation) (Wang et al. 2024a): Proposes alearnable tokenizerto generate discrete item tokens for generative recommendation models. -
LETTER-TIGER: Implies an integration or specific configuration ofLETTERandTIGER. -
LC-Rec(Large Language Models by Integrating Collaborative Semantics for Recommendation) (Zheng et al. 2024): Enhancescodebook tokenizationwithauxiliary alignment tasksduringSFTto integratecollaborative semantics. This is a direct predecessor and a strong generative baseline. -
LETTER-LC-Rec: Implies an integration or specific configuration ofLETTERandLC-Rec. -
EAGER-LLM(Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration) (Hong et al. 2025): Modelsuser-item collaborative signalsfortoken-level alignment, representing the state-of-the-art baseline that aims to significantly outperform, especially in thetoken-level alignmentaspect.All baselines are either implemented or adapted using available open-source code to ensure fair comparison.
-
5.4. Implementation Details
- Backbone LLM:
Llama2-7B(Touvron et al. 2023) is used as the foundationalLLM. - Parameter-Efficient Fine-Tuning (PEFT):
LoRA(Low-Rank Adaptation) (Hu et al. 2022) is employed for efficient fine-tuning of theLlama2-7Bmodel, reducing computational costs. - Item Tokenization: A 3-level
RQ-VAEis used for item tokenization. Eachcodebookwithin theRQ-VAEcontains 256embeddingsof dimension 32. - Vocabulary Expansion: The generated
SCIDrepresentations for bothusersanditemsare incorporated into theLLM's vocabulary to preventout-of-vocabulary (OOV)issues and ensure seamless integration. - Training Parameters:
- Training Steps: 20,000 steps.
- Optimizer:
AdamW. - Batch Size: 1024.
- Learning Rate: Selected from {1e-3, 5e-4, 1e-4} based on validation set performance.
- Hardware: All experiments are conducted on 4 NVIDIA RTX A800 GPUs.
- Hyperparameter Tuning: Hyperparameters, including and for the
loss functions, are tuned on the validation set. Softmax-DPOConfiguration: For eachSoftmax-DPOsample, 1 chosen response () and 20 rejected responses () are used.- Evaluation: For
generative methodsutilizingbeam search, the beam width is consistently set to 20, followingEAGER-LLM. Results are averaged over five runs with different random seeds to ensure robustness.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Overall Offline Performance
The overall offline performance of and various baselines on three public benchmark datasets (Instruments, Beauty, Yelp) is presented in Table 1.
The following are the results from Table 1 of the original paper:
| Model | Instruments | Beauty | Yelp | |||||||||
| R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | |
| Traditional Recommendation Methods | ||||||||||||
| MF | 0.0479 | 0.0735 | 0.0330 | 0.0412 | 0.0294 | 0.0474 | 0.0145 | 0.0191 | 0.0220 | 0.0381 | 0.0138 | 0.0190 |
| LightGCN | 0.0794 | 0.1000 | 0.0662 | 0.0728 | 0.0305 | 0.0511 | 0.0194 | 0.0260 | 0.0248 | 0.0407 | 0.0156 | 0.0207 |
| Sequential Recommendation Methods | ||||||||||||
| Caser | 0.0543 | 0.0710 | 0.0355 | 0.0409 | 0.0205 | 0.0347 | 0.0131 | 0.0176 | 0.0150 | 0.0326 | 0.0099 | 0.0134 |
| HGN | 0.0813 | 0.1048 | 0.0668 | 0.0774 | 0.0325 | 0.0512 | 0.0206 | 0.0266 | 0.0186 | 0.0320 | 0.0115 | 0.0159 |
| Bert4Rec | 0.0671 | 0.0822 | 0.0560 | 0.0608 | 0.0203 | 0.0347 | 0.0124 | 0.0170 | 0.0186 | 0.0291 | 0.0115 | 0.0159 |
| SASRec | 0.0751 | 0.0947 | 0.0627 | 0.0690 | 0.0380 | 0.0588 | 0.0246 | 0.0313 | 0.0183 | 0.0296 | 0.0116 | 0.0152 |
| BigRec | 0.0713 | 0.0576 | 0.0470 | 0.0491 | 0.0243 | 0.0299 | 0.011 | 0.0198 | 0.0154 | 0.0169 | 0.0137 | 0.0142 |
| Generative and LLM-based Recommendation Methods | ||||||||||||
| P5-SemID | 0.0775 | 0.0964 | 0.0669 | 0.0730 | 0.0393 | 0.0584 | 0.0273 | 0.0335 | 0.0202 | 0.0324 | 0.0131 | 0.0170 |
| P5-CID | 0.0809 | 0.0987 | 0.0695 | 0.0751 | 0.0404 | 0.0597 | 0.0284 | 0.0347 | 0.0219 | 0.0347 | 0.0140 | 0.0181 |
| TIGER | 0.0870 | 0.1058 | 0.0737 | 0.0797 | 0.0395 | 0.0610 | 0.0262 | 0.0331 | 0.0253 | 0.0407 | 0.0184 | 0.0231 |
| LETTER-TIGER | 0.0909 | 0.1122 | 0.0763 | 0.0831 | 0.0431 | 0.0672 | 0.0286 | 0.0364 | 0.0277 | 0.0426 | 0.0158 | 0.0199 |
| LC-Rec | 0.0824 | 0.1006 | 0.0712 | 0.0772 | 0.0443 | 0.0642 | 0.0311 | 0.0374 | 0.0230 | 0.0359 | 0.0164 | 0.0213 |
| LETTER-LC-Rec | 0.0913 | 0.1115 | 0.0789 | 0.0854 | 0.0505 | 0.0703 | 0.035 | 0.0418 | 0.0255 | 0.0393 | 0.0168 | 0.0211 |
| EAGER-LLM | 0.0991 | 0.1224 | 0.0851 | 0.0926 | 0.0548 | 0.0830 | 0.0369 | 0.0459 | 0.0373 | 0.0569 | 0.0251 | 0.0315 |
| AlignGR | 0.1103 | 0.1442 | 0.0970 | 0.1113 | 0.0627 | 0.0994 | 0.0434 | 0.0529 | 0.0425 | 0.0679 | 0.0299 | 0.0403 |
| Improvement | +11.3% | +17.8% | +11.3% | +20.2% | +14.4% | +19.8% | +17.6% | +15.3% | +13.9% | +19.3% | +19.1% | +27.9% |
Analysis:
- Superior Performance: consistently achieves the best performance across all three public datasets (
Instruments,Beauty,Yelp) and all evaluation metrics (Recall@5,Recall@10,NDCG@5,NDCG@10). - Significant Improvements over SOTA Baselines: Compared to the strongest generative baseline,
EAGER-LLM, demonstrates substantial improvements. For instance, on theInstrumentsdataset, it surpassesEAGER-LLMby inRecall@10(0.1442 vs. 0.1224) and inNDCG@10(0.1113 vs. 0.0926). Similar significant gains are observed acrossBeauty(e.g., inRecall@10, inNDCG@10) andYelp(e.g., inRecall@10, inNDCG@10). These improvements are stated to be statistically significant (). - Effectiveness of Multi-Level Alignment: The results underscore the advantage of 's multi-level alignment strategy. By jointly optimizing
token-level,behavior modeling-level, andpreference-levelaspects, the framework effectively captures complexuser preferencesandcollaborative relationshipsthat simpler or single-level alignment approaches miss. - Generative Models vs. Traditional/Sequential: Generally, the
Generative and LLM-based Recommendation MethodsoutperformTraditionalandSequential Recommendation Methods, highlighting the potential ofLLMsin this domain. further extends this advantage.
6.1.2. Incremental Alignment Performance
Figure 4 visualizes the performance gains (specifically Recall@10) as progressively adds its alignment components.
The following figure (Figure 4 from the original paper) shows the Recall (%10%) under incremental alignment configurations:

该图像是一个图表,展示了在增量对齐配置下的 Recall ( ext{Recall@10} ext{\text{(}%\text{)}}) 结果。图中比较了不同方法的性能,包括 'Single + SEQ' 和各个对齐级别的逐步添加,结果显示了各方法的 Recall 变化趋势。
Figure 4: Recall under incremental alignment configurations; "Single + SEQ" denotes using item-side semantic IDs as tokens for the sequence task, while " + " indicates cumulative addition of each module.
Analysis:
- Clear Upward Trajectory: The figure clearly shows an upward trend in
Recall@10as each alignment module is added, confirming that each component contributes positively to the overall performance. - Token-Level Alignment Impact: The jump when replacing "item-side semantic IDs" with "dual learning-based item-side SCIDs" (implied by the first significant boost after the
Single + SEQbaseline) demonstrates the profound impact oftoken-level alignment. This highlights the value of modelingcollaborative semanticsat the fundamentaltoken levelfor both users and items. - Preference-Level Alignment's Substantial Improvement: The
preference-level alignment stage(likely representing the final stages of the cumulative additions) yields the most substantial improvement. This strongly validates the effectiveness of theprogressive DPOstrategy, particularly the combination ofself-playandreal-world feedback, in refining theLLM's recommendations. - Superiority to
EAGER-LLM: Throughout all stages of incremental addition, consistently outperforms theSOTA baseline (EAGER-LLM), reinforcing the benefits of its comprehensive multi-level alignment approach in bridging theLLM-RS gap.
6.1.3. Online A/B Test Results
To validate 's practical efficacy, an online A/B test was conducted on an industrial advertising recommendation platform.
The following are the results from Table 2 of the original paper:
| Baseline | TIGER | Align3GR | |
| Recall@100 | 0.218 | 0.229 | 0.242 |
| Revenue (Improve.) | - | 0.555%↑ | 1.432% ↑ |
Analysis:
- Online Performance Gains: significantly outperforms both the industrial two-tower retrieval baseline and the
TIGERmodel in online retrieval performance as measured byRecall@100.Baseline: 0.218TIGER: 0.229 (5% improvement over baseline)- : 0.242 (5.7% improvement over
TIGER, 10.9% over baseline)
- Significant Business Impact: Crucially, achieves a statistically significant
+1.432% revenue improvementin all advertising scenarios under full-scale deployment. This demonstrates that the offline advantages translate directly into measurable business value in a real-world, large-scale production environment. This is a strong validation of the framework's robustness and practical utility.
6.2. Ablation Study
Ablation studies were conducted on the Instruments dataset to understand the contribution of each alignment level.
6.2.1. Effect of Dual SCID Tokenization
Table 3 presents the ablation results for Dual SCID Tokenization.
The following are the results from Table 3 of the original paper:
| Tokenization | CF | U-I Alignment | Recall@10 | NDCG@10 |
| Item | × | X | 0.1322 | 0.0978 |
| Item | ✓ | X | 0.1346 | 0.0991 |
| Dual | X | X | 0.1390 | 0.1032 |
| Dual | X | ✓ | 0.1426 | 0.1083 |
| Dual | ✓ | X | 0.1428 | 0.1091 |
| Dual | ✓ | ✓ | 0.1442 | 0.1113 |
Analysis:
- Necessity of Dual Tokenization: Switching from "Item" (single-sided, item-only) to "Dual"
tokenization(jointly modeling user and item) with and leads to a significant boost (Recall@10from 0.1322 to 0.1390;NDCG@10from 0.0978 to 0.1032). This confirms that jointly modelinguseranditem token representationsis critical. - Importance of Collaborative Features (
CF): IncorporatingCollaborative Features (CF)consistently improves performance.- For "Item" tokenization:
Recall@10increases from 0.1322 to 0.1346. - For "Dual" tokenization: Comparing "Dual, X, X" (0.1390) with "Dual, ✓, X" (0.1428) shows a gain. And "Dual, X, ✓" (0.1426) vs "Dual, ✓, ✓" (0.1442) shows another gain. This highlights the importance of integrating
collaborative signalsintorepresentation learning.
- For "Item" tokenization:
- Efficacy of User-Item Alignment (
U-I Alignment): EnablingU-I alignmentvia theU2I behavior lossalso leads to consistent gains, especially when combined with dual tokenization and CF. Comparing "Dual, X, X" (0.1390) with "Dual, X, ✓" (0.1426) shows improvement. The best performance is achieved when all components ("Dual", "✓ CF", "✓ U-I Alignment") are active (Recall@100.1442,NDCG@100.1113). - Complementary Components: The results demonstrate that all three components (dual
tokenization,CF, andU-I Alignment) are complementary and essential for achieving optimal recommendation performance within thetoken-level alignmentstage.
6.2.2. Effect of Behavior Modeling-level Alignment Tasks
Table 4 presents the ablation results for multi-task SFT on the Instruments dataset.
The following are the results from Table 4 of the original paper:
| Methods | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 |
| SEQ | 0.1042 | 0.1329 | 0.0867 | 0.0982 |
| + C1 − C3 | 0.1046 | 0.1344 | 0.0881 | 0.0988 |
| + B1 | 0.1054 | 0.1399 | 0.0908 | 0.1045 |
| + User SCID | 0.1091 | 0.1417 | 0.0937 | 0.1051 |
| + B2 | 0.1103 | 0.1442 | 0.0959 | 0.1113 |
Analysis:
SEQas Backbone: TheSEQtask (Sequential Item Prediction) serves as the baseline, using itemSCIDsfrom user history.- Incremental Gains:
- : Adding other
LCRectasks (not clearly defined in the table but likelyAsymmetric Item PredictionandItem Prediction Based on User Intention) shows a marginal improvement. - : Adding
Personalized Preference Inference(taskB1fromLCRec) yields a more noticeable gain (Recall@10from 0.1344 to 0.1399). + User SCID: Directly incorporating theUser SCIDinto the prompts provides further gains (Recall@10from 0.1399 to 0.1417). This indicates that rich, structured userSCIDrepresentations help theLLMbetter understand user-item interaction semantics.- : The most significant performance boost across all metrics comes from adding
B2(bidirectional alignment tasks, and ).Recall@10jumps from 0.1417 to 0.1442 andNDCG@10from 0.1051 to 0.1113. This highlights the critical role of explicitly aligningstructured SCIDwithsemantic information, allowing theLLMto build stronger correspondences betweenlanguageandrecommendation signals.
- : Adding other
6.2.3. Effect of Preference-Level Alignment Tasks
Table 5 presents the ablation study on SP-DPO and RF-DPO strategies using the Instruments dataset. The baseline is a well-trained SFT model followed by Softmax-DPO.
The following are the results from Table 5 of the original paper:
| DPO Variant | Self-Play | Progressive | Recall@10 | NDCG@10 |
| Softmax-DPO | × | X | 0.1295 | 0.0972 |
| SP-DPO | ✓ | X | 0.1356 | 0.1033 |
| SP-DPO | ✓ | ✓ | 0.1396 | 0.1042 |
| RF-DPO | - | X | 0.1414 | 0.1049 |
| RF-DPO | - | ✓ | 0.1442 | 0.1113 |
Analysis:
Softmax-DPOBaseline: Starting fromSoftmax-DPO(without self-play or progressive learning) yieldsRecall@10of 0.1295 andNDCG@10of 0.0972.- Impact of
Self-Play (SP-DPO):- Adding
Self-PlaytoDPO(SP-DPO, no progressive) significantly boosts performance (Recall@10from 0.1295 to 0.1356,NDCG@10from 0.0972 to 0.1033). This confirms the value ofself-playin generating diverse training data and mitigating sparsity. - Further applying
progressive learningwithinSP-DPO(going from easy to hard stages) yields additional gains (Recall@10to 0.1396,NDCG@10to 0.1042).
- Adding
- Impact of
Real-world Feedback (RF-DPO):- Using
RF-DPO(which implicitly means it's built on theSP-DPOstage for initial generative ability, as specified in the methodology) without progressive learning also shows good performance (Recall@100.1414,NDCG@100.1049), already outperformingSP-DPOwith progressive learning. This highlights the crucial role of real user feedback. - The best overall performance is achieved with
RF-DPOcombined withprogressive learning(Recall@100.1442,NDCG@100.1113). This is the full preference-level alignment setup.
- Using
- Complementary Benefits: The results demonstrate the complementary benefits of
progressive optimization(easy-to-hard learning stages) and the integration of bothself-play(SP-DPO) andreal-world feedback(RF-DPO) in effectively aligningLLMswith user preference signals for superior recommendation performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces , a novel and comprehensive framework designed to integrate Large Language Models (LLMs) into personalized recommender systems. achieves this by systematically addressing the semantic and behavioral misalignment through a unified multi-level alignment approach. The key contributions include:
-
Token-level alignment: A
dual SCID tokenization schemethat jointly encodes user and itemsemanticandcollaborative signalsinto a compact, expressivetoken space. -
Behavior modeling-level alignment: An
enhanced multi-task Supervised Fine-Tuning (SFT)that injects userSCIDsintoLLMprompts and employsbidirectional semantic alignmenttasks to groundSCIDrepresentations in real-world semantic meanings. -
Preference-level alignment: A
progressive Direct Preference Optimization (DPO)strategy that combinesself-play (SP-DPO)for data diversity andreal-world feedback (RF-DPO)for dynamic preference adaptation, guided bycurriculum learningprinciples (easy-to-hard stages).Extensive experiments on public benchmark datasets (
Instruments,Beauty,Yelp) consistently show that significantly outperforms state-of-the-art baselines. Crucially, online A/B tests and full-scale deployment on an industrial recommendation platform further validate its practical efficacy, demonstrating substantial improvements inRecall@100and achieving a statistically significant+1.432% revenue improvement. These results collectively highlight the critical importance of a hierarchical and unified alignment strategy for successfully transformingLLMsinto robust and adaptivegenerative recommenders.
7.2. Limitations & Future Work
While the paper does not contain a dedicated "Limitations" section, some can be inferred from the problem statement and the nature of the proposed solution:
-
Complexity of Multi-Stage Training: The framework involves multiple distinct stages of alignment (token, SFT, DPO), each with its own training objectives, hyperparameters, and potentially separate models (encoders, RQ-VAE). This multi-stage process could be computationally intensive and complex to manage and fine-tune for optimal performance, especially in environments with limited resources.
-
Hyperparameter Sensitivity: The success of relies on careful tuning of various hyperparameters (e.g., , beam width,
RQ-VAElevels). Finding the optimal combination across different datasets and scenarios might be challenging. -
Scalability of
RQ-VAEandSCIDSpace: WhileRQ-VAEcompresses embeddings, the generation ofSCIDsand their integration into theLLM's vocabulary still adds complexity. For extremely large item catalogs or highly dynamic user profiles, managing theSCIDspace and ensuring efficient lookup/generation could be a challenge. -
Defining "Easy" vs. "Hard" for Progressive Learning: The
prefix-ngram match metricforSP-DPOand the sentiment/behavior mapping forRF-DPOare heuristic definitions of "easy" and "hard" preference pairs. While effective, they might not perfectly capture the true learning difficulty or nuances of user preferences. More adaptive or learned curriculum strategies could be explored. -
Generalizability to Diverse Domains: While tested on several public datasets and an industrial platform, the specific
semanticandcollaborative encodersorLLMbackbones might need adaptation for vastly different domains with unique data characteristics (e.g., cold-start scenarios, highly sparse data). -
Latency in Real-time Systems: The
generative natureofLLMscan introduce higher inference latency compared to traditionalretrievalmodels. While the online A/B test shows success, managing latency for very high-throughput, low-latency recommendation scenarios might require further optimization.The paper implicitly suggests future work through its contributions, such as further refining
tokenizationto capture more subtle signals, developing more adaptivebehavior modelingtasks, and exploring more sophisticatedprogressive learningmechanisms forDPO. The emphasis on industrial deployment also points to ongoing work on robustness, efficiency, and scalability for real-world applications.
7.3. Personal Insights & Critique
presents a highly compelling and well-structured approach to a critical problem in LLM-based recommendation: moving beyond augmentation to true end-to-end generative capabilities.
Strengths:
- Holistic Framework: The most significant strength is the
unified multi-level alignment framework. It logically addresses theLLM-RS gapfrom the lowest level of data representation (tokens) up throughbehavior modelingandpreference optimization. This integrated perspective is crucial, as misalignment at any single stage can undermine the entire system. - Dual
SCIDInnovation: TheDual SCID Tokenizationis particularly insightful. By explicitly integrating bothsemanticandcollaborative signalsinto compactuseranditem IDs, the framework ensures that theLLMreceives the rich, personalized context it needs from the very beginning. This goes beyond simple semantic embeddings and tackles the corecollaborative filteringaspect essential forRS. - Progressive
DPOfor Dynamic Adaptation: Theprogressive DPOstrategy, combiningself-playandreal-world feedbackwithcurriculum learning, is an elegant solution to the challenges ofsparseanddynamic user preferences. This adaptive learning mechanism is key forLLMsto stay relevant and performant in ever-changing real-world recommendation environments.Self-playhelps with exploration and data generation, whilereal-world feedbackgrounds the model in actual business objectives. - Strong Empirical Validation (Offline and Online): The comprehensive experimental results, particularly the significant gains in
online A/B testsandrevenue improvementon an industrial platform, provide strong evidence of the framework's practical value and scalability. This real-world validation is often missing in academic papers and greatly boosts confidence in the proposed methods. - Beginner-Friendly Explanation: The paper is relatively well-written and provides good intuition for its components, making it accessible for researchers new to the field, assuming some basic
LLMandRSknowledge.
Potential Issues / Areas for Improvement (Critique):
- Formula Typo in
DPOLoss: As noted in the methodology section, the term in theSoftmax-DPOloss formula (Equation 3) appears to be a typographical error. For a rigorous academic paper, such an error in a core mathematical formulation can cause confusion and needs immediate correction. While I reproduced it exactly as instructed, it's a point of concern for interpretation. A correctSoftmax-DPOformulation typically involves a direct reward difference or log-ratio term. - Detailed Breakdown of
LCRecTasks: While builds uponLCRec, the paper doesn't fully detail theC1-C3andB1tasks mentioned in the ablation study (Table 4). For a beginner-friendly understanding, a brief explanation of these tasks would have been beneficial, even if they are from prior work. - Computational Cost: While
LoRAis used forPEFT, the multi-stage training (multiple encoders,RQ-VAEs,LLMSFT,DPO) for7B LLMstill implies significant computational resources. A more explicit discussion of the total training time and inference latency (beyond justRecall@100) would be valuable, especially for industrial deployment. - Defining "Relevance" for
RF-DPO: The mapping of sentiment scores fromecomgpt(1-5 to disliked/neutral/liked) for public datasets, while practical, could be sensitive to thesentiment model's accuracy. A deeper dive into how robust this mapping is and its potential impact onDPOlearning would be interesting. - Generalizability of
Prefix-Ngram Match: Theprefix-ngram match metricfor definingSP-DPOsamples forSCIDsis an interesting heuristic. However, the optimalngramlength or whether this captures the true "difficulty" ofSCIDdiscrimination might vary. Further analysis or more adaptive difficulty curricula could be explored.
Transferability and Future Value:
The methods and conclusions of are highly transferable. The concept of multi-level alignment is not specific to recommendation and could be adapted for other domains where LLMs need to integrate with structured data and specialized objectives (e.g., knowledge graph reasoning, code generation, scientific discovery). The dual tokenization idea could be applied to any domain where semantic and collaborative/structural signals are jointly important. The progressive DPO strategy is a valuable contribution to RLHF/DPO research itself, offering a more stable and adaptive way to align LLMs with complex, dynamic preferences, which is relevant for conversational AI, content moderation, and personalized education systems.
In conclusion, is a significant step forward in the field of LLM-based generative recommendation, providing a robust and empirically validated framework that tackles the misalignment challenge head-on. Its holistic and progressive design offers valuable insights for future research and practical deployment of LLMs in complex, real-world applications.
Similar papers
Recommended via semantic vector search.