Align$^3$GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation

Peng Jiang

Paper status: completed

Align$^3$GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation

Published:11/14/2025

LLM-based Recommendation Systems (23)Multi-Level Alignment Methods (1)Behavior Modeling and Alignment (1)Dynamic Preference Adaptation (1)Self-Play Decision Optimization (1)

Original Link PDF

Price: 0.10

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Align$^3$GR effectively transforms LLMs into recommendation systems via a unified multi-level alignment approach, introducing dual tokenization, enhanced behavior modeling, and progressive decision optimization. It significantly outperforms state-of-the-art metrics.

Abstract

Large Language Models (LLMs) demonstrate significant advantages in leveraging structured world knowledge and multi-step reasoning capabilities. However, fundamental challenges arise when transforming LLMs into real-world recommender systems due to semantic and behavioral misalignment. To bridge this gap, we propose Align $^3$ GR, a novel framework that unifies token-level, behavior modeling-level, and preference-level alignment. Our approach introduces: Dual tokenization fusing user-item semantic and collaborative signals. Enhanced behavior modeling with bidirectional semantic alignment. Progressive DPO strategy combining self-play (SP-DPO) and real-world feedback (RF-DPO) for dynamic preference adaptation. Experiments show Align $^3$ GR outperforms the SOTA baseline by +17.8% in Recall@10 and +20.2% in NDCG@10 on the public dataset, with significant gains in online A/B tests and full-scale deployment on an industrial large-scale recommendation platform.

Mind Map

In-depth Reading

English Analysis~32 min read · 46,475 chars

1. Bibliographic Information

1.1. Title

Align $^3$ GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation. The central topic of the paper is proposing a novel framework to integrate Large Language Models (LLMs) into real-world recommender systems by addressing semantic and behavioral misalignment through a unified multi-level alignment approach.

1.2. Authors

Wencai Ye
Mingjie Sun
Shuhang Chen
Wenjin Wu
Peng Jiang

All authors are affiliated with Kuaishou Technology, China. Their research backgrounds appear to be in the areas of recommender systems, large language models, and their applications in industrial settings.

1.3. Journal/Conference

This paper is published as a preprint on arXiv. The publication date is 2025-11-14. While arXiv is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating early research findings in various scientific fields, including computer science and artificial intelligence. Papers on arXiv often precede formal publication in conferences like ACM SIGIR, KDD, NeurIPS, or ICML, which are top-tier venues in recommendation systems and machine learning.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces $Align^3GR$ , a novel framework designed to transform Large Language Models (LLMs) into effective real-world recommender systems by addressing inherent semantic and behavioral misalignment. $Align^3GR$ achieves this through a unified multi-level alignment strategy encompassing token-level, behavior modeling-level, and preference-level alignment. Key innovations include: (1) Dual tokenization that fuses user-item semantic and collaborative signals; (2) Enhanced behavior modeling using bidirectional semantic alignment; and (3) a Progressive DPO strategy that combines self-play (SP-DPO) and real-world feedback (RF-DPO) for dynamic preference adaptation. Experimental results on a public dataset demonstrate that $Align^3GR$ significantly outperforms the state-of-the-art (SOTA) baseline by +17.8% in Recall@10 and +20.2% in NDCG@10. Furthermore, the framework shows substantial gains in online A/B tests and has been successfully deployed on an industrial large-scale recommendation platform.

1.6. Original Source Link

The original source link is: https://arxiv.org/abs/2511.11255v1. This indicates it is a preprint version 1 on arXiv.

1.7. PDF Link

The PDF link is: https://arxiv.org/pdf/2511.11255v1.pdf.

2. Executive Summary

2.1. Background & Motivation

The core problem $Align^3GR$ aims to solve is the fundamental challenge of effectively transforming Large Language Models (LLMs) into real-world recommender systems (RS). While LLMs possess impressive capabilities in leveraging structured world knowledge and performing multi-step reasoning, their direct application to recommendation tasks faces significant hurdles due to semantic and behavioral misalignment.

This problem is crucial because recommender systems are vital infrastructure for modern digital platforms like e-commerce, video streaming, and social media. The advancements in LLMs offer immense potential to revolutionize RS by moving beyond traditional discriminative approaches (which predict scores for existing items) to more generative paradigms (which can directly output recommended items in an end-to-end manner). However, LLMs are primarily trained for language modeling (concerned with semantic information and next-token prediction (NTP)), whereas recommender systems traditionally focus on modeling implicit user preferences based on interaction behavior information. This inherent mismatch creates a significant gap between LLMs' capabilities and the requirements of personalized recommendation.

Prior research has attempted to bridge this gap, often focusing on individual aspects like tokenization (transforming user/item info into tokens), Supervised Fine-Tuning (SFT) (adapting LLMs to recommendation tasks), or preference-based Reinforcement Learning (RL) (aligning outputs with user interests). However, existing methods frequently treat user and item information independently during tokenization, neglecting their crucial collaborative and semantic dependencies. Furthermore, preference-based RL techniques like Direct Preference Optimization (DPO) often rely on offline data without robust progressive learning mechanisms, making them less effective in adapting to the dynamic and complex nature of real-world user preferences and business objectives. These limitations result in suboptimal recommendation performance and hinder true LLM-to-recommendation alignment.

The paper's entry point is to propose a unified multi-level alignment framework that systematically addresses these gaps by integrating alignment across token-level, behavior modeling-level, and preference-level stages, ensuring a more comprehensive and adaptive integration of LLMs into RS.

2.2. Main Contributions / Findings

The paper makes several primary contributions to bridge the gap between LLMs and recommender systems:

Unified Multi-Level Alignment Framework ( $Align^3GR$ ): The paper proposes $Align^3GR$ , a novel framework that jointly optimizes alignment at three critical levels: token-level, behavior modeling-level, and preference-level. This holistic approach aims to provide a more robust and comprehensive solution for transforming LLMs into effective generative recommenders, overcoming the limitations of single-level or fragmented alignment strategies.
Dual SCID Tokenization and Enhanced Behavior Modeling:
- Dual SCID Tokenization: It introduces a dual tokenization scheme that fuses both semantic and collaborative signals for users and items into Semantic-Collaborative IDs (SCIDs). This joint optimization at the input level ensures that LLMs receive rich, aligned representations that capture both textual meaning and interaction patterns, which is crucial for personalized recommendation.
- Enhanced Behavior Modeling: It designs an enhanced multi-task Supervised Fine-Tuning (SFT) approach. This involves incorporating user SCID tokens directly into task prompts and introducing bidirectional semantic alignment tasks between user SCIDs and their semantic profiles. This explicitly grounds the abstract SCID tokens in real-world meaning, strengthening the model's understanding of user-item relationships.
Progressive DPO Strategy for Dynamic Preference Adaptation: The paper develops a progressive Direct Preference Optimization (DPO) strategy. This strategy combines self-play DPO (SP-DPO) for generating diverse training data and real-world feedback DPO (RF-DPO) for dynamically adapting to actual user interests and business objectives. Inspired by curriculum learning, this approach moves from easy to hard preference pairs, enabling smoother convergence and more stable training, thereby addressing the challenges of sparse and dynamic user preferences.
Comprehensive Experimental Validation: The paper provides extensive experimental validation on both public benchmark datasets (e.g., Instruments, Beauty, Yelp) and an industrial large-scale recommendation platform.
- Offline Performance: $Align^3GR$ consistently outperforms state-of-the-art baselines, demonstrating significant improvements (e.g., +17.8% in Recall@10 and +20.2% in NDCG@10 on the Instruments dataset compared to EAGER-LLM).
- Online Performance: Online A/B tests on an industrial platform show $Align^3GR$ achieving a statistically significant +1.432% revenue improvement and better Recall@100 compared to industrial baselines and TIGER.
  
  These findings collectively demonstrate that $Align^3GR$ effectively bridges the gap between LLMs and recommender systems by enabling high-quality personalization and robust adaptation in large-scale, dynamic recommendation environments.

3.1. Foundational Concepts

To understand $Align^3GR$ , it's helpful to be familiar with several core concepts from Large Language Models (LLMs) and Recommender Systems (RS).

Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the Transformer architecture, trained on massive amounts of text data. Their primary task is often next-token prediction (NTP), where they learn to predict the most probable next word or token in a sequence given the preceding context. This enables them to generate coherent and contextually relevant text, understand various linguistic nuances, and perform multi-step reasoning. Examples include GPT-3, Llama, and T5.
Recommender Systems (RS): These are information filtering systems that predict a user's preference for an item. They are ubiquitous in e-commerce (e.g., Amazon, Taobao), media streaming (e.g., Netflix, YouTube), and social media (e.g., TikTok, Instagram).
- Discriminative Recommenders: The traditional paradigm, where models predict a score or probability that a user will interact with a given item. This usually involves ranking pre-existing items.
- Generative Recommenders: A newer paradigm where the model directly generates item identifiers or descriptions as recommendations, rather than just ranking them. This allows for more dynamic and context-aware outputs.
Tokenization: In LLMs, text and other forms of data must be converted into numerical representations called tokens. Tokenization is the process of breaking down raw input (e.g., words, subwords, characters, or even structured data like item IDs) into these discrete units that the model can process. Effective tokenization is crucial for LLMs to understand and generate sequences.
Residual Quantization Variational Autoencoder (RQ-VAE): RQ-VAE is a type of Vector Quantization (VQ-VAE) model that learns to compress continuous embeddings (dense numerical representations) into discrete tokens or codes. It uses multiple "layers" of codebooks (a set of discrete vectors) to progressively refine the quantization, allowing for a more compact and expressive representation. In recommendation, RQ-VAE can transform rich item or user embeddings into discrete IDs that can be directly used by LLMs as tokens.
Supervised Fine-Tuning (SFT): After an LLM is pre-trained on a vast amount of general text data, SFT is a process where the model is further trained on a smaller, task-specific dataset with explicit labels. For recommendation, SFT adapts a general-purpose LLM to understand recommendation-specific data structures, user behavior patterns, and generate appropriate recommendations.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It involves trial and error, where the agent receives feedback (rewards or penalties) for its actions.
Reinforcement Learning from Human Feedback (RLHF): A technique used to align LLMs with human preferences or values. It typically involves three steps: (1) an LLM generates responses; (2) human annotators rank or rate these responses; (3) a reward model is trained on these human preferences; (4) the LLM is then fine-tuned using RL to maximize the reward predicted by the reward model. RLHF is effective but can be unstable and computationally expensive.
Direct Preference Optimization (DPO): An alternative to RLHF that simplifies the alignment process. Instead of training a separate reward model and then using RL, DPO directly optimizes the LLM's policy (its ability to generate responses) based on human preference data. It rephrases the RL objective into a simple classification problem, making it more stable and computationally efficient than RLHF.
Collaborative Filtering (CF): A widely used technique in recommender systems that makes recommendations based on the preferences of similar users or the characteristics of similar items. It leverages patterns of user-item interactions (e.g., purchases, clicks, ratings) to identify collaborative signals, meaning that users who liked certain items in the past are likely to like other items that were also liked by those same users.
Evaluation Metrics for Recommender Systems:
- Recall@K: Measures the proportion of relevant items (e.g., items a user actually interacted with) that are successfully retrieved within the top $K$ $K$ recommendations.
  - Conceptual Definition: Recall@K indicates how many of the items a user would have liked are present in the top K recommendations. A higher Recall@K means the system is good at not missing relevant items.
  - Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{Relevant_Items} \cap \mathrm{Recommended_Items@K}|}{|\mathrm{Relevant_Items}|} $
  - Symbol Explanation:
    - $|\mathrm{Relevant\_Items} \cap \mathrm{Recommended\_Items@K}|$ : The number of relevant items that are also present in the top $K$ recommended items.
    - $|\mathrm{Relevant\_Items}|$ : The total number of relevant items for a given user.
- NDCG@K (Normalized Discounted Cumulative Gain at K): A ranking-aware metric that considers the position of relevant items. Highly relevant items appearing earlier in the recommendation list contribute more to the score.
  - Conceptual Definition: NDCG@K measures the usefulness of a recommended list, where relevant items appearing higher in the list are valued more. It's normalized to be between 0 and 1, making it comparable across different queries or users.
  - Mathematical Formula: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ where $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{\mathrm{rel}_j} - 1}{\log_2(j+1)} $ and $\mathrm{IDCG@K}$ is the ideal DCG@K, calculated by ranking all relevant items by their relevance score and applying the DCG formula.
  - Symbol Explanation:
    - $\mathrm{rel}_j$ : The relevance score of the item at position $j$ in the recommended list.
    - $\mathrm{DCG@K}$ : Discounted Cumulative Gain at position $K$ .
    - $\mathrm{IDCG@K}$ : Ideal Discounted Cumulative Gain at position $K$ , which is the maximum possible DCG if all relevant items were perfectly ranked.
    - $\log_2(j+1)$ : A logarithmic discount factor, meaning items at higher ranks (smaller $j$ ) contribute more.

3.2. Previous Works

The paper references several prior studies that form the foundation or represent state-of-the-art in generative recommendation and LLM preference alignment.

Generative Recommendation:
- DSI (Differentiable Search Index) (Tay et al. 2022; Chen et al. 2023) and GENRE (Si et al. 2023): These approaches transform retrieval tasks into autoregressive sequence generation, where the model directly generates user context tokens or item identifiers. They highlight the shift from traditional retrieval to generative models.
- RQ-VAE (Lee et al. 2022) and other indexing/tokenization methods (e.g., hierarchical k-means, PQ): These techniques are crucial for converting continuous content embeddings into discrete tokens that LLMs can process.
- TIGER (Rajput et al. 2023), LC-Rec (Zheng et al. 2024), LETTER (Wang et al. 2024a), EAGER-LLM (Hong et al. 2025): These are prominent generative recommendation models that leverage codebook-based quantized identifiers or learnable tokenizers.
  - TIGER: Applies codebook-based quantized identifiers for items.
  - LC-Rec: Enhances codebook tokenization with auxiliary alignment tasks during SFT.
  - LETTER: Proposes a learnable tokenizer specifically for generative recommendation.
  - EAGER-LLM: Further models user-item collaborative signals for token-level alignment, representing a strong baseline that $Align^3GR$ aims to surpass.
Preference Alignment of LLMs:
- RLHF (Reinforcement Learning from Human Feedback) (Bai et al. 2022): This is the pioneering method for aligning LLMs with human preferences, involving a reward model and RL fine-tuning. $Align^3GR$ acknowledges its impact but notes its instability and high computational cost.
- DPO (Direct Preference Optimization) (Rafailov et al. 2023): DPO directly optimizes the LLM's policy on preference data, addressing some limitations of RLHF. It has inspired many variants.
  - IPO (Yang, Tan, and Li 2025), cDPO (Furuta et al. 2024), rDPO (Qian et al. 2025), and Softmax-DPO (Chen et al. 2024b): These are DPO variants developed to handle issues like noise robustness, unbiased learning, or multiple rejected responses. $Align^3GR$ specifically builds upon Softmax-DPO.
- Curriculum Learning (Liao et al. 2024) and Self-play (Wu et al. 2024; Gao et al. 2025): These are recent advancements that improve preference alignment by organizing training from easy to hard tasks (curriculum learning) or by having the model generate its own training data (self-play), which $Align^3GR$ integrates into its progressive DPO strategy.

3.3. Technological Evolution

The evolution of recommender systems has seen a progression from early collaborative filtering and matrix factorization methods to deep learning-based approaches, then to Transformer-based sequential recommenders. The advent of LLMs marked a new paradigm, initially used to augment traditional RS with better content understanding or reasoning capabilities. More recently, the focus has shifted to making LLMs standalone generative recommenders that directly output item recommendations.

However, simply plugging LLMs into RS presents a fundamental gap: LLMs are semantic-focused, while RS are behavior-focused. This paper's work ( $Align^3GR$ ) fits within this technological timeline by proposing a comprehensive framework to bridge this specific gap. It moves beyond merely using LLMs for text generation or simple embedding, instead focusing on deep alignment across multiple levels: ensuring that the tokens themselves carry both semantic and collaborative information, that the LLM's behavior modeling is explicitly aware of user-item relationships, and that user preferences are continually and dynamically optimized through RL-like mechanisms. This represents a significant step towards fully realizing the potential of LLMs as robust generative recommenders in real-world scenarios.

3.4. Differentiation Analysis

Compared to the main methods in related work, $Align^3GR$ introduces several core differentiations and innovations:

Unified Multi-Level Alignment: Unlike many previous works that often focus on improving one specific stage (e.g., tokenization or SFT or DPO in isolation), $Align^3GR$ proposes a unified framework that explicitly integrates and jointly optimizes token-level, behavior modeling-level, and preference-level alignment. This holistic approach ensures that LLMs are effectively transformed into recommenders across the entire pipeline, addressing misalignment at every critical juncture.
Dual Semantic-Collaborative ID (SCID) Tokenization:
- Innovation: Previous tokenization methods either primarily encode items (e.g., TIGER, LC-Rec) or incorporate user representations without truly co-optimizing them with item embeddings for collaborative signals. $Align^3GR$ introduces Dual SCID Tokenization which jointly encodes both users and items, fusing their respective semantic (e.g., profiles, descriptions) and collaborative features (e.g., behavioral patterns) into hierarchical discrete SCIDs within a unified framework.
- Differentiation: This explicitly tackles the limitation of isolated modeling (Zhang et al. 2025) by ensuring mutual influences and collaborative signals are preserved from the earliest input stage, which is a significant advancement over methods like P5-SemID (semantic only) or P5-CID (collaborative via clustering, but less integrated). Even EAGER-LLM, which models user-item collaborative signals for token-level alignment, is surpassed by $Align^3GR$ 's more comprehensive approach.
Enhanced Behavior Modeling with Bidirectional Semantic Alignment:
- Innovation: While LC-Rec also uses multi-task SFT, $Align^3GR$ enhances it by directly injecting user SCID tokens into LLM prompts for richer contextual alignment. More importantly, it introduces bidirectional alignment tasks (predicting user SCID from text and reconstructing text from SCID).
- Differentiation: This explicit index-language alignment directly grounds the abstract SCID tokens in their real-world semantic meanings, providing stronger supervision and enabling the LLM to build a more robust correspondence between structured behavioral signals and natural language semantics, which is less explored in prior SFT-based recommendation works.
Progressive DPO Strategy (combining SP-DPO and RF-DPO):
- Innovation: Existing DPO-based methods for RS (e.g., variants of DPO discussed in related work) often rely on static offline preference data and lack progressive learning mechanisms. $Align^3GR$ addresses the challenge of dynamic and sparse user preferences by proposing a progressive DPO strategy inspired by curriculum learning. It starts with self-play DPO (SP-DPO) to generate diverse training data and mitigate sparsity and exploration bottlenecks, then progressively incorporates real-world feedback DPO (RF-DPO) for accurate adaptation to real user interests and business objectives.
- Differentiation: This easy-to-hard progressive learning, coupled with the synergistic use of synthetic (SP-DPO) and real (RF-DPO) feedback, allows for continuous improvement and better generalization to real-world scenarios, a key advantage over static offline optimization approaches. This also provides an adaptive mechanism that previous DPO variants, primarily focused on NLP or static preference data, often miss in the context of dynamic RS.
  
  In summary, $Align^3GR$ differentiates itself by offering a deeply integrated, multi-level solution that considers both semantic and collaborative aspects from tokenization through behavior modeling to dynamic preference adaptation, providing a more complete and effective framework for LLM-based generative recommendation.

4. Methodology

4.1. Principles

The core idea behind $Align^3GR$ is to systematically bridge the inherent semantic and behavioral misalignment between Large Language Models (LLMs) and recommender systems (RS) by establishing a unified multi-level alignment framework. This framework operates on three tightly integrated stages: token-level alignment, behavior modeling-level alignment, and preference-level alignment. The theoretical basis is that by aligning the LLM at these fundamental levels, from how information is represented as tokens to how user behaviors are modeled and how preferences are learned, the LLM can truly understand and generate personalized recommendations, moving beyond its general language capabilities. The intuition is that for an LLM to act as a recommender, it needs to "think" like a recommender from the ground up, incorporating user-item collaborative signals and preference dynamics into its core operations, rather than just treating items as arbitrary text.

4.2. Core Methodology In-depth (Layer by Layer)

The $Align^3GR$ framework is structured into three consecutive and interconnected stages, as depicted in Figure 2.

The following figure (Figure 2 from the original paper) illustrates the architecture of the $Align^3GR$ framework:

$Figure 2: (a) The architecture of $\\mathbf { A l i g n ^ { 3 } G R }$ , a unified multi-level alignment framework for generative recommendation, which iedual Cencers andRQ-VAEs.Preferec-evelletplishev proressivP-DPO ndRF-DPO$
该图像是一个示意图，展示了Align $^3$ GR框架的结构，包括多级对齐的机制。图中展示了用户和项目的双重编码，令牌级对齐，以及偏好级对齐的进展，涉及自我游戏(SP-DPO)与真实反馈(RF-DPO)的结合。

Figure 2: (a) The architecture of $\mathbf { A l i g n ^ { 3 } G R }$ , a unified multi-level alignment framework for generative recommendation, which iedual Cencers andRQ-VAEs.Preferec-evelletplishev proressivP-DPO ndRF-DPO

4.2.1. Token-level Alignment: Dual SCID Tokenization

Problem Statement: Existing tokenization methods for generative recommendation often focus primarily on encoding items and tend to overlook user structure modeling. Even when user representations are incorporated, they are rarely co-optimized with item embeddings, leading to suboptimal user-item alignment and representations that lack critical collaborative signals. This independent modeling fails to capture the mutual influences crucial for comprehensive preference learning.

Solution: Dual SCID Tokenization. $Align^3GR$ addresses this by introducing a Dual SCID Tokenization scheme. This approach aims to jointly encode both users and items, leveraging their respective semantic and collaborative signals within a unified, co-optimized framework. The goal is to learn mutually aligned, expressive representations for both.

Process:

Feature Extraction: For both users and items, the method first extracts two types of features:
- Semantic Features: These capture textual information (e.g., user profiles, item descriptions, titles). They are processed by a frozen semantic encoder, typically initialized with a pre-trained Language Model (LM) like T5 (Ni et al. 2021).
- Collaborative Features: These capture behavioral patterns (e.g., past interactions, purchase history). They are processed by a frozen collaborative encoder, such as DIN (Deep Interest Network) (Zhou et al. 2018).
Hybrid Semantic-Collaborative (SC) Encoding: The resulting semantic and collaborative embeddings for users and items are concatenated. These concatenated embeddings are then fed into a hybrid SC Encoder (e.g., a Multi-Layer Perceptron, MLP), which integrates both information types to produce unified SC embeddings (denoted as $\mathbf{SCu}$ for users and $\mathbf{SCi}$ for items).
SCID Quantization: Finally, these unified SC embeddings ( $\mathbf{SCu}$ , $\mathbf{SCi}$ ) are quantized into discrete Semantic-Collaborative IDs (SCIDs) using a Residual Quantization Variational Autoencoder (RQ-VAE) (Lee et al. 2022). This process compresses the continuous embeddings into a compact, discrete token space.

Training Objective: The training of this Dual SCID Tokenization module involves two main components:

User-to-Item (U2I) Behavior Loss: This loss enhances alignment between user and item SC embeddings by optimizing for observed interactions. $ \mathcal { L } _ { \mathrm { U2I } } = - \frac { 1 } { | \mathcal { B } | } \sum _ { ( u , i ^ { + } ) \in \mathcal { B } } \left[ \log \frac { \exp ( \mathbf { u } _ { u } ^ { \top } \mathbf { v } _ { i ^ { + } } ) } { \exp ( \mathbf { u } _ { u } ^ { \top } \mathbf { v } _ { i ^ { + } } ) + \sum _ { j \in \mathcal { N } _ { u } } \exp ( \mathbf { u } _ { u } ^ { \top } \mathbf { v } _ { j } ) } \right] $
- Symbol Explanation:
  - $\mathcal{L}_{\mathrm{U2I}}$ : The User-to-Item behavior loss.
  - $|\mathcal{B}|$ : The batch size, representing the number of user-item interaction pairs in the current training batch.
  - $(u, i^+)$ : A positive user-item interaction pair within the batch $\mathcal{B}$ , meaning user $u$ interacted with item $i^+$ .
  - $\mathbf{u}_u$ : The unified SC embedding for user $u$ ( $\mathbf{SCu}$ ).
  - $\mathbf{v}_{i^+}$ : The unified SC embedding for the positive item $i^+$ ( $\mathbf{SCi}$ ).
  - $\mathcal{N}_u$ : The set of negative samples for user $u$ , which are randomly sampled items from within the batch that user $u$ did not interact with.
  - $\mathbf{v}_j$ : The unified SC embedding for a negative item $j$ from $\mathcal{N}_u$ .
  - $\mathbf{u}_u^\top \mathbf{v}_{i^+}$ : The dot product similarity between user $u$ 's embedding and positive item $i^+$ 's embedding. A higher value indicates stronger alignment.
  - $\log$ : Natural logarithm.
  - $\exp$ : Exponential function.
  - The term inside the $\log$ is a softmax probability, representing the probability of selecting the positive item $i^+$ given user $u$ , relative to the positive item and negative samples. This is a common form of sampled softmax or negative sampling loss.
Overall Joint Loss: This loss combines the U2I behavior loss with the quantization losses from the RQ-VAE components for both users and items. $ \mathcal { L } = \alpha \cdot \mathcal { L } _ { \mathrm { U2I } } + \gamma \cdot \left( \mathcal { L } _ { \mathrm { User \ :RQ } } + \mathcal { L } _ { \mathrm { Item \ :RQ } } \right) $
- Symbol Explanation:
  - $\mathcal{L}$ : The total joint training loss for the Dual SCID Tokenization module.
  - $\alpha$ : A trade-off hyperparameter weighting the U2I behavior loss.
  - $\mathcal{L}_{\mathrm{U2I}}$ : The User-to-Item behavior loss as defined above.
  - $\gamma$ : A trade-off hyperparameter weighting the RQ-VAE quantization losses.
  - $\mathcal{L}_{\mathrm{User \ :RQ}}$ : The reconstruction and quantization loss specific to the user RQ-VAE (e.g., recon loss + commitment loss from VQ-VAE).
  - $\mathcal{L}_{\mathrm{Item \ :RQ}}$ : The reconstruction and quantization loss specific to the item RQ-VAE.
    
    Training Strategy: In practice, the training proceeds in two phases:

Initially, $\alpha$ is set to 1 and $\gamma$ to 0. This phase focuses solely on optimizing $\mathcal{L}_{\mathrm{U2I}}$ to stabilize behavior alignment and thoroughly train the SC Encoder. This is monitored using metrics like AUC.
Once behavior alignment is stable, the hyperparameters switch to $\alpha = 0.1$ and $\gamma = 1$ . This phase prioritizes optimizing the quantization losses from the RQ-VAEs to ensure effective compression into SCIDs.

During inference, the user and item SCID generation modules are deployed separately, each producing their respective SCIDs for downstream tasks in the LLM. This design ensures that the LLM receives compact, collaboratively-aware tokens, preserving crucial collaborative relationships throughout the recommendation pipeline.

4.2.2. Behavior Modeling-level Alignment: Multi-task SFT

After generating quantized SCIDs for users and items, the next stage focuses on Supervised Fine-Tuning (SFT) the LLM to enhance its generative and semantic alignment capabilities within this new SCID token space.

The following figure (Figure 3 from the original paper) illustrates the behavior modeling-level alignment:

Figure 3: Behavior Modeling-level Alignment.
该图像是图示，展示了Align $^3$ GR框架中多层对齐的概念。图中分为三部分：序列物品预测、显性索引-语言对齐以及隐性推荐导向对齐。每部分包含了不同的预测和对齐策略，通过图形化展示了用户与物品关系的建模过程。

Figure 3: Behavior Modeling-level Alignment.

Foundation: The framework builds upon LCRec (Zheng et al. 2024), which defines a multi-task SFT framework encompassing:

Sequential Item Prediction: Predicting the next item a user will interact with based on their historical sequence.
Asymmetric Item Prediction: Predicting an item that is related to a target item but not necessarily in a direct sequence (e.g., complementary items).
Item Prediction Based on User Intention: Predicting items based on explicit or inferred user intentions (e.g., query-based recommendations).
Personalized Preference Inference: Tasks designed to infer deeper user preferences.

These tasks aim to equip the LLM with the ability to capture sequential dependencies, understand implicit user preferences, and align user behavior with items in a diverse and adaptive manner.

Enhancements in $Align^3GR$ : LC-Rec is limited in capturing comprehensive user-item collaborative and semantic relationships. To address this, $Align^3GR$ introduces two key enhancements during SFT:

User SCID Injection: The user's SCID token is injected into all task prompts. This enriches the input representation for the LLM, allowing it to leverage more comprehensive user features and contextual alignment. This is shown in Figure 3, where the User SCID is included in the input sequence.
Bidirectional Semantic Alignment Tasks (B.2): This is a crucial addition that explicitly aligns the abstract SCID tokens with their real-world semantic meanings. Two tasks are introduced:
- Text to SCID: The model is trained to predict a user's SCID token given their profile text (e.g., textual descriptions or attributes). This ensures that the SCID encapsulates the user's semantic information.
- SCID to Text: Conversely, the model is trained to reconstruct the user profile text given their SCID token. This ensures that the SCID can be mapped back to meaningful semantic descriptions, strengthening the grounding.
  
  By incorporating user SCID tokens directly and explicitly aligning structured SCID information with semantic information through bidirectional tasks, $Align^3GR$ provides a stronger foundation for downstream preference optimization compared to prior SFT designs.

4.2.3. Preference-level Alignment: Progressive DPO with Self-Play and Real-world Feedback

Even with effective SCID tokenization and multi-task SFT, the model's recommendation capabilities might still be preliminary. Simple preference optimization after SFT is often insufficient for continual improvement or robust business alignment due to limited coverage of annotated preference data and the dynamic complexity of real recommendation scenarios.

Solution: Progressive DPO with Self-Play (SP-DPO) and Real-world Feedback (RF-DPO). $Align^3GR$ proposes a progressive DPO strategy (inspired by curriculum learning, moving from "easy to hard") that leverages both self-play and real-world feedback. This approach ensures continuous adaptation and robust alignment.

Foundation: Softmax-DPO: The progressive DPO is based on Softmax-DPO (Chen et al. 2024b), which can handle training samples containing multiple rejected responses. The SFT model initialized from the previous stage serves as the starting point.

Training Objective (Softmax-DPO): The training objective for each stage $i$ is formally defined as: $ \begin{array} { r l r } { \mathcal { L } ( \pi _ { \theta } ^ { i } , \pi _ { \mathrm { ref } } ^ { i } ) = - \mathbb { E } _ { ( x , y _ { w } ^ { i } , Y _ { l } ^ { i } ) \sim \mathcal { D } ^ { i } } \Bigg [ \log \sigma \Bigg ( - \log \sum _ { y _ { l } ^ { i } \in Y _ { l } ^ { i } } \exp ( \frac { \sigma } { \sigma } ) } \ & { } & { ( \beta \log \frac { \pi _ { \theta } ^ { i } ( y _ { l } ^ { i } \mid x ) } { \pi _ { \mathrm { ref } } ^ { i } ( y _ { l } ^ { i } \mid x ) } - \beta \log \frac { \pi _ { \theta } ^ { i } ( y _ { w } ^ { i } \mid x ) } { \pi _ { \mathrm { ref } } ^ { i } ( y _ { w } ^ { i } \mid x ) } ) \Bigg ) \Bigg ] } \end{array} $

Symbol Explanation:
- $\mathcal{L}(\pi_{\theta}^i, \pi_{\mathrm{ref}}^i)$ : The DPO loss function at stage $i$ .
- $\pi_{\theta}^i$ : The current policy (the LLM being fine-tuned) at stage $i$ .
- $\pi_{\mathrm{ref}}^i$ : The reference policy at stage $i$ , typically a frozen version of the SFT model or the policy from the previous DPO stage. This acts as a regularization term.
- $\mathbb{E}$ : Expectation over the training data distribution.
- $(x, y_w^i, Y_l^i)$ : A training sample from the progressive training set $\mathcal{D}^i$ .
  - $x$ : The prompt (e.g., user history, current context).
  - $y_w^i$ : The chosen (preferred/winning) response (e.g., the SCID of the item the user liked).
  - $Y_l^i$ : The set of rejected (less preferred/losing) responses (e.g., SCIDs of items the user disliked or did not interact with).
- $\sigma(\cdot)$ : The sigmoid function.
- $\beta$ : A hyperparameter that controls the strength of the KL divergence penalty between the current policy and the reference policy, effectively managing how much the model can deviate from its initial behavior.
- $\log \frac{\pi_{\theta}^i(y \mid x)}{\pi_{\mathrm{ref}}^i(y \mid x)}$ : The log-ratio of the probability of generating response $y$ under the current policy versus the reference policy. This term measures the relative preference of the current policy for a given response.
- Note on $exp(\frac{\sigma}{\sigma})$ : The term $exp ( \frac { \sigma } { \sigma } )$ appears to be a typographical error in the original paper's formula as presented. In standard Softmax-DPO or DPO formulations, this part would typically involve a reward difference or a log-ratio directly. Given the strict instruction to reproduce the formula exactly, it is presented as is, but a beginner should be aware that this specific term is unusual and likely a placeholder or error in the source text. The intent is likely to compare the log-ratios of preferred versus rejected responses.
  
  The fine-tuned model at each stage $(\pi_{\theta}^i)$ serves as the reference policy for the next stage $(\pi_{\mathrm{ref}}^{i+\bar{1}})$ , enabling preference distinctions to be learned progressively.

Components of Progressive DPO:

Progressive Self-Play DPO (SP-DPO):
- Purpose: To enhance the model's generative capability and mitigate data sparsity by generating diverse and informative training data.
- Mechanism: SP-DPO involves the model interacting with itself to create preference pairs.
- Progressive Stages: This learning is divided into three stages based on the hierarchical nature of SCID and a prefix-ngram match metric (Zheng et al. 2025):
  - Easy Stage: Chosen and rejected SCID responses are completely different, with no shared prefix-ngram. These are easy for the model to distinguish.
  - Medium Stage: The prefix-ngram overlap between chosen and rejected SCID responses progressively increases, making discrimination harder.
  - Hard Stage: Even higher prefix-ngram overlap, further increasing difficulty, but responses remain non-identical.
- These three-stage preference data, combined with real user behavior sequences, progressively form the training data $\mathcal{D}^i$ for DPO. The prefix-ngram match metric can also be extended to a SCID vector-similarity metric for a softer sample construction.
Progressive Real-world Feedback DPO (RF-DPO):
- Purpose: To align the model with actual user interests and business objectives by incorporating authentic user feedback.
- Mechanism: The LLM recommends its own generated results to users, and their feedback is collected.
- Feedback Categorization: Feedback is categorized into three levels: disliked, neutral, and liked.
- Progressive Stages: Similar to SP-DPO, RF-DPO also follows a progressive strategy:
  - Easy Stage: Uses strongly disliked items as negatives and liked items as positives.
  - Hard Stage: Uses neutral items as harder negatives, while liked items remain positives. This systematically strengthens preference learning by introducing more nuanced negative signals.
- Industrial Settings: In industrial recommendation, feedback levels are defined by user behavior:
  - Disliked: Explicit negative feedback (e.g., "dislike" button, explicit negative comment).
  - Neutral: Implicit negative feedback (e.g., impression without a click, short dwell time).
  - Liked: Positive feedback (e.g., "like" button, purchase, long engagement).
- Public Datasets: For public datasets (e.g., Amazon reviews), an LLM-based sentiment model (e.g., ecomgpt (Li et al. 2023)) scores reviews, mapping scores to levels: disliked (1), neutral (2-3), and liked (4-5).
  
  Key Advantages: This progressive DPO framework offers several benefits:

Continual Enhancement: The model continually improves its ability to discern and generalize user preferences.
Overcoming "Preference Ceiling": It moves beyond the limitations of static data.
Smoother Learning: The curriculum learning approach (easy-to-hard) provides smoother interpolation between task distributions, leading to more efficient learning and stable training.
Synergistic Learning: Combines the exploration benefits of self-play with the grounding provided by real-world feedback.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three real-world sequential recommendation datasets from diverse domains and one industrial dataset.

Public Datasets:
- Instruments: A subset of the Amazon review corpus, focusing on user interactions with musical equipment.
- Beauty: Also from the Amazon review datasets, containing extensive user behaviors related to beauty products.
- Yelp: Comprising user-business interactions from the Yelp challenge dataset.
- Preprocessing: For fair comparison across models, data is preprocessed following standard protocols (Zheng et al. 2024; Rajput et al. 2023; Wang et al. 2024a). This typically includes filtering users and items with fewer than five interactions and applying the leave-one-out strategy for splitting data into training, validation, and test sets. Each user's history length is restricted to a maximum of 20 items for all sequential models.
Industrial Dataset:
- Source: An internal industrial large-scale advertising recommendation platform.
- Purpose: Used for online A/B tests to validate $Align^3GR$ 's practical efficacy and business impact at scale.
- Characteristics: Represents real-world, large-scale user traffic (approximately 40+ million users).
  
  These datasets were chosen because they are standard benchmarks in sequential recommendation and generative recommendation, allowing for comparison with state-of-the-art methods. The industrial dataset provides crucial real-world validation beyond academic benchmarks.

5.2. Evaluation Metrics

The performance of the models is evaluated using standard top- $K$ metrics, which are common in recommender systems to assess the quality of recommendations.

Recall@K ( $\mathrm{R@K}$ ):
- Conceptual Definition: Measures the proportion of relevant items that are successfully retrieved within the top $K$ recommendations. It focuses on the completeness of the recommendations, indicating how well the system identifies items that a user will interact with, regardless of their position in the ranked list (as long as they are within the top $K$ ).
- Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{Relevant_Items} \cap \mathrm{Recommended_Items@K}|}{|\mathrm{Relevant_Items}|} $
- Symbol Explanation:
  - $K$ : The number of top recommendations considered.
  - $\mathrm{Relevant\_Items}$ : The set of all items that are relevant to the user (e.g., items the user actually interacted with in the test set).
  - $\mathrm{Recommended\_Items@K}$ : The set of the top $K$ items recommended by the model.
  - $|\cdot|$ : Denotes the cardinality (number of elements) of a set.
  - $\cap$ : Set intersection.
NDCG@K (Normalized Discounted Cumulative Gain at K):
- Conceptual Definition: NDCG@K is a ranking-aware metric that evaluates the utility of a ranked list of recommendations. It assigns higher scores to highly relevant items that appear at higher positions (earlier) in the list. The score is normalized to be between 0 and 1, allowing for comparisons across different recommendation lists.
- Mathematical Formula: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ where $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{\mathrm{rel}_j} - 1}{\log_2(j+1)} $ And $\mathrm{IDCG@K}$ is the Ideal Discounted Cumulative Gain, which is the DCG@K for the perfectly sorted list of relevant items.
- Symbol Explanation:
  - $K$ : The number of top recommendations considered.
  - $\mathrm{rel}_j$ : The relevance score of the item at position $j$ in the recommended list. For implicit feedback scenarios (like in this paper), $\mathrm{rel}_j$ is often binary (1 if relevant, 0 if not).
  - $\log_2(j+1)$ : A logarithmic discount factor that reduces the contribution of items at lower ranks (higher $j$ ).
  - $\mathrm{DCG@K}$ : Discounted Cumulative Gain at position $K$ , which sums the relevance scores discounted by their position.
  - $\mathrm{IDCG@K}$ : Ideal Discounted Cumulative Gain at $K$ , calculated by arranging all relevant items in the test set in decreasing order of their relevance scores and then applying the DCG formula. This serves as the normalization factor.
    
    For the experiments, $K$ is set to 5 and 10 ( $\mathrm{R@5}$ , $\mathrm{R@10}$ , $\mathrm{N@5}$ , $\mathrm{N@10}$ ). For online A/B tests, Recall@100 and Revenue (Improve.) are used.

5.3. Baselines

The paper compares $Align^3GR$ against a comprehensive set of strong baselines, categorized into traditional, sequential, and generative/LLM-based recommender systems.

Traditional Recommendation Methods:
- MF (Matrix Factorization) (Mehta and Rana 2017): A classic collaborative filtering technique that decomposes the user-item interaction matrix into lower-dimensional user and item latent factor matrices.
- LightGCN (He et al. 2020): A simplified Graph Convolutional Network (GCN) for recommendation, which only includes the most essential component (neighborhood aggregation) for collaborative filtering.
Sequential Recommendation Methods:
- Caser (Convolutional Sequence Embedding Recommendation) (Tang and Wang 2018): Uses convolutional filters to capture local sequential patterns in user interaction history.
- HGN (Hierarchical Gating Networks) (Ma, Kang, and Liu 2019): Captures long- and short-term user preferences using hierarchical gating mechanisms.
- BERT4Rec (Sequential Recommendation with Bidirectional Encoder Representations from Transformer) (Sun et al. 2019): Adapts the BERT model for sequential recommendation by predicting masked items in a user's interaction sequence.
- SASRec (Self-Attentive Sequential Recommendation) (Kang and McAuley 2018): Uses a self-attention mechanism to model long-range dependencies in user behavior sequences.
Generative and LLM-based Recommendation Methods:
- BIGRec (Bi-step Grounding Paradigm for Large Language Models in Recommendation Systems) (Bao et al. 2025): An LLM-based generative recommender that uses item titles as textual identifiers.
- P5-SemID (Wang et al. 2024a): A variant that leverages item metadata to create semantic identifiers for generative recommendation.
- P5-CID (Wang et al. 2024a): Incorporates collaborative signals via clustering to create collaborative identifiers for LLM-based models.
- TIGER (Transformer-based Item Generative Recommender) (Rajput et al. 2023): Applies codebook-based quantized identifiers for items, framing recommendation as a sequence generation task.
- LETTER (Learnable Item Tokenization for Generative Recommendation) (Wang et al. 2024a): Proposes a learnable tokenizer to generate discrete item tokens for generative recommendation models.
- LETTER-TIGER: Implies an integration or specific configuration of LETTER and TIGER.
- LC-Rec (Large Language Models by Integrating Collaborative Semantics for Recommendation) (Zheng et al. 2024): Enhances codebook tokenization with auxiliary alignment tasks during SFT to integrate collaborative semantics. This is a direct predecessor and a strong generative baseline.
- LETTER-LC-Rec: Implies an integration or specific configuration of LETTER and LC-Rec.
- EAGER-LLM (Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration) (Hong et al. 2025): Models user-item collaborative signals for token-level alignment, representing the state-of-the-art baseline that $Align^3GR$ aims to significantly outperform, especially in the token-level alignment aspect.
  
  All baselines are either implemented or adapted using available open-source code to ensure fair comparison.

5.4. Implementation Details

Backbone LLM: Llama2-7B (Touvron et al. 2023) is used as the foundational LLM.
Parameter-Efficient Fine-Tuning (PEFT): LoRA (Low-Rank Adaptation) (Hu et al. 2022) is employed for efficient fine-tuning of the Llama2-7B model, reducing computational costs.
Item Tokenization: A 3-level RQ-VAE is used for item tokenization. Each codebook within the RQ-VAE contains 256 embeddings of dimension 32.
Vocabulary Expansion: The generated SCID representations for both users and items are incorporated into the LLM's vocabulary to prevent out-of-vocabulary (OOV) issues and ensure seamless integration.
Training Parameters:
- Training Steps: 20,000 steps.
- Optimizer: AdamW.
- Batch Size: 1024.
- Learning Rate: Selected from {1e-3, 5e-4, 1e-4} based on validation set performance.
Hardware: All experiments are conducted on 4 NVIDIA RTX A800 GPUs.
Hyperparameter Tuning: Hyperparameters, including $\alpha$ and $\beta$ for the loss functions, are tuned on the validation set.
Softmax-DPO Configuration: For each Softmax-DPO sample, 1 chosen response ( $y_w$ ) and 20 rejected responses ( $Y_l$ ) are used.
Evaluation: For generative methods utilizing beam search, the beam width is consistently set to 20, following EAGER-LLM. Results are averaged over five runs with different random seeds to ensure robustness.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Offline Performance

The overall offline performance of $Align^3GR$ and various baselines on three public benchmark datasets (Instruments, Beauty, Yelp) is presented in Table 1.

The following are the results from Table 1 of the original paper:

Model	Instruments				Beauty				Yelp
Model	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10
Traditional Recommendation Methods
MF	0.0479	0.0735	0.0330	0.0412	0.0294	0.0474	0.0145	0.0191	0.0220	0.0381	0.0138	0.0190
LightGCN	0.0794	0.1000	0.0662	0.0728	0.0305	0.0511	0.0194	0.0260	0.0248	0.0407	0.0156	0.0207
Sequential Recommendation Methods
Caser	0.0543	0.0710	0.0355	0.0409	0.0205	0.0347	0.0131	0.0176	0.0150	0.0326	0.0099	0.0134
HGN	0.0813	0.1048	0.0668	0.0774	0.0325	0.0512	0.0206	0.0266	0.0186	0.0320	0.0115	0.0159
Bert4Rec	0.0671	0.0822	0.0560	0.0608	0.0203	0.0347	0.0124	0.0170	0.0186	0.0291	0.0115	0.0159
SASRec	0.0751	0.0947	0.0627	0.0690	0.0380	0.0588	0.0246	0.0313	0.0183	0.0296	0.0116	0.0152
BigRec	0.0713	0.0576	0.0470	0.0491	0.0243	0.0299	0.011	0.0198	0.0154	0.0169	0.0137	0.0142
Generative and LLM-based Recommendation Methods
P5-SemID	0.0775	0.0964	0.0669	0.0730	0.0393	0.0584	0.0273	0.0335	0.0202	0.0324	0.0131	0.0170
P5-CID	0.0809	0.0987	0.0695	0.0751	0.0404	0.0597	0.0284	0.0347	0.0219	0.0347	0.0140	0.0181
TIGER	0.0870	0.1058	0.0737	0.0797	0.0395	0.0610	0.0262	0.0331	0.0253	0.0407	0.0184	0.0231
LETTER-TIGER	0.0909	0.1122	0.0763	0.0831	0.0431	0.0672	0.0286	0.0364	0.0277	0.0426	0.0158	0.0199
LC-Rec	0.0824	0.1006	0.0712	0.0772	0.0443	0.0642	0.0311	0.0374	0.0230	0.0359	0.0164	0.0213
LETTER-LC-Rec	0.0913	0.1115	0.0789	0.0854	0.0505	0.0703	0.035	0.0418	0.0255	0.0393	0.0168	0.0211
EAGER-LLM	0.0991	0.1224	0.0851	0.0926	0.0548	0.0830	0.0369	0.0459	0.0373	0.0569	0.0251	0.0315
AlignGR	0.1103	0.1442	0.0970	0.1113	0.0627	0.0994	0.0434	0.0529	0.0425	0.0679	0.0299	0.0403
Improvement	+11.3%	+17.8%	+11.3%	+20.2%	+14.4%	+19.8%	+17.6%	+15.3%	+13.9%	+19.3%	+19.1%	+27.9%

Analysis:

Superior Performance: $Align^3GR$ consistently achieves the best performance across all three public datasets (Instruments, Beauty, Yelp) and all evaluation metrics (Recall@5, Recall@10, NDCG@5, NDCG@10).
Significant Improvements over SOTA Baselines: Compared to the strongest generative baseline, EAGER-LLM, $Align^3GR$ demonstrates substantial improvements. For instance, on the Instruments dataset, it surpasses EAGER-LLM by $+17.8%$ in Recall@10 (0.1442 vs. 0.1224) and $+20.2%$ in NDCG@10 (0.1113 vs. 0.0926). Similar significant gains are observed across Beauty (e.g., $+19.8%$ in Recall@10, $+15.3%$ in NDCG@10) and Yelp (e.g., $+19.3%$ in Recall@10, $+27.9%$ in NDCG@10). These improvements are stated to be statistically significant ( $p < 0.05$ ).
Effectiveness of Multi-Level Alignment: The results underscore the advantage of $Align^3GR$ 's multi-level alignment strategy. By jointly optimizing token-level, behavior modeling-level, and preference-level aspects, the framework effectively captures complex user preferences and collaborative relationships that simpler or single-level alignment approaches miss.
Generative Models vs. Traditional/Sequential: Generally, the Generative and LLM-based Recommendation Methods outperform Traditional and Sequential Recommendation Methods, highlighting the potential of LLMs in this domain. $Align^3GR$ further extends this advantage.

6.1.2. Incremental Alignment Performance

Figure 4 visualizes the performance gains (specifically Recall@10) as $Align^3GR$ progressively adds its alignment components.

The following figure (Figure 4 from the original paper) shows the Recall (%10%) under incremental alignment configurations:

$Figure 4: Recall $( \\% 1 0 \\% )$ under incremental alignment configurations; "Single `+` SEQ" denotes using item-side semantic IDs as tokens for the sequence task, while `" + "` indicates cumulative addition of each module.$
该图像是一个图表，展示了在增量对齐配置下的 Recall $( ext{Recall@10} ext{\text{（}%\text{）}})$ 结果。图中比较了不同方法的性能，包括 'Single + SEQ' 和各个对齐级别的逐步添加，结果显示了各方法的 Recall 变化趋势。

Figure 4: Recall $( \% 1 0 \% )$ under incremental alignment configurations; "Single + SEQ" denotes using item-side semantic IDs as tokens for the sequence task, while " + " indicates cumulative addition of each module.

Analysis:

Clear Upward Trajectory: The figure clearly shows an upward trend in Recall@10 as each alignment module is added, confirming that each component contributes positively to the overall performance.
Token-Level Alignment Impact: The jump when replacing "item-side semantic IDs" with "dual learning-based item-side SCIDs" (implied by the first significant boost after the Single + SEQ baseline) demonstrates the profound impact of token-level alignment. This highlights the value of modeling collaborative semantics at the fundamental token level for both users and items.
Preference-Level Alignment's Substantial Improvement: The preference-level alignment stage (likely representing the final stages of the cumulative additions) yields the most substantial improvement. This strongly validates the effectiveness of the progressive DPO strategy, particularly the combination of self-play and real-world feedback, in refining the LLM's recommendations.
Superiority to EAGER-LLM: Throughout all stages of incremental addition, $Align^3GR$ consistently outperforms the SOTA baseline (EAGER-LLM), reinforcing the benefits of its comprehensive multi-level alignment approach in bridging the LLM-RS gap.

6.1.3. Online A/B Test Results

To validate $Align^3GR$ 's practical efficacy, an online A/B test was conducted on an industrial advertising recommendation platform.

The following are the results from Table 2 of the original paper:

	Baseline	TIGER	Align3GR
Recall@100	0.218	0.229	0.242
Revenue (Improve.)	-	0.555%↑	1.432% ↑

Analysis:

Online Performance Gains: $Align^3GR$ $A l i g n^{3} GR$ significantly outperforms both the industrial two-tower retrieval baseline and the TIGER model in online retrieval performance as measured by Recall@100.
- Baseline: 0.218
- TIGER: 0.229 (5% improvement over baseline)
- $Align^3GR$ : 0.242 (5.7% improvement over TIGER, 10.9% over baseline)
Significant Business Impact: Crucially, $Align^3GR$ achieves a statistically significant +1.432% revenue improvement in all advertising scenarios under full-scale deployment. This demonstrates that the offline advantages translate directly into measurable business value in a real-world, large-scale production environment. This is a strong validation of the framework's robustness and practical utility.

6.2. Ablation Study

Ablation studies were conducted on the Instruments dataset to understand the contribution of each alignment level.

6.2.1. Effect of Dual `SCID` Tokenization

Table 3 presents the ablation results for Dual SCID Tokenization.

The following are the results from Table 3 of the original paper:

Tokenization	CF	U-I Alignment	Recall@10	NDCG@10
Item	×	X	0.1322	0.0978
Item	✓	X	0.1346	0.0991
Dual	X	X	0.1390	0.1032
Dual	X	✓	0.1426	0.1083
Dual	✓	X	0.1428	0.1091
Dual	✓	✓	0.1442	0.1113

Analysis:

Necessity of Dual Tokenization: Switching from "Item" (single-sided, item-only) to "Dual" tokenization (jointly modeling user and item) with $CF=X$ and $U-I Alignment=X$ leads to a significant boost (Recall@10 from 0.1322 to 0.1390; NDCG@10 from 0.0978 to 0.1032). This confirms that jointly modeling user and item token representations is critical.
Importance of Collaborative Features (CF): Incorporating Collaborative Features (CF) consistently improves performance.
- For "Item" tokenization: Recall@10 increases from 0.1322 to 0.1346.
- For "Dual" tokenization: Comparing "Dual, X, X" (0.1390) with "Dual, ✓, X" (0.1428) shows a gain. And "Dual, X, ✓" (0.1426) vs "Dual, ✓, ✓" (0.1442) shows another gain. This highlights the importance of integrating collaborative signals into representation learning.
Efficacy of User-Item Alignment (U-I Alignment): Enabling U-I alignment via the U2I behavior loss also leads to consistent gains, especially when combined with dual tokenization and CF. Comparing "Dual, X, X" (0.1390) with "Dual, X, ✓" (0.1426) shows improvement. The best performance is achieved when all components ("Dual", "✓ CF", "✓ U-I Alignment") are active (Recall@10 0.1442, NDCG@10 0.1113).
Complementary Components: The results demonstrate that all three components (dual tokenization, CF, and U-I Alignment) are complementary and essential for achieving optimal recommendation performance within the token-level alignment stage.

6.2.2. Effect of Behavior Modeling-level Alignment Tasks

Table 4 presents the ablation results for multi-task SFT on the Instruments dataset.

The following are the results from Table 4 of the original paper:

Methods	Recall@5	Recall@10	NDCG@5	NDCG@10
SEQ	0.1042	0.1329	0.0867	0.0982
+ C1 − C3	0.1046	0.1344	0.0881	0.0988
+ B1	0.1054	0.1399	0.0908	0.1045
+ User SCID	0.1091	0.1417	0.0937	0.1051
+ B2	0.1103	0.1442	0.0959	0.1113

Analysis:

SEQ as Backbone: The SEQ task (Sequential Item Prediction) serves as the baseline, using item SCIDs from user history.
Incremental Gains:
- $+ C1 − C3$ : Adding other LCRec tasks (not clearly defined in the table but likely Asymmetric Item Prediction and Item Prediction Based on User Intention) shows a marginal improvement.
- $+ B1$ : Adding Personalized Preference Inference (task B1 from LCRec) yields a more noticeable gain (Recall@10 from 0.1344 to 0.1399).
- + User SCID: Directly incorporating the User SCID into the prompts provides further gains (Recall@10 from 0.1399 to 0.1417). This indicates that rich, structured user SCID representations help the LLM better understand user-item interaction semantics.
- $+ B2$ : The most significant performance boost across all metrics comes from adding B2 (bidirectional alignment tasks, $text -> SCID$ and $SCID -> text$ ). Recall@10 jumps from 0.1417 to 0.1442 and NDCG@10 from 0.1051 to 0.1113. This highlights the critical role of explicitly aligning structured SCID with semantic information, allowing the LLM to build stronger correspondences between language and recommendation signals.

6.2.3. Effect of Preference-Level Alignment Tasks

Table 5 presents the ablation study on SP-DPO and RF-DPO strategies using the Instruments dataset. The baseline is a well-trained SFT model followed by Softmax-DPO.

The following are the results from Table 5 of the original paper:

DPO Variant	Self-Play	Progressive	Recall@10	NDCG@10
Softmax-DPO	×	X	0.1295	0.0972
SP-DPO	✓	X	0.1356	0.1033
SP-DPO	✓	✓	0.1396	0.1042
RF-DPO	-	X	0.1414	0.1049
RF-DPO	-	✓	0.1442	0.1113

Analysis:

Softmax-DPO Baseline: Starting from Softmax-DPO (without self-play or progressive learning) yields Recall@10 of 0.1295 and NDCG@10 of 0.0972.
Impact of Self-Play (SP-DPO):
- Adding Self-Play to DPO (SP-DPO, no progressive) significantly boosts performance (Recall@10 from 0.1295 to 0.1356, NDCG@10 from 0.0972 to 0.1033). This confirms the value of self-play in generating diverse training data and mitigating sparsity.
- Further applying progressive learning within SP-DPO (going from easy to hard stages) yields additional gains (Recall@10 to 0.1396, NDCG@10 to 0.1042).
Impact of Real-world Feedback (RF-DPO):
- Using RF-DPO (which implicitly means it's built on the SP-DPO stage for initial generative ability, as specified in the methodology) without progressive learning also shows good performance (Recall@10 0.1414, NDCG@10 0.1049), already outperforming SP-DPO with progressive learning. This highlights the crucial role of real user feedback.
- The best overall performance is achieved with RF-DPO combined with progressive learning (Recall@10 0.1442, NDCG@10 0.1113). This is the full $Align^3GR$ preference-level alignment setup.
Complementary Benefits: The results demonstrate the complementary benefits of progressive optimization (easy-to-hard learning stages) and the integration of both self-play (SP-DPO) and real-world feedback (RF-DPO) in effectively aligning LLMs with user preference signals for superior recommendation performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces $Align^3GR$ , a novel and comprehensive framework designed to integrate Large Language Models (LLMs) into personalized recommender systems. $Align^3GR$ achieves this by systematically addressing the semantic and behavioral misalignment through a unified multi-level alignment approach. The key contributions include:

Token-level alignment: A dual SCID tokenization scheme that jointly encodes user and item semantic and collaborative signals into a compact, expressive token space.
Behavior modeling-level alignment: An enhanced multi-task Supervised Fine-Tuning (SFT) that injects user SCIDs into LLM prompts and employs bidirectional semantic alignment tasks to ground SCID representations in real-world semantic meanings.
Preference-level alignment: A progressive Direct Preference Optimization (DPO) strategy that combines self-play (SP-DPO) for data diversity and real-world feedback (RF-DPO) for dynamic preference adaptation, guided by curriculum learning principles (easy-to-hard stages).

Extensive experiments on public benchmark datasets (Instruments, Beauty, Yelp) consistently show that $Align^3GR$ significantly outperforms state-of-the-art baselines. Crucially, online A/B tests and full-scale deployment on an industrial recommendation platform further validate its practical efficacy, demonstrating substantial improvements in Recall@100 and achieving a statistically significant +1.432% revenue improvement. These results collectively highlight the critical importance of a hierarchical and unified alignment strategy for successfully transforming LLMs into robust and adaptive generative recommenders.

7.2. Limitations & Future Work

While the paper does not contain a dedicated "Limitations" section, some can be inferred from the problem statement and the nature of the proposed solution:

Complexity of Multi-Stage Training: The framework involves multiple distinct stages of alignment (token, SFT, DPO), each with its own training objectives, hyperparameters, and potentially separate models (encoders, RQ-VAE). This multi-stage process could be computationally intensive and complex to manage and fine-tune for optimal performance, especially in environments with limited resources.
Hyperparameter Sensitivity: The success of $Align^3GR$ relies on careful tuning of various hyperparameters (e.g., $\alpha, \gamma, \beta$ , beam width, RQ-VAE levels). Finding the optimal combination across different datasets and scenarios might be challenging.
Scalability of RQ-VAE and SCID Space: While RQ-VAE compresses embeddings, the generation of SCIDs and their integration into the LLM's vocabulary still adds complexity. For extremely large item catalogs or highly dynamic user profiles, managing the SCID space and ensuring efficient lookup/generation could be a challenge.
Defining "Easy" vs. "Hard" for Progressive Learning: The prefix-ngram match metric for SP-DPO and the sentiment/behavior mapping for RF-DPO are heuristic definitions of "easy" and "hard" preference pairs. While effective, they might not perfectly capture the true learning difficulty or nuances of user preferences. More adaptive or learned curriculum strategies could be explored.
Generalizability to Diverse Domains: While tested on several public datasets and an industrial platform, the specific semantic and collaborative encoders or LLM backbones might need adaptation for vastly different domains with unique data characteristics (e.g., cold-start scenarios, highly sparse data).
Latency in Real-time Systems: The generative nature of LLMs can introduce higher inference latency compared to traditional retrieval models. While the online A/B test shows success, managing latency for very high-throughput, low-latency recommendation scenarios might require further optimization.

The paper implicitly suggests future work through its contributions, such as further refining tokenization to capture more subtle signals, developing more adaptive behavior modeling tasks, and exploring more sophisticated progressive learning mechanisms for DPO. The emphasis on industrial deployment also points to ongoing work on robustness, efficiency, and scalability for real-world applications.

7.3. Personal Insights & Critique

$Align^3GR$ presents a highly compelling and well-structured approach to a critical problem in LLM-based recommendation: moving beyond augmentation to true end-to-end generative capabilities.

Strengths:

Holistic Framework: The most significant strength is the unified multi-level alignment framework. It logically addresses the LLM-RS gap from the lowest level of data representation (tokens) up through behavior modeling and preference optimization. This integrated perspective is crucial, as misalignment at any single stage can undermine the entire system.
Dual SCID Innovation: The Dual SCID Tokenization is particularly insightful. By explicitly integrating both semantic and collaborative signals into compact user and item IDs, the framework ensures that the LLM receives the rich, personalized context it needs from the very beginning. This goes beyond simple semantic embeddings and tackles the core collaborative filtering aspect essential for RS.
Progressive DPO for Dynamic Adaptation: The progressive DPO strategy, combining self-play and real-world feedback with curriculum learning, is an elegant solution to the challenges of sparse and dynamic user preferences. This adaptive learning mechanism is key for LLMs to stay relevant and performant in ever-changing real-world recommendation environments. Self-play helps with exploration and data generation, while real-world feedback grounds the model in actual business objectives.
Strong Empirical Validation (Offline and Online): The comprehensive experimental results, particularly the significant gains in online A/B tests and revenue improvement on an industrial platform, provide strong evidence of the framework's practical value and scalability. This real-world validation is often missing in academic papers and greatly boosts confidence in the proposed methods.
Beginner-Friendly Explanation: The paper is relatively well-written and provides good intuition for its components, making it accessible for researchers new to the field, assuming some basic LLM and RS knowledge.

Potential Issues / Areas for Improvement (Critique):

Formula Typo in DPO Loss: As noted in the methodology section, the term $exp ( \frac { \sigma } { \sigma } )$ in the Softmax-DPO loss formula (Equation 3) appears to be a typographical error. For a rigorous academic paper, such an error in a core mathematical formulation can cause confusion and needs immediate correction. While I reproduced it exactly as instructed, it's a point of concern for interpretation. A correct Softmax-DPO formulation typically involves a direct reward difference or log-ratio term.
Detailed Breakdown of LCRec Tasks: While $Align^3GR$ builds upon LCRec, the paper doesn't fully detail the C1-C3 and B1 tasks mentioned in the ablation study (Table 4). For a beginner-friendly understanding, a brief explanation of these tasks would have been beneficial, even if they are from prior work.
Computational Cost: While LoRA is used for PEFT, the multi-stage training (multiple encoders, RQ-VAEs, LLM SFT, DPO) for 7B LLM still implies significant computational resources. A more explicit discussion of the total training time and inference latency (beyond just Recall@100) would be valuable, especially for industrial deployment.
Defining "Relevance" for RF-DPO: The mapping of sentiment scores from ecomgpt (1-5 to disliked/neutral/liked) for public datasets, while practical, could be sensitive to the sentiment model's accuracy. A deeper dive into how robust this mapping is and its potential impact on DPO learning would be interesting.
Generalizability of Prefix-Ngram Match: The prefix-ngram match metric for defining $easy/medium/hard$ SP-DPO samples for SCIDs is an interesting heuristic. However, the optimal ngram length or whether this captures the true "difficulty" of SCID discrimination might vary. Further analysis or more adaptive difficulty curricula could be explored.

Transferability and Future Value: The methods and conclusions of $Align^3GR$ are highly transferable. The concept of multi-level alignment is not specific to recommendation and could be adapted for other domains where LLMs need to integrate with structured data and specialized objectives (e.g., knowledge graph reasoning, code generation, scientific discovery). The dual tokenization idea could be applied to any domain where semantic and collaborative/structural signals are jointly important. The progressive DPO strategy is a valuable contribution to RLHF/DPO research itself, offering a more stable and adaptive way to align LLMs with complex, dynamic preferences, which is relevant for conversational AI, content moderation, and personalized education systems.

In conclusion, $Align^3GR$ is a significant step forward in the field of LLM-based generative recommendation, providing a robust and empirically validated framework that tackles the misalignment challenge head-on. Its holistic and progressive design offers valuable insights for future research and practical deployment of LLMs in complex, real-world applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Align$^3$GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 46,475 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Token-level Alignment: Dual SCID Tokenization

4.2.2. Behavior Modeling-level Alignment: Multi-task SFT

4.2.3. Preference-level Alignment: Progressive DPO with Self-Play and Real-world Feedback

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Offline Performance

6.1.2. Incremental Alignment Performance

6.1.3. Online A/B Test Results

6.2. Ablation Study

6.2.1. Effect of Dual SCID Tokenization

6.2.2. Effect of Behavior Modeling-level Alignment Tasks

6.2.3. Effect of Preference-Level Alignment Tasks

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.2.1. Effect of Dual `SCID` Tokenization