LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation
TL;DR Summary
LLaDA-Rec is a discrete diffusion framework for generative recommendation, addressing unidirectional constraints and error accumulation. By integrating bidirectional attention and adaptive generation order, it effectively models item dependencies, surpassing existing systems in r
Abstract
Generative recommendation represents each item as a semantic ID, i.e., a sequence of discrete tokens, and generates the next item through autoregressive decoding. While effective, existing autoregressive models face two intrinsic limitations: (1) unidirectional constraints, where causal attention restricts each token to attend only to its predecessors, hindering global semantic modeling; and (2) error accumulation, where the fixed left-to-right generation order causes prediction errors in early tokens to propagate to the predictions of subsequent token. To address these issues, we propose LLaDA-Rec, a discrete diffusion framework that reformulates recommendation as parallel semantic ID generation. By combining bidirectional attention with the adaptive generation order, the approach models inter-item and intra-item dependencies more effectively and alleviates error accumulation. Specifically, our approach comprises three key designs: (1) a parallel tokenization scheme that produces semantic IDs for bidirectional modeling, addressing the mismatch between residual quantization and bidirectional architectures; (2) two masking mechanisms at the user-history and next-item levels to capture both inter-item sequential dependencies and intra-item semantic relationships; and (3) an adapted beam search strategy for adaptive-order discrete diffusion decoding, resolving the incompatibility of standard beam search with diffusion-based generation. Experiments on three real-world datasets show that LLaDA-Rec consistently outperforms both ID-based and state-of-the-art generative recommenders, establishing discrete diffusion as a new paradigm for generative recommendation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation
1.2. Authors
- Teng Shi (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
- Chenglei Shen (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
- Weijie Yu (School of Information Technology and Management, University of International Business and Economics, Beijing, China)
- Shen Nie (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
- Chongxuan Li (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
- Xiao Zhang (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
- Ming He (AI Lab at Lenovo Research, Beijing, China)
- Yan Han (AI Lab at Lenovo Research, Beijing, China)
- Jun Xu (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
1.3. Journal/Conference
The paper is listed as "In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX). ACM, New York, NY, USA". This indicates that it is intended for publication in an ACM conference, although the specific conference name is a placeholder in the provided text. ACM conferences are highly reputable venues in computer science, particularly in areas like information retrieval, data mining, and artificial intelligence.
1.4. Publication Year
2025 (Based on the Published at (UTC): 2025-11-09T07:12:15.000Z metadata).
1.5. Abstract
Generative recommendation systems represent items as semantic IDs (sequences of discrete tokens) and use autoregressive models to generate the next item. However, these models suffer from two main issues: unidirectional constraints (causal attention limits token interaction to predecessors, hindering global semantic modeling) and error accumulation (errors in early tokens propagate). To overcome these, the paper introduces LLaDA-Rec, a discrete diffusion framework that reframes recommendation as parallel semantic ID generation. By combining bidirectional attention with an adaptive generation order, LLaDA-Rec aims to model inter-item and intra-item dependencies more effectively and reduce error propagation. The framework includes three key designs: (1) a parallel tokenization scheme using Multi-Head VQ-VAE to make semantic IDs suitable for bidirectional architectures, (2) two masking mechanisms (at user-history and next-item levels) to capture both sequential dependencies and intra-item semantics, and (3) an adapted beam search strategy for adaptive-order discrete diffusion decoding, which resolves the incompatibility of standard beam search with diffusion-based generation. Experiments on three real-world datasets demonstrate that LLaDA-Rec consistently outperforms both ID-based and state-of-the-art generative recommenders, establishing discrete diffusion as a new paradigm for generative recommendation.
1.6. Original Source Link
https://arxiv.org/abs/2511.06254v1 (This is a preprint on arXiv, not an officially published version in a peer-reviewed journal/conference yet). PDF Link: https://arxiv.org/pdf/2511.06254v1.pdf
2. Executive Summary
2.1. Background & Motivation
The paper addresses crucial limitations within existing generative recommendation models, which have gained prominence by applying generative language models (LLMs) to item recommendation.
- Core Problem: Current generative recommendation approaches, which represent items as
semantic IDs(sequences of discrete tokens) and useautoregressive (AR)decoding to predict the next item, face two intrinsic limitations:- Unidirectional Constraints:
ARmodels typically employcausal attention(also known asunidirectional attention), meaning each token can only attend to (process information from) its preceding tokens in a sequence. This restriction hinders the model's ability to captureglobal relationshipsamong all tokens that collectively define an item, leading to less semantically coherent and expressive generated items. - Error Accumulation: During the
inference(generation) phase,ARmodels generate tokens one by one, conditioning each new token on the previously sampled ones. Unlike thetrainingphase, whereteacher forcing(providing the ground truth token at each step) prevents early errors,inferencemeans that any prediction error in an early token cannot be corrected and propagates to subsequent tokens, amplifying negative effects throughout the generatedsemantic ID.
- Unidirectional Constraints:
- Importance: Overcoming these limitations is crucial for enhancing the accuracy, coherence, and overall quality of recommendations generated by
LLM-based systems, making them more effective and reliable for users. - Innovative Idea: The paper proposes
LLaDA-Rec, a novel framework that leveragesdiscrete diffusion modelsto reformulate recommendation asparallel semantic ID generation. This approach aims to address theunidirectional constraintsthroughbidirectional attentionand mitigateerror accumulationvia anadaptive generation order.
2.2. Main Contributions / Findings
The primary contributions of LLaDA-Rec are:
- Addressing Autoregressive Limitations: It identifies and tackles the
unidirectional constraintsanderror accumulationissues prevalent in existingautoregressive generative recommendationmodels, which limit their performance. - Novel Discrete Diffusion Framework: It proposes
LLaDA-Rec, agenerative recommendation modelbased ondiscrete diffusion. This framework introducesparallel semantic IDsand develops specificdiscrete diffusion trainingandinferencemethods tailored for recommendation tasks. - Key Design Elements:
LLaDA-Recincorporates three essential designs:- Parallel Tokenization Scheme: It introduces
Multi-Head VQ-VAEto producesemantic IDsthat are inherently suitable forbidirectional modeling, resolving the architectural mismatch betweenresidual quantization (RQ)andbidirectional Transformers. - Dual Masking Mechanisms: It employs two distinct
masking mechanismsduring training:User-History level maskingto captureinter-item sequential dependenciesandNext-Item level maskingto modelintra-item semantic relationships, enabling the model to effectively understand and generate itemsemantic IDs. - Adapted Beam Search Strategy: It devises an
adapted beam search strategyforadaptive-order discrete diffusion decoding, which overcomes the incompatibility of standardbeam searchwith the dynamic, non-left-to-right generation process ofdiffusion models.
- Parallel Tokenization Scheme: It introduces
- State-of-the-Art Performance: Extensive experiments on three real-world datasets demonstrate that
LLaDA-Recconsistently outperforms both traditionalitem-ID-based methodsandstate-of-the-art semantic-ID-based generative recommendation models. This establishesdiscrete diffusionas a promising new paradigm forgenerative recommendation.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Generative Recommendation
Generative recommendation is a paradigm shift from traditional discriminative recommendation. Instead of predicting a score for a user-item pair or classifying items, it formulates the task as generating the characteristics of the next item a user might interact with. This often involves representing items as semantic IDs, which are sequences of discrete tokens (similar to words in a sentence), and then using a generative model to "write" the semantic ID of the target item.
3.1.2. Semantic IDs
Semantic IDs (SIDs) are a core concept in generative recommendation. Unlike traditional item IDs (which are just unique numbers with no inherent meaning, e.g., Item #123), SIDs represent an item as a sequence of discrete tokens. These tokens are learned to encode the item's semantic information, such as its attributes, categories, or descriptive text. For example, an item like "red apple" might be tokenized into [fruit, red, round]. This allows the generative model to compose new items by generating new sequences of tokens, enabling recommendation of novel or out-of-vocabulary items.
3.1.3. Autoregressive (AR) Models
Autoregressive models are a class of generative models that predict a sequence of data elements one step at a time, where each step's prediction is conditioned on all previously predicted elements. In language modeling, this means predicting the next word based on all preceding words.
- Left-to-Right Generation: The most common
ARmodels generate sequences in a fixedleft-to-rightorder. - Causal Attention: This
unidirectional constraintis enforced bycausal attentionmechanisms inTransformers, where each position in the output sequence can only attend to positions before it in the input sequence. This prevents tokens from seeing future information, which is necessary for sequential generation. - Teacher Forcing: During training,
ARmodels often useteacher forcing. This means that at each step, instead of feeding the model its own (potentially erroneous) previous prediction, the actual ground-truth previous token is provided as input. This stabilizes training and speeds up convergence. However, duringinference,teacher forcingcannot be used, leading toerror accumulation.
3.1.4. Discrete Diffusion Models
Discrete diffusion models are a type of generative model that learn to reverse a noise process applied to discrete data (like tokens).
- Forward Noise Process: During training, the original discrete sequence is progressively corrupted by adding
masking noise. This involves replacing tokens with a special[MASK]token, with a progressively increasingmasking ratio. At the end, the sequence is fully masked. - Reverse Denoising Process: The model is trained to predict the original (unmasked) tokens from the partially masked sequence. During
inference, this learnedreverse processis used to generate a clean sequence from a fully masked one. The model iteratively predicts masked tokens, often starting with high-confidence predictions, and then re-masks low-confidence ones for later refinement. - Bidirectional Transformer: Unlike
ARmodels,discrete diffusion modelstypically use abidirectional Transformer(encoder) as their core component. This allows the model to leverage context from both preceding and succeeding tokens when predicting a masked token, providing a more holistic understanding of the sequence. - Adaptive Generation Order: During
inference,discrete diffusion modelsdon't adhere to a fixed generation order (like left-to-right). Instead, they predict all masked tokens in parallel at each step and then select the tokens with the highest predictionconfidenceto keep. The remaining low-confidence tokens are re-masked and re-predicted in subsequent steps. Thisadaptive generation orderallows the model to prioritize "easier" tokens first, which can help mitigateerror accumulation.
3.1.5. Transformer Architecture
The Transformer is a neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017) that has revolutionized sequence modeling. It relies heavily on self-attention mechanisms.
- Attention Mechanism: The core idea is to allow the model to weigh the importance of different parts of the input sequence when processing a specific part. It calculates a
context vectoras a weighted sum ofvaluevectors, where weights are derived from the similarity between aqueryvector andkeyvectors. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings.
- is the dimension of the
keyvectors, used for scaling to prevent very large dot products that push thesoftmaxfunction into regions with tiny gradients. - normalizes the scores into probabilities.
- Self-Attention: This is when
Q, K, Vare all derived from the same input sequence. It allows each token in a sequence to attend to all other tokens in the same sequence. - Causal (Unidirectional) Attention: In
decoderblocks ofAR Transformers, amaskis applied to theattention scoresso that a token at position can only attend to tokens at positions . This prevents "looking into the future." - Bidirectional Attention: In
encoderblocks ofTransformers(like BERT), there is no such masking. Each token can attend to all other tokens in the sequence (both preceding and succeeding), providing a full contextual understanding.
3.1.6. Vector Quantization (VQ-VAE)
Vector Quantization Variational Autoencoder (VQ-VAE) is a generative model that learns to map continuous input vectors into discrete codebook entries.
- Encoder: Maps a continuous input (e.g., an item embedding) to a continuous
latent vector. - Codebook: A finite set of learnable discrete
code vectors(also calledembeddingsorprototypes). - Quantization: For each
latent vectorfrom theencoder, the closestcode vectorin thecodebookis found (e.g., using Euclidean distance). The index of this closestcode vectorbecomes the discretesemantic IDtoken. - Decoder: Reconstructs the original input from the selected
code vector(s). - Loss Functions:
VQ-VAEtypically uses areconstruction loss(to ensure the decoded output matches the original input) and avector quantization loss(to encourage theencoderoutputs to be close tocodebookentries and to update thecodebookentries themselves).
3.1.7. Residual Quantization (RQ-VAE)
Residual Quantization (RQ) is a hierarchical approach to vector quantization. Instead of quantizing a vector directly, it quantizes the residual error from a previous quantization step.
- Hierarchical Nature: The input vector is first quantized by a codebook, producing a residual error. This error is then quantized by a second codebook, generating a second residual, and so on. Each step adds another discrete token.
- Dependency: This creates an intrinsic
hierarchical dependencywhere earlier tokens (from earlier quantization stages) capture more coarse-grained information, and later tokens refine it. This structure naturally aligns withleft-to-right autoregressivegeneration, where early tokens are fixed before later ones are generated.
3.1.8. Beam Search
Beam search is a heuristic search algorithm used in sequence generation (e.g., machine translation, language modeling, generative recommendation) to find the most probable sequence.
- Mechanism: Instead of greedily choosing only the single most probable next token at each step (which can lead to suboptimal sequences),
beam searchmaintains a fixed number, (thebeam size), of the most probable partial sequences (or "beams"). At each step, it extends all partial sequences with all possible next tokens, evaluates their probabilities, and then prunes them back to the top most probable sequences. - Purpose: It aims to find a sequence with a higher overall probability than
greedy searchwithout exploring the entire exponentially large search space (likebreadth-first search). - Fixed Order: Traditionally,
beam searchis designed forfixed left-to-right decoding, where the sequence length grows by one token at each step, and tokens are added sequentially.
3.2. Previous Works
The paper contextualizes LLaDA-Rec by discussing two main streams of related work: Generative Recommendation and Discrete Diffusion Models.
3.2.1. Generative Recommendation
Inspired by Large Language Models (LLMs) like GPT ([1] Achiam et al., 2023; [21] Liu et al., 2024), generative recommendation ([4, 14, 15, 22, 29, 32, 36, 46, 48, 49]) represents items as semantic IDs and uses generative models to predict the next item. It generally involves two stages:
- Item Tokenization: Assigning
semantic IDs (SIDs)to items.- Clustering-based:
SEATER [33]andEAGER [38]cluster item embeddings. - Vector Quantization-based:
Residual Quantization (RQ)methods:TIGER [29],LETTER [36],LC-Rec [48](usingRQ-VAE [45]), andOneRec [4, 49](usingRQ-KMeans). These methods inherently create a hierarchical dependency where earlier tokens are more dominant.Product Quantization [6]methods:RPG [14].
- Clustering-based:
- Autoregressive Generation: Most existing methods (e.g.,
TIGER,LETTER,LC-Rec) use anautoregressive paradigmto generateSIDssequentially.RPG [14]is an exception, employingmulti-token prediction [7]for parallel generation of unorderedsemantic IDsin a single step, rather than iterative refinement. - Enhanced Generative Recommendation: Some studies (
[3, 35]) enhancegenerative recommendationthroughlatent reasoning [9].
3.2.2. Discrete Diffusion Models
Discrete diffusion models () are a newer class of generative models built on bidirectional Transformer backbones. They learn a denoising process to reconstruct data from masked inputs.
- LLaDA [26]: The first
diffusion-based language modelto achieve performance comparable toautoregressive models. It established the basic framework of abidirectional Transformertrained withforward token-masking noiseandreverse denoising reconstruction. - LLaDA-V [43]: An extension that adapts the
LLaDAframework forvisual understanding. - MMaDA [40]: Further generalizes the
LLaDAframework tomultimodal understanding and generation. - LLaDA 1.5 [51]: An improved version that integrates
DPO-based post-trainingfor additional performance gains. - Continuous Diffusion in Recommendation: The paper also briefly mentions
continuous diffusion models, which operate in continuous latent spaces and have been applied toimage generation [30, 41]andsequential recommendation [18, 19, 37, 42](e.g.,DreamRec [42],DiffuRec [19],DimeRec [18]). These typically generate latent representations that then require a separate retrieval stage, unlikeLLaDA-Recwhich directly generates discretesemantic IDs.
3.3. Technological Evolution
The field of recommender systems has evolved through several stages:
- Traditional
ID-basedMethods: Early methods (e.g.,matrix factorization,collaborative filtering) represented items and users with discrete IDs, learning embeddings for them.Sequential recommenders(e.g.,GRU4Rec,SASRec,BERT4Rec) extended this by modeling user interaction sequences based on these IDs. These arediscriminativemodels, predicting scores or rankings. Generative RecommendationwithSemantic IDs: Inspired by the success ofLLMs, this paradigm shifted togenerative models. Items are no longer justIDsbut aretokenizedintosemantic IDs(sequences of discrete tokens). The task becomes generating thesemantic IDof the next item. This allowed for more explicit semantic understanding and generation of new, unseen items.RQ-VAE-based methods (e.g.,TIGER,LETTER,LC-Rec) andProduct Quantization(e.g.,RPG) are prominent here, typically coupled withautoregressive Transformers.Diffusion Modelsfor Recommendation (LLaDA-Rec's contribution): The latest evolution, exemplified byLLaDA-Rec, introducesdiscrete diffusion modelstogenerative recommendation. This moves beyondautoregressive generationto leveragebidirectional attentionandadaptive generation orders, directly generatingsemantic IDsin a parallel and iterative denoising process. This addresses the inherent limitations ofautoregressiveapproaches, offering a new paradigm. Concurrently,continuous diffusion modelshave also been adapted for recommendation, usually generating item embeddings for retrieval.
3.4. Differentiation Analysis
LLaDA-Rec differentiates itself from existing methods primarily through its adoption of discrete diffusion and specialized designs for generative recommendation:
-
Autoregressive Generative Recommendation (e.g.,
TIGER,LETTER,LC-Rec):- Core Difference:
LLaDA-Recusesbidirectional attentionand anadaptive generation orderviadiscrete diffusion, whileARmodels useunidirectional (causal) attentionand afixed left-to-right generation order. - Innovation: This change allows
LLaDA-Recto captureglobal inter-itemandintra-item dependenciesmore effectively and mitigateerror accumulationby re-masking and re-predicting low-confidence tokens.ARmodels are prone to error propagation due to their fixed, sequential nature. - Tokenization:
ARmodels often usehierarchical quantizationlikeRQ-VAE, where earlier tokens are more important.LLaDA-Recusesparallel tokenization (Multi-Head VQ-VAE)where all tokens are equally important, better suitingbidirectional attention.
- Core Difference:
-
RPG [14](Parallel Semantic ID Generation):- Core Difference: While
RPGalso generatessemantic IDsin parallel (multi-token prediction [7]), it does so in a single step without iterative refinement.LLaDA-Recuses an iterativedenoising processwithre-maskingandre-prediction. - Innovation:
LLaDA-Rec's iterative nature allows foradaptive generation orderanddynamic refinementof predictions, which is missing inRPG's single-step parallel generation.
- Core Difference: While
-
Continuous Diffusion Modelsfor Recommendation (e.g.,DiffuRec,DreamRec):- Core Difference:
Continuous diffusion modelsoperate in continuous latent spaces and typically generate item embeddings, which then require a separateretrieval stage(e.g., similarity search) to find actual items.LLaDA-Recis adiscrete diffusion modelthat directly generates discretesemantic IDsof items. - Innovation:
LLaDA-Recunifiesgenerationandretrievalinto a single optimization process by directly outputtingSIDs, simplifying theinference pipelineand often leading to improved performance by removing the potential mismatch between generated embeddings and retrieved items.
- Core Difference:
-
Traditional
Item ID-basedMethods (e.g.,SASRec,BERT4Rec):-
Core Difference: These models predict the next
item IDdirectly or learnitem embeddingsfordiscriminative ranking.LLaDA-Recgeneratessemantic IDs. -
Innovation:
LLaDA-Rec'ssemantic IDapproach allows for a richer representation of items, potential generalization to new items (if theirsemantic IDscan be composed), and leverages the power ofgenerative models.The key distinguishing feature of
LLaDA-Recis its novel application ofdiscrete diffusiontogenerative recommendation, specifically designed to overcome theunidirectional constraintsanderror accumulationassociated withautoregressivemethods, and to integrategenerationandretrievalmore tightly thancontinuous diffusionmethods.
-
4. Methodology
The LLaDA-Rec framework addresses the limitations of autoregressive generative recommendation by formulating the task as parallel semantic ID generation using a discrete diffusion approach. This involves three main modules: Parallel Tokenization, Discrete Diffusion Training, and Discrete Diffusion Inference.
4.1. Parallel Tokenization via Multi-Head VQ-VAE
4.1.1. Motivation for Parallel Tokenization
Existing generative recommendation models often use hierarchical quantization methods like Residual Quantization (RQ-VAE) or RQ-KMeans. In these hierarchical schemes, tokens are generated sequentially, with earlier tokens (e.g., the first token) holding more influence as subsequent tokens are conditionally dependent on them. This aligns well with autoregressive models that generate tokens in a fixed left-to-right order.
However, LLaDA-Rec utilizes a bidirectional Transformer for its discrete diffusion model. In a bidirectional Transformer, all tokens interact mutually and are equally important in the representation and generation process, irrespective of their position. The hierarchical dependencies of RQ-VAE are mismatched with this bidirectional nature. To better align with the bidirectional Transformer and treat all semantic ID tokens on an equal footing, LLaDA-Rec proposes a Multi-Head VQ-VAE architecture for parallel tokenization, eliminating hierarchical dependencies.
4.1.2. Multi-Head VQ-VAE Architecture
The Multi-Head VQ-VAE works as follows:
-
Item Semantic Representation: Each item is first represented by a continuous semantic vector . This vector is obtained by encoding the item's textual information (e.g., title, description) using a pre-trained embedding model, such as
BERT [5]orSentence-T5 [24]. -
Encoder Projection: The semantic vector is then projected into a latent space through an
Encoder(implemented as amulti-layer perceptron (MLP)): $ \mathbf{z}_i = \operatorname{Encoder}(\mathbf{v}_i) $ where is the latent representation. -
Sub-vector Partitioning: The latent vector is partitioned into equal-sized sub-vectors. Each sub-vector corresponds to a separate "head" for quantization: $ \mathbf{z}i = [ \mathbf{z}{i,1} ; \mathbf{z}{i,2} ; \ldots ; \mathbf{z}{i,M} ] $ Here, for .
-
Independent Quantization: distinct
codebooksare maintained, one for each sub-vector. The -th codebook is , where is the size of the codebook (number ofcode vectors), and are learnablecode embeddings. Each sub-vector is quantized independently by finding the closestcode vectorin its corresponding codebook : $ c_{i,m} = \arg\operatorname*{min}{k} | \mathbf{z}{i,m} - \mathbf{e}_{m,k} |2^2, \quad \mathbf{e}{m,k} \in C_m $ The chosen index from the -th codebook represents the -th token of the item'ssemantic ID. -
Semantic ID Formation: After quantizing all sub-vectors, the
semantic IDfor item is formed as a sequence of these discrete tokens: $ s_i = [ c_{i,1}, c_{i,2}, \ldots, c_{i,M} ] $ The correspondingcode embeddingsare . -
Quantized Representation and Decoder: These selected
code embeddingsare concatenated to form the quantized representation : $ \hat{\bf z}i = [ {\bf e}{c_1} ; {\bf e}{c_2} ; \ldots ; {\bf e}{c_M} ] $ This concatenated vector is then passed through aDecoder(also anMLP) to reconstruct the original semantic vector : $ \hat{\mathbf{v}}_i = \operatorname{Decoder}(\hat{\mathbf{z}}_i) $
4.1.3. VQ-VAE Loss Function
The overall training objective for the Multi-Head VQ-VAE combines a reconstruction loss and a vector quantization loss:
$
\begin{array}{r l}
& \mathcal{L}{\mathrm{Recon}} = \Vert \mathbf{v}i - \hat{\mathbf{v}}i \Vert 2^2, \
& \mathcal{L}{\mathrm{VQ}} = \displaystyle \sum{m=1}^{M} \Big ( \Vert \mathrm{sg}[\mathbf{z}{i,m}] - \mathbf{e}{c_{i,m}} \Vert_2^2 + \alpha \Vert \mathbf{z}{i,m} - \mathrm{sg}[\mathbf{e}{c_{i,m}}] \Vert_2^2 \Big ), \
& \mathcal{L}{\mathrm{VQ-VAE}} = \mathcal{L}{\mathrm{Recon}} + \mathcal{L}_{\mathrm{VQ}}.
\end{array}
$
Where:
- :
Reconstruction loss, which is the squared Euclidean distance between the original item semantic vector and its reconstructed version . This term ensures that the discretesemantic IDtokens can effectively represent the item's original semantics. - :
Vector quantization loss, which consists of two parts for each sub-vector :- : This part minimizes the distance between the
encoder's output sub-vector (withstop-gradientapplied, ) and its chosencode embedding. This term is used to update thecodebook embeddings. - : This part minimizes the distance between the
encoder's output sub-vector and the chosencode embedding(withstop-gradientapplied to the code embedding). This term ensures that theencoderlearns to producelatent vectorsthat are close to thecodebookentries, preventingcodebook collapse(where only a few code vectors are ever used).
- : This part minimizes the distance between the
- : A hyperparameter that balances the contribution of the commitment loss term (the second part of ).
- : The
stop-gradientoperation. It prevents gradients from flowing through the specified variable, effectively making it a constant during backpropagation for that particular path.
4.2. Discrete Diffusion Training
The discrete diffusion model in LLaDA-Rec is trained to capture both inter-item sequential dependencies (relationships between items in a user's history) and intra-item semantic relationships (relationships among tokens within a single item). This is achieved through two distinct masking mechanisms: User-History level masking and Next-Item level masking.
4.2.1. Problem Formulation and Probabilistic Comparison
The overarching goal is to predict the next item for a user given their interaction history . Each item is represented by its semantic ID . The user history becomes a sequence of tokens . The task is to maximize the conditional probability:
$
\theta^* = \arg\operatorname*{max P}{\theta}(s_n | S{\mathcal{H}})
$
where are the model parameters.
-
Autoregressive Modeling (Eq. 3): Existing
generative recommendationmethods predominantly generate tokens sequentiallyfrom left to right. The probability of generating the entiresemantic IDis a product of conditional probabilities: $ \operatorname{P}{\theta}(s_n \mid S{\mathcal{H}}) = \prod_{m=1}^{M} \operatorname{P}{\theta}(c{n,m} \mid c_{n,denotes all tokens of the current item preceding . This formulation requires exactly steps for generation. -
Discrete Diffusion Modeling (Eq. 4): In contrast,
discrete diffusiongenerates tokens iteratively over steps (), starting from a fully masked sequence of[MASK]tokens. At each step , theMask Predictor(abidirectional Transformer encoder) predicts all masked positions in parallel. It then retains the highest-confidence predictions and re-masks the rest for subsequent steps. $ \mathrm{P}{\theta}(s_n \mid S{\mathcal{H}}) = \prod_{t=1}^{T} \prod_{m=1}^{M} \left{ \begin{array}{ll} \mathrm{P}{\theta} \big( c{n,m} \mid s_n^t, S_{\mathcal{H}} \big), & \mathrm{if } c_{n,m}^t = [\mathsf{MASK}], \ 1, & \mathrm{otherwise}. \end{array} \right. $ Where:- : The total number of generation steps.
- : The input sequence at step , representing the next item, which contains some generated tokens and some
[MASK]tokens. is all[MASK]tokens. - : The -th token of .
- The
Mask Predictor(aTransformer encoderwithbidirectional attention) predicts the original token given the current partially masked sequence and the user history . - The product accumulates probabilities for tokens that were
[MASK]at some step . Tokens that were already predicted are fixed (probability 1). This process offers parallel generation, anadaptive generation order, and explicit control over the number of generation steps.
4.2.2. Discrete Diffusion Process Overview
The discrete diffusion model operates in two stages:
-
Forward Process: Tokens in an input sequence are progressively masked. A
masking ratiodetermines the probability of each token being masked. At , all tokens are[MASK]. -
Reverse Denoising Process: The model learns to reconstruct the original sequence from a partially masked one. During
inference, it starts from a fully masked sequence and iteratively fills in tokens as decreases from 1 to 0.Based on this,
LLaDA-Recdesigns two diffusion mask training strategies:
4.2.3. User-History Level Masking
This strategy applies the discrete diffusion masking process to the token sequence of the user history, . Its objective is to train the MASK predictor to capture global dependencies among all tokens within the user's interaction history.
- Mechanism: At each
diffusion step(characterized by amasking ratio), each token in is independently masked with probability or remains visible with probability1-r. The resulting partially masked history, denoted , is then fed to theMASK predictor. - Training Loss: The model is trained to reconstruct the masked tokens in . The loss is defined as:
$
\mathcal{L}{\mathrm{His-Mask}} = - \mathbb{E}{r, S_{\mathcal{H}}, S_{\mathcal{H}}^r} \left[ \frac{1}{r} \sum_{i=1}^{M \times (n-1)} \mathbb{1} \left[ S_{\mathcal{H},i}^r = \left[ \mathsf{MASK} \right] \right] \log \mathbb{P}{\theta} ( S{\mathcal{H},i} \mid S_{\mathcal{H}}^r ) \right]
$
Where:
- : Expectation over the
masking ratio, the ground-truth user history , and its masked version . - : Total number of tokens in the user history (assuming
n-1items, each with tokens). - : An indicator function that is 1 if the -th token in is masked at step , and 0 otherwise.
- : The log-probability predicted by the model for the ground-truth token given the partially masked history .
- The term acts as a weighting factor, emphasizing steps with lower masking ratios (i.e., less noise) to help the model learn more precise reconstructions.
- : Expectation over the
4.2.4. Next-Item Level Masking
This strategy focuses on the target item. The discrete diffusion masking process is applied to the semantic ID of the next item (), while the user history is kept fully visible (unmasked).
- Mechanism: At each
diffusion step(), each of the tokens in the next item is independently masked with probability or remains visible with probability1-r. The resulting partially masked sequence for the next item, , is then concatenated with the fully visible historical tokens . This combined sequence is fed into theMASK predictor. - Training Objective: The model aims to reconstruct the masked tokens of the next item. The training objective is defined as:
$
\mathcal{L}{\mathrm{Item-Mask}} = - \mathbb{E}{r, s_n, s_n^r} \left[ \frac{1}{r} \sum_{i=1}^{M} \mathbb{1} \left[ c_{n,i}^r = \left[ \mathsf{MASK} \right] \right] \log \mathrm{P}{\theta} \left( c{n,i} \mid s_n^r, S_{\mathcal{H}} \right) \right]
$
Where:
- : Expectation over the
masking ratio, the ground-truth next item , and its masked version . - : The number of tokens in the next item's
semantic ID. - : An indicator function that is 1 if the -th token of the next item is masked at step , and 0 otherwise.
- : The log-probability predicted by the model for the ground-truth token given the partially masked next item and the full user history .
- : Expectation over the
- Theoretical Justification: The loss function in Eq. (12) can be shown to be an
upper boundon thenegative log-likelihoodof the conditional model distribution in Eq. (2) ([27, 31]): $- \mathbb{E} [ \log \mathrm{P}{\theta} \left( s_n ~ \vert ~ S{\mathcal{H}} \right) ] ~ \le ~ \mathcal{L}_{\mathrm{Item-Mask}}.
$
Minimizing is therefore equivalent to maximizing the desired conditional probability for
generative recommendation.
- \mathbb{E} [ \log \mathrm{P}{\theta} \left( s_n ~ \vert ~ S{\mathcal{H}} \right) ] ~ \le ~ \mathcal{L}_{\mathrm{Item-Mask}}.
$
Minimizing is therefore equivalent to maximizing the desired conditional probability for
4.2.5. Joint Training
To holistically train the MASK predictor to capture both inter-item and intra-item dependencies, the two loss functions are combined:
$
\mathcal{L}{\mathrm{Total}} = \mathcal{L}{\mathrm{Item-Mask}} + \lambda_{\mathrm{His-Mask}} \mathcal{L}{\mathrm{His-Mask}} + \lambda{\mathrm{Reg}} \Vert \theta \Vert_2^2
$
Where:
- : Guides the model to predict the next item conditioned on history and internal semantic relationships.
- : Helps the model better understand the relationships among different tokens across the entire user history.
- : A weighting coefficient that balances the contributions of the two masking losses.
- : Controls the strength of the
regularization term, which helps preventoverfittingby penalizing large parameter values.
4.3. Discrete Diffusion Inference
After training, the goal is to generate the top-k recommended items. This presents challenges for discrete diffusion: traditional diffusion models often rely on probabilistic sampling for top-1 outputs, and conventional beam search is designed for fixed left-to-right decoding, not adaptive-order diffusion. LLaDA-Rec adapts beam search for this setting.
4.3.1. Initialization
The generation process is divided into discrete steps.
- : The set of positions that have already been generated (filled) at step . Initially, at , .
- : The
token sequencefor the next item to be generated at step . At , it is initialized as , containing[MASK]tokens. - At each step , the
MASK predictortakes the current partially generated sequence (for the next item) and the user history as input. It outputs a probability distribution over the vocabulary for each masked position: $ \mathrm{P}{\theta}^{t,m}(\boldsymbol{w} \mid s_n^t, S{\mathcal{H}}) \in [0, 1], \quad m \in {1, \ldots, M} \setminus \mathcal{P}\mathcal{G}_t, \quad \boldsymbol{w} \in {1, \ldots, |\mathcal{W}|} $ Where:- : Indexes the positions that are currently masked (not yet generated).
- : Indexes candidate tokens in the vocabulary .
- : The probability distribution over the vocabulary for position at step .
4.3.2. Generation Position Selection
Unlike autoregressive generation, discrete diffusion predicts all [MASK] positions in parallel. To generate tokens iteratively, LLaDA-Rec first determines which positions to generate at step . Since tokens need to be generated over steps, at each step, the model selects the top unfilled positions that have the highest maximum token probabilities (i.e., highest confidence in their best prediction):
$
\begin{array}{r l}
& \mathcal{M}t = \underset{m \in {1, \dots, M} \setminus {\mathcal{PG}t}}{\mathrm{top}\frac{M}{T}} \left( \underset{w \in {1, \dots, |\mathcal{W}|}}{\mathrm{max}} \mathrm{P}{\theta}^{t,m}(w \mid s_n^t, S{\mathcal{H}}) \right), \
& \qquad \quad \mathcal{PG}_{t+1} = \mathcal{PG}_t \cup \mathcal{M}_t.
\end{array}
$
Where:
- : The set of positions selected at step based on their top highest confidence scores.
- : A function that selects the top elements with the highest scores.
- : The maximum probability for any token at a specific masked position , representing the model's confidence for that position.
- : The set of already generated positions is updated by adding the newly selected positions in .
4.3.3. Beam Search for Discrete Diffusion
Once the positions in are selected, beam search is applied sequentially to these positions.
- Beam Set: Let be the set of current candidate sequences (beams) at step .
- Expansion and Pruning: For each position in the set of selected positions :
- The current beam set (initially ) is expanded. For each beam in , the top candidate tokens for position (based on ) are considered.
- The resulting expanded set of beams is then pruned back to the top beams according to their model scores (joint probabilities). $ \begin{array}{r l} & \mathcal{B}{t,0} \ \gets \ \mathcal{B}t, \quad \mathcal{B}{t,i} \ \gets \ \mathcal{B}{t,i-1} \cup \quad \mathrm{top}{-B} \ \big ( \mathrm{P}{\theta}^{t,m_i}(w \mid s_n^t, S{\mathcal{H}}) \big ), \ & \mathcal{B}{t,i} \ \gets \ \mathrm{top}{-B} \big ( \mathrm{P}{\theta}^t(b \mid s_n^t, S_{\mathcal{H}}) \big ), \quad \mathcal{B}{t+1} \ \gets \ \mathcal{B}{t, |\mathcal{M}t|}. \ & \quad b \in \mathcal{B}{t,i} \end{array} $ Where:
- : The
beam size. - : Selects the elements with the highest confidence scores.
- : Denotes the joint probability of a beam at step . This is typically the product of the probabilities of all tokens in the beam generated so far.
- Sequence Update: After the
beam searchacross all selected positions in , the tokens at these positions in are replaced with the newly generated tokens from the winning beams to form the updated sequence .
4.3.4. Iterative Generation
This process continues for steps:
- At each iteration , the
MASK predictorre-evaluates all currently masked positions in the context of the partially generated sequence . - The set of positions to generate, , is determined using Eq. (16).
Beam searchis performed on these selected positions to fill them.- All unselected (still masked) positions are re-masked for the next iteration.
This loop repeats until all positions in the
semantic IDfor the next item have been filled. Finally, the resulting candidatesemantic IDsequences (from thebeam searchat the last step) are ranked by their overall probabilities, and thetop-ksequences are converted back to items and returned as recommendations. This iterative refinement allowsLLaDA-Recto dynamically adjust predictions and generate high-qualitytop-koutputs.
4.4. Discussion
4.4.1. Continuous vs. Discrete Diffusion in Recommendation
- Continuous Diffusion Models: Operate in continuous spaces, generating
latent representations(e.g., item embeddings) through adenoising process. These models are often used forsequential recommendation(e.g.,DiffuRec,DreamRec). The crucial point is that after generating alatent representation, a separateretrieval stage(e.g.,similarity searchover a large item embedding database) is required to map this representation back to actual items. This separation can lead to anoptimization mismatchbetween the generation and retrieval stages. - Discrete Diffusion Models (LLaDA-Rec): Operate directly on discrete tokens.
LLaDA-Recdirectly generates thesemantic IDs(sequences of discrete tokens) of items. This means the model's output is immediately the item's identifier. This approachunifies generation and retrievalinto a single optimization process, eliminating the need for a separateretrieval stageand simplifying theinference pipeline, which often leads to improved recommendation performance.
4.4.2. Advantages over Autoregressive Models
LLaDA-Rec offers several advantages over autoregressive (AR) generative recommendation methods (e.g., TIGER, LETTER, LC-Rec):
-
Unidirectional vs. Bidirectional Attention:
ARmodels usecausal attention, restricting tokens to only see predecessors.LLaDA-Recusesbidirectional attention, allowing tokens to attend to both preceding and succeeding positions, capturing richer global contextual semantics within thesemantic IDanduser history. -
Fixed vs. Adaptive Generation Order:
ARmodels generate tokens in afixed left-to-right order, making them susceptible toerror accumulation(an early mistake propagates).LLaDA-Recuses anadaptive, confidence-driven generation order, prioritizing tokens with high certainty (easier tokens) and iteratively re-masking and re-predicting low-confidence ones. This reduces the impact of early errors and alleviateserror accumulation. -
Single-Step vs. Iterative Refinement (vs. RPG): While
RPG [14]also performsparallel generationviamulti-token prediction [7], it's typically a single-step prediction.LLaDA-Rec'sdiscrete diffusionis an iterative process, allowing fordynamic refinementof predictions throughre-maskingandre-predictionover multiple steps. -
Controllable Generation Steps:
Discrete diffusionnaturally supportscontrollable generation steps(), allowing for a trade-off between generation speed and quality.The key differences are summarized in the following table (Table 1 from the original paper):
Methods Attention Mechanism Generation Order Controllable Generation Step TIGER [29] Causal Left2Right LETTER [36] Causal Left2Right xxxx) LC-Rec [48] Causal Left2Right RPG [14] Causal Parallel LLaDA-Rec Bidirectional Adaptive
5. Experimental Setup
5.1. Datasets
The experiments were conducted on three datasets from the widely used Amazon 2023 Review dataset [13]. These datasets are categories of products:
Industrial Scientific (Scientific): For industrial and scientific products.Musical Instruments (Instrument): For musical instruments.Video Games (Game): For video games.
Characteristics and Preprocessing:
-
Each user's historical reviews are treated as interaction records.
-
Interactions are ordered chronologically, with the earliest review first.
-
Leave-one-out protocol
[16, 29]: For evaluation, the last item in each user's sequence is reserved for testing, and the second-to-last item is used for validation. The remaining preceding items form the training history.The following are the results from Table 2 of the original paper:
Dataset #Users #Items #Interaction Sparsity Avg.len Scientific 50,985 25,848 412,947 99.969% 8.10 Instrument 57,439 24,587 511,836 99.964% 8.91 Game 94,762 25,612 814,586 99.966% 8.60 -
#Users: Number of unique users. -
#Items: Number of unique items. -
#Interaction: Total number of user-item interactions. -
Sparsity: Indicates the proportion of unobserved user-item interactions in the full user-item matrix. A high sparsity (close to 100%) means most users have interacted with only a small fraction of available items, which is typical for recommendation datasets. -
Avg.len: Average number of interactions within each input sequence (user history).These datasets are widely used benchmarks in
sequential recommendation, making them effective for validating the proposed method's performance and ensuring comparability with existing research. They represent diverse product categories, allowing for assessment of the model's generalization capabilities.
5.2. Evaluation Metrics
The performance of the recommendation models is evaluated using two standard ranking metrics: Recall@k and Normalized Discounted Cumulative Gain (NDCG@k). Results are reported for . NDCG@1 is omitted because it is mathematically identical to Recall@1.
5.2.1. Recall@k
- Conceptual Definition:
Recall@kmeasures the proportion of relevant items that are successfully retrieved within the top recommendations. It focuses on how many of the true next items are present among the top items recommended by the model. A higherRecall@kindicates that the model is better at identifying relevant items. - Mathematical Formula:
$
\mathrm{Recall@k} = \frac{\text{Number of relevant items in top-}k\text{ recommendations}}{\text{Total number of relevant items}}
$
For
leave-one-out evaluation, where there is only one relevant item (the true next item), the formula simplifies to: $ \mathrm{Recall@k} = \frac{\mathbb{1}(\text{true next item} \in \text{top-}k\text{ recommendations})}{1} $ - Symbol Explanation:
- : An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
true next item: The actual item the user interacted with after their history, used for testing.top-k recommendations: The list of items recommended by the model.
5.2.2. Normalized Discounted Cumulative Gain (NDCG@k)
- Conceptual Definition:
NDCG@kis a measure of ranking quality that takes into account the position of relevant items in the recommendation list. It assigns higher scores if relevant items appear at higher (more preferred) positions. It also considers the graded relevance of items (though inleave-one-outscenarios, relevance is binary). A higherNDCG@kindicates a better-ordered list where highly relevant items are ranked prominently. - Mathematical Formula:
First,
Discounted Cumulative Gain (DCG@k)is calculated: $ \mathrm{DCG@k} = \sum_{j=1}^{k} \frac{2^{\mathrm{rel}_j} - 1}{\log_2(j+1)} $ Then,NDCG@knormalizesDCG@kby dividing it by theIdeal DCG (IDCG@k), which is the maximum possible DCG for a perfect ranking: $ \mathrm{NDCG@k} = \frac{\mathrm{DCG@k}}{\mathrm{IDCG@k}} $ - Symbol Explanation:
- : The relevance score of the item at position in the recommendation list. For
leave-one-out, this is typically binary (1 if the item is the true next item, 0 otherwise). - : The rank (position) of an item in the recommendation list, starting from 1.
- : The logarithmic discount factor, which reduces the contribution of relevant items found at lower ranks.
- : The
DCGof the ideal ranking, where all relevant items are placed at the top of the list in decreasing order of relevance. Forleave-one-out, if the true next item is in the top- list, would be .
- : The relevance score of the item at position in the recommendation list. For
5.3. Baselines
The paper compares LLaDA-Rec against a comprehensive set of baseline models, categorized into Item ID-based and Semantic ID-based approaches.
5.3.1. Item ID-based Baselines
These models typically use unique numerical item IDs and learn embeddings for them, often focusing on capturing sequential patterns in user interactions.
GRU4Rec [11]: A pioneeringsequential recommenderthat utilizesGated Recurrent Units (GRUs)to model the sequence of user interactions and predict the next item.SASRec [16]:Self-Attentive Sequential Recommendation. It applies aunidirectional Transformer encoderto capture sequential dependencies by allowing each item to attend to all preceding items in the user's history.BERT4Rec [34]:Bidirectional Encoder Representations from Transformers for Recommendation. This model uses abidirectional Transformerand is trained with aclo-style objective(masking random items in a sequence and predicting them), similar toBERTforlanguage modeling.FMLP-Rec [50]:Filter-Enhanced MLP is All You Need for Sequential Recommendation. It employsmulti-layer perceptrons (MLPs)with learnable filters to model sequential patterns, offering an alternative toTransformersandRNNs.LRURec [44]:Linear Recurrent Unit for Sequential Recommendation. Integrateslinear recurrent units (LRUs)to efficiently processlong-range user interactions, aiming to overcome limitations of traditionalRNNsin handling very long sequences.DreamRec [42]: Reshapessequential recommendationviaguided diffusion. It usesSASRecoutputs as initial embeddings and then adiffusion denoising moduleto refine them, specifically designed to avoidnegative samplingand train only on positive samples. This is acontinuous diffusionmodel.DiffuRec [19]:A Diffusion Model for Sequential Recommendation. Combinesgenerative diffusionwithsequential recommendationby using aTransformer approximatorto reconstruct targetitem embeddings. This is also acontinuous diffusionmodel.
5.3.2. Semantic ID-based Generative Baselines
These models represent items as semantic IDs (sequences of discrete tokens) and use generative models to predict the semantic ID of the next item.
-
VQ-Rec [12]:Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. It appliesproduct quantizationto tokenize items intosemantic IDs, which are thenpooledto obtainitem representationsforsequential recommendation. -
TIGER [29]:Recommender Systems with Generative Retrieval. UtilizesResidual Quantization Variational Autoencoder (RQ-VAE)to generate codebook identifiers, embedding semantic information into discrete code sequences. It then uses anautoregressive Transformerfor generation. -
TIGER-SAS [29]: A variant ofTIGERwheresemantic IDsare derived fromitem embeddingstrained bySASRec(a sequential model) instead of solely fromtext embeddings. This aims to incorporate collaborative signals. -
LETTER [36]:Learnable Item Tokenization for Generative Recommendation. Develops alearnable tokenizerthat incorporateshierarchical semantics,collaborative signals, andcode assignment diversityduring the item tokenization process, followed byautoregressive generation. -
LC-Rec [48]:Generative Recommender with End-to-End Learnable Item Tokenization. Exploits identifiers withauxiliary alignment tasksto associate the generatedcodeswith natural language descriptions, enhancing the interpretability and quality ofsemantic IDsforautoregressive generation. -
RPG [14]:Retrieval-augmented Personalized Generative Recommendation. A lightweightsemantic ID-based modelthat generates long, unorderedsemantic IDsin parallel viamulti-token prediction [7]. UnlikeLLaDA-Rec, it performs this prediction in a single step without iterative refinement.These baselines are chosen to cover a wide range of state-of-the-art approaches, including traditional
ID-basedmethods, recentcontinuous diffusionmodels for recommendation, and contemporarysemantic ID-based generative recommenders, ensuring a robust comparison.
5.4. Implementation Details
5.4.1. Parallel Tokenization (Multi-Head VQ-VAE)
- Item Embedding:
Sentence-T5 [24]is used to encode the title and other textual information of each item into an initial semantic embedding. - Codebook Configuration:
- Number of codebooks (): 4
- Number of code vectors per codebook (): 256
- Dimension of each code vector (
d/M): 32 - Total latent dimension ():
- Hyperparameter: The weight in the
VQ-VAE loss(Eq. 10) is set to 0.25. - Training:
- Optimizer:
AdamW [23] - Learning Rate:
- Batch Size: 2,048
- Epochs: 10,000
- Optimizer:
5.4.2. Discrete Diffusion Model (MASK Predictor)
- Architecture: A
bidirectional Transformer encoder. - Transformer Configuration:
- Token embedding dimension: 256
- Attention heads per layer: 8
- Number of layers:
- 4 layers for
ScientificandInstrumentdatasets. - 6 layers for the
Gamedataset.
- 4 layers for
- Initialization: Model parameters are randomly initialized.
- Training:
- Loss function: The designed joint loss (Eq. 14).
- Weighting coefficient : Tuned over .
- Optimizer:
AdamW [23] - Epochs: 150, with
early stopping. - Learning Rate: Tuned over .
- Weight Decay: Tuned over .
- Batch Size: 1,024.
6. Results & Analysis
6.1. Core Results Analysis
The following are the results from Table 3 of the original paper:
| Datasets | Metric | Item ID-based | Semantic ID-based | ||||||||||||
| GRU4Rec | SASRec | BERT4Rec | FMLP-Rec | LRURec | DreamRec | DiffuRec | VQ-Rec | TIGER | TIGER-SAS | LETTER | LC-Rec | RPG | LLaDA-Rec | ||
| Scientific | Recall@1 | 0.0071 | 0.0063 | 0.0045 | 0.0046 | 0.0049 | 0.0052 | 0.0050 | 0.0076 | 0.0084 | 0.0067 | 0.0082 | 0.0091 | 0.0087 | 0.0098 |
| Recall@5 | 0.0184 | 0.0240 | 0.0157 | 0.0181 | 0.0169 | 0.0184 | 0.0190 | 0.0248 | 0.0282 | 0.0221 | 0.0273 | 0.0280 | 0.0257 | 0.0310 | |
| Recall@10 | 0.0272 | 0.0379 | 0.0264 | 0.0300 | 0.0267 | 0.0299 | 0.0310 | 0.0385 | 0.0446 | 0.0356 | 0.0423 | 0.0434 | 0.0395 | 0.0474 | |
| NDCG@5 | 0.0128 | 0.0152 | 0.0100 | 0.0113 | 0.0110 | 0.0118 | 0.0119 | 0.0162 | 0.0183 | 0.0144 | 0.0179 | 0.0186 | 0.0174 | 0.0203 | |
| NDCG@10 | 0.0156 | 0.0197 | 0.0134 | 0.0151 | 0.0141 | 0.0155 | 0.0158 | 0.0206 | 0.0236 | 0.0187 | 0.0227 | 0.0235 | 0.0218 | 0.0256 | |
| Recall@1 | 0.0094 | 0.0089 | 0.0065 | 0.0086 | 0.0071 | 0.0069 | 0.0077 | 0.0099 | 0.0105 | 0.0102 | 0.0114 | 0.0119 | 0.0118 | 0.0128 | |
| Instrument | Recall@5 | 0.0297 | 0.0331 | 0.0255 | 0.0299 | 0.0272 | 0.0245 | 0.0283 | 0.0345 | 0.0359 | 0.0342 | 0.0362 | 0.0379 | 0.0362 | 0.0406 |
| Recall@10 | 0.0453 | 0.0525 | 0.0412 | 0.0496 | 0.0431 | 0.0423 | 0.0465 | 0.0532 | 0.0566 | 0.0521 | 0.0562 | 0.0587 | 0.0545 | 0.0623 | |
| NDCG@5 | 0.0196 | 0.0211 | 0.0160 | 0.0193 | 0.0172 | 0.0157 | 0.0179 | 0.0222 | 0.0233 | 0.0223 | 0.0239 | 0.0251 | 0.0241 | 0.0268 | |
| NDCG@10 | 0.0246 | 0.0273 | 0.0211 | 0.0257 | 0.0223 | 0.0214 | 0.0237 | 0.0282 | 0.0300 | 0.0280 | 0.0303 | 0.0318 | 0.0300 | 0.0337 | |
| Recall@1 | 0.0149 | 0.0128 | 0.0082 | 0.0099 | 0.0134 | 0.0125 | 0.0111 | 0.0150 | 0.0166 | 0.0170 | 0.0169 | 0.0165 | 0.0209 | 0.0203 | |
| Game | Recall@5 | 0.0461 | 0.0516 | 0.0315 | 0.0395 | 0.0480 | 0.0381 | 0.0425 | 0.0497 | 0.0529 | 0.0548 | 0.0552 | 0.0567 | 0.0579 | 0.0623 |
| Recall@10 | 0.0712 | 0.0823 | 0.0530 | 0.0649 | 0.0753 | 0.0611 | 0.0709 | 0.0769 | 0.0823 | 0.0847 | 0.0863 | 0.0891 | 0.0853 | 0.0942 | |
| NDCG@5 | 0.0307 | 0.0323 | 0.0199 | 0.0246 | 0.0308 | 0.0253 | 0.0268 | 0.0325 | 0.0348 | 0.0360 | 0.0362 | 0.0366 | 0.0397 | 0.0415 | |
| NDCG@10 | 0.0387 | 0.0421 | 0.0267 | 0.0328 | 0.0396 | 0.0326 | 0.0359 | 0.0412 | 0.0442 | 0.0457 | 0.0462 | 0.0471 | 0.0485 | 0.0517 | |
6.1.1. Overall Performance of LLaDA-Rec
The experimental results consistently demonstrate that LLaDA-Rec achieves state-of-the-art (SOTA) performance across all three datasets (Scientific, Instrument, Game) and all evaluated metrics (Recall@1, @5, @10, NDCG@5, @10).
- Superiority over all Baselines:
LLaDA-Recconsistently outperforms bothtraditional item ID-based approachesand existinggenerative semantic ID-based approaches. This validation confirms the effectiveness of the proposeddiscrete diffusion trainingandinferencemechanisms, along with theMulti-Head VQ-VAEfor parallel tokenization.- For example, on the
Scientificdataset,LLaDA-RecachievesRecall@5of 0.0310, surpassing the bestsemantic ID-basedbaselineLC-Rec(0.0280) and allitem ID-basedbaselines. Similar trends are observed for other metrics and datasets.
- For example, on the
- Generative vs. Traditional ID-based Methods: The results generally show that
generative recommendation methodsbased onsemantic IDs(e.g.,VQ-Rec,TIGER,LETTER,LC-Rec,RPG,LLaDA-Rec) outperformtraditional ID-based methods(e.g.,GRU4Rec,SASRec,BERT4Rec,FMLP-Rec,LRURec). This reinforces the advantage of usingsemantic IDsto capture richer semantic correlations between items and the benefits ofgenerative approachesin general. - Parallel Semantic IDs: Both
RPGandLLaDA-Rec, which utilizeparallel semantic IDs, achieve promising results compared tohierarchical RQ-VAEbased methods. The superior performance ofLLaDA-Recfurther highlights the benefits of itsdiscrete diffusionframework overRPG's single-step parallel generation.
6.2. Ablation Studies / Parameter Analysis
The following are the results from Table 4 of the original paper:
| Model | Scientific | Instrument | Game | |||
| R@5 | N@5 | R@5 | N@5 | R@5 | N@5 | |
| LLaDA-Rec | 0.0310 | 0.0203 | 0.0406 | 0.0268 | 0.0623 | 0.0415 |
| Tokenizer | ||||||
| RQ-VAE | 0.0293 | 0.0191 | 0.0367 | 0.0244 | 0.0604 | 0.0399 |
| RQ-Kmeans | 0.0250 | 0.0165 | 0.0344 | 0.0224 | 0.0552 | 0.0370 |
| OPQ | 0.0237 | 0.0155 | 0.0340 | 0.0229 | 0.0552 | 0.0362 |
| Training | ||||||
| w/o LHis-Mask | 0.0255 | 0.0169 | 0.0321 | 0.0209 | 0.0544 | 0.0356 |
| w/o LItem-Mask | 0.0264 | 0.0172 | 0.0355 | 0.0231 | 0.0571 | 0.0376 |
| Inference | ||||||
| w/o Beam Search | 0.0077 | 0.0077 | 0.0091 | 0.0091 | 0.0162 | 0.0162 |
6.2.1. Tokenizer Ablation
- Comparison: The performance of
LLaDA-Recwith itsMulti-Head VQ-VAEis compared against using other commonsemantic ID generationmethods:RQ-VAE [29, 48],RQ-Kmeans [4], andOPQ [12, 14]. - Results:
Multi-Head VQ-VAEconsistently outperforms all other tokenization methods.RQ-VAEperforms worse thanLLaDA-Recbut still better thanRQ-KmeansandOPQ. Clustering-based approaches (RQ-Kmeans,OPQ) show the lowest performance. - Analysis:
- Mismatch with
RQ: The inferior performance ofsemantic IDsderived fromresidual quantization (RQ)(RQ-VAE,RQ-Kmeans) confirms the hypothesis thatRQmethods are not well-aligned withbidirectional Transformers.RQimposes a hierarchy where earlier tokens are more dominant, which conflicts with theuniformly distributed token importancein abidirectional architecturelikeLLaDA-Rec. - Robustness of
LLaDA-RecFramework: Even whenMulti-Head VQ-VAEis replaced withRQ-VAE,LLaDA-Recstill often surpasses the baseline performance (e.g., compareLLaDA-RecwithRQ-VAEtoLC-RecorTIGERin Table 3), suggesting the overall robustness and architectural advantages of thediscrete diffusion generative frameworkitself. VAE-based QuantizationSuperiority: The superior performance ofRQ-VAEandMulti-Head VQ-VAEoverclustering-based approaches(RQ-Kmeans,OPQ) indicates thatVAE-based quantizationmethods generally offer stronger representational capacity due to their learned encoding and decoding processes.
- Mismatch with
6.2.2. Training Ablation
- Impact of
User-History Level Masking(w/o LHis-Mask): When theUser-History level maskingloss ( from Eq. 11) is removed, performance (Recall@5,NDCG@5) drops significantly across all datasets.- Analysis: This confirms that is crucial for enabling the
MASK predictorto effectively captureinter-item sequential dependenciesandglobal dependenciesamong all tokens within the user's interaction history. Without it, the model's understanding of the historical context is diminished.
- Analysis: This confirms that is crucial for enabling the
- Impact of
Next-Item Level Masking(w/o LItem-Mask): Similarly, removing theNext-Item level maskingloss ( from Eq. 12) also leads to a notable performance degradation.- Analysis: This loss is essential for teaching the model
intra-item semantics(relationships among tokens within the same item) and for conditioning the generation of thenext itemspecifically on the given history. Its absence weakens the model's ability to compose coherent and relevantsemantic IDsfor recommended items.
- Analysis: This loss is essential for teaching the model
- Conclusion: Both masking mechanisms contribute significantly to the overall effectiveness of
LLaDA-Rec, highlighting the importance of capturing bothinter-itemandintra-itemdependencies.
6.2.3. Inference Ablation
- Impact of
Beam Search(w/o Beam Search): Removing theadapted beam search strategy(i.e., using agreedy searchstrategy that only returns the top-1 result, which is how standard diffusion language models often sample) results in a substantial drop in performance. The table showsRecall@5andNDCG@5values that are extremely low and identical (e.g., 0.0077 forScientific), indicating only a singletop-1item is considered for thesetop-kmetrics, which inherently penalizes performance. - Analysis: This unequivocally confirms the critical importance of the
adapted beam searchstrategy forgenerative recommendation tasks. Recommendation systems require generating a ranked list oftop-kitems, not just a singletop-1prediction. The ability to explore multiple candidate sequences and select the best ones is vital for achieving highRecallandNDCGat various values. The adaptation ofbeam searchfordiscrete diffusion's adaptive generation orderis therefore a key enabling component ofLLaDA-Rec's success.
6.2.4. Impact of the Attention Mechanism
The following figure (Figure 3 from the original paper) illustrates the attention masks and performance for different attention mechanisms:
该图像是图表,展示了不同注意力机制的比较。图中的(a)部分分别表示因果(Causal)、项目间因果(Inter-Item Causal)、项目内因果(Intra-Item Causal)和双向(Bidirectional)机制的注意力掩码;(b)和(c)部分呈现了在乐器和游戏数据集上的性能结果,使用 NDCG@5 和 Recall@5 作为评估指标。
- Attention Masks (Figure 3a):
Causal Attention: Each token (position) can only attend to itself and preceding tokens. This is typical forautoregressive models.Inter-Item Causal Attention: Within each item'ssemantic ID, tokens can attend bidirectionally. However, when attending across items in the history, the attention iscausal(only to previous items).Intra-Item Causal Attention: Within each item'ssemantic ID, attention iscausal. When attending across items in the history, attention isbidirectional.Bidirectional Attention: Each token can attend to all other tokens in the entire sequence (both within its own itemsemantic IDand across theuser history), providing full context. This is whatLLaDA-Recuses.
- Performance (Figure 3b, 3c):
Bidirectional attentionconsistently yields the best performance acrossNDCG@5andRecall@5on both theInstrumentandGamedatasets. This is attributed to its superior ability to capture comprehensive contextual dependencies by processing information from both directions.Causal attentionperforms the worst. Itsunidirectional constraintseverely limits its ability to effectively exploit contextual information, resulting in the lowestRecallandNDCGvalues.Inter-item causalandintra-item causal attentionachieve competitive performance, often falling betweencausalandbidirectional. This highlights that incorporatingbidirectional attention—whether it's across items (as ininter-item causalwithin each item) or within items (as inintra-item causalacross items in history)—is crucial for effective contextual modeling. The best performance is achieved whenbidirectional attentionis applied universally.
6.2.5. Impact of Generation Order
The following figure (Figure 4 from the original paper) illustrates the performance under different generation orders:
该图像是图表,展示了不同生成顺序下的性能表现,包括两个子图:(a) 工具的 NDCG@5 和 Recall@5,(b) 游戏的 NDCG@5 和 Recall@5。图中使用了不同颜色的柱状图分别表示左右生成顺序和自适应生成方式。
- Comparison:
LLaDA-Rec'sadaptive generation orderis compared againstfixed left-to-right (left2right)andfixed right-to-left (right2left)orders. - Results: The
adaptive approachconsistently delivers superior performance (highestNDCG@5andRecall@5values) on bothInstrumentandGamedatasets. - Analysis: The
left2rightorder, which is common inautoregressive models, occasionally produces the poorest results. This underscores the limitations of rigidly fixed generation orders, especially when errors can propagate.LLaDA-Rec's ability to dynamically determine the generation order by prioritizingeasier tokens(those with higher model confidence) and iteratively refining predictions provides a significant advantage, mitigatingerror accumulationand leading to more accurate itemsemantic IDgeneration.
6.2.6. Impact of Generation Steps
The following figure (Figure 5 from the original paper) illustrates the performance under different generation steps:
该图像是一个柱状图,展示了在不同生成步骤下的推荐性能,包含两个部分:左侧为"Instrument"和右侧为"Game"。每一部分展示了在5个生成步骤下的 NDCG@5(蓝色柱子)和 Recall@5(红色线条)的变化情况。
- Comparison: The performance (
NDCG@5,Recall@5) is analyzed as the number ofgeneration steps() varies (from 1 to 5). Recall that is the total number of tokens in asemantic ID, and at each step, tokens are generated. More steps imply generating fewer tokens per step and more iterative refinement. - Results: Increasing the number of
generation stepsgenerally leads to better performance on bothInstrumentandGamedatasets. A single step () results in the lowest performance, while performance gradually improves as increases to 5. - Analysis: This indicates that the
iterative refinement processofdiscrete diffusionis effective. More steps allow the model to re-evaluate and re-predict masked tokens multiple times, leveraging updated context from already generated (high-confidence) tokens, thereby improving accuracy. However, using fewer steps significantly improves generation efficiency. The trade-off betweenefficiency(fewer steps) andperformance(more steps) is an important consideration. The authors acknowledge that achieving a better balance with fewer steps remains an open research question, pointing to recent studies indiffusion language modelsexploring this (e.g.,[2, 10]).
6.2.7. Impact of Hyper-parameters
The following figure (Figure 6 from the original paper) illustrates the performance of different (Eq. (14)) values:
该图像是图表,展示了不同 ext{a}_{ ext{Instrument}} 和 ext{b}_{ ext{Game}} 中 NDCG@5 和 Recall@5 的性能表现。左侧图表显示了对 Instrument 项目的评估,而右侧图表则展示了对于 Game 项目的比较。两组数据均以柱状图和折线图的方式呈现。
- Comparison: The impact of the weighting coefficient (which balances the
User-History level maskingloss in the total training loss) is investigated. Values of from 0 to 5 are tested. - Results:
- When (meaning
User-History level maskingis not applied), performance is relatively low. - Performance generally improves as increases from 0 to about 2 or 3.
- However, if becomes too large (e.g., 4 or 5), performance starts to decline or plateau.
- When (meaning
- Analysis:
- A moderate value of is beneficial because it allows the model to effectively learn
global dependenciesamong all tokens within the user history. This supplementary learning helps the model build a richer contextual understanding. - If is set too high, it might cause the model to over-emphasize learning patterns within the history itself (which is 's primary goal) at the expense of its main task: predicting the
next itemconditioned on that history (driven by ). This can hinder its ability to generate relevant recommendations. This indicates the need for careful tuning of this hyperparameter to find an optimal balance.
- A moderate value of is beneficial because it allows the model to effectively learn
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces LLaDA-Rec, a novel generative recommendation framework that leverages discrete diffusion models to address key limitations of existing autoregressive approaches: unidirectional constraints and error accumulation. By reformulating recommendation as parallel semantic ID generation, LLaDA-Rec incorporates bidirectional attention and an adaptive generation order to enhance the modeling of inter-item sequential dependencies and intra-item semantic relationships, while mitigating the propagation of prediction errors.
The framework's core innovations include:
-
Parallel Tokenization: A
Multi-Head VQ-VAEscheme that generatessemantic IDssuitable forbidirectional modeling, resolving the mismatch with hierarchical quantization methods. -
Dual Masking Mechanisms: Distinct
User-History level maskingandNext-Item level maskingstrategies to train thediscrete diffusion modeleffectively for recommendation tasks. -
Adapted Beam Search: A tailored
beam search strategythat enablestop-krecommendation generation withadaptive-order discrete diffusion decoding.Extensive experiments on three real-world datasets consistently show that
LLaDA-Recachievesstate-of-the-artperformance, outperforming both traditionalitem-ID-based recommendersand existingsemantic-ID-based generative recommendation models. This work successfully establishesdiscrete diffusionas a powerful new paradigm forgenerative recommendation.
7.2. Limitations & Future Work
The paper does not explicitly dedicate a section to "Limitations and Future Work." However, some implicit limitations and future directions can be inferred from the experimental analysis:
- Efficiency vs. Performance Trade-off in Generation Steps: As shown in the "Impact of Generation Steps" analysis (Figure 5), increasing the number of
generation steps() improves performance but inherently reduces efficiency. The paper states that "How to achieve a better trade-off between efficiency and performance with fewer steps remains an open question," suggesting thatoptimizing the multi-step generation processfor faster inference without significant performance degradation is a key area for future research. This could involve techniques likeknowledge distillationor more advancedsampling schedulesfor diffusion models. - Computational Cost of Diffusion: While not explicitly stated as a limitation,
discrete diffusion models, especially with many generation steps andbeam search, can be computationally intensive during inference compared to single-passautoregressivemodels. Optimizing this aspect is a general challenge for diffusion models. - Dependence on Item Embeddings: The
Multi-Head VQ-VAErelies on high-quality initialitem semantic representations(e.g., fromSentence-T5). The performance of the entire system could be sensitive to the quality and robustness of these initial embeddings. Future work might exploreend-to-end learningof item representations integrated with the diffusion process. - Hyperparameter Tuning Complexity: The model has several hyperparameters (e.g., in VQ-VAE, , learning rates, weight decays, Transformer layers, beam size, number of generation steps). Tuning all these for optimal performance can be complex and time-consuming.
7.3. Personal Insights & Critique
LLaDA-Rec presents a highly innovative and compelling approach to generative recommendation. The shift from autoregressive to discrete diffusion is a significant architectural advancement that addresses fundamental limitations in current methods.
-
Inspirations and Transferability: The core idea of
adaptive-order generationanditerative refinementthroughmaskinganddenoisingis powerful. This principle could be highly transferable to other discrete sequence generation tasks wherefixed left-to-right generationis suboptimal or suffers fromerror accumulation. For instance, it could be applied tocode generation(where semantic coherence across a larger block of code is important),molecule design(generating sequences of chemical units), or even more complexdialogue generationwherebidirectional contextand the ability to refine earlier parts of a response based on later thoughts could be beneficial. TheMulti-Head VQ-VAEforparallel tokenizationis also a generalizable concept for preparing discrete data forbidirectional models. -
Potential Issues & Areas for Improvement:
-
Inference Latency: While the paper addresses
top-kgeneration, the iterative nature ofdiffusion modelswith multiple steps andbeam searchinherently suggests higher latency compared to single-passautoregressivemethods. This might be a practical concern for real-time recommendation systems with strict latency requirements. Further research intofast inferencetechniques fordiscrete diffusionis crucial. -
Interpretability of Semantic IDs: While
semantic IDsare intended to be semantically rich, their interpretability (i.e., understanding what a token sequence like means to a human) can still be challenging. The paper doesn't delve into how theseSIDscorrelate with human-understandable attributes. -
Cold-Start Problem: The model still relies on pre-trained
item embeddingsand aVQ-VAEto generatesemantic IDs. For completely new items without textual information or interaction history, thecold-start problemwould likely persist. HowLLaDA-Recwould handle items with very sparse information, or truly novel items not seen duringVQ-VAEtraining, is not explicitly discussed. -
Scale of Codebooks: The choice of codebooks with codes each means possible
semantic IDs. While large, for truly vast item catalogs, this might still represent a bottleneck. Exploring dynamic codebook sizes or more flexiblequantizationcould be an area for improvement.Overall,
LLaDA-Recis a rigorous and well-designed piece of research that pushes the boundaries ofgenerative recommendation. Its success demonstrates the untapped potential ofdiscrete diffusion modelsin tackling complexsequence generationproblems inrecommender systems.
-
Similar papers
Recommended via semantic vector search.