LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation

Jun Xu

Paper status: completed

LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation

Published:11/09/2025

Generative Recommendation Systems (23)Discrete Diffusion Framework (1)Parallel Semantic ID Generation (1)Bidirectional Attention Mechanism (1)Adaptive Sequence Generation (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LLaDA-Rec is a discrete diffusion framework for generative recommendation, addressing unidirectional constraints and error accumulation. By integrating bidirectional attention and adaptive generation order, it effectively models item dependencies, surpassing existing systems in r

Abstract

Generative recommendation represents each item as a semantic ID, i.e., a sequence of discrete tokens, and generates the next item through autoregressive decoding. While effective, existing autoregressive models face two intrinsic limitations: (1) unidirectional constraints, where causal attention restricts each token to attend only to its predecessors, hindering global semantic modeling; and (2) error accumulation, where the fixed left-to-right generation order causes prediction errors in early tokens to propagate to the predictions of subsequent token. To address these issues, we propose LLaDA-Rec, a discrete diffusion framework that reformulates recommendation as parallel semantic ID generation. By combining bidirectional attention with the adaptive generation order, the approach models inter-item and intra-item dependencies more effectively and alleviates error accumulation. Specifically, our approach comprises three key designs: (1) a parallel tokenization scheme that produces semantic IDs for bidirectional modeling, addressing the mismatch between residual quantization and bidirectional architectures; (2) two masking mechanisms at the user-history and next-item levels to capture both inter-item sequential dependencies and intra-item semantic relationships; and (3) an adapted beam search strategy for adaptive-order discrete diffusion decoding, resolving the incompatibility of standard beam search with diffusion-based generation. Experiments on three real-world datasets show that LLaDA-Rec consistently outperforms both ID-based and state-of-the-art generative recommenders, establishing discrete diffusion as a new paradigm for generative recommendation.

Mind Map

In-depth Reading

English Analysis~29 min read · 39,730 chars

1. Bibliographic Information

1.1. Title

LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation

1.2. Authors

Teng Shi (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
Chenglei Shen (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
Weijie Yu (School of Information Technology and Management, University of International Business and Economics, Beijing, China)
Shen Nie (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
Chongxuan Li (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
Xiao Zhang (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)
Ming He (AI Lab at Lenovo Research, Beijing, China)
Yan Han (AI Lab at Lenovo Research, Beijing, China)
Jun Xu (Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China)

1.3. Journal/Conference

The paper is listed as "In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX). ACM, New York, NY, USA". This indicates that it is intended for publication in an ACM conference, although the specific conference name is a placeholder in the provided text. ACM conferences are highly reputable venues in computer science, particularly in areas like information retrieval, data mining, and artificial intelligence.

1.4. Publication Year

2025 (Based on the Published at (UTC): 2025-11-09T07:12:15.000Z metadata).

1.5. Abstract

Generative recommendation systems represent items as semantic IDs (sequences of discrete tokens) and use autoregressive models to generate the next item. However, these models suffer from two main issues: unidirectional constraints (causal attention limits token interaction to predecessors, hindering global semantic modeling) and error accumulation (errors in early tokens propagate). To overcome these, the paper introduces LLaDA-Rec, a discrete diffusion framework that reframes recommendation as parallel semantic ID generation. By combining bidirectional attention with an adaptive generation order, LLaDA-Rec aims to model inter-item and intra-item dependencies more effectively and reduce error propagation. The framework includes three key designs: (1) a parallel tokenization scheme using Multi-Head VQ-VAE to make semantic IDs suitable for bidirectional architectures, (2) two masking mechanisms (at user-history and next-item levels) to capture both sequential dependencies and intra-item semantics, and (3) an adapted beam search strategy for adaptive-order discrete diffusion decoding, which resolves the incompatibility of standard beam search with diffusion-based generation. Experiments on three real-world datasets demonstrate that LLaDA-Rec consistently outperforms both ID-based and state-of-the-art generative recommenders, establishing discrete diffusion as a new paradigm for generative recommendation.

1.6. Original Source Link

https://arxiv.org/abs/2511.06254v1 (This is a preprint on arXiv, not an officially published version in a peer-reviewed journal/conference yet). PDF Link: https://arxiv.org/pdf/2511.06254v1.pdf

2. Executive Summary

2.1. Background & Motivation

The paper addresses crucial limitations within existing generative recommendation models, which have gained prominence by applying generative language models (LLMs) to item recommendation.

Core Problem: Current generative recommendation approaches, which represent items as semantic IDs (sequences of discrete tokens) and use autoregressive (AR) decoding to predict the next item, face two intrinsic limitations:
1. Unidirectional Constraints: AR models typically employ causal attention (also known as unidirectional attention), meaning each token can only attend to (process information from) its preceding tokens in a sequence. This restriction hinders the model's ability to capture global relationships among all tokens that collectively define an item, leading to less semantically coherent and expressive generated items.
2. Error Accumulation: During the inference (generation) phase, AR models generate tokens one by one, conditioning each new token on the previously sampled ones. Unlike the training phase, where teacher forcing (providing the ground truth token at each step) prevents early errors, inference means that any prediction error in an early token cannot be corrected and propagates to subsequent tokens, amplifying negative effects throughout the generated semantic ID.
Importance: Overcoming these limitations is crucial for enhancing the accuracy, coherence, and overall quality of recommendations generated by LLM-based systems, making them more effective and reliable for users.
Innovative Idea: The paper proposes LLaDA-Rec, a novel framework that leverages discrete diffusion models to reformulate recommendation as parallel semantic ID generation. This approach aims to address the unidirectional constraints through bidirectional attention and mitigate error accumulation via an adaptive generation order.

2.2. Main Contributions / Findings

The primary contributions of LLaDA-Rec are:

Addressing Autoregressive Limitations: It identifies and tackles the unidirectional constraints and error accumulation issues prevalent in existing autoregressive generative recommendation models, which limit their performance.
Novel Discrete Diffusion Framework: It proposes LLaDA-Rec, a generative recommendation model based on discrete diffusion. This framework introduces parallel semantic IDs and develops specific discrete diffusion training and inference methods tailored for recommendation tasks.
Key Design Elements: LLaDA-Rec incorporates three essential designs:
1. Parallel Tokenization Scheme: It introduces Multi-Head VQ-VAE to produce semantic IDs that are inherently suitable for bidirectional modeling, resolving the architectural mismatch between residual quantization (RQ) and bidirectional Transformers.
2. Dual Masking Mechanisms: It employs two distinct masking mechanisms during training: User-History level masking to capture inter-item sequential dependencies and Next-Item level masking to model intra-item semantic relationships, enabling the model to effectively understand and generate item semantic IDs.
3. Adapted Beam Search Strategy: It devises an adapted beam search strategy for adaptive-order discrete diffusion decoding, which overcomes the incompatibility of standard beam search with the dynamic, non-left-to-right generation process of diffusion models.
State-of-the-Art Performance: Extensive experiments on three real-world datasets demonstrate that LLaDA-Rec consistently outperforms both traditional item-ID-based methods and state-of-the-art semantic-ID-based generative recommendation models. This establishes discrete diffusion as a promising new paradigm for generative recommendation.

3.1. Foundational Concepts

3.1.1. Generative Recommendation

Generative recommendation is a paradigm shift from traditional discriminative recommendation. Instead of predicting a score for a user-item pair or classifying items, it formulates the task as generating the characteristics of the next item a user might interact with. This often involves representing items as semantic IDs, which are sequences of discrete tokens (similar to words in a sentence), and then using a generative model to "write" the semantic ID of the target item.

3.1.2. Semantic IDs

Semantic IDs (SIDs) are a core concept in generative recommendation. Unlike traditional item IDs (which are just unique numbers with no inherent meaning, e.g., Item #123), SIDs represent an item as a sequence of discrete tokens. These tokens are learned to encode the item's semantic information, such as its attributes, categories, or descriptive text. For example, an item like "red apple" might be tokenized into [fruit, red, round]. This allows the generative model to compose new items by generating new sequences of tokens, enabling recommendation of novel or out-of-vocabulary items.

3.1.3. Autoregressive (AR) Models

Autoregressive models are a class of generative models that predict a sequence of data elements one step at a time, where each step's prediction is conditioned on all previously predicted elements. In language modeling, this means predicting the next word based on all preceding words.

Left-to-Right Generation: The most common AR models generate sequences in a fixed left-to-right order.
Causal Attention: This unidirectional constraint is enforced by causal attention mechanisms in Transformers, where each position in the output sequence can only attend to positions before it in the input sequence. This prevents tokens from seeing future information, which is necessary for sequential generation.
Teacher Forcing: During training, AR models often use teacher forcing. This means that at each step, instead of feeding the model its own (potentially erroneous) previous prediction, the actual ground-truth previous token is provided as input. This stabilizes training and speeds up convergence. However, during inference, teacher forcing cannot be used, leading to error accumulation.

3.1.4. Discrete Diffusion Models

Discrete diffusion models are a type of generative model that learn to reverse a noise process applied to discrete data (like tokens).

Forward Noise Process: During training, the original discrete sequence is progressively corrupted by adding masking noise. This involves replacing tokens with a special [MASK] token, with a progressively increasing masking ratio. At the end, the sequence is fully masked.
Reverse Denoising Process: The model is trained to predict the original (unmasked) tokens from the partially masked sequence. During inference, this learned reverse process is used to generate a clean sequence from a fully masked one. The model iteratively predicts masked tokens, often starting with high-confidence predictions, and then re-masks low-confidence ones for later refinement.
Bidirectional Transformer: Unlike AR models, discrete diffusion models typically use a bidirectional Transformer (encoder) as their core component. This allows the model to leverage context from both preceding and succeeding tokens when predicting a masked token, providing a more holistic understanding of the sequence.
Adaptive Generation Order: During inference, discrete diffusion models don't adhere to a fixed generation order (like left-to-right). Instead, they predict all masked tokens in parallel at each step and then select the tokens with the highest prediction confidence to keep. The remaining low-confidence tokens are re-masked and re-predicted in subsequent steps. This adaptive generation order allows the model to prioritize "easier" tokens first, which can help mitigate error accumulation.

3.1.5. Transformer Architecture

The Transformer is a neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017) that has revolutionized sequence modeling. It relies heavily on self-attention mechanisms.

Attention Mechanism: The core idea is to allow the model to weigh the importance of different parts of the input sequence when processing a specific part. It calculates a context vector as a weighted sum of value vectors, where weights are derived from the similarity between a query vector and key vectors. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
- $d_k$ is the dimension of the key vectors, used for scaling to prevent very large dot products that push the softmax function into regions with tiny gradients.
- $\mathrm{softmax}$ normalizes the scores into probabilities.
Self-Attention: This is when Q, K, V are all derived from the same input sequence. It allows each token in a sequence to attend to all other tokens in the same sequence.
Causal (Unidirectional) Attention: In decoder blocks of AR Transformers, a mask is applied to the attention scores so that a token at position $i$ can only attend to tokens at positions $j \le i$ . This prevents "looking into the future."
Bidirectional Attention: In encoder blocks of Transformers (like BERT), there is no such masking. Each token can attend to all other tokens in the sequence (both preceding and succeeding), providing a full contextual understanding.

3.1.6. Vector Quantization (VQ-VAE)

Vector Quantization Variational Autoencoder (VQ-VAE) is a generative model that learns to map continuous input vectors into discrete codebook entries.

Encoder: Maps a continuous input (e.g., an item embedding) to a continuous latent vector.
Codebook: A finite set of learnable discrete code vectors (also called embeddings or prototypes).
Quantization: For each latent vector from the encoder, the closest code vector in the codebook is found (e.g., using Euclidean distance). The index of this closest code vector becomes the discrete semantic ID token.
Decoder: Reconstructs the original input from the selected code vector(s).
Loss Functions: VQ-VAE typically uses a reconstruction loss (to ensure the decoded output matches the original input) and a vector quantization loss (to encourage the encoder outputs to be close to codebook entries and to update the codebook entries themselves).

3.1.7. Residual Quantization (RQ-VAE)

Residual Quantization (RQ) is a hierarchical approach to vector quantization. Instead of quantizing a vector directly, it quantizes the residual error from a previous quantization step.

Hierarchical Nature: The input vector is first quantized by a codebook, producing a residual error. This error is then quantized by a second codebook, generating a second residual, and so on. Each step adds another discrete token.
Dependency: This creates an intrinsic hierarchical dependency where earlier tokens (from earlier quantization stages) capture more coarse-grained information, and later tokens refine it. This structure naturally aligns with left-to-right autoregressive generation, where early tokens are fixed before later ones are generated.

3.1.8. Beam Search

Beam search is a heuristic search algorithm used in sequence generation (e.g., machine translation, language modeling, generative recommendation) to find the most probable sequence.

Mechanism: Instead of greedily choosing only the single most probable next token at each step (which can lead to suboptimal sequences), beam search maintains a fixed number, $B$ (the beam size), of the most probable partial sequences (or "beams"). At each step, it extends all $B$ partial sequences with all possible next tokens, evaluates their probabilities, and then prunes them back to the top $B$ most probable sequences.
Purpose: It aims to find a sequence with a higher overall probability than greedy search without exploring the entire exponentially large search space (like breadth-first search).
Fixed Order: Traditionally, beam search is designed for fixed left-to-right decoding, where the sequence length grows by one token at each step, and tokens are added sequentially.

3.2. Previous Works

The paper contextualizes LLaDA-Rec by discussing two main streams of related work: Generative Recommendation and Discrete Diffusion Models.

3.2.1. Generative Recommendation

Inspired by Large Language Models (LLMs) like GPT ([1] Achiam et al., 2023; [21] Liu et al., 2024), generative recommendation ([4, 14, 15, 22, 29, 32, 36, 46, 48, 49]) represents items as semantic IDs and uses generative models to predict the next item. It generally involves two stages:

Item Tokenization: Assigning semantic IDs (SIDs) to items.
- Clustering-based: SEATER [33] and EAGER [38] cluster item embeddings.
- Vector Quantization-based:
  - Residual Quantization (RQ) methods: TIGER [29], LETTER [36], LC-Rec [48] (using RQ-VAE [45]), and OneRec [4, 49] (using RQ-KMeans). These methods inherently create a hierarchical dependency where earlier tokens are more dominant.
  - Product Quantization [6] methods: RPG [14].
Autoregressive Generation: Most existing methods (e.g., TIGER, LETTER, LC-Rec) use an autoregressive paradigm to generate SIDs sequentially. RPG [14] is an exception, employing multi-token prediction [7] for parallel generation of unordered semantic IDs in a single step, rather than iterative refinement.
Enhanced Generative Recommendation: Some studies ([3, 35]) enhance generative recommendation through latent reasoning [9].

3.2.2. Discrete Diffusion Models

Discrete diffusion models ( $[8, 17, 25-27, 51]$ ) are a newer class of generative models built on bidirectional Transformer backbones. They learn a denoising process to reconstruct data from masked inputs.

LLaDA [26]: The first diffusion-based language model to achieve performance comparable to autoregressive models. It established the basic framework of a bidirectional Transformer trained with forward token-masking noise and reverse denoising reconstruction.
LLaDA-V [43]: An extension that adapts the LLaDA framework for visual understanding.
MMaDA [40]: Further generalizes the LLaDA framework to multimodal understanding and generation.
LLaDA 1.5 [51]: An improved version that integrates DPO-based post-training for additional performance gains.
Continuous Diffusion in Recommendation: The paper also briefly mentions continuous diffusion models, which operate in continuous latent spaces and have been applied to image generation [30, 41] and sequential recommendation [18, 19, 37, 42] (e.g., DreamRec [42], DiffuRec [19], DimeRec [18]). These typically generate latent representations that then require a separate retrieval stage, unlike LLaDA-Rec which directly generates discrete semantic IDs.

3.3. Technological Evolution

The field of recommender systems has evolved through several stages:

Traditional ID-based Methods: Early methods (e.g., matrix factorization, collaborative filtering) represented items and users with discrete IDs, learning embeddings for them. Sequential recommenders (e.g., GRU4Rec, SASRec, BERT4Rec) extended this by modeling user interaction sequences based on these IDs. These are discriminative models, predicting scores or rankings.
Generative Recommendation with Semantic IDs: Inspired by the success of LLMs, this paradigm shifted to generative models. Items are no longer just IDs but are tokenized into semantic IDs (sequences of discrete tokens). The task becomes generating the semantic ID of the next item. This allowed for more explicit semantic understanding and generation of new, unseen items. RQ-VAE-based methods (e.g., TIGER, LETTER, LC-Rec) and Product Quantization (e.g., RPG) are prominent here, typically coupled with autoregressive Transformers.
Diffusion Models for Recommendation (LLaDA-Rec's contribution): The latest evolution, exemplified by LLaDA-Rec, introduces discrete diffusion models to generative recommendation. This moves beyond autoregressive generation to leverage bidirectional attention and adaptive generation orders, directly generating semantic IDs in a parallel and iterative denoising process. This addresses the inherent limitations of autoregressive approaches, offering a new paradigm. Concurrently, continuous diffusion models have also been adapted for recommendation, usually generating item embeddings for retrieval.

3.4. Differentiation Analysis

LLaDA-Rec differentiates itself from existing methods primarily through its adoption of discrete diffusion and specialized designs for generative recommendation:

Autoregressive Generative Recommendation (e.g., TIGER, LETTER, LC-Rec):
- Core Difference: LLaDA-Rec uses bidirectional attention and an adaptive generation order via discrete diffusion, while AR models use unidirectional (causal) attention and a fixed left-to-right generation order.
- Innovation: This change allows LLaDA-Rec to capture global inter-item and intra-item dependencies more effectively and mitigate error accumulation by re-masking and re-predicting low-confidence tokens. AR models are prone to error propagation due to their fixed, sequential nature.
- Tokenization: AR models often use hierarchical quantization like RQ-VAE, where earlier tokens are more important. LLaDA-Rec uses parallel tokenization (Multi-Head VQ-VAE) where all tokens are equally important, better suiting bidirectional attention.
RPG [14] (Parallel Semantic ID Generation):
- Core Difference: While RPG also generates semantic IDs in parallel (multi-token prediction [7]), it does so in a single step without iterative refinement. LLaDA-Rec uses an iterative denoising process with re-masking and re-prediction.
- Innovation: LLaDA-Rec's iterative nature allows for adaptive generation order and dynamic refinement of predictions, which is missing in RPG's single-step parallel generation.
Continuous Diffusion Models for Recommendation (e.g., DiffuRec, DreamRec):
- Core Difference: Continuous diffusion models operate in continuous latent spaces and typically generate item embeddings, which then require a separate retrieval stage (e.g., similarity search) to find actual items. LLaDA-Rec is a discrete diffusion model that directly generates discrete semantic IDs of items.
- Innovation: LLaDA-Rec unifies generation and retrieval into a single optimization process by directly outputting SIDs, simplifying the inference pipeline and often leading to improved performance by removing the potential mismatch between generated embeddings and retrieved items.
Traditional Item ID-based Methods (e.g., SASRec, BERT4Rec):
- Core Difference: These models predict the next item ID directly or learn item embeddings for discriminative ranking. LLaDA-Rec generates semantic IDs.
- Innovation: LLaDA-Rec's semantic ID approach allows for a richer representation of items, potential generalization to new items (if their semantic IDs can be composed), and leverages the power of generative models.
  
  The key distinguishing feature of LLaDA-Rec is its novel application of discrete diffusion to generative recommendation, specifically designed to overcome the unidirectional constraints and error accumulation associated with autoregressive methods, and to integrate generation and retrieval more tightly than continuous diffusion methods.

4. Methodology

The LLaDA-Rec framework addresses the limitations of autoregressive generative recommendation by formulating the task as parallel semantic ID generation using a discrete diffusion approach. This involves three main modules: Parallel Tokenization, Discrete Diffusion Training, and Discrete Diffusion Inference.

4.1. Parallel Tokenization via Multi-Head VQ-VAE

4.1.1. Motivation for Parallel Tokenization

Existing generative recommendation models often use hierarchical quantization methods like Residual Quantization (RQ-VAE) or RQ-KMeans. In these hierarchical schemes, tokens are generated sequentially, with earlier tokens (e.g., the first token) holding more influence as subsequent tokens are conditionally dependent on them. This aligns well with autoregressive models that generate tokens in a fixed left-to-right order.

However, LLaDA-Rec utilizes a bidirectional Transformer for its discrete diffusion model. In a bidirectional Transformer, all tokens interact mutually and are equally important in the representation and generation process, irrespective of their position. The hierarchical dependencies of RQ-VAE are mismatched with this bidirectional nature. To better align with the bidirectional Transformer and treat all semantic ID tokens on an equal footing, LLaDA-Rec proposes a Multi-Head VQ-VAE architecture for parallel tokenization, eliminating hierarchical dependencies.

4.1.2. Multi-Head VQ-VAE Architecture

The Multi-Head VQ-VAE works as follows:

Item Semantic Representation: Each item $i$ is first represented by a continuous semantic vector $\mathbf{v}_i \in \mathbb{R}^D$ . This vector is obtained by encoding the item's textual information (e.g., title, description) using a pre-trained embedding model, such as BERT [5] or Sentence-T5 [24].
Encoder Projection: The semantic vector $\mathbf{v}_i$ is then projected into a latent space through an Encoder (implemented as a multi-layer perceptron (MLP)): $ \mathbf{z}_i = \operatorname{Encoder}(\mathbf{v}_i) $ where $\mathbf{z}_i \in \mathbb{R}^d$ is the latent representation.
Sub-vector Partitioning: The latent vector $\mathbf{z}_i$ is partitioned into $M$ equal-sized sub-vectors. Each sub-vector corresponds to a separate "head" for quantization: $ \mathbf{z}i = [ \mathbf{z}{i,1} ; \mathbf{z}{i,2} ; \ldots ; \mathbf{z}{i,M} ] $ Here, $\mathbf{z}_{i,m} \in \mathbb{R}^{d/M}$ for $m \in \{1, 2, \ldots, M\}$ .
Independent Quantization: $M$ distinct codebooks are maintained, one for each sub-vector. The $m$ -th codebook is $C_m = \{\mathbf{e}_{m,k}\}_{k=1}^K$ , where $K$ is the size of the codebook (number of code vectors), and $\mathbf{e}_{m,k} \in \mathbb{R}^{d/M}$ are learnable code embeddings. Each sub-vector $\mathbf{z}_{i,m}$ is quantized independently by finding the closest code vector in its corresponding codebook $C_m$ : $ c_{i,m} = \arg\operatorname*{min}{k} | \mathbf{z}{i,m} - \mathbf{e}_{m,k} |2^2, \quad \mathbf{e}{m,k} \in C_m $ The chosen index $c_{i,m}$ from the $m$ -th codebook represents the $m$ -th token of the item's semantic ID.
Semantic ID Formation: After quantizing all $M$ sub-vectors, the semantic ID for item $i$ is formed as a sequence of these discrete tokens: $ s_i = [ c_{i,1}, c_{i,2}, \ldots, c_{i,M} ] $ The corresponding code embeddings are $\{\mathbf{e}_{c_{i,1}}, \mathbf{e}_{c_{i,2}}, \ldots, \mathbf{e}_{c_{i,M}}\}$ .
Quantized Representation and Decoder: These selected code embeddings are concatenated to form the quantized representation $\hat{\mathbf{z}}_i$ : $ \hat{\bf z}i = [ {\bf e}{c_1} ; {\bf e}{c_2} ; \ldots ; {\bf e}{c_M} ] $ This concatenated vector is then passed through a Decoder (also an MLP) to reconstruct the original semantic vector $\mathbf{v}_i$ : $ \hat{\mathbf{v}}_i = \operatorname{Decoder}(\hat{\mathbf{z}}_i) $

4.1.3. VQ-VAE Loss Function

The overall training objective for the Multi-Head VQ-VAE combines a reconstruction loss and a vector quantization loss: $ \begin{array}{r l} & \mathcal{L}{\mathrm{Recon}} = \Vert \mathbf{v}i - \hat{\mathbf{v}}i \Vert 2^2, \ & \mathcal{L}{\mathrm{VQ}} = \displaystyle \sum{m=1}^{M} \Big ( \Vert \mathrm{sg}[\mathbf{z}{i,m}] - \mathbf{e}{c_{i,m}} \Vert_2^2 + \alpha \Vert \mathbf{z}{i,m} - \mathrm{sg}[\mathbf{e}{c_{i,m}}] \Vert_2^2 \Big ), \ & \mathcal{L}{\mathrm{VQ-VAE}} = \mathcal{L}{\mathrm{Recon}} + \mathcal{L}_{\mathrm{VQ}}. \end{array} $ Where:

$\mathcal{L}_{\mathrm{Recon}}$ : Reconstruction loss, which is the squared Euclidean distance between the original item semantic vector $\mathbf{v}_i$ and its reconstructed version $\hat{\mathbf{v}}_i$ . This term ensures that the discrete semantic ID tokens can effectively represent the item's original semantics.
$\mathcal{L}_{\mathrm{VQ}}$ $L_{VQ}$ : Vector quantization loss, which consists of two parts for each sub-vector $m$ $m$ :
- $\Vert \mathrm{sg}[\mathbf{z}_{i,m}] - \mathbf{e}_{c_{i,m}} \Vert_2^2$ : This part minimizes the distance between the encoder's output sub-vector $\mathbf{z}_{i,m}$ (with stop-gradient applied, $\mathrm{sg}[\cdot]$ ) and its chosen code embedding $\mathbf{e}_{c_{i,m}}$ . This term is used to update the codebook embeddings.
- $\alpha \Vert \mathbf{z}_{i,m} - \mathrm{sg}[\mathbf{e}_{c_{i,m}}] \Vert_2^2$ : This part minimizes the distance between the encoder's output sub-vector $\mathbf{z}_{i,m}$ and the chosen code embedding $\mathbf{e}_{c_{i,m}}$ (with stop-gradient applied to the code embedding). This term ensures that the encoder learns to produce latent vectors that are close to the codebook entries, preventing codebook collapse (where only a few code vectors are ever used).
$\alpha$ : A hyperparameter that balances the contribution of the commitment loss term (the second part of $\mathcal{L}_{\mathrm{VQ}}$ ).
$\mathrm{sg}[\cdot]$ : The stop-gradient operation. It prevents gradients from flowing through the specified variable, effectively making it a constant during backpropagation for that particular path.

4.2. Discrete Diffusion Training

The discrete diffusion model in LLaDA-Rec is trained to capture both inter-item sequential dependencies (relationships between items in a user's history) and intra-item semantic relationships (relationships among tokens within a single item). This is achieved through two distinct masking mechanisms: User-History level masking and Next-Item level masking.

4.2.1. Problem Formulation and Probabilistic Comparison

The overarching goal is to predict the next item $i_n$ for a user $u$ given their interaction history $\mathcal{H} = \{i_1, i_2, \ldots, i_{n-1}\}$ . Each item $i$ is represented by its semantic ID $s_i = [c_{i,1}, c_{i,2}, \ldots, c_{i,M}]$ . The user history becomes a sequence of tokens $S_{\mathcal{H}} = [c_{1,1}, \ldots, c_{1,M}, c_{2,1}, \ldots, c_{2,M}, \ldots, c_{n-1,1}, \ldots, c_{n-1,M}]$ . The task is to maximize the conditional probability: $ \theta^* = \arg\operatorname*{max P}{\theta}(s_n | S{\mathcal{H}}) $ where $\theta$ are the model parameters.

Autoregressive Modeling (Eq. 3): Existing generative recommendation methods predominantly generate tokens sequentially from left to right. The probability of generating the entire semantic ID $s_n$ is a product of conditional probabilities: $ \operatorname{P}{\theta}(s_n \mid S{\mathcal{H}}) = \prod_{m=1}^{M} \operatorname{P}{\theta}(c{n,m} \mid c_{n, $c_{n,<m}$
Discrete Diffusion Modeling (Eq. 4): In contrast, discrete diffusion generates tokens iteratively over $T$ steps ( $T \le M$ ), starting from a fully masked sequence $s_n^1$ of $M$ [MASK] tokens. At each step $t$ , the Mask Predictor (a bidirectional Transformer encoder) predicts all masked positions in parallel. It then retains the highest-confidence predictions and re-masks the rest for subsequent steps. $ \mathrm{P}{\theta}(s_n \mid S{\mathcal{H}}) = \prod_{t=1}^{T} \prod_{m=1}^{M} \left{ \begin{array}{ll} \mathrm{P}{\theta} \big( c{n,m} \mid s_n^t, S_{\mathcal{H}} \big), & \mathrm{if } c_{n,m}^t = [\mathsf{MASK}], \ 1, & \mathrm{otherwise}. \end{array} \right. $ Where:
- $T$ : The total number of generation steps.
- $s_n^t$ : The input sequence at step $t$ , representing the next item, which contains some generated tokens and some [MASK] tokens. $s_n^1$ is all [MASK] tokens.
- $c_{n,m}^t$ : The $m$ -th token of $s_n^t$ .
- The Mask Predictor (a Transformer encoder with bidirectional attention) predicts the original token $c_{n,m}$ given the current partially masked sequence $s_n^t$ and the user history $S_{\mathcal{H}}$ .
- The product accumulates probabilities for tokens that were [MASK] at some step $t$ . Tokens that were already predicted are fixed (probability 1). This process offers parallel generation, an adaptive generation order, and explicit control over the number of generation steps.

4.2.2. Discrete Diffusion Process Overview

The discrete diffusion model operates in two stages:

Forward Process: Tokens in an input sequence are progressively masked. A masking ratio $r \in (0, 1]$ determines the probability of each token being masked. At $r=1$ , all tokens are [MASK].
Reverse Denoising Process: The model learns to reconstruct the original sequence from a partially masked one. During inference, it starts from a fully masked sequence and iteratively fills in tokens as $r$ decreases from 1 to 0.

Based on this, LLaDA-Rec designs two diffusion mask training strategies:

4.2.3. User-History Level Masking

This strategy applies the discrete diffusion masking process to the token sequence of the user history, $S_{\mathcal{H}}$ . Its objective is to train the MASK predictor to capture global dependencies among all tokens within the user's interaction history.

Mechanism: At each diffusion step (characterized by a masking ratio $r \in (0, 1)$ ), each token in $S_{\mathcal{H}}$ is independently masked with probability $r$ or remains visible with probability 1-r. The resulting partially masked history, denoted $S_{\mathcal{H}}^r$ , is then fed to the MASK predictor.
Training Loss: The model is trained to reconstruct the masked tokens in $S_{\mathcal{H}}$ $S_{H}$ . The loss is defined as: $ \mathcal{L}{\mathrm{His-Mask}} = - \mathbb{E}{r, S_{\mathcal{H}}, S_{\mathcal{H}}^r} \left[ \frac{1}{r} \sum_{i=1}^{M \times (n-1)} \mathbb{1} \left[ S_{\mathcal{H},i}^r = \left[ \mathsf{MASK} \right] \right] \log \mathbb{P}{\theta} ( S{\mathcal{H},i} \mid S_{\mathcal{H}}^r ) \right] $ Where:
- $\mathbb{E}_{r, S_{\mathcal{H}}, S_{\mathcal{H}}^r}[\cdot]$ : Expectation over the masking ratio $r$ , the ground-truth user history $S_{\mathcal{H}}$ , and its masked version $S_{\mathcal{H}}^r$ .
- $M \times (n-1)$ : Total number of tokens in the user history (assuming n-1 items, each with $M$ tokens).
- $\mathbb{1} \left[ S_{\mathcal{H},i}^r = \left[ \mathsf{MASK} \right] \right]$ : An indicator function that is 1 if the $i$ -th token in $S_{\mathcal{H}}$ is masked at step $r$ , and 0 otherwise.
- $\log \mathbb{P}_{\theta} ( S_{\mathcal{H},i} \mid S_{\mathcal{H}}^r )$ : The log-probability predicted by the model for the ground-truth token $S_{\mathcal{H},i}$ given the partially masked history $S_{\mathcal{H}}^r$ .
- The term $\frac{1}{r}$ acts as a weighting factor, emphasizing steps with lower masking ratios (i.e., less noise) to help the model learn more precise reconstructions.

4.2.4. Next-Item Level Masking

This strategy focuses on the target item. The discrete diffusion masking process is applied to the semantic ID of the next item ( $s_n$ ), while the user history $S_{\mathcal{H}}$ is kept fully visible (unmasked).

Mechanism: At each diffusion step ( $r \in (0, 1)$ ), each of the $M$ tokens in the next item $s_n$ is independently masked with probability $r$ or remains visible with probability 1-r. The resulting partially masked sequence for the next item, $s_n^r$ , is then concatenated with the fully visible historical tokens $S_{\mathcal{H}}$ . This combined sequence is fed into the MASK predictor.
Training Objective: The model aims to reconstruct the masked tokens of the next item. The training objective is defined as: $ \mathcal{L}{\mathrm{Item-Mask}} = - \mathbb{E}{r, s_n, s_n^r} \left[ \frac{1}{r} \sum_{i=1}^{M} \mathbb{1} \left[ c_{n,i}^r = \left[ \mathsf{MASK} \right] \right] \log \mathrm{P}{\theta} \left( c{n,i} \mid s_n^r, S_{\mathcal{H}} \right) \right] $ Where:
- $\mathbb{E}_{r, s_n, s_n^r}[\cdot]$ : Expectation over the masking ratio $r$ , the ground-truth next item $s_n$ , and its masked version $s_n^r$ .
- $M$ : The number of tokens in the next item's semantic ID.
- $\mathbb{1} \left[ c_{n,i}^r = [ \mathsf{MASK} ] \right]$ : An indicator function that is 1 if the $i$ -th token of the next item $s_n$ is masked at step $r$ , and 0 otherwise.
- $\log \mathrm{P}_{\theta} \left( c_{n,i} \mid s_n^r, S_{\mathcal{H}} \right)$ : The log-probability predicted by the model for the ground-truth token $c_{n,i}$ given the partially masked next item $s_n^r$ and the full user history $S_{\mathcal{H}}$ .
Theoretical Justification: The loss function in Eq. (12) can be shown to be an upper bound on the negative log-likelihood of the conditional model distribution in Eq. (2) ([27, 31]): $
- \mathbb{E} [ \log \mathrm{P}{\theta} \left( s_n ~ \vert ~ S{\mathcal{H}} \right) ] ~ \le ~ \mathcal{L}_{\mathrm{Item-Mask}}. $ Minimizing $\mathcal{L}_{\mathrm{Item-Mask}}$ is therefore equivalent to maximizing the desired conditional probability for generative recommendation.

4.2.5. Joint Training

To holistically train the MASK predictor to capture both inter-item and intra-item dependencies, the two loss functions are combined: $ \mathcal{L}{\mathrm{Total}} = \mathcal{L}{\mathrm{Item-Mask}} + \lambda_{\mathrm{His-Mask}} \mathcal{L}{\mathrm{His-Mask}} + \lambda{\mathrm{Reg}} \Vert \theta \Vert_2^2 $ Where:

$\mathcal{L}_{\mathrm{Item-Mask}}$ : Guides the model to predict the next item conditioned on history and internal semantic relationships.
$\mathcal{L}_{\mathrm{His-Mask}}$ : Helps the model better understand the relationships among different tokens across the entire user history.
$\lambda_{\mathrm{His-Mask}}$ : A weighting coefficient that balances the contributions of the two masking losses.
$\lambda_{\mathrm{Reg}}$ : Controls the strength of the $L_2$ regularization term $\Vert \theta \Vert_2^2$ , which helps prevent overfitting by penalizing large parameter values.

4.3. Discrete Diffusion Inference

After training, the goal is to generate the top-k recommended items. This presents challenges for discrete diffusion: traditional diffusion models often rely on probabilistic sampling for top-1 outputs, and conventional beam search is designed for fixed left-to-right decoding, not adaptive-order diffusion. LLaDA-Rec adapts beam search for this setting.

4.3.1. Initialization

The generation process is divided into $T$ discrete steps.

$\mathcal{P}\mathcal{G}_t$ : The set of positions that have already been generated (filled) at step $t$ . Initially, at $t=1$ , $\mathcal{P}\mathcal{G}_1 = \emptyset$ .
$s_n^t$ : The token sequence for the next item to be generated at step $t$ . At $t=1$ , it is initialized as $s_n^1 = \{[\mathsf{MASK}], \ldots, [\mathsf{MASK}]\}$ , containing $M$ [MASK] tokens.
At each step $t$ $t$ , the MASK predictor takes the current partially generated sequence $s_n^t$ $s_{n}^{t}$ (for the next item) and the user history $S_{\mathcal{H}}$ $S_{H}$ as input. It outputs a probability distribution over the vocabulary for each masked position: $ \mathrm{P}{\theta}^{t,m}(\boldsymbol{w} \mid s_n^t, S{\mathcal{H}}) \in [0, 1], \quad m \in {1, \ldots, M} \setminus \mathcal{P}\mathcal{G}_t, \quad \boldsymbol{w} \in {1, \ldots, |\mathcal{W}|} $ Where:
- $m$ : Indexes the positions that are currently masked (not yet generated).
- $\boldsymbol{w}$ : Indexes candidate tokens in the vocabulary $\mathcal{W}$ .
- $\mathrm{P}_{\theta}^{t,m}(\cdot)$ : The probability distribution over the vocabulary for position $m$ at step $t$ .

4.3.2. Generation Position Selection

Unlike autoregressive generation, discrete diffusion predicts all [MASK] positions in parallel. To generate tokens iteratively, LLaDA-Rec first determines which positions to generate at step $t$ . Since $M$ tokens need to be generated over $T$ steps, at each step, the model selects the top $\frac{M}{T}$ unfilled positions that have the highest maximum token probabilities (i.e., highest confidence in their best prediction): $ \begin{array}{r l} & \mathcal{M}t = \underset{m \in {1, \dots, M} \setminus {\mathcal{PG}t}}{\mathrm{top}\frac{M}{T}} \left( \underset{w \in {1, \dots, |\mathcal{W}|}}{\mathrm{max}} \mathrm{P}{\theta}^{t,m}(w \mid s_n^t, S{\mathcal{H}}) \right), \ & \qquad \quad \mathcal{PG}_{t+1} = \mathcal{PG}_t \cup \mathcal{M}_t. \end{array} $ Where:

$\mathcal{M}_t$ : The set of positions selected at step $t$ based on their top $\frac{M}{T}$ highest confidence scores.
$\mathrm{top}\frac{M}{T}(\cdot)$ : A function that selects the top $\frac{M}{T}$ elements with the highest scores.
$\underset{w \in \{1, \dots, |\mathcal{W}|\}}{\mathrm{max}} \mathrm{P}_{\theta}^{t,m}(w \mid s_n^t, S_{\mathcal{H}})$ : The maximum probability for any token $w$ at a specific masked position $m$ , representing the model's confidence for that position.
$\mathcal{PG}_{t+1}$ : The set of already generated positions is updated by adding the newly selected positions in $\mathcal{M}_t$ .

4.3.3. Beam Search for Discrete Diffusion

Once the positions in $\mathcal{M}_t$ are selected, beam search is applied sequentially to these positions.

Beam Set: Let $\mathcal{B}_t$ be the set of current candidate sequences (beams) at step $t$ .
Expansion and Pruning: For each position $m_i$ $m_{i}$ in the set of selected positions $\mathcal{M}_t = \{m_1, m_2, \ldots, m_{|\mathcal{M}_t|}\}$ $M_{t} = {m_{1}, m_{2}, \dots, m_{∣ M_{t} ∣}}$ :
1. The current beam set $\mathcal{B}_{t, i-1}$ (initially $\mathcal{B}_{t,0} = \mathcal{B}_t$ ) is expanded. For each beam in $\mathcal{B}_{t, i-1}$ , the top $B$ candidate tokens for position $m_i$ (based on $\mathrm{P}_{\theta}^{t,m_i}(\cdot)$ ) are considered.
2. The resulting expanded set of beams is then pruned back to the top $B$ beams according to their model scores (joint probabilities). $ \begin{array}{r l} & \mathcal{B}{t,0} \ \gets \ \mathcal{B}t, \quad \mathcal{B}{t,i} \ \gets \ \mathcal{B}{t,i-1} \cup \quad \mathrm{top}{-B} \ \big ( \mathrm{P}{\theta}^{t,m_i}(w \mid s_n^t, S{\mathcal{H}}) \big ), \ & \mathcal{B}{t,i} \ \gets \ \mathrm{top}{-B} \big ( \mathrm{P}{\theta}^t(b \mid s_n^t, S_{\mathcal{H}}) \big ), \quad \mathcal{B}{t+1} \ \gets \ \mathcal{B}{t, |\mathcal{M}t|}. \ & \quad b \in \mathcal{B}{t,i} \end{array} $ Where:
- $B$ : The beam size.
- $\mathrm{top}{-B}(\cdot)$ : Selects the $B$ elements with the highest confidence scores.
- $\mathrm{P}_{\theta}^t(b \mid s_n^t, S_{\mathcal{H}})$ : Denotes the joint probability of a beam $b$ at step $t$ . This is typically the product of the probabilities of all tokens in the beam generated so far.
Sequence Update: After the beam search across all selected positions in $\mathcal{M}_t$ , the tokens at these positions in $s_n^t$ are replaced with the newly generated tokens from the winning beams to form the updated sequence $s_n^{t+1}$ .

4.3.4. Iterative Generation

This process continues for $T$ steps:

At each iteration $t$ , the MASK predictor re-evaluates all currently masked positions in the context of the partially generated sequence $s_n^t$ .
The set of positions to generate, $\mathcal{M}_t$ , is determined using Eq. (16).
Beam search is performed on these selected positions to fill them.
All unselected (still masked) positions are re-masked for the next iteration. This loop repeats until all $M$ positions in the semantic ID for the next item have been filled. Finally, the resulting candidate semantic ID sequences (from the beam search at the last step) are ranked by their overall probabilities, and the top-k sequences are converted back to items and returned as recommendations. This iterative refinement allows LLaDA-Rec to dynamically adjust predictions and generate high-quality top-k outputs.

4.4. Discussion

4.4.1. Continuous vs. Discrete Diffusion in Recommendation

Continuous Diffusion Models: Operate in continuous spaces, generating latent representations (e.g., item embeddings) through a denoising process. These models are often used for sequential recommendation (e.g., DiffuRec, DreamRec). The crucial point is that after generating a latent representation, a separate retrieval stage (e.g., similarity search over a large item embedding database) is required to map this representation back to actual items. This separation can lead to an optimization mismatch between the generation and retrieval stages.
Discrete Diffusion Models (LLaDA-Rec): Operate directly on discrete tokens. LLaDA-Rec directly generates the semantic IDs (sequences of discrete tokens) of items. This means the model's output is immediately the item's identifier. This approach unifies generation and retrieval into a single optimization process, eliminating the need for a separate retrieval stage and simplifying the inference pipeline, which often leads to improved recommendation performance.

4.4.2. Advantages over Autoregressive Models

LLaDA-Rec offers several advantages over autoregressive (AR) generative recommendation methods (e.g., TIGER, LETTER, LC-Rec):

Unidirectional vs. Bidirectional Attention: AR models use causal attention, restricting tokens to only see predecessors. LLaDA-Rec uses bidirectional attention, allowing tokens to attend to both preceding and succeeding positions, capturing richer global contextual semantics within the semantic ID and user history.
Fixed vs. Adaptive Generation Order: AR models generate tokens in a fixed left-to-right order, making them susceptible to error accumulation (an early mistake propagates). LLaDA-Rec uses an adaptive, confidence-driven generation order, prioritizing tokens with high certainty (easier tokens) and iteratively re-masking and re-predicting low-confidence ones. This reduces the impact of early errors and alleviates error accumulation.
Single-Step vs. Iterative Refinement (vs. RPG): While RPG [14] also performs parallel generation via multi-token prediction [7], it's typically a single-step prediction. LLaDA-Rec's discrete diffusion is an iterative process, allowing for dynamic refinement of predictions through re-masking and re-prediction over multiple steps.

Controllable Generation Steps: Discrete diffusion naturally supports controllable generation steps ( $T \le M$ ), allowing for a trade-off between generation speed and quality.

The key differences are summarized in the following table (Table 1 from the original paper):

Methods	Attention Mechanism	Generation Order	Controllable Generation Step
TIGER [29]	Causal	Left2Right
LETTER [36]	Causal	Left2Right	xxxx)
LC-Rec [48]	Causal	Left2Right
RPG [14]	Causal	Parallel
LLaDA-Rec	Bidirectional	Adaptive

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three datasets from the widely used Amazon 2023 Review dataset [13]. These datasets are categories of products:

Industrial Scientific (Scientific): For industrial and scientific products.
Musical Instruments (Instrument): For musical instruments.
Video Games (Game): For video games.

Characteristics and Preprocessing:

Each user's historical reviews are treated as interaction records.
Interactions are ordered chronologically, with the earliest review first.
Leave-one-out protocol [16, 29]: For evaluation, the last item in each user's sequence is reserved for testing, and the second-to-last item is used for validation. The remaining preceding items form the training history.

The following are the results from Table 2 of the original paper:

Dataset #Users #Items #Interaction Sparsity Avg.len

Scientific 50,985 25,848 412,947 99.969% 8.10

Instrument 57,439 24,587 511,836 99.964% 8.91

Game 94,762 25,612 814,586 99.966% 8.60
#Users: Number of unique users.
#Items: Number of unique items.
#Interaction: Total number of user-item interactions.
Sparsity: Indicates the proportion of unobserved user-item interactions in the full user-item matrix. A high sparsity (close to 100%) means most users have interacted with only a small fraction of available items, which is typical for recommendation datasets.
Avg.len: Average number of interactions within each input sequence (user history).

These datasets are widely used benchmarks in sequential recommendation, making them effective for validating the proposed method's performance and ensuring comparability with existing research. They represent diverse product categories, allowing for assessment of the model's generalization capabilities.

5.2. Evaluation Metrics

The performance of the recommendation models is evaluated using two standard ranking metrics: Recall@k and Normalized Discounted Cumulative Gain (NDCG@k). Results are reported for $k \in \{1, 5, 10\}$ . NDCG@1 is omitted because it is mathematically identical to Recall@1.

5.2.1. Recall@k

Conceptual Definition: Recall@k measures the proportion of relevant items that are successfully retrieved within the top $k$ recommendations. It focuses on how many of the true next items are present among the top $k$ items recommended by the model. A higher Recall@k indicates that the model is better at identifying relevant items.
Mathematical Formula: $ \mathrm{Recall@k} = \frac{\text{Number of relevant items in top-}k\text{ recommendations}}{\text{Total number of relevant items}} $ For leave-one-out evaluation, where there is only one relevant item (the true next item), the formula simplifies to: $ \mathrm{Recall@k} = \frac{\mathbb{1}(\text{true next item} \in \text{top-}k\text{ recommendations})}{1} $
Symbol Explanation:
- $\mathbb{1}(\cdot)$ : An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
- true next item: The actual item the user interacted with after their history, used for testing.
- top-k recommendations: The list of $k$ items recommended by the model.

5.2.2. Normalized Discounted Cumulative Gain (NDCG@k)

Conceptual Definition: NDCG@k is a measure of ranking quality that takes into account the position of relevant items in the recommendation list. It assigns higher scores if relevant items appear at higher (more preferred) positions. It also considers the graded relevance of items (though in leave-one-out scenarios, relevance is binary). A higher NDCG@k indicates a better-ordered list where highly relevant items are ranked prominently.
Mathematical Formula: First, Discounted Cumulative Gain (DCG@k) is calculated: $ \mathrm{DCG@k} = \sum_{j=1}^{k} \frac{2^{\mathrm{rel}_j} - 1}{\log_2(j+1)} $ Then, NDCG@k normalizes DCG@k by dividing it by the Ideal DCG (IDCG@k), which is the maximum possible DCG for a perfect ranking: $ \mathrm{NDCG@k} = \frac{\mathrm{DCG@k}}{\mathrm{IDCG@k}} $
Symbol Explanation:
- $\mathrm{rel}_j$ : The relevance score of the item at position $j$ in the recommendation list. For leave-one-out, this is typically binary (1 if the item is the true next item, 0 otherwise).
- $j$ : The rank (position) of an item in the recommendation list, starting from 1.
- $\log_2(j+1)$ : The logarithmic discount factor, which reduces the contribution of relevant items found at lower ranks.
- $\mathrm{IDCG@k}$ : The DCG of the ideal ranking, where all relevant items are placed at the top of the list in decreasing order of relevance. For leave-one-out, if the true next item is in the top- $k$ list, $\mathrm{IDCG@k}$ would be $\frac{2^1-1}{\log_2(1+1)} = 1$ .

5.3. Baselines

The paper compares LLaDA-Rec against a comprehensive set of baseline models, categorized into Item ID-based and Semantic ID-based approaches.

5.3.1. Item ID-based Baselines

These models typically use unique numerical item IDs and learn embeddings for them, often focusing on capturing sequential patterns in user interactions.

GRU4Rec [11]: A pioneering sequential recommender that utilizes Gated Recurrent Units (GRUs) to model the sequence of user interactions and predict the next item.
SASRec [16]: Self-Attentive Sequential Recommendation. It applies a unidirectional Transformer encoder to capture sequential dependencies by allowing each item to attend to all preceding items in the user's history.
BERT4Rec [34]: Bidirectional Encoder Representations from Transformers for Recommendation. This model uses a bidirectional Transformer and is trained with a clo-style objective (masking random items in a sequence and predicting them), similar to BERT for language modeling.
FMLP-Rec [50]: Filter-Enhanced MLP is All You Need for Sequential Recommendation. It employs multi-layer perceptrons (MLPs) with learnable filters to model sequential patterns, offering an alternative to Transformers and RNNs.
LRURec [44]: Linear Recurrent Unit for Sequential Recommendation. Integrates linear recurrent units (LRUs) to efficiently process long-range user interactions, aiming to overcome limitations of traditional RNNs in handling very long sequences.
DreamRec [42]: Reshapes sequential recommendation via guided diffusion. It uses SASRec outputs as initial embeddings and then a diffusion denoising module to refine them, specifically designed to avoid negative sampling and train only on positive samples. This is a continuous diffusion model.
DiffuRec [19]: A Diffusion Model for Sequential Recommendation. Combines generative diffusion with sequential recommendation by using a Transformer approximator to reconstruct target item embeddings. This is also a continuous diffusion model.

5.3.2. Semantic ID-based Generative Baselines

These models represent items as semantic IDs (sequences of discrete tokens) and use generative models to predict the semantic ID of the next item.

VQ-Rec [12]: Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. It applies product quantization to tokenize items into semantic IDs, which are then pooled to obtain item representations for sequential recommendation.
TIGER [29]: Recommender Systems with Generative Retrieval. Utilizes Residual Quantization Variational Autoencoder (RQ-VAE) to generate codebook identifiers, embedding semantic information into discrete code sequences. It then uses an autoregressive Transformer for generation.
TIGER-SAS [29]: A variant of TIGER where semantic IDs are derived from item embeddings trained by SASRec (a sequential model) instead of solely from text embeddings. This aims to incorporate collaborative signals.
LETTER [36]: Learnable Item Tokenization for Generative Recommendation. Develops a learnable tokenizer that incorporates hierarchical semantics, collaborative signals, and code assignment diversity during the item tokenization process, followed by autoregressive generation.
LC-Rec [48]: Generative Recommender with End-to-End Learnable Item Tokenization. Exploits identifiers with auxiliary alignment tasks to associate the generated codes with natural language descriptions, enhancing the interpretability and quality of semantic IDs for autoregressive generation.
RPG [14]: Retrieval-augmented Personalized Generative Recommendation. A lightweight semantic ID-based model that generates long, unordered semantic IDs in parallel via multi-token prediction [7]. Unlike LLaDA-Rec, it performs this prediction in a single step without iterative refinement.

These baselines are chosen to cover a wide range of state-of-the-art approaches, including traditional ID-based methods, recent continuous diffusion models for recommendation, and contemporary semantic ID-based generative recommenders, ensuring a robust comparison.

5.4. Implementation Details

5.4.1. Parallel Tokenization (Multi-Head VQ-VAE)

Item Embedding: Sentence-T5 [24] is used to encode the title and other textual information of each item into an initial semantic embedding.
Codebook Configuration:
- Number of codebooks ( $M$ ): 4
- Number of code vectors per codebook ( $K$ ): 256
- Dimension of each code vector (d/M): 32
- Total latent dimension ( $d = M \times d/M$ ): $4 \times 32 = 128$
Hyperparameter: The weight $\alpha$ in the VQ-VAE loss (Eq. 10) is set to 0.25.
Training:
- Optimizer: AdamW [23]
- Learning Rate: $1 \times 10^{-3}$
- Batch Size: 2,048
- Epochs: 10,000

5.4.2. Discrete Diffusion Model (MASK Predictor)

Architecture: A bidirectional Transformer encoder.
Transformer Configuration:
- Token embedding dimension: 256
- Attention heads per layer: 8
- Number of layers:
  - 4 layers for Scientific and Instrument datasets.
  - 6 layers for the Game dataset.
Initialization: Model parameters are randomly initialized.
Training:
- Loss function: The designed joint loss $\mathcal{L}_{\mathrm{Total}}$ (Eq. 14).
- Weighting coefficient $\lambda_{\mathrm{His-Mask}}$ : Tuned over $\{1, 2, 3, 4, 5\}$ .
- Optimizer: AdamW [23]
- Epochs: 150, with early stopping.
- Learning Rate: Tuned over $\{0.005, 0.003, 0.001\}$ .
- Weight Decay: Tuned over $\{0.05, 0.005, 0.001\}$ .
- Batch Size: 1,024.

6. Results & Analysis

6.1. Core Results Analysis

The following are the results from Table 3 of the original paper:

Datasets	Metric	Item ID-based							Semantic ID-based
Datasets	Metric	GRU4Rec	SASRec	BERT4Rec	FMLP-Rec	LRURec	DreamRec	DiffuRec	VQ-Rec	TIGER	TIGER-SAS	LETTER	LC-Rec	RPG	LLaDA-Rec
Scientific	Recall@1	0.0071	0.0063	0.0045	0.0046	0.0049	0.0052	0.0050	0.0076	0.0084	0.0067	0.0082	0.0091	0.0087	0.0098
	Recall@5	0.0184	0.0240	0.0157	0.0181	0.0169	0.0184	0.0190	0.0248	0.0282	0.0221	0.0273	0.0280	0.0257	0.0310
	Recall@10	0.0272	0.0379	0.0264	0.0300	0.0267	0.0299	0.0310	0.0385	0.0446	0.0356	0.0423	0.0434	0.0395	0.0474
	NDCG@5	0.0128	0.0152	0.0100	0.0113	0.0110	0.0118	0.0119	0.0162	0.0183	0.0144	0.0179	0.0186	0.0174	0.0203
	NDCG@10	0.0156	0.0197	0.0134	0.0151	0.0141	0.0155	0.0158	0.0206	0.0236	0.0187	0.0227	0.0235	0.0218	0.0256
	Recall@1	0.0094	0.0089	0.0065	0.0086	0.0071	0.0069	0.0077	0.0099	0.0105	0.0102	0.0114	0.0119	0.0118	0.0128
Instrument	Recall@5	0.0297	0.0331	0.0255	0.0299	0.0272	0.0245	0.0283	0.0345	0.0359	0.0342	0.0362	0.0379	0.0362	0.0406
	Recall@10	0.0453	0.0525	0.0412	0.0496	0.0431	0.0423	0.0465	0.0532	0.0566	0.0521	0.0562	0.0587	0.0545	0.0623
	NDCG@5	0.0196	0.0211	0.0160	0.0193	0.0172	0.0157	0.0179	0.0222	0.0233	0.0223	0.0239	0.0251	0.0241	0.0268
	NDCG@10	0.0246	0.0273	0.0211	0.0257	0.0223	0.0214	0.0237	0.0282	0.0300	0.0280	0.0303	0.0318	0.0300	0.0337
	Recall@1	0.0149	0.0128	0.0082	0.0099	0.0134	0.0125	0.0111	0.0150	0.0166	0.0170	0.0169	0.0165	0.0209	0.0203
	Game	Recall@5	0.0461	0.0516	0.0315	0.0395	0.0480	0.0381	0.0425	0.0497	0.0529	0.0548	0.0552	0.0567	0.0579	0.0623
Recall@10		0.0712	0.0823	0.0530	0.0649	0.0753	0.0611	0.0709	0.0769	0.0823	0.0847	0.0863	0.0891	0.0853	0.0942
NDCG@5		0.0307	0.0323	0.0199	0.0246	0.0308	0.0253	0.0268	0.0325	0.0348	0.0360	0.0362	0.0366	0.0397	0.0415
NDCG@10		0.0387	0.0421	0.0267	0.0328	0.0396	0.0326	0.0359	0.0412	0.0442	0.0457	0.0462	0.0471	0.0485	0.0517

6.1.1. Overall Performance of LLaDA-Rec

The experimental results consistently demonstrate that LLaDA-Rec achieves state-of-the-art (SOTA) performance across all three datasets (Scientific, Instrument, Game) and all evaluated metrics (Recall@1, @5, @10, NDCG@5, @10).

Superiority over all Baselines: LLaDA-Rec consistently outperforms both traditional item ID-based approaches and existing generative semantic ID-based approaches. This validation confirms the effectiveness of the proposed discrete diffusion training and inference mechanisms, along with the Multi-Head VQ-VAE for parallel tokenization.
- For example, on the Scientific dataset, LLaDA-Rec achieves Recall@5 of 0.0310, surpassing the best semantic ID-based baseline LC-Rec (0.0280) and all item ID-based baselines. Similar trends are observed for other metrics and datasets.
Generative vs. Traditional ID-based Methods: The results generally show that generative recommendation methods based on semantic IDs (e.g., VQ-Rec, TIGER, LETTER, LC-Rec, RPG, LLaDA-Rec) outperform traditional ID-based methods (e.g., GRU4Rec, SASRec, BERT4Rec, FMLP-Rec, LRURec). This reinforces the advantage of using semantic IDs to capture richer semantic correlations between items and the benefits of generative approaches in general.
Parallel Semantic IDs: Both RPG and LLaDA-Rec, which utilize parallel semantic IDs, achieve promising results compared to hierarchical RQ-VAE based methods. The superior performance of LLaDA-Rec further highlights the benefits of its discrete diffusion framework over RPG's single-step parallel generation.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table 4 of the original paper:

Model	Scientific		Instrument		Game
	R@5	N@5	R@5	N@5	R@5	N@5
LLaDA-Rec	0.0310	0.0203	0.0406	0.0268	0.0623	0.0415
Tokenizer
RQ-VAE	0.0293	0.0191	0.0367	0.0244	0.0604	0.0399
RQ-Kmeans	0.0250	0.0165	0.0344	0.0224	0.0552	0.0370
OPQ	0.0237	0.0155	0.0340	0.0229	0.0552	0.0362
Training
w/o LHis-Mask	0.0255	0.0169	0.0321	0.0209	0.0544	0.0356
w/o LItem-Mask	0.0264	0.0172	0.0355	0.0231	0.0571	0.0376
Inference
w/o Beam Search	0.0077	0.0077	0.0091	0.0091	0.0162	0.0162

6.2.1. Tokenizer Ablation

Comparison: The performance of LLaDA-Rec with its Multi-Head VQ-VAE is compared against using other common semantic ID generation methods: RQ-VAE [29, 48], RQ-Kmeans [4], and OPQ [12, 14].
Results: Multi-Head VQ-VAE consistently outperforms all other tokenization methods. RQ-VAE performs worse than LLaDA-Rec but still better than RQ-Kmeans and OPQ. Clustering-based approaches (RQ-Kmeans, OPQ) show the lowest performance.
Analysis:
- Mismatch with RQ: The inferior performance of semantic IDs derived from residual quantization (RQ) (RQ-VAE, RQ-Kmeans) confirms the hypothesis that RQ methods are not well-aligned with bidirectional Transformers. RQ imposes a hierarchy where earlier tokens are more dominant, which conflicts with the uniformly distributed token importance in a bidirectional architecture like LLaDA-Rec.
- Robustness of LLaDA-Rec Framework: Even when Multi-Head VQ-VAE is replaced with RQ-VAE, LLaDA-Rec still often surpasses the baseline performance (e.g., compare LLaDA-Rec with RQ-VAE to LC-Rec or TIGER in Table 3), suggesting the overall robustness and architectural advantages of the discrete diffusion generative framework itself.
- VAE-based Quantization Superiority: The superior performance of RQ-VAE and Multi-Head VQ-VAE over clustering-based approaches (RQ-Kmeans, OPQ) indicates that VAE-based quantization methods generally offer stronger representational capacity due to their learned encoding and decoding processes.

6.2.2. Training Ablation

Impact of User-History Level Masking (w/o LHis-Mask): When the User-History level masking loss ( $\mathcal{L}_{\mathrm{His-Mask}}$ $L_{His - Mask}$ from Eq. 11) is removed, performance (Recall@5, NDCG@5) drops significantly across all datasets.
- Analysis: This confirms that $\mathcal{L}_{\mathrm{His-Mask}}$ is crucial for enabling the MASK predictor to effectively capture inter-item sequential dependencies and global dependencies among all tokens within the user's interaction history. Without it, the model's understanding of the historical context is diminished.
Impact of Next-Item Level Masking (w/o LItem-Mask): Similarly, removing the Next-Item level masking loss ( $\mathcal{L}_{\mathrm{Item-Mask}}$ $L_{Item - Mask}$ from Eq. 12) also leads to a notable performance degradation.
- Analysis: This loss is essential for teaching the model intra-item semantics (relationships among tokens within the same item) and for conditioning the generation of the next item specifically on the given history. Its absence weakens the model's ability to compose coherent and relevant semantic IDs for recommended items.
Conclusion: Both masking mechanisms contribute significantly to the overall effectiveness of LLaDA-Rec, highlighting the importance of capturing both inter-item and intra-item dependencies.

6.2.3. Inference Ablation

Impact of Beam Search (w/o Beam Search): Removing the adapted beam search strategy (i.e., using a greedy search strategy that only returns the top-1 result, which is how standard diffusion language models often sample) results in a substantial drop in performance. The table shows Recall@5 and NDCG@5 values that are extremely low and identical (e.g., 0.0077 for Scientific), indicating only a single top-1 item is considered for these top-k metrics, which inherently penalizes performance.
Analysis: This unequivocally confirms the critical importance of the adapted beam search strategy for generative recommendation tasks. Recommendation systems require generating a ranked list of top-k items, not just a single top-1 prediction. The ability to explore multiple candidate sequences and select the best ones is vital for achieving high Recall and NDCG at various $k$ values. The adaptation of beam search for discrete diffusion's adaptive generation order is therefore a key enabling component of LLaDA-Rec's success.

6.2.4. Impact of the Attention Mechanism

The following figure (Figure 3 from the original paper) illustrates the attention masks and performance for different attention mechanisms:

Figure 3: Comparison of different attention mechanisms. (a): Attention masks corresponding to each mechanism. (b) and (c): Performance under different attention mechanisms. 该图像是图表，展示了不同注意力机制的比较。图中的（a）部分分别表示因果（Causal）、项目间因果（Inter-Item Causal）、项目内因果（Intra-Item Causal）和双向（Bidirectional）机制的注意力掩码；（b）和（c）部分呈现了在乐器和游戏数据集上的性能结果，使用 NDCG@5 和 Recall@5 作为评估指标。

Attention Masks (Figure 3a):
- Causal Attention: Each token (position) can only attend to itself and preceding tokens. This is typical for autoregressive models.
- Inter-Item Causal Attention: Within each item's semantic ID, tokens can attend bidirectionally. However, when attending across items in the history, the attention is causal (only to previous items).
- Intra-Item Causal Attention: Within each item's semantic ID, attention is causal. When attending across items in the history, attention is bidirectional.
- Bidirectional Attention: Each token can attend to all other tokens in the entire sequence (both within its own item semantic ID and across the user history), providing full context. This is what LLaDA-Rec uses.
Performance (Figure 3b, 3c):
- Bidirectional attention consistently yields the best performance across NDCG@5 and Recall@5 on both the Instrument and Game datasets. This is attributed to its superior ability to capture comprehensive contextual dependencies by processing information from both directions.
- Causal attention performs the worst. Its unidirectional constraint severely limits its ability to effectively exploit contextual information, resulting in the lowest Recall and NDCG values.
- Inter-item causal and intra-item causal attention achieve competitive performance, often falling between causal and bidirectional. This highlights that incorporating bidirectional attention—whether it's across items (as in inter-item causal within each item) or within items (as in intra-item causal across items in history)—is crucial for effective contextual modeling. The best performance is achieved when bidirectional attention is applied universally.

6.2.5. Impact of Generation Order

The following figure (Figure 4 from the original paper) illustrates the performance under different generation orders:

Figure 4: Performance under different generation orders. 该图像是图表，展示了不同生成顺序下的性能表现，包括两个子图：(a) 工具的 NDCG@5 和 Recall@5，(b) 游戏的 NDCG@5 和 Recall@5。图中使用了不同颜色的柱状图分别表示左右生成顺序和自适应生成方式。

Comparison: LLaDA-Rec's adaptive generation order is compared against fixed left-to-right (left2right) and fixed right-to-left (right2left) orders.
Results: The adaptive approach consistently delivers superior performance (highest NDCG@5 and Recall@5 values) on both Instrument and Game datasets.
Analysis: The left2right order, which is common in autoregressive models, occasionally produces the poorest results. This underscores the limitations of rigidly fixed generation orders, especially when errors can propagate. LLaDA-Rec's ability to dynamically determine the generation order by prioritizing easier tokens (those with higher model confidence) and iteratively refining predictions provides a significant advantage, mitigating error accumulation and leading to more accurate item semantic ID generation.

6.2.6. Impact of Generation Steps

The following figure (Figure 5 from the original paper) illustrates the performance under different generation steps:

Figure 5: Performance under different generation steps. 该图像是一个柱状图，展示了在不同生成步骤下的推荐性能，包含两个部分：左侧为"Instrument"和右侧为"Game"。每一部分展示了在5个生成步骤下的 NDCG@5（蓝色柱子）和 Recall@5（红色线条）的变化情况。

Comparison: The performance (NDCG@5, Recall@5) is analyzed as the number of generation steps ( $T$ ) varies (from 1 to 5). Recall that $M$ is the total number of tokens in a semantic ID, and at each step, $\frac{M}{T}$ tokens are generated. More steps imply generating fewer tokens per step and more iterative refinement.
Results: Increasing the number of generation steps generally leads to better performance on both Instrument and Game datasets. A single step ( $T=1$ ) results in the lowest performance, while performance gradually improves as $T$ increases to 5.
Analysis: This indicates that the iterative refinement process of discrete diffusion is effective. More steps allow the model to re-evaluate and re-predict masked tokens multiple times, leveraging updated context from already generated (high-confidence) tokens, thereby improving accuracy. However, using fewer steps significantly improves generation efficiency. The trade-off between efficiency (fewer steps) and performance (more steps) is an important consideration. The authors acknowledge that achieving a better balance with fewer steps remains an open research question, pointing to recent studies in diffusion language models exploring this (e.g., [2, 10]).

6.2.7. Impact of Hyper-parameters

The following figure (Figure 6 from the original paper) illustrates the performance of different $\lambda_{\mathrm{His-Mask}}$ (Eq. (14)) values:

$Figure 6: Performance of different $\\lambda _ { \\mathrm { H i s - M a s k } }$ (Eq. (14)) values.$ 该图像是图表，展示了不同 ext{a}_{ ext{Instrument}} 和 ext{b}_{ ext{Game}} 中 NDCG@5 和 Recall@5 的性能表现。左侧图表显示了对 Instrument 项目的评估，而右侧图表则展示了对于 Game 项目的比较。两组数据均以柱状图和折线图的方式呈现。

Comparison: The impact of the weighting coefficient $\lambda_{\mathrm{His-Mask}}$ (which balances the User-History level masking loss $\mathcal{L}_{\mathrm{His-Mask}}$ in the total training loss) is investigated. Values of $\lambda_{\mathrm{His-Mask}}$ from 0 to 5 are tested.
Results:
- When $\lambda_{\mathrm{His-Mask}}=0$ (meaning User-History level masking is not applied), performance is relatively low.
- Performance generally improves as $\lambda_{\mathrm{His-Mask}}$ increases from 0 to about 2 or 3.
- However, if $\lambda_{\mathrm{His-Mask}}$ becomes too large (e.g., 4 or 5), performance starts to decline or plateau.
Analysis:
- A moderate value of $\lambda_{\mathrm{His-Mask}}$ is beneficial because it allows the model to effectively learn global dependencies among all tokens within the user history. This supplementary learning helps the model build a richer contextual understanding.
- If $\lambda_{\mathrm{His-Mask}}$ is set too high, it might cause the model to over-emphasize learning patterns within the history itself (which is $\mathcal{L}_{\mathrm{His-Mask}}$ 's primary goal) at the expense of its main task: predicting the next item conditioned on that history (driven by $\mathcal{L}_{\mathrm{Item-Mask}}$ ). This can hinder its ability to generate relevant recommendations. This indicates the need for careful tuning of this hyperparameter to find an optimal balance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces LLaDA-Rec, a novel generative recommendation framework that leverages discrete diffusion models to address key limitations of existing autoregressive approaches: unidirectional constraints and error accumulation. By reformulating recommendation as parallel semantic ID generation, LLaDA-Rec incorporates bidirectional attention and an adaptive generation order to enhance the modeling of inter-item sequential dependencies and intra-item semantic relationships, while mitigating the propagation of prediction errors.

The framework's core innovations include:

Parallel Tokenization: A Multi-Head VQ-VAE scheme that generates semantic IDs suitable for bidirectional modeling, resolving the mismatch with hierarchical quantization methods.
Dual Masking Mechanisms: Distinct User-History level masking and Next-Item level masking strategies to train the discrete diffusion model effectively for recommendation tasks.
Adapted Beam Search: A tailored beam search strategy that enables top-k recommendation generation with adaptive-order discrete diffusion decoding.

Extensive experiments on three real-world datasets consistently show that LLaDA-Rec achieves state-of-the-art performance, outperforming both traditional item-ID-based recommenders and existing semantic-ID-based generative recommendation models. This work successfully establishes discrete diffusion as a powerful new paradigm for generative recommendation.

7.2. Limitations & Future Work

The paper does not explicitly dedicate a section to "Limitations and Future Work." However, some implicit limitations and future directions can be inferred from the experimental analysis:

Efficiency vs. Performance Trade-off in Generation Steps: As shown in the "Impact of Generation Steps" analysis (Figure 5), increasing the number of generation steps ( $T$ ) improves performance but inherently reduces efficiency. The paper states that "How to achieve a better trade-off between efficiency and performance with fewer steps remains an open question," suggesting that optimizing the multi-step generation process for faster inference without significant performance degradation is a key area for future research. This could involve techniques like knowledge distillation or more advanced sampling schedules for diffusion models.
Computational Cost of Diffusion: While not explicitly stated as a limitation, discrete diffusion models, especially with many generation steps and beam search, can be computationally intensive during inference compared to single-pass autoregressive models. Optimizing this aspect is a general challenge for diffusion models.
Dependence on Item Embeddings: The Multi-Head VQ-VAE relies on high-quality initial item semantic representations (e.g., from Sentence-T5). The performance of the entire system could be sensitive to the quality and robustness of these initial embeddings. Future work might explore end-to-end learning of item representations integrated with the diffusion process.
Hyperparameter Tuning Complexity: The model has several hyperparameters (e.g., $\alpha$ in VQ-VAE, $\lambda_{\mathrm{His-Mask}}$ , learning rates, weight decays, Transformer layers, beam size, number of generation steps). Tuning all these for optimal performance can be complex and time-consuming.

7.3. Personal Insights & Critique

LLaDA-Rec presents a highly innovative and compelling approach to generative recommendation. The shift from autoregressive to discrete diffusion is a significant architectural advancement that addresses fundamental limitations in current methods.

Inspirations and Transferability: The core idea of adaptive-order generation and iterative refinement through masking and denoising is powerful. This principle could be highly transferable to other discrete sequence generation tasks where fixed left-to-right generation is suboptimal or suffers from error accumulation. For instance, it could be applied to code generation (where semantic coherence across a larger block of code is important), molecule design (generating sequences of chemical units), or even more complex dialogue generation where bidirectional context and the ability to refine earlier parts of a response based on later thoughts could be beneficial. The Multi-Head VQ-VAE for parallel tokenization is also a generalizable concept for preparing discrete data for bidirectional models.
Potential Issues & Areas for Improvement:
- Inference Latency: While the paper addresses top-k generation, the iterative nature of diffusion models with multiple steps and beam search inherently suggests higher latency compared to single-pass autoregressive methods. This might be a practical concern for real-time recommendation systems with strict latency requirements. Further research into fast inference techniques for discrete diffusion is crucial.
- Interpretability of Semantic IDs: While semantic IDs are intended to be semantically rich, their interpretability (i.e., understanding what a token sequence like $[c_1, c_2, c_3, c_4]$ means to a human) can still be challenging. The paper doesn't delve into how these SIDs correlate with human-understandable attributes.
- Cold-Start Problem: The model still relies on pre-trained item embeddings and a VQ-VAE to generate semantic IDs. For completely new items without textual information or interaction history, the cold-start problem would likely persist. How LLaDA-Rec would handle items with very sparse information, or truly novel items not seen during VQ-VAE training, is not explicitly discussed.
- Scale of Codebooks: The choice of $M=4$ codebooks with $K=256$ codes each means $256^4$ possible semantic IDs. While large, for truly vast item catalogs, this might still represent a bottleneck. Exploring dynamic codebook sizes or more flexible quantization could be an area for improvement.
  
  Overall, LLaDA-Rec is a rigorous and well-designed piece of research that pushes the boundaries of generative recommendation. Its success demonstrates the untapped potential of discrete diffusion models in tackling complex sequence generation problems in recommender systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Dataset	#Users	#Items	#Interaction	Sparsity	Avg.len
Scientific	50,985	25,848	412,947	99.969%	8.10
Instrument	57,439	24,587	511,836	99.964%	8.91
Game	94,762	25,612	814,586	99.966%	8.60

LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~29 min read · 39,730 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Generative Recommendation

3.1.2. Semantic IDs

3.1.3. Autoregressive (AR) Models

3.1.4. Discrete Diffusion Models

3.1.5. Transformer Architecture

3.1.6. Vector Quantization (VQ-VAE)

3.1.7. Residual Quantization (RQ-VAE)

3.1.8. Beam Search

3.2. Previous Works

3.2.1. Generative Recommendation

3.2.2. Discrete Diffusion Models

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Parallel Tokenization via Multi-Head VQ-VAE

4.1.1. Motivation for Parallel Tokenization

4.1.2. Multi-Head VQ-VAE Architecture

4.1.3. VQ-VAE Loss Function

4.2. Discrete Diffusion Training

4.2.1. Problem Formulation and Probabilistic Comparison

4.2.2. Discrete Diffusion Process Overview

4.2.3. User-History Level Masking

4.2.4. Next-Item Level Masking

4.2.5. Joint Training

4.3. Discrete Diffusion Inference

4.3.1. Initialization

4.3.2. Generation Position Selection

4.3.3. Beam Search for Discrete Diffusion

4.3.4. Iterative Generation

4.4. Discussion

4.4.1. Continuous vs. Discrete Diffusion in Recommendation

4.4.2. Advantages over Autoregressive Models

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@k

5.2.2. Normalized Discounted Cumulative Gain (NDCG@k)

5.3. Baselines

5.3.1. Item ID-based Baselines

5.3.2. Semantic ID-based Generative Baselines

5.4. Implementation Details

5.4.1. Parallel Tokenization (Multi-Head VQ-VAE)

5.4.2. Discrete Diffusion Model (MASK Predictor)

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Performance of LLaDA-Rec

6.2. Ablation Studies / Parameter Analysis

6.2.1. Tokenizer Ablation

6.2.2. Training Ablation

6.2.3. Inference Ablation

6.2.4. Impact of the Attention Mechanism

6.2.5. Impact of Generation Order

6.2.6. Impact of Generation Steps

6.2.7. Impact of Hyper-parameters

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers