Paper status: completed

DiffGRM: Diffusion-based Generative Recommendation Model

Published:10/21/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
24 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DiffGRM employs masked discrete diffusion and parallel semantic encoding to enable any-order parallel generation of semantic ID digits, addressing limitations of autoregressive models and improving recommendation accuracy and training efficiency.

Abstract

Generative recommendation (GR) is an emerging paradigm that represents each item via a tokenizer as an n-digit semantic ID (SID) and predicts the next item by autoregressively generating its SID conditioned on the user's history. However, two structural properties of SIDs make ARMs ill-suited. First, intra-item consistency: the n digits jointly specify one item, yet the left-to-right causality trains each digit only under its prefix and blocks bidirectional cross-digit evidence, collapsing supervision to a single causal path. Second, inter-digit heterogeneity: digits differ in semantic granularity and predictability, while the uniform next-token objective assigns equal weight to all digits, overtraining easy digits and undertraining hard digits. To address these two issues, we propose DiffGRM, a diffusion-based GR model that replaces the autoregressive decoder with a masked discrete diffusion model (MDM), thereby enabling bidirectional context and any-order parallel generation of SID digits for recommendation. Specifically, we tailor DiffGRM in three aspects: (1) tokenization with Parallel Semantic Encoding (PSE) to decouple digits and balance per-digit information; (2) training with On-policy Coherent Noising (OCN) that prioritizes uncertain digits via coherent masking to concentrate supervision on high-value signals; and (3) inference with Confidence-guided Parallel Denoising (CPD) that fills higher-confidence digits first and generates diverse Top-K candidates. Experiments show consistent gains over strong generative and discriminative recommendation baselines on multiple datasets, improving NDCG@10 by 6.9%-15.5%. Code is available at https://github.com/liuzhao09/DiffGRM.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "DiffGRM: Diffusion-based Generative Recommendation Model".

1.2. Authors

The authors are:

  • Zhao Liu (Kuaishou Technology, Beijing, China)

  • Yichen Zhu (Kuaishou Technology, Beijing, China)

  • Yiqing Yang (Kuaishou Technology, Beijing, China)

  • Guoping Tang (Kuaishou Technology, Beijing, China)

  • Rui Huang (Kuaishou Technology, Beijing, China)

  • Qiang Luo (Kuaishou Technology, Beijing, China)

  • Xiao Lv (Kuaishou Technology, Beijing, China)

  • Ruiming Tang (Kuaishou Technology, Beijing, China)

  • Kun Gai (Unaffiliated, Beijing, China)

  • Guorui Zhou (Kuaishou Technology, Beijing, China)

    Most authors are affiliated with Kuaishou Technology, a major internet company known for its short-video platform, indicating a strong industry research background with a focus on practical applications in recommendation systems. Kun Gai is listed as unaffiliated.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicated by the Published at (UTC): 2025-10-21T03:23:32.000Z and Original Source Link: https://arxiv.org/abs/2510.21805. While not yet formally peer-reviewed and published in a specific conference or journal, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like machine learning and artificial intelligence, allowing for early sharing and feedback.

1.4. Publication Year

The publication year is 2025, based on the provided UTC timestamp.

1.5. Abstract

Generative recommendation (GR) systems predict the next item by treating each item as an nn-digit semantic ID (SID) and autoregressively generating its SID based on user history. However, current autoregressive models (ARMs) are ill-suited for SIDs due to two issues: intra-item consistency (digits jointly define an item, but ARMs' left-to-right causality blocks bidirectional context) and inter-digit heterogeneity (digits vary in semantic granularity and predictability, yet ARMs apply a uniform next-token objective, leading to imbalanced training).

To overcome these, the paper proposes DiffGRM, a diffusion-based GR model. DiffGRM replaces the autoregressive decoder with a masked discrete diffusion model (MDM), enabling bidirectional context and parallel, any-order generation of SID digits. It tailors the MDM in three key aspects:

  1. Tokenization: Employs Parallel Semantic Encoding (PSE) to decouple digits and balance per-digit information, contrasting with residual quantization.

  2. Training: Introduces On-policy Coherent Noising (OCN) which prioritizes uncertain digits through coherent masking, concentrating supervision on high-value signals and avoiding combinatorial explosion of masking patterns.

  3. Inference: Develops Confidence-guided Parallel Denoising (CPD) which fills higher-confidence digits first and generates diverse Top-K candidates through a global parallel beam search.

    Experiments demonstrate that DiffGRM achieves consistent gains over strong generative and discriminative recommendation baselines across multiple datasets, improving NDCG@10 by 6.9%-15.5%.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve lies within the emerging paradigm of Generative Recommendation (GR). In GR, items are represented as sequences of semantic IDs (SIDs), and the recommendation task is framed as predicting the SID of the next item a user might interact with. This prediction is typically handled by autoregressive models (ARMs), often GPT-style Transformers, which generate the SID digits one by one, from left to right, conditioned on the user's historical interactions.

This problem is important because GR offers several advantages, such as unifying item representation and prediction, benefiting from large-scale language modeling techniques, and supporting open-vocabulary recommendation. However, the paper identifies two critical structural properties of SIDs that make ARMs ill-suited, posing significant challenges:

  1. Intra-item consistency: The nn digits of an SID are not independent tokens like words in a sentence; they collectively and jointly specify one single item. For instance, a SID might encode "Dior Rouge 999 Velvet" across its digits. ARMs enforce a strict left-to-right causal dependency, meaning each digit is predicted only based on its preceding digits. This inherently blocks the use of bidirectional context or "cross-digit evidence" that could help verify the overall item identity, leading to a collapse of supervision into a single causal path. This can cause early errors to propagate through the entire SID generation.

  2. Inter-digit heterogeneity: The SID digits often encode different semantic granularities (e.g., Category, Brand, Type, Size). Consequently, these digits differ significantly in their semantic load, predictability, and inherent difficulty. For example, predicting a general Category might be easier than a specific Size. Yet, the standard next-token objective used by ARMs assigns equal weight to predicting every digit. This uniform supervision leads to an imbalance: easy digits can be overtrained, while hard digits receive insufficient training signal, hindering accurate SID generation.

    The paper's entry point and innovative idea is to draw inspiration from the rapid advancements in discrete diffusion modeling to replace the ARM decoder. Masked Discrete Diffusion Models (MDMs) inherently support bidirectional context, parallel generation, and richer supervision through random noising, which align better with the structural characteristics of SID representations. However, directly applying MDMs is not optimal; task-specific adaptations for recommendation are necessary.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. Introduction of DiffGRM: Proposing DiffGRM, the first diffusion-based generative recommendation framework. It replaces the autoregressive decoder with a masked discrete diffusion model (MDM) operating over SID digits. This fundamental shift removes left-to-right causal constraints, enabling the exploitation of bidirectional cross-digit context, which is crucial for intra-item consistency.

  2. Novel Adaptations for Tokenization, Training, and Inference:

    • Parallel Semantic Encoding (PSE): For tokenization, DiffGRM adopts PSE (e.g., OPQ-based) to decouple SID digits. This move away from residual quantization (RQ) ensures digits are independent, balances per-digit information, and allows for fully parallel prediction, addressing the inter-digit heterogeneity and intra-item consistency issues at the representation level.
    • On-policy Coherent Noising (OCN): For training, OCN addresses the combinatorial explosion of supervision signals in MDMs when applied to SIDs. It uses the current model's confidence to identify and prioritize the most "uncertain" or "hard" digits for masking, thereby focusing the training budget on high-value signals. This improves sample efficiency and allocates supervision more effectively than random masking.
    • Confidence-guided Parallel Denoising (CPD): For inference, CPD is designed to meet the recommendation task's need for diverse Top-K candidates, unlike typical MDM greedy decoding for single outputs. CPD performs a global parallel beam search, filling higher-confidence digits first and then completing the rest, yielding accurate and diverse Top-K SID candidates.
  3. State-of-the-Art Performance: The paper demonstrates that DiffGRM achieves state-of-the-art results across multiple public datasets (Amazon Reviews: Sports, Beauty, Toys). It consistently outperforms strong generative and discriminative recommendation baselines, improving NDCG@10 by a significant margin (6.9%-15.5%). This finding strongly validates the accuracy and generalization strength of the DiffGRM framework.

    These contributions collectively solve the identified problems by reconciling the joint nature of SID digits with parallel generation, providing a more balanced and efficient training mechanism, and enabling effective Top-K recommendation, thereby advancing the field of generative recommendation.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand DiffGRM, several foundational concepts in recommendation systems and deep learning are essential.

  • Recommendation Systems (RS): Systems designed to predict user preferences and suggest items (e.g., products, movies, articles) that a user might like. They are crucial for personalization in various online platforms.
  • Sequential Recommendation: A subfield of RS where the order of user interactions (e.g., a sequence of purchased items) is important. The goal is to predict the next item in a sequence based on past interactions.
  • Generative Recommendation (GR): An emerging paradigm in RS that frames the recommendation task as generating the representation of the target item. Instead of scoring existing items, GR aims to directly produce the "description" or "ID" of a suitable next item.
  • Semantic ID (SID): In GR, an item is not represented by a simple integer ID but by a sequence of nn discrete digits, called a semantic ID. Each digit typically corresponds to a code from a learned codebook, and collectively, these nn digits encode the item's semantic features (e.g., brand, category, style). This allows for representing items based on their content and properties rather than just arbitrary identifiers.
  • Tokenization: The process of converting raw item content (e.g., text descriptions, images) into a discrete, fixed-length sequence of SIDs. This usually involves an encoder that maps content to dense vectors, followed by a quantization step that discretizes these vectors into tokens (digits).
    • Vector Quantization (VQ): A technique used in tokenization to map continuous vectors (embeddings) to discrete codes (tokens). It involves learning a set of "codebook" vectors (centroids), and each continuous vector is assigned the ID of its nearest codebook vector.
    • Codebook: A collection of discrete codes or centroids used in Vector Quantization. In SID generation, there are typically nn codebooks, one for each digit, each containing MM possible codes.
  • Autoregressive Models (ARMs): A class of generative models that predict future elements in a sequence based on past elements. In the context of language models and GR, this means predicting the next token (or SID digit) conditioned on all previously generated tokens in a left-to-right fashion. GPT-style Transformers are prominent examples of ARMs.
  • Transformer: A neural network architecture introduced by Vaswani et al. (2017) that relies heavily on the self-attention mechanism. It has become foundational for many sequence-to-sequence tasks, including natural language processing and sequential recommendation.
    • Self-Attention: A mechanism within Transformers that allows a model to weigh the importance of different parts of an input sequence when processing a specific element. Unlike recurrent neural networks (RNNs), it can capture long-range dependencies efficiently.
    • Encoder-Decoder Architecture: A common Transformer configuration. The encoder processes the input sequence (e.g., user history) to create a contextual representation, and the decoder uses this representation to generate the output sequence (e.g., target SID).
  • Diffusion Models: A class of generative models that learn to reverse a gradual noising process.
    • Discrete Diffusion Models (DDMs): Adaptations of diffusion models for discrete data (like text tokens or SID digits). They typically involve a "forward" process that gradually corrupts clean data by replacing tokens with a special [MASK] token or randomly sampling from the vocabulary, and a "reverse" process that learns to predict the original clean data from the corrupted version.
    • Masked Discrete Diffusion Model (MDM): A specific type of DDM where the corruption process involves masking out a subset of tokens in the sequence, and the model learns to predict the original tokens for these masked positions in parallel. This is in contrast to autoregressive models that predict one token at a time.
  • Cross-Entropy Loss: A common loss function used in classification tasks, including next-token prediction and masked language modeling. It measures the difference between the predicted probability distribution over classes (e.g., possible SID digits) and the true distribution.
  • Label Smoothing: A regularization technique for classification models that prevents them from becoming overconfident. Instead of using hard one-hot labels (e.g., [0, 1, 0]), it softens the target distribution by giving a small probability mass to incorrect labels (e.g., [0.05, 0.9, 0.05]). This can improve generalization.

3.2. Previous Works

The paper contextualizes DiffGRM against both Generative Recommendation Models and Discrete Diffusion Language Models.

3.2.1. Generative Recommendation Models

Prior GR approaches often cast recommendation as sequence generation, where items are discretized into SIDs and a Transformer predicts the target SID token by token. Key examples include:

  • TIGER [48]: A representative model that uses Residual Quantization (RQ)-VAE for tokenizing items into SIDs and then decodes them autoregressively. This is a primary baseline for DiffGRM, highlighting the ARM approach that DiffGRM aims to improve upon.

  • HSTU [65]: Frames recommendation as large-scale sequence transduction, discretizing raw item features into tokens.

  • RPG [20]: A generative model that predicts unordered semantic IDs in parallel using a multi-token objective, combined with graph-guided decoding. This model is a closer conceptual relative to DiffGRM in its aim for parallel prediction, but DiffGRM leverages diffusion rather than a multi-token objective and graph decoding.

  • Other AR-based GRs: GenNewsRec [9], MTGRec [72], ETEGRec [32] focus on various aspects like integrating LLM reasoning, enhancing quantization, or improving token quality.

  • ActionPiece [21]: Focuses on context-aware tokenization, representing actions as unordered sets of item features.

    The common thread among many of these, especially TIGER, is the reliance on residual quantization (RQ) and autoregressive decoding, which DiffGRM identifies as problematic.

3.2.2. Discrete Diffusion Language Models

Diffusion models originated for continuous data (DDPMs [16, 37, 56, 57]) and were later extended to discrete spaces [1, 3, 17].

  • Structured Denoising Diffusion Models [1, 2, 3]: These works laid the groundwork for applying diffusion to discrete state-spaces, moving beyond continuous data. They explore how to define forward noising and reverse denoising processes for discrete tokens.

  • Masked Diffusion Language Models (MDMs): Specifically, MDMs (e.g., [42, 64]) are relevant as they corrupt sequences by masking tokens and learn to predict them in parallel. This mechanism is central to DiffGRM's decoder.

  • Advanced Sampling Strategies [35, 44, 46]: Research has focused on improving reverse-sampling strategies for DDMs to enhance performance and efficiency in natural language tasks.

    The paper notes that most advances in DDMs are targeted at free-form, single-output text generation, whereas GR requires structured nn-digit SIDs and a Top-K candidate set, necessitating specific adaptations.

3.3. Technological Evolution

Recommendation systems have evolved significantly:

  1. Early Systems (e.g., Collaborative Filtering): Focused on user-item interaction patterns without deep understanding of item content.
  2. Item ID-based Discriminative Models (e.g., GRU4Rec, SASRec, BERT4Rec): These models learn embeddings for item IDs and predict the next item ID directly. They improved sequential modeling but often lacked semantic richness.
  3. Semantic-enhanced Discriminative Models (e.g., FDSA, s3s^3-Rec, vQ-Rec, RecJPQ): Incorporated item content features (text, images) alongside item IDs, often through pre-trained language models or quantization, to enrich item representations and improve prediction.
  4. Semantic ID-based Generative Models (e.g., TIGER, HSTU, RPG, ActionPiece): This is the direct lineage for DiffGRM. These models move beyond just predicting the next item ID to generating its semantic representation (SID). This allows for open-vocabulary recommendation (generating items not seen during training) and leveraging powerful Transformer architectures from NLP. Initially, these were predominantly autoregressive.
  5. Diffusion-based Generative Models (DiffGRM): DiffGRM represents a new evolutionary step within generative recommendation, replacing the autoregressive generation of SIDs with diffusion-based parallel generation. This aims to overcome the inherent limitations of ARMs (causality, uniform supervision) when dealing with the unique structure of SIDs. Simultaneously, it adapts discrete diffusion models, which have seen success in NLP, to the specific requirements of Top-K recommendation for structured SIDs.

3.4. Differentiation Analysis

Compared to the main methods in related work, DiffGRM introduces several core differences and innovations:

  • From Autoregressive Generation to Masked Diffusion: The most fundamental difference is replacing the autoregressive decoder (common in TIGER, HSTU) with a masked discrete diffusion model. This shifts from sequential, left-to-right generation to parallel, any-order generation.
    • Innovation: ARMs suffer from intra-item consistency issues (lack of bidirectional context) and inter-digit heterogeneity (uniform supervision). MDMs naturally provide bidirectional context and can offer richer, more flexible supervision signals, directly tackling these limitations. RPG also aims for parallel prediction but uses a multi-token objective and graph-guided decoding, which is different from the diffusion process.
  • From Residual Quantization (RQ) to Parallel Semantic Encoding (PSE): Most generative recommender tokenizers, like RQ-VAE in TIGER, use RQ. RQ introduces residual dependencies and unbalanced information distribution across digits.
    • Innovation: PSE (e.g., OPQ-based) decouples SID digits into independent subspaces. This balances per-digit information and removes the sequential coupling inherent in RQ, which aligns much better with the MDM's parallel and any-order prediction capabilities.
  • Novel Training Strategy (OCN): MDMs can face a combinatorial explosion of masking patterns and fragmented supervision.
    • Innovation: OCN is a task-specific training strategy for MDMs in GR. It uses an "on-policy" approach to identify and prioritize the "hardest" or most uncertain digits based on the current model's confidence. By coherently masking these digits and constructing nested views, OCN focuses the training budget on high-value signals, improving sample efficiency and optimizing supervision allocation.
  • Novel Inference Strategy (CPD): Standard MDMs often rely on greedy decoding for a single output, which is insufficient for Top-K recommendation.
    • Innovation: CPD adapts MDM inference for recommendation by implementing a Confidence-guided Parallel Denoising process. It performs a global parallel beam search, filling higher-confidence digits first. This allows DiffGRM to generate diverse Top-K SID candidates, a crucial requirement for recommendation systems, while still leveraging the MDM's parallel capabilities.

      In essence, DiffGRM innovatively merges the strengths of discrete diffusion models with specific adaptations tailored to the unique challenges of semantic ID representations in recommendation, thereby addressing fundamental limitations of previous autoregressive generative recommenders.

4. Methodology

4.1. Principles

The core idea behind DiffGRM is to replace the autoregressive decoder in existing Generative Recommendation (GR) frameworks with a masked discrete diffusion model (MDM). This fundamental shift is motivated by the desire to overcome two key limitations of ARMs when dealing with Semantic IDs (SIDs):

  1. Intra-item consistency: ARMs predict SID digits sequentially (left-to-right), which prevents bidirectional context and mutual verification among digits that jointly define a single item. MDMs naturally enable bidirectional context and parallel prediction.

  2. Inter-digit heterogeneity: ARMs apply a uniform next-token objective to all SID digits, regardless of their varying semantic granularity and predictability. MDMs, through their masking mechanism, can be designed to allocate supervision more strategically.

    The theoretical basis and intuition are that a diffusion model, by learning to denoise a corrupted input, implicitly learns the underlying data distribution and can generate diverse samples. By applying a masking corruption process to SID digits, the model is trained to predict the original digits given a partial context. This allows it to learn dependencies among all digits, not just causal prefixes, and to focus on more difficult predictions. The proposed DiffGRM tailors this MDM framework specifically for GR through three key adaptations: Parallel Semantic Encoding (PSE), On-policy Coherent Noising (OCN), and Confidence-guided Parallel Denoising (CPD).

4.2. Core Methodology In-depth (Layer by Layer)

DiffGRM operates as an encoder-decoder architecture. The encoder processes the user's interaction history, and the MD-Decoder generates the nn-digit SID of the next item.

4.2.1. Overall Architecture and Workflow

The workflow begins by processing raw item content. Items are first tokenized into SIDs using Parallel Semantic Encoding (PSE). This converts each user's interaction history into a sequence of SIDs.

  1. Encoder: For each item SID in the user's history, its nn digits are embedded, concatenated, and then projected into an item vector. Positional embeddings are added to this sequence of item vectors, and the entire sequence is fed into a Transformer encoder. The encoder outputs a contextual representation of the user history, denoted as HuRLinput×dm\mathbf{H}_u \in \mathbb{R}^{L_{\mathrm{input}} \times d_m}, where LinputL_{\mathrm{input}} is the encoder input length and dmd_m is the model's hidden dimension. This Hu\mathbf{H}_u summarizes the user's past behavior.

  2. MD-Decoder: The MD-Decoder takes a partially masked nn-digit input, denoted as y\mathbf{y}, which represents the target item's SID. Unlike autoregressive decoders, it applies non-causal (bidirectional) self-attention across all digits of y\mathbf{y}. This means each digit can attend to every other digit, enabling bidirectional intra-item semantics and cross-digit mutual verification. The MD-Decoder then performs cross-attention to Hu\mathbf{H}_u (where encoder-side k/v are derived from Hu\mathbf{H}_u), integrating user history context. Finally, it predicts all masked digits in parallel.

    During multi-view training, the encoder output Hu\mathbf{H}_u is computed once per sample and cached. The encoder-side k/v (key and value projections from the encoder output, used in cross-attention) are also cached for reuse across different "views" (masked inputs) and CPD steps, amortizing the cross-attention cost.

4.2.2. SID Generation Objective

The overall objective is to learn a conditional generator pθ(yXu)p_{\theta}(\mathbf{y} \mid X_u) that maximizes the conditional log-likelihood of the target SID given the user history. maxθEuU[logpθ(yXu)] \operatorname*{max}_{\theta} \mathbb{E}_{u \in \mathcal{U}} \big[ \log p_{\theta}(\mathbf{y}^* \mid X_u) \big] Here:

  • θ\theta: The parameters of the model (including the MD-Decoder).

  • U\mathcal{U}: The set of all users.

  • EuU[]\mathbb{E}_{u \in \mathcal{U}}[\dots]: The expectation is taken over users in U\mathcal{U}.

  • pθ(yXu)p_{\theta}(\mathbf{y}^* \mid X_u): The conditional probability distribution of the target SID y\mathbf{y}^* given the user history XuX_u.

  • y\mathbf{y}^*: The SID of the next item (the ground truth target).

  • XuX_u: The sequence of SIDs representing user uu's historical interactions.

    The model pθp_{\theta} is instantiated by the MD-Decoder, which is trained with digit-wise supervision on masked digits.

4.2.3. Masked Diffusion for Parallel Token Prediction

DiffGRM adopts masked diffusion for generating the nn-digit SID.

  • Forward Process: This process applies an absorbing-state mask corruption to a clean SID sequence x0\mathbf{x}_0. It replaces a subset of digits with a special [MASK] token according to a time-dependent schedule. The corrupted sequence at a given mask ratio τ\tau is denoted xτ\mathbf{x}_{\tau}.

  • Reverse Process: The MD-Decoder is trained to predict all masked digits in parallel from the corrupted sequence xτ\mathbf{x}_{\tau}.

    The MD-Decoder is trained using a masked-digit cross-entropy loss: L(θ)=Ex0,τ,xτqτ0(x0)[1MτkMτlogPθ(x0kxτ,τ)] \mathcal{L}(\theta) = - \mathbb{E}_{\mathbf{x}_0, \tau, \mathbf{x}_{\tau} \sim q_{\tau|0}(\cdot | \mathbf{x}_0)} \left[ \frac{1}{|\mathcal{M}_{\tau}|} \sum_{k \in \mathcal{M}_{\tau}} \log \mathcal{P}_{\theta} \Big( x_0^k \mid \mathbf{x}_{\tau}, \tau \Big) \right] Where:

  • L(θ)\mathcal{L}(\theta): The masked-digit cross-entropy loss for the model parameters θ\theta.

  • x0\mathbf{x}_0: The clean (ground truth) SID sequence.

  • τ\tau: The mask ratio (or time step), indicating the proportion of digits masked, where τ[0,1)\tau \in [0, 1).

  • xτ\mathbf{x}_{\tau}: The corrupted version of x0\mathbf{x}_0 at mask ratio τ\tau.

  • qτ0(x0)q_{\tau|0}(\cdot | \mathbf{x}_0): The forward noising process, which samples a corrupted sequence xτ\mathbf{x}_{\tau} given x0\mathbf{x}_0 and τ\tau.

  • Mτ\mathcal{M}_{\tau}: The set of indices of digits that are masked at mask ratio τ\tau.

  • Mτ|\mathcal{M}_{\tau}|: The number of masked digits.

  • kk: An index iterating over the masked digits.

  • Pθ(x0kxτ,τ)\mathcal{P}_{\theta}(x_0^k \mid \mathbf{x}_{\tau}, \tau): The MD-Decoder's predicted probability distribution for the original digit x0kx_0^k at position kk, given the corrupted sequence xτ\mathbf{x}_{\tau} and mask ratio τ\tau. This means the model predicts what the clean digit should be for each masked position.

    This objective removes causal constraints, allows for richer supervision signals (by varying masked sets), and enables efficient parallel generation.

The training process computes Hu\mathbf{H}_u once per sample. The MD-Decoder then takes a partially masked nn-digit input y(r)\mathbf{y}^{(r)} (a "view" generated by OCN) and predicts digits. The loss for a single view rr is: L(r)=1M(r)kM(r)(v=0M1q~v(k)logpθ(r,k)(vy(r),Hu)) \mathcal{L}^{(r)} = \frac{1}{\vert \mathcal{M}^{(r)} \vert} \sum_{k \in \mathcal{M}^{(r)}} \left( - \sum_{v = 0}^{M - 1} \tilde{q}_v^{(k)} \log {p_{\theta}^{(r, k)} \left( v \mid \mathbf{y}^{(r)}, \mathbf{H}_u \right) } \right) Where:

  • L(r)\mathcal{L}^{(r)}: The loss for view rr.

  • M(r)\mathcal{M}^{(r)}: The set of masked indices for view rr.

  • kk: An index of a masked digit.

  • vv: A value from the codebook {0,,M1}\{0, \ldots, M-1\}.

  • MM: The size of the per-digit codebook.

  • q~v(k)\tilde{q}_v^{(k)}: The smoothed one-hot target distribution for digit kk (using label smoothing). If the true digit sk=v0s^k = v_0, then q~v0(k)\tilde{q}_{v_0}^{(k)} would be close to 1, and other q~v(k)\tilde{q}_v^{(k)} would be small.

  • pθ(r,k)(vy(r),Hu)p_{\theta}^{(r, k)}(v \mid \mathbf{y}^{(r)}, \mathbf{H}_u): The MD-Decoder's predicted probability for digit kk to be value vv, given the partially masked input y(r)\mathbf{y}^{(r)} and the user history representation Hu\mathbf{H}_u.

  • y(r)\mathbf{y}^{(r)}: The partially masked nn-digit input for view rr.

  • Hu\mathbf{H}_u: The user history representation from the encoder.

    The total loss aggregates across a small set of views constructed by OCN: L=1Rr=1RL(r) \mathcal{L} = \frac{1}{R} \sum_{r = 1}^{R} \mathcal{L}^{(r)} Where:

  • L\mathcal{L}: The total training loss.

  • RR: The number of coherent views generated by OCN for a single sample.

4.2.4. Parallel Semantic Encoding (PSE)

To address the issues of residual quantization (RQ) (unbalanced information, sequential dependency), DiffGRM uses Parallel Semantic Encoding (PSE).

  1. Item Embedding: An item ii with content features fi\mathbf{f}_i is first mapped to a dd-dimensional continuous representation hi=E(fi)\mathbf{h}_i = E(\mathbf{f}_i) using a semantic encoder E()E(\cdot) (e.g., Sentence-T5).
  2. Orthogonal Rotation: An orthogonal rotation matrix Ro\mathbf{R}_o is learned and applied to hi\mathbf{h}_i to reduce downstream quantization distortion: h~i=Rohi\tilde{\mathbf{h}}_i = \mathbf{R}_o \mathbf{h}_i.
  3. Partitioning: The rotated vector h~i\tilde{\mathbf{h}}_i is evenly partitioned into nn subvectors: h~i=vi0vin1\tilde{\mathbf{h}}_i = \mathbf{v}_i^0 \oplus \cdots \oplus \mathbf{v}_i^{n-1}. Each subvector vik\mathbf{v}_i^k corresponds to a specific digit position.
  4. Independent Quantization: Each subvector vik\mathbf{v}_i^k is independently quantized to a code from its respective per-digit codebook C(k)={c0(k),,cM1(k)}\mathbf{C}^{(k)} = \{ \mathbf{c}_0^{(k)}, \dots, \mathbf{c}_{M-1}^{(k)} \}. The SID digit siks_i^k is obtained by finding the nearest centroid: sik=arg minjvikcj(k)22 s_i^k = \operatorname*{arg\,min}_{j} \| \mathbf{v}_i^k - \mathbf{c}_j^{(k)} \|_2^2 Where:
    • siks_i^k: The kk-th SID digit for item ii.

    • vik\mathbf{v}_i^k: The kk-th subvector of the rotated item embedding.

    • cj(k)\mathbf{c}_j^{(k)}: The jj-th centroid vector in the kk-th codebook.

    • 22\| \cdot \|_2^2: Squared Euclidean distance.

      This process yields an nn-digit SID SIDi=[si0,,sin1]\mathrm{SID}_i = [s_i^0, \ldots, s_i^{n-1}] with decoupled per-digit assignments. This removes the residual sequential dependence found in RQ and enables fully parallel prediction across digits, aligning with the MDM's capabilities.

4.2.5. On-policy Coherent Noising (OCN)

OCN aims to make MDM training more efficient by focusing supervision on "hard" digits, addressing the combinatorial explosion of masking patterns.

  1. Difficulty Estimation: For each training sample, the MD-Decoder is run once on a fully masked nn-digit input (representing the "last view" RR, with mR=nm_R = n digits masked and mask ratio tR=1t_R = 1). This "probe" generates a predictive distribution for each digit kk. From this, two scores are computed:

    • Maximum Confidence: pmax(k)=maxv{0,,M1}Pθ(R,k)(v)p_{\operatorname*{max}}^{(k)} = \operatorname*{max}_{v \in \{0, \ldots, M-1\}} \mathcal{P}_{\theta}^{(R, k)}(v)
    • Difficulty Score: δ(k)=1pmax(k)\delta^{(k)} = 1 - p_{\operatorname*{max}}^{(k)} Where:
    • pmax(k)p_{\operatorname*{max}}^{(k)}: The maximum predicted probability (confidence) for digit kk when all digits are masked.
    • Pθ(R,k)(v)\mathcal{P}_{\theta}^{(R, k)}(v): The MD-Decoder's predicted probability for digit kk to be value vv in the fully masked view RR.
    • δ(k)\delta^{(k)}: The difficulty score for digit kk. A larger δ(k)\delta^{(k)} indicates lower confidence (higher perplexity) and thus higher difficulty. These scores induce a policy πθ(k)δ(k)\pi_{\theta}(k) \propto \delta^{(k)} for selecting digits.
  2. Digit Ordering: Digits are sorted in descending order of their difficulty scores δ(k)\delta^{(k)} to obtain a permutation σ\sigma from hardest to easiest. Ties are broken with a fixed digit order.

  3. Coherent View Construction: A small nested set of RR views is constructed per sample, ordered from light to heavy corruption. For view rr (where 1rR1 \leq r \leq R), mrm_r digits are masked, where mrm_r is a non-decreasing schedule (i.e., 1m1<<mRn1 \leq m_1 < \cdots < m_R \leq n). The masked index set M(r)\boldsymbol{\mathcal{M}}^{(r)} for view rr is defined as: M(r)={σ(1),,σ(mr)} \boldsymbol{\mathcal{M}}^{(r)} = \{ \sigma(1), \ldots, \sigma(m_r) \} This means view rr masks the mrm_r hardest digits according to the current model's policy. The remaining digits (kM(r)k \notin \boldsymbol{\mathcal{M}}^{(r)}) are kept visible with their clean embeddings.

  4. Layer-0 Input for Views: The input embedding for digit kk in view rr, denoted y(r,k)\mathbf{y}^{(r, k)}, is constructed as: y(r,k)={Emask[k],ifkM(r),Esid(k)[sk],ifkM(r), \mathbf{y}^{(r, k)} = \left\{ \begin{array}{ll} \mathbf{E}_{\mathrm{mask}}[k], & \mathrm{if} k \in \mathcal{M}^{(r)}, \\ \mathbf{E}_{\mathrm{sid}}^{(k)}[s^k], & \mathrm{if} k \notin \mathcal{M}^{(r)}, \end{array} \right. Where:

    • Emask[k]\mathbf{E}_{\mathrm{mask}}[k]: The embedding for the [MASK] token at position kk.
    • Esid(k)[sk]\mathbf{E}_{\mathrm{sid}}^{(k)}[s^k]: The embedding for the true SID digit sks^k at position kk. The masked set is nested, meaning the visible context grows as mrm_r decreases (from fully masked to progressively fewer masked digits). The corruption ratio tr=mr/nt_r = m_r / n increases from light to heavy masking. This strategy avoids combinatorial masking, stabilizes optimization with progressively richer evidence, and concentrates gradients on the same hard digits under increasing context.
  5. Binary View Matrix: All views can be represented by a binary matrix: Mocn=[(m(1))(m(R˙))]{0,1}R×n \mathbf{M}_{\mathrm{ocn}} = \left[ \begin{array}{c} (\mathbf{m}^{(1)})^{\top} \\ \vdots \\ (\mathbf{m}^{(\dot{R})})^{\top} \end{array} \right] \in \{0, 1\}^{R \times n} Where:

    • Mocn\mathbf{M}_{\mathrm{ocn}}: The matrix representing the masking patterns for all views.

    • m(r)\mathbf{m}^{(r)}: A binary vector for view rr, where mk(r)=1m_k^{(r)} = 1 if digit kk is masked in view rr, and 0 otherwise.

      The encoder output Hu\mathbf{H}_u is reused across all RR views, and difficulty order is computed once.

4.2.6. Confidence-guided Parallel Denoising (CPD)

For inference, CPD generates diverse Top-K SID candidates using a global parallel beam search. The process maintains an active set of partial SIDs, denoted B\mathcal{B}, and progresses over reverse steps {tr}r=1R\{t_r\}_{r=1}^R from tR=1t_R=1 (fully masked) to t1=0t_1=0 (fully resolved). The encoder-side k/v are computed once and cached.

  1. Initialization (tR=1t_R=1): Starting from a fully masked input y(R)\mathbf{y}^{(R)}, the MD-Decoder predicts distributions for all digits. The active set BR\mathcal{B}_R is initialized by selecting the TopBactTop-B_act digit-codeword pairs based on their predicted log-probabilities: BR=TopBact{logpθ(yk=cy(R),Hu)} \mathcal{B}_R = \operatorname*{Top}_{B_{\mathrm{act}}} \Big\{ \log p_{\theta} \Big( y_k = c \Big| \mathbf{y}^{(R)}, \mathbf{H}_u \Big) \Big\} Where:

    • BactB_{\mathrm{act}}: The per-step beam width (number of top candidates to keep).
    • yky_k: The kk-th SID digit.
    • cc: A possible codeword from {0,,M1}\{0, \ldots, M-1\}.
    • logpθ()\log p_{\theta}(\dots): The predicted log-probability of assigning codeword cc to digit yky_k.
    • TopBact{}\operatorname*{Top}_{B_{\mathrm{act}}}\{\cdot\}: Selects the top BactB_{\mathrm{act}} scores across all possible (k, c) pairs.
  2. Denoising Steps (trtr1t_r \to t_{r-1}): At each subsequent denoising step trt_r, for each branch bb in the current active set Br\mathcal{B}_r, CPD scores filling one of its still-masked digits. The score for extending branch bb by filling digit kk with codeword cc is: sr1(b,k,c)=scorer(b)+logpθ(yk=cyb(r),Hu) s_{r-1}(b, k, c) = \mathrm{score}_r(b) + \log p_{\theta} \Big( y_k = c \mid \mathbf{y}_b^{(r)}, \mathbf{H}_u \Big) Where:

    • sr1(b,k,c)s_{r-1}(b, k, c): The accumulated score for the new partial SID after filling digit kk with cc in branch bb.
    • scorer(b)\mathrm{score}_r(b): The accumulated log-probability of branch bb at step trt_r.
    • yb(r)\mathbf{y}_b^{(r)}: The partially masked SID sequence for branch bb at step trt_r.
    • kMr(bˉ)k \in \mathcal{M}_r^{(\bar{b})}: kk is an index of a digit that is still masked in branch bb.
  3. Per-step Truncation: After scoring all possible extensions from all active branches, the TopBactTop-B_act highest-scoring new partial SIDs are selected to form the next active set Br1\mathcal{B}_{r-1}: Br1=TopBact{sr1(b,k,c)} \mathcal{B}_{r-1} = \operatorname*{Top}_{B_{\mathrm{act}}} \Big\{ s_{r-1}(b, k, c) \Big\} For each selected tuple (b, k, c), digit kk is filled with codeword cc, and kk is removed from the masked-index set of that specific branch. All other digits in the SID (that were not just filled) remain masked. This way, the mask ratio decreases from trt_r to tr1t_{r-1}.

  4. Completion: The iteration continues until t1=0t_1=0, at which point all digits are filled, yielding the final set of candidate SIDs in B0\mathcal{B}_0. Finally, generated sequences are deduplicated, and the Top-K distinct SIDs are kept based on their accumulated scores. This ensures diverse Top-K recommendations.

4.2.7. Discussion

  • Why not RQ: Residual Quantization (RQ), used in many generative recommendation tokenizers, introduces a strict left-to-right residual dependency between digits. This creates two problems for DiffGRM: (1) it leads to unbalanced information distribution and inter-digit heterogeneity, which conflicts with the goal of balanced per-digit information, and (2) its hierarchical residual dependency creates a left-to-right bias that directly conflicts with the MDM's parallel, any-order prediction. PSE addresses this by factorizing the representation into independent subspaces.

  • Complexity: The following are the leading terms for asymptotic complexity, with NN as history length, nn as SID digit count, dmd_m as model hidden size, MM as codebook size, RR as number of coherent views, and BactB_{\mathrm{act}} as active beam width.

    Module ARM DiffGRM
    Encoder O(N2dm)O(N^2 d_m) O(N2dm)O(N^2 d_m)
    Decoder—training O(n2dm+nNdm+nMdm)O(n^2 d_m + n N d_m + n M d_m) RO(n2dm+nNdm+nMdm)R \cdot O(n^2 d_m + n N d_m + n M d_m)
    Decoder—inference O(BactnNdm)O(B_{\mathrm{act}} n N d_m) O(Bactn2Ndm)O(B_{\mathrm{act}} n^2 N d_m)
    • Training Complexity: The encoder is computed once. ARM needs one decoder pass per sample. DiffGRM requires RR decoder passes (one for each coherent view). Since RR is typically small and NN (history length) is often large in industrial settings, the encoder term (O(N2dm)O(N^2 d_m)) dominates, making the overall training complexity similar for both.
    • Inference Complexity: Both run the encoder once. ARM's decoder performs nn incremental steps. DiffGRM with CPD involves an initial fully masked pass plus multiple reverse steps, summing to Θ(n2)\Theta(n^2) effective steps (since each step resolves one digit from nn possibilities, totaling n+(n1)++1=n(n+1)/2n + (n-1) + \dots + 1 = n(n+1)/2). This results in an additional factor of nn in the MD-Decoder term. However, because nn is small (e.g., 4) and NN is typically large, the common encoder term remains substantial. The extra scoring in DiffGRM runs in parallel across digits and beams using cached encoder k/v, meaning it increases compute but doesn't necessarily slow down the critical path (wall-clock latency).

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three categories from the widely used Amazon Reviews dataset [38]:

  • "Sports and Outdoors" (Sports): A dataset related to sports and outdoor equipment.

  • "Beauty" (Beauty): A dataset related to cosmetic and beauty products.

  • "Toys and Games" (Toys): A dataset related to toys and games.

    These datasets are standard benchmarks for semantic ID-based generative recommendation [23, 24, 48]. Each user's historical reviews are treated as interactions and sorted chronologically to form input sequences. The leave-last-out evaluation strategy [27, 48, 69] is adopted: the last item in each sequence is used for testing, the second-to-last for validation, and the remaining interactions for training.

The following are the statistics from Table 2 of the original paper:

Dataset #Users #Items #Interactions Avg. tp
Sports 35,598 18,357 260,739 8.32
Beauty 22,363 12,101 176,139 8.87
Toys 19,412 11,924 148,185 8.63

Where Avg.tpAvg. t^p denotes the average number of interactions per input sequence. These datasets are effective for validating the method's performance because they represent diverse product domains and provide realistic user interaction histories, suitable for sequential recommendation tasks and testing the generation of semantic IDs for varied item types.

5.2. Evaluation Metrics

The performance of DiffGRM and baselines is evaluated using Recall@K and NDCG@K, with K{5,10}K \in \{5, 10\}. These are standard metrics in recommendation systems for evaluating the quality of ranked lists.

5.2.1. Recall@K

  • Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top KK recommendations. It focuses on the ability of the recommender system to find as many relevant items as possible, without regard for their ranking within the top KK. In the context of Top-K recommendation, if the one ground-truth item is in the top KK predictions, it counts as a hit.
  • Mathematical Formula: Recall@K=Relevant ItemsTop-K RecommendationsRelevant Items \text{Recall@K} = \frac{|\text{Relevant Items} \cap \text{Top-K Recommendations}|}{|\text{Relevant Items}|} In a typical next-item recommendation scenario with a single ground truth item, this simplifies to: Recall@K=Number of users for whom the target item is in Top-KTotal number of users \text{Recall@K} = \frac{\text{Number of users for whom the target item is in Top-K}}{\text{Total number of users}}
  • Symbol Explanation:
    • Relevant ItemsTop-K Recommendations|\text{Relevant Items} \cap \text{Top-K Recommendations}|: The number of relevant items found within the Top-K recommended list.
    • Relevant Items|\text{Relevant Items}|: The total number of relevant items for a given user (often 1 in next-item prediction).
    • Number of users for whom the target item is in Top-K: A count of test cases where the single ground-truth item was successfully predicted within the top KK positions.
    • Total number of users: The total number of test cases (users) in the evaluation set.

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

  • Conceptual Definition: NDCG@K is a measure of ranking quality. It accounts for both the relevance of recommended items and their position in the ranked list. Higher relevance at higher (earlier) positions contributes more to the NDCG score. This metric is particularly useful because it differentiates between placing a relevant item at the 1st position versus the 10th position.
  • Mathematical Formula: First, Discounted Cumulative Gain (DCG@K) is calculated: DCG@K=i=1K2reli1log2(i+1) \text{DCG@K} = \sum_{i=1}^{K} \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)} Then, NDCG@K normalizes DCG@K by the Ideal DCG (IDCG@K), which is the DCG of a perfectly sorted list of relevant items: NDCG@K=DCG@KIDCG@K \text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}
  • Symbol Explanation:
    • KK: The number of top recommendations being considered.
    • ii: The rank (position) of an item in the recommendation list, starting from 1.
    • reli\text{rel}_i: The relevance score of the item at position ii. In next-item recommendation where there's only one ground truth item, reli\text{rel}_i is typically 1 if the item at position ii is the ground truth item, and 0 otherwise.
    • log2(i+1)\log_2(i+1): The logarithmic discount factor, which reduces the contribution of items at lower ranks.
    • DCG@K\text{DCG@K}: The Discounted Cumulative Gain at rank KK.
    • IDCG@K\text{IDCG@K}: The Ideal Discounted Cumulative Gain at rank KK, calculated by assuming the perfect ranking where all relevant items appear at the top of the list in decreasing order of relevance. For a single ground-truth item, IDCG@K will be 1/log2(1+1)=11/\log_2(1+1)=1 if K1K \geq 1.

5.3. Baselines

DiffGRM is compared against a comprehensive set of baselines, grouped into three families:

5.3.1. Item ID-based (Discriminative) Models

These models primarily rely on item identifiers and sequential patterns.

  • GRU4Rec [15]: A Gated Recurrent Unit (GRU)-based model designed for session-based recommendation, capturing sequential dynamics.
  • HGN [36]: Hierarchical Gating Networks, enhancing RNN-based sequence modeling with a gating mechanism.
  • SASRec [27]: Self-Attentive Sequential Recommendation, a Transformer-based model using self-attention for next-item prediction, trained with binary cross-entropy.
  • BERT4Rec [58]: A Bidirectional Encoder Representations from Transformers (BERT)-style model adapted for recommendation, using a Cloze-style objective (predicting masked item IDs) on item sequences.

5.3.2. Semantic-enhanced (Discriminative) Models

These models incorporate item content features to enrich representations.

  • FDSA [67]: Feature-level Deeper Self-Attention Network, which models both item-ID and feature sequences using self-attention and fuses them.
  • s3^3-Rec [74]: Self-Supervised Learning for Sequential Recommendation, using self-supervised pretraining on features and IDs before fine-tuning for next-item prediction.
  • vQ-Rec [18]: Learns vector-quantized item representations from text features, pooling them as item representations.
  • RecJPQ [47]: Replaces item embeddings with concatenated jointly product-quantized sub-embeddings.

5.3.3. Semantic ID-based (Generative) Models

These models aim to generate semantic IDs or tokenized representations of items.

  • TIGER [48]: Transformer-based Generative Recommender, which uses RQ-VAE for item tokenization into SIDs and then autoregressively generates the next SID token. This is a key autoregressive baseline.

  • HSTU [65]: Hierarchical Semantic Tokenization Unit, discretizes raw item features as tokens for generative recommendation. The paper notes using 4-digit OPQ-tokenized SIDs for consistency.

  • ActionPiece [21]: A model that uses context-aware tokenization, representing each action (user interaction) as an unordered set of item features for generative recommendation.

  • RPG [20]: Recurrent Prompting for Generation, predicts unordered SID tokens in parallel using a multi-token objective, combined with graph-guided decoding. This is another strong generative baseline that attempts parallel prediction.

    These baselines are representative because they cover a wide spectrum of recommendation approaches, from traditional ID-based to advanced semantic-enhanced and generative methods, allowing for a thorough comparison of DiffGRM's performance and architectural innovations.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that DiffGRM consistently outperforms both discriminative and generative recommendation baselines across all three datasets (Sports, Beauty, Toys). The evaluation is based on Recall@K and NDCG@K where K{5,10}K \in \{5, 10\}. The improvements highlight DiffGRM's effectiveness in leveraging masked diffusion for SID generation.

The following are the results from Table 3 of the original paper:

Methods Sports and Outdoors Beauty Toys and Games
Recall @5 NDCG @5 Recall @10 NDCG @10 Recall @5 NDCG @5 Recall @10 NDCG @10 Recall @5 NDCG @5 Recall @10 NDCG @10
Item ID-based (Discriminative)
GRU4Rec 0.0129 0.0086 0.0204 0.0110 0.0164 0.0099 0.0283 0.0137 0.0097 0.0059 0.0176 0.0084
HGN 0.0189 0.0120 0.0313 0.0159 0.0325 0.0206 0.0512 0.0266 0.0321 0.0221 0.0497 0.0277
SASRec 0.0233 0.0154 0.0350 0.0192 0.0387 0.0249 0.0605 0.0318 0.0463 0.0306 0.0675 0.0374
BERT4Rec 0.0115 0.0075 0.0191 0.0099 0.0203 0.0124 0.0347 0.0170 0.0116 0.0071 0.0203 0.0099
Semantic-enhanced (Discriminative)
FDSA 0.0182 0.0122 0.0288 0.0156 0.0267 0.0163 0.0407 0.0208 0.0228 0.0140 0.0381 0.0189
s3-Rec 0.0251 0.0161 0.0385 0.0204 0.0387 0.0244 0.0647 0.0327 0.0443 0.0294 0.0700 0.0376
vQ-Rec 0.0208 0.0144 0.0300 0.0173 0.0457 0.0317 0.0664 0.0383 0.0497 0.0346 0.0737 0.0423
RecJPQ 0.0141 0.0076 0.0220 0.0102 0.0311 0.0167 0.0482 0.0222 0.0331 0.0182 0.0484 0.0231
Semantic ID-based (Generative)
TIGER 0.0264 0.0181 0.0400 0.0225 0.0454 0.0321 0.0648 0.0384 0.0521 0.0371 0.0712 0.0432
HSTU 0.0258 0.0165 0.0414 0.0215 0.0469 0.0314 0.0704 0.0389 0.0433 0.0281 0.0669 0.0357
RPG 0.0316 0.0205 0.0500 0.0264 0.0511 0.0340 0.0775 0.0424
ActionPiece 0.0314 0.0216 0.0463 0.0263 0.0550 0.0381 0.0809 0.0464 0.0592 0.0401 0.0869 0.0490
DiffGRM **0.0363** **0.0245** **0.0550** **0.0305** **0.0603** **0.0414** **0.0876** **0.0502** **0.0618** **0.0455** **0.0834** **0.0524**
Improv. +14.87% +13.43% +10.00% +15.53% +9.64% +8.19% +8.28% +8.19% +4.39% +13.47% -4.03% +6.94%

Key Findings:

  • Overall Dominance: DiffGRM achieves the best overall results, ranking first on 11 out of 12 metrics across the three datasets. This indicates strong generalization capability and superiority over existing methods.
  • Significant NDCG Improvements: Relative to the strongest baseline (which varies per dataset and metric, but often ActionPiece or RPG), DiffGRM shows impressive improvements in NDCG@10:
    • Sports: +15.53%
    • Beauty: +8.19%
    • Toys: +6.94% NDCG is a crucial metric for recommendation as it values correct predictions at higher ranks.
  • Recall Improvements: Recall@10 also sees substantial gains:
    • Sports: +10.00%
    • Beauty: +8.28% On Toys, Recall@10 is slightly lower (-4.03%) than the best baseline (ActionPiece), but DiffGRM still achieves higher NDCG@5 and NDCG@10, suggesting that while it might miss a few relevant items that ActionPiece catches, it ranks its correct predictions more accurately.
  • Semantic vs. ID-based Models: As expected, methods leveraging semantic information generally outperform ID-only discriminative models. Within the semantic family, semantic ID-based generative models tend to surpass semantic-enhanced discriminative ones, affirming the promise of the generative paradigm.
  • Reasons for Gains: The paper attributes these gains to the core innovations:
    • The masked-diffusion training over SIDs, which provides dense per-digit supervision and leverages bidirectional context among SID digits, addressing the intra-item consistency issue.

    • OCN, which effectively allocates supervision by prioritizing difficult digits, tackling the inter-digit heterogeneity and improving training efficiency.

    • CPD, which enables accurate and diverse Top-K SID generation in parallel, meeting a critical requirement for practical recommendation systems.

      The results strongly validate the effectiveness of DiffGRM's approach in addressing the limitations of autoregressive models for SID-based generative recommendation.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Effectiveness of OCN (RQ2)

This section evaluates On-policy Coherent Noising (OCN)'s ability to achieve more balanced supervision allocation under a limited training budget, thereby improving sample efficiency. The concept of effective sample passes (ESP) is introduced, calculated as best_epoch × (training views per sample per epoch).

As can be seen from the results in Figure 3 of the original paper:

Figure 3: Analysis of performance (NDCG `@` 10/Recall@10) w.r.t. effective sample passes (ESP). DiffGRM matches or surpasses the \(k\) T \(d _ { \\mathrm { m o d e l } } { = } 2 5 6\) ,while the optimal \$… 该图像是多个子图组成的图表,展示了在不同数据集(Sports、Beauty、Toys)上,DiffGRM模型与基线方法在NDCG@10和Recall@10指标下,随着有效样本次数(ESP)变化的性能对比。图中以星号标注“ours”点,表明该方法在较低ESP时即有较好效果,曲线拟合反映基线在不同k值下的性能趋势。

  • Scaling with kk: The figure shows that increasing kk (the number of coherent paths, effectively increasing the number of training views per sample, and thus ESP) generally raises performance for kk-times coherent-path noising. The dashed lines represent a logarithmic least-squares fit, indicating how performance scales with ESP.
  • OCN's Superiority: OCN (represented by the star markers) consistently achieves better results at the same or even lower ESP compared to the coherent-path noising variants. This demonstrates that OCN is more sample-efficient.
  • Mechanism: OCN achieves this by using the current model's confidence to select the most uncertain positions (hardest digits) along coherent paths. This strategy focuses training on high-value signals, avoiding the scattered supervision of random masking and leading to better performance with fewer effective training steps.

6.2.2. Ablation Study (RQ3)

An ablation study quantifies the contribution of each proposed module to DiffGRM's overall performance. The performance is measured using NDCG@10.

The following are the results from Table 4 of the original paper:

Variants Sports Beauty Toys
Semantic ID Setting
(1.1) PSE → RQ-Kmeans 0.0200 0.0343 0.0305
(1.2) PSE → Random 0.0138 0.0300 0.0206
Training strategy
(2.1) w/o OCN 0.0250 0.0368 0.0385
(2.2) w/o On-policy 0.0263 0.0455 0.0430
Inference strategy
(3.1) w/o CPD 0.0273 0.0496 0.0499
DiffGRM (ours) 0.0305 0.0502 0.0524

Analysis:

  • Parallel Semantic Encoding (PSE):
    • (1.1) PSE → RQ-Kmeans: Replacing PSE with RQ-KMeans significantly degrades performance across all datasets (e.g., from 0.0305 to 0.0200 on Sports). This confirms that RQ's residual dependencies and left-to-right bias conflict with the MDM's bidirectional and parallel denoising, validating the choice of PSE.
    • (1.2) PSE → Random: Using random tokens instead of semantically structured SIDs further hurts performance dramatically. This underlines the necessity of using PSE to capture and balance semantic information effectively for the MDM.
  • On-policy Coherent Noising (OCN):
    • (2.1) w/o OCN: This variant removes coherent noising and on-policy selection, using DDMs-style random masking. Performance drops noticeably (e.g., from 0.0305 to 0.0250 on Sports). This shows that random masking disperses supervision, leading to insufficient training for infrequent items and less effective learning.
    • (2.2) w/o On-policy: This variant keeps coherent noising but removes the on-policy selection (i.e., it doesn't prioritize uncertain digits). It performs better than w/o OCN but still below the full DiffGRM. This indicates that while coherent noising (structured views) is beneficial, the on-policy selection further refines supervision by focusing on the "weakest links" (hardest digits), contributing to overall performance.
  • Confidence-guided Parallel Denoising (CPD):
    • (3.1) w/o CPD: This variant replaces CPD with a random fixed-order beam search, meaning digits are decoded in a fixed random permutation without using confidence feedback. Performance decreases across all datasets (e.g., from 0.0305 to 0.0273 on Sports). This highlights the importance of CPD's confidence-guided mechanism for accurate and effective Top-K SID generation during inference.

      The ablation study clearly demonstrates that PSE, OCN, and CPD are not only effective but also necessary components, each contributing significantly to DiffGRM's superior performance.

6.2.3. OCN Strategy Analysis

This analysis compares four OCN variants based on two dimensions: selection policy ("least" confident vs. "most" confident digits) and refresh frequency ("static" order vs. "refresh" order at each step). Performance is measured by NDCG@10 on Beauty and Toys.

The following are the results from Table 5 of the original paper:

Dataset Metric L-S (Ours) L-R M-S M-R
Beauty CPD 0.0502 0.0484 0.0476 0.0382
w/o CPD 0.0496 0.0470 0.0444 0.0309
Improv. -1.20% -2.89% -6.72% -19.11%
Toys CPD 0.0524 0.0481 0.0516 0.0421
w/o CPD 0.0499 0.0455 0.0506 0.0318
Improv. -4.71% -5.41% -1.94% -24.47%

Where:

  • L-S (Ours): "Least" confident digits selected, "Static" refresh frequency. This is the DiffGRM's default OCN strategy.
  • L-R: "Least" confident digits selected, "Refresh" frequency.
  • M-S: "Most" confident digits selected, "Static" refresh frequency.
  • M-R: "Most" confident digits selected, "Refresh" frequency.

Findings:

  1. Selection Policy: Least-based scheduling (prioritizing the lowest-confidence/hardest digits) consistently outperforms most-based scheduling (prioritizing highest-confidence/easiest digits) under the same refresh setting. This validates the OCN design to focus supervision on areas where the model is weakest, leading to better learning.
  2. Refresh Frequency: Static ordering (estimating uncertainty once and fixing the order for the example) outperforms refresh (re-estimating uncertainty after each denoising step). Re-estimating the order at every step likely introduces instability or changes the training "plan" too frequently, hindering effective learning.
  3. Order Sensitivity: The largest performance degradation, especially when CPD is replaced with w/o CPD (meaning using fixed-order decoding instead of confidence-guided), occurs in the M-R (most confident, refresh) variant. This is because prioritizing most confident digits with stepwise refresh makes the model rely on an "easy-first" order. If this order changes or is not followed during inference (as in w/o CPD), the hard digits become undertrained and performance drops sharply. This reinforces the need for OCN's least-confident strategy and CPD's confidence guidance.

6.2.4. CPD Beam-Size Analysis

The impact of the CPD beam size on DiffGRM performance (NDCG@10) is analyzed by varying it across {32, 64, 128, 256}.

As can be seen from the results in Figure 4 of the original paper:

Figure 4: Analysis of DiffGRM performance \(\\mathbf { ( N D C G } @ 1 0 )\) w.r.t. beam size in CPD. 该图像是图表,展示了DiffGRM模型中CPD算法在三个数据集(Sports、Beauty、Toys)上,随着beam size变化时NDCG@10性能的趋势。图中显示NDCG@10随beam size增大而提升,尤其是在Beauty和Toys数据集中提升更明显。

  • General Trend: Across all three datasets (Sports, Beauty, Toys), NDCG@10 generally improves as the beam size increases. This is a common phenomenon in beam search, where a larger beam width allows the search to explore more candidate paths, mitigating local optima and leading to better-quality solutions.
  • Dataset Variation: The improvement trend is particularly pronounced for the Beauty and Toys datasets, indicating that these datasets might benefit more from a broader search space during inference.

6.2.5. Expressive Ability Analysis

To verify DiffGRM's ability to extract and leverage semantics, its performance (NDCG@10) is evaluated with different semantic encoders (pretrained language models of varying sizes).

The following are the results from Table 9 of the original paper:

Model semantic encoder Sports Beauty Toys
RPG sentence-t5-base 0.0238 0.0429 0.0460
bge-large-en-v1.5 0.0248 0.0408 0.0421
gte-large-en-v1.5 0.0229 0.0423 0.0469
DiffGRM sentence-t5-base 0.0305 0.0502 0.0524
bge-large-en-v1.5 0.0327 0.0564 0.0508
gte-large-en-v1.5 0.0342 0.0549 0.0510

Findings:

  • Performance with Larger Encoders: For DiffGRM, performance generally rises with the size and capacity of the semantic encoder (sentence-t5-base (110M) < bge-large-en-v1.5 (335M) < gte-large-en-v1.5 (434M) for some datasets). For instance, NDCG@10 on Sports for DiffGRM goes from 0.0305 (t5-base) to 0.0342 (gte-large).
  • DiffGRM Benefits More: The paper states that DiffGRM benefits the most from more powerful semantic encoders. This implies DiffGRM is better at utilizing rich semantic information encoded by larger PLMs (pre-trained language models), suggesting its architecture is well-suited to leverage high-quality item representations.
  • RPG's Inconsistency: In contrast, RPG's performance with larger encoders is less consistent, sometimes even decreasing (bge-large-en-v1.5 on Beauty and Toys compared to sentence-t5-base). This indicates DiffGRM has a stronger ability to capture and effectively integrate the semantics provided by the upstream encoders.

6.2.6. ARM Sample Expansion

This experiment investigates whether autoregressive models (ARMs) benefit from data augmentation in the same way DiffGRM's MDM does by exposing more supervision signals. An ARM-style generative recommender (re-implemented using GRID framework with RQ-KMeans) is trained on the Toys dataset.

The following are the results from Table 11 of the original paper:

Setting Recall@5 Recall@10 NDCG@5 NDCG@10
1x 0.0415 0.0624 0.0273 0.0341
4x 0.0422 0.0627 0.0274 0.0340

Findings:

  • Negligible Difference: Duplicating training instances four times (4x) for the ARM (while keeping targets and token order identical) shows only negligible differences in performance compared to using the original dataset (1x).
  • Interpretation: This confirms that for ARMs with teacher forcing, a single pass already covers all nn digits sequentially, so simply duplicating the same training instance does not introduce new or diverse supervision signals. In contrast, MDMs (like DiffGRM) generate multiple "views" (masked inputs) from a single sample, each providing distinct supervision signals to the model. This experiment supports the claim that DiffGRM's gains come from exposing richer supervision through its masked diffusion paradigm and OCN, rather than just having more "samples" in a superficial sense.

6.2.7. Sliding Window Data Augmentation

This analysis examines the effect of sliding-window augmentation on training sample count and performance for DiffGRM.

The following are the results from Table 12 of the original paper:

Dataset Setting NDCG@10 samples
Sports No sliding window 0.0237 35,598
Sliding window 0.0305 152,346
Beauty No sliding window 0.0350 22,363
Sliding window 0.0502 105,668
Toys No sliding window 0.0396 19,412
Sliding window 0.0524 87,180

Findings:

  • Consistent Gains: Applying sliding-window augmentation consistently leads to significant gains in NDCG@10 across all datasets. For example, on Sports, NDCG@10 improves from 0.0237 to 0.0305.
  • Increased Training Samples: Sliding-window augmentation expands each user's interaction sequence into multiple contiguous sub-sequences, vastly increasing the number of effective training samples. This exposes the model to more item-SID correspondences and richer contexts.
  • Benefits: This augmentation helps the model learn more generalizable patterns, mitigates overfitting, and improves robustness, especially important given the potential sparsity or noise in real-world recommendation data.

6.2.8. Hidden Dimension Analysis

This analysis investigates the impact of the model's hidden dimension dmd_m on DiffGRM's performance (NDCG@10) and convergence speed (best epoch). The values of dmd_m tested are {64, 128, 256, 512, 1024}.

As can be seen from the results in Figure 6 of the original paper:

Figure 6: Analysis of DiffGRM performance \(\\mathbf { ( N D C G } @ 1 0 )\) and best epoch w.r.t. hidden dimension `d _ { m }` . 该图像是图表,展示了DiffGRM模型的性能指标NDCG@10和最佳训练轮次随隐藏维度dmd_m变化的趋势,分别在Sports、Beauty和Toys三个数据集上。曲线分别用不同颜色表示性能和最佳轮次的数值。

  • Performance vs. dmd_m:
    • For Sports and Beauty, performance (NDCG@10) generally increases with dmd_m up to a certain point and then plateaus or slightly decreases. The "knee" of the curve, indicating optimal trade-off, appears around dm=256d_m = 256.
    • For Toys, performance continues to improve significantly even up to dm=1024d_m = 1024, suggesting that this dataset might require a larger model capacity to fully capture its complexities.
  • Convergence Speed vs. dmd_m:
    • As dmd_m increases, the "best epoch" (the epoch at which the model achieves its peak performance on the validation set) generally decreases. This indicates that larger models, while more computationally expensive per epoch, tend to converge faster in terms of the number of epochs.
  • Practical Choice: Based on this analysis, the paper chose dm=256d_m = 256 for Sports and Beauty (balancing accuracy and efficiency) and dm=1024d_m = 1024 for Toys (to maximize performance). This highlights the importance of hyperparameter tuning based on dataset characteristics.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces DiffGRM, a novel generative recommendation framework that addresses the inherent limitations of autoregressive models (ARMs) when generating Semantic IDs (SIDs). The core innovation is replacing the ARM decoder with a masked discrete diffusion model (MDM), enabling bidirectional context and parallel, any-order generation of SID digits.

DiffGRM achieves this through three key, interconnected components:

  1. Parallel Semantic Encoding (PSE): Utilizes OPQ subspace quantization to decouple SID digits, balancing per-digit information and removing sequential dependencies that conflict with parallel generation.

  2. On-policy Coherent Noising (OCN): A novel training strategy that uses the model's confidence to identify and prioritize "hard" digits for masking. This coherently allocates supervision, focuses the training budget on high-value signals, and improves sample efficiency, overcoming the combinatorial explosion of masking patterns in MDMs.

  3. Confidence-guided Parallel Denoising (CPD): An inference strategy that performs a global parallel beam search, filling higher-confidence digits first. This allows DiffGRM to generate diverse Top-K SID candidates, a crucial requirement for practical recommendation systems, without sacrificing accuracy.

    Collectively, DiffGRM successfully reconciles the complex cross-digit semantics of SIDs with parallel generation. Experimental results across multiple Amazon datasets demonstrate consistent state-of-the-art performance, with NDCG@10 improvements ranging from 6.9% to 15.5% over strong generative and discriminative baselines. This confirms DiffGRM's accuracy, generalization strength, and ability to produce robust Top-K recommendations.

7.2. Limitations & Future Work

The authors explicitly state one main direction for future work:

  • Inference Efficiency and Scalability: While the paper discusses complexity, it notes that DiffGRM's MD-Decoder has an additional factor of nn in its inference complexity compared to ARMs. Although they argue that this introduces only a modest overhead for small nn and that CPD's parallel nature helps, further investigation into optimizing inference efficiency and scalability for very large item catalogs or extremely low-latency requirements remains an open area. This suggests that while DiffGRM is accurate, there might still be room for improvement in its real-world deployment performance, especially in highly demanding industrial scenarios.

7.3. Personal Insights & Critique

This paper presents a highly innovative approach to generative recommendation, effectively addressing known limitations of autoregressive models when applied to the structured nature of Semantic IDs.

Inspirations and Applications:

  • Bridging Paradigms: The paper masterfully bridges the gap between diffusion models (successful in image and text generation) and recommender systems. This shows a promising direction for applying advanced generative modeling techniques to structured data beyond traditional language or image domains.
  • Structured Data Generation: The explicit handling of intra-item consistency and inter-digit heterogeneity for SIDs offers valuable insights for generating other forms of structured discrete data. Many real-world entities can be represented as multi-attribute discrete sequences (e.g., product configurations, chemical compounds, user profiles). DiffGRM's PSE and OCN strategies could be adapted to these domains.
  • Targeted Supervision: The On-policy Coherent Noising (OCN) mechanism is a particularly clever innovation. Instead of relying on brute-force random masking or rigid patterns, using the model's own uncertainty to guide supervision is a powerful concept. This could be applied to other masked language modeling or denoising autoencoder tasks to improve learning efficiency, especially with sparse or imbalanced data.
  • Diverse Top-K Generation: The Confidence-guided Parallel Denoising (CPD) is a pragmatic solution for the Top-K requirement in recommendation, which is often overlooked by generative models designed for single-output tasks. This makes MDMs more viable for practical RS applications.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Sensitivity to Semantic Encoder Quality: While the paper shows DiffGRM benefits from larger semantic encoders, the performance is still heavily reliant on the quality of the initial item embeddings generated by these encoders. If the upstream encoder fails to capture crucial semantics, DiffGRM's ability to generate meaningful SIDs would be limited. Further research could explore end-to-end learning or adaptive fine-tuning of the semantic encoder alongside DiffGRM.

  • Interpretability of SIDs: The paper defines SIDs as nn-digit sequences. While PSE helps decouple them, the semantic meaning of each digit and how they combine is still somewhat abstract. Improving the interpretability of individual SID digits (e.g., explicitly linking them to human-understandable attributes like "brand", "color", "style") could enhance debugging, transparency, and potentially allow for more granular control over generation.

  • Generalization to Novel SIDs: While generative recommendation promises open-vocabulary capabilities, the extent to which DiffGRM can generate truly novel SIDs (i.e., combinations of digits corresponding to items not seen in training, or entirely new SID structures) is not fully explored. The Top-K candidates are likely variations of observed SIDs.

  • Computational Cost for OCN: While OCN improves sample efficiency, calculating the difficulty scores for each digit requires an initial fully masked pass through the MD-Decoder for every training sample. While amortized across RR views, this still adds a constant factor overhead per sample that might be significant for extremely large datasets or very frequent model updates.

  • Hyperparameter Sensitivity: As with many complex deep learning models, DiffGRM likely has sensitivity to hyperparameters such as nn (number of digits), MM (codebook size), BactB_{\mathrm{act}} (beam width), and the schedule of OCN's mrm_r. The paper provides default settings, but tuning these for new domains could be intricate.

  • Beyond SID as Atomic Unit: The paper focuses on generating SIDs as the atomic unit. What about generating sequences of features, or even directly generating textual descriptions of items? Extending the diffusion paradigm to more complex output formats could be a powerful future direction.

    Overall, DiffGRM represents a significant step forward in generative recommendation, providing a robust and theoretically sound framework that addresses critical limitations of prior work. Its innovative use of discrete diffusion and tailored training/inference strategies offer a promising path for building more effective and flexible recommender systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.