DiffGRM: Diffusion-based Generative Recommendation Model
TL;DR Summary
DiffGRM employs masked discrete diffusion and parallel semantic encoding to enable any-order parallel generation of semantic ID digits, addressing limitations of autoregressive models and improving recommendation accuracy and training efficiency.
Abstract
Generative recommendation (GR) is an emerging paradigm that represents each item via a tokenizer as an n-digit semantic ID (SID) and predicts the next item by autoregressively generating its SID conditioned on the user's history. However, two structural properties of SIDs make ARMs ill-suited. First, intra-item consistency: the n digits jointly specify one item, yet the left-to-right causality trains each digit only under its prefix and blocks bidirectional cross-digit evidence, collapsing supervision to a single causal path. Second, inter-digit heterogeneity: digits differ in semantic granularity and predictability, while the uniform next-token objective assigns equal weight to all digits, overtraining easy digits and undertraining hard digits. To address these two issues, we propose DiffGRM, a diffusion-based GR model that replaces the autoregressive decoder with a masked discrete diffusion model (MDM), thereby enabling bidirectional context and any-order parallel generation of SID digits for recommendation. Specifically, we tailor DiffGRM in three aspects: (1) tokenization with Parallel Semantic Encoding (PSE) to decouple digits and balance per-digit information; (2) training with On-policy Coherent Noising (OCN) that prioritizes uncertain digits via coherent masking to concentrate supervision on high-value signals; and (3) inference with Confidence-guided Parallel Denoising (CPD) that fills higher-confidence digits first and generates diverse Top-K candidates. Experiments show consistent gains over strong generative and discriminative recommendation baselines on multiple datasets, improving NDCG@10 by 6.9%-15.5%. Code is available at https://github.com/liuzhao09/DiffGRM.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "DiffGRM: Diffusion-based Generative Recommendation Model".
1.2. Authors
The authors are:
-
Zhao Liu (Kuaishou Technology, Beijing, China)
-
Yichen Zhu (Kuaishou Technology, Beijing, China)
-
Yiqing Yang (Kuaishou Technology, Beijing, China)
-
Guoping Tang (Kuaishou Technology, Beijing, China)
-
Rui Huang (Kuaishou Technology, Beijing, China)
-
Qiang Luo (Kuaishou Technology, Beijing, China)
-
Xiao Lv (Kuaishou Technology, Beijing, China)
-
Ruiming Tang (Kuaishou Technology, Beijing, China)
-
Kun Gai (Unaffiliated, Beijing, China)
-
Guorui Zhou (Kuaishou Technology, Beijing, China)
Most authors are affiliated with Kuaishou Technology, a major internet company known for its short-video platform, indicating a strong industry research background with a focus on practical applications in recommendation systems. Kun Gai is listed as unaffiliated.
1.3. Journal/Conference
The paper is published as a preprint on arXiv, indicated by the Published at (UTC): 2025-10-21T03:23:32.000Z and Original Source Link: https://arxiv.org/abs/2510.21805. While not yet formally peer-reviewed and published in a specific conference or journal, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like machine learning and artificial intelligence, allowing for early sharing and feedback.
1.4. Publication Year
The publication year is 2025, based on the provided UTC timestamp.
1.5. Abstract
Generative recommendation (GR) systems predict the next item by treating each item as an -digit semantic ID (SID) and autoregressively generating its SID based on user history. However, current autoregressive models (ARMs) are ill-suited for SIDs due to two issues: intra-item consistency (digits jointly define an item, but ARMs' left-to-right causality blocks bidirectional context) and inter-digit heterogeneity (digits vary in semantic granularity and predictability, yet ARMs apply a uniform next-token objective, leading to imbalanced training).
To overcome these, the paper proposes DiffGRM, a diffusion-based GR model. DiffGRM replaces the autoregressive decoder with a masked discrete diffusion model (MDM), enabling bidirectional context and parallel, any-order generation of SID digits. It tailors the MDM in three key aspects:
-
Tokenization: Employs
Parallel Semantic Encoding (PSE)to decouple digits and balance per-digit information, contrasting with residual quantization. -
Training: Introduces
On-policy Coherent Noising (OCN)which prioritizes uncertain digits through coherent masking, concentrating supervision on high-value signals and avoiding combinatorial explosion of masking patterns. -
Inference: Develops
Confidence-guided Parallel Denoising (CPD)which fills higher-confidence digits first and generates diverseTop-Kcandidates through a global parallel beam search.Experiments demonstrate that
DiffGRMachieves consistent gains over strong generative and discriminative recommendation baselines across multiple datasets, improvingNDCG@10by 6.9%-15.5%.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2510.21805
- PDF Link: https://arxiv.org/pdf/2510.21805v1.pdf The paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve lies within the emerging paradigm of Generative Recommendation (GR). In GR, items are represented as sequences of semantic IDs (SIDs), and the recommendation task is framed as predicting the SID of the next item a user might interact with. This prediction is typically handled by autoregressive models (ARMs), often GPT-style Transformers, which generate the SID digits one by one, from left to right, conditioned on the user's historical interactions.
This problem is important because GR offers several advantages, such as unifying item representation and prediction, benefiting from large-scale language modeling techniques, and supporting open-vocabulary recommendation. However, the paper identifies two critical structural properties of SIDs that make ARMs ill-suited, posing significant challenges:
-
Intra-item consistency: The digits of an
SIDare not independent tokens like words in a sentence; they collectively and jointly specify one single item. For instance, aSIDmight encode "Dior Rouge 999 Velvet" across its digits.ARMsenforce a strict left-to-right causal dependency, meaning each digit is predicted only based on its preceding digits. This inherently blocks the use of bidirectional context or "cross-digit evidence" that could help verify the overall item identity, leading to a collapse of supervision into a single causal path. This can cause early errors to propagate through the entireSIDgeneration. -
Inter-digit heterogeneity: The
SIDdigits often encode different semantic granularities (e.g.,Category,Brand,Type,Size). Consequently, these digits differ significantly in their semantic load, predictability, and inherent difficulty. For example, predicting a generalCategorymight be easier than a specificSize. Yet, the standardnext-token objectiveused byARMsassigns equal weight to predicting every digit. This uniform supervision leads to an imbalance: easy digits can be overtrained, while hard digits receive insufficient training signal, hindering accurateSIDgeneration.The paper's entry point and innovative idea is to draw inspiration from the rapid advancements in
discrete diffusion modelingto replace theARMdecoder.Masked Discrete Diffusion Models (MDMs)inherently support bidirectional context, parallel generation, and richer supervision through random noising, which align better with the structural characteristics ofSIDrepresentations. However, directly applyingMDMsis not optimal; task-specific adaptations for recommendation are necessary.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Introduction of DiffGRM: Proposing
DiffGRM, the first diffusion-based generative recommendation framework. It replaces the autoregressive decoder with amasked discrete diffusion model(MDM) operating overSIDdigits. This fundamental shift removes left-to-right causal constraints, enabling the exploitation of bidirectional cross-digit context, which is crucial forintra-item consistency. -
Novel Adaptations for Tokenization, Training, and Inference:
- Parallel Semantic Encoding (PSE): For tokenization,
DiffGRMadoptsPSE(e.g.,OPQ-based) to decoupleSIDdigits. This move away fromresidual quantization (RQ)ensures digits are independent, balances per-digit information, and allows for fully parallel prediction, addressing theinter-digit heterogeneityandintra-item consistencyissues at the representation level. - On-policy Coherent Noising (OCN): For training,
OCNaddresses the combinatorial explosion of supervision signals inMDMswhen applied toSIDs. It uses the current model's confidence to identify and prioritize the most "uncertain" or "hard" digits for masking, thereby focusing the training budget on high-value signals. This improves sample efficiency and allocates supervision more effectively than random masking. - Confidence-guided Parallel Denoising (CPD): For inference,
CPDis designed to meet the recommendation task's need for diverseTop-Kcandidates, unlike typicalMDMgreedy decoding for single outputs.CPDperforms a global parallel beam search, filling higher-confidence digits first and then completing the rest, yielding accurate and diverseTop-KSIDcandidates.
- Parallel Semantic Encoding (PSE): For tokenization,
-
State-of-the-Art Performance: The paper demonstrates that
DiffGRMachieves state-of-the-art results across multiple public datasets (Amazon Reviews: Sports, Beauty, Toys). It consistently outperforms strong generative and discriminative recommendation baselines, improvingNDCG@10by a significant margin (6.9%-15.5%). This finding strongly validates the accuracy and generalization strength of theDiffGRMframework.These contributions collectively solve the identified problems by reconciling the joint nature of
SIDdigits with parallel generation, providing a more balanced and efficient training mechanism, and enabling effectiveTop-Krecommendation, thereby advancing the field ofgenerative recommendation.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DiffGRM, several foundational concepts in recommendation systems and deep learning are essential.
- Recommendation Systems (RS): Systems designed to predict user preferences and suggest items (e.g., products, movies, articles) that a user might like. They are crucial for personalization in various online platforms.
- Sequential Recommendation: A subfield of
RSwhere the order of user interactions (e.g., a sequence of purchased items) is important. The goal is to predict the next item in a sequence based on past interactions. - Generative Recommendation (GR): An emerging paradigm in
RSthat frames the recommendation task as generating the representation of the target item. Instead of scoring existing items,GRaims to directly produce the "description" or "ID" of a suitable next item. - Semantic ID (SID): In
GR, anitemis not represented by a simple integer ID but by a sequence of discrete digits, called asemantic ID. Each digit typically corresponds to a code from a learned codebook, and collectively, these digits encode the item's semantic features (e.g., brand, category, style). This allows for representing items based on their content and properties rather than just arbitrary identifiers. - Tokenization: The process of converting raw item content (e.g., text descriptions, images) into a discrete, fixed-length sequence of
SIDs. This usually involves anencoderthat maps content to dense vectors, followed by aquantizationstep that discretizes these vectors into tokens (digits).- Vector Quantization (VQ): A technique used in tokenization to map continuous vectors (embeddings) to discrete codes (tokens). It involves learning a set of "codebook" vectors (centroids), and each continuous vector is assigned the ID of its nearest codebook vector.
- Codebook: A collection of discrete codes or centroids used in
Vector Quantization. InSIDgeneration, there are typically codebooks, one for each digit, each containing possible codes.
- Autoregressive Models (ARMs): A class of generative models that predict future elements in a sequence based on past elements. In the context of language models and
GR, this means predicting the next token (orSIDdigit) conditioned on all previously generated tokens in a left-to-right fashion.GPT-style Transformersare prominent examples ofARMs. - Transformer: A neural network architecture introduced by Vaswani et al. (2017) that relies heavily on the
self-attention mechanism. It has become foundational for manysequence-to-sequencetasks, including natural language processing and sequential recommendation.- Self-Attention: A mechanism within
Transformersthat allows a model to weigh the importance of different parts of an input sequence when processing a specific element. Unlikerecurrent neural networks (RNNs), it can capture long-range dependencies efficiently. - Encoder-Decoder Architecture: A common
Transformerconfiguration. Theencoderprocesses the input sequence (e.g., user history) to create a contextual representation, and thedecoderuses this representation to generate the output sequence (e.g., targetSID).
- Self-Attention: A mechanism within
- Diffusion Models: A class of generative models that learn to reverse a gradual noising process.
- Discrete Diffusion Models (DDMs): Adaptations of diffusion models for discrete data (like text tokens or
SIDdigits). They typically involve a "forward" process that gradually corrupts clean data by replacing tokens with a special[MASK]token or randomly sampling from the vocabulary, and a "reverse" process that learns to predict the original clean data from the corrupted version. - Masked Discrete Diffusion Model (MDM): A specific type of
DDMwhere the corruption process involves masking out a subset of tokens in the sequence, and the model learns to predict the original tokens for these masked positions in parallel. This is in contrast toautoregressivemodels that predict one token at a time.
- Discrete Diffusion Models (DDMs): Adaptations of diffusion models for discrete data (like text tokens or
- Cross-Entropy Loss: A common loss function used in classification tasks, including
next-token predictionandmasked language modeling. It measures the difference between the predicted probability distribution over classes (e.g., possibleSIDdigits) and the true distribution. - Label Smoothing: A regularization technique for classification models that prevents them from becoming overconfident. Instead of using hard one-hot labels (e.g.,
[0, 1, 0]), it softens the target distribution by giving a small probability mass to incorrect labels (e.g.,[0.05, 0.9, 0.05]). This can improve generalization.
3.2. Previous Works
The paper contextualizes DiffGRM against both Generative Recommendation Models and Discrete Diffusion Language Models.
3.2.1. Generative Recommendation Models
Prior GR approaches often cast recommendation as sequence generation, where items are discretized into SIDs and a Transformer predicts the target SID token by token. Key examples include:
-
TIGER [48]: A representative model that uses
Residual Quantization (RQ)-VAEfor tokenizing items intoSIDsand then decodes them autoregressively. This is a primary baseline forDiffGRM, highlighting theARMapproach thatDiffGRMaims to improve upon. -
HSTU [65]: Frames recommendation as large-scale sequence transduction, discretizing raw item features into tokens.
-
RPG [20]: A generative model that predicts unordered
semantic IDsin parallel using a multi-token objective, combined with graph-guided decoding. This model is a closer conceptual relative toDiffGRMin its aim for parallel prediction, butDiffGRMleverages diffusion rather than a multi-token objective and graph decoding. -
Other
AR-based GRs:GenNewsRec [9],MTGRec [72],ETEGRec [32]focus on various aspects like integratingLLMreasoning, enhancing quantization, or improving token quality. -
ActionPiece [21]: Focuses on context-aware tokenization, representing actions as unordered sets of item features.
The common thread among many of these, especially
TIGER, is the reliance onresidual quantization (RQ)andautoregressivedecoding, whichDiffGRMidentifies as problematic.
3.2.2. Discrete Diffusion Language Models
Diffusion models originated for continuous data (DDPMs [16, 37, 56, 57]) and were later extended to discrete spaces [1, 3, 17].
-
Structured Denoising Diffusion Models [1, 2, 3]: These works laid the groundwork for applying diffusion to discrete state-spaces, moving beyond continuous data. They explore how to define forward noising and reverse denoising processes for discrete tokens.
-
Masked Diffusion Language Models (MDMs): Specifically,
MDMs(e.g., [42, 64]) are relevant as they corrupt sequences by masking tokens and learn to predict them in parallel. This mechanism is central toDiffGRM's decoder. -
Advanced Sampling Strategies [35, 44, 46]: Research has focused on improving reverse-sampling strategies for
DDMsto enhance performance and efficiency in natural language tasks.The paper notes that most advances in
DDMsare targeted at free-form, single-output text generation, whereasGRrequires structured -digitSIDsand aTop-Kcandidate set, necessitating specific adaptations.
3.3. Technological Evolution
Recommendation systems have evolved significantly:
- Early Systems (e.g., Collaborative Filtering): Focused on user-item interaction patterns without deep understanding of item content.
- Item ID-based Discriminative Models (e.g.,
GRU4Rec,SASRec,BERT4Rec): These models learn embeddings for item IDs and predict the next item ID directly. They improved sequential modeling but often lacked semantic richness. - Semantic-enhanced Discriminative Models (e.g.,
FDSA, -Rec,vQ-Rec,RecJPQ): Incorporated item content features (text, images) alongside item IDs, often through pre-trained language models or quantization, to enrich item representations and improve prediction. - Semantic ID-based Generative Models (e.g.,
TIGER,HSTU,RPG,ActionPiece): This is the direct lineage forDiffGRM. These models move beyond just predicting the next item ID to generating its semantic representation (SID). This allows foropen-vocabularyrecommendation (generating items not seen during training) and leveraging powerfulTransformerarchitectures fromNLP. Initially, these were predominantlyautoregressive. - Diffusion-based Generative Models (DiffGRM):
DiffGRMrepresents a new evolutionary step withingenerative recommendation, replacing theautoregressivegeneration ofSIDswithdiffusion-based parallel generation. This aims to overcome the inherent limitations ofARMs(causality, uniform supervision) when dealing with the unique structure ofSIDs. Simultaneously, it adaptsdiscrete diffusion models, which have seen success inNLP, to the specific requirements ofTop-Krecommendation forstructured SIDs.
3.4. Differentiation Analysis
Compared to the main methods in related work, DiffGRM introduces several core differences and innovations:
- From
Autoregressive GenerationtoMasked Diffusion: The most fundamental difference is replacing theautoregressive decoder(common inTIGER,HSTU) with amasked discrete diffusion model. This shifts from sequential, left-to-right generation to parallel, any-order generation.- Innovation:
ARMssuffer fromintra-item consistencyissues (lack of bidirectional context) andinter-digit heterogeneity(uniform supervision).MDMsnaturally provide bidirectional context and can offer richer, more flexible supervision signals, directly tackling these limitations.RPGalso aims for parallel prediction but uses a multi-token objective and graph-guided decoding, which is different from the diffusion process.
- Innovation:
- From
Residual Quantization (RQ)toParallel Semantic Encoding (PSE): Mostgenerative recommendertokenizers, likeRQ-VAEinTIGER, useRQ.RQintroduces residual dependencies and unbalanced information distribution across digits.- Innovation:
PSE(e.g.,OPQ-based) decouplesSIDdigits into independent subspaces. This balances per-digit information and removes the sequential coupling inherent inRQ, which aligns much better with theMDM's parallel and any-order prediction capabilities.
- Innovation:
- Novel Training Strategy (
OCN):MDMscan face a combinatorial explosion of masking patterns and fragmented supervision.- Innovation:
OCNis a task-specific training strategy forMDMsinGR. It uses an "on-policy" approach to identify and prioritize the "hardest" or most uncertain digits based on the current model's confidence. By coherently masking these digits and constructing nested views,OCNfocuses the training budget on high-value signals, improving sample efficiency and optimizing supervision allocation.
- Innovation:
- Novel Inference Strategy (
CPD): StandardMDMsoften rely on greedy decoding for a single output, which is insufficient forTop-Krecommendation.-
Innovation:
CPDadaptsMDMinference for recommendation by implementing aConfidence-guided Parallel Denoisingprocess. It performs a global parallel beam search, filling higher-confidence digits first. This allowsDiffGRMto generate diverseTop-KSIDcandidates, a crucial requirement for recommendation systems, while still leveraging theMDM's parallel capabilities.In essence,
DiffGRMinnovatively merges the strengths ofdiscrete diffusion modelswith specific adaptations tailored to the unique challenges ofsemantic IDrepresentations in recommendation, thereby addressing fundamental limitations of previousautoregressive generative recommenders.
-
4. Methodology
4.1. Principles
The core idea behind DiffGRM is to replace the autoregressive decoder in existing Generative Recommendation (GR) frameworks with a masked discrete diffusion model (MDM). This fundamental shift is motivated by the desire to overcome two key limitations of ARMs when dealing with Semantic IDs (SIDs):
-
Intra-item consistency:
ARMspredictSIDdigits sequentially (left-to-right), which prevents bidirectional context and mutual verification among digits that jointly define a single item.MDMsnaturally enable bidirectional context and parallel prediction. -
Inter-digit heterogeneity:
ARMsapply a uniformnext-token objectiveto allSIDdigits, regardless of their varying semantic granularity and predictability.MDMs, through their masking mechanism, can be designed to allocate supervision more strategically.The theoretical basis and intuition are that a diffusion model, by learning to denoise a corrupted input, implicitly learns the underlying data distribution and can generate diverse samples. By applying a
masking corruptionprocess toSIDdigits, the model is trained to predict the original digits given a partial context. This allows it to learn dependencies among all digits, not just causal prefixes, and to focus on more difficult predictions. The proposedDiffGRMtailors thisMDMframework specifically forGRthrough three key adaptations:Parallel Semantic Encoding (PSE),On-policy Coherent Noising (OCN), andConfidence-guided Parallel Denoising (CPD).
4.2. Core Methodology In-depth (Layer by Layer)
DiffGRM operates as an encoder-decoder architecture. The encoder processes the user's interaction history, and the MD-Decoder generates the -digit SID of the next item.
4.2.1. Overall Architecture and Workflow
The workflow begins by processing raw item content. Items are first tokenized into SIDs using Parallel Semantic Encoding (PSE). This converts each user's interaction history into a sequence of SIDs.
-
Encoder: For each item
SIDin the user's history, its digits are embedded, concatenated, and then projected into an item vector. Positional embeddings are added to this sequence of item vectors, and the entire sequence is fed into aTransformer encoder. Theencoderoutputs a contextual representation of the user history, denoted as , where is the encoder input length and is the model's hidden dimension. This summarizes the user's past behavior. -
MD-Decoder: The
MD-Decodertakes a partially masked -digit input, denoted as , which represents the target item'sSID. Unlikeautoregressivedecoders, it appliesnon-causal (bidirectional) self-attentionacross all digits of . This means each digit can attend to every other digit, enablingbidirectional intra-item semanticsandcross-digit mutual verification. TheMD-Decoderthen performscross-attentionto (whereencoder-side k/vare derived from ), integrating user history context. Finally, it predicts all masked digits in parallel.During
multi-view training, theencoderoutput is computed once per sample and cached. Theencoder-side k/v(key and value projections from the encoder output, used in cross-attention) are also cached for reuse across different "views" (masked inputs) andCPDsteps, amortizing thecross-attentioncost.
4.2.2. SID Generation Objective
The overall objective is to learn a conditional generator that maximizes the conditional log-likelihood of the target SID given the user history.
Here:
-
: The parameters of the model (including the
MD-Decoder). -
: The set of all users.
-
: The expectation is taken over users in .
-
: The conditional probability distribution of the target
SIDgiven the user history . -
: The
SIDof the next item (the ground truth target). -
: The sequence of
SIDsrepresenting user 's historical interactions.The model is instantiated by the
MD-Decoder, which is trained with digit-wise supervision on masked digits.
4.2.3. Masked Diffusion for Parallel Token Prediction
DiffGRM adopts masked diffusion for generating the -digit SID.
-
Forward Process: This process applies an
absorbing-state mask corruptionto a cleanSIDsequence . It replaces a subset of digits with a special[MASK]token according to a time-dependent schedule. The corrupted sequence at a given mask ratio is denoted . -
Reverse Process: The
MD-Decoderis trained to predict all masked digits in parallel from the corrupted sequence .The
MD-Decoderis trained using amasked-digit cross-entropyloss: Where: -
: The masked-digit cross-entropy loss for the model parameters .
-
: The clean (ground truth)
SIDsequence. -
: The mask ratio (or time step), indicating the proportion of digits masked, where .
-
: The corrupted version of at mask ratio .
-
: The forward noising process, which samples a corrupted sequence given and .
-
: The set of indices of digits that are masked at mask ratio .
-
: The number of masked digits.
-
: An index iterating over the masked digits.
-
: The
MD-Decoder's predicted probability distribution for the original digit at position , given the corrupted sequence and mask ratio . This means the model predicts what the clean digit should be for each masked position.This objective removes causal constraints, allows for richer supervision signals (by varying masked sets), and enables efficient parallel generation.
The training process computes once per sample. The MD-Decoder then takes a partially masked -digit input (a "view" generated by OCN) and predicts digits. The loss for a single view is:
Where:
-
: The loss for view .
-
: The set of masked indices for view .
-
: An index of a masked digit.
-
: A value from the codebook .
-
: The size of the per-digit codebook.
-
: The smoothed one-hot target distribution for digit (using
label smoothing). If the true digit , then would be close to 1, and other would be small. -
: The
MD-Decoder's predicted probability for digit to be value , given the partially masked input and the user history representation . -
: The partially masked -digit input for view .
-
: The user history representation from the encoder.
The total loss aggregates across a small set of views constructed by
OCN: Where: -
: The total training loss.
-
: The number of coherent views generated by
OCNfor a single sample.
4.2.4. Parallel Semantic Encoding (PSE)
To address the issues of residual quantization (RQ) (unbalanced information, sequential dependency), DiffGRM uses Parallel Semantic Encoding (PSE).
- Item Embedding: An item with content features is first mapped to a -dimensional continuous representation using a semantic encoder (e.g.,
Sentence-T5). - Orthogonal Rotation: An orthogonal rotation matrix is learned and applied to to reduce downstream quantization distortion: .
- Partitioning: The rotated vector is evenly partitioned into subvectors: . Each subvector corresponds to a specific digit position.
- Independent Quantization: Each subvector is independently quantized to a code from its respective per-digit codebook . The
SIDdigit is obtained by finding the nearest centroid: Where:-
: The -th
SIDdigit for item . -
: The -th subvector of the rotated item embedding.
-
: The -th centroid vector in the -th codebook.
-
: Squared Euclidean distance.
This process yields an -digit
SIDwith decoupled per-digit assignments. This removes the residual sequential dependence found inRQand enables fully parallel prediction across digits, aligning with theMDM's capabilities.
-
4.2.5. On-policy Coherent Noising (OCN)
OCN aims to make MDM training more efficient by focusing supervision on "hard" digits, addressing the combinatorial explosion of masking patterns.
-
Difficulty Estimation: For each training sample, the
MD-Decoderis run once on a fully masked -digit input (representing the "last view" , with digits masked and mask ratio ). This "probe" generates a predictive distribution for each digit . From this, two scores are computed:- Maximum Confidence:
- Difficulty Score: Where:
- : The maximum predicted probability (confidence) for digit when all digits are masked.
- : The
MD-Decoder's predicted probability for digit to be value in the fully masked view . - : The difficulty score for digit . A larger indicates lower confidence (higher perplexity) and thus higher difficulty. These scores induce a policy for selecting digits.
-
Digit Ordering: Digits are sorted in descending order of their difficulty scores to obtain a permutation from hardest to easiest. Ties are broken with a fixed digit order.
-
Coherent View Construction: A small nested set of views is constructed per sample, ordered from light to heavy corruption. For view (where ), digits are masked, where is a non-decreasing schedule (i.e., ). The masked index set for view is defined as: This means view masks the hardest digits according to the current model's policy. The remaining digits () are kept visible with their clean embeddings.
-
Layer-0 Input for Views: The input embedding for digit in view , denoted , is constructed as: Where:
- : The embedding for the
[MASK]token at position . - : The embedding for the true
SIDdigit at position . The masked set is nested, meaning the visible context grows as decreases (from fully masked to progressively fewer masked digits). The corruption ratio increases from light to heavy masking. This strategy avoids combinatorial masking, stabilizes optimization with progressively richer evidence, and concentrates gradients on the same hard digits under increasing context.
- : The embedding for the
-
Binary View Matrix: All views can be represented by a binary matrix: Where:
-
: The matrix representing the masking patterns for all views.
-
: A binary vector for view , where if digit is masked in view , and
0otherwise.The
encoderoutput is reused across all views, and difficulty order is computed once.
-
4.2.6. Confidence-guided Parallel Denoising (CPD)
For inference, CPD generates diverse Top-K SID candidates using a global parallel beam search.
The process maintains an active set of partial SIDs, denoted , and progresses over reverse steps from (fully masked) to (fully resolved). The encoder-side k/v are computed once and cached.
-
Initialization (): Starting from a fully masked input , the
MD-Decoderpredicts distributions for all digits. The active set is initialized by selecting the digit-codeword pairs based on their predicted log-probabilities: Where:- : The per-step beam width (number of top candidates to keep).
- : The -th
SIDdigit. - : A possible codeword from .
- : The predicted log-probability of assigning codeword to digit .
- : Selects the top scores across all possible
(k, c)pairs.
-
Denoising Steps (): At each subsequent denoising step , for each branch in the current active set ,
CPDscores filling one of its still-masked digits. The score for extending branch by filling digit with codeword is: Where:- : The accumulated score for the new partial
SIDafter filling digit with in branch . - : The accumulated log-probability of branch at step .
- : The partially masked
SIDsequence for branch at step . - : is an index of a digit that is still masked in branch .
- : The accumulated score for the new partial
-
Per-step Truncation: After scoring all possible extensions from all active branches, the highest-scoring new partial
SIDsare selected to form the next active set : For each selected tuple(b, k, c), digit is filled with codeword , and is removed from the masked-index set of that specific branch. All other digits in theSID(that were not just filled) remain masked. This way, the mask ratio decreases from to . -
Completion: The iteration continues until , at which point all digits are filled, yielding the final set of candidate
SIDsin . Finally, generated sequences are deduplicated, and theTop-KdistinctSIDsare kept based on their accumulated scores. This ensures diverseTop-Krecommendations.
4.2.7. Discussion
-
Why not RQ:
Residual Quantization (RQ), used in manygenerative recommendationtokenizers, introduces a strict left-to-right residual dependency between digits. This creates two problems forDiffGRM: (1) it leads to unbalanced information distribution andinter-digit heterogeneity, which conflicts with the goal of balanced per-digit information, and (2) its hierarchical residual dependency creates a left-to-right bias that directly conflicts with theMDM'sparallel, any-order prediction.PSEaddresses this by factorizing the representation into independent subspaces. -
Complexity: The following are the leading terms for asymptotic complexity, with as history length, as SID digit count, as model hidden size, as codebook size, as number of coherent views, and as active beam width.
Module ARM DiffGRM Encoder Decoder—training Decoder—inference - Training Complexity: The
encoderis computed once.ARMneeds onedecoderpass per sample.DiffGRMrequiresdecoderpasses (one for each coherent view). Since is typically small and (history length) is often large in industrial settings, theencoderterm () dominates, making the overall training complexity similar for both. - Inference Complexity: Both run the
encoderonce.ARM'sdecoderperforms incremental steps.DiffGRMwithCPDinvolves an initial fully masked pass plus multiple reverse steps, summing to effective steps (since each step resolves one digit from possibilities, totaling ). This results in an additional factor of in theMD-Decoderterm. However, because is small (e.g., 4) and is typically large, the commonencoderterm remains substantial. Theextra scoringinDiffGRMruns in parallel across digits and beams using cachedencoder k/v, meaning it increases compute but doesn't necessarily slow down the critical path (wall-clock latency).
- Training Complexity: The
5. Experimental Setup
5.1. Datasets
The experiments are conducted on three categories from the widely used Amazon Reviews dataset [38]:
-
"Sports and Outdoors" (Sports): A dataset related to sports and outdoor equipment.
-
"Beauty" (Beauty): A dataset related to cosmetic and beauty products.
-
"Toys and Games" (Toys): A dataset related to toys and games.
These datasets are standard benchmarks for
semantic ID-based generative recommendation[23, 24, 48]. Each user's historical reviews are treated as interactions and sorted chronologically to form input sequences. Theleave-last-outevaluation strategy [27, 48, 69] is adopted: the last item in each sequence is used for testing, the second-to-last for validation, and the remaining interactions for training.
The following are the statistics from Table 2 of the original paper:
| Dataset | #Users | #Items | #Interactions | Avg. tp |
|---|---|---|---|---|
| Sports | 35,598 | 18,357 | 260,739 | 8.32 |
| Beauty | 22,363 | 12,101 | 176,139 | 8.87 |
| Toys | 19,412 | 11,924 | 148,185 | 8.63 |
Where denotes the average number of interactions per input sequence.
These datasets are effective for validating the method's performance because they represent diverse product domains and provide realistic user interaction histories, suitable for sequential recommendation tasks and testing the generation of semantic IDs for varied item types.
5.2. Evaluation Metrics
The performance of DiffGRM and baselines is evaluated using Recall@K and NDCG@K, with . These are standard metrics in recommendation systems for evaluating the quality of ranked lists.
5.2.1. Recall@K
- Conceptual Definition:
Recall@Kmeasures the proportion of relevant items that are successfully retrieved within the top recommendations. It focuses on the ability of the recommender system to find as many relevant items as possible, without regard for their ranking within the top . In the context ofTop-Krecommendation, if the one ground-truth item is in the top predictions, it counts as a hit. - Mathematical Formula:
In a typical
next-item recommendationscenario with a single ground truth item, this simplifies to: - Symbol Explanation:
- : The number of relevant items found within the
Top-Krecommended list. - : The total number of relevant items for a given user (often 1 in next-item prediction).
Number of users for whom the target item is in Top-K: A count of test cases where the single ground-truth item was successfully predicted within the top positions.Total number of users: The total number of test cases (users) in the evaluation set.
- : The number of relevant items found within the
5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)
- Conceptual Definition:
NDCG@Kis a measure of ranking quality. It accounts for both the relevance of recommended items and their position in the ranked list. Higher relevance at higher (earlier) positions contributes more to theNDCGscore. This metric is particularly useful because it differentiates between placing a relevant item at the 1st position versus the 10th position. - Mathematical Formula:
First,
Discounted Cumulative Gain (DCG@K)is calculated: Then,NDCG@KnormalizesDCG@Kby theIdeal DCG (IDCG@K), which is theDCGof a perfectly sorted list of relevant items: - Symbol Explanation:
- : The number of top recommendations being considered.
- : The rank (position) of an item in the recommendation list, starting from 1.
- : The relevance score of the item at position . In
next-item recommendationwhere there's only one ground truth item, is typically 1 if the item at position is the ground truth item, and 0 otherwise. - : The logarithmic discount factor, which reduces the contribution of items at lower ranks.
- : The Discounted Cumulative Gain at rank .
- : The Ideal Discounted Cumulative Gain at rank , calculated by assuming the perfect ranking where all relevant items appear at the top of the list in decreasing order of relevance. For a single ground-truth item,
IDCG@Kwill be if .
5.3. Baselines
DiffGRM is compared against a comprehensive set of baselines, grouped into three families:
5.3.1. Item ID-based (Discriminative) Models
These models primarily rely on item identifiers and sequential patterns.
- GRU4Rec [15]: A
Gated Recurrent Unit (GRU)-based model designed for session-based recommendation, capturing sequential dynamics. - HGN [36]:
Hierarchical Gating Networks, enhancingRNN-based sequence modeling with a gating mechanism. - SASRec [27]:
Self-Attentive Sequential Recommendation, aTransformer-based model usingself-attentionfor next-item prediction, trained withbinary cross-entropy. - BERT4Rec [58]: A
Bidirectional Encoder Representations from Transformers (BERT)-style model adapted for recommendation, using aCloze-style objective(predicting masked item IDs) on item sequences.
5.3.2. Semantic-enhanced (Discriminative) Models
These models incorporate item content features to enrich representations.
- FDSA [67]:
Feature-level Deeper Self-Attention Network, which models both item-ID and feature sequences usingself-attentionand fuses them. - s-Rec [74]:
Self-Supervised Learning for Sequential Recommendation, using self-supervised pretraining on features and IDs before fine-tuning for next-item prediction. - vQ-Rec [18]: Learns
vector-quantizeditem representations from text features, pooling them as item representations. - RecJPQ [47]: Replaces item embeddings with concatenated
jointly product-quantizedsub-embeddings.
5.3.3. Semantic ID-based (Generative) Models
These models aim to generate semantic IDs or tokenized representations of items.
-
TIGER [48]:
Transformer-based Generative Recommender, which usesRQ-VAEfor item tokenization intoSIDsand then autoregressively generates the nextSIDtoken. This is a keyautoregressivebaseline. -
HSTU [65]:
Hierarchical Semantic Tokenization Unit, discretizes raw item features as tokens forgenerative recommendation. The paper notes using 4-digitOPQ-tokenizedSIDsfor consistency. -
ActionPiece [21]: A model that uses
context-aware tokenization, representing each action (user interaction) as an unordered set of item features for generative recommendation. -
RPG [20]:
Recurrent Prompting for Generation, predicts unorderedSIDtokens in parallel using a multi-token objective, combined with graph-guided decoding. This is another stronggenerativebaseline that attempts parallel prediction.These baselines are representative because they cover a wide spectrum of recommendation approaches, from traditional
ID-basedto advancedsemantic-enhancedandgenerativemethods, allowing for a thorough comparison ofDiffGRM's performance and architectural innovations.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that DiffGRM consistently outperforms both discriminative and generative recommendation baselines across all three datasets (Sports, Beauty, Toys). The evaluation is based on Recall@K and NDCG@K where . The improvements highlight DiffGRM's effectiveness in leveraging masked diffusion for SID generation.
The following are the results from Table 3 of the original paper:
| Methods | Sports and Outdoors | Beauty | Toys and Games | |||||||||
| Recall @5 | NDCG @5 | Recall @10 | NDCG @10 | Recall @5 | NDCG @5 | Recall @10 | NDCG @10 | Recall @5 | NDCG @5 | Recall @10 | NDCG @10 | |
| Item ID-based (Discriminative) | ||||||||||||
| GRU4Rec | 0.0129 | 0.0086 | 0.0204 | 0.0110 | 0.0164 | 0.0099 | 0.0283 | 0.0137 | 0.0097 | 0.0059 | 0.0176 | 0.0084 |
| HGN | 0.0189 | 0.0120 | 0.0313 | 0.0159 | 0.0325 | 0.0206 | 0.0512 | 0.0266 | 0.0321 | 0.0221 | 0.0497 | 0.0277 |
| SASRec | 0.0233 | 0.0154 | 0.0350 | 0.0192 | 0.0387 | 0.0249 | 0.0605 | 0.0318 | 0.0463 | 0.0306 | 0.0675 | 0.0374 |
| BERT4Rec | 0.0115 | 0.0075 | 0.0191 | 0.0099 | 0.0203 | 0.0124 | 0.0347 | 0.0170 | 0.0116 | 0.0071 | 0.0203 | 0.0099 |
| Semantic-enhanced (Discriminative) | ||||||||||||
| FDSA | 0.0182 | 0.0122 | 0.0288 | 0.0156 | 0.0267 | 0.0163 | 0.0407 | 0.0208 | 0.0228 | 0.0140 | 0.0381 | 0.0189 |
| s3-Rec | 0.0251 | 0.0161 | 0.0385 | 0.0204 | 0.0387 | 0.0244 | 0.0647 | 0.0327 | 0.0443 | 0.0294 | 0.0700 | 0.0376 |
| vQ-Rec | 0.0208 | 0.0144 | 0.0300 | 0.0173 | 0.0457 | 0.0317 | 0.0664 | 0.0383 | 0.0497 | 0.0346 | 0.0737 | 0.0423 |
| RecJPQ | 0.0141 | 0.0076 | 0.0220 | 0.0102 | 0.0311 | 0.0167 | 0.0482 | 0.0222 | 0.0331 | 0.0182 | 0.0484 | 0.0231 |
| Semantic ID-based (Generative) | ||||||||||||
| TIGER | 0.0264 | 0.0181 | 0.0400 | 0.0225 | 0.0454 | 0.0321 | 0.0648 | 0.0384 | 0.0521 | 0.0371 | 0.0712 | 0.0432 |
| HSTU | 0.0258 | 0.0165 | 0.0414 | 0.0215 | 0.0469 | 0.0314 | 0.0704 | 0.0389 | 0.0433 | 0.0281 | 0.0669 | 0.0357 |
| RPG | 0.0316 | 0.0205 | 0.0500 | 0.0264 | 0.0511 | 0.0340 | 0.0775 | 0.0424 | − | − | ||
| ActionPiece | 0.0314 | 0.0216 | 0.0463 | 0.0263 | 0.0550 | 0.0381 | 0.0809 | 0.0464 | 0.0592 | 0.0401 | 0.0869 | 0.0490 |
| DiffGRM | **0.0363** | **0.0245** | **0.0550** | **0.0305** | **0.0603** | **0.0414** | **0.0876** | **0.0502** | **0.0618** | **0.0455** | **0.0834** | **0.0524** |
| Improv. | +14.87% | +13.43% | +10.00% | +15.53% | +9.64% | +8.19% | +8.28% | +8.19% | +4.39% | +13.47% | -4.03% | +6.94% |
Key Findings:
- Overall Dominance:
DiffGRMachieves the best overall results, ranking first on 11 out of 12 metrics across the three datasets. This indicates strong generalization capability and superiority over existing methods. - Significant NDCG Improvements: Relative to the strongest baseline (which varies per dataset and metric, but often
ActionPieceorRPG),DiffGRMshows impressive improvements inNDCG@10:- Sports: +15.53%
- Beauty: +8.19%
- Toys: +6.94%
NDCGis a crucial metric for recommendation as it values correct predictions at higher ranks.
- Recall Improvements:
Recall@10also sees substantial gains:- Sports: +10.00%
- Beauty: +8.28%
On Toys,
Recall@10is slightly lower (-4.03%) than the best baseline (ActionPiece), butDiffGRMstill achieves higherNDCG@5andNDCG@10, suggesting that while it might miss a few relevant items thatActionPiececatches, it ranks its correct predictions more accurately.
- Semantic vs. ID-based Models: As expected, methods leveraging semantic information generally outperform
ID-only discriminativemodels. Within the semantic family,semantic ID-based generativemodels tend to surpasssemantic-enhanced discriminativeones, affirming the promise of the generative paradigm. - Reasons for Gains: The paper attributes these gains to the core innovations:
-
The
masked-diffusiontraining overSIDs, which provides dense per-digit supervision and leverages bidirectional context amongSIDdigits, addressing theintra-item consistencyissue. -
OCN, which effectively allocates supervision by prioritizing difficult digits, tackling theinter-digit heterogeneityand improving training efficiency. -
CPD, which enables accurate and diverseTop-K SIDgeneration in parallel, meeting a critical requirement for practical recommendation systems.The results strongly validate the effectiveness of
DiffGRM's approach in addressing the limitations ofautoregressive modelsforSID-based generative recommendation.
-
6.2. Ablation Studies / Parameter Analysis
6.2.1. Effectiveness of OCN (RQ2)
This section evaluates On-policy Coherent Noising (OCN)'s ability to achieve more balanced supervision allocation under a limited training budget, thereby improving sample efficiency. The concept of effective sample passes (ESP) is introduced, calculated as best_epoch × (training views per sample per epoch).
As can be seen from the results in Figure 3 of the original paper:
该图像是多个子图组成的图表,展示了在不同数据集(Sports、Beauty、Toys)上,DiffGRM模型与基线方法在NDCG@10和Recall@10指标下,随着有效样本次数(ESP)变化的性能对比。图中以星号标注“ours”点,表明该方法在较低ESP时即有较好效果,曲线拟合反映基线在不同k值下的性能趋势。
- Scaling with : The figure shows that increasing (the number of coherent paths, effectively increasing the number of training views per sample, and thus
ESP) generally raises performance for -times coherent-path noising. The dashed lines represent a logarithmic least-squares fit, indicating how performance scales withESP. - OCN's Superiority:
OCN(represented by the star markers) consistently achieves better results at the same or even lowerESPcompared to the coherent-path noising variants. This demonstrates thatOCNis more sample-efficient. - Mechanism:
OCNachieves this by using the current model's confidence to select the most uncertain positions (hardest digits) along coherent paths. This strategy focuses training on high-value signals, avoiding the scattered supervision of random masking and leading to better performance with fewer effective training steps.
6.2.2. Ablation Study (RQ3)
An ablation study quantifies the contribution of each proposed module to DiffGRM's overall performance. The performance is measured using NDCG@10.
The following are the results from Table 4 of the original paper:
| Variants | Sports | Beauty | Toys |
|---|---|---|---|
| Semantic ID Setting | |||
| (1.1) PSE → RQ-Kmeans | 0.0200 | 0.0343 | 0.0305 |
| (1.2) PSE → Random | 0.0138 | 0.0300 | 0.0206 |
| Training strategy | |||
| (2.1) w/o OCN | 0.0250 | 0.0368 | 0.0385 |
| (2.2) w/o On-policy | 0.0263 | 0.0455 | 0.0430 |
| Inference strategy | |||
| (3.1) w/o CPD | 0.0273 | 0.0496 | 0.0499 |
| DiffGRM (ours) | 0.0305 | 0.0502 | 0.0524 |
Analysis:
- Parallel Semantic Encoding (PSE):
(1.1) PSE → RQ-Kmeans: ReplacingPSEwithRQ-KMeanssignificantly degrades performance across all datasets (e.g., from 0.0305 to 0.0200 on Sports). This confirms thatRQ's residual dependencies and left-to-right bias conflict with theMDM's bidirectional and parallel denoising, validating the choice ofPSE.(1.2) PSE → Random: Using random tokens instead of semantically structuredSIDsfurther hurts performance dramatically. This underlines the necessity of usingPSEto capture and balance semantic information effectively for theMDM.
- On-policy Coherent Noising (OCN):
(2.1) w/o OCN: This variant removescoherent noisingandon-policy selection, usingDDMs-style random masking. Performance drops noticeably (e.g., from 0.0305 to 0.0250 on Sports). This shows thatrandom maskingdisperses supervision, leading to insufficient training for infrequent items and less effective learning.(2.2) w/o On-policy: This variant keepscoherent noisingbut removes theon-policy selection(i.e., it doesn't prioritize uncertain digits). It performs better thanw/o OCNbut still below the fullDiffGRM. This indicates that whilecoherent noising(structured views) is beneficial, theon-policy selectionfurther refines supervision by focusing on the "weakest links" (hardest digits), contributing to overall performance.
- Confidence-guided Parallel Denoising (CPD):
-
(3.1) w/o CPD: This variant replacesCPDwith arandom fixed-order beam search, meaning digits are decoded in a fixed random permutation without using confidence feedback. Performance decreases across all datasets (e.g., from 0.0305 to 0.0273 on Sports). This highlights the importance ofCPD's confidence-guided mechanism for accurate and effectiveTop-KSIDgeneration during inference.The ablation study clearly demonstrates that
PSE,OCN, andCPDare not only effective but also necessary components, each contributing significantly toDiffGRM's superior performance.
-
6.2.3. OCN Strategy Analysis
This analysis compares four OCN variants based on two dimensions: selection policy ("least" confident vs. "most" confident digits) and refresh frequency ("static" order vs. "refresh" order at each step). Performance is measured by NDCG@10 on Beauty and Toys.
The following are the results from Table 5 of the original paper:
| Dataset | Metric | L-S (Ours) | L-R | M-S | M-R |
|---|---|---|---|---|---|
| Beauty | CPD | 0.0502 | 0.0484 | 0.0476 | 0.0382 |
| w/o CPD | 0.0496 | 0.0470 | 0.0444 | 0.0309 | |
| Improv. | -1.20% | -2.89% | -6.72% | -19.11% | |
| Toys | CPD | 0.0524 | 0.0481 | 0.0516 | 0.0421 |
| w/o CPD | 0.0499 | 0.0455 | 0.0506 | 0.0318 | |
| Improv. | -4.71% | -5.41% | -1.94% | -24.47% |
Where:
L-S (Ours): "Least" confident digits selected, "Static" refresh frequency. This is theDiffGRM's defaultOCNstrategy.L-R: "Least" confident digits selected, "Refresh" frequency.M-S: "Most" confident digits selected, "Static" refresh frequency.M-R: "Most" confident digits selected, "Refresh" frequency.
Findings:
- Selection Policy:
Least-based scheduling(prioritizing the lowest-confidence/hardest digits) consistently outperformsmost-based scheduling(prioritizing highest-confidence/easiest digits) under the same refresh setting. This validates theOCNdesign to focus supervision on areas where the model is weakest, leading to better learning. - Refresh Frequency:
Staticordering (estimating uncertainty once and fixing the order for the example) outperformsrefresh(re-estimating uncertainty after each denoising step). Re-estimating the order at every step likely introduces instability or changes the training "plan" too frequently, hindering effective learning. - Order Sensitivity: The largest performance degradation, especially when
CPDis replaced withw/o CPD(meaning using fixed-order decoding instead of confidence-guided), occurs in theM-R(most confident, refresh) variant. This is because prioritizingmost confidentdigits with stepwise refresh makes the model rely on an "easy-first" order. If this order changes or is not followed during inference (as inw/o CPD), the hard digits become undertrained and performance drops sharply. This reinforces the need forOCN'sleast-confidentstrategy andCPD's confidence guidance.
6.2.4. CPD Beam-Size Analysis
The impact of the CPD beam size on DiffGRM performance (NDCG@10) is analyzed by varying it across {32, 64, 128, 256}.
As can be seen from the results in Figure 4 of the original paper:
该图像是图表,展示了DiffGRM模型中CPD算法在三个数据集(Sports、Beauty、Toys)上,随着beam size变化时NDCG@10性能的趋势。图中显示NDCG@10随beam size增大而提升,尤其是在Beauty和Toys数据集中提升更明显。
- General Trend: Across all three datasets (Sports, Beauty, Toys),
NDCG@10generally improves as the beam size increases. This is a common phenomenon in beam search, where a larger beam width allows the search to explore more candidate paths, mitigating local optima and leading to better-quality solutions. - Dataset Variation: The improvement trend is particularly pronounced for the Beauty and Toys datasets, indicating that these datasets might benefit more from a broader search space during inference.
6.2.5. Expressive Ability Analysis
To verify DiffGRM's ability to extract and leverage semantics, its performance (NDCG@10) is evaluated with different semantic encoders (pretrained language models of varying sizes).
The following are the results from Table 9 of the original paper:
| Model | semantic encoder | Sports | Beauty | Toys |
|---|---|---|---|---|
| RPG | sentence-t5-base | 0.0238 | 0.0429 | 0.0460 |
| bge-large-en-v1.5 | 0.0248 | 0.0408 | 0.0421 | |
| gte-large-en-v1.5 | 0.0229 | 0.0423 | 0.0469 | |
| DiffGRM | sentence-t5-base | 0.0305 | 0.0502 | 0.0524 |
| bge-large-en-v1.5 | 0.0327 | 0.0564 | 0.0508 | |
| gte-large-en-v1.5 | 0.0342 | 0.0549 | 0.0510 |
Findings:
- Performance with Larger Encoders: For
DiffGRM, performance generally rises with the size and capacity of the semantic encoder (sentence-t5-base(110M) <bge-large-en-v1.5(335M) <gte-large-en-v1.5(434M) for some datasets). For instance,NDCG@10on Sports forDiffGRMgoes from 0.0305 (t5-base) to 0.0342 (gte-large). - DiffGRM Benefits More: The paper states that
DiffGRMbenefits the most from more powerful semantic encoders. This impliesDiffGRMis better at utilizing rich semantic information encoded by largerPLMs(pre-trained language models), suggesting its architecture is well-suited to leverage high-quality item representations. - RPG's Inconsistency: In contrast,
RPG's performance with larger encoders is less consistent, sometimes even decreasing (bge-large-en-v1.5on Beauty and Toys compared tosentence-t5-base). This indicatesDiffGRMhas a stronger ability to capture and effectively integrate the semantics provided by the upstream encoders.
6.2.6. ARM Sample Expansion
This experiment investigates whether autoregressive models (ARMs) benefit from data augmentation in the same way DiffGRM's MDM does by exposing more supervision signals. An ARM-style generative recommender (re-implemented using GRID framework with RQ-KMeans) is trained on the Toys dataset.
The following are the results from Table 11 of the original paper:
| Setting | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 |
|---|---|---|---|---|
| 1x | 0.0415 | 0.0624 | 0.0273 | 0.0341 |
| 4x | 0.0422 | 0.0627 | 0.0274 | 0.0340 |
Findings:
- Negligible Difference: Duplicating training instances four times (
4x) for theARM(while keeping targets and token order identical) shows only negligible differences in performance compared to using the original dataset (1x). - Interpretation: This confirms that for
ARMswithteacher forcing, a single pass already covers all digits sequentially, so simply duplicating the same training instance does not introduce new or diverse supervision signals. In contrast,MDMs(likeDiffGRM) generate multiple "views" (masked inputs) from a single sample, each providing distinct supervision signals to the model. This experiment supports the claim thatDiffGRM's gains come from exposing richer supervision through itsmasked diffusionparadigm andOCN, rather than just having more "samples" in a superficial sense.
6.2.7. Sliding Window Data Augmentation
This analysis examines the effect of sliding-window augmentation on training sample count and performance for DiffGRM.
The following are the results from Table 12 of the original paper:
| Dataset | Setting | NDCG@10 | samples |
|---|---|---|---|
| Sports | No sliding window | 0.0237 | 35,598 |
| Sliding window | 0.0305 | 152,346 | |
| Beauty | No sliding window | 0.0350 | 22,363 |
| Sliding window | 0.0502 | 105,668 | |
| Toys | No sliding window | 0.0396 | 19,412 |
| Sliding window | 0.0524 | 87,180 |
Findings:
- Consistent Gains: Applying
sliding-window augmentationconsistently leads to significant gains inNDCG@10across all datasets. For example, on Sports,NDCG@10improves from 0.0237 to 0.0305. - Increased Training Samples:
Sliding-window augmentationexpands each user's interaction sequence into multiple contiguous sub-sequences, vastly increasing the number of effective training samples. This exposes the model to more item-SID correspondences and richer contexts. - Benefits: This augmentation helps the model learn more generalizable patterns, mitigates overfitting, and improves robustness, especially important given the potential sparsity or noise in real-world recommendation data.
6.2.8. Hidden Dimension Analysis
This analysis investigates the impact of the model's hidden dimension on DiffGRM's performance (NDCG@10) and convergence speed (best epoch). The values of tested are {64, 128, 256, 512, 1024}.
As can be seen from the results in Figure 6 of the original paper:
该图像是图表,展示了DiffGRM模型的性能指标NDCG@10和最佳训练轮次随隐藏维度变化的趋势,分别在Sports、Beauty和Toys三个数据集上。曲线分别用不同颜色表示性能和最佳轮次的数值。
- Performance vs. :
- For Sports and Beauty, performance (NDCG@10) generally increases with up to a certain point and then plateaus or slightly decreases. The "knee" of the curve, indicating optimal trade-off, appears around .
- For Toys, performance continues to improve significantly even up to , suggesting that this dataset might require a larger model capacity to fully capture its complexities.
- Convergence Speed vs. :
- As increases, the "best epoch" (the epoch at which the model achieves its peak performance on the validation set) generally decreases. This indicates that larger models, while more computationally expensive per epoch, tend to converge faster in terms of the number of epochs.
- Practical Choice: Based on this analysis, the paper chose for Sports and Beauty (balancing accuracy and efficiency) and for Toys (to maximize performance). This highlights the importance of hyperparameter tuning based on dataset characteristics.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces DiffGRM, a novel generative recommendation framework that addresses the inherent limitations of autoregressive models (ARMs) when generating Semantic IDs (SIDs). The core innovation is replacing the ARM decoder with a masked discrete diffusion model (MDM), enabling bidirectional context and parallel, any-order generation of SID digits.
DiffGRM achieves this through three key, interconnected components:
-
Parallel Semantic Encoding (PSE): Utilizes
OPQsubspace quantization to decoupleSIDdigits, balancing per-digit information and removing sequential dependencies that conflict with parallel generation. -
On-policy Coherent Noising (OCN): A novel training strategy that uses the model's confidence to identify and prioritize "hard" digits for masking. This coherently allocates supervision, focuses the training budget on high-value signals, and improves sample efficiency, overcoming the combinatorial explosion of masking patterns in
MDMs. -
Confidence-guided Parallel Denoising (CPD): An inference strategy that performs a global parallel beam search, filling higher-confidence digits first. This allows
DiffGRMto generate diverseTop-KSIDcandidates, a crucial requirement for practical recommendation systems, without sacrificing accuracy.Collectively,
DiffGRMsuccessfully reconciles the complex cross-digit semantics ofSIDswith parallel generation. Experimental results across multiple Amazon datasets demonstrate consistent state-of-the-art performance, withNDCG@10improvements ranging from 6.9% to 15.5% over strong generative and discriminative baselines. This confirmsDiffGRM's accuracy, generalization strength, and ability to produce robustTop-Krecommendations.
7.2. Limitations & Future Work
The authors explicitly state one main direction for future work:
- Inference Efficiency and Scalability: While the paper discusses complexity, it notes that
DiffGRM'sMD-Decoderhas an additional factor of in its inference complexity compared toARMs. Although they argue that this introduces only a modest overhead for small and thatCPD's parallel nature helps, further investigation into optimizing inference efficiency and scalability for very large item catalogs or extremely low-latency requirements remains an open area. This suggests that whileDiffGRMis accurate, there might still be room for improvement in its real-world deployment performance, especially in highly demanding industrial scenarios.
7.3. Personal Insights & Critique
This paper presents a highly innovative approach to generative recommendation, effectively addressing known limitations of autoregressive models when applied to the structured nature of Semantic IDs.
Inspirations and Applications:
- Bridging Paradigms: The paper masterfully bridges the gap between
diffusion models(successful in image and text generation) andrecommender systems. This shows a promising direction for applying advanced generative modeling techniques to structured data beyond traditional language or image domains. - Structured Data Generation: The explicit handling of
intra-item consistencyandinter-digit heterogeneityforSIDsoffers valuable insights for generating other forms of structured discrete data. Many real-world entities can be represented as multi-attribute discrete sequences (e.g., product configurations, chemical compounds, user profiles).DiffGRM'sPSEandOCNstrategies could be adapted to these domains. - Targeted Supervision: The
On-policy Coherent Noising (OCN)mechanism is a particularly clever innovation. Instead of relying on brute-force random masking or rigid patterns, using the model's own uncertainty to guide supervision is a powerful concept. This could be applied to othermasked language modelingordenoising autoencodertasks to improve learning efficiency, especially with sparse or imbalanced data. - Diverse Top-K Generation: The
Confidence-guided Parallel Denoising (CPD)is a pragmatic solution for theTop-Krequirement in recommendation, which is often overlooked by generative models designed for single-output tasks. This makesMDMsmore viable for practicalRSapplications.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Sensitivity to Semantic Encoder Quality: While the paper shows
DiffGRMbenefits from larger semantic encoders, the performance is still heavily reliant on the quality of the initial item embeddings generated by these encoders. If the upstream encoder fails to capture crucial semantics,DiffGRM's ability to generate meaningfulSIDswould be limited. Further research could explore end-to-end learning or adaptive fine-tuning of the semantic encoder alongsideDiffGRM. -
Interpretability of SIDs: The paper defines
SIDsas -digit sequences. WhilePSEhelps decouple them, the semantic meaning of each digit and how they combine is still somewhat abstract. Improving the interpretability of individualSIDdigits (e.g., explicitly linking them to human-understandable attributes like "brand", "color", "style") could enhance debugging, transparency, and potentially allow for more granular control over generation. -
Generalization to Novel SIDs: While
generative recommendationpromisesopen-vocabularycapabilities, the extent to whichDiffGRMcan generate truly novelSIDs(i.e., combinations of digits corresponding to items not seen in training, or entirely newSIDstructures) is not fully explored. TheTop-Kcandidates are likely variations of observedSIDs. -
Computational Cost for OCN: While
OCNimproves sample efficiency, calculating the difficulty scores for each digit requires an initial fully masked pass through theMD-Decoderfor every training sample. While amortized across views, this still adds a constant factor overhead per sample that might be significant for extremely large datasets or very frequent model updates. -
Hyperparameter Sensitivity: As with many complex deep learning models,
DiffGRMlikely has sensitivity to hyperparameters such as (number of digits), (codebook size), (beam width), and the schedule ofOCN's . The paper provides default settings, but tuning these for new domains could be intricate. -
Beyond SID as Atomic Unit: The paper focuses on generating
SIDsas the atomic unit. What about generating sequences of features, or even directly generating textual descriptions of items? Extending thediffusionparadigm to more complex output formats could be a powerful future direction.Overall,
DiffGRMrepresents a significant step forward ingenerative recommendation, providing a robust and theoretically sound framework that addresses critical limitations of prior work. Its innovative use ofdiscrete diffusionand tailored training/inference strategies offer a promising path for building more effective and flexible recommender systems.
Similar papers
Recommended via semantic vector search.