UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration
TL;DR Summary
The paper introduces UNGER, a generative recommendation approach that integrates semantic and collaborative information into a unified code to reduce storage and inference costs. Utilizing a two-phase framework for effective code construction, it demonstrates significant improvem
Abstract
UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration LONGTAO XIAO, School of Computer Science and Technology, Huazhong University of Science and Technol- ogy, China HAOZHAO WANG ∗ , School of Computer Science and Technology, Huazhong University of Science and Technology, China CHENG WANG, Huawei Technologies Ltd, China LINFEI JI, Huazhong University of Science and Technology, China YIFAN WANG, Huazhong University of Science and Technology, China JIEMING ZHU, Huawei Noah’s Ark Lab, China ZHENHUA DONG, Huawei Noah’s Ark Lab, China RUI ZHANG, School of Computer Science and Technology, Huazhong University of Science and Technology (www.ruizhang.info), China RUIXUAN LI, School of Computer Science and Technology, Huazhong …
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration".
1.2. Authors
The authors of the paper are:
-
LONGTAO XIAO (School of Computer Science and Technology, Huazhong University of Science and Technology, China)
-
HAOZHAO WANG (School of Computer Science and Technology, Huazhong University of Science and Technology, China)
-
CHENG WANG (Huawei Technologies Ltd, China)
-
LINFEI JI (Huazhong University of Science and Technology, China)
-
YIFAN WANG (Huazhong University of Science and Technology, China)
-
JIEMING ZHU (Huawei Noah's Ark Lab, China)
-
ZHENHUA DONG (Huawei Noah's Ark Lab, China)
-
RUI ZHANG (School of Computer Science and Technology, Huazhong University of Science and Technology, China, www.ruizhang.info)
-
RUIXUAN LI (School of Computer Science and Technology, Huazhong University of Science and Technology, China)
The authors' research backgrounds appear to be in computer science and technology, with affiliations at a prominent Chinese university (Huazhong University of Science and Technology) and an industry research lab (Huawei Technologies Ltd / Huawei Noah's Ark Lab), indicating a blend of academic research and practical industry application.
1.3. Journal/Conference
The publication venue is not explicitly stated in the provided text, but the format and content suggest it is a research paper submitted to a conference or journal in the field of information systems or recommender systems.
1.4. Publication Year
The paper was published at (UTC): 2025-10-28T00:00:00.000Z, which means it is slated for publication in October 2025.
1.5. Abstract
This paper introduces UNGER, a novel framework for generative recommendation that utilizes a unified code to integrate both semantic and collaborative knowledge. Traditional generative recommender systems often employ separate codes for different modalities (e.g., semantic and collaborative), leading to increased storage and inference costs, as well as a failure to fully exploit the complementary strengths of these knowledge types due to inherent misalignment. UNGER addresses these challenges by proposing a unified code (referred to as Unicodes) that reduces storage and inference time significantly. The framework integrates knowledge adaptively through a learnable modality adaptation layer and a joint optimization task that combines cross-modality knowledge alignment with next-item prediction. To counteract information loss during the necessary quantization process, UNGER also incorporates an intra-modality knowledge distillation task. Extensive experiments on three public recommendation benchmarks demonstrate UNGER's superior performance compared to existing generative and traditional methods, while also exhibiting scaling law characteristics.
1.6. Original Source Link
The original source link is /files/papers/692c3e981db011de57153258/paper.pdf. This appears to be a link to a PDF file, suggesting it's likely a preprint or an internal link to the paper. Its publication status is unknown from the given context, but the publication date in 2025 indicates it's a forthcoming work.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve revolves around the limitations of existing recommendation systems, particularly in the context of generative recommendation.
-
Information Overload & Traditional RecSys Limitations: Modern society is plagued by
information overload, making recommendation systems crucial. Traditional recommendation systems typically rely onembedding-basedapproaches where users and items are represented by dense vectors, and recommendations are made by findingnearest neighborsusingdot-productorcosine similarity. This often necessitatesApproximate Nearest Neighbor (ANN)search indexes (e.g., Faiss, SCANN) for efficient retrieval from large candidate pools. However, theseANNindexes are independent of the model's optimization process, which can limit overall effectiveness and introduce computational overhead. -
Emergence of Generative Recommendation:
Generative recommendationhas emerged as a promising direction. Instead of matching embeddings, it frames the recommendation task asautoregressive code sequence generation. Items are first encoded into discretecodes(sequences of tokens), and then a model predicts the next item's code based on user history. This paradigm has the potential for more efficient decoding withoutANNindexes. -
Challenges in Existing Generative Recommendation: The key challenge
UNGERtargets in existinggenerative recommendationlies in how different modalities of knowledge (e.g.,collaborativeandsemantic) are handled:-
Separate Codes: Current methods often construct independent codes for different modalities. For instance,
Recforestuses collaborative codes,TIGERuses semantic codes, andEAGERuses two separate sets of codes for both. -
Increased Costs: A
dual-codeframework significantly increases storage and inference costs, making large-scale deployment impractical.Figure 3demonstrates that two separate codes are2.8xslower for inference compared to a unified code. -
Intrinsic Misalignment & Underutilization: Treating semantic and collaborative knowledge as independent entities limits the full exploitation of their complementary strengths. Direct concatenation of features, a common practice, suffers from a
semantic dominanceissue, where semantic features, often richer and pre-trained on vast textual data, overwhelm collaborative signals (Figure 4). This can even degrade performance below using semantic knowledge alone. The root cause is the inherent representational misalignment and signal strength variations between modalities.The paper's innovative idea is to integrate both collaborative and semantic knowledge into a
single unified code(Unicodes) to overcome these challenges, thereby improving efficiency, reducing costs, and enhancing recommendation effectiveness by truly harnessing the synergistic potential of both modalities while addressing thesemantic dominanceproblem.
-
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of generative recommendation:
-
Unified Code for Generative Recommendation:
- Contribution: Introduction of
UNGER(Unified Generative Recommendation), which leveragesUnicodes—a novelunified codethat integrates both collaborative and semantic knowledge for generative recommendation. - Problem Solved: This addresses the practical deployment challenges of dual-code systems by reducing storage space by half and achieving significantly faster inference compared to setups with two separate codes. It enables a more compact and efficient representation of items.
- Contribution: Introduction of
-
Adaptive Knowledge Integration to Resolve Semantic Domination:
- Contribution: Proposal of a learnable
modality adaptation layerand a joint optimization framework that combinescross-modality knowledge alignment (CKA)withnext item predictiontasks. - Problem Solved: This approach adaptively learns an integrated embedding, effectively resolving the
semantic domination issueobserved when simply concatenating features. It ensures a balanced contribution from both collaborative and semantic modalities, allowing their complementary strengths to be fully exploited.
- Contribution: Proposal of a learnable
-
Intra-modality Knowledge Distillation for Comprehensive Learning:
- Contribution: Introduction of an
intra-modality knowledge distillation (IKD)task, utilizing a specially designed token. - Problem Solved: This task compensates for the potential information loss inherent in the
quantization process(mapping continuous embeddings to discrete codes). By distilling high-level modality-specific knowledge, it ensures comprehensive and sufficient learning, further improvingautoregressive generationquality.
- Contribution: Introduction of an
-
Empirical Validation and Scaling Law Characteristics:
- Contribution: Extensive experiments on three public recommendation benchmarks.
- Problem Solved: The experiments demonstrate
UNGER's significant superiority over existing generative and traditional recommendation methods. Additionally, the study confirms thatUNGERexhibits desirablescaling lawcharacteristics with respect to model depth, width, and data volume, indicating its potential for performance gains with increased resources.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the UNGER framework, a foundational understanding of several key concepts in recommender systems, deep learning, and natural language processing is crucial for a beginner.
-
Recommendation Systems (RecSys): At its core, a recommender system aims to predict user preferences and suggest items (e.g., movies, products, music) that a user is likely to be interested in. It combats
information overloadby filtering vast amounts of available content.- Collaborative Filtering (CF): A widely used technique that makes recommendations based on the past behavior and preferences of similar users or items. If user A likes items X and Y, and user B likes X, then user B might also like Y. This captures
behavioral patterns. - Content-Based Recommendation: Recommends items similar to those a user has liked in the past, based on item attributes or content. For example, if a user likes sci-fi movies, they will be recommended other sci-fi movies. This uses
semantic information. - Hybrid Approaches: Combine collaborative and content-based methods to leverage the strengths of both and mitigate their individual weaknesses (e.g.,
cold-start problemfor new users/items in CF).
- Collaborative Filtering (CF): A widely used technique that makes recommendations based on the past behavior and preferences of similar users or items. If user A likes items X and Y, and user B likes X, then user B might also like Y. This captures
-
Embeddings: In machine learning, an
embeddingis a dense vector representation of discrete entities (like users, items, words) in a continuous vector space. Items with similar properties or relationships are mapped closer to each other in this space. For example, theembeddingof "apple" might be close to theembeddingof "pear" but far from "car".- Item ID Embeddings: Traditional recommendation systems often assign a unique
IDto each item and learn a correspondingembeddingvector for it. These areID-based signalsorcollaborative signalsderived purely from interaction patterns. - Semantic Embeddings: Derived from rich textual descriptions (e.g., item titles, descriptions, reviews) using
language models. Theseembeddingscapture thesemantic meaningand contextual information of an item.
- Item ID Embeddings: Traditional recommendation systems often assign a unique
-
Deep Learning Architectures:
- Recurrent Neural Networks (RNNs) / Gated Recurrent Units (GRUs): Neural networks designed to process sequential data.
GRUsare a type ofRNNthat can capture dependencies in sequences, useful for modeling user interaction history (e.g.,GRU4Rec). - Convolutional Neural Networks (CNNs): Neural networks primarily used for image processing, but also adaptable for sequence modeling (e.g.,
Caser) by applying convolutional filters to learn local patterns. - Transformer Models: A powerful neural network architecture, particularly dominant in
Natural Language Processing (NLP), known for itsself-attention mechanism.- Self-Attention: A mechanism that allows a model to weigh the importance of different parts of the input sequence when processing each element. It computes
attention scoresbetween query, key, and value vectors. - Multi-head Self-Attention: Extends
self-attentionby running multipleattentionmechanisms in parallel, allowing the model to focus on different aspects of the sequence simultaneously. - Encoder-Decoder Architecture: A common
Transformersetup where anencoderprocesses the input sequence to create a representation, and adecoderuses this representation to generate an output sequence. - Autoregressive Generation: A process where each element in an output sequence is generated one at a time, conditioned on the previously generated elements. This is fundamental to
generative recommendation.
- Self-Attention: A mechanism that allows a model to weigh the importance of different parts of the input sequence when processing each element. It computes
- Recurrent Neural Networks (RNNs) / Gated Recurrent Units (GRUs): Neural networks designed to process sequential data.
-
Embedding Quantization: The process of mapping continuous
embeddings(dense real-valued vectors) into discretecodes(sequences of integer tokens). This is crucial forgenerative recommendationas it transforms item representation into a format suitable forautoregressive generationby aTransformerdecoder, which typically generates sequences of discrete tokens.- Vector Quantization (VQ): A general technique for mapping vectors from a continuous space to a finite set of codebook vectors.
- Residual Vector Quantization (RVQ) / Hierarchical K-means: Advanced
VQtechniques that quantize residuals (the difference between the original vector and its approximation) iteratively at multiple layers to preserve more information.Hierarchical K-meansis a type ofRVQthat usesK-means clusteringin a hierarchical manner.
-
Contrastive Learning: A self-supervised learning paradigm where the model learns representations by pulling
positive pairs(similar examples) closer together in theembedding spaceand pushingnegative pairs(dissimilar examples) further apart.- Info-NCE Loss: A popular
contrastive loss functionoften used inself-supervised learningto maximize the mutual information between different views of the same data point.
- Info-NCE Loss: A popular
-
Knowledge Distillation: A technique where a smaller "student" model learns from a larger, more complex "teacher" model. In
UNGER, it's used to compensate for information loss duringquantizationby guiding the model with originalembeddings.
3.2. Previous Works
The paper discusses related work across pre-training in recommender systems, sequential recommendation, and generative approaches.
3.2.1. Pre-training in Recommender Systems
Inspired by NLP breakthroughs, pre-training addresses data sparsity and cold-start problems in recommendation.
- ID-based Pre-training:
PeterRec[65]: Learns universal user representations from large-scale interaction data, transferring knowledge via parameter-efficient fine-tuning.Conure[66]: A lifelong learning framework that incrementally updates user embeddings without catastrophic forgetting.CLUE[4]: Improves representation quality through contrastive views, similar toself-supervised learning.- These
ID-basedmethods often struggle with transferability to new domains due to reliance on discrete identifiers.
- Semantic-enhanced Pre-training:
ZESRec[9]: Replacesitem IDswith textual metadata to enablezero-shot generalizationacross disjoint domains.UniSRec[18] andMISSRec[56]: Leverage multimodal features (text, images) to build transferableuser/item representations.P5[12]: Unifies various recommendation tasks into atext-to-textframework usingpre-trained language models (PLMs), transforming all input/output into natural language. However, it still relies on smallerPLMs.
3.2.2. Sequential Recommendation
Focuses on modeling user behavior as a chronologically ordered sequence of interactions to predict the next item.
- Traditional Approaches:
- Markov Chains (MCs) [16, 48]: Early methods capturing item transition probabilities.
- GRU4Rec [21]: Pioneered
GRU-based RNNsfor sequential recommendations. TheGRU(Gated Recurrent Unit) is a type ofRNNthat usesgatesto control the flow of information, mitigating the vanishing gradient problem. It processes sequences by updating a hidden state based on the current input and previous hidden state . $ z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) \ r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r) \ \tilde{h}t = \tanh(W_h x_t + U_h (r_t \odot h{t-1}) + b_h) \ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t $ Where is the update gate, is the reset gate, is the candidate hidden state, is the element-wise product, and is the sigmoid function. - SASRec [24]: Introduced
self-attention(similar to adecoder-only Transformer) to capture long-range dependencies in item sequences. The coreattentionmechanism is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where , , are query, key, and value matrices, and is the dimension of the key vectors.SASRecapplies thisself-attentionto the sequence of item embeddings. - BERT4Rec [51]: Utilized
Transformerswithmasking strategies(likemasked language modelinginBERT) for sequential recommendation, allowing bidirectional context. - HGN [39]: Adopts
hierarchical gating networksto capture both long-term and short-term user interests. - FDSA [69]: Leverages
self-attention networksat bothitem-levelandfeature-levelto model sequence dynamics. - S^3-Rec [73]: Pre-trains a bidirectional
Transformerusingmutual information maximization (MIM)viaself-supervised tasksto learn correlations among attributes, subsequences, and sequences.
3.2.3. Generative Approaches
These predict item identifiers directly, moving beyond embedding-based matching and ANN search.
- LLM-based Methods:
LC-Rec[71]: Usescode-based vector quantizationfor semantic item indexing and fine-tuning to aligncollaborative signalswithLLM representations.CCF-LLM[38]: Transforms user-item interactions intohybrid prompts(encoding both semantic and collaborative knowledge) and uses anattentive cross-modal fusionstrategy.SC-Rec[26]: Employs multiple item indices,prompt templates, and aself-consistent re-ranking mechanismto merge collaborative and semantic knowledge.
- From-Scratch Methods (Custom-designed Models):
- Tree-based methods (
RecForest[11], [74, 75]): Construct multiple trees and integrateTransformer-based structuresfor routing, enhancing accuracy and memory efficiency. - TIGER [47]: Introduced
semantic IDs(item tokens derived from descriptions) and predicts next item tokens in a sequence-to-sequence manner, usingRQ-VAE quantization. - ColaRec [61]: Integrates user-item interactions and content data within an
end-to-end framework, leveragingpretrained collaborative identifiers, an item indexing task, andcontrastive lossto align semantic and collaborative spaces. - EAGER [62]: Employs a
dual-stream generative frameworkwith shared encoding but separate decoding for semantic and behavioral information, using two separate codes. It then fuses results based on confidence scores.
- Tree-based methods (
3.3. Technological Evolution
The evolution of recommendation systems has moved from simple content-based or collaborative filtering methods to complex deep learning models.
-
Early Stages (Pre-Deep Learning):
Matrix Factorization,Markov Chains, basiccontent-based filtering. -
ID-based Deep Learning:
RNNs(e.g.,GRU4Rec),CNNs(e.g.,Caser), and laterTransformers(e.g.,SASRec,BERT4Rec) applied to sequences ofitem IDs. These largely rely on implicitcollaborative signals. -
Pre-training for RecSys: Borrowing from
NLP,pre-trainingstrategies emerged to tacklecold-startandsparsity. This includesID-basedpre-training and, more recently, incorporatingside information(text, images) to buildsemantic-richrepresentations (ZESRec,UniSRec,P5). -
Generative Recommendation (Code-based): A paradigm shift from
retrieval-basedtogeneration-based. Items are converted into discretecodes, and the modelautoregressivelypredicts the next item's code. Early methods focused on single modalities (RecForestfor collaborative,TIGERfor semantic). -
Multi-modal Generative Recommendation (Separate Codes): Recognizing the value of combining modalities, methods like
EAGERandSC-Recstarted integrating bothsemanticandcollaborativeknowledge, but often using separate codes for each modality.UNGERfits into the latest stage of this evolution, pushing multi-modalgenerative recommendationforward.
3.4. Differentiation Analysis
Compared to the main methods in related work, UNGER introduces key innovations:
-
Unified Code vs. Separate Codes:
- Related Work (
EAGER,SC-Rec): These models integrate multiple modalities (semantic, collaborative) but typically maintain separate codes for each, leading to computational cost and storage cost (where is the number of modalities). This results in higher inference latency and storage requirements. - UNGER: Proposes a single unified code (
Unicodes) that cohesively encodes bothsemanticandcollaborativeknowledge. This reduces computational cost to and storage cost to , making it significantly more efficient and scalable for large-scale deployments.
- Related Work (
-
Addressing Semantic Domination:
- Related Work (Concatenation): A common baseline for integrating modalities is direct feature concatenation. However,
UNGERhighlights that this approach suffers fromsemantic dominance, where the richersemantic embeddings(e.g., fromLLMs) disproportionately influence the final representation, marginalizingcollaborative signals.Figure 4clearly illustrates this with 97.33% semantic similarity vs. 2.67% collaborative similarity. - UNGER: Explicitly tackles
semantic dominancethrough a learnablemodality adaptation layerwithAdaLNand across-modality knowledge alignment (CKA)task. This ensures a balanced and adaptive fusion of knowledge, leading to a more effective combined representation (e.g.,59.89%semantic,40.11%collaborative similarity inTable 6).
- Related Work (Concatenation): A common baseline for integrating modalities is direct feature concatenation. However,
-
Compensating Quantization Loss:
-
Related Work: While
quantizationis necessary for generative models, it inherently introduces approximation errors and information loss. Previous methods might not explicitly address this or rely on thequantized codesbeing sufficient. -
UNGER: Introduces an
intra-modality knowledge distillation (IKD)task. This auxiliary objective leverages the original (pre-quantization)embeddingsand a special[c_dis]token to provide more complete guidance, thereby compensating forinformation lossduringquantizationand improving the quality ofautoregressive generation.In essence,
UNGER's core innovation lies in its holistic approach to multi-modalgenerative recommendation: not just combining modalities, but doing so efficiently with aunified code, harmoniously by resolvingsemantic dominance, and robustly by mitigatingquantization loss.
-
4. Methodology
4.1. Principles
The core idea behind UNGER is to integrate semantic and collaborative knowledge into a single unified code (called Unicodes) for generative recommendation. This unification aims to enhance efficiency, reduce costs, and improve recommendation quality by leveraging the complementary strengths of both modalities while explicitly addressing the semantic dominance issue and mitigating information loss during the quantization process. The overall approach is structured as a two-stage framework: first, generating robust item Unicodes, and second, using these Unicodes for autoregressive generative recommendation.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation
Given an item corpus and a user's historical interaction sequence where each , the objective of a sequential recommendation system is to predict the next most likely item that the user may interact with.
In the generative framework, each item is represented by a sequence of codes , where denotes the length of the code sequence. The sequential recommendation task transforms into predicting the codes C _ { t } of the next item u _ { t } based on the user's historical interaction sequence . During training, the model first encodes and then autoregressively generates the codes of the target item u _ { t } step by step at the decoder. The decoding process is defined by the following formula:
Where:
-
is the probability of generating the entire code sequence for the next item given the user's interaction history .
-
is the length of the code sequence for an item.
-
p ( c _ { i } | U , c _ { 1 } , c _ { 2 } , . . . , c _ { i - 1 } )is the probability of generating the -th code , conditioned on the user history and all previously generated codes for the current item.During the inference phase, the decoder performs
beam searchto autoregressively generate the codes of the top-k items.
4.2.2. Overall Pipeline
UNGER operates in two main stages, as illustrated in Figure 5 from the original paper.
该图像是示意图,展示了论文“UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration”的主要框架。图中分为两个阶段:第一阶段为项目Unicode生成,包括预训练的语义编码器和协同模式的知识对齐任务;第二阶段为生成推荐,利用用户历史和Transformer编码器进行对比学习和知识蒸馏。箭头表示数据流动,具体任务和模块间的关系通过不同形状的框体和连接线清晰标示。
- Stage I: Item Unicodes Generation: The goal is to create item
Unicodesthat encapsulate bothcollaborativeandsemanticknowledge into a single unified codebook. This stage involves:- Extracting modality-specific
embeddingsusing pre-trained models (e.g.,DINforcollaborative,Llama2-7bforsemantic). - Fusing these
embeddingsthrough amodality-adaptive fusion module. - Jointly optimizing this fusion using a
cross-modality knowledge alignment (CKA)task and anext item predictiontask to resolvesemantic dominance. - Quantizing the integrated
embeddingsinto discreteUnicodesusinghierarchical K-means clustering.
- Extracting modality-specific
- Stage II: Generative Recommendation: This stage focuses on the
autoregressive generationofUnicodesfor recommendation. It comprises:- A
Transformer-based encoderto capture user interests from their interaction history. - A
Transformer-based decoderto predict theUnicodesequence of the next item. - An
intra-modality knowledge distillation (IKD)task to compensate for information loss duringquantization. - After training,
beam searchis used to generate top-k recommendedUnicodes, which are then mapped back to items.
- A
4.2.3. Stage I: Item Unicodes Generation
This stage aims to construct item Unicodes that encode both collaborative and semantic knowledge within a single unified codebook.
-
Initial Embedding Extraction:
- Given a user's historical interaction sequence, denoted by
ID sequenceand correspondingitem semantic information(e.g., titles) . - A randomly initialized sequential recommendation model (e.g.,
DIN[72]) encodes intocollaborative embeddingsE _ { C }. - A pre-trained
semantic encoder(e.g.,Llama2-7b[53]) encodes intosemantic embeddingsE _ { S }.
- Given a user's historical interaction sequence, denoted by
-
Modality Adaptation Layer (MAL): To bridge the modality gap between the two embeddings and align them into a common space,
UNGERproposes amodality adaptation layer (MAL)that maps thesemantic embeddingsE _ { S }into the sameembedding spaceascollaborative embeddings. This layer utilizesAdaLN(Adaptive Layer Normalization) [43], which introduces learnable affine parameters conditioned on the input itself, allowing dynamic adjustment of normalization based on modality-specific properties. The mapping process is defined as: Where:- represents the original
semantic embeddings. - denotes the transformed
semantic embeddingsafter adaptation, now aligned with thecollaborative embeddingspace. MALis themodality adaptation layer.- and are learnable parameters (weight matrix and bias vector, respectively) that transform the semantic embeddings.
AdaLNis theAdaptive Layer Normalizationfunction, which dynamically adjusts the normalization based on the input features, helping preserve and align modality-specific signals fromcontent-drivensemantics andbehavior-drivenpreferences.
- represents the original
-
Cross-modality Knowledge Alignment Task (CKA): To effectively integrate semantic and collaborative knowledge and counteract
semantic dominance, across-modality knowledge alignment taskis introduced. For a given item , itscollaborative embeddingE _ { C _ { i } }is pulled closer to its transformedsemantic embeddingE _ { T _ { i } }, while being pushed away from the transformedsemantic embeddingsE _ { T _ { j } }of other items (negative samples). This encourages the learnedintegrated embeddingfor item to encapsulate knowledge from both modalities. TheInfo-NCE loss[3] is adopted for this task: Where:- is the set of items in the current batch.
- is the
collaborative embeddingof item . - is the transformed
semantic embeddingof item (from theMAL). - is the transformed
semantic embeddingof a negative sample item from the same batch. - is a
similarity function(e.g.,dot productorcosine similarity). - is a
temperature parameterthat controls the smoothness of the similarity distribution.
-
Next Item Prediction Task: In addition to
CKA, the model also optimizes a standardnext item predictiontask using thecollaborative representations. This task takes the user's historical interaction sequence as input, learns a representation of user preferences, and computes matching scores with candidate items. The loss function is a standardcross-entropy lossfor sequential prediction: Where:- is the length of the sequence.
- is the probability of predicting item given the preceding items in the sequence.
-
Joint Optimization for Stage I: The total loss function for the first stage (
Stage I) combines thenext item prediction lossand thecross-modality knowledge alignment loss: Where:- is the
next item prediction loss. - is the
cross-modality knowledge alignment loss. - is a tunable
hyperparameterto adjust the relative importance of thealignment loss.
- is the
-
Unicodes Generation (Hierarchical K-means): After
Stage Itraining, theintegrated embeddings( fused with , implicitly via the CKA loss aligning to in the target space) are used to generateUnicodesviahierarchical K-means clustering. This method efficiently encodes high-dimensional item embeddings into discrete hierarchical codes, preserving as much information as possible. The core mathematical relationship at each layer is defined as: Where:-
denotes the original
integrated embeddingof item . -
is the
residual vectorfor item at layer . -
represents the centroid assigned to item at the -th layer of
quantization.The
Hierarchical K-means Clusteringalgorithm (Algorithm 1 in the paper) proceeds as follows:
-
Algorithm 1 Hierarchical K-means Clustering
| Input: Item embeddings V = {v1, V2, . . . }, number of clusters K, hierarchy depth L Output: Unicode sequences = [, , . . , ] for all items | |
| 1: for each item i do 2: | Initialize residual: r ← v |
| end for | |
| 4: | for layer l ← 1 to L do |
| 5: | Collect residuals: Rl−1 ← {r−1, r−1, . . . } |
| 6: | Perform k-means clustering on Rl-1 with K clusters |
| 7: | Store centroids in codebook = {, , . . , } |
| 8: | for each item i do |
| 9: | Compute distances: = ‖r−1 − ∥2 for all k [1, K] |
| 10: | Assign cluster index: ← arg min |
| 11: | Record centroid: ← c |
| 12: | Update residual: r ← r− |
| 13: end for | end for |
Let's break down the steps:
- Initialization: For each item , its
residual vectoris initialized with its originalintegrated embedding. So, . - Iterative Quantization (Layers): The process iterates for layers (hierarchy depth).
- Collect Residuals: At each layer , all current
residual vectorsfrom the previous layer are collected. - K-means Clustering:
K-means clusteringis performed on these residuals to partition them into clusters. This identifies common patterns. - Store Centroids: The resulting centroids are stored in a codebook .
- Assign Cluster Index & Update Residuals: For every item :
-
The Euclidean distance is computed between its current
residual vectorand each centroid in . -
The item is assigned to the cluster whose centroid is nearest: . This is the discrete code for item at layer .
-
The assigned centroid is recorded.
-
The
residual vectoris updated by subtracting the assigned centroid: . This removes information captured by the current centroid, allowing the next layer to focus on the remaining details.After layers, each item obtains a sequence of discrete cluster indices , which serves as its compact and hierarchical
Unicode representation. An item-unicodelookup tableis then constructed to map each item to itsUnicode sequence.
-
- Collect Residuals: At each layer , all current
4.2.4. Stage II: Generative Recommendation
This stage uses the generated Unicodes for autoregressive generative recommendation.
-
Encoding Process: Given a user's interaction history , these items are converted into their respective
Unicode sequences. The sequence ofUnicodesis then fed into anencoder(consisting of stackedmulti-head self-attentionlayers andfeed-forward layers, following theTransformer architecture). Theencoderprocesses this sequence to produce afeature representation, which captures the user's interests and is passed to thedecoder. -
Decoding Process: On the decoder side, the objective is to predict the
Unicode sequenceof the next itemx _ { t }. For training, a special token (Begin-of-Sequence) is prepended to the itemUnicode sequenceto form the decoder input. Thegenerative recommendation lossis computed using across-entropy loss function: Where:- is the length of the
Unicode sequence. - represents the encoded user history (output from the encoder).
- is the
Begin-of-Sequencetoken. - is the -th code in the target item's
Unicode sequence. - is the probability of predicting the -th code, conditioned on the encoded user history and previously generated codes for the current item.
- is the length of the
-
Intra-modality Knowledge Distillation Task (IKD): To compensate for
information lossintroduced by thequantization processinStage I, anintra-modality knowledge distillation taskis introduced. Inspired by the[CLS]token inBERTfor capturing global context, a learnable token[c_dis]is appended to the end of the decoder input sequence. This token is designed to capture global information about the sequence. The final layer output corresponding to[c_dis]is used in aglobal contrastive learning objective:- The
positive sampleis theintegrated embeddingE _ { t }of the target itemx _ { t }, which was learned inStage Ibeforequantization. Negative samplesare randomly selectedintegrated embeddingsfrom other items in the corpus, excludingx _ { t }. This objective pulls the[c_dis]output closer to thepositive sampleE _ { t }and pushes it away fromnegative samples. The loss function for thisdistillation taskis: Where:- denotes the final layer output of the decoder for the special token
[c_dis]. E _ { t }represents theintegrated embeddingof the target item , as learned inStage I.- is the
integrated embeddingfor a negative sample item, also learned inStage I. - The
dot product() measures similarity.
- The
-
Total Loss for Stage II: The total loss for
Stage IIcombines thegenerative recommendation lossand theintra-modality knowledge distillation loss: Where:- is the
generative recommendation loss. - is the
intra-modality knowledge distillation loss. - is a
hyperparameterthat balances the two objectives.
- is the
4.2.5. Training and Inference
-
Training:
UNGERemploys atwo-stage training process:- Stage I Training: A sequential recommendation model (e.g.,
DIN) is used as the backbone and optimized with loss . After this stage, the itemintegrated embeddingsare extracted, and itemUnicodesare derived usingHierarchical K-means clustering. - Stage II Training: A
Transformermodel (encoder-decoder) is trained forgenerative recommendation, optimizing with loss .
- Stage I Training: A sequential recommendation model (e.g.,
-
Inference: During inference, only the trained
Transformermodel fromStage IIis utilized.- The
decoderperformsbeam searchtoautoregressivelygenerate each token within the itemUnicode sequence. - Once the
Unicode sequencesare generated, they are mapped back to their corresponding items using the item-unicodelookup table. - A top- recommendation list is produced based on the confidence scores of the generated
Unicodes.
- The
4.2.6. Computational and Storage Costs Analysis
The paper provides a comparison of UNGER's efficiency against existing generative recommendation methods.
-
Computational Cost:
- UNGER: The
encoding phase(processing user history) occurs only once. The main burden is thedecoding stage, where the decoderautoregressivelygenerates each token of the target item code. If the code length is fixed and decoding each item code takes time, the total decoding complexity is . - Existing Methods (e.g.,
EAGER,SC-Rec): These adopt separate modality-specific codes, typically maintaining parallel decoders (e.g., one for semantic, one for collaborative). They must decode modality-specific codes in parallel, leading to a cumulative decoding cost of . An additional ranking or fusion step is often required. - Advantage:
UNGERreduces computational complexity from to .
- UNGER: The
-
Storage Cost:
-
UNGER: Compresses all relevant information into a single
unified code. If storing one modality-specific code requires space,UNGERrequires only storage per item. -
Existing Methods: Store codes per item, leading to a total storage cost of .
-
Advantage:
UNGERreduces storage requirements from to .The following are the results from
Table 1of the original paper:TIGER EAGER UNGER (Ours) Computation Cost O(n) O(2n) O(n) Storage Cost O(m) O(2m) O(m) Used Modality Semantic Only Semantic + Collaborative Semantic + Collaborative
-
This table clearly illustrates UNGER's efficiency advantages. While TIGER also has computation and storage, it only uses semantic information. EAGER uses both semantic and collaborative information but incurs computation and storage due to its dual-code approach. UNGER achieves the best of both worlds: leveraging both modalities with the efficiency of a single-modality system.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on three public benchmarks derived from the Amazon Product Reviews dataset [40]. This dataset contains user reviews and item metadata from May 1996 to July 2014. The chosen categories for sequential recommendation are "Beauty", "Sports and Outdoors", and "Toys and Games".
For all datasets, user interaction records are grouped by user and sorted chronologically by timestamp. To ensure data quality and relevance, the 5-core dataset filtering strategy was applied, which means only users and items with at least five interaction records are retained, filtering out unpopular items and inactive users. This helps mitigate sparsity and focuses on more active engagement patterns.
The following are the results from Table 2 of the original paper:
| Dataset | #Users | #Items | #Interactions | #Density |
| Beauty | 22,363 | 12,101 | 198,360 | 0.00073 |
| Sports and Outdoors | 35,598 | 18,357 | 296,175 | 0.00045 |
| Toys and Games | 19,412 | 11,924 | 167,526 | 0.00073 |
These datasets are widely used in sequential recommendation research, making them effective for validating the method's performance and allowing for fair comparison with existing benchmarks. The relatively low density across all datasets (ranging from 0.00045 to 0.00073) highlights the inherent sparsity of recommendation data, which UNGER aims to address by integrating diverse knowledge modalities.
5.2. Evaluation Metrics
The paper uses two standard metrics for evaluating recommendation performance: Recall@K and Normalized Discounted Cumulative Gain (NDCG@K). These metrics are reported for and .
-
Recall@K:
- Conceptual Definition:
Recall@Kmeasures the proportion of relevant items that are successfully retrieved within the top- recommended items. It focuses on how many of the "truly good" items for a user are actually present in the recommended list, regardless of their specific ranking within that list. A higherRecall@Kindicates that the model is better at identifying and including relevant items. - Mathematical Formula:
- Symbol Explanation:
- : The total number of users in the evaluation set.
- : The set of all users.
- : A specific user.
- : An indicator function that returns 1 if its argument is true, and 0 otherwise.
r _ { u }: The rank of the truenext item(ground-truth item) for user in the top- recommendation list. If the true item is within the top- list, is true.
- Conceptual Definition:
-
NDCG@K:
- Conceptual Definition:
NDCG@Kevaluates the quality of the ranking for the top- recommended items. UnlikeRecall@K,NDCG@Kconsiders not only whether relevant items are present but also their position in the list. Highly relevant items ranked higher contribute more to theNDCGscore. It is normalized by theIdeal DCGto ensure scores are comparable across different queries. A higherNDCG@Kindicates better ranking quality. - Mathematical Formula: The paper provides the component formulas for
NDCG@K:Discounted Cumulative Gain (DCG) for a user u at K:Ideal Discounted Cumulative Gain (IDCG) for a user u at K:Normalized Discounted Cumulative Gain (NDCG) for a user u at K:Average NDCG@K over all users:
- Symbol Explanation:
y _ { u , j }: Therelevanceof the -th recommended item for user . In binary relevance settings (common in sequential recommendation), if the -th item is the ground-truth item (the next item the user actually interacted with), and0otherwise.- : The
relevanceof the -th item in theideal rankingfor user . The ideal ranking places all ground-truth relevant items at the top. For binary relevance, this means for the first item (if relevant) and0for others, assuming only one relevant item. - : The number of top recommended items considered.
- : A logarithmic discount factor, meaning items ranked lower are given less weight.
- : The total number of users.
- Conceptual Definition:
-
Evaluation Protocol: The
leave-one-out strategyis used, a standard protocol [24]. For each user, the last interaction is used as the ground-truth item for testing, and the second-to-last item for validation. All preceding interactions form the training sequence. User history length is limited to 20 items during training.
5.3. Baselines
UNGER is compared against a comprehensive set of representative baselines, categorized into classical sequential methods and generative methods.
5.3.1. Classical Sequential Methods
These models typically predict the next item based on item ID sequences.
GRU4REC[17]: AnRNN-basedmodel usingGated Recurrent Units (GRUs)to model user click sequences.Caser[52]: ACNN-basedmethod that captures high-orderMarkov Chainsby modeling user behaviors using both horizontal and verticalconvolutional operations.SASRec[24]: Aself-attention-basedsequential recommendation model that utilizes aunidirectional Transformer encoderto predict the next item.BERT4Rec[51]: Adopts aTransformerwith abidirectional self-attention mechanismand aCloze objective lossfor item sequence modeling, inspired byBERT.HGN[39]: Employshierarchical gating networksto capture both long-term and short-term user interests.FDSA[69]: Leveragesself-attention networksto modelitem-levelandfeature-levelsequences separately, emphasizingfeature transition dynamics.- [73]: A
bi-directional Transformerpre-trained withmutual information maximizationviaself-supervised tasksfor sequential recommendation.
5.3.2. Generative Methods
These models frame recommendation as an autoregressive generation task.
RecForest[11]: Jointly learnslatent embeddingsand indices through multipleK-ary trees, usinghierarchical balanced clusteringand aTransformer-based encoder-decoder routing network.TIGER[47]: Utilizes a pre-trainedT5 encoderto learnsemantic identifiersfor items, thenautoregressively decodestarget candidates, incorporatingRQ-VAE quantization.ColaRec[61]: Integrates user-item interactions and content dataend-to-end, aligningsemanticandcollaborative spacesviapretrained collaborative identifiers, anitem indexing task, andcontrastive loss.EAGER[62]: Integratesbehavioralandsemantic informationthrough atwo-stream architecturewith shared encoding butseparate decodingpipelines for each modality, usingcontrastiveandsemantic-guided learning.
5.4. Implementation Details
The paper provides detailed implementation specifics for UNGER.
-
Model Architecture:
- Encoder Layers: 1
- Decoder Layers: 4
- Embedding Dimension: 96
- Hidden Size: 256
- Number of Attention Heads: 6
-
Quantization:
- Number of Clusters (): 256
- Cluster Depth (): 4
-
Encoders:
Collaborative Encoder:DIN[72]Semantic Encoder: Pre-trainedLlama2-7b[53] (hidden size 128 as reported in [47, 6])
-
Training:
- Optimizer:
Adam - Learning Rate:
1e-3 - Warmup Strategy: Applied for stable training, with
warmup steps= 2000 andwarmup initial learning rate=1e-7. - Batch Size: 256
- Training Steps: 20000
- Dropout Rate: 0
- Activation Function:
ReLU - Weight Decay:
1e-7
- Optimizer:
-
Loss Coefficients:
- (for
CKAtask inStage I): 1.0 - (for
IKDtask inStage II): 1.0 - (temperature parameter for
Info-NCEloss): 1.0 - The authors note
UNGER's robustness to and due to fast convergence, allowing them to be set to 1.0.
- (for
-
Inference:
- Beam Width: 100
-
Reproducibility: Each experiment is conducted five times with different random seeds (chosen from [2020, 2021, 2022, 2023, 2024]), and the average score is reported. Statistical significance is tested with a paired
t-test().The following are the results from
Table 4of the original paper:Parameter Value Embedding Dimension 96 Model Layers 4 Hidden Size 256 Heads 6 Num of Clusters 256 Cluster Depth 4 Learning Rate 1e-3 Optimizer Adam Semantic Encoder Llama2-7b Collaborative Encoder DIN Batch Size 256 Training Steps 20000 Dropout Rate 0 Beam Width 100 Activation Function ReLU Weight Decay 1e-7 Warmup Steps 2000 Warmup Initial Learning Rate 1e-' Random Seed 2024 α 1.0 β 1.0 τ 1.0
The paper also provides information on practical resources required.
The following are the results from Table 5 of the original paper:
| Information | Value |
| Device | NVIDIA RTX 3090 GPU (24GB) ×1 |
| Training Time | less than 1 GPU hour |
| Inference Latency (ms) | 19.5 |
| Model Parameter | 10249988 |
This indicates that UNGER is relatively efficient in terms of training time and inference latency, requiring less than an hour of training on a single NVIDIA RTX 3090 GPU (a high-end consumer/prosumer GPU) and having a low inference latency of 19.5 ms. The model parameter count is around 10 million, which is moderate for deep learning models.
6. Results & Analysis
6.1. Core Results Analysis (RQ1)
The paper thoroughly evaluates UNGER's performance against both classical and generative sequential recommendation baselines across three Amazon datasets (Beauty, Sports and Outdoors, Toys and Games) using Recall@10, Recall@20, NDCG@10, and NDCG@20.
The following are the results from Table 3 of the original paper:
| Dataset Metric | Classical | Generative | UNGER (Ours) | |||||||||||
| GRU4REC | Caser | SASRec | BERT4Rec | HGN | FDSA | S3-Rec | Recorest | TIGER | ColaRec | EAGER | ||||
| Beauty | Recall@10 | 0.0283 | 0.0605 | 0.0347 | 0.0512 | 0.0407 | 0.0647 | 0.0664 | 0.0617* | 0.0524* | 0.0836 | 0.0939 ±0.00093 | ||
| Recall@20 | 0.0479 | C0 | 0.0902 | 0.0599 | 0.0773 | 0.0656 | 0.0994 | 0.0915 | 0.0924* | 0.0807* | 0.1124 | 0.1289 ±0.00084 | ||
| NDCG@10 | 0.0137 | 0.0176 | 0.0318 | 0.0170 | 0.0266 | 0.0208 | 0.0327 | 0.0400 | 0.0339* | 0.0263* | 0.0525 | 0.0559 ±0.00037 | ||
| NDCG@20 | 0.0229 | 0.0394 | 0.0233 | 0.0332 | 0.0270 | 0.0414 | 0.0464 | 0.0417* | 0.0335* | 0.0599 | 0.0646 ±0.00041 | |||
| Sports | Recall@10 | 0.0204 | 0.0194 | 0.0350 | 0.0191 | 0.0313 | 0.0288 | 0.0385 | 0.0247 | 0.0376* | 0.0348* | 0.0441 | 0.0471 ±0.00078 | |
| Recall@20 | 0.0333 | 0.0314 | 0.0507 | 0.0315 | 0.0477 | 0.0463 | 0.0607 | 0.0375 | 0.0577* | 0.0533* | 0.0659 | 0.0710 ±0.00075 | ||
| NDCG@10 | 0.0110 | 0.0097 | 0.0192 | 0.0099 | 0.0159 | 0.0156 | 0.0204 | 0.0133 | 0.0196* | 0.0179* | 0.0236 | 0.0259 ±0.00034 | ||
| NDCG@20 | 0.0142 | 0.0126 | 0.0231 | 0.0130 | 0.0201 | 0.0200 | 0.0260 | 0.0164 | 0.0246* | 0.0226* | 0.0291 | 0.0319 ±0.00049 | ||
| Toys | Recall@10 | 0.0176 | 0.0270 | 0.0675 | 0.0203 | 0.0497 | 0.0381 | 0.0700 | 0.0383 | 0.0578* | 0.0474* | 0.0714 | 0.0822 ±0.00085 | |
| Recall@20 | 0.0301 | 0.0420 | 0.0941 | 0.0358 | 0.0716 | 0.0632 | 0.1065 | 0.0483 | 0.0838* | 0.0704* | 0.1024 | 0.1154 ±0.00070 | ||
| NDCG@10 | 0.0084 | 0.0141 | 0.0374 | 0.0099 | 0.0277 | 0.0189 | 0.0376 | 0.0285 | 0.0321* | 0.0242* | 0.0505 | 0.0489 ±0.00032 | ||
| NDCG@20 | 0.0116 | 0.0179 | 0.0441 | 0.0138 | 0.0332 | 0.0252 | 0.0468 | 0.0310 | 0.0386* | 0.0300* | 0.0538 | 0.0573 ±0.00025 | ||
Key Observations:
- UNGER's Superiority:
UNGERconsistently achieves the highest performance across all datasets and metrics. On the Beauty dataset, it outperforms the second-best generative baseline (EAGER) by14.68%inRecall@20(0.1289vs0.1124) and7.85%inNDCG@20(0.0646vs0.0599). This significant gain is attributed toUNGER's effective integration ofcollaborative filtering signalsandsemantic content representationsvia aunified discrete code, which enhances efficiency and reduces redundancy. - Generative Models Outperform Classical Models: A clear trend emerges:
generative modelsgenerally outperformclassical embedding-basedorID-basedsequential models. This highlights the advantage of representing items as structured,discrete hierarchical codesthat encapsulate rich prior knowledge, aligning well withTransformer-based autoregressive generation. Traditional methods often treatitem IDsas atomic symbols, lacking contextual information. - Efficiency Advantage: Despite its superior performance,
UNGERis also more lightweight and scalable, requiring approximately half the number of total parameters compared toEAGER, which usesdual-codestreams. This aligns with thecomputational and storage cost analysispresented earlier.
6.2. Semantic Domination Issue (RQ2)
To investigate UNGER's effectiveness in mitigating the semantic dominance issue, experiments were conducted on the Amazon Beauty dataset, comparing different methods of integrating semantic and collaborative information.
The following are the results from Table 6 of the original paper:
| Approach | Semantic modality | Collaborative modality |
| Concat | 97.33% | 2.67% |
| Ours | 59.89% | 40.11% |
Methodology for Analysis:
-
Embedding Preparation: Semantic embeddings are extracted using a pre-trained
Llama2-7b, and collaborative embeddings are obtained fromDIN. -
Concatenation Method (
Concat): To mimic a common fusion approach,semantic embeddingsare concatenated withcollaborative embeddings. To addressdimensionalityandvalue distributionmismatches,PCAis applied to reduce thesemantic embeddingdimension, and bothembeddingsare normalized before concatenation. -
UNGER's Method (
Ours): Theintegrated embeddingsare obtained usingUNGER's proposed approach. -
Visualization:
T-SNEis used to visualize theembedding distributions, withK-means clusteringfor color-coding points from each cluster. As illustrated inFigure 6, theconcatenationmethod leads to a fusedembedding distributionthat closely mirrors the originalsemantic embeddings, whileUNGERachieves a more balanced distribution.
该图像是一个示意图,展示了不同嵌入分布的可视化结果,包括协作嵌入、语义嵌入、拼接嵌入和我们的方法嵌入分布。每个区域的颜色代表不同的嵌入类别,通过这种方式,有助于比较不同方法在嵌入表现上的差异。 -
Quantitative Measurement:
KL divergenceis used to quantify the relative similarity of each modality to the final integrated embeddings.KL divergence(Kullback-Leibler divergence) is a non-symmetric measure of how one probability distribution is different from a second, reference probability distribution. In this context, it quantifies the dissimilarity between the modality-specific embeddings ( or ) and the final integrated embeddings ().- The relative similarity of a modality is calculated as:
Where:
- represents the final integrated embeddings (from
ConcatorOurs). - denotes the
semantic embeddings. - refers to the
collaborative embeddings. - is the
KL divergencefrom distribution to distribution . A lowerKLvalue indicates higher similarity.
- represents the final integrated embeddings (from
Analysis of Results:
-
Semantic Dominance in
Concat:Table 6shows that theconcatenationmethod results in a highly imbalanced fusion, with97.33%similarity from thesemantic modalityand only2.67%from thecollaborative modality. This confirms thesemantic dominance problem:semantic embeddings(often pre-trained on large textual data with smooth distributional properties) exert disproportionate influence, underutilizingcollaborative knowledge.Figure 4provides a visual representation of this.
该图像是一个饼状图,展示了在Beauty数据集中,使用连接方法时语义模态和协作模态对最终表示的相对相似性。其中,语义模态占比为97.33%,协作模态占比为2.67%。 -
UNGER's Balanced Integration: In contrast,
UNGEReffectively addresses this issue, achieving a much more balanced fusion with59.89%semantic similarity and40.11%collaborative similarity. This indicates thatUNGER'smodality adaptation layerandcross-modality knowledge alignment tasksuccessfully learn to integrate both knowledge types more proportionally at the distributional level.The following are the results from
Table 7of the original paper:Dataset Metric Beauty Recall@10 Recall@20 NDCG@10 NDCG@20 Semantic Only 0.0791 0.1077 0.0462 0.0535 Collaborative Only 0.0744 0.0997 0.0469 0.0532 Concat 0.0759 0.1071 0.0452 0.0530 Ours 0.0939 0.1289 0.0559 0.0646 -
Performance Impact:
Table 7further validatesUNGER's superiority.UNGER(Ours) significantly outperforms both single-modality approaches (Semantic Only,Collaborative Only) and, critically, theConcatenationmethod across all metrics. TheConcatenationmethod even performs worse than theSemantic Onlyapproach, underscoring the negative impact ofsemantic dominanceand the failure of naive fusion. This confirms thatUNGER's method successfully captures the complementary nature of the two knowledge types, leading to a more effective fusion.
6.3. Ablation Study (RQ3)
An ablation study was conducted to understand the individual contributions of UNGER's key auxiliary tasks: intra-modality knowledge distillation (IKD) and cross-modality knowledge alignment (CKA).
The following are the results from Table 8 of the original paper:
| Dataset | Metric | | | CKA | CKA + IKD (Ours) |
| Beauty | Recall@10 | 0.0759 | 0.0827 | 0.0939 |
| Recall@20 | 0.1071 | 0.1122 | 0.1289 | |
| NDCG@10 | 0.0452 | 0.0509 | 0.0559 | |
| NDCG@20 | 0.0530 | 0.0583 | 0.0646 | |
Analysis: The table compares three variants on the Beauty dataset:
- Baseline (without CKA or IKD): The row labeled
|(presumably referring to theConcatbaseline from Table 7 which yielded 0.0759 Recall@10, 0.1071 Recall@20, etc.) represents the model without eitherCKAorIKD. This performs the worst. CKA(with only CKA task): The columnCKAshows improved performance over the baseline, demonstrating thatcross-modality knowledge alignmentalone provides significant gains by enabling the model to capture richer, more holistic item representations.CKA + IKD (Ours)(full UNGER model): This variant, representing the fullUNGERmodel with both tasks, achieves the best performance across all metrics.
Conclusion:
- Indispensable Roles: The degradation in performance when either
CKAorIKDis removed strongly suggests that both tasks are critical toUNGER's success. - Effectiveness of
CKA: TheCKAtask acts as a bridge, effectively aligningcollaborativeandsemanticknowledge, which helps in harnessing their complementary information. - Synergistic Effect of
CKAandIKD: The best performance achieved by the fullUNGERmodel highlights thesynergistic effectof both tasks.CKAestablishes strong cross-modal connections, whileIKDreinforcesintra-modal coherenceand compensates forquantization loss. This combination enables the learning of powerful, unified item codes, significantly enhancing recommendation quality.
6.4. Scaling Law Study (RQ4)
The paper investigates the scaling law characteristics of UNGER with respect to model depth, model width, and data volume, a relatively underexplored area in generative recommendation.
6.4.1. Model Depth
The impact of model depth (number of layers) on performance was studied, varying the layers from 1 to 8.
As illustrated in Figure 7 from the original paper, the overall trend indicates consistent performance improvement as the model layers increase.
该图像是图表,展示了不同层数对模型性能的影响,包括Recall@20和NDCG@20的变化趋势。随着层数的增加,'Ours'表现出更好的性能,而去掉协作模态和语义模态的模型性能有所下降。
- Observation: Both
Recall@20andNDCG@20show steady gains, particularly within the range of 1 to 8 layers. This suggests that deeper models can capture more complex user-item interaction patterns. - UNGER's Advantage: The superior performance of
UNGERis attributed to itsunified code structure, which jointly encodescollaborativeandsemantic information. This provides richer and more diverse learning signals at every layer. - Synergy with Depth: Deeper layers can refine and abstract over both types of information in a complementary manner, enhancing generalization.
UNGERconsistently outperforms models lacking bothcollaborativeandsemantic modalitiesacross all depth settings, and this performance gap widens with increasing depth.
6.4.2. Data Size
The influence of data volume on generative recommendation performance was examined by training models on subsets of the Beauty dataset (20% to 100%).
As shown in Figure 8 from the original paper, both Recall@20 and NDCG@20 consistently improve with increasing data size.
该图像是图表,展示了数据量对 Beauty 数据集的 Recall@20 和 NDCG@20 的影响。左侧图表中,Recall@20 随着数据量的增加而上升,右侧图表中 NDCG@20 亦呈现出同样的趋势。数据量为 20%、40%、60%、80% 和 100% 时,指标值逐渐增大。
- Observation: The model effectively benefits from larger training corpora, demonstrating its
data-efficientnature.Recall@20shows a significant gain when data increases from20%to40%, indicating rapid leveraging of additional interaction signals. - Conclusion: This upward trend underscores the importance of training data scale in enhancing the capacity of
generative recommendation modelsand validates the reliability ofUNGERas it scales with data.
6.4.3. Model Width
The study also investigated how model width (embedding dimension) affects performance, varying it from 24 to 768.
As presented in Figure 9 from the original paper, both Recall@20 and NDCG@20 exhibit an initial upward trend, followed by a decline as width continues to increase.
该图像是一个图表,展示了模型宽度对Beauty数据集的Recall@20和NDCG@20的影响。横坐标表示模型宽度,从24到768不等,左侧柱状图表示Recall@20,右侧柱状图表示NDCG@20。随着模型宽度的增加,Recall@20在96时达到最高值,而NDCG@20则保持相对平稳的增长。
- Observation: The model achieves optimal performance when the width is set to 96. Increasing width beyond this point leads to decreased performance, likely due to
over-parameterizationand reduced training stability under limited data conditions. - Conclusion: This highlights the importance of selecting an appropriate width to balance
model capacityandgeneralization. Overly large widths can introduce optimization difficulties and suboptimal generalization, emphasizing the need for careful tuning.
6.5. Hyper-Parameter Analysis (RQ4)
6.5.1. Pretrained Semantic Encoders
The paper analyzes the impact of different pretrained semantic encoders on recommendation performance on the Beauty dataset. Three models were compared: BERT [7], Sentence-T5 [41], and Llama2-7b [53].
As presented in Figure 10 from the original paper, Llama2-7b achieves the best performance.
该图像是一个图表,展示了不同预训练语义编码器在 Recall@20 和 NDCG@20 指标上的表现。左侧柱状图显示了三种模型(BERT、Llama2-7b 和 Sentence-T5)的 Recall@20 值,右侧柱状图则展示了相应的 NDCG@20 值。
- Observation:
Llama2-7bsignificantly outperformsBERTandSentence-T5. This is attributed toLlama2-7b's substantially larger and more diverse training corpus, enabling it to learn richer contextual and domain-specific semantics.Sentence-T5(optimized for semantic similarity in short texts) performs less effectively, likely due to the complex, domain-specific, and nuanced nature of e-commerce product descriptions.BERT(general-purposemasked language modeling) also shows suboptimal performance, as itsembeddingsare less effective in distinguishingpolysemous wordsand capturinglong-tail attributesrelevant to product-level distinctions.
- Modularity:
UNGER's design isplug-and-play, allowing thesemantic encoderto be replaced without altering other components. This makes it adaptable to future advancements inlanguage modelsormultimodal encoders(e.g.,CLIP-like models[30, 35, 46]) as long as output dimensions are compatible.
6.5.2. Choice of Different Quantization Methods
The impact of different quantization strategies on model performance was evaluated on the Beauty dataset, comparing Random Assignment, RQVAE [28], and Hierarchical K-means.
As illustrated in Figure 11 from the original paper, the Hierarchical K-means approach significantly outperforms the other two methods.
该图像是一个柱状图,展示了不同量化方法(Random、K-means 和 RQVAE)的 Recall@20 和 NDCG@20 性能评估结果。图中对比了这三种方法在召回率和归一化折衷增益上的差异,观察到 K-means 方法在 Recall@20 上表现最佳。
- Observation:
Hierarchical K-meansperforms best, highlighting the importance of preserving structural information duringquantization. It encodes items in a way that reflects hierarchical proximity, improving the retrieval of relevant items.Random Assignmentyields the poorest performance, as it provides no meaningful structure to the discretized space, limiting generalization.RQVAE, while theoretically capable, suffered fromcodebook collisionsin practice. Although mitigation strategies (like appending unique identifiers) were used, they introduced noise, leading to suboptimal performance.
- Conclusion: These findings underscore the advantages of
hierarchical quantization methodsin capturing and preserving the structural and semantic information of itemembeddingsforgenerative recommendation.
6.5.3. Number of Clusters
An investigation into how the number of clusters in the hierarchical K-means algorithm influences performance was conducted on the Beauty dataset, with varying from 128 to 512.
The results are summarized in Figure 12 from the original paper.
该图像是图表,展示了当聚类数 从 128 变化到 512 时,Recall@20 和 NDCG@20 的变化情况。其中,256个聚类时的 Recall@20 达到了最大值,约为 0.128,NDCG@20 也在该点取得了最佳表现,约为 0.065。
- Observation:
- Increasing from 128 to 256 consistently improves both
Recall@20andNDCG@20, indicating enhancedrepresentational capacity. A larger allows for more discriminative codes for individual items. - Further increasing from 256 to 512 leads to
NDCG@20continuing to improve, suggesting better ranking quality for top items. However,Recall@20slightly declines. This is hypothesized to be due to an expanded search space, increasingdecoding complexityand potentially leading to suboptimal retrieval results during inference.
- Increasing from 128 to 256 consistently improves both
- Conclusion: The number of clusters must be carefully chosen to balance
representational richnessandcomputational efficiency. A moderate (e.g., 256) proportional to the overall number of items in the dataset appears to be a practical heuristic.
6.5.4. Sensitivity of Hyper-parameter
The sensitivity of the hyper-parameter (balancing next-item prediction and cross-modality knowledge alignment loss in Stage I) was investigated by varying from 0.2 to 5.
As shown in Figure 13 from the original paper, both Recall@20 and NDCG@20 improve significantly when increases from 0.2 to 1.0.
该图像是图表,展示了超参数 α 对 Recall@20 和 NDCG@20 的敏感性。左侧为 Recall@20 数据,右侧为 NDCG@20 数据,随着 α 的增加,二者的值均呈现上升趋势。
- Observation: Performance peaks when is set to 1.0 or 2.0, indicating that moderate emphasis on the
alignment objectiveleads to optimal integration. - Robustness: Performance remains stable across a broad range of values (from 0.5 to 5.0), demonstrating the robustness of
UNGERto thishyper-parameter. This implies the model is not overly reliant on fine-tuning and benefits from thealignment signalas long as it's sufficiently incorporated.
6.5.5. Sensitivity of Hyper-parameter
The sensitivity of the hyper-parameter (balancing generative recommendation and intra-modality knowledge distillation loss in Stage II) was analyzed by varying from 0.2 to 5.0.
As illustrated in Figure 14 from the original paper, both Recall@20 and NDCG@20 initially improve as increases from 0.2 to 1.0.
该图像是图表,展示了超参数 对于 Recall@20 和 NDCG@20 的敏感性分析。左侧柱状图显示不同 值下的 Recall@20,而右侧柱状图则展示了对应的 NDCG@20 值,两者均分别在 20 个推荐项上进行了评估。
- Observation: Performance achieves its best when , confirming the benefit of incorporating the
distillation lossto enhance learned representations and compensate forquantization loss. - Robustness: The model remains stable across a relatively wide range (from 0.5 to 2.0), indicating robustness to the choice of . Even with an increase to 5.0, performance does not drastically deteriorate.
6.6. Case Study
To demonstrate UNGER's interpretability and effectiveness, a case study on personalized video game recommendation is presented. This highlights how UNGER uses autoregressive decoding over discrete unified codes to infer user intent and generate recommendations.
As illustrated in Figure 15 from the original paper, the case study demonstrates the process from user history to a specific recommendation.
该图像是一个示意图,展示了UNGER在视频游戏推荐场景中的应用。图中显示了用户历史互动与推荐结果之间的关系,通过 UNGER模型进行视频游戏的推荐,预测的项目Unicode包括[66,100,168,99],对应的推荐游戏是《Red Dead Redemption》。
6.6.1. User Historical Interaction Profiling
- Example User History: Black Myth: Wukong, Cyberpunk 2077, Assassin's Creed Shadows, Call of Duty: Black Ops.
- Unified Features:
UNGERextractsunified codesthat reflect consistent user affinities across dimensions likeGameplay Type and Genre(action-oriented),Audience Targeting(mature content), andPlatform Engagement(PlayStation-compatible titles). Unlike static attributes, theseUnicodesdynamically capture item-level and user-dependent information.
6.6.2. Autoregressive Unified Code Decoding
UNGER decodes the next most probable unified code sequence, e.g., [66, 100, 168, 99].
- Code 66 (Domain: Video Games): Anchors the generation within the gaming domain.
- Code 100 (Mature Content Affinity): Infers a preference for mature content from prior choices.
- Code 168 (Action-Oriented Gameplay): Captures interest in fast-paced action.
- Code 99 (Platform: PlayStation): Implies consistent engagement with PlayStation titles. These codes, though not direct human labels, align with high-level concepts, providing interpretable projections and guiding the generation process.
6.6.3. Recommendation Interpretation: Red Dead Redemption
The generated unified code sequence maps to Red Dead Redemption.
- Alignment: The recommended game aligns perfectly with the inferred codes:
Video Game Domain,Mature Content,Action-Adventure Gameplay, andPlayStation Platform. - Interpretability: This demonstrates that
UNGER's learned representations, while discrete, align well with interpretable item features, enabling natural post-hoc explanations.
6.6.4. Discussion and Insights
- Bridging Behavior and Reasoning: The case study highlights
UNGER's ability to bridge implicit user behavior and explicit recommendation reasoning throughdiscrete code generation. - Incremental High-Fidelity Representation: Unlike traditional systems that rely on uninterpretable latent
embeddings,UNGERbuilds high-fidelity representations of user intent incrementally. - Transparent and Controllable Generation: The
sequential decodingprocedure refines the search space across multiple dimensions (domain, content style, genre, platform), mimicking a step-by-step decision-making process. The discrete nature ofUnicodesprovides a transparent layer between user history and item selection, validatingUNGER's design objectives:transparency,personalization, andcontrollable generation.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces UNGER, a novel and effective framework for generative recommendation that addresses critical limitations of existing approaches. Its core innovation lies in leveraging a unified code (Unicodes) to cohesively integrate both semantic and collaborative knowledge. UNGER achieves this through a two-stage process: first, a unified code generation framework that employs a learnable modality adaptation layer with AdaLN and a joint optimization of cross-modality knowledge alignment and next-item prediction tasks to adaptively fuse embeddings and resolve semantic dominance. Second, a generative recommendation phase augmented with an intra-modality knowledge distillation task using a special token, which effectively compensates for information loss introduced during quantization. Extensive experiments on three Amazon datasets demonstrate UNGER's consistent superiority over state-of-the-art classical and generative methods, while also exhibiting desirable scaling law characteristics with respect to model depth, width, and data volume. The case study further highlights its interpretability and ability to generate highly personalized and explainable recommendations.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Enriching Semantic Modality:
- Limitation: Current
semantic embeddingsare primarily textual. - Future Work: Extend
UNGERby incorporatingvisual embeddings(from item images/video frames using pre-traineddeep neural networks[15, 46] andVision Transformers[10]) andaudio/speech representations(for music/podcasts). This would capture stylistic, aesthetic, and affective aspects of user preferences not conveyed by text alone.
- Limitation: Current
- Improving Tokenizer:
- Limitation: The current
quantization process(Hierarchical K-means) could be further optimized. - Future Work: Adopt more advanced
vector quantization techniquesfrom other fields, such asLFQ[64],IBQ[50], andActionPiece[19]. These techniques promise better convergence properties, higher codebook utilization, and more expressivelatent representations, which are expected to enhancegeneration qualityandmodel robustness.
- Limitation: The current
- Evaluation on Denser Datasets:
- Limitation: The current evaluation is restricted to relatively sparse subsets of the Amazon dataset, consistent with prior work. This means
UNGER's behavior on denser recommendation scenarios has not been systematically explored. - Future Work: Conduct evaluations on
denser recommendation scenariosto establish broader applicability and generalizability of the proposed framework.
- Limitation: The current evaluation is restricted to relatively sparse subsets of the Amazon dataset, consistent with prior work. This means
7.3. Personal Insights & Critique
UNGER presents a highly compelling and well-motivated advancement in generative recommendation. The core idea of a unified code is elegant and addresses a practical pain point (cost and complexity) of dual-code systems. The explicit tackling of semantic dominance is particularly insightful, as naive fusion of modalities is a common pitfall in multimodal learning. The quantitative evidence from Table 6 and Table 7 strongly supports the authors' claims regarding this issue and UNGER's success in overcoming it.
-
Applicability & Transferability: The
unified codeconcept, coupled with theplug-and-playnature of thesemantic encoder, makesUNGERhighly adaptable. Its methodology could be transferred to other domains beyond e-commerce, such as news recommendation (integrating article text with user click history) or even scientific paper recommendation (integrating abstract/full text with citation networks). The framework's modularity suggests that asmultimodal foundation modelsimprove,UNGERcould seamlessly incorporate richersemantic embeddings(e.g., combining text, image, and video for movie recommendations). -
Potential Issues/Areas for Improvement:
- Codebook Design & Granularity: While
Hierarchical K-meansis effective, the trade-off betweenrepresentational richnessanddecoding complexity(as seen in the number of clusters analysis) suggests that optimalcodebook designremains crucial. Further research into adaptive codebook sizes or more dynamically learned hierarchical structures could be beneficial. - Computational Cost of Stage I: Although
Stage IIis efficient, theStage Itraining involves multiple components (DIN, Llama2-7b, CKA, seq loss). While the paper indicates less than 1 GPU hour for total training, it would be insightful to understand the breakdown of computational cost forStage Ito ensure it's not a bottleneck in scenarios with rapidly evolving item sets that require frequentUnicoderegeneration. - Cold-Start Performance: While
UNGERimplicitly helpscold-start itemsby leveraging theirsemantic informationthroughUnicodes, a dedicatedcold-startevaluation might further highlight this benefit, especially for new items with no interaction history. - Explainability beyond Case Study: The
case studyprovides excellent interpretability, attributing codes to high-level concepts. Further exploration into a more systematic, perhaps quantitative, measure ofUnicode interpretabilitycould be valuable. For instance, canUnicodesdirectly correspond to human-defined tags with high fidelity?
- Codebook Design & Granularity: While
-
Inspiration: The paper provides a strong blueprint for designing efficient and effective
multimodal generative models. It emphasizes that true integration goes beyond mere concatenation, requiring adaptive alignment and thoughtful compensation for intrinsic architectural limitations likequantization loss. Thescaling lawstudy is also inspiring, pointing towards a future where larger models and datasets could unlock even greater performance in recommendation, aligning the field with trends seen inNLPandcomputer vision.
Similar papers
Recommended via semantic vector search.