Abstract

UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration LONGTAO XIAO, School of Computer Science and Technology, Huazhong University of Science and Technol- ogy, China HAOZHAO WANG ∗ , School of Computer Science and Technology, Huazhong University of Science and Technology, China CHENG WANG, Huawei Technologies Ltd, China LINFEI JI, Huazhong University of Science and Technology, China YIFAN WANG, Huazhong University of Science and Technology, China JIEMING ZHU, Huawei Noah’s Ark Lab, China ZHENHUA DONG, Huawei Noah’s Ark Lab, China RUI ZHANG, School of Computer Science and Technology, Huazhong University of Science and Technology (www.ruizhang.info), China RUIXUAN LI, School of Computer Science and Technology, Huazhong …

1. Bibliographic Information

1.1. Title

The central topic of this paper is "UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration".

1.2. Authors

The authors of the paper are:

LONGTAO XIAO (School of Computer Science and Technology, Huazhong University of Science and Technology, China)
HAOZHAO WANG (School of Computer Science and Technology, Huazhong University of Science and Technology, China)
CHENG WANG (Huawei Technologies Ltd, China)
LINFEI JI (Huazhong University of Science and Technology, China)
YIFAN WANG (Huazhong University of Science and Technology, China)
JIEMING ZHU (Huawei Noah's Ark Lab, China)
ZHENHUA DONG (Huawei Noah's Ark Lab, China)
RUI ZHANG (School of Computer Science and Technology, Huazhong University of Science and Technology, China, www.ruizhang.info)
RUIXUAN LI (School of Computer Science and Technology, Huazhong University of Science and Technology, China)

The authors' research backgrounds appear to be in computer science and technology, with affiliations at a prominent Chinese university (Huazhong University of Science and Technology) and an industry research lab (Huawei Technologies Ltd / Huawei Noah's Ark Lab), indicating a blend of academic research and practical industry application.

1.3. Journal/Conference

The publication venue is not explicitly stated in the provided text, but the format and content suggest it is a research paper submitted to a conference or journal in the field of information systems or recommender systems.

1.4. Publication Year

The paper was published at (UTC): 2025-10-28T00:00:00.000Z, which means it is slated for publication in October 2025.

1.5. Abstract

This paper introduces UNGER, a novel framework for generative recommendation that utilizes a unified code to integrate both semantic and collaborative knowledge. Traditional generative recommender systems often employ separate codes for different modalities (e.g., semantic and collaborative), leading to increased storage and inference costs, as well as a failure to fully exploit the complementary strengths of these knowledge types due to inherent misalignment. UNGER addresses these challenges by proposing a unified code (referred to as Unicodes) that reduces storage and inference time significantly. The framework integrates knowledge adaptively through a learnable modality adaptation layer and a joint optimization task that combines cross-modality knowledge alignment with next-item prediction. To counteract information loss during the necessary quantization process, UNGER also incorporates an intra-modality knowledge distillation task. Extensive experiments on three public recommendation benchmarks demonstrate UNGER's superior performance compared to existing generative and traditional methods, while also exhibiting scaling law characteristics.

1.6. Original Source Link

The original source link is /files/papers/692c3e981db011de57153258/paper.pdf. This appears to be a link to a PDF file, suggesting it's likely a preprint or an internal link to the paper. Its publication status is unknown from the given context, but the publication date in 2025 indicates it's a forthcoming work.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve revolves around the limitations of existing recommendation systems, particularly in the context of generative recommendation.

Information Overload & Traditional RecSys Limitations: Modern society is plagued by information overload, making recommendation systems crucial. Traditional recommendation systems typically rely on embedding-based approaches where users and items are represented by dense vectors, and recommendations are made by finding nearest neighbors using dot-product or cosine similarity. This often necessitates Approximate Nearest Neighbor (ANN) search indexes (e.g., Faiss, SCANN) for efficient retrieval from large candidate pools. However, these ANN indexes are independent of the model's optimization process, which can limit overall effectiveness and introduce computational overhead.
Emergence of Generative Recommendation: Generative recommendation has emerged as a promising direction. Instead of matching embeddings, it frames the recommendation task as autoregressive code sequence generation. Items are first encoded into discrete codes (sequences of tokens), and then a model predicts the next item's code based on user history. This paradigm has the potential for more efficient decoding without ANN indexes.
Challenges in Existing Generative Recommendation: The key challenge UNGER targets in existing generative recommendation lies in how different modalities of knowledge (e.g., collaborative and semantic) are handled:
- Separate Codes: Current methods often construct independent codes for different modalities. For instance, Recforest uses collaborative codes, TIGER uses semantic codes, and EAGER uses two separate sets of codes for both.
- Increased Costs: A dual-code framework significantly increases storage and inference costs, making large-scale deployment impractical. Figure 3 demonstrates that two separate codes are 2.8x slower for inference compared to a unified code.
- Intrinsic Misalignment & Underutilization: Treating semantic and collaborative knowledge as independent entities limits the full exploitation of their complementary strengths. Direct concatenation of features, a common practice, suffers from a semantic dominance issue, where semantic features, often richer and pre-trained on vast textual data, overwhelm collaborative signals (Figure 4). This can even degrade performance below using semantic knowledge alone. The root cause is the inherent representational misalignment and signal strength variations between modalities.
  
  The paper's innovative idea is to integrate both collaborative and semantic knowledge into a single unified code (Unicodes) to overcome these challenges, thereby improving efficiency, reducing costs, and enhancing recommendation effectiveness by truly harnessing the synergistic potential of both modalities while addressing the semantic dominance problem.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of generative recommendation:

Unified Code for Generative Recommendation:
- Contribution: Introduction of UNGER (Unified Generative Recommendation), which leverages Unicodes—a novel unified code that integrates both collaborative and semantic knowledge for generative recommendation.
- Problem Solved: This addresses the practical deployment challenges of dual-code systems by reducing storage space by half and achieving significantly faster inference compared to setups with two separate codes. It enables a more compact and efficient representation of items.
Adaptive Knowledge Integration to Resolve Semantic Domination:
- Contribution: Proposal of a learnable modality adaptation layer and a joint optimization framework that combines cross-modality knowledge alignment (CKA) with next item prediction tasks.
- Problem Solved: This approach adaptively learns an integrated embedding, effectively resolving the semantic domination issue observed when simply concatenating features. It ensures a balanced contribution from both collaborative and semantic modalities, allowing their complementary strengths to be fully exploited.
Intra-modality Knowledge Distillation for Comprehensive Learning:
- Contribution: Introduction of an intra-modality knowledge distillation (IKD) task, utilizing a specially designed token.
- Problem Solved: This task compensates for the potential information loss inherent in the quantization process (mapping continuous embeddings to discrete codes). By distilling high-level modality-specific knowledge, it ensures comprehensive and sufficient learning, further improving autoregressive generation quality.
Empirical Validation and Scaling Law Characteristics:
- Contribution: Extensive experiments on three public recommendation benchmarks.
- Problem Solved: The experiments demonstrate UNGER's significant superiority over existing generative and traditional recommendation methods. Additionally, the study confirms that UNGER exhibits desirable scaling law characteristics with respect to model depth, width, and data volume, indicating its potential for performance gains with increased resources.

3.1. Foundational Concepts

To fully grasp the UNGER framework, a foundational understanding of several key concepts in recommender systems, deep learning, and natural language processing is crucial for a beginner.

Recommendation Systems (RecSys): At its core, a recommender system aims to predict user preferences and suggest items (e.g., movies, products, music) that a user is likely to be interested in. It combats information overload by filtering vast amounts of available content.
- Collaborative Filtering (CF): A widely used technique that makes recommendations based on the past behavior and preferences of similar users or items. If user A likes items X and Y, and user B likes X, then user B might also like Y. This captures behavioral patterns.
- Content-Based Recommendation: Recommends items similar to those a user has liked in the past, based on item attributes or content. For example, if a user likes sci-fi movies, they will be recommended other sci-fi movies. This uses semantic information.
- Hybrid Approaches: Combine collaborative and content-based methods to leverage the strengths of both and mitigate their individual weaknesses (e.g., cold-start problem for new users/items in CF).
Embeddings: In machine learning, an embedding is a dense vector representation of discrete entities (like users, items, words) in a continuous vector space. Items with similar properties or relationships are mapped closer to each other in this space. For example, the embedding of "apple" might be close to the embedding of "pear" but far from "car".
- Item ID Embeddings: Traditional recommendation systems often assign a unique ID to each item and learn a corresponding embedding vector for it. These are ID-based signals or collaborative signals derived purely from interaction patterns.
- Semantic Embeddings: Derived from rich textual descriptions (e.g., item titles, descriptions, reviews) using language models. These embeddings capture the semantic meaning and contextual information of an item.
Deep Learning Architectures:
- Recurrent Neural Networks (RNNs) / Gated Recurrent Units (GRUs): Neural networks designed to process sequential data. GRUs are a type of RNN that can capture dependencies in sequences, useful for modeling user interaction history (e.g., GRU4Rec).
- Convolutional Neural Networks (CNNs): Neural networks primarily used for image processing, but also adaptable for sequence modeling (e.g., Caser) by applying convolutional filters to learn local patterns.
- Transformer Models: A powerful neural network architecture, particularly dominant in Natural Language Processing (NLP), known for its self-attention mechanism.
  - Self-Attention: A mechanism that allows a model to weigh the importance of different parts of the input sequence when processing each element. It computes attention scores between query, key, and value vectors.
  - Multi-head Self-Attention: Extends self-attention by running multiple attention mechanisms in parallel, allowing the model to focus on different aspects of the sequence simultaneously.
  - Encoder-Decoder Architecture: A common Transformer setup where an encoder processes the input sequence to create a representation, and a decoder uses this representation to generate an output sequence.
  - Autoregressive Generation: A process where each element in an output sequence is generated one at a time, conditioned on the previously generated elements. This is fundamental to generative recommendation.
Embedding Quantization: The process of mapping continuous embeddings (dense real-valued vectors) into discrete codes (sequences of integer tokens). This is crucial for generative recommendation as it transforms item representation into a format suitable for autoregressive generation by a Transformer decoder, which typically generates sequences of discrete tokens.
- Vector Quantization (VQ): A general technique for mapping vectors from a continuous space to a finite set of codebook vectors.
- Residual Vector Quantization (RVQ) / Hierarchical K-means: Advanced VQ techniques that quantize residuals (the difference between the original vector and its approximation) iteratively at multiple layers to preserve more information. Hierarchical K-means is a type of RVQ that uses K-means clustering in a hierarchical manner.
Contrastive Learning: A self-supervised learning paradigm where the model learns representations by pulling positive pairs (similar examples) closer together in the embedding space and pushing negative pairs (dissimilar examples) further apart.
- Info-NCE Loss: A popular contrastive loss function often used in self-supervised learning to maximize the mutual information between different views of the same data point.
Knowledge Distillation: A technique where a smaller "student" model learns from a larger, more complex "teacher" model. In UNGER, it's used to compensate for information loss during quantization by guiding the model with original embeddings.

3.2. Previous Works

The paper discusses related work across pre-training in recommender systems, sequential recommendation, and generative approaches.

3.2.1. Pre-training in Recommender Systems

Inspired by NLP breakthroughs, pre-training addresses data sparsity and cold-start problems in recommendation.

ID-based Pre-training:
- PeterRec [65]: Learns universal user representations from large-scale interaction data, transferring knowledge via parameter-efficient fine-tuning.
- Conure [66]: A lifelong learning framework that incrementally updates user embeddings without catastrophic forgetting.
- CLUE [4]: Improves representation quality through contrastive views, similar to self-supervised learning.
- These ID-based methods often struggle with transferability to new domains due to reliance on discrete identifiers.
Semantic-enhanced Pre-training:
- ZESRec [9]: Replaces item IDs with textual metadata to enable zero-shot generalization across disjoint domains.
- UniSRec [18] and MISSRec [56]: Leverage multimodal features (text, images) to build transferable user/item representations.
- P5 [12]: Unifies various recommendation tasks into a text-to-text framework using pre-trained language models (PLMs), transforming all input/output into natural language. However, it still relies on smaller PLMs.

3.2.2. Sequential Recommendation

Focuses on modeling user behavior as a chronologically ordered sequence of interactions to predict the next item.

Traditional Approaches:
- Markov Chains (MCs) [16, 48]: Early methods capturing item transition probabilities.
- GRU4Rec [21]: Pioneered GRU-based RNNs for sequential recommendations. The GRU (Gated Recurrent Unit) is a type of RNN that uses gates to control the flow of information, mitigating the vanishing gradient problem. It processes sequences by updating a hidden state $h_t$ based on the current input $x_t$ and previous hidden state $h_{t-1}$ . $ z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) \ r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r) \ \tilde{h}t = \tanh(W_h x_t + U_h (r_t \odot h{t-1}) + b_h) \ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t $ Where $z_t$ is the update gate, $r_t$ is the reset gate, $\tilde{h}_t$ is the candidate hidden state, $\odot$ is the element-wise product, and $\sigma$ is the sigmoid function.
- SASRec [24]: Introduced self-attention (similar to a decoder-only Transformer) to capture long-range dependencies in item sequences. The core attention mechanism is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where $Q$ , $K$ , $V$ are query, key, and value matrices, and $d_k$ is the dimension of the key vectors. SASRec applies this self-attention to the sequence of item embeddings.
- BERT4Rec [51]: Utilized Transformers with masking strategies (like masked language modeling in BERT) for sequential recommendation, allowing bidirectional context.
- HGN [39]: Adopts hierarchical gating networks to capture both long-term and short-term user interests.
- FDSA [69]: Leverages self-attention networks at both item-level and feature-level to model sequence dynamics.
- S^3-Rec [73]: Pre-trains a bidirectional Transformer using mutual information maximization (MIM) via self-supervised tasks to learn correlations among attributes, subsequences, and sequences.

3.2.3. Generative Approaches

These predict item identifiers directly, moving beyond embedding-based matching and ANN search.

LLM-based Methods:
- LC-Rec [71]: Uses code-based vector quantization for semantic item indexing and fine-tuning to align collaborative signals with LLM representations.
- CCF-LLM [38]: Transforms user-item interactions into hybrid prompts (encoding both semantic and collaborative knowledge) and uses an attentive cross-modal fusion strategy.
- SC-Rec [26]: Employs multiple item indices, prompt templates, and a self-consistent re-ranking mechanism to merge collaborative and semantic knowledge.
From-Scratch Methods (Custom-designed Models):
- Tree-based methods (RecForest [11], [74, 75]): Construct multiple trees and integrate Transformer-based structures for routing, enhancing accuracy and memory efficiency.
- TIGER [47]: Introduced semantic IDs (item tokens derived from descriptions) and predicts next item tokens in a sequence-to-sequence manner, using RQ-VAE quantization.
- ColaRec [61]: Integrates user-item interactions and content data within an end-to-end framework, leveraging pretrained collaborative identifiers, an item indexing task, and contrastive loss to align semantic and collaborative spaces.
- EAGER [62]: Employs a dual-stream generative framework with shared encoding but separate decoding for semantic and behavioral information, using two separate codes. It then fuses results based on confidence scores.

3.3. Technological Evolution

The evolution of recommendation systems has moved from simple content-based or collaborative filtering methods to complex deep learning models.

Early Stages (Pre-Deep Learning): Matrix Factorization, Markov Chains, basic content-based filtering.
ID-based Deep Learning: RNNs (e.g., GRU4Rec), CNNs (e.g., Caser), and later Transformers (e.g., SASRec, BERT4Rec) applied to sequences of item IDs. These largely rely on implicit collaborative signals.
Pre-training for RecSys: Borrowing from NLP, pre-training strategies emerged to tackle cold-start and sparsity. This includes ID-based pre-training and, more recently, incorporating side information (text, images) to build semantic-rich representations (ZESRec, UniSRec, P5).
Generative Recommendation (Code-based): A paradigm shift from retrieval-based to generation-based. Items are converted into discrete codes, and the model autoregressively predicts the next item's code. Early methods focused on single modalities (RecForest for collaborative, TIGER for semantic).
Multi-modal Generative Recommendation (Separate Codes): Recognizing the value of combining modalities, methods like EAGER and SC-Rec started integrating both semantic and collaborative knowledge, but often using separate codes for each modality.

UNGER fits into the latest stage of this evolution, pushing multi-modal generative recommendation forward.

3.4. Differentiation Analysis

Compared to the main methods in related work, UNGER introduces key innovations:

Unified Code vs. Separate Codes:
- Related Work (EAGER, SC-Rec): These models integrate multiple modalities (semantic, collaborative) but typically maintain separate codes for each, leading to $O(K*n)$ computational cost and $O(K*m)$ storage cost (where $K$ is the number of modalities). This results in higher inference latency and storage requirements.
- UNGER: Proposes a single unified code (Unicodes) that cohesively encodes both semantic and collaborative knowledge. This reduces computational cost to $O(n)$ and storage cost to $O(m)$ , making it significantly more efficient and scalable for large-scale deployments.
Addressing Semantic Domination:
- Related Work (Concatenation): A common baseline for integrating modalities is direct feature concatenation. However, UNGER highlights that this approach suffers from semantic dominance, where the richer semantic embeddings (e.g., from LLMs) disproportionately influence the final representation, marginalizing collaborative signals. Figure 4 clearly illustrates this with 97.33% semantic similarity vs. 2.67% collaborative similarity.
- UNGER: Explicitly tackles semantic dominance through a learnable modality adaptation layer with AdaLN and a cross-modality knowledge alignment (CKA) task. This ensures a balanced and adaptive fusion of knowledge, leading to a more effective combined representation (e.g., 59.89% semantic, 40.11% collaborative similarity in Table 6).
Compensating Quantization Loss:
- Related Work: While quantization is necessary for generative models, it inherently introduces approximation errors and information loss. Previous methods might not explicitly address this or rely on the quantized codes being sufficient.
- UNGER: Introduces an intra-modality knowledge distillation (IKD) task. This auxiliary objective leverages the original (pre-quantization) embeddings and a special [c_dis] token to provide more complete guidance, thereby compensating for information loss during quantization and improving the quality of autoregressive generation.
  
  In essence, UNGER's core innovation lies in its holistic approach to multi-modal generative recommendation: not just combining modalities, but doing so efficiently with a unified code, harmoniously by resolving semantic dominance, and robustly by mitigating quantization loss.

4. Methodology

4.1. Principles

The core idea behind UNGER is to integrate semantic and collaborative knowledge into a single unified code (called Unicodes) for generative recommendation. This unification aims to enhance efficiency, reduce costs, and improve recommendation quality by leveraging the complementary strengths of both modalities while explicitly addressing the semantic dominance issue and mitigating information loss during the quantization process. The overall approach is structured as a two-stage framework: first, generating robust item Unicodes, and second, using these Unicodes for autoregressive generative recommendation.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

Given an item corpus $I$ and a user's historical interaction sequence $U = \left[ u _ { 1 } , u _ { 2 } , \ldots , u _ { t - 1 } \right]$ where each $u \in { \mathfrak { I } }$ , the objective of a sequential recommendation system is to predict the next most likely item $u _ { t } \in I$ that the user may interact with.

In the generative framework, each item $u$ is represented by a sequence of codes $C = [ c _ { 1 } , c _ { 2 } , \ldots , c _ { L } ]$ , where $L$ denotes the length of the code sequence. The sequential recommendation task transforms into predicting the codes C _ { t } of the next item u _ { t } based on the user's historical interaction sequence $U$ . During training, the model first encodes $U$ and then autoregressively generates the codes $\mathcal { C } _ { t }$ of the target item u _ { t } step by step at the decoder. The decoding process is defined by the following formula:

$p ( C _ { t } | U ) = \prod _ { i = 1 } ^ { L } p ( c _ { i } | U , c _ { 1 } , c _ { 2 } , . . . , c _ { i - 1 } )$

Where:

$p ( C _ { t } | U )$ is the probability of generating the entire code sequence $C_t$ for the next item given the user's interaction history $U$ .
$L$ is the length of the code sequence for an item.
p ( c _ { i } | U , c _ { 1 } , c _ { 2 } , . . . , c _ { i - 1 } ) is the probability of generating the $i$ -th code $c_i$ , conditioned on the user history $U$ and all previously generated codes $c_1, \ldots, c_{i-1}$ for the current item.

During the inference phase, the decoder performs beam search to autoregressively generate the codes of the top-k items.

4.2.2. Overall Pipeline

UNGER operates in two main stages, as illustrated in Figure 5 from the original paper.

该图像是示意图，展示了论文“UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration”的主要框架。图中分为两个阶段：第一阶段为项目Unicode生成，包括预训练的语义编码器和协同模式的知识对齐任务；第二阶段为生成推荐，利用用户历史和Transformer编码器进行对比学习和知识蒸馏。箭头表示数据流动，具体任务和模块间的关系通过不同形状的框体和连接线清晰标示。

Stage I: Item Unicodes Generation: The goal is to create item Unicodes that encapsulate both collaborative and semantic knowledge into a single unified codebook. This stage involves:
1. Extracting modality-specific embeddings using pre-trained models (e.g., DIN for collaborative, Llama2-7b for semantic).
2. Fusing these embeddings through a modality-adaptive fusion module.
3. Jointly optimizing this fusion using a cross-modality knowledge alignment (CKA) task and a next item prediction task to resolve semantic dominance.
4. Quantizing the integrated embeddings into discrete Unicodes using hierarchical K-means clustering.
Stage II: Generative Recommendation: This stage focuses on the autoregressive generation of Unicodes for recommendation. It comprises:
1. A Transformer-based encoder to capture user interests from their interaction history.
2. A Transformer-based decoder to predict the Unicode sequence of the next item.
3. An intra-modality knowledge distillation (IKD) task to compensate for information loss during quantization.
4. After training, beam search is used to generate top-k recommended Unicodes, which are then mapped back to items.

4.2.3. Stage I: Item Unicodes Generation

This stage aims to construct item Unicodes that encode both collaborative and semantic knowledge within a single unified codebook.

Initial Embedding Extraction:
- Given a user's historical interaction sequence, denoted by ID sequence $X = [ x _ { 1 } , x _ { 2 } , \ldots , x _ { t - 1 } ]$ and corresponding item semantic information (e.g., titles) $S$ .
- A randomly initialized sequential recommendation model (e.g., DIN [72]) encodes $X$ into collaborative embeddings E _ { C }.
- A pre-trained semantic encoder (e.g., Llama2-7b [53]) encodes $S$ into semantic embeddings E _ { S }.
Modality Adaptation Layer (MAL): To bridge the modality gap between the two embeddings and align them into a common space, UNGER proposes a modality adaptation layer (MAL) that maps the semantic embeddings E _ { S } into the same embedding space as collaborative embeddings. This layer utilizes AdaLN (Adaptive Layer Normalization) [43], which introduces learnable affine parameters conditioned on the input itself, allowing dynamic adjustment of normalization based on modality-specific properties. The mapping process is defined as: $E _ { T } = \mathrm { MAL } ( E _ { S } ) = \mathrm { AdaLN } ( W E _ { S } + b )$ Where:
- $E_S$ represents the original semantic embeddings.
- $E_T$ denotes the transformed semantic embeddings after adaptation, now aligned with the collaborative embedding space.
- MAL is the modality adaptation layer.
- $W$ and $b$ are learnable parameters (weight matrix and bias vector, respectively) that transform the semantic embeddings.
- AdaLN is the Adaptive Layer Normalization function, which dynamically adjusts the normalization based on the input features, helping preserve and align modality-specific signals from content-driven semantics and behavior-driven preferences.
Cross-modality Knowledge Alignment Task (CKA): To effectively integrate semantic and collaborative knowledge and counteract semantic dominance, a cross-modality knowledge alignment task is introduced. For a given item $i$ , its collaborative embedding E _ { C _ { i } } is pulled closer to its transformed semantic embedding E _ { T _ { i } }, while being pushed away from the transformed semantic embeddings E _ { T _ { j } } of other items $j$ (negative samples). This encourages the learned integrated embedding for item $i$ to encapsulate knowledge from both modalities. The Info-NCE loss [3] is adopted for this task: $\mathcal { L } _ { \mathrm { align } } = - \log \sum _ { i \in I } \frac { \exp ( \mathrm { sim } ( E _ { C _ { i } } , E _ { T _ { i } } ) / \tau ) } { \underset { j \in \mathrm { in - batchneg-samples } } { \sum } \exp ( \mathrm { sim } ( E _ { C _ { i } } , E _ { T _ { j } } ) / \tau ) }$ Where:
- $I$ is the set of items in the current batch.
- $E_{C_i}$ is the collaborative embedding of item $i$ .
- $E_{T_i}$ is the transformed semantic embedding of item $i$ (from the MAL).
- $E_{T_j}$ is the transformed semantic embedding of a negative sample item $j$ from the same batch.
- $\mathrm{sim}(\cdot, \cdot)$ is a similarity function (e.g., dot product or cosine similarity).
- $\tau$ is a temperature parameter that controls the smoothness of the similarity distribution.
Next Item Prediction Task: In addition to CKA, the model also optimizes a standard next item prediction task using the collaborative representations. This task takes the user's historical interaction sequence $\left[ x _ { 1 } , x _ { 2 } , \ldots , x _ { t - 1 } \right]$ as input, learns a representation of user preferences, and computes matching scores with candidate items. The loss function is a standard cross-entropy loss for sequential prediction: $\mathcal { L } _ { \mathrm { seq } } = - \sum _ { t = 2 } ^ { L } \log p ( x _ { t } \mid x _ { 1 } , x _ { 2 } , . . . , x _ { t - 1 } )$ Where:
- $L$ is the length of the sequence.
- $p ( x _ { t } \mid x _ { 1 } , x _ { 2 } , . . . , x _ { t - 1 } )$ is the probability of predicting item $x_t$ given the preceding items in the sequence.
Joint Optimization for Stage I: The total loss function for the first stage (Stage I) combines the next item prediction loss and the cross-modality knowledge alignment loss: $\mathcal { L } _ { \mathrm { StageI } } = \mathcal { L } _ { \mathrm { seq } } + \alpha \mathcal { L } _ { \mathrm { align } }$ Where:
- $\mathcal { L } _ { \mathrm { seq } }$ is the next item prediction loss.
- $\mathcal { L } _ { \mathrm { align } }$ is the cross-modality knowledge alignment loss.
- $\alpha$ is a tunable hyperparameter to adjust the relative importance of the alignment loss.
Unicodes Generation (Hierarchical K-means): After Stage I training, the integrated embeddings ( $E_T$ fused with $E_C$ , implicitly via the CKA loss aligning $E_T$ to $E_C$ in the target space) are used to generate Unicodes via hierarchical K-means clustering. This method efficiently encodes high-dimensional item embeddings into discrete hierarchical codes, preserving as much information as possible. The core mathematical relationship at each layer is defined as: $\mathbf { r } _ { i } ^ { l } = \mathbf { r } _ { i } ^ { l - 1 } - \mathbf { C } _ { i } ^ { l }$ Where:
- $\mathbf { r } _ { i } ^ { 0 } = \mathbf { v } _ { i }$ denotes the original integrated embedding of item $i$ .
- $\mathbf { r } _ { i } ^ { l }$ is the residual vector for item $i$ at layer $l$ .
- $\mathbf { C } _ { i } ^ { l }$ represents the centroid assigned to item $i$ at the $l$ -th layer of quantization.
  
  The Hierarchical K-means Clustering algorithm (Algorithm 1 in the paper) proceeds as follows:

Algorithm 1 Hierarchical K-means Clustering

Input: Item embeddings V = {v1, V2, . . . }, number of clusters K, hierarchy depth L Output: Unicode sequences = [, , . . , ] for all items
1: for each item i do 2:	Initialize residual: r ← v
end for
4:	for layer l ← 1 to L do
5:	Collect residuals: Rl−1 ← {r−1, r−1, . . . }
6:	Perform k-means clustering on Rl-1 with K clusters
7:	Store centroids in codebook = {, , . . , }
8:	for each item i do
9:	Compute distances: = ‖r−1 − ∥2 for all k [1, K]
10:	Assign cluster index: ← arg min
11:	Record centroid: ← c
12:	Update residual: r ← r−
13: end for	end for

Let's break down the steps:

Initialization: For each item $i$ , its residual vector $\mathbf { r } _ { i }$ is initialized with its original integrated embedding $\mathbf { v } _ { i }$ . So, $\mathbf { r } _ { i } ^ { 0 } = \mathbf { v } _ { i }$ .
Iterative Quantization (Layers): The process iterates for $L$ $L$ layers (hierarchy depth).
- Collect Residuals: At each layer $l$ , all current residual vectors $\mathcal { R } ^ { l-1 }$ from the previous layer are collected.
- K-means Clustering: K-means clustering is performed on these residuals $\mathcal { R } ^ { l-1 }$ to partition them into $K$ clusters. This identifies common patterns.
- Store Centroids: The $K$ resulting centroids are stored in a codebook $C _ { l } = \{ \mathbf { C } _ { 1 } ^ { l } , \mathbf { C } _ { 2 } ^ { l } , \dots , \mathbf { C } _ { K } ^ { l } \}$ .
- Assign Cluster Index & Update Residuals: For every item $i$ $i$ :
  - The Euclidean distance $d_k = \lVert \mathbf { r } _ { i } ^ { l - 1 } - \mathbf { C } _ { k } ^ { l } \rVert ^ { 2 }$ is computed between its current residual vector $\mathbf { r } _ { i } ^ { l - 1 }$ and each centroid $\mathbf { C } _ { k } ^ { l }$ in $C_l$ .
  - The item is assigned to the cluster whose centroid is nearest: $c _ { i } ^ { l } = \arg \operatorname* { m i n } _ { k } d _ { k }$ . This $c_i^l$ is the discrete code for item $i$ at layer $l$ .
  - The assigned centroid $\mathbf { C } _ { i } ^ { l } = \mathbf { C } _ { c _ { i } ^ { l } } ^ { l }$ is recorded.
  - The residual vector is updated by subtracting the assigned centroid: $\mathbf { r } _ { i } ^ { l } = \mathbf { r } _ { i } ^ { l - 1 } - \mathbf { C } _ { i } ^ { l }$ . This removes information captured by the current centroid, allowing the next layer to focus on the remaining details.
    
    After $L$ layers, each item $i$ obtains a sequence of discrete cluster indices $\mathbf { c } _ { i } = [ c _ { i } ^ { 1 } , c _ { i } ^ { 2 } , \ldots , c _ { i } ^ { L } ]$ , which serves as its compact and hierarchical Unicode representation. An item-unicode lookup table is then constructed to map each item to its Unicode sequence.

4.2.4. Stage II: Generative Recommendation

This stage uses the generated Unicodes for autoregressive generative recommendation.

Encoding Process: Given a user's interaction history $X = [ x _ { 1 } , x _ { 2 } , \dotsc , x _ { t - 1 } ]$ , these items are converted into their respective Unicode sequences. The sequence of Unicodes is then fed into an encoder (consisting of stacked multi-head self-attention layers and feed-forward layers, following the Transformer architecture). The encoder processes this sequence to produce a feature representation $H$ , which captures the user's interests and is passed to the decoder.
Decoding Process: On the decoder side, the objective is to predict the Unicode sequence $[ c _ { 1 } , c _ { 2 } , \ldots , c _ { L } ]$ of the next item x _ { t }. For training, a special token ${{<BOS>}}$ (Begin-of-Sequence) is prepended to the item Unicode sequence to form the decoder input. The generative recommendation loss is computed using a cross-entropy loss function: $\mathcal { L } _ { \mathrm { gen } } = \sum _ { i = 1 } ^ { L } \log p ( c _ { i } \mid x , < { \mathsf { B0 } } { \mathsf { S } } { \mathsf { > } } , c _ { 1 } , \ldots , c _ { i - 1 } )$ Where:
- $L$ is the length of the Unicode sequence.
- $x$ represents the encoded user history (output from the encoder).
- $<{ \mathsf { B0 } } { \mathsf { S } } { \mathsf { > } }$ is the Begin-of-Sequence token.
- $c_i$ is the $i$ -th code in the target item's Unicode sequence.
- $p ( c _ { i } \mid x , < { \mathsf { B0 } } { \mathsf { S } } { \mathsf { > } } , c _ { 1 } , \ldots , c _ { i - 1 } )$ is the probability of predicting the $i$ -th code, conditioned on the encoded user history and previously generated codes for the current item.
Intra-modality Knowledge Distillation Task (IKD): To compensate for information loss introduced by the quantization process in Stage I, an intra-modality knowledge distillation task is introduced. Inspired by the [CLS] token in BERT for capturing global context, a learnable token [c_dis] is appended to the end of the decoder input sequence. This token is designed to capture global information about the sequence. The final layer output corresponding to [c_dis] is used in a global contrastive learning objective:
- The positive sample is the integrated embedding E _ { t } of the target item x _ { t }, which was learned in Stage I before quantization.
- Negative samples $E _ { \mathrm { neg } }$ are randomly selected integrated embeddings from other items in the corpus, excluding x _ { t }. This objective pulls the [c_dis] output closer to the positive sample E _ { t } and pushes it away from negative samples $E _ { \mathrm { neg } }$ . The loss function for this distillation task is: $\mathcal { L } _ { \mathrm { distillation } } = - \log \frac { \exp ( c _ { \mathrm { dis } } \cdot E _ { t } ) } { \exp ( c _ { \mathrm { dis } } \cdot E _ { t } ) + \sum \exp ( c _ { \mathrm { dis } } \cdot E _ { \mathrm { neg } } ) }$ Where:
- $c _ { \mathrm { dis } }$ denotes the final layer output of the decoder for the special token [c_dis].
- E _ { t } represents the integrated embedding of the target item $x_t$ , as learned in Stage I.
- $E _ { \mathrm { neg } }$ is the integrated embedding for a negative sample item, also learned in Stage I.
- The dot product ( $\cdot$ ) measures similarity.
Total Loss for Stage II: The total loss for Stage II combines the generative recommendation loss and the intra-modality knowledge distillation loss: $\mathcal { L } _ { \mathrm { StageII } } = \mathcal { L } _ { \mathrm { gen } } + \beta \mathcal { L } _ { \mathrm { distillation } }$ Where:
- $\mathcal { L } _ { \mathrm { gen } }$ is the generative recommendation loss.
- $\mathcal { L } _ { \mathrm { distillation } }$ is the intra-modality knowledge distillation loss.
- $\beta$ is a hyperparameter that balances the two objectives.

4.2.5. Training and Inference

Training: UNGER employs a two-stage training process:
1. Stage I Training: A sequential recommendation model (e.g., DIN) is used as the backbone and optimized with loss $\mathcal { L } _ { \mathrm { StageI } }$ . After this stage, the item integrated embeddings are extracted, and item Unicodes are derived using Hierarchical K-means clustering.
2. Stage II Training: A Transformer model (encoder-decoder) is trained for generative recommendation, optimizing with loss $\mathcal { L } _ { \mathrm { StageII } }$ .
Inference: During inference, only the trained Transformer model from Stage II is utilized.
- The decoder performs beam search to autoregressively generate each token within the item Unicode sequence.
- Once the Unicode sequences are generated, they are mapped back to their corresponding items using the item-unicode lookup table.
- A top- $k$ recommendation list is produced based on the confidence scores of the generated Unicodes.

4.2.6. Computational and Storage Costs Analysis

The paper provides a comparison of UNGER's efficiency against existing generative recommendation methods.

Computational Cost:
- UNGER: The encoding phase (processing user history) occurs only once. The main burden is the decoding stage, where the decoder autoregressively generates each token of the target item code. If the code length is fixed and decoding each item code takes $O(n)$ time, the total decoding complexity is $O(n)$ .
- Existing Methods (e.g., EAGER, SC-Rec): These adopt separate modality-specific codes, typically maintaining $K$ parallel decoders (e.g., one for semantic, one for collaborative). They must decode $K$ modality-specific codes in parallel, leading to a cumulative decoding cost of $O(Kn)$ . An additional ranking or fusion step is often required.
- Advantage: UNGER reduces computational complexity from $O(Kn)$ to $O(n)$ .
Storage Cost:
- UNGER: Compresses all relevant information into a single unified code. If storing one modality-specific code requires $O(m)$ space, UNGER requires only $O(m)$ storage per item.
- Existing Methods: Store $K$ codes per item, leading to a total storage cost of $O(Km)$ .
- Advantage: UNGER reduces storage requirements from $O(Km)$ to $O(m)$ .
  
  The following are the results from Table 1 of the original paper:
  
  TIGER EAGER UNGER (Ours)
  
  Computation Cost O(n) O(2n) O(n)
  
  Storage Cost O(m) O(2m) O(m)
  
  Used Modality Semantic Only Semantic + Collaborative Semantic + Collaborative

	TIGER	EAGER	UNGER (Ours)
Computation Cost	O(n)	O(2n)	O(n)
Storage Cost	O(m)	O(2m)	O(m)
Used Modality	Semantic Only	Semantic + Collaborative	Semantic + Collaborative

This table clearly illustrates UNGER's efficiency advantages. While TIGER also has $O(n)$ computation and $O(m)$ storage, it only uses semantic information. EAGER uses both semantic and collaborative information but incurs $O(2n)$ computation and $O(2m)$ storage due to its dual-code approach. UNGER achieves the best of both worlds: leveraging both modalities with the efficiency of a single-modality system.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three public benchmarks derived from the Amazon Product Reviews dataset [40]. This dataset contains user reviews and item metadata from May 1996 to July 2014. The chosen categories for sequential recommendation are "Beauty", "Sports and Outdoors", and "Toys and Games".

For all datasets, user interaction records are grouped by user and sorted chronologically by timestamp. To ensure data quality and relevance, the 5-core dataset filtering strategy was applied, which means only users and items with at least five interaction records are retained, filtering out unpopular items and inactive users. This helps mitigate sparsity and focuses on more active engagement patterns.

The following are the results from Table 2 of the original paper:

Dataset	#Users	#Items	#Interactions	#Density
Beauty	22,363	12,101	198,360	0.00073
Sports and Outdoors	35,598	18,357	296,175	0.00045
Toys and Games	19,412	11,924	167,526	0.00073

These datasets are widely used in sequential recommendation research, making them effective for validating the method's performance and allowing for fair comparison with existing benchmarks. The relatively low density across all datasets (ranging from 0.00045 to 0.00073) highlights the inherent sparsity of recommendation data, which UNGER aims to address by integrating diverse knowledge modalities.

5.2. Evaluation Metrics

The paper uses two standard metrics for evaluating recommendation performance: Recall@K and Normalized Discounted Cumulative Gain (NDCG@K). These metrics are reported for $K=10$ and $K=20$ .

Recall@K:
1. Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top- $K$ recommended items. It focuses on how many of the "truly good" items for a user are actually present in the recommended list, regardless of their specific ranking within that list. A higher Recall@K indicates that the model is better at identifying and including relevant items.
2. Mathematical Formula: $\mathrm { R e c a l l } @ K = \frac { 1 } { | \mathcal { U } | } \sum _ { u \in \mathcal { U } } \mathbb { I } ( r _ { u } < K )$
3. Symbol Explanation:
  - $| \mathcal { U } |$ : The total number of users in the evaluation set.
  - $\mathcal { U }$ : The set of all users.
  - $u$ : A specific user.
  - $\mathbb { I } ( \cdot )$ : An indicator function that returns 1 if its argument is true, and 0 otherwise.
  - r _ { u }: The rank of the true next item (ground-truth item) for user $u$ in the top- $K$ recommendation list. If the true item is within the top- $K$ list, $r_u < K$ is true.
NDCG@K:
1. Conceptual Definition: NDCG@K evaluates the quality of the ranking for the top- $K$ recommended items. Unlike Recall@K, NDCG@K considers not only whether relevant items are present but also their position in the list. Highly relevant items ranked higher contribute more to the NDCG score. It is normalized by the Ideal DCG to ensure scores are comparable across different queries. A higher NDCG@K indicates better ranking quality.
2. Mathematical Formula: The paper provides the component formulas for NDCG@K:
  - Discounted Cumulative Gain (DCG) for a user u at K: $\mathrm { DCG } _ { u } @ K = \sum _ { j = 1 } ^ { K } \frac { 2 ^ { y _ { u , j } } - 1 } { \log _ { 2 } ( j + 1 ) }$
  - Ideal Discounted Cumulative Gain (IDCG) for a user u at K: $\mathrm { iDCG } _ { u } @ K = \sum _ { j = 1 } ^ { K } \frac { 2 ^ { y ^ { * } _ { u , j } } - 1 } { \log _ { 2 } ( j + 1 ) }$
  - Normalized Discounted Cumulative Gain (NDCG) for a user u at K: $\mathrm { NDCG } _ { u } @ K = \frac { \mathrm { DCG } _ { u } @ K } { \mathrm { iDCG } _ { u } @ K }$
  - Average NDCG@K over all users: $\mathrm { NDCG } @ K = \frac { 1 } { | \mathcal { U } | } \sum _ { u = 1 } ^ { | \mathcal { U } | } \mathrm { NDCG } _ { u } @ K$
3. Symbol Explanation:
  - y _ { u , j }: The relevance of the $j$ -th recommended item for user $u$ . In binary relevance settings (common in sequential recommendation), $y _ { u , j } = 1$ if the $j$ -th item is the ground-truth item (the next item the user actually interacted with), and 0 otherwise.
  - $y ^ { * } _ { u , j }$ : The relevance of the $j$ -th item in the ideal ranking for user $u$ . The ideal ranking places all ground-truth relevant items at the top. For binary relevance, this means $y ^ { * } _ { u , j } = 1$ for the first item (if relevant) and 0 for others, assuming only one relevant item.
  - $K$ : The number of top recommended items considered.
  - $\log _ { 2 } ( j + 1 )$ : A logarithmic discount factor, meaning items ranked lower are given less weight.
  - $| \mathcal { U } |$ : The total number of users.
Evaluation Protocol: The leave-one-out strategy is used, a standard protocol [24]. For each user, the last interaction is used as the ground-truth item for testing, and the second-to-last item for validation. All preceding interactions form the training sequence. User history length is limited to 20 items during training.

5.3. Baselines

UNGER is compared against a comprehensive set of representative baselines, categorized into classical sequential methods and generative methods.

5.3.1. Classical Sequential Methods

These models typically predict the next item based on item ID sequences.

GRU4REC [17]: An RNN-based model using Gated Recurrent Units (GRUs) to model user click sequences.
Caser [52]: A CNN-based method that captures high-order Markov Chains by modeling user behaviors using both horizontal and vertical convolutional operations.
SASRec [24]: A self-attention-based sequential recommendation model that utilizes a unidirectional Transformer encoder to predict the next item.
BERT4Rec [51]: Adopts a Transformer with a bidirectional self-attention mechanism and a Cloze objective loss for item sequence modeling, inspired by BERT.
HGN [39]: Employs hierarchical gating networks to capture both long-term and short-term user interests.
FDSA [69]: Leverages self-attention networks to model item-level and feature-level sequences separately, emphasizing feature transition dynamics.
$S^3-Rec$ [73]: A bi-directional Transformer pre-trained with mutual information maximization via self-supervised tasks for sequential recommendation.

5.3.2. Generative Methods

These models frame recommendation as an autoregressive generation task.

RecForest [11]: Jointly learns latent embeddings and indices through multiple K-ary trees, using hierarchical balanced clustering and a Transformer-based encoder-decoder routing network.
TIGER [47]: Utilizes a pre-trained T5 encoder to learn semantic identifiers for items, then autoregressively decodes target candidates, incorporating RQ-VAE quantization.
ColaRec [61]: Integrates user-item interactions and content data end-to-end, aligning semantic and collaborative spaces via pretrained collaborative identifiers, an item indexing task, and contrastive loss.
EAGER [62]: Integrates behavioral and semantic information through a two-stream architecture with shared encoding but separate decoding pipelines for each modality, using contrastive and semantic-guided learning.

5.4. Implementation Details

The paper provides detailed implementation specifics for UNGER.

Model Architecture:
- Encoder Layers: 1
- Decoder Layers: 4
- Embedding Dimension: 96
- Hidden Size: 256
- Number of Attention Heads: 6
Quantization:
- Number of Clusters ( $K$ ): 256
- Cluster Depth ( $L$ ): 4
Encoders:
- Collaborative Encoder: DIN [72]
- Semantic Encoder: Pre-trained Llama2-7b [53] (hidden size 128 as reported in [47, 6])
Training:
- Optimizer: Adam
- Learning Rate: 1e-3
- Warmup Strategy: Applied for stable training, with warmup steps = 2000 and warmup initial learning rate = 1e-7.
- Batch Size: 256
- Training Steps: 20000
- Dropout Rate: 0
- Activation Function: ReLU
- Weight Decay: 1e-7
Loss Coefficients:
- $\alpha$ (for CKA task in Stage I): 1.0
- $\beta$ (for IKD task in Stage II): 1.0
- $\tau$ (temperature parameter for Info-NCE loss): 1.0
- The authors note UNGER's robustness to $\alpha$ and $\beta$ due to fast convergence, allowing them to be set to 1.0.
Inference:
- Beam Width: 100

Reproducibility: Each experiment is conducted five times with different random seeds (chosen from [2020, 2021, 2022, 2023, 2024]), and the average score is reported. Statistical significance is tested with a paired t-test ( $p < 0.05$ ).

The following are the results from Table 4 of the original paper:

Parameter	Value
Embedding Dimension	96
Model Layers	4
Hidden Size	256
Heads	6
Num of Clusters	256
Cluster Depth	4
Learning Rate	1e-3
Optimizer	Adam
Semantic Encoder	Llama2-7b
Collaborative Encoder	DIN
Batch Size	256
Training Steps	20000
Dropout Rate	0
Beam Width	100
Activation Function	ReLU
Weight Decay	1e-7
Warmup Steps	2000
Warmup Initial Learning Rate	1e-'
Random Seed	2024
α	1.0
β	1.0
τ	1.0

The paper also provides information on practical resources required. The following are the results from Table 5 of the original paper:

Information	Value
Device	NVIDIA RTX 3090 GPU (24GB) ×1
Training Time	less than 1 GPU hour
Inference Latency (ms)	19.5
Model Parameter	10249988

This indicates that UNGER is relatively efficient in terms of training time and inference latency, requiring less than an hour of training on a single NVIDIA RTX 3090 GPU (a high-end consumer/prosumer GPU) and having a low inference latency of 19.5 ms. The model parameter count is around 10 million, which is moderate for deep learning models.

6. Results & Analysis

6.1. Core Results Analysis (RQ1)

The paper thoroughly evaluates UNGER's performance against both classical and generative sequential recommendation baselines across three Amazon datasets (Beauty, Sports and Outdoors, Toys and Games) using Recall@10, Recall@20, NDCG@10, and NDCG@20.

The following are the results from Table 3 of the original paper:

Dataset Metric		Classical						Generative						UNGER (Ours)
Dataset Metric		GRU4REC	Caser	SASRec	BERT4Rec	HGN	FDSA	S3-Rec	Recorest	TIGER	ColaRec	EAGER		UNGER (Ours)
Beauty	Recall@10	0.0283		0.0605	0.0347		0.0512	0.0407	0.0647	0.0664	0.0617*	0.0524*	0.0836	0.0939 ±0.00093
	Recall@20	0.0479	C0	0.0902	0.0599	0.0773		0.0656	0.0994	0.0915	0.0924*	0.0807*	0.1124	0.1289 ±0.00084
	NDCG@10	0.0137	0.0176	0.0318	0.0170		0.0266	0.0208	0.0327	0.0400	0.0339*	0.0263*	0.0525	0.0559 ±0.00037
	NDCG@20		0.0229	0.0394		0.0233	0.0332	0.0270	0.0414	0.0464	0.0417*	0.0335*	0.0599	0.0646 ±0.00041
Sports	Recall@10	0.0204	0.0194	0.0350	0.0191		0.0313	0.0288	0.0385	0.0247	0.0376*	0.0348*	0.0441	0.0471 ±0.00078
	Recall@20	0.0333	0.0314	0.0507	0.0315	0.0477	0.0463	0.0607		0.0375	0.0577*	0.0533*	0.0659	0.0710 ±0.00075
	NDCG@10	0.0110	0.0097	0.0192	0.0099	0.0159		0.0156	0.0204	0.0133	0.0196*	0.0179*	0.0236	0.0259 ±0.00034
	NDCG@20	0.0142	0.0126	0.0231	0.0130		0.0201	0.0200	0.0260	0.0164	0.0246*	0.0226*	0.0291	0.0319 ±0.00049
Toys	Recall@10	0.0176	0.0270	0.0675	0.0203		0.0497	0.0381	0.0700	0.0383	0.0578*	0.0474*	0.0714	0.0822 ±0.00085
	Recall@20	0.0301	0.0420	0.0941	0.0358	0.0716	0.0632		0.1065	0.0483	0.0838*	0.0704*	0.1024	0.1154 ±0.00070
	NDCG@10	0.0084	0.0141	0.0374	0.0099	0.0277		0.0189	0.0376	0.0285	0.0321*	0.0242*	0.0505	0.0489 ±0.00032
	NDCG@20	0.0116	0.0179	0.0441	0.0138		0.0332	0.0252	0.0468	0.0310	0.0386*	0.0300*	0.0538	0.0573 ±0.00025

Key Observations:

UNGER's Superiority: UNGER consistently achieves the highest performance across all datasets and metrics. On the Beauty dataset, it outperforms the second-best generative baseline (EAGER) by 14.68% in Recall@20 (0.1289 vs 0.1124) and 7.85% in NDCG@20 (0.0646 vs 0.0599). This significant gain is attributed to UNGER's effective integration of collaborative filtering signals and semantic content representations via a unified discrete code, which enhances efficiency and reduces redundancy.
Generative Models Outperform Classical Models: A clear trend emerges: generative models generally outperform classical embedding-based or ID-based sequential models. This highlights the advantage of representing items as structured, discrete hierarchical codes that encapsulate rich prior knowledge, aligning well with Transformer-based autoregressive generation. Traditional methods often treat item IDs as atomic symbols, lacking contextual information.
Efficiency Advantage: Despite its superior performance, UNGER is also more lightweight and scalable, requiring approximately half the number of total parameters compared to EAGER, which uses dual-code streams. This aligns with the computational and storage cost analysis presented earlier.

6.2. Semantic Domination Issue (RQ2)

To investigate UNGER's effectiveness in mitigating the semantic dominance issue, experiments were conducted on the Amazon Beauty dataset, comparing different methods of integrating semantic and collaborative information.

The following are the results from Table 6 of the original paper:

Approach	Semantic modality	Collaborative modality
Concat	97.33%	2.67%
Ours	59.89%	40.11%

Methodology for Analysis:

Embedding Preparation: Semantic embeddings are extracted using a pre-trained Llama2-7b, and collaborative embeddings are obtained from DIN.
Concatenation Method (Concat): To mimic a common fusion approach, semantic embeddings are concatenated with collaborative embeddings. To address dimensionality and value distribution mismatches, PCA is applied to reduce the semantic embedding dimension, and both embeddings are normalized before concatenation.
UNGER's Method (Ours): The integrated embeddings are obtained using UNGER's proposed approach.
Visualization: T-SNE is used to visualize the embedding distributions, with K-means clustering for color-coding points from each cluster. As illustrated in Figure 6, the concatenation method leads to a fused embedding distribution that closely mirrors the original semantic embeddings, while UNGER achieves a more balanced distribution.

该图像是一个示意图，展示了不同嵌入分布的可视化结果，包括协作嵌入、语义嵌入、拼接嵌入和我们的方法嵌入分布。每个区域的颜色代表不同的嵌入类别，通过这种方式，有助于比较不同方法在嵌入表现上的差异。
Quantitative Measurement: KL divergence is used to quantify the relative similarity of each modality to the final integrated embeddings.
- KL divergence (Kullback-Leibler divergence) is a non-symmetric measure of how one probability distribution is different from a second, reference probability distribution. In this context, it quantifies the dissimilarity between the modality-specific embeddings ( $S$ or $C$ ) and the final integrated embeddings ( $E$ ).
- The relative similarity of a modality is calculated as: $\mathrm { Similarity } _ { \mathrm { semantic } } = 1 - { \frac { \operatorname { KL } ( S \parallel E ) } { \operatorname { KL } ( S \parallel E ) + \operatorname { KL } ( C \parallel E ) } }$ $\mathrm { Similarity } _ { \mathrm { collaborative } } = 1 - { \frac { \mathrm { KL } ( C \parallel E ) } { { \mathrm { KL } } ( S \parallel E ) + { \mathrm { KL } } ( C \parallel E ) } }$ Where:
  - $E$ represents the final integrated embeddings (from Concat or Ours).
  - $S$ denotes the semantic embeddings.
  - $C$ refers to the collaborative embeddings.
  - $\operatorname { KL } ( A \parallel B )$ is the KL divergence from distribution $A$ to distribution $B$ . A lower KL value indicates higher similarity.

Analysis of Results:

Semantic Dominance in Concat: Table 6 shows that the concatenation method results in a highly imbalanced fusion, with 97.33% similarity from the semantic modality and only 2.67% from the collaborative modality. This confirms the semantic dominance problem: semantic embeddings (often pre-trained on large textual data with smooth distributional properties) exert disproportionate influence, underutilizing collaborative knowledge. Figure 4 provides a visual representation of this.

该图像是一个饼状图，展示了在Beauty数据集中，使用连接方法时语义模态和协作模态对最终表示的相对相似性。其中，语义模态占比为97.33%，协作模态占比为2.67%。

UNGER's Balanced Integration: In contrast, UNGER effectively addresses this issue, achieving a much more balanced fusion with 59.89% semantic similarity and 40.11% collaborative similarity. This indicates that UNGER's modality adaptation layer and cross-modality knowledge alignment task successfully learn to integrate both knowledge types more proportionally at the distributional level.

The following are the results from Table 7 of the original paper:

Dataset Metric	Beauty
Dataset Metric	Recall@10	Recall@20	NDCG@10	NDCG@20
Semantic Only	0.0791	0.1077	0.0462	0.0535
Collaborative Only	0.0744	0.0997	0.0469	0.0532
Concat	0.0759	0.1071	0.0452	0.0530
Ours	0.0939	0.1289	0.0559	0.0646

Performance Impact: Table 7 further validates UNGER's superiority. UNGER (Ours) significantly outperforms both single-modality approaches (Semantic Only, Collaborative Only) and, critically, the Concatenation method across all metrics. The Concatenation method even performs worse than the Semantic Only approach, underscoring the negative impact of semantic dominance and the failure of naive fusion. This confirms that UNGER's method successfully captures the complementary nature of the two knowledge types, leading to a more effective fusion.

6.3. Ablation Study (RQ3)

An ablation study was conducted to understand the individual contributions of UNGER's key auxiliary tasks: intra-modality knowledge distillation (IKD) and cross-modality knowledge alignment (CKA).

The following are the results from Table 8 of the original paper:

Dataset	Metric	\|	CKA	CKA + IKD (Ours)
Beauty	Recall@10	0.0759	0.0827	0.0939
	Recall@20	0.1071	0.1122	0.1289
	NDCG@10	0.0452	0.0509	0.0559
	NDCG@20	0.0530	0.0583	0.0646

Analysis: The table compares three variants on the Beauty dataset:

Baseline (without CKA or IKD): The row labeled | (presumably referring to the Concat baseline from Table 7 which yielded 0.0759 Recall@10, 0.1071 Recall@20, etc.) represents the model without either CKA or IKD. This performs the worst.
CKA (with only CKA task): The column CKA shows improved performance over the baseline, demonstrating that cross-modality knowledge alignment alone provides significant gains by enabling the model to capture richer, more holistic item representations.
CKA + IKD (Ours) (full UNGER model): This variant, representing the full UNGER model with both tasks, achieves the best performance across all metrics.

Conclusion:

Indispensable Roles: The degradation in performance when either CKA or IKD is removed strongly suggests that both tasks are critical to UNGER's success.
Effectiveness of CKA: The CKA task acts as a bridge, effectively aligning collaborative and semantic knowledge, which helps in harnessing their complementary information.
Synergistic Effect of CKA and IKD: The best performance achieved by the full UNGER model highlights the synergistic effect of both tasks. CKA establishes strong cross-modal connections, while IKD reinforces intra-modal coherence and compensates for quantization loss. This combination enables the learning of powerful, unified item codes, significantly enhancing recommendation quality.

6.4. Scaling Law Study (RQ4)

The paper investigates the scaling law characteristics of UNGER with respect to model depth, model width, and data volume, a relatively underexplored area in generative recommendation.

6.4.1. Model Depth

The impact of model depth (number of layers) on performance was studied, varying the layers from 1 to 8.

As illustrated in Figure 7 from the original paper, the overall trend indicates consistent performance improvement as the model layers increase.

该图像是图表，展示了不同层数对模型性能的影响，包括Recall@20和NDCG@20的变化趋势。随着层数的增加，'Ours'表现出更好的性能，而去掉协作模态和语义模态的模型性能有所下降。

Observation: Both Recall@20 and NDCG@20 show steady gains, particularly within the range of 1 to 8 layers. This suggests that deeper models can capture more complex user-item interaction patterns.
UNGER's Advantage: The superior performance of UNGER is attributed to its unified code structure, which jointly encodes collaborative and semantic information. This provides richer and more diverse learning signals at every layer.
Synergy with Depth: Deeper layers can refine and abstract over both types of information in a complementary manner, enhancing generalization. UNGER consistently outperforms models lacking both collaborative and semantic modalities across all depth settings, and this performance gap widens with increasing depth.

6.4.2. Data Size

The influence of data volume on generative recommendation performance was examined by training models on subsets of the Beauty dataset (20% to 100%).

As shown in Figure 8 from the original paper, both Recall@20 and NDCG@20 consistently improve with increasing data size.

Fig. 8. Impact of data volume on the Beauty dataset. 该图像是图表，展示了数据量对 Beauty 数据集的 Recall@20 和 NDCG@20 的影响。左侧图表中，Recall@20 随着数据量的增加而上升，右侧图表中 NDCG@20 亦呈现出同样的趋势。数据量为 20%、40%、60%、80% 和 100% 时，指标值逐渐增大。

Observation: The model effectively benefits from larger training corpora, demonstrating its data-efficient nature. Recall@20 shows a significant gain when data increases from 20% to 40%, indicating rapid leveraging of additional interaction signals.
Conclusion: This upward trend underscores the importance of training data scale in enhancing the capacity of generative recommendation models and validates the reliability of UNGER as it scales with data.

6.4.3. Model Width

The study also investigated how model width (embedding dimension) affects performance, varying it from 24 to 768.

As presented in Figure 9 from the original paper, both Recall@20 and NDCG@20 exhibit an initial upward trend, followed by a decline as width continues to increase.

Fig. 9. Impact of model width on the Beauty dataset. The number of width varies from 24 to 768. 该图像是一个图表，展示了模型宽度对Beauty数据集的Recall@20和NDCG@20的影响。横坐标表示模型宽度，从24到768不等，左侧柱状图表示Recall@20，右侧柱状图表示NDCG@20。随着模型宽度的增加，Recall@20在96时达到最高值，而NDCG@20则保持相对平稳的增长。

Observation: The model achieves optimal performance when the width is set to 96. Increasing width beyond this point leads to decreased performance, likely due to over-parameterization and reduced training stability under limited data conditions.
Conclusion: This highlights the importance of selecting an appropriate width to balance model capacity and generalization. Overly large widths can introduce optimization difficulties and suboptimal generalization, emphasizing the need for careful tuning.

6.5. Hyper-Parameter Analysis (RQ4)

6.5.1. Pretrained Semantic Encoders

The paper analyzes the impact of different pretrained semantic encoders on recommendation performance on the Beauty dataset. Three models were compared: BERT [7], Sentence-T5 [41], and Llama2-7b [53].

As presented in Figure 10 from the original paper, Llama2-7b achieves the best performance.

Fig. 10. Analysis of different pretrained semantic encoders. 该图像是一个图表，展示了不同预训练语义编码器在 Recall@20 和 NDCG@20 指标上的表现。左侧柱状图显示了三种模型（BERT、Llama2-7b 和 Sentence-T5）的 Recall@20 值，右侧柱状图则展示了相应的 NDCG@20 值。

Observation:
- Llama2-7b significantly outperforms BERT and Sentence-T5. This is attributed to Llama2-7b's substantially larger and more diverse training corpus, enabling it to learn richer contextual and domain-specific semantics.
- Sentence-T5 (optimized for semantic similarity in short texts) performs less effectively, likely due to the complex, domain-specific, and nuanced nature of e-commerce product descriptions.
- BERT (general-purpose masked language modeling) also shows suboptimal performance, as its embeddings are less effective in distinguishing polysemous words and capturing long-tail attributes relevant to product-level distinctions.
Modularity: UNGER's design is plug-and-play, allowing the semantic encoder to be replaced without altering other components. This makes it adaptable to future advancements in language models or multimodal encoders (e.g., CLIP-like models [30, 35, 46]) as long as output dimensions are compatible.

6.5.2. Choice of Different Quantization Methods

The impact of different quantization strategies on model performance was evaluated on the Beauty dataset, comparing Random Assignment, RQVAE [28], and Hierarchical K-means.

As illustrated in Figure 11 from the original paper, the Hierarchical K-means approach significantly outperforms the other two methods.

Fig. 11. Analysis of different quantization methods. 该图像是一个柱状图，展示了不同量化方法（Random、K-means 和 RQVAE）的 Recall@20 和 NDCG@20 性能评估结果。图中对比了这三种方法在召回率和归一化折衷增益上的差异，观察到 K-means 方法在 Recall@20 上表现最佳。

Observation:
- Hierarchical K-means performs best, highlighting the importance of preserving structural information during quantization. It encodes items in a way that reflects hierarchical proximity, improving the retrieval of relevant items.
- Random Assignment yields the poorest performance, as it provides no meaningful structure to the discretized space, limiting generalization.
- RQVAE, while theoretically capable, suffered from codebook collisions in practice. Although mitigation strategies (like appending unique identifiers) were used, they introduced noise, leading to suboptimal performance.
Conclusion: These findings underscore the advantages of hierarchical quantization methods in capturing and preserving the structural and semantic information of item embeddings for generative recommendation.

6.5.3. Number of Clusters

An investigation into how the number of clusters $K$ in the hierarchical K-means algorithm influences performance was conducted on the Beauty dataset, with $K$ varying from 128 to 512.

The results are summarized in Figure 12 from the original paper.

$Fig. 12. Impact of the number of clusters with $K$ varying from 128 to 512.$ 该图像是图表，展示了当聚类数 $K$ 从 128 变化到 512 时，Recall@20 和 NDCG@20 的变化情况。其中，256个聚类时的 Recall@20 达到了最大值，约为 0.128，NDCG@20 也在该点取得了最佳表现，约为 0.065。

Observation:
- Increasing $K$ from 128 to 256 consistently improves both Recall@20 and NDCG@20, indicating enhanced representational capacity. A larger $K$ allows for more discriminative codes for individual items.
- Further increasing $K$ from 256 to 512 leads to NDCG@20 continuing to improve, suggesting better ranking quality for top items. However, Recall@20 slightly declines. This is hypothesized to be due to an expanded search space, increasing decoding complexity and potentially leading to suboptimal retrieval results during inference.
Conclusion: The number of clusters $K$ must be carefully chosen to balance representational richness and computational efficiency. A moderate $K$ (e.g., 256) proportional to the overall number of items in the dataset appears to be a practical heuristic.

6.5.4. Sensitivity of Hyper-parameter $\alpha$

The sensitivity of the hyper-parameter $\alpha$ (balancing next-item prediction and cross-modality knowledge alignment loss in Stage I) was investigated by varying $\alpha$ from 0.2 to 5.

As shown in Figure 13 from the original paper, both Recall@20 and NDCG@20 improve significantly when $\alpha$ increases from 0.2 to 1.0.

Fig. 1.Sensitivity of hyper-parameter α 该图像是图表，展示了超参数 α 对 Recall@20 和 NDCG@20 的敏感性。左侧为 Recall@20 数据，右侧为 NDCG@20 数据，随着 α 的增加，二者的值均呈现上升趋势。

Observation: Performance peaks when $\alpha$ is set to 1.0 or 2.0, indicating that moderate emphasis on the alignment objective leads to optimal integration.
Robustness: Performance remains stable across a broad range of $\alpha$ values (from 0.5 to 5.0), demonstrating the robustness of UNGER to this hyper-parameter. This implies the model is not overly reliant on fine-tuning $\alpha$ and benefits from the alignment signal as long as it's sufficiently incorporated.

6.5.5. Sensitivity of Hyper-parameter $\beta$

The sensitivity of the hyper-parameter $\beta$ (balancing generative recommendation and intra-modality knowledge distillation loss in Stage II) was analyzed by varying $\beta$ from 0.2 to 5.0.

As illustrated in Figure 14 from the original paper, both Recall@20 and NDCG@20 initially improve as $\beta$ increases from 0.2 to 1.0.

$Fig. 14. Sensitivity of hyper-parameter $\\beta$$ 该图像是图表，展示了超参数 $\beta$ 对于 Recall@20 和 NDCG@20 的敏感性分析。左侧柱状图显示不同 $\beta$ 值下的 Recall@20，而右侧柱状图则展示了对应的 NDCG@20 值，两者均分别在 20 个推荐项上进行了评估。

Observation: Performance achieves its best when $\beta = 1.0$ , confirming the benefit of incorporating the distillation loss to enhance learned representations and compensate for quantization loss.
Robustness: The model remains stable across a relatively wide range (from 0.5 to 2.0), indicating robustness to the choice of $\beta$ . Even with an increase to 5.0, performance does not drastically deteriorate.

6.6. Case Study

To demonstrate UNGER's interpretability and effectiveness, a case study on personalized video game recommendation is presented. This highlights how UNGER uses autoregressive decoding over discrete unified codes to infer user intent and generate recommendations.

As illustrated in Figure 15 from the original paper, the case study demonstrates the process from user history to a specific recommendation.

Fig. 1.An example that illustrates the application of UNGER for the video game recommendation scenario. 该图像是一个示意图，展示了UNGER在视频游戏推荐场景中的应用。图中显示了用户历史互动与推荐结果之间的关系，通过 UNGER模型进行视频游戏的推荐，预测的项目Unicode包括[66,100,168,99]，对应的推荐游戏是《Red Dead Redemption》。

6.6.1. User Historical Interaction Profiling

Example User History: Black Myth: Wukong, Cyberpunk 2077, Assassin's Creed Shadows, Call of Duty: Black Ops.
Unified Features: UNGER extracts unified codes that reflect consistent user affinities across dimensions like Gameplay Type and Genre (action-oriented), Audience Targeting (mature content), and Platform Engagement (PlayStation-compatible titles). Unlike static attributes, these Unicodes dynamically capture item-level and user-dependent information.

6.6.2. Autoregressive Unified Code Decoding

UNGER decodes the next most probable unified code sequence, e.g., [66, 100, 168, 99].

Code 66 (Domain: Video Games): Anchors the generation within the gaming domain.
Code 100 (Mature Content Affinity): Infers a preference for mature content from prior choices.
Code 168 (Action-Oriented Gameplay): Captures interest in fast-paced action.
Code 99 (Platform: PlayStation): Implies consistent engagement with PlayStation titles. These codes, though not direct human labels, align with high-level concepts, providing interpretable projections and guiding the generation process.

6.6.3. Recommendation Interpretation: Red Dead Redemption

The generated unified code sequence maps to Red Dead Redemption.

Alignment: The recommended game aligns perfectly with the inferred codes: Video Game Domain, Mature Content, Action-Adventure Gameplay, and PlayStation Platform.
Interpretability: This demonstrates that UNGER's learned representations, while discrete, align well with interpretable item features, enabling natural post-hoc explanations.

6.6.4. Discussion and Insights

Bridging Behavior and Reasoning: The case study highlights UNGER's ability to bridge implicit user behavior and explicit recommendation reasoning through discrete code generation.
Incremental High-Fidelity Representation: Unlike traditional systems that rely on uninterpretable latent embeddings, UNGER builds high-fidelity representations of user intent incrementally.
Transparent and Controllable Generation: The sequential decoding procedure refines the search space across multiple dimensions (domain, content style, genre, platform), mimicking a step-by-step decision-making process. The discrete nature of Unicodes provides a transparent layer between user history and item selection, validating UNGER's design objectives: transparency, personalization, and controllable generation.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces UNGER, a novel and effective framework for generative recommendation that addresses critical limitations of existing approaches. Its core innovation lies in leveraging a unified code (Unicodes) to cohesively integrate both semantic and collaborative knowledge. UNGER achieves this through a two-stage process: first, a unified code generation framework that employs a learnable modality adaptation layer with AdaLN and a joint optimization of cross-modality knowledge alignment and next-item prediction tasks to adaptively fuse embeddings and resolve semantic dominance. Second, a generative recommendation phase augmented with an intra-modality knowledge distillation task using a special token, which effectively compensates for information loss introduced during quantization. Extensive experiments on three Amazon datasets demonstrate UNGER's consistent superiority over state-of-the-art classical and generative methods, while also exhibiting desirable scaling law characteristics with respect to model depth, width, and data volume. The case study further highlights its interpretability and ability to generate highly personalized and explainable recommendations.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Enriching Semantic Modality:
- Limitation: Current semantic embeddings are primarily textual.
- Future Work: Extend UNGER by incorporating visual embeddings (from item images/video frames using pre-trained deep neural networks [15, 46] and Vision Transformers [10]) and audio/speech representations (for music/podcasts). This would capture stylistic, aesthetic, and affective aspects of user preferences not conveyed by text alone.
Improving Tokenizer:
- Limitation: The current quantization process (Hierarchical K-means) could be further optimized.
- Future Work: Adopt more advanced vector quantization techniques from other fields, such as LFQ [64], IBQ [50], and ActionPiece [19]. These techniques promise better convergence properties, higher codebook utilization, and more expressive latent representations, which are expected to enhance generation quality and model robustness.
Evaluation on Denser Datasets:
- Limitation: The current evaluation is restricted to relatively sparse subsets of the Amazon dataset, consistent with prior work. This means UNGER's behavior on denser recommendation scenarios has not been systematically explored.
- Future Work: Conduct evaluations on denser recommendation scenarios to establish broader applicability and generalizability of the proposed framework.

7.3. Personal Insights & Critique

UNGER presents a highly compelling and well-motivated advancement in generative recommendation. The core idea of a unified code is elegant and addresses a practical pain point (cost and complexity) of dual-code systems. The explicit tackling of semantic dominance is particularly insightful, as naive fusion of modalities is a common pitfall in multimodal learning. The quantitative evidence from Table 6 and Table 7 strongly supports the authors' claims regarding this issue and UNGER's success in overcoming it.

Applicability & Transferability: The unified code concept, coupled with the plug-and-play nature of the semantic encoder, makes UNGER highly adaptable. Its methodology could be transferred to other domains beyond e-commerce, such as news recommendation (integrating article text with user click history) or even scientific paper recommendation (integrating abstract/full text with citation networks). The framework's modularity suggests that as multimodal foundation models improve, UNGER could seamlessly incorporate richer semantic embeddings (e.g., combining text, image, and video for movie recommendations).
Potential Issues/Areas for Improvement:
- Codebook Design & Granularity: While Hierarchical K-means is effective, the trade-off between representational richness and decoding complexity (as seen in the number of clusters analysis) suggests that optimal codebook design remains crucial. Further research into adaptive codebook sizes or more dynamically learned hierarchical structures could be beneficial.
- Computational Cost of Stage I: Although Stage II is efficient, the Stage I training involves multiple components (DIN, Llama2-7b, CKA, seq loss). While the paper indicates less than 1 GPU hour for total training, it would be insightful to understand the breakdown of computational cost for Stage I to ensure it's not a bottleneck in scenarios with rapidly evolving item sets that require frequent Unicode regeneration.
- Cold-Start Performance: While UNGER implicitly helps cold-start items by leveraging their semantic information through Unicodes, a dedicated cold-start evaluation might further highlight this benefit, especially for new items with no interaction history.
- Explainability beyond Case Study: The case study provides excellent interpretability, attributing codes to high-level concepts. Further exploration into a more systematic, perhaps quantitative, measure of Unicode interpretability could be valuable. For instance, can Unicodes directly correspond to human-defined tags with high fidelity?
Inspiration: The paper provides a strong blueprint for designing efficient and effective multimodal generative models. It emphasizes that true integration goes beyond mere concatenation, requiring adaptive alignment and thoughtful compensation for intrinsic architectural limitations like quantization loss. The scaling law study is also inspiring, pointing towards a future where larger models and datasets could unlock even greater performance in recommendation, aligning the field with trends seen in NLP and computer vision.

UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration

TL;DR Summary