From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

Jia

Paper status: completed

From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

Published:11/11/2025

User Preference Modeling (1)Cross-Domain Recommendation System (1)Generative Cross-Domain Recommendation Framework (1)Domain-Adaptive Tokenization (1)Multi-Domain Joint Training (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents GenCDR, a novel generative cross-domain recommendation framework that overcomes limitations of traditional methods by using domain-adaptive tokenization for disentangled semantic IDs, significantly improving recommendation accuracy and generalization across mu

Abstract

Cross-domain recommendation (CDR) is crucial for improving recommendation accuracy and generalization, yet traditional methods are often hindered by the reliance on shared user/item IDs, which are unavailable in most real-world scenarios. Consequently, many efforts have focused on learning disentangled representations through multi-domain joint training to bridge the domain gaps. Recent Large Language Model (LLM)-based approaches show promise, they still face critical challenges, including: (1) the \textbf{item ID tokenization dilemma}, which leads to vocabulary explosion and fails to capture high-order collaborative knowledge; and (2) \textbf{insufficient domain-specific modeling} for the complex evolution of user interests and item semantics. To address these limitations, we propose \textbf{GenCDR}, a novel \textbf{Gen}erative \textbf{C}ross-\textbf{D}omain \textbf{R}ecommendation framework. GenCDR first employs a \textbf{Domain-adaptive Tokenization} module, which generates disentangled semantic IDs for items by dynamically routing between a universal encoder and domain-specific adapters. Symmetrically, a \textbf{Cross-domain Autoregressive Recommendation} module models user preferences by fusing universal and domain-specific interests. Finally, a \textbf{Domain-aware Prefix-tree} enables efficient and accurate generation. Extensive experiments on multiple real-world datasets demonstrate that GenCDR significantly outperforms state-of-the-art baselines. Our code is available in the supplementary materials.

Mind Map

In-depth Reading

English Analysis~33 min read · 46,494 chars

1. Bibliographic Information

1.1. Title

From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

1.2. Authors

Peiyu Hu ( $^{*1,2}$ )
Wayne Lu ( $^{*1,2}$ )
Jia Wang ( $^{1,2*}$ )

Affiliations:

Xi'an Jiaotong-Liverpool University, Suzhou, China
University of Liverpool, Liverpool, United Kingdom

1.3. Journal/Conference

The paper is published on arXiv, a preprint server, under the identifier arxiv.org/abs/2511.08006v1. While arXiv hosts preprints, papers often undergo peer review and are subsequently published in reputable conferences or journals in fields like recommender systems, machine learning, or natural language processing (e.g., SIGIR, KDD, AAAI, NeurIPS, ACM TiiS, TKDE). The specified publication date suggests it's a recent work.

1.4. Publication Year

2025

1.5. Abstract

Cross-domain recommendation (CDR) is crucial for improving recommendation accuracy and generalization, yet traditional methods are often hindered by the reliance on shared user/item IDs, which are unavailable in most real-world scenarios. Consequently, many efforts have focused on learning disentangled representations through multi-domain joint training to bridge the domain gaps. Recent Large Language Model (LLM)-based approaches show promise, they still face critical challenges, including: (1) the item ID tokenization dilemma, which leads to vocabulary explosion and fails to capture high-order collaborative knowledge; and (2) insufficient domain-specific modeling for the complex evolution of user interests and item semantics. To address these limitations, we propose GenCDR, a novel Generative Cross-Domain Recommendation framework. GenCDR first employs a Domain-adaptive Tokenization module, which generates disentangled semantic IDs for items by dynamically routing between a universal encoder and domain-specific adapters. Symmetrically, a Cross-domain Autoregressive Recommendation module models user preferences by fusing universal and domain-specific interests. Finally, a Domain-aware Prefix-tree enables efficient and accurate generation. Extensive experiments on multiple real-world datasets demonstrate that GenCDR significantly outperforms state-of-the-art baselines.

1.6. Original Source Link

Official Source/PDF Link: https://arxiv.org/pdf/2511.08006.pdf
Publication Status: This paper is a preprint available on arXiv, indicating it has not yet undergone formal peer review or been accepted by a conference or journal.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is improving Cross-Domain Recommendation (CDR) accuracy and generalization, especially in real-world scenarios where shared user or item identifiers (IDs) are often unavailable. Traditional recommendation methods heavily rely on these IDs for knowledge transfer between domains, which creates a significant bottleneck when IDs are not strictly aligned.

The problem is important because users frequently interact across diverse online services (e-commerce, social media, content streaming), generating rich behavioral data that, if leveraged effectively, can greatly enhance recommendation quality. However, existing CDR methods face several challenges:

Reliance on Shared IDs: Most traditional CDR methods assume shared user/item IDs, which is unrealistic for many cross-domain applications.
Item ID Tokenization Dilemma: Recent Large Language Model (LLM)-based approaches, while promising due to their semantic understanding capabilities, struggle with item representation. Using traditional item IDs with LLMs leads to a vocabulary explosion (too many unique IDs) and fails to capture high-order collaborative knowledge (complex relationships between items and users).
Insufficient Domain-Specific Modeling: Current LLM-based methods often do not adequately model the complex evolution of user interests and item semantics in a domain-specific manner, failing to disentangle universal preferences from domain-specific nuances (e.g., an "Apple" as a fruit vs. an "Apple" as a tech brand).

The paper's entry point or innovative idea is to address these challenges by introducing a generative semantic ID paradigm into LLM-based cross-domain recommendation. It moves away from arbitrary item IDs towards semantically rich tokens that are transferable across domains, while also incorporating adaptive mechanisms to model both universal and domain-specific aspects of items and user interests.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

Novel Generative Framework (GenCDR): It proposes GenCDR, the first framework to introduce the generative semantic ID paradigm into LLM-based cross-domain recommendation, effectively resolving the item ID tokenization dilemma. This allows LLMs to process items from diverse domains using semantically meaningful tokens.
Domain-adaptive Tokenization Module: GenCDR includes a module that dynamically disentangles and precisely models both universal and item-wise domain-specific knowledge at the semantic level. This ensures items are represented with both shared meaning and domain-specific attributes.
Cross-Domain Autoregressive Recommendation Module: A symmetric module is designed to effectively disentangle and fuse universal and user-wise domain-specific interests during the recommendation process, enabling personalized recommendations that consider the user's overall preferences and their specific preferences within a target domain.
Domain-aware Prefix-tree Decoding Strategy: A Prefix-tree mechanism is introduced to guide the decoding process, ensuring efficient and accurate generation of valid semantic IDs in cross-domain scenarios, preventing the generation of "hallucinated" (non-existent) recommendations.

The key findings demonstrate that GenCDR significantly outperforms state-of-the-art baselines across multiple real-world cross-domain datasets in terms of both accuracy and generalization, validating the effectiveness of its generative semantic ID paradigm and adaptive modeling approach.

3.1. Foundational Concepts

Cross-Domain Recommendation (CDR)

Cross-Domain Recommendation (CDR) refers to the task of improving recommendation performance in one target domain by transferring knowledge from other source domains. This is particularly useful in situations like cold-start (new users or items with limited data) or data sparsity (insufficient interactions in a single domain). The core idea is that user preferences or item characteristics learned in one domain might be relevant and transferable to another, even if the domains are different (e.g., recommending movies based on a user's book preferences).

Large Language Models (LLMs)

Large Language Models (LLMs) are advanced artificial intelligence models, typically based on the Transformer architecture, that are trained on vast amounts of text data. They excel at understanding, generating, and processing human language. Key capabilities relevant to recommendation include:

Semantic Understanding: LLMs can capture the meaning and relationships between words, phrases, and concepts, which is crucial for understanding item descriptions and user reviews.
Sequence Generation: LLMs can generate coherent and contextually relevant sequences of text, making them suitable for generative recommendation tasks where the goal is to predict a sequence of items.
World Knowledge: Through pre-training on diverse internet data, LLMs acquire a broad base of world knowledge that can enrich item and user representations beyond what is available in a specific recommendation dataset.

Item ID Tokenization

In recommender systems, item ID tokenization is the process of mapping unique items (e.g., a specific product, movie, or song) to discrete tokens or identifiers that can be processed by a model. Traditionally, this involves assigning an arbitrary integer ID to each item. However, for LLMs, which operate on sequences of meaningful tokens (like words or subwords), these arbitrary item IDs are problematic. The "item ID tokenization dilemma" arises because:

Vocabulary Explosion: If each item is a unique token, the vocabulary size becomes extremely large, making LLM training inefficient and difficult.
Lack of Semantics: Arbitrary IDs carry no inherent semantic meaning or relationship to other items, hindering an LLM's ability to generalize or understand collaborative knowledge (patterns of user interactions). The paper proposes semantic IDs to address this, meaning the tokens themselves encode meaningful information about the item.

Residual-Quantized Variational Autoencoder (RQ-VAE)

A Variational Autoencoder (VAE) is a type of neural network that learns a compressed, probabilistic representation (a latent space) of input data. It consists of an encoder that maps input data to the latent space and a decoder that reconstructs the input from a sample in the latent space. Quantization is the process of mapping continuous values to a finite set of discrete values (codebook entries). RQ-VAE is a variant that performs residual quantization. Instead of quantizing the entire latent vector at once, it quantizes the residual (the error or remaining information) after a previous quantization step. This allows for a more expressive and hierarchical representation using multiple codebooks, where each codebook captures different levels of detail or aspects of the input. The output is a sequence of discrete codes (semantic IDs).

Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a Parameter-Efficient Fine-Tuning (PEFT) technique used to adapt large pre-trained models (like LLMs) to new tasks or domains with minimal computational cost. Instead of fine-tuning all parameters of the large model, LoRA injects small, trainable low-rank matrices into the Transformer layers. When fine-tuning, only these low-rank matrices are updated, while the original pre-trained model weights remain frozen. This significantly reduces the number of trainable parameters, memory usage, and training time, making fine-tuning LLMs more accessible and efficient.

Autoregressive Models

An autoregressive model is a type of statistical model that predicts future values based on past values. In the context of sequence generation (like LLMs generating text or recommending items), it means that the prediction of the next token in a sequence is conditioned on all previously generated tokens. For example, to predict the third word in a sentence, an autoregressive model would consider the first and second words. This allows LLMs to generate coherent and contextually relevant sequences.

Prefix-tree

A prefix-tree (also known as a trie) is a tree-like data structure used to store a dynamic set or associative array where the keys are usually strings. In the context of generative recommendation, a domain-aware prefix-tree can store all valid sequences of semantic IDs for items within a specific domain. During the generation process, this tree can be used to constrain the LLM's output, ensuring that only valid and existing semantic ID sequences (corresponding to real items) are generated. This prevents the LLM from "hallucinating" non-existent items or generating semantically incorrect sequences.

Variational Information Bottleneck (VIB)

The Variational Information Bottleneck (VIB) principle is a theoretical framework that aims to learn compressed representations of data that are maximally informative about a target variable while minimizing the information content related to other irrelevant aspects. In this paper, it's used as a regularization technique for the routing networks. By applying VIB, the routers are encouraged to extract only the most essential information from an item or user history to make a routing decision (i.e., whether to prioritize universal or domain-specific knowledge), thereby promoting more disentangled representations and preventing overfitting. It measures the mutual information between the input and the learned representation and tries to compress it while retaining task-relevant information. The KL-divergence term $D _ { \mathrm { KL } } ( q ( \mathbf { z } _ { r } \mid \mathbf { x } ) \parallel p ( \mathbf { z } _ { r } ) )$ in the paper is a common way to implement VIB regularization, encouraging the posterior distribution $q ( \mathbf { z } _ { r } \mid \mathbf { x } )$ (the router's internal representation distribution given the input) to be close to a simple prior distribution $p ( \mathbf { z } _ { r } )$ (e.g., a standard normal distribution), thus compressing the information.

3.2. Previous Works

The paper discusses related work in three main categories:

Cross-Domain Sequential Recommendation (CDSR)

Traditional CDSR: Many early methods rely on collaborative item IDs as a bridge for knowledge transfer. They often employ architectures like gating mechanisms, attention modules (e.g., Kang and McAuley 2018; Lu and Yin 2025; Cui et al. 2025), or Graph Neural Networks (GNNs) (e.g., Liu et al. 2024b; Li and Lu 2024; Cao et al. 2022) to fuse and transfer knowledge, sometimes enhanced with contrastive learning objectives (e.g., Ma et al. 2024; Xie et al. 2022).
Limitations: These methods are inherently limited by the availability of shared user/item IDs, which is a strong assumption often not met in real-world scenarios.
Semantic-enhanced CDSR: More recent trends incorporate richer semantic information from pre-trained language models (e.g., Liu et al. 2025c; Li et al. 2022) to overcome ID-based limitations.
Gap Addressed by GenCDR: GenCDR aims to effectively integrate these semantics into a unified generative framework, explicitly disentangling shared and domain-specific knowledge, which remains an open challenge for existing methods.

Generative Recommendation

Paradigm Shift: Generative recommendation reframes the recommendation task from item ranking to autoregressive sequence generation of semantic item IDs (e.g., Petrov and Macdonald 2023; Hou et al. 2025).
Item Tokenization: A critical aspect is the construction of item IDs. Approaches include:
- Content-based tokenization: Using vector quantization like RQ-VAE (e.g., Li et al. 2025a; Rajput et al. 2023).
- Structure-aware methods: Employing hierarchical clustering (e.g., Si et al. 2024).
- Collaborative signals: Embedding these directly into tokenization (e.g., Mo et al. 2024).
Limitations: These techniques have been developed almost exclusively for single-domain datasets (e.g., Zheng et al. 2025).
Gap Addressed by GenCDR: GenCDR extends this generative paradigm to complex, multi-domain environments, which is an unexplored research question in prior generative recommendation works.

Large Language Models for Recommendation

LLM as Auxiliary Components: LLMs are used to enhance traditional models by providing rich semantic features or data augmentation (e.g., Sun et al. 2024; Yin et al. 2025; Zhang et al. 2025a; Yuan et al. 2025).
LLM as Core Generative Engines: LLMs are used to reformulate recommendation as autoregressively predicting item IDs (e.g., Rajput et al. 2023; Zheng et al. 2024; Lin et al. 2024).
Fine-tuning: Parameter-efficient fine-tuning (PEFT) techniques like LoRA (e.g., Hu et al. 2022; Bao et al. 2023; Liu et al. 2025a; Zhang et al. 2023) are crucial for aligning LLMs to recommendation tasks.
Limitations: Existing work predominantly focuses on single-domain applications.
Gap Addressed by GenCDR: GenCDR tackles the challenge of effective knowledge transfer and representation across heterogeneous domains for LLM-based recommenders, which has largely been unaddressed.

3.3. Technological Evolution

The field of recommender systems has evolved from collaborative filtering based on user-item interactions (often using IDs) to incorporating richer content-based features. With the rise of deep learning, sequential recommendation models (e.g., SASRec, BERT4Rec) emerged, leveraging Transformer architectures to capture dynamic user preferences within a single domain.

The cross-domain recommendation paradigm arose to combat data sparsity and cold-start issues by transferring knowledge between related domains. Early CDR methods often relied on shared user/item IDs or mapping functions to align entities across domains. However, the limitation of ID-based approaches led to the exploration of semantic information and representation learning to bridge domain gaps more robustly.

The advent of Large Language Models (LLMs) marked a significant shift. Initially, LLMs were used to enrich item/user features with their vast world knowledge. More recently, researchers started framing recommendation as a language generation task, directly using LLMs to predict item sequences, often represented by item IDs. However, these LLM-based approaches still struggled with the inherent limitations of arbitrary item IDs (vocabulary explosion, lack of semantics) and the challenge of effectively disentangling universal and domain-specific knowledge in heterogeneous cross-domain settings.

GenCDR fits into this evolution by pushing the LLM-based generative recommendation paradigm further into the cross-domain setting. It innovates by:

Replacing arbitrary item IDs with generative semantic IDs derived from item content, making them inherently transferable and meaningful for LLMs.
Developing adaptive mechanisms (Domain-adaptive Tokenization, Cross-Domain Autoregressive Recommendation with routing networks) that explicitly model and fuse both universal and domain-specific aspects of items and user interests, overcoming the "insufficient domain-specific modeling" challenge.
Ensuring efficient and valid generation through a Domain-aware Prefix-tree, which is crucial for practical LLM-based recommenders.

3.4. Differentiation Analysis

Compared to the main methods in related work, GenCDR offers several core differences and innovations:

Compared to Traditional ID-based CDSR (e.g., C2DSR, TriCDR):
- Core Difference: GenCDR completely bypasses the reliance on shared user/item IDs by introducing semantic IDs. Traditional CDSR models use IDs or align ID-based embeddings.
- Innovation: Semantic IDs are inherently transferable and capture rich content semantics, enabling knowledge transfer without explicit ID mapping. This tackles the ID-dilemma directly.
Compared to Generative Recommendation Systems (GRS) (e.g., VQ-Rec, TIGER, HSTU):
- Core Difference: Existing GRS models are primarily designed for single-domain scenarios. While they use vector quantization for item tokenization, they don't explicitly address the complexities of cross-domain knowledge transfer or domain-specific adaptation.
- Innovation: GenCDR extends the generative paradigm to cross-domain recommendation by integrating a Domain-adaptive Tokenization module and a Cross-Domain Autoregressive Recommendation module with dynamic routing, allowing for disentanglement and fusion of universal and domain-specific knowledge. This makes generative recommendations effective in heterogeneous environments.
Compared to LLM-based Recommendation (e.g., LLM4CDSR):
- Core Difference: While LLM4CDSR formulates CDR as a text generation task using LLMs, it still faces the item ID tokenization dilemma and insufficient domain-specific modeling. LLM4CDSR might rely on prompting or fine-tuning LLMs to predict item IDs or textual descriptions, but without a structured semantic ID system or explicit disentanglement mechanisms.
- Innovation:
  1. GenCDR explicitly tackles the item ID tokenization dilemma by generating disentangled semantic IDs using an RQ-VAE and domain-specific adapters, providing a more structured and semantically rich input for the LLM.
  2. It uses dynamic routing networks at both item and user levels to adaptively fuse universal and domain-specific knowledge, addressing the "insufficient domain-specific modeling" challenge directly, which is often implicitly handled or overlooked in other LLM-based CDR methods.
  3. The Domain-aware Prefix-tree ensures efficient and valid semantic ID generation, a practical consideration for LLM-based systems.
    
    In essence, GenCDR uniquely combines the power of generative models and LLMs with a principled approach to semantic item representation and adaptive knowledge fusion to overcome the fundamental challenges in cross-domain recommendation.

4. Methodology

4.1. Principles

The core idea behind GenCDR is to move beyond arbitrary item IDs and embrace discrete semantic IDs (SIDs) as the fundamental representation for items in cross-domain recommendation. This paradigm shift allows Large Language Models (LLMs) to leverage their powerful semantic understanding and sequence generation capabilities more effectively. The theoretical basis is that raw semantic information (e.g., text descriptions) is inherently more transferable across domains than arbitrary item IDs.

The framework operates on two key intuitions:

Semantic Richness and Transferability: By converting items into semantic IDs that capture their inherent meaning, these IDs become transferable between domains, circumventing the need for shared item IDs. This is achieved by balancing domain-agnostic universal semantics with domain-specific discriminative features.
Adaptive Disentanglement and Fusion: User interests and item characteristics are a blend of universal (general) and domain-specific (niche) aspects. GenCDR proposes to explicitly model and dynamically fuse these two types of knowledge at both the item-level (for semantic ID generation) and the user-level (for recommendation generation), preventing negative transfer and enabling fine-grained personalization.
Efficient and Valid Generation: To ensure practicality, the generative process must be efficient and produce valid recommendations. This is addressed by constraining the generation space using a prefix-tree.

The overall architecture of GenCDR is illustrated in Figure 2 (a). It consists of three main modules:
Domain-adaptive Tokenization: Converts items into semantic IDs.
Cross-Domain Autoregressive Recommendation: Models user preferences and generates semantic IDs for recommendations.
Domain-aware Prefix-tree: Guides efficient and accurate inference.

4.2. Core Methodology In-depth (Layer by Layer)

The GenCDR framework is designed to tackle the item ID tokenization dilemma and insufficient domain-specific modeling. Let's break down its components:

4.2.1. Domain-adaptive Tokenization

This module is responsible for generating unified Semantic IDs (SIDs) for items. It balances domain-agnostic universal semantics with domain-specific discriminative features to create expressive representations suitable for generative recommendation tasks. The SIDs are designed to have Semantic Richness (capturing comprehensive item semantics) and Semantic Similarity (ensuring similar items across domains share comparable IDs). The process is visualized in Figure 2 (b).

4.2.1.1. Domain-Universal Semantic Token Generation

To establish a foundational universal semantic understanding, the module uses a Universal Discrete Semantic Encoder based on a Residual-Quantized Variational Autoencoder (RQ-VAE) framework. The RQ-VAE consists of an encoder ( $E$ ), a decoder ( $D$ ), and $M$ codebooks. It is pre-trained on the textual features of all items.

The process for converting an item's feature embedding $\mathbf { x }$ into a sequence of discrete codes $\mathbf { c } = ( c _ { 0 } , \dots , c _ { M - 1 } )$ is as follows:

The encoder $E$ maps the item embedding $\mathbf { x }$ to a continuous latent representation $\mathbf { z } = E ( \mathbf { x } )$ .
An initial residual vector is set: $\mathbf { r } _ { 0 } = \mathbf { z }$ .
For each level $d$ from 0 to M-1:
- The current residual $\mathbf { r } _ { d }$ is quantized by finding the nearest vector $\mathbf { e } _ { c _ { d } }$ in the $d$ -th codebook $\mathcal { C } _ { d }$ . The index c _ { d } corresponds to this nearest vector. $c _ { d } = \arg \operatorname* { m i n } _ { k } | | { \bf r } _ { d } - { \bf e } _ { k } | | ^ { 2 }$
- The next residual is computed by subtracting the quantized vector: $\mathbf { r } _ { d + 1 } = \mathbf { r } _ { d } - \mathbf { e } _ { c _ { d } }$ .
The final quantized representation $\hat { \mathbf { z } }$ is the sum of all chosen codebook vectors: $\hat { \mathbf { z } } = \sum _ { d = 0 } ^ { M - 1 } \mathbf { e } _ { c _ { d } }$ .
This quantized representation $\hat { \mathbf { z } }$ is then passed through the decoder $D$ to reconstruct the original item embedding: $\hat { \mathbf { x } } = D ( \hat { \mathbf { z } } )$ .

The RQ-VAE is optimized using a joint objective function during pre-training:

Reconstruction Loss ( $\mathcal { L } _ { \mathrm { REC } }$ ): Measures how well the decoder reconstructs the original input. $\mathcal { L } _ { \mathrm { { R E C } } } ~ = | | \bar { \mathbf { x } } - \hat { \mathbf { x } } | | ^ { 2 }$
- $\bar { \mathbf { x } }$ : The original input item feature embedding.
- $\hat { \mathbf { x } }$ : The reconstructed item feature embedding from the decoder.
- $| | \cdot | | ^ { 2 }$ : The squared Euclidean distance (L2 norm).
Quantization Loss ( $\mathcal { L } _ { Q }$ ): Ensures the encoder's output aligns with the codebook vectors. It includes commitment terms from VQ-VAE to pull the encoder output towards the codebook vectors and codebook vectors towards the encoder output. $\mathcal { L } _ { Q } = \sum _ { d = 0 } ^ { M - 1 } | | \mathbf { s g } ( \mathbf { r } _ { d } ) - \mathbf { e } _ { c _ { d } } | ^ { 2 } + \beta | | \mathbf { r } _ { d } - \mathbf { s g } ( \mathbf { e } _ { c _ { d } } ) | | ^ { 2 } .$
- $\mathbf { r } _ { d }$ : The residual vector at level $d$ before quantization.
- $\mathbf { e } _ { c _ { d } }$ : The chosen codebook vector from codebook $\mathcal { C } _ { d }$ .
- $\mathbf { s g } ( \cdot )$ : The stop-gradient operator, which means its argument is treated as a constant during backpropagation.
- $\beta$ : A hyperparameter balancing the two terms of the quantization loss.
- The first term encourages the encoder output to commit to the codebook vector, while the second term updates the codebook vector towards the encoder output.
Masked Token Modeling (MTM) Loss ( $\mathcal { L } _ { \mathrm { MTM } }$ ): To ensure SIDs are contextually coherent, this loss trains the model to predict masked codes from their surrounding context, similar to BERT. $\mathcal { L } _ { \mathrm { MTM } } = - \mathbb { E } _ { { \mathbf { x } } \sim \mathcal { X } , I _ { \mathrm { m a s k } } } \left[ \sum _ { i \in I _ { \mathrm { m a s k } } } \log P ( c _ { i } \mid S _ { \mathrm { masked } } ; \theta _ { \mathrm { c t x } } ) \right] .$
- $\mathcal { X }$ : The set of all item feature embeddings.
- $I _ { \mathrm { mask } }$ : The set of indices of masked codes in a semantic ID sequence.
- c _ { i }: A masked semantic ID at index $i$ .
- $S _ { \mathrm { masked } }$ : The semantic ID sequence with some codes masked.
- $\theta _ { \mathrm { ctx } }$ : Parameters of a contextual model (e.g., a Transformer) that predicts masked codes.
- $\log P ( c _ { i } \mid S _ { \mathrm { masked } } ; \theta _ { \mathrm { c t x } } )$ : The log-probability of predicting the correct masked code c _ { i } given the masked sequence and contextual model parameters.
  
  The total pre-training loss is a weighted sum: $\mathcal { L } _ { \mathrm { pretrain } } = \mathcal { L } _ { \mathrm { REC } } + \mu \mathcal { L } _ { Q } + \lambda \mathcal { L } _ { \mathrm { MTM } }$
$\mu, \lambda$ : Hyperparameters balancing the different loss terms. Upon completion of this pre-training, the universal encoder and codebooks are frozen.

4.2.1.2. Domain-specific Semantic Token Adapters

While the universal encoder learns shared semantics, domain-specific nuances need to be captured. This is done using Domain-specific Semantic Token Adapters, implemented with Low-Rank Adaptation (LoRA). For each domain $d$ , a lightweight LoRA module is introduced to adapt the frozen universal encoder $E$ .

A LoRA module consists of two low-rank matrices, $B _ { d } \in \mathbb { R } ^ { d _ { \mathrm { o u t } } \times r }$ and $A _ { d } \in \mathbb { R } ^ { r \times d _ { \mathrm { i n } } }$ , where $r \ll \operatorname* { m i n } ( d _ { \mathrm { in } } , d _ { \mathrm { out } } )$ is the rank. These matrices augment the frozen weights $W _ { 0 } \in \mathbb { R } ^ { d _ { \mathrm { o u t } } \times d _ { \mathrm { i n } } }$ of the universal encoder. The modified forward pass for an input $h _ { \mathrm { in } }$ becomes: $h _ { \mathrm { o u t } } = W _ { \mathrm { 0 } } h _ { \mathrm { i n } } + B _ { d } A _ { d } h _ { \mathrm { i n } } .$

$h _ { \mathrm { in } }$ : Input to the augmented layer.
W _ { 0 }: Original, frozen weight matrix of the universal encoder.
B _ { d } A _ { d }: The low-rank update matrix for domain $d$ . Only B _ { d } and A _ { d } are trainable.

The adapted encoder for domain $d$ is denoted $E _ { \theta _ { d } }$ , where $\theta _ { d } = \{ B _ { d } , A _ { d } \}$ are its trainable parameters. These parameters are fine-tuned in a second training phase. For each item embedding $\mathbf { x }$ from domain $d$ , the objective is to minimize a self-supervised reconstruction loss: $\mathcal { L } _ { \mathrm { a d a p t e r } } = \mathbb { E } _ { \mathbf { x } \sim \mathcal { X } _ { d } } \left[ \| \mathbf { x } - D ( Q ( E _ { \theta _ { d } } ( \mathbf { x } ) ) ) \| _ { 2 } ^ { 2 } \right] ,$
$\mathcal { X } _ { d }$ : The set of item feature embeddings specific to domain $d$ .
$E _ { \theta _ { d } } ( \mathbf { x } )$ : The latent representation produced by the domain-adapted encoder for item $\mathbf { x }$ .
$Q ( \cdot )$ : The quantization function that maps the continuous latent representation to discrete semantic IDs.
$D ( \cdot )$ : The decoder that reconstructs the item embedding from the quantized SIDs.
The quantization function $Q$ and decoder $D$ remain frozen during this phase. This ensures efficient adaptation by only training the LoRA parameters.

4.2.1.3. Item-level Dynamic Semantic Routing Network

To integrate the universal and domain-specific representations, an Item-level Dynamic Semantic Routing Network is used. This network adaptively balances these representations on a per-item basis, mitigating negative transfer from static fusion.

The routing network, $R _ { \phi }$ (e.g., a multi-layer perceptron (MLP)) with parameters $\phi$ , takes an item's embedding $\mathbf { x }$ as input and outputs a gating weight $\alpha \in [ 0 , 1 ]$ .

For an item from domain $d$ $d$ , two latent representations are computed before quantization:
- Universal representation: $\mathbf { z } _ { \mathrm { uni } } = E ( \mathbf { x } )$ from the frozen universal encoder.
- Domain-specific representation: $\mathbf { z } _ { \mathrm { spec } } = E _ { \theta _ { d } } ( \mathbf { x } )$ from the adapted encoder for domain $d$ .
The router calculates the gating weight $\alpha$ $α$ : $\alpha = \sigma ( R _ { \phi } ( \mathbf { x } ) ) ,$
- $\sigma ( \cdot )$ : The sigmoid function, which squashes the output to a range between 0 and 1.
- $R _ { \phi } ( \mathbf { x } )$ : The output of the routing network for item $\mathbf { x }$ .
The universal and domain-specific representations are then fused based on $\alpha$ $α$ : $\mathbf { z } _ { \mathrm { fused } } = ( 1 - \alpha ) \cdot \mathbf { z } _ { \mathrm { uni } } + \alpha \cdot \mathbf { z } _ { \mathrm { spec } } .$
- $\mathbf { z } _ { \mathrm { fused } }$ : The final fused latent representation. If $\alpha$ is close to 0, universal representation dominates; if $\alpha$ is close to 1, domain-specific representation dominates.
  
  The routing network is regularized using the Variational Information Bottleneck (VIB) principle to promote disentangled representations and prevent overfitting. This is enforced via a KL-divergence term: $\mathcal { L } _ { \mathrm { VIB } } = D _ { \mathrm { KL } } ( q ( \mathbf { z } _ { r } \mid \mathbf { x } ) \parallel p ( \mathbf { z } _ { r } ) ) ,$

$D _ { \mathrm { KL } } ( \cdot \parallel \cdot )$ : The Kullback-Leibler (KL) divergence, which measures the difference between two probability distributions.
$q ( \mathbf { z } _ { r } \mid \mathbf { x } )$ : The posterior distribution of the router's internal representation $\mathbf { z } _ { r }$ given the item embedding $\mathbf { x }$ . This is typically a learned distribution (e.g., Gaussian with mean and variance predicted by $R _ { \phi }$ ).
$p ( \mathbf { z } _ { r } )$ : A prior distribution for the router's internal representation (e.g., a standard normal distribution). This loss is incorporated into the second-phase training objective to ensure that the router learns to make routing decisions based on essential information while compressing irrelevant details.

4.2.2. Cross-Domain Autoregressive Recommendation

This module leverages the unified SIDs to model user interests and generate personalized recommendations. It employs a parameter-efficient, two-phase fine-tuning strategy, mirroring the item tokenization's dual focus on universal and domain-specific aspects. The process is visualized in Figure 2 (c).

4.2.2.1. Universal Interest Modeling Network

To capture domain-agnostic interest patterns, a Universal Interest Modeling Network is developed. This is achieved by enhancing a pre-trained Large Language Model (LLM) (e.g., Qwen2.5-7B) with a mixture of multiple Low-Rank Adaptation (LoRA) adapters. This collection of $N$ adapters, called universal experts, is trained jointly on aggregated data from all domains. The parameters of the $i$ -th universal expert are $\theta _ { \mathrm { uni } , i }$ , and the complete set is $\Theta _ { \mathrm { uni } } = \{ \theta _ { \mathrm { uni,1 } } , \dots , \theta _ { \mathrm { uni,N } } \}$ .

The input to this network consists of sequences of cross-domain SIDs, $S ^ { u } = ( c _ { 1 } ^ { u } , c _ { 2 } ^ { u } , . . . , c _ { t } ^ { u } )$ , representing a user's interaction history. In the initial fine-tuning phase, the universal parameters $\Theta _ { \mathrm { uni } }$ are optimized using a standard autoregressive objective: predicting the next semantic ID given the preceding sequence. The training loss is defined as: $\mathcal { L } _ { \mathrm { u n i } } = - \sum _ { u \in \mathcal { U } } \sum _ { k = 1 } ^ { | S ^ { u } | - 1 } \log P ( c _ { k + 1 } ^ { u } \mid c _ { \le k } ^ { u } ; \theta _ { \mathrm { L L M } } , \Theta _ { \mathrm { u n i } } ) ,$

$\mathcal { U }$ : The set of all users.
$S ^ { u }$ : The sequence of semantic IDs for user $u$ .
$| S ^ { u } |$ : The length of the sequence for user $u$ .
$c _ { k + 1 } ^ { u }$ : The $(k+1)$ -th semantic ID in user $u$ 's sequence.
$c _ { \le k } ^ { u }$ : The sequence of semantic IDs up to index $k$ for user $u$ .
$\theta _ { \mathrm { L L M } }$ : The frozen parameters of the base LLM.
$\Theta _ { \mathrm { uni } }$ : The trainable parameters of the universal LoRA adapters. This loss aims to maximize the likelihood of observing the next semantic ID in the sequence given the history. After this phase, $\theta _ { \mathrm { L L M } }$ and $\Theta _ { \mathrm { uni } }$ are fixed.

4.2.2.2. Domain-specific Interest Adaptation

To capture domain-specific nuances, a second fine-tuning phase trains domain-specific LoRA adapters. For each domain $d \in \mathcal { D }$ , a dedicated, trainable LoRA adapter, denoted $\theta _ { \mathrm { spec } _ { d } }$ , is added to the frozen model. Both the base LLM parameters $\theta _ { \mathrm { L L M } }$ and the universal parameters $\Theta _ { \mathrm { uni } }$ remain frozen during this phase.

The training loss for each domain $d$ focuses on user sequences $S _ { d } ^ { u }$ within that domain, for users $u \in \mathcal { U } _ { d }$ (users who interacted in domain $d$ ): $\mathcal { L } _ { \mathrm { s p e c } _ { d } } = - \sum _ { u \in \mathcal { U } _ { d } } \sum _ { k = 1 } ^ { | S _ { d } ^ { u } | - 1 } \log P ( c _ { k + 1 } ^ { u } \mid c _ { \le k } ^ { u } ; \theta _ { \mathrm { L L M } } , \Theta _ { \mathrm { u n i } } , \theta _ { \mathrm { s p e c } _ { d } } ) .$

$\mathcal { U } _ { d }$ : The set of users who have interacted in domain $d$ .
$S _ { d } ^ { u }$ : The sequence of semantic IDs for user $u$ within domain $d$ .
$\theta _ { \mathrm { s p e c } _ { d } }$ : The trainable parameters of the domain-specific LoRA adapter for domain $d$ . This approach efficiently learns domain-specific interest patterns.

4.2.2.3. User-level Dynamic Interest Routing Network

Symmetric to the item-level router, a VIB-regularized User-level Dynamic Interest Routing Network is employed to fuse the predictions from the universal and domain-specific models during inference. This lightweight gate takes the user's history representation $\mathbf { h } _ { t }$ as input to compute a dynamic weight $\gamma \in [ 0 , 1 ]$ .

This weight $\gamma$ is used to fuse the probability distributions over semantic IDs from the universal model ( $P _ { \mathrm { uni } }$ ) and the domain-adapted model ( $P _ { \mathrm { spec } }$ ): $P _ { \mathrm { final } } ( i \mid S ^ { u } ) = ( 1 - \gamma ) \cdot P _ { \mathrm { uni } } ( i \mid S ^ { u } ) + \gamma \cdot P _ { \mathrm { spec } } ( i \mid S ^ { u } ) .$

$P _ { \mathrm { final } } ( i \mid S ^ { u } )$ : The final predicted probability of recommending item $i$ given user $u$ 's sequence $S ^ { u }$ .
$P _ { \mathrm { uni } } ( i \mid S ^ { u } )$ : The probability distribution output from the frozen universal network (parameterized by $\Theta _ { \mathrm { uni } }$ ).
$P _ { \mathrm { spec } } ( i \mid S ^ { u } )$ : The probability distribution output from the network augmented with domain-specific adapters (parameterized by $\Theta _ { \mathrm { uni } }$ and $\theta _ { \mathrm { spec } _ { d } }$ ).
$\gamma$ : The dynamic weight from the user-level routing network, determining the blend of universal and domain-specific predictions. The VIB regularization ensures efficient and robust fusion.

4.2.3. Inference - Domain-aware Prefix-tree

To ensure efficient and valid semantic ID generation, GenCDR utilizes a Domain-aware Prefix-tree mechanism. This addresses the limitations of standard autoregressive decoding (computational inefficiency and invalid ID outputs).

For each domain $d \in \mathcal { D }$ , an offline prefix tree T _ { d } is constructed. This tree encodes all valid semantic ID sequences that can be produced by the Domain-adaptive Tokenization module for items in that domain.
During inference, when a target domain d _ { t } is specified, its corresponding tree T _ { d _ { t } } guides the generation process.
At each decoding step $k$ , the prefix-tree identifies a valid subset of next codes $V _ { \mathrm { valid } } ( s _ { k - 1 } ) \subset C _ { k }$ based on the current prefix $s _ { k - 1 }$ (the sequence of semantic IDs generated so far).
The LLM's predictions are then constrained to this subset using a masked softmax function: $P ( c _ { k } \mid s _ { k - 1 } , T _ { d _ { t } } ) = \frac { \exp ( z _ { k } ) } { \sum _ { c ^ { \prime } \in V _ { \mathrm { valid } } ( s _ { k - 1 } ) } \exp ( z _ { c ^ { \prime } } ) } ,$
- $P ( c _ { k } \mid s _ { k - 1 } , T _ { d _ { t } } )$ : The probability of choosing semantic ID c _ { k } as the next token, conditioned on the previous sequence $s _ { k - 1 }$ and the prefix-tree T _ { d _ { t } } for the target domain.
- z _ { k }: The LLM's raw logit score for semantic ID c _ { k }.
- $V _ { \mathrm { valid } } ( s _ { k - 1 } )$ : The set of valid next semantic IDs allowed by the prefix-tree given the current prefix $s _ { k - 1 }$ .
- The softmax is computed only over the valid semantic IDs, ensuring that only existing and semantically correct items are generated. This approach ensures valid sequence generation while significantly reducing computational overhead by pruning the search space.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three pairs of cross-domain datasets, each representing different real-world scenarios:

Sports-Clothing (Leisure): Derived from the Amazon product review dataset (McAuley et al. 2015).
Phones-Electronics (Technology): Derived from the Amazon product review dataset (McAuley et al. 2015).
Books-Movies (Entertainment): Collected from Douban (Zhu et al. 2019, 2020).

These datasets are chosen to validate the model's performance across diverse domains with varying characteristics and overlap levels. For data samples, the paper states: "Following (Rajput et al. 2023; Zhou et al. 2020), we treat users' historical reviews as interactions arranged chronologically." This implies that for each user, their interactions (e.g., buying a product, watching a movie) are recorded in the order they occurred. An example of a data sample for a user might look like a sequence of item IDs (which GenCDR converts to semantic IDs): $User A: [item_ID_1 (from Sports), item_ID_2 (from Clothing), item_ID_3 (from Sports)]$ . The textual features of these items (e.g., product descriptions, movie summaries) are used for generating the semantic IDs.

The evaluation protocol used is leave-last-out, where the very last item in a user's sequence is reserved for testing, and the second-to-last item is used for validation, ensuring that the model predicts future interactions based on past behavior.

The following are the statistics of the datasets used in the experiments (Table 1 from the original paper):

Dataset	#Users	#Items	#Interactions	Sparsity	Overlap
Sports	35,598	18,357	296,337	99.95%	1.73%
Clothing	39,387	23,033	278,677	99.97%	(704)
Phones	27,879	10,429	194,439	99.93%	0.55%
Electronics	192,403	63,001	1,689,188	99.99%	(404)
Books	1,713	8,601	104,295	99.29%	7.48%
Movies	2,628	20,964	1,249,016	97.73%	(2,058)

#Users, #Items, #Interactions: Total counts of users, items, and interactions in each domain.
Sparsity: Indicates the proportion of empty entries in the user-item interaction matrix, calculated as $1 - \frac{\text{#Interactions}}{\text{#Users} \times \text{#Items}}$ . A high sparsity (close to 100%) means very few interactions relative to the possible total, indicating a challenging recommendation scenario.
Overlap: Represents the percentage of common users between the two domains in a pair (e.g., Sports and Clothing). For example, 1.73% overlap between Sports and Clothing means 1.73% of users interacted in both domains. The numbers in parentheses for Clothing, Electronics, and Movies likely refer to the absolute number of overlapping users.

5.2. Evaluation Metrics

The paper adopts Recall@K and NDCG@K as evaluation metrics, with $K$ set to 5 and 10, following standard practice in sequential recommendation. These metrics measure the accuracy of the top- $K$ recommended items.

Recall@K

Recall@K measures the proportion of relevant items (i.e., the true next item in the sequence) that are present in the top- $K$ recommended items. It indicates how well the recommender system can find all relevant items up to a certain cutoff.

Conceptual Definition: For a single user, Recall@K is 1 if the actual next item is among the top- $K$ recommendations, and 0 otherwise. For a set of users, it's the average of these binary outcomes. It focuses on whether any relevant item is captured within the top- $K$ list.
Mathematical Formula: $\mathrm{Recall@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \mathbb{I}(\mathrm{true\_item}_u \in \mathrm{top\_K\_recommendations}_u)$
Symbol Explanation:
- $|\mathcal{U}|$ : The total number of users in the evaluation set.
- $\mathbb{I}(\cdot)$ : An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
- $\mathrm{true\_item}_u$ : The actual item that user $u$ interacted with next.
- $\mathrm{top\_K\_recommendations}_u$ : The set of $K$ items recommended to user $u$ .

NDCG@K (Normalized Discounted Cumulative Gain at K)

NDCG@K is a widely used metric for evaluating ranking quality, especially when items have varying degrees of relevance. It emphasizes relevant items appearing higher in the ranked list.

Conceptual Definition: NDCG@K considers the position of the relevant item in the recommendation list. Relevant items found at higher ranks contribute more to the score. It normalizes the score to a value between 0 and 1, where 1 represents a perfect ranking.
Mathematical Formula: First, Discounted Cumulative Gain (DCG@K) is calculated: $\mathrm{DCG@K} = \sum_{j=1}^{K} \frac{\mathrm{rel}_j}{\log_2(j+1)}$ Then, NDCG@K is calculated by normalizing DCG@K by the Ideal DCG (IDCG@K): $\mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}}$
Symbol Explanation:
- $K$ : The cutoff position in the recommendation list.
- $\mathrm{rel}_j$ : The relevance score of the item at position $j$ in the recommended list. In binary relevance (where an item is either relevant or not), $\mathrm{rel}_j$ is 1 if the item at position $j$ is the true next item, and 0 otherwise.
- $\log_2(j+1)$ : The logarithmic discount factor, which reduces the contribution of items at lower ranks.
- $\mathrm{IDCG@K}$ : The DCG score for the ideal ranking, where all relevant items are ranked at the top according to their true relevance. For binary relevance with only one true next item, IDCG@K is $\frac{1}{\log_2(1+1)} = 1$ .

5.3. Baselines

To comprehensively evaluate GenCDR, it is compared against three categories of state-of-the-art models:

5.3.1. Single-domain Sequential Recommendation (SDSR)

These models are designed for recommendation within a single domain and do not explicitly leverage cross-domain information.

SASRec (Kang and McAuley 2018): Employs a unidirectional Transformer to model users' sequential preferences through self-attention. It predicts the next item by focusing on relevant past interactions.
BERT4Rec (Sun et al. 2019): Adapts the BERT architecture for recommendation using a masked item prediction objective. It learns bidirectional context by considering both past and future dependencies in a sequence.
STOSA (Fan et al. 2022): Introduces stochastic self-attention for long sequences, improving efficiency, and incorporates self-supervised objectives to learn more robust item representations.

5.3.2. Generative Recommendation Systems (GRS)

These models recast recommendation as an autoregressive sequence generation problem, often using vector quantization for item tokenization.

VQ-Rec (Hou et al. 2023): Combines VQ-VAE-based tokenization with Transformer sequence modeling. It maps item embeddings to discrete codes and then predicts the next item in this code space.
TIGER (Rajput et al. 2023): Enhances generative retrieval by optimizing item tokenization with collaborative constraints. It produces semantic IDs that capture both content and user-item interaction signals.
HSTU (Zhai et al. 2024): Proposes a hierarchical tokenization framework that encodes items at multiple semantic levels (from coarse to fine-grained), aiming to improve both generation accuracy and efficiency.

5.3.3. Cross-domain Sequential Recommendation (CDSR)

These models are specifically designed to leverage information across multiple domains.

C2DSR (Cao et al. 2022): Constructs a unified user-item interaction graph across domains and uses a GNN-based propagation mechanism with adaptive gating to regulate inter-domain knowledge transfer.
TriCDR (Ma et al. 2024): Utilizes triplet-based contrastive learning to align user embeddings across domains. It minimizes cross-domain intra-user distances (making the same user similar across domains) and maximizes inter-user separability (making different users distinct).
LLM4CDSR (Liu et al. 2025c): Reformulates CDR as a text generation task. It converts user histories and item attributes into textual prompts for LLMs to model implicit cross-domain semantic correlations.

5.4. Implementation Details

Framework: Implemented in PyTorch with Hugging Face PEFT for LoRA-based fine-tuning.
Training Stages:
1. Domain-adaptive Tokenization module:
  - RQ-VAE pre-trained on all item embeddings using AdamW ( $\mathrm{lr} = 1 \times 10^{-4}$ , batch size $= 512$ ) for 100 epochs.
  - Domain-specific LoRA adapters: $rank = 64$ , $\alpha = 32$ , $dropout = 0.05$ , fine-tuned for 50 epochs with $\mathrm{lr} = 5 \times 10^{-5}$ .
  - Router network: A two-layer MLP with 128 hidden units, trained jointly with a VIB regularization weight of $10^{-3}$ .
2. Cross-Domain Autoregressive Recommendation module:
  - LLM backbone: Qwen2.5-7B.
  - Universal LoRA experts: $N=4$ experts, $rank = 64$ , $\alpha = 128$ . Trained on combined cross-domain data for 10 epochs.
  - Domain-specific adapters: Fine-tuned for 10-20 epochs per domain.
Optimization: All models optimized with AdamW ( $\mathrm{lr} = 5 \times 10^{-5}$ , batch size $= 8$ ) under mixed-precision (FP16) on NVIDIA H200 GPUs.
Evaluation: Checkpoint with the best Recall@10 on the validation set is selected for final testing.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that GenCDR consistently and significantly outperforms all baseline models across the tested datasets and metrics. This validates its effectiveness in cross-domain sequential recommendation.

The results also provide insights into the performance hierarchy of different baseline categories:

Cross-domain (CDSR) models generally perform better than Single-domain Sequential Recommendation (SDSR) models. This reinforces the core premise that leveraging cross-domain information is beneficial for recommendation.
Generative (GenRec) models show an improvement over SDSR baselines, but their performance typically lags behind specialized CDSR models. This highlights the gap that GenCDR aims to bridge: simply applying existing generative models (often single-domain focused) to cross-domain scenarios is not optimal. GenCDR's advantage stems from deeply integrating the generative paradigm with the unique challenges of cross-domain knowledge transfer through its novel semantic ID representation and adaptive modeling.

The superior performance of GenCDR can be attributed to its key innovations:

Resolution of Item Tokenization Dilemma: By generating disentangled semantic IDs instead of relying on arbitrary item IDs, GenCDR provides LLMs with semantically rich and transferable representations, overcoming a major limitation of previous LLM-based CDR methods.
Adaptive Modeling of Universal and Domain-Specific Knowledge: The Domain-adaptive Tokenization and Cross-Domain Autoregressive Recommendation modules, equipped with dynamic routing, effectively disentangle and fuse universal and domain-specific interests at both the item and user levels. This allows for nuanced personalization and knowledge transfer without negative transfer.

Efficient and Valid Inference: The Domain-aware Prefix-tree ensures that the generative process is computationally efficient and produces only valid semantic IDs, leading to accurate and practical recommendations.

The following are the overall performance comparison on all datasets (Table 2 from the original paper):

Scene	Domain	Metric	SDSR			GenRec			CDSR			GenCDR
Scene	Domain	Metric	Bert4Rec	SASRec	STOSA	VQ-Rec	TIGER	HSTU	C2DSR	TriCDR	LLM4CDSR	GenCDR
Leisure	Sports	R@5	0.0188	0.0197	0.0236	0.0261	0.0267	0.0254	0.0265	0.0266	0.0263	0.0274
		N@5	0.0121	0.0126	0.0162	0.0238	0.0244	0.0241	0.0253	0.0255	0.0257	0.0261
		R@10	0.0325	0.0334	0.0346	0.0389	0.0397	0.0381	0.0395	0.0396	0.0398	0.0403
		N@10	0.0169	0.0173	0.0283	0.0281	0.0287	0.0277	0.0258	0.0259	0.0260	0.0262
	Clothing	R@5	0.0128	0.0132	0.0162	0.0171	0.0173	0.0175	0.0172	0.0174	0.0176	0.0181
		N@5	0.0078	0.0081	0.0119	0.0129	0.0125	0.0132	0.0158	0.0161	0.0163	0.0167
		R@10	0.0219	0.0227	0.0223	0.0248	0.0241	0.0253	0.0255	0.0258	0.0261	0.0265
		N@10	0.0105	0.0108	0.0135	0.0170	0.0167	0.0174	0.0191	0.0194	0.0196	0.0203
Technology	Phones	R@5	0.0331	0.0345	0.0415	0.0411	0.0423	0.0415	0.0428	0.0434	0.0431	0.0431
		N@5	0.0215	0.0224	0.0283	0.0308	0.0315	0.0327	0.0392	0.0396	0.0401	0.0406
		R@10	0.0524	0.0537	0.0618	0.0607	0.0613	0.0615	0.0589	0.0593	0.0614	0.0620
		N@10	0.0278	0.0287	0.0346	0.0399	0.0406	0.0425	0.0493	0.0505	0.0506	0.0512
	Electronics	R@5	0.0179	0.0186	0.0213	0.0219	0.0228	0.0232	0.0235	0.0238	0.0237	0.0241
		N@5	0.0118	0.0122	0.0148	0.0211	0.0214	0.0226	0.0229	0.0231	0.0230	0.0235
		R@10	0.0276	0.0285	0.0315	0.0318	0.0322	0.0328	0.0336	0.0339	0.0338	0.0342
Entertainment	Books	R@5	0.0089	0.0093	0.0142	0.0175	0.0172	0.0181	0.0152	0.0155	0.0161	0.0192
		N@5	0.0071	0.0076	0.0117	0.0178	0.0177	0.0180	0.0143	0.0148	0.0153	0.0187
		R@10	0.0176	0.0182	0.0219	0.0224	0.0221	0.0230	0.0205	0.0211	0.0216	0.0237
	Movies	R@5	0.1503	0.1542	0.1562	0.1680	0.1652	0.1682	0.1588	0.1601	0.1613	0.1713
		N@5	0.1015	0.1047	0.1063	0.1182	0.1156	0.1189	0.1092	0.1105	0.1149	0.1215
		R@10	0.1798	0.1825	0.1753	0.1922	0.1893	0.1931	0.1854	0.1865	0.1878	0.1971

R@K and N@K denote Recall and NDCG at cutoff $K$ .
Best results are in bold, and the best baseline results are underlined.
The $t$ -tests showed significant performance improvements ( $p \leq 0.05$ ).

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study (RQ2)

To understand the contribution of each component of GenCDR, an ablation study was conducted. The results consistently show that each module plays a crucial role in the overall performance.

The following are the results of the ablation study on GenCDR components across four datasets (NDCG@10) (Table 3 from the original paper):

Category	Variant	Phones	Electronics	Sports	Clothing
Full Model	GenCDR	0.0512	0.0283	0.0262	0.0203
Tokenization	w/o MTM	0.0483 (↓5.7%)	0.0267 (↓5.7%)	0.0245 (↓6.5%)	0.0190 (↓6.4%)
Tokenization	w/o Item Adapter	0.0466 (↓9.0%)	0.0255 (↓9.9%)	0.0238 (↓9.2%)	0.0183 (↓9.9%)
Autoregressive Recommendation	w/o Specific Expert	0.0448 (↓12.5%)	0.0245 (↓13.4%)	0.0226 (↓13.7%)	0.0173 (↓14.8%)
	w/o Universal Experts	0.0425 (↓17.0%)	0.0232 (↓18.0%)	0.0212 (↓19.1%)	0.0162 (↓20.2%)
	w/o MoE Gate (Avg.)	0.0475 (↓7.2%)	0.0262 (↓7.4%)	0.0242 (↓7.6%)	0.0186 (↓8.4%)
Inference Strategy	w/o Prefix Tree	0.0498 (↓2.7%)	0.0274 (↓3.2%)	0.0255 (↓2.7%)	0.0198 (↓2.5%)

Impact of Contextual Code Modeling (w/o MTM): Removing the Masked Token Modeling (MTM) loss from the Domain-adaptive Tokenization module leads to a performance degradation (e.g., 5.7% on Phones, 6.5% on Sports). This confirms that learning the contextual relationships and "grammar" of semantic codes is vital, going beyond simple reconstruction of item features.
Impact of Item-specific Adaptation (w/o Item Adapter): When the item-specific adapter (domain-specific LoRA for tokenization) is removed, performance drops significantly (e.g., 9.0% on Phones, 9.9% on Clothing). This validates the necessity of capturing domain-specific item semantics to refine the universal representations.
Impact of Domain-specific Expert (w/o Specific Expert): Removing the domain-specific experts (LoRA adapters) in the Cross-Domain Autoregressive Recommendation module results in a substantial performance drop (e.g., 12.5% on Phones, 14.8% on Clothing). This highlights their crucial role in modeling fine-grained user-wise domain-specific interests.
Impact of Universal Experts (w/o Universal Experts): The most significant performance degradation is observed when all universal experts (LoRA adapters) are removed (e.g., 17.0% on Phones, 20.2% on Clothing). This clearly shows that a shared cross-domain knowledge foundation for user interests is indispensable for effective CDR.
Impact of MoE Gate (w/o MoE Gate (Avg.)): Replacing the trainable Mixture of Experts (MoE) gate (the user-level dynamic interest routing network) with simple averaging of universal and domain-specific predictions leads to a noticeable performance decrease (e.g., 7.2% on Phones, 8.4% on Clothing). This emphasizes the importance of a dynamic, context-aware selection mechanism over naive fusion for combining different interest models.
Impact of Constrained Decoding (w/o Prefix Tree): Removing the Domain-aware Prefix-tree constraint results in a consistent but smaller performance drop (e.g., 2.7% on Phones, 2.5% on Clothing). This confirms its role in guaranteeing the generation of valid item IDs and preventing hallucinated (non-existent or semantically incorrect) recommendations, thus ensuring accuracy and efficiency.

6.2.2. In-depth Analysis (RQ3)

To qualitatively assess the framework's ability to learn disentangled representations, the authors visualized the final item representations ( $\mathbf { z } _ { \mathrm { fused } }$ ) using t-SNE.

The following figure (Figure 3 from the original paper) shows t-SNE visualization of item embeddings in three different settings.

Figure 3: t-SNE visualization of item embeddings in three different settings. 该图像是一个图表，展示了在三种不同设置下项目嵌入的 t-SNE 可视化。左侧的 (a) 中为原始项目嵌入，中间的 (b) 展示共享 LoRA 嵌入，而右侧的 (c) 则为特定领域的 LoRA 嵌入，每个类别通过不同颜色表现。

Figure 3 (a): Original Item Embeddings. This visualization shows the raw item embeddings before any tokenization or adaptation. The items are mixed, with no clear separation by domain.
Figure 3 (b): Universal Adapters Only. When only universal adapters are used (without domain-specific adapters), item embeddings from different domains are still mixed together. While some clustering might occur based on shared semantic concepts, domain-specific distinctions are not pronounced. This confirms that universal knowledge alone is not sufficient to fully separate domain-specific item characteristics.
Figure 3 (c): Full GenCDR Model (with Domain-specific Adapters). In contrast, the full GenCDR model, which includes both universal and domain-specific adapters with dynamic routing, shows item embeddings forming clearly separated domain-specific clusters. This visual evidence strongly supports the importance and effectiveness of the domain-specific adaptation in learning disentangled representations, allowing the model to distinguish items belonging to different domains even if they share some high-level semantic concepts (like "Apple" in Figure 1).

6.2.3. Hyper-parameter Analysis (RQ4)

The sensitivity of key hyperparameters for LoRA fine-tuning was analyzed on the Cloth dataset.

The following figure (Figure 4 from the original paper) shows the sensitivity of LoRA fine-tuning to key hyperparameters on the Cloth dataset.

Figure 4: Sensitivity of LoRA fine-tuning to key hyperparameters on the Cloth dataset. 该图像是图表，显示了在 Cloth 数据集上 LoRA 微调对关键超参数的敏感性。图中展示了不同数量的通用专家、LoRA 排名、LoRA Alpha 和 LoRA dropout 率对 Recall@5、Recall@10、NDCG@5 和 NDCG@10 指标值的影响。各个超参数的变化与相应的度量结果之间的关系被清晰地呈现，便于分析其对推荐系统性能的影响。

The plot shows the effect of varying:

Number of Universal Experts ( $N$ ): Performance generally improves up to an optimum (e.g., $N=4$ ), after which it might plateau or decline due to increased complexity or potential redundancy.
LoRA Rank ( $r$ ): The rank parameter determines the capacity of the LoRA adapters. There is a clear optimal rank (e.g., $r=64$ ). A low rank might not capture enough information, while a very high rank can lead to overfitting (as it approaches full fine-tuning).
Alpha ( $\alpha$ ): This parameter scales the LoRA updates. An optimal alpha value needs to be found that balances the contribution of the LoRA layers with the base model.
LoRA Dropout Rate: A small LoRA Dropout Rate (e.g., 0.05) is shown to be effective for regularization, preventing overfitting without overly hindering learning.

These findings highlight a balanced trade-off between model capacity and generalization, demonstrating the framework's robustness and tunability. Finding these optimal points ensures the model captures sufficient complexity without overfitting to the training data.

6.2.4. Analysis of Efficiency (RQ5)

6.2.4.1. Training Efficiency

GenCDR leverages LoRA-based fine-tuning, which offers significant efficiency advantages compared to full fine-tuning of Large Language Models (LLMs).

The following figure (Figure 5 from the original paper) shows a comparison of training efficiency using the Qwen2.5-7B model.

Figure 5: Comparison of training efficiency using the Qwen2.5-7B model. The plots show (a) trainable parameters (log scale), (b) training time, and (c) peak GPU memory for our LoRA-based GenCDR versus a Full Fine-Tuning (Full Tversion. 该图像是一个图表，比较了使用 Qwen2.5-7B 模型的训练效率。图中展示了 (a) 可训练参数数量（十进制对数尺度）、(b) 训练时间和 (c) 峰值 GPU 内存，分别体现了基于 LoRA 的 GenCDR 与完整微调方法的差异。

Figure 5 (a): Trainable Parameters (log scale). LoRA-based GenCDR dramatically reduces the number of trainable parameters compared to Full Fine-Tuning. This is a core benefit of PEFT techniques, making training feasible for large LLMs.
Figure 5 (b): Training Time. With fewer trainable parameters, GenCDR requires substantially less training time than Full Fine-Tuning.
Figure 5 (c): Peak GPU Memory. LoRA-based GenCDR also consumes significantly less GPU memory, which is critical for training large models on available hardware.

These results confirm that GenCDR's LoRA-centric architecture provides substantial training efficiency, making it practical to fine-tune LLMs for cross-domain recommendation.

6.2.4.2. Inference Efficiency and Scalability

GenCDR also demonstrates superior inference efficiency and scalability, primarily due to its Domain-aware Prefix-tree constrained generative architecture.

The following figure (Figure 6 from the original paper) shows a comparison of runtime memory and inference time w.r.t. the item pool size for TriCDR, TIGER, and GenCDR (Qwen2.5-0.5B).

Figure 6: Comparison of runtime memory and inference time w.r.t. the item pool size for TriCDR, TIGER, and GenCDR (Qwen2.5-0.5B). 该图像是一个图表，展示了 TriCDR、TIGER 和 GenCDR 在不同项池大小下的运行内存和推理时间的比较。横轴表示项数（以 $10^4$ 为单位），左侧纵轴表示内存（GB），右侧纵轴表示时间（s）。从图中可以看出，GenCDR 在内存和时间上均具有优势。

Runtime Memory: GenCDR maintains a relatively constant runtime memory footprint regardless of the item pool size. This is a crucial advantage for real-world applications with vast item catalogs. Baselines like TriCDR and TIGER show increasing memory consumption as the item pool grows.
Inference Time: Similarly, GenCDR's inference time remains stable even as the item pool size increases. This is because the prefix-tree effectively prunes the search space, ensuring that the LLM only considers valid semantic ID sequences, rather than exhaustively searching through all possible items. In contrast, TriCDR and TIGER exhibit increasing inference times with larger item pools, indicating a less scalable approach for very large item sets.

This scalability makes GenCDR highly suitable for production environments where recommendation systems must operate efficiently with dynamically changing and extensive item catalogs.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper effectively addresses two critical challenges in LLM-based cross-domain recommendation (CDR): the item ID tokenization dilemma and insufficient domain-specific modeling. The authors propose GenCDR, a novel generative framework that leverages semantic IDs and adaptive mechanisms.

The key contributions and findings include:

Generative Semantic ID Paradigm: GenCDR introduces generative semantic IDs to replace traditional item IDs, enabling LLMs to process items based on their inherent meaning rather than arbitrary identifiers, thus resolving the tokenization dilemma.
Domain-adaptive Tokenization Module: This module dynamically generates hybrid semantic IDs by fusing universal and domain-specific item knowledge, ensuring rich and contextually relevant representations.
Cross-Domain Autoregressive Recommendation Module: Symmetrically, this module models user preferences by adaptively fusing universal and user-wise domain-specific interests through dynamic routing, leading to more personalized and accurate recommendations.
Domain-aware Prefix-tree for Inference: An efficient prefix-tree mechanism constrains the LLM's generative process, ensuring the output of valid and accurate semantic IDs while improving inference speed and scalability.
Superior Performance and Efficiency: Extensive experiments on multiple real-world datasets demonstrate that GenCDR significantly outperforms state-of-the-art baselines in terms of recommendation accuracy and generalization. Furthermore, it exhibits superior training and inference efficiency compared to full fine-tuning and other generative models, especially with large item pools.

7.2. Limitations & Future Work

The authors identify that GenCDR's current focus is primarily on textual features for generating semantic IDs. A stated direction for future work is to explore incorporating multimodal features (e.g., images, videos, audio) for richer representations. This suggests a current limitation where non-textual aspects of items might not be fully captured, potentially impacting performance in domains where visual or auditory information is highly discriminative.

7.3. Personal Insights & Critique

Inspirations

GenCDR offers several inspiring aspects:

Bridging LLMs and Structured Data: The elegant solution of converting items into generative semantic IDs is a powerful way to bridge the gap between LLMs (which excel at text) and structured recommendation data. This approach could be highly valuable in other domains where LLMs need to interact with non-textual entities.
Adaptive Knowledge Fusion: The dynamic routing mechanisms at both item and user levels for fusing universal and domain-specific knowledge are particularly insightful. This fine-grained control over knowledge transfer is crucial for complex multi-domain settings and could be adapted to other multi-task or meta-learning problems where disentangling shared and specific features is important.
Practicality of Generative Models: The inclusion of a Domain-aware Prefix-tree highlights a practical consideration often overlooked in theoretical generative models. Ensuring valid and efficient generation is key to deploying LLM-based recommenders in real-world scenarios, and this mechanism offers a robust solution for constrained generation tasks.

Potential Issues, Unverified Assumptions, or Areas for Improvement

Reliance on Initial Item Feature Embeddings: The Domain-adaptive Tokenization module starts with item feature embeddings ( $\mathbf{x}$ ). The quality and richness of these initial embeddings (derived from textual descriptions) are paramount. If the textual descriptions are sparse or low-quality, the generated semantic IDs might not be as effective. The paper mentions "textual features," but the robustness to various qualities of textual data (e.g., short titles vs. long descriptions) could be explored.
Complexity of Training Pipeline: While LoRA improves efficiency, the overall training pipeline involves multiple stages: RQ-VAE pre-training, domain-specific adapter fine-tuning for tokenization, universal expert fine-tuning for recommendation, and domain-specific expert fine-tuning for recommendation, all with various hyperparameters. Managing and optimizing this multi-stage process can be complex and computationally intensive, especially for a novice.
Generalizability of RQ-VAE to Rare Items/Domains: The RQ-VAE is pre-trained on all item features. How well does it handle items from very sparse domains or extremely rare items within a domain, especially if their textual descriptions are limited? The semantic IDs might not be as well-defined or disentangled for such long-tail items.
Interpretability of Semantic IDs: While semantic IDs are intuitively more meaningful than arbitrary IDs, their exact interpretability for human understanding is not fully detailed. Can a sequence of semantic IDs be easily translated back into human-understandable attributes or concepts? This could be important for debugging or explaining recommendations.
Impact of VIB Regularization: The paper states that VIB regularization is used for the routing networks to promote disentangled representations. While conceptually sound, a deeper empirical analysis of how effectively VIB achieves this disentanglement in the context of routing decisions (beyond the t-SNE visualization) would strengthen the argument. For instance, analyzing the information content passed by the router under different VIB strengths.

Overall, GenCDR presents a significant step forward in LLM-based cross-domain recommendation by tackling fundamental representation and adaptation challenges. Its innovations offer a solid foundation for future research in generative and adaptive recommendation systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 46,494 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Cross-Domain Recommendation (CDR)

Large Language Models (LLMs)

Item ID Tokenization

Residual-Quantized Variational Autoencoder (RQ-VAE)

Low-Rank Adaptation (LoRA)

Autoregressive Models

Prefix-tree

Variational Information Bottleneck (VIB)

3.2. Previous Works

Cross-Domain Sequential Recommendation (CDSR)

Generative Recommendation

Large Language Models for Recommendation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Domain-adaptive Tokenization

4.2.1.1. Domain-Universal Semantic Token Generation

4.2.1.2. Domain-specific Semantic Token Adapters

4.2.1.3. Item-level Dynamic Semantic Routing Network

4.2.2. Cross-Domain Autoregressive Recommendation

4.2.2.1. Universal Interest Modeling Network

4.2.2.2. Domain-specific Interest Adaptation

4.2.2.3. User-level Dynamic Interest Routing Network

4.2.3. Inference - Domain-aware Prefix-tree

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

Recall@K

NDCG@K (Normalized Discounted Cumulative Gain at K)

5.3. Baselines

5.3.1. Single-domain Sequential Recommendation (SDSR)

5.3.2. Generative Recommendation Systems (GRS)

5.3.3. Cross-domain Sequential Recommendation (CDSR)

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study (RQ2)

6.2.2. In-depth Analysis (RQ3)

6.2.3. Hyper-parameter Analysis (RQ4)

6.2.4. Analysis of Efficiency (RQ5)

6.2.4.1. Training Efficiency

6.2.4.2. Inference Efficiency and Scalability

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Inspirations

Potential Issues, Unverified Assumptions, or Areas for Improvement

Similar papers