Tokenize Once, Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM-based Recommendation
TL;DR Summary
The paper introduces `UniTok`, a unified item tokenization framework that addresses the limitations of separate models for different item domains in LLM-based recommendation systems. It employs a mixture-of-experts architecture and a mutual information calibration mechanism to pr
Abstract
Large language model (LLM)-based recommender systems have achieved high-quality performance by bridging the discrepancy between the item space and the language space through item tokenization. However, existing item tokenization methods typically require training separate models for each item domain, limiting generalization. Moreover, the diverse distributions and semantics across item domains make it difficult to construct a unified tokenization that preserves domain-specific information. To address these challenges, we propose UniTok, a Unified item Tokenization framework that integrates our own mixture-of-experts (MoE) architecture with a series of codebooks to convert items into discrete tokens, enabling scalable tokenization while preserving semantic information across multiple item domains. Specifically, items from different domains are first projected into a unified latent space through a shared encoder. They are then routed to domain-specific experts to capture the unique semantics, while a shared expert, which is always active, encodes common knowledge transferable across domains. Additionally, to mitigate semantic imbalance across domains, we present a mutual information calibration mechanism, which guides the model towards retaining similar levels of semantic information for each domain. Comprehensive experiments on wide-ranging real-world datasets demonstrate that the proposed UniTok framework is (a) highly effective: achieving up to 51.89% improvements over strong benchmarks, (b) theoretically sound: showing the analytical validity of our architectural design and optimization; and (c) highly generalizable: demonstrating robust performance across diverse domains without requiring per-domain retraining, a capability not supported by existing baselines.
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Tokenize Once, Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM-based Recommendation
1.2. Authors
The authors are Yu Hou and Won-Yong Shin, both affiliated with Yonsei University.
1.3. Journal/Conference
The paper is published at (UTC): 2025-11-17T03:18:04.000Z, implying it is intended for a conference or journal in late 2025. The specific venue is not mentioned in the provided text but given its nature, it would typically be a top-tier conference in AI, ML, or Recommender Systems (e.g., NeurIPS, ICML, KDD, SIGIR).
1.4. Publication Year
2025
1.5. Abstract
This paper addresses the limitations of existing item tokenization methods for Large Language Model (LLM)-based recommender systems, which typically require separate models for each item domain and struggle with diverse domain semantics. The authors propose UniTok, a Unified item Tokenization framework. UniTok integrates a novel mixture-of-experts (MoE) architecture with codebooks to convert items into discrete tokens. It uses a shared encoder for a unified latent space, domain-specific experts for unique semantics, and a shared expert for common knowledge. To ensure semantic balance, UniTok includes a mutual information (MI) calibration mechanism. Experiments on diverse real-world datasets show UniTok is highly effective (up to 51.89% improvement over benchmarks), theoretically sound (analytical validity of design), and highly generalizable (robust performance across domains without retraining).
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2511.12922v1 - PDF Link:
https://arxiv.org/pdf/2511.12922v1.pdfThe paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inefficiency and lack of generalization in item tokenization for Large Language Model (LLM)-based recommender systems, particularly in multi-domain settings.
This problem is important because LLMs are becoming a promising paradigm for generative recommendation, leveraging their strong generalization and language understanding capabilities. However, to effectively use LLMs, items must be converted into discrete tokens (item tokenization) to bridge the gap between the item space and the language space. Existing item tokenization methods are predominantly designed for single-domain scenarios. This leads to several critical issues in real-world multi-domain applications:
-
Training Overhead (C1): Training separate
tokenizersfor each new item domain is computationally expensive, resource-intensive, and inefficient. As the number of domains grows, this siloed approach becomes unsustainable. -
Semantic Alignment (C2): Item domains often have vastly different data distributions and semantics. A
naïvelyshared token space across domains can lead tosemantic mixingandbiased token assignments, where the tokenizer fails to capture the rich, unique semantics of each domain or struggles to maintain consistency in semantic informativeness across them.The paper's entry point or innovative idea is to design a unified
item tokenizationframework that can generalize across multiple domains with minimal computational overhead, addressing both the training inefficiency and semantic alignment challenges.
2.2. Main Contributions / Findings
The paper makes several significant contributions to address the challenges of multi-domain item tokenization for LLM-based recommendation:
-
New Methodology (
UniTokFramework):- Proposes
UniTok, a unified item tokenization framework that integrates a customizedMixture-of-Experts (MoE)architecture, dubbedTokenMoE, withcodebooks. TokenMoEis designed with domain-specific experts to capture unique semantics of each domain and a shared expert to encode common knowledge, enabling effectivedomain-aware tokenizationand knowledge transfer.- Introduces a
mutual information (MI) calibration mechanismto ensure semantic balance across diverse domains by guiding the model to retain similar levels of semantic information for each domain.
- Proposes
-
Extensive Evaluations and Superior Performance:
- Demonstrates
UniTok's superiority in multi-domain scenarios, achieving substantial improvements of up to 51.89% inNDCG@10over strong baselines. - Shows
UniTok's efficiency, with a 9.63x reduction in model size (trainable parameters) compared to traditionalcodebook-basedmethods that require per-domain training. - Proves
UniTok's strong generalization capability, exhibiting robust performance inzero-shot settings(unseen domains) without additional retraining, a capability lacking in existing baselines.
- Demonstrates
-
Theoretical Justifications:
-
Provides theoretical proofs that
UniTokinduces a higher entropy token space (Theorem 1), indicating greater capacity and richness of tokens. -
Demonstrates that
UniTokachieves a lower quantization error (Theorem 2), meaning more precise and accurateitem tokenization. -
Shows that
UniTokensures semantic consistency across domains by reducingMI variance(Theorem 3), leading to more stable and balanced recommendation performance.These findings collectively solve the training overhead and semantic inconsistency problems, making
LLM-based recommendation more scalable and effective in real-world multi-domain environments.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following core concepts:
Large Language Models (LLMs)
Large Language Models (LLMs) are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. In the context of recommender systems, LLMs can leverage their strong generalization abilities, natural language understanding, and world knowledge to generate recommendations. They can process items as if they were part of natural language sequences, making generative recommendation possible. This paradigm shifts from traditional recommendation methods that often rely on user-item interaction matrices or collaborative filtering.
Item Tokenization
Item tokenization is the process of converting items (e.g., products, movies, articles) into discrete, symbolic representations called tokens. Just as words in natural language are tokenized before being fed into an LLM, items in a recommender system need to be tokenized so that LLMs can understand and process them. This process bridges the gap between the item space (where items exist with their rich features and metadata) and the language space (where LLMs operate using discrete tokens). Tokens can be ID-based (simple numerical IDs), textual descriptors (using item metadata directly), or codebook-based (as explored in this paper).
Mixture-of-Experts (MoE)
A Mixture-of-Experts (MoE) architecture is a type of neural network design that enhances model capacity and efficiency by conditionally activating different sub-networks (called experts) for different inputs. Instead of processing every input with the entire model, an MoE uses a gating mechanism or router to select a subset of experts that are most relevant to a given input. The outputs of these selected experts are then combined, often weighted by the router's scores.
- Purpose:
MoEmodels can scale to a very large number of parameters (increasingmodel capacity) while keeping the computational cost for any single input relatively low (improvingcomputational efficiency), as only a fraction of the parameters are activated. - Application in Multi-Domain:
MoEis particularly well-suited formulti-domain learningbecause its gating mechanism allows adaptiveexpert selectionfor each domain. This helps the model specialize in differentdata distributionswithout interference, facilitating generalization across diverse data sources. Formally, given an input , the output of anMoEmodel is typically represented as: where is the total number ofexperts, is the output of the -thexpert, and is asoftmax-basedrouter functionthat assigns a probability (or weight) to the -thexpertfor the given input.
Codebook-based Identifiers (Residual Quantization - RQ)
Codebook-based identifiers are a method of item tokenization that converts continuous item embeddings into discrete token sequences using a technique called Residual Quantization (RQ).
- Codebook: A
codebookis a collection of discretecode vectors(orcodewords). - Residual Quantization (RQ): This process encodes an
input vector(e.g., anitem embedding) into a sequence ofcodesby iteratively finding the best matchingcode vectorfrom acodebookand thenquantizingtheresidual(the difference between the input and the selected code) in subsequent stages.- Start with an
initial residual(the input vector). - At each
level(from 1 to total levels), find thecode vectorfrom thecodebookthat is closest to the currentresidual. - Update the
residualfor the next level: . - The final
approximationof the original input is the sum of all selectedcode vectors: . Each selectedcode vectorcorresponds to a discrete index (where is the number ofcode vectorsin eachcodebook). The sequence forms thetoken sequencerepresenting the item. This method providescompactandsemantically meaningful tokens.
- Start with an
Mutual Information (MI) and Hilbert-Schmidt Independence Criterion (HSIC)
Mutual Information (MI) is a concept from information theory that measures the statistical dependence between two random variables. A high MI value indicates a strong relationship, meaning knowing one variable provides a lot of information about the other. Conversely, a low MI suggests that the variables are nearly independent.
- Purpose in this paper: In
UniTok,MIis used to quantify how much semantic information from the originalinput item embeddingsis preserved in theirlearned latent embeddingsfor each domain. - Hilbert-Schmidt Independence Criterion (HSIC): Calculating
MIdirectly can be challenging.HSICis anon-parametric measureof dependence between random variables that serves as an empirical estimate or proxy forMI. It measures the cross-covariance between twoReproducing Kernel Hilbert Spaces (RKHSs). A higherHSICvalue implies stronger statistical dependence.- It operates on
kernel matrices(e.g.,Gaussian kernel matrices) computed over the input data, which implicitly map data into high-dimensionalRKHSswhere dependencies might be more easily detected. - The
centering matrix(where ) is used to ensure that the embeddings inRKHShave a zero mean, which is standard practice inkernel methods.
- It operates on
3.2. Previous Works
The paper frames UniTok in contrast to existing item tokenization methods and traditional collaborative filtering (CF) approaches for recommendation.
-
Traditional Collaborative Filtering (CF) Methods: These methods typically rely on
user-item interaction datato find patterns and make recommendations.MF (Matrix Factorization): A classicCFmethod that decomposes theuser-item interaction matrixinto lower-dimensionaluseranditem latent factormatrices.LightGCN: AGraph Convolutional Network (GCN)-based method that simplifies theGCNarchitecture for recommendation by keeping only theneighborhood aggregationcomponent and linear transformations.SASRec (Self-Attentive Sequential Recommendation): Asequential recommenderthat uses aself-attentionmechanism (similar toTransformers) to captureshort-termandlong-term dependenciesin userinteraction sequences.Bert4Rec: Anothersequential recommenderthat adapts theBERT (Bidirectional Encoder Representations from Transformers)architecture to model userbehavior sequencesbidirectionally, predicting masked items.
-
LLM-based Generative Recommendation (with Item Tokenization): These methods bridge
item spaceandlanguage spaceusingLLMs.P5-TID (P5 - Text-based Item Description): A variant ofP5(aparameter-efficient personalized prompt & predict paradigm) that usestextual item descriptionsfortokenizationforLLM-based generative recommendation. It leveragesLLMsfor direct sequence generation ofitem IDsbased on textual information.P5-SemID (P5 - Semantic Item ID): AnotherP5variant that usessemantic item IDsforLLM-based generative recommendation. It focuses on representing items withsemantic codewordsfor next-item prediction.TIGER (Token-based Item Generator for Enhanced Recommendation): Acodebook-based item tokenizationmethod that convertsitem metadataintohierarchical code sequences. It's a foundational work in using discretecodebooksforLLM-based recommendation.LC-Rec (Learnable Code-based Recommendation): Learnscode indicesfor items viavector quantizationand fine-tunesLLMsthroughalignment tasksto generate items directly.LETTER (Learnable Item Tokenization for Generative Recommendation): This method learns atokenizerforLLM-based generative recommendation, aiming to combinecollaborative signalsand mitigatecode assignment bias. It also usescodebook-based identifiers.
3.3. Technological Evolution
The evolution of recommendation systems has moved from traditional methods focusing on explicit user-item interactions (e.g., Matrix Factorization, early CF) to more complex models capable of capturing sequential patterns (SASRec, Bert4Rec) and leveraging graph structures (LightGCN). More recently, the advent of Large Language Models has opened up the generative recommendation paradigm, where LLMs can directly generate recommendations, often by processing items as tokens within a language context.
Initially, item tokenization methods for LLM-based systems focused on single-domain scenarios (e.g., TIGER, LC-Rec, LETTER). These approaches, while effective within their specific domains, suffered from scalability issues when applied to multi-domain settings, requiring separate tokenizers and leading to redundant training and parameter inefficiency.
The field of machine learning more broadly has seen a strong trend towards unified models for multi-domain learning in areas like natural language processing and computer vision. This paper's work on UniTok fits within this broader trend, aiming to bring unified multi-domain learning principles to item tokenization for LLM-based recommendation. It represents a significant step towards creating general-purpose tokenization interfaces for foundation models in recommendation.
3.4. Differentiation Analysis
Compared to the main methods in related work, UniTok introduces several core innovations:
-
Unified Multi-domain Approach vs. Single-domain Specialization: The primary differentiation is
UniTok's ability to tokenize items acrossmultiple domainsusing asingle unified model, whereas most existingitem tokenizationmethods (e.g.,TIGER,LC-Rec,LETTER) are designed forsingle-domainsettings and require separate training for each domain. This directly addresses thetraining overhead (C1)challenge. -
Novel
TokenMoEArchitecture:UniTokproposes a specializedMixture-of-Experts (MoE)architecture (TokenMoE) integrated directly into thetokenization module. UnlikeMoEs used inTransformerlayers,TokenMoEexplicitly combinesdomain-specific experts(for unique semantics) and ashared expert(for common knowledge). This allows the model to retaindomain specializationwithout sacrificingglobal knowledge sharing, which is crucial for handling diverse semantics (partially addressingC2). -
Mutual Information (MI) Calibration:
UniTokintroduces a uniqueMI calibration mechanism() to explicitly ensuresemantic balanceacross domains. This mechanism minimizes the variance ofMIbetween input and latent embeddings, mitigatingsemantic imbalanceand ensuring consistentinformativenessfor each domain. This directly addresses thesemantic alignment (C2)challenge, which is largely overlooked by priorsingle-domainstudies. -
Efficiency and Generalizability: By learning a
unified tokenizationmodel,UniTokachieves significant efficiency gains (9.63x reduction in trainable parameters) and demonstrates strongzero-shot generalizationto unseen domains, capabilities not supported by existing baselines. This is a direct consequence of its architectural innovations andMI calibration.In essence, while previous
codebook-based tokenizationmethods focus on effective item representation within a single domain,UniTokinnovates by extending this concept to amulti-domainsetting through an intelligentMoEdesign andsemantic balancingmechanism, leading to a more scalable, efficient, and robust solution.
4. Methodology
4.1. Principles
The core idea behind UniTok is to enable unified item tokenization across multiple, diverse domains for LLM-based recommendation. It operates on two main principles:
-
Disentangling Domain-Specific and Shared Knowledge: Different item domains exhibit distinct
data distributionsandsemantics. To handle this with a single model,UniTokneeds to internally separatedomain-specific learning(to capture unique patterns) fromshared representations(to leverage common knowledge and improve efficiency). This is achieved through a novelMixture-of-Experts (MoE)architecture,TokenMoE, where specializedexpertshandledomain-specific nuanceswhile ashared expertcapturescross-domain commonalities. -
Ensuring Semantic Balance and Informativeness: When tokenizing items from diverse domains within a unified framework, it's critical to ensure that the learned
latent embeddingsretain sufficient and consistentsemantic informationfor each domain.Naïvelysharing atoken spacecan lead tosemantic mixingorimbalance, where some domains might lose crucial details.UniTokaddresses this by introducing amutual information (MI) calibration mechanismthat guides the model to preserve similar levels ofinformativenessacross all domains, promotingstableandbalanced performance.By combining these principles,
UniTokaims to create ascalableandgeneralizable item tokenizerthat is efficient in training and deployment while maintaining highsemantic fidelityacross heterogeneousitem domains.
4.2. Core Methodology In-depth (Layer by Layer)
The UniTok framework consists of four key components: a shared autoencoder, TokenMoE, codebook-based identifiers, and an MI calibration mechanism.
4.2.1. Task Formulation: Multi-domain Item Tokenization
The paper considers a multi-domain setting with a mixture of distinct item domains, denoted as . Each domain has an item set with associated textual metadata (e.g., titles, categories, features).
The objective is to learn a mapping function that projects each continuous semantic embedding of an item from domain into a discrete codeword . Here, represents a shared space of discrete item tokens across all domains. This tokenization should operate independently of user data, enabling scalable and general-purpose recommendations for LLM-based systems.
4.2.2. Shared Autoencoder
UniTok begins by projecting item embeddings from various domains into a unified latent space using a shared autoencoder. This step ensures that items from different domains initially share a common representation format, capturing structural patterns before domain-specific processing.
Components:
Encoder(): Maps an input item'ssemantic embeddingto alatent embedding.Decoder(): Reconstructs the originalsemantic embeddingfrom the processedlatent embedding.
Process: For each input item from domain :
- The
encoderproduces alatent embedding: where is thesemantic embeddingof the -th item in domain , and is its correspondinglatent embedding. - This
latent embeddingthen passes through theTokenMoEmodule to produce a quantized representation . - The
decoderreconstructs the input item'ssemantic embeddingfrom : where is the reconstructedsemantic embedding.
Optimization:
The shared autoencoder is optimized using a reconstruction loss, which aims to minimize the difference between the original and reconstructed semantic embeddings:
where denotes the norm. This loss ensures that the latent space is informative and retains core semantics for subsequent tokenization.
4.2.3. TokenMoE (Mixture-of-Experts for Tokenization)
TokenMoE is a specialized Mixture-of-Experts architecture within UniTok that routes items to both domain-specific experts and a shared expert. This design allows for domain-aware tokenization by capturing both unique domain patterns and transferable common knowledge.
Components:
Router Function(): Takes an item'slatent embeddingand produces asoftmax distributionoverdomain-specific experts.Domain-Specific Experts(): individual expert modules, each designed to specialize in patterns unique to a particular domain.Shared Expert(): An additional expert module that captures common knowledge transferable across all domains, and is always active.
Process:
- After the
shared encoderproduces an item'slatent embedding, it is fed into therouter function. Theroutercomputesrouter logitsusing a learnable linear transformation : . - The
softmax distributionfor eachdomain-specific expertis calculated as: where is the logit corresponding to the -thdomain-specific expert. - The item is then routed to the
top-NG_k$ values, while simultaneously being deterministically assigned to theshared expert. - The final processed
latent embedding(which is then fed to the decoder) is a weighted combination of the outputs of the selecteddomain-specific expertsand theshared expert: where denotes the -thexpert module, is theshared expert module, and is the set of indices for the top- experts selected by the router.
To encourage expert specialization, each expert is initialized with the mean feature of a specific domain, providing an inductive bias for domain-aware tokenization.
4.2.4. Codebook-based Identifiers
Within each expert (both domain-specific and shared), Residual Quantization (RQ) is employed to discretize the item latent embeddings into compact token sequences.
Components:
Codebooks(): A series ofcodebooks, each containingcode vectors.
Process:
- For a
latent embedding(output of theshared encoder, beforeTokenMoErouting),RQapproximates it hierarchically. - At each
level(from 1 to ):- The
residualfrom the previous step is used. Initially, . - The nearest
code vectorfromcodebookto the currentresidualis selected: . - The
residualis updated for the next level: .
- The
- The sum of all selected
code vectorsreconstructs the originallatent embedding: Here, refers to the output of the -th expert module, which effectively performs this quantization. - Each selected
code vectorcorresponds to adiscrete index. The finaldiscrete codewordfor an item is a combination of these indices and the chosenexpert IDs: where indicates theexpert IDof the -thtop-N$ expertchosen by therouter`.
Optimization:
The quantization process is trained using the RQ loss:
where is the residual vector at level , is the selected code vector from , is the stop-gradient operator, and is a balancing hyperparameter.
- The first term encourages the
code vectorto match theresidual(training thecodebooks). - The second term forces the
encoderandrouterto producelatent embeddingsthat are close to the selectedquantized code vector(making theencoderandroutercommit to thequantization).
4.2.5. MI Calibration
To address semantic imbalance and ensure consistent informativeness across diverse domains, UniTok introduces a mutual information (MI) calibration mechanism.
Mechanism:
- The
Hilbert-Schmidt Independence Criterion (HSIC)is used as aproxyforMI. A higherHSICvalue indicates stronger dependence and thus more preserved information. - For each domain , measures the dependence between the
input semantic embeddingsand theirlatent embeddingsin aReproducing Kernel Hilbert Space (RKHS). where:-
are
Gaussian kernel matricescomputed over and , respectively.Gaussian kernel matricesmeasure similarity between data points, implicitly mapping them into a high-dimensional feature space (RKHS). -
is the
centering matrix, which ensureszero-mean embeddingsinRKHS. is the identity matrix, and is a vector of all ones. -
computes the
Hilbert-Schmidt normof thecross-covariancebetween the twoRKHSs, quantifying their dependence.The following figure (Figure 3 from the original paper) illustrates the MI calibration process.

Optimization: The
MI calibration loss() is designed to enforcesemantic balanceacross domains: where is theMI estimatefor domain , and is aweighting hyperparameter.
-
- The first term, , penalizes
high varianceofMIacross domains, aiming to reducesemantic imbalanceand ensure consistentinformativeness. - The second term, , encourages each domain to retain a
sufficient amountofdomain-specific information(maximizing the expectedMI).
4.2.6. Optimization
The overall loss function for training UniTok combines the three losses:
where and are hyperparameters that control the strength of the RQ and MI losses, respectively.
Once trained, UniTok tokenizes each item into discrete semantic tokens. These item token sequences are then used as input to LLM-based recommender systems. For example, user interaction histories are converted into item token sequences , and the LLM learns to predict the next interacted item token .
4.2.7. Theoretical Analyses
The paper provides theoretical justifications for the design choices and effectiveness of UniTok.
4.2.7.1. Theorem 1: Higher Entropy Token Space
Statement: The token space induced by UniTok exhibits strictly higher entropy than that of standard codebook-based methods:
where and denote the discrete token distributions generated by UniTok and standard codebook-based methods (using a single codebook), respectively.
Explanation: Entropy in information theory measures the uncertainty or diversity of a random variable. A higher entropy for a token space means that there are more unique and distinct tokens that can be generated, implying a richer and more expressive representation capacity.
The proof relies on the chain rule of entropy. For standard codebook-based methods with levels and code vectors per codebook, the entropy is . For UniTok, the token generation involves two steps: first, selecting an expert (with entropy ), and then generating a token from that expert's codebooks (which also has entropy ). Since the router entropy is always positive when there is more than one expert (), UniTok's total entropy is , which is strictly greater than .
Implication: This theorem proves that TokenMoE significantly expands the capacity of the token space, allowing UniTok to represent items with greater diversity and expressiveness.
4.2.7.2. Theorem 2: Lower Quantization Error
Statement: Let and denote the expected quantization error of UniTok and standard codebook-based methods using a single shared codebook, respectively. Then, the following inequality holds:
Explanation: Quantization error refers to the loss of information when a continuous value is converted to a discrete one. A lower quantization error means that the discrete representation (the tokens) more accurately approximates the original continuous item embedding, thus preserving more semantic information.
The proof uses Jensen's inequality and the convexity of the squared norm function. It treats the output of UniTok as a convex combination (weighted average by router probabilities) of the quantization errors from individual experts. Since a convex combination of errors can achieve a lower or equal error than any single expert (or a single standard codebook which can be seen as a degenerate MoE with one active expert), UniTok can achieve better approximation.
Implication: This theorem theoretically guarantees that UniTok's TokenMoE architecture, by leveraging expert specialization, can represent items more precisely and reduce the information loss during tokenization compared to a single, monolithic tokenizer.
4.2.7.3. Theorem 3: Performance Stability via MI Calibration
Statement: Suppose that the loss on the -th domain is Lipschitz-continuous with respect to the informativeness of representations. Then, the performance variability across domains is upper-bounded by the variance of MI:
where is the variance of MI estimates across domains and is a constant.
Explanation: Lipschitz continuity implies that a small change in the input (here, informativeness or MI) leads to at most a proportionally small change in the output (here, the loss or performance). Performance variability refers to how much the model's performance differs from one domain to another.
The proof links the performance gap between any two domains () to the difference in their MI estimates (). It then shows that the maximum difference in MI estimates across domains is bounded by a term related to the variance of MI across all domains. Therefore, reducing the variance of MI estimates directly reduces the performance variability.
Implication: This theorem rigorously supports the effectiveness of the MI calibration mechanism. By minimizing through , UniTok ensures that the learned latent embeddings retain a consistent level of semantic information across all domains, leading to more stable and balanced recommendation performance in a multi-domain setting.
The following figure (Figure 2 from the original paper) provides a schematic overview of the UniTok framework, illustrating the interaction between its components.
该图像是示意图,展示了提出的UniTok框架的总体结构。图中展示了不同项目领域的项目输入通过共享编码器转化为语义嵌入,并经过路由器分配至专业专家,旨在保留领域特定信息,同时缓解领域间语义不平衡。
The UniTok Training Procedure is formally described in Algorithm 1.
The following are the results from [Algorithm 1] of the original paper:
Algorithm 1: UniTok Training Procedure
Input: Multi-domain item datasets $\left\{ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_K \right\}$ with text metadata (e.g., title, category, description)
Parameters: Encoder $f_\theta$, decoder $g_\phi$, router $h(\cdot)$, expert modules $\{E_1, ..., E_K, E_{\mathrm{share}}\}$ with associated codebooks
Output: Discrete tokens $\mathbf{c}_i^k \in \mathcal{C}$ for all items
1: Initialize all modules: $f_\theta, g_\phi$, router $h(\cdot)$, experts $E_k$ (including $E_{\mathrm{share}}$), and their codebooks
2: for $k = 1$ to $K$ do
3: for item $i_i^k \in \mathcal{D}_k$ do
4: $\mathbf{x}_i^k \gets$ SemanticEmbedding$(i_i^k)$ // Pre-compute semantic embeddings
5: end for
6: end for
7: for epoch $= 1$ to $E$ do
8: Sample mini-batch $B = \{ \mathbf{x}_i^k \}$ from mixed domains
9: for each item $\mathbf{x}_i^k$ in batch $B$ do
10: $\mathbf{z}_i^k \gets f_\theta(\mathbf{x}_i^k)$ // Encode input
11: $\hat{\mathbf{z}}_i^k, \mathbf{c}_i^k \gets \mathrm{TokenMoE}(\mathbf{z}_i^k)$ // Discretize via TokenMoE
12: $\hat{\mathbf{x}}_i^k \gets g_\phi(\hat{\mathbf{z}}_i^k)$ // Decode reconstruction
13: end for
14: Compute reconstruction loss $\mathcal{L}_{\mathrm{Rec}}$ in the main manuscript of Eq. (3)
15: Compute RQ loss $\mathcal{L}_{\mathrm{RQ}}$ in the main manuscript of Eq. (8)
16: Compute MI loss $\mathcal{L}_{\mathrm{MI}}$ in the main manuscript of Eq. (10)
17: Update $f_\theta, g_\phi, h(\cdot)$, and expert parameters
18: end for
19: return token assignments $\mathbf{c}_i^k$ for all items
Algorithm 1 Explanation:
- Initialization: All model components—the
encoder(),decoder(),router(), and allexpert modules( for domain-specific, for shared), along with their respectivecodebooks—are initialized. - Semantic Embedding Pre-computation: For each item in every domain , its
semantic embeddingis obtained using a pre-trainedembedding model. This step is performed once before the main training loop. - Epoch Loop: The model trains for a specified number of
epochs(). - Mini-batch Sampling: In each
epoch, amini-batchofsemantic embeddingsis sampled. Crucially, thismini-batchcontains items frommixed domains, reflecting themulti-domainnature of the problem. - Item Processing Loop: For each item in the
mini-batch:- Encoding: The
shared encoderconverts thesemantic embeddinginto alatent embedding. - Tokenization via
TokenMoE: TheTokenMoEmodule takes , routes it through selectedexperts(domain-specific and shared), appliesresidual quantization, and outputs aquantized embeddingand the correspondingdiscrete token. - Decoding: The
shared decoderattempts to reconstruct the originalsemantic embeddingfrom thequantized embedding.
- Encoding: The
- Loss Computation: After processing all items in the
mini-batch, the three loss components are calculated:Reconstruction Loss() from Eq. (3).Residual Quantization Loss() from Eq. (8).Mutual Information Calibration Loss() from Eq. (10).
- Parameter Update: All learnable parameters (of the
encoder,decoder,router, andexpert moduleswith theircodebooks) are updated using an optimizer based on the total loss (Eq. 11). - Output: After training completes, the algorithm returns the
discrete token assignmentsfor all items, which can then be used byLLM-based recommenders.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on a wide range of real-world datasets from various domains. The paper uses ten source datasets for training and evaluation of UniTok's multi-domain performance, and three target datasets for zero-shot generalization evaluation.
Each item in these datasets contains metadata such as title, category, and features. An example of an item could be a product with its title ("Echo Dot (3rd Gen) - Smart speaker with Alexa"), category ("Electronics > Smart Home > Smart Speakers"), and features ("Voice control, Bluetooth, Wi-Fi enabled").
The following are the results from [Table 6] of the original paper:
| Dataset | # of users | # of items | # of interactions |
| Beauty | 22,363 | 12,101 | 198,502 |
| Cellphones | 27,879 | 10,429 | 194,439 |
| Grocery | 14,681 | 8,713 | 151,254 |
| Instruments | 24,772 | 9,922 | 206,153 |
| Office Products | 4,905 | 2,420 | 53,258 |
| Pet Supplies | 19,856 | 8,510 | 157,836 |
| Tools | 16,638 | 10,217 | 134,476 |
| Toys | 19,412 | 11,924 | 167,597 |
| Games | 24,303 | 10,672 | 231,780 |
| Yelp | 30,431 | 20,033 | 316,354 |
| Clothing | 39,387 | 23,033 | 278,677 |
| Health | 38609 | 18533 | 346,355 |
| Sports | 35,598 | 18,357 | 296,337 |
Dataset Characteristics:
-
Source Domains (10 datasets):
Beauty,Cellphones,Grocery,Instruments,Office Products,Pet Supplies,Tools,Toys,Games, andYelp. These datasets represent diverse categories with varying numbers of users, items, and interactions, making them suitable for evaluatingmulti-domainperformance. -
Target Domains (3 datasets for zero-shot):
Clothing,Health, andSports. These are used to testUniTok'sgeneralization abilityto unseen domains. -
Metadata: The reliance on
item metadata(title, category, features) for generatingsemantic embeddingsis crucial, asUniTokoperates independently of user data fortokenization.These datasets were chosen for their diversity and common use in recommendation research. Their varied characteristics allow for a robust evaluation of
UniTok's ability to handle differentdata distributionsandsemantic nuancesin aunified manner.
5.2. Evaluation Metrics
The paper adopts two widely used ranking metrics to evaluate the performance of recommendations in a full-ranking protocol: Recall@M and NDCG@M. Here, .
5.2.1. Recall@M ()
Conceptual Definition: Recall@M measures the proportion of relevant items that are successfully retrieved within the top recommendations. It focuses on the ability of the recommender system to find a significant portion of the user's preferred items among the highest-ranked suggestions, regardless of their precise ranking order within the top .
Mathematical Formula:
Symbol Explanation:
- : The number of relevant items (i.e., items that are in the
ground truthset) that are present in the top recommendations. - : The total number of relevant items for a given user or interaction.
5.2.2. NDCG@M (Normalized Discounted Cumulative Gain)
Conceptual Definition: NDCG@M is a metric that considers both the relevance of recommended items and their ranking positions. It assigns higher scores to relevant items that appear at higher ranks and penalizes relevant items that appear at lower ranks. It is normalized to be between 0 and 1, where 1 represents the ideal ranking.
Mathematical Formula:
Symbol Explanation:
- : The number of top recommendations being considered.
- : An indicator function that is 1 if the item at position in the ranked list is a
relevant item(i.e., in theground truth), and 0 otherwise. This effectively assigns a relevance score. - : The
logarithmic discount factor. Items at higher ranks (smaller ) receive less discount, meaning they contribute more to theDCGscore. - :
Ideal Discounted Cumulative Gain at M. This is the maximum possibleDCGvalue for the given set ofground truthrelevant items. It's calculated by placing allground truthitems at the top positions in decreasing order of their relevance, ensuringNDCGis normalized between 0 and 1.
5.3. Baselines
UniTok is compared against nine benchmark recommendation methods, categorized into two groups:
5.3.1. Collaborative Filtering (CF) Methods
These are traditional recommenders that primarily rely on user-item interaction patterns.
MF (Matrix Factorization)(Rendle et al. 2009): A fundamentalcollaborative filteringtechnique that aims to discoverlatent featuresthat explainuser-item interactionsby factorizing theuser-item interaction matrix.LightGCN(He et al. 2020): A simplifiedGraph Convolutional Network (GCN)for recommendation that leveragesgraph convolutional layersto propagate embeddings across theuser-item interaction graph.SASRec (Self-Attentive Sequential Recommendation)(Kang and McAuley 2018): Asequential recommendationmodel that uses aTransformer-likeself-attentionmechanism to capturelong-term dependenciesin userinteraction sequences.Bert4Rec(Sun et al. 2019): Anothersequential recommendationmodel that adapts theBERT(Bidirectional Encoder Representations from Transformers) architecture to model userbehavior sequencesbidirectionally.
5.3.2. Item Tokenization-aided LLM-based Recommendation Methods
These methods are more directly comparable as they also focus on item tokenization for LLM-based recommendations.
-
P5-TID (P5 - Text-based Item Description)(Hua et al. 2023): ALLM-based method where items are represented using theirtextual descriptions, which are then fed to theLLMforgenerative recommendation. -
P5-SemID (P5 - Semantic Item ID)(Hua et al. 2023): Similar toP5-TID, but utilizessemantic item IDs(learnable embeddings or codes that capture item semantics) astokensfor theLLM. -
TIGER (Token-based Item Generator for Enhanced Recommendation)(Rajput et al. 2023): A pioneeringcodebook-based item tokenizationmethod that convertsitem metadataintohierarchical code sequencesforLLM-basedgenerative retrieval. -
LC-Rec (Learnable Code-based Recommendation)(Zheng et al. 2024): This method learnscode indicesfor items viavector quantizationand then tunesLLMsthroughalignment tasksfor directitem generation. -
LETTER (Learnable Item Tokenization for Generative Recommendation)(Wang et al. 2024): A method that proposes alearnable tokenizerforLLM-basedgenerative recommendation, designed to integratecollaborative signalsand mitigatecode assignment bias. It also relies oncodebook-based identifiers.These baselines are chosen to represent both traditional and modern
LLM-based recommendation paradigms, allowing for a comprehensive comparison againstUniTok'sunified item tokenizationapproach.
5.4. Implementation Details
- Item Tokenization:
UniTokuses4-level codebook-based identifiers. Eachcodebookcomprises256 code vectors, and eachcode vectorhas a dimension of32. - Loss Hyperparameters: The
RQ lossweight is set to1. TheMI lossweight is set to0.03. The parameter in theRQ loss(Eq. 8) is set to0.25, and in theMI loss(Eq. 10) is set to1. - Pre-trained Content Encoder: For
semantic embeddings, a pre-trainedTIGERencoder is used, specifically fine-tuned on theBeautydataset. This encoder is chosen for its effectiveness inLLM-basedgenerative recommendation. - Training:
UniTokis trained for10,000 epochsusing theAdamWoptimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2019), with alearning rateof1e-3and abatch sizeof1,024. - Baselines: For
TIGER, fine-tuning is performed to convergence based onvalidation performance, withlearning ratesof1e-3,5e-4,1e-4,2e-4,3e-4. Other baseline methods are implemented using parameters described in their original articles. - Hardware: All experiments are conducted on
Intel(R) 12-Core (TM) E5-1650 v4 CPUs @ 3.60 GHzandNVIDIA GeForce RTX 3090 GPUs.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Can One Tokenizer Serve All Domains? (Recommendation Accuracy)
This section evaluates UniTok's recommendation accuracy against benchmarks across ten diverse source datasets. A key distinction is that UniTok trains a single unified model to handle all ten datasets jointly, unlike competitors that require separate tokenizers for each dataset.
The following are the results from [Table 1] of the original paper:
| Method | Beauty | Cellphones | Grocery | Instruments | Office | Pet Supplies | Tools | Toys | Games | Yelp |
| MF | 0.0369 | 0.0267 | 0.0216 | 0.0710 | 0.0255 | 0.0268 | 0.0169 | 0.0192 | 0.0366 | 0.0144 |
| LightGCN | 0.0285 | 0.0456 | 0.0357 | 0.0781 | 0.0301 | 0.0289 | 0.0257 | 0.0287 | 0.0417 | 0.0195 |
| SASRec | 0.0314 | 0.0446 | 0.0376 | 0.0609 | 0.0285 | 0.0301 | 0.0234 | 0.0239 | 0.0412 | 0.0183 |
| Bert4Rec | 0.0194 | 0.0268 | 0.0237 | 0.0573 | 0.0274 | 0.0161 | 0.0092 | 0.0177 | 0.0379 | 0.0131 |
| P5-TID | 0.0255 | 0.0357 | 0.0316 | 0.0721 | 0.0239 | 0.0243 | 0.0198 | 0.0202 | 0.0388 | 0.0154 |
| P5-SemID | 0.0304 | 0.0406 | 0.0351 | 0.0730 | 0.0283 | 0.0282 | 0.0237 | 0.0231 | 0.0432 | 0.0188 |
| TIGER | 0.0324 | 0.0446 | 0.0375 | 0.0788 | 0.0295 | 0.0279 | 0.0284 | 0.0268 | 0.0427 | 0.0208 |
| LC-Rec | 0.0381 | 0.0458 | 0.0369 | 0.0802 | 0.0311 | 0.0335 | 0.0307 | 0.0279 | 0.0451 | 0.0215 |
| LETTER | 0.0364 | 0.0473 | 0.0392 | 0.0831 | 0.0326 | 0.0307 | 0.0298 | 0.0291 | 0.0469 | 0.0231 |
| UniTok | 0.0478 | 0.0647 | 0.0533 | 0.0884 | 0.0432 | 0.0496 | 0.0439 | 0.0442 | 0.0476 | 0.0321 |
| Gain | 25.46% | 36.78% | 35.97% | 6.38% | 32.52% | 48.06% | 42.99% | 51.89% | 1.49% | 38.96% |
(Note: NDCG@10 values are shown. Improvements are statistically significant on average (p = 0.0219 < 0.05) based on paired t-tests over five runs across all datasets.)
Analysis:
UniTokconsistently outperforms all benchmark methods across all ten datasets in terms ofNDCG@10. The improvements are substantial, with gains ranging from 1.49% (Games) to an impressive 51.89% (Toys). This strongly validates the effectiveness ofUniTok'sunified tokenization frameworkin handling multiple domains.- Superiority of LLM-based Tokenization: The results show that
item tokenization-aided LLM-based recommender systems(e.g.,TIGER,LC-Rec,LETTER, andUniTok) generally perform better than traditionalcollaborative filteringmethods (MF,LightGCN,SASRec,Bert4Rec). This highlights the benefits ofLLMsin leveragingrich semantic understandingbeyond mereuser-item interactions. - Impact of Item Tokenization Design: Among the
LLM-based methods, those incorporating carefully designeditem tokenization(likeTIGER,LC-Rec,LETTER,UniTok) show further gains compared toLLMsrelying solely onitem metadatafortokenization(P5-TID,P5-SemID). This underscoresitem tokenizationas a crucial bridge for effectivegenerative recommendation. UniTok's ability to achieve this superior performance with a single model for all domains, unlike domain-specific baselines, demonstrates its conceptual strength and practical generality.
6.1.2. Is UniTok More Efficient than Traditional Tokenizers? (Model Size)
This section compares the efficiency of UniTok in terms of the total number of trainable parameters against traditional codebook-based competitors (TIGER, LC-Rec, LETTER).
The following are the results from [Table 2] of the original paper:
| Module | Codebook-based methods | UniTok |
| Codebook | 0.33M | 0.36M |
| Autoencoder | 87.45M | 8.75M |
| Router | 0.01M | |
| Total | 87.78M | 9.11M |
Analysis:
UniTokachieves approximately a 9.63x reduction in the total number oftrainable parameters(9.11M forUniTokvs. 87.78M for competitors).- This efficiency gain primarily comes from the use of a
shared autoencoderinUniTok(8.75M parameters) compared to thedomain-specific autoencodersused by competitors (87.45M accumulated across 10 domains). - The additional parameters introduced by
UniTok'scodebook(0.36M vs. 0.33M for competitors' accumulated codebooks) andTokenMoE router(0.01M) are negligible in comparison to the savings from theshared autoencoder. - This demonstrates that
UniTokoffers substantial advantages inscalabilityanddeployment efficiencyby eliminating the need fordomain-specific tokenization models.
6.1.3. Performance under Unified Training Setup
To further validate UniTok's efficiency and architectural advantages, competitors are also evaluated under a single unified training setup, where they are trained jointly across all ten datasets with a comparable number of trainable parameters to UniTok.
The following are the results from [Table 3] of the original paper:
| Beauty | Cellphones | Grocery | |||
| Method | R@10 N@10 | R@10 N@10 | R@10 N@10 | ||
| TIGER | 0.0499 0.0267 | 0.0661 0.0342 | 0.0576 0.0273 | ||
| LC-Rec | 0.0564 0.0302 | 0.0647 0.0337 | 0.0584 0.0287 | ||
| LETTER | 0.0528 0.0288 | 0.0678 0.0363 | 0.0618 0.0315 | ||
| UniTok | 0.0934 0.0478 | 0.1251 0.0647 | 0.1061 0.0533 | ||
| Gain | 65.60% 58.28% | 84.51% 78.23% | 71.68% 69.21% | ||
Analysis:
- When forced to train in a
unified setup(using a shared model across domains with a similar parameter budget asUniTok), the competing methods (TIGER,LC-Rec,LETTER) show a substantial performance degradation compared to theirsingle-domainresults (Table 1). This is because they struggle to distinguish and specialize for items from different domains when using anaïvelysharedtokenizationwithoutUniTok'sMoEandMI calibration. - In contrast,
UniTokmaintains consistently superior recommendation performance, achieving impressive improvements (e.g., 84.51% in Recall@10 on Cellphones) even with a similartrainable parameter budget. - This further highlights the strength of
UniTok'smodular TokenMoE architecture, which allowsdomain-specific expertsto learn semantics independently while operating within aunified token space, making it truly effective formulti-domain scenarios.
6.1.4. Can UniTok be Generalized to Unseen Domains? (Zero-shot Performance)
To assess UniTok's generalization ability, a zero-shot setting is used. UniTok is trained once on the ten source datasets and then directly tested on three unseen target domains (Clothing, Health, Sports) without any additional training or fine-tuning.
The following are the results from [Table 4] of the original paper:
| Clothing | Health | ||
| Method | R@10 N@10 | R@10 N@10 | Sports | R@10 N@10 |
| TIGER | 0.0501 0.0242 | 0.0677 0.0342 | 0.0469 0.0228 |
| LC-Rec | 0.0527 0.0266 | 0.0694 0.0358 | 0.0494 0.0246 |
| LETTER | 0.0515 0.0257 | 0.0717 0.0375 | 0.0510 0.0265 |
| UniTok | 0.0592 0.0288 | 0.0835 0.0442 | 0.0591 0.0298 |
| Gain | 12.33% 8.27% | 16.46% 17.87% | 15.88% 12.45% |
Analysis:
UniToksignificantly outperforms existingitem tokenization-based recommender systemson all threeunseen domains. The improvements inNDCG@10are notable, reaching up to 17.87% on Health.- This robust
zero-shot performanceis a critical finding. It demonstrates thatUniTokeffectively learns adiscrete token spacethat capturestransferable item semanticsacross diverse domains, without needing any additional training or fine-tuning for new domains. - In contrast, competitors would typically require retraining on each new dataset to achieve reasonable
tokenizationresults, highlightingUniTok's superiorgeneralizabilityand practical utility in dynamic environments.
6.2. Ablation Studies / Parameter Analysis
6.2.1. What Makes UniTok Effective? (Ablation Study)
An ablation study is conducted to assess the contribution of each core module within UniTok by progressively removing or modifying them.
-
UniTok-1: RemovesTokenMoEandMI calibration, using only a single set ofcodebooks(monolithic approach withoutMoE). -
UniTok-2: KeepsTokenMoE, but removes theshared expertandMI calibration. -
UniTok-3: Removes only theMI calibration(keepsTokenMoEwithshared expert). -
UniTok: The full model with all components.The following are the results from [Table 8] of the original paper:
Beauty Cellphones Grocery Instruments Yelp Method R@10 N@10 R@10 N@10 R@10 N@10 R@10 N@10 R@10 N@10 UniTok-1 0.0558 0.0304 0.0702 0.0371 0.0633 0.0342 0.0926 0.0742 0.0345 0.0177 UniTok-2 0.0896 0.0436 0.1194 0.0606 0.0989 0.0497 0.1273 0.0851 0.0624 0.0281 UniTok-3 0.0915 0.0457 0.1225 0.0622 0.1044 0.0515 0.1327 0.0868 0.0657 0.0303 UniTok 0.0934 0.0478 0.1251 0.0647 0.1061 0.0533 0.1361 0.0884 0.0684 0.0321
Analysis:
- Impact of
TokenMoE(UniTok-1 vs. UniTok-2): ComparingUniTok-1(noMoE, noMI) withUniTok-2(withTokenMoE, but noshared expertorMI) shows a significant performance jump across all datasets. For example,NDCG@10on Beauty improves from 0.0304 to 0.0436. This highlights the crucial role of theTokenMoEmodule in capturingdomain-specific semanticsand improving accuracy. - Impact of
Shared Expert(UniTok-2 vs. UniTok-3): While not directly separable in the provided table, the design ofUniTok-2(removingshared expert) implicitly supports the benefit of theshared expertin the fullUniTok(which includes it). The improvement fromUniTok-2toUniTok-3(0.0436 to 0.0457 for BeautyNDCG@10) suggests that incorporating theshared experthelps in transferring common knowledge, even if it's coupled withMI calibrationinUniTok-3's comparison toUniTok-2. - Impact of
MI Calibration(UniTok-3 vs. UniTok): ComparingUniTok-3(fullTokenMoEbut noMI calibration) with the fullUniTokmodel reveals thatMI calibrationprovides further performance enhancements (e.g.,NDCG@10on Beauty from 0.0457 to 0.0478). This confirms that ensuringsemantic balanceandconsistent informativenessacross domains is vital for optimal performance. - Overall, the
ablation studyconfirms that each component ofUniTok(TokenMoE,shared expert, andMI calibration) is essential for its high effectiveness, withTokenMoEproviding the largest boost andMI calibrationrefining the performance.
6.2.2. How Sensitive is UniTok to Key Parameters?
The paper analyzes the sensitivity of UniTok to key hyperparameters: number of quantization levels (), codebook size (), and the loss weights (\lambda_{RQ}and\lambda_{MI}) using NDCG@10 on Beauty, Cellphones, and Grocery datasets.
The following figures (Figures 4, 5, and 6 from the original paper) illustrate the sensitivity analysis on Beauty, Cellphones, and Grocery datasets, respectively.
该图像是图表,展示了对美妆领域的敏感性分析,包含四个子图(a)(b)(c)(d)。每个子图中y轴表示NDCG@10的值,x轴分别为不同的参数:L(图a),T(图b),(图c)和(图d)。结果显示在不同参数设置下的推荐性能变化,为理解UniTok框架提供了有效性验证。
该图像是图表,展示了对手机推荐系统的灵敏度分析。图中包含四个子图(a、b、c、d),分别评估不同参数对NDCG@10的影响,其中表示参数的变化,为参数,为,为。每个子图中,纵轴为NDCG@10值,横轴为对应参数的不同取值,呈现出这些参数变化对性能指标的影响,总体趋势显示了在特定参数值下性能的提升或下降。

Analysis of Parameter Sensitivity:
- Number of Quantization Levels (): (Figures 4a, 5a, 6a)
- Performance generally improves as increases from 2 to 4. This is because a longer sequence of
codesallows for a morefine-grainedandexpressive representationof item semantics. - However, increasing beyond 4 leads to performance degradation. This is attributed to
error accumulationin longerautoregressive sequencesduring recommendations, potentially making thetokenizationtoo complex or introducing noise.
- Performance generally improves as increases from 2 to 4. This is because a longer sequence of
- Codebook Size (): (Figures 4b, 5b, 6b)
- Performance tends to improve as (number of
code vectorspercodebook) increases, up to a certain point (e.g., 256 or 512). A largercodebookprovides more distinctcode vectors, enabling richer semantic representation. - However, too large a
codebook sizecan lead tooverfittingtospuriousor less meaningful patterns, causing performance to plateau or slightly degrade.
- Performance tends to improve as (number of
- RQ Loss Weight (): (Figures 4c, 5c, 6c)
- Performance peaks around across the datasets.
- Setting too low means insufficient emphasis on the
quantization process, leading to poortokenization. - Setting too high might over-regularize the
codebook-based identifiers, potentially underutilizing their flexibility or forcingencoder/routerto commit to suboptimalquantizations. This highlights the need to balance thereconstruction qualitywith thequantization fidelity.
- MI Calibration Loss Weight (): (Figures 4d, 5d, 6d)
- The highest
NDCG@10is achieved at . - A high may over-constrain the
latent space, forcing it to alignMIvalues at the expense ofdomain-specific informationorreconstruction quality. - A low weakens the influence of
mutual information preservation, limiting thesemantic alignmentoftoken representationsand thus hindering generalization. - This indicates that a properly tuned is crucial for balancing
semantic preservationwith overall model performance andgeneralization.
- The highest
6.2.3. Empirical Validation of Theoretical Claims
The paper provides empirical evidence to support its three theoretical theorems.
6.2.3.1. Token Space Entropy (Theorem 1)
The following are the results from [Table 9] of the original paper:
| Method | Token Space Entropy |
| Codebook-based methods | 9.63 |
| UniTok without the router | 9.63 |
| UniTok (full) | 10.42 |
Analysis:
Standard codebook-based methods(orUniTok without the router) exhibit atoken space entropyof 9.63 (base-2 logarithm, approximately distinct states).- The full
UniTokmodel, with itsTokenMoErouter, achieves a highertoken space entropyof 10.42. - This empirical result directly validates
Theorem 1, demonstrating that thejudicious incorporationofmultiple expertsand aroutersignificantly increases theoverall entropy, thereby expanding thecapacityandrichnessof thetoken space.
6.2.3.2. Quantization Error (Theorem 2)
The following figure (Figure 7 from the original paper) compares the residual quantization loss over training epochs between UniTok and LETTER.

Analysis:
Figure 7shows thatUniTokachieves a significantly lowerresidual quantization lossthanLETTERover training epochs. WhileLETTER's loss converges at a higher value,UniTok's loss steadily decreases to a much lower point.- This empirically supports
Theorem 2, which states thatUniTok'sexpected quantization erroris less than or equal tostandard codebook-based methods. TheTokenMoEframework's ability to leverageexpert specializationleads to more preciseitem tokenizationand lower information loss, even inheterogeneous environments.
6.2.3.3. MI Variance Analysis (Theorem 3)
The following figure (Figure 8 from the original paper) illustrates the relationship between MI variance and performance gap.

Analysis:
Figure 8plots the relationship between thevariance of MI estimates() across domains and the observedperformance variability().- The graph shows a positive correlation: as the
variance of MIincreases, theperformance variability(the maximum difference in loss between any two domains) also tends to increase. - This empirical trend provides strong evidence for
Theorem 3, validating theLipschitz continuity assumptionand demonstrating that reducing thevarianceofMIacross domains (asUniTok'sMI calibrationmechanism aims to do) leads to moreconsistent semantic representationsand, consequently, morestableandreliable downstream performanceinmulti-domain recommendation systems.
6.3. Complete Set of Experimental Results (Appendix D.5)
The appendix provides the complete results for both Recall@10 and NDCG@10 across all ten source datasets.
The following are the results from [Table 7] of the original paper:
| Beauty | Cellphones | Grocery | Insturments | Office Products | ||||||
| Method | R@10 | N@10 | R@10 | N@10 | R@10 | N@10 | R@10 | N@10 | R@10 | N@10 |
| MF | 0.0614 | 0.0369 | 0.0604 | 0.0267 | 0.0418 | 0.0216 | 0.0930 | 0.0710 | 0.0569 | 0.0255 |
| LightGCN | 0.0639 | 0.0285 | 0.0668 | 0.0456 | 0.0697 | 0.0357 | 0.1008 | 0.0781 | 0.0587 | 0.0301 |
| SASRec | 0.0646 | 0.0314 | 0.0651 | 0.0446 | 0.0701 | 0.0376 | 0.0905 | 0.0609 | 0.0574 | 0.0285 |
| Bert4Rec | 0.0372 | 0.0194 | 0.0507 | 0.0268 | 0.0448 | 0.0237 | 0.0791 | 0.0573 | 0.0563 | 0.0274 |
| P5-TID | 0.0532 | 0.0255 | 0.0648 | 0.0357 | 0.0617 | 0.0316 | 0.0928 | 0.0721 | 0.0557 | 0.0239 |
| P5-SemID | 0.0584 | 0.0304 | 0.0737 | 0.0406 | 0.0641 | 0.0351 | 0.0964 | 0.0730 | 0.0592 | 0.0283 |
| TIGER | 0.0624 | 0.0324 | 0.0838 | 0.0446 | 0.0706 | 0.0375 | 0.1047 | 0.0788 | 0.0594 | 0.0295 |
| LC-Rec | 0.0684 | 0.0381 | 0.0859 | 0.0458 | 0.0722 | 0.0369 | 0.1066 | 0.0802 | 0.0637 | 0.0311 |
| LETTER | 0.0672 | 0.0364 | 0.0876 | 0.0473 | 0.0731 | 0.0392 | 0.1122 | 0.0831 | 0.0649 | 0.0326 |
| UniTok | 0.0934 | 0.0478 | 0.1251 | 0.0647 | 0.1061 | 0.0533 | 0.1361 | 0.0884 | 0.0897 | 0.0432 |
| Improve | 36.55% | 25.46% | 42.81% | 36.78% | 45.14% | 35.97% | 21.30% | 6.38% | 38.21% | 32.52% |
| Pet Supplies | Tools | Toys | Games | Yelp | ||||||
| Method | R@10 | N@10 | R@10 | N@10 | R@10 | N@10 | R@10 | N@10 | R@10 | N@10 |
| MF | 0.0503 | 0.0268 | 0.0356 | 0.0169 | 0.0405 | 0.0192 | 0.0359 | 0.0366 | 0.0304 | 0.0144 |
| LightGCN | 0.0529 | 0.0289 | 0.0482 | 0.0257 | 0.0495 | 0.0287 | 0.0407 | 0.0417 | 0.0368 | 0.0195 |
| SASRec | 0.0538 | 0.0301 | 0.0475 | 0.0234 | 0.0473 | 0.0239 | 0.0401 | 0.0412 | 0.0354 | 0.0183 |
| Bert4Rec | 0.0313 | 0.0161 | 0.0184 | 0.0092 | 0.0333 | 0.0177 | 0.0363 | 0.0379 | 0.0272 | 0.0131 |
| P5-TID | 0.0485 | 0.0243 | 0.0377 | 0.0198 | 0.0419 | 0.0202 | 0.0372 | 0.0388 | 0.0316 | 0.0154 |
| P5-SemID | 0.0569 | 0.0282 | 0.0405 | 0.0237 | 0.0445 | 0.0231 | 0.0398 | 0.0432 | 0.0324 | 0.0188 |
| TIGER | 0.0546 | 0.0279 | 0.0507 | 0.0284 | 0.0486 | 0.0268 | 0.0438 | 0.0427 | 0.0394 | 0.0208 |
| LC-Rec | 0.0648 | 0.0335 | 0.0561 | 0.0307 | 0.0538 | 0.0279 | 0.0442 | 0.0451 | 0.0418 | 0.0215 |
| LETTER | 0.0596 | 0.0307 | 0.0556 | 0.0298 | 0.0546 | 0.0291 | 0.0559 | 0.0469 | 0.0426 | 0.0231 |
| UniTok | 0.0955 | 0.0496 | 0.0852 | 0.0439 | 0.0902 | 0.0442 | 0.0565 | 0.0476 | 0.0684 | 0.0321 |
| Improve | 47.38% | 48.06% | 51.87% | 42.99% | 65.20% | 51.89% | 1.07% | 1.49% | 60.56% | 38.96% |
(Note: Improvement values indicate the gain of UniTok over the best competitor for each metric and dataset. UniTok consistently outperforms competitors across both Recall@10 and NDCG@10, with statistically significant improvements (p < 0.05).)
Analysis:
These tables reinforce the findings from the main manuscript, showing UniTok's consistent superiority across both Recall@10 and NDCG@10 metrics on all ten datasets. The percentage gains are often substantial, confirming the effectiveness of UniTok's unified item tokenization in multi-domain scenarios. The statistically significant improvements () further validate the robustness of these results. Both traditional CF methods and existing LLM-based tokenization methods are clearly outmatched by UniTok, particularly in its ability to manage heterogeneous item domains with a single, efficient model.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully addresses a fundamental challenge in multi-domain LLM-based recommendation: the inefficiency and semantic inconsistency of existing item tokenization methods. The proposed UniTok framework offers a novel and effective solution. It integrates a customized Mixture-of-Experts (TokenMoE) architecture with codebooks to generate semantically meaningful tokens across diverse domains. By employing a shared encoder and routing items to both domain-specific and a shared expert, UniTok adeptly captures both unique and common semantic patterns. Furthermore, its mutual information (MI) calibration mechanism ensures semantic balance and consistent informativeness across domains.
Empirical evaluations on a wide range of real-world datasets demonstrate UniTok's significant advantages:
- Effectiveness: Achieving up to 51.89% improvements in
NDCG@10over strong benchmarks. - Efficiency: Reducing
trainable parametersby 9.63x compared tosingle-domainspecializedtokenizers. - Generalizability: Showing robust
zero-shot performanceon unseen domains without requiring any retraining. These findings are further supported by theoretical analyses, which validate thatUniTokinduces a highertoken space entropy, achieves lowerquantization error, and ensuressemantic consistencyby reducingMI variance.
7.2. Limitations & Future Work
The paper primarily focuses on the item tokenization aspect for LLM-based recommendation. While it demonstrates UniTok's effectiveness in generating item tokens, it does not delve deeply into the integration of collaborative signals (user-item interactions) or the specific LLM architecture used for downstream recommendation. The evaluation is on the quality of tokens for improving recommendation performance, rather than the LLM itself.
The authors propose future work that includes extending UniTok into a general-purpose tokenization interface for foundation models in recommendation. This suggests a vision where UniTok could become a standardized preprocessing layer, simplifying the adoption of powerful foundation models for diverse recommendation tasks.
7.3. Personal Insights & Critique
This paper presents an elegant and practical solution to a critical problem in LLM-based recommendation. The unified tokenization concept is highly valuable for real-world applications where recommender systems often operate across many domains.
- Innovation of
TokenMoE: TheTokenMoEarchitecture is a clever adaptation ofMoEprinciples, specifically tailored formulti-domain tokenization. The inclusion of bothdomain-specific expertsand ashared expertprovides an intuitive way to manage the trade-off between specialization and generalization. Theinductive biasfrom initializing experts with mean features is a neat trick to guide specialization without explicit supervision. - Power of
MI Calibration: TheMI calibration mechanismis a sophisticated and crucial component.Semantic imbalanceis a subtle but significant issue inmulti-domain learning, and explicitly optimizing forMI varianceprovides a robust way to ensure that all domains are adequately represented, preventingperformance disparities. The theoretical backing for this, especiallyTheorem 3, adds strong credibility to its design. - Efficiency and Scalability: The demonstrated 9.63x parameter reduction is a massive practical benefit, making
UniToksignificantly more deployable and sustainable thanper-domain tokenizers. This efficiency, coupled with strongzero-shot generalization, positionsUniTokas a potential industry standard. - Potential Areas for Further Exploration:
-
Dynamic Expert Routing: While the
Top-Nrouting is effective, exploring more dynamic or adaptive routing mechanisms for (e.g., varying based on domain complexity or item characteristics) could offer further refinements. -
Computational Cost of HSIC: While effective,
HSICcomputation can be demanding for very large datasets due to kernel matrix operations. Investigating more scalable or approximateMI estimationtechniques in the context ofUniTokcould be beneficial. -
Integration with LLMs: The paper positions
UniTokas atokenizerforLLMs. A deeper dive into how differentLLMarchitectures (e.g.,encoder-decoder,decoder-only) specifically leverageUniTok's tokens, and whether certaintokenization strategieswithinUniTok(e.g., specificcodebook sizesorlevels) interact differently with variousLLMtypes, would be interesting. -
Cold-Start Scenarios for New Domains: While
zero-shotperformance is strong, more investigation into howUniTokhandles domains with extremely sparse metadata or very novel item types could yield further insights.Overall,
UniTokrepresents a significant step forward inmulti-domain recommendation, offering a theoretically sound and empirically robust solution that aligns well with the growing trend offoundation modelsandunified AI systems.
-
Similar papers
Recommended via semantic vector search.