Tokenize Once, Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM-based Recommendation

Won-Yong Shin

Paper status: completed

Tokenize Once, Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM-based Recommendation

Published:11/17/2025

Multidomain LLM-based Recommendation Systems (1)Unified Item Tokenization Framework (1)Mixture-of-Experts Architecture (1)Mutual Information Calibration Mechanism (1)Domain-Specific Knowledge Capture (1)

Original Link PDF

Price: 0.10

8 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces `UniTok`, a unified item tokenization framework that addresses the limitations of separate models for different item domains in LLM-based recommendation systems. It employs a mixture-of-experts architecture and a mutual information calibration mechanism to pr

Abstract

Large language model (LLM)-based recommender systems have achieved high-quality performance by bridging the discrepancy between the item space and the language space through item tokenization. However, existing item tokenization methods typically require training separate models for each item domain, limiting generalization. Moreover, the diverse distributions and semantics across item domains make it difficult to construct a unified tokenization that preserves domain-specific information. To address these challenges, we propose UniTok, a Unified item Tokenization framework that integrates our own mixture-of-experts (MoE) architecture with a series of codebooks to convert items into discrete tokens, enabling scalable tokenization while preserving semantic information across multiple item domains. Specifically, items from different domains are first projected into a unified latent space through a shared encoder. They are then routed to domain-specific experts to capture the unique semantics, while a shared expert, which is always active, encodes common knowledge transferable across domains. Additionally, to mitigate semantic imbalance across domains, we present a mutual information calibration mechanism, which guides the model towards retaining similar levels of semantic information for each domain. Comprehensive experiments on wide-ranging real-world datasets demonstrate that the proposed UniTok framework is (a) highly effective: achieving up to 51.89% improvements over strong benchmarks, (b) theoretically sound: showing the analytical validity of our architectural design and optimization; and (c) highly generalizable: demonstrating robust performance across diverse domains without requiring per-domain retraining, a capability not supported by existing baselines.

In-depth Reading

English Analysis~32 min read · 43,334 chars

1. Bibliographic Information

1.1. Title

Tokenize Once, Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM-based Recommendation

1.2. Authors

The authors are Yu Hou and Won-Yong Shin, both affiliated with Yonsei University.

1.3. Journal/Conference

The paper is published at (UTC): 2025-11-17T03:18:04.000Z, implying it is intended for a conference or journal in late 2025. The specific venue is not mentioned in the provided text but given its nature, it would typically be a top-tier conference in AI, ML, or Recommender Systems (e.g., NeurIPS, ICML, KDD, SIGIR).

1.4. Publication Year

2025

1.5. Abstract

This paper addresses the limitations of existing item tokenization methods for Large Language Model (LLM)-based recommender systems, which typically require separate models for each item domain and struggle with diverse domain semantics. The authors propose UniTok, a Unified item Tokenization framework. UniTok integrates a novel mixture-of-experts (MoE) architecture with codebooks to convert items into discrete tokens. It uses a shared encoder for a unified latent space, domain-specific experts for unique semantics, and a shared expert for common knowledge. To ensure semantic balance, UniTok includes a mutual information (MI) calibration mechanism. Experiments on diverse real-world datasets show UniTok is highly effective (up to 51.89% improvement over benchmarks), theoretically sound (analytical validity of design), and highly generalizable (robust performance across domains without retraining).

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2511.12922v1
PDF Link: https://arxiv.org/pdf/2511.12922v1.pdf The paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inefficiency and lack of generalization in item tokenization for Large Language Model (LLM)-based recommender systems, particularly in multi-domain settings.

This problem is important because LLMs are becoming a promising paradigm for generative recommendation, leveraging their strong generalization and language understanding capabilities. However, to effectively use LLMs, items must be converted into discrete tokens (item tokenization) to bridge the gap between the item space and the language space. Existing item tokenization methods are predominantly designed for single-domain scenarios. This leads to several critical issues in real-world multi-domain applications:

Training Overhead (C1): Training separate tokenizers for each new item domain is computationally expensive, resource-intensive, and inefficient. As the number of domains grows, this siloed approach becomes unsustainable.
Semantic Alignment (C2): Item domains often have vastly different data distributions and semantics. A naïvely shared token space across domains can lead to semantic mixing and biased token assignments, where the tokenizer fails to capture the rich, unique semantics of each domain or struggles to maintain consistency in semantic informativeness across them.

The paper's entry point or innovative idea is to design a unified item tokenization framework that can generalize across multiple domains with minimal computational overhead, addressing both the training inefficiency and semantic alignment challenges.

2.2. Main Contributions / Findings

The paper makes several significant contributions to address the challenges of multi-domain item tokenization for LLM-based recommendation:

New Methodology (UniTok Framework):
- Proposes UniTok, a unified item tokenization framework that integrates a customized Mixture-of-Experts (MoE) architecture, dubbed TokenMoE, with codebooks.
- TokenMoE is designed with domain-specific experts to capture unique semantics of each domain and a shared expert to encode common knowledge, enabling effective domain-aware tokenization and knowledge transfer.
- Introduces a mutual information (MI) calibration mechanism to ensure semantic balance across diverse domains by guiding the model to retain similar levels of semantic information for each domain.
Extensive Evaluations and Superior Performance:
- Demonstrates UniTok's superiority in multi-domain scenarios, achieving substantial improvements of up to 51.89% in NDCG@10 over strong baselines.
- Shows UniTok's efficiency, with a 9.63x reduction in model size (trainable parameters) compared to traditional codebook-based methods that require per-domain training.
- Proves UniTok's strong generalization capability, exhibiting robust performance in zero-shot settings (unseen domains) without additional retraining, a capability lacking in existing baselines.
Theoretical Justifications:
- Provides theoretical proofs that UniTok induces a higher entropy token space (Theorem 1), indicating greater capacity and richness of tokens.
- Demonstrates that UniTok achieves a lower quantization error (Theorem 2), meaning more precise and accurate item tokenization.
- Shows that UniTok ensures semantic consistency across domains by reducing MI variance (Theorem 3), leading to more stable and balanced recommendation performance.
  
  These findings collectively solve the training overhead and semantic inconsistency problems, making LLM-based recommendation more scalable and effective in real-world multi-domain environments.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following core concepts:

Large Language Models (LLMs)

Large Language Models (LLMs) are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. In the context of recommender systems, LLMs can leverage their strong generalization abilities, natural language understanding, and world knowledge to generate recommendations. They can process items as if they were part of natural language sequences, making generative recommendation possible. This paradigm shifts from traditional recommendation methods that often rely on user-item interaction matrices or collaborative filtering.

Item Tokenization

Item tokenization is the process of converting items (e.g., products, movies, articles) into discrete, symbolic representations called tokens. Just as words in natural language are tokenized before being fed into an LLM, items in a recommender system need to be tokenized so that LLMs can understand and process them. This process bridges the gap between the item space (where items exist with their rich features and metadata) and the language space (where LLMs operate using discrete tokens). Tokens can be ID-based (simple numerical IDs), textual descriptors (using item metadata directly), or codebook-based (as explored in this paper).

Mixture-of-Experts (MoE)

A Mixture-of-Experts (MoE) architecture is a type of neural network design that enhances model capacity and efficiency by conditionally activating different sub-networks (called experts) for different inputs. Instead of processing every input with the entire model, an MoE uses a gating mechanism or router to select a subset of experts that are most relevant to a given input. The outputs of these selected experts are then combined, often weighted by the router's scores.

Purpose: MoE models can scale to a very large number of parameters (increasing model capacity) while keeping the computational cost for any single input relatively low (improving computational efficiency), as only a fraction of the parameters are activated.
Application in Multi-Domain: MoE is particularly well-suited for multi-domain learning because its gating mechanism allows adaptive expert selection for each domain. This helps the model specialize in different data distributions without interference, facilitating generalization across diverse data sources. Formally, given an input $\mathbf{x}$ , the output of an MoE model is typically represented as: $\mathrm{MoE}(\mathbf{x}) = \sum_{k=1}^K G_k(\mathbf{x}) E_k(\mathbf{x})$ where $K$ is the total number of experts, $E_k(\mathbf{x})$ is the output of the $k$ -th expert, and $G_k(\mathbf{x})$ is a softmax-based router function that assigns a probability (or weight) to the $k$ -th expert for the given input.

Codebook-based Identifiers (Residual Quantization - RQ)

Codebook-based identifiers are a method of item tokenization that converts continuous item embeddings into discrete token sequences using a technique called Residual Quantization (RQ).

Codebook: A codebook is a collection of discrete code vectors (or codewords).
Residual Quantization (RQ): This process encodes an input vector (e.g., an item embedding) into a sequence of codes by iteratively finding the best matching code vector from a codebook and then quantizing the residual (the difference between the input and the selected code) in subsequent stages.
1. Start with an initial residual $\mathbf{r}^{(0)} = \mathbf{x}$ (the input vector).
2. At each level $\ell$ (from 1 to $L$ total levels), find the code vector $\mathbf{c}_\ell$ from the codebook $C_\ell$ that is closest to the current residual $\mathbf{r}^{(\ell-1)}$ .
3. Update the residual for the next level: $\mathbf{r}^{(\ell)} = \mathbf{r}^{(\ell-1)} - \mathbf{c}_\ell$ .
4. The final approximation of the original input is the sum of all selected code vectors: $\hat{\mathbf{x}} = \sum_{\ell=1}^L \mathbf{c}_\ell$ . Each selected code vector $\mathbf{c}_\ell$ corresponds to a discrete index $z_\ell \in \{1, \dots, T\}$ (where $T$ is the number of code vectors in each codebook). The sequence $(z_1, \dots, z_L)$ forms the token sequence representing the item. This method provides compact and semantically meaningful tokens.

Mutual Information (MI) and Hilbert-Schmidt Independence Criterion (HSIC)

Mutual Information (MI) is a concept from information theory that measures the statistical dependence between two random variables. A high MI value indicates a strong relationship, meaning knowing one variable provides a lot of information about the other. Conversely, a low MI suggests that the variables are nearly independent.

Purpose in this paper: In UniTok, MI is used to quantify how much semantic information from the original input item embeddings is preserved in their learned latent embeddings for each domain.
Hilbert-Schmidt Independence Criterion (HSIC): Calculating MI directly can be challenging. HSIC is a non-parametric measure of dependence between random variables that serves as an empirical estimate or proxy for MI. It measures the cross-covariance between two Reproducing Kernel Hilbert Spaces (RKHSs). A higher HSIC value implies stronger statistical dependence.
- It operates on kernel matrices (e.g., Gaussian kernel matrices) computed over the input data, which implicitly map data into high-dimensional RKHSs where dependencies might be more easily detected.
- The centering matrix $\mathbf{H}$ (where $\mathbf{H} = \mathbf{I} - \frac{1}{|\mathcal{T}_k|}\mathbf{1}\mathbf{1}^\top$ ) is used to ensure that the embeddings in RKHS have a zero mean, which is standard practice in kernel methods.

3.2. Previous Works

The paper frames UniTok in contrast to existing item tokenization methods and traditional collaborative filtering (CF) approaches for recommendation.

Traditional Collaborative Filtering (CF) Methods: These methods typically rely on user-item interaction data to find patterns and make recommendations.
- MF (Matrix Factorization): A classic CF method that decomposes the user-item interaction matrix into lower-dimensional user and item latent factor matrices.
- LightGCN: A Graph Convolutional Network (GCN)-based method that simplifies the GCN architecture for recommendation by keeping only the neighborhood aggregation component and linear transformations.
- SASRec (Self-Attentive Sequential Recommendation): A sequential recommender that uses a self-attention mechanism (similar to Transformers) to capture short-term and long-term dependencies in user interaction sequences.
- Bert4Rec: Another sequential recommender that adapts the BERT (Bidirectional Encoder Representations from Transformers) architecture to model user behavior sequences bidirectionally, predicting masked items.
LLM-based Generative Recommendation (with Item Tokenization): These methods bridge item space and language space using LLMs.
- P5-TID (P5 - Text-based Item Description): A variant of P5 (a parameter-efficient personalized prompt & predict paradigm) that uses textual item descriptions for tokenization for LLM-based generative recommendation. It leverages LLMs for direct sequence generation of item IDs based on textual information.
- P5-SemID (P5 - Semantic Item ID): Another P5 variant that uses semantic item IDs for LLM-based generative recommendation. It focuses on representing items with semantic codewords for next-item prediction.
- TIGER (Token-based Item Generator for Enhanced Recommendation): A codebook-based item tokenization method that converts item metadata into hierarchical code sequences. It's a foundational work in using discrete codebooks for LLM-based recommendation.
- LC-Rec (Learnable Code-based Recommendation): Learns code indices for items via vector quantization and fine-tunes LLMs through alignment tasks to generate items directly.
- LETTER (Learnable Item Tokenization for Generative Recommendation): This method learns a tokenizer for LLM-based generative recommendation, aiming to combine collaborative signals and mitigate code assignment bias. It also uses codebook-based identifiers.

3.3. Technological Evolution

The evolution of recommendation systems has moved from traditional methods focusing on explicit user-item interactions (e.g., Matrix Factorization, early CF) to more complex models capable of capturing sequential patterns (SASRec, Bert4Rec) and leveraging graph structures (LightGCN). More recently, the advent of Large Language Models has opened up the generative recommendation paradigm, where LLMs can directly generate recommendations, often by processing items as tokens within a language context.

Initially, item tokenization methods for LLM-based systems focused on single-domain scenarios (e.g., TIGER, LC-Rec, LETTER). These approaches, while effective within their specific domains, suffered from scalability issues when applied to multi-domain settings, requiring separate tokenizers and leading to redundant training and parameter inefficiency.

The field of machine learning more broadly has seen a strong trend towards unified models for multi-domain learning in areas like natural language processing and computer vision. This paper's work on UniTok fits within this broader trend, aiming to bring unified multi-domain learning principles to item tokenization for LLM-based recommendation. It represents a significant step towards creating general-purpose tokenization interfaces for foundation models in recommendation.

3.4. Differentiation Analysis

Compared to the main methods in related work, UniTok introduces several core innovations:

Unified Multi-domain Approach vs. Single-domain Specialization: The primary differentiation is UniTok's ability to tokenize items across multiple domains using a single unified model, whereas most existing item tokenization methods (e.g., TIGER, LC-Rec, LETTER) are designed for single-domain settings and require separate training for each domain. This directly addresses the training overhead (C1) challenge.
Novel TokenMoE Architecture: UniTok proposes a specialized Mixture-of-Experts (MoE) architecture (TokenMoE) integrated directly into the tokenization module. Unlike MoEs used in Transformer layers, TokenMoE explicitly combines domain-specific experts (for unique semantics) and a shared expert (for common knowledge). This allows the model to retain domain specialization without sacrificing global knowledge sharing, which is crucial for handling diverse semantics (partially addressing C2).
Mutual Information (MI) Calibration: UniTok introduces a unique MI calibration mechanism ( $\mathcal{L}_{\mathrm{MI}}$ ) to explicitly ensure semantic balance across domains. This mechanism minimizes the variance of MI between input and latent embeddings, mitigating semantic imbalance and ensuring consistent informativeness for each domain. This directly addresses the semantic alignment (C2) challenge, which is largely overlooked by prior single-domain studies.
Efficiency and Generalizability: By learning a unified tokenization model, UniTok achieves significant efficiency gains (9.63x reduction in trainable parameters) and demonstrates strong zero-shot generalization to unseen domains, capabilities not supported by existing baselines. This is a direct consequence of its architectural innovations and MI calibration.

In essence, while previous codebook-based tokenization methods focus on effective item representation within a single domain, UniTok innovates by extending this concept to a multi-domain setting through an intelligent MoE design and semantic balancing mechanism, leading to a more scalable, efficient, and robust solution.

4. Methodology

4.1. Principles

The core idea behind UniTok is to enable unified item tokenization across multiple, diverse domains for LLM-based recommendation. It operates on two main principles:

Disentangling Domain-Specific and Shared Knowledge: Different item domains exhibit distinct data distributions and semantics. To handle this with a single model, UniTok needs to internally separate domain-specific learning (to capture unique patterns) from shared representations (to leverage common knowledge and improve efficiency). This is achieved through a novel Mixture-of-Experts (MoE) architecture, TokenMoE, where specialized experts handle domain-specific nuances while a shared expert captures cross-domain commonalities.
Ensuring Semantic Balance and Informativeness: When tokenizing items from diverse domains within a unified framework, it's critical to ensure that the learned latent embeddings retain sufficient and consistent semantic information for each domain. Naïvely sharing a token space can lead to semantic mixing or imbalance, where some domains might lose crucial details. UniTok addresses this by introducing a mutual information (MI) calibration mechanism that guides the model to preserve similar levels of informativeness across all domains, promoting stable and balanced performance.

By combining these principles, UniTok aims to create a scalable and generalizable item tokenizer that is efficient in training and deployment while maintaining high semantic fidelity across heterogeneous item domains.

4.2. Core Methodology In-depth (Layer by Layer)

The UniTok framework consists of four key components: a shared autoencoder, TokenMoE, codebook-based identifiers, and an MI calibration mechanism.

4.2.1. Task Formulation: Multi-domain Item Tokenization

The paper considers a multi-domain setting with a mixture of $K$ distinct item domains, denoted as $\mathcal{D} = \{D_1, \ldots, D_K\}$ . Each domain $D_k$ has an item set $\mathcal{T}_k$ with associated textual metadata (e.g., titles, categories, features). The objective is to learn a mapping function $\mathcal{F} : \mathbb{R}^d \to \mathcal{C}$ that projects each continuous semantic embedding $\mathbf{x}_i^k$ of an item from domain $D_k$ into a discrete codeword $\mathbf{c}_i^k \in \mathcal{C}$ . Here, $\mathcal{C}$ represents a shared space of discrete item tokens across all domains. This tokenization should operate independently of user data, enabling scalable and general-purpose recommendations for LLM-based systems.

4.2.2. Shared Autoencoder

UniTok begins by projecting item embeddings from various domains into a unified latent space using a shared autoencoder. This step ensures that items from different domains initially share a common representation format, capturing structural patterns before domain-specific processing.

Components:

Encoder ( $f_\theta$ ): Maps an input item's semantic embedding to a latent embedding.
Decoder ( $g_\phi$ ): Reconstructs the original semantic embedding from the processed latent embedding.

Process: For each input item $\mathbf{x}_i^k \in \mathbf{X}^k$ from domain $\mathcal{D}_k$ :

The encoder $f_\theta$ produces a latent embedding: $\mathbf{z}_i^k = f_\theta(\mathbf{x}_i^k)$ where $\mathbf{x}_i^k$ is the semantic embedding of the $i$ -th item in domain $k$ , and $\mathbf{z}_i^k$ is its corresponding latent embedding.
This latent embedding $\mathbf{z}_i^k$ then passes through the TokenMoE module to produce a quantized representation $\hat{\mathbf{z}}_i^k = \mathrm{TokenMoE}(\mathbf{z}_i^k)$ .
The decoder $g_\phi$ reconstructs the input item's semantic embedding from $\hat{\mathbf{z}}_i^k$ : $\hat{\mathbf{x}}_i^k = g_\phi(\hat{\mathbf{z}}_i^k)$ where $\hat{\mathbf{x}}_i^k$ is the reconstructed semantic embedding.

Optimization: The shared autoencoder is optimized using a reconstruction loss, which aims to minimize the difference between the original and reconstructed semantic embeddings: $\mathcal{L}_{\mathrm{Rec}} = \sum_{k=1}^K \sum_{\mathbf{x}_i^k \in \mathbf{X}^k} \left\| \mathbf{x}_i^k - \hat{\mathbf{x}}_i^k \right\|^2$ where $\lVert \cdot \rVert$ denotes the $L_2$ norm. This loss ensures that the latent space is informative and retains core semantics for subsequent tokenization.

4.2.3. TokenMoE (Mixture-of-Experts for Tokenization)

TokenMoE is a specialized Mixture-of-Experts architecture within UniTok that routes items to both domain-specific experts and a shared expert. This design allows for domain-aware tokenization by capturing both unique domain patterns and transferable common knowledge.

Components:

Router Function ( $G(\cdot)$ ): Takes an item's latent embedding and produces a softmax distribution over $K$ domain-specific experts.
Domain-Specific Experts ( $E_k(\cdot)$ ): $K$ individual expert modules, each designed to specialize in patterns unique to a particular domain.
Shared Expert ( $E_{\mathrm{share}}(\cdot)$ ): An additional expert module that captures common knowledge transferable across all domains, and is always active.

Process:

After the shared encoder produces an item's latent embedding $\mathbf{z}_i^k$ , it is fed into the router function $G(\cdot)$ . The router computes router logits $s_i \in \mathbb{R}^K$ using a learnable linear transformation $h(\cdot)$ : $s_i = h(\mathbf{z}_i^k)$ .
The softmax distribution $G_k$ for each domain-specific expert $k$ is calculated as: $G_k = \frac{\exp(s_i^{(k)})}{\sum_{j=1}^K \exp(s_i^{(j)})}$ where $s_i^{(k)}$ is the logit corresponding to the $k$ -th domain-specific expert.
The item is then routed to the top- N $domain-specific experts` based on the highest$ G_k$ values, while simultaneously being deterministically assigned to the shared expert.
The final processed latent embedding $\hat{\mathbf{z}}_i^k$ (which is then fed to the decoder) is a weighted combination of the outputs of the selected domain-specific experts and the shared expert: $\begin{cases} \hat{\mathbf{z}}_i^k = \mathrm{TokenMoE}(\mathbf{z}_i^k) = \displaystyle \sum_{k=1}^K G_k E_k(\mathbf{z}_i^k) + E_{\mathrm{share}}(\mathbf{z}_i^k), \\ G_k = \begin{cases} G_k, & \mathrm{if} \ k \in \mathrm{Top}_N(G(\mathbf{z}_i^k)) \\ 0, & \mathrm{otherwise} \end{cases} \end{cases}$ where $E_k(\cdot)$ denotes the $k$ -th expert module, $E_{\mathrm{share}}(\cdot)$ is the shared expert module, and $\mathrm{Top}_N(G(\mathbf{z}_i^k))$ is the set of indices for the top- $N$ experts selected by the router.

To encourage expert specialization, each expert is initialized with the mean feature of a specific domain, providing an inductive bias for domain-aware tokenization.

4.2.4. Codebook-based Identifiers

Within each expert (both domain-specific and shared), Residual Quantization (RQ) is employed to discretize the item latent embeddings into compact token sequences.

Components:

Codebooks ( $\{C_1, C_2, \dots, C_L\}$ ): A series of $L$ codebooks, each $C_\ell$ containing $T$ code vectors $\{\mathbf{c}_t\}_{t=1}^T$ .

Process:

For a latent embedding $\mathbf{z}_i^k \in \mathbb{R}^d$ (output of the shared encoder, before TokenMoE routing), RQ approximates it hierarchically.
At each level $\ell$ $ℓ$ (from 1 to $L$ $L$ ):
- The residual $\mathbf{r}^{(\ell-1)}$ from the previous step is used. Initially, $\mathbf{r}^{(0)} = \mathbf{z}_i^k$ .
- The nearest code vector $\mathbf{c}_\ell$ from codebook $C_\ell$ to the current residual is selected: $\mathbf{c}_\ell = \arg\min_{\mathbf{c} \in C_\ell} \left\| \mathbf{r}^{(\ell-1)} - \mathbf{c} \right\|^2$ .
- The residual is updated for the next level: $\mathbf{r}^{(\ell)} = \mathbf{r}^{(\ell-1)} - \mathbf{c}_\ell$ .
The sum of all selected code vectors reconstructs the original latent embedding: $E_k(\mathbf{z}_i^k) \approx \sum_{\ell=1}^L \mathbf{c}_\ell, \quad \mathrm{where} \ \mathbf{c}_\ell \in C_\ell$ Here, $E_k(\mathbf{z}_i^k)$ refers to the output of the $k$ -th expert module, which effectively performs this quantization.
Each selected code vector $\mathbf{c}_\ell$ corresponds to a discrete index $z_\ell \in \{1, \dots, T\}$ . The final discrete codeword for an item is a combination of these indices and the chosen expert IDs: $\mathbf{z}_i^k \mapsto \mathbf{c}_i^k = (z_1, \ldots, z_L, e_1, \ldots, e_N)$ where $e_n \in \{1, \ldots, K\}$ indicates the expert ID of the $n$ -th top- N$ expertchosen by therouter`.

Optimization: The quantization process is trained using the RQ loss: $\mathcal{L}_{\mathrm{RQ}} := \sum_{\ell=1}^L \left\| \mathbf{sg}[\mathbf{r}^{(\ell)}] - \mathbf{c}_\ell \right\|^2 + \alpha \left\| \mathbf{r}^{(\ell)} - \mathbf{sg}[\mathbf{c}_\ell] \right\|^2$ where $\mathbf{r}^{(\ell)}$ is the residual vector at level $\ell$ , $\mathbf{c}_\ell$ is the selected code vector from $C_\ell$ , $\mathrm{sg}[\cdot]$ is the stop-gradient operator, and $\alpha$ is a balancing hyperparameter.

The first term encourages the code vector $\mathbf{c}_\ell$ to match the residual $\mathbf{r}^{(\ell)}$ (training the codebooks).
The second term forces the encoder and router to produce latent embeddings that are close to the selected quantized code vector (making the encoder and router commit to the quantization).

4.2.5. MI Calibration

To address semantic imbalance and ensure consistent informativeness across diverse domains, UniTok introduces a mutual information (MI) calibration mechanism.

Mechanism:

The Hilbert-Schmidt Independence Criterion (HSIC) is used as a proxy for MI. A higher HSIC value indicates stronger dependence and thus more preserved information.
For each domain $k$ $k$ , $\widehat{\mathrm{HSIC}}(\mathbf{X}^k, \mathbf{Z}^k)$ $HSIC (X^{k}, Z^{k})$ measures the dependence between the input semantic embeddings $\mathbf{X}^k = \{\mathbf{x}_1^k, \ldots, \mathbf{x}_{|\mathcal{T}_k|}^k\}$ $X^{k} = {x_{1}^{k}, \dots, x_{∣ T_{k} ∣}^{k}}$ and their latent embeddings $\mathbf{Z}^k = \{\mathbf{z}_1^k, \ldots, \mathbf{z}_{|\mathcal{T}_k|}^k\}$ $Z^{k} = {z_{1}^{k}, \dots, z_{∣ T_{k} ∣}^{k}}$ in a Reproducing Kernel Hilbert Space (RKHS). $\widehat{\mathrm{HSIC}}(\mathbf{X}^k, \mathbf{Z}^k) = \frac{1}{(|\mathcal{T}_k| - 1)^2} \mathrm{Tr}(\mathbf{UHVH})$ where:
- $\mathbf{U}, \mathbf{V} \in \mathbb{R}^{|\mathcal{T}_k| \times |\mathcal{T}_k|}$ are Gaussian kernel matrices computed over $\mathbf{X}^k$ and $\mathbf{Z}^k$ , respectively. Gaussian kernel matrices measure similarity between data points, implicitly mapping them into a high-dimensional feature space (RKHS).
- $\mathbf{H} = \mathbf{I} - \frac{1}{|\mathcal{T}_k|}\mathbf{1}\mathbf{1}^\top$ is the centering matrix, which ensures zero-mean embeddings in RKHS. $\mathbf{I}$ is the identity matrix, and $\mathbf{1}$ is a vector of all ones.
- $\mathrm{Tr}(\mathbf{UHVH})$ computes the Hilbert-Schmidt norm of the cross-covariance between the two RKHSs, quantifying their dependence.
  
  The following figure (Figure 3 from the original paper) illustrates the MI calibration process.
  
  Optimization: The MI calibration loss ( $\mathcal{L}_{\mathrm{MI}}$ ) is designed to enforce semantic balance across domains: $\mathcal{L}_{\mathrm{MI}} = \mathrm{Var}\left[ \widehat{I}^{(k)} \right] - \beta \mathbb{E}\left[ \widehat{I}^{(k)} \right]$ where $\widehat{I}^{(k)} = \widehat{\mathrm{HSIC}}(\mathbf{X}^k, \mathbf{Z}^k)$ is the MI estimate for domain $k$ , and $\beta$ is a weighting hyperparameter.

The first term, $\mathrm{Var}\left[ \widehat{I}^{(k)} \right]$ , penalizes high variance of MI across domains, aiming to reduce semantic imbalance and ensure consistent informativeness.
The second term, $-\beta \mathbb{E}\left[ \widehat{I}^{(k)} \right]$ , encourages each domain to retain a sufficient amount of domain-specific information (maximizing the expected MI).

4.2.6. Optimization

The overall loss function for training UniTok combines the three losses: $\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{Rec}} + \lambda_{\mathrm{RQ}} \mathcal{L}_{\mathrm{RQ}} + \lambda_{\mathrm{MI}} \mathcal{L}_{\mathrm{MI}}$ where $\lambda_{\mathrm{RQ}}$ and $\lambda_{\mathrm{MI}}$ are hyperparameters that control the strength of the RQ and MI losses, respectively.

Once trained, UniTok tokenizes each item into discrete semantic tokens. These item token sequences are then used as input to LLM-based recommender systems. For example, user interaction histories $\mathbf{u}$ are converted into item token sequences $\mathbf{u} = [\widetilde{\mathbf{c}}_1, \widetilde{\mathbf{c}}_2, \dots, \widetilde{\mathbf{c}}_P]$ , and the LLM learns to predict the next interacted item token $\widetilde{\mathbf{c}}_{P+1}$ .

4.2.7. Theoretical Analyses

The paper provides theoretical justifications for the design choices and effectiveness of UniTok.

4.2.7.1. Theorem 1: Higher Entropy Token Space

Statement: The token space induced by UniTok exhibits strictly higher entropy than that of standard codebook-based methods: $H(\mathcal{C}_{\mathrm{UniTok}}) > H(\mathcal{C}_{\mathrm{standard}})$ where $\mathcal{C}_{\mathrm{UniTok}}$ and $\mathcal{C}_{\mathrm{standard}}$ denote the discrete token distributions generated by UniTok and standard codebook-based methods (using a single codebook), respectively.

Explanation: Entropy in information theory measures the uncertainty or diversity of a random variable. A higher entropy for a token space means that there are more unique and distinct tokens that can be generated, implying a richer and more expressive representation capacity. The proof relies on the chain rule of entropy. For standard codebook-based methods with $L$ levels and $T$ code vectors per codebook, the entropy is $L \log T$ . For UniTok, the token generation involves two steps: first, selecting an expert (with entropy $H(\mathcal{G})$ ), and then generating a token from that expert's codebooks (which also has entropy $L \log T$ ). Since the router entropy $H(\mathcal{G})$ is always positive when there is more than one expert ( $K > 1$ ), UniTok's total entropy is $H(\mathcal{G}) + L \log T$ , which is strictly greater than $L \log T$ . Implication: This theorem proves that TokenMoE significantly expands the capacity of the token space, allowing UniTok to represent items with greater diversity and expressiveness.

4.2.7.2. Theorem 2: Lower Quantization Error

Statement: Let $\mathbb{E}[\mathcal{L}_{\mathrm{UniTok}}]$ and $\mathbb{E}[\mathcal{L}_{\mathrm{standard}}]$ denote the expected quantization error of UniTok and standard codebook-based methods using a single shared codebook, respectively. Then, the following inequality holds: $\mathbb{E}[\mathcal{L}_{\mathrm{UniTok}}] \le \mathbb{E}[\mathcal{L}_{\mathrm{standard}}]$

Explanation: Quantization error refers to the loss of information when a continuous value is converted to a discrete one. A lower quantization error means that the discrete representation (the tokens) more accurately approximates the original continuous item embedding, thus preserving more semantic information. The proof uses Jensen's inequality and the convexity of the squared norm function. It treats the output of UniTok as a convex combination (weighted average by router probabilities) of the quantization errors from individual experts. Since a convex combination of errors can achieve a lower or equal error than any single expert (or a single standard codebook which can be seen as a degenerate MoE with one active expert), UniTok can achieve better approximation. Implication: This theorem theoretically guarantees that UniTok's TokenMoE architecture, by leveraging expert specialization, can represent items more precisely and reduce the information loss during tokenization compared to a single, monolithic tokenizer.

4.2.7.3. Theorem 3: Performance Stability via MI Calibration

Statement: Suppose that the loss $\mathcal{L}^{(k)}$ on the $k$ -th domain is Lipschitz-continuous with respect to the informativeness of representations. Then, the performance variability across domains is upper-bounded by the variance of MI: $|\mathcal{L}^{(i)} - \mathcal{L}^{(j)}| \leq C \sqrt{\mathrm{Var}\left[ \widehat{I}^{(k)} \right]}, \forall i, j$ where $\mathrm{Var}\left[ \widehat{I}^{(k)} \right]$ is the variance of MI estimates across domains and $C$ is a constant.

Explanation: Lipschitz continuity implies that a small change in the input (here, informativeness or MI) leads to at most a proportionally small change in the output (here, the loss or performance). Performance variability refers to how much the model's performance differs from one domain to another. The proof links the performance gap between any two domains ( $|\mathcal{L}^{(i)} - \mathcal{L}^{(j)}|$ ) to the difference in their MI estimates ( $|\widehat{I}^{(i)} - \widehat{I}^{(j)}|$ ). It then shows that the maximum difference in MI estimates across domains is bounded by a term related to the variance of MI across all domains. Therefore, reducing the variance of MI estimates directly reduces the performance variability. Implication: This theorem rigorously supports the effectiveness of the MI calibration mechanism. By minimizing $\mathrm{Var}\left[ \widehat{I}^{(k)} \right]$ through $\mathcal{L}_{\mathrm{MI}}$ , UniTok ensures that the learned latent embeddings retain a consistent level of semantic information across all domains, leading to more stable and balanced recommendation performance in a multi-domain setting.

The following figure (Figure 2 from the original paper) provides a schematic overview of the UniTok framework, illustrating the interaction between its components.

Figure 2: The schematic overview of the proposed UniTok framework. 该图像是示意图，展示了提出的UniTok框架的总体结构。图中展示了不同项目领域的项目输入通过共享编码器转化为语义嵌入，并经过路由器分配至专业专家，旨在保留领域特定信息，同时缓解领域间语义不平衡。

The UniTok Training Procedure is formally described in Algorithm 1. The following are the results from [Algorithm 1] of the original paper: $Algorithm 1: UniTok Training Procedure Input: Multi-domain item datasets $\left\{ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_K \right\}$ with text metadata (e.g., title, category, description) Parameters: Encoder $f_\theta$, decoder $g_\phi$, router $h(\cdot)$, expert modules $\{E_1, ..., E_K, E_{\mathrm{share}}\}$ with associated codebooks Output: Discrete tokens $\mathbf{c}_i^k \in \mathcal{C}$ for all items 1: Initialize all modules: $f_\theta, g_\phi$, router $h(\cdot)$, experts $E_k$ (including $E_{\mathrm{share}}$), and their codebooks 2: for $k = 1$ to $K$ do 3: for item $i_i^k \in \mathcal{D}_k$ do 4: $\mathbf{x}_i^k \gets$ SemanticEmbedding$(i_i^k)$ // Pre-compute semantic embeddings 5: end for 6: end for 7: for epoch $= 1$ to $E$ do 8: Sample mini-batch $B = \{ \mathbf{x}_i^k \}$ from mixed domains 9: for each item $\mathbf{x}_i^k$ in batch $B$ do 10: $\mathbf{z}_i^k \gets f_\theta(\mathbf{x}_i^k)$ // Encode input 11: $\hat{\mathbf{z}}_i^k, \mathbf{c}_i^k \gets \mathrm{TokenMoE}(\mathbf{z}_i^k)$ // Discretize via TokenMoE 12: $\hat{\mathbf{x}}_i^k \gets g_\phi(\hat{\mathbf{z}}_i^k)$ // Decode reconstruction 13: end for 14: Compute reconstruction loss $\mathcal{L}_{\mathrm{Rec}}$ in the main manuscript of Eq. (3) 15: Compute RQ loss $\mathcal{L}_{\mathrm{RQ}}$ in the main manuscript of Eq. (8) 16: Compute MI loss $\mathcal{L}_{\mathrm{MI}}$ in the main manuscript of Eq. (10) 17: Update $f_\theta, g_\phi, h(\cdot)$, and expert parameters 18: end for 19: return token assignments $\mathbf{c}_i^k$ for all items$

Algorithm 1 Explanation:

Initialization: All model components—the encoder ( $f_\theta$ ), decoder ( $g_\phi$ ), router ( $h(\cdot)$ ), and all expert modules ( $E_k$ for domain-specific, $E_{\mathrm{share}}$ for shared), along with their respective codebooks—are initialized.
Semantic Embedding Pre-computation: For each item $i_i^k$ in every domain $\mathcal{D}_k$ , its semantic embedding $\mathbf{x}_i^k$ is obtained using a pre-trained embedding model. This step is performed once before the main training loop.
Epoch Loop: The model trains for a specified number of epochs ( $E$ ).
Mini-batch Sampling: In each epoch, a mini-batch $B$ of semantic embeddings is sampled. Crucially, this mini-batch contains items from mixed domains, reflecting the multi-domain nature of the problem.
Item Processing Loop: For each item $\mathbf{x}_i^k$ $x_{i}^{k}$ in the mini-batch:
- Encoding: The shared encoder $f_\theta$ converts the semantic embedding $\mathbf{x}_i^k$ into a latent embedding $\mathbf{z}_i^k$ .
- Tokenization via TokenMoE: The TokenMoE module takes $\mathbf{z}_i^k$ , routes it through selected experts (domain-specific and shared), applies residual quantization, and outputs a quantized embedding $\hat{\mathbf{z}}_i^k$ and the corresponding discrete token $\mathbf{c}_i^k$ .
- Decoding: The shared decoder $g_\phi$ attempts to reconstruct the original semantic embedding $\hat{\mathbf{x}}_i^k$ from the quantized embedding $\hat{\mathbf{z}}_i^k$ .
Loss Computation: After processing all items in the mini-batch, the three loss components are calculated:
- Reconstruction Loss ( $\mathcal{L}_{\mathrm{Rec}}$ ) from Eq. (3).
- Residual Quantization Loss ( $\mathcal{L}_{\mathrm{RQ}}$ ) from Eq. (8).
- Mutual Information Calibration Loss ( $\mathcal{L}_{\mathrm{MI}}$ ) from Eq. (10).
Parameter Update: All learnable parameters (of the encoder, decoder, router, and expert modules with their codebooks) are updated using an optimizer based on the total loss $\mathcal{L}_{\mathrm{total}}$ (Eq. 11).
Output: After training completes, the algorithm returns the discrete token assignments $\mathbf{c}_i^k$ for all items, which can then be used by LLM-based recommenders.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on a wide range of real-world datasets from various domains. The paper uses ten source datasets for training and evaluation of UniTok's multi-domain performance, and three target datasets for zero-shot generalization evaluation.

Each item in these datasets contains metadata such as title, category, and features. An example of an item could be a product with its title ("Echo Dot (3rd Gen) - Smart speaker with Alexa"), category ("Electronics > Smart Home > Smart Speakers"), and features ("Voice control, Bluetooth, Wi-Fi enabled").

The following are the results from [Table 6] of the original paper:

Dataset	# of users	# of items	# of interactions
Beauty	22,363	12,101	198,502
Cellphones	27,879	10,429	194,439
Grocery	14,681	8,713	151,254
Instruments	24,772	9,922	206,153
Office Products	4,905	2,420	53,258
Pet Supplies	19,856	8,510	157,836
Tools	16,638	10,217	134,476
Toys	19,412	11,924	167,597
Games	24,303	10,672	231,780
Yelp	30,431	20,033	316,354
Clothing	39,387	23,033	278,677
Health	38609	18533	346,355
Sports	35,598	18,357	296,337

Dataset Characteristics:

Source Domains (10 datasets): Beauty, Cellphones, Grocery, Instruments, Office Products, Pet Supplies, Tools, Toys, Games, and Yelp. These datasets represent diverse categories with varying numbers of users, items, and interactions, making them suitable for evaluating multi-domain performance.
Target Domains (3 datasets for zero-shot): Clothing, Health, and Sports. These are used to test UniTok's generalization ability to unseen domains.
Metadata: The reliance on item metadata (title, category, features) for generating semantic embeddings is crucial, as UniTok operates independently of user data for tokenization.

These datasets were chosen for their diversity and common use in recommendation research. Their varied characteristics allow for a robust evaluation of UniTok's ability to handle different data distributions and semantic nuances in a unified manner.

5.2. Evaluation Metrics

The paper adopts two widely used ranking metrics to evaluate the performance of recommendations in a full-ranking protocol: Recall@M and NDCG@M. Here, $M \in \{5, 10\}$ .

5.2.1. Recall@M ( $\mathrm{R}@M$ )

Conceptual Definition: Recall@M measures the proportion of relevant items that are successfully retrieved within the top $M$ recommendations. It focuses on the ability of the recommender system to find a significant portion of the user's preferred items among the highest-ranked suggestions, regardless of their precise ranking order within the top $M$ .

Mathematical Formula: $\mathrm{Recall}@M = \frac{ |\mathrm{Top}-M \mathrm{ \ recommended \ items} \cap \mathrm{Ground \ truth \ items}| }{ |\mathrm{Ground \ truth \ items}| }$

Symbol Explanation:

$|\mathrm{Top}-M \mathrm{ \ recommended \ items} \cap \mathrm{Ground \ truth \ items}|$ : The number of relevant items (i.e., items that are in the ground truth set) that are present in the top $M$ recommendations.
$|\mathrm{Ground \ truth \ items}|$ : The total number of relevant items for a given user or interaction.

5.2.2. NDCG@M (Normalized Discounted Cumulative Gain)

Conceptual Definition: NDCG@M is a metric that considers both the relevance of recommended items and their ranking positions. It assigns higher scores to relevant items that appear at higher ranks and penalizes relevant items that appear at lower ranks. It is normalized to be between 0 and 1, where 1 represents the ideal ranking.

Mathematical Formula: $\mathrm{NDCG}@M = \frac{1}{\mathrm{IDCG}@M} \sum_{i=1}^M \frac{1\{\mathrm{item}_i \in \mathrm{Ground \ truth}\}}{\log_2(i + 1)}$

Symbol Explanation:

$M$ : The number of top recommendations being considered.
$1\{\mathrm{item}_i \in \mathrm{Ground \ truth}\}$ : An indicator function that is 1 if the item at position $i$ in the ranked list is a relevant item (i.e., in the ground truth), and 0 otherwise. This effectively assigns a relevance score.
$\log_2(i + 1)$ : The logarithmic discount factor. Items at higher ranks (smaller $i$ ) receive less discount, meaning they contribute more to the DCG score.
$\mathrm{IDCG}@M$ : Ideal Discounted Cumulative Gain at M. This is the maximum possible DCG value for the given set of ground truth relevant items. It's calculated by placing all ground truth items at the top $M$ positions in decreasing order of their relevance, ensuring NDCG is normalized between 0 and 1.

5.3. Baselines

UniTok is compared against nine benchmark recommendation methods, categorized into two groups:

5.3.1. Collaborative Filtering (CF) Methods

These are traditional recommenders that primarily rely on user-item interaction patterns.

MF (Matrix Factorization) (Rendle et al. 2009): A fundamental collaborative filtering technique that aims to discover latent features that explain user-item interactions by factorizing the user-item interaction matrix.
LightGCN (He et al. 2020): A simplified Graph Convolutional Network (GCN) for recommendation that leverages graph convolutional layers to propagate embeddings across the user-item interaction graph.
SASRec (Self-Attentive Sequential Recommendation) (Kang and McAuley 2018): A sequential recommendation model that uses a Transformer-like self-attention mechanism to capture long-term dependencies in user interaction sequences.
Bert4Rec (Sun et al. 2019): Another sequential recommendation model that adapts the BERT (Bidirectional Encoder Representations from Transformers) architecture to model user behavior sequences bidirectionally.

5.3.2. Item Tokenization-aided LLM-based Recommendation Methods

These methods are more directly comparable as they also focus on item tokenization for LLM-based recommendations.

P5-TID (P5 - Text-based Item Description) (Hua et al. 2023): A LLM-based method where items are represented using their textual descriptions, which are then fed to the LLM for generative recommendation.
P5-SemID (P5 - Semantic Item ID) (Hua et al. 2023): Similar to P5-TID, but utilizes semantic item IDs (learnable embeddings or codes that capture item semantics) as tokens for the LLM.
TIGER (Token-based Item Generator for Enhanced Recommendation) (Rajput et al. 2023): A pioneering codebook-based item tokenization method that converts item metadata into hierarchical code sequences for LLM-based generative retrieval.
LC-Rec (Learnable Code-based Recommendation) (Zheng et al. 2024): This method learns code indices for items via vector quantization and then tunes LLMs through alignment tasks for direct item generation.
LETTER (Learnable Item Tokenization for Generative Recommendation) (Wang et al. 2024): A method that proposes a learnable tokenizer for LLM-based generative recommendation, designed to integrate collaborative signals and mitigate code assignment bias. It also relies on codebook-based identifiers.

These baselines are chosen to represent both traditional and modern LLM-based recommendation paradigms, allowing for a comprehensive comparison against UniTok's unified item tokenization approach.

5.4. Implementation Details

Item Tokenization: UniTok uses 4-level codebook-based identifiers. Each codebook comprises 256 code vectors, and each code vector has a dimension of 32.
Loss Hyperparameters: The RQ loss weight $\lambda_{\mathrm{RQ}}$ is set to 1. The MI loss weight $\lambda_{\mathrm{MI}}$ is set to 0.03. The parameter $\alpha$ in the RQ loss (Eq. 8) is set to 0.25, and $\beta$ in the MI loss (Eq. 10) is set to 1.
Pre-trained Content Encoder: For semantic embeddings, a pre-trained TIGER encoder is used, specifically fine-tuned on the Beauty dataset. This encoder is chosen for its effectiveness in LLM-based generative recommendation.
Training: UniTok is trained for 10,000 epochs using the AdamW optimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2019), with a learning rate of 1e-3 and a batch size of 1,024.
Baselines: For TIGER, fine-tuning is performed to convergence based on validation performance, with learning rates of 1e-3, 5e-4, 1e-4, 2e-4, 3e-4. Other baseline methods are implemented using parameters described in their original articles.
Hardware: All experiments are conducted on Intel(R) 12-Core (TM) E5-1650 v4 CPUs @ 3.60 GHz and NVIDIA GeForce RTX 3090 GPUs.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Can One Tokenizer Serve All Domains? (Recommendation Accuracy)

This section evaluates UniTok's recommendation accuracy against benchmarks across ten diverse source datasets. A key distinction is that UniTok trains a single unified model to handle all ten datasets jointly, unlike competitors that require separate tokenizers for each dataset.

The following are the results from [Table 1] of the original paper:

Method	Beauty	Cellphones	Grocery	Instruments	Office	Pet Supplies	Tools	Toys	Games	Yelp
MF	0.0369	0.0267	0.0216	0.0710	0.0255	0.0268	0.0169	0.0192	0.0366	0.0144
LightGCN	0.0285	0.0456	0.0357	0.0781	0.0301	0.0289	0.0257	0.0287	0.0417	0.0195
SASRec	0.0314	0.0446	0.0376	0.0609	0.0285	0.0301	0.0234	0.0239	0.0412	0.0183
Bert4Rec	0.0194	0.0268	0.0237	0.0573	0.0274	0.0161	0.0092	0.0177	0.0379	0.0131
P5-TID	0.0255	0.0357	0.0316	0.0721	0.0239	0.0243	0.0198	0.0202	0.0388	0.0154
P5-SemID	0.0304	0.0406	0.0351	0.0730	0.0283	0.0282	0.0237	0.0231	0.0432	0.0188
TIGER	0.0324	0.0446	0.0375	0.0788	0.0295	0.0279	0.0284	0.0268	0.0427	0.0208
LC-Rec	0.0381	0.0458	0.0369	0.0802	0.0311	0.0335	0.0307	0.0279	0.0451	0.0215
LETTER	0.0364	0.0473	0.0392	0.0831	0.0326	0.0307	0.0298	0.0291	0.0469	0.0231
UniTok	0.0478	0.0647	0.0533	0.0884	0.0432	0.0496	0.0439	0.0442	0.0476	0.0321
Gain	25.46%	36.78%	35.97%	6.38%	32.52%	48.06%	42.99%	51.89%	1.49%	38.96%

(Note: NDCG@10 values are shown. Improvements are statistically significant on average (p = 0.0219 < 0.05) based on paired t-tests over five runs across all datasets.)

Analysis:

UniTok consistently outperforms all benchmark methods across all ten datasets in terms of NDCG@10. The improvements are substantial, with gains ranging from 1.49% (Games) to an impressive 51.89% (Toys). This strongly validates the effectiveness of UniTok's unified tokenization framework in handling multiple domains.
Superiority of LLM-based Tokenization: The results show that item tokenization-aided LLM-based recommender systems (e.g., TIGER, LC-Rec, LETTER, and UniTok) generally perform better than traditional collaborative filtering methods (MF, LightGCN, SASRec, Bert4Rec). This highlights the benefits of LLMs in leveraging rich semantic understanding beyond mere user-item interactions.
Impact of Item Tokenization Design: Among the LLM-based methods, those incorporating carefully designed item tokenization (like TIGER, LC-Rec, LETTER, UniTok) show further gains compared to LLMs relying solely on item metadata for tokenization (P5-TID, P5-SemID). This underscores item tokenization as a crucial bridge for effective generative recommendation.
UniTok's ability to achieve this superior performance with a single model for all domains, unlike domain-specific baselines, demonstrates its conceptual strength and practical generality.

6.1.2. Is UniTok More Efficient than Traditional Tokenizers? (Model Size)

This section compares the efficiency of UniTok in terms of the total number of trainable parameters against traditional codebook-based competitors (TIGER, LC-Rec, LETTER).

The following are the results from [Table 2] of the original paper:

Module	Codebook-based methods	UniTok
Codebook	0.33M	0.36M
Autoencoder	87.45M	8.75M
Router		0.01M
Total	87.78M	9.11M

Analysis:

UniTok achieves approximately a 9.63x reduction in the total number of trainable parameters (9.11M for UniTok vs. 87.78M for competitors).
This efficiency gain primarily comes from the use of a shared autoencoder in UniTok (8.75M parameters) compared to the domain-specific autoencoders used by competitors (87.45M accumulated across 10 domains).
The additional parameters introduced by UniTok's codebook (0.36M vs. 0.33M for competitors' accumulated codebooks) and TokenMoE router (0.01M) are negligible in comparison to the savings from the shared autoencoder.
This demonstrates that UniTok offers substantial advantages in scalability and deployment efficiency by eliminating the need for domain-specific tokenization models.

6.1.3. Performance under Unified Training Setup

To further validate UniTok's efficiency and architectural advantages, competitors are also evaluated under a single unified training setup, where they are trained jointly across all ten datasets with a comparable number of trainable parameters to UniTok.

The following are the results from [Table 3] of the original paper:

Beauty		Cellphones		Grocery
Method	R@10 N@10	R@10 N@10	R@10 N@10
TIGER	0.0499 0.0267	0.0661 0.0342	0.0576 0.0273
LC-Rec	0.0564 0.0302	0.0647 0.0337	0.0584 0.0287
LETTER	0.0528 0.0288	0.0678 0.0363	0.0618 0.0315
UniTok	0.0934 0.0478	0.1251 0.0647	0.1061 0.0533
Gain	65.60% 58.28%	84.51% 78.23%	71.68% 69.21%

Analysis:

When forced to train in a unified setup (using a shared model across domains with a similar parameter budget as UniTok), the competing methods (TIGER, LC-Rec, LETTER) show a substantial performance degradation compared to their single-domain results (Table 1). This is because they struggle to distinguish and specialize for items from different domains when using a naïvely shared tokenization without UniTok's MoE and MI calibration.
In contrast, UniTok maintains consistently superior recommendation performance, achieving impressive improvements (e.g., 84.51% in Recall@10 on Cellphones) even with a similar trainable parameter budget.
This further highlights the strength of UniTok's modular TokenMoE architecture, which allows domain-specific experts to learn semantics independently while operating within a unified token space, making it truly effective for multi-domain scenarios.

6.1.4. Can UniTok be Generalized to Unseen Domains? (Zero-shot Performance)

To assess UniTok's generalization ability, a zero-shot setting is used. UniTok is trained once on the ten source datasets and then directly tested on three unseen target domains (Clothing, Health, Sports) without any additional training or fine-tuning.

The following are the results from [Table 4] of the original paper:

	Clothing	Health
Method	R@10 N@10	R@10 N@10	Sports \| R@10 N@10
TIGER	0.0501 0.0242	0.0677 0.0342	0.0469 0.0228
LC-Rec	0.0527 0.0266	0.0694 0.0358	0.0494 0.0246
LETTER	0.0515 0.0257	0.0717 0.0375	0.0510 0.0265
UniTok	0.0592 0.0288	0.0835 0.0442	0.0591 0.0298
Gain	12.33% 8.27%	16.46% 17.87%	15.88% 12.45%

Analysis:

UniTok significantly outperforms existing item tokenization-based recommender systems on all three unseen domains. The improvements in NDCG@10 are notable, reaching up to 17.87% on Health.
This robust zero-shot performance is a critical finding. It demonstrates that UniTok effectively learns a discrete token space that captures transferable item semantics across diverse domains, without needing any additional training or fine-tuning for new domains.
In contrast, competitors would typically require retraining on each new dataset to achieve reasonable tokenization results, highlighting UniTok's superior generalizability and practical utility in dynamic environments.

6.2. Ablation Studies / Parameter Analysis

6.2.1. What Makes UniTok Effective? (Ablation Study)

An ablation study is conducted to assess the contribution of each core module within UniTok by progressively removing or modifying them.

UniTok-1: Removes TokenMoE and MI calibration, using only a single set of codebooks (monolithic approach without MoE).
UniTok-2: Keeps TokenMoE, but removes the shared expert and MI calibration.
UniTok-3: Removes only the MI calibration (keeps TokenMoE with shared expert).

UniTok: The full model with all components.

The following are the results from [Table 8] of the original paper:

	Beauty		Cellphones		Grocery		Instruments		Yelp
Method	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10
UniTok-1	0.0558	0.0304	0.0702	0.0371	0.0633	0.0342	0.0926	0.0742	0.0345	0.0177
UniTok-2	0.0896	0.0436	0.1194	0.0606	0.0989	0.0497	0.1273	0.0851	0.0624	0.0281
UniTok-3	0.0915	0.0457	0.1225	0.0622	0.1044	0.0515	0.1327	0.0868	0.0657	0.0303
UniTok	0.0934	0.0478	0.1251	0.0647	0.1061	0.0533	0.1361	0.0884	0.0684	0.0321

Analysis:

Impact of TokenMoE (UniTok-1 vs. UniTok-2): Comparing UniTok-1 (no MoE, no MI) with UniTok-2 (with TokenMoE, but no shared expert or MI) shows a significant performance jump across all datasets. For example, NDCG@10 on Beauty improves from 0.0304 to 0.0436. This highlights the crucial role of the TokenMoE module in capturing domain-specific semantics and improving accuracy.
Impact of Shared Expert (UniTok-2 vs. UniTok-3): While not directly separable in the provided table, the design of UniTok-2 (removing shared expert) implicitly supports the benefit of the shared expert in the full UniTok (which includes it). The improvement from UniTok-2 to UniTok-3 (0.0436 to 0.0457 for Beauty NDCG@10) suggests that incorporating the shared expert helps in transferring common knowledge, even if it's coupled with MI calibration in UniTok-3's comparison to UniTok-2.
Impact of MI Calibration (UniTok-3 vs. UniTok): Comparing UniTok-3 (full TokenMoE but no MI calibration) with the full UniTok model reveals that MI calibration provides further performance enhancements (e.g., NDCG@10 on Beauty from 0.0457 to 0.0478). This confirms that ensuring semantic balance and consistent informativeness across domains is vital for optimal performance.
Overall, the ablation study confirms that each component of UniTok (TokenMoE, shared expert, and MI calibration) is essential for its high effectiveness, with TokenMoE providing the largest boost and MI calibration refining the performance.

6.2.2. How Sensitive is UniTok to Key Parameters?

The paper analyzes the sensitivity of UniTok to key hyperparameters: number of quantization levels ( $L$ ), codebook size ( $T$ ), and the loss weights (\lambda_{RQ}and\lambda_{MI}) using NDCG@10 on Beauty, Cellphones, and Grocery datasets.

The following figures (Figures 4, 5, and 6 from the original paper) illustrate the sensitivity analysis on Beauty, Cellphones, and Grocery datasets, respectively.

Figure 4: Sensitivity analysis on Beauty. 该图像是图表，展示了对美妆领域的敏感性分析，包含四个子图(a)(b)(c)(d)。每个子图中y轴表示NDCG@10的值，x轴分别为不同的参数：L（图a），T（图b）， $\lambda_{RQ}$ （图c）和 $\lambda_{MI}$ （图d）。结果显示在不同参数设置下的推荐性能变化，为理解UniTok框架提供了有效性验证。

Figure 5: Sensitivity analysis on Cellphones. 该图像是图表，展示了对手机推荐系统的灵敏度分析。图中包含四个子图（a、b、c、d），分别评估不同参数对NDCG@10的影响，其中 $a$ 表示参数 $L$ 的变化， $b$ 为参数 $T$ ， $c$ 为 $\lambda_{RQ}$ ， $d$ 为 $\lambda_{MI}$ 。每个子图中，纵轴为NDCG@10值，横轴为对应参数的不同取值，呈现出这些参数变化对性能指标的影响，总体趋势显示了在特定参数值下性能的提升或下降。

Figure 6: Sensitivity analysis on Grocery.

Analysis of Parameter Sensitivity:

Number of Quantization Levels ( $L$ ): (Figures 4a, 5a, 6a)
- Performance generally improves as $L$ increases from 2 to 4. This is because a longer sequence of codes allows for a more fine-grained and expressive representation of item semantics.
- However, increasing $L$ beyond 4 leads to performance degradation. This is attributed to error accumulation in longer autoregressive sequences during recommendations, potentially making the tokenization too complex or introducing noise.
Codebook Size ( $T$ ): (Figures 4b, 5b, 6b)
- Performance tends to improve as $T$ (number of code vectors per codebook) increases, up to a certain point (e.g., 256 or 512). A larger codebook provides more distinct code vectors, enabling richer semantic representation.
- However, too large a codebook size can lead to overfitting to spurious or less meaningful patterns, causing performance to plateau or slightly degrade.
RQ Loss Weight ( $\lambda_{RQ}$ ): (Figures 4c, 5c, 6c)
- Performance peaks around $\lambda_{RQ} = 1$ across the datasets.
- Setting $\lambda_{RQ}$ too low means insufficient emphasis on the quantization process, leading to poor tokenization.
- Setting $\lambda_{RQ}$ too high might over-regularize the codebook-based identifiers, potentially underutilizing their flexibility or forcing encoder/router to commit to suboptimal quantizations. This highlights the need to balance the reconstruction quality with the quantization fidelity.
MI Calibration Loss Weight ( $\lambda_{MI}$ ): (Figures 4d, 5d, 6d)
- The highest NDCG@10 is achieved at $\lambda_{MI} = 0.03$ .
- A high $\lambda_{MI}$ may over-constrain the latent space, forcing it to align MI values at the expense of domain-specific information or reconstruction quality.
- A low $\lambda_{MI}$ weakens the influence of mutual information preservation, limiting the semantic alignment of token representations and thus hindering generalization.
- This indicates that a properly tuned $\lambda_{MI}$ is crucial for balancing semantic preservation with overall model performance and generalization.

6.2.3. Empirical Validation of Theoretical Claims

The paper provides empirical evidence to support its three theoretical theorems.

6.2.3.1. Token Space Entropy (Theorem 1)

The following are the results from [Table 9] of the original paper:

Method	Token Space Entropy
Codebook-based methods	9.63
UniTok without the router	9.63
UniTok (full)	10.42

Analysis:

Standard codebook-based methods (or UniTok without the router) exhibit a token space entropy of 9.63 (base-2 logarithm, approximately $2^{9.63}$ distinct states).
The full UniTok model, with its TokenMoE router, achieves a higher token space entropy of 10.42.
This empirical result directly validates Theorem 1, demonstrating that the judicious incorporation of multiple experts and a router significantly increases the overall entropy, thereby expanding the capacity and richness of the token space.

6.2.3.2. Quantization Error (Theorem 2)

The following figure (Figure 7 from the original paper) compares the residual quantization loss over training epochs between UniTok and LETTER.

Figure 7: Comparison of residual quantization loss over training epochs between UniTok and LETTER.

Analysis:

Figure 7 shows that UniTok achieves a significantly lower residual quantization loss than LETTER over training epochs. While LETTER's loss converges at a higher value, UniTok's loss steadily decreases to a much lower point.
This empirically supports Theorem 2, which states that UniTok's expected quantization error is less than or equal to standard codebook-based methods. The TokenMoE framework's ability to leverage expert specialization leads to more precise item tokenization and lower information loss, even in heterogeneous environments.

6.2.3.3. MI Variance Analysis (Theorem 3)

The following figure (Figure 8 from the original paper) illustrates the relationship between MI variance and performance gap.

Figure 8: Relationship between MI variance and performance gap.

Analysis:

Figure 8 plots the relationship between the variance of MI estimates ( $\mathrm{Var}[\widehat{I}^{(k)}]$ ) across domains and the observed performance variability ( $\max_{i,j}|\mathcal{L}^{(i)} - \mathcal{L}^{(j)}|$ ).
The graph shows a positive correlation: as the variance of MI increases, the performance variability (the maximum difference in loss between any two domains) also tends to increase.
This empirical trend provides strong evidence for Theorem 3, validating the Lipschitz continuity assumption and demonstrating that reducing the variance of MI across domains (as UniTok's MI calibration mechanism aims to do) leads to more consistent semantic representations and, consequently, more stable and reliable downstream performance in multi-domain recommendation systems.

6.3. Complete Set of Experimental Results (Appendix D.5)

The appendix provides the complete results for both Recall@10 and NDCG@10 across all ten source datasets.

The following are the results from [Table 7] of the original paper:

	Beauty		Cellphones		Grocery		Insturments		Office Products
Method	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10
MF	0.0614	0.0369	0.0604	0.0267	0.0418	0.0216	0.0930	0.0710	0.0569	0.0255
LightGCN	0.0639	0.0285	0.0668	0.0456	0.0697	0.0357	0.1008	0.0781	0.0587	0.0301
SASRec	0.0646	0.0314	0.0651	0.0446	0.0701	0.0376	0.0905	0.0609	0.0574	0.0285
Bert4Rec	0.0372	0.0194	0.0507	0.0268	0.0448	0.0237	0.0791	0.0573	0.0563	0.0274
P5-TID	0.0532	0.0255	0.0648	0.0357	0.0617	0.0316	0.0928	0.0721	0.0557	0.0239
P5-SemID	0.0584	0.0304	0.0737	0.0406	0.0641	0.0351	0.0964	0.0730	0.0592	0.0283
TIGER	0.0624	0.0324	0.0838	0.0446	0.0706	0.0375	0.1047	0.0788	0.0594	0.0295
LC-Rec	0.0684	0.0381	0.0859	0.0458	0.0722	0.0369	0.1066	0.0802	0.0637	0.0311
LETTER	0.0672	0.0364	0.0876	0.0473	0.0731	0.0392	0.1122	0.0831	0.0649	0.0326
UniTok	0.0934	0.0478	0.1251	0.0647	0.1061	0.0533	0.1361	0.0884	0.0897	0.0432
Improve	36.55%	25.46%	42.81%	36.78%	45.14%	35.97%	21.30%	6.38%	38.21%	32.52%

	Pet Supplies		Tools		Toys		Games		Yelp
Method	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10	R@10	N@10
MF	0.0503	0.0268	0.0356	0.0169	0.0405	0.0192	0.0359	0.0366	0.0304	0.0144
LightGCN	0.0529	0.0289	0.0482	0.0257	0.0495	0.0287	0.0407	0.0417	0.0368	0.0195
SASRec	0.0538	0.0301	0.0475	0.0234	0.0473	0.0239	0.0401	0.0412	0.0354	0.0183
Bert4Rec	0.0313	0.0161	0.0184	0.0092	0.0333	0.0177	0.0363	0.0379	0.0272	0.0131
P5-TID	0.0485	0.0243	0.0377	0.0198	0.0419	0.0202	0.0372	0.0388	0.0316	0.0154
P5-SemID	0.0569	0.0282	0.0405	0.0237	0.0445	0.0231	0.0398	0.0432	0.0324	0.0188
TIGER	0.0546	0.0279	0.0507	0.0284	0.0486	0.0268	0.0438	0.0427	0.0394	0.0208
LC-Rec	0.0648	0.0335	0.0561	0.0307	0.0538	0.0279	0.0442	0.0451	0.0418	0.0215
LETTER	0.0596	0.0307	0.0556	0.0298	0.0546	0.0291	0.0559	0.0469	0.0426	0.0231
UniTok	0.0955	0.0496	0.0852	0.0439	0.0902	0.0442	0.0565	0.0476	0.0684	0.0321
Improve	47.38%	48.06%	51.87%	42.99%	65.20%	51.89%	1.07%	1.49%	60.56%	38.96%

(Note: Improvement values indicate the gain of UniTok over the best competitor for each metric and dataset. UniTok consistently outperforms competitors across both Recall@10 and NDCG@10, with statistically significant improvements (p < 0.05).)

Analysis: These tables reinforce the findings from the main manuscript, showing UniTok's consistent superiority across both Recall@10 and NDCG@10 metrics on all ten datasets. The percentage gains are often substantial, confirming the effectiveness of UniTok's unified item tokenization in multi-domain scenarios. The statistically significant improvements ( $p < 0.05$ ) further validate the robustness of these results. Both traditional CF methods and existing LLM-based tokenization methods are clearly outmatched by UniTok, particularly in its ability to manage heterogeneous item domains with a single, efficient model.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully addresses a fundamental challenge in multi-domain LLM-based recommendation: the inefficiency and semantic inconsistency of existing item tokenization methods. The proposed UniTok framework offers a novel and effective solution. It integrates a customized Mixture-of-Experts (TokenMoE) architecture with codebooks to generate semantically meaningful tokens across diverse domains. By employing a shared encoder and routing items to both domain-specific and a shared expert, UniTok adeptly captures both unique and common semantic patterns. Furthermore, its mutual information (MI) calibration mechanism ensures semantic balance and consistent informativeness across domains.

Empirical evaluations on a wide range of real-world datasets demonstrate UniTok's significant advantages:

Effectiveness: Achieving up to 51.89% improvements in NDCG@10 over strong benchmarks.
Efficiency: Reducing trainable parameters by 9.63x compared to single-domain specialized tokenizers.
Generalizability: Showing robust zero-shot performance on unseen domains without requiring any retraining. These findings are further supported by theoretical analyses, which validate that UniTok induces a higher token space entropy, achieves lower quantization error, and ensures semantic consistency by reducing MI variance.

7.2. Limitations & Future Work

The paper primarily focuses on the item tokenization aspect for LLM-based recommendation. While it demonstrates UniTok's effectiveness in generating item tokens, it does not delve deeply into the integration of collaborative signals (user-item interactions) or the specific LLM architecture used for downstream recommendation. The evaluation is on the quality of tokens for improving recommendation performance, rather than the LLM itself.

The authors propose future work that includes extending UniTok into a general-purpose tokenization interface for foundation models in recommendation. This suggests a vision where UniTok could become a standardized preprocessing layer, simplifying the adoption of powerful foundation models for diverse recommendation tasks.

7.3. Personal Insights & Critique

This paper presents an elegant and practical solution to a critical problem in LLM-based recommendation. The unified tokenization concept is highly valuable for real-world applications where recommender systems often operate across many domains.

Innovation of TokenMoE: The TokenMoE architecture is a clever adaptation of MoE principles, specifically tailored for multi-domain tokenization. The inclusion of both domain-specific experts and a shared expert provides an intuitive way to manage the trade-off between specialization and generalization. The inductive bias from initializing experts with mean features is a neat trick to guide specialization without explicit supervision.
Power of MI Calibration: The MI calibration mechanism is a sophisticated and crucial component. Semantic imbalance is a subtle but significant issue in multi-domain learning, and explicitly optimizing for MI variance provides a robust way to ensure that all domains are adequately represented, preventing performance disparities. The theoretical backing for this, especially Theorem 3, adds strong credibility to its design.
Efficiency and Scalability: The demonstrated 9.63x parameter reduction is a massive practical benefit, making UniTok significantly more deployable and sustainable than per-domain tokenizers. This efficiency, coupled with strong zero-shot generalization, positions UniTok as a potential industry standard.
Potential Areas for Further Exploration:
- Dynamic Expert Routing: While the Top-N routing is effective, exploring more dynamic or adaptive routing mechanisms for $N$ (e.g., varying $N$ based on domain complexity or item characteristics) could offer further refinements.
- Computational Cost of HSIC: While effective, HSIC computation can be demanding for very large datasets due to kernel matrix operations. Investigating more scalable or approximate MI estimation techniques in the context of UniTok could be beneficial.
- Integration with LLMs: The paper positions UniTok as a tokenizer for LLMs. A deeper dive into how different LLM architectures (e.g., encoder-decoder, decoder-only) specifically leverage UniTok's tokens, and whether certain tokenization strategies within UniTok (e.g., specific codebook sizes or levels) interact differently with various LLM types, would be interesting.
- Cold-Start Scenarios for New Domains: While zero-shot performance is strong, more investigation into how UniTok handles domains with extremely sparse metadata or very novel item types could yield further insights.
  
  Overall, UniTok represents a significant step forward in multi-domain recommendation, offering a theoretically sound and empirically robust solution that aligns well with the growing trend of foundation models and unified AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Tokenize Once, Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM-based Recommendation

TL;DR Summary

Abstract

In-depth Reading

English Analysis~32 min read · 43,334 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Large Language Models (LLMs)

Item Tokenization

Mixture-of-Experts (MoE)

Codebook-based Identifiers (Residual Quantization - RQ)

Mutual Information (MI) and Hilbert-Schmidt Independence Criterion (HSIC)

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Task Formulation: Multi-domain Item Tokenization

4.2.2. Shared Autoencoder

4.2.3. TokenMoE (Mixture-of-Experts for Tokenization)

4.2.4. Codebook-based Identifiers

4.2.5. MI Calibration

4.2.6. Optimization

4.2.7. Theoretical Analyses

4.2.7.1. Theorem 1: Higher Entropy Token Space

4.2.7.2. Theorem 2: Lower Quantization Error

4.2.7.3. Theorem 3: Performance Stability via MI Calibration

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@M (R@M\mathrm{R}@MR@M)

5.2.2. NDCG@M (Normalized Discounted Cumulative Gain)

5.3. Baselines

5.3.1. Collaborative Filtering (CF) Methods

5.3.2. Item Tokenization-aided LLM-based Recommendation Methods

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Can One Tokenizer Serve All Domains? (Recommendation Accuracy)

6.1.2. Is UniTok More Efficient than Traditional Tokenizers? (Model Size)

6.1.3. Performance under Unified Training Setup

6.1.4. Can UniTok be Generalized to Unseen Domains? (Zero-shot Performance)

6.2. Ablation Studies / Parameter Analysis

6.2.1. What Makes UniTok Effective? (Ablation Study)

6.2.2. How Sensitive is UniTok to Key Parameters?

6.2.3. Empirical Validation of Theoretical Claims

6.2.3.1. Token Space Entropy (Theorem 1)

6.2.3.2. Quantization Error (Theorem 2)

6.2.3.3. MI Variance Analysis (Theorem 3)

6.3. Complete Set of Experimental Results (Appendix D.5)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

5.2.1. Recall@M ( $\mathrm{R}@M$ )