Paper status: completed

Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID

Published:04/03/2025

Semantic ID Embedding in Recommendation Systems (1)Enhanced Stability of Embedding Representations (1)Tail ID Modeling Optimization (1)Content-Based ID Clustering (1)Integration of Attention Models (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Semantic ID prefix ngram, a novel token parameterization technique that enhances embedding stability in recommendation systems by hierarchically clustering items based on their content, addressing key challenges like data pollution and performance degradatio

Abstract

The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many systems rely on random hashing to handle the id space and control the corresponding model parameters (i.e embedding table). However, this approach introduces data pollution from multiple ids sharing the same embedding, leading to degraded model performance and embedding representation instability. This paper examines these challenges and introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts. We further highlight the advantages of Semantic ID prefix ngram in attention-based models that contextualize user histories, showing substantial performance improvements. We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments.

Mind Map

In-depth Reading

English Analysis~40 min read · 52,002 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID". It focuses on improving the stability and performance of item embeddings in large-scale recommendation systems using a novel approach called Semantic ID.

1.2. Authors

The paper lists numerous authors, primarily affiliated with AI at Meta, indicating a strong industry research background. Carolina Zheng is also affiliated with Columbia University and performed this work during a 2024 Internship at Meta. The extensive list of authors from Meta suggests a collaborative effort within a large research team, common for impactful industrial research.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, with the original source link being https://arxiv.org/abs/2504.02137v1. While arXiv is a reputable platform for sharing research quickly and openly, it typically hosts preprints that have not yet undergone formal peer review for a specific journal or conference. However, given the affiliations (Meta, Columbia University) and the detailed experimental results, it is likely intended for a top-tier machine learning or recommendation systems conference (e.g., KDD, WWW, RecSys, NeurIPS, ICML) or journal in the near future. The "Published at (UTC): 2025-04-02T21:28:38.000Z" indicates a future publication date, suggesting this is a scheduled or upcoming release.

1.4. Publication Year

The publication year is stated as 2025 (based on the Published at (UTC) timestamp).

1.5. Abstract

This paper addresses critical challenges in ID-based models within industrial recommendation systems, such as extremely high item cardinality, dynamic ID spaces, skewed engagement distributions, and prediction instability caused by the natural lifecycle of item IDs. Existing solutions like random hashing mitigate memory issues but introduce data pollution from collisions, degrading model performance and embedding representation stability.

The paper introduces Semantic ID prefix-ngram, a novel token parameterization technique that significantly enhances the original Semantic ID approach. Unlike random hashing which assigns IDs randomly, Semantic ID prefix-ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings. Through extensive experimentation, the authors demonstrate that this method not only resolves embedding instability but also substantially improves tail ID modeling, reduces overfitting, and mitigates representation shifts. The approach shows considerable performance improvements when integrated into attention-based models that contextualize user histories. The paper concludes by reporting successful integration of Semantic ID into Meta's production Ads Ranking system, yielding notable performance gains and enhanced prediction stability in live deployments.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2504.02137v1
PDF Link: https://arxiv.org/pdf/2504.02137v1.pdf
Publication Status: This paper is currently a preprint on arXiv, indicated by the $v1$ version tag and the platform. While widely accessible, it has not yet undergone formal peer review for a conference or journal. The future Published at (UTC) date suggests a planned release.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper lies in the practical challenges of learning effective item embedding representations in large-scale, industrial recommendation systems. These systems often deal with billions of items, leading to several key data-related issues:

Item Cardinality: The sheer number of distinct items (e.g., products, ads) makes it infeasible to assign a unique embedding to every single item due to memory and computational constraints.
Impression Skew: A small fraction of "head" items receives the vast majority of user impressions and interactions, while a "long tail" of items has very few interactions. This makes it difficult to learn robust embeddings for tail items due to insufficient training data.
ID Drifting: The item space is highly dynamic, with new items constantly entering and old items leaving the system. This "raw ID drifting" causes embedding representations to become unstable over time, as the meaning of an embedding might change as it is assigned to different items.

Traditional solutions, such as random hashing, map multiple raw item IDs to a shared embedding space to manage cardinality. However, this introduces random collisions, where semantically unrelated items might share the same embedding. This leads to:
Degraded Model Performance: Contradictory gradient updates for randomly colliding items hinder learning.
Embedding Representation Instability: The learned embeddings lack a stable semantic meaning, especially over long training periods or in dynamic item environments.
Poor Tail Item Modeling: Random hashing doesn't facilitate knowledge sharing between popular and unpopular items, leaving tail items with sparse or unstable representations.

The paper's innovative idea is to move beyond random assignments and leverage semantic similarity for item representation. It explores Semantic ID, a recently proposed approach that derives item IDs from hierarchical clusters based on the semantic similarity of their content (text, image, video). The motivation is that a fixed, semantically meaningful ID space can inherently address the instability and knowledge sharing issues that random hashing fails to resolve.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

Empirical Understanding of Semantic ID Stability: Through experiments on a simplified Meta ads ranking model, the authors deepen the empirical understanding of how Semantic ID improves embedding representation stability compared to random hashing and individual embeddings.
Novel Token Parameterization: Semantic ID prefix-ngram: They propose a new token parameterization technique, Semantic ID prefix-ngram, which significantly enhances the performance of the original Semantic ID. This method effectively incorporates the hierarchical nature of item clusters, allowing for better knowledge sharing.
Characterization of Item Distribution Challenges: The paper clearly characterizes the challenges of item cardinality, impression skew, and ID drifting and explains their direct impact on embedding representation stability.
Addressing Key Challenges:
- Improved Tail ID Modeling: Semantic ID significantly benefits tail items and new cold start items by enabling knowledge transfer from semantically similar, popular items.
- Reduced Overfitting and Representation Shifts: The approach leads to more stable learned representations over time, making the model less sensitive to ID drifting and distribution shifts.
- Outsized Gains in Contextualizing Models: Semantic ID provides substantial performance improvements when integrated into attention-based user history models (e.g., Transformer, Pooled Multihead Attention (PMA)), by enabling more focused and meaningful attention patterns.
Successful Productionization at Meta: The authors describe the successful integration of Semantic ID prefix-ngram features into Meta's production Ads Ranking system.
- Significant Online Performance Gains: Semantic ID features resulted in a notable 0.15% online performance gain on top-line metrics, considered highly significant for a highly optimized system.
- Enhanced Prediction Stability: The approach significantly reduces A/A variance (prediction variance for identical items), leading to more robust ad ranking orders and improved advertiser trust.
- Correlation of Semantic and Prediction Similarity: Experiments demonstrate that prediction similarity is correlated with semantic similarity, and deeper Semantic ID prefixes monotonically reduce click loss rate.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with several fundamental concepts in recommendation systems and deep learning.

Recommendation Systems: Systems that suggest items (products, movies, ads, etc.) to users based on their preferences and past behavior. The goal is to predict which items a user will like or interact with.
Item IDs: Unique identifiers assigned to each item in a recommendation system. These are typically categorical features.
Embeddings (Embedding Vectors): Dense, low-dimensional vector representations of discrete (categorical) features, like item IDs, users, or categories. Instead of treating each ID as a distinct one-hot encoded vector (which would be very high-dimensional and sparse), embeddings map them to a continuous vector space where semantically similar items are closer together. These vectors are learned during model training.
- Embedding Table: A matrix where each row corresponds to a unique embedding for a specific categorical feature (e.g., an item ID). When an item ID is input to the model, its corresponding embedding vector is looked up in this table.
High Cardinality: A situation where a categorical feature (like item ID) can take on a very large number of distinct values (e.g., billions of unique items). This poses challenges for embedding tables as they would become prohibitively large.
Random Hashing: A technique used to manage high cardinality by mapping a large, sparse feature space into a smaller, dense space. For example, item IDs can be hashed to a fixed number of embedding slots. While this saves memory, it introduces "collisions" where different original IDs map to the same embedding. If hashing is random, these collisions are arbitrary, leading to data pollution.
ID Drifting: In dynamic systems, old item IDs are constantly retired, and new ones are introduced. This causes the distribution of active item IDs to shift over time. If embeddings are tied directly to these volatile raw IDs (especially with random hashing), the meaning of an embedding vector can "drift" as it might represent different sets of items over time.
Impression Skew: The phenomenon where a small number of "head" items receive a disproportionately high number of impressions or interactions, while the vast majority of "tail" items receive very few. This makes it challenging to learn good embedding representations for tail items due to sparse data.
Deep Learning Recommendation Model (DLRM): A widely deployed deep learning architecture for recommendation systems (Covington et al., 2016; Naumov et al., 2019). It typically consists of an embedding layer for categorical features, a dense layer for numerical features, an interaction layer to combine features, and MLPs (multi-layer perceptrons) for final prediction.
Vector Quantization (VQ): A technique that approximates continuous vectors with discrete "codebook" vectors. It's like clustering, where each continuous vector is assigned to the closest vector in a learned codebook. This effectively discretizes a continuous space into a finite set of codes.
Residual Quantized Variational Autoencoder (RQ-VAE): An advanced vector quantization model (Zeghidour et al., 2021). RQ-VAE uses a stack of quantizers (a "residual" approach) to progressively refine the representation. Instead of quantizing a vector once, it quantizes the residual (the error) from the previous quantization layer. This creates a hierarchical sequence of discrete codes, where earlier codes capture coarser semantic categories and later codes capture finer details.
Transformer: A neural network architecture (Vaswani et al., 2017) that relies heavily on self-attention mechanisms. It processes sequences by weighing the importance of different parts of the input sequence to each other.
- Attention Mechanism: A core component in Transformers that allows the model to selectively focus on different parts of the input sequence when processing a particular element. It calculates a weighted sum of "value" vectors, where the weights are determined by the similarity between a "query" vector and "key" vectors. The standard Attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
  - $Q$ : Query matrix. Each row is a query vector representing an element in the target sequence.
  - $K$ : Key matrix. Each row is a key vector representing an element in the source sequence.
  - $V$ : Value matrix. Each row is a value vector associated with each element in the source sequence.
  - $d_k$ : The dimension of the key vectors, used to scale the dot products to prevent vanishing gradients.
  - $\mathrm{softmax}$ : A function that converts raw scores into probability distributions. This mechanism allows the model to "attend" to relevant parts of the input sequence.
- Multihead Attention: An extension of Attention where the attention mechanism is run multiple times in parallel, each with different learned linear projections of the queries, keys, and values. The outputs are concatenated and linearly transformed. This allows the model to capture different types of relationships or aspects of the input.
Pooled Multihead Attention (PMA): A variant of Multihead Attention (Lee et al., 2019) that uses a fixed set of learnable "seed" vectors as queries, rather than using the input sequence elements themselves as queries. This is particularly useful for aggregating information from a sequence into a fixed-size representation, making it suitable for summarizing user history sequences.

3.2. Previous Works

The paper contextualizes its work by referencing several related areas:

Item Representations in Recommendation:
- Modern deep learning recommendation models (Covington et al., 2016; Naumov et al., 2019; Naumov, 2019) heavily rely on trained embeddings for categorical features.
- Random hashing (Weinberger et al., 2009) is a simple solution for high item cardinality.
- More advanced hashing methods include collision-free hashing (Liu et al., 2022), which dynamically manages memory for individual embeddings, and double hashing (Zhang et al., 2020) to reduce memory usage with two hash functions.
- Learning to hash methods (Wang et al., 2017) train ML-based hash functions to preserve similarity.
- Approaches to impression skew often involve contrastive learning or clustering (Yao et al., 2021; Chang et al., 2024), which the authors view as complementary.
Stable Embedding Representation:
- The concept of stable ID is inspired by tokenization techniques in Natural Language Processing (NLP) (Sennrich, 2015; Kudo, 2018; Devlin, 2018), where a fixed vocabulary of tokens represents text.
- In recommendation, $Hou et al. (2023)$ proposed vector-quantizing item content embeddings to learn transferable sequential recommenders.
- $Qu et al. (2024)$ introduced a masked vector-quantizer to transfer collaborative filtering representations to generative recommenders.
- Semantic ID itself was introduced concurrently in Singh et al. (2023) and Rajput et al. (2024), building on $Hou et al. (2023)$ and using RQ-VAE for quantization, demonstrating benefits in generalization and sequential recommendation. This paper adapts Semantic ID as its stable ID method.

3.3. Technological Evolution

The evolution of item representation in recommendation systems has moved from simple one-hot encodings to trained embeddings, then to techniques for managing high cardinality (like random hashing), and more recently, to methods aiming for stable and semantically meaningful representations.

Initially, each unique item was represented by a discrete ID. As item catalogs grew, one-hot encoding became impractical due to sparsity and dimensionality. Embedding layers emerged as a powerful solution, mapping IDs to dense, continuous vectors. However, the sheer scale of items in industrial settings (billions of unique IDs) still challenged embedding table sizes and training efficiency.

This led to techniques like random hashing, which aggressively reduced the embedding table size by allowing multiple IDs to share an embedding slot. While memory-efficient, this sacrificed semantic integrity and introduced instability.

The current paper fits into the next phase of this evolution, which seeks to combine the efficiency of hashing (or quantization) with semantic coherence and stability. By leveraging content understanding models and vector quantization (specifically RQ-VAE), the paper introduces Semantic ID to create a fixed, semantically meaningful, and hierarchical ID space. This stable ID space mitigates the drawbacks of random hashing by ensuring that collisions are semantically relevant, rather than random. It represents a shift from purely ID-based representations to content-aware, semantically anchored representations that can better handle ID drifting and impression skew.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

Semantic vs. Random Collisions: The most significant differentiation is Semantic ID's approach to collisions. Unlike random hashing (Weinberger et al., 2009; Zhang et al., 2020) which causes arbitrary collisions, Semantic ID (via RQ-VAE) ensures that items sharing an embedding (or a prefix) are semantically similar. This transforms a drawback (collisions) into an advantage (knowledge sharing).
Fixed, Hierarchical ID Space: Instead of a dynamically changing raw ID space, Semantic ID constructs a fixed ID space (a vocabulary of semantic codes) that has intrinsic semantic meaning and a hierarchical structure. This directly addresses ID drifting and embedding instability.
Novel Token Parameterization (prefix-ngram): The paper's primary innovation is Semantic ID prefix-ngram. While previous Semantic ID works (Singh et al., 2023; Rajput et al., 2024) introduced the concept, this paper proposes prefix-ngram as an improved parameterization. It explicitly leverages the multi-granularity (coarse-to-fine) nature of RQ-VAE codes, allowing the model to learn representations at different levels of semantic abstraction, which Trigram, Fourgram, or All bigrams parameterizations do not fully capture. This leads to more effective knowledge sharing and better performance.
Holistic Approach to Stability: Instead of tackling high cardinality or impression skew in isolation (e.g., through contrastive learning or dynamic memory management), Semantic ID provides a holistic "stable ID space" solution that inherently mitigates issues arising from all three challenges (cardinality, skew, drifting).
Enhanced Contextualization in User History: The paper specifically highlights the outsized gains when Semantic ID is used in attention-based user history models, demonstrating that semantically stable representations improve the model's ability to contextualize and aggregate past interactions.

4. Methodology

The core of this paper's methodology revolves around the Semantic ID concept and its novel prefix-ngram token parameterization. The aim is to create stable and semantically meaningful item representations that can overcome the challenges of high cardinality, impression skew, and ID drifting in large-scale recommendation systems.

4.1. Principles

The fundamental principle behind Semantic ID is to represent items not by their arbitrary raw IDs, but by discrete codes derived from their semantic content. The intuition is that if items share similar content (e.g., two ads for pizza), they should share similar representations, even if they have different raw IDs or are new to the system. This allows for:

Knowledge Sharing: Semantically similar items can contribute to learning a shared representation, benefiting tail items with sparse data.
Stability: The semantic categories of items tend to be more stable over time than their individual raw IDs. Thus, representations based on semantics will drift less.
Efficiency: Discretizing the semantic space into a fixed set of codes provides a manageable ID space, addressing high cardinality without random collisions.

This is achieved by a two-stage process:
Content Understanding: First, items are processed by a content understanding model (e.g., a multimodal image and text foundation model) to generate dense content embeddings. These embeddings capture the semantic meaning of the item's content.
Vector Quantization (RQ-VAE): Second, these continuous content embeddings are quantized into a sequence of discrete codes using an RQ-VAE model. This sequence of codes forms the item's Semantic ID. The RQ-VAE is specifically chosen for its ability to produce hierarchical clusters, where earlier codes represent broader semantic categories and later codes represent finer details.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Ranking Model Overview

The recommendation problem is framed as a classification task. The model predicts a binary label (interaction or conversion) given user- and item-side features. The architecture is based on the Deep Learning Recommendation Model (DLRM) (Covington et al., 2016; Naumov et al., 2019), comprising three main sections:

Information Aggregation Section: Processes sparse (categorical), dense, and user history-based features independently. Each module outputs a list of embedding vectors.
Interaction Layer: Concatenates the embedding lists and performs dot products (or higher-order interactions) between all pairs of vectors.
MLP and Sigmoid: Transforms the output of the interaction layer via an MLP to produce a logit score, followed by a sigmoid function to output a probability. The model is trained using cross-entropy loss. The paper focuses on the information aggregation section.

Embedding Module

For categorical features (like item IDs), an embedding table stores the vector representations. Let $I$ be the total number of raw IDs in the system and [1..N] denote integers from 1 to $N$ . The embedding table is a matrix $\mathbf{E} \in \mathbb{R}^{H \times d_m}$ , where $d_m$ is the embedding dimension and $H$ is the total number of embeddings (rows) in the table. An embedding lookup function $f = (f_1, ..., f_G) : [1..I] \rightarrow [1..H]^G$ maps a raw ID to $G$ embedding table row indices. For each raw ID $x \in [1..I]$ , the sparse module looks up multiple embedding rows $\mathbf{e}_{f_1}(x), \ldots, \mathbf{e}_{f_G}(x)$ and sum-pools them to produce a single output embedding: $\begin{array} { r } { \mathbf { e } _ { f } ( x ) : = \sum _ { i = 1 } ^ { G } \mathbf { e } _ { f _ { i } } ( x ) } \end{array}$

$\mathbf{e}_{f}(x)$ : The resulting aggregated embedding for raw ID $x$ .
$G$ : The number of embedding table rows looked up for a single raw ID. In Semantic ID, this corresponds to the number of tokens (or n-grams) derived from the Semantic ID.
$\mathbf{e}_{f_i}(x)$ : The $i$ -th embedding vector looked up from the table $\mathbf{E}$ at index $f_i(x)$ .

Sparse Module

A sparse feature is a set of raw IDs, $\mathbf{x} := \{x_1, \dots, x_n\}$ . For example, this could be multiple product category IDs for an item. A single embedding $\mathbf{e}_f(\mathbf{x})$ is produced by sum-pooling the embeddings $\mathbf{e}_f(x_i)$ for each constituent raw ID.

User History Module

This module models a user's item interaction history as a sequence of sparse features $\mathbf{x}^u := (\mathbf{x}_1^u, \ldots, \mathbf{x}_T^u)$ along with interaction timestamps. The module is designed to contextualize this sequence. First, each sparse feature $\mathbf{x}_i^u$ is embedded using the sparse module described above, and a learned timestamp embedding is added. The sum is denoted as $\mathbf{e}_f^u(\mathbf{x}_i^u)$ . The resulting sequence of embeddings is $\mathbf{X} = [\mathbf{e}_f^u(x_1^u); \ldots; \mathbf{e}_f^u(x_T^u)]^\intercal \in \mathbb{R}^{T \times d_m}$ , where $T$ is the sequence length. This sequence is then contextualized by one of three aggregation modules: Bypass, Transformer, or Pooled Multihead Attention (PMA).

The architectures for these aggregation modules are defined in Appendix A and are critical for understanding user history modeling:

Bypass: This is the simplest module. It applies a linear transformation to each embedding in the sequence independently. $\operatorname { B y p a s s } ( \mathbf { X } ) : = \mathbf { X } \mathbf { W }$
- $\mathbf{X} \in \mathbb{R}^{T \times d_m}$ : The input sequence of embeddings, where $T$ is the sequence length and $d_m$ is the embedding dimension.
- $\mathbf{W} \in \mathbb{R}^{d_m \times d_m}$ : A learnable weight matrix. The Bypass module processes each item in the history without considering its context from other items in the sequence.
Transformer: This module applies a Transformer layer to the embedding sequence, incorporating self-attention to contextualize each item based on all other items in the sequence. The core Attention submodule is: ${ \mathrm { A t t e n t i o n } } ( \mathbf { X } ) : = \operatorname { s o f t m a x } \left( { \frac { ( \mathbf { X } \mathbf { W } ^ { Q } ) ( \mathbf { X } \mathbf { W } ^ { K } ) ^ { \top } } { \sqrt { d _ { m } } } } \right) ( \mathbf { X } \mathbf { W } ^ { V } )$
- $\mathbf{X}$ : The input embedding sequence.
- $\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V \in \mathbb{R}^{d_m \times d_a}$ : Learnable weight matrices for the query, key, and value transformations, respectively.
- $d_a$ : The dimension of the query/key/value vectors.
- $\operatorname{softmax}$ : Normalizes attention scores.
- The term $( \mathbf { X } \mathbf { W } ^ { Q } )$ generates queries, $( \mathbf { X } \mathbf { W } ^ { K } )$ generates keys, and $( \mathbf { X } \mathbf { W } ^ { V } )$ generates values. The full Transformer module then consists of an attention layer followed by an MLP, with LayerNorm and residual connections: $\begin{array} { r l } & { \mathbf { X } ^ { ( 1 ) } = \mathrm { A t t e n t i o n } ( \mathrm { L a y e r N o r m } ( \mathbf { X } ) ) + \mathbf { X } } \\ & { \mathbf { X } ^ { ( 2 ) } = \mathrm { MLP } ( \mathrm { L a y e r N o r m } ( \mathbf { X } ^ { ( 1 ) } ) ) + \mathbf { X } ^ { ( 1 ) } , } \end{array}$
- $\mathbf{X}^{(1)}$ : Output after the attention sub-layer with residual connection.
- $\mathbf{X}^{(2)}$ : Final output after the position-wise MLP sub-layer with residual connection.
- $\mathrm{LayerNorm}$ : Normalizes activations across the feature dimension.
- $\mathrm{MLP}$ : A multi-layer perceptron. Standard positional embeddings are added to the encoding before applying Transformer modules to incorporate sequence order information.
Pooled Multihead Attention (PMA): A variant of Transformer where the attention query vectors are replaced by a fixed set of $d_s$ learnable seed vectors. This allows for a fixed-size summary of the sequence. $\mathrm { P M A t t e n t i o n } ( \mathbf { X } ) : = \mathrm { s o f t m a x } \left( \frac { \mathbf { S } ( \mathbf { X } \mathbf { W } ^ { \mathbf { K } } ) ^ { \intercal } } { \sqrt { d _ { m } } } \right) ( \mathbf { X } \mathbf { W } ^ { V } )$
- $\mathbf{S} \in \mathbb{R}^{d_s \times d_a}$ : A matrix comprised of $d_s$ learnable query vectors (seeds). In experiments, $d_s = 32$ .
- Other symbols are as defined for the Transformer's Attention module. The PMA module is formed using the same equations as for the Transformer module (Equations 7 and 8 in the paper, corresponding to $\mathbf{X}^{(1)}$ and $\mathbf{X}^{(2)}$ above), but with PMAttention replacing Attention.

4.2.2. Semantic ID Learning (RQ-VAE)

Semantic IDs are learned for items in two stages:

Content Embeddings: A content understanding model (e.g., a multimodal image and text foundation model pre-trained on large datasets) processes item content (text, image, video) to produce dense content embeddings $\mathbf{x} \in \mathbb{R}^D$ .
Vector Quantization with RQ-VAE: An RQ-VAE (Residual Quantized Variational Autoencoder) is trained on these content embeddings.
- Encoder: Maps the continuous content embedding $\mathbf{x}$ to a continuous latent representation $\mathbf{z} \in \mathbb{R}^{D'}$ .
- Residual Quantizer: Quantizes $\mathbf{z}$ into a sequence of discrete codes $\mathbf{c} := (c_1, \ldots, c_L) \in K^L$ . Here, $L$ is the number of layers (length of the code sequence), and $K$ is the codebook size (number of clusters at each layer). The quantization is hierarchical: each layer $\iota$ has its own codebook $\{ \mathbf{v}_k^l \}_{k=1}^K$ containing $K$ vectors. The code $c_l$ at layer $l$ is chosen by finding the codebook vector $\mathbf{v}_{c_l}^l$ that best approximates the residual from $\mathbf{z}$ after subtracting the codebook vectors from previous layers (l-1) down to 1. The residual $\mathbf{r}_l$ for layer $l$ is defined as: $\mathbf { r } _ { l } : = \mathbf { z } - \sum _ { i = 1 } ^ { l - 1 } \mathbf { v } _ { c _ { i } } ^ { i }$
  - $\mathbf{r}_l$ : The residual vector at layer $l$ .
  - $\mathbf{z}$ : The continuous latent representation from the encoder.
  - $\sum_{i=1}^{l-1} \mathbf{v}_{c_i}^i$ : The sum of codebook vectors chosen by previous layers. The code $c_l$ for layer $l$ is then selected as the index of the codebook vector $\mathbf{v}_c^l$ that is closest to this residual $\mathbf{r}_l$ : $c _ { l } : = \arg \operatorname* { m i n } _ { c } \| \mathbf { v } _ { c } ^ { l } - \mathbf { r } _ { l } \| _ { 2 } .$
  - $c_l$ : The discrete code chosen for layer $l$ .
  - $\arg \min_c$ : Finds the index $c$ that minimizes the following expression.
  - $\| \mathbf{v}_c^l - \mathbf{r}_l \|_2$ : The Euclidean distance between a codebook vector $\mathbf{v}_c^l$ from the $l$ -th codebook and the residual $\mathbf{r}_l$ . This process is illustrated in Figure 1, showing how the residual quantizers progressively refine the representation.
    
    $Figure 1 The RQVAE model with $L = 3$$ 该图像是一个示意图，展示了 RQVAE 模型的结构，其中包含编码器、量化器和解码器。图中显示了输入 $x$ 的处理过程，经过编码器后生成的 $z$ 被发送至多个量化器，最终用于重构输出。该模型的设计强调了相似性和残差连接的作用。
  Figure 1 The RQVAE model with $L = 3$
- Decoder: Reconstructs the original content embedding $\mathbf{x}$ from the sequence of discrete codes $\mathbf{c}$ . The RQ-VAE is trained with two loss terms: $\begin{array} { l } { \displaystyle \mathcal { L } _ { \mathrm { R Q - V A E } } ( \mathbf { x } ) = \| \mathbf { x } - \mathrm { d e c } ( \mathbf { c } ) \| ^ { 2 } } \\ { \displaystyle \qquad + \sum _ { l = 1 } ^ { L } \beta \| \mathbf { r } _ { l } - \mathbf { s g } ( \mathbf { v } _ { c _ { l } } ^ { l } ) \| ^ { 2 } + \| \mathbf { s g } ( \mathbf { r } _ { l } ) - \mathbf { v } _ { c _ { l } } ^ { l } \| ^ { 2 } , } \end{array}$
- $\mathcal{L}_{\mathrm{RQ-VAE}}(\mathbf{x})$ : The total loss for an input content embedding $\mathbf{x}$ .
- $\| \mathbf{x} - \mathrm{dec}(\mathbf{c}) \|^2$ : The reconstruction loss, which measures how well the decoder can reconstruct the original content embedding $\mathbf{x}$ from the quantized codes $\mathbf{c}$ .
- $\sum_{l=1}^L \beta \| \mathbf{r}_l - \mathbf{sg}(\mathbf{v}_{c_l}^l) \|^2 + \| \mathbf{sg}(\mathbf{r}_l) - \mathbf{v}_{c_l}^l \|^2$ : The codebook loss terms, which encourage the residuals $\mathbf{r}_l$ to be close to their chosen codebook vectors $\mathbf{v}_{c_l}^l$ .
  - $\mathbf{sg}(\cdot)$ : The stop-gradient operator. This means that gradients flow through the term it encloses during the backward pass but are treated as constant during the forward pass for the purpose of the other term. Specifically, $\| \mathbf{r}_l - \mathbf{sg}(\mathbf{v}_{c_l}^l) \|^2$ optimizes $\mathbf{r}_l$ towards $\mathbf{v}_{c_l}^l$ (codebook vectors are fixed), while $\| \mathbf{sg}(\mathbf{r}_l) - \mathbf{v}_{c_l}^l \|^2$ optimizes $\mathbf{v}_{c_l}^l$ towards $\mathbf{r}_l$ (residuals are fixed). This dual optimization helps update both the encoder and the codebook.
- $\beta$ : A hyperparameter, set to 0.5 in the experiments.
  
  A Semantic ID is defined as the sequence of discrete codes $(c_1, \dots, c_L)$ produced by the encoder and residual quantizer. The hierarchical nature means $c_1$ represents the broadest category (e.g., "food ads"), $(c_1, c_2)$ refines it (e.g., "pizza ads"), and $(c_1, c_2, c_3)$ offers the finest detail (e.g., "pizza ads in English").

4.2.3. Token Parameterization

After obtaining the Semantic ID sequence $s(x) : [1..I] \rightarrow K^L$ for a raw item ID $x$ , the next step is to map this sequence of codes to embedding table rows. This is done via a token parameterization function $p(\mathbf{c}; H) : K^L \rightarrow [1..H]^G$ , which determines how the Semantic ID is represented in the embedding table. The choice of parameterization is crucial because it controls the amount and structure of information the recommendation model receives. Since a fine-grained tuple (all $L$ codes) might lead to extremely high cardinality ( $K^L$ possible combinations), there's a tradeoff between information granularity and embedding table size.

The paper defines several possible token parameterization techniques in Table 1:

The following are the results from Table 1 of the original paper:

Token Param	p(c1, . . . , L; H)
Trigram	[K²c₁ + Kc₂ + c₃]
Fourgram	[K³c₁ + K²c₂ + Kc₃ + c₄]
All bigrams	[K² × (i − 1) + Kc_i + c_i+1, for i in [1 . L-1]]
Prefix-ngram	[∑_t=1ⁱ K^i−t(c_t + 1) − 1, for i in [1 . n]]

Let's break down these parameterizations:

Trigram / Fourgram: These parameterizations treat a fixed sequence of codes (e.g., $c_1, c_2, c_3$ $c_{1}, c_{2}, c_{3}$ for Trigram) as a single, unique identifier. They combine the codes into a single integer index using a base- $K$ $K$ representation, effectively creating a "flat" ID that represents a specific combination of codes. For Trigram (assuming $L \ge 3$ $L \geq 3$ ):
- $[K^2c_1 + Kc_2 + c_3]$ means that $c_1$ (the coarsest code) contributes the most significantly to the index, followed by $c_2$ , and then $c_3$ . This maps a specific $(c_1, c_2, c_3)$ tuple to a unique embedding ID.
- Fourgram extends this to four codes: $[K^3c_1 + K^2c_2 + Kc_3 + c_4]$ .
All bigrams: This parameterization generates multiple IDs for an item, each representing a bigram (a pair of consecutive codes) from the Semantic ID sequence.
- $[K^2 * (i - 1) + Kc_i + c_{i+1}, for i in [1 . L-1]]$ means it generates separate IDs for $(c_1, c_2)$ , $(c_2, c_3)$ , etc., up to $(c_{L-1}, c_L)$ . The $K^2 * (i - 1)$ term likely provides a shifting factor to ensure that bigrams from different positions in the sequence (e.g., $(c_1, c_2)$ vs. $(c_2, c_3)$ ) do not collide if they happen to have the same values for $c_i, c_{i+1}$ . This allows the model to learn representations for pairs of codes at different hierarchical levels.
Prefix-ngram: This is the novel and most effective parameterization proposed in the paper. It generates $n$ $n$ IDs for an item, where each ID represents a prefix of the Semantic ID sequence, increasing in granularity.
- $[∑_{t=1}^i K^{i-t}(c_t + 1) - 1, for i in [1 . n]]$ $[\sum_{t = 1}^{i} K^{i - t} (c_{t} + 1) - 1, f or iin [1. n]]$
  - $n$ : The maximum length of the prefix (e.g., for Prefix-3gram, $n=3$ ).
  - For each $i$ from 1 to $n$ , an ID is generated for the prefix $(c_1, \dots, c_i)$ .
  - The sum $\sum_{t=1}^i K^{i-t}(c_t + 1) - 1$ effectively calculates a unique integer based on the codes in the prefix $(c_1, \dots, c_i)$ . The $(c_t + 1) - 1$ terms account for 0-indexed vs. 1-indexed codes or provide a slight offset.
  - This means Prefix-ngram generates IDs for $(c_1)$ , $(c_1, c_2)$ , $(c_1, c_2, c_3)$ , etc., up to $(c_1, \dots, c_n)$ . This allows the model to capture information at various levels of granularity—from very coarse (e.g., $c_1$ ) to more fine-grained (e.g., $(c_1, c_2, c_3)$ ). This is crucial for leveraging the hierarchical nature of RQ-VAE clusters.
    
    When the Semantic ID cardinality exceeds the embedding table size, a modulo hash function is applied. For multiple IDs (as in All bigrams or Prefix-ngram), a shifting factor is added to prevent collisions between IDs from different positions.

The paper's experiments (Table 2) confirm that Prefix-ngram is the best parameterization, highlighting the importance of incorporating the hierarchical clustering information. Increasing the depth ( $n$ ) of Prefix-ngram and the RQ-VAE cardinality ( $K$ and $L$ ) both further improve performance.

4.2.4. Item Impression Distribution Issues (Addressed by Semantic ID)

The paper elaborates on three key data distribution issues and how Semantic ID addresses them:

Item Cardinality: The number of distinct items $I$ is much larger than feasible embedding table size $H$ . Random hashing causes random collisions.
- Semantic ID Solution: Semantic ID creates a fixed and manageable ID space through RQ-VAE. Instead of random collisions, items sharing a Semantic ID (or prefix) are semantically similar. This ensures that when collisions occur (by design or by modulo hashing), they are meaningful, facilitating knowledge sharing.
Impression Skew: A small percentage of "head" items dominates impressions, leaving "tail" items with few examples (Figure 2). Random hashing doesn't allow effective knowledge sharing.

该图像是一个图表，展示了在考虑项目份额的情况下，累积印象份额作为项目ID份额的函数。随着项目按印象计数排序，观察到大多数印象来自少量最受欢迎的项目。

Figure 2 Impression Skew cumulative impressions as a function of the share of items considered. As items are sorted by the impression count, one sees that the majority of impressions comes from a fraction of most popular items.
- Semantic ID Solution: If a tail item has similar content to a head or torso item, their Semantic IDs will match or share a prefix. This allows the tail item to "inherit" knowledge from the more popular, semantically similar item, improving its representation learning despite sparse individual data. The paper suggests that Semantic ID space exhibits less skew (Appendix B, Figure 6).
ID Drifting: The item ID space is highly dynamic, with constant entry and exit of items (Figure 3). Random hashing leads to embedding representation drift.

该图像是图表，展示了随着时间推移，初始语料中活跃项目的比例变化。横轴表示经过的天数，纵轴表示活跃项目的百分比。可以看出，在第6天时，原始语料的一半项目退出系统，而新项目进入的数量相等，导致项目分布严重漂移。

Figure 3 ID Drift share of items that remain active in the initial corpus as a function of time. Half of the original corpus exits the system after 6 days. An equal number of new items enters the system, creating a severe item distribution drift.
- Semantic ID Solution: When an old ad retires and a new, semantically similar ad enters, their Semantic IDs (or prefixes) will likely match. This ensures temporal stability of semantic concepts, leading to stable Semantic ID encodings. The embedding weights for new items can leverage pre-existing knowledge from semantically similar items, rather than being learned from scratch or being randomly assigned.

5. Experimental Setup

5.1. Datasets

The experiments were conducted using production data from Meta's ads ranking system.

Data Source: Production user interaction data from Meta's ads ranking platform.
Scale: Training data spans a four-day time period, processed sequentially for a single epoch. Evaluation is performed on the first six hours of the next day's data.
Characteristics: The data exhibits the discussed challenges:
- High Item Cardinality: Over one billion items in the user history module.
- Impression Skew: A small percentage of items (0.1% head) accounts for 25% of impressions, while 94.4% (tail) accounts for the remaining 25%.
- ID Drifting: A significant portion of items (half) exits the system within 6 days, with new items constantly entering.
Content Embeddings: The item content embeddings for Semantic ID generation are obtained from a multimodal image and text foundation model. This model is pre-trained on the public CC100 dataset (Conneau, 2019) and then fine-tuned on internal ads datasets.
RQ-VAE Training Data: The RQ-VAE itself is trained on the content embeddings of all target items from the past three months.
Choice of Datasets: These datasets are representative of real-world, large-scale industrial recommendation scenarios, making them highly effective for validating the method's performance under practical constraints and challenges. While the paper doesn't provide specific content examples, it implies that items are advertisements with associated text, images, or videos.

5.2. Evaluation Metrics

5.2.1. Normalized Entropy (NE)

The primary offline model performance metric is Normalized Entropy (NE).

Conceptual Definition: Normalized Entropy measures the predictive quality of a classification model relative to a baseline predictor that always predicts the mean frequency of positive labels. A lower NE indicates better model performance, as it means the model's cross-entropy loss is lower than that of the naive mean-frequency predictor. It essentially normalizes the cross-entropy to provide a more interpretable score.
Mathematical Formula: $\begin{array} { r } { \mathrm { N E } = \frac { - \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \left( y _ { i } \log \left( p _ { i } \right) + \left( 1 - y _ { i } \right) \log \left( 1 - p _ { i } \right) \right. } { \left. - \left( p \log \left( p \right) + \left( 1 - p \right) \log \left( 1 - p \right) \right) \right. } , } \end{array}$
Symbol Explanation:
- $\mathrm{NE}$ : Normalized Entropy.
- $N$ : The total number of training examples.
- $y_i \in \{0, 1\}$ : The true binary label for example $i$ (0 for no interaction, 1 for interaction).
- $p_i$ : The model's predicted probability of a positive label for example $i$ .
- $p$ : The overall mean frequency of positive labels in the dataset, calculated as $p = \frac{1}{N} \sum_{i=1}^N y_i$ .
- $\log$ : Natural logarithm. The numerator is the cross-entropy loss of the model, and the denominator is the cross-entropy of a baseline model that always predicts the overall mean positive rate $p$ .

5.2.2. Attention Score-based Evaluation Metrics

For user history modeling, four metrics are computed on the attention scores ( $\mathbf{A} \in \mathbb{R}^{T \times S}$ ) from PMA and Transformer aggregation modules:

First source token attention: Measures how much weight is placed on the very first token in the source sequence. ${ \frac { 1 } { T } \sum _ { i = 1 } ^ { T } a _ { i , 1 } }$
- $T$ : Target sequence length.
- $a_{i,1}$ : Attention score from target token $i$ to the first source token.
- This metric indicates if the model disproportionately focuses on the initial item in the history.
Padding token attention: Measures the average attention given to padding tokens (placeholders used to make sequences of equal length). ${ \frac { 1 } { T } \sum _ { i = 1 } ^ { T } \sum _ { j = 1 } ^ { S } \mathbb { I } \{ a _ { i , j } = \mathrm { p a d } \} \cdot a _ { i , j } }$
- $S$ : Source sequence length.
- $\mathbb{I}\{a_{i,j} = \mathrm{pad}\}$ : Indicator function, which is 1 if source token $j$ is a padding token, 0 otherwise.
- This metric reflects how efficiently the model ignores irrelevant padding tokens. Lower values are desirable.
Entropy: Measures the diversity or "diffuseness" of the attention distribution for each target token, averaged over all target tokens. ${ \frac { 1 } { T } \sum _ { i = 1 } ^ { T } \sum _ { j = 1 } ^ { S } a _ { i , j } \cdot \log _ { 2 } a _ { i , j } }$
- Higher entropy means attention is more spread out; lower entropy means it's more focused on specific tokens. Lower entropy is often desired for more decisive attention.
Token self-attention: For Transformer models, this measures how much a token attends to itself. ${ \frac { 1 } { T } \sum _ { i = 1 } ^ { T } a _ { i , i } }$
- $a_{i,i}$ : Attention score from target token $i$ to source token $i$ .
- This indicates if a token primarily relies on its own representation or seeks context from others.

5.2.3. Online Metrics

Online Performance Gain: Measured as a percentage gain on a top-line online metric (e.g., Click-Through Rate (CTR), Conversion Rate) in live A/B tests. A 0.15% gain is considered significant in Meta's highly optimized system.
Click Loss Rate: Used to measure the correlation between semantic similarity and prediction similarity in online A/B tests. $C l i c k \ L o s s \ R a t e : = { \frac { { \mathrm { C T R \ o n \ } } S ^ { \prime } - { \mathrm { C T R \ o n \ } } S } { { \mathrm { C T R \ o n \ } } S } } .$
- $\mathrm{CTR\ on\ } S$ : Click-Through Rate for the original set of recommended items $S$ .
- $\mathrm{CTR\ on\ } S'$ : Click-Through Rate for the mutated set of items $S'$ , where an item in $S$ is swapped with a semantically similar item (same Semantic ID prefix).
- A smaller Click Loss Rate (ideally close to 0) indicates that replacing an item with a semantically similar one does not significantly impact user clicks, implying that prediction similarity aligns with semantic similarity.
A/A Prediction Difference (AAR): Measures the relative difference in predictions for an identical item and its copy (A/A pair). Used to quantify prediction variance. $\mathrm { A A R } ( a _ { 1 } , a _ { 2 } ) : = 2 \frac { p ( a _ { 1 } ) - p ( a _ { 2 } ) } { p ( a _ { 1 } ) + p ( a _ { 2 } ) + \epsilon } ,$
- $a_1, a_2$ : An A/A pair, representing an original item and its exact copy (with a different raw ID).
- $p(a_1), p(a_2)$ : The ranking model's prediction (e.g., probability of click) for item $a_1$ and $a_2$ , respectively.
- $\epsilon$ : A small constant to prevent division by zero.
- Lower AAR values indicate less prediction variance between identical items, which is desirable for stability and advertiser trust.

5.3. Baselines

The paper compares Semantic ID (SemID) against two baseline item representation approaches:

Individual Embeddings (IE):
- Description: Each raw item ID is assigned its own unique row in the embedding table. This means $I = H$ (total items equals embedding table size). During evaluation, any ID not seen during training is mapped to a randomly initialized, untrained embedding.
- Representativeness: While unrealistic for production-scale systems due to memory constraints, IE serves as an illustrative upper bound or ideal scenario for understanding item-specific representation quality, as it avoids any form of collision.
Random Hashing (RH):
- Description: Raw item IDs are randomly hashed to embedding table rows using a standard hash function (e.g., modulo hash). This is used when the total number of items $I$ is much larger than the feasible embedding table size $H$ (e.g., $I \approx a \cdot H$ with $a > 1$ ). This creates random collisions, where multiple unrelated raw IDs share the same embedding.
- Representativeness: RH is a popular and simple approach in industrial systems to manage high cardinality under system constraints. It serves as a strong, practical baseline that Semantic ID aims to outperform by resolving the random collision issue.
  
  For the offline experiments focusing on the target item sparse feature, the item embedding table size for IE is equal to the total number of items, while for RH and SemID, it is set to a smaller size, resulting in an average collision factor of 3. User history features are mapped using random hashing in these baseline comparisons.

6. Results & Analysis

The offline experiments investigate the hypotheses regarding Semantic ID's advantages, using a simplified version of Meta's production ads ranking model.

6.1. Segment Analysis

To understand the impact of impression skew and ID drifting, items are segmented by their impression count during training (head, torso, tail) and whether they are new (cold start) items.

The following are the results from Table 3a of the original paper:

Cum. Exs.	Item Percentile	RH	Eval NE IE	SemID	SemID NE Gain vs. RH	IE
25% (Head)	0.1	0.80105	0.80101	0.80108	0.00%	0.01%
75% (Torso)	5.6	0.83589	0.83583	0.83580	-0.01%	-0.00%
100% (Tail)	100	0.83904	0.83886	0.83872	-0.04%	-0.02%
Items Seen in Training		0.82626	0.82612	0.82600	-0.03%	-0.02%
New Items		0.83524	0.83453	0.83180	-0.41%	-0.33%
All Items		0.82663	0.82645	0.82621	-0.05%	-0.03%

Table 3a: Evaluation NE (lower is better). Semantic ID enables knowledge transfer to tail and new cold start items.

Analysis of Table 3a:

Tail Items: Semantic ID (SemID) shows the most significant NE gain (improvement) for tail items ( $-0.04\%$ vs. RH, $-0.02\%$ vs. IE). This confirms the hypothesis that SemID facilitates knowledge sharing, allowing items with few impressions to benefit from semantically similar, more popular items.
New Items (Cold Start): SemID achieves substantial gains for new items ( $-0.41\%$ vs. RH, $-0.33\%$ vs. IE). This is a critical finding, demonstrating that SemID effectively uses pre-trained weights from semantically similar items, avoiding the issue of randomly initialized or non-relevant weights encountered by IE and RH, respectively, for cold-start items.
Head Items: SemID is NE neutral for head items ( $0.00\%$ vs. RH, $0.01\%$ vs. IE), which is expected as these items already have ample data for learning individual representations.
Torso Items: SemID is slightly beneficial for torso items ( $-0.01\%$ vs. RH, $-0.00\%$ vs. IE).
Overall: SemID provides a modest 0.05% NE gain over RH and 0.03% over IE across all items, driven mostly by improvements in the tail and new item segments. This indicates SemID isn't just better at clustering than RH but actively enables semantically driven knowledge transfer.

To measure the effect of embedding representation drifting due to ID drifting, the models are evaluated on different temporal segments of the training data. The metric is the difference in NE between an earlier period (42-48 hours prior to end of training) and the latest period (last six hours of training). A smaller value indicates less impact from embedding representation shift.

The following are the results from Table 3b of the original paper:

Cum. Exs.	RH	IE	SemID
25% (Head)	0.0057	0.0065	0.0059
75% (Torso)	0.0087	0.0075	0.0076
100% (Tail)	0.0128	0.0103	0.0106
All Items	0.0083	0.0074	0.0073

Table 3b: Sensitivity to distribution shift: $\mathrm{NE}[t_0, t_1] - \mathrm{NE}[t_0+42h, t_1+42h]$ . Lower is better.

Analysis of Table 3b:

RH generally has higher NE gaps, especially for tail items (0.0128), indicating that its embedding representations suffer more from ID drifting over time. The model's ability to represent older items degrades as weights are updated for new items.
IE shows a smaller performance gap, suggesting individual embeddings are more stable, as each item theoretically retains its distinct representation.
SemID matches or slightly outperforms IE in terms of stability across all segments (0.0073 for all items vs 0.0074 for IE). This supports the claim that SemID leads to more stable learned representations over time, even with a smaller embedding table size than IE.

The paper further investigates ID drifting by training models over a longer period (20 days vs. 4 days).

The following are the results from Table 4 of the original paper:

	RH	Semantic ID
Eval NE Gain	-0.18%	−0.23%

Table 4: NE improvement from training for 20 days of data instead of 4 days.

Analysis of Table 4:

SemID demonstrates better scalability with longer training data, achieving a $-0.23\%$ NE gain compared to $-0.18\%$ for RH. This supports the conjecture that improved representation stability allows SemID models to generalize better over extended training durations, where ID drifting is more pronounced.

6.2. Parameterization Analysis

The paper evaluates different token parameterization techniques for Semantic ID.

The following are the results from Table 2 of the original paper:

RQ-VAE K × L	Token Parameterization	Train NE Gain
[2048] × 3	Trigram	−0.028%
[2048] × 4	Fourgram	−0.035%
[2048] × 4	All bigrams	-0.091%
[512] × 3	Prefix-3gram	-0.034%
[1024] × 3	Prefix-3gram	-0.097%
[2048] × 3	Prefix-3gram	-0.141%
[2048] × 5	Prefix-5gram	-0.208%
[2048] ×6	Prefix-6gram	-0.215%

Table 2: NE performance for different tokenization parameterizations

Analysis of Table 2:

Prefix-ngram Superiority: Prefix-ngram consistently outperforms other parameterizations (Trigram, Fourgram, All bigrams). For example, Prefix-3gram with $K=2048, L=3$ yields a $-0.141\%$ gain, significantly better than Trigram ( $-0.028\%$ ) or All bigrams ( $-0.091\%$ ) for similar RQ-VAE configurations. This strongly supports the idea that incorporating the hierarchical nature of RQ-VAE clusters (i.e., representing prefixes of codes at different granularities) is essential for effectively sharing knowledge and boosting performance.
Depth Matters: Increasing the depth of Prefix-ngram (from Prefix-3gram to Prefix-5gram and Prefix-6gram with $K=2048$ ) leads to improved NE performance (e.g., $-0.141\%$ to $-0.215\%$ ). Deeper prefixes capture more fine-grained semantic information, which contributes to better item differentiation and representation.
RQ-VAE Cardinality: Increasing the RQ-VAE cardinality (either $K$ or $L$ ) also improves NE. For Prefix-3gram, increasing $K$ from 512 to 2048 (while keeping $L=3$ ) improves NE gain from $-0.034\%$ to $-0.141\%$ . A larger $K$ (codebook size) and $L$ (number of layers) allow for a richer and more precise semantic space.

6.3. Item Representation Space

This section examines the quality of the item embeddings themselves by comparing how Random Hashing (RH) and Semantic ID (SemID) partition the raw item ID corpus. The goal is to see if SemID creates more effective summaries of individual embeddings. IE embeddings are used as a reference to compute metrics for these partitions. The collision factor is set to 5, meaning clusters contain 5 items on average. SemID clusters can have variable sizes due to the nature of RQ-VAE.

The following are the results from Table 5 of the original paper:

	Variance	Pairwise distance
Random Hashing	1.52 × 10⁻³ (8.0 × 10⁻⁴)	0.22 (0.04)
SemID (small)	1.31 × 10⁻³ (1.0 × 10⁻³)	0.24 (0.09)
SemID (top 1,000)	1.23 × 10⁻³ (5.5 × 10⁻⁴)	0.06 (0.02)

Table 5: Intra- and inter-cluster variances and pairwise distances for random hashing and SemID-based partitions.

Analysis of Table 5:

Intra-cluster Variance: Semantic ID partitions (both small and top 1,000 clusters) exhibit lower intra-cluster variance compared to Random Hashing. For example, SemID (top 1,000) has a variance of $1.23 × 10^-3$ compared to RH's $1.52 × 10^-3$ . This means that items grouped together by SemID are more semantically homogeneous, leading to a more coherent summary embedding.
Inter-cluster Pairwise Distance: The results are mixed:
- SemID (small) clusters have a slightly higher pairwise distance (0.24) than RH (0.22), which is a good indication of distinct cluster representations.
- However, SemID (top 1,000) clusters show a significantly lower pairwise distance (0.06). The authors hypothesize this is because RQ-VAE places multiple centroids in regions of highest data density to minimize overall model loss. While this might seem counter-intuitive (lower inter-cluster distance usually implies less distinct clusters), in the context of RQ-VAE and its residual quantization, it could mean that very popular, related clusters are finely differentiated within a dense semantic region.
  
  Overall, the lower intra-cluster variance for SemID confirms that it creates more semantically coherent groups of items, which is crucial for effective knowledge sharing.

6.4. User History Modeling

This section explores the benefits of Semantic ID when used with different user history aggregation modules.

The following are the results from Table 6 of the original paper:

	Train NE Gain	Eval NE Gain
Bypass	−0.056%	−0.085%
Transformer	−0.071%	−0.110%
PMA	−0.073%	−0.100%

Table 6: Performance for three aggregation modules. Baseline: model with RH for each module. Semantic ID brings larger gains to the contextualizing modules.

Analysis of Table 6:

Semantic ID consistently provides NE gains across all aggregation modules compared to a baseline using Random Hashing (RH).
The gains are outsized for contextualizing attention-based modules (Transformer and PMA) compared to Bypass. For example, Transformer shows an Eval NE Gain of $-0.110\%$ and PMA $-0.100\%$ , both larger than Bypass's $-0.085\%$ . This indicates that the stable and semantically meaningful Semantic ID representations significantly enhance the ability of attention mechanisms to contextualize user histories. PMA shows the best Train NE Gain ( $-0.073\%$ ).

To further understand this, attention score-based evaluation metrics are computed.

The following are the results from Table 7 of the original paper:

	First	Pad	Entropy	Self
Transformer + RH	0.030	0.460	2.149	0.052
Transformer + SemID	0.043	0.418	1.967	0.045
PMA + RH	0.071	0.351	3.075
PMA + SemID	0.074	0.313	3.025

Table 7: Attention score-based evaluation metrics for random hashing and SemID-based models for the user history item interaction features.

Analysis of Table 7:

First source token attention: SemID-based models (both Transformer and PMA) show higher attention on the first source token (0.043 vs 0.030 for Transformer, 0.074 vs 0.071 for PMA). This suggests SemID helps the model place more weight on high-signal (e.g., most recent) tokens in the sequence.
Padding token attention: SemID-based models show lower attention to padding tokens (0.418 vs 0.460 for Transformer, 0.313 vs 0.351 for PMA). This indicates that SemID representations make it easier for attention mechanisms to disregard irrelevant padded portions of the history.
Entropy: SemID-based models exhibit lower entropy (1.967 vs 2.149 for Transformer, 3.025 vs 3.075 for PMA). Lower entropy means the attention distributions are less diffuse and more focused on relevant tokens, suggesting more decisive and meaningful contextualization.
Token self-attention: For Transformer, SemID results in lower token self-attention (0.045 vs 0.052). This implies that with SemID, the Transformer is better able to find useful contextual information from other tokens in the sequence, rather than primarily attending to itself.

These attention metrics confirm that Semantic ID representations are more stable and meaningful, enabling more effective and focused user history modeling by attention-based architectures.

7. Productionization

Semantic ID features have been successfully integrated into Meta's Ads Recommendation System for over a year, becoming top sparse features by importance.

7.1. Offline RQ-VAE Training

Content Understanding (CU) Models: The RQ-VAE models are trained on embeddings generated by Content Understanding (CU) models. These CU models are initially pre-trained on the public CC100 dataset (Conneau, 2019) and then fine-tuned on Meta's internal ads datasets.
RQ-VAE Training: RQ-VAE models are trained offline using ad IDs and their content embeddings sampled from the past three months of data.
Production Configuration: For production, RQ-VAEs are configured with $L = 6$ (6 layers) and $K = 2048$ (codebook size). Semantic ID utilizes the prefix-5gram parameterization (meaning $n=5$ in Prefix-ngram) from Section 4.2, with an embedding table size of $H = O(50M)$ (50 million).
Deployment: A frozen (fixed) RQ-VAE checkpoint is used for online serving after training.

7.2. Online Semantic ID Serving System

The Semantic ID serving pipeline is illustrated in Figure 4.

Figure 4 Semantic ID serving pipeline. 该图像是图示，展示了语义 ID 的服务流程。流程包括实体创建、内容理解、RQVAE 模型、实体语义 ID 的处理，以及用户请求和数据存储的交互，最终形成排序模型以优化用户参与特征和目标项特征的处理。

Figure 4 Semantic ID serving pipeline.

Pipeline Steps:

Ad Creation Time: When a new ad is created, its content information (text, image, video) is processed by the CU models.
Semantic ID Generation: The output CU embeddings are then fed into the RQ-VAE model, which computes the Semantic ID signal (the sequence of discrete codes) for each raw ad ID.
Data Storage: This Semantic ID signal is stored in the Entity Data Store, a centralized repository for item metadata.
Feature Generation Stage: During feature generation, raw item IDs (for the target item) and user engagement raw ID histories are enriched. This involves looking up their corresponding Semantic ID signals from the Entity Data Store to create semantic features.
Serving Requests: When a user request for recommendations arrives, the precomputed semantic features (along with other features) are fetched.
Ranking Models: These features are then passed to downstream ranking models to generate predictions and deliver ranked ads.

This real-time serving pipeline ensures that Semantic ID features are available for both target item features and user engagement history features during live inference.

7.3. Production Performance Improvement

The integration of Semantic ID features into Meta's flagship ads ranking model yielded significant performance gains.

Feature Creation: Six sparse features and one sequential feature were created from different content embedding sources (text, image, video) using Semantic ID.
Significance Threshold: In Meta ads ranking, an offline NE gain greater than 0.02% is considered significant.

The following are the results from Table 8 of the original paper:

Train NE Gain Eval NE Gain

Baseline + 6 sparse features −0.063% -0.071%

Baseline + 1 sequential feature -0.110% -0.123%

	Train NE Gain	Eval NE Gain
Baseline + 6 sparse features	−0.063%	-0.071%
Baseline + 1 sequential feature	-0.110%	-0.123%

Table 8: NE improvement from incorporating Semantic ID features in the flagship Meta ads ranking model.

Analysis of Table 8:

Adding 6 sparse Semantic ID features alone resulted in an Eval NE Gain of $-0.071\%$ .
Adding 1 sequential Semantic ID feature (presumably for user history) yielded an even greater Eval NE Gain of $-0.123\%$ .
Overall Online Gain: Across multiple ads ranking models, incorporating Semantic ID features led to a 0.15% gain in the top-line online metric. This is considered highly significant for a system as optimized and large-scale as Meta Ads, serving billions of users.

7.4. Semantic and Prediction Similarity

The paper investigates whether semantic similarity (captured by Semantic ID) correlates with prediction similarity (user engagement patterns).

Online A/B Test: An online A/B test was conducted. For 50% of users, a recommended item was randomly swapped with a different item that shared the same Semantic ID prefix.
Metric: The Click Loss Rate was measured: $C l i c k \ L o s s \ R a t e : = { \frac { { \mathrm { C T R \ o n \ } } S ^ { \prime } - { \mathrm { C T R \ o n \ } } S } { { \mathrm { C T R \ o n \ } } S } } .$
- $S$ : Original set of recommended items.
- $S'$ : Mutated set, where an item is replaced by a semantically similar one (same Semantic ID prefix).
- CTR: Click-Through Rate. A smaller Click Loss Rate indicates that swapping with a semantically similar item does not significantly harm CTR, implying that items with similar semantics also have similar user engagement predictions.
  
  该图像是一个图表，展示了不同语义 ID 深度下的点击损失率。随着语义 ID 深度的增加，从 0-prefix 到 3-prefix，点击损失率逐渐降低，表明语义 ID 的优化对模型性能的提升具有积极作用。

Figure 5 Click Loss Rate reduction from Semantic ID.

Analysis of Figure 5:

The chart shows that as the Semantic ID depth (prefix length) increases, the Click Loss Rate monotonically decreases.
- For example, using a 0-prefix (broadest semantic category, essentially no semantic constraint) results in the highest Click Loss Rate (around 0.15%).
- As the prefix depth increases to 1-prefix, 2-prefix, and 3-prefix, the Click Loss Rate steadily drops, approaching 0% for 3-prefix.
Conclusion: This demonstrates a strong correlation between semantic similarity (as defined by Semantic ID) and prediction similarity (user click behavior). Deeper Semantic ID prefixes capture finer-grained semantic details, which translates to even closer prediction behavior. This validates the representation space analysis from Section 6.3 and underscores the robustness of Semantic ID for ranking models.

The following are the results from Figure 6 of the original paper:

该图像是一个柱状图，展示了原始广告ID与语义ID的点击分布情况。可见，原始广告ID的点击量在前期较高，但随后的点击量迅速减少，而语义ID的点击分布也呈现出相似的衰减趋势。

Figure 6 The 30-day click distribution in raw ID and Semantic ID spaces. Figure 6: The 30-day click distribution in raw ID and Semantic ID spaces.

Analysis of Figure 6:

This bar chart visually compares the click distribution for raw Ad IDs and Semantic IDs over a 30-day period.
For raw Ad IDs, the click distribution is extremely skewed, with a very small number of IDs receiving a disproportionately high number of clicks, and the vast majority receiving very few (a long tail).
For Semantic IDs, while still exhibiting some skew, the distribution appears less extreme. The "head" is less dominant, and the "tail" is potentially smoother or better represented, indicating that Semantic ID helps to normalize the click distribution by grouping similar items. This aligns with the idea of reducing impression skew by enabling knowledge transfer.

7.5. A/A Variance

Random hashing introduces prediction variance for identical items. If an advertiser creates an exact copy of an ad with a different raw ID, random hashing might assign them to different embeddings, leading to different model predictions and delivery behaviors. This A/A variance (where "A/A" refers to identical items) is undesirable.

Semantic ID Solution: Semantic ID mitigates A/A variance by ensuring that exact copies or very similar items will often have the same k-prefix Semantic ID, leading to identical or very similar embeddings.
Measurement: An online shadow ads experiment was set up to measure the relative A/A prediction difference (AAR) for pairs of identical items $(a_1, a_2)$ $(a_{1}, a_{2})$ : $\mathrm { A A R } ( a _ { 1 } , a _ { 2 } ) : = 2 \frac { p ( a _ { 1 } ) - p ( a _ { 2 } ) } { p ( a _ { 1 } ) + p ( a _ { 2 } ) + \epsilon } ,$
- $p(a_1), p(a_2)$ : Model predictions for the A/A pair.
Result: The production model with six Semantic ID sparse features achieved a 43% reduction in average AAR compared to the same model without these features.
Implication: This significant reduction in A/A variance improves the robustness of ad ranking orders, enhances the system's ability to accurately target audiences, and crucially, builds advertiser trust in the consistency of Meta's recommendation system. The authors believe the majority of this reduction comes from improved tail item modeling, where Semantic ID helps stabilize representations for less popular items.

8. Conclusion & Reflections

8.1. Conclusion Summary

This paper successfully demonstrates the utility of Semantic ID in creating a stable ID space for item representation within large-scale recommendation systems. It introduces Semantic ID prefix-ngram, a novel token parameterization technique that significantly enhances the performance of Semantic ID in ranking models. Through extensive offline experiments, the authors confirm that Semantic ID effectively mitigates the detrimental effects of embedding representation instability caused by item cardinality, impression skew, and ID drifting, outperforming both random hashing and individual embeddings baselines, particularly for tail items and new cold start items. The benefits extend to user history modeling, where Semantic ID leads to outsized gains in attention-based contextualizing modules. The paper also reports the successful productionization of Semantic ID features in Meta's ads recommendation system, achieving notable 0.15% online performance gains and significantly reducing downstream ad delivery variance (A/A variance) in live deployments, thereby improving prediction stability and advertiser trust.

8.2. Limitations & Future Work

The paper implicitly points to several limitations and areas for future work, primarily through the challenges it aims to solve and the improvements it highlights:

Content Understanding Model Dependency: The quality of Semantic IDs is directly dependent on the underlying content understanding models (e.g., multimodal image and text foundation models). Improvements in these upstream models would likely further enhance Semantic ID effectiveness.
RQ-VAE Configuration: The paper explores different RQ-VAE configurations ( $K \times L$ ) and prefix-ngram depths, but there might be further optimization opportunities for RQ-VAE architectures, training objectives, or adaptive quantization strategies.
Generalizability beyond Ads Ranking: While validated in Meta's ads ranking, further research could explore its efficacy across diverse recommendation domains (e.g., e-commerce, content platforms with different item types and user behaviors).
Dynamic Semantic ID Adaptation: While Semantic ID aims for stability, semantic concepts themselves can evolve over very long periods. Future work might explore adaptive RQ-VAE training or codebook updates to account for gradual shifts in item semantics.
Trade-offs between Granularity and Cardinality: The choice of prefix-ngram depth and RQ-VAE cardinality involves a trade-off. Investigating methods to automatically determine optimal granularity given computational constraints could be a future direction.

8.3. Personal Insights & Critique

Innovation of Semantic Collisions: The core innovation of transforming "random collisions" into "semantically meaningful collisions" is a powerful paradigm shift. Instead of fighting collisions, this work embraces and structures them to facilitate knowledge transfer. This is a very elegant solution to a long-standing problem in high-cardinality feature learning.
Applicability to Other Domains: The Semantic ID approach, particularly the RQ-VAE for content-based quantization, is highly transferable. Any domain dealing with high cardinality categorical features that also have rich content (e.g., e-commerce products, news articles, scientific papers, short-form videos) could benefit. For instance, in scientific paper recommendation, Semantic IDs could be derived from paper abstracts/titles, enabling better cold-start recommendations for new papers and improving recommendations for niche research areas.
Addressing Trust Issues: The reduction in A/A variance is a crucial practical benefit, especially for advertising platforms. Ensuring consistent predictions for identical items builds trust with advertisers, which is often as important as raw performance gains in real-world systems. This highlights a often overlooked, yet critical, aspect of system reliability.
Beyond Item Embeddings: The concept of a stable, semantically meaningful ID space could potentially be extended to other entities beyond items, such as user interests or query segments, to enhance stability and interpretability across various model components.
Potential Issues/Assumptions:
- Content Model Quality: The entire Semantic ID framework rests on the assumption that the upstream content understanding models provide high-quality, semantically meaningful embeddings. If the content models are biased or inaccurate, the Semantic IDs will inherit these flaws.
- Semantic Drift of Clusters: While item raw IDs drift, the meaning of clusters can also slowly drift over time, especially in very dynamic environments (e.g., fashion trends, new product categories emerging rapidly). The paper states that broad semantic categories remain "temporally stable," which is generally true, but extreme long-term shifts could eventually necessitate RQ-VAE retraining or adaptive codebook management.
- Computational Cost of RQ-VAE: Training RQ-VAE on billions of items (past three months data in production) and generating content embeddings from multimodal inputs can be computationally intensive, though this is managed offline. The trade-off between RQ-VAE complexity (L, K) and the resulting performance gains is well-explored, but operational cost remains a factor.
- Explainability: While Semantic ID improves interpretability by grouping similar items, the exact semantic meaning of a specific code sequence $(c_1, \dots, c_L)$ might still require human interpretation or auxiliary tools to fully explain why certain items are clustered together.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.