Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID
TL;DR Summary
This paper introduces Semantic ID prefix ngram, a novel token parameterization technique that enhances embedding stability in recommendation systems by hierarchically clustering items based on their content, addressing key challenges like data pollution and performance degradatio
Abstract
The exponential growth of online content has posed significant challenges to ID-based models in industrial recommendation systems, ranging from extremely high cardinality and dynamically growing ID space, to highly skewed engagement distributions, to prediction instability as a result of natural id life cycles (e.g, the birth of new IDs and retirement of old IDs). To address these issues, many systems rely on random hashing to handle the id space and control the corresponding model parameters (i.e embedding table). However, this approach introduces data pollution from multiple ids sharing the same embedding, leading to degraded model performance and embedding representation instability. This paper examines these challenges and introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID. Semantic ID prefix ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings, as opposed to random assignments. Through extensive experimentation, we demonstrate that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts. We further highlight the advantages of Semantic ID prefix ngram in attention-based models that contextualize user histories, showing substantial performance improvements. We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID". It focuses on improving the stability and performance of item embeddings in large-scale recommendation systems using a novel approach called Semantic ID.
1.2. Authors
The paper lists numerous authors, primarily affiliated with AI at Meta, indicating a strong industry research background. Carolina Zheng is also affiliated with Columbia University and performed this work during a 2024 Internship at Meta. The extensive list of authors from Meta suggests a collaborative effort within a large research team, common for impactful industrial research.
1.3. Journal/Conference
The paper is published as a preprint on arXiv, with the original source link being https://arxiv.org/abs/2504.02137v1. While arXiv is a reputable platform for sharing research quickly and openly, it typically hosts preprints that have not yet undergone formal peer review for a specific journal or conference. However, given the affiliations (Meta, Columbia University) and the detailed experimental results, it is likely intended for a top-tier machine learning or recommendation systems conference (e.g., KDD, WWW, RecSys, NeurIPS, ICML) or journal in the near future. The "Published at (UTC): 2025-04-02T21:28:38.000Z" indicates a future publication date, suggesting this is a scheduled or upcoming release.
1.4. Publication Year
The publication year is stated as 2025 (based on the Published at (UTC) timestamp).
1.5. Abstract
This paper addresses critical challenges in ID-based models within industrial recommendation systems, such as extremely high item cardinality, dynamic ID spaces, skewed engagement distributions, and prediction instability caused by the natural lifecycle of item IDs. Existing solutions like random hashing mitigate memory issues but introduce data pollution from collisions, degrading model performance and embedding representation stability.
The paper introduces Semantic ID prefix-ngram, a novel token parameterization technique that significantly enhances the original Semantic ID approach. Unlike random hashing which assigns IDs randomly, Semantic ID prefix-ngram creates semantically meaningful collisions by hierarchically clustering items based on their content embeddings. Through extensive experimentation, the authors demonstrate that this method not only resolves embedding instability but also substantially improves tail ID modeling, reduces overfitting, and mitigates representation shifts. The approach shows considerable performance improvements when integrated into attention-based models that contextualize user histories. The paper concludes by reporting successful integration of Semantic ID into Meta's production Ads Ranking system, yielding notable performance gains and enhanced prediction stability in live deployments.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2504.02137v1 - PDF Link:
https://arxiv.org/pdf/2504.02137v1.pdf - Publication Status: This paper is currently a preprint on arXiv, indicated by the version tag and the platform. While widely accessible, it has not yet undergone formal peer review for a conference or journal. The future
Published at (UTC)date suggests a planned release.
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper lies in the practical challenges of learning effective item embedding representations in large-scale, industrial recommendation systems. These systems often deal with billions of items, leading to several key data-related issues:
-
Item Cardinality: The sheer number of distinct items (e.g., products, ads) makes it infeasible to assign a unique embedding to every single item due to memory and computational constraints.
-
Impression Skew: A small fraction of "head" items receives the vast majority of user impressions and interactions, while a "long tail" of items has very few interactions. This makes it difficult to learn robust embeddings for tail items due to insufficient training data.
-
ID Drifting: The item space is highly dynamic, with new items constantly entering and old items leaving the system. This "raw ID drifting" causes
embedding representationsto become unstable over time, as the meaning of an embedding might change as it is assigned to different items.Traditional solutions, such as
random hashing, map multiple raw item IDs to a shared embedding space to manage cardinality. However, this introduces random collisions, where semantically unrelated items might share the same embedding. This leads to: -
Degraded Model Performance: Contradictory gradient updates for randomly colliding items hinder learning.
-
Embedding Representation Instability: The learned embeddings lack a stable semantic meaning, especially over long training periods or in dynamic item environments.
-
Poor Tail Item Modeling: Random hashing doesn't facilitate knowledge sharing between popular and unpopular items, leaving tail items with sparse or unstable representations.
The paper's innovative idea is to move beyond random assignments and leverage semantic similarity for item representation. It explores
Semantic ID, a recently proposed approach that derives item IDs from hierarchical clusters based on the semantic similarity of their content (text, image, video). The motivation is that a fixed, semantically meaningful ID space can inherently address the instability and knowledge sharing issues that random hashing fails to resolve.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Empirical Understanding of Semantic ID Stability: Through experiments on a simplified Meta ads ranking model, the authors deepen the empirical understanding of how
Semantic IDimprovesembedding representation stabilitycompared torandom hashingandindividual embeddings. - Novel Token Parameterization: Semantic ID prefix-ngram: They propose a new token parameterization technique,
Semantic ID prefix-ngram, which significantly enhances the performance of the originalSemantic ID. This method effectively incorporates the hierarchical nature of item clusters, allowing for better knowledge sharing. - Characterization of Item Distribution Challenges: The paper clearly characterizes the challenges of
item cardinality,impression skew, andID driftingand explains their direct impact onembedding representation stability. - Addressing Key Challenges:
- Improved Tail ID Modeling:
Semantic IDsignificantly benefitstail itemsandnew cold start itemsby enabling knowledge transfer from semantically similar, popular items. - Reduced Overfitting and Representation Shifts: The approach leads to more stable learned representations over time, making the model less sensitive to
ID driftingand distribution shifts. - Outsized Gains in Contextualizing Models:
Semantic IDprovides substantial performance improvements when integrated intoattention-based user history models(e.g.,Transformer,Pooled Multihead Attention (PMA)), by enabling more focused and meaningful attention patterns.
- Improved Tail ID Modeling:
- Successful Productionization at Meta: The authors describe the successful integration of
Semantic ID prefix-ngramfeatures into Meta's production Ads Ranking system.- Significant Online Performance Gains:
Semantic IDfeatures resulted in a notable 0.15% online performance gain on top-line metrics, considered highly significant for a highly optimized system. - Enhanced Prediction Stability: The approach significantly reduces
A/A variance(prediction variance for identical items), leading to more robust ad ranking orders and improved advertiser trust. - Correlation of Semantic and Prediction Similarity: Experiments demonstrate that prediction similarity is correlated with semantic similarity, and deeper
Semantic IDprefixes monotonically reduceclick loss rate.
- Significant Online Performance Gains:
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with several fundamental concepts in recommendation systems and deep learning.
- Recommendation Systems: Systems that suggest items (products, movies, ads, etc.) to users based on their preferences and past behavior. The goal is to predict which items a user will like or interact with.
- Item IDs: Unique identifiers assigned to each item in a recommendation system. These are typically categorical features.
- Embeddings (Embedding Vectors): Dense, low-dimensional vector representations of discrete (categorical) features, like item IDs, users, or categories. Instead of treating each ID as a distinct one-hot encoded vector (which would be very high-dimensional and sparse), embeddings map them to a continuous vector space where semantically similar items are closer together. These vectors are learned during model training.
- Embedding Table: A matrix where each row corresponds to a unique embedding for a specific categorical feature (e.g., an item ID). When an
item IDis input to the model, its correspondingembedding vectoris looked up in this table.
- Embedding Table: A matrix where each row corresponds to a unique embedding for a specific categorical feature (e.g., an item ID). When an
- High Cardinality: A situation where a categorical feature (like
item ID) can take on a very large number of distinct values (e.g., billions of unique items). This poses challenges forembedding tablesas they would become prohibitively large. - Random Hashing: A technique used to manage
high cardinalityby mapping a large, sparse feature space into a smaller, dense space. For example, item IDs can be hashed to a fixed number of embedding slots. While this saves memory, it introduces "collisions" where different original IDs map to the same embedding. If hashing is random, these collisions are arbitrary, leading todata pollution. - ID Drifting: In dynamic systems, old
item IDsare constantly retired, and new ones are introduced. This causes the distribution of activeitem IDsto shift over time. If embeddings are tied directly to these volatile raw IDs (especially with random hashing), the meaning of anembedding vectorcan "drift" as it might represent different sets of items over time. - Impression Skew: The phenomenon where a small number of "head" items receive a disproportionately high number of impressions or interactions, while the vast majority of "tail" items receive very few. This makes it challenging to learn good
embedding representationsfortail itemsdue to sparse data. - Deep Learning Recommendation Model (DLRM): A widely deployed deep learning architecture for recommendation systems (Covington et al., 2016; Naumov et al., 2019). It typically consists of an
embedding layerfor categorical features, adense layerfor numerical features, aninteraction layerto combine features, andMLPs(multi-layer perceptrons) for final prediction. - Vector Quantization (VQ): A technique that approximates continuous vectors with discrete "codebook" vectors. It's like clustering, where each continuous vector is assigned to the closest vector in a learned codebook. This effectively discretizes a continuous space into a finite set of codes.
- Residual Quantized Variational Autoencoder (RQ-VAE): An advanced
vector quantizationmodel (Zeghidour et al., 2021).RQ-VAEuses a stack ofquantizers(a "residual" approach) to progressively refine the representation. Instead of quantizing a vector once, it quantizes the residual (the error) from the previous quantization layer. This creates a hierarchical sequence of discrete codes, where earlier codes capture coarser semantic categories and later codes capture finer details. - Transformer: A neural network architecture (Vaswani et al., 2017) that relies heavily on
self-attentionmechanisms. It processes sequences by weighing the importance of different parts of the input sequence to each other.- Attention Mechanism: A core component in
Transformersthat allows the model to selectively focus on different parts of the input sequence when processing a particular element. It calculates a weighted sum of "value" vectors, where the weights are determined by the similarity between a "query" vector and "key" vectors. The standardAttentionformula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $- : Query matrix. Each row is a query vector representing an element in the target sequence.
- : Key matrix. Each row is a key vector representing an element in the source sequence.
- : Value matrix. Each row is a value vector associated with each element in the source sequence.
- : The dimension of the key vectors, used to scale the dot products to prevent vanishing gradients.
- : A function that converts raw scores into probability distributions. This mechanism allows the model to "attend" to relevant parts of the input sequence.
- Multihead Attention: An extension of
Attentionwhere theattention mechanismis run multiple times in parallel, each with different learned linear projections of the queries, keys, and values. The outputs are concatenated and linearly transformed. This allows the model to capture different types of relationships or aspects of the input.
- Attention Mechanism: A core component in
- Pooled Multihead Attention (PMA): A variant of
Multihead Attention(Lee et al., 2019) that uses a fixed set of learnable "seed" vectors as queries, rather than using the input sequence elements themselves as queries. This is particularly useful for aggregating information from a sequence into a fixed-size representation, making it suitable for summarizing user history sequences.
3.2. Previous Works
The paper contextualizes its work by referencing several related areas:
- Item Representations in Recommendation:
- Modern deep learning recommendation models (Covington et al., 2016; Naumov et al., 2019; Naumov, 2019) heavily rely on trained embeddings for categorical features.
Random hashing(Weinberger et al., 2009) is a simple solution forhigh item cardinality.- More advanced hashing methods include
collision-free hashing(Liu et al., 2022), which dynamically manages memory for individual embeddings, anddouble hashing(Zhang et al., 2020) to reduce memory usage with two hash functions. Learning to hashmethods (Wang et al., 2017) train ML-based hash functions to preserve similarity.- Approaches to
impression skewoften involvecontrastive learningorclustering(Yao et al., 2021; Chang et al., 2024), which the authors view as complementary.
- Stable Embedding Representation:
- The concept of
stable IDis inspired bytokenizationtechniques inNatural Language Processing (NLP)(Sennrich, 2015; Kudo, 2018; Devlin, 2018), where a fixed vocabulary of tokens represents text. - In recommendation, proposed
vector-quantizingitemcontent embeddingsto learn transferable sequential recommenders. - introduced a
masked vector-quantizerto transfercollaborative filteringrepresentations togenerative recommenders. Semantic IDitself was introduced concurrently inSingh et al. (2023)andRajput et al. (2024), building on and usingRQ-VAEfor quantization, demonstrating benefits in generalization and sequential recommendation. This paper adaptsSemantic IDas its stable ID method.
- The concept of
3.3. Technological Evolution
The evolution of item representation in recommendation systems has moved from simple one-hot encodings to trained embeddings, then to techniques for managing high cardinality (like random hashing), and more recently, to methods aiming for stable and semantically meaningful representations.
Initially, each unique item was represented by a discrete ID. As item catalogs grew, one-hot encoding became impractical due to sparsity and dimensionality. Embedding layers emerged as a powerful solution, mapping IDs to dense, continuous vectors. However, the sheer scale of items in industrial settings (billions of unique IDs) still challenged embedding table sizes and training efficiency.
This led to techniques like random hashing, which aggressively reduced the embedding table size by allowing multiple IDs to share an embedding slot. While memory-efficient, this sacrificed semantic integrity and introduced instability.
The current paper fits into the next phase of this evolution, which seeks to combine the efficiency of hashing (or quantization) with semantic coherence and stability. By leveraging content understanding models and vector quantization (specifically RQ-VAE), the paper introduces Semantic ID to create a fixed, semantically meaningful, and hierarchical ID space. This stable ID space mitigates the drawbacks of random hashing by ensuring that collisions are semantically relevant, rather than random. It represents a shift from purely ID-based representations to content-aware, semantically anchored representations that can better handle ID drifting and impression skew.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach are:
- Semantic vs. Random Collisions: The most significant differentiation is
Semantic ID's approach to collisions. Unlikerandom hashing(Weinberger et al., 2009; Zhang et al., 2020) which causes arbitrary collisions,Semantic ID(viaRQ-VAE) ensures that items sharing an embedding (or a prefix) are semantically similar. This transforms a drawback (collisions) into an advantage (knowledge sharing). - Fixed, Hierarchical ID Space: Instead of a dynamically changing raw
ID space,Semantic IDconstructs a fixedID space(a vocabulary ofsemantic codes) that has intrinsic semantic meaning and a hierarchical structure. This directly addressesID driftingandembedding instability. - Novel Token Parameterization (prefix-ngram): The paper's primary innovation is
Semantic ID prefix-ngram. While previousSemantic IDworks (Singh et al., 2023; Rajput et al., 2024) introduced the concept, this paper proposesprefix-ngramas an improved parameterization. It explicitly leverages the multi-granularity (coarse-to-fine) nature ofRQ-VAEcodes, allowing the model to learn representations at different levels of semantic abstraction, whichTrigram,Fourgram, orAll bigramsparameterizations do not fully capture. This leads to more effective knowledge sharing and better performance. - Holistic Approach to Stability: Instead of tackling
high cardinalityorimpression skewin isolation (e.g., throughcontrastive learningor dynamic memory management),Semantic IDprovides a holistic "stableID space" solution that inherently mitigates issues arising from all three challenges (cardinality,skew,drifting). - Enhanced Contextualization in User History: The paper specifically highlights the outsized gains when
Semantic IDis used inattention-based user history models, demonstrating that semantically stable representations improve the model's ability to contextualize and aggregate past interactions.
4. Methodology
The core of this paper's methodology revolves around the Semantic ID concept and its novel prefix-ngram token parameterization. The aim is to create stable and semantically meaningful item representations that can overcome the challenges of high cardinality, impression skew, and ID drifting in large-scale recommendation systems.
4.1. Principles
The fundamental principle behind Semantic ID is to represent items not by their arbitrary raw IDs, but by discrete codes derived from their semantic content. The intuition is that if items share similar content (e.g., two ads for pizza), they should share similar representations, even if they have different raw IDs or are new to the system. This allows for:
-
Knowledge Sharing: Semantically similar items can contribute to learning a shared representation, benefiting
tail itemswith sparse data. -
Stability: The semantic categories of items tend to be more stable over time than their individual raw IDs. Thus, representations based on semantics will drift less.
-
Efficiency: Discretizing the semantic space into a fixed set of
codesprovides a manageableID space, addressinghigh cardinalitywithout random collisions.This is achieved by a two-stage process:
-
Content Understanding: First, items are processed by a
content understanding model(e.g., amultimodal image and text foundation model) to generate densecontent embeddings. These embeddings capture the semantic meaning of the item's content. -
Vector Quantization (RQ-VAE): Second, these continuous
content embeddingsare quantized into a sequence of discretecodesusing anRQ-VAEmodel. This sequence of codes forms the item'sSemantic ID. TheRQ-VAEis specifically chosen for its ability to produce hierarchical clusters, where earlier codes represent broader semantic categories and later codes represent finer details.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Ranking Model Overview
The recommendation problem is framed as a classification task. The model predicts a binary label (interaction or conversion) given user- and item-side features. The architecture is based on the Deep Learning Recommendation Model (DLRM) (Covington et al., 2016; Naumov et al., 2019), comprising three main sections:
- Information Aggregation Section: Processes sparse (categorical), dense, and user history-based features independently. Each module outputs a list of
embedding vectors. - Interaction Layer: Concatenates the embedding lists and performs dot products (or higher-order interactions) between all pairs of vectors.
- MLP and Sigmoid: Transforms the output of the
interaction layervia anMLPto produce alogit score, followed by asigmoidfunction to output a probability. The model is trained usingcross-entropy loss. The paper focuses on theinformation aggregation section.
Embedding Module
For categorical features (like item IDs), an embedding table stores the vector representations.
Let be the total number of raw IDs in the system and [1..N] denote integers from 1 to .
The embedding table is a matrix , where is the embedding dimension and is the total number of embeddings (rows) in the table.
An embedding lookup function maps a raw ID to embedding table row indices.
For each raw ID , the sparse module looks up multiple embedding rows and sum-pools them to produce a single output embedding:
- : The resulting aggregated embedding for raw ID .
- : The number of embedding table rows looked up for a single raw ID. In
Semantic ID, this corresponds to the number of tokens (or n-grams) derived from theSemantic ID. - : The -th embedding vector looked up from the table at index .
Sparse Module
A sparse feature is a set of raw IDs, . For example, this could be multiple product category IDs for an item. A single embedding is produced by sum-pooling the embeddings for each constituent raw ID.
User History Module
This module models a user's item interaction history as a sequence of sparse features along with interaction timestamps. The module is designed to contextualize this sequence.
First, each sparse feature is embedded using the sparse module described above, and a learned timestamp embedding is added. The sum is denoted as .
The resulting sequence of embeddings is , where is the sequence length. This sequence is then contextualized by one of three aggregation modules: Bypass, Transformer, or Pooled Multihead Attention (PMA).
The architectures for these aggregation modules are defined in Appendix A and are critical for understanding user history modeling:
-
Bypass: This is the simplest module. It applies a linear transformation to each embedding in the sequence independently.
- : The input sequence of embeddings, where is the sequence length and is the embedding dimension.
- : A learnable weight matrix.
The
Bypassmodule processes each item in the history without considering its context from other items in the sequence.
-
Transformer: This module applies a
Transformer layerto the embedding sequence, incorporatingself-attentionto contextualize each item based on all other items in the sequence. The coreAttentionsubmodule is:- : The input embedding sequence.
- : Learnable weight matrices for the query, key, and value transformations, respectively.
- : The dimension of the query/key/value vectors.
- : Normalizes attention scores.
- The term generates queries, generates keys, and generates values.
The full
Transformermodule then consists of an attention layer followed by anMLP, withLayerNormand residual connections: - : Output after the attention sub-layer with residual connection.
- : Final output after the position-wise
MLPsub-layer with residual connection. - : Normalizes activations across the feature dimension.
- : A multi-layer perceptron.
Standard
positional embeddingsare added to the encoding before applyingTransformermodules to incorporate sequence order information.
-
Pooled Multihead Attention (PMA): A variant of
Transformerwhere the attention query vectors are replaced by a fixed set of learnableseedvectors. This allows for a fixed-size summary of the sequence.- : A matrix comprised of learnable query vectors (seeds). In experiments, .
- Other symbols are as defined for the
Transformer'sAttentionmodule. ThePMAmodule is formed using the same equations as for theTransformermodule (Equations 7 and 8 in the paper, corresponding to and above), but withPMAttentionreplacingAttention.
4.2.2. Semantic ID Learning (RQ-VAE)
Semantic IDs are learned for items in two stages:
- Content Embeddings: A
content understanding model(e.g., amultimodal image and text foundation modelpre-trained on large datasets) processes item content (text, image, video) to produce densecontent embeddings. - Vector Quantization with RQ-VAE: An
RQ-VAE(Residual Quantized Variational Autoencoder) is trained on thesecontent embeddings.-
Encoder: Maps the continuous
content embeddingto a continuous latent representation . -
Residual Quantizer: Quantizes into a sequence of discrete codes . Here, is the number of layers (length of the code sequence), and is the
codebook size(number of clusters at each layer). The quantization is hierarchical: each layer has its owncodebookcontaining vectors. The code at layer is chosen by finding the codebook vector that best approximates the residual from after subtracting the codebook vectors from previous layers(l-1)down to 1. The residual for layer is defined as:-
: The residual vector at layer .
-
: The continuous latent representation from the encoder.
-
: The sum of codebook vectors chosen by previous layers. The code for layer is then selected as the index of the codebook vector that is closest to this residual :
-
: The discrete code chosen for layer .
-
: Finds the index that minimizes the following expression.
-
: The Euclidean distance between a codebook vector from the -th codebook and the residual . This process is illustrated in Figure 1, showing how the residual quantizers progressively refine the representation.
该图像是一个示意图,展示了 RQVAE 模型的结构,其中包含编码器、量化器和解码器。图中显示了输入 的处理过程,经过编码器后生成的 被发送至多个量化器,最终用于重构输出。该模型的设计强调了相似性和残差连接的作用。
Figure 1 The RQVAE model with
-
-
Decoder: Reconstructs the original
content embeddingfrom the sequence of discrete codes . TheRQ-VAEis trained with two loss terms: -
: The total loss for an input
content embedding. -
: The
reconstruction loss, which measures how well the decoder can reconstruct the originalcontent embeddingfrom the quantized codes . -
: The
codebook lossterms, which encourage the residuals to be close to their chosen codebook vectors .- : The
stop-gradient operator. This means that gradients flow through the term it encloses during the backward pass but are treated as constant during the forward pass for the purpose of the other term. Specifically, optimizes towards (codebook vectors are fixed), while optimizes towards (residuals are fixed). This dual optimization helps update both the encoder and the codebook.
- : The
-
: A hyperparameter, set to 0.5 in the experiments.
A
Semantic IDis defined as the sequence of discrete codes produced by the encoder and residual quantizer. The hierarchical nature means represents the broadest category (e.g., "food ads"), refines it (e.g., "pizza ads"), and offers the finest detail (e.g., "pizza ads in English").
-
4.2.3. Token Parameterization
After obtaining the Semantic ID sequence for a raw item ID , the next step is to map this sequence of codes to embedding table rows. This is done via a token parameterization function , which determines how the Semantic ID is represented in the embedding table. The choice of parameterization is crucial because it controls the amount and structure of information the recommendation model receives. Since a fine-grained tuple (all codes) might lead to extremely high cardinality ( possible combinations), there's a tradeoff between information granularity and embedding table size.
The paper defines several possible token parameterization techniques in Table 1:
The following are the results from Table 1 of the original paper:
| Token Param | p(c1, . . . , L; H) |
|---|---|
| Trigram | [K2c1 + Kc2 + c3] |
| Fourgram | [K3c1 + K2c2 + Kc3 + c4] |
| All bigrams | [K2 × (i − 1) + Kci + ci+1, for i in [1 . L-1]] |
| Prefix-ngram | [∑t=1i Ki−t(ct + 1) − 1, for i in [1 . n]] |
Let's break down these parameterizations:
- Trigram / Fourgram: These parameterizations treat a fixed sequence of codes (e.g., for
Trigram) as a single, unique identifier. They combine the codes into a single integer index using a base- representation, effectively creating a "flat" ID that represents a specific combination of codes. ForTrigram(assuming ):- means that (the coarsest code) contributes the most significantly to the index, followed by , and then . This maps a specific tuple to a unique embedding ID.
Fourgramextends this to four codes: .
- All bigrams: This parameterization generates multiple IDs for an item, each representing a bigram (a pair of consecutive codes) from the
Semantic IDsequence.- means it generates separate IDs for , , etc., up to . The term likely provides a shifting factor to ensure that bigrams from different positions in the sequence (e.g., vs. ) do not collide if they happen to have the same values for . This allows the model to learn representations for pairs of codes at different hierarchical levels.
- Prefix-ngram: This is the novel and most effective parameterization proposed in the paper. It generates IDs for an item, where each ID represents a prefix of the
Semantic IDsequence, increasing in granularity.-
-
: The maximum length of the prefix (e.g., for
Prefix-3gram, ). -
For each from 1 to , an ID is generated for the prefix .
-
The sum effectively calculates a unique integer based on the codes in the prefix . The terms account for 0-indexed vs. 1-indexed codes or provide a slight offset.
-
This means
Prefix-ngramgenerates IDs for , , , etc., up to . This allows the model to capture information at various levels of granularity—from very coarse (e.g., ) to more fine-grained (e.g., ). This is crucial for leveraging the hierarchical nature ofRQ-VAEclusters.When the
Semantic IDcardinality exceeds theembedding tablesize, a modulo hash function is applied. For multiple IDs (as inAll bigramsorPrefix-ngram), a shifting factor is added to prevent collisions between IDs from different positions.
-
-
The paper's experiments (Table 2) confirm that Prefix-ngram is the best parameterization, highlighting the importance of incorporating the hierarchical clustering information. Increasing the depth () of Prefix-ngram and the RQ-VAE cardinality ( and ) both further improve performance.
4.2.4. Item Impression Distribution Issues (Addressed by Semantic ID)
The paper elaborates on three key data distribution issues and how Semantic ID addresses them:
-
Item Cardinality: The number of distinct items is much larger than feasible
embedding tablesize .Random hashingcauses random collisions.- Semantic ID Solution:
Semantic IDcreates a fixed and manageableID spacethroughRQ-VAE. Instead of random collisions, items sharing aSemantic ID(or prefix) are semantically similar. This ensures that when collisions occur (by design or by modulo hashing), they are meaningful, facilitating knowledge sharing.
- Semantic ID Solution:
-
Impression Skew: A small percentage of "head" items dominates impressions, leaving "tail" items with few examples (Figure 2).
Random hashingdoesn't allow effective knowledge sharing.
该图像是一个图表,展示了在考虑项目份额的情况下,累积印象份额作为项目ID份额的函数。随着项目按印象计数排序,观察到大多数印象来自少量最受欢迎的项目。Figure 2 Impression Skew cumulative impressions as a function of the share of items considered. As items are sorted by the impression count, one sees that the majority of impressions comes from a fraction of most popular items.
- Semantic ID Solution: If a
tail itemhas similar content to aheadortorso item, theirSemantic IDswill match or share a prefix. This allows thetail itemto "inherit" knowledge from the more popular, semantically similar item, improving its representation learning despite sparse individual data. The paper suggests thatSemantic ID spaceexhibits less skew (Appendix B, Figure 6).
- Semantic ID Solution: If a
-
ID Drifting: The
item ID spaceis highly dynamic, with constant entry and exit of items (Figure 3).Random hashingleads toembedding representation drift.
该图像是图表,展示了随着时间推移,初始语料中活跃项目的比例变化。横轴表示经过的天数,纵轴表示活跃项目的百分比。可以看出,在第6天时,原始语料的一半项目退出系统,而新项目进入的数量相等,导致项目分布严重漂移。Figure 3 ID Drift share of items that remain active in the initial corpus as a function of time. Half of the original corpus exits the system after 6 days. An equal number of new items enters the system, creating a severe item distribution drift.
- Semantic ID Solution: When an old ad retires and a new, semantically similar ad enters, their
Semantic IDs(or prefixes) will likely match. This ensurestemporal stabilityofsemantic concepts, leading to stableSemantic ID encodings. Theembedding weightsfor new items can leverage pre-existing knowledge from semantically similar items, rather than being learned from scratch or being randomly assigned.
- Semantic ID Solution: When an old ad retires and a new, semantically similar ad enters, their
5. Experimental Setup
5.1. Datasets
The experiments were conducted using production data from Meta's ads ranking system.
- Data Source: Production user interaction data from Meta's ads ranking platform.
- Scale: Training data spans a four-day time period, processed sequentially for a single epoch. Evaluation is performed on the first six hours of the next day's data.
- Characteristics: The data exhibits the discussed challenges:
- High Item Cardinality: Over one billion items in the user history module.
- Impression Skew: A small percentage of items (0.1% head) accounts for 25% of impressions, while 94.4% (tail) accounts for the remaining 25%.
- ID Drifting: A significant portion of items (half) exits the system within 6 days, with new items constantly entering.
- Content Embeddings: The
item content embeddingsforSemantic IDgeneration are obtained from amultimodal image and text foundation model. This model is pre-trained on the publicCC100 dataset(Conneau, 2019) and then fine-tuned on internal ads datasets. - RQ-VAE Training Data: The
RQ-VAEitself is trained on thecontent embeddingsof all target items from the past three months. - Choice of Datasets: These datasets are representative of real-world, large-scale industrial recommendation scenarios, making them highly effective for validating the method's performance under practical constraints and challenges. While the paper doesn't provide specific content examples, it implies that items are advertisements with associated text, images, or videos.
5.2. Evaluation Metrics
5.2.1. Normalized Entropy (NE)
The primary offline model performance metric is Normalized Entropy (NE).
- Conceptual Definition:
Normalized Entropymeasures the predictive quality of a classification model relative to a baseline predictor that always predicts the mean frequency of positive labels. A lowerNEindicates better model performance, as it means the model'scross-entropy lossis lower than that of the naive mean-frequency predictor. It essentially normalizes thecross-entropyto provide a more interpretable score. - Mathematical Formula:
- Symbol Explanation:
- : Normalized Entropy.
- : The total number of training examples.
- : The true binary label for example (0 for no interaction, 1 for interaction).
- : The model's predicted probability of a positive label for example .
- : The overall mean frequency of positive labels in the dataset, calculated as .
- : Natural logarithm.
The numerator is the
cross-entropy lossof the model, and the denominator is thecross-entropyof a baseline model that always predicts the overall mean positive rate .
5.2.2. Attention Score-based Evaluation Metrics
For user history modeling, four metrics are computed on the attention scores () from PMA and Transformer aggregation modules:
- First source token attention: Measures how much weight is placed on the very first token in the source sequence.
- : Target sequence length.
- : Attention score from target token to the first source token.
- This metric indicates if the model disproportionately focuses on the initial item in the history.
- Padding token attention: Measures the average attention given to
padding tokens(placeholders used to make sequences of equal length).- : Source sequence length.
- : Indicator function, which is 1 if source token is a
padding token, 0 otherwise. - This metric reflects how efficiently the model ignores irrelevant
padding tokens. Lower values are desirable.
- Entropy: Measures the diversity or "diffuseness" of the attention distribution for each target token, averaged over all target tokens.
- Higher entropy means attention is more spread out; lower entropy means it's more focused on specific tokens. Lower entropy is often desired for more decisive attention.
- Token self-attention: For
Transformermodels, this measures how much a token attends to itself.- : Attention score from target token to source token .
- This indicates if a token primarily relies on its own representation or seeks context from others.
5.2.3. Online Metrics
- Online Performance Gain: Measured as a percentage gain on a top-line online metric (e.g.,
Click-Through Rate (CTR),Conversion Rate) in live A/B tests. A0.15%gain is considered significant in Meta's highly optimized system. - Click Loss Rate: Used to measure the correlation between semantic similarity and prediction similarity in online A/B tests.
- : Click-Through Rate for the original set of recommended items .
- : Click-Through Rate for the mutated set of items , where an item in is swapped with a semantically similar item (same
Semantic ID prefix). - A smaller
Click Loss Rate(ideally close to 0) indicates that replacing an item with a semantically similar one does not significantly impact user clicks, implying that prediction similarity aligns with semantic similarity.
- A/A Prediction Difference (AAR): Measures the relative difference in predictions for an identical item and its copy (A/A pair). Used to quantify
prediction variance.- : An A/A pair, representing an original item and its exact copy (with a different raw ID).
- : The ranking model's prediction (e.g., probability of click) for item and , respectively.
- : A small constant to prevent division by zero.
- Lower
AARvalues indicate lessprediction variancebetween identical items, which is desirable for stability and advertiser trust.
5.3. Baselines
The paper compares Semantic ID (SemID) against two baseline item representation approaches:
- Individual Embeddings (IE):
- Description: Each raw
item IDis assigned its own unique row in theembedding table. This means (total items equals embedding table size). During evaluation, anyIDnot seen during training is mapped to a randomly initialized, untrained embedding. - Representativeness: While
unrealisticfor production-scale systems due to memory constraints,IEserves as an illustrative upper bound or ideal scenario for understanding item-specific representation quality, as it avoids any form of collision.
- Description: Each raw
- Random Hashing (RH):
-
Description: Raw
item IDsare randomly hashed toembedding table rowsusing a standard hash function (e.g.,modulo hash). This is used when the total number of items is much larger than the feasibleembedding tablesize (e.g., with ). This creates random collisions, where multiple unrelatedraw IDsshare the same embedding. -
Representativeness:
RHis a popular and simple approach in industrial systems to managehigh cardinalityunder system constraints. It serves as a strong, practical baseline thatSemantic IDaims to outperform by resolving therandom collisionissue.For the offline experiments focusing on the target item sparse feature, the
item embedding tablesize forIEis equal to the total number of items, while forRHandSemID, it is set to a smaller size, resulting in an average collision factor of 3.User history featuresare mapped usingrandom hashingin these baseline comparisons.
-
6. Results & Analysis
The offline experiments investigate the hypotheses regarding Semantic ID's advantages, using a simplified version of Meta's production ads ranking model.
6.1. Segment Analysis
To understand the impact of impression skew and ID drifting, items are segmented by their impression count during training (head, torso, tail) and whether they are new (cold start) items.
The following are the results from Table 3a of the original paper:
| Cum. Exs. | Item Percentile | RH | Eval NE IE | SemID | SemID NE Gain vs. RH | IE |
| 25% (Head) | 0.1 | 0.80105 | 0.80101 | 0.80108 | 0.00% | 0.01% |
| 75% (Torso) | 5.6 | 0.83589 | 0.83583 | 0.83580 | -0.01% | -0.00% |
| 100% (Tail) | 100 | 0.83904 | 0.83886 | 0.83872 | -0.04% | -0.02% |
| Items Seen in Training | 0.82626 | 0.82612 | 0.82600 | -0.03% | -0.02% | |
| New Items | 0.83524 | 0.83453 | 0.83180 | -0.41% | -0.33% | |
| All Items | 0.82663 | 0.82645 | 0.82621 | -0.05% | -0.03% |
Table 3a: Evaluation NE (lower is better). Semantic ID enables knowledge transfer to tail and new cold start items.
Analysis of Table 3a:
-
Tail Items:
Semantic ID (SemID)shows the most significantNEgain (improvement) fortail items( vs.RH, vs.IE). This confirms the hypothesis thatSemIDfacilitates knowledge sharing, allowing items with few impressions to benefit from semantically similar, more popular items. -
New Items (Cold Start):
SemIDachieves substantial gains fornew items( vs.RH, vs.IE). This is a critical finding, demonstrating thatSemIDeffectively uses pre-trained weights from semantically similar items, avoiding the issue ofrandomly initializedornon-relevant weightsencountered byIEandRH, respectively, for cold-start items. -
Head Items:
SemIDisNE neutralforhead items( vs.RH, vs.IE), which is expected as these items already have ample data for learning individual representations. -
Torso Items:
SemIDis slightly beneficial fortorso items( vs.RH, vs.IE). -
Overall:
SemIDprovides a modest0.05%NEgain overRHand0.03%overIEacrossall items, driven mostly by improvements in thetailandnew itemsegments. This indicatesSemIDisn't just better at clustering thanRHbut actively enables semantically driven knowledge transfer.To measure the effect of
embedding representation driftingdue toID drifting, the models are evaluated on different temporal segments of the training data. The metric is the difference inNEbetween an earlier period (42-48 hours prior to end of training) and the latest period (last six hours of training). A smaller value indicates less impact fromembedding representation shift.
The following are the results from Table 3b of the original paper:
| Cum. Exs. | RH | IE | SemID |
| 25% (Head) | 0.0057 | 0.0065 | 0.0059 |
| 75% (Torso) | 0.0087 | 0.0075 | 0.0076 |
| 100% (Tail) | 0.0128 | 0.0103 | 0.0106 |
| All Items | 0.0083 | 0.0074 | 0.0073 |
Table 3b: Sensitivity to distribution shift: . Lower is better.
Analysis of Table 3b:
-
RHgenerally has higherNEgaps, especially fortail items(0.0128), indicating that itsembedding representationssuffer more fromID driftingover time. The model's ability to represent older items degrades as weights are updated for new items. -
IEshows a smaller performance gap, suggestingindividual embeddingsare more stable, as each item theoretically retains its distinct representation. -
SemIDmatches or slightly outperformsIEin terms of stability across all segments (0.0073for all items vs0.0074forIE). This supports the claim thatSemIDleads to more stable learned representations over time, even with a smallerembedding tablesize thanIE.The paper further investigates
ID driftingby training models over a longer period (20 days vs. 4 days).
The following are the results from Table 4 of the original paper:
| RH | Semantic ID | |
| Eval NE Gain | -0.18% | −0.23% |
Table 4: NE improvement from training for 20 days of data instead of 4 days.
Analysis of Table 4:
SemIDdemonstrates better scalability with longer training data, achieving aNEgain compared to forRH. This supports the conjecture that improvedrepresentation stabilityallowsSemIDmodels to generalize better over extended training durations, whereID driftingis more pronounced.
6.2. Parameterization Analysis
The paper evaluates different token parameterization techniques for Semantic ID.
The following are the results from Table 2 of the original paper:
| RQ-VAE K × L | Token Parameterization | Train NE Gain |
| [2048] × 3 | Trigram | −0.028% |
| [2048] × 4 | Fourgram | −0.035% |
| [2048] × 4 | All bigrams | -0.091% |
| [512] × 3 | Prefix-3gram | -0.034% |
| [1024] × 3 | Prefix-3gram | -0.097% |
| [2048] × 3 | Prefix-3gram | -0.141% |
| [2048] × 5 | Prefix-5gram | -0.208% |
| [2048] ×6 | Prefix-6gram | -0.215% |
Table 2: NE performance for different tokenization parameterizations
Analysis of Table 2:
- Prefix-ngram Superiority:
Prefix-ngramconsistently outperforms other parameterizations (Trigram,Fourgram,All bigrams). For example,Prefix-3gramwith yields a gain, significantly better thanTrigram() orAll bigrams() for similarRQ-VAEconfigurations. This strongly supports the idea that incorporating the hierarchical nature ofRQ-VAEclusters (i.e., representing prefixes of codes at different granularities) is essential for effectively sharing knowledge and boosting performance. - Depth Matters: Increasing the depth of
Prefix-ngram(fromPrefix-3gramtoPrefix-5gramandPrefix-6gramwith ) leads to improvedNEperformance (e.g., to ). Deeper prefixes capture more fine-grained semantic information, which contributes to better item differentiation and representation. - RQ-VAE Cardinality: Increasing the
RQ-VAEcardinality (either or ) also improvesNE. ForPrefix-3gram, increasing from512to2048(while keeping ) improvesNEgain from to . A larger (codebook size) and (number of layers) allow for a richer and more precise semantic space.
6.3. Item Representation Space
This section examines the quality of the item embeddings themselves by comparing how Random Hashing (RH) and Semantic ID (SemID) partition the raw item ID corpus. The goal is to see if SemID creates more effective summaries of individual embeddings. IE embeddings are used as a reference to compute metrics for these partitions. The collision factor is set to 5, meaning clusters contain 5 items on average. SemID clusters can have variable sizes due to the nature of RQ-VAE.
The following are the results from Table 5 of the original paper:
| Variance | Pairwise distance | |
| Random Hashing | 1.52 × 10−3 (8.0 × 10−4) | 0.22 (0.04) |
| SemID (small) | 1.31 × 10−3 (1.0 × 10−3) | 0.24 (0.09) |
| SemID (top 1,000) | 1.23 × 10−3 (5.5 × 10−4) | 0.06 (0.02) |
Table 5: Intra- and inter-cluster variances and pairwise distances for random hashing and SemID-based partitions.
Analysis of Table 5:
- Intra-cluster Variance:
Semantic IDpartitions (bothsmallandtop 1,000clusters) exhibitlower intra-cluster variancecompared toRandom Hashing. For example,SemID (top 1,000)has a variance of compared toRH's . This means that items grouped together bySemIDare more semantically homogeneous, leading to a more coherentsummary embedding. - Inter-cluster Pairwise Distance: The results are mixed:
-
SemID (small)clusters have a slightlyhigher pairwise distance(0.24) thanRH(0.22), which is a good indication of distinct cluster representations. -
However,
SemID (top 1,000)clusters show a significantlylower pairwise distance(0.06). The authors hypothesize this is becauseRQ-VAEplaces multiple centroids in regions ofhighest data densityto minimize overall model loss. While this might seem counter-intuitive (lower inter-cluster distance usually implies less distinct clusters), in the context ofRQ-VAEand its residual quantization, it could mean that very popular, related clusters are finely differentiated within a dense semantic region.Overall, the lower
intra-cluster varianceforSemIDconfirms that it creates more semantically coherent groups of items, which is crucial for effective knowledge sharing.
-
6.4. User History Modeling
This section explores the benefits of Semantic ID when used with different user history aggregation modules.
The following are the results from Table 6 of the original paper:
| Train NE Gain | Eval NE Gain | |
| Bypass | −0.056% | −0.085% |
| Transformer | −0.071% | −0.110% |
| PMA | −0.073% | −0.100% |
Table 6: Performance for three aggregation modules. Baseline: model with RH for each module. Semantic ID brings larger gains to the contextualizing modules.
Analysis of Table 6:
-
Semantic IDconsistently providesNEgains across allaggregation modulescompared to a baseline usingRandom Hashing (RH). -
The gains are
outsizedforcontextualizing attention-based modules(TransformerandPMA) compared toBypass. For example,Transformershows anEval NE Gainof andPMA, both larger thanBypass's . This indicates that the stable and semantically meaningfulSemantic ID representationssignificantly enhance the ability ofattention mechanismsto contextualize user histories.PMAshows the bestTrain NE Gain().To further understand this,
attention score-based evaluation metricsare computed.
The following are the results from Table 7 of the original paper:
| First | Pad | Entropy | Self | |
| Transformer + RH | 0.030 | 0.460 | 2.149 | 0.052 |
| Transformer + SemID | 0.043 | 0.418 | 1.967 | 0.045 |
| PMA + RH | 0.071 | 0.351 | 3.075 | |
| PMA + SemID | 0.074 | 0.313 | 3.025 |
Table 7: Attention score-based evaluation metrics for random hashing and SemID-based models for the user history item interaction features.
Analysis of Table 7:
-
First source token attention:
SemID-based models (bothTransformerandPMA) showhigher attentionon the first source token (0.043vs0.030forTransformer,0.074vs0.071forPMA). This suggestsSemIDhelps the model place more weight on high-signal (e.g., most recent) tokens in the sequence. -
Padding token attention:
SemID-based models showlower attentiontopadding tokens(0.418vs0.460forTransformer,0.313vs0.351forPMA). This indicates thatSemIDrepresentations make it easier for attention mechanisms to disregard irrelevant padded portions of the history. -
Entropy:
SemID-based models exhibitlower entropy(1.967vs2.149forTransformer,3.025vs3.075forPMA). Lower entropy means the attention distributions areless diffuseandmore focusedon relevant tokens, suggesting more decisive and meaningful contextualization. -
Token self-attention: For
Transformer,SemIDresults inlower token self-attention(0.045vs0.052). This implies that withSemID, theTransformeris better able to find useful contextual information from other tokens in the sequence, rather than primarily attending to itself.These attention metrics confirm that
Semantic IDrepresentations are more stable and meaningful, enabling more effective and focuseduser history modelingbyattention-based architectures.
7. Productionization
Semantic ID features have been successfully integrated into Meta's Ads Recommendation System for over a year, becoming top sparse features by importance.
7.1. Offline RQ-VAE Training
- Content Understanding (CU) Models: The
RQ-VAEmodels are trained on embeddings generated byContent Understanding (CU) models. TheseCU modelsare initially pre-trained on the publicCC100 dataset(Conneau, 2019) and then fine-tuned on Meta's internal ads datasets. - RQ-VAE Training:
RQ-VAEmodels are trained offline using ad IDs and theircontent embeddingssampled from the past three months of data. - Production Configuration: For production,
RQ-VAEsare configured with (6 layers) and (codebook size).Semantic IDutilizes theprefix-5gram parameterization(meaning inPrefix-ngram) from Section 4.2, with anembedding tablesize of (50 million). - Deployment: A frozen (fixed)
RQ-VAE checkpointis used for online serving after training.
7.2. Online Semantic ID Serving System
The Semantic ID serving pipeline is illustrated in Figure 4.
该图像是图示,展示了语义 ID 的服务流程。流程包括实体创建、内容理解、RQVAE 模型、实体语义 ID 的处理,以及用户请求和数据存储的交互,最终形成排序模型以优化用户参与特征和目标项特征的处理。
Figure 4 Semantic ID serving pipeline.
Pipeline Steps:
-
Ad Creation Time: When a new ad is created, its content information (text, image, video) is processed by the
CU models. -
Semantic ID Generation: The output
CU embeddingsare then fed into theRQ-VAE model, which computes theSemantic ID signal(the sequence of discrete codes) for each raw ad ID. -
Data Storage: This
Semantic ID signalis stored in theEntity Data Store, a centralized repository for item metadata. -
Feature Generation Stage: During feature generation,
raw item IDs(for the target item) anduser engagement raw ID historiesare enriched. This involves looking up their correspondingSemantic ID signalsfrom theEntity Data Storeto createsemantic features. -
Serving Requests: When a user request for recommendations arrives, the precomputed
semantic features(along with other features) are fetched. -
Ranking Models: These features are then passed to downstream
ranking modelsto generate predictions and deliver ranked ads.This real-time serving pipeline ensures that
Semantic ID featuresare available for bothtarget item featuresanduser engagement history featuresduring live inference.
7.3. Production Performance Improvement
The integration of Semantic ID features into Meta's flagship ads ranking model yielded significant performance gains.
-
Feature Creation: Six
sparse featuresand onesequential featurewere created from differentcontent embedding sources(text, image, video) usingSemantic ID. -
Significance Threshold: In Meta ads ranking, an offline
NEgain greater than0.02%is considered significant.The following are the results from Table 8 of the original paper:
Train NE Gain Eval NE Gain Baseline + 6 sparse features −0.063% -0.071% Baseline + 1 sequential feature -0.110% -0.123%
Table 8: NE improvement from incorporating Semantic ID features in the flagship Meta ads ranking model.
Analysis of Table 8:
- Adding
6 sparse Semantic ID featuresalone resulted in anEval NE Gainof . - Adding
1 sequential Semantic ID feature(presumably for user history) yielded an even greaterEval NE Gainof . - Overall Online Gain: Across multiple ads ranking models, incorporating
Semantic ID featuresled to a 0.15% gain in the top-line online metric. This is considered highly significant for a system as optimized and large-scale as Meta Ads, serving billions of users.
7.4. Semantic and Prediction Similarity
The paper investigates whether semantic similarity (captured by Semantic ID) correlates with prediction similarity (user engagement patterns).
- Online A/B Test: An online A/B test was conducted. For 50% of users, a recommended item was randomly swapped with a different item that shared the same
Semantic ID prefix. - Metric: The
Click Loss Ratewas measured:-
: Original set of recommended items.
-
: Mutated set, where an item is replaced by a semantically similar one (same
Semantic ID prefix). -
CTR: Click-Through Rate. A smallerClick Loss Rateindicates that swapping with a semantically similar item does not significantly harmCTR, implying that items with similar semantics also have similar user engagement predictions.
该图像是一个图表,展示了不同语义 ID 深度下的点击损失率。随着语义 ID 深度的增加,从 0-prefix 到 3-prefix,点击损失率逐渐降低,表明语义 ID 的优化对模型性能的提升具有积极作用。
-
Figure 5 Click Loss Rate reduction from Semantic ID.
Analysis of Figure 5:
-
The chart shows that as the
Semantic ID depth(prefix length) increases, theClick Loss Ratemonotonically decreases.- For example, using a
0-prefix(broadest semantic category, essentially no semantic constraint) results in the highestClick Loss Rate(around0.15%). - As the
prefix depthincreases to1-prefix,2-prefix, and3-prefix, theClick Loss Ratesteadily drops, approaching0%for3-prefix.
- For example, using a
-
Conclusion: This demonstrates a strong correlation between
semantic similarity(as defined bySemantic ID) andprediction similarity(user click behavior). DeeperSemantic ID prefixescapture finer-grained semantic details, which translates to even closer prediction behavior. This validates therepresentation space analysisfrom Section 6.3 and underscores the robustness ofSemantic IDfor ranking models.The following are the results from Figure 6 of the original paper:
该图像是一个柱状图,展示了原始广告ID与语义ID的点击分布情况。可见,原始广告ID的点击量在前期较高,但随后的点击量迅速减少,而语义ID的点击分布也呈现出相似的衰减趋势。
Figure 6 The 30-day click distribution in raw ID and Semantic ID spaces. Figure 6: The 30-day click distribution in raw ID and Semantic ID spaces.
Analysis of Figure 6:
- This bar chart visually compares the click distribution for
raw Ad IDsandSemantic IDsover a 30-day period. - For
raw Ad IDs, the click distribution is extremely skewed, with a very small number ofIDsreceiving a disproportionately high number of clicks, and the vast majority receiving very few (a long tail). - For
Semantic IDs, while still exhibiting some skew, the distribution appears less extreme. The "head" is less dominant, and the "tail" is potentially smoother or better represented, indicating thatSemantic IDhelps to normalize the click distribution by grouping similar items. This aligns with the idea of reducingimpression skewby enabling knowledge transfer.
7.5. A/A Variance
Random hashing introduces prediction variance for identical items. If an advertiser creates an exact copy of an ad with a different raw ID, random hashing might assign them to different embeddings, leading to different model predictions and delivery behaviors. This A/A variance (where "A/A" refers to identical items) is undesirable.
- Semantic ID Solution:
Semantic IDmitigatesA/A varianceby ensuring that exact copies or very similar items will often have the samek-prefix Semantic ID, leading to identical or very similar embeddings. - Measurement: An
online shadow ads experimentwas set up to measure therelative A/A prediction difference (AAR)for pairs of identical items :- : Model predictions for the A/A pair.
- Result: The production model with six
Semantic ID sparse featuresachieved a 43% reduction in averageAARcompared to the same model without these features. - Implication: This significant reduction in
A/A varianceimproves the robustness of ad ranking orders, enhances the system's ability to accurately target audiences, and crucially, builds advertiser trust in the consistency of Meta's recommendation system. The authors believe the majority of this reduction comes from improvedtail itemmodeling, whereSemantic IDhelps stabilize representations for less popular items.
8. Conclusion & Reflections
8.1. Conclusion Summary
This paper successfully demonstrates the utility of Semantic ID in creating a stable ID space for item representation within large-scale recommendation systems. It introduces Semantic ID prefix-ngram, a novel token parameterization technique that significantly enhances the performance of Semantic ID in ranking models. Through extensive offline experiments, the authors confirm that Semantic ID effectively mitigates the detrimental effects of embedding representation instability caused by item cardinality, impression skew, and ID drifting, outperforming both random hashing and individual embeddings baselines, particularly for tail items and new cold start items. The benefits extend to user history modeling, where Semantic ID leads to outsized gains in attention-based contextualizing modules. The paper also reports the successful productionization of Semantic ID features in Meta's ads recommendation system, achieving notable 0.15% online performance gains and significantly reducing downstream ad delivery variance (A/A variance) in live deployments, thereby improving prediction stability and advertiser trust.
8.2. Limitations & Future Work
The paper implicitly points to several limitations and areas for future work, primarily through the challenges it aims to solve and the improvements it highlights:
- Content Understanding Model Dependency: The quality of
Semantic IDsis directly dependent on the underlyingcontent understanding models(e.g., multimodal image and text foundation models). Improvements in these upstream models would likely further enhanceSemantic IDeffectiveness. - RQ-VAE Configuration: The paper explores different
RQ-VAEconfigurations () andprefix-ngramdepths, but there might be further optimization opportunities forRQ-VAEarchitectures, training objectives, or adaptive quantization strategies. - Generalizability beyond Ads Ranking: While validated in Meta's ads ranking, further research could explore its efficacy across diverse recommendation domains (e.g., e-commerce, content platforms with different item types and user behaviors).
- Dynamic Semantic ID Adaptation: While
Semantic IDaims for stability, semantic concepts themselves can evolve over very long periods. Future work might explore adaptiveRQ-VAEtraining or codebook updates to account for gradual shifts in item semantics. - Trade-offs between Granularity and Cardinality: The choice of
prefix-ngramdepth andRQ-VAEcardinality involves a trade-off. Investigating methods to automatically determine optimal granularity given computational constraints could be a future direction.
8.3. Personal Insights & Critique
- Innovation of Semantic Collisions: The core innovation of transforming "random collisions" into "semantically meaningful collisions" is a powerful paradigm shift. Instead of fighting collisions, this work embraces and structures them to facilitate knowledge transfer. This is a very elegant solution to a long-standing problem in high-cardinality feature learning.
- Applicability to Other Domains: The
Semantic IDapproach, particularly theRQ-VAEfor content-based quantization, is highly transferable. Any domain dealing withhigh cardinalitycategorical features that also have rich content (e.g., e-commerce products, news articles, scientific papers, short-form videos) could benefit. For instance, in scientific paper recommendation,Semantic IDscould be derived from paper abstracts/titles, enabling better cold-start recommendations for new papers and improving recommendations for niche research areas. - Addressing Trust Issues: The reduction in
A/A varianceis a crucial practical benefit, especially for advertising platforms. Ensuring consistent predictions for identical items builds trust with advertisers, which is often as important as raw performance gains in real-world systems. This highlights a often overlooked, yet critical, aspect of system reliability. - Beyond Item Embeddings: The concept of a stable, semantically meaningful ID space could potentially be extended to other entities beyond items, such as user interests or query segments, to enhance stability and interpretability across various model components.
- Potential Issues/Assumptions:
- Content Model Quality: The entire
Semantic IDframework rests on the assumption that the upstreamcontent understanding modelsprovide high-quality, semantically meaningful embeddings. If the content models are biased or inaccurate, theSemantic IDswill inherit these flaws. - Semantic Drift of Clusters: While item raw IDs drift, the meaning of clusters can also slowly drift over time, especially in very dynamic environments (e.g., fashion trends, new product categories emerging rapidly). The paper states that broad semantic categories remain "temporally stable," which is generally true, but extreme long-term shifts could eventually necessitate
RQ-VAEretraining or adaptive codebook management. - Computational Cost of RQ-VAE: Training
RQ-VAEon billions of items (past three months data in production) and generatingcontent embeddingsfrom multimodal inputs can be computationally intensive, though this is managed offline. The trade-off betweenRQ-VAEcomplexity (L, K) and the resulting performance gains is well-explored, but operational cost remains a factor. - Explainability: While
Semantic IDimproves interpretability by grouping similar items, the exact semantic meaning of a specificcodesequence might still require human interpretation or auxiliary tools to fully explain why certain items are clustered together.
- Content Model Quality: The entire
Similar papers
Recommended via semantic vector search.