MMQ-v2: Align, Denoise, and Amplify: Adaptive Behavior Mining for Semantic IDs Learning in Recommendation
TL;DR Summary
MMQ-v2 adaptively aligns, denoises, and amplifies multimodal signals to enhance semantic ID learning, overcoming noise and signal ambiguity in recommendation systems, leading to superior performance on large-scale datasets.
Abstract
Industrial recommender systems rely on unique Item Identifiers (ItemIDs). However, this method struggles with scalability and generalization in large, dynamic datasets that have sparse long-tail data. Content-based Semantic IDs (SIDs) address this by sharing knowledge through content quantization. However, by ignoring dynamic behavioral properties, purely content-based SIDs have limited expressive power. Existing methods attempt to incorporate behavioral information but overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed and diverse, creating a vast information gap in quality and quantity between popular and long-tail items. This oversight leads to two critical limitations: (1) Noise Corruption: Indiscriminate behavior-content alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2)Signal Obscurity: The equal-weighting scheme for SIDs fails to reflect the varying importance of different behavioral signals, making it difficult for downstream tasks to distinguish important SIDs from uninformative ones. To tackle these issues, we propose a mixture-of-quantization framework, MMQ-v2, to adaptively Align, Denoise, and Amplify multimodal information from content and behavior modalities for semantic IDs learning. The semantic IDs generated by this framework named ADA-SID. It introduces two innovations: an adaptive behavior-content alignment that is aware of information richness to shield representations from noise, and a dynamic behavioral router to amplify critical signals by applying different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID's significant superiority in both generative and discriminative recommendation tasks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "MMQ-v2: Align, Denoise, and Amplify: Adaptive Behavior Mining for Semantic IDs Learning in Recommendation." This title indicates a focus on improving Semantic IDs (SIDs) in recommender systems by adaptively integrating and refining multimodal (content and behavioral) information.
1.2. Authors
The authors are:
-
Yi Xu (Alibaba International Digital Commerce Group, Beijing, China)
-
Moyu Zhang (Alibaba International Digital Commerce Group, Beijing, China)
-
Chaofan Fan (Alibaba International Digital Commerce Group, Beijing, China)
-
Jinxin Hu (Alibaba International Digital Commerce Group, Beijing, China)
-
Xiaochen Li (Alibaba International Digital Commerce Group, Beijing, China)
-
Yu Zhang (Alibaba International Digital Commerce Group, Beijing, China)
-
Xiaoyi Zeng (Alibaba International Digital Commerce Group, Beijing, China)
-
Jing Zhang (Wuhan University, School of Computer Science, Wuhan, China)
Most authors are affiliated with Alibaba International Digital Commerce Group, indicating an industrial research background with a focus on practical applications in large-scale e-commerce. Jing Zhang's affiliation with Wuhan University suggests an academic contribution, potentially bridging theoretical advancements with industrial relevance.
1.3. Journal/Conference
The paper is published in the Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX). ACM, New York, NY, USA. The placeholder (Conference acronym 'XX) suggests it is submitted to an ACM conference, which typically implies a peer-reviewed venue in computer science, likely related to information retrieval, data mining, or artificial intelligence, given the topic. ACM conferences are highly reputable in these fields.
1.4. Publication Year
The publication timestamp provided in the abstract is 2025-10-29T15:27:23.000Z, indicating a publication date of October 29, 2025. This suggests the paper is a very recent or forthcoming publication.
1.5. Abstract
Industrial recommender systems often struggle with the scalability and generalization limitations of unique Item Identifiers (ItemIDs) in dynamic, sparse, and long-tail datasets. Content-based Semantic IDs (SIDs) offer a partial solution by quantizing content to share knowledge, but they lack the dynamic behavioral properties essential for comprehensive recommendations. Existing methods attempting to incorporate behavioral information fail to account for the highly skewed and diverse nature of user-item interactions, particularly the vast information gap between popular and long-tail items. This oversight leads to two critical problems: Noise Corruption, where sparse behavioral data from long-tail items corrupts their content representations, and Signal Obscurity, where equal weighting of SIDs prevents distinguishing important behavioral signals from uninformative ones.
To address these issues, the paper proposes MMQ-v2, a mixture-of-quantization framework that adaptively Aligns, Denoises, and Amplifies multimodal information from content and behavior modalities to learn Semantic IDs. The resulting SIDs are named ADA-SID. The framework introduces two key innovations: an adaptive behavior-content alignment mechanism that considers information richness to prevent noise corruption, and a dynamic behavioral router that amplifies critical signals by assigning different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID's superior performance in both generative and discriminative recommendation tasks.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2510.25622 - PDF Link:
https://arxiv.org/pdf/2510.25622v2.pdfThe paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve lies within the limitations of item representation in industrial recommender systems. Traditionally, items are identified by unique ItemIDs. While simple, this approach struggles significantly in large-scale, dynamic environments characterized by an abundance of new items and long-tail data (items with very few interactions). ItemIDs lack inherent semantic meaning, making it difficult for the system to generalize to new or rarely interacted items.
Content-based Semantic IDs (SIDs) emerged as an improvement, addressing generalization by quantizing multimodal content (e.g., images, text descriptions) into discrete codes. This allows items with similar content to share similar SIDs, thereby enabling knowledge sharing. However, this approach inherently ignores the dynamic behavioral properties of items—how users actually interact with them. User interactions reveal evolving popularity, changing styles, and specific preferences that static content alone cannot capture, thus limiting the expressiveness and performance of purely content-based SIDs.
Existing attempts to incorporate behavioral information into SIDs, primarily through behavior-content alignment, also face significant challenges. The critical oversight is the inherent disparity and skewed distribution of user-item interactions. Unlike content features, which might be relatively uniform, behavioral data is highly irregular:
-
Popular Items (Head Items): Have abundant, diverse, and rich interaction data.
-
Long-tail Items (Tail Items): Suffer from extreme sparsity and very few interactions, leading to unreliable behavioral signals.
This
information gapleads to two critical limitations in current methods:
-
Noise Corruption: Indiscriminate alignment between behavioral and content information can be detrimental. For long-tail items, their sparse and noisy behavioral signals can corrupt their otherwise stable and reliable
content representations. For popular items, uniform alignment might overcompress their rich behavioral patterns, leading to a loss of uniquebehavioral signatures. -
Signal Obscurity: Current methods often treat all SIDs equally, failing to reflect the varying importance of different
behavioral signals. This makes it difficult for downstream recommendation tasks to prioritizeinformative SIDs(e.g., those from popular items with strong signals) overuninformative SIDs(e.g., those from long-tail items with weak signals), leading to increased optimization burden and reduced accuracy.The paper's entry point is to explicitly address this fundamental disparity in behavioral information richness. Its innovative idea is to adaptively handle the fusion of content and behavior, tailoring the alignment and amplification processes to the specific
information richnessof each item's behavioral data.
2.2. Main Contributions / Findings
The paper proposes ADA-SID (Adaptive Denoise and Amplify Semantic IDs), a novel mixture-of-quantization framework called MMQ-v2, to tackle the aforementioned issues. Its primary contributions are:
-
Customized Multimodal SID Learning based on Information Richness: The paper is the first to propose customizing
behavior-content multimodal SIDsfor items by considering theinformation richnessof their collaborative signals. This significantly enhances the expressiveness of SIDs and improves generalization for downstream recommendation tasks. -
Adaptive Behavior-Content Alignment Mechanism: An
adaptive tri-modal (behavior-vision-text) alignment strategyis introduced. This mechanism dynamically calibrates the alignment strength betweenbehavioralandstatic contentmodalities. By employing analignment strength controllerthat adjusts intensity based on an item's interaction data richness, itmitigates noise corruptionfor long-tail items while preserving diverse behavioral information for popular ones. -
Dynamic Behavioral Router: A
dynamic behavioral routeris proposed that learns to assignadaptive importance weightsto an item's set of behavioral SIDs. This mechanism effectivelyamplifies critical collaborative signalsand attenuates uninformative ones, improving robustness and downstream recommendation performance. -
Comprehensive Experimental Validation: Extensive offline experiments on both public and large-scale industrial datasets, along with online
A/B tests, demonstrateADA-SID's significant superiority in bothgenerativeanddiscriminative recommendation tasks. Notably, onlineA/B testsshowed a+3.50% increase in Advertising Revenueand a+1.15% increase in Click-Through Rate (CTR).The key conclusions are that by adaptively managing the fusion of multimodal information, specifically by recognizing and responding to the varying reliability of behavioral signals across popular and long-tail items,
ADA-SIDgenerates more robust, expressive, and effectiveSemantic IDs. This approach leads to substantial performance gains in real-world recommender systems.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Item Identifiers (ItemIDs): In recommender systems,
ItemIDsare unique, discrete numerical or alphanumeric labels assigned to each item (e.g., a product, movie, song). They are simple to use but inherently lack semantic meaning, meaning the ID itself doesn't convey any information about the item's features or properties.- Limitation: When a new item is introduced (cold-start problem) or an item has very few interactions (
long-tail data), there's no semantic basis to recommend it. Scalability also becomes an issue in very large item catalogs as the embedding table forItemIDsgrows proportionally.
- Limitation: When a new item is introduced (cold-start problem) or an item has very few interactions (
-
Semantic IDs (SIDs):
Semantic IDsare discrete codes that carry semantic meaning, allowing similar items to have similar or shared IDs. Instead of unique identifiers for each item, SIDs represent item properties or characteristics.- Benefit: They enable
knowledge sharingbetween similar items, improving generalization, especially for cold-start and long-tail items. They can also represent items more compactly.
- Benefit: They enable
-
Content Quantization: This is the process of converting continuous-valued
content features(e.g., numerical vectors representing text or images) into discrete, symbolic representations (SIDs). It's akin to reducing the precision of data while trying to preserve as much semantic information as possible.- Example: Using a
Vector Quantization (VQ)technique to map a high-dimensional image embedding to a discrete code from a predefinedcodebook.
- Example: Using a
-
Sparse Long-Tail Data: In many real-world datasets, particularly in e-commerce or media, a small number of "head" items (e.g., popular products, blockbuster movies) account for a large proportion of interactions, while a vast number of "long-tail" items (niche products, obscure films) have very few interactions. This creates a highly skewed distribution, known as
sparse long-tail data.- Challenge: Learning meaningful representations or making accurate recommendations for long-tail items is difficult due to insufficient interaction data.
-
Multimodal Information: Refers to data from multiple modalities, such as
text(product descriptions, titles),vision(product images), andbehavioral/collaborative signals(user interaction history). Combining these different types of data can provide a richer and more comprehensive understanding of items. -
Behavioral/Collaborative Signals: Information derived from how users interact with items. This includes clicks, purchases, views, ratings, etc.
Collaborative Filtering: A classic recommendation technique that leverages these signals, assuming users who agreed in the past will agree in the future.- Dynamic Nature: Behavioral signals are dynamic because user preferences and item popularity evolve over time.
-
Vector Quantization (VQ): A technique used in data compression and signal processing to map a vector from a continuous vector space into a finite set of vectors (a
codebook). Each vector in the codebook is called acodeword. The input vector is replaced by the index of its closest codeword in the codebook.- How it works: Given an input vector, it finds the closest codeword in the codebook (e.g., using Euclidean distance or cosine similarity) and replaces the input vector with that codeword.
- Application in SIDs: Used to map continuous item embeddings into discrete SIDs.
-
RQ-VAE (Residual Quantized Variational Autoencoder): An advanced
Vector Quantizationtechnique, often used in generatingSemantic IDs. It quantizes representations hierarchically or iteratively. Instead of quantizing the entire embedding at once, it quantizes a residual (the difference between the original embedding and the quantized version) in multiple stages. This allows for more granular and expressive discrete representations.- Components: Typically involves an
encoderthat maps input to a latent space, aquantizerthat maps latent vectors to discrete codes from acodebook, and adecoderthat reconstructs the input from the quantized codes.
- Components: Typically involves an
-
Contrastive Learning: A self-supervised learning paradigm where the model learns to bring representations of "similar" data points (positive pairs) closer together in an embedding space, while pushing "dissimilar" data points (negative pairs) further apart.
- Application in SIDs: Used to align representations from different modalities (e.g., content and behavior) by treating corresponding multimodal embeddings for the same item as positive pairs and those from different items as negative pairs.
-
Mixture-of-Experts (MoE): A machine learning technique where multiple "expert" models are trained to handle different subsets or aspects of the input data. A "gating network" or "router" learns to decide which expert(s) to route each input to.
- Benefit: Can increase model capacity and performance without a proportional increase in computational cost by sparsely activating experts.
- Application in MMQ-v2: Used to have
shared expertsfor common multimodal information andspecific expertsfor modality-specific information, along with adynamic behavioral routerto selectively activate behavioral SIDs.
3.2. Previous Works
The paper categorizes previous works into traditional ItemIDs, content-based SIDs, and behavior-content aligned SIDs.
-
Traditional ItemIDs:
- Concept: Unique identifiers for items.
- Limitations: Lack semantics, struggle with cold-start items and generalization, as discussed in [1, 2, 7] (e.g., A. Sigh et al., 2024; C. Zheng et al., 2025).
-
Content-based SIDs:
- Concept: Quantize item content features into discrete codes to share knowledge among similar items.
- Examples:
RQ-VAE(Residual Quantized Variational Autoencoder) pioneered byTIGER[22]: Transforms multimodal content (titles, descriptions, images) into textual embeddings using a pre-trainedLarge Language Model (LLM)and then quantizes these embeddings into hierarchical SIDs.OPQ(Optimized Product Quantization) [14]: Used inRPG[25] to convert pre-trained textual embeddings into a tuple of unordered SIDs.OPQaims to find an optimal orthogonal rotation for product quantization, minimizing quantization distortion.SPM-SID[1] andPMA[2]: Inspired byLLMs, these methods explore more granular subword-based compositions for SIDs.
- Limitation: Performance ceiling due to ignoring dynamic behavioral properties [20, 30, 34].
-
Behavior-Content Aligned SIDs: These methods attempt to incorporate dynamic behavioral information into SIDs. They are broadly categorized into two approaches:
- Injecting Explicit Collaborative Signals:
LCRec[31]: Designs a series of alignment tasks to unify semantic and collaborative information.ColaRec[32]: Distills collaborative signals directly from a pre-trained recommendation model and combines them with content information.IDGenRec[34]: LeveragesLLMsto generate semantically rich textual identifiers, showing potential in zero-shot settings by incorporating behavioral context.
- Aligning Pre-trained Representations:
EAGER[30]: Generates separate collaborative and content SIDs using K-means on pre-trained embeddings and then aligns them in downstream tasks.DAS[42]: Employs multi-view contrastive learning to maximize the mutual information between SIDs and collaborative signals, generating hierarchical, behavior-aware content SIDs usingRQ-VAE.LETTER[27]: Integrates hierarchical semantics, collaborative signals, and code assignment diversity to generate behavior-content fused SIDs usingRQ-VAE.MM-RQ-VAE[17]: Generates collaborative, textual, and visual SIDs from pre-trained collaborative and multimodal embeddings. It also introducescontrastive learningfor behavior-content alignment.
- Injecting Explicit Collaborative Signals:
3.3. Technological Evolution
The evolution of item representation for recommender systems can be summarized as follows:
- ItemIDs (Traditional): Simple, unique identifiers. Early systems relied heavily on these, but they became a bottleneck for scalability, cold-start, and generalization in large, dynamic catalogs.
- Content-based SIDs: Introduced
Vector Quantizationtechniques (likeRQ-VAE,OPQ) to embed semantic meaning directly into item representations by quantizing content features. This marked a shift towardsknowledge sharingand better generalization for new items. - Behavior-Content Aligned SIDs (Early Attempts): Recognized the limitations of content-only SIDs and started integrating
behavioral/collaborative signals. This often involved explicit injection of collaborative data or alignment of pre-trained content and behavioral embeddings using techniques likecontrastive learning. - MMQ-v2 (ADA-SID): This paper represents a crucial next step. While previous aligned methods acknowledge the importance of behavior, they largely ignore the inherent
heterogeneityandinformation asymmetry(skewed distribution) of behavioral data across items.MMQ-v2explicitly addresses this by introducing adaptive mechanisms (alignment strength controller,dynamic behavioral router) to intelligentlyAlign,Denoise, andAmplifysignals based on theirinformation richness, thereby overcoming theNoise CorruptionandSignal Obscurityissues prevalent in earlier aligned SID approaches.
3.4. Differentiation Analysis
Compared to the main methods in related work, ADA-SID's core innovations and differentiators are:
-
Adaptive Handling of Behavioral Information Richness: This is the most significant difference. Existing
behavior-content aligned SIDs(e.g.,DAS,LETTER,MM-RQ-VAE) tend to apply alignment uniformly across all items.ADA-SIDexplicitly recognizes that the reliability and quantity of behavioral signals differ vastly between popular and long-tail items. It designs mechanisms to adapt to thisinformation richness.- Contrast with Uniform Alignment: While methods like
DASandMM-RQ-VAEusecontrastive learningfor alignment,ADA-SIDmodulates the strength of this alignment, which is crucial for denoising.
- Contrast with Uniform Alignment: While methods like
-
Noise Corruption Mitigation: For
long-tail itemswith sparse and noisy behavioral data,ADA-SID'sadaptive behavior-content alignmentmechanism (Alignment Strength Controller) dynamically reduces the influence of these unreliable signals during alignment. This prevents thecorruptionof their stable content representations, a problem that indiscriminate alignment in other methods exacerbates. -
Signal Obscurity Resolution and Amplification:
ADA-SIDintroduces adynamic behavioral router. Instead of assigning equal importance to all SIDs, this router learns to assign adaptive weights tobehavioral SIDsbased on theirinformation richness. Thisamplifies critical signalsfrom popular items (where behavioral data is rich and informative) andattenuates uninformative signalsfrom long-tail items, effectively resolving thesignal obscurityissue. This is a novel approach compared to methods that just fuse or align without explicit weighting based on signal quality. -
Mixture-of-Quantization Network:
ADA-SIDuses amixture-of-quantization networkwithshared expertsandspecific expertsto capture both shared multimodal information and modality-specific details. This architecture, combined with the adaptive alignment and routing, provides a more nuanced and flexible way to represent items compared to simplerRQ-VAEstructures used in many baselines (e.g.,RQ-VAE,RQ-Kmeans,LETTER).In essence,
ADA-SIDdoesn't just incorporate behavioral signals; it intelligently manages them, making it more robust and effective across the entire spectrum of item popularity, especially for the challenging long-tail.
4. Methodology
4.1. Principles
The core idea behind ADA-SID (Adaptive Denoise and Amplify Semantic IDs) is to create expressive and noise-robust Semantic IDs (SIDs) by intelligently integrating multimodal information from content (text and vision) and behavior. The fundamental principle is to acknowledge and leverage the varying information richness of behavioral signals across items. Instead of treating all behavioral signals equally, ADA-SID aims to:
-
Align: Fuse dynamic behavioral information with static content.
-
Denoise: Prevent noisy behavioral signals (especially from long-tail items) from corrupting reliable content representations.
-
Amplify: Highlight and prioritize critical behavioral signals (especially from popular items) to improve their impact on downstream tasks.
This is achieved through a
mixture-of-quantization frameworkthat adaptively controls the strength of behavior-content alignment and dynamically weightsbehavioral SIDsbased on an item's interaction data richness.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation
The task is to quantize pre-trained textual, visual, and behavioral embeddings of an item into a sequence of discrete Semantic IDs.
For a given item, we first obtain:
-
Pre-trained
vision embedding -
Pre-trained
text embedding -
Pre-trained
behavioral embedding(obtained from a sequential recommender model likeSASRec[16] using collaborative signals).An
item tokenizerthen quantizes these high-dimensional embeddings into a discrete sequence of SIDs.
The formulation is given as:
Here, represents the length of the Semantic IDs sequence, and denotes the -th discrete semantic ID in the sequence. The notation implies a concatenation or fusion of the three pre-trained embeddings as input to the tokenizer.
4.2.2. Behavior-Content Mixture-of-Quantization Network
This network is designed to simultaneously capture both shared information across modalities and modality-specific information. It consists of Shared Experts and Specific Experts.
Shared Experts
The shared experts are responsible for quantizing the aligned behavior-content information into shared latent embeddings, which then generate shared SIDs.
- Projection: The pre-trained textual (), visual (), and behavioral () embeddings are first projected into a unified high-dimensional space using small two-layer
deep neural networks(DNNs), denoted as , , and respectively. The resulting hidden representations are , , and . - Concatenation: These projected hidden representations are then concatenated to form a unified representation :
- Adaptive Alignment: This concatenated representation is optimized by the
adaptive behavior-content alignmentmechanism (detailed in Section 4.2.3), which ensures the behavioral and content modalities are aligned appropriately based on information richness. - Encoding and Quantization: For each of the
shared experts(), the aligned representation is encoded into ashared latent embedding. This latent embedding is then quantized into a discretesemantic IDby finding the most similar codeword in theshared codebook. The codebook contains codewords . The quantization step involves finding the codeword in that maximizes the cosine similarity with : Here, is the number ofshared expertsand is thecodebook size.
Specific Experts
The specific experts are designed to learn modality-specific information and generate modality-specific SIDs. For each modality (text, vision, behavior), there is a group of specific experts and corresponding codebooks.
- Encoding: For text modality, the pre-trained is transformed by experts into latent representations . Similarly, for vision and behavior modalities, pre-trained and are transformed into and respectively. Note: There appears to be a typo in the paper's formula for and where and should be used instead of . Following the paper's provided text and consistent notation for specific experts, it should ideally be and respectively. However, remaining 100% faithful to the original text, I present it as it is written in the paper.
- Quantization: Analogous to shared experts, modality-specific SIDs , , and are searched from their respective codebooks , , using cosine distance. The corresponding codeword representations are , , and . Here, , , and are the numbers of experts for text, vision, and behavior modalities, respectively.
Decoder and Reconstruction
The decoder reconstructs the original fused pre-trained embeddings from the combined latent representations and their quantized codeword representations.
-
Fused Latent Representations: The total latent representation is a sum of shared latent embeddings and modality-specific latent embeddings, weighted by gating mechanisms. The paper provides the formula for and its quantized version : Here, and are gating mechanisms for the text and visual modalities, defined as: The term refers to the output of the
dynamic behavioral routerfor the -th behavioral SID, which selectively determines its importance (detailed in Section 4.2.4).- Note on formula: There's a potential typo in the paper's formula for , specifically . It seems that (quantized vision specific expert embedding) should be used instead of (quantized shared expert embedding) in the vision specific part, for consistency with the formula and the idea of modality-specific experts. Again, staying 100% faithful to the provided text, I present the formula as it is.
-
Reconstruction Loss: The decoder takes the fused latent representation and its quantized version to reconstruct the original embeddings. The reconstruction loss (denoted as ) is typically a mean squared error between the original embeddings and their reconstructions. Here, is the
stop gradientoperator, which means the gradients do not flow through this part, a common practice inVQ-VAEtraining to ensure the encoder learns to map embeddings close to the codebook entries. The term effectively uses for the forward pass but for the backward pass gradients.
4.2.3. Adaptive Behavior-Content Alignment
This mechanism aims to align behavior and content modalities while being sensitive to the varying information richness of behavioral signals, thereby preventing noise corruption.
Alignment Strength Controller
This component outputs a weight to modulate the intensity of the behavior-content alignment for each item. It's based on two principles:
-
Long-tail items: Alignment strength should smoothly decay towards zero.
-
Popular items: Alignment strength should increase with their estimated
information richness.The
L2-magnitudeof an item's behavioral embedding () is used as a proxy for itsinformation richness. -
Normalization: Given a pre-trained behavior embedding matrix (where is the number of items and is embedding dimension), consisting of vectors , the maximum () and minimum ()
L2-magnitudesare calculated across all behavioral embeddings: For the -th embedding , itsnormalized L2-magnitudeis computed: -
Weight Calculation: The alignment strength weight is then calculated using a sigmoid-like function: Here, is the
sigmoid function. and are hyperparameters that control the steepness and threshold of the curve, determining how rapidly alignment strength changes withinformation richness. In the experiments, and were found to be optimal. This setup ensures that for items with lowL2-magnitude(long-tail), is small, making a large negative number, pushing towards 0, hence towards 0. For popular items, is close to 1, making a large positive number, pushing towards 1, hence towards 1.
Behavior-Content Contrastive Learning
This module uses contrastive learning to learn shared information between behavior and content. It's a two-stage process:
-
Content Representation Unification:
Contrastive learningis first applied to align the text () and image () modalities to obtain a unifiedcontent representation. Thecontent lossis aInfoNCE lossthat encourages positive pairs (text and vision from the same item) to be close and negative pairs (text from one item, vision from another) to be far apart. Here, is thecosine similarity. and are the positive visual and text representations for the current item, while and are negative samples from other items in the batch ( is the batch size). is thetemperature coefficient(set to 0.07). -
Behavior-Content Alignment:
Contrastive learningis then performed between the behavioral representation () and the unified content representation () to maximize theirmutual information. A positive pair is (behavior and content from the same item), and negative pairs are (behavior from current item, content from other items). -
Total Alignment Loss: The total alignment loss combines the
content lossand thebehavior-content alignment loss, weighted by thealignment strength controller's output : This formulation is key fordenoising: if is low (for long-tail items), the influence of is reduced, preventing noisy behavioral signals from corrupting content. If is high (for popular items), strong alignment is encouraged.
4.2.4. Dynamic Behavioral Router Mechanism
This mechanism addresses Signal Obscurity by dynamically weighting the importance of behavioral Semantic IDs based on the item's interaction frequency, thereby amplifying critical signals.
Behavior-Guided Dynamic Router
The behavior-guided dynamic router assigns calibrated importance scores to behavioral SIDs. It up-weights SIDs from head items (popular) and down-weights those from long-tail items.
Here:
-
is the output of the router, a vector of importance scores for the behavioral SIDs.
-
is the
sigmoid function, and is thenormalized L2-magnitudeof the behavioral embedding (as defined in Section 4.2.3). This acts as amagnitude-based scaler, mapping weights to[0, 1]and calibrating them byinformation richness. -
is a
Multi-Layer Perceptronthat processes to capture its specific semantic patterns. -
is the
Rectified Linear Unitactivation function. It inducesexact-zero sparsity, meaning that if the output of is negative, the output is zero, effectively deactivating certain behavioral SIDs. -
The multiplication
*implies element-wise multiplication.This
learnable gateamplifies critical collaborative signals while attenuating uninformative ones, leading to improved robustness. It's trained end-to-end without manual thresholds.
Sparsity Regularization
To further refine the dynamic router, a sparsity regularization loss is introduced. The goal is to encourage the router to produce sparser SID sequences (activate fewer behavioral SIDs) for long-tail items and denser sequences for popular items.
The regularization loss penalizes the deviation from an item-specific target sparsity. The target sparsity is inversely proportional to the item's information richness (approximated by the L2-magnitude of ).
The paper provides a complex block of formulas for sparsity regularization and load balancing:
- Interpretation based on common MoE practices and the paper's description:
-
The first line seems to be a sparsity penalty term, possibly an
L1 regularization() on the router's output for each behavioral SID , scaled by a dynamic . The double summation might be a typo and should likely be a single summation over behavioral SIDs, or perhaps refers to an interaction between different SIDs if is an interaction term. Assuming it's meant to regularize the norm of router outputs across a batch . -
appears to be a
dynamic sparsity coefficientthat adjusts over iterations, influencing the strength of the sparsity penalty. The formula for looks like an adaptive learning rate adjustment based on a sign function and current vs. target sparsity, but its notation is highly condensed. -
seems to be the
current sparsityof the router's output, calculated as1 - (proportion of activated SIDs). is an indicator function. -
is the
target sparsityfor an item, which is proportional to1 - normalized information richness. This means less rich items (long-tail) will have a higher target sparsity, encouraging fewer SIDs to be activated. The term in the denominator seems incorrect given is already normalized to[0,1]using these values. It should likely be , where is a scaling factor. -
is a
load-balancing mechanismto prevent router collapse (where one expert or SID gets all the traffic). This loss ensures that activated behavioral SIDs are distributed somewhat evenly among available options, or at least that some minimum number of SIDs are activated to prevent collapse. The notation and its formula are also highly condensed and unusual, making a precise interpretation challenging without further context. Typically, load balancing inMoEaims to make the average utilization of experts equal.In summary, the dynamic router, guided by this complex sparsity and load-balancing regularization, produces a flexible and semantically rich representation. It captures diverse item facets while adaptively controlling the length and complexity of the representation to align with both the item's intrinsic properties and the demands of downstream tasks.
-
<br>
<em>Figure 1: Schematic diagram of the ADA-SID framework (Figure 2 from the original paper).</em>
The above Figure 1 (Figure 2 in the original paper) illustrates the overall ADA-SID framework. It shows how visual, text, and collaborative filtering (behavioral) encoders generate embeddings. These are fed into shared experts and specific experts for quantization. The adaptive behavior-content alignment module (bottom left) controls the influence of behavioral signals based on information richness. The dynamic behavioral router (bottom right) applies adaptive weights to the behavioral SIDs. The final SIDs are then used for generative or discriminative recommendation.
5. Experimental Setup
5.1. Datasets
The authors evaluate ADA-SID on both an industrial dataset and a public dataset.
-
Industrial Dataset:
- Source: Collected from a leading e-commerce advertising platform in Southeast Asia.
- Period: Between October 2024 and May 2025. (Note: Given the publication date of the paper is October 2025, this data collection period seems to be in the future or refers to an internal staging/simulated environment. It's an unusual temporal reference.)
- Scale: Encompasses 35,154,135 users and 48,106,880 advertisements (items).
- Interactions: Contains 75,730,321,793 user interactions, indicating a massive scale.
- Content: Rich, multimodal item content (images, titles, descriptions, etc.).
- Characteristics: User behavior sequences have an average length of 128. Its scale and complexity make it suitable for evaluating real-world performance.
-
Public Dataset:
-
Name: "Beauty" subset of the Amazon Product Reviews dataset [38].
-
Scale: 22,363 users, 12,101 items.
-
Interactions: 198,360 interactions.
-
Usage for Generative Retrieval: A 5-core filter is applied (users/items with at least 5 interactions), and chronological user sequences with a maximum length of 20 are constructed.
-
Usage for Discriminative Ranking: Ratings are binarized (positive if >3, negative if <3 or =3). A chronological 90%/10% split is used for training and testing.
The following are the results from [Table 1] of the original paper:
Dataset Industrial Dataset Beauty #User
#Item
#Interaction35,154,135 22,363 48,106,880 12,101 75,730,321,793 198,360
-
These datasets were chosen to cover both large-scale industrial scenarios, reflecting real-world complexity and volume, and a commonly used public benchmark, allowing for comparison with prior academic research. This dual evaluation approach strengthens the claims of the method's effectiveness and scalability.
5.2. Evaluation Metrics
The effectiveness of ADA-SID is evaluated using both quantization metrics and recommendation metrics.
5.2.1. Quantization Metrics
-
Reconstruction Loss ():
- Conceptual Definition: Measures how accurately the model can reconstruct the original input vector from its quantized (compressed) representation. Lower values indicate higher fidelity, meaning less information is lost during quantization.
- Mathematical Formula: For a given input vector and its reconstructed version , the
Reconstruction Lossis typically calculated using Mean Squared Error (MSE). - Symbol Explanation:
- : The original input vector (e.g., concatenated pre-trained embeddings).
- : The reconstructed vector obtained from the decoder after quantization.
- : The squared L2 norm, representing the squared Euclidean distance.
-
Token Distribution Entropy (Entropy):
- Conceptual Definition: Measures the diversity and balance of the distribution of
Semantic IDs(tokens/codewords) across thecodebooks. Higher entropy indicates that the model is utilizing a wider range of SIDs more uniformly, which is generally desirable for rich and expressive representations. Low entropy suggests that only a few SIDs are predominantly used. - Mathematical Formula: For a discrete probability distribution over codewords, the
Shannon Entropyis: - Symbol Explanation:
- : The total number of unique codewords in the codebook.
- : The probability of selecting the -th codeword (e.g., its frequency of use divided by the total number of codeword assignments).
- : Logarithm base 2.
- Conceptual Definition: Measures the diversity and balance of the distribution of
-
Codebook Utilization (Util.):
- Conceptual Definition: Reflects the efficiency with which the model uses the vectors in its
codebook. It's the percentage of codewords in the codebook that have been used (assigned to at least one input vector) during training. A value close to 1 (or 100%) indicates that most or all codewords are being actively learned and used, which is good for leveraging the full capacity of the codebook. Low utilization might suggest redundant codewords or issues in the quantization process. - Mathematical Formula:
- Symbol Explanation:
Number of used codewords: Count of unique codewords that have been assigned to at least one input embedding.Total number of codewords in the codebook: The predefined size of the codebook.
- Conceptual Definition: Reflects the efficiency with which the model uses the vectors in its
5.2.2. Recommendation Metrics
Generative Retrieval Tasks
For generative retrieval, the goal is often to generate a sequence of items.
-
Recall@N (R@N):
- Conceptual Definition: Measures the proportion of relevant items that are successfully retrieved within the top recommendations. It indicates how many of the actual target items are present in the recommended list. Higher values are better.
- Mathematical Formula:
- Symbol Explanation:
relevant items: The set of items that are truly relevant to the user (e.g., items the user actually interacted with in the test set).recommended items within top N: The set of items recommended by the system.- : Cardinality of a set.
- : Set intersection.
-
Normalized Discounted Cumulative Gain@N (NDCG@N):
- Conceptual Definition: A ranking quality metric that accounts for the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear higher up in the list. It is "normalized" by comparing the DCG (Discounted Cumulative Gain) to an
Ideal DCG(IDCG), which is the DCG of a perfectly ordered list of relevant items. Higher values are better. - Mathematical Formula:
- Symbol Explanation:
- : The number of top recommendations considered.
- : The relevance score of the item at position in the recommended list (e.g., 1 for relevant, 0 for irrelevant).
- : The relevance score of the item at position in the ideal (perfectly sorted by relevance) list.
- : Discount factor, where relevance at lower ranks is discounted.
- Conceptual Definition: A ranking quality metric that accounts for the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear higher up in the list. It is "normalized" by comparing the DCG (Discounted Cumulative Gain) to an
Discriminative Ranking Tasks
For discriminative ranking, the goal is to predict user preferences or likelihood of interaction.
-
Area Under the Receiver Operating Characteristic Curve (AUC):
- Conceptual Definition: Measures the ability of a model to distinguish between positive and negative classes (e.g., whether a user will click on an item or not). It represents the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance. Values range from 0 to 1, with 0.5 being random performance and 1 being perfect classification. Higher values are better.
- Mathematical Formula: Conceptually,
AUCis the area under theROC curve. TheROC curveplots theTrue Positive Rate (TPR)against theFalse Positive Rate (FPR)at various threshold settings. - Symbol Explanation:
- :
True Positive Rate(Sensitivity) at threshold , calculated as . - :
False Positive Rate(1 - Specificity) at threshold , calculated as .
- :
-
Group AUC (GAUC):
- Conceptual Definition: A weighted average of
AUCvalues calculated per user (or per impression session/group). It is particularly useful in recommendation systems where user-specific performance is important, as it mitigates the impact of users with many interactions on the overall AUC score. It represents the probability that a randomly chosen positive item will be ranked higher than a randomly chosen negative item for the same user. Higher values are better. - Mathematical Formula:
- Symbol Explanation:
- : The set of all users.
- : The
AUCscore for user . - : A weighting factor for user , often the number of impressions or positive samples for that user.
- Conceptual Definition: A weighted average of
Online Experiment Metrics
For online experiments (A/B tests):
- Advertising Revenue (AR):
- Conceptual Definition: A key business metric measuring the total revenue generated from advertisements shown to users in the experimental group.
- Calculation: Typically sum of revenue generated per click or impression, often related to bids and ad value.
- Click-Through Rate (CTR):
- Conceptual Definition: The percentage of users who click on a recommended item after viewing it. It measures the effectiveness of recommendations in capturing user interest.
- Mathematical Formula:
- Symbol Explanation:
Number of Clicks: Total clicks on recommended items.Number of Impressions: Total times recommended items were shown.
5.3. Baselines
The proposed ADA-SID method is compared against several state-of-the-art (SOTA) Semantic ID generation approaches, categorized into two groups, plus the traditional Item ID.
-
Item ID:
- Description: The conventional approach where each item has a unique identifier. Used as a baseline specifically for
discriminative rankingtasks, where it traditionally serves as the item representation. Ingenerative retrieval, SIDs act as the item identifier.
- Description: The conventional approach where each item has a unique identifier. Used as a baseline specifically for
-
Content-based SIDs: These methods quantize only content features.
- RQ-VAE [40]: (From TIGER [22]) Uses
RQ-VAEto quantize textual embeddings (derived from content features like titles, descriptions, categories using a pre-trainedLLM) into hierarchical SIDs. - OPQ [14]: (From RPG [25]) Utilizes
Optimized Product Quantizationto convert pre-trained textual embeddings into a tuple of unordered SIDs.
- RQ-VAE [40]: (From TIGER [22]) Uses
-
Behavior-content Aligned SIDs: These methods attempt to incorporate collaborative (behavioral) signals.
- RQ-Kmeans [43]: (From One-rec [39]) Integrates
RQ-VAEandK-meansto quantize behavior-finetuned multimodal representations.K-meansis applied to residuals in a coarse-to-fine manner. - DAS [42]: Introduces
multi-view contrastive alignmentto maximize mutual information between SIDs and collaborative signals. It generates hierarchical, behavior-aware content SIDs usingRQ-VAE. - LETTER [27]: Integrates hierarchical semantics, collaborative signals, and code assignment diversity to generate
behavior-content-fused SIDsusingRQ-VAE. - RQ-VAE++: An extension introduced by the authors. It incorporates both pre-trained
content representationand pre-trainedcollaborative representationinto theRQ-VAEframework. It generates collaborative, textual, and visual SIDs for each item, highlighting the importance of collaborative information. - MM-RQ-VAE [17]: Generates separate collaborative, textual, and visual SIDs from pre-trained collaborative and multimodal embeddings. It also employs
contrastive learningfor behavior-content alignment.
- RQ-Kmeans [43]: (From One-rec [39]) Integrates
5.4. Experiment Setup
-
Recommendation Foundations:
- Generative Retrieval: The
REG4Rec[18] model (a strong multi-token prediction model) is adopted as the base framework. - Discriminative Ranking: The
Parameter Personalized Network (PPNet)[62] is used as the backbone architecture.
- Generative Retrieval: The
-
Implementation Details:
- Industrial Dataset:
Codebook sizeis set to 300.Length of SIDsis set to 8. ForADA-SIDspecifically: (number of shared experts), (text experts), (vision experts), (behavior experts). The target sparsity . - Public Dataset (Amazon Beauty):
Codebook sizeis set to 100.Length of SIDsis 6. ForADA-SIDspecifically: , , , . The target sparsity . - Pre-trained Representations: Obtained from
Qwen3-Embedding 7B[15] (likely for text),SASRec[16] (for behavior), andPailiTAO v8(an e-commerce advertising platform's internal model, likely for vision or combined content). - Behavioral SID Activation: Behavioral SIDs with scores above a threshold (set to 0 in this paper) are retained, while those below are replaced with a
padding token. This threshold can be adjusted for specific recommendation task requirements.
- Industrial Dataset:
6. Results & Analysis
6.1. Core Results Analysis (RQ1)
The overall performance comparison of ADA-SID against baselines on both generative retrieval and discriminative ranking tasks is summarized in Table 2 (a) and (b).
The following are the results from [Table 2 (a)] of the original paper:
| Methods | Industrial Dataset | Amazon Beauty | ||||||||||||
| Lrecon | Entropy↑ | Util.↑ | R@50↑ | R@100↑ | N@50↑ | N@100↑ | Entropy↑ | Util.↑ | R@50↑ | R@100↑ | N@50↑ | N@100↑ | ||
| RQ-VAE | 0.0033 | 4.2481 | 1.0000 | 0.1854 | 0.2083 | 0.1337 | 0.1421 | 0.6028 | 3.4904 | 0.9900 | 0.1213 | 0.2398 | 0.0803 | 0.1304 |
| OPQ | 0.0038 | 4.3981 | 0.7563 | 0.1972 | 0.2104 | 0.1491 | 0.1518 | 0.9647 | 3.3980 | 0.9600 | 0.1117 | 0.2189 | 0.0802 | 0.1302 |
| RQ-Kmeans | 0.0065 | 4.7232 | 1.0000 | 0.1844 | 0.2202 | 0.1462 | 0.1578 | 0.6240 | 1.7100 | 1.0000 | 0.1385 | 0.2398 | 0.0843 | 0.1507 |
| LETTER | 0.0054 | 4.2072 | 1.0000 | 0.1812 | 0.2213 | 0.1582 | 0.1675 | 0.5431 | 2.6819 | 1.0000 | 0.1513 | 0.2492 | 0.0937 | 0.1453 |
| DAS | 0.0051 | 4.3539 | 1.0000 | 0.1864 | 0.2237 | 0.1576 | 0.1697 | 0.5432 | 3.6819 | 1.0000 | 0.1503 | 0.2403 | 0.0933 | 0.1445 |
| RQ-VAE++ | 0.0034 | 3.5566 | 0.9283 | 0.2254 | 0.2709 | 0.1628 | 0.1706 | 0.6028 | 3.4904 | 0.9900 | 0.1683 | 0.2991 | 0.0943 | 0.1507 |
| MM-RQ-VAE | 0.0055 | 4.2125 | 0.9850 | 0.2181 | 0.2542 | 0.1592 | 0.1707 | 0.5081 | 2.8449 | 0.9950 | 0.1674 | 0.2596 | 0.0915 | 0.1322 |
| ADA-SID(Ours) | 0.0032 | 5.0977 | 1.0000 | 0.2772 | 0.2926 | 0.1689 | 0.1714 | 0.4470 | 4.4206 | 1.0000 | 0.1855 | 0.3675 | 0.0996 | 0.1784 |
| Improv. | +3.03% | +7.92% | +0.00% | +22.45% | +7.53% | +3.56% | +0.23% | +12.02% | +20.06% | +0.00% | +10.21% | +22.86% | +5.62% | +18.38% |
The following are the results from [Table 2 (b)] of the original paper:
| Methods | Industrial Dataset | Amazon Beauty | ||||||||||
| Lrecon ↓ | Entropy↑ | Util.↑ | AUC↑ | GAUC↑ | Lrecon ↓ | Entropy↑ | Util.↑ | AUC↑ | GAUC↑ | |||
| Item ID | - | - | 0.7078 | 0.5845 | - | - | - | 0.6455 | 0.5897 | |||
| RQ-VAE | 0.0033 | 4.2481 | 1.0000 | 0.7071 | 0.5805 | 0.6028 | 3.4904 | 0.9900 | 0.6446 | 0.5852 | ||
| OPQ | 0.0038 | 4.3981 | 0.7563 | 0.7086 | 0.5829 | 0.9647 | 3.3980 | 0.9600 | 0.6449 | 0.5898 | ||
| RQ-Kmeans | 0.0065 | 4.7232 | 1.0000 | 0.7089 | 0.5832 | 0.6240 | 1.7100 | 1.0000 | 0.6472 | 0.5999 | ||
| LETTER | 0.0054 | 4.2072 | 1.0000 | 0.7089 | 0.5828 | 0.5431 | 2.6819 | 1.0000 | 0.6444 | 0.5973 | ||
| DAS | 0.0051 | 4.3539 | 1.0000 | 0.7091 | 0.5845 | 0.5432 | 3.6819 | 1.0000 | 0.6466 | 0.5933 | ||
| RQ-VAE++ | 0.0034 | 3.5566 | 0.9283 | 0.7100 | 0.5838 | 0.6028 | 3.4904 | 0.9900 | 0.6466 | 0.5952 | ||
| MM-RQ-VAE | 0.0055 | 4.2125 | 0.9850 | 0.7095 | 0.5843 | 0.5081 | 2.8449 | 0.9950 | 0.6453 | 0.5991 | ||
| ADA-SID(Ours) | 0.0032 | 5.0977 | 1.0000 | 0.2772 | 0.2926 | 0.1689 | 0.1714 | 0.4470 | 4.4206 | 1.0000 | 0.6480 | 0.6125 |
| Improv. | +3.03% | +7.92% | +0.00% | +0.07% | +0.02% | +12.02% | +20.06% | +0.00% | +0.12% | +2.10% | ||
Key Observations:
-
Superiority of Behavior-Content Aligned SIDs over Content-Only SIDs:
- Methods that incorporate behavioral information (e.g.,
RQ-Kmeans,LETTER,DAS,RQ-VAE++,MM-RQ-VAE) consistently outperform purely content-based SIDs (RQ-VAE,OPQ) acrossRecall@100,NDCG@100, andAUCmetrics. This strongly validates the necessity of integrating dynamic behavioral signals to overcome the limitations of static content. For instance, on the Industrial Dataset in generative retrieval,RQ-VAEachievesR@100of 0.2083, whileRQ-VAE++(with collaborative info) reaches 0.2709, a substantial improvement. - This highlights that content alone cannot capture the evolving preferences and dynamic nature of user-item interactions.
- Methods that incorporate behavioral information (e.g.,
-
Critical Value of Collaborative Information (
RQ-VAE++):- The
RQ-VAE++baseline, which explicitly integrates both pre-trained content and collaborative representations intoRQ-VAE, shows a significant performance lift compared to the originalRQ-VAE. This directly demonstrates that simply adding collaborative information, even without sophisticated adaptive mechanisms, is highly beneficial. For example,RQ-VAE++improvesR@100on the Industrial Dataset by 30% overRQ-VAE(0.2709 vs 0.2083).
- The
-
Effectiveness of Dedicated Collaborative SIDs (MM-RQ-VAE, RQ-VAE++):
- Comparing different alignment strategies, the results suggest that explicitly generating dedicated SIDs for collaborative signals (as done by
MM-RQ-VAEandRQ-VAE++) is generally more effective at capturing complex interaction patterns than more implicit alignment approaches (likeLETTERorDAS). This leads to better downstream performance, indicating the richness of behavioral data benefits from its own specialized representation.
- Comparing different alignment strategies, the results suggest that explicitly generating dedicated SIDs for collaborative signals (as done by
-
Overall Superiority of ADA-SID:
-
ADA-SIDconsistently achieves the best performance across all evaluated metrics and datasets, for bothgenerative retrievalanddiscriminative rankingtasks. -
Generative Retrieval (Table 2a):
- On the Industrial Dataset,
ADA-SIDimprovesR@50by +22.45%,R@100by +7.53%,N@50by +3.56%, andN@100by +0.23% over the best baseline (RQ-VAE++forR@100,N@100,MM-RQ-VAEforR@50,N@50). - On Amazon Beauty,
ADA-SIDshows even more significant gains:R@50by +10.21%,R@100by +22.86%,N@50by +5.62%,N@100by +18.38% over the best baselines (RQ-VAE++forR@100,N@50,N@100,MM-RQ-VAEforR@50).
- On the Industrial Dataset,
-
Discriminative Ranking (Table 2b):
- On the Industrial Dataset,
ADA-SIDachieves the highestAUC(0.7101) andGAUC(0.5846), marginally outperformingItem IDand other SIDs, demonstrating its ability to be effective even in traditional ranking setups. - On Amazon Beauty,
ADA-SIDsignificantly improvesAUCby +0.12% andGAUCby +2.10% over the best baselines (RQ-KmeansforAUC,MM-RQ-VAEforGAUC).
- On the Industrial Dataset,
-
Quantization Metrics:
ADA-SIDalso shows strongEntropy(5.0977 Industrial, 4.4206 Beauty) and perfectCodebook Utilization(1.0000 on both), indicating rich and diverse SID representations, while maintaining lowReconstruction Loss. This suggests that its adaptive mechanisms are effective in generating high-quality SIDs.These results strongly validate
ADA-SID's unique design, which intelligently fuses content and behavior information by assessinginformation richness, adaptively amplifying critical signals, and suppressing noise. This leads to more robust and expressive item representations crucial for diverse recommendation tasks.
-
6.2. Ablation Studies / Parameter Analysis (RQ2)
Ablation experiments were conducted on the Industrial Dataset to understand the contribution of each component of ADA-SID.
The following are the results from [Table 3] of the original paper:
| Variants | Lrecon ↓ | Entropy↑ | Util.↑ | R@50↑ | R@100↑ | N@50↑ | N@100↑ | AUC↑ | GAUC↑ |
| ADA-SID | 0.0032 | 5.0977 | 1.0000 | 0.2772 | 0.2926 | 0.1689 | 0.1714 | 0.7101 | 0.5846 |
| w/o Alignment Strength Controller | 0.0032 | 5.0710 | 1.0000 | 0.2701 | 0.2854 | 0.1618 | 0.1643 | 0.7104 | 0.5845 |
| w/o Behavior-content Contrastive Learning | 0.0032 | 5.1153 | 1.0000 | 0.2733 | 0.2874 | 0.1653 | 0.1676 | 0.7097 | 0.5846 |
| w/o Sparsity Regularization | 0.0034 | 5.0571 | 1.0000 | 0.2757 | 0.2903 | 0.1675 | 0.1698 | 0.7097 | 0.5846 |
| w/o Behavior-Guided Dynamic Router | 0.0033 | 5.0896 | 1.0000 | 0.2705 | 0.2861 | 0.1616 | 0.1641 | 0.7098 | 0.5845 |
Impact of Adaptive Behavior-Content Alignment:
w/o Alignment Strength Controller: Removing this controller leads to a performance drop across all generative retrieval metrics (e.g.,R@50drops from 0.2772 to 0.2701,N@100from 0.1714 to 0.1643). This confirms that indiscriminately aligning content with behavioral embeddings is detrimental. The controller's role in suppressingnoise corruptionfrom long-tail items by dynamically adjusting alignment intensity is crucial.w/o Behavior-content Contrastive Learning: Disabling the entireadaptive Behavior-Content Alignmentmodule results in a consistent and more significant drop inRecallandNDCG(e.g.,R@50from 0.2772 to 0.2733,N@100from 0.1714 to 0.1676). This indicates a substantialmodality gapbetween content and behavioral domains. Thecontrastive learningcomponent is vital for bridging this gap and achieving effective information fusion. The slight increase inEntropy(from 5.0977 to 5.1153) might suggest that without proper alignment, SIDs become more diverse but less semantically meaningful in a multi-modal context.
Impact of Dynamic Behavioral Router Mechanism:
w/o Behavior-Guided Dynamic Router: Removing this component leads to a notable performance degradation across generative retrieval metrics (e.g.,R@50drops from 0.2772 to 0.2705,N@100from 0.1714 to 0.1641) and also slightly impacts discriminative ranking. This demonstrates that the router's ability to estimate and weight collaborative signals based oninformation richnessis essential for recommendation accuracy. Without it, the model cannot effectivelyamplify critical signalsor attenuate uninformative ones.w/o Sparsity Regularization: Disablingsparsity regularizationcauses a performance drop (e.g.,R@50from 0.2772 to 0.2757,N@100from 0.1714 to 0.1698). This component plays two critical roles:-
It encourages
sparse activations(selecting fewer SIDs) for long-tail items, leading to more specialized anddisentangled representationsfor each SID, increasing the model's overall capacity akin toMixture-of-Experts (MoE)models. -
The
item-specific sparsity targetensures that the model wisely allocates its representational budget, using fewer, high-level SIDs for long-tail items and more detailed SIDs for popular items. Its absence leads to less expressive and less adaptive representations.The ablation studies confirm that both the
adaptive behavior-content alignmentand thedynamic behavioral router(with sparsity regularization) are crucial components contributing significantly toADA-SID's overall superior performance.
-
6.3. Hyper-Parameter Analysis (RQ3)
6.3.1. The Strength of Sparsity Regularization
The paper investigates the impact of sparsity regularization strength on ADA-SID's performance.
<br>
<em>Figure 2: Hyper-Parameter Analysis on Sparsity Regularization (Figure 3 from the original paper).</em>
Figure 2 (Figure 3 in the paper) shows that as the sparsity intensity (likely controlled by a regularization parameter, though not explicitly defined on the x-axis label beyond "Regularization parameter") is reduced, the model's performance generally improves across various metrics (R@50, R@100, N@50, N@100, AUC, GAUC). A reduced sparsity intensity implies allowing for more behavioral SIDs to be active, or perhaps less penalty on using more SIDs. This leads to a larger effective encoder capacity and stronger encoding capabilities, especially for head items with complex behavioral patterns. The paper highlights that ADA-SID's variable-length flexibility (allowing head items to use longer collaborative SID sequences) enables it to leverage this increased capacity for richer representations, leading to significant downstream performance gains.
6.3.2. Sensitivity to Alignment Strength Controller Hyperparameters
A hyperparameter study was conducted on for the Alignment Strength Controller, which defines the weighting curve for behavior-content alignment.
The following are the results from [Table 4] of the original paper:
| Variants | Lrecon ↓ | Entropy↑ | Utilization↑ | R@50↑ | R@100↑ | N@50↑ | N@100↑ | AUC↑ | GAUC↑ |
| α=20,β=7 | 0.0032 | 5.0844 | 1.0000 | 0.2749 | 0.2894 | 0.1664 | 0.1688 | 0.7106 | 0.5840 |
| α=20, β=9 | 0.0032 | 5.0711 | 1.0000 | 0.2750 | 0.2889 | 0.1677 | 0.1709 | 0.7105 | 0.5842 |
| α=20,β=14 | 0.0033 | 5.0967 | 1.0000 | 0.2760 | 0.2911 | 0.1686 | 0.1707 | 0.7105 | 0.5839 |
| α=10, β=9 | 0.0032 | 5.0977 | 1.0000 | 0.2772 | 0.2926 | 0.1689 | 0.1714 | 0.7101 | 0.5846 |
Figure 3: Illustration of alignment strength controller with different hyperparameters (Figure 4 from the original paper).
Table 4 presents results for four configurations of , and Figure 3 (Figure 4 in the paper) illustrates the corresponding alignment strength curves.
- The setting achieved the highest recommendation accuracy (
R@50,R@100,N@50,N@100,GAUC). - This optimal result suggests that for the Industrial Dataset's distribution,
noise filteringis most effective when applied to approximately the 40% least frequent (long-tail) items. Figure 3 visually confirms how different pairs shift the curve, controlling the threshold and steepness at which items transition from low to high alignment strength. - The tunability of these parameters demonstrates the flexibility of
ADA-SIDto adapt to diverse data landscapes and fine-tune thedenoisingprocess.
6.4. Item Popularity Stratified Performance Analysis (RQ4)
To understand how ADA-SID performs across different item popularity levels, a stratified analysis was conducted by categorizing items into popular (top 25% by impression counts) and long-tail (bottom 25% by impression counts) groups. The AUC for each group was then evaluated.
<br>
<em>Figure 4: Item Popularity Stratified Performance Comparison (Figure 5 from the original paper).</em>
Figure 4 (Figure 5 in the paper) illustrates the performance comparison:
-
For Head Items (Popular Items):
Content-based SIDs(e.g.,RQ-VAE) underperform traditionalItem IDsin ranking tasks. This is becauseItem IDscan learn highly independent and specialized representations for popular items due to their abundant interaction data, something content alone cannot fully capture.- Integrating
collaborative informationsignificantly improves performance, as it enhances the SIDs' expressiveness to capture complex behavioral patterns. Methods likeRQ-VAE++andMM-RQ-VAEshow clear gains. ADA-SIDfurther advances this by explicitlyaligning,denoising, andamplifyingthe fusion of content and behavioral modalities. This leads to a more expressive semantic representation, surpassing evenItem IDsfor popular items.
-
For Tail Items (Long-Tail Items):
-
All
SID-based methods(includingADA-SIDand baselines) outperformItem IDs. This is a classic strength of SIDs: bysharing knowledgeacross semantically similar items, SIDs can make meaningful recommendations even for items with scarce interaction data, whereItem IDswould fail due to insufficient data for learning. -
ADA-SIDachieves the largest performance gain among all methods on long-tail items. Itsadaptive alignmentmechanism is crucial here, as it effectivelyshields the stable content representationsof tail items from their noisy and sparse behavioral signals. -
Concurrently, its
dynamic behavioral routerlearns to produce a sparser, more robust representation for long-tail items by relying more on high-level content semantics than on unreliable fine-grained behavioral cues. This dual mechanism significantly boosts performance on the long tail.This stratified analysis provides strong evidence that
ADA-SIDeffectively addresses the challenge ofinformation disparity. It successfully balances the expressive power needed for popular items with the generalization capability required for long-tail items, ultimately outperforming traditionalItem IDsacross the entire popularity spectrum.
-
6.5. Online Experiments
The authors conducted a 5-day online A/B test on a large-scale e-commerce platform's generative retrieval system.
- Setup: The experimental group, using
ADA-SID's 8-token SIDs, was allocated 10% of random user traffic. It was compared against the production system that usedItem ID-based representations. - Results:
ADA-SIDyielded significant improvements in key business metrics:+3.50% increaseinAdvertising Revenue.+1.15% increaseinClick-Through Rate (CTR).
- Conclusion: These online gains confirm the practical value, scalability, and production-readiness of the proposed method in a real-world industrial setting.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces ADA-SID, a novel mixture-of-quantization framework called MMQ-v2, designed to learn expressive and noise-robust Semantic IDs (SIDs) for industrial recommender systems. The core innovation lies in its adaptive approach to integrating multimodal information (content and behavior) by explicitly considering the information richness of behavioral signals. The framework addresses two critical limitations of prior methods: Noise Corruption (from sparse long-tail data) and Signal Obscurity (due to equal weighting of SIDs).
ADA-SID achieves this through two main components:
-
Adaptive Behavior-Content Alignment: Utilizes an
alignment strength controllerto dynamically adjust the alignment intensity between content and behavior based on the item's interaction data richness. Thisdenoisesrepresentations for long-tail items by reducing the influence of unreliable behavioral signals. -
Dynamic Behavioral Router: Learns to assign adaptive importance weights to
behavioral SIDs, therebyamplifying critical collaborative signalsfrom popular items and attenuating uninformative ones from long-tail items. This also incorporatessparsity regularizationandload balancing.Extensive offline experiments on both industrial and public datasets demonstrate
ADA-SID's superior performance in bothgenerative retrievalanddiscriminative rankingtasks. Crucially, onlineA/B testson a large-scale e-commerce platform confirmed significant improvements in key business metrics, includingAdvertising RevenueandClick-Through Rate (CTR). The work pioneers an adaptive fusion approach that considersinformation richness, paving the way for more robust and personalized recommender systems.
7.2. Limitations & Future Work
The authors implicitly highlight some limitations through their design choices and explicitly mention future work:
- Hyperparameter Tuning: The
Alignment Strength Controller's hyperparameters require careful tuning to adapt to different data distributions, as shown in their analysis. While flexible, finding optimal values can be dataset-dependent. - Complexity of Sparsity Regularization: The sparsity regularization loss formulation, especially with its dynamic and load balancing terms, appears quite intricate and potentially difficult to stabilize or interpret without deep understanding of its internal mechanics.
- Focus on Item-side Modeling: The current work primarily focuses on improving
item representations. - Future Direction: The authors explicitly state that future directions include applying these principles to
user-side modeling. This suggests extending the adaptive behavior mining concept to create more robust and dynamic user representations, which could further enhance personalization.
7.3. Personal Insights & Critique
This paper presents a highly relevant and well-structured approach to a crucial problem in recommender systems. The explicit breakdown of limitations in existing behavior-content alignment into Noise Corruption and Signal Obscurity is insightful and provides a clear motivation for their proposed solutions.
Key Strengths:
- Novelty in Adaptive Fusion: The core idea of adaptively aligning and weighting signals based on
information richnessis a significant step forward. It moves beyond simply combining modalities to intelligently managing their fusion, which is particularly vital in real-world skewed data distributions. - Comprehensive Problem Addressing: The
Alignment Strength Controllerdirectly tacklesNoise Corruption, and theDynamic Behavioral Routerdirectly addressesSignal Obscurity. This targeted approach makes the solution robust. - Strong Empirical Evidence: The extensive offline experiments on both large-scale industrial data and public benchmarks, coupled with successful online
A/B tests, provide compelling evidence of the method's effectiveness and practicality. - Bridging Head and Tail Performance: The stratified analysis demonstrating
ADA-SID's superior performance for both popular and long-tail items is a strong indicator of its balanced design, which is often a challenge in recommendation.
Potential Issues/Areas for Improvement:
- Clarity of Sparsity Regularization Formulas: While the intent of the
sparsity regularizationandload balancingis clear, the mathematical formulations for , , and (Eq. 23) are exceptionally condensed and contain what appear to be notational ambiguities or potential typos (e.g., double summation, denominator in , the symbol). A more detailed breakdown or explanation of these specific formulas would greatly benefit reproducibility and understanding for beginners. I had to interpret them based on commonMoEpractices, but strict fidelity to the paper means presenting them as is, even if they are slightly unclear. - Computational Overhead: While
Mixture-of-Expertscan offer increased capacity without proportional cost, the introduction of multiple experts, gating networks, and adaptive controllers might add to the overall complexity and training time, especially with very large codebooks and item counts. The paper doesn't delve deeply into the computational cost compared to baselines, though online success implies it's manageable. - Generalizability of Hyperparameters: While are tunable, finding their optimal values might still require significant effort for new datasets. The paper's conclusion that
noise filteringis optimal for 40% least frequent items is specific to their dataset.
Transferability:
The principles of adaptive multimodal fusion based on information richness are highly transferable. This concept could be applied to:
-
User Modeling: Similar disparities exist in user behavior (active vs. cold-start users). Adapting user representation learning based on the richness of their interaction history could lead to more robust
user embeddings. -
Other Multimodal Learning Tasks: Beyond recommendation, any task involving fusion of modalities with varying reliability (e.g., combining noisy sensor data with cleaner visual data) could benefit from adaptive weighting or alignment.
-
Dynamic Feature Engineering: The idea of using
L2-magnitudeas a proxy forinformation richnesscould inspire other dynamic feature weighting or selection mechanisms in various machine learning contexts.Overall,
MMQ-v2 (ADA-SID)is a significant and practical contribution to the field of recommender systems, offering a more nuanced and effective way to leverage diverse information sources.
Similar papers
Recommended via semantic vector search.