Paper status: completed

MMQ-v2: Align, Denoise, and Amplify: Adaptive Behavior Mining for Semantic IDs Learning in Recommendation

Published:10/29/2025

LLM-based Recommendation Systems (29)Generative Recommendation Systems (37)Adaptive Behavior Mining (1)Semantic IDs Learning (2)Content-Behavior Multimodal Alignment (1)

Original Link PDF

Price: 0.100000

23 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MMQ-v2 adaptively aligns, denoises, and amplifies multimodal signals to enhance semantic ID learning, overcoming noise and signal ambiguity in recommendation systems, leading to superior performance on large-scale datasets.

Abstract

Industrial recommender systems rely on unique Item Identifiers (ItemIDs). However, this method struggles with scalability and generalization in large, dynamic datasets that have sparse long-tail data. Content-based Semantic IDs (SIDs) address this by sharing knowledge through content quantization. However, by ignoring dynamic behavioral properties, purely content-based SIDs have limited expressive power. Existing methods attempt to incorporate behavioral information but overlook a critical distinction: unlike relatively uniform content features, user-item interactions are highly skewed and diverse, creating a vast information gap in quality and quantity between popular and long-tail items. This oversight leads to two critical limitations: (1) Noise Corruption: Indiscriminate behavior-content alignment allows collaborative noise from long-tail items to corrupt their content representations, leading to the loss of critical multimodal information. (2)Signal Obscurity: The equal-weighting scheme for SIDs fails to reflect the varying importance of different behavioral signals, making it difficult for downstream tasks to distinguish important SIDs from uninformative ones. To tackle these issues, we propose a mixture-of-quantization framework, MMQ-v2, to adaptively Align, Denoise, and Amplify multimodal information from content and behavior modalities for semantic IDs learning. The semantic IDs generated by this framework named ADA-SID. It introduces two innovations: an adaptive behavior-content alignment that is aware of information richness to shield representations from noise, and a dynamic behavioral router to amplify critical signals by applying different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID's significant superiority in both generative and discriminative recommendation tasks.

Mind Map

In-depth Reading

English Analysis~37 min read · 55,837 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "MMQ-v2: Align, Denoise, and Amplify: Adaptive Behavior Mining for Semantic IDs Learning in Recommendation." This title indicates a focus on improving Semantic IDs (SIDs) in recommender systems by adaptively integrating and refining multimodal (content and behavioral) information.

1.2. Authors

The authors are:

Yi Xu (Alibaba International Digital Commerce Group, Beijing, China)
Moyu Zhang (Alibaba International Digital Commerce Group, Beijing, China)
Chaofan Fan (Alibaba International Digital Commerce Group, Beijing, China)
Jinxin Hu (Alibaba International Digital Commerce Group, Beijing, China)
Xiaochen Li (Alibaba International Digital Commerce Group, Beijing, China)
Yu Zhang (Alibaba International Digital Commerce Group, Beijing, China)
Xiaoyi Zeng (Alibaba International Digital Commerce Group, Beijing, China)
Jing Zhang (Wuhan University, School of Computer Science, Wuhan, China)

Most authors are affiliated with Alibaba International Digital Commerce Group, indicating an industrial research background with a focus on practical applications in large-scale e-commerce. Jing Zhang's affiliation with Wuhan University suggests an academic contribution, potentially bridging theoretical advancements with industrial relevance.

1.3. Journal/Conference

The paper is published in the Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX). ACM, New York, NY, USA. The placeholder (Conference acronym 'XX) suggests it is submitted to an ACM conference, which typically implies a peer-reviewed venue in computer science, likely related to information retrieval, data mining, or artificial intelligence, given the topic. ACM conferences are highly reputable in these fields.

1.4. Publication Year

The publication timestamp provided in the abstract is 2025-10-29T15:27:23.000Z, indicating a publication date of October 29, 2025. This suggests the paper is a very recent or forthcoming publication.

1.5. Abstract

Industrial recommender systems often struggle with the scalability and generalization limitations of unique Item Identifiers (ItemIDs) in dynamic, sparse, and long-tail datasets. Content-based Semantic IDs (SIDs) offer a partial solution by quantizing content to share knowledge, but they lack the dynamic behavioral properties essential for comprehensive recommendations. Existing methods attempting to incorporate behavioral information fail to account for the highly skewed and diverse nature of user-item interactions, particularly the vast information gap between popular and long-tail items. This oversight leads to two critical problems: Noise Corruption, where sparse behavioral data from long-tail items corrupts their content representations, and Signal Obscurity, where equal weighting of SIDs prevents distinguishing important behavioral signals from uninformative ones.

To address these issues, the paper proposes MMQ-v2, a mixture-of-quantization framework that adaptively Aligns, Denoises, and Amplifies multimodal information from content and behavior modalities to learn Semantic IDs. The resulting SIDs are named ADA-SID. The framework introduces two key innovations: an adaptive behavior-content alignment mechanism that considers information richness to prevent noise corruption, and a dynamic behavioral router that amplifies critical signals by assigning different weights to SIDs. Extensive experiments on public and large-scale industrial datasets demonstrate ADA-SID's superior performance in both generative and discriminative recommendation tasks.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2510.25622
PDF Link: https://arxiv.org/pdf/2510.25622v2.pdf The paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve lies within the limitations of item representation in industrial recommender systems. Traditionally, items are identified by unique ItemIDs. While simple, this approach struggles significantly in large-scale, dynamic environments characterized by an abundance of new items and long-tail data (items with very few interactions). ItemIDs lack inherent semantic meaning, making it difficult for the system to generalize to new or rarely interacted items.

Content-based Semantic IDs (SIDs) emerged as an improvement, addressing generalization by quantizing multimodal content (e.g., images, text descriptions) into discrete codes. This allows items with similar content to share similar SIDs, thereby enabling knowledge sharing. However, this approach inherently ignores the dynamic behavioral properties of items—how users actually interact with them. User interactions reveal evolving popularity, changing styles, and specific preferences that static content alone cannot capture, thus limiting the expressiveness and performance of purely content-based SIDs.

Existing attempts to incorporate behavioral information into SIDs, primarily through behavior-content alignment, also face significant challenges. The critical oversight is the inherent disparity and skewed distribution of user-item interactions. Unlike content features, which might be relatively uniform, behavioral data is highly irregular:

Popular Items (Head Items): Have abundant, diverse, and rich interaction data.
Long-tail Items (Tail Items): Suffer from extreme sparsity and very few interactions, leading to unreliable behavioral signals.

This information gap leads to two critical limitations in current methods:

Noise Corruption: Indiscriminate alignment between behavioral and content information can be detrimental. For long-tail items, their sparse and noisy behavioral signals can corrupt their otherwise stable and reliable content representations. For popular items, uniform alignment might overcompress their rich behavioral patterns, leading to a loss of unique behavioral signatures.
Signal Obscurity: Current methods often treat all SIDs equally, failing to reflect the varying importance of different behavioral signals. This makes it difficult for downstream recommendation tasks to prioritize informative SIDs (e.g., those from popular items with strong signals) over uninformative SIDs (e.g., those from long-tail items with weak signals), leading to increased optimization burden and reduced accuracy.

The paper's entry point is to explicitly address this fundamental disparity in behavioral information richness. Its innovative idea is to adaptively handle the fusion of content and behavior, tailoring the alignment and amplification processes to the specific information richness of each item's behavioral data.

2.2. Main Contributions / Findings

The paper proposes ADA-SID (Adaptive Denoise and Amplify Semantic IDs), a novel mixture-of-quantization framework called MMQ-v2, to tackle the aforementioned issues. Its primary contributions are:

Customized Multimodal SID Learning based on Information Richness: The paper is the first to propose customizing behavior-content multimodal SIDs for items by considering the information richness of their collaborative signals. This significantly enhances the expressiveness of SIDs and improves generalization for downstream recommendation tasks.
Adaptive Behavior-Content Alignment Mechanism: An adaptive tri-modal (behavior-vision-text) alignment strategy is introduced. This mechanism dynamically calibrates the alignment strength between behavioral and static content modalities. By employing an alignment strength controller that adjusts intensity based on an item's interaction data richness, it mitigates noise corruption for long-tail items while preserving diverse behavioral information for popular ones.
Dynamic Behavioral Router: A dynamic behavioral router is proposed that learns to assign adaptive importance weights to an item's set of behavioral SIDs. This mechanism effectively amplifies critical collaborative signals and attenuates uninformative ones, improving robustness and downstream recommendation performance.
Comprehensive Experimental Validation: Extensive offline experiments on both public and large-scale industrial datasets, along with online A/B tests, demonstrate ADA-SID's significant superiority in both generative and discriminative recommendation tasks. Notably, online A/B tests showed a +3.50% increase in Advertising Revenue and a +1.15% increase in Click-Through Rate (CTR).

The key conclusions are that by adaptively managing the fusion of multimodal information, specifically by recognizing and responding to the varying reliability of behavioral signals across popular and long-tail items, ADA-SID generates more robust, expressive, and effective Semantic IDs. This approach leads to substantial performance gains in real-world recommender systems.

3.1. Foundational Concepts

Item Identifiers (ItemIDs): In recommender systems, ItemIDs are unique, discrete numerical or alphanumeric labels assigned to each item (e.g., a product, movie, song). They are simple to use but inherently lack semantic meaning, meaning the ID itself doesn't convey any information about the item's features or properties.
- Limitation: When a new item is introduced (cold-start problem) or an item has very few interactions (long-tail data), there's no semantic basis to recommend it. Scalability also becomes an issue in very large item catalogs as the embedding table for ItemIDs grows proportionally.
Semantic IDs (SIDs): Semantic IDs are discrete codes that carry semantic meaning, allowing similar items to have similar or shared IDs. Instead of unique identifiers for each item, SIDs represent item properties or characteristics.
- Benefit: They enable knowledge sharing between similar items, improving generalization, especially for cold-start and long-tail items. They can also represent items more compactly.
Content Quantization: This is the process of converting continuous-valued content features (e.g., numerical vectors representing text or images) into discrete, symbolic representations (SIDs). It's akin to reducing the precision of data while trying to preserve as much semantic information as possible.
- Example: Using a Vector Quantization (VQ) technique to map a high-dimensional image embedding to a discrete code from a predefined codebook.
Sparse Long-Tail Data: In many real-world datasets, particularly in e-commerce or media, a small number of "head" items (e.g., popular products, blockbuster movies) account for a large proportion of interactions, while a vast number of "long-tail" items (niche products, obscure films) have very few interactions. This creates a highly skewed distribution, known as sparse long-tail data.
- Challenge: Learning meaningful representations or making accurate recommendations for long-tail items is difficult due to insufficient interaction data.
Multimodal Information: Refers to data from multiple modalities, such as text (product descriptions, titles), vision (product images), and behavioral/collaborative signals (user interaction history). Combining these different types of data can provide a richer and more comprehensive understanding of items.
Behavioral/Collaborative Signals: Information derived from how users interact with items. This includes clicks, purchases, views, ratings, etc.
- Collaborative Filtering: A classic recommendation technique that leverages these signals, assuming users who agreed in the past will agree in the future.
- Dynamic Nature: Behavioral signals are dynamic because user preferences and item popularity evolve over time.
Vector Quantization (VQ): A technique used in data compression and signal processing to map a vector from a continuous vector space into a finite set of vectors (a codebook). Each vector in the codebook is called a codeword. The input vector is replaced by the index of its closest codeword in the codebook.
- How it works: Given an input vector, it finds the closest codeword in the codebook (e.g., using Euclidean distance or cosine similarity) and replaces the input vector with that codeword.
- Application in SIDs: Used to map continuous item embeddings into discrete SIDs.
RQ-VAE (Residual Quantized Variational Autoencoder): An advanced Vector Quantization technique, often used in generating Semantic IDs. It quantizes representations hierarchically or iteratively. Instead of quantizing the entire embedding at once, it quantizes a residual (the difference between the original embedding and the quantized version) in multiple stages. This allows for more granular and expressive discrete representations.
- Components: Typically involves an encoder that maps input to a latent space, a quantizer that maps latent vectors to discrete codes from a codebook, and a decoder that reconstructs the input from the quantized codes.
Contrastive Learning: A self-supervised learning paradigm where the model learns to bring representations of "similar" data points (positive pairs) closer together in an embedding space, while pushing "dissimilar" data points (negative pairs) further apart.
- Application in SIDs: Used to align representations from different modalities (e.g., content and behavior) by treating corresponding multimodal embeddings for the same item as positive pairs and those from different items as negative pairs.
Mixture-of-Experts (MoE): A machine learning technique where multiple "expert" models are trained to handle different subsets or aspects of the input data. A "gating network" or "router" learns to decide which expert(s) to route each input to.
- Benefit: Can increase model capacity and performance without a proportional increase in computational cost by sparsely activating experts.
- Application in MMQ-v2: Used to have shared experts for common multimodal information and specific experts for modality-specific information, along with a dynamic behavioral router to selectively activate behavioral SIDs.

3.2. Previous Works

The paper categorizes previous works into traditional ItemIDs, content-based SIDs, and behavior-content aligned SIDs.

Traditional ItemIDs:
- Concept: Unique identifiers for items.
- Limitations: Lack semantics, struggle with cold-start items and generalization, as discussed in [1, 2, 7] (e.g., A. Sigh et al., 2024; C. Zheng et al., 2025).
Content-based SIDs:
- Concept: Quantize item content features into discrete codes to share knowledge among similar items.
- Examples:
  - RQ-VAE (Residual Quantized Variational Autoencoder) pioneered by TIGER [22]: Transforms multimodal content (titles, descriptions, images) into textual embeddings using a pre-trained Large Language Model (LLM) and then quantizes these embeddings into hierarchical SIDs.
  - OPQ (Optimized Product Quantization) [14]: Used in RPG [25] to convert pre-trained textual embeddings into a tuple of unordered SIDs. OPQ aims to find an optimal orthogonal rotation for product quantization, minimizing quantization distortion.
  - SPM-SID [1] and PMA [2]: Inspired by LLMs, these methods explore more granular subword-based compositions for SIDs.
- Limitation: Performance ceiling due to ignoring dynamic behavioral properties [20, 30, 34].
Behavior-Content Aligned SIDs: These methods attempt to incorporate dynamic behavioral information into SIDs. They are broadly categorized into two approaches:
1. Injecting Explicit Collaborative Signals:
  - LCRec [31]: Designs a series of alignment tasks to unify semantic and collaborative information.
  - ColaRec [32]: Distills collaborative signals directly from a pre-trained recommendation model and combines them with content information.
  - IDGenRec [34]: Leverages LLMs to generate semantically rich textual identifiers, showing potential in zero-shot settings by incorporating behavioral context.
2. Aligning Pre-trained Representations:
  - EAGER [30]: Generates separate collaborative and content SIDs using K-means on pre-trained embeddings and then aligns them in downstream tasks.
  - DAS [42]: Employs multi-view contrastive learning to maximize the mutual information between SIDs and collaborative signals, generating hierarchical, behavior-aware content SIDs using RQ-VAE.
  - LETTER [27]: Integrates hierarchical semantics, collaborative signals, and code assignment diversity to generate behavior-content fused SIDs using RQ-VAE.
  - MM-RQ-VAE [17]: Generates collaborative, textual, and visual SIDs from pre-trained collaborative and multimodal embeddings. It also introduces contrastive learning for behavior-content alignment.

3.3. Technological Evolution

The evolution of item representation for recommender systems can be summarized as follows:

ItemIDs (Traditional): Simple, unique identifiers. Early systems relied heavily on these, but they became a bottleneck for scalability, cold-start, and generalization in large, dynamic catalogs.
Content-based SIDs: Introduced Vector Quantization techniques (like RQ-VAE, OPQ) to embed semantic meaning directly into item representations by quantizing content features. This marked a shift towards knowledge sharing and better generalization for new items.
Behavior-Content Aligned SIDs (Early Attempts): Recognized the limitations of content-only SIDs and started integrating behavioral/collaborative signals. This often involved explicit injection of collaborative data or alignment of pre-trained content and behavioral embeddings using techniques like contrastive learning.
MMQ-v2 (ADA-SID): This paper represents a crucial next step. While previous aligned methods acknowledge the importance of behavior, they largely ignore the inherent heterogeneity and information asymmetry (skewed distribution) of behavioral data across items. MMQ-v2 explicitly addresses this by introducing adaptive mechanisms (alignment strength controller, dynamic behavioral router) to intelligently Align, Denoise, and Amplify signals based on their information richness, thereby overcoming the Noise Corruption and Signal Obscurity issues prevalent in earlier aligned SID approaches.

3.4. Differentiation Analysis

Compared to the main methods in related work, ADA-SID's core innovations and differentiators are:

Adaptive Handling of Behavioral Information Richness: This is the most significant difference. Existing behavior-content aligned SIDs (e.g., DAS, LETTER, MM-RQ-VAE) tend to apply alignment uniformly across all items. ADA-SID explicitly recognizes that the reliability and quantity of behavioral signals differ vastly between popular and long-tail items. It designs mechanisms to adapt to this information richness.
- Contrast with Uniform Alignment: While methods like DAS and MM-RQ-VAE use contrastive learning for alignment, ADA-SID modulates the strength of this alignment, which is crucial for denoising.
Noise Corruption Mitigation: For long-tail items with sparse and noisy behavioral data, ADA-SID's adaptive behavior-content alignment mechanism (Alignment Strength Controller) dynamically reduces the influence of these unreliable signals during alignment. This prevents the corruption of their stable content representations, a problem that indiscriminate alignment in other methods exacerbates.
Signal Obscurity Resolution and Amplification: ADA-SID introduces a dynamic behavioral router. Instead of assigning equal importance to all SIDs, this router learns to assign adaptive weights to behavioral SIDs based on their information richness. This amplifies critical signals from popular items (where behavioral data is rich and informative) and attenuates uninformative signals from long-tail items, effectively resolving the signal obscurity issue. This is a novel approach compared to methods that just fuse or align without explicit weighting based on signal quality.
Mixture-of-Quantization Network: ADA-SID uses a mixture-of-quantization network with shared experts and specific experts to capture both shared multimodal information and modality-specific details. This architecture, combined with the adaptive alignment and routing, provides a more nuanced and flexible way to represent items compared to simpler RQ-VAE structures used in many baselines (e.g., RQ-VAE, RQ-Kmeans, LETTER).

In essence, ADA-SID doesn't just incorporate behavioral signals; it intelligently manages them, making it more robust and effective across the entire spectrum of item popularity, especially for the challenging long-tail.

4. Methodology

4.1. Principles

The core idea behind ADA-SID (Adaptive Denoise and Amplify Semantic IDs) is to create expressive and noise-robust Semantic IDs (SIDs) by intelligently integrating multimodal information from content (text and vision) and behavior. The fundamental principle is to acknowledge and leverage the varying information richness of behavioral signals across items. Instead of treating all behavioral signals equally, ADA-SID aims to:

Align: Fuse dynamic behavioral information with static content.
Denoise: Prevent noisy behavioral signals (especially from long-tail items) from corrupting reliable content representations.
Amplify: Highlight and prioritize critical behavioral signals (especially from popular items) to improve their impact on downstream tasks.

This is achieved through a mixture-of-quantization framework that adaptively controls the strength of behavior-content alignment and dynamically weights behavioral SIDs based on an item's interaction data richness.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

The task is to quantize pre-trained textual, visual, and behavioral embeddings of an item into a sequence of discrete Semantic IDs.

For a given item, we first obtain:

Pre-trained vision embedding $\mathbf{e}_{\upsilon}$
Pre-trained text embedding $\mathbf{e}_{t}$
Pre-trained behavioral embedding $\mathbf{e}_{b}$ (obtained from a sequential recommender model like SASRec [16] using collaborative signals).

An item tokenizer $\mathcal{T}_{\mathrm{item}}$ then quantizes these high-dimensional embeddings into a discrete sequence of SIDs.

The formulation is given as: $Semantic\_IDs = (c_1, c_2, \ldots, c_l) = {\mathcal{T}}_{\mathrm{item}} ([\mathbf{e}_t, \mathbf{e}_v, \mathbf{e}_b])$ Here, $l$ represents the length of the Semantic IDs sequence, and $c_i$ denotes the $i$ -th discrete semantic ID in the sequence. The notation $[\mathbf{e}_t, \mathbf{e}_v, \mathbf{e}_b]$ implies a concatenation or fusion of the three pre-trained embeddings as input to the tokenizer.

4.2.2. Behavior-Content Mixture-of-Quantization Network

This network is designed to simultaneously capture both shared information across modalities and modality-specific information. It consists of Shared Experts and Specific Experts.

Shared Experts

The shared experts are responsible for quantizing the aligned behavior-content information into shared latent embeddings, which then generate shared SIDs.

Projection: The pre-trained textual ( $\mathbf{e}_t$ ), visual ( $\mathbf{e}_v$ ), and behavioral ( $\mathbf{e}_b$ ) embeddings are first projected into a unified high-dimensional space using small two-layer deep neural networks (DNNs), denoted as $D_t$ , $D_v$ , and $D_b$ respectively. The resulting hidden representations are $\mathbf{h}_{\mathbf{t}}$ , $\mathbf{h}_{\mathbf{v}}$ , and $\mathbf{h}_{\mathbf{b}}$ . $\mathbf{h}_{\mathbf{t}} = D_t(\mathbf{e}_{\mathbf{t}}), \mathbf{h}_{\mathbf{v}} = D_v(\mathbf{e}_{\mathbf{v}}), \mathbf{h}_{\mathbf{b}} = D_b(\mathbf{e}_{\mathbf{b}})$
Concatenation: These projected hidden representations are then concatenated to form a unified representation $\mathbf{h}$ : $\mathbf{h} = [\mathbf{h_t}, \mathbf{h_v}, \mathbf{h_b}]$
Adaptive Alignment: This concatenated representation $\mathbf{h}$ is optimized by the adaptive behavior-content alignment mechanism (detailed in Section 4.2.3), which ensures the behavioral and content modalities are aligned appropriately based on information richness.
Encoding and Quantization: For each of the $N_s$ shared experts ( $E_{s,i}$ ), the aligned representation $\mathbf{h}$ is encoded into a shared latent embedding $\mathbf{z}_{s,i}$ . This latent embedding is then quantized into a discrete semantic ID $c_{s,i}$ by finding the most similar codeword in the shared codebook $C_{s,i}$ . The codebook $C_{s,i}$ contains $K$ codewords $\{z_{q_k}\}_{k=1}^K$ . $\mathbf{z}_{s,i} = E_{s,i}(\mathbf{h})$ The quantization step involves finding the codeword $z_{q,j}$ in $C_{s,i}$ that maximizes the cosine similarity with $\mathbf{z}_{s,i}$ : $c_{s,i} = \underset{j \in \{1, \ldots, K\}}{\arg\max} \ \frac{\mathbf{z}_{s,i}^{\top} \mathbf{z}_{q,j}}{\left\| \mathbf{z}_{s,i} \right\| \left\| \mathbf{z}_{q,j} \right\|}, \qquad i = 1, \ldots, N_s$ Here, $N_s$ is the number of shared experts and $K$ is the codebook size.

Specific Experts

The specific experts are designed to learn modality-specific information and generate modality-specific SIDs. For each modality (text, vision, behavior), there is a group of specific experts and corresponding codebooks.

Encoding: For text modality, the pre-trained $\mathbf{e}_t$ is transformed by $N_t$ experts $\{E_{t,i}\}_{i=1}^{N_t}$ into latent representations $\{\mathbf{z}_{t,i}\}_{i=1}^{N_t}$ . Similarly, for vision and behavior modalities, pre-trained $\mathbf{e}_v$ and $\mathbf{e}_b$ are transformed into $\{\mathbf{z}_{v,i}\}_{i=1}^{N_v}$ and $\{\mathbf{z}_{b,i}\}_{i=1}^{N_b}$ respectively. $\mathbf{z}_{t,i} = E_{t,i}(\mathbf{e}_t), \mathbf{z}_{v,i} = E_{v,i}(\mathbf{e}_v), \mathbf{z}_{b,i} = E_{t,i}(\mathbf{e}_b)$ Note: There appears to be a typo in the paper's formula for $\mathbf{z}_{v,i}$ and $\mathbf{z}_{b,i}$ where $E_{v,i}$ and $E_{b,i}$ should be used instead of $E_{t,i}$ . Following the paper's provided text and consistent notation for specific experts, it should ideally be $E_{v,i}(\mathbf{e}_v)$ and $E_{b,i}(\mathbf{e}_b)$ respectively. However, remaining 100% faithful to the original text, I present it as it is written in the paper.
Quantization: Analogous to shared experts, modality-specific SIDs $\{c_{t,i}\}_{i=1}^{N_t}$ , $\{c_{v,i}\}_{i=1}^{N_v}$ , and $\{c_{b,i}\}_{i=1}^{N_b}$ are searched from their respective codebooks $\{C_{t,i}\}_{i=1}^{N_t}$ , $\{C_{v,i}\}_{i=1}^{N_v}$ , $\{C_{b,i}\}_{i=1}^{N_b}$ using cosine distance. The corresponding codeword representations are $\{z_{q_{t,i}}\}_{i=1}^{N_t}$ , $\{z_{q_{v,i}}\}_{i=1}^{N_v}$ , and $\{z_{q_{b,i}}\}_{i=1}^{N_b}$ . Here, $N_t$ , $N_v$ , and $N_b$ are the numbers of experts for text, vision, and behavior modalities, respectively.

Decoder and Reconstruction

The decoder reconstructs the original fused pre-trained embeddings $\mathbf{e} = [\mathbf{e}_{\mathbf{t}}, \mathbf{e}_{\mathbf{v}}, \mathbf{e}_{\mathbf{b}}]$ from the combined latent representations and their quantized codeword representations.

Fused Latent Representations: The total latent representation $\mathbf{z}$ is a sum of shared latent embeddings and modality-specific latent embeddings, weighted by gating mechanisms. The paper provides the formula for $\mathbf{z}$ and its quantized version $\mathbf{z_q}$ : $\mathbf{z = \sum_{i=1}^{N_s} \mathbf{z}_{s,i} + \sum_{i=1}^{N_v} g_{v,i} \mathbf{z}_{v,i} + \sum_{i=1}^{N_t} g_{t,i} \mathbf{z}_{t,i} + \sum_{i=1}^{N_b} g_{b,i} \mathbf{z}_{b,i}}$ $\mathbf{z_{\mathbf{q}} = \sum_{i=1}^{N_s} \mathbf{z}_{\mathbf{q}_{s,i}} + \sum_{i=1}^{N_v} g_{v,i} \mathbf{z}_{\mathbf{q}_{s,i}} + \sum_{i=1}^{N_t} g_{t,i} \mathbf{z}_{\mathbf{q}_{t,i}} + \sum_{i=1}^{N_b} R(\mathbf{e}_{\mathbf{b}})_i \mathbf{z}_{\mathbf{q}_{b,i}}}$ Here, $g_t$ and $g_v$ are gating mechanisms for the text and visual modalities, defined as: $g_t = softmax(MLP_t(\mathbf{e}_t) + b_t) \\ g_v = softmax(MLP_v(\mathbf{e}_v) + b_v)$ The term $R(\mathbf{e}_{\mathbf{b}})_i$ refers to the output of the dynamic behavioral router for the $i$ -th behavioral SID, which selectively determines its importance (detailed in Section 4.2.4).
- Note on $\mathbf{z_q}$ formula: There's a potential typo in the paper's formula for $\mathbf{z_q}$ , specifically $\sum_{i=1}^{N_v} g_{v,i} \mathbf{z}_{\mathbf{q}_{s,i}}$ . It seems that $z_{q_{v,i}}$ (quantized vision specific expert embedding) should be used instead of $z_{q_{s,i}}$ (quantized shared expert embedding) in the vision specific part, for consistency with the $z$ formula and the idea of modality-specific experts. Again, staying 100% faithful to the provided text, I present the formula as it is.
Reconstruction Loss: The decoder takes the fused latent representation $\mathbf{z}$ and its quantized version $\mathbf{z_q}$ to reconstruct the original embeddings. The reconstruction loss (denoted as $\sum_{r \in con}$ ) is typically a mean squared error between the original embeddings and their reconstructions. $\sum_{r \in con} = ||\mathbf{e} - decoder(\mathbf{z} + sg(\mathbf{z_q} - \mathbf{z}))||^2$ Here, $sg(\cdot)$ is the stop gradient operator, which means the gradients do not flow through this part, a common practice in VQ-VAE training to ensure the encoder learns to map embeddings close to the codebook entries. The term $\mathbf{z} + sg(\mathbf{z_q} - \mathbf{z})$ effectively uses $\mathbf{z_q}$ for the forward pass but $\mathbf{z}$ for the backward pass gradients.

4.2.3. Adaptive Behavior-Content Alignment

This mechanism aims to align behavior and content modalities while being sensitive to the varying information richness of behavioral signals, thereby preventing noise corruption.

Alignment Strength Controller

This component outputs a weight $w$ to modulate the intensity of the behavior-content alignment for each item. It's based on two principles:

Long-tail items: Alignment strength should smoothly decay towards zero.
Popular items: Alignment strength should increase with their estimated information richness.

The L2-magnitude of an item's behavioral embedding ( $\mathbf{e}_b$ ) is used as a proxy for its information richness.
Normalization: Given a pre-trained behavior embedding matrix $\mathbf{E} \in \mathbb{R}^{K \times D}$ (where $K$ is the number of items and $D$ is embedding dimension), consisting of vectors $\{\mathbf{e}_{b,1}, \mathbf{e}_{b,2}, \dots, \mathbf{e}_{b,K}\}$ , the maximum ( $N_{max}$ ) and minimum ( $N_{min}$ ) L2-magnitudes are calculated across all behavioral embeddings: $N_{\mathrm{max}} = \operatorname*{\max}_{i \in \{1, \dots, K\}} \left( ||\mathbf{e}_{b,i}||_2 \right), N_{\mathrm{min}} = \operatorname*{\min}_{i \in \{1, \dots, K\}} \left( ||\mathbf{e}_{b,i}||_2 \right)$ For the $j$ -th embedding $\mathbf{e}_{b,j}$ , its normalized L2-magnitude $N_{\mathrm{norm}}(\mathbf{e}_{b,j})$ is computed: $N_{\mathrm{norm}}(\mathbf{e}_{b,j}) = \frac{||\mathbf{e}_{b,j}||_2 - N_{\mathrm{min}}}{N_{\mathrm{max}} - N_{\mathrm{min}}}$
Weight Calculation: The alignment strength weight $w$ is then calculated using a sigmoid-like function: $w = \frac{\sigma(\alpha N_{\mathrm{norm}}(\mathbf{e}_{b,j}) - \beta)}{\sigma(\alpha - \beta)}$ Here, $\sigma(\cdot)$ is the sigmoid function $\sigma(x) = \frac{1}{1 + e^{-x}}$ . $\alpha$ and $\beta$ are hyperparameters that control the steepness and threshold of the curve, determining how rapidly alignment strength changes with information richness. In the experiments, $\alpha = 10$ and $\beta = 9$ were found to be optimal. This setup ensures that for items with low L2-magnitude (long-tail), $N_{norm}$ is small, making $\alpha N_{norm} - \beta$ a large negative number, pushing $\sigma(\cdot)$ towards 0, hence $w$ towards 0. For popular items, $N_{norm}$ is close to 1, making $\alpha N_{norm} - \beta$ a large positive number, pushing $\sigma(\cdot)$ towards 1, hence $w$ towards 1.

Behavior-Content Contrastive Learning

This module uses contrastive learning to learn shared information between behavior and content. It's a two-stage process:

Content Representation Unification: Contrastive learning is first applied to align the text ( $h_t$ ) and image ( $h_v$ ) modalities to obtain a unified content representation $h_c = h_t + h_v$ . The content loss $\mathcal{L}_{content}$ is a InfoNCE loss that encourages positive pairs (text and vision from the same item) to be close and negative pairs (text from one item, vision from another) to be far apart. $\mathcal{L}_{content} = - \log \frac{\exp(\mathrm{sim}(h_t, h_{v^+}) / \tau)}{\exp(\mathrm{sim}(h_t, h_{v^+}) / \tau) + \sum_{i=1}^{B-1} \exp(\mathrm{sim}(h_t, h_{v_i^-}) / \tau)} \\ - \log \frac{\exp(\mathrm{sim}(h_v, h_{t^+}) / \tau)}{\exp(\mathrm{sim}(h_v, h_{t^+}) / \tau) + \sum_{i=1}^{B-1} \exp(\mathrm{sim}(h_v, h_{t_i^-}) / \tau)}$ Here, $\mathrm{sim}(\cdot, \cdot)$ is the cosine similarity. $h_{v^+}$ and $h_{t^+}$ are the positive visual and text representations for the current item, while $h_{v_i^-}$ and $h_{t_i^-}$ are negative samples from other items in the batch ( $B$ is the batch size). $\tau$ is the temperature coefficient (set to 0.07).
Behavior-Content Alignment: Contrastive learning is then performed between the behavioral representation ( $h_b$ ) and the unified content representation ( $h_c$ ) to maximize their mutual information. A positive pair is $(h_b, h_{c^+})$ (behavior and content from the same item), and negative pairs are $(h_b, h_{c_i^-})$ (behavior from current item, content from other items). $\mathcal{L}_{align} = - \log \frac{\exp(\mathrm{sim}(h_b, h_{c^+}) / \tau)}{\exp(\mathrm{sim}(h_b, h_{c^+}) / \tau) + \sum_{i=1}^{B-1} \exp(\mathrm{sim}(h_b, h_{c_i^-}) / \tau)}$
Total Alignment Loss: The total alignment loss $\mathcal{L}_{align\_total}$ combines the content loss and the behavior-content alignment loss, weighted by the alignment strength controller's output $w$ : $\mathcal{L}_{align\_total} = \mathcal{L}_{content} + w \mathcal{L}_{align}$ This formulation is key for denoising: if $w$ is low (for long-tail items), the influence of $\mathcal{L}_{align}$ is reduced, preventing noisy behavioral signals from corrupting content. If $w$ is high (for popular items), strong alignment is encouraged.

4.2.4. Dynamic Behavioral Router Mechanism

This mechanism addresses Signal Obscurity by dynamically weighting the importance of behavioral Semantic IDs based on the item's interaction frequency, thereby amplifying critical signals.

Behavior-Guided Dynamic Router

The behavior-guided dynamic router $R(\mathbf{e_b})$ assigns calibrated importance scores to behavioral SIDs. It up-weights SIDs from head items (popular) and down-weights those from long-tail items. $R(\mathbf{e_b}) = \sigma(N_{\mathrm{norm}}(\mathbf{e}_b)) * \mathrm{relu}(MLP(\mathbf{e}_b) + b)$ Here:

$R(\mathbf{e_b})$ is the output of the router, a vector of importance scores for the behavioral SIDs.
$\sigma(\cdot)$ is the sigmoid function, and $N_{\mathrm{norm}}(\mathbf{e}_b)$ is the normalized L2-magnitude of the behavioral embedding (as defined in Section 4.2.3). This acts as a magnitude-based scaler, mapping weights to [0, 1] and calibrating them by information richness.
$MLP(\mathbf{e}_b)$ is a Multi-Layer Perceptron that processes $\mathbf{e}_b$ to capture its specific semantic patterns.
$\mathrm{relu}(\cdot)$ is the Rectified Linear Unit activation function. It induces exact-zero sparsity, meaning that if the output of $MLP(\mathbf{e}_b) + b$ is negative, the output is zero, effectively deactivating certain behavioral SIDs.
The multiplication * implies element-wise multiplication.

This learnable gate amplifies critical collaborative signals while attenuating uninformative ones, leading to improved robustness. It's trained end-to-end without manual thresholds.

Sparsity Regularization

To further refine the dynamic router, a sparsity regularization loss is introduced. The goal is to encourage the router to produce sparser SID sequences (activate fewer behavioral SIDs) for long-tail items and denser sequences for popular items. The regularization loss $L_{reg}$ penalizes the deviation from an item-specific target sparsity. The target sparsity is inversely proportional to the item's information richness (approximated by the L2-magnitude of $\mathbf{e}_b$ ). The paper provides a complex block of formulas for sparsity regularization and load balancing: $L_{reg_i} = \lambda_i \frac{1}{B} \sum_j^{N_b} \sum_j^{N_b} f_{ib} ||R({\bf e}_b)_j||_1$ $\lambda_i = \lambda_{i-1} \alpha^{sign(s_i \alpha \sigma \epsilon^{-1} s_{current})}$ $\lambda_{current} = 1 - \sum_j^{N_b} 1\{R(e_b)_j > 0\} / N_b$ $\lambda_{target} = \theta \ast (1 - N_{\mathrm{norm}}(e_b) / (N_{max} - N_{min}))$ $\iint_{lb} = \frac{1}{(1 - s_{target})B} \sum_t^B \sum_j^{N_b} 1\{R(e_b)_j > 0\} / N_b$

Interpretation based on common MoE practices and the paper's description:
- The first line seems to be a sparsity penalty term, possibly an L1 regularization ( $||\cdot||_1$ ) on the router's output $R(\mathbf{e}_b)_j$ for each behavioral SID $j$ , scaled by a dynamic $\lambda_i$ . The double summation $\sum_j^{N_b} \sum_j^{N_b}$ might be a typo and should likely be a single summation over $N_b$ behavioral SIDs, or perhaps refers to an interaction between different SIDs if $f_{ib}$ is an interaction term. Assuming it's meant to regularize the $L_1$ norm of router outputs across a batch $B$ .
- $\lambda_i$ appears to be a dynamic sparsity coefficient that adjusts over iterations, influencing the strength of the sparsity penalty. The formula for $\lambda_i$ looks like an adaptive learning rate adjustment based on a sign function and current vs. target sparsity, but its notation is highly condensed.
- $\lambda_{current}$ seems to be the current sparsity of the router's output, calculated as 1 - (proportion of activated SIDs). $1\{\cdot\}$ is an indicator function.
- $\lambda_{target}$ is the target sparsity for an item, which is proportional to 1 - normalized information richness. This means less rich items (long-tail) will have a higher target sparsity, encouraging fewer SIDs to be activated. The term $(N_{max} - N_{min})$ in the denominator seems incorrect given $N_{\mathrm{norm}}(e_b)$ is already normalized to [0,1] using these values. It should likely be $\lambda_{target} = \theta \ast (1 - N_{\mathrm{norm}}(e_b))$ , where $\theta$ is a scaling factor.
- $\iint_{lb}$ is a load-balancing mechanism to prevent router collapse (where one expert or SID gets all the traffic). This loss ensures that activated behavioral SIDs are distributed somewhat evenly among available options, or at least that some minimum number of SIDs are activated to prevent collapse. The notation $\iint_{lb}$ and its formula are also highly condensed and unusual, making a precise interpretation challenging without further context. Typically, load balancing in MoE aims to make the average utilization of experts equal.
  
  In summary, the dynamic router, guided by this complex sparsity and load-balancing regularization, produces a flexible and semantically rich representation. It captures diverse item facets while adaptively controlling the length and complexity of the representation to align with both the item's intrinsic properties and the demands of downstream tasks.

ADA-SID Framework

<br>
<em>Figure 1: Schematic diagram of the ADA-SID framework (Figure 2 from the original paper).</em>

The above Figure 1 (Figure 2 in the original paper) illustrates the overall ADA-SID framework. It shows how visual, text, and collaborative filtering (behavioral) encoders generate embeddings. These are fed into shared experts and specific experts for quantization. The adaptive behavior-content alignment module (bottom left) controls the influence of behavioral signals based on information richness. The dynamic behavioral router (bottom right) applies adaptive weights to the behavioral SIDs. The final SIDs are then used for generative or discriminative recommendation.

5. Experimental Setup

5.1. Datasets

The authors evaluate ADA-SID on both an industrial dataset and a public dataset.

Industrial Dataset:
- Source: Collected from a leading e-commerce advertising platform in Southeast Asia.
- Period: Between October 2024 and May 2025. (Note: Given the publication date of the paper is October 2025, this data collection period seems to be in the future or refers to an internal staging/simulated environment. It's an unusual temporal reference.)
- Scale: Encompasses 35,154,135 users and 48,106,880 advertisements (items).
- Interactions: Contains 75,730,321,793 user interactions, indicating a massive scale.
- Content: Rich, multimodal item content (images, titles, descriptions, etc.).
- Characteristics: User behavior sequences have an average length of 128. Its scale and complexity make it suitable for evaluating real-world performance.
Public Dataset:
- Name: "Beauty" subset of the Amazon Product Reviews dataset [38].
- Scale: 22,363 users, 12,101 items.
- Interactions: 198,360 interactions.
- Usage for Generative Retrieval: A 5-core filter is applied (users/items with at least 5 interactions), and chronological user sequences with a maximum length of 20 are constructed.
- Usage for Discriminative Ranking: Ratings are binarized (positive if >3, negative if <3 or =3). A chronological 90%/10% split is used for training and testing.
  
  The following are the results from [Table 1] of the original paper:
  
  Dataset Industrial Dataset Beauty
  #User
  #Item
  #Interaction 35,154,135 22,363
  48,106,880 12,101
  75,730,321,793 198,360

These datasets were chosen to cover both large-scale industrial scenarios, reflecting real-world complexity and volume, and a commonly used public benchmark, allowing for comparison with prior academic research. This dual evaluation approach strengthens the claims of the method's effectiveness and scalability.

5.2. Evaluation Metrics

The effectiveness of ADA-SID is evaluated using both quantization metrics and recommendation metrics.

5.2.1. Quantization Metrics

Reconstruction Loss ( $\mathcal{L}_{recon}$ ):
- Conceptual Definition: Measures how accurately the model can reconstruct the original input vector from its quantized (compressed) representation. Lower values indicate higher fidelity, meaning less information is lost during quantization.
- Mathematical Formula: For a given input vector $\mathbf{x}$ and its reconstructed version $\mathbf{\hat{x}}$ , the Reconstruction Loss is typically calculated using Mean Squared Error (MSE). $\mathcal{L}_{recon} = ||\mathbf{x} - \mathbf{\hat{x}}||^2$
- Symbol Explanation:
  - $\mathbf{x}$ : The original input vector (e.g., concatenated pre-trained embeddings).
  - $\mathbf{\hat{x}}$ : The reconstructed vector obtained from the decoder after quantization.
  - $||\cdot||^2$ : The squared L2 norm, representing the squared Euclidean distance.
Token Distribution Entropy (Entropy):
- Conceptual Definition: Measures the diversity and balance of the distribution of Semantic IDs (tokens/codewords) across the codebooks. Higher entropy indicates that the model is utilizing a wider range of SIDs more uniformly, which is generally desirable for rich and expressive representations. Low entropy suggests that only a few SIDs are predominantly used.
- Mathematical Formula: For a discrete probability distribution $P = \{p_1, p_2, \dots, p_K\}$ over $K$ codewords, the Shannon Entropy is: $H(P) = -\sum_{i=1}^{K} p_i \log_2(p_i)$
- Symbol Explanation:
  - $K$ : The total number of unique codewords in the codebook.
  - $p_i$ : The probability of selecting the $i$ -th codeword (e.g., its frequency of use divided by the total number of codeword assignments).
  - $\log_2$ : Logarithm base 2.
Codebook Utilization (Util.):
- Conceptual Definition: Reflects the efficiency with which the model uses the vectors in its codebook. It's the percentage of codewords in the codebook that have been used (assigned to at least one input vector) during training. A value close to 1 (or 100%) indicates that most or all codewords are being actively learned and used, which is good for leveraging the full capacity of the codebook. Low utilization might suggest redundant codewords or issues in the quantization process.
- Mathematical Formula: $\text{Utilization} = \frac{\text{Number of used codewords}}{\text{Total number of codewords in the codebook}}$
- Symbol Explanation:
  - Number of used codewords: Count of unique codewords that have been assigned to at least one input embedding.
  - Total number of codewords in the codebook: The predefined size of the codebook.

5.2.2. Recommendation Metrics

Generative Retrieval Tasks

For generative retrieval, the goal is often to generate a sequence of items.

Recall@N (R@N):
- Conceptual Definition: Measures the proportion of relevant items that are successfully retrieved within the top $N$ recommendations. It indicates how many of the actual target items are present in the recommended list. Higher values are better.
- Mathematical Formula: $\text{Recall@N} = \frac{|\{\text{relevant items}\} \cap \{\text{recommended items within top N}\}|}{|\{\text{relevant items}\}|}$
- Symbol Explanation:
  - relevant items: The set of items that are truly relevant to the user (e.g., items the user actually interacted with in the test set).
  - recommended items within top N: The set of $N$ items recommended by the system.
  - $|\cdot|$ : Cardinality of a set.
  - $\cap$ : Set intersection.
Normalized Discounted Cumulative Gain@N (NDCG@N):
- Conceptual Definition: A ranking quality metric that accounts for the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear higher up in the list. It is "normalized" by comparing the DCG (Discounted Cumulative Gain) to an Ideal DCG (IDCG), which is the DCG of a perfectly ordered list of relevant items. Higher values are better.
- Mathematical Formula: $\text{DCG@N} = \sum_{i=1}^{N} \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}$ $\text{IDCG@N} = \sum_{i=1}^{N} \frac{2^{\text{rel}_{i,ideal}} - 1}{\log_2(i+1)}$ $\text{NDCG@N} = \frac{\text{DCG@N}}{\text{IDCG@N}}$
- Symbol Explanation:
  - $N$ : The number of top recommendations considered.
  - $\text{rel}_i$ : The relevance score of the item at position $i$ in the recommended list (e.g., 1 for relevant, 0 for irrelevant).
  - $\text{rel}_{i,ideal}$ : The relevance score of the item at position $i$ in the ideal (perfectly sorted by relevance) list.
  - $\log_2(i+1)$ : Discount factor, where relevance at lower ranks is discounted.

Discriminative Ranking Tasks

For discriminative ranking, the goal is to predict user preferences or likelihood of interaction.

Area Under the Receiver Operating Characteristic Curve (AUC):
- Conceptual Definition: Measures the ability of a model to distinguish between positive and negative classes (e.g., whether a user will click on an item or not). It represents the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance. Values range from 0 to 1, with 0.5 being random performance and 1 being perfect classification. Higher values are better.
- Mathematical Formula: Conceptually, AUC is the area under the ROC curve. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. $\text{AUC} = \int_{0}^{1} \text{TPR}(t) \text{d}\text{FPR}(t)$
- Symbol Explanation:
  - $\text{TPR}(t)$ : True Positive Rate (Sensitivity) at threshold $t$ , calculated as $\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$ .
  - $\text{FPR}(t)$ : False Positive Rate (1 - Specificity) at threshold $t$ , calculated as $\frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}$ .
Group AUC (GAUC):
- Conceptual Definition: A weighted average of AUC values calculated per user (or per impression session/group). It is particularly useful in recommendation systems where user-specific performance is important, as it mitigates the impact of users with many interactions on the overall AUC score. It represents the probability that a randomly chosen positive item will be ranked higher than a randomly chosen negative item for the same user. Higher values are better.
- Mathematical Formula: $\text{GAUC} = \frac{\sum_{u \in U} \text{weight}_u \cdot \text{AUC}_u}{\sum_{u \in U} \text{weight}_u}$
- Symbol Explanation:
  - $U$ : The set of all users.
  - $\text{AUC}_u$ : The AUC score for user $u$ .
  - $\text{weight}_u$ : A weighting factor for user $u$ , often the number of impressions or positive samples for that user.

Online Experiment Metrics

For online experiments (A/B tests):

Advertising Revenue (AR):
- Conceptual Definition: A key business metric measuring the total revenue generated from advertisements shown to users in the experimental group.
- Calculation: Typically sum of revenue generated per click or impression, often related to bids and ad value.
Click-Through Rate (CTR):
- Conceptual Definition: The percentage of users who click on a recommended item after viewing it. It measures the effectiveness of recommendations in capturing user interest.
- Mathematical Formula: $\text{CTR} = \frac{\text{Number of Clicks}}{\text{Number of Impressions}} \times 100\%$
- Symbol Explanation:
  - Number of Clicks: Total clicks on recommended items.
  - Number of Impressions: Total times recommended items were shown.

5.3. Baselines

The proposed ADA-SID method is compared against several state-of-the-art (SOTA) Semantic ID generation approaches, categorized into two groups, plus the traditional Item ID.

Item ID:
- Description: The conventional approach where each item has a unique identifier. Used as a baseline specifically for discriminative ranking tasks, where it traditionally serves as the item representation. In generative retrieval, SIDs act as the item identifier.
Content-based SIDs: These methods quantize only content features.
- RQ-VAE [40]: (From TIGER [22]) Uses RQ-VAE to quantize textual embeddings (derived from content features like titles, descriptions, categories using a pre-trained LLM) into hierarchical SIDs.
- OPQ [14]: (From RPG [25]) Utilizes Optimized Product Quantization to convert pre-trained textual embeddings into a tuple of unordered SIDs.
Behavior-content Aligned SIDs: These methods attempt to incorporate collaborative (behavioral) signals.
- RQ-Kmeans [43]: (From One-rec [39]) Integrates RQ-VAE and K-means to quantize behavior-finetuned multimodal representations. K-means is applied to residuals in a coarse-to-fine manner.
- DAS [42]: Introduces multi-view contrastive alignment to maximize mutual information between SIDs and collaborative signals. It generates hierarchical, behavior-aware content SIDs using RQ-VAE.
- LETTER [27]: Integrates hierarchical semantics, collaborative signals, and code assignment diversity to generate behavior-content-fused SIDs using RQ-VAE.
- RQ-VAE++: An extension introduced by the authors. It incorporates both pre-trained content representation and pre-trained collaborative representation into the RQ-VAE framework. It generates collaborative, textual, and visual SIDs for each item, highlighting the importance of collaborative information.
- MM-RQ-VAE [17]: Generates separate collaborative, textual, and visual SIDs from pre-trained collaborative and multimodal embeddings. It also employs contrastive learning for behavior-content alignment.

5.4. Experiment Setup

Recommendation Foundations:
- Generative Retrieval: The REG4Rec [18] model (a strong multi-token prediction model) is adopted as the base framework.
- Discriminative Ranking: The Parameter Personalized Network (PPNet) [62] is used as the backbone architecture.
Implementation Details:
- Industrial Dataset: Codebook size is set to 300. Length of SIDs is set to 8. For ADA-SID specifically: $N_s = 2$ (number of shared experts), $N_t = 2$ (text experts), $N_v = 2$ (vision experts), $N_b = 6$ (behavior experts). The target sparsity $s_{target} = \frac{1}{3}$ .
- Public Dataset (Amazon Beauty): Codebook size is set to 100. Length of SIDs is 6. For ADA-SID specifically: $N_s = 1$ , $N_t = 1$ , $N_v = 1$ , $N_b = 5$ . The target sparsity $s_{target} = \frac{3}{5}$ .
- Pre-trained Representations: Obtained from Qwen3-Embedding 7B [15] (likely for text), SASRec [16] (for behavior), and PailiTAO v8 (an e-commerce advertising platform's internal model, likely for vision or combined content).
- Behavioral SID Activation: Behavioral SIDs with scores above a threshold (set to 0 in this paper) are retained, while those below are replaced with a padding token. This threshold can be adjusted for specific recommendation task requirements.

6. Results & Analysis

6.1. Core Results Analysis (RQ1)

The overall performance comparison of ADA-SID against baselines on both generative retrieval and discriminative ranking tasks is summarized in Table 2 (a) and (b).

The following are the results from [Table 2 (a)] of the original paper:

Methods	Industrial Dataset							Amazon Beauty
Methods	Lrecon	Entropy↑	Util.↑	R@50↑	R@100↑	N@50↑	N@100↑		Entropy↑	Util.↑	R@50↑	R@100↑	N@50↑	N@100↑
RQ-VAE	0.0033	4.2481	1.0000	0.1854	0.2083	0.1337	0.1421	0.6028	3.4904	0.9900	0.1213	0.2398	0.0803	0.1304
OPQ	0.0038	4.3981	0.7563	0.1972	0.2104	0.1491	0.1518	0.9647	3.3980	0.9600	0.1117	0.2189	0.0802	0.1302
RQ-Kmeans	0.0065	4.7232	1.0000	0.1844	0.2202	0.1462	0.1578	0.6240	1.7100	1.0000	0.1385	0.2398	0.0843	0.1507
LETTER	0.0054	4.2072	1.0000	0.1812	0.2213	0.1582	0.1675	0.5431	2.6819	1.0000	0.1513	0.2492	0.0937	0.1453
DAS	0.0051	4.3539	1.0000	0.1864	0.2237	0.1576	0.1697	0.5432	3.6819	1.0000	0.1503	0.2403	0.0933	0.1445
RQ-VAE++	0.0034	3.5566	0.9283	0.2254	0.2709	0.1628	0.1706	0.6028	3.4904	0.9900	0.1683	0.2991	0.0943	0.1507
MM-RQ-VAE	0.0055	4.2125	0.9850	0.2181	0.2542	0.1592	0.1707	0.5081	2.8449	0.9950	0.1674	0.2596	0.0915	0.1322
ADA-SID(Ours)	0.0032	5.0977	1.0000	0.2772	0.2926	0.1689	0.1714	0.4470	4.4206	1.0000	0.1855	0.3675	0.0996	0.1784
Improv.	+3.03%	+7.92%	+0.00%	+22.45%	+7.53%	+3.56%	+0.23%	+12.02%	+20.06%	+0.00%	+10.21%	+22.86%	+5.62%	+18.38%

The following are the results from [Table 2 (b)] of the original paper:

Methods	Industrial Dataset					Amazon Beauty
Methods	Lrecon ↓	Entropy↑	Util.↑	AUC↑	GAUC↑	Lrecon ↓	Entropy↑	Util.↑	AUC↑	GAUC↑
Item ID		-	-	0.7078	0.5845	-	-	-	0.6455	0.5897
RQ-VAE	0.0033	4.2481	1.0000	0.7071	0.5805	0.6028	3.4904	0.9900	0.6446	0.5852
OPQ	0.0038	4.3981	0.7563	0.7086	0.5829	0.9647	3.3980	0.9600	0.6449	0.5898
RQ-Kmeans	0.0065	4.7232	1.0000	0.7089	0.5832	0.6240	1.7100	1.0000	0.6472	0.5999
LETTER	0.0054	4.2072	1.0000	0.7089	0.5828	0.5431	2.6819	1.0000	0.6444	0.5973
DAS	0.0051	4.3539	1.0000	0.7091	0.5845	0.5432	3.6819	1.0000	0.6466	0.5933
RQ-VAE++	0.0034	3.5566	0.9283	0.7100	0.5838	0.6028	3.4904	0.9900	0.6466	0.5952
MM-RQ-VAE	0.0055	4.2125	0.9850	0.7095	0.5843	0.5081	2.8449	0.9950	0.6453	0.5991
ADA-SID(Ours)	0.0032	5.0977	1.0000	0.2772	0.2926	0.1689	0.1714	0.4470	4.4206	1.0000	0.6480	0.6125
Improv.	+3.03%	+7.92%	+0.00%	+0.07%	+0.02%	+12.02%	+20.06%	+0.00%	+0.12%	+2.10%

Key Observations:

Superiority of Behavior-Content Aligned SIDs over Content-Only SIDs:
- Methods that incorporate behavioral information (e.g., RQ-Kmeans, LETTER, DAS, RQ-VAE++, MM-RQ-VAE) consistently outperform purely content-based SIDs (RQ-VAE, OPQ) across Recall@100, NDCG@100, and AUC metrics. This strongly validates the necessity of integrating dynamic behavioral signals to overcome the limitations of static content. For instance, on the Industrial Dataset in generative retrieval, RQ-VAE achieves R@100 of 0.2083, while RQ-VAE++ (with collaborative info) reaches 0.2709, a substantial improvement.
- This highlights that content alone cannot capture the evolving preferences and dynamic nature of user-item interactions.
Critical Value of Collaborative Information (RQ-VAE++):
- The RQ-VAE++ baseline, which explicitly integrates both pre-trained content and collaborative representations into RQ-VAE, shows a significant performance lift compared to the original RQ-VAE. This directly demonstrates that simply adding collaborative information, even without sophisticated adaptive mechanisms, is highly beneficial. For example, RQ-VAE++ improves R@100 on the Industrial Dataset by 30% over RQ-VAE (0.2709 vs 0.2083).
Effectiveness of Dedicated Collaborative SIDs (MM-RQ-VAE, RQ-VAE++):
- Comparing different alignment strategies, the results suggest that explicitly generating dedicated SIDs for collaborative signals (as done by MM-RQ-VAE and RQ-VAE++) is generally more effective at capturing complex interaction patterns than more implicit alignment approaches (like LETTER or DAS). This leads to better downstream performance, indicating the richness of behavioral data benefits from its own specialized representation.
Overall Superiority of ADA-SID:
- ADA-SID consistently achieves the best performance across all evaluated metrics and datasets, for both generative retrieval and discriminative ranking tasks.
- Generative Retrieval (Table 2a):
  - On the Industrial Dataset, ADA-SID improves R@50 by +22.45%, R@100 by +7.53%, N@50 by +3.56%, and N@100 by +0.23% over the best baseline (RQ-VAE++ for R@100, N@100, MM-RQ-VAE for R@50, N@50).
  - On Amazon Beauty, ADA-SID shows even more significant gains: R@50 by +10.21%, R@100 by +22.86%, N@50 by +5.62%, N@100 by +18.38% over the best baselines (RQ-VAE++ for R@100, N@50, N@100, MM-RQ-VAE for R@50).
- Discriminative Ranking (Table 2b):
  - On the Industrial Dataset, ADA-SID achieves the highest AUC (0.7101) and GAUC (0.5846), marginally outperforming Item ID and other SIDs, demonstrating its ability to be effective even in traditional ranking setups.
  - On Amazon Beauty, ADA-SID significantly improves AUC by +0.12% and GAUC by +2.10% over the best baselines (RQ-Kmeans for AUC, MM-RQ-VAE for GAUC).
- Quantization Metrics: ADA-SID also shows strong Entropy (5.0977 Industrial, 4.4206 Beauty) and perfect Codebook Utilization (1.0000 on both), indicating rich and diverse SID representations, while maintaining low Reconstruction Loss. This suggests that its adaptive mechanisms are effective in generating high-quality SIDs.
  
  These results strongly validate ADA-SID's unique design, which intelligently fuses content and behavior information by assessing information richness, adaptively amplifying critical signals, and suppressing noise. This leads to more robust and expressive item representations crucial for diverse recommendation tasks.

6.2. Ablation Studies / Parameter Analysis (RQ2)

Ablation experiments were conducted on the Industrial Dataset to understand the contribution of each component of ADA-SID.

The following are the results from [Table 3] of the original paper:

Variants	Lrecon ↓	Entropy↑	Util.↑	R@50↑	R@100↑	N@50↑	N@100↑	AUC↑	GAUC↑
ADA-SID	0.0032	5.0977	1.0000	0.2772	0.2926	0.1689	0.1714	0.7101	0.5846
w/o Alignment Strength Controller	0.0032	5.0710	1.0000	0.2701	0.2854	0.1618	0.1643	0.7104	0.5845
w/o Behavior-content Contrastive Learning	0.0032	5.1153	1.0000	0.2733	0.2874	0.1653	0.1676	0.7097	0.5846
w/o Sparsity Regularization	0.0034	5.0571	1.0000	0.2757	0.2903	0.1675	0.1698	0.7097	0.5846
w/o Behavior-Guided Dynamic Router	0.0033	5.0896	1.0000	0.2705	0.2861	0.1616	0.1641	0.7098	0.5845

Impact of Adaptive Behavior-Content Alignment:

w/o Alignment Strength Controller: Removing this controller leads to a performance drop across all generative retrieval metrics (e.g., R@50 drops from 0.2772 to 0.2701, N@100 from 0.1714 to 0.1643). This confirms that indiscriminately aligning content with behavioral embeddings is detrimental. The controller's role in suppressing noise corruption from long-tail items by dynamically adjusting alignment intensity is crucial.
w/o Behavior-content Contrastive Learning: Disabling the entire adaptive Behavior-Content Alignment module results in a consistent and more significant drop in Recall and NDCG (e.g., R@50 from 0.2772 to 0.2733, N@100 from 0.1714 to 0.1676). This indicates a substantial modality gap between content and behavioral domains. The contrastive learning component is vital for bridging this gap and achieving effective information fusion. The slight increase in Entropy (from 5.0977 to 5.1153) might suggest that without proper alignment, SIDs become more diverse but less semantically meaningful in a multi-modal context.

Impact of Dynamic Behavioral Router Mechanism:

w/o Behavior-Guided Dynamic Router: Removing this component leads to a notable performance degradation across generative retrieval metrics (e.g., R@50 drops from 0.2772 to 0.2705, N@100 from 0.1714 to 0.1641) and also slightly impacts discriminative ranking. This demonstrates that the router's ability to estimate and weight collaborative signals based on information richness is essential for recommendation accuracy. Without it, the model cannot effectively amplify critical signals or attenuate uninformative ones.
w/o Sparsity Regularization: Disabling sparsity regularization causes a performance drop (e.g., R@50 from 0.2772 to 0.2757, N@100 from 0.1714 to 0.1698). This component plays two critical roles:
1. It encourages sparse activations (selecting fewer SIDs) for long-tail items, leading to more specialized and disentangled representations for each SID, increasing the model's overall capacity akin to Mixture-of-Experts (MoE) models.
2. The item-specific sparsity target ensures that the model wisely allocates its representational budget, using fewer, high-level SIDs for long-tail items and more detailed SIDs for popular items. Its absence leads to less expressive and less adaptive representations.
  
  The ablation studies confirm that both the adaptive behavior-content alignment and the dynamic behavioral router (with sparsity regularization) are crucial components contributing significantly to ADA-SID's overall superior performance.

6.3. Hyper-Parameter Analysis (RQ3)

6.3.1. The Strength of Sparsity Regularization

The paper investigates the impact of sparsity regularization strength on ADA-SID's performance.

Hyper-Parameter Analysis on Sparsity Regularization

<br>
<em>Figure 2: Hyper-Parameter Analysis on Sparsity Regularization (Figure 3 from the original paper).</em>

Figure 2 (Figure 3 in the paper) shows that as the sparsity intensity (likely controlled by a regularization parameter, though not explicitly defined on the x-axis label beyond "Regularization parameter") is reduced, the model's performance generally improves across various metrics (R@50, R@100, N@50, N@100, AUC, GAUC). A reduced sparsity intensity implies allowing for more behavioral SIDs to be active, or perhaps less penalty on using more SIDs. This leads to a larger effective encoder capacity and stronger encoding capabilities, especially for head items with complex behavioral patterns. The paper highlights that ADA-SID's variable-length flexibility (allowing head items to use longer collaborative SID sequences) enables it to leverage this increased capacity for richer representations, leading to significant downstream performance gains.

6.3.2. Sensitivity to Alignment Strength Controller Hyperparameters

A hyperparameter study was conducted on $(\alpha, \beta)$ for the Alignment Strength Controller, which defines the weighting curve for behavior-content alignment.

The following are the results from [Table 4] of the original paper:

Variants	Lrecon ↓	Entropy↑	Utilization↑	R@50↑	R@100↑	N@50↑	N@100↑	AUC↑	GAUC↑
α=20,β=7	0.0032	5.0844	1.0000	0.2749	0.2894	0.1664	0.1688	0.7106	0.5840
α=20, β=9	0.0032	5.0711	1.0000	0.2750	0.2889	0.1677	0.1709	0.7105	0.5842
α=20,β=14	0.0033	5.0967	1.0000	0.2760	0.2911	0.1686	0.1707	0.7105	0.5839
α=10, β=9	0.0032	5.0977	1.0000	0.2772	0.2926	0.1689	0.1714	0.7101	0.5846

Illustration of alignment strength controller with different hyperparameters ::MATH_INLINE_201::



Figure 3: Illustration of alignment strength controller with different hyperparameters  $(\alpha, \beta)$  (Figure 4 from the original paper).

Table 4 presents results for four configurations of $(\alpha, \beta)$ , and Figure 3 (Figure 4 in the paper) illustrates the corresponding alignment strength curves.

The setting $(\alpha = 10, \beta = 9)$ achieved the highest recommendation accuracy (R@50, R@100, N@50, N@100, GAUC).
This optimal result suggests that for the Industrial Dataset's distribution, noise filtering is most effective when applied to approximately the 40% least frequent (long-tail) items. Figure 3 visually confirms how different $(\alpha, \beta)$ pairs shift the curve, controlling the threshold and steepness at which items transition from low to high alignment strength.
The tunability of these parameters demonstrates the flexibility of ADA-SID to adapt to diverse data landscapes and fine-tune the denoising process.

6.4. Item Popularity Stratified Performance Analysis (RQ4)

To understand how ADA-SID performs across different item popularity levels, a stratified analysis was conducted by categorizing items into popular (top 25% by impression counts) and long-tail (bottom 25% by impression counts) groups. The AUC for each group was then evaluated.

Item Popularity Stratified Performance Comparison

<br>
<em>Figure 4: Item Popularity Stratified Performance Comparison (Figure 5 from the original paper).</em>

Figure 4 (Figure 5 in the paper) illustrates the performance comparison:

For Head Items (Popular Items):
- Content-based SIDs (e.g., RQ-VAE) underperform traditional Item IDs in ranking tasks. This is because Item IDs can learn highly independent and specialized representations for popular items due to their abundant interaction data, something content alone cannot fully capture.
- Integrating collaborative information significantly improves performance, as it enhances the SIDs' expressiveness to capture complex behavioral patterns. Methods like RQ-VAE++ and MM-RQ-VAE show clear gains.
- ADA-SID further advances this by explicitly aligning, denoising, and amplifying the fusion of content and behavioral modalities. This leads to a more expressive semantic representation, surpassing even Item IDs for popular items.
For Tail Items (Long-Tail Items):
- All SID-based methods (including ADA-SID and baselines) outperform Item IDs. This is a classic strength of SIDs: by sharing knowledge across semantically similar items, SIDs can make meaningful recommendations even for items with scarce interaction data, where Item IDs would fail due to insufficient data for learning.
- ADA-SID achieves the largest performance gain among all methods on long-tail items. Its adaptive alignment mechanism is crucial here, as it effectively shields the stable content representations of tail items from their noisy and sparse behavioral signals.
- Concurrently, its dynamic behavioral router learns to produce a sparser, more robust representation for long-tail items by relying more on high-level content semantics than on unreliable fine-grained behavioral cues. This dual mechanism significantly boosts performance on the long tail.
  
  This stratified analysis provides strong evidence that ADA-SID effectively addresses the challenge of information disparity. It successfully balances the expressive power needed for popular items with the generalization capability required for long-tail items, ultimately outperforming traditional Item IDs across the entire popularity spectrum.

6.5. Online Experiments

The authors conducted a 5-day online A/B test on a large-scale e-commerce platform's generative retrieval system.

Setup: The experimental group, using ADA-SID's 8-token SIDs, was allocated 10% of random user traffic. It was compared against the production system that used Item ID-based representations.
Results: ADA-SID yielded significant improvements in key business metrics:
- +3.50% increase in Advertising Revenue.
- +1.15% increase in Click-Through Rate (CTR).
Conclusion: These online gains confirm the practical value, scalability, and production-readiness of the proposed method in a real-world industrial setting.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces ADA-SID, a novel mixture-of-quantization framework called MMQ-v2, designed to learn expressive and noise-robust Semantic IDs (SIDs) for industrial recommender systems. The core innovation lies in its adaptive approach to integrating multimodal information (content and behavior) by explicitly considering the information richness of behavioral signals. The framework addresses two critical limitations of prior methods: Noise Corruption (from sparse long-tail data) and Signal Obscurity (due to equal weighting of SIDs).

ADA-SID achieves this through two main components:

Adaptive Behavior-Content Alignment: Utilizes an alignment strength controller to dynamically adjust the alignment intensity between content and behavior based on the item's interaction data richness. This denoises representations for long-tail items by reducing the influence of unreliable behavioral signals.
Dynamic Behavioral Router: Learns to assign adaptive importance weights to behavioral SIDs, thereby amplifying critical collaborative signals from popular items and attenuating uninformative ones from long-tail items. This also incorporates sparsity regularization and load balancing.

Extensive offline experiments on both industrial and public datasets demonstrate ADA-SID's superior performance in both generative retrieval and discriminative ranking tasks. Crucially, online A/B tests on a large-scale e-commerce platform confirmed significant improvements in key business metrics, including Advertising Revenue and Click-Through Rate (CTR). The work pioneers an adaptive fusion approach that considers information richness, paving the way for more robust and personalized recommender systems.

7.2. Limitations & Future Work

The authors implicitly highlight some limitations through their design choices and explicitly mention future work:

Hyperparameter Tuning: The Alignment Strength Controller's hyperparameters $(\alpha, \beta)$ require careful tuning to adapt to different data distributions, as shown in their analysis. While flexible, finding optimal values can be dataset-dependent.
Complexity of Sparsity Regularization: The sparsity regularization loss formulation, especially with its dynamic $\lambda_i$ and load balancing terms, appears quite intricate and potentially difficult to stabilize or interpret without deep understanding of its internal mechanics.
Focus on Item-side Modeling: The current work primarily focuses on improving item representations.
Future Direction: The authors explicitly state that future directions include applying these principles to user-side modeling. This suggests extending the adaptive behavior mining concept to create more robust and dynamic user representations, which could further enhance personalization.

7.3. Personal Insights & Critique

This paper presents a highly relevant and well-structured approach to a crucial problem in recommender systems. The explicit breakdown of limitations in existing behavior-content alignment into Noise Corruption and Signal Obscurity is insightful and provides a clear motivation for their proposed solutions.

Key Strengths:

Novelty in Adaptive Fusion: The core idea of adaptively aligning and weighting signals based on information richness is a significant step forward. It moves beyond simply combining modalities to intelligently managing their fusion, which is particularly vital in real-world skewed data distributions.
Comprehensive Problem Addressing: The Alignment Strength Controller directly tackles Noise Corruption, and the Dynamic Behavioral Router directly addresses Signal Obscurity. This targeted approach makes the solution robust.
Strong Empirical Evidence: The extensive offline experiments on both large-scale industrial data and public benchmarks, coupled with successful online A/B tests, provide compelling evidence of the method's effectiveness and practicality.
Bridging Head and Tail Performance: The stratified analysis demonstrating ADA-SID's superior performance for both popular and long-tail items is a strong indicator of its balanced design, which is often a challenge in recommendation.

Potential Issues/Areas for Improvement:

Clarity of Sparsity Regularization Formulas: While the intent of the sparsity regularization and load balancing is clear, the mathematical formulations for $L_{reg_i}$ , $\lambda_i$ , and $\iint_{lb}$ (Eq. 23) are exceptionally condensed and contain what appear to be notational ambiguities or potential typos (e.g., double summation, denominator in $\lambda_{target}$ , the $\iint$ symbol). A more detailed breakdown or explanation of these specific formulas would greatly benefit reproducibility and understanding for beginners. I had to interpret them based on common MoE practices, but strict fidelity to the paper means presenting them as is, even if they are slightly unclear.
Computational Overhead: While Mixture-of-Experts can offer increased capacity without proportional cost, the introduction of multiple experts, gating networks, and adaptive controllers might add to the overall complexity and training time, especially with very large codebooks and item counts. The paper doesn't delve deeply into the computational cost compared to baselines, though online success implies it's manageable.
Generalizability of Hyperparameters: While $(\alpha, \beta)$ are tunable, finding their optimal values might still require significant effort for new datasets. The paper's conclusion that noise filtering is optimal for 40% least frequent items is specific to their dataset.

Transferability: The principles of adaptive multimodal fusion based on information richness are highly transferable. This concept could be applied to:

User Modeling: Similar disparities exist in user behavior (active vs. cold-start users). Adapting user representation learning based on the richness of their interaction history could lead to more robust user embeddings.
Other Multimodal Learning Tasks: Beyond recommendation, any task involving fusion of modalities with varying reliability (e.g., combining noisy sensor data with cleaner visual data) could benefit from adaptive weighting or alignment.
Dynamic Feature Engineering: The idea of using L2-magnitude as a proxy for information richness could inspire other dynamic feature weighting or selection mechanisms in various machine learning contexts.

Overall, MMQ-v2 (ADA-SID) is a significant and practical contribution to the field of recommender systems, offering a more nuanced and effective way to leverage diverse information sources.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Dataset	Industrial Dataset	Beauty
#User #Item #Interaction	35,154,135	22,363
	48,106,880	12,101
	75,730,321,793	198,360