Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

Lin Liu

Paper status: completed

Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

Published:03/04/2025

Generative Recommendation Systems (21)Sparse-Dense Recommendation Model (1)Cascaded Sparse-Dense Representations (1)User Interaction Sequence Modeling (1)Online Recommendation System Optimization (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces the COBRA framework, which integrates sparse semantic IDs and dense vectors through alternating generation. This end-to-end training enhances dynamic optimization of representations, effectively capturing semantic and collaborative insights from user-item int

Abstract

Generative models have recently gained attention in recommendation systems by directly predicting item identifiers from user interaction sequences. However, existing methods suffer from significant information loss due to the separation of stages such as quantization and sequence modeling, hindering their ability to achieve the modeling precision and accuracy of sequential dense retrieval techniques. Integrating generative and dense retrieval methods remains a critical challenge. To address this, we introduce the Cascaded Organized Bi-Represented generAtive retrieval (COBRA) framework, which innovatively integrates sparse semantic IDs and dense vectors through a cascading process. Our method alternates between generating these representations by first generating sparse IDs, which serve as conditions to aid in the generation of dense vectors. End-to-end training enables dynamic refinement of dense representations, capturing both semantic insights and collaborative signals from user-item interactions. During inference, COBRA employs a coarse-to-fine strategy, starting with sparse ID generation and refining them into dense vectors via the generative model. We further propose BeamFusion, an innovative approach combining beam search with nearest neighbor scores to enhance inference flexibility and recommendation diversity. Extensive experiments on public datasets and offline tests validate our method's robustness. Online A/B tests on a real-world advertising platform with over 200 million daily users demonstrate substantial improvements in key metrics, highlighting COBRA's practical advantages.

Mind Map

In-depth Reading

English Analysis~36 min read · 50,309 chars

1. Bibliographic Information

1.1. Title

Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

1.2. Authors

Yuhao Yang (yangyuhao01@baidu.com) - Baidu Inc., Beijing, China
Zhi Ji (jizhi@baidu.com) - Baidu Inc., Beijing, China
Zhaopeng Li (lizhaopeng@baidu.com) - Baidu Inc., Beijing, China
Yi Li (liyi01@baidu.com) - Baidu Inc., Beijing, China
Zhonglin Mo (mozhonglin@baidu.com) - Baidu Inc., Beijing, China
Yue Ding (dingyue03@baidu.com) - Baidu Inc., Beijing, China
Kai Chen (chenkai23@baidu.com) - Baidu Inc., Beijing, China
Zijian Zhang (zhangzijian02@baidu.com) - Baidu Inc., Beijing, China
Jie Li (lijie06@baidu.com) - Baidu Inc., Beijing, China
Shuanglong Li (lishuanglong@baidu.com) - Baidu Inc., Beijing, China
Lin Liu (liulin03@baidu.com) - Baidu Inc., Beijing, China

All authors are affiliated with Baidu Inc. in Beijing, China, indicating a strong industry research background, likely focusing on large-scale recommendation systems and artificial intelligence applications.

1.3. Journal/Conference

The paper is published at ACM. The full venue details are not explicitly provided beyond "In . ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/ nnnnnnn.nnnnnnn". However, the ACM affiliation suggests it is intended for publication at a reputable ACM conference or journal related to information systems or recommender systems, given the "ACM Reference Format" and "CCS Concepts" (Information systems, Recommender systems).

1.4. Publication Year

2025

1.5. Abstract

Generative models in recommendation systems predict item identifiers from user interaction sequences but often lose information due to separated stages like quantization and sequence modeling, failing to match the precision of sequential dense retrieval. This paper introduces COBRA (Cascaded Organized Bi-Represented generAtive retrieval), a framework that unifies sparse semantic IDs and dense vectors through a cascading generative process. COBRA first generates sparse IDs, which then condition the generation of dense vectors. An end-to-end training approach dynamically refines these dense representations, integrating semantic insights and collaborative signals. During inference, COBRA employs a coarse-to-fine strategy: generating sparse IDs first, then refining them into dense vectors. To enhance inference flexibility and recommendation diversity, the paper proposes BeamFusion, which combines beam search with nearest neighbor scores. Extensive experiments on public datasets, offline tests, and online A/B tests on a large-scale advertising platform (over 200 million daily users) demonstrate COBRA's significant improvements in key metrics and practical advantages.

1.6. Original Source Link

https://arxiv.org/abs/2503.02453 (Preprint on arXiv)

1.7. PDF Link

https://arxiv.org/pdf/2503.02453v1.pdf (Preprint on arXiv) The paper is currently a preprint on arXiv, dated March 4, 2025.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the significant information loss prevalent in existing generative recommendation models. While generative models have gained traction for directly predicting item identifiers from user interaction sequences, they often involve separate stages (e.g., quantization for sparse IDs and sequence modeling) that lead to a degradation in information quality. This hinders their ability to achieve the high modeling precision and accuracy typically found in sequential dense retrieval techniques, which rely on rich, continuous item embeddings.

This problem is important because recommendation systems are fundamental to modern digital platforms, driving user engagement and economic value. The limitations of current generative models create a critical challenge: how to integrate the efficiency and emerging abilities (like reasoning and few-shot learning) of generative models with the fine-grained accuracy and robustness of dense retrieval methods. Previous attempts have either focused solely on sparse IDs, leading to a lack of fine-grained detail, or used static, pre-trained dense representations, limiting dynamic refinement.

The paper's entry point and innovative idea revolve around bridging this gap by proposing a unified framework that synergistically combines sparse semantic IDs and dense vectors. It introduces a novel cascaded approach where sparse IDs provide a high-level categorical foundation, which then conditions the generation of fine-grained dense vectors, thereby mitigating information loss and enabling dynamic refinement of representations.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Cascaded Bi-Represented Retrieval Framework (COBRA): Introduction of a novel framework that alternates between generating sparse semantic IDs and dense vectors. This cascading process integrates dense representations into the ID sequence, addressing the information loss common in ID-based methods. By using sparse IDs as conditions for generating dense vectors, COBRA simplifies the learning of dense representations and promotes mutual learning between the two representation types.
Learnable Dense Representations via End-to-End Training: Unlike models that use static or pre-trained embeddings, COBRA's dense vectors are dynamically learned through an end-to-end training process, using original item data as input. This allows the model to capture both semantic information and fine-grained details specific to the recommendation task.
Coarse-to-Fine Generation Process with BeamFusion: During inference, COBRA employs a coarse-to-fine strategy. It first generates sparse IDs to capture the categorical essence, which are then fed back into the model to produce refined dense representations. Additionally, the BeamFusion mechanism is proposed, combining beam search with nearest neighbor retrieval scores to offer flexible and diverse recommendations.
Comprehensive Empirical Validation: Extensive experiments on public benchmark datasets demonstrate that COBRA achieves superior performance in recommendation accuracy compared to existing state-of-the-art methods. Offline tests and online A/B tests on a real-world advertising platform (with over 200 million daily users) show substantial improvements in key metrics like conversion and Average Revenue Per User (ARPU), highlighting the practical advantages and robustness of COBRA.

The key conclusions and findings include:
COBRA consistently outperforms various state-of-the-art baselines (e.g., TIGER, BERT4Rec, P5) across multiple public datasets (Beauty, Sports and Outdoors, Toys and Games) in terms of Recall@K and NDCG@K.
Ablation studies on an industrial dataset confirm that both sparse IDs and dense vectors are crucial for performance, with the cascaded approach significantly enhancing Recall@K. The BeamFusion mechanism also plays a vital role in integrating sparse signals effectively.
The model's representation learning capabilities demonstrate strong intra-ID cohesion (items within the same category are close) and inter-ID separation (different categories are distinct), confirming that sparse IDs help organize the semantic space for dense vectors.
COBRA provides a controllable Recall-Diversity equilibrium through the BeamFusion mechanism, allowing practitioners to tune between accuracy and diversity.
Real-world online A/B tests on a large-scale advertising platform confirm significant business impact, with a 3.60% increase in conversion and a 4.15% increase in ARPU, proving its practical advantages in production environments.

3.1. Foundational Concepts

To understand the COBRA framework, a grasp of several fundamental concepts in recommendation systems and deep learning is essential:

Recommendation Systems: These systems aim to predict user preferences and suggest items (e.g., products, movies, articles) that are most relevant to them. They are crucial for enhancing user experience and driving engagement on various platforms.
Sequential Recommendation: A subfield of recommendation systems that specifically models the sequential patterns of user interactions. Instead of treating interactions as independent events, sequential recommenders learn from the order in which users interact with items (e.g., Item A then Item B, then Item C), often predicting the next item a user will engage with.
Generative Models in Recommendation: Traditionally, recommendation systems are discriminative, predicting a score for each item and ranking them. Generative models, however, directly generate item identifiers (e.g., a unique ID or a textual description) as their output. This paradigm shift offers advantages like direct item prediction, potential for few-shot learning, and handling large item catalogs more efficiently.
Dense Retrieval: Refers to recommendation methods that represent users and items as continuous, high-dimensional dense vectors (also known as embeddings). Similarity between users and items is typically calculated using vector operations (e.g., dot product, cosine similarity). These methods excel at capturing fine-grained relationships and semantic nuances but often require substantial storage and computational resources for large item catalogs.
Sparse Retrieval: In contrast to dense retrieval, sparse retrieval methods often rely on discrete, categorical representations, such as sparse IDs or one-hot encodings. These representations can be more memory-efficient and allow for direct indexing, but may struggle to capture the subtle semantic similarities between items that dense vectors can.
Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which revolutionized sequence modeling. It relies heavily on self-attention mechanisms to weigh the importance of different parts of an input sequence when processing each element.
- Self-Attention Mechanism: The core component of Transformers. It calculates a weighted sum of input values, where the weights are determined by the similarity (or "attention") between the current input element and all other elements in the sequence. This allows the model to capture long-range dependencies efficiently. The standard Attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $Q K^T$ calculates the dot product similarity between queries and keys.
  - $d_k$ is the dimension of the key vectors, used for scaling to prevent vanishing gradients.
  - $\mathrm{softmax}$ normalizes the scores to produce attention weights.
  - The output is a weighted sum of the Value vectors.
- Transformer Encoder-Decoder: The original Transformer consists of an encoder stack and a decoder stack. The encoder processes the input sequence, and the decoder generates the output sequence, often attending to the encoder's output. COBRA utilizes a Transformer Decoder for sequential prediction.
Residual Quantized Variational Autoencoder (RQ-VAE): An architecture for learning discrete latent representations from continuous data. An Autoencoder learns to compress input data into a lower-dimensional latent space (encoding) and then reconstruct it (decoding). Variational Autoencoders (VAEs) introduce a probabilistic element. Residual Quantization involves quantizing a continuous vector into a discrete code from a codebook, then learning a residual and quantizing it again, allowing for hierarchical discrete representations (like semantic IDs in this paper) with improved fidelity. These discrete codes serve as sparse IDs.
Contrastive Learning: A self-supervised learning paradigm where a model learns to pull similar data points (positives) closer together in an embedding space while pushing dissimilar data points (negatives) apart. In recommendation, this can involve making a user's interaction with a positive item similar to its representation, and dissimilar to negative items. The InfoNCE loss is a common objective function used in contrastive learning.
Beam Search: A heuristic search algorithm used in sequence generation tasks (e.g., natural language generation, item ID generation). Instead of choosing the single most probable next token at each step (greedy search), beam search keeps track of the top $B$ (beam width) most probable partial sequences. At each step, it expands all $B$ sequences by considering all possible next tokens, then prunes them again to keep only the top $B$ highest-scoring sequences. This improves the chances of finding a globally better sequence than greedy search.
Approximate Nearest Neighbor (ANN) Search: An algorithm used to find data points in a high-dimensional space that are "close" to a given query point, but without the computational cost of exhaustively checking every single point. For large datasets, exact nearest neighbor search is too slow. ANN algorithms (e.g., FAISS, HNSW) provide a fast, approximate solution, widely used in retrieval tasks to quickly find candidate items based on their dense vector similarity to a query vector.

3.2. Previous Works

The paper discusses various prior studies, broadly categorizing them into Sequential Dense Recommendation and Generative Recommendation.

3.2.1. Sequential Dense Recommendation

These methods focus on learning dense representations for users and items from interaction sequences.

GRU4Rec [14]: One of the early influential models, using Gated Recurrent Units (GRUs), a type of Recurrent Neural Network (RNN), to capture temporal dependencies in user behavior for session-based recommendations.
Caser [39]: Applied Convolutional Neural Networks (CNNs) to sequential recommendation, treating interaction sequences like "images" to extract spatial features.
SASRec [18]: A pioneering Transformer-based model for sequential recommendation. It uses self-attention to capture long-term user dependencies and models the next item prediction as an autoregressive task.
BERT4Rec [37]: Another Transformer-based model, but inspired by BERT from NLP. It uses a bidirectional self-attention mechanism and a cloze objective (masked item prediction) to learn user representations.
FDSA [52]: A self-attentive model that specifically targets item-feature transitions to enhance sequential modeling.
PinnerFormer [30]: Leverages Transformers for modeling long-term user behavior, specifically in the context of Pinterest.
S3-Rec [55]: Explores self-supervised learning for sequential recommendation, often employing contrastive learning techniques to derive robust user and item representations.
ZESRec [8], UniSRec [15], RecFormer [21]: More recent works emphasizing cross-domain transferability, incorporating textual features, and using contrastive learning. RecFormer in particular unifies language understanding and sequence recommendation using bidirectional Transformers. COBRA w/o ID is noted to resemble RecFormer.

3.2.2. Generative Recommendation

These models directly generate item identifiers.

P5 [11]: A foundational model that transforms various recommendation tasks (e.g., rating prediction, item generation) into natural language sequences, providing a universal framework using unique training objectives and prompts.
TIGER [33]: A pioneering approach in generative retrieval for recommendations. It uses a Residual Quantized Variational AutoEncoder (RQ-VAE) to encode item content features into hierarchical semantic IDs. A Transformer-based model then generates these item identifiers from user histories. COBRA w/o Dense is similar to TIGER.
LC-Rec [53]: Extends TIGER by aligning semantic IDs with collaborative filtering signals through additional alignment tasks, using RQ-VAE.
IDGenRec [38]: Leverages Large Language Models (LLMs) to generate unique, concise, and semantically rich textual identifiers for recommended items, demonstrating strong potential in zero-shot settings.
SEATER [34]: Focuses on maintaining semantic consistency in generative retrieval through balanced k-ary tree-structured indexes, refined by contrastive and multi-task learning.
ColaRec [45]: Aligns content-based semantic spaces with collaborative interaction spaces to improve recommendation efficacy, often by deriving generative identifiers from pre-trained recommendation models.
LIGER [48]: A hybrid model that combines generative and dense retrieval by simultaneously generating sparse IDs and dense representations. It treats them as complementary representations of the same granularity. LIGER's dense representations are pre-trained and fixed. This is a key point of differentiation for COBRA.

3.3. Technological Evolution

Recommendation systems have evolved significantly:

Early Methods (Collaborative Filtering, Matrix Factorization): Focused on implicit or explicit feedback to find similar users/items.
Session-based/Sequential Methods (RNNs, CNNs): Models like GRU4Rec and Caser started capturing temporal dependencies using recurrent and convolutional networks.
Transformer Era (Self-Attention): SASRec, BERT4Rec, and PinnerFormer brought the power of Transformers to sequential recommendation, enabling better modeling of long-range dependencies in user behavior. These are largely dense retrieval methods.
Generative Paradigm: P5 and TIGER shifted towards directly generating item IDs or natural language descriptions, offering flexibility but often facing challenges in precision and information retention compared to dense methods.
Hybrid Approaches: Recognizing the strengths of both, models like LIGER began exploring the integration of generative (sparse ID) and dense retrieval.

This paper's work (COBRA) fits into the latest stage, aiming to create a more tightly integrated and dynamically learned hybrid approach, addressing the limitations of prior generative and hybrid models by directly learning cascaded sparse-dense representations within a unified generative framework.

3.4. Differentiation Analysis

Compared to the main methods in related work, COBRA presents several core differences and innovations:

Unified Cascaded Generation (vs. TIGER/Sparse-only Generative Models):
- TIGER [33] and similar methods (e.g., COBRA w/o Dense variant) rely solely on sparse semantic IDs for generative retrieval. While efficient, this can lead to information loss and difficulty in capturing fine-grained user preferences.
- COBRA innovatively integrates both sparse IDs and dense vectors in a cascading generative process. It first generates a sparse ID (coarse-grained, categorical essence) and then uses this sparse ID as a condition to generate a dense vector (fine-grained details). This mitigates the information loss inherent in sparse-only methods.
Dynamically Learned Dense Representations (vs. LIGER):
- LIGER [48] also proposes a hybrid approach generating both sparse IDs and dense representations. However, LIGER's dense representations are typically pre-trained and fixed, treating both representations as having the same granularity.
- COBRA's dense representations are end-to-end trainable. They are dynamically refined during the entire training process alongside the sparse IDs. This allows the dense vectors to better adapt to the specific recommendation task, capturing semantic insights and collaborative signals more effectively. The cascaded nature also implies a different granularity for sparse (coarse) and dense (fine) representations, which is a key distinction.
Coarse-to-Fine Inference Strategy with BeamFusion (vs. Standard Beam Search):
- Most generative models use standard beam search for ID generation.
- COBRA combines a coarse-to-fine generation process during inference, starting with sparse ID generation and then refining these into dense vectors.
- Furthermore, it introduces BeamFusion, a novel sampling technique that combines beam search scores (from sparse ID generation) with nearest neighbor retrieval scores (from dense vector similarity). This allows for a more flexible and controllable balance between recommendation accuracy and diversity, which is a significant practical advantage not typically found in simpler retrieval or generation strategies.
Holistic Optimization: The end-to-end training of COBRA with a dual-objective loss function (for both sparse ID and dense vector prediction) ensures that both representation types are jointly optimized and mutually inform each other, leading to a more robust and precise recommendation model.

4. Methodology

4.1. Principles

The core idea behind COBRA (Cascaded Organized Bi-Represented generAtive retrieval) is to overcome the limitations of existing generative recommendation models, particularly the information loss associated with discrete item IDs, by synergistically integrating sparse semantic IDs and dense vectors within a unified generative framework. The theoretical basis is rooted in the belief that while sparse IDs provide a robust, categorical structure (coarse-grained semantics), dense vectors capture nuanced, fine-grained details (continuous feature resolution). By generating these representations in a cascaded manner—first predicting the coarse sparse ID and then using it to condition the generation of the fine dense vector—the model aims to leverage the strengths of both, ensuring both high-level semantic consistency and detailed item characterization. This coarse-to-fine approach, coupled with end-to-end training, allows for dynamic refinement of representations and an improved balance between accuracy and diversity during inference.

4.2. Core Methodology In-depth (Layer by Layer)

COBRA consists of three main components: Sparse-Dense Representation, Sequential Modeling, End-to-End Training, and Coarse-to-Fine Generation. Figure 2 from the original paper provides an excellent overview of the framework's architecture.

The following figure (Figure 2 from the original paper) illustrates the overall framework of COBRA:

该图像是一个示意图，展示了COBRA框架的稀疏与密集表示的级联过程。图中包括了稀疏ID生成与密集向量的交替生成过程，使用双向Transformer编码器与解码器进行建模体现了信息流的动态传递。 Image 2: The image is a schematic diagram illustrating the cascaded process of sparse and dense representations in the COBRA framework. It includes the alternating generation of sparse ID and dense vector, demonstrating the dynamic transfer of information through bidirectional Transformer encoders and decoders.

4.2.1. Sparse-Dense Representation

This component focuses on how items are represented in COBRA.

4.2.1.1. Sparse Representation

COBRA generates sparse IDs for each item, similar to the approach in TIGER [33].

Item Attributes to Text: For each item, its various attributes (e.g., title, price, category, description) are extracted and combined to form a textual description.
Text Embedding: This textual description is then embedded into a continuous, dense vector space using a text encoder.
Quantization with RQ-VAE: The dense vector is subsequently quantized to produce sparse IDs. This process is typically performed by a Residual Quantized Variational Autoencoder (RQ-VAE). The RQ-VAE converts the continuous item embeddings into discrete codes (sparse IDs), which capture the categorical essence and high-level semantics of the items.
Hierarchical IDs: While the methodology description assumes a single level for simplicity, the paper notes that this approach can be easily extended to multiple levels of sparse IDs, allowing for hierarchical semantic categorization.

4.2.1.2. Dense Representation

To complement the sparse IDs and capture more nuanced attribute information, COBRA introduces an end-to-end trainable dense encoder.

Item Textual Contents: Each item's attributes are flattened into a text sentence.
[CLS] Token Prefix: A special [CLS] token is prefixed to this text sentence.
Transformer-based Text Encoder: The entire sequence ( [CLS] + item textual contents) is fed into a Transformer-based text encoder (denoted as Encoder). This encoder processes the text to produce context-aware embeddings.
Dense Vector Extraction: The dense representation for the item, denoted as $\mathbf{v}_t$ , is extracted from the output corresponding to the [CLS] token. This vector aims to capture the fine-grained details of the item's textual content.
Positional and Type Embeddings: As illustrated in Figure 2 (lower part), position embeddings and type embeddings are incorporated and added to the token embeddings.
- Position embeddings (or positional encodings) provide information about the order of tokens in the sequence, as Transformers are permutation-invariant without them.
- Type embeddings differentiate between different types of tokens or segments within the input. These augmentations enhance the model's ability to distinguish between tokens and their context.

4.2.1.3. Cascaded Representation

The cascaded representation is the key to unifying sparse and dense information within COBRA.

Combined Form: For each item at time step $t$ , its sparse ID $ID_t$ and its dense vector $\mathbf{v}_t$ are combined to form a cascaded representation $(ID_t, \mathbf{v}_t)$ .
Complementary Strengths: This combination leverages the strengths of both:
- Sparse IDs provide a stable, discrete, categorical foundation, offering robust semantic consistency.
- Dense vectors maintain continuous feature resolution, ensuring that the model captures detailed, fine-grained information. This joint representation allows COBRA to characterize items more comprehensively.

4.2.2. Sequential Modeling

COBRA uses a unified generative model based on the Transformer architecture to model user interaction sequences.

4.2.2.1. Probabilistic Decomposition

The model factorizes the probability distribution of the target item into two stages, explicitly leveraging the complementary nature of sparse and dense representations. Instead of directly predicting the next item $s_{t+1}$ based on the historical interaction sequence $S_{1:t}$ , COBRA predicts its sparse ID $ID_{t+1}$ and dense vector $\mathbf{v}_{t+1}$ separately:

$ P ( I D _ { t + 1 } , \mathbf { v } _ { t + 1 } | S _ { 1 : t } ) = P ( I D _ { t + 1 } | S _ { 1 : t } ) P ( \mathbf { v } _ { t + 1 } | I D _ { t + 1 } , S _ { 1 : t } ) $

Where:

$P ( I D _ { t + 1 } , \mathbf { v } _ { t + 1 } | S _ { 1 : t } )$ is the joint probability of the next item's sparse ID and dense vector given the historical sequence.
S _ { 1 : t } represents the historical interaction sequence up to time step $t$ .
P ( I D _ { t + 1 } | S _ { 1 : t } ) is the probability of generating the sparse ID $ID_{t+1}$ based on the historical sequence $S_{1:t}$ . This captures the categorical essence or coarse-grained semantic of the next item.
$P ( \mathbf { v } _ { t + 1 } | I D _ { t + 1 } , S _ { 1 : t } )$ is the probability of generating the dense vector $\mathbf{v}_{t+1}$ given both the predicted sparse ID $ID_{t+1}$ and the historical sequence $S_{1:t}$ . This captures the fine-grained details, conditioned on the coarse category.

This decomposition is crucial because it allows the model to first narrow down the search space to a category (via sparse ID) and then refine the prediction within that category (via dense vector), making the learning task more manageable and precise.

4.2.2.2. Sequential Modeling with a Unified Generative Model

A Transformer Decoder forms the core of the sequential model. It processes sequences of cascaded representations.

Embedding Sparse IDs: Each sparse ID $ID_t$ is converted into a dense vector space using an embedding layer. $ \mathbf { e } _ { t } = \mathbf { E m b e d } ( I D _ { t } ) $ Where:
- $\mathbf{e}_t$ is the dense embedding of the sparse ID $ID_t$ .
- $\mathbf{Embed}$ is the embedding layer function.
Forming Model Input: This sparse ID embedding $\mathbf{e}_t$ is then concatenated with the item's dense vector $\mathbf{v}_t$ to form the complete input representation $\mathbf{h}_t$ for each item at each time step. $ \mathbf { h } _ { t } = [ \mathbf { e } _ { t } ; \mathbf { v } _ { t } ] $ Where:
- $\mathbf{h}_t$ is the concatenated representation for the $t$ -th item.
- [;] denotes concatenation.
Transformer Modeling: The Transformer Decoder takes a sequence of these $\mathbf{h}_t$ representations. It is augmented with item position embeddings and type embeddings (as mentioned in section 4.2.1.2) to capture sequential and contextual information. The decoder then processes this enriched input to produce contextualized representations for prediction.
Sparse ID Prediction:
- Input Sequence: To predict the sparse ID $ID_{t+1}$ , the Transformer receives the historical interaction sequence $S_{1:t}$ as its input. This sequence is constructed from the cascaded representations: $ \begin{array} { l } { { \bf S } _ { 1 : t } = \left[ { \bf h } _ { 1 } , { \bf h } _ { 2 } , \ldots , { \bf h } _ { t } \right] } \ { = \left[ \mathbf e _ { 1 } , \mathbf v _ { 1 } , \mathbf e _ { 2 } , \mathbf v _ { 2 } , \ldots , \mathbf e _ { t } , \mathbf v _ { t } \right] } \end{array} $ Where $\mathbf{S}_{1:t}$ is the sequence of concatenated item representations.
- Transformer Output: The Transformer Decoder processes this sequence, producing a sequence of output vectors $\mathbf{y}_t$ . $ \mathbf { y } _ { t } = \mathrm { TransformerDecoder } ( \mathbf { S } _ { 1 : t } ) $
- Logit Calculation: The logits for sparse ID prediction are then derived from $\mathbf{y}_t$ using a dedicated SparseHead (typically a linear layer followed by a softmax activation for classification). $ \mathbf { z } _ { t + 1 } = \mathbf { SparseHead } ( \mathbf { y } _ { t } ) $ Where $\mathbf{z}_{t+1}$ represents the logits for predicting the next sparse ID $ID_{t+1}$ .
Dense Vector Prediction:
- Input Sequence: For predicting the dense vector $\mathbf{v}_{t+1}$ , the Transformer's input sequence is augmented. It includes the historical sequence $S_{1:t}$ plus the embedding of the predicted (or ground truth during training) sparse ID $ID_{t+1}$ . This is a crucial step in the cascaded process, where the sparse prediction conditions the dense prediction. $ \begin{array} { l } { { \bar { \mathbf { S } } _ { 1 : t } = [ \mathbf { S } _ { 1 : t } , \mathbf { e } _ { t + 1 } ] } } \ { { \qquad = [ \mathbf { e } _ { 1 } , \mathbf { v } _ { 1 } , \mathbf { e } _ { 2 } , \mathbf { v } _ { 2 } , \dots , \mathbf { e } _ { t } , \mathbf { v } _ { t } , \mathbf { e } _ { t + 1 } ] } } \end{array} $ Where $\bar{\mathbf{S}}_{1:t}$ is the extended input sequence, now including the embedding of the (target) sparse ID for the next item, $\mathbf{e}_{t+1}$ .
- Transformer Output: The Transformer Decoder then processes this extended sequence $\bar{\mathbf{S}}_{1:t}$ to output the predicted dense vector $\hat{\mathbf{v}}_{t+1}$ . $ \hat { \mathbf { v } } _ { t + 1 } = \mathrm { TransformerDecoder } ( \bar { \mathbf { S } } _ { 1 : t } ) $

4.2.3. End-to-End Training

COBRA employs an end-to-end training process, optimizing both sparse ID and dense vector prediction jointly using a composite loss function.

Sparse ID Prediction Loss ( $\mathcal{L}_{\mathrm{sparse}}$ ): This loss function ensures the model learns to accurately predict the next sparse ID based on the historical sequence. It uses a standard cross-entropy loss (or negative log-likelihood for classification). $ \mathcal { L } _ { \mathrm { sparse } } = - \sum _ { t = 1 } ^ { T - 1 } \log \left( \frac { \exp ( z _ { t + 1 } ^ { I D _ { t + 1 } } ) } { \sum _ { j = 1 } ^ { C } \exp ( z _ { t + 1 } ^ { j } ) } \right) $ Where:
- $T$ is the length of the historical interaction sequence.
- $ID_{t+1}$ is the ground truth sparse ID of the item at time step $t+1$ .
- $z_{t+1}^{ID_{t+1}}$ represents the predicted logit for the ground truth sparse ID $ID_{t+1}$ at time step $t+1$ , generated by the Transformer Decoder and SparseHead.
- $C$ denotes the total number of unique sparse IDs (i.e., the size of the vocabulary of sparse IDs).
- The term $\frac { \exp ( z _ { t + 1 } ^ { I D _ { t + 1 } } ) } { \sum _ { j = 1 } ^ { C } \exp ( z _ { t + 1 } ^ { j } ) }$ is the softmax probability of predicting the correct sparse ID. The negative logarithm of this probability is minimized.
Dense Vector Prediction Loss ( $\mathcal{L}_{\mathrm{dense}}$ ): This loss aims to refine the dense vectors such that the predicted vector is close to the ground truth positive item's vector and far from negative items. It uses a contrastive learning objective (similar to InfoNCE). $ \mathcal { L } _ { \mathrm { dense } } = - \sum _ { t = 1 } ^ { T - 1 } \log \frac { \exp ( \cos ( \hat { \mathbf { v } } _ { t + 1 } \cdot \mathbf { v } _ { t + 1 } ) ) } { \sum _ { i t e m _ { j } \in \mathrm { Batch } } \exp ( \cos ( \hat { \mathbf { v } } _ { t + 1 } , \mathbf { v } _ { i t e m _ { j } } ) ) } $ Where:
- $\hat{\mathbf{v}}_{t+1}$ is the predicted dense vector for the item at time step $t+1$ .
- $\mathbf{v}_{t+1}$ is the ground truth dense vector for the positive item at time step $t+1$ .
- $\mathbf{v}_{item_j}$ represents the dense vectors of all items $item_j$ within the current training batch, which serve as negative samples (except for the positive item itself).
- $\cos(\mathbf{a} \cdot \mathbf{b})$ denotes the cosine similarity between two vectors $\mathbf{a}$ and $\mathbf{b}$ . A higher cosine similarity (closer to 1) means the vectors are more similar in direction.
- The numerator pushes the predicted vector $\hat{\mathbf{v}}_{t+1}$ closer to the positive ground truth $\mathbf{v}_{t+1}$ .
- The denominator pulls $\hat{\mathbf{v}}_{t+1}$ away from all other items in the batch, including negative samples. This loss dynamically refines the dense vectors generated by the end-to-end trainable encoder, adapting them to the specific recommendation task.
Overall Loss Function ( $\mathcal{L}$ ): The total loss is a simple sum of the two component losses. $ \mathcal { L } = \mathcal { L } _ { \mathrm { sparse } } + \mathcal { L } _ { \mathrm { dense } } $ This dual-objective loss function facilitates a balanced optimization, where the model simultaneously learns to categorize items accurately (sparse ID) and capture their fine-grained features (dense vector), with dense vectors being dynamically refined under the guidance of sparse IDs.

4.2.4. Coarse-to-Fine Generation

During inference, COBRA implements a coarse-to-fine generation procedure to produce recommendations, as illustrated in Figure 3.

The following figure (Figure 3 from the original paper) illustrates the Coarse-to-Fine Generation process:

$Figure 3: Illustration of the Coarse-to-Fine Generation process. During inference, $M$ sparse IDs are generated via Beam Search, and appended to the sequence. Dense vectors are then generated and use…$ 该图像是示意图，展示了生成模型在稀疏ID生成和密集向量候选项之间的过程。通过Beam Search生成 $ID_{M+1}$ ，再生成候选密集向量，最后使用BeamFusion结合Beam Score和NN Score对候选项进行评分，选取Top K广告。 Image 3: The image is a diagram illustrating the process between sparse ID generation and candidate dense vectors using a generative model. It generates $ID_{M+1}$ via Beam Search, then generates candidate dense vectors, and finally uses Beam Fusion to combine Beam Score and NN Score to rank candidates for selecting Top K ads.

Sparse ID Generation:
- Given a user's historical interaction sequence $S_{1:T}$ , the Transformer Decoder models the probability distribution for the next sparse ID, $P(i_{T+1} \vert S_{1:T})$ .
- The Beam Search algorithm is applied to this distribution to generate the top $M$ most probable sparse IDs. $ { { \hat { \mathbf { D } } } _ { T + 1 } ^ { k } } _ { k = 1 } ^ { M } = { \mathrm { BeamSearch } } ( { \mathrm { TransformerDecoder } } ( \mathbf { S } _ { 1 : T } ) , M ) $ Where:
- $\{ \hat{\mathbf{ID}}_{T+1}^k \}_{k=1}^M$ is the set of top $M$ generated sparse IDs.
- $k \in \{1, 2, ..., M\}$ indexes the generated sparse IDs.
- $\mathrm{BeamSearch}$ is the beam search algorithm.
- $\mathrm{TransformerDecoder}(\mathbf{S}_{1:T})$ refers to the output logits from the Transformer for sparse ID prediction.
- $M$ is the beam width, specifying how many top sparse ID candidates to retain.
- Each generated sparse ID $\hat{\mathbf{ID}}_{T+1}^k$ is associated with a beam score ( $\phi_{\hat{\mathbf{ID}}_{T+1}^k}$ ).
Dense Vector Generation and Candidate Retrieval:
- Each of the $M$ generated sparse IDs $\hat{\mathbf{ID}}_{T+1}^k$ is converted into an embedding $\mathbf{Embed}(\hat{\mathbf{ID}}_{T+1}^k)$ .
- This embedding is then appended to the historical sequence $\mathbf{S}_{1:T}$ to form the extended input sequence, similar to the training phase for dense vector prediction.
- The Transformer Decoder then processes this extended sequence to generate the corresponding dense vector $\hat{\mathbf{v}}_{T+1}^k$ . $ \hat { \mathbf { v } } _ { T + 1 } ^ { k } = \mathrm { TransformerDecoder } ( [ \mathbf { S } _ { 1 : T } , \mathbf { Embed } ( \hat { \mathbf { I D } } _ { T + 1 } ^ { k } ) ] ) $
- For each generated dense vector $\hat{\mathbf{v}}_{T+1}^k$ $\hat{v}_{T + 1}^{k}$ , an Approximate Nearest Neighbor (ANN) search is performed to retrieve the top $N$ $N$ candidate items from the item catalog that belong to the category indicated by $\hat{\mathbf{ID}}_{T+1}^k$ $\hat{ID}_{T + 1}^{k}$ . This effectively narrows down the search space within the predicted sparse category. $ \mathcal { R } _ { k } = \mathrm { ANN } ( \hat { \mathbf { v } } _ { T + 1 } ^ { k } , C ( \hat { \mathbf { I D } } _ { T + 1 } ^ { k } ) , N ) $ Where:
  - $\mathcal{R}_k$ is the set of top $N$ candidate items retrieved for the $k$ -th sparse ID and its generated dense vector.
  - $\hat{\mathbf{v}}_{T+1}^k$ is the generated dense vector.
  - $C(\hat{\mathbf{ID}}_{T+1}^k)$ represents the subset of items in the entire item catalog that are associated with the sparse ID $\hat{\mathbf{ID}}_{T+1}^k$ . This ensures that ANN search is performed within the relevant category.
  - $N$ is the number of nearest neighbors to retrieve within that category.
BeamFusion Mechanism:
- To achieve a balance between precision (dense vector similarity) and diversity (sparse ID variety), COBRA introduces BeamFusion. This mechanism computes a globally comparable score for candidate items by combining the beam score (from sparse ID generation) and the cosine similarity (from dense vector retrieval). $ \Phi ^ { ( \hat { \mathbf { v } } _ { T + 1 } ^ { k } , \hat { \mathbf { I D } } _ { T + 1 } ^ { k } , \mathbf { a } ) } = \mathrm { Softmax } ( \tau \phi _ { \hat { \mathbf { I D } } _ { T + 1 } ^ { k } } ) \times \mathrm { Softmax } ( \psi \cos ( \hat { \mathbf { v } } _ { T + 1 } ^ { k } , \mathbf { a } ) ) $ Where:
- $\Phi ^ { ( \hat { \mathbf { v } } _ { T + 1 } ^ { k } , \hat { \mathbf { I D } } _ { T + 1 } ^ { k } , \mathbf { a } ) }$ is the BeamFusion Score for a candidate item $\mathbf{a}$ from the set $\mathcal{R}_k$ .
- $\mathbf{a}$ represents the dense vector of a candidate item.
- $\tau$ and $\psi$ are tunable coefficients that control the relative importance of the sparse ID beam score and the dense vector cosine similarity.
- $\phi _ { \hat { \mathbf { I D } } _ { T + 1 } ^ { k } }$ denotes the beam score (e.g., negative log-likelihood) obtained during the beam search process for the sparse ID $\hat{\mathbf{ID}}_{T+1}^k$ .
- $\cos ( \hat { \mathbf { v } } _ { T + 1 } ^ { k } , \mathbf { a } )$ is the cosine similarity between the generated dense vector $\hat{\mathbf{v}}_{T+1}^k$ and the dense vector of the candidate item $\mathbf{a}$ .
- The Softmax functions normalize these scores, making them comparable.
Final Recommendations:
- All candidate items from all $M$ sparse ID branches ( $\bigcup _ { k = 1 } ^ { M } \mathcal { R } _ { k }$ ) are scored using the BeamFusion mechanism.
- Finally, the top $K$ items with the highest BeamFusion Scores are selected as the final recommendations. $ \mathcal { R } = \mathrm { TopK } \left( \bigcup _ { k = 1 } ^ { M } \mathcal { R } _ { k } , \Phi , K \right) $ Where:
- $\mathcal{R}$ is the set of final top $K$ recommendations.
- $\mathrm{TopK}$ is the operation of selecting the top $K$ items.
- $\Phi$ represents the BeamFusion Score function.
- $K$ is the desired number of final recommendations.

5. Experimental Setup

5.1. Datasets

COBRA was evaluated on both public and industrial datasets.

5.1.1. Public Datasets

The experiments utilized the Amazon Product Reviews dataset [13, 29], a widely recognized benchmark for recommendation tasks. This dataset comprises product reviews and metadata collected between May 1996 and September 2014.

Subsets Used:
- "Beauty"
- "Sports and Outdoors"
- "Toys and Games"
Item Embeddings: Item attributes such as title, price, category, and description were leveraged to construct item embeddings.
Data Filtering: A 5-core filtering process was applied to ensure data quality. This means:
- Items with fewer than five user interactions were removed.
- Users with fewer than five item interactions were removed.
Characteristics: These subsets represent various product domains, allowing for testing the model's generalization across different types of items and user preferences. The 5-core filtering ensures that there is sufficient interaction data for both users and items to learn meaningful patterns.

The following are the results from Table 1 of the original paper:

Dataset # Users # Items Sequence Length

Mean Median

Beauty 22,363 12,101 8.87 6

Sports and Outdoors 35,598 18,357 8.32 6

Toys and Games 19,412 11,924 8.63 6

Table 1: Dataset Statistics

5.1.2. Industrial Dataset

For large-scale validation, COBRA was evaluated on the Baidu Industrial Dataset.

Source: Derived from user interaction logs on the Baidu advertising platform.
Scale: Consists of five million users and two million advertisements, representing a significant real-world scale and diversity.
Scenarios: Encompasses diverse recommendation scenarios, including list-page, dual-column, and short-video.
Advertiser/Advertisement Representation: Advertisers and advertisements are characterized by attributes such as title, industry labels, brand, and campaign text.
Dual Representation Encoding: These attributes are processed and encoded into two-level sparse IDs and dense vectors, which capture both coarse-grained and fine-grained semantic information. This dual representation is crucial for COBRA to model user preferences and item characteristics effectively.
Data Split:
- D_train: User interaction logs collected over the first 60 days.
- D_test: Logs from the day immediately following the D_train period, used for performance assessment.

5.2. Evaluation Metrics

The paper employs various metrics to assess the performance of COBRA, covering accuracy, ranking quality, and business impact.

5.2.1. Offline Evaluation Metrics

Recall@K (R@K):
1. Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved among the top K recommendations. It focuses on how many of the actual preferred items appear within the recommended list of a certain length.
2. Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum_{u \in U} \frac{|\mathrm{R}(u, K) \cap \mathrm{T}(u)|}{|\mathrm{T}(u)|} $
3. Symbol Explanation:
  - $U$ : The set of all users.
  - $|U|$ : The total number of users.
  - $\mathrm{R}(u, K)$ : The set of top K items recommended to user $u$ .
  - $\mathrm{T}(u)$ : The set of items that user $u$ actually interacted with (ground truth).
  - $|\cdot|$ : Denotes the cardinality (number of elements) of a set.
  - $\cap$ : Set intersection.
Normalized Discounted Cumulative Gain at K (NDCG@K):
1. Conceptual Definition: NDCG@K is a measure of ranking quality that considers the position of relevant items. It assigns higher scores to relevant items that appear earlier in the recommendation list. It's normalized to range from 0 to 1, where 1 indicates a perfect ranking.
2. Mathematical Formula: $ \mathrm{NDCG@K} = \frac{1}{|U|} \sum_{u \in U} \frac{\mathrm{DCG@K}(u)}{\mathrm{IDCG@K}(u)} $ Where $\mathrm{DCG@K}(u)$ is the Discounted Cumulative Gain for user $u$ at rank K, and $\mathrm{IDCG@K}(u)$ is the Ideal Discounted Cumulative Gain for user $u$ at rank K. $ \mathrm{DCG@K}(u) = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}(i)} - 1}{\log_2(i + 1)} $ $ \mathrm{IDCG@K}(u) = \sum_{i=1}^{|\mathrm{T}(u)|, i \le K} \frac{2^{\mathrm{rel}_{\mathrm{ideal}}(i)} - 1}{\log_2(i + 1)} $
3. Symbol Explanation:
  - $U$ : The set of all users.
  - $|U|$ : The total number of users.
  - $K$ : The number of top recommendations considered.
  - $\mathrm{rel}(i)$ : The relevance score of the item at position $i$ in the recommended list. For binary relevance (relevant/not relevant), this is typically 1 or 0.
  - $\mathrm{rel}_{\mathrm{ideal}}(i)$ : The relevance score of the item at position $i$ in the ideal (perfectly ranked) list.
  - $\log_2(i + 1)$ : Discount factor, giving less weight to relevant items at lower ranks.
Diversity (for Recall-Diversity Curves):
1. Conceptual Definition: As defined in the paper, this metric measures the number of different IDs (likely sparse IDs or categories) present in the recalled items. It reflects the model's ability to offer a broad range of item categories, avoiding redundancy and promoting exploration.
2. Mathematical Formula: Not explicitly provided, but conceptually it is: $ \mathrm{Diversity} = |\mathrm{UniqueIDs}(\mathcal{R})| $
3. Symbol Explanation:
  - $\mathrm{UniqueIDs}(\mathcal{R})$ : The set of unique sparse IDs present among the final recommended items $\mathcal{R}$ .
  - $|\cdot|$ : Denotes the cardinality of the set.

5.2.2. Online Evaluation Metrics

For online A/B tests, COBRA uses business-oriented metrics:

Conversion: Measures the percentage of users who perform a desired action (e.g., click on an ad, make a purchase) after receiving recommendations. It directly reflects user engagement and the effectiveness of recommendations in driving specific behaviors.
Average Revenue Per User (ARPU): Measures the average revenue generated from each user over a specific period. This is a key economic metric, reflecting the direct business value generated by the recommendation system.

5.3. Baselines

To provide a comprehensive evaluation, COBRA is compared against various state-of-the-art recommendation methods, including both sequential dense and generative approaches, as well as its own ablated variants.

5.3.1. Public Dataset Baselines

P5 [11]: (Generative) Transforms recommendation tasks into natural language sequences.
Caser [39]: (Sequential Dense) Captures sequential patterns using convolutional layers.
HGN [28]: (Sequential Dense) Employs Hierarchical Gating Networks to model long-term and short-term user interests.
GRU4Rec [14]: (Sequential Dense) Uses Gated Recurrent Units for session-based recommendations.
SASRec [18]: (Sequential Dense) A Transformer-based model for self-attentive sequential recommendation, capturing long-term dependencies.
FDSA [52]: (Sequential Dense) A Feature-level Deeper Self-Attention network for sequential recommendation.
BERT4Rec [37]: (Sequential Dense) Utilizes bidirectional self-attention with a cloze objective for sequential recommendation.
S3-Rec [55]: (Sequential Dense) Employs Self-Supervised learning with mutual information maximization for sequential recommendation.
TIGER [33]: (Generative) A pioneering generative retrieval model that uses RQ-VAE to encode item content features into hierarchical semantic IDs and a Transformer for generation.

5.3.2. Industrial Dataset Baselines (COBRA Variants)

These variants are designed as ablation studies to understand the contribution of each component within COBRA.

COBRA w/o ID: This variant removes the sparse ID component, relying solely on dense vectors for recommendations. It resembles RecFormer [21] in its use of lightweight Transformers for sequence modeling based on dense representations.
COBRA w/o Dense: This variant removes the dense vector component, using only sparse IDs for retrieval. Due to the coarse-grained nature of IDs, this variant is analogous to generative retrieval methods like TIGER [33], which leverage semantic IDs for retrieval. It uses 3-level semantic IDs (256x256x256) for a more fine-grained representation than the default 2-level for other COBRA industrial variants.
COBRA w/o BeamFusion: This variant removes the BeamFusion module from the inference process. Instead, it typically uses the top-1 predicted sparse ID and performs standard nearest-neighbor retrieval (based on the generated dense vector corresponding to that top-1 ID) to obtain the top $K$ results. This tests the importance of combining beam search scores and nearest neighbor scores for diversity and precision.

5.4. Implementation Details

Semantic ID Generation (Public Datasets): A method similar to [33] (TIGER) is adopted, but with a different configuration. COBRA employs a 3-level semantic ID structure, where each level corresponds to a codebook size of 32. These semantic IDs are generated using the T5 model for encoding item attributes.
Architecture (Public Datasets): COBRA uses a lightweight Transformer architecture, specifically a 1-layer encoder and a 2-layer decoder.
Semantic ID Configuration (Industrial Dataset):
- For the full COBRA and COBRA w/o ID variants, advertisement text is processed into sequences by the text encoder, and the sparse ID head predicts 2-level semantic IDs configured as 32 x 32.
- For the COBRA w/o Dense variant (which relies solely on sparse IDs), 3-level semantic IDs are used, configured as 256 x 256 x 256 to compensate for the absence of dense vectors and provide more fine-grained modeling.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Public Dataset Performance

COBRA demonstrates superior performance across all evaluated metrics on public datasets, consistently outperforming baseline models. The paper presents these results in Table 2, broken down by dataset. Note: The raw text provided for Table 2 in the prompt is highly unstructured and unparsable into a standard tabular format. To adhere to the instruction to "transcribe the entire table completely" while providing a readable output, the raw text is presented as a literal block, and the narrative summary below directly quotes the performance metrics mentioned in Section 4.1.4 of the paper.

The following are the results from Table 2 of the original paper:

P5 Caser	Method	R@5	N@5	R@10	N@10
Baeg Sos	HGN GRU4Rec	0.0163 0.0205 0.0325 0.0164 0.0203 0.0267 0.0387	0.0107 0.0131 0.0206 0.0099 0.0124	0.0254 0.0347 0.0512 0.0283	0.0136 0.0176 0.0266 0.0137
P5 HGN GRU4Rec	BERT4Rec FDSA SASRec S3-Rec TIGER COBRA[Ours] Caser	0.0387 0.0454 0.0537 0.0061 0.0116	0.0163 0.0249 0.0244 0.0321 0.0395 0.0041 0.0072	0.0347 0.0407 0.0605 0.0647 0.0648 0.0725 0.0095	0.0170 0.0208 0.0318 0.0327 0.0384 0.0456
BERT4Rec FDSA SASRec S3-Rec TIGER	0.0189 0.0129 0.0115 0.0182 0.0233 0.0251	0.0120 0.0086 0.0075 0.0122 0.0154 0.0161 0.0181 0.0215 0.0050	0.0194 0.0313 0.0204 0.0191 0.0288 0.0350 0.0385 0.0400 0.0434	0.0052 0.0097 0.0159 0.0110 0.0099 0.0156 0.0192 0.0204
P5 10	COBRA[Ours] Caser HGN GRU4Rec BERT4Rec FDSA SASRec S3-Rec TIGER COBRA[Ours]	0.0264 0.0305 0.0070 0.0166 0.0321 0.0097 0.0116 0.0228 0.0463 0.0443 0.0521 0.0619	0.0107 0.0221 0.0059 0.0071 0.0140 0.0306 0.0294 0.0371	0.0121 0.0270 0.0497 0.0176 0.0203 0.0381 0.0675 0.0700	0.0225 0.0257 0.0066 0.0141 0.0277 0.0084 0.0099 0.0189 0.0374

Note: The raw text provided for Table 2 is severely malformed and cannot be accurately transcribed into a standard tabular Markdown or HTML format while preserving its intended structure. The text dump is presented above as provided by the source. The following analysis relies on the narrative provided in Section 4.1.4 of the paper, which summarizes the key numerical results from this table.

Summary of Results from Narrative:

Beauty Dataset:
- COBRA achieves Recall@5 of 0.0537 and Recall@10 of 0.0725.
- This represents a 18.3% improvement over TIGER for Recall@5 and 11.9% for Recall@10.
Sports and Outdoors Dataset:
- COBRA records Recall@5 of 0.0305 and NDCG@10 of 0.0215.
- This outperforms TIGER by 15.5% for Recall@5 and 18.8% for NDCG@10.
Toys and Games Dataset:
- COBRA attains Recall@10 of 0.0462 and NDCG@10 of 0.0515.
- This surpasses TIGER by 24.5% for Recall@10 and 19.2% for NDCG@10.
  
  These results indicate that COBRA consistently achieves superior performance across diverse public datasets, demonstrating its effectiveness in balancing precision and diversity compared to existing generative and sequential dense recommendation methods. The significant improvements over TIGER, a leading generative retrieval model, highlight the advantages of COBRA's cascaded sparse-dense representation and coarse-to-fine generation.

6.1.2. Industrial-scale Experiments

On the large-scale Baidu Industrial Dataset, COBRA is compared against its ablated variants. These experiments validate the contributions of COBRA's individual components in a real-world setting.

The following are the results from Table 3 of the original paper:

Method	R@50	R@100	R@200	R@500	R@800
COBRA	0.1180	0.1737	0.2470	0.3716	0.4466
COBRA w/o ID	0.0611	0.0964	0.1474	0.2466	0.3111
COBRA w/o Dense	0.0690	0.1032	0.1738	0.2709	0.3273
COBRA w/o BeamFusion	0.0856	0.1254	0.1732	0.2455	0.2855

Table 3: Performance comparison on industrial dataset

As shown in Table 3, COBRA consistently outperforms all its variants across all Recall@K metrics.

At $K=500$ , COBRA achieves a Recall@500 of 0.3716, representing a 42.2% improvement over COBRA w/o Dense (0.2709 to 0.3716, difference of 0.1007, % increase = 0.1007 / 0.2709 $\approx$ 0.372). The paper states 42.2%, suggesting a slight difference in calculation (likely (0.3716 - 0.2709) / 0.2466 (COBRA w/o ID) = 0.408, or (0.3716 - 0.2709) / 0.2709 = 0.372). Let's re-verify the paper's calculation. It states "42.2% improvement over the COBRA w/o Dense variant". This would be ((0.3716 - 0.2709) / 0.2709) * 100% = 37.2%. There might be a typo in the paper or a different base for percentage calculation. Assuming the stated 42.2% is correct, it implies a substantial gain.
At $K=800$ $K = 800$ , COBRA attains a Recall@800 of 0.4466. This reflects:
- A 43.6% improvement over COBRA w/o ID ((0.4466 - 0.3111) / 0.3111 $\approx$ 0.4355).
- A 36.1% enhancement compared to COBRA w/o BeamFusion ((0.4466 - 0.2855) / 0.2855 $\approx$ 0.564). This also suggests a calculation discrepancy with the paper's narrative. However, the trend is clear: COBRA significantly outperforms the ablated variants.
  
  The results underscore the importance of the cascaded representations:
At smaller $K$ values, the absence of either Dense or ID representations leads to more pronounced performance declines, indicating that both are crucial for granularity and precision in initial retrieval.
As $K$ increases, the advantages of BeamFusion become more apparent, demonstrating its effectiveness in industrial recall systems for broader retrieval.

6.1.3. Component Contributions

The ablation study further quantifies the contribution of specific components:

Excluding sparse IDs (COBRA w/o ID): Leads to a recall reduction ranging from 26.7% to 41.5%. This highlights the critical role of semantic categorization provided by sparse IDs. For example, at R@50, (0.1180 - 0.0611) / 0.1180 $\approx$ 48.2%, and at R@800, (0.4466 - 0.3111) / 0.4466 $\approx$ 30.3%. The reported range "26.7% to 41.5%" is also subject to calculation interpretation. Regardless of the exact percentage, the significant drop clearly indicates the importance of sparse IDs.
Removing dense vectors (COBRA w/o Dense): Results in a performance drop between 30.3% and 48.3%. This underscores the importance of fine-grained modeling and continuous feature resolution that dense vectors provide. For example, at R@50, (0.1180 - 0.0690) / 0.1180 $\approx$ 41.5%.
Eliminating BeamFusion (COBRA w/o BeamFusion): Leads to a recall decrease of 27.5% to 36.1%. This emphasizes BeamFusion's significance in integrating sparse signals and enhancing the overall retrieval process. At R@50, (0.1180 - 0.0856) / 0.1180 $\approx$ 27.4%.

These results clearly validate that each core component of COBRA (sparse IDs, dense vectors, and BeamFusion) contributes significantly to the overall recommendation performance.

6.2. Further Analysis

6.2.1. Analysis of Representation Learning

To assess COBRA's ability to learn effective item representations, similarity matrices and t-SNE visualizations are employed.

The following figure (Figure 4 from the original paper) displays the comparison of cosine similarity matrices between the COBRA method and the version without IDs:

该图像是一个示意图，展示了COBRA方法与无ID版本的余弦相似性矩阵对比，其中(a)为COBRA方法的相似性，(b)为未采用ID的相似性，(c)则显示两者之间的差异。通过这种比较，可以直观地观察到COBRA方法在广告推荐中的效果提升。 Image 4: The image is an illustration displaying the comparison of cosine similarity matrices between the COBRA method and the version without IDs, where (a) represents the similarity for the COBRA method, (b) for the version without IDs, and (c) shows the difference between the two. This comparison allows for a visual observation of the enhanced performance of the COBRA method in advertisement recommendations.

Similarity Matrices (Figure 4):
- Figure 4a (COBRA): Shows significant intra-ID cohesion (items within the same sparse ID category are highly similar) and strong inter-ID separation (items from different sparse ID categories are clearly distinct). This indicates that COBRA's dense embeddings effectively capture fine-grained item characteristics while maintaining strong semantic consistency within broader categories defined by the sparse IDs.
- Figure 4b (COBRA w/o ID): The model variant without sparse IDs exhibits weaker category separation, with less distinct boundaries between different groups of items. This highlights that sparse IDs play a crucial role in providing a structural backbone that helps organize the dense representation space.
- Figure 4c (Difference Matrix): Quantitatively confirms that the incorporation of sparse IDs in COBRA enhances both the cohesion within categories and the separation between categories.
  
  The following figure (Figure 5 from the original paper) illustrates the distribution of advertisement embeddings using t-SNE:
  
  Image 5: The image is a diagram illustrating different categories of sparse representations. The scattered colored points represent various user interactions and recommended items, while the surrounding illustrations highlight specific examples related to particular data clusters to enhance visual understanding.
t-SNE Visualization (Figure 5):
- The t-SNE plot visualizes 10,000 randomly sampled advertisement embeddings in a two-dimensional space.
- It clearly reveals distinct clustering centers for various categories (represented by different colors), indicating strong cohesion within categories.
- For example, clusters in purple correspond to novels, teal to games, light green to legal services, and dark green to clothing. This visualization empirically validates that COBRA's advertisement representations effectively capture and organize semantic information according to item categories.

6.2.2. Recall-Diversity Equilibrium

The BeamFusion mechanism is designed to balance recommendation accuracy (Recall) and Diversity. This trade-off is analyzed through recall-diversity curves.

The following figure (Figure 6 from the original paper) illustrates the relationship between recall and diversity:

$Figure 6: Recall-Diversity Curves. The $\\mathbf { x }$ axis represents the coefficient $\\tau$ , and the y-axis shows the Recall $\\textcircled { a } 2 \\mathbf { 0 0 0 }$ and Diversity metrics.$ Image 6: The image is a chart that illustrates the relationship between recall and diversity under different threshold values. The x-axis represents the coefficient $\tau$ , and the y-axis shows the Recall@2000 and Diversity metrics.

Recall-Diversity Curves (Figure 6):
- The curves depict how Recall@2000 and Diversity metrics change as the coefficient $\tau$ (which weights the sparse ID beam score in BeamFusion) varies, while $\psi$ (weight for dense vector cosine similarity) is fixed at 16.
- Increasing $\tau$ generally decreases diversity. This is because a higher $\tau$ places more emphasis on the beam score of the initially generated sparse IDs. If the beam search consistently favors a few highly probable IDs, the resulting recommendations will be less diverse across categories.
- COBRA achieves an optimal balance between recall and diversity at $\tau = 0.9$ and $\psi = 16$ . At this point, the model maintains high accuracy (Recall@2000 is high) while ensuring that recommendations cover a sufficiently diverse set of items.
- The diversity metric (number of different IDs in recalled items) confirms the model's ability to provide a broader range of options, thus avoiding redundancy.
- This fine-grained control over $\tau$ and $\psi$ allows practitioners to adjust the emphasis based on specific business objectives (e.g., lower $\tau$ for exploration, higher $\tau$ for precision within expected categories). This adaptability makes COBRA flexible for diverse recommendation scenarios.

6.3. Online Results

To confirm its real-world applicability and impact, COBRA underwent online A/B tests on the Baidu Industrial Dataset in January 2025.

Test Setup: The test involved 10% of user traffic to ensure statistical significance.
Evaluation Metrics: The primary online metrics were conversion (reflecting user engagement) and Average Revenue Per User (ARPU) (reflecting economic value).
Results: In the field covered by the proposed COBRA strategy, the online A/B tests demonstrated:
- A 3.60% increase in conversion.
- A 4.15% increase in ARPU.
  
  These significant improvements in key business metrics validate COBRA's practical advantages, demonstrating that its hybrid architecture not only enhances recommendation quality in offline evaluations but also translates into measurable positive business outcomes in a large-scale production environment.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces COBRA (Cascaded Organized Bi-Represented generAtive retrieval), a novel generative recommendation framework that effectively integrates cascaded sparse and dense representations. COBRA addresses the information loss in traditional generative models by employing a coarse-to-fine generation process: it first generates sparse IDs to capture the categorical essence of an item, then refines this with a dynamically generated dense vector for fine-grained details. The framework is trained end-to-end with a dual-objective loss function and incorporates a BeamFusion mechanism during inference to balance accuracy and diversity. Extensive experiments on public datasets show COBRA's superior recommendation accuracy over state-of-the-art methods. Crucially, offline and online A/B tests on a real-world industrial advertising platform with over 200 million daily users confirm substantial improvements in key business metrics like conversion and ARPU, highlighting its robustness and practical applicability in large-scale scenarios.

7.2. Limitations & Future Work

The authors do not explicitly enumerate limitations or future work within a dedicated section in this paper. However, based on the context and the problem statement, potential areas for future research and inherent limitations can be inferred:

Complexity of Multi-Level IDs: While COBRA uses 2-level or 3-level semantic IDs, the complexity of optimally designing and learning hierarchical sparse IDs for very large and diverse item catalogs remains a challenge. The trade-off between the number of levels, codebook size, and the precision/recall of the RQ-VAE could be further explored.
Computational Cost of Dense Vector Generation: While sparse IDs are efficient, generating a dense vector for each candidate sparse ID during inference, followed by ANN search and BeamFusion, adds computational overhead compared to purely sparse retrieval. Optimizing the efficiency of the dense vector generation and ANN lookup in real-time for extremely low-latency scenarios could be a direction.
Generative Model Latency: Transformer-based generative models can have higher inference latency compared to simpler discriminative ranking models, especially with longer interaction sequences and larger beam widths. Further work on model compression, distillation, or more efficient Transformer architectures could be beneficial.
Interpretability: While sparse IDs offer some level of interpretability (e.g., category-based suggestions), the dense vectors and the complex interaction within the Transformer Decoder might still pose challenges for explaining specific recommendations to users or understanding model decisions.
Generalization to New Items/Cold Start: The RQ-VAE and dense encoder rely on item textual attributes. While COBRA is flexible, handling completely new items with minimal attributes or a rapidly evolving item catalog might still be challenging for generating optimal sparse IDs and dense vectors.

7.3. Personal Insights & Critique

COBRA presents a compelling solution to a critical problem in generative recommendation: bridging the gap between the efficiency of discrete ID-based generation and the precision of dense retrieval. The cascaded approach is particularly insightful, formalizing the intuitive idea of a coarse-to-fine recommendation process. By first narrowing down the category with a sparse ID and then refining with a dense vector, the model elegantly manages the complexity of predicting from a vast item space.

A key strength is the end-to-end training of the dense representations. This contrasts with previous hybrid models like LIGER that use fixed dense embeddings, which can limit adaptability. Allowing the dense vectors to dynamically learn from the recommendation task, guided by sparse IDs, is a powerful design choice that likely contributes significantly to COBRA's superior performance.

The BeamFusion mechanism is another notable innovation. Recommendation systems often face a dilemma between accuracy and diversity. BeamFusion provides a practical knob ( $\tau, \psi$ ) for practitioners to explicitly control this trade-off, which is invaluable in real-world applications where business objectives might prioritize exploration over exploitation at different times. The empirical validation, especially the online A/B tests on a massive industrial platform, provides strong evidence of COBRA's robustness and practical impact, moving beyond theoretical benchmarks to tangible business value.

Potential areas for further exploration or unverified assumptions:

Optimality of Decomposition: The probabilistic decomposition $P ( I D _ { t + 1 } , \mathbf { v } _ { t + 1 } | S _ { 1 : t } ) = P ( I D _ { t + 1 } | S _ { 1 : t } ) P ( \mathbf { v } _ { t + 1 } | I D _ { t + 1 } , S _ { 1 : t } )$ assumes that the sparse ID is a sufficient condition to simplify the dense vector generation. While intuitive, the degree to which this conditional independence holds or if more complex interactions (e.g., joint attention over sparse and dense at each step) could be beneficial, might be worth investigating.
Scalability of ANN and C(ID): For extremely large item catalogs, maintaining and querying the item database efficiently, especially the $C(\hat{\mathbf{ID}}_{T+1}^k)$ subset per sparse ID, can be challenging. The performance relies heavily on the efficiency of the underlying ANN index and how well items are partitioned by sparse ID.
Robustness to Noisy/Sparse Attributes: The quality of sparse IDs and dense vectors heavily depends on the richness and quality of item textual attributes. In domains with very sparse or noisy attribute information, the initial representation learning might suffer.
Transferability to Other Domains: While evaluated on product reviews and advertising, applying COBRA to domains with different interaction patterns (e.g., implicit feedback only, short-form content) or less structured item metadata would be an interesting test of its generality.

Overall, COBRA represents a significant step forward in unifying generative and dense retrieval for sequential recommendation, offering a robust and practically effective solution for large-scale, real-world systems. Its architectural elegance and strong empirical results make it an inspiring model for future research in hybrid recommendation paradigms.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Dataset	# Users	# Items	Sequence Length
			Mean	Median
Beauty	22,363	12,101	8.87	6
Sports and Outdoors	35,598	18,357	8.32	6
Toys and Games	19,412	11,924	8.63	6

Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 50,309 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Sequential Dense Recommendation

3.2.2. Generative Recommendation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Sparse-Dense Representation

4.2.1.1. Sparse Representation

4.2.1.2. Dense Representation

4.2.1.3. Cascaded Representation

4.2.2. Sequential Modeling

4.2.2.1. Probabilistic Decomposition

4.2.2.2. Sequential Modeling with a Unified Generative Model

4.2.3. End-to-End Training

4.2.4. Coarse-to-Fine Generation

5. Experimental Setup

5.1. Datasets

5.1.1. Public Datasets

5.1.2. Industrial Dataset

5.2. Evaluation Metrics

5.2.1. Offline Evaluation Metrics

5.2.2. Online Evaluation Metrics

5.3. Baselines

5.3.1. Public Dataset Baselines

5.3.2. Industrial Dataset Baselines (COBRA Variants)

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Public Dataset Performance

6.1.2. Industrial-scale Experiments

6.1.3. Component Contributions

6.2. Further Analysis

6.2.1. Analysis of Representation Learning

6.2.2. Recall-Diversity Equilibrium

6.3. Online Results

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers