Paper status: completed

Bi-Level Optimization for Generative Recommendation: Bridging Tokenization and Generation

Published:10/24/2025

Generative Recommendation Systems (37)Bi-Level Optimization Framework (2)Joint Optimization of Tokenizer and Recommender (1)Meta-Learning Approach (1)Gradient Conflict Mitigation (2)

Original Link PDF

Price: 0.100000

8 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

BLOGER unifies tokenizer and recommender optimization via bi-level learning and gradient surgery, enhancing item identifiers' quality and aligning with recommendation goals for improved generative recommendation performance.

Abstract

Generative recommendation is emerging as a transformative paradigm by directly generating recommended items, rather than relying on matching. Building such a system typically involves two key components: (1) optimizing the tokenizer to derive suitable item identifiers, and (2) training the recommender based on those identifiers. Existing approaches often treat these components separately--either sequentially or in alternation--overlooking their interdependence. This separation can lead to misalignment: the tokenizer is trained without direct guidance from the recommendation objective, potentially yielding suboptimal identifiers that degrade recommendation performance. To address this, we propose BLOGER, a Bi-Level Optimization for GEnerative Recommendation framework, which explicitly models the interdependence between the tokenizer and the recommender in a unified optimization process. The lower level trains the recommender using tokenized sequences, while the upper level optimizes the tokenizer based on both the tokenization loss and recommendation loss. We adopt a meta-learning approach to solve this bi-level optimization efficiently, and introduce gradient surgery to mitigate gradient conflicts in the upper-level updates, thereby ensuring that item identifiers are both informative and recommendation-aligned. Extensive experiments on real-world datasets demonstrate that BLOGER consistently outperforms state-of-the-art generative recommendation methods while maintaining practical efficiency with no significant additional computational overhead, effectively bridging the gap between item tokenization and autoregressive generation.

Mind Map

In-depth Reading

English Analysis~32 min read · 44,259 chars

1. Bibliographic Information

1.1. Title

Bi-Level Optimization for Generative Recommendation: Bridging Tokenization and Generation

1.2. Authors

Yimeng Bai (University of Science and Technology of China)
Chang Liu (Beihang University)
Yang Zhang (National University of Singapore)
Dingxian Wang (Upwork)
Frank Yang (Upwork)
Andrew Rabinovich (Upwork)
Wenge Rong (Beihang University)
Fuli Feng (University of Science and Technology of China)

The authors represent a mix of academic institutions and industry entities (Upwork), suggesting a research effort that combines theoretical rigor with practical applicability.

1.3. Journal/Conference

Published at: Woodstock '18: ACM Symposium on Neural Gaze Detection, June 03-05, 2018, Woodstock, NY. ACM, New York, NY, USA. Note: The publication date provided in the metadata is 2025-10-24T08:25:56.000Z, while the ACM reference within the paper itself states 2018. Given the arXiv link suggests a future publication date, the 2018 ACM reference is likely a placeholder or an artifact from a template, and the paper is intended for publication in 2025. The ACM is a highly reputable organization for computer science research, and its conferences (like SIGIR, KDD, WSDM) are top venues for recommender systems research.

1.4. Publication Year

2025 (as indicated by the arXiv timestamp, considering the ACM reference year as a placeholder).

1.5. Abstract

Generative recommendation aims to directly generate recommended items rather than matching them. Such systems typically involve two components: an item tokenizer to create suitable item identifiers, and a recommender trained on these identifiers. Current methods often treat these components separately (sequentially or alternately), leading to misalignment where the tokenizer lacks direct guidance from the recommendation objective, potentially yielding suboptimal identifiers. This paper introduces BLOGER (Bi-Level Optimization for GEnerative Recommendation), a framework that explicitly models the interdependence between the tokenizer and recommender using a unified bi-level optimization process. The lower level trains the recommender using tokenized sequences, while the upper level optimizes the tokenizer based on both tokenization loss and a recommendation loss, which acts as direct supervision. BLOGER employs a meta-learning approach for efficiency and incorporates gradient surgery to mitigate gradient conflicts in upper-level updates. Experiments on real-world datasets show BLOGER consistently outperforms state-of-the-art generative recommendation methods with no significant additional computational overhead, effectively bridging the gap between item tokenization and autoregressive generation.

1.6. Original Source Link

https://arxiv.org/abs/2510.21242 Publication status: This is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The paper addresses a critical challenge in the emerging paradigm of generative recommendation. Unlike traditional discriminative (retrieve-and-rank) systems that rely on matching user queries to candidate items, generative recommenders directly generate item identifiers autoregressively. This offers significant advantages such as improved scalability for large item corpora, reduced susceptibility to feedback loops, and enhanced user personalization.

The construction of a generative recommendation system typically involves two main components:

Item Tokenization: Optimizing a tokenizer to convert each item into a sequence of tokens, which serve as its unique identifier.
Autoregressive Generation: Training a recommender (often a Transformer-based model like T5) to predict the next item's identifier based on the user's historical interaction sequence.

The core problem identified by the authors is the decoupled optimization of these two components. Existing approaches typically train the tokenizer and recommender either sequentially (first tokenizer, then recommender) or alternately (iterative updates). This separation creates an interdependence misalignment: the tokenizer is optimized without direct feedback from the ultimate recommendation objective. Consequently, the generated item identifiers might be suboptimal for the recommender's task, failing to capture user-behavior correlations and ultimately degrading recommendation performance. For instance, a tokenizer might assign unrelated identifiers to items frequently co-purchased (e.g., "baby formula" and "diapers") if it only considers modality differences, thereby hindering the recommender's ability to link them.

The paper's entry point and innovative idea is to explicitly model this intricate interdependence within a unified optimization framework, thereby ensuring that the tokenizer's output (item identifiers) is directly aligned with the recommender's goal (accurate item prediction).

2.2. Main Contributions / Findings

The primary contributions of this work are:

Bi-Level Optimization Formulation: The paper proposes reformulating generative recommendation as a bi-level optimization problem. This explicitly models the mutual dependence between the tokenizer and the recommender, where the recommender's training forms the inner loop and the tokenizer's optimization (guided by the recommender's performance) forms the outer loop.
BLOGER Framework: Introduction of BLOGER (Bi-Level Optimization for GEnerative Recommendation), a novel framework designed to efficiently and effectively solve this bi-level optimization problem. It employs meta-learning (specifically, a MAML-like approach) for efficient optimization and incorporates gradient surgery to resolve potential conflicts between the tokenization loss and recommendation loss during the outer-level updates. This ensures that the generated item identifiers are both informative and aligned with the recommendation objective, without introducing significant additional computational overhead.
Empirical Validation: Extensive experiments conducted on two real-world datasets demonstrate that BLOGER consistently outperforms state-of-the-art generative recommendation methods. The analysis further reveals that the performance gains largely stem from improved codebook utilization, indicating that the recommendation loss effectively acts as a natural indicator of collaborative information, rectifying codebook allocation bias that might arise from relying solely on textual information.

3.1. Foundational Concepts

Generative Recommendation

Generative recommendation is a paradigm shift from traditional discriminative recommendation (often termed retrieve-and-rank). Instead of selecting items from a pre-defined candidate set by matching user queries or preferences, generative recommenders directly create or generate the identifier of the next item a user is likely to interact with. This is typically done in an autoregressive manner, meaning each part of the item identifier is generated sequentially, conditioned on previously generated parts and the user's historical interactions. This approach eliminates the need for one-to-one comparisons with potentially millions of candidate items, leading to better scalability and enabling the generation of novel items.

Item Tokenization

Item tokenization is the process of converting an item into a sequence of discrete units called tokens. Just as natural language processing models break sentences into words or sub-word units, generative recommenders need a way to represent items in a discrete, machine-understandable format that can be generated sequentially. These token sequences serve as unique identifiers or "semantic IDs" for items. The quality of these token sequences is crucial: they should ideally capture rich semantic information about the item and its relationships with other items, while also being concise and easy for a generative model to learn and produce.

Autoregressive Generation

Autoregressive generation refers to a sequence generation process where each element in a sequence is generated one at a time, conditioned on all previously generated elements in that sequence. In the context of generative recommendation, an autoregressive model predicts the first token of an item identifier, then predicts the second token based on the first, and so on, until the entire item identifier (sequence of tokens) is complete. This is a common mechanism in sequence-to-sequence models like Transformers and recurrent neural networks (RNNs).

Transformer

The Transformer is a neural network architecture introduced in 2017, known for its effectiveness in sequence-to-sequence tasks, particularly in natural language processing. It revolutionized sequence modeling by entirely eschewing recurrence and convolutions, instead relying solely on attention mechanisms. A Transformer typically consists of an encoder and a decoder.

Encoder: Processes the input sequence (e.g., user's historical item tokens) and produces a contextualized representation. It consists of multiple identical layers, each containing a multi-head self-attention mechanism and a feed-forward network.
Decoder: Generates the output sequence (e.g., the next item's tokens) autoregressively. It also consists of multiple identical layers, but each layer includes an additional masked multi-head self-attention mechanism (to prevent attending to future tokens) and an encoder-decoder attention mechanism (to attend to the encoder's output).

The T5 (Text-to-Text Transfer Transformer) model, specifically mentioned in the paper, is a variant of the Transformer architecture that frames all NLP tasks as a text-to-text problem, making it highly versatile for various sequence generation tasks.

Attention Mechanism (Self-Attention)

The core component of a Transformer is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. Given an input sequence, self-attention computes three matrices:

Query (Q): Represents the current token being processed.
Key (K): Represents all tokens in the sequence.
Value (V): Represents the information content of all tokens.

The attention score between a Query and all Keys determines how much focus to place on each Value. The output is a weighted sum of the Values. The formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
$Q$ is the Query matrix.
$K$ is the Key matrix.
$V$ is the Value matrix.
$d_k$ is the dimension of the Key vectors, used for scaling to prevent very large dot products that push the softmax into regions with tiny gradients.
$QK^T$ computes the dot product similarity between Query and Key vectors.
softmax converts these scores into probability distributions.

Multi-head attention extends this by running the attention mechanism multiple times in parallel with different learned linear projections of Q, K, V, and then concatenating their outputs.

Residual Quantization (RQ-VAE)

Residual Quantization (RQ) is a technique used to compress continuous vector representations into discrete token sequences, often in a hierarchical manner. It works by progressively quantizing the residual (the part of the vector not yet explained by previous quantization steps).

The first quantization step quantizes the original continuous vector.
The residual error from the first step (original vector minus its quantized approximation) is then quantized in the second step.
This process repeats for multiple levels, with each level quantizing the residual from the previous level. This hierarchical approach allows for representing items with increasing granularity and creates a natural tree-structured item space. RQ-VAE refers to combining Residual Quantization with a Variational Autoencoder (VAE) framework, where the quantization process is part of the VAE's bottleneck.

Bi-Level Optimization

Bi-level optimization involves two nested optimization problems. It's a hierarchical optimization framework where the solution of an inner (lower-level) problem constrains or determines parameters for an outer (upper-level) problem. The general form is: $ \begin{array} { r l } { \underset { x } { \operatorname* { m i n } } } & { F ( x, y^* (x) ) } \ { \mathrm { s . t . } } & { y^* (x) = \arg \underset { y } { \operatorname* { m i n } } G ( x, y ) } \end{array} $ Where:

$x$ are the outer-level variables.
$y$ are the inner-level variables.
$F$ is the outer-level objective function.
$G$ is the inner-level objective function.
$y^*(x)$ denotes the optimal solution of the inner-level problem for a given $x$ .

In the context of this paper, the tokenizer parameters would be the outer-level variables, and the recommender parameters would be the inner-level variables. The recommender is optimized given a tokenizer, and then the tokenizer is optimized based on how well the recommender performs with its outputs.

Meta-Learning (MAML)

Meta-learning, or "learning to learn," aims to train models that can quickly adapt to new tasks or environments with minimal data. MAML (Model-Agnostic Meta-Learning) is a popular meta-learning algorithm. The core idea is to find a good model initialization that, after one or a few gradient descent steps on a new task, leads to a good performance on that task. For bi-level optimization, meta-learning can be used to efficiently approximate the inner-level optimization. Instead of fully converging the inner-level model, MAML performs a "tentative update" (one or a few gradient descent steps) of the inner-level parameters. The outer-level then uses the performance of this tentatively updated inner-level model to compute its meta-gradients and update its own parameters. This avoids the computational burden of fully solving the inner-level problem at each outer-level iteration.

Gradient Surgery

Gradient surgery is a technique used in multi-task learning to address gradient conflicts, which occur when gradients from different tasks point in opposing directions, potentially hindering overall learning. When tasks have conflicting objectives, a naive summation of their gradients can lead to oscillations or suboptimal convergence. Gradient surgery identifies gradients that conflict (e.g., by checking cosine similarity) and then modifies them. A common strategy, as adopted in this paper, is to project one gradient onto the normal plane of another, effectively removing any component that opposes the second gradient. This ensures that the update direction is at least orthogonal (or aligned) with the desired gradient and does not negatively impact the other task's progress.

3.2. Previous Works

The paper categorizes previous works into Traditional recommender models and Generative recommender models.

Traditional Recommender Models

These models typically operate on discrete item IDs and focus on query-candidate matching or predicting a score for potential items.

MF (Matrix Factorization) [19]: Decomposes the user-item interaction matrix into lower-dimensional user and item embeddings. This allows for capturing latent features and predicting missing interactions.
LightGCN [13]: A lightweight Graph Convolutional Network (GCN) that simplifies the propagation rule of GCNs for recommendation by removing feature transformations and non-linear activations. It focuses on learning user and item embeddings through explicit graph convolution on the user-item interaction graph, capturing high-order interactions.
Caser [38]: Convolutional Sequence Embedding Recommendation uses horizontal and vertical convolutional filters to capture point-wise and sequential patterns in user interaction sequences.
GRU4Rec [15]: An RNN-based sequential recommender that utilizes Gated Recurrent Units (GRUs) to encode historical interactions and predict the next item. It's effective for capturing short-term user interests.
HGN (Hierarchical Gating Networks) [27]: Employs hierarchical gating networks to capture both long-term and short-term user interests from item sequences, allowing for flexible aggregation of information at different time scales.
SASRec (Self-Attentive Sequential Recommendation) [18]: Utilizes self-attention mechanisms to capture long-term dependencies within a user's interaction history, enabling it to model complex sequential patterns more effectively than RNN-based models.

Generative Recommender Models

These models aim to directly generate item identifiers. They can be broadly classified by how they integrate item representation and generation.

LLM-based Generative Recommenders

These approaches leverage Large Language Models (LLMs) for recommendation, often by formulating the task as a text generation problem.

P5 [10]: Frames recommendation as Language Processing by using a Unified Pretrain, Personalized Prompt & Predict Paradigm. It integrates collaborative knowledge into LLM-based recommenders by generating item identifiers, often derived through spectral clustering on item co-occurrence graphs.
BIGRec [3]: Represents items using their titles and fine-tunes LLMs for recommendation under a bi-step grounding paradigm.
IDGenRec [37]: Focuses on generating unique, concise, semantically rich, and platform-agnostic textual identifiers for items, making them more adaptable for LLM-based recommender systems.
BinLLM [48]: Transforms collaborative embeddings from external models into binary sequences, aligning item representations with LLM-consumable formats.

Sequence-to-Sequence Generative Recommenders

These models combine item tokenization with autoregressive generation, often using Transformer-like architectures. The paper highlights a distinction based on their optimization strategy:

Sequential Optimization (Decoupled):
- TIGER [34]: Represents each item with a Semantic ID (a tuple of codewords) and uses a Transformer to predict the next one. It leverages text embeddings to construct these semantic IDs, typically in a sequential optimization manner where tokenization is done first and then recommender is trained.
- LETTER [40]: Enhances TIGER by introducing a learnable tokenizer that incorporates hierarchical semantics, collaborative signals, and promotes diversity in code assignment. It still largely follows a sequential optimization strategy.
- QARM [26]: Focuses on quantitative code learning guided by high-quality down-streaming task item-item pairs.
Alternating Optimization:
- ETEGRec [22]: A Generative Recommender with End-to-End Learnable Item Tokenization. It introduces two auxiliary losses to align the intermediate representations of the tokenizer and the recommender and adopts an alternating optimization strategy for end-to-end training. While iterative, it still uses auxiliary losses rather than direct recommendation loss for tokenizer guidance.

Bi-Level Optimization in Recommendation

Bi-level optimization has been applied in recommendation for various tasks:

BOD [44]: Utilizes bi-level optimization to adaptively learn sample weights to mitigate noise in implicit feedback.
LabelCraft [2]: Applies bi-level optimization to automatically generate reliable labels from raw feedback in short video recommendation, aiming for direct alignment with platform objectives.
Other applications include adversarial training for unlearning protected user attributes [9] or personalized ranking [14].

The paper claims to be the first to apply bi-level optimization to reshape generative recommendation, specifically bridging item tokenization and autoregressive generation.

3.3. Technological Evolution

The evolution of recommendation systems has moved from simple collaborative filtering and matrix factorization to deep learning-based sequential models using RNNs and Transformers. A major paradigm shift is from discriminative (matching) to generative (direct generation).

Initially, generative recommenders focused on either LLM-based approaches that directly used textual descriptions or collaborative signals in prompts, or sequence-to-sequence models that relied on pre-processed or sequentially optimized item tokenizers (TIGER, LETTER). The challenge with these sequence-to-sequence approaches has been the decoupled nature of tokenization and generation. ETEGRec attempted to bridge this with alternating optimization and auxiliary losses, but still lacked a direct, unified optimization of the tokenizer based on the ultimate recommendation objective.

This paper's work, BLOGER, fits into this timeline by proposing a more principled and direct approach to coupling tokenization and generation. By using bi-level optimization, it moves beyond sequential or alternating updates with auxiliary objectives, directly aligning the tokenizer's learning with the recommender's performance via the recommendation loss. This represents a significant step towards truly end-to-end optimized generative recommendation systems.

3.4. Differentiation Analysis

BLOGER distinguishes itself from previous methods primarily through its novel bi-level optimization framework and integrated learning scheme:

Optimization Formulation:
- TIGER & LETTER: Employ sequential optimization. The tokenizer is trained independently to produce static item identifiers, which are then fed to the recommender. This completely decouples the tokenizer's objective from the recommender's performance.
- ETEGRec: Uses alternating optimization. The tokenizer and recommender are iteratively updated. While better than sequential, it still relies on auxiliary losses to align intermediate representations, which are indirect proxies for the final recommendation objective.
- BLOGER: Introduces bi-level optimization. The tokenizer (outer-level) is optimized based on the recommendation loss achieved by the recommender (inner-level). This provides direct, end-to-end alignment and supervision from the downstream recommendation task.
Alignment with Recommendation Objectives:
- TIGER & LETTER: Lack direct guidance from the recommendation objective. The tokenizer optimizes for its intrinsic quality (e.g., reconstruction fidelity, diversity) but not necessarily for how well its tokens enable the recommender to make predictions.
- ETEGRec: Achieves alignment through auxiliary losses applied to intermediate layers, but this is an indirect alignment.
- BLOGER: Directly incorporates the recommendation loss into the outer-level objective for tokenizer optimization. This eliminates the need for handcrafted auxiliary objectives and ensures the tokenization process is inherently geared towards improving recommendation performance.
Coupling Mechanism:
- TIGER & LETTER: No direct gradient-level coupling between tokenizer and recommender during the tokenization phase.
- ETEGRec: Employs alternating updates but the coupling at the gradient level is implicitly through auxiliary losses rather than a meta-gradient approach.
- BLOGER: Leverages meta-gradients to establish a direct gradient-level connection. This allows the tokenizer to receive optimization signals that reflect the recommender's performance, enabling more effective bidirectional learning. Furthermore, gradient surgery is applied to mitigate conflicts between tokenization loss and recommendation loss gradients, ensuring stable and beneficial tokenizer updates.
Token Sequence Generation:
- TIGER & LETTER: Use pre-processed or statically assigned token sequences.
- ETEGRec & BLOGER: Gradually refine token sequences during training, allowing them to adapt and improve over time.
  
  In essence, BLOGER provides a more principled and tightly coupled optimization framework for generative recommendation by directly integrating the recommendation objective into the tokenizer's learning process via bi-level optimization and meta-gradients, while using gradient surgery to manage gradient conflicts.

4. Methodology

4.1. Principles

The core principle behind BLOGER is to resolve the misalignment between item tokenization and autoregressive generation in generative recommendation by explicitly modeling their interdependence within a unified bi-level optimization framework. The intuition is that an effective tokenizer should produce item identifiers that not only accurately represent items but also facilitate the recommender in predicting future interactions. Conversely, the recommender relies on high-quality tokenized inputs to accurately model user preferences.

BLOGER reformulates the problem such that the recommender's training becomes an inner-level optimization task, which is dependent on the tokenizer's parameters. The tokenizer's optimization then becomes an outer-level task, where it learns to produce better tokens by observing how well the recommender performs with those tokens, in addition to its intrinsic tokenization quality. To efficiently solve this nested optimization, BLOGER adopts a meta-learning strategy, approximating the inner-level solution, and employs gradient surgery to manage gradient conflicts between the tokenization and recommendation objectives, ensuring stable and beneficial updates to the tokenizer.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1 Problem Reformulation

The paper models the tokenizer as $\mathcal{T}_{\phi}$ and the recommender as $\mathcal{R}_{\theta}$ , where $\phi$ and $\theta$ are their respective model parameters. The bi-level optimization problem is formulated as follows:

$ \begin{array} { r l } { \underset { \phi } { \operatorname* { m i n } } } & { \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta ^ { * } } ) + \lambda \mathcal { L } _ { \mathrm { t o k e n } } ( \mathcal { T } _ { \phi } ) } \ { \mathrm { s . t . } } & { \theta ^ { * } = \arg \underset { \theta } { \operatorname* { m i n } } \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta } ) . } \end{array} $ (2)

Here, we break down the components of this formulation:

$\mathcal{L}_{\mathrm{rec}}(\mathcal{T}_{\phi}, \mathcal{R}_{\theta})$ : This is the recommendation loss. It measures the performance of the recommender $\mathcal{R}_{\theta}$ on the training data, given the tokenized outputs produced by the tokenizer $\mathcal{T}_{\phi}$ .
$\mathcal{L}_{\mathrm{token}}(\mathcal{T}_{\phi})$ : This is the tokenization loss. It evaluates the intrinsic quality of the tokenizer $\mathcal{T}_{\phi}$ itself, independent of the recommender (e.g., how well it can reconstruct item embeddings).
$\lambda$ : This is a hyperparameter that controls the relative importance of the tokenization loss in the overall outer-level objective. A larger $\lambda$ means more emphasis on intrinsic tokenizer quality, while a smaller $\lambda$ prioritizes recommendation performance.
$\theta^*$ : This represents the optimal parameters for the recommender $\mathcal{R}_{\theta}$ . Critically, $\theta^*$ is obtained by minimizing the recommendation loss for a fixed tokenizer $\mathcal{T}_{\phi}$ , meaning $\theta^*$ is implicitly a function of $\phi$ .
$\mathcal{L}_{\mathrm{rec}}(\mathcal{T}_{\phi}, \mathcal{R}_{\theta^*})$ : This term in the outer-level objective represents the recommendation loss of the recommender $\mathcal{R}_{\theta^*}$ after it has been optimally trained using the tokens from tokenizer $\mathcal{T}_{\phi}$ . This is the key mechanism through which the tokenizer receives direct feedback about its impact on recommendation performance.

Equation (2b) defines the inner-level objective: for a given tokenizer $\mathcal{T}_{\phi}$ , the recommender $\mathcal{R}_{\theta}$ is trained to minimize its recommendation loss. Equation (2a) defines the outer-level objective: the tokenizer $\mathcal{T}_{\phi}$ is optimized to ensure that the optimally trained recommender $\mathcal{R}_{\theta^*}$ (which depends on $\mathcal{T}_{\phi}$ ) achieves minimal recommendation loss, while simultaneously minimizing its own tokenization loss. This setup ensures mutual reinforcement and explicit alignment.

4.2.2 BLOGER Framework

An overview of the BLOGER framework is illustrated in Figure 2.

Figure 2: An overview of the proposed BLOGER framework, which comprises an encoder-decoder model architecture and a tailored learning scheme. Specifically, we construct a mixed representation based o…
该图像是论文中图2的示意图，展示了BLOGER框架的整体结构。左侧为模型架构，包含Tokenizer、Embedding Table和Recommender，构造基于概率的混合表示；右侧为元学习策略和梯度手术，后者通过 $G_{rec} ot G_{token}$ 的操作缓解梯度冲突。

Figure 2: An overview of the proposed BLOGER framework, which comprises an encoder-decoder model architecture and a tailored learning scheme. Specifically, we construct a mixed representation based on the assignment probabilities over the codebook to ensure the differentiability of the recommendation loss to the tokenizer. Besides, we adopt a meta-learning strategy to enable efficient optimization, and apply gradient surgery to extract beneficial components from the gradients of the recommendation loss.

4.2.2.1 Tokenizer

The tokenizer $\mathcal{T}_{\phi}$ is implemented using a Residual Quantization Variational Autoencoder (RQVAE) model, building on prior work. This choice is motivated by the benefits of hierarchical encoding, which induces a tree-structured item space suitable for generative modeling and enables sharing of information among related items through common prefix tokens.

Given an item $i$ , its semantic embedding $z$ (e.g., a pretrained text embedding or a collaborative embedding) is input to the tokenizer to produce a sequence of $L$ discrete tokens: $ [ c _ { 1 } , . . . , c _ { L } ] = \mathcal { T } _ { \phi } ( z ) , $ (3) Where $c_l$ is the token assigned to item $i$ at the $l$ -th level of the hierarchy.

Quantization: The process begins by encoding the input semantic embedding $z$ into a latent representation $r$ using an MLP-based encoder: $ r = \operatorname { E n c o d e r } _ { \mathcal { T } } ( z ) . $ (4) The latent representation $r$ is then progressively quantized into $L$ discrete tokens. At each level $l$ , there is a codebook $C_l = \{e_l^k\}_{k=1}^K$ , which contains $K$ distinct code embeddings. The quantization proceeds as follows: $ \begin{array} { c } { P ( k | v _ { l } ) = \displaystyle \frac { \exp { \left( - | v _ { l } - e _ { l } ^ { k } | ^ { 2 } \right) } } { \sum _ { j = 1 } ^ { K } \exp { \left( - | v _ { l } - e _ { l } ^ { j } | ^ { 2 } \right) } } , } \ { c _ { l } = \arg { \operatorname* { m a x } _ { k } P ( k | v _ { l } ) } , } \ { v _ { l } = v _ { l - 1 } - e _ { l - 1 } ^ { c _ { l - 1 } } . } \end{array} $ (5) Here:

$P(k|v_l)$ : This is the probability of assigning the residual vector $v_l$ to the $k$ -th token in the current codebook $C_l$ . It's computed using a softmax-like function based on the Euclidean distance between $v_l$ and each codebook entry $e_l^k$ . A smaller distance implies higher probability.
$v_l$ : This denotes the residual vector at the $l$ -th level. For the first level, $v_1 = r$ (the initial latent representation). For subsequent levels, $v_l$ is the residual error from the previous quantization step, specifically $v_l = v_{l-1} - e_{l-1}^{c_{l-1}}$ .
$e_l^k$ : The $k$ -th code embedding (vector) in the codebook $C_l$ at level $l$ .
$c_l$ : This is the selected token at level $l$ , chosen as the codebook entry $k$ that maximizes the probability $P(k|v_l)$ . This is a hard assignment.

Reconstruction: After obtaining the $L$ tokens $[c_1, ..., c_L]$ for an item, a quantized representation $\tilde{r}$ is formed by summing the selected codebook entries: $ \tilde { r } = \sum _ { l = 1 } ^ { L } e _ { l } ^ { c _ { l } } $ This $\tilde{r}$ is then fed into an MLP-based decoder to reconstruct the original semantic embedding $z$ : $ \tilde { z } = \mathrm { D e c o d e r } _ { \mathcal { T } } ( \tilde { r } ) . $ (6)

Tokenization Loss: The tokenization loss $\mathcal{L}_{\mathrm{token}}(\mathcal{T}_{\phi})$ for an item $i$ is defined as: $ \mathcal { L } _ { \mathrm { t o k e n } } ( \mathcal { T } _ { \phi } ) = \sum _ { i } ( \Vert \tilde { \boldsymbol { z } } - \boldsymbol { z } \Vert ^ { 2 } + \sum _ { l = 1 } ^ { L } \Vert \mathrm { s g } [ \boldsymbol { v } _ { l } ] - \boldsymbol { e } _ { l } ^ { c _ { l } } \Vert ^ { 2 } + \beta \Vert \boldsymbol { v } _ { l } - \mathrm { s g } [ \boldsymbol { e } _ { l } ^ { c _ { l } } ] \Vert ^ { 2 } ) , $ (7) This loss is summed over all items $i$ in the sequence. It comprises three main parts:

$\Vert \tilde{z} - z \Vert^2$ : This is a reconstruction loss (mean squared error) that encourages the decoder to accurately reconstruct the original semantic embedding $z$ from its quantized representation $\tilde{z}$ .
$\sum_{l=1}^L \Vert \mathrm{sg}[v_l] - e_l^{c_l} \Vert^2$ : This is the codebook loss. It minimizes the distance between the (stop-gradient'd) residual vector $v_l$ and the chosen codebook entry $e_l^{c_l}$ . sg (stop-gradient) means that gradients do not flow through $v_l$ in this term; it primarily updates the codebook entries $e_l^{c_l}$ to be closer to the residual vectors that map to them.
$\beta \Vert v_l - \mathrm{sg}[e_l^{c_l}] \Vert^2$ : This is the commitment loss. It minimizes the distance between the residual vector $v_l$ and the (stop-gradient'd) chosen codebook entry $e_l^{c_l}$ . sg here means gradients do not flow through $e_l^{c_l}$ ; this term primarily updates the encoder parameters to ensure that the residual vectors $v_l$ are committed to (close to) the codebook entries they select.
$\beta$ : A weighting coefficient, typically set to 0.25, to balance the updates between the encoder and the codebooks.
$\mathrm{sg}[\cdot]$ : The stop-gradient operation. It prevents gradients from flowing through its argument during backpropagation. This is crucial for training VAEs with discrete bottlenecks like RQVAE to ensure stable updates.

4.2.2.2 Recommender

The generative recommender $\mathcal{R}_{\theta}$ is based on a Transformer-based encoder-decoder architecture, specifically T5.

Mixed Representation: A key challenge is that the tokenizer outputs discrete token sequences, which are non-differentiable. To allow gradients from the recommendation loss to flow back to the tokenizer for optimization, BLOGER uses differentiable soft representations. Let $E^V$ be the vocabulary embeddings of the recommender, where each row corresponds to the embedding of a token in the vocabulary. For a given token $c_l^t$ (the $l$ -th token of item $t$ ):

Hard Embedding: A hard embedding $h_l^t$ is obtained by a direct lookup in $E^V$ using the discrete $c_l^t$ .
Soft Embedding: The probability distribution $P(k|v_l^t)$ (from Equation 5) is padded with zeros to match the size of the full vocabulary. The soft embedding $s_l^t$ is then computed as a weighted average over $E^V$ , using these padded probabilities as weights. Finally, the mixed representation $\mathbf{o}_l^t$ for the token is computed as: $ \mathbf { \ o } _ { l } ^ { t } = s _ { l } ^ { t } + \operatorname { s g } [ \pmb { h } _ { l } ^ { t } - s _ { l } ^ { t } ] , $ (8) Here:

$s_l^t$ : The soft embedding, which is differentiable as it's a weighted sum.
$h_l^t$ : The hard embedding, which is numerically precise.
$\operatorname{sg}[\cdot]$ : Stop-gradient operation. This ensures that gradients flow only through the soft embedding $s_l^t$ (which connects to the tokenizer's output probabilities), while the hard embedding $h_l^t$ provides the numerical value but does not contribute to gradient flow for tokenizer updates. This effectively combines the benefits of discrete precision with differentiable soft representations.

Seq2Seq Formulation: During training, the user's item-level interaction sequence $S = [i_1, \dots, i_T]$ is tokenized and converted into the input sequence embedding $E^X = [\mathbf{o}_1^1, \mathbf{o}_2^1, \dots, \mathbf{o}_{L-1}^T, \mathbf{o}_L^T]$ using the mixed representation for each token. This is fed into the recommender's encoder: $ H ^ { E } = \operatorname { E n c o d e r } _ { \mathcal { R } } ( E ^ { X } ) . $ (9) Where $H^E$ is the hidden representation from the encoder.

For decoding the target item $i_{T+1}$ (tokenized as $Y = [c_1^{T+1}, \dots, c_L^{T+1}]$ ), a special beginning-of-sequence ([BOS]) token is prepended, and its mixed representation $E^Y = [\mathbf{o}^{\mathrm{BOS}}, \mathbf{o}_1^{T+1}, \dots, \mathbf{o}_L^{T+1}]$ is created. The decoder then takes $H^E$ and $E^Y$ as input to model user preference: $ H ^ { D } = \operatorname { D e c o d e r } _ { \mathcal { R } } ( H ^ { E } , E ^ { Y } ) . $ (10) Where $H^D$ is the hidden representation from the decoder.

Recommendation Loss: The decoder's output $H^D$ is projected onto the vocabulary space (e.g., by an inner product with the vocabulary embedding matrix $E^V$ ) to predict the target item tokens. The recommendation loss $\mathcal{L}_{\mathrm{rec}}(\mathcal{T}_{\phi}, \mathcal{R}_{\theta})$ is defined as the negative log-likelihood of the target tokens, following the sequence-to-sequence learning paradigm: $ \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta } ) = - \sum _ { l = 1 } ^ { L } \log P ( Y _ { l } | X , Y _ { < l } ) , $ (11) Here:

$Y_l$ : The $l$ -th token in the target sequence for the next item.
$X$ : The tokenized input sequence (historical interactions).
$Y_{<l}$ : All previously generated tokens in the target sequence (from 1 to l-1). This loss function trains the recommender to generate each token of the target item autoregressively.

4.2.2.3 Tailored Learning Scheme

To solve the bi-level optimization problem (Equation 2), BLOGER employs a meta-learning strategy, inspired by MAML, which iteratively updates $\mathcal{T}_{\phi}$ and $\mathcal{R}_{\theta}$ .

Meta-Learning Strategy: The meta-learning approach avoids the computationally expensive process of fully training the recommender $\mathcal{R}_{\theta}$ to convergence for each tokenizer update.

Update of $\theta$ (Inner Level): For the inner-level objective, the recommender parameters $\theta$ are updated by directly minimizing the recommendation loss $\mathcal{L}_{\mathrm{rec}}(\mathcal{T}_{\phi}, \mathcal{R}_{\theta})$ for a fixed tokenizer $\mathcal{T}_{\phi}$ . This is a standard gradient descent step: $ \theta \leftarrow \theta - \eta_{\mathcal{R}} \nabla_{\theta} \mathcal{L}{\mathrm{rec}}(\mathcal{T}{\phi}, \mathcal{R}_{\theta}) , $ (14) Where $\eta_{\mathcal{R}}$ is the learning rate for the recommender.
Update of $\phi$ (Outer Level): Updating the tokenizer parameters $\phi$ is more complex as it depends on the optimal recommender parameters $\theta^*$ . Instead of computing $\theta^*$ , BLOGER performs a tentative update of $\mathcal{R}_{\theta}$ . This process involves two steps:
1. Meta Training: The tokenizer $\mathcal{T}_{\phi}$ is kept unchanged. A single step of gradient descent is performed on the recommender parameters $\theta$ to obtain a tentative recommender $\mathcal{R}_{\theta'}$ : $ \theta ' = \theta - \eta_{\mathcal{R}} \nabla_{\theta} \mathcal{L}{\mathrm{rec}}(\mathcal{T}{\phi}, \mathcal{R}_{\theta}) . $ (15) Here, $\theta'$ represents the recommender parameters after one tentative update.
2. Meta Testing: With the tentative recommender $\mathcal{R}_{\theta'}$ , the recommendation loss $\mathcal{L}_{\mathrm{rec}}(\mathcal{T}_{\phi}, \mathcal{R}_{\theta'})$ is computed. This loss, combined with the tokenization loss $\mathcal{L}_{\mathrm{token}}(\mathcal{T}_{\phi})$ , forms the outer-level objective for updating $\phi$ : $ \phi \leftarrow \phi - \eta_{\mathcal{T}} ( \nabla_{\phi} \mathcal{L}{\mathrm{rec}}(\mathcal{T}{\phi}, \mathcal{R}{\theta'}) + \lambda \nabla{\phi} \mathcal{L}{\mathrm{token}}(\mathcal{T}{\phi}) ) , $ (16) Where $\eta_{\mathcal{T}}$ is the learning rate for the tokenizer. The meta-gradient for the recommendation loss term, $\nabla_{\phi} \mathcal{L}_{\mathrm{rec}}(\mathcal{T}_{\phi}, \mathcal{R}_{\theta'})$ , is computed by backpropagating through the chain: $ \nabla _ { \phi } \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta ^ { \prime } } ) \propto \frac { \partial \mathcal { L } _ { \mathrm { r e c } } } { \partial \theta ^ { \prime } } \frac { \partial \theta ^ { \prime } } { \partial \mathcal { L } _ { \mathrm { r e c } } } \frac { \partial \mathcal { L } _ { \mathrm { r e c } } } { \partial \phi } . $ This meta-gradient effectively tells the tokenizer how to update its parameters $\phi$ so that the recommender, after its tentative update, achieves a lower recommendation loss.

Gradient Surgery: During outer-level optimization (Equation 16), the gradients from the recommendation loss and the tokenization loss might conflict. To mitigate this, gradient surgery is applied. When the cosine similarity between the gradient of recommendation loss ( $G_{\mathrm{rec}}$ ) and tokenization loss ( $G_{\mathrm{token}}$ ) is negative, the conflicting component of $G_{\mathrm{rec}}$ is removed by projecting it onto the normal plane of $G_{\mathrm{token}}$ . The adjusted gradient for recommendation loss is: $ \begin{array} { r l r } & { } & { G _ { \mathrm { r e c } } ^ { \mathrm { p r o j } } = G _ { \mathrm { r e c } } - \frac { G _ { \mathrm { r e c } } \cdot G _ { \mathrm { t o k e n } } } { | G _ { \mathrm { t o k e n } } | ^ { 2 } } G _ { \mathrm { t o k e n } } , } \ & { } & { G _ { \mathrm { r e c } } = \nabla _ { \phi } \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta ^ { \prime } } ) , } \ & { } & { G _ { \mathrm { t o k e n } } = \nabla _ { \phi } \mathcal { L } _ { \mathrm { t o k e n } } ( \mathcal { T } _ { \phi } ) . \qquad } \end{array} $ (17) Here:

$G_{\mathrm{rec}}$ : The meta-gradient of the recommendation loss with respect to tokenizer parameters $\phi$ .
$G_{\mathrm{token}}$ : The gradient of the tokenization loss with respect to tokenizer parameters $\phi$ .

$G_{\mathrm{rec}}^{\mathrm{proj}}$ : The projected gradient of the recommendation loss. This new gradient is orthogonal to $G_{\mathrm{token}}$ (i.e., $G_{\mathrm{rec}}^{\mathrm{proj}} \cdot G_{\mathrm{token}} = 0$ ), ensuring it does not oppose the tokenization objective while still contributing towards improving recommendation performance. The outer-level update (Equation 16) then uses $G_{\mathrm{rec}}^{\mathrm{proj}}$ instead of $G_{\mathrm{rec}}$ when conflict is detected.

The overall training procedure is summarized in Algorithm 1.

Algorithm 1 Training of BLOGER

1: : while not converged do

2: Update

\theta

according to Equation (14);

3: if Step % M == 0 then

4: Compute

\theta'

with Equation (15);

5: Compute

G_{rec}

and

G_{token}

;

6: if

G_{rec} \cdot G_{token}

< 0 then

G_{rec} \leftarrow G_{rec}^{proj}

with Equation (17);

8: end if

9: Update

\phi

according to Equation (16) (using potentially projected

G_{rec}

);

10: end if

11: end while

Algorithm 1 outlines the iterative training loop:

Recommender Update (Inner Loop): In each step, the recommender parameters $\theta$ are updated using a gradient descent step on the recommendation loss (Line 2).
Tokenizer Update (Outer Loop - Conditional): The tokenizer parameters $\phi$ $ϕ$ are updated less frequently, every $M$ $M$ steps (Line 3). This allows the recommender to stabilize before providing meaningful meta-gradients.
- First, a tentative recommender $\theta'$ is computed (Line 4).
- Then, the meta-gradient $G_{\mathrm{rec}}$ and the tokenization loss gradient $G_{\mathrm{token}}$ are calculated (Line 5).
- If these gradients conflict (their dot product is negative), gradient surgery is applied to $G_{\mathrm{rec}}$ (Lines 6-8).
- Finally, $\phi$ is updated using the combined (and potentially adjusted) gradients (Line 9).

4.2.2.4 Complexity Analysis

Let $d$ be the model dimension, $K$ the size of each codebook, $L$ the number of codebooks, and $T$ the sequence length.

Item Tokenization: For a single item, the encoder/decoder layers have $O(d^2)$ complexity. Obtaining the mixed representation involves codebook lookup and soft representation, incurring $O(LKd)$ . Tokenization loss computation adds $O(d + Ld)$ . Total for one item: $O(d^2 + LKd)$ .
Generative Recommendation: The Transformer's self-attention and feed-forward layers cost $O(T^2 d + Td^2)$ . Computing the recommendation loss incurs $O(TLKd)$ .

The total training cost of BLOGER is $O(TLKd + T^2 d + Td^2)$ , which matches the order of magnitude of TIGER. The meta-learning and gradient surgery introduce minimal additional overhead as they primarily involve a few extra forward/backward passes, not altering the asymptotic complexity. Inference complexity remains identical to standard Transformer-based generative models. This indicates practical efficiency.

4.2.3 Discussion

The paper explicitly compares BLOGER to related generative recommendation methods:

Method	Optimization	Alignment	Token Sequence
TIGER [34]	Sequential	×	Pre-processed
LETTER [40]	Sequential	X	Pre-processed
ETEGRec [22]	Alternating	Auxiliary Loss	Gradually Refined
BLOGER	Bi-Level	Rec Loss	Gradually Refined

The following are the results from Table 1 of the original paper:

The comparison highlights `BLOGER`'s key innovations: 1. **Bi-Level Optimization Formulation:** `BLOGER` models the inherent interdependence between `tokenizer` and `recommender` more directly and formally than `sequential` (TIGER, LETTER) or `alternating` (ETEGRec) approaches. 2. **Recommendation-Aligned Tokenization:** Unlike `TIGER` and `LETTER` (which have no explicit alignment) or `ETEGRec` (which uses `auxiliary losses`), `BLOGER` directly optimizes the `tokenizer` using the `recommendation loss`. This eliminates the need for heuristic `auxiliary objectives` and ensures the `tokenizer` learns representations that are directly beneficial for the downstream task. 3. **Meta-Gradient-Based Coupling:** `BLOGER` leverages `meta-gradients` to create a `gradient-level` connection between the `recommender` and `tokenizer`. This allows the `tokenizer` to receive precise optimization signals from the `recommender`'s performance. This is more fine-grained and effective than naive joint learning or iteration-independent `alternating updates` (like in `ETEGRec`), which might not fully capture the intricate interactions between the modules.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three subsets of the Amazon Review Data [12, 29], commonly used in sequential recommendation research:

Beauty: Products related to beauty.
Instruments: Musical instruments.
Arts: Arts, Crafts & Sewing products.

The datasets were processed following established practices:
5-core filtering: Users and items with fewer than five interactions were removed. This helps ensure data quality and relevance.
Sequence Length: Each user's interaction history was truncated or padded to a fixed length of 20.
Data Split: A leave-one-out strategy was used:
- Test set: The latest interaction for each user.
- Validation set: The second most recent interaction for each user.
- Training set: All other historical interactions.
  
  The following are the results from Table 2 of the original paper:
  
  Dataset #User #Item #Interaction Sparsity AvgLen
  
  Beauty 22363 12101 198502 99.93% 8.87
  
  Instruments 24772 9922 206153 99.92% 8.32
  
  Arts 45141 20956 390832 99.96% 8.66

Dataset	#User	#Item	#Interaction	Sparsity	AvgLen
Beauty	22363	12101	198502	99.93%	8.87
Instruments	24772	9922	206153	99.92%	8.32
Arts	45141	20956	390832	99.96%	8.66

Table 2: Statistical details of the evaluation datasets, where "AvgLen" represents the average length of item sequences.

These datasets are widely used benchmarks in recommendation systems, making them suitable for validating the performance of new methods. Their characteristics (e.g., varying number of users/items, high sparsity) reflect real-world e-commerce interaction data.

5.2. Evaluation Metrics

The performance of sequential recommendation methods is evaluated using two widely adopted metrics: Recall@K and NDCG@K, with $K=5$ and $K=10$ . The evaluation uses full ranking over the entire item set to avoid biases from item sampling. Constrained decoding with a Trie is applied during autoregressive generation to ensure only valid item identifiers are generated, with a beam size of 20.

Recall@K

Conceptual Definition: Recall@K measures the proportion of truly relevant items that are successfully retrieved within the top $K$ recommendations. It focuses on the completeness of the recommendations, indicating how many of the items a user would actually interact with are present in the recommended list.
Mathematical Formula: $ \text{Recall}@K = \frac{\text{Number of relevant items in top-K recommendations}}{\text{Total number of relevant items}} $
Symbol Explanation:
- Number of relevant items in top-K recommendations: The count of actual next items (ground truth) that appear in the model's top $K$ predicted items.
- Total number of relevant items: In the leave-one-out setting for next-item recommendation, this is typically 1 (the single next item the user interacted with). If there are multiple relevant items, it would be the total count.

NDCG@K (Normalized Discounted Cumulative Gain at K)

Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items. It assigns higher scores to relevant items that appear at higher ranks (closer to the top of the list) and penalizes relevant items that appear lower. It also handles graded relevance, though in next-item recommendation, relevance is often binary (1 for the true next item, 0 otherwise).
Mathematical Formula: $ \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K} $ Where: $ \text{DCG}@K = \sum_{i=1}^K \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)} $ And IDCG@K is the Ideal DCG@K, which is the maximum possible DCG@K achievable by ranking all relevant items at the top.
Symbol Explanation:
- $\text{DCG}@K$ : Discounted Cumulative Gain at rank $K$ . It sums the relevance scores of items in the recommended list, with a logarithmic discount applied to items at lower ranks.
- $\text{IDCG}@K$ : Ideal Discounted Cumulative Gain at rank $K$ . This is the DCG@K calculated for the perfect ranking, where all relevant items are placed at the top of the list. It serves as a normalization factor.
- $\text{rel}_i$ : The relevance score of the item at position $i$ in the recommended list. In binary relevance settings (like next-item prediction), $\text{rel}_i$ is 1 if the item at rank $i$ is the true next item, and 0 otherwise.
- $K$ : The number of top items considered in the recommendation list.
- $\log_2(i+1)$ : The logarithmic discount factor, which reduces the contribution of relevant items as their rank $i$ increases.

5.3. Baselines

The BLOGER framework is compared against two categories of baseline models:

(1) Traditional Recommender Models:

MF [19]
LightGCN [13]
Caser [38]
GRU4Rec [15]
HGN [27]
SASRec [18] These models represent various popular approaches, including matrix factorization, graph neural networks, and sequential deep learning models. The implementations for these methods follow those provided by [53].

(2) Generative Recommender Models:

P5 [10]: An LLM-based generative recommender that integrates collaborative knowledge.
TIGER [34]: A generative retrieval method that uses text embeddings to construct semantic IDs for items and a Transformer for sequential recommendation.
LETTER [40]: An enhancement of TIGER that introduces a learnable tokenizer with hierarchical semantics, collaborative signals, and diversity regularization.
ETEGRec [22]: A generative recommender with end-to-end learnable item tokenization that uses auxiliary losses and alternating optimization.

For generative methods, a unified recommender configuration is adopted for fair comparison. ETEGRec uses pretrained SASRec embeddings as semantic embeddings, while other generative methods (including BLOGER) utilize LLaMA2-7B [39] to generate text embeddings from item titles and descriptions.

5.4. Implementation Details

Recommender Configuration:
- Transformer-based encoder-decoder (like T5).
- Layers: 4 layers for both encoder and decoder.
- Model Dimension: 128.
- Dropout Rate: 0.1.
- Attention Heads: 6 heads per layer, each with a dimension of 64.
- MLP Hidden Dimension: 1024.
- Activation: ReLU.
Item Tokenization (RQ-VAE):
- Codebook Levels (L): 4.
- Codebook Initialization: k-means clustering.
- Codebook Size (K): 256 code embeddings per codebook.
- Code Embedding Dimension: 32.
Training Details for BLOGER:
- Tokenizer Pretraining:
  - Epochs: 20,000.
  - Batch Size: 1,024.
  - Optimizer: AdamW [25].
  - Learning Rate: 1e-3.
  - Weight Decay: 1e-4.
- Bi-Level Optimization (Main Training):
  - Batch Size: 256.
  - Optimizer: AdamW.
  - Early Stopping: Patience of 20 epochs, based on validation recommendation loss.
  - Learning Rates:
    - Generative recommender ( $\eta_{\mathcal{R}}$ ): 5e-4.
    - Item tokenizer ( $\eta_{\mathcal{T}}$ ): 1e-4.
  - Hyperparameter $\lambda$ : Tuned over the set $\{1, 5e-1, 1e-1, 1e-2, 1e-3, 1e-4\}$ . This parameter controls the weight of the tokenization loss in the outer-level objective.
  - Hyperparameter $M$ (Algorithm 1): Tuned such that the tokenizer update frequency occurs $\{1, 2, \dots, 10\}$ times per epoch. This controls how often the outer-level optimization happens.
Baseline Hyperparameters: Searched within the ranges defined in their respective papers.

6. Results & Analysis

6.1. Core Results Analysis

The overall performance of BLOGER compared to baselines is presented in Table 3.

The following are the results from Table 3 of the original paper:

Dataset	Metric	MF	LightGCN	Caser	GRU4Rec	HGN	SASRec	P5	TIGER	LETTER	ETEGRec	BLOGER
Beauty	R@5	0.0226	0.0208	0.0110	0.0229	0.0381	0.0372	0.0400	0.0384	0.0424	0.0413	0.0444
	N@5	0.0142	0.0127	0.0072	0.0155	0.0241	0.0237	0.0274	0.0256	0.0264	0.0271	0.0293
	R@10	0.0389	0.0351	0.0186	0.0327	0.0558	0.0602	0.0590	0.0609	0.0612	0.0632	0.0654
	N@10	0.0195	0.0174	0.0096	0.0186	0.0297	0.0311	0.0335	0.0329	0.0334	0.0342	0.0361
Instruments	R@5	0.0456	0.0722	0.0425	0.0559	0.0797	0.0709	0.0809	0.0865	0.0872	0.0878	0.0882
	N@5	0.0376	0.0608	0.0348	0.0459	0.0676	0.0565	0.0695	0.0736	0.0747	0.0745	0.0753
	R@10	0.0511	0.0887	0.0528	0.0702	0.0967	0.0922	0.0987	0.1062	0.1082	0.1079	0.1100
	N@10	0.0394	0.0661	0.0381	0.0505	0.0731	0.0633	0.0751	0.0799	0.0814	0.0810	0.0822
Arts	R@5	0.0473	0.0464	0.0571	0.0669	0.0667	0.0758	0.0724	0.0807	0.0841	0.0860	0.0879
	N@5	0.0271	0.0244	0.0407	0.0518	0.0516	0.0632	0.0607	0.0640	0.0675	0.0687	0.0703
	R@10	0.0753	0.0755	0.0781	0.0834	0.0910	0.0945	0.0902	0.1017	0.1081	0.1084	0.1108
	N@10	0.0361	0.0338	0.0474	0.0523	0.0595	0.0693	0.0664	0.0707	0.0752	0.0759	0.0777

Table 3: The overall performance comparisons between the baselines and BLOGER. Recall@K and NDCG@K are abbreviated as R@K and N@K, respectively. The best and second-best results are highlighted in bold and underlined font, respectively.

Key observations from the table:

BLOGER's Superiority: BLOGER consistently achieves the best performance across all datasets (Beauty, Instruments, Arts) and all metrics (Recall@5, NDCG@5, Recall@10, NDCG@10). This strongly validates the effectiveness of its bi-level optimization formulation and tailored learning strategy. The fact that BLOGER achieves this without auxiliary alignment losses, using only the recommendation loss for direct tokenizer supervision, supports the intuition that this direct alignment is highly effective.
Generative vs. Traditional Methods: Generative recommendation methods (P5, TIGER, LETTER, ETEGRec, BLOGER) generally outperform traditional recommendation methods (MF, LightGCN, Caser, GRU4Rec, HGN, SASRec). This is attributed to generative models' ability to capture hierarchical semantic structures and differentiate items with fine-grained representations, especially when incorporating RQ-VAE for item tokenization. Traditional methods, relying on discrete ID representations, are limited in this aspect.
Improvements within Generative Methods:
- LETTER and ETEGRec show performance improvements over TIGER. LETTER benefits from integrating collaborative embeddings and diversity regularization into its tokenizer. ETEGRec improves through alternating optimization and auxiliary losses that align tokenizer and recommender intermediate representations.
- BLOGER further surpasses these methods, indicating that its bi-level optimization and direct recommendation loss guidance for tokenization are more effective than alternating optimization with auxiliary losses or sequential optimization.

6.2. Ablation Study (RQ2)

To understand the contribution of each component, an ablation study was conducted on the Beauty dataset. Various BLOGER variants were tested:

w/ LETTER: Integrates the LETTER tokenizer (with its collaborative regularization) into BLOGER.
w/o Gs: BLOGER without the gradient surgery operation (Equation 17).
Joint: Combines recommendation loss and tokenization loss into a single joint optimization for both models, removing the bi-level optimization structure.
Joint w/ Gs: Applies gradient surgery on top of the Joint optimization.

Fixed: The tokenizer is fixed (pretrained and then frozen), and only the recommender is optimized, similar to TIGER.

The following are the results from Table 4 of the original paper:

Method	Recall@5	NDCG@5	Recall@10	NDCG@10
BLOGER	0.0444	0.0293	0.0654	0.0361
w/LETTER	0.0489	0.0322	0.0732	0.0400
w/o GS	0.0329	0.0209	0.0530	0.0274
Joint	0.0418	0.0272	0.0637	0.0342
Joint w/ GS	0.0412	0.0273	0.0631	0.0343
Fixed	0.0384	0.0256	0.0609	0.0329

Table 4: Results of the ablation study for BLOGER on Beauty.

Observations:

w/ LETTER Performance: The w/ LETTER variant achieves even better performance than BLOGER alone. This suggests that the benefits of BLOGER (bi-level optimization, direct alignment) are complementary to more advanced tokenizer designs (like LETTER's collaborative regularization). This confirms BLOGER's generalizability across diverse tokenizer backbones.
Importance of Gradient Surgery (w/o Gs): Removing gradient surgery (w/o Gs) leads to a significant performance degradation across all metrics. This indicates that without mitigating gradient conflicts, the direct supervision from the recommendation objective to the tokenizer becomes ineffective, and naive joint optimization of conflicting gradients can harm performance.
Effectiveness of Bi-Level Optimization vs. Joint:
- Joint optimization (without bi-level) outperforms the Fixed tokenizer setting, as it allows for tokenizer refinement.
- However, Joint optimization still lags behind BLOGER. This underscores the effectiveness of bi-level optimization and meta-gradients in properly coordinating the learning of the two interdependent components, which is not achieved by simple joint training.
- Adding gradient surgery to Joint (Joint w/ GS) does not bring significant gains, possibly because meta-gradient signals are absent, hindering the coordination needed for gradient surgery to be fully effective in this context.

6.3. Hyper-Parameter Analysis (RQ3)

The impact of two key hyper-parameters, $\lambda$ (balancing recommendation loss and tokenization loss) and $M'$ (tokenizer update frequency per epoch), is analyzed. The results for varying $\lambda$ (from 1e-4 to 5) and $M'$ (from 1 to 10 updates per epoch) are summarized in Figure 3.

$Figure 3: Results of the performance of BLOGER across different values of hyper-parameters. $\\lambda$ controls the balance between the recommendation loss and tokenization loss in the outer level, an…$
该图像是两幅折线图，展示了BLOGER在不同超参数取值下的性能表现。左图横轴为损失权重 $\lambda$ ，纵轴显示Recall@5和NDCG@5指标；右图横轴为分词器参数 $\phi$ 的更新频率 $M^{\prime}$ ，纵轴同样为Recall@5和NDCG@5。图中红线表示Recall@5，蓝线表示NDCG@5，反映了超参数对推荐性能的影响。

Figure 3: Results of the performance of BLOGER across different values of hyper-parameters. $\lambda$ controls the balance between the recommendation loss and tokenization loss in the outer level, and $M ^ { \prime }$ representing the update frequency of the tokenizer parameters $\phi$ .

Observations from Figure 3:

Impact of $\lambda$ :
- When $\lambda$ is too small (e.g., 1e-4, 1e-3), the tokenization loss has insufficient weight. This can disrupt the tokenizer's ability to maintain a good identifier structure, as the recommendation loss might dominate and pull token representations too far from their intrinsic quality.
- Conversely, if $\lambda$ is too large (e.g., 5), the tokenization loss becomes too dominant, and the recommendation loss loses its ability to effectively supervise and align the tokenizer with the recommendation objective.
- The empirical evidence suggests that $\lambda = 0.5$ strikes an optimal balance, preserving identifier structure while allowing effective recommendation-driven supervision.
Impact of $M'$ (Tokenizer Update Frequency):
- BLOGER shows stable performance when $M'$ (number of tokenizer updates per epoch) is between 1 and 4. This implies that updating the tokenizer parameters $\phi$ less frequently (allowing the recommender to take more steps per $\phi$ update) is beneficial. This delay gives the recommender sufficient time to optimize and provide stable, reliable meta-gradients to guide the tokenizer.
- When $M'$ is too high (e.g., 8-10 updates per epoch), the recommender does not get enough iterations to optimize between tokenizer updates. This leads to unstable meta-gradients, which in turn undermine the effectiveness of tokenizer training and degrade overall performance.

6.4. In-Depth Analysis (RQ4 & RQ5)

6.4.1 Time Efficiency

A comparison of the average training and testing time per epoch for BLOGER and TIGER (a comparable generative method) was conducted on an NVIDIA RTX 3090 GPU.

The following are the results from Table 5 of the original paper:

Method	Train (s/epoch)	Test (s/epoch)
TIGER	23.7	66.7
BLOGER	26.6	66.9

Table 5: Results of the time efficiency comparison between BLOGER and TIGER, where the average training and testing time per epoch are reported.

Observations:

BLOGER incurs only a slight increase in training time (from 23.7 s/epoch for TIGER to 26.6 s/epoch for BLOGER).
The testing time remains nearly identical (66.7 s/epoch for TIGER vs. 66.9 s/epoch for BLOGER). Both methods use greedy decoding for testing. These findings confirm the complexity analysis (Section 3.2.4), demonstrating that BLOGER's bi-level optimization learning strategy introduces minimal additional computational overhead, thus maintaining practical efficiency.

6.4.2 Visualization Analysis

To understand why BLOGER performs better, a visualization analysis of the learned tokenizer was performed on the Beauty dataset. This involved examining the utilization patterns of the codebook at each level. Two metrics were used:

Density: The ratio of activated codewords to the total number of codewords in a codebook. A higher density means more codewords are being used.
Entropy: Measures the distribution of codeword usage, indicating how evenly the codewords are utilized. Higher entropy means codewords are used more uniformly, suggesting better coverage and less redundancy.

For comparison, TIGER (which can be seen as a BLOGER variant without specific bi-level optimization designs) was also analyzed.

$Figure 4: Codebook utilization comparison between TIGER and BLOGER, where "Codebook $i ^ { \\dag }$ denotes the $i$ -th codebook in the residual quantization process. "Density" is defined as the ratio…$ 该图像是图4，展示了TIGER与BLOGER在残差量化过程四个码本（Codebook）中码字激活密度和熵值的对比。图中“Density”表示激活码字与总码字数的比值，“Entropy”衡量码字使用的均匀度，反映码本利用率。

Figure 4: Codebook utilization comparison between TIGER and BLOGER, where "Codebook $i ^ { \dag }$ denotes the $i$ -th codebook in the residual quantization process. "Density" is defined as the ratio of activated codewords to the total number of codewords, and "Entropy" measures the distribution of codeword usage, indicating how evenly the codewords are utilized.

Observations from Figure 4:

First-Level Codebook Improvement: BLOGER shows a significant improvement over TIGER specifically in the first-level codebook.
- Density increases from 47% (TIGER) to 65% (BLOGER).
- Entropy increases from 6.05 (TIGER) to 7.14 (BLOGER). This indicates that BLOGER encourages more diverse and uniform usage of codewords at the coarsest granularity level.
Subsequent Levels: No substantial differences are observed in the subsequent levels (Codebook 2, 3, 4).
Reasoning: The paper attributes this to the first-level codebook carrying the most information due to the residual quantization process. Consequently, it is more responsive to additional supervision. The recommendation loss, acting as a natural indicator of collaborative information, effectively guides the tokenizer at this crucial first level. It helps alleviate the coarse-grained codebook allocation bias that might arise from relying solely on textual information (as TIGER largely does). By incorporating this collaborative signal via the bi-level optimization framework, the tokenizer learns better item representations, leading to improved recommendation performance.

This analysis provides deeper insight into the mechanism of BLOGER's effectiveness: the performance gains are rooted in a more robust and collaboratively-aligned item tokenization, particularly at the fundamental first level of granularity.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work successfully reformulates generative recommendation as a bi-level optimization problem, explicitly addressing the crucial interdependence between item tokenization and autoregressive generation. The proposed BLOGER framework efficiently solves this problem by integrating meta-learning for the nested optimization and gradient surgery to resolve gradient conflicts between the tokenization and recommendation objectives. Extensive experiments on real-world datasets demonstrate that BLOGER consistently outperforms state-of-the-art generative recommendation methods without significant additional computational overhead. A key finding is that the recommendation loss provides effective collaborative signals that improve codebook utilization in the tokenizer, particularly at the first (coarsest) level, thereby bridging the gap between item tokenization and effective autoregressive generation.

7.2. Limitations & Future Work

The authors identify several directions for future research:

Generalization Capabilities: Further validation of BLOGER's generalization on more diverse datasets, especially large-scale industrial datasets, is planned to assess its scalability and practical applicability.
Practical Deployment: Investigating practical deployment considerations to facilitate the adoption of BLOGER in production environments. This could involve exploring optimizations for real-time inference or integration with existing infrastructure.
Reasoning Capabilities: Exploring the integration of reasoning capabilities within the recommendation context, possibly by leveraging Large Language Models (LLMs) or other advanced reasoning mechanisms, to enhance the explanatory power or decision-making of the recommender.

7.3. Personal Insights & Critique

This paper presents a highly rigorous and principled approach to a critical problem in generative recommendation. The use of bi-level optimization to unify tokenizer and recommender training is a significant conceptual leap beyond prior sequential or alternating optimization strategies.

Insights:

Principled Alignment: The bi-level optimization framework offers a theoretically sound way to ensure the tokenizer's output directly serves the recommendation objective. This is a more elegant solution than relying on auxiliary losses or indirect alignment mechanisms.
Meta-Learning for Efficiency: Employing meta-learning (MAML) is a smart choice to manage the computational complexity inherent in bi-level optimization. This makes the approach practical for deep learning models.
Gradient Surgery for Stability: The integration of gradient surgery is crucial for the stability of multi-objective optimization. Gradient conflicts are a common issue in such setups, and explicitly addressing them ensures that the tokenizer updates are always beneficial or at least non-detrimental to both objectives.
Interpretability of Gains: The visualization analysis, specifically the improved codebook utilization at the first level, provides valuable interpretability. It suggests that the recommendation loss acts as a powerful collaborative signal that refines the initial, coarse-grained item representations, which is a concrete and understandable mechanism for performance improvement.

Critique/Areas for Improvement:

Training Time Overhead: While the paper states "no significant additional computational overhead" and the asymptotic complexity is the same as TIGER, the absolute training time increased by about 12% (23.7s to 26.6s). For very large-scale industrial applications, even a 12% increase can be substantial, especially over many epochs. Further optimization or distributed training strategies might be needed for true industrial deployment.
Initial Semantic Embeddings: BLOGER relies on initial semantic embeddings (e.g., LLaMA2-7B text embeddings or SASRec embeddings for ETEGRec). The overall performance is still bounded by the quality of these initial representations. Future work could explore how to jointly learn these initial embeddings within the bi-level framework or make the tokenizer less dependent on their pre-trained quality.
Codebook Design: The RQVAE structure is effective, but exploration into dynamic codebook sizes, adaptive quantization levels, or more advanced codebook learning mechanisms could yield further improvements. The observation that only the first codebook significantly improves might suggest that the higher levels are already well-optimized or less sensitive to recommendation loss guidance.
Generalizability to Diverse Data: The datasets used are from Amazon reviews, which are typical e-commerce interaction datasets. While robust, testing on datasets with different characteristics (e.g., highly sparse social networks, content-rich media platforms) could further stress-test the framework's versatility.
Long-Term Effects: The paper focuses on next-item recommendation. Exploring how BLOGER impacts long-term recommendation performance or user satisfaction over extended periods would be valuable.

This paper provides a strong foundation for future research in generative recommendation, particularly in developing more tightly integrated and end-to-end optimized systems. The methodological rigor and clear experimental validation make it a significant contribution to the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Bi-Level Optimization for Generative Recommendation: Bridging Tokenization and Generation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 44,259 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Generative Recommendation

Item Tokenization

Autoregressive Generation

Transformer

Attention Mechanism (Self-Attention)

Residual Quantization (RQ-VAE)

Bi-Level Optimization

Meta-Learning (MAML)

Gradient Surgery

3.2. Previous Works

Traditional Recommender Models

Generative Recommender Models

LLM-based Generative Recommenders

Sequence-to-Sequence Generative Recommenders

Bi-Level Optimization in Recommendation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1 Problem Reformulation

4.2.2 BLOGER Framework

4.2.2.1 Tokenizer

4.2.2.2 Recommender

4.2.2.3 Tailored Learning Scheme

4.2.2.4 Complexity Analysis

4.2.3 Discussion

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

Recall@K

NDCG@K (Normalized Discounted Cumulative Gain at K)

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Study (RQ2)

6.3. Hyper-Parameter Analysis (RQ3)

6.4. In-Depth Analysis (RQ4 & RQ5)

6.4.1 Time Efficiency

6.4.2 Visualization Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers