Bi-Level Optimization for Generative Recommendation: Bridging Tokenization and Generation
TL;DR Summary
BLOGER unifies tokenizer and recommender optimization via bi-level learning and gradient surgery, enhancing item identifiers' quality and aligning with recommendation goals for improved generative recommendation performance.
Abstract
Generative recommendation is emerging as a transformative paradigm by directly generating recommended items, rather than relying on matching. Building such a system typically involves two key components: (1) optimizing the tokenizer to derive suitable item identifiers, and (2) training the recommender based on those identifiers. Existing approaches often treat these components separately--either sequentially or in alternation--overlooking their interdependence. This separation can lead to misalignment: the tokenizer is trained without direct guidance from the recommendation objective, potentially yielding suboptimal identifiers that degrade recommendation performance. To address this, we propose BLOGER, a Bi-Level Optimization for GEnerative Recommendation framework, which explicitly models the interdependence between the tokenizer and the recommender in a unified optimization process. The lower level trains the recommender using tokenized sequences, while the upper level optimizes the tokenizer based on both the tokenization loss and recommendation loss. We adopt a meta-learning approach to solve this bi-level optimization efficiently, and introduce gradient surgery to mitigate gradient conflicts in the upper-level updates, thereby ensuring that item identifiers are both informative and recommendation-aligned. Extensive experiments on real-world datasets demonstrate that BLOGER consistently outperforms state-of-the-art generative recommendation methods while maintaining practical efficiency with no significant additional computational overhead, effectively bridging the gap between item tokenization and autoregressive generation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Bi-Level Optimization for Generative Recommendation: Bridging Tokenization and Generation
1.2. Authors
-
Yimeng Bai (University of Science and Technology of China)
-
Chang Liu (Beihang University)
-
Yang Zhang (National University of Singapore)
-
Dingxian Wang (Upwork)
-
Frank Yang (Upwork)
-
Andrew Rabinovich (Upwork)
-
Wenge Rong (Beihang University)
-
Fuli Feng (University of Science and Technology of China)
The authors represent a mix of academic institutions and industry entities (Upwork), suggesting a research effort that combines theoretical rigor with practical applicability.
1.3. Journal/Conference
Published at: Woodstock '18: ACM Symposium on Neural Gaze Detection, June 03-05, 2018, Woodstock, NY. ACM, New York, NY, USA.
Note: The publication date provided in the metadata is 2025-10-24T08:25:56.000Z, while the ACM reference within the paper itself states 2018. Given the arXiv link suggests a future publication date, the 2018 ACM reference is likely a placeholder or an artifact from a template, and the paper is intended for publication in 2025. The ACM is a highly reputable organization for computer science research, and its conferences (like SIGIR, KDD, WSDM) are top venues for recommender systems research.
1.4. Publication Year
2025 (as indicated by the arXiv timestamp, considering the ACM reference year as a placeholder).
1.5. Abstract
Generative recommendation aims to directly generate recommended items rather than matching them. Such systems typically involve two components: an item tokenizer to create suitable item identifiers, and a recommender trained on these identifiers. Current methods often treat these components separately (sequentially or alternately), leading to misalignment where the tokenizer lacks direct guidance from the recommendation objective, potentially yielding suboptimal identifiers. This paper introduces BLOGER (Bi-Level Optimization for GEnerative Recommendation), a framework that explicitly models the interdependence between the tokenizer and recommender using a unified bi-level optimization process. The lower level trains the recommender using tokenized sequences, while the upper level optimizes the tokenizer based on both tokenization loss and a recommendation loss, which acts as direct supervision. BLOGER employs a meta-learning approach for efficiency and incorporates gradient surgery to mitigate gradient conflicts in upper-level updates. Experiments on real-world datasets show BLOGER consistently outperforms state-of-the-art generative recommendation methods with no significant additional computational overhead, effectively bridging the gap between item tokenization and autoregressive generation.
1.6. Original Source Link
https://arxiv.org/abs/2510.21242 Publication status: This is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The paper addresses a critical challenge in the emerging paradigm of generative recommendation. Unlike traditional discriminative (retrieve-and-rank) systems that rely on matching user queries to candidate items, generative recommenders directly generate item identifiers autoregressively. This offers significant advantages such as improved scalability for large item corpora, reduced susceptibility to feedback loops, and enhanced user personalization.
The construction of a generative recommendation system typically involves two main components:
-
Item Tokenization: Optimizing a
tokenizerto convert each item into a sequence of tokens, which serve as its unique identifier. -
Autoregressive Generation: Training a
recommender(often aTransformer-based model likeT5) to predict the next item's identifier based on the user's historical interaction sequence.The core problem identified by the authors is the decoupled optimization of these two components. Existing approaches typically train the
tokenizerandrecommendereither sequentially (firsttokenizer, thenrecommender) or alternately (iterative updates). This separation creates an interdependence misalignment: thetokenizeris optimized without direct feedback from the ultimate recommendation objective. Consequently, the generated item identifiers might be suboptimal for therecommender's task, failing to capture user-behavior correlations and ultimately degrading recommendation performance. For instance, atokenizermight assign unrelated identifiers to items frequently co-purchased (e.g., "baby formula" and "diapers") if it only considers modality differences, thereby hindering therecommender's ability to link them.
The paper's entry point and innovative idea is to explicitly model this intricate interdependence within a unified optimization framework, thereby ensuring that the tokenizer's output (item identifiers) is directly aligned with the recommender's goal (accurate item prediction).
2.2. Main Contributions / Findings
The primary contributions of this work are:
- Bi-Level Optimization Formulation: The paper proposes reformulating
generative recommendationas abi-level optimizationproblem. This explicitly models the mutual dependence between thetokenizerand therecommender, where therecommender's training forms the inner loop and thetokenizer's optimization (guided by therecommender's performance) forms the outer loop. - BLOGER Framework: Introduction of
BLOGER(Bi-Level Optimization for GEnerative Recommendation), a novel framework designed to efficiently and effectively solve thisbi-level optimizationproblem. It employsmeta-learning(specifically, aMAML-like approach) for efficient optimization and incorporatesgradient surgeryto resolve potential conflicts between thetokenization lossandrecommendation lossduring the outer-level updates. This ensures that the generated item identifiers are both informative and aligned with the recommendation objective, without introducing significant additional computational overhead. - Empirical Validation: Extensive experiments conducted on two real-world datasets demonstrate that
BLOGERconsistently outperforms state-of-the-artgenerative recommendationmethods. The analysis further reveals that the performance gains largely stem from improvedcodebook utilization, indicating that therecommendation losseffectively acts as a natural indicator of collaborative information, rectifyingcodebook allocation biasthat might arise from relying solely on textual information.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Generative Recommendation
Generative recommendation is a paradigm shift from traditional discriminative recommendation (often termed retrieve-and-rank). Instead of selecting items from a pre-defined candidate set by matching user queries or preferences, generative recommenders directly create or generate the identifier of the next item a user is likely to interact with. This is typically done in an autoregressive manner, meaning each part of the item identifier is generated sequentially, conditioned on previously generated parts and the user's historical interactions. This approach eliminates the need for one-to-one comparisons with potentially millions of candidate items, leading to better scalability and enabling the generation of novel items.
Item Tokenization
Item tokenization is the process of converting an item into a sequence of discrete units called tokens. Just as natural language processing models break sentences into words or sub-word units, generative recommenders need a way to represent items in a discrete, machine-understandable format that can be generated sequentially. These token sequences serve as unique identifiers or "semantic IDs" for items. The quality of these token sequences is crucial: they should ideally capture rich semantic information about the item and its relationships with other items, while also being concise and easy for a generative model to learn and produce.
Autoregressive Generation
Autoregressive generation refers to a sequence generation process where each element in a sequence is generated one at a time, conditioned on all previously generated elements in that sequence. In the context of generative recommendation, an autoregressive model predicts the first token of an item identifier, then predicts the second token based on the first, and so on, until the entire item identifier (sequence of tokens) is complete. This is a common mechanism in sequence-to-sequence models like Transformers and recurrent neural networks (RNNs).
Transformer
The Transformer is a neural network architecture introduced in 2017, known for its effectiveness in sequence-to-sequence tasks, particularly in natural language processing. It revolutionized sequence modeling by entirely eschewing recurrence and convolutions, instead relying solely on attention mechanisms.
A Transformer typically consists of an encoder and a decoder.
-
Encoder: Processes the input sequence (e.g., user's historical item tokens) and produces a contextualized representation. It consists of multiple identical layers, each containing a
multi-head self-attentionmechanism and afeed-forward network. -
Decoder: Generates the output sequence (e.g., the next item's tokens)
autoregressively. It also consists of multiple identical layers, but each layer includes an additionalmasked multi-head self-attentionmechanism (to prevent attending to future tokens) and anencoder-decoder attentionmechanism (to attend to theencoder's output).The
T5(Text-to-Text Transfer Transformer) model, specifically mentioned in the paper, is a variant of theTransformerarchitecture that frames all NLP tasks as a text-to-text problem, making it highly versatile for varioussequence generationtasks.
Attention Mechanism (Self-Attention)
The core component of a Transformer is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element.
Given an input sequence, self-attention computes three matrices:
-
Query (Q): Represents the current token being processed.
-
Key (K): Represents all tokens in the sequence.
-
Value (V): Represents the information content of all tokens.
The
attentionscore between aQueryand allKeysdetermines how much focus to place on eachValue. The output is a weighted sum of theValues. The formula forscaled dot-product attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where: -
is the Query matrix.
-
is the Key matrix.
-
is the Value matrix.
-
is the dimension of the
Keyvectors, used for scaling to prevent very large dot products that push thesoftmaxinto regions with tiny gradients. -
computes the dot product similarity between
QueryandKeyvectors. -
softmaxconverts these scores into probability distributions.Multi-head attentionextends this by running theattentionmechanism multiple times in parallel with different learned linear projections ofQ, K, V, and then concatenating their outputs.
Residual Quantization (RQ-VAE)
Residual Quantization (RQ) is a technique used to compress continuous vector representations into discrete token sequences, often in a hierarchical manner. It works by progressively quantizing the residual (the part of the vector not yet explained by previous quantization steps).
- The first quantization step quantizes the original continuous vector.
- The residual error from the first step (original vector minus its quantized approximation) is then quantized in the second step.
- This process repeats for multiple levels, with each level quantizing the residual from the previous level.
This hierarchical approach allows for representing items with increasing granularity and creates a natural tree-structured item space.
RQ-VAErefers to combiningResidual Quantizationwith aVariational Autoencoder (VAE)framework, where thequantizationprocess is part of theVAE's bottleneck.
Bi-Level Optimization
Bi-level optimization involves two nested optimization problems. It's a hierarchical optimization framework where the solution of an inner (lower-level) problem constrains or determines parameters for an outer (upper-level) problem.
The general form is:
$
\begin{array} { r l } { \underset { x } { \operatorname* { m i n } } } & { F ( x, y^* (x) ) } \ { \mathrm { s . t . } } & { y^* (x) = \arg \underset { y } { \operatorname* { m i n } } G ( x, y ) } \end{array}
$
Where:
-
are the
outer-levelvariables. -
are the
inner-levelvariables. -
is the
outer-levelobjective function. -
is the
inner-levelobjective function. -
denotes the optimal solution of the
inner-levelproblem for a given .In the context of this paper, the
tokenizerparameters would be theouter-levelvariables, and therecommenderparameters would be theinner-levelvariables. Therecommenderis optimized given atokenizer, and then thetokenizeris optimized based on how well therecommenderperforms with its outputs.
Meta-Learning (MAML)
Meta-learning, or "learning to learn," aims to train models that can quickly adapt to new tasks or environments with minimal data. MAML (Model-Agnostic Meta-Learning) is a popular meta-learning algorithm. The core idea is to find a good model initialization that, after one or a few gradient descent steps on a new task, leads to a good performance on that task.
For bi-level optimization, meta-learning can be used to efficiently approximate the inner-level optimization. Instead of fully converging the inner-level model, MAML performs a "tentative update" (one or a few gradient descent steps) of the inner-level parameters. The outer-level then uses the performance of this tentatively updated inner-level model to compute its meta-gradients and update its own parameters. This avoids the computational burden of fully solving the inner-level problem at each outer-level iteration.
Gradient Surgery
Gradient surgery is a technique used in multi-task learning to address gradient conflicts, which occur when gradients from different tasks point in opposing directions, potentially hindering overall learning. When tasks have conflicting objectives, a naive summation of their gradients can lead to oscillations or suboptimal convergence.
Gradient surgery identifies gradients that conflict (e.g., by checking cosine similarity) and then modifies them. A common strategy, as adopted in this paper, is to project one gradient onto the normal plane of another, effectively removing any component that opposes the second gradient. This ensures that the update direction is at least orthogonal (or aligned) with the desired gradient and does not negatively impact the other task's progress.
3.2. Previous Works
The paper categorizes previous works into Traditional recommender models and Generative recommender models.
Traditional Recommender Models
These models typically operate on discrete item IDs and focus on query-candidate matching or predicting a score for potential items.
MF(Matrix Factorization) [19]: Decomposes the user-item interaction matrix into lower-dimensional user and itemembeddings. This allows for capturing latent features and predicting missing interactions.LightGCN[13]: A lightweightGraph Convolutional Network(GCN) that simplifies the propagation rule ofGCNsforrecommendationby removing feature transformations and non-linear activations. It focuses on learning user and itemembeddingsthrough explicitgraph convolutionon the user-item interaction graph, capturing high-order interactions.Caser[38]:Convolutional Sequence Embedding Recommendationuses horizontal and vertical convolutional filters to capture point-wise and sequential patterns in user interaction sequences.GRU4Rec[15]: AnRNN-based sequential recommender that utilizesGated Recurrent Units (GRUs)to encode historical interactions and predict the next item. It's effective for capturing short-term user interests.HGN(Hierarchical Gating Networks) [27]: Employshierarchical gating networksto capture both long-term and short-term user interests from item sequences, allowing for flexible aggregation of information at different time scales.SASRec(Self-Attentive Sequential Recommendation) [18]: Utilizesself-attention mechanismsto capture long-term dependencies within a user's interaction history, enabling it to model complex sequential patterns more effectively thanRNN-based models.
Generative Recommender Models
These models aim to directly generate item identifiers. They can be broadly classified by how they integrate item representation and generation.
LLM-based Generative Recommenders
These approaches leverage Large Language Models (LLMs) for recommendation, often by formulating the task as a text generation problem.
P5[10]: FramesrecommendationasLanguage Processingby using aUnified Pretrain, Personalized Prompt & Predict Paradigm. It integratescollaborative knowledgeintoLLM-based recommenders by generating item identifiers, often derived through spectral clustering on itemco-occurrence graphs.BIGRec[3]: Represents items using their titles and fine-tunesLLMsforrecommendationunder abi-step grounding paradigm.IDGenRec[37]: Focuses on generating unique, concise, semantically rich, and platform-agnostic textual identifiers for items, making them more adaptable forLLM-basedrecommendersystems.BinLLM[48]: Transformscollaborative embeddingsfrom external models into binary sequences, aligning item representations withLLM-consumable formats.
Sequence-to-Sequence Generative Recommenders
These models combine item tokenization with autoregressive generation, often using Transformer-like architectures. The paper highlights a distinction based on their optimization strategy:
-
Sequential Optimization (Decoupled):
TIGER[34]: Represents each item with aSemantic ID(a tuple ofcodewords) and uses aTransformerto predict the next one. It leveragestext embeddingsto construct thesesemantic IDs, typically in asequential optimizationmanner wheretokenizationis done first and thenrecommenderis trained.LETTER[40]: EnhancesTIGERby introducing a learnabletokenizerthat incorporateshierarchical semantics,collaborative signals, and promotesdiversityincode assignment. It still largely follows asequential optimizationstrategy.QARM[26]: Focuses onquantitative code learningguided by high-qualitydown-streaming task item-item pairs.
-
Alternating Optimization:
ETEGRec[22]: AGenerative RecommenderwithEnd-to-End Learnable Item Tokenization. It introducestwo auxiliary lossesto align the intermediate representations of thetokenizerand therecommenderand adopts analternating optimizationstrategy forend-to-end training. While iterative, it still uses auxiliary losses rather than directrecommendation lossfortokenizerguidance.
Bi-Level Optimization in Recommendation
Bi-level optimization has been applied in recommendation for various tasks:
-
BOD[44]: Utilizesbi-level optimizationto adaptively learn sample weights to mitigate noise inimplicit feedback. -
LabelCraft[2]: Appliesbi-level optimizationto automatically generate reliable labels from raw feedback in short videorecommendation, aiming for direct alignment with platform objectives. -
Other applications include adversarial training for
unlearning protected user attributes[9] orpersonalized ranking[14].The paper claims to be the first to apply
bi-level optimizationto reshapegenerative recommendation, specifically bridgingitem tokenizationandautoregressive generation.
3.3. Technological Evolution
The evolution of recommendation systems has moved from simple collaborative filtering and matrix factorization to deep learning-based sequential models using RNNs and Transformers. A major paradigm shift is from discriminative (matching) to generative (direct generation).
Initially, generative recommenders focused on either LLM-based approaches that directly used textual descriptions or collaborative signals in prompts, or sequence-to-sequence models that relied on pre-processed or sequentially optimized item tokenizers (TIGER, LETTER). The challenge with these sequence-to-sequence approaches has been the decoupled nature of tokenization and generation. ETEGRec attempted to bridge this with alternating optimization and auxiliary losses, but still lacked a direct, unified optimization of the tokenizer based on the ultimate recommendation objective.
This paper's work, BLOGER, fits into this timeline by proposing a more principled and direct approach to coupling tokenization and generation. By using bi-level optimization, it moves beyond sequential or alternating updates with auxiliary objectives, directly aligning the tokenizer's learning with the recommender's performance via the recommendation loss. This represents a significant step towards truly end-to-end optimized generative recommendation systems.
3.4. Differentiation Analysis
BLOGER distinguishes itself from previous methods primarily through its novel bi-level optimization framework and integrated learning scheme:
-
Optimization Formulation:
- TIGER & LETTER: Employ
sequential optimization. Thetokenizeris trained independently to produce static item identifiers, which are then fed to therecommender. This completely decouples thetokenizer's objective from therecommender's performance. - ETEGRec: Uses
alternating optimization. Thetokenizerandrecommenderare iteratively updated. While better than sequential, it still relies onauxiliary lossesto align intermediate representations, which are indirect proxies for the finalrecommendation objective. - BLOGER: Introduces
bi-level optimization. Thetokenizer(outer-level) is optimized based on therecommendation lossachieved by therecommender(inner-level). This provides direct, end-to-end alignment and supervision from the downstreamrecommendation task.
- TIGER & LETTER: Employ
-
Alignment with Recommendation Objectives:
- TIGER & LETTER: Lack direct guidance from the
recommendation objective. Thetokenizeroptimizes for its intrinsic quality (e.g., reconstruction fidelity, diversity) but not necessarily for how well itstokensenable therecommenderto make predictions. - ETEGRec: Achieves alignment through
auxiliary lossesapplied to intermediate layers, but this is an indirect alignment. - BLOGER: Directly incorporates the
recommendation lossinto theouter-levelobjective fortokenizeroptimization. This eliminates the need for handcraftedauxiliary objectivesand ensures thetokenizationprocess is inherently geared towards improvingrecommendation performance.
- TIGER & LETTER: Lack direct guidance from the
-
Coupling Mechanism:
- TIGER & LETTER: No direct
gradient-levelcoupling betweentokenizerandrecommenderduring thetokenizationphase. - ETEGRec: Employs
alternating updatesbut the coupling at thegradientlevel is implicitly throughauxiliary lossesrather than ameta-gradientapproach. - BLOGER: Leverages
meta-gradientsto establish a directgradient-levelconnection. This allows thetokenizerto receive optimization signals that reflect therecommender's performance, enabling more effective bidirectional learning. Furthermore,gradient surgeryis applied to mitigate conflicts betweentokenization lossandrecommendation loss gradients, ensuring stable and beneficialtokenizerupdates.
- TIGER & LETTER: No direct
-
Token Sequence Generation:
-
TIGER & LETTER: Use pre-processed or statically assigned
tokensequences. -
ETEGRec & BLOGER: Gradually refine
tokensequences during training, allowing them to adapt and improve over time.In essence,
BLOGERprovides a more principled and tightly coupled optimization framework forgenerative recommendationby directly integrating therecommendation objectiveinto thetokenizer's learning process viabi-level optimizationandmeta-gradients, while usinggradient surgeryto managegradient conflicts.
-
4. Methodology
4.1. Principles
The core principle behind BLOGER is to resolve the misalignment between item tokenization and autoregressive generation in generative recommendation by explicitly modeling their interdependence within a unified bi-level optimization framework. The intuition is that an effective tokenizer should produce item identifiers that not only accurately represent items but also facilitate the recommender in predicting future interactions. Conversely, the recommender relies on high-quality tokenized inputs to accurately model user preferences.
BLOGER reformulates the problem such that the recommender's training becomes an inner-level optimization task, which is dependent on the tokenizer's parameters. The tokenizer's optimization then becomes an outer-level task, where it learns to produce better tokens by observing how well the recommender performs with those tokens, in addition to its intrinsic tokenization quality. To efficiently solve this nested optimization, BLOGER adopts a meta-learning strategy, approximating the inner-level solution, and employs gradient surgery to manage gradient conflicts between the tokenization and recommendation objectives, ensuring stable and beneficial updates to the tokenizer.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1 Problem Reformulation
The paper models the tokenizer as and the recommender as , where and are their respective model parameters. The bi-level optimization problem is formulated as follows:
$ \begin{array} { r l } { \underset { \phi } { \operatorname* { m i n } } } & { \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta ^ { * } } ) + \lambda \mathcal { L } _ { \mathrm { t o k e n } } ( \mathcal { T } _ { \phi } ) } \ { \mathrm { s . t . } } & { \theta ^ { * } = \arg \underset { \theta } { \operatorname* { m i n } } \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta } ) . } \end{array} $ (2)
Here, we break down the components of this formulation:
-
: This is the
recommendation loss. It measures the performance of therecommenderon the training data, given thetokenizedoutputs produced by thetokenizer. -
: This is the
tokenization loss. It evaluates the intrinsic quality of thetokenizeritself, independent of therecommender(e.g., how well it can reconstruct item embeddings). -
: This is a
hyperparameterthat controls the relative importance of thetokenization lossin the overallouter-levelobjective. A larger means more emphasis on intrinsictokenizerquality, while a smaller prioritizesrecommendation performance. -
: This represents the optimal parameters for the
recommender. Critically, is obtained by minimizing therecommendation lossfor a fixedtokenizer, meaning is implicitly a function of . -
: This term in the
outer-levelobjective represents therecommendation lossof therecommenderafter it has been optimally trained using thetokensfromtokenizer. This is the key mechanism through which thetokenizerreceives direct feedback about its impact onrecommendation performance.Equation (2b) defines the inner-level objective: for a given
tokenizer, therecommenderis trained to minimize itsrecommendation loss. Equation (2a) defines the outer-level objective: thetokenizeris optimized to ensure that the optimally trainedrecommender(which depends on ) achieves minimalrecommendation loss, while simultaneously minimizing its owntokenization loss. This setup ensures mutual reinforcement and explicit alignment.
4.2.2 BLOGER Framework
An overview of the BLOGER framework is illustrated in Figure 2.

该图像是论文中图2的示意图,展示了BLOGER框架的整体结构。左侧为模型架构,包含Tokenizer、Embedding Table和Recommender,构造基于概率的混合表示;右侧为元学习策略和梯度手术,后者通过 G_{rec} ot G_{token} 的操作缓解梯度冲突。
Figure 2: An overview of the proposed BLOGER framework, which comprises an encoder-decoder model architecture and a tailored learning scheme. Specifically, we construct a mixed representation based on the assignment probabilities over the codebook to ensure the differentiability of the recommendation loss to the tokenizer. Besides, we adopt a meta-learning strategy to enable efficient optimization, and apply gradient surgery to extract beneficial components from the gradients of the recommendation loss.
4.2.2.1 Tokenizer
The tokenizer is implemented using a Residual Quantization Variational Autoencoder (RQVAE) model, building on prior work. This choice is motivated by the benefits of hierarchical encoding, which induces a tree-structured item space suitable for generative modeling and enables sharing of information among related items through common prefix tokens.
Given an item , its semantic embedding (e.g., a pretrained text embedding or a collaborative embedding) is input to the tokenizer to produce a sequence of discrete tokens:
$
[ c _ { 1 } , . . . , c _ { L } ] = \mathcal { T } _ { \phi } ( z ) ,
$ (3)
Where is the token assigned to item at the -th level of the hierarchy.
Quantization:
The process begins by encoding the input semantic embedding into a latent representation using an MLP-based encoder:
$
r = \operatorname { E n c o d e r } _ { \mathcal { T } } ( z ) .
$ (4)
The latent representation is then progressively quantized into discrete tokens. At each level , there is a codebook , which contains distinct code embeddings. The quantization proceeds as follows:
$
\begin{array} { c } { P ( k | v _ { l } ) = \displaystyle \frac { \exp { \left( - | v _ { l } - e _ { l } ^ { k } | ^ { 2 } \right) } } { \sum _ { j = 1 } ^ { K } \exp { \left( - | v _ { l } - e _ { l } ^ { j } | ^ { 2 } \right) } } , } \ { c _ { l } = \arg { \operatorname* { m a x } _ { k } P ( k | v _ { l } ) } , } \ { v _ { l } = v _ { l - 1 } - e _ { l - 1 } ^ { c _ { l - 1 } } . } \end{array}
$ (5)
Here:
- : This is the probability of assigning the residual vector to the -th token in the current
codebook. It's computed using asoftmax-like function based on the Euclidean distance between and each codebook entry . A smaller distance implies higher probability. - : This denotes the residual vector at the -th level. For the first level, (the initial latent representation). For subsequent levels, is the residual error from the previous quantization step, specifically .
- : The -th code embedding (vector) in the
codebookat level . - : This is the selected token at level , chosen as the
codebookentry that maximizes the probability . This is a hard assignment.
Reconstruction:
After obtaining the tokens for an item, a quantized representation is formed by summing the selected codebook entries:
$
\tilde { r } = \sum _ { l = 1 } ^ { L } e _ { l } ^ { c _ { l } }
$
This is then fed into an MLP-based decoder to reconstruct the original semantic embedding :
$
\tilde { z } = \mathrm { D e c o d e r } _ { \mathcal { T } } ( \tilde { r } ) .
$ (6)
Tokenization Loss:
The tokenization loss for an item is defined as:
$
\mathcal { L } _ { \mathrm { t o k e n } } ( \mathcal { T } _ { \phi } ) = \sum _ { i } ( \Vert \tilde { \boldsymbol { z } } - \boldsymbol { z } \Vert ^ { 2 } + \sum _ { l = 1 } ^ { L } \Vert \mathrm { s g } [ \boldsymbol { v } _ { l } ] - \boldsymbol { e } _ { l } ^ { c _ { l } } \Vert ^ { 2 } + \beta \Vert \boldsymbol { v } _ { l } - \mathrm { s g } [ \boldsymbol { e } _ { l } ^ { c _ { l } } ] \Vert ^ { 2 } ) ,
$ (7)
This loss is summed over all items in the sequence. It comprises three main parts:
- : This is a reconstruction loss (mean squared error) that encourages the
decoderto accurately reconstruct the original semantic embedding from its quantized representation . - : This is the
codebookloss. It minimizes the distance between the (stop-gradient'd) residual vector and the chosencodebookentry .sg(stop-gradient) means that gradients do not flow through in this term; it primarily updates thecodebookentries to be closer to the residual vectors that map to them. - : This is the commitment loss. It minimizes the distance between the residual vector and the (stop-gradient'd) chosen
codebookentry .sghere means gradients do not flow through ; this term primarily updates the encoder parameters to ensure that the residual vectors are committed to (close to) thecodebookentries they select. - : A weighting coefficient, typically set to 0.25, to balance the updates between the encoder and the
codebooks. - : The
stop-gradientoperation. It prevents gradients from flowing through its argument during backpropagation. This is crucial for trainingVAEswith discretebottleneckslikeRQVAEto ensure stable updates.
4.2.2.2 Recommender
The generative recommender is based on a Transformer-based encoder-decoder architecture, specifically T5.
Mixed Representation:
A key challenge is that the tokenizer outputs discrete token sequences, which are non-differentiable. To allow gradients from the recommendation loss to flow back to the tokenizer for optimization, BLOGER uses differentiable soft representations.
Let be the vocabulary embeddings of the recommender, where each row corresponds to the embedding of a token in the vocabulary. For a given token (the -th token of item ):
- Hard Embedding: A hard embedding is obtained by a direct lookup in using the discrete .
- Soft Embedding: The probability distribution (from Equation 5) is padded with zeros to match the size of the full vocabulary. The soft embedding is then computed as a weighted average over , using these padded probabilities as weights.
Finally, the
mixed representationfor the token is computed as: $ \mathbf { \ o } _ { l } ^ { t } = s _ { l } ^ { t } + \operatorname { s g } [ \pmb { h } _ { l } ^ { t } - s _ { l } ^ { t } ] , $ (8) Here:
- : The soft embedding, which is differentiable as it's a weighted sum.
- : The hard embedding, which is numerically precise.
- :
Stop-gradientoperation. This ensures thatgradientsflow only through the soft embedding (which connects to thetokenizer's output probabilities), while the hard embedding provides the numerical value but does not contribute togradientflow fortokenizerupdates. This effectively combines the benefits of discrete precision with differentiable soft representations.
Seq2Seq Formulation:
During training, the user's item-level interaction sequence is tokenized and converted into the input sequence embedding using the mixed representation for each token. This is fed into the recommender's encoder:
$
H ^ { E } = \operatorname { E n c o d e r } _ { \mathcal { R } } ( E ^ { X } ) .
$ (9)
Where is the hidden representation from the encoder.
For decoding the target item (tokenized as ), a special beginning-of-sequence ([BOS]) token is prepended, and its mixed representation is created. The decoder then takes and as input to model user preference:
$
H ^ { D } = \operatorname { D e c o d e r } _ { \mathcal { R } } ( H ^ { E } , E ^ { Y } ) .
$ (10)
Where is the hidden representation from the decoder.
Recommendation Loss:
The decoder's output is projected onto the vocabulary space (e.g., by an inner product with the vocabulary embedding matrix ) to predict the target item tokens. The recommendation loss is defined as the negative log-likelihood of the target tokens, following the sequence-to-sequence learning paradigm:
$
\mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta } ) = - \sum _ { l = 1 } ^ { L } \log P ( Y _ { l } | X , Y _ { < l } ) ,
$ (11)
Here:
- : The -th
tokenin the target sequence for the next item. - : The tokenized input sequence (historical interactions).
- : All previously generated
tokensin the target sequence (from1tol-1). This loss function trains therecommenderto generate each token of the target itemautoregressively.
4.2.2.3 Tailored Learning Scheme
To solve the bi-level optimization problem (Equation 2), BLOGER employs a meta-learning strategy, inspired by MAML, which iteratively updates and .
Meta-Learning Strategy:
The meta-learning approach avoids the computationally expensive process of fully training the recommender to convergence for each tokenizer update.
-
Update of (Inner Level): For the
inner-levelobjective, therecommenderparameters are updated by directly minimizing therecommendation lossfor a fixedtokenizer. This is a standardgradient descentstep: $ \theta \leftarrow \theta - \eta_{\mathcal{R}} \nabla_{\theta} \mathcal{L}{\mathrm{rec}}(\mathcal{T}{\phi}, \mathcal{R}_{\theta}) , $ (14) Where is the learning rate for therecommender. -
Update of (Outer Level): Updating the
tokenizerparameters is more complex as it depends on the optimalrecommenderparameters . Instead of computing ,BLOGERperforms a tentative update of . This process involves two steps:- Meta Training: The
tokenizeris kept unchanged. A single step ofgradient descentis performed on therecommenderparameters to obtain a tentativerecommender: $ \theta ' = \theta - \eta_{\mathcal{R}} \nabla_{\theta} \mathcal{L}{\mathrm{rec}}(\mathcal{T}{\phi}, \mathcal{R}_{\theta}) . $ (15) Here, represents therecommenderparameters after one tentative update. - Meta Testing: With the tentative
recommender, therecommendation lossis computed. This loss, combined with thetokenization loss, forms theouter-levelobjective for updating : $ \phi \leftarrow \phi - \eta_{\mathcal{T}} ( \nabla_{\phi} \mathcal{L}{\mathrm{rec}}(\mathcal{T}{\phi}, \mathcal{R}{\theta'}) + \lambda \nabla{\phi} \mathcal{L}{\mathrm{token}}(\mathcal{T}{\phi}) ) , $ (16) Where is the learning rate for thetokenizer. Themeta-gradientfor therecommendation lossterm, , is computed by backpropagating through the chain: $ \nabla _ { \phi } \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta ^ { \prime } } ) \propto \frac { \partial \mathcal { L } _ { \mathrm { r e c } } } { \partial \theta ^ { \prime } } \frac { \partial \theta ^ { \prime } } { \partial \mathcal { L } _ { \mathrm { r e c } } } \frac { \partial \mathcal { L } _ { \mathrm { r e c } } } { \partial \phi } . $ Thismeta-gradienteffectively tells thetokenizerhow to update its parameters so that therecommender, after its tentative update, achieves a lowerrecommendation loss.
- Meta Training: The
Gradient Surgery:
During outer-level optimization (Equation 16), the gradients from the recommendation loss and the tokenization loss might conflict. To mitigate this, gradient surgery is applied. When the cosine similarity between the gradient of recommendation loss () and tokenization loss () is negative, the conflicting component of is removed by projecting it onto the normal plane of .
The adjusted gradient for recommendation loss is:
$
\begin{array} { r l r } & { } & { G _ { \mathrm { r e c } } ^ { \mathrm { p r o j } } = G _ { \mathrm { r e c } } - \frac { G _ { \mathrm { r e c } } \cdot G _ { \mathrm { t o k e n } } } { | G _ { \mathrm { t o k e n } } | ^ { 2 } } G _ { \mathrm { t o k e n } } , } \ & { } & { G _ { \mathrm { r e c } } = \nabla _ { \phi } \mathcal { L } _ { \mathrm { r e c } } ( \mathcal { T } _ { \phi } , \mathcal { R } _ { \theta ^ { \prime } } ) , } \ & { } & { G _ { \mathrm { t o k e n } } = \nabla _ { \phi } \mathcal { L } _ { \mathrm { t o k e n } } ( \mathcal { T } _ { \phi } ) . \qquad } \end{array}
$ (17)
Here:
-
: The
meta-gradientof therecommendation losswith respect totokenizerparameters . -
: The
gradientof thetokenization losswith respect totokenizerparameters . -
: The projected
gradientof therecommendation loss. This newgradientis orthogonal to (i.e., ), ensuring it does not oppose thetokenization objectivewhile still contributing towards improvingrecommendation performance. Theouter-levelupdate (Equation 16) then uses instead of when conflict is detected.The overall training procedure is summarized in Algorithm 1.
Algorithm 1 Training of BLOGER 1: : while not converged do 2: Update according to Equation (14); 3: if Step % M == 0 then 4: Compute with Equation (15); 5: Compute and ; 6: if < 0 then 7: with Equation (17); 8: end if 9: Update according to Equation (16) (using potentially projected ); 10: end if 11: end while
Algorithm 1 outlines the iterative training loop:
- Recommender Update (Inner Loop): In each step, the
recommenderparameters are updated using agradient descentstep on therecommendation loss(Line 2). - Tokenizer Update (Outer Loop - Conditional): The
tokenizerparameters are updated less frequently, every steps (Line 3). This allows therecommenderto stabilize before providing meaningfulmeta-gradients.- First, a tentative
recommenderis computed (Line 4). - Then, the
meta-gradientand thetokenization loss gradientare calculated (Line 5). - If these
gradientsconflict (their dot product is negative),gradient surgeryis applied to (Lines 6-8). - Finally, is updated using the combined (and potentially adjusted)
gradients(Line 9).
- First, a tentative
4.2.2.4 Complexity Analysis
Let be the model dimension, the size of each codebook, the number of codebooks, and the sequence length.
-
Item Tokenization: For a single item, the encoder/decoder layers have complexity. Obtaining the
mixed representationinvolvescodebook lookupand soft representation, incurring .Tokenization losscomputation adds . Total for one item: . -
Generative Recommendation: The
Transformer'sself-attentionandfeed-forward layerscost . Computing therecommendation lossincurs .The total training cost of
BLOGERis , which matches the order of magnitude ofTIGER. Themeta-learningandgradient surgeryintroduce minimal additional overhead as they primarily involve a few extra forward/backward passes, not altering the asymptotic complexity. Inference complexity remains identical to standardTransformer-based generative models. This indicates practical efficiency.
4.2.3 Discussion
The paper explicitly compares BLOGER to related generative recommendation methods:
| Method | Optimization | Alignment | Token Sequence |
|---|---|---|---|
| TIGER [34] | Sequential | × | Pre-processed |
| LETTER [40] | Sequential | X | Pre-processed |
| ETEGRec [22] | Alternating | Auxiliary Loss | Gradually Refined |
| BLOGER | Bi-Level | Rec Loss | Gradually Refined |
5. Experimental Setup
5.1. Datasets
The experiments are conducted on three subsets of the Amazon Review Data [12, 29], commonly used in sequential recommendation research:
-
Beauty: Products related to beauty.
-
Instruments: Musical instruments.
-
Arts: Arts, Crafts & Sewing products.
The datasets were processed following established practices:
-
5-core filtering: Users and items with fewer than five interactions were removed. This helps ensure data quality and relevance.
-
Sequence Length: Each user's interaction history was truncated or padded to a fixed length of 20.
-
Data Split: A
leave-one-out strategywas used:-
Test set: The latest interaction for each user.
-
Validation set: The second most recent interaction for each user.
-
Training set: All other historical interactions.
The following are the results from Table 2 of the original paper:
Dataset #User #Item #Interaction Sparsity AvgLen Beauty 22363 12101 198502 99.93% 8.87 Instruments 24772 9922 206153 99.92% 8.32 Arts 45141 20956 390832 99.96% 8.66
-
Table 2: Statistical details of the evaluation datasets, where "AvgLen" represents the average length of item sequences.
These datasets are widely used benchmarks in recommendation systems, making them suitable for validating the performance of new methods. Their characteristics (e.g., varying number of users/items, high sparsity) reflect real-world e-commerce interaction data.
5.2. Evaluation Metrics
The performance of sequential recommendation methods is evaluated using two widely adopted metrics: Recall@K and NDCG@K, with and . The evaluation uses full ranking over the entire item set to avoid biases from item sampling. Constrained decoding with a Trie is applied during autoregressive generation to ensure only valid item identifiers are generated, with a beam size of 20.
Recall@K
- Conceptual Definition:
Recall@Kmeasures the proportion of truly relevant items that are successfully retrieved within the top recommendations. It focuses on the completeness of the recommendations, indicating how many of the items a user would actually interact with are present in the recommended list. - Mathematical Formula: $ \text{Recall}@K = \frac{\text{Number of relevant items in top-K recommendations}}{\text{Total number of relevant items}} $
- Symbol Explanation:
Number of relevant items in top-K recommendations: The count of actual next items (ground truth) that appear in the model's top predicted items.Total number of relevant items: In theleave-one-outsetting fornext-item recommendation, this is typically 1 (the single next item the user interacted with). If there are multiple relevant items, it would be the total count.
NDCG@K (Normalized Discounted Cumulative Gain at K)
- Conceptual Definition:
NDCG@Kis a measure of ranking quality that takes into account the position of relevant items. It assigns higher scores to relevant items that appear at higher ranks (closer to the top of the list) and penalizes relevant items that appear lower. It also handles graded relevance, though innext-item recommendation, relevance is often binary (1 for the true next item, 0 otherwise). - Mathematical Formula:
$
\text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K}
$
Where:
$
\text{DCG}@K = \sum_{i=1}^K \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}
$
And
IDCG@Kis theIdeal DCG@K, which is the maximum possibleDCG@Kachievable by ranking all relevant items at the top. - Symbol Explanation:
- :
Discounted Cumulative Gainat rank . It sums the relevance scores of items in the recommended list, with a logarithmic discount applied to items at lower ranks. - :
Ideal Discounted Cumulative Gainat rank . This is theDCG@Kcalculated for the perfect ranking, where all relevant items are placed at the top of the list. It serves as a normalization factor. - : The relevance score of the item at position in the recommended list. In binary relevance settings (like
next-item prediction), is 1 if the item at rank is the true next item, and 0 otherwise. - : The number of top items considered in the recommendation list.
- : The logarithmic discount factor, which reduces the contribution of relevant items as their rank increases.
- :
5.3. Baselines
The BLOGER framework is compared against two categories of baseline models:
(1) Traditional Recommender Models:
MF[19]LightGCN[13]Caser[38]GRU4Rec[15]HGN[27]SASRec[18] These models represent various popular approaches, includingmatrix factorization,graph neural networks, andsequential deep learningmodels. The implementations for these methods follow those provided by [53].
(2) Generative Recommender Models:
-
P5[10]: AnLLM-basedgenerative recommenderthat integratescollaborative knowledge. -
TIGER[34]: Agenerative retrievalmethod that usestext embeddingsto constructsemantic IDsfor items and aTransformerforsequential recommendation. -
LETTER[40]: An enhancement ofTIGERthat introduces a learnabletokenizerwithhierarchical semantics,collaborative signals, anddiversity regularization. -
ETEGRec[22]: Agenerative recommenderwithend-to-end learnable item tokenizationthat usesauxiliary lossesandalternating optimization.For
generative methods, a unifiedrecommenderconfiguration is adopted for fair comparison.ETEGRecuses pretrainedSASRec embeddingsas semantic embeddings, while other generative methods (includingBLOGER) utilizeLLaMA2-7B[39] to generatetext embeddingsfrom item titles and descriptions.
5.4. Implementation Details
- Recommender Configuration:
Transformer-basedencoder-decoder(likeT5).- Layers: 4 layers for both
encoderanddecoder. - Model Dimension: 128.
- Dropout Rate: 0.1.
- Attention Heads: 6 heads per layer, each with a dimension of 64.
- MLP Hidden Dimension: 1024.
- Activation:
ReLU.
- Item Tokenization (RQ-VAE):
- Codebook Levels (L): 4.
- Codebook Initialization:
k-meansclustering. - Codebook Size (K): 256 code embeddings per
codebook. - Code Embedding Dimension: 32.
- Training Details for BLOGER:
- Tokenizer Pretraining:
- Epochs: 20,000.
- Batch Size: 1,024.
- Optimizer:
AdamW[25]. - Learning Rate: 1e-3.
- Weight Decay: 1e-4.
- Bi-Level Optimization (Main Training):
- Batch Size: 256.
- Optimizer:
AdamW. - Early Stopping: Patience of 20 epochs, based on validation
recommendation loss. - Learning Rates:
Generative recommender(): 5e-4.Item tokenizer(): 1e-4.
- Hyperparameter : Tuned over the set . This parameter controls the weight of the
tokenization lossin theouter-levelobjective. - Hyperparameter (Algorithm 1): Tuned such that the
tokenizerupdate frequency occurs times per epoch. This controls how often theouter-leveloptimization happens.
- Tokenizer Pretraining:
- Baseline Hyperparameters: Searched within the ranges defined in their respective papers.
6. Results & Analysis
6.1. Core Results Analysis
The overall performance of BLOGER compared to baselines is presented in Table 3.
| Dataset | Metric | MF | LightGCN | Caser | GRU4Rec | HGN | SASRec | P5 | TIGER | LETTER | ETEGRec | BLOGER |
| Beauty | R@5 | 0.0226 | 0.0208 | 0.0110 | 0.0229 | 0.0381 | 0.0372 | 0.0400 | 0.0384 | 0.0424 | 0.0413 | 0.0444 |
| N@5 | 0.0142 | 0.0127 | 0.0072 | 0.0155 | 0.0241 | 0.0237 | 0.0274 | 0.0256 | 0.0264 | 0.0271 | 0.0293 | |
| R@10 | 0.0389 | 0.0351 | 0.0186 | 0.0327 | 0.0558 | 0.0602 | 0.0590 | 0.0609 | 0.0612 | 0.0632 | 0.0654 | |
| N@10 | 0.0195 | 0.0174 | 0.0096 | 0.0186 | 0.0297 | 0.0311 | 0.0335 | 0.0329 | 0.0334 | 0.0342 | 0.0361 | |
| Instruments | R@5 | 0.0456 | 0.0722 | 0.0425 | 0.0559 | 0.0797 | 0.0709 | 0.0809 | 0.0865 | 0.0872 | 0.0878 | 0.0882 |
| N@5 | 0.0376 | 0.0608 | 0.0348 | 0.0459 | 0.0676 | 0.0565 | 0.0695 | 0.0736 | 0.0747 | 0.0745 | 0.0753 | |
| R@10 | 0.0511 | 0.0887 | 0.0528 | 0.0702 | 0.0967 | 0.0922 | 0.0987 | 0.1062 | 0.1082 | 0.1079 | 0.1100 | |
| N@10 | 0.0394 | 0.0661 | 0.0381 | 0.0505 | 0.0731 | 0.0633 | 0.0751 | 0.0799 | 0.0814 | 0.0810 | 0.0822 | |
| Arts | R@5 | 0.0473 | 0.0464 | 0.0571 | 0.0669 | 0.0667 | 0.0758 | 0.0724 | 0.0807 | 0.0841 | 0.0860 | 0.0879 |
| N@5 | 0.0271 | 0.0244 | 0.0407 | 0.0518 | 0.0516 | 0.0632 | 0.0607 | 0.0640 | 0.0675 | 0.0687 | 0.0703 | |
| R@10 | 0.0753 | 0.0755 | 0.0781 | 0.0834 | 0.0910 | 0.0945 | 0.0902 | 0.1017 | 0.1081 | 0.1084 | 0.1108 | |
| N@10 | 0.0361 | 0.0338 | 0.0474 | 0.0523 | 0.0595 | 0.0693 | 0.0664 | 0.0707 | 0.0752 | 0.0759 | 0.0777 |
Table 3: The overall performance comparisons between the baselines and BLOGER. Recall@K and NDCG@K are abbreviated as R@K and N@K, respectively. The best and second-best results are highlighted in bold and underlined font, respectively.
Key observations from the table:
- BLOGER's Superiority:
BLOGERconsistently achieves the best performance across all datasets (Beauty, Instruments, Arts) and all metrics (Recall@5,NDCG@5,Recall@10,NDCG@10). This strongly validates the effectiveness of itsbi-level optimization formulationand tailored learning strategy. The fact thatBLOGERachieves this without auxiliary alignment losses, using only therecommendation lossfor directtokenizersupervision, supports the intuition that this direct alignment is highly effective. - Generative vs. Traditional Methods:
Generative recommendation methods(P5, TIGER, LETTER, ETEGRec, BLOGER) generally outperformtraditional recommendation methods(MF, LightGCN, Caser, GRU4Rec, HGN, SASRec). This is attributed togenerative models'ability to capturehierarchical semantic structuresand differentiate items with fine-grained representations, especially when incorporatingRQ-VAEforitem tokenization. Traditional methods, relying on discrete ID representations, are limited in this aspect. - Improvements within Generative Methods:
LETTERandETEGRecshow performance improvements overTIGER.LETTERbenefits from integratingcollaborative embeddingsanddiversity regularizationinto itstokenizer.ETEGRecimproves throughalternating optimizationandauxiliary lossesthat aligntokenizerandrecommenderintermediate representations.BLOGERfurther surpasses these methods, indicating that itsbi-level optimizationand directrecommendation lossguidance fortokenizationare more effective thanalternating optimizationwithauxiliary lossesorsequential optimization.
6.2. Ablation Study (RQ2)
To understand the contribution of each component, an ablation study was conducted on the Beauty dataset. Various BLOGER variants were tested:
-
w/ LETTER: Integrates theLETTER tokenizer(with itscollaborative regularization) intoBLOGER. -
w/o Gs:BLOGERwithout thegradient surgeryoperation (Equation 17). -
Joint: Combinesrecommendation lossandtokenization lossinto a single joint optimization for both models, removing thebi-level optimizationstructure. -
Joint w/ Gs: Appliesgradient surgeryon top of theJointoptimization. -
Fixed: Thetokenizeris fixed (pretrained and then frozen), and only therecommenderis optimized, similar toTIGER.The following are the results from Table 4 of the original paper:Method Recall@5 NDCG@5 Recall@10 NDCG@10 BLOGER 0.0444 0.0293 0.0654 0.0361 w/LETTER 0.0489 0.0322 0.0732 0.0400 w/o GS 0.0329 0.0209 0.0530 0.0274 Joint 0.0418 0.0272 0.0637 0.0342 Joint w/ GS 0.0412 0.0273 0.0631 0.0343 Fixed 0.0384 0.0256 0.0609 0.0329
Table 4: Results of the ablation study for BLOGER on Beauty.
Observations:
w/ LETTERPerformance: Thew/ LETTERvariant achieves even better performance thanBLOGERalone. This suggests that the benefits ofBLOGER(bi-level optimization, direct alignment) are complementary to more advancedtokenizerdesigns (likeLETTER'scollaborative regularization). This confirmsBLOGER's generalizability across diversetokenizer backbones.- Importance of
Gradient Surgery(w/o Gs): Removinggradient surgery(w/o Gs) leads to a significant performance degradation across all metrics. This indicates that without mitigatinggradient conflicts, the direct supervision from therecommendation objectiveto thetokenizerbecomes ineffective, andnaive joint optimizationof conflictinggradientscan harm performance. - Effectiveness of
Bi-Level Optimizationvs.Joint:Jointoptimization (withoutbi-level) outperforms theFixedtokenizersetting, as it allows fortokenizerrefinement.- However,
Jointoptimization still lags behindBLOGER. This underscores the effectiveness ofbi-level optimizationandmeta-gradientsin properly coordinating the learning of the two interdependent components, which is not achieved by simple joint training. - Adding
gradient surgerytoJoint(Joint w/ GS) does not bring significant gains, possibly becausemeta-gradient signalsare absent, hindering the coordination needed forgradient surgeryto be fully effective in this context.
6.3. Hyper-Parameter Analysis (RQ3)
The impact of two key hyper-parameters, (balancing recommendation loss and tokenization loss) and (tokenizer update frequency per epoch), is analyzed.
The results for varying (from 1e-4 to 5) and (from 1 to 10 updates per epoch) are summarized in Figure 3.

该图像是两幅折线图,展示了BLOGER在不同超参数取值下的性能表现。左图横轴为损失权重,纵轴显示Recall@5和NDCG@5指标;右图横轴为分词器参数的更新频率,纵轴同样为Recall@5和NDCG@5。图中红线表示Recall@5,蓝线表示NDCG@5,反映了超参数对推荐性能的影响。
Figure 3: Results of the performance of BLOGER across different values of hyper-parameters. controls the balance between the recommendation loss and tokenization loss in the outer level, and representing the update frequency of the tokenizer parameters .
Observations from Figure 3:
- Impact of :
- When is too small (e.g., 1e-4, 1e-3), the
tokenization losshas insufficient weight. This can disrupt thetokenizer's ability to maintain a good identifier structure, as therecommendation lossmight dominate and pulltokenrepresentations too far from their intrinsic quality. - Conversely, if is too large (e.g., 5), the
tokenization lossbecomes too dominant, and therecommendation lossloses its ability to effectively supervise and align thetokenizerwith therecommendation objective. - The empirical evidence suggests that strikes an optimal balance, preserving
identifier structurewhile allowing effectiverecommendation-drivensupervision.
- When is too small (e.g., 1e-4, 1e-3), the
- Impact of (Tokenizer Update Frequency):
BLOGERshows stable performance when (number of tokenizer updates per epoch) is between 1 and 4. This implies that updating thetokenizerparameters less frequently (allowing therecommenderto take more steps per update) is beneficial. This delay gives therecommendersufficient time to optimize and provide stable, reliablemeta-gradientsto guide thetokenizer.- When is too high (e.g., 8-10 updates per epoch), the
recommenderdoes not get enough iterations to optimize betweentokenizerupdates. This leads to unstablemeta-gradients, which in turn undermine the effectiveness oftokenizer trainingand degrade overall performance.
6.4. In-Depth Analysis (RQ4 & RQ5)
6.4.1 Time Efficiency
A comparison of the average training and testing time per epoch for BLOGER and TIGER (a comparable generative method) was conducted on an NVIDIA RTX 3090 GPU.
| Method | Train (s/epoch) | Test (s/epoch) |
| TIGER | 23.7 | 66.7 |
| BLOGER | 26.6 | 66.9 |
Table 5: Results of the time efficiency comparison between BLOGER and TIGER, where the average training and testing time per epoch are reported.
Observations:
BLOGERincurs only a slight increase in training time (from 23.7 s/epoch forTIGERto 26.6 s/epoch forBLOGER).- The testing time remains nearly identical (66.7 s/epoch for
TIGERvs. 66.9 s/epoch forBLOGER). Both methods usegreedy decodingfor testing. These findings confirm the complexity analysis (Section 3.2.4), demonstrating thatBLOGER'sbi-level optimizationlearning strategy introduces minimal additional computational overhead, thus maintaining practical efficiency.
6.4.2 Visualization Analysis
To understand why BLOGER performs better, a visualization analysis of the learned tokenizer was performed on the Beauty dataset. This involved examining the utilization patterns of the codebook at each level.
Two metrics were used:
-
Density: The ratio of activated
codewordsto the total number ofcodewordsin acodebook. A higher density means morecodewordsare being used. -
Entropy: Measures the distribution of
codewordusage, indicating how evenly thecodewordsare utilized. Higherentropymeanscodewordsare used more uniformly, suggesting better coverage and less redundancy.For comparison,
TIGER(which can be seen as aBLOGERvariant without specificbi-level optimizationdesigns) was also analyzed.
该图像是图4,展示了TIGER与BLOGER在残差量化过程四个码本(Codebook)中码字激活密度和熵值的对比。图中“Density”表示激活码字与总码字数的比值,“Entropy”衡量码字使用的均匀度,反映码本利用率。
Figure 4: Codebook utilization comparison between TIGER and BLOGER, where "Codebook denotes the -th codebook in the residual quantization process. "Density" is defined as the ratio of activated codewords to the total number of codewords, and "Entropy" measures the distribution of codeword usage, indicating how evenly the codewords are utilized.
Observations from Figure 4:
-
First-Level Codebook Improvement:
BLOGERshows a significant improvement overTIGERspecifically in the first-levelcodebook.Densityincreases from 47% (TIGER) to 65% (BLOGER).Entropyincreases from 6.05 (TIGER) to 7.14 (BLOGER). This indicates thatBLOGERencourages more diverse and uniform usage ofcodewordsat the coarsest granularity level.
-
Subsequent Levels: No substantial differences are observed in the subsequent levels (Codebook 2, 3, 4).
-
Reasoning: The paper attributes this to the first-level
codebookcarrying the most information due to theresidual quantizationprocess. Consequently, it is more responsive to additional supervision. Therecommendation loss, acting as a natural indicator ofcollaborative information, effectively guides thetokenizerat this crucial first level. It helps alleviate thecoarse-grained codebook allocation biasthat might arise from relying solely on textual information (asTIGERlargely does). By incorporating thiscollaborative signalvia thebi-level optimizationframework, thetokenizerlearns betteritem representations, leading to improvedrecommendation performance.This analysis provides deeper insight into the mechanism of
BLOGER's effectiveness: the performance gains are rooted in a more robust and collaboratively-aligneditem tokenization, particularly at the fundamental first level of granularity.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work successfully reformulates generative recommendation as a bi-level optimization problem, explicitly addressing the crucial interdependence between item tokenization and autoregressive generation. The proposed BLOGER framework efficiently solves this problem by integrating meta-learning for the nested optimization and gradient surgery to resolve gradient conflicts between the tokenization and recommendation objectives. Extensive experiments on real-world datasets demonstrate that BLOGER consistently outperforms state-of-the-art generative recommendation methods without significant additional computational overhead. A key finding is that the recommendation loss provides effective collaborative signals that improve codebook utilization in the tokenizer, particularly at the first (coarsest) level, thereby bridging the gap between item tokenization and effective autoregressive generation.
7.2. Limitations & Future Work
The authors identify several directions for future research:
- Generalization Capabilities: Further validation of
BLOGER's generalization on more diverse datasets, especially large-scale industrial datasets, is planned to assess its scalability and practical applicability. - Practical Deployment: Investigating practical deployment considerations to facilitate the adoption of
BLOGERin production environments. This could involve exploring optimizations for real-time inference or integration with existing infrastructure. - Reasoning Capabilities: Exploring the integration of
reasoning capabilitieswithin therecommendation context, possibly by leveragingLarge Language Models(LLMs) or other advanced reasoning mechanisms, to enhance the explanatory power or decision-making of the recommender.
7.3. Personal Insights & Critique
This paper presents a highly rigorous and principled approach to a critical problem in generative recommendation. The use of bi-level optimization to unify tokenizer and recommender training is a significant conceptual leap beyond prior sequential or alternating optimization strategies.
Insights:
- Principled Alignment: The
bi-level optimizationframework offers a theoretically sound way to ensure thetokenizer's output directly serves therecommendation objective. This is a more elegant solution than relying on auxiliary losses or indirect alignment mechanisms. - Meta-Learning for Efficiency: Employing
meta-learning(MAML) is a smart choice to manage the computational complexity inherent inbi-level optimization. This makes the approach practical for deep learning models. - Gradient Surgery for Stability: The integration of
gradient surgeryis crucial for the stability ofmulti-objective optimization.Gradient conflictsare a common issue in such setups, and explicitly addressing them ensures that thetokenizerupdates are always beneficial or at least non-detrimental to both objectives. - Interpretability of Gains: The visualization analysis, specifically the improved
codebook utilizationat the first level, provides valuable interpretability. It suggests that therecommendation lossacts as a powerfulcollaborative signalthat refines the initial, coarse-grained item representations, which is a concrete and understandable mechanism for performance improvement.
Critique/Areas for Improvement:
-
Training Time Overhead: While the paper states "no significant additional computational overhead" and the asymptotic complexity is the same as
TIGER, the absolute training time increased by about 12% (23.7s to 26.6s). For very large-scale industrial applications, even a 12% increase can be substantial, especially over many epochs. Further optimization or distributed training strategies might be needed for true industrial deployment. -
Initial Semantic Embeddings:
BLOGERrelies on initial semantic embeddings (e.g.,LLaMA2-7B text embeddingsorSASRec embeddingsforETEGRec). The overall performance is still bounded by the quality of these initial representations. Future work could explore how to jointly learn these initial embeddings within thebi-level frameworkor make thetokenizerless dependent on their pre-trained quality. -
Codebook Design: The
RQVAEstructure is effective, but exploration into dynamiccodebooksizes, adaptivequantizationlevels, or more advancedcodebooklearning mechanisms could yield further improvements. The observation that only the firstcodebooksignificantly improves might suggest that the higher levels are already well-optimized or less sensitive torecommendation lossguidance. -
Generalizability to Diverse Data: The datasets used are from Amazon reviews, which are typical
e-commerceinteraction datasets. While robust, testing on datasets with different characteristics (e.g., highly sparse social networks, content-rich media platforms) could further stress-test the framework's versatility. -
Long-Term Effects: The paper focuses on
next-item recommendation. Exploring howBLOGERimpacts long-termrecommendationperformance or user satisfaction over extended periods would be valuable.This paper provides a strong foundation for future research in
generative recommendation, particularly in developing more tightly integrated and end-to-end optimized systems. The methodological rigor and clear experimental validation make it a significant contribution to the field.
Similar papers
Recommended via semantic vector search.