Paper status: completed

Killing Two Birds with One Stone: Unifying Retrieval and Ranking with a Single Generative Recommendation Model

Published:04/23/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
8 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces the Unified Generative Recommendation Framework (UniGRF) to unify retrieval and ranking stages in recommendation systems, enhancing performance through information sharing and dynamic optimization, outperforming existing models in extensive experiments.

Abstract

In recommendation systems, the traditional multi-stage paradigm, which includes retrieval and ranking, often suffers from information loss between stages and diminishes performance. Recent advances in generative models, inspired by natural language processing, suggest the potential for unifying these stages to mitigate such loss. This paper presents the Unified Generative Recommendation Framework (UniGRF), a novel approach that integrates retrieval and ranking into a single generative model. By treating both stages as sequence generation tasks, UniGRF enables sufficient information sharing without additional computational costs, while remaining model-agnostic. To enhance inter-stage collaboration, UniGRF introduces a ranking-driven enhancer module that leverages the precision of the ranking stage to refine retrieval processes, creating an enhancement loop. Besides, a gradient-guided adaptive weighter is incorporated to dynamically balance the optimization of retrieval and ranking, ensuring synchronized performance improvements. Extensive experiments demonstrate that UniGRF significantly outperforms existing models on benchmark datasets, confirming its effectiveness in facilitating information transfer. Ablation studies and further experiments reveal that UniGRF not only promotes efficient collaboration between stages but also achieves synchronized optimization. UniGRF provides an effective, scalable, and compatible framework for generative recommendation systems.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is the unification of the retrieval and ranking stages in recommendation systems into a single generative model, aiming to overcome information loss and improve overall performance. The title, Killing Two Birds with One Stone: Unifying Retrieval and Ranking with a Single Generative Recommendation Model, succinctly captures this dual objective.

1.2. Authors

The paper lists several authors from various institutions:

  • Luankang Zhang, Hao Wang*, Defu Lian, Enhong Chen: University of Science and Technology of China, Hefei, Anhui, China. Hao Wang is marked with an asterisk, typically indicating a corresponding author.

  • Kenan Song, Yi Quan Lee, Wei Guo, Huifeng Guo, Yong Liu: Huawei Noah's Ark Lab, Singapore, Singapore, Singapore (for Kenan Song, Yi Quan Lee, Wei Guo, Yong Liu) and Shenzhen, Shenzhen, China (for Huifeng Guo).

  • Yawen Li: Beijing University of Posts and Telecommunications, Beijing, Beijing, China.

    The affiliations suggest a collaborative effort between academia (University of Science and Technology of China, Beijing University of Posts and Telecommunications) and industry research (Huawei Noah's Ark Lab), indicating a blend of theoretical rigor and practical application focus.

1.3. Journal/Conference

The paper is listed as In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX). ACM, New York, NY, USA, 12 pages. https://doi.org/XXXXXXX.XXXXXXX. This indicates that it is intended for publication in an ACM conference. ACM (Association for Computing Machinery) conferences are highly reputable venues in computer science, including information systems and recommendation systems, suggesting a peer-reviewed publication of significant standing in the field. The placeholder Conference acronym 'XX implies the specific conference name is yet to be confirmed or was omitted in the preprint.

1.4. Publication Year

The paper's abstract lists Published at (UTC): 2025-04-23T06:43:54.000Z. This indicates a publication date in April 2025.

1.5. Abstract

The paper addresses the challenge of information loss and diminished performance in traditional multi-stage recommendation systems (retrieval and ranking). Inspired by advances in generative models from natural language processing, it proposes the Unified Generative Recommendation Framework (UniGRF). UniGRF integrates both retrieval and ranking into a single generative model by treating them as sequence generation tasks, ensuring sufficient information sharing without additional computational costs and maintaining model-agnosticism. To enhance collaboration, it introduces a ranking-driven enhancer module that uses the ranking stage's precision to refine retrieval processes, creating an enhancement loop. Additionally, a gradient-guided adaptive weighter dynamically balances the optimization of both stages for synchronized performance improvements. Extensive experiments demonstrate UniGRF's superior performance over existing models on benchmark datasets, confirming its effectiveness in information transfer, inter-stage collaboration, and synchronized optimization. The paper concludes that UniGRF is an effective, scalable, and compatible framework for generative recommendation systems.

  • Original Source Link: https://arxiv.org/abs/2504.16454
  • PDF Link: https://arxiv.org/pdf/2504.16454v1.pdf The paper is currently available as a preprint on arXiv (version 1), indicating it has been submitted for peer review but its final, officially published version might still undergo revisions or appear in the specified ACM conference.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the information loss and diminished performance inherent in traditional multi-stage recommendation systems, particularly between the retrieval and ranking stages. In industrial applications, recommendation systems typically operate in two main stages:

  1. Retrieval Stage: A low-precision model efficiently sifts through a vast item pool (all available items) to select a smaller candidate set of potentially relevant items. This stage prioritizes efficiency.

  2. Ranking Stage: A higher-precision model then accurately predicts user preferences within this smaller candidate set to recommend the most suitable items. This stage prioritizes accuracy.

    This multi-stage paradigm, while practical for efficiency, suffers from several issues:

  • Information Loss: Training separate models for each stage means information learned by one stage might not be fully utilized by the other. For example, the ranking model's fine-grained understanding of user preferences might not directly inform or improve the retrieval model's initial selection.

  • Data Bias: The candidate set generated by the retrieval stage introduces sample selection bias for the ranking stage, as the ranking model only sees a subset of items, not the entire item pool.

  • Lack of Collaboration: Traditional approaches often train these stages independently or use basic mechanisms for information transfer (e.g., data or loss perspectives), which are insufficient to address the deep interdependencies.

    Recent advancements in generative models, particularly large language models (LLMs) inspired by natural language processing (NLP), have shown great promise in recommendation systems. These autoregressive paradigms predict sequences of items, similar to how LLMs generate text. However, current generative recommendation models often operate in a single-stage manner or still require separate training for each stage, perpetuating the information loss problem. The paper identifies two key challenges in unifying these stages within a single generative model:

  1. Efficient Collaboration Mechanism: Retrieval and ranking are inherently interdependent. A unified model needs a mechanism to leverage these dependencies for mutual enhancement.
  2. Synchronized Optimization: Different stages often have different loss functions and convergence rates. Without careful management, optimizing one stage might negatively impact the other (e.g., ranking performance declining while retrieval is still improving), preventing optimal overall performance.

2.2. Main Contributions / Findings

The paper proposes the Unified Generative Recommendation Framework (UniGRF) to address these challenges. Its primary contributions are:

  1. Novel Unified Generative Framework: UniGRF is presented as the first framework to integrate retrieval and ranking into a single generative model by reformulating both tasks as sequence generation tasks. This approach inherently enables sufficient information sharing through shared parameters and reduces time and space complexity by half compared to separate models, while being model-agnostic (can integrate with any autoregressive generative model).

  2. Ranking-Driven Enhancer Module: To facilitate inter-stage collaboration, this module leverages the high precision of the ranking stage to generate high-quality samples for the retrieval stage. It includes:

    • Hard-to-Detect Disliked Items Generator: Identifies challenging negative samples (items highly scored by retrieval but low by ranking) to improve retrieval training effectiveness.
    • Potential Favorite Items Generator: Corrects mislabeling by identifying items that are actually liked by the user (high ranking score) but were initially sampled as negatives, reassigning them as positive examples. This creates a mutually reinforcing loop with minimal computational overhead.
  3. Gradient-Guided Adaptive Weighter: To ensure synchronized optimization across stages, this module dynamically monitors the gradient update rates (approximated by loss decrease rates) of retrieval and ranking. It adaptively adjusts the weights of their respective loss functions to balance their convergence speeds, leading to simultaneous performance improvements for both stages.

  4. Extensive Experimental Validation: Through comprehensive experiments on MovieLens-1M, MovieLens-20M, and Amazon-Books datasets, UniGRF (instantiated with HSTU and Llama architectures) demonstrates significant outperformance over state-of-the-art baselines in both retrieval and ranking metrics. Ablation studies confirm the effectiveness of the enhancer and weighter modules, and further analysis highlights UniGRF's ability to achieve synchronized optimization and follow scaling laws.

    In summary, the key findings are that unifying retrieval and ranking within a single generative model, when coupled with intelligent collaboration mechanisms and synchronized optimization strategies, effectively mitigates information loss and significantly boosts recommendation performance, particularly in the ranking stage, which is crucial for overall quality.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following foundational concepts in recommendation systems and deep learning:

  • Recommendation Systems: Systems designed to predict user preferences for items (e.g., movies, products, articles) and suggest items that users might like. They typically rely on historical interaction data.
  • Multi-stage Paradigm (Retrieval and Ranking): A common architecture in large-scale recommendation systems due to computational constraints.
    • Retrieval Stage: The initial stage responsible for efficiently narrowing down a vast item pool (millions to billions of items) to a much smaller candidate set (hundreds to thousands) of potentially relevant items. It prioritizes recall and speed.
    • Ranking Stage: The subsequent stage that takes the candidate set from retrieval and precisely orders these items based on their estimated relevance or user preference. It prioritizes precision and accuracy.
  • Generative Models: A class of artificial intelligence models that learn to generate new data instances that resemble the training data. In the context of recommendation, this means generating a sequence of items that a user might interact with next, rather than just predicting a score for existing items.
  • Autoregressive Paradigm: A type of generative model where the generation of the current element in a sequence depends on the preceding elements. This is commonly seen in natural language processing (NLP) where models predict the next word given the previous words. In recommendation, it means predicting the next item given the user's historical interaction sequence.
  • Transformer Architecture: A neural network architecture introduced in 2017, foundational to modern large language models (LLMs). It relies heavily on a mechanism called self-attention.
    • Self-Attention: A mechanism within Transformers that allows the model to weigh the importance of different parts of an input sequence when processing each element. For a given element in a sequence, self-attention calculates how much it relates to all other elements in the same sequence, allowing it to capture long-range dependencies. The core idea is to compute three vectors for each input token: Query (Q), Key (K), and Value (V). The attention score is calculated as a scaled dot-product between QQ and KK, followed by a softmax function, which is then multiplied by VV. The formula for Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
      • QQ: Query matrix, derived from the input embeddings.
      • KK: Key matrix, derived from the input embeddings.
      • VV: Value matrix, derived from the input embeddings.
      • dkd_k: Dimensionality of the key vectors, used for scaling to prevent very large dot products that push the softmax into regions with tiny gradients.
      • QKTQK^T: Dot product between Query and Key matrices, representing similarity.
      • softmax()\mathrm{softmax}(\cdot): Normalization function that converts scores into probabilities.
  • Embeddings: Low-dimensional vector representations of discrete entities (like users or items) that capture their semantic meaning. Items and users are mapped from sparse IDs to dense vectors, allowing neural networks to process them.
  • Loss Functions: Mathematical functions that quantify the difference between the predicted output of a model and the true target output. The goal of training is to minimize this loss.
    • Sampled Softmax Loss: A variant of softmax loss used when the number of possible output classes (e.g., items in a large item pool) is extremely large. Instead of calculating probabilities over all possible items, it calculates softmax over the true positive item and a randomly sampled subset of negative items. This significantly reduces computational cost.
    • Binary Cross-Entropy (BCE) Loss: A loss function commonly used for binary classification tasks (e.g., predicting click or no click, 0 or 1). It measures the difference between two probability distributions. For a true label y{0,1}y \in \{0, 1\} and a predicted probability p[0,1]p \in [0, 1], the BCELoss is: $ \mathcal{L}_{\mathrm{BCE}}(y, p) = -(y \log(p) + (1-y) \log(1-p)) $ Where:
      • yy: The true binary label (e.g., 1 for click, 0 for no click).
      • pp: The predicted probability of a positive outcome (e.g., click).
  • Gradient Descent and Backpropagation: The fundamental algorithms used to train neural networks. Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize the loss function by moving in the direction opposite to the gradient of the loss. Backpropagation is the method used to efficiently calculate these gradients.

3.2. Previous Works

The paper contextualizes its work by discussing various prior studies in retrieval, ranking, and cascade frameworks, highlighting the evolution towards generative models and multi-stage integration.

3.2.1. Retrieval Models for Recommendation

  • Matrix Factorization (MF) [1]: A foundational approach that represents users and items as low-dimensional vectors (embeddings). User-item compatibility is determined by the inner product of their respective vectors. BPR-MF [37] is a prominent example, using Bayesian Personalized Ranking to optimize for pairwise preferences.
  • Sequential Recommendation Models: These models treat user behavior as a sequence of interactions and use techniques from language modeling to predict the next item.
    • GRU4Rec [17]: One of the early works using recurrent neural networks (RNNs) with Gated Recurrent Units (GRUs) to capture short-term user preferences from sequences. RNNs process sequences by maintaining a hidden state that is updated at each step, making them suitable for sequential data. GRUs are a type of RNN that help address the vanishing gradient problem.
    • SASRec [23]: Introduced Transformer-based self-attention mechanisms to model dynamic interest patterns in user sequences, allowing it to capture long-range dependencies more effectively than RNNs.
    • BERT4Rec [43]: Adapted the BERT (Bidirectional Encoder Representations from Transformers) architecture for sequential recommendation, using a bidirectional self-attention mechanism.
  • LLM-influenced Retrieval: Recent trends leverage Large Language Models (LLMs) to enhance user and item representation:
    • RLMRec [36]: Uses LLMs for semantic representation to improve item vector quality.
    • LLMRec [59]: Augments user-item interaction data using LLMs alongside internal semantic representations.
  • Generative Retrieval Models: The shift towards generative models in recommendation.
    • HSTU (Hierarchical Sequential Transduction Units) [67]: A model by Meta that demonstrated model capabilities scale with parameters, suggesting ranking and retrieval tasks can be reframed as causal autoregressive sequence-to-sequence generation problems. It introduced a point-wise aggregated attention mechanism and M-FALCON algorithm for efficiency.
    • HLLM (Hierarchical Large Language Model) [3]: A two-layer framework where an Item LLM integrates text features, and a User LLM predicts future interests.

3.2.2. Ranking Models for Recommendation

  • Feature Interaction Models: Early models focused on explicitly or implicitly modeling interactions between various features (user, item, context).
    • DCN (Deep & Cross Network) [56]: Combines a deep neural network (for complex non-linear interactions) with a cross network (for explicit feature interactions).
    • DeepFM [10]: Integrates Factorization Machines (for capturing second-order feature interactions) and deep neural networks (for higher-order interactions).
    • AutoInt [42]: Uses a self-attention network to implicitly model feature interactions.
    • GDCN (Gated Deep & Cross Network) [48]: Enhances DCN with a gated cross network to capture more complex relationships.
  • Sequential Ranking Models: Adapt autoregressive paradigms for ranking.
    • DIN (Deep Interest Network) [75]: Introduced a target-aware attention mechanism that dynamically weighs parts of a user's behavior sequence based on the candidate item being ranked, modeling diverse user interests.
    • DIEN (Deep Interest Evolution Network) [74]: Improved upon DIN by incorporating GRUs to model the evolution of a user's interests over time within their sequence.
    • SIM (Search Interest Model) [33]: Uses search techniques on long sequences to identify relevant behaviors and reduce noise for user representation.
  • LLM-influenced Ranking:
    • BAHE [9]: Uses Transformer blocks of an LLM to generate improved user behavior representations, with final blocks fine-tuned for CTR prediction.
    • HSTU [67] also adapted for ranking by appending candidate items to user sequences and adding MLP and prediction heads.
  • Llama [45]: While originally an LLM for NLP, its Transformer-based architecture has been adapted for recommendation tasks, leveraging its strong sequence modeling capabilities in a generative setting.

3.2.3. Cascade Framework for Retrievers and Rankers

These frameworks focus on improving information transfer between separately trained retrieval and ranking models to reduce sample bias.

  • Joint Optimization Methods [8, 57]: Aim to connect stage-specific objectives and jointly optimize them to balance efficiency and effectiveness.
  • ECM (Explicit Click Model) [41]: Incorporates unexposed data and refines the task into multi-label classification to address sample selection bias.
  • RankFlow [34]: Trains models for different stages separately, then jointly optimizes them by minimizing a global loss function to facilitate inter-stage information transfer.
  • CoRR (Cooperative Retriever and Ranker) [20]: Uses Kullback-Leibler (KL) divergence to distill knowledge between stages, aligning them for effective knowledge transfer. The KL divergence measures how one probability distribution diverges from a second, expected probability distribution.

3.3. Technological Evolution

The evolution of recommendation systems, as highlighted by the paper, has progressed through several phases:

  1. Early Systems (e.g., MF): Focused on collaborative filtering, representing users and items as static embeddings.

  2. Sequential Models (e.g., GRU4Rec, SASRec): Recognized the importance of interaction order, adopting RNNs and later Transformers to model user behavior sequences, moving towards autoregressive prediction.

  3. Large Generative Models (e.g., HSTU, HLLM, Llama): Inspired by LLMs, these models reformulate recommendation as a sequence generation task, leveraging scaling laws (performance improves with more parameters) to achieve state-of-the-art results.

  4. Multi-stage Optimization (e.g., RankFlow, CoRR): Addressed the practical cascade framework by attempting to mitigate information loss and bias between retrieval and ranking, but often still treating them as separate models with explicit transfer mechanisms.

    This paper's work, UniGRF, fits into the most recent wave, pushing the boundaries of generative recommendation. It directly tackles the multi-stage problem within the generative paradigm, aiming to unify retrieval and ranking into a single generative model, rather than just improving transfer between two distinct ones.

3.4. Differentiation Analysis

Compared to the main methods in related work, UniGRF offers several core differences and innovations:

  • Unified Generative Paradigm: Unlike HSTU and HLLM which are generative models but typically applied to a single stage or require separate training, UniGRF is the first to explicitly unify both retrieval and ranking within a single generative model. This fundamentally changes how information sharing occurs, moving from explicit transfer mechanisms to inherent sharing through shared parameters.

  • Inherent Information Sharing vs. Explicit Transfer: Traditional cascade frameworks like RankFlow and CoRR focus on information transfer (e.g., knowledge distillation, global loss optimization) between separate retrieval and ranking models. UniGRF eliminates the need for complex explicit transfer mechanisms by using a single model where parameters are fully shared, allowing for more sufficient information flow without additional computational overhead.

  • Addressing Data Bias and Information Loss at the Source: By generating negative samples from the entire item pool in the retrieval stage and using a unified model, UniGRF aims to learn the true data distribution, reducing data bias for the ranking stage and mitigating information loss from the outset.

  • Integrated Collaboration and Optimization Mechanisms: UniGRF introduces specific modules designed for a unified generative context:

    • The Ranking-Driven Enhancer proactively generates high-quality samples (hard negatives, potential positives) based on the ranking model's precision, directly refining retrieval training in a mutually reinforcing loop. This is distinct from passive information transfer.
    • The Gradient-Guided Adaptive Weighter directly addresses the challenge of asynchronous convergence speeds when training multiple tasks within a single model. This dynamic loss weighting mechanism is tailored to optimize the unified generative framework's performance synchronously.
  • Model-Agnostic and Scalable Framework: UniGRF is designed as a plug-and-play framework that can integrate with any autoregressive generative model architecture (e.g., HSTU, Llama). This emphasizes its flexibility and scalability, allowing it to leverage scaling laws effectively.

    In essence, UniGRF moves beyond merely connecting retrieval and ranking to truly integrating them within the generative paradigm, offering a more holistic and efficient solution to the multi-stage recommendation problem.

4. Methodology

The Unified Generative Recommendation Framework (UniGRF) is designed to integrate retrieval and ranking into a single generative model. It achieves this through a novel task reformulation, a ranking-driven enhancer module for inter-stage collaboration, and a gradient-guided adaptive weighter for synchronized optimization. An overview of the UniGRF architecture is shown in the following figure (Figure 1 from the original paper).

该图像是论文中框架结构的示意图,展示了一个统一推荐模型的整体设计,包括检索和排序模块、Ranking-Driven Enhancer以及Gradient-Guided Adaptive Weighter三个关键组成部分,体现了各模块间的信息流和交互。 该图像是论文中框架结构的示意图,展示了一个统一推荐模型的整体设计,包括检索和排序模块、Ranking-Driven Enhancer以及Gradient-Guided Adaptive Weighter三个关键组成部分,体现了各模块间的信息流和交互。

Figure 1: UniGRF framework overview. The image is a schematic diagram of the framework structure in the paper, illustrating an overall design of a unified recommendation model, including retrieval and ranking modules, the Ranking-Driven Enhancer, and the Gradient-Guided Adaptive Weighter, highlighting the flow and interaction between components.

4.1. A Unified Generative Recommendation Framework

The core idea of UniGRF is to treat both the retrieval and ranking tasks as sequence generation tasks within a single autoregressive Transformer-based generative model. This allows for inherent information sharing between stages.

For a given user uu, their interaction history is represented as a sequence Xu={i1,b1,,in,bn}X_u = \{i_1, b_1, \dots, i_n, b_n\}, where ikIi_k \in \mathcal{I} is the kk-th interacted item and bkB={0,1}b_k \in \mathcal{B} = \{0, 1\} is the interaction type (1 for a click, 0 for no click). The sequence length nn is fixed, with padding or truncation applied as needed.

4.1.1. Encoding and Generative Process

First, each item iki_k and its corresponding interaction type bkb_k in the user behavior sequence XuX_u are encoded into embeddings. These embeddings are concatenated to form the input sequence eu={e1i,e1b,,eki,ekb,,eni,enb}\mathbf{e}_u = \{\mathbf{e}_1^i, \mathbf{e}_1^b, \ldots, \mathbf{e}_k^i, \mathbf{e}_k^b, \ldots, \mathbf{e}_n^i, \mathbf{e}_n^b\}.

This input sequence eu\mathbf{e}_u is then fed into an autoregressive Transformer-based generative model. The Transformer processes these embeddings and produces a sequence of output embeddings:

Transformer(eu)={e1b,e2i,,ekb,ek+1i,,enb,en+1i}, \mathrm { Transformer } ( \mathbf { e } _ { u } ) = \{ \mathbf { e } _ { 1 } ^ { b ^ { \prime } } , \mathbf { e } _ { 2 } ^ { i ^ { \prime } } , \dots , \mathbf { e } _ { k } ^ { b ^ { \prime } } , \mathbf { e } _ { k + 1 } ^ { i ^ { \prime } } , \dots , \mathbf { e } _ { n } ^ { b ^ { \prime } } , \mathbf { e } _ { n + 1 } ^ { i ^ { \prime } } \} ,

Where:

  • eu\mathbf{e}_u: The input sequence of item and behavior embeddings.

  • Transformer()\mathrm{Transformer}(\cdot): The Transformer-based generative model.

  • ekb\mathbf{e}_k^{b'}: The predicted latent embedding for the click probability of the kk-th item. This output is used for the ranking stage.

  • ek+1i\mathbf{e}_{k+1}^{i'}: The predicted latent embedding for the (k+1)(k+1)-th item the user will interact with. This output is used for the retrieval stage.

    By strategically distinguishing the output embeddings based on their positions (e.g., bb' for click prediction and ii' for next item prediction), UniGRF unifies both retrieval and ranking within a single generative model. This design makes UniGRF plug-and-play and model-agnostic, meaning it can seamlessly integrate with any autoregressive generative model architecture. This unification also effectively reduces time and space complexity by half compared to training separate models.

4.1.2. Constraint on the Retrieval Stage in UniGRF

The retrieval stage aims to predict the next item ek+1i\mathbf{e}_{k+1}^{i'} a user is likely to interact with. To achieve this, the model compares the predicted item embedding ek+1i\mathbf{e}_{k+1}^{i'} with all items in the item pool I\mathcal{I} to identify the most similar items, which form the candidate set Ic\mathcal{I}^c.

To optimize this stage, Sampled Softmax Loss is employed. This loss function encourages the predicted item embedding ek+1i\mathbf{e}_{k+1}^{i'} to be maximally similar to the actual next interacted item's embedding eki\mathbf{e}_k^i (the true positive). Simultaneously, it pushes ek+1i\mathbf{e}_{k+1}^{i'} away from the embeddings of randomly sampled negative items SS from the item pool I\mathcal{I}. The loss for the retrieval stage is defined as:

Lretrieval=uUk=2nSampledSoftmaxLoss(eki)=uUk=2nlog(exp(sin(eki,eki))exp(sin(eki,eki))+jSexp(sin(eki,ekj))), \begin{array} { r l } & { \mathcal { L } _ { \mathrm { r e t r i e v a l } } = \displaystyle \sum _ { u \in \mathcal { U } } \sum _ { k = 2 } ^ { n } \mathrm { SampledSoftmaxLoss } ( \mathbf { e } _ { k } ^ { i ^ { \prime } } ) } \\ & { = - \displaystyle \sum _ { u \in \mathcal { U } } \sum _ { k = 2 } ^ { n } \log \left( \frac { \exp ( \sin ( \mathbf { e } _ { k } ^ { i ^ { \prime } } , \mathbf { e } _ { k } ^ { i } ) ) } { \exp ( \sin ( \mathbf { e } _ { k } ^ { i ^ { \prime } } , \mathbf { e } _ { k } ^ { i } ) ) + \sum _ { j \in S } \exp ( \sin ( \mathbf { e } _ { k } ^ { i ^ { \prime } } , \mathbf { e } _ { k } ^ { j } ) ) } \right) , } \end{array}

Where:

  • Lretrieval\mathcal{L}_{\mathrm{retrieval}}: The loss for the retrieval stage.

  • uUu \in \mathcal{U}: Summation over all users in the user set U\mathcal{U}.

  • k=2k=2 to nn: Summation over the sequence positions (starting from the second item to predict the next).

  • SampledSoftmaxLoss()\mathrm{SampledSoftmaxLoss}(\cdot): The Sampled Softmax Loss function.

  • sin(a,b)\sin(\mathbf{a}, \mathbf{b}): A similarity function, specifically defined as the inner product ab\mathbf{a} \cdot \mathbf{b} between two embedding vectors a\mathbf{a} and b\mathbf{b}.

  • eki\mathbf{e}_k^{i'}: The predicted latent embedding for the kk-th item.

  • eki\mathbf{e}_k^i: The actual embedding of the kk-th interacted item (the positive sample).

  • SS: A set of randomly sampled negative items from the item pool I\mathcal{I}.

  • ekj\mathbf{e}_k^j: The embedding of a negative item jSj \in S.

    This loss aims to make the predicted item embedding eki\mathbf{e}_k^{i'} close to the actual next item embedding eki\mathbf{e}_k^i while pushing it away from negative samples, thus optimizing the retrieval performance.

4.1.3. Constraint on the Ranking Stage in UniGRF

The ranking stage focuses on predicting the probability of a user clicking on items within the candidate set Ic\mathcal{I}^c (generated by the retrieval stage). Items are then ranked by these probabilities.

To predict this click probability, a small neural network fϕ()f_\phi(\cdot) is applied to the predicted latent embedding ekb\mathbf{e}_k^{b'} from the Transformer for each input item iki_k. The output is then passed through a sigmoid function to obtain a click score scorek[0,1]score_k \in [0, 1]. This score is compared with the user's actual interaction feedback bkb_k using Binary Cross-Entropy (BCE) Loss:

Lranking=uUk=1nBCELoss(scorek,bk). \mathcal { L } _ { r a n k i n g } = \sum _ { u \in \mathcal { U } } \sum _ { k = 1 } ^ { n } \mathrm { BCELoss } ( s c o r e _ { k } , b _ { k } ) .

Where:

  • Lranking\mathcal{L}_{\mathrm{ranking}}: The loss for the ranking stage.

  • uUu \in \mathcal{U}: Summation over all users.

  • k=1k=1 to nn: Summation over all items in the interaction sequence.

  • BCELoss(scorek,bk)\mathrm{BCELoss}(score_k, b_k): The Binary Cross-Entropy Loss.

  • scorek=sigmoid(fϕ(ekb))score_k = \mathrm{sigmoid}(f_\phi(\mathbf{e}_k^{b'})): The predicted click score for item kk.

  • bkb_k: The actual binary interaction type (click or no click) for item kk.

  • fϕ()f_\phi(\cdot): A small neural network with parameters ϕ\phi.

  • sigmoid()\mathrm{sigmoid}(\cdot): The sigmoid activation function, which squashes values between 0 and 1, suitable for probability prediction.

    By unifying retrieval and ranking into a single generative model, UniGRF effectively transfers knowledge between stages through shared parameters. The retrieval stage benefits from learning the true data distribution by sampling negatives from the entire item pool, reducing bias. The ranking stage provides precise user interest signals, which then implicitly enhance the retrieval task through this shared learning.

4.2. Ranking-Driven Enhancer

The Ranking-Driven Enhancer module is introduced to foster active collaboration between the retrieval and ranking stages, beyond just parameter sharing. It leverages the ranking model's precision to generate high-quality samples for the retrieval stage, creating a mutually reinforcing loop.

4.2.1. Hard-to-Detect Disliked Items Generator

Traditional retrieval stages often use random negative sampling, which can result in negative samples that are too easy for the model to distinguish from positive ones, limiting training effectiveness. This module uses the ranking model's precision to identify more challenging negative samples, termed hard-to-detect disliked items (H\mathcal{H}). These items are hard negatives because the retrieval model might mistakenly give them high scores, while the more precise ranking model correctly identifies them as undesired.

The process to construct the set H\mathcal{H} involves comparing scores from both stages:

  1. Ranking Stage Score: For a given negative sample SS (from the randomly sampled negative set), its ranking score is calculated using the small neural network fϕ()f_\phi(\cdot) and a sigmoid activation:

    scoreranking=sigmoid(fϕ(eS)), s c o r e _ { \mathrm { r a n k i n g } } = \mathrm { s i g m o i d } ( f _ { \phi } ( { \bf e } ^ { S } ) ) ,

    Where:

    • scorerankingscore_{\mathrm{ranking}}: The click probability predicted by the ranking stage for a negative item.
    • eS\mathbf{e}^S: The embedding of the negative sample item.
    • fϕ()f_\phi(\cdot): The scoring function of the ranking stage.
    • sigmoid()\mathrm{sigmoid}(\cdot): The sigmoid activation function.
  2. Retrieval Stage Score: The retrieval score for a negative sample SS is calculated based on its similarity to the predicted user-interacted item embedding ei\mathbf{e}^{i'}:

    scoreretrieval=sigmoid(sin(ei,eS)), s c o r e _ { \mathrm { r e t r i e v a l } } = \mathrm { s i g m o i d } ( \sin ( { \bf e } ^ { i ^ { \prime } } , { \bf e } ^ { S } ) ) ,

    Where:

    • scoreretrievalscore_{\mathrm{retrieval}}: The similarity score between the predicted next item and the negative sample, passed through a sigmoid.
    • ei\mathbf{e}^{i'}: The predicted user-interacted item embedding from the retrieval stage.
    • sin(,)\sin(\cdot, \cdot): The similarity function (inner product).
  3. Relative Score for Hard Negatives: A relative score is defined to identify items that are highly ranked by retrieval but poorly by ranking, indicating they are hard-to-detect disliked items:

    score=scoreretrieval(scoreretrievalscoreranking1). s c o r e = s c o r e _ { \mathrm { r e t r i e v a l } } \cdot \left( \frac { s c o r e _ { \mathrm { r e t r i e v a l } } } { s c o r e _ { \mathrm { r a n k i n g } } } - 1 \right) .

    Where:

    • score: The relative score. A high score indicates an item that the retrieval stage found relevant but the ranking stage did not, making it a hard negative.

    • The term (scoreretrievalscoreranking1)\left( \frac { s c o r e _ { \mathrm { r e t r i e v a l } } } { s c o r e _ { \mathrm { r a n k i n g } } } - 1 \right) amplifies the score when score_retrieval is much higher than score_ranking.

      The top mm items with the highest relative scores are selected as hard-to-detect disliked items and added to the set H\mathcal{H}. This set H\mathcal{H} is then combined with new randomly sampled negatives to form the negative sample set for the next training epoch. This iterative process, where the model learns to distinguish hard negatives, facilitates information transfer from the more precise ranking stage to the retrieval stage.

4.2.2. Potential Favorite Items Generator

This module addresses another issue in negative sampling: some randomly selected negative samples might actually be potential favorite items for the user. Mislabeling these as negative can introduce bias and degrade performance.

For each item in the previous epoch's randomly generated negative sample set SS, the module computes its ranking score scorerankingscore_{\mathrm{ranking}}. If an item's ranking score exceeds a predefined threshold α\alpha (scoreranking>αscore_{\mathrm{ranking}} > \alpha), it is identified as a potential favorite item and stored in the set P\mathcal{P}.

In subsequent training iterations, potential favorite items in P\mathcal{P} are relabeled as positive samples for the retrieval stage. This correction ensures the generative model learns more accurately from these examples, enhancing its understanding of user preferences.

By generating hard negatives and correcting mislabelings, the Ranking-Driven Enhancer provides high-quality training data for the retrieval stage based on the ranking stage's fine-grained understanding of user preferences. This interaction creates a bidirectional enhancement loop due to the fully shared parameters of the unified model. Crucially, the computational overhead is minimal because ranking scores are calculated via a linear complexity function fϕ()f_\phi(\cdot), and retrieval scores are implicitly handled within the retrieval loss calculation.

4.3. Gradient-Guided Adaptive Weighter

Even with shared parameters and collaborative mechanisms, retrieval and ranking tasks often have different convergence speeds, making simultaneous optimization challenging. The Gradient-Guided Adaptive Weighter addresses this by dynamically balancing the optimization rates of the two stages.

4.3.1. Monitoring Convergence Speed

The module monitors the loss decrease rate of each stage as an approximation of its optimization speed. A faster loss decrease implies faster gradient updates and quicker convergence. The convergence rates for retrieval and ranking are calculated as follows:

ra=Lretrievalt/Lretrievalt1,rb=Lrankingt/Lrankingt1, \begin{array} { r } { r _ { a } = \mathcal { L } _ { r e t r i e v a l } ^ { t } / \mathcal { L } _ { r e t r i e v a l } ^ { t - 1 } , } \\ { r _ { b } = \mathcal { L } _ { r a n k i n g } ^ { t } / \mathcal { L } _ { r a n k i n g } ^ { t - 1 } , } \end{array}

Where:

  • rar_a: The convergence rate for the retrieval stage. A value greater than 1 implies the loss is increasing, less than 1 means it's decreasing.
  • rbr_b: The convergence rate for the ranking stage.
  • Lretrievalt\mathcal{L}_{\mathrm{retrieval}}^t: The retrieval loss at the current time step tt.
  • Lretrievalt1\mathcal{L}_{\mathrm{retrieval}}^{t-1}: The retrieval loss at the previous time step t-1.
  • Lrankingt\mathcal{L}_{\mathrm{ranking}}^t: The ranking loss at the current time step tt.
  • Lrankingt1\mathcal{L}_{\mathrm{ranking}}^{t-1}: The ranking loss at the previous time step t-1. A larger rr value indicates slower convergence (or even divergence if r>1r>1), implying that the corresponding task needs more weight to accelerate its optimization.

4.3.2. Adaptive Weight Calculation

Based on these convergence rates, the module adaptively calculates weights waw_a and wbw_b for the retrieval and ranking losses, respectively, using a softmax-like scaling with a temperature coefficient:

wa=λaexp(ra/T)exp(ra/T)+exp(rb/T),wb=λbexp(rb/T)exp(ra/T)+exp(rb/T), \begin{array} { r } { w _ { a } = \frac { \lambda _ { a } \exp \left( r _ { a } / T \right) } { \exp \left( r _ { a } / T \right) + \exp \left( r _ { b } / T \right) } , } \\ { w _ { b } = \frac { \lambda _ { b } \exp \left( r _ { b } / T \right) } { \exp \left( r _ { a } / T \right) + \exp \left( r _ { b } / T \right) } , } \end{array}

Where:

  • waw_a: The adaptive weight for the retrieval loss.
  • wbw_b: The adaptive weight for the ranking loss.
  • TT: The temperature coefficient. A smaller TT value will lead to a larger difference between waw_a and wbw_b, making the weighting more aggressive. A larger TT smooths the weights, making them more similar.
  • λa,λb\lambda_a, \lambda_b: Hyperparameters used to scale the losses of the two stages, bringing them to a similar magnitude. This is important because raw loss values can vary significantly between different loss functions. The use of exp(r/T)exp(r/T) means that a task with a higher (slower) convergence rate rr will receive a relatively larger weight, pushing the model to pay more attention to it.

4.3.3. Final Loss Function

Finally, the adaptive weights are applied to the retrieval and ranking losses to form the total loss function Lt\mathcal{L}^t for optimization at time step tt:

Lt=waLretrievalt+wbLrankingt. \mathcal { L } ^ { t } = w _ { a } \cdot \mathcal { L } _ { r e t r i e v a l } ^ { t } + w _ { b } \cdot \mathcal { L } _ { r a n k i n g } ^ { t } .

This gradient-guided adaptive weighted loss enables synchronous optimization of both retrieval and ranking tasks. By dynamically adjusting loss weights, UniGRF ensures that neither stage is over-optimized or under-optimized, allowing the unified generative framework to achieve optimal performance by fully leveraging information transfer and inter-stage collaboration.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three publicly available recommendation datasets, chosen for their varying sizes and widespread use in research:

  • MovieLens-1M [14]:
    • Source: GroupLens Research project.
    • Scale: Approximately 1 million user ratings.
    • Characteristics: Contains user ratings (1-5 stars) for movies, along with basic user information and movie metadata.
    • Domain: Movies.
  • MovieLens-20M [14]:
    • Source: GroupLens Research project.
    • Scale: Approximately 20 million user ratings.
    • Characteristics: A larger version of MovieLens, covering ratings from 1995 to 2015.
    • Domain: Movies.
  • Amazon-Books [31]:
    • Source: Subset of Amazon product reviews.
    • Scale: Millions of user ratings and reviews.
    • Characteristics: Focuses on the book category, providing a real-world, sparse dataset.
    • Domain: Books.

5.1.1. Data Processing

For all datasets:

  • User interaction records are arranged chronologically to form historical interaction sequences.

  • Sequences with fewer than three interactions are filtered out.

  • For rating data (e.g., 1-5 stars), binary processing is applied: ratings greater than 3 are marked as 1 (liked), otherwise 0 (disliked). This simplifies user interest representation.

    The following are the results from Table 1 of the original paper:

    Datasets #Users #Items #Inters. Avg. n
    MovieLens-1M 6,040 3,706 1,000,209 165.6
    MovieLens-20M 138,493 26,744 20,000,263 144.4
    Amazon-Books 694,897 686,623 10,053,086 14.5

Where:

  • #Users: Total number of unique users.

  • #Items: Total number of unique items.

  • #Inters.: Total number of user-item interactions.

  • Avg. n: Average sequence length of user interactions.

    These datasets were chosen because they are widely used benchmarks, cover different scales (from MovieLens-1M being relatively small to Amazon-Books being very large and sparse), and represent real-world user behavior, making them effective for validating the performance and scalability of recommendation methods.

5.1.2. Training, Validation, and Test Split

  • Test Set: The item from the user's last interaction.
  • Validation Set: The item from the user's second-to-last interaction.
  • Training Set: All other interactions.
  • Candidate Set for Retrieval: The entire item set is used as the candidate set to avoid sampling bias.

5.2. Evaluation Metrics

The paper uses standard evaluation metrics for both retrieval and ranking tasks.

5.2.1. Retrieval Metrics

These metrics are used to evaluate how well the model identifies relevant items from a large pool. KK denotes the number of top items considered in the recommendation list.

  • Normalized Discounted Cumulative Gain at K (NDCG@K):

    • Conceptual Definition: NDCG measures the quality of a ranking, considering both the relevance of recommended items and their position in the list. It assigns higher scores to more relevant items appearing earlier in the list. NDCG@K calculates this score for the top KK recommendations.
    • Mathematical Formula: $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{rel_j} - 1}{\log_2(j+1)} $ $ \mathrm{IDCG@K} = \sum_{j=1}^{K} \frac{2^{rel_{j}^{ideal}} - 1}{\log_2(j+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
    • Symbol Explanation:
      • reljrel_j: The relevance score of the item at position jj in the recommended list (typically 1 if relevant, 0 if not).
      • reljidealrel_{j}^{ideal}: The relevance score of the item at position jj in the ideal (perfect) ranking list.
      • DCG@K\mathrm{DCG@K}: Discounted Cumulative Gain at K, where relevant items at lower ranks are penalized logarithmically.
      • IDCG@K\mathrm{IDCG@K}: Ideal Discounted Cumulative Gain at K, the maximum possible DCG for a given set of relevant items.
      • NDCG@K\mathrm{NDCG@K}: The normalized DCG, typically ranging from 0 to 1.
  • Hit Rate at K (HR@K):

    • Conceptual Definition: HR@K (also known as Recall@K) measures whether the target item (the one the user actually interacted with) is present in the top KK recommended items. It indicates how often the model successfully recommends at least one relevant item.
    • Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the target item is in top K}}{\text{Total number of users}} $
    • Symbol Explanation:
      • "Number of users for whom the target item is in top K": Count of users for whom the ground truth item is among the KK recommended items.
      • "Total number of users": Total number of users in the evaluation set.
  • Mean Reciprocal Rank (MRR):

    • Conceptual Definition: MRR measures the average of the reciprocal ranks of the first relevant item. If the first relevant item is at rank 1, its reciprocal rank is 1; if at rank 2, it's 1/2, and so on. A higher MRR indicates that relevant items are found earlier in the ranked list.
    • Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{1}{\mathrm{rank}_u} $
    • Symbol Explanation:
      • U|\mathcal{U}|: The total number of users.

      • ranku\mathrm{rank}_u: The rank of the first relevant item for user uu.

        The paper sets the value of KK to 10 and 50 for NDCG@K and HR@K.

5.2.2. Ranking Metric

  • Area Under the ROC Curve (AUC):
    • Conceptual Definition: AUC is a commonly used metric for binary classification tasks (like click-through rate (CTR) prediction in ranking). It represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. An AUC of 0.5 indicates performance no better than random, while 1.0 indicates perfect classification.
    • Mathematical Formula: $ \mathrm{AUC} = \frac{\sum_{i \in \text{positive class}} \sum_{j \in \text{negative class}} \mathbb{I}(\text{score}_i > \text{score}_j)}{|\text{positive class}| \cdot |\text{negative class}|} $
    • Symbol Explanation:
      • I()\mathbb{I}(\cdot): The indicator function, which is 1 if the condition is true, and 0 otherwise.
      • scorei\text{score}_i: The predicted score for a positive instance ii.
      • scorej\text{score}_j: The predicted score for a negative instance jj.
      • positive class|\text{positive class}|: The total number of positive instances.
      • negative class|\text{negative class}|: The total number of negative instances.

5.3. Baselines

The paper compares UniGRF against a comprehensive set of baseline models categorized by their primary function and approach.

5.3.1. Retrieval Models

These models are typically trained separately for the retrieval task.

  • BPR-MF [37]: Bayesian Personalized Ranking - Matrix Factorization. A classic matrix factorization method optimized for implicit feedback, predicting pairwise preferences.
  • GRU4Rec [21]: A recurrent neural network model using Gated Recurrent Units to capture sequential user behavior for session-based recommendations.
  • NARM [27]: Neural Attentive Session-based Recommendation Model. Combines RNNs with attention mechanisms to capture users' diverse preferences in sessions.
  • SASRec [23]: Self-Attentive Sequential Recommendation. Uses a Transformer architecture with self-attention to model complex and long-range patterns in user behavior sequences.

5.3.2. Ranking Models

These models are typically trained separately for the ranking task.

  • DCN [56]: Deep & Cross Network. A model that combines deep neural networks and cross networks to capture both explicit and implicit high-order feature interactions.
  • DIN [75]: Deep Interest Network. Introduces a target-aware attention mechanism to dynamically capture a user's interest distribution based on the candidate item.
  • GDCN [47]: Gated Deep & Cross Network. An improved version of DCN that uses a gated cross network to capture more complex relationships.

5.3.3. Generative Models

These are autoregressive models that can be adapted for recommendation tasks, but are typically trained for a single stage.

  • HSTU [67]: Hierarchical Sequential Transduction Units. A generative model by Meta that emphasizes scaling laws and efficiency through point-wise aggregated attention and the M-FALCON algorithm. It can be adapted for both retrieval and ranking but is usually trained independently for each.
  • Llama [45]: A Transformer-based large language model architecture. Adapted here to build a recommendation system using a sequence generation method similar to HSTU for both retrieval and ranking.

5.3.4. Cascade Frameworks for Generative Model

These frameworks combine separate retrieval and ranking models (here, instantiated with HSTU) using specialized modules for information exchange.

  • RankFlow-HSTU [34]: A cascade framework that first trains retrieval and ranking models independently, then performs joint training to facilitate information transfer between them. Here, both retriever and ranker are HSTU models.
  • CoRR-HSTU [20]: Cooperative Retriever and Ranker. This framework uses knowledge distillation based on Kullback-Leibler (KL) divergence to align the retrieval and ranking stages for effective knowledge transfer. Here, both retriever and ranker are HSTU models.

5.4. Parameter Settings

To ensure a fair comparison, consistent hyperparameter settings were used across models for the same dataset.

  • Learning Rate: Uniformly set to 1e31e^{-3}.
  • Negative Samples (Retrieval Stage):
    • MovieLens-1M: 128 negative samples.
    • MovieLens-20M and Amazon-Books: 256 negative samples.
  • Training Epochs:
    • Ranking-only models: 20 epochs.
    • Other models (including UniGRF and cascade frameworks): 100 epochs.
    • Early stopping: Applied if the AUC value (for ranking) or other primary metrics (for retrieval/unified) do not improve over 5 epochs.
  • Transformer Layers (for Transformer-based models):
    • MovieLens-1M: 2 layers.
    • MovieLens-20M and Amazon-Books: 4 layers (unless specified otherwise for scaling law experiments).
  • Distributed Training: All models were trained using 8 GPUs.

6. Results & Analysis

6.1. Overall Performance

The UniGRF framework was instantiated with two representative generative models, HSTU and Llama, resulting in UniGRF-HSTU and UniGRF-Llama. The performance comparison against various baselines is presented in Table 2 for retrieval and Table 3 for ranking.

The following are the results from Table 2 of the original paper:

Dataset MovieLens-1M MovieLens-20M Amazon-Books
NG@10 NG@50 HR@10 HR@50 MRR NG@10 NG@50 HR@10 HR@50 MRR NG@10 NG@50 HR@10 HR@50 MRR
BPRMF 0.0607 0.1027 0.1185 0.3127 0.0556 0.0629 0.1074 0.1241 0.3300 0.0572 0.0081 0.0135 0.0162 0.0412 0.0078
GRU4Rec 0.1015 0.1460 0.1816 0.3864 0.0895 0.0768 0.1155 0.1394 0.3177 0.0689 0.0111 0.0179 0.0207 0.0519 0.0107
NARM 0.1350 0.1894 0.2445 0.4915 0.1165 0.1037 0.1552 0.1926 0.4281 0.0910 0.0148 0.0243 0.0280 0.0722 0.0141
SASRec 0.1594 0.2208 0.2899 0.5671 0.1388 0.1692 0.2267 0.3016 0.5616 0.1440 0.0229 0.0353 0.0408 0.1059 0.0213
Llama 0.1746 0.2343 0.3110 0.5785 0.1477 0.1880 0.2451 0.3268 0.5846 0.1605 0.0338 0.0493 0.0603 0.1317 0.0305
HSTU 0.1708 0.2314 0.3132 0.5862 0.1431 0.1814 0.2325 0.3056 0.5371 0.1569 0.0337 0.0502 0.0613 0.1371 0.0305
RankFlow-HSTU 0.1096 0.2178 0.2184 0.5062 0.0988 0.0985 0.2064 0.2002 0.4245 0.1221 0.0281 0.0307 0.0324 0.0878 0.0249
CoRR-HSTU 0.1468 0.2662 0.2081 0.5435 0.1261 0.1629 0.2166 0.2796 0.5236 0.1409 0.0352 0.0521 0.0612 0.1360 0.0315
UniGRF-Llama 0.1765 0.2368 0.3219 0.5921 0.1478 0.1891 0.2464 0.3270 0.5846 0.1652 0.0351 0.0508 0.0624 0.1347 0.0316
UniGRF-HSTU 0.1756 0.2326 0.3140 0.5909 0.1484 0.1816 0.2384 0.3178 0.5747 0.1548 0.0354 0.0522 0.0638 0.1415 0.0319

Note: The original table contains bold and underlined formatting to indicate the best and second-best results, respectively. Due to the overlapping values and the definition (best result highlighted in bold, second-best underlined, and p-value < 0.05 implies significance), there are instances where the highest value is bold and the second highest is underlined, or in cases of ties/very close values, some may appear underlined only if the highest is not significantly better. The interpretation should follow the paper's intent for statistical significance, which is not fully clear from the raw values alone. For consistency, I will mark the best value in bold and the second best as underlined based on the numerical values presented in the table.

The following are the results from Table 3 of the original paper:

Dataset ML-1M ML-20M Books
Model AUC AUC AUC
DCN 0.7176 0.7098 0.7042
DIN 0.7329 0.7274 0.7159
GDCN 0.7364 0.7089 0.7061
Llama 0.7819 0.7722 0.7532
HSTU 0.7621 0.7807 0.7340
RankFlow-HSTU 0.6340 0.6439 0.6388
CoRR-HSTU 0.7497 0.7492 0.7357
UniGRF-Llama 0.7932 0.7776 0.7559
UniGRF-HSTU 0.7832 0.7941 0.7672

From these tables, several key conclusions can be drawn:

  • Generative Models Outperform Non-Generative Ones: In general, generative methods like HSTU and Llama (when trained separately) consistently outperform traditional non-generative retrieval models (e.g., BPR-MF, GRU4Rec, NARM, SASRec) and ranking models (e.g., DCN, DIN, GDCN). This validates the paper's premise about the significant potential of generative models in recommendation by learning richer representations and deeper user interest relationships.
  • UniGRF Achieves Superior Performance in Both Stages: UniGRF (both UniGRF-HSTU and UniGRF-Llama) consistently achieves the best or second-best performance across all retrieval metrics (NDCG@10/50NDCG@10/50, HR@10/50HR@10/50, MRR) and ranking AUC on all three datasets.
    • For example, on MovieLens-1M, UniGRF-HSTU achieves the highest MRR (0.1484) and AUC (0.7832), and competitive NDCG and HR values. On MovieLens-20M, UniGRF-HSTU achieves the highest AUC (0.7941) and very strong retrieval metrics. On Amazon-Books, UniGRF-HSTU also leads significantly across almost all metrics.
    • This demonstrates that UniGRF effectively enhances performance by facilitating robust information transfer, efficient inter-stage collaboration, and synchronous optimization within its unified framework. Its model-agnostic nature is also highlighted by its consistent improvements across different underlying generative architectures (HSTU and Llama).
  • Traditional Cascade Frameworks Struggle with Generative Models: RankFlow-HSTU and CoRR-HSTU, which are traditional cascade frameworks instantiated with HSTU, perform poorly. RankFlow-HSTU often performs worse than the independently trained HSTU model, and in some cases, even worse than much simpler baselines. CoRR-HSTU shows slight improvements over individual HSTU in some retrieval metrics on Amazon-Books but generally falls short of UniGRF. This suggests that traditional cascade frameworks, designed for non-generative models, are not well-suited for generative recommendation models and fail to enable effective inter-stage information transfer in this new paradigm.
  • Greater Impact on Ranking Performance: While UniGRF improves retrieval performance, it shows a more pronounced performance boost in the ranking stage. This is evident from the AUC values in Table 3. The paper posits this is because the ranking stage, being a simpler task, benefits more significantly from the rich information transfer that a unified model provides, especially when it might otherwise lack sufficient information when trained in isolation. This is a positive outcome for practical applications, as the ranking stage has a more direct impact on the quality of final recommendations.
  • Effectiveness on Sparse Data (Amazon-Books): UniGRF exhibits a more significant performance improvement on the Amazon-Books dataset, which is known for its sparsity. This suggests that in data-scarce environments, the cross-stage information transfer facilitated by UniGRF becomes even more crucial, allowing the model to leverage limited information more effectively. CoRR-HSTU also performs relatively better on this dataset compared to others, reinforcing the idea that information transfer is vital for sparse data.

6.2. Ablation Study

To investigate the effectiveness of each proposed module, an ablation study was conducted on the MovieLens-1M dataset. UniGRF was instantiated with the HSTU model. The study involved three variants:

  1. w/o Enhancer: The Ranking-Driven Enhancer module is removed, and negative samples are generated purely through random sampling.

  2. w/o Weighter: The Gradient-Guided Adaptive Weighter is removed. The total loss is simply the sum of retrieval and ranking losses: Lt=Lretrievalt+Lrankingt\mathcal{L}^t = \mathcal{L}_{\mathrm{retrieval}}^t + \mathcal{L}_{\mathrm{ranking}}^t.

  3. w/o Both: Both the Ranking-Driven Enhancer and the Gradient-Guided Adaptive Weighter are removed.

    Additionally, to verify the effectiveness of information transfer in generative recommendation beyond unified training, two fine-tuning (FT) paradigms were tested:

  • HSTU-FT-a: The HSTU model is first trained for retrieval (100 epochs). Its parameters then initialize a ranking model, which is further trained (20 epochs). The results for ranking are reported.

  • HSTU-FT-b: The HSTU model is first trained for ranking (20 epochs). Its parameters then initialize a retrieval model, which is further trained (100 epochs). The results for retrieval are reported.

    The following are the results from Table 4 of the original paper:

    Model AUC NG@10 NG@50 HR@10 HR@50 MRR
    HSTU 0.7621 0.1708 0.2314 0.3132 0.5862 0.1431
    (1) w/o Enhancer 0.7815 0.1717 0.2318 0.3117 0.5824 0.1446
    (2) w/o Weighter 0.7582 0.1734 0.2317 0.3199 0.5824 0.1441
    (3) w/o Both 0.7293 0.1681 0.2276 0.3113 0.5824 0.1392
    (a) HSTU-FT-a 0.7752 - - - - -
    (b) HSTU-FT-b - 0.1727 0.2308 0.3182 0.5807 0.1436
    UniGRF-HSTU 0.7832 0.1756 0.2326 0.3140 0.5909 0.1484

Analysis of the ablation study results:

  • Effectiveness of Both Modules: All variants (w/o Enhancer, w/o Weighter, w/o Both) show performance inferior to the full UniGRF-HSTU. This confirms that both the Ranking-Driven Enhancer and the Gradient-Guided Adaptive Weighter are crucial for UniGRF's superior performance.
  • "w/o Both" Variant's Poor Performance: The w/o Both variant performs the worst, even failing to surpass the independently optimized HSTU model. This is a critical finding: simply unifying retrieval and ranking without explicit mechanisms for collaboration and synchronized optimization is insufficient and can even lead to performance degradation. This validates the paper's initial premise that a thoughtful design is necessary for a successful unified framework.
  • Impact of the Ranking-Driven Enhancer (w/o Enhancer): Removing the Enhancer module (w/o Enhancer variant) leads to a slight degradation in retrieval performance metrics (NG@10, NG@50, HR@10, HR@50, MRR) compared to UniGRF-HSTU. This confirms that random negative sampling is less effective than the Ranking-Driven Enhancer in capturing deep user interests and providing informative negative samples. The Enhancer's ability to generate hard negatives and potential positives helps the model learn more robust representations.
  • Impact of the Gradient-Guided Adaptive Weighter (w/o Weighter): The w/o Weighter variant shows a notable degradation in ranking performance (AUC of 0.7582 vs 0.7832 for UniGRF-HSTU) while retrieval metrics are slightly better or similar. This highlights the critical role of the adaptive weighter in balancing the optimization speeds of the two stages. Without it, the model might over-optimize one stage (e.g., retrieval) at the expense of the other (ranking), preventing both from reaching their full potential synchronously. However, w/o Weighter still performs better than w/o Both, suggesting the Enhancer module still provides benefits by improving inter-stage collaboration even without balanced optimization.
  • Comparison with Fine-Tuning Paradigms:
    • HSTU-FT-a (fine-tuning ranking after retrieval) achieves an AUC of 0.7752, which is better than standalone HSTU (0.7621) but lower than UniGRF-HSTU (0.7832).
    • HSTU-FT-b (fine-tuning retrieval after ranking) achieves retrieval metrics that are slightly better than standalone HSTU but generally lower than UniGRF-HSTU.
    • This indicates that while information transfer through fine-tuning (sequential training) is effective and improves performance over separate training, the unified generative framework of UniGRF (simultaneous, end-to-end training with shared parameters) facilitates more sufficient and effective information transfer, leading to superior overall performance.

6.3. Impact of Hyper-parameter

The Ranking-Driven Enhancer module introduces mm hard-to-detect disliked items. The paper investigates the impact of varying mm on UniGRF's performance, while keeping the total number of negative samples fixed (128 for MovieLens-1M, 256 for MovieLens-20M). The values of mm tested are {0,2,5,10,20}\{0, 2, 5, 10, 20\}. The results are presented in the following figure (Figure 2 from the original paper).

Figure 2: Impact of varying the number of hard-to-detect disliked items \(m\) . 该图像是图表,展示了MovieLens-1M和MovieLens-20M数据集上模型的检索与排名表现。横坐标为参数m,纵坐标为AUC值,蓝色曲线表示排名,红色曲线表示检索,数据趋势显示了两者之间的性能差异。

Figure 2: Impact of varying the number of hard-to-detect disliked items mm.

6.3.1. MovieLens-20M (Larger Dataset)

For the larger dataset, MovieLens-20M, as mm increases from 0 to 20, the performance of both the retrieval and ranking stages shows a continuous upward trend. This suggests that:

  • Introducing a higher number of hard negative samples (up to 20 in this experiment) through the Ranking-Driven Enhancer effectively aids the model in learning more accurate user preferences.
  • These hard negatives facilitate mutual enhancement between stages, contributing to improved recommendation performance. The model needs more challenging examples to refine its boundaries in a larger, more complex data distribution.

6.3.2. MovieLens-1M (Smaller Dataset)

For the smaller dataset, MovieLens-1M, the performance of both retrieval and ranking initially increases and then decreases as mm increases from 0 to 20. The optimal performance is observed around m=5m = 5. This indicates:

  • Moderate Introduction is Beneficial: On smaller datasets, a moderate number of hard negative samples can indeed enhance the model's learning ability and improve the capture of fine-grained user preferences.
  • Risk of Overfitting: When mm becomes too large, the model may over-focus on these hard negative samples. This can lead to overfitting, where the model performs well on the specific hard negatives but loses generalization ability on unseen data, thus degrading overall performance.
  • Reduced Sample Diversity: A high proportion of hard negative samples can also lead to repeated encounters with the same samples across epochs, reducing the diversity of training samples. This lack of variation can hinder the model's ability to learn broadly and limit its generalization ability. Therefore, on smaller datasets, careful selection of mm is crucial to strike a balance between improving model performance and avoiding overfitting.

6.4. Further Analysis

6.4.1. Analysis of the Impact of Optimization on Unified Framework Performance

This analysis highlights the importance of synchronized optimization between stages. The MovieLens-20M dataset was used to observe the changes in recommendation performance over epochs for both retrieval and ranking stages. The results are presented in the following figure (Figure 3 from the original paper).

Figure 3: Effect of synchronized stage optimization on unified generative framework performance. 该图像是图表,展示了在不同情况下,Rank和Retrieve的AUC表现随Epoch变化的曲线,左侧为未使用Enhancer和Weight的结果,右侧为UniGRF-HSTU的结果,明确显示了两种方法在性能优化方面的差异。

Figure 3: Effect of synchronized stage optimization on unified generative framework performance.

  • Without Adaptive Weighter (Figure 3a):

    • When a simple sum of retrieval and ranking losses is used (Lt=Lretrievalt+Lrankingt\mathcal{L}^t = \mathcal{L}_{\mathrm{retrieval}}^t + \mathcal{L}_{\mathrm{ranking}}^t, corresponding to the w/o Weighter scenario in the ablation study), the retrieval stage's performance (red curve) shows a continuously rising trend.
    • In contrast, the ranking stage's performance (blue curve) quickly reaches an optimal value within a few epochs and then rapidly declines.
    • Explanation: This asynchronous optimization occurs because retrieval (involving negative sampling from a large pool) is generally a more challenging task, while ranking is comparatively simpler. The model, when optimizing a combined loss without balancing, tends to prioritize the more challenging task. In later stages, it becomes overly biased towards retrieval, leading to the ranking stage being under-trained or over-adapted in a way that hurts its overall performance, demonstrating a significant drop. This limits the potential for effective information sharing and inter-stage collaboration.
  • With Gradient-Guided Adaptive Weighter (Figure 3b):

    • When UniGRF's Gradient-Guided Adaptive Weighter is employed, the optimization pace of the two stages is balanced.
    • The retrieval stage's performance (red curve) exhibits a smoother and more stable improvement curve.
    • Crucially, the ranking stage's performance (blue curve) now also shows continuous improvement, instead of declining.
    • Explanation: The adaptive weighter dynamically monitors training rates and loss convergence speeds using gradients. It then adjusts the loss weights in real-time to ensure appropriate attention is given to both tasks throughout the training process. By achieving synchronous optimization, UniGRF fully leverages the benefits of the unified generative framework, significantly enhancing overall recommendation performance.

6.4.2. Analysis on Scaling Law

The paper also explores how UniGRF's performance changes with parameter expansion by adjusting the number of Transformer layers within the generative architecture (instantiated with HSTU). The number of layers was varied as {2,4,8,16,24,32}\{2, 4, 8, 16, 24, 32\}. The experimental results on the MovieLens-20M dataset for UniGRF-HSTU are presented in the following figure (Figure 4 from the original paper).

Figure 4: Effect of parameter expansion on model loss and recommendation performance. 该图像是一个示意图,展示了层数与损失(Loss)和性能(Performance)之间的关系。左侧图表(a)显示了排名损失(Ranking Loss,蓝线)和检索损失(Retrieval Loss,红线)的变化,右侧图表(b)展示了在不同层数下的性能评估,包括排名(Ranking,AUC)和检索(Retrieval,MRG20)。

Figure 4: Effect of parameter expansion on model loss and recommendation performance.

  • Loss Reduction with Parameter Expansion (Figure 4a):

    • Figure 4a shows that as the number of Transformer layers increases (i.e., model parameter size expands), the loss value for both retrieval and ranking tasks consistently decreases.
    • Explanation: This trend indicates that UniGRF adheres to the scaling law, a phenomenon observed in large language models where increasing model capacity (number of parameters) leads to a reduction in loss. This suggests that the model can learn more complex patterns and achieve a better fit to the data with larger architectures.
  • Performance Improvement with Parameter Expansion (Figure 4b):

    • Figure 4b shows that the actual recommendation performance of UniGRF also improves as the number of Transformer layers increases.

    • Explanation: This is a crucial finding, as loss reduction doesn't always translate directly to performance improvement in practical metrics. The consistent improvement in retrieval and ranking performance metrics (e.g., NDCG@K, AUC) with increasing parameter size validates that UniGRF can effectively leverage larger models. This demonstrates UniGRF's strong scalability and its ability to capture more complex data patterns with more sophisticated architectures, indicating significant application potential.

      The paper notes that it temporarily focuses on constructing the unified framework and demonstrating its scalability on public datasets, with plans to study scaling laws on industrial datasets in future work.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces the Unified Generative Recommendation Framework (UniGRF), a novel approach that effectively addresses the persistent problem of information loss and diminished performance in traditional multi-stage recommendation systems. By unifying retrieval and ranking into a single generative model and treating both as sequence generation tasks, UniGRF enables inherent and sufficient information sharing through shared parameters, eliminating the need for complex explicit transfer mechanisms.

Key to UniGRF's success are two innovative components:

  1. The ranking-driven enhancer module: This module actively promotes inter-stage collaboration by leveraging the ranking stage's precision to generate high-quality training samples (e.g., hard-to-detect disliked items, potential favorite items) for the retrieval stage, creating a mutually reinforcing feedback loop with minimal computational overhead.

  2. The gradient-guided adaptive weighter: This component dynamically monitors the optimization speeds of the retrieval and ranking tasks and adaptively adjusts their loss weights. This ensures synchronized performance improvements across both stages, preventing one from being over- or under-optimized at the expense of the other.

    Extensive experiments on benchmark datasets (MovieLens-1M, MovieLens-20M, Amazon-Books) demonstrate that UniGRF consistently outperforms state-of-the-art baselines in both retrieval and ranking metrics. Ablation studies confirm the individual efficacy of the enhancer and weighter modules, highlighting their crucial roles in inter-stage collaboration and synchronized optimization. Further analysis also verifies UniGRF's adherence to scaling laws, showing that performance improves with increased model parameters.

In essence, UniGRF provides a highly effective, scalable, and compatible framework for generative recommendation systems, significantly advancing the field by offering a holistic solution to multi-stage recommendation.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and propose future research directions:

  • Unifying More Stages: Currently, UniGRF unifies retrieval and ranking. Future work could explore extending the framework to unify even more stages in a typical recommendation pipeline, such as pre-rank and rerank. This would lead to an even more comprehensive end-to-end generative system.
  • Application to Industrial Scenarios: While demonstrated on public datasets, the authors plan to apply UniGRF to specific industrial scenarios. This involves addressing real-world complexities and larger scales.
  • Extending Parameters for Performance Improvements: The scaling law analysis on MovieLens-20M showed promising results. The authors intend to further investigate and achieve performance improvements by significantly extending model parameters on industrial datasets, pushing the boundaries of large recommendation models.

7.3. Personal Insights & Critique

The UniGRF paper presents a compelling and timely solution to a long-standing problem in recommendation systems. The idea of unifying retrieval and ranking within a single generative model is a natural evolution given the success of Transformer-based LLMs in multi-task learning.

Personal Insights:

  • Elegant Solution to Information Loss: The most significant insight is how the paper re-frames information loss not as a problem of transferring between disparate components, but as a problem solvable by integration. By making the retrieval and ranking tasks different "heads" or "outputs" of the same underlying generative Transformer, information inherently flows through shared representations. This is much more elegant than complex knowledge distillation or joint optimization schemes between separate models.
  • Active Collaboration vs. Passive Transfer: The Ranking-Driven Enhancer is a clever mechanism for active collaboration. Instead of merely having ranking "inform" retrieval passively, it actively generates high-quality feedback (hard negatives, potential positives) that directly improves the retrieval model's learning. This iterative self-improvement loop is powerful and potentially applicable beyond recommendation, wherever a higher-precision component can guide the training of a lower-precision one.
  • Addressing Training Dynamics: The Gradient-Guided Adaptive Weighter is crucial. It highlights a practical challenge in multi-task learning: even if tasks are unified, their different complexities and convergence rates can lead to suboptimal training. This adaptive weighting approach is a robust way to manage this and could be generalized to other multi-task generative models where sub-tasks have varying training characteristics.
  • Scalability for the LLM Era: The explicit focus on model-agnosticism and scaling laws positions UniGRF well for the future of large recommendation models. As LLMs continue to grow, frameworks that can seamlessly integrate larger generative backbones will be essential.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Computational Cost of Hard Negative Mining (even if minimal): While the paper states the Ranking-Driven Enhancer has minimal overhead, calculating score_ranking and score_retrieval for all candidate negative samples (or a subset) in each epoch, even with a linear complexity function, can still add noticeable training time, especially with very large item pools and mini-batches. The paper could offer more detailed analysis of this practical runtime impact beyond just "minimal overhead."

  • Hyperparameter Sensitivity: The Gradient-Guided Adaptive Weighter introduces TT (temperature coefficient) and λa,λbλ_a, λ_b (scaling hyperparameters). The Ranking-Driven Enhancer introduces mm (number of hard negatives) and αα (potential favorite threshold). Optimizing these hyperparameters can be tricky, especially as the optimal m varied significantly between large and small datasets. A more robust, possibly self-adaptive, way to set these parameters could be explored.

  • Definition of "Hard Negative": The relative score formula for hard negatives (score=scoreretrieval(scoreretrievalscoreranking1)score = score_{\mathrm{retrieval}} \cdot \left( \frac { s c o r e _ { \mathrm { r e t r i e v a l } } } { s c o r e _ { \mathrm { r a n k i n g } } } - 1 \right)) is interesting. It penalizes items where score_ranking is high, which isn't exactly a "disliked" item. Instead, it seems to identify items that the retrieval model thinks are good but the ranking model doesn't find as good (or actively dislikes if score_ranking is very low). Clarifying the intuitive meaning of this relative score, especially when score_ranking is close to score_retrieval or even higher, would be beneficial for understanding its precise mechanism.

  • Generalizability of sigmoid for retrieval scores: Applying sigmoid to sin(ei,eS)sin(e_i', e_S) for score_retrieval might simplify things but could potentially mask the true distribution of similarity scores. Depending on the embedding space, inner product values can vary widely, and a sigmoid might compress important information. Further analysis of this choice or alternatives could be valuable.

  • Complexity of Unifying More Stages: While a promising future direction, unifying pre-rank and rerank will exponentially increase the complexity of defining distinct output heads and loss functions within a single model. The current work handles two stages, but extending to four or more stages might require more sophisticated ways to manage the outputs and their interdependencies without making the model architecture unwieldy.

    Overall, UniGRF is an important contribution, providing a solid foundation for building truly unified and performant generative recommendation systems. It offers practical modules to manage the inherent challenges of multi-stage optimization within such a framework, pushing the boundaries of what's possible in modern recommender systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.