Killing Two Birds with One Stone: Unifying Retrieval and Ranking with a Single Generative Recommendation Model
TL;DR Summary
The study introduces the Unified Generative Recommendation Framework (UniGRF) to unify retrieval and ranking stages in recommendation systems, enhancing performance through information sharing and dynamic optimization, outperforming existing models in extensive experiments.
Abstract
In recommendation systems, the traditional multi-stage paradigm, which includes retrieval and ranking, often suffers from information loss between stages and diminishes performance. Recent advances in generative models, inspired by natural language processing, suggest the potential for unifying these stages to mitigate such loss. This paper presents the Unified Generative Recommendation Framework (UniGRF), a novel approach that integrates retrieval and ranking into a single generative model. By treating both stages as sequence generation tasks, UniGRF enables sufficient information sharing without additional computational costs, while remaining model-agnostic. To enhance inter-stage collaboration, UniGRF introduces a ranking-driven enhancer module that leverages the precision of the ranking stage to refine retrieval processes, creating an enhancement loop. Besides, a gradient-guided adaptive weighter is incorporated to dynamically balance the optimization of retrieval and ranking, ensuring synchronized performance improvements. Extensive experiments demonstrate that UniGRF significantly outperforms existing models on benchmark datasets, confirming its effectiveness in facilitating information transfer. Ablation studies and further experiments reveal that UniGRF not only promotes efficient collaboration between stages but also achieves synchronized optimization. UniGRF provides an effective, scalable, and compatible framework for generative recommendation systems.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is the unification of the retrieval and ranking stages in recommendation systems into a single generative model, aiming to overcome information loss and improve overall performance. The title, Killing Two Birds with One Stone: Unifying Retrieval and Ranking with a Single Generative Recommendation Model, succinctly captures this dual objective.
1.2. Authors
The paper lists several authors from various institutions:
-
Luankang Zhang, Hao Wang*, Defu Lian, Enhong Chen: University of Science and Technology of China, Hefei, Anhui, China. Hao Wang is marked with an asterisk, typically indicating a corresponding author.
-
Kenan Song, Yi Quan Lee, Wei Guo, Huifeng Guo, Yong Liu: Huawei Noah's Ark Lab, Singapore, Singapore, Singapore (for Kenan Song, Yi Quan Lee, Wei Guo, Yong Liu) and Shenzhen, Shenzhen, China (for Huifeng Guo).
-
Yawen Li: Beijing University of Posts and Telecommunications, Beijing, Beijing, China.
The affiliations suggest a collaborative effort between academia (University of Science and Technology of China, Beijing University of Posts and Telecommunications) and industry research (Huawei Noah's Ark Lab), indicating a blend of theoretical rigor and practical application focus.
1.3. Journal/Conference
The paper is listed as In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX). ACM, New York, NY, USA, 12 pages. https://doi.org/XXXXXXX.XXXXXXX. This indicates that it is intended for publication in an ACM conference. ACM (Association for Computing Machinery) conferences are highly reputable venues in computer science, including information systems and recommendation systems, suggesting a peer-reviewed publication of significant standing in the field. The placeholder Conference acronym 'XX implies the specific conference name is yet to be confirmed or was omitted in the preprint.
1.4. Publication Year
The paper's abstract lists Published at (UTC): 2025-04-23T06:43:54.000Z. This indicates a publication date in April 2025.
1.5. Abstract
The paper addresses the challenge of information loss and diminished performance in traditional multi-stage recommendation systems (retrieval and ranking). Inspired by advances in generative models from natural language processing, it proposes the Unified Generative Recommendation Framework (UniGRF). UniGRF integrates both retrieval and ranking into a single generative model by treating them as sequence generation tasks, ensuring sufficient information sharing without additional computational costs and maintaining model-agnosticism. To enhance collaboration, it introduces a ranking-driven enhancer module that uses the ranking stage's precision to refine retrieval processes, creating an enhancement loop. Additionally, a gradient-guided adaptive weighter dynamically balances the optimization of both stages for synchronized performance improvements. Extensive experiments demonstrate UniGRF's superior performance over existing models on benchmark datasets, confirming its effectiveness in information transfer, inter-stage collaboration, and synchronized optimization. The paper concludes that UniGRF is an effective, scalable, and compatible framework for generative recommendation systems.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2504.16454 - PDF Link:
https://arxiv.org/pdf/2504.16454v1.pdfThe paper is currently available as a preprint on arXiv (version 1), indicating it has been submitted for peer review but its final, officially published version might still undergo revisions or appear in the specified ACM conference.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the information loss and diminished performance inherent in traditional multi-stage recommendation systems, particularly between the retrieval and ranking stages. In industrial applications, recommendation systems typically operate in two main stages:
-
Retrieval Stage: A low-precision model efficiently sifts through a vast
item pool(all available items) to select a smallercandidate setof potentially relevant items. This stage prioritizes efficiency. -
Ranking Stage: A higher-precision model then accurately predicts user preferences within this smaller
candidate setto recommend the most suitable items. This stage prioritizes accuracy.This
multi-stage paradigm, while practical for efficiency, suffers from several issues:
-
Information Loss: Training separate models for each stage means information learned by one stage might not be fully utilized by the other. For example, the
ranking model's fine-grained understanding of user preferences might not directly inform or improve theretrieval model's initial selection. -
Data Bias: The
candidate setgenerated by theretrieval stageintroducessample selection biasfor theranking stage, as theranking modelonly sees a subset of items, not the entireitem pool. -
Lack of Collaboration: Traditional approaches often train these stages independently or use basic mechanisms for information transfer (e.g., data or loss perspectives), which are insufficient to address the deep interdependencies.
Recent advancements in
generative models, particularlylarge language models (LLMs)inspired bynatural language processing (NLP), have shown great promise in recommendation systems. Theseautoregressive paradigmspredict sequences of items, similar to howLLMsgenerate text. However, currentgenerative recommendation modelsoften operate in asingle-stagemanner or still requireseparate trainingfor each stage, perpetuating theinformation lossproblem. The paper identifies two key challenges in unifying these stages within a singlegenerative model:
- Efficient Collaboration Mechanism:
Retrievalandrankingare inherently interdependent. A unified model needs a mechanism to leverage these dependencies for mutual enhancement. - Synchronized Optimization: Different stages often have different
loss functionsandconvergence rates. Without careful management, optimizing one stage might negatively impact the other (e.g.,ranking performancedeclining whileretrievalis still improving), preventing optimal overall performance.
2.2. Main Contributions / Findings
The paper proposes the Unified Generative Recommendation Framework (UniGRF) to address these challenges. Its primary contributions are:
-
Novel Unified Generative Framework:
UniGRFis presented as the first framework to integrateretrievalandrankinginto a singlegenerative modelby reformulating both tasks assequence generation tasks. This approach inherently enablessufficient information sharingthroughshared parametersand reducestime and space complexityby half compared to separate models, while beingmodel-agnostic(can integrate with anyautoregressive generative model). -
Ranking-Driven Enhancer Module: To facilitate
inter-stage collaboration, this module leverages thehigh precisionof theranking stageto generatehigh-quality samplesfor theretrieval stage. It includes:Hard-to-Detect Disliked Items Generator: Identifies challengingnegative samples(items highly scored byretrievalbut low byranking) to improveretrievaltraining effectiveness.Potential Favorite Items Generator: Correctsmislabelingby identifying items that are actually liked by the user (highrankingscore) but were initially sampled as negatives, reassigning them as positive examples. This creates amutually reinforcing loopwith minimal computational overhead.
-
Gradient-Guided Adaptive Weighter: To ensure
synchronized optimizationacross stages, this module dynamically monitors thegradient update rates(approximated byloss decrease rates) ofretrievalandranking. It adaptively adjusts theweightsof their respectiveloss functionsto balance theirconvergence speeds, leading to simultaneous performance improvements for both stages. -
Extensive Experimental Validation: Through comprehensive experiments on
MovieLens-1M,MovieLens-20M, andAmazon-Booksdatasets,UniGRF(instantiated withHSTUandLlamaarchitectures) demonstratessignificant outperformanceover state-of-the-art baselines in bothretrievalandrankingmetrics. Ablation studies confirm the effectiveness of theenhancerandweightermodules, and further analysis highlightsUniGRF's ability to achievesynchronized optimizationand followscaling laws.In summary, the key findings are that unifying
retrievalandrankingwithin a singlegenerative model, when coupled with intelligentcollaboration mechanismsandsynchronized optimization strategies, effectively mitigatesinformation lossand significantly boosts recommendation performance, particularly in theranking stage, which is crucial for overall quality.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice reader should be familiar with the following foundational concepts in recommendation systems and deep learning:
- Recommendation Systems: Systems designed to predict user preferences for items (e.g., movies, products, articles) and suggest items that users might like. They typically rely on historical interaction data.
- Multi-stage Paradigm (Retrieval and Ranking): A common architecture in large-scale recommendation systems due to computational constraints.
- Retrieval Stage: The initial stage responsible for efficiently narrowing down a vast
item pool(millions to billions of items) to a much smallercandidate set(hundreds to thousands) of potentially relevant items. It prioritizes recall and speed. - Ranking Stage: The subsequent stage that takes the
candidate setfrom retrieval and precisely orders these items based on their estimated relevance or user preference. It prioritizes precision and accuracy.
- Retrieval Stage: The initial stage responsible for efficiently narrowing down a vast
- Generative Models: A class of artificial intelligence models that learn to generate new data instances that resemble the training data. In the context of recommendation, this means generating a sequence of items that a user might interact with next, rather than just predicting a score for existing items.
- Autoregressive Paradigm: A type of
generative modelwhere the generation of the current element in a sequence depends on the preceding elements. This is commonly seen innatural language processing (NLP)where models predict the next word given the previous words. In recommendation, it means predicting the next item given the user's historical interaction sequence. - Transformer Architecture: A neural network architecture introduced in 2017, foundational to modern
large language models (LLMs). It relies heavily on a mechanism calledself-attention.- Self-Attention: A mechanism within
Transformersthat allows the model to weigh the importance of different parts of an input sequence when processing each element. For a given element in a sequence,self-attentioncalculates how much it relates to all other elements in the same sequence, allowing it to capture long-range dependencies. The core idea is to compute three vectors for each input token:Query (Q),Key (K), andValue (V). The attention score is calculated as a scaled dot-product between and , followed by asoftmaxfunction, which is then multiplied by . The formula forAttentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- : Query matrix, derived from the input embeddings.
- : Key matrix, derived from the input embeddings.
- : Value matrix, derived from the input embeddings.
- : Dimensionality of the key vectors, used for scaling to prevent very large dot products that push the
softmaxinto regions with tiny gradients. - : Dot product between
QueryandKeymatrices, representing similarity. - : Normalization function that converts scores into probabilities.
- Self-Attention: A mechanism within
- Embeddings: Low-dimensional vector representations of discrete entities (like users or items) that capture their semantic meaning. Items and users are mapped from sparse IDs to dense vectors, allowing neural networks to process them.
- Loss Functions: Mathematical functions that quantify the difference between the predicted output of a model and the true target output. The goal of training is to minimize this loss.
- Sampled Softmax Loss: A variant of
softmax lossused when the number of possible output classes (e.g., items in a largeitem pool) is extremely large. Instead of calculating probabilities over all possible items, it calculatessoftmaxover the true positive item and a randomly sampled subset of negative items. This significantly reduces computational cost. - Binary Cross-Entropy (BCE) Loss: A
loss functioncommonly used for binary classification tasks (e.g., predicting click or no click, 0 or 1). It measures the difference between two probability distributions. For a true label and a predicted probability , theBCELossis: $ \mathcal{L}_{\mathrm{BCE}}(y, p) = -(y \log(p) + (1-y) \log(1-p)) $ Where:- : The true binary label (e.g., 1 for click, 0 for no click).
- : The predicted probability of a positive outcome (e.g., click).
- Sampled Softmax Loss: A variant of
- Gradient Descent and Backpropagation: The fundamental algorithms used to train neural networks.
Gradient descentis an optimization algorithm that iteratively adjusts model parameters to minimize theloss functionby moving in the direction opposite to thegradientof theloss.Backpropagationis the method used to efficiently calculate thesegradients.
3.2. Previous Works
The paper contextualizes its work by discussing various prior studies in retrieval, ranking, and cascade frameworks, highlighting the evolution towards generative models and multi-stage integration.
3.2.1. Retrieval Models for Recommendation
- Matrix Factorization (MF) [1]: A foundational approach that represents users and items as low-dimensional vectors (embeddings). User-item compatibility is determined by the inner product of their respective vectors.
BPR-MF [37]is a prominent example, usingBayesian Personalized Rankingto optimize for pairwise preferences. - Sequential Recommendation Models: These models treat user behavior as a sequence of interactions and use techniques from
language modelingto predict the next item.- GRU4Rec [17]: One of the early works using
recurrent neural networks (RNNs)withGated Recurrent Units (GRUs)to capture short-term user preferences from sequences.RNNsprocess sequences by maintaining a hidden state that is updated at each step, making them suitable for sequential data.GRUsare a type ofRNNthat help address the vanishing gradient problem. - SASRec [23]: Introduced
Transformer-basedself-attentionmechanisms to model dynamic interest patterns in user sequences, allowing it to capture long-range dependencies more effectively thanRNNs. - BERT4Rec [43]: Adapted the
BERT (Bidirectional Encoder Representations from Transformers)architecture for sequential recommendation, using abidirectional self-attentionmechanism.
- GRU4Rec [17]: One of the early works using
- LLM-influenced Retrieval: Recent trends leverage
Large Language Models (LLMs)to enhance user and item representation:- RLMRec [36]: Uses
LLMsfor semantic representation to improve item vector quality. - LLMRec [59]: Augments user-item interaction data using
LLMsalongside internal semantic representations.
- RLMRec [36]: Uses
- Generative Retrieval Models: The shift towards
generative modelsin recommendation.- HSTU (Hierarchical Sequential Transduction Units) [67]: A model by Meta that demonstrated model capabilities scale with parameters, suggesting
rankingandretrievaltasks can be reframed ascausal autoregressivesequence-to-sequence generation problems. It introduced apoint-wise aggregated attentionmechanism andM-FALCON algorithmfor efficiency. - HLLM (Hierarchical Large Language Model) [3]: A two-layer framework where an
Item LLMintegrates text features, and aUser LLMpredicts future interests.
- HSTU (Hierarchical Sequential Transduction Units) [67]: A model by Meta that demonstrated model capabilities scale with parameters, suggesting
3.2.2. Ranking Models for Recommendation
- Feature Interaction Models: Early models focused on explicitly or implicitly modeling interactions between various features (user, item, context).
- DCN (Deep & Cross Network) [56]: Combines a
deep neural network(for complex non-linear interactions) with across network(for explicit feature interactions). - DeepFM [10]: Integrates
Factorization Machines(for capturing second-order feature interactions) anddeep neural networks(for higher-order interactions). - AutoInt [42]: Uses a
self-attention networkto implicitly model feature interactions. - GDCN (Gated Deep & Cross Network) [48]: Enhances
DCNwith agated cross networkto capture more complex relationships.
- DCN (Deep & Cross Network) [56]: Combines a
- Sequential Ranking Models: Adapt
autoregressive paradigmsfor ranking.- DIN (Deep Interest Network) [75]: Introduced a
target-aware attention mechanismthat dynamically weighs parts of a user's behavior sequence based on the candidate item being ranked, modeling diverse user interests. - DIEN (Deep Interest Evolution Network) [74]: Improved upon
DINby incorporatingGRUsto model the evolution of a user's interests over time within their sequence. - SIM (Search Interest Model) [33]: Uses search techniques on long sequences to identify relevant behaviors and reduce noise for user representation.
- DIN (Deep Interest Network) [75]: Introduced a
- LLM-influenced Ranking:
- BAHE [9]: Uses
Transformer blocksof anLLMto generate improved user behavior representations, with final blocks fine-tuned forCTR prediction. HSTU[67] also adapted forrankingby appending candidate items to user sequences and addingMLPandprediction heads.
- BAHE [9]: Uses
- Llama [45]: While originally an
LLMforNLP, itsTransformer-basedarchitecture has been adapted for recommendation tasks, leveraging its strong sequence modeling capabilities in agenerativesetting.
3.2.3. Cascade Framework for Retrievers and Rankers
These frameworks focus on improving information transfer between separately trained retrieval and ranking models to reduce sample bias.
- Joint Optimization Methods [8, 57]: Aim to connect stage-specific objectives and jointly optimize them to balance efficiency and effectiveness.
- ECM (Explicit Click Model) [41]: Incorporates unexposed data and refines the task into
multi-label classificationto addresssample selection bias. - RankFlow [34]: Trains models for different stages separately, then jointly optimizes them by minimizing a
global loss functionto facilitateinter-stage information transfer. - CoRR (Cooperative Retriever and Ranker) [20]: Uses
Kullback-Leibler (KL) divergenceto distill knowledge between stages, aligning them for effectiveknowledge transfer. TheKL divergencemeasures how one probability distribution diverges from a second, expected probability distribution.
3.3. Technological Evolution
The evolution of recommendation systems, as highlighted by the paper, has progressed through several phases:
-
Early Systems (e.g., MF): Focused on collaborative filtering, representing users and items as static embeddings.
-
Sequential Models (e.g., GRU4Rec, SASRec): Recognized the importance of interaction order, adopting
RNNsand laterTransformersto model user behavior sequences, moving towardsautoregressive prediction. -
Large Generative Models (e.g., HSTU, HLLM, Llama): Inspired by
LLMs, these models reformulate recommendation as asequence generation task, leveragingscaling laws(performance improves with more parameters) to achieve state-of-the-art results. -
Multi-stage Optimization (e.g., RankFlow, CoRR): Addressed the practical
cascade frameworkby attempting to mitigateinformation lossandbiasbetweenretrievalandranking, but often still treating them as separate models with explicit transfer mechanisms.This paper's work,
UniGRF, fits into the most recent wave, pushing the boundaries ofgenerative recommendation. It directly tackles themulti-stageproblem within thegenerative paradigm, aiming to unifyretrievalandrankinginto a singlegenerative model, rather than just improving transfer between two distinct ones.
3.4. Differentiation Analysis
Compared to the main methods in related work, UniGRF offers several core differences and innovations:
-
Unified Generative Paradigm: Unlike
HSTUandHLLMwhich aregenerative modelsbut typically applied to asingle stageor requireseparate training,UniGRFis the first to explicitly unify bothretrievalandrankingwithin a singlegenerative model. This fundamentally changes howinformation sharingoccurs, moving from explicittransfer mechanismsto inherent sharing throughshared parameters. -
Inherent Information Sharing vs. Explicit Transfer: Traditional
cascade frameworkslikeRankFlowandCoRRfocus oninformation transfer(e.g.,knowledge distillation,global loss optimization) between separateretrievalandrankingmodels.UniGRFeliminates the need for complex explicit transfer mechanisms by using asingle modelwhereparametersare fully shared, allowing for moresufficient information flowwithout additional computational overhead. -
Addressing Data Bias and Information Loss at the Source: By generating
negative samplesfrom theentire item poolin theretrieval stageand using aunified model,UniGRFaims to learn the true data distribution, reducingdata biasfor theranking stageand mitigatinginformation lossfrom the outset. -
Integrated Collaboration and Optimization Mechanisms:
UniGRFintroduces specific modules designed for a unified generative context:- The
Ranking-Driven Enhancerproactively generateshigh-quality samples(hard negatives, potential positives) based on theranking model's precision, directly refiningretrievaltraining in amutually reinforcing loop. This is distinct from passive information transfer. - The
Gradient-Guided Adaptive Weighterdirectly addresses the challenge ofasynchronous convergence speedswhen trainingmultiple taskswithin asingle model. This dynamicloss weightingmechanism is tailored to optimize theunified generative framework's performance synchronously.
- The
-
Model-Agnostic and Scalable Framework:
UniGRFis designed as aplug-and-play frameworkthat can integrate with anyautoregressive generative model architecture(e.g.,HSTU,Llama). This emphasizes its flexibility andscalability, allowing it to leveragescaling lawseffectively.In essence,
UniGRFmoves beyond merely connectingretrievalandrankingto truly integrating them within thegenerative paradigm, offering a more holistic and efficient solution to themulti-stage recommendation problem.
4. Methodology
The Unified Generative Recommendation Framework (UniGRF) is designed to integrate retrieval and ranking into a single generative model. It achieves this through a novel task reformulation, a ranking-driven enhancer module for inter-stage collaboration, and a gradient-guided adaptive weighter for synchronized optimization. An overview of the UniGRF architecture is shown in the following figure (Figure 1 from the original paper).
该图像是论文中框架结构的示意图,展示了一个统一推荐模型的整体设计,包括检索和排序模块、Ranking-Driven Enhancer以及Gradient-Guided Adaptive Weighter三个关键组成部分,体现了各模块间的信息流和交互。
Figure 1: UniGRF framework overview. The image is a schematic diagram of the framework structure in the paper, illustrating an overall design of a unified recommendation model, including retrieval and ranking modules, the Ranking-Driven Enhancer, and the Gradient-Guided Adaptive Weighter, highlighting the flow and interaction between components.
4.1. A Unified Generative Recommendation Framework
The core idea of UniGRF is to treat both the retrieval and ranking tasks as sequence generation tasks within a single autoregressive Transformer-based generative model. This allows for inherent information sharing between stages.
For a given user , their interaction history is represented as a sequence , where is the -th interacted item and is the interaction type (1 for a click, 0 for no click). The sequence length is fixed, with padding or truncation applied as needed.
4.1.1. Encoding and Generative Process
First, each item and its corresponding interaction type in the user behavior sequence are encoded into embeddings. These embeddings are concatenated to form the input sequence .
This input sequence is then fed into an autoregressive Transformer-based generative model. The Transformer processes these embeddings and produces a sequence of output embeddings:
Where:
-
: The input sequence of item and behavior embeddings.
-
: The
Transformer-based generative model. -
: The predicted
latent embeddingfor the click probability of the -th item. This output is used for theranking stage. -
: The predicted
latent embeddingfor the -th item the user will interact with. This output is used for theretrieval stage.By strategically distinguishing the output embeddings based on their positions (e.g., for click prediction and for next item prediction),
UniGRFunifies bothretrievalandrankingwithin a singlegenerative model. This design makesUniGRFplug-and-playandmodel-agnostic, meaning it can seamlessly integrate with anyautoregressive generative model architecture. This unification also effectively reducestime and space complexityby half compared to training separate models.
4.1.2. Constraint on the Retrieval Stage in UniGRF
The retrieval stage aims to predict the next item a user is likely to interact with. To achieve this, the model compares the predicted item embedding with all items in the item pool to identify the most similar items, which form the candidate set .
To optimize this stage, Sampled Softmax Loss is employed. This loss function encourages the predicted item embedding to be maximally similar to the actual next interacted item's embedding (the true positive). Simultaneously, it pushes away from the embeddings of randomly sampled negative items from the item pool . The loss for the retrieval stage is defined as:
Where:
-
: The
lossfor theretrieval stage. -
: Summation over all users in the user set .
-
to : Summation over the sequence positions (starting from the second item to predict the next).
-
: The
Sampled Softmax Lossfunction. -
: A
similarity function, specifically defined as theinner productbetween two embedding vectors and . -
: The predicted
latent embeddingfor the -th item. -
: The actual embedding of the -th interacted item (the positive sample).
-
: A set of randomly sampled
negative itemsfrom theitem pool. -
: The embedding of a negative item .
This
lossaims to make the predicted item embedding close to the actual next item embedding while pushing it away from negative samples, thus optimizing theretrievalperformance.
4.1.3. Constraint on the Ranking Stage in UniGRF
The ranking stage focuses on predicting the probability of a user clicking on items within the candidate set (generated by the retrieval stage). Items are then ranked by these probabilities.
To predict this click probability, a small neural network is applied to the predicted latent embedding from the Transformer for each input item . The output is then passed through a sigmoid function to obtain a click score . This score is compared with the user's actual interaction feedback using Binary Cross-Entropy (BCE) Loss:
Where:
-
: The
lossfor theranking stage. -
: Summation over all users.
-
to : Summation over all items in the interaction sequence.
-
: The
Binary Cross-Entropy Loss. -
: The predicted click score for item .
-
: The actual binary interaction type (click or no click) for item .
-
: A small neural network with parameters .
-
: The
sigmoid activation function, which squashes values between 0 and 1, suitable forprobability prediction.By unifying
retrievalandrankinginto asingle generative model,UniGRFeffectively transfers knowledge between stages throughshared parameters. Theretrieval stagebenefits from learning the true data distribution by sampling negatives from the entireitem pool, reducingbias. Theranking stageprovides preciseuser interestsignals, which then implicitly enhance theretrieval taskthrough this shared learning.
4.2. Ranking-Driven Enhancer
The Ranking-Driven Enhancer module is introduced to foster active collaboration between the retrieval and ranking stages, beyond just parameter sharing. It leverages the ranking model's precision to generate high-quality samples for the retrieval stage, creating a mutually reinforcing loop.
4.2.1. Hard-to-Detect Disliked Items Generator
Traditional retrieval stages often use random negative sampling, which can result in negative samples that are too easy for the model to distinguish from positive ones, limiting training effectiveness. This module uses the ranking model's precision to identify more challenging negative samples, termed hard-to-detect disliked items (). These items are hard negatives because the retrieval model might mistakenly give them high scores, while the more precise ranking model correctly identifies them as undesired.
The process to construct the set involves comparing scores from both stages:
-
Ranking Stage Score: For a given negative sample (from the randomly sampled negative set), its
ranking scoreis calculated using the small neural network and asigmoidactivation:Where:
- : The click probability predicted by the
ranking stagefor a negative item. - : The embedding of the negative sample item.
- : The scoring function of the
ranking stage. - : The
sigmoid activation function.
- : The click probability predicted by the
-
Retrieval Stage Score: The
retrieval scorefor a negative sample is calculated based on itssimilarityto the predicted user-interacted item embedding :Where:
- : The
similarity scorebetween the predicted next item and the negative sample, passed through asigmoid. - : The predicted user-interacted item embedding from the
retrieval stage. - : The
similarity function(inner product).
- : The
-
Relative Score for Hard Negatives: A
relative scoreis defined to identify items that are highly ranked by retrieval but poorly by ranking, indicating they arehard-to-detect disliked items:Where:
-
score: Therelative score. A highscoreindicates an item that theretrieval stagefound relevant but theranking stagedid not, making it ahard negative. -
The term amplifies the score when
score_retrievalis much higher thanscore_ranking.The top items with the highest
relative scoresare selected ashard-to-detect disliked itemsand added to the set . This set is then combined with newrandomly sampled negativesto form thenegative sample setfor the next training epoch. This iterative process, where the model learns to distinguishhard negatives, facilitatesinformation transferfrom the more preciseranking stageto theretrieval stage.
-
4.2.2. Potential Favorite Items Generator
This module addresses another issue in negative sampling: some randomly selected negative samples might actually be potential favorite items for the user. Mislabeling these as negative can introduce bias and degrade performance.
For each item in the previous epoch's randomly generated negative sample set , the module computes its ranking score . If an item's ranking score exceeds a predefined threshold (), it is identified as a potential favorite item and stored in the set .
In subsequent training iterations, potential favorite items in are relabeled as positive samples for the retrieval stage. This correction ensures the generative model learns more accurately from these examples, enhancing its understanding of user preferences.
By generating hard negatives and correcting mislabelings, the Ranking-Driven Enhancer provides high-quality training data for the retrieval stage based on the ranking stage's fine-grained understanding of user preferences. This interaction creates a bidirectional enhancement loop due to the fully shared parameters of the unified model. Crucially, the computational overhead is minimal because ranking scores are calculated via a linear complexity function , and retrieval scores are implicitly handled within the retrieval loss calculation.
4.3. Gradient-Guided Adaptive Weighter
Even with shared parameters and collaborative mechanisms, retrieval and ranking tasks often have different convergence speeds, making simultaneous optimization challenging. The Gradient-Guided Adaptive Weighter addresses this by dynamically balancing the optimization rates of the two stages.
4.3.1. Monitoring Convergence Speed
The module monitors the loss decrease rate of each stage as an approximation of its optimization speed. A faster loss decrease implies faster gradient updates and quicker convergence. The convergence rates for retrieval and ranking are calculated as follows:
Where:
- : The
convergence ratefor theretrieval stage. A value greater than 1 implies the loss is increasing, less than 1 means it's decreasing. - : The
convergence ratefor theranking stage. - : The
retrieval lossat the current time step . - : The
retrieval lossat the previous time stept-1. - : The
ranking lossat the current time step . - : The
ranking lossat the previous time stept-1. A larger value indicates slower convergence (or even divergence if ), implying that the corresponding task needs more weight to accelerate its optimization.
4.3.2. Adaptive Weight Calculation
Based on these convergence rates, the module adaptively calculates weights and for the retrieval and ranking losses, respectively, using a softmax-like scaling with a temperature coefficient:
Where:
- : The adaptive weight for the
retrieval loss. - : The adaptive weight for the
ranking loss. - : The
temperature coefficient. A smaller value will lead to a larger difference between and , making the weighting more aggressive. A larger smooths the weights, making them more similar. - :
Hyperparametersused to scale thelossesof the two stages, bringing them to a similar magnitude. This is important because raw loss values can vary significantly between differentloss functions. The use of means that a task with a higher (slower)convergence ratewill receive a relatively larger weight, pushing the model to pay more attention to it.
4.3.3. Final Loss Function
Finally, the adaptive weights are applied to the retrieval and ranking losses to form the total loss function for optimization at time step :
This gradient-guided adaptive weighted loss enables synchronous optimization of both retrieval and ranking tasks. By dynamically adjusting loss weights, UniGRF ensures that neither stage is over-optimized or under-optimized, allowing the unified generative framework to achieve optimal performance by fully leveraging information transfer and inter-stage collaboration.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on three publicly available recommendation datasets, chosen for their varying sizes and widespread use in research:
- MovieLens-1M [14]:
- Source: GroupLens Research project.
- Scale: Approximately 1 million user ratings.
- Characteristics: Contains user ratings (1-5 stars) for movies, along with basic user information and movie metadata.
- Domain: Movies.
- MovieLens-20M [14]:
- Source: GroupLens Research project.
- Scale: Approximately 20 million user ratings.
- Characteristics: A larger version of MovieLens, covering ratings from 1995 to 2015.
- Domain: Movies.
- Amazon-Books [31]:
- Source: Subset of Amazon product reviews.
- Scale: Millions of user ratings and reviews.
- Characteristics: Focuses on the book category, providing a real-world, sparse dataset.
- Domain: Books.
5.1.1. Data Processing
For all datasets:
-
User interaction records are arranged chronologically to form
historical interaction sequences. -
Sequences with fewer than three interactions are filtered out.
-
For rating data (e.g., 1-5 stars),
binary processingis applied: ratings greater than 3 are marked as1(liked), otherwise0(disliked). This simplifies user interest representation.The following are the results from Table 1 of the original paper:
Datasets #Users #Items #Inters. Avg. n MovieLens-1M 6,040 3,706 1,000,209 165.6 MovieLens-20M 138,493 26,744 20,000,263 144.4 Amazon-Books 694,897 686,623 10,053,086 14.5
Where:
-
#Users: Total number of unique users. -
#Items: Total number of unique items. -
#Inters.: Total number of user-item interactions. -
Avg. n: Average sequence length of user interactions.These datasets were chosen because they are widely used benchmarks, cover different scales (from MovieLens-1M being relatively small to Amazon-Books being very large and sparse), and represent real-world user behavior, making them effective for validating the performance and scalability of recommendation methods.
5.1.2. Training, Validation, and Test Split
- Test Set: The item from the user's last interaction.
- Validation Set: The item from the user's second-to-last interaction.
- Training Set: All other interactions.
- Candidate Set for Retrieval: The entire item set is used as the candidate set to avoid sampling bias.
5.2. Evaluation Metrics
The paper uses standard evaluation metrics for both retrieval and ranking tasks.
5.2.1. Retrieval Metrics
These metrics are used to evaluate how well the model identifies relevant items from a large pool. denotes the number of top items considered in the recommendation list.
-
Normalized Discounted Cumulative Gain at K (NDCG@K):
- Conceptual Definition:
NDCGmeasures the quality of a ranking, considering both the relevance of recommended items and their position in the list. It assigns higher scores to more relevant items appearing earlier in the list.NDCG@Kcalculates this score for the top recommendations. - Mathematical Formula: $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{rel_j} - 1}{\log_2(j+1)} $ $ \mathrm{IDCG@K} = \sum_{j=1}^{K} \frac{2^{rel_{j}^{ideal}} - 1}{\log_2(j+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
- Symbol Explanation:
- : The relevance score of the item at position in the recommended list (typically 1 if relevant, 0 if not).
- : The relevance score of the item at position in the ideal (perfect) ranking list.
- : Discounted Cumulative Gain at K, where relevant items at lower ranks are penalized logarithmically.
- : Ideal Discounted Cumulative Gain at K, the maximum possible
DCGfor a given set of relevant items. - : The normalized
DCG, typically ranging from 0 to 1.
- Conceptual Definition:
-
Hit Rate at K (HR@K):
- Conceptual Definition:
HR@K(also known as Recall@K) measures whether the target item (the one the user actually interacted with) is present in the top recommended items. It indicates how often the model successfully recommends at least one relevant item. - Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the target item is in top K}}{\text{Total number of users}} $
- Symbol Explanation:
- "Number of users for whom the target item is in top K": Count of users for whom the ground truth item is among the recommended items.
- "Total number of users": Total number of users in the evaluation set.
- Conceptual Definition:
-
Mean Reciprocal Rank (MRR):
- Conceptual Definition:
MRRmeasures the average of the reciprocal ranks of the first relevant item. If the first relevant item is at rank 1, its reciprocal rank is 1; if at rank 2, it's 1/2, and so on. A higherMRRindicates that relevant items are found earlier in the ranked list. - Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{1}{\mathrm{rank}_u} $
- Symbol Explanation:
-
: The total number of users.
-
: The rank of the first relevant item for user .
The paper sets the value of to 10 and 50 for
NDCG@KandHR@K.
-
- Conceptual Definition:
5.2.2. Ranking Metric
- Area Under the ROC Curve (AUC):
- Conceptual Definition:
AUCis a commonly used metric forbinary classificationtasks (likeclick-through rate (CTR) predictionin ranking). It represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. AnAUCof 0.5 indicates performance no better than random, while 1.0 indicates perfect classification. - Mathematical Formula: $ \mathrm{AUC} = \frac{\sum_{i \in \text{positive class}} \sum_{j \in \text{negative class}} \mathbb{I}(\text{score}_i > \text{score}_j)}{|\text{positive class}| \cdot |\text{negative class}|} $
- Symbol Explanation:
- : The indicator function, which is 1 if the condition is true, and 0 otherwise.
- : The predicted score for a positive instance .
- : The predicted score for a negative instance .
- : The total number of positive instances.
- : The total number of negative instances.
- Conceptual Definition:
5.3. Baselines
The paper compares UniGRF against a comprehensive set of baseline models categorized by their primary function and approach.
5.3.1. Retrieval Models
These models are typically trained separately for the retrieval task.
- BPR-MF [37]:
Bayesian Personalized Ranking - Matrix Factorization. A classicmatrix factorizationmethod optimized for implicit feedback, predicting pairwise preferences. - GRU4Rec [21]: A
recurrent neural networkmodel usingGated Recurrent Unitsto capture sequential user behavior for session-based recommendations. - NARM [27]:
Neural Attentive Session-based Recommendation Model. CombinesRNNswithattention mechanismsto capture users' diverse preferences in sessions. - SASRec [23]:
Self-Attentive Sequential Recommendation. Uses aTransformerarchitecture withself-attentionto model complex and long-range patterns in user behavior sequences.
5.3.2. Ranking Models
These models are typically trained separately for the ranking task.
- DCN [56]:
Deep & Cross Network. A model that combinesdeep neural networksandcross networksto capture both explicit and implicit high-order feature interactions. - DIN [75]:
Deep Interest Network. Introduces atarget-aware attention mechanismto dynamically capture a user's interest distribution based on the candidate item. - GDCN [47]:
Gated Deep & Cross Network. An improved version ofDCNthat uses agated cross networkto capture more complex relationships.
5.3.3. Generative Models
These are autoregressive models that can be adapted for recommendation tasks, but are typically trained for a single stage.
- HSTU [67]:
Hierarchical Sequential Transduction Units. Agenerative modelby Meta that emphasizesscaling lawsand efficiency throughpoint-wise aggregated attentionand theM-FALCON algorithm. It can be adapted for bothretrievalandrankingbut is usually trained independently for each. - Llama [45]: A
Transformer-based large language modelarchitecture. Adapted here to build a recommendation system using asequence generation methodsimilar toHSTUfor bothretrievalandranking.
5.3.4. Cascade Frameworks for Generative Model
These frameworks combine separate retrieval and ranking models (here, instantiated with HSTU) using specialized modules for information exchange.
- RankFlow-HSTU [34]: A
cascade frameworkthat first trainsretrievalandrankingmodels independently, then performsjoint trainingto facilitateinformation transferbetween them. Here, bothretrieverandrankerareHSTUmodels. - CoRR-HSTU [20]:
Cooperative Retriever and Ranker. This framework usesknowledge distillationbased onKullback-Leibler (KL) divergenceto align theretrievalandrankingstages for effectiveknowledge transfer. Here, bothretrieverandrankerareHSTUmodels.
5.4. Parameter Settings
To ensure a fair comparison, consistent hyperparameter settings were used across models for the same dataset.
- Learning Rate: Uniformly set to .
- Negative Samples (Retrieval Stage):
- MovieLens-1M: 128 negative samples.
- MovieLens-20M and Amazon-Books: 256 negative samples.
- Training Epochs:
Ranking-only models: 20 epochs.Other models(including UniGRF and cascade frameworks): 100 epochs.Early stopping: Applied if theAUCvalue (for ranking) or other primary metrics (for retrieval/unified) do not improve over 5 epochs.
- Transformer Layers (for Transformer-based models):
- MovieLens-1M: 2 layers.
- MovieLens-20M and Amazon-Books: 4 layers (unless specified otherwise for
scaling lawexperiments).
- Distributed Training: All models were trained using 8 GPUs.
6. Results & Analysis
6.1. Overall Performance
The UniGRF framework was instantiated with two representative generative models, HSTU and Llama, resulting in UniGRF-HSTU and UniGRF-Llama. The performance comparison against various baselines is presented in Table 2 for retrieval and Table 3 for ranking.
The following are the results from Table 2 of the original paper:
| Dataset | MovieLens-1M | MovieLens-20M | Amazon-Books | ||||||||||||
| NG@10 | NG@50 | HR@10 | HR@50 | MRR | NG@10 | NG@50 | HR@10 | HR@50 | MRR | NG@10 | NG@50 | HR@10 | HR@50 | MRR | |
| BPRMF | 0.0607 | 0.1027 | 0.1185 | 0.3127 | 0.0556 | 0.0629 | 0.1074 | 0.1241 | 0.3300 | 0.0572 | 0.0081 | 0.0135 | 0.0162 | 0.0412 | 0.0078 |
| GRU4Rec | 0.1015 | 0.1460 | 0.1816 | 0.3864 | 0.0895 | 0.0768 | 0.1155 | 0.1394 | 0.3177 | 0.0689 | 0.0111 | 0.0179 | 0.0207 | 0.0519 | 0.0107 |
| NARM | 0.1350 | 0.1894 | 0.2445 | 0.4915 | 0.1165 | 0.1037 | 0.1552 | 0.1926 | 0.4281 | 0.0910 | 0.0148 | 0.0243 | 0.0280 | 0.0722 | 0.0141 |
| SASRec | 0.1594 | 0.2208 | 0.2899 | 0.5671 | 0.1388 | 0.1692 | 0.2267 | 0.3016 | 0.5616 | 0.1440 | 0.0229 | 0.0353 | 0.0408 | 0.1059 | 0.0213 |
| Llama | 0.1746 | 0.2343 | 0.3110 | 0.5785 | 0.1477 | 0.1880 | 0.2451 | 0.3268 | 0.5846 | 0.1605 | 0.0338 | 0.0493 | 0.0603 | 0.1317 | 0.0305 |
| HSTU | 0.1708 | 0.2314 | 0.3132 | 0.5862 | 0.1431 | 0.1814 | 0.2325 | 0.3056 | 0.5371 | 0.1569 | 0.0337 | 0.0502 | 0.0613 | 0.1371 | 0.0305 |
| RankFlow-HSTU | 0.1096 | 0.2178 | 0.2184 | 0.5062 | 0.0988 | 0.0985 | 0.2064 | 0.2002 | 0.4245 | 0.1221 | 0.0281 | 0.0307 | 0.0324 | 0.0878 | 0.0249 |
| CoRR-HSTU | 0.1468 | 0.2662 | 0.2081 | 0.5435 | 0.1261 | 0.1629 | 0.2166 | 0.2796 | 0.5236 | 0.1409 | 0.0352 | 0.0521 | 0.0612 | 0.1360 | 0.0315 |
| UniGRF-Llama | 0.1765 | 0.2368 | 0.3219 | 0.5921 | 0.1478 | 0.1891 | 0.2464 | 0.3270 | 0.5846 | 0.1652 | 0.0351 | 0.0508 | 0.0624 | 0.1347 | 0.0316 |
| UniGRF-HSTU | 0.1756 | 0.2326 | 0.3140 | 0.5909 | 0.1484 | 0.1816 | 0.2384 | 0.3178 | 0.5747 | 0.1548 | 0.0354 | 0.0522 | 0.0638 | 0.1415 | 0.0319 |
Note: The original table contains bold and underlined formatting to indicate the best and second-best results, respectively. Due to the overlapping values and the definition (best result highlighted in bold, second-best underlined, and p-value < 0.05 implies significance), there are instances where the highest value is bold and the second highest is underlined, or in cases of ties/very close values, some may appear underlined only if the highest is not significantly better. The interpretation should follow the paper's intent for statistical significance, which is not fully clear from the raw values alone. For consistency, I will mark the best value in bold and the second best as underlined based on the numerical values presented in the table.
The following are the results from Table 3 of the original paper:
| Dataset | ML-1M | ML-20M | Books |
| Model | AUC | AUC | AUC |
| DCN | 0.7176 | 0.7098 | 0.7042 |
| DIN | 0.7329 | 0.7274 | 0.7159 |
| GDCN | 0.7364 | 0.7089 | 0.7061 |
| Llama | 0.7819 | 0.7722 | 0.7532 |
| HSTU | 0.7621 | 0.7807 | 0.7340 |
| RankFlow-HSTU | 0.6340 | 0.6439 | 0.6388 |
| CoRR-HSTU | 0.7497 | 0.7492 | 0.7357 |
| UniGRF-Llama | 0.7932 | 0.7776 | 0.7559 |
| UniGRF-HSTU | 0.7832 | 0.7941 | 0.7672 |
From these tables, several key conclusions can be drawn:
- Generative Models Outperform Non-Generative Ones: In general,
generative methodslikeHSTUandLlama(when trained separately) consistently outperform traditionalnon-generative retrieval models(e.g.,BPR-MF,GRU4Rec,NARM,SASRec) andranking models(e.g.,DCN,DIN,GDCN). This validates the paper's premise about the significant potential ofgenerative modelsin recommendation by learning richer representations and deeper user interest relationships. - UniGRF Achieves Superior Performance in Both Stages:
UniGRF(bothUniGRF-HSTUandUniGRF-Llama) consistently achieves the best or second-best performance across allretrievalmetrics (, ,MRR) andrankingAUCon all three datasets.- For example, on MovieLens-1M,
UniGRF-HSTUachieves the highestMRR(0.1484) andAUC(0.7832), and competitiveNDCGandHRvalues. On MovieLens-20M,UniGRF-HSTUachieves the highestAUC(0.7941) and very strongretrievalmetrics. On Amazon-Books,UniGRF-HSTUalso leads significantly across almost all metrics. - This demonstrates that
UniGRFeffectively enhances performance by facilitating robustinformation transfer,efficient inter-stage collaboration, andsynchronous optimizationwithin its unified framework. Itsmodel-agnosticnature is also highlighted by its consistent improvements across different underlying generative architectures (HSTUandLlama).
- For example, on MovieLens-1M,
- Traditional Cascade Frameworks Struggle with Generative Models:
RankFlow-HSTUandCoRR-HSTU, which are traditionalcascade frameworksinstantiated withHSTU, perform poorly.RankFlow-HSTUoften performs worse than the independently trainedHSTUmodel, and in some cases, even worse than much simpler baselines.CoRR-HSTUshows slight improvements over individualHSTUin someretrievalmetrics on Amazon-Books but generally falls short ofUniGRF. This suggests thattraditional cascade frameworks, designed for non-generative models, are not well-suited forgenerative recommendation modelsand fail to enable effectiveinter-stage information transferin this new paradigm. - Greater Impact on Ranking Performance: While
UniGRFimprovesretrievalperformance, it shows a more pronounced performance boost in theranking stage. This is evident from theAUCvalues in Table 3. The paper posits this is because theranking stage, being a simpler task, benefits more significantly from the richinformation transferthat a unified model provides, especially when it might otherwise lack sufficient information when trained in isolation. This is a positive outcome for practical applications, as theranking stagehas a more direct impact on the quality of final recommendations. - Effectiveness on Sparse Data (Amazon-Books):
UniGRFexhibits a more significant performance improvement on theAmazon-Booksdataset, which is known for its sparsity. This suggests that indata-scarce environments, the cross-stageinformation transferfacilitated byUniGRFbecomes even more crucial, allowing the model to leverage limited information more effectively.CoRR-HSTUalso performs relatively better on this dataset compared to others, reinforcing the idea thatinformation transferis vital for sparse data.
6.2. Ablation Study
To investigate the effectiveness of each proposed module, an ablation study was conducted on the MovieLens-1M dataset. UniGRF was instantiated with the HSTU model. The study involved three variants:
-
w/o Enhancer: The
Ranking-Driven Enhancer moduleis removed, andnegative samplesare generated purely throughrandom sampling. -
w/o Weighter: The
Gradient-Guided Adaptive Weighteris removed. The total loss is simply the sum ofretrievalandranking losses: . -
w/o Both: Both the
Ranking-Driven Enhancerand theGradient-Guided Adaptive Weighterare removed.Additionally, to verify the effectiveness of
information transferingenerative recommendationbeyond unified training, twofine-tuning (FT)paradigms were tested:
-
HSTU-FT-a: The
HSTUmodel is first trained forretrieval(100 epochs). Its parameters then initialize aranking model, which is further trained (20 epochs). The results forrankingare reported. -
HSTU-FT-b: The
HSTUmodel is first trained forranking(20 epochs). Its parameters then initialize aretrieval model, which is further trained (100 epochs). The results forretrievalare reported.The following are the results from Table 4 of the original paper:
Model AUC NG@10 NG@50 HR@10 HR@50 MRR HSTU 0.7621 0.1708 0.2314 0.3132 0.5862 0.1431 (1) w/o Enhancer 0.7815 0.1717 0.2318 0.3117 0.5824 0.1446 (2) w/o Weighter 0.7582 0.1734 0.2317 0.3199 0.5824 0.1441 (3) w/o Both 0.7293 0.1681 0.2276 0.3113 0.5824 0.1392 (a) HSTU-FT-a 0.7752 - - - - - (b) HSTU-FT-b - 0.1727 0.2308 0.3182 0.5807 0.1436 UniGRF-HSTU 0.7832 0.1756 0.2326 0.3140 0.5909 0.1484
Analysis of the ablation study results:
- Effectiveness of Both Modules: All variants (
w/o Enhancer,w/o Weighter,w/o Both) show performance inferior to the fullUniGRF-HSTU. This confirms that both theRanking-Driven Enhancerand theGradient-Guided Adaptive Weighterare crucial forUniGRF's superior performance. - "w/o Both" Variant's Poor Performance: The
w/o Bothvariant performs the worst, even failing to surpass the independently optimizedHSTUmodel. This is a critical finding: simply unifyingretrievalandrankingwithout explicit mechanisms forcollaborationandsynchronized optimizationis insufficient and can even lead to performance degradation. This validates the paper's initial premise that a thoughtful design is necessary for a successful unified framework. - Impact of the
Ranking-Driven Enhancer(w/o Enhancer): Removing theEnhancermodule (w/o Enhancervariant) leads to a slight degradation inretrievalperformance metrics (NG@10,NG@50,HR@10,HR@50,MRR) compared toUniGRF-HSTU. This confirms thatrandom negative samplingis less effective than theRanking-Driven Enhancerin capturing deep user interests and providing informativenegative samples. TheEnhancer's ability to generatehard negativesandpotential positiveshelps the model learn more robust representations. - Impact of the
Gradient-Guided Adaptive Weighter(w/o Weighter): Thew/o Weightervariant shows a notable degradation inrankingperformance (AUCof 0.7582 vs 0.7832 forUniGRF-HSTU) whileretrievalmetrics are slightly better or similar. This highlights the critical role of theadaptive weighterin balancing theoptimization speedsof the two stages. Without it, the model might over-optimize one stage (e.g.,retrieval) at the expense of the other (ranking), preventing both from reaching their full potential synchronously. However,w/o Weighterstill performs better thanw/o Both, suggesting theEnhancermodule still provides benefits by improving inter-stage collaboration even without balanced optimization. - Comparison with Fine-Tuning Paradigms:
HSTU-FT-a(fine-tuningrankingafterretrieval) achieves anAUCof 0.7752, which is better than standaloneHSTU(0.7621) but lower thanUniGRF-HSTU(0.7832).HSTU-FT-b(fine-tuningretrievalafterranking) achievesretrievalmetrics that are slightly better than standaloneHSTUbut generally lower thanUniGRF-HSTU.- This indicates that while
information transferthroughfine-tuning(sequential training) is effective and improves performance over separate training, theunified generative frameworkofUniGRF(simultaneous, end-to-end training with shared parameters) facilitates moresufficientand effectiveinformation transfer, leading to superior overall performance.
6.3. Impact of Hyper-parameter
The Ranking-Driven Enhancer module introduces hard-to-detect disliked items. The paper investigates the impact of varying on UniGRF's performance, while keeping the total number of negative samples fixed (128 for MovieLens-1M, 256 for MovieLens-20M). The values of tested are . The results are presented in the following figure (Figure 2 from the original paper).
该图像是图表,展示了MovieLens-1M和MovieLens-20M数据集上模型的检索与排名表现。横坐标为参数m,纵坐标为AUC值,蓝色曲线表示排名,红色曲线表示检索,数据趋势显示了两者之间的性能差异。
Figure 2: Impact of varying the number of hard-to-detect disliked items .
6.3.1. MovieLens-20M (Larger Dataset)
For the larger dataset, MovieLens-20M, as increases from 0 to 20, the performance of both the retrieval and ranking stages shows a continuous upward trend. This suggests that:
- Introducing a higher number of
hard negative samples(up to 20 in this experiment) through theRanking-Driven Enhancereffectively aids the model in learning moreaccurate user preferences. - These
hard negativesfacilitatemutual enhancementbetween stages, contributing to improved recommendation performance. The model needs more challenging examples to refine its boundaries in a larger, more complex data distribution.
6.3.2. MovieLens-1M (Smaller Dataset)
For the smaller dataset, MovieLens-1M, the performance of both retrieval and ranking initially increases and then decreases as increases from 0 to 20. The optimal performance is observed around . This indicates:
- Moderate Introduction is Beneficial: On smaller datasets, a moderate number of
hard negative samplescan indeed enhance the model's learning ability and improve the capture offine-grained user preferences. - Risk of Overfitting: When becomes too large, the model may
over-focuson thesehard negative samples. This can lead tooverfitting, where the model performs well on the specifichard negativesbut losesgeneralization abilityon unseen data, thus degrading overall performance. - Reduced Sample Diversity: A high proportion of
hard negative samplescan also lead to repeated encounters with the same samples across epochs, reducing the diversity oftraining samples. This lack of variation can hinder the model's ability to learn broadly and limit itsgeneralization ability. Therefore, on smaller datasets, careful selection of is crucial to strike a balance between improving model performance and avoidingoverfitting.
6.4. Further Analysis
6.4.1. Analysis of the Impact of Optimization on Unified Framework Performance
This analysis highlights the importance of synchronized optimization between stages. The MovieLens-20M dataset was used to observe the changes in recommendation performance over epochs for both retrieval and ranking stages. The results are presented in the following figure (Figure 3 from the original paper).
该图像是图表,展示了在不同情况下,Rank和Retrieve的AUC表现随Epoch变化的曲线,左侧为未使用Enhancer和Weight的结果,右侧为UniGRF-HSTU的结果,明确显示了两种方法在性能优化方面的差异。
Figure 3: Effect of synchronized stage optimization on unified generative framework performance.
-
Without Adaptive Weighter (Figure 3a):
- When a simple sum of
retrievalandranking lossesis used (, corresponding to thew/o Weighterscenario in the ablation study), theretrieval stage's performance (red curve) shows a continuously rising trend. - In contrast, the
ranking stage's performance (blue curve) quickly reaches an optimal value within a few epochs and then rapidly declines. - Explanation: This
asynchronous optimizationoccurs becauseretrieval(involvingnegative samplingfrom a large pool) is generally a more challenging task, whilerankingis comparatively simpler. The model, when optimizing a combined loss without balancing, tends to prioritize the more challenging task. In later stages, it becomes overly biased towardsretrieval, leading to theranking stagebeingunder-trainedorover-adaptedin a way that hurts its overall performance, demonstrating a significant drop. This limits the potential for effectiveinformation sharingandinter-stage collaboration.
- When a simple sum of
-
With Gradient-Guided Adaptive Weighter (Figure 3b):
- When
UniGRF'sGradient-Guided Adaptive Weighteris employed, theoptimization paceof the two stages is balanced. - The
retrieval stage's performance (red curve) exhibits a smoother and more stable improvement curve. - Crucially, the
ranking stage's performance (blue curve) now also shows continuous improvement, instead of declining. - Explanation: The
adaptive weighterdynamically monitorstraining ratesandloss convergence speedsusinggradients. It then adjusts theloss weightsin real-time to ensure appropriate attention is given to both tasks throughout the training process. By achievingsynchronous optimization,UniGRFfully leverages the benefits of theunified generative framework, significantly enhancing overallrecommendation performance.
- When
6.4.2. Analysis on Scaling Law
The paper also explores how UniGRF's performance changes with parameter expansion by adjusting the number of Transformer layers within the generative architecture (instantiated with HSTU). The number of layers was varied as . The experimental results on the MovieLens-20M dataset for UniGRF-HSTU are presented in the following figure (Figure 4 from the original paper).
该图像是一个示意图,展示了层数与损失(Loss)和性能(Performance)之间的关系。左侧图表(a)显示了排名损失(Ranking Loss,蓝线)和检索损失(Retrieval Loss,红线)的变化,右侧图表(b)展示了在不同层数下的性能评估,包括排名(Ranking,AUC)和检索(Retrieval,MRG20)。
Figure 4: Effect of parameter expansion on model loss and recommendation performance.
-
Loss Reduction with Parameter Expansion (Figure 4a):
- Figure 4a shows that as the number of
Transformer layersincreases (i.e.,model parameter sizeexpands), theloss valuefor bothretrievalandrankingtasks consistently decreases. - Explanation: This trend indicates that
UniGRFadheres to thescaling law, a phenomenon observed inlarge language modelswhere increasingmodel capacity(number of parameters) leads to a reduction inloss. This suggests that the model can learn more complex patterns and achieve a better fit to the data with larger architectures.
- Figure 4a shows that as the number of
-
Performance Improvement with Parameter Expansion (Figure 4b):
-
Figure 4b shows that the actual
recommendation performanceofUniGRFalso improves as the number ofTransformer layersincreases. -
Explanation: This is a crucial finding, as
loss reductiondoesn't always translate directly toperformance improvementin practical metrics. The consistent improvement inretrievalandrankingperformance metrics (e.g.,NDCG@K,AUC) with increasing parameter size validates thatUniGRFcan effectively leverage larger models. This demonstratesUniGRF's strongscalabilityand its ability to capture more complex data patterns with more sophisticated architectures, indicating significant application potential.The paper notes that it temporarily focuses on constructing the unified framework and demonstrating its scalability on public datasets, with plans to study
scaling lawson industrial datasets in future work.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces the Unified Generative Recommendation Framework (UniGRF), a novel approach that effectively addresses the persistent problem of information loss and diminished performance in traditional multi-stage recommendation systems. By unifying retrieval and ranking into a single generative model and treating both as sequence generation tasks, UniGRF enables inherent and sufficient information sharing through shared parameters, eliminating the need for complex explicit transfer mechanisms.
Key to UniGRF's success are two innovative components:
-
The
ranking-driven enhancer module: This module actively promotesinter-stage collaborationby leveraging theranking stage's precisionto generatehigh-quality training samples(e.g.,hard-to-detect disliked items,potential favorite items) for theretrieval stage, creating amutually reinforcing feedback loopwith minimal computational overhead. -
The
gradient-guided adaptive weighter: This component dynamically monitors theoptimization speedsof theretrievalandrankingtasks and adaptively adjusts theirloss weights. This ensuressynchronized performance improvementsacross both stages, preventing one from being over- or under-optimized at the expense of the other.Extensive experiments on benchmark datasets (
MovieLens-1M,MovieLens-20M,Amazon-Books) demonstrate thatUniGRFconsistentlyoutperforms state-of-the-art baselinesin bothretrievalandrankingmetrics. Ablation studies confirm the individual efficacy of theenhancerandweightermodules, highlighting their crucial roles ininter-stage collaborationandsynchronized optimization. Further analysis also verifiesUniGRF's adherence toscaling laws, showing that performance improves with increased model parameters.
In essence, UniGRF provides a highly effective, scalable, and compatible framework for generative recommendation systems, significantly advancing the field by offering a holistic solution to multi-stage recommendation.
7.2. Limitations & Future Work
The authors acknowledge the following limitations and propose future research directions:
- Unifying More Stages: Currently,
UniGRFunifiesretrievalandranking. Future work could explore extending the framework to unify even more stages in a typical recommendation pipeline, such aspre-rankandrerank. This would lead to an even more comprehensive end-to-end generative system. - Application to Industrial Scenarios: While demonstrated on public datasets, the authors plan to apply
UniGRFtospecific industrial scenarios. This involves addressing real-world complexities and larger scales. - Extending Parameters for Performance Improvements: The
scaling lawanalysis onMovieLens-20Mshowed promising results. The authors intend to further investigate and achieve performance improvements by significantlyextending model parametersonindustrial datasets, pushing the boundaries oflarge recommendation models.
7.3. Personal Insights & Critique
The UniGRF paper presents a compelling and timely solution to a long-standing problem in recommendation systems. The idea of unifying retrieval and ranking within a single generative model is a natural evolution given the success of Transformer-based LLMs in multi-task learning.
Personal Insights:
- Elegant Solution to Information Loss: The most significant insight is how the paper re-frames
information lossnot as a problem oftransferringbetween disparate components, but as a problem solvable byintegration. By making theretrievalandrankingtasks different "heads" or "outputs" of the same underlyinggenerative Transformer, information inherently flows through shared representations. This is much more elegant than complexknowledge distillationorjoint optimizationschemes between separate models. - Active Collaboration vs. Passive Transfer: The
Ranking-Driven Enhanceris a clever mechanism foractive collaboration. Instead of merely havingranking"inform"retrievalpassively, it actively generateshigh-quality feedback(hard negatives, potential positives) that directly improves theretrieval model's learning. This iterative self-improvement loop is powerful and potentially applicable beyond recommendation, wherever a higher-precision component can guide the training of a lower-precision one. - Addressing Training Dynamics: The
Gradient-Guided Adaptive Weighteris crucial. It highlights a practical challenge in multi-task learning: even if tasks are unified, their different complexities andconvergence ratescan lead to suboptimal training. This adaptive weighting approach is a robust way to manage this and could be generalized to other multi-task generative models where sub-tasks have varying training characteristics. - Scalability for the LLM Era: The explicit focus on
model-agnosticismandscaling lawspositionsUniGRFwell for the future oflarge recommendation models. AsLLMscontinue to grow, frameworks that can seamlessly integrate larger generative backbones will be essential.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Computational Cost of Hard Negative Mining (even if minimal): While the paper states the
Ranking-Driven Enhancerhas minimal overhead, calculatingscore_rankingandscore_retrievalfor all candidate negative samples (or a subset) in each epoch, even with alinear complexity function, can still add noticeable training time, especially with very largeitem poolsandmini-batches. The paper could offer more detailed analysis of this practical runtime impact beyond just "minimal overhead." -
Hyperparameter Sensitivity: The
Gradient-Guided Adaptive Weighterintroduces (temperature coefficient) and (scaling hyperparameters). TheRanking-Driven Enhancerintroduces (number of hard negatives) and (potential favorite threshold). Optimizing these hyperparameters can be tricky, especially as theoptimal mvaried significantly between large and small datasets. A more robust, possibly self-adaptive, way to set these parameters could be explored. -
Definition of "Hard Negative": The
relative scoreformula forhard negatives() is interesting. It penalizes items wherescore_rankingis high, which isn't exactly a "disliked" item. Instead, it seems to identify items that theretrievalmodel thinks are good but therankingmodel doesn't find as good (or actively dislikes ifscore_rankingis very low). Clarifying the intuitive meaning of this relative score, especially whenscore_rankingis close toscore_retrievalor even higher, would be beneficial for understanding its precise mechanism. -
Generalizability of
sigmoidfor retrieval scores: Applyingsigmoidto forscore_retrievalmight simplify things but could potentially mask the true distribution of similarity scores. Depending on theembedding space,inner productvalues can vary widely, and asigmoidmight compress important information. Further analysis of this choice or alternatives could be valuable. -
Complexity of Unifying More Stages: While a promising future direction, unifying
pre-rankandrerankwill exponentially increase the complexity of defining distinct output heads andloss functionswithin a single model. The current work handles two stages, but extending to four or more stages might require more sophisticated ways to manage the outputs and their interdependencies without making the model architecture unwieldy.Overall,
UniGRFis an important contribution, providing a solid foundation for building truly unified and performantgenerative recommendation systems. It offers practical modules to manage the inherent challenges ofmulti-stage optimizationwithin such a framework, pushing the boundaries of what's possible in modern recommender systems.
Similar papers
Recommended via semantic vector search.