Enhancing Sequential Recommendation with World Knowledge from Large Language Models
TL;DR Summary
This paper introduces GRASP, a framework that overcomes limitations of traditional sequential recommendation systems by integrating generation-augmented retrieval and multi-level attention. It effectively leverages world knowledge despite LLM hallucinations, enhancing modeling of
Abstract
Sequential Recommendation System~(SRS) has become pivotal in modern society, which predicts subsequent actions based on the user's historical behavior. However, traditional collaborative filtering-based sequential recommendation models often lead to suboptimal performance due to the limited information of their collaborative signals. With the rapid development of LLMs, an increasing number of works have incorporated LLMs' world knowledge into sequential recommendation. Although they achieve considerable gains, these approaches typically assume the correctness of LLM-generated results and remain susceptible to noise induced by LLM hallucinations. To overcome these limitations, we propose GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction), a flexible framework that integrates generation augmented retrieval for descriptive synthesis and similarity retrieval, and holistic attention enhancement which employs multi-level attention to effectively employ LLM's world knowledge even with hallucinations and better capture users' dynamic interests. The retrieved similar users/items serve as auxiliary contextual information for the later holistic attention enhancement module, effectively mitigating the noisy guidance of supervision-based methods. Comprehensive evaluations on two public benchmarks and one industrial dataset reveal that GRASP consistently achieves state-of-the-art performance when integrated with diverse backbones. The code is available at: https://anonymous.4open.science/r/GRASP-SRS.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Enhancing Sequential Recommendation with World Knowledge from Large Language Models".
1.2. Authors
The authors and their affiliations are:
- Tianjie Dai, Shanghai Jiao Tong University, Shanghai, China
- Xu Chen, Taobao & Tmall Group, Hangzhou, China
- Yunmeng Shu, Taobao & Tmall Group, Hangzhou, China
- Jinsong Lan, Taobao & Tmall Group, Beijing, China
- Xiaoyong Zhu, Taobao & Tmall Group, Hangzhou, China
- Jiangchao Yao, Shanghai Jiao Tong University, Shanghai, China
- Bo Zheng, Taobao & Tmall Group, Hangzhou, China
1.3. Journal/Conference
This paper was published as a preprint on arXiv. The listed publication date is 2025-11-25T10:59:38.000Z. As a preprint, it has not yet undergone formal peer review for publication in a journal or conference, but arXiv is a widely respected platform for disseminating early-stage research in computer science and other fields. Given the listed publication year (2025), it might be targeting a future major conference or journal in the field of recommendation systems or AI.
1.4. Publication Year
2025
1.5. Abstract
The abstract introduces Sequential Recommendation Systems (SRS) as crucial for predicting future user actions based on historical behavior. It highlights a limitation of traditional collaborative filtering-based SRS: suboptimal performance due to limited collaborative signals. The abstract notes the recent trend of incorporating Large Language Models (LLMs) for their world knowledge to enhance SRS. However, existing LLM-based approaches often assume the correctness of LLM-generated results and are vulnerable to noise induced by LLM hallucinations.
To address these issues, the paper proposes GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction). GRASP is presented as a flexible framework that combines two main components:
-
Generation Augmented Retrieval: This involves using LLMs fordescriptive synthesisandsimilarity retrieval. -
Holistic Attention Enhancement: This employsmulti-level attentionto effectively utilize LLM'sworld knowledge, even in the presence ofhallucinations, and to better capture users'dynamic interests.The
retrieved similar users/itemsserve asauxiliary contextual informationfor theholistic attention enhancement module, which helps mitigatenoisy guidanceassociated withsupervision-based methods. Comprehensive evaluations on twopublic benchmarksand oneindustrial datasetdemonstrate thatGRASPconsistently achievesstate-of-the-art performancewhen integrated with diversebackbones.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2511.20177v1
- PDF Link: https://arxiv.org/pdf/2511.20177v1.pdf
- Publication Status: This paper is currently a preprint on arXiv, indicated by "v1" (version 1) and its presence on the arXiv platform.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the suboptimal performance of Sequential Recommendation Systems (SRS) when relying solely on ID-based embeddings and collaborative filtering. Traditional SRS models, while effective in capturing sequential patterns, often suffer from limited collaborative signals, leading to a narrow view of user interests. For instance, recommending only similar tents after a tent purchase, rather than related camping equipment like sleeping bags or stoves, misses broader user intentions. This limitation prevents these models from capturing comprehensive user behavior patterns, especially for sparse or long-tail items and users.
The emergence of Large Language Models (LLMs) has introduced a new paradigm, allowing the integration of rich semantic and world knowledge into recommendation systems. However, existing approaches that incorporate LLMs often face two significant challenges:
-
Assumption of Correctness: Many methods implicitly assume the
LLM-generated resultsare accurate. -
Susceptibility to Hallucinations: LLMs are known to
hallucinate, meaning they can generate factually incorrect or nonsensical information. This issue is particularly pronounced for users with short interaction histories (tail users), wherehallucination ratescan be high. Directly using such potentially noisy or incorrectsemantic featuresassupervision signalscan introducenoise, ultimately impairing model performance and reliability.The paper's entry point and innovative idea is to leverage the semantic understanding capabilities of LLMs while simultaneously building robustness against their
hallucinationtendencies. Instead of usingLLM-generated contentas direct supervision,GRASPproposes to integrate it asauxiliary contextual information, processed through a specializedattention mechanism, thereby mitigating the risks associated with noisy LLM outputs.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Novel Framework (
GRASP): The authors proposeGRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction), a flexible and novel framework for enhancingsequential recommendationsystems. This framework is designed to be orthogonal (meaning it can be integrated with) current SRS backbones, while specifically addressing the inaccuracies ofattribute-based retrievalandnoisy guidancefromLLM hallucinations. -
Generation Augmented Retrieval: This component utilizes LLMs to generate detailed
user profilesanditem descriptions, which are then used to buildsemantic embedding databases. Anearest-neighbor retrievalstrategy identifiestop-k similar users or items, whose aggregated embeddings serve as richauxiliary information. This approach enriches the sparsecollaborative signalswith semantic context. -
Holistic Attention Enhancement: This module integrates the
retrieved informationascontextual inputrather than directsupervision, a key differentiator for mitigatinghallucination noise. It employs amulti-level attention mechanismincluding:Initial user-item attentionfor core interaction patterns.Attention between similar user/item groupsfor neighborhood context.Concatenated attentionfor global interest modeling. The use of aSigmoid functioninstead ofSoftmaxin attention is highlighted for preservingdiverse preferences.
-
Empirical Superiority: Extensive experiments on two
public benchmarks(Amazon Beauty, Amazon Fashion) and oneindustrial dataset(Industry-100K) demonstrate thatGRASPconsistently achievesstate-of-the-art performance. It shows significant improvements over existingLLM-enhanced sequential recommendation models(likeLLM-ESR) andtraditional SRS backbones(GRU4Rec, BERT4Rec, SASRec), particularly intail scenarios(data-scarce), wherehallucination risksare highest, without sacrificing performance inhead scenarios. -
Online Validation: An
online A/B teston an industrial e-commerce platform confirmedGRASP's practical value, showing a0.14 point absolute increase in CTR,1.69% relative growth in order volume, and1.71% uplift in GMV.These findings collectively highlight
GRASP's ability to effectively integrateLLM world knowledgeintosequential recommendationwhile robustly managing the inherenthallucination problem, making it a promising solution for real-world applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand GRASP, a reader should be familiar with the following concepts:
- Sequential Recommendation System (SRS): An
SRSis a type of recommender system that predicts the next item a user will interact with, based on their sequence of past interactions. Unlike traditional recommender systems that might only consider a user's overall preferences,SRSmodels explicitly capture the temporal order and dynamics of user behavior. For example, if a user buys a camera, then a lens, anSRSmight predict they'll buy a camera bag next, recognizing the sequence. - Collaborative Filtering (CF): A widely used technique in recommender systems.
CFoperates on the principle that users who agreed in the past on some items will likely agree again in the future. It identifies patterns by analyzing eitheruser-item interactions(e.g., ratings, purchases) to findsimilar usersorsimilar items.ID-based collaborative filteringrefers to methods where users and items are primarily represented by numerical identifiers (IDs), and their relationships are learned through interactions, often neglecting rich content information. - Large Language Models (LLMs): These are powerful artificial intelligence models, like
GPT-3,GPT-4, orQwen, that have been trained on vast amounts of text data. They can understand, generate, and process human language, allowing them to perform tasks such as summarization, translation, question answering, and even creative writing. In the context ofrecommendation,LLMscan generate richsemantic descriptionsfor items or users, providingworld knowledge(e.g., knowing that a "sleeping bag" is related to "camping" even if a user hasn't explicitly interacted with camping items). - LLM Hallucinations: This refers to the phenomenon where
LLMsgenerate content that is factually incorrect, nonsensical, or unfaithful to the input prompt, even though it may sound plausible or grammatically correct. Inrecommendation systems, anLLMmighthallucinatea user's interest in a product category they've never shown interest in, or misrepresent an item's attributes. This can introducenoiseand lead to unreliable recommendations. - Attention Mechanism: A core component in many modern neural networks, especially in
Transformers. Theattention mechanismallows a model to weigh the importance of different parts of the input sequence when processing a specific element. It calculates acontext vectorby taking a weighted sum ofvaluevectors, where the weights are determined by aqueryvector andkeyvectors. The standardself-attentionformula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- :
Querymatrix (typically derived from the current element). - :
Keymatrix (derived from all elements in the sequence). - :
Valuematrix (derived from all elements in the sequence). - : Dot product to measure similarity between
queryandkeys. - : Scaling factor to prevent large dot products from pushing
softmaxinto regions with tiny gradients. is the dimension of thekeys. - : Normalizes the scores into probability distributions, ensuring weights sum to 1.
- The output is a weighted sum of the
values, focusing on the most relevant parts.
- :
- Average Pooling: A simple aggregation technique where a set of numerical vectors is combined into a single vector by computing the average of their corresponding elements. For example, if you have two vectors
[1, 2, 3]and[4, 5, 6], their average pooling would be . InGRASP, it's used to aggregate embeddings ofsimilar usersoritems. - Cosine Similarity: A measure of similarity between two non-zero vectors that measures the cosine of the angle between them. A cosine similarity of 1 means the vectors are in the same direction (perfectly similar), 0 means they are orthogonal (no similarity), and -1 means they are in opposite directions (perfectly dissimilar). It's commonly used to find similar items or users in
embedding spaces. The formula for two vectors and is: $ \text{cosine similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} $ Where:- : The dot product of vectors and .
- : The Euclidean norm (magnitude) of vector .
- : The Euclidean norm (magnitude) of vector .
- Multi-Layer Perceptron (MLP): A fundamental type of
feedforward neural networkconsisting of at least three layers: aninput layer, one or morehidden layers, and anoutput layer. Each node (neuron) in one layer connects to every node in the next layer with a specific weight.MLPsare used for various tasks, including non-linear transformations and feature mapping.
3.2. Previous Works
The paper builds upon a rich history of sequential recommendation and LLM integration.
-
Traditional Sequential Recommendation Models:
- GRU4Rec [16]: One of the pioneering works in
SRS,GRU4RecutilizesGated Recurrent Units (GRUs)to capture sequential patterns.GRUsare a type ofrecurrent neural network (RNN)designed to process sequences of data by maintaining a hidden state that evolves over time, allowing them to remember past information. - SASRec [21]:
SASRec(Self-Attentive Sequential Recommendation) introduced theself-attention mechanism(inspired byTransformers) tosequential recommendation. UnlikeRNNsthat process tokens sequentially,self-attentionallows the model to capture dependencies between any two items in a sequence regardless of their distance, making it effective for modeling long-range dependencies and complex item relationships. - BERT4Rec [35]:
BERT4Recadapted theMasked Language Modeling (MLM)paradigm fromBERT(Bidirectional Encoder Representations from Transformers) tosequential recommendation. It pre-trains aTransformer-based modelby masking certain items in a user's historical sequence and predicting them, learning rich contextual representations for items. These models primarily rely oncollaborative filteringandID embeddings, which, as the paper notes, can be limited bysparse collaborative signals.
- GRU4Rec [16]: One of the pioneering works in
-
LLM for Sequential Recommendation (Two Paradigms): The paper categorizes existing
LLM-based recommendationworks into two main paradigms:-
LLMs as the entire system:
- ProLLM4Rec [43]: This approach designs specific
promptsto directly leverage the generative and reasoning capabilities ofLLMsfor recommendation tasks, essentially using theLLMas the recommender itself. - TALLRec [2]: This method creates an
instruction-tuning datasetfrom user history andfine-tunesanLLMto directly output recommendations. - Limitations: While powerful, this paradigm is often computationally expensive during inference, making it challenging to meet the
high concurrencyandlow latencydemands of industrial-scale recommendation systems.
- ProLLM4Rec [43]: This approach designs specific
-
Integrating LLM semantic understanding into existing systems:
- HLLM [8]:
HLLMdesigns separateitem LLMsto extract rich content features from item descriptions anduser LLMsto utilize these features for predicting future interests. - LLM-ESR [27]: This work is a primary point of comparison for
GRASP.LLM-ESRusesLLMsto generatesemantic features(hidden embeddings) to initializeID embeddingsfor traditionalsequential recommendation models. Crucially, it employs adual networkand aself-distillation loss functionthat leveragesLLM embeddingsto identify and supervisesimilar users. Specifically, it uses a regularization term in its loss function: , where is the user embedding and is thecontext embeddingderived from theLLM. This term directly pulls the user embedding towards theLLM-generated context, treating it as asupervisory signal. - LRD [45]:
LRDleveragesLLMsto discoverlatent item relationsthroughlanguage representationsand integrates them with adiscrete state variational autoencoderto enhancerelation-aware sequential recommendation.
- HLLM [8]:
-
3.3. Technological Evolution
The field of sequential recommendation has evolved significantly:
-
Early Models (e.g., Markov Chains): Focused on simple transitions between items.
-
RNN-based Models (e.g., GRU4Rec): Introduced
recurrent neural networksto capture more complex sequential dependencies. -
Attention-based/Transformer Models (e.g., SASRec, BERT4Rec): Revolutionized
sequential modelingby enabling the capture oflong-range dependenciesandcontextual item representationsthroughself-attention. These models moved beyond strict sequential processing to allow global interaction within a sequence. -
Hybrid Models (Content-aware): Began to incorporate
item attributes(e.g., text, images) alongsideID embeddingsto enrich representations, addressingcold-startandsparsityissues. -
LLM-enhanced Models: The latest frontier involves integrating
Large Language Modelsto injectworld knowledgeandsemantic understanding. Initially, this involved usingLLMsfor feature extraction or embedding initialization, and later, usingLLMsdirectly as recommenders or for complex reasoning tasks.GRASPfits into this evolution by pushing the boundaries ofLLM-enhanced models. It recognizes the power ofLLMsforsemantic understandingbut critically addresses a major practical hurdle:hallucinations. By designing a robust way to integrateLLM-derived informationas auxiliary input rather than direct supervision, it refines theLLM-enhanced paradigm, making it more reliable for real-world deployment.
3.4. Differentiation Analysis
Compared to the main methods in related work, GRASP offers distinct innovations, especially against LLM-ESR:
-
Robustness to LLM Hallucinations: This is
GRASP's most significant differentiator. PreviousLLM-enhanced methods, particularlyLLM-ESR, often treatLLM-generated embeddingsorsemantic featuresas reliablesupervisory signals. As detailed in Appendix A.3,LLM-ESRuses a regularization term that directly pulls user embeddings towardsLLM context(e.g., ). If theLLM contextishallucinatedor inaccurate, this direct supervision introducesnoise, impairing performance.GRASP, in contrast, treatsLLM contextas alearnable input feature. Theholistic attention enhancementmodule fuses thisLLM contextwith original embeddings throughlearnable weights(e.g., ). This allows the model toadaptively down-weightnoisy or uninformativeLLM contextby learning smallerweights() if it doesn't contribute positively to the main task loss. Thisgating mechanismprovides inherentrobustness against hallucinations. -
Integration Strategy (Auxiliary Input vs. Supervision):
GRASPintegratesLLM knowledgeasauxiliary contextual informationthroughgeneration augmented retrievalandmulti-level attention. This is a more flexible and less prescriptive approach than directsupervision. It enriches the representations without forcing them to conform to potentially erroneousLLM outputs. -
Multi-Level Attention:
GRASPemploys a sophisticatedmulti-level attention mechanism(self,similar,global) to dynamically captureuser interestsand integrate contextual information fromsimilar users/items. This holistic approach allows for a richer and more nuanced understanding of user preferences compared to simpler fusion strategies. -
Flexible and Orthogonal Framework:
GRASPis designed to be orthogonal toSRS backbones, meaning it can be easily integrated with various existingsequential recommendation models(e.g.,GRU4Rec,BERT4Rec,SASRec) to enhance their performance, rather than replacing them entirely or requiring extensive modifications.In summary, while previous works recognized the value of
LLMs,GRASPinnovates by providing a more resilient and flexible framework to harnessLLM world knowledgewhile explicitly tackling thehallucination problemthrough a distinct integration strategy and architectural design.
4. Methodology
4.1. Principles
The core idea of GRASP is to enhance Sequential Recommendation Systems (SRS) by leveraging the rich world knowledge from Large Language Models (LLMs) in a robust manner. The theoretical basis and intuition behind it stem from two main observations:
-
Limitations of ID-based CF: Traditional
SRSmodels primarily useID embeddingsandcollaborative filtering. These are inherently limited because they only capture patterns from observed interactions and lack externalsemantic knowledge. This leads tosparsityissues,cold-start problems, and an inability to infer broaderuser interestsbeyond directly similar items. -
LLM Power and Pitfalls:
LLMspossess vastworld knowledgeandsemantic understanding capabilities, which can significantly enrichuseranditem representations. However,LLMsare prone tohallucinations(generating incorrect information), especially forsparse data(e.g., users with short histories). Directly usingLLM-generated featuresassupervision signalscan introducenoiseand degrade performance.GRASP's principle is to bridge this gap by:
-
Enriching Representations Semantically: Use
LLMsto generate descriptivesemantic embeddingsfor allusersanditems, moving beyond simpleIDs. -
Providing Contextual Auxiliary Information: Instead of direct
supervision,LLM-derived embeddingsare used to retrievesimilar usersanditems. The aggregated embeddings from these neighbors provideauxiliary contextual information. This context is less prone to the "single point of failure" of ahallucinateddirectsupervision signalbecause it averages over multiple similar entities. -
Adaptive Fusion with Multi-Level Attention: A specialized
holistic attention mechanismis employed to adaptively fuse theseLLM-derived semantic embeddingsandretrieved contextual informationwith the coreuser-item interactions. This attention mechanism is designed to be robust tonoise(e.g., fromhallucinations) by allowing the model to learn todown-weightless reliable information, and to capture diverse, dynamic user preferences throughmulti-level attentionand aSigmoid activation(instead ofSoftmax).In essence,
GRASPaims to harness thesemantic richnessofLLMswhile sidestepping the fragility introduced by theirhallucinationtendencies, by treatingLLM outputsaslearnable contextrather than infallible ground truth.
4.2. Core Methodology In-depth
4.2.1. Problem Formulation
The goal of a Sequential Recommendation System (SRS) is to predict the next item a user is most likely to interact with, given their historical sequence of interactions.
Let represent the interaction sequence of user , where denotes the -th item interacted with by the user. Here, is a user from the set of users, and is the item set.
The task can be mathematically formulated as finding the item that maximizes the probability of being the next interaction: $ i^* = \underset{i_j \in \mathcal{I}}{\mathrm{argmax}} f(i_{|S_u|+1} = i_j \ | \ S_u) $ Where:
-
: The predicted next item that the user is most likely to interact with.
-
: The set of all possible items.
-
: The
SRSmodel, which takes the user's historical sequence as input. -
: Represents the item the user will interact with next, immediately following their current sequence .
-
: Denotes the length (number of items) in the user's historical interaction sequence.
GRASPprimarily focuses on enhancing the representations of items () and users () before they are fed into theSRS backbone(), making it orthogonal to the specific choice of .
4.2.2. Generation Augmented Retrieval
This component aims to enrich user and item embeddings by leveraging the semantic understanding capabilities of LLMs. It consists of two main steps: Generation and Retrieval.
4.2.2.1. Generation
First, the system creates prompt templates. These templates are designed to incorporate existing item attributes (like name, brand, price, features, descriptions) or user profiles (like birthplace, gender, age, occupation, spending power) and their historical behaviors.
For example, a prompt for an item might be: "The beauty item has the following attributes: name is ; brand is ; price is . The item has the following features: . The item has the following descriptions: ."
For a user, it might include their demographics and a summary of their visited items, asking the LLM to "conclude the user's preference." (Detailed templates are in Appendix A.1 of the paper).
An LLM (e.g., OpenAI API for public datasets, Qwen2.5-7B-Instruct for industrial data) is then invoked to process these prompts. It interprets the provided information and generates detailed, descriptive texts for all items and users.
After generating these descriptive texts, embeddings are extracted from them. For open-source datasets, embeddings are directly obtained from OpenAI API. For internal industrial data, a pre-trained text encoder (e.g., LLM2Vec) is used to convert the generated text into semantic embeddings.
As a result, two semantic embedding databases are constructed:
- : A database of
LLM-generated semantic embeddingsfor all users. - : A database of
LLM-generated semantic embeddingsfor all items. Here, is thesemantic embedding dimension.
4.2.2.2. Retrieval
To further enhance feature representation and address data sparsity, a nearest-neighbor retrieval strategy is employed. For each user or item, the system retrieves the top-k most similar users or items based on the cosine similarity between their LLM-generated semantic embeddings. These retrieved embeddings are then aggregated using average pooling.
Formally, for a given user with LLM embedding and an item with LLM embedding , the retrieval process is expressed as:
$
\begin{array}{l}
\bar{\mathbf{u}} = \mathrm{Avg \underline{~Pooling~}} (\mathbf{u}_i \mid \mathbf{u}_i \in \mathrm{Top@k}(\mathbf{u}) \ \backslash \ {\mathbf{u}}) \
\bar{\mathbf{i}} = \mathrm{Avg \underline{~Pooling~}} (\mathbf{i}_j \mid \mathbf{i}_j \in \mathrm{Top@k}(\mathbf{i}) \ \backslash \ {\mathbf{i}})
\end{array}
$
Where:
-
: The
averaged embeddingrepresenting the aggregated knowledge from thetop-k similar usersto user . -
: The
averaged embeddingrepresenting the aggregated knowledge from thetop-k similar itemsto item . -
: The
LLM embeddingof the current user. -
: The
LLM embeddingof the current item. -
: The
LLM embeddingof a user (neighboring user). -
: The
LLM embeddingof an item (neighboring item). -
: Denotes the set of
top-kuserLLM embeddingsthat are most similar to (based oncosine similarity). -
: Denotes the set of
top-kitemLLM embeddingsthat are most similar to . -
and : Exclude the user/item itself from its own set of similar entities to retrieve distinct neighbors.
-
: The
average poolingoperation that sums the embeddings and divides by the count of elements.Through this process, each
useranditemis not only represented by its ownLLM embeddingbut also enriched with contextual information derived from itsnearest neighbors. Theseretrievedandaveraged embeddings( and ) are then frozen and cached for subsequent steps, meaning they are computed once offline and not updated during the model training.
4.2.3. Holistic Attention Enhancement (HAE)
The Holistic Attention Enhancement module is designed to dynamically fuse the LLM-derived semantic embeddings and the retrieved similar neighbor embeddings to capture users' dynamic interests effectively.
The inputs to this module are:
LLM embeddingof a specific user : .Averaged embeddingsofsimilar usersto : .LLM embeddingof a specific item : .Averaged embeddingsofsimilar itemsto : .
4.2.3.1. Attention Mechanism Definition
The paper defines a specific attention mechanism . Unlike traditional softmax-based attention that normalizes scores to sum to 1 (potentially leading to a single dominant focus), this version uses a Sigmoid function . This choice is made to avoid the single-peak issue of softmax, allowing for a representation that better reflects users' diverse preferences while maintaining more raw interest patterns.
The attention mechanism is defined as:
$
\mathcal{A}(\mathbf{q}, \mathbf{v}) = \sigma \left( \frac{\mathbf{qv}^T}{\sqrt{d}} \right) \mathbf{v}
$
Where:
- : The
attention function. - : The
query vector(typically derived from the user or current context). - : The
value vector(representing an item or a set of items). In this formulation, also implicitly serves as thekey vector, as its transpose is used for the dot product with . - : The
dot productbetween thequeryandvalue(key) vectors, measuring their similarity. - : A scaling factor to normalize the dot product, where is the dimension of the embeddings.
- : The
Sigmoid activation function, which outputs a value between 0 and 1, allowing for independent weighting of different interest components, rather than forced competition as insoftmax.
4.2.3.2. Multi-level Attention Operations
The holistic attention enhancement is computed through a series of attention operations at different levels, capturing various aspects of user-item interaction and context.
-
Self-Attention (Core Interaction): This captures the core interaction pattern between the current user and the current item, using their direct
LLM embeddings. $ \mathbf{i}_{j,\mathrm{self}}^{\mathrm{HAE}} = \mathcal{A}(\mathbf{u}_i, \mathbf{i}_j) $ Here, the user'sLLM embeddingacts as thequery, and the item'sLLM embeddingacts as both thekeyandvalue. This component focuses on the direct semantic relationship. -
Similar-Attention (Neighborhood Context): This captures contextual information from the
neighborhoodof similar users and items, using theiraveraged embeddings. $ \mathbf{i}_{j,\mathrm{similar}}^{\mathrm{HAE}} = \mathcal{A}(\bar{\mathbf{u}}_i, \bar{\mathbf{i}}_j) $ Here, theaveraged embeddingofsimilar usersacts as thequery, and theaveraged embeddingofsimilar itemsacts askeyandvalue. This component brings in broader collaborative-semantic context. -
Global-Attention (Holistic Context): To capture a more comprehensive global interest, the raw
user embeddingsandsimilar user embeddingsare concatenated to form a globaluser query, and similarly foritemsto form a globalitem key/value. $ \mathbf{i}_{j,\mathrm{global}}^{\mathrm{HAE}} = \mathcal{A}([\mathbf{u}_i \parallel \bar{\mathbf{u}}_i], [\mathbf{i}_j \parallel \bar{\mathbf{i}}_j]) $ Where:- : Denotes the
concatenationoperation, joining two vectors end-to-end. - : The concatenated
query vectorcombining the specific user'sLLM embeddingand theirsimilar users' aggregated embedding. - : The concatenated
value/key vectorcombining the specific item'sLLM embeddingand itssimilar items' aggregated embedding. Thisglobal attentioncaptures interaction patterns from a broader, more enriched perspective.
- : Denotes the
These three attention-enhanced vectors (, , ) are then concatenated together. This combined vector is passed through a Multi-Layer Perceptron (MLP) to adjust its dimension to fit the input size expected by the underlying SRS backbone :
$
\mathbf{i}{j,all} = \mathrm{MLP} \left( [\mathbf{i}{j,self}^{\mathrm{HAE}} \parallel \mathbf{i}{j,similar}^{\mathrm{HAE}} \parallel \mathbf{i}{j,global}^{\mathrm{HAE}}] \right)
$
Where:
-
: The final
holistic attention-enhanced embeddingfor item . -
: A
Multi-Layer Perceptronthat transforms the concatenated vector into the appropriate dimension.This process ensures that the
semantic informationfromLLMsis preserved and adaptively fused through various levels ofattention, creating a rich and robust representation for each item, which then serves as the input to the chosenSRS backbone.
4.2.4. Training and Deployment Complexity
GRASP is designed to be a flexible module that enhances existing SRS backbones rather than replacing them. This means it can be integrated on top of models like GRU4Rec, BERT4Rec, or SASRec.
The overall training objective uses the standard loss function of the chosen SRS backbone. For binary cross-entropy loss (common in recommendation for distinguishing positive from negative items), it is defined as:
$
\mathcal{L} = - \frac{1}{|\mathcal{B}|} \sum_j \left[ y_j \log(\hat{y}_j) + (1 - y_j) \log(1 - \hat{y}_j) \right]
$
And the predicted probability for item is calculated as:
$
\hat{y}j = \sigma \left( \mathbf{o} \cdot \mathbf{i}{j,all} \right)
$
Where:
-
: The
total loss functionto be minimized during training. -
: The size of the
candidate poolof items (including both positive and negative samples). -
: The
ground truth labelfor item (1 if it's a positive interaction, 0 otherwise). -
: The
predicted probabilitythat item is the next interaction. -
: The
Sigmoid function, used here to convert the dot product score into a probability. -
: The
user representationlearned by theSRS backbone. This representation captures the user's dynamic interests from their historical sequence. -
: The
holistic attention-enhanced embeddingfor item , computed via Equation (5) as described in theHolistic Attention Enhancementsection.For
practical deployment, theLLM-generated embeddings(, , , ) areoffline precomputeddaily to account foruser behavior changes. This ensures that the online system does not incur the high computational cost ofLLM inference. The main online computational overhead introduced byGRASPis theholistic attention module. This module has a limited time complexity of when the sequence length and the latent dimension of theSRS backboneare fixed. To further optimize for industrial deployment, the retrieval ofsimilar usersanditemscan also beoffline pre-retrievedin smaller groups (e.g., within the same category), rather than across the entire universe of users and items, significantly alleviating thenearest-neighbor search complexity.
Algorithm 1 Pseudo code of GRASP
The overall process of GRASP can be summarized by the following pseudo-code:
Algorithm 1 Pseudo code of GRASP. Require: Interaction sequence
1: Generate LLM embedding database , ; retrieve similar user/item and generate , by Eq. (2). (This step is performed offline)
Training 2: Freeze , , and . (These precomputed embeddings are static during training) 3: for each iteration do 4: Compute fine-grained and global enhanced embedding using Eq. (4). (This refers to , , ) 5: Compute input sequence embedding after holistic attention by Eq. (5). (This combines the enhanced embeddings into ) 6: Calculate loss function using Eq. (6). (The overall training objective) 7: Update model parameters. (Parameters of the SRS backbone and the MLP in GRASP are updated) 8: end for
Testing 9: Return 10: for in do 11: Obtain corresponding input embedding from , , and , obtain the model parameters. (Retrieve precomputed embeddings and trained model parameters) 12: Compute the scores of items in the candidate set by Eq. (1) and return the ranked order. (Use the enhanced item embeddings and user representation from SRS to predict scores and rank items) 13: end for
5. Experimental Setup
5.1. Datasets
The experiments were conducted on three datasets: two publicly available benchmarks and one industrial dataset.
- Amazon Beauty [28]: This dataset is sourced from Amazon and contains user reviews on beauty-related products.
- Domain: E-commerce, beauty products.
- Amazon Fashion [28]: Also from the Amazon collection, this dataset contains user reviews on fashion items.
- Domain: E-commerce, fashion products.
- Industry-100K: This is a subset collected from user purchase records on an internal e-commerce platform. It captures transactions from January 17, 2025, to February 23, 2025.
-
Domain: Industrial e-commerce, user purchase behavior.
-
Characteristics: Represents real-world e-commerce interactions.
The following are the statistics from Table 1 of the original paper:
-
The following are the results from Table 1 of the original paper:
| Dataset | # User | # Item | # AVG Length | Sparsity |
|---|---|---|---|---|
| Beauty | 52204 | 57289 | 7.56 | 99.99% |
| Fashion | 9049 | 4722 | 3.82 | 99.92% |
| Industry-100K | 99711 | 1205282 | 20.88 | 99.99% |
-
# User: The total number of unique users in the dataset.
-
# Item: The total number of unique items in the dataset.
-
# AVG Length: The average length of user interaction sequences.
-
Sparsity: The percentage of possible
user-item interactionsthat are missing (i.e., not observed). A highsparsity(close to 100%) indicates very few interactions relative to the total possible interactions, which is common in recommendation systems and poses a challenge.Preprocessing: The datasets were preprocessed following established practices from
SASRec [21]andLLM-ESR [27]. Data Partitioning: Aleave-one-outstrategy was adopted for validation and testing. This means for each user, the last interaction is used as the test item, the second-to-last for validation, and the preceding items form the training sequence. Head/Tail Partitioning: Data was also partitioned intoheadandtailsegments based on thePareto Principle.Head users/itemsare those with interaction frequencies in the top 20%, whiletail users/itemsconstitute the remaining 80%. -
Beauty: Head/tail user demarcation at 9 interactions; item demarcation at 4 interactions.
-
Fashion: Head/tail user threshold at 3 interactions; item threshold at 4 interactions.
-
Industry-100K: Head/tail user criterion at 29 interactions; item criterion at 2 interactions.
These datasets are effective for validating the method's performance because they cover both widely used public benchmarks and a large-scale, real-world industrial setting. They also include varying levels of
sparsityand average sequence lengths, allowing for a robust evaluation across different data characteristics. Thehead/tailanalysis specifically targets thehallucination probleminsparse (tail)scenarios.
5.2. Evaluation Metrics
To comprehensively assess the performance of the models, Normalized Discounted Cumulative Gain (NDCG) and Hit Rate (HR) are utilized. Both metrics are reported at various ranking positions .
-
Normalized Discounted Cumulative Gain (NDCG@k):
- Conceptual Definition:
NDCGmeasures the ranking quality of a recommendation list, taking into account the position of relevant items. It assigns higher scores to highly relevant items appearing at the top of the list and penalizes relevant items that are ranked lower. The "Normalized" part means it's scaled by the idealDCG(theDCGof a perfect ranking), so scores range from 0 to 1. - Mathematical Formula:
First,
Discounted Cumulative Gain (DCG)is calculated: $ DCG_k = \sum_{i=1}^k \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Then, theIdeal Discounted Cumulative Gain (IDCG)is calculated, which is theDCGfor the perfectly sorted list by relevance: $ IDCG_k = \sum{i=1}^k \frac{2^{\mathrm{rel}_i^{opt}} - 1}{\log_2(i+1)} $ Finally,NDCGis obtained by normalizingDCGbyIDCG: $ NDCG_k = \frac{DCG_k}{IDCG_k} $ - Symbol Explanation:
- : The number of top items in the recommendation list being considered (e.g., 1, 3, 5, 10, 20).
- : The rank position of an item in the recommendation list.
- : The relevance score of the item at position in the actual recommendation list. In typical
sequential recommendationwith binary feedback, relevance is usually 1 for the ground truth next item and 0 for others. - : The relevance score of the item at position in the ideal recommendation list (where all relevant items are ranked highest).
- : A logarithmic discount factor, meaning items at lower ranks contribute less to the total score.
- Conceptual Definition:
-
Hit Rate (HR@k):
- Conceptual Definition:
Hit Ratemeasures whether the target (ground truth) item is present anywhere within the top recommended items. It indicates how often the recommender system successfully "hits" the correct item within its top-k predictions. - Mathematical Formula: $ HR@k = \frac{\text{Number of users for whom the target item is in top-k}}{\text{Total number of users}} $
- Symbol Explanation:
-
: The number of top items in the recommendation list being considered.
-
"Number of users for whom the target item is in top-k": This counts how many times the actual next item the user interacted with appeared within the top items recommended by the model.
-
"Total number of users": The total number of users for whom recommendations were generated.
During evaluation,
negative samplingwas performed with a size of 100, meaning for each positive interaction, 100 negative items were randomly sampled to create the candidate pool for ranking.
-
- Conceptual Definition:
5.3. Baselines
The paper compares GRASP against a set of sequential recommendation models, categorized into traditional backbones and LLM-enhanced models.
-
Traditional Sequential Recommendation Backbones (GRASP integrates with these):
- GRU4Rec [16]: A
Recurrent Neural Network (RNN)based model usingGated Recurrent Units (GRUs)to capture sequential patterns. - BERT4Rec [35]: A
Transformer-based modelthat adapts theMasked Language Modelingobjective forsequential recommendation. - SASRec [21]: A
Transformer-based modelutilizingself-attentionto capturelong-range dependenciesin user interaction sequences.
- GRU4Rec [16]: A
-
LLM-Enhanced Sequential Recommendation Models (Compared against):
-
RLMRec [34]: A
recommendation modelthat usesrepresentation learningwithLarge Language Models. -
LLMInit [15, 17]: Methods that leverage
LLMsforembedding initializationinsequential recommendation. -
LLM-ESR [27]: A model that uses
LLM hidden embeddingsto identify and supervisesimilar usersvia aself-distillation lossto enhancesequential recommendation. This is a direct competitor model thatGRASPaims to surpass, particularly inhallucination robustness.These baselines are representative because they cover the spectrum from foundational
RNN-based modelsto advancedTransformer-based models, and recentLLM-enhanced approaches, providing a comprehensive comparison forGRASP's effectiveness and its unique contribution to theLLM4Recfield.
-
5.4. Implementation Details
-
Hardware: All experiments were conducted on a single
NVIDIA A100 GPU. -
Sequence Length: The maximum sequence length for user interaction histories was fixed at 100.
-
Hidden Embedding Dimension: The
hidden embedding dimensionfor all methods (including theSRS backbonesandGRASP's internal representations before theMLP) was set to 64. -
Batch Size: Training
batch sizewas 128. -
Optimizer: The
Adam optimizerwas used for training. -
Learning Rate: A fixed
learning rateof 0.001 was applied. -
Early Stopping: To prevent
overfitting,early stoppingwas implemented. Training would cease if theNDCG@10metric on thevalidation setdid not improve for 20 consecutive epochs. -
Robustness: To ensure the robustness and reliability of the reported results, experiments were run three times with different
random seeds(), and the average results are reported. -
LLM Embeddings for Public Datasets: For the
Amazon BeautyandFashiondatasets,OpenAI APIwas utilized to obtainLLM embeddings. The dimension of these embeddings was 1536. -
LLM Embeddings for Industrial Dataset: Due to
data confidentiality requirementsfor theIndustry-100Kdataset, anopen-source LLMwas used.Qwen2.5-7B-Instruct [44]was employed to generate descriptive texts. Subsequently, thepre-trained text encoder LLM2Vec [3]was used to convert these texts intosemantic embeddings, resulting in a dimension of 4096.These details highlight a rigorous and consistent experimental setup, allowing for fair comparison across models and demonstrating the practicality of
GRASPin both academic and industrial settings.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Overall Performance
The paper demonstrates that GRASP consistently achieves state-of-the-art performance across the three tested benchmarks, outperforming both traditional SRS baselines and other LLM-enhanced sequential recommendation models. The results are presented in Table 2.
The following are the results from Table 2 of the original paper:
| Dataset | Model | N@1 | N@3 | N@5 | N@10 | N@20 | H@1 | H@3 | H@5 | H@10 | H@20 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Beauty | GRU4Rec | 11.48 | 16.88 | 19.23 | 22.42 | 25.62 | 11.48 | 20.83 | 26.56 | 36.45 | 49.17 |
| RLMRec | 11.03 | 16.50 | 18.93 | 22.25 | 25.48 | 11.03 | 20.51 | 26.41 | 36.75 | 49.60 | |
| - LLMInit | 14.33 | 20.53 | 23.05 | 26.29 | 29.37 | 14.33 | 25.08 | 31.20 | 41.24 | 53.46 | |
| - LLM-ESR | 17.50 | 25.25 | 28.18 | 31.82 | 35.05 | 17.50 | 30.88 | 38.02 | 49.28 | 62.09 | |
| - GRASP | 18.15 | 26.88 | 30.21 | 34.16 | 37.56 | 18.15 | 33.24 | 41.35 | 53.57 | 67.03 | |
| BERT4Rec | 10.49 | 16.86 | 19.68 | 23.33 | 26.67 | 10.49 | 21.56 | 28.43 | 39.72 | 52.93 | |
| - RLMRec | 10.45 | 16.82 | 19.66 | 23.20 | 26.72 | 10.45 | 21.50 | 28.42 | 39.43 | 53.39 | |
| - LLMInit | 16.57 | 24.74 | 27.92 | 31.65 | 34.89 | 16.57 | 30.71 | 38.45 | 50.00 | 62.86 | |
| - LLM-ESR | 21.66 | 29.58 | 32.37 | 35.76 | 38.59 | 21.66 | 35.27 | 42.05 | 52.55 | 63.77 | |
| - GRASP | 23.61 | 33.62 | 37.19 | 41.01 | 44.12 | 23.61 | 40.89 | 49.56 | 61.47 | 73.62 | |
| SASRec | 18.84 | 25.22 | 27.60 | 30.58 | 33.47 | 18.84 | 29.83 | 35.62 | 44.88 | 56.34 | |
| RLMRec | 17.93 | 24.16 | 26.56 | 29.64 | 32.50 | 17.93 | 28.70 | 34.56 | 44.10 | 55.48 | |
| - LLMInit | 19.00 | 27.40 | 30.50 | 34.08 | 37.02 | 19.00 | 33.51 | 41.05 | 52.14 | 63.78 | |
| - LLM-ESR | 20.73 | 29.73 | 33.10 | 36.99 | 40.26 | 20.73 | 36.27 | 44.49 | 56.50 | 69.44 | |
| - GRASP | 26.56 | 36.18 | 39.33 | 42.76 | 45.61 | 26.56 | 43.09 | 50.74 | 61.33 | 72.62 | |
| Fashion | GRU4Rec | 32.71 | 38.31 | 42.37 | 48.58 | 55.07 | 32.71 | 38.31 | 42.37 | 48.58 | 55.07 |
| - RLMRec | 25.84 | 29.04 | 34.20 | 36.66 | 39.81 | 29.04 | 37.76 | 41.56 | 47.74 | 55.91 | |
| - LLMInit | 33.31 | 37.48 | 38.71 | 40.29 | 42.21 | 33.31 | 40.32 | 43.31 | 48.26 | 55.89 | |
| - LLM-ESR | 37.90 | 42.11 | 43.42 | 45.43 | 47.38 | 37.90 | 45.03 | 49.77 | 56.01 | 64.36 | |
| - GRASP | 38.39 | 42.88 | 44.40 | 46.41 | 48.51 | 38.39 | 46.06 | 50.00 | 56.01 | 64.36 | |
| BERT4Rec | 28.61 | 32.00 | 33.37 | 35.58 | 37.76 | 28.61 | 34.39 | 37.74 | 44.68 | 53.23 | |
| - RLMRec | 26.95 | 31.92 | 33.40 | 35.41 | 37.36 | 26.95 | 35.33 | 38.95 | 45.16 | 52.91 | |
| - LLMInit | 33.99 | 37.84 | 38.92 | 40.62 | 42.43 | 33.99 | 40.48 | 43.12 | 48.42 | 55.67 | |
| - LLM-ESR | 37.70 | 42.37 | 43.75 | 45.43 | 47.19 | 37.70 | 45.70 | 49.04 | 54.26 | 61.26 | |
| - GRASP | 37.11 | 42.38 | 43.97 | 46.22 | 48.36 | 37.11 | 46.09 | 50.00 | 57.01 | 65.46 | |
| SASRec | 39.32 | 41.93 | 42.84 | 44.13 | 45.64 | 39.32 | 43.75 | 45.95 | 49.95 | 55.98 | |
| - RLMRec | 39.94 | 41.96 | 42.72 | 43.92 | 45.32 | 39.94 | 43.40 | 45.26 | 48.98 | 54.55 | |
| - LLMInit | 38.91 | 42.52 | 44.04 | 46.66 | 48.27 | 38.91 | 45.68 | 49.80 | 55.32 | 61.92 | |
| - LLM-ESR | 39.93 | 43.92 | 45.29 | 47.15 | 49.17 | 39.93 | 46.79 | 50.13 | 55.92 | 64.02 | |
| - GRASP | 42.16 | 46.92 | 48.50 | 50.50 | 52.57 | 42.16 | 50.30 | 54.15 | 60.31 | 68.57 | |
| Industry-100K | GRU4Rec | 4.78 | 6.68 | 7.78 | 9.58 | 11.18 | 4.78 | 8.07 | 10.76 | 16.36 | 22.70 |
| - RLMRec | 4.10 | 6.13 | 7.27 | 8.92 | 10.42 | 4.10 | 7.65 | 10.42 | 15.56 | 21.50 | |
| - LLMInit | 4.55 | 8.64 | 10.98 | 14.57 | 18.66 | 4.55 | 11.74 | 17.44 | 28.60 | 44.89 | |
| - LLM-ESR | 11.84 | 18.32 | 21.25 | 25.13 | 28.92 | 11.84 | 23.10 | 30.25 | 42.28 | 57.32 | |
| - GRASP | 13.02 | 21.04 | 24.23 | 28.73 | 32.73 | 13.02 | 26.24 | 35.47 | 48.99 | 64.15 | |
| BERT4Rec | 4.78 | 6.68 | 7.78 | 9.58 | 11.18 | 4.78 | 8.07 | 10.76 | 16.36 | 22.70 | |
| - RLMRec | 4.10 | 6.13 | 7.27 | 8.92 | 10.42 | 4.10 | 7.65 | 10.42 | 15.56 | 21.50 | |
| - LLMInit | 4.55 | 8.64 | 10.98 | 14.57 | 18.66 | 4.55 | 11.74 | 17.44 | 28.60 | 44.89 | |
| - LLM-ESR | 11.84 | 18.32 | 21.25 | 25.13 | 28.92 | 11.84 | 23.10 | 30.25 | 42.28 | 57.32 | |
| - GRASP | 13.21 | 21.04 | 24.30 | 28.66 | 32.73 | 13.21 | 26.09 | 35.35 | 48.87 | 64.15 |
-
Beauty Dataset:
GRASPwithSASRecas backbone achieves the highestNDCG@kandHR@kscores. ForNDCG@10,GRASPreaches 42.76, significantly surpassing the bestLLM-ESRscore of 36.99 (paired withSASRec). This represents an average improvement of 4.56% over the previous best-performing model (LLM-ESR) across all metrics forSASRec.GRASPalso boostsGRU4RecandBERT4Recperformance significantly compared to their base and otherLLM-enhancedvariants. For instance,GRASPwithBERT4RecachievesNDCG@10of 41.01, compared toLLM-ESR's 35.76. -
Fashion Dataset: Similar trends are observed.
GRASPwithSASRecagain yields the highestNDCG@10at 50.50, outperformingLLM-ESR's 47.15. This constitutes an 1.81% improvement overLLM-ESR. -
Industry-100K Dataset: On this large-scale industrial dataset,
GRASPdemonstrates an even more substantial gain. ForNDCG@10,GRASPwithSASRecachieves 28.66, compared toLLM-ESR's 25.13. This represents a remarkable 6.68% overall improvement.The consistent performance boosts across
GRU4Rec,BERT4Rec, andSASRecbackbones highlightGRASP's flexibility andtransferability. It effectively enhances diverseSRS architectures, affirming its general applicability and robustness.
6.1.2. Performance Under Different Groups
The paper further analyzes GRASP's effectiveness in tail scenarios (data-scarce users or items), where LLM hallucinations are more prevalent. The results, presented in Table 3, compare GRASP against LLM-ESR and traditional SRS baselines.
The following are the results from Table 3 of the original paper:
| Dataset | Model | Tail User | Tail Item | Head User | Head Item | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N@5 | H@5 | N@10 | H@10 | N@5 | H@5 | N@10 | H@10 | N@5 | H@5 | N@10 | H@10 | N@5 | H@5 | N@10 | H@10 | ||
| Beauty | GRU4Rec | 18.51 | 25.68 | 21.73 | 35.67 | 5.11 | 6.28 | 5.52 | 7.58 | 22.53 | 30.58 | 25.56 | 40.00 | 22.60 | 31.39 | 26.45 | 43.33 |
| - LLM-ESR | 27.58 | 37.34 | 31.26 | 48.76 | 6.72 | 10.33 | 8.61 | 16.23 | 30.96 | 41.10 | 34.35 | 51.64 | 33.30 | 44.62 | 37.35 | 57.16 | |
| - GRASP | 29.60 | 40.57 | 33.64 | 53.04 | 15.88 | 23.92 | 19.65 | 35.63 | 34.68 | 46.91 | 38.60 | 59.05 | 34.00 | 45.95 | 38.07 | 58.53 | |
| BERT4Rec | 18.90 | 27.34 | 22.51 | 38.54 | 0.05 | 0.12 | 0.26 | 0.76 | 23.24 | 33.40 | 27.04 | 45.11 | 24.36 | 35.18 | 28.83 | 49.01 | |
| - LLM-ESR | 31.56 | 41.04 | 34.97 | 51.59 | 7.05 | 9.17 | 8.22 | 12.83 | 36.06 | 46.67 | 39.38 | 56.94 | 38.41 | 49.89 | 42.33 | 62.03 | |
| - GRASP | 36.44 | 48.57 | 40.26 | 60.39 | 14.62 | 22.83 | 18.44 | 34.74 | 40.59 | 54.07 | 44.44 | 65.98 | 42.57 | 55.92 | 46.46 | 67.76 | |
| SASRec | 26.83 | 34.52 | 29.82 | 43.78 | 5.89 | 6.90 | 6.52 | 8.89 | 31.08 | 40.63 | 34.07 | 49.88 | 32.77 | 42.46 | 36.32 | 53.46 | |
| - LLM-ESR | 32.31 | 43.51 | 36.20 | 55.56 | 7.44 | 12.58 | 10.29 | 21.47 | 36.74 | 48.96 | 40.56 | 60.79 | 39.23 | 52.10 | 43.35 | 64.85 | |
| - GRASP | 38.83 | 49.93 | 42.20 | 60.36 | 23.03 | 31.70 | 26.34 | 41.82 | 41.63 | 54.49 | 45.29 | 65.79 | 43.23 | 55.28 | 46.67 | 65.98 | |
| Fashion | GRU4Rec | 22.16 | 31.00 | 24.69 | 38.79 | 0.36 | 0.70 | 0.80 | 2.12 | 50.86 | 57.11 | 52.20 | 61.28 | 48.31 | 58.96 | 50.95 | 67.08 |
| - LLM-ESR | 32.73 | 38.58 | 35.08 | 45.82 | 2.42 | 3.90 | 3.65 | 7.76 | 57.28 | 60.77 | 58.85 | 65.60 | 59.74 | 65.89 | 62.06 | 73.01 | |
| - GRASP | 34.00 | 40.37 | 36.37 | 47.74 | 5.89 | 9.73 | 8.23 | 17.31 | 57.96 | 61.97 | 59.43 | 66.72 | 59.77 | 65.71 | 61.60 | 71.41 | |
| BERT4Rec | 19.82 | 24.56 | 22.54 | 33.11 | 0.82 | 1.20 | 1.20 | 3.49 | 50.94 | 54.84 | 52.50 | 59.69 | 46.32 | 52.28 | 49.27 | 61.51 | |
| - LLM-ESR | 32.73 | 39.28 | 34.67 | 45.24 | 1.61 | 2.82 | 2.57 | 5.87 | 58.03 | 61.71 | 59.39 | 65.95 | 60.52 | 67.45 | 62.50 | 73.52 | |
| - GRASP | 33.26 | 40.66 | 35.97 | 48.98 | 3.30 | 5.97 | 5.64 | 13.31 | 57.86 | 62.13 | 59.57 | 67.43 | 60.16 | 67.53 | 62.38 | 74.40 | |
| SASRec - LLM-ESR | 32.35 | 35.60 | 33.82 | 40.18 | 1.68 | 2.39 | 2.13 | 3.78 | 56.45 | 59.38 | 57.49 | 62.62 | 59.22 | 63.29 | 60.85 | 68.32 | |
| - GRASP | 35.02 | 40.55 | 37.31 | 47.67 | 3.28 | 5.33 | 4.96 | 10.58 | 58.61 | 62.57 | 59.90 | 66.61 | 62.01 | 67.97 | 63.94 | 73.97 | |
| Industry-100K | GRU4Rec | 7.89 | 10.96 | 9.65 | 16.41 | 0.48 | 0.82 | 0.71 | 1.53 | 7.69 | 10.66 | 9.37 | 15.91 | 15.86 | 21.85 | 19.25 | 32.38 |
| - LLM-ESR | 20.78 | 29.68 | 24.67 | 41.74 | 20.97 | 29.97 | 24.81 | 41.87 | 23.21 | 32.67 | 27.04 | 44.53 | 21.55 | 30.57 | 25.47 | 42.73 | |
| - GRASP | 24.09 | 34.94 | 28.46 | 48.47 | 24.30 | 35.35 | 28.66 | 48.87 | 26.41 | 37.70 | 30.75 | 51.14 | 24.80 | 35.61 | 29.17 | 49.12 | |
| BERT4Rec | 14.16 | 18.17 | 15.67 | 22.86 | 5.49 | 5.64 | 5.50 | 5.64 | 17.96 | 21.60 | 19.30 | 25.76 | 25.10 | 33.17 | 28.18 | 42.73 | |
| - LLM-ESR | 26.64 | 35.85 | 30.38 | 51.09 | 26.09 | 37.07 | 30.38 | 50.34 | 25.58 | 36.30 | 30.00 | 49.99 | 26.81 | 37.56 | 31.30 | 51.47 | |
| - GRASP | 26.64 | 37.54 | 31.02 | 51.09 | 26.09 | 37.07 | 30.38 | 50.34 | 25.58 | 36.30 | 30.00 | 49.99 | 26.81 | 37.56 | 31.30 | 51.47 | |
| SASRec - LLM-ESR | 39.53 | 46.25 | 41.85 | 53.44 | 12.28 | 17.79 | 15.07 | 26.46 | 60.14 | 64.39 | 61.77 | 69.24 | 62.92 | 68.63 | 64.59 | 73.78 | |
| - GRASP | 39.53 | 46.25 | 41.85 | 53.44 | 12.28 | 17.79 | 15.07 | 26.46 | 60.14 | 64.39 | 61.77 | 69.24 | 62.92 | 68.63 | 64.59 | 73.78 | |
-
Tail Scenarios (Tail User / Tail Item): In
tail scenarios, where interaction data is sparse andLLM hallucinationrisks are highest,GRASPconsistently and significantly outperformsLLM-ESRandSRS baselines.- On the
Fashiondataset,GRASPsurpassesLLM-ESRby an average of 5.00% acrossNDCG@kandHR@kfortail users. Fortail items, the improvement is even more dramatic, for example,GRASPwithSASRecachievesNDCG@10of 4.96 fortail itemscompared toLLM-ESR's 2.13. - The
Beautydataset shows the most substantial enhancement, withGRASPdemonstrating a remarkable 9.99% average increase overLLM-ESRfortail usersand particularly strong gains fortail items(e.g., hasNDCG@10of 26.34 vs.SASRec+LLM-ESR's 10.29). - On the real-world
Industry-100Kdataset,GRASPachieves an impressive 8.42% improvement overLLM-ESRorSRS baselinesintail scenarios.
- On the
-
Head Scenarios (Head User / Head Item):
GRASPmaintains strong performance inhead scenarios(abundant interaction data, less hallucination risk).-
It improves over
LLM-ESRby 0.57% onFashion, 4.30% onBeauty, and 6.41% onIndustry-100K. This indicates that the gains inlong-tail scenariosdo not come at the expense of performance for well-represented users and items.This balanced performance underscores
GRASP's ability tomitigate hallucination effectsin data-scarce situations while preserving highrecommendation accuracywhere data is rich. This robustness is a direct consequence ofGRASP's design, which treatsLLM-derived informationascontextual inputrather than rigidsupervision.
-
6.1.3. Case Study: Robustness to Hallucinations
To further illustrate GRASP's robustness, the paper presents a case study in Figure 3, showing examples from Industry-100K where LLMs produced hallucinated descriptions.
The following figure (Figure 3 from the original paper) shows the purchase cases and their corresponding LLM hallucinatory responses:
该图像是示意图,展示了两个购买案例及其对应的LLM幻觉响应。案例一的购买序列长度为2,案例二的购买序列长度为3,均显示了GRASP和LLM-ESR的用户下一个项匹配评分。
VLM Description: The image is a diagram illustrating two purchase cases and their corresponding LLM hallucinatory responses. Case 1 has a purchase sequence length of 2, while Case 2 has a length of 3, both showing the user-next item matching scores of GRASP and LLM-ESR.
The figure demonstrates two purchase cases:
-
Case 1: A user's sequence length is 2. The
LLMgenerates a description that includeshallucinatedinformation (e.g., "user needs a new phone case and wants to match the phone color"). -
Case 2: A user's sequence length is 3. Similarly, the
LLMgenerateshallucinatedcontent (e.g., "user is planning a party").For each case, the
ground truth next-itemis provided, along withuser interaction scores(obtained by applyingsigmoidto thedot productofuseranditem embeddings) for bothGRASPandLLM-ESR. The crucial observation is thatGRASPconsistently showshigher interaction scoresfor theground truth next-itemcompared toLLM-ESR, even whenhallucinated descriptionsare present. This indicates thatGRASPis more aligned with actual user expectations. Thehallucinated descriptionscan introducenoisethat misguidesLLM-ESR, which relies on these descriptions as directsupervision. In contrast,GRASP'sholistic attentionmechanism, designed to adaptively processLLM-derived context, appears tofilter outordown-weightthe harmful effects ofhallucinations, leading to more accurate predictions. This qualitatively supportsGRASP's claim of enhancedrobustnessagainstLLM hallucination issues.
6.2. Ablation Study / Parameter Analysis
6.2.1. Ablation Study on Each Component
To demonstrate the effectiveness of each component within GRASP, an ablation study was conducted. This involves systematically removing or modifying parts of the Holistic Attention Enhancement (HAE) module and observing the impact on performance. The results are presented in Table 4, using SASRec as the backbone on the Beauty dataset.
The following are the results from Table 4 of the original paper:
| Module | Setting | N@1 | N@3 | N@5 | N@10 | N@20 | H@1 | H@3 | H@5 | H@10 | H@20 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HAE | - w/o Attention | 18.24 | 26.63 | 29.83 | 33.41 | 36.70 | 18.24 | 32.71 | 40.48 | 51.59 | 64.63 |
| - w/o HAE similar | 18.74 | 27.16 | 30.48 | 34.35 | 37.75 | 18.74 | 33.29 | 41.36 | 53.35 | 66.83 | |
| - w/o HAE global | 20.00 | 29.29 | 32.72 | 36.62 | 39.87 | 20.00 | 36.02 | 44.36 | 56.44 | 69.31 | |
| - Softmax | 14.59 | 22.60 | 25.92 | 30.00 | 33.63 | 14.59 | 28.46 | 36.54 | 49.17 | 63.55 | |
| GRASP | 26.56 | 36.18 | 39.33 | 42.76 | 45.61 | 26.56 | 43.09 | 50.74 | 61.33 | 72.62 | |
Here, "GRASP" refers to the full model, typically .
-
- w/o Attention (
- w/o Attention): This setting likely represents removing theattention mechanismentirely from theHAEmodule, possibly using simple concatenation or element-wise operations instead of weighted fusion. The performance (e.g.,NDCG@10of 33.41) significantly drops compared to the fullGRASP(42.76). This highlights that theattention mechanismis crucial for effectively integrating the varioussemantic signals. -
- w/o HAE similar (
- w/o HAE similar): This removes thesimilar-attentioncomponent (i.e., ) that incorporates information fromsimilar users/items. TheNDCG@10drops to 34.35. This demonstrates the importance of explicitly modelingneighborhood contextto enrich representations. -
- w/o HAE global (
- w/o HAE global): This removes theglobal-attentioncomponent (i.e., ) that concatenates original and similar embeddings for a broader interaction. TheNDCG@10drops to 36.62. This shows that aholistic viewof both direct and contextualuser-item signalsis beneficial. -
- Softmax (
- Softmax): This replaces theSigmoid functionused inGRASP'sattention mechanismwith the traditionalSoftmax function. The performance drops sharply (NDCG@10of 30.00), even below the results of removing entire attention sub-components. This result is particularly significant: it validates the design choice ofSigmoidoverSoftmax. TheSigmoid functionallows for amulti-peakrepresentation, preservingdiverse preferencesand allowingindependent weightingof interest patterns, whichsoftmax(forcing weights to sum to 1) might suppress by emphasizing only one dominant aspect.The substantial degradation in performance when any part of the
Holistic Attention Enhancementis removed or whenSigmoidis replaced bySoftmaxconfirms the efficacy ofGRASP'smulti-level attentiondesign and its fine-grained user-item integration approach.
6.2.2. Impacts of Hyper-parameters
The paper also analyzes the impact of two key hyper-parameters:
-
: The size of the
candidate poolfor retrievingsimilar users/items. This corresponds to the inTop@kretrieval. -
: The
hidden embedding dimensionfor theSRS backbone.The following figure (Figure 4 from the original paper) shows the hyperparameter analysis of GRASP based on SASRec on the Beauty dataset.
该图像是图表,展示了GRASP在Beauty数据集上基于SASRec的超参数分析。左侧两幅图分别表示候选池大小与NDCG@10和HR@10的关系;右侧两幅图则展示隐藏维度与NDCG@10和HR@10的关系。各图中数据点用不同颜色和符号标识,展示了不同超参数对模型性能的影响。
VLM Description: The image is a chart illustrating the hyperparameter analysis of GRASP based on SASRec on the Beauty dataset. The two graphs on the left show the relationship between the candidate pool size and NDCG@10 and HR@10; the two graphs on the right display the relationship between the hidden dimension and NDCG@10 and HR@10. Data points in each graph are marked with different colors and symbols to demonstrate the impact of various hyperparameters on model performance.
-
Impact of Candidate Pool Size : The left graphs in Figure 4 show the effect of on
NDCG@10andHR@10.- When is too small, the model fails to capture sufficient
similar patternsfrom neighbors. For example, very low values result in lower performance. - As increases, performance generally improves up to a certain point.
- However, excessively large values of can introduce
noiseandirrelevant informationby including less relevant neighbors, leading to a slight drop or plateau in performance. - The optimal value for appears to be around 10, where performance peaks.
- When is too small, the model fails to capture sufficient
-
Impact of Hidden Dimension : The right graphs in Figure 4 illustrate the relationship between the
hidden embedding dimensionandNDCG@10andHR@10.-
Insufficient dimensionality(small ) cannot adequately represent complexuser-item relationshipsandsemantic information, resulting in lower performance. -
As increases, the model's capacity to learn richer representations grows, leading to performance improvements.
-
However,
excessive dimensionalitycan lead todiminished returnsorpotential overfittingif the model becomes too complex for the available data. -
The optimal value for is identified as 64, which is the setting used in the main experiments, indicating a good balance between representation power and computational efficiency/risk of overfitting.
These
hyper-parameter analysesguide the selection of appropriate settings forGRASP, ensuring optimal performance by balancing the trade-offs between information richness, noise, and model complexity.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction), a novel and flexible framework designed to enhance sequential recommendation models by integrating Large Language Model (LLM) world knowledge. The core innovation of GRASP lies in its dual approach:
-
Generation Augmented Retrieval: This component enriches
useranditem representationsby generating detailedsemantic embeddingsusingLLMsand then retrieving and aggregatingtop-k similar users/itemsto provideauxiliary contextual information. -
Holistic Attention Enhancement: This module employs a
multi-level attention mechanism(self, similar, and global attention) to dynamically and adaptively fuse theLLM-derived semantic embeddingsand theretrieved contextual information. A key design choice is the use of aSigmoid functionin attention, which allows for robust handling ofdiverse user preferencesand mitigatesnoisy guidancefrom potentialLLM hallucinations.Crucially,
GRASPaddresses the significant challenge ofLLM hallucinationsby treatingLLM-generated contentaslearnable input featuresrather than rigidsupervisory signals. This adaptive fusion mechanism allows the model todown-weightorfilter outunreliableLLM outputs, leading to enhancedrobustness.
Comprehensive experiments on two public datasets (Amazon Beauty, Amazon Fashion) and one industrial dataset (Industry-100K) demonstrate GRASP's consistent superiority over state-of-the-art baselines, including other LLM-enhanced models like LLM-ESR. The framework shows particular strength in tail scenarios (data-scarce), where hallucination risks are highest, without compromising performance in head scenarios. Furthermore, GRASP's flexibility allows it to be integrated seamlessly with diverse SRS backbones (GRU4Rec, BERT4Rec, SASRec), consistently yielding performance boosts. An online A/B test confirmed its practical value in a real-world e-commerce setting, showing positive uplifts in CTR, order volume, and GMV.
7.2. Limitations & Future Work
The authors explicitly mention one key direction for future work:
- Combining with LLM-in-SRS-backbone methods:
GRASPutilizesLLM's world knowledgeprimarily as afront-end feature augmentation module. This is contrasted with otherLLM-based recommendation works [39, 50]thatinherently use the pre-trained weights of LLM in the SRS backbone. The authors suggest thatGRASP's technique isorthogonalto these methods and plan to explore how to combine them for potentially more impressive model performance. This implies that whileGRASPenhances embeddings, it doesn't directly leverage thegenerativeorreasoning capabilitiesofLLMswithin the coresequential modelingprocess itself.
7.3. Personal Insights & Critique
GRASP presents a very compelling and practical solution to a critical problem in LLM-enhanced recommendation systems: hallucinations.
- Innovation in Robustness: The core innovation of distinguishing between using
LLM outputsaslearnable input featuresversusrigid supervisory signalsis profound. It moves beyond simply incorporatingLLMsto intelligently managing their inherent flaws. The theoretical analysis in Appendix A.3, clearly outlining the gradient differences betweenGRASPandLLM-ESR, provides strong justification for this design choice. Themulti-level attentionwithSigmoidfurther refines this robustness by allowing selective weighting and diverse preference capture. - Flexibility and Real-World Applicability: The
orthogonalityofGRASPto existingSRS backbonesis a significant advantage. This meansGRASPcan be adopted by practitioners with existingSRS deploymentswithout requiring a complete overhaul. Theoffline precomputationofLLM embeddingsandretrievalalso makes it highly practical forindustrial deployment, minimizing online latency, as confirmed by the successfulA/B test. This validation on a massive scale (50 million DAU, 5% traffic allocation) lends immense credibility to the paper's claims. - Addressing Data Sparsity:
GRASP's strong performance intail scenariosis a testament to its ability to alleviatedata sparsity. By leveragingLLM world knowledgeandsimilar user/item context, it can provide meaningful recommendations even for users or items with limited historical interactions, a common pain point inrecommender systems.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Quality of Initial LLM Descriptions: While
GRASPis robust tohallucinationsin the guidance signal, its performance is still fundamentally tied to the initial quality ofLLM-generated descriptive textsandsemantic embeddings. If theLLMconsistently produces poor quality or highlyhallucinatedbasic descriptions for certainusersoritems, evenGRASP's adaptive mechanisms might struggle to extract useful information. The paper mentions usingQwen2.5-7B-Instructfor the industrial dataset, but the impact of differentLLMsand differentprompt engineeringstrategies on the initialembedding quality(and thusGRASP's performance) could be explored further. -
Complexity of Prompt Engineering: Appendix A.1 shows detailed and comprehensive
prompt templates, especially forIndustry-100KwithCoT (Chain-of-Thought). The effectiveness ofGRASPrelies heavily on well-designed prompts to elicit rich and relevant descriptions fromLLMs. The process of designing and optimizing these prompts can be complex and time-consuming, and their sensitivity to slight changes is a known challenge. -
Scalability of Retrieval for Very Large Datasets: While the paper mentions
offline pre-retrievalinsmall groupsto mitigatenearest-neighbor search complexity, for truly massive and dynamic item/user catalogs, maintainingtop-k similar neighborsefficiently and keeping them up-to-date can still be a non-trivial engineering challenge. The number of similar items/users (the inTop@k) was optimized for, but the inherent scalability limits of similarity search remain. -
Interpretability of Holistic Attention: While the
multi-level attentionis effective, understanding precisely which aspects ofLLM knowledge(direct or similar) are being leveraged and how they contribute to a specific recommendation can be less transparent compared to simpler models. Further work oninterpretable AIcould help shed light on the learnedattention weights. -
Implicit vs. Explicit Feedback: The paper focuses on
sequential recommendationwhich typically usesimplicit feedback(purchases, clicks). The effectiveness ofGRASPwithexplicit feedback(ratings) and howLLM knowledgemight interact with explicit user preferences could be another area of exploration.Overall,
GRASPmakes a significant contribution by providing a practically robust method for integratingLLM world knowledgeintosequential recommendation. Its careful design to sidestep thehallucination problemis a crucial step towards makingLLM-enhanced recommendersmore reliable and deployable in real-world, large-scale systems.
Similar papers
Recommended via semantic vector search.