Paper status: completed

Enhancing Sequential Recommendation with World Knowledge from Large Language Models

Published:11/25/2025

LLM-based Sequential Recommendation Systems (1)World Knowledge Enhancement from LLMs (1)Generation-Augmented Retrieval Methods (1)Multi-Level Attention Mechanism (1)Dynamic User Interest Modeling (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces GRASP, a framework that overcomes limitations of traditional sequential recommendation systems by integrating generation-augmented retrieval and multi-level attention. It effectively leverages world knowledge despite LLM hallucinations, enhancing modeling of

Abstract

Sequential Recommendation System~(SRS) has become pivotal in modern society, which predicts subsequent actions based on the user's historical behavior. However, traditional collaborative filtering-based sequential recommendation models often lead to suboptimal performance due to the limited information of their collaborative signals. With the rapid development of LLMs, an increasing number of works have incorporated LLMs' world knowledge into sequential recommendation. Although they achieve considerable gains, these approaches typically assume the correctness of LLM-generated results and remain susceptible to noise induced by LLM hallucinations. To overcome these limitations, we propose GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction), a flexible framework that integrates generation augmented retrieval for descriptive synthesis and similarity retrieval, and holistic attention enhancement which employs multi-level attention to effectively employ LLM's world knowledge even with hallucinations and better capture users' dynamic interests. The retrieved similar users/items serve as auxiliary contextual information for the later holistic attention enhancement module, effectively mitigating the noisy guidance of supervision-based methods. Comprehensive evaluations on two public benchmarks and one industrial dataset reveal that GRASP consistently achieves state-of-the-art performance when integrated with diverse backbones. The code is available at: https://anonymous.4open.science/r/GRASP-SRS.

Mind Map

In-depth Reading

English Analysis~32 min read · 46,522 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Enhancing Sequential Recommendation with World Knowledge from Large Language Models".

1.2. Authors

The authors and their affiliations are:

Tianjie Dai, Shanghai Jiao Tong University, Shanghai, China
Xu Chen, Taobao & Tmall Group, Hangzhou, China
Yunmeng Shu, Taobao & Tmall Group, Hangzhou, China
Jinsong Lan, Taobao & Tmall Group, Beijing, China
Xiaoyong Zhu, Taobao & Tmall Group, Hangzhou, China
Jiangchao Yao, Shanghai Jiao Tong University, Shanghai, China
Bo Zheng, Taobao & Tmall Group, Hangzhou, China

1.3. Journal/Conference

This paper was published as a preprint on arXiv. The listed publication date is 2025-11-25T10:59:38.000Z. As a preprint, it has not yet undergone formal peer review for publication in a journal or conference, but arXiv is a widely respected platform for disseminating early-stage research in computer science and other fields. Given the listed publication year (2025), it might be targeting a future major conference or journal in the field of recommendation systems or AI.

1.4. Publication Year

2025

1.5. Abstract

The abstract introduces Sequential Recommendation Systems (SRS) as crucial for predicting future user actions based on historical behavior. It highlights a limitation of traditional collaborative filtering-based SRS: suboptimal performance due to limited collaborative signals. The abstract notes the recent trend of incorporating Large Language Models (LLMs) for their world knowledge to enhance SRS. However, existing LLM-based approaches often assume the correctness of LLM-generated results and are vulnerable to noise induced by LLM hallucinations.

To address these issues, the paper proposes GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction). GRASP is presented as a flexible framework that combines two main components:

Generation Augmented Retrieval: This involves using LLMs for descriptive synthesis and similarity retrieval.
Holistic Attention Enhancement: This employs multi-level attention to effectively utilize LLM's world knowledge, even in the presence of hallucinations, and to better capture users' dynamic interests.

The retrieved similar users/items serve as auxiliary contextual information for the holistic attention enhancement module, which helps mitigate noisy guidance associated with supervision-based methods. Comprehensive evaluations on two public benchmarks and one industrial dataset demonstrate that GRASP consistently achieves state-of-the-art performance when integrated with diverse backbones.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2511.20177v1
PDF Link: https://arxiv.org/pdf/2511.20177v1.pdf
Publication Status: This paper is currently a preprint on arXiv, indicated by "v1" (version 1) and its presence on the arXiv platform.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the suboptimal performance of Sequential Recommendation Systems (SRS) when relying solely on ID-based embeddings and collaborative filtering. Traditional SRS models, while effective in capturing sequential patterns, often suffer from limited collaborative signals, leading to a narrow view of user interests. For instance, recommending only similar tents after a tent purchase, rather than related camping equipment like sleeping bags or stoves, misses broader user intentions. This limitation prevents these models from capturing comprehensive user behavior patterns, especially for sparse or long-tail items and users.

The emergence of Large Language Models (LLMs) has introduced a new paradigm, allowing the integration of rich semantic and world knowledge into recommendation systems. However, existing approaches that incorporate LLMs often face two significant challenges:

Assumption of Correctness: Many methods implicitly assume the LLM-generated results are accurate.
Susceptibility to Hallucinations: LLMs are known to hallucinate, meaning they can generate factually incorrect or nonsensical information. This issue is particularly pronounced for users with short interaction histories (tail users), where hallucination rates can be high. Directly using such potentially noisy or incorrect semantic features as supervision signals can introduce noise, ultimately impairing model performance and reliability.

The paper's entry point and innovative idea is to leverage the semantic understanding capabilities of LLMs while simultaneously building robustness against their hallucination tendencies. Instead of using LLM-generated content as direct supervision, GRASP proposes to integrate it as auxiliary contextual information, processed through a specialized attention mechanism, thereby mitigating the risks associated with noisy LLM outputs.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Novel Framework (GRASP): The authors propose GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction), a flexible and novel framework for enhancing sequential recommendation systems. This framework is designed to be orthogonal (meaning it can be integrated with) current SRS backbones, while specifically addressing the inaccuracies of attribute-based retrieval and noisy guidance from LLM hallucinations.
Generation Augmented Retrieval: This component utilizes LLMs to generate detailed user profiles and item descriptions, which are then used to build semantic embedding databases. A nearest-neighbor retrieval strategy identifies top-k similar users or items, whose aggregated embeddings serve as rich auxiliary information. This approach enriches the sparse collaborative signals with semantic context.
Holistic Attention Enhancement: This module integrates the retrieved information as contextual input rather than direct supervision, a key differentiator for mitigating hallucination noise. It employs a multi-level attention mechanism including:
- Initial user-item attention for core interaction patterns.
- Attention between similar user/item groups for neighborhood context.
- Concatenated attention for global interest modeling. The use of a Sigmoid function instead of Softmax in attention is highlighted for preserving diverse preferences.
Empirical Superiority: Extensive experiments on two public benchmarks (Amazon Beauty, Amazon Fashion) and one industrial dataset (Industry-100K) demonstrate that GRASP consistently achieves state-of-the-art performance. It shows significant improvements over existing LLM-enhanced sequential recommendation models (like LLM-ESR) and traditional SRS backbones (GRU4Rec, BERT4Rec, SASRec), particularly in tail scenarios (data-scarce), where hallucination risks are highest, without sacrificing performance in head scenarios.
Online Validation: An online A/B test on an industrial e-commerce platform confirmed GRASP's practical value, showing a 0.14 point absolute increase in CTR, 1.69% relative growth in order volume, and 1.71% uplift in GMV.

These findings collectively highlight GRASP's ability to effectively integrate LLM world knowledge into sequential recommendation while robustly managing the inherent hallucination problem, making it a promising solution for real-world applications.

3.1. Foundational Concepts

To understand GRASP, a reader should be familiar with the following concepts:

Sequential Recommendation System (SRS): An SRS is a type of recommender system that predicts the next item a user will interact with, based on their sequence of past interactions. Unlike traditional recommender systems that might only consider a user's overall preferences, SRS models explicitly capture the temporal order and dynamics of user behavior. For example, if a user buys a camera, then a lens, an SRS might predict they'll buy a camera bag next, recognizing the sequence.
Collaborative Filtering (CF): A widely used technique in recommender systems. CF operates on the principle that users who agreed in the past on some items will likely agree again in the future. It identifies patterns by analyzing either user-item interactions (e.g., ratings, purchases) to find similar users or similar items. ID-based collaborative filtering refers to methods where users and items are primarily represented by numerical identifiers (IDs), and their relationships are learned through interactions, often neglecting rich content information.
Large Language Models (LLMs): These are powerful artificial intelligence models, like GPT-3, GPT-4, or Qwen, that have been trained on vast amounts of text data. They can understand, generate, and process human language, allowing them to perform tasks such as summarization, translation, question answering, and even creative writing. In the context of recommendation, LLMs can generate rich semantic descriptions for items or users, providing world knowledge (e.g., knowing that a "sleeping bag" is related to "camping" even if a user hasn't explicitly interacted with camping items).
LLM Hallucinations: This refers to the phenomenon where LLMs generate content that is factually incorrect, nonsensical, or unfaithful to the input prompt, even though it may sound plausible or grammatically correct. In recommendation systems, an LLM might hallucinate a user's interest in a product category they've never shown interest in, or misrepresent an item's attributes. This can introduce noise and lead to unreliable recommendations.
Attention Mechanism: A core component in many modern neural networks, especially in Transformers. The attention mechanism allows a model to weigh the importance of different parts of the input sequence when processing a specific element. It calculates a context vector by taking a weighted sum of value vectors, where the weights are determined by a query vector and key vectors. The standard self-attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ : Query matrix (typically derived from the current element).
- $K$ : Key matrix (derived from all elements in the sequence).
- $V$ : Value matrix (derived from all elements in the sequence).
- $Q K^T$ : Dot product to measure similarity between query and keys.
- $\sqrt{d_k}$ : Scaling factor to prevent large dot products from pushing softmax into regions with tiny gradients. $d_k$ is the dimension of the keys.
- $\mathrm{softmax}$ : Normalizes the scores into probability distributions, ensuring weights sum to 1.
- The output is a weighted sum of the values, focusing on the most relevant parts.
Average Pooling: A simple aggregation technique where a set of numerical vectors is combined into a single vector by computing the average of their corresponding elements. For example, if you have two vectors [1, 2, 3] and [4, 5, 6], their average pooling would be $[\frac{1+4}{2}, \frac{2+5}{2}, \frac{3+6}{2}] = [2.5, 3.5, 4.5]$ . In GRASP, it's used to aggregate embeddings of similar users or items.
Cosine Similarity: A measure of similarity between two non-zero vectors that measures the cosine of the angle between them. A cosine similarity of 1 means the vectors are in the same direction (perfectly similar), 0 means they are orthogonal (no similarity), and -1 means they are in opposite directions (perfectly dissimilar). It's commonly used to find similar items or users in embedding spaces. The formula for two vectors $\mathbf{A}$ $A$ and $\mathbf{B}$ $B$ is: $ \text{cosine similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} $ Where:
- $\mathbf{A} \cdot \mathbf{B}$ : The dot product of vectors $\mathbf{A}$ and $\mathbf{B}$ .
- $\|\mathbf{A}\|$ : The Euclidean norm (magnitude) of vector $\mathbf{A}$ .
- $\|\mathbf{B}\|$ : The Euclidean norm (magnitude) of vector $\mathbf{B}$ .
Multi-Layer Perceptron (MLP): A fundamental type of feedforward neural network consisting of at least three layers: an input layer, one or more hidden layers, and an output layer. Each node (neuron) in one layer connects to every node in the next layer with a specific weight. MLPs are used for various tasks, including non-linear transformations and feature mapping.

3.2. Previous Works

The paper builds upon a rich history of sequential recommendation and LLM integration.

Traditional Sequential Recommendation Models:
- GRU4Rec [16]: One of the pioneering works in SRS, GRU4Rec utilizes Gated Recurrent Units (GRUs) to capture sequential patterns. GRUs are a type of recurrent neural network (RNN) designed to process sequences of data by maintaining a hidden state that evolves over time, allowing them to remember past information.
- SASRec [21]: SASRec (Self-Attentive Sequential Recommendation) introduced the self-attention mechanism (inspired by Transformers) to sequential recommendation. Unlike RNNs that process tokens sequentially, self-attention allows the model to capture dependencies between any two items in a sequence regardless of their distance, making it effective for modeling long-range dependencies and complex item relationships.
- BERT4Rec [35]: BERT4Rec adapted the Masked Language Modeling (MLM) paradigm from BERT (Bidirectional Encoder Representations from Transformers) to sequential recommendation. It pre-trains a Transformer-based model by masking certain items in a user's historical sequence and predicting them, learning rich contextual representations for items. These models primarily rely on collaborative filtering and ID embeddings, which, as the paper notes, can be limited by sparse collaborative signals.
LLM for Sequential Recommendation (Two Paradigms): The paper categorizes existing LLM-based recommendation works into two main paradigms:
1. LLMs as the entire system:
  - ProLLM4Rec [43]: This approach designs specific prompts to directly leverage the generative and reasoning capabilities of LLMs for recommendation tasks, essentially using the LLM as the recommender itself.
  - TALLRec [2]: This method creates an instruction-tuning dataset from user history and fine-tunes an LLM to directly output recommendations.
  - Limitations: While powerful, this paradigm is often computationally expensive during inference, making it challenging to meet the high concurrency and low latency demands of industrial-scale recommendation systems.
2. Integrating LLM semantic understanding into existing systems:
  - HLLM [8]: HLLM designs separate item LLMs to extract rich content features from item descriptions and user LLMs to utilize these features for predicting future interests.
  - LLM-ESR [27]: This work is a primary point of comparison for GRASP. LLM-ESR uses LLMs to generate semantic features (hidden embeddings) to initialize ID embeddings for traditional sequential recommendation models. Crucially, it employs a dual network and a self-distillation loss function that leverages LLM embeddings to identify and supervise similar users. Specifically, it uses a regularization term in its loss function: $\mathcal{L} = \mathcal{L}_{\mathrm{main}} + \lambda \|\mathbf{e}_u - \mathbf{e}_c\|_2^2$ , where $\mathbf{e}_u$ is the user embedding and $\mathbf{e}_c$ is the context embedding derived from the LLM. This term directly pulls the user embedding towards the LLM-generated context, treating it as a supervisory signal.
  - LRD [45]: LRD leverages LLMs to discover latent item relations through language representations and integrates them with a discrete state variational autoencoder to enhance relation-aware sequential recommendation.

3.3. Technological Evolution

The field of sequential recommendation has evolved significantly:

Early Models (e.g., Markov Chains): Focused on simple transitions between items.
RNN-based Models (e.g., GRU4Rec): Introduced recurrent neural networks to capture more complex sequential dependencies.
Attention-based/Transformer Models (e.g., SASRec, BERT4Rec): Revolutionized sequential modeling by enabling the capture of long-range dependencies and contextual item representations through self-attention. These models moved beyond strict sequential processing to allow global interaction within a sequence.
Hybrid Models (Content-aware): Began to incorporate item attributes (e.g., text, images) alongside ID embeddings to enrich representations, addressing cold-start and sparsity issues.
LLM-enhanced Models: The latest frontier involves integrating Large Language Models to inject world knowledge and semantic understanding. Initially, this involved using LLMs for feature extraction or embedding initialization, and later, using LLMs directly as recommenders or for complex reasoning tasks.

GRASP fits into this evolution by pushing the boundaries of LLM-enhanced models. It recognizes the power of LLMs for semantic understanding but critically addresses a major practical hurdle: hallucinations. By designing a robust way to integrate LLM-derived information as auxiliary input rather than direct supervision, it refines the LLM-enhanced paradigm, making it more reliable for real-world deployment.

3.4. Differentiation Analysis

Compared to the main methods in related work, GRASP offers distinct innovations, especially against LLM-ESR:

Robustness to LLM Hallucinations: This is GRASP's most significant differentiator. Previous LLM-enhanced methods, particularly LLM-ESR, often treat LLM-generated embeddings or semantic features as reliable supervisory signals. As detailed in Appendix A.3, LLM-ESR uses a regularization term that directly pulls user embeddings towards LLM context (e.g., $\lambda \|\mathbf{e}_u - \mathbf{e}_c\|_2^2$ ). If the LLM context is hallucinated or inaccurate, this direct supervision introduces noise, impairing performance. GRASP, in contrast, treats LLM context as a learnable input feature. The holistic attention enhancement module fuses this LLM context with original embeddings through learnable weights (e.g., $\mathbf{h}_u = \mathbf{W}_e \mathbf{e}_u + \mathbf{W}_c \mathbf{e}_c + \mathbf{b}$ ). This allows the model to adaptively down-weight noisy or uninformative LLM context by learning smaller weights ( $\mathbf{W}_c \to \mathbf{0}$ ) if it doesn't contribute positively to the main task loss. This gating mechanism provides inherent robustness against hallucinations.
Integration Strategy (Auxiliary Input vs. Supervision): GRASP integrates LLM knowledge as auxiliary contextual information through generation augmented retrieval and multi-level attention. This is a more flexible and less prescriptive approach than direct supervision. It enriches the representations without forcing them to conform to potentially erroneous LLM outputs.
Multi-Level Attention: GRASP employs a sophisticated multi-level attention mechanism (self, similar, global) to dynamically capture user interests and integrate contextual information from similar users/items. This holistic approach allows for a richer and more nuanced understanding of user preferences compared to simpler fusion strategies.
Flexible and Orthogonal Framework: GRASP is designed to be orthogonal to SRS backbones, meaning it can be easily integrated with various existing sequential recommendation models (e.g., GRU4Rec, BERT4Rec, SASRec) to enhance their performance, rather than replacing them entirely or requiring extensive modifications.

In summary, while previous works recognized the value of LLMs, GRASP innovates by providing a more resilient and flexible framework to harness LLM world knowledge while explicitly tackling the hallucination problem through a distinct integration strategy and architectural design.

4. Methodology

4.1. Principles

The core idea of GRASP is to enhance Sequential Recommendation Systems (SRS) by leveraging the rich world knowledge from Large Language Models (LLMs) in a robust manner. The theoretical basis and intuition behind it stem from two main observations:

Limitations of ID-based CF: Traditional SRS models primarily use ID embeddings and collaborative filtering. These are inherently limited because they only capture patterns from observed interactions and lack external semantic knowledge. This leads to sparsity issues, cold-start problems, and an inability to infer broader user interests beyond directly similar items.
LLM Power and Pitfalls: LLMs possess vast world knowledge and semantic understanding capabilities, which can significantly enrich user and item representations. However, LLMs are prone to hallucinations (generating incorrect information), especially for sparse data (e.g., users with short histories). Directly using LLM-generated features as supervision signals can introduce noise and degrade performance.

GRASP's principle is to bridge this gap by:

Enriching Representations Semantically: Use LLMs to generate descriptive semantic embeddings for all users and items, moving beyond simple IDs.
Providing Contextual Auxiliary Information: Instead of direct supervision, LLM-derived embeddings are used to retrieve similar users and items. The aggregated embeddings from these neighbors provide auxiliary contextual information. This context is less prone to the "single point of failure" of a hallucinated direct supervision signal because it averages over multiple similar entities.
Adaptive Fusion with Multi-Level Attention: A specialized holistic attention mechanism is employed to adaptively fuse these LLM-derived semantic embeddings and retrieved contextual information with the core user-item interactions. This attention mechanism is designed to be robust to noise (e.g., from hallucinations) by allowing the model to learn to down-weight less reliable information, and to capture diverse, dynamic user preferences through multi-level attention and a Sigmoid activation (instead of Softmax).

In essence, GRASP aims to harness the semantic richness of LLMs while sidestepping the fragility introduced by their hallucination tendencies, by treating LLM outputs as learnable context rather than infallible ground truth.

4.2. Core Methodology In-depth

4.2.1. Problem Formulation

The goal of a Sequential Recommendation System (SRS) is to predict the next item a user is most likely to interact with, given their historical sequence of interactions. Let $\mathcal{S}_u = \{i_1, i_2, \dots\}$ represent the interaction sequence of user $u$ , where $i_j$ denotes the $j$ -th item interacted with by the user. Here, $u \in \mathcal{U} = \{u_1, u_2, \dots, u_n\}$ is a user from the set of $n$ users, and $\mathcal{I} = \{i_1, i_2, \dots, i_m\}$ is the item set.

The task can be mathematically formulated as finding the item $i^*$ that maximizes the probability of being the next interaction: $ i^* = \underset{i_j \in \mathcal{I}}{\mathrm{argmax}} f(i_{|S_u|+1} = i_j \ | \ S_u) $ Where:

$i^*$ : The predicted next item that the user is most likely to interact with.
$\mathcal{I}$ : The set of all possible items.
$f$ : The SRS model, which takes the user's historical sequence $S_u$ as input.
$i_{|S_u|+1}$ : Represents the item the user will interact with next, immediately following their current sequence $S_u$ .
$|S_u|$ : Denotes the length (number of items) in the user's historical interaction sequence.

GRASP primarily focuses on enhancing the representations of items ( $i$ ) and users ( $u$ ) before they are fed into the SRS backbone ( $f$ ), making it orthogonal to the specific choice of $f$ .

4.2.2. Generation Augmented Retrieval

This component aims to enrich user and item embeddings by leveraging the semantic understanding capabilities of LLMs. It consists of two main steps: Generation and Retrieval.

4.2.2.1. Generation

First, the system creates prompt templates. These templates are designed to incorporate existing item attributes (like name, brand, price, features, descriptions) or user profiles (like birthplace, gender, age, occupation, spending power) and their historical behaviors. For example, a prompt for an item might be: "The beauty item has the following attributes: name is $<TITLE>$ ; brand is $<BRAND>$ ; price is $<PRICE>$ . The item has the following features: $<FEATURE>$ . The item has the following descriptions: $<DESCRIPTION>$ ." For a user, it might include their demographics and a summary of their visited items, asking the LLM to "conclude the user's preference." (Detailed templates are in Appendix A.1 of the paper).

An LLM (e.g., OpenAI API for public datasets, Qwen2.5-7B-Instruct for industrial data) is then invoked to process these prompts. It interprets the provided information and generates detailed, descriptive texts for all items and users. After generating these descriptive texts, embeddings are extracted from them. For open-source datasets, embeddings are directly obtained from OpenAI API. For internal industrial data, a pre-trained text encoder (e.g., LLM2Vec) is used to convert the generated text into semantic embeddings. As a result, two semantic embedding databases are constructed:

$\mathbf{U} \in \mathbb{R}^{n \times d}$ : A database of LLM-generated semantic embeddings for all $n$ users.
$\mathbf{I} \in \mathbb{R}^{m \times d}$ : A database of LLM-generated semantic embeddings for all $m$ items. Here, $d$ is the semantic embedding dimension.

4.2.2.2. Retrieval

To further enhance feature representation and address data sparsity, a nearest-neighbor retrieval strategy is employed. For each user or item, the system retrieves the top-k most similar users or items based on the cosine similarity between their LLM-generated semantic embeddings. These retrieved embeddings are then aggregated using average pooling.

Formally, for a given user $u$ with LLM embedding $\mathbf{u}$ and an item $i$ with LLM embedding $\mathbf{i}$ , the retrieval process is expressed as: $ \begin{array}{l} \bar{\mathbf{u}} = \mathrm{Avg \underline{~Pooling~}} (\mathbf{u}_i \mid \mathbf{u}_i \in \mathrm{Top@k}(\mathbf{u}) \ \backslash \ {\mathbf{u}}) \ \bar{\mathbf{i}} = \mathrm{Avg \underline{~Pooling~}} (\mathbf{i}_j \mid \mathbf{i}_j \in \mathrm{Top@k}(\mathbf{i}) \ \backslash \ {\mathbf{i}}) \end{array} $ Where:

$\bar{\mathbf{u}}$ : The averaged embedding representing the aggregated knowledge from the top-k similar users to user $u$ .
$\bar{\mathbf{i}}$ : The averaged embedding representing the aggregated knowledge from the top-k similar items to item $i$ .
$\mathbf{u}$ : The LLM embedding of the current user.
$\mathbf{i}$ : The LLM embedding of the current item.
$\mathbf{u}_i$ : The LLM embedding of a user $i$ (neighboring user).
$\mathbf{i}_j$ : The LLM embedding of an item $j$ (neighboring item).
$\mathrm{Top@k}(\mathbf{u})$ : Denotes the set of top-k user LLM embeddings that are most similar to $\mathbf{u}$ (based on cosine similarity).
$\mathrm{Top@k}(\mathbf{i})$ : Denotes the set of top-k item LLM embeddings that are most similar to $\mathbf{i}$ .
$\backslash \ \{\mathbf{u}\}$ and $\backslash \ \{\mathbf{i}\}$ : Exclude the user/item itself from its own set of similar entities to retrieve distinct neighbors.
$\mathrm{Avg \underline{~Pooling~}}$ : The average pooling operation that sums the embeddings and divides by the count of elements.

Through this process, each user and item is not only represented by its own LLM embedding but also enriched with contextual information derived from its nearest neighbors. These retrieved and averaged embeddings ( $\bar{\mathbf{u}}$ and $\bar{\mathbf{i}}$ ) are then frozen and cached for subsequent steps, meaning they are computed once offline and not updated during the model training.

4.2.3. Holistic Attention Enhancement (HAE)

The Holistic Attention Enhancement module is designed to dynamically fuse the LLM-derived semantic embeddings and the retrieved similar neighbor embeddings to capture users' dynamic interests effectively.

The inputs to this module are:

LLM embedding of a specific user $u_i$ : $\mathbf{u}_i \in \mathbb{R}^d$ .
Averaged embeddings of similar users to $u_i$ : $\bar{\mathbf{u}}_i \in \mathbb{R}^d$ .
LLM embedding of a specific item $i_j$ : $\mathbf{i}_j \in \mathbb{R}^d$ .
Averaged embeddings of similar items to $i_j$ : $\bar{\mathbf{i}}_j \in \mathbb{R}^d$ .

4.2.3.1. Attention Mechanism Definition

The paper defines a specific attention mechanism $\mathcal{A}(\mathbf{q}, \mathbf{v})$ . Unlike traditional softmax-based attention that normalizes scores to sum to 1 (potentially leading to a single dominant focus), this version uses a Sigmoid function $\sigma$ . This choice is made to avoid the single-peak issue of softmax, allowing for a representation that better reflects users' diverse preferences while maintaining more raw interest patterns.

The attention mechanism is defined as: $ \mathcal{A}(\mathbf{q}, \mathbf{v}) = \sigma \left( \frac{\mathbf{qv}^T}{\sqrt{d}} \right) \mathbf{v} $ Where:

$\mathcal{A}$ : The attention function.
$\mathbf{q}$ : The query vector (typically derived from the user or current context).
$\mathbf{v}$ : The value vector (representing an item or a set of items). In this formulation, $\mathbf{v}$ also implicitly serves as the key vector, as its transpose $\mathbf{v}^T$ is used for the dot product with $\mathbf{q}$ .
$\mathbf{qv}^T$ : The dot product between the query and value (key) vectors, measuring their similarity.
$\sqrt{d}$ : A scaling factor to normalize the dot product, where $d$ is the dimension of the embeddings.
$\sigma$ : The Sigmoid activation function, which outputs a value between 0 and 1, allowing for independent weighting of different interest components, rather than forced competition as in softmax.

4.2.3.2. Multi-level Attention Operations

The holistic attention enhancement is computed through a series of attention operations at different levels, capturing various aspects of user-item interaction and context.

Self-Attention (Core Interaction): This captures the core interaction pattern between the current user and the current item, using their direct LLM embeddings. $ \mathbf{i}_{j,\mathrm{self}}^{\mathrm{HAE}} = \mathcal{A}(\mathbf{u}_i, \mathbf{i}_j) $ Here, the user's LLM embedding $\mathbf{u}_i$ acts as the query, and the item's LLM embedding $\mathbf{i}_j$ acts as both the key and value. This component focuses on the direct semantic relationship.
Similar-Attention (Neighborhood Context): This captures contextual information from the neighborhood of similar users and items, using their averaged embeddings. $ \mathbf{i}_{j,\mathrm{similar}}^{\mathrm{HAE}} = \mathcal{A}(\bar{\mathbf{u}}_i, \bar{\mathbf{i}}_j) $ Here, the averaged embedding of similar users $\bar{\mathbf{u}}_i$ acts as the query, and the averaged embedding of similar items $\bar{\mathbf{i}}_j$ acts as key and value. This component brings in broader collaborative-semantic context.
Global-Attention (Holistic Context): To capture a more comprehensive global interest, the raw user embeddings and similar user embeddings are concatenated to form a global user query, and similarly for items to form a global item key/value. $ \mathbf{i}_{j,\mathrm{global}}^{\mathrm{HAE}} = \mathcal{A}([\mathbf{u}_i \parallel \bar{\mathbf{u}}_i], [\mathbf{i}_j \parallel \bar{\mathbf{i}}_j]) $ Where:
- $\parallel$ : Denotes the concatenation operation, joining two vectors end-to-end.
- $[\mathbf{u}_i \parallel \bar{\mathbf{u}}_i]$ : The concatenated query vector combining the specific user's LLM embedding and their similar users' aggregated embedding.
- $[\mathbf{i}_j \parallel \bar{\mathbf{i}}_j]$ : The concatenated value/key vector combining the specific item's LLM embedding and its similar items' aggregated embedding. This global attention captures interaction patterns from a broader, more enriched perspective.

These three attention-enhanced vectors ( $\mathbf{i}_{j,\mathrm{self}}^{\mathrm{HAE}}$ , $\mathbf{i}_{j,\mathrm{similar}}^{\mathrm{HAE}}$ , $\mathbf{i}_{j,\mathrm{global}}^{\mathrm{HAE}}$ ) are then concatenated together. This combined vector is passed through a Multi-Layer Perceptron (MLP) to adjust its dimension to fit the input size expected by the underlying SRS backbone $f$ : $ \mathbf{i}{j,all} = \mathrm{MLP} \left( [\mathbf{i}{j,self}^{\mathrm{HAE}} \parallel \mathbf{i}{j,similar}^{\mathrm{HAE}} \parallel \mathbf{i}{j,global}^{\mathrm{HAE}}] \right) $ Where:

$\mathbf{i}_{j,all}$ : The final holistic attention-enhanced embedding for item $i_j$ .
$\mathrm{MLP}$ : A Multi-Layer Perceptron that transforms the concatenated vector into the appropriate dimension.

This process ensures that the semantic information from LLMs is preserved and adaptively fused through various levels of attention, creating a rich and robust representation for each item, which then serves as the input to the chosen SRS backbone.

4.2.4. Training and Deployment Complexity

GRASP is designed to be a flexible module that enhances existing SRS backbones rather than replacing them. This means it can be integrated on top of models like GRU4Rec, BERT4Rec, or SASRec.

The overall training objective uses the standard loss function of the chosen SRS backbone. For binary cross-entropy loss (common in recommendation for distinguishing positive from negative items), it is defined as: $ \mathcal{L} = - \frac{1}{|\mathcal{B}|} \sum_j \left[ y_j \log(\hat{y}_j) + (1 - y_j) \log(1 - \hat{y}_j) \right] $ And the predicted probability $\hat{y}_j$ for item $j$ is calculated as: $ \hat{y}j = \sigma \left( \mathbf{o} \cdot \mathbf{i}{j,all} \right) $ Where:

$\mathcal{L}$ : The total loss function to be minimized during training.
$|\mathcal{B}|$ : The size of the candidate pool of items (including both positive and negative samples).
$y_j$ : The ground truth label for item $j$ (1 if it's a positive interaction, 0 otherwise).
$\hat{y}_j$ : The predicted probability that item $j$ is the next interaction.
$\sigma$ : The Sigmoid function, used here to convert the dot product score into a probability.
$\mathbf{o}$ : The user representation learned by the SRS backbone. This representation captures the user's dynamic interests from their historical sequence.
$\mathbf{i}_{j,all}$ : The holistic attention-enhanced embedding for item $j$ , computed via Equation (5) as described in the Holistic Attention Enhancement section.

For practical deployment, the LLM-generated embeddings ( $\mathbf{U}$ , $\mathbf{I}$ , $\bar{\mathbf{U}}$ , $\bar{\mathbf{I}}$ ) are offline precomputed daily to account for user behavior changes. This ensures that the online system does not incur the high computational cost of LLM inference. The main online computational overhead introduced by GRASP is the holistic attention module. This module has a limited time complexity of $O(l^2 d)$ when the sequence length $l$ and the latent dimension $d$ of the SRS backbone are fixed. To further optimize for industrial deployment, the retrieval of similar users and items can also be offline pre-retrieved in smaller groups (e.g., within the same category), rather than across the entire universe of users and items, significantly alleviating the nearest-neighbor search complexity.

Algorithm 1 Pseudo code of GRASP

The overall process of GRASP can be summarized by the following pseudo-code:

Algorithm 1 Pseudo code of GRASP. Require: Interaction sequence $S_u$

1: Generate LLM embedding database $\mathbf{U}$ , $\mathbf{I}$ ; retrieve similar user/item and generate $\bar{\mathbf{U}}$ , $\bar{\mathbf{I}}$ by Eq. (2). (This step is performed offline)

Training 2: Freeze $\mathbf{U}$ , $\mathbf{I}$ , $\bar{\mathbf{U}}$ and $\bar{\mathbf{I}}$ . (These precomputed embeddings are static during training) 3: for each iteration do 4: Compute fine-grained and global enhanced embedding using Eq. (4). (This refers to $\mathbf{i}_{j,\mathrm{self}}^{\mathrm{HAE}}$ , $\mathbf{i}_{j,\mathrm{similar}}^{\mathrm{HAE}}$ , $\mathbf{i}_{j,\mathrm{global}}^{\mathrm{HAE}}$ ) 5: Compute input sequence embedding $\mathbf{i}_{all}$ after holistic attention by Eq. (5). (This combines the enhanced embeddings into $\mathbf{i}_{j,all}$ ) 6: Calculate loss function $\mathcal{L}$ using Eq. (6). (The overall training objective) 7: Update model parameters. (Parameters of the SRS backbone and the MLP in GRASP are updated) 8: end for

Testing 9: Return 10: for $u$ in $\mathcal{U}$ do 11: Obtain corresponding input embedding from $\mathbf{U}$ , $\mathbf{I}$ , $\bar{\mathbf{U}}$ and $\bar{\mathbf{I}}$ , obtain the model parameters. (Retrieve precomputed embeddings and trained model parameters) 12: Compute the scores of items in the candidate set by Eq. (1) and return the ranked order. (Use the enhanced item embeddings and user representation from SRS to predict scores and rank items) 13: end for

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three datasets: two publicly available benchmarks and one industrial dataset.

Amazon Beauty [28]: This dataset is sourced from Amazon and contains user reviews on beauty-related products.
- Domain: E-commerce, beauty products.
Amazon Fashion [28]: Also from the Amazon collection, this dataset contains user reviews on fashion items.
- Domain: E-commerce, fashion products.
Industry-100K: This is a subset collected from user purchase records on an internal e-commerce platform. It captures transactions from January 17, 2025, to February 23, 2025.
- Domain: Industrial e-commerce, user purchase behavior.
- Characteristics: Represents real-world e-commerce interactions.
  
  The following are the statistics from Table 1 of the original paper:

The following are the results from Table 1 of the original paper:

Dataset	# User	# Item	# AVG Length	Sparsity
Beauty	52204	57289	7.56	99.99%
Fashion	9049	4722	3.82	99.92%
Industry-100K	99711	1205282	20.88	99.99%

# User: The total number of unique users in the dataset.
# Item: The total number of unique items in the dataset.
# AVG Length: The average length of user interaction sequences.
Sparsity: The percentage of possible user-item interactions that are missing (i.e., not observed). A high sparsity (close to 100%) indicates very few interactions relative to the total possible interactions, which is common in recommendation systems and poses a challenge.

Preprocessing: The datasets were preprocessed following established practices from SASRec [21] and LLM-ESR [27]. Data Partitioning: A leave-one-out strategy was adopted for validation and testing. This means for each user, the last interaction is used as the test item, the second-to-last for validation, and the preceding items form the training sequence. Head/Tail Partitioning: Data was also partitioned into head and tail segments based on the Pareto Principle. Head users/items are those with interaction frequencies in the top 20%, while tail users/items constitute the remaining 80%.
Beauty: Head/tail user demarcation at 9 interactions; item demarcation at 4 interactions.
Fashion: Head/tail user threshold at 3 interactions; item threshold at 4 interactions.
Industry-100K: Head/tail user criterion at 29 interactions; item criterion at 2 interactions.

These datasets are effective for validating the method's performance because they cover both widely used public benchmarks and a large-scale, real-world industrial setting. They also include varying levels of sparsity and average sequence lengths, allowing for a robust evaluation across different data characteristics. The head/tail analysis specifically targets the hallucination problem in sparse (tail) scenarios.

5.2. Evaluation Metrics

To comprehensively assess the performance of the models, Normalized Discounted Cumulative Gain (NDCG) and Hit Rate (HR) are utilized. Both metrics are reported at various ranking positions $k \in \{1, 3, 5, 10, 20\}$ .

Normalized Discounted Cumulative Gain (NDCG@k):
- Conceptual Definition: NDCG measures the ranking quality of a recommendation list, taking into account the position of relevant items. It assigns higher scores to highly relevant items appearing at the top of the list and penalizes relevant items that are ranked lower. The "Normalized" part means it's scaled by the ideal DCG (the DCG of a perfect ranking), so scores range from 0 to 1.
- Mathematical Formula: First, Discounted Cumulative Gain (DCG) is calculated: $ DCG_k = \sum_{i=1}^k \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Then, the Ideal Discounted Cumulative Gain (IDCG) is calculated, which is the DCG for the perfectly sorted list by relevance: $ IDCG_k = \sum{i=1}^k \frac{2^{\mathrm{rel}_i^{opt}} - 1}{\log_2(i+1)} $ Finally, NDCG is obtained by normalizing DCG by IDCG: $ NDCG_k = \frac{DCG_k}{IDCG_k} $
- Symbol Explanation:
  - $k$ : The number of top items in the recommendation list being considered (e.g., 1, 3, 5, 10, 20).
  - $i$ : The rank position of an item in the recommendation list.
  - $\mathrm{rel}_i$ : The relevance score of the item at position $i$ in the actual recommendation list. In typical sequential recommendation with binary feedback, relevance is usually 1 for the ground truth next item and 0 for others.
  - $\mathrm{rel}_i^{opt}$ : The relevance score of the item at position $i$ in the ideal recommendation list (where all relevant items are ranked highest).
  - $\log_2(i+1)$ : A logarithmic discount factor, meaning items at lower ranks contribute less to the total score.
Hit Rate (HR@k):
- Conceptual Definition: Hit Rate measures whether the target (ground truth) item is present anywhere within the top $k$ recommended items. It indicates how often the recommender system successfully "hits" the correct item within its top-k predictions.
- Mathematical Formula: $ HR@k = \frac{\text{Number of users for whom the target item is in top-k}}{\text{Total number of users}} $
- Symbol Explanation:
  - $k$ : The number of top items in the recommendation list being considered.
  - "Number of users for whom the target item is in top-k": This counts how many times the actual next item the user interacted with appeared within the top $k$ items recommended by the model.
  - "Total number of users": The total number of users for whom recommendations were generated.
    
    During evaluation, negative sampling was performed with a size of 100, meaning for each positive interaction, 100 negative items were randomly sampled to create the candidate pool for ranking.

5.3. Baselines

The paper compares GRASP against a set of sequential recommendation models, categorized into traditional backbones and LLM-enhanced models.

Traditional Sequential Recommendation Backbones (GRASP integrates with these):
- GRU4Rec [16]: A Recurrent Neural Network (RNN) based model using Gated Recurrent Units (GRUs) to capture sequential patterns.
- BERT4Rec [35]: A Transformer-based model that adapts the Masked Language Modeling objective for sequential recommendation.
- SASRec [21]: A Transformer-based model utilizing self-attention to capture long-range dependencies in user interaction sequences.
LLM-Enhanced Sequential Recommendation Models (Compared against):
- RLMRec [34]: A recommendation model that uses representation learning with Large Language Models.
- LLMInit [15, 17]: Methods that leverage LLMs for embedding initialization in sequential recommendation.
- LLM-ESR [27]: A model that uses LLM hidden embeddings to identify and supervise similar users via a self-distillation loss to enhance sequential recommendation. This is a direct competitor model that GRASP aims to surpass, particularly in hallucination robustness.
  
  These baselines are representative because they cover the spectrum from foundational RNN-based models to advanced Transformer-based models, and recent LLM-enhanced approaches, providing a comprehensive comparison for GRASP's effectiveness and its unique contribution to the LLM4Rec field.

5.4. Implementation Details

Hardware: All experiments were conducted on a single NVIDIA A100 GPU.
Sequence Length: The maximum sequence length for user interaction histories was fixed at 100.
Hidden Embedding Dimension: The hidden embedding dimension for all methods (including the SRS backbones and GRASP's internal representations before the MLP) was set to 64.
Batch Size: Training batch size was 128.
Optimizer: The Adam optimizer was used for training.
Learning Rate: A fixed learning rate of 0.001 was applied.
Early Stopping: To prevent overfitting, early stopping was implemented. Training would cease if the NDCG@10 metric on the validation set did not improve for 20 consecutive epochs.
Robustness: To ensure the robustness and reliability of the reported results, experiments were run three times with different random seeds ( $\{42, 43, 44\}$ ), and the average results are reported.
LLM Embeddings for Public Datasets: For the Amazon Beauty and Fashion datasets, OpenAI API was utilized to obtain LLM embeddings. The dimension of these embeddings was 1536.
LLM Embeddings for Industrial Dataset: Due to data confidentiality requirements for the Industry-100K dataset, an open-source LLM was used. Qwen2.5-7B-Instruct [44] was employed to generate descriptive texts. Subsequently, the pre-trained text encoder LLM2Vec [3] was used to convert these texts into semantic embeddings, resulting in a dimension of 4096.

These details highlight a rigorous and consistent experimental setup, allowing for fair comparison across models and demonstrating the practicality of GRASP in both academic and industrial settings.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Performance

The paper demonstrates that GRASP consistently achieves state-of-the-art performance across the three tested benchmarks, outperforming both traditional SRS baselines and other LLM-enhanced sequential recommendation models. The results are presented in Table 2.

The following are the results from Table 2 of the original paper:

Dataset	Model	N@1	N@3	N@5	N@10	N@20	H@1	H@3	H@5	H@10	H@20
Beauty	GRU4Rec	11.48	16.88	19.23	22.42	25.62	11.48	20.83	26.56	36.45	49.17
	RLMRec	11.03	16.50	18.93	22.25	25.48	11.03	20.51	26.41	36.75	49.60
	- LLMInit	14.33	20.53	23.05	26.29	29.37	14.33	25.08	31.20	41.24	53.46
	- LLM-ESR	17.50	25.25	28.18	31.82	35.05	17.50	30.88	38.02	49.28	62.09
	- GRASP	18.15	26.88	30.21	34.16	37.56	18.15	33.24	41.35	53.57	67.03
	BERT4Rec	10.49	16.86	19.68	23.33	26.67	10.49	21.56	28.43	39.72	52.93
	- RLMRec	10.45	16.82	19.66	23.20	26.72	10.45	21.50	28.42	39.43	53.39
	- LLMInit	16.57	24.74	27.92	31.65	34.89	16.57	30.71	38.45	50.00	62.86
	- LLM-ESR	21.66	29.58	32.37	35.76	38.59	21.66	35.27	42.05	52.55	63.77
	- GRASP	23.61	33.62	37.19	41.01	44.12	23.61	40.89	49.56	61.47	73.62
SASRec	18.84	25.22	27.60	30.58	33.47	18.84	29.83	35.62	44.88	56.34
	RLMRec	17.93	24.16	26.56	29.64	32.50	17.93	28.70	34.56	44.10	55.48
	- LLMInit	19.00	27.40	30.50	34.08	37.02	19.00	33.51	41.05	52.14	63.78
	- LLM-ESR	20.73	29.73	33.10	36.99	40.26	20.73	36.27	44.49	56.50	69.44
	- GRASP	26.56	36.18	39.33	42.76	45.61	26.56	43.09	50.74	61.33	72.62
Fashion	GRU4Rec	32.71	38.31	42.37	48.58	55.07	32.71	38.31	42.37	48.58	55.07
	- RLMRec	25.84	29.04	34.20	36.66	39.81	29.04	37.76	41.56	47.74	55.91
	- LLMInit	33.31	37.48	38.71	40.29	42.21	33.31	40.32	43.31	48.26	55.89
	- LLM-ESR	37.90	42.11	43.42	45.43	47.38	37.90	45.03	49.77	56.01	64.36
	- GRASP	38.39	42.88	44.40	46.41	48.51	38.39	46.06	50.00	56.01	64.36
	BERT4Rec	28.61	32.00	33.37	35.58	37.76	28.61	34.39	37.74	44.68	53.23
	- RLMRec	26.95	31.92	33.40	35.41	37.36	26.95	35.33	38.95	45.16	52.91
	- LLMInit	33.99	37.84	38.92	40.62	42.43	33.99	40.48	43.12	48.42	55.67
	- LLM-ESR	37.70	42.37	43.75	45.43	47.19	37.70	45.70	49.04	54.26	61.26
	- GRASP	37.11	42.38	43.97	46.22	48.36	37.11	46.09	50.00	57.01	65.46
SASRec	39.32	41.93	42.84	44.13	45.64	39.32	43.75	45.95	49.95	55.98
	- RLMRec	39.94	41.96	42.72	43.92	45.32	39.94	43.40	45.26	48.98	54.55
	- LLMInit	38.91	42.52	44.04	46.66	48.27	38.91	45.68	49.80	55.32	61.92
	- LLM-ESR	39.93	43.92	45.29	47.15	49.17	39.93	46.79	50.13	55.92	64.02
	- GRASP	42.16	46.92	48.50	50.50	52.57	42.16	50.30	54.15	60.31	68.57
Industry-100K	GRU4Rec	4.78	6.68	7.78	9.58	11.18	4.78	8.07	10.76	16.36	22.70
	- RLMRec	4.10	6.13	7.27	8.92	10.42	4.10	7.65	10.42	15.56	21.50
	- LLMInit	4.55	8.64	10.98	14.57	18.66	4.55	11.74	17.44	28.60	44.89
	- LLM-ESR	11.84	18.32	21.25	25.13	28.92	11.84	23.10	30.25	42.28	57.32
	- GRASP	13.02	21.04	24.23	28.73	32.73	13.02	26.24	35.47	48.99	64.15
	BERT4Rec	4.78	6.68	7.78	9.58	11.18	4.78	8.07	10.76	16.36	22.70
	- RLMRec	4.10	6.13	7.27	8.92	10.42	4.10	7.65	10.42	15.56	21.50
	- LLMInit	4.55	8.64	10.98	14.57	18.66	4.55	11.74	17.44	28.60	44.89
	- LLM-ESR	11.84	18.32	21.25	25.13	28.92	11.84	23.10	30.25	42.28	57.32
	- GRASP	13.21	21.04	24.30	28.66	32.73	13.21	26.09	35.35	48.87	64.15

Beauty Dataset: GRASP with SASRec as backbone achieves the highest NDCG@k and HR@k scores. For NDCG@10, GRASP reaches 42.76, significantly surpassing the best LLM-ESR score of 36.99 (paired with SASRec). This represents an average improvement of 4.56% over the previous best-performing model (LLM-ESR) across all metrics for SASRec. GRASP also boosts GRU4Rec and BERT4Rec performance significantly compared to their base and other LLM-enhanced variants. For instance, GRASP with BERT4Rec achieves NDCG@10 of 41.01, compared to LLM-ESR's 35.76.
Fashion Dataset: Similar trends are observed. GRASP with SASRec again yields the highest NDCG@10 at 50.50, outperforming LLM-ESR's 47.15. This constitutes an 1.81% improvement over LLM-ESR.
Industry-100K Dataset: On this large-scale industrial dataset, GRASP demonstrates an even more substantial gain. For NDCG@10, GRASP with SASRec achieves 28.66, compared to LLM-ESR's 25.13. This represents a remarkable 6.68% overall improvement.

The consistent performance boosts across GRU4Rec, BERT4Rec, and SASRec backbones highlight GRASP's flexibility and transferability. It effectively enhances diverse SRS architectures, affirming its general applicability and robustness.

6.1.2. Performance Under Different Groups

The paper further analyzes GRASP's effectiveness in tail scenarios (data-scarce users or items), where LLM hallucinations are more prevalent. The results, presented in Table 3, compare GRASP against LLM-ESR and traditional SRS baselines.

The following are the results from Table 3 of the original paper:

Dataset	Model	N@5	H@5	N@10	H@10	N@5	H@5	N@10	H@10	N@5	H@5	N@10	H@10	N@5	H@5	N@10	H@10
Dataset	Model	Tail User				Tail Item				Head User				Head Item
Beauty	GRU4Rec	18.51	25.68	21.73	35.67	5.11	6.28	5.52	7.58	22.53	30.58	25.56	40.00	22.60	31.39	26.45	43.33
	- LLM-ESR	27.58	37.34	31.26	48.76	6.72	10.33	8.61	16.23	30.96	41.10	34.35	51.64	33.30	44.62	37.35	57.16
	- GRASP	29.60	40.57	33.64	53.04	15.88	23.92	19.65	35.63	34.68	46.91	38.60	59.05	34.00	45.95	38.07	58.53
	BERT4Rec	18.90	27.34	22.51	38.54	0.05	0.12	0.26	0.76	23.24	33.40	27.04	45.11	24.36	35.18	28.83	49.01
	- LLM-ESR	31.56	41.04	34.97	51.59	7.05	9.17	8.22	12.83	36.06	46.67	39.38	56.94	38.41	49.89	42.33	62.03
	- GRASP	36.44	48.57	40.26	60.39	14.62	22.83	18.44	34.74	40.59	54.07	44.44	65.98	42.57	55.92	46.46	67.76
	SASRec	26.83	34.52	29.82	43.78	5.89	6.90	6.52	8.89	31.08	40.63	34.07	49.88	32.77	42.46	36.32	53.46
	- LLM-ESR	32.31	43.51	36.20	55.56	7.44	12.58	10.29	21.47	36.74	48.96	40.56	60.79	39.23	52.10	43.35	64.85
	- GRASP	38.83	49.93	42.20	60.36	23.03	31.70	26.34	41.82	41.63	54.49	45.29	65.79	43.23	55.28	46.67	65.98
Fashion	GRU4Rec	22.16	31.00	24.69	38.79	0.36	0.70	0.80	2.12	50.86	57.11	52.20	61.28	48.31	58.96	50.95	67.08
	- LLM-ESR	32.73	38.58	35.08	45.82	2.42	3.90	3.65	7.76	57.28	60.77	58.85	65.60	59.74	65.89	62.06	73.01
	- GRASP	34.00	40.37	36.37	47.74	5.89	9.73	8.23	17.31	57.96	61.97	59.43	66.72	59.77	65.71	61.60	71.41
	BERT4Rec	19.82	24.56	22.54	33.11	0.82	1.20	1.20	3.49	50.94	54.84	52.50	59.69	46.32	52.28	49.27	61.51
	- LLM-ESR	32.73	39.28	34.67	45.24	1.61	2.82	2.57	5.87	58.03	61.71	59.39	65.95	60.52	67.45	62.50	73.52
	- GRASP	33.26	40.66	35.97	48.98	3.30	5.97	5.64	13.31	57.86	62.13	59.57	67.43	60.16	67.53	62.38	74.40
	SASRec - LLM-ESR	32.35	35.60	33.82	40.18	1.68	2.39	2.13	3.78	56.45	59.38	57.49	62.62	59.22	63.29	60.85	68.32
	- GRASP	35.02	40.55	37.31	47.67	3.28	5.33	4.96	10.58	58.61	62.57	59.90	66.61	62.01	67.97	63.94	73.97
Industry-100K	GRU4Rec	7.89	10.96	9.65	16.41	0.48	0.82	0.71	1.53	7.69	10.66	9.37	15.91	15.86	21.85	19.25	32.38
	- LLM-ESR	20.78	29.68	24.67	41.74	20.97	29.97	24.81	41.87	23.21	32.67	27.04	44.53	21.55	30.57	25.47	42.73
	- GRASP	24.09	34.94	28.46	48.47	24.30	35.35	28.66	48.87	26.41	37.70	30.75	51.14	24.80	35.61	29.17	49.12
	BERT4Rec	14.16	18.17	15.67	22.86	5.49	5.64	5.50	5.64	17.96	21.60	19.30	25.76	25.10	33.17	28.18	42.73
	- LLM-ESR	26.64	35.85	30.38	51.09	26.09	37.07	30.38	50.34	25.58	36.30	30.00	49.99	26.81	37.56	31.30	51.47
	- GRASP	26.64	37.54	31.02	51.09	26.09	37.07	30.38	50.34	25.58	36.30	30.00	49.99	26.81	37.56	31.30	51.47
	SASRec - LLM-ESR	39.53	46.25	41.85	53.44	12.28	17.79	15.07	26.46	60.14	64.39	61.77	69.24	62.92	68.63	64.59	73.78
	- GRASP	39.53	46.25	41.85	53.44	12.28	17.79	15.07	26.46	60.14	64.39	61.77	69.24	62.92	68.63	64.59	73.78

Tail Scenarios (Tail User / Tail Item): In tail scenarios, where interaction data is sparse and LLM hallucination risks are highest, GRASP consistently and significantly outperforms LLM-ESR and SRS baselines.
- On the Fashion dataset, GRASP surpasses LLM-ESR by an average of 5.00% across NDCG@k and HR@k for tail users. For tail items, the improvement is even more dramatic, for example, GRASP with SASRec achieves NDCG@10 of 4.96 for tail items compared to LLM-ESR's 2.13.
- The Beauty dataset shows the most substantial enhancement, with GRASP demonstrating a remarkable 9.99% average increase over LLM-ESR for tail users and particularly strong gains for tail items (e.g., $SASRec+GRASP$ has NDCG@10 of 26.34 vs. SASRec+LLM-ESR's 10.29).
- On the real-world Industry-100K dataset, GRASP achieves an impressive 8.42% improvement over LLM-ESR or SRS baselines in tail scenarios.
Head Scenarios (Head User / Head Item): GRASP maintains strong performance in head scenarios (abundant interaction data, less hallucination risk).
- It improves over LLM-ESR by 0.57% on Fashion, 4.30% on Beauty, and 6.41% on Industry-100K. This indicates that the gains in long-tail scenarios do not come at the expense of performance for well-represented users and items.
  
  This balanced performance underscores GRASP's ability to mitigate hallucination effects in data-scarce situations while preserving high recommendation accuracy where data is rich. This robustness is a direct consequence of GRASP's design, which treats LLM-derived information as contextual input rather than rigid supervision.

6.1.3. Case Study: Robustness to Hallucinations

To further illustrate GRASP's robustness, the paper presents a case study in Figure 3, showing examples from Industry-100K where LLMs produced hallucinated descriptions.

The following figure (Figure 3 from the original paper) shows the purchase cases and their corresponding LLM hallucinatory responses:

该图像是示意图，展示了两个购买案例及其对应的LLM幻觉响应。案例一的购买序列长度为2，案例二的购买序列长度为3，均显示了GRASP和LLM-ESR的用户下一个项匹配评分。

VLM Description: The image is a diagram illustrating two purchase cases and their corresponding LLM hallucinatory responses. Case 1 has a purchase sequence length of 2, while Case 2 has a length of 3, both showing the user-next item matching scores of GRASP and LLM-ESR.

The figure demonstrates two purchase cases:

Case 1: A user's sequence length is 2. The LLM generates a description that includes hallucinated information (e.g., "user needs a new phone case and wants to match the phone color").
Case 2: A user's sequence length is 3. Similarly, the LLM generates hallucinated content (e.g., "user is planning a party").

For each case, the ground truth next-item is provided, along with user interaction scores (obtained by applying sigmoid to the dot product of user and item embeddings) for both GRASP and LLM-ESR. The crucial observation is that GRASP consistently shows higher interaction scores for the ground truth next-item compared to LLM-ESR, even when hallucinated descriptions are present. This indicates that GRASP is more aligned with actual user expectations. The hallucinated descriptions can introduce noise that misguides LLM-ESR, which relies on these descriptions as direct supervision. In contrast, GRASP's holistic attention mechanism, designed to adaptively process LLM-derived context, appears to filter out or down-weight the harmful effects of hallucinations, leading to more accurate predictions. This qualitatively supports GRASP's claim of enhanced robustness against LLM hallucination issues.

6.2. Ablation Study / Parameter Analysis

6.2.1. Ablation Study on Each Component

To demonstrate the effectiveness of each component within GRASP, an ablation study was conducted. This involves systematically removing or modifying parts of the Holistic Attention Enhancement (HAE) module and observing the impact on performance. The results are presented in Table 4, using SASRec as the backbone on the Beauty dataset.

The following are the results from Table 4 of the original paper:

Module	Setting	N@1	N@3	N@5	N@10	N@20	H@1	H@3	H@5	H@10	H@20
HAE	- w/o Attention	18.24	26.63	29.83	33.41	36.70	18.24	32.71	40.48	51.59	64.63
	- w/o HAE similar	18.74	27.16	30.48	34.35	37.75	18.74	33.29	41.36	53.35	66.83
	- w/o HAE global	20.00	29.29	32.72	36.62	39.87	20.00	36.02	44.36	56.44	69.31
	- Softmax	14.59	22.60	25.92	30.00	33.63	14.59	28.46	36.54	49.17	63.55
GRASP		26.56	36.18	39.33	42.76	45.61	26.56	43.09	50.74	61.33	72.62

Here, "GRASP" refers to the full model, typically $SASRec+GRASP$ .

- w/o Attention (- w/o Attention): This setting likely represents removing the attention mechanism entirely from the HAE module, possibly using simple concatenation or element-wise operations instead of weighted fusion. The performance (e.g., NDCG@10 of 33.41) significantly drops compared to the full GRASP (42.76). This highlights that the attention mechanism is crucial for effectively integrating the various semantic signals.
- w/o HAE similar (- w/o HAE similar): This removes the similar-attention component (i.e., $\mathbf{i}_{j,\mathrm{similar}}^{\mathrm{HAE}}$ ) that incorporates information from similar users/items. The NDCG@10 drops to 34.35. This demonstrates the importance of explicitly modeling neighborhood context to enrich representations.
- w/o HAE global (- w/o HAE global): This removes the global-attention component (i.e., $\mathbf{i}_{j,\mathrm{global}}^{\mathrm{HAE}}$ ) that concatenates original and similar embeddings for a broader interaction. The NDCG@10 drops to 36.62. This shows that a holistic view of both direct and contextual user-item signals is beneficial.
- Softmax (- Softmax): This replaces the Sigmoid function used in GRASP's attention mechanism with the traditional Softmax function. The performance drops sharply (NDCG@10 of 30.00), even below the results of removing entire attention sub-components. This result is particularly significant: it validates the design choice of Sigmoid over Softmax. The Sigmoid function allows for a multi-peak representation, preserving diverse preferences and allowing independent weighting of interest patterns, which softmax (forcing weights to sum to 1) might suppress by emphasizing only one dominant aspect.

The substantial degradation in performance when any part of the Holistic Attention Enhancement is removed or when Sigmoid is replaced by Softmax confirms the efficacy of GRASP's multi-level attention design and its fine-grained user-item integration approach.

6.2.2. Impacts of Hyper-parameters

The paper also analyzes the impact of two key hyper-parameters:

$N$ : The size of the candidate pool for retrieving similar users/items. This corresponds to the $k$ in Top@k retrieval.
$d$ : The hidden embedding dimension for the SRS backbone.

The following figure (Figure 4 from the original paper) shows the hyperparameter analysis of GRASP based on SASRec on the Beauty dataset.

$Figure 4: Analysis of hyper-parameters on Beauty dataset of GRASP based on SASRec. Left: $N$ is the size of the candidate pool for similar retrieval. Right: $d$ is the hidden dimension for SRS.$ 该图像是图表，展示了GRASP在Beauty数据集上基于SASRec的超参数分析。左侧两幅图分别表示候选池大小 $N$ 与NDCG@10和HR@10的关系；右侧两幅图则展示隐藏维度 $d$ 与NDCG@10和HR@10的关系。各图中数据点用不同颜色和符号标识，展示了不同超参数对模型性能的影响。

VLM Description: The image is a chart illustrating the hyperparameter analysis of GRASP based on SASRec on the Beauty dataset. The two graphs on the left show the relationship between the candidate pool size $N$ and NDCG@10 and HR@10; the two graphs on the right display the relationship between the hidden dimension $d$ and NDCG@10 and HR@10. Data points in each graph are marked with different colors and symbols to demonstrate the impact of various hyperparameters on model performance.

Impact of Candidate Pool Size $N$ : The left graphs in Figure 4 show the effect of $N$ on NDCG@10 and HR@10.
- When $N$ is too small, the model fails to capture sufficient similar patterns from neighbors. For example, very low $N$ values result in lower performance.
- As $N$ increases, performance generally improves up to a certain point.
- However, excessively large values of $N$ can introduce noise and irrelevant information by including less relevant neighbors, leading to a slight drop or plateau in performance.
- The optimal value for $N$ appears to be around 10, where performance peaks.
Impact of Hidden Dimension $d$ : The right graphs in Figure 4 illustrate the relationship between the hidden embedding dimension $d$ and NDCG@10 and HR@10.
- Insufficient dimensionality (small $d$ ) cannot adequately represent complex user-item relationships and semantic information, resulting in lower performance.
- As $d$ increases, the model's capacity to learn richer representations grows, leading to performance improvements.
- However, excessive dimensionality can lead to diminished returns or potential overfitting if the model becomes too complex for the available data.
- The optimal value for $d$ is identified as 64, which is the setting used in the main experiments, indicating a good balance between representation power and computational efficiency/risk of overfitting.
  
  These hyper-parameter analyses guide the selection of appropriate settings for GRASP, ensuring optimal performance by balancing the trade-offs between information richness, noise, and model complexity.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction), a novel and flexible framework designed to enhance sequential recommendation models by integrating Large Language Model (LLM) world knowledge. The core innovation of GRASP lies in its dual approach:

Generation Augmented Retrieval: This component enriches user and item representations by generating detailed semantic embeddings using LLMs and then retrieving and aggregating top-k similar users/items to provide auxiliary contextual information.
Holistic Attention Enhancement: This module employs a multi-level attention mechanism (self, similar, and global attention) to dynamically and adaptively fuse the LLM-derived semantic embeddings and the retrieved contextual information. A key design choice is the use of a Sigmoid function in attention, which allows for robust handling of diverse user preferences and mitigates noisy guidance from potential LLM hallucinations.

Crucially, GRASP addresses the significant challenge of LLM hallucinations by treating LLM-generated content as learnable input features rather than rigid supervisory signals. This adaptive fusion mechanism allows the model to down-weight or filter out unreliable LLM outputs, leading to enhanced robustness.

Comprehensive experiments on two public datasets (Amazon Beauty, Amazon Fashion) and one industrial dataset (Industry-100K) demonstrate GRASP's consistent superiority over state-of-the-art baselines, including other LLM-enhanced models like LLM-ESR. The framework shows particular strength in tail scenarios (data-scarce), where hallucination risks are highest, without compromising performance in head scenarios. Furthermore, GRASP's flexibility allows it to be integrated seamlessly with diverse SRS backbones (GRU4Rec, BERT4Rec, SASRec), consistently yielding performance boosts. An online A/B test confirmed its practical value in a real-world e-commerce setting, showing positive uplifts in CTR, order volume, and GMV.

7.2. Limitations & Future Work

The authors explicitly mention one key direction for future work:

Combining with LLM-in-SRS-backbone methods: GRASP utilizes LLM's world knowledge primarily as a front-end feature augmentation module. This is contrasted with other LLM-based recommendation works [39, 50] that inherently use the pre-trained weights of LLM in the SRS backbone. The authors suggest that GRASP's technique is orthogonal to these methods and plan to explore how to combine them for potentially more impressive model performance. This implies that while GRASP enhances embeddings, it doesn't directly leverage the generative or reasoning capabilities of LLMs within the core sequential modeling process itself.

7.3. Personal Insights & Critique

GRASP presents a very compelling and practical solution to a critical problem in LLM-enhanced recommendation systems: hallucinations.

Innovation in Robustness: The core innovation of distinguishing between using LLM outputs as learnable input features versus rigid supervisory signals is profound. It moves beyond simply incorporating LLMs to intelligently managing their inherent flaws. The theoretical analysis in Appendix A.3, clearly outlining the gradient differences between GRASP and LLM-ESR, provides strong justification for this design choice. The multi-level attention with Sigmoid further refines this robustness by allowing selective weighting and diverse preference capture.
Flexibility and Real-World Applicability: The orthogonality of GRASP to existing SRS backbones is a significant advantage. This means GRASP can be adopted by practitioners with existing SRS deployments without requiring a complete overhaul. The offline precomputation of LLM embeddings and retrieval also makes it highly practical for industrial deployment, minimizing online latency, as confirmed by the successful A/B test. This validation on a massive scale (50 million DAU, 5% traffic allocation) lends immense credibility to the paper's claims.
Addressing Data Sparsity: GRASP's strong performance in tail scenarios is a testament to its ability to alleviate data sparsity. By leveraging LLM world knowledge and similar user/item context, it can provide meaningful recommendations even for users or items with limited historical interactions, a common pain point in recommender systems.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Quality of Initial LLM Descriptions: While GRASP is robust to hallucinations in the guidance signal, its performance is still fundamentally tied to the initial quality of LLM-generated descriptive texts and semantic embeddings. If the LLM consistently produces poor quality or highly hallucinated basic descriptions for certain users or items, even GRASP's adaptive mechanisms might struggle to extract useful information. The paper mentions using Qwen2.5-7B-Instruct for the industrial dataset, but the impact of different LLMs and different prompt engineering strategies on the initial embedding quality (and thus GRASP's performance) could be explored further.
Complexity of Prompt Engineering: Appendix A.1 shows detailed and comprehensive prompt templates, especially for Industry-100K with CoT (Chain-of-Thought). The effectiveness of GRASP relies heavily on well-designed prompts to elicit rich and relevant descriptions from LLMs. The process of designing and optimizing these prompts can be complex and time-consuming, and their sensitivity to slight changes is a known challenge.
Scalability of Retrieval for Very Large Datasets: While the paper mentions offline pre-retrieval in small groups to mitigate nearest-neighbor search complexity, for truly massive and dynamic item/user catalogs, maintaining top-k similar neighbors efficiently and keeping them up-to-date can still be a non-trivial engineering challenge. The number of similar items/users $N$ (the $k$ in Top@k) was optimized for, but the inherent scalability limits of similarity search remain.
Interpretability of Holistic Attention: While the multi-level attention is effective, understanding precisely which aspects of LLM knowledge (direct or similar) are being leveraged and how they contribute to a specific recommendation can be less transparent compared to simpler models. Further work on interpretable AI could help shed light on the learned attention weights.
Implicit vs. Explicit Feedback: The paper focuses on sequential recommendation which typically uses implicit feedback (purchases, clicks). The effectiveness of GRASP with explicit feedback (ratings) and how LLM knowledge might interact with explicit user preferences could be another area of exploration.

Overall, GRASP makes a significant contribution by providing a practically robust method for integrating LLM world knowledge into sequential recommendation. Its careful design to sidestep the hallucination problem is a crucial step towards making LLM-enhanced recommenders more reliable and deployable in real-world, large-scale systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.