Generative Sequential Recommendation with GPTRec
TL;DR Summary
GPTRec, a GPT-2-based generative sequential recommender, uses SVD tokenization to handle large vocabularies and a Next-K strategy to optimize recommendations, achieving SASRec-level quality on MovieLens-1M with 40% smaller embeddings.
Abstract
Sequential recommendation is an important recommendation task that aims to predict the next item in a sequence. Recently, adaptations of language models, particularly Transformer-based models such as SASRec and BERT4Rec, have achieved state-of-the-art results in sequential recommendation. In these models, item ids replace tokens in the original language models. However, this approach has limitations. First, the vocabulary of item ids may be many times larger than in language models. Second, the classical Top-K recommendation approach used by these models may not be optimal for complex recommendation objectives, including auxiliary objectives such as diversity, coverage or coherence. Recent progress in generative language models inspires us to revisit generative approaches to address these challenges. This paper presents the GPTRec sequential recommendation model, which is based on the GPT-2 architecture. GPTRec can address large vocabulary issues by splitting item ids into sub-id tokens using a novel SVD Tokenisation algorithm based on quantised item embeddings from an SVD decomposition of the user-item interaction matrix. The paper also presents a novel Next-K recommendation strategy, which generates recommendations item-by-item, considering already recommended items. The Next-K strategy can be used for producing complex interdependent recommendation lists. We experiment with GPTRec on the MovieLens-1M dataset and show that using sub-item tokenisation GPTRec can match the quality of SASRec while reducing the embedding table by 40%. We also show that the recommendations generated by GPTRec on MovieLens-1M using the Next-K recommendation strategy match the quality of SASRec in terms of NDCG@10, meaning that the model can serve as a strong starting point for future research.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Generative Sequential Recommendation with GPTRec
1.2. Authors
- Aleksandr V. Petrov: Affiliated with the University of Glasgow, United Kingdom.
- Craig Macdonald: Affiliated with the University of Glasgow, United Kingdom.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server. While not a peer-reviewed journal or conference proceeding itself, arXiv is a highly reputable platform for disseminating research rapidly in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers published here often precede formal publication in top-tier venues.
1.4. Publication Year
2023
1.5. Abstract
The paper introduces GPTRec, a generative sequential recommendation model based on the GPT-2 architecture. It addresses two key limitations of existing Transformer-based sequential recommenders like SASRec and BERT4Rec: the large vocabulary of item IDs and the inflexibility of the classical Top-K recommendation approach for complex objectives (e.g., diversity, coverage). GPTRec tackles the large vocabulary issue with a novel SVD Tokenisation algorithm that splits item IDs into sub-ID tokens, leveraging quantised item embeddings from an SVD decomposition of the user-item interaction matrix. For recommendation generation, it proposes a Next-K strategy, which generates recommendations item-by-item, considering previously recommended items, thus enabling complex interdependent recommendation lists. Experiments on the MovieLens-1M dataset show that SVD Tokenisation allows GPTRec to match SASRec's quality while reducing the embedding table size by 40%. The Next-K strategy also matches SASRec's NDCG@10 quality, indicating GPTRec's potential as a foundation for future research in generative sequential recommendation.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2306.11114v1
- PDF Link: https://arxiv.org/pdf/2306.11114v1.pdf
- Publication Status: This is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses lies within sequential recommendation systems, which aim to predict the next item a user might interact with based on their past sequence of interactions. While Transformer-based models (like SASRec and BERT4Rec) have achieved state-of-the-art results by adapting language models (LMs) to recommendation tasks (replacing words with item IDs), they face significant limitations:
-
Large Item ID Vocabulary: The number of unique items in a recommendation catalogue can be orders of magnitude larger than the vocabulary in typical language models. This leads to massive
embedding tablesfor item IDs, consuming substantial GPU memory and making training infeasible for very large datasets (e.g., more than 1M items). -
Inflexible
Top-KRecommendation: TraditionalTop-Krecommendation strategies, where items are scored independently and the top-scoring ones are selected, are suboptimal for complex recommendation objectives. These objectives might includediversity(avoiding redundant recommendations of similar items),coverage(ensuring a broad range of items is considered), orcoherence(recommending complementary items). The independent scoring approach often leads to recommendation lists dominated by highly similar items, failing to capture user's varied interests or provide a well-rounded experience.The paper is motivated by the rapid progress in
generative language models(likeGPTandT5), which have shown great flexibility and power across various text generation tasks. This inspires the authors to revisit generative approaches for sequential recommendation, believing they can inherently address the limitations of current discriminative, score-and-rank methods.
The paper's entry point is to adapt a generative Transformer architecture (GPT-2) to sequential recommendation, introducing novel mechanisms (SVD Tokenisation and Next-K strategy) to overcome the identified challenges.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of sequential recommendation:
- Introduction of
GPTRec: The paper proposesGPTRec, agenerative sequential recommendation modelbuilt upon theGPT-2architecture. Unlike previousTransformer-based models,GPTRecis inherently generative, opening avenues for more flexible recommendation strategies. - Novel
SVD-based Sub-item Tokenisation: To tackle the issue of largeembedding tables, the paper introducesSVD Tokenisation. This algorithm decomposes item IDs into multiple sub-item tokens by quantising item embeddings derived from anSVDof the user-item interaction matrix. This approach significantly reduces memory footprint (e.g., 40% reduction forMovieLens-1Mcompared toSASRecwhile matching quality) by storing a smaller vocabulary of sub-item tokens rather than individual item embeddings. - Novel
Next-K Recommendation Strategy: As an alternative to the traditionalTop-Kapproach, the paper proposes theNext-Kstrategy. This method generates recommendations item-by-item, with each subsequent recommendation conditioned on the items already generated in the list. This allows for the creation ofcomplex interdependent recommendation lists, addressing objectives likediversity,complementarity, andcoherencethat are difficult to optimize with independent scoring. - Empirical Validation: Experiments on the
MovieLens-1Mdataset demonstrate the viability ofGPTRecand its novel components:-
GPTRecinone-token-per-itemmode withTop-Kgeneration shows performance comparable toBERT4Recand outperformsSASRec(e.g.,NDCG@10of 0.146 vs.SASRec's 0.108), suggesting the benefit ofCross-Entropyloss overBinary Cross-Entropyfor this task. -
When using
SVD Tokenisation(multi-token-per-itemmode),GPTReccan matchSASRec's quality (e.g.,NDCG@10of 0.108 for both) while substantially reducing memory usage (e.g., 40% smaller embedding table). -
The
Next-Kstrategy, despite not being specifically tuned for this generation mode, still achieves recommendation quality (NDCG@10) similar toSASRec, establishing it as a strong starting point for future research, particularly with advanced tuning techniques likereinforcement learning.These findings solve the problems of memory inefficiency in large item catalogues and inflexibility in optimizing for complex recommendation objectives, paving the way for more powerful and adaptable recommender systems.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand GPTRec and its contributions, a foundational understanding of several key concepts is essential:
-
Sequential Recommendation Systems: These are a class of recommender systems that leverage the order or sequence of user-item interactions. Unlike traditional collaborative filtering or content-based systems that treat interactions independently, sequential recommenders recognize that user preferences evolve over time and that the next item a user interacts with often depends on their most recent interactions. The primary task is typically
next-item prediction, where the system predicts the single item a user will engage with next based on their interaction history. For example, if a user watches movies A, B, and C in that order, a sequential recommender tries to predict movie D. -
Language Models (LMs): Originally from Natural Language Processing (NLP), LMs learn to predict the probability of a sequence of words. They model the distribution of natural language by predicting the next word in a sequence given the preceding words (e.g., "The cat sat on the _").
- Autoregressive Language Modeling (LM): This is a type of language modeling where the model predicts the next token in a sequence given all previous tokens. Models like
GPTare trained with this objective. - Masked Language Modeling (MLM): In
MLM, a percentage of tokens in the input sequence are masked, and the model is trained to predict the original masked tokens based on their context (both left and right).BERTis a prominent example trained withMLM.
- Autoregressive Language Modeling (LM): This is a type of language modeling where the model predicts the next token in a sequence given all previous tokens. Models like
-
Transformer Architecture: Introduced by Vaswani et al. (2017) in "Attention Is All You Need," the
Transformeris a neural network architecture that has revolutionized NLP. It relies entirely onself-attention mechanismsto draw global dependencies between input and output.- Encoder-Decoder Structure: The original
Transformerconsists of anencoderstack and adecoderstack.- The
encoderprocesses the input sequence and produces a sequence of contextualized representations. - The
decoderthen takes these encoder outputs, along with its own previous outputs, to generate the output sequence.
- The
- Self-Attention Mechanism: This is the core component of
Transformers. It allows the model to weigh the importance of different parts of the input sequence when processing each element. For each token in a sequence,self-attentioncomputes a weighted sum of all other tokens, where the weights are learned based on the relevance between tokens. The basic formula forScaled Dot-Product Attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings, representing different projections of the input.
- calculates the dot product between queries and keys, measuring similarity.
- is a scaling factor, where is the dimension of the keys. This prevents the dot products from becoming too large, which can push the
softmaxfunction into regions with tiny gradients. - normalizes the scores to obtain attention weights.
- The output is a weighted sum of the
Valuevectors.
- Encoder-Decoder Structure: The original
-
GPT-2 Architecture:
GPT-2(Generative Pre-trained Transformer 2) is a largeTransformer-basedlanguage modeldeveloped by OpenAI. It is anautoregressive modelthat uses only thedecoderpart of theTransformerarchitecture. It is trained on a massive dataset of text to predict the next word in a sequence. Its key characteristic is its ability to generate coherent and contextually relevant text.GPTRecuses this generative nature and architectural backbone. -
Singular Value Decomposition (SVD):
SVDis a matrix factorization technique used in linear algebra, often for dimensionality reduction and finding latent factors. For a matrix (e.g., a user-item interaction matrix),SVDdecomposes it into three matrices: $ \boldsymbol { M } = \boldsymbol { U } \boldsymbol { \Sigma } \boldsymbol { V } ^ { T } $ Or, in the context of truncated SVD for item embeddings as used in the paper: $ \boldsymbol { M } \approx \boldsymbol { U } \times \boldsymbol { \Sigma } \times \boldsymbol { E } ^ { T } $ Where:- is the original user-item interaction matrix.
- is a matrix containing left singular vectors (often interpreted as user embeddings).
- is a diagonal matrix containing singular values, ordered from largest to smallest. In
truncated SVD, only the top singular values and corresponding vectors are kept. - (or ) is a matrix containing right singular vectors (often interpreted as item embeddings).
SVDcan capture underlying patterns and reduce noise in data by representing items (or users) in a lower-dimensionallatent space.
-
Quantization: In the context of
SVD Tokenisation,quantizationrefers to the process of mapping a continuous range of values (e.g., item embedding dimensions) to a finite, discrete set of values. For example, if an embedding dimension ranges from 0 to 1,quantizingit into 4 bins would map values in [0, 0.25) to 0, [0.25, 0.5) to 1, etc. This discretisation is crucial for creating tokens from continuous embeddings. -
Evaluation Metrics (
NDCG@K,Recall@K):- Recall@K:
- Conceptual Definition:
Recall@Kmeasures the proportion of relevant items that are successfully retrieved among the top recommendations. It focuses on how many of the truly relevant items for a user were present in the recommended list. - Mathematical Formula: $ \mathrm{Recall@K} = \frac{|{\text{relevant items}} \cap {\text{recommended items at K}}|}{|{\text{relevant items}}|} $
- Symbol Explanation:
- denotes the number of relevant items found within the top recommended items.
- denotes the total number of relevant items for a given user.
- Conceptual Definition:
- NDCG@K (Normalized Discounted Cumulative Gain at K):
- Conceptual Definition:
NDCG@Kis a measure of ranking quality. It accounts for both the relevance of recommended items and their position in the list. Highly relevant items appearing higher in the recommendation list contribute more to theNDCGscore. It's "normalized" because the score is divided by the idealDCG(achieved if all relevant items were perfectly ordered), ensuring scores are comparable across different queries. - Mathematical Formula: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ $ \mathrm{IDCG@K} = \sum_{i=1}^{K} \frac{2^{rel_{i_{ideal}}} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
- Symbol Explanation:
- : The number of top recommendations considered.
- : The relevance score of the item at position in the recommended list. (Often 1 if relevant, 0 if not for binary relevance).
- : A
discountingfactor, meaning items at lower ranks (higher ) contribute less to the score. - :
Discounted Cumulative Gainat . - : The relevance score of the item at position in the ideal (perfectly sorted by relevance) recommendation list.
- :
Ideal Discounted Cumulative Gainat . This is the maximum possibleDCGfor the given set of relevant items.
- Conceptual Definition:
- Recall@K:
3.2. Previous Works
The paper builds upon and distinguishes itself from several lines of prior research:
-
Transformer-based Sequential Recommenders:SASRec[13]: Self-Attentive Sequential Recommendation. This model adapts theTransformer decoderarchitecture for sequential recommendation. It is trained with aLanguage Modeling (LM)objective, predicting the next item in a sequence given the preceding ones. It usesBinary Cross-Entropyloss and generates recommendations using aTop-Kstrategy.BERT4Rec[26]: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. This model adapts theTransformer encoderarchitecture. It is trained with aMasked Language Modeling (MLM)objective, predicting masked items in a sequence based on bidirectional context. It usesCross-Entropyloss and aTop-Kstrategy.- Other models like
DuoRec[19],CBiT[7],ALBERT4Rec[18] also primarily useTransformerencoders or decoders as their main component, often with additional training tasks or architectural tweaks, but generally adhere to theTop-Krecommendation strategy andone-token-per-itemrepresentation.
-
Recommendations as Text Generation /
Generative IR:- Recent works like
P5[8] andM6-Rec[5] leverage large pre-trainedLanguage Models(likeT5andGPT-3) for recommendation by treating item IDs or titles as text strings to be generated. These models often rely on existing pre-trained knowledge and side information (e.g., item descriptions, user reviews) to represent items semantically. TIGER[23] introducessemantic IDs, where items are represented by tokens derived from their side information.- In the broader field of
Generative IR, similar challenges of large document ID vocabularies and the need for sub-document ID tokenization ( [15],Learning to Tokenize for Generative Retrieval[27]) have been identified. - Differentiation:
GPTRecdistinguishes itself from these generative approaches by focusing on the classicsequential recommendationsetting where only interaction sequences are available, without relying on external side information or large pre-trained LMs. This allows for a clearer study of the generative approach's properties in a more fundamental context.
- Recent works like
-
Sub-word Tokenization in NLP: The concept of splitting atomic units into sub-units to manage vocabulary size in
GPTRec'sSVD Tokenisationis inspired bysub-word tokenizationtechniques in NLP. Examples includeWordPiece encoding[32] (used byBERT) andBytePair encoding[25] (used byGPTfamily models). These methods allow large vocabularies of words to be represented by a smaller, fixed vocabulary of sub-word units, enabling handling of rare words and reducing embedding table size. -
Diversity and Interdependent Recommendations: Methods like
MMR[3] (Maximal Marginal Relevance) and multi-aspect methods [2] are used for re-rankingTop-Klists to improvediversity. However, these are often post-processing steps and can increase complexity and computational demands. TheNext-Kstrategy ofGPTRecaims to addressinterdependent objectivesmore intrinsically during generation.
3.3. Technological Evolution
The field of sequential recommendation has evolved significantly, mirroring advancements in deep learning and NLP:
- Early Neural Models (RNNs/GRUs): Starting with models like
GRU4Rec[11] (2016), sequential recommenders began leveragingRecurrent Neural Networks (RNNs)and their variants (GRUs) to model sequential patterns, inspired by their success inmachine translation. These models capture dependencies over time but can struggle with long sequences. Transformer-based Models (2018-present): The introduction of theTransformerarchitecture [30] (2017) revolutionized NLP by replacing recurrence withattention mechanisms, allowing for parallel processing and better handling of long-range dependencies. This quickly led to its adoption in sequential recommendation:SASRec[13] (2018) was among the first to adapt theTransformer decoderfornext-item prediction.BERT4Rec[26] (2019) followed, adapting theTransformer encoderwithmasked language modeling. These models rapidly became state-of-the-art due to their ability to capture complex sequential patterns. However, they inherited the largeembedding tableproblem and typically employed theTop-Kscore-and-rank approach.
- Generative Models and
Pre-trained Language Models(2020-present): The success of very largegenerative language modelslikeGPT-3[1] andT5[22] in varioustext generationtasks sparked interest in applying them to recommendation. This led to research likeP5[8], which treats recommendation as atext generationproblem, often relying onpre-trained modelsandside information. GPTRec's Position:GPTRecsits at the intersection ofTransformer-based sequential recommendation and the emerging trend of generative models. It extends theTransformer-decoderparadigm (likeSASRec) but embraces a fully generative approach (likeGPT-2) to address fundamental limitations. It focuses on solving thememory footprintandobjective flexibilityissues using novel techniques (SVD Tokenisation,Next-Kstrategy) in a "side-information-agnostic" setting, providing a foundational study for generative sequential recommendation.
3.4. Differentiation Analysis
Compared to the main methods in related work, GPTRec introduces several core innovations:
-
Generative Approach vs. Discriminative/Score-and-Rank:
SASRecandBERT4Rec: These models are primarily discriminative. They learn to score individual items based on user history, and then the top-scoring items are selected (Top-Kstrategy). While powerful fornext-item prediction, they treat items independently and are less flexible for optimizing complex list-level objectives likediversityorcoherence.GPTRec: Adopts a truly generative approach, predicting items sequentially. This enables theNext-Kstrategy, where each subsequent item can be conditioned on previously recommended items, allowing for the inherent optimization of interdependent recommendation lists. This is a fundamental shift from ranking to generation.
-
Memory Efficiency for Large Item Vocabularies:
SASRecandBERT4Rec: Both use aone-token-per-itemapproach, meaning each unique item ID requires its own embedding. This leads to extremely largeembedding tablesthat become a bottleneck for GPU memory as the number of items grows into the millions or tens of millions.GPTRec: IntroducesSVD Tokenisation, which splits item IDs into multiplesub-item tokens. This dramatically reduces the size of the requiredembedding tablebecause the model only needs to store embeddings for the smaller vocabulary of sub-item tokens, not every individual item. This is crucial for scaling to very large catalogues.
-
Flexibility in Recommendation Objectives:
SASRecandBERT4Rec: Primarily optimized for single-item prediction (e.g., maximizingNDCGorRecallof the next relevant item). Adapting them fordiversity,serendipity, orcomplementaritytypically requires complex post-processing re-ranking methods, which can add computational overhead and are hard to train end-to-end.GPTRec(Next-Kstrategy): By generating items sequentially and conditionally,GPTRecprovides a native mechanism to incorporate complex, interdependent objectives directly into the generation process. Although not fully explored in this paper, theNext-Kstrategy lays the groundwork for usingreinforcement learningto tune the model for specific list-level qualities.
-
Loss Function:
SASRec: UsesBinary Cross-Entropyloss, which the paper empirically shows to underperformCross-Entropyloss.BERT4RecandGPTRec: Both useCross-Entropy(orSoftmax) loss, whichGPTRecdemonstrates is more effective fornext-item prediction.
-
Reliance on Side Information:
-
P5,M6-Rec,TIGER: These generative recommendation models often rely heavily on externalpre-trained language modelsandside information(e.g., item descriptions, user reviews) to imbue items with semantic meaning and aid generation. -
GPTRec: Focuses on the more traditionalsequential recommendationsetting where only interaction sequences are used. This makes it a more fundamental study of the generative approach's capabilities without confounding factors from side information.In essence,
GPTRecoffers a more memory-efficient and flexible generative framework for sequential recommendation, directly addressing scaling and objective complexity limitations inherent in many existingTransformer-based models.
-
4. Methodology
4.1. Principles
The core idea behind GPTRec is to re-frame sequential recommendation as a generative task, similar to language modeling, rather than a discriminative score-and-rank task. By leveraging the GPT-2 architecture, GPTRec aims to predict the entire sequence of future recommendations item-by-item. This generative approach enables two key innovations:
-
Memory Efficiency: By treating item IDs not as atomic units but as compositions of
sub-item tokens, the model can manage very large item catalogues without an explosion inembedding tablesize. -
Flexible Recommendation Objectives: The item-by-item generation process (
Next-Kstrategy) allows the model to consider already recommended items when generating subsequent ones, opening the door for optimizing complex, interdependent objectives likediversity,coherence, orserendipity, which are difficult to achieve with traditionalTop-Kmethods.The theoretical basis draws heavily from the success of
Transformer-basedlanguage modelsin handling sequences and generating coherent outputs. The intuition is that ifGPT-2can generate coherent text from sub-word tokens, a similar mechanism can generate coherent item sequences fromsub-item tokens.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Architecture
GPTRec is designed for generative sequential recommendation and is built upon the GPT-2 architecture [21]. The GPT-2 architecture, in turn, is based on the decoder part of the original Transformer model [30]. This means GPTRec is an autoregressive model, meaning it generates sequences one item (or sub-item token) at a time, conditioning each prediction on all preceding ones.
The specific architectural modifications GPT-2 makes compared to the original Transformer decoder are:
- Moving the
layer normalisationto the beginning of eachTransformer block(pre-normalization). - Implementing a modified initialization with
scaled residual weights. - Employing
learnable positional encodingsinstead of the original sine-based positional encodings. These learnable embeddings allow the model to learn relative or absolute position information about the items in the sequence.
4.2.2. Tokenisation
Existing Transformer-based sequential recommendation models, such as SASRec [13] and BERT4Rec [26], typically use item IDs directly as tokens. This is referred to as the one-token-per-item mode. However, as discussed in the introduction, for large item catalogues, this results in an extremely large embedding table that is computationally expensive and difficult to fit into GPU memory.
Inspired by language models that split whole words into sub-word tokens (e.g., WordPiece for BERT, BytePair for GPT models), GPTRec introduces a multi-token-per-item mode. In this mode, each item is represented by sub-item tokens, where each of these tokens can be chosen from alternatives.
The paper proposes a novel SVD-based Tokenisation algorithm to decompose items into these sub-item tokens. The algorithm consists of three main steps, illustrated in Figure 1:
Step 1: SVD Decomposition for Item Embeddings
The algorithm first constructs a user-item interaction matrix . This matrix typically records whether a user has interacted with an item (e.g., 1 for interaction, 0 for no interaction, or a rating). A truncated SVD decomposition is then performed on this matrix with latent components. The purpose of truncated SVD here is to obtain lower-dimensional embeddings for items that capture their underlying characteristics and relationships based on user interactions.
The decomposition is given by: $ \boldsymbol { M } \approx \boldsymbol { U } \times \boldsymbol { \Sigma } \times \boldsymbol { E } ^ { T } $ Where:
-
: The
user-item interaction matrix. Its dimensions are . -
: A matrix of
user embeddings. Its dimensions are . Each row represents a user's embedding in a -dimensional latent space. -
: A diagonal matrix of
singular values. Its dimensions are . The diagonal elements contain the largest singular values, which represent the strength of each latent component. -
: A matrix of
item embeddings. Its dimensions are . Each row represents an item's embedding in a -dimensional latent space. The transpose is used in the formula.Both user and item embeddings ( and ) now have
latent components, effectively providing a -dimensional embedding for each item.
Step 2: Normalization, Noise Addition, and Quantization of Item Embeddings After obtaining the item embeddings (from Step 1), each dimension of these embeddings is processed independently:
- Normalization: Each dimension (column) of is
normalizedso that its values fall within the[0, 1]interval. This is a common preprocessing step to ensure consistent scaling. - Noise Addition: A small amount of
Gaussian noiseis added to each dimension. This is crucial to preventquantizationfrom mapping two distinct items (that might have identical or very close embeddings due to sparse interaction data) to the exact samesub-item tokenrepresentation. In the experiments,Gaussian noisewith zero mean and a standard deviation of is used. - Quantization: Each dimension of the (normalized and noisy) item embeddings is then
quantisedinto discrete values. This means that the continuous range[0, 1]for each embedding dimension is divided into bins, and each value is assigned to the integer corresponding to its bin. These discrete values will form thesub-item tokens.
Step 3: Offsetting Quantised Values
Finally, to ensure that each dimension's quantised values are distinct in the overall token vocabulary, an offset is applied. The -th dimension's quantised values are offset by . This means:
- The first dimension's tokens will be in the range .
- The second dimension's tokens will be in the range .
- The -th dimension's tokens will be in the range .
This ensures that the
token vocabularyfor each dimension is unique and non-overlapping. The final representation of an item is a sequence of thesesub-item tokens.
The SVD Tokenisation algorithm is summarized in Algorithm 3:
| Algorithm 3 SVD Tokenisation Algorithm |
| Require: is the user-item interaction matrix |
| Require: is the number of tokens per item |
| Require: v is the number of alternatives per token |
| Ensure: Sub-item token representation of items |
| 1: procedure SVDTOKENISE |
| 2: Compute item embeddings with truncated SVD on |
| 3: for to do |
| 4: Normalise `E _ { i }` to range [0, 1] |
| 5: Add small Gaussian noise to `E _ { i }` |
| 6: Quantise `E _ { i }` into bins |
| 7: Offset quantised values by |
| 8: end for |
| 9: Concatenate quantised components for sub-item tokens |
| 10: return Sub-item token representation |
| 11: end procedure |
Explanation of Algorithm 3:
-
Line 1-2: The procedure takes the
user-item interaction matrix, the desired number ofsub-item tokensper item, and the number ofalternatives(bins) per token. It then computes theitem embeddingsby performingtruncated SVDon , keeping latent components. -
Line 3-8: This loop iterates times, once for each dimension (or component) of the item embeddings.
- Line 4: Normalizes the -th column of (which represents the -th embedding dimension for all items) to the range
[0, 1]. - Line 5: Adds a small amount of
Gaussian noiseto this normalized -th dimension. - Line 6:
Quantisesthe values in the -th dimension into discrete bins. - Line 7:
Offsetsthesequantisedvalues by to create a unique range of tokens for this dimension.
- Line 4: Normalizes the -th column of (which represents the -th embedding dimension for all items) to the range
-
Line 9-10: After processing all dimensions, the
quantisedandoffsetcomponents for each item are concatenated to form itssub-item token representation. This final representation is returned.The key advantage of
SVD Tokenisationismemory efficiency. Instead of requiring an embedding for each of the items, the model now only needs to store embeddings for the uniquesub-item tokens. This allows the memory consumption for embeddings to be independent of the total number of items in the catalogue, makingGPTRecviable for datasets with millions of items where traditional approaches would be infeasible due to GPU memory limitations.
The following table from the paper illustrates the potential memory savings:
The following are the results from Table 1 of the original paper:
| Dataset | Num Items | GPU Memory (one-token-per-item mode) | t=2 | t=4 | t=8 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| v=128 (256 embs; 0.25 MB) | v=512 (1024 embs; 1.00 MB) | v=2048 (4096 embs; 4.00 MB) | v=128 (512 embs; 0.50 MB) | v=512 (2048 embs; 2.00 MB) | v=2048 (8192 embs; 8.00 MB) | v=128 (1024 embs; 1.00 MB) | v=512 (4096 embs; 4.00 MB) | v=2048 (16384 embs; 16.00 MB) | |||
| MovieLens-1M | 3,416 | 3.34 MB | 7.494% | 29.977% | 119.906% | 14.988% | 59.953% | 239.813% | 29.977% | 119.906% | 479.625% |
| MovieLens-20M | 138,493 | 135.25 MB | 0.185% | 0.739% | 2.958% | 0.370% | 1.479% | 5.915% | 0.739% | 2.958% | 11.830% |
| Yelp | 150,346 | 146.82 MB | 0.170% | 0.681% | 2.724% | 0.341% | 1.362% | 5.449% | 0.681% | 2.724% | 10.898% |
| Gowalla | 1,280,969 | 1.22 GB | 0.020% | 0.080% | 0.320% | 0.040% | 0.160% | 0.640% | 0.080% | 0.320% | 1.279% |
| Amazon Books | 5,264,307 | 5.02 GB | 0.005% | 0.019% | 0.078% | 0.010% | 0.039% | 0.156% | 0.019% | 0.078% | 0.311% |
| LastFM-1b | 32,291,134 | 30.80 GB | 0.001% | 0.003% | 0.013% | 0.002% | 0.006% | 0.025% | 0.003% | 0.013% | 0.051% |
The table clearly shows that for larger datasets (e.g., LastFM-1b with 32M items), the multi-token-per-item mode reduces embedding memory requirements from 30.80 GB to as little as 0.001% (i.e., less than a MB) depending on and parameters, or 0.051% (16MB) in the largest configuration . For MovieLens-1M, it might be larger than the original because the number of items is small, and the values can exceed the original item count. However, the gains are significant for larger item catalogs.
4.2.3. Training Objective and Loss Function
GPTRec is trained using a standard Language Modelling (LM) objective, similar to GPT-2. This means the model is trained to predict the next token in a sequence given all preceding tokens. The probability of a sequence of tokens is factorised as the product of conditional probabilities:
$
{ \cal { P } } ( S ) = \prod _ { i = 1 } ^ { n } p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . . s _ { i - 1 } )
$
Where:
-
: A sequence of tokens. In
GPTRec, represents either anitem ID(inone-token-per-itemmode) or asub-item tokenvalue (inmulti-token-per-itemmode). -
p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . . s _ { i - 1 } ): The conditional probability of the -th token given all tokens from the first to the(i-1)-th position.The model is trained to maximize this probability, which translates to minimizing the
Cross-Entropy loss(derived from theMaximum Log-likelihood principle). The loss function is: $ \mathcal { L } _ { L M } = - \frac { 1 } { n } \sum _ { i = 1 } ^ { n } \log ( p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . . s _ { i - 1 } ) ) $ Where: -
: The
Language Modellingloss. -
: The length of the sequence .
-
p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . . s _ { i - 1 } ): The predicted probability of the actual -th token given the preceding tokens. The model predicts this probability using asoftmaxoperation over its output for the -th position.During training,
GPTRecis typically configured to "shift its input one token to the left." This means for an input sequence , it tries to predict the next token . This aligns directly with thenext-item predictiontask. The last token predicted by the model corresponds to a new token that was not in the input sequence, which can then be fed back into the model forautoregressive generation.
4.2.4. Generating Recommendations with GPTRec
GPTRec supports different strategies for generating recommendation lists:
4.2.4.1. Top-K generation, one-token-per-item
This is the most straightforward generation mode, analogous to how SASRec operates.
-
Process: When
GPTRecis used inone-token-per-item mode, each item corresponds to a single token ID. For a given user's interaction sequence,GPTRecoutputs a probability distributionp ( s _ { i } | s _ { 1 } , s _ { 2 } , . . s _ { i - 1 } )for the next token (item) at each position. The crucial part forTop-Kis the last probability distribution, which represents the model's prediction for the next most likely item in the sequence.GPTRecuses these probabilities as item scores and then applies the standardTop-Kscoring procedure (as described in Algorithm 1) to select the items with the highest scores. -
Relevance to Algorithm 1: This mode directly maps to the
Top-K recommendation strategywhere the function in Algorithm 1 represents the probability output byGPTRecfor that item being the next in sequence.The following is the
Algorithm 1from the original paper:Algorithm 1 Top-K Recommendation Strategy Require: user is the target user for recommendations Require: items is the set of all available items Require: K is the number of recommendations to generate Require: f is an existing scoring model 1: procedure TopKRecoMMEND(user, items, K, f ) 2: item_scores ← empty list 3: for item items do 4: score ← f (user, item) 5: item_scores.append((item, score)) 6: end for 7: sorted_items ← sorted(item_scores, 8: key = score, reverse = True) 9: recommended_items ← [item 10: for (item, score) in sorted_items[: K]] 11: return recommended_items
Explanation of Algorithm 1:
- Input: A
user, theset of all available items, the desired number of recommendations , and ascoring model. - Line 2-6: For each item in the catalogue, the model computes a
relevance score, and these(item, score)pairs are stored. - Line 7-8: The
item_scoresare sorted indescending orderbased on their scores. - Line 9-10: The top items from the sorted list are selected as
recommended_items. - Line 11: The list of
recommended_itemsis returned.
4.2.4.2. Top-K generation, multi-token-per-item
This mode combines GPTRec's multi-token-per-item representation with a Top-K generation approach.
- Process:
GPTRecusesautoregressive generationto produce candidate items. This involves:- Predicting the probability distribution of the next
sub-item token. - Sampling a token from this distribution.
- Appending the sampled token to the input sequence.
- Repeating this process times (where is the number of tokens per item) to generate a full
sub-item token sequencethat represents a candidate item.
- Predicting the probability distribution of the next
- Scoring Candidates: After generating a candidate item (represented by tokens), the model scores this candidate using the
chain rulefrom Equation (2), which calculates the probability of the entiresub-item token sequence. - Final
Top-KSelection: A set of candidate items are generated this way. Then, the standardTop-Kstrategy (Algorithm 1) is applied to these candidate items. Any item that was not generated as a candidate is implicitly assigned a score of . - Challenges:
- Some generated
sub-item token sequencesmight not correspond to any validitem IDin the original catalogue; these are ignored. - Duplicate candidates can be generated.
- To get a sufficient number of valid, unique recommendations, the number of candidates to be generated autoregressively must be significantly higher than the desired number of final recommendations (e.g., for 10 recommendations).
- Some generated
4.2.4.3. Next-K generation, one-token-per-item
This is GPTRec's novel sequential generation strategy, equivalent to greedy search in language models.
-
Process: The recommendation list is built iteratively, item-by-item:
- At each iteration,
GPTRectakes the user's initial interaction sequence and the sequence of already generated recommendations. - It then predicts the next most likely item (token) based on this combined history.
- The highest-scored item that has not yet been recommended is added to the
recommended_itemslist. - This process is repeated times until the full recommendation list is generated.
- At each iteration,
-
Relevance to Algorithm 2: This mode directly implements the
Next-K recommendation strategy.The following is the
Algorithm 2from the original paper:Algorithm 2 Next-K Recommendation Strategy Require: user is the target user for recommendations Require: items is the set of all available items Require: K is the number of recommendations to generate Require: f is an existing scoring model 1: procedure NexTKRecoMMEND(user, items, K, f) 2: recommended_items ← empty list 3: for k = 1 to K do 4: max_score ← −∞ 5: best_item ← None 6: for item items \ recommended_items do 7: score ← f(user,recommended_items, item) 8: if score > max_score then 9: max_score ← score 10: best_item ←item 11: end if 12: end for 13: recommended_items.append(best_item) 14: end for 15: return recommended_items 16: end procedure
Explanation of Algorithm 2:
-
Input: A
user, theset of all available items, the desired number of recommendations , and ascoring model. -
Line 2: An empty list
recommended_itemsis initialized to store the recommendations. -
Line 3-14: This loop runs times, generating one recommendation at a time.
- Line 4-5: Initializes
max_scoreandbest_itemfor finding the highest-scoring item in the current iteration. - Line 6-12: Iterates through all
available itemsthat havenot yet been recommended. For each suchitem, thescoring modelcalculates ascore. Crucially, this score depends not only on theuserand theitembut also on thealready recommended items(recommended_items). Theitemwith the highest score is identified asbest_item. - Line 13: The
best_itemis added to therecommended_itemslist.
- Line 4-5: Initializes
-
Line 15: The final list of
recommended_itemsis returned. -
Limitations:
- Computational Cost: This strategy is more computationally expensive than
Top-Kbecause it requires separate inference steps, where each step involves scoring all available (non-recommended) items. - Training-Generation Misalignment: The model is trained with a simple
LMobjective to predict one next token. TheNext-Kgeneration, especially when is high, involves predicting a sequence of tokens iteratively. This misalignment means that the model is not explicitly optimized for generating a coherent list of items. Advanced techniques likereinforcement learning(e.g.,InstructGPT[16]) would be required to align the training objective with the iterative generation process for optimal performance inNext-Kmode. The paper emphasizes that its goal is to see if a model trained forTop-Kcan still yield meaningful results inNext-Kas a starting point for future work.
- Computational Cost: This strategy is more computationally expensive than
4.2.5. Comparison of GPTRec with SASRec and BERT4Rec
The paper explicitly compares GPTRec with SASRec and BERT4Rec, summarizing their key characteristics, especially concerning generation strategy, architecture, training task, and loss function.
The salient characteristics of these models are portrayed in the left side of Table 3 from the paper:
The following are the results from Table 3 of the original paper:
| Model name | Generation Strategy | Architecture | Training Task | Loss | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|---|
| BERT4Rec | TopK | Encoder | MLM (Item Masking) | Cross Entropy (Softmax Loss) | 0.282 | 0.152 |
| GPTRec-TopK | TopK | Decoder | LM (Sequence Shifting) | Cross Entropy (Softmax Loss) | 0.254 | 0.146 |
| SASRec | TopK | Decoder | LM (Sequence Shifting) | Binary Cross-Entropy | 0.199 | 0.108 |
| GPTRec-NextK | NextK | Decoder | LM (Sequence Shifting) | Cross Entropy (Softmax Loss) | 0.157 | 0.105 |
Key differences highlighted:
SASRecvs.GPTRec:- Both use the
Transformer decoderand anLM (Sequence Shifting)training task. - The primary difference is the
loss function:SASRecusesBinary Cross-Entropy, whileGPTRecusesCross-Entropy(Softmax Loss). The paper's experiments showCross-Entropyto be superior. SASRecis limited toTop-Kandone-token-per-item.GPTRecextends this withmulti-token-per-item(for memory efficiency) andNext-K(for flexible objectives).
- Both use the
BERT4Recvs.GPTRec:-
BERT4Recuses aTransformer encoderand anMLM (Item Masking)training task, which is less aligned withnext-item predictioncompared toLM. -
GPTRecuses aTransformer decoderandLM. -
Both use
Cross-Entropyloss. -
BERT4Recis limited toTop-Kandone-token-per-item.GPTRecoffers the memory-efficientmulti-token-per-itemand the flexibleNext-Kstrategies.In summary,
GPTRecretains the strong predictive power ofTransformer-decodermodels while introducing innovations for scalability (SVD Tokenisation) and enhanced flexibility in recommendation generation (Next-Kstrategy).
-
5. Experimental Setup
5.1. Datasets
The experiments primarily use the MovieLens-1M dataset [10].
-
Source: MovieLens datasets are widely used benchmarks in recommender systems research, collected by the GroupLens Research Project at the University of Minnesota.
-
Scale and Characteristics:
- Number of users: 6,040
- Number of items: 3,416 (movies)
- Number of interactions: 999,611
- Average sequence length (interactions per user): 165.49
- Median sequence length: 96
A concrete example of a data sample in
MovieLens-1Mwould be a sequence of movie ratings by a specific user, ordered by timestamp. For instance,User 123watchedMovie A(rated 4 stars), thenMovie B(rated 5 stars), thenMovie C(rated 3 stars), etc. Thesequential recommendationtask would then be to predict the next movieUser 123is likely to watch.
-
Choice Rationale: Despite acknowledged issues (e.g., users don't necessarily rate movies in the exact order they watch them),
MovieLens-1Mis consistently used across many research papers. This widespread adoption allows for direct comparison of published results, ensuring the reproducibility and comparability of the experimental findings.The following are the results from Table 2 of the original paper:
Number of users 6040 Number of items 3416 Number of interactions 999611 Average sequence length 165.49 Median sequence length 96
5.2. Evaluation Metrics
The paper uses two common evaluation metrics for sequential recommendation: Recall@10 and NDCG@10. The @10 indicates that the top 10 recommendations are considered for evaluation.
-
Recall@K:
- Conceptual Definition:
Recall@Kmeasures the fraction of relevant items that are successfully retrieved within the top items recommended to a user. It quantifies how complete the recommendation list is in terms of covering all known relevant items. - Mathematical Formula: $ \mathrm{Recall@K} = \frac{|{\text{relevant items}} \cap {\text{recommended items at K}}|}{|{\text{relevant items}}|} $
- Symbol Explanation:
- : The count of relevant items that are present in the top recommended items.
- : The total count of relevant items for the specific user in the test set.
- Conceptual Definition:
-
NDCG@K (Normalized Discounted Cumulative Gain at K):
- Conceptual Definition:
NDCG@Kis a metric that assesses the quality of a ranking by considering both the relevance of the recommended items and their positions. Highly relevant items ranked higher in the list contribute more significantly to the score. It is normalized by theIdeal DCGto provide a score between 0 and 1, making it comparable across different queries. - Mathematical Formula:
First,
Discounted Cumulative Gain (DCG)at is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ Then, theIdeal DCG (IDCG)at is calculated, which is theDCGof the perfectly sorted (by relevance) list: $ \mathrm{IDCG@K} = \sum_{i=1}^{K} \frac{2^{rel_{i_{ideal}}} - 1}{\log_2(i+1)} $ Finally,NDCG@Kis the ratio ofDCG@KtoIDCG@K: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ - Symbol Explanation:
- : The number of top recommendations being evaluated.
- : The relevance score of the item at position in the actual recommendation list. In many sequential recommendation setups, this is binary (1 if the item is the true next item, 0 otherwise, or 1 if the item is in the relevant set, 0 otherwise).
- : The relevance score of the item at position in the ideal ranking (where all relevant items are ranked before non-relevant ones, and highly relevant items are ranked first).
- : A logarithmic
discountingfactor, which penalizes relevant items that appear lower in the ranked list.
- Conceptual Definition:
5.3. Baselines
The paper compares GPTRec against two popular and state-of-the-art Transformer-based sequential recommendation models:
-
BERT4Rec[26]: This model adapts theTransformer encoderarchitecture and is trained using aMasked Language Modeling (MLM)objective. It is representative of bidirectionalTransformerapproaches. -
SASRec[13]: This model adapts theTransformer decoderarchitecture and is trained using aLanguage Modeling (LM)objective. It is representative ofautoregressive Transformerapproaches, and is the most structurally similar toGPTRec.These baselines are chosen because they represent the current state-of-the-art in
Transformer-based sequential recommendation, allowing for a direct and relevant comparison ofGPTRec's performance. The results forSASRecandBERT4Recare copied from a previous reproducibility study [18] by the authors.
5.4. Implementation Details and Hyperparameters
- Frameworks:
GPTRecis implemented using theGPT-2architecture fromHuggingFace Transformers[31] library, integrated into theaprec4framework (from a recent replicability study [18]). - Evaluation Strategy: A
leave-one-out strategyis used for testing, meaning the latest interaction of each user is held out for the test set. - Validation Set: For 128 randomly selected users, their second-to-last interaction is held out to form a
validation set. This set is used forearly stoppingand monitoring model quality during training. - Model Depth: A relatively shallow
Transformer modelwiththree Transformer blocksis used. - Sequence Length: The maximum sequence length is set to 100. User interaction sequences longer than 100 are truncated to the last 100 interactions.
- Embedding Dimension:
256-dimensional token embeddingsare used for all experiments. - Early Stopping: Training is terminated if the
validation NDCG@10metric does not improve for 300 epochs, preventing overfitting and ensuring generalization.
6. Results & Analysis
6.1. Core Results Analysis
The experiments aim to answer three research questions (RQ1, RQ2, RQ3) regarding GPTRec's performance.
6.1.1. RQ1: Comparison with existing transformer-based models
This research question investigates how GPTRec performs in its one-token-per-item mode (where each item is a single token) against BERT4Rec and SASRec. Two variations of GPTRec are evaluated: GPTRec-TopK (using the Top-K recommendation strategy) and GPTRec-NextK (using the Next-K recommendation strategy).
The following are the results from Table 3 of the original paper:
| Model name | Generation Strategy | Architecture | Training Task | Loss | Recall@10 | NDCG@10 |
|---|---|---|---|---|---|---|
| BERT4Rec | TopK | Encoder | MLM (Item Masking) | Cross Entropy (Softmax Loss) | 0.282 | 0.152 |
| GPTRec-TopK | TopK | Decoder | LM (Sequence Shifting) | Cross Entropy (Softmax Loss) | 0.254 | 0.146 |
| SASRec | TopK | Decoder | LM (Sequence Shifting) | Binary Cross-Entropy | 0.199 | 0.108 |
| GPTRec-NextK | NextK | Decoder | LM (Sequence Shifting) | Cross Entropy (Softmax Loss) | 0.157 | 0.105 |
Analysis:
-
GPTRec-TopKvs. Baselines:GPTRec-TopKachieves anNDCG@10of 0.146. This is significantly better thanSASRec's 0.108 (a +35% improvement) and very close toBERT4Rec's 0.152 (only -4% lower).- In terms of
Recall@10,GPTRec-TopKscores 0.254, which is better thanSASRec's 0.199, but slightly worse thanBERT4Rec's 0.282. - Insight: The strong performance of
GPTRec-TopKcompared toSASRecis particularly noteworthy. Both models share aTransformer decoderarchitecture and anLM (Sequence Shifting)training task. The primary difference isGPTRec's use ofCross-Entropyloss versusSASRec'sBinary Cross-Entropy. This result empirically confirms the paper's claim thatCross-Entropy lossis more beneficial fornext-item predictionthanBinary Cross-Entropy.GPTRec-TopK's comparable performance toBERT4Recalso validates its effectiveness as a state-of-the-artTransformer-based sequential recommender in the conventionalTop-Ksetting.
-
GPTRec-NextKPerformance:GPTRec-NextKshows anNDCG@10of 0.105 andRecall@10of 0.157. This is lower thanGPTRec-TopKandBERT4Rec.- Insight: This result is expected, as discussed in Section 4.4.3.
GPTRec-NextKis used in a generation mode (Next-K) that is misaligned with its training objective (LMfor a single next token). Despite this misalignment, itsNDCG@10(0.105) is very similar toSASRec's (0.108), indicating that it provides a strong foundation. The authors suggest that with more advanced training techniques (likereinforcement learning) specifically designed to align with theNext-Kgeneration strategy, its performance could be significantly improved. This establishesGPTRec-NextKas a viable starting point for research into complex recommendation objectives.
6.1.2. RQ2: Effect of the number of tokens and the number of tokens per item in multi-token-per-item mode
This question explores the impact of SVD Tokenisation parameters (number of tokens per item and number of values per token ) on GPTRec's performance and memory footprint.
The following figure (Figure 2 from the original paper) shows the performance in multi-token-per-item mode:
Analysis:
- Performance Degradation vs.
one-token-per-item: Themulti-token-per-itemmode generally shows degraded performance compared toGPTRec-TopKinone-token-per-itemmode (which achievedNDCG@10of 0.146). The highestNDCG@10inmulti-token-per-itemmode is around 0.124 ( or ). - Competitiveness with
SASRec: Despite the degradation,GPTRecinmulti-token-per-itemmode remains competitive withSASRec. For example, with tokens per item and values per token,GPTRecalmost matchesSASRec'sNDCG@10(both 0.108). However, itsRecall@10(0.182) is still slightly worse thanSASRec's (0.199). - Impact of (values per token): The figure clearly shows that both
Recall@10andNDCG@10increase as the number of values per token () increases. For ,NDCG@10rises from 0.0914 () to 0.124 (). A larger means finer granularity inquantization, allowing for a more precise representation of item embeddings, thus improving performance. - Impact of (tokens per item): Increasing the number of tokens per item () generally improves performance. For most values, outperforms . The exception is when , where and achieve similar
NDCG@10(around 0.124). - Memory vs. Performance Trade-off: The primary benefit of
multi-token-per-itemmode is memory reduction. As shown in Table 1 (Section 4.2.2), forMovieLens-1Mand , the embedding table size is 2.00 MB, which is 59.953% of theone-token-per-itemmode (3.34 MB) forMovieLens-1M. For much larger datasets, the percentage can drop to below 1%. This demonstrates a critical trade-off:GPTReccan achieve comparable quality toSASRecwhile significantly reducing GPU memory consumption, which is vital for scaling to large item catalogues. The slight performance hit compared toGPTRec-TopKinone-token-per-itemmode is a cost for this memory efficiency.
6.1.3. RQ3: Effect of cutoff K in Next-K generation
This question examines how GPTRec's performance changes with increasing cutoff K when comparing Top-K and Next-K recommendation strategies.
The following figure (Figure 3 from the original paper) compares GPTRec's Top-K and Next-K strategies:
Analysis:
- Equivalence: At ,
GPTRecwithTop-Kstrategy andGPTRecwithNext-Kstrategy show identical performance. This is because both strategies, when recommending only one item, will select the item with the highest probability, effectively behaving identically. - Performance Degradation with Increasing for
Next-K: As thecutoff Kincreases, theNDCG@Kof theNext-Kstrategy gradually degrades relative to theTop-Kstrategy. Figure 3b shows that at , theNext-Kstrategy's quality drops to approximately 75% of theTop-Kstrategy's quality. - Insight into Degradation: This degradation is an expected consequence of the
training-generation misalignment. The model is trained to predict a single next item, not an entire sequence of items. When generating multiple items sequentially inNext-Kmode, errors can compound, and the model is not explicitly optimized for the coherence or quality of the entire list. Next-KCompetitiveness withSASRec: Despite the degradation, Figure 3a shows that even at ,GPTRecwith theNext-Kstrategy (approx. 0.105NDCG@10) performs similarly to theSASRecmodel (0.108NDCG@10). This is a significant finding. It suggests that even without dedicated tuning forNext-Kgeneration (e.g., usingreinforcement learning), theGPTRecmodel can produce reasonable quality recommendations in this flexible mode. This makes it a strong starting point for future research aiming to optimizeNext-Kgeneration for complex objectives.
6.2. Data Presentation (Tables)
All relevant tables from the original paper have been transcribed and presented in the methodology and experimental setup sections:
- Table 1: Illustrated memory savings with SVD Tokenisation (Section 4.2.2).
- Table 2: Salient characteristics of the MovieLens-1M dataset (Section 5.1).
- Table 3: Comparison of
GPTRecwithSASRecandBERT4Rec(Section 4.2.5 and 6.1.1).
6.3. Ablation Studies / Parameter Analysis
The paper performs a form of parameter analysis in RQ2 by varying the number of tokens per item () and the number of values per token () in the SVD Tokenisation algorithm. This analysis serves as an implicit ablation study for the multi-token-per-item mode, showing how different configurations affect the trade-off between memory efficiency and recommendation quality.
- The results show that
NDCG@10andRecall@10generally improve with an increased number of values per token (), indicating that finer-grainedquantisationhelps retain more information from the originalitem embeddings. - Similarly, increasing the number of tokens per item () also tends to improve performance, suggesting that more
latent componentsallow for a richer representation of items. - These analyses help in understanding the optimal configuration for
SVD Tokenisationto balance memory reduction with performance. No explicit ablation studies on architectural components ofGPTRecitself (e.g., impact ofTransformerblocks) were reported, as the focus was on the novel tokenisation and generation strategies.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces GPTRec, a novel generative sequential recommendation model built upon the GPT-2 architecture. GPTRec addresses two critical limitations of current Transformer-based recommenders: the massive memory footprint due to large item vocabularies and the inflexibility of the Top-K strategy for complex recommendation objectives.
The key contributions and findings are:
-
Memory Efficiency: The proposed
SVD Tokenisationalgorithm effectively splits item IDs intosub-item tokensbased onquantised SVD embeddings. This drastically reduces the size of theembedding table(e.g., 40% reduction onMovieLens-1Mcompared toSASRecwhile matching quality), makingGPTRecscalable to much larger item catalogues. -
Flexible Generation: The
Next-K recommendation strategyallowsGPTRecto generate recommendations item-by-item, conditioning each new item on previously recommended ones. This opens possibilities for optimizing for complex, interdependent objectives likediversityorcoherence. -
Strong Performance: In its standard
one-token-per-item Top-Kmode,GPTRecperforms comparably toBERT4Recand outperformsSASReconMovieLens-1M, highlighting the benefit ofCross-Entropyloss. Even in theNext-Kmode (which was not specifically tuned),GPTRecachievesNDCG@10results similar toSASRec, establishing a strong baseline for future research.Overall, the paper concludes that
generative sequential recommendationis a highly promising direction, offering tangible benefits in memory efficiency and adaptability for complex recommendation tasks.
7.2. Limitations & Future Work
The authors openly acknowledge several limitations and propose avenues for future research:
- Limited Dataset Evaluation: The primary evaluation was conducted on
MovieLens-1M. Future work will expand evaluation to more, particularly larger, datasets where the memory efficiency gains frommulti-token-per-itemmode would be most pronounced. - Basic Generation Techniques: The current
multi-token-per-item Top-Kgeneration relies on simple sampling, which can suffer fromdiversity problems. More advancedgeneration techniques, such asBeam Search, will be explored to mitigate this. - Unaligned
Next-KTraining: TheNext-Kstrategy, while promising, currently suffers fromtraining-generation misalignment. The model is trained for singlenext-item predictionbut used forsequential generation. Future work will focus on more advanced training objectives, such asreinforcement learning(e.g., inspired byInstructGPT), to explicitly tune the model forNext-Kgeneration and complex objectives likediversityandcoverage. - Hyperparameter Tuning: The current experiments used a limited set of hyperparameter settings. More extensive hyperparameter tuning could further improve performance.
- Adaptation to Generative IR: The concepts of
SVD Tokenisationand theNext-Kstrategy have direct applicability to the broader field ofGenerative Information Retrieval (IR). The authors plan to adapt these methods to address large document ID vocabularies and complex search result problems (e.g.,diversity,fairness) inGenerative IR.
7.3. Personal Insights & Critique
This paper presents a compelling argument for shifting from discriminative score-and-rank to generative approaches in sequential recommendation, especially given the challenges of scale and objective complexity.
-
Innovation and Potential: The
SVD Tokenisationis a clever and practical solution to theembedding tableexplosion, directly drawing inspiration fromsub-word tokenisationin NLP. Its ability to achieve comparable performance with significant memory savings is a major step forward, particularly for real-world applications with vast item catalogs. TheNext-Kstrategy is conceptually powerful; it's intuitive that considering previous recommendations should lead to better list-level coherence and diversity. The fact thatGPTRec-NextKmatchesSASRec'sNDCG@10without explicit tuning for this mode is indeed a strong starting point, suggesting inherent potential. -
Potential Issues/Unverified Assumptions:
- SVD Tokenisation Dependence on Interaction Matrix: The quality of
SVD Tokenisationrelies heavily on the quality and density of theuser-item interaction matrix. For very sparse datasets or new items (cold-start problem), theSVD embeddingsmight be less reliable, potentially impacting tokenization quality. The "small Gaussian noise" helps prevent identical embeddings, but fundamentally, sparse data will yield less informative SVD components. This dependency could be a limitation when applying it to domains with extremely sparse interaction data. - Loss of Information in Quantization: While memory efficient,
quantizationinherently involves some loss of information from the continuousSVD embeddings. The paper shows that increasing (number of values per token) improves performance, indicating that finer-grainedquantizationis beneficial. This trade-off betweenquantizationgranularity, memory, and information loss is critical. - Computational Cost of
Next-K: While conceptually powerful, theNext-Kstrategy is stated to be "more computationally expensive" during inference. For real-time recommendation systems, the need to re-score for each of steps could be a bottleneck, especially if the item catalogue remains large even after filtering. The paper's current evaluation onMovieLens-1M(3,416 items) might not fully expose this challenge compared to datasets with millions of items. - Complexity of
Reinforcement LearningforNext-K: Aligning theNext-Kgeneration with a suitable training objective usingreinforcement learningis a non-trivial task. Defining appropriatereward functionsfordiversity,serendipity, orcoherencein an end-to-end differentiable manner can be challenging.
- SVD Tokenisation Dependence on Interaction Matrix: The quality of
-
Transferability to Other Domains: The methods presented are highly transferable:
-
Generative IR: The authors explicitly mention this.SVD Tokenisationcould indeed solve the large document ID vocabulary problem, andNext-Kcould be used to generate diverse search result lists. -
Any Sequential Prediction with Large Categorical Output: Fields beyond recommendation or IR that involve predicting a sequence of discrete, high-cardinality items could benefit. For instance, predicting sequences of actions in complex systems or medical treatment plans, where the "items" are specific interventions or medications.
Overall,
GPTRecprovides a solid framework and practical solutions for addressing some long-standing scalability and flexibility issues in sequential recommendation. The work encourages a more holistic, generative view of recommendation, moving beyond simple next-item prediction to the generation of contextually rich and diverse recommendation lists.
-
Similar papers
Recommended via semantic vector search.