Paper status: completed

Generative Sequential Recommendation with GPTRec

Published:06/20/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GPTRec, a GPT-2-based generative sequential recommender, uses SVD tokenization to handle large vocabularies and a Next-K strategy to optimize recommendations, achieving SASRec-level quality on MovieLens-1M with 40% smaller embeddings.

Abstract

Sequential recommendation is an important recommendation task that aims to predict the next item in a sequence. Recently, adaptations of language models, particularly Transformer-based models such as SASRec and BERT4Rec, have achieved state-of-the-art results in sequential recommendation. In these models, item ids replace tokens in the original language models. However, this approach has limitations. First, the vocabulary of item ids may be many times larger than in language models. Second, the classical Top-K recommendation approach used by these models may not be optimal for complex recommendation objectives, including auxiliary objectives such as diversity, coverage or coherence. Recent progress in generative language models inspires us to revisit generative approaches to address these challenges. This paper presents the GPTRec sequential recommendation model, which is based on the GPT-2 architecture. GPTRec can address large vocabulary issues by splitting item ids into sub-id tokens using a novel SVD Tokenisation algorithm based on quantised item embeddings from an SVD decomposition of the user-item interaction matrix. The paper also presents a novel Next-K recommendation strategy, which generates recommendations item-by-item, considering already recommended items. The Next-K strategy can be used for producing complex interdependent recommendation lists. We experiment with GPTRec on the MovieLens-1M dataset and show that using sub-item tokenisation GPTRec can match the quality of SASRec while reducing the embedding table by 40%. We also show that the recommendations generated by GPTRec on MovieLens-1M using the Next-K recommendation strategy match the quality of SASRec in terms of NDCG@10, meaning that the model can serve as a strong starting point for future research.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Generative Sequential Recommendation with GPTRec

1.2. Authors

  • Aleksandr V. Petrov: Affiliated with the University of Glasgow, United Kingdom.
  • Craig Macdonald: Affiliated with the University of Glasgow, United Kingdom.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server. While not a peer-reviewed journal or conference proceeding itself, arXiv is a highly reputable platform for disseminating research rapidly in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers published here often precede formal publication in top-tier venues.

1.4. Publication Year

2023

1.5. Abstract

The paper introduces GPTRec, a generative sequential recommendation model based on the GPT-2 architecture. It addresses two key limitations of existing Transformer-based sequential recommenders like SASRec and BERT4Rec: the large vocabulary of item IDs and the inflexibility of the classical Top-K recommendation approach for complex objectives (e.g., diversity, coverage). GPTRec tackles the large vocabulary issue with a novel SVD Tokenisation algorithm that splits item IDs into sub-ID tokens, leveraging quantised item embeddings from an SVD decomposition of the user-item interaction matrix. For recommendation generation, it proposes a Next-K strategy, which generates recommendations item-by-item, considering previously recommended items, thus enabling complex interdependent recommendation lists. Experiments on the MovieLens-1M dataset show that SVD Tokenisation allows GPTRec to match SASRec's quality while reducing the embedding table size by 40%. The Next-K strategy also matches SASRec's NDCG@10 quality, indicating GPTRec's potential as a foundation for future research in generative sequential recommendation.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses lies within sequential recommendation systems, which aim to predict the next item a user might interact with based on their past sequence of interactions. While Transformer-based models (like SASRec and BERT4Rec) have achieved state-of-the-art results by adapting language models (LMs) to recommendation tasks (replacing words with item IDs), they face significant limitations:

  1. Large Item ID Vocabulary: The number of unique items in a recommendation catalogue can be orders of magnitude larger than the vocabulary in typical language models. This leads to massive embedding tables for item IDs, consuming substantial GPU memory and making training infeasible for very large datasets (e.g., more than 1M items).

  2. Inflexible Top-K Recommendation: Traditional Top-K recommendation strategies, where items are scored independently and the top-scoring ones are selected, are suboptimal for complex recommendation objectives. These objectives might include diversity (avoiding redundant recommendations of similar items), coverage (ensuring a broad range of items is considered), or coherence (recommending complementary items). The independent scoring approach often leads to recommendation lists dominated by highly similar items, failing to capture user's varied interests or provide a well-rounded experience.

    The paper is motivated by the rapid progress in generative language models (like GPT and T5), which have shown great flexibility and power across various text generation tasks. This inspires the authors to revisit generative approaches for sequential recommendation, believing they can inherently address the limitations of current discriminative, score-and-rank methods.

The paper's entry point is to adapt a generative Transformer architecture (GPT-2) to sequential recommendation, introducing novel mechanisms (SVD Tokenisation and Next-K strategy) to overcome the identified challenges.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of sequential recommendation:

  1. Introduction of GPTRec: The paper proposes GPTRec, a generative sequential recommendation model built upon the GPT-2 architecture. Unlike previous Transformer-based models, GPTRec is inherently generative, opening avenues for more flexible recommendation strategies.
  2. Novel SVD-based Sub-item Tokenisation: To tackle the issue of large embedding tables, the paper introduces SVD Tokenisation. This algorithm decomposes item IDs into multiple sub-item tokens by quantising item embeddings derived from an SVD of the user-item interaction matrix. This approach significantly reduces memory footprint (e.g., 40% reduction for MovieLens-1M compared to SASRec while matching quality) by storing a smaller vocabulary of sub-item tokens rather than individual item embeddings.
  3. Novel Next-K Recommendation Strategy: As an alternative to the traditional Top-K approach, the paper proposes the Next-K strategy. This method generates recommendations item-by-item, with each subsequent recommendation conditioned on the items already generated in the list. This allows for the creation of complex interdependent recommendation lists, addressing objectives like diversity, complementarity, and coherence that are difficult to optimize with independent scoring.
  4. Empirical Validation: Experiments on the MovieLens-1M dataset demonstrate the viability of GPTRec and its novel components:
    • GPTRec in one-token-per-item mode with Top-K generation shows performance comparable to BERT4Rec and outperforms SASRec (e.g., NDCG@10 of 0.146 vs. SASRec's 0.108), suggesting the benefit of Cross-Entropy loss over Binary Cross-Entropy for this task.

    • When using SVD Tokenisation (multi-token-per-item mode), GPTRec can match SASRec's quality (e.g., NDCG@10 of 0.108 for both) while substantially reducing memory usage (e.g., 40% smaller embedding table).

    • The Next-K strategy, despite not being specifically tuned for this generation mode, still achieves recommendation quality (NDCG@10) similar to SASRec, establishing it as a strong starting point for future research, particularly with advanced tuning techniques like reinforcement learning.

      These findings solve the problems of memory inefficiency in large item catalogues and inflexibility in optimizing for complex recommendation objectives, paving the way for more powerful and adaptable recommender systems.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand GPTRec and its contributions, a foundational understanding of several key concepts is essential:

  • Sequential Recommendation Systems: These are a class of recommender systems that leverage the order or sequence of user-item interactions. Unlike traditional collaborative filtering or content-based systems that treat interactions independently, sequential recommenders recognize that user preferences evolve over time and that the next item a user interacts with often depends on their most recent interactions. The primary task is typically next-item prediction, where the system predicts the single item a user will engage with next based on their interaction history. For example, if a user watches movies A, B, and C in that order, a sequential recommender tries to predict movie D.

  • Language Models (LMs): Originally from Natural Language Processing (NLP), LMs learn to predict the probability of a sequence of words. They model the distribution of natural language by predicting the next word in a sequence given the preceding words (e.g., "The cat sat on the _").

    • Autoregressive Language Modeling (LM): This is a type of language modeling where the model predicts the next token in a sequence given all previous tokens. Models like GPT are trained with this objective.
    • Masked Language Modeling (MLM): In MLM, a percentage of tokens in the input sequence are masked, and the model is trained to predict the original masked tokens based on their context (both left and right). BERT is a prominent example trained with MLM.
  • Transformer Architecture: Introduced by Vaswani et al. (2017) in "Attention Is All You Need," the Transformer is a neural network architecture that has revolutionized NLP. It relies entirely on self-attention mechanisms to draw global dependencies between input and output.

    • Encoder-Decoder Structure: The original Transformer consists of an encoder stack and a decoder stack.
      • The encoder processes the input sequence and produces a sequence of contextualized representations.
      • The decoder then takes these encoder outputs, along with its own previous outputs, to generate the output sequence.
    • Self-Attention Mechanism: This is the core component of Transformers. It allows the model to weigh the importance of different parts of the input sequence when processing each element. For each token in a sequence, self-attention computes a weighted sum of all other tokens, where the weights are learned based on the relevance between tokens. The basic formula for Scaled Dot-Product Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
      • QQ (Query), KK (Key), VV (Value) are matrices derived from the input embeddings, representing different projections of the input.
      • QKTQ K^T calculates the dot product between queries and keys, measuring similarity.
      • dk\sqrt{d_k} is a scaling factor, where dkd_k is the dimension of the keys. This prevents the dot products from becoming too large, which can push the softmax function into regions with tiny gradients.
      • softmax\mathrm{softmax} normalizes the scores to obtain attention weights.
      • The output is a weighted sum of the Value vectors.
  • GPT-2 Architecture: GPT-2 (Generative Pre-trained Transformer 2) is a large Transformer-based language model developed by OpenAI. It is an autoregressive model that uses only the decoder part of the Transformer architecture. It is trained on a massive dataset of text to predict the next word in a sequence. Its key characteristic is its ability to generate coherent and contextually relevant text. GPTRec uses this generative nature and architectural backbone.

  • Singular Value Decomposition (SVD): SVD is a matrix factorization technique used in linear algebra, often for dimensionality reduction and finding latent factors. For a matrix MM (e.g., a user-item interaction matrix), SVD decomposes it into three matrices: $ \boldsymbol { M } = \boldsymbol { U } \boldsymbol { \Sigma } \boldsymbol { V } ^ { T } $ Or, in the context of truncated SVD for item embeddings as used in the paper: $ \boldsymbol { M } \approx \boldsymbol { U } \times \boldsymbol { \Sigma } \times \boldsymbol { E } ^ { T } $ Where:

    • MM is the original user-item interaction matrix.
    • UU is a matrix containing left singular vectors (often interpreted as user embeddings).
    • Σ\Sigma is a diagonal matrix containing singular values, ordered from largest to smallest. In truncated SVD, only the top tt singular values and corresponding vectors are kept.
    • EE (or VV) is a matrix containing right singular vectors (often interpreted as item embeddings). SVD can capture underlying patterns and reduce noise in data by representing items (or users) in a lower-dimensional latent space.
  • Quantization: In the context of SVD Tokenisation, quantization refers to the process of mapping a continuous range of values (e.g., item embedding dimensions) to a finite, discrete set of values. For example, if an embedding dimension ranges from 0 to 1, quantizing it into 4 bins would map values in [0, 0.25) to 0, [0.25, 0.5) to 1, etc. This discretisation is crucial for creating tokens from continuous embeddings.

  • Evaluation Metrics (NDCG@K, Recall@K):

    • Recall@K:
      • Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved among the top KK recommendations. It focuses on how many of the truly relevant items for a user were present in the recommended list.
      • Mathematical Formula: $ \mathrm{Recall@K} = \frac{|{\text{relevant items}} \cap {\text{recommended items at K}}|}{|{\text{relevant items}}|} $
      • Symbol Explanation:
        • {relevant items}{recommended items at K}|\{\text{relevant items}\} \cap \{\text{recommended items at K}\}| denotes the number of relevant items found within the top KK recommended items.
        • {relevant items}|\{\text{relevant items}\}| denotes the total number of relevant items for a given user.
    • NDCG@K (Normalized Discounted Cumulative Gain at K):
      • Conceptual Definition: NDCG@K is a measure of ranking quality. It accounts for both the relevance of recommended items and their position in the list. Highly relevant items appearing higher in the recommendation list contribute more to the NDCG score. It's "normalized" because the score is divided by the ideal DCG (achieved if all relevant items were perfectly ordered), ensuring scores are comparable across different queries.
      • Mathematical Formula: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ $ \mathrm{IDCG@K} = \sum_{i=1}^{K} \frac{2^{rel_{i_{ideal}}} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
      • Symbol Explanation:
        • KK: The number of top recommendations considered.
        • relirel_i: The relevance score of the item at position ii in the recommended list. (Often 1 if relevant, 0 if not for binary relevance).
        • log2(i+1)\log_2(i+1): A discounting factor, meaning items at lower ranks (higher ii) contribute less to the score.
        • DCG@K\mathrm{DCG@K}: Discounted Cumulative Gain at KK.
        • reliidealrel_{i_{ideal}}: The relevance score of the item at position ii in the ideal (perfectly sorted by relevance) recommendation list.
        • IDCG@K\mathrm{IDCG@K}: Ideal Discounted Cumulative Gain at KK. This is the maximum possible DCG for the given set of relevant items.

3.2. Previous Works

The paper builds upon and distinguishes itself from several lines of prior research:

  • Transformer-based Sequential Recommenders:

    • SASRec [13]: Self-Attentive Sequential Recommendation. This model adapts the Transformer decoder architecture for sequential recommendation. It is trained with a Language Modeling (LM) objective, predicting the next item in a sequence given the preceding ones. It uses Binary Cross-Entropy loss and generates recommendations using a Top-K strategy.
    • BERT4Rec [26]: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. This model adapts the Transformer encoder architecture. It is trained with a Masked Language Modeling (MLM) objective, predicting masked items in a sequence based on bidirectional context. It uses Cross-Entropy loss and a Top-K strategy.
    • Other models like DuoRec [19], CBiT [7], ALBERT4Rec [18] also primarily use Transformer encoders or decoders as their main component, often with additional training tasks or architectural tweaks, but generally adhere to the Top-K recommendation strategy and one-token-per-item representation.
  • Recommendations as Text Generation / Generative IR:

    • Recent works like P5 [8] and M6-Rec [5] leverage large pre-trained Language Models (like T5 and GPT-3) for recommendation by treating item IDs or titles as text strings to be generated. These models often rely on existing pre-trained knowledge and side information (e.g., item descriptions, user reviews) to represent items semantically.
    • TIGER [23] introduces semantic IDs, where items are represented by tokens derived from their side information.
    • In the broader field of Generative IR, similar challenges of large document ID vocabularies and the need for sub-document ID tokenization (DI+DI+ [15], Learning to Tokenize for Generative Retrieval [27]) have been identified.
    • Differentiation: GPTRec distinguishes itself from these generative approaches by focusing on the classic sequential recommendation setting where only interaction sequences are available, without relying on external side information or large pre-trained LMs. This allows for a clearer study of the generative approach's properties in a more fundamental context.
  • Sub-word Tokenization in NLP: The concept of splitting atomic units into sub-units to manage vocabulary size in GPTRec's SVD Tokenisation is inspired by sub-word tokenization techniques in NLP. Examples include WordPiece encoding [32] (used by BERT) and BytePair encoding [25] (used by GPT family models). These methods allow large vocabularies of words to be represented by a smaller, fixed vocabulary of sub-word units, enabling handling of rare words and reducing embedding table size.

  • Diversity and Interdependent Recommendations: Methods like MMR [3] (Maximal Marginal Relevance) and multi-aspect methods [2] are used for re-ranking Top-K lists to improve diversity. However, these are often post-processing steps and can increase complexity and computational demands. The Next-K strategy of GPTRec aims to address interdependent objectives more intrinsically during generation.

3.3. Technological Evolution

The field of sequential recommendation has evolved significantly, mirroring advancements in deep learning and NLP:

  1. Early Neural Models (RNNs/GRUs): Starting with models like GRU4Rec [11] (2016), sequential recommenders began leveraging Recurrent Neural Networks (RNNs) and their variants (GRUs) to model sequential patterns, inspired by their success in machine translation. These models capture dependencies over time but can struggle with long sequences.
  2. Transformer-based Models (2018-present): The introduction of the Transformer architecture [30] (2017) revolutionized NLP by replacing recurrence with attention mechanisms, allowing for parallel processing and better handling of long-range dependencies. This quickly led to its adoption in sequential recommendation:
    • SASRec [13] (2018) was among the first to adapt the Transformer decoder for next-item prediction.
    • BERT4Rec [26] (2019) followed, adapting the Transformer encoder with masked language modeling. These models rapidly became state-of-the-art due to their ability to capture complex sequential patterns. However, they inherited the large embedding table problem and typically employed the Top-K score-and-rank approach.
  3. Generative Models and Pre-trained Language Models (2020-present): The success of very large generative language models like GPT-3 [1] and T5 [22] in various text generation tasks sparked interest in applying them to recommendation. This led to research like P5 [8], which treats recommendation as a text generation problem, often relying on pre-trained models and side information.
  4. GPTRec's Position: GPTRec sits at the intersection of Transformer-based sequential recommendation and the emerging trend of generative models. It extends the Transformer-decoder paradigm (like SASRec) but embraces a fully generative approach (like GPT-2) to address fundamental limitations. It focuses on solving the memory footprint and objective flexibility issues using novel techniques (SVD Tokenisation, Next-K strategy) in a "side-information-agnostic" setting, providing a foundational study for generative sequential recommendation.

3.4. Differentiation Analysis

Compared to the main methods in related work, GPTRec introduces several core innovations:

  • Generative Approach vs. Discriminative/Score-and-Rank:

    • SASRec and BERT4Rec: These models are primarily discriminative. They learn to score individual items based on user history, and then the top-scoring items are selected (Top-K strategy). While powerful for next-item prediction, they treat items independently and are less flexible for optimizing complex list-level objectives like diversity or coherence.
    • GPTRec: Adopts a truly generative approach, predicting items sequentially. This enables the Next-K strategy, where each subsequent item can be conditioned on previously recommended items, allowing for the inherent optimization of interdependent recommendation lists. This is a fundamental shift from ranking to generation.
  • Memory Efficiency for Large Item Vocabularies:

    • SASRec and BERT4Rec: Both use a one-token-per-item approach, meaning each unique item ID requires its own embedding. This leads to extremely large embedding tables that become a bottleneck for GPU memory as the number of items grows into the millions or tens of millions.
    • GPTRec: Introduces SVD Tokenisation, which splits item IDs into multiple sub-item tokens. This dramatically reduces the size of the required embedding table because the model only needs to store embeddings for the smaller vocabulary of sub-item tokens, not every individual item. This is crucial for scaling to very large catalogues.
  • Flexibility in Recommendation Objectives:

    • SASRec and BERT4Rec: Primarily optimized for single-item prediction (e.g., maximizing NDCG or Recall of the next relevant item). Adapting them for diversity, serendipity, or complementarity typically requires complex post-processing re-ranking methods, which can add computational overhead and are hard to train end-to-end.
    • GPTRec (Next-K strategy): By generating items sequentially and conditionally, GPTRec provides a native mechanism to incorporate complex, interdependent objectives directly into the generation process. Although not fully explored in this paper, the Next-K strategy lays the groundwork for using reinforcement learning to tune the model for specific list-level qualities.
  • Loss Function:

    • SASRec: Uses Binary Cross-Entropy loss, which the paper empirically shows to underperform Cross-Entropy loss.
    • BERT4Rec and GPTRec: Both use Cross-Entropy (or Softmax) loss, which GPTRec demonstrates is more effective for next-item prediction.
  • Reliance on Side Information:

    • P5, M6-Rec, TIGER: These generative recommendation models often rely heavily on external pre-trained language models and side information (e.g., item descriptions, user reviews) to imbue items with semantic meaning and aid generation.

    • GPTRec: Focuses on the more traditional sequential recommendation setting where only interaction sequences are used. This makes it a more fundamental study of the generative approach's capabilities without confounding factors from side information.

      In essence, GPTRec offers a more memory-efficient and flexible generative framework for sequential recommendation, directly addressing scaling and objective complexity limitations inherent in many existing Transformer-based models.

4. Methodology

4.1. Principles

The core idea behind GPTRec is to re-frame sequential recommendation as a generative task, similar to language modeling, rather than a discriminative score-and-rank task. By leveraging the GPT-2 architecture, GPTRec aims to predict the entire sequence of future recommendations item-by-item. This generative approach enables two key innovations:

  1. Memory Efficiency: By treating item IDs not as atomic units but as compositions of sub-item tokens, the model can manage very large item catalogues without an explosion in embedding table size.

  2. Flexible Recommendation Objectives: The item-by-item generation process (Next-K strategy) allows the model to consider already recommended items when generating subsequent ones, opening the door for optimizing complex, interdependent objectives like diversity, coherence, or serendipity, which are difficult to achieve with traditional Top-K methods.

    The theoretical basis draws heavily from the success of Transformer-based language models in handling sequences and generating coherent outputs. The intuition is that if GPT-2 can generate coherent text from sub-word tokens, a similar mechanism can generate coherent item sequences from sub-item tokens.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Architecture

GPTRec is designed for generative sequential recommendation and is built upon the GPT-2 architecture [21]. The GPT-2 architecture, in turn, is based on the decoder part of the original Transformer model [30]. This means GPTRec is an autoregressive model, meaning it generates sequences one item (or sub-item token) at a time, conditioning each prediction on all preceding ones.

The specific architectural modifications GPT-2 makes compared to the original Transformer decoder are:

  • Moving the layer normalisation to the beginning of each Transformer block (pre-normalization).
  • Implementing a modified initialization with scaled residual weights.
  • Employing learnable positional encodings instead of the original sine-based positional encodings. These learnable embeddings allow the model to learn relative or absolute position information about the items in the sequence.

4.2.2. Tokenisation

Existing Transformer-based sequential recommendation models, such as SASRec [13] and BERT4Rec [26], typically use item IDs directly as tokens. This is referred to as the one-token-per-item mode. However, as discussed in the introduction, for large item catalogues, this results in an extremely large embedding table that is computationally expensive and difficult to fit into GPU memory.

Inspired by language models that split whole words into sub-word tokens (e.g., WordPiece for BERT, BytePair for GPT models), GPTRec introduces a multi-token-per-item mode. In this mode, each item is represented by tt sub-item tokens, where each of these tt tokens can be chosen from vv alternatives.

The paper proposes a novel SVD-based Tokenisation algorithm to decompose items into these sub-item tokens. The algorithm consists of three main steps, illustrated in Figure 1:

Step 1: SVD Decomposition for Item Embeddings The algorithm first constructs a user-item interaction matrix MM. This matrix typically records whether a user has interacted with an item (e.g., 1 for interaction, 0 for no interaction, or a rating). A truncated SVD decomposition is then performed on this matrix with tt latent components. The purpose of truncated SVD here is to obtain lower-dimensional embeddings for items that capture their underlying characteristics and relationships based on user interactions.

The decomposition is given by: $ \boldsymbol { M } \approx \boldsymbol { U } \times \boldsymbol { \Sigma } \times \boldsymbol { E } ^ { T } $ Where:

  • MM: The user-item interaction matrix. Its dimensions are Nusers×NitemsN_{\text{users}} \times N_{\text{items}}.

  • UU: A matrix of user embeddings. Its dimensions are Nusers×tN_{\text{users}} \times t. Each row represents a user's embedding in a tt-dimensional latent space.

  • Σ\Sigma: A diagonal matrix of singular values. Its dimensions are t×tt \times t. The diagonal elements contain the tt largest singular values, which represent the strength of each latent component.

  • EE: A matrix of item embeddings. Its dimensions are Nitems×tN_{\text{items}} \times t. Each row represents an item's embedding in a tt-dimensional latent space. The transpose ETE^T is used in the formula.

    Both user and item embeddings (UU and EE) now have tt latent components, effectively providing a tt-dimensional embedding for each item.

Step 2: Normalization, Noise Addition, and Quantization of Item Embeddings After obtaining the item embeddings EE (from Step 1), each dimension of these embeddings is processed independently:

  • Normalization: Each dimension (column) of EE is normalized so that its values fall within the [0, 1] interval. This is a common preprocessing step to ensure consistent scaling.
  • Noise Addition: A small amount of Gaussian noise is added to each dimension. This is crucial to prevent quantization from mapping two distinct items (that might have identical or very close embeddings due to sparse interaction data) to the exact same sub-item token representation. In the experiments, Gaussian noise with zero mean and a standard deviation of 10510^{-5} is used.
  • Quantization: Each dimension of the (normalized and noisy) item embeddings is then quantised into vv discrete values. This means that the continuous range [0, 1] for each embedding dimension is divided into vv bins, and each value is assigned to the integer corresponding to its bin. These discrete values will form the sub-item tokens.

Step 3: Offsetting Quantised Values Finally, to ensure that each dimension's quantised values are distinct in the overall token vocabulary, an offset is applied. The ii-th dimension's quantised values are offset by v×(i1)v \times (i-1). This means:

  • The first dimension's tokens will be in the range [0,v1][0, v-1].
  • The second dimension's tokens will be in the range [v,2v1][v, 2v-1].
  • The ii-th dimension's tokens will be in the range [v×(i1),v×i1][v \times (i-1), v \times i - 1]. This ensures that the token vocabulary for each dimension is unique and non-overlapping. The final representation of an item is a sequence of these tt sub-item tokens.

The SVD Tokenisation algorithm is summarized in Algorithm 3:

Algorithm 3 SVD Tokenisation Algorithm
Require: MM is the user-item interaction matrix
Require: tt is the number of tokens per item
Require: v is the number of alternatives per token
Ensure: Sub-item token representation of items
1: procedure SVDTOKENISE (M,t,v)( M , t , \boldsymbol { v } )
2: Compute item embeddings EE with truncated SVD on MM
3: for i=1i = 1 to tt do
4: Normalise `E _ { i }` to range [0, 1]
5: Add small Gaussian noise to `E _ { i }`
6: Quantise `E _ { i }` into vv bins
7: Offset quantised values by (i1)v( i - 1 ) * v
8: end for
9: Concatenate quantised components for sub-item tokens
10: return Sub-item token representation
11: end procedure

Explanation of Algorithm 3:

  • Line 1-2: The procedure takes the user-item interaction matrix MM, the desired number of sub-item tokens tt per item, and the number of alternatives (bins) vv per token. It then computes the item embeddings EE by performing truncated SVD on MM, keeping tt latent components.

  • Line 3-8: This loop iterates tt times, once for each dimension (or component) of the item embeddings.

    • Line 4: Normalizes the ii-th column of EE (which represents the ii-th embedding dimension for all items) to the range [0, 1].
    • Line 5: Adds a small amount of Gaussian noise to this normalized ii-th dimension.
    • Line 6: Quantises the values in the ii-th dimension into vv discrete bins.
    • Line 7: Offsets these quantised values by v×(i1)v \times (i-1) to create a unique range of tokens for this dimension.
  • Line 9-10: After processing all tt dimensions, the quantised and offset components for each item are concatenated to form its sub-item token representation. This final representation is returned.

    The key advantage of SVD Tokenisation is memory efficiency. Instead of requiring an embedding for each of the NitemsN_{\text{items}} items, the model now only needs to store embeddings for the t×vt \times v unique sub-item tokens. This allows the memory consumption for embeddings to be independent of the total number of items in the catalogue, making GPTRec viable for datasets with millions of items where traditional approaches would be infeasible due to GPU memory limitations.

The following table from the paper illustrates the potential memory savings:

The following are the results from Table 1 of the original paper:

Dataset Num Items GPU Memory (one-token-per-item mode) t=2 t=4 t=8
v=128 (256 embs; 0.25 MB) v=512 (1024 embs; 1.00 MB) v=2048 (4096 embs; 4.00 MB) v=128 (512 embs; 0.50 MB) v=512 (2048 embs; 2.00 MB) v=2048 (8192 embs; 8.00 MB) v=128 (1024 embs; 1.00 MB) v=512 (4096 embs; 4.00 MB) v=2048 (16384 embs; 16.00 MB)
MovieLens-1M 3,416 3.34 MB 7.494% 29.977% 119.906% 14.988% 59.953% 239.813% 29.977% 119.906% 479.625%
MovieLens-20M 138,493 135.25 MB 0.185% 0.739% 2.958% 0.370% 1.479% 5.915% 0.739% 2.958% 11.830%
Yelp 150,346 146.82 MB 0.170% 0.681% 2.724% 0.341% 1.362% 5.449% 0.681% 2.724% 10.898%
Gowalla 1,280,969 1.22 GB 0.020% 0.080% 0.320% 0.040% 0.160% 0.640% 0.080% 0.320% 1.279%
Amazon Books 5,264,307 5.02 GB 0.005% 0.019% 0.078% 0.010% 0.039% 0.156% 0.019% 0.078% 0.311%
LastFM-1b 32,291,134 30.80 GB 0.001% 0.003% 0.013% 0.002% 0.006% 0.025% 0.003% 0.013% 0.051%

The table clearly shows that for larger datasets (e.g., LastFM-1b with 32M items), the multi-token-per-item mode reduces embedding memory requirements from 30.80 GB to as little as 0.001% (i.e., less than a MB) depending on tt and vv parameters, or 0.051% (16MB) in the largest configuration t=8,v=2048t=8, v=2048. For MovieLens-1M, it might be larger than the original because the number of items is small, and the t×vt \times v values can exceed the original item count. However, the gains are significant for larger item catalogs.

4.2.3. Training Objective and Loss Function

GPTRec is trained using a standard Language Modelling (LM) objective, similar to GPT-2. This means the model is trained to predict the next token in a sequence given all preceding tokens. The probability of a sequence of tokens S={s1,s2,...sn}S = \{ s _ { 1 } , s _ { 2 } , . . . s _ { n } \} is factorised as the product of conditional probabilities: $ { \cal { P } } ( S ) = \prod _ { i = 1 } ^ { n } p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . . s _ { i - 1 } ) $ Where:

  • SS: A sequence of tokens. In GPTRec, sis_i represents either an item ID (in one-token-per-item mode) or a sub-item token value (in multi-token-per-item mode).

  • p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . . s _ { i - 1 } ): The conditional probability of the ii-th token given all tokens from the first to the (i-1)-th position.

    The model is trained to maximize this probability, which translates to minimizing the Cross-Entropy loss (derived from the Maximum Log-likelihood principle). The loss function is: $ \mathcal { L } _ { L M } = - \frac { 1 } { n } \sum _ { i = 1 } ^ { n } \log ( p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . . s _ { i - 1 } ) ) $ Where:

  • LLM\mathcal { L } _ { L M }: The Language Modelling loss.

  • nn: The length of the sequence SS.

  • p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . . s _ { i - 1 } ): The predicted probability of the actual ii-th token given the preceding tokens. The model predicts this probability using a softmax operation over its output for the ii-th position.

    During training, GPTRec is typically configured to "shift its input one token to the left." This means for an input sequence (s1,,sn1)(s_1, \dots, s_{n-1}), it tries to predict the next token sns_n. This aligns directly with the next-item prediction task. The last token predicted by the model corresponds to a new token that was not in the input sequence, which can then be fed back into the model for autoregressive generation.

4.2.4. Generating Recommendations with GPTRec

GPTRec supports different strategies for generating recommendation lists:

4.2.4.1. Top-K generation, one-token-per-item

This is the most straightforward generation mode, analogous to how SASRec operates.

  • Process: When GPTRec is used in one-token-per-item mode, each item corresponds to a single token ID. For a given user's interaction sequence, GPTRec outputs a probability distribution p ( s _ { i } | s _ { 1 } , s _ { 2 } , . . s _ { i - 1 } ) for the next token (item) at each position. The crucial part for Top-K is the last probability distribution, which represents the model's prediction for the next most likely item in the sequence. GPTRec uses these probabilities as item scores and then applies the standard Top-K scoring procedure (as described in Algorithm 1) to select the KK items with the highest scores.

  • Relevance to Algorithm 1: This mode directly maps to the Top-K recommendation strategy where the function f(user, item)f(\text{user, item}) in Algorithm 1 represents the probability output by GPTRec for that item being the next in sequence.

    The following is the Algorithm 1 from the original paper:

    Algorithm 1 Top-K Recommendation Strategy
    Require: user is the target user for recommendations
    Require: items is the set of all available items
    Require: K is the number of recommendations to generate
    Require: f is an existing scoring model
    1: procedure TopKRecoMMEND(user, items, K, f )
    2: item_scores ← empty list
    3: for item items do
    4: score ← f (user, item)
    5: item_scores.append((item, score))
    6: end for
    7: sorted_items ← sorted(item_scores,
    8: key = score, reverse = True)
    9: recommended_items ← [item
    10: for (item, score) in sorted_items[: K]]
    11: return recommended_items

Explanation of Algorithm 1:

  • Input: A user, the set of all available items, the desired number of recommendations KK, and a scoring model ff.
  • Line 2-6: For each item in the catalogue, the model ff computes a relevance score, and these (item, score) pairs are stored.
  • Line 7-8: The item_scores are sorted in descending order based on their scores.
  • Line 9-10: The top KK items from the sorted list are selected as recommended_items.
  • Line 11: The list of recommended_items is returned.

4.2.4.2. Top-K generation, multi-token-per-item

This mode combines GPTRec's multi-token-per-item representation with a Top-K generation approach.

  • Process: GPTRec uses autoregressive generation to produce candidate items. This involves:
    1. Predicting the probability distribution of the next sub-item token.
    2. Sampling a token from this distribution.
    3. Appending the sampled token to the input sequence.
    4. Repeating this process tt times (where tt is the number of tokens per item) to generate a full sub-item token sequence that represents a candidate item.
  • Scoring Candidates: After generating a candidate item (represented by tt tokens), the model scores this candidate using the chain rule from Equation (2), which calculates the probability of the entire sub-item token sequence.
  • Final Top-K Selection: A set of KK candidate items are generated this way. Then, the standard Top-K strategy (Algorithm 1) is applied to these candidate items. Any item that was not generated as a candidate is implicitly assigned a score of -\infty.
  • Challenges:
    • Some generated sub-item token sequences might not correspond to any valid item ID in the original catalogue; these are ignored.
    • Duplicate candidates can be generated.
    • To get a sufficient number of valid, unique recommendations, the number of candidates KK to be generated autoregressively must be significantly higher than the desired number of final recommendations (e.g., K=50K=50 for 10 recommendations).

4.2.4.3. Next-K generation, one-token-per-item

This is GPTRec's novel sequential generation strategy, equivalent to greedy search in language models.

  • Process: The recommendation list is built iteratively, item-by-item:

    1. At each iteration, GPTRec takes the user's initial interaction sequence and the sequence of already generated recommendations.
    2. It then predicts the next most likely item (token) based on this combined history.
    3. The highest-scored item that has not yet been recommended is added to the recommended_items list.
    4. This process is repeated KK times until the full recommendation list is generated.
  • Relevance to Algorithm 2: This mode directly implements the Next-K recommendation strategy.

    The following is the Algorithm 2 from the original paper:

    Algorithm 2 Next-K Recommendation Strategy
    Require: user is the target user for recommendations
    Require: items is the set of all available items
    Require: K is the number of recommendations to generate
    Require: f is an existing scoring model
    1: procedure NexTKRecoMMEND(user, items, K, f)
    2: recommended_items ← empty list
    3: for k = 1 to K do
    4: max_score ← −∞
    5: best_item ← None
    6: for item items \ recommended_items do
    7: score ← f(user,recommended_items, item)
    8: if score > max_score then
    9: max_score ← score
    10: best_item ←item
    11: end if
    12: end for
    13: recommended_items.append(best_item)
    14: end for
    15: return recommended_items
    16: end procedure

Explanation of Algorithm 2:

  • Input: A user, the set of all available items, the desired number of recommendations KK, and a scoring model ff.

  • Line 2: An empty list recommended_items is initialized to store the recommendations.

  • Line 3-14: This loop runs KK times, generating one recommendation at a time.

    • Line 4-5: Initializes max_score and best_item for finding the highest-scoring item in the current iteration.
    • Line 6-12: Iterates through all available items that have not yet been recommended. For each such item, the scoring model ff calculates a score. Crucially, this score depends not only on the user and the item but also on the already recommended items (recommended_items). The item with the highest score is identified as best_item.
    • Line 13: The best_item is added to the recommended_items list.
  • Line 15: The final list of recommended_items is returned.

  • Limitations:

    • Computational Cost: This strategy is more computationally expensive than Top-K because it requires KK separate inference steps, where each step involves scoring all available (non-recommended) items.
    • Training-Generation Misalignment: The model is trained with a simple LM objective to predict one next token. The Next-K generation, especially when KK is high, involves predicting a sequence of tokens iteratively. This misalignment means that the model is not explicitly optimized for generating a coherent list of KK items. Advanced techniques like reinforcement learning (e.g., InstructGPT [16]) would be required to align the training objective with the iterative generation process for optimal performance in Next-K mode. The paper emphasizes that its goal is to see if a model trained for Top-K can still yield meaningful results in Next-K as a starting point for future work.

4.2.5. Comparison of GPTRec with SASRec and BERT4Rec

The paper explicitly compares GPTRec with SASRec and BERT4Rec, summarizing their key characteristics, especially concerning generation strategy, architecture, training task, and loss function.

The salient characteristics of these models are portrayed in the left side of Table 3 from the paper:

The following are the results from Table 3 of the original paper:

Model name Generation Strategy Architecture Training Task Loss Recall@10 NDCG@10
BERT4Rec TopK Encoder MLM (Item Masking) Cross Entropy (Softmax Loss) 0.282 0.152
GPTRec-TopK TopK Decoder LM (Sequence Shifting) Cross Entropy (Softmax Loss) 0.254 0.146
SASRec TopK Decoder LM (Sequence Shifting) Binary Cross-Entropy 0.199 0.108
GPTRec-NextK NextK Decoder LM (Sequence Shifting) Cross Entropy (Softmax Loss) 0.157 0.105

Key differences highlighted:

  • SASRec vs. GPTRec:
    • Both use the Transformer decoder and an LM (Sequence Shifting) training task.
    • The primary difference is the loss function: SASRec uses Binary Cross-Entropy, while GPTRec uses Cross-Entropy (Softmax Loss). The paper's experiments show Cross-Entropy to be superior.
    • SASRec is limited to Top-K and one-token-per-item. GPTRec extends this with multi-token-per-item (for memory efficiency) and Next-K (for flexible objectives).
  • BERT4Rec vs. GPTRec:
    • BERT4Rec uses a Transformer encoder and an MLM (Item Masking) training task, which is less aligned with next-item prediction compared to LM.

    • GPTRec uses a Transformer decoder and LM.

    • Both use Cross-Entropy loss.

    • BERT4Rec is limited to Top-K and one-token-per-item. GPTRec offers the memory-efficient multi-token-per-item and the flexible Next-K strategies.

      In summary, GPTRec retains the strong predictive power of Transformer-decoder models while introducing innovations for scalability (SVD Tokenisation) and enhanced flexibility in recommendation generation (Next-K strategy).

5. Experimental Setup

5.1. Datasets

The experiments primarily use the MovieLens-1M dataset [10].

  • Source: MovieLens datasets are widely used benchmarks in recommender systems research, collected by the GroupLens Research Project at the University of Minnesota.

  • Scale and Characteristics:

    • Number of users: 6,040
    • Number of items: 3,416 (movies)
    • Number of interactions: 999,611
    • Average sequence length (interactions per user): 165.49
    • Median sequence length: 96 A concrete example of a data sample in MovieLens-1M would be a sequence of movie ratings by a specific user, ordered by timestamp. For instance, User 123 watched Movie A (rated 4 stars), then Movie B (rated 5 stars), then Movie C (rated 3 stars), etc. The sequential recommendation task would then be to predict the next movie User 123 is likely to watch.
  • Choice Rationale: Despite acknowledged issues (e.g., users don't necessarily rate movies in the exact order they watch them), MovieLens-1M is consistently used across many research papers. This widespread adoption allows for direct comparison of published results, ensuring the reproducibility and comparability of the experimental findings.

    The following are the results from Table 2 of the original paper:

    Number of users 6040
    Number of items 3416
    Number of interactions 999611
    Average sequence length 165.49
    Median sequence length 96

5.2. Evaluation Metrics

The paper uses two common evaluation metrics for sequential recommendation: Recall@10 and NDCG@10. The @10 indicates that the top 10 recommendations are considered for evaluation.

  • Recall@K:

    • Conceptual Definition: Recall@K measures the fraction of relevant items that are successfully retrieved within the top KK items recommended to a user. It quantifies how complete the recommendation list is in terms of covering all known relevant items.
    • Mathematical Formula: $ \mathrm{Recall@K} = \frac{|{\text{relevant items}} \cap {\text{recommended items at K}}|}{|{\text{relevant items}}|} $
    • Symbol Explanation:
      • {relevant items}{recommended items at K}|\{\text{relevant items}\} \cap \{\text{recommended items at K}\}|: The count of relevant items that are present in the top KK recommended items.
      • {relevant items}|\{\text{relevant items}\}|: The total count of relevant items for the specific user in the test set.
  • NDCG@K (Normalized Discounted Cumulative Gain at K):

    • Conceptual Definition: NDCG@K is a metric that assesses the quality of a ranking by considering both the relevance of the recommended items and their positions. Highly relevant items ranked higher in the list contribute more significantly to the score. It is normalized by the Ideal DCG to provide a score between 0 and 1, making it comparable across different queries.
    • Mathematical Formula: First, Discounted Cumulative Gain (DCG) at KK is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ Then, the Ideal DCG (IDCG) at KK is calculated, which is the DCG of the perfectly sorted (by relevance) list: $ \mathrm{IDCG@K} = \sum_{i=1}^{K} \frac{2^{rel_{i_{ideal}}} - 1}{\log_2(i+1)} $ Finally, NDCG@K is the ratio of DCG@K to IDCG@K: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
    • Symbol Explanation:
      • KK: The number of top recommendations being evaluated.
      • relirel_i: The relevance score of the item at position ii in the actual recommendation list. In many sequential recommendation setups, this is binary (1 if the item is the true next item, 0 otherwise, or 1 if the item is in the relevant set, 0 otherwise).
      • reliidealrel_{i_{ideal}}: The relevance score of the item at position ii in the ideal ranking (where all relevant items are ranked before non-relevant ones, and highly relevant items are ranked first).
      • log2(i+1)\log_2(i+1): A logarithmic discounting factor, which penalizes relevant items that appear lower in the ranked list.

5.3. Baselines

The paper compares GPTRec against two popular and state-of-the-art Transformer-based sequential recommendation models:

  • BERT4Rec [26]: This model adapts the Transformer encoder architecture and is trained using a Masked Language Modeling (MLM) objective. It is representative of bidirectional Transformer approaches.

  • SASRec [13]: This model adapts the Transformer decoder architecture and is trained using a Language Modeling (LM) objective. It is representative of autoregressive Transformer approaches, and is the most structurally similar to GPTRec.

    These baselines are chosen because they represent the current state-of-the-art in Transformer-based sequential recommendation, allowing for a direct and relevant comparison of GPTRec's performance. The results for SASRec and BERT4Rec are copied from a previous reproducibility study [18] by the authors.

5.4. Implementation Details and Hyperparameters

  • Frameworks: GPTRec is implemented using the GPT-2 architecture from HuggingFace Transformers [31] library, integrated into the aprec4 framework (from a recent replicability study [18]).
  • Evaluation Strategy: A leave-one-out strategy is used for testing, meaning the latest interaction of each user is held out for the test set.
  • Validation Set: For 128 randomly selected users, their second-to-last interaction is held out to form a validation set. This set is used for early stopping and monitoring model quality during training.
  • Model Depth: A relatively shallow Transformer model with three Transformer blocks is used.
  • Sequence Length: The maximum sequence length is set to 100. User interaction sequences longer than 100 are truncated to the last 100 interactions.
  • Embedding Dimension: 256-dimensional token embeddings are used for all experiments.
  • Early Stopping: Training is terminated if the validation NDCG@10 metric does not improve for 300 epochs, preventing overfitting and ensuring generalization.

6. Results & Analysis

6.1. Core Results Analysis

The experiments aim to answer three research questions (RQ1, RQ2, RQ3) regarding GPTRec's performance.

6.1.1. RQ1: Comparison with existing transformer-based models

This research question investigates how GPTRec performs in its one-token-per-item mode (where each item is a single token) against BERT4Rec and SASRec. Two variations of GPTRec are evaluated: GPTRec-TopK (using the Top-K recommendation strategy) and GPTRec-NextK (using the Next-K recommendation strategy).

The following are the results from Table 3 of the original paper:

Model name Generation Strategy Architecture Training Task Loss Recall@10 NDCG@10
BERT4Rec TopK Encoder MLM (Item Masking) Cross Entropy (Softmax Loss) 0.282 0.152
GPTRec-TopK TopK Decoder LM (Sequence Shifting) Cross Entropy (Softmax Loss) 0.254 0.146
SASRec TopK Decoder LM (Sequence Shifting) Binary Cross-Entropy 0.199 0.108
GPTRec-NextK NextK Decoder LM (Sequence Shifting) Cross Entropy (Softmax Loss) 0.157 0.105

Analysis:

  • GPTRec-TopK vs. Baselines:

    • GPTRec-TopK achieves an NDCG@10 of 0.146. This is significantly better than SASRec's 0.108 (a +35% improvement) and very close to BERT4Rec's 0.152 (only -4% lower).
    • In terms of Recall@10, GPTRec-TopK scores 0.254, which is better than SASRec's 0.199, but slightly worse than BERT4Rec's 0.282.
    • Insight: The strong performance of GPTRec-TopK compared to SASRec is particularly noteworthy. Both models share a Transformer decoder architecture and an LM (Sequence Shifting) training task. The primary difference is GPTRec's use of Cross-Entropy loss versus SASRec's Binary Cross-Entropy. This result empirically confirms the paper's claim that Cross-Entropy loss is more beneficial for next-item prediction than Binary Cross-Entropy. GPTRec-TopK's comparable performance to BERT4Rec also validates its effectiveness as a state-of-the-art Transformer-based sequential recommender in the conventional Top-K setting.
  • GPTRec-NextK Performance:

    • GPTRec-NextK shows an NDCG@10 of 0.105 and Recall@10 of 0.157. This is lower than GPTRec-TopK and BERT4Rec.
    • Insight: This result is expected, as discussed in Section 4.4.3. GPTRec-NextK is used in a generation mode (Next-K) that is misaligned with its training objective (LM for a single next token). Despite this misalignment, its NDCG@10 (0.105) is very similar to SASRec's (0.108), indicating that it provides a strong foundation. The authors suggest that with more advanced training techniques (like reinforcement learning) specifically designed to align with the Next-K generation strategy, its performance could be significantly improved. This establishes GPTRec-NextK as a viable starting point for research into complex recommendation objectives.

6.1.2. RQ2: Effect of the number of tokens and the number of tokens per item in multi-token-per-item mode

This question explores the impact of SVD Tokenisation parameters (number of tokens per item tt and number of values per token vv) on GPTRec's performance and memory footprint.

The following figure (Figure 2 from the original paper) shows the performance in multi-token-per-item mode:

该图像是展示了GPTRec模型在MovieLens-1M数据集上,使用不同Tokens per Item和Values per Token参数时,Recall@10和NDCG@10的实验结果对比图。两张子图分别展示了Recall@10和NDCG@10指标与标杆SASRec及GPTRec-TopK的水平对比。 Analysis:

  • Performance Degradation vs. one-token-per-item: The multi-token-per-item mode generally shows degraded performance compared to GPTRec-TopK in one-token-per-item mode (which achieved NDCG@10 of 0.146). The highest NDCG@10 in multi-token-per-item mode is around 0.124 (t=2,v=2048t=2, v=2048 or t=4,v=2048t=4, v=2048).
  • Competitiveness with SASRec: Despite the degradation, GPTRec in multi-token-per-item mode remains competitive with SASRec. For example, with t=4t=4 tokens per item and v=512v=512 values per token, GPTRec almost matches SASRec's NDCG@10 (both 0.108). However, its Recall@10 (0.182) is still slightly worse than SASRec's (0.199).
  • Impact of vv (values per token): The figure clearly shows that both Recall@10 and NDCG@10 increase as the number of values per token (vv) increases. For t=2t=2, NDCG@10 rises from 0.0914 (v=128v=128) to 0.124 (v=2048v=2048). A larger vv means finer granularity in quantization, allowing for a more precise representation of item embeddings, thus improving performance.
  • Impact of tt (tokens per item): Increasing the number of tokens per item (tt) generally improves performance. For most vv values, t=4t=4 outperforms t=2t=2. The exception is when v=2048v=2048, where t=2t=2 and t=4t=4 achieve similar NDCG@10 (around 0.124).
  • Memory vs. Performance Trade-off: The primary benefit of multi-token-per-item mode is memory reduction. As shown in Table 1 (Section 4.2.2), for MovieLens-1M and t=4,v=512t=4, v=512, the embedding table size is 2.00 MB, which is 59.953% of the one-token-per-item mode (3.34 MB) for MovieLens-1M. For much larger datasets, the percentage can drop to below 1%. This demonstrates a critical trade-off: GPTRec can achieve comparable quality to SASRec while significantly reducing GPU memory consumption, which is vital for scaling to large item catalogues. The slight performance hit compared to GPTRec-TopK in one-token-per-item mode is a cost for this memory efficiency.

6.1.3. RQ3: Effect of cutoff K in Next-K generation

This question examines how GPTRec's performance changes with increasing cutoff K when comparing Top-K and Next-K recommendation strategies.

The following figure (Figure 3 from the original paper) compares GPTRec's Top-K and Next-K strategies:

该图像是论文中图3的图表,展示了Next-K和Top-K推荐策略在NDCG@K指标上的绝对值和相对值表现。左图(a)显示Top-K方法在NDCG@K上优于Next-K,右图(b)展示Next-K的相对性能随着K增大而下降。 Analysis:

  • K=1K=1 Equivalence: At K=1K=1, GPTRec with Top-K strategy and GPTRec with Next-K strategy show identical performance. This is because both strategies, when recommending only one item, will select the item with the highest probability, effectively behaving identically.
  • Performance Degradation with Increasing KK for Next-K: As the cutoff K increases, the NDCG@K of the Next-K strategy gradually degrades relative to the Top-K strategy. Figure 3b shows that at K=10K=10, the Next-K strategy's quality drops to approximately 75% of the Top-K strategy's quality.
  • Insight into Degradation: This degradation is an expected consequence of the training-generation misalignment. The model is trained to predict a single next item, not an entire sequence of KK items. When generating multiple items sequentially in Next-K mode, errors can compound, and the model is not explicitly optimized for the coherence or quality of the entire list.
  • Next-K Competitiveness with SASRec: Despite the degradation, Figure 3a shows that even at K=10K=10, GPTRec with the Next-K strategy (approx. 0.105 NDCG@10) performs similarly to the SASRec model (0.108 NDCG@10). This is a significant finding. It suggests that even without dedicated tuning for Next-K generation (e.g., using reinforcement learning), the GPTRec model can produce reasonable quality recommendations in this flexible mode. This makes it a strong starting point for future research aiming to optimize Next-K generation for complex objectives.

6.2. Data Presentation (Tables)

All relevant tables from the original paper have been transcribed and presented in the methodology and experimental setup sections:

  • Table 1: Illustrated memory savings with SVD Tokenisation (Section 4.2.2).
  • Table 2: Salient characteristics of the MovieLens-1M dataset (Section 5.1).
  • Table 3: Comparison of GPTRec with SASRec and BERT4Rec (Section 4.2.5 and 6.1.1).

6.3. Ablation Studies / Parameter Analysis

The paper performs a form of parameter analysis in RQ2 by varying the number of tokens per item (tt) and the number of values per token (vv) in the SVD Tokenisation algorithm. This analysis serves as an implicit ablation study for the multi-token-per-item mode, showing how different configurations affect the trade-off between memory efficiency and recommendation quality.

  • The results show that NDCG@10 and Recall@10 generally improve with an increased number of values per token (vv), indicating that finer-grained quantisation helps retain more information from the original item embeddings.
  • Similarly, increasing the number of tokens per item (tt) also tends to improve performance, suggesting that more latent components allow for a richer representation of items.
  • These analyses help in understanding the optimal configuration for SVD Tokenisation to balance memory reduction with performance. No explicit ablation studies on architectural components of GPTRec itself (e.g., impact of Transformer blocks) were reported, as the focus was on the novel tokenisation and generation strategies.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces GPTRec, a novel generative sequential recommendation model built upon the GPT-2 architecture. GPTRec addresses two critical limitations of current Transformer-based recommenders: the massive memory footprint due to large item vocabularies and the inflexibility of the Top-K strategy for complex recommendation objectives.

The key contributions and findings are:

  1. Memory Efficiency: The proposed SVD Tokenisation algorithm effectively splits item IDs into sub-item tokens based on quantised SVD embeddings. This drastically reduces the size of the embedding table (e.g., 40% reduction on MovieLens-1M compared to SASRec while matching quality), making GPTRec scalable to much larger item catalogues.

  2. Flexible Generation: The Next-K recommendation strategy allows GPTRec to generate recommendations item-by-item, conditioning each new item on previously recommended ones. This opens possibilities for optimizing for complex, interdependent objectives like diversity or coherence.

  3. Strong Performance: In its standard one-token-per-item Top-K mode, GPTRec performs comparably to BERT4Rec and outperforms SASRec on MovieLens-1M, highlighting the benefit of Cross-Entropy loss. Even in the Next-K mode (which was not specifically tuned), GPTRec achieves NDCG@10 results similar to SASRec, establishing a strong baseline for future research.

    Overall, the paper concludes that generative sequential recommendation is a highly promising direction, offering tangible benefits in memory efficiency and adaptability for complex recommendation tasks.

7.2. Limitations & Future Work

The authors openly acknowledge several limitations and propose avenues for future research:

  • Limited Dataset Evaluation: The primary evaluation was conducted on MovieLens-1M. Future work will expand evaluation to more, particularly larger, datasets where the memory efficiency gains from multi-token-per-item mode would be most pronounced.
  • Basic Generation Techniques: The current multi-token-per-item Top-K generation relies on simple sampling, which can suffer from diversity problems. More advanced generation techniques, such as Beam Search, will be explored to mitigate this.
  • Unaligned Next-K Training: The Next-K strategy, while promising, currently suffers from training-generation misalignment. The model is trained for single next-item prediction but used for sequential generation. Future work will focus on more advanced training objectives, such as reinforcement learning (e.g., inspired by InstructGPT), to explicitly tune the model for Next-K generation and complex objectives like diversity and coverage.
  • Hyperparameter Tuning: The current experiments used a limited set of hyperparameter settings. More extensive hyperparameter tuning could further improve performance.
  • Adaptation to Generative IR: The concepts of SVD Tokenisation and the Next-K strategy have direct applicability to the broader field of Generative Information Retrieval (IR). The authors plan to adapt these methods to address large document ID vocabularies and complex search result problems (e.g., diversity, fairness) in Generative IR.

7.3. Personal Insights & Critique

This paper presents a compelling argument for shifting from discriminative score-and-rank to generative approaches in sequential recommendation, especially given the challenges of scale and objective complexity.

  • Innovation and Potential: The SVD Tokenisation is a clever and practical solution to the embedding table explosion, directly drawing inspiration from sub-word tokenisation in NLP. Its ability to achieve comparable performance with significant memory savings is a major step forward, particularly for real-world applications with vast item catalogs. The Next-K strategy is conceptually powerful; it's intuitive that considering previous recommendations should lead to better list-level coherence and diversity. The fact that GPTRec-NextK matches SASRec's NDCG@10 without explicit tuning for this mode is indeed a strong starting point, suggesting inherent potential.

  • Potential Issues/Unverified Assumptions:

    • SVD Tokenisation Dependence on Interaction Matrix: The quality of SVD Tokenisation relies heavily on the quality and density of the user-item interaction matrix. For very sparse datasets or new items (cold-start problem), the SVD embeddings might be less reliable, potentially impacting tokenization quality. The "small Gaussian noise" helps prevent identical embeddings, but fundamentally, sparse data will yield less informative SVD components. This dependency could be a limitation when applying it to domains with extremely sparse interaction data.
    • Loss of Information in Quantization: While memory efficient, quantization inherently involves some loss of information from the continuous SVD embeddings. The paper shows that increasing vv (number of values per token) improves performance, indicating that finer-grained quantization is beneficial. This trade-off between quantization granularity, memory, and information loss is critical.
    • Computational Cost of Next-K: While conceptually powerful, the Next-K strategy is stated to be "more computationally expensive" during inference. For real-time recommendation systems, the need to re-score items recommendeditemsitems \ recommended_items for each of KK steps could be a bottleneck, especially if the item catalogue remains large even after filtering. The paper's current evaluation on MovieLens-1M (3,416 items) might not fully expose this challenge compared to datasets with millions of items.
    • Complexity of Reinforcement Learning for Next-K: Aligning the Next-K generation with a suitable training objective using reinforcement learning is a non-trivial task. Defining appropriate reward functions for diversity, serendipity, or coherence in an end-to-end differentiable manner can be challenging.
  • Transferability to Other Domains: The methods presented are highly transferable:

    • Generative IR: The authors explicitly mention this. SVD Tokenisation could indeed solve the large document ID vocabulary problem, and Next-K could be used to generate diverse search result lists.

    • Any Sequential Prediction with Large Categorical Output: Fields beyond recommendation or IR that involve predicting a sequence of discrete, high-cardinality items could benefit. For instance, predicting sequences of actions in complex systems or medical treatment plans, where the "items" are specific interventions or medications.

      Overall, GPTRec provides a solid framework and practical solutions for addressing some long-standing scalability and flexibility issues in sequential recommendation. The work encourages a more holistic, generative view of recommendation, moving beyond simple next-item prediction to the generation of contextually rich and diverse recommendation lists.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.