AiPaper
Paper status: completed

Order-agnostic Identifier for Large Language Model-based Generative Recommendation

Published:02/15/2025
Original LinkPDF
Price: 0.10
Price: 0.10
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents an order-agnostic identifier design for LLM-based generative recommendations, addressing efficiency and performance issues. By integrating CF and semantic information using the SETRec framework, it significantly enhances recommendation effectiveness and genera

Abstract

Leveraging Large Language Models (LLMs) for generative recommendation has attracted significant research interest, where item tokenization is a critical step. It involves assigning item identifiers for LLMs to encode user history and generate the next item. Existing approaches leverage either token-sequence identifiers, representing items as discrete token sequences, or single-token identifiers, using ID or semantic embeddings. Token-sequence identifiers face issues such as the local optima problem in beam search and low generation efficiency due to step-by-step generation. In contrast, single-token identifiers fail to capture rich semantics or encode Collaborative Filtering (CF) information, resulting in suboptimal performance. To address these issues, we propose two fundamental principles for item identifier design: 1) integrating both CF and semantic information to fully capture multi-dimensional item information, and 2) designing order-agnostic identifiers without token dependency, mitigating the local optima issue and achieving simultaneous generation for generation efficiency. Accordingly, we introduce a novel set identifier paradigm for LLM-based generative recommendation, representing each item as a set of order-agnostic tokens. To implement this paradigm, we propose SETRec, which leverages CF and semantic tokenizers to obtain order-agnostic multi-dimensional tokens. To eliminate token dependency, SETRec uses a sparse attention mask for user history encoding and a query-guided generation mechanism for simultaneous token generation. We instantiate SETRec on T5 and Qwen (from 1.5B to 7B). Extensive experiments demonstrate its effectiveness under various scenarios (e.g., full ranking, warm- and cold-start ranking, and various item popularity groups). Moreover, results validate SETRec's superior efficiency and show promising scalability on cold-start items as model sizes increase.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Order-agnostic Identifier for Large Language Model-based Generative Recommendation

1.2. Authors

1.3. Journal/Conference

Published at SIGIR '25, July 13-18, 2025, Padua, Italy. The ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) is a premier international forum for the presentation of new research results and for the demonstration of new systems and techniques in information retrieval and recommender systems. Its reputation is highly influential in the relevant field, indicating a high-quality publication.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses critical issues in item tokenization for Large Language Model (LLM)-based generative recommendation. It identifies problems with existing token-sequence identifiers (local optima in beam search, low generation efficiency) and single-token identifiers (failure to capture rich semantics or Collaborative Filtering (CF) information). To overcome these, the authors propose two fundamental principles for item identifier design: 1) integrating both CF and semantic information, and 2) designing order-agnostic identifiers without token dependency to mitigate local optima and enable simultaneous generation.

Based on these principles, the paper introduces a novel set identifier paradigm, representing each item as a set of order-agnostic tokens. To implement this, they propose SETRec, which uses CF and semantic tokenizers to obtain multi-dimensional tokens. SETRec eliminates token dependency through a sparse attention mask for user history encoding and a query-guided generation mechanism for simultaneous token generation. The method is instantiated on T5 and Qwen (1.5B to 7B models). Extensive experiments on four datasets demonstrate its effectiveness across various scenarios (full ranking, warm-/cold-start, item popularity groups), superior efficiency, and promising scalability for cold-start items with increasing model sizes.

https://arxiv.org/abs/2502.10833v2 (Preprint) https://arxiv.org/pdf/2502.10833v2.pdf (PDF Link) The paper is published as a preprint, meaning it has not yet undergone full peer review for the mentioned conference, but represents active research contribution.

2. Executive Summary

2.1. Background & Motivation

The recent success of Large Language Models (LLMs) in personalized recommendation has sparked significant research interest. A critical step in LLM-based generative recommendation is item tokenization, which involves assigning unique identifiers to items. These identifiers allow LLMs to encode a user's historical interactions and generate the next recommended item.

However, existing item tokenization approaches face significant challenges:

  1. Token-sequence identifiers: These represent items as sequences of discrete tokens (e.g., item titles, generated tags).
    • Local Optima Problem: When LLMs use beam search for autoregressive generation (generating tokens one by one), they greedily select sequences with the highest probabilities. If the initial tokens of a target item identifier have low probabilities, they might be pruned early, preventing the correct item from ever being generated, even if the complete sequence is highly relevant. This leads to sub-optimal recommendations.
    • Low Generation Efficiency: Autoregressive generation requires multiple, sequential LLM calls for each token in the sequence. This is computationally expensive and slow, posing a major barrier to real-world deployment, especially for large models.
  2. Single-token identifiers: These represent each item with a single continuous token, typically an ID embedding or a semantic embedding.
    • Suboptimal Performance:
      • ID embeddings (e.g., from Collaborative Filtering models) heavily rely on abundant interaction data. They struggle with long-tailed items (items with few interactions) or cold-start items (new items with no interactions), as there isn't enough data to learn meaningful representations.

      • Semantic embeddings (e.g., from pre-trained text encoders) capture rich item content but often overlook Collaborative Filtering (CF) information, which is crucial for personalized recommendations based on user behavior patterns.

        The core problem the paper aims to solve is how to design item identifiers that enable effective and efficient LLM-based recommendations, overcoming the limitations of both token-sequence and single-token approaches. The paper's entry point is the recognition that item identifiers need to capture both CF and semantic information in an order-agnostic manner to address these issues.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

  1. Fundamental Principles for Item Identifier Design: The authors propose two key principles:
    • Integration of semantic and CF information: This allows leveraging LLMs' knowledge for generalization (e.g., cold-start) while incorporating user behavior for rich personalization.
    • Order-agnostic Identifier: This principle suggests representing multi-dimensional item information as a set of tokens without inherent ordering, thereby mitigating the local optima problem and facilitating simultaneous generation for efficiency.
  2. Novel Set Identifier Paradigm: Based on these principles, the paper introduces a new paradigm for LLM-based generative recommendation, where each item is represented as a set of order-agnostic tokens that integrate both CF and semantic information.
  3. SETRec Framework: The paper proposes SETRec as an effective implementation of this new paradigm. Key technical innovations within SETRec include:
    • CF and Semantic Tokenizers: To obtain order-agnostic multi-dimensional tokens.
    • Sparse Attention Mask: For user history encoding, it specifically discards token dependencies within an item's identifier while retaining dependencies on previous item identifiers, ensuring order agnosticism and boosting efficiency.
    • Query-Guided Generation Mechanism: Employs learnable query vectors to guide LLMs to simultaneously generate tokens for each specific information dimension, addressing the challenge of generating multiple independent tokens.
    • Token Set Grounding Strategy: Collects tokens from all items as grounding heads to effectively map generated token sets to existing items.
  4. Extensive Experimental Validation: SETRec is instantiated on T5 and Qwen (from 1.5B to 7B models) and evaluated on four real-world datasets. The main findings include:
    • Effectiveness: SETRec significantly outperforms existing baselines across various scenarios, including full ranking, warm-start, and particularly cold-start recommendations, as well as different item popularity groups.

    • Efficiency: SETRec demonstrates superior inference efficiency, achieving substantial speedups (e.g., average 15×15 \times speedup on Toys compared to token-sequence identifiers) due to simultaneous generation.

    • Generalization and Scalability: SETRec shows strong generalization across different LLM architectures and promising scalability on cold-start items as model sizes increase, suggesting that larger models can better leverage its semantic understanding capabilities.

      These findings collectively solve the identified problems by offering a more robust, efficient, and semantically rich item tokenization strategy for LLM-based generative recommendation.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Large Language Models (LLMs)

Large Language Models (LLMs) are advanced artificial intelligence models, typically based on the transformer architecture, that have been trained on vast amounts of text data (e.g., books, articles, websites). This pre-training allows them to learn complex patterns, grammar, and world knowledge, making them proficient in various natural language processing tasks like text generation, translation, summarization, and question answering. In the context of recommendation, LLMs are leveraged for their ability to understand complex user behaviors and diverse item characteristics, often by treating recommendation as a sequence generation task.

Generative Recommendation

Generative recommendation is an emerging paradigm in recommender systems where, instead of merely predicting a score for existing items or retrieving items from a candidate pool, the model directly generates the identifiers or representations of items that a user might like. This often involves using LLMs to generate tokens that correspond to items or item attributes, allowing for more flexible and potentially novel recommendations.

Item Tokenization

Item tokenization is the process of converting items (e.g., products, movies, articles) into a format that LLMs can understand and process. This usually means representing each item as one or more numerical tokens or embeddings. It's a critical step because the quality of these item identifiers directly impacts the LLM's ability to encode user history accurately and generate relevant recommendations.

Autoregressive Generation

Autoregressive generation is a common method for sequence generation, where each token in a sequence is generated one at a time, conditioned on all previously generated tokens and the input context. For example, when generating a sentence, the model predicts the first word, then the second word based on the first, then the third based on the first two, and so on. In LLM-based recommendation using token-sequence identifiers, autoregressive generation is used to generate the sequence of tokens that represents the recommended item.

Beam search is a heuristic search algorithm often used in sequence generation tasks (like autoregressive generation) to find the most probable sequence of tokens. Instead of exploring all possible token combinations (which is computationally prohibitive), beam search keeps track of the KK most promising partial sequences (the "beam") at each step. When generating the next token, it extends all KK partial sequences with all possible next tokens, then selects the top KK new partial sequences to continue the search.

  • Local Optima Problem: A significant drawback of beam search is its greedy nature. If the globally optimal sequence (e.g., the identifier for the target item) starts with a token that has a relatively low probability at an early step, that partial sequence might be discarded from the beam, preventing the model from ever reaching the true target item, even if the subsequent tokens would have made it a high-probability sequence. This is what the paper refers to as the local optima problem.

Collaborative Filtering (CF)

Collaborative Filtering (CF) is a traditional and highly effective technique in recommender systems that makes recommendations based on the preferences or behaviors of similar users or items. The core idea is that if users have agreed in the past (e.g., by buying the same items), they will agree in the future. CF methods typically rely on user-item interaction data (e.g., ratings, purchases, clicks) to find these similarities.

  • CF information: Refers to the patterns and similarities derived from user-item interactions, which are essential for personalized recommendations.

Semantic Embeddings

Semantic embeddings are numerical vector representations of items that capture their inherent meaning or characteristics. These embeddings are typically learned from textual descriptions (e.g., titles, descriptions, categories) or other rich metadata associated with items. Models like SentenceT5 (mentioned in the paper) can generate such embeddings. Semantic embeddings are valuable because they can generalize to cold-start items (items with no interaction history) by using their content information.

Attention Mechanism

The attention mechanism is a core component of transformer models, including LLMs. It allows the model to dynamically weigh the importance of different parts of the input sequence when processing a specific token. Instead of treating all input tokens equally, attention enables the model to focus on the most relevant parts. The general formula for Scaled Dot-Product Attention (a common form of attention) is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

  • QQ (Query), KK (Key), VV (Value) are matrices derived from the input embeddings.
  • Q,K,VRn×dkQ, K, V \in \mathbb{R}^{n \times d_k} where nn is the sequence length and dkd_k is the dimension of the keys (and queries).
  • QKTQ K^T calculates the dot product similarity between each query and all keys.
  • dk\sqrt{d_k} is a scaling factor to prevent large dot products from pushing the softmax function into regions with very small gradients.
  • softmax()\mathrm{softmax}(\cdot) normalizes the scores to obtain attention weights.
  • The result is a weighted sum of the Value vectors, where the weights indicate the importance of each Value to the Query.

Autoencoders (AE)

An Autoencoder (AE) is a type of neural network used for unsupervised learning of efficient data codings (representations) in an unsupervised manner. It consists of two main parts:

  1. Encoder: Maps the input data into a lower-dimensional latent space representation (the embedding).
  2. Decoder: Reconstructs the input data from the latent space representation. The AE is trained to minimize the reconstruction error between the input and its reconstructed output. In this paper, an AE is used as a semantic tokenizer to compress rich semantic information into a set of order-agnostic semantic embeddings.

LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning technique for large pre-trained models like LLMs. Instead of fine-tuning all parameters of the large model, LoRA injects small, trainable low-rank matrices into the transformer blocks (specifically, in the attention layers). During fine-tuning, the original pre-trained weights remain frozen, and only these much smaller LoRA matrices are updated. This significantly reduces the number of trainable parameters and computational cost, making it feasible to fine-tune very large models on domain-specific tasks with limited resources. The paper uses LoRA for fine-tuning Qwen models.

3.2. Previous Works

The paper categorizes previous works on item identifiers for LLM-based generative recommendation into two main groups:

Token-Sequence Identifiers

These methods represent each item as a sequence of discrete tokens. The LLM then generates this sequence token by token.

  • Based on Human Vocabulary:
    • BIGRec [1]: Uses item titles as identifiers. The tokens are taken directly from human language, allowing LLMs to leverage their inherent linguistic knowledge.
    • IDGenRec [31]: Aims to learn concise but informative tags from human vocabulary to represent each item. It's a learnable ID generator.
    • Pros: Can leverage LLMs' rich world knowledge (especially for human-readable tokens), potentially offering better generalization on cold-start items if semantic information is strong.
    • Cons: Suffer from the local optima problem in beam search and low generation efficiency due to autoregressive generation.
  • Based on External Tokens:
    • CID [9]: Leverages hierarchical clustering to obtain token sequences. It uses an item co-occurrence matrix to ensure items with similar interactions share similar tokens.
    • SemID [9]: Represents items with external token sequences derived from hierarchical item categories.
    • TIGER [26]: Employs RQ-VAE (Residual Quantization Variational Autoencoder) with codebooks to quantize item semantic information into token sequences. The identifier sequentially contains coarse-grained to fine-grained information.
    • LETTER [36]: A state-of-the-art method that incorporates both semantic and CF information into RQ-VAE training to create multi-dimensional identifiers, aiming for improved diversity.
    • Pros: Can encode rich, hierarchical information, potentially including both CF and semantic cues.
    • Cons: Still face the local optima problem and inference inefficiency of autoregressive generation. External tokens might not align well with LLMs' pre-trained knowledge, requiring extensive interaction data for training.

Single-Token Identifiers

These methods represent each item with a single continuous token (an embedding). The LLM generates this embedding, which is then mapped to an actual item.

  • ID Embedding-based:
    • DreamRec [46]: Leverages an ID embedding to represent each item and uses a diffusion model to refine the ID embedding generated by LLMs.
    • E4SRec [14]: Utilizes a pre-trained CF model to obtain ID embeddings and then uses a linear projection layer to map generated embeddings to item scores efficiently.
    • Pros: Improves inference efficiency by bypassing token-by-token autoregressive generation.
    • Cons: Rely heavily on sufficient interactions to capture CF information, making them vulnerable to long-tailed or cold-start items. A single embedding might not capture rich, multi-dimensional item information effectively.
  • Semantic Embedding-based:
    • LITE-LLM4Rec [34] (mentioned in abstract): Uses semantic embeddings.
    • Pros: Can leverage semantic information for cold-start recommendations.
    • Cons: Often overlooks the crucial CF information necessary for personalized recommendations, leading to suboptimal performance.

3.3. Technological Evolution

The evolution of recommender systems has seen a shift from traditional Collaborative Filtering and content-based methods to deep learning approaches, and more recently, the integration of Large Language Models (LLMs). Initially, LLMs were used to enhance existing recommenders (discriminative recommendation) by providing better feature representations or assisting in tasks like feature engineering. The current frontier involves using LLMs directly as recommender models (generative recommendation).

Within LLM-based generative recommendation, item tokenization has evolved:

  1. Initial approaches used human-readable token sequences (e.g., titles) to leverage LLMs' linguistic capabilities.

  2. Later works introduced external tokens and hierarchical token sequences to encode more complex CF and semantic information, often trained with VQ-VAE or RQ-VAE-like structures.

  3. To address efficiency concerns, single-token identifiers (ID or semantic embeddings) emerged, sacrificing some expressiveness for speed.

    This paper's work (SETRec) fits into this timeline by attempting to synthesize the best aspects of these approaches. It aims to combine the rich information capture of multi-dimensional tokenization with the efficiency of single-token generation, while addressing the local optima and inference efficiency challenges that plague existing token-sequence methods, and the information-loss issues of single-token methods. It pushes the boundary by introducing order-agnostic set identifiers and simultaneous generation.

3.4. Differentiation Analysis

Compared to the main methods in related work, SETRec introduces several core differences and innovations:

  • Addressing Local Optima and Efficiency: Unlike token-sequence identifiers (e.g., BIGRec, IDGenRec, CID, SemID, TIGER, LETTER), SETRec proposes an order-agnostic identifier paradigm that eliminates sequential token dependencies within an item's identifier. This fundamentally avoids the local optima problem inherent in beam search for autoregressive generation and significantly boosts inference efficiency through simultaneous generation. Existing token-sequence methods, even advanced ones like LETTER, are still subject to these issues.

  • Comprehensive Information Integration: While some token-sequence methods (e.g., LETTER, TIGER) attempt to integrate CF and semantic information, they often do so by encoding it into a single ordered sequence of tokens. SETRec explicitly integrates both CF and semantic information into a set of distinct, order-agnostic tokens, allowing for clearer separation and representation of these multi-dimensional aspects without forced dependencies.

  • Overcoming Single-Token Limitations: In contrast to single-token identifiers (e.g., DreamRec, E4SRec), which either lack rich semantic information (ID embeddings) or CF information (semantic embeddings), SETRec uses a set of tokens for each item. This set explicitly includes both a CF embedding and multiple semantic embeddings, ensuring that both types of crucial information are captured comprehensively, leading to better performance, especially in cold-start scenarios where CF is scarce.

  • Novel Generation Mechanism: SETRec introduces a query-guided generation mechanism combined with a sparse attention mask. This is a novel way to guide LLMs to generate multiple order-agnostic tokens simultaneously, each aligned with a specific information dimension (CF or different semantic aspects), without introducing spurious dependencies during user history encoding. This differs from prior work that either generates a single embedding or relies on sequential, autoregressive token generation.

    In essence, SETRec innovates by shifting from sequential, dependent item representations to a parallel, independent set representation, providing a robust solution for information integration, generation efficiency, and overcoming the local optima challenge.

4. Methodology

The core idea of SETRec is to design item identifiers that are order-agnostic and integrate both Collaborative Filtering (CF) and semantic information. This allows for efficient and effective LLM-based generative recommendation by mitigating the local optima problem and enabling simultaneous generation.

4.1. Principles

The two fundamental principles guiding SETRec's design are:

  1. Integration of semantic and CF information: Items possess multi-dimensional information. Semantic information (e.g., item descriptions, categories) is crucial for generalization, especially for cold-start items, by leveraging the rich knowledge within LLMs. CF information (derived from user interactions) is essential for personalized recommendations, capturing user preferences from behavioral patterns. SETRec aims to combine both to achieve comprehensive item representation.
  2. Order-agnostic Identifier: Representing multi-dimensional item information with a single token can lead to embedding collapse or loss of detail. While using multiple tokens (a sequence) can capture more information, an ordered sequence introduces unnecessary dependencies that can lead to the local optima problem and hinder generation efficiency (due to autoregressive generation). The principle of order-agnostic identifiers suggests that if different dimensions of item information (e.g., "price" and "category") are inherently independent, their representation should also be independent. This approach allows for simultaneous generation of these independent tokens, significantly improving inference speed.

4.2. Core Methodology In-depth (Layer by Layer)

SETRec implements the set identifier paradigm through two main stages: order-agnostic item tokenization and simultaneous item generation.

4.2.1. Order-agnostic Item Tokenization

This stage focuses on converting each item into a set identifier comprising CF and semantic embeddings.

CF Tokenizer

To incorporate Collaborative Filtering (CF) information, SETRec uses a pre-trained conventional recommender model (like SASRec [10]) to generate an item's CF embedding. This embedding is then passed through a linear projection layer to obtain the final CF token zCFRdz_{\mathrm{CF}} \in \mathbb{R}^d, where dd is the hidden dimension of the LLMs used.

  • Purpose: This CF token helps LLM-based recommenders to provide accurate recommendations for users and items with rich interaction histories.

Semantic Tokenizer

To capture rich semantic information, SETRec uses a semantic tokenizer to generate a set of semantic embeddings.

  1. Semantic Representation Extraction: Given an item's semantic information (e.g., title, categories), a pre-trained semantic extractor (e.g., SentenceT5 [23]) first extracts a high-dimensional semantic representation s\pmb{s}.
  2. Multi-dimensional Semantic Embedding via Autoencoder (AE): Instead of compressing s\pmb{s} into a single embedding (which could lead to embedding collapse and loss of fine-grained information), SETRec tokenizes s\pmb{s} into NN order-agnostic semantic embeddings. This is achieved using a unified Autoencoder (AE): $ z = \operatorname { E n c o d e r } ( s ) $ Where:
    • ss: The input high-dimensional semantic representation of an item.
    • Encoder()\operatorname{Encoder}(\cdot): The encoder component of the Autoencoder.
    • z=[zS1,zS2,,zSN]RNdz = [z_{S_1}, z_{S_2}, \ldots, z_{S_N}] \in \mathbb{R}^{Nd}: The concatenated semantic embeddings.
    • zSnRdz_{S_n} \in \mathbb{R}^d: The nn-th semantic embedding, representing a distinct latent semantic dimension (e.g., aspects like "brand", "price", "material"). Each zSnz_{S_n} is an order-agnostic token. The use of a unified AE (rather than NN separate AEs) reduces model parameters and improves training stability.
  3. Reconstruction Loss: To ensure these semantic embeddings preserve useful information, the AE is trained with a reconstruction loss: $ \mathcal { L } _ { A E } = | \pmb { s } - \pmb { \hat { s } } | _ { 2 } ^ { 2 } $ Where:
    • s\pmb{s}: The original semantic representation.
    • s^=Decoder(z)\pmb{\hat{s}} = \operatorname{Decoder}(z): The reconstructed semantic representation obtained by passing the concatenated semantic embeddings zz through the Decoder component of the Autoencoder.
    • 22\| \cdot \| _ { 2 } ^ { 2 }: The squared L2 norm, representing the mean squared error between the original and reconstructed representations.
    • Purpose: This loss encourages the AE to learn meaningful and reconstructable semantic embeddings.

Token Corpus

After item tokenization, each item i~\tilde{i} is represented by a set identifier: i~={zCF,zS1,,zSN}\tilde{i} = \{z_{\mathrm{CF}}, z_{S_1}, \ldots, z_{S_N}\}. All tokens generated across all items are collected to form token corpora for each information dimension: ZCF,ZS1,,ZSN\mathcal{Z}_{\mathrm{CF}}, \mathcal{Z}_{S_1}, \ldots, \mathcal{Z}_{S_N}.

  • Purpose: These token corpora serve as grounding heads during the simultaneous item generation phase to map generated embeddings back to existing items.

4.2.2. Simultaneous Item Generation

This stage focuses on efficiently generating the set identifier for the next recommended item.

Query-guided Generation

To enable simultaneous generation and ensure that each generated token aligns with its specific information dimension (CF or a particular semantic aspect), SETRec introduces learnable query vectors. These vectors guide the LLM during generation.

  1. Input to LLM: The user's historical interactions are transformed into an identifier sequence. For a user with LL historical interactions, the input to the LLM becomes: $ \pmb { x } = [ { z _ { \mathrm { C F } } , z _ { S _ { 1 } } , . . . , z _ { S _ { N } } } ^ { 1 } , . . . , { z _ { \mathrm { C F } } , z _ { S _ { 1 } } , . . . , z _ { S _ { N } } } ^ { L } ] $ Where {zCF,zS1,...,zSN}j\{ z _ { \mathrm { C F } } , z _ { S _ { 1 } } , . . . , z _ { S _ { N } } \} ^ { j } represents the set identifier for the jj-th item in the user's history.
  2. Token Generation: For each dimension k{CF,S1,S2,,SN}k \in \{ \mathrm{CF}, S_1, S_2, \ldots, S_N \}, the corresponding token z^k\hat{z}_k is generated by the LLM layers conditioned on the transformed user history x\pmb{x} and a specific learnable query vector qk\pmb{q}_k: $ \hat { z } _ { k } = \mathrm { LLM } _ { - } \mathrm { Layers } ( { \pmb x } , { \pmb q } _ { k } ) $ Where:
    • LLM_Layers()\mathrm{LLM\_Layers}(\cdot): Represents the attention layers of the LLM.
    • qkRd\pmb{q}_k \in \mathbb{R}^d: A learnable query vector specifically for dimension kk. This vector acts as a prompt, guiding the LLM to focus on generating information relevant to that dimension.
    • z^k\hat{z}_k: The generated continuous token (embedding) for dimension kk. This process generates a set identifier i^={z^CF,z^S1,,z^SN}\hat{i} = \{\hat{z}_{\mathrm{CF}}, \hat{z}_{S_1}, \ldots, \hat{z}_{S_N}\} for the next item.

Token Generation Optimization

To train the model to generate accurate tokens for each dimension, a specific loss function is used, encouraging the generated token to be similar to the target token for its dimension while being dissimilar to other tokens of that dimension. This is a form of contrastive learning. $ \mathcal { L } _ { \mathrm { Gen } } = - \frac { 1 } { | \mathcal { D } | } \sum _ { \mathcal { D } } \sum _ { k \in \mathcal { F } } \frac { \exp ( \operatorname { sim } ( \hat { z } _ { k } , z _ { k } ) ) } { \sum _ { z \in \mathcal { Z } _ { k } } \exp ( \operatorname { sim } ( \hat { z } _ { k } , z ) ) } $ Where:

  • D|\mathcal{D}|: The total number of user interaction sequences in the dataset.
  • F={CF,S1,,SN}\mathcal{F} = \{ \mathrm{CF}, S_1, \ldots, S_N \}: The set of all information dimensions.
  • z^k\hat{z}_k: The generated token for dimension kk.
  • zkz_k: The true target token for dimension kk (from the ground truth item).
  • sim()\operatorname{sim}(\cdot): A similarity function, typically inner product or cosine similarity.
  • Zk\mathcal{Z}_k: The token corpus for dimension kk, containing all possible tokens for that dimension from all items.
  • Purpose: This loss function maximizes the similarity between the generated token z^k\hat{z}_k and the ground truth token zkz_k for dimension kk, while minimizing similarity with all other tokens in the corpus Zk\mathcal{Z}_k for that same dimension. This pushes the model to generate tokens that are distinct and accurate for each dimension.

Token Generation Grounding

After generating the set identifier i^={z^CF,z^S1,,z^SN}\hat{i} = \{\hat{z}_{\mathrm{CF}}, \hat{z}_{S_1}, \ldots, \hat{z}_{S_N}\}, the next step is to map these generated embeddings to existing items. This is challenging because the number of possible token combinations can be vast. SETRec addresses this with a token set grounding strategy that leverages the pre-computed token corpora as grounding heads. The scores for each item iIi \in \mathcal{I} are obtained as follows: $ \begin{cases} s_k = W_k \hat{z}k \ s = (1 - \beta) s{\mathrm{CF}} + \beta \sum_{k \in \mathcal{F} \setminus \mathrm{CF}} s_k \end{cases} $ Where:

  • skRIs_k \in \mathbb{R}^{|\mathcal{I}|}: A vector where each element represents the score of an item based on the generated token z^k\hat{z}_k for dimension kk.
  • WkRI×dW_k \in \mathbb{R}^{|\mathcal{I}| \times d}: This matrix is formed by stacking the tokens from the token corpus Zk\mathcal{Z}_k (i.e., WkW_k is essentially a matrix containing all zkz_k for all items). Multiplying z^k\hat{z}_k by WkW_k performs a similarity lookup against all items' tokens for dimension kk.
  • sRIs \in \mathbb{R}^{|\mathcal{I}|}: The final score vector for all items.
  • β[0,1]\beta \in [0, 1]: A hyper-parameter that balances the contribution of CF scores (sCFs_{\mathrm{CF}}) and semantic scores (kFCFsk\sum_{k \in \mathcal{F} \setminus \mathrm{CF}} s_k).
  • FCF\mathcal{F} \setminus \mathrm{CF}: Represents the set of all semantic dimensions (S1,,SNS_1, \ldots, S_N).
  • Purpose: This strategy computes a score for every item based on its match with each generated dimension's token. The final item scores are a weighted sum of these CF and semantic scores. This approach is extendable to new items (e.g., cold-start items) because their semantic tokens are available in the token corpora, allowing for scoring even without historical CF interactions.

Sparse Attention Mask

To realize the order-agnostic principle for user history encoding and boost efficiency, SETRec employs a sparse attention mask.

  • Problem: In the transformed user history x\pmb{x}, the tokens for a single item (e.g., zCF,zS1,,zSNz_{\mathrm{CF}}, z_{S_1}, \ldots, z_{S_N}) are concatenated. Without a special mask, the attention mechanism might create spurious dependencies between these tokens within the same item's identifier (e.g., a semantic token attending to a CF token), violating the order-agnostic principle.
  • Solution: The sparse attention mask is designed to:
    1. Eliminate Intra-item Dependencies: Tokens belonging to the same item's identifier cannot attend to each other. This ensures that the components of an item's set identifier remain independent during encoding.
    2. Retain Inter-item Dependencies: All tokens can attend to all tokens from previously interacted items. This preserves the sequential nature of user history, allowing the LLM to understand how past items influence future preferences. This is illustrated in Figure 5. Figure 5(a) shows the original attention mask where tokens within an item can attend to each other, creating unwanted dependencies. Figure 5(b) shows the sparse attention mask where attention within an item (marked by \bigoplus) is blocked, while attention to previous items is allowed.
  • Time Complexity Analysis: The sparse attention mask also significantly improves generation efficiency.
    • For a sequence with LL historical items, each represented by MM tokens (CF + NN semantic tokens, so M=N+1M = N+1), the total length of the flattened input sequence is MLM \cdot L.
    • With the original attention mask, the time complexity for batch generation is typically O((ML)2d)O((ML)^2 \cdot d), where dd is the hidden dimension. The paper refers to it as Mˉ3L2d˙\bar{M}^3 L^2 \dot{d} in one place, which might be a typo or specific to a particular implementation detail, but the general quadratic dependency on sequence length is key.
    • With the proposed sparse attention mask and flattened input, the time complexity reduces. The paper states it as O(M2L2d)O(M^2 L^2 d). This reduction comes from the fact that interactions within an item's MM tokens are restricted. The main benefit to efficiency comes from enabling simultaneous generation, reducing the number of sequential LLM calls.

4.2.3. Instantiation

To train SETRec, the CF and semantic tokenizers, the learnable query vectors, and the LLM parameters are optimized jointly by minimizing a combined loss function: $ \mathcal { L } = \mathcal { L } _ { \mathrm { Gen } } + \alpha \mathcal { L } _ { \mathrm { A E } } $ Where:

  • LGen\mathcal{L}_{\mathrm{Gen}}: The token generation optimization loss (Equation 6), which encourages accurate and distinct token generation for each dimension.

  • LAE\mathcal{L}_{\mathrm{AE}}: The Autoencoder reconstruction loss (Equation 4), which ensures the semantic embeddings preserve rich information.

  • α\alpha: A hyper-parameter that controls the relative strength or weighting of the AE reconstruction loss during training. A higher α\alpha means more emphasis on accurate reconstruction of semantic information.

    Inference Process:

  1. Item Tokenization: All available items are first tokenized into set identifiers i~={zCF,zS1,,zSN}\tilde{i} = \{z_{\mathrm{CF}}, z_{S_1}, \ldots, z_{S_N}\}. The token corpora ZCF,ZS1,,ZSN\mathcal{Z}_{\mathrm{CF}}, \mathcal{Z}_{S_1}, \ldots, \mathcal{Z}_{S_N} are formed.

  2. User History Transformation: A user's historical interaction sequence is transformed into the input x\pmb{x} using these set identifiers.

  3. Simultaneous Generation: The LLM performs query-guided simultaneous generation with the sparse attention mask (Equation 5) to generate the set identifier i^={z^CF,z^S1,,z^SN}\hat{i} = \{\hat{z}_{\mathrm{CF}}, \hat{z}_{S_1}, \ldots, \hat{z}_{S_N}\} for the next recommended item. This step generates all tokens in parallel in a single LLM forward pass.

  4. Item Grounding: The generated token set i^\hat{i} is then grounded to existing items using the token corpora as extendable grounding heads (Equation 7), producing final scores for all candidate items.

  5. Ranking: Items are ranked based on their scores, and the top-ranked items are recommended.

    This entire process ensures that SETRec is both effective (integrating multi-dimensional information) and efficient (simultaneous generation, sparse attention).

5. Experimental Setup

5.1. Datasets

The experiments are conducted on four real-world datasets from various domains:

  • Amazon Review Datasets: These datasets contain rich user interactions (e.g., purchases, reviews) and extensive textual metadata (title, description, category, brand) for items.

    1. Toys: Products in the Toys category.
    2. Beauty: Products in the Beauty category.
    3. Sports: Products in the Sports category.
  • Steam: A video games dataset proposed in [10], which includes substantial user interactions and abundant textual semantic information about video games.

    For all datasets, the authors follow previous work [37] for preprocessing:

  • User interactions are sorted chronologically according to their timestamps.

  • The data is split into training, validation, and testing sets with a ratio of 8:1:1, respectively.

  • Items are categorized into warm items (those appearing in the training set) and cold items (those not appearing in the training set). This allows for evaluating the model's performance on items with varying levels of interaction history.

    These datasets are well-suited for validating the method because they provide diverse domains, rich semantic information necessary for LLM-based recommendation, and sufficient user interaction data for CF and for training/testing warm-start and cold-start scenarios.

5.2. Evaluation Metrics

The experiments use two widely accepted metrics for evaluating recommender systems, Recall@K and NDCG@K, with K=5K=5 and K=10K=10. Additionally, evaluations are performed under three distinct settings:

  1. All Items: Evaluation over the entire set of items.
  2. Warm Items Only: Evaluation focusing only on items that appeared in the training set.
  3. Cold Items Only: Evaluation focusing only on items that did not appear in the training set (i.e., new items).

Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved (recommended) among the top KK recommendations. It focuses on the completeness of the recommendations – how many of the truly relevant items were found by the system within the top KK list.

Mathematical Formula: $ \mathrm{Recall@K} = \frac{\sum_{u \in U} |{\text{recommended items for } u \text{ in top K}} \cap {\text{relevant items for } u}|}{\sum_{u \in U} |{\text{relevant items for } u}|} $

Symbol Explanation:

  • UU: The set of all users in the test set.
  • uu: A specific user.
  • {recommended items for u in top K}\{\text{recommended items for } u \text{ in top K}\}: The set of KK items recommended to user uu.
  • {relevant items for u}\{\text{relevant items for } u\}: The set of all items that are truly relevant to user uu (e.g., items the user actually interacted with in the test set).
  • |\cdot|: The cardinality (number of elements) of a set.

NDCG@K (Normalized Discounted Cumulative Gain at K)

Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear higher up in the list (i.e., discounted cumulative gain) and normalizes these scores by comparing them to an ideal ranking where all relevant items are perfectly ordered (ideal discounted cumulative gain). NDCG emphasizes getting relevant items to the top of the list.

Mathematical Formula: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ where $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ and $ \mathrm{IDCG@K} = \sum{i=1}^{K} \frac{2^{\mathrm{rel}{i{\text{ideal}}}} - 1}{\log_2(i+1)} $

Symbol Explanation:

  • DCG@K\mathrm{DCG@K}: Discounted Cumulative Gain at position KK. It sums the relevance scores of items in the recommended list, penalizing items that appear lower.
  • IDCG@K\mathrm{IDCG@K}: Ideal Discounted Cumulative Gain at position KK. This is the maximum possible DCG if the recommended list were perfectly ordered by relevance. It serves as a normalization factor.
  • reli\mathrm{rel}_i: The relevance score of the item at position ii in the actual recommended list. For binary relevance (relevant/not relevant), reli\mathrm{rel}_i is typically 1 if the item is relevant, and 0 otherwise.
  • reliideal\mathrm{rel}_{i_{\text{ideal}}}: The relevance score of the item at position ii in the ideal recommended list (where all relevant items are ranked before non-relevant ones, and ties are broken arbitrarily).
  • log2(i+1)\log_2(i+1): A logarithmic discount factor that reduces the contribution of items further down the list.

5.3. Baselines

The paper compares SETRec against a comprehensive set of competitive baselines, categorized by their item identifier type:

Single-Token Identifiers

  • DreamRec [46]: Leverages ID embeddings for items. It employs a diffusion model to refine the ID embedding generated by LLMs, aiming to improve the quality of generated item representations.
  • E4SRec [14]: Uses a pre-trained Collaborative Filtering (CF) model (e.g., SASRec) to obtain ID embeddings for items. For recommendation, it generates an item embedding and then uses a linear projection layer to efficiently compute scores for all items, mapping the generated embedding back to existing item IDs.

Token-Sequence Identifiers

  • BIGRec [1]: Represents items using their titles. The tokens are drawn from human vocabulary, allowing the LLM to directly utilize its linguistic understanding.

  • IDGenRec [31]: A learnable ID generator that aims to produce concise yet informative tags (from human vocabulary) to represent each item.

  • CID [9]: Employs hierarchical clustering based on item co-occurrence patterns to generate token sequences. Items with similar interaction histories are expected to have similar identifiers.

  • SemID [9]: Represents items using external token sequences derived from hierarchical item categories. This emphasizes semantic similarity.

  • TIGER [26]: Utilizes RQ-VAE (Residual Quantization Variational Autoencoder) with codebooks to quantize item semantic information into token sequences. These sequences are designed to convey information from coarse-grained to fine-grained.

  • LETTER [36]: A state-of-the-art method that integrates both semantic and Collaborative Filtering (CF) information into the training of an RQ-VAE to create multi-dimensional item identifiers, aiming for improved diversity and richness.

    These baselines are representative because they cover the main existing paradigms for item tokenization in LLM-based generative recommendation, including methods that rely on human language, learned external tokens, and different combinations of CF and semantic information.

5.4. Implementation Details

  • LLM Architectures: SETRec and all baselines are instantiated on two different LLM architectures:
    • T5-small [25]: An encoder-decoder transformer model.
    • Qwen2.5 [45]: A decoder-only transformer model, evaluated with different model sizes: 1.5B, 3B, and 7B parameters. This allows for a comprehensive evaluation of scalability.
  • Hidden Layer Dimensions: For methods that use Autoencoders (AE) in their tokenizer training (e.g., TIGER, LETTER, SETRec), the hidden layer dimensions are set to 512, 256, and 128 with ReLU activation.
  • Prompt for LLM Training: A consistent prompt is used for all methods to ensure a fair comparison: "What would the user be likely to purchase next after buying items history?;"
  • Fine-tuning Strategy:
    • T5 models: Fully fine-tuned (all parameters updated).
    • Qwen models: Parameter-Efficient Fine-Tuning (PEFT) technique LoRA [8] is used to reduce computational costs.
  • Hardware: All experiments are conducted on four NVIDIA RTX A5000 GPUs.
  • Hyperparameter Selection for SETRec:
    • Number of semantic tokens (NN): Selected from {1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\}.
    • Strength of AE loss (α\alpha): Selected from {0.1,0.3,0.5,0.7,0.9}\{0.1, 0.3, 0.5, 0.7, 0.9\}.
    • Semantic strength for inference (β\beta): Selected from {0,0.1,0.2,,0.9,1.0}\{0, 0.1, 0.2, \ldots, 0.9, 1.0\}.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance on T5 (RQ1)

The following are the results from Table 1 of the original paper:

All Warm Cold Inf. Time (s)
R@5 R@10 N@5 N@10 R@5 R@10 N@5 N@10 R@5 R@10 N@5 N@10
Toys DreamRec 0.0020 0.0027 0.0015 0.0018 0.0027 0.0039 0.0020 0.0024 0.0066 0.0168 0.0045 0.0082 912
E4SRec 0.0061 0.0098 0.0051 0.0064 0.0081 0.0128 0.0065 0.0082 0.0065 0.0122 0.0056 0.0078 55
BIGRec 0.0008 0.0013 0.0007 0.0009 0.0014 0.0019 0.0011 0.0013 0.0278 0.0360 0.0196 0.0223 2,079
IDGenRec 0.0044 0.0082 0.0040 0.0053 0.0065 0.0128 0.0049 0.0071 0.0059 0.0111 0.0047 0.0066 810
CID 0.0063 0.0110 0.0052 0.0069 0.0109 0.0161 0.0081 0.0102 0.0318 0.0589 0.0236 0.0335 658
SemID 0.0071 0.0108 0.0061 0.0074 0.0086 0.0153 0.0075 0.0100 0.0307 0.0507 0.0220 0.0292 1,215
TIGER 0.0064 0.0106 0.0060 0.0076 0.0091 0.0147 0.0080 0.0102 0.0315 0.0555 0.0228 0.0314 448
LETTER 0.0081 0.0117 0.0077 0.0091 0.0109 0.0155 0.0083 0.0101 0.0183 0.0395 0.0115 0.0190 448
SETRec 0.0110* 0.0189* 0.0089* 0.0118* 0.0139* 0.0236* 0.0112* 0.0147* 0.0443* 0.0812* 0.0310* 0.0445* 60
Beauty DreamRec 0.0012 0.0025 0.0013 0.0017 0.0016 0.0028 0.0016 0.0019 0.0078 0.0161 0.0065 0.0094 1,102
E4SRec 0.0061 0.0092 0.0052 0.0063 0.0080 0.0121 0.0067 0.0082 0.0072 0.0118 0.0065 0.0077 120
BIGRec 0.0008 0.0009 0.0006 0.0008 0.0106 0.0251 0.0095 0.0151 4,544 0.0054 0.0064 0.0051 0.0054
IDGenRec 0.0080 0.0115 0.0066 0.0078 0.0106 0.0165 0.0078 0.0099 0.0187 0.0350 0.0186 0.0224 840
CID 0.0071 0.0125 0.0060 0.0080 0.0098 0.0166 0.0077 0.0101 0.0087 0.0183 0.0071 0.0104 815
SemID 0.0071 0.0131 0.0056 0.0078 0.0098 0.0174 0.0074 0.0103 0.0260 0.0465 0.0178 0.0255 1,310
TIGER 0.0063 0.0098 0.0050 0.0062 0.0086 0.0131 0.0065 0.0082 0.0190 0.0325 0.0130 0.0178 430
LETTER 0.0071 0.0103 0.0061 0.0070 0.0094 0.0135 0.0079 0.0091 0.0251 0.0410 0.0241 0.0285 430
SETRec 0.0106* 0.0161* 0.0083* 0.0103* 0.0139* 0.0212* 0.0108* 0.0134* 0.0384* 0.0761* 0.0280* 0.0413* 126
Sports DreamRec 0.0027 0.0044 0.0025 0.0031 0.0032 0.0052 0.0028 0.0035 0.0045 0.0108 0.0026 0.0049 2,100
E4SRec 0.0079 0.0131 0.0075 0.0094 0.0092 0.0154 0.0085 0.0107 0.0031 0.0093 0.0019 0.0039 117
BIGRec 0.0033 0.0042 0.0030 0.0033 0.0001 0.0002 0.0001 0.0001 0.0059 0.0104 0.0043 0.0061 7,822
IDGenRec 0.0087 0.0127 0.0079 0.0092 0.0101 0.0149 0.0091 0.0107 0.0181 0.0302 0.0134 0.0179 1,724
CID 0.0077 0.0131 0.0073 0.0092 0.0074 0.0119 0.0045 0.0061 0.0082 0.0149 0.0075 0.0099 2,135
SemID 0.0094 0.0167 0.0088 0.0114 0.0119 0.0201 0.0104 0.0135 0.0254 0.0495 0.0175 0.0256 2,367
TIGER 0.0085 0.0129 0.0080 0.0095 0.0100 0.0151 0.0091 0.0109 0.0190 0.0310 0.0120 0.0159 481
LETTER 0.0077 0.0131 0.0073 0.0092 0.0074 0.0119 0.0045 0.0061 0.0082 0.0149 0.0075 0.0099 481
SETRec 0.0114* 0.0185* 0.0101* 0.0126* 0.0134* 0.0216* 0.0115* 0.0144* 0.0341* 0.0595* 0.0233* 0.0323* 136
Steam DreamRec 0.0029 0.0057 0.0037 0.0046 0.0042 0.0080 0.0045 0.0059 0.0017 0.0029 0.0013 0.0018 4,620
E4SRec 0.0194 0.0351 0.0220 0.0270 0.0312 0.0558 0.0283 0.0370 0.0006 0.0010 0.0006 0.0006 328
BIGRec 0.0099 0.0107 0.0099 0.0103 0.0088 0.0097 0.0088 0.0092 0.0011 0.0010 0.0010 0.0010 3,120
IDGenRec 0.0163 0.0284 0.0152 0.0200 0.0204 0.0360 0.0190 0.0250 0.0083 0.0117 0.0076 0.0093 1,438
CID 0.0189 0.0325 0.0202 0.0250 0.0276 0.0478 0.0296 0.0369 0.0019 0.0033 0.0018 0.0024 1,760
SemID 0.0175 0.0288 0.0184 0.0227 0.0222 0.0366 0.0234 0.0288 0.0077 0.0122 0.0071 0.0091 2,000
TIGER 0.0201 0.0357 0.0225 0.0279 0.0273 0.0494 0.0305 0.0381 0.0031 0.0051 0.0028 0.0036 720
LETTER 0.0195 0.0347 0.0210 0.0264 0.0259 0.0463 0.0274 0.0346 0.0034 0.0062 0.0027 0.0040 720
SETRec 0.0231* 0.0396* 0.0260* 0.0319* 0.0294* 0.0506* 0.0326* 0.0401* 0.0152* 0.0264* 0.0135* 0.0187* 100

The performance comparison of SETRec against baselines on T5 reveals several key observations:

  • Token-sequence vs. Single-token Identifiers: Generally, token-sequence identifiers (e.g., BIGRec, CID, TIGER, LETTER) outperform single-token identifiers (e.g., DreamRec, E4SRec) across all, warm, and cold settings. This is expected because token-sequence identifiers inherently represent items with multiple tokens, allowing them to encode richer, multi-dimensional information explicitly.
  • External Tokens vs. Human Vocabulary: Among token-sequence identifiers, methods using external tokens (e.g., CID, SemID, TIGER, LETTER) typically perform better than those relying solely on human vocabulary (e.g., BIGRec, IDGenRec) in all and warm settings. This is attributed to the hierarchical structure of external identifiers, which can represent coarse-grained to fine-grained semantics, potentially alleviating the local optima problem to some extent in autoregressive generation.
  • Cold-Start Performance:
    • Methods relying predominantly on CF information (e.g., DreamRec, E4SRec, CID) show poor results on cold items (e.g., E4SRec on Toys R@5: 0.0065 vs. SemID 0.0307). This is because CF requires substantial interaction data, which is absent for cold items.
    • Methods that integrate semantic information into identifiers (e.g., BIGRec, IDGenRec, SemID, TIGER, LETTER) demonstrate better generalization ability in cold-start scenarios. BIGRec and IDGenRec, which use human vocabulary, show competitive performance here, likely leveraging the LLM's rich world knowledge.
  • SETRec's Superiority: SETRec consistently and significantly outperforms all baselines across all datasets and under all, warm, and cold settings. The improvements are statistically significant (indicated by * and underlining).
    • This superior performance is attributed to its dual principles: 1) effectively integrating both CF and semantic information into a set of tokens, which provides accurate recommendations for warm items and strong generalization for cold items; and 2) its order-agnostic identifier design, which avoids inaccurate dependencies between tokens within an item, overcoming the local optima issue.
  • Inference Efficiency: SETRec achieves remarkable inference efficiency. It substantially reduces inference time compared to token-sequence identifiers (e.g., on Toys, SETRec is 60s vs. BIGRec 2,079s, LETTER 448s). On average, SETRec achieves speedups of 15×15 \times (Toys), 11×11 \times (Beauty), 18×18 \times (Sports), and 8×8 \times (Steam) compared to token-sequence identifiers. This efficiency stems from its simultaneous generation mechanism, which generates multiple tokens in a single LLM call.

6.1.2. Performance on Qwen-1.5B (RQ1)

The following are the results from Table 2 of the original paper:

All Warm Cold Inf. Time (s)
R@5 R@10 N@5 N@10 R@5 R@10 N@5 N@10 R@5 R@10 N@5 N@10
Toys DreamRec 0.0006 0.0013 0.0005 0.0008 0.0008 0.0019 0.0007 0.0012 0.0076 0.0137 0.0052 0.0074 1,093
E4SRec 0.0065 0.0108 0.0056 0.0072 0.0089 0.0144 0.0075 0.0096 0.0084 0.0235 0.0055 0.0111 905
BIGRec 0.0009 0.0016 0.0009 0.0012 0.0011 0.0013 0.0010 0.0011 0.0194 0.0311 0.0147 0.0191 43,304
IDGenRec 0.0030 0.0053 0.0022 0.0031 0.0043 0.0086 0.0032 0.0048 0.0189 0.0364 0.0161 0.0224 30,720
CID 0.0027 0.0047 0.0025 0.0033 0.0055 0.0084 0.0044 0.0056 0.0055 0.0156 0.0044 0.0081 27,248
SemID 0.0024 0.0042 0.0018 0.0024 0.0034 0.0055 0.0026 0.0034 0.0140 0.0275 0.0095 0.0143 32,288
TIGER 0.0068 0.0117 0.0054 0.0072 0.0094 0.0159 0.0070 0.0095 0.0384 0.0715 0.0291 0.0408 13,800
LETTER 0.0057 0.0093 0.0050 0.0064 0.0080 0.0126 0.0066 0.0085 0.0217 0.0416 0.0170 0.0239 13,800
SETRec 0.0116* 0.0188* 0.0095* 0.0120* 0.0144* 0.0236* 0.0118* 0.0151* 0.0531* 0.0883* 0.0382* 0.0507* 926
Beauty DreamRec 0.0007 0.0009 0.0005 0.0005 0.0010 0.0011 0.0007 0.0007 0.0090 0.0167 0.0075 0.0103 1,326
E4SRec 0.0067 0.0109 0.0056 0.0072 0.0088 0.0146 0.0072 0.0094 0.0017 0.0071 0.0010 0.0029 910
BIGRec 0.0006 0.0010 0.0006 0.0007 0.0010 0.0010 0.0008 0.0008 0.0141 0.0246 0.0094 0.0135 29,500
IDGenRec 0.0042 0.0078 0.0030 0.0043 0.0045 0.0104 0.0033 0.0054 0.0254 0.0471 0.0207 0.0292 35,040
CID 0.0046 0.0077 0.0040 0.0052 0.0059 0.0107 0.0051 0.0068 0.0075 0.0155 0.0071 0.0096 27,792
SemID 0.0030 0.0045 0.0027 0.0033 0.0050 0.0076 0.0042 0.0052 0.0159 0.0227 0.0116 0.0159 45,160
TIGER 0.0041 0.0065 0.0032 0.0041 0.0054 0.0085 0.0042 0.0054 0.0083 0.0167 0.0064 0.0091 12,600
LETTER 0.0040 0.0069 0.0031 0.0042 0.0051 0.0088 0.0039 0.0054 0.0043 0.0129 0.0043 0.0071 12,600
SETRec 0.0104* 0.0167* 0.0085* 0.0108* 0.0140* 0.0221* 0.0109* 0.0141* 0.0477* 0.0748* 0.0370* 0.0464* 1,050

The evaluation of SETRec and baselines on Qwen-1.5B (a decoder-only LLM) reveals differences compared to T5:

  • Limited Competitiveness of Token-Sequence Identifiers: On Qwen-1.5B, token-sequence identifiers show less competitiveness compared to their performance on T5. A possible reason suggested by the authors is that Qwen-1.5B might possess richer pre-trained knowledge within its parameters. This larger knowledge base could amplify the knowledge gap between the general pre-training task and the specific recommendation tasks with limited interaction data, making it harder for these methods to adapt.
  • Competitive Performance of E4SRec: E4SRec (a single-token identifier based on ID embedding) often yields competitive performance on Qwen-1.5B. This is likely because E4SRec replaces the original LLM vocabulary head with an item projection head, which effectively adapts the LLM to the recommendation task by directly mapping generated embeddings to item scores, bypassing the vocabulary mismatch issue.
  • Human Vocabulary vs. External Tokens on Cold Items:
    • BIGRec and IDGenRec (using human vocabulary) sometimes outperform their T5 counterparts on cold items (e.g., on Beauty). This indicates that Qwen-1.5B's richer world knowledge can be better leveraged when item representations are in human-readable language, leading to improved generalization for cold-start items.
    • Conversely, identifiers with external tokens (e.g., CID, TIGER, LETTER) show inferior cold performance compared to their T5 counterparts. This is because training external tokens requires substantial interaction data, which is difficult to achieve for cold items, leading to poor generalization due to low generation probability for these tokens.
  • SETRec's Consistent Superiority: Despite these shifts in baseline performance, SETRec consistently outperforms all baselines on Qwen-1.5B across all settings.
    • Notably, SETRec instantiated on Qwen-1.5B often surpasses SETRec on T5, especially in the cold-start setting. This validates SETRec's strong generalization ability across different LLM architectures.
  • Enhanced Efficiency on Qwen: As the LLM size increases (even from T5 to Qwen-1.5B), the efficiency improvements of SETRec over token-sequence identifiers become even more significant, achieving an average of 20×20 \times speedup across the tested datasets. This highlights the practical advantages of simultaneous generation for larger LLMs.

6.2. Ablation Study (RQ2)

The following figure (Figure 6 from the original paper) shows the ablation study results on Toys:

Figure 6: Ablation study on Toys. 该图像是一个图表,展示了在不同条件下(例如热启动和冷启动)T5和Qwen-1.5B模型的召回率(Recall@10)和归一化折损累计增益(NDCG@10)。图表比较了不同算法(如SETRec、缺少语义、查询、稀疏注意力和协同过滤)在全量、热启动及冷启动情境中的效果。

The ablation study investigates the contribution of each component of SETRec by removing them one by one. The variants are:

  • w/o Sem: SETRec without semantic tokens (only CF tokens).

  • w/o CF: SETRec without CF tokens (only semantic tokens).

  • w/o Query: SETRec using random frozen vectors instead of learnable query vectors.

  • w/o SA: SETRec using the original attention mask instead of the sparse attention mask.

    Key observations from the ablation study on both T5 and Qwen-1.5B on the Toys dataset:

  1. Effectiveness of Each Component: Removing any component (Sem, CF, Query, SA) consistently leads to performance drops across all, warm, and cold settings. This validates that every component of SETRec contributes positively to its overall effectiveness.
  2. Necessity of Semantic Information: Discarding semantic tokens (w/o Sem) drastically degrades recommendation accuracy, particularly under cold settings. This strongly underscores the critical role of integrating semantic information in item identifiers for handling items with sparse interactions.
  3. Significance of Multi-dimensional Semantics: Removing semantic tokens (w/o Sem) generally leads to worse performance than removing CF tokens (w/o CF). This suggests that leveraging multiple semantic tokens to represent multi-dimensional semantic aspects is highly beneficial, potentially mitigating embedding collapse and capturing richer item details. This aligns with findings in other research [18].
  4. Role of CF Tokens:
    • For T5, removing CF tokens (w/o CF) leads to inferior performance on cold items. This is somewhat counterintuitive but could indicate that even limited CF signals, when integrated, can provide some context.
    • For Qwen, removing CF tokens (w/o CF) might negatively impact cold items performance (as observed for T5), but the overall impact seems less pronounced than removing semantics. The authors suggest that larger Qwen models, with their stronger pre-trained knowledge, might be better at understanding semantics, making the contribution of CF less critical, especially for cold items where CF is inherently sparse.
  5. Impact of Query Vectors and Sparse Attention: Both w/o Query and w/o SA lead to performance degradation, demonstrating the importance of learnable query vectors for guiding simultaneous generation and the sparse attention mask for ensuring order-agnostic encoding and efficiency.

6.3. Item Group Analysis (RQ3)

The following figure (Figure 7 from the original paper) shows the performance of SETRec, LETTER, and E4SRec (T5) on item groups with different popularity on Toys:

Figure 7: Performance of SETRec, LETTER, and E4SRec (T5) on item groups with different popularity on Toys. 该图像是图表,展示了SETRec、LETTER和E4SRec在不同受欢迎程度的商品组上的表现,包括Recall@10(图(a))和NDCG@10(图(b))。可以看出,SETRec在受欢迎商品组中表现优异。

This analysis evaluates SETRec's performance across items grouped by their popularity (G1: most popular, G4: least popular). SETRec is compared with LETTER (a strong token-sequence identifier) and E4SRec (a single-token identifier based on CF).

Key observations:

  1. Popularity-Performance Trend: Performance (both Recall@10 and NDCG@10) generally declines from the most popular items (G1) to the least popular (G4) for all methods. This is expected, as LLMs (and recommender systems in general) have less data to learn from for less popular items, making accurate recommendations more challenging.
  2. E4SRec's Strengths and Weaknesses:
    • E4SRec (CF-only single-token identifier) performs well on the most popular items (G1), sometimes outperforming LETTER. This highlights the strength of CF information when abundant interactions are available.
    • However, E4SRec yields significantly inferior performance on unpopular items (G2-G4). This is a direct consequence of its reliance on CF information, which is sparse for these items.
  3. LETTER's Generalization: LETTER, which incorporates both semantic and CF information (though as a token-sequence), shows better generalization on sparser items (G2-G4) compared to E4SRec, leveraging semantic information to compensate for limited CF.
  4. SETRec's Consistent Excellence: SETRec consistently outperforms both E4SRec and LETTER across all popularity groups (G1-G4).
    • Crucially, the improvements of SETRec are more significant on the sparser, less popular items (G2-G4). This indicates that SETRec's design (integrating multi-dimensional CF and semantic information with an order-agnostic set identifier) is particularly effective in challenging scenarios with limited interaction data. This superior generalization on sparse items largely explains SETRec's overall performance gains.

6.4. Scalability on Model Parameters (RQ3)

The following are the results from Table 3 of the original paper:

All Warm Cold
R@10 N@10 R@10 N@10 R@10 N@10
1.5B LETTER 0.0093 0.0064 0.0126 0.0085 0.0416 0.0239
E4SRec 0.0108 0.0072 0.0144 0.0096 0.0235 0.0111
SETRec 0.0188 0.0120 0.0236 0.0151 0.0883 0.0507
3B LETTER 0.0109 0.0072 0.0151 0.0097 0.0471 0.0236
E4SRec 0.0096 0.0061 0.0129 0.0081 0.0218 0.0103
SETRec 0.0195 0.0123 0.0258 0.0159 0.0964 0.0571
7B LETTER 0.0099 0.0061 0.0137 0.0081 0.0406 0.0216
E4SRec 0.0088 0.0057 0.0114 0.0072 0.0133 0.0065
SETRec 0.0194 0.0115 0.0239 0.0140 0.1016 0.0613

This analysis investigates how SETRec scales with increasing LLM model sizes (Qwen 1.5B, 3B, and 7B) compared to E4SRec and LETTER on the Toys dataset.

Key observations:

  1. SETRec's Scalability on Cold Items: SETRec demonstrates clear and continuous performance improvements on cold-start items as the model size scales from 1.5B to 7B (R@10 increasing from 0.0883 to 0.1016). This indicates promising scalability for cold items, suggesting that larger models, with their enhanced semantic understanding and general knowledge, can better leverage SETRec's semantic information for items with sparse interactions.

  2. Limited Scalability on Warm Items: SETRec's performance on warm items (and overall all items) shows minor improvements or even slight fluctuations as model size increases (e.g., R@10 for warm items goes from 0.0236 to 0.0258 then to 0.0239). This suggests LLMs might not necessarily lead to proportionally better CF information understanding with increased size, or that the CF signals are already effectively captured by smaller models. The limited improvements of E4SRec (a CF-focused method) on warm items also support this.

  3. LETTER's Weak Scalability: LETTER generally shows weak scalability across all three settings (all, warm, cold). Its performance either stagnates or slightly fluctuates with increasing model size. This is primarily attributed to its reliance on external tokens, which may not align well with the pre-trained knowledge embedded in LLMs. Consequently, simply increasing the LLM's parameter count does not translate into significant improvements for LETTER.

  4. E4SRec's Performance with Scaling: E4SRec's performance also fluctuates and does not show consistent improvements with model scaling. On cold items, its performance actually decreases significantly from 1.5B to 7B (from 0.0235 to 0.0133), reinforcing the idea that larger LLMs might not inherently improve CF-based recommendations on sparse data without semantic guidance.

    In summary, SETRec exhibits strong scalability, particularly for cold-start items, by effectively leveraging the enhanced semantic understanding capabilities of larger LLMs through its multi-dimensional and order-agnostic set identifier design.

6.5. Effect of Semantic Strength β\beta (RQ4)

The following figure (Figure 8 from the original paper) shows the performance of SETRec (T5) with different strength of semantics β\beta for inference:

Figure 8: Performance of SETRec (T5) with different strength of semantics \(\\beta\) for inference. 该图像是图表,展示了在不同语义强度 eta 下,SETRec (T5) 在温暖和冷启动场景中的性能。左侧为温暖启动的召回率和NDCG@10,右侧为冷启动的性能指标,显示了不同 eta 值对模型效果的影响。

This analysis examines the impact of the hyper-parameter β\beta on SETRec's performance. β\beta controls the balance between CF scores and semantic scores during item grounding (Equation 7), where β=0\beta=0 means only CF scores are used, and β=1\beta=1 means only semantic scores are used.

Key observations:

  1. Necessity of Semantic Information for Grounding: Incorporating semantic information during inference (any β>0\beta > 0) is generally beneficial compared to relying solely on CF scores (β=0\beta=0). This indicates that the multi-dimensional semantic information contributes to a more robust global ranking and stronger generalization ability.
  2. Significant Improvements on Cold Items: The inclusion of semantic scores brings particularly significant improvements on cold items. For instance, R@10 for cold items increases from below 0.03 at β=0\beta=0 to over 0.08 at β=0.4\beta=0.4. This highlights the crucial role of semantic information in providing recommendations for items with sparse or no interaction history.
  3. Optimal Balance: There is often an optimal value for β\beta that balances the CF and semantic contributions. For warm items, performance is high across a broader range of β\beta, peaking around β=0.4\beta=0.4. For cold items, the optimal β\beta is also around 0.4, indicating that a balanced integration of both types of information is best.
  4. Competitive Performance with Pure Semantics: Even when relying solely on semantic scores (β=1\beta=1), SETRec maintains competitive performance on warm items. This suggests an implicit alignment between CF and semantic tokens learned during training, where the semantic embeddings can still capture aspects relevant to CF preferences.

6.6. Hyper-parameter Sensitivity (RQ4)

The following figure (Figure 9 from the original paper) shows the performance of SETRec (T5) with different strength of AE loss α\alpha and different numbers of semantic tokens NN:

Figure 9: Performance of SETRec (T5) with different strength of AE loss \(\\alpha\) and different numbers of semantic tokens \(N\) . 该图像是图表,展示了SETRec(T5)在不同AE损失强度α\alpha和不同语义token数量NN下的Recall@10效果。左侧子图(a)显示了α\alpha对各种场景(所有、热启动、冷启动)的Recall@10的影响;右侧子图(b)展示了NN在相同场景下的效果。不同颜色的线代表了不同的数据类型,提供了对模型性能的比较分析。

This analysis explores the sensitivity of SETRec to two key hyperparameters: α\alpha (strength of AE loss) and NN (number of semantic embeddings).

6.6.1. Effect of α\alpha

  • α\alpha controls the weighting of the Autoencoder (AE) reconstruction loss (LAE\mathcal{L}_{\mathrm{AE}}) in the total training objective (Equation 8).
  • Observations: When α\alpha is increased from 0 (no AE loss) to around 0.7, the overall performance of SETRec generally improves. This indicates that encouraging the semantic tokenizer to accurately reconstruct the original semantic representation helps in learning richer and more useful semantic embeddings.
  • However, increasing α\alpha too much (e.g., beyond 0.7) can cause performance to drop, especially on warm items. This suggests that an overly strong reconstruction constraint might make the semantic embeddings too focused on literal content, potentially limiting their ability to learn subtle semantic nuances or adapt to CF-driven preferences. The authors recommend an empirical range of 0.5 to 0.7 for α\alpha.

6.6.2. Effect of NN

  • NN represents the number of order-agnostic semantic embeddings used to represent each item.
  • Observations: Increasing the number of semantic tokens (NN) generally improves performance. This supports the idea that using multiple embeddings can better capture the multi-dimensional semantic information of an item, potentially mitigating the embedding collapse issue [7] and resolving potential information conflicts [36] that arise when forcing diverse semantic aspects into fewer dimensions.
  • However, blindly increasing NN beyond a certain point (e.g., N=5N=5 or N=6N=6 in some cases) might lead to diminishing returns or even slight performance degradation. This could be because it becomes increasingly challenging for the AE to consistently learn distinct and meaningful category-level preferences for a very large number of semantic dimensions, particularly when aligning them with real-world scenarios [20, 22]. Finding an optimal NN is crucial to balancing expressiveness and learnability.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously addresses fundamental issues in item tokenization for LLM-based generative recommendation. It identifies that existing token-sequence identifiers suffer from local optima in beam search and low generation efficiency due to autoregressive generation, while single-token identifiers fail to capture rich semantics or Collaborative Filtering (CF) information adequately.

To overcome these challenges, the authors propose two guiding principles for item identifier design:

  1. Integration of both CF and semantic information: To leverage both user behavioral patterns and item content richness.

  2. Order-agnostic identifiers: To represent multi-dimensional item information as a set of independent tokens, eliminating spurious dependencies and enabling efficient simultaneous generation.

    Based on these principles, the paper introduces a novel set identifier paradigm and implements it with SETRec. SETRec utilizes CF and semantic tokenizers to create order-agnostic multi-dimensional tokens. It employs a sparse attention mask for user history encoding to preserve sequential item dependencies while removing intra-item token dependencies, and a query-guided generation mechanism for simultaneous token generation.

Extensive experiments on four datasets (Amazon Toys, Beauty, Sports, and Steam) demonstrate SETRec's effectiveness, efficiency, generalization ability, and scalability. It consistently outperforms strong baselines across various scenarios, including full ranking, warm-start, and particularly cold-start recommendations, and different item popularity groups. The simultaneous generation significantly boosts inference efficiency, and SETRec shows promising scalability on cold-start items with increasing LLM sizes (Qwen 1.5B to 7B), showcasing its ability to harness larger models' semantic understanding.

7.2. Limitations & Future Work

The authors identify several promising avenues for future research:

  1. Discrete Set Identifiers: While SETRec uses continuous tokens, exploring how discrete set identifiers (a set of order-agnostic discrete tokens) perform on generative recommendation is a valuable direction. This could potentially align even better with LLMs' pre-training tasks and fully utilize their embedded knowledge.
  2. Open-ended Recommendation: SETRec shows strong generalization ability in challenging scenarios. The authors suggest applying SETRec for open-ended recommendation in contexts with open-domain user behaviors. This implies tackling more complex, less structured recommendation tasks beyond fixed item catalogs.

7.3. Personal Insights & Critique

This paper presents a highly insightful and impactful contribution to LLM-based recommendation. The explicit articulation of the two design principles (information integration and order-agnosticism) provides a strong theoretical foundation for the set identifier paradigm. This moves beyond incremental improvements to existing tokenization schemes by fundamentally rethinking how items are represented and generated.

Innovations and Strengths:

  • Conceptual Clarity: The paper clearly identifies the core problems with existing item identifiers and proposes well-reasoned principles. The local optima problem and inference inefficiency are critical practical bottlenecks, and SETRec offers an elegant solution.

  • Comprehensive Information Capture: Integrating both CF and semantic information in a structured, multi-dimensional way is crucial. The order-agnostic set identifier prevents embedding collapse and allows for flexible representation of diverse item attributes.

  • Efficiency Boost: The simultaneous generation mechanism, enabled by order-agnosticism and the sparse attention mask, is a major practical advantage, making LLM-based generative recommendation much more feasible for real-world deployment. The significant speedups observed are compelling.

  • Cold-Start Performance: SETRec's strong performance on cold-start items and less popular groups highlights its robustness and generalization ability, which is a persistent challenge in recommender systems.

  • Scalability: The demonstrated scalability on cold-start items with increasing model size is a strong indicator of its future potential, as LLMs continue to grow.

    Potential Issues/Areas for Improvement:

  • Defining NN Semantic Dimensions: While the paper shows sensitivity analysis for NN, the optimal number of semantic tokens (NN) is dataset-dependent. Determining how to automatically or more adaptively define these NN dimensions (and what each dimension represents) could be a complex research problem. Currently, it relies on empirical tuning. The conceptual idea of "latent semantic dimensions" is powerful, but their precise interpretation and optimal number remain somewhat opaque.

  • Interpretability of Semantic Tokens: While the semantic tokens capture rich information, their individual interpretability for human understanding might be limited. For debuggability or explaining recommendations, understanding what each zSnz_{S_n} represents could be valuable.

  • Computational Cost of Tokenization: The semantic tokenizer involves training an Autoencoder. While this is done offline, the initial cost of generating CF and semantic embeddings for all items and constructing the token corpora should be considered, especially for very large and dynamic item catalogs.

  • Applicability of Learnable Query Vectors: The query-guided generation mechanism is novel. Further exploration into the design of these query vectors and their interaction with the LLM could reveal more insights. Are they truly learning to isolate specific dimensions, or are they acting as more general prompts?

    Transferability and Future Value: SETRec's principles of order-agnosticism and multi-dimensional information integration are broadly transferable.

  • Beyond Recommendation: This paradigm could be applied to other generative tasks where composite entities need to be represented and generated (e.g., generating complex objects in creative AI, multimodal content generation).

  • Multimodal Recommendation: Integrating more modalities (images, audio) into the set identifier is a natural extension. Each modality could contribute its own order-agnostic token or set of tokens.

  • Dynamic Item Catalogs: For frequently changing item catalogs, efficient updates to the token corpora would be crucial. The extendable grounding heads already provide a good foundation for this.

    Overall, SETRec offers a significant conceptual and practical advancement in LLM-based generative recommendation, addressing key limitations and opening new avenues for research into more flexible, efficient, and robust item representations.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.