Order-agnostic Identifier for Large Language Model-based Generative Recommendation
TL;DR Summary
This paper presents an order-agnostic identifier design for LLM-based generative recommendations, addressing efficiency and performance issues. By integrating CF and semantic information using the SETRec framework, it significantly enhances recommendation effectiveness and genera
Abstract
Leveraging Large Language Models (LLMs) for generative recommendation has attracted significant research interest, where item tokenization is a critical step. It involves assigning item identifiers for LLMs to encode user history and generate the next item. Existing approaches leverage either token-sequence identifiers, representing items as discrete token sequences, or single-token identifiers, using ID or semantic embeddings. Token-sequence identifiers face issues such as the local optima problem in beam search and low generation efficiency due to step-by-step generation. In contrast, single-token identifiers fail to capture rich semantics or encode Collaborative Filtering (CF) information, resulting in suboptimal performance. To address these issues, we propose two fundamental principles for item identifier design: 1) integrating both CF and semantic information to fully capture multi-dimensional item information, and 2) designing order-agnostic identifiers without token dependency, mitigating the local optima issue and achieving simultaneous generation for generation efficiency. Accordingly, we introduce a novel set identifier paradigm for LLM-based generative recommendation, representing each item as a set of order-agnostic tokens. To implement this paradigm, we propose SETRec, which leverages CF and semantic tokenizers to obtain order-agnostic multi-dimensional tokens. To eliminate token dependency, SETRec uses a sparse attention mask for user history encoding and a query-guided generation mechanism for simultaneous token generation. We instantiate SETRec on T5 and Qwen (from 1.5B to 7B). Extensive experiments demonstrate its effectiveness under various scenarios (e.g., full ranking, warm- and cold-start ranking, and various item popularity groups). Moreover, results validate SETRec's superior efficiency and show promising scalability on cold-start items as model sizes increase.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Order-agnostic Identifier for Large Language Model-based Generative Recommendation
1.2. Authors
- Xinyu Lin (xylin1028@gmail.com) - National University of Singapore, Singapore
- Haihan Shi (shh924@mail.ustc.edu.cn) - University of Science and Technology of China, Hefei, China
- Wenjie Wang (wenjiewang96@gmail.com) - University of Science and Technology of China, Hefei, China
- Fuli Feng (fulifeng93@gmail.com) - University of Science and Technology of China, Hefei, China
- Qifan Wang (wqfcr@meta.com) - Meta AI, Menlo Park, USA
- See-Kiong Ng (seekiong@nus.edu.sg) - National University of Singapore, Singapore
- Tat-Seng Chua (dcscts@nus.edu.sg) - National University of Singapore, Singapore
1.3. Journal/Conference
Published at SIGIR '25, July 13-18, 2025, Padua, Italy. The ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) is a premier international forum for the presentation of new research results and for the demonstration of new systems and techniques in information retrieval and recommender systems. Its reputation is highly influential in the relevant field, indicating a high-quality publication.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses critical issues in item tokenization for Large Language Model (LLM)-based generative recommendation. It identifies problems with existing token-sequence identifiers (local optima in beam search, low generation efficiency) and single-token identifiers (failure to capture rich semantics or Collaborative Filtering (CF) information). To overcome these, the authors propose two fundamental principles for item identifier design: 1) integrating both CF and semantic information, and 2) designing order-agnostic identifiers without token dependency to mitigate local optima and enable simultaneous generation.
Based on these principles, the paper introduces a novel set identifier paradigm, representing each item as a set of order-agnostic tokens. To implement this, they propose SETRec, which uses CF and semantic tokenizers to obtain multi-dimensional tokens. SETRec eliminates token dependency through a sparse attention mask for user history encoding and a query-guided generation mechanism for simultaneous token generation. The method is instantiated on T5 and Qwen (1.5B to 7B models). Extensive experiments on four datasets demonstrate its effectiveness across various scenarios (full ranking, warm-/cold-start, item popularity groups), superior efficiency, and promising scalability for cold-start items with increasing model sizes.
1.6. Original Source Link
https://arxiv.org/abs/2502.10833v2 (Preprint) https://arxiv.org/pdf/2502.10833v2.pdf (PDF Link) The paper is published as a preprint, meaning it has not yet undergone full peer review for the mentioned conference, but represents active research contribution.
2. Executive Summary
2.1. Background & Motivation
The recent success of Large Language Models (LLMs) in personalized recommendation has sparked significant research interest. A critical step in LLM-based generative recommendation is item tokenization, which involves assigning unique identifiers to items. These identifiers allow LLMs to encode a user's historical interactions and generate the next recommended item.
However, existing item tokenization approaches face significant challenges:
- Token-sequence identifiers: These represent items as sequences of discrete tokens (e.g., item titles, generated tags).
- Local Optima Problem: When LLMs use
beam searchfor autoregressive generation (generating tokens one by one), they greedily select sequences with the highest probabilities. If the initial tokens of a target item identifier have low probabilities, they might be pruned early, preventing the correct item from ever being generated, even if the complete sequence is highly relevant. This leads to sub-optimal recommendations. - Low Generation Efficiency: Autoregressive generation requires multiple, sequential LLM calls for each token in the sequence. This is computationally expensive and slow, posing a major barrier to real-world deployment, especially for large models.
- Local Optima Problem: When LLMs use
- Single-token identifiers: These represent each item with a single continuous token, typically an ID embedding or a semantic embedding.
- Suboptimal Performance:
-
ID embeddings(e.g., from Collaborative Filtering models) heavily rely on abundant interaction data. They struggle withlong-tailed items(items with few interactions) orcold-start items(new items with no interactions), as there isn't enough data to learn meaningful representations. -
Semantic embeddings(e.g., from pre-trained text encoders) capture rich item content but oftenoverlook Collaborative Filtering (CF) information, which is crucial for personalized recommendations based on user behavior patterns.The core problem the paper aims to solve is how to design
item identifiersthat enable effective and efficient LLM-based recommendations, overcoming the limitations of bothtoken-sequenceandsingle-tokenapproaches. The paper's entry point is the recognition thatitem identifiersneed to capture bothCFandsemantic informationin anorder-agnosticmanner to address these issues.
-
- Suboptimal Performance:
2.2. Main Contributions / Findings
The paper makes the following primary contributions:
- Fundamental Principles for Item Identifier Design: The authors propose two key principles:
- Integration of semantic and CF information: This allows leveraging LLMs' knowledge for generalization (e.g., cold-start) while incorporating user behavior for rich personalization.
- Order-agnostic Identifier: This principle suggests representing multi-dimensional item information as a set of tokens without inherent ordering, thereby mitigating the
local optima problemand facilitatingsimultaneous generationfor efficiency.
- Novel Set Identifier Paradigm: Based on these principles, the paper introduces a new paradigm for LLM-based generative recommendation, where each item is represented as a set of order-agnostic tokens that integrate both
CFandsemantic information. - SETRec Framework: The paper proposes
SETRecas an effective implementation of this new paradigm. Key technical innovations withinSETRecinclude:- CF and Semantic Tokenizers: To obtain order-agnostic multi-dimensional tokens.
- Sparse Attention Mask: For user history encoding, it specifically discards token dependencies within an item's identifier while retaining dependencies on previous item identifiers, ensuring order agnosticism and boosting efficiency.
- Query-Guided Generation Mechanism: Employs learnable query vectors to guide LLMs to simultaneously generate tokens for each specific information dimension, addressing the challenge of generating multiple independent tokens.
- Token Set Grounding Strategy: Collects tokens from all items as grounding heads to effectively map generated token sets to existing items.
- Extensive Experimental Validation:
SETRecis instantiated onT5andQwen(from 1.5B to 7B models) and evaluated on four real-world datasets. The main findings include:-
Effectiveness:
SETRecsignificantly outperforms existing baselines across various scenarios, including full ranking, warm-start, and particularly cold-start recommendations, as well as different item popularity groups. -
Efficiency:
SETRecdemonstrates superior inference efficiency, achieving substantial speedups (e.g., average speedup on Toys compared to token-sequence identifiers) due tosimultaneous generation. -
Generalization and Scalability:
SETRecshows strong generalization across different LLM architectures and promising scalability on cold-start items as model sizes increase, suggesting that larger models can better leverage its semantic understanding capabilities.These findings collectively solve the identified problems by offering a more robust, efficient, and semantically rich
item tokenizationstrategy for LLM-based generative recommendation.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Large Language Models (LLMs)
Large Language Models (LLMs) are advanced artificial intelligence models, typically based on the transformer architecture, that have been trained on vast amounts of text data (e.g., books, articles, websites). This pre-training allows them to learn complex patterns, grammar, and world knowledge, making them proficient in various natural language processing tasks like text generation, translation, summarization, and question answering. In the context of recommendation, LLMs are leveraged for their ability to understand complex user behaviors and diverse item characteristics, often by treating recommendation as a sequence generation task.
Generative Recommendation
Generative recommendation is an emerging paradigm in recommender systems where, instead of merely predicting a score for existing items or retrieving items from a candidate pool, the model directly generates the identifiers or representations of items that a user might like. This often involves using LLMs to generate tokens that correspond to items or item attributes, allowing for more flexible and potentially novel recommendations.
Item Tokenization
Item tokenization is the process of converting items (e.g., products, movies, articles) into a format that LLMs can understand and process. This usually means representing each item as one or more numerical tokens or embeddings. It's a critical step because the quality of these item identifiers directly impacts the LLM's ability to encode user history accurately and generate relevant recommendations.
Autoregressive Generation
Autoregressive generation is a common method for sequence generation, where each token in a sequence is generated one at a time, conditioned on all previously generated tokens and the input context. For example, when generating a sentence, the model predicts the first word, then the second word based on the first, then the third based on the first two, and so on. In LLM-based recommendation using token-sequence identifiers, autoregressive generation is used to generate the sequence of tokens that represents the recommended item.
Beam Search
Beam search is a heuristic search algorithm often used in sequence generation tasks (like autoregressive generation) to find the most probable sequence of tokens. Instead of exploring all possible token combinations (which is computationally prohibitive), beam search keeps track of the most promising partial sequences (the "beam") at each step. When generating the next token, it extends all partial sequences with all possible next tokens, then selects the top new partial sequences to continue the search.
- Local Optima Problem: A significant drawback of
beam searchis its greedy nature. If the globally optimal sequence (e.g., the identifier for the target item) starts with a token that has a relatively low probability at an early step, that partial sequence might be discarded from the beam, preventing the model from ever reaching the true target item, even if the subsequent tokens would have made it a high-probability sequence. This is what the paper refers to as thelocal optima problem.
Collaborative Filtering (CF)
Collaborative Filtering (CF) is a traditional and highly effective technique in recommender systems that makes recommendations based on the preferences or behaviors of similar users or items. The core idea is that if users have agreed in the past (e.g., by buying the same items), they will agree in the future. CF methods typically rely on user-item interaction data (e.g., ratings, purchases, clicks) to find these similarities.
- CF information: Refers to the patterns and similarities derived from user-item interactions, which are essential for personalized recommendations.
Semantic Embeddings
Semantic embeddings are numerical vector representations of items that capture their inherent meaning or characteristics. These embeddings are typically learned from textual descriptions (e.g., titles, descriptions, categories) or other rich metadata associated with items. Models like SentenceT5 (mentioned in the paper) can generate such embeddings. Semantic embeddings are valuable because they can generalize to cold-start items (items with no interaction history) by using their content information.
Attention Mechanism
The attention mechanism is a core component of transformer models, including LLMs. It allows the model to dynamically weigh the importance of different parts of the input sequence when processing a specific token. Instead of treating all input tokens equally, attention enables the model to focus on the most relevant parts.
The general formula for Scaled Dot-Product Attention (a common form of attention) is:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- (Query), (Key), (Value) are matrices derived from the input embeddings.
- where is the sequence length and is the dimension of the keys (and queries).
- calculates the dot product similarity between each query and all keys.
- is a scaling factor to prevent large dot products from pushing the
softmaxfunction into regions with very small gradients. - normalizes the scores to obtain attention weights.
- The result is a weighted sum of the
Valuevectors, where the weights indicate the importance of eachValueto theQuery.
Autoencoders (AE)
An Autoencoder (AE) is a type of neural network used for unsupervised learning of efficient data codings (representations) in an unsupervised manner. It consists of two main parts:
- Encoder: Maps the input data into a lower-dimensional latent space representation (the
embedding). - Decoder: Reconstructs the input data from the latent space representation.
The
AEis trained to minimize thereconstruction errorbetween the input and its reconstructed output. In this paper, anAEis used as asemantic tokenizerto compress richsemantic informationinto a set of order-agnosticsemantic embeddings.
LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning technique for large pre-trained models like LLMs. Instead of fine-tuning all parameters of the large model, LoRA injects small, trainable low-rank matrices into the transformer blocks (specifically, in the attention layers). During fine-tuning, the original pre-trained weights remain frozen, and only these much smaller LoRA matrices are updated. This significantly reduces the number of trainable parameters and computational cost, making it feasible to fine-tune very large models on domain-specific tasks with limited resources. The paper uses LoRA for fine-tuning Qwen models.
3.2. Previous Works
The paper categorizes previous works on item identifiers for LLM-based generative recommendation into two main groups:
Token-Sequence Identifiers
These methods represent each item as a sequence of discrete tokens. The LLM then generates this sequence token by token.
- Based on Human Vocabulary:
- BIGRec [1]: Uses item titles as identifiers. The tokens are taken directly from human language, allowing LLMs to leverage their inherent linguistic knowledge.
- IDGenRec [31]: Aims to learn concise but informative tags from human vocabulary to represent each item. It's a learnable ID generator.
- Pros: Can leverage
LLMs' rich world knowledge (especially for human-readable tokens), potentially offering better generalization oncold-start itemsif semantic information is strong. - Cons: Suffer from the
local optima probleminbeam searchandlow generation efficiencydue toautoregressive generation.
- Based on External Tokens:
- CID [9]: Leverages hierarchical clustering to obtain token sequences. It uses an item co-occurrence matrix to ensure items with similar interactions share similar tokens.
- SemID [9]: Represents items with external token sequences derived from hierarchical item categories.
- TIGER [26]: Employs
RQ-VAE (Residual Quantization Variational Autoencoder)with codebooks to quantize itemsemantic informationinto token sequences. The identifier sequentially contains coarse-grained to fine-grained information. - LETTER [36]: A state-of-the-art method that incorporates both
semanticandCF informationintoRQ-VAEtraining to create multi-dimensional identifiers, aiming for improved diversity. - Pros: Can encode rich, hierarchical information, potentially including both
CFandsemanticcues. - Cons: Still face the
local optima problemandinference inefficiencyofautoregressive generation. External tokens might not align well withLLMs'pre-trained knowledge, requiring extensive interaction data for training.
Single-Token Identifiers
These methods represent each item with a single continuous token (an embedding). The LLM generates this embedding, which is then mapped to an actual item.
- ID Embedding-based:
- DreamRec [46]: Leverages an
ID embeddingto represent each item and uses adiffusion modelto refine theID embeddinggenerated byLLMs. - E4SRec [14]: Utilizes a pre-trained
CF modelto obtainID embeddingsand then uses a linear projection layer to map generated embeddings to item scores efficiently. - Pros: Improves
inference efficiencyby bypassingtoken-by-token autoregressive generation. - Cons: Rely heavily on sufficient interactions to capture
CF information, making them vulnerable tolong-tailedorcold-start items. A single embedding might not capture rich, multi-dimensional item information effectively.
- DreamRec [46]: Leverages an
- Semantic Embedding-based:
LITE-LLM4Rec [34](mentioned in abstract): Uses semantic embeddings.- Pros: Can leverage
semantic informationforcold-start recommendations. - Cons: Often overlooks the crucial
CF informationnecessary for personalized recommendations, leading to suboptimal performance.
3.3. Technological Evolution
The evolution of recommender systems has seen a shift from traditional Collaborative Filtering and content-based methods to deep learning approaches, and more recently, the integration of Large Language Models (LLMs). Initially, LLMs were used to enhance existing recommenders (discriminative recommendation) by providing better feature representations or assisting in tasks like feature engineering. The current frontier involves using LLMs directly as recommender models (generative recommendation).
Within LLM-based generative recommendation, item tokenization has evolved:
-
Initial approaches used human-readable
token sequences(e.g., titles) to leverageLLMs'linguistic capabilities. -
Later works introduced
external tokensandhierarchical token sequencesto encode more complexCFandsemantic information, often trained withVQ-VAEorRQ-VAE-like structures. -
To address efficiency concerns,
single-token identifiers(ID or semantic embeddings) emerged, sacrificing some expressiveness for speed.This paper's work (
SETRec) fits into this timeline by attempting to synthesize the best aspects of these approaches. It aims to combine the rich information capture of multi-dimensional tokenization with the efficiency of single-token generation, while addressing thelocal optimaandinference efficiencychallenges that plague existingtoken-sequencemethods, and the information-loss issues ofsingle-tokenmethods. It pushes the boundary by introducingorder-agnostic set identifiersandsimultaneous generation.
3.4. Differentiation Analysis
Compared to the main methods in related work, SETRec introduces several core differences and innovations:
-
Addressing Local Optima and Efficiency: Unlike
token-sequence identifiers(e.g.,BIGRec,IDGenRec,CID,SemID,TIGER,LETTER),SETRecproposes anorder-agnostic identifierparadigm that eliminates sequential token dependencies within an item's identifier. This fundamentally avoids thelocal optima probleminherent inbeam searchfor autoregressive generation and significantly boostsinference efficiencythroughsimultaneous generation. Existingtoken-sequencemethods, even advanced ones likeLETTER, are still subject to these issues. -
Comprehensive Information Integration: While some
token-sequencemethods (e.g.,LETTER,TIGER) attempt to integrateCFandsemantic information, they often do so by encoding it into a single ordered sequence of tokens.SETRecexplicitly integrates bothCFandsemantic informationinto a set of distinct, order-agnostic tokens, allowing for clearer separation and representation of these multi-dimensional aspects without forced dependencies. -
Overcoming Single-Token Limitations: In contrast to
single-token identifiers(e.g.,DreamRec,E4SRec), which either lack richsemantic information(ID embeddings) orCF information(semantic embeddings),SETRecuses a set of tokens for each item. This set explicitly includes both aCF embeddingand multiplesemantic embeddings, ensuring that both types of crucial information are captured comprehensively, leading to better performance, especially incold-start scenarioswhereCFis scarce. -
Novel Generation Mechanism:
SETRecintroduces aquery-guided generation mechanismcombined with asparse attention mask. This is a novel way to guideLLMsto generate multipleorder-agnostic tokenssimultaneously, each aligned with a specific information dimension (CF or different semantic aspects), without introducing spurious dependencies during user history encoding. This differs from prior work that either generates a single embedding or relies on sequential, autoregressive token generation.In essence,
SETRecinnovates by shifting from sequential, dependent item representations to a parallel, independent set representation, providing a robust solution for information integration, generation efficiency, and overcoming thelocal optimachallenge.
4. Methodology
The core idea of SETRec is to design item identifiers that are order-agnostic and integrate both Collaborative Filtering (CF) and semantic information. This allows for efficient and effective LLM-based generative recommendation by mitigating the local optima problem and enabling simultaneous generation.
4.1. Principles
The two fundamental principles guiding SETRec's design are:
- Integration of semantic and CF information: Items possess multi-dimensional information.
Semantic information(e.g., item descriptions, categories) is crucial for generalization, especially forcold-start items, by leveraging the rich knowledge withinLLMs.CF information(derived from user interactions) is essential for personalized recommendations, capturing user preferences from behavioral patterns.SETRecaims to combine both to achieve comprehensive item representation. - Order-agnostic Identifier: Representing multi-dimensional item information with a single token can lead to
embedding collapseor loss of detail. While using multiple tokens (a sequence) can capture more information, an ordered sequence introduces unnecessary dependencies that can lead to thelocal optima problemand hindergeneration efficiency(due toautoregressive generation). The principle oforder-agnostic identifierssuggests that if different dimensions of item information (e.g., "price" and "category") are inherently independent, their representation should also be independent. This approach allows for simultaneous generation of these independent tokens, significantly improving inference speed.
4.2. Core Methodology In-depth (Layer by Layer)
SETRec implements the set identifier paradigm through two main stages: order-agnostic item tokenization and simultaneous item generation.
4.2.1. Order-agnostic Item Tokenization
This stage focuses on converting each item into a set identifier comprising CF and semantic embeddings.
CF Tokenizer
To incorporate Collaborative Filtering (CF) information, SETRec uses a pre-trained conventional recommender model (like SASRec [10]) to generate an item's CF embedding. This embedding is then passed through a linear projection layer to obtain the final CF token , where is the hidden dimension of the LLMs used.
- Purpose: This
CF tokenhelpsLLM-based recommendersto provide accurate recommendations for users and items with rich interaction histories.
Semantic Tokenizer
To capture rich semantic information, SETRec uses a semantic tokenizer to generate a set of semantic embeddings.
- Semantic Representation Extraction: Given an item's
semantic information(e.g., title, categories), a pre-trainedsemantic extractor(e.g.,SentenceT5[23]) first extracts a high-dimensionalsemantic representation. - Multi-dimensional Semantic Embedding via Autoencoder (AE): Instead of compressing into a single embedding (which could lead to
embedding collapseand loss of fine-grained information),SETRectokenizes intoorder-agnostic semantic embeddings. This is achieved using a unifiedAutoencoder (AE): $ z = \operatorname { E n c o d e r } ( s ) $ Where:- : The input high-dimensional
semantic representationof an item. - : The encoder component of the
Autoencoder. - : The concatenated
semantic embeddings. - : The -th
semantic embedding, representing a distinct latent semantic dimension (e.g., aspects like "brand", "price", "material"). Each is anorder-agnostic token. The use of a unifiedAE(rather than separate AEs) reduces model parameters and improves training stability.
- : The input high-dimensional
- Reconstruction Loss: To ensure these
semantic embeddingspreserve useful information, theAEis trained with areconstruction loss: $ \mathcal { L } _ { A E } = | \pmb { s } - \pmb { \hat { s } } | _ { 2 } ^ { 2 } $ Where:- : The original
semantic representation. - : The reconstructed
semantic representationobtained by passing the concatenatedsemantic embeddingsthrough theDecodercomponent of theAutoencoder. - : The squared L2 norm, representing the mean squared error between the original and reconstructed representations.
- Purpose: This loss encourages the
AEto learn meaningful and reconstructablesemantic embeddings.
- : The original
Token Corpus
After item tokenization, each item is represented by a set identifier: .
All tokens generated across all items are collected to form token corpora for each information dimension: .
- Purpose: These
token corporaserve asgrounding headsduring thesimultaneous item generationphase to map generated embeddings back to existing items.
4.2.2. Simultaneous Item Generation
This stage focuses on efficiently generating the set identifier for the next recommended item.
Query-guided Generation
To enable simultaneous generation and ensure that each generated token aligns with its specific information dimension (CF or a particular semantic aspect), SETRec introduces learnable query vectors. These vectors guide the LLM during generation.
- Input to LLM: The user's historical interactions are transformed into an
identifier sequence. For a user with historical interactions, the input to theLLMbecomes: $ \pmb { x } = [ { z _ { \mathrm { C F } } , z _ { S _ { 1 } } , . . . , z _ { S _ { N } } } ^ { 1 } , . . . , { z _ { \mathrm { C F } } , z _ { S _ { 1 } } , . . . , z _ { S _ { N } } } ^ { L } ] $ Where represents the set identifier for the -th item in the user's history. - Token Generation: For each dimension , the corresponding token is generated by the
LLMlayers conditioned on the transformed user history and a specificlearnable query vector: $ \hat { z } _ { k } = \mathrm { LLM } _ { - } \mathrm { Layers } ( { \pmb x } , { \pmb q } _ { k } ) $ Where:- : Represents the
attention layersof theLLM. - : A
learnable query vectorspecifically for dimension . This vector acts as a prompt, guiding theLLMto focus on generating information relevant to that dimension. - : The generated continuous token (embedding) for dimension .
This process generates a
set identifierfor the next item.
- : Represents the
Token Generation Optimization
To train the model to generate accurate tokens for each dimension, a specific loss function is used, encouraging the generated token to be similar to the target token for its dimension while being dissimilar to other tokens of that dimension. This is a form of contrastive learning. $ \mathcal { L } _ { \mathrm { Gen } } = - \frac { 1 } { | \mathcal { D } | } \sum _ { \mathcal { D } } \sum _ { k \in \mathcal { F } } \frac { \exp ( \operatorname { sim } ( \hat { z } _ { k } , z _ { k } ) ) } { \sum _ { z \in \mathcal { Z } _ { k } } \exp ( \operatorname { sim } ( \hat { z } _ { k } , z ) ) } $ Where:
- : The total number of user interaction sequences in the dataset.
- : The set of all information dimensions.
- : The generated token for dimension .
- : The true target token for dimension (from the ground truth item).
- : A similarity function, typically
inner productorcosine similarity. - : The
token corpusfor dimension , containing all possible tokens for that dimension from all items. - Purpose: This loss function maximizes the similarity between the generated token and the ground truth token for dimension , while minimizing similarity with all other tokens in the corpus for that same dimension. This pushes the model to generate tokens that are distinct and accurate for each dimension.
Token Generation Grounding
After generating the set identifier , the next step is to map these generated embeddings to existing items. This is challenging because the number of possible token combinations can be vast. SETRec addresses this with a token set grounding strategy that leverages the pre-computed token corpora as grounding heads.
The scores for each item are obtained as follows:
$
\begin{cases}
s_k = W_k \hat{z}k \
s = (1 - \beta) s{\mathrm{CF}} + \beta \sum_{k \in \mathcal{F} \setminus \mathrm{CF}} s_k
\end{cases}
$
Where:
- : A vector where each element represents the score of an item based on the generated token for dimension .
- : This matrix is formed by stacking the tokens from the
token corpus(i.e., is essentially a matrix containing all for all items). Multiplying by performs a similarity lookup against all items' tokens for dimension . - : The final score vector for all items.
- : A hyper-parameter that balances the contribution of
CF scores() andsemantic scores(). - : Represents the set of all semantic dimensions ().
- Purpose: This strategy computes a score for every item based on its match with each generated dimension's token. The final item scores are a weighted sum of these
CFandsemanticscores. This approach isextendable to new items(e.g.,cold-start items) because theirsemantic tokensare available in thetoken corpora, allowing for scoring even without historicalCFinteractions.
Sparse Attention Mask
To realize the order-agnostic principle for user history encoding and boost efficiency, SETRec employs a sparse attention mask.
- Problem: In the transformed user history , the tokens for a single item (e.g., ) are concatenated. Without a special mask, the
attention mechanismmight create spurious dependencies between these tokens within the same item's identifier (e.g., asemantic tokenattending to aCF token), violating theorder-agnosticprinciple. - Solution: The
sparse attention maskis designed to:- Eliminate Intra-item Dependencies: Tokens belonging to the same item's identifier cannot attend to each other. This ensures that the components of an item's set identifier remain independent during encoding.
- Retain Inter-item Dependencies: All tokens can attend to all tokens from previously interacted items. This preserves the sequential nature of user history, allowing the
LLMto understand how past items influence future preferences. This is illustrated in Figure 5.Figure 5(a)shows the original attention mask where tokens within an item can attend to each other, creating unwanted dependencies.Figure 5(b)shows thesparse attention maskwhere attention within an item (marked by ) is blocked, while attention to previous items is allowed.
- Time Complexity Analysis: The
sparse attention maskalso significantly improvesgeneration efficiency.- For a sequence with historical items, each represented by tokens (CF + semantic tokens, so ), the total length of the flattened input sequence is .
- With the original attention mask, the time complexity for batch generation is typically , where is the hidden dimension. The paper refers to it as in one place, which might be a typo or specific to a particular implementation detail, but the general quadratic dependency on sequence length is key.
- With the proposed
sparse attention maskand flattened input, the time complexity reduces. The paper states it as . This reduction comes from the fact that interactions within an item's tokens are restricted. The main benefit to efficiency comes from enabling simultaneous generation, reducing the number of sequentialLLMcalls.
4.2.3. Instantiation
To train SETRec, the CF and semantic tokenizers, the learnable query vectors, and the LLM parameters are optimized jointly by minimizing a combined loss function:
$
\mathcal { L } = \mathcal { L } _ { \mathrm { Gen } } + \alpha \mathcal { L } _ { \mathrm { A E } }
$
Where:
-
: The
token generation optimization loss(Equation 6), which encourages accurate and distinct token generation for each dimension. -
: The
Autoencoder reconstruction loss(Equation 4), which ensures thesemantic embeddingspreserve rich information. -
: A hyper-parameter that controls the relative strength or weighting of the
AE reconstruction lossduring training. A higher means more emphasis on accurate reconstruction of semantic information.Inference Process:
-
Item Tokenization: All available items are first tokenized into
set identifiers. Thetoken corporaare formed. -
User History Transformation: A user's historical interaction sequence is transformed into the input using these
set identifiers. -
Simultaneous Generation: The
LLMperformsquery-guided simultaneous generationwith thesparse attention mask(Equation 5) to generate theset identifierfor the next recommended item. This step generates all tokens in parallel in a singleLLMforward pass. -
Item Grounding: The generated
token setis then grounded to existing items using thetoken corporaasextendable grounding heads(Equation 7), producing final scores for all candidate items. -
Ranking: Items are ranked based on their scores, and the top-ranked items are recommended.
This entire process ensures that
SETRecis both effective (integrating multi-dimensional information) and efficient (simultaneous generation, sparse attention).
5. Experimental Setup
5.1. Datasets
The experiments are conducted on four real-world datasets from various domains:
-
Amazon Review Datasets: These datasets contain rich user interactions (e.g., purchases, reviews) and extensive textual
metadata(title, description, category, brand) for items.- Toys: Products in the Toys category.
- Beauty: Products in the Beauty category.
- Sports: Products in the Sports category.
-
Steam: A video games dataset proposed in [10], which includes substantial user interactions and abundant textual
semantic informationabout video games.For all datasets, the authors follow previous work [37] for preprocessing:
-
User interactions are sorted chronologically according to their timestamps.
-
The data is split into training, validation, and testing sets with a ratio of 8:1:1, respectively.
-
Items are categorized into
warm items(those appearing in the training set) andcold items(those not appearing in the training set). This allows for evaluating the model's performance on items with varying levels of interaction history.These datasets are well-suited for validating the method because they provide diverse domains, rich
semantic informationnecessary forLLM-based recommendation, and sufficient user interaction data forCFand for training/testingwarm-startandcold-startscenarios.
5.2. Evaluation Metrics
The experiments use two widely accepted metrics for evaluating recommender systems, Recall@K and NDCG@K, with and . Additionally, evaluations are performed under three distinct settings:
- All Items: Evaluation over the entire set of items.
- Warm Items Only: Evaluation focusing only on items that appeared in the training set.
- Cold Items Only: Evaluation focusing only on items that did not appear in the training set (i.e., new items).
Recall@K
Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved (recommended) among the top recommendations. It focuses on the completeness of the recommendations – how many of the truly relevant items were found by the system within the top list.
Mathematical Formula: $ \mathrm{Recall@K} = \frac{\sum_{u \in U} |{\text{recommended items for } u \text{ in top K}} \cap {\text{relevant items for } u}|}{\sum_{u \in U} |{\text{relevant items for } u}|} $
Symbol Explanation:
- : The set of all users in the test set.
- : A specific user.
- : The set of items recommended to user .
- : The set of all items that are truly relevant to user (e.g., items the user actually interacted with in the test set).
- : The cardinality (number of elements) of a set.
NDCG@K (Normalized Discounted Cumulative Gain at K)
Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear higher up in the list (i.e., discounted cumulative gain) and normalizes these scores by comparing them to an ideal ranking where all relevant items are perfectly ordered (ideal discounted cumulative gain). NDCG emphasizes getting relevant items to the top of the list.
Mathematical Formula: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ where $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ and $ \mathrm{IDCG@K} = \sum{i=1}^{K} \frac{2^{\mathrm{rel}{i{\text{ideal}}}} - 1}{\log_2(i+1)} $
Symbol Explanation:
- :
Discounted Cumulative Gainat position . It sums therelevance scoresof items in the recommended list, penalizing items that appear lower. - :
Ideal Discounted Cumulative Gainat position . This is the maximum possibleDCGif the recommended list were perfectly ordered by relevance. It serves as a normalization factor. - : The
relevance scoreof the item at position in the actual recommended list. For binary relevance (relevant/not relevant), is typically 1 if the item is relevant, and 0 otherwise. - : The
relevance scoreof the item at position in the ideal recommended list (where all relevant items are ranked before non-relevant ones, and ties are broken arbitrarily). - : A logarithmic
discount factorthat reduces the contribution of items further down the list.
5.3. Baselines
The paper compares SETRec against a comprehensive set of competitive baselines, categorized by their item identifier type:
Single-Token Identifiers
- DreamRec [46]: Leverages
ID embeddingsfor items. It employs a diffusion model to refine theID embeddinggenerated byLLMs, aiming to improve the quality of generated item representations. - E4SRec [14]: Uses a pre-trained
Collaborative Filtering (CF)model (e.g.,SASRec) to obtainID embeddingsfor items. For recommendation, it generates an item embedding and then uses a linear projection layer to efficiently compute scores for all items, mapping the generated embedding back to existing item IDs.
Token-Sequence Identifiers
-
BIGRec [1]: Represents items using their titles. The tokens are drawn from
human vocabulary, allowing theLLMto directly utilize its linguistic understanding. -
IDGenRec [31]: A
learnable ID generatorthat aims to produce concise yet informative tags (fromhuman vocabulary) to represent each item. -
CID [9]: Employs
hierarchical clusteringbased on item co-occurrence patterns to generate token sequences. Items with similar interaction histories are expected to have similar identifiers. -
SemID [9]: Represents items using external token sequences derived from
hierarchical item categories. This emphasizes semantic similarity. -
TIGER [26]: Utilizes
RQ-VAE (Residual Quantization Variational Autoencoder)with codebooks to quantize itemsemantic informationinto token sequences. These sequences are designed to convey information from coarse-grained to fine-grained. -
LETTER [36]: A state-of-the-art method that integrates both
semanticandCollaborative Filtering (CF) informationinto the training of anRQ-VAEto create multi-dimensionalitem identifiers, aiming for improved diversity and richness.These baselines are representative because they cover the main existing paradigms for
item tokenizationinLLM-based generative recommendation, including methods that rely on human language, learned external tokens, and different combinations ofCFandsemantic information.
5.4. Implementation Details
- LLM Architectures:
SETRecand all baselines are instantiated on two differentLLMarchitectures:- T5-small [25]: An
encoder-decodertransformermodel. - Qwen2.5 [45]: A
decoder-onlytransformermodel, evaluated with different model sizes: 1.5B, 3B, and 7B parameters. This allows for a comprehensive evaluation of scalability.
- T5-small [25]: An
- Hidden Layer Dimensions: For methods that use
Autoencoders (AE)in their tokenizer training (e.g.,TIGER,LETTER,SETRec), the hidden layer dimensions are set to 512, 256, and 128 withReLU activation. - Prompt for LLM Training: A consistent prompt is used for all methods to ensure a fair comparison: "What would the user be likely to purchase next after buying items history?;"
- Fine-tuning Strategy:
- T5 models: Fully fine-tuned (all parameters updated).
- Qwen models:
Parameter-Efficient Fine-Tuning (PEFT)techniqueLoRA[8] is used to reduce computational costs.
- Hardware: All experiments are conducted on four NVIDIA RTX A5000 GPUs.
- Hyperparameter Selection for SETRec:
- Number of semantic tokens (): Selected from .
- Strength of AE loss (): Selected from .
- Semantic strength for inference (): Selected from .
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Performance on T5 (RQ1)
The following are the results from Table 1 of the original paper:
| All | Warm | Cold | Inf. Time (s) | |||||||||||
| R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | |||
| Toys | DreamRec | 0.0020 | 0.0027 | 0.0015 | 0.0018 | 0.0027 | 0.0039 | 0.0020 | 0.0024 | 0.0066 | 0.0168 | 0.0045 | 0.0082 | 912 |
| E4SRec | 0.0061 | 0.0098 | 0.0051 | 0.0064 | 0.0081 | 0.0128 | 0.0065 | 0.0082 | 0.0065 | 0.0122 | 0.0056 | 0.0078 | 55 | |
| BIGRec | 0.0008 | 0.0013 | 0.0007 | 0.0009 | 0.0014 | 0.0019 | 0.0011 | 0.0013 | 0.0278 | 0.0360 | 0.0196 | 0.0223 | 2,079 | |
| IDGenRec | 0.0044 | 0.0082 | 0.0040 | 0.0053 | 0.0065 | 0.0128 | 0.0049 | 0.0071 | 0.0059 | 0.0111 | 0.0047 | 0.0066 | 810 | |
| CID | 0.0063 | 0.0110 | 0.0052 | 0.0069 | 0.0109 | 0.0161 | 0.0081 | 0.0102 | 0.0318 | 0.0589 | 0.0236 | 0.0335 | 658 | |
| SemID | 0.0071 | 0.0108 | 0.0061 | 0.0074 | 0.0086 | 0.0153 | 0.0075 | 0.0100 | 0.0307 | 0.0507 | 0.0220 | 0.0292 | 1,215 | |
| TIGER | 0.0064 | 0.0106 | 0.0060 | 0.0076 | 0.0091 | 0.0147 | 0.0080 | 0.0102 | 0.0315 | 0.0555 | 0.0228 | 0.0314 | 448 | |
| LETTER | 0.0081 | 0.0117 | 0.0077 | 0.0091 | 0.0109 | 0.0155 | 0.0083 | 0.0101 | 0.0183 | 0.0395 | 0.0115 | 0.0190 | 448 | |
| SETRec | 0.0110* | 0.0189* | 0.0089* | 0.0118* | 0.0139* | 0.0236* | 0.0112* | 0.0147* | 0.0443* | 0.0812* | 0.0310* | 0.0445* | 60 | |
| Beauty | DreamRec | 0.0012 | 0.0025 | 0.0013 | 0.0017 | 0.0016 | 0.0028 | 0.0016 | 0.0019 | 0.0078 | 0.0161 | 0.0065 | 0.0094 | 1,102 |
| E4SRec | 0.0061 | 0.0092 | 0.0052 | 0.0063 | 0.0080 | 0.0121 | 0.0067 | 0.0082 | 0.0072 | 0.0118 | 0.0065 | 0.0077 | 120 | |
| BIGRec | 0.0008 | 0.0009 | 0.0006 | 0.0008 | 0.0106 | 0.0251 | 0.0095 | 0.0151 | 4,544 | 0.0054 | 0.0064 | 0.0051 | 0.0054 | |
| IDGenRec | 0.0080 | 0.0115 | 0.0066 | 0.0078 | 0.0106 | 0.0165 | 0.0078 | 0.0099 | 0.0187 | 0.0350 | 0.0186 | 0.0224 | 840 | |
| CID | 0.0071 | 0.0125 | 0.0060 | 0.0080 | 0.0098 | 0.0166 | 0.0077 | 0.0101 | 0.0087 | 0.0183 | 0.0071 | 0.0104 | 815 | |
| SemID | 0.0071 | 0.0131 | 0.0056 | 0.0078 | 0.0098 | 0.0174 | 0.0074 | 0.0103 | 0.0260 | 0.0465 | 0.0178 | 0.0255 | 1,310 | |
| TIGER | 0.0063 | 0.0098 | 0.0050 | 0.0062 | 0.0086 | 0.0131 | 0.0065 | 0.0082 | 0.0190 | 0.0325 | 0.0130 | 0.0178 | 430 | |
| LETTER | 0.0071 | 0.0103 | 0.0061 | 0.0070 | 0.0094 | 0.0135 | 0.0079 | 0.0091 | 0.0251 | 0.0410 | 0.0241 | 0.0285 | 430 | |
| SETRec | 0.0106* | 0.0161* | 0.0083* | 0.0103* | 0.0139* | 0.0212* | 0.0108* | 0.0134* | 0.0384* | 0.0761* | 0.0280* | 0.0413* | 126 | |
| Sports | DreamRec | 0.0027 | 0.0044 | 0.0025 | 0.0031 | 0.0032 | 0.0052 | 0.0028 | 0.0035 | 0.0045 | 0.0108 | 0.0026 | 0.0049 | 2,100 |
| E4SRec | 0.0079 | 0.0131 | 0.0075 | 0.0094 | 0.0092 | 0.0154 | 0.0085 | 0.0107 | 0.0031 | 0.0093 | 0.0019 | 0.0039 | 117 | |
| BIGRec | 0.0033 | 0.0042 | 0.0030 | 0.0033 | 0.0001 | 0.0002 | 0.0001 | 0.0001 | 0.0059 | 0.0104 | 0.0043 | 0.0061 | 7,822 | |
| IDGenRec | 0.0087 | 0.0127 | 0.0079 | 0.0092 | 0.0101 | 0.0149 | 0.0091 | 0.0107 | 0.0181 | 0.0302 | 0.0134 | 0.0179 | 1,724 | |
| CID | 0.0077 | 0.0131 | 0.0073 | 0.0092 | 0.0074 | 0.0119 | 0.0045 | 0.0061 | 0.0082 | 0.0149 | 0.0075 | 0.0099 | 2,135 | |
| SemID | 0.0094 | 0.0167 | 0.0088 | 0.0114 | 0.0119 | 0.0201 | 0.0104 | 0.0135 | 0.0254 | 0.0495 | 0.0175 | 0.0256 | 2,367 | |
| TIGER | 0.0085 | 0.0129 | 0.0080 | 0.0095 | 0.0100 | 0.0151 | 0.0091 | 0.0109 | 0.0190 | 0.0310 | 0.0120 | 0.0159 | 481 | |
| LETTER | 0.0077 | 0.0131 | 0.0073 | 0.0092 | 0.0074 | 0.0119 | 0.0045 | 0.0061 | 0.0082 | 0.0149 | 0.0075 | 0.0099 | 481 | |
| SETRec | 0.0114* | 0.0185* | 0.0101* | 0.0126* | 0.0134* | 0.0216* | 0.0115* | 0.0144* | 0.0341* | 0.0595* | 0.0233* | 0.0323* | 136 | |
| Steam | DreamRec | 0.0029 | 0.0057 | 0.0037 | 0.0046 | 0.0042 | 0.0080 | 0.0045 | 0.0059 | 0.0017 | 0.0029 | 0.0013 | 0.0018 | 4,620 |
| E4SRec | 0.0194 | 0.0351 | 0.0220 | 0.0270 | 0.0312 | 0.0558 | 0.0283 | 0.0370 | 0.0006 | 0.0010 | 0.0006 | 0.0006 | 328 | |
| BIGRec | 0.0099 | 0.0107 | 0.0099 | 0.0103 | 0.0088 | 0.0097 | 0.0088 | 0.0092 | 0.0011 | 0.0010 | 0.0010 | 0.0010 | 3,120 | |
| IDGenRec | 0.0163 | 0.0284 | 0.0152 | 0.0200 | 0.0204 | 0.0360 | 0.0190 | 0.0250 | 0.0083 | 0.0117 | 0.0076 | 0.0093 | 1,438 | |
| CID | 0.0189 | 0.0325 | 0.0202 | 0.0250 | 0.0276 | 0.0478 | 0.0296 | 0.0369 | 0.0019 | 0.0033 | 0.0018 | 0.0024 | 1,760 | |
| SemID | 0.0175 | 0.0288 | 0.0184 | 0.0227 | 0.0222 | 0.0366 | 0.0234 | 0.0288 | 0.0077 | 0.0122 | 0.0071 | 0.0091 | 2,000 | |
| TIGER | 0.0201 | 0.0357 | 0.0225 | 0.0279 | 0.0273 | 0.0494 | 0.0305 | 0.0381 | 0.0031 | 0.0051 | 0.0028 | 0.0036 | 720 | |
| LETTER | 0.0195 | 0.0347 | 0.0210 | 0.0264 | 0.0259 | 0.0463 | 0.0274 | 0.0346 | 0.0034 | 0.0062 | 0.0027 | 0.0040 | 720 | |
| SETRec | 0.0231* | 0.0396* | 0.0260* | 0.0319* | 0.0294* | 0.0506* | 0.0326* | 0.0401* | 0.0152* | 0.0264* | 0.0135* | 0.0187* | 100 | |
The performance comparison of SETRec against baselines on T5 reveals several key observations:
- Token-sequence vs. Single-token Identifiers: Generally,
token-sequence identifiers(e.g.,BIGRec,CID,TIGER,LETTER) outperformsingle-token identifiers(e.g.,DreamRec,E4SRec) acrossall,warm, andcoldsettings. This is expected becausetoken-sequence identifiersinherently represent items with multiple tokens, allowing them to encode richer, multi-dimensional information explicitly. - External Tokens vs. Human Vocabulary: Among
token-sequence identifiers, methods usingexternal tokens(e.g.,CID,SemID,TIGER,LETTER) typically perform better than those relying solely onhuman vocabulary(e.g.,BIGRec,IDGenRec) inallandwarmsettings. This is attributed to thehierarchical structureof external identifiers, which can represent coarse-grained to fine-grained semantics, potentially alleviating thelocal optima problemto some extent in autoregressive generation. - Cold-Start Performance:
- Methods relying predominantly on
CF information(e.g.,DreamRec,E4SRec,CID) show poor results oncold items(e.g.,E4SRecon ToysR@5: 0.0065 vs.SemID0.0307). This is becauseCFrequires substantial interaction data, which is absent forcold items. - Methods that integrate
semantic informationinto identifiers (e.g.,BIGRec,IDGenRec,SemID,TIGER,LETTER) demonstrate bettergeneralization abilityincold-start scenarios.BIGRecandIDGenRec, which use human vocabulary, show competitive performance here, likely leveraging theLLM's richworld knowledge.
- Methods relying predominantly on
- SETRec's Superiority:
SETRecconsistently and significantly outperforms all baselines across all datasets and underall,warm, andcoldsettings. The improvements are statistically significant (indicated by*and underlining).- This superior performance is attributed to its dual principles: 1) effectively integrating both
CFandsemantic informationinto aset of tokens, which provides accurate recommendations forwarm itemsand stronggeneralizationforcold items; and 2) itsorder-agnostic identifierdesign, which avoids inaccurate dependencies between tokens within an item, overcoming thelocal optima issue.
- This superior performance is attributed to its dual principles: 1) effectively integrating both
- Inference Efficiency:
SETRecachieves remarkableinference efficiency. It substantially reduces inference time compared totoken-sequence identifiers(e.g., on Toys,SETRecis 60s vs.BIGRec2,079s,LETTER448s). On average,SETRecachieves speedups of (Toys), (Beauty), (Sports), and (Steam) compared totoken-sequence identifiers. This efficiency stems from itssimultaneous generationmechanism, which generates multiple tokens in a singleLLMcall.
6.1.2. Performance on Qwen-1.5B (RQ1)
The following are the results from Table 2 of the original paper:
| All | Warm | Cold | Inf. Time (s) | |||||||||||
| R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | |||
| Toys | DreamRec | 0.0006 | 0.0013 | 0.0005 | 0.0008 | 0.0008 | 0.0019 | 0.0007 | 0.0012 | 0.0076 | 0.0137 | 0.0052 | 0.0074 | 1,093 |
| E4SRec | 0.0065 | 0.0108 | 0.0056 | 0.0072 | 0.0089 | 0.0144 | 0.0075 | 0.0096 | 0.0084 | 0.0235 | 0.0055 | 0.0111 | 905 | |
| BIGRec | 0.0009 | 0.0016 | 0.0009 | 0.0012 | 0.0011 | 0.0013 | 0.0010 | 0.0011 | 0.0194 | 0.0311 | 0.0147 | 0.0191 | 43,304 | |
| IDGenRec | 0.0030 | 0.0053 | 0.0022 | 0.0031 | 0.0043 | 0.0086 | 0.0032 | 0.0048 | 0.0189 | 0.0364 | 0.0161 | 0.0224 | 30,720 | |
| CID | 0.0027 | 0.0047 | 0.0025 | 0.0033 | 0.0055 | 0.0084 | 0.0044 | 0.0056 | 0.0055 | 0.0156 | 0.0044 | 0.0081 | 27,248 | |
| SemID | 0.0024 | 0.0042 | 0.0018 | 0.0024 | 0.0034 | 0.0055 | 0.0026 | 0.0034 | 0.0140 | 0.0275 | 0.0095 | 0.0143 | 32,288 | |
| TIGER | 0.0068 | 0.0117 | 0.0054 | 0.0072 | 0.0094 | 0.0159 | 0.0070 | 0.0095 | 0.0384 | 0.0715 | 0.0291 | 0.0408 | 13,800 | |
| LETTER | 0.0057 | 0.0093 | 0.0050 | 0.0064 | 0.0080 | 0.0126 | 0.0066 | 0.0085 | 0.0217 | 0.0416 | 0.0170 | 0.0239 | 13,800 | |
| SETRec | 0.0116* | 0.0188* | 0.0095* | 0.0120* | 0.0144* | 0.0236* | 0.0118* | 0.0151* | 0.0531* | 0.0883* | 0.0382* | 0.0507* | 926 | |
| Beauty | DreamRec | 0.0007 | 0.0009 | 0.0005 | 0.0005 | 0.0010 | 0.0011 | 0.0007 | 0.0007 | 0.0090 | 0.0167 | 0.0075 | 0.0103 | 1,326 |
| E4SRec | 0.0067 | 0.0109 | 0.0056 | 0.0072 | 0.0088 | 0.0146 | 0.0072 | 0.0094 | 0.0017 | 0.0071 | 0.0010 | 0.0029 | 910 | |
| BIGRec | 0.0006 | 0.0010 | 0.0006 | 0.0007 | 0.0010 | 0.0010 | 0.0008 | 0.0008 | 0.0141 | 0.0246 | 0.0094 | 0.0135 | 29,500 | |
| IDGenRec | 0.0042 | 0.0078 | 0.0030 | 0.0043 | 0.0045 | 0.0104 | 0.0033 | 0.0054 | 0.0254 | 0.0471 | 0.0207 | 0.0292 | 35,040 | |
| CID | 0.0046 | 0.0077 | 0.0040 | 0.0052 | 0.0059 | 0.0107 | 0.0051 | 0.0068 | 0.0075 | 0.0155 | 0.0071 | 0.0096 | 27,792 | |
| SemID | 0.0030 | 0.0045 | 0.0027 | 0.0033 | 0.0050 | 0.0076 | 0.0042 | 0.0052 | 0.0159 | 0.0227 | 0.0116 | 0.0159 | 45,160 | |
| TIGER | 0.0041 | 0.0065 | 0.0032 | 0.0041 | 0.0054 | 0.0085 | 0.0042 | 0.0054 | 0.0083 | 0.0167 | 0.0064 | 0.0091 | 12,600 | |
| LETTER | 0.0040 | 0.0069 | 0.0031 | 0.0042 | 0.0051 | 0.0088 | 0.0039 | 0.0054 | 0.0043 | 0.0129 | 0.0043 | 0.0071 | 12,600 | |
| SETRec | 0.0104* | 0.0167* | 0.0085* | 0.0108* | 0.0140* | 0.0221* | 0.0109* | 0.0141* | 0.0477* | 0.0748* | 0.0370* | 0.0464* | 1,050 | |
The evaluation of SETRec and baselines on Qwen-1.5B (a decoder-only LLM) reveals differences compared to T5:
- Limited Competitiveness of Token-Sequence Identifiers: On
Qwen-1.5B,token-sequence identifiersshow less competitiveness compared to their performance onT5. A possible reason suggested by the authors is thatQwen-1.5Bmight possess richerpre-trained knowledgewithin its parameters. This largerknowledge basecould amplify theknowledge gapbetween the generalpre-training taskand the specificrecommendation taskswith limited interaction data, making it harder for these methods to adapt. - Competitive Performance of E4SRec:
E4SRec(asingle-token identifierbased onID embedding) often yields competitive performance onQwen-1.5B. This is likely becauseE4SRecreplaces the originalLLM vocabulary headwith anitem projection head, which effectively adapts theLLMto the recommendation task by directly mapping generated embeddings to item scores, bypassing thevocabulary mismatchissue. - Human Vocabulary vs. External Tokens on Cold Items:
BIGRecandIDGenRec(usinghuman vocabulary) sometimes outperform theirT5counterparts oncold items(e.g., on Beauty). This indicates thatQwen-1.5B's richerworld knowledgecan be better leveraged when item representations are in human-readable language, leading to improvedgeneralizationforcold-start items.- Conversely,
identifiers with external tokens(e.g.,CID,TIGER,LETTER) show inferiorcold performancecompared to theirT5counterparts. This is because trainingexternal tokensrequires substantial interaction data, which is difficult to achieve forcold items, leading to poorgeneralizationdue to low generation probability for these tokens.
- SETRec's Consistent Superiority: Despite these shifts in baseline performance,
SETRecconsistently outperforms all baselines onQwen-1.5Bacross all settings.- Notably,
SETRecinstantiated onQwen-1.5Boften surpassesSETReconT5, especially in thecold-start setting. This validatesSETRec's stronggeneralization abilityacross differentLLM architectures.
- Notably,
- Enhanced Efficiency on Qwen: As the
LLM sizeincreases (even from T5 to Qwen-1.5B), theefficiency improvementsofSETRecovertoken-sequence identifiersbecome even more significant, achieving an average of speedup across the tested datasets. This highlights the practical advantages ofsimultaneous generationfor largerLLMs.
6.2. Ablation Study (RQ2)
The following figure (Figure 6 from the original paper) shows the ablation study results on Toys:
该图像是一个图表,展示了在不同条件下(例如热启动和冷启动)T5和Qwen-1.5B模型的召回率(Recall@10)和归一化折损累计增益(NDCG@10)。图表比较了不同算法(如SETRec、缺少语义、查询、稀疏注意力和协同过滤)在全量、热启动及冷启动情境中的效果。
The ablation study investigates the contribution of each component of SETRec by removing them one by one. The variants are:
-
w/o Sem:SETRecwithoutsemantic tokens(onlyCF tokens). -
w/o CF:SETRecwithoutCF tokens(onlysemantic tokens). -
w/o Query:SETRecusingrandom frozen vectorsinstead oflearnable query vectors. -
w/o SA:SETRecusing theoriginal attention maskinstead of thesparse attention mask.Key observations from the ablation study on both
T5andQwen-1.5Bon the Toys dataset:
- Effectiveness of Each Component: Removing any component (
Sem,CF,Query,SA) consistently leads to performance drops acrossall,warm, andcoldsettings. This validates that every component ofSETReccontributes positively to its overall effectiveness. - Necessity of Semantic Information: Discarding
semantic tokens(w/o Sem) drastically degradesrecommendation accuracy, particularly undercoldsettings. This strongly underscores the critical role of integratingsemantic informationinitem identifiersfor handling items with sparse interactions. - Significance of Multi-dimensional Semantics: Removing
semantic tokens(w/o Sem) generally leads to worse performance than removingCF tokens(w/o CF). This suggests that leveraging multiplesemantic tokensto represent multi-dimensional semantic aspects is highly beneficial, potentially mitigatingembedding collapseand capturing richer item details. This aligns with findings in other research [18]. - Role of CF Tokens:
- For
T5, removingCF tokens(w/o CF) leads to inferior performance oncold items. This is somewhat counterintuitive but could indicate that even limitedCFsignals, when integrated, can provide some context. - For
Qwen, removingCF tokens(w/o CF) might negatively impactcold itemsperformance (as observed for T5), but the overall impact seems less pronounced than removing semantics. The authors suggest that largerQwenmodels, with their strongerpre-trained knowledge, might be better at understanding semantics, making the contribution ofCFless critical, especially forcold itemswhereCFis inherently sparse.
- For
- Impact of Query Vectors and Sparse Attention: Both
w/o Queryandw/o SAlead to performance degradation, demonstrating the importance oflearnable query vectorsfor guidingsimultaneous generationand thesparse attention maskfor ensuringorder-agnostic encodingandefficiency.
6.3. Item Group Analysis (RQ3)
The following figure (Figure 7 from the original paper) shows the performance of SETRec, LETTER, and E4SRec (T5) on item groups with different popularity on Toys:
该图像是图表,展示了SETRec、LETTER和E4SRec在不同受欢迎程度的商品组上的表现,包括Recall@10(图(a))和NDCG@10(图(b))。可以看出,SETRec在受欢迎商品组中表现优异。
This analysis evaluates SETRec's performance across items grouped by their popularity (G1: most popular, G4: least popular). SETRec is compared with LETTER (a strong token-sequence identifier) and E4SRec (a single-token identifier based on CF).
Key observations:
- Popularity-Performance Trend: Performance (both
Recall@10andNDCG@10) generally declines from the most popular items (G1) to the least popular (G4) for all methods. This is expected, asLLMs(and recommender systems in general) have less data to learn from for less popular items, making accurate recommendations more challenging. - E4SRec's Strengths and Weaknesses:
E4SRec(CF-onlysingle-token identifier) performs well on the most popular items (G1), sometimes outperformingLETTER. This highlights the strength ofCF informationwhen abundant interactions are available.- However,
E4SRecyields significantly inferior performance on unpopular items (G2-G4). This is a direct consequence of its reliance onCF information, which is sparse for these items.
- LETTER's Generalization:
LETTER, which incorporates bothsemanticandCF information(though as atoken-sequence), shows bettergeneralizationon sparser items (G2-G4) compared toE4SRec, leveragingsemantic informationto compensate for limitedCF. - SETRec's Consistent Excellence:
SETRecconsistently outperforms bothE4SRecandLETTERacross all popularity groups (G1-G4).- Crucially, the improvements of
SETRecare more significant on the sparser, less popular items (G2-G4). This indicates thatSETRec's design (integratingmulti-dimensional CF and semantic informationwith anorder-agnostic set identifier) is particularly effective in challenging scenarios with limited interaction data. This superiorgeneralizationon sparse items largely explainsSETRec's overall performance gains.
- Crucially, the improvements of
6.4. Scalability on Model Parameters (RQ3)
The following are the results from Table 3 of the original paper:
| All | Warm | Cold | |||||
| R@10 | N@10 | R@10 | N@10 | R@10 | N@10 | ||
| 1.5B | LETTER | 0.0093 | 0.0064 | 0.0126 | 0.0085 | 0.0416 | 0.0239 |
| E4SRec | 0.0108 | 0.0072 | 0.0144 | 0.0096 | 0.0235 | 0.0111 | |
| SETRec | 0.0188 | 0.0120 | 0.0236 | 0.0151 | 0.0883 | 0.0507 | |
| 3B | LETTER | 0.0109 | 0.0072 | 0.0151 | 0.0097 | 0.0471 | 0.0236 |
| E4SRec | 0.0096 | 0.0061 | 0.0129 | 0.0081 | 0.0218 | 0.0103 | |
| SETRec | 0.0195 | 0.0123 | 0.0258 | 0.0159 | 0.0964 | 0.0571 | |
| 7B | LETTER | 0.0099 | 0.0061 | 0.0137 | 0.0081 | 0.0406 | 0.0216 |
| E4SRec | 0.0088 | 0.0057 | 0.0114 | 0.0072 | 0.0133 | 0.0065 | |
| SETRec | 0.0194 | 0.0115 | 0.0239 | 0.0140 | 0.1016 | 0.0613 | |
This analysis investigates how SETRec scales with increasing LLM model sizes (Qwen 1.5B, 3B, and 7B) compared to E4SRec and LETTER on the Toys dataset.
Key observations:
-
SETRec's Scalability on Cold Items:
SETRecdemonstrates clear and continuous performance improvements oncold-start itemsas the model size scales from 1.5B to 7B (R@10increasing from 0.0883 to 0.1016). This indicates promisingscalabilityforcold items, suggesting that larger models, with their enhanced semantic understanding and general knowledge, can better leverageSETRec'ssemantic informationfor items with sparse interactions. -
Limited Scalability on Warm Items:
SETRec's performance onwarm items(and overallallitems) shows minor improvements or even slight fluctuations as model size increases (e.g.,R@10for warm items goes from 0.0236 to 0.0258 then to 0.0239). This suggestsLLMsmight not necessarily lead to proportionally betterCF information understandingwith increased size, or that theCFsignals are already effectively captured by smaller models. The limited improvements ofE4SRec(aCF-focused method) onwarm itemsalso support this. -
LETTER's Weak Scalability:
LETTERgenerally shows weakscalabilityacross all three settings (all,warm,cold). Its performance either stagnates or slightly fluctuates with increasing model size. This is primarily attributed to its reliance onexternal tokens, which may not align well with thepre-trained knowledgeembedded inLLMs. Consequently, simply increasing theLLM's parameter count does not translate into significant improvements forLETTER. -
E4SRec's Performance with Scaling:
E4SRec's performance also fluctuates and does not show consistent improvements with model scaling. Oncold items, its performance actually decreases significantly from 1.5B to 7B (from 0.0235 to 0.0133), reinforcing the idea that largerLLMsmight not inherently improveCF-based recommendations on sparse data without semantic guidance.In summary,
SETRecexhibits strong scalability, particularly forcold-start items, by effectively leveraging the enhancedsemantic understandingcapabilities of largerLLMsthrough itsmulti-dimensionalandorder-agnostic set identifierdesign.
6.5. Effect of Semantic Strength (RQ4)
The following figure (Figure 8 from the original paper) shows the performance of SETRec (T5) with different strength of semantics for inference:
该图像是图表,展示了在不同语义强度 eta 下,SETRec (T5) 在温暖和冷启动场景中的性能。左侧为温暖启动的召回率和NDCG@10,右侧为冷启动的性能指标,显示了不同 eta 值对模型效果的影响。
This analysis examines the impact of the hyper-parameter on SETRec's performance. controls the balance between CF scores and semantic scores during item grounding (Equation 7), where means only CF scores are used, and means only semantic scores are used.
Key observations:
- Necessity of Semantic Information for Grounding: Incorporating
semantic informationduring inference (any ) is generally beneficial compared to relying solely onCF scores(). This indicates that themulti-dimensional semantic informationcontributes to a more robustglobal rankingand strongergeneralization ability. - Significant Improvements on Cold Items: The inclusion of
semantic scoresbrings particularly significant improvements oncold items. For instance,R@10forcold itemsincreases from below 0.03 at to over 0.08 at . This highlights the crucial role ofsemantic informationin providing recommendations for items with sparse or no interaction history. - Optimal Balance: There is often an optimal value for that balances the
CFandsemanticcontributions. Forwarm items, performance is high across a broader range of , peaking around . Forcold items, the optimal is also around 0.4, indicating that a balanced integration of both types of information is best. - Competitive Performance with Pure Semantics: Even when relying solely on
semantic scores(),SETRecmaintains competitive performance onwarm items. This suggests an implicit alignment betweenCFandsemantic tokenslearned during training, where thesemantic embeddingscan still capture aspects relevant toCFpreferences.
6.6. Hyper-parameter Sensitivity (RQ4)
The following figure (Figure 9 from the original paper) shows the performance of SETRec (T5) with different strength of AE loss and different numbers of semantic tokens :
该图像是图表,展示了SETRec(T5)在不同AE损失强度和不同语义token数量下的Recall@10效果。左侧子图(a)显示了对各种场景(所有、热启动、冷启动)的Recall@10的影响;右侧子图(b)展示了在相同场景下的效果。不同颜色的线代表了不同的数据类型,提供了对模型性能的比较分析。
This analysis explores the sensitivity of SETRec to two key hyperparameters: (strength of AE loss) and (number of semantic embeddings).
6.6.1. Effect of
- controls the weighting of the
Autoencoder (AE) reconstruction loss() in the total training objective (Equation 8). - Observations: When is increased from 0 (no
AE loss) to around 0.7, the overall performance ofSETRecgenerally improves. This indicates that encouraging thesemantic tokenizerto accurately reconstruct the originalsemantic representationhelps in learning richer and more usefulsemantic embeddings. - However, increasing too much (e.g., beyond 0.7) can cause performance to drop, especially on
warm items. This suggests that an overly strongreconstruction constraintmight make thesemantic embeddingstoo focused on literal content, potentially limiting their ability to learn subtlesemantic nuancesor adapt toCF-driven preferences. The authors recommend an empirical range of 0.5 to 0.7 for .
6.6.2. Effect of
- represents the number of
order-agnostic semantic embeddingsused to represent each item. - Observations: Increasing the number of
semantic tokens() generally improves performance. This supports the idea that using multiple embeddings can better capture themulti-dimensional semantic informationof an item, potentially mitigating theembedding collapse issue[7] and resolving potentialinformation conflicts[36] that arise when forcing diverse semantic aspects into fewer dimensions. - However, blindly increasing beyond a certain point (e.g., or in some cases) might lead to diminishing returns or even slight performance degradation. This could be because it becomes increasingly challenging for the
AEto consistently learn distinct and meaningfulcategory-level preferencesfor a very large number of semantic dimensions, particularly when aligning them with real-world scenarios [20, 22]. Finding an optimal is crucial to balancing expressiveness and learnability.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously addresses fundamental issues in item tokenization for LLM-based generative recommendation. It identifies that existing token-sequence identifiers suffer from local optima in beam search and low generation efficiency due to autoregressive generation, while single-token identifiers fail to capture rich semantics or Collaborative Filtering (CF) information adequately.
To overcome these challenges, the authors propose two guiding principles for item identifier design:
-
Integration of both CF and semantic information: To leverage both user behavioral patterns and item content richness.
-
Order-agnostic identifiers: To represent multi-dimensional item information as a set of independent tokens, eliminating spurious dependencies and enabling efficient
simultaneous generation.Based on these principles, the paper introduces a novel
set identifier paradigmand implements it withSETRec.SETRecutilizesCFandsemantic tokenizersto createorder-agnostic multi-dimensional tokens. It employs asparse attention maskforuser history encodingto preserve sequential item dependencies while removing intra-item token dependencies, and aquery-guided generation mechanismforsimultaneous token generation.
Extensive experiments on four datasets (Amazon Toys, Beauty, Sports, and Steam) demonstrate SETRec's effectiveness, efficiency, generalization ability, and scalability. It consistently outperforms strong baselines across various scenarios, including full ranking, warm-start, and particularly cold-start recommendations, and different item popularity groups. The simultaneous generation significantly boosts inference efficiency, and SETRec shows promising scalability on cold-start items with increasing LLM sizes (Qwen 1.5B to 7B), showcasing its ability to harness larger models' semantic understanding.
7.2. Limitations & Future Work
The authors identify several promising avenues for future research:
- Discrete Set Identifiers: While
SETRecuses continuous tokens, exploring howdiscrete set identifiers(a set oforder-agnostic discrete tokens) perform ongenerative recommendationis a valuable direction. This could potentially align even better withLLMs'pre-training tasksand fully utilize their embedded knowledge. - Open-ended Recommendation:
SETRecshows stronggeneralization abilityin challenging scenarios. The authors suggest applyingSETRecforopen-ended recommendationin contexts withopen-domain user behaviors. This implies tackling more complex, less structured recommendation tasks beyond fixed item catalogs.
7.3. Personal Insights & Critique
This paper presents a highly insightful and impactful contribution to LLM-based recommendation. The explicit articulation of the two design principles (information integration and order-agnosticism) provides a strong theoretical foundation for the set identifier paradigm. This moves beyond incremental improvements to existing tokenization schemes by fundamentally rethinking how items are represented and generated.
Innovations and Strengths:
-
Conceptual Clarity: The paper clearly identifies the core problems with existing
item identifiersand proposes well-reasoned principles. Thelocal optima problemandinference inefficiencyare critical practical bottlenecks, andSETRecoffers an elegant solution. -
Comprehensive Information Capture: Integrating both
CFandsemantic informationin a structured, multi-dimensional way is crucial. Theorder-agnostic set identifierpreventsembedding collapseand allows for flexible representation of diverse item attributes. -
Efficiency Boost: The
simultaneous generationmechanism, enabled byorder-agnosticismand thesparse attention mask, is a major practical advantage, makingLLM-based generative recommendationmuch more feasible for real-world deployment. The significant speedups observed are compelling. -
Cold-Start Performance:
SETRec's strong performance oncold-start itemsand less popular groups highlights its robustness andgeneralization ability, which is a persistent challenge in recommender systems. -
Scalability: The demonstrated scalability on
cold-start itemswith increasing model size is a strong indicator of its future potential, asLLMscontinue to grow.Potential Issues/Areas for Improvement:
-
Defining Semantic Dimensions: While the paper shows sensitivity analysis for , the optimal number of
semantic tokens() is dataset-dependent. Determining how to automatically or more adaptively define these dimensions (and what each dimension represents) could be a complex research problem. Currently, it relies on empirical tuning. The conceptual idea of "latent semantic dimensions" is powerful, but their precise interpretation and optimal number remain somewhat opaque. -
Interpretability of Semantic Tokens: While the
semantic tokenscapture rich information, their individual interpretability for human understanding might be limited. For debuggability or explaining recommendations, understanding what each represents could be valuable. -
Computational Cost of Tokenization: The
semantic tokenizerinvolves training anAutoencoder. While this is done offline, the initial cost of generatingCFandsemantic embeddingsfor all items and constructing thetoken corporashould be considered, especially for very large and dynamic item catalogs. -
Applicability of Learnable Query Vectors: The
query-guided generation mechanismis novel. Further exploration into the design of thesequery vectorsand their interaction with theLLMcould reveal more insights. Are they truly learning to isolate specific dimensions, or are they acting as more general prompts?Transferability and Future Value:
SETRec's principles oforder-agnosticismandmulti-dimensional information integrationare broadly transferable. -
Beyond Recommendation: This paradigm could be applied to other generative tasks where composite entities need to be represented and generated (e.g., generating complex objects in creative AI, multimodal content generation).
-
Multimodal Recommendation: Integrating more modalities (images, audio) into the
set identifieris a natural extension. Each modality could contribute its ownorder-agnostic tokenor set of tokens. -
Dynamic Item Catalogs: For frequently changing item catalogs, efficient updates to the
token corporawould be crucial. Theextendable grounding headsalready provide a good foundation for this.Overall,
SETRecoffers a significant conceptual and practical advancement inLLM-based generative recommendation, addressing key limitations and opening new avenues for research into more flexible, efficient, and robust item representations.
Similar papers
Recommended via semantic vector search.