Paper status: completed

Pctx: Tokenizing Personalized Context for Generative Recommendation

Published:10/24/2025

Generative Recommendation Systems (37)Personalized Context Tokenization (1)Autoregressive Recommendation Models (1)User Interaction History Modeling (1)Semantic ID Representation in Recommendation (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a personalized context-aware tokenizer generating context-dependent semantic IDs, enhancing generative recommendation personalization and improving NDCG@10 by up to 11.44% across datasets.

Abstract

Generative recommendation (GR) models tokenize each action into a few discrete tokens (called semantic IDs) and autoregressively generate the next tokens as predictions, showing advantages such as memory efficiency, scalability, and the potential to unify retrieval and ranking. Despite these benefits, existing tokenization methods are static and non-personalized. They typically derive semantic IDs solely from item features, assuming a universal item similarity that overlooks user-specific perspectives. However, under the autoregressive paradigm, semantic IDs with the same prefixes always receive similar probabilities, so a single fixed mapping implicitly enforces a universal item similarity standard across all users. In practice, the same item may be interpreted differently depending on user intentions and preferences. To address this issue, we propose a personalized context-aware tokenizer that incorporates a user's historical interactions when generating semantic IDs. This design allows the same item to be tokenized into different semantic IDs under different user contexts, enabling GR models to capture multiple interpretive standards and produce more personalized predictions. Experiments on three public datasets demonstrate up to 11.44% improvement in NDCG@10 over non-personalized action tokenization baselines. Our code is available at https://github.com/YoungZ365/Pctx.

Mind Map

In-depth Reading

English Analysis~35 min read · 48,681 chars

1. Bibliographic Information

1.1. Title

Pctx: Tokenizing Personalized Context for Generative Recommendation

1.2. Authors

Qiyong Zhong (Zhejiang University)
Jiajie Su (Zhejiang University)
Yunshan Ma (Singapore Management University)
Julian McAuley (University of California, San Diego)
Yupeng Hou (University of California, San Diego)

1.3. Journal/Conference

This paper is published as a preprint on arXiv (arXiv preprint arXiv:2510.21276) and is scheduled for publication at a future date (2025-10-24T09:22:04.000Z). arXiv is a widely recognized open-access preprint server for research in physics, mathematics, computer science, and related disciplines. Papers published on arXiv are typically pre-peer-review versions.

1.4. Publication Year

2025

1.5. Abstract

Generative Recommendation (GR) models represent each user action as a sequence of discrete tokens, known as semantic IDs, and make predictions by autoregressively generating the next tokens. While GR offers advantages like memory efficiency and scalability, existing tokenization methods are often static and non-personalized, deriving semantic IDs solely from item features. This approach assumes a universal item similarity, overlooking individual user preferences. The autoregressive nature of GR models means that semantic IDs with common prefixes will have similar probabilities, implicitly enforcing this universal similarity standard.

To address this limitation, the paper proposes Pctx, a personalized context-aware tokenizer. Pctx incorporates a user's historical interactions to generate semantic IDs, allowing the same item to be tokenized into different semantic IDs based on varying user contexts. This design enables GR models to capture multiple interpretative standards for an item and produce more personalized predictions. Experiments conducted on three public datasets demonstrate Pctx's effectiveness, showing up to an 11.44% improvement in NDCG@10 over non-personalized action tokenization baselines. The authors have made their code publicly available.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2510.21276 PDF Link: https://arxiv.org/pdf/2510.21276v1.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The field of recommender systems has seen the rise of Generative Recommendation (GR) models, which offer significant advantages over traditional ID-based approaches. Instead of treating each item as a unique identifier, GR models convert user actions (interactions with items) into a few discrete tokens, called semantic IDs, and then use autoregressive models to predict the next semantic IDs in a sequence. This paradigm brings benefits such as enhanced memory efficiency, better scalability, and the potential to unify retrieval and ranking stages in recommendation pipelines.

However, a critical limitation of existing action tokenization methods in GR is their static and non-personalized nature. These methods typically generate semantic IDs based purely on item features (e.g., titles, descriptions), assuming that all users perceive items similarly. This universal item similarity assumption is problematic because, in reality, a single item can hold different meanings or appeal to different users based on their unique intentions and historical preferences. For example, a high-end watch could be an investment for one user, a gift for another, or a fashion statement for a third. Under the autoregressive generation paradigm, semantic IDs with shared prefixes are inherently assigned similar probabilities, which reinforces this static similarity standard and hinders the model's ability to provide truly personalized recommendations that account for diverse user interpretations.

The core problem the paper aims to solve is this lack of personalization and context-awareness in action tokenization for Generative Recommendation. The existing methods fail to capture the nuanced, user-specific ways items are interpreted, leading to less personalized and potentially suboptimal recommendations. The paper's innovative idea, or entry point, is to design a tokenizer that can adaptively generate semantic IDs not just from item features, but also by incorporating a user's historical interactions as a personalized context. This allows the same item to have multiple semantic IDs, each reflecting a different user-specific interpretation.

2.2. Main Contributions / Findings

The paper introduces Pctx, a novel personalized context-aware tokenizer for Generative Recommendation, making several key contributions:

Personalized Context-Aware Tokenization: Pctx proposes the first tokenizer that explicitly incorporates a user's historical interactions into the semantic ID generation process. This allows a single item to be mapped to different semantic IDs depending on the user's specific context, thereby capturing diverse user interpretations and overcoming the universal item similarity assumption of previous methods.
Balancing Generalizability and Personalizability: The paper addresses the challenge of creating personalized semantic IDs without sacrificing the generalizability often sought in tokenization. It introduces several strategies:
- Adaptive Clustering: Context representations are clustered into a variable number of groups, with cluster centroids serving as prototype representations.
- Merging Infrequent Semantic IDs: Low-frequency semantic IDs are merged with semantically similar ones of the same item to reduce sparsity.
- Data Augmentation: The training process is enhanced by augmenting actions with alternative semantic IDs for the same item, both in model inputs and prediction targets, connecting different interpretations.
Multi-Facet Semantic ID Generation: During inference, Pctx enables the Generative Recommendation model to decode multiple potential semantic IDs for the next item, each representing a different user interpretation. This allows for a richer understanding of recommendation probabilities and enhances the explainability of the recommendations.
Empirical Validation: Extensive experiments on three public datasets (Amazon Reviews: "Musical Instruments", "Industrial & Scientific", and "Video Games") demonstrate the effectiveness of Pctx. The model achieves significant performance improvements, up to 11.44% in NDCG@10, compared to non-personalized action tokenization baselines, confirming that personalized context-aware tokenization leads to more accurate and relevant recommendations.

In summary, Pctx successfully introduces personalization into the tokenization phase of Generative Recommendation models, enabling them to capture the multifaceted nature of user preferences and item interpretations, leading to substantial improvements in recommendation quality and explainability.

3.1. Foundational Concepts

To fully understand Pctx, it is essential to grasp several fundamental concepts in recommender systems and machine learning:

Sequential Recommendation: This is a subfield of recommender systems where the goal is to predict the user's next interaction (e.g., next item purchase, next movie watched) based on their sequence of past interactions. The order of interactions is crucial, as user preferences often evolve over time.
- ID-based Approaches: Traditional sequential recommenders (e.g., SASRec, GRU4Rec) typically represent each item with a unique integer ID. These IDs are then embedded into dense vectors (item embeddings), which are learned during training. A major challenge is managing a large embedding table for millions of items, leading to high memory consumption and scalability issues.
Generative Recommendation (GR): A newer paradigm that contrasts with ID-based methods. Instead of directly predicting item IDs, GR models convert each item or action into a sequence of discrete tokens, called semantic IDs. The model then autoregressively generates the semantic IDs for the next predicted item.
- Benefits of GR:
  - Memory Efficiency: By using a compact vocabulary of tokens, GR models can significantly reduce memory usage compared to large item ID embedding tables.
  - Scalability: The token-based approach allows for better scalability to large item catalogs.
  - Unifying Retrieval and Ranking: GR models can potentially perform both item retrieval (finding relevant items) and ranking (ordering them) within a single generative framework.
Semantic IDs (Tokens): In Generative Recommendation, a semantic ID is a short sequence of discrete tokens (e.g., $[token_1, token_2, ..., token_G]$ ) that collectively represent an item or action. These tokens are drawn from a shared, compact vocabulary. The process of converting an item into its semantic ID is called tokenization.
Autoregressive Models: These models are designed to predict the next element in a sequence based on the preceding elements. In the context of GR, an autoregressive model generates semantic IDs token by token. For instance, after generating $token_1$ , it uses $token_1$ to predict $token_2$ , and so on. This mechanism implies that semantic IDs sharing common prefixes will naturally receive similar prediction probabilities.
Tokenization (General): In computer science, tokenization is the process of breaking down a sequence of characters (or other data) into smaller units called tokens. In natural language processing, this often involves splitting sentences into words or subword units. In GR, it means converting item features or representations into discrete semantic ID tokens.
Contrastive Learning: A self-supervised learning technique where the model learns to pull "similar" (positive) samples closer together in the embedding space while pushing "dissimilar" (negative) samples farther apart. DuoRec, mentioned in the paper, uses contrastive learning to learn user context representations that are more distinguishable, helping to mitigate representation degeneration where distinct inputs map to similar embeddings.
Residual Quantization Variational AutoEncoder (RQ-VAE): A neural network architecture used for quantization. VAEs (Variational AutoEncoders) are generative models that learn a compressed, latent representation of data. Residual Quantization involves quantizing residuals (errors) sequentially, allowing for finer-grained representation with multiple codebooks. In Pctx, RQ-VAE is used to convert continuous item representations into discrete semantic ID tokens.
k-means++ Clustering: An advanced initialization method for the k-means clustering algorithm. k-means aims to partition $n$ observations into $k$ clusters, where each observation belongs to the cluster with the nearest mean (centroid). k-means++ improves the quality of clustering by selecting initial cluster centers that are far apart, reducing the chance of converging to suboptimal solutions. Pctx uses it to condense multiple context representations for an item into a smaller set of representative centroids.
sentence-t5-base: A pre-trained sentence embedding model based on the T5 (Text-To-Text Transfer Transformer) architecture. It is used to generate dense vector representations (embeddings) from textual features (e.g., item titles, descriptions).
FAISS (Facebook AI Similarity Search): A library for efficient similarity search and clustering of dense vectors. Pctx uses FAISS for quantizing representations, which involves finding the closest codebook entries (quantization).
PCA (Principal Component Analysis) and Whitening:
- PCA: A dimensionality reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components.
- Whitening: A data preprocessing step that transforms a set of variables so that they are uncorrelated and have unit variance. It often follows PCA. These techniques are used in Pctx to refine the semantic quality of item representations before quantization.

3.2. Previous Works

The paper discusses previous works primarily in two categories: Conventional Sequential Recommendation and Generative Recommendation.

3.2.1. Conventional Sequential Recommendation

These models typically rely on unique item IDs and embedding tables.

Caser (Tang & Wang, 2018): Applies convolutional neural networks (CNNs) to capture both sequential (temporal) and positional dependencies in user interaction sequences.
HGN (Ma et al., 2019): Uses hierarchical gating networks at both feature and instance levels to refine user preference representations.
GRU4Rec (Hidasi et al., 2016): Employs Gated Recurrent Units (GRUs) to model sequential dynamics in user behaviors, an early and influential deep learning model for session-based recommendation.
BERT4Rec (Sun et al., 2019): Adapts the BERT (Bidirectional Encoder Representations from Transformers) architecture to sequential recommendation. It uses a masked item prediction objective, where some items in a sequence are masked, and the model tries to predict them using bidirectional context.
- A core concept in Transformer-based models like BERT and SASRec is the self-attention mechanism. For an input sequence of vectors $X = [x_1, \dots, x_N]$ , self-attention calculates the output $Y = [y_1, \dots, y_N]$ where each $y_i$ is a weighted sum of all $x_j$ 's, with weights determined by their pairwise compatibility. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, $Q$ (Query), $K$ (Key), and $V$ (Value) are matrices derived from the input embeddings. $d_k$ is the dimension of the keys.
SASRec (Kang & McAuley, 2018): A prominent model that utilizes a unidirectional self-attention mechanism to capture user interests along behavior trajectories. It focuses on how past items influence the next item in a sequence.
DuoRec (Qiu et al., 2022): Addresses representation degeneration (where distinct inputs get mapped to similar representations) in sequential modeling by using contrastive learning with dropout-based augmentation and supervised sampling. This is particularly relevant as Pctx uses DuoRec for user context encoding.

3.2.2. Generative Recommendation (GR)

These models are the direct predecessors and contemporaries that Pctx builds upon and differentiates itself from.

TIGER (Rajput et al., 2023): One of the foundational GR models. It applies RQ-VAE to discretize item embeddings into semantic IDs and then uses a generative retrieval paradigm for recommendation. TIGER uses static tokenization, meaning each item is always mapped to the same semantic ID.
LETTER (Wang et al., 2024a): Extends TIGER by incorporating collaborative information and diversity-oriented constraints into the RQ-VAE process to improve semantic ID quality. It also employs static tokenization.
ActionPiece (Hou et al., 2025b): The first context-aware tokenization approach mentioned by the paper. It merges frequent co-occurring features with probabilistic weighting and introduces set permutation regularization to better exploit action sequences. However, its context is typically limited to adjacent actions, making it less effective at capturing long-term user personalities.
Multi-Identifier Tokenizers (e.g., MTGRec (Zheng et al., 2025)): Assign multiple semantic IDs to each item. However, the paper clarifies that MTGRec's approach is for data augmentation during pre-training, sampling semantic IDs from different epochs of the same RQ-VAE model. It still relies on the universal similarity assumption and is not inherently personalized based on user context.

3.3. Technological Evolution

The evolution in recommender systems has moved from simple collaborative filtering to matrix factorization, then to complex deep learning models for sequential recommendation (e.g., GRU4Rec, SASRec, BERT4Rec). These ID-based methods face challenges with memory and scalability due to large item embedding tables.

Generative Recommendation emerged as a solution to these issues, by tokenizing items into compact semantic IDs and using autoregressive models for prediction. Early GR models like TIGER and LETTER demonstrated the potential of this paradigm but inherited a critical limitation from ID-based systems: the static and non-personalized nature of item representations. They assumed a universal notion of item similarity.

ActionPiece took a step towards context-awareness, but its context was typically local. Pctx represents the next leap in this evolution by introducing truly personalized, long-term context-aware tokenization. It acknowledges that the semantic ID of an item should reflect the user's specific historical interactions, thereby enabling the GR model to capture diverse interpretations and generate highly personalized recommendations. Pctx fits into this timeline by addressing a fundamental personalization gap in the tokenization stage of Generative Recommendation.

3.4. Differentiation Analysis

Pctx differentiates itself from previous Generative Recommendation approaches primarily through its novel personalized context-aware tokenizer:

Compared to Static Tokenizers (TIGER, LETTER):
- Core Difference: Static tokenizers assign a fixed semantic ID to each item, regardless of the user or interaction context. This implicitly enforces a universal item similarity standard, as items with shared semantic ID prefixes will always receive similar prediction probabilities in autoregressive models.
- Pctx Innovation: Pctx overcomes this by tokenizing each item into different semantic IDs conditioned on the personalized user context (historical interactions). This allows the model to capture multiple interpretations of the same item, breaking free from the static similarity assumption.
Compared to Multi-Identifier Tokenizers (MTGRec):
- Core Difference: While MTGRec also assigns multiple semantic IDs to an item, its primary mechanism is data augmentation during pre-training by sampling semantic IDs from different model states. This approach does not inherently provide personalization based on user context; it still largely operates under the universal similarity assumption regarding the meaning of the different semantic IDs.
- Pctx Innovation: Pctx's multi-identifier mapping is explicitly driven by distinct user interpretations derived from personalized context. Each semantic ID for an item corresponds to a unique way a user might perceive or interact with that item, making the distinction inherently personalized and context-dependent.
Compared to Context-Aware Tokenizers (ActionPiece):
- Core Difference: ActionPiece is indeed context-aware, but its context is typically limited to adjacent actions within a sequence. This local context can capture immediate sequential patterns but often falls short in reflecting a user's broader, longer-term personality or preferences.
- Pctx Innovation: Pctx expands the perceived context window to incorporate the entire user interaction history. This long-term context allows the tokenizer to capture deeper personalities and evolving preferences, leading to more nuanced and accurate personalized semantic IDs.
  
  In essence, Pctx's core innovation lies in its ability to dynamically adapt item tokenization based on an individual user's comprehensive interaction history, thereby enabling Generative Recommendation models to produce predictions that truly reflect diverse and personalized user intentions.

4. Methodology

4.1. Principles

The core idea behind Pctx is to overcome the limitations of static, non-personalized action tokenization in Generative Recommendation (GR) models. The fundamental principle is that the meaning or interpretation of an item can vary significantly based on the individual user's context, specifically their past interactions. Therefore, instead of assigning a fixed semantic ID to each item, Pctx aims to generate personalized semantic IDs that reflect these diverse user interpretations.

The theoretical intuition is that if a user has consistently interacted with items of a certain type, their interpretation of a new, potentially multi-faceted item will be colored by that historical preference. For example, a user who frequently buys story-driven games might perceive "StarCraft II" primarily for its narrative, while a user interested in real-time strategy might focus on its strategic gameplay. Pctx's design allows the Generative Recommendation model to capture these distinct perspectives by tokenizing the same item into different semantic IDs depending on the user context. This dynamic tokenization then enables the autoregressive model to generate predictions that are genuinely personalized, anticipating how a user might interpret a potential next item.

4.2. Core Methodology In-depth (Layer by Layer)

Pctx operates by taking both the current item and the user's interaction history as input to produce personalized semantic IDs. The framework involves deriving rich context representations, condensing them, constructing semantic IDs, and then training a Generative Recommendation model with these personalized semantic IDs. Figure 2 provides an overview of the Pctx framework.

Figure 2: Overall framework of Pctx. 该图像是一张示意图，展示了Pctx模型的整体框架，包括输入的特征和用户上下文，个性化语义ID的生成过程及自回归生成模型多方面语义ID的预测。左侧显示训练数据与上下文表示，中间部分为个性化语义ID融合，右侧为生成模型的多面语义ID生成机制及概率预测。

The overall framework of Pctx illustrates how user context is integrated into the tokenization process. On the left, user interaction history is used to derive context representations. These are then fused with item feature representations and quantized into personalized semantic IDs in the middle. The right side shows an autoregressive generative model that predicts these semantic IDs in a multi-faceted manner.

4.2.1. Problem Formulation

Following the standard sequential recommendation setting, the paper represents each user's historical interactions as a chronologically ordered sequence of items: $ \mathcal{S} = [v_1, v_2, \ldots, v_n] $ where:

$v_i \in \mathcal{V}$ denotes an interacted item from the item set $\mathcal{V}$ .
$n$ is the number of past interactions in the sequence.

The traditional goal is to predict the next item given $\mathcal{S}$ . Generative Recommendation models reformulate this task. Each item $v_i$ is tokenized into a sequence of discrete tokens: $ [m_1^i, m_2^i, \ldots, m_G^i] $ This sequence is referred to as a semantic ID, where $G$ is the fixed number of tokens per semantic ID. The task then becomes predicting the semantic ID(s) of the target item given a sequence formed by concatenating the semantic IDs of historical items.

4.2.2. Personalized Action Tokenization

This is the core component of Pctx, designed to tokenize an item based on the user's context.

4.2.2.1. Personalized Context Representation

This step involves obtaining rich context representations from the training data.

User Context Encoding: An auxiliary model is used to encode the user context for each item $v_i$ . This model takes the current item and its preceding historical interactions as input: $ \pmb{e}_{v_i}^{ctx} = f([v_1, v_2, \ldots, v_i]) $ where:
- $\pmb{e}_{v_i}^{ctx} \in \mathbb{R}^{d_1}$ is the context embedding (or representation) for item $v_i$ .
- $[v_1, v_2, \ldots, v_i]$ represents the sequence of items up to and including $v_i$ .
- $f(\cdot)$ is a sequence model responsible for encoding this context. The paper specifies that the goal is not merely next-item prediction, but to derive user context representations that are sufficiently distinguishable to capture user personalities. For this, DuoRec (Qiu et al., 2022) is adopted, which uses contrastive learning to prevent representation degeneration (where distinct inputs map to very similar embeddings).
Multi-Facet Condensation of Context Representations: An item might appear many times in the training data, each time with a different user context, indicating diverse user interpretations. To manage the number of semantic IDs and avoid sparsity (where each semantic ID appears too rarely), Pctx groups these context representations by the item $v_i$ and condenses them. Specifically, for each item $v_i$ , k-means++ clustering is applied to its associated context representations ( $\pmb{e}_{v_i}^{ctx}$ ). This process generates $C_{v_i}$ centroids, which serve as representative context representations for that item. The number of centroids, $C_{v_i}$ , is chosen proportionally to the number of available context representations for $v_i$ , reflecting the richness of its interaction data while avoiding excessive splitting. The exact determination of $C_{v_i}$ is detailed in Section 4.2.4.

4.2.2.2. Personalized Semantic ID

After obtaining the representative context representations, these are tokenized into discrete semantic IDs.

Semantic ID Construction from Context Representations: In addition to the context representations, Pctx incorporates item feature representations to provide more comprehensive information. A feature representation $\pmb{e}^{feat} \in \mathbb{R}^{d_2}$ is derived for each item by encoding textual features (e.g., titles, descriptions) using a pre-trained sentence embedding model like sentence-t5-base (Ni et al., 2022). The context and feature representations for an item $v_i$ are then fused for each of its $C_{v_i}$ representative contexts. The fused representation for the $k$ -th context of item $v_i$ is given by: $ \pmb{e}{v_i, k} = \operatorname{concat}(\alpha \cdot \pmb{e}{v_i, k}^{ctx}, (1 - \alpha) \cdot \pmb{e}{v_i}^{feat}), \quad k \in {1, 2, \ldots, C{v_i}} $ where:
- $\pmb{e}_{v_i, k} \in \mathbb{R}^{d_1 + d_2}$ is the $k$ -th fused representation for item $v_i$ .
- $\pmb{e}_{v_i, k}^{ctx}$ is the $k$ -th encoded user context representation (one of the $C_{v_i}$ centroids for $v_i$ ).
- $\alpha$ is a hyperparameter (a scalar weight between 0 and 1) that balances the contribution of the context representation and the feature representation to the fused embedding.
- $\operatorname{concat}(\cdot, \cdot)$ denotes the concatenation operation, combining the two vectors. After obtaining these fused representations for all items, Pctx follows Rajput et al. (2023) and applies RQ-VAE (Residual Quantization Variational AutoEncoder, Zeghidour et al., 2021) to quantize each fused representation. This converts the continuous vector into a sequence of G-1 discrete tokens. An additional token is appended to this sequence to avoid conflicts between semantic IDs, resulting in a final $G$ -digit semantic ID.
Redundant Semantic ID Merging: To further improve generalizability and prevent sparsity caused by too many unique semantic IDs, two types of merging strategies are applied:
- Merging of duplicated semantic IDs: It's possible for an item to be assigned multiple semantic IDs that are identical except for their very last token. Since the last token is purely for conflict resolution and carries no semantic meaning, these semantic IDs are considered semantically equivalent. Pctx merges these by retaining only one of them, ensuring the last token is only used to distinguish semantic IDs between different items, not within the same item.
- Merging of infrequent semantic IDs: Some semantic IDs may appear very rarely in the dataset, potentially due to outliers or an excessive number of centroids during clustering. These infrequent IDs can harm generalization if kept. Pctx sets a frequency threshold $\tau$ . Any semantic ID appearing less often than $\tau$ is removed, and all instances previously associated with it are re-assigned to the nearest remaining centroid of the same item. This balances personalization with the need for sufficient training data for each semantic ID.
  
  As a result of these steps, each item can now be associated with multiple semantic IDs, where each semantic ID represents a typical user interpretation under different contexts.

4.2.3. Generative Recommendation Under Pctx

This section describes how the personalized semantic IDs are used for training and inference in a Generative Recommendation model.

Training with Data Augmentation: An autoregressive encoder-decoder model is trained on sequences of personalized semantic IDs using a next-token prediction loss (similar to Rajput et al., 2023). During the tokenization process for training: when an item $v_i$ and its user context $[v_1, v_2, \ldots, v_{i-1}]$ are considered, a fused personalized semantic representation is derived using Equation (2). The semantic ID for $v_i$ is then selected as the one whose centroid (from the $C_{v_i}$ representative centroids) is closest to this fused representation. By performing this for all items in a user's sequence, training sequences of personalized semantic IDs are constructed.

To further enhance data diversity and implicitly connect different semantic IDs for the same item, an augmentation strategy is introduced: Each personalized semantic ID in a training sequence is randomly replaced with another semantic ID corresponding to the same item with a probability $\gamma$ . This means if an item $v_i$ has semantic IDs $SID_A$ and $SID_B$ , and the chosen personalized semantic ID was $SID_A$ , there's a $\gamma$ chance it might be swapped to $SID_B$ . Even if the augmented sequence doesn't always reflect the most accurate user interpretation, it still represents a valid interaction possibility and helps the model generalize across different facets of an item.
Multi-Facet Semantic ID Generation: During inference, Pctx utilizes beam search (following Rajput et al., 2023; Zheng et al., 2024) to generate semantic ID predictions. Since an item can have multiple personalized semantic IDs, different decoding paths in beam search can lead to distinct personalized semantic IDs for the same underlying item. Each of these predicted semantic IDs will have an associated probability, representing the likelihood of a user perceiving a potential next item from a specific facet or interpretation. These probabilities for different semantic IDs of the same item are then aggregated to obtain the final next-item probabilities. This multi-facet semantic ID generation not only provides the predicted items but also offers insights into the likelihoods of various user interpretations, thereby improving the explainability of the recommendation process.

4.2.4. Determination of the Number of Centroids Per Item (Appendix B)

This section explains how Pctx determines $C_{v_i}$ , the number of context centroids for each item $v_i$ . The goal is to assign more centroids to items with higher user interpretation diversity while preventing excessive splitting that leads to sparsity. The strategy avoids a simple linear scaling with interaction count, which could over-allocate to popular items and under-allocate to rare ones.

The proposed strategy has three parts:

Interaction-aware Grouping:
- All items are sorted in ascending order based on their number of context representations (i.e., how many times they appear in user histories).
- These sorted items are then partitioned into $T$ groups.
- The proportion of items assigned to each group is determined by sampling $T$ discrete support points from a normalized Gamma distribution $\overline{\mathrm{Gamma}}(K, \theta=1)$ over the integer interval [1, T].
- The shape parameter $K$ of the Gamma distribution controls the skewness of this allocation: a smaller $K$ favors items in the tail (less popular items), while a larger $K$ allocates more capacity towards head items (more popular items).
- This ensures that items with similar interaction volumes are grouped together.
Group-based Centroid Allocation:
- Each group $t$ (where $t \in \{1, \ldots, T\}$ ) is assigned a predefined number of centroids based on an arithmetic progression.
- The number of centroids for a group $t$ $t$ , denoted as $\overline{C}^{(t)}$ $\overline{C}^{(t)}$ , is calculated as: $ \overline{C}^{(t)} \overset{\cdot}{=} C_{\mathrm{start}} + (t - 1) \cdot \delta $ where:
  - $C_{\mathrm{start}}$ is the starting number of centroids (for the first group).
  - $\delta$ is a small step size, which determines how much the number of centroids increases from one group to the next.
- All items $v_i$ within the same group $t$ are assigned the same number of centroids, $C_{v_i} = \overline{C}^{(t)}$ . This provides a smooth scaling of capacity and ensures consistent treatment for items with similar interaction levels.
Practical Adjustment:
- For rare items (those with a number of context representations smaller than their initially assigned $C_{v_i}$ ), a simplification is applied. Instead of attempting to form multiple clusters from insufficient data, $C_{v_i}$ is set to 1.
- Clustering is then performed with a single centroid for these rare items. This provides a robust solution for context condensation in the presence of long-tailed data (many rare items), balancing specialization (multiple centroids for diverse items) and generalization (single centroid for less diverse or rare items).

4.3. Discussion

Pctx is positioned within the landscape of action tokenization paradigms in Generative Recommendation:

Static Tokenizers (TIGER, LC-Rec): These assign fixed semantic IDs to each item. As discussed, this imposes a universal standard of item similarity due to the autoregressive nature of GR models, limiting their representational power. Pctx directly addresses this by making tokenization context-dependent.
Multi-Identifier Tokenizers (MTGRec): While appearing to offer multiple semantic IDs per item, MTGRec's approach is primarily a data augmentation strategy. It samples semantic IDs from different model states but does not inherently link these multiple IDs to distinct user interpretations based on dynamic user context. Pctx explicitly ensures that each of the multiple semantic IDs for an item reflects a distinct user interpretation.
Context-Aware Tokenizers (ActionPiece): ActionPiece tokenizes items based on their surrounding action context. Pctx belongs to this family but extends the concept of context. ActionPiece typically considers only adjacent actions (local context), which may not fully capture a user's personality. Pctx incorporates the entire user interaction history, allowing it to capture personalities reflected in longer-term contexts. This makes Pctx a more comprehensively personalized context-aware tokenizer.

5. Experimental Setup

5.1. Datasets

The experiments in this paper are conducted on three public datasets derived from the latest Amazon Reviews dataset (Hou et al., 2024). These datasets fall into different product categories, allowing for evaluation across diverse domains.

Source: Latest Amazon Reviews dataset.
Categories Used:
- "Musical Instruments" (Instrument)
- "Industrial & Scientific" (Scientific)
- "Video Games" (Game)
Preprocessing Pipeline (following Rajput et al., 2023; Zhou et al., 2020):
- Users and items with fewer than five interactions are excluded to mitigate data sparsity and noise.
- User-specific interaction histories are constructed and ordered chronologically.
- The maximum sequence length for user interactions is capped at 20 items.
Characteristics and Domain: These datasets represent real-world e-commerce interactions, covering diverse product types. "Musical Instruments" and "Video Games" are consumer-oriented, often reflecting personal hobbies and interests, while "Industrial & Scientific" might involve more professional or specialized purchasing patterns. This diversity helps validate the robustness of the proposed method.

The following are the results from Table 5 of the original paper:

Datasets Users Items Interactions Sparsity AvgLen

Instrument 57,439 24,587 511,836 99.964% 8.91

Scientific 50,985 25,848 412,947 99.969% 8.10

Game 94,762 25,612 814,586 99.966% 8.60
Users: Number of unique users in the dataset.
Items: Number of unique items in the dataset.
Interactions: Total number of recorded interactions between users and items.
Sparsity: A measure of how few interactions there are compared to all possible interactions (User $\times$ Item matrix). A high sparsity (close to 100%) indicates a very sparse dataset, which is common in recommendation. For example, 99.964% sparsity means only 0.036% of all possible user-item interactions have occurred.
AvgLen: Average length of user interaction sequences after preprocessing.

Datasets	Users	Items	Interactions	Sparsity	AvgLen
Instrument	57,439	24,587	511,836	99.964%	8.91
Scientific	50,985	25,848	412,947	99.969%	8.10
Game	94,762	25,612	814,586	99.966%	8.60

5.2. Evaluation Metrics

The paper uses two widely adopted metrics for evaluating recommendation system performance, particularly in ranking tasks: Recall@K and Normalized Discounted Cumulative Gain@K (NDCG@K). $K$ is set to 5 and 10 in the experiments.

5.2.1. Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top $K$ recommendations. In the context of sequential recommendation (specifically, the leave-one-out setting used here), it indicates whether the single ground-truth next item is present in the list of the top $K$ predicted items. A higher Recall@K implies that the model is better at identifying the relevant items.
Mathematical Formula: For a single user $u$ with a single ground truth relevant item $GT_u$ , Recall@K is calculated as: $ \text{Recall@K}u = \begin{cases} 1 & \text{if } GT_u \in \text{Top-K recommendations for user } u \ 0 & \text{otherwise} \end{cases} $ When averaged over $N$ users, the overall Recall@K is: $ \text{Recall@K} = \frac{1}{N} \sum{u=1}^{N} \text{Recall@K}_u $
Symbol Explanation:
- $N$ : The total number of users in the evaluation set.
- $u$ : An individual user.
- $GT_u$ : The ground-truth relevant item for user $u$ (in this paper's setting, the next item in the user's sequence).
- $\text{Top-K recommendations for user } u$ : The list of the top $K$ items recommended by the model for user $u$ .
- $\in$ : Denotes membership (i.e., whether an item is present in a set or list).

5.2.2. Normalized Discounted Cumulative Gain@K (NDCG@K)

Conceptual Definition: NDCG@K is a measure of ranking quality that accounts for the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear earlier in the list. The "Discounted Cumulative Gain" part sums the utility of items in the list, penalizing items at lower ranks. "Normalized" means it's divided by the Ideal DCG (the DCG of a perfectly ordered list), ensuring the score is between 0 and 1, regardless of the number of relevant items. A higher NDCG@K indicates that relevant items are not only retrieved but also ranked highly.
Mathematical Formula: First, Discounted Cumulative Gain (DCG@K) is calculated: $ \text{DCG@K} = \sum_{i=1}^{K} \frac{2^{\text{rel}i} - 1}{\log_2(i+1)} $ Then, the Ideal DCG (IDCG@K) is calculated for a perfectly ordered list of relevant items: $ \text{IDCG@K} = \sum{i=1}^{|\text{Relevant items}|} \frac{2^{\text{rel}_{\text{ideal}, i}} - 1}{\log_2(i+1)} $ Finally, NDCG@K is: $ \text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}} $ In the context of leave-one-out evaluation where there is only one ground-truth relevant item ( $GT_u$ ) at position $p$ in the recommended list (if found within top-K): $ \text{DCG@K}_u = \frac{1}{\log_2(p+1)} \quad \text{if } GT_u \text{ is at position } p \le K; \quad 0 \text{ otherwise} $ $ \text{IDCG@K}_u = \frac{1}{\log_2(1+1)} = 1 $ So for a single user, if $GT_u$ is at position $p \le K$ : $ \text{NDCG@K}u = \frac{1}{\log_2(p+1)} $ Otherwise, $\text{NDCG@K}_u = 0$ . The overall NDCG@K is then the average over all users: $ \text{NDCG@K} = \frac{1}{N} \sum{u=1}^{N} \text{NDCG@K}_u $
Symbol Explanation:
- $K$ : The number of top recommendations considered.
- $\text{rel}_i$ : The relevance score of the item at position $i$ in the recommended list. In binary relevance (item is either relevant or not), $\text{rel}_i$ is 1 if the item is relevant, and 0 otherwise. In this paper's setting, it's 1 for the ground-truth next item, 0 for others.
- $i$ : The rank (position) of an item in the recommendation list.
- $\text{rel}_{\text{ideal}, i}$ : The relevance score of the item at position $i$ in the ideal (perfectly ordered) recommendation list.
- $p$ : The rank (position) of the single ground-truth relevant item in the recommended list.
- $\log_2(i+1)$ : The logarithmic discount factor, which reduces the importance of relevant items found at lower ranks.
- $N$ : The total number of users.

5.3. Baselines

The paper compares Pctx against a comprehensive set of baselines, categorized into Conventional Sequential Recommendation and Generative Recommendation models.

5.3.1. Conventional Sequential Recommendation

These models predict the next item based on its unique ID.

Caser (Tang & Wang, 2018): Uses convolutional neural networks to capture sequential patterns.
HGN (Ma et al., 2019): Leverages hierarchical gating networks for user preference representation.
GRU4Rec (Hidasi et al., 2016): An early deep learning model using Gated Recurrent Units for session-based recommendations.
BERT4Rec (Sun et al., 2019): Applies a bidirectional Transformer encoder with masked item prediction.
SASRec (Kang & McAuley, 2018): A popular self-attentive sequential recommendation model.
FMLP-Rec (Zhou et al., 2022): A fully MLP-based framework using learnable filters.
HSTU (Zhai et al., 2024): Incorporates action-timestamp signals and hierarchical sequential transducers. (Still ID-based despite being recent)
DuoRec (Qiu et al., 2022): Addresses representation collapse using contrastive learning. (Notably, Pctx uses DuoRec as its auxiliary context encoder).
FDSA (Zhang et al., 2019): Employs a dual-stream self-attention design.
S3-Rec (Zhou et al., 2020): Improves representation learning with self-supervised objectives.

5.3.2. Generative Recommendation

These models utilize action tokenization.

TIGER (Rajput et al., 2023): A pioneering GR model using RQ-VAE to discretize item embeddings into semantic IDs. (Static tokenizer)
LETTER (Wang et al., 2024a): Extends TIGER by injecting collaborative information and diversity constraints. (Static tokenizer)
ActionPiece (Hou et al., 2025b): Proposes a context-aware tokenization framework, but limited to local context (adjacent actions).

These baselines are representative as they cover both traditional and modern sequential recommendation approaches, as well as the latest advancements in generative recommendation, including those with context-aware capabilities. This allows for a thorough comparison of Pctx's personalized context approach.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Pctx consistently outperforms all baseline methods across all three datasets and evaluation metrics (Recall@5, Recall@10, NDCG@5, NDCG@10). This strongly validates the effectiveness of the proposed personalized context-aware tokenization approach.

The following are the results from Table 1 of the original paper:

Methods	Instrument				Scientific				Game
Methods	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10
Caser	0.0241	0.0386	0.0151	0.0197	0.0159	0.0257	0.0101	0.0132	0.0330	0.0553	0.0209	0.0281
HGN	0.0321	0.0517	0.0202	0.0265	0.0212	0.0351	0.0131	0.0176	0.0424	0.0687	0.0281	0.0356
GRU4Rec	0.0324	0.0501	0.0209	0.0266	0.0202	0.0338	0.0129	0.0173	0.0499	0.0799	0.0320	0.0416
BERT4Rec	0.0307	0.0485	0.0195	0.0252	0.0186	0.0296	0.0119	0.0155	0.0460	0.0735	0.0298	0.0386
SASRec	0.0333	0.0523	0.0213	0.0274	0.0259	0.0412	0.0150	0.0199	0.0535	0.0847	0.0331	0.0438
FMLP-Rec	0.0339	0.0536	0.0218	0.0282	0.0269	0.0422	0.0155	0.0204	0.0528	0.0857	0.0338	0.0444
HSTU	0.0343	0.0577	0.0191	0.0271	0.0271	0.0429	0.0147	0.0198	0.0578	0.0903	0.0334	0.0442
DuoRec	0.0347	0.0547	0.0227	0.0291	0.0234	0.0389	0.0146	0.0196	0.0524	0.0827	0.0336	0.0433
FDSA	0.0347	0.0545	0.0230	0.0293	0.0262	0.0421	0.0169	0.0213	0.0544	0.0852	0.0361	0.0448
S3-Rec	0.0317	0.0496	0.0199	0.0257	0.0263	0.0418	0.0171	0.0219	0.0485	0.0769	0.0315	0.0406
TIGER	0.0370	0.0564	0.0244	0.0306	0.0264	0.0422	0.0175	0.0226	0.0559	0.0868	0.0366	0.0467
LETTER	0.0372	0.0580	0.0246	0.0313	0.0279	0.0435	0.0182	0.0232	0.0563	0.0877	0.0372	0.0473
ActionPiece	0.0383	0.0615	0.0243	0.0318	0.0284	0.0452	0.0182	0.0236	0.0591	0.0927	0.0382	0.0490
Pctx	0.0419	0.0655	0.0275	0.0350	0.0323	0.0504	0.0205	0.0263	0.0638	0.0981	0.0416	0.0527
Improvements	+9.40%	+6.50%	+11.79%	+10.06%	+13.73%	+11.50%	+12.64%	+11.44%	+7.95%	+5.82%	+8.90%	+7.55%

Key Observations:

GR Models vs. ID-based Models: Generally, Generative Recommendation (GR) models (TIGER, LETTER, ActionPiece, Pctx) achieve superior performance compared to conventional ID-based sequential recommendation approaches. This confirms the benefits of action tokenization and the generative retrieval paradigm for improving recommendation quality.
ActionPiece's Strength: Among the baselines, ActionPiece demonstrates the best performance. This indicates that incorporating context-aware action tokenization, even if limited to local context, provides stronger expressive power than static tokenization methods like TIGER and LETTER.
Pctx's Superiority: Pctx significantly outperforms all baselines on all four metrics (Recall@5, Recall@10, NDCG@5, NDCG@10) across all three datasets.
- The improvements are substantial, reaching up to 11.44% in NDCG@10 over the best-performing baseline (ActionPiece) on the "Scientific" dataset. This highlights Pctx's ability to provide more personalized and accurate predictions.
- The "Scientific" dataset shows the largest percentage improvements for Pctx, suggesting that personalized context might be particularly impactful in domains with potentially more diverse or specialized interpretations of items.
Reason for Pctx's Success: The paper attributes Pctx's success to its unique design as the first paradigm to introduce a personalized context-aware tokenizer for GR. By allowing the same action to be tokenized into different personalized semantic IDs based on a user's entire interaction history, Pctx enables the model to capture diverse user interpretations and generate more personalized recommendations, which is a fundamental advantage over existing approaches.

6.2. Ablation Studies / Parameter Analysis

To understand the contribution of each component within Pctx, an ablation study was conducted.

The following are the results from Table 2 of the original paper:

Variants	Instrument				Scientific
Variants	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10
Personalized context
(1.1) with SASRec	0.0395	0.0612	0.0261	0.0330	0.0294	0.0458	0.0190	0.0243
(1.2) with SASRec Item Embedding	0.0360	0.0573	0.0231	0.0300	0.0281	0.0448	0.0182	0.0235
(1.3) with DuoRec Item Embedding	0.0378	0.0594	0.0249	0.0318	0.0278	0.0445	0.0180	0.0235
TIGER	0.0370	0.0564	0.0244	0.0306	0.0264	0.0422	0.0175	0.0226
Tokenization
(2.1) w/o Clustering	0.0386	0.0596	0.0249	0.0316	0.0295	0.0462	0.0192	0.0245
(2.2) w/o Redundant SID Merging	0.0270	0.0415	0.0175	0.0221	0.0201	0.0316	0.0133	0.0170
Model training and inference
(3.1) w/o Data Augmentation	0.0366	0.0577	0.0240	0.0308	0.0291	0.0457	0.0188	0.0242
(3.2) w/o Multi-Facet Generation	0.0376	0.0594	0.0242	0.0312	0.0282	0.0449	0.0181	0.0235
Pctx	0.0419	0.0655	0.0275	0.0350	0.0323	0.0504	0.0205	0.0263

6.2.1. Study of Personalized Context

This part investigates the impact of the source and nature of personalized context representations.

(a) Pctx vs. (1.1) with SASRec: Pctx (using DuoRec for context encoding) performs better than (1.1) with SASRec (using SASRec). This suggests that DuoRec, which uses contrastive learning to make sequence representations more distinguishable, is more effective for generating rich user context representations suitable for personalization, even if SASRec might sometimes perform better on the next-item prediction task itself. The ability to capture distinct user personalities for tokenization is crucial.
(b) Pctx vs. (1.2) with SASRec Item Embedding & (1.3) with DuoRec Item Embedding: Variants using item embeddings from pre-trained models (static representations) show larger performance degradation compared to using sequence representations. This confirms that incorporating actual user context (sequential interactions) is vital, as item embeddings alone cannot capture dynamic user perspectives.

6.2.2. Effects of Tokenization

This section examines the impact of the strategies for managing personalized semantic IDs.

(2.1) w/o Clustering: Removing the clustering step (which condenses context representations into centroids) leads to a performance drop. This indicates that context condensation is important for creating meaningful and manageable semantic ID prototypes, preventing over-personalization that could lead to sparsity.
(2.2) w/o Redundant SID Merging: Disabling the redundant semantic ID merging strategy results in a more severe performance drop. This emphasizes the importance of managing the number of semantic IDs. Without merging, the system likely generates too many sparse semantic IDs, hindering the generalization ability of the GR model. The merging strategy is crucial for striking a balance between personalization and generalizability.

6.2.3. Model Training and Inference

This part evaluates the strategies applied during the GR model's training and inference phases.

(3.1) w/o Data Augmentation: When data augmentation (randomly replacing personalized semantic IDs with other valid semantic IDs for the same item) is removed, there's a clear performance drop. This confirms that the augmentation strategy is effective in enhancing data diversity, implicitly connecting different semantic IDs associated with the same item, and thereby improving the generalization ability of the GR model.
(3.2) w/o Multi-Facet Generation: If the model is restricted to a single decoding path (a single semantic ID) during inference instead of leveraging multi-facet generation (considering multiple candidate semantic IDs and aggregating their probabilities), performance also drops. This highlights the importance of allowing the GR model to decode multiple potential user interpretations during prediction, reflecting the nuanced nature of user preferences.

6.3. In-depth Analysis

6.3.1. Model Ensemble

To ensure Pctx's improvements are not simply due to combining strengths of existing models, an ensemble analysis was performed. SASRec and DuoRec predictions were ensembled with TIGER using a voting scheme.

The following are the results from Table 3 of the original paper:

Methods	Instrument				Scientific
Methods	Recall@5	Recall@10	NDCG@5	NDCG@10	Recall@5	Recall@10	NDCG@5	NDCG@10
SASRec	0.0333	0.0523	0.0213	0.0274	0.0259	0.0412	0.0150	0.0199
DuoRec	0.0347	0.0547	0.0227	0.0291	0.0234	0.0389	0.0146	0.0196
TIGER	0.0370	0.0564	0.0244	0.0306	0.0264	0.0422	0.0175	0.0226
TIGER+SASRec	0.0374	0.0582	0.0245	0.0311	0.0268	0.0427	0.0169	0.0221
TIGER+DuoRec	0.0376	0.0586	0.0247	0.0314	0.0258	0.0418	0.0163	0.0215
Pctx	0.0419	0.0655	0.0275	0.0350	0.0323	0.0504	0.0205	0.0263

Key Findings:

Ensembled models (e.g., $TIGER+SASRec$ , $TIGER+DuoRec$ ) generally outperform their individual components, suggesting that the different models capture complementary information.
However, even the best ensembled results remain significantly below Pctx's performance. This confirms that Pctx is not merely a simple combination of existing models but that its fundamental innovation—personalized semantic IDs—expands the capabilities of GR models in a unique way.

6.3.2. Study of the Number of Personalized Semantic IDs

This analysis focuses on the distribution of personalized semantic IDs per item, illustrated in Figure 3.

Figure 3: The number of personalized semantic IDs (simplified as SIDs) every item possesses. 该图像是一个对比柱状图，展示了科学类和乐器类数据集中每个物品所拥有的个性化语义ID数量（SIDs）的分布情况，图中以对数刻度显示，并比较了Pctx和TIGER两种方法的结果。

This bar chart shows the distribution of the number of personalized semantic IDs assigned to each item for Pctx (across two datasets) compared to TIGER. TIGER (static tokenizer) assigns exactly one semantic ID per item, whereas Pctx shows a distribution across multiple semantic IDs.

Key Observations:

Static vs. Personalized: TIGER, as a static tokenizer, assigns only one semantic ID to each item, completely hindering personalization. In contrast, Pctx assigns multiple personalized semantic IDs to the same item.
Distribution: In Pctx, the majority of items are assigned two personalized semantic IDs, followed by one, then three, and a smaller fraction exceeding four.
Single SID Items: Items with only a single semantic ID are typically infrequent or long-tail entities with limited interactions, offering restricted diversity in user interpretations.
Redundancy Management: The number of items with an excessive number of personalized IDs remains small. This is attributed to the redundant semantic ID merging strategy, which effectively consolidates redundant representations and prevents over-personalization and sparsity.

6.3.3. Parameter Analysis (from Appendix D.1)

6.3.3.1. Performance w.r.t. the Augmentation Probability $\gamma$

Figure 5 illustrates how NDCG@10 changes with varying augmentation probability $\gamma$ .

该图像是两幅折线图，展示了不同增强概率γ下，Instrument和Scientific数据集上的NDCG@10指标变化趋势，反映模型性能随γ调整的敏感性。

The line charts display the NDCG@10 performance on the Instrument and Scientific datasets as the augmentation probability $\gamma$ varies from 0.0 to 0.9.

Key Insights:

Effectiveness of Augmentation: Setting $\gamma = 0$ (disabling data augmentation) results in performance notably worse than most configurations with non-zero $\gamma$ . This validates the effectiveness of the proposed data augmentation strategy in enhancing generalization.
Critical Hyperparameter: $\gamma$ is a critical hyperparameter. Inappropriate settings can lead to significant performance degradation.
Stable Range: Performance remains relatively stable and within an acceptable margin when $\gamma$ is in the range of 0.3 to 0.7.
Extreme Values: Excessively small values of $\gamma$ lead to underwhelming outcomes due to insufficient augmentation. Overly large values introduce instability and may cause extreme performance fluctuations, suggesting a trade-off where too much augmentation can dilute the core personalized signals.

6.3.3.2. Performance w.r.t. the Frequency Threshold $\tau$

Figure 6 analyzes the NDCG@10 performance and the percentage of semantic IDs in use as the frequency threshold $\tau$ varies.

$Figure 6: Analysis of performance $( \\mathrm { N D C G } @ 1 0 , \\uparrow )$ and the quantity of semantic IDs in use (↓) w.r.t. the frequency threshold $\\tau$ . Each bar represents the percentage of…$ 该图像是图表，展示了不同频率阈值 $\tau$ 下，个性化分词器与静态分词器在两个数据集（Instrument和Scientific）上的NDCG@10性能和语义ID使用量百分比变化。

This chart shows the NDCG@10 (lines) and the percentage of utilized semantic IDs (bars, relative to static tokenizer) as frequency threshold $\tau$ increases.

Main Observations:

Semantic ID Count: As $\tau$ increases, the number of utilized semantic IDs decreases monotonically. This is because higher $\tau$ values cause more infrequent semantic IDs to be merged. The total number of semantic IDs does not grow excessively, as most items are low-frequency.
Performance Trend: Both NDCG@10 (and other evaluation metrics) initially improve with increasing $\tau$ , but then begin to decline after $\tau$ exceeds approximately 0.2. The best performance is observed around $\tau = 0.2$ on both datasets.
Sparsity vs. Personalization:
- An excessive number of personalized semantic IDs (when $\tau$ is too low) leads to poor performance due to sparsity issues.
- While a higher $\tau$ alleviates sparsity by merging infrequent IDs, if it's too high, it sacrifices too much personalization, leading to a decline in performance.
Balance: Varying $\tau$ essentially embodies a crucial balance between sparsity reduction (improving generalizability) and personalization preservation.

6.3.4. Popular and Personalization (from Appendix D.3)

This section investigates the relationship between an item's position in the input sequence and the probability of it being tokenized as its most popular semantic ID. The popular rate is defined as the mean probability of an item being tokenized as its popular semantic ID at a given position.

Figure 7 illustrates this analysis across different models/variants.

Figure 7: The heatmap illustrating the relationship between the position of an item and the probability of its tokenization as the most popular semantic ID. position is the index of an interaction se… 该图像是图表，展示了图7中不同模型变体在三个领域（乐器、科学、游戏）中，随着交互序列位置变化，物品被标记为最流行语义ID的概率热力图。坐标轴显示位置序号，颜色深浅表示概率大小，颜色越浅概率越低。

The heatmap displays the popular rate (probability of tokenization as the most popular semantic ID) at different sequence positions for various models/variants on three datasets. Lighter colors indicate lower popular rates.

Key Observations:

TIGER (Static Tokenizer): As TIGER uses a static tokenizer, every item is always tokenized into its single, fixed semantic ID (which is, by definition, its popular semantic ID). Thus, the popular rate is consistently 1 (or close to 1, depicted by dark blue/purple) across all sequence positions, independent of context.
w/o Data Augmentation (Pctx with $\gamma=0$ ): In this variant, as the sequence length (position) increases, the probability of tokenizing an item with its popular semantic ID decreases (lighter colors appear). This confirms that as more user context accumulates, the influence of personalized context rises, making it more likely for the item to be tokenized into a more personalized semantic ID that reflects the specific user's evolving preferences, rather than just its general 'popular' interpretation.
with augmentation probability\gamma = 1$$: When the augmentation probability $\gamma$ is set to 1, items are equally likely to be tokenized with any of their possible semantic IDs. This leads to a uniform distribution of semantic IDs across the sequence, and consequently, the popular rate is consistently low and uniform across all positions (lightest colors). This variant harms personalization by excessively introducing noise and losing the contextual signal.

These findings strongly support that Pctx's personalized context-aware tokenizer adaptively tokenizes items based on user context, moving beyond static representations and enabling more personalized representations.

6.3.5. Explainability (from Appendix D.4 and D.5)

To assess whether the personalized semantic IDs generated by Pctx correspond to human-interpretable user preferences, an explainability experiment was conducted using GPT-4o (a Large Language Model).

The following are the results from Table 6 of the original paper:

Methods	Instrument (Acc.)	Scientific (Acc.)	Game (Acc.)
with SASRec	0.8333	0.8030	0.8240
Pctx	0.8533	0.8534	0.8690

Experimental Design:

Item Selection: Items with at least two personalized semantic IDs were randomly selected.
Preference Summarization: For each selected item, user interaction sequences associated with each of its semantic IDs were grouped. GPT-4o was then used to summarize the underlying user preference for each semantic ID into keywords and a descriptive summary.
Accuracy Assessment: For each selected item, 50 test sequences where the item was the target were randomly sampled. For each sequence, the semantic ID that appeared first in Pctx's prediction list was identified. GPT-4o was then prompted to assess whether the summarized preference for this top-ranked semantic ID aligned better with the sequence context than the preferences of other semantic IDs for the same item. A binary "Yes" or "No" judgment, along with an explanation, was obtained.
Metric: Accuracy was defined as the proportion of "Yes" responses. This process was repeated for 25 items per dataset, totaling 1250 samples per dataset.

Key Findings:

Pctx achieved high accuracy (over 0.85 across all three datasets), indicating that its personalized semantic IDs indeed capture diverse and coherent user preferences in a human-interpretable manner.
The variant with SASRec (using SASRec as the auxiliary model) underperformed Pctx, but still achieved high accuracy (above 0.80). This further reinforces that the quality of context representation impacts the interpretability of the semantic IDs.
The high accuracy demonstrates that Pctx effectively aligns its predictions with these learned preferences, validating the interpretability of its tokenization mechanism.

6.3.5.1. Case Study (from Figure 4 and Appendix D.5)

Figure 4 and the detailed case in Appendix D.5 provide a concrete example of Pctx's personalized tokenization.

Figure 4: Case Study. The upper row denotes a story-driven game player, while the lower row depicts a real-time strategy game player. The same item StarCraft I is tokenized into different semantic ID… 该图像是论文图4的示意图，展示了故事驱动类游戏玩家与实时战略类游戏玩家对同一物品StarCraft II的不同语义ID(token)分配，体现个性化上下文感知分词器对于同一物品在不同用户语境下的多样化表示。

The case study illustrates how Pctx assigns different semantic IDs to the same item, StarCraft II: Heart of the Swarm, based on different user contexts: a story-driven game player (upper row) and a real-time strategy game player (lower row).

Scenario: The item StarCraft II: Heart of the Swarm is a hybrid game, appealing to both story-driven and real-time strategy (RTS) players.

User 1 (Story-driven player): This user's history includes items like Tomb Raider, The Last of Us, Saints Row: The Third (emphasizing narrative, adventure). For this user, Pctx tokenizes StarCraft II into the semantic ID [53, 395, 576, 770].
User 2 (RTS player): This user's history includes items like Warcraft II, Command & Conquer, Company of Heroes (emphasizing strategy, management, competitive gameplay). For this user, Pctx tokenizes StarCraft II into the semantic ID [53, 412, 576, 770].

GPT-4o Explainability Example: An example of the GPT-4o prompt and response is provided for a user whose historical interactions primarily consist of RTS games (Command & Conquer, Company of Heroes, World in Conflict).
Top-ranked Semantic ID's Preference: Keywords like "Gaming, RTS, Adventure, Strategy, Multiplayer, Fantasy, Competitive, Role-playing, Decision-making, Management" with a summary emphasizing competitive RTS and strategizing.
Other Semantic ID's Preference: Keywords like "Adventure, Narrative, Multiplayer, Open-world, Action, Fantasy, Survival, Shooter, Strategy, Customization" with a summary emphasizing narrative-driven and open-world experiences.
GPT-4o's Judgment: Yes. The LLM confirms that the historical sequence strongly aligns with the top-ranked semantic ID's preference for RTS games, confirming Pctx's ability to adaptively tokenize and predict based on context.

This case study vividly demonstrates Pctx's capability to adaptively tokenize the same action (StarCraft II) into distinct personalized semantic IDs under different user contexts, thereby enabling the GR model to produce more user-specific and interpretable predictions.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Pctx, a pioneering personalized context-aware tokenizer designed for Generative Recommendation (GR) models. Pctx addresses a critical limitation of existing GR approaches, which typically rely on static, non-personalized action tokenization, implicitly assuming a universal standard of item similarity. By integrating a user's historical interactions into the semantic ID generation process, Pctx allows the same item to be represented by different semantic IDs under varying user contexts. This novel design effectively captures the diverse interpretations users may have for an item, enhancing the GR model's ability to generate truly personalized predictions. The method achieves a crucial balance between generalizability and personalization through strategies like adaptive clustering, redundant semantic ID merging, and data augmentation. Extensive experiments on three public datasets demonstrate Pctx's superior performance, yielding up to an 11.44% improvement in NDCG@10 over non-personalized tokenization baselines. This work is significant as it represents the first successful attempt to introduce a personalized action tokenizer within the Generative Recommendation paradigm, paving the way for more nuanced and user-centric recommendation systems.

7.2. Limitations & Future Work

The authors identify several directions for future research:

Scaling Effective Semantic IDs: Investigating approaches for scaling the generation and management of effective semantic IDs within a broader semantic ID space remains an open challenge. As the number of items and potential interpretations grows, efficiently handling and learning from a vast array of personalized semantic IDs will be crucial.
End-to-End Personalized Action Tokenizers: Developing fully end-to-end personalized action tokenizers is another future goal. Currently, Pctx relies on an auxiliary model (DuoRec) for context encoding and a separate RQ-VAE for quantization. An end-to-end learning framework could potentially optimize the entire tokenization process more cohesively.

7.3. Personal Insights & Critique

The Pctx paper presents a highly innovative and necessary advancement in Generative Recommendation. The core idea of moving beyond static item representations to context-aware, personalized semantic IDs is a fundamental shift that significantly enhances the capabilities of GR models.

Novelty and Impact: The paper's primary contribution—introducing personalized context into action tokenization—is genuinely novel for GR. Previous "context-aware" tokenizers were limited to local context, and "multi-identifier" approaches lacked true personalization. Pctx's comprehensive approach, considering full user history, allows for a more nuanced understanding of user intent and item interpretation. This has a direct impact on recommendation quality, as evidenced by the strong experimental results. The improvements, especially in NDCG@10, are compelling.
Balancing Act: The explicit focus on balancing generalizability and personalizability is a strong point. The strategies for context condensation, semantic ID merging, and data augmentation demonstrate a practical understanding of the trade-offs involved in managing a dynamic token vocabulary. Without these, over-personalization could lead to extreme sparsity, rendering the system ineffective.
Explainability: The explainability experiment using GPT-4o is a particularly interesting and forward-looking aspect. While the rigor of LLM-based evaluation in this context is still a nascent area of research, it provides compelling qualitative evidence that Pctx's semantic IDs correspond to meaningful and distinct user preferences. This aligns with the broader trend of making AI systems more transparent and understandable. The detailed case study further strengthens this point by visually demonstrating the adaptive tokenization.
Potential Areas for Improvement/Future Research (Beyond Authors' Scope):
- Dynamic Context Window: While Pctx uses the "entire user interaction history," the maximum sequence length is capped at 20 items. Investigating more dynamic or adaptive context windows that can weigh recent interactions differently from older ones, or dynamically select relevant past interactions, could further refine personalization without increasing computational burden excessively.
- Computational Cost: Generating personalized semantic IDs involves training an auxiliary model, clustering, and then quantizing. The computational overhead of these steps, especially for very large item catalogs and complex histories, could be a practical consideration for deployment. Future work could explore more efficient approximation techniques.
- Fine-grained Personalization: The current approach clusters context representations into a fixed number of centroids per group. Perhaps a more fine-grained or hierarchical approach to context representation could capture even subtler personalized facets, especially for highly complex or multi-modal items.
- Unified Learning Objective: The current framework has distinct steps (context encoding, semantic ID generation, GR model training). While RQ-VAE is learnable, an end-to-end system where the semantic ID generation is jointly optimized with the recommendation task might yield further benefits.
Transferability: The methods proposed in Pctx could potentially be transferred to other domains where contextual interpretation of discrete entities is important. For example, in knowledge graph completion or relation extraction, where the "meaning" of an entity or relation might depend on its surrounding context within a specific graph path or query.

Overall, Pctx is a robust and timely contribution to Generative Recommendation, demonstrating a clear path forward for achieving truly personalized and interpretable recommendations by revolutionizing the foundational tokenization step.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Pctx: Tokenizing Personalized Context for Generative Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~35 min read · 48,681 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Conventional Sequential Recommendation

3.2.2. Generative Recommendation (GR)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. Personalized Action Tokenization

4.2.2.1. Personalized Context Representation

4.2.2.2. Personalized Semantic ID

4.2.3. Generative Recommendation Under Pctx

4.2.4. Determination of the Number of Centroids Per Item (Appendix B)

4.3. Discussion

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@K

5.2.2. Normalized Discounted Cumulative Gain@K (NDCG@K)

5.3. Baselines

5.3.1. Conventional Sequential Recommendation

5.3.2. Generative Recommendation

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Study of Personalized Context

6.2.2. Effects of Tokenization

6.2.3. Model Training and Inference

6.3. In-depth Analysis

6.3.1. Model Ensemble

6.3.2. Study of the Number of Personalized Semantic IDs

6.3.3. Parameter Analysis (from Appendix D.1)

6.3.3.1. Performance w.r.t. the Augmentation Probability γ\gammaγ

6.3.3.2. Performance w.r.t. the Frequency Threshold τ\tauτ

6.3.4. Popular and Personalization (from Appendix D.3)

6.3.5. Explainability (from Appendix D.4 and D.5)

6.3.5.1. Case Study (from Figure 4 and Appendix D.5)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.3.3.1. Performance w.r.t. the Augmentation Probability $\gamma$

6.3.3.2. Performance w.r.t. the Frequency Threshold $\tau$