Paper status: completed

Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation

Published:09/11/2024

Cold-Start Recommendation (3)Generative Recommendation Systems (2)Multi-Aspect Semantic Tokenization (1)Text-Based Reconstruction Tasks (1)Long-Tailed Recommendation Issues (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces LAMIA, a novel multi-aspect semantic tokenization framework that enhances generative recommendation systems. Unlike traditional methods, it learns independent embeddings capturing multiple facets of items, significantly improving recommendation accuracy for

Abstract

Traditional recommendation models often rely on unique item identifiers (IDs) to distinguish between items, which can hinder their ability to effectively leverage item content information and generalize to long-tailed or cold-start items. Recently, semantic tokenization has been proposed as a promising solution that aims to tokenize each item's semantic representation into a sequence of discrete tokens. These semantic tokens have become fundamental in training generative recommendation models. However, existing methods typically rely on RQ-VAE, a residual vector quantizer, for semantic tokenization. This reliance introduces several key limitations, including challenges in embedding extraction, hierarchical coarse-to-fine quantization, and training stability. To address these issues, we introduce LAMIA, a novel approach for multi-aspect semantic tokenization. Unlike RQ-VAE, which uses a single embedding, LAMIA learns an ``item palette''--a collection of independent and semantically parallel embeddings that capture multiple aspects of items. Additionally, LAMIA enhances the semantic encoders through domain-specific tuning using text-based reconstruction tasks, resulting in more representative item palette embeddings. We have conducted extensive experiments to validate the effectiveness of the LAMIA framework across various recommendation tasks and datasets. Our results demonstrate significant improvements in recommendation accuracy over existing methods. To facilitate reproducible research, we will release the source code, data, and configurations.

Mind Map

In-depth Reading

English Analysis~33 min read · 49,419 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is a novel semantic tokenization framework called LAMIA (Learning Multi-Aspect Item Palette) for generative recommendation systems. It aims to improve how items are represented by capturing their multifaceted semantic information, especially for long-tailed or cold-start items, thereby enhancing recommendation accuracy.

1.2. Authors

The authors are:

Qijiong Liu: The HK PolyU, Hong Kong, China (liu@qijiong.work)
Jieming Zhu: Huawei Noah's Ark Lab, Shenzhen, China (jiemingzhu@ieee.org)
Zhao cheng Du: Huawei Noah's Ark Lab, Shenzhen, China (zhaochen gdu@huawei.com)
Lu Fan: The HK PolyU, Hong Kong, China (cslfan@comp.polyu.edu.hk)
Zhou Zhao: Zhejiang University, Hangzhou, China (zhaozhou@zju.edu.cn)
Xiao-Ming Wu: The HK PolyU, Hong Kong, China (xiao-ming.wu@polyu.edu.hk)

Their affiliations suggest a research background in academia (The Hong Kong Polytechnic University, Zhejiang University) and industrial research (Huawei Noah's Ark Lab), indicating expertise in artificial intelligence, machine learning, and recommender systems.

1.3. Journal/Conference

The paper is published in the "Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym 'XX'). ACM, New York, NY, USA, 11 pages." The ACM (Association for Computing Machinery) is a highly reputable and influential organization in computer science. Publication in an ACM conference, especially one that handles a wide range of computer science topics, signifies a high-quality, peer-reviewed contribution to the field.

1.4. Publication Year

The paper was published in 2018 according to the ACM reference format, but the Published at (UTC) information indicates 2024-09-11T13:49:48.000Z. Given the content references recent models like LLaMA (2023) and OPT-350M (2022), the 2018 date in the ACM reference format seems to be a placeholder or error; the actual publication is likely in 2024.

1.5. Abstract

Traditional recommendation models struggle with item content utilization and generalizing to rare items due to their reliance on unique item identifiers (IDs). Semantic tokenization, which converts item semantic representations into discrete token sequences, offers a solution and is crucial for generative recommendation models. However, current methods largely depend on RQ-VAE (residual vector quantizer), leading to issues like difficult embedding extraction, hierarchical quantization, and training instability. This paper introduces LAMIA, a novel multi-aspect semantic tokenization framework. Unlike RQ-VAE's single embedding approach, LAMIA learns an "item palette"—a collection of independent, semantically parallel embeddings capturing multiple item facets. It further improves semantic encoders through domain-specific tuning using text-based reconstruction. Extensive experiments confirm LAMIA's effectiveness, showing significant accuracy improvements across various recommendation tasks and datasets. The authors plan to release code and data for reproducibility.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2409.07276 PDF Link: https://arxiv.org/pdf/2409.07276v3.pdf Publication Status: This is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve lies in the limitations of traditional recommendation models, particularly sequential recommenders. These models often rely on unique item identifiers (IDs), which present several key challenges:

Overfitting and Sparsity: ID-based representations can easily overfit, especially with sparse and imbalanced training data, a common issue in real-world recommendation scenarios.
Limited Content Leverage: They fail to effectively utilize rich item content information (e.g., text descriptions, images), which is vital for improving recommendations, particularly for long-tailed (infrequently interacted with) and cold-start (new) items. Without content, these items lack sufficient interaction data to learn robust ID embeddings.

To address these limitations, semantic tokenization has emerged as a promising solution. This approach converts an item's semantic representation (derived from its content) into a sequence of discrete tokens. These semantic tokens can be shared across different items, allowing for a more nuanced understanding of item similarity and better generalization. They also form the foundation for generative recommendation models, where the goal shifts from predicting a specific ID to generating a sequence of semantic tokens for the next recommended item.

However, existing semantic tokenization methods predominantly rely on RQ-VAE (Residual Vector Quantized Variational Autoencoder), which, despite its initial promise, introduces its own set of limitations:

Single Aspect Focus: RQ-VAE primarily captures a dominant semantic aspect of an item, with subsequent layers merely refining details of this primary aspect. This falls short in representing items with complex, multifaceted characteristics. For instance, a news article might be about "science," but also have a "political" or "environmental" angle, which RQ-VAE might miss if it's not the primary aspect.
Training Instability: RQ-VAE training is known to be sensitive and prone to code collapse, where multiple input embeddings map to the same codebook entry, leading to a loss of representational diversity.
Domain Mismatch: Existing methods often use embeddings directly from pre-trained generic encoders (like LLMs) for quantization without domain-specific tuning. This can lead to semantic identifiers that are not representative of the specific recommendation domain, resulting in information loss.

The paper's entry point is to critically re-evaluate and refine the standard semantic tokenization framework by proposing a novel approach that explicitly handles the multi-aspect nature of items and addresses the training challenges of RQ-VAE.

2.2. Main Contributions / Findings

The paper introduces LAMIA (Learning Multi-Aspect Item Palette), a novel framework for multi-aspect semantic tokenization, offering several primary contributions:

Multi-Aspect Item Palette: LAMIA proposes learning an "item palette"—a collection of independent and semantically parallel embeddings—for each item. Unlike RQ-VAE's hierarchical, single-aspect quantization, LAMIA's palette captures multiple distinct facets of an item's content simultaneously and with equal weight, leading to a richer and more comprehensive item representation. This directly addresses the RQ-VAE limitation of focusing on only one dominant aspect.
Domain-Specific Tuning with Text-Based Reconstruction: LAMIA enhances semantic encoders through domain-specific tuning. It uses text-based reconstruction tasks (e.g., reconstructing item title, abstract, or category from the item palette) rather than embedding-based reconstruction. This approach minimizes information loss, adapts the LLM to the specific recommendation domain, and ensures that the learned palette embeddings are more representative and relevant to the content.
Enhanced Training Stability with Contrastive Learning: To ensure the independence and mutual exclusivity of the multi-aspect embeddings within the item palette, LAMIA incorporates intra-palette contrastive loss. Additionally, an inter-palette contrastive loss is used to mitigate the code collision problem, encouraging distinct semantic identifiers across different items. These contrastive tasks lead to more stable and meaningful palette embeddings.
Simplified and Training-Free Quantization: By learning independent multi-aspect embeddings, LAMIA can utilize simple, training-free clustering algorithms (like K-Means) for quantization, bypassing the complex, sensitive, and prone-to-collapse training process associated with RQ-VAE. This makes the tokenization process more robust and efficient.
Significant Performance Improvements: Extensive experiments across various recommendation tasks and datasets (MIND, Amazon CDs, H&M) demonstrate that LAMIA significantly outperforms existing RQ-VAE-based and ID-based recommendation methods in terms of Recall and NDCG metrics. This validates the effectiveness of the proposed framework in real-world scenarios.

In summary, LAMIA provides a new, more robust, and semantically richer way to tokenize items for generative recommendation, addressing key limitations of prior approaches and leading to superior recommendation performance.

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp several fundamental concepts in recommender systems and natural language processing:

Recommender Systems (RS): Software systems that suggest items (products, movies, news articles, etc.) to users based on their preferences, past behavior, and/or item characteristics. The goal is to personalize user experience and increase engagement.
Sequential Recommendation: A sub-field of recommender systems that focuses on predicting the next item a user will interact with, given their historical sequence of interactions. This is crucial in dynamic contexts like e-commerce or streaming.
Item Identifiers (IDs): In traditional recommender systems, each item is assigned a unique numerical identifier. These IDs are then typically mapped to dense, learnable embedding vectors (numerical representations) that capture the item's features and relationships within the system.
Embedding Vector: A dense, low-dimensional numerical representation of discrete items (like IDs, words, users, items) that captures their semantic and relational properties. Items with similar properties are expected to have similar embedding vectors.
Long-tailed Items: Items that are interacted with very infrequently. They reside in the "long tail" of the popularity distribution.
Cold-start Items: New items that have no or very few interactions in the system. Recommending these items effectively is challenging because there isn't enough historical data to learn their ID embeddings or user preferences for them.
Semantic Tokenization: The process of converting an item's semantic representation (derived from its content, e.g., text description, images) into a sequence of discrete tokens (like words in a sentence, but representing item properties). These tokens are shared across items, allowing for a more generalized and content-aware representation than unique IDs.
Generative Recommendation: A paradigm where the recommender system directly generates a sequence of items or semantic tokens that constitute a recommendation, rather than just ranking existing items. This often leverages techniques from natural language generation.
RQ-VAE (Residual Vector Quantized Variational Autoencoder): A type of Variational Autoencoder (VAE) that incorporates Residual Vector Quantization (RVQ).
- Vector Quantization (VQ): A process that maps continuous input vectors to a finite set of discrete codebook vectors. Each input vector is replaced by the closest codebook vector.
- Residual Vector Quantization: Instead of quantizing an embedding once, RVQ quantizes the residual (the difference between the original embedding and its quantized version) in multiple layers. This allows for a hierarchical, coarse-to-fine quantization. The first layer captures broad features, and subsequent layers add finer details.
- Variational Autoencoder: A type of neural network that learns a compressed, probabilistic representation (latent space) of its input data. It consists of an encoder (maps input to latent space) and a decoder (reconstructs input from latent space). RQ-VAE replaces the continuous latent space with a discrete one using VQ.
- Codebook Collapse: A common problem in VQ-based models where only a small subset of the codebook vectors are actively used during training, leading to a loss of representational capacity.
Large Language Models (LLMs): Very large neural networks trained on vast amounts of text data, capable of understanding, generating, and processing human language. Examples include GPT models, LLaMA, and OPT. They often use the Transformer architecture.
Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which revolutionized sequence modeling. Key components include:
- Self-Attention Mechanism: Allows the model to weigh the importance of different parts of the input sequence when processing each element. The core formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $Q K^T$ calculates similarity between queries and keys.
  - $\sqrt{d_k}$ is a scaling factor to prevent large dot products from pushing the softmax into regions with tiny gradients. $d_k$ is the dimension of the key vectors.
  - $\mathrm{softmax}$ normalizes the scores to create probability distributions.
  - The result is a weighted sum of the Value vectors, where weights are determined by the attention scores.
- Decoder-Only Transformer: A variant of the Transformer architecture that only uses the decoder blocks. These models are typically used for generative tasks where they predict the next token in a sequence, only attending to previous tokens (causal attention).
K-Means Clustering: An unsupervised machine learning algorithm used to partition $N$ observations into $K$ clusters. It aims to minimize the sum of squared distances between each point and the centroid of its assigned cluster. It's "training-free" in the sense that it doesn't involve gradient-based optimization like neural networks but rather an iterative assignment and update process.
PCA (Principal Component Analysis): A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. It's often used for dimensionality reduction to simplify data while retaining as much variance as possible.
Recall: An evaluation metric that measures the proportion of relevant items that are successfully retrieved (recommended) by the system out of all relevant items. $ \text{Recall}@K = \frac{\text{Number of relevant items in top-K recommendations}}{\text{Total number of relevant items}} $ Where:
- $K$ is the number of top recommendations considered.
- "Relevant items" usually refers to items the user actually interacted with in the test set.
NDCG (Normalized Discounted Cumulative Gain): An evaluation metric used for ranking tasks, which considers the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear higher up in the list. $ \text{DCG}@K = \sum_{i=1}^{K} \frac{2^{\text{rel}i} - 1}{\log_2(i+1)} $ $ \text{IDCG}@K = \sum{i=1}^{K} \frac{2^{\text{rel}_{\text{ideal}_i}} - 1}{\log_2(i+1)} $ $ \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K} $ Where:
- $K$ is the number of top recommendations considered.
- $\text{rel}_i$ is the relevance score of the item at position $i$ in the recommended list (typically 1 if relevant, 0 if not).
- $\text{rel}_{\text{ideal}_i}$ is the relevance score of the item at position $i$ in the ideal (perfect) recommendation list, ordered by relevance.
- $\text{DCG}$ (Discounted Cumulative Gain) accumulates relevance scores, discounting items at lower positions.
- $\text{IDCG}$ (Ideal Discounted Cumulative Gain) is the maximum possible DCG for a given list of relevant items, used for normalization.

3.2. Previous Works

The paper contextualizes its contribution by referencing several prior works, particularly in the domain of generative recommendation and semantic tokenization.

Traditional ID-based Recommender Systems: Models like SASRec [14] and BERT4Rec [33] represent items using unique IDs. While effective for sequence modeling, they suffer from cold-start and long-tail issues and cannot leverage item content directly. GRU4Rec [9] and Caser [34] are other examples of ID-based sequential recommenders. P5 [6] is an LLM-based recommender that also often represents items with IDs or simple token representations.
Semantic Tokenization using RQ-VAE:
- TIGER [29]: One of the pioneering works that introduced semantic identifiers to replace unique item IDs. It uses SentenceBERT as an embedder and RQ-VAE as the quantizer, feeding these tokens into a Transformer for generative recommendation. TIGER's core idea is to shift from next-item ID prediction to next-code prediction.
- LC-Rec [46]: Builds upon TIGER by using a larger LLM (Llama1-7B) as both embedder and recommender, still relying on RQ-VAE for quantization. It adds an alignment task to better integrate collaborative knowledge.
- CoST [47]: Also uses SentenceT5 with RQ-VAE and a Transformer recommender. It focuses on contrastive quantization to improve token quality.
- LETTER [38]: Leverages Llama1-7B for embedding and RQ-VAE for tokenization, with a Transformer recommender. It also incorporates collaborative features.
- TokenRec [28]: Uses MQ-VAE (a variant of RQ-VAE) for quantization and LightGCN for collaborative embeddings, with Llama1-7B as the recommender. It also includes an alignment task.
Semantic Tokenization using K-Means or other clustering:
- EAGER [39]: Employs SentenceT5 for content embedding and DIN for collaborative features, using K-Means clustering for quantization. A key distinction is that EAGER incorporates both content and behavioral knowledge into its identifiers, which the authors note makes direct comparison potentially inequitable for content-only tokenization methods. EAGER also uses hierarchical K-Means, which is a form of hierarchical clustering.

Crucial Formula for Understanding Semantic Tokenization (RQ-VAE context): While the paper details RQ-VAE conceptually, it doesn't provide its full mathematical formulation. For a beginner, understanding the quantization process is key. The core idea of Vector Quantization (VQ) is to map an input vector $x \in \mathbb{R}^D$ to a discrete codebook entry $e_k \in \mathbb{R}^D$ from a codebook $\mathcal{C} = \{e_1, \dots, e_K\}$ , where $e_k$ is the closest vector in the codebook to $x$ . This selection is often done using an argmin operation: $ z_q = e_k \quad \text{where} \quad k = \mathrm{argmin}_j |x - e_j|_2 $ Where:

$x$ is the input embedding (e.g., from an LLM).
$e_j$ is the $j$ -th codebook vector in the codebook $\mathcal{C}$ .
$z_q$ is the quantized output vector.
$\| \cdot \|_2$ denotes the L2 norm (Euclidean distance).

RQ-VAE extends this by iteratively quantizing the residual. If $x$ is the original embedding:

Quantize $x$ to $z_{q,1}$ using codebook $\mathcal{C}_1$ .
Calculate residual $r_1 = x - z_{q,1}$ .
Quantize $r_1$ to $z_{q,2}$ using codebook $\mathcal{C}_2$ .
And so on, for $L$ layers. The final token sequence would correspond to the indices of the selected codebook vectors from each layer.

3.3. Technological Evolution

The field of recommender systems has evolved from simple collaborative filtering (relying on user-item interaction patterns) to more sophisticated content-based and hybrid approaches.

ID-based Models: Early models primarily used item IDs and user IDs, learning embeddings for them. Techniques like matrix factorization, neural collaborative filtering, and later sequential recommenders (SASRec, BERT4Rec) emerged to capture interaction patterns. However, their reliance on IDs limited their ability to handle cold-start items and incorporate rich item content.
Content-based Models & Embeddings: The rise of deep learning, particularly LLMs and Computer Vision models, allowed for extracting rich content embeddings from text, images, and other modalities. These embeddings could then be used in content-based filtering or combined with ID embeddings in hybrid systems.
Semantic Tokenization: As LLMs became more powerful, the idea of semantic tokenization gained traction. Instead of just using content embeddings directly, converting them into discrete, shareable tokens offered advantages similar to word tokens in natural language, enabling generative capabilities and better generalization. RQ-VAE became a popular tool for this, but its hierarchical nature and training difficulties presented new challenges.
LLMs for Recommendation (General): Beyond tokenization, LLMs are being directly applied in recommendation through various paradigms:
- Pre-training: Developing foundational recommendation models by pre-training LLMs on diverse user behaviors (e.g., PITM [40], M6 [3], P5 [6]).
- Prompting: Using LLMs to infer user preferences or item knowledge without parameter updates, often via feature augmentation (e.g., Xi et al. [42], Wang et al. [37]).
- Fine-tuning: Adapting existing powerful LLMs to specific recommendation tasks, sometimes using parameter-efficient fine-tuning (PEFT) techniques like LoRA [10] (e.g., Friedman et al. [5], Shen et al. [31]).
  
  LAMIA fits into this evolution by pushing the boundaries of semantic tokenization. It acknowledges the power of LLMs for content understanding but critiques the conventional RQ-VAE approach, proposing a more robust and multi-faceted tokenization framework that addresses its limitations, particularly for generative recommendation.

3.4. Differentiation Analysis

Compared to the main methods in related work, LAMIA introduces several core differences and innovations:

Multi-Aspect vs. Hierarchical Single-Aspect Tokenization:
- Previous (RQ-VAE-based): Methods like TIGER, LC-Rec, CoST, and LETTER rely on RQ-VAE. RQ-VAE uses a hierarchical process where the first quantization layer captures a dominant semantic aspect, and subsequent layers refine the residual of that initial quantization. This means later tokens are dependent on and merely add detail to the primary aspect. As illustrated in Figure 2, this can lead to information loss if an item has multiple equally important, yet distinct, aspects (e.g., a news article about "science" and "environment" might only be categorized under "science" with RQ-VAE).
- LAMIA: Introduces an "item palette" – a collection of independent, mutually exclusive, equally weighted, and semantically parallel embeddings. Each embedding in the palette aims to capture a distinct facet of the item's content. This explicitly addresses the multi-faceted nature of items, preventing the loss of secondary but important semantic information.
Text-Based Reconstruction for Domain Adaptation vs. Embedding-Based Reconstruction:
- Previous (RQ-VAE-based): Typically quantize embeddings directly from pre-trained LLMs (e.g., SentenceBERT, LLaMA). The reconstruction task (if any) is usually embedding-based, trying to reconstruct the original dense embedding from the quantized codes. This approach assumes the pre-trained LLM embeddings are perfectly aligned with the recommendation domain and susceptible to information loss due to data distribution shifts.
- LAMIA: Employs domain-specific tuning using text-based reconstruction tasks. Instead of reconstructing an embedding, LAMIA's LLM is trained to reconstruct item attributes (e.g., title, abstract, category) from the learned item palette. This ensures the LLM adapts its semantic encoders to the specific domain, making the palette embeddings more representative and minimizing information loss at the text level.
Simple Clustering for Quantization vs. Differentiable Vector Quantization:
- Previous (RQ-VAE-based): Relies on differentiable vector quantization, which is notoriously sensitive to training, prone to code collapse, and computationally intensive.
- LAMIA: Once the multi-aspect item palette is learned through generative and contrastive tasks, LAMIA quantizes these dense vectors into discrete semantic codes using simple, training-free clustering algorithms like K-Means. This completely bypasses the stability and training challenges associated with RQ-VAE. EAGER also uses K-Means, but in a hierarchical fashion and often with collaborative features, while LAMIA applies it to independent, parallel aspects.
Contrastive Learning for Palette Quality:
- Previous: While CoST uses contrastive quantization, LAMIA introduces specific intra-palette and inter-palette contrastive losses.
- LAMIA:
  - Intra-palette contrastive loss ensures that the embeddings within a single item's palette are maximally independent and mutually exclusive, preventing redundancy.
  - Inter-palette contrastive loss addresses code collision across different items, making sure distinct items have distinct semantic identifiers. This significantly improves the quality and distinctiveness of the learned semantic tokens.
    
    In essence, LAMIA moves from a single, hierarchically refined item representation to a truly multi-faceted, independent representation, trained to be domain-adaptive and robustly quantized, thereby overcoming key limitations of prior semantic tokenization approaches for generative recommendation.

4. Methodology

4.1. Principles

The core idea behind LAMIA is to move beyond the limitations of RQ-VAE based semantic tokenization by explicitly learning a multi-aspect representation for each item, which the authors term an "item palette." Instead of a single embedding that is hierarchically quantized, LAMIA aims to capture multiple, independent, and semantically parallel facets of an item's content. The theoretical basis and intuition are that real-world items are inherently complex and multifaceted. A news article might simultaneously be about "politics," "economy," and "social issues," or a product might have "material," "functionality," and "style" aspects. A single, dominant-aspect representation, as provided by RQ-VAE, inevitably loses information.

LAMIA's approach allows for a richer, more comprehensive semantic representation that can then be easily discretized into multiple semantic codes without the training instabilities of RQ-VAE. This is achieved by:

Domain-Adaptive Tuning: Leveraging a decoder-only Large Language Model (LLM) and tuning it with domain-specific text-level reconstruction tasks. This ensures the LLM learns to encode features relevant to the recommendation context.
Multi-Aspect Palette Learning: Training the LLM to compress variable-length item content into a fixed-size "item palette" (a collection of dense embeddings).
Palette Independence: Enforcing independence and mutual exclusivity among the palette embeddings through contrastive learning tasks. This ensures each part of the palette captures a distinct semantic aspect.
Training-Free Quantization: Once the robust, multi-aspect dense palette embeddings are learned, they are discretized using simple clustering algorithms (like K-Means), avoiding the challenges of differentiable vector quantization.

The overarching principle is to create semantic identifiers that are more expressive, stable, and domain-relevant, leading to improved generative recommendation.

4.2. Core Methodology In-depth (Layer by Layer)

The LAMIA framework is designed to overcome the limitations of RQ-VAE by learning a multi-aspect item palette. It comprises a specialized architecture, self-supervised learning objectives, and a simple quantization strategy. The process can be broken down into several key components:

4.2.1. Architecture of LAMIA

LAMIA is compatible with any decoder-only Large Language Model (LLM) and utilizes a block-wise input scheme along with hierarchical attention masking. This architecture, inspired by gisting frameworks that condense prompts, allows the LLM to compress variable-length text content into fixed-length item palettes.

The input sequence for an item $\mathbb{X}$ is structured into four distinct blocks:

Content Block: This block contains selected attributes of the item, such as text descriptions. For an item with $m$ attributes $\mathbf{a}_1, \mathbf{a}_2, \dots, \mathbf{a}_m$ , the content block uses the first $r \le m$ attributes: $ \langle \mathrm{content} \rangle = [\mathsf{a}_1;\mathsf{a}_2;\dots ,\mathsf{a}_r] $ Where:
- $[\cdot]$ denotes concatenation of the attributes.
- $\mathsf{a}_i$ represents the $i$ -th attribute of the item (e.g., title, abstract). For instance, if an article has a title, abstract, and category ( $m=3$ ), the content block might use the title and abstract ( $r=2$ ), forming a sequence like "title: Yellowstone tourist injured:. abstract: A tourist suffered severe burns..".
Learnable Palette Block (LaP): This block consists of $L$ predefined special tokens. These tokens are crucial for learning the multi-aspect representation. $ \langle \mathsf{LaP} \rangle = [\prec \mathsf{L}\mathsf{a}\mathsf{P1} \succ ,\prec \mathsf{L}\mathsf{a}\mathsf{P2} \succ ,\prec \mathsf{L}\mathsf{a}\mathsf{P3} \succ ] $ Where:
- $\prec \mathsf{L}\mathsf{a}\mathsf{P}i \succ$ denotes the $i$ -th learnable palette token.
- $L$ is the predefined palette size (e.g., $L=3$ as shown in Figure 3 for simplicity, but often $L=4$ in experiments).
- These tokens have learnable embeddings that are randomly initialized and are updated during training to integrate item content into actionable insights via the Transformer network. The output embeddings corresponding to these LaP tokens after processing by the LLM form the Item Palette.
Learned Palette Block (LdP): This block serves as a placeholder during the initial input construction. It mirrors the LaP in length but initially contains generic placeholder tokens. $ \langle \mathsf{LD}\mathsf{P} \rangle = [\prec \mathsf{LD}\mathsf{P1} \succ ,\prec \mathsf{LD}\mathsf{2} \succ ,\prec \mathsf{LD}\mathsf{P3} \succ ] $ Where:
- $\prec \mathsf{LD}\mathsf{P}i \succ$ denotes the $i$ -th placeholder token. This block is later replaced with the actual output embeddings from the LaP block before processing by subsequent Transformer layers. This mechanism allows the task block to leverage the learned item palette for text generation.
Task Block: This block concludes the sequence and defines the specific generative task. $ \langle \mathtt{task} \rangle = [t_i; a_i] $ Where:
- $t_i$ denotes a task token that specifies the goal (e.g., reconstruct title, predict category).
- $a_i$ is the answer sequence for that task (e.g., the actual title text, the category name). The task can be either reconstruction (when $0 < i \le r$ , for attributes used in the content block) or prediction (when $r < i \le m$ , for unseen attributes). Examples include "" or ":category:travel".

Hierarchical Attention Masking: Decoder-only LLMs typically use causal attention masks, meaning each token can only attend to itself and preceding tokens. However, for LAMIA, the goal is to have the item palette fully influence the task output. To achieve this, LAMIA implements a hierarchical attention masking scheme (illustrated in Figure 4).

Inner-block masking: Within each block (e.g., content, LaP), causal attention is maintained to preserve sequential knowledge.
Inter-block masking: This defines attention patterns between blocks:
- The content block fully attends to the learnable palette block (LaP). This means information from the item's content flows into the LaP to form the palette embeddings.
- The learned palette block (LdP) fully attends to the task block. This ensures the item palette embeddings (which replace the LdP placeholders) can inform the generation of the task output.
- All other inter-block attentions are disabled. For example, the task block does not attend to the content block directly, ensuring the task output is generated primarily from the compressed item palette.
  
  This masking strategy is critical because it controls the information flow, forcing the LLM to compress the content into the LaP and then use only the LaP (via LdP) to generate the task output.

The detailed architecture of LAMIA (Figure 3) shows the item content being processed by the encoder, the learnable palette tokens capturing the semantics, and the contrastive losses applied to these palette embeddings. The generative tasks (title reconstruction, category prediction) then train the model to map the palette to specific item attributes.

4.2.2. Learning Multi-aspect Item Palette

The training objectives for LAMIA are designed to optimize the item palette's functionality: it should capture sufficient item content information for accurate reconstruction and prediction, and each embedding within the palette should be as independent and mutually exclusive as possible to minimize redundancy. These objectives are achieved through a combination of generative reconstruction/prediction and contrastive learning tasks.

Generative Reconstruction or Prediction: This task ensures that the item palette effectively incorporates content knowledge. For each item, distinct input samples are created by iterating through task ID $t_i$ (from 1 to $m$ , representing different attributes). Each $t_i$ corresponds to a unique generative challenge: reconstructing an attribute present in the content block ( $0 < i \le r$ ) or predicting an unseen attribute ( $r < i \le m$ ). The LLM is tuned with a standard next-token prediction task, optimized using cross-entropy loss: $ \mathcal{L}{\mathrm{gen}} = -\log P(a{i,j + 1}|a_{i,1},a_{i,2},\ldots ,a_{i,j}) $ Where:

$\mathcal{L}_{\mathrm{gen}}$ is the generative loss.
$P(a_{i,j + 1}|a_{i,1},a_{i,2},\ldots ,a_{i,j})$ is the probability of predicting the $(j+1)$ -th token of attribute $a_i$ , given its preceding tokens.
$a_{i,j}$ denotes the $j$ -th token of the $i$ -th attribute.

To enable the item palette embeddings (output of LaP) to influence the task block, a dual forward propagation mechanism is used. During the initial forward pass, the item palette embeddings (output of the LaP block) are captured. These captured embeddings then replace the placeholder tokens in the LdP block for the subsequent layers, allowing the task block to generate its output based on this populated LdP block.

Contrastive Learning: Contrastive learning plays a crucial role in shaping the quality of the item palette embeddings.

Intra-Palette Contrastive Loss: This loss encourages independence and mutual exclusivity among the $L$ palette embeddings within the same item. Since each palette embedding is meant to capture a distinct perspective, their similarity should be minimized. This ensures that the subsequent clustering results for each palette position are independent. First, the palette embeddings are normalized: $ \mathsf{B}{i,j} = \frac{\mathsf{B}{i,j}}{| {\mathsf{B}{i,j}}|{2}} $ Where:
- $\mathsf{B} \in \mathbb{R}^{B \times L \times d}$ is a batch of palette embeddings.
- $B$ is the batch size, $L$ is the palette size, and $d$ is the embedding dimension.
- $\mathsf{B}_{i,j}$ represents the palette embedding of the $j$ -th aspect (order) for the $i$ -th item in the batch.
- $\| \cdot \|_{2}$ denotes the L2 norm for normalization. Then, the intra-palette contrastive loss is formulated using a Hinge Loss to restrict the cosine similarity between different-order palette embeddings for the same item: $ \mathcal{L}{\mathrm{intra}} = \sum{i = 1}^{B}\sum_{j = 1}^{L}\sum_{k = 1,k\neq j}^{L}\max (0,s(\hat{\mathsf{B}}{i,j},\hat{\mathsf{B}}{i,k}) - \alpha_{\mathrm{intra}})^2 $ Where:
- $s(\cdot, \cdot)$ denotes the cosine similarity function.
- $\hat{\mathsf{B}}_{i,j}$ and $\hat{\mathsf{B}}_{i,k}$ are normalized palette embeddings for the $j$ -th and $k$ -th aspects of the $i$ -th item.
- $\alpha_{\mathrm{intra}}$ is a margin hyperparameter, ensuring that similarities below this threshold are not penalized. The $\max(0, \dots)^2$ term penalizes similarities that exceed $\alpha_{\mathrm{intra}}$ .
Inter-Palette Contrastive Loss: This loss addresses the code collision problem, where different items might collapse to the same semantic identifiers. It encourages palette embeddings of the same order (aspect) to remain semantically distinct across different items in a batch. $ \mathcal{L}{\mathrm{inter}} = \sum{i = 1}^{B}\sum_{k = 1,k\neq i}^{B}\sum_{j = 1}^{L}\mathrm{max}(0,s(\mathbf{B}{i,j},\bar{\mathbf{B}}{k,j}) - \alpha_{\mathrm{inter}})^2 $ Where:
- $\mathbf{B}_{i,j}$ is the $j$ -th palette embedding of the $i$ -th item.
- $\bar{\mathbf{B}}_{k,j}$ is the $j$ -th palette embedding of a different (negative) item $k$ from the same batch.
- $\alpha_{\mathrm{inter}}$ is a margin hyperparameter, similar to $\alpha_{\mathrm{intra}}$ , restricting the similarity between same-order palette embeddings of different items.

Final Training Objective: The overall training objective for LAMIA combines the generative reconstruction/prediction loss and the contrastive losses: $ \mathcal{L}{\mathrm{LAMIA}} = \mathcal{L}{\mathrm{gen}} + \gamma \mathcal{L}{\mathrm{cl}} = \mathcal{L}{\mathrm{gen}} + \gamma \left(\mathcal{L}{\mathrm{intra}} + \mathcal{L}{\mathrm{inter}}\right) $ Where:

$\mathcal{L}_{\mathrm{cl}}$ represents the total contrastive loss.
$\gamma$ is a hyperparameter that balances the contribution of the generative and contrastive components.

4.2.3. Quantization Using A Simple Clustering Algorithm

Unlike RQ-VAE's complex differentiable vector quantization, LAMIA quantizes its learned dense item palette vectors using a simpler, training-free approach. This process happens after the LAMIA model has been trained to produce the palette embeddings.

Aggregation of Palette Embeddings: First, the output item palette embeddings for all items in the dataset are collected and aggregated into a matrix: $ \mathbf{E} = \begin{bmatrix} \mathbb{E}{1,1} & \mathbb{E}{1,2} & \dots & \mathbb{E}{1,n}\ \mathbb{E}{2,1} & \mathbb{E}{2,2} & \dots & \mathbb{E}{2,n}\ \vdots & \vdots & \ddots & \vdots \ \mathbb{E}{L,1} & \mathbb{E}{L,2} & \dots & \mathbb{E}_{L,n} \end{bmatrix} $ Where:
- $n$ denotes the total number of items.
- $\mathbb{E}_{i,j}$ represents the $i$ -th item palette embedding (for aspect $i$ ) of the $j$ -th item. Each $\mathbb{E}_{i,j}$ is a vector of $D$ dimensions.
Dimensionality Reduction with PCA: The palette embeddings can be high-dimensional (e.g., 1024 dimensions). To make clustering more efficient and potentially more robust, Principal Component Analysis (PCA) [25] is applied to reduce these dimensions.
- PCA is applied to each $\mathbb{E}_{i,j}$ to reduce its dimensionality from $D$ to a lower dimension $d$ (e.g., 32 or 64). The resulting reduced embeddings are denoted as $\hat{\mathbf{e}}_{i,j}$ .
Clustering with K-Means: A simple, training-free clustering algorithm, such as K-Means [16], is applied independently to each row of the reduced embedding matrix. Each row $\hat{\mathbf{E}}[i,:] = [\hat{\mathbf{e}}_{i,1},\hat{\mathbf{e}}_{i,2},\dots ,\hat{\mathbf{e}}_{i,n}]$ corresponds to the embeddings for a specific aspect (palette position) across all items.
- For each aspect $i$ (i.e., each row of $\hat{\mathbf{E}}$ ), K-Means clusters all $\hat{\mathbf{e}}_{i,j}$ into $K$ clusters.
- The output of this step is a matrix of cluster indices: $ \mathbf{C} = \left[ \begin{array}{cccc} c_{1,1} & c_{1,2} & \dots & c_{1,n}\ c_{2,1} & c_{2,2} & \dots & c_{2,n}\ \vdots & \vdots & \ddots & \vdots \ c_{L,1} & c_{L,2} & \dots & c_{L,n} \end{array} \right] $ Where:
  - $c_{i,j}$ represents the cluster index for the $i$ -th aspect of the $j$ -th item.
  - $1 \le c_{i,j} \le k$ , where $k$ is the number of clusters (e.g., 256 codebook size).
- Thus, each item $j$ is now represented by a sequence of $L$ discrete tokens: $c_{j} = [c_{1,j},c_{2,j},\dots ,c_{L,j}]$ . These are the semantic identifiers for item $j$ .

4.2.4. Generative Recommender

Once the semantic codes (discrete tokens) for all items are obtained, they can be used in a generative recommender system. This can be either a traditional Deep Learning-based Recommender Model (DLRM) or an LLM specifically fine-tuned for recommendation (LLM as RS).

Training: Given a user behavior sequence, where each item is now represented by its $L$ semantic tokens, the sequence is flattened. For example, if user interaction is item $u_1$ , item $u_2$ , then the sequence becomes tokens of $u_1$ followed by tokens of $u_2$ . $ \operatorname {R} = (\mathbf{u}{1,1},\dots,\mathbf{u}{1,L}, \mathbf{u}{2,1},\dots,\mathbf{u}{2,L},\dots) $ Where:

$\mathbf{u}_{i,j}$ represents the $j$ -th token of the $i$ -th item in the user sequence.

The next-item prediction task is the principal task for training the recommender: $ \mathcal{L}{\mathrm{rip}} = -\sum{i = 1}^{I}\sum_{j = 1}^{L}\log P(u_{i,j + 1}|u_{1,k},\cdot \cdot \cdot ,u_{i - 1,k},u_{i,1},\ldots ,u_{i,j}) $ Where:
$\mathcal{L}_{\mathrm{rip}}$ is the next-item prediction loss.
$I$ is the number of items in the user sequence.
$L$ is the length of the semantic identifier (number of tokens per item).
$P(\cdot | \cdot)$ is the probability of predicting the $(j+1)$ -th token of the $i$ -th item, given the preceding tokens in the flattened sequence.
$k$ is an index indicating which tokens of previous items are considered (the paper's notation here seems to imply that not all previous tokens are always used, or it refers to a specific position within item tokens, but standard sequential modeling would use all prior context).

When an LLM is used as the backbone, the user sequence can be concatenated with natural language prompts. However, the loss computation $\mathcal{L}_{\mathrm{rip}}$ exclusively considers the semantic tokens, ignoring the natural language parts.

Additionally, for LLM-based recommenders, a text-token alignment task can be used as a supplementary objective (following LC-Rec [46]). This task helps the LLM understand the relationship between item descriptions in natural language and their corresponding semantic tokens. As illustrated in Figure 5, the input sequence for this task might be structured as: $\mathbf{s} = [s_{1},\dots ,s_{l},\underline{{\mathbf{e}}_{1}},\underline{{\mathbf{e}}_{2}},\dots ,\underline{{\mathbf{e}}_{l}} ]$ Where:

$s_i$ are natural language tokens of the item description.
$\underline{{\mathbf{e}}_{i}}$ are the semantic tokens of the item. The alignment task loss is formalized as: $ \mathcal{L}{\mathrm{align}} = -\sum{i = 1}^{v}\log P(c_{i + 1}|s_{k},\dots ,\underline{\underline{\mathbf{e}}}{i},\dots ,\underline{\underline{\mathbf{e}}}{l}) $ Where:
$c_{i+1}$ is the next token to be predicted in the sequence.
The conditional probability $P(\cdot | \cdot)$ is for predicting tokens, potentially using both natural language and semantic tokens as context. The exact interpretation of $k$ and $v$ within the context of the input sequence $\mathbf{s}$ in the paper is slightly ambiguous, but typically this would involve predicting the semantic tokens given the text description, or vice-versa, or predicting subsequent tokens based on combined context. This task forces the LLM to learn the semantic mapping between continuous text and discrete codes.

Inference: During inference, the generative recommender predicts the semantic tokens for the next item in an autoregressive manner. Following prior work [29, 46], beam search [4] is applied. Beam search explores multiple possible token sequences concurrently, maintaining the top-K most probable sequences, and ultimately selects the best sequence of semantic tokens to form the recommended item. These generated semantic tokens are then mapped back to actual items based on the learned semantic identifier to item ID mapping.

The image 5.jpg from the paper, referenced as Figure 5, shows "Instruction templates for tuning large language models as generative recommenders." The upper template shows how a user behavior sequence (semantic tokens) is used to predict the next item (semantic tokens). The lower template shows the text-token alignment task, where a textual description (title, abstract) is used to retrieve the corresponding item (semantic tokens).

The image 6.jpg shows a diagram related to classification and scoring, with values like 7 and 1/6. This image is not directly referenced in the methodology text, and its relevance to the core LAMIA process is unclear from the provided context. It might be a general illustration or part of a different discussion not detailed here.

Overall, the LAMIA methodology provides a comprehensive pipeline from robust multi-aspect item representation learning to generative recommendation, addressing critical limitations of existing semantic tokenization methods.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three real-world content-based recommendation datasets, representing different domains:

MIND (MIcrosoft News Dataset): A news recommendation dataset.
Amazon CDs: A music (CDs) recommendation dataset from Amazon.
H&M: A fashion recommendation dataset from H&M.

The dataset statistics are summarized in Table 2:

The following are the results from Table 2 of the original paper:

	MIND	CDs	H&M
#Items	25,634	19,684	15,889
#Users	45,000	45,000	45,000
#Finetune	40,000	40,000	40,000
#Test	5,000	5,000	5,000
Avg. User Length	11.78	5.19	8.67
Avg. Item Appearance	20.69	11.70	22.44

Characteristics and Domain:

MIND: News articles, implying rich textual content (titles, abstracts, categories). User behavior involves reading sequences of news.
Amazon CDs: Music albums/tracks, likely with textual descriptions, artist information, and potentially genre tags. User behavior involves purchasing or listening sequences.
H&M: Fashion items, characterized by descriptions, product types, colors, and appearance names. User behavior involves browsing or purchasing clothing items.

These datasets are chosen because they are content-based, meaning items have rich descriptive content that LAMIA is designed to leverage. They represent diverse domains, allowing for a comprehensive validation of the framework's generalization ability. The varying "Avg. User Length" and "Avg. Item Appearance" indicate different levels of user engagement and item popularity, which are relevant for assessing long-tail and cold-start performance.

For the MIND dataset, the authors provide a concrete example of a news item's attributes: "title:Yellowstone tourist injured:. abstract:A tourist suffered severe burns..". This demonstrates the kind of textual content used for the content block and generative tasks. For H&M, attributes like "fashion description, product type, product group, appearance name, color master name, color value name, and index name" are mentioned, indicating the rich, multi-faceted content typical of fashion items.

5.2. Evaluation Metrics

The effectiveness of sequential recommenders is evaluated using standard metrics: Recall and NDCG [12]. The paper specifically reports Recall@K and NDCG@K for $K \in \{1, 5, 10, 20\}$ .

Recall (Recall@K):
- Conceptual Definition: Recall measures the proportion of relevant items that were successfully identified and included in the top $K$ recommendations by the system, out of all possible relevant items. It focuses on the completeness of retrieval—how many of the truly desired items did the system manage to recommend.
- Mathematical Formula: $ \text{Recall}@K = \frac{|\text{Relevant Items} \cap \text{Top-K Recommendations}|}{|\text{Relevant Items}|} $
- Symbol Explanation:
  - $|\text{Relevant Items} \cap \text{Top-K Recommendations}|$ : The number of items that are both actually relevant to the user and present in the system's top $K$ recommendations.
  - $|\text{Relevant Items}|$ : The total number of items that are relevant to the user (e.g., items the user interacted with in the test set).
  - $K$ : The size of the recommendation list (e.g., 1, 5, 10, 20).
NDCG (Normalized Discounted Cumulative Gain - NDCG@K):
- Conceptual Definition: NDCG is a widely used metric for evaluating the quality of ranked recommendation lists. It not only considers whether relevant items are recommended but also their positions in the list. Higher relevance at higher positions (top of the list) contributes more to the score, and relevance at lower positions is discounted. The score is normalized by dividing by the Ideal DCG (the DCG of a perfectly ordered list) to make it comparable across different queries or recommendation scenarios.
- Mathematical Formula: The Discounted Cumulative Gain (DCG) at position $K$ is calculated as: $ \text{DCG}@K = \sum_{i=1}^{K} \frac{2^{\text{rel}i} - 1}{\log_2(i+1)} $ The Ideal Discounted Cumulative Gain (IDCG) at position $K$ is: $ \text{IDCG}@K = \sum{i=1}^{K} \frac{2^{\text{rel}_{\text{ideal}_i}} - 1}{\log_2(i+1)} $ Finally, NDCG@K is: $ \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K} $
- Symbol Explanation:
  - $K$ : The number of top recommendations considered.
  - $\text{rel}_i$ : The relevance score of the item at position $i$ in the actual recommendation list. In typical recommendation tasks, this is often binary (1 if the item is relevant/interacted with, 0 otherwise).
  - $\text{rel}_{\text{ideal}_i}$ : The relevance score of the item at position $i$ in the ideal recommendation list, where all relevant items are ranked perfectly from most to least relevant.
  - $i$ : The position of an item in the recommendation list (starting from 1).
  - $\log_2(i+1)$ : The logarithmic discount factor, which reduces the contribution of items further down the list.

5.3. Baselines

The proposed LAMIA framework is benchmarked against a comprehensive set of baselines, categorized into unique ID-based recommenders and semantic code-based recommenders.

5.3.1. Unique ID-based Recommenders

These models represent each item with a unique ID and learn an embedding for it.

GRU4Rec [9]: A Gated Recurrent Unit (GRU)-based model for session-based recommendation, capturing sequential patterns.
Caser [34]: Uses Convolutional Sequence Embedding for personalized top-N sequential recommendation.
SASRec [14]: Self-Attentive Sequential Recommendation, a pioneering Transformer-based model for sequential recommendation. The paper evaluates SASRec with different numbers of layers: SASRec3L, SASRec6L, and SASRec12L.
Bert4Rec [33]: Sequential recommendation with bidirectional encoder representations from transformer, adapting BERT's masked language modeling objective to sequence prediction.
P5 [6]: Recommendation as Language Processing (R1P), a paradigm that unifies various recommendation tasks into a pre-train, personalized prompt & predict framework using an LLM.

5.3.2. Semantic Code-based Recommenders

These models utilize semantic tokenization to represent items, where items are encoded into sequences of discrete tokens.

TIGER [29]: Uses SentenceBERT as an embedder and RQ-VAE for semantic tokenization, with a Transformer recommender.
LC-Rec [46]: Employs Llama1-7B for both embedding and as the recommender, with RQ-VAE for tokenization and an alignment task.
CoST [47]: Uses SentenceT5 for embedding, RQ-VAE for quantization, and a Transformer recommender.
EAGER [39]: A two-stream generative recommender that incorporates behavior-semantic collaboration. It uses SentenceT5 for content embedding, DIN (Deep Interest Network) for collaborative features, and K-Means for quantization. The paper notes that comparing EAGER directly might be inequitable as it leverages both content and behavioral knowledge, unlike LAMIA and most other code-based methods that primarily focus on content.
LETTER [38]: Uses Llama1-7B for embedding, RQ-VAE for tokenization, and a Transformer recommender. It also incorporates collaborative features.
TokenRec [28]: Uses LightGCN for collaborative embeddings and MQ-VAE for quantization, with Llama1-7B as the recommender.

All code-based recommenders (TIGER, LC-Rec, CoST, and LAMIA) represent each item using four codes ( $L=4$ ), with a fixed code vocabulary of 256 at each position.

The baselines are representative as they cover both traditional ID-based approaches (including state-of-the-art sequential models) and contemporary semantic code-based methods that LAMIA aims to improve upon. The inclusion of LLM-based recommenders further strengthens the comparison in the modern context of LLM advancements.

5.4. Implementation Details

The paper provides detailed implementation specifics for LAMIA and its experimental setup:

i) LAMIA Configuration:
- Backbone LLM: OPT-350M [44] is used as the pretrained decoder-only LLM to learn the item palette. OPT-350M is a smaller, open-source Transformer-based LLM from Meta (formerly Facebook AI), providing a balance of performance and computational efficiency.
- Optimizer: Adam [15] is used for optimization.
- Learning Rate: $1\text{e-4}$ .
- Batch Size: 128.
- LoRA Rank: 128. LoRA (Low-Rank Adaptation) [10] is a Parameter-Efficient Fine-Tuning (PEFT) technique that reduces the number of trainable parameters by injecting small, low-rank matrices into the Transformer layers, making fine-tuning more efficient.
- Palette Size ( $L$ ): 4. This means each item is represented by 4 distinct palette embeddings, leading to 4 semantic codes.
- Intra-/Inter-Palette Contrastive Margins: $\alpha_{\mathrm{intra}}$ is set to 0.1, and $\alpha_{\mathrm{inter}}$ is set to 0.25. These margins define the threshold for similarity in the Hinge Loss used for contrastive learning.
- Palette Contrastive Weight ( $\gamma$ ): 0.1. This hyperparameter balances the contribution of the contrastive losses relative to the generative loss in the overall LAMIA training objective.
ii) Self-supervised Generative Tasks: The specific generative tasks are tailored to each dataset's item attributes:
- MIND dataset: $m=4$ attributes (title, abstract, category, subcategory). The content block uses news title and abstract ( $r=2$ ). Four generative tasks are designed: reconstructing title, abstract, category, and subcategory.
- H&M dataset: $m=k=7$ attributes (fashion description, product type, product group, appearance name, color master name, color value name, and index name). The content block uses all 7 attributes ( $r=7$ ), and each attribute corresponds to a distinct generation task.
iii) Clusterer:
- PCA: Principal Component Analysis [25] is applied to reduce the 1024-dimensional item embeddings to 64 components. This reduces dimensionality before clustering.
- Clustering: Each of the 4 palette positions (aspects) is independently clustered into 256 groups (codebook size) using K-Means.
- Hinge Loss Similarity Threshold: The threshold for Hinge Loss in Eq. (8) ( $\alpha_{\mathrm{inter}}$ in the paper's main text, but here implicitly referring to the contrastive losses) is stated as 0.25 (consistent with $\alpha_{\mathrm{inter}}$ above).
- Loss Weight ( $\gamma$ ): 0.1 (consistent with $\gamma$ above).
iv) Item Collision: An additional index token is used to ensure that even if two items have identical semantic content and thus identical semantic tokens, they can still be distinguished. This is a practical measure to avoid ambiguity.
v) Generative Recommender:
- Max User History Length: 20 items. The last item in the sequence is used as the prediction target.
- Backbone: The same pretrained OPT-base (OPT-350M) is used as the recommender backbone.
- Learning Rate: $5\text{e-4}$ .
- Batch Size: 64.
- LoRA Rank: 128.
- Training Strategy for LLM-based Recommender:
  1. Initial phase: Joint learning of the generative recommendation task ( $\mathcal{L}_{\mathrm{rip}}$ ) and the text-token alignment task ( $\mathcal{L}_{\mathrm{align}}$ ).
  2. Subsequent phase: After model convergence in the initial phase, the model is further tuned by the single generative recommendation task ( $\mathcal{L}_{\mathrm{rip}}$ ).
- Early Stopping: Used with a patience of 5 epochs to prevent overfitting.
- Computational Resources: All experiments are conducted on a single NVIDIA A100 device with 80GB memory. This indicates a significant but manageable computational requirement for research.
- Reproducibility: Source code, data, and configurations will be released.
- Benchmark: RecBench [23] is employed for evaluating the recommendation abilities of LLMs.
  
  These details highlight the careful engineering and resource considerations involved in implementing and evaluating LAMIA.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results in Table 3 provide a comprehensive comparison of LAMIA against various baselines across the MIND, CDs, and H&M datasets. The primary goal is to validate LAMIA's effectiveness in retrieval scenarios.

The following are the results from Table 3 of the original paper:

Method	Embedder	Recommender	MIND						CDs						H&M
Method	Embedder	Recommender	Recall@5	Recall@10	Recall@20	NDCG@5	NDCG@10	NDCG@20	Recall@5	Recall@10	Recall@20	NDCG@5	NDCG@10	NDCG@20	Recall@5	Recall@10	Recall@20	NDCG@5	NDCG@10	NDCG@20
Item Representation: Unique ID
GRU4Rec [9]	N/A	N/A	0.48	0.52	0.60	0.38	0.40	0.42	0.22	0.25	0.25	0.16	0.16	0.16	2.62	2.80	3.28	2.24	2.37	2.37
Caser [34]	N/A	N/A	0.88	1.02	1.18	0.38	0.40	0.40	0.22	0.22	0.22	0.14	0.14	0.14	2.02	2.24	3.69	2.69	1.82	1.82
Bert4Rec [33]	N/A	N/A	0.82	1.04	1.56	0.30	0.32	0.32	0.14	0.14	0.14	0.14	0.14	0.14	1.98	1.75	2.44	1.90	1.92	1.92
SASRec3L [14]	N/A	N/A	1.00	1.26	1.18	4.18	1.31	1.31	0.18	0.24	0.24	0.13	0.13	0.13	2.24	2.64	2.98	1.74	1.74	1.74
SASRec6L	N/A	N/A	1.08	3.54	4.18	1.24	2.41	2.41	0.24	0.21	0.21	0.16	0.16	0.16	3.12	2.68	4.26	2.09	8.07	8.07
SASRec12L	N/A	N/A	3.00	2.21	2.24	2.24	2.24	2.24	0.10	0.18	0.18	0.22	0.22	0.22	11.64	12.35	14.74	3.12	4.98	4.98
Item Representation: Hierarchical Semantic ID, using RQ-VAE or Hierarchical K-Means
CoST [47]	SentenceT5	TRM3L	2.72	3.22	4.50	1.97	2.13	2.45	1.42	1.66	1.74	1.32	1.58	1.92	3.36	4.49	5.71	2.80	3.75	4.76
EAGER [39]	SentenceT5*	TRM3L	2.00	3.48	5.32	1.36	1.83	2.51	1.56	1.74	1.92	1.28	1.54	1.88	3.71	4.60	5.89	1.10	1.33	1.32
TIGER [29]	SentenceBert	TRM3L	2.98	4.64	6.52	2.15	2.66	3.13	0.00	0.00	0.00	0.00	0.00	0.00	1.04	1.32	1.66	0.75	0.84	0.93
TIGER [29]	SentenceBert	TRM6L	2.94	3.42	4.44	2.09	2.50	2.50	1.72	1.88	2.20	1.48	1.48	1.48	1.74	1.74	3.12	0.88	1.05	1.63
TIGER [29]	SentenceBert	TRM12L	2.52	3.30	4.82	1.76	2.61	3.74	1.74	1.84	2.36	1.45	1.52	1.52	6.66	26.16	26.16	0.97	1.05	1.85
TIGER [29]	SentenceBert	BERTbase	8.42	10.82	11.68	6.41	7.71	7.87	37.00	37.84	37.84	32.31	35.39	35.39	1.00	1.00	1.00	1.00	1.00	1.00
Item Representation: Parallel Semantic ID, using LAMIA (ours)
LAMIA (ours)	OPT-350M	BERTbase	9.08	9.96	10.56	7.42	7.71	7.87	38.88	39.70	40.04	36.16	36.43	36.52	11.50	12.68	13.44	9.17	9.56	9.75

Key Observations and Analysis:

Semantic ID vs. Unique ID:
- Generally, semantic identifier-based methods (the middle and bottom sections of the table) outperform unique identifier-based methods (top section) on the MIND and CDs datasets. This validates the core premise of semantic tokenization: leveraging item content and generalizable representations improves recommendation, especially for long-tail and cold-start items.
- H&M Dataset Anomaly: On the H&M dataset, ID-based methods, particularly SASRec12L, perform remarkably well (e.g., 14.74 NDCG@20 for SASRec12L vs. 13.44 for LAMIA). The authors attribute this to potentially "low-quality content features" in H&M, where similar text descriptions (e.g., for different clothing items) might correspond to vastly different labels or subtle distinctions that are hard to capture via content alone. This suggests that for domains where content is ambiguous or highly similar across distinct items, ID-based models, which rely purely on behavioral signals, might still have an edge. However, even here, LAMIA still achieves very competitive results.
Impact of Recommender Backbone (LLM vs. Traditional Transformer):
- Within the TIGER series, replacing a simple Transformer (TRM3L, TRM6L, TRM12L) with a BERTbase LLM as the generative recommender leads to a significant performance boost (e.g., for MIND, TIGER TRM3L Recall@5 is 2.98, while TIGER BERTbase is 8.42). This underscores the value of pretrained language models (LLMs) in interpreting semantic IDs and capturing rich semantic information, even when these capabilities are implicitly encoded in the tokens.
Embedder Choice and Domain Knowledge:
- LC-Rec uses LLaMA-1-7B as an embedder, while TIGER uses SentenceBERT. Despite LLaMA-1-7B possessing broader world knowledge, SentenceBERT (specifically pre-trained for sentence representation) achieves better results across all three datasets with TIGER. This indicates that domain-specific or task-specific pre-training (like SentenceBERT for sentence embeddings) can be more effective than larger, more general LLMs for embedding items in recommendation contexts, at least when directly used for quantization with RQ-VAE.
LAMIA's Superiority:
- Our proposed LAMIA framework, which uses parallel semantic IDs and OPT-350M as an embedder, consistently achieves the best performance on the MIND and CDs datasets across all Recall and NDCG metrics (e.g., MIND Recall@5: LAMIA 9.08 vs. TIGER BERTbase 8.42; CDs Recall@5: LAMIA 38.88 vs. TIGER BERTbase 37.00).
- On H&M, while SASRec12L has a slightly higher NDCG@20, LAMIA still achieves very strong results and outperforms all other semantic code-based methods.
- These results strongly validate the effectiveness of LAMIA's novel design, particularly its ability to capture multi-aspect information and leverage domain-specific tuning. The use of parallel semantic IDs explicitly addresses the limitations of hierarchical semantic ID approaches like RQ-VAE.
  
  In summary, LAMIA demonstrates a significant step forward in generative recommendation by providing a more robust and semantically rich item tokenization framework, especially when content information is valuable.

6.2. Data Presentation (Tables)

The main results are presented in Table 3 (transcribed above).

6.3. Ablation Studies / Parameter Analysis

Table 4 presents the results of ablation studies, which investigate the effectiveness of different components within the LAMIA framework. Experiments are conducted on the MIND and H&M datasets.

The following are the results from Table 4 of the original paper:

Quantizer Recommender	L_cl	L_align		MIND						H&M
Quantizer Recommender	L_cl	L_align		R@5	R@10	R@20	N@5	N@10	N@20	R@5	R@10	R@20	N@5	N@10	N@20
RQ-VAE	BERTbase	N/A		×	6.74	6.90	7.17	5.02	5.35	5.58	5.25	6.38	6.86	3.94	4.15	4.50
LAMIA	BERTbase	×	√	4.30	6.06	7.94	3.41	3.57	3.83	8.38	9.42	10.14	6.02	6.70	7.15
LAMIA	BERTbase	√	×	8.84	9.79	10.26	7.06	7.28	7.38	10.88	11.80	12.64	8.55	8.93	9.08
LAMIA	BERTbase	√	√	9.08	9.96	10.56	7.42	7.71	7.87	11.50	12.68	13.44	9.17	9.56	9.75

Observations from Ablation Studies:

LAMIA vs. Conventional RQ-VAE:
- The first row, RQ-VAE with BERTbase recommender, represents a conventional RQ-VAE semantic tokenization pipeline using the same OPT-base (here labeled BERTbase for consistency, but implying the LLM used) as LAMIA.
- Comparing LAMIA (full model, last row) with this RQ-VAE baseline: LAMIA significantly outperforms RQ-VAE (e.g., MIND R@5: 9.08 vs. 6.74; H&M R@5: 11.50 vs. 5.25). This clearly demonstrates the advantages of LAMIA's domain-adaptive tuning, multi-aspect item palette, and robust quantization over RQ-VAE.
Effectiveness of Contrastive Loss ( $\mathcal{L}_{\mathrm{cl}}$ ):
- Comparing LAMIA with BERTbase $(× L_cl, √ L_align)$ (second row) against the full LAMIA $(√ L_cl, √ L_align)$ (last row): Removing the contrastive loss ( $\mathcal{L}_{\mathrm{cl}}$ ) leads to a substantial drop in performance. For MIND, R@5 decreases from 9.08 to 4.30; for H&M, R@5 decreases from 11.50 to 8.38.
- This confirms that $\mathcal{L}_{\mathrm{cl}}$ is crucial. Without it, the auxiliary contrastive tasks (intra- and inter-palette) are absent, which means the palette embeddings are not explicitly forced to be independent, mutually exclusive, or distinct across items. This can lead to redundant information storage within the palette or code collision between items, making the clustering process less effective and overall representation quality suboptimal.
Effectiveness of Text-Token Alignment Task ( $\mathcal{L}_{\mathrm{align}}$ ):
- Comparing LAMIA with BERTbase $(√ L_cl, × L_align)$ (third row) against the full LAMIA $(√ L_cl, √ L_align)$ (last row): Removing the text-token alignment task ( $\mathcal{L}_{\mathrm{align}}$ ) also results in a performance decrease, though less severe than removing $\mathcal{L}_{\mathrm{cl}}$ . For MIND, R@5 drops from 9.08 to 8.84; for H&M, R@5 drops from 11.50 to 10.88.
- This indicates that while the generative and contrastive tasks are powerful, the text-token alignment task is still beneficial. It explicitly bridges the gap between the continuous textual content and the discrete semantic tokens, helping the LLM to better comprehend the real semantics behind the tokens. This enhances the LLM's ability to process user sequences and generate relevant recommendations based on the learned semantic representations.
  
  In conclusion, the ablation studies robustly demonstrate that both the contrastive loss and the text-token alignment task are vital components of the LAMIA framework, each contributing significantly to its superior performance.

6.4. Effect of Semantic Identifier Length

The paper explores how the length of the semantic identifier (or item palette size, $L$ ) affects performance. Figure 6 (from the original paper) plots the R@5 metric against semantic identifier length for both RQ-VAE (TIGER model) and LAMIA on the MIND dataset.

The image 7.jpg provides the visual for this analysis.

As can be seen from the results in Figure 6 (7.jpg):

Initial Improvement for Both: Both RQ-VAE and LAMIA show an improvement in R@5 as the semantic identifier length increases from 1 up to about 6. This is expected because more tokens (longer identifiers) allow for richer and more granular item representations, capturing more information.
RQ-VAE Performance Decline: Beyond a length of 6, RQ-VAE's performance starts to decline. This is attributed to its hierarchical and residual discretization method. In RQ-VAE, earlier tokens capture coarse-grained information, and subsequent tokens refine the residual. As the length increases, later tokens often contain less significant information or struggle to capture meaningful details due to the nature of residual quantization, leading to diminishing returns and eventual performance degradation.
LAMIA's Continued Improvement: In contrast, LAMIA continues to show slight improvements or maintains its performance even when the length reaches 8. This divergence highlights a key advantage of LAMIA's multi-aspect item palette approach. Since LAMIA treats each palette embedding equitably and each embedding encapsulates a distinct segment (aspect) of the item's content, increasing the palette length genuinely enhances the granularity and richness of the overall item representation without the inherent limitations of hierarchical residual refinement. More aspects mean more distinct facets of the item can be represented, leading to better performance.

This analysis strongly supports LAMIA's design choice of parallel semantic IDs over RQ-VAE's hierarchical approach, demonstrating its ability to scale effectively with increased representational capacity.

6.5. Visualization

Figure 7 (from the original paper) provides a visualization of semantic identifiers generated by both RQ-VAE (specifically, the TIGER model) and LAMIA using the t-SNE [36] technique. t-SNE projects high-dimensional embeddings into two dimensions, where shorter distances between points indicate closer semantic relationships. The visualization shows clusters for randomly selected categories on the MIND dataset.

The image 3.jpg provides the visual for this analysis. It contains multiple scatter plots (a-f), each showing item embeddings colored by their assigned semantic category for a specific token position.

Key observations from Figure 7:

RQ-VAE's Hierarchical Nature:
- RQ-VAE's initial token (e.g., Figure 7a-d, though not explicitly labeled as token 1 vs. 2 etc.) tends to show more distinct and globally coherent clusters. This reflects its design where the first layer captures broad, coarse-grained semantic information.
- However, RQ-VAE's subsequent tokens are based on residual quantization. This can lead to a dispersion of points that belong to the same initial broad category but are refined differently by later residual layers. The visualization shows that while RQ-VAE might form clusters, the overall organization can be less clear for later tokens compared to LAMIA.
LAMIA's Multi-Aspect Parallelism:
- LAMIA displays clear, focal clusters for each semantic token (e.g., Category 197 in Figure 7e, Category 50 in Figure 7f). The caption states, "Our LAMIA is designed to generate multi-aspect semantic tokens, with each token at different positions capturing a distinct aspect of item semantics. This design is reflected in the clustering behavior observed within certain categories."
- This means that for LAMIA, if an item has multiple important semantic aspects (e.g., science, environment, politics), it will likely contribute to distinct clusters in different palette positions (different semantic tokens). For example, an item might fall into "Category 197" for its first aspect (e.g., "topic"), and "Category 50" for its second aspect (e.g., "sentiment").
- The t-SNE visualization for LAMIA supports the idea that its palette embeddings are semantically parallel and independent. Each token position appears to effectively cluster items based on a specific, distinct semantic facet, rather than merely refining a single dominant aspect. This allows for a more comprehensive and flexible representation of an item's complex semantics.
  
  In summary, the t-SNE visualizations visually confirm the fundamental architectural differences: RQ-VAE exhibits a hierarchical clustering behavior influenced by its residual design, while LAMIA successfully creates independent clusters for different semantic tokens, each representing a distinct aspect of item semantics, thus validating its multi-aspect item palette approach.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces LAMIA (Learning Multi-Aspect Item Palette), a novel and robust framework for semantic tokenization tailored for generative recommendation. LAMIA addresses critical limitations of existing RQ-VAE-based methods, which struggle with multi-faceted item representations and training stability. Its core innovation lies in learning an "item palette"—a collection of independent, semantically parallel embeddings that capture multiple distinct aspects of an item's content.

Key contributions include:

A novel architecture with a block-wise input scheme and hierarchical attention masking to effectively compress variable-length content into a multi-aspect palette.
Domain-specific tuning through text-based reconstruction tasks, ensuring the learned palette embeddings are highly representative of the recommendation domain.
Integration of intra-palette and inter-palette contrastive losses to enforce independence among aspects and prevent code collapse, leading to more stable and distinct semantic tokens.
A simplified, training-free quantization process using standard clustering algorithms like K-Means, bypassing the complexities and instabilities of differentiable vector quantization.

Extensive experiments on MIND, Amazon CDs, and H&M datasets demonstrate LAMIA's superior performance, achieving significant improvements in Recall and NDCG compared to both traditional ID-based and RQ-VAE-based methods. Ablation studies further confirm the efficacy of each component, and analysis of semantic identifier length highlights LAMIA's ability to scale effectively with increased representational capacity.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation:

Time-consuming Training: The self-supervised fine-tuning of Large Language Models using a text-level reconstruction task can be time-consuming. This is a common challenge with LLM training, especially when domain-specific adaptation is involved.

As future work, the authors plan to:
Expedite Palette Learning: Explore strategies to accelerate the multi-aspect palette learning process. This could involve more efficient fine-tuning techniques (beyond LoRA), optimized training schedules, or novel architectural designs that reduce computational overhead.

7.3. Personal Insights & Critique

7.3.1. Inspirations and Applications

This paper offers several profound inspirations:

Beyond Single-Aspect Representations: The explicit modeling of multi-aspect item information is a crucial step forward. Many real-world entities are inherently multi-faceted, and forcing them into a single dominant semantic vector, even hierarchically refined, is an oversimplification. This approach can be applied to other domains beyond recommendation, such as knowledge graph construction (where entities have multiple properties), image retrieval (where images can be described by different visual aspects), or document summarization (where a document might cover several topics).
Domain Adaptation via Text-level Tasks: The use of text-based reconstruction for domain-specific tuning is highly valuable. Instead of relying solely on generic LLM embeddings, actively tuning the LLM to reconstruct relevant domain attributes forces it to learn representations that are meaningful within that specific context. This approach could be generalized to adapt LLMs for various specialized NLP tasks by designing appropriate text-level reconstruction/prediction objectives.
Robust Quantization Strategy: LAMIA's decoupling of dense embedding learning from discrete token assignment is elegant. By using contrastive learning to ensure high-quality dense embeddings and then employing a simple clustering method, it sidesteps the notorious code collapse and training instability issues of RQ-VAE. This principle could inspire more robust quantization schemes in other areas, such as speech recognition or computer vision, where converting continuous data into discrete tokens is beneficial.
Improved Generative Recommendation: For generative recommenders, providing multi-aspect semantic tokens as input allows for richer and potentially more diverse generated recommendations, moving beyond mere popularity or basic content similarity. This could lead to more serendipitous and satisfying recommendations.

7.3.2. Potential Issues, Unverified Assumptions, and Areas for Improvement

Subjectivity of "Aspects": While LAMIA aims to capture multi-aspect information, the definition and number of "aspects" ( $L$ , the palette size) are manually defined hyperparameters. The qualitative nature of what constitutes an "aspect" could be subjective. Does K-Means consistently find truly distinct, interpretable semantic aspects for each palette position, or does it sometimes capture arbitrary statistical variations? Further work could explore methods to dynamically determine the optimal number of aspects or to ensure the interpretability of each aspect.
Scalability of K-Means: While K-Means is simple and training-free, its performance can degrade in very high-dimensional spaces or on extremely large datasets. Although PCA is used for dimensionality reduction, for datasets with millions or billions of items, even K-Means on reduced dimensions can be computationally intensive, requiring efficient approximate clustering algorithms.
Computational Cost of LLMs: Despite using LoRA and a relatively smaller OPT-350M, LLM fine-tuning and inference still demand significant computational resources, as noted by the authors. Accelerating this process is critical for broader adoption, especially in real-time recommendation scenarios. Exploring knowledge distillation or more advanced PEFT methods could be beneficial.
H&M Dataset Performance: The slightly weaker performance on the H&M dataset compared to SASRec12L suggests a limitation where item content might be less discriminative or of "low quality" (as stated by authors). This highlights an area where LAMIA could potentially be improved by:
- Incorporating collaborative filtering signals: EAGER's success by combining content and behavior suggests that even LAMIA could benefit from integrating behavioral embeddings into its item palette learning.
- More sophisticated content encoders for ambiguous domains: Perhaps OPT-350M isn't fully capturing nuanced fashion distinctions, and specialized multimodal models (e.g., trained on fashion images and text) might yield better content representations for such domains.
Interpretability of Semantic Tokens: While LAMIA aims for independent aspects, further work could focus on making these semantic tokens more human-interpretable. Can we label what "aspect 1" or "category 197" represents? This would greatly enhance trust and explainability in recommendation.

LAMIA presents a compelling paradigm shift in semantic tokenization for generative recommendation. Its strengths lie in explicitly tackling the multi-aspect nature of items and ensuring robust representation learning. Addressing its current limitations, particularly around computational efficiency and domain-specific content challenges, would further solidify its impact.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.