Paper status: completed

RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation

Published:09/03/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents RecBase, a domain-agnostic foundational model for zero-shot recommendation, overcoming cross-domain limitations of existing LLMs. By pretraining with a recommendation-oriented focus and leveraging a large, heterogeneous corpus, RecBase improves performance in r

Abstract

Recent advances in LLM-based recommendation have shown promise, yet their cross-domain generalization is hindered by a fundamental mismatch between language-centric pretraining and the recommendation task. Existing methods, relying on language-level knowledge, fail to capture dynamic, item-level user interests across domains. To bridge this gap, we propose RecBase, a domain-agnostic foundational model pretrained with a recommendation-oriented objective. RecBase leverages a large-scale, heterogeneous, cross-domain corpus with unified textual representations and feature mappings to enhance cross-domain generalization. To further align item semantics across domains, we introduce a unified item tokenizer that encodes items into hierarchical concept identifiers, enabling structured representation and efficient vocabulary sharing. The model is trained using an autoregressive objective to capture complex item-level sequential patterns. On eight real-world datasets, our 1.5B-parameter model matches or surpasses the performance of LLM baselines up to 7B parameters in zero-shot and cross-domain recommendation tasks.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation

1.2. Authors

The paper is authored by Sashuai Zhou, Weinan Gan, Qijiong Liu, Ke Leil, Jieming Zhu, Hai Huang, Yan Xia, Ruiming Tang, Zhenhua Dong, and Zhou Zhao. Their affiliations include Zhejiang University, Huawei Noah's Ark Lab, Shanghai AI Lab, and The HK PolyU. The multiple affiliations, especially including a major tech company's research lab (Huawei Noah's Ark Lab) and prominent universities, suggest a collaborative effort between academia and industry, often indicative of practical relevance and substantial research resources.

1.3. Journal/Conference

The paper is published as a preprint on arXiv (https://arxiv.org/abs/2509.03131v1). As a preprint, it has not yet undergone formal peer review or been published in a specific journal or conference proceedings. However, arXiv is a widely respected platform for disseminating cutting-edge research in computer science, and papers published here often represent significant advancements that are later submitted to top-tier conferences (e.g., KDD, SIGIR, WWW, NeurIPS, ICML) or journals in the field of recommender systems and artificial intelligence.

1.4. Publication Year

2025 (indicated by the publication date 2025-09-03T08:33:43.000Z).

1.5. Abstract

The paper introduces RecBase, a domain-agnostic foundational model specifically designed for zero-shot and cross-domain recommendation tasks. It addresses the limitation of current large language model (LLM)-based recommendation systems, which suffer from a mismatch between their language-centric pretraining and the item-level dynamics required for effective recommendation. RecBase tackles this by leveraging a large-scale, heterogeneous, cross-domain corpus with unified textual representations and feature mappings. A key innovation is the unified item tokenizer that encodes items into hierarchical concept identifiers, allowing for structured representation and efficient vocabulary sharing across domains. The model is trained using an autoregressive objective to capture complex item-level sequential patterns. The authors claim their 1.5B-parameter model matches or surpasses the performance of LLM baselines up to 7B parameters on eight real-world datasets in zero-shot and cross-domain recommendation.

Official Source: https://arxiv.org/abs/2509.03131v1 PDF Link: https://arxiv.org/pdf/2509.03131v1.pdf Publication Status: This is a preprint, meaning it has been publicly shared but has not yet undergone formal peer review for publication in a journal or conference.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limited cross-domain generalization ability of existing large language model (LLM)-based recommendation systems. While LLMs have shown remarkable capabilities in zero-shot learning and multi-task unification across various domains, their direct application to recommender systems (often termed LLM-based recommendation) faces a fundamental mismatch.

This problem is important because an ideal foundational model for recommender systems should be capable of addressing diverse recommendation tasks across different domains and performing effectively in zero-shot and few-shot (e.g., cold-start) settings. However, current LLM-based approaches face several challenges:

  1. Input Representation Mismatch: Recommendation data, which often involves structured item information and user interaction sequences, is usually mapped into language modalities (e.g., item descriptions). This language-centric representation may not effectively capture dynamic, item-level user interests and sequential patterns.

  2. Knowledge Gap: The pretraining of LLMs is primarily focused on language understanding, leading to a knowledge gap when applied to recommendation tasks. They struggle with modeling intricate item-item co-relationships, which are crucial for effective zero-shot recommendations.

  3. Model Alignment Issues: Fine-tuning general-purpose LLMs to recommendation tasks using downstream datasets can compromise their zero-shot and cross-domain generalization abilities, as the fine-tuning might specialize them too much to specific domains or tasks.

    The paper's entry point and innovative idea is to bridge this gap by pretraining a foundational model (RecBase) from scratch, specifically tailored for recommendation. Instead of relying on LLMs' language-level knowledge for item relationships, RecBase uses LLMs solely as encoders for unified semantic representation and then models item-item relationships through generative pretraining on large-scale, open-domain, recommendation-oriented item sequence data. This shifts the focus from language understanding to direct item-level sequence modeling.

2.2. Main Contributions / Findings

The paper makes several key technical contributions:

  1. Data Collection and Representation: They compile a large-scale, open-domain recommendation dataset spanning 15 different domains (4.5M items and 35M interactions). They uniformly extract textual representations of items to serve as a data source for pretraining across various domains, ensuring a consistent input format.

  2. Unified Item Tokenizer: Instead of traditional ID-based or verbose language-based modeling, they propose a general unified item tokenizer. This tokenizer encodes each item into multi-level concept IDs, learned in a coarse-to-fine manner inspired by curriculum learning. This hierarchical encoding facilitates semantic alignment, reduces vocabulary size, and enables effective knowledge transfer across diverse domains.

  3. Autoregressive Pretraining: The model is trained using an autoregressive modeling paradigm. This objective predicts the next token (concept ID) in a sequence, allowing the model to capture complex item-level sequential patterns and learn item co-relationships within a unified concept token space. This approach enhances the model's generalization in zero-shot and cross-domain settings.

    The key findings are:

  • The RecBase model (specifically, RecBase-1.5B) consistently matches or surpasses the performance of LLM baselines up to 7 billion parameters in zero-shot and cross-domain recommendation tasks across eight real-world datasets. This demonstrates the effectiveness of a recommendation-oriented pretraining strategy.

  • The unified item tokenizer with Curriculum Learning Enhanced RQ-VAE (CL-VAE) effectively creates a well-distributed concept ID space, mitigating codebook collapse and improving cross-domain generalization.

  • RecBase-0.3B, a smaller version of the model, also delivers competitive results at a substantially lower computational cost, highlighting its efficiency.

  • While focused on zero-shot, RecBase shows significant further performance improvements with in-domain fine-tuning, indicating its adaptability.

  • RecBase demonstrates superior inference efficiency compared to other state-of-the-art models, attributing this to its specialized ID vocabulary space.

    These findings solve the problem of limited cross-domain generalization in LLM-based recommenders by providing a model specifically designed to understand item-level semantics and relationships, rather than relying solely on language-level knowledge.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand RecBase, a reader should be familiar with the following concepts:

  • Recommender Systems (RS): These are information filtering systems that aim to predict the "rating" or "preference" a user would give to an item. They are ubiquitous in e-commerce, streaming services, and social media, helping users discover content (products, movies, music, news) they might be interested in. A common type relevant here is sequential recommendation, where the system predicts the next item a user will interact with based on their past sequence of interactions.

  • Large Language Models (LLMs): These are neural networks, typically based on the Transformer architecture, trained on massive amounts of text data. They learn to predict the next word in a sequence, allowing them to understand, generate, and process human language. LLMs exhibit powerful capabilities such as:

    • Zero-shot learning: Performing tasks they were not explicitly trained on, by leveraging their vast pre-trained knowledge.
    • Multi-task unification: Handling various tasks (e.g., question answering, summarization, translation) within a single model.
    • Multi-domain generalization: Adapting to diverse domains (e.g., medical, legal, creative writing) without extensive domain-specific fine-tuning.
  • Zero-Shot/Few-Shot Recommendation:

    • Zero-shot recommendation refers to the ability of a model to make accurate recommendations in a new domain or for new items/users without any prior training data from that specific domain, or for those specific items/users. The model leverages knowledge acquired from other domains or general item properties.
    • Few-shot recommendation is a related concept where the model can learn to make recommendations with only a very small amount of specific training data (e.g., a few interactions for a new user or item, also known as cold-start scenarios).
  • Autoregressive Models: An autoregressive model is a type of statistical model where the output variable depends linearly on its own previous values. In deep learning, particularly with Transformers, an autoregressive model predicts the next token in a sequence based on all preceding tokens. For example, in language modeling, it predicts the next word given all the words that came before it. This property is crucial for generating sequences.

  • Variational Autoencoders (VAEs): VAEs are generative models that learn a compressed, continuous latent space representation of input data. They consist of two main parts:

    • Encoder: Maps input data (e.g., an image) into a distribution (mean and variance) in the latent space.
    • Decoder: Samples from this latent space distribution and reconstructs the original input data. The goal is to learn a meaningful latent representation that captures the underlying structure of the data, allowing for generation of new, similar data points.
  • Vector Quantized Variational Autoencoder (VQ-VAE): A variant of VAEs that introduces a discrete latent space. Instead of a continuous latent vector, the encoder outputs a discrete code (an index into a codebook). The codebook is a learned set of embedding vectors, and the encoder's output is "snapped" to the nearest vector in this codebook. This forces the latent representations to be categorical, which is beneficial for tasks like discrete token generation (e.g., text, images). The codebook size determines the number of discrete representations available.

  • Residual Quantized Variational Autoencoder (RQ-VAE): An extension of VQ-VAE that introduces hierarchical quantization. Instead of quantizing the entire latent vector at once, RQ-VAE quantizes it in multiple stages. At each stage (or level), it quantizes the residual (the difference between the original latent vector and the quantized vector from the previous level). This allows for a more fine-grained and multi-scale representation. Each level has its own codebook, and the final discrete representation for an input is a tuple of codes (one from each level). This hierarchical approach can capture information at different granularities and potentially compress information more efficiently.

  • Curriculum Learning: An approach to machine learning where a model is trained on progressively more difficult examples or tasks. Inspired by how humans learn, it starts with simpler concepts to build a strong foundation before moving to complex ones. In the context of RQ-VAE, this means gradually introducing higher levels of quantization or more complex feature learning, rather than trying to optimize everything at once.

  • t-SNE (t-Distributed Stochastic Neighbor Embedding): A dimensionality reduction technique used for visualizing high-dimensional data. It maps high-dimensional data points to a lower-dimensional space (typically 2D or 3D) in such a way that similar points in the high-dimensional space are modeled by nearby points in the low-dimensional space, and dissimilar points are modeled by distant points. This makes it useful for visualizing clusters and relationships within complex datasets.

3.2. Previous Works

The paper discusses previous work in LLM-based recommendation and item representation for recommendation.

LLM-based Recommendation:

  • Item Scoring and Generation:
    • Item Scoring: Models like M6-Rec (Cui et al., 2022), Prompt4NR (Zhang and Wang, 2023), TabLLM (Hegselmann et al., 2023), and TALLRec (Bao et al., 2023) transform user-item data into natural language representations (e.g., generating item descriptions for scoring or reframing as cloze-style prediction). ONCE (Liu et al., 2024a) offers a generative framework for content-based recommendation, and CLLM4Rec (Zhu et al., 2024b) focuses on collaboration.
    • Item Generation: Approaches like GPT4Rec (Petrov and Macdonald, 2023), P5 (Geng et al., 2022), EAGER (Wang et al., 2024a), and EAGER-LLM (Hong et al., 2025) use generative models to predict the next item based on user behavior. DiffuRec incorporates uncertainty into sequential recommendations, and GIRL (Zheng et al., 2023) applies LLMs to job recommendations.
  • General Limitations of LLM-based methods: The authors highlight that these methods struggle with the semantic gap between language-based representations (which LLMs are pre-trained on) and structured recommendation data (which focuses on item-item relationships and user preferences). This gap hinders their ability to model item-item co-relationships and achieve strong zero-shot generalization.

Item Representation for Recommendation:

  • Hierarchical Models: Works using graph neural networks (Li et al., 2020; Wang et al., 2021a) aggregate item information to refine user profiles and capture dependencies.
  • Contrastive Learning: Cross-view contrastive learning (Ma et al., 2022) models user-bundle and user-item interactions for better generalization.
  • Sequential Recommendation: Methods addressing item representation divergence (Peng et al., 2022) enhance learning efficiency in sequential tasks.
  • Semantic IDs: The use of Semantic IDs (Rajput et al., 2023; Zheng et al., 2024; Wang et al., 2024b; Zhu et al., 2024a) for items, where generative retrieval frameworks predict the next item, has shown promise for generalization, especially in zero-shot scenarios.
  • General Limitations of Item Representation methods: These methods are often constrained by domain-specific data, limiting their ability to generalize across different contexts.

3.3. Technological Evolution

The field of recommender systems has evolved from traditional collaborative filtering and matrix factorization to deep learning based approaches, including sequential recommendation models that capture user preferences over time. More recently, the success of Large Language Models (LLMs) in natural language processing has inspired their application to recommendation tasks, leading to the LLM-based recommendation paradigm.

Initially, LLMs were seen as powerful knowledge sources that could be prompted to generate recommendations or augment existing recommender systems. This involved translating recommendation tasks into natural language prompts. However, as noted by the authors, directly applying language-centric LLMs to item-centric recommendation tasks introduced a semantic gap. LLMs are excellent at understanding and generating human language, but they are not inherently designed to capture the nuanced, dynamic relationships between items or the sequential patterns of user interactions in a discrete item space.

This paper's work, RecBase, represents a further evolution by moving beyond merely adapting existing LLMs. It proposes building a foundational model specifically for recommendation from scratch. This model shifts the focus from language-level knowledge to item-level sequential patterns and unified discrete item representations. It leverages the architectural strengths of LLMs (like Transformers for autoregressive modeling) but applies them to a novel representation space (hierarchical concept IDs) derived directly from item features, rather than general natural language tokens. This positions RecBase as a generative foundation model tailored to the unique challenges of recommendation, especially for zero-shot and cross-domain settings.

3.4. Differentiation Analysis

Compared to the main methods in related work, RecBase introduces several core differences and innovations:

  • Pretraining Objective:

    • Existing LLM-based Recs: Primarily rely on pretraining on vast text corpora for language understanding, then adapt this knowledge to recommendation (e.g., through prompting or fine-tuning). The core knowledge is language-level.
    • RecBase: Pretrains from scratch with a recommendation-oriented objective. It models item-item relationships through generative pretraining directly on large-scale, open-domain, item sequence data. It uses LLMs solely as encoders for semantic representation, not as the primary source of recommendation knowledge.
  • Item Representation:

    • Existing LLM-based Recs: Often map recommendation data into natural language modalities. This can be verbose and may not effectively represent user sequences or item-item co-relationships. ID-based models lack semantics.
    • RecBase: Introduces a unified item tokenizer that encodes items into multi-level concept IDs. This is a novel structured representation that offers several advantages:
      • Semantic Alignment: Bridges the semantic gap by converting diverse item descriptions into a common, structured ID space.
      • Efficiency: Avoids the verbosity of language-based representations and the lack of semantics in raw ID-based models.
      • Vocabulary Sharing: The hierarchical nature and curriculum learning facilitate efficient sharing of vocabulary across domains, promoting generalization.
  • Generalization Mechanism:

    • Existing LLM-based Recs: Their zero-shot generalization mostly stems from their broad language knowledge. However, this often falls short when specific item-level, dynamic user interests and cross-domain item semantics are critical. Fine-tuning for downstream tasks can also reduce zero-shot capabilities.
    • RecBase: Achieves zero-shot and cross-domain generalization by:
      • Domain-agnostic pretraining on a heterogeneous corpus.
      • Unified textual representations and feature mappings.
      • Hierarchical concept IDs (learned via CL-VAE) that align item semantics across domains.
      • Autoregressive modeling on these concept ID sequences, directly learning item co-relationships in a unified space.
  • Computational Efficiency:

    • Existing LLM-based Recs: Often use large LLMs (e.g., 7B, 13B parameters or more) which are computationally expensive for inference.

    • RecBase: A 1.5B-parameter model is shown to outperform LLMs up to 7B parameters, and a 0.3B-parameter version is competitive. This indicates a more efficient design tailored for recommendation tasks, particularly due to its specialized ID vocabulary space compared to general-purpose LLMs.

      In essence, RecBase differentiates itself by creating a dedicated foundational model for recommendation that explicitly addresses the unique data modalities and learning objectives of recommender systems, rather than attempting to force general language models into this specialized role.

4. Methodology

4.1. Principles

The core idea behind RecBase's methodology is to develop a domain-agnostic foundational model for recommendation by moving away from language-centric pretraining, which often mismatches the recommendation task. Instead, RecBase aims to capture dynamic, item-level user interests and sequential patterns directly. This is achieved through two main stages:

  1. Unified Discretization: Item representations are first mapped into a unified, discrete concept ID space. This step leverages a specialized variant of Residual Quantized Variational Autoencoder (RQ-VAE) enhanced with Curriculum Learning (CL-VAE) to generate hierarchical concept IDs for each item. This ensures a consistent and semantically rich representation across diverse domains.
  2. Autoregressive Modeling: Once items are represented as sequences of these concept IDs, an autoregressive model is trained. This model learns to predict the next concept ID in a user's interaction sequence, thereby capturing complex item-level sequential patterns and inter-item dependencies. This generative approach enables effective zero-shot and cross-domain recommendation by directly modeling user behavior in a semantically aligned discrete space.

4.2. Core Methodology In-depth

4.2.1. Preliminary: Residual Quantized Variational Autoencoder (RQ-VAE)

To enable large language models to learn user behavior based on item history, continuous item semantic embeddings must be transformed into discrete, unified tokens. Residual Quantized Variational Autoencoder (RQ-VAE) (Lee et al., 2022) is employed for this purpose.

Given an input embedding eRde \in \mathbb{R}^d (representing an item's semantic features), RQ-VAE first uses an encoder, denoted as E\mathcal{E}, to map this input into a latent space. This step produces a continuous latent representation zz: z:=E(e)z := \mathcal{E}(e) Here, ee is the input item embedding (a continuous vector of dimension dd), zz is the continuous latent representation generated by the encoder E\mathcal{E}, and E\mathcal{E} is the encoder function.

RQ-VAE extends the Vector Quantized Variational Autoencoder (VQ-VAE) by introducing hierarchical quantization. This means the quantization process occurs in multiple successive levels. At each level dd (where dd typically refers to the quantization level, starting from 0 or 1), a residual rdr_d is quantized. The residual is the portion of the latent representation that has not yet been perfectly represented by the previous quantization levels. This residual rdr_d is mapped to the nearest embedding ecde_{c_d} within a level-specific codebook CdC_d. The codebook CdC_d is a collection of KK learnable embedding vectors, denoted as Cd:={ek}k=1KC_d := \{e_k\}_{k=1}^K. The selection of the nearest embedding is determined by minimizing the Euclidean distance: cd=argminkrdek c_d = \arg\min_k \| r_d - e_k \| In this equation, cdc_d is the index of the chosen codebook vector at level dd, rdr_d is the current residual vector being quantized, eke_k is the kk-th embedding vector in the codebook CdC_d, and \|\cdot\| denotes the Euclidean norm, indicating the distance between vectors. This equation effectively finds the codebook vector eke_k that is closest to the residual rdr_d, and cdc_d is the identifier (index) for that closest vector.

After a codebook vector ecde_{c_d} is selected, the residual for the next level, rd+1r_{d+1}, is computed. This is done by subtracting the selected codebook vector from the current residual. This ensures that only the unrepresented portion of the information is passed to the subsequent quantization level: rd+1:=rdecdr_{d+1} := r_d - e_{c_d} Here, rd+1r_{d+1} is the new residual for the next level, rdr_d is the residual from the current level, and ecde_{c_d} is the codebook vector chosen at the current level.

This process is recursively repeated mm times, generating a tuple of mm codewords (concept IDs) that collectively represent the Semantic ID of the input item. This tuple approximates the input embedding from coarse to fine granularity. The use of separate codebooks for each level allows for varying granularities of representation, as the norms of the residuals decrease with each successive quantization step.

4.2.2. Unified Feature Representation Space: Curriculum Learning Enhanced RQ-VAE (CL-VAE)

To ensure a consistent and structured representation of items across different domains, items are first standardized into a unified textual format. This allows for uniform processing and conversion into feature embeddings using a shared encoder (e.g., NV-Embed-v2).

A crucial challenge for generalizable recommendation models using RQ-VAE is codebook collapse. This occurs when most inputs, regardless of their domain, are mapped to only a small subset of codebook vectors, leading to uneven distribution of IDs in the latent space. This is problematic for zero-shot scenarios, as a new item might be encoded into an unencountered token, degrading performance. To address this, the paper proposes Curriculum Learning Enhanced RQ-VAE (CL-VAE).

RecBase Model Structure

Figure 1: RecBase model structure. (a) illustrates the RecBase model's use of unified item representations and autoregressive sequence prediction. (b) details the curriculum process for optimizing codebook learning within CL-VAE. (c) demonstrates how the autoregressive model leverages discretized concept IDs to predict the next item in the sequence, effectively capturing item relationships for recommendation.

As illustrated in Figure 1b, CL-VAE incorporates curriculum learning by progressively training the RQ-VAE codebooks. The core idea is to train the model from simple to complex tasks. The hierarchical structure of RQ-VAE naturally aligns with this principle. The training proceeds in stages for different levels of the codebook:

  1. Initially, only the first layer's codebook is trained for nn epochs to learn basic feature representations.
  2. Once the loss stabilizes, the second layer's codebook is added and trained.
  3. This process continues, adding one level at a time, until all mm levels are trained. This staged approach reduces initial training complexity, enhances convergence stability, and helps CL-VAE effectively map item representations from diverse domains into a unified concept ID space, improving generalization.

To further mitigate codebook collapse, particularly for sparsely utilized codebook vectors, CL-VAE includes a reinitialization mechanism. If the usage rate of a first-level codebook vector falls below a certain threshold, it is reinitialized. This provides new optimization starting points for low-level features, preventing collapse and enhancing overall model performance.

The training process for CL-VAE involves several loss components:

  • Reconstruction Loss (LR\mathcal{L}_{\mathrm{R}}): Measures how well the decoder can reconstruct the original input item embedding from its quantized latent representation. It's typically a Mean Squared Error (MSE). LR:=xx^2 \mathcal{L}_{\mathrm{R}} := |x - \hat{x}|^2 Here, xx represents the original input item embedding, and x^\hat{x} is the reconstructed item embedding output by the decoder. The squared difference measures the error in reconstruction.
  • Codebook Loss (LQ\mathcal{L}_{\mathrm{Q}}): This loss term is adopted directly from RQ-VAE and VQ-VAE. It comprises two parts:
    1. A term that pushes the codebook embeddings ecie_{c_i} towards the encoder's output rir_i (using stop-gradient on rir_i to prevent direct gradients to the encoder).
    2. A commitment loss term that encourages the encoder's output rir_i to "commit" to the chosen codebook embedding ecie_{c_i} (using stop-gradient on ecie_{c_i} to prevent direct gradients to the codebook). LQ:=d=0m1sg[ri]eci2+βrisg[eci]2 \mathcal{L}_{\mathrm{Q}} := \sum_{d=0}^{m-1} |\mathrm{sg}[r_i] - e_{c_i}|^2 + \beta |r_i - \mathrm{sg}[e_{c_i}]|^2 In this formula, dd iterates through the mm levels of the codebook. rir_i represents the continuous latent vector produced by the encoder for item ii, and ecie_{c_i} is the chosen codebook vector for item ii at a specific level. sg[]\mathrm{sg}[\cdot] denotes the stop-gradient operation, which means gradients are not propagated through this term. β\beta is a hyperparameter controlling the strength of the commitment loss. The first term updates the codebook embeddings, while the second term encourages the encoder's output to stay close to the selected codebook entries.
  • Entropy Loss (LE\mathcal{L}_{\mathrm{E}}): This is an additional penalty term designed to promote more diverse utilization of the codebook vectors, counteracting codebook collapse. It encourages a uniform usage frequency across all codebook entries. LE:=d=0m1j=1Kpd,jlogpd,j \mathcal{L}_{\mathrm{E}} := - \sum_{d=0}^{m-1} \sum_{j=1}^{K} p_{d,j} \log p_{d,j} Here, pd,jp_{d,j} represents the usage frequency of the jj-th codebook vector eje_j at level dd. This is the standard formula for entropy, which is maximized when pd,jp_{d,j} is uniform across all jj and dd. Maximizing this negative entropy term (or minimizing LE- \mathcal{L}_{\mathrm{E}}) encourages more diverse usage.

The total loss for CL-VAE is a weighted sum of these components: L(x):=LR+LQ+γLE \mathcal{L}(x) := \mathcal{L}_{\mathrm{R}} + \mathcal{L}_{\mathrm{Q}} + \gamma \mathcal{L}_{\mathrm{E}} where γ\gamma is a hyperparameter balancing the importance of the entropy term. These loss terms facilitate the joint training of the encoder, decoder, and codebooks.

t-SNE Clustering Visualization

Figure 2: Visualization of t-SNE Clustering in the ID Space: (a) Discrete ID space learned by RQ-VAE, (b)Discrete ID space learned by CL-VAE.

Figure 2 visually demonstrates the effectiveness of CL-VAE. In Figure 2a, the discrete ID space learned by a traditional RQ-VAE shows features from different domains as largely independent and minimally interacting. In contrast, Figure 2b, representing the discrete ID space learned by CL-VAE, shows increased overlap and interaction between features from diverse datasets. This indicates that CL-VAE successfully maps items into a more unified and better-distributed concept space, which is crucial for cross-domain generalization.

4.2.3. Autoregressive Modeling

After the unified discretization process, each item is represented as an mm-bit semantic ID, which is a sequence of mm concept IDs. A user's interaction history is then converted into a chronological sequence of these mm-bit concept IDs. If an item ii has a concept ID sis_i, then its mm-bit representation is si=(si1,si2,,sim)s_i = (s_i^1, s_i^2, \ldots, s_i^m), where sijs_i^j is the jj-th bit (or level's concept ID) of that item's concept ID.

RecBase Model Structure

Figure 1 (c): demonstrating how the autoregressive model leverages discretized concept IDs to predict the next item in the sequence, effectively capturing item relationships for recommendation.

As depicted in Figure 1c, an autoregressive model is trained to predict the next ID in a sequence. Given a sequence of historical interactions S=(s1,s2,,sn)S = (s_1, s_2, \ldots, s_n) (where each sts_t is an mm-bit semantic ID), the model takes the preceding sequence S<tS_{<t} as input and outputs a probability distribution over each bit of the next ID sts_t. The prediction of the entire mm-bit concept ID sts_t is factorized into predictions for each bit, conditioned on the previous bits and the interaction history: P(stS<t)=j=1mP(stjst<j,S<t) P(s_t | S_{<t}) = \prod_{j=1}^m P(s_t^j | s_t^{<j}, S_{<t}) Here, P(stS<t)P(s_t | S_{<t}) is the probability of the entire mm-bit semantic ID sts_t given the historical sequence S<tS_{<t}. P(stjst<j,S<t)P(s_t^j | s_t^{<j}, S_{<t}) is the probability of the jj-th bit of sts_t given its preceding bits (st<js_t^{<j}) and the full historical sequence S<tS_{<t}. This factorization means that for each new item, the model predicts its first concept ID bit, then its second conditioned on the first, and so on.

The model is trained using the negative log-likelihood loss (NLL loss), which aims to maximize the probability of the observed ground truth concept IDs: L=t=1nj=1mlogP(stjst<j,S<t) \mathcal{L} = - \sum_{t=1}^n \sum_{j=1}^m \log P(s_t^{j*} | s_t^{<j*}, S_{<t}) In this formula, stjs_t^{j*} represents the ground truth jj-th bit of the tt-th item in the sequence. st<js_t^{<j*} denotes the ground truth preceding bits of sts_t. The sum is over all items in the sequence (t=1t=1 to nn) and all bits within each item's concept ID (j=1j=1 to mm). Minimizing this loss encourages the model to accurately predict the subsequent concept IDs in user interaction sequences.

During inference, the trained autoregressive model generates the next item's concept ID bit by bit. These concept IDs are treated as tokens in the model's vocabulary. The model takes a user's historical interaction sequence as input and outputs logits (raw prediction scores) for the predicted item concept IDs. These logits represent the joint probability distribution over the possible concept IDs for the next item. By comparing these probabilities, the model identifies the item a user is most likely to engage with, based on their past behavior. This structured prediction process allows RecBase to rank items and generate recommendations effectively.

5. Experimental Setup

5.1. Datasets

The RecBase model is pretrained on a large-scale corpus compiled from 15 diverse training datasets across various domains and is evaluated on 8 additional cross-domain datasets.

Training and test datasets distribution

Figure 4: Training and test datasets distribution. The chart illustrates the domain distribution for both training and test datasets, showing a mix of categories like movie, video, goods, news, and hotel.

The following are the results from Table 2 of the original paper:

Item size User size History Avg. length
Training datasets 4,595,003 35,047,682 20.37
Finetune datasets 1,005,745 5,098,084 17.83
Test datasets 623,615 145,975 15.01

Table 2: Statistics of training, finetune, and test datasets.

Details of Datasets (from Appendix A):

  • Training datasets:

    • EBNeRD (Kruse et al., 2024) and PENS (Ao et al., 2021): News-related datasets. EBNeRD for news recommendation, PENS for personalized news headline generation.
    • PixelRec (Cheng et al., 2024) and KuaiRec (Gao et al., 2022): Short video recommendation datasets, containing user interactions with short videos and corresponding thumbnails.
    • Amazon Reviews 2023 (Hou et al., 2024a): Large-scale e-commerce review dataset with user feedback (ratings, reviews, helpful votes) and product metadata.
    • Amazon Review Dataset (various sub-datasets): Classic dataset of user evaluations on Amazon products. Sub-datasets like Amazonbeauty, Amazonbooks, Amazonsports, and Amazontoys focus on specific product categories.
    • Netflix dataset: A classic movie recommendation dataset with over 100 million user ratings for movies.
  • Evaluation datasets (unseen during pretraining):

    • MIND (Wu et al., 2020): Large-scale news recommendation dataset from Microsoft News user click logs, with rich article information (title, abstract, body, category).

    • MovieLens (Harper and Konstan, 2015): Classic movie recommendation dataset by GroupLens, containing numerous user movie ratings.

    • MicroLens (Ni et al., 2023): Large-scale, content-driven short video recommendation dataset with 1 billion user interactions and rich modality information.

    • Goodreads (Wan et al., 2019): Book recommendation dataset from Goodreads, with book metadata, user-book interactions, and detailed reviews (228 million interactions).

    • Yelp Open Dataset: Business, review, and user data from Yelp, including 160,000 businesses, 8.63 million reviews, and 200,000 images across eight metropolitan areas.

    • Steam Dataset: Multi-dimensional dataset based on Steam, covering game purchase records and playtime for millions of users.

    • H&M dataset: Product information, customer information, and transaction records provided by H&M.

    • HotelRec (Antognini and Faltings, 2020): Large-scale hotel recommendation dataset from TripAdvisor (50 million hotel reviews), the largest single-domain dataset with text reviews.

      These datasets were chosen for their diversity in domains and scale, making them suitable for validating the method's ability to generalize across different contexts and for zero-shot scenarios. The training datasets have a combined item size of 4,595,003 and user size of 35,047,682, with an average history length of 20.37. The test datasets include 623,615 items and 145,975 users.

Data Processing Details (from Appendix B):

  • Biases in Data Distribution and Modality Handling: The pretraining corpus was curated to achieve a balanced category distribution, mitigating overrepresentation of news and audio-visual content. For multimodal datasets, only textual descriptions (metadata, reviews) were used, discarding other modalities to ensure fair comparison with text-centric baselines.
  • Data Preprocessing and Noise Filtering:
    • Text Standardization: Item-related content (titles, attributes, reviews) was structured into a unified textual format. Non-informative or off-topic reviews were filtered out.
    • User History Filtering: Users with fewer than 15 interactions were removed due to insufficient sequence signal. Extremely long histories (exceeding 2500 interactions) were truncated to avoid overfitting and memory inefficiency.
  • Negative Sampling: Unlike traditional models, RecBase is trained on real user interaction sequences using an autoregressive objective, meaning no artificial negative sampling is involved during training. It learns from positive (observed) user feedback.

Examples of Item Formatted Text Descriptions (from Appendix C): The NV-Embed-v2 model is used to convert unstructured text into dense, semantically rich embeddings. Examples of formatted text descriptions for movies: Describeamovie:\n\n"title":"ToyStory(1995)",\n"genres":"AdventureAnimationChildrenComedyFantasy"\n'Describe a movie:\n{\n"title": "Toy Story (1995)",\n"genres": "Adventure | Animation | Children | Comedy | Fantasy"\n}' Describeamovie:\n\n"title":"Jumanji(1995)",\n"genres":"AdventureChildrenFantasy"\n'Describe a movie:\n{\n"title": "Jumanji (1995)",\n"genres": "Adventure | Children | Fantasy"\n}' Describeamovie:\n\n"title":"GrumpierOldMen(1995)",\n"genres":"ComedyRomance"\n'Describe a movie:\n{\n"title": "Grumpier Old Men (1995)",\n"genres": "Comedy | Romance"\n}'

Test cases for large language model benchmarks often involve a user behavior sequence and a candidate item, framed as a "Yes/No" question: Userbehaviorsequence:Π\n(1)ThisFordGT40MovieRigFrom"FordVFerrari"LooksAbsurd\n(2)KendallJennerWoretheTiniestDresstoGoJewelryShopping\nCandidateitem:9fashiontrendsinspiredbythe2000sthatarecomingbackinstyle'User behavior sequence: \Pi \n(1) This Ford GT40 Movie Rig From "Ford V Ferrari" Looks Absurd\n(2) Kendall Jenner Wore the Tiniest Dress to Go Jewelry Shopping\nCandidate item: 9 fashion trends inspired by the 2000s that are coming back in style'

5.2. Evaluation Metrics

The evaluation of RecBase focuses on its generalization ability in zero-shot and multi-domain settings, framed as a ranking problem. The primary metric used is Area Under Curve (AUC).

  1. Conceptual Definition of AUC: Area Under Curve (AUC) is a performance metric commonly used for binary classification problems (like predicting user interest or click probability in recommendation). It quantifies the ability of a classifier to distinguish between positive and negative classes. In the context of recommendation, a higher AUC indicates that the model is better at ranking items a user is interested in (positive items) above items they are not interested in (negative items). It essentially represents the probability that the model will rank a randomly chosen positive item higher than a randomly chosen negative item.

  2. Mathematical Formula for AUC: The formula for AUC is often described as the probability of a randomly chosen positive example being ranked higher than a randomly chosen negative example. One common way to compute AUC is using the Wilcoxon-Mann-Whitney test, which can be expressed as: AUC=iPositivejNegativeI(scorei>scorej)+0.5I(scorei=scorej)NPositiveNNegative \text{AUC} = \frac{\sum_{i \in \text{Positive}} \sum_{j \in \text{Negative}} I(\text{score}_i > \text{score}_j) + 0.5 \cdot I(\text{score}_i = \text{score}_j)}{N_{\text{Positive}} \cdot N_{\text{Negative}}} Note: The paper does not explicitly provide the AUC formula, but this is a standard definition.

  3. Symbol Explanation:

    • Positive\text{Positive}: The set of actual positive instances (e.g., items a user clicked or engaged with).
    • Negative\text{Negative}: The set of actual negative instances (e.g., items a user did not click or showed no interest in).
    • I()I(\cdot): An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
    • scorei\text{score}_i: The predicted score (e.g., click probability, interest score) given by the model for a positive item ii.
    • scorej\text{score}_j: The predicted score given by the model for a negative item jj.
    • NPositiveN_{\text{Positive}}: The total number of positive instances.
    • NNegativeN_{\text{Negative}}: The total number of negative instances. The term 0.5 \cdot I(\text{score}_i = \text{score}_j) handles ties in scores, assigning half a point when positive and negative items have the same predicted score.

For LLM-based baselines, the paper describes how Click Probability is derived from logits (raw output scores) for 'YES' and 'NO' tokens: Click Probability=pyes=elyeselyes+elno \mathrm{Click~Probability} = p_{\mathrm{yes}} = \frac{e^{l_{\mathrm{yes}}}}{e^{l_{\mathrm{yes}}} + e^{l_{\mathrm{no}}}}

  • pyesp_{\mathrm{yes}}: The calculated probability that the LLM's response indicates 'YES' (i.e., the user is interested).
  • lyesl_{\mathrm{yes}}: The logit (unnormalized output) corresponding to the 'YES' token from the LLM's classifier.
  • lnol_{\mathrm{no}}: The logit corresponding to the 'NO' token from the LLM's classifier. This formula applies a softmax normalization over the two relevant tokens ('YES' and 'NO') to obtain a probability. For closed-source models, textual responses "YES" or "NO" are directly mapped to interest scores of 1.0 and 0.0.

5.3. Baselines

The paper compares RecBase against a comprehensive set of baselines, including various sizes of LLMs and LLM-based recommendation models:

  • General LLM-based Zero-Shot Methods:

    • BERTbase\mathbf{BERT}_{\text{base}} (Devlin et al., 2019): A foundational transformer-based language model.
    • OPT (Zhang et al., 2022): Open Pre-trained Transformer language models, evaluated with OPTbase and OPTlarge versions.
    • Qwen-2 (Yang et al., 2024): A family of open-source large language models.
    • Phi-2: A small, high-quality language model.
    • Llama-2 (Touvron et al., 2023): Open-source LLM by Meta, used for comparison in larger sizes.
    • Llama-3 (Dubey et al., 2024): The successor to Llama-2.
    • Mistral (Jiang et al., 2023): A highly efficient LLM.
    • GPT-3.5 (Brown et al., 2020): A powerful closed-source LLM from OpenAI.
  • Finetuned LLM-based Recommendation Models:

    • RecGPT (Zhang et al., 2024): Generative personalized prompts for sequential recommendation.

    • P5 (Geng et al., 2022): A "Pretrain, Personalized Prompt & Predict" paradigm for recommendation.

    • Deepseek-Qwen2: Another advanced LLM variant.

      These baselines are representative because they cover a wide spectrum of LLM sizes (from 110M to 8B parameters), include both open-source and closed-source models, and represent common approaches to leveraging LLMs for recommendation (either directly zero-shot or via fine-tuning/prompting). This allows for a robust comparison of RecBase's domain-specific pretraining strategy against more general language-centric approaches.

5.4. Implementation Details

  • Item Textual Embeddings: The NV-Embed-v2 (Lee et al., 2024) model is used to convert unstructured item text descriptions into dense, semantically rich embeddings.
  • CL-VAE Configuration:
    • A 4-level codebook is used for the CL-VAE model.
    • Each level of the codebook has a size of 2048. This setup progressively extracts structured features from raw data.
  • RecBase Model Versions: Two versions of RecBase are trained and evaluated:
    • RecBase-0.3B (Base version):
      • Hidden size: 1024
      • Intermediate size: 2816
      • Attention heads: 16
      • Layers: 24
      • Maximum position embedding length: 32,768
    • RecBase-1.5B (Large version):
      • Hidden size: 1536
      • Intermediate size: 8960
      • Attention heads: 12
      • Layers: 28
      • Position embedding length and sliding window: 131,072 (enhanced capacity for longer sequences)
  • Shared Settings: Both models use a vocabulary size of 20,000 and share other key architectural settings derived from the Qwen2 (Yang et al., 2024) architecture.

6. Results & Analysis

6.1. Core Results Analysis

The experiments evaluate RecBase's generalization ability in zero-shot and multi-domain recommendation on eight previously unseen real-world datasets. The primary metric for evaluation is Area Under Curve (AUC).

The following are the results from Table 1 of the original paper:

Size(M) MIND MovieLens MicroLens Goodreads Yelp Steam H&M HotelRec Overall
P5 223 0.4911 0.5138 0.5017 0.5027 0.5080 0.5296 0.4845 0.4905 0.5027
RecGPT 6,649 0.5078 0.5069 0.4703 0.5083 0.5140 0.4924 0.4875 0.4937 0.4976
BERTbase 110 0.4963 0.4934 0.4992 0.4958 0.4914 0.5002 0.5204 0.4955 0.4990
OPTbase 331 0.5490 0.5104 0.4773 0.5015 0.5158 0.4257 0.4555 0.5028 0.4922
OPTlarge 1,316 0.5338 0.5174 0.5236 0.5042 0.5026 0.3825 0.5650 0.5026 0.5039
Qwen-2 494 0.4886 0.5138 0.5701 0.5148 0.5077 0.6399 0.6287 0.5311 0.5493
Phi-2 2,780 0.4851 0.5296 0.5078 0.5049 0.5186 0.6061 0.5447 0.4986 0.5244
Llama-2 6,738 0.4945 0.6030 0.4877 0.5273 0.5378 0.5622 0.4519 0.5305 0.5243
Llama-3 8,030 0.4904 0.6412 0.5577 0.5191 0.5267 0.7690 0.5454 0.5342 0.5729
Mistral 7,248 0.4833 0.6933 0.559 0.5321 0.5313 0.8102 0.5762 0.5677 0.5941
Deepseek-Qwen2 7,615 0.5117 0.5407 0.563 0.5165 0.5303 0.5905 0.5994 0.5648 0.5520
GPT-3.5 - 0.5057 0.5170 0.5110 0.5122 0.5039 0.6184 0.5801 0.5076 0.5319
RecBasebase 313 0.5508 0.5352 0.5401 0.5029 0.5320 0.7450 0.5870 0.4874 0.5601
RecBaselarge 1,318 0.5442 0.6474 0.5712 0.5329 0.5326 0.8343 0.6761 0.5124 0.6063

Table 1: Zero-shot recommendation evaluation across multi-domain datasets. AUC scores are reported.

As shown in Table 1, RecBase consistently demonstrates superior performance in zero-shot recommendation tasks.

  • Superiority over LLM Baselines: RecBaselarge (1.318 billion parameters) achieves an impressive Overall AUC of 0.6063, which surpasses all other LLM baselines, including much larger models like Mistral (7.248 billion parameters, AUC 0.5941), Llama-3 (8.030 billion parameters, AUC 0.5729), and GPT-3.5 (AUC 0.5319). This highlights that recommendation-oriented pretraining is more effective than relying on general language-level knowledge for zero-shot recommendation.

  • Strong Generalization Capabilities: RecBase shows particularly notable improvements on challenging datasets. For instance, on the H&M dataset, RecBaselarge achieves an AUC of 0.6761 compared to Mistral's 0.5762 and Qwen-2's 0.6287. Similarly, on the Steam dataset, RecBaselarge scores 0.8343, outperforming Mistral's 0.8102. This indicates that RecBase effectively captures fine-grained semantic nuances of recommended items and generalizes robustly across diverse domains.

  • Efficiency and Competitiveness of Smaller Models: RecBasebase, with only 313 million parameters, achieves an Overall AUC of 0.5601. This score is competitive and even outperforms some larger LLM baselines like Qwen-2 (494M parameters, AUC 0.5493) and Phi-2 (2.780B parameters, AUC 0.5244), as well as BERTbase and OPTbase. This demonstrates RecBase's efficiency, offering strong performance at a substantially lower computational cost.

    The results strongly validate the effectiveness of the proposed method, especially its domain-specific pretraining strategy and the use of unified item concept IDs, which allow it to bridge the semantic gap inherent in LLM-based approaches for recommendation tasks.

6.2. Unified Representation Performance

The effectiveness of the unified concept space generated by the CL-VAE method is crucial for RecBase's performance. An ideal code discretization approach should distribute input data uniformly across the latent space, maximizing token utilization for the autoregressive model.

  • t-SNE Clustering Visualization (Figure 2): As discussed in Section 4.2.2, Figure 2 effectively illustrates this.

    • Figure 2a (traditional RQ-VAE) shows that features from different domains are distributed independently in the latent space, with minimal interaction. This suggests a less unified representation, potentially leading to codebook collapse and poor generalization.
    • Figure 2b (CL-VAE) demonstrates a significant improvement. It shows increased overlap and interaction between features from diverse datasets within the unified concept space. This indicates that CL-VAE successfully creates a more integrated and well-distributed representation, enabling better semantic alignment across domains.
  • Codebook Usage Frequency (Figure 5c):

    Codebook structure vs. Collision rate and Utilization rate

    Figure 5: Codebook structure vs. Collision rate and Utilization rate. (a) demonstrates the impact of codebook size on the collision rate and utilization rate. (b) reflects the influence of codebook level on the aforementioned metrics. (c) Codebook usage frequency in CL-VAE and RQ-VAE.

    Figure 5c (embedded within Figure 5 in the original paper) displays the `codebook usage frequency` for both `CL-VAE` and `RQ-VAE`. The transformation in the frequency distribution of code usage across different levels confirms that `CL-VAE` achieves a more balanced distribution of `concept IDs`. This ensures that each token in the codebook is utilized effectively, preventing `codebook collapse` and allowing the `autoregressive model` to learn from a richer, more diverse set of discrete representations. The hierarchical approach enables the capture of both `fine-grained` and `coarse-grained` features, enhancing generalization across various recommendation scenarios.

By establishing a unified and well-distributed concept space, CL-VAE facilitates more efficient and accurate predictions by the autoregressive model, ultimately improving the overall performance of the recommendation system.

6.3. Ablation Studies / Parameter Analysis

The paper includes ablation studies to evaluate the contribution of individual components of CL-VAE to RecBase's performance and generalization.

The following are the results from Table 4 of the original paper:

Yelp Steam H&M HotelRec Overall
RecBasebase 0.5320 0.7450 0.5870 0.4874 0.5879
w/o format. 0.5204 0.7187 0.5668 0.4966 0.5756
w/o init. 0.4912 0.5924 0.5319 0.4909 0.5266
w/o cur. 0.5073 0.6815 0.5412 0.4815 0.5529

Table 4: Modular ablation study. format., init. and cur. represent formatted text description, reinitialization and curriculum learning in CL-VAE respectively.

6.3.1. Ablation Analysis on Key Components (Table 4):

  • RecBasebase (Complete Model): Achieves the best Overall AUC of 0.5879, confirming the effectiveness of the complete CL-VAE method.
  • w/o format. (Without formatted text description): Removing structured text representations leads to a noticeable drop in performance (Overall AUC 0.5756). This indicates that structured text descriptions are vital for the model to capture relevant features and generalize well across domains.
  • w/o init. (Without reinitialization): The absence of the reinitialization step results in a significant decline in performance (Overall AUC 0.5266). This highlights the criticality of this mechanism in stabilizing learning, preventing codebook collapse, and ensuring sufficient learning of low-level features.
  • w/o cur. (Without curriculum learning): Excluding the curriculum learning module causes further performance degradation (Overall AUC 0.5529), particularly in more complex recommendation scenarios. This underscores the value of progressively training the model on increasingly difficult examples to enhance convergence stability and hierarchical representation learning.

6.3.2. Ablation Study on the Codebook (Figure 5a and 5b): The analysis on codebook size and number of levels provides insights into CL-VAE's configuration:

  • Codebook Size (Figure 5a): As codebook size increases, the collision rate between concept IDs generally decreases, indicating better optimization of representations and more distinct codes. However, beyond a certain point (e.g., 4096 in the figure), utilization rate decreases, leading to redundant vocabulary space. Thus, a size of 2048 for each layer was selected to balance distinctiveness and utilization.

  • Codebook Levels (Figure 5b): Increasing the number of levels exponentially expands the number of products representable by concept IDs. However, for spaces beyond four levels, the utilization of IDs becomes very low, and the performance gains plateau. Furthermore, more levels incur additional inference costs during decoding. Consequently, the model adopts a strategy of four levels to optimize the balance between performance and efficiency.

    These ablation studies confirm that each component of the CL-VAE (formatted text description, reinitialization, and curriculum learning) and the chosen codebook parameters (size and levels) are essential for RecBase's robust performance and generalization capabilities.

6.4. In-Domain Adaptation via Fine-Tuning

While the paper primarily focuses on zero-shot generalization, it also explores the model's adaptability through in-domain fine-tuning.

The following are the results from Table 3 of the original paper:

Microlens Steam MovieLens H&M Yelp
Zero-shot 0.5401 0.7450 0.5352 0.5870 0.5320
Fine-tuned 0.5602 0.9173 0.6216 0.6261 0.6125
Improve. (%) 3.70% 23.12% 16.14% 6.66% 15.13%

Table 3: Performance of our model under zero-shot and fine-tuning settings on various datasets.

Table 3 presents a comparison of RecBase's performance under zero-shot and fine-tuning settings across five representative datasets.

  • Consistent Improvement: Fine-tuning consistently improves performance across all evaluated datasets compared to the zero-shot setting. This demonstrates the model's adaptability and ability to benefit from domain-specific supervision.

  • Significant Gains:

    • The Steam dataset shows the most substantial improvement, with an AUC increase from 0.7450 (zero-shot) to 0.9173 (fine-tuned), a 23.12%23.12\% relative improvement. This suggests that for domains with rich interaction data, RecBase can be highly optimized.
    • MovieLens also sees a considerable gain, from 0.5352 to 0.6216 (a 16.14%16.14\% improvement).
    • Yelp improves from 0.5320 to 0.6125 (15.13%15.13\% improvement).
  • Moderate Gains: Even datasets with smaller initial zero-shot performance or narrower domain focus, such as Microlens (3.70%3.70\% improvement) and H&M (6.66%6.66\% improvement), benefit from fine-tuning.

    Zero-shot transfer and domain-specific fine-tuning process

Figure 3: Illustration of the zero-shot transfer and domain-specific fine-tuning process. The diagram shows how the pretrained model is tested in a zeroshot setting and fine-tuned for in-domain performance.

Figure 3 illustrates this fine-tuning process, emphasizing how in-domain adaptation refines the model's representations. These results highlight RecBase's flexibility: it provides strong zero-shot performance for cold-start scenarios and new domains, and its architecture allows for further performance enhancement when sufficient in-domain data is available for fine-tuning. This dual capability makes it a versatile foundational model for recommendation.

6.5. Analysis of Inference Efficiency

The paper also presents an analysis of inference efficiency, comparing RecBase with several state-of-the-art models.

Comparison of Inference Latency

Figure 6: Comparison of Inference Latency. The bar chart shows inference times in seconds for various models on 20,000 interactive test data.

Figure 6 illustrates the inference latency (time taken for inference) on 20,000 interactive test data points.

  • Superior Efficiency of RecBase: RecBase-0.3B and RecBase-1.5B demonstrate significantly lower inference latency compared to other models. RecBase-0.3B takes approximately 295 seconds, and RecBase-1.5B takes about 390 seconds.

  • Comparison with Baselines: In contrast, GpdRec-7B (likely a typo for RecGPT) takes around 774 seconds, Mistral-1B takes about 946 seconds, Qwen2 takes approximately 1500 seconds, and Phi takes over 2000 seconds.

  • Reason for Efficiency: This superior efficiency is attributed to RecBase's specialized ID vocabulary space, which is designed specifically for recommendation tasks. Unlike general-purpose large language models that operate on vast vocabulary spaces based on natural language representations, RecBase utilizes a much smaller, more efficient vocabulary tailored to the discrete concept IDs of items.

    This design choice not only enhances the model's efficiency but also positions it as a more suitable and practical base model for recommendation-related tasks, especially in real-world scenarios where latency is a critical factor. The efficiency gains, coupled with strong performance, make RecBase a powerful solution.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces RecBase, a novel foundational model specifically engineered for the challenges of zero-shot and multi-domain recommendation. By moving beyond the limitations of language-centric pretraining common in existing LLM-based recommenders, RecBase adopts a recommendation-oriented pretraining approach. Key to its success is the pretraining on a large-scale, heterogeneous, cross-domain corpus, utilizing structured text representations and unified feature mappings to enable strong generalization.

The core innovations include a unified item tokenizer that encodes items into hierarchical concept IDs (learned via Curriculum Learning Enhanced RQ-VAE, or CL-VAE). This approach mitigates semantic discrepancies between domains and provides a structured, efficient representation space. Furthermore, RecBase employs an autoregressive training paradigm to effectively capture inter-item dependencies within these concept ID sequences.

Extensive evaluations on eight real-world datasets confirm the effectiveness of RecBase. The model, particularly RecBase-1.5B, matches or surpasses the zero-shot and cross-domain recommendation performance of LLM baselines up to 7 billion parameters, while also demonstrating superior inference efficiency. These findings underscore the significant potential of recommendation-oriented pretraining for building robust and adaptable recommender systems, especially in cold-start scenarios and new domains.

7.2. Limitations & Future Work

The authors acknowledge several limitations inherent to recommendation data, despite RecBase's promising performance:

  • Data Sparsity and Distribution Imbalance: These issues can still impair generalization, particularly for cold-start users (users with very few interactions) and long-tail items (items with very few interactions). While cross-domain pretraining partially alleviates this, the model might still underrepresent certain domains or items with sparse interactions.

  • Biases in Training Data: Biases present in the large-scale training corpus can limit the model's generalization to entirely new domains or diverse user populations that are not adequately represented in the training data.

    Based on these limitations, the authors suggest the following directions for future work:

  • Data Augmentation: Exploring techniques to enrich the training data and mitigate sparsity.

  • Active Learning: Incorporating strategies where the model can intelligently query for new data to improve its knowledge in uncertain or underrepresented areas.

  • Bias Mitigation Strategies: Developing methods to detect and reduce biases in the training data and model predictions.

  • Larger, More Heterogeneous Benchmarks: Evaluating the model on even larger and more diverse benchmarks to further enhance its scalability and real-world robustness.

  • Integration of Discretized Multimodal Features: The paper mentions that for multimodal datasets, only textual descriptions were used. The authors suggest that integrating discretized multimodal features into their unified framework is a promising direction for future research, which could further enhance performance.

7.3. Personal Insights & Critique

This paper presents a highly insightful and practical approach to addressing a critical challenge in modern recommender systems: cross-domain and zero-shot generalization.

Inspirations:

  • Paradigm Shift: The most significant inspiration is the explicit shift from a "fine-tuning LLMs for recommendation" paradigm to "building a foundation model for recommendation." This recognizes that while LLMs are powerful, their core inductive biases (language) are not perfectly aligned with recommendation (item relationships and sequences). This targeted pretraining approach is likely to yield more optimized and efficient solutions.
  • Unified Item Tokenization with Hierarchical Concept IDs: The CL-VAE and multi-level concept IDs are brilliant. It's a clever way to:
    1. Bridge the Semantic Gap: Convert heterogeneous item descriptions into a structured, unified, and semantically meaningful discrete representation.
    2. Reduce Vocabulary Size: Compared to raw language tokens, a compact set of concept IDs is more efficient for modeling item sequences.
    3. Facilitate Knowledge Transfer: The hierarchical nature allows learning at different granularities, promoting better generalization across domains by finding common abstract concepts.
  • Curriculum Learning for Codebook Stability: Addressing codebook collapse with curriculum learning and reinitialization is an elegant solution to a known problem in VQ-VAE-like models, ensuring the learned discrete space is robust and well-utilized.
  • Efficiency: Demonstrating superior performance with significantly fewer parameters (1.5B vs. 7B LLMs) and lower inference latency is a strong practical advantage, especially for real-world deployment.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Exclusivity of Textual Information: The paper states that for multimodal datasets, only textual descriptions were used to ensure fair comparison with text-centric baselines. While understandable for evaluation fairness, it's an important limitation. Real-world items often have rich multimodal features (images, audio, video). The performance of RecBase might be further enhanced if it could directly ingest and discretize these multimodal signals within its CL-VAE framework. The authors do mention this as future work, which is good.

  • True Cold-Start for Users: The User History Filtering removes users with fewer than 15 interactions. While this ensures sufficient sequence signal for training, it slightly sidesteps the "true cold-start user" problem (e.g., a brand new user with 0 or 1 interaction). How well RecBase would perform in such extreme cold-start scenarios, where minimal sequence data is available, could be further investigated.

  • Sensitivity to mm (Number of Levels): The paper analyzes the impact of codebook levels on collision and utilization rates (Figure 5b) and concludes that four levels offer a good balance. A deeper analysis of how the choice of mm (the number of bits in the semantic ID) directly impacts the quality of recommendations for different item complexities and domains could be beneficial.

  • Comparison with Larger LLMs: While RecBase-1.5B outperforms LLMs up to 7B parameters, state-of-the-art LLMs now exceed 70B parameters. Although these larger models are computationally more expensive, it would be insightful to understand the performance ceiling and trade-offs against even larger, general-purpose LLMs, if RecBase's efficiency gains hold up proportionately.

  • Interpretability of Concept IDs: While the concept IDs are learned to be semantically rich, their direct interpretability for humans might be limited. Understanding what specific "concepts" each ID or combination of IDs represents could offer valuable insights for debugging or system design.

  • Dataset Biases: The authors acknowledge biases in the overall training data distribution (overrepresentation of news/audio-visual). While curation aimed to balance this, the success largely depends on the diversity and quality of the initial 15 datasets chosen for pretraining. A more formal analysis of the representation of different item types and domains in the unified concept space could be a useful diagnostic.

    Overall, RecBase is a significant step forward in building specialized foundational models for recommendation, offering a robust framework for handling diverse domains and zero-shot scenarios with impressive efficiency. Its innovations in item representation and pretraining objective pave the way for more effective and scalable recommender systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.