Paper status: completed

MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation

Published:10/28/2025

Large Language Model Fine-Tuning (51)Residual Quantized Variational Autoencoder (RQ-VAE) (2)Sequence Policy Optimization (40)RL Training for Large Language Models (67)Generative Recommendation Systems (37)

Original Link PDF

Price: 0.100000

30 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MiniOneRec, the first open-source generative recommendation framework, uses Residual Quantized VAE for SID and post-trains 0.5B–7B parameter Qwen models, confirming scaling benefits and improving ranking accuracy and diversity via aligned SID processing and constrained RL.

Abstract

The recent success of large language models (LLMs) has renewed interest in whether recommender systems can achieve similar scaling benefits. Conventional recommenders, dominated by massive embedding tables, tend to plateau as embedding dimensions grow. In contrast, the emerging generative paradigm replaces embeddings with compact Semantic ID (SID) sequences produced by autoregressive Transformers. Yet most industrial deployments remain proprietary, leaving two fundamental questions open: (1) Do the expected scaling laws hold on public benchmarks? (2) What is the minimal post-training recipe that enables competitive performance? We present MiniOneRec, to the best of our knowledge, the first fully open-source generative recommendation framework, which provides an end-to-end workflow spanning SID construction, supervised fine-tuning, and recommendation-oriented reinforcement learning. We generate SIDs via a Residual Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters on the Amazon Review dataset. Our experiments reveal a consistent downward trend in both training and evaluation losses with increasing model size, validating the parameter efficiency of the generative approach. To further enhance performance, we propose a lightweight yet effective post-training pipeline that (1) enforces full-process SID alignment and (2) applies reinforcement learning with constrained decoding and hybrid rewards. Together, these techniques yield significant improvements in both ranking accuracy and candidate diversity.

Mind Map

In-depth Reading

English Analysis~31 min read · 42,412 chars

1. Bibliographic Information

1.1. Title

MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation

1.2. Authors

Xiaoyu Kong, Leheng Sheng, Junfei Tan, Yuxin Chen, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He. The authors are affiliated with the University of Science and Technology of China and the National University of Singapore. Their research backgrounds appear to be in computer science, with a focus on recommender systems, large language models (LLMs), and reinforcement learning (RL).

1.3. Journal/Conference

This paper is published on CoRR, which stands for Computing Research Repository. It is a preprint server, often associated with arXiv, where research papers are publicly shared before, or in parallel with, formal peer review and publication in journals or conferences.

1.4. Publication Year

2025 (specifically, 2025-10-28T13:58:36.000Z).

1.5. Abstract

The abstract outlines the paper's investigation into whether recommender systems can benefit from the scaling laws observed in large language models (LLMs). It highlights a shift from traditional, embedding-heavy recommenders to an emerging generative paradigm that uses Semantic ID (SID) sequences produced by autoregressive Transformers. The paper addresses two open questions: (1) if scaling laws apply to public benchmarks, and (2) what minimal post-training steps achieve competitive performance. The authors introduce MiniOneRec, presented as the first fully open-source generative recommendation framework. This framework covers an end-to-end workflow including SID construction using a Residual Quantized VAE (RQ-VAE), supervised fine-tuning (SFT), and recommendation-oriented reinforcement learning (RL). Experiments on the Amazon Review dataset with Qwen backbones (0.5B to 7B parameters) demonstrate that increasing model size leads to a consistent decrease in training and evaluation losses, validating the parameter efficiency of the generative approach. To further improve performance, they propose a lightweight post-training pipeline that enforces full-process SID alignment and applies reinforcement learning with constrained decoding and hybrid rewards. These techniques are shown to significantly enhance both ranking accuracy and candidate diversity.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2510.24431 PDF Link: https://arxiv.org/pdf/2510.24431v1.pdf Publication Status: This paper is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limitation of scaling in conventional recommender systems and the lack of open-source, reproducible frameworks for the emerging generative recommendation paradigm.

Why is this problem important in the current field?

The success of large language models (LLMs) has demonstrated predictable performance gains with increased model size, a phenomenon known as scaling laws. This has spurred interest in whether similar benefits can be achieved in recommender systems. However, traditional recommenders, which rely heavily on massive embedding tables for user and item representations, tend to plateau in performance as these embeddings grow. This embedding-heavy design leads to diminishing returns beyond moderate scales. In contrast, the generative paradigm compresses items into compact Semantic ID (SID) sequences, allowing autoregressive Transformers to handle the bulk of parameters, theoretically enabling better scaling.

Existing industrial deployments of generative recommenders, such as OneRec and OnePiece, have shown significant promise, but they remain proprietary and rely on massive private datasets. This leaves the broader research community with critical unanswered questions:

Do the expected scaling laws observed in LLMs also hold for generative recommendation models when applied to public benchmarks?
What is the minimum, yet effective, post-training recipe required to achieve competitive performance with these generative models?

What is the paper's entry point or innovative idea?

The paper's entry point is to fill this gap by providing the first fully open-source, end-to-end framework for generative recommendation, named MiniOneRec. This framework not only aims to validate the scaling properties of generative recommenders on public datasets but also to offer a reproducible blueprint for building high-performing generative models using a lightweight post-training pipeline.

2.2. Main Contributions / Findings

The paper makes several primary contributions and reports key findings:

First Fully Open-Source Generative Recommendation Framework (MiniOneRec): The authors present MiniOneRec, which includes complete source code, reproducible training pipelines, and publicly available model checkpoints. It offers an end-to-end workflow covering SID construction, supervised fine-tuning (SFT), and recommendation-oriented reinforcement learning (RL).
Validation of Scaling Laws on Public Benchmarks: Through systematic investigation on the Amazon Review dataset, MiniOneRec variants (0.5B to 7B parameters) demonstrate a consistent downward trend in both training and evaluation losses as model size increases. This empirically validates the parameter efficiency and scaling laws of the generative paradigm, suggesting that larger models consistently perform better.
Optimized Lightweight Post-training Pipeline: The paper proposes an effective two-pronged post-training strategy:
1. Full-Process SID Alignment: This involves augmenting the vocabulary with dedicated SID tokens and enforcing auxiliary alignment objectives throughout both the SFT and RL stages. This approach is shown to be crucial for grounding SID generation in world knowledge derived from the underlying LLM.
2. Reinforced Preference Optimization: This leverages reinforcement learning with constrained decoding (masking invalid tokens for legal item generation) and a hybrid reward signal. The hybrid reward combines a rule-based accuracy term with a ranking-aware penalty for hard negatives, enhancing both ranking accuracy and candidate diversity.
Superior Performance: MiniOneRec consistently surpasses strong sequential, generative, and LLM-based baselines across various metrics (HR@K, NDCG@K) on the Amazon Review benchmarks.
Demonstrated Transferability and Impact of Pre-trained LLMs: The framework exhibits robust out-of-distribution (OOD) transferability (e.g., from Industrial to Office domains) by effectively discovering reusable interaction patterns. Furthermore, models initialized with pre-trained LLM weights significantly outperform those trained from scratch, highlighting the immense value of world knowledge and general reasoning abilities embedded in large pre-trained models.

In summary, the paper successfully demonstrates the viability and advantages of scaling generative recommendation models on public data, providing a practical, open-source framework and an optimized post-training recipe to achieve state-of-the-art performance.

3.1. Foundational Concepts

To fully understand the MiniOneRec framework, a beginner needs to grasp several fundamental concepts from recommender systems, natural language processing, and reinforcement learning.

Recommender Systems: Systems designed to predict user preferences for items (products, movies, news articles, etc.).
- Conventional Recommenders: Often rely on embedding tables to represent users and items as dense vectors. Recommendations are typically made by calculating similarity (e.g., dot product) between user and item embeddings. Examples include matrix factorization models and sequential recommenders like SASRec.
- Generative Recommenders: A newer paradigm where recommendation is framed as a sequence generation problem. Instead of matching embeddings, these models generate sequences of identifiers for items that a user is likely to interact with next. This approach draws inspiration from Large Language Models (LLMs).
Large Language Models (LLMs): Very large neural networks trained on vast amounts of text data to understand, generate, and predict human language. They typically employ the Transformer architecture.
- Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which revolutionized sequence modeling. It relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence when processing each element.
- Autoregressive Models: Models that predict the next item in a sequence based on all preceding items. In LLMs, this means predicting the next word given the previous words. In generative recommenders, it means predicting the next Semantic ID (SID) given previous SIDs.
Semantic ID (SID): A compact, discrete token representation of an item, designed to capture its semantic meaning. Instead of using raw text or a simple numerical ID, SIDs are generated by compressing rich information (like item descriptions) into a sequence of special tokens. This allows generative models, which operate on discrete tokens, to handle items effectively. SIDs are analogous to words or sub-words in natural language processing.
Quantization: The process of mapping continuous values to a finite set of discrete values. In MiniOneRec, quantization is used to convert continuous item embeddings (from text encoders) into discrete SIDs.
- Residual Quantized Variational Autoencoder (RQ-VAE): An advanced quantization technique. A Variational Autoencoder (VAE) learns a compressed, latent representation of data. Residual Quantization involves quantizing the residual errors iteratively, allowing for a more accurate and hierarchical compression of information into multiple discrete codes (SIDs). Each SID can be seen as a "byte" or "code" representing a part of the item's features, and a sequence of these codes (e.g., three bytes) forms the unique SID for an item.
Supervised Fine-Tuning (SFT): A common technique in LLM training. After an LLM is pre-trained on a general corpus, it is fine-tuned on a smaller, task-specific dataset using supervised learning (i.e., with labeled input-output pairs). In MiniOneRec, SFT warms up the LLM to generate item SIDs based on user interaction histories.
Reinforcement Learning (RL): A paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal.
- Policy Gradient Methods: A class of RL algorithms that directly optimize the policy (the agent's strategy for choosing actions).
- Group Relative Policy Gradient (GRPO): A specific policy gradient algorithm used in MiniOneRec. It draws multiple candidate outputs per input (prompt) and normalizes rewards within that group. This helps reduce gradient variance and stabilizes training, especially when rule-based rewards are used instead of a learned reward model.
- Reinforcement Learning with Human Feedback (RLHF): A popular technique for aligning LLMs with human preferences, often using Proximal Policy Optimization (PPO). GRPO can be seen as a lightweight alternative to RLHF when rule-based rewards are sufficient.
- Constrained Decoding: A technique used during text generation (or SID generation) to ensure that the generated output adheres to certain rules or formats. For MiniOneRec, this means ensuring that only valid item SIDs are produced, preventing the generation of non-existent items.
- Beam Search: A search algorithm used in sequence generation tasks to find the most probable sequence of tokens. Instead of just picking the single best token at each step (greedy decoding), it keeps track of the $k$ most probable partial sequences (beams) and extends them. This helps explore a wider range of possibilities and often leads to higher-quality generations.
Evaluation Metrics for Recommender Systems:
- Hit Rate (HR@K): Measures how often the ground-truth item (the item the user actually interacted with) appears within the top $K$ $K$ recommended items.
  - Conceptual Definition: HR@K is a recall-based metric that quantifies the proportion of users for whom the target item is present in the top K recommendations. It focuses on whether the model successfully found the relevant item.
  - Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom target item is in top K}}{\text{Total number of users}} $
  - Symbol Explanation:
    - Number of users for whom target item is in top K: The count of unique users for whom the relevant item was included in the recommendation list of size K.
    - Total number of users: The total number of users in the evaluation set.
- Normalized Discounted Cumulative Gain (NDCG@K): A ranking-aware metric that considers not only whether relevant items are in the top $K$ $K$ but also their position in the list. Higher positions for relevant items yield higher scores.
  - Conceptual Definition: NDCG@K is a measure of ranking quality, assessing the usefulness of an item based on its position in the recommendation list. It assigns higher scores to relevant items that appear earlier in the list. The "Normalized" part means it is scaled to the ideal possible score, making it comparable across different queries.
  - Mathematical Formula: First, calculate DCG@K (Discounted Cumulative Gain): $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ Then, calculate IDCG@K (Ideal DCG@K) by arranging all relevant items in the dataset in decreasing order of relevance for the given query: $ \mathrm{IDCG@K} = \sum_{i=1}^{K} \frac{2^{rel_{i_{ideal}}} - 1}{\log_2(i+1)} $ Finally, NDCG@K is: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
  - Symbol Explanation:
    - $K$ : The number of items in the recommendation list being considered.
    - $rel_i$ : The relevance score of the item at position $i$ in the recommended list (typically binary: 1 if relevant, 0 if not).
    - $rel_{i_{ideal}}$ : The relevance score of the item at position $i$ in the ideal ranked list (where all relevant items are ranked highest).
    - $\log_2(i+1)$ : The discount factor, which reduces the weight of relevant items as their position $i$ in the list increases.

3.2. Previous Works

The paper builds upon and compares itself to several prior works in generative recommendation, LLM applications in recommendation, and reinforcement learning.

Generative Recommendation Models:
- TIGER (Rajput et al., 2023): One of the early explorations in generative recommendation. It uses RQ-VAE to map textual embeddings of titles and descriptions into SIDs, and then a Transformer model generates these SIDs. MiniOneRec adopts a similar RQ-VAE approach for SID construction.
- HSTU (Zhai et al., 2024): Introduces a streaming architecture for generative recommendation, designed to handle high-cardinality and non-stationary interaction logs efficiently.
- LC-Rec (Zheng et al., 2024): Aligns an LLM with SIDs through multi-task learning, enabling the model to understand and generate these symbols effectively. This highlights the importance of aligning SIDs with LLM's linguistic knowledge, a concept MiniOneRec further develops with its full-process SID alignment.
- RecForest (Feng et al., 2022): Clusters items using hierarchical k-means and uses cluster indices as tokens, an alternative approach to SID generation.
- EAGER (Wang et al., 2024) & TokenRec (Qu et al., 2024): These works focus on improving item tokenization by fusing collaborative and semantic evidence directly into the tokenizer.
- Industrial-Scale Generative Recommenders:
  - OneRec (Deng et al., 2025; Zhou et al., 2025a): A pioneering industrial deployment that achieved significant improvements on Kuaishou's platform. It features a two-stage post-training pipeline with SFT and RL.
  - OneRec-V2 (Zhou et al., 2025b): Further advances OneRec with a lazy decoder architecture and preference alignment.
  - OnePiece (Dai et al., 2025): Integrates LLM-style context engineering and reasoning into industrial retrieval and ranking pipelines, also exploring inference in the latent space.
  - MTGR (Wang et al., 2025): Focuses on boosting large-scale generative recommendation models by keeping original DLRM features, adding user-level compression, and accelerating training/inference.
- The common thread among these works is the shift from traditional embedding matching to sequence generation using Transformers and item tokenization. MiniOneRec differentiates itself by being open-source and systematically validating scaling laws on public data while providing a refined post-training strategy.
LLM and RL in Recommendation:
- Proximal Policy Optimization (PPO) (Schulman et al., 2017): A widely used policy gradient algorithm in RLHF for fine-tuning LLMs. It is known for its stability but can be memory-intensive.
- Direct Preference Optimization (DPO) (Rafailov et al., 2023): An alternative to PPO that removes the need for a separate value network, directly optimizing a preference-based objective.
- S-DPO (Chen et al., 2024b): Adapts DPO for recommendation by treating softmax-based negative sampling as implicit pairwise preference.
- Group Relative Policy Gradient (GRPO) (Shao et al., 2024; DeepSeek-AI et al., 2024): The RL algorithm adopted by MiniOneRec. It normalizes rewards within a small batch of roll-outs and typically uses rule-based reward signals instead of a learned reward model, making it more lightweight and efficient than PPO or DPO in some contexts. MiniOneRec integrates GRPO with constrained decoding and hybrid rewards tailored for recommendation.
- LLM-based Recommendation Models:
  - BigRec (Bao et al., 2023): An example of an LLM-based recommender.
  - D3 (Bao et al., 2024): Another LLM-based model for recommendation, often used as a baseline.
- These works illustrate the increasing trend of leveraging LLMs and RL techniques to enhance recommendation systems, moving beyond traditional collaborative filtering or content-based methods.

3.3. Technological Evolution

The field of recommender systems has seen a significant evolution:

Early Systems (e.g., Collaborative Filtering): Focused on user-item interaction patterns, often using matrix factorization or neighborhood-based methods.
Deep Learning Era (e.g., GRU4Rec, Caser, SASRec): Introduced deep neural networks, particularly Recurrent Neural Networks (RNNs) and Transformers, to model sequential user behavior, leading to sequential recommenders. These still largely relied on embedding tables.
Embedding-Heavy Models (e.g., DeepFM, DCN): Further scaled deep learning models by using massive embedding tables for categorical features, but these often hit performance plateaus due to the fixed nature of embeddings and the limitations of shallow scoring networks.
Generative Recommendation Paradigm (e.g., TIGER, OneRec): Inspired by the success of LLMs, this paradigm shifts from embedding matching to sequence generation. Items are tokenized into SIDs, and autoregressive Transformers generate sequences of SIDs. This allows for a more flexible and scalable architecture, where model capacity can be redirected from static embedding tables to dynamic generative components.
LLM-Integrated Generative Recommendation (e.g., OnePiece, MiniOneRec): The latest stage, where pre-trained LLMs are explicitly leveraged not just as a backbone but also for their world knowledge and reasoning capabilities. This often involves multi-task learning, alignment objectives, and reinforcement learning to fine-tune LLMs for recommendation-specific tasks.

MiniOneRec fits into this latest stage by providing an open-source framework that integrates RQ-VAE for SID construction, Qwen-based LLMs as backbones, SFT for initial training, and GRPO-based RL for preference optimization, crucially incorporating SID alignment with LLM's world knowledge.

3.4. Differentiation Analysis

Compared to the main methods in related work, MiniOneRec offers several core differences and innovations:

Open-Source & Reproducibility: Unlike many industrial generative recommenders (e.g., OneRec, OnePiece) that are proprietary, MiniOneRec is the "first fully open-source generative recommendation framework." This is a significant contribution to the research community, enabling transparency, reproducibility, and further development.
Systematic Validation of Scaling Laws: MiniOneRec provides the first systematic investigation into how generative recommendation models scale on public datasets (Amazon Review). This directly addresses a critical unanswered question for the research community, providing empirical evidence for the parameter efficiency of the generative paradigm across different model sizes (0.5B to 7B parameters).
Comprehensive Post-training Recipe: The paper proposes a lightweight yet effective post-training pipeline with two key components:
1. Full-Process SID Alignment: While LC-Rec also aligns LLMs with SIDs, MiniOneRec emphasizes "full-process SID alignment," maintaining alignment objectives throughout both SFT and RL stages. This ensures deeper semantic understanding by grounding SID generation in LLM's world knowledge. This is more robust than models that treat recommendation purely as a SID-to-SID task.
2. Reinforced Preference Optimization with Hybrid Rewards and Constrained Decoding: MiniOneRec applies GRPO with specific enhancements:
  - Constrained decoding: Masks invalid tokens to guarantee legal item generation, which is crucial for practical recommender systems.
  - Beam search for efficient exploration of diverse candidates, addressing the poor sampling diversity issue in RLVR with limited action spaces.
  - Hybrid rewards: Combines rule-based accuracy with a novel ranking-aware penalty for hard negatives. This goes beyond simple binary rewards used in standard GRPO and provides richer supervision for ranking quality, a common weakness of RL in recommendation.
Compact SID Space for Efficiency: By operating in the compact SID space rather than verbose textual titles, MiniOneRec requires substantially fewer context tokens, leading to faster inference, lower latency, and smaller memory footprints compared to LLM-based models that might process full text descriptions.
Demonstrated Transferability: The MiniOneRec-w/RL-OOD variant, trained on one domain (Industrial) and evaluated on an unseen one (Office) without SFT on the target, highlights the framework's ability to uncover reusable interaction patterns and its robust out-of-distribution (OOD) performance, which is a key challenge in recommendation.

In essence, MiniOneRec distinguishes itself by democratizing generative recommendation research, rigorously validating its scaling properties, and offering a finely tuned, effective, and efficient methodology that integrates the strengths of LLMs and RL while addressing practical challenges like output validity and diversity.

4. Methodology

The MiniOneRec framework (as illustrated in Figure 2) provides an end-to-end workflow for generative recommendation. It starts by converting item textual information into discrete Semantic IDs (SIDs), incorporates world knowledge from Large Language Models (LLMs) through alignment objectives, and then refines the recommendation policy using supervised fine-tuning (SFT) followed by reinforced preference optimization (RL).

4.1. Principles

The core idea behind MiniOneRec is to leverage the scaling capabilities of autoregressive Transformers, similar to LLMs, for recommendation. Instead of relying on massive and static embedding tables common in traditional recommenders, MiniOneRec represents items as compact sequences of discrete Semantic IDs (SIDs). This allows the LLM backbone to learn complex sequential patterns and generate SID sequences corresponding to recommended items. A key principle is to integrate the LLM's vast world knowledge by aligning its language space with the SID space, thereby enriching the semantic understanding of items. Furthermore, reinforcement learning is employed to optimize the LLM's generation policy directly towards recommendation-specific objectives like ranking accuracy and candidate diversity, overcoming the limitations of standard supervised learning which often treats all negative items equally.

4.2. Core Methodology In-depth (Layer by Layer)

The MiniOneRec framework consists of three main stages: Item Tokenization, Alignment with LLMs (incorporating SFT), and Reinforced Preference Optimization (using RL).

Figure 2: MiniOneRec framework. RQ-VAE builds the item SID codebook. We then perform SFT to warm up the LLM and obtain an initial alignment. In RL, beam search with constrained decoding, thereby the…
该图像是论文中的示意图，展示了MiniOneRec框架的全流程SID对齐。图中包括RQ-VAE生成Item SID码本，LLM进行监督微调（SFT）和强化学习（RL）阶段的受限束搜索及奖励机制，最后通过GRPO策略优化。

The MiniOneRec framework. RQ-VAE builds the item SID codebook. We then perform SFT to warm up the LLM and obtain an initial alignment. In RL, beam search with constrained decoding ensures the model sequentially produces a ranked list of distinct, valid SIDs. GRPO updates the policy, and SID alignment is enforced end-to-end. This alignment objective is preserved throughout both the SFT and RL stages, fostering deeper semantic understanding.

4.2.1. Task Formulation

Recommendation is formulated as a sequence generation problem. For each user $u$ , their historically interacted items are arranged chronologically into a sequence $H_u = [i_1, i_2, \dots, i_T]$ . Each item $i_t$ is not represented by its raw ID or text, but by a sequence of structural IDs called SIDs, typically denoted as $\{ c_0^{i_t}, c_1^{i_t}, c_2^{i_t} \}$ . These SIDs are designed to preserve hierarchical semantics through quantization techniques applied to semantic embeddings of the items.

A generative policy, denoted as $\pi_\theta$ , is implemented as an autoregressive model with parameters $\theta$ . This policy takes the entire interaction history $H_u$ as input and is trained to predict the next item $i^+$ that best matches user $u$ 's preferences from the catalog of available items. During inference, the model recursively generates item tokens. To produce a recommendation list, the beam search algorithm is used to keep the $k$ most promising sequences (beams), which are then returned as the recommended items.

4.2.2. Item Tokenization

The first step in SID-style generative recommenders is to convert each item into a sequence of discrete tokens. MiniOneRec follows TIGER (Rajput et al., 2023) and employs Residual Quantized Variational Autoencoder (RQ-VAE) (Zeghidour et al., 2022) for this purpose. The pipeline is as follows:

Textual Input Preparation: For every item $i$ , its title and textual description are concatenated into a single sentence.
Semantic Vector Encoding: This concatenated sentence is fed through a frozen text encoder (specifically, Qwen3-Embedding-4B in this work), which produces a $d$ -dimensional continuous semantic vector $\mathbf{x} \in \mathbb{R}^d$ . This vector captures the item's rich semantic information.
RQ-VAE Application: The RQ-VAE then quantizes this continuous vector $\mathbf{x}$ $x$ . The process involves multiple levels of quantization to progressively refine the representation.
- The paper sets the number of levels $L = 3$ and the codebook size $K = 256$ for each level. This means each item is represented by three discrete codes (bytes), offering $2^{24}$ possible unique codes, which is sufficient for large catalogs while keeping the vocabulary manageable.
- The quantization operates by iteratively finding the closest codebook entry and then quantizing the residual error. The residual is initialized as $\mathbf{r}_0 = \mathbf{x}$ $r_{0} = x$ . At each level $l$ $l$ (where $0 \leq l < L$ $0 \leq l < L$ ), the process is defined by: $ c _ { l } = \arg \operatorname* { m i n } _ { k } \left| \mathbf { r } _ { l } - \mathbf { e } _ { k } ^ { ( l ) } \right| _ { 2 } , \qquad \mathbf { r } _ { l + 1 } = \mathbf { r } _ { l } - \mathbf { e } _ { c _ { l } } ^ { ( l ) } . $
 - Symbol Explanation:
 - $c_l$ : The index of the chosen codebook entry at level $l$ .
 - $\mathbf{r}_l$ : The residual vector at level $l$ , representing the remaining information to be quantized.
 - $\mathbf{e}_k^{(l)}$ : The $k$ -th codebook entry (vector) at level $l$ .
 - $\left\| \cdot \right\|_2$ : The L2 norm, calculating the Euclidean distance between vectors.
 - $\arg \min_k$ : The argument (index $k$ ) that minimizes the expression.
 - $\mathbf{r}_{l+1}$ : The updated residual for the next level, calculated by subtracting the chosen codebook entry from the current residual.
Item Token Sequence Formation: The collected indices $(c_0, \dots, c_{L-1})$ form the discrete token sequence for item $i$ . These sequences are the SIDs that the subsequent generative recommender model consumes.
Quantized Latent Reconstruction: The quantized latent representation $\mathbf{z}_\mathrm{q}$ $z_{q}$ is reconstructed by summing the chosen codebook entries across all levels. This reconstructed latent is then passed through a decoder $D(\cdot)$ $D (\cdot)$ to reconstruct the original semantic vector $\hat{\mathbf{x}}$ $\hat{x}$ : $ \mathbf { z } _ { \mathrm { q } } = \sum _ { l = 0 } ^ { L - 1 } \mathbf { e } _ { c _ { l } } ^ { ( l ) } , \qquad \hat { \mathbf { x } } = D ( \mathbf { z } _ { \mathrm { q } } ) . $
- Symbol Explanation:
  - $\mathbf{z}_\mathrm{q}$ : The reconstructed quantized latent vector.
  - $\mathbf{e}_{c_l}^{(l)}$ : The codebook entry (vector) at level $l$ corresponding to the chosen index $c_l$ .
  - $\hat{\mathbf{x}}$ : The reconstructed semantic vector.
  - $D(\cdot)$ : The decoder function.
Loss Function: The RQ-VAE is trained to minimize a loss function that combines a reconstruction term ( $\mathcal{L}_{\mathrm{RECO}}$ $L_{RECO}$ ) and an RQ regularizer ( $\mathcal{L}_{\mathrm{RQ}}$ $L_{RQ}$ ): $ \mathcal { L } ( \mathbf { x } ) = \underbrace { | \mathbf { x } - \hat { \mathbf { x } } | _ { 2 } ^ { 2 } } _ { \mathcal { L } _ { \mathrm { R E C O } } } + \underbrace { \sum _ { l = 0 } ^ { L - 1 } \bigl ( | \mathrm { s g } [ \mathbf { r } _ { l } ] - \mathbf { e } _ { k } ^ { ( l ) } | _ { 2 } ^ { 2 } + \beta | \mathbf { r } _ { l } - \mathrm { s g } [ \mathbf { e } _ { c _ { l } } ^ { ( l ) } ] | _ { 2 } ^ { 2 } \bigr ) } _ { \mathcal { L } _ { \mathrm { R Q } } } , $
- Symbol Explanation:
  - $\mathcal{L}(\mathbf{x})$ : The total loss for a given semantic vector $\mathbf{x}$ .
  - $\mathcal{L}_{\mathrm{RECO}}$ : The reconstruction loss, measuring the L2 difference between the original semantic vector $\mathbf{x}$ and its reconstruction $\hat{\mathbf{x}}$ .
  - $\mathcal{L}_{\mathrm{RQ}}$ $L_{RQ}$ : The RQ regularizer, consisting of two terms:
    - $\| \mathrm{sg}[\mathbf{r}_l] - \mathbf{e}_k^{(l)} \|_2^2$ : The codebook loss, which pulls the codebook entries $\mathbf{e}_k^{(l)}$ towards the encoder outputs $\mathbf{r}_l$ . sg (stop-gradient) means the gradient does not flow through $\mathbf{r}_l$ .
    - $\beta \| \mathbf{r}_l - \mathrm{sg}[\mathbf{e}_{c_l}^{(l)}] \|_2^2$ : The commitment loss, which encourages the encoder output $\mathbf{r}_l$ to "commit" to the selected codebook entry $\mathbf{e}_{c_l}^{(l)}$ . $\beta$ is a hyper-parameter controlling the strength of this term. sg (stop-gradient) here means the gradient does not flow through $\mathbf{e}_{c_l}^{(l)}$ .
Warm-Start Initialization: To prevent codebook collapse (where only a few codebook entries are used), the codebooks are initialized with k-means centroids computed on the first training batch, a common practice in RQ-VAE training.

4.2.3. Alignment with LLMs

LLMs possess extensive world knowledge and understanding of human behaviors. MiniOneRec aims to incorporate this knowledge by aligning the LLM's language space with the SID representations, which is an expansion of the original OneRec architecture. The framework augments the LLM's vocabulary with a three-layer codebook (each layer with 256 unique SIDs) and treats these inserted codes as indivisible tokens. This allows the LLM to directly read and write SID sequences.

To bridge the semantic gap between the natural language vocabulary and the new SID space, several alignment objectives are introduced. These tasks are optimized jointly during both the SFT stage and the subsequent RL stage, fostering a deeper semantic understanding. During the RL phase, constrained decoding is applied, ensuring that the model only produces tokens from a predefined list of valid SIDs and their canonical titles, which facilitates rule-based reward computation.

The alignment objectives are categorized into two major groups:

4.2.3.1. Recommendation Tasks

These tasks directly train the LLM to perform recommendation-related predictions.

Generative Retrieval: The LLM receives a chronologically ordered sequence of SIDs representing a user's recent interactions, along with a clear instruction (e.g., "Recommend the next item."). The model is tasked with predicting the SID of the next item the user might engage with.

该图像是图5的示意图，展示了生成式推荐的提示词设计。输入包含用户按时间顺序交互的商品ID序列，模型需预测用户可能的下一个商品，回答部分为预测的商品ID。

Figure 5 from the original paper shows an example of a Generative Retrieval Prompt. The input is a sequence of SIDs, and the response is the predicted SID for the next item.
Asymmetric Item Prediction: This involves two sub-tasks to bridge different modalities:
- (a) Textual History to SID Prediction: Given a textual user history (e.g., item titles), predict the SID of the next item.
 
 该图像是一个示意图，展示了图6中的非对称物品预测Prompt1，包含输入的文本提示和模型生成的响应示例。
 
 Figure 6 from the original paper shows an example of Asymmetric Item Prediction Prompt1. The input is a natural language description of interacted items, and the response is the predicted SID for the next item.
- (b) SID-only History to Textual Title Generation: Given a history consisting only of SIDs, generate the textual title of the next expected item.
 
 Figure 7 from the original paper shows an example of Asymmetric Item Prediction Prompt2. The input is a sequence of SIDs, and the response is the predicted natural language title for the next item. (Note: The provided image images/7.jpg is SID-Text Semantic Alignment Prompt2, not Asymmetric Item Prediction Prompt2. I will use the prompt example from the text for Figure 7 as the image for Figure 7 is missing.)
 
 Input: The user has interacted with items < a _ { - } 7 1 > < c _ { - } 2 4 9 > , < a _ { - } 7 1 > < c _ { - } 1 3 6 > , $< a < 6 7 > < c 3 5 >$ in chronological order. Can you predict the next possible item's title that the user may expect? Response: [Expected Title]

4.2.3.2. Alignment Tasks

These tasks explicitly enforce a two-way mapping between natural language and the SID space, grounding discrete codes in text and injecting linguistic knowledge into their embeddings.

SID-Text Semantic Alignment: This task trains the model to associate SIDs with their corresponding textual representations.
- (a) SID to Textual Title Prediction: Predict an item's textual title given its SID.
 
 Figure 8 from the original paper shows an example of SID-Text Semantic Alignment Prompt1. The input is an SID, and the response is its corresponding textual title. (Note: The image for Figure 8 is not provided separately from Figure 6 in the raw text, but Figure 9 below is an example of SID-Text Alignment. I will describe it based on the text description).
 
 Input: Given the SID <a_104><b_60><c_152>, what is the item's title? Response: [Item Title]
- (b) Textual Title to SID Prediction: Predict the SID from an item's textual title.
 
 该图像是示意图，展示了Item Description Reconstruction Prompt的输入输出实例，描述了如何将物品描述转为模型标记序列。
 
 Figure 9 from the original paper shows an example of SID-Text Semantic Alignment Prompt2. The input is a natural language title, and the response is its corresponding SID. This example shows the model predicting $<a_202><b_202><c_29>$ from the input "Wicmas he ash e al elu lp" (which appears to be a tokenized or obscured form of an item title).
Item Description Reconstruction: To imbue SIDs with richer semantics, the model is asked to generate the item description from a single SID and, conversely, infer the SID from a description. This task is performed only during the SFT stage due to the open-ended output space of descriptions.

Figure 10 from the original paper shows an example of Item Description Reconstruction Prompt. The input is a SID and the task is to generate its description. (Note: The image for Figure 10 is not provided separately from Figure 7 in the raw text. I will describe it based on the text description.)

Input: Given the SID <a_104><b_60><c_152>, what is the item's description? Response: [Item Description]
User Preference Summarization: Given a sequence of SIDs representing user interactions, the model generates a short natural-language profile summarizing the user's interests. Since the raw dataset lacks explicit preference labels, DEEPSEEK (DeepSeek-AI et al., 2024) is used to extract summaries from item metadata and user reviews, which serve as pseudo labels. This task is also restricted to the SFT stage due to its open-ended output space.

Figure 11 from the original paper shows an example of User Preference Summarization Prompt. The input is a sequence of SIDs, and the response is a natural language summary of the user's preferences.

Input: The user has interacted with items $< \underline { { { a } } } _ { - } 3 9 > < c _ { - } 1 > , < \underline { { { a } } } _ { - } 3 9 > < c _ { - } 1 6 > ,$ $1 4 2 > < c \_ 3 5 >$

4.2.4. Reinforced Preference Optimization

After the SFT stage provides an initial policy, MiniOneRec further refines it using Group Relative Policy Gradient (GRPO) (Shao et al., 2024; DeepSeek-AI et al., 2024). GRPO differs from classic Reinforcement Learning with Human Feedback (RLHF) by generating multiple candidates per prompt and normalizing rewards within that group, which helps reduce gradient variance.

The steps for GRPO are:

Roll-out Candidates: For every input prompt $x$ sampled from the data distribution $D$ , the current frozen policy $\pi_{\theta_{\mathrm{old}}}$ (the policy before the current update) is used to generate $G$ candidate output sequences, denoted as $\mathcal{V}(x) = \{ y^{(1)}, \dots, y^{(G)} \}$ .
Assign Scores: Each candidate sequence $y^{(i)}$ is assigned a scalar score $S_i$ based on a predefined reward function (discussed below).
Standardize Advantages: The scores are then standardized within the group of $G$ $G$ candidates to compute advantages: $ \hat { A } _ { i } = \frac { S _ { i } - \mu _ { 1 : G } } { \sigma _ { 1 : G } } , $
- Symbol Explanation:
  - $\hat{A}_i$ : The standardized advantage for the $i$ -th candidate sequence.
  - $S_i$ : The scalar score (reward) assigned to the $i$ -th candidate sequence.
  - $\mu_{1:G}$ : The mean of the $G$ rewards (scores) within the current group.
  - $\sigma_{1:G}$ : The standard deviation of the $G$ rewards (scores) within the current group.
    
    The GRPO algorithm then optimizes a surrogate objective function: $ J _ { \mathrm { G R P O } } ( \theta ) = \mathbb { E } _ { x , y ^ { ( i ) } } \left[ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | y ^ { ( i ) } | } \sum _ { t = 1 } ^ { | y ^ { ( i ) } | } \Bigl ( \operatorname* { m i n } \bigl ( w _ { i , t } \hat { A } _ { i , t } , \mathrm { c l i p } ( w _ { i , t } , 1 - \epsilon , 1 + \epsilon ) \hat { A } _ { i , t } \bigr ) - \beta \mathrm { K L } \bigl [ \pi _ { \theta } \bigr | \pi _ { \mathrm { r e f } } \bigr ] \Bigr ) \right] , $

Symbol Explanation:
- $J_{\mathrm{GRPO}}(\theta)$ : The GRPO surrogate objective function to be maximized with respect to policy parameters $\theta$ .
- $\mathbb{E}_{x, y^{(i)}}$ : Expectation over prompts $x$ and generated candidate sequences $y^{(i)}$ .
- $G$ : The number of candidate sequences (roll-outs) generated per prompt.
- $|y^{(i)}|$ : The length of the $i$ -th candidate sequence.
- $w_{i,t}$ : The token-level importance ratio for the $t$ -th token in the $i$ -th sequence, defined as $\frac{\pi_{\boldsymbol{\theta}}(y_t^{(i)} | x, y_{<t}^{(i)})}{\pi_{\boldsymbol{\theta}_{\mathrm{old}}}(y_t^{(i)} | x, y_{<t}^{(i)})}$ . This ratio measures how much the probability of generating a token has changed under the new policy $\pi_\theta$ compared to the old policy $\pi_{\theta_{\mathrm{old}}}$ .
- $\hat{A}_{i,t}$ : The advantage for the $t$ -th token of the $i$ -th sequence. (The paper uses $\hat{A}_i$ for the sequence, but here implies a token-level application, consistent with PPO formulation. Assuming $\hat{A}_{i,t} = \hat{A}_i$ for simplicity if tokens are not individually scored).
- $\mathrm{clip}(w_{i,t}, 1-\epsilon, 1+\epsilon)$ : A clipping function that limits the importance ratio $w_{i,t}$ within a range $[1-\epsilon, 1+\epsilon]$ . This stabilizes training by preventing excessively large policy updates.
- $\epsilon$ : A hyper-parameter controlling the clipping range.
- $\beta$ : A hyper-parameter controlling the strength of the KL divergence penalty.
- $\mathrm{KL}[\pi_\theta | \pi_{\mathrm{ref}}]$ : The Kullback-Leibler (KL) divergence between the current policy $\pi_\theta$ and a reference policy $\pi_{\mathrm{ref}}$ . This term acts as a regularizer, preventing the new policy from diverging too far from a stable reference (often the policy before the current update or the SFT policy).
 
 MiniOneRec addresses two key obstacles when applying RL to recommendation:

Unique Generation Space: The action space for recommendation (item SIDs) is discrete and much smaller than natural language vocabularies. This can lead to poor sampling diversity where re-sampling often produces duplicate items.
Sparse Ranking Supervision: A simple binary reward (1 for correct, 0 for incorrect) provides limited guidance on ranking quality. Recommendation models benefit from distinguishing between hard and easy negatives.

4.2.4.1. Sampling Strategy

To combat the issue of poor sampling diversity, MiniOneRec investigates two remedies:

Diversity Metric: The diversity of generated items is measured using: $ \mathrm { D i v } \big ( { e _ { k } } _ { k = 1 } ^ { G } \big ) = \frac { \big | \mathrm { U n i q u e } \big ( { e _ { k } } _ { 1 } ^ { G } \big ) \big | } { G } , $
- Symbol Explanation:
  - $\mathrm{Div}(\cdot)$ : The diversity score.
  - $\{e_k\}_{k=1}^G$ : The set of $G$ generated items.
  - $|\mathrm{Unique}(\{e_k\}_1^G)|$ : The count of unique items among the $G$ generations.
  - $G$ : The total number of generated items.
  - Higher values indicate richer supervision for RL.
Dynamic Sampling (Yu et al., 2025): This method first over-samples candidates and then intelligently selects a subset. The subset must include the ground-truth item and maximize internal diversity. While helpful, it requires extra forward passes and can still deteriorate as training progresses.
Beam Search: MiniOneRec ultimately adopts beam search (without length normalization) as its default sampler. By its nature, beam search ensures that all generated beams (candidate sequences) are distinct. This guarantees zero duplication within each group, offering a better diversity-efficiency trade-off. Constrained beam search is used to ensure all generated items are valid.

4.2.4.2. Reward Design

To provide richer supervision than a simple binary reward, MiniOneRec introduces a rank-aware reward and combines it with a rule-based reward.

Rank-Aware Reward: This reward penalizes negative items more strongly if the model was more confident in recommending them (i.e., they appear prominently in the model's own ranking). Given a negative candidate item $e_k$ whose generation probability ranks $\rho_k$ (where $\rho_k=1$ means it's the most probable), and the true item is $e_t$ : $ \tilde { R } _ { \mathrm { r a n k } } ( e _ { k } , e _ { t } ) = \left{ \begin{array} { l l } { 0 , } & { e _ { k } = e _ { t } , } \ { - \frac { 1 } { \log ( \rho _ { k } + 1 ) } , } & { \mathrm { o t h e r w i s e } , } \end{array} \right. ~ R _ { \mathrm { r a n k } } ( e _ { k } , e _ { t } ) = - \frac { \tilde { R } _ { \mathrm { r a n k } } ( e _ { k } , e _ { t } ) } { \sum _ { j = 1 } ^ { G } \tilde { R } _ { \mathrm { r a n k } } ( e _ { j } , e _ { t } ) } . $
- Symbol Explanation:
  - $\tilde{R}_{\mathrm{rank}}(e_k, e_t)$ : The unnormalized rank-aware reward for candidate $e_k$ given the true item $e_t$ .
  - $e_k$ : A candidate item generated by the policy.
  - $e_t$ : The ground-truth (target) item.
  - $\rho_k$ : The rank of item $e_k$ among the generated candidates, based on the model's generation probabilities (1 for the most probable, 2 for the second, etc.).
  - $R_{\mathrm{rank}}(e_k, e_t)$ : The normalized rank-aware reward. The negative sign ensures that lower scores are assigned to hard negatives (those with low $\rho_k$ ).
  - $\sum_{j=1}^G \tilde{R}_{\mathrm{rank}}(e_j, e_t)$ : The sum of unnormalized rank-aware rewards for all $G$ candidates. This normalizes the reward.
Final Hybrid Reward: The final reward signal combines the rule-based accuracy term with the ranking-aware component: $ R ( e _ { k } , e _ { t } ) = R _ { \mathrm { r u l e } } ( e _ { k } , e _ { t } ) + R _ { \mathrm { r a n k } } ( e _ { k } , e _ { t } ) , $
- Symbol Explanation:
  - $R(e_k, e_t)$ : The total reward for candidate $e_k$ given true item $e_t$ .
  - $R_{\mathrm{rule}}(e_k, e_t)$ : The rule-based reward term.
  - $R_{\mathrm{rank}}(e_k, e_t)$ : The rank-aware reward term.
    
    The rule-based term simply assigns a positive reward for correctly predicting the true item and zero otherwise: $ R _ { \mathrm { r u l e } } ( e _ { k } , e _ { t } ) = { \left{ \begin{array} { l l } { 1 , } & { e _ { k } = e _ { t } , } \ { 0 , } & { { \mathrm { o t h e r w i s e } } . } \end { array } \right. } $
- Symbol Explanation:
  - $R_{\mathrm{rule}}(e_k, e_t)$ : The rule-based reward for candidate $e_k$ given true item $e_t$ .
  - 1: Reward for a correct prediction.
  - 0: Reward for an incorrect prediction.
    
    MiniOneRec ultimately adopts this combined ranking-and-rule reward as its default choice. The paper also explored using collaborative rewards (e.g., logits from a pre-trained collaborative-filtering model), but found that this led to reward hacking and performance degradation, indicating a misalignment with the true objective.

5. Experimental Setup

This section details the empirical study conducted to evaluate MiniOneRec.

5.1. Datasets

The experiments are conducted on two real-world subsets of the Amazon Review dataset (Hou et al., 2024):

Industrial_and_Scientific
Office_Products

These datasets were chosen because they represent diverse product domains and are publicly available, allowing for reproducibility.

To manage computational costs, the datasets undergo a trimming strategy based on (Bao et al., 2024):

Filtering: Users and items with fewer than five interactions are removed. This ensures sufficient interaction history for sequential modeling.
Time-based Filtering:
- For Toys_and_Games (though MiniOneRec uses Industrial/Office, this indicates general trimming strategy), events from October 2016 to November 2018 are kept.
- For Industrial_and_Scientific (a smaller dataset), all events between October 1996 and November 2018 are retained.
Sequence Truncation: Each user's interaction history is truncated to a maximum of ten items. This limits the input sequence length for the Transformer models.
Data Splitting: Each dataset is chronologically split into training, validation, and test sets with an 8:1:1 ratio. This ensures that the model is evaluated on future interactions.

The following are the statistics for the resulting training splits from Table 4 of the original paper:

The following are the results from Table 4 of the original paper:

Datasets	Inductrial	Office
Items	3,685	3,459
Train	3,6259	3,8924
Valid	4,532	4,866
Test	4,533	4,866

Data Sample: While the paper does not provide a specific image of a data sample, the nature of the data involves item titles, textual descriptions (used for SID generation), and chronological sequences of user-item interactions. For instance, a user's history might look like: [item_A_SID, item_B_SID, item_C_SID], where item_A_SID corresponds to the SID generated from the title and description of "Kreg SML-C150-100 Pocket Screws...".

5.2. Evaluation Metrics

The top- $K$ recommendation accuracy is measured using standard metrics:

Hit Rate (HR@K):
- Conceptual Definition: HR@K measures the proportion of users for whom the target item (the next item they actually interacted with) is present in the top $K$ items recommended by the model. It is a recall-based metric, indicating whether the model successfully retrieved the relevant item, irrespective of its position within the top $K$ .
- Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users with hit in top K}}{\text{Total number of users}} $
- Symbol Explanation:
  - Number of users with hit in top K: The count of unique users for whom the ground-truth item was found among the top $K$ recommendations.
  - Total number of users: The total number of users in the evaluation set for whom a recommendation was generated.
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition: NDCG@K is a ranking-aware metric that evaluates the quality of a recommendation list by giving higher scores to relevant items that appear at higher (better) positions. It accounts for both the presence of relevant items and their rank. The normalization ensures that NDCG@K values are comparable across different queries and recommendation list lengths.
- Mathematical Formula: First, the Discounted Cumulative Gain (DCG@K) is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ Then, the Ideal Discounted Cumulative Gain (IDCG@K) is computed, which is the maximum possible DCG@K if all relevant items were perfectly ranked: $ \mathrm{IDCG@K} = \sum_{i=1}^{K} \frac{2^{rel_{i_{ideal}}} - 1}{\log_2(i+1)} $ Finally, NDCG@K is derived as: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
- Symbol Explanation:
  - $K$ : The maximum rank position to consider in the recommendation list.
  - $rel_i$ : The relevance score of the item at position $i$ in the actual recommended list. For recommendation tasks, this is often binary (1 if the item is the true next item, 0 otherwise).
  - $rel_{i_{ideal}}$ : The relevance score of the item at position $i$ in the ideal recommendation list (i.e., where all relevant items are listed first).
  - $\log_2(i+1)$ : A logarithmic discount factor that penalizes relevant items appearing at lower ranks.

5.3. Baselines

MiniOneRec is compared against three categories of representative baseline models:

Traditional Recommendation Models: These are established sequential recommenders.
- GRU4Rec (Hidasi et al., 2016): A Gated Recurrent Unit (GRU)-based model for session-based recommendations.
- Caser (Tang and Wang, 2018): Uses convolutional neural networks (CNNs) to capture both general and personalized sequential patterns.
- SASRec (Kang and McAuley, 2018): A self-attentive sequential recommendation model that uses Transformers to capture long-range dependencies in user behavior sequences.
- Why representative? They represent the state-of-the-art in non-generative, embedding-based sequential recommendation prior to the LLM era.
Generative Recommendation Models: These models also frame recommendation as a generation task, often using SIDs.
- HSTU (Zhai et al., 2024): A generative model with a streaming architecture.
- TIGER (Rajput et al., 2023): Uses RQ-VAE for SID generation and a Transformer backbone, similar to MiniOneRec's SID construction.
- LC-Rec (Zheng et al., 2024): Aligns an LLM with SIDs using multi-task learning.
- Why representative? They are direct competitors in the generative recommendation paradigm, allowing comparison of MiniOneRec's specific architectural and training choices.
LLM-based Recommendation Models: These are more recent models that leverage LLMs for recommendation.
- BigRec (Bao et al., 2023): A method leveraging LLMs for recommendation.
- D3 (Bao et al., 2024): Another LLM-based recommendation model.
- S-DPO (Chen et al., 2024b): An LLM-based approach that adapts Direct Preference Optimization (DPO) for recommendation.
- Why representative? They represent the cutting edge of integrating LLMs into recommendation, testing MiniOneRec against methods that also exploit LLM capabilities.
  
  All LLM-powered systems, including MiniOneRec, share the Qwen2.5-Instruct backbone and use the AdamW optimizer for fair comparison. Training details (learning rates, batch sizes, epochs) are carefully specified for each stage and baseline to ensure a rigorous evaluation. For conventional recommenders, binary cross-entropy and Adam optimizer are used, with hyper-parameter tuning for learning rates and weight decay.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Scaling

The paper demonstrates the scaling capabilities of MiniOneRec by observing how the loss curves behave with increasing model size.

Figure 1: Left: Scaling curves from 0.5B to 7B parameters. Right: Effect of world knowledge on model performance: MiniOneRec-W/O ALIGN uses pretrained LLM weights but omits SID-text alignment, while…
该图像是论文中的图表，左侧展示了从0.5B到7B参数规模下训练样本的FLOPs与收敛损失的关系曲线，右侧展示了不同训练阶段（SFT和SFT-then-RL）下三种模型的HR@10表现，比较了未对齐和对齐后模型性能差异。

Figure 1 (Left) from the original paper shows the scaling curves from 0.5B to 7B parameters. Figure 1 (Right) illustrates the effect of world knowledge on model performance, comparing MiniOneRec-W/O ALIGN (pretrained LLM weights, no SID-text alignment) and MiniOneRec-Scratch (random initialization, no alignment) at both SFT and SFT-then-RL stages.

Figure 3: Evaluation loss vs. SFT training epoch
该图像是图表，展示了图3中不同规模的Qwen2.5-Instruct模型在SFT训练过程中评估损失随训练轮次下降的趋势，体现了模型参数增长带来的性能提升。

Figure 3 from the original paper shows the Evaluation loss versus SFT training epoch.

Analysis: As shown in Figure 1 (Left), there is a clear and consistent downward trend in convergence loss as the model size of MiniOneRec increases from 0.5B to 7B parameters. This indicates that larger models are better able to learn the underlying patterns in the data and achieve lower training errors. Figure 3 further reinforces this finding by tracking the evaluation loss on the SFT training set over epochs. Models with larger parameter counts consistently maintain lower evaluation losses throughout the training process and converge more rapidly. This directly validates the parameter efficiency of the generative paradigm in recommendation, confirming that scaling laws (where increased capacity leads to consistent performance gains) indeed hold on public datasets when using autoregressive Transformers with SIDs. This superior scaling effect suggests that generative recommenders have the potential to be the next generation of recommendation models.

6.1.2. Performance Comparison

The paper benchmarks MiniOneRec against various baselines on the Industrial and Office Amazon Review datasets.

The following are the results from Table 1 of the original paper:

Datasets	Methods	HR@3	NDCG@3	HR@5	NDCG@5	HR@10	NDCG@10
	Traditional
Industrial	GRU4Rec	0.0638	0.0542	0.0774	0.0598	0.0999	0.0669
	Caser	0.0618	0.0514	0.0717	0.0555	0.0942	0.0628
	SASRec	0.0790	0.0700	0.0909	0.0748	0.1088	0.0806
		Generative
	HSTU	0.0927	0.0885	0.1037	0.0918	0.1163	0.0958
	TIGER	0.0852	0.0742	0.1010	0.0807	0.1321	0.0908
	LCRec	0.0915	0.0805	0.1057	0.0862	0.1332	0.0952
LLM-based
	BIGRec	0.0931	0.0841	0.1092	0.0907	0.1370	0.0997
	D3	0.1024	0.0991	0.1213	0.0989	0.1500	0.1082
	S-DPO	0.1032	0.0906	0.1238	0.0991	0.1524	0.1082
		Ours
	MiniOneRec	0.1143	0.1011	0.1321	0.1084	0.1586	0.1167
		Traditional

Office	GRU4Rec	0.0629	0.0528	0.0789	0.0595	0.1019	0.0669
	Caser	0.0748	0.0615	0.0865	0.0664	0.1093	0.0737
	SASRec	0.0861	0.0769	0.0949	0.0805	0.1120	0.0858
		Generative
	HSTU	0.1134	0.1031	0.1252	0.1079	0.1400	0.1126
	TIGER	0.0986	0.0852	0.1163	0.0960	0.1408	0.1002
	LCRec	0.0921	0.0807	0.1048	0.0859	0.1237	0.0920
	LLM-based
	BIGRec	0.1069	0.0961	0.1204	0.1017	0.1434	0.1091
	D3	0.1204	0.1055	0.1406	0.1139	0.1634	0.1213
	S-DPO	0.1169	0.1033	0.1356	0.1110	0.1587	0.1255
		Ours

	MiniOneRec	0.1217	0.1088	0.1420	0.1172	0.1634	0.1242

Analysis: Table 1 reveals two critical insights:

Utility of LLM World Knowledge: Models powered by LLMs (e.g., BIGRec, D3, S-DPO) consistently outperform traditional recommenders (GRU4Rec, Caser, SASRec) across all metrics and both datasets. This strongly suggests that the vast world knowledge and general reasoning abilities embedded in LLMs translate into better recommendation accuracy. Even generative models without explicit LLM alignment (TIGER) show improvements over traditional methods, but those with LLM integration (BIGRec, D3, S-DPO) further extend this lead.
Effectiveness of MiniOneRec: MiniOneRec consistently achieves the highest scores across most reported metrics on both the Industrial and Office datasets, outperforming all traditional, generative, and other LLM-based solutions. For instance, on the Industrial dataset, MiniOneRec scores HR@3 of 0.1143 and NDCG@10 of 0.1167, surpassing S-DPO's 0.1032 and 0.1082 respectively. Similarly, on the Office dataset, MiniOneRec achieves HR@3 of 0.1217 and NDCG@10 of 0.1242, which are competitive with or better than the leading baselines (D3 and S-DPO). This superior performance is attributed to MiniOneRec's design:
- Full Generation Process Alignment: By aligning the entire generation process with the task objective.
- Reinforced Preference Optimization: Utilizing RL with its carefully designed constrained decoding and hybrid rewards.
- Compact SID Space: Operating in the compact SID space reduces context tokens, leading to faster inference, lower latency, and smaller memory footprints at serving time, addressing practical deployment concerns.

6.1.3. Transferability

The paper evaluates the out-of-distribution (OOD) robustness of MiniOneRec through an experiment called SID pattern discovery. The model is trained exclusively on the Industrial domain and then deployed, without any further tuning, to the unseen Office domain. A MiniOneRec-w/ RL-OOD variant (RL-only, skipping SFT for emphasis on generalization) is included.

The following are the results from Table 2 of the original paper:

Dataset	Method	HR@3	NDCG@3	HR@5	NDCG@5	HR@10	NDCG@10
Office	GRU4Rec	0.0629	0.0528	0.0789	0.0595	0.1019	0.0669
	Qwen-Text	0.0031	0.0021	0.0044	0.0026	0.0057	0.0030
	Qwen-SID	0.0300	0.0214	0.0456	0.0282	0.0733	0.0373
	MiniOneRec-w/RL-OOD	0.0553	0.0433	0.0691	0.0489	0.0892	0.0553

Analysis: Table 2 highlights the importance of SIDs and the transferability of RL-trained models.

Qwen-Text performs very poorly, indicating that directly processing raw text for OOD prediction without specific fine-tuning is ineffective.
Qwen-SID (using SID tokens but no fine-tuning) performs noticeably better than Qwen-Text, suggesting that a structured SID vocabulary makes it easier for an LLM to extract patterns, even without domain-specific training.
MiniOneRec-w/RL-OOD (trained via GRPO on Industrial only and evaluated on Office) achieves competitive accuracy compared to GRU4Rec (which was trained directly on the Office domain). While it falls short of the full MiniOneRec (which would be trained on Office), its reinforcement-only training provides excellent transferability. This implies that MiniOneRec successfully uncovers reusable interaction patterns that generalize across domains, despite substantial domain shift and potential semantic drift among SIDs. This underscores the framework's promise for cross-domain recommendation.

6.1.4. Pre-trained LLM Impact

The paper investigates the impact of pre-trained LLMs by comparing two MiniOneRec variants: one initialized from a general-purpose pre-trained LLM and another trained from scratch with random weights.

The following are the results from Table 3 of the original paper:

Datesets	Methods	HR@3	NDCG@3	HR@5	NDCG@5	HR@10	NDCG@10
Industrial	MiniOneRec-scratch	0.0757	0.0672	0.0891	0.0726	0.1134	0.0804
Industrial	MiniOneRec	0.1125	0.0988	0.1259	0.1046	0.1546	0.1139
Office	MiniOneRec-scratch	0.0959	0.0855	0.1057	0.0896	0.1196	0.0941
Office	MiniOneRec	0.1217	0.1088	0.1420	0.1172	0.1634	0.1242

Analysis: Table 3 shows a consistent and significant pattern across both Industrial and Office datasets: MiniOneRec initialized with pre-trained weights (MiniOneRec) substantially outperforms its randomly initialized counterpart (MiniOneRec-scratch). For example, on the Industrial dataset, MiniOneRec achieves an HR@10 of 0.1546, significantly higher than MiniOneRec-scratch's 0.1134. Similar improvements are seen on the Office dataset.

The authors attribute this to two main factors:

General Reasoning Ability: The general reasoning ability acquired during large-scale language pre-training allows the model to interpret the next-SID prediction task as a problem of pattern discovery, making it more effective at learning user preferences.
Factual Knowledge: The factual knowledge already encoded in the pre-trained LLM provides a head start in understanding the real-world semantics behind each SID, which can be transferred to the recommendation domain. This highlights the critical role of pre-trained LLMs as powerful backbones for generative recommendation, providing a strong foundation that would otherwise require extensive training from scratch.

6.2. Ablation Studies / Parameter Analysis

The paper conducts ablation studies to validate the effectiveness of MiniOneRec's individual components.

Figure 4: Study on the effectiveness of MiniOneRec's individual components. Figure 4a examines model performance under different alignment strategies; Figure 4b investigates various sampling strategi…
该图像是论文中图4的图表，分别展示了MiniOneRec各个组成部分的效果。图4a比较不同对齐策略对模型性能的影响，图4b分析多种采样策略，而图4c评估了不同奖励设计对模型表现的影响。

Figure 4 from the original paper shows the study on the effectiveness of MiniOneRec's individual components. Figure 4a examines model performance under different alignment strategies; Figure 4b investigates various sampling strategies; Figure 4c evaluates the impact of alternative reward designs.

6.2.1. Aligning Strategy

This ablation study examines the impact of different language-SID alignment strategies on model performance.

MINIONEREC-W/O ALIGN: Removes any language-SID alignment, treating recommendation purely as a SID-to-SID task.
MINIONEREC-W/ SFTALIGN: Keeps the alignment objective only during the SFT stage, while RL uses SID data alone.
MINIONEREC-W/ RLALIGN: SFT relies solely on SID supervision, and alignment tasks are introduced later in the RL stage.
MiniOneRec (full model): Maintains alignment throughout the entire pipeline (SFT and RL).

Analysis (Figure 4a): Figure 4a clearly shows that the complete MiniOneRec model, which maintains SID alignment throughout both the SFT and RL stages, delivers the highest scores across all metrics (HR@K and NDCG@K). MINIONEREC-W/O ALIGN performs the worst. This indicates that grounding SID generation in world knowledge (via language-SID alignment) is essential for effective generative recommendation. The variants that only align during SFT or RL (MINIONEREC-W/ SFTALIGN, MINIONEREC-W/ RLALIGN) perform better than W/O ALIGN but worse than the full MiniOneRec, underscoring the importance of continuous, full-process alignment for maximizing performance.

6.2.2. Sampling Strategy

This study investigates how different roll-out methods for generating candidate sequences affect MiniOneRec during the RL stage.

MINIONEREC-CoMMoN: Uses a plain Top-k decoder to produce the exact number of required paths (candidates). This method often produces many duplicate sequences.
MINIONEREC-DyNAMIC: Employs a two-step sampler: it first over-samples (e.g., 1.5 times the budget) and then retains as many unique items as possible for RL.
MiniOneRec (full model): Adopts beam search with a width of 16, ensuring distinct candidate sequences.

Analysis (Figure 4b): Figure 4b demonstrates that the complete MiniOneRec (using beam search) achieves the highest accuracy among the tested sampling strategies. It also does so while being more cost-efficient, using approximately two-thirds of the samples required by the dynamic variant to achieve its performance. This indicates that beam search is the most cost-efficient and effective choice for generating diverse and high-quality candidate trajectories during RL, outperforming simpler Top-k decoding and more complex dynamic sampling which still struggles with diversity.

6.2.3. Reward Design

This ablation study evaluates the impact of alternative reward designs for reinforcement learning.

MiNIONEREC-w/ ACC: Relies solely on a binary correctness signal (1 for correct item, 0 otherwise). This is a basic rule-based reward.
MINIONEREC-w/ COLLABORATIVE: Replaces the ranking term in the hybrid reward with logits taken from a frozen SASRec model, aiming to inject collaborative cues.
MiniOneRec (full model): Uses the default hybrid reward combining rule-based accuracy and the ranking-aware penalty.

Analysis (Figure 4c): Figure 4c shows that the full MiniOneRec model, with its hybrid reward (combining rule-based and rank-aware components), achieves the best overall performance. MiNIONEREC-w/ ACC performs worse, highlighting the limitations of simple binary rewards which fail to distinguish between different types of negative items or provide nuanced ranking supervision. Interestingly, MINIONEREC-w/ COLLABORATIVE performs significantly worse than even MiNIONEREC-w/ ACC. The authors hypothesize that this degradation is due to reward hacking: the collaborative reward signal may become misaligned with the true objective, causing the model to optimize for a proxy that does not correspond to actual recommendation accuracy. This finding emphasizes the importance of carefully designing reward functions in RL for recommendation, where domain-specific considerations (like ranking and hard negatives) are crucial.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces MiniOneRec, a pioneering fully open-source framework for generative recommendation. It provides a comprehensive, end-to-end workflow covering Semantic ID (SID) construction via RQ-VAE, supervised fine-tuning (SFT), and recommendation-oriented reinforcement learning (RL). A key contribution is the systematic validation of scaling laws on public benchmarks, demonstrating that larger generative recommenders (from 0.5B to 7B parameters) consistently achieve lower training and evaluation losses. This confirms the superior parameter-efficiency of the SID-based paradigm compared to traditional embedding-centric models.

MiniOneRec further proposes an optimized post-training pipeline centered on two main techniques:

Full-process SID alignment: This embeds SID tokens into the model vocabulary and enforces auxiliary alignment tasks across both SFT and RL stages, effectively grounding SID generation in the LLM's world knowledge.
Reinforced preference optimization: This utilizes Group Relative Policy Gradient (GRPO) with practical enhancements including constrained decoding (ensuring valid item generation), beam-based sampling (for candidate diversity), and a hybrid reward design (combining rule-based accuracy with a ranking-aware penalty for hard negatives).

Extensive experiments on the Amazon Review dataset show that MiniOneRec consistently outperforms strong traditional, generative, and LLM-based baselines in ranking accuracy and candidate diversity, all while maintaining a lean post-training footprint. The framework also demonstrates robust transferability across domains and highlights the critical impact of pre-trained LLM weights on performance.

7.2. Limitations & Future Work

The paper explicitly outlines a roadmap for future developments rather than direct limitations, but these implicitly suggest current constraints or areas for improvement:

Codebase Maintenance and Extension: The immediate future work involves continuously maintaining and extending the MiniOneRec codebase.
New Datasets: Future developments will include support for new datasets to further test the generalizability and performance of the framework across diverse domains.
Advanced Tokenization Schemes: Exploring more advanced tokenization schemes beyond the current RQ-VAE could further improve SID quality and semantic richness.
Larger Backbone Models: The current study scales up to 7B parameters. Future work aims to incorporate larger backbone models to investigate the limits of scaling laws in generative recommendation.
Enhanced Training Pipelines: Continuously improving the training pipelines (e.g., more efficient RL algorithms, better SFT strategies) is a direction for future research.

While not explicitly stated as limitations, some implicit areas for consideration could be:
Computational Cost: Despite being lightweight, RL training, especially with beam search and large LLMs, can still be computationally intensive.
Complexity of SID Generation: The RQ-VAE based SID generation process adds a preprocessing step and requires careful tuning. Its scalability for extremely large and dynamic item catalogs might be a challenge.
Reward Hacking: The paper itself notes the issue of reward hacking when using collaborative rewards, indicating that reward function design remains a delicate balance.

7.3. Personal Insights & Critique

MiniOneRec represents a significant step forward in making generative recommendation accessible and scientifically rigorous.

Personal Insights:

Open-Source Impact: The commitment to an open-source framework is invaluable. It democratizes research in a field heavily dominated by proprietary industrial solutions, fostering transparency, reproducibility, and collaborative innovation. This alone could accelerate the adoption and development of generative recommenders.
Rigorous Validation of Scaling Laws: The systematic validation of scaling laws on public benchmarks is crucial. It provides concrete evidence that the LLM paradigm of "more data, more parameters, better performance" holds for recommendation, guiding future research toward building larger, more capable models.
Practical Post-training Recipe: The detailed and effective post-training pipeline, particularly the full-process SID alignment and hybrid reward design in RL, offers a practical blueprint for researchers and practitioners. The ablation studies clearly demonstrate the value of each component, making the methodology transparent and adaptable.
Leveraging LLM World Knowledge: The emphasis on integrating LLM's world knowledge is a powerful insight. It moves beyond treating LLMs merely as sequence generators and instead leverages their pre-trained semantic understanding to enrich item representations and user preferences.
Addressing RL Challenges: The paper effectively addresses common challenges in applying RL to recommendation, such as poor sampling diversity and sparse ranking supervision, through constrained decoding, beam search, and rank-aware rewards. This shows a deep understanding of the practicalities of RL in this domain.

Critique & Potential Issues:

Definition of "Minimal Recipe": While the paper refers to its post-training as "minimal" or "lightweight," it still involves multi-stage training (RQ-VAE, SFT, RL), careful reward engineering, and sophisticated sampling strategies. For a true beginner or smaller organizations, this might still represent a significant undertaking compared to simpler embedding-based models. The term "minimal" might be relative to the complexity of industrial-scale RLHF systems, but the overall pipeline remains non-trivial.
Complexity of SID Interpretation: While SIDs are efficient for models, their direct interpretability for humans or for debugging purposes might be limited. The mapping from SID back to human-readable item features is handled by the LLM during alignment tasks, but errors in SID generation or interpretation could be hard to trace.
Generalizability Beyond Amazon Reviews: The experiments are primarily conducted on Amazon Review subsets. While the transferability study to an unseen Amazon domain is promising, further validation on vastly different types of datasets (e.g., news, movies, diverse cultural contexts) would strengthen the claims of general applicability. Semantic IDs for highly abstract or subjective items might also pose challenges.
Cost of Large Backbones: While scaling laws are validated, the computational cost associated with training and serving 7B+ parameter LLMs remains substantial, even with optimized methods. This could be a barrier for widespread adoption outside of well-resourced institutions.
Dynamic Catalog Updates: The paper mentions HSTU for handling non-stationary logs. While RQ-VAE allows for dynamic generation of SIDs, the efficiency and stability of constantly updating SID codebooks or adapting LLMs to rapidly changing item catalogs is an area that could benefit from more detailed discussion.

Overall, MiniOneRec offers a robust and forward-looking framework, providing valuable insights and a solid foundation for future research in generative recommendation. Its open-source nature is a commendable contribution to the research community.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~31 min read · 42,412 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

Why is this problem important in the current field?

What is the paper's entry point or innovative idea?

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Task Formulation

4.2.2. Item Tokenization

4.2.3. Alignment with LLMs

4.2.3.1. Recommendation Tasks

4.2.3.2. Alignment Tasks

4.2.4. Reinforced Preference Optimization

4.2.4.1. Sampling Strategy

4.2.4.2. Reward Design

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Scaling

6.1.2. Performance Comparison

6.1.3. Transferability

6.1.4. Pre-trained LLM Impact

6.2. Ablation Studies / Parameter Analysis

6.2.1. Aligning Strategy

6.2.2. Sampling Strategy

6.2.3. Reward Design

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers