MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation
TL;DR Summary
MiniOneRec, the first open-source generative recommendation framework, uses Residual Quantized VAE for SID and post-trains 0.5B–7B parameter Qwen models, confirming scaling benefits and improving ranking accuracy and diversity via aligned SID processing and constrained RL.
Abstract
The recent success of large language models (LLMs) has renewed interest in whether recommender systems can achieve similar scaling benefits. Conventional recommenders, dominated by massive embedding tables, tend to plateau as embedding dimensions grow. In contrast, the emerging generative paradigm replaces embeddings with compact Semantic ID (SID) sequences produced by autoregressive Transformers. Yet most industrial deployments remain proprietary, leaving two fundamental questions open: (1) Do the expected scaling laws hold on public benchmarks? (2) What is the minimal post-training recipe that enables competitive performance? We present MiniOneRec, to the best of our knowledge, the first fully open-source generative recommendation framework, which provides an end-to-end workflow spanning SID construction, supervised fine-tuning, and recommendation-oriented reinforcement learning. We generate SIDs via a Residual Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters on the Amazon Review dataset. Our experiments reveal a consistent downward trend in both training and evaluation losses with increasing model size, validating the parameter efficiency of the generative approach. To further enhance performance, we propose a lightweight yet effective post-training pipeline that (1) enforces full-process SID alignment and (2) applies reinforcement learning with constrained decoding and hybrid rewards. Together, these techniques yield significant improvements in both ranking accuracy and candidate diversity.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
MiniOneRec: An Open-Source Framework for Scaling Generative Recommendation
1.2. Authors
Xiaoyu Kong, Leheng Sheng, Junfei Tan, Yuxin Chen, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He. The authors are affiliated with the University of Science and Technology of China and the National University of Singapore. Their research backgrounds appear to be in computer science, with a focus on recommender systems, large language models (LLMs), and reinforcement learning (RL).
1.3. Journal/Conference
This paper is published on CoRR, which stands for Computing Research Repository. It is a preprint server, often associated with arXiv, where research papers are publicly shared before, or in parallel with, formal peer review and publication in journals or conferences.
1.4. Publication Year
2025 (specifically, 2025-10-28T13:58:36.000Z).
1.5. Abstract
The abstract outlines the paper's investigation into whether recommender systems can benefit from the scaling laws observed in large language models (LLMs). It highlights a shift from traditional, embedding-heavy recommenders to an emerging generative paradigm that uses Semantic ID (SID) sequences produced by autoregressive Transformers. The paper addresses two open questions: (1) if scaling laws apply to public benchmarks, and (2) what minimal post-training steps achieve competitive performance. The authors introduce MiniOneRec, presented as the first fully open-source generative recommendation framework. This framework covers an end-to-end workflow including SID construction using a Residual Quantized VAE (RQ-VAE), supervised fine-tuning (SFT), and recommendation-oriented reinforcement learning (RL). Experiments on the Amazon Review dataset with Qwen backbones (0.5B to 7B parameters) demonstrate that increasing model size leads to a consistent decrease in training and evaluation losses, validating the parameter efficiency of the generative approach. To further improve performance, they propose a lightweight post-training pipeline that enforces full-process SID alignment and applies reinforcement learning with constrained decoding and hybrid rewards. These techniques are shown to significantly enhance both ranking accuracy and candidate diversity.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2510.24431 PDF Link: https://arxiv.org/pdf/2510.24431v1.pdf Publication Status: This paper is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the limitation of scaling in conventional recommender systems and the lack of open-source, reproducible frameworks for the emerging generative recommendation paradigm.
Why is this problem important in the current field?
The success of large language models (LLMs) has demonstrated predictable performance gains with increased model size, a phenomenon known as scaling laws. This has spurred interest in whether similar benefits can be achieved in recommender systems. However, traditional recommenders, which rely heavily on massive embedding tables for user and item representations, tend to plateau in performance as these embeddings grow. This embedding-heavy design leads to diminishing returns beyond moderate scales. In contrast, the generative paradigm compresses items into compact Semantic ID (SID) sequences, allowing autoregressive Transformers to handle the bulk of parameters, theoretically enabling better scaling.
Existing industrial deployments of generative recommenders, such as OneRec and OnePiece, have shown significant promise, but they remain proprietary and rely on massive private datasets. This leaves the broader research community with critical unanswered questions:
- Do the expected
scaling lawsobserved in LLMs also hold for generative recommendation models when applied to public benchmarks? - What is the minimum, yet effective,
post-training reciperequired to achieve competitive performance with these generative models?
What is the paper's entry point or innovative idea?
The paper's entry point is to fill this gap by providing the first fully open-source, end-to-end framework for generative recommendation, named MiniOneRec. This framework not only aims to validate the scaling properties of generative recommenders on public datasets but also to offer a reproducible blueprint for building high-performing generative models using a lightweight post-training pipeline.
2.2. Main Contributions / Findings
The paper makes several primary contributions and reports key findings:
-
First Fully Open-Source Generative Recommendation Framework (
MiniOneRec): The authors presentMiniOneRec, which includes complete source code, reproducible training pipelines, and publicly available model checkpoints. It offers an end-to-end workflow coveringSIDconstruction,supervised fine-tuning (SFT), andrecommendation-oriented reinforcement learning (RL). -
Validation of Scaling Laws on Public Benchmarks: Through systematic investigation on the Amazon Review dataset,
MiniOneRecvariants (0.5B to 7B parameters) demonstrate a consistent downward trend in both training and evaluation losses as model size increases. This empirically validates theparameter efficiencyandscaling lawsof the generative paradigm, suggesting that larger models consistently perform better. -
Optimized Lightweight Post-training Pipeline: The paper proposes an effective two-pronged post-training strategy:
- Full-Process
SIDAlignment: This involves augmenting the vocabulary with dedicatedSIDtokens and enforcingauxiliary alignment objectivesthroughout both theSFTandRLstages. This approach is shown to be crucial for groundingSIDgeneration inworld knowledgederived from the underlying LLM. - Reinforced Preference Optimization: This leverages
reinforcement learningwithconstrained decoding(masking invalid tokens for legal item generation) and ahybrid rewardsignal. The hybrid reward combines a rule-based accuracy term with a ranking-aware penalty for hard negatives, enhancing bothranking accuracyandcandidate diversity.
- Full-Process
-
Superior Performance:
MiniOneRecconsistently surpasses strong sequential, generative, and LLM-based baselines across various metrics (HR@K, NDCG@K) on the Amazon Review benchmarks. -
Demonstrated Transferability and Impact of Pre-trained LLMs: The framework exhibits robust
out-of-distribution (OOD)transferability (e.g., from Industrial to Office domains) by effectively discovering reusable interaction patterns. Furthermore, models initialized with pre-trained LLM weights significantly outperform those trained from scratch, highlighting the immense value ofworld knowledgeand general reasoning abilities embedded in large pre-trained models.In summary, the paper successfully demonstrates the viability and advantages of scaling generative recommendation models on public data, providing a practical, open-source framework and an optimized post-training recipe to achieve state-of-the-art performance.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the MiniOneRec framework, a beginner needs to grasp several fundamental concepts from recommender systems, natural language processing, and reinforcement learning.
-
Recommender Systems: Systems designed to predict user preferences for items (products, movies, news articles, etc.).
- Conventional Recommenders: Often rely on
embedding tablesto represent users and items as dense vectors. Recommendations are typically made by calculating similarity (e.g., dot product) between user and item embeddings. Examples includematrix factorizationmodels andsequential recommenderslikeSASRec. - Generative Recommenders: A newer paradigm where recommendation is framed as a
sequence generation problem. Instead of matching embeddings, these models generate sequences of identifiers for items that a user is likely to interact with next. This approach draws inspiration fromLarge Language Models (LLMs).
- Conventional Recommenders: Often rely on
-
Large Language Models (LLMs): Very large neural networks trained on vast amounts of text data to understand, generate, and predict human language. They typically employ the
Transformer architecture.- Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which revolutionized sequence modeling. It relies heavily on
self-attention mechanismsto weigh the importance of different parts of the input sequence when processing each element. - Autoregressive Models: Models that predict the next item in a sequence based on all preceding items. In LLMs, this means predicting the next word given the previous words. In generative recommenders, it means predicting the next
Semantic ID (SID)given previousSIDs.
- Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which revolutionized sequence modeling. It relies heavily on
-
Semantic ID (SID): A compact, discrete token representation of an item, designed to capture its semantic meaning. Instead of using raw text or a simple numerical ID,
SIDsare generated by compressing rich information (like item descriptions) into a sequence of special tokens. This allows generative models, which operate on discrete tokens, to handle items effectively.SIDsare analogous to words or sub-words in natural language processing. -
Quantization: The process of mapping continuous values to a finite set of discrete values. In
MiniOneRec,quantizationis used to convert continuous item embeddings (from text encoders) into discreteSIDs.- Residual Quantized Variational Autoencoder (RQ-VAE): An advanced
quantizationtechnique. AVariational Autoencoder (VAE)learns a compressed, latent representation of data.Residual Quantizationinvolves quantizing the residual errors iteratively, allowing for a more accurate and hierarchical compression of information into multiple discrete codes (SIDs). EachSIDcan be seen as a "byte" or "code" representing a part of the item's features, and a sequence of these codes (e.g., three bytes) forms the uniqueSIDfor an item.
- Residual Quantized Variational Autoencoder (RQ-VAE): An advanced
-
Supervised Fine-Tuning (SFT): A common technique in
LLMtraining. After anLLMispre-trainedon a general corpus, it isfine-tunedon a smaller, task-specific dataset usingsupervised learning(i.e., with labeled input-output pairs). InMiniOneRec,SFTwarms up theLLMto generate itemSIDsbased on user interaction histories. -
Reinforcement Learning (RL): A paradigm where an
agentlearns to make decisions by performingactionsin anenvironmentto maximize a cumulativerewardsignal.- Policy Gradient Methods: A class of
RL algorithmsthat directly optimize thepolicy(the agent's strategy for choosing actions). - Group Relative Policy Gradient (GRPO): A specific
policy gradientalgorithm used inMiniOneRec. It draws multiple candidate outputs per input (prompt) and normalizesrewardswithin that group. This helps reducegradient varianceand stabilizes training, especially whenrule-based rewardsare used instead of a learnedreward model. - Reinforcement Learning with Human Feedback (RLHF): A popular technique for aligning
LLMswith human preferences, often usingProximal Policy Optimization (PPO).GRPOcan be seen as a lightweight alternative toRLHFwhen rule-based rewards are sufficient. - Constrained Decoding: A technique used during
text generation(orSIDgeneration) to ensure that the generated output adheres to certain rules or formats. ForMiniOneRec, this means ensuring that only valid itemSIDsare produced, preventing the generation of non-existent items. - Beam Search: A search algorithm used in sequence generation tasks to find the most probable sequence of tokens. Instead of just picking the single best token at each step (
greedy decoding), it keeps track of the most probable partial sequences (beams) and extends them. This helps explore a wider range of possibilities and often leads to higher-quality generations.
- Policy Gradient Methods: A class of
-
Evaluation Metrics for Recommender Systems:
- Hit Rate (HR@K): Measures how often the ground-truth item (the item the user actually interacted with) appears within the top recommended items.
- Conceptual Definition: HR@K is a recall-based metric that quantifies the proportion of users for whom the target item is present in the top K recommendations. It focuses on whether the model successfully found the relevant item.
- Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom target item is in top K}}{\text{Total number of users}} $
- Symbol Explanation:
Number of users for whom target item is in top K: The count of unique users for whom the relevant item was included in the recommendation list of size K.Total number of users: The total number of users in the evaluation set.
- Normalized Discounted Cumulative Gain (NDCG@K): A ranking-aware metric that considers not only whether relevant items are in the top but also their position in the list. Higher positions for relevant items yield higher scores.
- Conceptual Definition: NDCG@K is a measure of ranking quality, assessing the usefulness of an item based on its position in the recommendation list. It assigns higher scores to relevant items that appear earlier in the list. The "Normalized" part means it is scaled to the ideal possible score, making it comparable across different queries.
- Mathematical Formula:
First, calculate
DCG@K(Discounted Cumulative Gain): $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ Then, calculateIDCG@K(Ideal DCG@K) by arranging all relevant items in the dataset in decreasing order of relevance for the given query: $ \mathrm{IDCG@K} = \sum_{i=1}^{K} \frac{2^{rel_{i_{ideal}}} - 1}{\log_2(i+1)} $ Finally,NDCG@Kis: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ - Symbol Explanation:
- : The number of items in the recommendation list being considered.
- : The relevance score of the item at position in the recommended list (typically binary: 1 if relevant, 0 if not).
- : The relevance score of the item at position in the ideal ranked list (where all relevant items are ranked highest).
- : The discount factor, which reduces the weight of relevant items as their position in the list increases.
- Hit Rate (HR@K): Measures how often the ground-truth item (the item the user actually interacted with) appears within the top recommended items.
3.2. Previous Works
The paper builds upon and compares itself to several prior works in generative recommendation, LLM applications in recommendation, and reinforcement learning.
-
Generative Recommendation Models:
- TIGER (Rajput et al., 2023): One of the early explorations in generative recommendation. It uses
RQ-VAEto map textual embeddings of titles and descriptions intoSIDs, and then a Transformer model generates theseSIDs.MiniOneRecadopts a similarRQ-VAEapproach forSIDconstruction. - HSTU (Zhai et al., 2024): Introduces a streaming architecture for generative recommendation, designed to handle high-cardinality and non-stationary interaction logs efficiently.
- LC-Rec (Zheng et al., 2024): Aligns an
LLMwithSIDsthrough multi-task learning, enabling the model to understand and generate these symbols effectively. This highlights the importance of aligningSIDswithLLM's linguistic knowledge, a conceptMiniOneRecfurther develops with itsfull-process SID alignment. - RecForest (Feng et al., 2022): Clusters items using hierarchical k-means and uses cluster indices as tokens, an alternative approach to
SIDgeneration. - EAGER (Wang et al., 2024) & TokenRec (Qu et al., 2024): These works focus on improving item tokenization by fusing collaborative and semantic evidence directly into the tokenizer.
- Industrial-Scale Generative Recommenders:
- OneRec (Deng et al., 2025; Zhou et al., 2025a): A pioneering industrial deployment that achieved significant improvements on Kuaishou's platform. It features a two-stage post-training pipeline with
SFTandRL. - OneRec-V2 (Zhou et al., 2025b): Further advances
OneRecwith a lazy decoder architecture and preference alignment. - OnePiece (Dai et al., 2025): Integrates
LLM-style context engineering and reasoning into industrial retrieval and ranking pipelines, also exploring inference in the latent space. - MTGR (Wang et al., 2025): Focuses on boosting large-scale generative recommendation models by keeping original
DLRMfeatures, adding user-level compression, and accelerating training/inference.
- OneRec (Deng et al., 2025; Zhou et al., 2025a): A pioneering industrial deployment that achieved significant improvements on Kuaishou's platform. It features a two-stage post-training pipeline with
- The common thread among these works is the shift from traditional embedding matching to sequence generation using
Transformersand item tokenization.MiniOneRecdifferentiates itself by being open-source and systematically validating scaling laws on public data while providing a refined post-training strategy.
- TIGER (Rajput et al., 2023): One of the early explorations in generative recommendation. It uses
-
LLM and RL in Recommendation:
- Proximal Policy Optimization (PPO) (Schulman et al., 2017): A widely used
policy gradientalgorithm inRLHFfor fine-tuningLLMs. It is known for its stability but can be memory-intensive. - Direct Preference Optimization (DPO) (Rafailov et al., 2023): An alternative to
PPOthat removes the need for a separatevalue network, directly optimizing apreference-based objective. - S-DPO (Chen et al., 2024b): Adapts
DPOfor recommendation by treatingsoftmax-based negative sampling as implicit pairwise preference. - Group Relative Policy Gradient (GRPO) (Shao et al., 2024; DeepSeek-AI et al., 2024): The
RLalgorithm adopted byMiniOneRec. It normalizes rewards within a small batch ofroll-outsand typically usesrule-based reward signalsinstead of a learned reward model, making it more lightweight and efficient thanPPOorDPOin some contexts.MiniOneRecintegratesGRPOwithconstrained decodingandhybrid rewardstailored for recommendation. - LLM-based Recommendation Models:
- BigRec (Bao et al., 2023): An example of an
LLM-based recommender. - D3 (Bao et al., 2024): Another
LLM-based model for recommendation, often used as a baseline.
- BigRec (Bao et al., 2023): An example of an
- These works illustrate the increasing trend of leveraging
LLMsandRLtechniques to enhance recommendation systems, moving beyond traditional collaborative filtering or content-based methods.
- Proximal Policy Optimization (PPO) (Schulman et al., 2017): A widely used
3.3. Technological Evolution
The field of recommender systems has seen a significant evolution:
-
Early Systems (e.g., Collaborative Filtering): Focused on user-item interaction patterns, often using
matrix factorizationorneighborhood-based methods. -
Deep Learning Era (e.g., GRU4Rec, Caser, SASRec): Introduced deep neural networks, particularly
Recurrent Neural Networks (RNNs)andTransformers, to model sequential user behavior, leading tosequential recommenders. These still largely relied onembedding tables. -
Embedding-Heavy Models (e.g., DeepFM, DCN): Further scaled deep learning models by using massive
embedding tablesfor categorical features, but these often hitperformance plateausdue to the fixed nature of embeddings and the limitations of shallow scoring networks. -
Generative Recommendation Paradigm (e.g., TIGER, OneRec): Inspired by the success of
LLMs, this paradigm shifts from embedding matching tosequence generation. Items are tokenized intoSIDs, andautoregressive Transformersgenerate sequences ofSIDs. This allows for a more flexible and scalable architecture, where model capacity can be redirected from static embedding tables to dynamic generative components. -
LLM-Integrated Generative Recommendation (e.g., OnePiece, MiniOneRec): The latest stage, where
pre-trained LLMsare explicitly leveraged not just as a backbone but also for theirworld knowledgeand reasoning capabilities. This often involvesmulti-task learning,alignment objectives, andreinforcement learningto fine-tuneLLMsfor recommendation-specific tasks.MiniOneRecfits into this latest stage by providing an open-source framework that integratesRQ-VAEforSIDconstruction,Qwen-basedLLMsas backbones,SFTfor initial training, andGRPO-basedRLfor preference optimization, crucially incorporatingSID alignmentwithLLM'sworld knowledge.
3.4. Differentiation Analysis
Compared to the main methods in related work, MiniOneRec offers several core differences and innovations:
-
Open-Source & Reproducibility: Unlike many industrial generative recommenders (e.g.,
OneRec,OnePiece) that are proprietary,MiniOneRecis the "first fully open-source generative recommendation framework." This is a significant contribution to the research community, enabling transparency, reproducibility, and further development. -
Systematic Validation of Scaling Laws:
MiniOneRecprovides the first systematic investigation into how generative recommendation models scale on public datasets (Amazon Review). This directly addresses a critical unanswered question for the research community, providing empirical evidence for theparameter efficiencyof the generative paradigm across different model sizes (0.5B to 7B parameters). -
Comprehensive Post-training Recipe: The paper proposes a lightweight yet effective post-training pipeline with two key components:
- Full-Process
SIDAlignment: WhileLC-Recalso alignsLLMswithSIDs,MiniOneRecemphasizes "full-processSIDalignment," maintaining alignment objectives throughout bothSFTandRLstages. This ensures deeper semantic understanding by groundingSIDgeneration inLLM'sworld knowledge. This is more robust than models that treat recommendation purely as aSID-to-SIDtask. - Reinforced Preference Optimization with Hybrid Rewards and Constrained Decoding:
MiniOneRecappliesGRPOwith specific enhancements:Constrained decoding: Masks invalid tokens to guarantee legal item generation, which is crucial for practical recommender systems.Beam searchfor efficient exploration of diverse candidates, addressing thepoor sampling diversityissue inRLVRwith limited action spaces.Hybrid rewards: Combinesrule-based accuracywith a novelranking-aware penaltyfor hard negatives. This goes beyond simple binary rewards used in standardGRPOand provides richersupervisionforranking quality, a common weakness ofRLin recommendation.
- Full-Process
-
Compact
SIDSpace for Efficiency: By operating in the compactSIDspace rather than verbose textual titles,MiniOneRecrequires substantially fewer context tokens, leading to faster inference, lower latency, and smaller memory footprints compared toLLM-based models that might process full text descriptions. -
Demonstrated Transferability: The
MiniOneRec-w/RL-OODvariant, trained on one domain (Industrial) and evaluated on an unseen one (Office) withoutSFTon the target, highlights the framework's ability to uncoverreusable interaction patternsand its robustout-of-distribution (OOD)performance, which is a key challenge in recommendation.In essence,
MiniOneRecdistinguishes itself by democratizing generative recommendation research, rigorously validating its scaling properties, and offering a finely tuned, effective, and efficient methodology that integrates the strengths ofLLMsandRLwhile addressing practical challenges like output validity and diversity.
4. Methodology
The MiniOneRec framework (as illustrated in Figure 2) provides an end-to-end workflow for generative recommendation. It starts by converting item textual information into discrete Semantic IDs (SIDs), incorporates world knowledge from Large Language Models (LLMs) through alignment objectives, and then refines the recommendation policy using supervised fine-tuning (SFT) followed by reinforced preference optimization (RL).
4.1. Principles
The core idea behind MiniOneRec is to leverage the scaling capabilities of autoregressive Transformers, similar to LLMs, for recommendation. Instead of relying on massive and static embedding tables common in traditional recommenders, MiniOneRec represents items as compact sequences of discrete Semantic IDs (SIDs). This allows the LLM backbone to learn complex sequential patterns and generate SID sequences corresponding to recommended items. A key principle is to integrate the LLM's vast world knowledge by aligning its language space with the SID space, thereby enriching the semantic understanding of items. Furthermore, reinforcement learning is employed to optimize the LLM's generation policy directly towards recommendation-specific objectives like ranking accuracy and candidate diversity, overcoming the limitations of standard supervised learning which often treats all negative items equally.
4.2. Core Methodology In-depth (Layer by Layer)
The MiniOneRec framework consists of three main stages: Item Tokenization, Alignment with LLMs (incorporating SFT), and Reinforced Preference Optimization (using RL).

该图像是论文中的示意图,展示了MiniOneRec框架的全流程SID对齐。图中包括RQ-VAE生成Item SID码本,LLM进行监督微调(SFT)和强化学习(RL)阶段的受限束搜索及奖励机制,最后通过GRPO策略优化。
The MiniOneRec framework. RQ-VAE builds the item SID codebook. We then perform SFT to warm up the LLM and obtain an initial alignment. In RL, beam search with constrained decoding ensures the model sequentially produces a ranked list of distinct, valid SIDs. GRPO updates the policy, and SID alignment is enforced end-to-end. This alignment objective is preserved throughout both the SFT and RL stages, fostering deeper semantic understanding.
4.2.1. Task Formulation
Recommendation is formulated as a sequence generation problem. For each user , their historically interacted items are arranged chronologically into a sequence . Each item is not represented by its raw ID or text, but by a sequence of structural IDs called SIDs, typically denoted as . These SIDs are designed to preserve hierarchical semantics through quantization techniques applied to semantic embeddings of the items.
A generative policy, denoted as , is implemented as an autoregressive model with parameters . This policy takes the entire interaction history as input and is trained to predict the next item that best matches user 's preferences from the catalog of available items. During inference, the model recursively generates item tokens. To produce a recommendation list, the beam search algorithm is used to keep the most promising sequences (beams), which are then returned as the recommended items.
4.2.2. Item Tokenization
The first step in SID-style generative recommenders is to convert each item into a sequence of discrete tokens. MiniOneRec follows TIGER (Rajput et al., 2023) and employs Residual Quantized Variational Autoencoder (RQ-VAE) (Zeghidour et al., 2022) for this purpose. The pipeline is as follows:
- Textual Input Preparation: For every item , its title and textual description are concatenated into a single sentence.
- Semantic Vector Encoding: This concatenated sentence is fed through a
frozen text encoder(specifically,Qwen3-Embedding-4Bin this work), which produces a -dimensional continuous semantic vector . This vector captures the item's rich semantic information. - RQ-VAE Application: The
RQ-VAEthen quantizes this continuous vector . The process involves multiple levels of quantization to progressively refine the representation.- The paper sets the number of levels and the codebook size for each level. This means each item is represented by three discrete codes (bytes), offering possible unique codes, which is sufficient for large catalogs while keeping the vocabulary manageable.
- The quantization operates by iteratively finding the closest codebook entry and then quantizing the residual error. The residual is initialized as . At each level (where ), the process is defined by:
$
c _ { l } = \arg \operatorname* { m i n } _ { k } \left| \mathbf { r } _ { l } - \mathbf { e } _ { k } ^ { ( l ) } \right| _ { 2 } , \qquad \mathbf { r } _ { l + 1 } = \mathbf { r } _ { l } - \mathbf { e } _ { c _ { l } } ^ { ( l ) } .
$
- Symbol Explanation:
- : The index of the chosen codebook entry at level .
- : The residual vector at level , representing the remaining information to be quantized.
- : The -th codebook entry (vector) at level .
- : The L2 norm, calculating the Euclidean distance between vectors.
- : The argument (index ) that minimizes the expression.
- : The updated residual for the next level, calculated by subtracting the chosen codebook entry from the current residual.
- Symbol Explanation:
- Item Token Sequence Formation: The collected indices form the discrete token sequence for item . These sequences are the
SIDsthat the subsequent generative recommender model consumes. - Quantized Latent Reconstruction: The quantized latent representation is reconstructed by summing the chosen codebook entries across all levels. This reconstructed latent is then passed through a decoder to reconstruct the original semantic vector :
$
\mathbf { z } _ { \mathrm { q } } = \sum _ { l = 0 } ^ { L - 1 } \mathbf { e } _ { c _ { l } } ^ { ( l ) } , \qquad \hat { \mathbf { x } } = D ( \mathbf { z } _ { \mathrm { q } } ) .
$
- Symbol Explanation:
- : The reconstructed quantized latent vector.
- : The codebook entry (vector) at level corresponding to the chosen index .
- : The reconstructed semantic vector.
- : The decoder function.
- Symbol Explanation:
- Loss Function: The
RQ-VAEis trained to minimize a loss function that combines a reconstruction term () and anRQ regularizer(): $ \mathcal { L } ( \mathbf { x } ) = \underbrace { | \mathbf { x } - \hat { \mathbf { x } } | _ { 2 } ^ { 2 } } _ { \mathcal { L } _ { \mathrm { R E C O } } } + \underbrace { \sum _ { l = 0 } ^ { L - 1 } \bigl ( | \mathrm { s g } [ \mathbf { r } _ { l } ] - \mathbf { e } _ { k } ^ { ( l ) } | _ { 2 } ^ { 2 } + \beta | \mathbf { r } _ { l } - \mathrm { s g } [ \mathbf { e } _ { c _ { l } } ^ { ( l ) } ] | _ { 2 } ^ { 2 } \bigr ) } _ { \mathcal { L } _ { \mathrm { R Q } } } , $- Symbol Explanation:
- : The total loss for a given semantic vector .
- : The reconstruction loss, measuring the L2 difference between the original semantic vector and its reconstruction .
- : The
RQ regularizer, consisting of two terms:- : The
codebook loss, which pulls the codebook entries towards the encoder outputs .sg(stop-gradient) means the gradient does not flow through . - : The
commitment loss, which encourages the encoder output to "commit" to the selected codebook entry . is a hyper-parameter controlling the strength of this term.sg(stop-gradient) here means the gradient does not flow through .
- : The
- Symbol Explanation:
- Warm-Start Initialization: To prevent
codebook collapse(where only a few codebook entries are used), the codebooks are initialized withk-meanscentroids computed on the first training batch, a common practice inRQ-VAEtraining.
4.2.3. Alignment with LLMs
LLMs possess extensive world knowledge and understanding of human behaviors. MiniOneRec aims to incorporate this knowledge by aligning the LLM's language space with the SID representations, which is an expansion of the original OneRec architecture. The framework augments the LLM's vocabulary with a three-layer codebook (each layer with 256 unique SIDs) and treats these inserted codes as indivisible tokens. This allows the LLM to directly read and write SID sequences.
To bridge the semantic gap between the natural language vocabulary and the new SID space, several alignment objectives are introduced. These tasks are optimized jointly during both the SFT stage and the subsequent RL stage, fostering a deeper semantic understanding. During the RL phase, constrained decoding is applied, ensuring that the model only produces tokens from a predefined list of valid SIDs and their canonical titles, which facilitates rule-based reward computation.
The alignment objectives are categorized into two major groups:
4.2.3.1. Recommendation Tasks
These tasks directly train the LLM to perform recommendation-related predictions.
-
Generative Retrieval: The
LLMreceives a chronologically ordered sequence ofSIDsrepresenting a user's recent interactions, along with a clear instruction (e.g., "Recommend the next item."). The model is tasked with predicting theSIDof the next item the user might engage with.
该图像是图5的示意图,展示了生成式推荐的提示词设计。输入包含用户按时间顺序交互的商品ID序列,模型需预测用户可能的下一个商品,回答部分为预测的商品ID。Figure 5 from the original paper shows an example of a Generative Retrieval Prompt. The input is a sequence of
SIDs, and the response is the predictedSIDfor the next item. -
Asymmetric Item Prediction: This involves two sub-tasks to bridge different modalities:
-
(a) Textual History to
SIDPrediction: Given a textual user history (e.g., item titles), predict theSIDof the next item.
该图像是一个示意图,展示了图6中的非对称物品预测Prompt1,包含输入的文本提示和模型生成的响应示例。Figure 6 from the original paper shows an example of Asymmetric Item Prediction Prompt1. The input is a natural language description of interacted items, and the response is the predicted
SIDfor the next item. -
(b)
SID-only History to Textual Title Generation: Given a history consisting only ofSIDs, generate the textual title of the next expected item.Figure 7 from the original paper shows an example of Asymmetric Item Prediction Prompt2. The input is a sequence of
SIDs, and the response is the predicted natural language title for the next item. (Note: The provided imageimages/7.jpgisSID-Text Semantic Alignment Prompt2, notAsymmetric Item Prediction Prompt2. I will use the prompt example from the text for Figure 7 as the image for Figure 7 is missing.)Input: The user has interacted with items
< a _ { - } 7 1 > < b _ { - } 4 4 > < c _ { - } 2 4 9 > , < a _ { - } 7 1 > < b _ { - } 1 7 1 4 > < c _ { - } 1 3 6 > ,in chronological order. Can you predict the next possible item's title that the user may expect? Response: [Expected Title]
-
4.2.3.2. Alignment Tasks
These tasks explicitly enforce a two-way mapping between natural language and the SID space, grounding discrete codes in text and injecting linguistic knowledge into their embeddings.
-
SID-Text Semantic Alignment: This task trains the model to associate
SIDswith their corresponding textual representations.-
(a)
SIDto Textual Title Prediction: Predict an item's textual title given itsSID.Figure 8 from the original paper shows an example of SID-Text Semantic Alignment Prompt1. The input is an
SID, and the response is its corresponding textual title. (Note: The image for Figure 8 is not provided separately from Figure 6 in the raw text, but Figure 9 below is an example of SID-Text Alignment. I will describe it based on the text description).Input: Given the SID <a_104><b_60><c_152>, what is the item's title? Response: [Item Title]
-
(b) Textual Title to
SIDPrediction: Predict theSIDfrom an item's textual title.
该图像是示意图,展示了Item Description Reconstruction Prompt的输入输出实例,描述了如何将物品描述转为模型标记序列。Figure 9 from the original paper shows an example of SID-Text Semantic Alignment Prompt2. The input is a natural language title, and the response is its corresponding
SID. This example shows the model predicting from the input "Wicmas he ash e al elu lp" (which appears to be a tokenized or obscured form of an item title).
-
-
Item Description Reconstruction: To imbue
SIDswith richer semantics, the model is asked to generate the item description from a singleSIDand, conversely, infer theSIDfrom a description. This task is performed only during theSFTstage due to theopen-ended output spaceof descriptions.Figure 10 from the original paper shows an example of Item Description Reconstruction Prompt. The input is a
SIDand the task is to generate its description. (Note: The image for Figure 10 is not provided separately from Figure 7 in the raw text. I will describe it based on the text description.)Input: Given the SID <a_104><b_60><c_152>, what is the item's description? Response: [Item Description]
-
User Preference Summarization: Given a sequence of
SIDsrepresenting user interactions, the model generates a short natural-language profile summarizing the user's interests. Since the raw dataset lacks explicit preference labels,DEEPSEEK(DeepSeek-AI et al., 2024) is used to extract summaries from item metadata and user reviews, which serve aspseudo labels. This task is also restricted to theSFTstage due to itsopen-ended output space.Figure 11 from the original paper shows an example of User Preference Summarization Prompt. The input is a sequence of
SIDs, and the response is a natural language summary of the user's preferences.Input: The user has interacted with items
in chronological order. Can you summarize the user's preference? Response: Based on the items purchased, this user has a clear preference for durable, high-performance, and practical supplies orhands-on taskselectricalwork, and general repairs, necessitating dependable personal protective equipment and precision tools.
4.2.4. Reinforced Preference Optimization
After the SFT stage provides an initial policy, MiniOneRec further refines it using Group Relative Policy Gradient (GRPO) (Shao et al., 2024; DeepSeek-AI et al., 2024). GRPO differs from classic Reinforcement Learning with Human Feedback (RLHF) by generating multiple candidates per prompt and normalizing rewards within that group, which helps reduce gradient variance.
The steps for GRPO are:
- Roll-out Candidates: For every input prompt sampled from the data distribution , the current
frozen policy(the policy before the current update) is used to generate candidate output sequences, denoted as . - Assign Scores: Each candidate sequence is assigned a scalar score based on a predefined reward function (discussed below).
- Standardize Advantages: The scores are then standardized within the group of candidates to compute
advantages: $ \hat { A } _ { i } = \frac { S _ { i } - \mu _ { 1 : G } } { \sigma _ { 1 : G } } , $- Symbol Explanation:
-
: The standardized advantage for the -th candidate sequence.
-
: The scalar score (reward) assigned to the -th candidate sequence.
-
: The mean of the rewards (scores) within the current group.
-
: The standard deviation of the rewards (scores) within the current group.
The
GRPOalgorithm then optimizes asurrogate objective function: $ J _ { \mathrm { G R P O } } ( \theta ) = \mathbb { E } _ { x , y ^ { ( i ) } } \left[ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | y ^ { ( i ) } | } \sum _ { t = 1 } ^ { | y ^ { ( i ) } | } \Bigl ( \operatorname* { m i n } \bigl ( w _ { i , t } \hat { A } _ { i , t } , \mathrm { c l i p } ( w _ { i , t } , 1 - \epsilon , 1 + \epsilon ) \hat { A } _ { i , t } \bigr ) - \beta \mathrm { K L } \bigl [ \pi _ { \theta } \bigr | \pi _ { \mathrm { r e f } } \bigr ] \Bigr ) \right] , $
-
- Symbol Explanation:
- Symbol Explanation:
-
: The
GRPOsurrogate objective function to be maximized with respect to policy parameters . -
: Expectation over prompts and generated candidate sequences .
-
: The number of candidate sequences (roll-outs) generated per prompt.
-
: The length of the -th candidate sequence.
-
: The
token-level importance ratiofor the -th token in the -th sequence, defined as . This ratio measures how much the probability of generating a token has changed under the new policy compared to the old policy . -
: The advantage for the -th token of the -th sequence. (The paper uses for the sequence, but here implies a token-level application, consistent with
PPOformulation. Assuming for simplicity if tokens are not individually scored). -
: A clipping function that limits the importance ratio within a range . This stabilizes training by preventing excessively large policy updates.
-
: A hyper-parameter controlling the clipping range.
-
: A hyper-parameter controlling the strength of the
KL divergencepenalty. -
: The
Kullback-Leibler (KL) divergencebetween the current policy and areference policy. This term acts as a regularizer, preventing the new policy from diverging too far from a stable reference (often the policy before the current update or theSFTpolicy).MiniOneRecaddresses two key obstacles when applyingRLto recommendation:
-
- Unique Generation Space: The action space for recommendation (item
SIDs) is discrete and much smaller than natural language vocabularies. This can lead topoor sampling diversitywhere re-sampling often produces duplicate items. - Sparse Ranking Supervision: A simple
binary reward(1 for correct, 0 for incorrect) provides limited guidance onranking quality. Recommendation models benefit from distinguishing betweenhardandeasy negatives.
4.2.4.1. Sampling Strategy
To combat the issue of poor sampling diversity, MiniOneRec investigates two remedies:
-
Diversity Metric: The diversity of generated items is measured using: $ \mathrm { D i v } \big ( { e _ { k } } _ { k = 1 } ^ { G } \big ) = \frac { \big | \mathrm { U n i q u e } \big ( { e _ { k } } _ { 1 } ^ { G } \big ) \big | } { G } , $
- Symbol Explanation:
- : The diversity score.
- : The set of generated items.
- : The count of unique items among the generations.
- : The total number of generated items.
- Higher values indicate richer supervision for
RL.
- Symbol Explanation:
-
Dynamic Sampling (Yu et al., 2025): This method first over-samples candidates and then intelligently selects a subset. The subset must include the ground-truth item and maximize internal diversity. While helpful, it requires extra forward passes and can still deteriorate as training progresses.
-
Beam Search:
MiniOneRecultimately adoptsbeam search(without length normalization) as its default sampler. By its nature,beam searchensures that all generated beams (candidate sequences) are distinct. This guarantees zero duplication within each group, offering a betterdiversity-efficiency trade-off.Constrained beam searchis used to ensure all generated items are valid.
4.2.4.2. Reward Design
To provide richer supervision than a simple binary reward, MiniOneRec introduces a rank-aware reward and combines it with a rule-based reward.
-
Rank-Aware Reward: This reward penalizes
negative itemsmore strongly if the model was more confident in recommending them (i.e., they appear prominently in the model's own ranking). Given a negative candidate item whose generation probability ranks (where means it's the most probable), and the true item is : $ \tilde { R } _ { \mathrm { r a n k } } ( e _ { k } , e _ { t } ) = \left{ \begin{array} { l l } { 0 , } & { e _ { k } = e _ { t } , } \ { - \frac { 1 } { \log ( \rho _ { k } + 1 ) } , } & { \mathrm { o t h e r w i s e } , } \end{array} \right. ~ R _ { \mathrm { r a n k } } ( e _ { k } , e _ { t } ) = - \frac { \tilde { R } _ { \mathrm { r a n k } } ( e _ { k } , e _ { t } ) } { \sum _ { j = 1 } ^ { G } \tilde { R } _ { \mathrm { r a n k } } ( e _ { j } , e _ { t } ) } . $- Symbol Explanation:
- : The unnormalized rank-aware reward for candidate given the true item .
- : A candidate item generated by the policy.
- : The ground-truth (target) item.
- : The rank of item among the generated candidates, based on the model's generation probabilities (1 for the most probable, 2 for the second, etc.).
- : The normalized rank-aware reward. The negative sign ensures that lower scores are assigned to hard negatives (those with low ).
- : The sum of unnormalized rank-aware rewards for all candidates. This normalizes the reward.
- Symbol Explanation:
-
Final Hybrid Reward: The final reward signal combines the
rule-based accuracyterm with theranking-aware component: $ R ( e _ { k } , e _ { t } ) = R _ { \mathrm { r u l e } } ( e _ { k } , e _ { t } ) + R _ { \mathrm { r a n k } } ( e _ { k } , e _ { t } ) , $- Symbol Explanation:
-
: The total reward for candidate given true item .
-
: The rule-based reward term.
-
: The rank-aware reward term.
The
rule-based termsimply assigns a positive reward for correctly predicting the true item and zero otherwise: $ R _ { \mathrm { r u l e } } ( e _ { k } , e _ { t } ) = { \left{ \begin{array} { l l } { 1 , } & { e _ { k } = e _ { t } , } \ { 0 , } & { { \mathrm { o t h e r w i s e } } . } \end { array } \right. } $
-
- Symbol Explanation:
-
: The rule-based reward for candidate given true item .
-
1: Reward for a correct prediction. -
0: Reward for an incorrect prediction.MiniOneRecultimately adopts this combinedranking-and-rule rewardas its default choice. The paper also explored usingcollaborative rewards(e.g., logits from a pre-trainedcollaborative-filtering model), but found that this led toreward hackingand performance degradation, indicating a misalignment with the true objective.
-
- Symbol Explanation:
5. Experimental Setup
This section details the empirical study conducted to evaluate MiniOneRec.
5.1. Datasets
The experiments are conducted on two real-world subsets of the Amazon Review dataset (Hou et al., 2024):
-
Industrial_and_Scientific -
Office_ProductsThese datasets were chosen because they represent diverse product domains and are publicly available, allowing for reproducibility.
To manage computational costs, the datasets undergo a trimming strategy based on (Bao et al., 2024):
-
Filtering: Users and items with fewer than five interactions are removed. This ensures sufficient interaction history for sequential modeling.
-
Time-based Filtering:
- For
Toys_and_Games(thoughMiniOneRecuses Industrial/Office, this indicates general trimming strategy), events from October 2016 to November 2018 are kept. - For
Industrial_and_Scientific(a smaller dataset), all events between October 1996 and November 2018 are retained.
- For
-
Sequence Truncation: Each user's interaction history is truncated to a maximum of ten items. This limits the input sequence length for the
Transformermodels. -
Data Splitting: Each dataset is chronologically split into training, validation, and test sets with an 8:1:1 ratio. This ensures that the model is evaluated on future interactions.
The following are the statistics for the resulting training splits from Table 4 of the original paper:
The following are the results from Table 4 of the original paper:
| Datasets | Inductrial | Office |
| Items | 3,685 | 3,459 |
| Train | 3,6259 | 3,8924 |
| Valid | 4,532 | 4,866 |
| Test | 4,533 | 4,866 |
Data Sample: While the paper does not provide a specific image of a data sample, the nature of the data involves item titles, textual descriptions (used for SID generation), and chronological sequences of user-item interactions. For instance, a user's history might look like: [item_A_SID, item_B_SID, item_C_SID], where item_A_SID corresponds to the SID generated from the title and description of "Kreg SML-C150-100 Pocket Screws...".
5.2. Evaluation Metrics
The top- recommendation accuracy is measured using standard metrics:
-
Hit Rate (HR@K):
- Conceptual Definition: HR@K measures the proportion of users for whom the target item (the next item they actually interacted with) is present in the top items recommended by the model. It is a recall-based metric, indicating whether the model successfully retrieved the relevant item, irrespective of its position within the top .
- Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users with hit in top K}}{\text{Total number of users}} $
- Symbol Explanation:
Number of users with hit in top K: The count of unique users for whom the ground-truth item was found among the top recommendations.Total number of users: The total number of users in the evaluation set for whom a recommendation was generated.
-
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition: NDCG@K is a ranking-aware metric that evaluates the quality of a recommendation list by giving higher scores to relevant items that appear at higher (better) positions. It accounts for both the presence of relevant items and their rank. The normalization ensures that
NDCG@Kvalues are comparable across different queries and recommendation list lengths. - Mathematical Formula:
First, the
Discounted Cumulative Gain (DCG@K)is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ Then, theIdeal Discounted Cumulative Gain (IDCG@K)is computed, which is the maximum possibleDCG@Kif all relevant items were perfectly ranked: $ \mathrm{IDCG@K} = \sum_{i=1}^{K} \frac{2^{rel_{i_{ideal}}} - 1}{\log_2(i+1)} $ Finally,NDCG@Kis derived as: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ - Symbol Explanation:
- : The maximum rank position to consider in the recommendation list.
- : The relevance score of the item at position in the actual recommended list. For recommendation tasks, this is often binary (1 if the item is the true next item, 0 otherwise).
- : The relevance score of the item at position in the ideal recommendation list (i.e., where all relevant items are listed first).
- : A logarithmic discount factor that penalizes relevant items appearing at lower ranks.
- Conceptual Definition: NDCG@K is a ranking-aware metric that evaluates the quality of a recommendation list by giving higher scores to relevant items that appear at higher (better) positions. It accounts for both the presence of relevant items and their rank. The normalization ensures that
5.3. Baselines
MiniOneRec is compared against three categories of representative baseline models:
-
Traditional Recommendation Models: These are established sequential recommenders.
GRU4Rec(Hidasi et al., 2016): AGated Recurrent Unit (GRU)-based model for session-based recommendations.Caser(Tang and Wang, 2018): Usesconvolutional neural networks (CNNs)to capture both general and personalized sequential patterns.SASRec(Kang and McAuley, 2018): Aself-attentive sequential recommendationmodel that usesTransformersto capture long-range dependencies in user behavior sequences.- Why representative? They represent the state-of-the-art in non-generative, embedding-based sequential recommendation prior to the
LLMera.
-
Generative Recommendation Models: These models also frame recommendation as a generation task, often using
SIDs.HSTU(Zhai et al., 2024): A generative model with a streaming architecture.TIGER(Rajput et al., 2023): UsesRQ-VAEforSIDgeneration and aTransformerbackbone, similar toMiniOneRec'sSIDconstruction.LC-Rec(Zheng et al., 2024): Aligns anLLMwithSIDsusing multi-task learning.- Why representative? They are direct competitors in the
generative recommendationparadigm, allowing comparison ofMiniOneRec's specific architectural and training choices.
-
LLM-based Recommendation Models: These are more recent models that leverage
LLMsfor recommendation.-
BigRec(Bao et al., 2023): A method leveragingLLMsfor recommendation. -
D3(Bao et al., 2024): AnotherLLM-based recommendation model. -
S-DPO(Chen et al., 2024b): AnLLM-based approach that adaptsDirect Preference Optimization (DPO)for recommendation. -
Why representative? They represent the cutting edge of integrating
LLMsinto recommendation, testingMiniOneRecagainst methods that also exploitLLMcapabilities.All
LLM-powered systems, includingMiniOneRec, share theQwen2.5-Instruct backboneand use theAdamWoptimizer for fair comparison. Training details (learning rates, batch sizes, epochs) are carefully specified for each stage and baseline to ensure a rigorous evaluation. For conventional recommenders,binary cross-entropyandAdamoptimizer are used, with hyper-parameter tuning for learning rates and weight decay.
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Scaling
The paper demonstrates the scaling capabilities of MiniOneRec by observing how the loss curves behave with increasing model size.

该图像是论文中的图表,左侧展示了从0.5B到7B参数规模下训练样本的FLOPs与收敛损失的关系曲线,右侧展示了不同训练阶段(SFT和SFT-then-RL)下三种模型的HR@10表现,比较了未对齐和对齐后模型性能差异。
Figure 1 (Left) from the original paper shows the scaling curves from 0.5B to 7B parameters. Figure 1 (Right) illustrates the effect of world knowledge on model performance, comparing MiniOneRec-W/O ALIGN (pretrained LLM weights, no SID-text alignment) and MiniOneRec-Scratch (random initialization, no alignment) at both SFT and SFT-then-RL stages.

该图像是图表,展示了图3中不同规模的Qwen2.5-Instruct模型在SFT训练过程中评估损失随训练轮次下降的趋势,体现了模型参数增长带来的性能提升。
Figure 3 from the original paper shows the Evaluation loss versus SFT training epoch.
Analysis:
As shown in Figure 1 (Left), there is a clear and consistent downward trend in convergence loss as the model size of MiniOneRec increases from 0.5B to 7B parameters. This indicates that larger models are better able to learn the underlying patterns in the data and achieve lower training errors. Figure 3 further reinforces this finding by tracking the evaluation loss on the SFT training set over epochs. Models with larger parameter counts consistently maintain lower evaluation losses throughout the training process and converge more rapidly. This directly validates the parameter efficiency of the generative paradigm in recommendation, confirming that scaling laws (where increased capacity leads to consistent performance gains) indeed hold on public datasets when using autoregressive Transformers with SIDs. This superior scaling effect suggests that generative recommenders have the potential to be the next generation of recommendation models.
6.1.2. Performance Comparison
The paper benchmarks MiniOneRec against various baselines on the Industrial and Office Amazon Review datasets.
The following are the results from Table 1 of the original paper:
| Datasets | Methods | HR@3 | NDCG@3 | HR@5 | NDCG@5 | HR@10 | NDCG@10 |
| Traditional | |||||||
| Industrial | GRU4Rec | 0.0638 | 0.0542 | 0.0774 | 0.0598 | 0.0999 | 0.0669 |
| Caser | 0.0618 | 0.0514 | 0.0717 | 0.0555 | 0.0942 | 0.0628 | |
| SASRec | 0.0790 | 0.0700 | 0.0909 | 0.0748 | 0.1088 | 0.0806 | |
| Generative | |||||||
| HSTU | 0.0927 | 0.0885 | 0.1037 | 0.0918 | 0.1163 | 0.0958 | |
| TIGER | 0.0852 | 0.0742 | 0.1010 | 0.0807 | 0.1321 | 0.0908 | |
| LCRec | 0.0915 | 0.0805 | 0.1057 | 0.0862 | 0.1332 | 0.0952 | |
| LLM-based | |||||||
| BIGRec | 0.0931 | 0.0841 | 0.1092 | 0.0907 | 0.1370 | 0.0997 | |
| D3 | 0.1024 | 0.0991 | 0.1213 | 0.0989 | 0.1500 | 0.1082 | |
| S-DPO | 0.1032 | 0.0906 | 0.1238 | 0.0991 | 0.1524 | 0.1082 | |
| Ours | |||||||
| MiniOneRec | 0.1143 | 0.1011 | 0.1321 | 0.1084 | 0.1586 | 0.1167 | |
| Traditional | |||||||
| Office | GRU4Rec | 0.0629 | 0.0528 | 0.0789 | 0.0595 | 0.1019 | 0.0669 |
| Caser | 0.0748 | 0.0615 | 0.0865 | 0.0664 | 0.1093 | 0.0737 | |
| SASRec | 0.0861 | 0.0769 | 0.0949 | 0.0805 | 0.1120 | 0.0858 | |
| Generative | |||||||
| HSTU | 0.1134 | 0.1031 | 0.1252 | 0.1079 | 0.1400 | 0.1126 | |
| TIGER | 0.0986 | 0.0852 | 0.1163 | 0.0960 | 0.1408 | 0.1002 | |
| LCRec | 0.0921 | 0.0807 | 0.1048 | 0.0859 | 0.1237 | 0.0920 | |
| LLM-based | |||||||
| BIGRec | 0.1069 | 0.0961 | 0.1204 | 0.1017 | 0.1434 | 0.1091 | |
| D3 | 0.1204 | 0.1055 | 0.1406 | 0.1139 | 0.1634 | 0.1213 | |
| S-DPO | 0.1169 | 0.1033 | 0.1356 | 0.1110 | 0.1587 | 0.1255 | |
| Ours | |||||||
| MiniOneRec | 0.1217 | 0.1088 | 0.1420 | 0.1172 | 0.1634 | 0.1242 | |
Analysis: Table 1 reveals two critical insights:
- Utility of LLM World Knowledge: Models powered by
LLMs(e.g.,BIGRec,D3,S-DPO) consistently outperform traditional recommenders (GRU4Rec,Caser,SASRec) across all metrics and both datasets. This strongly suggests that the vastworld knowledgeand general reasoning abilities embedded inLLMstranslate into better recommendation accuracy. Even generative models without explicitLLMalignment (TIGER) show improvements over traditional methods, but those withLLMintegration (BIGRec,D3,S-DPO) further extend this lead. - Effectiveness of MiniOneRec:
MiniOneRecconsistently achieves the highest scores across most reported metrics on both theIndustrialandOfficedatasets, outperforming all traditional, generative, and otherLLM-based solutions. For instance, on theIndustrialdataset,MiniOneRecscores HR@3 of 0.1143 and NDCG@10 of 0.1167, surpassingS-DPO's 0.1032 and 0.1082 respectively. Similarly, on theOfficedataset,MiniOneRecachieves HR@3 of 0.1217 and NDCG@10 of 0.1242, which are competitive with or better than the leading baselines (D3andS-DPO). This superior performance is attributed toMiniOneRec's design:- Full Generation Process Alignment: By aligning the entire generation process with the task objective.
- Reinforced Preference Optimization: Utilizing
RLwith its carefully designedconstrained decodingandhybrid rewards. - Compact SID Space: Operating in the compact
SIDspace reduces context tokens, leading to faster inference, lower latency, and smaller memory footprints at serving time, addressing practical deployment concerns.
6.1.3. Transferability
The paper evaluates the out-of-distribution (OOD) robustness of MiniOneRec through an experiment called SID pattern discovery. The model is trained exclusively on the Industrial domain and then deployed, without any further tuning, to the unseen Office domain. A MiniOneRec-w/ RL-OOD variant (RL-only, skipping SFT for emphasis on generalization) is included.
The following are the results from Table 2 of the original paper:
| Dataset | Method | HR@3 | NDCG@3 | HR@5 | NDCG@5 | HR@10 | NDCG@10 |
| Office | GRU4Rec | 0.0629 | 0.0528 | 0.0789 | 0.0595 | 0.1019 | 0.0669 |
| Qwen-Text | 0.0031 | 0.0021 | 0.0044 | 0.0026 | 0.0057 | 0.0030 | |
| Qwen-SID | 0.0300 | 0.0214 | 0.0456 | 0.0282 | 0.0733 | 0.0373 | |
| MiniOneRec-w/RL-OOD | 0.0553 | 0.0433 | 0.0691 | 0.0489 | 0.0892 | 0.0553 |
Analysis:
Table 2 highlights the importance of SIDs and the transferability of RL-trained models.
Qwen-Textperforms very poorly, indicating that directly processing raw text forOODprediction without specific fine-tuning is ineffective.Qwen-SID(usingSIDtokens but no fine-tuning) performs noticeably better thanQwen-Text, suggesting that a structuredSID vocabularymakes it easier for anLLMto extract patterns, even without domain-specific training.MiniOneRec-w/RL-OOD(trained viaGRPOonIndustrialonly and evaluated onOffice) achieves competitive accuracy compared toGRU4Rec(which was trained directly on theOfficedomain). While it falls short of the fullMiniOneRec(which would be trained onOffice), itsreinforcement-only trainingprovides excellent transferability. This implies thatMiniOneRecsuccessfully uncoversreusable interaction patternsthat generalize across domains, despite substantialdomain shiftand potentialsemantic driftamongSIDs. This underscores the framework's promise forcross-domain recommendation.
6.1.4. Pre-trained LLM Impact
The paper investigates the impact of pre-trained LLMs by comparing two MiniOneRec variants: one initialized from a general-purpose pre-trained LLM and another trained from scratch with random weights.
The following are the results from Table 3 of the original paper:
| Datesets | Methods | HR@3 | NDCG@3 | HR@5 | NDCG@5 | HR@10 | NDCG@10 |
| Industrial | MiniOneRec-scratch | 0.0757 | 0.0672 | 0.0891 | 0.0726 | 0.1134 | 0.0804 |
| MiniOneRec | 0.1125 | 0.0988 | 0.1259 | 0.1046 | 0.1546 | 0.1139 | |
| Office | MiniOneRec-scratch | 0.0959 | 0.0855 | 0.1057 | 0.0896 | 0.1196 | 0.0941 |
| MiniOneRec | 0.1217 | 0.1088 | 0.1420 | 0.1172 | 0.1634 | 0.1242 |
Analysis:
Table 3 shows a consistent and significant pattern across both Industrial and Office datasets: MiniOneRec initialized with pre-trained weights (MiniOneRec) substantially outperforms its randomly initialized counterpart (MiniOneRec-scratch). For example, on the Industrial dataset, MiniOneRec achieves an HR@10 of 0.1546, significantly higher than MiniOneRec-scratch's 0.1134. Similar improvements are seen on the Office dataset.
The authors attribute this to two main factors:
- General Reasoning Ability: The
general reasoning abilityacquired during large-scale languagepre-trainingallows the model to interpret thenext-SID prediction taskas a problem ofpattern discovery, making it more effective at learning user preferences. - Factual Knowledge: The
factual knowledgealready encoded in thepre-trained LLMprovides a head start in understanding thereal-world semanticsbehind eachSID, which can be transferred to the recommendation domain. This highlights the critical role ofpre-trained LLMsas powerful backbones forgenerative recommendation, providing a strong foundation that would otherwise require extensive training from scratch.
6.2. Ablation Studies / Parameter Analysis
The paper conducts ablation studies to validate the effectiveness of MiniOneRec's individual components.

该图像是论文中图4的图表,分别展示了MiniOneRec各个组成部分的效果。图4a比较不同对齐策略对模型性能的影响,图4b分析多种采样策略,而图4c评估了不同奖励设计对模型表现的影响。
Figure 4 from the original paper shows the study on the effectiveness of MiniOneRec's individual components. Figure 4a examines model performance under different alignment strategies; Figure 4b investigates various sampling strategies; Figure 4c evaluates the impact of alternative reward designs.
6.2.1. Aligning Strategy
This ablation study examines the impact of different language-SID alignment strategies on model performance.
MINIONEREC-W/O ALIGN: Removes any language-SID alignment, treating recommendation purely as aSID-to-SIDtask.MINIONEREC-W/ SFTALIGN: Keeps the alignment objective only during theSFTstage, whileRLusesSIDdata alone.MINIONEREC-W/ RLALIGN:SFTrelies solely onSIDsupervision, andalignment tasksare introduced later in theRLstage.MiniOneRec(full model): Maintains alignment throughout the entire pipeline (SFTandRL).
Analysis (Figure 4a):
Figure 4a clearly shows that the complete MiniOneRec model, which maintains SID alignment throughout both the SFT and RL stages, delivers the highest scores across all metrics (HR@K and NDCG@K). MINIONEREC-W/O ALIGN performs the worst. This indicates that grounding SID generation in world knowledge (via language-SID alignment) is essential for effective generative recommendation. The variants that only align during SFT or RL (MINIONEREC-W/ SFTALIGN, MINIONEREC-W/ RLALIGN) perform better than W/O ALIGN but worse than the full MiniOneRec, underscoring the importance of continuous, full-process alignment for maximizing performance.
6.2.2. Sampling Strategy
This study investigates how different roll-out methods for generating candidate sequences affect MiniOneRec during the RL stage.
MINIONEREC-CoMMoN: Uses a plainTop-k decoderto produce the exact number of required paths (candidates). This method often produces many duplicate sequences.MINIONEREC-DyNAMIC: Employs a two-step sampler: it firstover-samples(e.g., 1.5 times the budget) and then retains as many unique items as possible forRL.MiniOneRec(full model): Adoptsbeam searchwith a width of 16, ensuring distinct candidate sequences.
Analysis (Figure 4b):
Figure 4b demonstrates that the complete MiniOneRec (using beam search) achieves the highest accuracy among the tested sampling strategies. It also does so while being more cost-efficient, using approximately two-thirds of the samples required by the dynamic variant to achieve its performance. This indicates that beam search is the most cost-efficient and effective choice for generating diverse and high-quality candidate trajectories during RL, outperforming simpler Top-k decoding and more complex dynamic sampling which still struggles with diversity.
6.2.3. Reward Design
This ablation study evaluates the impact of alternative reward designs for reinforcement learning.
MiNIONEREC-w/ ACC: Relies solely on abinary correctness signal(1 for correct item, 0 otherwise). This is a basicrule-based reward.MINIONEREC-w/ COLLABORATIVE: Replaces theranking termin thehybrid rewardwithlogitstaken from a frozenSASRecmodel, aiming to injectcollaborative cues.MiniOneRec(full model): Uses the defaulthybrid rewardcombiningrule-based accuracyand theranking-aware penalty.
Analysis (Figure 4c):
Figure 4c shows that the full MiniOneRec model, with its hybrid reward (combining rule-based and rank-aware components), achieves the best overall performance. MiNIONEREC-w/ ACC performs worse, highlighting the limitations of simple binary rewards which fail to distinguish between different types of negative items or provide nuanced ranking supervision. Interestingly, MINIONEREC-w/ COLLABORATIVE performs significantly worse than even MiNIONEREC-w/ ACC. The authors hypothesize that this degradation is due to reward hacking: the collaborative reward signal may become misaligned with the true objective, causing the model to optimize for a proxy that does not correspond to actual recommendation accuracy. This finding emphasizes the importance of carefully designing reward functions in RL for recommendation, where domain-specific considerations (like ranking and hard negatives) are crucial.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces MiniOneRec, a pioneering fully open-source framework for generative recommendation. It provides a comprehensive, end-to-end workflow covering Semantic ID (SID) construction via RQ-VAE, supervised fine-tuning (SFT), and recommendation-oriented reinforcement learning (RL). A key contribution is the systematic validation of scaling laws on public benchmarks, demonstrating that larger generative recommenders (from 0.5B to 7B parameters) consistently achieve lower training and evaluation losses. This confirms the superior parameter-efficiency of the SID-based paradigm compared to traditional embedding-centric models.
MiniOneRec further proposes an optimized post-training pipeline centered on two main techniques:
-
Full-process
SID alignment: This embedsSIDtokens into the model vocabulary and enforcesauxiliary alignment tasksacross bothSFTandRLstages, effectively groundingSIDgeneration in theLLM'sworld knowledge. -
Reinforced preference optimization: This utilizes
Group Relative Policy Gradient (GRPO)with practical enhancements includingconstrained decoding(ensuring valid item generation),beam-based sampling(for candidate diversity), and ahybrid reward design(combining rule-based accuracy with a ranking-aware penalty for hard negatives).Extensive experiments on the Amazon Review dataset show that
MiniOneRecconsistently outperforms strong traditional, generative, andLLM-based baselines inranking accuracyandcandidate diversity, all while maintaining a leanpost-training footprint. The framework also demonstrates robusttransferabilityacross domains and highlights the critical impact ofpre-trained LLMweights on performance.
7.2. Limitations & Future Work
The paper explicitly outlines a roadmap for future developments rather than direct limitations, but these implicitly suggest current constraints or areas for improvement:
-
Codebase Maintenance and Extension: The immediate future work involves continuously maintaining and extending the
MiniOneRec codebase. -
New Datasets: Future developments will include support for
new datasetsto further test the generalizability and performance of the framework across diverse domains. -
Advanced Tokenization Schemes: Exploring
more advanced tokenization schemesbeyond the currentRQ-VAEcould further improveSIDquality and semantic richness. -
Larger Backbone Models: The current study scales up to 7B parameters. Future work aims to incorporate
larger backbone modelsto investigate the limits ofscaling lawsin generative recommendation. -
Enhanced Training Pipelines: Continuously improving the
training pipelines(e.g., more efficientRLalgorithms, betterSFTstrategies) is a direction for future research.While not explicitly stated as limitations, some implicit areas for consideration could be:
-
Computational Cost: Despite being lightweight,
RLtraining, especially withbeam searchand largeLLMs, can still be computationally intensive. -
Complexity of
SIDGeneration: TheRQ-VAEbasedSIDgeneration process adds a preprocessing step and requires careful tuning. Its scalability for extremely large and dynamic item catalogs might be a challenge. -
Reward Hacking: The paper itself notes the issue of
reward hackingwhen usingcollaborative rewards, indicating thatreward functiondesign remains a delicate balance.
7.3. Personal Insights & Critique
MiniOneRec represents a significant step forward in making generative recommendation accessible and scientifically rigorous.
Personal Insights:
- Open-Source Impact: The commitment to an
open-source frameworkis invaluable. It democratizes research in a field heavily dominated by proprietary industrial solutions, fostering transparency, reproducibility, and collaborative innovation. This alone could accelerate the adoption and development ofgenerative recommenders. - Rigorous Validation of Scaling Laws: The systematic validation of
scaling lawson public benchmarks is crucial. It provides concrete evidence that theLLMparadigm of "more data, more parameters, better performance" holds for recommendation, guiding future research toward building larger, more capable models. - Practical Post-training Recipe: The detailed and effective
post-training pipeline, particularly thefull-process SID alignmentandhybrid rewarddesign inRL, offers a practical blueprint for researchers and practitioners. The ablation studies clearly demonstrate the value of each component, making the methodology transparent and adaptable. - Leveraging LLM World Knowledge: The emphasis on integrating
LLM'sworld knowledgeis a powerful insight. It moves beyond treatingLLMsmerely as sequence generators and instead leverages their pre-trained semantic understanding to enrich item representations and user preferences. - Addressing RL Challenges: The paper effectively addresses common challenges in applying
RLto recommendation, such aspoor sampling diversityandsparse ranking supervision, throughconstrained decoding,beam search, andrank-aware rewards. This shows a deep understanding of the practicalities ofRLin this domain.
Critique & Potential Issues:
-
Definition of "Minimal Recipe": While the paper refers to its post-training as "minimal" or "lightweight," it still involves multi-stage training (
RQ-VAE,SFT,RL), carefulreward engineering, and sophisticatedsampling strategies. For a true beginner or smaller organizations, this might still represent a significant undertaking compared to simpler embedding-based models. The term "minimal" might be relative to the complexity of industrial-scaleRLHFsystems, but the overall pipeline remains non-trivial. -
Complexity of
SIDInterpretation: WhileSIDsare efficient for models, their direct interpretability for humans or for debugging purposes might be limited. The mapping fromSIDback to human-readable item features is handled by theLLMduring alignment tasks, but errors inSIDgeneration or interpretation could be hard to trace. -
Generalizability Beyond Amazon Reviews: The experiments are primarily conducted on Amazon Review subsets. While the
transferabilitystudy to an unseen Amazon domain is promising, further validation on vastly different types of datasets (e.g., news, movies, diverse cultural contexts) would strengthen the claims of general applicability. Semantic IDs for highly abstract or subjective items might also pose challenges. -
Cost of Large Backbones: While
scaling lawsare validated, the computational cost associated with training and serving 7B+ parameterLLMsremains substantial, even with optimized methods. This could be a barrier for widespread adoption outside of well-resourced institutions. -
Dynamic Catalog Updates: The paper mentions
HSTUfor handling non-stationary logs. WhileRQ-VAEallows for dynamic generation ofSIDs, the efficiency and stability of constantly updatingSID codebooksor adaptingLLMsto rapidly changing item catalogs is an area that could benefit from more detailed discussion.Overall,
MiniOneRecoffers a robust and forward-looking framework, providing valuable insights and a solid foundation for future research ingenerative recommendation. Its open-source nature is a commendable contribution to the research community.
Similar papers
Recommended via semantic vector search.