OpenOneRec Technical Report
TL;DR Summary
The paper introduces RecIF-Bench, a holistic benchmark for generative recommendation and a large dataset of 96 million interactions. It open-sources a comprehensive training pipeline, demonstrating scalable capabilities while reducing catastrophic forgetting, with the OneRec foun
Abstract
While the OneRec series has successfully unified the fragmented recommendation pipeline into an end-to-end generative framework, a significant gap remains between recommendation systems and general intelligence. Constrained by isolated data, they operate as domain specialists-proficient in pattern matching but lacking world knowledge, reasoning capabilities, and instruction following. This limitation is further compounded by the lack of a holistic benchmark to evaluate such integrated capabilities. To address this, our contributions are: 1) RecIF Bench & Open Data: We propose RecIF-Bench, a holistic benchmark covering 8 diverse tasks that thoroughly evaluate capabilities from fundamental prediction to complex reasoning. Concurrently, we release a massive training dataset comprising 96 million interactions from 160,000 users to facilitate reproducible research. 2) Framework & Scaling: To ensure full reproducibility, we open-source our comprehensive training pipeline, encompassing data processing, co-pretraining, and post-training. Leveraging this framework, we demonstrate that recommendation capabilities can scale predictably while mitigating catastrophic forgetting of general knowledge. 3) OneRec-Foundation: We release OneRec Foundation (1.7B and 8B), a family of models establishing new state-of-the-art (SOTA) results across all tasks in RecIF-Bench. Furthermore, when transferred to the Amazon benchmark, our models surpass the strongest baselines with an average 26.8% improvement in Recall@10 across 10 diverse datasets (Figure 1). This work marks a step towards building truly intelligent recommender systems. Nonetheless, realizing this vision presents significant technical and theoretical challenges, highlighting the need for broader research engagement in this promising direction.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "OpenOneRec Technical Report: An Open Foundation Model and Benchmark to Accelerate Generative Recommendation."
1.2. Authors
The paper is authored by the OneRec Team, with a detailed list of contributors provided in Appendix A. The Core Contributors include Guorui Zhou, Honghui Bao, Jiaming Huang, Jiaxin Deng, Jinghao Zhang, Junda She, Kuo Cai, Lejian Ren, Lu Ren, Qiang Luo, Qianqian Wang, Qigen Hu, Rongzhou Zhang, Ruiming Tang, Shiyao Wang, Wuchao Li, Xiangyu Wu, Xinchen Luo, Xingmei Wang, Yifei Hu, Yunfan Wu, Zhanyu Liu, Zhiyang Zhang, and Zixing Zhang. Additional Contributors are Bo Chen, Bin Wen, Chaoyi Ma, Chengru Song, Chenglong Chu, Defu Lian, Fan Yang, Feng Jiang, Hongtao Cheng, Huanjie Wang, Kun Gai, Pengfei Zheng, Qiang Wang, Rui Huang, Siyang Mao, Tingting Gao, Wei Yuan, Yan Wang, Yang Zhou, Yi Su, Zexuan Cheng, Zhixin Ling, and Ziming Li. Their affiliations are not explicitly stated in the provided text, but the GitHub and Hugging Face links suggest they are associated with Kuaishou.
1.3. Journal/Conference
The paper is published as a preprint on arXiv (https://arxiv.org/abs/2512.24762). arXiv is a widely recognized open-access repository for preprints of scientific papers in fields like mathematics, physics, computer science, and quantitative biology. While it is not a peer-reviewed journal or conference in itself, it serves as a crucial platform for rapid dissemination of research findings and allows for community feedback before formal publication. Its influence is significant in facilitating timely academic discourse.
1.4. Publication Year
The paper was published at (UTC): 2025-12-31T10:15:53.000Z.
1.5. Abstract
The abstract states that the OneRec series has successfully integrated the fragmented recommendation pipeline into an end-to-end generative framework. However, a significant gap persists between recommendation systems and general intelligence. Current recommendation systems act as domain specialists, adept at pattern matching but lacking world knowledge, reasoning capabilities, and instruction following due to isolated data. This limitation is exacerbated by the absence of a holistic benchmark. To address this, the authors make three contributions:
-
RecIF-Bench & Open Data: They propose
RecIF-Bench, a holistic benchmark with 8 diverse tasks, evaluating capabilities from fundamental prediction to complex reasoning. They also release a training dataset of 96 million interactions from 160,000 users. -
Framework & Scaling: They open-source a comprehensive training pipeline (data processing,
co-pretraining,post-training) to ensure reproducibility. They demonstrate predictable scaling of recommendation capabilities while mitigatingcatastrophic forgettingof general knowledge. -
OneRec-Foundation: They release
OneRec-Foundationmodels (1.7B and 8B parameters) that achieve newstate-of-the-art (SOTA)results across allRecIF-Benchtasks. When transferred to theAmazon benchmark, these models surpass baselines with an average 26.8% improvement inRecall@10across 10 datasets.The work is presented as a step towards intelligent recommender systems, acknowledging significant technical and theoretical challenges and calling for broader research engagement.
1.6. Original Source Link
The official source link is https://arxiv.org/abs/2512.24762 and the PDF link is https://arxiv.org/pdf/2512.24762v1.pdf. The paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The paper addresses a critical challenge in the field of recommender systems: the significant gap between current generative recommender systems and general intelligence. While recent advancements, particularly the OneRec series, have successfully unified the traditional multi-stage recommendation pipeline into an end-to-end generative framework, these systems remain constrained by several limitations:
-
Isolated Data Silos: Existing models are typically trained on isolated, domain-specific data, preventing them from leveraging the massive data scaling that drives the emergent capabilities of
Large Language Models (LLMs). -
Lack of General Intelligence: Consequently, these models operate as
domain specialists, excelling atcollaborative pattern matchingbut lacking crucialworld knowledge,reasoning capabilities, andinstruction followingabilities that are hallmarks ofgeneral intelligence. This limits their adaptability and broader utility. -
Catastrophic Forgetting: Attempts to integrate
LLMcapabilities, such as aligning discrete recommendation identifiers withLLMlinguistic space, often lead tocatastrophic forgettingof theLLMbackbone's inherent generalization capabilities, especially when fine-tuned on limited, task-homogeneous data. -
Inadequate Benchmarking: A major impediment is the absence of a holistic benchmark capable of evaluating the integrated capabilities (prediction, reasoning, instruction-following) essential for these next-generation recommendation
foundation models. Traditional benchmarks are confined to narrow, specialized tasks, usually focusing onclosed-set ranking accuracywithin single domains.The core problem is to bridge this semantic and functional gap to build truly intelligent recommender systems that can understand complex user intent, reason about recommendations, and follow diverse instructions, much like
LLMsdo in general domains.
2.2. Main Contributions / Findings
To address the identified challenges, the paper presents several key contributions:
- RecIF-Bench: A Holistic Recommendation Instruction-Following Benchmark & Open Data:
- Contribution: Introduction of
RecIF-Bench, a multi-dimensional benchmark featuring 8 diverse tasks spanning 4 capability layers (fromsemantic alignmenttocomplex reasoning). It covers short-video, e-commerce, and online advertising domains, designed to rigorously assess the multifaceted capabilities ofrecommender foundation modelsviainstruction following. It is the first benchmark to coverlong-sequenceinteractions,multi-task,multi-modal,cross-domain,multi-behavioralinteractions,interleaved data(text anditemic tokens), andrecommendation explanation. - Finding: It provides a robust testbed for quantifying the synergy between
instruction followingand recommendation. Additionally, theOneRec-Foundationmodels are evaluated on 7 widely-recognized general benchmarks to verify retention of broad reasoning and coding skills. A comprehensive training dataset (96 million interactions from 160,000 users) is released to facilitate reproducible research.
- Contribution: Introduction of
- Open-Source Framework & Validated Scaling Laws:
- Contribution: Open-sourcing of a complete training pipeline (built on
PyTorchandVeRL) encompassing data processing,co-pretraining, andpost-trainingprotocols. A novel two-stagealignment strategyis introduced, combiningon-policy distillationandrecommendation-oriented Reinforcement Learning (Rec-RL). - Finding: Empirical validation of
scaling lawsin the recommendation domain. The study demonstrates predictable capability scaling, with optimal compute allocation requiring data volume to scale more aggressively than model parameters (a data-intensive regime where and ). This framework effectively mitigatescatastrophic forgettingof general knowledge.
- Contribution: Open-sourcing of a complete training pipeline (built on
- OneRec-Foundation Model Family:
- Contribution: Release of the
OneRec-Foundationseries, including 1.7B and 8B parameter models, built uponQwen. This series moves beyond simple fine-tuning, endowing theLLMbackbone with intrinsic recommendation capabilities.Standardversions are trained on the open-source dataset, andProversions are enhanced with a hundred-billion-token industrial corpus. - Finding: The models achieve new
state-of-the-art (SOTA)performance across allRecIF-Benchtasks. Furthermore, they demonstrate exceptional cross-domain transferability, surpassing strong baselines on 10Amazon datasetswith an average 26.8% improvement inRecall@10, underscoring their robustness asfoundation models.
- Contribution: Release of the
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the OpenOneRec paper, a beginner should be familiar with several fundamental concepts in machine learning, particularly in the domains of natural language processing and recommender systems.
- Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They are typically based on the
Transformerarchitecture and are known for their ability to learn complex patterns and perform a wide range of tasks, from translation to creative writing, throughnext-token prediction. Their "emergent capabilities" refer to advanced behaviors (like complex reasoning) that appear only when scaled to very large sizes and data volumes. - Generative Models: In machine learning, a generative model learns the distribution of data points in a dataset and can then generate new data samples that are similar to the training data. For example, a
generative LLMcan generate novel sentences or paragraphs. In recommendation,generative recommenderspredict the next item by generating its identifier or description, rather than just ranking existing items. - Recommender Systems: These systems aim to predict users' preferences for items (e.g., movies, products, videos) and suggest items they are likely to enjoy.
- Collaborative Filtering: A common technique in recommender systems that predicts user preferences based on the preferences of similar users or the characteristics of similar items. It essentially finds patterns in user-item interaction data.
- Sequential Recommendation: A sub-field of recommendation focusing on modeling the temporal dynamics of user behavior. It predicts the next item a user will interact with based on their sequence of past interactions.
- Transformer Architecture: The dominant neural network architecture for
LLMs. It relies heavily onself-attention mechanismsto weigh the importance of different parts of the input sequence when processing each element. This allows it to capture long-range dependencies in data, unlike recurrent neural networks. - Autoregressive Modeling: A type of statistical model where the output at a given time step is dependent on previous outputs. In
LLMs, this means generating text one token at a time, with each new token being predicted based on all the preceding tokens in the sequence. The paper formulates recommendation as aNext-Token Predictionproblem within this paradigm. - Instruction Following: The ability of an
LLMto understand and execute commands or requests given in natural language.Instruction tuningis a training method where models are fine-tuned on a diverse set of tasks presented as natural language instructions, enabling them to generalize to new, unseen instructions. - Catastrophic Forgetting: A phenomenon in neural networks where training a model on a new task causes it to forget previously learned information or skills from older tasks. This is a significant challenge when adapting
general LLMsto specific domains like recommendation, as fine-tuning might erase theirworld knowledge. - Reinforcement Learning (RL): A paradigm of machine learning where an "agent" learns to make decisions by performing actions in an environment to maximize a cumulative
rewardsignal. It's often used to fine-tuneLLMsfor better alignment with human preferences or specific task objectives. - Itemic Tokens: A novel representation for discrete items (like videos or products) that bridges the gap between item IDs and natural language. Instead of arbitrary IDs, items are represented by sequences of discrete
tokens(codes) derived from their semantic embeddings, allowingLLMsto process them as if they were words.
3.2. Previous Works
The paper builds upon and differentiates itself from several lines of prior research:
- OneRec Series (Deng et al., 2025; Zhou et al., 2025a,b): The
OpenOneRecpaper is presented as part of theOneRecseries, which is noted for successfully unifying the multi-stage recommendation pipeline into anend-to-end generative framework. This means moving from separate stages like retrieval, ranking, and explanation to a single model that can generate recommendations directly. - Generative Recommenders (Geng et al., 2022; Wu et al., 2023): This trend in recommendation treats the task as a language processing problem. The paper highlights this shift, noting that
LLMsoffer a new paradigm. - LLM-Recommendation Alignment (Zheng et al., 2024; Liu et al., 2025):
- LC-Rec (Zheng et al., 2024): A study that aligns discrete recommendation identifiers with the linguistic space of
LLMs. The paper citesLC-Recas a baseline and notes its limitation to alimited set of downstream tasks, potentially causingcatastrophic forgetting. - OneRec-Think (Liu et al., 2025): Another related work aiming to bridge the semantic gap.
- LC-Rec (Zheng et al., 2024): A study that aligns discrete recommendation identifiers with the linguistic space of
- Traditional Discriminative Recommender Models: The paper compares
OpenOneRecagainst several established sequential recommendation models:- BERT4Rec (Sun et al., 2019): Uses a
Transformer-basedarchitecture withbidirectional encoder representationsto predict masked items in a user's sequence, capturing rich contextual information. - GRU4Rec (Hidasi et al., 2016): Employs
Gated Recurrent Units (GRUs)to model sequential user behavior, particularly effective for session-based recommendations. - SASRec (Kang and McAuley, 2018):
Self-Attentive Sequential Recommendationmodel that uses theself-attentionmechanism to capture long-range dependencies and item transitions in user behavior sequences. - HSTU (Zhai et al., 2024):
Trillion-parameter sequential transducersfor generative recommendations, indicating very large-scale models. - ReaRec (Tang et al., 2025): Focuses on unleashing latent
reasoning powerfor sequential recommendation, suggesting a move towards more intelligent recommendation.
- BERT4Rec (Sun et al., 2019): Uses a
- Generative Recommender Baselines:
- TIGER (Rajput et al., 2023): A
generative retrievalmodel for recommender systems, serving as a strong baseline forgenerative recommendation.
- TIGER (Rajput et al., 2023): A
- Item Tokenization Techniques (Luo et al., 2025; Zhou et al., 2025a): The paper adopts
Itemic Tokensand mentionsRQ-Kmeans(Luo et al., 2025) for discretizing itemsemantic embeddingsintohierarchical discrete codes. This is a crucial step for integrating items into anLLM's token space.Finite Scalar Quantization (FSQ)(Mentzer et al., 2023) is also used for extending token depth intransfer learning. - Base
LLMBackbone:- Qwen (Yang et al., 2025): The
OneRec-Foundationmodels are built onQwen, a family of open-sourceLLMs. This provides the foundational linguistic andreasoning capabilities.
- Qwen (Yang et al., 2025): The
LLMScaling Laws (Kaplan et al., 2020; Hoffmann et al., 2022): The paper references these works for its analysis of how model performance scales withmodel parameters() andtraining tokens().Chinchilla scaling laws(Hoffmann et al., 2022) are specifically mentioned for comparison.Reinforcement LearningFrameworks:- Group Relative Policy Optimization (GRPO) (Shao et al., 2024): Used in the
Rec-RLstage for fine-tuning. It's noted for computingadvantagerelative to a group of sampled trajectories, reducing computational overhead compared toActor-Critic algorithmslikePPO.
- Group Relative Policy Optimization (GRPO) (Shao et al., 2024): Used in the
- Data Deduplication:
- MinHash Algorithm (Broder, 1997): Used for
efficient fuzzy deduplicationto ensure evaluation benchmarks are not leaked into the training data.
- MinHash Algorithm (Broder, 1997): Used for
3.3. Technological Evolution
The field of recommender systems has seen a significant evolution:
-
Early Systems (e.g., Matrix Factorization): Focused on discovering latent factors from user-item interaction matrices. These were often
discriminativemodels, predicting ratings or clicks for existing items. -
Deep Learning for Sequential Recommendation: With the rise of deep learning, models like
GRU4RecandBERT4Recstarted to capture temporal dependencies in user behavior, treating interactions as sequences. This improved the ability to predict the next item. -
Multi-modal and Cross-domain Approaches: Researchers began incorporating rich
metadata(text, images) and leveraging data from multiple domains to enrich user and item representations. -
Generative Recommender Systems: The recent paradigm shift, inspired by
LLMs, views recommendation as agenerative task. Instead of just predicting scores or ranking, these systems generate item IDs, descriptions, or explanations. TheOneRecseries represents a significant step in this direction, unifying previously fragmented pipelines. -
LLM-based Foundation Models for Recommendation: This paper marks the latest stage, aiming to integrate thegeneral intelligence,reasoning, andinstruction-followingcapabilities ofLLMsinto recommendation. The goal is to move beyond domain-specific expertise to truly intelligent agents that can engage in complex dialogues about recommendations.OpenOneRecfits into this timeline at the cutting edge, striving to establishfoundation modelsfor recommendation that can leverageLLMscalability and generalizability, while also providing a holistic benchmark to evaluate these advanced capabilities.
3.4. Differentiation Analysis
Compared to existing methods, OpenOneRec introduces several core innovations and differentiations:
- Holistic Benchmarking with RecIF-Bench:
- Differentiation: Unlike existing benchmarks (e.g.,
PixelRec,KuaiSAR,NineRec,Yelp,Amazon) that partially addressmulti-modal,multi-task, orcross-domainaspects,RecIF-Benchis the first to simultaneously cover all seven critical dimensions:long-sequenceinteractions,multi-task,multi-modal,cross-domain,multi-behavioralinteractions,interleaved data(text anditemic tokens), andrecommendation explanation. This provides a truly comprehensive testbed forfoundation models. - Innovation: The inclusion of
Interleaved DataandRecommendation Explanationis particularly novel, directly assessing theinstruction-followingandreasoningcapabilities central toLLM-based recommenders.
- Differentiation: Unlike existing benchmarks (e.g.,
- Unified Generative Framework with Scalable Pre-training:
- Differentiation: While
OneRecseries andTIGERalso adoptgenerative frameworks,OpenOneRecspecifically integratescollaborative signalswithgeneral semanticsthroughItemic-Text Alignmentandmixed-domain Co-Pretraining. This approach is designed to overcome thedata silosandcatastrophic forgettingissues prevalent in prior attempts to adaptLLMsfor recommendation. - Innovation: The two-stage pre-training (
Itemic-Text AlignmentthenFull-Parameter Co-Pretraining) systematically bridges themodality gapand injects recommendation knowledge while preservinggeneral world knowledgefrom theQwen3backbone.
- Differentiation: While
- Hybrid Post-Training for Capability Balance:
- Differentiation: Many
LLM-based recommenders struggle to balance domain-specific precision withgeneral reasoning abilities.OpenOneRecintroduces a novel hybridpost-trainingpipeline:Multi-task Supervised Fine-tuning(SFT) is followed by alternatingOn-policy Distillation for General CapabilityandReinforcement Learning for Recommendation (Rec-RL). - Innovation:
On-policy distillationactively recoversgeneral reasoning abilitiesby using the originalLLMas a teacher, whileRec-RL(specificallyGroup Relative Policy Optimizationwith a rule-basedHit reward) directly optimizes for discreteranking metricsin recommendation, effectively balancing both objectives.
- Differentiation: Many
- Validated Scaling Laws in Recommendation:
- Differentiation: The paper empirically validates
scaling lawsspecifically in the recommendation domain. This extends the understanding from generalLLMsto a specialized application. - Innovation: The finding of a
data-intensive scaling regime( vs. ) suggests that, unlikeChinchilla's near-equiproportional split for general text, optimal compute allocation in recommendation requires scaling data volume more aggressively than model parameters.
- Differentiation: The paper empirically validates
- Exceptional Transferability and Few-Shot Learning Performance:
- Differentiation:
OpenOneRec-Foundationmodels demonstrate superiorcross-domain transferabilitycompared to baselines likeTIGER, especially infew-shot learningscenarios. - Innovation: The
Text-Augmented Itemic Tokensstrategy, which concatenates originalitemic tokenswithkeyword representations, effectively leverages thefoundation model's diverse capabilities (collaborative filtering, knowledge, semantic understanding) while maintainingpre-trained structural integrity. This adaptability is crucial for robust performance in new domains with limited data.
- Differentiation:
4. Methodology
4.1. Principles
The core idea behind the OpenOneRec framework is to bridge the semantic and functional gap between traditional recommendation systems and general intelligence by treating recommendation as an autoregressive generation problem within a Large Language Model (LLM) framework. The theoretical basis is that LLMs, with their vast world knowledge, reasoning capabilities, and instruction-following prowess, can transcend the limitations of domain-specific recommenders that are constrained by isolated data and lack broader intelligence.
The intuition is that if LLMs can process and generate natural language, they should also be able to process and generate item identifiers (or representations) if these items are properly aligned with the LLM's token space. This involves:
- Bridging the Modality Gap: Representing discrete items as
itemic tokensthat theLLMcan understand and generate, akin to natural languagetokens. - Unified Generative Formulation: Framing all recommendation tasks, from basic prediction to complex
reasoning, as anext-token predictionproblem, allowing a singleLLMarchitecture to handle diverse scenarios. - Scalable Pre-training: Integrating
collaborative signalsandgeneral semanticsduringpre-trainingto endow theLLMbackbone with intrinsic recommendation capabilities while preserving itsgeneral world knowledge. - Hybrid Post-training: Employing a multi-stage
post-trainingprocess that includessupervised fine-tuning,on-policy distillation, andreinforcement learningto simultaneously enhanceinstruction-followingfor recommendation tasks and restoregeneral intelligencecapabilities, effectively balancing domain-specific precision with broader reasoning. - Holistic Evaluation: Developing a comprehensive benchmark (
RecIF-Bench) to thoroughly assess these integrated capabilities, pushing beyond traditionalranking metricsto includeinstruction followingandexplanation generation.
4.2. Core Methodology In-depth (Layer by Layer)
The OpenOneRec framework is structured into three main phases: Pre-Training, Post-Training, and Evaluation, as depicted in Figure 2.
4.2.1. Items as Tokens: Bridging the Modality Gap
A foundational challenge in applying LLMs to recommendation is the inherent mismatch between the continuous semantic space of language and the discrete identifier space of items. Representing items with extensive textual descriptions is inefficient due to long context lengths for user histories and doesn't guarantee the generation of in-corpus items.
To address this, the paper adopts Itemic Tokens (Luo et al., 2025; Zhou et al., 2025a), treating items as a distinct modality. This involves:
- Hierarchical Quantization: The
RQ-Kmeans(Luo et al., 2025) algorithm is used to discretizepre-trained semantic embeddingsof itemmetadata(e.g., text, visual features) intohierarchical discrete codes. This process compresses item semantics into short, fixed-length sequences oftokens. - Vocabulary Extension: These
itemic tokensare appended to theLLM's original vocabulary, creating a unified vocabulary . - Benefits:
- Efficient Long-Context Modeling: Short
itemic tokensequences allow for processing longer user histories. - Preservation of Collaborative Structure: The hierarchical nature ensures items with similar semantics share common prefixes, enabling the model to
transfer knowledgebased on token proximity, analogous to natural language. - In-corpus Generation: Facilitates the generation of actual, known items.
- Efficient Long-Context Modeling: Short
4.2.2. Recommendation as Autoregressive Modeling
With items represented as tokens, the recommendation task is unified into a standard autoregressive generation problem. A user's interaction history is treated as a long context sequence comprising item tokens, optionally interleaved with text.
Tasks, from prediction (e.g., retrieval) to reasoning (e.g., explanation), are formulated as a Next-Token Prediction problem. Given an instruction and a user context (which includes interaction history and optional queries), the model maximizes the likelihood of the target response . The objective function is:
$
\mathcal { L } ( \theta ) = - \sum _ { t = 1 } ^ { | Y | } \log P _ { \theta } ( y _ { t } | \mathcal { I } , C , y _ { < t } )
$
Where:
-
: The negative log-likelihood loss, which the model aims to minimize. represents the parameters of the model.
-
: The length of the target response sequence.
-
: The probability of predicting the -th token in the target sequence, conditioned on the task instruction , the user context , and all previously generated tokens (i.e., tokens ).
-
: The -th token in the target response sequence.
-
: The task-specific instruction provided to the model.
-
: The personalized user context, which can be the user's interaction history () or their
User Portrait(). -
: The sequence of tokens generated before the current token .
This formulation allows leveraging standard
Transformer architecturesandLLM training infrastructurewithout task-specific architectural changes.
4.2.3. RecIF-Bench: Task Taxonomy
RecIF-Bench organizes 8 distinct tasks into a four-layer capability hierarchy to assess different levels of intelligence in recommendation foundation models. The tasks are formalized as sequence-to-sequence problems , where is the input (instruction and context) and is the target (item ID, sequence of IDs, or natural language explanation).
The four layers are:
- Layer 0: Semantic Alignment (
Item Understanding)- Input (X): Item .
- Target (Y): Item Description (textual metadata).
- Evaluation Focus: Verifies if the model can map
itemic tokensto natural language, bridging the modality gap.
- Layer 1: Fundamental Recommendation
Short Video Recommendation- Input (X): History (user's video viewing history).
- Target (Y): Next item .
- Evaluation Focus: Canonical
next-item predictionwithin a single domain.
Ad Recommendation- Input (X): History (video viewing history plus ad click history).
- Target (Y): Next item .
- Evaluation Focus: Cross-domain interest transfer and holistic user modeling.
Product Recommendation- Input (X): History (video viewing history plus product click history).
- Target (Y): Next item .
- Evaluation Focus: Cross-domain interest transfer and holistic user modeling.
Label Prediction- Input (X): History .
- Target (Y): Binary (Yes/No) engagement.
- Evaluation Focus: Pointwise estimation complementing
generative recommendation.
- Layer 2: Instruction Following
Interactive Recommendation- Input (X): Portrait (user profile plus natural language intent).
- Target (Y): Item user engages with.
- Evaluation Focus: Model's ability to adapt predictions based on explicit natural language instructions.
Label-Conditional Recommendation- Input (X): History (video history plus target behavior label, e.g., "Like").
- Target (Y): Item with action .
- Evaluation Focus: Fine-grained behavior modeling based on specified actions.
- Layer 3: Reasoning
Recommendation Explanation-
Input (X): Portrait .
-
Target (Y): Explanation (natural language justification).
-
Evaluation Focus: Model's ability to synthesize information and articulate
user-item compatibilityreasoning.Figure 7 from the original paper (Figure 4 in original) provides a visual summary of the
Task Taxonomy of RecIF-Bench.
该图像是示意图,展示了 RecIF-Bench 中 8 个任务的分类架构,涵盖了 4 个能力层级,并展示了任务之间的关系与结构。
-
Figure 7 | Task Taxonomy of RecIF-Bench. Weorganize 8 tasks across 4 capability layers, speing the instruction, context, and target.
4.2.4. Pre-Training
The pre-training phase is designed to inject recommendation knowledge and align itemic tokens while preserving general world knowledge. It uses a Qwen3 architecture as its backbone and is structured into two stages.
4.2.4.1. Pre-training Data
The pre-training corpus consists of Recommendation Corpora and General-Domain Corpora.
- Recommendation Corpora: Derived from anonymized user logs from Kuaishou, covering user-side, item-side, and interaction-side
metadata.- Itemic Tokenization: Item
multimodal embeddingsare quantized into 3-layerhierarchical itemic tokensusingRQ-Kmeans, with a codebook size of 8192 per layer. Each item is mapped to a tuple of codes , flattened into a token sequence wrapped by special tokens (e.g.,<|item_begin|><item_a_C1><item_b_C2><item_c_C3><|item_end|>). - Data Types:
Itemic Dense Caption Data: Trains the model to generate natural-language captions givenitemic tokens, establishing a semantic bridge.Sequential User Behavior Data: Captures chronological user-item interactions (views, likes, shares) fornext-item predictionand learningcollaborative filteringsignals.Interleaved User Persona Grounding Data: Constructs narrative-styleUser Portraitsby interleaving discrete item representations with heterogeneous usermetadata(demographics, search history, interaction sequences, summarized interests). This fosters deeper semantic grounding between user characteristics and behaviors.
- Scale: The primary training corpus for the standard
OneRecvariant involves ~160k users, 13 million item captions, and corresponding interactions. For theOneRec-Provariant, this scales to ~20 million users and 98 million item captions, leveraging an in-house industrial corpus.
- Itemic Tokenization: Item
- General-Domain Corpora: High-quality general-domain text corpora are mixed in during
co-pretrainingto mitigatecatastrophic forgettingand maintaingeneral intelligence.- Diversity: Includes multiple languages (Chinese, English) and specialized domains (Mathematic, Medical).
- Reasoning-Intensive Data: Prioritizes data like mathematical derivations, logical puzzles, and code-centric corpora to enhance
reasoning capabilities. - Deduplication: The
MinHash algorithmis used to filter out general-domain samples similar to evaluation benchmarks, ensuring reliable generalization.
4.2.4.2. Training Recipe
The pre-training uses the Qwen3 architecture. Two model variants are developed: OneRec (standard, 33B tokens, 41.3 million samples from public datasets) and OneRec-Pro (enhanced, 130B tokens, 179.1 million samples from in-house corpus).
The pre-training methodology involves two distinct stages:
- Stage 1: Itemic-Text Alignment:
- Objective: Establish a preliminary alignment between
itemic tokensandtext tokens. - Process: The vocabulary is expanded by appending
itemic special tokensto theQwen3 tokenizer. Theembedding parametersfor theseitemic tokensare initialized from amultivariate normal distribution(mean and covariance of existing embeddings). Only theseitemic token embedding parametersare trainable, while all other model parameters are frozen. For larger models (8B+),output projection parameterscorresponding toitemic tokensare also trainable.
- Objective: Establish a preliminary alignment between
- Stage 2: Full-Parameter Co-Pretraining:
- Objective: Inject recommendation knowledge while preserving
general world knowledge. - Process: All model parameters are unfrozen. Full-parameter
pre-trainingis conducted on a mixed dataset of recommendation-domain samples and a considerable proportion ofgeneral-domain knowledge datato preventcatastrophic forgetting.
- Objective: Inject recommendation knowledge while preserving
- Training Recipe Details:
- Optimizer:
AdamWwith , , andweight decayof 0.1. - Learning Rate Schedule:
Cosine decaywith alinear warmup phase.- Peak LR: for Stage 1, for Stage 2.
- Minimum LR: for Stage 1, for Stage 2.
- Warmup Duration: First 10% of training steps.
- Maximum Context Length: 32K tokens to accommodate long
sequential user behavior data.
- Optimizer:
4.2.4.3. Scaling Laws in Recommendation
To optimize the allocation of compute budget between model parameters and training tokens , the paper follows Hoffmann et al. (2022). The Qwen3 architecture is evaluated across parameters. The compute budget is approximated as .
The compute-optimal frontier is derived from the convex hull of the final training loss. Power-law scaling relations are fitted:
$
N _ { \mathrm { o p t } } \propto C ^ { a } , \quad D _ { \mathrm { o p t } } \propto C ^ { b }
$
-
Scaling Laws on Recommendation Data: Empirical fit yields exponents:
-
This indicates a
data-intensive scaling regime(), suggesting that optimal compute in recommendation requires scaling data volume more aggressively than model parameters, a deviation fromChinchilla scaling laws(which imply ).
-
Parametric Fit and Interpretation: The final loss
L(N, D)is modeled using a parametric function: $ L ( N , D ) = E + \frac { A } { N ^ { \alpha } } + \frac { B } { D ^ { \beta } } $ Where:-
: Irreducible loss floor (minimum possible loss).
-
: Component of loss due to finite model capacity, where is a coefficient and is the model capacity exponent.
-
: Component of loss due to finite data size, where is a coefficient and is the data exponent.
Fitting this to experimental data yields: $ L ( N , D ) = 0 . 4 2 3 2 + \frac { 5 0 2 . 3 2 } { N ^ { 0 . 3 3 2 5 } } + \frac { 7 . 0 2 } { D ^ { 0 . 1 8 6 5 } } $ From these coefficients, three insights are derived:
-
Data-Hungry Scaling (): The model capacity exponent () is consistent with
LLMliterature, but the data exponent () is lower than typical text-domain values (). Since , this necessitates , confirming the need for more aggressive data scaling. -
Impact of Warm-Starting (High A, Low B): The imbalance between (502.32) and (7.02) is attributed to
transfer learningfrom theQwen3backbone, which lowers initial data entropy (low ). The inflated captures performance gains from larger models being trained with more data. -
Low Entropy of Recommendation Tasks (Low E): The estimated irreducible loss floor is substantially lower than for natural text (). This suggests
recommendation taskswith structured features (likeItemic Dense Captions) have lower inherent entropy, allowing the model to approach a more deterministic state, emphasizing the need for diverse and high-quality recommendation corpora.
-
4.2.5. Post-Training
After pre-training, the model aligns itemic tokens and encodes collaborative filtering signals but may still lack advanced instruction-following and reasoning capabilities. The post-training phase (Figure 6) aims to enhance recommendation capabilities and restore general task performance using identical training data and strategies for both OneRec and OneRec-Pro.
该图像是OneRec系列模型的后训练流程示意图,展示了模型从预训练到多任务微调、一般策略蒸馏,再到推荐任务的转变过程。每个阶段的处理流程清晰可见,强调了推荐系统的构建步骤。
Figure 6 | Post-training pipeline of the OneRec series models.
4.2.5.1. Multi-task Supervised Fine-tuning (SFT)
- Objective: Restore and enhance
foundational instruction-followingandreasoning capabilitiesacross bothgeneralandrecommendation domains, establishing a robust base for subsequent stages. - Data: A specialized
SFTcorpus is curated by blending complexinstruction-response pairsfromRecIF-Bench-specific cleaned metadata (from 160K users) with high-quality open-sourcegeneral-domain datasetsfocusing oninstruction-followingandcomplex reasoning. This corpus ensures no leakage from evaluation benchmarks. - Format: All instances are organized into a
conversational formatand serialized using theQwen3 chat template. - Training: Fine-tuning on this unified dataset uses a recipe consistent with
pre-trainingbut with a reducedlearning rate(from to ). - Outcome: Successfully resuscitates
instruction-following.Reasoning abilityfromgeneral-domain datacross-fertilizes withrecommendation tasks, leading to coherent reasoning trajectories for complex recommendation queries even without explicit supervision in recommendation samples.
4.2.5.2. On-policy Distillation for General Capability
-
Objective: Address a persistent capability gap in
general-domain reasoningobserved afterSFT, likely due todistributional shiftandRL-initialized backbones. -
Method:
On-policy distillation via Policy Gradient. Unlikeoff-policy distillation(learning from static pre-generated data),on-policy distillationinvolves the student model generating its own trajectories, which are then evaluated and supervised by a teacher. -
Objective Function: Per-token
reverse KL divergencebetween the student's distribution () and the teacher's (). $ \mathbb { D } _ { K L } \left( \pi _ { \theta } \parallel \pi _ { \mathsf { t e a c h e r } } \right) = \mathbb { E } _ { x \sim \pi _ { \theta } } \left[ \log \pi _ { \theta } ( x _ { t + 1 } | x _ { 1 . . t } ) - \log \pi _ { \mathsf { t e a c h e r } } ( x _ { t + 1 } | x _ { 1 . . t } ) \right] $ Where:- : The
Kullback-Leibler (KL) divergence, a measure of how one probability distribution () is different from a second, reference probability distribution (). Minimizing this divergence makes the student policy's outputs similar to the teacher's. - : The student policy (the model being trained), parameterized by .
- : The teacher policy (a pre-trained, typically stronger model).
- : Expected value over trajectories sampled from the student policy .
- : Log probability of the next token given the current trajectory prefix under the student policy.
- : Log probability of the next token given the current trajectory prefix under the teacher policy.
- : The
-
Policy Gradient Optimization: The policy is optimized using
policy gradient methods. For eachinput prompt, a trajectory is sampled, and thereverse KL divergenceis used as areward signal. The objective is to maximize the expected reward viagradient ascent: $ \nabla _ { \theta } J ( \theta ) = \mathbb { E } _ { o \sim \mathcal { D } , x \sim \pi _ { \theta } } \left[ \sum _ { t = 1 } ^ { T } \nabla _ { \theta } \log \pi _ { \theta } ( x _ { t } | o , x _ { < t } ) \cdot R _ { K L } ( o , x ) \right] $ Where:- : The gradient of the objective function with respect to model parameters . Maximizing this objective means updating to increase the likelihood of actions that lead to higher rewards.
- : Expected value over prompts sampled from dataset and trajectories sampled from the student policy .
- : The length of the sampled trajectory .
- : The gradient of the log-probability of taking action at time , given the prompt and previous actions , under the student policy. This is the "score function" or "log-likelihood gradient."
- : The
per-token rewardderived from the teacher's distribution for the trajectory generated from prompt .
-
Reward Clipping: To mitigate numerical instability from extreme
log-probability ratios, aclipping mechanismis applied to thereverse KL divergence: $ R _ { K L } ( o , x ) = \mathrm { c l i p } \left( - \mathbb { D } _ { K L } ( \pi _ { \theta } | | \pi _ { \mathrm { t e a c h e r } } ) , \alpha , \beta \right) $ Where:-
: A function that limits the value of its input. If the input is less than , it becomes ; if it's greater than , it becomes ; otherwise, it remains unchanged.
-
: Lower and upper clipping thresholds. This prevents outlier reward signals from destabilizing training.
The comprehensive pipeline is illustrated in Figure 10 from the original paper (Figure 7 in original).
该图像是示意图,展示了通过策略梯度进行在线政策蒸馏的流程。图中,学生模型(Policy Model)针对给定的提示 采样出轨迹 ,同时教师模型通过反向KL散度提供反馈作为奖励 。学生模型基于奖励 通过策略梯度方法进行迭代优化。
Figure 7 | The pipeline of On-policy Distillation via Policy Gradient. The student model (Policy Model) samples a trajectory o for the given prompt , while the Teacher Model provides feedback through Reverse KL divergence as reward The Policy Model is iteratively optimized using policy gradient methods based on reward.
-
-
On-Policy Distillation on General-Domain:
- Teacher Model: The original
Qwen3model (of the same parameter scale) serves as the teacher (). - Vocabulary Discrepancy Strategy: The
Qwen3teacher cannot recognizeitemic tokens. To handle this:Prompt Selection: Queries are sampled exclusively fromgeneral-domain datasets, expecting pure text generation.Itemic Token Penalty & Truncation: If anitemic tokenappears in a sampled trajectory at step , is set to a minimal value (e.g.,-1e9) to simulate zero probability, and the trajectory is truncated. This provides a strong negative signal.Enhanced Exploration: A hightemperature coefficientduring sampling encourages exploration, allowing the distillation process to identify and correctitemic tokenactivations ingeneral-domain tasks.
- Thinking Paradigms: To recover
instruction-followingon thinking, a suffix (/think,/no_think, or empty) is randomly appended to user prompts to align withforced-thinking,non-thinking, andauto-thinkingparadigms, as described in theQwen3technical report.
- Teacher Model: The original
4.2.5.3. Reinforcement Learning for Recommendation (Rec-RL)
-
Objective: Directly optimize discrete
ranking metrics(e.g.,Recall,NDCG) thatSFTdoesn't directly address, which often suffers fromexposure biasand struggles to distinguish "near-misses." -
Method:
Group Relative Policy Optimization (GRPO)(Shao et al., 2024). This framework avoids a separatecritic modelby computing theadvantageof a response relative to a group of sampled trajectories for the same prompt, reducing computational overhead while maintaining stability. -
Objective Function: Maximize: $ \mathcal { L } _ { G R P O } ( \theta ) = \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \left( \mathsf { A d v } _ { i } \cdot \log \pi _ { \theta } ( R _ { i } | q ) \right) - \beta \cdot K L ( \pi _ { \theta } | | \pi _ { r e f } ) $ Where:
- : The
GRPOobjective function to be maximized. - : The number of candidate responses sampled for a given prompt .
- : The -th sampled candidate response.
- : The
relative advantageof response , calculated by normalizingrewardswithin the group. This term dictates how much the probability of should be increased or decreased. - : The log-probability of generating response given prompt under the current policy .
- : A
hyperparametercontrolling the strength of theKL penalty. - : The
KL divergencebetween the current policy and areference policy. This penalty encourages the new policy not to deviate too far from the previous, stable policy (afteron-policy distillation), preventing instability andcatastrophic forgettingofgeneral intelligence.
- : The
-
Rule-based Recommendation Reward: A sparse, rule-based
reward functionis designed for the 5 corerecommendation tasks(Short Video Rec,Ad Rec,Product Rec,Interactive Rec,Label-Conditional Rec). Therewardis focused on "Hit" events: $ r ( R _ { i } ) = \left{ \begin{array} { l l } { + 1 . 0 } & { { \mathrm { i f ~ t h e ~ t a r g e t ~ i t e m i c ~ t o k e n ~ } } s \in R _ { i } } \ { 0 . 0 } & { { \mathrm { o t h e r w i s e } } } \end{array} \right. $ This encourages the model to assign higher probability mass toitemic tokensthat lead to successful hits, effectively performing "Soft Ranking" within thegenerative space. -
Implementation Details:
- Initialization: The
RL traineris initialized with the model afteron-policy distillation. - KL Penalty: A strict
KL penalty() against is maintained to prevent sacrificinggeneral intelligencefor domain-specific precision. - Dataset: Uses the same dataset as the
SFTstage.
- Initialization: The
-
Outcome: Significant boost in
recommendation metrics, aligninggenerative behaviorwith recommender system goals.
5. Experimental Setup
5.1. Datasets
The experiments in OpenOneRec utilize several datasets to evaluate both recommendation-specific and general intelligence capabilities.
5.1.1. RecIF-Bench
RecIF-Bench is a newly proposed, comprehensive benchmark designed to rigorously evaluate recommendation foundation models.
-
Source & Scale: Aggregates approximately 120 million (
120M) interactions from 200,000 (200K) distinct users. -
Domain Coverage: Spans three heterogeneous industrial domains, each capturing different user behavior patterns:
- Short Video (Content Domain): Short-form videos from Kuaishou, including viewing behaviors across various app tabs. Provides impression sequences with corresponding interaction types.
- Ad (Commercial Domain): Promotional short videos sponsored by advertisers on the Kuaishou platform, typically with clickable redirects. Provides click sequences of user ad click behaviors.
- Product (E-commerce Domain): Products listed in the Kuaishou Mall. Provides click sequences of user product click behaviors.
-
Rich Metadata: Beyond interaction logs,
RecIF-Benchprovides comprehensive metadata:- User-side: Each user has a
User Portrait—a unified narrative interleaving natural language descriptions withitemic tokens. This portrait includes demographics (gender, age), content creation history, recent searches, followed creator types, viewing preferences, comments, livestream views, purchase records, shopping cart items, local service coupons, ad exposures, and commercial intent signals. - Item-side: Each item is associated with
multimodal embeddings(4096-dim text embedding and 5-frame visual embeddings with 1152-dim per frame). Dense captions are also provided for approximately 13 million videos. - Interaction-side: For each user-video pair in the exposure sequence, multi-label behavioral signals are recorded, including like, follow, comment, effective view, and dislike.
- User-side: Each user has a
-
Itemic Tokenization: All items are pre-tokenized into tuples of
discrete tokenss = ( c _ { 1 } , c _ { 2 } , . . . , c _ { k } )using ahierarchical quantization strategy, enabling direct consumption byLLM-based recommenders. The paper notes flexibility for researchers to train customitemic tokenizersor use traditional item IDs. -
Data Splitting Strategy: A strict
user-based splitting strategyis used. 20% of users are randomly selected as the held-out test set, ensuring zero leakage. For each user, interactions are partitioned temporally: interactions before a timestamp form the history , and those after serve as the target .The following are the statistics of
RecIF-Benchfrom Table 1 of the original paper:Domain # Users # Items # Interactions Avg. Hist. Item Avg. Tgt. Item Short Video 195,026 13,107,675 94,443,611 458.1 8.6 Ad 151,259 177,548 5,341,911 29.9 5.5 Product 144,307 2,055,240 20,087,210 132.5 6.7 Total 202,359 15,340,463 119,872,732 574.9 17.5
The following are the data distribution analysis of RecIF-Bench from Figure 3 of the original paper:
该图像是图表,展示了 RecIF-Bench 数据分布分析。其中 (a) 显示各域项目的热门程度分布(对数-对数坐标),(b-d) 展示短视频、广告和产品域的用户历史长度分布。各子图反映了不同类型项目的互动计数和历史长度特点。
Figure 3 | Data distribution analysis of RecIF-Bench. (a) Item popularity distribution (log-log scale) across domains. (b-d) Distribution of user history lengths for Short Video, Ad, and Product domains, respectively.
5.1.2. Amazon Benchmark
To evaluate cross-domain transferability and generalization capabilities, the models are tested on 10 real-world datasets from the popular Amazon review benchmark (McAuley et al., 2015).
- Domains: Baby, Beauty, Cell Phones and Accessories, Grocery and Gourmet Food, Health and Personal Care, Home and Kitchen, Pet Supplies, Sports and Outdoors, Tools and Home Improvement, and Toys and Games.
- Purpose: These diverse domains rigorously validate whether the
foundation model,pre-trainedon open-domain data, provides a fundamental transfer advantage for specific downstream recommendation distributions. - Data Pre-processing: Sparse users and items with fewer than 5 interactions are discarded.
- Splitting Strategy:
Leave-one-out strategy(Rajput et al., 2023; Wang et al., 2024) for training and evaluation in thesequential recommendation setting.
5.1.3. General Intelligence Sanity Check
To ensure that recommendation-focused models retain general intelligence, a sanity check suite is included, covering four categories:
- Math & Text Reasoning:
MATH-500,GSM8K, andAIME'24. - General Tasks:
MMLU-ProandGPQA-Diamond. - Alignment:
IFEVAL(strict prompt). - Coding:
LiveCodeBench v5.
5.1.4. Pre-training Data
The pre-training data for OneRec (standard variant) consists of 33B tokens across 41.3 million samples, derived from a mixture of general-domain and recommendation-domain corpora. OneRec-Pro uses 130B tokens and 179.1 million samples.
The following are the detailed data composition and token budgets for pre-training from Table 13 of the original paper:
| Dataset | Weight (%) | Category | Subtotal (%) | Token Budget |
| Nemotron_CC_Math_v1 | 37.41% | Math | 62.34% | 29B |
| Nemotron_Pretraining_Code_v1 | 12.91% | Code | ||
| Nemotron_CC_v2 | 5.59% | Math, General, Code | 4B | |
| reasoning_v1_20m | 4.04% | General | ||
| OpenMathReasoning | 1.16% | Math | ||
| NuminaMath-QwQ-CoT-5M | 0.79% | Math | ||
| KodCode_V1_SFT_R1 | 0.26% | Code | ||
| Chinese-Reasoning-Distil-Data | 0.09% | General | ||
| medical-o1-reasoning-SFT | 0.09% | Medical | ||
| Itemic Dense Caption Data | 12.01% | Reco | 37.66% | |
| Interleaved User Persona Grounding Data | 10.03% | Reco | ||
| Sequential User Behavior Data | 15.62% | Reco | ||
| Total | 100% | 100% | 33B |
The following are the data composition and token budgets for pre-training stages from Table 14 of the original paper:
| Model | Stage | Training | General-Domain | Reco-Domain | Token Budget |
| OneRec | stage1 | Itemic-related Parameters | - | - | 16B |
| stage2 | Full-Parameter | 62.34% | 37.66% | 33B | |
| OneRec-Pro | stage1 | Itemic-related Parameters | - | - | 30B |
| stage2 | Full-Parameter | 53% | 47% | 130B |
5.1.5. Multi-task SFT Data
The Multi-task Supervised Fine-tuning (SFT) stage uses a blended corpus of general-domain reasoning samples and recommendation-specific tasks.
The following are the data mixture for Multi-task SFT from Table 15 of the original paper:
| Dataset | Weight (%) | Category | Subtotal (%) |
| OpenMathReasoning | 12.971% | Math | 64.978% |
| R1-Distill-SFT | 12.784% | General Instruction | |
| Infinity_Instruct | 11.359% | Instruction | |
| OpenCoderReasoning | 11.130% | Code | |
| Chinese-Reasoning-Distil-Data | 4.552% | General | |
| Reasoning_Multi_subject_RLVR | 4.376% | Multi-subject | |
| Reasoning_KodCode_V1_SFT_R1 | 4.167% | Code | |
| DeepMath103K | 2.362% | Math | |
| medical-o1-reasoning-SFT | 1.277% | Medical | |
| Label prediction | 7.800% | Reco | 35.022% |
| SID to Caption generation | 7.493% | Reco | |
| Interactive recommendation | 6.392% | Reco | |
| Video recommendation | 3.971% | Reco | |
| Label conditional recommendation | 3.575% | Reco | |
| Total | 100% | 100% |
5.2. Evaluation Metrics
The paper employs a dual-metric evaluation system to cover both recommendation accuracy and generation quality.
5.2.1. Recommendation Metrics (for Layer 1 & 2 tasks)
- Pass@K:
- Conceptual Definition:
Pass@Kmeasures whether theground truth itemis present among the top items generated or recommended by the model. It is a binary hit/miss metric. If any of the predicted items matches a ground truth relevant item, it's considered a "pass." - Mathematical Formula: $ \mathrm{Pass@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{relevant_item}{u} \in \text{TopK_predictions}{u}) $
- Symbol Explanation:
- : Total number of users in the evaluation set.
- : An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
- : The actual relevant item(s) for user in the test set.
- : The set of top items recommended by the model for user .
- Conceptual Definition:
- Recall@K:
- Conceptual Definition:
Recall@Kmeasures the proportion ofrelevant itemsthat are successfully retrieved by the recommendation system within its top recommendations. It focuses on how many of the truly relevant items the model was able to "find." - Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum_{u \in U} \frac{|\text{relevant_items}{u} \cap \text{TopK_predictions}{u}|}{|\text{relevant_items}_{u}|} $
- Symbol Explanation:
- : Total number of users in the evaluation set.
- : The set of all relevant items for user in the test set.
- : The set of top items recommended by the model for user .
- : Denotes the cardinality (number of elements) of a set.
- : Set intersection operator.
- Conceptual Definition:
- AUC (Area Under the Receiver Operating Characteristic Curve):
- Conceptual Definition:
AUCis a commonly used metric for binary classification problems (likeLabel Prediction). It measures the ability of a model to distinguish between positive and negative classes across all possible classification thresholds. A higherAUCindicates better discrimination power. AnAUCof 0.5 means the model performs no better than random guessing, while anAUCof 1.0 indicates perfect classification. - Mathematical Formula: $ \mathrm{AUC} = \frac{\sum_{i \in \text{positive}} \sum_{j \in \text{negative}} \mathbb{I}(P(i) > P(j)) + 0.5 \cdot \mathbb{I}(P(i) = P(j))}{|\text{positive}| \cdot |\text{negative}|} $
- Symbol Explanation:
positive: Set of positive samples (e.g., actual engagements).negative: Set of negative samples (e.g., non-engagements).P(i): The predicted probability (or score) that sample belongs to the positive class.- : An indicator function.
- : Number of positive samples.
- : Number of negative samples.
- Conceptual Definition:
- NDCG@K (Normalized Discounted Cumulative Gain at K):
- Conceptual Definition:
NDCG@Kis a measure of ranking quality, especially useful when relevance judgments are graded (e.g., highly relevant, somewhat relevant, not relevant). It takes into account both the relevance of recommended items and their position in the ranked list. Highly relevant items ranked higher contribute more to the score. It normalizes theDCGby theideal DCGto get a score between 0 and 1. - Mathematical Formula: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
- Symbol Explanation:
- : The relevance score of the item at position in the ranked list.
- : The discount factor for items at lower positions.
- :
Discounted Cumulative Gainat rank . - :
Ideal Discounted Cumulative Gainat rank , which is the maximum possibleDCGif all relevant items were perfectly ranked.
- Conceptual Definition:
5.2.2. Text Generation Metrics (for Layer 0 & 3 tasks)
- LLM-as-Judge Score:
- Conceptual Definition: For
Item UnderstandingandRecommendation Explanationtasks, an independentLLM(Gemini-2.5-Flash-Lite) is used as a judge to rate the quality of generated text on dimensions like accuracy and coherence. This method overcomes the limitations of traditionalN-grambased metrics (likeBLEUorROUGE) which may not capture semantic quality for open-ended generation. - Evaluation Process (Appendix B.1):
- Information Point Extraction: The
LLMjudge decomposes both theground truth captions(generated by Gemini-2.5-Pro) andmodel-generated captionsintoatomic Weighted Information Points (WIPs). EachWIPcontains a fact statement and an importance score (1-5). - Semantic Matching: The
LLMjudge alignsmodel-generated WIPswithground truth WIPs, identifyingvalid matches,hallucinations(unmatched modelWIPs), andomissions(unmatchedGT WIPs). - Weighted Scoring: A
Double-Weighted F1 Scoreis computed. For each matched pair , a match quality score is calculated usingBERTScore(withbert-base-chinese).
- Information Point Extraction: The
- Mathematical Formula for F1 Score: $ TP _ { i } = \sum _ { ( w _ { g t } , w _ { m o d e l } ) \in \mathrm { M a t c h e s } } \mathrm { s c o r e } ( w _ { g t } ) \times q $ $ F N _ { i } = \sum _ { ( w _ { g t } , w _ { m o d e l } ) \in \mathrm { M a t c h e s } } \mathrm { s c o r e } ( w _ { g t } ) \times ( 1 - q ) + \sum _ { w _ { g t } \in \mathrm { U n m a t c h e d ~ G T } } \mathrm { s c o r e } ( w _ { g t } ) $ $ F P _ { i } = \sum _ { ( w _ { g t } , w _ { m o d e l } ) \in \mathrm { M a t c h e s } } \mathrm { s c o r e } ( w _ { m o d e l } ) \times ( 1 - q ) + \sum _ { w _ { m o d e l } \in \mathrm { U n m a t c h e d ~ M o d e l } } \mathrm { s c o r e } ( w _ { m o d e l } ) $ $ F 1 _ { i } = \frac { 2 \cdot T P _ { i } } { 2 \cdot T P _ { i } + F P _ { i } + F N _ { i } } $ $ \mathrm { L L M - J u d g e ~ S c o r e } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } F 1 _ { i } $
- Symbol Explanation:
- : Weighted
True Positivesfor sample . Sum of scores of matched ground truthWIPs, weighted by theirBERTScorematch quality . - : Weighted
False Negativesfor sample . Sum of scores of matchedGT WIPsweighted by(1-q)(missed aspects of matchedWIPs) plus scores of completelyunmatched GT WIPs(omissions). - : Weighted
False Positivesfor sample . Sum of scores of modelWIPsthat matchedGT WIPsweighted by(1-q)(incorrect aspects of matchedWIPs) plus scores of completelyunmatched model WIPs(hallucinations). - : Importance score of a
WIP. - :
BERTScore(F1) match quality for a pair ofWIPs. Matches: Set of matchedWIPpairs.Unmatched GT: Set ofground truth WIPsthat were not matched.Unmatched Model: Set ofmodel-generated WIPsthat were not matched.F1_i: The F1 score for sample .- : The final average F1 score across all samples.
- : Weighted
- Conceptual Definition: For
5.3. Baselines
The OneRec-Foundation models are compared against two groups of competitive baselines:
5.3.1. Discriminative Recommender Models
These are traditional models that typically focus on predicting scores or ranking items.
BERT4Rec(Sun et al., 2019)GRU4Rec(Hidasi et al., 2016)SASRec(Kang and McAuley, 2018)HSTU(Zhai et al., 2024)ReaRec(Tang et al., 2025)- Adaptation: These methods are inherently task-specific, requiring separate training for each
RecIF-Benchtask. They were adapted with task-specific modifications to support the diverseRecIFtasks.
5.3.2. Generative Recommender Models
These models also aim to generate recommendations, but OpenOneRec distinguishes itself by its holistic approach and foundation model capabilities.
TIGER(Rajput et al., 2023)LC-Rec(Zheng et al., 2024): This model was specifically implemented as "LC-Rec-8B" using a comparableQwen3-8Bbackbone to ensure a fair comparison at thefoundation modelscale.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Main Results on RecIF-Bench
The evaluation on RecIF-Bench demonstrates OpenOneRec-Foundation's superior performance across a wide range of recommendation tasks.
The following are the unified performance comparison across all tasks from Table 4 of the original paper:
| OneRec-1.7B | OneRec-1.7B-Pro | OneRec-8B | OneRec-8B-Pro | ||||||
| Task | Metric | Best Baseline | LC-Rec-8B | SASRec | BERT4Rec | GRU4Rec | HSTU | ReaRec | TIGER |
| Short Video Rec | Pass@1 | 0.0045 | 0.0040 | 0.0051 | 0.0043 | 0.0052 | 0.0168 | 0.0341 | 0.0496 |
| Pass@32 | 0.0951 | 0.0993 | 0.1003 | 0.1010 | 0.1002 | 0.1061 | 0.1306 | 0.1710 | |
| Recall@32 | 0.0113 | 0.0119 | 0.0119 | 0.0120 | 0.0117 | 0.0132 | 0.0180 | 0.0272 | |
| Ad Rec | Pass@1 | 0.0044 | 0.0061 | 0.0059 | 0.0035 | 0.0076 | 0.0125 | 0.0197 | 0.0169 |
| Pass@32 | 0.0980 | 0.1225 | 0.1102 | 0.1054 | 0.1266 | 0.1769 | 0.2096 | 0.2037 | |
| Recall@32 | 0.0293 | 0.0381 | 0.0336 | 0.0327 | 0.0409 | 0.0581 | 0.0723 | 0.0707 | |
| Product Rec | Pass@1 | 0.0052 | 0.0054 | 0.0047 | 0.0030 | 0.0055 | 0.0120 | 0.0178 | 0.0144 |
| Pass@32 | 0.0914 | 0.0936 | 0.0821 | 0.0907 | 0.0914 | 0.1276 | 0.1809 | 0.1571 | |
| Recall@32 | 0.0175 | 0.0193 | 0.0161 | 0.0189 | 0.0178 | 0.0283 | 0.0416 | 0.0360 | |
| Label-Cond. Rec | Pass@1 | 0.0026 | 0.0026 | 0.0032 | 0.0026 | 0.0027 | 0.0044 | 0.0079 | 0.0064 |
| Pass@32 | 0.0380 | 0.0372 | 0.0393 | 0.0381 | 0.0383 | 0.0337 | 0.0420 | 0.0431 | |
| Recall@32 | 0.0140 | 0.0135 | 0.0143 | 0.0137 | 0.0139 | 0.0123 | 0.0170 | 0.0184 | |
| Label Pred. | AUC | 0.6244 | 0.6598 | 0.6640 | 0.6204 | 0.6581 | 0.6675 | 0.6139 | 0.6184 |
| Interactive Rec | Pass@1 | - | - | - | - | - | - | 0.0890 | 0.0660 |
| Pass@32 | - | - | - | - | - | - | 0.3730 | 0.3170 | |
| Recall@32 | - | - | - | - | - | - | 0.2394 | 0.1941 | |
| Item Understand. | LLM-Judge Score | - | - | - | - | - | - | 0.2517 | 0.3175 |
| Rec. Explanation | LLM-Judge Score | - | - | - | - | - | - | 3.9350 | 3.3540 |
- State-of-the-Art Recommendation Performance:
OneRec-Foundationconsistently outperforms all baselines (bothdiscriminativeandgenerative) across the vast majority of tasks inRecIF-Bench. This is evident from the higherPass@K,Recall@K, andAUCscores forOneRec-Foundationvariants compared toLC-Rec-8Band other baselines. - Scaling Effects:
- Data Scaling:
OneRec-Promodels (trained on a larger industrial corpus) consistently surpassOneRecmodels of the same parameter size (e.g.,OneRec-8B-Progenerally outperformsOneRec-8B). This validates the importance of data scale in enhancing recommendation capabilities. - Model Scaling: Larger models (
8Bparameters) generally outperform smaller models (1.7Bparameters) across all variants (e.g.,OneRec-8B>OneRec-1.7B). This confirms predictablecapability scalingwith increasing model size, analogous to findings ingeneral LLMs.
- Data Scaling:
6.1.2. Trade-off on General Capabilities
While excelling in recommendation, the models exhibit a trade-off in general capabilities.
The following are the performance comparison on general capability (Thinking) from Table 5 of the original paper:
| Category | Task | Qwen3-1.7B | OneRec-1.7B | OneRec-1.7B-Pro | Qwen3-8B | OneRec-8B | OneRec-8B-Pro |
| Math & Text Reasoning | MATH-500 | 0.8780 | 0.8840 | 0.8840 | 0.9520 | 0.9460 | 0.9380 |
| GSM8K | 0.9121 | 0.8984 | 0.8999 | 0.9568 | 0.9575 | 0.9575 | |
| AIME'24 | 0.4938 | 0.4104 | 0.4146 | 0.7917 | 0.7250 | 0.7188 | |
| General Tasks | MMLU-Pro | 0.5422 | 0.3548 | 0.3932 | 0.7235 | 0.5342 | 0.5204 |
| GPQA-Diamond | 0.3788 | 0.3232 | 0.3333 | 0.5606 | 0.5000 | 0.5051 | |
| Alignment Tasks | IFEVALstrict prompt | 0.6969 | 0.5471 | 0.5416 | 0.8577 | 0.7893 | 0.7634 |
| Coding | LiveCodeBench v5 | 0.3907 | 0.2832 | 0.2832 | 0.5484 | 0.4910 | 0.4667 |
The following are the performance comparison on general capability (Non-Thinking) from Table 6 of the original paper:
| Category | Task | Qwen3-1.7B | OneRec-1.7B | OneRec-1.7B-Pro | Qwen3-8B | OneRec-8B | OneRec-8B-Pro |
| Math & Text Reasoning | MATH-500 | 0.6980 | 0.7060 | 0.6940 | 0.8380 | 0.8240 | 0.7980 |
| GSM8K | 0.8218 | 0.8036 | 0.8158 | 0.9303 | 0.9310 | 0.9196 | |
| AIME'24 | 0.1313 | 0.1271 | 0.1250 | 0.2729 | 0.2417 | 0.2271 | |
| General Tasks | MMLU-Pro | 0.4384 | 0.3072 | 0.2804 | 0.6632 | 0.5795 | 0.4521 |
| GPQA-Diamond | 0.3030 | 0.3131 | 0.2778 | 0.3990 | 0.4040 | 0.3939 | |
| Alignment Tasks | IFEVALstrict prompt | 0.6747 | 0.4769 | 0.5250 | 0.8392 | 0.7357 | 0.7098 |
| Coding | LiveCodeBench v5 | 0.1219 | 0.1219 | 0.1147 | 0.2760 | 0.2401 | 0.2401 |
- Retention of General Capabilities: The models successfully retain most of the general capabilities of the
Qwen3backbone. Performance on mathematical benchmarks (MATH-500,GSM8K) shows minimal degradation, and in some cases (e.g.,GSM8KforOneRec-8B), even slight improvements over the baseQwen3-8BinThinkingmode. - Performance Trade-off: A noticeable performance trade-off exists, particularly in broader
general knowledge(MMLU-Pro,GPQA-Diamond) andcoding(LiveCodeBench v5) tasks.OneRec-Foundationvariants generally score lower than their baseQwen3counterparts in these categories. This suggests that while thedistillation processeffectively preservesreasoning proficiency, the limited diversity or quality of the general data used duringpost-trainingmight constrain the model's broadergeneral capabilities, indicating a need for more refined data strategies to achieve a better balance.
6.1.3. Transfer Learning on Amazon Benchmark
OneRec-Foundation demonstrates exceptional transferability and cross-domain generalization on the Amazon benchmark.
The following are the Cross-Domain generalization performance on Amazon domains from Table 7 of the original paper:
| Model | Metric | Baby | Beauty | Cell | Grocery | Health | Home | Pet | Sports | Tools | Toys |
| SASRec | R@5 | 0.0232 | 0.0393 | 0.0482 | 0.0480 | 0.0295 | 0.0133 | 0.0377 | 0.0240 | 0.0269 | 0.0420 |
| R@10 | 0.0381 | 0.0639 | 0.0782 | 0.0789 | 0.0506 | 0.0212 | 0.0607 | 0.0389 | 0.0437 | 0.0658 | |
| N@5 | 0.0137 | 0.0209 | 0.0281 | 0.0262 | 0.0173 | 0.0070 | 0.0222 | 0.0130 | 0.0149 | 0.0217 | |
| N@10 | 0.0185 | 0.0289 | 0.0378 | 0.0361 | 0.0242 | 0.0098 | 0.0296 | 0.0178 | 0.0203 | 0.0294 | |
| BERT4Rec | R@5 | 0.0117 | 0.0219 | 0.0325 | 0.0307 | 0.0204 | 0.0063 | 0.0218 | 0.0151 | 0.0145 | 0.0200 |
| R@10 | 0.0228 | 0.0419 | 0.0569 | 0.0534 | 0.0353 | 0.0113 | 0.0412 | 0.0261 | 0.0264 | 0.0362 | |
| N@5 | 0.0065 | 0.0120 | 0.0190 | 0.0174 | 0.0117 | 0.0038 | 0.0123 | 0.0083 | 0.0083 | 0.0102 | |
| N@10 | 0.0101 | 0.0185 | 0.0268 | 0.0247 | 0.0165 | 0.0054 | 0.0186 | 0.0119 | 0.0121 | 0.0154 | |
| GRU4Rec | R@5 | 0.0202 | 0.0322 | 0.0430 | 0.0362 | 0.0256 | 0.0090 | 0.0264 | 0.0174 | 0.0176 | 0.0266 |
| R@10 | 0.0346 | 0.0539 | 0.0676 | 0.0591 | 0.0423 | 0.0156 | 0.0449 | 0.0278 | 0.0305 | 0.0453 | |
| N@5 | 0.0124 | 0.0201 | 0.0275 | 0.0230 | 0.0164 | 0.0058 | 0.0163 | 0.0110 | 0.0116 | 0.0171 | |
| N@10 | 0.0170 | 0.0271 | 0.0355 | 0.0303 | 0.0217 | 0.0079 | 0.0222 | 0.0144 | 0.0158 | 0.0231 | |
| HSTU | R@5 | 0.0226 | 0.0456 | 0.0475 | 0.0458 | 0.0330 | 0.0134 | 0.0362 | 0.0227 | 0.0231 | 0.0489 |
| R@10 | 0.0350 | 0.0643 | 0.0725 | 0.0712 | 0.0485 | 0.0197 | 0.0521 | 0.0347 | 0.0337 | 0.0649 | |
| N@5 | 0.0156 | 0.0308 | 0.0314 | 0.0297 | 0.0215 | 0.0092 | 0.0239 | 0.0151 | 0.0159 | 0.0339 | |
| N@10 | 0.0196 | 0.0368 | 0.0395 | 0.0378 | 0.0265 | 0.0112 | 0.0290 | 0.0190 | 0.0193 | 0.0391 | |
| ReaRec | R@5 | 0.0197 | 0.0488 | 0.0444 | 0.0454 | 0.0326 | 0.0150 | 0.0299 | 0.0231 | 0.0219 | 0.0517 |
| R@10 | 0.0320 | 0.0702 | 0.0711 | 0.0730 | 0.0481 | 0.0210 | 0.0486 | 0.0348 | 0.0310 | 0.0706 | |
| N@5 | 0.0123 | 0.0341 | 0.0269 | 0.0289 | 0.0213 | 0.0101 | 0.0189 | 0.0152 | 0.0143 | 0.0369 | |
| N@10 | 0.0163 | 0.0409 | 0.0355 | 0.0378 | 0.0263 | 0.0121 | 0.0249 | 0.0189 | 0.0173 | 0.0430 | |
| TIGER | R@5 | 0.0191 | 0.0413 | 0.0540 | 0.0447 | 0.0328 | 0.0142 | 0.0343 | 0.0216 | 0.0228 | 0.0367 |
| R@10 | 0.0318 | 0.0628 | 0.0786 | 0.0691 | 0.0534 | 0.0216 | 0.0542 | 0.0331 | 0.0344 | 0.0527 | |
| N@5 | 0.0125 | 0.0277 | 0.0350 | 0.0295 | 0.0222 | 0.0094 | 0.0232 | 0.0145 | 0.0148 | 0.0255 | |
| N@10 | 0.0162 | 0.0346 | 0.0429 | 0.0373 | 0.0289 | 0.0118 | 0.0295 | 0.0182 | 0.0184 | 0.0307 | |
| LC-Rec | R@5 | 0.0232 | 0.0495 | 0.0585 | 0.0501 | 0.0412 | 0.0199 | 0.0388 | 0.0269 | 0.0288 | 0.0350 |
| R@10 | 0.0344 | 0.0764 | 0.0883 | 0.0790 | 0.0616 | 0.0293 | 0.0612 | 0.0418 | 0.0438 | 0.0549 | |
| N@5 | 0.0151 | 0.0338 | 0.0392 | 0.0328 | 0.0272 | 0.0138 | 0.0247 | 0.0177 | 0.0187 | 0.0221 | |
| N@10 | 0.0187 | 0.0424 | 0.0488 | 0.0421 | 0.0338 | 0.0168 | 0.0320 | 0.0225 | 0.0235 | 0.0285 | |
| Ours | R@5 | 0.0352 | 0.0646 | 0.0717 | 0.0688 | 0.0534 | 0.0279 | 0.0563 | 0.0365 | 0.0412 | 0.0693 |
| R@10 | 0.0513 | 0.0924 | 0.1036 | 0.1029 | 0.0768 | 0.0390 | 0.0834 | 0.0547 | 0.0593 | 0.0953 | |
| N@5 | 0.0238 | 0.0456 | 0.0490 | 0.0460 | 0.0376 | 0.0202 | 0.0389 | 0.0252 | 0.0295 | 0.0496 | |
| N@10 | 0.0289 | 0.0545 | 0.0593 | 0.0570 | 0.0452 | 0.0237 | 0.0476 | 0.0310 | 0.0354 | 0.0579 | |
| Improve (%) R@10 | 34.6↑ | 20.9↑ | 17.3↑ | 30.3↑ | 24.7↑ | 33.1↑ | 36.3↑ | 30.9↑ | 35.4↑ | 35.0↑ | |
- SOTA Results Across Domains:
OneRec-Foundation(Ours) achieves newstate-of-the-artresults across all 10Amazon datasets. - Significant Improvement: The model secures an average improvement of 26.8% in
Recall@10over the second-best baseline on each domain. This empirically confirms thatlarge-scale generative pre-trainingendows the model with robust transfer capabilities far exceeding traditionalcollaborative filtering approaches.
6.1.3.1. Adaptive Strategies for Pre-trained Model Utilization
To address the distributional shift of item identifiers in transfer learning (where the pre-trained tokenizer might not granularly distinguish items in specific target domains), three strategies were explored.
The following are the Performance Comparison of Adaptive Strategies for Pre-trained Model Utilization from Table 8 of the original paper:
| Strategy | Metric | Baby | Beauty | Cell | Grocery | Health | Home | Pet | Sports | Tools | Toys |
| Extended Residual Quantization | R@5 | 0.0288 | 0.0534 | 0.0574 | 0.0562 | 0.0479 | 0.0227 | 0.0518 | 0.0315 | 0.0350 | 0.0511 |
| R@10 | 0.0407 | 0.0799 | 0.0830 | 0.0861 | 0.0673 | 0.0313 | 0.0758 | 0.0447 | 0.0495 | 0.0701 | |
| N@5 | 0.0201 | 0.0364 | 0.0389 | 0.0383 | 0.0333 | 0.0162 | 0.0356 | 0.0215 | 0.0243 | 0.0360 | |
| N@10 | 0.0239 | 0.0449 | 0.0471 | 0.0480 | 0.0396 | 0.0190 | 0.0433 | 0.0258 | 0.0289 | 0.0421 | |
| Text-Only Adaptation | R@5 | 0.0317 | 0.0630 | 0.0688 | 0.0687 | 0.0529 | 0.0285 | 0.0548 | 0.0368 | 0.0414 | 0.0668 |
| R@10 | 0.0448 | 0.0883 | 0.0985 | 0.1048 | 0.0752 | 0.0398 | 0.0850 | 0.0548 | 0.0615 | 0.0931 | |
| N@5 | 0.0227 | 0.0445 | 0.0473 | 0.0460 | 0.0368 | 0.0199 | 0.0382 | 0.0256 | 0.0288 | 0.0483 | |
| N@10 | 0.0269 | 0.0526 | 0.0569 | 0.0576 | 0.0440 | 0.0235 | 0.0478 | 0.0314 | 0.0354 | 0.0568 | |
| Text-Augmented Itemic Tokens | R@5 | 0.0352 | 0.0646 | 0.0717 | 0.0688 | 0.0534 | 0.0285 | 0.0563 | 0.0368 | 0.0414 | 0.0693 |
| R@10 | 0.0513 | 0.0924 | 0.1036 | 0.1029 | 0.0768 | 0.0398 | 0.0834 | 0.0547 | 0.0593 | 0.0953 | |
| N@5 | 0.0238 | 0.0456 | 0.0490 | 0.0460 | 0.0376 | 0.0202 | 0.0389 | 0.0256 | 0.0295 | 0.0496 | |
| N@10 | 0.0289 | 0.0545 | 0.0593 | 0.0576 | 0.0452 | 0.0237 | 0.0478 | 0.0314 | 0.0354 | 0.0579 |
- Strategy 1: Extended Residual Quantization: Extends the
hierarchical depthofitemic tokensusingFinite Scalar Quantization (FSQ). Reduces collision rate to 3.05%. Achieves a 10.0% improvement in averageR@10overLC-Rec. However, the non-pre-trainedfourth layer disrupts originalhierarchical semantics. - Strategy 2: Text-Only Adaptation: Bypasses
itemic tokensentirely, representing items via 5 keywords frommetadata. Collision rate is 4.27%. Achieves an 18.8% improvement in averageR@10overExtended Residual Quantization, as the model's linguistic core remains intact andnatural language representationsare more expressive in narrow domains. This sacrificescollaborative filtering signals. - Strategy 3: Text-Augmented Itemic Tokens: Concatenates the original three-layer
pre-trained itemic tokenswithkeyword representations(). Preserves originalitemic tokensstructure, maintaininghierarchical semantics. Keywords providesemantic disambiguation(collision rate 0.47%) and enable full utilization oflinguistic capabilities. This strategy achievesstate-of-the-artperformance across nearly all datasets, validating that effective transfer learning requires maximizing utilization of thefoundation model's diverse capabilities while preservingpre-trained structural integrity.
6.1.3.2. Domain-Specific Training vs. Multi-Domain Joint Training
The paper investigates the impact of training strategies on transfer performance.
The following are the Impact of Training Strategies (Domain-Specific vs Multi-Domain Joint) and Few-Shot Learning on Transfer Performance from Figure 11 of the original paper:
该图像是图表,展示了不同训练策略(少量训练、领域特定及多领域联合)对转移性能的影响。通过比较 OneRec-Foundation 和 TIGER 在四个亚马逊领域的 Recall@10,显示了我们方法在不同设置下的表现。绿虚线代表我们与 TIGER 的性能差异(Recall @ 10 之差)。
Figure 8 | Impact of Training Strategies (Domain-Specific vs Multi-Domain Joint) and Few-Shot Learning on Transfer Performance. We compare OneRec-Foundation (Ours) against TIGER across four Amazon domains under three settings: (1) Few-shot learning with training data, (2) Full-data training with domain-specific strategy, and (3) Full-data training with joint multi-domain strategy. The green dashed line represents the performance gain (Recall @ 1 0 difference) of Ours over TIGER.
- Divergent Performance:
TIGER(a traditionalgenerative recommender) shows a consistent performance decline (average 10.6% drop inRecall@10) underjoint trainingacross multiple domains. In contrast,OneRec-Foundationdemonstrates an average 2.3% improvement. - Foundation Model Advantage: This divergence highlights that
OneRec-Foundation, with itspre-trained recommendation knowledgeandsemantic understanding, extractsgeneralizable patternsrather than memorizingdomain-specific statistics.Multi-domain joint trainingfurther enriches the model by exposing it to diverse interaction patterns, enabling effectivecross-domain knowledge transfer. The massiveparameter capacityallows encodingdomain-specific nuanceswhile maintaining shared high-level patterns.
6.1.3.3. Few-Shot Learning: Amplified Transfer Advantage
- Significant Gap Widening: The transfer learning advantage of
foundation modelsis amplified under data scarcity. WhileOneRec-FoundationsurpassesTIGERby an average of 77.7% inRecall@10with full training data, this gap dramatically widens to 219.7% in the 10%few-shot regime. - Resilience under Data Constraints:
OneRec-Foundationpreserves 45.2% of its full-data performance with only 10% of the data, whereasTIGERretains only 23.0%. This striking resilience validates thatlarge-scale pre-trainingconfers robust,transferable representationsthat enable effectivedomain adaptationeven under severe data constraints.
6.2. Ablation Study
6.2.1. Ablation Study on Pre-training Strategies
An ablation study quantifies the contribution of the itemic-text alignment (Stage 1) in the pre-training pipeline. The full model is compared against a variant w/o Align that bypasses this initial phase. Both variants undergo Multi-task SFT before evaluation.
The following are the Ablation study on pre-training strategies from Table 9 of the original paper:
| Task | Metric | 0.6B | 1.7B | 8B | |||
| Ours | w/o Align | Ours | w/o Align | Ours | w/o Align | ||
| Short Video Rec | Pass@32 | 0.1401 | 0.1397 | 0.1636 | 0.1605 | 0.2034 | 0.1933 |
| Recall@32 | 0.0210 | 0.0210 | 0.0254 | 0.0251 | 0.0334 | 0.0310 | |
| Ad Rec | Pass@32 | 0.1740 | 0.1680 | 0.1961 | 0.1922 | 0.2350 | 0.2401 |
| Recall@32 | 0.0586 | 0.0569 | 0.0673 | 0.0669 | 0.0821 | 0.0841 | |
| Product Rec | Pass@32 | 0.1139 | 0.1064 | 0.1512 | 0.1395 | 0.1893 | 0.1911 |
| Recall@32 | 0.0257 | 0.0243 | 0.0343 | 0.0312 | 0.0447 | 0.0442 | |
| Label-Cond. Rec | Pass@32 | 0.0350 | 0.0343 | 0.0426 | 0.0401 | 0.0537 | 0.0537 |
| Recall@32 | 0.0146 | 0.0145 | 0.0181 | 0.0171 | 0.0227 | 0.0230 | |
| Interactive Rec | Pass@32 | 0.2460 | 0.2360 | 0.3110 | 0.3050 | 0.4650 | 0.4490 |
| Recall@32 | 0.1402 | 0.1357 | 0.1908 | 0.1770 | 0.3039 | 0.2910 | |
| Label Pred. | AUC | 0.6488 | 0.5807 | 0.6392 | 0.5796 | 0.6879 | 0.6285 |
| Item Understanding | LLM-Judge Score | 0.3174 | 0.3112 | 0.3170 | 0.3181 | 0.3225 | 0.3103 |
| Rec. Explanation | LLM-Judge Score | 2.9960 | 2.8635 | 3.0922 | 3.3160 | 3.9420 | 3.9329 |
- Semantic Bridge: Stage 1 (
itemic-text alignment) acts as a fundamental semantic bridge for cold-starteditemic token embeddings. Aligning these initialized parameters with thepre-trained latent spacebefore full-parameter fine-tuning establishes a robust semantic foundation. - Model Size Dependence: The marginal gains from Stage 1 scale inversely with model size, meaning it's particularly necessary for smaller models (0.6B, 1.7B). Larger backbones possess greater inherent generalization capabilities, reducing the relative impact of this initial alignment stage.
- Domain-Specific Precision: This stage remains essential for domain-specific precision, especially in
Label Prediction(significantAUCdrop without alignment) andInteractive Recommendation. This underscores that explicit alignment is a prerequisite for optimizing recommendation performance across all model scales.
6.2.2. Evolution of Model Capabilities Across Post-training Stages
The performance of the model is analyzed after each key post-training phase (Multi-task SFT, On-policy Distillation, Reinforcement Learning).
The following are the Performance comparison on general capability across post-training stages (Thinking) from Table 10 of the original paper:
| math_500 | gsm8k | AIME'24 | mmlu_pro | gpqa_diamond | IFEVAL | LiveCodeBench | |
| Qwen3-8B (Base) | 0.952 | 0.9568 | 0.7917 | 0.7235 | 0.5606 | 0.8577 | 0.5484 |
| Stage 1: Multi-task SFT | 0.936 | 0.9083 | 0.5104 | 0.5307 | 0.4949 | 0.6174 | 0.4516 |
| Stage 2: On-Policy Distillation | 0.948 | 0.9538 | 0.7125 | 0.5454 | 0.5 | 0.7653 | 0.4659 |
| Stage 3: Reinforcement Learning | 0.938 | 0.9575 | 0.7188 | 0.5204 | 0.5051 | 0.7634 | 0.4667 |
The following are the Performance comparison on general capability across post-training stages (Non-Thinking) from Table 11 of the original paper:
| math_500 | gsm8k | AIME'24 | mmlu_pro | gpqa_diamond | IFEVAL | LiveCodeBench | |
| Qwen3-8B (Base) | 0.838 | 0.9303 | 0.2729 | 0.6632 | 0.399 | 0.8392 | 0.276 |
| Stage 1: Multi-task SFT | 0.876 | 0.906 | 0.0688 | 0.4909 | 0.3384 | 0.5638 | 0.1756 |
| Stage 2: On-Policy Distillation | 0.848 | 0.9234 | 0.2521 | 0.583 | 0.4091 | 0.7689 | 0.2545 |
| Stage 3: Reinforcement Learning | 0.798 | 0.9196 | 0.2271 | 0.4521 | 0.3939 | 0.7098 | 0.2401 |
The following are the Recommendation benchmark performance across post-training stages from Table 12 of the original paper:
| Model | Video Rec | Ad Rec | Product Rec | Label Cond. | Interactive | Label Pred. | Item Understanding | Reco Reason |
| Stage 1: Multi-task SFT | 0.0324 | 0.0925 | 0.0532 | 0.0229 | 0.3461 | 0.6979 | 0.3274 | 3.8795 |
| Stage 2: On-Policy Distillation | 0.0304 | 0.0596 | 0.0330 | 0.0200 | 0.2419 | 0.6944 | 0.3319 | 3.9479 |
| Stage 3: Reinforcement Learning | 0.0370 | 0.0967 | 0.0536 | 0.0236 | 0.3458 | 0.6908 | 0.3209 | 4.0381 |
-
Impact of On-policy Distillation on General Capabilities:
- Restoration of General Capabilities: Comparing
Stage 1 (Multi-task SFT)withStage 2 (On-policy Distillation)(Tables 10 and 11) shows thaton-policy distillationsignificantly restoresgeneral capabilities, realigning the model with theQwen3baseline on most general benchmarks, particularly inMath & Text ReasoningandAlignment Tasks. For example,AIME'24(Thinking) jumps from 0.5104 (SFT) to 0.7125 (Distillation) close toQwen3-8B's 0.7917. - Persistent Gap: Despite restoration, a performance gap relative to the original
Qwen3base model persists across several metrics (MMLU-Pro,GPQA-Diamond,LiveCodeBench). This is attributed to current data composition and quality during thedistillation phase, suggesting potential for further improvement with refined data strategies. - Mitigation of Instruction Drift:
Multi-task SFToccasionally led toinstruction driftwhere the model disregarded/no_thinktags and generatedChain-of-Thought (CoT)reasoning in "Non-Thinking" mode, leading to inflated scores.On-policy Distillationeffectively mitigates this, restoring the model's ability to faithfully switch between distinctreasoning modes.
- Restoration of General Capabilities: Comparing
-
Advancements through RL for Recommendation:
- Targeted Improvements:
Stage 3 (Reinforcement Learning)demonstrates targeted improvements on core recommendation tasks (Table 12). TheRL-trained modelachieves consistent gains onVideo Rec,Ad Rec,Product Rec,Label Cond., andInteractive Reccompared toStage 2. - Optimization for Ranking Accuracy: These improvements stem from the
rule-based Hit rewardinRec-RLwhich directly optimizes forranking accuracy, encouraging the model to assign higherprobability massto targetitemic tokens. - Transfer to Explanation Generation: Interestingly, the
Reco Reasontask (Layer 3) also benefits fromRL training, showing an increase inLLM-Judge Score(from 3.9479 to 4.0381). This suggests that the refined "recommendation intuition" acquired throughRLtransfers toexplanation generation, producing more coherent and relevant reasoning.
- Targeted Improvements:
6.3. Data Presentation (Figure 1 from original paper)
The following are the holistic performance overview of OpenOneRec-Foundation from Figure 1 of the original paper:
该图像是图表,展示了在 RecIF-Bench 的多项任务中的能力评估(左侧)及在亚马逊基准上的召回率结果(右侧)。左侧为不同任务的表现,红色和黄色线条分别代表 OneRec-Pro-8B 和最佳基线,右侧柱状图比较了各模型在不同类别中的 Recall@10 结果。
Figure 1 | Holistic Performance Overview. Left: Evaluation on RecIF-Bench and general LLM bencharks. ue ive penedat sk hiey general knowledge. "Best Baseline" denotes the highest performance achieved by existing methods for each specific task. Right: Amazon Benchmark results. Our model demonstrates exceptional cross-domain transferability, consistently surpassing leading baselines across 10 diverse datasets.
The left chart of Figure 1 visually confirms the SOTA performance of OneRec-Pro-8B across RecIF-Bench tasks, often significantly outperforming the "Best Baseline." The general LLM benchmarks section on the left also illustrates the trade-off, where OneRec-Pro-8B retains strong performance in areas like MATH-500 but shows some decrease compared to the base Qwen3-8B in broader General Tasks. The right chart (Amazon Benchmark) clearly shows OneRec-Pro-8B consistently outperforming "Best Baseline" in Recall@10 across all 10 Amazon datasets, visually reinforcing its cross-domain transferability.
7. Conclusion & Reflections
7.1. Conclusion Summary
In this work, the authors successfully presented OpenOneRec, a comprehensive framework aimed at bridging the critical gap between traditional recommendation systems and Large Language Models (LLMs). Their contributions are multi-faceted:
-
RecIF-Bench: They introduced
RecIF-Bench, the first holistic benchmark specifically designed to evaluaterecommendation instruction-following capabilities. This benchmark covers 8 diverse tasks, ranging from fundamental prediction to complex reasoning, and incorporates crucial dimensions likemulti-modal,cross-domain,multi-behavioralinteractions,interleaved data, andrecommendation explanation. -
Open-Sourced Framework & Scaling Laws: To ensure reproducibility and foster scalable research, they open-sourced a full-stack training pipeline, encompassing
data processing,co-pretrainingstrategies (includingItemic-Text Alignment), andhybrid post-trainingprotocols (Multi-task SFT,On-policy Distillation, andRec-RL). Through empirical analysis, they validatedscaling lawsin the recommendation domain, demonstrating adata-intensive scaling regimewhere increasing training data is more critical than merely scaling model parameters. -
OneRec-Foundation Models: Leveraging their framework, they released the
OneRec-Foundationmodel family (1.7B and 8B parameters). Extensive experiments demonstrated that these models achieve newstate-of-the-art (SOTA)performance across allRecIF-Benchtasks. Furthermore, they exhibited exceptionalcross-domain transferabilityto theAmazon benchmark, surpassing strong baselines with an average 26.8% improvement inRecall@10.This work signifies a crucial step towards building truly intelligent recommender systems that can understand, reason, and adapt to diverse user instructions, much like
general LLMs.
7.2. Limitations & Future Work
The authors acknowledge several limitations and outline important future research directions:
- Tokenizer Transferability: While
OpenOneRecenhances downstream performance, the magnitude of gains is currently limited bytokenizer transferability. Thepre-trained tokenizer, optimized on broad open-domain data, may not optimally distinguish items in a specific vertical, leading tocollision rates.- Future Direction: Maximizing the reuse of
foundation model priorswhile ensuring high-quality item indexing (code quality) for downstream tasks is a promising avenue. This could involve more adaptive or domain-specificitemic tokenizationstrategies during transfer.
- Future Direction: Maximizing the reuse of
- Balancing General Intelligence and Domain Precision: Maintaining the model's
general intelligenceandreasoning capabilitiesnecessitates mixing vast amounts ofgeneral-domain textduring training. This process currently presents a trade-off.- Future Direction: Investigating optimal data mixing ratios and improving data utilization efficiency are urgent challenges to better balance domain-specific precision with
general capabilities, possibly through more advancedcurriculum learningordata weightingtechniques.
- Future Direction: Investigating optimal data mixing ratios and improving data utilization efficiency are urgent challenges to better balance domain-specific precision with
- Chain-of-Thought Reasoning Consistency:
Chain-of-Thought (CoT)reasoning currently yields improvements only in limited settings.- Future Direction: A more rigorous exploration of
test-time scaling strategiesis needed to unlock consistentreasoning gainsacross diverse recommendation scenarios. This could involve exploring novel prompting techniques, self-consistency methods, or integrating external knowledge graphs more effectively.
- Future Direction: A more rigorous exploration of
7.3. Personal Insights & Critique
The OpenOneRec paper makes a highly significant contribution by systematically addressing the integration of LLMs into recommendation systems. The authors' commitment to open-sourcing their benchmark and training pipeline is commendable and truly accelerates research in this emerging field.
Personal Insights:
- Paradigm Shift Validation: This paper strongly validates the paradigm shift towards
generative recommendationand the potential offoundation modelsto bringgeneral intelligenceto this domain. The improvements onRecIF-Benchand theAmazon benchmarkare substantial, particularly thefew-shot learningresults, which highlight the power oftransferable representationsunder data scarcity. This suggests a future whererecommender systemsare not just prediction engines but intelligent conversational agents. - Bridging Modalities: The
itemic tokenizationstrategy is a crucial innovation. Effectively bridging thediscrete item spacewith theLLM'slinguistic token spaceis fundamental to this integration. The detailed exploration of adaptive strategies fortokenizer transferability(e.g.,Text-Augmented Itemic Tokens) is particularly insightful, providing concrete guidance for future work. - Balanced Training Regimen: The meticulously designed
hybrid post-trainingpipeline (SFT,On-policy Distillation,Rec-RL) is a key strength. It demonstrates a sophisticated understanding of how to mitigatecatastrophic forgettingand simultaneously optimize for diverse objectives (general intelligence vs. domain-specific metrics). This multi-stage approach is likely to become a standard for adaptingLLMsto specialized domains. - Scaling Laws in a New Domain: The empirical validation of
scaling lawsspecifically for recommendation data is a valuable theoretical contribution. The finding of adata-intensive scaling regime(where data scaling is more impactful than parameter scaling) provides critical guidance for resource allocation in developing futuregenerative recommenders. This challenges the direct applicability of generalLLM scaling lawsand underscores the unique characteristics of recommendation data.
Critique & Areas for Improvement:
- General Intelligence Trade-off: While the paper successfully mitigates
catastrophic forgetting, the persistent performance gap on somegeneral intelligence benchmarks(e.g.,MMLU-Pro,LiveCodeBench) compared to the baseQwen3suggests that the currentdata mixinganddistillation strategiescould be further refined. Future work could explore more dynamic or adaptivedata samplingtechniques duringpre-trainingandpost-trainingto better balance the two types of knowledge. - Complexity of
LLM-as-Judge: WhileLLM-as-Judgeis a good approach fortext generation, its multi-step process (WIP extraction, semantic matching, weighted F1) is complex and potentially sensitive to the choice of the judgeLLMitself. More transparency on its robustness and potential biases would be beneficial. - Computational Cost: The scale of training (
hundred-billion-token corpus, 8B parameter models, multi-stage training) is substantial. Whilescaling lawsare explored, practical deployment considerations (inference cost, latency) for such largegenerative recommendersin real-world industrial settings remain a significant challenge. The paper could delve more into the efficiency of these models at inference time. - Interpretability of
Itemic Tokens: Whileitemic tokensare explained as maintaininghierarchical semantics, their direct interpretability for humans or for debugging purposes might be limited compared to natural language descriptions. Exploring ways to makeitemic tokensmore human-understandable could enhance trust and utility.
Transferability to Other Domains:
The methodology presented in OpenOneRec has high transferability to other specialized domains where LLMs are being adapted. The core principles—bridging modality gaps (e.g., sensor data, medical images, financial time series to LLM-understandable tokens), scalable co-pretraining with general-domain knowledge, and hybrid post-training with on-policy distillation for general skills and RL for domain-specific metrics—could be applied to:
-
Medical AI: Integrating
LLMswith structured medical data (e.g., patient records, lab results, imaging reports) for diagnosis, treatment planning, or drug discovery. -
Financial Services: Combining
LLMswith market data, financial reports, and expert knowledge for quantitative trading, risk management, or personalized financial advice. -
Robotics/Control: Using
LLMsfor high-level planning and instruction in robotic systems, where sensor data and control signals need to be aligned with natural language commands.By open-sourcing their framework,
OpenOneRecprovides a solid foundation, inviting the broader research community to tackle these remaining challenges and further accelerate the evolution towards truly intelligent AI systems across various applications.
Similar papers
Recommended via semantic vector search.