Paper status: completed

OpenOneRec Technical Report

Published:12/31/2025

RecIF-Bench for Recommendation Systems (1)Large-Scale User Interaction Dataset (1)OneRec Foundation Models (1)Scalability of Recommendation Systems (1)Holistic Evaluation of Recommendation Capabilities (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces RecIF-Bench, a holistic benchmark for generative recommendation and a large dataset of 96 million interactions. It open-sources a comprehensive training pipeline, demonstrating scalable capabilities while reducing catastrophic forgetting, with the OneRec foun

Abstract

While the OneRec series has successfully unified the fragmented recommendation pipeline into an end-to-end generative framework, a significant gap remains between recommendation systems and general intelligence. Constrained by isolated data, they operate as domain specialists-proficient in pattern matching but lacking world knowledge, reasoning capabilities, and instruction following. This limitation is further compounded by the lack of a holistic benchmark to evaluate such integrated capabilities. To address this, our contributions are: 1) RecIF Bench & Open Data: We propose RecIF-Bench, a holistic benchmark covering 8 diverse tasks that thoroughly evaluate capabilities from fundamental prediction to complex reasoning. Concurrently, we release a massive training dataset comprising 96 million interactions from 160,000 users to facilitate reproducible research. 2) Framework & Scaling: To ensure full reproducibility, we open-source our comprehensive training pipeline, encompassing data processing, co-pretraining, and post-training. Leveraging this framework, we demonstrate that recommendation capabilities can scale predictably while mitigating catastrophic forgetting of general knowledge. 3) OneRec-Foundation: We release OneRec Foundation (1.7B and 8B), a family of models establishing new state-of-the-art (SOTA) results across all tasks in RecIF-Bench. Furthermore, when transferred to the Amazon benchmark, our models surpass the strongest baselines with an average 26.8% improvement in Recall@10 across 10 diverse datasets (Figure 1). This work marks a step towards building truly intelligent recommender systems. Nonetheless, realizing this vision presents significant technical and theoretical challenges, highlighting the need for broader research engagement in this promising direction.

Mind Map

In-depth Reading

English Analysis~33 min read · 51,118 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "OpenOneRec Technical Report: An Open Foundation Model and Benchmark to Accelerate Generative Recommendation."

1.2. Authors

The paper is authored by the OneRec Team, with a detailed list of contributors provided in Appendix A. The Core Contributors include Guorui Zhou, Honghui Bao, Jiaming Huang, Jiaxin Deng, Jinghao Zhang, Junda She, Kuo Cai, Lejian Ren, Lu Ren, Qiang Luo, Qianqian Wang, Qigen Hu, Rongzhou Zhang, Ruiming Tang, Shiyao Wang, Wuchao Li, Xiangyu Wu, Xinchen Luo, Xingmei Wang, Yifei Hu, Yunfan Wu, Zhanyu Liu, Zhiyang Zhang, and Zixing Zhang. Additional Contributors are Bo Chen, Bin Wen, Chaoyi Ma, Chengru Song, Chenglong Chu, Defu Lian, Fan Yang, Feng Jiang, Hongtao Cheng, Huanjie Wang, Kun Gai, Pengfei Zheng, Qiang Wang, Rui Huang, Siyang Mao, Tingting Gao, Wei Yuan, Yan Wang, Yang Zhou, Yi Su, Zexuan Cheng, Zhixin Ling, and Ziming Li. Their affiliations are not explicitly stated in the provided text, but the GitHub and Hugging Face links suggest they are associated with Kuaishou.

1.3. Journal/Conference

The paper is published as a preprint on arXiv (https://arxiv.org/abs/2512.24762). arXiv is a widely recognized open-access repository for preprints of scientific papers in fields like mathematics, physics, computer science, and quantitative biology. While it is not a peer-reviewed journal or conference in itself, it serves as a crucial platform for rapid dissemination of research findings and allows for community feedback before formal publication. Its influence is significant in facilitating timely academic discourse.

1.4. Publication Year

The paper was published at (UTC): 2025-12-31T10:15:53.000Z.

1.5. Abstract

The abstract states that the OneRec series has successfully integrated the fragmented recommendation pipeline into an end-to-end generative framework. However, a significant gap persists between recommendation systems and general intelligence. Current recommendation systems act as domain specialists, adept at pattern matching but lacking world knowledge, reasoning capabilities, and instruction following due to isolated data. This limitation is exacerbated by the absence of a holistic benchmark. To address this, the authors make three contributions:

RecIF-Bench & Open Data: They propose RecIF-Bench, a holistic benchmark with 8 diverse tasks, evaluating capabilities from fundamental prediction to complex reasoning. They also release a training dataset of 96 million interactions from 160,000 users.
Framework & Scaling: They open-source a comprehensive training pipeline (data processing, co-pretraining, post-training) to ensure reproducibility. They demonstrate predictable scaling of recommendation capabilities while mitigating catastrophic forgetting of general knowledge.
OneRec-Foundation: They release OneRec-Foundation models (1.7B and 8B parameters) that achieve new state-of-the-art (SOTA) results across all RecIF-Bench tasks. When transferred to the Amazon benchmark, these models surpass baselines with an average 26.8% improvement in Recall@10 across 10 datasets.

The work is presented as a step towards intelligent recommender systems, acknowledging significant technical and theoretical challenges and calling for broader research engagement.

1.6. Original Source Link

The official source link is https://arxiv.org/abs/2512.24762 and the PDF link is https://arxiv.org/pdf/2512.24762v1.pdf. The paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The paper addresses a critical challenge in the field of recommender systems: the significant gap between current generative recommender systems and general intelligence. While recent advancements, particularly the OneRec series, have successfully unified the traditional multi-stage recommendation pipeline into an end-to-end generative framework, these systems remain constrained by several limitations:

Isolated Data Silos: Existing models are typically trained on isolated, domain-specific data, preventing them from leveraging the massive data scaling that drives the emergent capabilities of Large Language Models (LLMs).
Lack of General Intelligence: Consequently, these models operate as domain specialists, excelling at collaborative pattern matching but lacking crucial world knowledge, reasoning capabilities, and instruction following abilities that are hallmarks of general intelligence. This limits their adaptability and broader utility.
Catastrophic Forgetting: Attempts to integrate LLM capabilities, such as aligning discrete recommendation identifiers with LLM linguistic space, often lead to catastrophic forgetting of the LLM backbone's inherent generalization capabilities, especially when fine-tuned on limited, task-homogeneous data.
Inadequate Benchmarking: A major impediment is the absence of a holistic benchmark capable of evaluating the integrated capabilities (prediction, reasoning, instruction-following) essential for these next-generation recommendation foundation models. Traditional benchmarks are confined to narrow, specialized tasks, usually focusing on closed-set ranking accuracy within single domains.

The core problem is to bridge this semantic and functional gap to build truly intelligent recommender systems that can understand complex user intent, reason about recommendations, and follow diverse instructions, much like LLMs do in general domains.

2.2. Main Contributions / Findings

To address the identified challenges, the paper presents several key contributions:

RecIF-Bench: A Holistic Recommendation Instruction-Following Benchmark & Open Data:
- Contribution: Introduction of RecIF-Bench, a multi-dimensional benchmark featuring 8 diverse tasks spanning 4 capability layers (from semantic alignment to complex reasoning). It covers short-video, e-commerce, and online advertising domains, designed to rigorously assess the multifaceted capabilities of recommender foundation models via instruction following. It is the first benchmark to cover long-sequence interactions, multi-task, multi-modal, cross-domain, multi-behavioral interactions, interleaved data (text and itemic tokens), and recommendation explanation.
- Finding: It provides a robust testbed for quantifying the synergy between instruction following and recommendation. Additionally, the OneRec-Foundation models are evaluated on 7 widely-recognized general benchmarks to verify retention of broad reasoning and coding skills. A comprehensive training dataset (96 million interactions from 160,000 users) is released to facilitate reproducible research.
Open-Source Framework & Validated Scaling Laws:
- Contribution: Open-sourcing of a complete training pipeline (built on PyTorch and VeRL) encompassing data processing, co-pretraining, and post-training protocols. A novel two-stage alignment strategy is introduced, combining on-policy distillation and recommendation-oriented Reinforcement Learning (Rec-RL).
- Finding: Empirical validation of scaling laws in the recommendation domain. The study demonstrates predictable capability scaling, with optimal compute allocation requiring data volume to scale more aggressively than model parameters (a data-intensive regime where $N _ { \mathrm { o p t } } \propto C ^ { 0 . 4 4 }$ and $D _ { \mathrm { o p t } } \propto C ^ { 0 . 5 6 }$ ). This framework effectively mitigates catastrophic forgetting of general knowledge.
OneRec-Foundation Model Family:
- Contribution: Release of the OneRec-Foundation series, including 1.7B and 8B parameter models, built upon Qwen. This series moves beyond simple fine-tuning, endowing the LLM backbone with intrinsic recommendation capabilities. Standard versions are trained on the open-source dataset, and Pro versions are enhanced with a hundred-billion-token industrial corpus.
- Finding: The models achieve new state-of-the-art (SOTA) performance across all RecIF-Bench tasks. Furthermore, they demonstrate exceptional cross-domain transferability, surpassing strong baselines on 10 Amazon datasets with an average 26.8% improvement in Recall@10, underscoring their robustness as foundation models.

3.1. Foundational Concepts

To understand the OpenOneRec paper, a beginner should be familiar with several fundamental concepts in machine learning, particularly in the domains of natural language processing and recommender systems.

Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They are typically based on the Transformer architecture and are known for their ability to learn complex patterns and perform a wide range of tasks, from translation to creative writing, through next-token prediction. Their "emergent capabilities" refer to advanced behaviors (like complex reasoning) that appear only when scaled to very large sizes and data volumes.
Generative Models: In machine learning, a generative model learns the distribution of data points in a dataset and can then generate new data samples that are similar to the training data. For example, a generative LLM can generate novel sentences or paragraphs. In recommendation, generative recommenders predict the next item by generating its identifier or description, rather than just ranking existing items.
Recommender Systems: These systems aim to predict users' preferences for items (e.g., movies, products, videos) and suggest items they are likely to enjoy.
- Collaborative Filtering: A common technique in recommender systems that predicts user preferences based on the preferences of similar users or the characteristics of similar items. It essentially finds patterns in user-item interaction data.
- Sequential Recommendation: A sub-field of recommendation focusing on modeling the temporal dynamics of user behavior. It predicts the next item a user will interact with based on their sequence of past interactions.
Transformer Architecture: The dominant neural network architecture for LLMs. It relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence when processing each element. This allows it to capture long-range dependencies in data, unlike recurrent neural networks.
Autoregressive Modeling: A type of statistical model where the output at a given time step is dependent on previous outputs. In LLMs, this means generating text one token at a time, with each new token being predicted based on all the preceding tokens in the sequence. The paper formulates recommendation as a Next-Token Prediction problem within this paradigm.
Instruction Following: The ability of an LLM to understand and execute commands or requests given in natural language. Instruction tuning is a training method where models are fine-tuned on a diverse set of tasks presented as natural language instructions, enabling them to generalize to new, unseen instructions.
Catastrophic Forgetting: A phenomenon in neural networks where training a model on a new task causes it to forget previously learned information or skills from older tasks. This is a significant challenge when adapting general LLMs to specific domains like recommendation, as fine-tuning might erase their world knowledge.
Reinforcement Learning (RL): A paradigm of machine learning where an "agent" learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. It's often used to fine-tune LLMs for better alignment with human preferences or specific task objectives.
Itemic Tokens: A novel representation for discrete items (like videos or products) that bridges the gap between item IDs and natural language. Instead of arbitrary IDs, items are represented by sequences of discrete tokens (codes) derived from their semantic embeddings, allowing LLMs to process them as if they were words.

3.2. Previous Works

The paper builds upon and differentiates itself from several lines of prior research:

OneRec Series (Deng et al., 2025; Zhou et al., 2025a,b): The OpenOneRec paper is presented as part of the OneRec series, which is noted for successfully unifying the multi-stage recommendation pipeline into an end-to-end generative framework. This means moving from separate stages like retrieval, ranking, and explanation to a single model that can generate recommendations directly.
Generative Recommenders (Geng et al., 2022; Wu et al., 2023): This trend in recommendation treats the task as a language processing problem. The paper highlights this shift, noting that LLMs offer a new paradigm.
LLM-Recommendation Alignment (Zheng et al., 2024; Liu et al., 2025):
- LC-Rec (Zheng et al., 2024): A study that aligns discrete recommendation identifiers with the linguistic space of LLMs. The paper cites LC-Rec as a baseline and notes its limitation to a limited set of downstream tasks, potentially causing catastrophic forgetting.
- OneRec-Think (Liu et al., 2025): Another related work aiming to bridge the semantic gap.
Traditional Discriminative Recommender Models: The paper compares OpenOneRec against several established sequential recommendation models:
- BERT4Rec (Sun et al., 2019): Uses a Transformer-based architecture with bidirectional encoder representations to predict masked items in a user's sequence, capturing rich contextual information.
- GRU4Rec (Hidasi et al., 2016): Employs Gated Recurrent Units (GRUs) to model sequential user behavior, particularly effective for session-based recommendations.
- SASRec (Kang and McAuley, 2018): Self-Attentive Sequential Recommendation model that uses the self-attention mechanism to capture long-range dependencies and item transitions in user behavior sequences.
- HSTU (Zhai et al., 2024): Trillion-parameter sequential transducers for generative recommendations, indicating very large-scale models.
- ReaRec (Tang et al., 2025): Focuses on unleashing latent reasoning power for sequential recommendation, suggesting a move towards more intelligent recommendation.
Generative Recommender Baselines:
- TIGER (Rajput et al., 2023): A generative retrieval model for recommender systems, serving as a strong baseline for generative recommendation.
Item Tokenization Techniques (Luo et al., 2025; Zhou et al., 2025a): The paper adopts Itemic Tokens and mentions RQ-Kmeans (Luo et al., 2025) for discretizing item semantic embeddings into hierarchical discrete codes. This is a crucial step for integrating items into an LLM's token space. Finite Scalar Quantization (FSQ) (Mentzer et al., 2023) is also used for extending token depth in transfer learning.
Base LLM Backbone:
- Qwen (Yang et al., 2025): The OneRec-Foundation models are built on Qwen, a family of open-source LLMs. This provides the foundational linguistic and reasoning capabilities.
LLM Scaling Laws (Kaplan et al., 2020; Hoffmann et al., 2022): The paper references these works for its analysis of how model performance scales with model parameters ( $N$ ) and training tokens ( $D$ ). Chinchilla scaling laws (Hoffmann et al., 2022) are specifically mentioned for comparison.
Reinforcement Learning Frameworks:
- Group Relative Policy Optimization (GRPO) (Shao et al., 2024): Used in the Rec-RL stage for fine-tuning. It's noted for computing advantage relative to a group of sampled trajectories, reducing computational overhead compared to Actor-Critic algorithms like PPO.
Data Deduplication:
- MinHash Algorithm (Broder, 1997): Used for efficient fuzzy deduplication to ensure evaluation benchmarks are not leaked into the training data.

3.3. Technological Evolution

The field of recommender systems has seen a significant evolution:

Early Systems (e.g., Matrix Factorization): Focused on discovering latent factors from user-item interaction matrices. These were often discriminative models, predicting ratings or clicks for existing items.
Deep Learning for Sequential Recommendation: With the rise of deep learning, models like GRU4Rec and BERT4Rec started to capture temporal dependencies in user behavior, treating interactions as sequences. This improved the ability to predict the next item.
Multi-modal and Cross-domain Approaches: Researchers began incorporating rich metadata (text, images) and leveraging data from multiple domains to enrich user and item representations.
Generative Recommender Systems: The recent paradigm shift, inspired by LLMs, views recommendation as a generative task. Instead of just predicting scores or ranking, these systems generate item IDs, descriptions, or explanations. The OneRec series represents a significant step in this direction, unifying previously fragmented pipelines.
LLM-based Foundation Models for Recommendation: This paper marks the latest stage, aiming to integrate the general intelligence, reasoning, and instruction-following capabilities of LLMs into recommendation. The goal is to move beyond domain-specific expertise to truly intelligent agents that can engage in complex dialogues about recommendations.

OpenOneRec fits into this timeline at the cutting edge, striving to establish foundation models for recommendation that can leverage LLM scalability and generalizability, while also providing a holistic benchmark to evaluate these advanced capabilities.

3.4. Differentiation Analysis

Compared to existing methods, OpenOneRec introduces several core innovations and differentiations:

Holistic Benchmarking with RecIF-Bench:
- Differentiation: Unlike existing benchmarks (e.g., PixelRec, KuaiSAR, NineRec, Yelp, Amazon) that partially address multi-modal, multi-task, or cross-domain aspects, RecIF-Bench is the first to simultaneously cover all seven critical dimensions: long-sequence interactions, multi-task, multi-modal, cross-domain, multi-behavioral interactions, interleaved data (text and itemic tokens), and recommendation explanation. This provides a truly comprehensive testbed for foundation models.
- Innovation: The inclusion of Interleaved Data and Recommendation Explanation is particularly novel, directly assessing the instruction-following and reasoning capabilities central to LLM-based recommenders.
Unified Generative Framework with Scalable Pre-training:
- Differentiation: While OneRec series and TIGER also adopt generative frameworks, OpenOneRec specifically integrates collaborative signals with general semantics through Itemic-Text Alignment and mixed-domain Co-Pretraining. This approach is designed to overcome the data silos and catastrophic forgetting issues prevalent in prior attempts to adapt LLMs for recommendation.
- Innovation: The two-stage pre-training (Itemic-Text Alignment then Full-Parameter Co-Pretraining) systematically bridges the modality gap and injects recommendation knowledge while preserving general world knowledge from the Qwen3 backbone.
Hybrid Post-Training for Capability Balance:
- Differentiation: Many LLM-based recommenders struggle to balance domain-specific precision with general reasoning abilities. OpenOneRec introduces a novel hybrid post-training pipeline: Multi-task Supervised Fine-tuning (SFT) is followed by alternating On-policy Distillation for General Capability and Reinforcement Learning for Recommendation (Rec-RL).
- Innovation: On-policy distillation actively recovers general reasoning abilities by using the original LLM as a teacher, while Rec-RL (specifically Group Relative Policy Optimization with a rule-based Hit reward) directly optimizes for discrete ranking metrics in recommendation, effectively balancing both objectives.
Validated Scaling Laws in Recommendation:
- Differentiation: The paper empirically validates scaling laws specifically in the recommendation domain. This extends the understanding from general LLMs to a specialized application.
- Innovation: The finding of a data-intensive scaling regime ( $D_{\mathrm{opt}} \propto C^{0.56}$ vs. $N_{\mathrm{opt}} \propto C^{0.44}$ ) suggests that, unlike Chinchilla's near-equiproportional split for general text, optimal compute allocation in recommendation requires scaling data volume more aggressively than model parameters.
Exceptional Transferability and Few-Shot Learning Performance:
- Differentiation: OpenOneRec-Foundation models demonstrate superior cross-domain transferability compared to baselines like TIGER, especially in few-shot learning scenarios.
- Innovation: The Text-Augmented Itemic Tokens strategy, which concatenates original itemic tokens with keyword representations, effectively leverages the foundation model's diverse capabilities (collaborative filtering, knowledge, semantic understanding) while maintaining pre-trained structural integrity. This adaptability is crucial for robust performance in new domains with limited data.

4. Methodology

4.1. Principles

The core idea behind the OpenOneRec framework is to bridge the semantic and functional gap between traditional recommendation systems and general intelligence by treating recommendation as an autoregressive generation problem within a Large Language Model (LLM) framework. The theoretical basis is that LLMs, with their vast world knowledge, reasoning capabilities, and instruction-following prowess, can transcend the limitations of domain-specific recommenders that are constrained by isolated data and lack broader intelligence.

The intuition is that if LLMs can process and generate natural language, they should also be able to process and generate item identifiers (or representations) if these items are properly aligned with the LLM's token space. This involves:

Bridging the Modality Gap: Representing discrete items as itemic tokens that the LLM can understand and generate, akin to natural language tokens.
Unified Generative Formulation: Framing all recommendation tasks, from basic prediction to complex reasoning, as a next-token prediction problem, allowing a single LLM architecture to handle diverse scenarios.
Scalable Pre-training: Integrating collaborative signals and general semantics during pre-training to endow the LLM backbone with intrinsic recommendation capabilities while preserving its general world knowledge.
Hybrid Post-training: Employing a multi-stage post-training process that includes supervised fine-tuning, on-policy distillation, and reinforcement learning to simultaneously enhance instruction-following for recommendation tasks and restore general intelligence capabilities, effectively balancing domain-specific precision with broader reasoning.
Holistic Evaluation: Developing a comprehensive benchmark (RecIF-Bench) to thoroughly assess these integrated capabilities, pushing beyond traditional ranking metrics to include instruction following and explanation generation.

4.2. Core Methodology In-depth (Layer by Layer)

The OpenOneRec framework is structured into three main phases: Pre-Training, Post-Training, and Evaluation, as depicted in Figure 2.

4.2.1. Items as Tokens: Bridging the Modality Gap

A foundational challenge in applying LLMs to recommendation is the inherent mismatch between the continuous semantic space of language and the discrete identifier space of items. Representing items with extensive textual descriptions is inefficient due to long context lengths for user histories and doesn't guarantee the generation of in-corpus items.

To address this, the paper adopts Itemic Tokens (Luo et al., 2025; Zhou et al., 2025a), treating items as a distinct modality. This involves:

Hierarchical Quantization: The RQ-Kmeans (Luo et al., 2025) algorithm is used to discretize pre-trained semantic embeddings of item metadata (e.g., text, visual features) into hierarchical discrete codes. This process compresses item semantics into short, fixed-length sequences of tokens.
Vocabulary Extension: These itemic tokens are appended to the LLM's original vocabulary, creating a unified vocabulary $\mathcal { V } = \mathcal { V } _ { t e x t } \cup \mathcal { V } _ { i t e m }$ .
Benefits:
- Efficient Long-Context Modeling: Short itemic token sequences allow for processing longer user histories.
- Preservation of Collaborative Structure: The hierarchical nature ensures items with similar semantics share common prefixes, enabling the model to transfer knowledge based on token proximity, analogous to natural language.
- In-corpus Generation: Facilitates the generation of actual, known items.

4.2.2. Recommendation as Autoregressive Modeling

With items represented as tokens, the recommendation task is unified into a standard autoregressive generation problem. A user's interaction history is treated as a long context sequence comprising item tokens, optionally interleaved with text.

Tasks, from prediction (e.g., retrieval) to reasoning (e.g., explanation), are formulated as a Next-Token Prediction problem. Given an instruction $\boldsymbol { \mathcal { I } }$ and a user context $C$ (which includes interaction history and optional queries), the model maximizes the likelihood of the target response $Y$ . The objective function is: $ \mathcal { L } ( \theta ) = - \sum _ { t = 1 } ^ { | Y | } \log P _ { \theta } ( y _ { t } | \mathcal { I } , C , y _ { < t } ) $ Where:

$\mathcal{L}(\theta)$ : The negative log-likelihood loss, which the model aims to minimize. $\theta$ represents the parameters of the model.
$|Y|$ : The length of the target response sequence.
$P_{\theta}(y_t | \mathcal{I}, C, y_{<t})$ : The probability of predicting the $t$ -th token $y_t$ in the target sequence, conditioned on the task instruction $\mathcal{I}$ , the user context $C$ , and all previously generated tokens $y_{<t}$ (i.e., tokens $y_1, \ldots, y_{t-1}$ ).
$y_t$ : The $t$ -th token in the target response sequence.
$\mathcal{I}$ : The task-specific instruction provided to the model.
$C$ : The personalized user context, which can be the user's interaction history ( $\mathcal{H}_u$ ) or their User Portrait ( $\mathcal{P}_u$ ).
$y_{<t}$ : The sequence of tokens generated before the current token $y_t$ .

This formulation allows leveraging standard Transformer architectures and LLM training infrastructure without task-specific architectural changes.

4.2.3. RecIF-Bench: Task Taxonomy

RecIF-Bench organizes 8 distinct tasks into a four-layer capability hierarchy to assess different levels of intelligence in recommendation foundation models. The tasks are formalized as sequence-to-sequence problems $Y = { \mathcal { F } } ( X )$ , where $X = [ \bar { \mathcal { I } } , C ]$ is the input (instruction and context) and $Y$ is the target (item ID, sequence of IDs, or natural language explanation).

The four layers are:

Layer 0: Semantic Alignment (Item Understanding)
- Input (X): Item $i$ .
- Target (Y): Item Description (textual metadata).
- Evaluation Focus: Verifies if the model can map itemic tokens to natural language, bridging the modality gap.
Layer 1: Fundamental Recommendation
- Short Video Recommendation
  - Input (X): History $\mathcal{H}_{video}$ (user's video viewing history).
  - Target (Y): Next item $i_{video}$ .
  - Evaluation Focus: Canonical next-item prediction within a single domain.
- Ad Recommendation
  - Input (X): History $\mathcal{H}_{video} + \mathcal{H}_{ad}$ (video viewing history plus ad click history).
  - Target (Y): Next item $i_{ad}$ .
  - Evaluation Focus: Cross-domain interest transfer and holistic user modeling.
- Product Recommendation
  - Input (X): History $\mathcal{H}_{video} + \mathcal{H}_{product}$ (video viewing history plus product click history).
  - Target (Y): Next item $i_{product}$ .
  - Evaluation Focus: Cross-domain interest transfer and holistic user modeling.
- Label Prediction
  - Input (X): History $\mathcal{H}_{video} + \text{Item } i_{video}$ .
  - Target (Y): Binary (Yes/No) engagement.
  - Evaluation Focus: Pointwise estimation complementing generative recommendation.
Layer 2: Instruction Following
- Interactive Recommendation
  - Input (X): Portrait $\mathcal{P} + \text{Query } q$ (user profile plus natural language intent).
  - Target (Y): Item $i$ user engages with.
  - Evaluation Focus: Model's ability to adapt predictions based on explicit natural language instructions.
- Label-Conditional Recommendation
  - Input (X): History $\mathcal{H}_{video} + \text{Action } a$ (video history plus target behavior label, e.g., "Like").
  - Target (Y): Item $i$ with action $a$ .
  - Evaluation Focus: Fine-grained behavior modeling based on specified actions.
Layer 3: Reasoning
- Recommendation Explanation
  - Input (X): Portrait $\mathcal{P} + \mathcal{H}_{video} + \text{Item } i$ .
  - Target (Y): Explanation (natural language justification).
  - Evaluation Focus: Model's ability to synthesize information and articulate user-item compatibility reasoning.
    
    Figure 7 from the original paper (Figure 4 in original) provides a visual summary of the Task Taxonomy of RecIF-Bench.
    
    该图像是示意图，展示了 RecIF-Bench 中 8 个任务的分类架构，涵盖了 4 个能力层级，并展示了任务之间的关系与结构。

Figure 7 | Task Taxonomy of RecIF-Bench. Weorganize 8 tasks across 4 capability layers, speing the instruction, context, and target.

4.2.4. Pre-Training

The pre-training phase is designed to inject recommendation knowledge and align itemic tokens while preserving general world knowledge. It uses a Qwen3 architecture as its backbone and is structured into two stages.

4.2.4.1. Pre-training Data

The pre-training corpus consists of Recommendation Corpora and General-Domain Corpora.

Recommendation Corpora: Derived from anonymized user logs from Kuaishou, covering user-side, item-side, and interaction-side metadata.
- Itemic Tokenization: Item multimodal embeddings are quantized into 3-layer hierarchical itemic tokens using RQ-Kmeans, with a codebook size of 8192 per layer. Each item $i$ is mapped to a tuple of codes $S _ { i } = \left( c _ { 1 } , c _ { 2 } , c _ { 3 } \right)$ , flattened into a token sequence wrapped by special tokens (e.g., <|item_begin|><item_a_C1><item_b_C2><item_c_C3><|item_end|>).
- Data Types:
  - Itemic Dense Caption Data: Trains the model to generate natural-language captions given itemic tokens, establishing a semantic bridge.
  - Sequential User Behavior Data: Captures chronological user-item interactions (views, likes, shares) for next-item prediction and learning collaborative filtering signals.
  - Interleaved User Persona Grounding Data: Constructs narrative-style User Portraits by interleaving discrete item representations with heterogeneous user metadata (demographics, search history, interaction sequences, summarized interests). This fosters deeper semantic grounding between user characteristics and behaviors.
- Scale: The primary training corpus for the standard OneRec variant involves ~160k users, 13 million item captions, and corresponding interactions. For the OneRec-Pro variant, this scales to ~20 million users and 98 million item captions, leveraging an in-house industrial corpus.
General-Domain Corpora: High-quality general-domain text corpora are mixed in during co-pretraining to mitigate catastrophic forgetting and maintain general intelligence.
- Diversity: Includes multiple languages (Chinese, English) and specialized domains (Mathematic, Medical).
- Reasoning-Intensive Data: Prioritizes data like mathematical derivations, logical puzzles, and code-centric corpora to enhance reasoning capabilities.
- Deduplication: The MinHash algorithm is used to filter out general-domain samples similar to evaluation benchmarks, ensuring reliable generalization.

4.2.4.2. Training Recipe

The pre-training uses the Qwen3 architecture. Two model variants are developed: OneRec (standard, 33B tokens, 41.3 million samples from public datasets) and OneRec-Pro (enhanced, 130B tokens, 179.1 million samples from in-house corpus).

The pre-training methodology involves two distinct stages:

Stage 1: Itemic-Text Alignment:
- Objective: Establish a preliminary alignment between itemic tokens and text tokens.
- Process: The vocabulary is expanded by appending itemic special tokens to the Qwen3 tokenizer. The embedding parameters for these itemic tokens are initialized from a multivariate normal distribution (mean and covariance of existing embeddings). Only these itemic token embedding parameters are trainable, while all other model parameters are frozen. For larger models (8B+), output projection parameters corresponding to itemic tokens are also trainable.
Stage 2: Full-Parameter Co-Pretraining:
- Objective: Inject recommendation knowledge while preserving general world knowledge.
- Process: All model parameters are unfrozen. Full-parameter pre-training is conducted on a mixed dataset of recommendation-domain samples and a considerable proportion of general-domain knowledge data to prevent catastrophic forgetting.
Training Recipe Details:
- Optimizer: AdamW with $\beta _ { 1 } = 0 . 9$ , $\beta _ { 2 } = 0 . 9 5$ , and weight decay of 0.1.
- Learning Rate Schedule: Cosine decay with a linear warmup phase.
  - Peak LR: $1 \times 1 0 ^ { - 3 }$ for Stage 1, $1 \times 1 0 ^ { - 4 }$ for Stage 2.
  - Minimum LR: $1 \times 1 0 ^ { - 4 }$ for Stage 1, $2 \times 1 0 ^ { - 5 }$ for Stage 2.
  - Warmup Duration: First 10% of training steps.
- Maximum Context Length: 32K tokens to accommodate long sequential user behavior data.

4.2.4.3. Scaling Laws in Recommendation

To optimize the allocation of compute budget $C$ between model parameters $N$ and training tokens $D$ , the paper follows Hoffmann et al. (2022). The Qwen3 architecture is evaluated across $N \in \{ 0 . 6 , 1 . 7 , 4 , 8 , 1 4 \} \times 1 0 ^ { 9 }$ parameters. The compute budget is approximated as $C \approx 6 N D$ .

The compute-optimal frontier is derived from the convex hull of the final training loss. Power-law scaling relations are fitted: $ N _ { \mathrm { o p t } } \propto C ^ { a } , \quad D _ { \mathrm { o p t } } \propto C ^ { b } $

Scaling Laws on Recommendation Data: Empirical fit yields exponents:
- $N _ { \mathrm { o p t } } \propto C ^ { 0 . 4 4 }$
- $D _ { \mathrm { o p t } } \propto C ^ { 0 . 5 6 }$ This indicates a data-intensive scaling regime ( $b > a$ ), suggesting that optimal compute in recommendation requires scaling data volume more aggressively than model parameters, a deviation from Chinchilla scaling laws (which imply $a \approx 0.5, b \approx 0.5$ ).
Parametric Fit and Interpretation: The final loss L(N, D) is modeled using a parametric function: $ L ( N , D ) = E + \frac { A } { N ^ { \alpha } } + \frac { B } { D ^ { \beta } } $ Where:
- $E$ : Irreducible loss floor (minimum possible loss).
- $\frac{A}{N^\alpha}$ : Component of loss due to finite model capacity, where $A$ is a coefficient and $\alpha$ is the model capacity exponent.
- $\frac{B}{D^\beta}$ : Component of loss due to finite data size, where $B$ is a coefficient and $\beta$ is the data exponent.
  
  Fitting this to experimental data yields: $ L ( N , D ) = 0 . 4 2 3 2 + \frac { 5 0 2 . 3 2 } { N ^ { 0 . 3 3 2 5 } } + \frac { 7 . 0 2 } { D ^ { 0 . 1 8 6 5 } } $ From these coefficients, three insights are derived:
- Data-Hungry Scaling ( $\alpha > \beta$ ): The model capacity exponent ( $\alpha \approx 0.33$ ) is consistent with LLM literature, but the data exponent ( $\beta \approx 0.19$ ) is lower than typical text-domain values ( $\beta_{text} \approx 0.28$ ). Since $\alpha > \beta$ , this necessitates $b > 0.5$ , confirming the need for more aggressive data scaling.
- Impact of Warm-Starting (High A, Low B): The imbalance between $A$ (502.32) and $B$ (7.02) is attributed to transfer learning from the Qwen3 backbone, which lowers initial data entropy (low $B$ ). The inflated $A$ captures performance gains from larger models being trained with more data.
- Low Entropy of Recommendation Tasks (Low E): The estimated irreducible loss floor $E = 0.42$ is substantially lower than for natural text ( $\approx 1.69$ ). This suggests recommendation tasks with structured features (like Itemic Dense Captions) have lower inherent entropy, allowing the model to approach a more deterministic state, emphasizing the need for diverse and high-quality recommendation corpora.

4.2.5. Post-Training

After pre-training, the model aligns itemic tokens and encodes collaborative filtering signals but may still lack advanced instruction-following and reasoning capabilities. The post-training phase (Figure 6) aims to enhance recommendation capabilities and restore general task performance using identical training data and strategies for both OneRec and OneRec-Pro.

Figure 6 | Post-training pipeline of the OneRec series models. 该图像是OneRec系列模型的后训练流程示意图，展示了模型从预训练到多任务微调、一般策略蒸馏，再到推荐任务的转变过程。每个阶段的处理流程清晰可见，强调了推荐系统的构建步骤。

Figure 6 | Post-training pipeline of the OneRec series models.

4.2.5.1. Multi-task Supervised Fine-tuning (SFT)

Objective: Restore and enhance foundational instruction-following and reasoning capabilities across both general and recommendation domains, establishing a robust base for subsequent stages.
Data: A specialized SFT corpus is curated by blending complex instruction-response pairs from RecIF-Bench-specific cleaned metadata (from 160K users) with high-quality open-source general-domain datasets focusing on instruction-following and complex reasoning. This corpus ensures no leakage from evaluation benchmarks.
Format: All instances are organized into a conversational format and serialized using the Qwen3 chat template.
Training: Fine-tuning on this unified dataset uses a recipe consistent with pre-training but with a reduced learning rate (from $2 \times 1 0 ^ { - 5 }$ to $5 \times 1 0 ^ { - 6 }$ ).
Outcome: Successfully resuscitates instruction-following. Reasoning ability from general-domain data cross-fertilizes with recommendation tasks, leading to coherent reasoning trajectories for complex recommendation queries even without explicit supervision in recommendation samples.

4.2.5.2. On-policy Distillation for General Capability

Objective: Address a persistent capability gap in general-domain reasoning observed after SFT, likely due to distributional shift and RL-initialized backbones.
Method: On-policy distillation via Policy Gradient. Unlike off-policy distillation (learning from static pre-generated data), on-policy distillation involves the student model generating its own trajectories, which are then evaluated and supervised by a teacher.
Objective Function: Per-token reverse KL divergence between the student's distribution ( $\pi_{\theta}$ ) and the teacher's ( $\pi_{\mathrm{teacher}}$ ). $ \mathbb { D } _ { K L } \left( \pi _ { \theta } \parallel \pi _ { \mathsf { t e a c h e r } } \right) = \mathbb { E } _ { x \sim \pi _ { \theta } } \left[ \log \pi _ { \theta } ( x _ { t + 1 } | x _ { 1 . . t } ) - \log \pi _ { \mathsf { t e a c h e r } } ( x _ { t + 1 } | x _ { 1 . . t } ) \right] $ Where:
- $\mathbb{D}_{KL}$ : The Kullback-Leibler (KL) divergence, a measure of how one probability distribution ( $\pi_{\theta}$ ) is different from a second, reference probability distribution ( $\pi_{\mathrm{teacher}}$ ). Minimizing this divergence makes the student policy's outputs similar to the teacher's.
- $\pi_{\theta}$ : The student policy (the model being trained), parameterized by $\theta$ .
- $\pi_{\mathrm{teacher}}$ : The teacher policy (a pre-trained, typically stronger model).
- $\mathbb{E}_{x \sim \pi_{\theta}}[\cdot]$ : Expected value over trajectories $x$ sampled from the student policy $\pi_{\theta}$ .
- $\log P_{\theta}(x_{t+1}|x_{1..t})$ : Log probability of the next token $x_{t+1}$ given the current trajectory prefix $x_{1..t}$ under the student policy.
- $\log P_{\mathrm{teacher}}(x_{t+1}|x_{1..t})$ : Log probability of the next token $x_{t+1}$ given the current trajectory prefix $x_{1..t}$ under the teacher policy.
Policy Gradient Optimization: The policy $\pi_{\theta}$ is optimized using policy gradient methods. For each input prompt $q$ , a trajectory $o$ is sampled, and the reverse KL divergence is used as a reward signal. The objective is to maximize the expected reward via gradient ascent: $ \nabla _ { \theta } J ( \theta ) = \mathbb { E } _ { o \sim \mathcal { D } , x \sim \pi _ { \theta } } \left[ \sum _ { t = 1 } ^ { T } \nabla _ { \theta } \log \pi _ { \theta } ( x _ { t } | o , x _ { < t } ) \cdot R _ { K L } ( o , x ) \right] $ Where:
- $\nabla_{\theta} J(\theta)$ : The gradient of the objective function $J(\theta)$ with respect to model parameters $\theta$ . Maximizing this objective means updating $\theta$ to increase the likelihood of actions that lead to higher rewards.
- $\mathbb{E}_{o \sim \mathcal{D}, x \sim \pi_{\theta}}[\cdot]$ : Expected value over prompts $o$ sampled from dataset $\mathcal{D}$ and trajectories $x$ sampled from the student policy $\pi_{\theta}$ .
- $T$ : The length of the sampled trajectory $x$ .
- $\nabla_{\theta} \log \pi_{\theta}(x_t | o, x_{<t})$ : The gradient of the log-probability of taking action $x_t$ at time $t$ , given the prompt $o$ and previous actions $x_{<t}$ , under the student policy. This is the "score function" or "log-likelihood gradient."
- $R_{KL}(o, x)$ : The per-token reward derived from the teacher's distribution for the trajectory $x$ generated from prompt $o$ .
Reward Clipping: To mitigate numerical instability from extreme log-probability ratios, a clipping mechanism is applied to the reverse KL divergence: $ R _ { K L } ( o , x ) = \mathrm { c l i p } \left( - \mathbb { D } _ { K L } ( \pi _ { \theta } | | \pi _ { \mathrm { t e a c h e r } } ) , \alpha , \beta \right) $ Where:
- $\mathrm{clip}(\cdot, \alpha, \beta)$ : A function that limits the value of its input. If the input is less than $\alpha$ , it becomes $\alpha$ ; if it's greater than $\beta$ , it becomes $\beta$ ; otherwise, it remains unchanged.
- $\alpha, \beta$ : Lower and upper clipping thresholds. This prevents outlier reward signals from destabilizing training.
  
  The comprehensive pipeline is illustrated in Figure 10 from the original paper (Figure 7 in original).
  
  $Figure 7 | The pipeline of On-policy Distillation via Policy Gradient. The student model (Policy Model) samples a trajectory o for the given prompt $q$ , while the Teacher Model provides feedback through Reverse KL divergence as reward $r$ The Policy Model is iteratively optimized using policy gradient methods based on reward.$ 该图像是示意图，展示了通过策略梯度进行在线政策蒸馏的流程。图中，学生模型（Policy Model）针对给定的提示 $q$ 采样出轨迹 $o$ ，同时教师模型通过反向KL散度提供反馈作为奖励 $r$ 。学生模型基于奖励 $r$ 通过策略梯度方法进行迭代优化。
Figure 7 | The pipeline of On-policy Distillation via Policy Gradient. The student model (Policy Model) samples a trajectory o for the given prompt $q$ , while the Teacher Model provides feedback through Reverse KL divergence as reward $r$ The Policy Model is iteratively optimized using policy gradient methods based on reward.
On-Policy Distillation on General-Domain:
- Teacher Model: The original Qwen3 model (of the same parameter scale) serves as the teacher ( $\pi_{\mathrm{teacher}}$ ).
- Vocabulary Discrepancy Strategy: The Qwen3 teacher cannot recognize itemic tokens. To handle this:
  - Prompt Selection: Queries $q$ are sampled exclusively from general-domain datasets, expecting pure text generation.
  - Itemic Token Penalty & Truncation: If an itemic token appears in a sampled trajectory $o$ at step $t$ , $\log \pi _ { \mathrm { t e a c h e r } } ( x _ { t } | \boldsymbol x _ { < t } )$ is set to a minimal value (e.g., -1e9) to simulate zero probability, and the trajectory is truncated. This provides a strong negative signal.
  - Enhanced Exploration: A high temperature coefficient during sampling encourages exploration, allowing the distillation process to identify and correct itemic token activations in general-domain tasks.
- Thinking Paradigms: To recover instruction-following on thinking, a suffix (/think, /no_think, or empty) is randomly appended to user prompts to align with forced-thinking, non-thinking, and auto-thinking paradigms, as described in the Qwen3 technical report.

4.2.5.3. Reinforcement Learning for Recommendation (Rec-RL)

Objective: Directly optimize discrete ranking metrics (e.g., Recall, NDCG) that SFT doesn't directly address, which often suffers from exposure bias and struggles to distinguish "near-misses."
Method: Group Relative Policy Optimization (GRPO) (Shao et al., 2024). This framework avoids a separate critic model by computing the advantage of a response relative to a group of sampled trajectories for the same prompt, reducing computational overhead while maintaining stability.
Objective Function: Maximize: $ \mathcal { L } _ { G R P O } ( \theta ) = \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \left( \mathsf { A d v } _ { i } \cdot \log \pi _ { \theta } ( R _ { i } | q ) \right) - \beta \cdot K L ( \pi _ { \theta } | | \pi _ { r e f } ) $ Where:
- $\mathcal{L}_{GRPO}(\theta)$ : The GRPO objective function to be maximized.
- $G$ : The number of candidate responses sampled for a given prompt $q$ .
- $R_i$ : The $i$ -th sampled candidate response.
- $\mathsf{Adv}_i$ : The relative advantage of response $R_i$ , calculated by normalizing rewards within the group. This term dictates how much the probability of $R_i$ should be increased or decreased.
- $\log \pi_{\theta}(R_i | q)$ : The log-probability of generating response $R_i$ given prompt $q$ under the current policy $\pi_{\theta}$ .
- $\beta$ : A hyperparameter controlling the strength of the KL penalty.
- $KL(\pi_{\theta} || \pi_{ref})$ : The KL divergence between the current policy $\pi_{\theta}$ and a reference policy $\pi_{ref}$ . This penalty encourages the new policy not to deviate too far from the previous, stable policy (after on-policy distillation), preventing instability and catastrophic forgetting of general intelligence.
Rule-based Recommendation Reward: A sparse, rule-based reward function $r(R_i)$ is designed for the 5 core recommendation tasks (Short Video Rec, Ad Rec, Product Rec, Interactive Rec, Label-Conditional Rec). The reward is focused on "Hit" events: $ r ( R _ { i } ) = \left{ \begin{array} { l l } { + 1 . 0 } & { { \mathrm { i f ~ t h e ~ t a r g e t ~ i t e m i c ~ t o k e n ~ } } s \in R _ { i } } \ { 0 . 0 } & { { \mathrm { o t h e r w i s e } } } \end{array} \right. $ This encourages the model to assign higher probability mass to itemic tokens that lead to successful hits, effectively performing "Soft Ranking" within the generative space.
Implementation Details:
- Initialization: The RL trainer is initialized with the model after on-policy distillation.
- KL Penalty: A strict KL penalty ( $\beta$ ) against $\pi_{ref}$ is maintained to prevent sacrificing general intelligence for domain-specific precision.
- Dataset: Uses the same dataset as the SFT stage.
Outcome: Significant boost in recommendation metrics, aligning generative behavior with recommender system goals.

5. Experimental Setup

5.1. Datasets

The experiments in OpenOneRec utilize several datasets to evaluate both recommendation-specific and general intelligence capabilities.

5.1.1. RecIF-Bench

RecIF-Bench is a newly proposed, comprehensive benchmark designed to rigorously evaluate recommendation foundation models.

Source & Scale: Aggregates approximately 120 million (120M) interactions from 200,000 (200K) distinct users.
Domain Coverage: Spans three heterogeneous industrial domains, each capturing different user behavior patterns:
- Short Video (Content Domain): Short-form videos from Kuaishou, including viewing behaviors across various app tabs. Provides impression sequences with corresponding interaction types.
- Ad (Commercial Domain): Promotional short videos sponsored by advertisers on the Kuaishou platform, typically with clickable redirects. Provides click sequences of user ad click behaviors.
- Product (E-commerce Domain): Products listed in the Kuaishou Mall. Provides click sequences of user product click behaviors.
Rich Metadata: Beyond interaction logs, RecIF-Bench provides comprehensive metadata:
- User-side: Each user has a User Portrait—a unified narrative interleaving natural language descriptions with itemic tokens. This portrait includes demographics (gender, age), content creation history, recent searches, followed creator types, viewing preferences, comments, livestream views, purchase records, shopping cart items, local service coupons, ad exposures, and commercial intent signals.
- Item-side: Each item is associated with multimodal embeddings (4096-dim text embedding and 5-frame visual embeddings with 1152-dim per frame). Dense captions are also provided for approximately 13 million videos.
- Interaction-side: For each user-video pair in the exposure sequence, multi-label behavioral signals are recorded, including like, follow, comment, effective view, and dislike.
Itemic Tokenization: All items are pre-tokenized into tuples of discrete tokens s = ( c _ { 1 } , c _ { 2 } , . . . , c _ { k } ) using a hierarchical quantization strategy, enabling direct consumption by LLM-based recommenders. The paper notes flexibility for researchers to train custom itemic tokenizers or use traditional item IDs.

Data Splitting Strategy: A strict user-based splitting strategy is used. 20% of users are randomly selected as the held-out test set, ensuring zero leakage. For each user, interactions are partitioned temporally: interactions before a timestamp form the history $\mathcal{H}$ , and those after serve as the target $Y$ .

The following are the statistics of RecIF-Bench from Table 1 of the original paper:

Domain	# Users	# Items	# Interactions	Avg. Hist. Item	Avg. Tgt. Item
Short Video	195,026	13,107,675	94,443,611	458.1	8.6
Ad	151,259	177,548	5,341,911	29.9	5.5
Product	144,307	2,055,240	20,087,210	132.5	6.7
Total	202,359	15,340,463	119,872,732	574.9	17.5

The following are the data distribution analysis of RecIF-Bench from Figure 3 of the original paper:

Figure 3 | Data distribution analysis of RecIF-Bench. (a) Item popularity distribution (log-log scale) across domains. (b-d) Distribution of user history lengths for Short Video, Ad, and Product domains, respectively. 该图像是图表，展示了 RecIF-Bench 数据分布分析。其中 (a) 显示各域项目的热门程度分布（对数-对数坐标），(b-d) 展示短视频、广告和产品域的用户历史长度分布。各子图反映了不同类型项目的互动计数和历史长度特点。

5.1.2. Amazon Benchmark

To evaluate cross-domain transferability and generalization capabilities, the models are tested on 10 real-world datasets from the popular Amazon review benchmark (McAuley et al., 2015).

Domains: Baby, Beauty, Cell Phones and Accessories, Grocery and Gourmet Food, Health and Personal Care, Home and Kitchen, Pet Supplies, Sports and Outdoors, Tools and Home Improvement, and Toys and Games.
Purpose: These diverse domains rigorously validate whether the foundation model, pre-trained on open-domain data, provides a fundamental transfer advantage for specific downstream recommendation distributions.
Data Pre-processing: Sparse users and items with fewer than 5 interactions are discarded.
Splitting Strategy: Leave-one-out strategy (Rajput et al., 2023; Wang et al., 2024) for training and evaluation in the sequential recommendation setting.

5.1.3. General Intelligence Sanity Check

To ensure that recommendation-focused models retain general intelligence, a sanity check suite is included, covering four categories:

Math & Text Reasoning: MATH-500, GSM8K, and AIME'24.
General Tasks: MMLU-Pro and GPQA-Diamond.
Alignment: IFEVAL (strict prompt).
Coding: LiveCodeBench v5.

5.1.4. Pre-training Data

The pre-training data for OneRec (standard variant) consists of 33B tokens across 41.3 million samples, derived from a mixture of general-domain and recommendation-domain corpora. OneRec-Pro uses 130B tokens and 179.1 million samples.

The following are the detailed data composition and token budgets for pre-training from Table 13 of the original paper:

Dataset	Weight (%)	Category	Subtotal (%)	Token Budget
Nemotron_CC_Math_v1	37.41%	Math	62.34%	29B
Nemotron_Pretraining_Code_v1	12.91%	Code	62.34%	29B
Nemotron_CC_v2	5.59%	Math, General, Code		4B
reasoning_v1_20m	4.04%	General		4B
OpenMathReasoning	1.16%	Math
NuminaMath-QwQ-CoT-5M	0.79%	Math
KodCode_V1_SFT_R1	0.26%	Code
Chinese-Reasoning-Distil-Data	0.09%	General
medical-o1-reasoning-SFT	0.09%	Medical
Itemic Dense Caption Data	12.01%	Reco			37.66%
Interleaved User Persona Grounding Data	10.03%	Reco			37.66%
Sequential User Behavior Data	15.62%	Reco
Total	100%				100%	33B

The following are the data composition and token budgets for pre-training stages from Table 14 of the original paper:

Model	Stage	Training	General-Domain	Reco-Domain	Token Budget
OneRec	stage1	Itemic-related Parameters	-	-	16B
	stage2	Full-Parameter	62.34%	37.66%	33B
OneRec-Pro	stage1	Itemic-related Parameters	-	-	30B
	stage2	Full-Parameter	53%	47%	130B

5.1.5. Multi-task SFT Data

The Multi-task Supervised Fine-tuning (SFT) stage uses a blended corpus of general-domain reasoning samples and recommendation-specific tasks.

The following are the data mixture for Multi-task SFT from Table 15 of the original paper:

Dataset	Weight (%)	Category	Subtotal (%)
OpenMathReasoning	12.971%	Math	64.978%
R1-Distill-SFT	12.784%	General Instruction	64.978%
Infinity_Instruct	11.359%	Instruction
OpenCoderReasoning	11.130%	Code
Chinese-Reasoning-Distil-Data	4.552%	General
Reasoning_Multi_subject_RLVR	4.376%	Multi-subject
Reasoning_KodCode_V1_SFT_R1	4.167%	Code
DeepMath103K	2.362%	Math
medical-o1-reasoning-SFT	1.277%	Medical
Label prediction	7.800%	Reco		35.022%
SID to Caption generation	7.493%	Reco		35.022%
Interactive recommendation	6.392%	Reco
Video recommendation	3.971%	Reco
Label conditional recommendation	3.575%	Reco
Total	100%			100%

5.2. Evaluation Metrics

The paper employs a dual-metric evaluation system to cover both recommendation accuracy and generation quality.

5.2.1. Recommendation Metrics (for Layer 1 & 2 tasks)

Pass@K:
- Conceptual Definition: Pass@K measures whether the ground truth item is present among the top $K$ items generated or recommended by the model. It is a binary hit/miss metric. If any of the $K$ predicted items matches a ground truth relevant item, it's considered a "pass."
- Mathematical Formula: $ \mathrm{Pass@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{relevant_item}{u} \in \text{TopK_predictions}{u}) $
- Symbol Explanation:
  - $|U|$ : Total number of users in the evaluation set.
  - $\mathbb{I}(\cdot)$ : An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
  - $\text{relevant\_item}_{u}$ : The actual relevant item(s) for user $u$ in the test set.
  - $\text{TopK\_predictions}_{u}$ : The set of top $K$ items recommended by the model for user $u$ .
Recall@K:
- Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved by the recommendation system within its top $K$ recommendations. It focuses on how many of the truly relevant items the model was able to "find."
- Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum_{u \in U} \frac{|\text{relevant_items}{u} \cap \text{TopK_predictions}{u}|}{|\text{relevant_items}_{u}|} $
- Symbol Explanation:
  - $|U|$ : Total number of users in the evaluation set.
  - $\text{relevant\_items}_{u}$ : The set of all relevant items for user $u$ in the test set.
  - $\text{TopK\_predictions}_{u}$ : The set of top $K$ items recommended by the model for user $u$ .
  - $|\cdot|$ : Denotes the cardinality (number of elements) of a set.
  - $\cap$ : Set intersection operator.
AUC (Area Under the Receiver Operating Characteristic Curve):
- Conceptual Definition: AUC is a commonly used metric for binary classification problems (like Label Prediction). It measures the ability of a model to distinguish between positive and negative classes across all possible classification thresholds. A higher AUC indicates better discrimination power. An AUC of 0.5 means the model performs no better than random guessing, while an AUC of 1.0 indicates perfect classification.
- Mathematical Formula: $ \mathrm{AUC} = \frac{\sum_{i \in \text{positive}} \sum_{j \in \text{negative}} \mathbb{I}(P(i) > P(j)) + 0.5 \cdot \mathbb{I}(P(i) = P(j))}{|\text{positive}| \cdot |\text{negative}|} $
- Symbol Explanation:
  - positive: Set of positive samples (e.g., actual engagements).
  - negative: Set of negative samples (e.g., non-engagements).
  - P(i): The predicted probability (or score) that sample $i$ belongs to the positive class.
  - $\mathbb{I}(\cdot)$ : An indicator function.
  - $|\text{positive}|$ : Number of positive samples.
  - $|\text{negative}|$ : Number of negative samples.
NDCG@K (Normalized Discounted Cumulative Gain at K):
- Conceptual Definition: NDCG@K is a measure of ranking quality, especially useful when relevance judgments are graded (e.g., highly relevant, somewhat relevant, not relevant). It takes into account both the relevance of recommended items and their position in the ranked list. Highly relevant items ranked higher contribute more to the score. It normalizes the DCG by the ideal DCG to get a score between 0 and 1.
- Mathematical Formula: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
- Symbol Explanation:
  - $\mathrm{rel}_i$ : The relevance score of the item at position $i$ in the ranked list.
  - $\log_2(i+1)$ : The discount factor for items at lower positions.
  - $\mathrm{DCG@K}$ : Discounted Cumulative Gain at rank $K$ .
  - $\mathrm{IDCG@K}$ : Ideal Discounted Cumulative Gain at rank $K$ , which is the maximum possible DCG if all relevant items were perfectly ranked.

5.2.2. Text Generation Metrics (for Layer 0 & 3 tasks)

LLM-as-Judge Score:
- Conceptual Definition: For Item Understanding and Recommendation Explanation tasks, an independent LLM (Gemini-2.5-Flash-Lite) is used as a judge to rate the quality of generated text on dimensions like accuracy and coherence. This method overcomes the limitations of traditional N-gram based metrics (like BLEU or ROUGE) which may not capture semantic quality for open-ended generation.
- Evaluation Process (Appendix B.1):
  1. Information Point Extraction: The LLM judge decomposes both the ground truth captions (generated by Gemini-2.5-Pro) and model-generated captions into atomic Weighted Information Points (WIPs). Each WIP contains a fact statement and an importance score (1-5).
  2. Semantic Matching: The LLM judge aligns model-generated WIPs with ground truth WIPs, identifying valid matches, hallucinations (unmatched model WIPs), and omissions (unmatched GT WIPs).
  3. Weighted Scoring: A Double-Weighted F1 Score is computed. For each matched pair $(w_{gt}, w_{model})$ , a match quality score $q \in [0, 1]$ is calculated using BERTScore (with bert-base-chinese).
- Mathematical Formula for F1 Score: $ TP _ { i } = \sum _ { ( w _ { g t } , w _ { m o d e l } ) \in \mathrm { M a t c h e s } } \mathrm { s c o r e } ( w _ { g t } ) \times q $ $ F N _ { i } = \sum _ { ( w _ { g t } , w _ { m o d e l } ) \in \mathrm { M a t c h e s } } \mathrm { s c o r e } ( w _ { g t } ) \times ( 1 - q ) + \sum _ { w _ { g t } \in \mathrm { U n m a t c h e d ~ G T } } \mathrm { s c o r e } ( w _ { g t } ) $ $ F P _ { i } = \sum _ { ( w _ { g t } , w _ { m o d e l } ) \in \mathrm { M a t c h e s } } \mathrm { s c o r e } ( w _ { m o d e l } ) \times ( 1 - q ) + \sum _ { w _ { m o d e l } \in \mathrm { U n m a t c h e d ~ M o d e l } } \mathrm { s c o r e } ( w _ { m o d e l } ) $ $ F 1 _ { i } = \frac { 2 \cdot T P _ { i } } { 2 \cdot T P _ { i } + F P _ { i } + F N _ { i } } $ $ \mathrm { L L M - J u d g e ~ S c o r e } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } F 1 _ { i } $
- Symbol Explanation:
  - $TP_i$ : Weighted True Positives for sample $i$ . Sum of scores of matched ground truth WIPs, weighted by their BERTScore match quality $q$ .
  - $FN_i$ : Weighted False Negatives for sample $i$ . Sum of scores of matched GT WIPs weighted by (1-q) (missed aspects of matched WIPs) plus scores of completely unmatched GT WIPs (omissions).
  - $FP_i$ : Weighted False Positives for sample $i$ . Sum of scores of model WIPs that matched GT WIPs weighted by (1-q) (incorrect aspects of matched WIPs) plus scores of completely unmatched model WIPs (hallucinations).
  - $\mathrm{score}(w)$ : Importance score of a WIP.
  - $q$ : BERTScore (F1) match quality for a pair of WIPs.
  - Matches: Set of matched WIP pairs.
  - Unmatched GT: Set of ground truth WIPs that were not matched.
  - Unmatched Model: Set of model-generated WIPs that were not matched.
  - F1_i: The F1 score for sample $i$ .
  - $\mathrm{LLM-Judge~Score}$ : The final average F1 score across all $N$ samples.

5.3. Baselines

The OneRec-Foundation models are compared against two groups of competitive baselines:

5.3.1. Discriminative Recommender Models

These are traditional models that typically focus on predicting scores or ranking items.

BERT4Rec (Sun et al., 2019)
GRU4Rec (Hidasi et al., 2016)
SASRec (Kang and McAuley, 2018)
HSTU (Zhai et al., 2024)
ReaRec (Tang et al., 2025)
Adaptation: These methods are inherently task-specific, requiring separate training for each RecIF-Bench task. They were adapted with task-specific modifications to support the diverse RecIF tasks.

5.3.2. Generative Recommender Models

These models also aim to generate recommendations, but OpenOneRec distinguishes itself by its holistic approach and foundation model capabilities.

TIGER (Rajput et al., 2023)
LC-Rec (Zheng et al., 2024): This model was specifically implemented as "LC-Rec-8B" using a comparable Qwen3-8B backbone to ensure a fair comparison at the foundation model scale.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Main Results on RecIF-Bench

The evaluation on RecIF-Bench demonstrates OpenOneRec-Foundation's superior performance across a wide range of recommendation tasks.

The following are the unified performance comparison across all tasks from Table 4 of the original paper:

		OneRec-1.7B		OneRec-1.7B-Pro		OneRec-8B		OneRec-8B-Pro
Task	Metric	Best Baseline	LC-Rec-8B	SASRec	BERT4Rec	GRU4Rec	HSTU	ReaRec	TIGER
Short Video Rec	Pass@1	0.0045	0.0040	0.0051	0.0043	0.0052	0.0168	0.0341	0.0496
	Pass@32	0.0951	0.0993	0.1003	0.1010	0.1002	0.1061	0.1306	0.1710
	Recall@32	0.0113	0.0119	0.0119	0.0120	0.0117	0.0132	0.0180	0.0272
Ad Rec	Pass@1	0.0044	0.0061	0.0059	0.0035	0.0076	0.0125	0.0197	0.0169
	Pass@32	0.0980	0.1225	0.1102	0.1054	0.1266	0.1769	0.2096	0.2037
	Recall@32	0.0293	0.0381	0.0336	0.0327	0.0409	0.0581	0.0723	0.0707
Product Rec	Pass@1	0.0052	0.0054	0.0047	0.0030	0.0055	0.0120	0.0178	0.0144
	Pass@32	0.0914	0.0936	0.0821	0.0907	0.0914	0.1276	0.1809	0.1571
	Recall@32	0.0175	0.0193	0.0161	0.0189	0.0178	0.0283	0.0416	0.0360
Label-Cond. Rec	Pass@1	0.0026	0.0026	0.0032	0.0026	0.0027	0.0044	0.0079	0.0064
	Pass@32	0.0380	0.0372	0.0393	0.0381	0.0383	0.0337	0.0420	0.0431
	Recall@32	0.0140	0.0135	0.0143	0.0137	0.0139	0.0123	0.0170	0.0184
Label Pred.	AUC	0.6244	0.6598	0.6640	0.6204	0.6581	0.6675	0.6139	0.6184
Interactive Rec	Pass@1	-	-	-	-	-	-	0.0890	0.0660
	Pass@32	-	-	-	-	-	-	0.3730	0.3170
	Recall@32	-	-	-	-	-	-	0.2394	0.1941
Item Understand.	LLM-Judge Score	-	-	-	-	-	-	0.2517	0.3175
Rec. Explanation	LLM-Judge Score	-	-	-	-	-	-	3.9350	3.3540

State-of-the-Art Recommendation Performance: OneRec-Foundation consistently outperforms all baselines (both discriminative and generative) across the vast majority of tasks in RecIF-Bench. This is evident from the higher Pass@K, Recall@K, and AUC scores for OneRec-Foundation variants compared to LC-Rec-8B and other baselines.
Scaling Effects:
- Data Scaling: OneRec-Pro models (trained on a larger industrial corpus) consistently surpass OneRec models of the same parameter size (e.g., OneRec-8B-Pro generally outperforms OneRec-8B). This validates the importance of data scale in enhancing recommendation capabilities.
- Model Scaling: Larger models (8B parameters) generally outperform smaller models (1.7B parameters) across all variants (e.g., OneRec-8B > OneRec-1.7B). This confirms predictable capability scaling with increasing model size, analogous to findings in general LLMs.

6.1.2. Trade-off on General Capabilities

While excelling in recommendation, the models exhibit a trade-off in general capabilities.

The following are the performance comparison on general capability (Thinking) from Table 5 of the original paper:

Category	Task	Qwen3-1.7B	OneRec-1.7B	OneRec-1.7B-Pro	Qwen3-8B	OneRec-8B	OneRec-8B-Pro
Math & Text Reasoning	MATH-500	0.8780	0.8840	0.8840	0.9520	0.9460	0.9380
	GSM8K	0.9121	0.8984	0.8999	0.9568	0.9575	0.9575
	AIME'24	0.4938	0.4104	0.4146	0.7917	0.7250	0.7188
General Tasks	MMLU-Pro	0.5422	0.3548	0.3932	0.7235	0.5342	0.5204
General Tasks	GPQA-Diamond	0.3788	0.3232	0.3333	0.5606	0.5000	0.5051
Alignment Tasks	IFEVALstrict prompt	0.6969	0.5471	0.5416	0.8577	0.7893	0.7634
Coding	LiveCodeBench v5	0.3907	0.2832	0.2832	0.5484	0.4910	0.4667

The following are the performance comparison on general capability (Non-Thinking) from Table 6 of the original paper:

Category	Task	Qwen3-1.7B	OneRec-1.7B	OneRec-1.7B-Pro	Qwen3-8B	OneRec-8B	OneRec-8B-Pro
Math & Text Reasoning	MATH-500	0.6980	0.7060	0.6940	0.8380	0.8240	0.7980
	GSM8K	0.8218	0.8036	0.8158	0.9303	0.9310	0.9196
	AIME'24	0.1313	0.1271	0.1250	0.2729	0.2417	0.2271
General Tasks	MMLU-Pro	0.4384	0.3072	0.2804	0.6632	0.5795	0.4521
General Tasks	GPQA-Diamond	0.3030	0.3131	0.2778	0.3990	0.4040	0.3939
Alignment Tasks	IFEVALstrict prompt	0.6747	0.4769	0.5250	0.8392	0.7357	0.7098
Coding	LiveCodeBench v5	0.1219	0.1219	0.1147	0.2760	0.2401	0.2401

Retention of General Capabilities: The models successfully retain most of the general capabilities of the Qwen3 backbone. Performance on mathematical benchmarks (MATH-500, GSM8K) shows minimal degradation, and in some cases (e.g., GSM8K for OneRec-8B), even slight improvements over the base Qwen3-8B in Thinking mode.
Performance Trade-off: A noticeable performance trade-off exists, particularly in broader general knowledge (MMLU-Pro, GPQA-Diamond) and coding (LiveCodeBench v5) tasks. OneRec-Foundation variants generally score lower than their base Qwen3 counterparts in these categories. This suggests that while the distillation process effectively preserves reasoning proficiency, the limited diversity or quality of the general data used during post-training might constrain the model's broader general capabilities, indicating a need for more refined data strategies to achieve a better balance.

6.1.3. Transfer Learning on Amazon Benchmark

OneRec-Foundation demonstrates exceptional transferability and cross-domain generalization on the Amazon benchmark.

The following are the Cross-Domain generalization performance on Amazon domains from Table 7 of the original paper:

Model	Metric	Baby	Beauty	Cell	Grocery	Health	Home	Pet	Sports	Tools	Toys
SASRec	R@5	0.0232	0.0393	0.0482	0.0480	0.0295	0.0133	0.0377	0.0240	0.0269	0.0420
	R@10	0.0381	0.0639	0.0782	0.0789	0.0506	0.0212	0.0607	0.0389	0.0437	0.0658
	N@5	0.0137	0.0209	0.0281	0.0262	0.0173	0.0070	0.0222	0.0130	0.0149	0.0217
	N@10	0.0185	0.0289	0.0378	0.0361	0.0242	0.0098	0.0296	0.0178	0.0203	0.0294
BERT4Rec	R@5	0.0117	0.0219	0.0325	0.0307	0.0204	0.0063	0.0218	0.0151	0.0145	0.0200
	R@10	0.0228	0.0419	0.0569	0.0534	0.0353	0.0113	0.0412	0.0261	0.0264	0.0362
	N@5	0.0065	0.0120	0.0190	0.0174	0.0117	0.0038	0.0123	0.0083	0.0083	0.0102
	N@10	0.0101	0.0185	0.0268	0.0247	0.0165	0.0054	0.0186	0.0119	0.0121	0.0154
GRU4Rec	R@5	0.0202	0.0322	0.0430	0.0362	0.0256	0.0090	0.0264	0.0174	0.0176	0.0266
	R@10	0.0346	0.0539	0.0676	0.0591	0.0423	0.0156	0.0449	0.0278	0.0305	0.0453
	N@5	0.0124	0.0201	0.0275	0.0230	0.0164	0.0058	0.0163	0.0110	0.0116	0.0171
	N@10	0.0170	0.0271	0.0355	0.0303	0.0217	0.0079	0.0222	0.0144	0.0158	0.0231
HSTU	R@5	0.0226	0.0456	0.0475	0.0458	0.0330	0.0134	0.0362	0.0227	0.0231	0.0489
	R@10	0.0350	0.0643	0.0725	0.0712	0.0485	0.0197	0.0521	0.0347	0.0337	0.0649
	N@5	0.0156	0.0308	0.0314	0.0297	0.0215	0.0092	0.0239	0.0151	0.0159	0.0339
	N@10	0.0196	0.0368	0.0395	0.0378	0.0265	0.0112	0.0290	0.0190	0.0193	0.0391
ReaRec	R@5	0.0197	0.0488	0.0444	0.0454	0.0326	0.0150	0.0299	0.0231	0.0219	0.0517
	R@10	0.0320	0.0702	0.0711	0.0730	0.0481	0.0210	0.0486	0.0348	0.0310	0.0706
	N@5	0.0123	0.0341	0.0269	0.0289	0.0213	0.0101	0.0189	0.0152	0.0143	0.0369
N@10	0.0163	0.0409	0.0355	0.0378	0.0263	0.0121	0.0249	0.0189	0.0173	0.0430
TIGER	R@5	0.0191	0.0413	0.0540	0.0447	0.0328	0.0142	0.0343	0.0216	0.0228	0.0367
	R@10	0.0318	0.0628	0.0786	0.0691	0.0534	0.0216	0.0542	0.0331	0.0344	0.0527
	N@5	0.0125	0.0277	0.0350	0.0295	0.0222	0.0094	0.0232	0.0145	0.0148	0.0255
	N@10	0.0162	0.0346	0.0429	0.0373	0.0289	0.0118	0.0295	0.0182	0.0184	0.0307
LC-Rec	R@5	0.0232	0.0495	0.0585	0.0501	0.0412	0.0199	0.0388	0.0269	0.0288	0.0350
	R@10	0.0344	0.0764	0.0883	0.0790	0.0616	0.0293	0.0612	0.0418	0.0438	0.0549
	N@5	0.0151	0.0338	0.0392	0.0328	0.0272	0.0138	0.0247	0.0177	0.0187	0.0221
	N@10	0.0187	0.0424	0.0488	0.0421	0.0338	0.0168	0.0320	0.0225	0.0235	0.0285
Ours	R@5	0.0352	0.0646	0.0717	0.0688	0.0534	0.0279	0.0563	0.0365	0.0412	0.0693
	R@10	0.0513	0.0924	0.1036	0.1029	0.0768	0.0390	0.0834	0.0547	0.0593	0.0953
	N@5	0.0238	0.0456	0.0490	0.0460	0.0376	0.0202	0.0389	0.0252	0.0295	0.0496
	N@10	0.0289	0.0545	0.0593	0.0570	0.0452	0.0237	0.0476	0.0310	0.0354	0.0579
Improve (%) R@10		34.6↑	20.9↑	17.3↑	30.3↑	24.7↑	33.1↑	36.3↑	30.9↑	35.4↑	35.0↑

SOTA Results Across Domains: OneRec-Foundation (Ours) achieves new state-of-the-art results across all 10 Amazon datasets.
Significant Improvement: The model secures an average improvement of 26.8% in Recall@10 over the second-best baseline on each domain. This empirically confirms that large-scale generative pre-training endows the model with robust transfer capabilities far exceeding traditional collaborative filtering approaches.

6.1.3.1. Adaptive Strategies for Pre-trained Model Utilization

To address the distributional shift of item identifiers in transfer learning (where the pre-trained tokenizer might not granularly distinguish items in specific target domains), three strategies were explored.

The following are the Performance Comparison of Adaptive Strategies for Pre-trained Model Utilization from Table 8 of the original paper:

Strategy	Metric	Baby	Beauty	Cell	Grocery	Health	Home	Pet	Sports	Tools	Toys
Extended Residual Quantization	R@5	0.0288	0.0534	0.0574	0.0562	0.0479	0.0227	0.0518	0.0315	0.0350	0.0511
	R@10	0.0407	0.0799	0.0830	0.0861	0.0673	0.0313	0.0758	0.0447	0.0495	0.0701
	N@5	0.0201	0.0364	0.0389	0.0383	0.0333	0.0162	0.0356	0.0215	0.0243	0.0360
	N@10	0.0239	0.0449	0.0471	0.0480	0.0396	0.0190	0.0433	0.0258	0.0289	0.0421
Text-Only Adaptation	R@5	0.0317	0.0630	0.0688	0.0687	0.0529	0.0285	0.0548	0.0368	0.0414	0.0668
	R@10	0.0448	0.0883	0.0985	0.1048	0.0752	0.0398	0.0850	0.0548	0.0615	0.0931
	N@5	0.0227	0.0445	0.0473	0.0460	0.0368	0.0199	0.0382	0.0256	0.0288	0.0483
	N@10	0.0269	0.0526	0.0569	0.0576	0.0440	0.0235	0.0478	0.0314	0.0354	0.0568
Text-Augmented Itemic Tokens	R@5	0.0352	0.0646	0.0717	0.0688	0.0534	0.0285	0.0563	0.0368	0.0414	0.0693
	R@10	0.0513	0.0924	0.1036	0.1029	0.0768	0.0398	0.0834	0.0547	0.0593	0.0953
	N@5	0.0238	0.0456	0.0490	0.0460	0.0376	0.0202	0.0389	0.0256	0.0295	0.0496
	N@10	0.0289	0.0545	0.0593	0.0576	0.0452	0.0237	0.0478	0.0314	0.0354	0.0579

Strategy 1: Extended Residual Quantization: Extends the hierarchical depth of itemic tokens using Finite Scalar Quantization (FSQ). Reduces collision rate to 3.05%. Achieves a 10.0% improvement in average R@10 over LC-Rec. However, the non-pre-trained fourth layer disrupts original hierarchical semantics.
Strategy 2: Text-Only Adaptation: Bypasses itemic tokens entirely, representing items via 5 keywords from metadata. Collision rate is 4.27%. Achieves an 18.8% improvement in average R@10 over Extended Residual Quantization, as the model's linguistic core remains intact and natural language representations are more expressive in narrow domains. This sacrifices collaborative filtering signals.
Strategy 3: Text-Augmented Itemic Tokens: Concatenates the original three-layer pre-trained itemic tokens with keyword representations ( $[itemic_tokens]^+[keywords]$ ). Preserves original itemic tokens structure, maintaining hierarchical semantics. Keywords provide semantic disambiguation (collision rate 0.47%) and enable full utilization of linguistic capabilities. This strategy achieves state-of-the-art performance across nearly all datasets, validating that effective transfer learning requires maximizing utilization of the foundation model's diverse capabilities while preserving pre-trained structural integrity.

6.1.3.2. Domain-Specific Training vs. Multi-Domain Joint Training

The paper investigates the impact of training strategies on transfer performance.

The following are the Impact of Training Strategies (Domain-Specific vs Multi-Domain Joint) and Few-Shot Learning on Transfer Performance from Figure 11 of the original paper:

$Figure 8 | Impact of Training Strategies (Domain-Specific vs Multi-Domain Joint) and Few-Shot Learning on Transfer Performance. We compare OneRec-Foundation (Ours) against TIGER across four Amazon domains under three settings: (1) Few-shot learning with $1 0 \\%$ training data, (2) Full-data training with domain-specific strategy, and (3) Full-data training with joint multi-domain strategy. The green dashed line represents the performance gain (Recall `@ 1 0` difference) of Ours over TIGER.$ 该图像是图表，展示了不同训练策略（少量训练、领域特定及多领域联合）对转移性能的影响。通过比较 OneRec-Foundation 和 TIGER 在四个亚马逊领域的 Recall@10，显示了我们方法在不同设置下的表现。绿虚线代表我们与 TIGER 的性能差异（Recall @ 10 之差）。

Figure 8 | Impact of Training Strategies (Domain-Specific vs Multi-Domain Joint) and Few-Shot Learning on Transfer Performance. We compare OneRec-Foundation (Ours) against TIGER across four Amazon domains under three settings: (1) Few-shot learning with $1 0 \\%$ training data, (2) Full-data training with domain-specific strategy, and (3) Full-data training with joint multi-domain strategy. The green dashed line represents the performance gain (Recall @ 1 0 difference) of Ours over TIGER.

Divergent Performance: TIGER (a traditional generative recommender) shows a consistent performance decline (average 10.6% drop in Recall@10) under joint training across multiple domains. In contrast, OneRec-Foundation demonstrates an average 2.3% improvement.
Foundation Model Advantage: This divergence highlights that OneRec-Foundation, with its pre-trained recommendation knowledge and semantic understanding, extracts generalizable patterns rather than memorizing domain-specific statistics. Multi-domain joint training further enriches the model by exposing it to diverse interaction patterns, enabling effective cross-domain knowledge transfer. The massive parameter capacity allows encoding domain-specific nuances while maintaining shared high-level patterns.

6.1.3.3. Few-Shot Learning: Amplified Transfer Advantage

Significant Gap Widening: The transfer learning advantage of foundation models is amplified under data scarcity. While OneRec-Foundation surpasses TIGER by an average of 77.7% in Recall@10 with full training data, this gap dramatically widens to 219.7% in the 10% few-shot regime.
Resilience under Data Constraints: OneRec-Foundation preserves 45.2% of its full-data performance with only 10% of the data, whereas TIGER retains only 23.0%. This striking resilience validates that large-scale pre-training confers robust, transferable representations that enable effective domain adaptation even under severe data constraints.

6.2. Ablation Study

6.2.1. Ablation Study on Pre-training Strategies

An ablation study quantifies the contribution of the itemic-text alignment (Stage 1) in the pre-training pipeline. The full model is compared against a variant w/o Align that bypasses this initial phase. Both variants undergo Multi-task SFT before evaluation.

The following are the Ablation study on pre-training strategies from Table 9 of the original paper:

Task	Metric	0.6B		1.7B		8B
Task	Metric	Ours	w/o Align	Ours	w/o Align	Ours	w/o Align
Short Video Rec	Pass@32	0.1401	0.1397	0.1636	0.1605	0.2034	0.1933
Short Video Rec	Recall@32	0.0210	0.0210	0.0254	0.0251	0.0334	0.0310
Ad Rec	Pass@32	0.1740	0.1680	0.1961	0.1922	0.2350	0.2401
Ad Rec	Recall@32	0.0586	0.0569	0.0673	0.0669	0.0821	0.0841
Product Rec	Pass@32	0.1139	0.1064	0.1512	0.1395	0.1893	0.1911
Product Rec	Recall@32	0.0257	0.0243	0.0343	0.0312	0.0447	0.0442
Label-Cond. Rec	Pass@32	0.0350	0.0343	0.0426	0.0401	0.0537	0.0537
Label-Cond. Rec	Recall@32	0.0146	0.0145	0.0181	0.0171	0.0227	0.0230
Interactive Rec	Pass@32	0.2460	0.2360	0.3110	0.3050	0.4650	0.4490
Interactive Rec	Recall@32	0.1402	0.1357	0.1908	0.1770	0.3039	0.2910
Label Pred.	AUC	0.6488	0.5807	0.6392	0.5796	0.6879	0.6285
Item Understanding	LLM-Judge Score	0.3174	0.3112	0.3170	0.3181	0.3225	0.3103
Rec. Explanation	LLM-Judge Score	2.9960	2.8635	3.0922	3.3160	3.9420	3.9329

Semantic Bridge: Stage 1 (itemic-text alignment) acts as a fundamental semantic bridge for cold-started itemic token embeddings. Aligning these initialized parameters with the pre-trained latent space before full-parameter fine-tuning establishes a robust semantic foundation.
Model Size Dependence: The marginal gains from Stage 1 scale inversely with model size, meaning it's particularly necessary for smaller models (0.6B, 1.7B). Larger backbones possess greater inherent generalization capabilities, reducing the relative impact of this initial alignment stage.
Domain-Specific Precision: This stage remains essential for domain-specific precision, especially in Label Prediction (significant AUC drop without alignment) and Interactive Recommendation. This underscores that explicit alignment is a prerequisite for optimizing recommendation performance across all model scales.

6.2.2. Evolution of Model Capabilities Across Post-training Stages

The performance of the model is analyzed after each key post-training phase (Multi-task SFT, On-policy Distillation, Reinforcement Learning).

The following are the Performance comparison on general capability across post-training stages (Thinking) from Table 10 of the original paper:

	math_500	gsm8k	AIME'24	mmlu_pro	gpqa_diamond	IFEVAL	LiveCodeBench
Qwen3-8B (Base)	0.952	0.9568	0.7917	0.7235	0.5606	0.8577	0.5484
Stage 1: Multi-task SFT	0.936	0.9083	0.5104	0.5307	0.4949	0.6174	0.4516
Stage 2: On-Policy Distillation	0.948	0.9538	0.7125	0.5454	0.5	0.7653	0.4659
Stage 3: Reinforcement Learning	0.938	0.9575	0.7188	0.5204	0.5051	0.7634	0.4667

The following are the Performance comparison on general capability across post-training stages (Non-Thinking) from Table 11 of the original paper:

	math_500	gsm8k	AIME'24	mmlu_pro	gpqa_diamond	IFEVAL	LiveCodeBench
Qwen3-8B (Base)	0.838	0.9303	0.2729	0.6632	0.399	0.8392	0.276
Stage 1: Multi-task SFT	0.876	0.906	0.0688	0.4909	0.3384	0.5638	0.1756
Stage 2: On-Policy Distillation	0.848	0.9234	0.2521	0.583	0.4091	0.7689	0.2545
Stage 3: Reinforcement Learning	0.798	0.9196	0.2271	0.4521	0.3939	0.7098	0.2401

The following are the Recommendation benchmark performance across post-training stages from Table 12 of the original paper:

Model	Video Rec	Ad Rec	Product Rec	Label Cond.	Interactive	Label Pred.	Item Understanding	Reco Reason
Stage 1: Multi-task SFT	0.0324	0.0925	0.0532	0.0229	0.3461	0.6979	0.3274	3.8795
Stage 2: On-Policy Distillation	0.0304	0.0596	0.0330	0.0200	0.2419	0.6944	0.3319	3.9479
Stage 3: Reinforcement Learning	0.0370	0.0967	0.0536	0.0236	0.3458	0.6908	0.3209	4.0381

Impact of On-policy Distillation on General Capabilities:
- Restoration of General Capabilities: Comparing Stage 1 (Multi-task SFT) with Stage 2 (On-policy Distillation) (Tables 10 and 11) shows that on-policy distillation significantly restores general capabilities, realigning the model with the Qwen3 baseline on most general benchmarks, particularly in Math & Text Reasoning and Alignment Tasks. For example, AIME'24 (Thinking) jumps from 0.5104 (SFT) to 0.7125 (Distillation) close to Qwen3-8B's 0.7917.
- Persistent Gap: Despite restoration, a performance gap relative to the original Qwen3 base model persists across several metrics (MMLU-Pro, GPQA-Diamond, LiveCodeBench). This is attributed to current data composition and quality during the distillation phase, suggesting potential for further improvement with refined data strategies.
- Mitigation of Instruction Drift: Multi-task SFT occasionally led to instruction drift where the model disregarded /no_think tags and generated Chain-of-Thought (CoT) reasoning in "Non-Thinking" mode, leading to inflated scores. On-policy Distillation effectively mitigates this, restoring the model's ability to faithfully switch between distinct reasoning modes.
Advancements through RL for Recommendation:
- Targeted Improvements: Stage 3 (Reinforcement Learning) demonstrates targeted improvements on core recommendation tasks (Table 12). The RL-trained model achieves consistent gains on Video Rec, Ad Rec, Product Rec, Label Cond., and Interactive Rec compared to Stage 2.
- Optimization for Ranking Accuracy: These improvements stem from the rule-based Hit reward in Rec-RL which directly optimizes for ranking accuracy, encouraging the model to assign higher probability mass to target itemic tokens.
- Transfer to Explanation Generation: Interestingly, the Reco Reason task (Layer 3) also benefits from RL training, showing an increase in LLM-Judge Score (from 3.9479 to 4.0381). This suggests that the refined "recommendation intuition" acquired through RL transfers to explanation generation, producing more coherent and relevant reasoning.

6.3. Data Presentation (Figure 1 from original paper)

The following are the holistic performance overview of OpenOneRec-Foundation from Figure 1 of the original paper:

该图像是图表，展示了在 RecIF-Bench 的多项任务中的能力评估（左侧）及在亚马逊基准上的召回率结果（右侧）。左侧为不同任务的表现，红色和黄色线条分别代表 OneRec-Pro-8B 和最佳基线，右侧柱状图比较了各模型在不同类别中的 Recall@10 结果。

Figure 1 | Holistic Performance Overview. Left: Evaluation on RecIF-Bench and general LLM bencharks. ue ive penedat sk hiey general knowledge. "Best Baseline" denotes the highest performance achieved by existing methods for each specific task. Right: Amazon Benchmark results. Our model demonstrates exceptional cross-domain transferability, consistently surpassing leading baselines across 10 diverse datasets.

The left chart of Figure 1 visually confirms the SOTA performance of OneRec-Pro-8B across RecIF-Bench tasks, often significantly outperforming the "Best Baseline." The general LLM benchmarks section on the left also illustrates the trade-off, where OneRec-Pro-8B retains strong performance in areas like MATH-500 but shows some decrease compared to the base Qwen3-8B in broader General Tasks. The right chart (Amazon Benchmark) clearly shows OneRec-Pro-8B consistently outperforming "Best Baseline" in Recall@10 across all 10 Amazon datasets, visually reinforcing its cross-domain transferability.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this work, the authors successfully presented OpenOneRec, a comprehensive framework aimed at bridging the critical gap between traditional recommendation systems and Large Language Models (LLMs). Their contributions are multi-faceted:

RecIF-Bench: They introduced RecIF-Bench, the first holistic benchmark specifically designed to evaluate recommendation instruction-following capabilities. This benchmark covers 8 diverse tasks, ranging from fundamental prediction to complex reasoning, and incorporates crucial dimensions like multi-modal, cross-domain, multi-behavioral interactions, interleaved data, and recommendation explanation.
Open-Sourced Framework & Scaling Laws: To ensure reproducibility and foster scalable research, they open-sourced a full-stack training pipeline, encompassing data processing, co-pretraining strategies (including Itemic-Text Alignment), and hybrid post-training protocols (Multi-task SFT, On-policy Distillation, and Rec-RL). Through empirical analysis, they validated scaling laws in the recommendation domain, demonstrating a data-intensive scaling regime where increasing training data is more critical than merely scaling model parameters.
OneRec-Foundation Models: Leveraging their framework, they released the OneRec-Foundation model family (1.7B and 8B parameters). Extensive experiments demonstrated that these models achieve new state-of-the-art (SOTA) performance across all RecIF-Bench tasks. Furthermore, they exhibited exceptional cross-domain transferability to the Amazon benchmark, surpassing strong baselines with an average 26.8% improvement in Recall@10.

This work signifies a crucial step towards building truly intelligent recommender systems that can understand, reason, and adapt to diverse user instructions, much like general LLMs.

7.2. Limitations & Future Work

The authors acknowledge several limitations and outline important future research directions:

Tokenizer Transferability: While OpenOneRec enhances downstream performance, the magnitude of gains is currently limited by tokenizer transferability. The pre-trained tokenizer, optimized on broad open-domain data, may not optimally distinguish items in a specific vertical, leading to collision rates.
- Future Direction: Maximizing the reuse of foundation model priors while ensuring high-quality item indexing (code quality) for downstream tasks is a promising avenue. This could involve more adaptive or domain-specific itemic tokenization strategies during transfer.
Balancing General Intelligence and Domain Precision: Maintaining the model's general intelligence and reasoning capabilities necessitates mixing vast amounts of general-domain text during training. This process currently presents a trade-off.
- Future Direction: Investigating optimal data mixing ratios and improving data utilization efficiency are urgent challenges to better balance domain-specific precision with general capabilities, possibly through more advanced curriculum learning or data weighting techniques.
Chain-of-Thought Reasoning Consistency: Chain-of-Thought (CoT) reasoning currently yields improvements only in limited settings.
- Future Direction: A more rigorous exploration of test-time scaling strategies is needed to unlock consistent reasoning gains across diverse recommendation scenarios. This could involve exploring novel prompting techniques, self-consistency methods, or integrating external knowledge graphs more effectively.

7.3. Personal Insights & Critique

The OpenOneRec paper makes a highly significant contribution by systematically addressing the integration of LLMs into recommendation systems. The authors' commitment to open-sourcing their benchmark and training pipeline is commendable and truly accelerates research in this emerging field.

Personal Insights:

Paradigm Shift Validation: This paper strongly validates the paradigm shift towards generative recommendation and the potential of foundation models to bring general intelligence to this domain. The improvements on RecIF-Bench and the Amazon benchmark are substantial, particularly the few-shot learning results, which highlight the power of transferable representations under data scarcity. This suggests a future where recommender systems are not just prediction engines but intelligent conversational agents.
Bridging Modalities: The itemic tokenization strategy is a crucial innovation. Effectively bridging the discrete item space with the LLM's linguistic token space is fundamental to this integration. The detailed exploration of adaptive strategies for tokenizer transferability (e.g., Text-Augmented Itemic Tokens) is particularly insightful, providing concrete guidance for future work.
Balanced Training Regimen: The meticulously designed hybrid post-training pipeline (SFT, On-policy Distillation, Rec-RL) is a key strength. It demonstrates a sophisticated understanding of how to mitigate catastrophic forgetting and simultaneously optimize for diverse objectives (general intelligence vs. domain-specific metrics). This multi-stage approach is likely to become a standard for adapting LLMs to specialized domains.
Scaling Laws in a New Domain: The empirical validation of scaling laws specifically for recommendation data is a valuable theoretical contribution. The finding of a data-intensive scaling regime (where data scaling is more impactful than parameter scaling) provides critical guidance for resource allocation in developing future generative recommenders. This challenges the direct applicability of general LLM scaling laws and underscores the unique characteristics of recommendation data.

Critique & Areas for Improvement:

General Intelligence Trade-off: While the paper successfully mitigates catastrophic forgetting, the persistent performance gap on some general intelligence benchmarks (e.g., MMLU-Pro, LiveCodeBench) compared to the base Qwen3 suggests that the current data mixing and distillation strategies could be further refined. Future work could explore more dynamic or adaptive data sampling techniques during pre-training and post-training to better balance the two types of knowledge.
Complexity of LLM-as-Judge: While LLM-as-Judge is a good approach for text generation, its multi-step process (WIP extraction, semantic matching, weighted F1) is complex and potentially sensitive to the choice of the judge LLM itself. More transparency on its robustness and potential biases would be beneficial.
Computational Cost: The scale of training (hundred-billion-token corpus, 8B parameter models, multi-stage training) is substantial. While scaling laws are explored, practical deployment considerations (inference cost, latency) for such large generative recommenders in real-world industrial settings remain a significant challenge. The paper could delve more into the efficiency of these models at inference time.
Interpretability of Itemic Tokens: While itemic tokens are explained as maintaining hierarchical semantics, their direct interpretability for humans or for debugging purposes might be limited compared to natural language descriptions. Exploring ways to make itemic tokens more human-understandable could enhance trust and utility.

Transferability to Other Domains: The methodology presented in OpenOneRec has high transferability to other specialized domains where LLMs are being adapted. The core principles—bridging modality gaps (e.g., sensor data, medical images, financial time series to LLM-understandable tokens), scalable co-pretraining with general-domain knowledge, and hybrid post-training with on-policy distillation for general skills and RL for domain-specific metrics—could be applied to:

Medical AI: Integrating LLMs with structured medical data (e.g., patient records, lab results, imaging reports) for diagnosis, treatment planning, or drug discovery.
Financial Services: Combining LLMs with market data, financial reports, and expert knowledge for quantitative trading, risk management, or personalized financial advice.
Robotics/Control: Using LLMs for high-level planning and instruction in robotic systems, where sensor data and control signals need to be aligned with natural language commands.

By open-sourcing their framework, OpenOneRec provides a solid foundation, inviting the broader research community to tackle these remaining challenges and further accelerate the evolution towards truly intelligent AI systems across various applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.