OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation
TL;DR Summary
OxygenREC is an e-commerce recommendation system utilizing a Fast-Slow Thinking architecture for deep reasoning, addressing inconsistencies in multi-stage optimization and independent training across scenarios, while enhancing recommendation quality with a semantic alignment mech
Abstract
Traditional recommendation systems suffer from inconsistency in multi-stage optimization objectives. Generative Recommendation (GR) mitigates them through an end-to-end framework; however, existing methods still rely on matching mechanisms based on inductive patterns. Although responsive, they lack the ability to uncover complex user intents that require deductive reasoning based on world knowledge. Meanwhile, LLMs show strong deep reasoning capabilities, but their latency and computational costs remain challenging for industrial applications. More critically, there are performance bottlenecks in multi-scenario scalability: as shown in Figure 1, existing solutions require independent training and deployment for each scenario, leading to low resource utilization and high maintenance costs-a challenge unaddressed in GR literature. To address these, we present OxygenREC, an industrial recommendation system that leverages Fast-Slow Thinking to deliver deep reasoning with strict latency and multi-scenario requirements of real-world environments. First, we adopt a Fast-Slow Thinking architecture. Slow thinking uses a near-line LLM pipeline to synthesize Contextual Reasoning Instructions, while fast thinking employs a high-efficiency encoder--decoder backbone for real-time generation. Second, to ensure reasoning instructions effectively enhance recommendation generation, we introduce a semantic alignment mechanism with Instruction-Guided Retrieval (IGR) to filter intent-relevant historical behaviors and use a Query-to-Item (Q2I) loss for instruction-item consistency. Finally, to resolve multi-scenario scalability, we transform scenario information into controllable instructions, using unified reward mapping and Soft Adaptive Group Clip Policy Optimization (SA-GCPO) to align policies with diverse business objectives, realizing a train-once-deploy-everywhere paradigm.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation". The central topic is the development of a generative recommendation system for e-commerce that uses instruction-following capabilities, deep reasoning, and multi-scenario scalability, addressing challenges like latency and resource utilization.
1.2. Authors
The paper lists a large group of authors, all affiliated with JD.com, Beijing, China. This suggests the research is an industrial effort focused on practical applications within a large e-commerce platform. The contact emails provided are {haoxuegang.1, zhangming229, gongpinghua1}@jd.com.
1.3. Journal/Conference
The paper is published as a preprint on arXiv with the identifier arXiv:2512.22386. As a preprint, it has not yet undergone formal peer review by a journal or conference. However, arXiv is a highly influential platform for rapid dissemination of research, especially in fields like artificial intelligence, and many significant papers first appear there. The content suggests it is targeting top-tier AI/ML or Recommender Systems conferences/journals.
1.4. Publication Year
The paper was published on 2025-12-26T21:13:59.000Z, indicating a very recent publication, almost certainly a preprint for upcoming conferences in early 2026.
1.5. Abstract
The paper addresses two key limitations of existing generative recommendation (GR) systems:
-
Limited Reasoning Capabilities: Existing GR methods primarily rely on inductive patterns, struggling with complex user intents that require deductive reasoning and world knowledge. While large language models (LLMs) offer strong reasoning, their latency and computational costs are prohibitive for industrial use.
-
Multi-Scenario Scalability Bottlenecks: Current solutions require independent training and deployment for each recommendation scenario (e.g., homepage, search, cart), leading to low resource utilization and high maintenance costs.
To overcome these challenges, the authors propose
OxygenREC, an industrial recommendation system built on aFast-Slow Thinkingarchitecture.
-
Fast-Slow Thinking:
Slow thinkinginvolves a near-line LLM pipeline that synthesizesContextual Reasoning Instructions(deductive reasoning based on world knowledge and complex user intents).Fast thinkingemploys a high-efficiency encoder-decoder backbone for real-time item sequence generation, conditioned on these instructions, without incurring online LLM latency. -
Semantic Alignment: To ensure the
reasoning instructionseffectively guide generation,OxygenRECusesInstruction-Guided Retrieval (IGR)to filter intent-relevant historical user behaviors and aQuery-to-Item (Q2I)loss for consistency between instructions and target items. -
Multi-Scenario Scalability: The framework transforms scenario information into
controllable instructionsand employsunified reward mappingalongsideSoft Adaptive Group Clip Policy Optimization (SA-GCPO)to align a single policy with diverse business objectives, enabling a "train-once-deploy-everywhere" paradigm.The system is deployed at
JD.com, showing significant increases inGMV(Gross Merchandise Value) andorder volumeacross multiple core scenarios, demonstrating its flexibility and scalability.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2512.22386.
The PDF link is https://arxiv.org/pdf/2512.22386v1.pdf.
This paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
Core Problem: Traditional recommendation systems, especially in large e-commerce platforms, face two significant challenges:
- Limited Deductive Reasoning in Generative Recommendation (GR): While
Generative Recommendation (GR)offers an end-to-end framework, existing methods primarily rely on inductive pattern matching from user behavior. They struggle to uncover complex user intents that requiredeductive reasoningbased onworld knowledge(common sense, external facts). For example, inferring the need for "moisture-wicking baby sleepwear" for "young parents in Chengdu during winter solstice" goes beyond observed patterns. Large Language Models (LLMs) possess strong reasoning capabilities but are too slow and computationally expensive for real-time industrial deployment. - Inefficient Multi-Scenario Scalability: E-commerce platforms operate across diverse scenarios (e.g., homepage, product detail page, shopping cart, search results), each with unique user behaviors, business objectives, and latency requirements. Training and deploying independent recommendation models for each scenario is costly in terms of resources (computation, maintenance, development) and leads to low utilization. Current
GRliterature has largely unaddressed this multi-scenario scalability challenge.
Why is this problem important?
- Enhanced User Experience: Understanding deep user intent through deductive reasoning leads to more relevant and surprising recommendations, moving beyond simple "what you've seen is what you get" patterns. This can significantly improve user satisfaction and engagement.
- Business Impact: More accurate and context-aware recommendations directly translate to higher
Click-Through Rates (CTR),Conversion Rates (CVR),Order Volume, andGross Merchandise Value (GMV), which are critical for e-commerce platforms. - Operational Efficiency: A unified, scalable system reduces the massive operational overhead associated with managing a multitude of scenario-specific models, freeing up resources for innovation and broader deployment.
Specific challenges or gaps in prior research:
GRsystems are good at end-to-end optimization but primarily inductive.LLMsoffer powerful reasoning but are too slow and expensive for online inference. The trade-off between deductive knowledge injection and online latency is a major hurdle.Multi-scenario recommendationresearch has mainly focused on discriminative models, often relying on complex, scenario-specific architectures that are not easily adaptable to generative paradigms or lead tonegative transferissues (where learning for one task harms performance on another). TheGRliterature lacks effective solutions fortrain-once-deploy-everywhereacross diverse scenarios.
Paper's entry point or innovative idea:
The paper proposes OxygenREC as an industrial generative recommendation system that leverages Fast-Slow Thinking and instruction-following to address these challenges. It aims to inject deep reasoning capabilities without compromising real-time latency and to enable scalable multi-scenario services within a single, unified backbone network.
2.2. Main Contributions / Findings
The paper makes four primary contributions:
-
Fast-Slow Thinking Architecture with Deductive Knowledge Injection:
- Contribution: Introduces a novel
Fast-Slow Thinkingarchitecture to injectworld knowledgeanddeductive reasoninginto recommendations without adding online latency. - Mechanism: A near-line
LLM pipeline(slow thinking) synthesizes high-precisionContextual Reasoning Instructionsby performing deep intent reasoning. A high-throughputencoder-decoder backbone(fast thinking) then uses these instructions for real-time item sequence generation. - Problem Solved: Overcomes the trade-off between
LLMreasoning power and strict online latency requirements.
- Contribution: Introduces a novel
-
Semantic Alignment for Effective Instruction Control:
- Contribution: Develops a semantic alignment mechanism to ensure that the generated
reasoning instructionseffectively guide the recommendation process. - Mechanism: Uses
Query-to-Item (Q2I) lossto map instructions into the item embedding space. This enablesInstruction-Guided Retrieval (IGR)to filter out irrelevant historical user behaviors, focusing the model on intent-relevant data. - Problem Solved: Ensures that the model's output is tightly controlled by the user's inferred intent, improving precision and reducing noise from past irrelevant interactions.
- Contribution: Develops a semantic alignment mechanism to ensure that the generated
-
Scalable Multi-Scenario Alignment via Instruction and Reinforcement Learning (RL):
- Contribution: Achieves scalable multi-scenario adaptation within a single generative model backbone.
- Mechanism: Converts
scenario-specific contextsintocontrollable instructions(scenario instructions). Employs aunified Reward Mapping ServiceandSoft Adaptive Group Clip Policy Optimization (SA-GCPO)to align a single policy with diverse business objectives across scenarios. - Problem Solved: Realizes a "train-once-deploy-everywhere" paradigm, significantly reducing operational and computational costs by eliminating the need for separate models per scenario.
-
Large-Scale Production Deployment:
- Contribution: Successfully deployed
OxygenRECinJD.com's core recommendation scenarios. - Mechanism: Built a unified
PyTorch-based training framework achieving40% Model FLOPs Utilization (MFU)and utilizedxLLMfor high-performance inference serving. - Problem Solved: Demonstrates significant gains in
order volumeandGMVin online A/B tests across multiple core scenarios, proving its practical value, efficiency, and scalability in demanding industrial environments.
- Contribution: Successfully deployed
Key Conclusions / Findings:
- The
Fast-Slow Thinkingarchitecture effectively injects deductive knowledge without onlineLLMlatency. Semantic IDsandmultimodal fusionare crucial for high-quality item representations.- The
instruction-following mechanismwithIGRandQ2Iloss significantly enhances recommendation controllability and accuracy. - The
unified generative modelwithscenario instructionsandSA-GCPOoutperforms independent scenario-specific models, enabling efficientmulti-scenario adaptation. - Online A/B tests confirm
OxygenREC's ability to drive substantial business growth (GMVandorder volume) across diverse scenarios, fromHomepage FloortoCheckout Path, demonstrating its robustness and practical impact.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand OxygenREC, a reader should be familiar with the following concepts:
- Recommendation Systems (RS): Software systems that suggest items (products, movies, articles, etc.) to users.
- Traditional Cascading Methods: Often involve multiple sequential stages (e.g., matching, ranking, re-ranking). Each stage optimizes a local objective, leading to potential
objective misalignmentanderror propagation. - Generative Recommendation (GR): A newer paradigm that reformulates recommendation as an end-to-end sequence generation task. Instead of scoring and selecting from a fixed pool,
GRmodels directly generate sequences of item identifiers or theirsemantic IDs. This allows for global optimization and potentially more novel recommendations.
- Traditional Cascading Methods: Often involve multiple sequential stages (e.g., matching, ranking, re-ranking). Each stage optimizes a local objective, leading to potential
- Large Language Models (LLMs): Deep learning models with billions of parameters, pre-trained on vast amounts of text data. They excel at understanding, generating, and reasoning with human language.
- Deep Reasoning Capabilities:
LLMscan performdeductive reasoning(applying general rules to specific cases) andinductive reasoning(inferring general rules from specific observations). In recommendations,deductive reasoningcan leverageworld knowledgeto infer complex user intents beyond observed patterns. - Latency and Computational Costs: A major challenge with
LLMsis their high computational demand, especially during inference (generating output), which results in highlatencyandcost, making them difficult to deploy in real-time industrial applications.
- Deep Reasoning Capabilities:
- Fast-Slow Thinking (Dual-Process Theory): Inspired by cognitive psychology (e.g., Daniel Kahneman's "Thinking, Fast and Slow"), this concept divides cognitive processes into two systems:
- System 1 (Fast Thinking): Intuitive, automatic, unconscious, and fast. In
OxygenREC, this is the high-throughputencoder-decoder backbonefor real-time generation. - System 2 (Slow Thinking): Deliberative, analytical, conscious, and slow. In
OxygenREC, this is thenear-line LLM pipelinethat synthesizes complexContextual Reasoning Instructions. The key is to leverageSystem 2offline to informSystem 1online, avoiding latency issues.
- System 1 (Fast Thinking): Intuitive, automatic, unconscious, and fast. In
- Encoder-Decoder Architecture: A common neural network architecture, especially for sequence-to-sequence tasks (like machine translation or text generation).
- Encoder: Processes the input sequence (e.g., user history, profile, context) and compresses it into a fixed-size
context vectoror a sequence ofhidden states. - Decoder: Takes the
context vectorfrom the encoder and generates an output sequence one element at a time (auto-regressively). It can also be conditioned on additional inputs.
- Encoder: Processes the input sequence (e.g., user history, profile, context) and compresses it into a fixed-size
- Transformer: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017) that revolutionized sequence modeling. It relies entirely on
attention mechanismsrather than recurrent or convolutional layers.- Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence when processing each element.
- Cross-Attention: Used in
encoder-decoder Transformers, where the decoder attends to the encoder's output.
- Semantic IDs (SIDs): Instead of using raw item identifiers (which are categorical and lack semantic meaning),
SIDsare discrete, semantically rich representations of items. They are learned by mapping items (text, image, features) into a continuous latent space and then quantizing this space into discrete codes. This allows generative models to "generate" items as sequences ofSIDs(like words in a sentence).- Residual Quantization (RQ-KMeans): A technique to discretize continuous embeddings into hierarchical codes. It quantizes the residual (the error between the original embedding and its quantized version) iteratively, building a coarse-to-fine representation. This balances expressiveness with a compact vocabulary size.
- Contrastive Learning: A self-supervised learning paradigm where the model learns representations by pushing "similar" (positive) samples closer together in the embedding space and "dissimilar" (negative) samples farther apart. In
OxygenREC, it's used forItem-to-Item (I2I)alignment andmulti-source alignmentforSIDs. - Reinforcement Learning (RL): A machine learning paradigm where an
agentlearns to make decisions by interacting with anenvironmentto maximize a cumulativereward.- Policy: The agent's strategy for choosing actions given a state.
- Reward: A numerical signal indicating the desirability of an action.
- Policy Optimization: Algorithms (e.g.,
Proximal Policy Optimization (PPO),Clipped Policy Optimization (CPO)) used to update the agent's policy based on observed rewards. - Importance Sampling: A technique used in
RLto estimate the expectation of a function under one distribution, given samples from another distribution. This is crucial for off-policyRLwhere the data is collected using anold policy() but used to update anew policy().
- Mixture-of-Experts (MoE): A neural network architecture where different "expert" sub-networks specialize in different parts of the input space. A
gating networklearns to route each input to one or a few relevant experts. This allows models to scale to trillions of parameters while only activating a small subset for each input, improving efficiency. - KV-Cache (Key-Value Cache): In
Transformerdecoders, thekeyandvaluevectors from previous tokens are often cached to avoid recomputing them at each auto-regressive decoding step, significantly speeding up inference, especially for long sequences.
3.2. Previous Works
The paper discusses related work in three main categories:
-
Generative Recommendation (GR):
- Early GR: Used raw textual item IDs (e.g., product names or short descriptions) as tokens.
- Modern GR: Shifted towards
Semantic IDs (SIDs)[27, 29, 53, 54] which are discrete, semantically meaningful representations.RQ-VAE[48] andresidual quantization[22] create hierarchical codes forSIDsto balance expressiveness and vocabulary size. - Unified Search and Recommendation: Works like
SynerGen[25] andIntSR[64] explored unifying search and recommendation, but they are often built on traditional ranking models rather thansemantic ID-based generation. - LLM-integrated GR (e.g., RecGPT series [4, 5]): These approaches integrate
LLMsfor enhanced reasoning, but often use offlineLLMinference for reasoning and rely on simpleN-based dual-tower retrievalfor online serving, whichOxygenRECargues limits the full leverage ofLLMreasoning in real-time.OxygenRECdirectly integratesLLMreasoning outputs (Contextual Reasoning Instructions) into a generative backbone, aiming for deeper integration. - Architectures:
Transformer-based encoder-decoderordecoder-only architecturesare common [41, 58, 62], offering throughput advantages. - Limitation addressed by OxygenREC: Most
GRmethods are fundamentallyinductive, lackingdeductive reasoningfor complex user intents.
-
Large Language Models for Recommendation (LLM4Rec):
- Direct LLM Usage [12, 24, 39]: Some approaches directly use
LLMsas the recommendation backbone, leveraging theirworld knowledgeandreasoning abilities. For example,OneRec-think[39] usesChain-of-Thought (CoT)reasoning for interpretability. - Limitation addressed by OxygenREC: The primary challenge for
LLM4Recin industrial settings is severelatencyandcostconstraints due to the computational overhead ofauto-regressive decodingwith billion-scale parameters [69].OxygenRECaddresses this with itsFast-Slow Thinkingarchitecture, keeping theLLMpart near-line.
- Direct LLM Usage [12, 24, 39]: Some approaches directly use
-
Multi-Scenario Recommendation (MSR) and Controllability:
- Traditional MSR for Discriminative Ranking Models: Approaches like
Multi-Gate Mixture-of-Experts (MoE)[40],STAR[51], andPEPNet[9] introduce specialized architectural interventions (gating, star-topology units, routing mechanisms) to mitigatenegative transfer(when learning for one task harms another) in discriminative ranking models. - Limitation addressed by OxygenREC: These traditional
MSRmethods introduce structural complexity, implicit parameter modulation, and often require separate model instances, increasing operational costs. Adapting them directly toGRis challenging. Most existingGRmulti-scenario approaches [25] are exploratory and don't fully solve the trade-off between unified modeling and scenario-specific adaptation.OxygenRECaims for atrain-once-deploy-everywhereparadigm using explicitscenario instructionsandRL-based alignment.
- Traditional MSR for Discriminative Ranking Models: Approaches like
3.3. Technological Evolution
The field of recommendation systems has evolved significantly:
- Early Systems (Rule-based, Collaborative Filtering): Simple rules or user-item similarity based on past interactions.
- Traditional Multi-Stage Cascaded Pipelines: Introduced deep learning, but typically involved fragmented optimization (e.g., embedding -> matching -> ranking -> re-ranking stages) [15, 16, 71]. This led to
objective misalignmentanderror propagation. - Generative Recommendation (GR): Emerged as an end-to-end paradigm, treating recommendation as sequence generation [36, 68]. This unified approach aimed for global optimization. Key advancements included
Semantic IDs[27, 53, 54] andTransformer-based architectures [41, 62]. - Integration of Large Language Models (LLMs): With the rise of powerful
LLMs, researchers began exploring their use in recommendation for richer understanding and reasoning [12, 24, 39].-
Challenge: Directly using
LLMsonline for real-time recommendations suffered from prohibitivelatencyandcomputational costs. -
Challenge:
GRsystems, while unified, often still relied oninductive pattern matchingand lackeddeductive reasoningcapabilities based onworld knowledge. -
Challenge: Scaling
GRacross diversemulti-scenarioindustrial environments remained largely unaddressed, leading to high operational costs.OxygenRECfits into this evolution by attempting to bridge these gaps. It represents a step forward from basicGRandLLM4Recby specifically tackling the latency problem ofLLMsfor deductive reasoning viaFast-Slow Thinking, and the scalability problem ofGRacross multiple scenarios viainstruction-followingandRL-basedmulti-scenario alignment. It aims to bringLLM-level reasoning to industrialGRwithout the typical performance bottlenecks, moving towards a more intelligent and efficient recommendation paradigm.
-
3.4. Differentiation Analysis
OxygenREC differentiates itself from previous works in several key areas:
-
Deductive Reasoning with Latency Constraints (Fast-Slow Thinking):
- Prior Work:
LLM-based recommendation systems (e.g.,OneRec-think[39],RecGPT[4]) leverageLLMsfor reasoning, but often struggle with the highlatencyandcomputational costsof deployingLLMsonline for real-time inference. Some offlineLLMreasoning approaches still rely on simpler retrieval mechanisms online. - OxygenREC's Innovation: It directly tackles this
latency-reasoning trade-offwith itsFast-Slow Thinkingarchitecture. The "slow" part (near-lineLLM pipeline) generatesContextual Reasoning Instructionsoffline, injectingdeductive world knowledge. The "fast" part (high-throughputencoder-decoder backbone) uses these pre-computed instructions for real-time generation, ensuring deep reasoning without onlineLLMcalls.
- Prior Work:
-
Multi-Scenario Scalability ("Train-Once-Deploy-Everywhere"):
- Prior Work:
Multi-scenario recommendation (MSR)research largely focused ondiscriminative ranking models(e.g.,STAR[51],PEPNet[9]), often requiring complex scenario-specific towers, gating mechanisms, or separate model instances, leading to high operational costs andnegative transferissues.GRliterature had limited solutions formulti-scenario scalability. - OxygenREC's Innovation: It unifies diverse recommendation tasks into a single
generative frameworkusinginstruction-following.Scenario informationis transformed into explicitcontrollable instructionsfor the backbone.Unified reward mappingandSoft Adaptive Group Clip Policy Optimization (SA-GCPO)enable a single policy to align with various business objectives across scenarios, achieving a truetrain-once-deploy-everywhereparadigm.
- Prior Work:
-
Effective Instruction Control and Semantic Alignment:
- Prior Work: While
instruction-followingis a core capability ofLLMs, its application toGRhas been "long overlooked," according to the authors. Ensuring instructions effectively guide generation, especially with complex user history, is a challenge. - OxygenREC's Innovation: It introduces
semantic alignmentmechanisms.Instruction-Guided Retrieval (IGR)filtersintent-relevant historical behaviorsbased on the instruction, reducing noise. TheQuery-to-Item (Q2I) lossexplicitly aligns instruction embeddings with target item embeddings, ensuring strong control over the generated output.
- Prior Work: While
-
Robust Industrial Deployment:
- Prior Work: Many research papers propose models but lack the engineering optimizations for large-scale, high-traffic industrial deployment.
- OxygenREC's Innovation: Beyond the model architecture, it includes significant system implementation and optimization details, such as a unified
PyTorch-based training framework with40% MFU,xLLM-based inference optimization (xGR,xSchedule,xAttention,xBeam,Prefix-Constrained Decoding), and a robustnear-line inference deploymentstrategy for reasoning instructions. This focus on practical deployment and achieving significant online A/B test lifts sets it apart as a production-ready system.
4. Methodology
4.1. Principles
The core idea behind OxygenREC is to integrate the deep reasoning capabilities of LLMs into a generative recommendation system for e-commerce while simultaneously addressing the real-time latency constraints and multi-scenario scalability requirements of industrial applications. This is achieved through two main principles:
- Fast-Slow Thinking: This principle separates the complex, computationally intensive
deductive reasoning(the "slow" part) from thereal-time item generation(the "fast" part). The "slow"LLM pipelineoperatesnear-line(not during live user requests) to synthesize high-qualityContextual Reasoning Instructions. These instructions then guide a high-efficiencyencoder-decoder backboneforfast,real-time recommendation generation, effectively pre-computingLLMinsights to avoid onlineLLM latency. - Instruction-Following Unification: The system unifies diverse recommendation tasks and scenarios into a single
generative modelby treating them asinstruction-following tasks. Bothuser intent(deduced by theLLM) andscenario contextare encoded as explicitinstructionsthat steer the model's generation process. This enables a "train-once-deploy-everywhere" paradigm, reducing operational costs and leveragingsynergistic knowledge transferacross scenarios.
4.2. Core Methodology In-depth (Layer by Layer)
OxygenREC unifies various recommendation tasks into a single Instruction-Following Generative paradigm, addressing both LLM reasoning integration and multi-scenario adaptation.
The overall architecture of OxygenREC is illustrated in Figure 2.
该图像是示意图,展示了OxygenREC的整体架构。左侧描述了基于指令的框架和多模态量化表示,右侧为上下文推理指令的生成流程,以及多场景对齐的奖励映射机制。
Figure 2: The Overall Architecture of OxygenREC. (a) Instruction Following Framework: A transformerbased encoderdecoder backbone that generates semantic item sequences conditioned on specific instructions. (b) Multimodal Quantized Representations: Items are tokenized as multimodal semantic IDs via residual quantization of contrastively trained embeddings, enabling compact and expressive item representations. (c) Contextual Reasoning Instructions: A near-line LLM pipeline that analyzes user behavior and context to synthesize such instructions, bridging the gap between inductive patterns and deductive reasoning. d) MultiScenario Alignment: We achieve a "train-once-deploy-everywhere" workflow by coupling scenario instructions with RL-based alignment.
The system workflow consists of four stages:
- Model Input Representation: Integrates user profiles, LLM-driven reasoning intents, multimodal
Semantic IDs (SIDs)for items, and contextual inputs for scenario information. - Instruction-Following Pre-training: The model learns to follow instructions using a
multi-task objectivecombiningNext Token Prediction (NTP)andsemantic alignment. - Post-training with Multi-Scenario Alignment: Refines the model using
Reinforcement Learning (RL)with areward mapping serviceand a novel policy optimization strategy. - Multi-Scenario Serving: Deploys a single model across diverse scenarios, using
prefix-constrained beam searchto enforce scenario-specific rules.
4.2.1. LLM-driven User and Item Inputs
4.2.1.1. Model Input Representation
OxygenREC adopts an encoder-decoder architecture. The encoder processes user-side inputs into a latent space, and the decoder generates recommendation sequences conditioned on instructions.
-
Encoder Input (): Integrates three key sources:
User Profile: Static attributes like demographics.User Behavior: Split intoshort-term(real-time interests) andlong-termsequences. Forlong-term history,Instruction-Guided Retrieval (IGR)is used. Theinstructionacts as a query to retrieve only relevant past actions, enhancinginstruction controllabilityandefficiencyfor deep user understanding.Contextual Reasoning Instructions(): Explained below.
-
Decoder Input: Conditioned on the encoded user representation () and a
composite instruction prompt(). This prompt combines two signals:Scenario Instructions(): Fordomain control(e.g., specific rules for a homepage vs. a cart scenario).Contextual Reasoning Instructions(): Fordeductive intent guidance(e.g., inferring user needs based on world knowledge). These instructions collectively steer theauto-regressive generationof the target item sequence.
4.2.1.2. Multimodal Quantized Item Representations
As shown in Figure 2 (b), OxygenREC constructs a unified vocabulary using Multimodal Semantic IDs (SIDs).
- A
multimodal item encoderis trained via acontrastive Item-to-Item (I2I) objective[45, 46] on large-scale item pairs derived from cross-scenario co-occurrence behaviors. - Each item is represented by
textual metadataandproduct images, processed by separate encoders. - A
lightweight fusion moduleemploys modality-specific projections to map inputs into ashared space, followed byQ-Former[34] andMLPlayers for cross-modal interactions. - The resulting 256-dimensional embeddings are
discretizedusing theRQ-KMeans scheme[22]. Thisresidual quantizationprocess assigns each item a tuple of discrete codes (semantic IDs) in a coarse-to-fine manner. - A hierarchical structure with a depth of 3 and a vocabulary size of 8,192 per level is used, creating a compact and expressive
semantic ID spaceforauto-regressive generation.
4.2.1.3. Contextual Reasoning Instructions
As illustrated in Figure 3, this component integrates world knowledge and deductive reasoning while avoiding online LLM latency by operating near-line.
该图像是一个示意图,展示了上下文推理指令的生成流程。图中包含三条分支:第一条从时空和用户资料信号生成指令,第二条基于用户行为序列生成结果,第三条用于重写近期查询,最终汇集输出形成指令和推理结果。
Figure 3: Overview of our Contextual Reasoning Instructions pipeline. Two parallel branches generate contextual instructions and reasons from spatiotemporal profile signals and recent user behavior sequences. In parallel, L L M _ { Q R } rewrites noisy or truncated recent queries, and all outputs are combined to form the final Instructions and Reasons for downstream generation
The core goal is to use LLMs to build a controllable and interpretable intermediate instruction layer. This layer acts as an explicit semantic bridge between raw user behaviors and downstream models, improving intent alignment and system stability. The system transforms complex recommendation signals into a reasoning process based on multi-signal input instructions, which combine spatiotemporal context, user profile, and historical behavior into text-based Instructions and corresponding Reasons.
-
Spatiotemporal and Profile Reasoning (): This module infers user intent from their external environment and static traits. It covers three aspects:
Event-driven reasoning: Identifies holidays or seasonal features (e.g., "winter solstice") and infers related shopping intent (e.g., "Frozen dumplings").Profile-driven reasoning: Uses user static traits (gender, age, spending power) for personalized product preference conclusions.Temporal-Profile Fusion reasoning: Combines local culture with real-time weather/time/localization needs (e.g., "Bosideng men's long down jacket" for Beijing's dry, sub-zero winter). In practice,JoyAI LLMgenerates these instructions, stored hierarchically by "time-location-persona" for quick retrieval.
-
User Query Rewrite (): Addresses incomplete or noisy real user queries (e.g., "foam case", "wine gift").
LLM_QRperformssemantic completionandnormalization. It is trained on identity-preserving data and rewrite-required data (synthetic error samples). Supervised fine-tuning onQwen3-0.6Bachieves a 95.33% pass rate in human evaluation. -
User Intent Reasoning: Given user behavior sequences and preferences, the model generates
intent-matched instructionsandreasoning reasons. The rationale explains the recommendation logic and helps uncover deeper motivations from noisy behaviors.- Data Refining and Auto-labeling Pipeline: To address the lack of explicit rationale labels in real logs,
LLM_QRmaps queries to normalized queries.Qwen3-32Baggregates/deduplicates these for standard intent targets.Qwen3-32Balso performs alignment filtering on the full behavior sequence to keep only relevant subsequences. Finally,DeepSeek-R1automatically generates rationales as pseudo labels. After supervised fine-tuning onQwen3-0.6B, the model achieves a 72% usability rate in human evaluation for intent-aligned instructions and reasons.
- Data Refining and Auto-labeling Pipeline: To address the lack of explicit rationale labels in real logs,
4.2.2. Instruction-Following Unified Pre-training
This stage injects deductive knowledge and enables multi-scenario adaptation.
4.2.2.1. Instruction Framework Design
Recommendation is reformulated as an instruction-following generation task, . The backbone learns to dynamically adjust its generation distribution based on provided scenario instructions () and contextual reasoning instructions ().
4.2.2.2. Dual-Instruction Formulation
The instruction prompt consists of Scenario Instructions () and Contextual Reasoning Instructions ().
-
Scenario Instructions (): Specifies the scenario context for controllable generation.
Scenario Information: Includes a scenario ID and contextual signals (e.g.,Homepage,Channel Feeds). Guides the generation style and candidate/item distribution.Optional Trigger Item: Provides local context (e.g., channel entry item, main item forI2Irecommendations). Only available in scenarios likeChannel FeedsandI2I. allows one backbone to serve multiple scenarios by adapting generation style and item distribution.
-
Contextual Reasoning Instructions (): A dense embedding projected from a textual instruction via an
adapter.- Online Inference: The textual instruction is synthesized by the
near-line LLM pipeline(Section 2.2.3). - Training: For
search data, the rewritten/normalized user query serves as a natural textual source for . For recommendation scenarios without textual instructions, adefault (learnable) instruction embeddingis used. This dual approach provides a high-quality data source and improves robustness ifLLMinstructions are missing online.
- Online Inference: The textual instruction is synthesized by the
4.2.2.3. Generative Backbone with IGR
OxygenREC uses an encoder-decoder architecture similar to OneRec [68], but the decoder is augmented with the Instruction Prompt () as conditional inputs for auto-regressive generation guided by latent intent and scenario context.
To enhance controllable generation, Instruction-Guided Retrieval (IGR) filters long-term user history, selecting interactions most relevant to the instruction prompt. This aligns the input context with control signals (), leading to more precise outputs. As illustrated in Figure 2 (a), IGR consists of three components:
- Adapter: Projects instructions and items into a shared embedding space.
- Q2I Alignment: Uses the ground-truth target to supervise instruction-item similarity during training.
- IGR: Performs top-K retrieval at inference time using the aligned query embedding.
Adapter Mechanism for Feature Mapping:
The adapter layers [26] map the instruction-driven query and history items into a comparable shared embedding space.
- The textual instruction (from the near-line
LLM pipeline) is encoded by the same text encoder used for item texts. - The embeddings are defined as:
$
\begin{array} { r l } & { \mathbf { e } _ { q } = \mathrm { C o n c a t } \left[ \phi _ { \mathrm { scn } } ( I _ { s } ) , g ^ { \mathrm { t r a i n } } ( I _ { r } ^ { \mathrm { t e x t } } ) \right] } \ & { \mathbf { e } _ { t } = \mathrm { C o n c a t } \left[ \phi _ { \mathrm { item } } ( \nu _ { t } ) , \phi _ { \mathrm { side } } ( u _ { t } ) , g ^ { \mathrm { train } } ( x _ { t } ) \right] } \ & { \mathbf { e } _ { h } = \mathrm { C o n c a t } \left[ \phi _ { \mathrm { item } } ( \nu _ { h } ) , \phi _ { \mathrm { side } } ( u _ { h } ) , g ^ { \mathrm { frozen } } ( x _ { h } ) \right] } \ & { \mathbf { q } = \psi _ { q } ( \mathbf { e } _ { q } ) , \quad \mathbf { t } = \psi _ { i } ( \mathbf { e } _ { t } ) , \quad \mathbf { h } = \psi _ { i } ( \mathbf { e } _ { h } ) } \end{array}
$
- : The raw query embedding, concatenated from the scenario information embedding and the textual reasoning instruction embedding .
- : The raw target item embedding, concatenated from the item ID embedding , side-information features embedding , and textual description embedding .
- : The raw history item embedding, similarly concatenated from item ID embedding , side-information features embedding , and textual description embedding .
- : A projection network that maps the raw query embedding into the shared query embedding space, resulting in .
- : A projection network that maps raw item embeddings ( for target, for history) into the shared item embedding space, resulting in and respectively.
- : Scenario instructions, potentially including scenario information and a trigger item (using if absent).
- : Textual contextual reasoning instruction.
- : Item IDs for target and history items.
- : Side-information features for target and history items.
- : Textual descriptions for target and history items.
- : Text encoder for trainable parts.
- : Frozen text encoder for long-term history to reduce computation. Gradients are not backpropagated through the long-history branch to reduce overhead.
Q2I Alignment: To make query embedding and history item embeddings comparable, is aligned with the target item embedding (available during training). This ensures can retrieve relevant history interactions at serving time when the ground-truth target is absent. For a batch of size with normalized query embeddings and normalized target item embeddings , the auxiliary objective is: $ \mathcal { L } _ { \mathrm { Q2I } } = \underbrace { - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } q _ { i } \cdot t _ { i } } _ { \mathrm { A l i g n m e n t } } + \lambda _ { r } \underbrace { ( - \log \left[ \mathrm { V a r } ( \mathbf { Q } ) \cdot \mathrm { V a r } ( \mathbf { T } ) \right] ) } _ { \mathrm { R e g u l a r i z a t i o n } } + \lambda _ { d } \underbrace { \frac { 1 } { B ^ { 2 } - B } \sum _ { i \neq j } ( q _ { i } ^ { \top } q _ { j } ) ^ { 2 } } _ { \mathrm { D ecorrelation } } $
- : The
Query-to-Item (Q2I)alignment loss. - : The alignment term, which maximizes the cosine similarity (or dot product for normalized vectors) between query embedding and its corresponding target item embedding within a batch of size . This pulls relevant query-item pairs closer.
- : The regularization term, weighted by . It encourages variance in the embeddings of and across dimensions within the batch, helping to prevent
dimensional collapse(where embeddings collapse to a low-dimensional subspace). - : The decorrelation term, weighted by . It minimizes the squared dot product between different query embeddings and in the batch. This encourages
decorrelationbetween different queries, reducing embedding redundancy. - : Hyperparameters controlling the strength of the regularization and decorrelation terms.
- : The average variance (across embedding dimensions) within the batch for query embeddings and target item embeddings , respectively.
IGR (Instruction-Guided Retrieval):
With the aligned embedding space, IGR uses the query embedding to retrieve the top-K most relevant interactions from the long-term history (in the space). This reduces noise and ensures the model focuses on the user's current request.
4.2.2.4. Instruction-Following Pre-training: Data Mixture, Signals, and Objectives
-
Data Mixture: The pre-training data is a mixture of
searchandmultiple recommendation scenarios(e.g.,Homepage,Channel Feeds,I2I related recommendations).- Benefits: Enables unified modeling of user trajectories, expands usable supervision (positive feedback is sparse in e-commerce, so mixing search and rec improves coverage), and improves robustness when instructions might be missing online.
-
Training Signals and Scenario-Specific Formulation:
-
Scenario Information() is always included as part of . -
Some scenarios provide an
optional trigger item(). -
For ,
search dataprovides an observed query. Forrecommendation scenarios, adefault (learnable) instruction embedding() is used when a near-line instruction is unavailable during training.The following are the results from Table 1 of the original paper:
Scenario Scenario Info Trigger Item Contextual Reasoning Formulation Search ✓ Zdef query-derived P( | X, (, Zdef, Ir(q) Homepage ✓ Zdef default emb. P(Y | X, Is(s, Zdef), Idef) Channel Feeds ✓ Zentry default emb. P( | , (, Zentry), def) Related Rec. (Item-to-Item) ✓ Zmain default emb. P(Y | X, IS(S, main), def)
-
Table 1: Training signals and scenario-specific conditional formulations. Scenario information ( s ) is always available. The trigger item is denoted by (using when absent). I _ { r } ( q ) denotes the contextual reasoning instruction embedding projected from the (rewritten/normalized) query, while denotes a default learnable embedding used when the textual instruction is unavailable in training.
- Joint Learning Objectives:
The overall training objective combines generation accuracy with query-to-item alignment:
$
\mathcal { L } = \mathcal { L } _ { \mathrm { NTP } } + \lambda \mathcal { L } _ { \mathrm { Q2I } }
$
-
: The total joint learning objective.
-
: The
Weighted Next Token Predictionloss, which is the primary objective for sequence generation. -
: The
Query-to-Itemalignment loss, as described above, ensuring semantic consistency. -
: A hyperparameter balancing the two loss components.
Weighted Next Token Prediction (): Optimizes the
auto-regressive likelihoodof the target item sequence. Higher weights are assigned toconversion-related tokens(e.g.,Purchase>Cart>Click) to prioritize high-value user behaviors.
-
4.2.3. Post-training with Multi-Scenario Alignment
The post-training phase adopts a multi-scenario modeling approach, unlike single-scenario fine-tuning. It involves a reward mapping system and RL-based post-training, designed for various tasks across different scenarios.
The post-training process is illustrated in Figure 4.
该图像是OxygenREC的推理阶段与奖励服务的示意图。图中展示了不同场景的数据流向OxygenREC,结合用户信息和序列特征来生成推荐。模型通过统一排名模型获取奖励分数,并与策略学习算法(SFT和SA-GCPO)相结合,实现多场景下的有效推荐。
Figure 4: The post-training process of OxygenREC
4.2.3.1. Post-training Framework
- Reward Service: A
real-online reward mapping systemincludes anonline unified ranking model serviceand othermulti-task rewards. - Inference Stage: Features based on user history and sequential behavior are fed into the
policy modelfor generation. Multiple candidate items are generated. Theunified ranking modelacts as an online reward service, returning a score for requested items. - Policy Learning:
Supervised Fine-Tuning (SFT)andReinforcement Learning (RL)are conducted usingSoft Adaptive Group Clip Policy Optimization (SA-GCPO)to refine the policy model. This process iterates until convergence.
4.2.3.2. Multi-Scenario Adaptation
To avoid the cost of separate GR models and ranking services for each scenario, OxygenREC uses a unified ranking model as a centralized reward model service and conducts post-training collectively on data from various scenarios.
-
Scenario-Aware Reward Mapping: The
RLstage uses rewards tailored to specific recommendation scenarios. The total reward is a weighted combination of:Format Reward: Penalizes structural errors insemantic IDoutput.Relative Reward: Rewards items relevant to the user's immediate context and query (based onQ2I semantic relationships).Ranking Reward: Rewards sequences that maximize business objectives likeGMVorConversion Rate.Diversity Reward: Evaluates the diversity of generated items.
-
Unified Ranking Model: A novel unified
multi-scenario ranking modelserves as the core of theReward Mapping Service.-
Traditional
MSRmodels (e.g.,STAR[51],PEPNet[9]) use complex structures to mitigatenegative transfer.OxygenRECproposes aunified list-wise approachfor both offline training and online learning. -
It constructs
token representationsfromheterogeneous user featuresand processes them through ashared Transformer-based feature extraction blockfor cross-scenario feature interaction and consistent scaling effects. -
An
adaptive modeling mechanismaddressesinput feature heterogeneity.The comparison of model architectures is shown in Figure 5.
该图像是示意图,展示了传统单场景排名模型、传统多场景排名模型与统一排名模型之间的比较。左侧分别为单场景和多场景模型的特征提取结构,右侧则展示了统一模型的转换特征提取块。
-
Figure 5: Comparison of model architectures: traditional ranking models vs. our unified ranking model
- Multi-scenario training samples: Created using a
label packing strategy, transforming point-wise samples intolist-wise samplesarranged chronologically. Acustomized causal masking mechanismmodels user behavior trajectories explicitly, maximizing conversion gains across the user path.
4.2.3.3. Reinforcement Learning
The paper introduces Soft Adaptive Group Clip Policy Optimization (SA-GCPO) for preference alignment during post-training.
- Motivation: Existing methods like
GRPO[23],ECPO(OneRec-v1 [68]), and gradient truncation (OneRec-v2 [69]) use hard clipping strategies, which can lead to sample inefficiency, discontinuous gradients, and instability, especially inmulti-environment unified training.SA-GCPOadopts asoft adaptive functionfor importance sampling weights. - Key Idea: It incorporates reward scores from real user behavior as a threshold to distinguish
positiveandnegative advantage samplesand applies anasymmetric temperature control mechanism(different temperature coefficients for positive and negative samples). This accelerates gradient decay for negative samples, mitigating gradient diffusion and instability.
SA-GCPO Formulation: Given each sample , for a group of items generated from the behavior policy , the optimization objective is: $ \mathcal { J } _ { \mathrm { S A - G C P O } } ( \theta ) = \mathbb { E } _ { x \sim D , { y _ { i } } _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot | x ) } \left[ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { \left| y _ { i } \right| } \sum _ { t = 1 } ^ { \left| y _ { i } \right| } f _ { i , t } ( r _ { i , t } ( \theta ) ) \Gamma ^ { \mathrm { a d v } } ( \widehat { A } _ { i , t } , R _ { g } ^ { * } ) \right] $
-
: The optimization objective for the current policy .
-
: Expectation over samples from data distribution and generated item groups from the old policy .
-
: The number of generated items for each sample (group size).
-
: The length of the sequence of tokens for item .
-
: The
soft adaptive functionthat weights the importance ratio for token of item . -
: The
threshold functionthat distinguishes positive and negative advantage samples. -
: The normalized advantage for token of item .
-
: The reward of the target item in the current group.
The token-level importance ratio is defined as: $ r _ { i , t } ( \theta ) = \frac { \pi _ { \theta } ( y _ { i , t } \mid x , y _ { i , < t } ) } { \pi _ { \theta _ { \mathrm { o l d } } } ( y _ { i , t } \mid x , y _ { i , < t } ) } $
-
: The ratio of the probability of generating token under the new policy to the probability under the old policy .
-
: Probability of generating token given input and previous tokens under the new policy.
-
: Probability of generating token under the old policy.
The soft adaptive function is given by: $ f _ { i , t } ( \rho ) = \sigma \left( \tau _ { i , t } ( \rho - 1 ) \right) \cdot \frac { 4 } { \tau _ { i , t } } , \tau _ { i , t } = \left{ \begin{array} { l l } { \tau _ { \mathrm { p o s } } , } & { \Gamma ^ { \mathrm { a d v } } ( \widehat { A } _ { i , t } , R _ { g } ^ { * } ) > 0 , } \ { \tau _ { \mathrm { n e g } } , } & { \mathrm { o t h e rw ise } , } \end{array} \right. $
-
: The soft adaptive function, where is the importance ratio .
-
: The sigmoid function.
-
: The temperature coefficient, which is dynamically set based on whether the advantage is positive or negative.
-
: Temperature for positive advantage samples.
-
: Temperature for negative advantage samples.
-
The term is a scaling factor.
The gradient weight is given by: $ w _ { i , t } ( \theta ) = 4 p _ { i , t } ( \theta ) \left( 1 - p _ { i , t } ( \theta ) \right) , \quad p _ { i , t } ( \theta ) = \sigma \left( \tau _ { i , t } \left( r _ { i , t } ( \theta ) - 1 \right) \right) $
-
: The gradient weight for token of item . This weight peaks at and smoothly decays as deviates, creating a soft trust region for the importance ratio.
-
: The output of the sigmoid function, with applied to the difference between the importance ratio and 1.
in Equation 4 represents the threshold function to distinguish positive and negative advantage samples of , defined as: $ \Gamma ^ { \mathrm { a d v } } ( \widehat { A } _ { i , t } , R _ { g } ^ { * } ) = \left{ \begin{array} { l l } { 0 , } & { \widehat { A } _ { i , t } > 0 \mathrm { a n d } R _ { i } < R _ { g } ^ { * } , } \ { \widehat { A } _ { i , t } , } & { \mathrm { o t h e rw ise } , } \end{array} \right. $
-
: The threshold function for advantage.
-
: The normalized advantage for token of item .
-
: The reward score for item .
-
: The reward of the target item in the current group.
-
This function sets the advantage to 0 if is positive but the item's reward is less than the target item's reward . This helps prevent learning from "false positive" advantages where an item is preferred over the baseline but not as good as the actual target. Otherwise, it uses the raw normalized advantage.
is the normalized advantage for item , calculated as: $ \widehat { A } _ { i , t } = \widehat { A } _ { i } = \frac { R _ { i } - \mathrm { m e a n } ( { R _ { i } } _ { i = 1 } ^ { G } ) } { \mathrm { s t d } ( { R _ { i } } _ { i = 1 } ^ { G } ) } $
-
-
: The normalized advantage for token of item . This is calculated as the advantage for the entire item , i.e., .
-
: The reward score for item .
-
: The mean reward of all items in the generated group .
-
: The standard deviation of rewards of all items in the generated group .
Main advantages of SA-GCPO:
- Adaptive smooth gating: Replaces hard clipping with a continuous sigmoid-based gate function, reducing optimization noise and enhancing training stability.
- Real user feedback as threshold for pos/neg advantages: Uses reward scores from real user feedback to define positive and negative advantages, mitigating
reward hacking(exploiting flaws in the reward function). - Asymmetric temperature control: Uses different
temperature settingsfor and to more rapidly attenuate negative-token gradients, improving stability. - Sequence-level coherence: Since
SIDrepresents a single item,SA-GCPOreduces to a smooth sequence-level gate (similar toGSPO[67]) but without abrupt clipping.
4.2.4. System Implementation and Optimization
OxygenREC requires handling both terabyte-scale sparse embeddings (typical of recommendation systems) and billion-scale dense parameters (typical of LLMs).
4.2.4.1. Unified Training Framework
Built on PyTorch [35] on a production cluster with 128 NVIDIA H800 GPUs, achieving 40% Model FLOPs Utilization (MFU).
- Distributed Sparse Optimization:
- Designed a large-scale distributed sparse engine in
PyTorchwith a non-overlapping partition strategy for embeddings across workers. - Utilizes
hierarchical HBM-MEM cachingand amulti-stage pipelineto hide embedding access latency. - Implements a
dual-buffer mechanismfor strong consistency, reducing sparse operation time from 15% to 5% and achieving1.1-2.4x speedup.
- Designed a large-scale distributed sparse engine in
- Operator-Level Acceleration:
- Integrates
BF16 mixed-precision training[31] andZeRO[47] for memory efficiency in theLLM backbone. - Employs advanced
attention mechanisms[17, 49] and efficient architectures [19, 50]. - Developed a dedicated
attention acceleration libraryusingCUTLASS[57] andTileLang[59] for custom kernel compilation, supporting flexible mask configurations and achieving1.7xto3.0xspeedups overFlexAttention[18] andtorch.compile[2].
- Integrates
- Scenario-Aware Reinforcement Learning:
- A customized
RL workflowis built onRay[42]. - In collocated deployment modes, shared-memory access for sparse tables eliminates redundant copying, ensuring efficient synchronization during high-throughput sample generation.
- A customized
4.2.4.2. Inference Optimization based on xLLM
GR inference differs from standard LLM serving: long user-history prompts, relatively short outputs, and large beam width (e.g., 256-512). Decoding is the bottleneck due to sorting overhead, stochastic sampling, KV-cache pressure, and memory access inefficiency.
-
OxygenRECusesxGR[4], a dedicated serving system built uponxLLM[38], with a three-tier architecture:- xSchedule (System Level): Manages task parallelism, enabling fine-grained pipeline overlapping across batching, request handling, and kernel execution for high GPU utilization.
- xAttention (Operator Level): Based on
xLLM'sPagedAttention, customized forGR's attention patterns (long prompts + short decoding, hybrid masks), strengtheningKV-cachemanagement and using staged compute allocation for large-beam decoding. - xBeam (Algorithm Level): Handles massive sorting overhead of large-beam decoding and supports advanced sampling strategies over billion-scale item spaces.
-
Deep Customization for OxygenREC:
- Specialized Beam Search & Sampling:
xBeamimplements an optimizedBeam Sample kernelcombining top-k selection with nucleus/multinomial sampling. It uses operator-level fusion for stochasticity without performance degradation, reducing decoding latency. - Prefix-Constrained Decoding: Integrates a
Trie Index mechanisminto the inference loop to dynamically generatelogit masksat each step, ensuring 100% generation legality within designated item pools with negligible runtime overhead.
- Specialized Beam Search & Sampling:
4.2.4.3. Inference Deployment of Reasoning Instructions
A near-line updating mechanism balances timeliness and system overhead.
-
Architecture:
LLM-based instruction generation serviceand anAdapter-based text encoder service. -
The
LLM instruction modeloperatesnear-line, synthesizing natural-language reasoning instructions in batch usingspatiotemporal contextanduser behavioral history. -
The
Adapter text encoderconverts textual instructions into a dense embedding vector, indexed by user ID, and stored in a low-latency key-value store (e.g.,Redis cluster) for real-time consumption. -
Update Mechanisms:
-
Daily full refresh: An offline job regenerates
spatiotemporalandbehavioral instructionsfor all daily active users. -
Near-line incremental update: Triggered by high-value user actions (searches, views, carts, purchases). A
time-window aggregation strategy(e.g., 5 minutes) merges multiple behavioral events from the same user into a unified intent summary, executing a single instruction regeneration and storage write at the window's end. This preserves responsiveness while reducing backend load.This design enables
zero online LLM calls,low-latency serving, andhigh semantic fidelityin recommendations by retrieving precomputed instruction embeddings.
-
5. Experimental Setup
5.1. Datasets
The paper utilizes proprietary datasets from JD.com, a large e-commerce platform.
- Source:
JD.com's core recommendation scenarios. - Characteristics: The pre-training data is a
mixture of search and multiple recommendation scenarios(e.g.,Homepage,Channel Feeds, andI2I related recommendations). This reflects real user journeys across the app. - Scale:
OxygenRECis designed forlarge-scale e-commerce recommendation, handlingterabyte-scale sparse embeddingsandbillion-scale dense parameters. - Specific Examples: The
Contextual Reasoning Instructionssection (Appendix B) provides concrete examples of input signals:- Spatiotemporal & User Profile:
Location: Beijing,Date: December 21, 2025,User Profile: male, 30 years old, white-collar, mid-to-high spending tier. - User Intent Reasoning Input:
Added a full-suspension mountain bike (soft-tail) to cart on Dec 19 at 18:30 (high intent, L2: Outdoor Cycling); Clicked Huawei Mate 70 and iPhone 16 Pro on Dec 20 during lunch break (comparative browsing, L1: Consumer Electronics); Long-term Interests Input: Digital products, men's fashion, cycling, outdoor sports.
- Spatiotemporal & User Profile:
- Synthetic Data: For post-training, the model generates a large amount of
synthetic databy using user profiles and candidate items to query the ranking model for reward scores. - Rationale for Choice: These datasets are chosen because they represent real-world, large-scale e-commerce traffic, allowing for the comprehensive validation of
OxygenREC's ability to handle complex user behaviors, diverse scenarios, and strict industrial constraints. The mixture of search and recommendation data is crucial forunified modelingand addressing thedata sparsitycommon in positive feedback signals in e-commerce.
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to evaluate both the model's recommendation performance and the quality of its learned Semantic IDs.
5.2.1. Model Evaluation Metrics
-
HitRate@K (HR@K):
- Conceptual Definition: Quantifies the precision of the generative process. It measures the proportion of test instances where the generated candidate exactly matches the ground-truth
semantic ID sequence(across all hierarchical code levels) within the top-K beam search hypotheses. A higherHR@Kindicates that the model is generating the correct item (or its semantic representation) with high accuracy among its top suggestions. - Mathematical Formula: While the paper describes the concept, it does not explicitly provide a formula. A standard definition for
HitRate@Kin recommendation systems is: $ \mathrm{HitRate@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{ground-truth item} \in \text{Top K recommendations for user } u) $ However, the paper specifies that theground-truth semantic ID sequencemust alignacross all hierarchical code levels. Adapting this forsemantic IDs: $ \mathrm{HitRate@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\exists y_{gen} \in \mathrm{TopK}u \text{ such that } \mathrm{SID}(y{gen}) = \mathrm{SID}(y_{gt,u})) $ - Symbol Explanation:
- : The set of all users in the test set.
- : The total number of users.
- : An indicator function that returns 1 if its argument is true, and 0 otherwise.
- : The ground-truth item for user in the test set.
- : The set of top-K recommended items for user .
- : The
semantic ID sequence(tuple of discrete codes at all hierarchical levels) of item . - The condition implies that the generated
semantic ID sequencemust be identical to the ground-truth sequence at all hierarchical levels.
- Conceptual Definition: Quantifies the precision of the generative process. It measures the proportion of test instances where the generated candidate exactly matches the ground-truth
-
Recall@K:
- Conceptual Definition: Assesses the model's capacity to cover the user's relevant interests. It is calculated as the ratio of the user's daily positive interactions (ground-truth items) that are successfully identified within the top K generated candidates relative to the full set of observed user behaviors for that day. A higher
Recall@Kindicates that the model is effectively retrieving a larger proportion of items the user interacted with. - Mathematical Formula: Similar to
HitRate@K, the paper provides a conceptual definition. A common formula forRecall@K(adapted forsemantic IDsas per the paper's context) is: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum_{u \in U} \frac{|{y_{gt,u} \in \mathrm{DailyInteractions}u \mid \exists y{gen} \in \mathrm{TopK}u \text{ such that } \mathrm{SID}(y{gen}) = \mathrm{SID}(y_{gt,u})}|}{|\mathrm{DailyInteractions}_u|} $ - Symbol Explanation:
- : The set of all users in the test set.
- : The total number of users.
- : The full set of observed positive interactions (ground-truth items) for user on a given day.
- : The set of top-K recommended items for user .
- : The
semantic ID sequenceof item . - The numerator counts how many of the user's daily ground-truth items are successfully matched (by
semantic ID sequence) within the top-K recommendations.
- Conceptual Definition: Assesses the model's capacity to cover the user's relevant interests. It is calculated as the ratio of the user's daily positive interactions (ground-truth items) that are successfully identified within the top K generated candidates relative to the full set of observed user behaviors for that day. A higher
5.2.2. Semantic ID Evaluation Metrics
These metrics evaluate the quality of the learned Multimodal Quantized Item Representations (Semantic IDs).
-
SID Codebook Coverage (↑: Higher is better):
- Conceptual Definition: Evaluates the utilization efficiency of the hierarchical semantic space. It measures the joint path occupancy at each
quantization level. It reflects how effectivelySKUs(Stock Keeping Units, i.e., individual items) populate the exponentially growing combination space of thecodebook. - Mathematical Formula: For a given depth and
codebook width: $ \mathrm{Codebook Coverage@L} = \frac{|{(c_1, \ldots, c_L) \mid \exists \mathrm{SKU} \text{ mapped to this tuple}}|}{|V|^L} $ - Symbol Explanation:
- : The current
quantization levelor depth (e.g., 1, 2, or 3). - : The vocabulary of codes at each level.
- : The
codebook width(vocabulary size per level). - : A unique tuple of codes up to level .
- The numerator counts the number of unique
code tuplesthat have at least oneSKUmapped to them. The denominator is the total possible number of uniquecode tuplesat that level.
- : The current
- Conceptual Definition: Evaluates the utilization efficiency of the hierarchical semantic space. It measures the joint path occupancy at each
-
Semantic Cluster Purity (↑: Higher is better):
- Conceptual Definition: Evaluates the alignment between the learned
semantic clusters(formed by items mapping to the samesemantic ID) and human-defined taxonomies (e.g., product categories). It measures the mean dominance of the primary category within eachsemantic ID's correspondingSKU set. A higher score meanssemantic IDseffectively capture high-level category semantics (e.g., allSKUsunder oneSIDconsistently belong to "Electronics"). - Mathematical Formula: Not explicitly provided in a standard formula. Conceptually, for each
semantic ID, it would identify the most frequent human-defined category among its assignedSKUsand then sum up the proportion ofSKUsbelonging to that dominant category across allsemantic IDs, averaged. - Symbol Explanation:
Cate1,Cate2,Cate3: Refer to different levels of product categories in a hierarchical taxonomy (e.g.,Category1might be "Electronics,"Category2"Mobile Phones,"Category3"Smartphones").Purityis measured at each categorical level.
- Conceptual Definition: Evaluates the alignment between the learned
-
Semantic ID Collision (↓: Lower is better):
- Conceptual Definition: Quantifies the
discriminative powerofsemantic identifiersin distinguishing individual items within the billion-scaleSKU pool. It is measured by the distribution ofitem cardinality(number of uniqueSKUs) for each uniqueSID tuple. LowerSKU countsat upper quantiles (e.g., P90, P99) indicate finer granularity, meaning thesemantic ID sequenceprovides a more precise representation closer to a unique item identifier. High collision means many distinct items map to the sameSID, losing discriminative power. - Mathematical Formula: No explicit formula. It's a statistical measure of
SKU countsperSID tuple. - Symbol Explanation:
P90,P99,P999: Percentiles (90th, 99th, 99.9th percentile) of the distribution ofSKU countsperSID tuple. For example,P99being 21 means that 99% ofSID tuplescontain 21 or fewerSKUs.
- Conceptual Definition: Quantifies the
-
Codebook Load Balance (→ 100%):
- Conceptual Definition: Measures the uniformity of item distribution across the
quantization codebook. It quantifies the deviation ofactual cluster sizes(number ofSKUsassigned to a specific code) from thetheoretical uniform distribution(TotalSKUs/Codebook Size). Examining this ratio at various quantiles assesses whether theRQ-KMeansprocess effectively utilizes the full capacity of thecodebookor suffers frommode collapse(where specific codes are over-utilized and others under-utilized). - Mathematical Formula: No explicit formula. It's a measure of the distribution of
cluster sizesrelative to an ideal uniform distribution. - Symbol Explanation:
P25,P75,P90: Percentiles of the distribution ofcluster sizes(number ofSKUspercode). The ideal value would be 100% across all quantiles, indicating a perfectly uniform distribution.
- Conceptual Definition: Measures the uniformity of item distribution across the
5.3. Baselines
OxygenREC is compared against various baselines depending on the specific experiment:
-
Multimodal Encoder Design:
OneRec-style multimodal model: Utilizes a large pre-trained multimodal backbone (e.g.,MiniCPMV[65]) with multipleQ-Former[34] layers for feature compression.Lighter Vision-Language Models: Variants fromSAIL[11] andInternVL[13] families.Pure-text encoders: E.g.,Qwen3[3, 63].Pure-vision encoders: (Not explicitly named but implied for comparison).Multimodal backbone models[33, 37, 60]: Generic multimodal models.
-
Reinforcement Learning (RL) Methods (for Post-training):
GRPO[23]: A widely applied policy optimization method inLLMs,VLMs, andGRs.GSPO[67]: A modified approach based onGRPOthat enhances model performance by computing importance sampling along the sequence.
-
Multi-Scenario Adaptation:
-
Independent SFT (Supervised Fine-Tuning) Baselines: This is the industry-standard approach where separate models are fine-tuned exclusively for each scenario.OxygenRECcompares itsUnified Instruction-Following Modelagainst this baseline.The paper implicitly compares against the limitations of
traditional multi-stage cascaded pipelinesand existinggenerative methodsthat have limitedworld-knowledge based reasoningand poormulti-scenario scalability(as depicted in Figure 1).
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Ablation Studies and Analysis
6.1.1.1. Lightweight Multimodal Design
The authors performed a systematic analysis of the multimodal encoder design.
- OneRec-style backbone (MiniCPMV + Q-Former): This initial heavy setup was the starting point.
- Lighter VLMs (SAIL, InternVL): Replacing the heavy backbone with lighter variants did not significantly improve recommendation performance, despite faster inference.
- Pure-text (Qwen3) vs. Pure-vision vs. Multimodal backbones: Surprisingly,
pure-text encoder Qwen3substantially outperformed other multimodal alternatives. - Final Design (Qwen3 + CLIP fusion): This observation led to the final design:
Qwen3 text encoderandCLIP[45]image encoderindependently extract representations, which are then fused and compressed viaQ-Formerlayers andMLPlayers, withresidual initializationfor query tokens inQ-Former. This fusion architecture yielded over30% relative improvement in HitRate@1overOneRec-style variantsand up to32x speedupin embedding inference.
6.1.1.2. Semantic ID Evolution
The authors investigated four versions of Semantic IDs (SIDs) to understand the impact of different item representation strategies. The quality was assessed using Codebook Coverage, Semantic Cluster Purity, SID Collision (P99), and Codebook Load Balance.
The following are the results from Table 2 of the original paper:
| Metric | V1 (Textual) | V2 (MiniCPM) | V3 (Fusion) | V4 (Multi-Source) | |
| Codebook Coverage (↑) | Codebook1 | 100% | 100% | 100% | 100 % |
| Codebook2 | 72.33% | 31.22% | 70.03% | 69.56% | |
| Codebook3 | 2.74% | 0.013% | 0.14% | 0.15 % | |
| Cluster Purity (↑) | Cate1 | 86.30% | 83.31% | 84.38% | 92.80 % |
| Cate2 | 73.45% | 66.80% | 70.35% | 79.90% | |
| Cate3 | 53.68% | 41.99% | 48.41% | 59.73% | |
| SID Collision (↓) | P90 | 10 | 6 | 2 | 2 |
| P99 | 79 | 21 | 9 | 9 | |
| P999 | 419 | 73 | 34 | 35 | |
| Load Balance (→ 100%) | P25 | 32.49% | 88.53% | 89.25% | 89.92% |
| P75 | 144.22% | 114.66% | 107.39% | 105.40% | |
| P90 | 212.84% | 122.14% | 118.60% | 118.09% | |
Table 2: Detailed Evaluation of Semantic ID Versions.
Note: ↑: Higher is better; Lower is better; : Closer to is better.
Analysis of Semantic ID Quality:
- V1 (Textual Baseline): Suffers from a small codebook size (2048) despite seemingly high
Codebook Coverage(an artifact). Its highP999 Collision(419) indicates poor resolution capacity. - V2 (MiniCPM): Shows very low
Codebook Coverageat deeper levels (0.013% at L3), suggestingpartial codebook collapse(many codes unused). - V3 (Fusion) & V4 (Multi-Source): Successfully address the issues of V1 and V2 by leveraging robust
multimodal alignment. - V4 (Multi-Source Alignment): Achieves the best performance across all metrics:
Highest Cluster Purity(92.80% at Cate-L1) due to explicitcategory semanticsinjection.Superior SID Collision(P999 of 35), ensuring fine-grained item distinction.Optimal Load Balance(P90 of 118.09%), confirming effective utilization and uniformity of the latent space throughmulti-source contrastive learning. This version is identified as theoptimal semantic ID representation.
6.1.1.3. Generative Backbone Architecture Ablation
Using the optimal V4 Semantic ID, the authors investigated the scaling laws and architectural hyperparameters of the generative backbone by varying depth of encoder/decoder layers, model dimensions, and Mixture-of-Experts (MoE) configurations.
The following are the results from Table 3 of the original paper:
| Model Size (Tot/Act) | Enc Layers | Dec Layers | Dim (Hidden/Inter) | Experts (Tot/Act) |
| 0.1B / 0.1B | 4 | 4 | 1024/512 | 2/1 |
| 0.4B / 0.3B | 4 | 6 | 2048/1024 | 4/2 |
| 0.7B / 0.4B | 4 | 8 | 2048/1024 | 8/2 |
| 1.5B / 0.4B | 4 | 8 | 2048/1024 | 24/2 |
| 3.0B / 0.6B | 4 | 16 | 2048/1024 | 24/2 |
Table 3: Model Configurations for Generative Backbone Ablation.
The generative performance was assessed using HitRate and Recall metrics, and training dynamics via NTP Loss.
The following are the results from Table 4 of the original paper:
| Model Size (Tot) | HR@1 | HR@10 | Recall@10 | Recall@30 |
| 0.1B | 3.99% | 13.17% | 10.10% | 15.11% |
| 0.4B | 4.42% | 15.03% | 11.38% | 17.34% |
| 0.7B | 4.84% | 16.33% | 12.32% | 18.71% |
| 1.6B | 4.92% | 16.61% | 12.51% | 19.01% |
| 3.0B | 5.02% | 16.99% | 12.78% | 19.53% |
Table 4: Performance Comparison.
The NTP Loss Scaling Laws are shown in Figure 6.
该图像是图表,展示了不同模型在训练步骤中损失(Loss)的变化情况。随着训练步骤的增加,损失逐渐降低,其中各条曲线对应不同的模型参数量,例如0.1B、0.4B等,显示了不同规模下模型训练的效果。
Figure 6: NTP Loss Scaling Laws.
Scaling Law Validation and MoE Saturation Analysis:
- General Trend: A positive correlation exists between model capacity and retrieval performance. Scaling from 0.1B to 3.0B
total parametersyields monotonic improvements inHR@10(from 13.17% to 16.99%) andNTP loss(larger models achieve lower asymptotic loss). - MoE Saturation: A
plateauing phaseis observed between the 0.7B and 1.6B variants forNTP loss. This is attributed to both configurations using the same number ofactive experts per token(2), which limits immediate gains from merely expanding theexpert poolin anMoEarchitecture. - Overcoming Bottleneck: The 3.0B model overcomes this plateau, likely due to its significantly increased
decoder depth(16 layers vs. 8), which extends the computational path and further reducesNTP loss.
6.1.2. Instruction-Following Unification Analysis
This section validates the effectiveness of the proposed instruction-following mechanism.
6.1.2.1. Instruction Token Integration Strategies
Five strategies for inserting the instruction token into the decoder's input sequence relative to the Begin-Of-Sequence (BOS) marker were compared.
The following are the results from Table 5 of the original paper:
| Integration Strategy | HR@1 | HR@10 | Recall@10 | Recall@30 |
| No Instruction | 2.78% | 10.38% | 8.18% | 13.01% |
| Replace BOS | 3.30% | 12.08% | 9.12% | 14.38% |
| Add to BOS | 3.50% | 12.59% | 9.52% | 14.93% |
| Insert Left of BOS | 3.33% | 12.17% | 9.21% | 14.50% |
| Insert Right of BOS | 3.53% | 12.68% | 9.58% | 14.91% |
Table 5: Performance comparison of different instruction token integration strategies.
Analysis: The Insert Right of BOS strategy consistently yields the highest retrieval metrics (e.g., HR@10 of 12.68%). This approach allows the decoder to first initialize its state with BOS and then immediately condition the subsequent auto-regressive generation on the specific context provided by the instruction, providing the most effective guidance flow.
6.1.2.2. Instruction Component Ablation
The composition of the instruction token (fusing Scenario ID and Trigger Item ID) was investigated.
The following are the results from Table 6 of the original paper:
| Instruction Components | HR@1 | HR@10 | Recall@10 | Recall@30 |
| No Instruction (Baseline) | 2.78% | 10.38% | 8.18% | 13.01% |
| Scenario ID Only (1 token) | 3.30% | 12.17% | 9.22% | 14.50% |
| Trigger Item ID Only (1 token) | 3.22% | 11.60% | 9.13% | 13.98% |
| Concatenated (Scenario + Trigger, 2 tokens) | 3.53% | 12.68% | 9.58% | 14.91% |
| Fused (Scenario + Trigger, 1 token) | 3.60 % | 12.82% | 9.68% | 15.08% |
Table 6: Ablation study on Instruction Token components.
Analysis:
- Both
Scenario ID OnlyandTrigger Item ID Onlyimprove performance over theNo Instructionbaseline, confirming their individual utility. Concatenated (Scenario + Trigger)performs better than using either alone, indicating their complementary guidance.Fused (Scenario + Trigger)achieves the best performance across all metrics (e.g.,HR@10of 12.82%). This suggests that deeper interaction during the fusion process allows the model to better capture the complex relationship betweenscenario context(global domain characteristics like price sensitivity) anduser intent(Trigger Item IDprovides fine-grained, localized context).
6.1.2.3. IGR Ablation
Ablation studies were conducted in search-dominated scenarios to validate the effectiveness of the IGR mechanism and Q2I alignment.
The following are the results from Table 7 of the original paper:
| Configuration | HR@1 | HR@10 | Recall@10 | Recall@30 |
| Base Model (w/o IGR/Q2I) | 3.76% | 12.20% | 9.87% | 15.53% |
| + IGR Only | 4.02% | 12.91% | 10.25% | 15.95% |
| + IGR+Q2I (Full) | 4.19% | 13.38% | 10.52% | 16.23% |
Table 7: Ablation study on IGR components
Analysis:
- Introducing
IGR Onlyimproves retrieval quality (e.g.,HR@10from 12.20% to 12.91%) by focusing on relevant historical interactions. - The
Full model() achieves the best performance (HR@10of 13.38%). This demonstrates thatexplicit alignmentbetweenqueryanditem spacesviaQ2I lossis crucial foreffective IGRand overall recommendation quality.
6.1.2.4. Unified Instruction-Following Model vs. Independent SFT Baselines
The Unified Instruction Following Model was compared against the industry-standard Pretrain and Scenario Independent SFT approach across six core deployment scenarios.
The following are the results from Table 8 of the original paper:
| Metric | Model Type | Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | Scenario 5 | Scenario 6 |
| HR@1 | Independent SFT | 6.39% | 8.17% | 1.12% | 1.83% | 7.22% | 5.29% |
| Unified Model | 15.39% | 20.75% | 17.24% | 6.34% | 10.54% | 25.75% | |
| HR@10 | Independent SFT | 23.29% | 29.05% | 5.22% | 8.44% | 29.84% | 19.38% |
| Unified Model | 46.73% | 55.02% | 53.57% | 29.89% | 37.90% | 62.62% |
Table 8: Performance comparison: Unified Instruction-Following Model vs. Independent SFT Baselines across six core scenarios.
Analysis: The Unified Model consistently and significantly outperforms the Independent SFT baselines across all six core scenarios. For example, in Scenario 3, HR@1 jumps from 1.12% to 17.24%, and HR@10 from 5.22% to 53.57%. This superiority is attributed to:
- Synergistic Knowledge Transfer: High-resource scenarios (like
Homepage) enhancerepresentation learningfor lower-resource scenarios. - Universal User Modeling: Captures a holistic view of user interests across different contexts.
- Improved Operational Efficiency: A single model reduces maintenance and
GPU overhead.
6.1.2.5. Trigger Instruction Sensitivity Analysis
This experiment verifies that the unified model actively utilizes the Trigger Item component of the instruction token as a control signal, rather than ignoring it. Conducted on scenarios 3, 4, 5, and 6, which are explicitly trigger-item driven.
The following are the results from Table 9 of the original paper:
| Metric | Inference Setting | Scenario 3 | Scenario 4 | Scenario 5 | Scenario 6 |
| HR@1 | Correct Trigger (Instruction) | 20.75% | 6.34% | 10.54% | 25.75% |
| Masked Trigger (Default) | 10.71% | 1.96% | 9.26% | 18.55% | |
| HR@10 | Correct Trigger (Instruction) | 55.02% | 29.89% | 37.90% | 62.62% |
| Masked Trigger (Default) | 40.96% | 11.31% | 35.74% | 50.52% |
Table 9: Sensitivity Analysis: Impact of masking the Trigger Item ID during inference (Scenarios 3, 4, 5, 6).
Analysis: Replacing the authentic Trigger Item ID with a generic "Default" embedding during decoding leads to a substantial performance drop across all tested scenarios. For example, HR@1 in Scenario 3 drops from 20.75% to 10.71%. This sharp drop confirms that the decoder heavily relies on the fine-grained trigger signal to effectively contextualize user intent, acting as a critical "steering wheel" for the generative process.
6.1.3. Post-training of OxygenREC
6.1.3.1. Effectiveness of Synthetic Data
The authors evaluated the effectiveness of using synthetic data for RL post-training, compared to OneRec-v2's online RL approach.
The evaluation of synthetic data is shown in Figure 7.
该图像是图表,展示了不同合成数据比例下的命中率(Hit Rate@10)比较。图中蓝线代表GRPO算法,橙线代表SA-GCPO算法。随着合成数据比例的增加,两种算法的命中率均有所提升,在20%合成数据时达到最高。
Figure 7: Evaluation of synthetic data
Analysis:
- The
OxygenREC-0.7B MOEwas used as the backbone.GRPOshowed considerable instability across varying proportions ofsynthetic data. SA-GCPOdemonstrated more consistent results and superior performance compared toGRPOacross different proportions ofsynthetic data. This validates the effectiveness of the proposed method and the robustness ofSA-GCPO.
6.1.3.2. Effectiveness of proposed SA-GCPO
The SA-GCPO method was compared against GRPO [23] and GSPO [67] (a modified GRPO that computes importance sampling along the sequence). OxygenREC-0.7B was warm-started, with 33% synthetic data.
The following are the results from Table 10 of the original paper:
| Methods | Ratio of synthetic data | HR@1 | HR@10 |
| OxygenREC0.7B-GRPO | 33% | 23.85% | 62.15% |
| OxygenREC0.7B-GSPO | 33% | 24.13% | 62.88% |
| OxygenREC0.7B-SA-GCPO | 33% | 25.58% | 65.95% |
Table 10: Evaluation of proposed SA-GCPO with other methods
Analysis:
GSPOperforms slightly better thanGRPO(HR@124.13% vs 23.85%;HR@1062.88% vs 62.15%).SA-GCPOsignificantly outperforms both: achieving+1.45ppand+1.73ppinHR@1compared toGSPOandGRPO, respectively. ForHR@10,SA-GCPOachieves more than gains. This demonstrates the superior performance of theSA-GCPORLmethod, particularly itsadaptive smooth gatingandreal user feedback threshold.
6.1.3.3. Ablation study for different settings of and
Ablation experiments validated the effect of separate temperature settings for positive and negative advantage samples in SA-GCPO.
The following are the results from Table 11 of the original paper:
| Tpos | Tneg | HR@1 | HR@10 |
| 1.0 | 1.05 | 25.35% | 65.64% |
| 1.0 | 1.0 | 25.48% | 65.95% |
| 1.0 | 0.95 | 25.51% | 66.01% |
Table 11: Ablation study of temperatures set for positive and negative samples of SA-GCPO
Analysis:
- When (e.g., ), the model performance becomes more stable and slightly better (
HR@125.51%,HR@1066.01%). - Assigning a lower temperature coefficient to negative samples helps prevent performance collapse during
RL trainingand enhances overall training stability, confirming the benefit ofasymmetric temperature control.
6.1.4. Online A/B Test Performance and Industrial Impact
OxygenREC was deployed across three sequentially dependent scenarios on the JD App, covering the user's entire session lifecycle:
-
Phase 1: Interest Triggering (Homepage Floor): Scenarios 1 and 2, high traffic, low latency, visually engaging items to attract clicks.
-
Phase 2: Deep Exploration (Feeds Recommendation): Scenarios 3 and 4, entered via homepage clicks, recommendations based on "trigger SKU" and user behavior, encourages prolonged engagement.
-
Phase 3: Immediate Conversion (Checkout Path Recommendations): Scenarios 5 and 6 (Add-to-Cart Overlay, Checkout Add-on), targets transaction process, capitalize on strong purchase intent for supplementary items.
Rigorous online A/B testing was conducted, with 10% of total traffic allocated to experimental and control groups.
The following are the results from Table 12 of the original paper:
| Scenario | UCTR | UCTCVR | Order Volume | GMV | Latency | |
| Homepage Floor | Scenario 1 | +0.68% | +2.71% | +2.81% | +4.52% | 50ms |
| Scenario 2 | +3.55% | +2.26% | +2.21% | +8.40% | ||
| Channel Feeds | Scenario 3* | -0.25% | +7.89% | +8.03% | +1.46% | 80ms |
| Scenario 4 | +0.78% | +2.17% | +1.49% | +1.66% | ||
| Checkout Path | Scenario 5 | +0.40% | +4.21% | +4.28% | +11.80% | 50ms |
| Scenario 6 | +3.29% | +3.00% | +2.92% | +4.15% | ||
Table 12: Online A/B Test Lift at First Launch
Analysis:
- The
generative modelachieved statistically significant improvements across all key business metrics (UCTR,UCTCVR,Order Volume,GMV) in all scenarios. - Homepage Floor (Scenarios 1 & 2): Positive lifts across all metrics, with
GMVreaching in Scenario 2. - Channel Feeds (Scenarios 3 & 4): Notably, Scenario 3 shows a slight
UCTRdrop of-0.25%but a remarkable+7.89% UCTCVRand+8.03% Order Volume. This indicates the model prioritizes high-quality, conversion-intent items over shallow clicks. - Checkout Path (Scenarios 5 & 6): Significant gains in
GMV(e.g., in Scenario 5) andOrder Volume, capitalizing on immediate user intent. - Latency: The system maintains strict
latencyrequirements (50ms for Homepage/Checkout, 80ms for Feeds) despite handlingbillion-scale candidate spaces. - These results validate that the
generative frameworkeffectively translatessemantic understandinginto tangible business growth across the user's entire shopping lifecycle in real-world industrial settings.
6.2. Ablation Studies / Parameter Analysis
The ablation studies and parameter analyses are thoroughly integrated into the Core Results Analysis section above, specifically under:
-
Lightweight Multimodal Design: Evaluates different
multimodal encoderconfigurations. -
Semantic ID Evolution: Compares
SIDversions (V1-V4) to determine the optimal item representation. -
Generative Backbone Architecture Ablation: Investigates scaling laws and
MoEconfigurations for theencoder-decoder backbone. -
Instruction Token Integration Strategies: Analyzes how the
instruction tokenposition affects performance. -
Instruction Component Ablation: Breaks down the
instruction tokeninto itsScenario IDandTrigger Item IDcomponents. -
IGR Ablation: Verifies the contribution of
Instruction-Guided RetrievalandQ2I alignment. -
Unified Instruction-Following Model vs. Independent SFT Baselines: Compares the unified model against scenario-specific baselines.
-
Trigger Instruction Sensitivity Analysis: Confirms the importance of the
Trigger Itemsignal. -
Effectiveness of proposed SA-GCPO: Benchmarks
SA-GCPOagainstGRPOandGSPO. -
Ablation study for different settings of and : Analyzes the impact of
asymmetric temperature controlinSA-GCPO.These studies collectively demonstrate the individual contributions of each proposed component and validate the overall design choices of
OxygenREC.
7. Conclusion & Reflections
7.1. Conclusion Summary
OxygenREC introduces a novel generative recommendation system that successfully integrates deep reasoning capabilities with the stringent latency and scalability demands of industrial deployments. The key innovations are:
- Fast-Slow Thinking Architecture: A
near-line LLM pipeline(slow thinking) generatesContextual Reasoning Instructionsbased on user behavior and context, injectingworld knowledgeanddeductive reasoning. Alightweight encoder-decoder model(fast thinking) then uses these pre-generated instructions forreal-time recommendation generation, effectively avoiding onlineLLM latency. - Instruction-Following Unification:
OxygenRECtransforms recommendation into aninstruction-following task.Semantic alignmentmechanisms, includingInstruction-Guided Retrieval (IGR)andQuery-to-Item (Q2I) loss, ensure that user intent (Contextual Reasoning Instructions) andscenario context(Scenario Instructions) precisely control the generation process. - Multi-Scenario Scalability: By converting scenario information into
structured Scenario Instructionsand utilizingunified reward mappingwithSoft Adaptive Group Clip Policy Optimization (SA-GCPO),OxygenRECachieves a "train-once-deploy-everywhere" paradigm, allowing a single model to adapt to diverse business objectives across multiple scenarios. - Industrial Impact: Deployed across core
JD.comrecommendation scenarios,OxygenRECdemonstrated significant online A/B test gains inorder volumeandGMV, validating its robustness, efficiency, and practical value in high-traffic e-commerce environments.
7.2. Limitations & Future Work
The authors identify several promising directions for future research:
-
Latency Optimization through Non-Autoregressive Generation:
- Limitation: The current framework relies on
sequential Next Token Prediction (NTP), where decoding latency increases linearly with the required recommendation list length. This fundamentally hinders high-throughput real-time deployment. - Future Work: Transitioning to a
Non-Autoregressive (NAR)parallel generation paradigm to drastically minimize serving latency and maximize throughput by generating the entire sequence ofsemantic identifierssimultaneously. This is crucial for maintaining performance as model complexity and knowledge integration depth increase.
- Limitation: The current framework relies on
-
Multi-Scenario User Trajectory Modeling for Deep Intent Discovery:
- Limitation: The current
instruction systemeffectively uses immediate context and scene information, butusers' true purchase intentis often a complex decision trajectory spanning multiple distinct scenarios (e.g., Homepage, Search, Cart, Checkout). - Future Work: Focus on
multi-scenario user trajectory modelingto capture the full context of user behavior. This involves integrating and analyzingcross-scenario sequencesto uncoverdeep-seated user goalsandintent evolution. The goal is to upgrade the instruction system to use richer, hierarchical intent signals for theLLM backbone, leading to more precise and long-term optimal recommendations, further aligned withlong-term user valuevia a robustclosed-loop learning mechanism.
- Limitation: The current
7.3. Personal Insights & Critique
OxygenREC presents a highly compelling and practically significant contribution to the field of recommendation systems, particularly for large-scale e-commerce platforms.
Personal Insights:
- Elegance of Fast-Slow Thinking: The
Fast-Slow Thinkingarchitecture is an elegant solution to the perennialLLM latencyproblem in real-time systems. By pre-computing complexdeductive reasoninginstructions offline, it effectively leveragesLLMpower without sacrificing online performance. This pattern could be widely applicable to other domains requiringLLMintelligence under strict latency budgets, such as intelligent assistants, content moderation, or fraud detection where complex reasoning is needed for real-time decisions. - Instruction-Following for Unification: The idea of
instruction-followingto unifymulti-scenariorecommendations is powerful. It moves beyond implicit parameter modulation inMulti-Gate Mixture-of-Experts (MoE)ortower modelsto explicit control signals, making the model more interpretable and adaptable. This paradigm could be generalized to other complexmulti-task learningscenarios where explicit guidance can disentangle objectives. - Rigorous Engineering Focus: The paper's detailed account of system implementation and optimization (e.g.,
PyTorchtraining framework,xLLMinference,distributed sparse optimization, customattention kernels) highlights the immense engineering effort required to bring such a sophisticated model to production. This level of detail is often missing in academic papers but is crucial for real-world impact. - Synthetic Data for RL Stability: The use of
synthetic datacombined withSA-GCPOforRLpost-training is a smart way to address thedata sparsityandreward hackingissues often encountered when relying solely on real-time user feedback, especially in new or data-scarce scenarios.
Critique & Areas for Improvement:
-
Proprietary Data Limitation: The reliance on
JD.com's proprietary datasets, while demonstrating real-world impact, limits reproducibility and direct comparison by external researchers. Public benchmarks (e.g.,Amazon Reviews,Taobao) could have provided broader generalizability. -
LLM Model Specifics: While
Qwen3andDeepSeek-R1are mentioned, more details on the size, architecture, and specific prompting strategies for thenear-line LLM pipelinewould enhance understanding for researchers attempting to replicate or adapt this work. The fine-tuning process forLLM_QRandUser Intent Reasoningis described, but full prompt templates and few-shot examples could be beneficial for transparency. -
Complexity of SA-GCPO: While innovative, the
SA-GCPOformulation introduces several new components (soft adaptive function, asymmetric temperature, threshold function for advantage). A deeper intuitive explanation of how these precisely interact and contribute to stability and performance, perhaps with illustrative examples of gradient behavior, would be helpful for beginners. -
Cold Start Problem: The paper focuses on active users with rich historical data. It would be interesting to see how
OxygenREChandlescold-start usersorcold-start itemsgiven its reliance onuser behaviorandsemantic IDs. TheLLM-drivenContextual Reasoning Instructionscould potentially offer a strong advantage here, and further analysis would be valuable. -
Interpretability of Instructions: While
Contextual Reasoning Instructionsare designed to be interpretable, the paper primarily uses quantitative metrics. Case studies demonstrating how these instructions lead to specific, nuanced recommendations that traditional systems would miss, and how users perceive this improved reasoning, would further strengthen the claim of deep reasoning.Overall,
OxygenRECrepresents a significant advancement in bridging the gap between cutting-edgeLLMresearch and the practical demands of industrial-scale recommendation systems, offering a robust and scalable framework for future development.
Similar papers
Recommended via semantic vector search.