Paper status: completed

OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation

Published:12/27/2025

Generative Recommendation Systems (38)Multi-Stage Optimization Objectives (1)Instruction-Guided Retrieval (1)Deep Reasoning Capabilities (1)Controllable Instructions from Scenario Information (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

OxygenREC is an e-commerce recommendation system utilizing a Fast-Slow Thinking architecture for deep reasoning, addressing inconsistencies in multi-stage optimization and independent training across scenarios, while enhancing recommendation quality with a semantic alignment mech

Abstract

Traditional recommendation systems suffer from inconsistency in multi-stage optimization objectives. Generative Recommendation (GR) mitigates them through an end-to-end framework; however, existing methods still rely on matching mechanisms based on inductive patterns. Although responsive, they lack the ability to uncover complex user intents that require deductive reasoning based on world knowledge. Meanwhile, LLMs show strong deep reasoning capabilities, but their latency and computational costs remain challenging for industrial applications. More critically, there are performance bottlenecks in multi-scenario scalability: as shown in Figure 1, existing solutions require independent training and deployment for each scenario, leading to low resource utilization and high maintenance costs-a challenge unaddressed in GR literature. To address these, we present OxygenREC, an industrial recommendation system that leverages Fast-Slow Thinking to deliver deep reasoning with strict latency and multi-scenario requirements of real-world environments. First, we adopt a Fast-Slow Thinking architecture. Slow thinking uses a near-line LLM pipeline to synthesize Contextual Reasoning Instructions, while fast thinking employs a high-efficiency encoder--decoder backbone for real-time generation. Second, to ensure reasoning instructions effectively enhance recommendation generation, we introduce a semantic alignment mechanism with Instruction-Guided Retrieval (IGR) to filter intent-relevant historical behaviors and use a Query-to-Item (Q2I) loss for instruction-item consistency. Finally, to resolve multi-scenario scalability, we transform scenario information into controllable instructions, using unified reward mapping and Soft Adaptive Group Clip Policy Optimization (SA-GCPO) to align policies with diverse business objectives, realizing a train-once-deploy-everywhere paradigm.

Mind Map

In-depth Reading

English Analysis~36 min read · 51,871 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation". The central topic is the development of a generative recommendation system for e-commerce that uses instruction-following capabilities, deep reasoning, and multi-scenario scalability, addressing challenges like latency and resource utilization.

1.2. Authors

The paper lists a large group of authors, all affiliated with JD.com, Beijing, China. This suggests the research is an industrial effort focused on practical applications within a large e-commerce platform. The contact emails provided are {haoxuegang.1, zhangming229, gongpinghua1}@jd.com.

1.3. Journal/Conference

The paper is published as a preprint on arXiv with the identifier arXiv:2512.22386. As a preprint, it has not yet undergone formal peer review by a journal or conference. However, arXiv is a highly influential platform for rapid dissemination of research, especially in fields like artificial intelligence, and many significant papers first appear there. The content suggests it is targeting top-tier AI/ML or Recommender Systems conferences/journals.

1.4. Publication Year

The paper was published on 2025-12-26T21:13:59.000Z, indicating a very recent publication, almost certainly a preprint for upcoming conferences in early 2026.

1.5. Abstract

The paper addresses two key limitations of existing generative recommendation (GR) systems:

Limited Reasoning Capabilities: Existing GR methods primarily rely on inductive patterns, struggling with complex user intents that require deductive reasoning and world knowledge. While large language models (LLMs) offer strong reasoning, their latency and computational costs are prohibitive for industrial use.
Multi-Scenario Scalability Bottlenecks: Current solutions require independent training and deployment for each recommendation scenario (e.g., homepage, search, cart), leading to low resource utilization and high maintenance costs.

To overcome these challenges, the authors propose OxygenREC, an industrial recommendation system built on a Fast-Slow Thinking architecture.

Fast-Slow Thinking: Slow thinking involves a near-line LLM pipeline that synthesizes Contextual Reasoning Instructions (deductive reasoning based on world knowledge and complex user intents). Fast thinking employs a high-efficiency encoder-decoder backbone for real-time item sequence generation, conditioned on these instructions, without incurring online LLM latency.
Semantic Alignment: To ensure the reasoning instructions effectively guide generation, OxygenREC uses Instruction-Guided Retrieval (IGR) to filter intent-relevant historical user behaviors and a Query-to-Item (Q2I) loss for consistency between instructions and target items.
Multi-Scenario Scalability: The framework transforms scenario information into controllable instructions and employs unified reward mapping alongside Soft Adaptive Group Clip Policy Optimization (SA-GCPO) to align a single policy with diverse business objectives, enabling a "train-once-deploy-everywhere" paradigm.

The system is deployed at JD.com, showing significant increases in GMV (Gross Merchandise Value) and order volume across multiple core scenarios, demonstrating its flexibility and scalability.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2512.22386. The PDF link is https://arxiv.org/pdf/2512.22386v1.pdf. This paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Traditional recommendation systems, especially in large e-commerce platforms, face two significant challenges:

Limited Deductive Reasoning in Generative Recommendation (GR): While Generative Recommendation (GR) offers an end-to-end framework, existing methods primarily rely on inductive pattern matching from user behavior. They struggle to uncover complex user intents that require deductive reasoning based on world knowledge (common sense, external facts). For example, inferring the need for "moisture-wicking baby sleepwear" for "young parents in Chengdu during winter solstice" goes beyond observed patterns. Large Language Models (LLMs) possess strong reasoning capabilities but are too slow and computationally expensive for real-time industrial deployment.
Inefficient Multi-Scenario Scalability: E-commerce platforms operate across diverse scenarios (e.g., homepage, product detail page, shopping cart, search results), each with unique user behaviors, business objectives, and latency requirements. Training and deploying independent recommendation models for each scenario is costly in terms of resources (computation, maintenance, development) and leads to low utilization. Current GR literature has largely unaddressed this multi-scenario scalability challenge.

Why is this problem important?

Enhanced User Experience: Understanding deep user intent through deductive reasoning leads to more relevant and surprising recommendations, moving beyond simple "what you've seen is what you get" patterns. This can significantly improve user satisfaction and engagement.
Business Impact: More accurate and context-aware recommendations directly translate to higher Click-Through Rates (CTR), Conversion Rates (CVR), Order Volume, and Gross Merchandise Value (GMV), which are critical for e-commerce platforms.
Operational Efficiency: A unified, scalable system reduces the massive operational overhead associated with managing a multitude of scenario-specific models, freeing up resources for innovation and broader deployment.

Specific challenges or gaps in prior research:

GR systems are good at end-to-end optimization but primarily inductive.
LLMs offer powerful reasoning but are too slow and expensive for online inference. The trade-off between deductive knowledge injection and online latency is a major hurdle.
Multi-scenario recommendation research has mainly focused on discriminative models, often relying on complex, scenario-specific architectures that are not easily adaptable to generative paradigms or lead to negative transfer issues (where learning for one task harms performance on another). The GR literature lacks effective solutions for train-once-deploy-everywhere across diverse scenarios.

Paper's entry point or innovative idea: The paper proposes OxygenREC as an industrial generative recommendation system that leverages Fast-Slow Thinking and instruction-following to address these challenges. It aims to inject deep reasoning capabilities without compromising real-time latency and to enable scalable multi-scenario services within a single, unified backbone network.

2.2. Main Contributions / Findings

The paper makes four primary contributions:

Fast-Slow Thinking Architecture with Deductive Knowledge Injection:
- Contribution: Introduces a novel Fast-Slow Thinking architecture to inject world knowledge and deductive reasoning into recommendations without adding online latency.
- Mechanism: A near-line LLM pipeline (slow thinking) synthesizes high-precision Contextual Reasoning Instructions by performing deep intent reasoning. A high-throughput encoder-decoder backbone (fast thinking) then uses these instructions for real-time item sequence generation.
- Problem Solved: Overcomes the trade-off between LLM reasoning power and strict online latency requirements.
Semantic Alignment for Effective Instruction Control:
- Contribution: Develops a semantic alignment mechanism to ensure that the generated reasoning instructions effectively guide the recommendation process.
- Mechanism: Uses Query-to-Item (Q2I) loss to map instructions into the item embedding space. This enables Instruction-Guided Retrieval (IGR) to filter out irrelevant historical user behaviors, focusing the model on intent-relevant data.
- Problem Solved: Ensures that the model's output is tightly controlled by the user's inferred intent, improving precision and reducing noise from past irrelevant interactions.
Scalable Multi-Scenario Alignment via Instruction and Reinforcement Learning (RL):
- Contribution: Achieves scalable multi-scenario adaptation within a single generative model backbone.
- Mechanism: Converts scenario-specific contexts into controllable instructions (scenario instructions). Employs a unified Reward Mapping Service and Soft Adaptive Group Clip Policy Optimization (SA-GCPO) to align a single policy with diverse business objectives across scenarios.
- Problem Solved: Realizes a "train-once-deploy-everywhere" paradigm, significantly reducing operational and computational costs by eliminating the need for separate models per scenario.
Large-Scale Production Deployment:
- Contribution: Successfully deployed OxygenREC in JD.com's core recommendation scenarios.
- Mechanism: Built a unified PyTorch-based training framework achieving 40% Model FLOPs Utilization (MFU) and utilized xLLM for high-performance inference serving.
- Problem Solved: Demonstrates significant gains in order volume and GMV in online A/B tests across multiple core scenarios, proving its practical value, efficiency, and scalability in demanding industrial environments.

Key Conclusions / Findings:

The Fast-Slow Thinking architecture effectively injects deductive knowledge without online LLM latency.
Semantic IDs and multimodal fusion are crucial for high-quality item representations.
The instruction-following mechanism with IGR and Q2I loss significantly enhances recommendation controllability and accuracy.
The unified generative model with scenario instructions and SA-GCPO outperforms independent scenario-specific models, enabling efficient multi-scenario adaptation.
Online A/B tests confirm OxygenREC's ability to drive substantial business growth (GMV and order volume) across diverse scenarios, from Homepage Floor to Checkout Path, demonstrating its robustness and practical impact.

3.1. Foundational Concepts

To understand OxygenREC, a reader should be familiar with the following concepts:

Recommendation Systems (RS): Software systems that suggest items (products, movies, articles, etc.) to users.
- Traditional Cascading Methods: Often involve multiple sequential stages (e.g., matching, ranking, re-ranking). Each stage optimizes a local objective, leading to potential objective misalignment and error propagation.
- Generative Recommendation (GR): A newer paradigm that reformulates recommendation as an end-to-end sequence generation task. Instead of scoring and selecting from a fixed pool, GR models directly generate sequences of item identifiers or their semantic IDs. This allows for global optimization and potentially more novel recommendations.
Large Language Models (LLMs): Deep learning models with billions of parameters, pre-trained on vast amounts of text data. They excel at understanding, generating, and reasoning with human language.
- Deep Reasoning Capabilities: LLMs can perform deductive reasoning (applying general rules to specific cases) and inductive reasoning (inferring general rules from specific observations). In recommendations, deductive reasoning can leverage world knowledge to infer complex user intents beyond observed patterns.
- Latency and Computational Costs: A major challenge with LLMs is their high computational demand, especially during inference (generating output), which results in high latency and cost, making them difficult to deploy in real-time industrial applications.
Fast-Slow Thinking (Dual-Process Theory): Inspired by cognitive psychology (e.g., Daniel Kahneman's "Thinking, Fast and Slow"), this concept divides cognitive processes into two systems:
- System 1 (Fast Thinking): Intuitive, automatic, unconscious, and fast. In OxygenREC, this is the high-throughput encoder-decoder backbone for real-time generation.
- System 2 (Slow Thinking): Deliberative, analytical, conscious, and slow. In OxygenREC, this is the near-line LLM pipeline that synthesizes complex Contextual Reasoning Instructions. The key is to leverage System 2 offline to inform System 1 online, avoiding latency issues.
Encoder-Decoder Architecture: A common neural network architecture, especially for sequence-to-sequence tasks (like machine translation or text generation).
- Encoder: Processes the input sequence (e.g., user history, profile, context) and compresses it into a fixed-size context vector or a sequence of hidden states.
- Decoder: Takes the context vector from the encoder and generates an output sequence one element at a time (auto-regressively). It can also be conditioned on additional inputs.
Transformer: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017) that revolutionized sequence modeling. It relies entirely on attention mechanisms rather than recurrent or convolutional layers.
- Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence when processing each element.
- Cross-Attention: Used in encoder-decoder Transformers, where the decoder attends to the encoder's output.
Semantic IDs (SIDs): Instead of using raw item identifiers (which are categorical and lack semantic meaning), SIDs are discrete, semantically rich representations of items. They are learned by mapping items (text, image, features) into a continuous latent space and then quantizing this space into discrete codes. This allows generative models to "generate" items as sequences of SIDs (like words in a sentence).
- Residual Quantization (RQ-KMeans): A technique to discretize continuous embeddings into hierarchical codes. It quantizes the residual (the error between the original embedding and its quantized version) iteratively, building a coarse-to-fine representation. This balances expressiveness with a compact vocabulary size.
Contrastive Learning: A self-supervised learning paradigm where the model learns representations by pushing "similar" (positive) samples closer together in the embedding space and "dissimilar" (negative) samples farther apart. In OxygenREC, it's used for Item-to-Item (I2I) alignment and multi-source alignment for SIDs.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by interacting with an environment to maximize a cumulative reward.
- Policy: The agent's strategy for choosing actions given a state.
- Reward: A numerical signal indicating the desirability of an action.
- Policy Optimization: Algorithms (e.g., Proximal Policy Optimization (PPO), Clipped Policy Optimization (CPO)) used to update the agent's policy based on observed rewards.
- Importance Sampling: A technique used in RL to estimate the expectation of a function under one distribution, given samples from another distribution. This is crucial for off-policy RL where the data is collected using an old policy ( $\pi_{\theta_{old}}$ ) but used to update a new policy ( $\pi_\theta$ ).
Mixture-of-Experts (MoE): A neural network architecture where different "expert" sub-networks specialize in different parts of the input space. A gating network learns to route each input to one or a few relevant experts. This allows models to scale to trillions of parameters while only activating a small subset for each input, improving efficiency.
KV-Cache (Key-Value Cache): In Transformer decoders, the key and value vectors from previous tokens are often cached to avoid recomputing them at each auto-regressive decoding step, significantly speeding up inference, especially for long sequences.

3.2. Previous Works

The paper discusses related work in three main categories:

Generative Recommendation (GR):
- Early GR: Used raw textual item IDs (e.g., product names or short descriptions) as tokens.
- Modern GR: Shifted towards Semantic IDs (SIDs) [27, 29, 53, 54] which are discrete, semantically meaningful representations. RQ-VAE [48] and residual quantization [22] create hierarchical codes for SIDs to balance expressiveness and vocabulary size.
- Unified Search and Recommendation: Works like SynerGen [25] and IntSR [64] explored unifying search and recommendation, but they are often built on traditional ranking models rather than semantic ID-based generation.
- LLM-integrated GR (e.g., RecGPT series [4, 5]): These approaches integrate LLMs for enhanced reasoning, but often use offline LLM inference for reasoning and rely on simple N-based dual-tower retrieval for online serving, which OxygenREC argues limits the full leverage of LLM reasoning in real-time. OxygenREC directly integrates LLM reasoning outputs (Contextual Reasoning Instructions) into a generative backbone, aiming for deeper integration.
- Architectures: Transformer-based encoder-decoder or decoder-only architectures are common [41, 58, 62], offering throughput advantages.
- Limitation addressed by OxygenREC: Most GR methods are fundamentally inductive, lacking deductive reasoning for complex user intents.
Large Language Models for Recommendation (LLM4Rec):
- Direct LLM Usage [12, 24, 39]: Some approaches directly use LLMs as the recommendation backbone, leveraging their world knowledge and reasoning abilities. For example, OneRec-think [39] uses Chain-of-Thought (CoT) reasoning for interpretability.
- Limitation addressed by OxygenREC: The primary challenge for LLM4Rec in industrial settings is severe latency and cost constraints due to the computational overhead of auto-regressive decoding with billion-scale parameters [69]. OxygenREC addresses this with its Fast-Slow Thinking architecture, keeping the LLM part near-line.
Multi-Scenario Recommendation (MSR) and Controllability:
- Traditional MSR for Discriminative Ranking Models: Approaches like Multi-Gate Mixture-of-Experts (MoE) [40], STAR [51], and PEPNet [9] introduce specialized architectural interventions (gating, star-topology units, routing mechanisms) to mitigate negative transfer (when learning for one task harms another) in discriminative ranking models.
- Limitation addressed by OxygenREC: These traditional MSR methods introduce structural complexity, implicit parameter modulation, and often require separate model instances, increasing operational costs. Adapting them directly to GR is challenging. Most existing GR multi-scenario approaches [25] are exploratory and don't fully solve the trade-off between unified modeling and scenario-specific adaptation. OxygenREC aims for a train-once-deploy-everywhere paradigm using explicit scenario instructions and RL-based alignment.

3.3. Technological Evolution

The field of recommendation systems has evolved significantly:

Early Systems (Rule-based, Collaborative Filtering): Simple rules or user-item similarity based on past interactions.
Traditional Multi-Stage Cascaded Pipelines: Introduced deep learning, but typically involved fragmented optimization (e.g., embedding -> matching -> ranking -> re-ranking stages) [15, 16, 71]. This led to objective misalignment and error propagation.
Generative Recommendation (GR): Emerged as an end-to-end paradigm, treating recommendation as sequence generation [36, 68]. This unified approach aimed for global optimization. Key advancements included Semantic IDs [27, 53, 54] and Transformer-based architectures [41, 62].
Integration of Large Language Models (LLMs): With the rise of powerful LLMs, researchers began exploring their use in recommendation for richer understanding and reasoning [12, 24, 39].
- Challenge: Directly using LLMs online for real-time recommendations suffered from prohibitive latency and computational costs.
- Challenge: GR systems, while unified, often still relied on inductive pattern matching and lacked deductive reasoning capabilities based on world knowledge.
- Challenge: Scaling GR across diverse multi-scenario industrial environments remained largely unaddressed, leading to high operational costs.
  
  OxygenREC fits into this evolution by attempting to bridge these gaps. It represents a step forward from basic GR and LLM4Rec by specifically tackling the latency problem of LLMs for deductive reasoning via Fast-Slow Thinking, and the scalability problem of GR across multiple scenarios via instruction-following and RL-based multi-scenario alignment. It aims to bring LLM-level reasoning to industrial GR without the typical performance bottlenecks, moving towards a more intelligent and efficient recommendation paradigm.

3.4. Differentiation Analysis

OxygenREC differentiates itself from previous works in several key areas:

Deductive Reasoning with Latency Constraints (Fast-Slow Thinking):
- Prior Work: LLM-based recommendation systems (e.g., OneRec-think [39], RecGPT [4]) leverage LLMs for reasoning, but often struggle with the high latency and computational costs of deploying LLMs online for real-time inference. Some offline LLM reasoning approaches still rely on simpler retrieval mechanisms online.
- OxygenREC's Innovation: It directly tackles this latency-reasoning trade-off with its Fast-Slow Thinking architecture. The "slow" part (near-line LLM pipeline) generates Contextual Reasoning Instructions offline, injecting deductive world knowledge. The "fast" part (high-throughput encoder-decoder backbone) uses these pre-computed instructions for real-time generation, ensuring deep reasoning without online LLM calls.
Multi-Scenario Scalability ("Train-Once-Deploy-Everywhere"):
- Prior Work: Multi-scenario recommendation (MSR) research largely focused on discriminative ranking models (e.g., STAR [51], PEPNet [9]), often requiring complex scenario-specific towers, gating mechanisms, or separate model instances, leading to high operational costs and negative transfer issues. GR literature had limited solutions for multi-scenario scalability.
- OxygenREC's Innovation: It unifies diverse recommendation tasks into a single generative framework using instruction-following. Scenario information is transformed into explicit controllable instructions for the backbone. Unified reward mapping and Soft Adaptive Group Clip Policy Optimization (SA-GCPO) enable a single policy to align with various business objectives across scenarios, achieving a true train-once-deploy-everywhere paradigm.
Effective Instruction Control and Semantic Alignment:
- Prior Work: While instruction-following is a core capability of LLMs, its application to GR has been "long overlooked," according to the authors. Ensuring instructions effectively guide generation, especially with complex user history, is a challenge.
- OxygenREC's Innovation: It introduces semantic alignment mechanisms. Instruction-Guided Retrieval (IGR) filters intent-relevant historical behaviors based on the instruction, reducing noise. The Query-to-Item (Q2I) loss explicitly aligns instruction embeddings with target item embeddings, ensuring strong control over the generated output.
Robust Industrial Deployment:
- Prior Work: Many research papers propose models but lack the engineering optimizations for large-scale, high-traffic industrial deployment.
- OxygenREC's Innovation: Beyond the model architecture, it includes significant system implementation and optimization details, such as a unified PyTorch-based training framework with 40% MFU, xLLM-based inference optimization (xGR, xSchedule, xAttention, xBeam, Prefix-Constrained Decoding), and a robust near-line inference deployment strategy for reasoning instructions. This focus on practical deployment and achieving significant online A/B test lifts sets it apart as a production-ready system.

4. Methodology

4.1. Principles

The core idea behind OxygenREC is to integrate the deep reasoning capabilities of LLMs into a generative recommendation system for e-commerce while simultaneously addressing the real-time latency constraints and multi-scenario scalability requirements of industrial applications. This is achieved through two main principles:

Fast-Slow Thinking: This principle separates the complex, computationally intensive deductive reasoning (the "slow" part) from the real-time item generation (the "fast" part). The "slow" LLM pipeline operates near-line (not during live user requests) to synthesize high-quality Contextual Reasoning Instructions. These instructions then guide a high-efficiency encoder-decoder backbone for fast, real-time recommendation generation, effectively pre-computing LLM insights to avoid online LLM latency.
Instruction-Following Unification: The system unifies diverse recommendation tasks and scenarios into a single generative model by treating them as instruction-following tasks. Both user intent (deduced by the LLM) and scenario context are encoded as explicit instructions that steer the model's generation process. This enables a "train-once-deploy-everywhere" paradigm, reducing operational costs and leveraging synergistic knowledge transfer across scenarios.

4.2. Core Methodology In-depth (Layer by Layer)

OxygenREC unifies various recommendation tasks into a single Instruction-Following Generative paradigm, addressing both LLM reasoning integration and multi-scenario adaptation.

The overall architecture of OxygenREC is illustrated in Figure 2.

该图像是示意图，展示了OxygenREC的整体架构。左侧描述了基于指令的框架和多模态量化表示，右侧为上下文推理指令的生成流程，以及多场景对齐的奖励映射机制。

Figure 2: The Overall Architecture of OxygenREC. (a) Instruction Following Framework: A transformerbased encoderdecoder backbone that generates semantic item sequences conditioned on specific instructions. (b) Multimodal Quantized Representations: Items are tokenized as multimodal semantic IDs via residual quantization of contrastively trained embeddings, enabling compact and expressive item representations. (c) Contextual Reasoning Instructions: A near-line LLM pipeline that analyzes user behavior and context to synthesize such instructions, bridging the gap between inductive patterns and deductive reasoning. d) MultiScenario Alignment: We achieve a "train-once-deploy-everywhere" workflow by coupling scenario instructions with RL-based alignment.

The system workflow consists of four stages:

Model Input Representation: Integrates user profiles, LLM-driven reasoning intents, multimodal Semantic IDs (SIDs) for items, and contextual inputs for scenario information.
Instruction-Following Pre-training: The model learns to follow instructions using a multi-task objective combining Next Token Prediction (NTP) and semantic alignment.
Post-training with Multi-Scenario Alignment: Refines the model using Reinforcement Learning (RL) with a reward mapping service and a novel policy optimization strategy.
Multi-Scenario Serving: Deploys a single model across diverse scenarios, using prefix-constrained beam search to enforce scenario-specific rules.

4.2.1. LLM-driven User and Item Inputs

4.2.1.1. Model Input Representation

OxygenREC adopts an encoder-decoder architecture. The encoder processes user-side inputs into a latent space, and the decoder generates recommendation sequences conditioned on instructions.

Encoder Input ( $X_{\mathrm{enc}}$ ): Integrates three key sources:
1. User Profile: Static attributes like demographics.
2. User Behavior: Split into short-term (real-time interests) and long-term sequences. For long-term history, Instruction-Guided Retrieval (IGR) is used. The instruction acts as a query to retrieve only relevant past actions, enhancing instruction controllability and efficiency for deep user understanding.
3. Contextual Reasoning Instructions ( $I_r$ ): Explained below.
Decoder Input: Conditioned on the encoded user representation ( $X_{\mathrm{enc}}$ ) and a composite instruction prompt ( $P$ ). This prompt combines two signals:
1. Scenario Instructions ( $I_s$ ): For domain control (e.g., specific rules for a homepage vs. a cart scenario).
2. Contextual Reasoning Instructions ( $I_r$ ): For deductive intent guidance (e.g., inferring user needs based on world knowledge). These instructions collectively steer the auto-regressive generation of the target item sequence.

4.2.1.2. Multimodal Quantized Item Representations

As shown in Figure 2 (b), OxygenREC constructs a unified vocabulary using Multimodal Semantic IDs (SIDs).

A multimodal item encoder is trained via a contrastive Item-to-Item (I2I) objective [45, 46] on large-scale item pairs derived from cross-scenario co-occurrence behaviors.
Each item is represented by textual metadata and product images, processed by separate encoders.
A lightweight fusion module employs modality-specific projections to map inputs into a shared space, followed by Q-Former [34] and MLP layers for cross-modal interactions.
The resulting 256-dimensional embeddings are discretized using the RQ-KMeans scheme [22]. This residual quantization process assigns each item a tuple of discrete codes (semantic IDs) in a coarse-to-fine manner.
A hierarchical structure with a depth of 3 and a vocabulary size of 8,192 per level is used, creating a compact and expressive semantic ID space for auto-regressive generation.

4.2.1.3. Contextual Reasoning Instructions

As illustrated in Figure 3, this component integrates world knowledge and deductive reasoning while avoiding online LLM latency by operating near-line.

$Figure 3: Overview of our Contextual Reasoning Instructions pipeline. Two parallel branches generate contextual instructions and reasons from spatiotemporal $^ +$ profile signals and recent user behavior sequences. In parallel, `L L M _ { Q R }` rewrites noisy or truncated recent queries, and all outputs are combined to form the final Instructions and Reasons for downstream generation$ 该图像是一个示意图，展示了上下文推理指令的生成流程。图中包含三条分支：第一条从时空和用户资料信号生成指令，第二条基于用户行为序列生成结果，第三条用于重写近期查询，最终汇集输出形成指令和推理结果。

Figure 3: Overview of our Contextual Reasoning Instructions pipeline. Two parallel branches generate contextual instructions and reasons from spatiotemporal $^ +$ profile signals and recent user behavior sequences. In parallel, L L M _ { Q R } rewrites noisy or truncated recent queries, and all outputs are combined to form the final Instructions and Reasons for downstream generation

The core goal is to use LLMs to build a controllable and interpretable intermediate instruction layer. This layer acts as an explicit semantic bridge between raw user behaviors and downstream models, improving intent alignment and system stability. The system transforms complex recommendation signals into a reasoning process based on multi-signal input instructions, which combine spatiotemporal context, user profile, and historical behavior into text-based Instructions and corresponding Reasons.

Spatiotemporal and Profile Reasoning ( $LLM_{TPR}$ ): This module infers user intent from their external environment and static traits. It covers three aspects:
1. Event-driven reasoning: Identifies holidays or seasonal features (e.g., "winter solstice") and infers related shopping intent (e.g., "Frozen dumplings").
2. Profile-driven reasoning: Uses user static traits (gender, age, spending power) for personalized product preference conclusions.
3. Temporal-Profile Fusion reasoning: Combines local culture with real-time weather/time/localization needs (e.g., "Bosideng men's long down jacket" for Beijing's dry, sub-zero winter). In practice, JoyAI LLM generates these instructions, stored hierarchically by "time-location-persona" for quick retrieval.
User Query Rewrite ( $LLM_{QR}$ ): Addresses incomplete or noisy real user queries (e.g., "foam case", "wine gift"). LLM_QR performs semantic completion and normalization. It is trained on identity-preserving data and rewrite-required data (synthetic error samples). Supervised fine-tuning on Qwen3-0.6B achieves a 95.33% pass rate in human evaluation.
User Intent Reasoning: Given user behavior sequences and preferences, the model generates intent-matched instructions and reasoning reasons. The rationale explains the recommendation logic and helps uncover deeper motivations from noisy behaviors.
- Data Refining and Auto-labeling Pipeline: To address the lack of explicit rationale labels in real logs, LLM_QR maps queries to normalized queries. Qwen3-32B aggregates/deduplicates these for standard intent targets. Qwen3-32B also performs alignment filtering on the full behavior sequence to keep only relevant subsequences. Finally, DeepSeek-R1 automatically generates rationales as pseudo labels. After supervised fine-tuning on Qwen3-0.6B, the model achieves a 72% usability rate in human evaluation for intent-aligned instructions and reasons.

4.2.2. Instruction-Following Unified Pre-training

This stage injects deductive knowledge and enables multi-scenario adaptation.

4.2.2.1. Instruction Framework Design

Recommendation is reformulated as an instruction-following generation task, $P(Y \mid X, I_s, I_r)$ . The backbone learns to dynamically adjust its generation distribution based on provided scenario instructions ( $I_s$ ) and contextual reasoning instructions ( $I_r$ ).

4.2.2.2. Dual-Instruction Formulation

The instruction prompt consists of Scenario Instructions ( $I_s$ ) and Contextual Reasoning Instructions ( $I_r$ ).

Scenario Instructions ( $I_s$ ): Specifies the scenario context for controllable generation.
1. Scenario Information: Includes a scenario ID and contextual signals (e.g., Homepage, Channel Feeds). Guides the generation style and candidate/item distribution.
2. Optional Trigger Item: Provides local context (e.g., channel entry item, main item for I2I recommendations). Only available in scenarios like Channel Feeds and I2I. $I_s$ allows one backbone to serve multiple scenarios by adapting generation style and item distribution.
Contextual Reasoning Instructions ( $I_r$ ): A dense embedding projected from a textual instruction via an adapter.
- Online Inference: The textual instruction is synthesized by the near-line LLM pipeline (Section 2.2.3).
- Training: For search data, the rewritten/normalized user query serves as a natural textual source for $I_r$ . For recommendation scenarios without textual instructions, a default (learnable) instruction embedding is used. This dual approach provides a high-quality data source and improves robustness if LLM instructions are missing online.

4.2.2.3. Generative Backbone with IGR

OxygenREC uses an encoder-decoder architecture similar to OneRec [68], but the decoder is augmented with the Instruction Prompt ( $I_s, I_r$ ) as conditional inputs for auto-regressive generation guided by latent intent and scenario context.

To enhance controllable generation, Instruction-Guided Retrieval (IGR) filters long-term user history, selecting interactions most relevant to the instruction prompt. This aligns the input context with control signals ( $I_s, I_r$ ), leading to more precise outputs. As illustrated in Figure 2 (a), IGR consists of three components:

Adapter: Projects instructions and items into a shared embedding space.
Q2I Alignment: Uses the ground-truth target to supervise instruction-item similarity during training.
IGR: Performs top-K retrieval at inference time using the aligned query embedding.

Adapter Mechanism for Feature Mapping: The adapter layers [26] map the instruction-driven query and history items into a comparable shared embedding space.

The textual instruction $I_r^{\mathrm{text}}$ (from the near-line LLM pipeline) is encoded by the same text encoder $g(\cdot)$ used for item texts.
The embeddings are defined as: $ \begin{array} { r l } & { \mathbf { e } _ { q } = \mathrm { C o n c a t } \left[ \phi _ { \mathrm { scn } } ( I _ { s } ) , g ^ { \mathrm { t r a i n } } ( I _ { r } ^ { \mathrm { t e x t } } ) \right] } \ & { \mathbf { e } _ { t } = \mathrm { C o n c a t } \left[ \phi _ { \mathrm { item } } ( \nu _ { t } ) , \phi _ { \mathrm { side } } ( u _ { t } ) , g ^ { \mathrm { train } } ( x _ { t } ) \right] } \ & { \mathbf { e } _ { h } = \mathrm { C o n c a t } \left[ \phi _ { \mathrm { item } } ( \nu _ { h } ) , \phi _ { \mathrm { side } } ( u _ { h } ) , g ^ { \mathrm { frozen } } ( x _ { h } ) \right] } \ & { \mathbf { q } = \psi _ { q } ( \mathbf { e } _ { q } ) , \quad \mathbf { t } = \psi _ { i } ( \mathbf { e } _ { t } ) , \quad \mathbf { h } = \psi _ { i } ( \mathbf { e } _ { h } ) } \end{array} $
- $\mathbf{e}_q$ : The raw query embedding, concatenated from the scenario information embedding $\phi_{\mathrm{scn}}(I_s)$ and the textual reasoning instruction embedding $g^{\mathrm{train}}(I_r^{\mathrm{text}})$ .
- $\mathbf{e}_t$ : The raw target item embedding, concatenated from the item ID embedding $\phi_{\mathrm{item}}(\nu_t)$ , side-information features embedding $\phi_{\mathrm{side}}(u_t)$ , and textual description embedding $g^{\mathrm{train}}(x_t)$ .
- $\mathbf{e}_h$ : The raw history item embedding, similarly concatenated from item ID embedding $\phi_{\mathrm{item}}(\nu_h)$ , side-information features embedding $\phi_{\mathrm{side}}(u_h)$ , and textual description embedding $g^{\mathrm{frozen}}(x_h)$ .
- $\psi_q$ : A projection network that maps the raw query embedding $\mathbf{e}_q$ into the shared query embedding space, resulting in $\mathbf{q}$ .
- $\psi_i$ : A projection network that maps raw item embeddings ( $\mathbf{e}_t$ for target, $\mathbf{e}_h$ for history) into the shared item embedding space, resulting in $\mathbf{t}$ and $\mathbf{h}$ respectively.
- $I_s$ : Scenario instructions, potentially including scenario information $s$ and a trigger item $z$ (using $z_{\mathrm{def}}$ if absent).
- $I_r^{\mathrm{text}}$ : Textual contextual reasoning instruction.
- $\nu_t, \nu_h$ : Item IDs for target and history items.
- $u_t, u_h$ : Side-information features for target and history items.
- $x_t, x_h$ : Textual descriptions for target and history items.
- $g^{\mathrm{train}}(\cdot)$ : Text encoder for trainable parts.
- $g^{\mathrm{frozen}}(\cdot)$ : Frozen text encoder for long-term history to reduce computation. Gradients are not backpropagated through the long-history branch to reduce overhead.

Q2I Alignment: To make query embedding $\mathbf{q}$ and history item embeddings $\{\mathbf{h}\}$ comparable, $\mathbf{q}$ is aligned with the target item embedding $\mathbf{t}$ (available during training). This ensures $\mathbf{q}$ can retrieve relevant history interactions at serving time when the ground-truth target is absent. For a batch of size $B$ with normalized query embeddings $Q = \{q_1, ..., q_B\}$ and normalized target item embeddings $T = \{t_1, ..., t_B\}$ , the auxiliary objective is: $ \mathcal { L } _ { \mathrm { Q2I } } = \underbrace { - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } q _ { i } \cdot t _ { i } } _ { \mathrm { A l i g n m e n t } } + \lambda _ { r } \underbrace { ( - \log \left[ \mathrm { V a r } ( \mathbf { Q } ) \cdot \mathrm { V a r } ( \mathbf { T } ) \right] ) } _ { \mathrm { R e g u l a r i z a t i o n } } + \lambda _ { d } \underbrace { \frac { 1 } { B ^ { 2 } - B } \sum _ { i \neq j } ( q _ { i } ^ { \top } q _ { j } ) ^ { 2 } } _ { \mathrm { D ecorrelation } } $

$\mathcal{L}_{\mathrm{Q2I}}$ : The Query-to-Item (Q2I) alignment loss.
$\underbrace { - \frac { 1 } { B } \sum _ { i = 1 } ^ { B } q _ { i } \cdot t _ { i } } _ { \mathrm { A l i g n m e n t } }$ : The alignment term, which maximizes the cosine similarity (or dot product for normalized vectors) between query embedding $q_i$ and its corresponding target item embedding $t_i$ within a batch of size $B$ . This pulls relevant query-item pairs closer.
$\underbrace { \lambda _ { r } ( - \log \left[ \mathrm { V a r } ( \mathbf { Q } ) \cdot \mathrm { V a r } ( \mathbf { T } ) \right] ) } _ { \mathrm { R e g u l a r i z a t i o n } }$ : The regularization term, weighted by $\lambda_r$ . It encourages variance in the embeddings of $\mathbf{Q}$ and $\mathbf{T}$ across dimensions within the batch, helping to prevent dimensional collapse (where embeddings collapse to a low-dimensional subspace).
$\underbrace { \lambda _ { d } \frac { 1 } { B ^ { 2 } - B } \sum _ { i \neq j } ( q _ { i } ^ { \top } q _ { j } ) ^ { 2 } } _ { \mathrm { D ecorrelation } }$ : The decorrelation term, weighted by $\lambda_d$ . It minimizes the squared dot product between different query embeddings $q_i$ and $q_j$ in the batch. This encourages decorrelation between different queries, reducing embedding redundancy.
$\lambda_r, \lambda_d$ : Hyperparameters controlling the strength of the regularization and decorrelation terms.
$\mathrm{Var}(\mathbf{Q}), \mathrm{Var}(\mathbf{T})$ : The average variance (across embedding dimensions) within the batch for query embeddings $\mathbf{Q}$ and target item embeddings $\mathbf{T}$ , respectively.

IGR (Instruction-Guided Retrieval): With the aligned embedding space, IGR uses the query embedding $\mathbf{q}$ to retrieve the top-K most relevant interactions from the long-term history (in the $\psi_i$ space). This reduces noise and ensures the model focuses on the user's current request.

4.2.2.4. Instruction-Following Pre-training: Data Mixture, Signals, and Objectives

Data Mixture: The pre-training data is a mixture of search and multiple recommendation scenarios (e.g., Homepage, Channel Feeds, I2I related recommendations).
- Benefits: Enables unified modeling of user trajectories, expands usable supervision (positive feedback is sparse in e-commerce, so mixing search and rec improves coverage), and improves robustness when instructions might be missing online.

Training Signals and Scenario-Specific Formulation:

Scenario Information ( $s$ ) is always included as part of $I_s$ .
Some scenarios provide an optional trigger item ( $z$ ).

For $I_r$ , search data provides an observed query. For recommendation scenarios, a default (learnable) instruction embedding ( $I_r^{\mathrm{def}}$ ) is used when a near-line instruction is unavailable during training.

The following are the results from Table 1 of the original paper:

Scenario	Scenario Info	Trigger Item	Contextual Reasoning	Formulation
Search	✓	Zdef	query-derived	P( \| X, (, Zdef, Ir(q)
Homepage	✓	Zdef	default emb.	P(Y \| X, Is(s, Zdef), Idef)
Channel Feeds	✓	Zentry	default emb.	P( \| , (, Zentry), def)
Related Rec. (Item-to-Item)	✓	Zmain	default emb.	P(Y \| X, IS(S, main), def)

Table 1: Training signals and scenario-specific conditional formulations. Scenario information ( s ) is always available. The trigger item is denoted by $z$ (using ${ \it z } _ { \mathrm { d e f } }$ when absent). I _ { r } ( q ) denotes the contextual reasoning instruction embedding projected from the (rewritten/normalized) query, while $I _ { r } ^ { \mathrm { d e f } }$ denotes a default learnable embedding used when the textual instruction is unavailable in training.

Joint Learning Objectives: The overall training objective combines generation accuracy with query-to-item alignment: $ \mathcal { L } = \mathcal { L } _ { \mathrm { NTP } } + \lambda \mathcal { L } _ { \mathrm { Q2I } } $
- $\mathcal{L}$ : The total joint learning objective.
- $\mathcal{L}_{\mathrm{NTP}}$ : The Weighted Next Token Prediction loss, which is the primary objective for sequence generation.
- $\mathcal{L}_{\mathrm{Q2I}}$ : The Query-to-Item alignment loss, as described above, ensuring semantic consistency.
- $\lambda$ : A hyperparameter balancing the two loss components.
  
  Weighted Next Token Prediction ( $\mathcal{L}_{\mathrm{NTP}}$ ): Optimizes the auto-regressive likelihood of the target item sequence. Higher weights are assigned to conversion-related tokens (e.g., Purchase > Cart > Click) to prioritize high-value user behaviors.

4.2.3. Post-training with Multi-Scenario Alignment

The post-training phase adopts a multi-scenario modeling approach, unlike single-scenario fine-tuning. It involves a reward mapping system and RL-based post-training, designed for various tasks across different scenarios.

The post-training process is illustrated in Figure 4.

Figure 4: The post-training process of OxygenREC 该图像是OxygenREC的推理阶段与奖励服务的示意图。图中展示了不同场景的数据流向OxygenREC，结合用户信息和序列特征来生成推荐。模型通过统一排名模型获取奖励分数，并与策略学习算法（SFT和SA-GCPO）相结合，实现多场景下的有效推荐。

Figure 4: The post-training process of OxygenREC

4.2.3.1. Post-training Framework

Reward Service: A real-online reward mapping system includes an online unified ranking model service and other multi-task rewards.
Inference Stage: Features based on user history and sequential behavior are fed into the policy model for generation. Multiple candidate items are generated. The unified ranking model acts as an online reward service, returning a score for requested items.
Policy Learning: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are conducted using Soft Adaptive Group Clip Policy Optimization (SA-GCPO) to refine the policy model. This process iterates until convergence.

4.2.3.2. Multi-Scenario Adaptation

To avoid the cost of separate GR models and ranking services for each scenario, OxygenREC uses a unified ranking model as a centralized reward model service and conducts post-training collectively on data from various scenarios.

Scenario-Aware Reward Mapping: The RL stage uses rewards tailored to specific recommendation scenarios. The total reward is a weighted combination of:
1. Format Reward: Penalizes structural errors in semantic ID output.
2. Relative Reward: Rewards items relevant to the user's immediate context and query (based on Q2I semantic relationships).
3. Ranking Reward: Rewards sequences that maximize business objectives like GMV or Conversion Rate.
4. Diversity Reward: Evaluates the diversity of generated items.
Unified Ranking Model: A novel unified multi-scenario ranking model serves as the core of the Reward Mapping Service.
- Traditional MSR models (e.g., STAR [51], PEPNet [9]) use complex structures to mitigate negative transfer. OxygenREC proposes a unified list-wise approach for both offline training and online learning.
- It constructs token representations from heterogeneous user features and processes them through a shared Transformer-based feature extraction block for cross-scenario feature interaction and consistent scaling effects.
- An adaptive modeling mechanism addresses input feature heterogeneity.
  
  The comparison of model architectures is shown in Figure 5.
  
  该图像是示意图，展示了传统单场景排名模型、传统多场景排名模型与统一排名模型之间的比较。左侧分别为单场景和多场景模型的特征提取结构，右侧则展示了统一模型的转换特征提取块。

Figure 5: Comparison of model architectures: traditional ranking models vs. our unified ranking model

Multi-scenario training samples: Created using a label packing strategy, transforming point-wise samples into list-wise samples arranged chronologically. A customized causal masking mechanism models user behavior trajectories explicitly, maximizing conversion gains across the user path.

4.2.3.3. Reinforcement Learning

The paper introduces Soft Adaptive Group Clip Policy Optimization (SA-GCPO) for preference alignment during post-training.

Motivation: Existing methods like GRPO [23], ECPO (OneRec-v1 [68]), and gradient truncation (OneRec-v2 [69]) use hard clipping strategies, which can lead to sample inefficiency, discontinuous gradients, and instability, especially in multi-environment unified training. SA-GCPO adopts a soft adaptive function for importance sampling weights.
Key Idea: It incorporates reward scores from real user behavior as a threshold to distinguish positive and negative advantage samples and applies an asymmetric temperature control mechanism (different temperature coefficients for positive and negative samples). This accelerates gradient decay for negative samples, mitigating gradient diffusion and instability.

SA-GCPO Formulation: Given each sample $x \sim D$ , for a group of items $\{y_i\}_{i=1}^G$ generated from the behavior policy $\pi_{\theta_{\mathrm{old}}}(\cdot | x)$ , the optimization objective is: $ \mathcal { J } _ { \mathrm { S A - G C P O } } ( \theta ) = \mathbb { E } _ { x \sim D , { y _ { i } } _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot | x ) } \left[ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { \left| y _ { i } \right| } \sum _ { t = 1 } ^ { \left| y _ { i } \right| } f _ { i , t } ( r _ { i , t } ( \theta ) ) \Gamma ^ { \mathrm { a d v } } ( \widehat { A } _ { i , t } , R _ { g } ^ { * } ) \right] $

$\mathcal{J}_{\mathrm{SA-GCPO}}(\theta)$ : The optimization objective for the current policy $\theta$ .
$\mathbb{E}$ : Expectation over samples $x$ from data distribution $D$ and generated item groups $\{y_i\}_{i=1}^G$ from the old policy $\pi_{\theta_{\mathrm{old}}}$ .
$G$ : The number of generated items for each sample $x$ (group size).
$|y_i|$ : The length of the sequence of tokens for item $y_i$ .
$f_{i,t}(r_{i,t}(\theta))$ : The soft adaptive function that weights the importance ratio for token $t$ of item $y_i$ .
$\Gamma^{\mathrm{adv}}(\widehat{A}_{i,t}, R_g^*)$ : The threshold function that distinguishes positive and negative advantage samples.
$\widehat{A}_{i,t}$ : The normalized advantage for token $t$ of item $y_i$ .
$R_g^*$ : The reward of the target item in the current group.

The token-level importance ratio $r_{i,t}(\theta)$ is defined as: $ r _ { i , t } ( \theta ) = \frac { \pi _ { \theta } ( y _ { i , t } \mid x , y _ { i , < t } ) } { \pi _ { \theta _ { \mathrm { o l d } } } ( y _ { i , t } \mid x , y _ { i , < t } ) } $
$r_{i,t}(\theta)$ : The ratio of the probability of generating token $y_{i,t}$ under the new policy $\pi_\theta$ to the probability under the old policy $\pi_{\theta_{\mathrm{old}}}$ .
$\pi_\theta(y_{i,t} \mid x, y_{i,<t})$ : Probability of generating token $y_{i,t}$ given input $x$ and previous tokens $y_{i,<t}$ under the new policy.
$\pi_{\theta_{\mathrm{old}}}(y_{i,t} \mid x, y_{i,<t})$ : Probability of generating token $y_{i,t}$ under the old policy.

The soft adaptive function $f_{i,t}(\rho)$ is given by: $ f _ { i , t } ( \rho ) = \sigma \left( \tau _ { i , t } ( \rho - 1 ) \right) \cdot \frac { 4 } { \tau _ { i , t } } , \tau _ { i , t } = \left{ \begin{array} { l l } { \tau _ { \mathrm { p o s } } , } & { \Gamma ^ { \mathrm { a d v } } ( \widehat { A } _ { i , t } , R _ { g } ^ { * } ) > 0 , } \ { \tau _ { \mathrm { n e g } } , } & { \mathrm { o t h e rw ise } , } \end{array} \right. $
$f_{i,t}(\rho)$ : The soft adaptive function, where $\rho$ is the importance ratio $r_{i,t}(\theta)$ .
$\sigma(z) = 1 / (1 + e^{-z})$ : The sigmoid function.
$\tau_{i,t}$ : The temperature coefficient, which is dynamically set based on whether the advantage is positive or negative.
$\tau_{\mathrm{pos}}$ : Temperature for positive advantage samples.
$\tau_{\mathrm{neg}}$ : Temperature for negative advantage samples.
The term $\frac{4}{\tau_{i,t}}$ is a scaling factor.

The gradient weight $w_{i,t}(\theta)$ is given by: $ w _ { i , t } ( \theta ) = 4 p _ { i , t } ( \theta ) \left( 1 - p _ { i , t } ( \theta ) \right) , \quad p _ { i , t } ( \theta ) = \sigma \left( \tau _ { i , t } \left( r _ { i , t } ( \theta ) - 1 \right) \right) $
$w_{i,t}(\theta)$ : The gradient weight for token $t$ of item $y_i$ . This weight peaks at $r_{i,t}(\theta) = 1$ and smoothly decays as $r_{i,t}(\theta)$ deviates, creating a soft trust region for the importance ratio.
$p_{i,t}(\theta)$ : The output of the sigmoid function, with $\tau_{i,t}$ applied to the difference between the importance ratio and 1.

$\Gamma^{\mathrm{adv}}$ in Equation 4 represents the threshold function to distinguish positive and negative advantage samples of $\widehat{A}_{i,t}$ , defined as: $ \Gamma ^ { \mathrm { a d v } } ( \widehat { A } _ { i , t } , R _ { g } ^ { * } ) = \left{ \begin{array} { l l } { 0 , } & { \widehat { A } _ { i , t } > 0 \mathrm { a n d } R _ { i } < R _ { g } ^ { * } , } \ { \widehat { A } _ { i , t } , } & { \mathrm { o t h e rw ise } , } \end{array} \right. $
$\Gamma^{\mathrm{adv}}$ : The threshold function for advantage.
$\widehat{A}_{i,t}$ : The normalized advantage for token $t$ of item $y_i$ .
$R_i$ : The reward score for item $y_i$ .
$R_g^*$ : The reward of the target item in the current group.
- This function sets the advantage to 0 if $\widehat{A}_{i,t}$ is positive but the item's reward $R_i$ is less than the target item's reward $R_g^*$ . This helps prevent learning from "false positive" advantages where an item is preferred over the baseline but not as good as the actual target. Otherwise, it uses the raw normalized advantage.
  
  $\widehat{A}_i$ is the normalized advantage for item $y_i$ , calculated as: $ \widehat { A } _ { i , t } = \widehat { A } _ { i } = \frac { R _ { i } - \mathrm { m e a n } ( { R _ { i } } _ { i = 1 } ^ { G } ) } { \mathrm { s t d } ( { R _ { i } } _ { i = 1 } ^ { G } ) } $
$\widehat{A}_{i,t}$ : The normalized advantage for token $t$ of item $y_i$ . This is calculated as the advantage for the entire item $y_i$ , i.e., $\widehat{A}_i$ .
$R_i$ : The reward score for item $y_i$ .
$\mathrm{mean}(\{R_i\}_{i=1}^G)$ : The mean reward of all items in the generated group $G$ .
$\mathrm{std}(\{R_i\}_{i=1}^G)$ : The standard deviation of rewards of all items in the generated group $G$ .

Main advantages of SA-GCPO:

Adaptive smooth gating: Replaces hard clipping with a continuous sigmoid-based gate function, reducing optimization noise and enhancing training stability.
Real user feedback as threshold for pos/neg advantages: Uses reward scores from real user feedback to define positive and negative advantages, mitigating reward hacking (exploiting flaws in the reward function).
Asymmetric temperature control: Uses different temperature settings for $\tau_{\mathrm{neg}}$ and $\tau_{\mathrm{pos}}$ to more rapidly attenuate negative-token gradients, improving stability.
Sequence-level coherence: Since SID represents a single item, SA-GCPO reduces to a smooth sequence-level gate (similar to GSPO [67]) but without abrupt clipping.

4.2.4. System Implementation and Optimization

OxygenREC requires handling both terabyte-scale sparse embeddings (typical of recommendation systems) and billion-scale dense parameters (typical of LLMs).

4.2.4.1. Unified Training Framework

Built on PyTorch [35] on a production cluster with 128 NVIDIA H800 GPUs, achieving 40% Model FLOPs Utilization (MFU).

Distributed Sparse Optimization:
- Designed a large-scale distributed sparse engine in PyTorch with a non-overlapping partition strategy for embeddings across workers.
- Utilizes hierarchical HBM-MEM caching and a multi-stage pipeline to hide embedding access latency.
- Implements a dual-buffer mechanism for strong consistency, reducing sparse operation time from 15% to 5% and achieving 1.1-2.4x speedup.
Operator-Level Acceleration:
- Integrates BF16 mixed-precision training [31] and ZeRO [47] for memory efficiency in the LLM backbone.
- Employs advanced attention mechanisms [17, 49] and efficient architectures [19, 50].
- Developed a dedicated attention acceleration library using CUTLASS [57] and TileLang [59] for custom kernel compilation, supporting flexible mask configurations and achieving 1.7x to 3.0x speedups over FlexAttention [18] and torch.compile [2].
Scenario-Aware Reinforcement Learning:
- A customized RL workflow is built on Ray [42].
- In collocated deployment modes, shared-memory access for sparse tables eliminates redundant copying, ensuring efficient synchronization during high-throughput sample generation.

4.2.4.2. Inference Optimization based on xLLM

GR inference differs from standard LLM serving: long user-history prompts, relatively short outputs, and large beam width (e.g., 256-512). Decoding is the bottleneck due to sorting overhead, stochastic sampling, KV-cache pressure, and memory access inefficiency.

OxygenREC uses xGR [4], a dedicated serving system built upon xLLM [38], with a three-tier architecture:
- xSchedule (System Level): Manages task parallelism, enabling fine-grained pipeline overlapping across batching, request handling, and kernel execution for high GPU utilization.
- xAttention (Operator Level): Based on xLLM's PagedAttention, customized for GR's attention patterns (long prompts + short decoding, hybrid masks), strengthening KV-cache management and using staged compute allocation for large-beam decoding.
- xBeam (Algorithm Level): Handles massive sorting overhead of large-beam decoding and supports advanced sampling strategies over billion-scale item spaces.
Deep Customization for OxygenREC:
- Specialized Beam Search & Sampling: xBeam implements an optimized Beam Sample kernel combining top-k selection with nucleus/multinomial sampling. It uses operator-level fusion for stochasticity without performance degradation, reducing decoding latency.
- Prefix-Constrained Decoding: Integrates a Trie Index mechanism into the inference loop to dynamically generate logit masks at each step, ensuring 100% generation legality within designated item pools with negligible runtime overhead.

4.2.4.3. Inference Deployment of Reasoning Instructions

A near-line updating mechanism balances timeliness and system overhead.

Architecture: LLM-based instruction generation service and an Adapter-based text encoder service.
The LLM instruction model operates near-line, synthesizing natural-language reasoning instructions in batch using spatiotemporal context and user behavioral history.
The Adapter text encoder converts textual instructions into a dense embedding vector, indexed by user ID, and stored in a low-latency key-value store (e.g., Redis cluster) for real-time consumption.
Update Mechanisms:
1. Daily full refresh: An offline job regenerates spatiotemporal and behavioral instructions for all daily active users.
2. Near-line incremental update: Triggered by high-value user actions (searches, views, carts, purchases). A time-window aggregation strategy (e.g., 5 minutes) merges multiple behavioral events from the same user into a unified intent summary, executing a single instruction regeneration and storage write at the window's end. This preserves responsiveness while reducing backend load.
  
  This design enables zero online LLM calls, low-latency serving, and high semantic fidelity in recommendations by retrieving precomputed instruction embeddings.

5. Experimental Setup

5.1. Datasets

The paper utilizes proprietary datasets from JD.com, a large e-commerce platform.

Source: JD.com's core recommendation scenarios.
Characteristics: The pre-training data is a mixture of search and multiple recommendation scenarios (e.g., Homepage, Channel Feeds, and I2I related recommendations). This reflects real user journeys across the app.
Scale: OxygenREC is designed for large-scale e-commerce recommendation, handling terabyte-scale sparse embeddings and billion-scale dense parameters.
Specific Examples: The Contextual Reasoning Instructions section (Appendix B) provides concrete examples of input signals:
- Spatiotemporal & User Profile: Location: Beijing, Date: December 21, 2025, User Profile: male, 30 years old, white-collar, mid-to-high spending tier.
- User Intent Reasoning Input: Added a full-suspension mountain bike (soft-tail) to cart on Dec 19 at 18:30 (high intent, L2: Outdoor Cycling); Clicked Huawei Mate 70 and iPhone 16 Pro on Dec 20 during lunch break (comparative browsing, L1: Consumer Electronics); Long-term Interests Input: Digital products, men's fashion, cycling, outdoor sports.
Synthetic Data: For post-training, the model generates a large amount of synthetic data by using user profiles and candidate items to query the ranking model for reward scores.
Rationale for Choice: These datasets are chosen because they represent real-world, large-scale e-commerce traffic, allowing for the comprehensive validation of OxygenREC's ability to handle complex user behaviors, diverse scenarios, and strict industrial constraints. The mixture of search and recommendation data is crucial for unified modeling and addressing the data sparsity common in positive feedback signals in e-commerce.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate both the model's recommendation performance and the quality of its learned Semantic IDs.

5.2.1. Model Evaluation Metrics

HitRate@K (HR@K):
1. Conceptual Definition: Quantifies the precision of the generative process. It measures the proportion of test instances where the generated candidate exactly matches the ground-truth semantic ID sequence (across all hierarchical code levels) within the top-K beam search hypotheses. A higher HR@K indicates that the model is generating the correct item (or its semantic representation) with high accuracy among its top suggestions.
2. Mathematical Formula: While the paper describes the concept, it does not explicitly provide a formula. A standard definition for HitRate@K in recommendation systems is: $ \mathrm{HitRate@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{ground-truth item} \in \text{Top K recommendations for user } u) $ However, the paper specifies that the ground-truth semantic ID sequence must align across all hierarchical code levels. Adapting this for semantic IDs: $ \mathrm{HitRate@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\exists y_{gen} \in \mathrm{TopK}u \text{ such that } \mathrm{SID}(y{gen}) = \mathrm{SID}(y_{gt,u})) $
3. Symbol Explanation:
  - $U$ : The set of all users in the test set.
  - $|U|$ : The total number of users.
  - $\mathbb{I}(\cdot)$ : An indicator function that returns 1 if its argument is true, and 0 otherwise.
  - $y_{gt,u}$ : The ground-truth item for user $u$ in the test set.
  - $\mathrm{TopK}_u$ : The set of top-K recommended items for user $u$ .
  - $\mathrm{SID}(y)$ : The semantic ID sequence (tuple of discrete codes at all hierarchical levels) of item $y$ .
  - The condition $\mathrm{SID}(y_{gen}) = \mathrm{SID}(y_{gt,u})$ implies that the generated semantic ID sequence must be identical to the ground-truth sequence at all hierarchical levels.
Recall@K:
1. Conceptual Definition: Assesses the model's capacity to cover the user's relevant interests. It is calculated as the ratio of the user's daily positive interactions (ground-truth items) that are successfully identified within the top K generated candidates relative to the full set of observed user behaviors for that day. A higher Recall@K indicates that the model is effectively retrieving a larger proportion of items the user interacted with.
2. Mathematical Formula: Similar to HitRate@K, the paper provides a conceptual definition. A common formula for Recall@K (adapted for semantic IDs as per the paper's context) is: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum_{u \in U} \frac{|{y_{gt,u} \in \mathrm{DailyInteractions}u \mid \exists y{gen} \in \mathrm{TopK}u \text{ such that } \mathrm{SID}(y{gen}) = \mathrm{SID}(y_{gt,u})}|}{|\mathrm{DailyInteractions}_u|} $
3. Symbol Explanation:
  - $U$ : The set of all users in the test set.
  - $|U|$ : The total number of users.
  - $\mathrm{DailyInteractions}_u$ : The full set of observed positive interactions (ground-truth items) for user $u$ on a given day.
  - $\mathrm{TopK}_u$ : The set of top-K recommended items for user $u$ .
  - $\mathrm{SID}(y)$ : The semantic ID sequence of item $y$ .
  - The numerator counts how many of the user's daily ground-truth items are successfully matched (by semantic ID sequence) within the top-K recommendations.

5.2.2. Semantic ID Evaluation Metrics

These metrics evaluate the quality of the learned Multimodal Quantized Item Representations (Semantic IDs).

SID Codebook Coverage (↑: Higher is better):
1. Conceptual Definition: Evaluates the utilization efficiency of the hierarchical semantic space. It measures the joint path occupancy at each quantization level. It reflects how effectively SKUs (Stock Keeping Units, i.e., individual items) populate the exponentially growing combination space of the codebook.
2. Mathematical Formula: For a given depth $L$ and codebook width $|V|$ : $ \mathrm{Codebook Coverage@L} = \frac{|{(c_1, \ldots, c_L) \mid \exists \mathrm{SKU} \text{ mapped to this tuple}}|}{|V|^L} $
3. Symbol Explanation:
  - $L$ : The current quantization level or depth (e.g., 1, 2, or 3).
  - $V$ : The vocabulary of codes at each level.
  - $|V|$ : The codebook width (vocabulary size per level).
  - $(c_1, \ldots, c_L)$ : A unique tuple of codes up to level $L$ .
  - The numerator counts the number of unique code tuples that have at least one SKU mapped to them. The denominator is the total possible number of unique code tuples at that level.
Semantic Cluster Purity (↑: Higher is better):
1. Conceptual Definition: Evaluates the alignment between the learned semantic clusters (formed by items mapping to the same semantic ID) and human-defined taxonomies (e.g., product categories). It measures the mean dominance of the primary category within each semantic ID's corresponding SKU set. A higher score means semantic IDs effectively capture high-level category semantics (e.g., all SKUs under one SID consistently belong to "Electronics").
2. Mathematical Formula: Not explicitly provided in a standard formula. Conceptually, for each semantic ID, it would identify the most frequent human-defined category among its assigned SKUs and then sum up the proportion of SKUs belonging to that dominant category across all semantic IDs, averaged.
3. Symbol Explanation:
  - Cate1, Cate2, Cate3: Refer to different levels of product categories in a hierarchical taxonomy (e.g., Category1 might be "Electronics," Category2 "Mobile Phones," Category3 "Smartphones"). Purity is measured at each categorical level.
Semantic ID Collision (↓: Lower is better):
1. Conceptual Definition: Quantifies the discriminative power of semantic identifiers in distinguishing individual items within the billion-scale SKU pool. It is measured by the distribution of item cardinality (number of unique SKUs) for each unique SID tuple. Lower SKU counts at upper quantiles (e.g., P90, P99) indicate finer granularity, meaning the semantic ID sequence provides a more precise representation closer to a unique item identifier. High collision means many distinct items map to the same SID, losing discriminative power.
2. Mathematical Formula: No explicit formula. It's a statistical measure of SKU counts per SID tuple.
3. Symbol Explanation:
  - P90, P99, P999: Percentiles (90th, 99th, 99.9th percentile) of the distribution of SKU counts per SID tuple. For example, P99 being 21 means that 99% of SID tuples contain 21 or fewer SKUs.
Codebook Load Balance (→ 100%):
1. Conceptual Definition: Measures the uniformity of item distribution across the quantization codebook. It quantifies the deviation of actual cluster sizes (number of SKUs assigned to a specific code) from the theoretical uniform distribution (Total SKUs / Codebook Size). Examining this ratio at various quantiles assesses whether the RQ-KMeans process effectively utilizes the full capacity of the codebook or suffers from mode collapse (where specific codes are over-utilized and others under-utilized).
2. Mathematical Formula: No explicit formula. It's a measure of the distribution of cluster sizes relative to an ideal uniform distribution.
3. Symbol Explanation:
  - P25, P75, P90: Percentiles of the distribution of cluster sizes (number of SKUs per code). The ideal value would be 100% across all quantiles, indicating a perfectly uniform distribution.

5.3. Baselines

OxygenREC is compared against various baselines depending on the specific experiment:

Multimodal Encoder Design:
- OneRec-style multimodal model: Utilizes a large pre-trained multimodal backbone (e.g., MiniCPMV [65]) with multiple Q-Former [34] layers for feature compression.
- Lighter Vision-Language Models: Variants from SAIL [11] and InternVL [13] families.
- Pure-text encoders: E.g., Qwen3 [3, 63].
- Pure-vision encoders: (Not explicitly named but implied for comparison).
- Multimodal backbone models [33, 37, 60]: Generic multimodal models.
Reinforcement Learning (RL) Methods (for Post-training):
- GRPO [23]: A widely applied policy optimization method in LLMs, VLMs, and GRs.
- GSPO [67]: A modified approach based on GRPO that enhances model performance by computing importance sampling along the sequence.
Multi-Scenario Adaptation:
- Independent SFT (Supervised Fine-Tuning) Baselines: This is the industry-standard approach where separate models are fine-tuned exclusively for each scenario. OxygenREC compares its Unified Instruction-Following Model against this baseline.
  
  The paper implicitly compares against the limitations of traditional multi-stage cascaded pipelines and existing generative methods that have limited world-knowledge based reasoning and poor multi-scenario scalability (as depicted in Figure 1).

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Ablation Studies and Analysis

6.1.1.1. Lightweight Multimodal Design

The authors performed a systematic analysis of the multimodal encoder design.

OneRec-style backbone (MiniCPMV + Q-Former): This initial heavy setup was the starting point.
Lighter VLMs (SAIL, InternVL): Replacing the heavy backbone with lighter variants did not significantly improve recommendation performance, despite faster inference.
Pure-text (Qwen3) vs. Pure-vision vs. Multimodal backbones: Surprisingly, pure-text encoder Qwen3 substantially outperformed other multimodal alternatives.
Final Design (Qwen3 + CLIP fusion): This observation led to the final design: Qwen3 text encoder and CLIP [45] image encoder independently extract representations, which are then fused and compressed via Q-Former layers and MLP layers, with residual initialization for query tokens in Q-Former. This fusion architecture yielded over 30% relative improvement in HitRate@1 over OneRec-style variants and up to 32x speedup in embedding inference.

6.1.1.2. Semantic ID Evolution

The authors investigated four versions of Semantic IDs (SIDs) to understand the impact of different item representation strategies. The quality was assessed using Codebook Coverage, Semantic Cluster Purity, SID Collision (P99), and Codebook Load Balance.

The following are the results from Table 2 of the original paper:

Metric		V1 (Textual)	V2 (MiniCPM)	V3 (Fusion)	V4 (Multi-Source)
Codebook Coverage (↑)	Codebook1	100%	100%	100%	100 %
	Codebook2	72.33%	31.22%	70.03%	69.56%
	Codebook3	2.74%	0.013%	0.14%	0.15 %
Cluster Purity (↑)	Cate1	86.30%	83.31%	84.38%	92.80 %
	Cate2	73.45%	66.80%	70.35%	79.90%
	Cate3	53.68%	41.99%	48.41%	59.73%
SID Collision (↓)	P90	10	6	2	2
	P99	79	21	9	9
	P999	419	73	34	35
Load Balance (→ 100%)	P25	32.49%	88.53%	89.25%	89.92%
	P75	144.22%	114.66%	107.39%	105.40%
	P90	212.84%	122.14%	118.60%	118.09%

Table 2: Detailed Evaluation of Semantic ID Versions.
Note: ↑: Higher is better; $\downarrow$ Lower is better; $1 0 0 \%$ : Closer to $100 \%$ is better.

Analysis of Semantic ID Quality:

V1 (Textual Baseline): Suffers from a small codebook size (2048) despite seemingly high Codebook Coverage (an artifact). Its high P999 Collision (419) indicates poor resolution capacity.
V2 (MiniCPM): Shows very low Codebook Coverage at deeper levels (0.013% at L3), suggesting partial codebook collapse (many codes unused).
V3 (Fusion) & V4 (Multi-Source): Successfully address the issues of V1 and V2 by leveraging robust multimodal alignment.
V4 (Multi-Source Alignment): Achieves the best performance across all metrics:
- Highest Cluster Purity (92.80% at Cate-L1) due to explicit category semantics injection.
- Superior SID Collision (P999 of 35), ensuring fine-grained item distinction.
- Optimal Load Balance (P90 of 118.09%), confirming effective utilization and uniformity of the latent space through multi-source contrastive learning. This version is identified as the optimal semantic ID representation.

6.1.1.3. Generative Backbone Architecture Ablation

Using the optimal V4 Semantic ID, the authors investigated the scaling laws and architectural hyperparameters of the generative backbone by varying depth of encoder/decoder layers, model dimensions, and Mixture-of-Experts (MoE) configurations.

The following are the results from Table 3 of the original paper:

Model Size (Tot/Act)	Enc Layers	Dec Layers	Dim (Hidden/Inter)	Experts (Tot/Act)
0.1B / 0.1B	4	4	1024/512	2/1
0.4B / 0.3B	4	6	2048/1024	4/2
0.7B / 0.4B	4	8	2048/1024	8/2
1.5B / 0.4B	4	8	2048/1024	24/2
3.0B / 0.6B	4	16	2048/1024	24/2

Table 3: Model Configurations for Generative Backbone Ablation.

The generative performance was assessed using HitRate and Recall metrics, and training dynamics via NTP Loss.

The following are the results from Table 4 of the original paper:

Model Size (Tot)	HR@1	HR@10	Recall@10	Recall@30
0.1B	3.99%	13.17%	10.10%	15.11%
0.4B	4.42%	15.03%	11.38%	17.34%
0.7B	4.84%	16.33%	12.32%	18.71%
1.6B	4.92%	16.61%	12.51%	19.01%
3.0B	5.02%	16.99%	12.78%	19.53%

Table 4: Performance Comparison.

The NTP Loss Scaling Laws are shown in Figure 6.

Figure 6: NTP Loss Scaling Laws. 该图像是图表，展示了不同模型在训练步骤中损失（Loss）的变化情况。随着训练步骤的增加，损失逐渐降低，其中各条曲线对应不同的模型参数量，例如0.1B、0.4B等，显示了不同规模下模型训练的效果。

Figure 6: NTP Loss Scaling Laws.

Scaling Law Validation and MoE Saturation Analysis:

General Trend: A positive correlation exists between model capacity and retrieval performance. Scaling from 0.1B to 3.0B total parameters yields monotonic improvements in HR@10 (from 13.17% to 16.99%) and NTP loss (larger models achieve lower asymptotic loss).
MoE Saturation: A plateauing phase is observed between the 0.7B and 1.6B variants for NTP loss. This is attributed to both configurations using the same number of active experts per token (2), which limits immediate gains from merely expanding the expert pool in an MoE architecture.
Overcoming Bottleneck: The 3.0B model overcomes this plateau, likely due to its significantly increased decoder depth (16 layers vs. 8), which extends the computational path and further reduces NTP loss.

6.1.2. Instruction-Following Unification Analysis

This section validates the effectiveness of the proposed instruction-following mechanism.

6.1.2.1. Instruction Token Integration Strategies

Five strategies for inserting the instruction token into the decoder's input sequence relative to the Begin-Of-Sequence (BOS) marker were compared.

The following are the results from Table 5 of the original paper:

Integration Strategy	HR@1	HR@10	Recall@10	Recall@30
No Instruction	2.78%	10.38%	8.18%	13.01%
Replace BOS	3.30%	12.08%	9.12%	14.38%
Add to BOS	3.50%	12.59%	9.52%	14.93%
Insert Left of BOS	3.33%	12.17%	9.21%	14.50%
Insert Right of BOS	3.53%	12.68%	9.58%	14.91%

Table 5: Performance comparison of different instruction token integration strategies.

Analysis: The Insert Right of BOS strategy consistently yields the highest retrieval metrics (e.g., HR@10 of 12.68%). This approach allows the decoder to first initialize its state with BOS and then immediately condition the subsequent auto-regressive generation on the specific context provided by the instruction, providing the most effective guidance flow.

6.1.2.2. Instruction Component Ablation

The composition of the instruction token (fusing Scenario ID and Trigger Item ID) was investigated.

The following are the results from Table 6 of the original paper:

Instruction Components	HR@1	HR@10	Recall@10	Recall@30
No Instruction (Baseline)	2.78%	10.38%	8.18%	13.01%
Scenario ID Only (1 token)	3.30%	12.17%	9.22%	14.50%
Trigger Item ID Only (1 token)	3.22%	11.60%	9.13%	13.98%
Concatenated (Scenario + Trigger, 2 tokens)	3.53%	12.68%	9.58%	14.91%
Fused (Scenario + Trigger, 1 token)	3.60 %	12.82%	9.68%	15.08%

Table 6: Ablation study on Instruction Token components.

Analysis:

Both Scenario ID Only and Trigger Item ID Only improve performance over the No Instruction baseline, confirming their individual utility.
Concatenated (Scenario + Trigger) performs better than using either alone, indicating their complementary guidance.
Fused (Scenario + Trigger) achieves the best performance across all metrics (e.g., HR@10 of 12.82%). This suggests that deeper interaction during the fusion process allows the model to better capture the complex relationship between scenario context (global domain characteristics like price sensitivity) and user intent (Trigger Item ID provides fine-grained, localized context).

6.1.2.3. IGR Ablation

Ablation studies were conducted in search-dominated scenarios to validate the effectiveness of the IGR mechanism and Q2I alignment.

The following are the results from Table 7 of the original paper:

Configuration	HR@1	HR@10	Recall@10	Recall@30
Base Model (w/o IGR/Q2I)	3.76%	12.20%	9.87%	15.53%
+ IGR Only	4.02%	12.91%	10.25%	15.95%
+ IGR+Q2I (Full)	4.19%	13.38%	10.52%	16.23%

Table 7: Ablation study on IGR components

Analysis:

Introducing IGR Only improves retrieval quality (e.g., HR@10 from 12.20% to 12.91%) by focusing on relevant historical interactions.
The Full model ( $+ IGR+Q2I$ ) achieves the best performance (HR@10 of 13.38%). This demonstrates that explicit alignment between query and item spaces via Q2I loss is crucial for effective IGR and overall recommendation quality.

6.1.2.4. Unified Instruction-Following Model vs. Independent SFT Baselines

The Unified Instruction Following Model was compared against the industry-standard Pretrain and Scenario Independent SFT approach across six core deployment scenarios.

The following are the results from Table 8 of the original paper:

Metric	Model Type	Scenario 1	Scenario 2	Scenario 3	Scenario 4	Scenario 5	Scenario 6
HR@1	Independent SFT	6.39%	8.17%	1.12%	1.83%	7.22%	5.29%
HR@1	Unified Model	15.39%	20.75%	17.24%	6.34%	10.54%	25.75%
HR@10	Independent SFT	23.29%	29.05%	5.22%	8.44%	29.84%	19.38%
HR@10	Unified Model	46.73%	55.02%	53.57%	29.89%	37.90%	62.62%

Table 8: Performance comparison: Unified Instruction-Following Model vs. Independent SFT Baselines across six core scenarios.

Analysis: The Unified Model consistently and significantly outperforms the Independent SFT baselines across all six core scenarios. For example, in Scenario 3, HR@1 jumps from 1.12% to 17.24%, and HR@10 from 5.22% to 53.57%. This superiority is attributed to:

Synergistic Knowledge Transfer: High-resource scenarios (like Homepage) enhance representation learning for lower-resource scenarios.
Universal User Modeling: Captures a holistic view of user interests across different contexts.
Improved Operational Efficiency: A single model reduces maintenance and GPU overhead.

6.1.2.5. Trigger Instruction Sensitivity Analysis

This experiment verifies that the unified model actively utilizes the Trigger Item component of the instruction token as a control signal, rather than ignoring it. Conducted on scenarios 3, 4, 5, and 6, which are explicitly trigger-item driven.

The following are the results from Table 9 of the original paper:

Metric	Inference Setting	Scenario 3	Scenario 4	Scenario 5	Scenario 6
HR@1	Correct Trigger (Instruction)	20.75%	6.34%	10.54%	25.75%
	Masked Trigger (Default)	10.71%	1.96%	9.26%	18.55%
HR@10	Correct Trigger (Instruction)	55.02%	29.89%	37.90%	62.62%
	Masked Trigger (Default)	40.96%	11.31%	35.74%	50.52%

Table 9: Sensitivity Analysis: Impact of masking the Trigger Item ID during inference (Scenarios 3, 4, 5, 6).

Analysis: Replacing the authentic Trigger Item ID with a generic "Default" embedding during decoding leads to a substantial performance drop across all tested scenarios. For example, HR@1 in Scenario 3 drops from 20.75% to 10.71%. This sharp drop confirms that the decoder heavily relies on the fine-grained trigger signal to effectively contextualize user intent, acting as a critical "steering wheel" for the generative process.

6.1.3. Post-training of OxygenREC

6.1.3.1. Effectiveness of Synthetic Data

The authors evaluated the effectiveness of using synthetic data for RL post-training, compared to OneRec-v2's online RL approach.

The evaluation of synthetic data is shown in Figure 7.

Figure 7: Evaluation of synthetic data 该图像是图表，展示了不同合成数据比例下的命中率（Hit Rate@10）比较。图中蓝线代表GRPO算法，橙线代表SA-GCPO算法。随着合成数据比例的增加，两种算法的命中率均有所提升，在20%合成数据时达到最高。

Figure 7: Evaluation of synthetic data

Analysis:

The OxygenREC-0.7B MOE was used as the backbone. GRPO showed considerable instability across varying proportions of synthetic data.
SA-GCPO demonstrated more consistent results and superior performance compared to GRPO across different proportions of synthetic data. This validates the effectiveness of the proposed method and the robustness of SA-GCPO.

6.1.3.2. Effectiveness of proposed SA-GCPO

The SA-GCPO method was compared against GRPO [23] and GSPO [67] (a modified GRPO that computes importance sampling along the sequence). OxygenREC-0.7B was warm-started, with 33% synthetic data.

The following are the results from Table 10 of the original paper:

Methods	Ratio of synthetic data	HR@1	HR@10
OxygenREC0.7B-GRPO	33%	23.85%	62.15%
OxygenREC0.7B-GSPO	33%	24.13%	62.88%
OxygenREC0.7B-SA-GCPO	33%	25.58%	65.95%

Table 10: Evaluation of proposed SA-GCPO with other methods

Analysis:

GSPO performs slightly better than GRPO (HR@1 24.13% vs 23.85%; HR@10 62.88% vs 62.15%).
SA-GCPO significantly outperforms both: achieving +1.45pp and +1.73pp in HR@1 compared to GSPO and GRPO, respectively. For HR@10, SA-GCPO achieves more than $+3pp$ gains. This demonstrates the superior performance of the SA-GCPO RL method, particularly its adaptive smooth gating and real user feedback threshold.

6.1.3.3. Ablation study for different settings of $\tau_{\mathrm{pos}}$ and $\tau_{\mathrm{neg}}$

Ablation experiments validated the effect of separate temperature settings for positive and negative advantage samples in SA-GCPO.

The following are the results from Table 11 of the original paper:

Tpos	Tneg	HR@1	HR@10
1.0	1.05	25.35%	65.64%
1.0	1.0	25.48%	65.95%
1.0	0.95	25.51%	66.01%

Table 11: Ablation study of temperatures set for positive and negative samples of SA-GCPO

Analysis:

When $\tau_{\mathrm{pos}} > \tau_{\mathrm{neg}}$ (e.g., $\tau_{\mathrm{pos}}=1.0, \tau_{\mathrm{neg}}=0.95$ ), the model performance becomes more stable and slightly better (HR@1 25.51%, HR@10 66.01%).
Assigning a lower temperature coefficient to negative samples helps prevent performance collapse during RL training and enhances overall training stability, confirming the benefit of asymmetric temperature control.

6.1.4. Online A/B Test Performance and Industrial Impact

OxygenREC was deployed across three sequentially dependent scenarios on the JD App, covering the user's entire session lifecycle:

Phase 1: Interest Triggering (Homepage Floor): Scenarios 1 and 2, high traffic, low latency, visually engaging items to attract clicks.
Phase 2: Deep Exploration (Feeds Recommendation): Scenarios 3 and 4, entered via homepage clicks, recommendations based on "trigger SKU" and user behavior, encourages prolonged engagement.
Phase 3: Immediate Conversion (Checkout Path Recommendations): Scenarios 5 and 6 (Add-to-Cart Overlay, Checkout Add-on), targets transaction process, capitalize on strong purchase intent for supplementary items.

Rigorous online A/B testing was conducted, with 10% of total traffic allocated to experimental and control groups.

The following are the results from Table 12 of the original paper:

Scenario		UCTR	UCTCVR	Order Volume	GMV	Latency
Homepage Floor	Scenario 1	+0.68%	+2.71%	+2.81%	+4.52%	50ms
Homepage Floor	Scenario 2	+3.55%	+2.26%	+2.21%	+8.40%	50ms
Channel Feeds	Scenario 3*	-0.25%	+7.89%	+8.03%	+1.46%	80ms
Channel Feeds	Scenario 4	+0.78%	+2.17%	+1.49%	+1.66%	80ms
Checkout Path	Scenario 5	+0.40%	+4.21%	+4.28%	+11.80%	50ms
Checkout Path	Scenario 6	+3.29%	+3.00%	+2.92%	+4.15%	50ms

Table 12: Online A/B Test Lift at First Launch

Analysis:

The generative model achieved statistically significant improvements across all key business metrics (UCTR, UCTCVR, Order Volume, GMV) in all scenarios.
Homepage Floor (Scenarios 1 & 2): Positive lifts across all metrics, with GMV reaching $+8.40%$ in Scenario 2.
Channel Feeds (Scenarios 3 & 4): Notably, Scenario 3 shows a slight UCTR drop of -0.25% but a remarkable +7.89% UCTCVR and +8.03% Order Volume. This indicates the model prioritizes high-quality, conversion-intent items over shallow clicks.
Checkout Path (Scenarios 5 & 6): Significant gains in GMV (e.g., $+11.80%$ in Scenario 5) and Order Volume, capitalizing on immediate user intent.
Latency: The system maintains strict latency requirements (50ms for Homepage/Checkout, 80ms for Feeds) despite handling billion-scale candidate spaces.
These results validate that the generative framework effectively translates semantic understanding into tangible business growth across the user's entire shopping lifecycle in real-world industrial settings.

6.2. Ablation Studies / Parameter Analysis

The ablation studies and parameter analyses are thoroughly integrated into the Core Results Analysis section above, specifically under:

Lightweight Multimodal Design: Evaluates different multimodal encoder configurations.
Semantic ID Evolution: Compares SID versions (V1-V4) to determine the optimal item representation.
Generative Backbone Architecture Ablation: Investigates scaling laws and MoE configurations for the encoder-decoder backbone.
Instruction Token Integration Strategies: Analyzes how the instruction token position affects performance.
Instruction Component Ablation: Breaks down the instruction token into its Scenario ID and Trigger Item ID components.
IGR Ablation: Verifies the contribution of Instruction-Guided Retrieval and Q2I alignment.
Unified Instruction-Following Model vs. Independent SFT Baselines: Compares the unified model against scenario-specific baselines.
Trigger Instruction Sensitivity Analysis: Confirms the importance of the Trigger Item signal.
Effectiveness of proposed SA-GCPO: Benchmarks SA-GCPO against GRPO and GSPO.
Ablation study for different settings of $\tau_{\mathrm{pos}}$ and $\tau_{\mathrm{neg}}$ : Analyzes the impact of asymmetric temperature control in SA-GCPO.

These studies collectively demonstrate the individual contributions of each proposed component and validate the overall design choices of OxygenREC.

7. Conclusion & Reflections

7.1. Conclusion Summary

OxygenREC introduces a novel generative recommendation system that successfully integrates deep reasoning capabilities with the stringent latency and scalability demands of industrial deployments. The key innovations are:

Fast-Slow Thinking Architecture: A near-line LLM pipeline (slow thinking) generates Contextual Reasoning Instructions based on user behavior and context, injecting world knowledge and deductive reasoning. A lightweight encoder-decoder model (fast thinking) then uses these pre-generated instructions for real-time recommendation generation, effectively avoiding online LLM latency.
Instruction-Following Unification: OxygenREC transforms recommendation into an instruction-following task. Semantic alignment mechanisms, including Instruction-Guided Retrieval (IGR) and Query-to-Item (Q2I) loss, ensure that user intent (Contextual Reasoning Instructions) and scenario context (Scenario Instructions) precisely control the generation process.
Multi-Scenario Scalability: By converting scenario information into structured Scenario Instructions and utilizing unified reward mapping with Soft Adaptive Group Clip Policy Optimization (SA-GCPO), OxygenREC achieves a "train-once-deploy-everywhere" paradigm, allowing a single model to adapt to diverse business objectives across multiple scenarios.
Industrial Impact: Deployed across core JD.com recommendation scenarios, OxygenREC demonstrated significant online A/B test gains in order volume and GMV, validating its robustness, efficiency, and practical value in high-traffic e-commerce environments.

7.2. Limitations & Future Work

The authors identify several promising directions for future research:

Latency Optimization through Non-Autoregressive Generation:
- Limitation: The current framework relies on sequential Next Token Prediction (NTP), where decoding latency increases linearly with the required recommendation list length. This fundamentally hinders high-throughput real-time deployment.
- Future Work: Transitioning to a Non-Autoregressive (NAR) parallel generation paradigm to drastically minimize serving latency and maximize throughput by generating the entire sequence of semantic identifiers simultaneously. This is crucial for maintaining performance as model complexity and knowledge integration depth increase.
Multi-Scenario User Trajectory Modeling for Deep Intent Discovery:
- Limitation: The current instruction system effectively uses immediate context and scene information, but users' true purchase intent is often a complex decision trajectory spanning multiple distinct scenarios (e.g., Homepage, Search, Cart, Checkout).
- Future Work: Focus on multi-scenario user trajectory modeling to capture the full context of user behavior. This involves integrating and analyzing cross-scenario sequences to uncover deep-seated user goals and intent evolution. The goal is to upgrade the instruction system to use richer, hierarchical intent signals for the LLM backbone, leading to more precise and long-term optimal recommendations, further aligned with long-term user value via a robust closed-loop learning mechanism.

7.3. Personal Insights & Critique

OxygenREC presents a highly compelling and practically significant contribution to the field of recommendation systems, particularly for large-scale e-commerce platforms.

Personal Insights:

Elegance of Fast-Slow Thinking: The Fast-Slow Thinking architecture is an elegant solution to the perennial LLM latency problem in real-time systems. By pre-computing complex deductive reasoning instructions offline, it effectively leverages LLM power without sacrificing online performance. This pattern could be widely applicable to other domains requiring LLM intelligence under strict latency budgets, such as intelligent assistants, content moderation, or fraud detection where complex reasoning is needed for real-time decisions.
Instruction-Following for Unification: The idea of instruction-following to unify multi-scenario recommendations is powerful. It moves beyond implicit parameter modulation in Multi-Gate Mixture-of-Experts (MoE) or tower models to explicit control signals, making the model more interpretable and adaptable. This paradigm could be generalized to other complex multi-task learning scenarios where explicit guidance can disentangle objectives.
Rigorous Engineering Focus: The paper's detailed account of system implementation and optimization (e.g., PyTorch training framework, xLLM inference, distributed sparse optimization, custom attention kernels) highlights the immense engineering effort required to bring such a sophisticated model to production. This level of detail is often missing in academic papers but is crucial for real-world impact.
Synthetic Data for RL Stability: The use of synthetic data combined with SA-GCPO for RL post-training is a smart way to address the data sparsity and reward hacking issues often encountered when relying solely on real-time user feedback, especially in new or data-scarce scenarios.

Critique & Areas for Improvement:

Proprietary Data Limitation: The reliance on JD.com's proprietary datasets, while demonstrating real-world impact, limits reproducibility and direct comparison by external researchers. Public benchmarks (e.g., Amazon Reviews, Taobao) could have provided broader generalizability.
LLM Model Specifics: While Qwen3 and DeepSeek-R1 are mentioned, more details on the size, architecture, and specific prompting strategies for the near-line LLM pipeline would enhance understanding for researchers attempting to replicate or adapt this work. The fine-tuning process for LLM_QR and User Intent Reasoning is described, but full prompt templates and few-shot examples could be beneficial for transparency.
Complexity of SA-GCPO: While innovative, the SA-GCPO formulation introduces several new components (soft adaptive function, asymmetric temperature, threshold function for advantage). A deeper intuitive explanation of how these precisely interact and contribute to stability and performance, perhaps with illustrative examples of gradient behavior, would be helpful for beginners.
Cold Start Problem: The paper focuses on active users with rich historical data. It would be interesting to see how OxygenREC handles cold-start users or cold-start items given its reliance on user behavior and semantic IDs. The LLM-driven Contextual Reasoning Instructions could potentially offer a strong advantage here, and further analysis would be valuable.
Interpretability of Instructions: While Contextual Reasoning Instructions are designed to be interpretable, the paper primarily uses quantitative metrics. Case studies demonstrating how these instructions lead to specific, nuanced recommendations that traditional systems would miss, and how users perceive this improved reasoning, would further strengthen the claim of deep reasoning.

Overall, OxygenREC represents a significant advancement in bridging the gap between cutting-edge LLM research and the practical demands of industrial-scale recommendation systems, offering a robust and scalable framework for future development.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 51,871 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. LLM-driven User and Item Inputs

4.2.1.1. Model Input Representation

4.2.1.2. Multimodal Quantized Item Representations

4.2.1.3. Contextual Reasoning Instructions

4.2.2. Instruction-Following Unified Pre-training

4.2.2.1. Instruction Framework Design

4.2.2.2. Dual-Instruction Formulation

4.2.2.3. Generative Backbone with IGR

4.2.2.4. Instruction-Following Pre-training: Data Mixture, Signals, and Objectives

4.2.3. Post-training with Multi-Scenario Alignment

4.2.3.1. Post-training Framework

4.2.3.2. Multi-Scenario Adaptation

4.2.3.3. Reinforcement Learning

4.2.4. System Implementation and Optimization

4.2.4.1. Unified Training Framework

4.2.4.2. Inference Optimization based on xLLM

4.2.4.3. Inference Deployment of Reasoning Instructions

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Model Evaluation Metrics

5.2.2. Semantic ID Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Ablation Studies and Analysis

6.1.1.1. Lightweight Multimodal Design

6.1.1.2. Semantic ID Evolution

6.1.1.3. Generative Backbone Architecture Ablation

6.1.2. Instruction-Following Unification Analysis

6.1.2.1. Instruction Token Integration Strategies

6.1.2.2. Instruction Component Ablation

6.1.2.3. IGR Ablation

6.1.2.4. Unified Instruction-Following Model vs. Independent SFT Baselines

6.1.2.5. Trigger Instruction Sensitivity Analysis

6.1.3. Post-training of OxygenREC

6.1.3.1. Effectiveness of Synthetic Data

6.1.3.2. Effectiveness of proposed SA-GCPO

6.1.3.3. Ablation study for different settings of τpos\tau_{\mathrm{pos}}τpos​ and τneg\tau_{\mathrm{neg}}τneg​

6.1.4. Online A/B Test Performance and Industrial Impact

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.1.3.3. Ablation study for different settings of $\tau_{\mathrm{pos}}$ and $\tau_{\mathrm{neg}}$