Paper status: completed

RecGPT-V2 Technical Report

Published:12/16/2025

LLM-based Recommendation Systems (7)RecGPT-V2 Recommender System (1)Hierarchical Multi-Agent System (1)Hybrid Representation Inference (1)Constrained Reinforcement Learning (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

RecGPT-V2 introduces four innovations to enhance intent reasoning and efficiency, reducing GPU consumption by 60% and improving generalization and evaluation consistency. Online tests show significant performance gains in metrics like CTR and IPV, indicating its industrial applic

Abstract

Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed-template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome-focused evaluation that fails to match human standards. To address these challenges, we present RecGPT-V2 with four key innovations. First, a Hierarchical Multi-Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user-behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 establishes both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility.

Mind Map

In-depth Reading

English Analysis~27 min read · 38,518 chars

1. Bibliographic Information

1.1. Title

RecGPT-V2 Technical Report

1.2. Authors

The core contributors are: Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Wen Chen, Wenjun Yang, Yujie Luo, and Yuning Jiang, Zhujin Gao. Additional contributors include: Bo Zheng, Binbin Cao, Changfa Wu, Dixuan Wang, Han Wu, Haoyi Hu, Kewei Zhu, Lang Tian, Lin Yang, Qiqi Huang, Siqi Yang, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Xiaowei Huang, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yujin Yuan, Yuliang Yan, Zile Zhou, and Renmin University of China. The author listing is in alphabetical order based on their first names.

1.3. Journal/Conference

This is a technical report published as a preprint on arXiv. The arXiv platform serves as a repository for preprints of scientific papers, making research widely accessible before or in parallel with peer review. The presence of a technical report on arXiv indicates active research and development in the field of recommender systems and large language models.

1.4. Publication Year

2025

1.5. Abstract

Large language models (LLMs) have significantly advanced recommender systems by shifting from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 pioneered this by integrating LLM-based reasoning into user interest mining and item tag prediction, it faced four main limitations: (1) computational inefficiency and cognitive redundancy in its multi-route architecture, (2) insufficient explanation diversity due to fixed-template generation, (3) limited generalization from supervised learning, and (4) simplistic, outcome-focused evaluation misaligned with human standards.

To overcome these, RecGPT-V2 introduces four innovations. First, a Hierarchical Multi-Agent System (HMAS), combined with Hybrid Representation Inference, restructures intent reasoning through coordinated collaboration and context compression. This reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, increasing explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, resulting in a +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, enhancing human preference alignment. Online A/B tests on Taobao show significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 demonstrates both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging cognitive exploration and industrial utility.

1.6. Original Source Link

https://arxiv.org/abs/2512.14503 (Preprint status) https://arxiv.org/pdf/2512.14503v1.pdf (PDF link)

2. Executive Summary

2.1. Background & Motivation

Recommender systems have evolved significantly, moving from traditional methods like matrix factorization and deep neural networks to more sophisticated approaches. However, a fundamental limitation persists: most systems primarily rely on historical behavioral patterns and log-fitting objectives, optimizing for behavioral pattern matching rather than explicitly reasoning about the underlying user intent. This results in recommendations that might lack transparency, control, and deep personalization.

The paper aims to solve the problem of making recommender systems more intelligent, efficient, and human-aligned by leveraging the reasoning capabilities of Large Language Models (LLMs). While RecGPT-V1 was a pioneer in integrating LLM-based reasoning for user interest mining and item tag prediction, it suffered from several critical drawbacks that hindered its scalability, efficiency, and overall effectiveness in real-world industrial deployment:

Computational Inefficiency and Cognitive Redundancy: RecGPT-V1 used a multi-route architecture where multiple LLMs independently processed the same user behavior sequences, leading to duplicated representation encoding and overlapping reasoning outputs. This was computationally expensive and inefficient.
Insufficient Explanation Diversity: Its reliance on fixed prompt templates for generating recommendation explanations resulted in generic, repetitive, and context-insensitive outputs, failing to capture the dynamic nature of user needs.
Limited Generalization: Training based on supervised learning on static data constrained the model's ability to adapt to dynamically evolving user needs and complex, multi-objective generation tasks in real-world scenarios.
Simplistic Outcome-Focused Evaluation: RecGPT-V1's LLM-as-a-Judge evaluation framework focused solely on direct outcome prediction, overlooking the multi-dimensional, step-by-step reasoning that human evaluators employ, leading to suboptimal alignment with human quality standards.

The importance of addressing these issues stems from the need to move beyond simple pattern matching to truly understand and cater to explicit user intent, enabling more transparent, personalized, and engaging recommendation experiences at scale in industrial settings like Taobao.

2.2. Main Contributions / Findings

RecGPT-V2 addresses the identified limitations of RecGPT-V1 with four key innovations, significantly advancing LLM-powered recommender systems:

Agentic Intent Reasoning:
- Innovation: Introduces a Hierarchical Multi-Agent System (HMAS) combined with Hybrid Representation Inference. HMAS orchestrates specialized expert agents under a Global Planner and Decision Arbiter to perform coordinated intent reasoning, eliminating cognitive duplication. Hybrid Representation Inference compresses user behavior contexts using atomized entity encoding.
- Findings: Achieves a 60% reduction in GPU consumption and improves exclusive recall from 9.39% to 10.99%. It delivers a 53.11% improvement in Model FLOPs Utilization (MFU) and significant throughput gains (QPS and TPS improvements).
Dynamic Explanation Generation:
- Innovation: Employs a Meta-Prompting framework that dynamically generates contextually adaptive prompts for personalized explanation generation. This two-stage process (style synthesis then style-conditioned generation) overcomes the limitations of fixed templates.
- Findings: Improves explanation diversity by +7.3% and enhances human-rated explanation acceptance by +13.0%, demonstrating superior user engagement.
Constrained Reinforcement Learning Optimization:
- Innovation: Implements constrained reinforcement learning with a novel Constrained Reward Shaping (CRS) mechanism. This mitigates multi-reward conflicts by treating secondary objectives (like diversity) as conditional constraints for primary objectives (like accuracy), ensuring stable optimization.
- Findings: Achieves a +24.1% improvement in item tag prediction (HR@30) and +13.0% in explanation acceptance, outperforming naive sum-based reward aggregation.
Process-Oriented Multi-Step Evaluation (Agentic Judge Framework):
- Innovation: Introduces an Agent-as-a-Judge framework that decomposes assessment into multi-step reasoning using Multi-Dimension Sub-Evaluators and a Senior Reviewer for three-tier judgments (S-A-B). This framework is complemented by Judge-as-a-Reward for distilling agent judgments into dense reward signals for RL.
- Findings: Improves human preference alignment by achieving higher accuracy and F1 scores in identifying Superior (S) quality for both item tag prediction and explanation generation compared to LLM-as-a-Judge baselines.
  
  Overall Impact: Online A/B tests on Taobao confirm the commercial viability and technical feasibility of RecGPT-V2, showing significant gains: +2.98% CTR (Click-Through Rate), +3.71% IPV (Item Page Views), +2.19% TV (Transaction Volume), and +11.46% NER (Novelty Exposure Rate). RecGPT-V2 bridges the gap between cognitive exploration and industrial utility, establishing a new paradigm for LLM-powered recommenders at scale.

3.1. Foundational Concepts

Recommender Systems (RS): Software systems that suggest items (products, movies, articles, etc.) to users based on their preferences, past behaviors, and other contextual information. Their goal is to help users discover relevant content and enhance engagement.
- Traditional RS: Often rely on statistical methods (matrix factorization) or deep learning (deep neural networks) to identify patterns in user-item interactions (e.g., collaborative filtering, content-based filtering). These systems typically focus on behavioral pattern matching, learning from implicit signals like clicks, purchases, or ratings.
- LLM-powered RS: A newer paradigm that integrates Large Language Models (LLMs) to move beyond implicit pattern matching. LLMs provide capabilities for explicit intent reasoning, semantic understanding, and natural language generation, allowing the system to understand why a user might be interested in something and to explain recommendations in human-like language.
Large Language Models (LLMs): Advanced artificial intelligence models, typically based on the Transformer architecture, that are trained on vast amounts of text data. They excel at understanding, generating, and reasoning with human language.
- Transformer Architecture: A neural network architecture introduced in 2017 that revolutionized sequence processing tasks. It relies heavily on self-attention mechanisms to weigh the importance of different parts of the input sequence when processing each element.
- Tokens: The basic units of text that an LLM processes. A token can be a word, a subword, or even a single character, depending on the tokenizer used.
- Input/Output Lengths ( $L_{in}$ , $L_{out}$ ): Refers to the number of tokens in the input prompt and the generated response, respectively.
- Prefill Phase: In LLM inference, this is the initial phase where the input prompt tokens are processed to compute their representations and populate the key-value (KV) cache. This phase is typically compute-intensive due to the quadratic complexity of attention with respect to input length ( $O(L_{in}^2)$ ).
- Decode Phase: The subsequent phase where the LLM generates tokens one by one, autoregressively. Each new token depends on previously generated tokens and the KV cache. This phase is often memory-intensive due to frequent access to the KV cache and has a complexity of $O(L_{in} \times L_{out})$ .
Model FLOPs Utilization (MFU): A metric that measures how efficiently a hardware accelerator (like a GPU) is being used. It compares the actual floating-point operations performed to the theoretical maximum, indicating the percentage of peak performance achieved. Higher MFU means better hardware utilization.
Queries Per Second (QPS): A measure of throughput, indicating the number of requests (e.g., LLM prompts processed) a system can handle per second. For the prefill stage, it indicates how many input sequences can be processed.
Tokens Per Second (TPS): Another throughput metric, specifically for token generation. For the decode stage, it indicates how many output tokens can be generated per second.
Embeddings: Dense vector representations of text, items, or users. These vectors capture semantic meaning, where items with similar meanings or properties are closer in the embedding space. Embedding models (e.g., BGE, Qwen3-Embedding) are specialized neural networks trained to produce these representations.
Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent receives rewards for desired actions and penalties for undesired ones, aiming to maximize cumulative reward over time.
- Policy: The strategy that an RL agent uses to determine its next action based on the current state.
- Reward Function: Defines the goal of the RL agent by assigning numerical values to outcomes.
- Policy Optimization: The process of adjusting the agent's policy to improve its performance (i.e., maximize rewards).
Supervised Fine-Tuning (SFT): A common technique in LLM training where a pre-trained LLM is further trained on a smaller, task-specific dataset with labeled examples. This adapts the model's general knowledge to a particular downstream task.
Prompt Engineering: The art and science of crafting effective prompts (inputs) for LLMs to guide their behavior and elicit desired outputs. Meta-Prompting involves dynamically generating these prompts or instructions.
A/B Testing: A randomized controlled experiment used to compare two versions of a system (A and B) to determine which one performs better. In recommender systems, A/B tests are crucial for evaluating new features or models on real user traffic.

3.2. Previous Works

The paper builds upon RecGPT-V1 and references several foundational and contemporary works in LLMs and recommender systems:

RecGPT-V1 (Yi et al., 2025): This paper is the direct predecessor. RecGPT-V1 was a paradigm-shifting framework that first integrated LLMs into recommender systems for user interest mining and item tag prediction, transforming traditional pattern matching into an intent-centric recommendation objective. It demonstrated promising online performance but suffered from the four limitations (computational inefficiency, fixed explanations, limited generalization, simplistic evaluation) that RecGPT-V2 aims to overcome.
Matrix Factorization (Koren et al., 2009): A foundational technique in recommender systems. It decomposes the user-item interaction matrix into two lower-rank matrices, representing latent features for users and items, which are then multiplied to predict missing ratings or preferences.
Deep Neural Networks (Tang et al., 2025): The evolution of recommender systems to use deep learning models for more complex pattern recognition and representation learning.
Transformer-based LLMs (Achiam et al., 2023 for GPT-4, Yang et al., 2025 for Qwen3-Embedding): The core technology enabling LLM-powered reasoning. The Transformer architecture's efficiency and ability to handle long-range dependencies are key.
Embedding Models (Xiao et al., 2023 for BGE, Zhang et al., 2025c for Qwen3-Embedding): Used for encoding entity information (item descriptions, user query histories) into compact vector representations. These models are crucial for atomized entity compression.
In-Context Learning (Brown et al., 2020; Dong et al., 2024): A property of large language models where they can learn new tasks or adapt their behavior from examples provided directly within the prompt, without explicit weight updates. GPT-4's use for dynamic QA pair generation leverages this.
OneRec-Think (Liu et al., 2025), LC-Rec (Zheng et al., 2024), CoLLM (Zhang et al., 2025b): Other contemporary works that aim to integrate collaborative embeddings into LLMs for recommendation. RecGPT-V2 differentiates its Atomized Entity Compression approach by using a lightweight adaptor network instead of directly inserting new tokens into the LLM's vocabulary, claiming better parameter efficiency, generalization, and modularity.
Disaggregated Prefill-Decode Architectures (Liu et al., 2024; Zhong et al., 2024): Prior work that inspired RecGPT-V2's infrastructure optimizations for LLM serving, acknowledging the different computational demands of the prefill and decode phases.
Group Relative Policy Optimization (GRPO) (Liu et al., 2024; Shao et al., 2024): The specific reinforcement learning algorithm used for constrained reinforcement optimization in RecGPT-V2. It optimizes policy by comparing against a group of sampled outputs from an old policy and includes a KL divergence penalty for stability.
Poly-Encoder (Humeau et al., 2019): A model architecture used for efficient retrieval in large-scale systems, specifically for multi-interest user encoding in RecGPT-V2. It uses multiple context codes to aggregate user behavioral embeddings into distinct interest vectors.
Mainstream Advances in Prompt Engineering (Suzgun and Kalai, 2024; Zhang et al., 2023): General trends in prompt design that inform RecGPT-V2's Meta-Prompting framework, moving towards dynamic and adaptive prompt generation.
Agent-as-a-Judge (Gou et al., 2025; Zhang et al., 2025a; Zhuge et al., 2024): Recent work that motivates RecGPT-V2's Agent-as-a-Judge framework, moving beyond one-shot LLM-as-a-Judge to more structured, multi-step agentic evaluation.
DeepSeek-R1 (Guo et al., 2025), Qwen3-235B (Yang et al., 2025): Powerful LLMs used to generate training samples for the Agent-as-a-Judge framework, ensuring high-quality and diverse data for adaptation.

3.3. Technological Evolution

The evolution of recommender systems can be broadly categorized:

Early Systems (e.g., Matrix Factorization, 2000s): Focused on collaborative filtering and content-based methods, identifying implicit patterns in user-item interaction data. These were effective but lacked deep semantic understanding and interpretability.
Deep Learning Era (e.g., Deep Neural Networks, 2010s-early 2020s): Applied neural networks to learn more complex, non-linear relationships and rich representations from diverse data (e.g., item features, user profiles). This improved accuracy but still largely operated on implicit signals and behavioral pattern matching.
LLM-Powered Era (RecGPT-V1, 2020s-present): Integrated Large Language Models to bring explicit intent reasoning and semantic understanding to the forefront. This allowed systems to process natural language queries, understand nuances of user interests, generate human-readable explanations, and decompose complex recommendation tasks. RecGPT-V1 pioneered this shift.

RecGPT-V2 represents the next generation within the LLM-powered era. It refines the initial LLM integration by tackling the practical challenges of scalability, efficiency, dynamism, and robust evaluation that arose from RecGPT-V1. It moves towards more sophisticated agentic architectures, dynamic content generation, and human-aligned evaluation, pushing the boundaries of what LLMs can achieve in industrial-scale recommender systems.

3.4. Differentiation Analysis

RecGPT-V2 differentiates itself from its predecessor, RecGPT-V1, and other contemporary methods primarily by addressing the four core limitations:

Computational Efficiency & Redundancy:
- RecGPT-V1: Used isolated multi-route LLM channels, each processing the full user behavior sequence and leading to significant computational overhead ( $O(L_{in}^2)$ ) and cognitive redundancy (13.46% inter-route duplication).
- RecGPT-V2: Introduces Hybrid Representation Inference (compressing 32K tokens to 11K) and a Hierarchical Multi-Agent System (HMAS) where a Global Planner orchestrates Distributed Experts and a Decision Arbiter. This eliminates redundant full-sequence encoding and overlapping reasoning.
- Differentiation from other compression methods (e.g., OneRec-Think, LC-Rec, CoLLM): RecGPT-V2 uses a lightweight adaptor network to project entity embeddings into the LLM's input space, keeping the LLM backbone frozen. This offers superior parameter efficiency, generalization, and modularity compared to methods that directly insert new tokens into the LLM's vocabulary and require full model fine-tuning.
Explanation Diversity:
- RecGPT-V1: Relied on fixed prompt templates, producing homogeneous and context-insensitive explanations.
- RecGPT-V2: Implements a Meta-Prompting framework that dynamically generates contextually adaptive prompts by synthesizing user interests, item attributes, and real-time contextual signals (e.g., weather, seasonal events).
- Differentiation: This shift from static templates to dynamic, two-stage (style synthesis then style-conditioned generation) prompting significantly enhances creative capacity and contextual relevance, yielding more diverse and engaging explanations.
Generalization & Multi-Objective Optimization:
- RecGPT-V1: Used supervised fine-tuning (SFT) on static data, which limited generalization to dynamic, multi-objective, and multi-constraint real-world scenarios.
- RecGPT-V2: Employs constrained reinforcement learning with a novel Constrained Reward Shaping (CRS) mechanism. This explicitly addresses multi-reward conflicts by treating secondary objectives as hard constraints for the primary objective, enabling stable optimization within feasible domains.
- Differentiation: This moves beyond naive sum-based aggregation of rewards, which often leads to suboptimal solutions due to conflicting gradients, ensuring better balance and stability in achieving diverse objectives.
Evaluation Quality & Human Alignment:
- RecGPT-V1: Used LLM-as-a-Judge for one-shot outcome evaluation, directly predicting quality scores without decomposing the assessment into intermediate reasoning steps. This resulted in suboptimal alignment with human judgment.
- RecGPT-V2: Introduces an Agent-as-a-Judge framework that mimics human cognitive evaluation through hierarchical multi-agent reasoning (dimension-specific sub-evaluators and a Senior Reviewer for three-tier judgments (S-A-B)). It also uses Judge-as-a-Reward to distill these judgments into continuous reward signals for RL.
- Differentiation: This process-oriented approach provides more accurate, interpretable, and human-aligned quality judgments, overcoming the limitations of black-box, outcome-focused evaluations. The Judge-as-a-Reward mechanism further makes this sophisticated evaluation computationally feasible for continuous RL optimization.

4. Methodology

RecGPT-V2 introduces a comprehensive framework designed to overcome the limitations of its predecessor. The methodology is structured around four key innovations: Agentic Intent Reasoning, Dynamic Explanation Generation, Constrained Reinforcement Optimization, and an Agentic Judge Framework. Each of these innovations contributes to a more efficient, diverse, generalizable, and human-aligned recommender system.

The overall architecture of RecGPT-V2 is illustrated in Figure 2. The system processes lifelong user behaviors, compresses them into hybrid contextual representations (§2.1.1), which then feed into a Hierarchical Multi-Agent System (HMAS) for intent decomposition and item tag prediction (§2.2). The predicted tags are used to retrieve items, which are then augmented with personalized explanations (§3). To ensure and continuously improve generation quality, an Agent-as-a-Judge evaluation framework (§4.1) assesses generation tasks, and a Judge-as-a-Reward distillation method (§4.2) converts these assessments into optimization reward signals.

4.1. Agentic Intent Reasoning

The goal of Agentic Intent Reasoning is to mitigate the computational inefficiency and cognitive redundancy observed in RecGPT-V1's multi-route architecture. This is achieved by jointly improving representation compactness and cognitive coordination through two main components: Hybrid Representation Inference and a Hierarchical Multi-Agent System.

4.1.1. Hybrid Representation Inference

RecGPT-V1 suffered from computational and memory bottlenecks because user lifelong behaviors accounted for approximately 95.89% of its input tokens (averaging 32K tokens). This section introduces Hybrid Representation Inference to reduce this overhead.

Atomized Entity Compression

This technique aims to compress entity information (item descriptions, user query histories) into compact atomic units, significantly reducing context storage and computational overhead. It involves two stages:

Stage 1: Atomic Representation Encoding Pretrained embedding models (e.g., BGE, Qwen3-Embedding, TBstars-Embedding) are used to encode textual entity information into dense vector representations. A lightweight adaptor network then projects these embeddings into an atomic representation compatible with the LLM.

Given an entity $e$ with its textual description $\mathbf{x} = [w_1, w_2, \ldots, w_n]$ consisting of $n$ tokens, its embedding representation $\mathbf{h}$ is obtained as: $\mathbf{h} = f_{\mathrm{embed}}(\mathbf{x}) \in \mathbb{R}^{d_{\mathrm{emb}}}$ Where:

$\mathbf{h}$ : The embedding representation of the entity.
$f_{\mathrm{embed}}(\cdot)$ : The embedding function (e.g., BGE, Qwen3-Embedding) that maps variable-length textual sequences $\mathbf{x}$ to fixed-dimensional dense vectors.
$\mathbf{x}$ : The textual description of the entity, represented as a sequence of $n$ tokens.
$d_{\mathrm{emb}}$ : The dimension of the embedding vector.

To bridge the gap between the embedding space and the LLM's language space, a lightweight adaptor network $f_{\mathrm{adapt}}(\cdot)$ projects $\mathbf{h}$ into an atomic representation $\mathbf{z}$ compatible with the LLM input: $\mathbf{z} = f_{\mathrm{adapt}}(\mathbf{h}) = \mathbf{W}_2 \cdot \mathrm{ReLU}(\mathbf{W}_1 \mathbf{h} + \mathbf{b}_1) + \mathbf{b}_2 \in \mathbb{R}^{d_{\mathrm{LLM}}}$ Where:
$\mathbf{z}$ : The atomic representation of the entity, denoted as [entity] in context. This single unit replaces the original multi-token textual description.
$f_{\mathrm{adapt}}(\cdot)$ : The lightweight adaptor network, typically a two-layer feed-forward network with a ReLU activation function.
$\mathbf{W}_1 \in \mathbb{R}^{d_{\mathrm{hidden}} \times d_{\mathrm{emb}}}$ , $\mathbf{W}_2 \in \mathbb{R}^{d_{\mathrm{LLM}} \times d_{\mathrm{hidden}}}$ : Projection matrices.
$\mathbf{b}_1, \mathbf{b}_2$ : Bias terms.
$d_{\mathrm{hidden}}$ : The dimension of the hidden layer in the adaptor network.
$d_{\mathrm{LLM}}$ : The hidden dimension of the LLM, ensuring compatibility.

This atomized entity compression significantly reduces token length. For example, a 12-token Chinese product title can be compressed into a single atomic representation, achieving a 12:1 compression ratio. A user profile with 21,349 tokens can be reduced to 5,158 tokens (76% reduction) by replacing item descriptions and query texts with atomic representations, while preserving user attributes and temporal metadata in natural language. This creates a hybrid representation that balances compactness and contextual richness.

Hybrid Representation Adaptation

To enable the LLM to understand hybrid contexts (interleaving natural language with compressed atomic units), a two-tier training strategy is designed, keeping the LLM backbone frozen and only training the adaptor parameters for parameter efficiency and preservation of general knowledge.

Stage 2: Hybrid Representation Adaptation

Self-Perception Tasks ("What-is-it" philosophy): GPT-4 is used to dynamically generate diverse, attribute-focused questions (QA pairs) that probe the semantic completeness of atomic representations. This ensures that the compressed representations retain critical entity attributes.

The meta-prompt for dynamic QA pair generation (Prompt 1) guides GPT-4 to produce questions and answers directly from a product title, ensuring all questions are answerable solely from the input text and outputting in JSON format.

Given an entity $e$ with original text $\mathbf{x}$ , GPT-4 automatically generates $K$ attribute-focused question-answer pairs: $\{(\mathbf{q}_i, \mathbf{a}_i)\}_{i=1}^K = \mathrm{LLM}(\mathbf{x})$ Where:
- $(\mathbf{q}_i, \mathbf{a}_i)$ : The $i$ -th question-answer pair.
- $\mathbf{q}_i$ : The question probing a specific entity attribute.
- $\mathbf{a}_i$ : The answer extracted from the entity's original text $\mathbf{x}$ .
- $\mathrm{LLM}(\mathbf{x})$ : The function representing the powerful LLM (e.g., GPT-4) generating QA pairs from the entity text.
- $K$ : The number of generated QA pairs.
Production-Oriented Alignment: To validate practical applicability, compressed atomic units are integrated into two core recommendation generation tasks from RecGPT-V1:
- User Interest Mining: Infers user interest profiles from interaction histories.
- Item Tag Prediction: Predicts relevant item tags based on inferred interests and historical behaviors. For these tasks, reference samples are constructed using full textual representations, and ground-truth responses from the frozen LLM serve as supervision signals.

Unified Training Formulation Both self-perception QA tasks and production-oriented tasks share an identical optimization paradigm. The adaptor is trained such that hybrid prompts (with compressed entities) can reproduce the same responses as full-text prompts.

Given a reference sample with a full-text prompt $\mathcal{P}_{\mathrm{full}}$ and its corresponding response $\mathbf{y}^*$ , the hybrid prompt $\mathcal{P}_{\mathrm{hybrid}}$ is constructed by replacing all entity texts with adaptor-projected representations: $\mathcal{P}_{\mathrm{hybrid}} = \phi(\mathcal{P}_{\mathrm{full}}), \quad \mathrm{where}\ \phi(\mathbf{x}_e) = f_{\mathrm{adapt}}(f_{\mathrm{embed}}(\mathbf{x}_e)), \forall e \in \mathcal{E}$ Where:

$\mathcal{P}_{\mathrm{hybrid}}$ : The hybrid prompt with compressed entities.
$\mathcal{P}_{\mathrm{full}}$ : The full-text prompt.
$\phi(\cdot)$ : A function that performs entity-to-atomic replacement.
$\mathbf{x}_e$ : The textual description of an entity $e$ .
$\mathcal{E}$ : The set of all entities in $\mathcal{P}_{\mathrm{full}}$ .
$f_{\mathrm{adapt}}(\cdot)$ and $f_{\mathrm{embed}}(\cdot)$ : Functions defined earlier for atomic representation encoding.

The adaptor is optimized by minimizing the cross-entropy loss between model predictions on compressed inputs and reference responses: $\mathcal{L}(\theta_{\mathrm{adapt}}) = - \sum_{t=1}^{|\mathbf{y}^*|} \log p\left(y_t^* \mid \mathcal{P}_{\mathrm{hybrid}}, \mathbf{y}_{<t}^*\right)$ Where:
$\mathcal{L}(\theta_{\mathrm{adapt}})$ : The cross-entropy loss to be minimized.
$\theta_{\mathrm{adapt}}$ : The parameters of the adaptor network.
$p(\cdot)$ : The output distribution of the frozen LLM.
$y_t^*$ : The $t$ -th token in the ground-truth response $\mathbf{y}^*$ .
$\mathbf{y}_{<t}^*$ : The sequence of ground-truth tokens before position $t$ .

This objective ensures that the adaptor learns semantic-preserving projections that maintain functional equivalence between compressed and full-text representations.

Infrastructure Engineering Optimization

To meet industrial latency requirements, two complementary infrastructure optimizations are introduced:

Disaggregated Prefill-Decode Serving Architecture:
- This architecture addresses the asymmetric input-output characteristic of recommendation generation (long inputs, short outputs).
- Prefill phase is compute-intensive ( $O(L_{in}^2)$ ), while decode phase is memory-intensive ( $O(L_{in} \times L_{out})$ ).
- GPU resources are partitioned: a larger GPU pool is assigned to prefill for parallel throughput, and fewer resources to decode for efficient memory access. KV caches are transferred between phases.
XQA Kernel Integration:
- Replaces the FlashInfer kernel with the XQA kernel to leverage FP8 precision inference on H20 GPUs. XQA kernel provides superior performance for FP8 quantized models, accelerating attention computation and reducing memory bandwidth.
  
  These optimizations collectively improve MFU from 11.56% (RecGPT-V1) to 17.04% and achieve substantial throughput gains (e.g., $\times 7.35$ TPS improvement in the decode stage), enabling scalable deployment.

The following figure (Figure 3 from the original paper) compares the inference architectures between RecGPT-V1 and RecGPT-V2:

Figure 3 | Comparison of inference architectures between RecGPT-V1 (full-text representation with coupled prefill-decode) and RecGPT-V2 (hybrid representation with disaggregated prefill-decode). RecGPT-V2 demonstrates substantial gains in GPU utilization peak and computational efficiency. 该图像是示意图，展示了RecGPT-V1与RecGPT-V2在推理架构上的比较。RecGPT-V2采取混合表示与分解的预填充-解码方式，显著提高了GPU利用率和计算效率，减少了计算冗余。

The following figure (Figure 4 from the original paper) shows the computational efficiency comparison between RecGPT-V1 and RecGPT-V2:

Figure 4 | Computational efficiency comparison. 该图像是一个图表，展示了RecGPT-V1与RecGPT-V2在MFU%、QPS (Prefill)和 TPS (Decode)方面的计算效率比较。RecGPT-V2在MFU%达到了17.70%，而RecGPT-V1为11.56%；在QPS (Prefill)上，RecGPT-V2的值为69.30，RecGPT-V1为1；在TPS (Decode)上，RecGPT-V2的值为7.35，RecGPT-V1为1。此图表显示了RecGPT-V2在计算效率上的显著提升。

4.1.2. Hierarchical Multi-Agent System

To eliminate computational waste and cognitive duplication (13.46% inter-route duplication in RecGPT-V1), RecGPT-V2 proposes a Hierarchical Multi-Agent System (HMAS) with a three-tier architecture: Planner-Experts-Arbiter.

The following figure (Figure 5 from the original paper) illustrates the architectural comparison between RecGPT-V1's isolated multi-route reasoning and RecGPT-V2's HMAS:

Figure 5 | Architectural comparison between RecGPT-V1's isolated multi-route reasoning and RecGPTV2's Hierarchical Multi-Agent System (Global Planner Distributed Experts Decision Arbiter), demonstrating reduced cognitive redundancy through coordinated intent decomposition. 该图像是一个示意图，展示了RecGPT-V1与RecGPT-V2的架构对比。左侧为RecGPT-V1的孤立多路推理，右侧为RecGPT-V2的层次多智能体系统，强调通过协调的意图分解减少认知冗余。

Global Planner

The Global Planner is the top-level orchestrator. It performs holistic intent analysis by synthesizing rich contextual signals to decompose complex user intent into $K$ specialized personas. This eliminates redundant processing of raw sequences by individual experts and ensures coordinated reasoning.

Context Representation: The Global Planner receives a comprehensive contextual representation $C$ from three sources:

User Behavioral History ( $\mathcal{B}$ ): Temporally ordered user interactions, where each interaction $(a_i, e_i, t_i)$ includes action type, entity (item/query), and timestamp. Behavioral entities are represented through atomic compression (from §2.1.1).
User Profile ( $\mathcal{U}$ ):
- Static Attributes (\mathcal{U}_{\mathrm{attr}}): Demographics (age, gender, location).
- Dynamic Interests (\mathcal{U}_{\mathrm{int}}): Behavioral patterns (e.g., "cycling enthusiast").
Environmental Context ( $\mathcal{E}$ ): Real-time multi-source signals (weather, seasonal factors, trending events) for situational intent mining.

These components form a rich hybrid context: $C = \{\mathcal{B}, \mathcal{U}, \mathcal{E}\}$ Where:

$C$ : The comprehensive contextual representation.
$\mathcal{B}$ : User Behavioral History, with entities compressed.
$\mathcal{U}$ : User Profile, with static attributes and dynamic interests.
$\mathcal{E}$ : Environmental Context, in natural language.

Intent Decomposition: The Global Planner analyzes $C$ through multi-dimensional reasoning (temporal trends, situational adaptation, behavioral consistency) to uncover latent user needs and decompose them into $K$ specialized personas $\{p_1, p_2, \ldots, p_K\}$ , where each persona represents a distinct facet of user intent. $\{p_1, p_2, \ldots, p_K\} = f_{\mathrm{planner}}(C)$ Where:
$\{p_1, p_2, \ldots, p_K\}$ : The set of $K$ specialized personas.
$f_{\mathrm{planner}}(\cdot)$ : The reasoning function of the Global Planner.

This design eliminates computational redundancy (decomposition done once) and ensures cognitive coordination by orchestrating complementary reasoning perspectives.

Distributed Experts

Upon receiving personas from the Global Planner, the distributed expert ensemble executes parallel yet complementary item tag prediction tasks. Each expert agent operates under its assigned persona to generate a set of item tags reflecting a distinct facet of user intent. $\mathcal{T}_k = f_{\mathrm{expert}}(p_k)$ Where:

$\mathcal{T}_k = \{t_1^k, t_2^k, \ldots, t_{M_k}^k\}$ : The set of $M_k$ predicted item tags for persona $p_k$ .
$f_{\mathrm{expert}}(\cdot)$ : The function representing the expert model for tag prediction.
$p_k$ : The $k$ -th specialized persona.

To enhance expert capabilities, a two-stage training strategy is used: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) optimization.

Stage 1: Supervised Fine-Tuning (SFT) SFT establishes foundational expert capabilities using persona-aligned training samples. For a given persona $p_k$ , supervision signals are constructed from the user's subsequent interactions. GPT-4 identifies item categories from the user's next interactions ( $C_{\mathrm{next}}$ ) that semantically align with the persona's intent focus: $C_k^{\mathrm{rel}} = \{c \in C_{\mathrm{next}} \mid f_{\mathrm{GPT-4}}(c, p_k) = \mathrm{True}\}$ Where:

$C_k^{\mathrm{rel}}$ : A set of item categories relevant to persona $p_k$ .
$C_{\mathrm{next}}$ : All item categories from the user's subsequent (held-out next) interactions.
$f_{\mathrm{GPT-4}}(c, p_k)$ : A binary classifier (implemented by GPT-4) that determines if category $c$ is semantically relevant to persona $p_k$ .

A target set $C_k^{\mathrm{target}}$ with exactly 15 elements is created. If $|C_k^{\mathrm{rel}}|<15$ , it's augmented with GPT-4-generated synthetic tags. If $|C_k^{\mathrm{rel}}|>15$ , 15 tags are randomly sampled. The expert model is then trained using a token-prediction paradigm by minimizing the cross-entropy loss: $\mathcal{L}_{\mathrm{SFT}}(\theta_{\mathrm{expert}}) = - \mathbb{E}_{(p_k, C_k^{\mathrm{target}})} \left[ \log p_{\theta_{\mathrm{expert}}}\left(C_k^{\mathrm{target}} \mid p_k\right) \right]$ Where:
$\mathcal{L}_{\mathrm{SFT}}(\theta_{\mathrm{expert}})$ : The cross-entropy loss for supervised fine-tuning.
$\theta_{\mathrm{expert}}$ : The parameters of the expert model.
$\mathbb{E}[\cdot]$ : Expectation over persona-target pairs.
$p_{\theta_{\mathrm{expert}}}(\cdot)$ : The expert model's output distribution.

The training data composition balances domain-specific knowledge with general language capabilities (Table 1). The following are the results from Table 1 of the original paper:

Data Type	Proportion (%)
Recommendation Task
Pure Behavior Patterns	32.17
Trending Topics & Events	6.97
Weather-Related Contexts •	1.19
Other Situational Signals •	7.36
General Language Modeling	52.31

Stage 2: Constrained Reinforcement Optimization Building on SFT, reinforcement learning further enhances expert performance across multiple objectives (diversity, relevance, accuracy). A constrained reward shaping (CRS) mechanism is used to balance conflicting multi-reward optimization.

Policy Optimization Framework (GRPO): The Group Relative Policy Optimization (GRPO) algorithm is used. For each input sample, a group of $G$ outputs is sampled from the old policy $\pi_{\theta_{\mathrm{old}}}$ , and the new policy $\pi_{\theta}$ is optimized by minimizing the following objective: $\mathcal{L}_{\mathrm{GRPO}}(\theta) = - \mathbb{E}_{(x, y) \sim \pi_{\theta_{\mathrm{old}}}} \left[ \min\left( r(\theta) \hat{A}(x, y), \mathrm{clip}\left(r(\theta), 1 - \epsilon, 1 + \epsilon\right) \hat{A}(x, y) \right) - \beta \cdot \mathbb{D}_{\mathbb{KL}}\left( \pi_{\theta} \| \pi_{\mathrm{ref}} \right) \right]$ Where:

$\mathcal{L}_{\mathrm{GRPO}}(\theta)$ : The GRPO objective function to be minimized.
$\theta$ : Parameters of the new policy $\pi_{\theta}$ .
$(x, y) \sim \pi_{\theta_{\mathrm{old}}}$ : Input context $x$ and output $y$ sampled from the old policy.
$r(\theta) = \frac{\pi_{\theta}(y | x)}{\pi_{\theta_{\mathrm{old}}}(y | x)}$ : The ratio of probabilities of output $y$ under the new policy vs. the old policy.
\hat{A}(x, y) = R(x, y) - \frac{1}{G} \sum_{i=1}^G R(x, y_i): The group-normalized advantage. This is the reward of the current output $y$ minus the average reward of the group of $G$ outputs sampled from the old policy.
R(x, y): The reward function for input $x$ and output $y$ .
$\epsilon$ : Clipping parameter, typically a small value (e.g., 0.2), used to limit the policy update step size.
$\beta$ : A coefficient controlling the strength of the KL penalty.
$\mathbb{D}_{\mathbb{KL}}\left( \pi_{\theta} \| \pi_{\mathrm{ref}} \right)$ : The Kullback-Leibler (KL) divergence between the new policy $\pi_{\theta}$ and the reference policy $\pi_{\mathrm{ref}}$ . It's calculated as: $\mathbb{D}_{\mathbb{KL}}\left( \pi_{\theta} \| \pi_{\mathrm{ref}} \right) = \frac{\pi_{\mathrm{ref}}(y | x)}{\pi_{\theta}(y | x)} - \log \frac{\pi_{\mathrm{ref}}(y | x)}{\pi_{\theta}(y | x)} - 1$ This term prevents the policy from deviating too far from the reference model (SFT base model), ensuring stability and mitigating reward hacking.

Multi-Reward Modeling: Four complementary components make up the multi-objective reward function:

Accuracy Reward ( $R_{\mathrm{acc}}$ ): Encourages experts to predict tags aligning with online user behavior. It measures recall against ground-truth interactions. Given predicted tags $\mathcal{T}_k = \{t_1, \ldots, t_M\}$ $T_{k} = {t_{1}, \dots, t_{M}}$ and interacted item categories $C_{\mathrm{gt}} = \{c_1, \ldots, c_N\}$ $C_{gt} = {c_{1}, \dots, c_{N}}$ : $R_{\mathrm{acc}} = \frac{1}{|C_{\mathrm{gt}}|} \sum_{c \in C_{\mathrm{gt}}} \mathbb{I}\left[c \in f_{\mathrm{tag2cat}}(\mathcal{T}_k)\right]$ Where:
- $R_{\mathrm{acc}}$ : The accuracy reward.
- $|C_{\mathrm{gt}}|$ : The number of ground-truth item categories.
- $\mathbb{I}[\cdot]$ : The indicator function, which is 1 if the condition is true, and 0 otherwise.
- $f_{\mathrm{tag2cat}}(\mathcal{T}_k)$ : A function that maps predicted tags $\mathcal{T}_k$ to item categories.
Alignment Reward ( $R_{\mathrm{align}}$ ): Ensures predicted tags align with human quality standards and the assigned persona's intent. A dedicated reward model (f_{\mathrm{RM}}) (trained using preference pairs from RecGPT-V1's quality criteria) evaluates the alignment score for each predicted tag $t_i$ $t_{i}$ with respect to persona $p_k$ $p_{k}$ . $R_{\mathrm{align}} = \frac{1}{M_k} \sum_{i=1}^{M_k} f_{\mathrm{RM}}(t_i, p_k)$ Where:
- $R_{\mathrm{align}}$ : The alignment reward.
- $M_k$ : The number of predicted tags for persona $p_k$ .
- $f_{\mathrm{RM}}(t_i, p_k)$ : The reward model that outputs an alignment score for tag $t_i$ given persona $p_k$ .
Diversity Reward ( $R_{\mathrm{div}}$ ): Encourages experts to explore diverse user interests. It measures the semantic richness of predicted tags using BGE embedding model (Xiao et al., 2023) and computing the average cosine distance among tag representations. $R_{\mathrm{div}} = 1 - \frac{2}{M_k(M_k - 1)} \sum_{i=1}^{M_k - 1} \sum_{j=i + 1}^{M_k} \frac{\mathbf{e}_i \cdot \mathbf{e}_j}{\|\mathbf{e}_i\| \|\mathbf{e}_j\|}$ Where:
- $R_{\mathrm{div}}$ : The diversity reward.
- $\mathbf{e}_i = f_{\mathrm{BGE}}(t_i)$ : The embedding of tag $t_i$ generated by the BGE embedding model.
- $\mathbf{e}_i \cdot \mathbf{e}_j$ : The dot product of embeddings $\mathbf{e}_i$ and $\mathbf{e}_j$ .
- $\|\mathbf{e}_i\|$ : The L2 norm of embedding $\mathbf{e}_i$ . Higher diversity scores encourage broader intent coverage without redundant predictions.
Length Reward ( $R_{\mathrm{len}}$ ): Promotes appropriate tag lengths (6-11 words for optimal informativeness and retrieval). For each predicted tag $t$ with word number $l$ , the reward $R_{\mathrm{len}}(t)$ is defined as: $R_{\mathrm{len}}(t) = \left\{ \begin{array}{ll} 1.0, & \mathrm{if}\ 6 \leq l \leq 11, \\ 0.5, & \mathrm{if}\ 4 \leq l < 6\ \mathrm{or}\ 11 < l \leq 13, \\ 0.0, & \mathrm{otherwise}. \end{array} \right.$ The overall length reward for a set of $M$ predicted tags is the average: $R_{\mathrm{len}} = \frac{1}{M} \sum_{i=1}^M R_{\mathrm{len}}(t_i)$

Constrained Reward Shaping (CRS): Unlike naive sum-based aggregation of rewards (SUM), which leads to multi-reward conflicts, CRS treats certain rewards as hard constraints for optimizing the primary accuracy objective. It enforces a two-stage optimization: first satisfy secondary constraints, then optimize the primary reward. The following figure (Figure 6 from the original paper) compares reward shaping strategies:

Figure 6 | Comparison of reward shaping strategies. (a) Sum-based aggregation suffers from multireward conflicts. (b) Our constrained reward shaping treats secondary rewards (e.g., diversity) as conditional constraints, enabling stable optimization of the primary reward (i.e., accuracy). 该图像是比较奖励塑形策略的图表。左侧(a)的和为基础奖励塑形显示出多奖励冲突的问题，而右侧(b)的约束奖励塑形将次级奖励（如多样性）视为条件约束，从而实现对主要奖励（即准确性）的稳定优化。

The composite reward is defined as a product of conditional indicators: $R_{\mathrm{total}} = R_{\mathrm{acc}} \cdot \mathbb{I} \big[ R_{\mathrm{align}} \ge \tau_{\mathrm{align}} \big] \cdot \mathbb{I} \big[ R_{\mathrm{div}} \ge \tau_{\mathrm{div}} \big] \cdot \mathbb{I} \big[ R_{\mathrm{len}} \ge \tau_{\mathrm{len}} \big]$ Where:

$R_{\mathrm{total}}$ : The total composite reward.
$R_{\mathrm{acc}}, R_{\mathrm{align}}, R_{\mathrm{div}}, R_{\mathrm{len}}$ : Individual reward components defined above.
$\mathbb{I}[\cdot]$ : The indicator function, which evaluates to 1 if the condition inside the brackets is true, and 0 otherwise.
$\tau_{\mathrm{align}}, \tau_{\mathrm{div}}, \tau_{\mathrm{len}}$ : Predefined thresholds for alignment, diversity, and length rewards, respectively.

This multiplicative formulation ensures that the accuracy reward is propagated only when all secondary objectives meet their minimum requirements, effectively mitigating conflicting gradient signals.

The following figure (Figure 7 from the original paper) compares training dynamics between sum-based and constrained reward shaping:

Figure 7 | Training dynamics comparison between sum-based and constrained reward shaping. (a) Gradient norm. (b) KL divergence from reference model. (c) Accuracy reward. (d) Diversity reward. CRS maintains stable optimization across all metrics, while SUM suffers from multi-reward conflicts. 该图像是图表，展示了基于和约束奖励塑形的训练动态比较。(a) 梯度范数。(b) KL 散度。(c) 准确性奖励。(d) 多样性奖励。CRS 在各指标上保持稳定优化，而 SUM 存在多重奖励冲突。

Decision Arbiter

After experts generate tag predictions, the Decision Arbiter performs final candidate selection from the aggregated tag pool $\mathcal{T}_{\mathrm{all}} = \bigcup_{k=1}^K \mathcal{T}_k$ . It leverages the hybrid context $C = \{\mathcal{B}, \mathcal{U}, \mathcal{E}\}$ to holistically evaluate all candidate tags across multiple quality dimensions (relevance, consistency, specificity, validity, defined in Appendix B). The arbiter performs joint reasoning to identify the top- $N$ tags that collectively maximize these dimensions. $\mathcal{T}_{\mathrm{final}} = f_{\mathrm{arbiter}}(\mathcal{T}_{\mathrm{all}}, C)$ Where:

$\mathcal{T}_{\mathrm{final}}$ : The refined set of item tags for downstream retrieval.
$f_{\mathrm{arbiter}}(\cdot)$ : The function representing the Decision Arbiter.
$\mathcal{T}_{\mathrm{all}}$ : The aggregated pool of tags from all expert agents.
$C$ : The hybrid context.

This joint evaluation considers inter-tag complementarity and avoids redundancy, consolidating expert outputs into a cohesive recommendation.

Online Item Recommendation

Multi-Interest User Encoding: Building on RecGPT-V1's three-tower architecture (user-item-tag), the user encoder is extended to capture multiple interest facets using Poly-Encoder (Humeau et al., 2019). This introduces $K$ learnable context codes that aggregate user behavioral embeddings into multiple interest vectors $\{\mathbf{u}_1, \ldots, \mathbf{u}_K\}$ via attention mechanisms. These vectors represent distinct aspects of user preferences, enabling fine-grained matching.
Traffic Allocation via Quadratic Programming: To balance exploration (cognitive channel items) and exploitation (existing utility channel items) under limited exposure budgets, traffic allocation is formulated as a quadratic programming problem. This dynamically adjusts the proportion of cognitive retrieval items in the recommendation slate to maximize overall system revenue while ensuring exploratory recommendations enhance long-term user engagement without compromising short-term business metrics. The detailed problem formulation and solution are provided in Appendix C.

4.2. Dynamic Explanation Generation

RecGPT-V2 enhances explanation generation by addressing low information density, weak temporal adaptation, and homogenized expression of RecGPT-V1, which resulted from static prompt templates and incomplete evaluation. It introduces Meta-Prompting and Preference-Aware Reinforcement Learning.

4.2.1. Meta-Prompting

Unlike RecGPT-V1's direct one-step generation from fixed templates, Meta-Prompting decomposes the generation process into two stages: style synthesis followed by style-conditioned explanation generation. This hierarchical design allows for diverse and contextually adaptive explanations.

Expanded Evaluation Dimensions: RecGPT-V1's four dimensions (Relevance, Factuality, Clarity, Safety) are expanded to seven by adding Timeliness (alignment with trends/events), Informativeness (substantive insights), and Attractiveness (emotional appeal).

Two-Stage Generation Framework: Given user interests ( $\mathcal{U}$ ), item attributes ( $\mathcal{I}$ ), and contextual signals ( $S$ ), the framework operates as follows:

Stage 1: Style Synthesis: The model first generates a stylistic guideline $g$ $g$ specifying tone, rhetorical devices, target audience, and emotional resonance. $g = f_{\mathrm{meta}}(\mathcal{U}, \mathcal{I}, S)$ Where:
- $g$ : The generated stylistic guideline.
- $f_{\mathrm{meta}}(\cdot)$ : The meta-prompt generator function.
- $\mathcal{U}$ : User interests.
- $\mathcal{I}$ : Item attributes.
- $S$ : Situational signals (e.g., weather, seasonal trends).
Stage 2: Style-Conditioned Explanation Generation: Conditioned on the style guideline $g$ $g$ , the model generates the final explanation $e$ $e$ adhering to the specified stylistic constraints. $e = f_{\mathrm{exp}}(g, \mathcal{U}, \mathcal{I}, S)$ Where:
- $e$ : The final explanation.
- $f_{\mathrm{exp}}(\cdot)$ : The explanation generation function.
  
  This two-stage approach provides flexibility and leverages the model's creative capacity.

4.2.2. Preference-Aware Reinforcement Learning

Building on SFT, constrained reinforcement learning (similar to §2.2.2) further enhances explanation quality. It uses a hybrid reward framework combining rule-based diversity rewards and model-based alignment rewards, unified under the Constrained Reward Shaping (CRS) mechanism.

Policy Optimization Framework: The GRPO algorithm (Equation 6) is used, with the reward function replaced by an explanation-specific composite reward.

Hybrid Reward Modeling:

Rule-Based Diversity Reward ( $R_{\mathrm{div}}$ ): To encourage varied linguistic expressions, an IDF-inspired diversity reward is used. A memory buffer $M$ $M$ (size 160) stores recent generated explanations. For a new explanation $e = \{w_1, w_2, \ldots, w_L\}$ $e = {w_{1}, w_{2}, \dots, w_{L}}$ : $R_{\mathrm{div}} = \frac{1}{L} \sum_{i=1}^L \log \frac{| \mathcal{M} |}{| \{e' \in \mathcal{M} : w_i \in e' \} | + 1}$ Where:
- $R_{\mathrm{div}}$ : The diversity reward for the explanation.
- $L$ : The length (number of tokens) of the explanation $e$ .
- $| \mathcal{M} |$ : The size of the memory buffer $M$ .
- $| \{e' \in \mathcal{M} : w_i \in e' \} |$ : The count of stored explanations containing token $w_i$ . The logarithmic term rewards rare tokens, and $+1$ is for smoothing.
Model-Based Alignment Reward ( $R_{\mathrm{align}}$ ): To capture subjective quality (e.g., informativeness), a reward model $f_{\mathrm{RM}}(\cdot)$ $f_{RM} (\cdot)$ is trained on preference data using listwise comparisons (detailed in §4.2, Judge-as-a-Reward). Given a generated explanation $e$ $e$ : $R_{\mathrm{align}} = f_{\mathrm{RM}}(e, \mathcal{U}, \mathcal{I}, S)$ Where:
- $R_{\mathrm{align}}$ : The alignment reward for the explanation.
- $f_{\mathrm{RM}}(\cdot)$ : The reward model.
- $e$ : The generated explanation.
- $\mathcal{U}, \mathcal{I}, S$ : User interests, item attributes, and situational signals, respectively.
  
  Constrained Reward Shaping (CRS): Consistent with §2.2.2, CRS is adopted. For explanation generation, human preference alignment ( $R_{\mathrm{align}}$ ) is the main reward, with diversity ( $R_{\mathrm{div}}$ ) as a secondary constraint. $R_{\mathrm{total}} = R_{\mathrm{align}} \cdot \mathbb{I} \left[ R_{\mathrm{div}} \geq \tau_{\mathrm{div}} \right]$ Where:

$R_{\mathrm{total}}$ : The total composite reward.
$\tau_{\mathrm{div}}$ : The diversity threshold. This formulation treats diversity as a gating condition, preventing gradient interference and enabling stable optimization towards human-aligned, diverse explanations.

4.3. Agentic Judge Framework

RecGPT-V1's LLM-as-a-Judge was outcome-focused, limiting its ability to capture nuanced quality. RecGPT-V2 introduces a novel Agentic Judge Framework with Agent-as-a-Judge and Judge-as-a-Reward to enhance evaluation quality and create a self-reinforcing Flywheel Effect.

The following figure (Figure 8 from the original paper) illustrates the Agent-as-a-Judge framework:

Figure 8 | Agent-as-a-Judge framework mimicking human process-oriented fine-grained evaluation. Multi-dimension sub-evaluators independently assess specialized quality dimensions, and Senior Reviewer aggregates feedback into three-tier judgments (Superior/Average/Bad). 该图像是示意图，展示了 Agent-as-a-Judge 框架用于细致的评价过程。图中包括多维度子评估者独立评估标签预测和说明生成的专业质量维度，以及高级评审员将反馈汇总为三层判断（优越/平均/差）。

4.3.1. Agent-as-a-Judge

This framework mirrors human cognitive evaluation by decomposing holistic quality assessment into fine-grained, dimension-specific sub-evaluators followed by a multi-level review.

Multi-Dimension Sub-Evaluators: For each evaluation dimension (e.g., relevance, diversity, coherence, as detailed in Appendix B), a specialized sub-evaluator is instantiated. Each sub-evaluator $\mathcal{E}_i$ assesses the generated content $y$ along its assigned dimension $d_i$ . $s_i = \mathcal{E}_i (y, d_i)$ Where:

$s_i$ : The dimension-specific evaluation result for the $i$ -th dimension.
$\mathcal{E}_i$ : The specialized sub-evaluator for dimension $d_i$ .
$y$ : The generated content (e.g., item tags, explanation).
$d_i$ : The specific evaluation dimension.

This decomposition transforms complex multi-objective evaluation into manageable single-objective tasks.

Three-Tier Judgment: A Senior Reviewer Agent aggregates the outputs $\{s_1, \ldots, s_D\}$ from all sub-evaluators to produce a final overall quality judgment using a three-tier S-A-B scheme:

Superior (S): Output excels across all or most dimensions.
Average (A): Output meets minimum standards across dimensions.
Bad (B): Output fails to satisfy basic requirements in at least one critical dimension.

The aggregation procedure involves a two-stage decision:

Defect Detection: If any dimension receives a negative or unsatisfactory signal, the overall result is classified as Bad (B).
Excellence Elevation: If no critical defects are detected, the Senior Reviewer distinguishes between Superior (S) and Average (A) based on the proportion or pattern of positive feedback among all dimensions, using a threshold $\tau$ .

Model Adaptation through Supervised Fine-Tuning: Evaluation agents are adapted to domain-specific quality standards by training on a corpus combining model-generated samples and outputs from powerful LLMs (e.g., DeepSeek-R1, Qwen3-235B). A hybrid annotation strategy is used: (1) in-batch shuffling for automatic construction of Bad-quality samples (e.g., mismatched user contexts for relevance); (2) human annotators for nuanced judgments across all evaluation dimensions and holistic S-A-B labels. A lightweight Qwen3-32B-Instruct model is fine-tuned using SFT.

4.3.2. Judge-as-a-Reward

While Agent-as-a-Judge provides accurate assessment, its discrete classification labels lack granularity for fine-grained policy gradient estimation, and its multi-step evaluation is computationally expensive for online RL training. Judge-as-a-Reward addresses this by distilling agent evaluation capabilities into lightweight reward models for dense optimization signals.

Reward Model Architecture: The reward model is initialized from the Agent Judge checkpoint (meaning it shares the same underlying LLM architecture) but with an added scalar value head. $r = f_{\mathrm{RM}}(y, \mathcal{U}, \mathcal{I}, S)$ Where:

$r \in \mathbb{R}$ : The predicted reward score, bounded into [0, 1] by a sigmoid activation.
$f_{\mathrm{RM}}(\cdot)$ : The reward model.
$y$ : The generated content.
$\mathcal{U}, \mathcal{I}, S$ : User interests, item attributes, and situational signals, respectively.

Reward Model Training via Listwise Learning-to-Rank: To preserve fine-grained quality distinctions from the Senior Reviewer's three-tier labels (S, A, B), a listwise learning-to-rank approach is used. Samples in each batch are grouped by quality level. For any quality level $g$ , samples at $g$ are positive instances, and samples at lower levels ( $g' < g$ ) are negative. The reward model is trained to assign higher scores to higher-quality samples using a unified contrastive loss formulation: $\mathcal{L}_{\mathrm{RM}} = - \sum_{g \in \{\mathrm{S}, \mathrm{A}\}} \sum_{y_g \in \mathcal{Y}_g} \log \frac{\exp(f_{\mathrm{RM}}(y_g))}{\exp(f_{\mathrm{RM}}(y_g)) + \sum_{g' < g} \sum_{y_{g'} \in \mathcal{Y}_{g'}} \exp(f_{\mathrm{RM}}(y_{g'}))}$ Where:
$\mathcal{L}_{\mathrm{RM}}$ : The contrastive loss for the reward model.
$g \in \{\mathrm{S}, \mathrm{A}\}$ : Quality levels (S for Superior, A for Average).
$y_g \in \mathcal{Y}_g$ : A sample $y_g$ belonging to the set of samples $\mathcal{Y}_g$ at quality level $g$ .
$f_{\mathrm{RM}}(y_g)$ : The reward score predicted by the reward model for sample $y_g$ .
$g' < g$ : Denotes all quality levels lower than $g$ (e.g., for $g=\mathrm{S}$ , $g'$ includes A and B; for $g=\mathrm{A}$ , $g'$ only includes B).
$\mathcal{Y}_{g'}$ : The set of samples at quality level $g'$ .

This formulation implicitly captures all pairwise relationships (S vs. AB, A vs. B), enabling the reward model to learn the complete hierarchical preference ordering.

Engineering Acceleration via Prefix Sharing: To accelerate training, shared prefix representations are computed once and reused across all candidates within contrastive groups, enabling parallel inference and reducing redundant computation.

Self-Improving Flywheel Effect: The integration of Agent-as-a-Judge and Judge-as-a-Reward creates a self-reinforcing optimization cycle for continuous quality improvement without recurring human annotation costs:

Policy Generation: Policy model generates diverse responses through SFT and RL.
Agentic Evaluation: Agent-as-a-Judge decomposes samples into dimension-specific assessments and holistic S-A-B judgments.
Reward Distillation: Judge-as-a-Reward distills discrete agent judgments into continuous, differentiable reward signals by learning preference structures.
Policy Optimization: Distilled reward signals guide policy refinement via GRPO to maximize human-aligned preferences.

This closed-loop architecture operates autonomously after initial human annotation, progressively aligning model behavior with human quality standards, with reward distillation ensuring computational efficiency and Multi-Dimension evaluation guaranteeing comprehensive quality improvements.

5. Experimental Setup

To validate RecGPT-V2's effectiveness in practical industrial applications, long-term online experiments were conducted on Taobao's platform.

5.1. Datasets

The paper describes the data used for training the Distributed Experts (Table 1) and for training the Agent-as-a-Judge framework. For the Distributed Experts (specifically for Supervised Fine-Tuning):

Recommendation Task Data:
- Pure Behavior Patterns: 32.17%
- Trending Topics & Events: 6.97%
- Weather-Related Contexts: 1.19%
- Other Situational Signals: 7.36%
General Language Modeling Data: 52.31% This composition balances domain-specific knowledge with general language capabilities.

For Agent-as-a-Judge model adaptation:

Training Corpus: Combines model-generated samples (e.g., from DeepSeek-R1, Qwen3-235B) and outputs from powerful LLMs.
Bad-quality samples: Automatically constructed via in-batch shuffling (e.g., randomly pairing outputs with mismatched user contexts for relevance).
Nuanced judgment samples: Human annotators provide labels across all evaluation dimensions and holistic S-A-B judgments.

The online A/B tests were conducted on Taobao's homepage "Guess What You Like" scenario, implying a real-world, large-scale e-commerce dataset of user interactions, item attributes, and contextual information. The specific details of this proprietary Taobao dataset (size, number of users/items) are not explicitly given but are implied to be massive given the "industrial scale" context.

Data Sample Example (from Case 2 in the paper): A concrete example of a User Behavioral Sequence Compression from the paper:

Original Full-Text Context (21,349 tokens): User Attributes:28,/ 28-year-old female resident of Beijing; Astrological signs: Gemini (Western), Ox (Chinese zodiac) User Behavioral History: 3 / Purchased 3 years ago Women's autumn-winter knee-high boots Topstitched satin-textured dress 2 / Searched 2 years ago Premium aesthetic outerwearRetro Bluetooth mini speaker 1 ‡ Clicked 1 year ago 04 Korean-style loose-fit sweater Pure cotton 4-piece bedding set : (numerous additional interactions omitted due to space)

This data sample shows a user's demographic information and a timeline of their past interactions (purchased, searched, clicked items). This demonstrates the type of rich, sequential user behavior data that RecGPT-V2 processes, often requiring compression due to its length.

5.2. Evaluation Metrics

The paper uses distinct evaluation metrics for different components and for overall online A/B testing.

For Item Tag Prediction (Offline & RL Optimization)

Hit Rate at Top-30 (HR@30):
- Conceptual Definition: Measures whether any of the top 30 predicted item tags (after being mapped to item categories) successfully match any of the user's actual interaction categories in their subsequent behavior. It's a common metric for retrieval or recommendation quality, indicating how often the desired items are present in a ranked list.
- Mathematical Formula: $\mathrm{HR}@K = \frac{\text{Number of users for whom at least one relevant item is in top K predictions}}{\text{Total number of users}}$ In the context of item tag prediction, it implies: $\mathrm{HR}@30 = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}\left[ \exists c \in C_{\mathrm{gt}}^u \text{ such that } c \in f_{\mathrm{tag2cat}}(\mathcal{T}_{\mathrm{pred}}^u[1..30]) \right]$
- Symbol Explanation:
  - $\mathrm{HR}@30$ : Hit Rate at top 30.
  - $|U|$ : Total number of users.
  - $u$ : A specific user.
  - $\mathbb{I}[\cdot]$ : Indicator function, returns 1 if the condition is true, 0 otherwise.
  - $C_{\mathrm{gt}}^u$ : The set of ground-truth item categories from user $u$ 's actual subsequent interactions.
  - $\mathcal{T}_{\mathrm{pred}}^u[1..30]$ : The top 30 predicted item tags for user $u$ .
  - $f_{\mathrm{tag2cat}}(\cdot)$ : A pre-trained model that maps predicted tags to item categories.

For Explanation Generation (Offline & RL Optimization)

Diversity:
- Conceptual Definition: Measures the lexical variety and non-redundancy within a set of generated explanations for a given item. Higher diversity indicates that the explanations are less repetitive and offer varied perspectives.
- Mathematical Formula: $\mathrm{Diversity}_i = 1 - \frac{2}{K(K - 1)} \sum_{j=1}^{K - 1} \sum_{k=j + 1}^{K} \mathrm{ROUGE.L}(e_j^i, e_k^i)$
- Symbol Explanation:
  - $\mathrm{Diversity}_i$ : The diversity score for item $i$ .
  - $K$ : The number of generated explanations for item $i$ .
  - $e_j^i, e_k^i$ : The $j$ -th and $k$ -th explanations generated for item $i$ .
  - $\mathrm{ROUGE.L}(\cdot, \cdot)$ : The ROUGE-L metric, which measures the longest common subsequence similarity between two explanations.
Quality (Human Evaluation Acceptance Rate):
- Conceptual Definition: Measures the percentage of generated explanations that are deemed "high-quality" by human annotators, based on satisfying all seven expanded evaluation criteria (Relevance, Factuality, Clarity, Safety, Timeliness, Informativeness, Attractiveness).
- Mathematical Formula: Not explicitly provided as a formula but as a percentage: $\text{Quality Acceptance Rate} = \frac{\text{Number of explanations rated as high-quality}}{\text{Total number of explanations evaluated}} \times 100\%$
- Symbol Explanation:
  - "high-quality": Refers to explanations that meet all criteria in Table 8.

For Agent-as-a-Judge Evaluation

Accuracy:
- Conceptual Definition: Measures how often the Agent-as-a-Judge framework's Superior (S) classification matches human annotations, which serve as the ground truth.
- Mathematical Formula: $\mathrm{Accuracy} = \frac{\text{Number of correctly classified Superior samples}}{\text{Total number of samples}}$
- Symbol Explanation: Standard accuracy definition.
F1 Score:
- Conceptual Definition: The harmonic mean of precision and recall. It's particularly useful when the classes are imbalanced (e.g., fewer Superior samples) or when both false positives and false negatives are important.
- Mathematical Formula: $\mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$ Where: $\mathrm{Precision} = \frac{\mathrm{True Positives}}{\mathrm{True Positives} + \mathrm{False Positives}}$ $\mathrm{Recall} = \frac{\mathrm{True Positives}}{\mathrm{True Positives} + \mathrm{False Negatives}}$
- Symbol Explanation:
  - $\mathrm{Precision}$ : Proportion of positive identifications that were actually correct.
  - $\mathrm{Recall}$ : Proportion of actual positives that were identified correctly.
  - $\mathrm{True Positives}$ : Samples correctly identified as Superior.
  - $\mathrm{False Positives}$ : Samples incorrectly identified as Superior.
  - $\mathrm{False Negatives}$ : Superior samples incorrectly identified as non-Superior.

For Online A/B Test (Taobao)

Short-Term Metrics:
- IPV (Item Page Views):
  - Conceptual Definition: The number of times users view an item's detail page after seeing a recommendation. Indicates initial user interest and engagement.
  - Mathematical Formula: Not explicitly provided, but typically a count.
- CTR (Click-Through Rate):
  - Conceptual Definition: The ratio of clicks on recommended items to the total number of impressions (times items were shown). Measures the relevance and attractiveness of recommendations.
  - Mathematical Formula: $\mathrm{CTR} = \frac{\text{Number of Clicks}}{\text{Number of Impressions}} \times 100\%$
- TV (Transaction Volume):
  - Conceptual Definition: The total monetary value of completed purchases resulting from recommendations. Directly measures the revenue generated.
  - Mathematical Formula: Not explicitly provided, but typically a sum of purchase values.
- GMV (Gross Merchandise Value):
  - Conceptual Definition: The total transaction value, including both orders and returns. A broader measure of commercial activity.
  - Mathematical Formula: Not explicitly provided, but typically a sum of all transaction values.
- ATC (Add-to-Cart):
  - Conceptual Definition: The number of times recommended items are added to a user's shopping cart. Reflects purchase intent, even if a transaction isn't immediately completed.
  - Mathematical Formula: Not explicitly provided, but typically a count.
Long-Term Metrics:
- NER (Novelty Exposure Rate):
  - Conceptual Definition: The percentage of recommended items that users have not previously interacted with (clicked, purchased, etc.). Measures the system's ability to expose users to new and diverse content, mitigating filter bubble effects.
  - Mathematical Formula: $\mathrm{NER} = \frac{\text{Number of novel recommended items}}{\text{Total number of recommended items}} \times 100\%$
- LT-14 / LT-30 (User Retention Rates):
  - Conceptual Definition: The percentage of users who remain active on the platform (e.g., logging in, interacting with content) 14 days and 30 days after being exposed to the new recommendation system. Quantifies sustained user engagement and platform health.
  - Mathematical Formula: $\mathrm{LT-X} = \frac{\text{Number of active users on day X}}{\text{Number of users exposed to system on day 0}} \times 100\%$

5.3. Baselines

RecGPT-V1: Serves as the primary control group and baseline for comparison in all experiments, both offline (e.g., tag prediction accuracy) and online (A/B tests). This allows for a direct assessment of the improvements introduced by RecGPT-V2's innovations.
Qwen-14B Base Model: Used as the base model for constructing RecGPT-V2 variants in offline experiments (e.g., tag prediction). This baseline helps validate the necessity of domain adaptation and fine-tuning.
SFT (Supervised Fine-Tuning) variant: A variant of RecGPT-V2 that only undergoes supervised fine-tuning, without the reinforcement learning optimization. This helps isolate the contribution of RL.
GRPO (SUM) variant: A variant using Group Relative Policy Optimization but with naive sum-based aggregation of rewards. This serves as a baseline to demonstrate the superiority of Constrained Reward Shaping (CRS) in handling multi-reward conflicts.
RecGPT-V2 (Point-wise RM) variant: A variant using reinforcement learning with a reward model trained using point-wise methods, contrasting with the list-wise approach of the full RecGPT-V2.

6. Results & Analysis

6.1. Core Results Analysis

RecGPT-V2 demonstrates significant improvements across computational efficiency, generation quality (tag prediction, explanation diversity), human alignment in evaluation, and real-world online performance.

Computational Efficiency (from Figure 4): RecGPT-V2 achieves substantial gains in computational efficiency compared to RecGPT-V1, validating the effectiveness of Hybrid Representation Inference and Infrastructure Engineering Optimization.

The following are the results from Figure 4 of the original paper:

MFU (Model FLOPs Utilization): RecGPT-V2's MFU is 17.04%, a significant improvement over RecGPT-V1's 11.56%. This represents a 53.11% improvement in MFU, indicating much more efficient GPU utilization.
QPS (Prefill): RecGPT-V2 achieves 69.30 QPS in the prefill stage, compared to RecGPT-V1's baseline of 1. This is a dramatic increase in throughput for processing input prompts.
TPS (Decode): RecGPT-V2 shows a 7.35x improvement in TPS during the decode stage relative to RecGPT-V1. These results confirm the claim of 60% GPU consumption reduction and enable scalable deployment.

Item Tag Prediction Accuracy (from Table 2): The Constrained Reinforcement Optimization dramatically improves item tag prediction.

The following are the results from Table 2 of the original paper:

Metric	RecGPT-V1	RecGPT-V2
Metric	RecGPT-V1	Base	SFT	GRPO (SUM)	GRPO (CRS)
HR@30	26.29%	23.08%	29.20%	27.38%	32.60%

Base Model: The Qwen-14B Base model, without domain adaptation, performs worse than RecGPT-V1 (23.08% vs. 26.29%), highlighting the necessity of domain-specific training.
SFT: Supervised fine-tuning significantly boosts performance to 29.20% HR@30, surpassing RecGPT-V1 by 2.91% (29.20 - 26.29). This demonstrates the effectiveness of persona-aligned supervision.
GRPO (SUM): Naive sum-based reward aggregation in GRPO (27.38%) performs worse than SFT, confirming that multi-reward conflicts can degrade performance.
GRPO (CRS): The full RecGPT-V2 with Constrained Reward Shaping (CRS) achieves the highest HR@30 at 32.60%. This is a 3.40% improvement over SFT (32.60 - 29.20) and a substantial 6.31% improvement over RecGPT-V1 (32.60 - 26.29), validating the efficacy of CRS in stable multi-objective optimization.

Explanation Generation Performance (from Table 3): Meta-Prompting and Preference-Aware Reinforcement Learning enhance explanation quality.

The following are the results from Table 3 of the original paper:

Method	Diversity	Quality (%)
RecGPT-V1	0.631	36.03
RecGPT-V2	0.677	40.73

Diversity: RecGPT-V2 improves explanation diversity from 0.631 to 0.677, a relative increase of 7.3%.
Quality (%): The human-rated explanation acceptance rate increases from 36.03% to 40.73%, a relative improvement of 13.0% ((40.73-36.03)/36.03 * 100%). These gains validate the effectiveness of the meta-prompting framework and preference-aware RL for dynamic and engaging explanations.

Agent-as-a-Judge Human-Alignment (from Table 4): The Agent-as-a-Judge framework demonstrates superior alignment with human judgments.

The following are the results from Table 4 of the original paper:

Task	Model	Accuracy		F1
Task	Model	V1	V2	V1	V2
Item Tag Prediction	GPT5-mini	0.7694	0.7704	0.7499	0.7535
	Qwen3-Base	0.7844	0.7864	0.7991	0.8051
	Qwen3-SFT	0.8210	0.8248	0.8095	0.8228
Explanation Generation	GPT5-mini	0.4481	0.4548	0.5673	0.5424
	Qwen3-Base	0.3423	0.2764	0.0898	0.0904
	Qwen3-SFT	0.6885	0.7006	0.6787	0.7307

Item Tag Prediction: RecGPT-V2 consistently shows higher accuracy and F1 scores across different underlying models (GPT5-mini, Qwen3-Base, Qwen3-SFT). For Qwen3-SFT, accuracy improves from 0.8210 to 0.8248 (+0.38 pp), and F1 improves from 0.8095 to 0.8228 (+1.33 pp).
Explanation Generation: RecGPT-V2 also generally outperforms RecGPT-V1. For Qwen3-SFT, accuracy increases from 0.6885 to 0.7006 (+1.21 pp), and F1 significantly jumps from 0.6787 to 0.7307 (+5.20 pp). The exception is Qwen3-Base for explanation generation, where V2 accuracy decreases, though F1 slightly increases. These results validate that the multi-step, process-oriented evaluation of Agent-as-a-Judge aligns more closely with human standards.

Impact of Reward Model Training Strategy (from Table 5): The listwise learning-to-rank approach for Judge-as-a-Reward is crucial.

The following are the results from Table 5 of the original paper:

Method	HR@30 (Tag)	Quality (Explanation)
RecGPT-V1	26.29%	36.03%
RecGPT-V2 (Point-wise RM)	31.24%	37.64%
RecGPT-V2 (List-wise RM)	32.60%	40.73%

Tag Prediction (HR@30): RecGPT-V2 with list-wise RM achieves 32.60%, outperforming RecGPT-V1 (26.29%, +24.1% relative improvement) and point-wise RM (31.24%, +4.4% relative improvement).
Explanation Quality: RecGPT-V2 with list-wise RM achieves 40.73%, outperforming RecGPT-V1 (36.03%, +13.0% relative improvement) and point-wise RM (37.64%, +8.2% relative improvement). This confirms that modeling hierarchical preference ordering with listwise learning-to-rank provides more discriminative optimization signals for RL.

6.2. Online A/B Test Results (from Table 6)

Online A/B tests on Taobao's "Guess What You Like" scenario over two weeks demonstrate that RecGPT-V2 consistently outperforms RecGPT-V1 across both item and feed scenarios, validating its commercial viability.

The following are the results from Table 6 of the original paper:

Scenario	Short-Term Engagement						Long-Term Retention
Scenario	IPV	CTR	TV	GMV	ATC		NER	LT-14	LT-30
Item	+3.64	+3.01	+2.11		+3.39	+3.47	+11.46
Feed	+1.29	+1.50	+0.34	+1.53		+0.99	+4.49	+0.04	+0.05

Note: indicates metrics not applicable in the item scenario. The values in the table are relative percentage improvements over RecGPT-V1.

Item Scenario:

Short-Term Engagement:
- IPV: +3.64%
- CTR: +3.01%
- TV: +2.11%
- ATC: +3.39%
- GMV is not applicable/reported for the item scenario. These robust gains indicate that RecGPT-V2's enhanced intent understanding directly translates into increased user interaction and transaction value when items are directly recommended.
Long-Term Retention:
- NER: +11.46%
- LT-14, LT-30 are not applicable/reported for the item scenario. The significant increase in Novelty Exposure Rate (NER) suggests improved recommendation diversity and exploration effectiveness, mitigating filter bubble effects.

Feed Scenario:

Short-Term Engagement:
- IPV: +1.29%
- CTR: +1.50%
- TV: +0.34%
- GMV: +1.53%
- ATC is not applicable/reported for the feed scenario. Consistent positive trends in CTR and GMV indicate improved recommendation relevance even in a mixed-content feed, although the percentage gains are slightly lower than the item scenario, likely due to the more complex and diverse nature of feed content.
Long-Term Retention:
- NER: +4.49%
- LT-14: +0.04%
- LT-30: +0.05% The NER gain further confirms improved novelty. While user retention rates (LT-14, LT-30) show modest absolute gains, these are significant in the context of long-term user behavior on large platforms, representing meaningful progress in sustained user engagement.

Overall Summary: RecGPT-V2's comprehensive innovations lead to tangible improvements across various online metrics, affirming its effectiveness and practical utility in a real-world industrial setting. The particularly strong gains in NER highlight its ability to foster user exploration, a key aspect of healthy recommender systems.

6.3. Case Study

A case study (Figure 9) further illustrates RecGPT-V2's strengths in dynamic intent understanding and context-aware recommendation generation.

The following figure (Figure 9 from the original paper) illustrates a real-world case:

Figure 9 | Case study. 该图像是一个示意图，展示了多个商品及其描述，包括毛衣、儿童产品和哑铃等，具有丰富的视觉信息和信息组织。图中的决策仲裁者图标也突出了智能推荐系统的特点。

User Profile: Female, 35, Tianjin, with compressed behavioral history.
Environmental Context: Cooling weather, upcoming Mid-Autumn Festival and Halloween.
Global Planner: Decomposes these signals into three personas: Ladies' Fashion Expert, Kids' Products Expert, and Health Expert. This shows the HMAS effectively processing hybrid context and decomposing intent.
Distributed Experts:
- Ladies' Fashion Expert: Predicts "Wool Blend Cardigan" (responding to cooling weather).
- Kids' Products Expert: Generates "Kids' Hydrating Lotion" (for dry autumn climate) and "Kids' Halloween Costume" (anticipating holiday). This demonstrates temporal adaptation.
- Health Expert: Recommends "Adjustable Dumbbell Set" (aligning weather-driven wellness with historical fitness interests).
Decision Arbiter: Synthesizes these expert predictions.
Contextually Adaptive Explanations: Final items are paired with explanations like "Wrapped in Autumn Sunshine," "Quench Your Little One's Skin," and "All You Need is Dumbbells," generated by the meta-prompting framework. These explanations are personalized and context-aware.

This case study vividly validates RecGPT-V2's core capabilities: integrating real-time environmental signals into hierarchical multi-agent reasoning for diverse intent coverage and precise situational adaptation, moving beyond static behavioral pattern matching.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper effectively introduces RecGPT-V2, a sophisticated agentic framework that significantly advances LLM-powered recommender systems. It systematically addresses the four major limitations of RecGPT-V1: computational inefficiency, lack of explanation diversity, limited generalization, and sub-optimal evaluation. Through its four key innovations—Hierarchical Multi-Agent System with Hybrid Representation Inference, Meta-Prompting for dynamic explanations, Constrained Reinforcement Learning, and an Agentic Judge Framework—RecGPT-V2 achieves remarkable technical and commercial success. The framework not only reduces GPU consumption by a substantial 60% and improves core recommendation metrics like recall and tag prediction accuracy but also enhances user experience through diverse and contextually adaptive explanations. Validated through extensive online A/B tests on Taobao, RecGPT-V2 demonstrates significant lifts in key business metrics (e.g., +2.98% CTR, +11.46% NER), establishing the practical viability and scalability of deploying LLM-based intent reasoning in real-world industrial environments. It successfully bridges the gap between cognitive exploration in research and concrete industrial utility.

7.2. Limitations & Future Work

The authors themselves point out a key direction for future work:

End-to-End Joint Optimization of Multi-Agent Collaboration with Reinforcement Learning: The paper suggests exploring how to jointly optimize multi-agent collaboration in an end-to-end fashion using reinforcement learning techniques. Currently, the Planner, Experts, and Arbiter components might be optimized somewhat independently or sequentially. An end-to-end optimization could potentially lead to even greater synergy and emergent behaviors among agents, further enhancing recommendation performance and user experience by allowing the entire system to learn from global rewards.

7.3. Personal Insights & Critique

This paper presents a highly impressive and comprehensive approach to integrating LLMs into recommender systems at an industrial scale. The rigor in addressing practical challenges like computational efficiency and real-world deployability is particularly commendable.

Bridging Research and Industry: The most significant contribution of RecGPT-V2 is its successful deployment and validation on Taobao. This provides strong evidence that advanced LLM-based reasoning, often seen as computationally intensive, can be made feasible and commercially impactful in a high-traffic, low-latency environment. The detailed Infrastructure Engineering Optimization and Hybrid Representation Inference are critical for this.
Holistic Problem Solving: Instead of focusing on a single aspect, the paper tackles multiple facets of LLM-powered recommendations—efficiency, generation quality, and evaluation—in an integrated manner. The Flywheel Effect idea is a powerful conceptual framework for continuous improvement.
Sophisticated Agentic Design: The Hierarchical Multi-Agent System is a clever way to manage the complexity and redundancy of LLM reasoning. Decomposing user intent into specialized personas for Distributed Experts, orchestrated by a Global Planner and consolidated by a Decision Arbiter, is an elegant solution to ensure both breadth and focus.
Addressing Reward Conflicts: The Constrained Reward Shaping mechanism is a crucial methodological innovation. Multi-objective optimization is notoriously difficult due to reward conflicts, and providing a stable method to achieve diverse objectives without sacrificing primary performance is a significant step forward, applicable beyond recommender systems.
Human-Aligned Evaluation: The Agent-as-a-Judge framework, with its multi-step, process-oriented evaluation, marks a move towards more trustworthy and interpretable AI systems. Moving beyond simple outcome prediction to mimic human reasoning in evaluation is vital for building systems that truly meet human quality standards.

Potential Issues/Areas for Improvement:

Interpretability of Agent Interactions: While the HMAS structure is beneficial for organization, the internal decision-making processes of the Global Planner (how it generates personas), Distributed Experts (how they generate tags based on persona), and Decision Arbiter (how it synthesizes) still rely on LLM reasoning. Further work could explore even more transparent and interpretable mechanisms for agent interaction and decision aggregation to provide deeper insights into the "why" behind recommendations.
Cost of GPT-4/Powerful LLMs: The methodology mentions using GPT-4 for QA pair generation and for identifying relevant item categories in SFT. While this ensures high-quality supervision, relying on such powerful (and typically proprietary/API-based) LLMs might be a significant cost factor in the data generation pipeline, especially if the volume of training data required is massive. The paper emphasizes the efficiency of the inference model, but the training data generation cost could be a practical limitation for other researchers or smaller companies without access to such budgets.
Generalizability of Hybrid Representation: While the adaptor network is designed for generalization, the effectiveness of atomized entity compression across vastly different domains (e.g., medical data, legal documents) and languages might need further investigation. The compression ratio and semantic preservation could vary.
Long-term A/B Test Horizon: While 14-day and 30-day retention are good indicators, long-term user behavior can sometimes take longer to manifest significant shifts, especially for metrics related to loyalty or habit formation. Longer-duration A/B tests (e.g., 3-6 months) could provide even more robust evidence for sustained impact, though these are challenging in industrial settings.

Overall, RecGPT-V2 is a landmark paper that provides a clear blueprint for building highly efficient, effective, and intelligent LLM-powered recommender systems. Its innovations are likely to influence future research and development in this rapidly evolving field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

RecGPT-V2 Technical Report

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~27 min read · 38,518 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Agentic Intent Reasoning

4.1.1. Hybrid Representation Inference

Atomized Entity Compression

Hybrid Representation Adaptation

Infrastructure Engineering Optimization

4.1.2. Hierarchical Multi-Agent System

Global Planner

Distributed Experts

Decision Arbiter

4.2. Dynamic Explanation Generation

4.2.1. Meta-Prompting

4.2.2. Preference-Aware Reinforcement Learning

4.3. Agentic Judge Framework

4.3.1. Agent-as-a-Judge

4.3.2. Judge-as-a-Reward

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

For Item Tag Prediction (Offline & RL Optimization)

For Explanation Generation (Offline & RL Optimization)

For Agent-as-a-Judge Evaluation

For Online A/B Test (Taobao)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Online A/B Test Results (from Table 6)

6.3. Case Study

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers