RecGPT-V2 Technical Report
TL;DR Summary
RecGPT-V2 introduces four innovations to enhance intent reasoning and efficiency, reducing GPU consumption by 60% and improving generalization and evaluation consistency. Online tests show significant performance gains in metrics like CTR and IPV, indicating its industrial applic
Abstract
Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed-template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome-focused evaluation that fails to match human standards. To address these challenges, we present RecGPT-V2 with four key innovations. First, a Hierarchical Multi-Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user-behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 establishes both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
RecGPT-V2 Technical Report
1.2. Authors
The core contributors are: Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Wen Chen, Wenjun Yang, Yujie Luo, and Yuning Jiang, Zhujin Gao. Additional contributors include: Bo Zheng, Binbin Cao, Changfa Wu, Dixuan Wang, Han Wu, Haoyi Hu, Kewei Zhu, Lang Tian, Lin Yang, Qiqi Huang, Siqi Yang, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Xiaowei Huang, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yujin Yuan, Yuliang Yan, Zile Zhou, and Renmin University of China. The author listing is in alphabetical order based on their first names.
1.3. Journal/Conference
This is a technical report published as a preprint on arXiv. The arXiv platform serves as a repository for preprints of scientific papers, making research widely accessible before or in parallel with peer review. The presence of a technical report on arXiv indicates active research and development in the field of recommender systems and large language models.
1.4. Publication Year
2025
1.5. Abstract
Large language models (LLMs) have significantly advanced recommender systems by shifting from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 pioneered this by integrating LLM-based reasoning into user interest mining and item tag prediction, it faced four main limitations: (1) computational inefficiency and cognitive redundancy in its multi-route architecture, (2) insufficient explanation diversity due to fixed-template generation, (3) limited generalization from supervised learning, and (4) simplistic, outcome-focused evaluation misaligned with human standards.
To overcome these, RecGPT-V2 introduces four innovations. First, a Hierarchical Multi-Agent System (HMAS), combined with Hybrid Representation Inference, restructures intent reasoning through coordinated collaboration and context compression. This reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, increasing explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, resulting in a +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, enhancing human preference alignment. Online A/B tests on Taobao show significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 demonstrates both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging cognitive exploration and industrial utility.
1.6. Original Source Link
https://arxiv.org/abs/2512.14503 (Preprint status) https://arxiv.org/pdf/2512.14503v1.pdf (PDF link)
2. Executive Summary
2.1. Background & Motivation
Recommender systems have evolved significantly, moving from traditional methods like matrix factorization and deep neural networks to more sophisticated approaches. However, a fundamental limitation persists: most systems primarily rely on historical behavioral patterns and log-fitting objectives, optimizing for behavioral pattern matching rather than explicitly reasoning about the underlying user intent. This results in recommendations that might lack transparency, control, and deep personalization.
The paper aims to solve the problem of making recommender systems more intelligent, efficient, and human-aligned by leveraging the reasoning capabilities of Large Language Models (LLMs). While RecGPT-V1 was a pioneer in integrating LLM-based reasoning for user interest mining and item tag prediction, it suffered from several critical drawbacks that hindered its scalability, efficiency, and overall effectiveness in real-world industrial deployment:
-
Computational Inefficiency and Cognitive Redundancy: RecGPT-V1 used a multi-route architecture where multiple LLMs independently processed the same user behavior sequences, leading to duplicated representation encoding and overlapping reasoning outputs. This was computationally expensive and inefficient.
-
Insufficient Explanation Diversity: Its reliance on fixed prompt templates for generating recommendation explanations resulted in generic, repetitive, and context-insensitive outputs, failing to capture the dynamic nature of user needs.
-
Limited Generalization: Training based on supervised learning on static data constrained the model's ability to adapt to dynamically evolving user needs and complex, multi-objective generation tasks in real-world scenarios.
-
Simplistic Outcome-Focused Evaluation: RecGPT-V1's
LLM-as-a-Judgeevaluation framework focused solely on direct outcome prediction, overlooking the multi-dimensional, step-by-step reasoning that human evaluators employ, leading to suboptimal alignment with human quality standards.The importance of addressing these issues stems from the need to move beyond simple pattern matching to truly understand and cater to explicit user intent, enabling more transparent, personalized, and engaging recommendation experiences at scale in industrial settings like Taobao.
2.2. Main Contributions / Findings
RecGPT-V2 addresses the identified limitations of RecGPT-V1 with four key innovations, significantly advancing LLM-powered recommender systems:
-
Agentic Intent Reasoning:
- Innovation: Introduces a
Hierarchical Multi-Agent System (HMAS)combined withHybrid Representation Inference. HMAS orchestrates specialized expert agents under aGlobal PlannerandDecision Arbiterto perform coordinated intent reasoning, eliminating cognitive duplication.Hybrid Representation Inferencecompresses user behavior contexts usingatomized entity encoding. - Findings: Achieves a 60% reduction in GPU consumption and improves exclusive recall from 9.39% to 10.99%. It delivers a 53.11% improvement in Model FLOPs Utilization (MFU) and significant throughput gains (QPS and TPS improvements).
- Innovation: Introduces a
-
Dynamic Explanation Generation:
- Innovation: Employs a
Meta-Promptingframework that dynamically generates contextually adaptive prompts for personalized explanation generation. This two-stage process (style synthesis then style-conditioned generation) overcomes the limitations of fixed templates. - Findings: Improves explanation diversity by +7.3% and enhances human-rated explanation acceptance by +13.0%, demonstrating superior user engagement.
- Innovation: Employs a
-
Constrained Reinforcement Learning Optimization:
- Innovation: Implements
constrained reinforcement learningwith a novelConstrained Reward Shaping (CRS)mechanism. This mitigates multi-reward conflicts by treating secondary objectives (like diversity) as conditional constraints for primary objectives (like accuracy), ensuring stable optimization. - Findings: Achieves a +24.1% improvement in item tag prediction (HR@30) and +13.0% in explanation acceptance, outperforming naive sum-based reward aggregation.
- Innovation: Implements
-
Process-Oriented Multi-Step Evaluation (Agentic Judge Framework):
-
Innovation: Introduces an
Agent-as-a-Judgeframework that decomposes assessment into multi-step reasoning usingMulti-Dimension Sub-Evaluatorsand aSenior Reviewerforthree-tier judgments (S-A-B). This framework is complemented byJudge-as-a-Rewardfor distilling agent judgments into dense reward signals for RL. -
Findings: Improves human preference alignment by achieving higher accuracy and F1 scores in identifying
Superior (S)quality for both item tag prediction and explanation generation compared toLLM-as-a-Judgebaselines.Overall Impact: Online A/B tests on Taobao confirm the commercial viability and technical feasibility of RecGPT-V2, showing significant gains: +2.98% CTR (Click-Through Rate), +3.71% IPV (Item Page Views), +2.19% TV (Transaction Volume), and +11.46% NER (Novelty Exposure Rate). RecGPT-V2 bridges the gap between cognitive exploration and industrial utility, establishing a new paradigm for LLM-powered recommenders at scale.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Recommender Systems (RS): Software systems that suggest items (products, movies, articles, etc.) to users based on their preferences, past behaviors, and other contextual information. Their goal is to help users discover relevant content and enhance engagement.
- Traditional RS: Often rely on statistical methods (
matrix factorization) or deep learning (deep neural networks) to identify patterns in user-item interactions (e.g., collaborative filtering, content-based filtering). These systems typically focus onbehavioral pattern matching, learning from implicit signals like clicks, purchases, or ratings. - LLM-powered RS: A newer paradigm that integrates
Large Language Models (LLMs)to move beyond implicit pattern matching. LLMs provide capabilities forexplicit intent reasoning,semantic understanding, andnatural language generation, allowing the system to understand why a user might be interested in something and to explain recommendations in human-like language.
- Traditional RS: Often rely on statistical methods (
- Large Language Models (LLMs): Advanced artificial intelligence models, typically based on the
Transformerarchitecture, that are trained on vast amounts of text data. They excel at understanding, generating, and reasoning with human language.- Transformer Architecture: A neural network architecture introduced in 2017 that revolutionized sequence processing tasks. It relies heavily on
self-attention mechanismsto weigh the importance of different parts of the input sequence when processing each element. - Tokens: The basic units of text that an LLM processes. A token can be a word, a subword, or even a single character, depending on the
tokenizerused. - Input/Output Lengths (, ): Refers to the number of tokens in the input prompt and the generated response, respectively.
- Prefill Phase: In LLM inference, this is the initial phase where the input prompt tokens are processed to compute their representations and populate the
key-value (KV) cache. This phase is typicallycompute-intensivedue to the quadratic complexity of attention with respect to input length (). - Decode Phase: The subsequent phase where the LLM generates tokens one by one, autoregressively. Each new token depends on previously generated tokens and the
KV cache. This phase is oftenmemory-intensivedue to frequent access to the KV cache and has a complexity of .
- Transformer Architecture: A neural network architecture introduced in 2017 that revolutionized sequence processing tasks. It relies heavily on
- Model FLOPs Utilization (MFU): A metric that measures how efficiently a hardware accelerator (like a GPU) is being used. It compares the actual floating-point operations performed to the theoretical maximum, indicating the percentage of peak performance achieved. Higher MFU means better hardware utilization.
- Queries Per Second (QPS): A measure of throughput, indicating the number of requests (e.g., LLM prompts processed) a system can handle per second. For the
prefill stage, it indicates how many input sequences can be processed. - Tokens Per Second (TPS): Another throughput metric, specifically for token generation. For the
decode stage, it indicates how many output tokens can be generated per second. - Embeddings: Dense vector representations of text, items, or users. These vectors capture semantic meaning, where items with similar meanings or properties are closer in the embedding space.
Embedding models(e.g., BGE, Qwen3-Embedding) are specialized neural networks trained to produce these representations. - Reinforcement Learning (RL): A machine learning paradigm where an
agentlearns to make decisions by interacting with anenvironment. The agent receivesrewardsfor desired actions andpenaltiesfor undesired ones, aiming to maximize cumulative reward over time.- Policy: The strategy that an RL agent uses to determine its next action based on the current state.
- Reward Function: Defines the goal of the RL agent by assigning numerical values to outcomes.
- Policy Optimization: The process of adjusting the agent's policy to improve its performance (i.e., maximize rewards).
- Supervised Fine-Tuning (SFT): A common technique in LLM training where a pre-trained LLM is further trained on a smaller, task-specific dataset with labeled examples. This adapts the model's general knowledge to a particular downstream task.
- Prompt Engineering: The art and science of crafting effective
prompts(inputs) for LLMs to guide their behavior and elicit desired outputs.Meta-Promptinginvolves dynamically generating these prompts or instructions. - A/B Testing: A randomized controlled experiment used to compare two versions of a system (A and B) to determine which one performs better. In recommender systems, A/B tests are crucial for evaluating new features or models on real user traffic.
3.2. Previous Works
The paper builds upon RecGPT-V1 and references several foundational and contemporary works in LLMs and recommender systems:
- RecGPT-V1 (Yi et al., 2025): This paper is the direct predecessor. RecGPT-V1 was a paradigm-shifting framework that first integrated LLMs into recommender systems for
user interest mininganditem tag prediction, transforming traditional pattern matching into anintent-centric recommendationobjective. It demonstrated promising online performance but suffered from the four limitations (computational inefficiency, fixed explanations, limited generalization, simplistic evaluation) that RecGPT-V2 aims to overcome. - Matrix Factorization (Koren et al., 2009): A foundational technique in recommender systems. It decomposes the user-item interaction matrix into two lower-rank matrices, representing latent features for users and items, which are then multiplied to predict missing ratings or preferences.
- Deep Neural Networks (Tang et al., 2025): The evolution of recommender systems to use deep learning models for more complex pattern recognition and representation learning.
- Transformer-based LLMs (Achiam et al., 2023 for GPT-4, Yang et al., 2025 for Qwen3-Embedding): The core technology enabling LLM-powered reasoning. The
Transformerarchitecture's efficiency and ability to handle long-range dependencies are key. - Embedding Models (Xiao et al., 2023 for BGE, Zhang et al., 2025c for Qwen3-Embedding): Used for encoding entity information (item descriptions, user query histories) into compact vector representations. These models are crucial for
atomized entity compression. - In-Context Learning (Brown et al., 2020; Dong et al., 2024): A property of large language models where they can learn new tasks or adapt their behavior from examples provided directly within the prompt, without explicit weight updates. GPT-4's use for dynamic QA pair generation leverages this.
- OneRec-Think (Liu et al., 2025), LC-Rec (Zheng et al., 2024), CoLLM (Zhang et al., 2025b): Other contemporary works that aim to integrate collaborative embeddings into LLMs for recommendation. RecGPT-V2 differentiates its
Atomized Entity Compressionapproach by using a lightweight adaptor network instead of directly inserting new tokens into the LLM's vocabulary, claiming better parameter efficiency, generalization, and modularity. - Disaggregated Prefill-Decode Architectures (Liu et al., 2024; Zhong et al., 2024): Prior work that inspired RecGPT-V2's infrastructure optimizations for LLM serving, acknowledging the different computational demands of the prefill and decode phases.
- Group Relative Policy Optimization (GRPO) (Liu et al., 2024; Shao et al., 2024): The specific reinforcement learning algorithm used for
constrained reinforcement optimizationin RecGPT-V2. It optimizes policy by comparing against a group of sampled outputs from an old policy and includes a KL divergence penalty for stability. - Poly-Encoder (Humeau et al., 2019): A model architecture used for efficient retrieval in large-scale systems, specifically for
multi-interest user encodingin RecGPT-V2. It uses multiple context codes to aggregate user behavioral embeddings into distinct interest vectors. - Mainstream Advances in Prompt Engineering (Suzgun and Kalai, 2024; Zhang et al., 2023): General trends in prompt design that inform RecGPT-V2's
Meta-Promptingframework, moving towards dynamic and adaptive prompt generation. - Agent-as-a-Judge (Gou et al., 2025; Zhang et al., 2025a; Zhuge et al., 2024): Recent work that motivates RecGPT-V2's
Agent-as-a-Judgeframework, moving beyond one-shotLLM-as-a-Judgeto more structured, multi-step agentic evaluation. - DeepSeek-R1 (Guo et al., 2025), Qwen3-235B (Yang et al., 2025): Powerful LLMs used to generate training samples for the
Agent-as-a-Judgeframework, ensuring high-quality and diverse data for adaptation.
3.3. Technological Evolution
The evolution of recommender systems can be broadly categorized:
-
Early Systems (e.g., Matrix Factorization, 2000s): Focused on collaborative filtering and content-based methods, identifying implicit patterns in user-item interaction data. These were effective but lacked deep semantic understanding and interpretability.
-
Deep Learning Era (e.g., Deep Neural Networks, 2010s-early 2020s): Applied neural networks to learn more complex, non-linear relationships and rich representations from diverse data (e.g., item features, user profiles). This improved accuracy but still largely operated on implicit signals and
behavioral pattern matching. -
LLM-Powered Era (RecGPT-V1, 2020s-present): Integrated
Large Language Modelsto bringexplicit intent reasoningandsemantic understandingto the forefront. This allowed systems to process natural language queries, understand nuances of user interests, generate human-readable explanations, and decompose complex recommendation tasks. RecGPT-V1 pioneered this shift.RecGPT-V2 represents the next generation within the LLM-powered era. It refines the initial LLM integration by tackling the practical challenges of scalability, efficiency, dynamism, and robust evaluation that arose from RecGPT-V1. It moves towards more sophisticated
agentic architectures,dynamic content generation, andhuman-aligned evaluation, pushing the boundaries of what LLMs can achieve in industrial-scale recommender systems.
3.4. Differentiation Analysis
RecGPT-V2 differentiates itself from its predecessor, RecGPT-V1, and other contemporary methods primarily by addressing the four core limitations:
-
Computational Efficiency & Redundancy:
- RecGPT-V1: Used isolated multi-route LLM channels, each processing the full user behavior sequence and leading to significant computational overhead () and
cognitive redundancy(13.46% inter-route duplication). - RecGPT-V2: Introduces
Hybrid Representation Inference(compressing 32K tokens to 11K) and aHierarchical Multi-Agent System (HMAS)where aGlobal PlannerorchestratesDistributed Expertsand aDecision Arbiter. This eliminates redundant full-sequence encoding and overlapping reasoning. - Differentiation from other compression methods (e.g., OneRec-Think, LC-Rec, CoLLM): RecGPT-V2 uses a
lightweight adaptor networkto project entity embeddings into the LLM's input space, keeping the LLM backbone frozen. This offers superiorparameter efficiency,generalization, andmodularitycompared to methods that directly insert new tokens into the LLM's vocabulary and require full model fine-tuning.
- RecGPT-V1: Used isolated multi-route LLM channels, each processing the full user behavior sequence and leading to significant computational overhead () and
-
Explanation Diversity:
- RecGPT-V1: Relied on
fixed prompt templates, producinghomogeneousand context-insensitive explanations. - RecGPT-V2: Implements a
Meta-Promptingframework that dynamically generatescontextually adaptive promptsby synthesizing user interests, item attributes, and real-time contextual signals (e.g., weather, seasonal events). - Differentiation: This shift from static templates to dynamic, two-stage (style synthesis then style-conditioned generation) prompting significantly enhances creative capacity and contextual relevance, yielding more diverse and engaging explanations.
- RecGPT-V1: Relied on
-
Generalization & Multi-Objective Optimization:
- RecGPT-V1: Used
supervised fine-tuning (SFT)on static data, which limited generalization to dynamic, multi-objective, and multi-constraint real-world scenarios. - RecGPT-V2: Employs
constrained reinforcement learningwith a novelConstrained Reward Shaping (CRS)mechanism. This explicitly addressesmulti-reward conflictsby treating secondary objectives as hard constraints for the primary objective, enabling stable optimization within feasible domains. - Differentiation: This moves beyond naive
sum-based aggregationof rewards, which often leads to suboptimal solutions due to conflicting gradients, ensuring better balance and stability in achieving diverse objectives.
- RecGPT-V1: Used
-
Evaluation Quality & Human Alignment:
- RecGPT-V1: Used
LLM-as-a-Judgeforone-shot outcome evaluation, directly predicting quality scores without decomposing the assessment into intermediate reasoning steps. This resulted insuboptimal alignmentwith human judgment. - RecGPT-V2: Introduces an
Agent-as-a-Judgeframework that mimics human cognitive evaluation throughhierarchical multi-agent reasoning(dimension-specific sub-evaluators and aSenior Reviewerforthree-tier judgments (S-A-B)). It also usesJudge-as-a-Rewardto distill these judgments into continuous reward signals for RL. - Differentiation: This
process-orientedapproach provides moreaccurate,interpretable, andhuman-alignedquality judgments, overcoming the limitations of black-box, outcome-focused evaluations. TheJudge-as-a-Rewardmechanism further makes this sophisticated evaluation computationally feasible for continuous RL optimization.
- RecGPT-V1: Used
4. Methodology
RecGPT-V2 introduces a comprehensive framework designed to overcome the limitations of its predecessor. The methodology is structured around four key innovations: Agentic Intent Reasoning, Dynamic Explanation Generation, Constrained Reinforcement Optimization, and an Agentic Judge Framework. Each of these innovations contributes to a more efficient, diverse, generalizable, and human-aligned recommender system.
The overall architecture of RecGPT-V2 is illustrated in Figure 2. The system processes lifelong user behaviors, compresses them into hybrid contextual representations (§2.1.1), which then feed into a Hierarchical Multi-Agent System (HMAS) for intent decomposition and item tag prediction (§2.2). The predicted tags are used to retrieve items, which are then augmented with personalized explanations (§3). To ensure and continuously improve generation quality, an Agent-as-a-Judge evaluation framework (§4.1) assesses generation tasks, and a Judge-as-a-Reward distillation method (§4.2) converts these assessments into optimization reward signals.
4.1. Agentic Intent Reasoning
The goal of Agentic Intent Reasoning is to mitigate the computational inefficiency and cognitive redundancy observed in RecGPT-V1's multi-route architecture. This is achieved by jointly improving representation compactness and cognitive coordination through two main components: Hybrid Representation Inference and a Hierarchical Multi-Agent System.
4.1.1. Hybrid Representation Inference
RecGPT-V1 suffered from computational and memory bottlenecks because user lifelong behaviors accounted for approximately 95.89% of its input tokens (averaging 32K tokens). This section introduces Hybrid Representation Inference to reduce this overhead.
Atomized Entity Compression
This technique aims to compress entity information (item descriptions, user query histories) into compact atomic units, significantly reducing context storage and computational overhead. It involves two stages:
Stage 1: Atomic Representation Encoding Pretrained embedding models (e.g., BGE, Qwen3-Embedding, TBstars-Embedding) are used to encode textual entity information into dense vector representations. A lightweight adaptor network then projects these embeddings into an atomic representation compatible with the LLM.
Given an entity with its textual description consisting of tokens, its embedding representation is obtained as: Where:
-
: The embedding representation of the entity.
-
: The embedding function (e.g., BGE, Qwen3-Embedding) that maps variable-length textual sequences to fixed-dimensional dense vectors.
-
: The textual description of the entity, represented as a sequence of tokens.
-
: The dimension of the embedding vector.
To bridge the gap between the embedding space and the LLM's language space, a lightweight adaptor network projects into an atomic representation compatible with the LLM input: Where:
-
: The atomic representation of the entity, denoted as
[entity]in context. This single unit replaces the original multi-token textual description. -
: The lightweight adaptor network, typically a two-layer feed-forward network with a
ReLUactivation function. -
, : Projection matrices.
-
: Bias terms.
-
: The dimension of the hidden layer in the adaptor network.
-
: The hidden dimension of the LLM, ensuring compatibility.
This
atomized entity compressionsignificantly reduces token length. For example, a 12-token Chinese product title can be compressed into a single atomic representation, achieving a 12:1 compression ratio. A user profile with 21,349 tokens can be reduced to 5,158 tokens (76% reduction) by replacing item descriptions and query texts with atomic representations, while preserving user attributes and temporal metadata in natural language. This creates ahybrid representationthat balances compactness and contextual richness.
Hybrid Representation Adaptation
To enable the LLM to understand hybrid contexts (interleaving natural language with compressed atomic units), a two-tier training strategy is designed, keeping the LLM backbone frozen and only training the adaptor parameters for parameter efficiency and preservation of general knowledge.
Stage 2: Hybrid Representation Adaptation
-
Self-Perception Tasks ("What-is-it" philosophy):
GPT-4is used to dynamically generate diverse, attribute-focused questions (QA pairs) that probe the semantic completeness of atomic representations. This ensures that the compressed representations retain critical entity attributes.The
meta-promptfor dynamic QA pair generation (Prompt 1) guides GPT-4 to produce questions and answers directly from a product title, ensuring all questions are answerable solely from the input text and outputting in JSON format.Given an entity with original text ,
GPT-4automatically generates attribute-focused question-answer pairs: Where:- : The -th question-answer pair.
- : The question probing a specific entity attribute.
- : The answer extracted from the entity's original text .
- : The function representing the powerful LLM (e.g., GPT-4) generating QA pairs from the entity text.
- : The number of generated QA pairs.
-
Production-Oriented Alignment: To validate practical applicability, compressed atomic units are integrated into two core recommendation generation tasks from RecGPT-V1:
- User Interest Mining: Infers user interest profiles from interaction histories.
- Item Tag Prediction: Predicts relevant item tags based on inferred interests and historical behaviors.
For these tasks, reference samples are constructed using full textual representations, and
ground-truth responsesfrom the frozen LLM serve as supervision signals.
Unified Training Formulation
Both self-perception QA tasks and production-oriented tasks share an identical optimization paradigm. The adaptor is trained such that hybrid prompts (with compressed entities) can reproduce the same responses as full-text prompts.
Given a reference sample with a full-text prompt and its corresponding response , the hybrid prompt is constructed by replacing all entity texts with adaptor-projected representations:
Where:
-
: The hybrid prompt with compressed entities.
-
: The full-text prompt.
-
: A function that performs entity-to-atomic replacement.
-
: The textual description of an entity .
-
: The set of all entities in .
-
and : Functions defined earlier for atomic representation encoding.
The adaptor is optimized by minimizing the
cross-entropy lossbetween model predictions on compressed inputs and reference responses: Where: -
: The cross-entropy loss to be minimized.
-
: The parameters of the adaptor network.
-
: The output distribution of the frozen LLM.
-
: The -th token in the ground-truth response .
-
: The sequence of ground-truth tokens before position .
This objective ensures that the adaptor learns semantic-preserving projections that maintain functional equivalence between compressed and full-text representations.
Infrastructure Engineering Optimization
To meet industrial latency requirements, two complementary infrastructure optimizations are introduced:
- Disaggregated Prefill-Decode Serving Architecture:
- This architecture addresses the
asymmetric input-output characteristicof recommendation generation (long inputs, short outputs). Prefill phaseiscompute-intensive(), whiledecode phaseismemory-intensive().- GPU resources are partitioned: a larger GPU pool is assigned to prefill for parallel throughput, and fewer resources to decode for efficient memory access. KV caches are transferred between phases.
- This architecture addresses the
- XQA Kernel Integration:
-
Replaces the
FlashInferkernel with theXQA kernelto leverageFP8 precision inferenceon H20 GPUs.XQA kernelprovides superior performance forFP8 quantized models, accelerating attention computation and reducing memory bandwidth.These optimizations collectively improve
MFUfrom 11.56% (RecGPT-V1) to 17.04% and achieve substantial throughput gains (e.g., TPS improvement in the decode stage), enabling scalable deployment.
-
The following figure (Figure 3 from the original paper) compares the inference architectures between RecGPT-V1 and RecGPT-V2:
该图像是示意图,展示了RecGPT-V1与RecGPT-V2在推理架构上的比较。RecGPT-V2采取混合表示与分解的预填充-解码方式,显著提高了GPU利用率和计算效率,减少了计算冗余。
The following figure (Figure 4 from the original paper) shows the computational efficiency comparison between RecGPT-V1 and RecGPT-V2:
该图像是一个图表,展示了RecGPT-V1与RecGPT-V2在MFU%、QPS (Prefill)和 TPS (Decode)方面的计算效率比较。RecGPT-V2在MFU%达到了17.70%,而RecGPT-V1为11.56%;在QPS (Prefill)上,RecGPT-V2的值为69.30,RecGPT-V1为1;在TPS (Decode)上,RecGPT-V2的值为7.35,RecGPT-V1为1。此图表显示了RecGPT-V2在计算效率上的显著提升。
4.1.2. Hierarchical Multi-Agent System
To eliminate computational waste and cognitive duplication (13.46% inter-route duplication in RecGPT-V1), RecGPT-V2 proposes a Hierarchical Multi-Agent System (HMAS) with a three-tier architecture: Planner-Experts-Arbiter.
The following figure (Figure 5 from the original paper) illustrates the architectural comparison between RecGPT-V1's isolated multi-route reasoning and RecGPT-V2's HMAS:
该图像是一个示意图,展示了RecGPT-V1与RecGPT-V2的架构对比。左侧为RecGPT-V1的孤立多路推理,右侧为RecGPT-V2的层次多智能体系统,强调通过协调的意图分解减少认知冗余。
Global Planner
The Global Planner is the top-level orchestrator. It performs holistic intent analysis by synthesizing rich contextual signals to decompose complex user intent into specialized personas. This eliminates redundant processing of raw sequences by individual experts and ensures coordinated reasoning.
Context Representation: The Global Planner receives a comprehensive contextual representation from three sources:
-
User Behavioral History (): Temporally ordered user interactions, where each interaction includes action type, entity (item/query), and timestamp. Behavioral entities are represented through
atomic compression(from §2.1.1). -
User Profile ():
Static Attributes (\mathcal{U}_{\mathrm{attr}}):Demographics (age, gender, location).Dynamic Interests (\mathcal{U}_{\mathrm{int}}):Behavioral patterns (e.g., "cycling enthusiast").
-
Environmental Context (): Real-time multi-source signals (weather, seasonal factors, trending events) for
situational intent mining.These components form a rich hybrid context: Where:
-
: The comprehensive contextual representation.
-
: User Behavioral History, with entities compressed.
-
: User Profile, with static attributes and dynamic interests.
-
: Environmental Context, in natural language.
Intent Decomposition: The Global Planner analyzes through multi-dimensional reasoning (temporal trends, situational adaptation, behavioral consistency) to uncover
latent user needsand decompose them into specialized personas , where each persona represents a distinct facet of user intent. Where: -
: The set of specialized personas.
-
: The reasoning function of the Global Planner.
This design eliminates computational redundancy (decomposition done once) and ensures cognitive coordination by orchestrating complementary reasoning perspectives.
Distributed Experts
Upon receiving personas from the Global Planner, the distributed expert ensemble executes parallel yet complementary item tag prediction tasks. Each expert agent operates under its assigned persona to generate a set of item tags reflecting a distinct facet of user intent.
Where:
-
: The set of predicted item tags for persona .
-
: The function representing the expert model for tag prediction.
-
: The -th specialized persona.
To enhance expert capabilities, a two-stage training strategy is used:
Supervised Fine-Tuning (SFT)andReinforcement Learning (RL) optimization.
Stage 1: Supervised Fine-Tuning (SFT)
SFT establishes foundational expert capabilities using persona-aligned training samples. For a given persona , supervision signals are constructed from the user's subsequent interactions. GPT-4 identifies item categories from the user's next interactions () that semantically align with the persona's intent focus:
Where:
-
: A set of item categories relevant to persona .
-
: All item categories from the user's subsequent (held-out next) interactions.
-
: A binary classifier (implemented by GPT-4) that determines if category is semantically relevant to persona .
A target set with exactly 15 elements is created. If , it's augmented with GPT-4-generated synthetic tags. If , 15 tags are randomly sampled. The expert model is then trained using a token-prediction paradigm by minimizing the cross-entropy loss: Where:
-
: The cross-entropy loss for supervised fine-tuning.
-
: The parameters of the expert model.
-
: Expectation over persona-target pairs.
-
: The expert model's output distribution.
The training data composition balances domain-specific knowledge with general language capabilities (Table 1). The following are the results from Table 1 of the original paper:
| Data Type | Proportion (%) |
| Recommendation Task | |
| Pure Behavior Patterns | 32.17 |
| Trending Topics & Events | 6.97 |
| Weather-Related Contexts • | 1.19 |
| Other Situational Signals • | 7.36 |
| General Language Modeling | 52.31 |
Stage 2: Constrained Reinforcement Optimization
Building on SFT, reinforcement learning further enhances expert performance across multiple objectives (diversity, relevance, accuracy). A constrained reward shaping (CRS) mechanism is used to balance conflicting multi-reward optimization.
Policy Optimization Framework (GRPO): The Group Relative Policy Optimization (GRPO) algorithm is used. For each input sample, a group of outputs is sampled from the old policy , and the new policy is optimized by minimizing the following objective:
Where:
- : The GRPO objective function to be minimized.
- : Parameters of the new policy .
- : Input context and output sampled from the old policy.
- : The ratio of probabilities of output under the new policy vs. the old policy.
\hat{A}(x, y) = R(x, y) - \frac{1}{G} \sum_{i=1}^G R(x, y_i): Thegroup-normalized advantage. This is the reward of the current output minus the average reward of the group of outputs sampled from the old policy.R(x, y): The reward function for input and output .- : Clipping parameter, typically a small value (e.g., 0.2), used to limit the policy update step size.
- : A coefficient controlling the strength of the
KL penalty. - : The Kullback-Leibler (KL) divergence between the new policy and the reference policy . It's calculated as:
This term prevents the policy from deviating too far from the reference model (SFT base model), ensuring stability and mitigating
reward hacking.
Multi-Reward Modeling: Four complementary components make up the multi-objective reward function:
- Accuracy Reward (): Encourages experts to predict tags aligning with online user behavior. It measures recall against ground-truth interactions.
Given predicted tags and interacted item categories :
Where:
- : The accuracy reward.
- : The number of ground-truth item categories.
- : The indicator function, which is 1 if the condition is true, and 0 otherwise.
- : A function that maps predicted tags to item categories.
- Alignment Reward (): Ensures predicted tags align with human quality standards and the assigned persona's intent. A dedicated
reward model (f_{\mathrm{RM}})(trained using preference pairs from RecGPT-V1's quality criteria) evaluates the alignment score for each predicted tag with respect to persona . Where:- : The alignment reward.
- : The number of predicted tags for persona .
- : The reward model that outputs an alignment score for tag given persona .
- Diversity Reward (): Encourages experts to explore diverse user interests. It measures the semantic richness of predicted tags using
BGE embedding model(Xiao et al., 2023) and computing the average cosine distance among tag representations. Where:- : The diversity reward.
- : The embedding of tag generated by the BGE embedding model.
- : The dot product of embeddings and .
- : The L2 norm of embedding . Higher diversity scores encourage broader intent coverage without redundant predictions.
- Length Reward (): Promotes appropriate tag lengths (6-11 words for optimal informativeness and retrieval). For each predicted tag with word number , the reward is defined as: The overall length reward for a set of predicted tags is the average:
Constrained Reward Shaping (CRS): Unlike naive sum-based aggregation of rewards (SUM), which leads to multi-reward conflicts, CRS treats certain rewards as hard constraints for optimizing the primary accuracy objective. It enforces a two-stage optimization: first satisfy secondary constraints, then optimize the primary reward.
The following figure (Figure 6 from the original paper) compares reward shaping strategies:
该图像是比较奖励塑形策略的图表。左侧(a)的和为基础奖励塑形显示出多奖励冲突的问题,而右侧(b)的约束奖励塑形将次级奖励(如多样性)视为条件约束,从而实现对主要奖励(即准确性)的稳定优化。
The composite reward is defined as a product of conditional indicators: Where:
-
: The total composite reward.
-
: Individual reward components defined above.
-
: The indicator function, which evaluates to 1 if the condition inside the brackets is true, and 0 otherwise.
-
: Predefined thresholds for alignment, diversity, and length rewards, respectively.
This multiplicative formulation ensures that the accuracy reward is propagated only when all secondary objectives meet their minimum requirements, effectively mitigating conflicting gradient signals.
The following figure (Figure 7 from the original paper) compares training dynamics between sum-based and constrained reward shaping:
该图像是图表,展示了基于和约束奖励塑形的训练动态比较。(a) 梯度范数。(b) KL 散度。(c) 准确性奖励。(d) 多样性奖励。CRS 在各指标上保持稳定优化,而 SUM 存在多重奖励冲突。
Decision Arbiter
After experts generate tag predictions, the Decision Arbiter performs final candidate selection from the aggregated tag pool . It leverages the hybrid context to holistically evaluate all candidate tags across multiple quality dimensions (relevance, consistency, specificity, validity, defined in Appendix B). The arbiter performs joint reasoning to identify the top- tags that collectively maximize these dimensions.
Where:
-
: The refined set of item tags for downstream retrieval.
-
: The function representing the Decision Arbiter.
-
: The aggregated pool of tags from all expert agents.
-
: The hybrid context.
This joint evaluation considers inter-tag complementarity and avoids redundancy, consolidating expert outputs into a cohesive recommendation.
Online Item Recommendation
- Multi-Interest User Encoding: Building on RecGPT-V1's three-tower architecture (user-item-tag), the user encoder is extended to capture multiple interest facets using
Poly-Encoder(Humeau et al., 2019). This introduces learnable context codes that aggregate user behavioral embeddings into multiple interest vectors via attention mechanisms. These vectors represent distinct aspects of user preferences, enabling fine-grained matching. - Traffic Allocation via Quadratic Programming: To balance exploration (
cognitive channelitems) and exploitation (existing utility channelitems) under limited exposure budgets, traffic allocation is formulated as aquadratic programming problem. This dynamically adjusts the proportion of cognitive retrieval items in the recommendation slate to maximize overall system revenue while ensuring exploratory recommendations enhance long-term user engagement without compromising short-term business metrics. The detailed problem formulation and solution are provided in Appendix C.
4.2. Dynamic Explanation Generation
RecGPT-V2 enhances explanation generation by addressing low information density, weak temporal adaptation, and homogenized expression of RecGPT-V1, which resulted from static prompt templates and incomplete evaluation. It introduces Meta-Prompting and Preference-Aware Reinforcement Learning.
4.2.1. Meta-Prompting
Unlike RecGPT-V1's direct one-step generation from fixed templates, Meta-Prompting decomposes the generation process into two stages: style synthesis followed by style-conditioned explanation generation. This hierarchical design allows for diverse and contextually adaptive explanations.
Expanded Evaluation Dimensions: RecGPT-V1's four dimensions (Relevance, Factuality, Clarity, Safety) are expanded to seven by adding Timeliness (alignment with trends/events), Informativeness (substantive insights), and Attractiveness (emotional appeal).
Two-Stage Generation Framework: Given user interests (), item attributes (), and contextual signals (), the framework operates as follows:
- Stage 1: Style Synthesis: The model first generates a stylistic guideline specifying tone, rhetorical devices, target audience, and emotional resonance.
Where:
- : The generated stylistic guideline.
- : The meta-prompt generator function.
- : User interests.
- : Item attributes.
- : Situational signals (e.g., weather, seasonal trends).
- Stage 2: Style-Conditioned Explanation Generation: Conditioned on the style guideline , the model generates the final explanation adhering to the specified stylistic constraints.
Where:
-
: The final explanation.
-
: The explanation generation function.
This two-stage approach provides flexibility and leverages the model's creative capacity.
-
4.2.2. Preference-Aware Reinforcement Learning
Building on SFT, constrained reinforcement learning (similar to §2.2.2) further enhances explanation quality. It uses a hybrid reward framework combining rule-based diversity rewards and model-based alignment rewards, unified under the Constrained Reward Shaping (CRS) mechanism.
Policy Optimization Framework: The GRPO algorithm (Equation 6) is used, with the reward function replaced by an explanation-specific composite reward.
Hybrid Reward Modeling:
- Rule-Based Diversity Reward (): To encourage varied linguistic expressions, an
IDF-inspired diversity rewardis used. A memory buffer (size 160) stores recent generated explanations. For a new explanation : Where:- : The diversity reward for the explanation.
- : The length (number of tokens) of the explanation .
- : The size of the memory buffer .
- : The count of stored explanations containing token . The logarithmic term rewards rare tokens, and is for smoothing.
- Model-Based Alignment Reward (): To capture subjective quality (e.g., informativeness), a reward model is trained on
preference datausinglistwise comparisons(detailed in §4.2, Judge-as-a-Reward). Given a generated explanation : Where:-
: The alignment reward for the explanation.
-
: The reward model.
-
: The generated explanation.
-
: User interests, item attributes, and situational signals, respectively.
Constrained Reward Shaping (CRS): Consistent with §2.2.2, CRS is adopted. For explanation generation,
human preference alignment() is the main reward, withdiversity() as a secondary constraint. Where:
-
- : The total composite reward.
- : The diversity threshold. This formulation treats diversity as a gating condition, preventing gradient interference and enabling stable optimization towards human-aligned, diverse explanations.
4.3. Agentic Judge Framework
RecGPT-V1's LLM-as-a-Judge was outcome-focused, limiting its ability to capture nuanced quality. RecGPT-V2 introduces a novel Agentic Judge Framework with Agent-as-a-Judge and Judge-as-a-Reward to enhance evaluation quality and create a self-reinforcing Flywheel Effect.
The following figure (Figure 8 from the original paper) illustrates the Agent-as-a-Judge framework:
该图像是示意图,展示了 Agent-as-a-Judge 框架用于细致的评价过程。图中包括多维度子评估者独立评估标签预测和说明生成的专业质量维度,以及高级评审员将反馈汇总为三层判断(优越/平均/差)。
4.3.1. Agent-as-a-Judge
This framework mirrors human cognitive evaluation by decomposing holistic quality assessment into fine-grained, dimension-specific sub-evaluators followed by a multi-level review.
Multi-Dimension Sub-Evaluators: For each evaluation dimension (e.g., relevance, diversity, coherence, as detailed in Appendix B), a specialized sub-evaluator is instantiated. Each sub-evaluator assesses the generated content along its assigned dimension .
Where:
-
: The dimension-specific evaluation result for the -th dimension.
-
: The specialized sub-evaluator for dimension .
-
: The generated content (e.g., item tags, explanation).
-
: The specific evaluation dimension.
This decomposition transforms complex multi-objective evaluation into manageable single-objective tasks.
Three-Tier Judgment: A Senior Reviewer Agent aggregates the outputs from all sub-evaluators to produce a final overall quality judgment using a three-tier S-A-B scheme:
-
Superior (S): Output excels across all or most dimensions.
-
Average (A): Output meets minimum standards across dimensions.
-
Bad (B): Output fails to satisfy basic requirements in at least one critical dimension.
The aggregation procedure involves a two-stage decision:
-
Defect Detection: If any dimension receives a negative or unsatisfactory signal, the overall result is classified as
Bad (B). -
Excellence Elevation: If no critical defects are detected, the Senior Reviewer distinguishes between
Superior (S)andAverage (A)based on the proportion or pattern of positive feedback among all dimensions, using a threshold .Model Adaptation through Supervised Fine-Tuning: Evaluation agents are adapted to domain-specific quality standards by training on a corpus combining model-generated samples and outputs from powerful LLMs (e.g., DeepSeek-R1, Qwen3-235B). A hybrid annotation strategy is used: (1)
in-batch shufflingfor automatic construction ofBad-qualitysamples (e.g., mismatched user contexts for relevance); (2)human annotatorsfor nuanced judgments across all evaluation dimensions and holistic S-A-B labels. A lightweightQwen3-32B-Instructmodel is fine-tuned usingSFT.
4.3.2. Judge-as-a-Reward
While Agent-as-a-Judge provides accurate assessment, its discrete classification labels lack granularity for fine-grained policy gradient estimation, and its multi-step evaluation is computationally expensive for online RL training. Judge-as-a-Reward addresses this by distilling agent evaluation capabilities into lightweight reward models for dense optimization signals.
Reward Model Architecture: The reward model is initialized from the Agent Judge checkpoint (meaning it shares the same underlying LLM architecture) but with an added scalar value head.
Where:
-
: The predicted reward score, bounded into [0, 1] by a sigmoid activation.
-
: The reward model.
-
: The generated content.
-
: User interests, item attributes, and situational signals, respectively.
Reward Model Training via Listwise Learning-to-Rank: To preserve fine-grained quality distinctions from the Senior Reviewer's three-tier labels (S, A, B), a
listwise learning-to-rankapproach is used. Samples in each batch are grouped by quality level. For any quality level , samples at are positive instances, and samples at lower levels () are negative. The reward model is trained to assign higher scores to higher-quality samples using a unifiedcontrastive lossformulation: Where: -
: The contrastive loss for the reward model.
-
: Quality levels (S for Superior, A for Average).
-
: A sample belonging to the set of samples at quality level .
-
: The reward score predicted by the reward model for sample .
-
: Denotes all quality levels lower than (e.g., for , includes A and B; for , only includes B).
-
: The set of samples at quality level .
This formulation implicitly captures all pairwise relationships (S vs. AB, A vs. B), enabling the reward model to learn the complete hierarchical preference ordering.
Engineering Acceleration via Prefix Sharing: To accelerate training, shared prefix representations are computed once and reused across all candidates within contrastive groups, enabling parallel inference and reducing redundant computation.
Self-Improving Flywheel Effect: The integration of Agent-as-a-Judge and Judge-as-a-Reward creates a self-reinforcing optimization cycle for continuous quality improvement without recurring human annotation costs:
-
Policy Generation: Policy model generates diverse responses through SFT and RL.
-
Agentic Evaluation:
Agent-as-a-Judgedecomposes samples into dimension-specific assessments and holistic S-A-B judgments. -
Reward Distillation:
Judge-as-a-Rewarddistills discrete agent judgments into continuous, differentiable reward signals by learning preference structures. -
Policy Optimization: Distilled reward signals guide policy refinement via
GRPOto maximize human-aligned preferences.This closed-loop architecture operates autonomously after initial human annotation, progressively aligning model behavior with human quality standards, with
reward distillationensuring computational efficiency andMulti-Dimension evaluationguaranteeing comprehensive quality improvements.
5. Experimental Setup
To validate RecGPT-V2's effectiveness in practical industrial applications, long-term online experiments were conducted on Taobao's platform.
5.1. Datasets
The paper describes the data used for training the Distributed Experts (Table 1) and for training the Agent-as-a-Judge framework.
For the Distributed Experts (specifically for Supervised Fine-Tuning):
- Recommendation Task Data:
Pure Behavior Patterns: 32.17%Trending Topics & Events: 6.97%Weather-Related Contexts: 1.19%Other Situational Signals: 7.36%
- General Language Modeling Data: 52.31% This composition balances domain-specific knowledge with general language capabilities.
For Agent-as-a-Judge model adaptation:
-
Training Corpus: Combines model-generated samples (e.g., from DeepSeek-R1, Qwen3-235B) and outputs from powerful LLMs.
-
Bad-quality samples: Automatically constructed via
in-batch shuffling(e.g., randomly pairing outputs with mismatched user contexts for relevance). -
Nuanced judgment samples: Human annotators provide labels across all evaluation dimensions and holistic
S-A-Bjudgments.The online A/B tests were conducted on Taobao's homepage "Guess What You Like" scenario, implying a real-world, large-scale e-commerce dataset of user interactions, item attributes, and contextual information. The specific details of this proprietary Taobao dataset (size, number of users/items) are not explicitly given but are implied to be massive given the "industrial scale" context.
Data Sample Example (from Case 2 in the paper):
A concrete example of a User Behavioral Sequence Compression from the paper:
Original Full-Text Context (21,349 tokens): User Attributes:28,/ 28-year-old female resident of Beijing; Astrological signs: Gemini (Western), Ox (Chinese zodiac) User Behavioral History: 3 / Purchased 3 years ago Women's autumn-winter knee-high boots Topstitched satin-textured dress 2 / Searched 2 years ago Premium aesthetic outerwearRetro Bluetooth mini speaker 1 ‡ Clicked 1 year ago 04 Korean-style loose-fit sweater Pure cotton 4-piece bedding set : (numerous additional interactions omitted due to space)
This data sample shows a user's demographic information and a timeline of their past interactions (purchased, searched, clicked items). This demonstrates the type of rich, sequential user behavior data that RecGPT-V2 processes, often requiring compression due to its length.
5.2. Evaluation Metrics
The paper uses distinct evaluation metrics for different components and for overall online A/B testing.
For Item Tag Prediction (Offline & RL Optimization)
- Hit Rate at Top-30 (HR@30):
- Conceptual Definition: Measures whether any of the top 30 predicted item tags (after being mapped to item categories) successfully match any of the user's actual interaction categories in their subsequent behavior. It's a common metric for retrieval or recommendation quality, indicating how often the desired items are present in a ranked list.
- Mathematical Formula: In the context of item tag prediction, it implies:
- Symbol Explanation:
- : Hit Rate at top 30.
- : Total number of users.
- : A specific user.
- : Indicator function, returns 1 if the condition is true, 0 otherwise.
- : The set of ground-truth item categories from user 's actual subsequent interactions.
- : The top 30 predicted item tags for user .
- : A pre-trained model that maps predicted tags to item categories.
For Explanation Generation (Offline & RL Optimization)
- Diversity:
- Conceptual Definition: Measures the lexical variety and non-redundancy within a set of generated explanations for a given item. Higher diversity indicates that the explanations are less repetitive and offer varied perspectives.
- Mathematical Formula:
- Symbol Explanation:
- : The diversity score for item .
- : The number of generated explanations for item .
- : The -th and -th explanations generated for item .
- : The ROUGE-L metric, which measures the longest common subsequence similarity between two explanations.
- Quality (Human Evaluation Acceptance Rate):
- Conceptual Definition: Measures the percentage of generated explanations that are deemed "high-quality" by human annotators, based on satisfying all seven expanded evaluation criteria (Relevance, Factuality, Clarity, Safety, Timeliness, Informativeness, Attractiveness).
- Mathematical Formula: Not explicitly provided as a formula but as a percentage:
- Symbol Explanation:
- "high-quality": Refers to explanations that meet all criteria in Table 8.
For Agent-as-a-Judge Evaluation
- Accuracy:
- Conceptual Definition: Measures how often the
Agent-as-a-Judgeframework'sSuperior (S)classification matches human annotations, which serve as the ground truth. - Mathematical Formula:
- Symbol Explanation: Standard accuracy definition.
- Conceptual Definition: Measures how often the
- F1 Score:
- Conceptual Definition: The harmonic mean of precision and recall. It's particularly useful when the classes are imbalanced (e.g., fewer Superior samples) or when both false positives and false negatives are important.
- Mathematical Formula: Where:
- Symbol Explanation:
- : Proportion of positive identifications that were actually correct.
- : Proportion of actual positives that were identified correctly.
- : Samples correctly identified as Superior.
- : Samples incorrectly identified as Superior.
- : Superior samples incorrectly identified as non-Superior.
For Online A/B Test (Taobao)
- Short-Term Metrics:
- IPV (Item Page Views):
- Conceptual Definition: The number of times users view an item's detail page after seeing a recommendation. Indicates initial user interest and engagement.
- Mathematical Formula: Not explicitly provided, but typically a count.
- CTR (Click-Through Rate):
- Conceptual Definition: The ratio of clicks on recommended items to the total number of impressions (times items were shown). Measures the relevance and attractiveness of recommendations.
- Mathematical Formula:
- TV (Transaction Volume):
- Conceptual Definition: The total monetary value of completed purchases resulting from recommendations. Directly measures the revenue generated.
- Mathematical Formula: Not explicitly provided, but typically a sum of purchase values.
- GMV (Gross Merchandise Value):
- Conceptual Definition: The total transaction value, including both orders and returns. A broader measure of commercial activity.
- Mathematical Formula: Not explicitly provided, but typically a sum of all transaction values.
- ATC (Add-to-Cart):
- Conceptual Definition: The number of times recommended items are added to a user's shopping cart. Reflects purchase intent, even if a transaction isn't immediately completed.
- Mathematical Formula: Not explicitly provided, but typically a count.
- IPV (Item Page Views):
- Long-Term Metrics:
- NER (Novelty Exposure Rate):
- Conceptual Definition: The percentage of recommended items that users have not previously interacted with (clicked, purchased, etc.). Measures the system's ability to expose users to new and diverse content, mitigating
filter bubble effects. - Mathematical Formula:
- Conceptual Definition: The percentage of recommended items that users have not previously interacted with (clicked, purchased, etc.). Measures the system's ability to expose users to new and diverse content, mitigating
- LT-14 / LT-30 (User Retention Rates):
- Conceptual Definition: The percentage of users who remain active on the platform (e.g., logging in, interacting with content) 14 days and 30 days after being exposed to the new recommendation system. Quantifies sustained user engagement and platform health.
- Mathematical Formula:
- NER (Novelty Exposure Rate):
5.3. Baselines
- RecGPT-V1: Serves as the primary control group and baseline for comparison in all experiments, both offline (e.g., tag prediction accuracy) and online (A/B tests). This allows for a direct assessment of the improvements introduced by RecGPT-V2's innovations.
- Qwen-14B Base Model: Used as the base model for constructing RecGPT-V2 variants in offline experiments (e.g., tag prediction). This baseline helps validate the necessity of domain adaptation and fine-tuning.
- SFT (Supervised Fine-Tuning) variant: A variant of RecGPT-V2 that only undergoes supervised fine-tuning, without the reinforcement learning optimization. This helps isolate the contribution of RL.
- GRPO (SUM) variant: A variant using
Group Relative Policy Optimizationbut with naivesum-based aggregationof rewards. This serves as a baseline to demonstrate the superiority ofConstrained Reward Shaping (CRS)in handling multi-reward conflicts. - RecGPT-V2 (Point-wise RM) variant: A variant using reinforcement learning with a reward model trained using
point-wisemethods, contrasting with thelist-wiseapproach of the full RecGPT-V2.
6. Results & Analysis
6.1. Core Results Analysis
RecGPT-V2 demonstrates significant improvements across computational efficiency, generation quality (tag prediction, explanation diversity), human alignment in evaluation, and real-world online performance.
Computational Efficiency (from Figure 4):
RecGPT-V2 achieves substantial gains in computational efficiency compared to RecGPT-V1, validating the effectiveness of Hybrid Representation Inference and Infrastructure Engineering Optimization.
The following are the results from Figure 4 of the original paper:
该图像是一个图表,展示了RecGPT-V1与RecGPT-V2在MFU%、QPS (Prefill)和 TPS (Decode)方面的计算效率比较。RecGPT-V2在MFU%达到了17.70%,而RecGPT-V1为11.56%;在QPS (Prefill)上,RecGPT-V2的值为69.30,RecGPT-V1为1;在TPS (Decode)上,RecGPT-V2的值为7.35,RecGPT-V1为1。此图表显示了RecGPT-V2在计算效率上的显著提升。
- MFU (Model FLOPs Utilization): RecGPT-V2's MFU is 17.04%, a significant improvement over RecGPT-V1's 11.56%. This represents a 53.11% improvement in MFU, indicating much more efficient GPU utilization.
- QPS (Prefill): RecGPT-V2 achieves 69.30 QPS in the prefill stage, compared to RecGPT-V1's baseline of 1. This is a dramatic increase in throughput for processing input prompts.
- TPS (Decode): RecGPT-V2 shows a 7.35x improvement in TPS during the decode stage relative to RecGPT-V1. These results confirm the claim of 60% GPU consumption reduction and enable scalable deployment.
Item Tag Prediction Accuracy (from Table 2):
The Constrained Reinforcement Optimization dramatically improves item tag prediction.
The following are the results from Table 2 of the original paper:
| Metric | RecGPT-V1 | RecGPT-V2 | |||
| Base | SFT | GRPO (SUM) | GRPO (CRS) | ||
| HR@30 | 26.29% | 23.08% | 29.20% | 27.38% | 32.60% |
- Base Model: The
Qwen-14B Basemodel, without domain adaptation, performs worse than RecGPT-V1 (23.08% vs. 26.29%), highlighting the necessity of domain-specific training. - SFT: Supervised fine-tuning significantly boosts performance to 29.20% HR@30, surpassing RecGPT-V1 by 2.91% (29.20 - 26.29). This demonstrates the effectiveness of persona-aligned supervision.
- GRPO (SUM): Naive sum-based reward aggregation in
GRPO(27.38%) performs worse than SFT, confirming that multi-reward conflicts can degrade performance. - GRPO (CRS): The full RecGPT-V2 with
Constrained Reward Shaping (CRS)achieves the highest HR@30 at 32.60%. This is a 3.40% improvement over SFT (32.60 - 29.20) and a substantial 6.31% improvement over RecGPT-V1 (32.60 - 26.29), validating the efficacy of CRS in stable multi-objective optimization.
Explanation Generation Performance (from Table 3):
Meta-Prompting and Preference-Aware Reinforcement Learning enhance explanation quality.
The following are the results from Table 3 of the original paper:
| Method | Diversity | Quality (%) |
| RecGPT-V1 | 0.631 | 36.03 |
| RecGPT-V2 | 0.677 | 40.73 |
- Diversity: RecGPT-V2 improves explanation diversity from 0.631 to 0.677, a relative increase of 7.3%.
- Quality (%): The human-rated explanation acceptance rate increases from 36.03% to 40.73%, a relative improvement of 13.0% ((40.73-36.03)/36.03 * 100%). These gains validate the effectiveness of the meta-prompting framework and preference-aware RL for dynamic and engaging explanations.
Agent-as-a-Judge Human-Alignment (from Table 4):
The Agent-as-a-Judge framework demonstrates superior alignment with human judgments.
The following are the results from Table 4 of the original paper:
| Task | Model | Accuracy | F1 | ||
| V1 | V2 | V1 | V2 | ||
| Item Tag Prediction | GPT5-mini | 0.7694 | 0.7704 | 0.7499 | 0.7535 |
| Qwen3-Base | 0.7844 | 0.7864 | 0.7991 | 0.8051 | |
| Qwen3-SFT | 0.8210 | 0.8248 | 0.8095 | 0.8228 | |
| Explanation Generation | GPT5-mini | 0.4481 | 0.4548 | 0.5673 | 0.5424 |
| Qwen3-Base | 0.3423 | 0.2764 | 0.0898 | 0.0904 | |
| Qwen3-SFT | 0.6885 | 0.7006 | 0.6787 | 0.7307 | |
- Item Tag Prediction: RecGPT-V2 consistently shows higher accuracy and F1 scores across different underlying models (GPT5-mini, Qwen3-Base, Qwen3-SFT). For
Qwen3-SFT, accuracy improves from 0.8210 to 0.8248 (+0.38 pp), and F1 improves from 0.8095 to 0.8228 (+1.33 pp). - Explanation Generation: RecGPT-V2 also generally outperforms RecGPT-V1. For
Qwen3-SFT, accuracy increases from 0.6885 to 0.7006 (+1.21 pp), and F1 significantly jumps from 0.6787 to 0.7307 (+5.20 pp). The exception isQwen3-Basefor explanation generation, where V2 accuracy decreases, though F1 slightly increases. These results validate that the multi-step, process-oriented evaluation ofAgent-as-a-Judgealigns more closely with human standards.
Impact of Reward Model Training Strategy (from Table 5):
The listwise learning-to-rank approach for Judge-as-a-Reward is crucial.
The following are the results from Table 5 of the original paper:
| Method | HR@30 (Tag) | Quality (Explanation) |
| RecGPT-V1 | 26.29% | 36.03% |
| RecGPT-V2 (Point-wise RM) | 31.24% | 37.64% |
| RecGPT-V2 (List-wise RM) | 32.60% | 40.73% |
- Tag Prediction (HR@30): RecGPT-V2 with
list-wise RMachieves 32.60%, outperforming RecGPT-V1 (26.29%, +24.1% relative improvement) andpoint-wise RM(31.24%, +4.4% relative improvement). - Explanation Quality: RecGPT-V2 with
list-wise RMachieves 40.73%, outperforming RecGPT-V1 (36.03%, +13.0% relative improvement) andpoint-wise RM(37.64%, +8.2% relative improvement). This confirms that modeling hierarchical preference ordering withlistwise learning-to-rankprovides more discriminative optimization signals for RL.
6.2. Online A/B Test Results (from Table 6)
Online A/B tests on Taobao's "Guess What You Like" scenario over two weeks demonstrate that RecGPT-V2 consistently outperforms RecGPT-V1 across both item and feed scenarios, validating its commercial viability.
The following are the results from Table 6 of the original paper:
| Scenario | Short-Term Engagement | Long-Term Retention | |||||||
| IPV | CTR | TV | GMV | ATC | NER | LT-14 | LT-30 | ||
| Item | +3.64 | +3.01 | +2.11 | +3.39 | +3.47 | +11.46 | |||
| Feed | +1.29 | +1.50 | +0.34 | +1.53 | +0.99 | +4.49 | +0.04 | +0.05 | |
Note: indicates metrics not applicable in the item scenario. The values in the table are relative percentage improvements over RecGPT-V1.
Item Scenario:
- Short-Term Engagement:
- IPV: +3.64%
- CTR: +3.01%
- TV: +2.11%
- ATC: +3.39%
- GMV is not applicable/reported for the item scenario. These robust gains indicate that RecGPT-V2's enhanced intent understanding directly translates into increased user interaction and transaction value when items are directly recommended.
- Long-Term Retention:
- NER: +11.46%
- LT-14, LT-30 are not applicable/reported for the item scenario.
The significant increase in
Novelty Exposure Rate(NER) suggests improved recommendation diversity and exploration effectiveness, mitigatingfilter bubble effects.
Feed Scenario:
- Short-Term Engagement:
- IPV: +1.29%
- CTR: +1.50%
- TV: +0.34%
- GMV: +1.53%
- ATC is not applicable/reported for the feed scenario. Consistent positive trends in CTR and GMV indicate improved recommendation relevance even in a mixed-content feed, although the percentage gains are slightly lower than the item scenario, likely due to the more complex and diverse nature of feed content.
- Long-Term Retention:
- NER: +4.49%
- LT-14: +0.04%
- LT-30: +0.05%
The NER gain further confirms improved novelty. While
user retention rates(LT-14, LT-30) show modest absolute gains, these are significant in the context of long-term user behavior on large platforms, representing meaningful progress in sustained user engagement.
Overall Summary: RecGPT-V2's comprehensive innovations lead to tangible improvements across various online metrics, affirming its effectiveness and practical utility in a real-world industrial setting. The particularly strong gains in NER highlight its ability to foster user exploration, a key aspect of healthy recommender systems.
6.3. Case Study
A case study (Figure 9) further illustrates RecGPT-V2's strengths in dynamic intent understanding and context-aware recommendation generation.
The following figure (Figure 9 from the original paper) illustrates a real-world case:
该图像是一个示意图,展示了多个商品及其描述,包括毛衣、儿童产品和哑铃等,具有丰富的视觉信息和信息组织。图中的决策仲裁者图标也突出了智能推荐系统的特点。
-
User Profile: Female, 35, Tianjin, with compressed behavioral history.
-
Environmental Context: Cooling weather, upcoming Mid-Autumn Festival and Halloween.
-
Global Planner: Decomposes these signals into three personas:
Ladies' Fashion Expert,Kids' Products Expert, andHealth Expert. This shows theHMASeffectively processinghybrid contextand decomposing intent. -
Distributed Experts:
Ladies' Fashion Expert: Predicts "Wool Blend Cardigan" (responding to cooling weather).Kids' Products Expert: Generates "Kids' Hydrating Lotion" (for dry autumn climate) and "Kids' Halloween Costume" (anticipating holiday). This demonstratestemporal adaptation.Health Expert: Recommends "Adjustable Dumbbell Set" (aligning weather-driven wellness with historical fitness interests).
-
Decision Arbiter: Synthesizes these expert predictions.
-
Contextually Adaptive Explanations: Final items are paired with explanations like "Wrapped in Autumn Sunshine," "Quench Your Little One's Skin," and "All You Need is Dumbbells," generated by the
meta-prompting framework. These explanations are personalized and context-aware.This case study vividly validates RecGPT-V2's core capabilities: integrating
real-time environmental signalsintohierarchical multi-agent reasoningfordiverse intent coverageandprecise situational adaptation, moving beyond static behavioral pattern matching.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper effectively introduces RecGPT-V2, a sophisticated agentic framework that significantly advances LLM-powered recommender systems. It systematically addresses the four major limitations of RecGPT-V1: computational inefficiency, lack of explanation diversity, limited generalization, and sub-optimal evaluation. Through its four key innovations—Hierarchical Multi-Agent System with Hybrid Representation Inference, Meta-Prompting for dynamic explanations, Constrained Reinforcement Learning, and an Agentic Judge Framework—RecGPT-V2 achieves remarkable technical and commercial success. The framework not only reduces GPU consumption by a substantial 60% and improves core recommendation metrics like recall and tag prediction accuracy but also enhances user experience through diverse and contextually adaptive explanations. Validated through extensive online A/B tests on Taobao, RecGPT-V2 demonstrates significant lifts in key business metrics (e.g., +2.98% CTR, +11.46% NER), establishing the practical viability and scalability of deploying LLM-based intent reasoning in real-world industrial environments. It successfully bridges the gap between cognitive exploration in research and concrete industrial utility.
7.2. Limitations & Future Work
The authors themselves point out a key direction for future work:
- End-to-End Joint Optimization of Multi-Agent Collaboration with Reinforcement Learning: The paper suggests exploring how to jointly optimize multi-agent collaboration in an end-to-end fashion using reinforcement learning techniques. Currently, the
Planner,Experts, andArbitercomponents might be optimized somewhat independently or sequentially. An end-to-end optimization could potentially lead to even greater synergy and emergent behaviors among agents, further enhancing recommendation performance and user experience by allowing the entire system to learn from global rewards.
7.3. Personal Insights & Critique
This paper presents a highly impressive and comprehensive approach to integrating LLMs into recommender systems at an industrial scale. The rigor in addressing practical challenges like computational efficiency and real-world deployability is particularly commendable.
- Bridging Research and Industry: The most significant contribution of RecGPT-V2 is its successful deployment and validation on Taobao. This provides strong evidence that advanced LLM-based reasoning, often seen as computationally intensive, can be made feasible and commercially impactful in a high-traffic, low-latency environment. The detailed
Infrastructure Engineering OptimizationandHybrid Representation Inferenceare critical for this. - Holistic Problem Solving: Instead of focusing on a single aspect, the paper tackles multiple facets of LLM-powered recommendations—efficiency, generation quality, and evaluation—in an integrated manner. The
Flywheel Effectidea is a powerful conceptual framework for continuous improvement. - Sophisticated Agentic Design: The
Hierarchical Multi-Agent Systemis a clever way to manage the complexity and redundancy of LLM reasoning. Decomposing user intent into specialized personas forDistributed Experts, orchestrated by aGlobal Plannerand consolidated by aDecision Arbiter, is an elegant solution to ensure both breadth and focus. - Addressing Reward Conflicts: The
Constrained Reward Shapingmechanism is a crucial methodological innovation. Multi-objective optimization is notoriously difficult due toreward conflicts, and providing a stable method to achieve diverse objectives without sacrificing primary performance is a significant step forward, applicable beyond recommender systems. - Human-Aligned Evaluation: The
Agent-as-a-Judgeframework, with its multi-step, process-oriented evaluation, marks a move towards more trustworthy and interpretable AI systems. Moving beyond simple outcome prediction to mimic human reasoning in evaluation is vital for building systems that truly meet human quality standards.
Potential Issues/Areas for Improvement:
-
Interpretability of Agent Interactions: While the HMAS structure is beneficial for organization, the internal decision-making processes of the
Global Planner(how it generates personas),Distributed Experts(how they generate tags based on persona), andDecision Arbiter(how it synthesizes) still rely on LLM reasoning. Further work could explore even more transparent and interpretable mechanisms for agent interaction and decision aggregation to provide deeper insights into the "why" behind recommendations. -
Cost of GPT-4/Powerful LLMs: The methodology mentions using
GPT-4forQA pair generationand for identifying relevant item categories inSFT. While this ensures high-quality supervision, relying on such powerful (and typically proprietary/API-based) LLMs might be a significant cost factor in the data generation pipeline, especially if the volume of training data required is massive. The paper emphasizes the efficiency of the inference model, but the training data generation cost could be a practical limitation for other researchers or smaller companies without access to such budgets. -
Generalizability of
Hybrid Representation: While theadaptor networkis designed for generalization, the effectiveness ofatomized entity compressionacross vastly different domains (e.g., medical data, legal documents) and languages might need further investigation. The compression ratio and semantic preservation could vary. -
Long-term A/B Test Horizon: While 14-day and 30-day retention are good indicators,
long-term user behaviorcan sometimes take longer to manifest significant shifts, especially for metrics related to loyalty or habit formation. Longer-duration A/B tests (e.g., 3-6 months) could provide even more robust evidence for sustained impact, though these are challenging in industrial settings.Overall, RecGPT-V2 is a landmark paper that provides a clear blueprint for building highly efficient, effective, and intelligent LLM-powered recommender systems. Its innovations are likely to influence future research and development in this rapidly evolving field.
Similar papers
Recommended via semantic vector search.