Paper status: completed

GPR: Towards a Generative Pre-trained One-Model Paradigm for Large-Scale Advertising Recommendation

Published:11/13/2025

Generative Recommendation Systems (38)Advertising Recommendation Optimization (1)Unified Generative Model Framework (1)Multi-Stage Joint Training Strategy (1)Heterogeneous Hierarchical Decoder (1)

Original Link PDF

Price: 0.100000

10 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents GPR (Generative Pre-trained Recommender), a novel framework that redefines advertising recommendation as an end-to-end generative task, overcoming issues of objective misalignment and error propagation, while enhancing semantic alignment and modeling consisten

Abstract

As an intelligent infrastructure connecting users with commercial content, advertising recommendation systems play a central role in information flow and value creation within the digital economy. However, existing multi-stage advertising recommendation systems suffer from objective misalignment and error propagation, making it difficult to achieve global optimality, while unified generative recommendation models still struggle to meet the demands of practical industrial applications. To address these issues, we propose GPR (Generative Pre-trained Recommender), the first one-model framework that redefines advertising recommendation as an end-to-end generative task, replacing the traditional cascading paradigm with a unified generative approach. To realize GPR, we introduce three key innovations spanning unified representation, network architecture, and training strategy. First, we design a unified input schema and tokenization method tailored to advertising scenarios, mapping both ads and organic content into a shared multi-level semantic ID space, thereby enhancing semantic alignment and modeling consistency across heterogeneous data. Second, we develop the Heterogeneous Hierarchical Decoder (HHD), a dual-decoder architecture that decouples user intent modeling from ad generation, achieving a balance between training efficiency and inference flexibility while maintaining strong modeling capacity. Finally, we propose a multi-stage joint training strategy that integrates Multi-Token Prediction (MTP), Value-Aware Fine-Tuning and the Hierarchy Enhanced Policy Optimization (HEPO) algorithm, forming a complete generative recommendation pipeline that unifies interest modeling, value alignment, and policy optimization. GPR has been fully deployed in the Tencent Weixin Channels advertising system, delivering significant improvements in key business metrics including GMV and CTCVR.

Mind Map

In-depth Reading

English Analysis~35 min read · 51,161 chars

1. Bibliographic Information

1.1. Title

GPR: Towards a Generative Pre-trained One-Model Paradigm for Large-Scale Advertising Recommendation

1.2. Authors

Jun Zhang, Yi Li, Yue Liu, Chao Wang, Yuan Wan, Yuliang Xiong, Xun Lu, Hao Wu, Jian Li, En Zhang, Jiawei Sun, Xin Xu, Zishi Zhan, Ruoxue Liu, Shihao Zhang, Zhaoxin Zhang, Zhengkai Guo, Shuojin Yang, Meng-Hao Guo, Huan Yu, Jie Jiang, Shi-Min Hu

Their affiliations include Tencent Inc., China, and Tsinghua University, China. This indicates a collaboration between a major technology company with significant industrial advertising systems and a prestigious academic institution, suggesting a focus on both practical deployment and theoretical rigor.

1.3. Journal/Conference

The paper is published as an arXiv preprint. Original Source Link: https://arxiv.org/abs/2511.10138 PDF Link: https://arxiv.org/pdf/2511.10138v1.pdf

As an arXiv preprint, it means the paper has been submitted to arXiv, a free open-access archive for scientific preprints. While it provides early dissemination of research, it has not yet undergone formal peer review for publication in a conference or journal. However, arXiv is a highly influential platform in machine learning and AI, with many seminal papers first appearing there.

1.4. Publication Year

2025 (Published at: 2025-11-13T09:50:53.000Z)

1.5. Abstract

Advertising recommendation systems are critical infrastructure in the digital economy, connecting users with commercial content. However, current multi-stage systems face challenges like objective misalignment and error propagation, hindering global optimality. Existing unified generative models also struggle with industrial application demands. To overcome these issues, the authors propose GPR (Generative Pre-trained Recommender), an innovative one-model framework that reframes advertising recommendation as an end-to-end generative task, moving away from the traditional cascading pipeline. GPR introduces three main innovations: (1) a unified input schema and tokenization method for advertising scenarios, mapping diverse content into a shared multi-level semantic ID space for enhanced alignment; (2) the Heterogeneous Hierarchical Decoder (HHD), a dual-decoder architecture that separates user intent modeling from ad generation, balancing training efficiency and inference flexibility; and (3) a multi-stage joint training strategy comprising Multi-Token Prediction (MTP), Value-Aware Fine-Tuning, and the Hierarchy Enhanced Policy Optimization (HEPO) algorithm, which unifies interest modeling, value alignment, and policy optimization. GPR has been successfully deployed in Tencent Weixin Channels advertising system, demonstrating significant improvements in key business metrics such as Gross Merchandise Volume (GMV) and CTCVR (Click-Through Conversion Rate), proving its strong competitiveness against highly optimized traditional systems.

2. Executive Summary

2.1. Background & Motivation

Core Problem

The core problem GPR aims to solve lies in the limitations of existing advertising recommendation systems.

Multi-Stage Cascading Pipeline Issues: Traditional systems typically follow a "retrieval-pre-ranking-ranking" pipeline. This architecture suffers from:
- Objective Misalignment: Different stages optimize for different objectives (e.g., retrieval for coverage, ranking for business outcome prediction), preventing global optimality.
- Error Propagation: Errors or suboptimal decisions made in early stages (e.g., retrieval) propagate through the pipeline, prematurely eliminating potentially high-quality candidates and creating an information bottleneck for later stages.
- Engineering Complexity: Such pipelines require complex engineering to maintain cross-stage consistency, slowing down algorithmic iteration and scalability.
Struggles of Unified Generative Models: While generative recommendation models have shown promise, they still face significant challenges in meeting the demands of large-scale industrial advertising applications, particularly concerning:
- Extreme Heterogeneity in Data and Behavior: Advertising platforms integrate diverse content (ads, short videos, social feeds, news) and user behaviors (clicks, conversions, views, reads), leading to noisy, complex data distributions that are difficult for models to represent and align semantically.
- Efficiency-Flexibility Trade-off: Industrial systems need both efficient training for vast datasets and flexible inference to handle ultra-long user sequences and various constraints (targeting, bidding, budget) in real-time. Existing generative architectures (decoder-only vs. encoder-decoder) struggle to balance these.
- Revenue and Multi-stakeholder Value Optimization: Advertising systems must optimize for multiple stakeholders (user experience, advertiser ROI, platform revenue). Existing pre-training methods often optimize simplified, single objectives in isolation, leading to objective misalignment and local optimality, failing to maximize overall ecosystem value.

Importance of the Problem

Advertising recommendation systems are a vital component of the digital economy, acting as intelligent infrastructure that connects users with commercial content. Their effectiveness directly impacts information flow and value creation. The scale of these systems is immense, serving hundreds of millions of users and tens of millions of dynamic ads under strict real-time, low-latency requirements. The timeliness and stability of their performance are critical to a multi-billion-dollar ecosystem. Achieving a dynamic balance among user experience, advertiser ROI, and platform revenue is a key challenge. Solving these problems promises significant economic and user experience benefits.

Paper's Entry Point and Innovative Idea

The paper's innovative idea is to redefine advertising recommendation as an end-to-end generative task, replacing the traditional cascading paradigm with a unified generative approach within a "one-model" framework called GPR. This shift aims to achieve global optimality by ensuring consistent optimization objectives and eliminating error propagation inherent in multi-stage systems. GPR specifically tackles the identified challenges through systematic innovations in representation, architecture, and training.

2.2. Main Contributions / Findings

Primary Contributions

The paper's primary contributions are the development and deployment of GPR (Generative Pre-trained Recommender), which is presented as the first end-to-end generative advertising recommendation framework successfully deployed in a large-scale real-world advertising system. To achieve this, GPR introduces three key innovations:

Unified Input Schema and Tokenization:
- GPR designs a novel unified input schema using four token types (User Token, Organic Token, Environment Token, Item Token) to represent the entire user journey, addressing extreme data heterogeneity and ultra-long user sequences.
- It introduces RQ-Kmeans+, a new quantization model that maps both ads and organic content into a shared multi-level semantic ID space. This significantly improves codebook utilization efficiency, resolves "codebook collapse," and enhances semantic alignment and modeling consistency across heterogeneous data.
Heterogeneous Hierarchical Decoder (HHD) Architecture:
- GPR proposes HHD, a dual-decoder-based generative architecture comprising a Heterogeneous Sequence-wise Decoder (HSD), a Progressive Token-wise Decoder (PTD), and a Hierarchical Token-wise Evaluator (HTE).
- This hierarchical structure decouples user intent modeling (by HSD) from ad generation (by PTD), allowing for finer-grained interest representation and more accurate recommendations.
- HHD balances training efficiency and inference flexibility, integrating trie constraints, value guidance, and efficient multi-stage pruning during decoding to improve accuracy and reliability.
Multi-Stage Joint Training Strategy:
- GPR proposes a comprehensive training pipeline that integrates Multi-Token Prediction (MTP) for pre-training, Value-Aware Fine-Tuning (VAFT), and the Hierarchy Enhanced Policy Optimization (HEPO) algorithm for post-training reinforcement learning.
- MTP captures global, multi-interest user patterns under sparse signals.
- VAFT aligns optimization with business priorities by reweighting updates towards higher-value items.
- HEPO (with Anticipatory Request Rehearsal (ARR)) enables exploration beyond logged exposures, surfaces under-served high-value candidates, and addresses the credit assignment problem in hierarchical decoding by introducing process rewards and handling distribution shift.

Key Conclusions / Findings

The key conclusions and findings of the paper are:

Feasibility and Superiority of One-Model Generative Paradigm: GPR successfully demonstrates that an end-to-end generative "one-model" framework can effectively replace traditional multi-stage advertising recommendation systems, overcoming their inherent limitations of objective misalignment and error propagation to achieve global optimality.
Effective Handling of Industrial Challenges: The systematic innovations in GPR (unified representation, HHD architecture, and multi-stage training) effectively address real-world industrial challenges such as extreme data heterogeneity, the efficiency-flexibility trade-off, and the complex multi-stakeholder value optimization required in large-scale advertising.
Significant Business Impact in Production: GPR has been fully deployed in the Tencent Weixin Channels advertising system. Large-scale online A/B testing results show significant improvements in key business metrics, including Gross Merchandise Volume (GMV) and CTCVR (Click-Through Conversion Rate). This empirically validates GPR's strong competitiveness against highly optimized and mature cascading systems.
Enhanced Cold-Start Handling and User Engagement: Stratified analysis reveals consistent gains across various user activity groups and particularly strong performance for newly launched ads, indicating improved cold-start handling and better allocation towards higher-value ads even for high-activity users.

This section provides foundational concepts and context from related research to fully understand the GPR paper.

3.1. Foundational Concepts

Recommendation Systems

Recommendation systems are information filtering systems that predict what a user might like. Their core task is to match users with relevant items (products, movies, news, ads) based on their past behaviors and preferences.

Traditional Multi-stage Pipeline: Many industrial recommendation systems, especially in advertising, adopt a multi-stage cascading architecture. This typically involves:
1. Retrieval: Selects a large pool of relevant candidates from a vast item corpus, focusing on broad coverage. Often uses simpler models (e.g., collaborative filtering, item-to-item similarity) for speed.
2. Pre-ranking: Narrows down the retrieved candidates to a smaller, more manageable set using slightly more complex models, focusing on initial relevance.
3. Ranking: The final stage, where a sophisticated deep learning model ranks the remaining candidates based on predicted user engagement (e.g., click-through rate, conversion rate) and business objectives (e.g., revenue). This stage involves deep feature interactions.

Generative Models

Generative models are a class of artificial intelligence models that learn the underlying distribution of a dataset and can generate new, similar data instances. In the context of recommendation, a generative recommender learns to directly generate a sequence of item IDs or their representations given a user's history, rather than merely predicting a score for existing items. This contrasts with discriminative models, which typically predict a score or probability for a given input.

Large Language Models (LLMs) and Transformers

Large Language Models (LLMs) are neural networks, often based on the Transformer architecture, trained on massive amounts of text data. They excel at understanding, generating, and processing human language.

Transformer Architecture: Introduced by Vaswani et al. (2017), the Transformer is a neural network architecture that revolutionized sequence modeling. Its key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element, capturing long-range dependencies efficiently without recurrence or convolutions.
- Attention Mechanism: The core of the Transformer. It computes a weighted sum of input values, where the weights are determined by the similarity between a query and a set of keys.
- Formula: The scaled dot-product attention function is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings through linear transformations. $Q$ represents what we are looking for, $K$ represents what is available, and $V$ represents the information associated with $K$ .
  - $d_k$ is the dimension of the keys and queries, used as a scaling factor to prevent large dot products from pushing the softmax function into regions with very small gradients.
  - $QK^T$ computes similarity scores between each query vector and all key vectors.
  - $\mathrm{softmax}$ normalizes these scores into a probability distribution.
  - The result is a weighted sum of the value vectors.
Decoder-only Architecture: A type of Transformer that processes input sequences and generates output sequences autoregressively (one token at a time), typically using a causal mask to prevent attention to future tokens. Often used in LLMs for text generation.
Encoder-Decoder Architecture: Another Transformer variant where an encoder processes the input sequence to create a representation, and a decoder then uses this representation to generate the output sequence. This allows for more flexible interaction between input and output.

Vector Quantization (VQ)

Vector Quantization is a technique used to map continuous vector embeddings into discrete, finite "codebook" entries (semantic IDs). This is useful for generative models because they typically generate discrete tokens.

Codebook: A finite set of learnable vectors, often called "codewords" or "embeddings."
Quantization Process: For a given input vector, the closest codeword in the codebook is selected, and its index (semantic ID) is returned.
RQ-VAE (Residual Quantized Variational Autoencoder): A method that combines VQ with VAEs. A Variational Autoencoder (VAE) is a generative model that learns a latent representation of data by encoding inputs into a probability distribution (mean and variance) and then decoding samples from that distribution to reconstruct the input.
- VAE Loss Function: Typically consists of a reconstruction loss (to ensure decoded output is similar to input) and a KL divergence loss (to regularize the latent space to be close to a prior distribution). $ \mathcal{L}{\mathrm{VAE}}(\theta, \phi) = \mathbb{E}{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) || p(z)) $ Where:
  - $x$ is the input data.
  - $z$ is the latent variable.
  - $q_{\phi}(z|x)$ is the encoder's approximate posterior distribution of $z$ given $x$ , parameterized by $\phi$ .
  - $p_{\theta}(x|z)$ is the decoder's likelihood distribution of $x$ given $z$ , parameterized by $\theta$ .
  - p(z) is the prior distribution over the latent variables (often a standard normal distribution).
  - $D_{KL}$ is the Kullback-Leibler divergence, measuring the difference between two probability distributions.
- In RQ-VAE, quantization is often applied in a residual manner, where multiple codebooks are used sequentially to refine the representation.

Reinforcement Learning (RL)

Reinforcement Learning is a paradigm where an agent learns to make decisions by interacting with an environment. The agent performs actions in states, receives rewards or penalties, and learns a policy (a mapping from states to actions) that maximizes cumulative reward over time.

Policy ( $\pi$ ): The agent's strategy, defining the probability of taking an action in a given state.
Reward (R): A scalar feedback signal indicating the desirability of an action taken in a state.
Value Function (V): Predicts the expected cumulative reward from a given state (or state-action pair).
Advantage (A): Measures how much better an action is than the average action in a given state. $A(s, a) = Q(s, a) - V(s)$ , where Q(s, a) is the action-value function (expected return from taking action $a$ in state $s$ ).
Generalized Advantage Estimation (GAE): A method to estimate the advantage function in RL, balancing the bias-variance trade-off by combining $n$ -step returns with value function bootstrapping.
Off-policy Learning: Learning a policy from data generated by a different policy (the "behavior policy"). Requires importance sampling for correction.

Beam Search

Beam search is a heuristic search algorithm that explores a graph by expanding the most promising nodes in a limited set. It is often used in sequence generation (e.g., in LLMs) to find high-probability sequences without exploring all possible paths, which would be computationally infeasible. At each step, it keeps track of the top $k$ (beam width) most probable partial sequences and extends them.

Trie Tree

A Trie (prefix tree) is a tree-like data structure used to store a dynamic set of strings or words. Each node in the tree represents a prefix. It allows for efficient retrieval of words with a common prefix. In recommendation, it can be used to constrain generation to only valid or targeted items.

3.2. Previous Works

The paper discusses related work across four main areas:

LLMs as Generative Rankers:
- Concept: Recent research explores using LLMs for recommendation by transforming user behavior sequences into text and having the LLM generate the next item based on this textual input [10, 17, 18, 29, 36, 38, 43].
- Limitation (addressed by GPR): A core limitation is the fixed vocabulary of LLMs, which struggles to adapt to the dynamically changing and large-scale item sets typical of modern advertising. GPR addresses this with quantized semantic IDs.
Generative Recommendation Models:
- Concept: Driven by LLM success, there's a shift towards generative recommendation to unify tasks and handle long sequences [2, 4, 12, 14, 19, 22, 24, 28, 32, 34].
- TIGER [24]: A generative recommendation model leveraging semantic IDs and a sequence-to-sequence framework.
- HLLM [4]: Uses pre-trained LLMs with a two-layer architecture for item representations and user interests.
- HSTU [34]: A decoder-only architecture designed to process ultra-long user histories and generate item recommendations, adopting an efficient attention mechanism. GPR builds upon HSTU blocks.
- MTGR [9]: Adopts the HSTU architecture while integrating deep learning recommendation model (DLRM) features.
- COBRA [33]: Integrates sparse semantic IDs and dense vectors through a cascading process.
- GPR's Differentiation: GPR introduces the novel HHD architecture, which explicitly decouples user understanding from item generation via a hierarchical structure and a "understanding-thinking-refining-generation" paradigm, offering finer-grained control and accuracy compared to prior generative approaches that entangle these tasks.
End-to-End Recommendation Frameworks:
- Concept: Efforts to build more unified, end-to-end systems to overcome multi-stage limitations.
- OneRec [39, 40]: Unifies retrieval and ranking using an encoder-decoder architecture and preference alignment (e.g., DPO) in video recommendation. OneRec-V2 [40] shifted to a lazy decoder-only stack with RL.
- GPR's Differentiation: While OneRec is a notable end-to-end framework, GPR is highlighted as the first end-to-end generative solution successfully deployed in a large-scale advertising system. Advertising poses unique challenges: behavior heterogeneity and sparsity, stringent multi-objective optimization, and precise value prediction, which GPR specifically addresses.
RL in Recommender Systems:
- Traditional RL in RecSys: Used to optimize slate/page decisions and long-horizon value, moving beyond just next-item accuracy.
  - Seq2Slate [3]: Formulates re-ranking as autoregressive slate generation, trained via policy gradients to capture cross-position effects.
  - SlateQ [13]: Decomposes long-term slate value into item-wise terms, making Q-learning tractable for page-wise optimization.
  - DEAR [37]: Addresses ad insertion as a joint decision using DQN to optimize long-run value in ad-mixed feeds.
  - Conservative Q-Learning (CQL) [15]: An offline RL method that penalizes out-of-distribution actions to prevent overestimation of unseen choices.
- RL for Generative Recommendation:
  - GeMS [6]: Learns a variational latent space for slates, enabling an RL agent to act in this continuous space.
  - PrefRec [31]: Learns a reward model from human/trajectory preferences, then optimizes policies against it.
  - Direct Preference Optimization (DPO) [23]: Adapts preference learning to sequence models, turning pairwise preferences into stable training losses that can replace or complement RL during post-training.
    - DPO Loss Formula: $ \mathcal{L}{\mathrm{DPO}}(\pi) = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \left( \log \frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right) \right] $ Where:
      - $\pi$ : The policy being optimized.
      - $\pi_{\text{ref}}$ : A reference policy, often the initial supervised policy, used to prevent the optimized policy from diverging too far from the initial learned behavior.
      - $\sigma$ : The sigmoid function, $\sigma(x) = 1 / (1 + e^{-x})$ .
      - $\beta$ : A hyperparameter that controls the strength of the reward, effectively scaling the implicit reward difference.
      - $x$ : The input prompt or context (e.g., user history).
      - $y_w$ : The preferred response (e.g., a higher-value recommendation).
      - $y_l$ : The dispreferred response (e.g., a lower-value recommendation).
      - $\mathcal{D}$ : The dataset of preference triples $(x, y_w, y_l)$ .
  - GPR's Differentiation: GPR specifically proposes HEPO with hierarchical process rewards and Anticipatory Request Rehearsal (ARR) to address credit assignment, high variance, and dynamic environment shifts in advertising, going beyond standard DPO or RL applications.

3.3. Technological Evolution

The field has evolved from heuristic-based recommendation to complex multi-stage deep learning pipelines. The advent of Transformer architecture and LLMs spurred a new wave of generative recommendation models, aiming for unification and end-to-end optimization. Initially, LLMs were explored as text-based rankers, facing vocabulary limitations. Then, generative models focused on generating item IDs or sequences, often building on Transformer decoder-only or encoder-decoder architectures (e.g., TIGER, HSTU). The challenge then shifted to adapting these for industrial scale, addressing heterogeneity, efficiency, and complex business objectives. Reinforcement Learning has been integrated to optimize long-term value and explore beyond logged data, with techniques like DPO enhancing preference alignment. GPR represents a further step in this evolution by being the first successfully deployed end-to-end generative framework specifically for large-scale advertising, integrating innovations across representation, architecture, and a sophisticated multi-stage RL training strategy to handle its unique complexities.

3.4. Differentiation Analysis

Compared to the main methods in related work, GPR's core differences and innovations are:

Unified Generative Paradigm for Advertising: Unlike traditional multi-stage cascading systems, GPR is the first to propose and successfully deploy a one-model, end-to-end generative framework specifically for large-scale advertising recommendation. This directly addresses objective misalignment and error propagation inherent in cascades.
Unified Representation for Extreme Heterogeneity: GPR introduces a novel unified input schema and the RQ-Kmeans+ tokenizer. This is a significant innovation tailored for advertising's extreme data heterogeneity (ads mixed with organic content, diverse behaviors, multi-modal data), allowing all content to be mapped into a shared multi-level semantic ID space. Previous generative models might struggle with such diverse input or suffer from codebook collapse, which RQ-Kmeans+ explicitly resolves.
Decoupled & Hierarchical Decoding (HHD): GPR's Heterogeneous Hierarchical Decoder (HHD) offers a unique dual-decoder architecture that explicitly decouples user intent modeling (HSD) from ad generation (PTD). This "understanding-thinking-refining-generation" paradigm, combined with Hybrid Attention, Token-Aware FFN/LN, MoR, and external LLM knowledge, allows for finer-grained user interest understanding and more accurate item generation. This provides a balance between training efficiency and inference flexibility that many encoder-decoder models (like OneRec) lack due to high training costs for pointwise losses, or decoder-only models (like HSTU) lack in decoding flexibility.
Comprehensive Multi-Stage, Value-Aligned Training: GPR introduces a sophisticated training strategy that goes beyond simple pre-training or DPO.
- Multi-Token Prediction (MTP) explicitly captures multiple concurrent user interests, which is crucial for advertising where users often have diverse needs.
- Value-Aware Fine-Tuning (VAFT) directly integrates business value (eCPM, action types) into the loss, resolving the misalignment of vanilla likelihood-based objectives.
- Hierarchy Enhanced Policy Optimization (HEPO) with Anticipatory Request Rehearsal (ARR) addresses the complex RL challenges in advertising, such as credit assignment in hierarchical generation, high variance of rewards, and the need for proactive adaptation to dynamic environments, which is beyond what standard DPO or basic RL approaches offer. It uses process rewards and Z-score normalization for stability.
Industrial Deployment and Proven Performance: The paper emphasizes GPR's successful deployment in a large-scale production advertising system (Tencent Weixin Channels) and demonstrates significant, statistically robust improvements in core business metrics like GMV and CTCVR against a highly optimized existing system, which is a strong validation of its practical utility and robustness.

4. Methodology

The GPR (Generative Pre-trained Recommender) framework redefines advertising recommendation as an end-to-end generative task, moving away from traditional multi-stage pipelines. Its overall architecture is shown in Figure 2, integrating innovations in unified representation, network architecture, and training strategy.

4.1. Principles

The core idea of GPR is to unify the entire advertising recommendation process into a single generative model. This is driven by several key principles:

Global Optimality: By treating the task as end-to-end generation, GPR aims to optimize a consistent objective across all stages, overcoming the objective misalignment and error propagation issues of multi-stage systems.
Unified Understanding: To handle the extreme heterogeneity and ultra-long sequences in advertising, the model must develop a unified semantic understanding of both users and diverse content (ads, organic content) within a shared representation space.
Hierarchical Generation: User understanding and item generation are complex. Decoupling these tasks into a hierarchical decoding process allows for finer-grained modeling of user intent and more accurate, constrained item generation.
Value-Aligned Optimization: Advertising systems require maximizing multi-stakeholder value. The training process must explicitly align with business objectives, incorporating value awareness and policy optimization to go beyond simple likelihood maximization.
Efficiency and Flexibility: Industrial deployment demands efficient training for large-scale data and flexible inference for real-time, constrained recommendations. The architecture and decoding strategy are designed to balance these.

4.2. Core Methodology In-depth

4.2.1. Input Schema and Processing

GPR is designed to process noisy, heterogeneous, and ultra-long user information from diverse real-world advertising platforms. To achieve this, it proposes a unified input schema that represents a user's entire journey using four types of tokens:

User Token (U-Token): Represents user attributes and preferences (e.g., age, gender, demographics, long-term interests).
Organic Token (O-Token): Encapsulates users' interactions with organic content, such as short videos, articles, or social feeds. This is crucial as organic content influences ad engagement.
Environment Token (E-Token): Encodes the immediate contextual information surrounding an advertisement request, such as device type, time of day, ad position, placement type, and privacy settings.
Item Token (I-Token): Represents an ad item with which the user has interacted in the past.

These tokens are combined into a continuous sequence to capture the full user journey and context.

Quantization with RQ-Kmeans+: To align with generative model paradigms and extract crucial item information, GPR converts all contents in O-Token and items in I-Token into discrete semantic IDs. Popular methods like RQ-VAE and RQ-Kmeans suffer from "codebook collapse" (some codebook vectors are rarely used) and "insufficient robustness of the latent space."

To address these issues, GPR proposes RQ-Kmeans+, a new quantization model consisting of an encoder, residual codebooks, and a decoder. The overall architecture of RQ-Kmeans+ is shown in the following figure (Figure 3 from the original paper):

Figure 3: Overall Architecture of RQ-Kmeans+.
该图像是RQ-Kmeans+的总体架构示意图，展示了多模态语义嵌入如何与初始化由RQ-Kmeans生成的码书相结合，经过Vanilla RQ-VAE进行编码，并最终通过多个码书实现量化处理。涉及的关键公式为 $0 + 2 + 4 =$ 。

Figure 3: Overall Architecture of RQ-Kmeans+.

The key innovations of RQ-Kmeans+ are:

High-Quality Codebook Initialization: It first employs RQ-Kmeans to generate a high-quality codebook, which is then used as initialization weights for the RQ-Kmeans+ model. This helps mitigate "dead vectors" from random initialization.
Adaptive Codebook Updates: The codebook is subsequently updated using the same loss function as RQ-VAE, enabling adaptation to the current learnable latent space.
Residual Connection on Encoder Side: A residual connection is introduced on the encoder side to ensure that the output distribution remains close to the input distribution in the early stages of training. This accelerates convergence and stabilizes latent-space alignment.

Ultimately, RQ-Kmeans+ significantly improves codebook utilization and maintains flexibility in the latent space, effectively resolving the codebook collapse problem.

4.2.2. Heterogeneous Hierarchical Decoder (HHD)

HHD is a decoder-only generative architecture consisting of three modules: a Heterogeneous Sequence-wise Decoder (HSD) Module, a Progressive Token-wise Decoder (PTD) Module, and a Hierarchical Token-wise Evaluator (HTE) Module. This hierarchical structure is designed to decouple user behavior understanding from next item prediction, leading to finer-grained user preference understanding and more accurate item recommendations.

The overall architecture of GPR, highlighting the HHD, is shown in the following figure (Figure 2 from the original paper):

Figure 2: Overall Architecture of GPR.
该图像是GPR的整体架构示意图，展示了异构平台与内容的集成，包括图像、文本和视频广告等。图中展示了分层解码器及用户画像的构建过程，重点介绍了对抗性噪声生成及强化模块的架构。公式部分涉及了生成与优化过程及路径评估。

Figure 2: Overall Architecture of GPR.

Heterogeneous Sequence-wise Decoder (HSD): The HSD module is the primary decoder. It stacks HSTU blocks (a type of Transformer block [34]) and takes the unified token sequence (U-Token, O-Token, E-Token, I-Token) as input to understand user actions and generate high-quality intent embeddings. GPR introduces several critical enhancements to the foundational HSTU block:

Hybrid Attention Mechanism:
- Unlike standard attention, HSD Attention incorporates an extra embedding, $U$ , which adaptively modulates the attention weights. This allows HSD to focus more effectively on relevant user behaviors and attenuate less informative interactions.
- A Hybrid Attention mask, $M^{\mathrm{hybrid}}$ , is introduced. Within the prefix block (U/O/E-Tokens), tokens can see each other freely using bi-directional attention, allowing the model to fully exploit contextual interplay among prompt tokens to construct a comprehensive context prior to prediction. For I-Tokens (items to be generated or predicted), a vanilla causal mask is applied.
- The Hybrid Attention mechanism can be expressed as: $ \mathsf { H y b r i d A t t n ( \cdot ) } = S o f t m a x \left( \frac { Q K ^ { \top } } { \sqrt { d } } + M ^ { \mathrm { h y b r i d } } \right) V \odot U $ Where:
  - $Q$ : Query matrix derived from the input embedding.
  - $K$ : Key matrix derived from the input embedding.
  - $V$ : Value matrix derived from the input embedding.
  - $d$ : The dimension of the queries and keys, used for scaling.
  - $M^{\mathrm{hybrid}}$ : The Hybrid Attention mask, which dictates which tokens can attend to which other tokens.
  - $U$ : An adaptive embedding that modulates the attention weights, allowing the model to focus on relevant behaviors.
  - $\odot$ : Element-wise multiplication.
- The Hybrid Attention mask $M^{\mathrm{hybrid}}$ $M^{hybrid}$ is defined as: $ M _ { i j } ^ { \mathrm { h y b r i d } } = \left{ { \begin{array} { l l } { 0 , } & { { \mathrm { i f ~ } } i < j { \mathrm { ~ o r ~ } } X _ { i } , X _ { j } \in { \mathrm { U / O / E \mathrm { - } T o k e n } } } } \ { - \infty , } & { { \mathrm { i f ~ } } j > i } \end{array} \right. $ Where:
  - $M_{ij}^{\mathrm{hybrid}}$ : The mask value at row $i$ and column $j$ .
  - $X_i, X_j$ : Tokens at position $i$ and $j$ in the input sequence.
  - 0: Indicates that token $i$ can attend to token $j$ .
  - $-\infty$ : Indicates that token $i$ cannot attend to token $j$ . When added before softmax, this effectively zeros out the attention weight.
  - The condition $i < j$ (for I-Tokens) or $X_i, X_j \in \{\mathrm{U/O/E-Token}\}$ allows bi-directional attention within the prefix (U/O/E-Token) block and causal attention for the generative part.
Token-Aware Normalization and Feed-Forward Networks (FFN): Different token types possess distinct characteristics. HSD assigns independent normalization layers and FFNs to each token type. This projects them into their own semantic subspaces, allowing the model to fully capture the semantic diversity of heterogeneous sequences and reduce cross-type interference.
Mixture-of-Recursions (MoR): This mechanism [2] is employed to increase the model's effective depth and reasoning capacity without adding extra parameters by sharing weights across recurrent layers.
External Knowledge Integration: To further enhance reasoning, the model incorporates external knowledge. A fine-tuned Large Language Model (LLM) generates a textual "thought process" about users' potential interests. This "thought process" is then tokenized and integrated into the intent embeddings, strengthening semantic understanding and reasoning.

Progressive Token-wise Decoder (PTD): The PTD module acts as the secondary decoder. Guided by the intent embeddings generated by the HSD, it adapts a traditional Transformer decoder architecture to generate the target item. To address potential redundancy in intent embeddings, PTD adopts a novel "Thinking-Refining-Generation" paradigm for predicting the next item's semantic ID.

Thinking Tokens: The PTD first utilizes a cross-attention mechanism where the intent embeddings serve as both keys and values. It is then compelled to generate $K$ thinking tokens (e.g., $K=4$ ). These tokens are designed to distill essential information and filter out irrelevant components from the intent embedding, serving as an implicit reasoning step.
Refining Module: Inspired by research on LLM reasoning [21, 30], a refining module is integrated to strengthen cognitive and generative capacities. As shown in Figure 2 (c), this module is designed upon the diffusion paradigm [25]. It consists of a noise generator and a reverse process, modeled as a Markov chain. The reverse process iteratively removes noise using a conditional denoising module (with a Transformer architecture), conditioned on aggregated prefix thinking tokens (using Sum_Pooling). This module refines the initial reasoning results.
Generation: Finally, leveraging both the thinking tokens and the refining token, the PTD generates a sequence of semantic codes (representing the target item). In inference, this generation process is further guided by a Trie-Constrained Value-Guided Beam Search (detailed in Section 4.2.3).

Hierarchical Token-wise Evaluator (HTE): Unlike traditional content recommendation, advertising systems must jointly optimize user engagement and platform revenue. This requires predicting multiple business metrics (e.g., click-through rate (CTR), conversion rate (CVR), effective cost per mille (eCPM)) for each candidate ad. These predictions are aggregated into a single scalar objective called final_value, which is the main optimization target.

The HTE module is an integrated value estimation module built upon the hierarchical GPR model. It combines item generation with value estimation, outputting estimated values for each semantic code and the final item. This integrated, end-to-end approach offers significant advantages:

Consistency: Enhances representation and objective consistency, mitigating conflicts between retrieval and ranking stages.
Efficiency: Improves overall computational efficiency by integrating value prediction directly into the generation process. Beyond inference, HTE subsequently serves as the critic model in reinforcement learning post-training, enabling value-based advantage estimation for policy optimization.

4.2.3. Value-Guided Trie-Based Beam Search

The semantic codes predicted by PTD might lead to invalid (non-existent, geographically restricted, out-of-budget) or suboptimal items. Traditional beam search with post-filtering/ranking is computationally prohibitive. Therefore, GPR proposes Value-Guided Trie-based Beam Search to improve inference efficiency and performance.

This method integrates trie constraints and value estimations directly into the decoding step for early prefix evaluation:

Dynamic Beam Width Adjustment: Given the predicted values for each semantic code by HTE, the algorithm dynamically adjusts the beam width. Larger predicted values correspond to a wider beam for the next semantic code, prioritizing paths with higher potential revenue.
Trie Tree Pruning: The search space is pruned via a Trie Tree generated by current user profiles. This Trie Tree is constructed by applying user targeting strategies (e.g., age, gender) in the advertising system. It contains only candidate items consistent with the user's attributes, enabling early user-level targeted filtering, thus preventing the generation of invalid or irrelevant items.

4.2.4. Multi-Stage Training

The GPR model is trained using a three-stage regimen specifically designed for advertising recommendation, addressing sparse signals, multiple business objectives, and a dynamic item space. The full training pipeline is shown in the following figure (Figure 4 from the original paper):

Figure 4: Training Pipeline of GPR.
该图像是GPR的训练流程示意图，展示了多阶段联合训练策略，包括多标记预测（MTP）、关注价值的微调及使用HEPO的强化学习。通过这种方法，GPR实现了用户意图建模与广告生成的高效整合。

Figure 4: Training Pipeline of GPR.

1. Pre-training with Multi-Token Prediction (MTP): The pre-training stage injects advertising-scenario knowledge into GPR, focusing on capturing global, multi-interest user patterns despite sparse interaction signals.

Data: A large-scale industrial corpus from Tencent's advertising platform (one year, hundreds of millions of anonymized users), comprising both ad interactions (impressions, clicks, conversions) and organic engagements.
Sequence Construction: For each user, a chronological sequence is constructed using the unified four-token schema (U/O/E/I-Token). Items (ads) are encoded as $L$ coarse-to-fine semantic codes via residual vector quantization, providing a hierarchical, compact representation.
MTP Objective: While next-token prediction (NTP) assumes a dominant interest, MTP addresses users' multiple concurrent interests. As shown in Figure 4 (a), GPR extends the decoder with $N$ parallel heads (default $N = 4$ ). Each head independently predicts a complete $L$ -level code path for one interest dimension, sharing the same backbone states but using separate projection layers. This enables concurrent modeling without mutual interference and preserves level-wise legality via masked decoding on each head.
Pre-training Loss: The pre-training objective aggregates per-head, per-level likelihoods using simplex-constrained head weights $\omega_j$ $ω_{j}$ (where $\sum_j \omega_j = 1$ $\sum_{j} ω_{j} = 1$ ), which are adaptively tuned to prioritize high-quality interest threads. $ \boldsymbol { L } _ { \mathrm { M T P } } = - \sum _ { j = 1 } ^ { N } \sum _ { t = 1 } ^ { T } \sum _ { \ell = 1 } ^ { L } \omega _ { j } ^ { H } \cdot \log P _ { j } \left( I _ { j , t , \ell } \mid S , C , I _ { j , t , 1 : \ell - 1 } \right) $ Where:
- $N$ : Number of parallel heads in MTP (e.g., 4).
- $T$ : Length of the token sequence.
- $L$ : Number of semantic code levels for an item.
- $j$ : Index for the parallel heads, from 1 to $N$ .
- $t$ : Position in the sequence, from 1 to $T$ .
- $\ell$ : Level of semantic code, from 1 to $L$ .
- $\omega_j^H$ : Simplex-constrained weight for head $j$ , indicating its importance, initialized as $1/N$ .
- $P_j(I_{j,t,\ell} \mid S, C, I_{j,t,1:\ell-1})$ : The masked conditional probability of emitting the $\ell$ -th semantic code $I_{j,t,\ell}$ by head $j$ at position $t$ , conditioned on the sequence history $S$ , contextual features $C$ , and previously emitted codes $I_{j,t,1:\ell-1}$ by the same head for the same item.
- The loss yields a backbone encoding broad, disentangled interest structures.

2. Value-Aware Fine-Tuning (VAFT): This stage bridges the gap between multi-interest pre-training and monetization goals by injecting action value and eCPM awareness into the MTP framework. It ensures the model prioritizes high-value ads while preserving relevance.

Problem with Vanilla MTP: It assigns equal loss weight to ads with vastly different economic values (low-eCPM long-tail items can dominate gradients) and treats action types (impression, click, conversion) uniformly, ignoring their hierarchical business value (conversion > click > impression).
Solution: GPR introduces a per-head, per-position weight $\omega_{j,t}^V$ that encodes business value by combining the action type and the ad's eCPM, as shown in Figure 4 (b). This weight differentiates actions based on their value hierarchy and scales with a normalized eCPM to avoid magnitude distortion.
Value-Aligned MTP Loss ( $L_{\mathrm{eCPM-MTP}}$ ): This loss multiplies the head importance ( $\omega_j^H$ $ω_{j}^{H}$ ) and the action/eCPM weight ( $\omega_{j,t}^V$ $ω_{j, t}^{V}$ ): $ L _ { \mathrm { e C P M - M T P } } = - \sum _ { j = 1 } ^ { N } \sum _ { t = 1 } ^ { T } \sum _ { \ell = 1 } ^ { L } \left( \omega _ { j } ^ { H } \omega _ { j , t } ^ { V } \right) \log P _ { j } \left( I _ { j , t , \ell } \mid S , C , I _ { j , t , 1 : \ell - 1 } \right) $ Where:
- $\omega_j^H$ : Head-level interest quality weight from pre-training.
- $\omega_{j,t}^V$ : Position-level business value weight for head $j$ at position $t$ .
- All other symbols are as defined for $L_{\mathrm{MTP}}$ .
- The composite weight $(\omega_j^H \omega_{j,t}^V)$ biases updates toward high-eCPM actions while preserving multi-interest coverage, yielding stable gradients and aligning with revenue objectives.
Weight Denominators for $\omega_{j,t}^V$ (Monotonic Transforms):
- Impression ( $i=1$ ): Denominator $= 1$ , so $\omega_{j,t}^V \propto \mathrm{eCPM}$ (basic revenue contribution).
- Click ( $i=2$ ): Denominator $= \mathrm{pCTR}$ , so $\omega_{j,t}^V \propto \frac{\mathrm{eCPM}}{\mathrm{pCTR}}$ (rewards ads with high click quality, where $\mathrm{pCTR}$ is predicted Click-Through Rate).
- Conversion ( $i=3$ ): Denominator $= \mathrm{pCTR} \times \mathrm{pCVR}$ , so $\omega_{j,t}^V \propto \frac{\mathrm{eCPM}}{\mathrm{pCTR} \times \mathrm{pCVR}}$ (prioritizes ads driving actual conversions, where $\mathrm{pCVR}$ is predicted Conversion Rate). This setup ensures $\omega_{j,t}^V$ aligns with the advertising business value hierarchy (conversion > click > impression).

3. Post-training with HEPO (Hierarchy Enhanced Policy Optimization): Supervised pre-training and fine-tuning rely on logged data, which provides limited action coverage (only sequences generated by the historical policy). This restricts the model to imitate past decisions. Reinforcement learning addresses this by using a high-fidelity simulation environment for counterfactual evaluation, allowing exploration of new high-value candidates.

RL Setup:
- State ( $s$ ): Includes user interaction history, contextual signals (device, time, scene), multi-level codes already emitted, and level-specific legal masks.
- Action ( $a$ ): A hierarchical decision produced by the $L$ -level decoder (coarse to fine). At each level, the policy selects one quantized code from that level's legal candidate set. The final level resolves to the concrete ad.
- Reward (R): Assigned at the final decoding step. Small shaping signals may be applied to earlier levels.
- Episode: Corresponds to a single request or session.
Model Components in RL:
- HSD: Produces intent embeddings $h = \mathrm{HSD}_{\theta}(s)$ from user context $s$ .
- PTD: Performs hierarchical decoding to generate action probabilities $\pi_{\theta}(\boldsymbol{z}_{\ell})$ over semantic tokens $z_{\ell}$ at each level $\ell$ . $\theta$ comprises parameters of both HSD and PTD.
- HTE: Serves as the value function $V_{\phi}$ in RL training, computing expected returns $\boldsymbol{v}_{\ell} = V_{\phi}(s, z_{1:\ell-1})$ , where $z_{1:\ell-1} = \{z_1, \dots, z_{\ell-1}\}$ denotes tokens selected at earlier levels. The value function operates on stopgrad( $h$ ) to maintain training stability by preventing value gradients from updating the backbone.

Reward Generation with Simulation Environment: Evaluating all generated candidates with a production ranking model in real-time is too costly. Instead, GPR uses a high-fidelity simulation environment that replicates the production serving system for offline reward evaluation.

Simulator Structure: Built upon production snapshots, preserving infrastructure (retrieval indices, feature processing, business constraints). It integrates production pCTR/pCVR ranking models (for reward fidelity) and periodically updated GPR policy models.
Candidate Generation & Evaluation: For each user request context $s$ $s$ , the simulator performs beam search with the deployed GPR model to generate $K$ $K$ candidate advertisements (typically $K=40$ $K = 40$ ). Each candidate is obtained through hierarchical decoding across $L$ $L$ levels of semantic IDs. Each candidate is then evaluated by the ranking models to obtain its predicted reward: $ R = \mathsf { f i n al _ v a l u e } ( s , { z _ { \ell } } _ { \ell = 1 } ^ { L } ) \ = \mathsf { e C P M } ( s , { z _ { \ell } } _ { \ell = 1 } ^ { L } ) + \displaystyle \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \alpha _ { i } \mathrm { t a r g e t } _ { i } ( s , { z _ { \ell } } _ { \ell = 1 } ^ { L } ) $ Where:
- $R$ : The predicted reward, representing the aggregated final value for a candidate advertisement.
- $s$ : The user request context.
- $\{z_{\ell}\}_{\ell=1}^L$ : The token sequence uniquely identifying the candidate advertisement.
- $\mathsf{eCPM}(s, \{z_{\ell}\}_{\ell=1}^L)$ : The predicted effective cost per mille for the candidate.
- $\mathrm{target}_i(s, \{z_{\ell}\}_{\ell=1}^L)$ : Auxiliary objectives such as predicted CTR or CVR.
- $\alpha_i$ : Weights for the auxiliary objectives, balancing them with eCPM.
- The simulation environment also records the generation probabilities $\pi_{\theta_{\mathrm{old}}}(z_{\ell})$ at each decoding level, which serve as behavior policy probabilities for off-policy correction in RL.

Hierarchical Process Rewards (HEPO): A key challenge in hierarchical decoding is the credit assignment problem: when rewards are only at the final exposure level, intermediate decisions receive no direct feedback. HEPO addresses this by constructing process rewards at each hierarchical level, leveraging user-specific preference patterns.

Per-token Popularity Score: For each level $\ell$ , a per-token popularity score $P_{\ell}(t) \in [0, 1]$ is derived from the user's successful historical interactions. This score represents how frequently token $t$ appeared in recommendations leading to positive outcomes.
Preference Signal ( $\Delta_{\ell}$ ): For the chosen token $z_{\ell}$ $z_{ℓ}$ at level $\ell$ $ℓ$ , a preference signal is computed by comparing its popularity against the average popularity of all legal candidates $S_{\ell}$ $S_{ℓ}$ at that level: $ \Delta _ { \ell } = P _ { \ell } ( z _ { \ell } ) - { \frac { 1 } { | S _ { \ell } | } } \sum _ { t \in S _ { \ell } } P _ { \ell } ( t ) $ Where:
- $P_{\ell}(z_{\ell})$ : Popularity score of the chosen token $z_{\ell}$ at level $\ell$ .
- $S_{\ell}$ : The set of all legal candidate tokens at level $\ell$ .
- $|S_{\ell}|$ : The number of legal candidate tokens at level $\ell$ .
- The baseline subtraction ensures zero-mean signals when all candidates have equal popularity, preventing systematic bias.
Step Reward ( $r_{\ell}$ ): The step reward at each level is defined as: $ r _ { \ell } = \left{ \begin{array} { l l } { \alpha _ { \ell } \operatorname* { m a x } ( 0 , \Delta _ { \ell } ) , } & { \ell < L , } \ { R , } & { \ell = L , } \end{array} \right. $ Where:
- $r_{\ell}$ : The reward for choosing token $z_{\ell}$ at level $\ell$ .
- $L$ : The final level.
- $\alpha_{\ell}$ : Small scaling factors for intermediate process rewards ( $\ell < L$ ), ensuring they guide learning without overwhelming the terminal reward.
- $R$ : The terminal reward obtained from the simulator (Eq. 5) at the final level $L$ .

Advantage and Loss:

Advantages for Coarse Levels ( $\ell < L$ ): Advantages are computed via Generalized Advantage Estimation (GAE) using the process rewards $r_{\ell}$ $r_{ℓ}$ .
- The cumulative return is $G_{\ell} = \sum_{k=0}^{L-\ell} \gamma^k r_{\ell+k}$ .
- The temporal difference errors are $\delta_{\ell} = r_{\ell} + \gamma V_{\phi}(s, z_{1:\ell}) - V_{\phi}(s, z_{1:\ell-1})$ , where $V_{\phi}(\cdot)$ is the value function and $\gamma$ is the discount factor.
Advantages for Final Level ( $\ell = L$ ): Terminal rewards exhibit high variance. For stable optimization, Z-score normalization is applied over the $K$ candidates generated for each request within the simulation environment.
The Advantage ( $A_{\ell}$ ): At each level is defined as: $ A _ { \ell } = \left{ \begin{array} { l l } { \sum _ { l = 0 } ^ { L - \ell - 1 } ( \gamma \lambda ) ^ { l } \delta _ { \ell + l } , } & { \ell < L , } \ { } \ { \frac { R - \mu _ { K } } { \sigma _ { K } + \epsilon } , } & { \ell = L , } \end{array} \right. $ Where:
- $A_{\ell}$ : The advantage for level $\ell$ .
- $\gamma$ : Discount factor.
- $\lambda$ : GAE parameter controlling bias-variance trade-off.
- $\delta_{\ell+l}$ : Temporal difference error at level $\ell+l$ .
- $R$ : Terminal reward from Eq. 5.
- $\mu_K$ : Mean of rewards over the $K$ candidates generated in simulation.
- $\sigma_K$ : Standard deviation of rewards over the $K$ candidates.
- $\epsilon$ : A small constant for numerical stability.
Policy Loss ( $\mathcal{L}_{\theta}$ ): The policy model (parameters $\theta$ $θ$ ) is updated by minimizing: $ \mathcal { L } _ { \theta } = \mathbb { E } \Bigg [ \sum _ { \ell = 1 } ^ { L } c _ { \ell } \operatorname* { m i n } \Big ( \rho _ { \ell } A _ { \ell } , \mathrm { c l i p } ( \rho _ { \ell } , 1 - \epsilon , 1 + \epsilon ) A _ { \ell } \Big ) \Bigg ] $ Where:
- $\mathcal{L}_{\theta}$ : The policy loss.
- $c_{\ell}$ : Coefficients for each level, weighting their contribution to the total loss.
- $\rho_{\ell} = \pi_{\theta}(z_{\ell}) / \pi_{\theta_{\mathrm{old}}}(z_{\ell})$ : The importance ratio, correcting for off-policy sampling (where $\pi_{\theta_{\mathrm{old}}}$ is the behavior policy used for sampling in simulation).
- $\mathrm{clip}(\rho_{\ell}, 1 - \epsilon, 1 + \epsilon)$ : Clips the importance ratio to prevent excessively large policy updates, a common technique in Proximal Policy Optimization (PPO)-like algorithms.
- $A_{\ell}$ : The advantage at level $\ell$ .
Value Loss ( $\mathcal{L}_{\phi}$ ): The value function (parameters $\phi$ $ϕ$ ) is trained with mean squared error across all levels: $ \mathcal { L } _ { \phi } = \mathbb { E } \left[ \sum _ { \ell = 1 } ^ { L } ( V _ { \phi } ( s , z _ { 1 : \ell - 1 } ) - G _ { \ell } ) ^ { 2 } \right] $ Where:
- $\mathcal{L}_{\phi}$ : The value loss.
- $V_{\phi}(s, z_{1:\ell-1})$ : The predicted value of the state up to token $z_{\ell-1}$ .
- $G_{\ell}$ : The cumulative return from level $\ell$ .

Anticipatory Request Rehearsal (ARR): To adapt to the highly dynamic advertising ecosystem (evolving user interests, changing inventory), GPR introduces ARR. This generates synthetic training samples that approximate users' future request states, enabling anticipatory adaptation rather than just reacting to historical data.

Synthetic Request Construction: Based on each user's current state, ARR constructs synthetic requests.
- Sampling Frequency: Adapts to user activity (e.g., every 2-4 hours for high-activity users, proportionally adjusted for low-activity users).
- Organic Token: Reconstructed using the user's most recently viewed organic content to reflect evolving interests.
- User Token: Reused from the previous request for high-activity users (profile features are stable over short horizons).
- Environment Token: Queried in real-time to capture the latest system state (predicted ad position, placement type, privacy settings).
These synthetic samples are processed identically to observed samples in the simulation environment: deployed GPR generates candidates, ranking model evaluates them, and advantages are computed for RL training.

5. Experimental Setup

5.1. Datasets

The experiments utilize a large-scale corpus curated from Tencent's advertising platform.

Content: It covers both advertisements and organic media (such as short videos, social feeds, and news articles).
Modality: The corpus comprises heterogeneous multimodal signals, including textual metadata (titles, tags, descriptions) and visual content (thumbnails and sampled frames). These signals are aligned at both the item and session levels.
Scale: The training data spans one year of anonymized user interactions and comprises hundreds of millions of users.
Data Characteristics: To ensure data quality, near-duplicate samples are filtered to reduce redundancy, and category distributions are balanced to limit sampling bias.
Splitting: The corpus is partitioned into an $80\%$ training set and a $20\%$ test set.
Sequence Construction: User histories are serialized in strict time order using the unified four-token schema (U/O/E/I-Tokens, as described in Section 4.2.1).
Validation Data for Offline Experiments: For offline studies (Sections 4.2 and 4.3), validation uses the next calendar day's data following the one-year training period.

Example Data Sample: The paper describes items as being represented by "textual metadata (titles, tags, descriptions) and visual content (thumbnails and sampled frames)". A hypothetical ad item sample might look like:
Textual Metadata: Title: "New Smartphone Model X: Capture stunning photos!", Tags: ["Smartphone", "Electronics", "Camera"], Description: "Experience the ultimate in mobile photography with our latest device."
Visual Content: Thumbnail: [image_vector_representation_of_phone_ad_thumbnail.jpg], Sampled Frames: [list_of_image_vectors_from_ad_video_creative.mp4]
Associated Organic Content: User viewed: "Short video: Top 5 travel destinations", "Article: Latest tech gadgets review"

These datasets are effective because they directly represent the real-world, large-scale, and heterogeneous data environment of an industrial advertising platform, making them ideal for validating GPR's practical performance and scalability.

5.2. Evaluation Metrics

The paper employs a variety of evaluation metrics across different experimental stages (multimodal tokenization, user behavior modeling, business alignment, and online A/B testing).

For Multimodal Tokenization Performance (Section 4.1):

Collision Rate ( $\downarrow$ ):
- Conceptual Definition: Measures the fraction of distinct original items (e.g., ads or organic content) that are mapped to an identical semantic code or ID by the tokenizer.
- Mathematical Formula: $ \mathrm{Collision , Rate} = \frac{\sum_{c \in C_{\text{codes}}} (\mathrm{count}(c) - 1)}{\mathrm{Total , number , of , distinct , items}} $
- Symbol Explanation:
  - $C_{\text{codes}}$ : The set of all unique semantic codes generated.
  - $\mathrm{count}(c)$ : The number of distinct original items that map to semantic code $c$ .
  - $\mathrm{Total \, number \, of \, distinct \, items}$ : The total count of unique items in the dataset.
- Goal: Lower values are better, indicating that fewer distinct items are erroneously grouped under the same semantic ID.
Code Usage Rate at Level 1 ( $\mathrm{CUR}_{\mathrm{L1}} \uparrow$ ):
- Conceptual Definition: Measures the proportion of active (used) codes in the codebook at the first level of quantization. It indicates how effectively the codebook's capacity is being utilized.
- Mathematical Formula: $ \mathrm{CUR}_{\mathrm{L1}} = \frac{\mathrm{Number , of , active , codes , at , level , 1}}{\mathrm{Total , number , of , codes , in , the , codebook , at , level , 1}} \times 100% $
- Symbol Explanation:
  - Number of active codes at level 1: The count of unique semantic codes from the first level of the codebook that have been assigned to at least one item.
  - Total number of codes in the codebook at level 1: The total size of the codebook for the first quantization level.
- Goal: Higher values are better, indicating that the codebook is being well-utilized without "dead codes."
Path Average Similarity (PAS $\uparrow$ ):
- Conceptual Definition: Measures the mean embedding similarity among all items that share an identical semantic code path. It assesses the semantic coherence of items grouped by the same code.
- Mathematical Formula: The paper describes it as "mean embedding similarity among items that share a code." A common approach is using cosine similarity: $ \mathrm{PAS} = \frac{1}{|C_{\text{codes}}|} \sum_{c \in C_{\text{codes}}} \left( \frac{1}{\binom{\mathrm{count}(c)}{2}} \sum_{(i_1, i_2) \in \mathrm{Pairs}(c)} \mathrm{cosine_similarity}(\mathbf{e}{i_1}, \mathbf{e}{i_2}) \right) $ (Note: If count(c) < 2, the inner sum is 0, or this code is excluded).
- Symbol Explanation:
  - $C_{\text{codes}}$ : The set of all unique semantic codes.
  - $\mathrm{count}(c)$ : The number of distinct original items that map to semantic code $c$ .
  - $\mathrm{Pairs}(c)$ : The set of all unique pairs of distinct items that map to semantic code $c$ .
  - $\mathbf{e}_{i_1}, \mathbf{e}_{i_2}$ : The original dense embeddings of item $i_1$ and $i_2$ .
  - $\mathrm{cosine\_similarity}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$ : A measure of the cosine of the angle between two vectors, indicating their directional similarity.
- Goal: Higher values indicate more semantically coherent groupings of items under the same code.

For User Behavior Modeling Performance (Section 4.2):

HitRate@100 ( $\uparrow$ ):
- Conceptual Definition: Measures the proportion of times the ground-truth next interacted item is found within the top 100 generated candidates. This is a common metric for evaluating retrieval-like tasks in recommendation systems, assessing whether the model can localize the correct item in the embedding space among many candidates.
- Mathematical Formula: $ \mathrm{HitRate@K} = \frac{\mathrm{Number , of , requests , where , ground-truth , item , is , in , top-K}}{\mathrm{Total , number , of , requests}} $
- Symbol Explanation:
  - $\mathrm{K}$ : The cutoff rank (here, $K=100$ ).
  - Number of requests where ground-truth item is in top-K: Count of instances where the actual item a user interacted with is present within the first $K$ items recommended by the model.
  - Total number of requests: The total number of recommendation instances evaluated.
- Goal: Higher values are better.

For Business Alignment Performance (Section 4.3):

nDCG (Normalized Discounted Cumulative Gain $\uparrow$ ):
- Conceptual Definition: nDCG is a measure of ranking quality that considers the graded relevance of items in a ranked list. It assigns higher scores if highly relevant items appear at the top of the list and normalizes the score to be comparable across different queries or recommendation lists.
- Mathematical Formula: $ \mathrm{DCG}p = \sum{i=1}^{p} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ $ \mathrm{IDCG}p = \sum{i=1}^{p} \frac{2^{\mathrm{rel}'_i} - 1}{\log_2(i+1)} $ $ \mathrm{nDCG}_p = \frac{\mathrm{DCG}_p}{\mathrm{IDCG}_p} $
- Symbol Explanation:
  - $p$ : The position in the ranked list (e.g., the length of the recommended list).
  - $\mathrm{rel}_i$ : The graded relevance score of the item at position $i$ in the recommended list. For eCPM-oriented ranking, this could be the eCPM value itself or a derived relevance score.
  - $\mathrm{rel}'_i$ : The graded relevance score of the item at position $i$ in the ideal (perfectly sorted by relevance) ranked list.
  - $\log_2(i+1)$ : The discount factor, which reduces the importance of items further down the list.
  - $\mathrm{DCG}_p$ : Discounted Cumulative Gain at position $p$ .
  - $\mathrm{IDCG}_p$ : Ideal Discounted Cumulative Gain at position $p$ , which is the maximum possible DCG for the given set of relevant items.
- Goal: Higher nDCG values indicate better ranking quality, especially for highly relevant items at the top.
OPR (Ordered Pair Ratio $\uparrow$ ):
- Conceptual Definition: OPR measures the fraction of item pairs that are correctly ordered according to an objective criterion (e.g., eCPM-oriented ranking). If item A should be ranked higher than item B based on eCPM, and the model ranks A higher, that pair is correctly ordered.
- Mathematical Formula (Common interpretation for pairwise ranking accuracy): $ \mathrm{OPR} = \frac{\sum_{(i,j) \in P} \mathbb{I}(\mathrm{rank}(i) < \mathrm{rank}(j))}{\sum_{(i,j) \in P} 1} $
- Symbol Explanation:
  - $P$ : The set of all comparable pairs of items (i,j) where item $i$ is objectively preferred over item $j$ (e.g., $i$ has higher eCPM than $j$ ) according to the ground truth or reference ranking.
  - $\mathrm{rank}(k)$ : The position of item $k$ in the model's predicted ranked list (lower rank value means higher position).
  - $\mathbb{I}(\cdot)$ : An indicator function, which is 1 if the condition inside is true, and 0 otherwise.
- Goal: Higher OPR values indicate better agreement with the desired pairwise ranking order.
Avg final_value ( $\uparrow$ ) and Max final_value ( $\uparrow$ ):
- Conceptual Definition: These metrics are derived from the final_value as defined in Eq. 5 (Section 4.2.4).
  - final_value aggregates predicted eCPM and auxiliary objectives (pCTR, pCVR) into a single scalar representing the overall business value of a recommendation.
  - Avg final_value is the mean of these normalized final values over all $K$ candidates generated per request, aggregated across all evaluation requests.
  - Max final_value is the maximum of these normalized final values over all $K$ candidates generated per request, aggregated across all evaluation requests.
- Normalization: For confidentiality, these values are min-max normalized across all candidates and requests in the simulation environment.
- Goal: Higher values indicate that the model is generating candidates with higher predicted business value.

For Online Performance (Section 4.4):

GMV (Gross Merchandise Volume $\uparrow$ ):
- Conceptual Definition: A primary Key Performance Indicator (KPI) in e-commerce and advertising, representing the total value of sales or transactions facilitated through the platform. It directly reflects business return.
- Goal: Higher GMV indicates greater monetization and business success.
Costs ( $\uparrow$ ):
- Conceptual Definition: Reflects the total advertising expenditure or cost incurred by advertisers through the system.
- Goal: While higher costs might seem negative, in advertising, a well-performing system can drive higher costs if it effectively increases advertiser ROI and platform revenue (GMV). The goal is typically to increase GMV efficiently relative to costs.
CTCVR (Click-Through Conversion Rate $\uparrow$ ):
- Conceptual Definition: A composite metric that measures the efficiency of turning impressions into clicks and then clicks into conversions. It's often calculated as (Conversions / Clicks) * (Clicks / Impressions) or directly as Conversions / Impressions. The paper refers to it as a key business metric.
- Goal: Higher CTCVR indicates a more efficient funnel from initial exposure to a valuable user action.

5.3. Baselines

The paper compares GPR against various baselines depending on the experimental stage:

For Multimodal Tokenization Performance (Section 4.1):

RQ-VAE [16]: Standard discrete tokenizer. It uses a Variational Autoencoder (VAE) framework combined with residual quantization. It's known to suffer from "dead codes" or codebook collapse, where some codebook vectors are rarely or never used.
RQ-Kmeans [20]: A quantization method that uses K-means for codebook initialization. It may offer better initialization than purely random but can still have limitations in adaptability and robustness of the latent space.

For User Behavior Modeling Performance (Section 4.2):

HSTU [34] (Decoder-only): A state-of-the-art generative recommendation model that adopts a decoder-only architecture to process ultra-long user histories and generate item recommendations. It serves as a strong baseline for generative models focused on long sequences.
OneRec [40] (Encoder-Decoder): A unified retrieve-and-rank framework using an encoder-decoder architecture. It aggregates richer feature views, which can provide stronger conditioning than strictly autoregressive decoders, but might have weaker universality at inference due to dependence on pre-structured fields.

For Business Alignment Performance (Section 4.3):

MTP (base): A baseline Multi-Token Prediction model (the pre-training stage of GPR) trained with uniform head/position weights under a likelihood-only objective. It serves to show the impact of value-aware fine-tuning and RL.
MTP+DPO [23]: A variant where the MTP pre-trained model is fine-tuned using Direct Preference Optimization (DPO). DPO is a preference-based fine-tuning method that learns from pairwise orderings constructed under predicted rewards, sharpening local orderings to favor higher-value items.

For Online Performance (Section 4.4):

Mature Multi-Stage Cascade: The existing production advertising system at Tencent Weixin Channels. This is a highly optimized and mature system, typically involving multiple retrieval methods and customized ranking strategies. This serves as a strong real-world benchmark against which GPR's practical efficacy is measured.

6. Results & Analysis

This section details the experimental results, comparing GPR's performance against baselines and analyzing the contributions of its various components.

6.1. Core Results Analysis

6.1.1. Multimodal Tokenization Performance

The paper first evaluates the RQ-Kmeans+ tokenizer against established baselines (RQ-VAE, RQ-Kmeans) using Collision Rate, $Code Usage Rate at level 1 (CUR_L1)$ , and Path Average Similarity (PAS).

The following are the results from Table 1 of the original paper:

Model	Collision (%)↓	CURL1 (%)↑	PAS↑
RQ-VAE	23.21	92.13	0.985
RQ-Kmeans	21.40	100	0.986
RQ-KMeans+ (Ours)	20.60	99.36	0.992

Analysis:

RQ-Kmeans+ demonstrates the best overall code quality.
It achieves the lowest Collision Rate (20.60%), representing a relative reduction of $11.2\%$ compared to RQ-VAE ( $23.21\%$ ) and $3.7\%$ compared to RQ-Kmeans ( $21.40\%$ ). A lower collision rate means fewer distinct items are mapped to the same semantic ID, indicating better item differentiation and less ambiguity in the discrete representation.
$Code Usage Rate at level 1 (CUR_L1)$ for RQ-Kmeans+ is 99.36%, which is near saturation. This is comparable to RQ-Kmeans ( $100\%$ ) and significantly higher ( $+7.2$ percentage points) than RQ-VAE ( $92.13\%$ ). This indicates that RQ-Kmeans+ effectively utilizes its codebook, avoiding the "codebook collapse" problem prevalent in RQ-VAE.
Crucially, Path Average Similarity (PAS) for RQ-Kmeans+ reaches 0.992, which is higher than RQ-VAE (0.985) and RQ-Kmeans (0.986). This implies that even for the items that do collide (i.e., map to the same semantic ID), their original dense embeddings exhibit higher semantic coherence. Conclusion: These results confirm that RQ-Kmeans+ produces more reasonable code collisions and a more efficient, semantically aligned discrete representation space, which is critical for the subsequent generative modeling of heterogeneous data.

6.1.2. User Behavior Modeling Performance

This section evaluates the effectiveness of the HHD architecture and MTP for modeling long, heterogeneous user behaviors in predicting the next interacted item. The primary metric is HitRate@100.

The following are the results from Table 2 of the original paper:

Model	HitR@100 (%)	Δ vs. HSTU
Baselines
HSTU (Decoder-only)	18.98
OneRec (Encoder-Decoder)	19.85	+4.6%
HSD
+ Hybrid Attention	20.56	+8.3%
+ Token-Aware FFN	21.98	+15.8%
+ Token-Aware Layer Norm	20.76	+9.4%
+ Mixture of Recursions	20.09	+5.9%
+ External Knowledge	20.13	+6.1%
PTD
+ Thinking	21.75	+14.6%
+ Refining	19.61	+3.3%
HTE
+HTE	19.91	+4.9%
Training
+ Multi-Token Prediction	22.38	+17.9%
GPR (Full)
+ All (HSD+PTD+HTE)	27.32	+43.9%

Analysis:

Overall GPR Performance: The full GPR model with HHD achieves a HitRate@100 of 27.32%. This represents a substantial relative gain of +43.9% over HSTU (18.98%) and +37.6% over OneRec (19.85%). This clearly indicates GPR's superior capacity for user behavior modeling and next-item prediction compared to leading generative and end-to-end baselines. OneRec outperforms HSTU due to its ability to ingest and aggregate richer, non-purely sequential feature views, providing stronger conditioning, although at a cost to inference flexibility.
Ablation Study of HSD Components (User Intent Encoding):
- + Hybrid Attention: Improves HitRate@100 to 20.56% (a relative gain of +8.3% over HSTU). This highlights the effectiveness of adapting the attention mask for heterogeneous prefix tokens and adaptively modulating attention weights.
- + Token-Aware FFN: Leads to a significant improvement to 21.98% (relative gain of +15.8%). This validates the importance of independent FFNs for different token types to capture semantic diversity.
- + Token-Aware Layer Norm: Contributes to 20.76% (relative gain of +9.4%), further supporting the need for type-specific normalization to reduce cross-type interference.
- + Mixture of Recursions (MoR): Increases HitRate@100 to 20.09% (relative gain of +5.9%), showing that MoR effectively increases model depth and reasoning capacity without adding parameters.
- + External Knowledge: Integrating "thought-process" tokens from an LLM yields 20.13% (relative gain of +6.1%), confirming that external knowledge enriches intent representation.
Ablation Study of PTD Components (Generation Quality):
- + Thinking: Implementing the thinking tokens paradigm for multi-step latent refinement results in 21.75% (relative gain of +14.6%). This is a strong indicator of the benefit of an explicit reasoning step before generation.
- + Refining: The refining module, based on the diffusion paradigm, further contributes to 19.61% (relative gain of +3.3%), showing its role in improving initial reasoning results.
Ablation Study of HTE (Value Prediction):
- $+HTE$ : The explicit value estimation module improves HitRate@100 to 19.91% (relative gain of +4.9%). This suggests that predicting business value sharpens candidate ordering, better preparing the generative system for downstream auction and demonstrating the consistency of value prediction with retrieval quality.
Ablation Study of Multi-Token Prediction (Training Strategy):
- + Multi-Token Prediction: This training strategy provides the largest single gain, boosting HitRate@100 to 22.38% (relative gain of +17.9%). This strongly supports the multi-threaded-interest hypothesis, validating that modeling parallel user interests is crucial for advertising scenarios.

6.1.3. Scaling Properties

The paper investigates the scaling properties of GPR across six dense parameter sizes: 0.02B, 0.1B, 0.2B, 0.5B, 1B, and 2B. It notes that GPR's total size is dominated by sparse parameters, totaling approximately 80B.

The following figure (Figure 5 from the original paper) illustrates the loss curves:

Figure 5: Comparison of loss curves for six different GPR parameter sizes.
该图像是图表，展示了六种不同GPR参数大小的损失曲线对比。横轴表示训练步数，纵轴表示损失值，曲线清晰揭示了不同参数设置下的模型训练效果。

Figure 5: Comparison of loss curves for six different GPR parameter sizes.

Analysis:

Figure 5 clearly shows a robust scaling law: models with greater parameter counts consistently achieve lower loss values as training progresses. This empirical observation validates the substantial potential for performance enhancement gained by scaling up the model size.
The curves show a clear separation, with larger models (e.g., 2B, 1B) achieving significantly lower training loss compared to smaller models (e.g., 0.02B, 0.1B) at any given training step. This is a common and desirable property in large-scale machine learning models, indicating that the architecture can effectively leverage increased capacity.

6.1.4. Business Alignment Performance

This study evaluates how the multi-stage training strategy (MTP pretraining, eCPM-aware fine-tuning, and HEPO) improves monetization alignment.

The following are the results from Table 3 of the original paper:

Model	nDCG	OPR	Avg final_value	Max final_value
Pretraining & Fine-tuning
MTP (base)	0.3868	0.5292	0.2412	0.6201
+ VAFT	0.3925	0.5348
Post-training
+ DPO	0.4383	0.5463	0.2442	0.6659
+ HEPO	0.4413	0.5509	0.2630	0.7619

Analysis:

Impact of Value-Aware Fine-Tuning (VAFT):
- Relative to the MTP (base) model, MTP + VAFT improves nDCG from 0.3868 to 0.3925 and OPR from 0.5292 to 0.5348. This demonstrates that reweighting the MTP loss with action type and normalized eCPM successfully shifts learning toward high-value impressions while maintaining relevance, enhancing business alignment.
Impact of Reinforcement Learning (RL):
- $MTP + DPO$ : Post-training with DPO (which optimizes pairwise preferences for higher-value items) further improves nDCG to 0.4383 and OPR to 0.5463. More importantly, the normalized average final_value increases from 0.2412 to 0.2442, and max final_value from 0.6201 to 0.6659. This shows that RL, by evaluating policy-generated sequences in simulation, sharpens local orderings and improves the potential business value of generated candidates.
- MTP + HEPO: HEPO surpasses DPO in all metrics. It achieves nDCG of 0.4413, OPR of 0.5509, a significantly higher average final_value of 0.2630 (compared to 0.2442 for DPO), and max final_value of 0.7619 (compared to 0.6659 for DPO). The substantial increase in average and max final_value clearly indicates that HEPO is more effective at optimizing for overall business value, likely due to its hierarchical process rewards and robust advantage estimation, enabling it to discover and prioritize high-value candidates more effectively.

6.1.5. Online Performance

GPR was fully deployed in the Tencent Weixin Channels advertising system, with its performance rigorously validated through sequential online A/B tests against a mature multi-stage cascading pipeline.

The following are the results from Table 4 of the original paper:

Version	GMV	GMV-Normal	Costs
Launches with incremental changes.
v0.1:HSD+NTP+DPO	+2.11%	+2.42%	+3.29%
v0.2: +HEPO w/o ARR	+0.70%	+0.67%	+0.36%
v0.3: +MTP+Thinking	+0.63%	+0.94%	+0.21%
v0.4: +PTD	+0.71%	+1.04%	+0.12%
v0.5: +HEPO w/ ARR	+0.58%	+0.81%	+0.23%

Analysis of Incremental Launches:

v0.1 (HSD+NTP+DPO): The initial full-scale deployment of a basic GPR stack (using HSD, NTP pre-training, and DPO post-training) established a strong baseline lift of +2.11% GMV and +3.29% Costs. GMV-Normal also increased by +2.42%, indicating improved monetization across ads optimized for clicks or conversions. This initial success validates the fundamental end-to-end generative approach.
v0.2 (+HEPO w/o ARR): Replacing DPO with HEPO (without Anticipatory Request Rehearsal) yielded an additional +0.70% GMV and +0.36% Costs. This confirms the superior value optimization capability of HEPO in a live production environment.
v0.3 (+MTP+Thinking): Introducing Multi-Token Prediction and the Thinking tokens mechanism contributed a further +0.63% GMV (+0.94% GMV-Normal). This highlights the importance of modeling parallel user interests and explicit reasoning steps for real-world advertising performance.
v0.4 (+PTD): The full Progressive Token-wise Decoder (PTD) contributed an additional +0.71% GMV (+1.04% GMV-Normal). This shows the benefit of the refined generation process, incorporating both thinking and refining steps.
v0.5 (+HEPO w/ ARR): Finally, applying HEPO with Anticipatory Request Rehearsal (ARR) added another +0.58% GMV (+0.81% GMV-Normal). This indicates that proactive adaptation to dynamic environments through synthetic future samples further enhances monetization. Overall: Across all iterative rollouts, GPR consistently delivered stable, statistically significant incremental improvements in both GMV and GMV-Normal while managing Costs, demonstrating stronger monetization under real-time constraints.

The following are the results from Table 5 of the original paper:

		GMV	CTR	CVR	CTCVR
v0.1		+2.11%	+1.69%	+1.15%	+3.16%
User Group	UG1	+3.56%	+2.51%	+0.82%	+3.72%
	UG2	+3.84%	+2.06%	+1.30%	+3.80%
	UG3	+0.92%	+2.18%	+1.91%	+4.63%
	UG4	+0.45%	+1.08%	+1.53%	+2.87%
	UG5	+3.68%	+0.05%	+0.32%	+0.50%
Ad Group	new	+2.97%	+2.25%	+1.41%	+4.02%
Ad Group	non-new	+1.65%	+1.42%	+1.12%	+2.78%

Stratified Analysis of v0.1 Launch: The stratified analysis for the initial v0.1 launch (HSD+NTP+DPO) shows consistent gains across various user and ad segments:

User Groups:
- UG1 and UG2 (low-activity users) show strong gains: e.g., UG1 with +3.56% GMV, +2.51% CTR, +0.82% CVR, and +3.72% CTCVR. This indicates GPR's ability to effectively engage even less active users.
- UG3 and UG4 (mid-activity groups) also show improvements in engagement and efficiency. UG3 exhibits the largest CTCVR lift at +4.63%, alongside +0.92% GMV.
- UG5 (high-activity users) shows smaller changes in CTR (+0.05%) and CVR (+0.32%) but still delivers a significant +3.68% GMV. This suggests that for heavy users, GPR might be leading to better allocation toward higher-value ads rather than just increased clicks/conversions.
Ad Groups:
- Newly launched ads ( $\leq 3$ days): Outperform established ads ( $> 3$ days) with +2.97% GMV vs. +1.65% and +4.02% CTCVR vs. +2.78%. This pattern indicates GPR's strong capability in cold-start handling for new inventory, providing significant lifts to fresh advertisements.
- Non-new ads ( $> 3$ days): Still show respectable gains in GMV (+1.65%) and CTCVR (+2.78%), demonstrating that GPR preserves gains for mature inventory. Conclusion: The online A/B tests confirm GPR's robust practical performance, its ability to drive significant business value, and its strong generalization across diverse user and ad segments, including effective cold-start handling.

6.2. Ablation Studies / Parameter Analysis

The detailed ablation studies presented in Table 2 for User Behavior Modeling Performance (Section 6.1.2) systematically validate the effectiveness of each proposed component:

HSD Enhancements (Hybrid Attention, Token-Aware FFN/LN, MoR, External Knowledge): These collectively demonstrate the importance of architectural innovations for deeply understanding heterogeneous user behaviors and generating high-quality intent embeddings. The Token-Aware FFN and Hybrid Attention show the largest individual contributions within HSD, highlighting the critical role of handling token heterogeneity and flexible attention masks.
PTD Enhancements (Thinking, Refining): The Thinking module's significant contribution (relative gain of +14.6%) underscores the benefit of introducing explicit reasoning steps for latent refinement before item generation, leading to more accurate predictions. The Refining module further solidifies this.
HTE Integration ( $+HTE$ ): The positive impact of HTE on HitRate@100 (+4.9%) shows that integrating value prediction directly sharpens candidate ordering, aligning the generative process with business objectives even at the prediction stage.
MTP (Training Strategy): The Multi-Token Prediction strategy provides the largest individual performance lift (+17.9% HitR@100). This is a crucial validation of the paper's hypothesis that modeling multiple, parallel user interests is essential for advertising scenarios where users often have diverse and concurrent needs.

These studies confirm that GPR's strong overall performance is not due to a single silver bullet but rather the synergistic effect of its systematically designed components across representation, architecture, and training strategy. The scaling analysis (Figure 5) further supports that GPR's architecture is well-suited for larger models, promising further performance gains with increased computational resources.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces GPR (Generative Pre-trained Recommender), pioneering the first one-model framework that redefines advertising recommendation as an end-to-end generative task. By moving away from the conventional multi-stage cascading pipeline, GPR successfully addresses long-standing issues of objective inconsistency and error accumulation, thereby achieving global optimization and long-term value alignment within a single, holistic model.

The framework's core contributions lie in three systematic innovations:

Unified Representation: A novel input schema with four token types (U/O/E/I-Token) and the RQ-Kmeans+ quantization model effectively map heterogeneous content into a shared multi-level semantic ID space, enabling robust modeling of diverse, ultra-long user sequences and resolving codebook collapse.
Architectural Design: The Heterogeneous Hierarchical Decoder (HHD) offers a dual-decoder structure (HSD for user intent, PTD for ad generation) that decouples these complex tasks. Enhancements like Hybrid Attention, Token-Aware FFN/LN, Mixture-of-Recursions, and a "Thinking-Refining-Generation" paradigm enable finer-grained understanding and accurate, constrained item generation, balancing efficiency and flexibility.
Training Strategy: A multi-stage joint training pipeline integrates Multi-Token Prediction (MTP) for pre-training multi-interest patterns, Value-Aware Fine-Tuning (VAFT) for business objective alignment, and Hierarchy Enhanced Policy Optimization (HEPO) with Anticipatory Request Rehearsal (ARR) for robust value optimization and exploration in dynamic environments.

Extensive large-scale offline experiments and real-world online A/B testing in the Tencent Weixin Channels advertising system unequivocally demonstrate GPR's superiority. It delivers significant improvements in key business metrics, including Gross Merchandise Volume (GMV) and CTCVR, outperforming a highly optimized and mature cascading system. This validates GPR's practical utility, robustness, and strong competitiveness.

In essence, GPR propels advertising recommendation systems from stage-wise optimization toward truly end-to-end intelligent decision-making, unifying user intent understanding, long-term value optimization, and continuous adaptation within the digital economy ecosystem.

7.2. Limitations & Future Work

While the paper highlights GPR's significant advancements and successful deployment, some limitations and potential areas for future work can be inferred:

Complexity of Training Pipeline: The multi-stage joint training strategy, involving MTP, VAFT, and HEPO with ARR, is sophisticated. Setting up and maintaining such a pipeline, particularly the high-fidelity simulation environment and the various reward shaping mechanisms for RL, likely requires substantial engineering effort and expertise.
Hyperparameter Sensitivity: The numerous components and training stages (e.g., $N$ parallel heads for MTP, $\alpha_{\ell}$ for process rewards, $\gamma, \lambda, \epsilon$ for HEPO) imply a potentially high sensitivity to hyperparameter tuning, which can be challenging at industrial scale.
Generalizability to Different Advertising Scenarios: While deployed in Tencent Weixin Channels, different advertising platforms might have unique data distributions, business models, or constraint requirements. The generalizability of RQ-Kmeans+ and HHD to vastly different multimodal inputs or HEPO to different reward structures would need further validation.
Interpretability and Explainability: Generative models, especially large Transformer-based ones with complex training objectives, often lack direct interpretability. In advertising, understanding why a particular ad was generated and recommended can be crucial for debugging, auditing, and satisfying regulatory requirements. The paper does not explicitly discuss GPR's interpretability.
Real-time Inference Latency: While the paper mentions balancing training efficiency and inference flexibility, and uses Trie-based Beam Search for efficiency, processing ultra-long user sequences and generating multiple hierarchical tokens in real-time for hundreds of millions of users still poses a significant computational challenge. Any potential bottlenecks or trade-offs between model size/complexity and strict latency requirements are not fully detailed.
Ethical Considerations of Value Optimization: Aggressively optimizing for final_value (eCPM, GMV) in advertising could, in some cases, lead to a less optimal user experience if not carefully balanced. The paper highlights balancing user experience, advertiser ROI, and platform revenue as a challenge, and GPR aims to optimize the "overall ecosystem value," but the exact mechanisms and empirical guarantees for maintaining user experience are not extensively explored.

Future work could focus on:
Further Scaling and Efficiency: Exploring even larger model capacities and more advanced techniques for efficient inference and distributed training.
Adaptive Reward Learning: Developing more adaptive or user-specific reward functions for RL, potentially through inverse reinforcement learning or more sophisticated preference learning from user feedback beyond just logged actions.
Enhanced Interpretability: Integrating techniques for model explainability to provide insights into why specific recommendations are made.
Broader Application and Generalization: Testing GPR on other advertising platforms or general recommendation domains with different characteristics to evaluate its broader applicability.
Robustness to Adversarial Attacks/Manipulation: Investigating GPR's robustness against potential manipulation attempts in a live advertising environment.

7.3. Personal Insights & Critique

GPR marks a significant stride in the evolution of advertising recommendation systems. Its transition to a "one-model" end-to-end generative paradigm is a bold and necessary move to overcome the inherent limitations of traditional cascading systems, particularly the objective misalignment and error propagation. The systematic innovations, from unified tokenization to hierarchical decoding and multi-stage RL training, demonstrate a deep understanding of the unique challenges in industrial advertising.

The RQ-Kmeans+ tokenizer is a practical solution to a common problem (codebook collapse) in discrete representation learning, essential for generative models. The Heterogeneous Hierarchical Decoder with its "Thinking-Refining-Generation" paradigm is particularly insightful; explicitly separating user intent modeling from item generation, and then introducing iterative refinement, intuitively mirrors human decision-making and offers a powerful way to manage complexity. The Multi-Token Prediction strategy for capturing diverse user interests is also highly relevant for advertising, where users often have multiple concurrent needs.

The HEPO algorithm with Hierarchical Process Rewards and Anticipatory Request Rehearsal is a highlight, showcasing how sophisticated RL techniques can be tailored to address the nuanced credit assignment problem in hierarchical generation and the dynamic nature of advertising. The strong online A/B test results, especially the consistent GMV gains and improved cold-start handling for new ads, provide compelling evidence of GPR's practical efficacy and its potential to drive substantial business value in a highly competitive environment.

However, a potential area for further detail could be the specific formulation of eCPM and final_value beyond a high-level aggregation, as these are critical to the "value-aware" aspects. Additionally, while the paper notes the model's total sparse parameters are approximately 80B, more detailed insight into the computational resources (e.g., GPU hours, memory) required for such a large-scale, multi-stage training pipeline would be beneficial for others attempting similar industrial deployments. The fine-tuning process with an external LLM for "thought process" generation is intriguing, and further exploration of this interaction (e.g., how sensitive GPR is to the quality of the LLM's "thoughts") could be valuable.

The success of GPR, particularly its deployment in a system as large as Tencent Weixin Channels, serves as a strong testament to the viability of generative pre-trained models for complex, high-stakes industrial applications. Its methods and conclusions could potentially be transferred to other domains requiring personalized recommendations under dynamic, heterogeneous, and multi-objective constraints, such as content platforms, e-commerce, or even personalized learning systems, provided the necessary adaptations for their specific data characteristics and reward structures are made. This paper undoubtedly pushes the frontier of generative recommendation in the industrial context.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

GPR: Towards a Generative Pre-trained One-Model Paradigm for Large-Scale Advertising Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~35 min read · 51,161 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

2. Executive Summary

2.1. Background & Motivation

Core Problem

Importance of the Problem

Paper's Entry Point and Innovative Idea

2.2. Main Contributions / Findings

Primary Contributions

Key Conclusions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Recommendation Systems

Generative Models

Large Language Models (LLMs) and Transformers

Vector Quantization (VQ)

Reinforcement Learning (RL)

Beam Search

Trie Tree

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Input Schema and Processing

4.2.2. Heterogeneous Hierarchical Decoder (HHD)

4.2.3. Value-Guided Trie-Based Beam Search

4.2.4. Multi-Stage Training

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

For Multimodal Tokenization Performance (Section 4.1):

For User Behavior Modeling Performance (Section 4.2):

For Business Alignment Performance (Section 4.3):

For Online Performance (Section 4.4):

5.3. Baselines

For Multimodal Tokenization Performance (Section 4.1):

For User Behavior Modeling Performance (Section 4.2):

For Business Alignment Performance (Section 4.3):

For Online Performance (Section 4.4):

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Multimodal Tokenization Performance

6.1.2. User Behavior Modeling Performance

6.1.3. Scaling Properties

6.1.4. Business Alignment Performance

6.1.5. Online Performance

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers