Paper status: completed

UniSearch: Rethinking Search System with a Unified Generative Architecture

Published:09/09/2025

Unified Generative Search Framework (1)Integration of Search Generator and Video Encoder (1)Search Preference Optimization Method (1)Short Video Search System (1)Generative Recommendation Systems (2)

Original Link PDF

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UniSearch introduces a unified generative search framework for Kuaishou, replacing the traditional cascaded architecture. By integrating a Search Generator and Video Encoder, it achieves end-to-end optimization, addressing objective inconsistency and limited generalization, thus

Abstract

Modern search systems play a crucial role in facilitating information acquisition. Traditional search engines typically rely on a cascaded architecture, where results are retrieved through recall, pre-ranking, and ranking stages. The complexity of designing and maintaining multiple modules makes it difficult to achieve holistic performance gains. Recent advances in generative recommendation have motivated the exploration of unified generative search as an alternative. However, existing approaches are not genuinely end-to-end: they typically train an item encoder to tokenize candidates first and then optimize a generator separately, leading to objective inconsistency and limited generalization. To address these limitations, we propose UniSearch, a unified generative search framework for Kuaishou Search. UniSearch replaces the cascaded pipeline with an end-to-end architecture that integrates a Search Generator and a Video Encoder. The Generator produces semantic identifiers of relevant items given a user query, while the Video Encoder learns latent item embeddings and provides their tokenized representations. A unified training framework jointly optimizes both components, enabling mutual enhancement and improving representation quality and generation accuracy. Furthermore, we introduce Search Preference Optimization (SPO), which leverages a reward model and real user feedback to better align generation with user preferences. Extensive experiments on industrial-scale datasets, together with online A/B testing in both short-video and live search scenarios, demonstrate the strong effectiveness and deployment potential of UniSearch. Notably, its deployment in live search yields the largest single-experiment improvement in recent years of our product's history, highlighting its practical value for real-world applications.

Mind Map

In-depth Reading

English Analysis~20 min read · 26,645 chars

1. Bibliographic Information

1.1. Title

UniSearch: Rethinking Search System with a Unified Generative Architecture

1.2. Authors

Jiahui Chen†, Xiaoze Jiang*†, Zhibo Wang†, Quanzhi Zhu†, Junyao Zhao†, Feng Hu†, Kang Pan, Ao Xie, Maohua Pei, Zhiheng Qin, Hongjing Zhang, Zhixin Zhai, Xiaobo Guo, Runbin Zhou, Kefeng Wang, Mingyang Geng, Cheng Chen, Jingshan Lv, Yupeng Huang, Xiao Liang, Han Li The authors are primarily affiliated with Kuaishou Technology, Beijing, China, indicating a strong industry research focus on practical applications and large-scale deployment within a real-world product context.

1.3. Journal/Conference

Preprint Version of UniSearch, Beijing, China. ACM. The paper is published as a preprint, meaning it has not yet undergone formal peer review for a specific conference or journal. However, the ACM citation format indicates an intention or submission to an ACM-affiliated venue. Preprints are common in fast-moving fields like AI/ML for rapid dissemination of research findings.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces UniSearch, a unified generative search framework developed for Kuaishou Search. Modern search systems traditionally use a cascaded architecture (recall, pre-ranking, ranking), which is complex and often leads to performance inconsistencies due to distinct optimization objectives across modules. UniSearch addresses this by proposing an end-to-end architecture that integrates a Search Generator and a Video Encoder. The Generator creates semantic identifiers (SIDs) of relevant items from user queries, while the Video Encoder learns latent item embeddings and tokenizes them into SIDs. A unified training framework jointly optimizes both components, enhancing representation quality and generation accuracy. Additionally, UniSearch employs Search Preference Optimization (SPO) which uses a reward model and real user feedback to align generated results with user preferences. Extensive offline experiments and online A/B tests on Kuaishou's industrial-scale short-video and live search systems demonstrate UniSearch's effectiveness. Notably, its deployment in live search achieved the largest single-experiment improvement in the product's recent history, underscoring its practical value.

1.6. Original Source Link

https://arxiv.org/abs/2509.06887v2 PDF Link: https://arxiv.org/pdf/2509.06887v2.pdf This paper is available as a preprint on arXiv, indicated by the 'v2' in the link, meaning it's the second version posted.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Traditional search systems, widely used in industrial applications, predominantly rely on a multi-stage cascaded architecture (MCA). This architecture typically involves sequential stages like recall, pre-ranking, and ranking. While refined over time, this design suffers from fundamental issues:

Objective Misalignment: Each stage uses distinct models optimized with separate objectives, leading to misaligned training signals across the entire pipeline. This prevents optimal end-to-end relevance modeling and limits overall performance.
Complexity and Overhead: The multi-stage structure introduces substantial inference latency and high maintenance overhead, as each stage requires dedicated algorithmic design, development, and system support.

Why this problem is important: In large-scale industrial search applications like Kuaishou, efficient and highly relevant information retrieval is crucial for user satisfaction and business metrics. The limitations of MCA directly impact the user experience, development costs, and the system's ability to adapt and improve holistically.

Challenges in extending generative models to search: While generative recommendation has shown promise in replacing cascaded pipelines, existing approaches are often not truly end-to-end. They typically separate item encoding (tokenization) from generation, leading to objective inconsistency and limited generalization. Furthermore, applying generative models to search is more complex than recommendation because search involves cross-domain generation (text query to item), whereas recommendation often stays intra-domain (user/item embeddings to item). The exploration of unified generative methods for the full cascaded search architecture remains underexplored.

Paper's Innovative Idea: The paper proposes UniSearch, a unified generative search framework that replaces the entire cascaded search pipeline with a single end-to-end architecture. It addresses the limitations of previous generative approaches by jointly optimizing item encoding and query-to-item generation within a unified training framework, directly bridging the gap between text queries and item representations.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of generative search:

Unified End-to-End Generative Search Solution: UniSearch is presented as a deployed, industrial-scale generative search solution that replaces the traditional multi-stage cascaded architecture. It demonstrates superior retrieval quality and computational efficiency compared to its traditional counterparts. This marks a significant step towards fully unified search systems.
Novel Unified Training Framework: The paper introduces a unified training framework that seamlessly integrates item tokenization and generative modeling. This contrasts with prior methods that separate these tasks, leading to objective inconsistency. UniSearch's approach ensures semantic consistency and mutual enhancement between video encoding and query generation, resulting in more effective item representations and higher-quality generated results. Key components of this framework include:
- Residual Contrastive Learning with Coarse-to-Fine Strategy for semantic alignment between queries and videos.
- VQ-VAE for end-to-end discretization of latent embeddings into semantic IDs during training.
- Reject-Sampling Generative Training to prioritize high-quality item generation.
Search Preference Optimization (SPO) for Alignment: UniSearch incorporates an online post-training phase called Search Preference Optimization (SPO). This mechanism leverages a reward model combining system-estimated signals and real user feedback (e.g., clicks, watch time) to refine the generation policy. This ensures that the generated results are better aligned with actual user preferences.
Industrial-Scale Deployment and Validation: UniSearch has been successfully deployed in Kuaishou's live and short-video search systems, serving hundreds of millions of active users. Large-scale online A/B experiments confirm its practical effectiveness and efficiency.
- Notably, its deployment in live search resulted in the largest single-experiment improvement in Kuaishou's product history, significantly boosting metrics like Total Play Counts (TPC).
- The framework demonstrates enhanced performance on long-tail queries and new users, and produces richer and more diverse results, addressing common challenges in large-scale search.

3.1. Foundational Concepts

Cascaded Architecture in Search Systems: This is a traditional multi-stage pipeline used in most industrial search engines. It processes a search query through several sequential filtering steps to narrow down a vast item corpus to a relevant few.
- Recall (Retrieval): The initial stage, responsible for efficiently retrieving a large set of candidate items (e.g., thousands or millions) that are broadly relevant to the query. This is often done using inverted indices or dual-encoder models.
- Pre-ranking: A filtering stage that further reduces the candidate set (e.g., to hundreds) for efficiency, using computationally lighter models than ranking.
- Ranking: The final stage, employing complex and computationally expensive models to precisely order the remaining candidates for presentation to the user, optimizing for relevance, diversity, and other quality metrics.
Generative Models: Models capable of generating new data instances that resemble the training data. In the context of this paper, it refers to models that generate semantic identifiers of items based on a query, effectively performing retrieval as a generation task.
- Transformer: A neural network architecture introduced in 2017, foundational to modern Large Language Models (LLMs). It relies heavily on self-attention mechanisms to process sequences, allowing it to weigh the importance of different parts of the input sequence when processing each element.
- Encoder-Decoder Paradigm: A common architecture for sequence-to-sequence (seq2seq) tasks. An encoder processes the input sequence (e.g., query) into a context vector, and a decoder generates an output sequence (e.g., item IDs) based on this context.
- Autoregressive Generation: A process where each token in the output sequence is generated conditioned on the previously generated tokens and the input.
Vector Quantized Variational Autoencoder (VQ-VAE): A type of generative model that learns discrete latent representations (a codebook) for data. It's used here to convert continuous video embeddings into semantic identifiers.
- Latent Embeddings: Numerical representations (vectors) that capture the underlying semantic information of an item (e.g., a video). These are typically high-dimensional.
- Codebook: In VQ-VAE, a discrete set of learned vectors. When a continuous latent embedding is generated, it is "quantized" by finding the closest vector in this codebook, and the index of that closest vector becomes the semantic ID.
- Semantic Identifiers (SIDs): Discrete tokens or indices representing the meaning or characteristics of an item. These are what the Search Generator ultimately produces.
Contrastive Learning: A self-supervised learning technique where the model learns to pull similar samples closer together in the embedding space and push dissimilar samples further apart. It requires defining positive and negative pairs.
Reinforcement Learning (RL): A paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In this paper, it's used for Search Preference Optimization (SPO) to align generated results with user feedback.
- Reward Model: A model that assigns a numerical reward score to an action or outcome, guiding the RL agent towards desired behaviors.
- Policy: In RL, the policy is the strategy that the agent uses to determine its next action based on the current state. Here, it refers to the UniSearch model's generation behavior.
Trie (Prefix Tree): A tree-like data structure that efficiently stores and retrieves a dynamic set of strings or sequences, organized by their prefixes. It's used here to constrain the generation of semantic IDs to only valid sequences.

3.2. Previous Works

The paper primarily positions itself against two main lines of previous work:

Traditional Cascaded Search Systems (MCA):
- Description: These systems (e.g., [8, 28, 29, 33, 36]) involve recall, pre-ranking, and ranking stages. Modern versions often use BERT-based discriminative models [3] with dual-tower architectures for recall (contrastive learning) and pre-ranking (post-interaction), and single-tower fully interactive architectures for fine-ranking (point-wise learning).
- Relevance to UniSearch: UniSearch aims to replace this entire MCA with a single generative model, unifying objectives and improving end-to-end consistency. The paper highlights the misaligned training signals and high maintenance overhead as key drawbacks of MCA that UniSearch intends to solve.
Existing Generative Retrieval and OneModels:
- Description: This newer paradigm (e.g., [10-12, 15, 17, 19-22, 26, 34, 37]) formulates retrieval as a text generation task, where document identifiers or passages are generated. In recommendation, OneRec [2, 40], OneSug [5], and other "OneModel" approaches [25, 27, 39] have explored replacing cascaded pipelines with generative models.
- Relevance to UniSearch: UniSearch builds upon this trend but points out a critical limitation: most existing generative approaches are not truly end-to-end. They often involve two separate training steps:
  1. Training an item encoder to produce embeddings and semantic tokens (e.g., using VQ-VAE, FSQ [16], RQ-Kmeans).
  2. Separately training a generator to predict these tokens. This discrepancy between tokenization and generation objectives leads to inconsistent optimization and suboptimal generation quality. Furthermore, many rely on pre-tokenized candidate items, limiting generalization to unseen content. Some studies only apply generative models to early stages like recall [15, 31], not the full architecture.

3.3. Technological Evolution

Search technology has evolved from simple inverted indexing to sophisticated deep neural models. The shift from lexical matching to semantic understanding has been driven by advances in natural language processing (NLP) and deep learning. BERT [3] and other Transformer-based models [9, 18, 23, 30] have enabled powerful semantic matching and relevance modeling.

Recently, the success of Large Language Models (LLMs) and generative AI has inspired a new paradigm: generative retrieval. Instead of just matching or ranking items, systems can now generate identifiers for relevant items directly. This represents a move towards more unified and intelligent information access. UniSearch contributes to this evolution by proposing a fully end-to-end generative framework for search, aiming to overcome the inherent limitations of multi-stage systems.

3.4. Differentiation Analysis

Compared to the main methods in related work, UniSearch's core differences and innovations are:

Unified End-to-End Architecture for Full Search Pipeline: Unlike traditional MCA systems that use distinct models for recall, pre-ranking, and ranking, UniSearch replaces the entire pipeline with a single Search Generator and Video Encoder. This aims to resolve the objective misalignment and maintenance overhead of MCA.
Joint Optimization of Tokenization and Generation: A key differentiator from existing generative recommendation and retrieval methods (e.g., OneRec, OneSug, or those using VQ-VAE or RQ-Kmeans as separate steps) is UniSearch's unified pre-training framework. It jointly optimizes video encoding (which includes discretization into semantic IDs) and query-to-video generation. This eliminates the objective inconsistency and suboptimal generation quality often seen when these tasks are trained separately.
Residual Contrastive Learning with Coarse-to-Fine Strategy: UniSearch introduces residual contrastive learning combined with a coarse-to-fine optimization strategy. This novel approach allows the Video Encoder to learn robust recall capabilities through initial tokens and then progressively refine ranking and personalization with subsequent tokens. This is explicitly designed to address token redundancy and path collapse issues found in other discrete representation learning methods.
Online Reinforcement Learning for Preference Alignment (SPO): Beyond supervised pre-training, UniSearch integrates Search Preference Optimization (SPO). This post-training phase uses a reward model that combines both system-estimated signals and real user feedback (clicks, watch time) to continuously refine the model's generation policy. This ensures a stronger alignment with dynamic user preferences, a capability often lacking in purely offline-trained models.
Robustness in Search Context: While previous generative models might focus on recommendation (intra-domain mapping), UniSearch tackles the more challenging cross-domain generation from text queries directly to items in a search context, explicitly addressing the unique requirements and large-scale nature of industrial search.
Practical Deployment and Efficiency: The integration of Trie-based inference for constrained generation, TensorRT acceleration, and a KV-cache ensures that UniSearch is not only effective but also efficient and scalable for real-world industrial deployment, a critical aspect often overlooked in academic prototypes.

4. Methodology

The UniSearch framework aims to replace traditional cascaded search systems with a unified, end-to-end generative architecture. This section details its core components, training scheme, inference process, and deployment.

4.1. Principles

The core idea behind UniSearch is to treat the entire search process, from query understanding to item retrieval, as a single sequence generation task. Instead of multiple, loosely coupled models, UniSearch uses a single encoder-decoder generative model to directly produce semantic identifiers of relevant items given a user query. The theoretical basis is that by unifying the objectives of item representation learning and query-to-item generation, the system can achieve better end-to-end relevance modeling, overcome objective inconsistency, and reduce system complexity. The intuition is to enable the model to "speak the language" of items (via semantic IDs) based on the user's "language" (via query text).

4.2. Core Methodology In-depth (Layer by Layer)

UniSearch consists of two main components: a Search Generator and a Video Encoder, which are optimized together through a unified pre-training framework and further refined by Search Preference Optimization (SPO).

4.2.1. UniSearch Architecture

As illustrated in Figure 2 (a) from the original paper, UniSearch's architecture comprises the Search Generator and the Video Encoder.

The following figure (Figure 2 from the original paper) shows the architecture of UniSearch and its deployment:

该图像是示意图，展示了UniSearch的模型架构及其在线部署与后续训练的流程。上半部分展示了搜索生成器和视频编码器的统一预训练，包括query的输入和不同层次的特征表示；下半部分则描述了UniSearch在强化学习训练系统、Trie服务器和奖励系统中的应用与更新机制。

4.2.1.1. Search Generator

The Search Generator follows an encoder-decoder generative paradigm, similar to architectures like BART.

Encoder: This component is a bidirectional Transformer. It takes two main inputs:
1. The user query (textual input).
2. Auxiliary user features (e.g., GSU sequences representing personalized historical behaviors). To capture the holistic semantics of a search request, a special $<cls>$ token is prepended to the input sequence. The representation corresponding to this $<cls>$ token, after being processed by the encoder, serves as the global query embedding $q$ . This $q$ effectively summarizes the user's intent for the current search.
Decoder: This component receives the contextualized query representation $q$ from the encoder. It then autoregressively generates a sequence of tokens. These tokens are not raw text but rather semantic IDs (SIDs) of relevant videos. This process transforms a natural language query into a structured semantic representation of the desired search results.

4.2.1.2. Video Encoder

The Video Encoder is a unidirectional Transformer designed to learn latent embeddings and semantic IDs for each video item.

Input: Each video is represented by a combination of:
1. Textual metadata (e.g., title, description).
2. Multi-modal content features (e.g., visual features from video frames, audio features).
3. Side statistical features (e.g., view counts, engagement metrics). These features are concatenated with $k$ learnable tokens and passed through the Transformer.
Output: The Transformer produces a sequence of $k$ latent embeddings: $D = \{ d^{(1)}, d^{(2)}, \ldots, d^{(k)} \}$ . These embeddings capture the semantic properties of the video in a continuous vector space.
Discretization with VQ-VAE: To convert these continuous latent embeddings into semantic identifiers (SIDs), the Video Encoder is augmented with a VQ-VAE module. This module maps the latent embeddings into a codebook space. The output is a sequence of semantic IDs: $S = \{ s^{(1)}, s^{(2)}, \ldots, s^{(k)} \}$ . This SID sequence provides a compact and interpretable representation of each video, crucial for indexing and retrieval within the generative framework.

4.2.2. Unified Pre-training

The unified pre-training framework is designed to enable the Search Generator and Video Encoder to learn cooperatively. It jointly optimizes video encoding and query-to-video generation, addressing the semantic consistency and task misalignment issues of prior methods.

Pre-training Data: Sampled from large-scale search logs, each record includes:
- query: The user's search query.
- user features: Contextual information about the user.
- candidate video set $\{d_1, d_2, \dots, d_N\}$ : Videos considered for the query (exposed and unexposed negatives).
- label set $\{l_1, l_2, \dots, l_N\}$ : Graded relevance annotations for each candidate, reflecting query-video relevance and video quality.

4.2.2.1. Residual Contrastive Learning for Semantic Alignment

This objective aligns the semantics of user queries (from the Generator encoder) with the latent representations of relevant items (from the Video Encoder).

Let $q$ be the query embedding from the Generator encoder.
Let $D_i$ be the latent representation of the $i$ -th candidate video encoded by the Video Encoder, specifically its $n$ -th residual embedding is $d_i^{(n)}$ .

The residual contrastive learning formulation is defined by the following equations:

$ \mathcal{L}{\mathrm{contrast}} = \sum{n=1}^{k} \mathcal{L} \Big( q_i, \boldsymbol{\mathrm{sg}} \big[ \sum_{m < n} d_i^{(m)} \big] + d_i^{(n)} \Big) $ Where:

$\mathcal{L}_{\mathrm{contrast}}$ is the total contrastive loss.
$k$ is the number of latent embeddings (or levels in the codebook) for each video.
$n$ iterates from 1 to $k$ , representing each residual token.
$q_i$ is the query embedding for the $i$ -th query.
$\boldsymbol{\mathrm{sg}}(\cdot)$ denotes the stop-gradient operation. This means the gradient does not flow through the sum of previous residual embeddings, ensuring that each current residual $d_i^{(n)}$ focuses on learning new, complementary information.
$\sum_{m < n} d_i^{(m)}$ is the sum of previous residual embeddings for video $i$ .
$d_i^{(n)}$ is the current $n$ -th residual embedding for video $i$ .
The term $\boldsymbol{\mathrm{sg}} \big[ \sum_{m < n} d_i^{(m)} \big] + d_i^{(n)}$ represents the accumulated information up to the current $n$ -th residual, where the current residual $d_i^{(n)}$ is actively learned to contribute to the overall representation.

The specific contrastive loss function $\mathcal{L}(q, d)$ used for a query $q$ and a document (video) representation $d$ is:

$ \mathcal{L}(q, d) = - \log \frac{\exp\left( \mathrm{sim}(q, d) / \tau \right)}{\exp\left( \mathrm{sim}(q, d) / \tau \right) + \sum_{d^- \in N} \exp\left( \mathrm{sim}(q, d^-) / \tau \right)} $ Where:

$\mathrm{sim}(q, d)$ is a similarity function between the query $q$ and the document representation $d$ .
$\tau$ is the temperature parameter, which controls the sharpness of the probability distribution.
$N$ is the negative sample set, consisting of irrelevant document representations.
In contrast to conventional cosine similarity, the paper adopts L2 similarity: $\mathrm{sim}(q, d) = 1 - \| q - d \|_2^2$ . This choice is stated to better fit the residual accumulation mechanism and preserve the geometric structure of embeddings.

Coarse-to-Fine Strategy: This strategy mimics the cascaded ranking process of traditional search to strengthen semantic alignment.

The first residual token $d^{(1)}$ is trained for an "easy", coarse-grained matching task, ensuring strong recall capability. This is achieved by using "simple in-batch negatives" for the negative sample set $N$ .
Subsequent tokens $d^{(2)}, \ldots, d^{(k)}$ progressively handle more complex tasks, refining the representation for fine-grained ranking and personalization. For these tokens, increasingly hard negatives are sampled from semantically similar but irrelevant items. This staged learning helps decompose the task and prevents token redundancy, making efficient use of the codebook.

4.2.2.2. Discretization into Semantic IDs

The Video Encoder must discretize latent embeddings into semantic IDs for the generative model. VQ-VAE [24] is employed for this, allowing the codebook to be updated jointly during training, unlike post-clustering methods.

For each embedding $d_i^{(n)}$ , the VQ-VAE encoder performs a nearest-neighbor lookup in a learnable codebook to obtain a quantized embedding $e_i^{(n)}$ and its corresponding semantic ID $s_i^{(n)}$ . The sequence of quantized embeddings $E_i = \{ e_i^{(1)}, e_i^{(2)}, \ldots, e_i^{(k)} \}$ is then used to reconstruct the original embeddings $D_i$ .

A codebook loss is introduced to ensure that $e_i^{(n)}$ approximates $d_i^{(n)}$ :

$ \mathcal{L}{\mathrm{codebook}} = \sum{n=1}^{k} \alpha_1 || \mathrm{sg}[d_i^{(n)}] - e_i^{(n)} ||_2^2 + \alpha_2 || d_i^{(n)} - \mathrm{sg}[e_i^{(n)}] ||_2^2 $ Where:

$\mathcal{L}_{\mathrm{codebook}}$ is the codebook loss.
$\mathrm{sg}(\cdot)$ is the stop-gradient operation.
$\alpha_1$ and $\alpha_2$ are hyperparameters balancing the terms.
The first term, $|| \mathrm{sg}[d_i^{(n)}] - e_i^{(n)} ||_2^2$ , is the codebook loss itself, pulling the quantized embedding $e_i^{(n)}$ towards the encoder output $d_i^{(n)}$ (with gradients only flowing to $e_i^{(n)}$ ).
The second term, $|| d_i^{(n)} - \mathrm{sg}[e_i^{(n)}] ||_2^2$ , is the commitment loss, ensuring that the encoder output $d_i^{(n)}$ stays close to the selected codebook vector $e_i^{(n)}$ (with gradients only flowing to $d_i^{(n)}$ ). The SimVQ strategy [41] is used to stabilize discretization and prevent codebook collapse.

4.2.2.3. Reject-Sampling Generative Training

Once items are mapped to semantic IDs, the Generator is trained to produce these sequences autoregressively using a next-token prediction objective with cross-entropy loss.

To enhance generation quality, reject-sampling strategies are used:

Low-quality items (identified by labels) are filtered out.
Losses for items of different quality levels are reweighted.

The next-token prediction loss is:

$ \mathcal{L}{\mathrm{NTP}} = - w_i \sum{n=1}^{k} \log p \big( s_i^{(n)} | q, u, s_i^{(<n)} \big) $ Where:

$\mathcal{L}_{\mathrm{NTP}}$ is the next-token prediction loss.
$w_i$ is a re-weight factor determined by the label $l_i$ of the item, ensuring the Generator prioritizes high-relevance, high-quality items.
$p \big( s_i^{(n)} | q, u, s_i^{(<n)} \big)$ is the probability of generating the $n$ -th semantic ID $s_i^{(n)}$ given the query $q$ , user context $u$ , and previously generated semantic IDs $s_i^{(<n)}$ .

4.2.2.4. Overall Unified Pre-training Loss

The total pre-training loss combines these objectives:

$ \begin{array} { r } { \mathcal { L } = \lambda _ { 1 } \mathcal { L } _ { \mathrm { c o n t r a s t } } + \lambda _ { 2 } \mathcal { L } _ { \mathrm { c o d e b o o k } } + \lambda _ { 3 } \mathcal { L } _ { \mathrm { NTP } } } \end{array} $ Where:

$\lambda_{1, 2, 3}$ are hyperparameters used to balance the loss scales and stabilize the training process. This joint optimization provides a robust foundation for subsequent training and inference.

4.2.3. Post-training with Search Preference Optimization (SPO)

After unified pre-training, UniSearch undergoes an online post-training phase called Search Preference Optimization (SPO) to further align its generation with user preferences. This is an online reinforcement learning framework conducted in a controlled online environment (see Figure 2 (b)).

4.2.3.1. Reward System

To leverage existing advancements in ranking models, UniSearch uses the production system's fine-ranking module as its Reward System.

Inputs: query, user context, and candidate video features.
Outputs: Multiple predictive scores covering query-item relevance, video quality, and expected user satisfaction. These form part of the reward signal.
User Interaction Feedback: Explicit feedback (clicks, watch time, likes, downloads) collected after videos are presented to users provides an additional reward signal.

The final reward score $R$ is a weighted combination of system-estimated scores and observed user interactions:

$ R = \gamma_1 R_{\mathrm{system}} + \gamma_2 R_{\mathrm{interaction}} $ Where:

$R_{\mathrm{system}}$ represents scores from the system's fine-ranking module.
$R_{\mathrm{interaction}}$ represents scores derived from user interaction feedback.
$\gamma_1$ and $\gamma_2$ are weights to balance the relative importance of these two types of signals.

4.2.3.2. Search Preference Optimization (SPO)

From online system logs, $G$ generated search result sequences $\{ S_1, S_2, \ldots, S_G \}$ are collected for a given query, along with their corresponding reward scores $\{ R_1, R_2, \ldots, R_G \}$ . Each sequence $S_i$ is composed of semantic IDs: $S_i = \{ s_i^{(1)}, s_i^{(2)}, \ldots, s_i^{(k)} \}$ . UniSearch assigns a generation probability $p_i = \pi_{\theta}(S_i \mid q, u)$ to each sequence, conditioned on the query $q$ and user context $u$ .

Following GRPO [14], the relative advantage of each candidate is computed by normalizing its reward against the expected reward of the batch:

$ A_i = \frac{R_i - \operatorname*{mean}({R_1, R_2, \ldots, R_G})}{\operatorname*{std}({R_1, R_2, \ldots, R_G})} $ Where:

$A_i$ is the relative advantage for the $i$ -th generated sequence.
$\operatorname*{mean}(\cdot)$ and $\operatorname*{std}(\cdot)$ are the mean and standard deviation of the rewards within the batch, respectively. This formulation ensures that the model prioritizes results better than the batch average and suppresses worse ones.

The final SPO optimization objective $\mathcal{L}_{\mathrm{SPO}}(\theta)$ is defined as:

$ \begin{array}{c} { { { \cal L } _ { S P O } ( \theta ) = - \displaystyle \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \displaystyle \frac { 1 } { k } \sum _ { n = 1 } ^ { k } \Bigl ( \displaystyle \frac { \pi _ { \theta } \bigl ( s _ { i } ^ { ( n ) } | q , u , s _ { i } ^ { ( < n ) } \bigr ) } { \pi _ { \theta _ { \mathrm { n o - g r a d } } } \bigl ( s _ _ { i } ^ { ( n ) } | q , u , s _ { i } ^ { ( < n ) } \bigr ) } A _ { \theta } } } \ { { - \beta { \mathbb D } _ { \mathrm { K L } } [ \pi _ { \theta } | | \pi _ { r e f } ] \Bigr ) } } \end{array} $ Where:

$\mathcal{L}_{\mathrm{SPO}}(\theta)$ is the Search Preference Optimization loss for model parameters $\theta$ .
$G$ is the number of generated search results in the batch.
$k$ is the length of the semantic ID sequence.
$\pi_{\theta}(s_i^{(n)} | q, u, s_i^{(<n)})$ is the probability of generating the $n$ -th semantic ID $s_i^{(n)}$ by the current policy (model) $\pi_{\theta}$ .
$\pi_{\theta_{\mathrm{no-grad}}}(s_i^{(n)} | q, u, s_i^{(<n)})$ is the probability of generation by a reference policy $\pi_{\theta_{\mathrm{no-grad}}}$ , typically the policy before the current update, where gradients are stopped (no-grad). This term forms a ratio with the current policy, central to policy gradient methods like Proximal Policy Optimization (PPO).
$A_{\theta}$ is the relative advantage $A_i$ (from the previous equation), reflecting how much better the generated sequence is compared to the average.
The first term, $\frac { \pi _ { \theta } \bigl ( s _ { i } ^ { ( n ) } | \ldots \bigr ) } { \pi _ { \theta _ { \mathrm { n o - g r a d } } } \bigl ( s _ { i } ^ { ( n ) } | \ldots \bigr ) } A _ { \theta }$ , aims to maximize the likelihood of preference-aligned generations weighted by their relative advantages.
$\beta$ is a coefficient that controls the strength of the KL regularization.
$\mathbb{D}_{\mathrm{KL}}[\pi_{\theta} || \pi_{\mathrm{ref}}]$ is the Kullback-Leibler (KL) divergence between the current policy $\pi_{\theta}$ and the reference policy $\pi_{\mathrm{ref}}$ (which is the pre-trained model's policy). This term regularizes the updated policy to prevent it from deviating too much from the pre-trained model, ensuring stability and avoiding catastrophic deviation.

4.2.4. Inference and Deployment

To ensure efficiency and scalability in real-world deployment, UniSearch employs several techniques:

Trie-based Inference: During inference, the generation of semantic ID sequences is constrained by a prefix tree (Trie) structure. This Trie stores all valid semantic ID sequences corresponding to actual candidate items. At each decoding step, the Generator queries a dedicated Trie service to retrieve feasible continuation tokens, eliminating invalid semantic IDs. This significantly reduces computational overhead and guarantees that generated results map to valid items. The Trie is optimized by combining prefix-tree search with binary search to handle large semantic ID spaces and alleviate memory overhead.
Deployment Architecture (Figure 2 (b)): UniSearch integrates four key components into a unified system in Kuaishou's production environment:
1. Reinforcement Learning (RL) Training System: Continuously fine-tunes UniSearch with online preference alignment (SPO) and synchronizes updated model parameters to the inference service.
2. Online Inference System: Serves real-time search requests. Models are deployed using TensorRT (an NVIDIA platform for high-performance deep learning inference) and accelerated by key-value (KV) caching (a technique to store intermediate computations in Transformer decoders to speed up subsequent token generation), ensuring low-latency responses.
3. Trie Server: Provides efficient constrained generation by maintaining valid semantic ID paths. It is dynamically updated to reflect changes in the item corpus (e.g., new live streams starting/ending).
4. Reward System: Evaluates candidate results using fine-grained ranking scores and user interaction signals, feeding back preference-aligned rewards to guide further RL training.
  
  This end-to-end design allows for scalable, low-latency inference and continuous improvement of search quality through online learning and preference optimization.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on large-scale industrial datasets derived from real search logs of the Kuaishou App.

Live Search Dataset:
- Scale: Approximately 80 million user search sessions.
- Period: July 9 to July 23, 2025.
- Training Data: Logs from July 9 to July 22.
- Test Set: Logs from July 23 (held-out).
- Purpose: All offline experiments in Sections 4.2 and 4.3 are based on this dataset. The candidate pool for live search contains around 500K live sessions.
Short-Video Search Dataset:
- Purpose: Used in Section 4.4 to validate UniSearch's generalization beyond live scenarios.
- Scale: The candidate pool for short-video search is much larger, roughly one billion videos.
  
  These datasets are representative of real-world industrial search scenarios, providing a robust environment for evaluating the model's effectiveness and scalability. The data includes queries, user features, candidate video sets, and labeled relevance information.

5.2. Evaluation Metrics

The paper uses two primary evaluation metrics: Recall@300 and Mean Reciprocal Rank (MRR). Experiments are conducted on both a ranking subset (RK) and a click subset (CK).

5.2.1. Recall@300

Conceptual Definition: Recall@300 assesses the model's ability to retrieve relevant results within the top 300 generated (or retrieved) items. It focuses on the completeness of relevant item retrieval, indicating whether a significant portion of known relevant items are present among the highly ranked results.

Mathematical Formula: $ \mathrm { R e c a l l } @ 3 0 0 = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \frac { T _ { i } } { P _ { i } } $

Symbol Explanation:

$N$ : The total number of samples (queries) in the dataset.
$i$ : An index iterating through each sample (query).
$T_i$ : The number of true positive instances among the top 300 retrieved results for the $i$ -th sample. A true positive instance is one that is both predicted by the model as relevant (within the top 300) and labeled as truly relevant.
$P_i$ : The total number of positive (truly relevant) instances in the dataset for the $i$ -th sample.

5.2.2. Mean Reciprocal Rank (MRR)

Conceptual Definition: MRR measures the ranking ability of the model. It's particularly useful for tasks where there is only one or a few correct answers, and the goal is to rank them as highly as possible. The Reciprocal Rank is the inverse of the rank of the first relevant item found. If the first relevant item is at position 1, the Reciprocal Rank is $1/1 = 1$ . If it's at position 3, it's $1/3$ . MRR is the average of these Reciprocal Ranks across all queries. A higher MRR indicates better ranking performance, meaning relevant items are found earlier in the ranked list.

Mathematical Formula: $ \mathrm { M R R } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \frac { 1 } { K _ { i } } $

Symbol Explanation:

$N$ : The total number of samples (queries) in the dataset.
$i$ : An index iterating through each sample (query).
$K_i$ : The position (rank) of the first ground-truth relevant result for the $i$ -th sample in the ranked list of candidates. If no relevant item is found, $K_i$ is typically considered $\infty$ , making $1/K_i = 0$ .

5.2.3. Test Subsets

Ranking Subset (RK): This test set includes the top videos composed to users by the search system. It is used to evaluate the alignment between generated videos and system preferences, reflecting how well the model can reproduce the system's intended ranking.
Click Subset (CK): This test set considers clicked items as positives. It demonstrates immediate user preferences and is used to evaluate how well the model aligns with actual user engagement.

5.3. Baselines

To verify the effectiveness of UniSearch, it is compared against existing generative search frameworks. These baselines are structured as two-stage pipelines, reflecting the common practices that UniSearch aims to improve upon.

The baselines are generally composed of:

An Embedding Model: Typically BERT-based variants (e.g., BERT-6 Layer, BERT-12 Layer, BERT-24 Layer). This model is trained to produce item embeddings.
A Codebook Construction Method: Used to tokenize the item embeddings into discrete semantic IDs. Methods include FSQ [16], VQ-VAE [24], or RQ-Kmeans.
A Generative Model: Typically BART-based variants (e.g., BART-6 Layer, BART-12 Layer, BART-24 Layer). This model is trained separately to generate semantic ID sequences based on queries.

The paper reproduces these approaches for fair comparison, with results reported in Table 1. The key point of these baselines is their reliance on multiple models and inconsistent training objectives, which UniSearch explicitly addresses.

5.4. Details of UniSearch Implementation

Architecture:
- Search Generator: Instantiated using BART [9] architecture.
- Video Encoder: Instantiated using BERT [3] architecture.
Live Search Configuration (smaller scale):
- Model Size: Lightweight 6-layer BART and BERT models.
- Hidden Size: 768.
- Codebook: Three-level codebook ( $k=3$ ), with each level containing 512 entries.
Short-Video Search Configuration (larger scale):
- Model Size: 12-layer UniSearch configuration (indicating larger BART and BERT components).
- Codebook: Three-level codebook ( $k=3$ ), with 8192 entries per level. This larger codebook is needed due to the candidate pool of approximately one billion videos.
Optimizer: Adam optimizer.
Batch Size: 64.
Learning Rates:
- Pre-training: Initial learning rate of $2 \times 10^{-4}$ , scheduled with a cosine decay strategy.
- Online Post-training (SPO): Fixed learning rate of $1 \times 10^{-5}$ .
Inference:
- Beam search with a beam size of 256 to generate high-quality semantic IDs. This balances computational efficiency with the ability to explore diverse candidate items.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents extensive offline and online experiments to validate UniSearch's effectiveness.

6.1.1. Offline Performance Comparison

The following are the results from Table 1 of the original paper:

	Model			Recall@300(%)↑		MRR(%)↑
	Embed. Model	Codebook	Gen. Model	RK	CK	RK	CK
Baselines	BERT-6 Layer	FSQ	BART- 6 Layer	55.42	62.26	8.36	9.93
	BERT-6 Layer	VQ-VAE	BART- 6 Layer	64.83	63.67	7.98	9.50
	BERT-6 Layer	RQ-Kmeans	BART- 6 Layer	65.56	64.12	9.24	14.81
	BERT-12 Layer BERT-24 Layer	RQ-Kmeans	BART-12 Layer BART-24 Layer	67.79 69.05	67.90 69.27	12.13 13.84	15.68 16.02
UniSearch	UniSearch- 6 Layer			67.45	68.17	9.65	15.36
	UniSearch-12 Layer			68.83	70.11	12.35	15.73
	UniSearch-24 Layer			69.78	72.03	14.62	16.80
	UniSearch- 6 Layer (w/ Online SPO)			68.29	68.48	12.19	16.04
	UniSearch-12 Layer (w/ Online SPO)			69.53	70.25	13.56	16.71

Advantages of Unified Pre-training:

Consistent Outperformance: UniSearch consistently outperforms baseline methods across all model scales and metrics (Recall@300 and MRR on both RK and CK test sets).
Efficiency at Smaller Scale: UniSearch-6 Layer (e.g., MRR of 15.36% on CK) significantly surpasses comparable 6-layer baselines (e.g., RQ-Kmeans with 14.81% on CK) and even achieves Recall@300 close to 12-layer baselines. This indicates that UniSearch's unified objective enables more efficient learning and better query understanding and generation quality even with fewer parameters.
Strong Scalability: As model size increases (UniSearch-12 Layer, UniSearch-24 Layer), UniSearch continues to show substantial improvements over baselines of equivalent or even larger scale. For instance, UniSearch-24 Layer achieves the highest MRR of 16.80% on CK, demonstrating the framework's robustness and scalability.

Effect of Search Preference Optimization (SPO):

MRR Boost: The SPO post-training phase yields notable improvements, particularly in MRR. For UniSearch-6 Layer, MRR on RK increases from 9.65% to 12.19% (+2.54 percentage points), and on CK from 15.36% to 16.04% (+0.68 percentage points). Similar gains are observed for the 12-layer model.
Preference Alignment: This improvement in MRR suggests that SPO successfully guides the model to generate results that are not only relevant but also better aligned with user preferences, prioritizing items that are more likely to be favored by users. Recall@300 also sees moderate gains, but MRR's larger increase highlights the quality and ranking benefits of SPO.

6.1.2. Online A/B Testing Results

The paper reports the results of online A/B tests conducted on Kuaishou's live-streaming and short-video search systems.

The following are the results from Table 3 of the original paper:

UniSearch - Live	TPC ↑	CTR ↑	CQR ↓	FCP ↓
UniSearch - Live	+3.31%	+0.202%	-0.382%	-0.107%
UniSearch - Video	VPD ↑	PVD ↑	CQR ↓	LPC ↑
UniSearch - Video	+0.213%	+0.993%	-0.602%	+0.830%

Live-streaming Search Scenario:

Total Play Counts (TPC): $+3.31%$ increase. This is highlighted as the most critical metric, directly reflecting search scale and platform vitality. The $+3.31%$ gain represents the most significant improvement in recent years for Kuaishou.
Page Click-Through Rates (CTR): $+0.202%$ increase, indicating stronger user engagement.
Change Query Rates (CQR): -0.382% decrease, suggesting that UniSearch generates more satisfactory results, leading to fewer query reformulations.
Position of First-Click (PFC): -0.107% decrease (meaning the first click occurs higher up in the results list), indicating better result relevance and ranking.

Short-video Search Scenario:

Video Playback Duration (VPD): $+0.213%$ improvement.
Page View Depth (PVD): $+0.993%$ increase.
Change Query Rates (CQR): -0.602% reduction.
Long Play Count per PV (LPC): $+0.830%$ increase. These metrics collectively suggest that UniSearch not only satisfies immediate search needs but also encourages users to explore a wider variety of content, leading to more diversified interests and deeper engagement.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Effectiveness of Coarse-to-Fine Strategy and Residual Contrastive Learning

The paper conducts an ablation study to analyze the contributions of Coarse-to-Fine (CF) strategy and Residual Contrastive Learning (RCL).

The following are the results from Table 2 of the original paper:

Method	Recall@300(%) ↑		MRR(%) ↑
Method	RK	CK	RK	CK
UniSearch-Plain	63.54	63.15	6.55	12.16
UniSearch-Plain w/ CF	64.63	65.61	7.01	12.97
UniSearch-Plain w/ RCL	66.18	65.41	8.59	14.71
UniSearch	67.45	68.17	9.65	15.36

UniSearch-Plain: This variant treats each token independently without CF or RCL. It shows the weakest performance across all metrics (e.g., MRR of 12.16% on CK). This validates the need for structured learning and token diversity.
UniSearch-Plain w/ CF: Adding Coarse-to-Fine (CF) strategy improves performance (e.g., MRR of 12.97% on CK). This indicates that gradually learning from easier to more complex objectives enhances ranking capability. However, without RCL, token representations can become similar, leading to path collapsing.
UniSearch-Plain w/ RCL: Incorporating Residual Contrastive Learning (RCL) provides a substantial boost, especially in MRR (14.71% on CK). RCL encourages different tokens to capture complementary semantic aspects, mitigating path collapse and significantly boosting ranking precision.
Full UniSearch: The full model, integrating both CF and RCL, achieves the best performance (e.g., MRR of 15.36% on CK). The CF strategy provides a structured learning path, while RCL preserves token diversity. Their combination leads to more effective end-to-end discretization and superior overall search performance.

6.2.2. Ablation on Codebook Depth and Size

The paper analyzes the impact of codebook depth ( $k$ ) and codebook size ( $w$ ) on performance.

The following figure (Figure 3 from the original paper) shows the results of different Codebook Depth and Codebook Size on RK and CK test set.

$Figure 3: The results of different Codebook Depth and Codebook Size on RK and CK test set. (a) The results of different Codebook Depth $k$ , and (b) The results of different Codebook Size w.$ 该图像是图表，展示了不同 Codebook 深度 $k$ 和 Codebook 大小 $w$ 对 RK 和 CK 测试集的影响。图 (a) 左侧显示在 Recall@300 下的结果，右侧为 MRR 值的变化；图 (b) 同样分别展示了不同 Codebook 大小下的 Recall@300 和 MRR 数据。数据显示出各参数对性能指标的影响趋势。

Codebook Depth ( $k$ ):
- Figure 3(a) shows that when codebook size is fixed at 512, performance (Recall@300 and MRR) saturates once depth $k$ exceeds 3.
- Deeper codebooks (longer token sequences) intuitively should capture more semantic information. However, increasing depth beyond 3 leads to diminishing returns and can increase dispersion in the candidate space, weakening generalization.
- Considering both accuracy and inference latency, a depth of $k=3$ is chosen as optimal.
Codebook Size ( $w$ ):
- Figure 3(b) shows that when codebook depth is fixed at $k=3$ , increasing codebook size reveals divergent trends: Recall@300 decreases, while MRR steadily improves.
- Larger codebooks encourage finer semantic partitioning, which enhances ranking precision (higher MRR) but can reduce recall by fragmenting candidate coverage.
- For practical systems requiring a balance between recall and precision, a codebook size of $w=512$ is selected, offering strong performance across both metrics.

6.2.3. Impact of the Trie Constraint

The Trie-based constraint during inference is crucial for system reliability and efficiency.

The following figure (Figure 4 from the original paper) shows the results of incorporating the Trie on RK and CK test set, and Valid Rates of path to candidates.

Figure 4: The results of incorporating the Trie on RK and CK test set, and Valid Rates of path to candidates. 该图像是一个图表，展示了在RK和CK测试集上使用Trie的UniSearch与不使用Trie的Recall@300、MRR和Valid Rate结果。结果表明，使用Trie的UniSearch在各项指标上均有明显提升。

Improved Performance: The introduction of the Trie significantly improves both Recall@300 and MRR on both RK and CK test sets. For example, Recall@300 on RK increases from roughly 60% to over 67%, and MRR on RK from below 8% to over 9%.
Path Validity: Critically, the Trie increases path validity from 51.3% to 99.8%. This means nearly all generated results now map to valid candidate items, which is essential for search availability and user experience.
Efficiency: By constraining generation to valid paths, the Trie enhances computational efficiency during inference.
Dynamic Updates: The paper emphasizes the necessity of dynamically updating the Trie in live scenarios (e.g., for live streams that start/end or change content) to maintain robustness.

6.3. Further Analysis of Online Performance

The paper provides further analysis of the online A/B testing results for the Live Search scenario, focusing on Total Play Count (TPC).

The following figure (Figure 5 from the original paper) shows the analysis of the improvements in Total Play Count (TPC) from query frequency and user type.

Figure 5: The analysis of the improvements in Total Play Count (TPC) from query frequency and user type. 该图像是一个饼状图，展示了查询频率和用户类型对总播放量（TPC）改善的分析。图中显示，总播放量的改善为 3.31%，同时在查询频率中，65.06% 为长尾频率，34.94% 为高频率，而用户类型则分为新用户（23.02%）、低活跃用户（58.73%）和高活跃用户（18.25%）。

Enhanced Performance on Long-tail Queries: Figure 5 shows that 65.06% of the TPC improvement comes from long-tail queries, nearly doubling the contribution from head queries. This indicates UniSearch's ability to effectively model sparse and semantically complex queries, a significant challenge for traditional systems. The unified pre-training and improved semantic representations likely enable better generalization to diverse, low-frequency queries.
Stronger Appeal to New Users: 58.73% of the TPC increase originates from new users, accounting for more than half of the total gain. This suggests that UniSearch delivers results particularly attractive to users less familiar with the platform. This is valuable for expanding the active user base.
Richer and More Diverse Results: Case studies, such as the MOBA game example shown in Figure 6, illustrate UniSearch's ability to generate both semantically relevant and diverse results.

The following figure (Figure 6 from the original paper) shows the search results comparison between the Multi-stage Cascading Architecture (MCA) and our UniSearch.

该图像是图表，展示了基于多阶段级联架构（MCA）与UniSearch的搜索结果对比。上方为MCA的搜索结果，显示了多款《王者荣耀》的相关信息；下方为UniSearch的搜索结果，其中包括《英雄联盟》和《英雄之魂》的相关信息，展示了UniSearch在搜索结果多样性和准确性上的提升。

For the query "MOBA game":

Traditional MCA: Consistently shows "Honor of Kings" as the top results, exhibiting a bias towards a single popular item. This can lead to content homogenization and dissatisfaction for users with different preferences.
UniSearch: Surfaces additional relevant games like "League of Legends" and "Heroes Evolved". This broader coverage ensures a wider range of user preferences are met, improving user satisfaction and mitigating the risk of users abandoning the platform due to lack of diverse relevant content.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces UniSearch, a novel unified generative search framework that rethinks the architecture of modern search systems. By replacing the traditional multi-stage cascaded architecture with a single, end-to-end generative model, UniSearch overcomes critical limitations such as objective inconsistency and high maintenance overhead. The core innovation lies in its unified pre-training framework, which jointly optimizes video encoding (including semantic ID discretization via VQ-VAE and residual contrastive learning with a coarse-to-fine strategy) and query-to-item generation. Furthermore, Search Preference Optimization (SPO), an online reinforcement learning mechanism, effectively aligns the model's outputs with real user preferences using a reward model and user feedback. Extensive offline and industrial-scale online A/B tests on Kuaishou's live and short-video search systems demonstrate UniSearch's superior performance in both retrieval quality and computational efficiency. Its deployment in live search yielded the largest single-experiment improvement in the product's recent history, particularly benefiting long-tail queries and new users by providing richer and more diverse results.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and suggest future research directions:

Candidate Generation Diversity: Currently, UniSearch generates candidates in a point-wise manner using beam search, which might limit the diversity of the generated results.
Future Work Directions:
- Enhancing List-wise Generation: The authors plan to focus on improving list-wise generation to enhance both diversity and ranking accuracy. This would involve considering the entire list of generated items and their interrelationships rather than generating each item independently.
- Finer-grained Reward Methods: Exploring more finer-grained reward methods in the SPO framework to further strengthen alignment with diverse and complex user preferences. This could involve more detailed breakdowns of user interactions or context-aware reward signals.

7.3. Personal Insights & Critique

Innovation of Unified Architecture: The shift from a cascaded architecture to a truly end-to-end generative model is a significant step forward in search systems. The core idea of jointly optimizing item tokenization and generation addresses a fundamental limitation in prior generative approaches. This unified training approach, especially with residual contrastive learning and VQ-VAE within the same optimization loop, is elegant and powerful. It essentially forces the model to "understand" and "speak" item IDs in a semantically consistent manner from the outset, rather than trying to map a query to pre-defined, possibly suboptimal, item representations.
Robustness of SPO: The integration of Search Preference Optimization (SPO) via reinforcement learning is crucial for practical deployment. Real-world user preferences are dynamic and often implicit. Relying solely on offline supervised signals can lead to models that perform well on static benchmarks but fail to adapt to evolving user behavior. SPO provides a continuous feedback loop that ensures the model's output remains aligned with user satisfaction, which is a major contributor to its industrial success. The use of both system-estimated rewards and actual user interactions makes the reward signal robust.
Practicality and Scalability: The emphasis on Trie-based inference, TensorRT acceleration, and KV-caching highlights the paper's strong practical focus. Many academic generative models struggle with real-time latency and scalability requirements for industrial-scale deployment. UniSearch demonstrates that these challenges can be overcome, making the generative paradigm viable for production systems. The dynamic Trie updates are particularly important for constantly changing content like live streams.
Transferability: The UniSearch framework, particularly its unified training and SPO components, appears highly transferable. The core principles could be applied to other information retrieval domains beyond video search, such as e-commerce product search, news article retrieval, or even document search, provided suitable multi-modal item features and semantic ID tokenization can be developed. The cross-domain generation aspect (text query to item) is generic.
Potential Areas for Improvement/Critique:
- Diversity in Beam Search: While beam search with a beam size of 256 helps generate high-quality results, the authors acknowledge the point-wise nature may limit diversity. Future work on list-wise generation is a critical next step. However, even with list-wise generation, maintaining recall for very diverse long-tail queries while ensuring fine-grained ranking for subtle differences remains a significant challenge.
- Interpretability of Semantic IDs: While semantic IDs provide a compact representation, their direct interpretability for human understanding or debugging might be limited without a clear mapping back to human-readable concepts. Understanding why certain ID sequences are generated could be complex.
- Computational Cost of RL Training: While SPO is effective, online reinforcement learning can be computationally intensive and sensitive to reward signal noise and exploration-exploitation trade-offs. Details on the stability and convergence of SPO in a highly dynamic environment would be interesting.
- Comparison to LLM-based Retrieval: The paper focuses on BART/BERT-based generative models. With the rapid advancement of very large LLMs, it would be interesting to see how UniSearch compares against or integrates with LLMs that directly output relevant item metadata or even generate executable search queries. However, LLMs might struggle with the discretization and efficiency requirements for large-scale production systems.
- Bias in Reward System: The reward system leverages the existing fine-ranking module and user interaction feedback. While practical, this carries the risk of inheriting and perpetuating biases present in the legacy system or user behavior data (e.g., popularity bias, filter bubbles). Further research into debiasing strategies within SPO could be valuable.
  
  Overall, UniSearch represents a robust and practically impactful solution for modern search systems, effectively bridging the gap between theoretical advancements in generative AI and the stringent demands of industrial deployment.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

UniSearch: Rethinking Search System with a Unified Generative Architecture

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~20 min read · 26,645 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. UniSearch Architecture

4.2.1.1. Search Generator

4.2.1.2. Video Encoder

4.2.2. Unified Pre-training

4.2.2.1. Residual Contrastive Learning for Semantic Alignment

4.2.2.2. Discretization into Semantic IDs

4.2.2.3. Reject-Sampling Generative Training

4.2.2.4. Overall Unified Pre-training Loss

4.2.3. Post-training with Search Preference Optimization (SPO)

4.2.3.1. Reward System

4.2.3.2. Search Preference Optimization (SPO)

4.2.4. Inference and Deployment

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@300

5.2.2. Mean Reciprocal Rank (MRR)

5.2.3. Test Subsets

5.3. Baselines

5.4. Details of UniSearch Implementation

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Offline Performance Comparison

6.1.2. Online A/B Testing Results

6.2. Ablation Studies / Parameter Analysis

6.2.1. Effectiveness of Coarse-to-Fine Strategy and Residual Contrastive Learning

6.2.2. Ablation on Codebook Depth and Size

6.2.3. Impact of the Trie Constraint

6.3. Further Analysis of Online Performance

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers