OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
TL;DR Summary
OnePiece integrates context engineering and multi-step reasoning into industrial ranking systems, enhancing existing Transformer models. Key innovations include structured context engineering and progressive multi-task training, leading to significant performance improvements in
Abstract
Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi-step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems. In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM-style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block-wise latent reasoning, which equips the model with multi-step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi-task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over GMV/UU and a increase in advertising revenue.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
1.2. Authors
The paper is co-authored by a team from multiple institutions:
-
Sunhao Dai and Jiakai Tang from Renmin University of China.
-
Jiahua Wu, Kunwang, Yuxuan Zhu, Bingjun Chen, Bangyang Hong, Yu Zhao, Cong Fu, Kangle Wu, Yabo Ni, and Anxiang Zeng from Shopee.
-
Wenjie Wang from the University of California, San Diego.
-
Xu Chen and Jun Xu from Renmin University of China.
-
See-Kiong Ng from the National University of Singapore.
The affiliations indicate a strong collaboration between academia (Renmin University of China, UCSD, NUS) and industry (Shopee), suggesting a paper with both theoretical grounding and practical deployment relevance.
1.3. Journal/Conference
The paper is published as a preprint on arXiv.
- Original Source Link:
https://arxiv.org/abs/2509.18091 - PDF Link:
https://arxiv.org/pdf/2509.18091v1.pdf - Publication Status: As of
2025-09-22T17:59:07.000Z, it is a preprint, indicating it has not yet undergone formal peer review or been published in a specific journal or conference proceedings. However, given the nature of the work and the affiliations, it is likely intended for a top-tier conference in information retrieval, data mining, or artificial intelligence.
1.4. Publication Year
2025 (Published at UTC: 2025-09-22T17:59:07.000Z)
1.5. Abstract
The paper introduces OnePiece, a unified framework designed to integrate large language model (LLM)-style context engineering and multi-step reasoning into industrial cascaded ranking systems, specifically for both retrieval and ranking models. The authors observe that existing industrial efforts primarily transplant Transformer architectures, yielding only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). They argue that the success of LLMs stems from two complementary mechanisms—context engineering (enriching inputs with contextual cues) and multi-step reasoning (iteratively refining outputs)—which remain underexplored in industrial ranking.
OnePiece is built on a pure Transformer backbone and features three key innovations:
-
Structured Context Engineering: Augments user interaction history with
preference anchors(auxiliary item sequences from domain knowledge) andsituational descriptors(user profiles, query context), unifying them into a structured tokenized input sequence for both retrieval and ranking. -
Block-Wise Latent Reasoning: Equips the model with multi-step refinement of representations, scaling reasoning bandwidth via block size. This allows for iterative enhancement of hidden states.
-
Progressive Multi-Task Training: Leverages natural user feedback chains (e.g., click, add-to-cart, order) to supervise reasoning steps effectively during training, aligning earlier steps with weak signals and later steps with stronger, sparser signals.
OnePiecehas been deployed in Shopee's main personalized search scenario, demonstrating consistent online gains across key business metrics, including over GMV/UU (Gross Merchandise Volume per Unique User) and a increase in advertising revenue. Extensive offline experiments also validate the effectiveness of each core design, showing higher sample efficiency and better scaling with larger training spans compared to baselines.
1.6. Original Source Link
https://arxiv.org/abs/2509.18091
1.7. PDF Link
https://arxiv.org/pdf/2509.18091v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the limited success of current industrial search and recommender systems in fully replicating the breakthroughs seen in large language models (LLMs). While Transformer architectures have been widely adopted, they often provide only incremental improvements over existing strong Deep Learning Recommendation Models (DLRMs). This suggests that merely transplanting architectures is insufficient.
This problem is important because industrial ranking systems are crucial for e-commerce, content platforms, and other digital services, directly impacting user experience and business revenue. Achieving significant performance leaps in these systems could unlock immense value.
The paper identifies specific challenges and gaps:
-
Limited LLM Mechanism Adoption: The authors argue that the true breakthroughs of LLMs come not just from their
Transformerarchitectures, but from two complementary mechanisms:context engineeringandmulti-step reasoning. These mechanisms, which significantly expand LLMs' generalization and capability, remain largely underexplored in industrial ranking systems. -
Input Context Construction: Current
Transformer-based industrial models primarily rely on raw user-item interaction sequences, which lack the rich, structured context ofLLM-style prompts. There's a gap in how to effectively enrich context for ranking models to enable reasoning. -
Optimization of Multi-Step Reasoning: Unlike LLMs, where
chain-of-thoughtannotations provide explicit supervision for reasoning, industrial ranking systems lack such direct supervision. It's difficult to articulate or supervise the latent decision paths underlying user behaviors, making it challenging to train multi-step reasoning.The paper's entry point and innovative idea is to systematically integrate these two underexplored but powerful LLM mechanisms—
context engineeringandmulti-step reasoning—into the specific context of industrial cascaded ranking pipelines, tailoring them to the unique characteristics and constraints of recommendation tasks.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
First Deployment of LLM Mechanisms in Industrial Ranking: To the best of the authors' knowledge, this is the first work to explore and successfully deploy
context engineeringandmulti-step reasoningin industrial-scale ranking systems, achieving significant improvements over strongDLRMbaselines in both retrieval and ranking tasks. -
Proposed OnePiece Framework: The paper introduces
OnePiece, a unified framework built on a pureTransformerbackbone that integrates:- Structured Context Engineering: Augments user interaction history with
preference anchorsandsituational descriptors, unifying them into a structured tokenized input sequence for both retrieval and ranking. - Block-Wise Latent Reasoning: Equips the model with multi-step refinement of representations, allowing for adjustable reasoning bandwidth via block size.
- Progressive Multi-Task Training: A strategy that leverages user feedback chains (e.g., click, add-to-cart, order) to effectively supervise reasoning steps during training, aligning tasks of increasing complexity to successive reasoning steps.
- Structured Context Engineering: Augments user interaction history with
-
Extensive Offline and Online Validation: The paper conducts comprehensive evaluations, including large-scale A/B testing in Shopee's main personalized search scenario. These experiments validate the effectiveness of each design choice, demonstrate favorable scaling and efficiency properties, and confirm the practicality of deploying
OnePiecein real-world industrial environments.The key conclusions and findings are:
OnePiecesignificantly outperforms strong baselines likeDLRM,HSTU, andReaRecacross various retrieval and ranking metrics offline.Structured context engineering, particularly thepreference anchors, provides substantial improvements by enriching user context with domain knowledge and query-specific signals.Block-wise latent reasoningconsistently enhances performance by enabling finer-grained preference refinement through multi-step processing.Progressive multi-task trainingis crucial for effectively supervising intermediate reasoning steps, preventing gradient conflicts, and allowing each reasoning block to develop specialized capabilities for tasks of increasing complexity.OnePieceexhibits higher sample efficiency and better scaling capabilities with increasing training data compared to baselines.- Online A/B testing demonstrates consistent and significant business gains, including over GMV/UU and advertising revenue, along with improved recall coverage and exclusive contribution, proving its real-world impact and efficiency.
3. Prerequisite Knowledge & Related Work
This section provides foundational knowledge necessary for understanding the OnePiece paper, followed by a summary of related work and a differentiation analysis.
3.1. Foundational Concepts
3.1.1. Transformer Architecture
The Transformer (Vaswani et al., 2017) is a neural network architecture that revolutionized sequence transduction tasks, particularly in natural language processing. It is distinct from recurrent neural networks (RNNs) and convolutional neural networks (CNNs) in its heavy reliance on the attention mechanism to draw global dependencies between input and output.
-
Self-Attention: This mechanism allows the model to weigh the importance of different words in an input sequence when encoding a particular word. For each token, it computes three vectors: a
Query (Q), aKey (K), and aValue (V). The attention score is calculated by taking the dot product of theQueryvector with allKeyvectors, followed by a scaling factor and a softmax function to get weights. These weights are then applied to theValuevectors to produce the output for that token. The core formula forAttentionis:- : Query matrix. It contains the query vectors for all tokens in the sequence.
- : Key matrix. It contains the key vectors for all tokens.
- : Value matrix. It contains the value vectors for all tokens.
- : Dimension of the
Keyvectors. This term is used to scale the dot products, preventing them from becoming too large and pushing thesoftmaxfunction into regions with tiny gradients. - : A function that converts a vector of numbers into a probability distribution, ensuring all elements are between 0 and 1 and sum to 1.
- : The dot product between queries and keys, representing how much each query should attend to each key.
-
Multi-Head Self-Attention (MHSA): Instead of performing a single attention function,
MHSAlinearly projects theQueries,Keys, andValuesmultiple times with different learned linear projections to different sets. Then, parallel attention functions are applied. Each of theseattention headslearns to focus on different parts of the input sequence. The outputs from these heads are then concatenated and linearly transformed to produce the final output. This allows the model to capture diverse contextual information from different representation subspaces. -
Feed-Forward Network (FFN): After the attention mechanism, each position in the sequence passes through an identical, independently applied
position-wise feed-forward network. This typically consists of two linear transformations with aReLUactivation in between. -
Layer Normalization (LN): Applied after the output of each sub-layer (attention and FFN) and before the residual connection.
Layer Normalizationnormalizes the inputs across the features for each sample independently, helping stabilize the training process. -
Positional Embeddings: Since
Transformersdo not inherently process sequences in order (unlike RNNs),positional embeddingsare added to the input embeddings to inject information about the relative or absolute position of tokens in the sequence. These can be learned or fixed (e.g., sinusoidal functions).
3.1.2. Large Language Models (LLMs)
LLMs are a class of neural networks, typically based on the Transformer architecture with billions of parameters, trained on vast amounts of text data. They exhibit emergent capabilities such as text generation, summarization, translation, and question answering.
- Scaling Laws: The performance of
LLMsoften scales predictably with model size, dataset size, and computational budget. - Context Engineering (Prompt Engineering): Refers to the art and science of designing effective inputs (prompts) to guide
LLMstowards desired outputs. This involves structuring the input, providing examples (in-context learning), and incorporating specific instructions or external knowledge. - Multi-Step Reasoning (Chain-of-Thought): A technique where
LLMsare prompted to break down complex problems into intermediate steps, explicitly showing their reasoning process. This improves performance on complex reasoning tasks by making the model's "thought process" more explicit and allowing for self-correction.
3.1.3. Industrial Search and Recommender Systems
These systems aim to provide users with relevant items (products, articles, videos, etc.) from a vast corpus.
-
Cascade Ranking Paradigm: A dominant approach in large-scale industrial systems due to its efficiency. It organizes the decision process into multiple stages:
- Retrieval Stage: The first stage, which efficiently selects a small set of highly relevant candidate items from a massive corpus (e.g., millions or billions of items). It uses lightweight models and aims for high recall (not missing potentially relevant items).
- Pre-ranking Stage (Optional): An intermediate stage that further filters the retrieved candidates to a smaller set (e.g., from thousands to hundreds) using slightly more complex models.
- Ranking Stage: The final stage, which takes a refined, smaller set of candidates and uses sophisticated, computationally expensive models to precisely estimate relevance scores and produce a finely ordered list for display to the user. It prioritizes precision.
-
Dual-Tower Architecture (for Retrieval): Common in retrieval, it consists of two independent neural networks (
towers): one for encoding the user query/context and another for encoding items. The output of each tower is a dense vector (embedding). Retrieval is performed by finding items whose embeddings are "close" to the user/query embedding in a vector space, typically using dot product or cosine similarity. This allows pre-computation of item embeddings, enabling fastApproximate Nearest Neighbor (ANN)search. -
Single-Tower Architecture (for Ranking): Common in ranking, it jointly encodes the user context, query, and candidate item within a single network. This allows for rich, explicit interactions between user, query, and item features, leading to more accurate preference prediction. However, it's computationally more expensive as it must be run for each candidate item.
-
Approximate Nearest Neighbor (ANN) Search: A family of algorithms used to find data points that are "closest" (most similar) to a given query point in a high-dimensional space, without exhaustively checking every single point. This is crucial for efficient retrieval from large item corpora. Examples include
HNSW(Hierarchical Navigable Small World) used in the paper.
3.1.4. Deep Learning Recommendation Models (DLRMs)
DLRMs (Naumov et al., 2019) are a class of deep learning models widely used in industrial recommendation systems. They typically handle both sparse (categorical, e.g., user ID, item ID) and dense (continuous, e.g., price, age) features.
- Embeddings: Categorical features are converted into dense
embeddingvectors. - Feature Interactions:
DLRMsoften employ techniques to model interactions between features, such ascross-networkcomponents (e.g.,DCN,DCNv2) orattention mechanisms(e.g.,DIN-like attention) to focus on relevant historical items. - MLP (Multi-Layer Perceptron): A series of fully connected layers used to combine processed features and output the final prediction score.
3.2. Previous Works
The paper compares OnePiece against several representative baselines and draws inspiration from related work.
- DLRM (Naumov et al., 2019): This is Shopee's production baseline, a highly optimized hybrid model integrating various state-of-the-art components.
- Retrieval Mode (Two-Tower): Uses a
DSSM(Deep Structured Semantic Models) (Huang et al., 2013) inspired dual-tower design. User context (query, history) is encoded separately from items. Features includeDIN-like attention (Zhou et al., 2018) andzero-attention(Ai et al., 2019) for relevance.Lightweight text CNNfor keyword features,DCNv2(Wang et al., 2021) for high-order cross features. Sequential features are aggregated viamean poolingand fused with other features using anMLP. - Ranking Mode (Single-Tower): Incorporates candidate items into a single tower with user features. Uses
ResFlow(Fu et al., 2024) as backbone, combined withDIN-like target attention andcross-attentionacross sequential behaviors.DCNv2again for higher-order interactions, followed byMLPfusion. ASENet(Hu et al., 2017) module supports adaptive feature selection.
- Retrieval Mode (Two-Tower): Uses a
- HSTU (Zhai et al., 2024): A representative generative recommendation framework from Meta. It typically focuses on
Interaction History (IH)andSituational Descriptors (SD). For fair comparison,OnePiecealigns its parameter size and adapts a variant by introducingPreference Anchors (PA)as well. - ReaRec (Tang et al., 2025): A reasoning-enhanced recommendation model that formulates user representation modeling as a
multi-step reasoningprocess over item sequences.- The vanilla
ReaRecsupports retrieval with user interaction history.OnePieceadapts its backbone and feature inputs. - For ranking,
ReaRecis adapted by introducing candidate items with atarget-aware attention mask, meaning sequence tokens can attend to the candidate, but candidates are mutually invisible. also augmentsIHandSDwithPreference Anchors.
- The vanilla
- CLIP (Radford et al., 2021): A model that learns transferable visual models from natural language supervision using a
bidirectional contrastive learningobjective.OnePiecedraws inspiration from CLIP for itsBidirectional Contrastive Learning (BCL)objective in retrieval mode.
3.3. Technological Evolution
The field of industrial ranking systems has seen a progressive integration of advanced modeling techniques:
-
Early Systems (Rule-based, Collaborative Filtering): Initial systems relied on hand-crafted rules, content-based filtering, or
collaborative filtering(e.g., item-to-item similarity, user-to-user similarity). These were simple but struggled with scalability and cold-start problems. -
Feature Engineering & Traditional Machine Learning: Introduction of extensive feature engineering combined with traditional ML models like
logistic regressionorgradient boosting decision trees(GBDT). -
Deep Learning (DLRMs, DSSM): The advent of deep learning brought models like
DSSMfor retrieval andDLRMsfor ranking. These models excel at learning complex feature interactions and representations, handling sparse and dense features efficiently. Architectures likeDIN(Deep Interest Network) introducedattention mechanismsto model dynamic user interests. -
Sequential Recommendation (RNNs, Transformers): Recognizing the importance of user behavior sequences,
RNN-based models (e.g.,GRU4Rec) emerged, followed byTransformer-based models likeSASRec(Self-Attentive Sequential Recommendation) andBERT4Rec. These models leverageself-attentionto capture long-range dependencies in user interaction histories, leading to more personalized recommendations. -
LLM-Inspired Architectures (Current Trend): The recent success of
LLMshas inspired researchers to transplantTransformer-based architectures into recommendation. However, asOnePiecepoints out, many of these efforts focus only on the architecture, yielding incremental gains.OnePiecefits into this timeline by pushing theLLM-inspiredtrend further. Instead of just adoptingTransformerbackbones, it systematically integrates the mechanisms behind LLM success (context engineeringandmulti-step reasoning), which were largely overlooked in previousTransformer-based recommendation models. This represents a significant step towards more intelligent and adaptive industrial ranking systems.
3.4. Differentiation Analysis
Compared to the main methods in related work, OnePiece presents several core differences and innovations:
- Beyond Architectural Transplant: Unlike many
Transformer-based models (HSTU,SASRec,BERT4Rec) that primarily transplant the architecture,OnePieceexplicitly focuses on integrating the mechanisms ofcontext engineeringandmulti-step reasoningfromLLMs. This is a more principled approach to leveraging LLM breakthroughs. - Unified Framework for Retrieval and Ranking:
OnePieceprovides a single, unified framework that seamlessly integrates theseLLMmechanisms into both the retrieval and ranking stages of a cascaded pipeline. This contrasts with approaches that typically optimize each stage independently with different model designs. - Structured Context Engineering:
- Most
Transformer-based industrial models primarily use raw user-item interaction sequences.OnePieceenriches this by introducingPreference Anchors (PA)(auxiliary item sequences based on domain knowledge, like top-clicked items for a query) andSituational Descriptors (SD)(user profiles, query context). - This structured approach provides richer contextual cues than plain interaction history, addressing the "lack of structural richness" challenge compared to
LLM-style prompts. - Even and incorporate
PA, butOnePiece's overall context engineering is part of a more integrated system withblock-wise reasoning.
- Most
- Block-Wise Latent Reasoning:
- While
ReaRecalso employs multi-step reasoning,OnePieceintroducesblock-wise latent reasoning. This means that instead of recycling a single hidden state across iterations (which might overly compress information),OnePieceiteratively refines a set of hidden states (ablock), offering adjustable reasoning bandwidth. - This design provides greater flexibility and a better balance between information compression and retention, potentially leading to more expressive representations.
- While
- Progressive Multi-Task Training:
- To supervise the multi-step reasoning process effectively without
chain-of-thoughtannotations,OnePieceintroduces aprogressive multi-task trainingstrategy. It leverages natural user feedback chains (e.g., click, add-to-cart, order) as staged supervision signals, assigning tasks of increasing complexity to successive reasoning blocks. - This differs from traditional multi-task learning where all tasks might supervise a single final representation, and from models like
ReaRecthat might use a single task or simpler supervision for reasoning. This progressive approach provides rich process supervision, helping each reasoning step develop specialized capabilities and mitigating gradient conflicts.
- To supervise the multi-step reasoning process effectively without
- Enhanced Inter-Candidate Interaction in Ranking: For the ranking stage,
OnePieceexplicitly models cross-candidate interactions within smallCandidate Item Set (CIS)groups (grouped setwise strategy) by making them jointly visible. This is a significant improvement over pointwise models or even adaptedReaRecwhere candidates remain mutually invisible. This allows the model to compare candidates directly, which is crucial for fine-grained ranking. - Online Deployment and Efficiency:
OnePieceis designed and optimized for large-scale industrial deployment, demonstrating significant online gains and superior hardware utilization compared toDLRMbaselines, validating its practicality and efficiency in real-world scenarios.
4. Methodology
4.1. Principles
The core idea behind OnePiece is to systematically adapt and integrate two fundamental mechanisms that have driven the success of Large Language Models (LLMs)—context engineering and multi-step reasoning—into the specific domain of industrial cascaded ranking systems. The theoretical basis and intuition are as follows:
-
Context Engineering: LLMs demonstrate that rich, structured input contexts (prompts) are crucial for eliciting their full capabilities. In ranking systems, traditional input sequences often lack this richness. The principle here is that by augmenting raw user interaction history with
preference anchors(domain-specific reference points, e.g., top-clicked items) andsituational descriptors(user and query context), the model can be provided with more informative cues. This enriched context helps the model better understand user intent and scenario specifics, analogous to how well-crafted prompts guide an LLM. -
Multi-Step Reasoning: LLMs solve complex problems by breaking them down into intermediate, iterative reasoning steps (
chain-of-thought). The intuition is that user preference modeling in recommendation is also a complex task that benefits from iterative refinement. Instead of a single-shot prediction, a model can progressively refine its understanding of user preferences and item relevance.OnePieceproposes ablock-wise latent reasoningmechanism, where hidden representations are iteratively updated, allowing for a deeper, more nuanced understanding that builds upon previous steps. This addresses the limitation of single-unit reasoning, which might overly compress signals. -
Supervision for Reasoning: A key challenge in applying multi-step reasoning to ranking is the lack of explicit "thought process" annotations.
OnePieceaddresses this by leveraging naturally occurringuser feedback chains(e.g., exposure -> click -> add-to-cart -> order) as a form ofprogressive multi-task supervision. The principle is that these feedback chains represent a natural curriculum of increasing user commitment and task complexity. By assigning tasks of varying complexity to different reasoning steps, the model learns to develop specialized capabilities at each stage, guiding the latent reasoning process effectively.By unifying these principles within a
Transformer-based backbone,OnePieceaims to enhance context-awareness and reasoning depth across both retrieval and ranking stages of industrial systems, moving beyond incremental architectural adaptations to fundamental capability improvements.
4.2. Core Methodology In-depth (Layer by Layer)
OnePiece is a unified framework combining structured context engineering, block-wise latent reasoning, and a progressive multi-task training strategy. Figure 2 illustrates its overall architecture in both retrieval and ranking modes.

该图像是OnePiece框架的整体架构示意图,展示了检索模式(a)和排名模式(b)。两种模式均采用结构化上下文工程来构建统一输入标记,利用块状潜在推理通过多步推理逐步增强表示,并通过渐进式多任务训练进行优化。
Figure 2 | Overall architecture of the proposed OnePiece framework. Retrieval Mode (a) and Ranking Mode (b) both employ structured context engineering to construct unified input tokens, utilize block-wise latent reasoning to iteratively enhance representations across multiple reasoning steps, and are optimized through progressive multi-task training sy.
Both modes utilize structured context engineering to create unified input tokens, which are then processed by a Transformer-based backbone equipped with block-wise latent reasoning to iteratively refine representations. The entire system is optimized using a progressive multi-task training strategy.
4.2.1. Context Engineering
The first step in OnePiece is to transform all heterogeneous inputs into a unified token sequence that can be processed by a Transformer backbone. This is achieved through four complementary token types: Interaction History (IH), Preference Anchors (PA), Situational Descriptors (SD), and Candidate Item Set (CIS). Figure 3 provides a visual representation of this design.

该图像是示意图,展示了OnePiece中检索模式和排名模式下的输入标记序列。图中包括了交互历史、偏好锚点和情境描述符的构建,同时在排名模式中增加了候选项目集的标记,支持单塔架构下的联合评分。
Figure 3 | Context engineering and tokenizer design for input token sequences in OnePiece. Both retrieval and ranking share the same construction of interaction history (IH), preference anchors (PA), and situational descriptors (SD). The key difference is that ranking additionally incorporates candidate item set (CIS) tokens, enabling joint scoring within the single-tower architecture.
Following the problem formulation, a user has feature representation , a query has , and an item has . Entity-specific embedding functions , , and map these entities' features (categorical and continuous) into concatenated embedding vectors. To unify these into the -dimensional hidden space of the Transformer backbone, lightweight projection layers Proj are used. Specifically, , , and are defined, each mapping its input dimension to . IH and PA components share a common projection layer, .
Let's detail each component of the input token sequence:
Interaction History (IH)
The IH component encodes the user's historical item interactions in chronological order. Each item descriptor is embedded using the shared projection layer:
-
: The embedding of the -th item in the user's interaction history.
-
: A shared projection layer that maps the item's raw feature embedding to the model's hidden dimension .
-
: The raw feature representation of the -th item , which includes its ID and associated content information.
-
: The -dimensional hidden space of the backbone model.
Temporal information is then incorporated by adding learnable
positional embeddings: -
: The final token embedding for the -th interaction, combining content and temporal information.
-
: The learnable positional embedding for the -th interaction in the sequence.
Preference Anchors (PA)
Preference Anchors are auxiliary item sequences constructed based on domain knowledge (e.g., top-clicked items under the current query). These anchors provide high-quality reference points, injecting inductive biases and guiding the model towards plausible prediction directions.
For a given user and query , anchor groups are provided, where each group contains items. The token embedding for the -th item in the -th anchor group is computed similarly to IH items:
-
: The embedding of the -th item in the -th anchor group.
-
: The final token embedding for the -th item in the -th anchor group.
-
: The positional embedding for the -th item within its group.
To preserve the group structure, each anchor group is wrapped with learnable boundary tokens:
e_BOS(Beginning of Sequence) ande_EOS(End of Sequence), both . The final token sequence for each anchor group is:
Situational Descriptors (SD)
Situational Descriptors capture non-item information relevant to the ranking task, such as static user features and query-specific information.
For the user with features , the embedding is:
-
: The projected user embedding.
-
: The final user token embedding.
-
: A projection layer for user features.
-
: The raw feature representation of the user.
-
: The positional embedding for the user token at position .
Similarly, for the query with features (omitted in recommendation scenarios without explicit queries):
-
: The projected query embedding.
-
: The final query token embedding.
-
: A projection layer for query features.
-
: The raw feature representation of the query.
-
: The positional embedding for the query token at position .
Candidate Item Set (CIS, Ranking Mode Only)
In the ranking stage, OnePiece adopts a grouped setwise strategy to balance efficiency and expressiveness. The retrieved candidate set is randomly partitioned into smaller groups of size (e.g., 12). Each group is processed independently, allowing intra-group interaction among candidates.
Given a candidate group , each candidate item is embedded as:
-
: The projected embedding of the -th candidate item in the group.
-
: A projection layer for candidate item features.
Crucially, positional embeddings are deliberately excluded for candidate tokens to prevent the model from learning spurious correlations between position and relevance labels:
-
: The final token embedding for the -th candidate item, which is simply its projected content embedding.
Sequence Packing and Ordering
Let denote concatenation of token subsequences. The final input sequence to the backbone model is constructed by packing these components according to fixed ordering rules:
-
Retrieval Mode: The input token sequence is constructed as:
IHtokens are ordered by ascending interaction timestamp.- Each
PAgroup is wrapped byBOS/EOSboundary tokens, and groups are ordered by predefined business rules. SDtokens (user, query, etc.) have no temporal ordering and are placed in a segment with distinct positional indices.
-
Ranking Mode: The retrieval-mode sequence is extended by appending candidate item tokens:
CIStokens are appended without positional encodings.
4.2.2. Backbone Architecture
The OnePiece backbone processes the packed token sequence uniformly for both retrieval and ranking.
Transformer-Based Sequential Encoding
Let denote the final input tokens (from Section 3.2), where is the total sequence length. OnePiece adopts an -layer bi-directional Transformer (Vaswani et al., 2017) with pre-normalization.
Let be the hidden states at layer . The initial input is . For the -th layer ():
-
: Input hidden states from the previous layer.
-
: Layer Normalization.
-
: Multi-Head Self-Attention with bi-directional attention. Bi-directional attention allows tokens to attend to all other tokens in the sequence, which is suitable for non-autoregressive tasks like personalized ranking.
-
: The output of the
MHSAsub-layer after adding the residual connection. -
: Position-wise Feed-Forward Network.
-
: The final output hidden states of layer , after the
FFNsub-layer and its residual connection.The final encoder output serves as the foundation for subsequent reasoning.
Block-Wise Multi-Step Reasoning
OnePiece introduces a block-wise reasoning mechanism to iteratively refine a set of hidden states across multiple steps, providing adjustable reasoning bandwidth.
Let be the block size, which is task-dependent. Let denote the -th reasoning block.
The initial block is constructed from the final encoder output :
-
: The initial reasoning block, extracted from the last hidden states of the
Transformer's final layer output .For subsequent reasoning steps , the block is extracted from the output of the previous reasoning step:
-
: The final output hidden states from the Transformer after step
k-1(this includes the base sequence plus all blocks up to ).To distinguish different reasoning steps,
Reasoning Position Embeddings (RPE)are introduced. Let be a learnable embedding matrix, where is the maximum number of reasoning steps. The enhanced blocks are defined as: -
: A vector of ones.
-
: Outer product.
-
: The -th row of the
RPEmatrix, representing the positional embedding for reasoning step .At each step , the base sequence is concatenated with all previous enhanced blocks and the current block . This concatenated sequence is then passed through the
Transformerbackbone with ablock-wise causal mask: -
: The
Transformerupdate function. -
: The
block-wise causal mask. As shown in Figure 4(a), this mask ensures that current block tokens can attend to all base tokens and all historical blocks , but tokens within the current reasoning block cannot attend to future reasoning block tokens.
该图像是图表,展示了区块推理掩码和渐进式多任务训练的概念。左侧 (a) 展示了在区块推理中,各层如何通过因果注意力掩码相互连接;右侧 (b) 表示渐进式训练中任务的复杂性逐步增加,提供有效的过程监督。
Figure 4 | Block-wise reasoning mask and progressive multi-task training. (a) Causal attention mask enables reasoning blocks to attend to input and previous blocks. (b) Progressive training assigns tasks of increasing complexity to successive reasoning steps to provide effective process supervision.
This iterative procedure yields progressively refined reasoning states .
The block size is task-dependent:
- Retrieval Mode: is set equal to the length of the
Situational Descriptor (SD)segment. The user and query tokens are designated as aggregation blocks, allowing iterative reasoning to reinforce personalization and relevance dimensions. - Ranking Mode: is set equal to , the number of candidate items in a group. Each block corresponds to all candidate item tokens. The final block contains the refined representations used for ranking. Randomized candidate grouping is applied during training to encourage robust set-wise reasoning.
4.2.3. Progressive Multi-Task Training
Building on the block-wise multi-step reasoning, OnePiece obtains intermediate block representations . To effectively supervise this trajectory, a progressive multi-task training paradigm is introduced, implementing curriculum learning through gradually increasing task complexity. Figure 4(b) depicts this strategy.
learning objectives are arranged in a progressive curriculum (e.g., exposure click add-to-cart purchase). Each reasoning step is assigned to optimize a single task , providing structured guidance and enabling the model to gradually align with deeper levels of user preference.
Retrieval Mode
In retrieval, user representations are extracted from reasoning blocks and optimized using a combination of calibrated probability estimation (Binary Cross-Entropy) and bidirectional contrastive learning objectives.
For each reasoning block , a step-specific user representation is extracted via layer normalization followed by mean pooling:
-
: The user representation derived from reasoning block .
-
: Mean pooling operation across the block.
-
: Layer Normalization.
For each training instance, a candidate pool is constructed. For task assigned to step , and candidate , is the behavioral label. The candidate pool is partitioned into positive and negative sets.
OnePiece employs two complementary learning objectives:
(i) Binary Cross-Entropy Loss (BCE) This provides point-wise calibrated probability estimates for individual user-item pairs:
- : The
BCEloss for reasoning step and its assigned task . - : The
sigmoid function, , which squashes values between 0 and 1, interpreting them as probabilities. - : The inner product (similarity) between the user representation and the item embedding .
- : The item embedding for candidate .
(ii) Bidirectional Contrastive Learning (BCL)
Inspired by CLIP, BCL operates at the batch level, enabling global contrastive reasoning across in-batch samples. It has two symmetric components:
-
User-to-Item (U2I) Contrastive Learning: Enables each user representation to distinguish positive items from negative candidates.
- : The
U2Icontrastive loss for step . - : The
temperature parameter, which scales the logits before thesoftmaxfunction, influencing the sharpness of the probability distribution. A smaller makes the distribution sharper.
- : The
-
Item-to-User (I2U) Contrastive Learning: Enables each positive item to identify its corresponding user representation within the batch. Let be the set of user representations for step in the current training batch of size .
-
: The
I2Ucontrastive loss for step . -
: Represents a user representation from the batch.
The complete
BCLobjective for step combines both symmetric components:
-
The overall retrieval loss aggregates the objectives across all reasoning steps:
Ranking Mode
In the ranking stage, the block size equals the candidate group size . Each reasoning block contains hidden states , where is the hidden state for the -th candidate at reasoning step . For task assigned to step , candidate-wise logits are computed via a task-specific scoring network:
-
: The predicted score (logit) for candidate at reasoning step for task .
-
: A Multi-Layer Perceptron (MLP) specifically designed for task .
Two complementary learning objectives are employed:
(i) Binary Cross-Entropy Loss (BCE) Provides point-wise probability calibration for individual candidates:
- : The behavioral label of candidate for task (e.g., 1 if clicked, 0 otherwise).
(ii) Set Contrastive Learning (SCL) Operates at the set-wise level, enabling each positive candidate to distinguish itself from negative candidates within the group:
-
The summation is over all positive candidates in the group.
-
The denominator includes all candidates in the group, making each positive candidate compete against all others for ranking position.
The overall ranking loss combines both objectives across all reasoning steps:
4.2.4. Time Complexity Analysis
The time complexity analysis for OnePiece considers both the backbone encoder and the reasoning phase.
-
Backbone Encoder (without reasoning): The time complexity of each
Transformerlayer is , where is the base sequence length, is the hidden dimension, and is the number of layers. The total cost is .- : Cost for attention mechanism (computing ).
- : Cost for linear projections (Q, K, V, and output projection) and the
FFN.
-
Reasoning Phase:
OnePieceemploysKV Caching(Key-Value Caching) to reuse historical key-value pairs, meaning each new reasoning step only incurs attention between the new block tokens and the cached tokens. At reasoning step :- Compute Q, K, V for new block tokens: .
- Calculate attention between new tokens and all cached tokens: .
- Apply output projection (part of FFN and subsequent projections): . Therefore, the complexity per layer at step is . Aggregating over reasoning steps and layers, the total additional reasoning cost is:
- : Number of
Transformerlayers. - : Number of reasoning steps.
- : Block size.
- : Base sequence length.
- : Hidden dimension. This formula shows that the reasoning cost scales linearly with , , , and , which is crucial for controlling computational overhead in industrial settings.
5. Experimental Setup
5.1. Datasets
The experiments were conducted using 30-day logs from Shopee, a large e-commerce platform operating in Southeast Asia and Latin America, serving billions of users. The dataset includes multi-behavior user interactions.
The following are the results from Table 1 of the original paper:
| #User | #Item | #Query | #Impression | #Click | #Add-to-Cart | #Order |
|---|---|---|---|---|---|---|
| 10M | 93M | 12M | 0.24B | 60M | 12M | 6M |
- Source: Shopee E-commerce Platform.
- Scale: Contains data for 10 million unique users, 93 million unique items, and 12 million unique queries. The total number of impressions is 0.24 billion, with 60 million clicks, 12 million add-to-cart events, and 6 million orders.
- Characteristics: The dataset captures
multi-behavior user interaction, which is essential forprogressive multi-task trainingas it provides diverse feedback signals (impression, click, add-to-cart, order) for different reasoning steps. - Domain: E-commerce, specifically personalized search.
5.1.1. Offline Dataset Construction (from Appendix A)
Retrieval Stage: The objective is to retrieve items users may potentially interact with. The training samples focus on impression and click objectives.
- Filtering: Session request data where users did not exhibit click behaviors are filtered out.
- Positive Samples: clicked items serve as positive samples for both
impressionandclicktasks. - Mixed Samples: exposed but unclicked items serve as positive samples for the
impressiontask but simultaneously as negative samples for theclicktask. - Additional Negative Samples: items are sampled from unexposed items within the top-500 results from the ranking stage. These serve as additional negative samples.
- Hard Negative Samples: items from the same category as the clicked items are sampled. These serve as
hard negative samplesto enhance model convergence and mitigatehomogeneous recommendation risks(i.e., recommending too many similar items). The specific values ofm, n, k, lare determined by domain experts and empirical validation.
Ranking Stage: As a downstream stage, ranking requires finer refinement. The focus is on session requests with click behaviors.
- Positive Samples: Task-specific interaction types (impression, click, add-to-cart, order) are used as positive samples for their respective tasks.
- Negative Samples: Interactions from preceding tasks in the conversion funnel serve as negative samples for each respective task. For example, for the
order prediction task, items that were exposed, clicked, and added to the cart but not purchased serve as negative samples. - Augmented Hard Negative Samples: Similar to retrieval, items are randomly sampled from the top-500 ranking results that were not exposed to users. These serve as augmented
hard negative samplesto improve model performance.
5.2. Evaluation Metrics
5.2.1. Offline Evaluation Metrics
Retrieval Stage: The primary concern is the number of clicked items successfully recalled.
- Recall@K (R@K): Measures the proportion of relevant items (clicked items in this case) that are successfully retrieved within the top items.
- Conceptual Definition: Recall quantifies the model's ability to find all relevant items in the corpus.
Recall@Kspecifically measures this within the top retrieved items. A higherRecall@Kindicates that the model is better at bringing relevant items into the candidate pool for subsequent stages. - Mathematical Formula:
- Symbol Explanation:
Number of relevant items in top K: The count of items truly relevant to the user's intent that appear within the top items returned by the retrieval model.Total number of relevant items: The total count of items that are truly relevant to the user's intent in the entire corpus (or ground truth set). The paper reportsRecall@100andRecall@500.
- Conceptual Definition: Recall quantifies the model's ability to find all relevant items in the corpus.
Ranking Stage: Evaluates the model's ability to precisely estimate preference scores for different types of user feedback.
- AUC (Area Under the Receiver Operating Characteristic Curve):
- Conceptual Definition: The
ROC curveplots the True Positive Rate (sensitivity) against the False Positive Rate (1 - specificity) at various threshold settings.AUCrepresents the area under this curve. It quantifies the model's ability to distinguish between positive and negative classes. AnAUCof 1.0 means perfect classification, while 0.5 means random classification. In recommendation, it indicates how well the model ranks a randomly chosen positive item higher than a randomly chosen negative item. - Mathematical Formula:
- Symbol Explanation:
- : The total number of positive samples (e.g., clicked items).
- : The total number of negative samples (e.g., unclicked items).
- : The predicted score for the -th positive sample.
- : The predicted score for the -th negative sample.
- : An indicator function that returns 1 if the condition is true, and 0 otherwise.
- The numerator sums up cases where a positive sample is ranked higher than a negative sample, with ties counting as 0.5.
- Conceptual Definition: The
- GAUC (Group AUC / User-wise AUC):
- Conceptual Definition:
AUCcan sometimes be misleading if the distribution of positive/negative samples varies greatly across users.GAUCaddresses this by calculating theAUCfor each user (or query group) separately and then averaging these per-userAUCs, often weighted by the number of impressions or positive samples for that user. This gives a more personalized and often more robust evaluation of ranking performance. - Mathematical Formula:
- Symbol Explanation:
- : The set of all users.
- : The
AUCcalculated for user . - : The weight for user , typically the number of impressions or positive samples generated by user . This ensures that users with more interactions contribute more to the overall
GAUC. The paper reportsAUCandGAUCfor three feedback types:click (C-),add-to-cart (A-), andorder (O-).
- Conceptual Definition:
5.2.2. Online Evaluation Metrics (from Section 5.1.3)
These metrics track business and user engagement indicators in real-world A/B tests.
- GMV/UU (Gross Merchandise Volume per Unique User):
- Conceptual Definition: The average total value of goods sold per unique user. It's a key business metric reflecting the overall revenue generated from users.
- GMV(99.5%)/UU:
- Conceptual Definition:
GMV per userexcluding the top0.5%high-value orders. This metric is used to filter out extreme outliers (e.g., very large, rare purchases) and reflect the stable contributions from regular transactions, providing a more robust measure of typical user spending.
- Conceptual Definition:
- AR/UU (Advertising Revenue per Unique User):
- Conceptual Definition: The average advertising revenue generated per unique user. This metric reflects the effectiveness of the system in converting ad exposures into ad-related revenue.
- Order/UU:
- Conceptual Definition: The average number of orders placed per user, capturing transaction frequency.
- Paid Order/UU:
- Conceptual Definition: The average number of successfully paid orders per user, counting only completed purchases without refunds. This is a more stringent measure of conversion than
Order/UU.
- Conceptual Definition: The average number of successfully paid orders per user, counting only completed purchases without refunds. This is a more stringent measure of conversion than
- CTR (Click-Through-Rate):
- Conceptual Definition: The ratio of clicked impressions to total impressions. It measures how often users click on items after seeing them, reflecting the attractiveness and relevance of the ranked results.
- Mathematical Formula:
- Symbol Explanation:
Number of Clicks: Total count of times users clicked on items.Number of Impressions: Total count of times items were displayed to users.
- CTCVR (Click-to-Conversion Rate):
- Conceptual Definition: The ratio of successful conversions (e.g., purchases) to total clicks. It measures the effectiveness of transforming user engagement (clicks) into completed transactions, reflecting the quality of the clicked items.
- Mathematical Formula:
- Symbol Explanation:
Number of Conversions: Total count of times users completed a desired action (e.g., order, add-to-cart).Number of Clicks: Total count of times users clicked on items.
- Buyer:
- Conceptual Definition: The proportion of unique users who placed at least one order. It indicates the breadth of user conversion.
- Bad Query Rate:
- Conceptual Definition: The percentage of queries for which human evaluators judge the recommended content as irrelevant. This serves as an inverse measure of recommendation accuracy and user satisfaction, aiming for lower values.
5.3. Baselines
OnePiece is compared against several representative baselines to demonstrate its superiority:
-
DLRM (Production baseline in Shopee): This is Shopee's highly optimized internal production model, representing a strong industrial benchmark.
- Retrieval Mode: Uses a
two-towerarchitecture (inspired byDSSM). User context (query, history) is encoded separately from items. It incorporatesDIN-like attention (for relevance),zero-attention,lightweight text CNN(for keyword features),DCNv2(for high-order cross features), andmean poolingfor sequential features, all fused via anMLP. - Ranking Mode: Uses a
single-towerarchitecture where candidate items are jointly encoded with user features. The backbone isResFlow(Fu et al., 2024), combined withDIN-like target attention,cross-attentionacross sequential behaviors,DCNv2for higher-order interactions, andMLPfusion. ASENetmodule (Hu et al., 2017) further supports adaptive feature selection for different tasks.
- Retrieval Mode: Uses a
-
HSTU (Zhai et al., 2024): This is a generative recommendation framework proposed by Meta.
- Core Idea: It typically considers
Interaction History (IH)andSituational Descriptors (SD). - Adaptation for Comparison: For a fair comparison, its parameter size is aligned with
OnePiece. An additional variant, , is also evaluated, wherePreference Anchors (PA)are introduced into its input sequence, consistent withOnePiece's context engineering.
- Core Idea: It typically considers
-
ReaRec (Tang et al., 2025): This is a reasoning-enhanced recommendation model that models user representation as a multi-step reasoning process over item sequences.
- Core Idea: The vanilla
ReaRecsupports retrieval tasks using user interaction history. - Adaptation for Comparison: Its backbone and feature inputs are adapted to match
OnePiece. For ranking, it's adapted by introducing candidate items into the input sequence and applying atarget-aware attention mask(similar toHSTU's design), where sequence tokens can attend to the candidate but candidates remain mutually invisible. A variant is also evaluated, augmentingIHandSDwithPreference Anchors.
- Core Idea: The vanilla
6. Results & Analysis
This section delves into the experimental results, including overall performance, ablation studies, scaling analysis, and online A/B testing, to validate the effectiveness and practicality of OnePiece.
6.1. Overall Performance
The following are the results from Table 2 of the original paper:
| Model | Retrieval Mode | Ranking Mode | ||||||
| R@100 | R@500 | C-AUC | C-GAUC | A-AUC | A-GAUC | O-AUC | O-GAUC | |
| DLRM | 0.458 | 0.679 | 0.856 | 0.851 | 0.893 | 0.843 | 0.931 | 0.854 |
| HSTU | 0.443 | 0.658 | 0.833 | 0.829 | 0.878 | 0.827 | 0.913 | 0.839 |
| HSTU+PA | 0.472 | 0.680 | 0.855 | 0.852 | 0.901 | 0.848 | 0.926 | 0.849 |
| ReaRec | 0.452 | 0.674 | 0.843 | 0.838 | 0.882 | 0.834 | 0.919 | 0.843 |
| ReaRec+PA | 0.485 | 0.701 | 0.862 | 0.863 | 0.908 | 0.851 | 0.927 | 0.851 |
| OnePiece | 0.517 | 0.731 | 0.911 | 0.909 | 0.952 | 0.897 | 0.963 | 0.886 |
Table 2 presents the performance comparison of different models on both retrieval and ranking tasks using 30 days of training data.
- DLRM as a Strong Baseline: The
DLRMbaseline, highly optimized within Shopee, shows strong performance, often outperforming the vanillaHSTUandReaRec. This indicates thatDLRMeffectively leverages rich feature interactions and various sequential features. - Impact of Preference Anchors (PA): Both and consistently outperform their vanilla counterparts across all metrics. For instance, improves
R@100from 0.443 to 0.472, and raisesR@100from 0.452 to 0.485. This strongly confirms that enriching user history with auxiliarypreference anchorsprovides valuable, complementary information, guiding the model towards better understanding of context-specific user intents. generally shows higher robustness than , likely due to itsTransformerbackbone with bi-directional attention and reasoning capabilities. - OnePiece's Superiority:
OnePieceachieves the best overall results across all metrics and tasks.-
Retrieval: Compared to the strongest baseline ,
OnePiecesignificantly improvesRecall@100from 0.485 to 0.517 (a relative gain of ~6.6%) andRecall@500from 0.701 to 0.731 (a relative gain of ~4.3%). -
Ranking: In ranking,
OnePieceboostsC-AUCfrom 0.862 to 0.911 (a relative gain of ~5.7%),A-AUCfrom 0.908 to 0.952 (a relative gain of ~4.8%), andO-AUCfrom 0.927 to 0.963 (a relative gain of ~3.9%). Similar improvements are observed forGAUCmetrics.These consistent and substantial gains validate
OnePiece's core design principles, particularly its novelblock-wise latent reasoningandprogressive multi-task trainingstrategy, which enable finer-grained preference refinement through multi-step reasoning. This positionsOnePieceas a more powerful and unified framework for industrial retrieval and ranking.
-
6.2. Ablation Study
6.2.1. Context Engineering Ablation
The following are the results from Table 3 of the original paper:
| Version | Model | Retrieval | Ranking | ||||||
| R@100 | R@500 | C-AUC | C-GAUC | A-AUC | A-GAUC | O-AUC | O-GAUC | ||
| V1 | IH(ID) | 0.407 | 0.646 | 0.802 | 0.802 | 0.860 | 0.819 | 0.908 | 0.835 |
| V2 | IH(ID+Side Info) | 0.428 | 0.657 | 0.846 | 0.844 | 0.871 | 0.839 | 0.918 | 0.845 |
| V3 | V2+PA(10) | 0.459 | 0.677 | 0.879 | 0.876 | 0.923 | 0.863 | 0.940 | 0.861 |
| V4 | V2+PA(20) | 0.467 | 0.686 | 0.885 | 0.886 | 0.929 | 0.869 | 0.946 | 0.866 |
| V5 | V2+PA(30) | 0.475 | 0.689 | 0.892 | 0.890 | 0.936 | 0.874 | 0.949 | 0.871 |
| V6 | V2+PA(60) | 0.491 | 0.707 | 0.901 | 0.900 | 0.945 | 0.886 | 0.956 | 0.880 |
| V7 | V2+PA(90) | 0.504 | 0.719 | 0.908 | 0.905 | 0.951 | 0.896 | 0.962 | 0.885 |
| V8 | V7+SD | 0.517 | 0.731 | 0.911 | 0.909 | 0.952 | 0.897 | 0.963 | 0.886 |
Table 3 details the ablation study on OnePiece's context engineering design, showing the progressive impact of adding Interaction History (IH), Preference Anchors (PA), and Situational Descriptors (SD).
- V1 (IH(ID) - Minimal Baseline): Starting with only user interaction sequences composed of raw item IDs and a two-layer
Transformerwith bi-directional attention, this configuration yields the lowest performance (e.g.,R@100of 0.407,C-AUCof 0.802). This highlights the need for richer features and context. - V2 (IH(ID+Side Info) - Adding Item Features): Introducing side information (additional features beyond raw IDs) for each item in the
IHsequence leads to a clear improvement across all metrics (e.g.,R@100increases to 0.428,C-AUCto 0.846). This demonstrates the importance of comprehensive item features. - V3-V7 (V2+PA(L) - Incorporating Preference Anchors): Gradually increasing the length () of
Preference Anchorsfrom 10 to 90 consistently boosts performance. For example,R@100increases from 0.459 (PA(10)) to 0.504 (PA(90)), andC-AUCfrom 0.879 to 0.908. This shows a clearscaling effectofPA, where longer auxiliary item sequences provide richer query-specific context, enabling the model to capture more fine-grained user intent.PAintroduces query-dependent signals that are absent in plainIH, helping differentiate user preferences under various queries. - V8 (V7+SD - Adding Situational Descriptors): Finally, incorporating
Situational Descriptors(static user features and query-specific information) yields the best overall results.-
Retrieval Impact:
Recall@100improves from 0.504 (V7) to 0.517, andR@500from 0.719 to 0.731. This significant gain suggests thatSDprovides stronger contextual grounding, which is particularly beneficial for the retrieval stage to find a broader set of relevant items. -
Ranking Impact: The gains in ranking metrics are marginal (e.g.,
C-AUCfrom 0.908 to 0.911). This is becauseIHalready provides rich personalization, andPAcaptures detailed query-specific preferences. Since ranking focuses on fine-grained comparisons among highly relevant candidates,SDserves as a weaker anchor in this stage.In summary, this ablation study clearly demonstrates the effectiveness of
OnePiece's structuredcontext engineering.IHcaptures long-term personalization,PAoffers scalable and query-specific anchors, andSDprovides stable contextual grounding, all contributing in complementary ways to enrich user-query representations and achieve consistent improvements.
-
6.2.2. Training Strategy Ablation
The following are the results from Table 4 of the original paper:
| Version | Training Strategy | R@100 | R@500 |
| V1 | Causal Mask | 0.464 | 0.671 |
| V2 | Bi-Directional | 0.470 | 0.676 |
| V3 | V2 + 1-Step Reasoning, Click Task on Last Step | 0.490 | 0.708 |
| V4 | V2 + 1-Step Reasoning, Multi-Task on Last Step | 0.495 | 0.714 |
| V5 | V2 + 2-Step Reasoning, Multi-Task on Last Step | 0.510 | 0.726 |
| V6 | V2 + 2-Step Reasoning, Progressive Multi-Task | 0.517 | 0.731 |
Table 4 shows the impact of different training strategies on retrieval performance.
The following are the results from Table 5 of the original paper:
| Version | Training Strategy | C-AUC | C-GAUC | A-AUC | A-GAUC | O-AUC | O-GAUC |
| V1 | Causal Mask | 0.839 | 0.836 | 0.876 | 0.830 | 0.911 | 0.838 |
| V2 | Bi-Directional, CIS Inter-Invisible | 0.860 | 0.859 | 0.903 | 0.848 | 0.920 | 0.847 |
| V3 | Bi-Directional, CIS Inter-Visible | 0.881 | 0.879 | 0.918 | 0.857 | 0.937 | 0.854 |
| V4 | V3 + 1-Step Reasoning, Multi-Task on Last Step | 0.890 | 0.889 | 0.931 | 0.871 | 0.946 | 0.867 |
| V5 | V3 + 2-Step Reasoning, Multi-Task on Last Step | 0.893 | 0.894 | 0.936 | 0.876 | 0.948 | 0.869 |
| V6 | V3 + 3-Step Reasoning, Multi-Task on Last Step | 0.906 | 0.902 | 0.946 | 0.889 | 0.957 | 0.881 |
| V7 | V3 + 3-Step Reasoning, Progressive Multi-Task | 0.911 | 0.909 | 0.952 | 0.897 | 0.963 | 0.886 |
Table 5 details the impact of training strategies on ranking performance.
- Impact of Bi-Directional Attention (V1 vs. V2): Switching from
causal mask(V1) tobi-directional attention(V2) yields significant gains across both tasks. In retrieval,R@100improves from 0.464 to 0.470. In ranking,C-AUCjumps from 0.839 to 0.860. This validates thatbi-directional attention, by allowing tokens to condition on the full context, provides more comprehensive representation information crucial for non-autoregressive recommendation tasks. - Impact of Candidate Inter-Visibility (V2 vs. V3 for Ranking): For ranking tasks, enabling
Candidate Item Set (CIS)inter-visibility (V3) (allowing candidates to attend to each other) provides a major boost, withC-AUCincreasing from 0.860 to 0.881. This confirms that the ability to perform rich comparative reasoning among candidates in a shared latent space is essential for accurate ranking. - Impact of Block-Wise Reasoning (V2/V3 to V4-V6): The introduction of the
block-wise reasoning mechanismconsistently demonstrates cumulative performance gains:- Retrieval: Moving from simple bi-directional attention (V2) to
1-step reasoningwith click prediction (V3) improvesR@100to 0.490. Usingmulti-task learningon the final step (V4) further increasesR@100to 0.495. Extending to2-step reasoning(V5) pushesR@100to 0.510. - Ranking: Starting from
CIS inter-visible(V3),1-step reasoning(V4) improvesC-AUCto 0.890.2-step reasoning(V5) bringsC-AUCto 0.893, and3-step reasoning(V6) achievesC-AUCof 0.906. These results indicate that each additional reasoning step meaningfully contributes to performance by capturing more nuanced user behavioral patterns, enabling increasingly sophisticated preference modeling.
- Retrieval: Moving from simple bi-directional attention (V2) to
- Impact of Progressive Multi-Task Training (V5/V6 vs. V6/V7):
OnePiece'sprogressive multi-task trainingstrategy consistently outperformssingle-embedding multi-task learning(where all tasks supervise only the final step's embedding).- Retrieval: Progressive multi-task (V6) achieves
R@100of 0.517, surpassing the2-step reasoningwith multi-task on the last step (V5) at 0.510. - Ranking: Progressive multi-task (V7) achieves
C-AUCof 0.911, compared to3-step reasoningwith multi-task on the last step (V6) at 0.906. The key advantage lies in distributing different tasks across multiple reasoning steps, preventing gradient conflicts and encouraging each reasoning step to specialize in extracting task-specific information. The optimal number of reasoning steps also differs by task: retrieval benefits most from two steps (impression-click hierarchy), while ranking benefits from three steps (full conversion funnel), demonstrating the adaptive nature of the progressive framework.
- Retrieval: Progressive multi-task (V6) achieves
6.3. Scaling Analysis
6.3.1. Training Data Scaling

该图像是图表,展示了不同模型在检索和排名任务上的训练收敛曲线。图中左侧为检索模式下的 Recall@100 曲线,右侧为排名模式下的 Click AUC 曲线,均以训练数据跨度(天)为横坐标,说明了 OnePiece 相较于其他模型的表现趋势。
Figure 5 | Training convergence curves of different models on retrieval and ranking tasks.
Figure 5 illustrates the training convergence curves of OnePiece compared to DLRM and HSTU over increasing training data spans (up to 60 days).
-
Superior Data Efficiency:
OnePiecealready surpasses both baselines after only 7-10 days of training. This indicatesOnePiece's superiordata efficiency, attributable to itscontext-awareandmulti-step reasoningarchitecture, which can extract more value from less data. -
Stronger Scaling Capabilities: While
DLRMandHSTUquickly converge to a plateau,OnePiececontinues to improve with longer training spans, and the performance gap widens. By day 60,OnePiecedemonstrates a pronounced lead, showing continuous improvement potential. This suggestsOnePiecepossesses a stronger modeling capacity that can effectively exploit richer behavioral supervision from extended time horizons, indicating superiorscaling capabilitiescompared to baselines. -
Robust Optimization: The training curves for
OnePieceexhibit smooth and stable growth without significant fluctuations, demonstrating robust optimization under itsprogressive multi-task supervision.These results confirm that
OnePiecenot only achieves higher sample efficiency but also scales more effectively as more training data becomes available, making it suitable for industrial environments with vast and continuously growing datasets.
6.3.2. Reasoning Scaling
The following are the results from Table 6 of the original paper:
| Block Size | C-AUC | C-GAUC | A-AUC | A-GAUC | O-AUC | O-GAUC |
| M = C = 1 | 0.885 | 0.881 | 0.923 | 0.861 | 0.947 | 0.871 |
| M = C = 4 | 0.913 | 0.911 | 0.951 | 0.896 | 0.961 | 0.885 |
| M = C = 8 | 0.920 | 0.918 | 0.956 | 0.899 | 0.964 | 0.887 |
| M = C = 12 | 0.927 | 0.923 | 0.958 | 0.903 | 0.969 | 0.893 |
Table 6 investigates the impact of the reasoning block size (which equals the number of candidate items in ranking mode) on OnePiece's ranking performance, using 60 days of training data.
-
Consistent Improvements: Increasing from 1 to 12 yields consistent improvements across all evaluated metrics. For example,
C-AUCincreases from 0.885 at to 0.927 at . -
Largest Initial Gain: The most substantial performance gain occurs when scaling from to . This is because
pointwise modeling() lacks cross-sample comparisons, meaning candidates are evaluated in isolation. Grouping candidates into blocks () enables the reasoning mechanism to contrast preferences more effectively, directly aligning with the intrinsic nature of ranking. -
Diminishing Returns: As the block size continues to increase beyond , the improvements become smaller yet remain positive (e.g.,
C-AUCincreases from 0.913 at to 0.920 at , then to 0.927 at ). This suggestsdiminishing returns, possibly because overly large blocks might overload the reasoning medium with redundant information, saturating its representational capacity.These findings reveal a trade-off: expanding reasoning bandwidth is beneficial, but there's a point where information redundancy might limit further gains. Selecting an appropriate block size is crucial for maximizing the effectiveness of
block-wise reasoning.
6.4. Online A/B Testing
OnePiece was subjected to large-scale online A/B testing on Shopee's production search system to assess its real-world effectiveness. of traffic was allocated for these experiments.
6.4.1. Online Inference Details (from Section 5.1.1)
- Retrieval Stage: Offline training generates vector representations for the entire item pool. An
Approximate Nearest Neighbor (ANN)index is constructed using theHierarchical Navigable Small World (HNSW)algorithm (Malkov and Yashunin, 2018) to support efficient online retrieval. - Ranking Stage: A score fusion strategy is employed to integrate outputs from different tasks:
- : The final relevance score for an item.
- : Hyperparameters controlling the importance weights of the respective components, enabling a balance between user experience and business revenue.
- : Click-through rate predicted by
OnePiece's final reasoning step (logit of the click task). - : Click-to-conversion rate predicted by
OnePiece's final reasoning step (logit of the order task). a, b: Parameters modulating the influence of click and conversion tasks in the final ranking.price: The item's price information.ecpm: The item's advertising value component (Effective Cost Per Mille, or cost per thousand impressions).
6.4.2. Overall Performance - Retrieval Mode
The following are the results from Table 7 of the original paper:
| GMV/UU | GMV(99.5%)/UU | Order/UU | Paid Order/UU | CTCVR | Buyer | Bad Query Rate |
| +1.08% | +0.91% | +0.71% | +0.98% | +0.66% | +0.41% | -0.17% |
Table 7 shows the online A/B testing results for OnePiece in retrieval mode, replacing a User-to-Item (U2I) recall route.
- Consistent Business Gains:
GMV/UUincreases by . also improves by , indicating that gains are driven by stable, regular transactions, not just occasional high-value orders. - Improved Conversion:
Order/UUrises by , andPaid Order/UUincreases even faster (), suggesting higher conversion rates and reduced refunds.Buyerexpands by , meaning more unique users are completing purchases.CTCVRimproves by , reflecting a better end-to-end conversion from exposure to transaction. - Enhanced User Experience: Crucially,
Bad Query Ratedecreases by0.17%. This indicates better query relevance and an improved user experience, balancing personalization with relevance. Unlike previous personalized recall strategies that might boost GMV at the expense of relevance,OnePieceachieves balanced improvements.
6.4.3. Overall Performance - Ranking Mode
The following are the results from Table 8 of the original paper:
| GMV/UU | GMV(99.5%)/UU | AR/UU | Order/UU | Buyer | CTR | Bad Query Rate |
| +1.12% | +0.65% | +2.90% | +0.08% | +0.08% | +0.29% | +0.21% |
Table 8 summarizes the online A/B testing results for OnePiece deployed in the pre-ranking stage.
-
Strong Business Metrics:
GMV/UUimproves by . Most notably,Advertising Revenue per Unique User (AR/UU)shows a substantial boost of . -
Utility Translation:
Order/UUandBuyerincrease marginally (), which is consistent with thescore fusion function(Eq. 21) designed to translate order-related utility intoGMVand advertising gains. -
Engagement and Relevance Trade-off:
CTRimproves by , indicating stronger attractiveness of ranked results. However,Bad Query Rateincreases by . This minor increase is attributed to more advertising slots potentially introducing items less relevant to direct user interests, but it is outweighed by the substantial revenue gains.Overall,
OnePieceranking strengthens core business metrics, especially advertising revenue, while achieving a practical trade-off between user experience and business objectives in large-scale industrial systems.
6.4.4. Recall Coverage and Exclusive Contribution
To further evaluate OnePiece in the retrieval stage, its overlap with other recall routes and its exclusive contribution are analyzed.
The following are the results from Table 9 of the original paper:
| Recall Route | STR1 | STR2 | Swing I21 | KPop | S2I |
| DLRM | 37.3% | 31.3% | 57.9% | 62.5% | 47.6% |
| OnePiece | 66.2% (+77.6%) | 64.4% (+105.8%) | 76.8% (+32.6%) | 77.2% (+23.5%) | 67.8% (+42.4%) |
Table 9 compares the overlap coverage between DLRM and OnePiece with respect to other recall strategies (e.g., STR1: sparse text recall with user-input keywords; STR2: sparse text recall with rewritten keywords; Swing I2I: graph-based item-to-item personalized recall; KPop: popularity-based recall under keywords; S2I: semantic vector-to-item recall).
-
Higher Recall Coverage:
OnePiececonsistently achieves substantially higher recall coverage across all other recall routes compared toDLRM. For instance,STR2coverage more than doubles (), andSTR1coverage increases by . Significant gains are also observed forSwing I2I(),KPop(), andS2I(). -
Potential for Unified Model: These improvements suggest that
OnePiecehas strong potential to replace multiple specialized recall strategies with a single unified model, effectively balancing personalization, popularity, and relevance.
该图像是柱状图,展示了 OnePiece 相较于 DLRM 在印象和点击阶段的独特贡献。印象阶段,OnePiece 达到 9.9%,而 DLRM 为 3.6%;点击阶段,OnePiece 为 5.7%,DLRM 则为 2.4%。数据表明,OnePiece 在两个阶段的贡献都有显著提升。
Figure 6 | Exclusive contribution of OnePiece in the retrieval stage.
Figure 6 compares the exclusive contribution of OnePiece and DLRM in terms of impressions and clicks.
- Substantial Unique Contribution:
OnePiecedemonstrates a significant increase in unique contributions: itsexclusive impression sharerises from3.6%to9.9%(a 2.8x increase), and itsexclusive click sharegrows from2.4%to5.7%(a 2.4x increase). - Novel Impressions and Clicks: This indicates that
OnePiecenot only covers the exposure of other recall routes but also contributes significantly more novel impressions and clicks that are not captured by traditionalDLRM-based recall. In essence,OnePiecenearly doubles the independent value over traditionalDLRM, enhancing overall recall performance.
6.4.5. Efficiency Analysis
The following are the results from Table 10 of the original paper:
| Retrieval Mode | |||
| Method | Infer. Time↓ | MFU↑ | MU↑ |
| DLRM | 40ms/request | 35% | 30% |
| OnePiece | 30ms/request (-25%) | 80% (+129%) | 50% (+67%) |
| Ranking Mode (batch size=128, KV-Cache enabled) | |||
| Method | Infer. Time↓ | MFU↑ | MU↑ |
| DLRM | 109ms/batch | 23% | 29% |
| OnePiece (M=1) | 110ms/batch (+0.9%) | 67% (+191%) | 38% (+31%) |
| OnePiece (M=4) | 112ms/batch (+2.8%) | ||
| OnePiece (M=8) | 115ms/batch (+5.5%) | ||
| OnePiece (M=12) | 120ms/batch (+10.1%) | ||
Table 10 presents a computational efficiency comparison between DLRM and OnePiece on a single NVIDIA A30 GPU. MFU denotes Model FLOPs Utilization (ratio of achieved FLOPs to theoretical peak), and MU denotes Memory Utilization (percentage of GPU memory occupied).
-
Enhanced Hardware Utilization in Retrieval Mode:
OnePieceachieves a25%reduction in inference time (30ms vs. 40ms per request) compared toDLRM.- It shows a dramatic
129%increase inMFU(from 35% to 80%) and a67%increase inMU(from 30% to 50%). - This indicates that
OnePiece's unifiedTransformerarchitecture is more compatible with modern GPU parallelization, effectively leveraging tensor computation units that are often underutilized byDLRM's heterogeneous, embedding-heavy design. The efficiency gains stem from streamlined data flow and reduced memory transfer overhead. This is crucial for reducing operational costs in large-scale industrial deployments.
-
Controlled Computational Scaling in Ranking Mode:
-
While
OnePieceincurs a modest overhead relative toDLRMat (110ms/batch vs. 109ms/batch), its scaling behavior with increasing block size (reasoning capacity) is efficient. -
Inference time increases from 110ms at to 120ms at , representing only a
10.1%overhead for a12xexpansion in reasoning capacity. The progressive overhead (0.9%, 2.8%, 5.5%, 10.1%) demonstrates efficient computational amortization, where each additional reasoning block incurs diminishing marginal cost. -
The
MFUshows a dramatic191%improvement (from 23% to 67%) even at . This indicates thatOnePiece's architecture inherently aligns better with GPU computational paradigms, regardless of reasoning complexity. This efficiency is partly due to theKV-Cachingmechanism enabling efficient batch processing. -
This controlled scaling offers a favorable efficiency-performance trade-off: as shown in Table 6,
C-AUCsignificantly improves from 0.885 at to 0.927 at (a4.7%relative improvement).These findings establish
OnePieceas a practical solution for production systems, offering configurable reasoning depth that allows flexible trade-offs between computational efficiency and model performance.
-
6.5. Attention Visualization Analysis (from Appendix C)
The attention visualization analyses (Figures 8 and 9) provide insights into how OnePiece processes information and performs multi-step reasoning.

该图像是图表,展示了OnePiece注意力分析的案例研究,包括检索模式(a)和排名模式(b)。每个子图表示不同层和头的注意力矩阵,呈现输入、偏好和场景信号之间的关系。
Figure 8 | OnePiece Attention Analysis in Different Modes. The attention maps visualize the attention weights between different input components: Interaction History (I), Preference Anchor (P), Situational Descriptor (S), and Candidate Item Set (C). In these visualizations, the y-axis represents the Query, while the x-axis represents the Key-Value, corresponding to the attention weight matrix commonly used in Transformer-like architectures.
Attention Analysis of Context Input (Figure 8):
- Layer-wise Evolution: Early layers (e.g., Layer-1 heads in Figure 8(a)-1/4 and 8(b)-1/3) show concentrated or diagonal attention, indicating
localized sequential processingwithinIHtokens or short-span links betweenSDandCIS. Later layers (e.g., Layer-2 heads in Figure 8(a)-5/8 and 8(b)-5/8) develop multi-region attention patterns, connecting multiple token groups simultaneously, suggesting a transition from localized processing toglobal integrationof information. - Head-level Specialization: Within the same layer, different attention heads learn specialized roles. Some heads emphasize
intra-component coherence(e.g.,IH-IHdiagonals in Figure 8(b)-7), focusing on relationships within a single input type. Others prioritizecross-component flows(e.g.,SDtoIHin Figure 8(a)-5/6;CIStoIHin Figure 8(b)-6), indicating integration of information across different input types. This validates that the model develops hierarchical and diversified reasoning strategies. - Mode-Specific Characteristics:
-
Retrieval Mode (Figure 8(a)): The three-token design (
IH,PA,SD) fosters structured and compact cross-component attention. For instance, heads in Layer-2 (e.g., 5, 6) linkIHwithPAto guide long-term preference recall, while another head (e.g., 8) reinforcesSDtoIHconnections, grounding retrieval in situational context. These interactions remain relatively localized, aligning with retrieval's coarse-grained filtering objective. -
Ranking Mode (Figure 8(b)): The introduction of
CIStokens fundamentally expands the attention space to four-way interactions. Heads in Layer-2 (e.g., 6, 8) show attention flows spanningIH,PA,SD, andCISsimultaneously, enabling joint evaluation of user preference signals against explicit candidate items.IHtokens maintain temporal sequentiality (e.g., 7), whileCIStokens actively integrate withPAandSDfor fine-grained candidate comparison (e.g., 4). This highlights the ranking stage's role in nuanced discrimination.
该图像是示意图,展示了OnePiece中多步骤块状推理的注意力可视化。左侧 (a) 为检索模式,包含两个推理步骤 和 ;右侧 (b) 为排名模式,包含三个推理步骤 、 和 。热图通过颜色深浅表示了推理块(y轴)与输入组件及以往推理输出(x轴)之间的注意力权重,输入组件包括交互历史(I)、偏好锚(P)、情境描述符(S)和候选项目(C,仅在排名模式中)。
-
Figure 9 | Attention visualization of multi-step block-wise reasoning in OnePiece. The heatmaps show attention weights between reasoning blocks (y-axis, as queries) and input components with previous reasoning outputs -axis, as keys and values) for (a) retrieval mode with two reasoning steps and (b) ranking mode with three reasoning steps ( R _ { 1 } , R _ { 2 } , R _ { 3 } ) . Context Tokens include Interaction History (I), Preference Anchors (P), Situational Descriptors (S), and Candidate Items (C, ranking only).
Attention Analysis of Multi-Step Block-wise Reasoning (Figure 9): This analysis shows how reasoning blocks progressively query different information sources.
- Retrieval Mode (Figure 9(a) - two steps ):
R1(first reasoning block) exhibits strong concentrated attention onSituational Descriptors (S)and moderate attention onPreference Anchors (P), with minimal attention toInteraction History (I). This indicates that initial reasoning prioritizes contextual and query-specific signals for understanding user intent.R2(second reasoning block) shows a pivotal shift, developing concentrated attention on specific regions withinInteraction History (I)while also incorporating information from the previous reasoning blockR1.- This demonstrates an evolution from a situational-preference focus to selective behavioral pattern recognition, where progressive reasoning transitions from broad contextual understanding to targeted sequential preference extraction.
- Ranking Mode (Figure 9(b) - three steps ):
- The three-step process reveals increasingly sophisticated attention integration with a hierarchical information enhancement pattern.
- As reasoning progresses, later blocks (
R3) demonstrate stronger attention to more recent reasoning outputs (R2), while exhibiting relatively weaker attention to earlier outputs (R1). - This suggests that each reasoning step progressively consolidates and refines information from previous steps. More recent reasoning blocks contain higher-level abstractions that effectively subsume earlier insights. The model learns to prioritize the most refined representations, indicating an efficient information compression mechanism where each reasoning step builds upon an increasingly compressed preference understanding to achieve discriminative candidate evaluation.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces OnePiece, a pioneering unified framework that successfully integrates LLM-style context engineering and multi-step reasoning into industrial cascaded ranking systems. Built upon a pure Transformer backbone, OnePiece incorporates structured context engineering to organize heterogeneous signals (interaction history, preference anchors, situational descriptors, and candidate items) into a tokenized sequence. It equips the model with block-wise latent reasoning capacity for iterative representation refinement and optimizes this process through a progressive multi-task training strategy, leveraging natural user feedback chains.
Extensive offline experiments validate the effectiveness of each design component and demonstrate favorable scaling properties with respect to preference anchor length, training data span, and block size. Crucially, online A/B testing in Shopee's main personalized search scenario confirms significant real-world impact, including over GMV/UU improvement and advertising revenue increase. OnePiece also exhibits superior efficiency and hardware utilization compared to traditional baselines. These results position OnePiece as a promising new paradigm for building scalable, reasoning-driven ranking models in real-world industrial environments.
7.2. Limitations & Future Work
The authors identify two promising future research directions:
-
Unified Multi-Route Retrieval: Existing industrial multi-route retrieval systems maintain separate models with distinct parameters for various pathways (e.g., I2U, I2I, U2I, Q2I). This is resource-intensive and prone to redundancy.
OnePiece's unified architecture, by processing tailored contexts for different retrieval scenarios within a single model, paves the way for "One For All" multi-route retrieval. The empirical evidence from recall coverage and exclusive contribution (Section 5.3) supports the feasibility of such a streamlined system. The future work suggests further exploring howcontext engineeringcan be adapted to serve diverse recommendation objectives with a single unified model, reducing system complexity and maintenance overhead.
该图像是示意图,比较了传统多路径检索系统与基于OnePiece的统一架构。在(a)部分,传统多路径检索需要维护多个参数不同的模型,而(b)部分则展示了OnePiece如何通过单一模型处理不同的检索场景,实现统一检索功能,降低系统复杂性。Figure 7 | Comparison between existing multi-route retrieval systems and OnePiece-based unified architecture. (a) Traditional multi-route retrieval requires maintaining separate models with distinct parameters for different retrieval pathways (I2U, I2I, U2I, Q2I, etc.), each utilizing specialized architectures and storage systems. (b) OnePiece achieves unified multi-route retrieval through a single model that processes tailored prompts for different retrieval scenarios, enabling "One For All" functionality while reducing system complexity and maintenance overhead.
-
Scalable Latent Reasoning: While
OnePiecerepresents a successful deployment of latent reasoning at industrial scale, the authors acknowledge inherent limitations in its current reasoning scalability. The primary challenge is obtaining sufficientmulti-task signalsto effectively supervise intermediate reasoning processes. This constraint limits the ability to scale reasoning capabilities further. Future research should explore more effective methodologies for scalinglatent reasoning, such as incorporating online user feedback throughreinforcement learningto adaptively determine optimal reasoning depth, or developing organic integration betweenmodel self-explorationandmulti-task supervisionprocesses to enable autonomous reasoning evolution.
7.3. Personal Insights & Critique
OnePiece presents a significant step forward in industrial recommendation systems by moving beyond mere architectural transplants from LLMs to a more fundamental integration of context engineering and multi-step reasoning.
Innovations and Strengths:
- Principled Adaptation: The core strength lies in its principled approach to adapting
LLMmechanisms. By explicitly defining and integratingstructured context engineeringandblock-wise latent reasoning, the paper offers a robust framework that goes beyond superficial application ofTransformers. - Practicality and Impact: The large-scale deployment at Shopee and the consistent online gains are compelling evidence of
OnePiece's practical value and readiness for real-world industrial settings. The efficiency analysis (MFU, MU) further underscores its viability. - Comprehensive Evaluation: The combination of extensive offline experiments (ablation, scaling) and online A/B testing provides strong validation for each component and the overall framework.
- "One For All" Vision: The idea of a unified model for multi-route retrieval is highly appealing. Current systems are often complex ensembles of specialized models.
OnePiece's capability to generalize across different recall routes is a game-changer for system simplicity and maintenance.
Potential Issues and Areas for Improvement:
- Complexity of Progressive Multi-Task Loss: While effective, the
progressive multi-task trainingstrategy involves designing specific tasks for each reasoning step and carefully managing their supervision. This might require significant domain expertise and empirical tuning for new scenarios or larger numbers of reasoning steps. The paper's formulation provides a general framework, but practical implementation could be intricate. - Dependency on Preference Anchors (PA):
PAs, constructed from domain knowledge (e.g., top-clicked items), are shown to be highly effective. However, the quality and construction of these anchors are crucial. In domains where such rich, structured domain knowledge is scarce or difficult to extract, thePAcomponent's effectiveness might diminish. Further research could explore more automated or self-supervised ways to generate high-quality anchors. - Interpretability of Latent Reasoning: While
block-wise latent reasoningsignificantly improves performance, the "latent" nature means the intermediate reasoning steps are not directly human-interpretable in natural language, unlikechain-of-thoughtinLLMs. This might pose challenges in debugging, understanding model failures, or building user trust, particularly in sensitive recommendation contexts. The attention visualizations offer some insight, but a full "why" remains opaque. - Generalizability beyond E-commerce Search: While proven effective in Shopee's personalized search, the direct applicability and optimal configurations for other domains (e.g., news recommendation, video streaming) or recommendation types (e.g., cold-start scenarios, diverse recommendations) might require further adaptation and validation. The specific task definitions for
progressive multi-task trainingmight need to be re-calibrated.
Transferability and Future Value:
The core methodologies of structured context engineering and block-wise latent reasoning are highly transferable.
-
Context Engineering: The idea of augmenting raw user data with various contextual cues (
preference anchors,situational descriptors) can be applied to almost any sequential modeling task where rich context improves understanding. This could benefitfraud detection,medical diagnosis(integrating patient history with similar cases and current symptoms), orad targeting. -
Block-Wise Latent Reasoning: The concept of iteratively refining representations in a block-wise manner offers a general mechanism for enhancing complex models beyond recommendation. It could be valuable in any domain requiring multi-step decision-making or hierarchical information processing, such as
reinforcement learningagents,time-series forecastingwith complex dependencies, orscientific discoverypipelines. -
Progressive Multi-Task Training: This strategy for supervising latent processes via a curriculum of related tasks is a powerful technique for models that perform multi-step computations without explicit human-provided reasoning traces. It could find applications in
robotics(training complex motor skills through simpler sub-tasks),drug discovery(optimizing molecule design based on progressively complex biological properties), or anyAI systemwhere an internal "thought process" needs to be guided.OnePiecesets a new standard for integrating advancedAIparadigms into industrial systems, highlighting that the future of recommendation lies not just in bigger models, but in smarter, more context-aware, and reasoning-capable architectures.
Similar papers
Recommended via semantic vector search.