Paper status: completed

MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment

Published:03/04/2025

LLM-Driven Personalized Search (1)Motivation-Aware Retrieval (1)Multimodal Semantic Alignment (1)Mixture of Attention Experts (MoAE) (1)E-commerce Search Intent Modeling (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MAPS leverages LLMs to unify query and consultation embeddings, using Mixture of Attention Experts and dual alignment techniques to enhance motivation-aware personalized search, outperforming existing methods in e-commerce retrieval and ranking tasks.

Abstract

Personalized product search aims to retrieve and rank items that match users' preferences and search intent. Despite their effectiveness, existing approaches typically assume that users' query fully captures their real motivation. However, our analysis of a real-world e-commerce platform reveals that users often engage in relevant consultations before searching, indicating they refine intents through consultations based on motivation and need. The implied motivation in consultations is a key enhancing factor for personalized search. This unexplored area comes with new challenges including aligning contextual motivations with concise queries, bridging the category-text gap, and filtering noise within sequence history. To address these, we propose a Motivation-Aware Personalized Search (MAPS) method. It embeds queries and consultations into a unified semantic space via LLMs, utilizes a Mixture of Attention Experts (MoAE) to prioritize critical semantics, and introduces dual alignment: (1) contrastive learning aligns consultations, reviews, and product features; (2) bidirectional attention integrates motivation-aware embeddings with user preferences. Extensive experiments on real and synthetic data show MAPS outperforms existing methods in both retrieval and ranking tasks.

Mind Map

In-depth Reading

English Analysis~37 min read · 50,718 chars

1. Bibliographic Information

1.1. Title

MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment

1.2. Authors

Weicong Qin, Yi Xu, Weijie Vu, Chenglei Shen, Ming He, Jianping Fan, Xiao Zhang, Jun Xu. Affiliations include Gaoling School of Artificial Intelligence, Renmin University of China, University of International Business and Economics, and AI Lab at Lenovo Research, Lenovo Group Limited, China.

1.3. Journal/Conference

This paper is currently a preprint, published on arXiv. The arXiv platform serves as a repository for preprints of scientific papers, typically prior to peer review or formal publication in a journal or conference. While it provides early access to research, its reputation is that of a preprint server rather than a peer-reviewed venue itself.

1.4. Publication Year

Published at (UTC): 2025-03-03T16:24:36.000Z. The publication year is 2025.

1.5. Abstract

Personalized product search aims to retrieve and rank items matching users' preferences and search intent. Existing methods often assume that user queries fully capture their true motivation. However, the authors' analysis of a real e-commerce platform reveals that users frequently engage in consultations before searching, indicating a refinement of their intentions based on underlying motivations and needs. These implied motivations in consultations are identified as a critical, yet unexplored, factor for enhancing personalized search. The paper addresses challenges associated with this, including aligning contextual motivations with concise queries, bridging the gap between product categories and natural language text, and filtering noise from historical sequences. To overcome these, the authors propose Motivation-Aware Personalized Search (MAPS). MAPS utilizes Large Language Models (LLMs) to embed queries and consultations into a unified semantic space. It employs a Mixture of Attention Experts (MoAE) to prioritize critical semantic information and introduces a dual alignment strategy: (1) contrastive learning to align consultations, reviews, and product features, and (2) bidirectional attention to integrate motivation-aware embeddings with user preferences. Extensive experiments on real and synthetic datasets demonstrate that MAPS significantly outperforms existing methods in both retrieval and ranking tasks.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2503.01711
PDF Link: https://arxiv.org/pdf/2503.01711v4.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the inherent limitation of current personalized product search systems: they typically assume that a user's search query fully articulates their underlying needs and motivations. This assumption often falls short in real-world scenarios. In e-commerce, users frequently engage in preliminary consultations (e.g., with AI assistants or customer service) to clarify their needs or gather information before formulating a concise search query. These consultations contain rich, contextual motivation that drives the user's eventual search, but this motivation remains largely uncaptured and unutilized by existing search algorithms.

This problem is important because understanding a user's search motivation (their intrinsic goal or problem they want to solve) leads to more satisfactory and relevant search results than merely matching keywords. By addressing this gap, personalized search systems can move beyond superficial keyword matching to truly anticipate and fulfill user needs.

The paper's innovative idea is to explicitly model this "search motivation" embedded in consultation histories. This unexplored area presents specific challenges:

Alignment with Queries: Consultations are often lengthy and complex natural language descriptions of needs, while queries are concise keywords. Bridging this semantic gap is crucial.
Alignment with Product Features: Products have structured, categorical attributes, but motivations are in free-form text. Aligning these disparate data types is challenging.
Alignment with User History: Not all past consultations are relevant to the current search; filtering noise and identifying pertinent information within a sequence is necessary.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

Explicit Modeling of Search Motivation: It is the first work to explicitly define and model "search motivation" by leveraging consultation data within personalized search systems on e-commerce platforms. This highlights the critical role of pre-search consultations in understanding user intent.
Novel MAPS Framework: The paper proposes Motivation-Aware Personalized Search (MAPS), a comprehensive model framework designed to integrate LLM knowledge to bridge the gap between categorical ID (identifier) and natural language text embeddings.
- LLMs are used to embed queries and consultation texts into a unified semantic space.
- A Mixture of Attention Experts (MoAE) network is introduced to adaptively prioritize critical tokens and extract accurate semantic embeddings from varying text lengths and complexities.
- Dual Alignment mechanisms are employed:
  - Mapping-based General Alignment: Uses contrastive learning to align consultations, reviews, and product features by establishing keyword-item relationships.
  - Sequence-based Personalized Alignment: Employs bidirectional attention within a transformer encoder to integrate motivation-aware embeddings derived from user consultation and search histories with individual user preferences.
Superior Performance: Extensive experiments conducted on both a real-world commercial dataset and a synthetic Amazon dataset demonstrate that MAPS significantly outperforms existing traditional retrieval methods, personalized search methods, and conversational retrieval methods in both retrieval and ranking tasks. This empirical evidence validates the effectiveness and superiority of the proposed approach.

3.1. Foundational Concepts

Personalized Product Search: This refers to the task of retrieving and ranking products that are highly relevant not only to a user's explicit query but also to their implicit preferences, historical interactions, and context. It aims to tailor search results to individual users, improving satisfaction and conversion rates in e-commerce.
Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They are proficient in tasks like text summarization, translation, question answering, and generating embeddings (numerical representations) of text that capture semantic meaning. In this paper, LLMs are used to obtain high-quality token embeddings for natural language inputs, providing rich world knowledge and natural language understanding (NLU) capabilities.
Embeddings: In machine learning, an embedding is a low-dimensional, continuous vector representation of discrete data (like words, items, or users) in a continuous vector space. The idea is that semantically similar items will have embeddings that are close to each other in this space. For example, word embeddings represent words as vectors, where words with similar meanings have similar vector representations. Item embeddings represent products, and user embeddings represent users.
Attention Mechanism: A core component in modern deep learning architectures, particularly Transformers. The attention mechanism allows a model to weigh the importance of different parts of the input data when processing a sequence. Instead of processing all parts equally, it focuses on the most relevant parts.
- Scaled Dot-Product Attention: This is a common form of attention. Given a Query (Q), Key (K), and Value (V) vectors, the output is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ : A matrix of query vectors. The query is what the attention mechanism is looking for.
  - $K$ : A matrix of key vectors. The keys are what the attention mechanism compares the query against.
  - $V$ : A matrix of value vectors. The values are what the attention mechanism outputs, weighted by the attention scores.
  - $Q K^T$ : Dot product between query and key vectors, measuring their similarity.
  - $\sqrt{d_k}$ : Scaling factor, where $d_k$ is the dimension of the key vectors. This prevents the dot products from becoming too large, which could push the softmax function into regions with very small gradients.
  - $\mathrm{softmax}$ : A function that converts a vector of numbers into a probability distribution, ensuring weights sum to 1.
  - The output is a weighted sum of the value vectors, where the weights are determined by the attention scores (the softmax output).
Contrastive Learning: A self-supervised learning paradigm where a model learns representations by comparing similar and dissimilar pairs of data points. The goal is to learn an embedding space where positive pairs (e.g., different views of the same item, or an item and its associated text) are pulled closer together, while negative pairs (e.g., an item and an unrelated text) are pushed further apart. This helps the model distinguish between relevant and irrelevant information.
Transformers: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). They primarily rely on self-attention mechanisms to process sequential data, making them highly effective for Natural Language Processing (NLP) tasks. Unlike Recurrent Neural Networks (RNNs), Transformers can process all parts of a sequence in parallel, making them faster to train on large datasets and better at capturing long-range dependencies. A typical Transformer consists of an encoder stack and a decoder stack. The encoder maps an input sequence to a sequence of continuous representations, and the decoder generates an output sequence based on the encoder's output.
Dot Product Similarity: A common way to measure the similarity between two vectors. Given two vectors $\mathbf{a}$ and $\mathbf{b}$ , their dot product similarity is defined as $\mathbf{a} \cdot \mathbf{b} = ||\mathbf{a}|| \cdot ||\mathbf{b}|| \cdot \cos(\theta)$ , where $\theta$ is the angle between them. If the vectors are normalized (unit vectors), the dot product directly gives the cosine of the angle, $\cos(\theta)$ , which ranges from -1 (opposite) to 1 (identical). A higher dot product indicates higher similarity.
Negative Sampling: A technique used in machine learning, particularly for training word embeddings (like Word2Vec) and recommender systems. Instead of computing the loss over all possible negative examples (which can be millions for a large vocabulary or item set), negative sampling randomly selects a small number of negative examples (items or words that are not relevant or co-occurring) to update the model's weights. This makes training much more efficient while still providing a good signal for learning.

3.2. Previous Works

The paper discusses several categories of related work:

Traditional Retrieval Algorithms:
- BM25 (Robertson et al., 2009): A statistical bag-of-words retrieval function that ranks documents based on the presence of query terms, their frequency within documents (Term Frequency), and their rarity across the entire collection (Inverse Document Frequency), along with document length normalization. It primarily focuses on keyword matching and does not inherently support personalization or semantic understanding beyond term overlaps.
Dense Retrieval Algorithms:
- BGE-M3 (Chen et al., 2024): This represents a newer generation of retrieval methods that use deep learning to embed queries and documents into a shared vector space. Retrieval is performed by finding documents whose embeddings are close to the query embedding (e.g., using cosine similarity). BGE-M3 specifically focuses on multi-lingual, multi-functionality, and multi-granularity text embeddings, enhancing retrieval and ranking capabilities by capturing deeper semantic relationships than BM25.
Conversational Retrieval Methods:
- CHIQ (Mo et al., 2024): These methods aim to improve retrieval by incorporating contextual history from conversational interactions. They recognize that queries in a conversation are often ambiguous or incomplete on their own and require understanding the preceding turns to formulate effective search requests. CHIQ specifically tries to enhance query rewriting in conversational search by leveraging context. However, these methods typically focus on query rewriting for better retrieval and do not explicitly model underlying user motivation or deep personalization beyond the immediate conversation.
Personalized Search Methods (Focusing on User History & Intent):
- QEM (Ai et al., 2019a): Query-Embedding Model. Primarily focuses on the direct similarity between the query and items, often using simple embedding matching. It considers basic query-item relevance.
- DREM (Ai et al., 2019b): Dynamic Relation Embedding Model. Extends basic query-item matching by incorporating dynamic relationships, often for explainable product search.
- HEM (Ai et al., 2017): Hierarchical Embedding Model. Incorporates user information and search history into a separate user embedding. It learns user preferences from past interactions and combines them with query embeddings to personalize search.
- AEM (Ai et al., 2019a): Attention-Embedding Model. An attention-based personalized model that combines the user's previously interacted items with the current query, allowing the model to focus on relevant past interactions.
- ZAM (Ai et al., 2019a): Zero Attention Model. An enhancement to AEM that includes a "zero vector" to the item list, designed to handle cases where no relevant historical items are found.
- TEM (Bi et al., 2020): Transformer-based Embedding Model. Improves AEM by replacing its attention layer with a Transformer encoder, enabling more sophisticated modeling of user interaction sequences.
- CoPPS (Dai et al., 2023): Contrastive learning for user sequence representation in Personalized Product Search. Leverages contrastive learning to learn robust user sequence representations, improving personalization by distinguishing between relevant and irrelevant user behaviors.
Multi-Scenario Methods (Integrating Search and Recommendation):
- SESRec (Si et al., 2023): Search and Recommendation System. Uses contrastive learning to learn disentangled search representations specifically for recommendation tasks, aiming to bridge the gap between user search behavior and recommendation.
- UnifiedSSR (Xie et al., 2023): Unified Sequential Search and Recommendation. A dual-branch network that jointly learns user behavior history across both search and recommendation scenarios, aiming for a unified understanding of user preferences.
- UniSAR (Shi et al., 2024): Unified Search and Recommendation. Employs Transformers and cross-attention to model different types of fine-grained behavior transitions between search and recommendation, aiming for a more holistic user understanding.

3.3. Technological Evolution

The evolution of personalized search can be traced through several stages:

Early Keyword Matching (e.g., BM25): Focused solely on lexical overlap between queries and documents. No personalization.
Embedding-based Retrieval (e.g., BGE-M3): Moved to dense vector representations for semantic matching, improving relevance beyond exact keywords. Still largely query-centric.
Personalized Search (e.g., HEM, AEM, TEM, CoPPS): Incorporated user history and interaction sequences to tailor results, recognizing that users have individual preferences. These methods primarily model past behaviors and direct query-item relations.
Conversational Search (e.g., CHIQ): Began to consider the context of ongoing user-system dialogue to refine queries, addressing ambiguity in conversational interactions.
Multi-scenario Integration (e.g., SESRec, UnifiedSSR, UniSAR): Attempted to unify search and recommendation contexts, acknowledging that user intent can manifest in different interaction types.

MAPS represents a further evolution by recognizing a previously unaddressed data source: user consultations. It moves beyond merely observing past interactions or immediate conversational context to actively infer underlying user motivation before the search even begins. This is a significant step towards a more proactive and deeply personalized understanding of user intent.

3.4. Differentiation Analysis

Compared to the main methods in related work, MAPS introduces several core differences and innovations:

Novel Data Source: Consultations: The most significant innovation is the explicit utilization of consultation history to extract search motivation. Previous personalized search methods focus on direct query-item interactions or search/recommendation sequences. Conversational retrieval uses dialogue history but primarily for query rewriting. MAPS uniquely posits that consultations reveal a deeper, more refined user motivation that precedes and influences the formal search query.
Motivation Modeling: MAPS is the first to explicitly define and model "search motivation" as a distinct concept influencing personalized search, rather than just inferring intent from historical searches or clicks.
LLM-Driven Semantic Understanding: While some modern methods use embeddings (like BGE-M3), MAPS leverages the sophisticated Natural Language Understanding (NLU) and world knowledge capabilities of Large Language Models (LLMs) to process the complex, natural language text of consultations and queries. This allows for a richer and more nuanced semantic representation compared to models relying on simpler token embeddings or ID embeddings.
Mixture of Attention Experts (MoAE): MAPS introduces MoAE to dynamically select and combine different attention mechanisms, enabling the model to adaptively focus on critical semantics within various types of textual inputs. This is more sophisticated than a single, static attention mechanism typically found in models like AEM or TEM.
Dual Alignment Strategy:
- General Alignment: Addresses the category-text gap and aligns raw text with item IDs and features using contrastive learning based on collected item-related texts (queries, consultations, titles, reviews). This is crucial for grounding LLM representations in the specific e-commerce domain.
- Personalized Alignment: Integrates motivation from consultation history and query history with the current query and user preferences using bidirectional attention. This dynamic and context-aware integration of long-term and short-term user intent is more comprehensive than methods that only consider search history.
Comprehensive Problem Addressing: MAPS directly tackles the three critical challenges identified in its motivation: aligning complex consultations with concise queries, bridging the category-text gap, and filtering noise within historical sequences, which are often overlooked or partially addressed by existing models.

In essence, MAPS innovates by tapping into a richer, previously ignored source of user intent (consultations), leveraging powerful LLMs for deeper semantic understanding, and designing a multi-faceted alignment strategy to integrate this motivation into a personalized search framework.

4. Methodology

The Motivation-Aware Personalized Search (MAPS) model is designed to enhance personalized product search by explicitly incorporating user consultations to understand their underlying motivations. The overview of MAPS is illustrated in Figure 3 of the original paper. The methodology is structured into three main modules: (1) ID-text representation fusion with LLM, (2) Mapping-based general alignment, and (3) Sequence-based personalized alignment.

The following figure (Figure 3 from the original paper) shows the overall framework of MAPS.

$Figure 3: Overview of MAPS. $\\textcircled{1}$ denotes ID-text representation fusion with LLM. $\\textcircled{2}$ denotes the general alignment. EY $\\textcircled{3}$ denotes the personalized alignment.$ 该图像是论文中图3的示意图，展示了MAPS模型的整体框架，包括通过LLM实现的ID-文本表示融合（①）、通用对齐模块（②）与个性化对齐模块（③），架构清晰描述了多专家注意力机制和双重对齐策略。

4.1. Principles

The core idea behind MAPS is to leverage the rich, contextual information present in user consultations to infer their true search motivation. This motivation often precedes and informs their concise search queries. By capturing this deeper intent using Large Language Models (LLMs) and carefully aligning it with various data sources (queries, product features, user history), MAPS aims to provide more relevant and personalized search results. The theoretical basis is that LLMs can understand complex natural language, and attention mechanisms can focus on critical parts of this information, while contrastive learning and bidirectional attention can effectively bridge semantic gaps and integrate diverse signals for a holistic user understanding.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. ID-Text Representation Fusion with LLM

This module focuses on representing users, items, queries, and consultations as embeddings (numerical vectors) in a unified space, crucial for the model to understand interactions. It combines both categorical ID features and rich textual features.

Text Representation

Natural language texts from consultations and queries require robust Natural Language Understanding (NLU). Existing personalized product search methods often rely on simple token embeddings or average pooling, which lack world knowledge and the ability to focus on critical semantics. MAPS addresses this by employing pre-trained LLM embeddings and a Mixture of Attention Experts (MoAE) pooling network.

LLM Embedding Initialization: The raw text (e.g., query, consultation message) is first fed into a frozen pre-trained LLM. The LLM generates token embeddings for each word or sub-word unit in the input sequence. These token embeddings capture deep semantic meaning and world knowledge. Unlike standard approaches, no average pooling is performed at this stage; instead, the individual token embeddings $\mathbf{h}_i$ are retained.
Dimension Mapping: To ensure compatibility with various LLMs and to map the token embeddings to a consistent dimension, trainable Feed-Forward Network (FFN) layers are used. These FFNs project the LLM token embeddings into a unified dimension $d_{\mathrm{t}}$ .
Mixture of Attention Experts (MoAE) Pooling Network: This framework adaptively assigns weights to tokens to derive a comprehensive text embedding. It includes three types of attention pooling experts:
- Parameterized Attention Pooling Expert: This expert maintains a learnable parameterized embedding $\mathbf{q}$ . It uses $\mathbf{q}$ as a query to compute attention scores over the input token embeddings $\mathbf{h}_i$ , which act as keys. The resulting attention scores are used to compute a weighted average of the token embeddings, creating a pooled representation. $ \mathbf { e } _ { \mathrm { p a r a m } } ^ { \mathrm { p ool } } = \frac { 1 } { L } \sum _ { i = 1 } ^ { L } \mathrm { s o f t m a x } \left( \frac { \mathbf { q } ^ { \top } ( \mathbf { h } _ { i } \mathbf { W } ^ { k } ) } { \sqrt { d _ { \mathrm { t } } } } \right) \mathbf { h } _ { i } $ Where:
  - $\mathbf{e}_{\mathrm{param}}^{\mathrm{pool}}$ : The pooled embedding from the parameterized expert.
  - $L$ : The length of the input token sequence.
  - $\mathrm{softmax}(\cdot)$ : The softmax function, which converts raw scores into a probability distribution (attention weights).
  - $\mathbf{q}$ : The parameterized query vector of this expert, which is a learnable embedding.
  - $\mathbf{h}_i$ : The embedding of the $i$ -th token in the input sequence.
  - $\mathbf{W}^k \in \mathbb{R}^{d_{\mathrm{t}} \times d_{\mathrm{t}}}$ : A learnable weight matrix for transforming the token embeddings into key representations.
  - $d_{\mathrm{t}}$ : The dimension of the token embeddings and the key vectors.
  - $\mathbf{q}^{\top} (\mathbf{h}_i \mathbf{W}^k)$ : Calculates the dot product similarity between the parameterized query and the transformed token key.
  - $\sqrt{d_{\mathrm{t}}}$ : A scaling factor used to prevent dot products from becoming too large, stabilizing gradients during training (common in attention mechanisms).
- Self-Attention Pooling Expert: This expert computes self-attention scores directly from the input token embeddings themselves. Each token's embedding acts as both a query and a key to determine its own importance relative to other tokens in the sequence. $ \mathbf { e } _ { \mathrm { s e l f } } ^ { \mathrm { p ool } } = \frac { 1 } { L } \sum _ { i = 1 } ^ { L } \mathrm { s o f t m a x } \left( \frac { ( \mathbf { h } _ { i } \mathbf { W } ^ { q } ) ^ { \top } ( \mathbf { h } _ { i } \mathbf { W } ^ { k } ) } { \sqrt { d _ { \mathrm { t } } } } \right) \mathbf { h } _ { i } $ Where:
  - $\mathbf{e}_{\mathrm{self}}^{\mathrm{pool}}$ : The pooled embedding from the self-attention expert.
  - $\mathbf{W}^q \in \mathbb{R}^{d_{\mathrm{t}} \times d_{\mathrm{t}}}$ : A learnable weight matrix for transforming token embeddings into query representations.
  - Other symbols are as defined for the parameterized expert. The key difference is that the query here is derived from the token embedding itself, $(\mathbf{h}_i \mathbf{W}^q)$ .
- Search-Centered Cross-Attention Pooling Expert: To ensure that the text embeddings of users, items, and consultations are particularly relevant to the current search task, this expert uses the embedding of the current search query text as the attention query. This forces the expert to focus on tokens that are semantically aligned with the current search intent. $ \mathbf { e } _ { \mathrm { c r o s s } } ^ { \mathrm { p ool } } = \frac { 1 } { L } \sum _ { i = 1 } ^ { L } \mathrm { s o f t m a x } \left( \frac { ( \mathbf { q } ^ { \prime } \mathbf { W } ^ { q } ) ^ { \top } ( \mathbf { h } _ { i } \mathbf { W } ^ { k } ) } { \sqrt { d _ { \mathrm { t } } } } \right) \mathbf { h } _ { i } $ Where:
  - $\mathbf{e}_{\mathrm{cross}}^{\mathrm{pool}}$ : The pooled embedding from the cross-attention expert.
  - $\mathbf{q}'$ : The embedding of the current search query text, $\mathbf{e}_s^{\mathrm{text}}$ .
  - Other symbols are as defined previously. This expert's query is fixed by the current search query, providing a strong contextual bias.
- Combining Experts: Each type of expert (parameterized, self-attention, cross-attention) has $N_E$ members (individual experts). A gating network computes scores for all $3N_E$ experts based on the input embeddings. The top $K$ experts with the highest gating scores are activated. The final text embedding is a weighted sum of the pooled embeddings from these $K$ activated experts, where the weights are their gating scores (normalized via softmax). $ { { \bf { e } } ^ { \mathrm { { t e x t } } } } = \sum _ { j = 1 } ^ { K } { g a t { { e } _ { j } } } \ { { \bf { e } } _ { j } ^ { \mathrm { { p ool } } } } $ Where:
  - $\mathbf{e}^{\mathrm{text}}$ : The final combined text embedding for the input sequence.
  - $gate_j$ : The normalized gating score for the $j$ -th activated expert.
  - $\mathbf{e}_j^{\mathrm{pool}}$ : The pooled embedding from the $j$ -th activated expert.
  - If a user or item has multiple textual features (e.g., item title, description, reviews), their embeddings are concatenated: $\mathbf { e } ^ { \mathrm { t e x t } } = { \mathrm { c o n c a t } } ( \mathbf { e } _ { f _ { 1 } } ^ { \mathrm { t e x t } } , \ldots ; \mathbf { e } _ { f _ { m } } ^ { \mathrm { t e x t } } )$ , where $m$ is the number of textual features.

Categorical ID Representation

For discrete categorical features (e.g., brand ID, category ID), embeddings are obtained by a simple lookup operation. Each unique ID is mapped to a dense vector. $ { \bf e } _ { g _ { i d } } ^ { \mathrm { I D } } = \mathrm { \ l o o k u p } _ { g } ^ { \mathrm { I D } } ( i d ) $ Where:

$\mathbf{e}_{g_{id}}^{\mathrm{ID}}$ : The ID embedding for a specific id within category $g$ .
$\mathrm{lookup}_g^{\mathrm{ID}}$ : The embedding lookup table for category $g$ .
id: The specific categorical identifier. These individual categorical ID embeddings are then concatenated to form an overall ID embedding for a user or item: $\mathbf { e } ^ { \mathrm { I D } } = \mathrm { c o n c a t } ( \mathbf { e } _ { g _ { 1 } } ^ { \mathrm { I D } } , \ldots ; \mathbf { e } _ { g _ { n } } ^ { \mathrm { I D } } )$ , where $n$ is the number of categorical features.

Overall Representations

Finally, the ID embeddings ( $\mathbf{e}^{\mathrm{ID}}$ ) and text embeddings ( $\mathbf{e}^{\mathrm{text}}$ ) are combined to form a comprehensive representation for users, items, queries, and consultations. This is achieved by concatenating them, passing through an FFN (to map to a unified dimension $d_{\mathrm{uni}}$ ), and applying an activation function. $ \begin{array} { r l } & { { \bf e } _ { u } = \mathrm { act } ( \mathrm { FFN } _ { \mathrm { u } } ( \mathrm { concat } ( { \bf e } _ { u } ^ { \mathrm { I D } } , { \bf e } _ { u } ^ { \mathrm { t e x t } } ) ) ) , } \ & { { \bf e } _ { v } = \mathrm { act } ( \mathrm { FFN } _ { \mathrm { v } } ( \mathrm { concat } ( { \bf e } _ { v } ^ { \mathrm { I D } } , { \bf e } _ { v } ^ { \mathrm { t e x t } } ) ) ) , } \ & { { \bf e } _ { s } = \mathrm { act } ( \mathrm { FFN } _ { \mathrm { s } } ( { \bf e } _ { s } ^ { \mathrm { t e x t } } ) ) , } \ & { { \bf e } _ { c } = \mathrm { act } ( \mathrm { FFN } _ { \mathrm { c } } ( { \bf e } _ { c } ^ { \mathrm { t e x t } } ) ) . } \end{array} $ Where:

$\mathbf{e}_u, \mathbf{e}_v, \mathbf{e}_s, \mathbf{e}_c$ : The overall embeddings for user $u$ , item $v$ , query $s$ , and consultation $c$ , respectively.
$\mathrm{concat}(\cdot)$ : Concatenation operation.
$\mathrm{FFN}_k(\cdot)$ : Feed-Forward Network specific to the entity type (u, v, s, c). These FFNs project the concatenated features into a unified embedding dimension $d_{\mathrm{uni}}$ .
$\mathrm{act}(\cdot)$ : An activation function (e.g., tanh, ReLU, GELU) applied after the FFN.

4.2.2. Mapping-Based General Alignment

This module aims to align tokens (words/phrases from text) with items in a unified semantic space, addressing the category-text gap. This "general alignment" allows the model to understand which features, IDs, and items correspond to various natural language texts.

Full-Text Collection for Items: For each item $v$ , all relevant textual data (e.g., related queries, consultations, item titles, descriptive texts, advertisement texts) across multiple scenarios are gathered to construct a comprehensive full-text collection $\mathcal{A}_v$ . This collection essentially aggregates all linguistic contexts in which an item appears.
Keyword Filtering: To reduce noise and focus on critical terms, the full-text collection is refined. A threshold $t$ $t$ is applied to filter out noise texts (words $w$ $w$ ) that appear infrequently in search-related scenarios. Only words with a frequency greater than $t$ $t$ are retained. $ \mathcal { A } _ { v } ^ { S } = \mathrm { f l t e r } ^ { S } ( \mathcal { A } _ { v } ) = { w \in \mathcal { A } _ { v } | \mathrm { f r e q } ^ { S } ( w ) > t } $ Where:
- $\mathcal{A}_v^S$ : The filtered keyword collection for item $v$ .
- $\mathrm{filter}^S(\cdot)$ : The filtering function.
- $\mathcal{A}_v$ : The original full-text collection for item $v$ .
- $w$ : A word or token.
- $\mathrm{freq}^S(w)$ : The frequency of word $w$ in search-related scenarios.
- $t$ : A predefined threshold for frequency. This curated collection $\mathcal{A}_v^S$ creates a mapping $M$ where each token $t \in \mathcal{A}_v^S$ is associated with item $v$ due to their shared presence or thematic relevance.
Bidirectional Contrastive Loss ( $\mathcal{L}_{\mathrm{GA}}$ ): To learn effective alignments between tokens and items, a bidirectional contrastive loss is employed. This loss function pushes embeddings of related token-item pairs (t,v) closer while separating them from negative samples (unrelated tokens or items). $ \begin{array} { r } { \mathcal { L } _ { \mathrm { G A } } = - \lambda _ { 1 } \displaystyle \sum _ { ( t , v ) } \log \frac { \exp ( \mathrm { s i m } ( \mathbf { e } _ { t } , \mathbf { e } _ { v } ) / \tau _ { 1 } ) } { \sum _ { t ^ { - } \in T _ { \mathrm { n e g } } } \exp ( \mathrm { s i m } ( \mathbf { e } _ { t } - \mathbf { ,e } _ { v } ) / \tau _ { 1 } ) } \ \mathrm { w n } } \ { - \lambda _ { 2 } \displaystyle \sum _ { ( t , v ) } \log \frac { \exp ( \mathrm { s i m } ( \mathbf { e } _ { t } , \mathbf { e } _ { v } ) / \tau _ { 2 } ) } { \sum _ { v ^ { - } \in I _ { \mathrm { n e g } } } \exp ( \mathrm { s i m } ( \mathbf { e } _ { t } , \mathbf { e } _ { v ^ { - } } ) / \tau _ { 2 } ) } } \end{array} $ Where:
- $\mathcal{L}_{\mathrm{GA}}$ : The general alignment loss.
- $\lambda_1, \lambda_2$ : Learnable weights for the two terms, balancing the importance of token-to-item and item-to-token alignment.
- $\sum_{(t,v)}$ : Summation over all positive token-item pairs.
- $\mathrm{sim}(\cdot, \cdot)$ : The dot product similarity function between two embeddings.
- $\mathbf{e}_t, \mathbf{e}_v$ : The embeddings of token $t$ and item $v$ , respectively.
- $\tau_1, \tau_2$ : Temperature parameters that control the sharpness of the softmax distribution. Higher values make the distribution smoother, while lower values make it peak more sharply around the most similar example.
- $T_{\mathrm{neg}}$ : A set of randomly sampled negative tokens (unrelated to $v$ ). The term $\sum_{t^- \in T_{\mathrm{neg}}}$ ensures that $\mathbf{e}_t$ is more similar to $\mathbf{e}_v$ than to any negative token.
- $I_{\mathrm{neg}}$ : A set of randomly sampled negative items (unrelated to $t$ ). The term $\sum_{v^- \in I_{\mathrm{neg}}}$ ensures that $\mathbf{e}_v$ is more similar to $\mathbf{e}_t$ than to any negative item. This bidirectional contrastive loss ensures that embeddings for positive token-item pairs are close, and embeddings for negative pairs are far apart, establishing a robust general alignment.

4.2.3. Sequence-Based Personalized Alignment

This module focuses on extracting search motivations from a user's consultation and search histories and aligning them with the current query to enhance personalized search.

Motivation-Aware Query Embedding

The core idea is to enrich the current query embedding $\mathbf{e}_{s_{N+1}}$ by incorporating historical context.

Consultation History Integration: The current query embedding $\mathbf{e}_{s_{N+1}}$ acts as an anchor. It is concatenated with the user's consultation history (a sequence of consultation embeddings $[\mathbf{e}_{c_1}, \dots, \mathbf{e}_{c_M}]$ ). This combined sequence is then fed into a Transformer encoder. The Transformer encoder, with its multi-head bidirectional attention and FFN layers, processes this sequence to extract a motivation embedding that is relevant to the current query from the consultation history. The first vector of the encoder's output, corresponding to the anchored current query, is chosen as the motivation-aware embedding from consultations. $ \mathbf { e } _ { s _ { N + 1 } } ^ { \mathcal { C } } = \operatorname { E n c o d e r } _ { \mathrm { c } } ( \mathbf { e } _ { s _ { N + 1 } } , \mathbf { e } _ { c _ { 1 } } , \dots , \mathbf { e } _ { c _ { M } } ) [ 0 , : ] $ Where:
- $\mathbf{e}_{s_{N+1}}^{\mathcal{C}}$ : The search motivation embedding derived from the consultation history for the current query $s_{N+1}$ .
- $\mathrm{Encoder}_{\mathrm{c}}(\cdot)$ : A Transformer encoder specifically designed for processing consultation sequences.
- $\mathbf{e}_{s_{N+1}}$ : The embedding of the current query.
- $\mathbf{e}_{c_1}, \dots, \mathbf{e}_{c_M}$ : The embeddings of the $M$ consultation sessions in the user's history.
- [0,:]: An operation to select the first vector from the Transformer encoder's output sequence. This vector typically summarizes the context, anchored by the initial input (here, the current query).
Query History Integration: Similarly, the user's historical query sequence $[\mathbf{e}_{s_1}, \dots, \mathbf{e}_{s_M}]$ is also processed with the current query embedding through another Transformer encoder. This captures search motivation derived from past search behaviors. $ { \mathbf e } _ { s _ { N + 1 } } ^ { S } = \mathrm { E n c o d e r } _ { \mathrm { s } } ( { \mathbf e } _ { s _ { N + 1 } } , { \mathbf e } _ { s _ { 1 } } , \ldots , { \mathbf e } _ { s _ { M } } ) [ 0 , : ] $ Where:
- $\mathbf{e}_{s_{N+1}}^{S}$ : The search motivation embedding derived from the query history for the current query $s_{N+1}$ .
- $\mathrm{Encoder}_{\mathrm{s}}(\cdot)$ : A Transformer encoder for processing search query sequences.
- $\mathbf{e}_{s_1}, \ldots , \mathbf{e}_{s_M}$ : The embeddings of the $M$ historical search queries.
Combined Motivation-Aware Query Embedding: The three relevant embeddings—the consultation-derived motivation ( $\mathbf{e}_{s_{N+1}}^{\mathcal{C}}$ ), the query-history-derived motivation ( $\mathbf{e}_{s_{N+1}}^{S}$ ), and the current query embedding ( $\mathbf{e}_{s_{N+1}}$ )—are combined using learnable weights. $ \mathbf { e } _ { s _ { N + 1 } } ^ { \prime } = \alpha _ { 1 } \mathbf { e } _ { s _ { N + 1 } } ^ { \mathcal { C } } + \alpha _ { 2 } \mathbf { e } _ { s _ { N + 1 } } ^ { S } + \alpha _ { 3 } \mathbf { e } _ { s _ { N + 1 } } $ Where:
- $\mathbf{e}_{s_{N+1}}'$ : The combined motivation-aware query embedding.
- $\alpha_1, \alpha_2, \alpha_3$ : Learnable scalar weights that determine the contribution of each component to the final query embedding. These weights are learned during training.

Personalized Search with Item History

This final step refines the motivation-aware query embedding and integrates it with the user's past interacted items to form the ultimate personalized query representation for ranking.

Final Query Embedding: The combined motivation-aware query embedding $\mathbf{e}_{s_{N+1}}'$ and the embeddings of the user's past interacted items $\mathbf{E}_{\mathrm{items}}$ are fed into a final Transformer encoder. This captures complex interactions between the current motivated query and the user's overall item interaction history. The first output vector of this encoder is then added in-place with the user's embedding $\mathbf{e}_u$ to obtain the final personalized query embedding. $ \mathbf { e } _ { s _ { N + 1 } } ^ { \prime \prime } = \mathrm { E n c o d e r } _ { \mathrm { f i n a l } } ( \mathbf { e } _ { s _ { N + 1 } } ^ { \prime } , \mathbf { E } _ { \mathrm { i t e m s } } ) [ 0 , : ] \oplus \mathbf { e } _ { u } $ Where:
- $\mathbf{e}_{s_{N+1}}''$ : The final, fully personalized and motivation-aware query embedding.
- $\mathrm{Encoder}_{\mathrm{final}}(\cdot)$ : The Transformer encoder that processes the motivation-aware query embedding and historical item embeddings.
- $\mathbf{E}_{\mathrm{items}}$ : A matrix or sequence of embeddings representing the items with which the user has previously interacted.
- $\oplus$ : Denotes in-place addition (element-wise sum) of the user embedding $\mathbf{e}_u$ to the first output vector of the Transformer encoder.
Ranking Probability: For inference, candidate items are ranked based on their similarity to this final personalized query embedding. The similarity is calculated using the dot product function. $ p ( v | s _ { N + 1 } , H , u ) = \mathrm { s i m } ( \mathbf { e } _ { s _ { N + 1 } } ^ { \prime \prime } , \mathbf { e } _ { v } ) $ Where:
- $p(v | s_{N+1}, H, u)$ : The probability score that item $v$ is relevant given the current query $s_{N+1}$ , user history $H$ , and user profile $u$ .
- $\mathrm{sim}(\cdot, \cdot)$ : The dot product similarity function.
- $\mathbf{e}_v$ : The embedding of candidate item $v$ .
Personalized Alignment Loss ( $\mathcal{L}_{\mathrm{PA}}$ ): The model is optimized to increase the relevance scores of ground-truth (correctly interacted) items. This is formulated using a softmax cross-entropy loss over candidate items, where the goal is to maximize the similarity between the personalized query embedding and the actual interacted item. $ \mathcal { L } _ { \mathrm { P A } } = \sum _ { \left( u , v , s _ { N + 1 } \right) } \log \frac { \exp ( \mathrm { s i m } ( \mathbf { e } _ { s _ { N + 1 } } ^ { \prime \prime } , \mathbf { e } _ { v } ) ) } { \sum _ { v ^ { \prime } \in V _ { N + 1 } } \exp ( \mathrm { s i m } ( \mathbf { e } _ { s _ { N + 1 } } ^ { \prime \prime } , \mathbf { e } _ { v ^ { \prime } } ) ) } $ Where:
- $\mathcal{L}_{\mathrm{PA}}$ : The personalized alignment loss.
- $\sum_{(u,v,s_{N+1})}$ : Summation over training instances, each consisting of a user $u$ , a ground-truth interacted item $v$ , and a current query $s_{N+1}$ .
- $V_{N+1}$ : The set of candidate items for ranking, which includes the ground-truth item and negative samples (unrelated items).
- The numerator represents the similarity score for the correct item $v$ , and the denominator sums similarities over all candidate items $v'$ , effectively normalizing the score into a probability. Negative sampling is typically employed to make the denominator computation efficient.
Overall Loss ( $\mathcal{L}_{\mathrm{overall}}$ ): The total loss function combines the personalized alignment loss and the general alignment loss, along with L2 regularization to prevent overfitting. $ \mathcal { L } _ { \mathrm { o v e r a l l } } = \mathcal { L } _ { \mathrm { P A } } + \lambda _ { 3 } \mathcal { L } _ { \mathrm { G A } } + \lambda _ { 4 } | | \Theta | | _ { 2 } $ Where:
- $\mathcal{L}_{\mathrm{overall}}$ : The total objective function to be minimized.
- $\lambda_3$ : A hyper-parameter balancing the contribution of the general alignment loss.
- $\lambda_4$ : A hyper-parameter controlling the strength of L2 regularization.
- $||\Theta||_2$ : The L2 norm of all learnable parameters $\Theta$ in the MAPS model, which penalizes large parameter values and helps prevent overfitting.

5. Experimental Setup

5.1. Datasets

To evaluate the effectiveness of MAPS, experiments are conducted on two datasets: a commercial dataset and a synthetic Amazon dataset.

Commercial Dataset:
- Source: A real user interaction dataset from an internet e-commerce shopping platform that provides AI consulting services.
- Characteristics: Contains user interactions over 31 days.
- Preprocessing: Users and items with fewer than 5 interactions are filtered out. To prevent sequence data leakage (where future information inadvertently influences past predictions), the first 29 days are used for training, and the remaining two days are split for validation and testing, respectively. This split ensures that the model is evaluated on unseen, future data.
- Domain: E-commerce platform with AI consulting services.
Amazon Dataset:
- Source: Based on the widely used Amazon Reviews dataset (Ni et al., 2019), specifically the version processed by PersonalWAB (Cai et al., 2024).
- Characteristics: Includes user profiles and various types of user interaction data, such as searches and reviews.
- Consultation Simulation: To mimic the real-world e-commerce environment with AI consultation services (which the original Amazon dataset lacks), GPT-40 (a specific version of GPT-4) was utilized to generate synthetic user consultation texts. These texts are generated based on existing user profiles and interaction behaviors, aiming to simulate realistic consultation scenarios.
- Preprocessing: Processing and splitting are consistent with Shi et al. (2024).
- Domain: E-commerce (Amazon products and reviews).
  
  The following are the statistics from Table 1 of the original paper:
  
  Dataset #Users #Items #Inters #Sparsity
  
  Commercial 2096 2691 24662, (18774) 99.56%, (99.66%)
  
  Amazon 967 35772 7263, (40567) 99.98%, (99.88%)

Dataset	#Users	#Items	#Inters	#Sparsity
Commercial	2096	2691	24662, (18774)	99.56%, (99.66%)
Amazon	967	35772	7263, (40567)	99.98%, (99.88%)

Where:

#Users: Number of unique users in the dataset.
#Items: Number of unique items (products) in the dataset.
#Inters: Total number of interactions. The numbers outside parentheses indicate search interactions, while the numbers in parentheses indicate consultation interactions.
#Sparsity: A measure of how few interactions there are compared to the total possible interactions (#Users * #Items). Higher sparsity indicates fewer observed interactions relative to the potential, making it harder to learn user preferences. The numbers outside parentheses indicate search interaction sparsity, while those in parentheses indicate consultation interaction sparsity.

The datasets were chosen because they provide real token text and multiple types of user interaction data, which are essential for validating MAPS's ability to model search motivations and leverage LLM knowledge. The commercial dataset offers a direct real-world application, while the Amazon dataset provides a widely used benchmark with simulated consultations to test the core ideas.

The following figure (Figure 5 from the original paper) shows examples of consultations on the Amazon dataset.

Figure 5: Examples of consultations on the Amazon dataset. 该图像是论文中图6，展示了在商业平台上的咨询示例。左侧为用户关于电子产品配件的具体问题与系统回答，右侧为隐私相关项目需求的问答展示，体现了系统对用户动机的理解和针对性推荐能力。

This figure illustrates examples of consultations from the Amazon dataset. The left panel shows user inquiries about electronic accessories ("Need an adapter that can convert USB-C to HDMI..." and "Looking for durable, waterproof headphones for running..."). The right panel shows a discussion about privacy-related projects ("How can I ensure data privacy in my AI project..."). For each user question, a system response (presumably generated or simulated) is provided, demonstrating how consultations can reveal detailed user needs and motivations that would be hard to capture in a short search query.

5.2. Evaluation Metrics

The experiments employ standard metrics for both ranking and retrieval tasks.

Ranking Metrics

For evaluating ranking performance, two common metrics are used: Hit Ratio (HR@k) and Normalized Discounted Cumulative Gain (NDCG@k). These metrics are calculated by pairing the ground-truth item with 99 randomly sampled negative items as candidates.

Hit Ratio (HR@k):
- Conceptual Definition: Hit Ratio measures the proportion of queries for which the ground-truth (correct) item appears within the top $k$ ranked results. It's a simple, intuitive metric that indicates how often the desired item is found among the top recommendations.
- Mathematical Formula: $ \mathrm{HR@k} = \frac{\text{Number of users for whom the ground-truth item is in top } k}{\text{Total number of users}} $ This can also be seen as: $ \mathrm{HR@k} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{ground-truth item for } u \text{ is in top } k \text{ list}) $
- Symbol Explanation:
  - $|U|$ : The total number of users (or queries) being evaluated.
  - $\mathbb{I}(\cdot)$ : An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
  - $k$ : The cut-off rank (e.g., 5, 10, 20, 50), meaning we check if the item is within the top $k$ positions.
Normalized Discounted Cumulative Gain (NDCG@k):
- Conceptual Definition: NDCG is a metric that evaluates the quality of a ranked list by considering both the relevance of items and their position in the list. It assigns higher scores to relevant items that appear earlier in the ranking. The "discounted" part means that relevant items further down the list contribute less to the overall score. "Normalized" means the score is compared against an ideal ranking (where all relevant items are at the top) to get a value between 0 and 1.
- Mathematical Formula: First, Cumulative Gain (CG@k): $ \mathrm{CG@k} = \sum_{i=1}^{k} \mathrm{rel}i $ Then, Discounted Cumulative Gain (DCG@k): $ \mathrm{DCG@k} = \sum{i=1}^{k} \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Finally, Normalized Discounted Cumulative Gain (NDCG@k): $ \mathrm{NDCG@k} = \frac{\mathrm{DCG@k}}{\mathrm{IDCG@k}} $ Where IDCG@k is the DCG of the ideal ranking: $ \mathrm{IDCG@k} = \sum{i=1}^{k} \frac{2^{\mathrm{rel}_{i, \text{ideal}}} - 1}{\log_2(i+1)} $
- Symbol Explanation:
  - $k$ : The cut-off rank.
  - $\mathrm{rel}_i$ : The relevance score of the item at position $i$ in the ranked list. In a binary relevance setting (item is either relevant or not), $\mathrm{rel}_i$ is typically 1 if the item is the ground-truth, and 0 otherwise.
  - $\log_2(i+1)$ : The logarithmic discount factor.
  - $\mathrm{DCG@k}$ : The Discounted Cumulative Gain at rank $k$ .
  - $\mathrm{IDCG@k}$ : The Ideal Discounted Cumulative Gain at rank $k$ , obtained by sorting all relevant items by their relevance.
  - $\mathrm{rel}_{i, \text{ideal}}$ : The relevance score of the item at position $i$ in the ideal (perfectly sorted) ranked list.
  - The final NDCG@k for a system is typically the average NDCG@k across all queries.

Retrieval Metric

For evaluating retrieval performance, Mean Reciprocal Rank (MRR@k) is used. For retrieval tasks, all items are considered as candidates.

Mean Reciprocal Rank (MRR@k):
- Conceptual Definition: MRR is a statistic for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. It measures the average of the reciprocal ranks of the first relevant item in a set of queries. If the first relevant item is found at rank $r$ , its reciprocal rank is $1/r$ . If no relevant item is found within the top $k$ results, the reciprocal rank is 0.
- Mathematical Formula: $ \mathrm{MRR@k} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{1}{\mathrm{rank}_q} \quad \text{if } \mathrm{rank}_q \le k \text{, else 0} $
- Symbol Explanation:
  - $|Q|$ : The total number of queries.
  - $\mathrm{rank}_q$ : The rank position of the first relevant item for query $q$ .
  - $k$ : The cut-off rank, beyond which relevant items are not considered.
  - $\frac{1}{\mathrm{rank}_q}$ : The reciprocal rank for query $q$ .

5.3. Baselines

The MAPS method is compared against a comprehensive set of baselines across different categories:

Personalized Search Baselines: These models primarily focus on integrating user information and search history for personalized ranking.
- AEM (Ai et al., 2019a): Attention-Embedding Model, combines user's past interacted items with the current query using attention.
- QEM (Ai et al., 2019a): Query-Embedding Model, focuses on matching scores between items and queries only.
- HEM (Ai et al., 2017): Hierarchical Embedding Model, a personalized model based on latent vectors.
- ZAM (Ai et al., 2019a): Zero Attention Model, an extension of AEM with a zero vector for item lists.
- TEM (Bi et al., 2020): Transformer-based Embedding Model, improves AEM by using a Transformer encoder.
- CoPPS (Dai et al., 2023): Contrastive learning for user sequence representation in Personalized Product Search, leverages contrastive learning.
Multi-Scenario Baselines (Integrating Search & Recommendation): These models aim to combine insights from both search and recommendation interactions.
- SESRec (Si et al., 2023): Uses contrastive learning to learn disentangled search representations for recommendation.
- UnifiedSSR (Xie et al., 2023): Unified Sequential Search and Recommendation, jointly learns user behavior history across both scenarios.
- UniSAR (Shi et al., 2024): Unified Search and Recommendation, models fine-grained behavior transitions using Transformers and cross-attention.
Retrieval Baselines: These models are specifically designed for efficient item retrieval.
- BM25 (Robertson et al., 2009): A traditional term-frequency based statistical retrieval algorithm.
- BGE-M3 (Chen et al., 2024): A modern dense retrieval algorithm using embeddings for multi-lingual, multi-functionality, multi-granularity text.
- CHIQ (Mo et al., 2024): Contextual History Enhancement for Improving Query Rewriting in Conversational Search, a conversational retrieval method incorporating world knowledge from LLMs.
  
  These baselines are representative as they cover a wide spectrum of approaches, from traditional statistical methods to modern deep learning models, including those focused on personalization, multi-scenario learning, and conversational context, allowing for a comprehensive evaluation of MAPS's innovations.

5.4. Implementation details

Embedding Dimensions: Latent embedding dimension $d$ is set to 64. The unified dimension $d_{\mathrm{t}}$ for LLM token embeddings (after FFN mapping) is set to 32.
User History Length: The maximum length of the user history sequence (for both consultations and queries) is set to 30.
User Filtering: Users with fewer than 5 interactions are filtered out, a common practice to ensure sufficient data for learning user preferences.
Activation Function: tanh is used as the default activation function in the FFN layers (act in Eq. 1).
Transformer Layers: The number of layers in the Transformer encoder modules is initially set to 1.
Batch Size: The batch size for training is 72. For $\mathcal{L}_{\mathrm{GA}}$ , the in-batch negative sampling strategy is adopted, and the batch size is searched among {128, 256, 512, 1024}.
Negative Samples: For $\mathcal{L}_{\mathrm{PA}}$ , 10 negative samples are used for each positive sample.
Hyperparameter Tuning:
- Weights for general alignment loss $(\lambda_1, \lambda_2)$ are tuned in $\{ (0.0, 1.0), (0.25, 0.75), (0.5, 0.5), (0.75, 0.25), (1.0, 0.0) \}$ .
- Weight for overall loss $\lambda_3$ (balancing $\mathcal{L}_{\mathrm{PA}}$ and $\mathcal{L}_{\mathrm{GA}}$ ) is tuned in $\{ 0.0, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5 \}$ .
- Temperature parameters $\tau_1, \tau_2$ (for contrastive learning in $\mathcal{L}_{\mathrm{GA}}$ ) are tuned in the interval [0.0, 1.0] with a step size of 0.1.
Training: All models are trained for 100 epochs. Early stopping is employed to prevent overfitting, meaning training stops if performance on the validation set does not improve for a certain number of epochs.
Optimizer: Adam (Kingma and Ba, 2014) is used for optimization.
Learning Rate: The learning rate is adjusted among $\{ 1\text{e-}3, 5\text{e-}4, 1\text{e-}4, 5\text{e-}5, 1\text{e-}5 \}$ .
Hardware: All experiments were conducted on an A800 GPU (80GB).

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the superior performance of MAPS across both ranking and retrieval tasks on both datasets.

Ranking Performance

The following are the results from the Table labeled "Table:Search ranking performance compared with personalized search baselines" of the original paper:

Model	HR				NDCG
Model	@5	@10	@20	@50	@5	@10	@20	@50
Commercial
AEM	0.3886	0.5376	0.6733	0.8249	0.2656	0.3135	0.3478	0.3781
QEM	0.3996	0.5473	0.6733	0.8439	0.2671	0.3144	0.3463	0.3805
HEM	0.3484	0.4907	0.6366	0.8037	0.2360	0.2817	0.3185	0.3519
ZAM	0.3674	0.5248	0.6808	0.8205	0.2490	0.2994	0.3389	0.3669
TEM	0.4041	0.5685	0.7078	0.8528	0.2871	0.3402	0.3756	0.4049
CoPPS	0.4050	0.5637	0.7171	0.8660	0.2831	0.3445	0.3805	0.4103
MAPS	0.5281†	0.7071†	0.8330†	0.9308†	0.3780†	0.4359†	0.4680†	0.4877†
Amazon
AEM	0.3180	0.4550	0.5372	0.7239	0.1860	0.2132	0.2475	0.2768
QEM	0.2831	0.3888	0.5285	0.7663	0.1914	0.1805	0.2277	0.2913
HEM	0.2735	0.4198	0.5400	0.7446	0.1983	0.2172	0.2598	0.2961
ZAM	0.3103	0.4488	0.5429	0.7301	0.1833	0.2114	0.2494	0.2787
TEM	0.4026	0.4814	0.7197	0.7301	0.2968	0.3124	0.3415	0.3535
CoPPS	0.3870	0.4854	0.7286	0.8004	0.2788	0.3298	0.3439	0.3699
MAPS	0.5832†	0.7735†	0.8987†	0.9741†	0.4059†	0.4676†	0.4995†	0.5147†

Comparison with Personalized Search Baselines: MAPS consistently and significantly outperforms all other personalized product search methods across all HR@k and NDCG@k metrics on both the Commercial and Amazon datasets.
- On the Commercial dataset, MAPS achieves an HR@10 of 0.7071 and NDCG@10 of 0.4359, representing an approximate 20% improvement over the best baselines (e.g., CoPPS with HR@10 of 0.5637, TEM with NDCG@10 of 0.3402).
- On the Amazon dataset, the performance gains are even more substantial, with MAPS reaching HR@10 of 0.7735 and NDCG@10 of 0.4676, marking an approximately 35% improvement over baselines (CoPPS with HR@10 of 0.4854, TEM with NDCG@10 of 0.3124).
- The '†' symbol indicates that these improvements are statistically significant ( $p < 0.05$ level with paired t-tests), confirming that the enhanced semantic understanding and motivation integration in MAPS provide a robust advantage.

Retrieval Performance

The following are the results from Table 3 of the original paper:

Method	MRR@10	MRR@20	MRR@50
BM25	0.2529	0.2577	0.2625
AEM	0.2445	0.2539	0.2588
QEM	0.2427	0.2516	0.2572
HEM	0.2176	0.2277	0.2331
ZAM	0.2304	0.2413	0.2459
TEM	0.2705	0.2803	0.2852
CoPPS	0.2642	0.2750	0.2799
BGE-M3	0.2976	0.3110	0.3168
CHIQ	0.3192	0.3392	0.3412
MAPS	0.3805	0.3889	0.3922

Comparison with Retrieval Baselines: MAPS also significantly outperforms traditional, dense, and conversational retrieval methods on the Commercial dataset.
- MAPS achieves an MRR@10 of 0.3805, which is substantially higher than the best baseline (CHIQ at 0.3192). This represents an improvement of over 15%.
- Even powerful dense retrieval models like BGE-M3 and conversational retrieval models like CHIQ are surpassed, indicating that MAPS's ability to incorporate consultation-derived motivation is highly effective for retrieval as well, not just ranking. This suggests that MAPS retrieves more relevant items at higher ranks.

Multi-Scenario Ranking Performance

The following are the results from Table 4 of the original paper:

Method	HR@10	HR@20	N@10	N@20
SESRec	0.5622	0.7191	0.3465	0.3797
UnifiedSSR	0.5706	0.7074	0.3590	0.3743
UniSAR	0.5838	0.7294	0.3577	0.3894
MAPS	0.7071	0.8330	0.4359	0.4680

Comparison with Multi-Scenario Baselines: MAPS maintains its lead when compared to models that integrate both search and recommendation interactions.
- On the Commercial dataset, MAPS's HR@10 (0.7071) and NDCG@10 (0.4359) are significantly higher than those of UniSAR (0.5838 and 0.3577, respectively), which is the strongest baseline in this category.
- This further validates that the specific approach of leveraging consultation motivations provides a distinct advantage, even over methods that attempt a broader understanding of user behavior across different interaction types.
  
  Overall Conclusion: The consistent and substantial performance gains across various metrics, tasks, and datasets confirm the effectiveness and superiority of MAPS. It demonstrates that explicitly modeling and integrating search motivation derived from user consultations via LLM-driven alignment is a highly impactful strategy for enhancing personalized search in e-commerce platforms.

6.2. Ablation Studies / Parameter Analysis

Ablation Study of MAPS Modules

To understand the contribution of each component within MAPS, an ablation study was performed on the Commercial dataset. "w/o" denotes removing a component.

The following are the results from Table 5 of the original paper:

Ablation	HR@10	HR@20	N@10	N@20
MAPS	0.7071	0.8330	0.4359	0.4680
w/o LLM	0.6527	0.7839	0.3968	0.4309
w/o MoAE	0.6781	0.7844	0.4096	0.4494
w/o general align	0.6198	0.7424	0.3669	0.4006
w/o filter () in Eq. 2	0.6201	0.7426	0.3597	0.3951
w/o personal align	0.6334	0.7518	0.3732	0.4105
wo $e_c$	0.6565	0.7730	0.3863	0.4246
wo $e_s$	0.6448	0.7615	0.3803	0.4170

Analysis:

Overall Impact: Removing any module from MAPS consistently leads to a significant drop in performance across all HR and NDCG metrics, highlighting that all proposed components are crucial for the model's effectiveness.
Impact of LLM and MoAE: Removing LLM (w/o LLM) causes a notable drop (e.g., HR@10 from 0.7071 to 0.6527), indicating the importance of LLM's world knowledge and semantic understanding. Similarly, removing MoAE (w/o MoAE) also degrades performance (e.g., HR@10 to 0.6781), confirming that adaptively prioritizing critical semantics through attention experts is beneficial.
Critical Role of General Alignment: The most pronounced performance drop occurs when the general alignment module is removed (w/o general align, HR@10 to 0.6198; NDCG@10 to 0.3669). This is attributed to the crucial role of general alignment in aligning text with IDs and integrating domain-specific knowledge with LLM's general world knowledge. Without it, the model struggles with the semantic mismatch between general LLM representations and precise domain contexts. For instance, "Cool" might be interpreted as an adjective by an LLM without domain alignment, rather than a product feature, causing significant semantic shift.
Effect of Filtering: Removing the keyword filtering in general alignment (w/o filter () in Eq. 2) also leads to a substantial decrease (HR@10 to 0.6201), indicating that filtering noise texts is vital for effective alignment.
Impact of Personalized Alignment: Removing the entire personalized alignment module (w/o personal align, which means setting $\mathbf{e}_{q_{N+1}}' = \mathbf{e}_{q_{N+1}}$ ) also causes a significant performance reduction (HR@10 to 0.6334; NDCG@10 to 0.3732). This confirms the importance of incorporating motivation from consultation and query histories.
Individual Contributions of Consultation and Query History: Separately ablating consultation history (woe_c

) or `query history` (`wo`e_s

) also shows performance degradation (e.g., HR@10 to 0.6565 and 0.6448 respectively), demonstrating that both sources contribute positively to motivation-aware query embedding.

ID-Text Representation Fusion Analysis

This study investigates how fusing ID and LLM embeddings with MoAE pooling enhances personalization.

The following are the results from Table 6 of the original paper:

Ablation	HR@10	HR@20	N@10	N@20
MAPS-Default	0.7071	0.8330	0.4359	0.4680
MAPS-ID	0.6870	0.7953	0.4226	0.4500
MAPS-LLM	0.6794	0.7896	0.4196	0.4427
MAPS-Mean	0.6950	0.8249	0.4337	0.4566

Analysis:

Fusion of ID and LLM is Best: MAPS-Default (which fuses both categorical ID and LLM text embeddings) outperforms models using only ID embeddings (MAPS-ID) or only LLM embeddings (MAPS-LLM). This confirms that a holistic representation combining both structural ID information and rich semantic text information is superior for representing users and items. ID embeddings capture explicit categorical properties, while LLM embeddings capture nuanced textual semantics; their fusion provides a more complete picture.
MoAE vs. Simple Mean Pooling: MAPS-Default (with MoAE) also outperforms MAPS-Mean (which uses simple mean pooling for text representation). This indicates that MoAE, by adaptively selecting and weighting different attention experts, is more effective at capturing critical semantic information and aligning it with the search task than a simple average of token embeddings. MoAE's dynamic nature allows it to better handle varying textual content and focus on the most relevant parts.

Scalability Study

The scalability study explores the impact of various configuration choices on MAPS performance on the Commercial dataset.

The following are the results from Table 7 of the original paper:

Aspect −	Config	N
Aspect −	Config	@5	@10	@20
SequenceLength	10	0.3674	0.4200	0.4481
	30	0.3780	0.4359	0.4680
	40	0.3739	0.4303	0.4627
LLMScale	Qwen2.5-0.5B	0.3394	0.3892	0.4237
	Qwen2.5-1.5B	0.3534	0.4026	0.4357
	Qwen2-7B	0.3593	0.4090	0.4412
	Qwen2.5-7B	0.3780	0.4359	0.4680
TransformerScale	1 Layer	0.3780	0.4359	0.4680
	2 Layer	0.3881	0.4470	0.4724
	4 Layer	0.3909	0.4561	0.4838

Analysis:

Sequence Length: The optimal sequence length for user history is 30. Using a shorter sequence (10) results in lower performance, likely due to insufficient historical context. However, increasing the sequence length beyond 30 (to 40) also leads to a slight performance drop. This suggests that excessively long sequences might introduce more noise or irrelevant interactions, diminishing the signal-to-noise ratio.
LLM Scale: The performance of MAPS generally improves with the scale and capability of the underlying LLM. Using larger and presumably more powerful LLMs (from Qwen2.5-0.5B to Qwen2.5-7B) consistently leads to better NDCG scores. This underscores the importance of the world knowledge and NLU capabilities provided by LLMs in understanding complex consultation and query texts.
Transformer Layers: Increasing the number of Transformer layers generally improves the ranking effect. Going from 1 layer to 2 layers, and then to 4 layers, shows a steady increase in NDCG scores. This implies that deeper Transformer architectures are more effective at capturing complex interactions and aligning LLM embeddings with the specific domain and personalized search task.

Configuration Analysis

Mapping Threshold $t$

The following figure (Figure 4 from the original paper) shows the ranking performance on Amazon with different threshold $t$ in Eq. 2.

$Figure 4: Ranking performance on Amazon with different threshold $t$ in Eq. 2. The default one is 2.$

Analysis:

The figure shows the HR@k and NDCG@k performance on the Amazon dataset for different values of the mapping threshold $t$ (from Eq. 2), which is used to filter keywords in the general alignment module. The default threshold is 2.
The results indicate that MAPS achieves optimal performance when $t = 3$ .
Too Low Threshold ( $t < 3$ ): If $t$ is too small (e.g., $t=0$ or $t=1$ ), it means fewer restrictions on keywords. This introduces noise from texts in other scenarios that are not truly relevant to the item, leading to degraded performance.
Too High Threshold ( $t > 3$ ): If $t$ is excessively high, it imposes stringent conditions, filtering out too much data. This can limit the amount of useful semantic information available for general alignment, thereby constraining the model's ability to learn robust token-item mappings and consequently reducing performance.
This analysis highlights the necessity of an appropriately tuned threshold $t$ to balance between capturing sufficient relevant data and filtering out irrelevant noise.

Activation Function

The following are the results from Table 8 of the original paper:

Activation	HR@10	HR@20	N@10	N@20
tanh	0.7585	0.8787	0.4676	0.4995
SiLU	0.7823	0.8953	0.4697	0.5010
PReLU	0.7813	0.9067	0.4763	0.5097
GELU	0.7978	0.9036	0.4734	0.5015
ReLU	0.4390	0.6740	0.2165	0.2768

Analysis:

The default activation function used in MAPS (in Eq. 1 for overall representations) is tanh.
The results show that ReLU performs significantly worse than all other activation functions (e.g., HR@10 of 0.4390 vs. tanh's 0.7585). This poor performance is attributed to the "dying ReLU" problem (Lu et al., 2019), where neurons can become inactive during training if their input consistently falls into the negative range, leading to zero gradients and preventing further weight updates.
Other activation functions like SiLU, PReLU, and GELU actually outperform tanh. GELU achieves the highest performance (e.g., HR@10 of 0.7978, NDCG@10 of 0.4734), suggesting that it might be a more suitable choice for MAPS. This indicates that the choice of activation function can have a significant impact on model performance, and GELU (or PReLU) provides better non-linearity and gradient flow for this specific task than tanh.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Motivation-Aware Personalized Search (MAPS), a novel method that significantly enhances personalized product search by explicitly leveraging user consultation histories. MAPS moves beyond traditional approaches that assume queries fully capture user intent, recognizing that consultations reveal crucial underlying motivations. The model employs Large Language Models (LLMs) to embed natural language queries and consultations into a unified semantic space, providing deep Natural Language Understanding (NLU) and world knowledge. A Mixture of Attention Experts (MoAE) adaptively prioritizes critical semantic information within these texts. Furthermore, MAPS features a dual alignment strategy: (1) Mapping-based General Alignment uses contrastive learning to align tokens from consultations, reviews, and product features with item IDs, bridging the category-text gap and grounding LLM representations in the domain; (2) Sequence-based Personalized Alignment utilizes bidirectional attention within Transformer encoders to integrate motivation-aware embeddings derived from both consultation and search histories with individual user preferences. Extensive experiments on a real-world commercial dataset and a synthetic Amazon dataset demonstrate that MAPS consistently and significantly outperforms existing retrieval and ranking methods, providing a more accurate and context-aware solution.

7.2. Limitations & Future Work

The authors acknowledge several limitations of MAPS and suggest future research directions:

Computational Efficiency and Scalability: While MAPS improves semantic understanding, the paper notes that it may not fully address computational efficiency or scalability challenges, especially in real-time applications. The use of LLMs and complex attention mechanisms can be computationally intensive.
Dynamic User Behavior: The current framework primarily focuses on integrating past consultations and search histories but may not explicitly account for dynamic user behavior and evolving preferences over longer periods. User intents and motivations can shift, requiring models to adapt.
Domain-Specific Knowledge Integration: MAPS leverages LLM's world knowledge and general alignment for domain-specific grounding, but it does not incorporate explicit, pre-curated domain-specific knowledge (e.g., ontologies, expert rules) that could further enhance understanding in highly specialized contexts. This limits its generalizability across diverse industries without additional domain adaptation.

Future work could focus on:
Optimizing Real-time Adaptability: Developing strategies to make MAPS more efficient for real-time personalization, potentially through model distillation or more optimized inference techniques.
Addressing Scalability Issues: Investigating methods to handle larger volumes of consultation data and user histories more effectively, perhaps by exploring more efficient Transformer architectures or approximation techniques.
Integrating External Domain Knowledge: Incorporating explicit domain-specific knowledge bases or knowledge graphs to further refine semantic understanding and alignment in specialized domains, making the system more robust and versatile across industries.
Further Consultation Modeling: The authors also indicate an interest in exploring further nuances of consultation modeling within e-commerce platforms.

7.3. Personal Insights & Critique

MAPS introduces a compelling and intuitive idea: that users don't just search, they often consult first to clarify their needs. This pre-search consultation data is a goldmine of explicit motivation that has been largely overlooked. The paper's strength lies in its rigorous approach to capturing this motivation, utilizing state-of-the-art LLMs for NLU and designing a multi-layered alignment mechanism to integrate this rich information. The Mixture of Attention Experts (MoAE) is a clever way to dynamically adapt to varying textual contexts, and the dual general and personalized alignment modules effectively tackle the ID-text gap and historical noise. The significant performance gains strongly validate the hypothesis that search motivation is a critical enhancing factor.

One potential area for improvement or critique is the reliance on GPT-40 for generating consultation texts on the Amazon dataset. While necessary due to the lack of real consultation data, synthetic data, no matter how good, might not fully capture the nuanced linguistic patterns, ambiguities, and emotional content of genuine user consultations. This could lead to a slight overestimation of performance on real-world consultation data where users might express themselves differently. Further validation with more diverse real consultation datasets would strengthen the findings.

The computational cost of employing LLMs and complex Transformer architectures (especially with multiple encoders and MoAE) could be substantial, as acknowledged by the authors in limitations. For real-time search systems handling millions of queries per second, this could pose a practical challenge. Future work in model compression, distillation, or more efficient inference strategies would be crucial for industrial deployment.

The methods in MAPS could be highly transferable to other domains beyond e-commerce. Any system where users interact in natural language to express complex needs before formulating a concise query could benefit. Examples include:

Healthcare: Patients describing symptoms to a chatbot before a doctor's visit, leading to more personalized medical information retrieval.
Customer Support: Users explaining complex issues to a virtual assistant before searching for solutions in a knowledge base.
Legal Research: Lawyers outlining a case to an AI assistant before searching for relevant precedents.

The paper makes a strong case for shifting focus from purely observed behavior to understanding underlying motivations, demonstrating how LLMs can enable this deeper level of personalized understanding. It sets a new direction for personalized information retrieval that goes beyond superficial interaction patterns.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~37 min read · 50,718 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. ID-Text Representation Fusion with LLM

Text Representation

Categorical ID Representation

Overall Representations

4.2.2. Mapping-Based General Alignment

4.2.3. Sequence-Based Personalized Alignment

Motivation-Aware Query Embedding

Personalized Search with Item History

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

Ranking Metrics

Retrieval Metric

5.3. Baselines

5.4. Implementation details

6. Results & Analysis

6.1. Core Results Analysis

Ranking Performance

Retrieval Performance

Multi-Scenario Ranking Performance

6.2. Ablation Studies / Parameter Analysis

Ablation Study of MAPS Modules

ID-Text Representation Fusion Analysis

Scalability Study

Configuration Analysis

Mapping Threshold ttt

Activation Function

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

Mapping Threshold $t$