MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment
TL;DR Summary
MAPS leverages LLMs to unify query and consultation embeddings, using Mixture of Attention Experts and dual alignment techniques to enhance motivation-aware personalized search, outperforming existing methods in e-commerce retrieval and ranking tasks.
Abstract
Personalized product search aims to retrieve and rank items that match users' preferences and search intent. Despite their effectiveness, existing approaches typically assume that users' query fully captures their real motivation. However, our analysis of a real-world e-commerce platform reveals that users often engage in relevant consultations before searching, indicating they refine intents through consultations based on motivation and need. The implied motivation in consultations is a key enhancing factor for personalized search. This unexplored area comes with new challenges including aligning contextual motivations with concise queries, bridging the category-text gap, and filtering noise within sequence history. To address these, we propose a Motivation-Aware Personalized Search (MAPS) method. It embeds queries and consultations into a unified semantic space via LLMs, utilizes a Mixture of Attention Experts (MoAE) to prioritize critical semantics, and introduces dual alignment: (1) contrastive learning aligns consultations, reviews, and product features; (2) bidirectional attention integrates motivation-aware embeddings with user preferences. Extensive experiments on real and synthetic data show MAPS outperforms existing methods in both retrieval and ranking tasks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment
1.2. Authors
Weicong Qin, Yi Xu, Weijie Vu, Chenglei Shen, Ming He, Jianping Fan, Xiao Zhang, Jun Xu. Affiliations include Gaoling School of Artificial Intelligence, Renmin University of China, University of International Business and Economics, and AI Lab at Lenovo Research, Lenovo Group Limited, China.
1.3. Journal/Conference
This paper is currently a preprint, published on arXiv. The arXiv platform serves as a repository for preprints of scientific papers, typically prior to peer review or formal publication in a journal or conference. While it provides early access to research, its reputation is that of a preprint server rather than a peer-reviewed venue itself.
1.4. Publication Year
Published at (UTC): 2025-03-03T16:24:36.000Z. The publication year is 2025.
1.5. Abstract
Personalized product search aims to retrieve and rank items matching users' preferences and search intent. Existing methods often assume that user queries fully capture their true motivation. However, the authors' analysis of a real e-commerce platform reveals that users frequently engage in consultations before searching, indicating a refinement of their intentions based on underlying motivations and needs. These implied motivations in consultations are identified as a critical, yet unexplored, factor for enhancing personalized search. The paper addresses challenges associated with this, including aligning contextual motivations with concise queries, bridging the gap between product categories and natural language text, and filtering noise from historical sequences. To overcome these, the authors propose Motivation-Aware Personalized Search (MAPS). MAPS utilizes Large Language Models (LLMs) to embed queries and consultations into a unified semantic space. It employs a Mixture of Attention Experts (MoAE) to prioritize critical semantic information and introduces a dual alignment strategy: (1) contrastive learning to align consultations, reviews, and product features, and (2) bidirectional attention to integrate motivation-aware embeddings with user preferences. Extensive experiments on real and synthetic datasets demonstrate that MAPS significantly outperforms existing methods in both retrieval and ranking tasks.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2503.01711
- PDF Link: https://arxiv.org/pdf/2503.01711v4.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the inherent limitation of current personalized product search systems: they typically assume that a user's search query fully articulates their underlying needs and motivations. This assumption often falls short in real-world scenarios. In e-commerce, users frequently engage in preliminary consultations (e.g., with AI assistants or customer service) to clarify their needs or gather information before formulating a concise search query. These consultations contain rich, contextual motivation that drives the user's eventual search, but this motivation remains largely uncaptured and unutilized by existing search algorithms.
This problem is important because understanding a user's search motivation (their intrinsic goal or problem they want to solve) leads to more satisfactory and relevant search results than merely matching keywords. By addressing this gap, personalized search systems can move beyond superficial keyword matching to truly anticipate and fulfill user needs.
The paper's innovative idea is to explicitly model this "search motivation" embedded in consultation histories. This unexplored area presents specific challenges:
- Alignment with Queries:
Consultationsare often lengthy and complex natural language descriptions of needs, whilequeriesare concise keywords. Bridging this semantic gap is crucial. - Alignment with Product Features:
Productshave structured, categorical attributes, butmotivationsare in free-form text. Aligning these disparate data types is challenging. - Alignment with User History: Not all past
consultationsare relevant to the current search; filtering noise and identifying pertinent information within a sequence is necessary.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Explicit Modeling of Search Motivation: It is the first work to explicitly define and model "search motivation" by leveraging
consultationdata within personalized search systems on e-commerce platforms. This highlights the critical role of pre-search consultations in understanding user intent. - Novel
MAPSFramework: The paper proposesMotivation-Aware Personalized Search (MAPS), a comprehensive model framework designed to integrateLLMknowledge to bridge the gap between categoricalID(identifier) and natural languagetext embeddings.LLMsare used to embed queries and consultation texts into a unified semantic space.- A
Mixture of Attention Experts (MoAE)network is introduced to adaptively prioritize critical tokens and extract accurate semantic embeddings from varying text lengths and complexities. Dual Alignmentmechanisms are employed:Mapping-based General Alignment: Usescontrastive learningto alignconsultations,reviews, andproduct featuresby establishing keyword-item relationships.Sequence-based Personalized Alignment: Employsbidirectional attentionwithin a transformer encoder to integrate motivation-aware embeddings derived from userconsultationandsearch historieswith individual user preferences.
- Superior Performance: Extensive experiments conducted on both a real-world commercial dataset and a synthetic Amazon dataset demonstrate that
MAPSsignificantly outperforms existing traditional retrieval methods, personalized search methods, and conversational retrieval methods in bothretrievalandrankingtasks. This empirical evidence validates the effectiveness and superiority of the proposed approach.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Personalized Product Search: This refers to the task of retrieving and ranking products that are highly relevant not only to a user's explicit query but also to their implicit preferences, historical interactions, and context. It aims to tailor search results to individual users, improving satisfaction and conversion rates in e-commerce.
-
Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They are proficient in tasks like text summarization, translation, question answering, and generating
embeddings(numerical representations) of text that capture semantic meaning. In this paper,LLMsare used to obtain high-qualitytoken embeddingsfor natural language inputs, providing richworld knowledgeandnatural language understanding (NLU)capabilities. -
Embeddings: In machine learning, an
embeddingis a low-dimensional, continuous vector representation of discrete data (like words, items, or users) in a continuous vector space. The idea is that semantically similar items will haveembeddingsthat are close to each other in this space. For example,word embeddingsrepresent words as vectors, where words with similar meanings have similar vector representations.Item embeddingsrepresent products, anduser embeddingsrepresent users. -
Attention Mechanism: A core component in modern deep learning architectures, particularly
Transformers. Theattention mechanismallows a model to weigh the importance of different parts of the input data when processing a sequence. Instead of processing all parts equally, it focuses on the most relevant parts.- Scaled Dot-Product Attention: This is a common form of attention. Given a
Query (Q),Key (K), andValue (V)vectors, the output is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- : A matrix of
queryvectors. Thequeryis what the attention mechanism is looking for. - : A matrix of
keyvectors. Thekeysare what the attention mechanism compares thequeryagainst. - : A matrix of
valuevectors. Thevaluesare what the attention mechanism outputs, weighted by the attention scores. - : Dot product between
queryandkeyvectors, measuring their similarity. - : Scaling factor, where is the dimension of the
keyvectors. This prevents the dot products from becoming too large, which could push thesoftmaxfunction into regions with very small gradients. - : A function that converts a vector of numbers into a probability distribution, ensuring weights sum to 1.
- The output is a weighted sum of the
valuevectors, where the weights are determined by theattention scores(thesoftmaxoutput).
- : A matrix of
- Scaled Dot-Product Attention: This is a common form of attention. Given a
-
Contrastive Learning: A self-supervised learning paradigm where a model learns representations by comparing similar and dissimilar pairs of data points. The goal is to learn an embedding space where
positive pairs(e.g., different views of the same item, or an item and its associated text) are pulled closer together, whilenegative pairs(e.g., an item and an unrelated text) are pushed further apart. This helps the model distinguish between relevant and irrelevant information. -
Transformers: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). They primarily rely on
self-attention mechanismsto process sequential data, making them highly effective forNatural Language Processing (NLP)tasks. UnlikeRecurrent Neural Networks (RNNs),Transformerscan process all parts of a sequence in parallel, making them faster to train on large datasets and better at capturing long-range dependencies. A typicalTransformerconsists of anencoderstack and adecoderstack. Theencodermaps an input sequence to a sequence of continuous representations, and thedecodergenerates an output sequence based on theencoder's output. -
Dot Product Similarity: A common way to measure the similarity between two vectors. Given two vectors and , their
dot product similarityis defined as , where is the angle between them. If the vectors are normalized (unit vectors), the dot product directly gives the cosine of the angle, , which ranges from -1 (opposite) to 1 (identical). A higher dot product indicates higher similarity. -
Negative Sampling: A technique used in machine learning, particularly for training
word embeddings(likeWord2Vec) andrecommender systems. Instead of computing the loss over all possible negative examples (which can be millions for a large vocabulary or item set),negative samplingrandomly selects a small number ofnegative examples(items or words that are not relevant or co-occurring) to update the model's weights. This makes training much more efficient while still providing a good signal for learning.
3.2. Previous Works
The paper discusses several categories of related work:
-
Traditional Retrieval Algorithms:
BM25(Robertson et al., 2009): A statisticalbag-of-wordsretrieval function that ranks documents based on the presence of query terms, their frequency within documents (Term Frequency), and their rarity across the entire collection (Inverse Document Frequency), along with document length normalization. It primarily focuses on keyword matching and does not inherently support personalization or semantic understanding beyond term overlaps.
-
Dense Retrieval Algorithms:
BGE-M3(Chen et al., 2024): This represents a newer generation of retrieval methods that use deep learning to embed queries and documents into a shared vector space. Retrieval is performed by finding documents whose embeddings are close to the query embedding (e.g., usingcosine similarity).BGE-M3specifically focuses onmulti-lingual,multi-functionality, andmulti-granularity text embeddings, enhancing retrieval and ranking capabilities by capturing deeper semantic relationships thanBM25.
-
Conversational Retrieval Methods:
CHIQ(Mo et al., 2024): These methods aim to improve retrieval by incorporatingcontextual historyfrom conversational interactions. They recognize that queries in a conversation are often ambiguous or incomplete on their own and require understanding the preceding turns to formulate effective search requests.CHIQspecifically tries to enhance query rewriting in conversational search by leveraging context. However, these methods typically focus onquery rewritingfor better retrieval and do not explicitly model underlying usermotivationor deeppersonalizationbeyond the immediate conversation.
-
Personalized Search Methods (Focusing on User History & Intent):
QEM(Ai et al., 2019a):Query-Embedding Model. Primarily focuses on the direct similarity between the query and items, often using simple embedding matching. It considers basic query-item relevance.DREM(Ai et al., 2019b):Dynamic Relation Embedding Model. Extends basic query-item matching by incorporating dynamic relationships, often for explainable product search.HEM(Ai et al., 2017):Hierarchical Embedding Model. Incorporates user information and search history into a separate userembedding. It learns user preferences from past interactions and combines them with queryembeddingsto personalize search.AEM(Ai et al., 2019a):Attention-Embedding Model. An attention-based personalized model that combines the user's previously interacted items with the current query, allowing the model to focus on relevant past interactions.ZAM(Ai et al., 2019a):Zero Attention Model. An enhancement toAEMthat includes a "zero vector" to the item list, designed to handle cases where no relevant historical items are found.TEM(Bi et al., 2020):Transformer-based Embedding Model. ImprovesAEMby replacing its attention layer with aTransformer encoder, enabling more sophisticated modeling of user interaction sequences.CoPPS(Dai et al., 2023):Contrastive learning for user sequence representation in Personalized Product Search. Leveragescontrastive learningto learn robust user sequence representations, improving personalization by distinguishing between relevant and irrelevant user behaviors.
-
Multi-Scenario Methods (Integrating Search and Recommendation):
SESRec(Si et al., 2023):Search and Recommendation System. Usescontrastive learningto learn disentangled search representations specifically forrecommendationtasks, aiming to bridge the gap between user search behavior and recommendation.UnifiedSSR(Xie et al., 2023):Unified Sequential Search and Recommendation. A dual-branch network that jointly learns user behavior history across both search andrecommendationscenarios, aiming for a unified understanding of user preferences.UniSAR(Shi et al., 2024):Unified Search and Recommendation. EmploysTransformersandcross-attentionto model different types of fine-grained behavior transitions between search andrecommendation, aiming for a more holistic user understanding.
3.3. Technological Evolution
The evolution of personalized search can be traced through several stages:
-
Early Keyword Matching (e.g.,
BM25): Focused solely on lexical overlap between queries and documents. No personalization. -
Embedding-based Retrieval (e.g.,
BGE-M3): Moved todense vector representationsfor semantic matching, improving relevance beyond exact keywords. Still largely query-centric. -
Personalized Search (e.g.,
HEM,AEM,TEM,CoPPS): Incorporated user history and interaction sequences to tailor results, recognizing that users have individual preferences. These methods primarily model past behaviors and direct query-item relations. -
Conversational Search (e.g.,
CHIQ): Began to consider thecontextof ongoing user-system dialogue to refine queries, addressing ambiguity in conversational interactions. -
Multi-scenario Integration (e.g.,
SESRec,UnifiedSSR,UniSAR): Attempted to unifysearchandrecommendationcontexts, acknowledging that user intent can manifest in different interaction types.MAPSrepresents a further evolution by recognizing a previously unaddressed data source: userconsultations. It moves beyond merely observing past interactions or immediate conversational context to actively inferunderlying user motivationbefore the search even begins. This is a significant step towards a more proactive and deeply personalized understanding of user intent.
3.4. Differentiation Analysis
Compared to the main methods in related work, MAPS introduces several core differences and innovations:
-
Novel Data Source: Consultations: The most significant innovation is the explicit utilization of
consultation historyto extractsearch motivation. Previous personalized search methods focus on direct query-item interactions or search/recommendation sequences. Conversational retrieval uses dialogue history but primarily forquery rewriting.MAPSuniquely posits thatconsultationsreveal a deeper, more refined user motivation that precedes and influences the formal search query. -
Motivation Modeling:
MAPSis the first to explicitly define and model "search motivation" as a distinct concept influencing personalized search, rather than just inferring intent from historical searches or clicks. -
LLM-Driven Semantic Understanding: While some modern methods use
embeddings(likeBGE-M3),MAPSleverages the sophisticatedNatural Language Understanding (NLU)andworld knowledgecapabilities ofLarge Language Models (LLMs)to process the complex, natural language text ofconsultationsand queries. This allows for a richer and more nuanced semantic representation compared to models relying on simplertoken embeddingsorID embeddings. -
Mixture of Attention Experts (MoAE):
MAPSintroducesMoAEto dynamically select and combine different attention mechanisms, enabling the model to adaptively focus on critical semantics within various types of textual inputs. This is more sophisticated than a single, static attention mechanism typically found in models likeAEMorTEM. -
Dual Alignment Strategy:
General Alignment: Addresses thecategory-text gapand aligns raw text withitem IDsand features usingcontrastive learningbased on collected item-related texts (queries, consultations, titles, reviews). This is crucial for groundingLLMrepresentations in the specific e-commerce domain.Personalized Alignment: Integrates motivation fromconsultation historyandquery historywith the current query and user preferences usingbidirectional attention. This dynamic and context-aware integration of long-term and short-term user intent is more comprehensive than methods that only considersearch history.
-
Comprehensive Problem Addressing:
MAPSdirectly tackles the three critical challenges identified in its motivation: aligning complex consultations with concise queries, bridging thecategory-text gap, and filtering noise within historical sequences, which are often overlooked or partially addressed by existing models.In essence,
MAPSinnovates by tapping into a richer, previously ignored source of user intent (consultations), leveraging powerfulLLMsfor deeper semantic understanding, and designing a multi-faceted alignment strategy to integrate this motivation into a personalized search framework.
4. Methodology
The Motivation-Aware Personalized Search (MAPS) model is designed to enhance personalized product search by explicitly incorporating user consultations to understand their underlying motivations. The overview of MAPS is illustrated in Figure 3 of the original paper. The methodology is structured into three main modules: (1) ID-text representation fusion with LLM, (2) Mapping-based general alignment, and (3) Sequence-based personalized alignment.
The following figure (Figure 3 from the original paper) shows the overall framework of MAPS.
该图像是论文中图3的示意图,展示了MAPS模型的整体框架,包括通过LLM实现的ID-文本表示融合(①)、通用对齐模块(②)与个性化对齐模块(③),架构清晰描述了多专家注意力机制和双重对齐策略。
4.1. Principles
The core idea behind MAPS is to leverage the rich, contextual information present in user consultations to infer their true search motivation. This motivation often precedes and informs their concise search queries. By capturing this deeper intent using Large Language Models (LLMs) and carefully aligning it with various data sources (queries, product features, user history), MAPS aims to provide more relevant and personalized search results. The theoretical basis is that LLMs can understand complex natural language, and attention mechanisms can focus on critical parts of this information, while contrastive learning and bidirectional attention can effectively bridge semantic gaps and integrate diverse signals for a holistic user understanding.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. ID-Text Representation Fusion with LLM
This module focuses on representing users, items, queries, and consultations as embeddings (numerical vectors) in a unified space, crucial for the model to understand interactions. It combines both categorical ID features and rich textual features.
Text Representation
Natural language texts from consultations and queries require robust Natural Language Understanding (NLU). Existing personalized product search methods often rely on simple token embeddings or average pooling, which lack world knowledge and the ability to focus on critical semantics. MAPS addresses this by employing pre-trained LLM embeddings and a Mixture of Attention Experts (MoAE) pooling network.
- LLM Embedding Initialization: The raw text (e.g., query, consultation message) is first fed into a frozen
pre-trained LLM. TheLLMgeneratestoken embeddingsfor each word or sub-word unit in the input sequence. Thesetoken embeddingscapture deep semantic meaning andworld knowledge. Unlike standard approaches, no average pooling is performed at this stage; instead, the individualtoken embeddingsare retained. - Dimension Mapping: To ensure compatibility with various
LLMsand to map thetoken embeddingsto a consistent dimension, trainableFeed-Forward Network (FFN)layers are used. TheseFFNsproject theLLM token embeddingsinto a unified dimension . - Mixture of Attention Experts (MoAE) Pooling Network: This framework adaptively assigns weights to tokens to derive a comprehensive text
embedding. It includes three types of attention pooling experts:-
Parameterized Attention Pooling Expert: This expert maintains a learnable
parameterized embedding. It uses as aqueryto computeattention scoresover the inputtoken embeddings, which act askeys. The resultingattention scoresare used to compute a weighted average of thetoken embeddings, creating a pooled representation. $ \mathbf { e } _ { \mathrm { p a r a m } } ^ { \mathrm { p ool } } = \frac { 1 } { L } \sum _ { i = 1 } ^ { L } \mathrm { s o f t m a x } \left( \frac { \mathbf { q } ^ { \top } ( \mathbf { h } _ { i } \mathbf { W } ^ { k } ) } { \sqrt { d _ { \mathrm { t } } } } \right) \mathbf { h } _ { i } $ Where:- : The pooled
embeddingfrom the parameterized expert. - : The length of the input
tokensequence. - : The
softmaxfunction, which converts raw scores into a probability distribution (attention weights). - : The
parameterized query vectorof this expert, which is a learnableembedding. - : The
embeddingof the -thtokenin the input sequence. - : A learnable
weight matrixfor transforming thetoken embeddingsintokeyrepresentations. - : The dimension of the
token embeddingsand thekeyvectors. - : Calculates the
dot productsimilarity between theparameterized queryand the transformedtoken key. - : A scaling factor used to prevent
dot productsfrom becoming too large, stabilizing gradients during training (common inattention mechanisms).
- : The pooled
-
Self-Attention Pooling Expert: This expert computes
self-attention scoresdirectly from the inputtoken embeddingsthemselves. Eachtoken'sembeddingacts as both aqueryand akeyto determine its own importance relative to othertokensin the sequence. $ \mathbf { e } _ { \mathrm { s e l f } } ^ { \mathrm { p ool } } = \frac { 1 } { L } \sum _ { i = 1 } ^ { L } \mathrm { s o f t m a x } \left( \frac { ( \mathbf { h } _ { i } \mathbf { W } ^ { q } ) ^ { \top } ( \mathbf { h } _ { i } \mathbf { W } ^ { k } ) } { \sqrt { d _ { \mathrm { t } } } } \right) \mathbf { h } _ { i } $ Where:- : The pooled
embeddingfrom the self-attention expert. - : A learnable
weight matrixfor transformingtoken embeddingsintoqueryrepresentations. - Other symbols are as defined for the parameterized expert. The key difference is that the
queryhere is derived from thetoken embeddingitself, .
- : The pooled
-
Search-Centered Cross-Attention Pooling Expert: To ensure that the
text embeddingsof users, items, andconsultationsare particularly relevant to the current search task, this expert uses theembeddingof the current search query text as theattention query. This forces the expert to focus ontokensthat are semantically aligned with the current search intent. $ \mathbf { e } _ { \mathrm { c r o s s } } ^ { \mathrm { p ool } } = \frac { 1 } { L } \sum _ { i = 1 } ^ { L } \mathrm { s o f t m a x } \left( \frac { ( \mathbf { q } ^ { \prime } \mathbf { W } ^ { q } ) ^ { \top } ( \mathbf { h } _ { i } \mathbf { W } ^ { k } ) } { \sqrt { d _ { \mathrm { t } } } } \right) \mathbf { h } _ { i } $ Where:- : The pooled
embeddingfrom the cross-attention expert. - : The
embeddingof the current search query text, . - Other symbols are as defined previously. This expert's
queryis fixed by the current search query, providing a strong contextual bias.
- : The pooled
-
Combining Experts: Each type of expert (parameterized, self-attention, cross-attention) has members (individual experts). A
gating networkcomputes scores for all experts based on the inputembeddings. The top experts with the highestgating scoresare activated. The finaltext embeddingis a weighted sum of the pooledembeddingsfrom these activated experts, where the weights are theirgating scores(normalized viasoftmax). $ { { \bf { e } } ^ { \mathrm { { t e x t } } } } = \sum _ { j = 1 } ^ { K } { g a t { { e } _ { j } } } \ { { \bf { e } } _ { j } ^ { \mathrm { { p ool } } } } $ Where:- : The final combined
text embeddingfor the input sequence. - : The normalized
gating scorefor the -th activated expert. - : The pooled
embeddingfrom the -th activated expert. - If a user or item has multiple
textual features(e.g., item title, description, reviews), theirembeddingsare concatenated: , where is the number of textual features.
- : The final combined
-
Categorical ID Representation
For discrete categorical features (e.g., brand ID, category ID), embeddings are obtained by a simple lookup operation. Each unique ID is mapped to a dense vector.
$
{ \bf e } _ { g _ { i d } } ^ { \mathrm { I D } } = \mathrm { \ l o o k u p } _ { g } ^ { \mathrm { I D } } ( i d )
$
Where:
- : The
ID embeddingfor a specificidwithin category . - : The
embedding lookuptable for category . id: The specific categoricalidentifier. These individualcategorical ID embeddingsare then concatenated to form an overallID embeddingfor a user or item: , where is the number of categorical features.
Overall Representations
Finally, the ID embeddings () and text embeddings () are combined to form a comprehensive representation for users, items, queries, and consultations. This is achieved by concatenating them, passing through an FFN (to map to a unified dimension ), and applying an activation function.
$
\begin{array} { r l } & { { \bf e } _ { u } = \mathrm { act } ( \mathrm { FFN } _ { \mathrm { u } } ( \mathrm { concat } ( { \bf e } _ { u } ^ { \mathrm { I D } } , { \bf e } _ { u } ^ { \mathrm { t e x t } } ) ) ) , } \ & { { \bf e } _ { v } = \mathrm { act } ( \mathrm { FFN } _ { \mathrm { v } } ( \mathrm { concat } ( { \bf e } _ { v } ^ { \mathrm { I D } } , { \bf e } _ { v } ^ { \mathrm { t e x t } } ) ) ) , } \ & { { \bf e } _ { s } = \mathrm { act } ( \mathrm { FFN } _ { \mathrm { s } } ( { \bf e } _ { s } ^ { \mathrm { t e x t } } ) ) , } \ & { { \bf e } _ { c } = \mathrm { act } ( \mathrm { FFN } _ { \mathrm { c } } ( { \bf e } _ { c } ^ { \mathrm { t e x t } } ) ) . } \end{array}
$
Where:
- : The overall
embeddingsfor user , item , query , and consultation , respectively. - : Concatenation operation.
- :
Feed-Forward Networkspecific to the entity type (u, v, s, c). TheseFFNsproject the concatenated features into a unifiedembedding dimension. - : An
activation function(e.g.,tanh,ReLU,GELU) applied after theFFN.
4.2.2. Mapping-Based General Alignment
This module aims to align tokens (words/phrases from text) with items in a unified semantic space, addressing the category-text gap. This "general alignment" allows the model to understand which features, IDs, and items correspond to various natural language texts.
- Full-Text Collection for Items: For each item , all relevant textual data (e.g.,
related queries,consultations,item titles,descriptive texts,advertisement texts) across multiple scenarios are gathered to construct a comprehensivefull-text collection. This collection essentially aggregates all linguistic contexts in which an item appears. - Keyword Filtering: To reduce noise and focus on critical terms, the
full-text collectionis refined. Athresholdis applied to filter outnoise texts(words ) that appear infrequently in search-related scenarios. Only words with a frequency greater than are retained. $ \mathcal { A } _ { v } ^ { S } = \mathrm { f l t e r } ^ { S } ( \mathcal { A } _ { v } ) = { w \in \mathcal { A } _ { v } | \mathrm { f r e q } ^ { S } ( w ) > t } $ Where:- : The filtered
keyword collectionfor item . - : The filtering function.
- : The original
full-text collectionfor item . - : A word or
token. - : The frequency of word in search-related scenarios.
- : A predefined
thresholdfor frequency. This curated collection creates a mapping where eachtokenis associated with item due to their shared presence or thematic relevance.
- : The filtered
- Bidirectional Contrastive Loss (): To learn effective alignments between
tokensanditems, abidirectional contrastive lossis employed. This loss function pushesembeddingsof relatedtoken-item pairs(t,v)closer while separating them fromnegative samples(unrelatedtokensoritems). $ \begin{array} { r } { \mathcal { L } _ { \mathrm { G A } } = - \lambda _ { 1 } \displaystyle \sum _ { ( t , v ) } \log \frac { \exp ( \mathrm { s i m } ( \mathbf { e } _ { t } , \mathbf { e } _ { v } ) / \tau _ { 1 } ) } { \sum _ { t ^ { - } \in T _ { \mathrm { n e g } } } \exp ( \mathrm { s i m } ( \mathbf { e } _ { t } - \mathbf { ,e } _ { v } ) / \tau _ { 1 } ) } \ \mathrm { w n } } \ { - \lambda _ { 2 } \displaystyle \sum _ { ( t , v ) } \log \frac { \exp ( \mathrm { s i m } ( \mathbf { e } _ { t } , \mathbf { e } _ { v } ) / \tau _ { 2 } ) } { \sum _ { v ^ { - } \in I _ { \mathrm { n e g } } } \exp ( \mathrm { s i m } ( \mathbf { e } _ { t } , \mathbf { e } _ { v ^ { - } } ) / \tau _ { 2 } ) } } \end{array} $ Where:- : The
general alignment loss. - : Learnable
weightsfor the two terms, balancing the importance oftoken-to-itemanditem-to-tokenalignment. - : Summation over all positive
token-item pairs. - : The
dot product similarity functionbetween twoembeddings. - : The
embeddingsoftokenanditem, respectively. - :
Temperature parametersthat control the sharpness of thesoftmaxdistribution. Higher values make the distribution smoother, while lower values make it peak more sharply around the most similar example. - : A set of randomly sampled
negative tokens(unrelated to ). The term ensures that is more similar to than to anynegative token. - : A set of randomly sampled
negative items(unrelated to ). The term ensures that is more similar to than to anynegative item. Thisbidirectional contrastive lossensures thatembeddingsfor positivetoken-item pairsare close, andembeddingsfor negative pairs are far apart, establishing a robust general alignment.
- : The
4.2.3. Sequence-Based Personalized Alignment
This module focuses on extracting search motivations from a user's consultation and search histories and aligning them with the current query to enhance personalized search.
Motivation-Aware Query Embedding
The core idea is to enrich the current query embedding by incorporating historical context.
-
Consultation History Integration: The current query
embeddingacts as ananchor. It is concatenated with the user'sconsultation history(a sequence ofconsultation embeddings). This combined sequence is then fed into aTransformer encoder. TheTransformer encoder, with itsmulti-head bidirectional attentionandFFN layers, processes this sequence to extract amotivation embeddingthat is relevant to the current query from the consultation history. The first vector of the encoder's output, corresponding to the anchored current query, is chosen as themotivation-aware embeddingfrom consultations. $ \mathbf { e } _ { s _ { N + 1 } } ^ { \mathcal { C } } = \operatorname { E n c o d e r } _ { \mathrm { c } } ( \mathbf { e } _ { s _ { N + 1 } } , \mathbf { e } _ { c _ { 1 } } , \dots , \mathbf { e } _ { c _ { M } } ) [ 0 , : ] $ Where:- : The
search motivation embeddingderived from theconsultation historyfor the current query . - : A
Transformer encoderspecifically designed for processing consultation sequences. - : The
embeddingof the current query. - : The
embeddingsof theconsultation sessionsin the user's history. [0,:]: An operation to select the first vector from theTransformer encoder's output sequence. This vector typically summarizes the context, anchored by the initial input (here, the current query).
- : The
-
Query History Integration: Similarly, the user's
historical query sequenceis also processed with the current queryembeddingthrough anotherTransformer encoder. This capturessearch motivationderived from past search behaviors. $ { \mathbf e } _ { s _ { N + 1 } } ^ { S } = \mathrm { E n c o d e r } _ { \mathrm { s } } ( { \mathbf e } _ { s _ { N + 1 } } , { \mathbf e } _ { s _ { 1 } } , \ldots , { \mathbf e } _ { s _ { M } } ) [ 0 , : ] $ Where:- : The
search motivation embeddingderived from thequery historyfor the current query . - : A
Transformer encoderfor processing search query sequences. - : The
embeddingsof thehistorical search queries.
- : The
-
Combined Motivation-Aware Query Embedding: The three relevant
embeddings—theconsultation-derived motivation(), thequery-history-derived motivation(), and thecurrent query embedding()—are combined using learnableweights. $ \mathbf { e } _ { s _ { N + 1 } } ^ { \prime } = \alpha _ { 1 } \mathbf { e } _ { s _ { N + 1 } } ^ { \mathcal { C } } + \alpha _ { 2 } \mathbf { e } _ { s _ { N + 1 } } ^ { S } + \alpha _ { 3 } \mathbf { e } _ { s _ { N + 1 } } $ Where:- : The
combined motivation-aware query embedding. - : Learnable scalar
weightsthat determine the contribution of each component to the finalquery embedding. These weights are learned during training.
- : The
Personalized Search with Item History
This final step refines the motivation-aware query embedding and integrates it with the user's past interacted items to form the ultimate personalized query representation for ranking.
-
Final Query Embedding: The
combined motivation-aware query embeddingand theembeddingsof the user's past interacted items are fed into a finalTransformer encoder. This captures complex interactions between the current motivated query and the user's overall item interaction history. The first output vector of this encoder is thenadded in-placewith the user'sembeddingto obtain the final personalized queryembedding. $ \mathbf { e } _ { s _ { N + 1 } } ^ { \prime \prime } = \mathrm { E n c o d e r } _ { \mathrm { f i n a l } } ( \mathbf { e } _ { s _ { N + 1 } } ^ { \prime } , \mathbf { E } _ { \mathrm { i t e m s } } ) [ 0 , : ] \oplus \mathbf { e } _ { u } $ Where:- : The final, fully personalized and motivation-aware query
embedding. - : The
Transformer encoderthat processes themotivation-aware query embeddingand historical itemembeddings. - : A matrix or sequence of
embeddingsrepresenting the items with which the user has previously interacted. - : Denotes
in-place addition(element-wise sum) of the userembeddingto the first output vector of theTransformer encoder.
- : The final, fully personalized and motivation-aware query
-
Ranking Probability: For inference, candidate items are ranked based on their similarity to this final personalized query
embedding. The similarity is calculated using thedot product function. $ p ( v | s _ { N + 1 } , H , u ) = \mathrm { s i m } ( \mathbf { e } _ { s _ { N + 1 } } ^ { \prime \prime } , \mathbf { e } _ { v } ) $ Where:- : The
probability scorethat item is relevant given the current query , user history , and user profile . - : The
dot product similarity function. - : The
embeddingof candidate item .
- : The
-
Personalized Alignment Loss (): The model is optimized to increase the relevance scores of ground-truth (correctly interacted) items. This is formulated using a
softmax cross-entropy lossover candidate items, where the goal is to maximize the similarity between the personalized queryembeddingand the actual interacted item. $ \mathcal { L } _ { \mathrm { P A } } = \sum _ { \left( u , v , s _ { N + 1 } \right) } \log \frac { \exp ( \mathrm { s i m } ( \mathbf { e } _ { s _ { N + 1 } } ^ { \prime \prime } , \mathbf { e } _ { v } ) ) } { \sum _ { v ^ { \prime } \in V _ { N + 1 } } \exp ( \mathrm { s i m } ( \mathbf { e } _ { s _ { N + 1 } } ^ { \prime \prime } , \mathbf { e } _ { v ^ { \prime } } ) ) } $ Where:- : The
personalized alignment loss. - : Summation over training instances, each consisting of a user , a ground-truth interacted item , and a current query .
- : The set of candidate items for ranking, which includes the ground-truth item and
negative samples(unrelated items). - The numerator represents the similarity score for the correct item , and the denominator sums similarities over all candidate items , effectively normalizing the score into a probability.
Negative samplingis typically employed to make the denominator computation efficient.
- : The
-
Overall Loss (): The total loss function combines the
personalized alignment lossand thegeneral alignment loss, along withL2 regularizationto prevent overfitting. $ \mathcal { L } _ { \mathrm { o v e r a l l } } = \mathcal { L } _ { \mathrm { P A } } + \lambda _ { 3 } \mathcal { L } _ { \mathrm { G A } } + \lambda _ { 4 } | | \Theta | | _ { 2 } $ Where:- : The total
objective functionto be minimized. - : A
hyper-parameterbalancing the contribution of thegeneral alignment loss. - : A
hyper-parametercontrolling the strength ofL2 regularization. - : The
L2 normof all learnable parameters in theMAPSmodel, which penalizes large parameter values and helps prevent overfitting.
- : The total
5. Experimental Setup
5.1. Datasets
To evaluate the effectiveness of MAPS, experiments are conducted on two datasets: a commercial dataset and a synthetic Amazon dataset.
-
Commercial Dataset:
- Source: A real user interaction dataset from an internet e-commerce shopping platform that provides AI consulting services.
- Characteristics: Contains user interactions over 31 days.
- Preprocessing: Users and items with fewer than 5 interactions are filtered out. To prevent
sequence data leakage(where future information inadvertently influences past predictions), the first 29 days are used for training, and the remaining two days are split for validation and testing, respectively. This split ensures that the model is evaluated on unseen, future data. - Domain: E-commerce platform with AI consulting services.
-
Amazon Dataset:
-
Source: Based on the widely used Amazon Reviews dataset (Ni et al., 2019), specifically the version processed by
PersonalWAB(Cai et al., 2024). -
Characteristics: Includes user profiles and various types of user interaction data, such as searches and reviews.
-
Consultation Simulation: To mimic the real-world e-commerce environment with AI consultation services (which the original Amazon dataset lacks),
GPT-40(a specific version ofGPT-4) was utilized to generate synthetic userconsultation texts. These texts are generated based on existing user profiles and interaction behaviors, aiming to simulate realistic consultation scenarios. -
Preprocessing: Processing and splitting are consistent with Shi et al. (2024).
-
Domain: E-commerce (Amazon products and reviews).
The following are the statistics from Table 1 of the original paper:
Dataset #Users #Items #Inters #Sparsity Commercial 2096 2691 24662, (18774) 99.56%, (99.66%) Amazon 967 35772 7263, (40567) 99.98%, (99.88%)
-
Where:
-
#Users: Number of unique users in the dataset. -
#Items: Number of unique items (products) in the dataset. -
#Inters: Total number of interactions. The numbers outside parentheses indicatesearch interactions, while the numbers in parentheses indicateconsultation interactions. -
#Sparsity: A measure of how few interactions there are compared to the total possible interactions (#Users*#Items). Higher sparsity indicates fewer observed interactions relative to the potential, making it harder to learn user preferences. The numbers outside parentheses indicatesearch interaction sparsity, while those in parentheses indicateconsultation interaction sparsity.The datasets were chosen because they provide
real token textandmultiple types of user interaction data, which are essential for validatingMAPS's ability to modelsearch motivationsand leverageLLMknowledge. The commercial dataset offers a direct real-world application, while the Amazon dataset provides a widely used benchmark with simulated consultations to test the core ideas.
The following figure (Figure 5 from the original paper) shows examples of consultations on the Amazon dataset.
该图像是论文中图6,展示了在商业平台上的咨询示例。左侧为用户关于电子产品配件的具体问题与系统回答,右侧为隐私相关项目需求的问答展示,体现了系统对用户动机的理解和针对性推荐能力。
This figure illustrates examples of consultations from the Amazon dataset. The left panel shows user inquiries about electronic accessories ("Need an adapter that can convert USB-C to HDMI..." and "Looking for durable, waterproof headphones for running..."). The right panel shows a discussion about privacy-related projects ("How can I ensure data privacy in my AI project..."). For each user question, a system response (presumably generated or simulated) is provided, demonstrating how consultations can reveal detailed user needs and motivations that would be hard to capture in a short search query.
5.2. Evaluation Metrics
The experiments employ standard metrics for both ranking and retrieval tasks.
Ranking Metrics
For evaluating ranking performance, two common metrics are used: Hit Ratio (HR@k) and Normalized Discounted Cumulative Gain (NDCG@k). These metrics are calculated by pairing the ground-truth item with 99 randomly sampled negative items as candidates.
-
Hit Ratio (HR@k):
- Conceptual Definition:
Hit Ratiomeasures the proportion of queries for which the ground-truth (correct) item appears within the top ranked results. It's a simple, intuitive metric that indicates how often the desired item is found among the top recommendations. - Mathematical Formula: $ \mathrm{HR@k} = \frac{\text{Number of users for whom the ground-truth item is in top } k}{\text{Total number of users}} $ This can also be seen as: $ \mathrm{HR@k} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{ground-truth item for } u \text{ is in top } k \text{ list}) $
- Symbol Explanation:
- : The total number of users (or queries) being evaluated.
- : An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
- : The cut-off rank (e.g., 5, 10, 20, 50), meaning we check if the item is within the top positions.
- Conceptual Definition:
-
Normalized Discounted Cumulative Gain (NDCG@k):
- Conceptual Definition:
NDCGis a metric that evaluates the quality of a ranked list by considering both the relevance of items and their position in the list. It assigns higher scores to relevant items that appear earlier in the ranking. The "discounted" part means that relevant items further down the list contribute less to the overall score. "Normalized" means the score is compared against an ideal ranking (where all relevant items are at the top) to get a value between 0 and 1. - Mathematical Formula:
First,
Cumulative Gain (CG@k): $ \mathrm{CG@k} = \sum_{i=1}^{k} \mathrm{rel}i $ Then,Discounted Cumulative Gain (DCG@k): $ \mathrm{DCG@k} = \sum{i=1}^{k} \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Finally,Normalized Discounted Cumulative Gain (NDCG@k): $ \mathrm{NDCG@k} = \frac{\mathrm{DCG@k}}{\mathrm{IDCG@k}} $ WhereIDCG@kis theDCGof the ideal ranking: $ \mathrm{IDCG@k} = \sum{i=1}^{k} \frac{2^{\mathrm{rel}_{i, \text{ideal}}} - 1}{\log_2(i+1)} $ - Symbol Explanation:
- : The cut-off rank.
- : The relevance score of the item at position in the ranked list. In a binary relevance setting (item is either relevant or not), is typically 1 if the item is the ground-truth, and 0 otherwise.
- : The
logarithmic discountfactor. - : The
Discounted Cumulative Gainat rank . - : The
Ideal Discounted Cumulative Gainat rank , obtained by sorting all relevant items by their relevance. - : The relevance score of the item at position in the ideal (perfectly sorted) ranked list.
- The final
NDCG@kfor a system is typically the averageNDCG@kacross all queries.
- Conceptual Definition:
Retrieval Metric
For evaluating retrieval performance, Mean Reciprocal Rank (MRR@k) is used. For retrieval tasks, all items are considered as candidates.
- Mean Reciprocal Rank (MRR@k):
- Conceptual Definition:
MRRis a statistic for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. It measures the average of thereciprocal ranksof the first relevant item in a set of queries. If the first relevant item is found at rank , itsreciprocal rankis . If no relevant item is found within the top results, thereciprocal rankis 0. - Mathematical Formula: $ \mathrm{MRR@k} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{1}{\mathrm{rank}_q} \quad \text{if } \mathrm{rank}_q \le k \text{, else 0} $
- Symbol Explanation:
- : The total number of queries.
- : The rank position of the first relevant item for query .
- : The cut-off rank, beyond which relevant items are not considered.
- : The
reciprocal rankfor query .
- Conceptual Definition:
5.3. Baselines
The MAPS method is compared against a comprehensive set of baselines across different categories:
-
Personalized Search Baselines: These models primarily focus on integrating user information and search history for personalized ranking.
AEM(Ai et al., 2019a):Attention-Embedding Model, combines user's past interacted items with the current query using attention.QEM(Ai et al., 2019a):Query-Embedding Model, focuses on matching scores between items and queries only.HEM(Ai et al., 2017):Hierarchical Embedding Model, a personalized model based on latent vectors.ZAM(Ai et al., 2019a):Zero Attention Model, an extension ofAEMwith a zero vector for item lists.TEM(Bi et al., 2020):Transformer-based Embedding Model, improvesAEMby using aTransformer encoder.CoPPS(Dai et al., 2023):Contrastive learning for user sequence representation in Personalized Product Search, leveragescontrastive learning.
-
Multi-Scenario Baselines (Integrating Search & Recommendation): These models aim to combine insights from both search and recommendation interactions.
SESRec(Si et al., 2023): Usescontrastive learningto learn disentangled search representations for recommendation.UnifiedSSR(Xie et al., 2023):Unified Sequential Search and Recommendation, jointly learns user behavior history across both scenarios.UniSAR(Shi et al., 2024):Unified Search and Recommendation, models fine-grained behavior transitions usingTransformersandcross-attention.
-
Retrieval Baselines: These models are specifically designed for efficient item retrieval.
-
BM25(Robertson et al., 2009): A traditionalterm-frequencybased statistical retrieval algorithm. -
BGE-M3(Chen et al., 2024): A moderndense retrievalalgorithm usingembeddingsformulti-lingual,multi-functionality,multi-granularitytext. -
CHIQ(Mo et al., 2024):Contextual History Enhancement for Improving Query Rewriting in Conversational Search, a conversational retrieval method incorporatingworld knowledgefromLLMs.These baselines are representative as they cover a wide spectrum of approaches, from traditional statistical methods to modern deep learning models, including those focused on personalization, multi-scenario learning, and conversational context, allowing for a comprehensive evaluation of
MAPS's innovations.
-
5.4. Implementation details
- Embedding Dimensions: Latent
embedding dimensionis set to 64. Theunified dimensionforLLM token embeddings(afterFFNmapping) is set to 32. - User History Length: The maximum length of the user history sequence (for both consultations and queries) is set to 30.
- User Filtering: Users with fewer than 5 interactions are filtered out, a common practice to ensure sufficient data for learning user preferences.
- Activation Function:
tanhis used as the defaultactivation functionin theFFNlayers (actin Eq. 1). - Transformer Layers: The number of layers in the
Transformer encodermodules is initially set to 1. - Batch Size: The
batch sizefor training is 72. For , thein-batch negativesampling strategy is adopted, and thebatch sizeis searched among {128, 256, 512, 1024}. - Negative Samples: For , 10
negative samplesare used for eachpositive sample. - Hyperparameter Tuning:
Weightsforgeneral alignment lossare tuned in .Weightforoverall loss(balancing and ) is tuned in .Temperature parameters(forcontrastive learningin ) are tuned in the interval [0.0, 1.0] with a step size of 0.1.
- Training: All models are trained for 100 epochs.
Early stoppingis employed to preventoverfitting, meaning training stops if performance on the validation set does not improve for a certain number of epochs. - Optimizer:
Adam(Kingma and Ba, 2014) is used for optimization. - Learning Rate: The
learning rateis adjusted among . - Hardware: All experiments were conducted on an A800
GPU(80GB).
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate the superior performance of MAPS across both ranking and retrieval tasks on both datasets.
Ranking Performance
The following are the results from the Table labeled "Table:Search ranking performance compared with personalized search baselines" of the original paper:
| Model | HR | NDCG | ||||||
|---|---|---|---|---|---|---|---|---|
| @5 | @10 | @20 | @50 | @5 | @10 | @20 | @50 | |
| Commercial | ||||||||
| AEM | 0.3886 | 0.5376 | 0.6733 | 0.8249 | 0.2656 | 0.3135 | 0.3478 | 0.3781 |
| QEM | 0.3996 | 0.5473 | 0.6733 | 0.8439 | 0.2671 | 0.3144 | 0.3463 | 0.3805 |
| HEM | 0.3484 | 0.4907 | 0.6366 | 0.8037 | 0.2360 | 0.2817 | 0.3185 | 0.3519 |
| ZAM | 0.3674 | 0.5248 | 0.6808 | 0.8205 | 0.2490 | 0.2994 | 0.3389 | 0.3669 |
| TEM | 0.4041 | 0.5685 | 0.7078 | 0.8528 | 0.2871 | 0.3402 | 0.3756 | 0.4049 |
| CoPPS | 0.4050 | 0.5637 | 0.7171 | 0.8660 | 0.2831 | 0.3445 | 0.3805 | 0.4103 |
| MAPS | 0.5281† | 0.7071† | 0.8330† | 0.9308† | 0.3780† | 0.4359† | 0.4680† | 0.4877† |
| Amazon | ||||||||
| AEM | 0.3180 | 0.4550 | 0.5372 | 0.7239 | 0.1860 | 0.2132 | 0.2475 | 0.2768 |
| QEM | 0.2831 | 0.3888 | 0.5285 | 0.7663 | 0.1914 | 0.1805 | 0.2277 | 0.2913 |
| HEM | 0.2735 | 0.4198 | 0.5400 | 0.7446 | 0.1983 | 0.2172 | 0.2598 | 0.2961 |
| ZAM | 0.3103 | 0.4488 | 0.5429 | 0.7301 | 0.1833 | 0.2114 | 0.2494 | 0.2787 |
| TEM | 0.4026 | 0.4814 | 0.7197 | 0.7301 | 0.2968 | 0.3124 | 0.3415 | 0.3535 |
| CoPPS | 0.3870 | 0.4854 | 0.7286 | 0.8004 | 0.2788 | 0.3298 | 0.3439 | 0.3699 |
| MAPS | 0.5832† | 0.7735† | 0.8987† | 0.9741† | 0.4059† | 0.4676† | 0.4995† | 0.5147† |
- Comparison with Personalized Search Baselines:
MAPSconsistently and significantly outperforms all other personalized product search methods across allHR@kandNDCG@kmetrics on both theCommercialandAmazondatasets.- On the
Commercialdataset,MAPSachieves anHR@10of 0.7071 andNDCG@10of 0.4359, representing an approximate 20% improvement over the best baselines (e.g.,CoPPSwithHR@10of 0.5637,TEMwithNDCG@10of 0.3402). - On the
Amazondataset, the performance gains are even more substantial, withMAPSreachingHR@10of 0.7735 andNDCG@10of 0.4676, marking an approximately 35% improvement over baselines (CoPPSwithHR@10of 0.4854,TEMwithNDCG@10of 0.3124). - The '†' symbol indicates that these improvements are statistically significant ( level with paired t-tests), confirming that the enhanced
semantic understandingandmotivation integrationinMAPSprovide a robust advantage.
- On the
Retrieval Performance
The following are the results from Table 3 of the original paper:
| Method | MRR@10 | MRR@20 | MRR@50 |
|---|---|---|---|
| BM25 | 0.2529 | 0.2577 | 0.2625 |
| AEM | 0.2445 | 0.2539 | 0.2588 |
| QEM | 0.2427 | 0.2516 | 0.2572 |
| HEM | 0.2176 | 0.2277 | 0.2331 |
| ZAM | 0.2304 | 0.2413 | 0.2459 |
| TEM | 0.2705 | 0.2803 | 0.2852 |
| CoPPS | 0.2642 | 0.2750 | 0.2799 |
| BGE-M3 | 0.2976 | 0.3110 | 0.3168 |
| CHIQ | 0.3192 | 0.3392 | 0.3412 |
| MAPS | 0.3805 | 0.3889 | 0.3922 |
- Comparison with Retrieval Baselines:
MAPSalso significantly outperforms traditional, dense, and conversational retrieval methods on theCommercialdataset.MAPSachieves anMRR@10of 0.3805, which is substantially higher than the best baseline (CHIQat 0.3192). This represents an improvement of over 15%.- Even powerful
dense retrievalmodels likeBGE-M3andconversational retrievalmodels likeCHIQare surpassed, indicating thatMAPS's ability to incorporateconsultation-derived motivationis highly effective forretrievalas well, not justranking. This suggests thatMAPSretrieves more relevant items at higher ranks.
Multi-Scenario Ranking Performance
The following are the results from Table 4 of the original paper:
| Method | HR@10 | HR@20 | N@10 | N@20 |
|---|---|---|---|---|
| SESRec | 0.5622 | 0.7191 | 0.3465 | 0.3797 |
| UnifiedSSR | 0.5706 | 0.7074 | 0.3590 | 0.3743 |
| UniSAR | 0.5838 | 0.7294 | 0.3577 | 0.3894 |
| MAPS | 0.7071 | 0.8330 | 0.4359 | 0.4680 |
- Comparison with Multi-Scenario Baselines:
MAPSmaintains its lead when compared to models that integrate both search and recommendation interactions.-
On the
Commercialdataset,MAPS'sHR@10(0.7071) andNDCG@10(0.4359) are significantly higher than those ofUniSAR(0.5838 and 0.3577, respectively), which is the strongest baseline in this category. -
This further validates that the specific approach of leveraging
consultation motivationsprovides a distinct advantage, even over methods that attempt a broader understanding of user behavior across different interaction types.Overall Conclusion: The consistent and substantial performance gains across various metrics, tasks, and datasets confirm the effectiveness and superiority of
MAPS. It demonstrates that explicitly modeling and integratingsearch motivationderived from userconsultationsviaLLM-driven alignmentis a highly impactful strategy for enhancing personalized search in e-commerce platforms.
-
6.2. Ablation Studies / Parameter Analysis
Ablation Study of MAPS Modules
To understand the contribution of each component within MAPS, an ablation study was performed on the Commercial dataset. "w/o" denotes removing a component.
The following are the results from Table 5 of the original paper:
| Ablation | HR@10 | HR@20 | N@10 | N@20 |
|---|---|---|---|---|
| MAPS | 0.7071 | 0.8330 | 0.4359 | 0.4680 |
| w/o LLM | 0.6527 | 0.7839 | 0.3968 | 0.4309 |
| w/o MoAE | 0.6781 | 0.7844 | 0.4096 | 0.4494 |
| w/o general align | 0.6198 | 0.7424 | 0.3669 | 0.4006 |
| w/o filter () in Eq. 2 | 0.6201 | 0.7426 | 0.3597 | 0.3951 |
| w/o personal align | 0.6334 | 0.7518 | 0.3732 | 0.4105 |
| wo | 0.6565 | 0.7730 | 0.3863 | 0.4246 |
| wo | 0.6448 | 0.7615 | 0.3803 | 0.4170 |
Analysis:
- Overall Impact: Removing any module from
MAPSconsistently leads to a significant drop in performance across allHRandNDCGmetrics, highlighting that all proposed components are crucial for the model's effectiveness. - Impact of
LLMandMoAE: RemovingLLM(w/o LLM) causes a notable drop (e.g.,HR@10from 0.7071 to 0.6527), indicating the importance ofLLM'sworld knowledgeandsemantic understanding. Similarly, removingMoAE(w/o MoAE) also degrades performance (e.g.,HR@10to 0.6781), confirming that adaptively prioritizing criticalsemanticsthroughattention expertsis beneficial. - Critical Role of General Alignment: The most pronounced performance drop occurs when the
general alignmentmodule is removed (w/o general align,HR@10to 0.6198;NDCG@10to 0.3669). This is attributed to the crucial role ofgeneral alignmentinaligning text with IDsand integrating domain-specific knowledge withLLM's generalworld knowledge. Without it, the model struggles with the semantic mismatch between generalLLMrepresentations and precise domain contexts. For instance, "Cool" might be interpreted as an adjective by anLLMwithout domain alignment, rather than a product feature, causing significantsemantic shift. - Effect of Filtering: Removing the
keyword filteringingeneral alignment(w/o filter () in Eq. 2) also leads to a substantial decrease (HR@10to 0.6201), indicating that filteringnoise textsis vital for effective alignment. - Impact of Personalized Alignment: Removing the entire
personalized alignmentmodule (w/o personal align, which means setting ) also causes a significant performance reduction (HR@10to 0.6334;NDCG@10to 0.3732). This confirms the importance of incorporatingmotivationfromconsultationandquery histories. - Individual Contributions of
ConsultationandQuery History: Separately ablatingconsultation history(woe_c
) or `query history` (`wo`e_s
) also shows performance degradation (e.g., HR@10 to 0.6565 and 0.6448 respectively), demonstrating that both sources contribute positively to motivation-aware query embedding.
ID-Text Representation Fusion Analysis
This study investigates how fusing ID and LLM embeddings with MoAE pooling enhances personalization.
The following are the results from Table 6 of the original paper:
| Ablation | HR@10 | HR@20 | N@10 | N@20 |
|---|---|---|---|---|
| MAPS-Default | 0.7071 | 0.8330 | 0.4359 | 0.4680 |
| MAPS-ID | 0.6870 | 0.7953 | 0.4226 | 0.4500 |
| MAPS-LLM | 0.6794 | 0.7896 | 0.4196 | 0.4427 |
| MAPS-Mean | 0.6950 | 0.8249 | 0.4337 | 0.4566 |
Analysis:
- Fusion of
IDandLLMis Best:MAPS-Default(which fuses bothcategorical IDandLLM text embeddings) outperforms models using onlyID embeddings(MAPS-ID) or onlyLLM embeddings(MAPS-LLM). This confirms that a holistic representation combining both structuralID informationand richsemantic text informationis superior for representing users and items.ID embeddingscapture explicit categorical properties, whileLLM embeddingscapture nuanced textual semantics; their fusion provides a more complete picture. MoAEvs. Simple Mean Pooling:MAPS-Default(withMoAE) also outperformsMAPS-Mean(which uses simple mean pooling for text representation). This indicates thatMoAE, by adaptively selecting and weighting differentattention experts, is more effective at capturingcritical semantic informationand aligning it with the search task than a simple average oftoken embeddings.MoAE's dynamic nature allows it to better handle varying textual content and focus on the most relevant parts.
Scalability Study
The scalability study explores the impact of various configuration choices on MAPS performance on the Commercial dataset.
The following are the results from Table 7 of the original paper:
| Aspect − | Config | N | ||
|---|---|---|---|---|
| @5 | @10 | @20 | ||
| SequenceLength | 10 | 0.3674 | 0.4200 | 0.4481 |
| 30 | 0.3780 | 0.4359 | 0.4680 | |
| 40 | 0.3739 | 0.4303 | 0.4627 | |
| LLMScale | Qwen2.5-0.5B | 0.3394 | 0.3892 | 0.4237 |
| Qwen2.5-1.5B | 0.3534 | 0.4026 | 0.4357 | |
| Qwen2-7B | 0.3593 | 0.4090 | 0.4412 | |
| Qwen2.5-7B | 0.3780 | 0.4359 | 0.4680 | |
| TransformerScale | 1 Layer | 0.3780 | 0.4359 | 0.4680 |
| 2 Layer | 0.3881 | 0.4470 | 0.4724 | |
| 4 Layer | 0.3909 | 0.4561 | 0.4838 | |
Analysis:
- Sequence Length: The optimal
sequence lengthfor user history is 30. Using a shorter sequence (10) results in lower performance, likely due to insufficient historical context. However, increasing thesequence lengthbeyond 30 (to 40) also leads to a slight performance drop. This suggests that excessively long sequences might introduce more noise or irrelevant interactions, diminishing the signal-to-noise ratio. LLMScale: The performance ofMAPSgenerally improves with the scale and capability of the underlyingLLM. Using larger and presumably more powerfulLLMs(fromQwen2.5-0.5BtoQwen2.5-7B) consistently leads to betterNDCGscores. This underscores the importance of theworld knowledgeandNLU capabilitiesprovided byLLMsin understanding complexconsultationand query texts.- Transformer Layers: Increasing the number of
Transformer layersgenerally improves theranking effect. Going from 1 layer to 2 layers, and then to 4 layers, shows a steady increase inNDCGscores. This implies that deeperTransformer architecturesare more effective at capturing complex interactions andaligning LLM embeddingswith the specific domain andpersonalized search task.
Configuration Analysis
Mapping Threshold
The following figure (Figure 4 from the original paper) shows the ranking performance on Amazon with different threshold in Eq. 2.

Analysis:
- The figure shows the
HR@kandNDCG@kperformance on theAmazondataset for different values of themapping threshold(from Eq. 2), which is used to filter keywords in thegeneral alignmentmodule. The defaultthresholdis 2. - The results indicate that
MAPSachieves optimal performance when . - Too Low Threshold (): If is too small (e.g., or ), it means fewer restrictions on keywords. This introduces
noisefrom texts in other scenarios that are not truly relevant to the item, leading to degraded performance. - Too High Threshold (): If is excessively high, it imposes stringent conditions, filtering out too much data. This can limit the amount of useful
semantic informationavailable forgeneral alignment, thereby constraining the model's ability to learn robusttoken-item mappingsand consequently reducing performance. - This analysis highlights the necessity of an appropriately tuned
thresholdto balance between capturing sufficient relevant data and filtering out irrelevant noise.
Activation Function
The following are the results from Table 8 of the original paper:
| Activation | HR@10 | HR@20 | N@10 | N@20 |
|---|---|---|---|---|
| tanh | 0.7585 | 0.8787 | 0.4676 | 0.4995 |
| SiLU | 0.7823 | 0.8953 | 0.4697 | 0.5010 |
| PReLU | 0.7813 | 0.9067 | 0.4763 | 0.5097 |
| GELU | 0.7978 | 0.9036 | 0.4734 | 0.5015 |
| ReLU | 0.4390 | 0.6740 | 0.2165 | 0.2768 |
Analysis:
- The default
activation functionused inMAPS(in Eq. 1 for overall representations) istanh. - The results show that
ReLUperforms significantly worse than all other activation functions (e.g.,HR@10of 0.4390 vs.tanh's 0.7585). This poor performance is attributed to the "dyingReLU" problem (Lu et al., 2019), whereneuronscan become inactive during training if their input consistently falls into the negative range, leading to zerogradientsand preventing further weight updates. - Other activation functions like
SiLU,PReLU, andGELUactually outperformtanh.GELUachieves the highest performance (e.g.,HR@10of 0.7978,NDCG@10of 0.4734), suggesting that it might be a more suitable choice forMAPS. This indicates that the choice ofactivation functioncan have a significant impact on model performance, andGELU(orPReLU) provides better non-linearity and gradient flow for this specific task thantanh.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Motivation-Aware Personalized Search (MAPS), a novel method that significantly enhances personalized product search by explicitly leveraging user consultation histories. MAPS moves beyond traditional approaches that assume queries fully capture user intent, recognizing that consultations reveal crucial underlying motivations. The model employs Large Language Models (LLMs) to embed natural language queries and consultations into a unified semantic space, providing deep Natural Language Understanding (NLU) and world knowledge. A Mixture of Attention Experts (MoAE) adaptively prioritizes critical semantic information within these texts. Furthermore, MAPS features a dual alignment strategy: (1) Mapping-based General Alignment uses contrastive learning to align tokens from consultations, reviews, and product features with item IDs, bridging the category-text gap and grounding LLM representations in the domain; (2) Sequence-based Personalized Alignment utilizes bidirectional attention within Transformer encoders to integrate motivation-aware embeddings derived from both consultation and search histories with individual user preferences. Extensive experiments on a real-world commercial dataset and a synthetic Amazon dataset demonstrate that MAPS consistently and significantly outperforms existing retrieval and ranking methods, providing a more accurate and context-aware solution.
7.2. Limitations & Future Work
The authors acknowledge several limitations of MAPS and suggest future research directions:
-
Computational Efficiency and Scalability: While
MAPSimprovessemantic understanding, the paper notes that it may not fully addresscomputational efficiencyorscalabilitychallenges, especially inreal-time applications. The use ofLLMsand complexattention mechanismscan be computationally intensive. -
Dynamic User Behavior: The current framework primarily focuses on integrating past
consultationsandsearch historiesbut may not explicitly account fordynamic user behaviorand evolving preferences over longer periods. User intents and motivations can shift, requiring models to adapt. -
Domain-Specific Knowledge Integration:
MAPSleveragesLLM'sworld knowledgeandgeneral alignmentfor domain-specific grounding, but it does not incorporate explicit, pre-curateddomain-specific knowledge(e.g., ontologies, expert rules) that could further enhance understanding in highly specialized contexts. This limits itsgeneralizabilityacross diverse industries without additional domain adaptation.Future work could focus on:
-
Optimizing Real-time Adaptability: Developing strategies to make
MAPSmore efficient forreal-time personalization, potentially through model distillation or more optimized inference techniques. -
Addressing Scalability Issues: Investigating methods to handle larger volumes of
consultation dataand user histories more effectively, perhaps by exploring more efficientTransformer architecturesor approximation techniques. -
Integrating External Domain Knowledge: Incorporating explicit
domain-specific knowledge basesorknowledge graphsto further refine semantic understanding andalignmentin specialized domains, making the system more robust and versatile across industries. -
Further Consultation Modeling: The authors also indicate an interest in exploring further nuances of
consultation modelingwithin e-commerce platforms.
7.3. Personal Insights & Critique
MAPS introduces a compelling and intuitive idea: that users don't just search, they often consult first to clarify their needs. This pre-search consultation data is a goldmine of explicit motivation that has been largely overlooked. The paper's strength lies in its rigorous approach to capturing this motivation, utilizing state-of-the-art LLMs for NLU and designing a multi-layered alignment mechanism to integrate this rich information. The Mixture of Attention Experts (MoAE) is a clever way to dynamically adapt to varying textual contexts, and the dual general and personalized alignment modules effectively tackle the ID-text gap and historical noise. The significant performance gains strongly validate the hypothesis that search motivation is a critical enhancing factor.
One potential area for improvement or critique is the reliance on GPT-40 for generating consultation texts on the Amazon dataset. While necessary due to the lack of real consultation data, synthetic data, no matter how good, might not fully capture the nuanced linguistic patterns, ambiguities, and emotional content of genuine user consultations. This could lead to a slight overestimation of performance on real-world consultation data where users might express themselves differently. Further validation with more diverse real consultation datasets would strengthen the findings.
The computational cost of employing LLMs and complex Transformer architectures (especially with multiple encoders and MoAE) could be substantial, as acknowledged by the authors in limitations. For real-time search systems handling millions of queries per second, this could pose a practical challenge. Future work in model compression, distillation, or more efficient inference strategies would be crucial for industrial deployment.
The methods in MAPS could be highly transferable to other domains beyond e-commerce. Any system where users interact in natural language to express complex needs before formulating a concise query could benefit. Examples include:
-
Healthcare: Patients describing symptoms to a chatbot before a doctor's visit, leading to more personalized medical information retrieval.
-
Customer Support: Users explaining complex issues to a virtual assistant before searching for solutions in a knowledge base.
-
Legal Research: Lawyers outlining a case to an
AI assistantbefore searching for relevant precedents.The paper makes a strong case for shifting focus from purely observed behavior to understanding
underlying motivations, demonstrating howLLMscan enable this deeper level of personalized understanding. It sets a new direction forpersonalized information retrievalthat goes beyond superficial interaction patterns.
Similar papers
Recommended via semantic vector search.