Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction
TL;DR Summary
This work proposes Decoupled Multimodal Fusion (DMF) to enable fine-grained interaction between multimodal and ID embeddings using target-aware features, improving user interest modeling in CTR prediction with inference-optimized attention and validated performance gains in indus
Abstract
Modern industrial recommendation systems improve recommendation performance by integrating multimodal representations from pre-trained models into ID-based Click-Through Rate (CTR) prediction frameworks. However, existing approaches typically adopt modality-centric modeling strategies that process ID-based and multimodal embeddings independently, failing to capture fine-grained interactions between content semantics and behavioral signals. In this paper, we propose Decoupled Multimodal Fusion (DMF), which introduces a modality-enriched modeling strategy to enable fine-grained interactions between ID-based collaborative representations and multimodal representations for user interest modeling. Specifically, we construct target-aware features to bridge the semantic gap across different embedding spaces and leverage them as side information to enhance the effectiveness of user interest modeling. Furthermore, we design an inference-optimized attention mechanism that decouples the computation of target-aware features and ID-based embeddings before the attention layer, thereby alleviating the computational bottleneck introduced by incorporating target-aware features. To achieve comprehensive multimodal integration, DMF combines user interest representations learned under the modality-centric and modality-enriched modeling strategies. Offline experiments on public and industrial datasets demonstrate the effectiveness of DMF. Moreover, DMF has been deployed on the product recommendation system of the international e-commerce platform Lazada, achieving relative improvements of 5.30% in CTCVR and 7.43% in GMV with negligible computational overhead.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction
- Authors: Alin Fan, Hanqing Li, Sihan Lu, Jingsong Yuan, Jiandong Zhang
- Affiliations: The authors are from Alibaba International Digital Commerce Group and Renmin University of China. This affiliation suggests a strong connection between academic research and industrial application, which is reflected in the paper's focus on both performance and deployment efficiency.
- Journal/Conference: The paper provides a placeholder for the conference (
Conference acronym 'XX), indicating it is likely a preprint submitted for publication. - Publication Year: The original source link contains a futuristic, placeholder arXiv ID (
2510.11066), suggesting the paper is a very recent submission (likely from 2024 or late 2023) and the provided link is a mock-up. - Abstract: The paper addresses the challenge of integrating multimodal information (like images and text) into traditional ID-based Click-Through Rate (CTR) prediction models. Current methods often model multimodal and ID-based features separately (
modality-centric), missing fine-grained interactions. The authors propose Decoupled Multimodal Fusion (DMF), a framework that introduces amodality-enrichedstrategy to capture these interactions. The core of DMF is a novel Decoupled Target Attention (DTA) mechanism, which efficiently incorporates target-aware multimodal features without the high computational cost typically associated with such features during online inference. DMF combines themodality-centricandmodality-enrichedapproaches for a comprehensive user interest representation. Experiments on public and industrial datasets show significant improvements, and the model's successful deployment on the Lazada e-commerce platform resulted in a 5.30% increase in CTCVR and a 7.43% increase in GMV. - Original Source Link:
- Original Source: https://arxiv.org/abs/2510.11066 (Note: This is a placeholder link from the provided text).
- PDF: https://arxiv.org/pdf/2510.11066v1.pdf (Note: This is a placeholder link from the provided text).
- Publication Status: The paper is presented as a preprint, not yet formally published in a peer-reviewed venue.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: In industrial recommender systems, CTR prediction models primarily use item IDs to represent user interaction history. While effective for capturing collaborative patterns, these ID-based features lack rich semantic information. Integrating multimodal features (from images and text) can enrich item representations, but this fusion is challenging.
- Gaps in Prior Work: Existing methods typically follow a
modality-centricapproach, where ID-based embeddings and multimodal embeddings are processed in separate streams and combined late in the process. This prevents the model from capturing fine-grained interactions between a user's collaborative behavior (from IDs) and the content semantics (from multimodal features). Attempting a more fine-grained,early fusionby treating multimodal similarity as target-aware side information leads to a massive computational bottleneck during online inference, as calculations must be repeated for every candidate item. - Fresh Angle: The paper introduces a "best of both worlds" solution. It proposes a
modality-enrichedmodeling strategy that enables fine-grained interaction but designs it in a computationally efficient way. The key innovation is an attention mechanism that decouples the processing of target-agnostic (reusable) ID features from target-aware (non-reusable) multimodal features before the main attention computation, avoiding redundant calculations.
-
Main Contributions / Findings (What):
- A Novel
Modality-EnrichedModeling Paradigm: The paper introduces a new way to fuse multimodal information by treating multimodal similarity scores as target-aware side information. This allows the model to learn fine-grained interactions between a user's behavioral history and the specific content of a candidate item. - Decoupled Target Attention (DTA): To solve the inference inefficiency of using target-aware features, the authors propose
DTA. This new attention architecture decouples the computation for ID features (which are target-agnostic and can be pre-computed) from the multimodal similarity features (which are target-aware). This design achieves the expressive power of early fusion with the efficiency of late fusion, making it scalable for industrial systems with thousands of candidate items. - Complementary Modality Modeling (CMM): The final
DMFframework combines the proposedmodality-enrichedstrategy (viaDTA) with a traditionalmodality-centricstrategy (a histogram-based approach). This complementary fusion allows the model to benefit from both broad semantic generalization and fine-grained behavioral personalization. - Significant Real-World Impact: The model was successfully deployed in Lazada's product recommendation system, leading to substantial improvements in key business metrics: +5.30% in CTCVR (Click-Through and Conversion Rate) and +7.43% in GMV (Gross Merchandise Volume) with negligible computational overhead.
- A Novel
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Click-Through Rate (CTR) Prediction: A core task in computational advertising and recommender systems. It aims to predict the probability that a user will click on a specific item. These probabilities are used to rank items, with higher-ranked items shown more prominently.
- User Interest Modeling: The process of creating a representation of a user's preferences, typically by analyzing their sequence of historical interactions (e.g., clicks, purchases). Modern models use this representation to personalize recommendations.
- ID-based vs. Multimodal Features:
- ID-based Features: These are unique, sparse identifiers for users and items (e.g.,
user_id:123,item_id:456). They are learned via embeddings and are powerful at capturing collaborative filtering signals (i.e., users who interacted with similar items in the past may have similar tastes). However, they carry no inherent semantic meaning. - Multimodal Features: These are dense vector representations derived from an item's content, such as its image (processed by a Vision Transformer,
ViT) and text description (processed by a language model,RoBERTa). They encode rich semantic information and world knowledge.
- ID-based Features: These are unique, sparse identifiers for users and items (e.g.,
- Semantic Gap: The challenge that arises when trying to combine embeddings from fundamentally different sources. ID embeddings are learned in a "collaborative space" based on user behavior, while multimodal embeddings are learned in a "semantic space" based on content. Directly combining them (e.g., by concatenation) is often ineffective because the spaces are not aligned.
- Two-Stage Framework: An industrial standard for using multimodal information efficiently. In Stage 1, powerful but slow models (like
ViT) extract multimodal embeddings offline. These embeddings are then frozen and stored. In Stage 2, a lightweight CTR model uses these frozen embeddings as features for online prediction, avoiding the high cost of running the large models in real-time. - Target-Aware vs. Target-Agnostic Attention:
Target-Agnostic: Models likeSASRecprocess the user's history sequence once to generate a single "user interest" vector. This vector is then used to score all candidate items.Target-Aware: Models likeDINandMHTAdynamically model user interest based on the specific target item being considered. They use an attention mechanism where the target item is the "query" and the user's historical items are the "keys" and "values." This allows the model to focus on the most relevant parts of a user's history for each candidate item.
- Side Information Fusion: Techniques for incorporating additional features (side information) into a model. The paper discusses three main strategies in the context of attention mechanisms:
Early Fusion: Combines item IDs and side information at the input layer. This allows for deep interactions but can be computationally expensive if the side information is target-aware.Late Fusion: Models item IDs and side information in separate streams and only combines their outputs at the final prediction layer. This is efficient but misses out on fine-grained interactions.Hybrid Fusion: A middle ground where interactions are allowed at intermediate layers.DTAis a form of hybrid fusion.
-
Previous Works & Differentiation:
- ID-only Models (
DIN,TA/MHTA): These models established the effectiveness oftarget-aware attentionfor CTR prediction but are limited by their reliance on sparse ID features. - Multimodal Alignment (
MAKE,DMAE): These works address the semantic gap.MAKEuses a multi-stage pre-training framework to align modalities.DMAEencodes similarity scores into embeddings to bridge the gap. The current paper builds on this idea of using similarity scores as a proxy for alignment. - Multimodal Fusion (
SIMTIER,BFS_MF): These methods integrate multimodal information but use amodality-centricapproach.SIMTIERconverts similarity scores into a histogram, losing positional information.BFS_MF(an adaptation ofTWIN) splits features but still processes ID and multimodal signals separately. - Differentiation: The key innovation of
DMFis itsmodality-enrichedparadigm, enabled by theDTAmodule. UnlikeSIMTIERorBFS_MF, which keep ID and multimodal processing separate,DTAallows for fine-grained, position-wise interaction between ID-based collaborative signals and target-aware multimodal semantics. Crucially, it achieves this with the same computational efficiency aslate fusionmethods, solving the scalability problem that plaguesearly fusion.
- ID-only Models (
4. Methodology (Core Technology & Implementation)
The proposed Decoupled Multimodal Fusion (DMF) framework is designed to effectively and efficiently integrate multimodal information into a target-aware user interest model.
Figure 1: The overall architecture of the DMF framework. It shows user interaction sequences being processed through two parallel paths: a modality-enriched path using Decoupled Target Attention (DTA) and a modality-centric path using a Similarity Histogram. The outputs are fused by the CMM module for the final CTR prediction.
As shown in Figure 1, the framework consists of three main components:
- Target-aware Multimodal Similarity Feature Construction.
- Decoupled Target Attention (DTA) for modality-enriched modeling.
- Complementary Modality Modeling (CMM) to fuse different modeling strategies.
4.1 Target-aware Multimodal Similarity Feature
To bridge the semantic gap between ID and multimodal embeddings, the authors use similarity as a proxy.
- First, multimodal embeddings for all items are pre-computed using frozen encoders (
RoBERTafor text,ViTfor images) and stored. - For a given user with a historical interaction sequence and a candidate (target) item , the model computes the cosine similarity between the target item's multimodal embedding () and each historical item's embedding ().
- This produces a sequence of similarity scores , where each score is calculated as:
- Key Property: This similarity score sequence is target-aware, because its values change depending on the candidate item . This feature directly encodes the semantic relevance of the candidate item to each item in the user's history.
4.2 Target-Aware and Target-Agnostic Nodes
This concept is crucial for understanding the efficiency of DTA.
- Target-Agnostic Node: A computation in the model whose output depends only on user-level features (like historical interactions) and is independent of the candidate item. Its result can be computed once per user and reused for all candidate items in a request. This leads to high efficiency. Example: Applying a linear projection to the user's historical item ID embeddings.
- Target-Aware Node: A computation whose output depends on the candidate item. It must be re-computed for every candidate item. This is computationally expensive, especially with hundreds or thousands of candidates. Example: An attention mechanism where the target item is the query.
4.3 Decoupled Target Attention (DTA)
DTA is the core innovation for enabling fine-grained interaction efficiently. It is designed to overcome the limitations of simple early and late fusion.
Figure 2: Comparison of fusion strategies. (a) Early Fusion is expressive but slow. (b) Late Fusion is fast but not expressive. (c) Decoupled Fusion (DTA) achieves both expressiveness and efficiency.
-
Problem with
Early Fusion(Figure 2a): Concatenating ID embeddings and similarity scores before the linear projections for Key (K) and Value (V) makes both K and Vtarget-aware. This requires re-computing them for every candidate item, leading to a high inference complexity of , where is the number of candidates. -
Problem with
Late Fusion(Figure 2b): Processing ID and multimodal features in parallel and fusing them at the end is efficient (complexity ) because the ID-based K and V aretarget-agnosticand can be reused. However, it fails to model fine-grained interactions. -
DTASolution (Figure 2c):DTAdecouples the computation. The expensive linear projections are only applied to thetarget-agnosticID embeddings. Thetarget-awaresimilarity information is incorporated through a computationally cheap lookup-and-add operation.The formal steps are as follows: Let be the ID embeddings of the user's history, and be the target-aware similarity scores.
- Query (Q): The query is derived from the target item embedding :
- Key (K) and Value (V) Decoupling:
- ID-based part (Target-Agnostic): Linear projections are applied to the historical ID embeddings. This can be pre-computed.
- Similarity-based part (Target-Aware but Cheap): The raw similarity scores are processed by the Multimodal Similarity Encoding (MSE) module.
Bucket(S): The continuous similarity scores (range [-1.0, 1.0]) are discretized into a fixed number of bins.Lookup(·): An embedding lookup is performed on these discrete bin IDs to get dense vectors and . This is a very fast operation.
- Fusion: The two parts are combined via element-wise addition to form the final Key and Value.
- Final Attention: Standard scaled dot-product attention is then applied using the constructed , , and .
This design cleverly maintains a low inference complexity of while allowing the target-aware similarity information to influence the attention calculation at a fine-grained level (by modifying the K and V for each historical item).
4.4 Complementary Modality Modeling (CMM)
DMF does not rely solely on DTA. It recognizes that modality-centric and modality-enriched strategies are complementary.
-
Modality-Enriched Representation (): This is the output from the
DTAmodule, capturing fine-grained, personalized behavioral patterns. -
Modality-Centric Representation (): This is obtained using a histogram-based approach similar to
SIMTIER. The similarity scores are binned, and a histogram of counts is created. This vector is fed through an MLP. This representation captures a user's general semantic preferences, offering robust generalization.The CMM module combines these two representations using a weighted sum controlled by a hyperparameter : This final representation is then concatenated with other features (e.g., user profile) and fed into a final MLP for CTR prediction.
5. Experimental Setup
-
Datasets:
-
Amazon (Electronics): A public benchmark dataset where product reviews serve as user interaction sequences. It is preprocessed with 5-core filtering.
-
Industry: A large-scale, real-world dataset from the Lazada e-commerce platform (Thailand), consisting of 19 days of training data and 1 day for testing.
The following table, transcribed from Table 1 in the paper, provides statistics for the datasets.
Dataset #Users #Items #Samples Amazon(Electronics) 192k 63k 1.7M Industry 8.7M 20.9M 469M
-
-
Evaluation Metrics:
- AUC (Area Under the ROC Curve):
- Conceptual Definition: AUC measures the model's ability to distinguish between positive and negative classes. It represents the probability that a randomly chosen positive sample is ranked higher than a randomly chosen negative sample. An AUC of 1.0 is a perfect classifier, while 0.5 is equivalent to random guessing.
- Mathematical Formula: For a set of predictions on positive and negative samples, where is the rank of the -th positive sample among all samples sorted by score:
- Symbol Explanation: is the number of positive samples, is the number of negative samples, and is the rank of the -th positive sample in the sorted list of all samples.
- GAUC (Group Area Under Curve):
- Conceptual Definition: GAUC is more relevant for recommendation performance as it measures ranking quality within each user's recommendations. It is the average of AUC scores calculated for each user, weighted by the number of impressions or clicks for that user. This prevents users with many interactions from dominating the metric.
- Mathematical Formula:
- Symbol Explanation: is the total number of users, is the AUC calculated only on samples for user , and is the weight for user (e.g., number of clicks or impressions).
- CTCVR (Post-view Click-Through and Conversion Rate): An online business metric measuring the rate of conversions (e.g., purchases) after a click.
- GMV (Gross Merchandise Volume): A key online business metric representing the total value of merchandise sold over a given period.
- AUC (Area Under the ROC Curve):
-
Baselines:
- ID-only models:
SASRec(target-agnostic),DIN(target-aware),TA(target-aware with standard multi-head attention). - Multimodal models:
BFS_MF(adaptsTWINfor multimodal features),SIMTIER(histogram-based),MAKE(multi-stage pre-training).
- ID-only models:
6. Results & Analysis
-
Core Results: The following table, transcribed from Table 2 in the paper, shows the main performance comparison. represents the relative improvement over the
SASRecbaseline.Model Amazon(Electronics) Industry AUC (mean ± std) ΔAUC ↑ AUC (mean ± std) ΔAUC ↑ GAUC (mean ± std) ΔGAUC ↑ SASRec 0.7776 ± 0.00292 - 0.6491 ± 0.00206 - 0.6048 ± 0.00084 - DIN 0.7806 ± 0.00118 +0.30% 0.6508 ± 0.00094 +0.17% 0.6058 ± 0.00064 +0.10% TA 0.7798 ± 0.00129 +0.22% 0.6538 ± 0.00046 +0.47% 0.6080 ± 0.00074 +0.32% BFSMF 0.7823 ± 0.00050 +0.47% 0.6579 ± 0.00083 +0.88% 0.6109 ± 0.00124 +0.61% SIMTIER 0.8090 ± 0.00233 +3.14% 0.6629 ± 0.00068 +1.38% 0.6135 ± 0.00099 +0.87% MAKE 0.8145 ± 0.00264 +3.69% 0.6623 ± 0.00075 +1.32% 0.6154 ± 0.00047 +1.06% DTA 0.8214 ± 0.00184 +4.38% 0.6645 ± 0.00035 +1.54% 0.6158 ± 0.00043 +1.10% DMF 0.8251 ± 0.00105 +4.75% 0.6663 ± 0.00049 +1.72% 0.6177 ± 0.00060 +1.29% DMF+MAKE 0.8299 ± 0.00262 +5.23% 0.6678 ± 0.00060 +1.87% 0.6195 ± 0.00082 +1.47% - Analysis:
- Target-aware models (
DIN,TA) outperform the target-agnosticSASRec. - All multimodal models (
BFS_MF,SIMTIER,MAKE) consistently outperform the ID-only baselines, confirming the value of multimodal information. - The proposed
DTAmodule alone already outperforms all other baselines, demonstrating the power of its fine-grained, modality-enriched fusion. - The full
DMFmodel (which combinesDTAwith the histogram-based method) achieves even better results, confirming the benefit of theComplementary Modality Modeling(CMM) module. - shows that DMF's architectural innovation is complementary to advanced training strategies like MAKE's multi-stage pre-training, leading to state-of-the-art performance.
- Target-aware models (
- Analysis:
-
Ablation Study: This study (transcribed from Table 3) analyzes different fusion strategies on the industrial dataset to justify the design of
DTA.Model AUC (mean ± std) ΔAUC ↑ GAUC (mean ± std) ΔGAUC ↑ Cinf TAearly 0.6644 ± 0.00212 -0.01% 0.6159 ± 0.00091 +0.01% O(BLd2) TAlate 0.6615 ± 0.00101 -0.30% 0.6145 ± 0.00052 -0.13% O(Ld2 + BLd) DTAnon-invasive 0.6624 ± 0.00079 -0.21% 0.6129 ± 0.00063 -0.29% O(Ld2 + BLd) DTA 0.6645 ± 0.00035 - 0.6158 ± 0.00043 - O(Ld2 + BLd) - Analysis:
TA_earlyachieves performance comparable toDTA, confirming the value of fine-grained interaction. However, its computational complexity is prohibitively expensive for online inference.TA_lateis efficient but performs significantly worse, demonstrating that delaying fusion limits expressive power.DTA_non-invasive(where similarity only affects the Key but not the Value ) performs worse than the fullDTA. This shows that enriching both the attention scores (via ) and the aggregated content (via ) with multimodal signals is crucial.DTAsuccessfully achieves the high performance ofTA_earlywith the low inference complexity ofTA_late, striking an optimal balance between effectiveness and efficiency.
- Analysis:
-
Hyper-parameter Study:
Figure 3: Performance variation with the representation aggregating hyperparameter .- Analysis: This figure shows the performance of
DMFas the weighting hyperparameter varies between the modality-centric (, at ) and modality-enriched (, at ) representations. - The worst performance occurs at , where the model only uses the histogram-based representation, highlighting the insufficiency of coarse-grained semantic generalization alone.
- As increases, performance improves, showing the benefit of incorporating the fine-grained behavioral signals from
DTA. - The optimal performance is achieved at an intermediate value ( for Amazon, for Industry), not at . This demonstrates that combining both strategies is better than using either one in isolation. The
modality-centriccomponent provides robust generalization, while themodality-enrichedcomponent provides sharp personalization.
- Analysis: This figure shows the performance of
-
Online Experiments: The model was deployed in an online A/B test on Lazada for 12 days. The experimental group used
DMFto replace the baselineTA+SIMTIERmodules. The results were highly positive:- CTCVR: +5.30% relative improvement.
- GMV: +7.43% relative improvement.
These significant gains in key business metrics, achieved with negligible added latency, prove the practical value and successful industrial deployment of the
DMFframework.
7. Conclusion & Reflections
-
Conclusion Summary: The paper introduces
Decoupled Multimodal Fusion(DMF), a novel framework for CTR prediction that effectively integrates multimodal and ID-based information. Its core component,Decoupled Target Attention(DTA), enables fine-grained interaction between content semantics and collaborative signals while maintaining high inference efficiency through a clever computational decoupling. By combining thismodality-enrichedstrategy with amodality-centricone,DMFachieves a comprehensive and powerful user interest representation. Strong offline results and significant online business gains on a large-scale e-commerce platform validate the effectiveness and practicality of the proposed method. -
Limitations & Future Work:
- The paper itself does not explicitly list limitations. However, potential areas for future work could include:
- Exploring more sophisticated fusion operators than element-wise addition within the
DTAmodule, while still maintaining efficiency. - Applying the
DTAconcept to other types of target-aware side information beyond multimodal similarity (e.g., price, brand, or other item attributes). - Extending the framework to handle more than two modalities (e.g., audio, user-generated video).
- Exploring more sophisticated fusion operators than element-wise addition within the
- The paper itself does not explicitly list limitations. However, potential areas for future work could include:
-
Personal Insights & Critique:
- Novelty and Impact: The primary contribution of this paper is not just a new model, but a practical and elegant engineering solution to a critical problem in industrial recommender systems: balancing model expressiveness with inference cost. The concept of
Decoupled Target Attentionis a powerful pattern for integrating any form of target-aware side information efficiently, and its applicability extends beyond just multimodal fusion. - Strengths: The paper is well-grounded in real-world constraints. The focus on inference complexity and the successful A/B test results make its contributions highly credible and impactful. The ablation study provides a clear, convincing argument for the specific design choices in
DTA. - Critique: The CMM fusion using a simple scalar hyperparameter is straightforward but might be suboptimal. A more dynamic, context-aware gating mechanism could potentially learn the optimal balance between the two strategies for different users or items. Additionally, the discretization of similarity scores (
Bucket(S)) is a heuristic; exploring learnable or soft discretization methods could be a direction for improvement. Overall, however, the paper presents a strong, practical contribution to the field of recommender systems.
- Novelty and Impact: The primary contribution of this paper is not just a new model, but a practical and elegant engineering solution to a critical problem in industrial recommender systems: balancing model expressiveness with inference cost. The concept of
Similar papers
Recommended via semantic vector search.