Paper status: completed

Long-Sequence Recommendation Models Need Decoupled Embeddings

Published:10/03/2024

Long-Sequence Recommendation Systems (1)Decoupled Attention and Representation Embeddings (1)Embedding Learning in Recommendation Systems (1)Application of Attention Mechanism in Recommendations (1)Behavior Sequence Search and Aggregation (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study highlights the limitations of existing recommendation models using a single embedding for attention and representation. The authors propose the DARE model, which employs decoupled embeddings, improving behavior search accuracy and overall performance on various datasets

Abstract

Lifelong user behavior sequences are crucial for capturing user interests and predicting user responses in modern recommendation systems. A two-stage paradigm is typically adopted to handle these long sequences: a subset of relevant behaviors is first searched from the original long sequences via an attention mechanism in the first stage and then aggregated with the target item to construct a discriminative representation for prediction in the second stage. In this work, we identify and characterize, for the first time, a neglected deficiency in existing long-sequence recommendation models: a single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. Initial attempts to address this issue with some common methods (e.g., linear projections -- a technique borrowed from language processing) proved ineffective, shedding light on the unique challenges of recommendation models. To overcome this, we propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are initialized and learned separately to fully decouple attention and representation. Extensive experiments and analysis demonstrate that DARE provides more accurate searches of correlated behaviors and outperforms baselines with AUC gains up to 0.9% on public datasets and notable improvements on Tencent's advertising platform. Furthermore, decoupling embedding spaces allows us to reduce the attention embedding dimension and accelerate the search procedure by 50% without significant performance impact, enabling more efficient, high-performance online serving. Code in PyTorch for experiments, including model analysis, is available at https://github.com/thuml/DARE.

Mind Map

In-depth Reading

English Analysis~25 min read · 31,912 chars

1. Bibliographic Information

1.1. Title

Long-Sequence Recommendation Models Need Decoupled Embeddings

1.2. Authors

The authors are Ningya Feng, Jialong Wu, Baixu Chen, Mingsheng Long (from Tsinghua University) and Junwei Pan, Ximei Wang, Qian Li, Xian Hu, Jie Jiang (from Tencent Inc.). This represents a strong collaboration between a leading academic institution in AI and a major industrial technology company, which is common for research that has both theoretical depth and large-scale practical applications.

1.3. Journal/Conference

The paper was submitted to arXiv as a preprint. The specific conference or journal of publication is not mentioned in this version. Given its topic and quality, it is a strong candidate for top-tier conferences in data mining (KDD), information retrieval (SIGIR), or general AI (NeurIPS, ICML).

1.4. Publication Year

The paper was published on arXiv on October 3, 2024.

1.5. Abstract

The abstract introduces the problem of handling lifelong user behavior sequences in recommender systems. It describes the standard two-stage paradigm: a search stage using an attention mechanism to find relevant historical behaviors, followed by an aggregation stage to create a user representation for prediction. The authors identify a key deficiency: a single set of embeddings struggles to learn both the attention scores (for search) and the item representations (for prediction), leading to interference. They show that simple solutions like linear projections, common in natural language processing, are ineffective for recommendation models. Their proposed solution is the Decoupled Attention and Representation Embeddings (DARE) model, which uses two separate embedding tables for these two tasks. Experiments show that DARE improves search accuracy and outperforms baselines, achieving up to a 0.9% (9‰) AUC gain on public datasets and a significant GMV lift on Tencent's advertising platform. A key practical benefit is that this decoupling allows for reducing the attention embedding dimension, accelerating the search process by 50% without significant performance loss.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2410.02604
PDF Link: https://arxiv.org/pdf/2410.02604v3.pdf
Publication Status: This is a preprint version available on arXiv.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Modern recommender systems leverage very long sequences of user behavior (e.g., clicks, purchases over years) to model user interests accurately. For efficiency, these systems typically use a two-stage process: first, they search for a small subset of relevant items from the long history, and second, they model this shorter sequence to predict user behavior (e.g., Click-Through Rate or CTR). Both stages rely on item embeddings. The search stage uses them to calculate attention scores (measuring correlation between a historical item and a target item), while the modeling stage uses them to build a final representation of the user's interest. The core problem identified in this paper is that using a single, shared embedding table for both of these distinct tasks creates a conflict. The learning objectives for good attention (accurate correlation) and good representation (high discriminability for classification) interfere with each other.
Importance & Gaps: The problem is important because the performance of the entire recommendation model hinges on the quality of both the search and the final representation. If the search stage fails to retrieve key historical items, the model will have incomplete information. The paper fills a critical gap in existing research, as this "embedding interference" was previously a neglected issue. Furthermore, the paper demonstrates that a seemingly obvious solution borrowed from Natural Language Processing (NLP)—using linear projection matrices to create separate spaces for attention and representation—fails in the recommendation context. The authors hypothesize this failure is due to the unique constraints of recommender systems, specifically the relatively low embedding dimensions used to avoid the "interaction collapse" phenomenon.
Innovative Idea: The paper's central idea is to resolve this conflict with a simple yet powerful architectural change: complete decoupling at the embedding level. Instead of trying to transform a shared embedding into different spaces, they propose using two entirely separate and independently learned embedding tables: one dedicated to the attention task (Attention Embedding) and another to the representation task (Representation Embedding). This allows each set of embeddings to be optimized for its specific function without interference.

2.2. Main Contributions / Findings

Primary Contributions:
1. Problem Identification: The paper is the first to formally identify and analyze the interference between attention and representation learning in long-sequence recommendation models. It uses gradient analysis from Multi-Task Learning to show that representation gradients dominate and conflict with attention gradients.
2. Failure Analysis of Standard Methods: It demonstrates that common decoupling techniques from NLP, such as linear projections, are ineffective in recommendation systems and provides a compelling hypothesis (limited model capacity due to small embedding dimensions) to explain this failure.
3. Proposal of DARE Model: It introduces the DARE model, which uses two separate embedding tables to fully decouple the two tasks, thereby resolving the learning conflict.
Key Findings:
1. Improved Performance: DARE significantly outperforms state-of-the-art models on public datasets (up to 9‰ AUC gain) and achieves a 1.47% Gross Merchandise Value (GMV) lift in a large-scale online A/B test at Tencent, demonstrating substantial real-world business impact.
2. Enhanced Model Interpretability: The analysis shows DARE learns more accurate attention scores that better reflect the true semantic and temporal correlations between items, leading to better retrieval in the search stage. It also produces more discriminative final representations for the prediction task.
3. Inference Acceleration: The decoupled architecture offers a practical advantage. The dimension of the attention embeddings can be significantly reduced (e.g., by 50-75%) to speed up the computationally expensive search stage with minimal to acceptable impact on overall model accuracy.

3.1. Foundational Concepts

Click-Through Rate (CTR) Prediction: This is a core task in online advertising and recommendation. It involves predicting the probability that a user will click on a specific item (e.g., an ad, a product, a video). It is typically modeled as a binary classification problem, where the model outputs a probability score between 0 and 1.
Long-Sequence Recommendation: This refers to recommendation models that utilize a user's entire or a very long history of interactions (e.g., thousands of clicks over several years) rather than just recent ones. Longer sequences provide a richer signal of a user's diverse and evolving interests but pose significant computational challenges for real-time systems.
Two-Stage Paradigm (Search & Model): To handle long sequences efficiently, a common paradigm is to split the process into two stages:
1. Search Stage (General Search Unit - GSU): A fast but approximate method is used to retrieve a small subset of the most relevant items from the user's long history. This is often done using a simple similarity metric like dot product.
2. Modeling Stage (Exact Search Unit - ESU): A more complex and computationally expensive model (like a Transformer or attention network) is then applied only to this shorter, retrieved subsequence to make the final prediction.
Attention Mechanism: Originally proposed for machine translation, the attention mechanism allows a model to weigh the importance of different parts of an input sequence when making a prediction. In recommendation, it's used to assign higher weights to historical items that are more relevant to the target item. The most common form is Scaled Dot-Product Attention, introduced in the Transformer paper ("Attention Is All You Need" by Vaswani et al., 2017). The formula is: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
- Q (Query): A representation of the current context. In recommendation, this is typically the embedding of the target item.
- K (Key): A representation of an item in the sequence used for scoring. The embeddings of the historical items serve as Keys. The dot product between a Query and a Key ( $QK^T$ ) measures their compatibility or relevance.
- V (Value): A representation of an item in the sequence used for aggregation. The embeddings of the historical items also serve as Values. The final output is a weighted sum of the Values.
- In standard Transformers, Q, K, and V are generated by applying separate linear projection matrices to the same input embeddings. The core conflict this paper identifies is that in recommendation models, the same raw embedding is used for both K (attention) and V (representation).
Multi-Task Learning (MTL): A machine learning paradigm where a single model is trained to perform multiple tasks simultaneously. A common challenge in MTL is negative transfer or task conflict, where improving performance on one task degrades performance on another. This often happens when tasks have conflicting optimization objectives, which can be diagnosed by analyzing the gradients they produce. If gradients for different tasks point in opposite directions, their objectives are in conflict.
Embeddings: In machine learning, an embedding is a low-dimensional, dense vector representation of a high-dimensional, sparse feature (like an item ID). An embedding table is essentially a lookup matrix where each row corresponds to the vector for a specific item.

3.2. Previous Works

DIN (Deep Interest Network): A pioneering work in modeling user interest for CTR prediction. Its key innovation was target-aware attention, where the representation of user history is not static but dynamically calculated based on its relevance to a given target item. However, DIN was designed for short sequences.
SIM (Search-based User Interest Modeling): This work introduced the two-stage GSU/ESU paradigm to make long-sequence modeling feasible in industrial systems. The GSU performs a fast search, and the ESU performs detailed modeling on the retrieved items.
TWIN (Two-stage Interest Network): A state-of-the-art model that serves as the primary baseline for this paper. TWIN improves upon SIM by unifying the attention mechanism used in both the search (GSU) and modeling (ESU) stages, which improves the quality of the initial search. However, like its predecessors, TWIN uses a single shared embedding table for both attention calculation and representation aggregation, making it susceptible to the interference problem identified in this paper.
Interaction Collapse Theory (Guo et al., 2024): This recent theory provides a crucial piece of context for the paper's argument. It finds that in recommendation models (which often rely on dot-product interactions), increasing the embedding dimension beyond a certain point can actually hurt performance, a phenomenon termed "interaction collapse." This forces practitioners to use relatively small embedding dimensions (e.g., under 200), in stark contrast to modern NLP models where dimensions can be in the thousands.

3.3. Technological Evolution

The field has evolved from handling short sequences to long sequences.

Short-Sequence Models (DIN): Focused on capturing target-specific interests from recent behaviors. Inefficient for long histories.
Long-Sequence Models (SIM): Introduced the two-stage search-then-model paradigm to make long sequences computationally tractable.
Refined Two-Stage Models (TWIN): Improved the search quality by using a consistent attention mechanism across both stages.
DARE (This Paper): Addresses a fundamental flaw in the embedding layer of all previous two-stage models by decoupling the conflicting tasks of attention and representation. It is a structural refinement that builds upon the TWIN framework.

3.4. Differentiation Analysis

vs. TWIN/SIM: While previous models also use a two-stage approach, they all rely on a single, shared embedding table. DARE's core innovation is the introduction of two separate embedding tables, one for attention and one for representation. This fundamentally resolves the learning conflict that other models suffer from.
vs. NLP Transformers: Standard Transformers also decouple attention (Query/Key) from representation (Value) but do so by applying linear projection matrices to a shared input embedding. This paper shows this approach fails in recommendation systems due to the limited capacity of these matrices (a consequence of small embedding dimensions). DARE's method of decoupling at the embedding table level is a more powerful and suitable solution for the recommendation domain.

4. Methodology

4.1. Principles

The core principle of DARE is that the function of attention and the function of representation in long-sequence recommendation are fundamentally different and impose conflicting requirements on the item embeddings.

Attention's Goal: To learn a score that accurately reflects the correlation or relevance between two items (a historical item and a target item). This is used for searching and weighting.
Representation's Goal: To learn a feature vector that is highly discriminative for the downstream prediction task (e.g., CTR classification). The final aggregated user representation vector needs to be easily separable by a classifier.

When a single embedding vector is used for both, its training is pulled in two different directions. For example, to make two items have a high dot product for attention, their vectors might need to align. However, to make the final representation discriminative, the vectors might need to be spread out in the feature space. DARE resolves this by creating two separate, specialized embedding spaces, allowing each to be optimized for its single purpose without interference.

4.2. Core Methodology In-depth (Layer by Layer)

The authors first diagnose the problem in existing models and then present DARE as the solution.

4.2.1. Diagnosis: Gradient Domination and Conflict

The authors frame the learning of attention and representation from a shared embedding as a Multi-Task Learning (MTL) problem. They analyze the gradients that flow back to the shared embedding table from two sources: the attention calculation and the final representation aggregation.

Gradient Domination: As shown in Figure 2 from the paper, the authors empirically find that the magnitude (L2 norm) of the gradients originating from the representation part of the model is about five times larger than that from the attention part. This means the updates to the embedding vectors are overwhelmingly dictated by the representation task, effectively drowning out the learning signal for the attention task.

该图像是一个柱状图，显示了来自注意力模块和表示模块的嵌入梯度的平均大小。图中，注意力的平均梯度幅度较小，约为 $0.5 \times 10^{-3}$ ，而表示的平均梯度幅度明显更大，约为 $2.5 \times 10^{-3}$ ，显示出两者在梯度学习过程中的差异。这突显了注意力和表示在模型训练中的不同重要性和影响。
Gradient Conflict: As shown in Figure 3, they calculate the cosine similarity between the gradient vectors from the two tasks. They find that in nearly two-thirds of cases (64.31%), the cosine similarity is negative. This indicates that the gradients are pointing in opposite directions, meaning an update that improves the representation objective would worsen the attention objective, and vice-versa.

该图像是一个直方图，展示了余弦相似度的分布情况。图中分别标注了相似度为负值和零的冲突区（红色）及相似度为正值的相似区（绿色）。在冲突区域中，65.31%的比例表明相应配置存在冲突，而在一致区域中，占比为35.69%。

These two phenomena—domination and conflict—demonstrate a severe interference problem that prevents the model from learning either task optimally.

4.2.2. Failed Attempt: Decoupling with Linear Projections

Inspired by standard Transformer architectures in NLP, a natural first attempt is to use separate linear projection matrices to map the shared embedding into two distinct spaces for attention and representation. The architecture is shown in Figure 4.

$Figure 4: Illustration and evaluation for adopting linear projections. (a-b) The attention module in the original TWIN and after adopting linear projections. (c) Performance of TWIN variants. Adopting linear projections causes an AUC drop of nearly $2 \\%$ on Taobao.$ 该图像是图表，展示了TWIN模型中注意力机制的原始和使用线性映射后的效果。左侧部分为原始TWIN的注意力模块，右侧为引入线性映射的TWIN。下方表格比较了不同版本TWIN在淘宝和天猫上的AUC结果，显示使用线性映射导致AUC下降近2%。

However, this approach failed to improve and even hurt performance. The authors hypothesize this is due to the limited capacity of the projection matrices in recommendation models. In NLP, a model like LLaMA3.1 might have an embedding dimension of 4096, so a projection matrix has over 16 million parameters ( $4096 \times 4096$ ). In recommendation, due to the "interaction collapse theory," dimensions are much smaller (e.g., 128), leading to a projection matrix with only ~16,000 parameters ( $128 \times 128$ ). This is insufficient to learn a meaningful transformation for millions of unique item IDs. Their synthetic experiment in NLP (Figure 5) supports this, showing that projections only help when the embedding dimension (and thus matrix capacity) is sufficiently large.

Figure 5: The influence of linear projections with different embedding dimensions in NLP. 该图像是图表，展示了不同嵌入维度下线性投影对损失的影响。图中橙色曲线代表使用线性投影的情况，绿色曲线则表示未使用线性投影，随着嵌入维度的增加，损失逐渐降低。

4.2.3. The DARE Architecture

To overcome the capacity limitations of projections, DARE proposes a complete decoupling at the embedding table level. This provides maximum capacity for each task. The architecture is illustrated in Figure 6.

Figure 6: Architecture of the proposed DARE model. One embedding is responsible for attention, learning the correlation between the target and history behaviors, while another embedding is responsible for representation, learning discriminative representations for prediction. Decoupling these two embeddings allows us to resolve the conflict between the two modules. 该图像是示意图，展示了提出的DARE模型的架构。左侧部分展示了一个嵌入负责注意力机制，通过缩放/Softmax方法检索相关项目，右侧部分展示了通过加权求和得到的短序列的表示，用于最终的预测结果。两个嵌入分别学习注意力和表示，以解决模块间的冲突。

The model consists of two parallel embedding pathways:

1. Attention Embedding Pathway ( $E^{\mathrm{Att}}$ ) This pathway is solely responsible for searching for relevant items and calculating their importance weights.

Embedding Lookup: For each historical item $i$ and the target item $t$ , their IDs are used to look up corresponding vectors from the attention embedding table $E^{\mathrm{Att}}$ , yielding $e_{i}^{\mathrm{Att}}$ and $v_{t}^{\mathrm{Att}}$ .
Search and Weight Calculation: The relevance score between item $i$ $i$ and target $t$ $t$ is their dot product, $\langle e_{i}^{\mathrm{Att}}, v_{t}^{\mathrm{Att}} \rangle$ $⟨ e_{i}^{Att}, v_{t}^{Att} ⟩$ . The top-K historical items with the highest scores are selected. For this retrieved subsequence of length $K$ $K$ , the final attention weights $w_i$ $w_{i}$ are calculated using the softmax function over these scores: $w _ { i } = \frac { { e ^ { \langle { { e _ { i } ^ { \mathrm{Att} } } , { \mathbf { v } } _ { t } ^ { \mathrm{Att} } } \rangle / \sqrt { \lvert { \mathcal { E } } ^ { \mathrm{Att} } } \rvert } } } { { \sum _ { j = 1 } ^ { K } { e ^ { \langle { { e _ { j } ^ { \mathrm{Att} } } , { \mathbf { v } } _ { t } ^ { \mathrm{Att} } } \rangle / \sqrt { \lvert { \mathcal { E } } ^ { \mathrm{Att} } } \rvert } } } }$
- $w_i$ : The final attention weight for the $i$ -th retrieved historical behavior.
- $e_{i}^{\mathrm{Att}}$ : The attention embedding for the $i$ -th historical behavior.
- $v_{t}^{\mathrm{Att}}$ : The attention embedding for the target item.
- $\langle \cdot, \cdot \rangle$ : The dot product operator.
- $|\mathcal{E}^{\mathrm{Att}}|$ : The dimension of the attention embeddings, used as a scaling factor.

2. Representation Embedding Pathway ( $E^{\mathrm{Repr}}$ ) This pathway is solely responsible for providing rich, discriminative feature vectors for the final prediction.

Embedding Lookup: The same IDs for the retrieved historical items ( $i=1...K$ ) and the target item $t$ are used to look up vectors from a different embedding table, $E^{\mathrm{Repr}}$ , yielding $e_{i}^{\mathrm{Repr}}$ and $v_{t}^{\mathrm{Repr}}$ .
Target-Aware Representation and Aggregation: To enhance discriminability, the model first creates a "target-aware" representation for each historical item by taking the element-wise product (Hadamard product) of its representation embedding and the target's representation embedding: $e_{i}^{\mathrm{Repr}} \odot v_{t}^{\mathrm{Repr}}$ $e_{i}^{Repr} ⊙ v_{t}^{Repr}$ . Then, these target-aware vectors are aggregated using the attention weights $w_i$ $w_{i}$ calculated from the other pathway. The final user history representation $\mathbf{h}$ $h$ is: $\pmb { h } = \sum _ { i = 1 } ^ { K } w _ { i } \cdot ( \pmb { e } _ { i } ^ { \mathrm{Repr} } \odot \pmb { v } _ { t } ^ { \mathrm{Repr} } )$
- $\mathbf{h}$ : The final aggregated vector representing the user's interest in the context of the target item.
- $w_i$ : The attention weight calculated from the attention embeddings.
- $e_{i}^{\mathrm{Repr}}$ : The representation embedding for the $i$ -th historical behavior.
- $v_{t}^{\mathrm{Repr}}$ : The representation embedding for the target item.
- $\odot$ : The element-wise product operator.

3. Final Prediction The aggregated history vector $\mathbf{h}$ is then concatenated with other features (like the target representation $v_{t}^{\mathrm{Repr}}$ itself) and fed into a Multi-Layer Perceptron (MLP) to produce the final CTR prediction. Because the gradients from the attention calculation only flow back to $E^{\mathrm{Att}}$ and the gradients from the representation aggregation only flow back to $E^{\mathrm{Repr}}$ , the two learning processes are completely independent, and the conflict is resolved.

4.2.4. Inference Acceleration

A key practical benefit of this decoupling is the ability to use different dimensions for the two embedding tables. The search stage, which involves calculating dot products for the entire long sequence of length $N$ , has a complexity of $O(K_A N)$ , where $K_A$ is the dimension of the attention embeddings. The modeling stage operates on the much shorter sequence of length $K$ . Therefore, the search stage is the main bottleneck. With DARE, one can set the attention dimension $K_A$ to be much smaller than the representation dimension $K_R$ (e.g., $K_A=32, K_R=128$ ). This significantly reduces the cost of the search step without compromising the expressive power of the final representation, leading to faster online inference.

5. Experimental Setup

5.1. Datasets

Taobao & Tmall: These are large-scale, publicly available datasets from two of China's largest e-commerce platforms, Alibaba and Tmall. They contain anonymized user interaction data, where each interaction consists of an item ID and its corresponding category ID. These are standard benchmarks for long-sequence recommendation. The paper provides statistics in Table 2: Taobao is more complex with ~9.4k categories and ~4M items, while Tmall has ~1.5k categories and ~1M items.

Dataset #Category #Item #User # Active User

Taobao 9407 4,068,790 984,114 603,176

Tmall 1492 1,080,666 423,862 246,477
Tencent Advertising Platform: A proprietary, industrial-scale dataset used for the online A/B test. This dataset is much larger and sparser, involving user behaviors from article and micro-video recommendations over two years, with sequences up to 6000 items long. This validates the model's effectiveness in a real-world, high-stakes environment.

5.2. Evaluation Metrics

AUC (Area Under the ROC Curve):
1. Conceptual Definition: AUC measures the ability of a binary classifier to rank a randomly chosen positive sample higher than a randomly chosen negative sample. It provides a single score that summarizes the model's performance across all possible classification thresholds. An AUC of 1.0 represents a perfect classifier, while 0.5 represents a random guess.
2. Mathematical Formula: It can be calculated as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. A common formula is based on the Wilcoxon-Mann-Whitney statistic: $\text{AUC} = \frac{\sum_{i \in \text{positive\_class}} \sum_{j \in \text{negative\_class}} \mathbf{1}(\hat{y}_i > \hat{y}_j)}{|\text{positive\_class}| \cdot |\text{negative\_class}|}$
3. Symbol Explanation: $\hat{y}_i$ and $\hat{y}_j$ are the predicted scores for a positive instance $i$ and a negative instance $j$ , respectively. $\mathbf{1}(\cdot)$ is an indicator function that is 1 if the condition is true and 0 otherwise.
GMV (Gross Merchandise Value):
1. Conceptual Definition: A key business metric in e-commerce and advertising that measures the total monetary value of goods sold through a platform over a specific period. A "GMV lift" in an A/B test means the new model (treatment) generated more sales revenue than the old model (control), providing direct evidence of business impact.
2. Mathematical Formula: There is no standard formula, but it is calculated as: $\text{GMV} = \sum_{\text{all sold items}} \text{price of item}$
3. Symbol Explanation: N/A.
NDCG (Normalized Discounted Cumulative Gain):
1. Conceptual Definition: A metric used to evaluate the quality of a ranked list of items. It gives higher scores if more relevant items are placed at the top of the list. "Discounted" refers to the fact that relevant items found lower down the list are penalized, and "Normalized" means the score is scaled to a range of 0 to 1 by dividing by the score of an ideal ranking.
2. Mathematical Formula: $\text{DCG}_k = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i+1)}, \quad \text{NDCG}_k = \frac{\text{DCG}_k}{\text{IDCG}_k}$
3. Symbol Explanation: $k$ is the number of items in the ranked list. $rel_i$ is the graded relevance of the item at position $i$ . $\text{IDCG}_k$ (Ideal DCG) is the DCG score of the perfect ranking up to position $k$ .
GAUC (Group Area Under Curve):
1. Conceptual Definition: An adaptation of AUC for recommendation systems that addresses user-level evaluation bias. It calculates AUC separately for each user (or another group, like item category) and then computes a weighted average of these individual AUCs, typically weighted by the number of impressions or clicks per user. This prevents users with a large number of interactions from dominating the overall metric.
2. Mathematical Formula: $\text{GAUC} = \frac{\sum_{u=1}^{U} w_u \cdot \text{AUC}_u}{\sum_{u=1}^{U} w_u}$
3. Symbol Explanation: $U$ is the total number of users (or groups). $\text{AUC}_u$ is the AUC calculated only on the samples for user $u$ . $w_u$ is the weight for user $u$ (e.g., number of impressions).
Logloss (Logarithmic Loss):
1. Conceptual Definition: Also known as binary cross-entropy, Logloss measures the performance of a classifier that outputs probabilities. It heavily penalizes predictions that are confident but wrong. Lower values are better.
2. Mathematical Formula: $\text{Logloss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$
3. Symbol Explanation: $N$ is the number of samples. $y_i$ is the true binary label (0 or 1). $\hat{y}_i$ is the model's predicted probability of the sample being class 1.

5.3. Baselines

The paper compares DARE against a comprehensive set of baselines:

Standard Long-Sequence Models: ETA and SDIM, which are variants of the two-stage paradigm focused on efficiency.
Classic Short-Sequence Model: DIN, adapted for a long-sequence setting by adding a search stage.
State-of-the-Art Baseline: TWIN, the primary model to beat.
Ablation/Variant Baselines:
- TWIN (hard): A simplified search that only retrieves items from the same category as the target.
- TWIN (w/ proj.): The failed attempt to use linear projections for decoupling.
- TWIN (w/o TR): The original TWIN without the target-aware representation ( $e_i \odot v_t$ ), to show the importance of this feature.
- TWIN-V2: A follow-up to TWIN specialized for video recommendation, adapted to the current task.
- TWIN-4E: A variant that further decouples embeddings for history and target items, resulting in 4 separate embedding tables. This tests whether over-decoupling hurts by ignoring valuable prior knowledge (e.g., an item should be similar to itself whether it's in the history or the target).

6. Results & Analysis

6.1. Core Results Analysis

The main performance comparison is presented in Table 1. The following are the results from Table 1 of the original paper:

Setup	Embedding Dim. = 16		Embedding Dim. = 64		Embedding Dim. = 128
Setup	Taobao	Tmall	Taobao	Tmall	Taobao	Tmall
ETA (2021)	0.91326 (0.00338)	0.95744	0.92300 (0.00079)	0.96658 (0.00042)	0.92480 (0.00032)	0.96956 (0.00039)
SDIM (2022)	0.90430 (0.00108)	0.93516	0.90854	0.94110	0.91108	0.94298
DIN (2018)	0.90442 (0.0103)	0.95894 (0.00069)	0.90912 (0.00085)	0.96194 (0.00093)	0.91078 (0.00119)	0.96428 (0.00081)
TWIN (2023)	0.91688 (0.00060)	0.95812 (0.0037)	0.92636 (0.00092)	0.96684 (0.00033)	0.93116 (0.00054)	0.97060 (0.00013)
TWIN (hard)	0.91002 (0.00211)	0.96026 (0.00073)	0.91984 (0.00052)	0.96448 (0.00039)	0.91446 (0.00056)	0.96712 (0.00005)
TWIN (w/ proj.)	0.89642 (0.00053)	0.96152 (0.00024)	0.87176 (0.00048)	0.95570 (0.00042)	0.87990 (0.00055)	0.95724 (0.00019)
TWIN (w/o TR)	0.90732 (0.00351)	0.96170 (0.00088)	0.91590 (0.00437)	0.96320 (0.00403)	0.92060 (0.02022)	0.96366 (0.00194)
TWIN-V2 (2024)	0.89434 (0.00063)	0.94714 (0.00057)	0.90170 (0.00083)	0.95378 (0.00032)	0.90586 (0.00084)	0.95732 (0.00103)
TWIN-4E	0.90414 (0.00077)	0.96124 (0.00110)	0.90356 (0.00063)	0.96372 (0.00037)	0.90946 (0.00059)	0.96016 (0.00045)
DARE (Ours)	0.92568 (0.00025)	0.96800 (0.00024)	0.92992 (0.00046)	0.97074 (0.00012)	0.93242 (0.00045)	0.97254 (0.00016)

DARE's Superiority: DARE consistently outperforms all baselines across both datasets and all embedding dimensions. The improvements are significant in the context of recommendation systems, where a 1-2‰ (0.1-0.2%) AUC gain is considered substantial. The largest gain is 9‰ (0.9%) on Taobao with embedding dimension 16, which strongly supports the hypothesis that decoupling is most critical when model capacity is low.
Failure of Projections: TWIN (w/ proj.) performs much worse than the standard TWIN, confirming that linear projections are not an effective decoupling strategy in this domain.
Importance of Target-Aware Representation: TWIN generally outperforms TWIN (w/o TR), highlighting the benefit of using the $e_i \odot v_t$ interaction.
Failure of Over-decoupling: TWIN-4E performs poorly, suggesting that completely separating history and target embeddings violates useful inductive biases in recommendation (i.e., an item should have a consistent identity).

6.2. Attention Accuracy

The paper provides a compelling qualitative and quantitative analysis of how well the learned attention scores match ground-truth correlations, measured by mutual information.

Qualitative Analysis (Figure 8): The authors visualize the ground-truth correlation (mutual information) between a target item and historical items based on their category and recency. The ground truth shows a strong pattern: items in the same category are highly relevant, and relevance decays over time. TWIN's learned attention (b) captures the temporal decay but overestimates the importance of recent items from irrelevant categories. In contrast, DARE's attention (c) much more closely resembles the ground truth (a), correctly capturing both the semantic (category) and temporal patterns.

$Figure 8: The ground truth (GT) and learned correlation between history behaviors of top-10 frequent categories (y-axis) at various positions ( $\\mathbf { \\dot { X } }$ -axis), with category 15 as the target. Our correlation scores are noticeably closer to the ground truth.$ 该图像是图表，展示了目标类别ID 15与历史行为的学习相关性，包括三种情况：GT互信息、TWIN学习相关性和DARE学习相关性。相较于GT，DARE的相关性评分在不同位置更接近真实情况，体现了模型的优势。
Quantitative Analysis (Figure 9): The quality of retrieval in the search stage is measured using NDCG. DARE achieves a significantly higher NDCG score (0.8124) compared to TWIN (0.5545) and DIN (0.6382), representing a 46.5% relative improvement over TWIN. This provides strong quantitative evidence that DARE is much better at identifying and retrieving truly relevant items from the long user history. Case studies in (b) and (c) provide concrete examples where TWIN mistakenly retrieves recent but irrelevant items, while DARE correctly identifies older but more relevant items.

$Figure 9: Retrieval in the search stage. (a) Our model can retrieve more correlated behaviors. (b-c) Two showcases where the $\\mathbf { X }$ -axis is the categories of the recent ten behaviors.$ 该图像是图表，展示了在淘宝上的检索性能（a）以及在类别87（b）和类别19（c）的相关性。图中显示了DARE与其他模型的NDCG表现及行为序列的相关性，表明DARE在检索相关行为方面的优势。

6.3. Representation Discriminability

The authors measure how well the final aggregated user history vector $\mathbf{h}$ can distinguish between positive and negative samples. They do this by clustering the vectors with K-means and then calculating the mutual information (MI) between the cluster assignments and the true labels. Higher MI means better discriminability.

Results (Figure 10): Across all cluster sizes, DARE consistently achieves higher MI than TWIN, indicating that its learned representations are more discriminative. This demonstrates that decoupling not only improves attention but also enhances the quality of the final representation used for prediction. The figure also visually separates the contributions: the target-aware representation ( $e_i ⊙ v_t$ ) provides a large boost (orange gap), and decoupling (DARE vs. TWIN) provides a further significant boost (blue/green gaps).

该图像是图表，展示了不同模型在可区分性上的比较，包括在淘宝和天猫上的表现，以及两种解耦方法对可区分性的影响。图中的互信息随着聚类数量的变化而变化，反映了模型在处理用户行为序列时的表现差异。

6.4. Ablation Studies / Parameter Analysis

Convergence and Efficiency (Figure 11):
- Faster Convergence: DARE converges much faster during training. On the Tmall dataset, it reaches a high accuracy level in about 450 iterations, while TWIN requires over 1300 iterations. This indicates that by resolving the conflicting gradients, the optimization process becomes much more direct and efficient.
- Inference Speed-up: The paper shows that the attention embedding dimension in DARE can be reduced by 50% (from 128 to 64) with almost no drop in AUC. A 75% reduction (to 32) results in only a small, acceptable performance loss. This directly translates to a 50-75% speed-up in the most time-consuming search phase of online inference. In contrast, reducing the single embedding dimension in TWIN causes a severe performance drop.
  
  该图像是图表，展示了DARE模型与TWIN模型在淘宝和天猫上的训练表现及效率。图(a-b)显示两种模型在训练迭代中验证准确率的变化，DARE明显优于TWIN。图(c-d)比较了在不同归一化复杂度下的AUC值，DARE的表现也优于TWIN。
Online A/B Testing: The deployment on Tencent's advertising platform yielded a 0.57% increase in cost-effectiveness and a 1.47% lift in GMV. This is an extremely strong result, as even fractional percentage point improvements in such large-scale systems translate to millions of dollars in annual revenue. This validates the practical value of the DARE model beyond academic metrics.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully identifies a critical and previously overlooked problem in long-sequence recommendation models: the interference caused by using a single embedding table for the distinct tasks of attention and representation. Through rigorous gradient analysis, the authors demonstrate that this leads to a learning conflict where representation objectives dominate and oppose attention objectives. They propose an elegant and effective solution, the DARE model, which uses two separate embedding tables to fully decouple these tasks. The comprehensive experimental results, both offline on public benchmarks and online in a massive industrial system, strongly validate DARE's superiority. It achieves state-of-the-art accuracy, learns better attention and representations, converges faster, and enables significant inference acceleration, making it a highly impactful contribution to the field.

7.2. Limitations & Future Work

The authors acknowledge a few limitations and areas for future research:

Theoretical Understanding: While they empirically show that linear projections fail at small embedding dimensions and hypothesize it's due to limited capacity, a deeper theoretical explanation for this phenomenon is still needed.
Anomalous Results: They note an unexplained result where the TWIN model without target-aware representation unexpectedly outperformed the standard TWIN on the Tmall dataset with a small embedding dimension. This suggests dataset-specific characteristics that are not fully understood.
Scope: The work focuses on the dominant two-stage paradigm. The applicability and potential benefits of embedding decoupling in emerging one-stage long-sequence models remain an open question for future exploration.

7.3. Personal Insights & Critique

Strengths:
- Problem Finding: The paper's greatest strength is its identification of a simple, fundamental, yet impactful problem that was hiding in plain sight. This is a hallmark of excellent research.
- Rigorous Analysis: The use of gradient analysis to diagnose the problem, the synthetic NLP experiment to support their hypothesis about projections, and the multi-faceted evaluation (accuracy, discriminability, efficiency, online A/B test) make the paper's claims extremely credible and well-supported.
- Simplicity and Practicality: The proposed DARE model is a simple architectural change, yet it is highly effective. The added benefit of inference acceleration makes it very attractive for practical deployment in industrial systems where latency and computational cost are critical constraints. The reported 1.47% GMV lift is a testament to its real-world value.
Potential for Broader Application: The core idea of functional decoupling of embeddings is highly generalizable. It could be applied in any system where a single learned representation is "overloaded" with multiple, potentially conflicting tasks. For example:
- Multi-modal models: An image embedding might be used for both image-text similarity matching and object classification. Decoupling these could improve performance on both tasks.
- Graph Neural Networks: Node embeddings are often used for both link prediction and node classification. Separate embeddings could be beneficial here as well.
- General Retrieval-then-Rank Systems: The embeddings used for fast, large-scale retrieval (first stage) have different requirements than those used for detailed, feature-rich re-ranking (second stage). DARE's principle could be directly applied.
  
  In conclusion, this paper is an exemplary piece of applied research that combines deep technical analysis with a practical and impactful solution, setting a new standard for how to design embedding architectures in long-sequence recommendation systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Dataset	#Category	#Item	#User	# Active User
Taobao	9407	4,068,790	984,114	603,176
Tmall	1492	1,080,666	423,862	246,477

Long-Sequence Recommendation Models Need Decoupled Embeddings

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~25 min read · 31,912 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Diagnosis: Gradient Domination and Conflict

4.2.2. Failed Attempt: Decoupling with Linear Projections

4.2.3. The DARE Architecture

4.2.4. Inference Acceleration

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Attention Accuracy

6.3. Representation Discriminability

6.4. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers