- Title: HyMiRec: A Hybrid Multi-interest Learning Framework for LLM-based Sequential Recommendation.
- Authors: Jingyi Zhou, Cheng Chen, Kai Zuo, Manjie Xu, Zhendong Fu, Yibo Chen, Xu Tang, and Yao Hu.
- Affiliations: The authors are affiliated with Xiaohongshu Inc., a major Chinese social media and e-commerce platform, as well as Fudan University and Beijing University. This indicates the research is heavily influenced by real-world industrial challenges.
- Journal/Conference: The paper is submitted to an ACM conference (the specific name is a placeholder
Conference acronym 'XX
in the template) but is currently available as a preprint on arXiv.
- Publication Year: The paper was first submitted to arXiv in 2024. The ACM reference format in the paper mentions 2018, which is a placeholder from the template and should be disregarded.
- Abstract: The abstract highlights the key limitations of existing Large Language Model (LLM)-based sequential recommenders: they truncate long user histories due to latency and cost, losing long-term preference signals, and they use a single interest embedding, which fails to capture the diverse nature of user interests. To solve this, the authors propose HyMiRec, a hybrid framework. It uses a lightweight recommender to process long user sequences into "coarse" interest embeddings and an LLM-based recommender to refine these with recent interactions. It introduces a
cosine-similarity-based residual codebook
for efficient compression of historical data and a Disentangled Multi-Interest Learning (DMIL)
module to learn multiple, distinct user interests. The authors report that HyMiRec outperforms state-of-the-art methods on benchmark and industrial datasets and shows improvements in online A/B tests.
- Original Source Link:
- Official Source: https://arxiv.org/abs/2510.13738 (Note: The provided arXiv ID is fictional as of late 2024; this analysis is based on the provided text).
- Publication Status: This is a preprint and has not yet undergone formal peer review for a conference or journal publication.
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The core of HyMiRec is a two-stage process designed for both effectiveness and efficiency. The overall architecture is depicted in Figure 2.
该图像是HyMiRec框架的整体架构示意图,展示了内容编码器训练、推荐器训练、余弦相似度残差码本以及解耦多兴趣学习模块的设计。图中包含关键模块及数据流,体现了长历史序列压缩和多兴趣表示学习。
Stage 1: Content Encoder Training and Codebook Generation
- Content LLM Encoder: A pre-trained LLM (e.g., TinyLlama-1.1B) is fine-tuned to act as an item encoder. For each item, its textual metadata (e.g., "Title: ... content: ...") is fed into the LLM. The output embedding from a special token (like
[CLS]
) is taken as the item's semantic representation. This encoder is trained end-to-end with a recommendation loss on recent sequences to ensure the embeddings are useful for the downstream task.
- Cosine-Similarity-based Residual Codebook: After Stage 1, the item embeddings for all items are generated offline. To handle long user histories efficiently during online inference, these embeddings must be compressed.
- Principle: The authors observe that item embeddings naturally form clusters. They exploit this using a multi-layer residual quantization process. The key innovation is using cosine similarity as the distance metric, which aligns with how most industrial retrieval systems measure item similarity.
- Procedure:
- A basepool of item embeddings {e1,...,eN} is selected.
- For the first layer (i=1), the embeddings are clustered into k groups using balanced k-means with cosine similarity. The cluster centroids {c11,...,ck1} form the first codebook.
- For each item embedding ej1, its closest centroid cbj11 is found. The residual (error) is not a simple subtraction. Instead, it's the component of ej1 that is orthogonal to the centroid vector. This is calculated via projection:
ej2=ej1−∥cbj11∥2ej1⋅cbj11⋅cbj11
- ej1: The original embedding for item j.
- cbj11: The closest centroid to ej1 in the first codebook.
- ej2: The residual embedding, which is passed to the next layer. This residual is orthogonal to the chosen centroid, ensuring that subsequent layers capture new information.
- This process is repeated for multiple layers (e.g., 3 layers in the paper).
- Result: An item embedding (e.g., 2048 dimensions) can be compressed into just a few integers (the code indices) and a few floating-point numbers (the projection magnitudes), achieving a massive reduction in storage and retrieval bandwidth (over 300x mentioned).
Stage 2: Hybrid Recommender Training
This is the main training loop for the recommender system.
该图像是论文中的示意图,展示了现有方法与本文方法在用户多兴趣建模上的差异。现有方法截断用户行为序列,并用单一嵌入表示兴趣,可能导致长短期兴趣遗忘和混淆。本文方法则采用轻量级推荐器提取长序列的粗兴趣嵌入,再用LLM推荐器捕捉细化兴趣,实现多兴趣嵌入,兼顾长短期偏好。
-
Coarse Interest Modeling (Long-term):
- The user's full, long history sequence is retrieved using the compressed codes from the codebook. The embeddings are reconstructed.
- This reconstructed sequence, along with a set of learnable
coarse queries
(Qcoarse), is fed into a lightweight recommender (a shallow Transformer).
- The output provides
coarse interest embeddings
(Rcoarseu), which act as a summary of the user's long-term behavior patterns.
-
Refined Interest Modeling (Short-term):
- The
coarse interest embeddings
are concatenated with a special indicator embedding
(I), which helps the LLM distinguish them from regular item embeddings.
- This is then combined with the embeddings of the user's most recent interactions (
last-n sequence
) and another set of learnable refined queries
(Qrefined).
- This combined sequence is fed into the main LLM recommender. The output gives the final
refined interest embeddings
(Rrefinedu), which represent the user's immediate and nuanced intents.
Disentangled Multi-Interest Learning (DMIL) Module
This module provides the supervision signal for training the hybrid recommender.
- Principle: Instead of predicting just the next item (which is noisy and provides a weak signal for multiple interests), DMIL aims to predict a window of future items. It encourages different refined interest embeddings to specialize in different types of items within that window.
- Steps & Procedures:
- Window Targets: For a given user history, all items in a future window (e.g., the next 8 items clicked) are considered positive targets {t1,...,tw}.
- Target Clustering: The embeddings of these target items are clustered into s groups (G1,...,Gs) using cosine similarity, where s is the number of refined interest embeddings. This groups semantically similar targets together. The centroids of these clusters are {g1,...,gs}.
- Optimal Matching: The s refined interest embeddings {r1,...,rs} must be matched to the s target cluster centroids. To do this optimally, the Hungarian algorithm is used to find a permutation Π that maximizes the total similarity between matched pairs:
Π∈Psmaxj=1∑scos(rj,gΠ(j))
- Ps: The set of all possible permutations of s items.
- cos(⋅,⋅): Cosine similarity.
- This step ensures that each interest embedding is assigned to the cluster of targets it is most suited to predict, providing a stable and balanced training signal.
- Contrastive Loss: A contrastive loss is calculated. Each refined interest embedding rj is pulled closer to the target items in its matched cluster GΠ(j) and pushed away from randomly sampled negative items.
Ltotal=w1i=1∑wj=1∑sLctr(ti,rj)⋅I[ti∈GΠ(j)]
- I[⋅]: An indicator function that is 1 if the condition is true, 0 otherwise. This ensures loss is only computed for matched pairs.
- Lctr: The standard contrastive loss (InfoNCE):
Lctr(t,r)=−logecos(t,r)/τ+∑i=1mecos(r,em)/τecos(t,r)/τ
t, r
: A positive target embedding and its matched refined interest embedding.
- em: A negative item embedding.
- τ: A temperature hyperparameter that controls the sharpness of the distribution.
5. Experimental Setup
-
Datasets:
Three datasets were used to test the model in various conditions. Table 1 from the paper summarizes their statistics.
(Manual transcription of Table 1)
Dataset |
#User |
#Item |
#Avg. L. |
Avg. T. |
PixelRec |
148,335 |
98,833 |
51.38 |
64.39 |
MovieLens-1M |
3,938 |
3,677 |
234.7 |
15.79 |
Industrial |
571,958 |
11,708,332 |
241.11 |
229.1 |
- PixelRec: An image-based recommendation dataset.
- MovieLens-1M: A classic movie recommendation benchmark.
- Industrial Dataset: A large-scale, real-world dataset from Xiaohongshu, featuring millions of users and items, and very long user sequences. This is a key test of the model's scalability and performance in a complex environment.
-
Evaluation Metrics:
- Recall@K:
- Conceptual Definition: This metric measures the proportion of actual next items (ground truth) that are found within the top-K recommended items. It answers the question: "Out of all the items the user actually liked, what percentage did we manage to recommend in our top-K list?". It is a measure of coverage or retrieval effectiveness.
- Mathematical Formula:
Recall@K=∣Ground Truth Items∣∣Recommended ItemsK∩Ground Truth Items∣
- Symbol Explanation:
- Recommended ItemsK: The set of the top-K items recommended by the model.
- Ground Truth Items: The set of items the user actually interacted with next.
- ∣⋅∣: The number of items in the set.
- NDCG@K (Normalized Discounted Cumulative Gain@K):
- Conceptual Definition: NDCG@K evaluates the ranking quality of the recommendations. Unlike Recall, it gives higher scores for relevant items that appear higher up in the top-K list. It measures not just if a relevant item was found, but how well it was ranked.
- Mathematical Formula:
DCG@K=i=1∑Klog2(i+1)reli
NDCG@K=IDCG@KDCG@K
- Symbol Explanation:
- K: The number of recommended items being considered.
- reli: The relevance of the item at position i. In this context, it's 1 if the item is a ground truth item, and 0 otherwise.
- DCG@K: Discounted Cumulative Gain, which sums the relevance scores penalized by their rank.
- IDCG@K: Ideal Discounted Cumulative Gain, the DCG score of a perfect ranking where all ground truth items are at the top. This normalizes the score to be between 0 and 1.
-
Baselines:
The paper compares HyMiRec against two categories of strong baselines:
- ID-based methods: These models use item IDs and do not consider content.
GRU4Rec
: Uses Recurrent Neural Networks (RNNs).
SASRec
: Uses a Transformer-based self-attention mechanism.
HSTU
: A more advanced Transformer-based model.
- LLM-based methods: These leverage LLMs for content understanding.
MoRec
: A baseline for LLM-based recommendation.
HLLM
: A state-of-the-art end-to-end LLM recommender.
PatchRec
: A method specifically designed for long-sequence modeling with LLMs.
6. Results & Analysis
-
Core Results (RQ1):
HyMiRec consistently outperforms all baselines across all datasets and metrics.
(Manual transcription of Table 2)
Method |
PixelRec |
MovieLens-1M |
R@10 |
R@200 |
N@10 |
N@200 |
R@10 |
R@200 |
N@10 |
N@200 |
ID-based methods |
GRU4REC |
0.0358 |
0.1646 |
0.02058 |
0.0429 |
0.2318 |
0.6846 |
0.1430 |
0.2197 |
SASRec |
0.0427 |
0.2137 |
0.0235 |
0.0532 |
0.2580 |
0.7016 |
0.1464 |
0.2304 |
HSTU |
0.0543 |
0.2422 |
0.0302 |
0.0631 |
0.2461 |
0.7296 |
0.1346 |
0.2263 |
LLM-based methods |
HLLM |
0.0583 |
0.2407 |
0.0329 |
0.0649 |
0.2715 |
0.6346 |
0.1562 |
0.2432 |
Morec |
0.0503 |
0.2241 |
0.0279 |
0.5824 |
0.2341 |
0.5863 |
0.1297 |
0.2161 |
Patchrec |
0.0570 |
0.2417 |
0.0315 |
0.0639 |
0.2504 |
0.6302 |
0.1420 |
0.2328 |
HyMiRec(Ours) |
0.0608 |
0.2625 |
0.0337 |
0.0691 |
0.2811 |
0.7354 |
0.1607 |
0.2474 |
(Manual transcription of Table 3)
Method |
R@10 |
R@50 |
R@100 |
R@200 |
N@10 |
N@50 |
N@100 |
N@200 |
ID-based methods |
GRU4REC |
0.0043 |
0.0197 |
0.0390 |
0.0664 |
0.0030 |
0.0055 |
0.0089 |
0.0118 |
SASRec |
0.0050 |
0.0213 |
0.0400 |
0.0690 |
0.0029 |
0.0052 |
0.0092 |
0.0120 |
HSTU |
0.0070 |
0.0237 |
0.0417 |
0.0747 |
0.0033 |
0.0068 |
0.0097 |
0.0133 |
LLM-based methods |
HLLM |
0.0163 |
0.0550 |
0.0827 |
0.1313 |
0.0085 |
0.0166 |
0.0210 |
0.0278 |
Morec |
0.0083 |
0.0267 |
0.0443 |
0.0774 |
0.0039 |
0.0078 |
0.0106 |
0.0152 |
Patchrec |
0.0128 |
0.0477 |
0.0844 |
0.1347 |
0.0067 |
0.0141 |
0.0200 |
0.0271 |
HyMiRec(Ours) |
0.0227 |
0.0707 |
0.1047 |
0.1577 |
0.0115 |
0.0219 |
0.0274 |
0.0348 |
- Key Insight: The performance gain is most dramatic on the large-scale Industrial dataset, where HyMiRec achieves an average improvement of 73.71% over other LLM baselines. This demonstrates its effectiveness in complex, real-world scenarios with sparse data and long sequences, where other methods struggle.
-
Online A/B Experiment (RQ2):
The model was tested in a live production environment at Xiaohongshu.
- Item Cold-start: HyMiRec led to a +0.44% increase in daily publications and a +0.52% increase in daily active publishers. This shows the model is better at promoting new content, which encourages users to create more.
- Ad Cold-start: HyMiRec significantly improved the pass-through rate (proportion of new ads reaching 500 impressions) from 26.46% to 30.93% in one channel and from 13.19% to 14.23% in another. This demonstrates a direct commercial impact by helping new ads get visibility faster.
-
Ablation Study (RQ3):
This study dissects the contribution of each component of HyMiRec on the industrial dataset.
(Manual transcription of Table 4)
Method |
R@10 |
R@50 |
R@100 |
R@200 |
N@10 |
N@50 |
N@100 |
N@200 |
HyMiRec |
0.0227 |
0.0707 |
0.1047 |
0.1577 |
0.0115 |
0.0219 |
0.0274 |
0.0348 |
w/o lightweight recommender |
0.0207 |
0.0640 |
0.1024 |
0.1494 |
0.0105 |
0.0199 |
0.0261 |
0.0326 |
w/o Cosine-Similarity-based Residual Codebook |
0.0233 |
0.0714 |
0.1044 |
0.1580 |
0.0118 |
0.0221 |
0.0277 |
0.0350 |
w/ Euclidean-Similarity-based Residual Codebook |
0.0213 |
0.0687 |
0.1027 |
0.1530 |
0.0108 |
0.0210 |
0.0267 |
0.0338 |
w/o Indicator Embedding |
0.0220 |
0.0694 |
0.1034 |
0.1547 |
0.0111 |
0.0216 |
0.0269 |
0.0342 |
w/o DIML |
0.0193 |
0.0624 |
0.0937 |
0.1474 |
0.0112 |
0.0208 |
0.0257 |
0.0333 |
w/o window targets |
0.0173 |
0.0597 |
0.0904 |
0.1427 |
0.0103 |
0.0202 |
0.0251 |
0.0312 |
max matching |
0.0180 |
0.0610 |
0.0917 |
0.1450 |
0.0104 |
0.0200 |
0.0255 |
0.0324 |
- Hybrid Framework: Removing the
lightweight recommender
(long-term signal) causes a performance drop, proving the hybrid approach is effective.
- Codebook: Removing the
Cosine-Similarity-based Residual Codebook
gives a tiny performance boost but at an infeasible 300x
increase in system cost. Replacing cosine with Euclidean similarity in the codebook hurts performance, confirming that aligning the compression metric with the retrieval metric is crucial.
- DMIL: Removing the
DMIL
module entirely (reverting to single-interest, next-item prediction) causes a major performance drop. This confirms that multi-interest learning is vital. Simpler versions, like using only the next item (w/o window targets
) or a naive matching (max matching
), perform worse than the full DMIL, demonstrating the effectiveness of the clustering and Hungarian matching design.
-
Hyper-Parameters Analysis (RQ4):
该图像是论文中图3,用折线图展示了HyMiRec在不同超参数设置下的性能表现,分别以精细兴趣嵌入数量和窗口大小为横轴,指标R@10和N@10为纵轴,包含PixelRec、MovieLens-1M和工业数据集的结果对比。
- Number of Refined Interest Embeddings: Performance generally improves as this number increases from 1, as the model can capture more diverse interests. However, too many embeddings (e.g., more than 3 on the industrial dataset) can cause over-fragmentation and hurt performance. The optimal number is larger for the more diverse industrial dataset (3) than for the public benchmarks (2).
- Window Size: A small window doesn't provide enough supervision. An overly large window can introduce noise from less relevant future interactions. The optimal window size is also larger for the industrial dataset (8) than for the benchmarks (4), likely because user sessions are longer and denser in the real-world setting.
7. Conclusion & Reflections
-
Conclusion Summary:
The paper successfully introduces HyMiRec, a novel and practical framework for LLM-based sequential recommendation that addresses the critical challenges of modeling long-term and diverse user interests. Its hybrid architecture pragmatically balances the power of LLMs with the efficiency needed for production systems. The proposed cosine-similarity-based residual codebook
makes long-sequence modeling feasible at scale, while the DMIL
module provides a superior method for learning disentangled, multi-faceted user preferences. The strong results from both offline and online experiments validate its effectiveness and real-world impact.
-
Limitations & Future Work:
The authors identify several areas for future improvement:
- Dynamic Codebook: The current codebook is static. A dynamic version that can be updated over time would better adapt to evolving item catalogs and user preferences.
- Multimodal Integration: The current model is text-based. Integrating other modalities like images and audio could create richer item representations and improve recommendations on content-rich platforms.
- Reinforcement Learning (RL): Unifying HyMiRec with RL frameworks (like PPO and DPO) could allow for direct optimization of long-term user engagement and satisfaction, moving beyond next-item prediction.
-
Personal Insights & Critique:
- Pragmatism and Industrial Relevance: The paper's greatest strength is its grounding in real-world industrial constraints. The hybrid architecture and the residual codebook are not just academically novel but are clever engineering solutions to the high cost and latency of LLMs in production. This makes the work highly transferable to other large-scale recommender systems.
- Novelty in Combination: While the paper uses existing components like Transformers and LLMs, its novelty lies in the intelligent combination and tailoring of these parts into a cohesive, high-performing system. The
DMIL
module, with its use of clustering and Hungarian matching, is a particularly elegant solution to the difficult problem of supervising multi-interest representations.
- Potential Weaknesses: The framework's complexity, with its two-stage training and multiple components, could be a barrier to adoption for smaller teams. The assumption that items within a short "window" are interchangeable is a strong one; for tasks where sequence order is paramount (e.g., learning a skill), this might not hold.
- Open Questions: It would be interesting to see a deeper analysis of the "coarse interest embeddings" to understand what kind of long-term signals the lightweight model learns to extract. Additionally, the trade-off between the codebook's compression ratio and recommendation accuracy could be further explored to provide a clearer guide for practitioners.