HPSERec: A Hierarchical Partitioning and Stepwise Enhancement Framework for Long-tailed Sequential Recommendation
TL;DR Summary
The HPSERec framework addresses the long-tail problem in sequential recommendation systems by hierarchically partitioning items and employing stepwise enhancement strategies to improve tail item representation, outperforming existing methods through expert networks and optimal tr
Abstract
The long-tail problem in sequential recommender systems stems from imbalanced interaction data, resulting in suboptimal model performance for tail users and items. Recent studies have leveraged head data to enhance tail data for diminish the impact of the long-tail problem. However, these methods often adopt ad-hoc strategies to distinguish between head and tail data, which fails to capture the underlying distributional characteristics and structural properties of each category. Moreover, due to a substantial representational gap exists between head and tail data, head-to-tail enhancement strategies are susceptible to negative transfer, often leading to a decline in overall model performance. To address these issues, we propose a hierarchical partitioning and stepwise enhancement framework, called HPSERec, for long-tailed sequential recommendation. HPSERec partitions the item set into subsets based on a data imbalance metric, assigning an expert network to each subset to capture user-specific local features. Subsequently, we apply knowledge distillation to progressively improve long-tail interest representation, followed by a Sinkhorn optimal transport-based feedback module, which aligns user representations across expert levels through a globally optimal and softly matched mapping. Extensive experiments on three real-world datasets demonstrate that HPSERec consistently outperforms all baseline methods. The implementation code is available at https://github.com/bolunxier123/HPSERec.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
HPSERec: A Hierarchical Partitioning and Stepwise Enhancement Framework for Long-tailed Sequential Recommendation
1.2. Authors
Xiaolong Xu, Xudong Zhao, Haolong Xiang, Xuyun Zhang, Wei Shen, Hongsheng Hu, Lianyong Qi
- School of Software, Nanjing University of Information Science and Technology, China
- School of Computing, Macquarie University, Australia
- Independent Researcher, China
- School of Information and Physical Sciences, University of Newcastle, Australia
- CoCu ScinTecoly (Institution name appears truncated/typoed in source), China
1.3. Journal/Conference
NeurIPS 2024 (The Thirty-eighth Annual Conference on Neural Information Processing Systems)
- Note: The paper includes a "NeurIPS Paper Checklist" and references the "NeurIPS Code of Ethics," indicating it is a submission to or accepted paper at this prestigious top-tier AI conference.
1.4. Publication Year
2024
1.5. Abstract
The paper addresses the long-tail problem in Sequential Recommender Systems (SRS), where a few popular items (head) dominate interactions, leaving the majority of items (tail) poorly recommended. The authors criticize existing methods for using arbitrary "head vs. tail" binary splits and for causing "negative transfer" (noise) when simply augmenting tail data with head data. They propose HPSERec, a framework that:
- Hierarchically partitions items into multiple subsets based on a novel imbalance metric.
- Assigns an expert network to each subset.
- Uses Stepwise Enhancement via a feedforward module (Knowledge Distillation) to transfer tail knowledge to global experts.
- Uses a feedback module based on Sinkhorn Optimal Transport to refine tail experts using global context. Experiments show HPSERec outperforms state-of-the-art baselines on three real-world datasets.
1.6. Original Source Link
2. Executive Summary
2.1. Background & Motivation
- The Problem: In recommender systems, user interactions follow a power-law distribution. A small fraction of "Head" items receive most interactions, while the "Tail" items (the vast majority) are rarely consumed. This long-tail problem leads to poor recommendation performance for tail items and tail users, reducing platform diversity and user retention.
- Existing Limitations:
- Arbitrary Partitioning: Current methods typically split data into "Head" (e.g., top 20%) and "Tail" (remaining 80%) based on fixed ratios. This ignores the specific distribution shape of different datasets (e.g., MovieLens is much more skewed than Yelp).
- Flawed Augmentation & Negative Transfer: Techniques often try to "borrow" data from head items to train tail items. However, since head and tail items have vastly different interaction densities and patterns, this direct transfer often introduces noise (negative transfer), confusing the model rather than helping it.
2.2. Main Contributions / Findings
- Imbalance-Aware Partitioning: The paper introduces a new metric to quantify data imbalance and a dynamic programming algorithm to optimally partition items into multiple balanced subsets, rather than a simple binary split.
- HPSERec Framework: A novel architecture employing a Mixture of Experts (MoE) strategy. It assigns specific experts to different subsets (ranging from tail-focused to global-focused) and trains them sequentially.
- Feedforward: Uses Knowledge Distillation to transfer detailed local preferences (from tail experts) to global experts.
- Feedback: Uses Optimal Transport (Sinkhorn Algorithm) to feed global context back to tail experts, aligning their representations without enforcing rigid one-to-one matching.
- Superior Performance: HPSERec consistently outperforms existing baselines, including Large Language Model (LLM) based recommenders, on datasets like Yelp, Amazon Beauty, and Amazon Music. It significantly boosts tail item performance without degrading head item accuracy.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Sequential Recommendation (SRS): A type of recommender system that predicts the next item a user will interact with based on their chronological sequence of past interactions. Models like SASRec (Self-Attentive Sequential Recommendation) and BERT4Rec are standard backbones, using Transformer architectures to learn sequence dependencies.
- Long-Tail Distribution: A statistical pattern where a small number of events (head) occur frequently, while a large number of events (tail) occur rarely. In recommendations, this means the model learns popular items well but fails on niche items.
- Mixture of Experts (MoE): A machine learning technique where a model is composed of multiple sub-networks ("experts"), each specializing in a different part of the input space. A gating mechanism usually decides which expert to use. Here, experts are assigned to different item subsets.
- Knowledge Distillation (KD): A training method where a small "student" model is trained to reproduce the output (logits) of a larger or more specialized "teacher" model. It transfers "knowledge" by matching the probability distributions.
- Optimal Transport (OT) & Sinkhorn Algorithm: OT is a mathematical framework for calculating the most efficient way to transform one probability distribution into another. It provides a robust distance metric (Wasserstein distance). The Sinkhorn algorithm is a fast, iterative method to approximate this transport plan, often used to align feature spaces (embeddings) between different domains or models.
3.2. Previous Works
- SRS Baselines:
- SASRec (2018): Uses self-attention (like Transformer) to capture long-term dependencies in user sequences.
- BERT4Rec (2019): Uses a bidirectional Transformer (like BERT) with a Cloze (fill-in-the-blank) objective.
- Long-Tail Solutions:
- CITIES (2020): Enhances tail item embeddings using context from head items via attention.
- MELT (2023): "Mutual Enhancement of Long-Tailed user and item." It addresses both user and item tails simultaneously.
- LLM-based Approaches:
- RLMRec (2024), LLMInit (2023): Use Large Language Models (like GPT) to generate rich text-based profiles or initialize embeddings to compensate for sparse interaction data.
3.3. Differentiation Analysis
- Beyond Binary Splits: Unlike CITIES or MELT which typically use a static 20/80 split, HPSERec uses an adaptive, metric-driven partitioning to create multiple hierarchical subsets.
- Stepwise vs. Direct Transfer: Instead of directly using head data to augment tail (which causes negative transfer), HPSERec uses a stepwise chain of experts. Tail experts teach global experts (Feedforward), and global experts refine tail experts (Feedback).
- Distribution-Level Alignment: By using Optimal Transport (Sinkhorn) for feedback, HPSERec aligns user representations at a distributional level, which is more flexible and robust than the rigid point-to-point alignment used in previous methods.
4. Methodology
4.1. Principles
The core philosophy of HPSERec is "Divide, Specialize, and Align."
-
Divide: The dataset is too imbalanced to be treated as a whole. It should be partitioned based on interaction density.
-
Specialize: Different neural networks (experts) should specialize in different partitions (e.g., one focuses on tail patterns, another on global patterns).
-
Align: These experts shouldn't work in isolation. They should exchange knowledge: "Tail" experts provide specific interest signals to "Global" experts, and "Global" experts provide broad context back to "Tail" experts to regularize them.
The following figure (Figure 2 from the original paper) illustrates this architecture:
该图像是HPSERec的总体架构示意图。分布平衡模块将项目集划分为子集,以减少长尾效应并改善训练。前馈模块通过对比学习增强长尾项表示,并通过知识蒸馏精炼全局专家性能。反馈模块利用Sinkhorn最优传输算法,将头项信息转移到长尾专家,扩大其感受野。
4.2. Distribution Balancing Module
Goal: Partition the sorted item set (sorted by popularity) into subsets such that the "imbalance" within each subset is minimized.
Step 1: Metric Definition. First, the authors define a normalized interaction probability for an item within a subset : $ p_i^{(k)} = \frac{c_i}{\sum_{j \in V_k} c_j} $ where is the interaction count of item .
To measure how "skewed" or concentrated the interactions are in this subset, they define a non-linear aggregation functional : $ \Phi_k = \left( \sum_{i \in V_k} \left( p_i^{(k)} \right)^\alpha \right)^{\frac{1}{1-\alpha}} $
- Symbol Explanation:
- : The dispersion/concentration score. Lower is better (more uniform).
- : A hyperparameter. If , it is sensitive to tail items; if , it is sensitive to head items.
Step 2: Size Regularization. To prevent the algorithm from creating tiny or huge subsets, a regularization term penalizes deviation from the average subset size: $ B_k = \left( \frac{S_k - \mu}{\mu} \right)^2 $
- Symbol Explanation:
- : The number of items in subset .
- : The ideal average size of a subset.
Step 3: Optimization Objective. The overall imbalance score for a subset combines the skewness and the size penalty: $ \mathcal{I}(V_k) = \log \Phi_k + \gamma B_k $
-
: Weight controlling the strength of the size penalty.
The final goal is to find partition points (thresholds) that minimize the sum of scores across all subsets: $ \min_{\Theta} \sum_{k=1}^{L} \mathcal{I}(V_k) $ The authors solve this using a Dynamic Programming algorithm (Algorithm 1 in the paper).
Step 4: Expert Assignment. Crucially, the subsets assigned to experts are cumulative. $ V_k^{expert} = \bigcup_{i=1}^{k} V_i' $
- Expert sees only the first subset (most tail-focused).
- Expert sees the union of all subsets (the global dataset). This creates a hierarchy where is the most specialized on local/tail features, and is the most general.
4.3. Feedforward Module (Tail Global)
Goal: Train the experts sequentially from specialized (tail) to general (global), transferring knowledge to ensure the global expert doesn't ignore the tail.
Mechanism: Knowledge Distillation (KD). The model treats the upstream expert (more tail-focused) as a "Teacher" and the downstream expert (more global) as a "Student". The student tries to match the teacher's output distribution on the overlapping items.
The distillation loss minimizes the KL-divergence between their softened predictions: $ \mathcal{L}_{KD} = \frac{1}{|\mathcal{V}i|} \sum{v \in \mathcal{V}_i} \mathrm{KL} \Big( \mathrm{Softmax}(z_v^{i-1} / \tau) \ || \ \mathrm{Softmax}(z_v^{i} / \tau) \Big) $
- Symbol Explanation:
- : The output logits (prediction scores) from expert
i-1and expert for item . - : Temperature parameter. Higher creates softer probability distributions, revealing more information about the "dark knowledge" (relationships between classes) than hard labels.
- : Kullback-Leibler divergence, measuring the difference between two probability distributions.
- : The output logits (prediction scores) from expert
Total Feedforward Loss: The expert is trained on a combined loss: standard recommendation loss (e.g., cross-entropy) plus the distillation loss. $ \mathcal{L}{\mathrm{forw}} = \mathcal{L}{\mathrm{rec}} + \beta \mathcal{L}_{KD} $
- : Weight balancing recommendation accuracy and knowledge transfer.
4.4. Feedback Module (Global Tail)
Goal: After the global experts are trained, their broader context (which captures head item patterns well) is sent back to refine the tail experts. This prevents tail experts from overfitting to sparse data.
Mechanism: Sinkhorn Optimal Transport. Instead of forcing a user's representation in expert to be exactly the same as in (which is too rigid), the model aligns the distributions of user embeddings.
Step 1: Cost Matrix. Define the cost (distance) between user embedding (from expert ) and (from expert ) using cosine distance: $ C_{ij} = 1 - \cos(u_i^t, u_j^{t+1}) $
Step 2: Optimal Transport Objective. Find a transport plan (a matrix describing how much mass to move from to ) that minimizes the total cost, regularized by entropy : $ \min_{\gamma \in \Pi(\mu, \nu)} \sum_{i,j} \gamma_{ij} C_{ij} - \varepsilon \sum_{i,j} \gamma_{ij} \log \gamma_{ij} $
- Symbol Explanation:
- : The amount of "probability mass" mapped from user in space to user in space .
- : Regularization parameter. This term makes the problem solvable using the fast Sinkhorn iterations and makes the mapping "soft" (one-to-many alignment is allowed).
Step 3: Feedback Loss. The loss is simply the expected transport cost under the optimal plan: $ \mathcal{L}{\mathrm{back}} = \sum{i,j} \gamma_{ij} C_{ij} $ Minimizing this pulls the upstream (tail) user embeddings structurally closer to the downstream (global) embeddings, improving consistency.
5. Experimental Setup
5.1. Datasets
The experiments use three real-world datasets, pre-processed to exclude users with fewer than 5 interactions.
- Amazon Beauty: E-commerce data. 57k items, 52k users. High sparsity.
- Yelp: Business reviews. 15k items, 4k users.
- Amazon Music: Digital music. 20k items, 20k users.
5.2. Evaluation Metrics
The paper uses Top-K ranking metrics ().
-
Hit Rate (HR@K):
- Concept: Measures the proportion of test cases where the ground-truth item appears in the top-K recommended list. It answers "Did the correct item appear?"
- Formula:
- Symbols: is number of users, is indicator function (1 if true, 0 otherwise), is the rank of the ground truth item.
-
Normalized Discounted Cumulative Gain (NDCG@K):
- Concept: Measures the quality of the ranking. It rewards the model more if the correct item is ranked higher (e.g., rank 1 is much better than rank 10).
- Formula: , where
\mathrm{DCG}@K = \sum_{i=1}^K \frac{rel_i}{\log_2(i+1)} - Symbols: is the relevance of item at rank (1 for ground truth, 0 for others). IDCG is the max possible DCG (perfect ranking).
5.3. Baselines
- Standard SRS: SASRec, BERT4Rec.
- Long-Tail SRS:
- CITIES: Uses head items to infer tail embeddings.
- MELT: Mutual enhancement of tail users and items.
- LLM-Enhanced SRS:
- RLMRec: LLM-generated profiles for representation learning.
- LLMInit: LLM embeddings as initialization.
- LLM-ESR: Retrieval-enhanced self-distillation with LLMs.
6. Results & Analysis
6.1. Core Results Analysis
HPSERec demonstrates superior performance across all datasets.
-
Overall Improvement: It beats the best baseline (often the LLM-based LLM-ESR) by significant margins (e.g., +10.6% HR@10 on Music).
-
Tail Item Boost: The most dramatic improvements are seen in Tail Items. For example, on the Yelp dataset, HPSERec achieves an HR@10 of 0.3252 for tail items, compared to 0.1584 for LLM-ESR (effectively doubling the performance).
-
No "Seesaw" Effect: Unlike CITIES, which improves tail performance but hurts head performance (a "seesaw" trade-off), HPSERec improves or maintains Head Item performance while boosting the tail. This confirms the effectiveness of the feedback module and stepwise enhancement.
The following are the results from Table 1 of the original paper:
Dataset Model Overall Tail Item Head Item Tail User Head User HR@10 ND@10 HR@10 ND@10 HR@10 ND@10 HR@10 ND@10 HR@10 ND@10 Beauty Bert4Rec 0.3945 0.2453 0.0342 0.0085 0.4917 0.2922 0.3845 0.2384 0.4593 0.2941 SASRec 0.4488 0.2861 0.1593 0.0856 0.7187 0.4815 0.4236 0.2690 0.5261 0.3611 CITIES 0.3894 0.2462 0.1127 0.0013 0.4974 0.2745 0.3852 0.2249 0.4554 0.2594 MELT 0.4869 0.3144 0.1598 0.0628 0.8055 0.5595 0.4719 0.3021 0.5523 0.3679 RLMRec 0.4077 0.2565 0.1924 0.1660 0.6302 0.4657 0.4356 0.3016 0.4892 0.3345 LLMInit 0.4351 0.2914 0.2714 0.1708 0.6984 0.5198 0.4919 0.3117 0.5430 0.3632 LLM-ESR 0.4945 0.3275 0.2986 0.1713 0.7270 0.5232 0.4821 0.3103 0.5501 0.3425 HPSERec 0.5281 0.3665 0.3203 0.2060 0.7306 0.5229 0.5163 0.3557 0.5799 0.4148 Yelp Bert4Rec 0.5307 0.3025 0.0131 0.0045 0.6834 0.3913 0.5319 0.3036 0.5251 0.2978 SASRec 0.5866 0.3536 0.0890 0.0386 0.8002 0.4888 0.5848 0.3525 0.5945 0.3585 CITIES 0.5745 0.3404 0.0776 0.0341 0.7648 0.4573 0.5751 0.3416 0.5891 0.3419 MELT 0.6038 0.3687 0.0697 0.0263 0.8245 0.5041 0.6037 0.3688 0.6042 0.3681 RLMRec 0.5306 0.3909 0.0104 0.0140 0.7683 0.4568 0.5351 0.3065 0.5137 0.2936 LLMInit 0.6099 0.3781 0.0874 0.0330 0.7766 0.4797 0.6204 0.3795 0.6187 0.3823 LLM-ESR 0.6190 0.3784 0.1584 0.0670 0.8045 0.5055 0.6138 0.3761 0.6331 0.3844 HPSERec 0.6827 0.4231 0.3252 0.1832 0.8361 0.5261 0.6884 0.4280 0.6583 0.4027 Music Bert4Rec 0.4721 0.3056 0.1222 0.0494 0.8299 0.5929 0.4475 0.2870 0.5638 0.3752 SASRec 0.5431 0.3714 0.2473 0.1405 0.8511 0.6256 0.5149 0.3591 0.6843 0.4746 CITIES 0.4421 0.3345 0.2243 0.0832 0.8328 0.6124 0.4835 0.3237 0.6317 0.4364 MELT 0.5442 0.2710 0.0824 0.0312 0.8347 0.5391 0.4192 0.2609 0.5082 0.3085 RLMRec 0.3832 0.3271 0.1539 0.1171 0.8531 0.6292 0.5070 0.3374 0.6677 0.4722 LLMInit 0.5537 0.3877 0.3024 0.1574 0.8312 0.6426 0.5145 0.3426 0.6604 0.4631 LLM-ESR 0.5958 0.4035 0.3318 0.1548 0.8961 0.6835 0.5672 0.3824 0.7069 0.4846 HPSERec 0.6592 0.4786 0.4425 0.2959 0.8989 0.6806 0.6428 0.4701 0.7144 0.5069
6.2. Ablation Studies
The authors performed an ablation study (Table 2 in the paper) to verify the necessity of each module:
- Removing Distribution Balancing (DB): Replacing the optimal partitioning with a standard top-20% split causes a performance drop, proving that the new partitioning algorithm is more effective.
- Removing Feedforward (FF): Without stepwise enhancement (just training experts independently?), performance drops significantly, showing the value of KD.
- Removing Feedback (FB): Removing the Sinkhorn-based feedback leads to a decrease in metrics, confirming that global context helps refine tail experts.
- Conclusion: The full HPSERec model (DB + FF + FB) performs best, indicating all components are complementary.
7. Conclusion & Reflections
7.1. Conclusion Summary
HPSERec effectively tackles the long-tail problem in sequential recommendation by abandoning the "one-size-fits-all" or binary "head/tail" approaches. Instead, it views the data as a spectrum, partitioning it hierarchically and training a chain of experts. The combination of Knowledge Distillation (for transferring tail signals up) and Sinkhorn Optimal Transport (for passing global context down) creates a balanced system that significantly improves recommendations for rare items without sacrificing popular ones.
7.2. Limitations & Future Work
- Hyperparameters: The framework introduces several new hyperparameters (number of subsets , imbalance weight , regularization , etc.) which might require extensive tuning for different datasets.
- Computational Complexity: The subset partitioning algorithm (Dynamic Programming) and the Sinkhorn iterations add computational overhead compared to simpler models. The authors note this might be a bottleneck for extremely large-scale item sets and suggest future work on incremental learning strategies.
7.3. Personal Insights & Critique
- Innovation: The use of Optimal Transport for feedback is a brilliant touch. Most "distillation" works only one way (Teacher -> Student). Here, the bidirectional flow (Tail Global) allows the model to find a "middle ground" where tail items are distinct but not isolated from the global latent space.
- Applicability: The idea of hierarchical partitioning based on an imbalance metric is highly transferable. It could be applied to other long-tail domains like object detection (rare classes) or medical diagnosis (rare diseases), where binary splits are also insufficient.
- Critique: The paper relies on the assumption that "adjacent" experts are the best for distillation. It would be interesting to see if "skip connections" (e.g., Expert 1 distilling directly to Expert 3) could help, or if the strictly sequential chain is indeed optimal. Furthermore, while it beats LLM-based methods, the cost of training expert networks (even if they share some structure, which isn't fully detailed) might be higher than fine-tuning a single adapter for an LLM.
Similar papers
Recommended via semantic vector search.