Paper status: completed

HPSERec: A Hierarchical Partitioning and Stepwise Enhancement Framework for Long-tailed Sequential Recommendation

Original Link
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The HPSERec framework addresses the long-tail problem in sequential recommendation systems by hierarchically partitioning items and employing stepwise enhancement strategies to improve tail item representation, outperforming existing methods through expert networks and optimal tr

Abstract

The long-tail problem in sequential recommender systems stems from imbalanced interaction data, resulting in suboptimal model performance for tail users and items. Recent studies have leveraged head data to enhance tail data for diminish the impact of the long-tail problem. However, these methods often adopt ad-hoc strategies to distinguish between head and tail data, which fails to capture the underlying distributional characteristics and structural properties of each category. Moreover, due to a substantial representational gap exists between head and tail data, head-to-tail enhancement strategies are susceptible to negative transfer, often leading to a decline in overall model performance. To address these issues, we propose a hierarchical partitioning and stepwise enhancement framework, called HPSERec, for long-tailed sequential recommendation. HPSERec partitions the item set into subsets based on a data imbalance metric, assigning an expert network to each subset to capture user-specific local features. Subsequently, we apply knowledge distillation to progressively improve long-tail interest representation, followed by a Sinkhorn optimal transport-based feedback module, which aligns user representations across expert levels through a globally optimal and softly matched mapping. Extensive experiments on three real-world datasets demonstrate that HPSERec consistently outperforms all baseline methods. The implementation code is available at https://github.com/bolunxier123/HPSERec.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

HPSERec: A Hierarchical Partitioning and Stepwise Enhancement Framework for Long-tailed Sequential Recommendation

1.2. Authors

Xiaolong Xu1^{1}, Xudong Zhao1^{1}, Haolong Xiang1^{1}, Xuyun Zhang2^{2}, Wei Shen3^{3}, Hongsheng Hu4^{4}, Lianyong Qi5^{5}

  • 1^{1} School of Software, Nanjing University of Information Science and Technology, China
  • 2^{2} School of Computing, Macquarie University, Australia
  • 3^{3} Independent Researcher, China
  • 4^{4} School of Information and Physical Sciences, University of Newcastle, Australia
  • 5^{5} CoCu ScinTecoly (Institution name appears truncated/typoed in source), China

1.3. Journal/Conference

NeurIPS 2024 (The Thirty-eighth Annual Conference on Neural Information Processing Systems)

  • Note: The paper includes a "NeurIPS Paper Checklist" and references the "NeurIPS Code of Ethics," indicating it is a submission to or accepted paper at this prestigious top-tier AI conference.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses the long-tail problem in Sequential Recommender Systems (SRS), where a few popular items (head) dominate interactions, leaving the majority of items (tail) poorly recommended. The authors criticize existing methods for using arbitrary "head vs. tail" binary splits and for causing "negative transfer" (noise) when simply augmenting tail data with head data. They propose HPSERec, a framework that:

  1. Hierarchically partitions items into multiple subsets based on a novel imbalance metric.
  2. Assigns an expert network to each subset.
  3. Uses Stepwise Enhancement via a feedforward module (Knowledge Distillation) to transfer tail knowledge to global experts.
  4. Uses a feedback module based on Sinkhorn Optimal Transport to refine tail experts using global context. Experiments show HPSERec outperforms state-of-the-art baselines on three real-world datasets.

Paper PDF

2. Executive Summary

2.1. Background & Motivation

  • The Problem: In recommender systems, user interactions follow a power-law distribution. A small fraction of "Head" items receive most interactions, while the "Tail" items (the vast majority) are rarely consumed. This long-tail problem leads to poor recommendation performance for tail items and tail users, reducing platform diversity and user retention.
  • Existing Limitations:
    1. Arbitrary Partitioning: Current methods typically split data into "Head" (e.g., top 20%) and "Tail" (remaining 80%) based on fixed ratios. This ignores the specific distribution shape of different datasets (e.g., MovieLens is much more skewed than Yelp).
    2. Flawed Augmentation & Negative Transfer: Techniques often try to "borrow" data from head items to train tail items. However, since head and tail items have vastly different interaction densities and patterns, this direct transfer often introduces noise (negative transfer), confusing the model rather than helping it.

2.2. Main Contributions / Findings

  • Imbalance-Aware Partitioning: The paper introduces a new metric to quantify data imbalance and a dynamic programming algorithm to optimally partition items into multiple balanced subsets, rather than a simple binary split.
  • HPSERec Framework: A novel architecture employing a Mixture of Experts (MoE) strategy. It assigns specific experts to different subsets (ranging from tail-focused to global-focused) and trains them sequentially.
    • Feedforward: Uses Knowledge Distillation to transfer detailed local preferences (from tail experts) to global experts.
    • Feedback: Uses Optimal Transport (Sinkhorn Algorithm) to feed global context back to tail experts, aligning their representations without enforcing rigid one-to-one matching.
  • Superior Performance: HPSERec consistently outperforms existing baselines, including Large Language Model (LLM) based recommenders, on datasets like Yelp, Amazon Beauty, and Amazon Music. It significantly boosts tail item performance without degrading head item accuracy.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Sequential Recommendation (SRS): A type of recommender system that predicts the next item a user will interact with based on their chronological sequence of past interactions. Models like SASRec (Self-Attentive Sequential Recommendation) and BERT4Rec are standard backbones, using Transformer architectures to learn sequence dependencies.
  • Long-Tail Distribution: A statistical pattern where a small number of events (head) occur frequently, while a large number of events (tail) occur rarely. In recommendations, this means the model learns popular items well but fails on niche items.
  • Mixture of Experts (MoE): A machine learning technique where a model is composed of multiple sub-networks ("experts"), each specializing in a different part of the input space. A gating mechanism usually decides which expert to use. Here, experts are assigned to different item subsets.
  • Knowledge Distillation (KD): A training method where a small "student" model is trained to reproduce the output (logits) of a larger or more specialized "teacher" model. It transfers "knowledge" by matching the probability distributions.
  • Optimal Transport (OT) & Sinkhorn Algorithm: OT is a mathematical framework for calculating the most efficient way to transform one probability distribution into another. It provides a robust distance metric (Wasserstein distance). The Sinkhorn algorithm is a fast, iterative method to approximate this transport plan, often used to align feature spaces (embeddings) between different domains or models.

3.2. Previous Works

  • SRS Baselines:
    • SASRec (2018): Uses self-attention (like Transformer) to capture long-term dependencies in user sequences.
    • BERT4Rec (2019): Uses a bidirectional Transformer (like BERT) with a Cloze (fill-in-the-blank) objective.
  • Long-Tail Solutions:
    • CITIES (2020): Enhances tail item embeddings using context from head items via attention.
    • MELT (2023): "Mutual Enhancement of Long-Tailed user and item." It addresses both user and item tails simultaneously.
  • LLM-based Approaches:
    • RLMRec (2024), LLMInit (2023): Use Large Language Models (like GPT) to generate rich text-based profiles or initialize embeddings to compensate for sparse interaction data.

3.3. Differentiation Analysis

  • Beyond Binary Splits: Unlike CITIES or MELT which typically use a static 20/80 split, HPSERec uses an adaptive, metric-driven partitioning to create multiple hierarchical subsets.
  • Stepwise vs. Direct Transfer: Instead of directly using head data to augment tail (which causes negative transfer), HPSERec uses a stepwise chain of experts. Tail experts teach global experts (Feedforward), and global experts refine tail experts (Feedback).
  • Distribution-Level Alignment: By using Optimal Transport (Sinkhorn) for feedback, HPSERec aligns user representations at a distributional level, which is more flexible and robust than the rigid point-to-point alignment used in previous methods.

4. Methodology

4.1. Principles

The core philosophy of HPSERec is "Divide, Specialize, and Align."

  1. Divide: The dataset is too imbalanced to be treated as a whole. It should be partitioned based on interaction density.

  2. Specialize: Different neural networks (experts) should specialize in different partitions (e.g., one focuses on tail patterns, another on global patterns).

  3. Align: These experts shouldn't work in isolation. They should exchange knowledge: "Tail" experts provide specific interest signals to "Global" experts, and "Global" experts provide broad context back to "Tail" experts to regularize them.

    The following figure (Figure 2 from the original paper) illustrates this architecture:

    Figure 2: The overall architecture of HPSERec. Distribution balancing module splits the item set into subsets to reduce long-tail effects and improve training. Feedforward module boosts long-tail item representations with contrastive learning and refines global expert performance via knowledge distillation. Feedback module transfers head item information to long-tail experts using an annealing algorithm, expanding their receptive field. 该图像是HPSERec的总体架构示意图。分布平衡模块将项目集划分为子集,以减少长尾效应并改善训练。前馈模块通过对比学习增强长尾项表示,并通过知识蒸馏精炼全局专家性能。反馈模块利用Sinkhorn最优传输算法,将头项信息转移到长尾专家,扩大其感受野。

4.2. Distribution Balancing Module

Goal: Partition the sorted item set VV (sorted by popularity) into LL subsets {V1,V2,,VL}\{V_1, V_2, \dots, V_L\} such that the "imbalance" within each subset is minimized.

Step 1: Metric Definition. First, the authors define a normalized interaction probability pi(k)p_i^{(k)} for an item ii within a subset VkV_k: $ p_i^{(k)} = \frac{c_i}{\sum_{j \in V_k} c_j} $ where cic_i is the interaction count of item ii.

To measure how "skewed" or concentrated the interactions are in this subset, they define a non-linear aggregation functional Φk\Phi_k: $ \Phi_k = \left( \sum_{i \in V_k} \left( p_i^{(k)} \right)^\alpha \right)^{\frac{1}{1-\alpha}} $

  • Symbol Explanation:
    • Φk\Phi_k: The dispersion/concentration score. Lower is better (more uniform).
    • α\alpha: A hyperparameter. If α<1\alpha < 1, it is sensitive to tail items; if α>1\alpha > 1, it is sensitive to head items.

Step 2: Size Regularization. To prevent the algorithm from creating tiny or huge subsets, a regularization term BkB_k penalizes deviation from the average subset size: $ B_k = \left( \frac{S_k - \mu}{\mu} \right)^2 $

  • Symbol Explanation:
    • SkS_k: The number of items in subset VkV_k.
    • μ=VL\mu = \frac{|V|}{L}: The ideal average size of a subset.

Step 3: Optimization Objective. The overall imbalance score I(Vk)\mathcal{I}(V_k) for a subset combines the skewness and the size penalty: $ \mathcal{I}(V_k) = \log \Phi_k + \gamma B_k $

  • γ\gamma: Weight controlling the strength of the size penalty.

    The final goal is to find partition points (thresholds) Θ\Theta that minimize the sum of scores across all LL subsets: $ \min_{\Theta} \sum_{k=1}^{L} \mathcal{I}(V_k) $ The authors solve this using a Dynamic Programming algorithm (Algorithm 1 in the paper).

Step 4: Expert Assignment. Crucially, the subsets assigned to experts are cumulative. $ V_k^{expert} = \bigcup_{i=1}^{k} V_i' $

  • Expert E1E_1 sees only the first subset (most tail-focused).
  • Expert ELE_L sees the union of all subsets (the global dataset). This creates a hierarchy where E1E_1 is the most specialized on local/tail features, and ELE_L is the most general.

4.3. Feedforward Module (Tail \to Global)

Goal: Train the experts sequentially from specialized (tail) to general (global), transferring knowledge to ensure the global expert doesn't ignore the tail.

Mechanism: Knowledge Distillation (KD). The model treats the upstream expert Ei1E_{i-1} (more tail-focused) as a "Teacher" and the downstream expert EiE_i (more global) as a "Student". The student tries to match the teacher's output distribution on the overlapping items.

The distillation loss LKD\mathcal{L}_{KD} minimizes the KL-divergence between their softened predictions: $ \mathcal{L}_{KD} = \frac{1}{|\mathcal{V}i|} \sum{v \in \mathcal{V}_i} \mathrm{KL} \Big( \mathrm{Softmax}(z_v^{i-1} / \tau) \ || \ \mathrm{Softmax}(z_v^{i} / \tau) \Big) $

  • Symbol Explanation:
    • zvi1,zviz_v^{i-1}, z_v^{i}: The output logits (prediction scores) from expert i-1 and expert ii for item vv.
    • τ\tau: Temperature parameter. Higher τ\tau creates softer probability distributions, revealing more information about the "dark knowledge" (relationships between classes) than hard labels.
    • KL\mathrm{KL}: Kullback-Leibler divergence, measuring the difference between two probability distributions.

Total Feedforward Loss: The expert EiE_i is trained on a combined loss: standard recommendation loss (e.g., cross-entropy) plus the distillation loss. $ \mathcal{L}{\mathrm{forw}} = \mathcal{L}{\mathrm{rec}} + \beta \mathcal{L}_{KD} $

  • β\beta: Weight balancing recommendation accuracy and knowledge transfer.

4.4. Feedback Module (Global \to Tail)

Goal: After the global experts are trained, their broader context (which captures head item patterns well) is sent back to refine the tail experts. This prevents tail experts from overfitting to sparse data.

Mechanism: Sinkhorn Optimal Transport. Instead of forcing a user's representation in expert EtE_t to be exactly the same as in Et+1E_{t+1} (which is too rigid), the model aligns the distributions of user embeddings.

Step 1: Cost Matrix. Define the cost (distance) between user embedding uitu_i^t (from expert tt) and ujt+1u_j^{t+1} (from expert t+1t+1) using cosine distance: $ C_{ij} = 1 - \cos(u_i^t, u_j^{t+1}) $

Step 2: Optimal Transport Objective. Find a transport plan γ\gamma (a matrix describing how much mass to move from ii to jj) that minimizes the total cost, regularized by entropy H(γ)H(\gamma): $ \min_{\gamma \in \Pi(\mu, \nu)} \sum_{i,j} \gamma_{ij} C_{ij} - \varepsilon \sum_{i,j} \gamma_{ij} \log \gamma_{ij} $

  • Symbol Explanation:
    • γij\gamma_{ij}: The amount of "probability mass" mapped from user ii in space tt to user jj in space t+1t+1.
    • ε\varepsilon: Regularization parameter. This term makes the problem solvable using the fast Sinkhorn iterations and makes the mapping "soft" (one-to-many alignment is allowed).

Step 3: Feedback Loss. The loss is simply the expected transport cost under the optimal plan: $ \mathcal{L}{\mathrm{back}} = \sum{i,j} \gamma_{ij} C_{ij} $ Minimizing this pulls the upstream (tail) user embeddings structurally closer to the downstream (global) embeddings, improving consistency.

5. Experimental Setup

5.1. Datasets

The experiments use three real-world datasets, pre-processed to exclude users with fewer than 5 interactions.

  1. Amazon Beauty: E-commerce data. 57k items, 52k users. High sparsity.
  2. Yelp: Business reviews. 15k items, 4k users.
  3. Amazon Music: Digital music. 20k items, 20k users.

5.2. Evaluation Metrics

The paper uses Top-K ranking metrics (K=10K=10).

  1. Hit Rate (HR@K):

    • Concept: Measures the proportion of test cases where the ground-truth item appears in the top-K recommended list. It answers "Did the correct item appear?"
    • Formula: HR@K=1UuUI(ranku,gtK)\mathrm{HR}@K = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{rank}_{u, gt} \le K)
    • Symbols: U|U| is number of users, I\mathbb{I} is indicator function (1 if true, 0 otherwise), ranku,gt\text{rank}_{u, gt} is the rank of the ground truth item.
  2. Normalized Discounted Cumulative Gain (NDCG@K):

    • Concept: Measures the quality of the ranking. It rewards the model more if the correct item is ranked higher (e.g., rank 1 is much better than rank 10).
    • Formula: NDCG@K=DCG@KIDCG@K\mathrm{NDCG}@K = \frac{\mathrm{DCG}@K}{\mathrm{IDCG}@K}, where \mathrm{DCG}@K = \sum_{i=1}^K \frac{rel_i}{\log_2(i+1)}
    • Symbols: relirel_i is the relevance of item at rank ii (1 for ground truth, 0 for others). IDCG is the max possible DCG (perfect ranking).

5.3. Baselines

  • Standard SRS: SASRec, BERT4Rec.
  • Long-Tail SRS:
    • CITIES: Uses head items to infer tail embeddings.
    • MELT: Mutual enhancement of tail users and items.
  • LLM-Enhanced SRS:
    • RLMRec: LLM-generated profiles for representation learning.
    • LLMInit: LLM embeddings as initialization.
    • LLM-ESR: Retrieval-enhanced self-distillation with LLMs.

6. Results & Analysis

6.1. Core Results Analysis

HPSERec demonstrates superior performance across all datasets.

  • Overall Improvement: It beats the best baseline (often the LLM-based LLM-ESR) by significant margins (e.g., +10.6% HR@10 on Music).

  • Tail Item Boost: The most dramatic improvements are seen in Tail Items. For example, on the Yelp dataset, HPSERec achieves an HR@10 of 0.3252 for tail items, compared to 0.1584 for LLM-ESR (effectively doubling the performance).

  • No "Seesaw" Effect: Unlike CITIES, which improves tail performance but hurts head performance (a "seesaw" trade-off), HPSERec improves or maintains Head Item performance while boosting the tail. This confirms the effectiveness of the feedback module and stepwise enhancement.

    The following are the results from Table 1 of the original paper:

    Dataset Model Overall Tail Item Head Item Tail User Head User
    HR@10 ND@10 HR@10 ND@10 HR@10 ND@10 HR@10 ND@10 HR@10 ND@10
    Beauty Bert4Rec 0.3945 0.2453 0.0342 0.0085 0.4917 0.2922 0.3845 0.2384 0.4593 0.2941
    SASRec 0.4488 0.2861 0.1593 0.0856 0.7187 0.4815 0.4236 0.2690 0.5261 0.3611
    CITIES 0.3894 0.2462 0.1127 0.0013 0.4974 0.2745 0.3852 0.2249 0.4554 0.2594
    MELT 0.4869 0.3144 0.1598 0.0628 0.8055 0.5595 0.4719 0.3021 0.5523 0.3679
    RLMRec 0.4077 0.2565 0.1924 0.1660 0.6302 0.4657 0.4356 0.3016 0.4892 0.3345
    LLMInit 0.4351 0.2914 0.2714 0.1708 0.6984 0.5198 0.4919 0.3117 0.5430 0.3632
    LLM-ESR 0.4945 0.3275 0.2986 0.1713 0.7270 0.5232 0.4821 0.3103 0.5501 0.3425
    HPSERec 0.5281 0.3665 0.3203 0.2060 0.7306 0.5229 0.5163 0.3557 0.5799 0.4148
    Yelp Bert4Rec 0.5307 0.3025 0.0131 0.0045 0.6834 0.3913 0.5319 0.3036 0.5251 0.2978
    SASRec 0.5866 0.3536 0.0890 0.0386 0.8002 0.4888 0.5848 0.3525 0.5945 0.3585
    CITIES 0.5745 0.3404 0.0776 0.0341 0.7648 0.4573 0.5751 0.3416 0.5891 0.3419
    MELT 0.6038 0.3687 0.0697 0.0263 0.8245 0.5041 0.6037 0.3688 0.6042 0.3681
    RLMRec 0.5306 0.3909 0.0104 0.0140 0.7683 0.4568 0.5351 0.3065 0.5137 0.2936
    LLMInit 0.6099 0.3781 0.0874 0.0330 0.7766 0.4797 0.6204 0.3795 0.6187 0.3823
    LLM-ESR 0.6190 0.3784 0.1584 0.0670 0.8045 0.5055 0.6138 0.3761 0.6331 0.3844
    HPSERec 0.6827 0.4231 0.3252 0.1832 0.8361 0.5261 0.6884 0.4280 0.6583 0.4027
    Music Bert4Rec 0.4721 0.3056 0.1222 0.0494 0.8299 0.5929 0.4475 0.2870 0.5638 0.3752
    SASRec 0.5431 0.3714 0.2473 0.1405 0.8511 0.6256 0.5149 0.3591 0.6843 0.4746
    CITIES 0.4421 0.3345 0.2243 0.0832 0.8328 0.6124 0.4835 0.3237 0.6317 0.4364
    MELT 0.5442 0.2710 0.0824 0.0312 0.8347 0.5391 0.4192 0.2609 0.5082 0.3085
    RLMRec 0.3832 0.3271 0.1539 0.1171 0.8531 0.6292 0.5070 0.3374 0.6677 0.4722
    LLMInit 0.5537 0.3877 0.3024 0.1574 0.8312 0.6426 0.5145 0.3426 0.6604 0.4631
    LLM-ESR 0.5958 0.4035 0.3318 0.1548 0.8961 0.6835 0.5672 0.3824 0.7069 0.4846
    HPSERec 0.6592 0.4786 0.4425 0.2959 0.8989 0.6806 0.6428 0.4701 0.7144 0.5069

6.2. Ablation Studies

The authors performed an ablation study (Table 2 in the paper) to verify the necessity of each module:

  • Removing Distribution Balancing (DB): Replacing the optimal partitioning with a standard top-20% split causes a performance drop, proving that the new partitioning algorithm is more effective.
  • Removing Feedforward (FF): Without stepwise enhancement (just training experts independently?), performance drops significantly, showing the value of KD.
  • Removing Feedback (FB): Removing the Sinkhorn-based feedback leads to a decrease in metrics, confirming that global context helps refine tail experts.
  • Conclusion: The full HPSERec model (DB + FF + FB) performs best, indicating all components are complementary.

7. Conclusion & Reflections

7.1. Conclusion Summary

HPSERec effectively tackles the long-tail problem in sequential recommendation by abandoning the "one-size-fits-all" or binary "head/tail" approaches. Instead, it views the data as a spectrum, partitioning it hierarchically and training a chain of experts. The combination of Knowledge Distillation (for transferring tail signals up) and Sinkhorn Optimal Transport (for passing global context down) creates a balanced system that significantly improves recommendations for rare items without sacrificing popular ones.

7.2. Limitations & Future Work

  • Hyperparameters: The framework introduces several new hyperparameters (number of subsets LL, imbalance weight α\alpha, regularization γ\gamma, etc.) which might require extensive tuning for different datasets.
  • Computational Complexity: The subset partitioning algorithm (Dynamic Programming) and the Sinkhorn iterations add computational overhead compared to simpler models. The authors note this might be a bottleneck for extremely large-scale item sets and suggest future work on incremental learning strategies.

7.3. Personal Insights & Critique

  • Innovation: The use of Optimal Transport for feedback is a brilliant touch. Most "distillation" works only one way (Teacher -> Student). Here, the bidirectional flow (Tail \rightleftarrows Global) allows the model to find a "middle ground" where tail items are distinct but not isolated from the global latent space.
  • Applicability: The idea of hierarchical partitioning based on an imbalance metric is highly transferable. It could be applied to other long-tail domains like object detection (rare classes) or medical diagnosis (rare diseases), where binary splits are also insufficient.
  • Critique: The paper relies on the assumption that "adjacent" experts are the best for distillation. It would be interesting to see if "skip connections" (e.g., Expert 1 distilling directly to Expert 3) could help, or if the strictly sequential chain is indeed optimal. Furthermore, while it beats LLM-based methods, the cost of training LL expert networks (even if they share some structure, which isn't fully detailed) might be higher than fine-tuning a single adapter for an LLM.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.