Paper status: completed

HPSERec: A Hierarchical Partitioning and Stepwise Enhancement Framework for Long-tailed Sequential Recommendation

Long-Tailed Sequential Recommendation (1)Knowledge Distillation (1)Data Imbalance Metric (1)User-Specific Feature Capture (1)Sinkhorn Optimal Transport (1)

Original Link

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The HPSERec framework addresses the long-tail problem in sequential recommendation systems by hierarchically partitioning items and employing stepwise enhancement strategies to improve tail item representation, outperforming existing methods through expert networks and optimal tr

Abstract

The long-tail problem in sequential recommender systems stems from imbalanced interaction data, resulting in suboptimal model performance for tail users and items. Recent studies have leveraged head data to enhance tail data for diminish the impact of the long-tail problem. However, these methods often adopt ad-hoc strategies to distinguish between head and tail data, which fails to capture the underlying distributional characteristics and structural properties of each category. Moreover, due to a substantial representational gap exists between head and tail data, head-to-tail enhancement strategies are susceptible to negative transfer, often leading to a decline in overall model performance. To address these issues, we propose a hierarchical partitioning and stepwise enhancement framework, called HPSERec, for long-tailed sequential recommendation. HPSERec partitions the item set into subsets based on a data imbalance metric, assigning an expert network to each subset to capture user-specific local features. Subsequently, we apply knowledge distillation to progressively improve long-tail interest representation, followed by a Sinkhorn optimal transport-based feedback module, which aligns user representations across expert levels through a globally optimal and softly matched mapping. Extensive experiments on three real-world datasets demonstrate that HPSERec consistently outperforms all baseline methods. The implementation code is available at https://github.com/bolunxier123/HPSERec.

Mind Map

In-depth Reading

English Analysis~12 min read · 17,247 chars

1. Bibliographic Information

1.1. Title

HPSERec: A Hierarchical Partitioning and Stepwise Enhancement Framework for Long-tailed Sequential Recommendation

1.2. Authors

Xiaolong Xu $^{1}$ , Xudong Zhao $^{1}$ , Haolong Xiang $^{1}$ , Xuyun Zhang $^{2}$ , Wei Shen $^{3}$ , Hongsheng Hu $^{4}$ , Lianyong Qi $^{5}$

$^{1}$ School of Software, Nanjing University of Information Science and Technology, China
$^{2}$ School of Computing, Macquarie University, Australia
$^{3}$ Independent Researcher, China
$^{4}$ School of Information and Physical Sciences, University of Newcastle, Australia
$^{5}$ CoCu ScinTecoly (Institution name appears truncated/typoed in source), China

1.3. Journal/Conference

NeurIPS 2024 (The Thirty-eighth Annual Conference on Neural Information Processing Systems)

Note: The paper includes a "NeurIPS Paper Checklist" and references the "NeurIPS Code of Ethics," indicating it is a submission to or accepted paper at this prestigious top-tier AI conference.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses the long-tail problem in Sequential Recommender Systems (SRS), where a few popular items (head) dominate interactions, leaving the majority of items (tail) poorly recommended. The authors criticize existing methods for using arbitrary "head vs. tail" binary splits and for causing "negative transfer" (noise) when simply augmenting tail data with head data. They propose HPSERec, a framework that:

Hierarchically partitions items into multiple subsets based on a novel imbalance metric.
Assigns an expert network to each subset.
Uses Stepwise Enhancement via a feedforward module (Knowledge Distillation) to transfer tail knowledge to global experts.
Uses a feedback module based on Sinkhorn Optimal Transport to refine tail experts using global context. Experiments show HPSERec outperforms state-of-the-art baselines on three real-world datasets.

1.6. Original Source Link

Paper PDF

2. Executive Summary

2.1. Background & Motivation

The Problem: In recommender systems, user interactions follow a power-law distribution. A small fraction of "Head" items receive most interactions, while the "Tail" items (the vast majority) are rarely consumed. This long-tail problem leads to poor recommendation performance for tail items and tail users, reducing platform diversity and user retention.
Existing Limitations:
1. Arbitrary Partitioning: Current methods typically split data into "Head" (e.g., top 20%) and "Tail" (remaining 80%) based on fixed ratios. This ignores the specific distribution shape of different datasets (e.g., MovieLens is much more skewed than Yelp).
2. Flawed Augmentation & Negative Transfer: Techniques often try to "borrow" data from head items to train tail items. However, since head and tail items have vastly different interaction densities and patterns, this direct transfer often introduces noise (negative transfer), confusing the model rather than helping it.

2.2. Main Contributions / Findings

Imbalance-Aware Partitioning: The paper introduces a new metric to quantify data imbalance and a dynamic programming algorithm to optimally partition items into multiple balanced subsets, rather than a simple binary split.
HPSERec Framework: A novel architecture employing a Mixture of Experts (MoE) strategy. It assigns specific experts to different subsets (ranging from tail-focused to global-focused) and trains them sequentially.
- Feedforward: Uses Knowledge Distillation to transfer detailed local preferences (from tail experts) to global experts.
- Feedback: Uses Optimal Transport (Sinkhorn Algorithm) to feed global context back to tail experts, aligning their representations without enforcing rigid one-to-one matching.
Superior Performance: HPSERec consistently outperforms existing baselines, including Large Language Model (LLM) based recommenders, on datasets like Yelp, Amazon Beauty, and Amazon Music. It significantly boosts tail item performance without degrading head item accuracy.

3.1. Foundational Concepts

Sequential Recommendation (SRS): A type of recommender system that predicts the next item a user will interact with based on their chronological sequence of past interactions. Models like SASRec (Self-Attentive Sequential Recommendation) and BERT4Rec are standard backbones, using Transformer architectures to learn sequence dependencies.
Long-Tail Distribution: A statistical pattern where a small number of events (head) occur frequently, while a large number of events (tail) occur rarely. In recommendations, this means the model learns popular items well but fails on niche items.
Mixture of Experts (MoE): A machine learning technique where a model is composed of multiple sub-networks ("experts"), each specializing in a different part of the input space. A gating mechanism usually decides which expert to use. Here, experts are assigned to different item subsets.
Knowledge Distillation (KD): A training method where a small "student" model is trained to reproduce the output (logits) of a larger or more specialized "teacher" model. It transfers "knowledge" by matching the probability distributions.
Optimal Transport (OT) & Sinkhorn Algorithm: OT is a mathematical framework for calculating the most efficient way to transform one probability distribution into another. It provides a robust distance metric (Wasserstein distance). The Sinkhorn algorithm is a fast, iterative method to approximate this transport plan, often used to align feature spaces (embeddings) between different domains or models.

3.2. Previous Works

SRS Baselines:
- SASRec (2018): Uses self-attention (like Transformer) to capture long-term dependencies in user sequences.
- BERT4Rec (2019): Uses a bidirectional Transformer (like BERT) with a Cloze (fill-in-the-blank) objective.
Long-Tail Solutions:
- CITIES (2020): Enhances tail item embeddings using context from head items via attention.
- MELT (2023): "Mutual Enhancement of Long-Tailed user and item." It addresses both user and item tails simultaneously.
LLM-based Approaches:
- RLMRec (2024), LLMInit (2023): Use Large Language Models (like GPT) to generate rich text-based profiles or initialize embeddings to compensate for sparse interaction data.

3.3. Differentiation Analysis

Beyond Binary Splits: Unlike CITIES or MELT which typically use a static 20/80 split, HPSERec uses an adaptive, metric-driven partitioning to create multiple hierarchical subsets.
Stepwise vs. Direct Transfer: Instead of directly using head data to augment tail (which causes negative transfer), HPSERec uses a stepwise chain of experts. Tail experts teach global experts (Feedforward), and global experts refine tail experts (Feedback).
Distribution-Level Alignment: By using Optimal Transport (Sinkhorn) for feedback, HPSERec aligns user representations at a distributional level, which is more flexible and robust than the rigid point-to-point alignment used in previous methods.

4. Methodology

4.1. Principles

The core philosophy of HPSERec is "Divide, Specialize, and Align."

Divide: The dataset is too imbalanced to be treated as a whole. It should be partitioned based on interaction density.
Specialize: Different neural networks (experts) should specialize in different partitions (e.g., one focuses on tail patterns, another on global patterns).
Align: These experts shouldn't work in isolation. They should exchange knowledge: "Tail" experts provide specific interest signals to "Global" experts, and "Global" experts provide broad context back to "Tail" experts to regularize them.

The following figure (Figure 2 from the original paper) illustrates this architecture:

该图像是HPSERec的总体架构示意图。分布平衡模块将项目集划分为子集，以减少长尾效应并改善训练。前馈模块通过对比学习增强长尾项表示，并通过知识蒸馏精炼全局专家性能。反馈模块利用Sinkhorn最优传输算法，将头项信息转移到长尾专家，扩大其感受野。

4.2. Distribution Balancing Module

Goal: Partition the sorted item set $V$ (sorted by popularity) into $L$ subsets $\{V_1, V_2, \dots, V_L\}$ such that the "imbalance" within each subset is minimized.

Step 1: Metric Definition. First, the authors define a normalized interaction probability $p_i^{(k)}$ for an item $i$ within a subset $V_k$ : $ p_i^{(k)} = \frac{c_i}{\sum_{j \in V_k} c_j} $ where $c_i$ is the interaction count of item $i$ .

To measure how "skewed" or concentrated the interactions are in this subset, they define a non-linear aggregation functional $\Phi_k$ : $ \Phi_k = \left( \sum_{i \in V_k} \left( p_i^{(k)} \right)^\alpha \right)^{\frac{1}{1-\alpha}} $

Symbol Explanation:
- $\Phi_k$ : The dispersion/concentration score. Lower is better (more uniform).
- $\alpha$ : A hyperparameter. If $\alpha < 1$ , it is sensitive to tail items; if $\alpha > 1$ , it is sensitive to head items.

Step 2: Size Regularization. To prevent the algorithm from creating tiny or huge subsets, a regularization term $B_k$ penalizes deviation from the average subset size: $ B_k = \left( \frac{S_k - \mu}{\mu} \right)^2 $

Symbol Explanation:
- $S_k$ : The number of items in subset $V_k$ .
- $\mu = \frac{|V|}{L}$ : The ideal average size of a subset.

Step 3: Optimization Objective. The overall imbalance score $\mathcal{I}(V_k)$ for a subset combines the skewness and the size penalty: $ \mathcal{I}(V_k) = \log \Phi_k + \gamma B_k $

$\gamma$ : Weight controlling the strength of the size penalty.

The final goal is to find partition points (thresholds) $\Theta$ that minimize the sum of scores across all $L$ subsets: $ \min_{\Theta} \sum_{k=1}^{L} \mathcal{I}(V_k) $ The authors solve this using a Dynamic Programming algorithm (Algorithm 1 in the paper).

Step 4: Expert Assignment. Crucially, the subsets assigned to experts are cumulative. $ V_k^{expert} = \bigcup_{i=1}^{k} V_i' $

Expert $E_1$ sees only the first subset (most tail-focused).
Expert $E_L$ sees the union of all subsets (the global dataset). This creates a hierarchy where $E_1$ is the most specialized on local/tail features, and $E_L$ is the most general.

4.3. Feedforward Module (Tail $\to$ Global)

Goal: Train the experts sequentially from specialized (tail) to general (global), transferring knowledge to ensure the global expert doesn't ignore the tail.

Mechanism: Knowledge Distillation (KD). The model treats the upstream expert $E_{i-1}$ (more tail-focused) as a "Teacher" and the downstream expert $E_i$ (more global) as a "Student". The student tries to match the teacher's output distribution on the overlapping items.

The distillation loss $\mathcal{L}_{KD}$ minimizes the KL-divergence between their softened predictions: $ \mathcal{L}_{KD} = \frac{1}{|\mathcal{V}i|} \sum{v \in \mathcal{V}_i} \mathrm{KL} \Big( \mathrm{Softmax}(z_v^{i-1} / \tau) \ || \ \mathrm{Softmax}(z_v^{i} / \tau) \Big) $

Symbol Explanation:
- $z_v^{i-1}, z_v^{i}$ : The output logits (prediction scores) from expert i-1 and expert $i$ for item $v$ .
- $\tau$ : Temperature parameter. Higher $\tau$ creates softer probability distributions, revealing more information about the "dark knowledge" (relationships between classes) than hard labels.
- $\mathrm{KL}$ : Kullback-Leibler divergence, measuring the difference between two probability distributions.

Total Feedforward Loss: The expert $E_i$ is trained on a combined loss: standard recommendation loss (e.g., cross-entropy) plus the distillation loss. $ \mathcal{L}{\mathrm{forw}} = \mathcal{L}{\mathrm{rec}} + \beta \mathcal{L}_{KD} $

$\beta$ : Weight balancing recommendation accuracy and knowledge transfer.

4.4. Feedback Module (Global $\to$ Tail)

Goal: After the global experts are trained, their broader context (which captures head item patterns well) is sent back to refine the tail experts. This prevents tail experts from overfitting to sparse data.

Mechanism: Sinkhorn Optimal Transport. Instead of forcing a user's representation in expert $E_t$ to be exactly the same as in $E_{t+1}$ (which is too rigid), the model aligns the distributions of user embeddings.

Step 1: Cost Matrix. Define the cost (distance) between user embedding $u_i^t$ (from expert $t$ ) and $u_j^{t+1}$ (from expert $t+1$ ) using cosine distance: $ C_{ij} = 1 - \cos(u_i^t, u_j^{t+1}) $

Step 2: Optimal Transport Objective. Find a transport plan $\gamma$ (a matrix describing how much mass to move from $i$ to $j$ ) that minimizes the total cost, regularized by entropy $H(\gamma)$ : $ \min_{\gamma \in \Pi(\mu, \nu)} \sum_{i,j} \gamma_{ij} C_{ij} - \varepsilon \sum_{i,j} \gamma_{ij} \log \gamma_{ij} $

Symbol Explanation:
- $\gamma_{ij}$ : The amount of "probability mass" mapped from user $i$ in space $t$ to user $j$ in space $t+1$ .
- $\varepsilon$ : Regularization parameter. This term makes the problem solvable using the fast Sinkhorn iterations and makes the mapping "soft" (one-to-many alignment is allowed).

Step 3: Feedback Loss. The loss is simply the expected transport cost under the optimal plan: $ \mathcal{L}{\mathrm{back}} = \sum{i,j} \gamma_{ij} C_{ij} $ Minimizing this pulls the upstream (tail) user embeddings structurally closer to the downstream (global) embeddings, improving consistency.

5. Experimental Setup

5.1. Datasets

The experiments use three real-world datasets, pre-processed to exclude users with fewer than 5 interactions.

Amazon Beauty: E-commerce data. 57k items, 52k users. High sparsity.
Yelp: Business reviews. 15k items, 4k users.
Amazon Music: Digital music. 20k items, 20k users.

5.2. Evaluation Metrics

The paper uses Top-K ranking metrics ( $K=10$ ).

Hit Rate (HR@K):
- Concept: Measures the proportion of test cases where the ground-truth item appears in the top-K recommended list. It answers "Did the correct item appear?"
- Formula: $\mathrm{HR}@K = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{rank}_{u, gt} \le K)$
- Symbols: $|U|$ is number of users, $\mathbb{I}$ is indicator function (1 if true, 0 otherwise), $\text{rank}_{u, gt}$ is the rank of the ground truth item.
Normalized Discounted Cumulative Gain (NDCG@K):
- Concept: Measures the quality of the ranking. It rewards the model more if the correct item is ranked higher (e.g., rank 1 is much better than rank 10).
- Formula: $\mathrm{NDCG}@K = \frac{\mathrm{DCG}@K}{\mathrm{IDCG}@K}$ , where \mathrm{DCG}@K = \sum_{i=1}^K \frac{rel_i}{\log_2(i+1)}
- Symbols: $rel_i$ is the relevance of item at rank $i$ (1 for ground truth, 0 for others). IDCG is the max possible DCG (perfect ranking).

5.3. Baselines

Standard SRS: SASRec, BERT4Rec.
Long-Tail SRS:
- CITIES: Uses head items to infer tail embeddings.
- MELT: Mutual enhancement of tail users and items.
LLM-Enhanced SRS:
- RLMRec: LLM-generated profiles for representation learning.
- LLMInit: LLM embeddings as initialization.
- LLM-ESR: Retrieval-enhanced self-distillation with LLMs.

6. Results & Analysis

6.1. Core Results Analysis

HPSERec demonstrates superior performance across all datasets.

Overall Improvement: It beats the best baseline (often the LLM-based LLM-ESR) by significant margins (e.g., +10.6% HR@10 on Music).
Tail Item Boost: The most dramatic improvements are seen in Tail Items. For example, on the Yelp dataset, HPSERec achieves an HR@10 of 0.3252 for tail items, compared to 0.1584 for LLM-ESR (effectively doubling the performance).

No "Seesaw" Effect: Unlike CITIES, which improves tail performance but hurts head performance (a "seesaw" trade-off), HPSERec improves or maintains Head Item performance while boosting the tail. This confirms the effectiveness of the feedback module and stepwise enhancement.

The following are the results from Table 1 of the original paper:

Dataset	Model	Overall		Tail Item		Head Item		Tail User		Head User
Dataset	Model	HR@10	ND@10	HR@10	ND@10	HR@10	ND@10	HR@10	ND@10	HR@10	ND@10
Beauty	Bert4Rec	0.3945	0.2453	0.0342	0.0085	0.4917	0.2922	0.3845	0.2384	0.4593	0.2941
	SASRec	0.4488	0.2861	0.1593	0.0856	0.7187	0.4815	0.4236	0.2690	0.5261	0.3611
	CITIES	0.3894	0.2462	0.1127	0.0013	0.4974	0.2745	0.3852	0.2249	0.4554	0.2594
	MELT	0.4869	0.3144	0.1598	0.0628	0.8055	0.5595	0.4719	0.3021	0.5523	0.3679
	RLMRec	0.4077	0.2565	0.1924	0.1660	0.6302	0.4657	0.4356	0.3016	0.4892	0.3345
	LLMInit	0.4351	0.2914	0.2714	0.1708	0.6984	0.5198	0.4919	0.3117	0.5430	0.3632
	LLM-ESR	0.4945	0.3275	0.2986	0.1713	0.7270	0.5232	0.4821	0.3103	0.5501	0.3425
	HPSERec	0.5281	0.3665	0.3203	0.2060	0.7306	0.5229	0.5163	0.3557	0.5799	0.4148
Yelp	Bert4Rec	0.5307	0.3025	0.0131	0.0045	0.6834	0.3913	0.5319	0.3036	0.5251	0.2978
	SASRec	0.5866	0.3536	0.0890	0.0386	0.8002	0.4888	0.5848	0.3525	0.5945	0.3585
	CITIES	0.5745	0.3404	0.0776	0.0341	0.7648	0.4573	0.5751	0.3416	0.5891	0.3419
	MELT	0.6038	0.3687	0.0697	0.0263	0.8245	0.5041	0.6037	0.3688	0.6042	0.3681
	RLMRec	0.5306	0.3909	0.0104	0.0140	0.7683	0.4568	0.5351	0.3065	0.5137	0.2936
	LLMInit	0.6099	0.3781	0.0874	0.0330	0.7766	0.4797	0.6204	0.3795	0.6187	0.3823
	LLM-ESR	0.6190	0.3784	0.1584	0.0670	0.8045	0.5055	0.6138	0.3761	0.6331	0.3844
	HPSERec	0.6827	0.4231	0.3252	0.1832	0.8361	0.5261	0.6884	0.4280	0.6583	0.4027
Music	Bert4Rec	0.4721	0.3056	0.1222	0.0494	0.8299	0.5929	0.4475	0.2870	0.5638	0.3752
	SASRec	0.5431	0.3714	0.2473	0.1405	0.8511	0.6256	0.5149	0.3591	0.6843	0.4746
	CITIES	0.4421	0.3345	0.2243	0.0832	0.8328	0.6124	0.4835	0.3237	0.6317	0.4364
	MELT	0.5442	0.2710	0.0824	0.0312	0.8347	0.5391	0.4192	0.2609	0.5082	0.3085
	RLMRec	0.3832	0.3271	0.1539	0.1171	0.8531	0.6292	0.5070	0.3374	0.6677	0.4722
	LLMInit	0.5537	0.3877	0.3024	0.1574	0.8312	0.6426	0.5145	0.3426	0.6604	0.4631
	LLM-ESR	0.5958	0.4035	0.3318	0.1548	0.8961	0.6835	0.5672	0.3824	0.7069	0.4846
	HPSERec	0.6592	0.4786	0.4425	0.2959	0.8989	0.6806	0.6428	0.4701	0.7144	0.5069

6.2. Ablation Studies

The authors performed an ablation study (Table 2 in the paper) to verify the necessity of each module:

Removing Distribution Balancing (DB): Replacing the optimal partitioning with a standard top-20% split causes a performance drop, proving that the new partitioning algorithm is more effective.
Removing Feedforward (FF): Without stepwise enhancement (just training experts independently?), performance drops significantly, showing the value of KD.
Removing Feedback (FB): Removing the Sinkhorn-based feedback leads to a decrease in metrics, confirming that global context helps refine tail experts.
Conclusion: The full HPSERec model (DB + FF + FB) performs best, indicating all components are complementary.

7. Conclusion & Reflections

7.1. Conclusion Summary

HPSERec effectively tackles the long-tail problem in sequential recommendation by abandoning the "one-size-fits-all" or binary "head/tail" approaches. Instead, it views the data as a spectrum, partitioning it hierarchically and training a chain of experts. The combination of Knowledge Distillation (for transferring tail signals up) and Sinkhorn Optimal Transport (for passing global context down) creates a balanced system that significantly improves recommendations for rare items without sacrificing popular ones.

7.2. Limitations & Future Work

Hyperparameters: The framework introduces several new hyperparameters (number of subsets $L$ , imbalance weight $\alpha$ , regularization $\gamma$ , etc.) which might require extensive tuning for different datasets.
Computational Complexity: The subset partitioning algorithm (Dynamic Programming) and the Sinkhorn iterations add computational overhead compared to simpler models. The authors note this might be a bottleneck for extremely large-scale item sets and suggest future work on incremental learning strategies.

7.3. Personal Insights & Critique

Innovation: The use of Optimal Transport for feedback is a brilliant touch. Most "distillation" works only one way (Teacher -> Student). Here, the bidirectional flow (Tail $\rightleftarrows$ Global) allows the model to find a "middle ground" where tail items are distinct but not isolated from the global latent space.
Applicability: The idea of hierarchical partitioning based on an imbalance metric is highly transferable. It could be applied to other long-tail domains like object detection (rare classes) or medical diagnosis (rare diseases), where binary splits are also insufficient.
Critique: The paper relies on the assumption that "adjacent" experts are the best for distillation. It would be interesting to see if "skip connections" (e.g., Expert 1 distilling directly to Expert 3) could help, or if the strictly sequential chain is indeed optimal. Furthermore, while it beats LLM-based methods, the cost of training $L$ expert networks (even if they share some structure, which isn't fully detailed) might be higher than fine-tuning a single adapter for an LLM.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.