Paper status: completed

Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation

Published:11/10/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces the Wavelet Enhanced Adaptive Frequency Filter (WEARec) for sequential recommendation, enhancing traditional frequency-domain methods by dynamically filtering and wavelet feature enhancement, effectively improving performance and efficiency in capturing user

Abstract

Sequential recommendation has garnered significant attention for its ability to capture dynamic preferences by mining users' historical interaction data. Given that users' complex and intertwined periodic preferences are difficult to disentangle in the time domain, recent research is exploring frequency domain analysis to identify these hidden patterns. However, current frequency-domain-based methods suffer from two key limitations: (i) They primarily employ static filters with fixed characteristics, overlooking the personalized nature of behavioral patterns; (ii) While the global discrete Fourier transform excels at modeling long-range dependencies, it can blur non-stationary signals and short-term fluctuations. To overcome these limitations, we propose a novel method called Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation. Specifically, it consists of two vital modules: dynamic frequency-domain filtering and wavelet feature enhancement. The former is used to dynamically adjust filtering operations based on behavioral sequences to extract personalized global information, and the latter integrates wavelet transform to reconstruct sequences, enhancing blurred non-stationary signals and short-term fluctuations. Finally, these two modules work to achieve comprehensive performance and efficiency optimization in long sequential recommendation scenarios. Extensive experiments on four widely-used benchmark datasets demonstrate the superiority of our work.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is a novel approach to sequential recommendation that enhances traditional frequency-domain filtering with adaptive and wavelet-based mechanisms. The title is Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation.

1.2. Authors

The authors are Huayang Xu, Huanhuan Yuan, Guanfeng Liu, Junhua Fang, Lei Zhao, and Pengpeng Zhao. Their affiliations are primarily with the School of Computer Science and Technology, Soochow University, China, with Guanfeng Liu also affiliated with Macquarie University. The first two authors, Huayang Xu and Huanhuan Yuan, are marked with an asterisk, indicating they contributed equally or are co-first authors.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. The publication date (UTC) is 2025-11-10T00:00:00.000Z, suggesting it might be submitted to or accepted by a conference or journal in late 2025 or 2026. arXiv is a reputable open-access archive for preprints of scientific papers in various disciplines, including computer science, allowing early dissemination of research.

1.4. Publication Year

The publication year is 2025, based on the arXiv preprint date.

1.5. Abstract

Sequential recommendation (SR) aims to capture dynamic user preferences from historical interaction data. Recent research has explored frequency-domain analysis to identify hidden periodic patterns that are difficult to disentangle in the time domain. However, existing frequency-domain methods have two main limitations: (i) they use static filters that ignore the personalized nature of user behavior, and (ii) the global discrete Fourier transform (DFT), while good for long-range dependencies, can blur non-stationary signals and short-term fluctuations.

To address these issues, the paper proposes Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation (WEARec). This novel method comprises two key modules:

  • Dynamic Frequency-domain Filtering (DFF): This module dynamically adjusts filtering operations based on individual behavioral sequences to extract personalized global information.

  • Wavelet Feature Enhancement (WFE): This module integrates the wavelet transform to reconstruct sequences, specifically enhancing blurred non-stationary signals and short-term fluctuations.

    These two modules work synergistically to improve performance and efficiency, particularly in long sequential recommendation scenarios. Extensive experiments conducted on four benchmark datasets demonstrate the superiority of WEARec.

The official source link is https://arxiv.org/abs/2511.07028v2, and the PDF link is https://arxiv.org/pdf/2511.07028.pdf. It is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is accurately capturing users' dynamic and often complex periodic preferences in sequential recommendation systems. In real-world e-commerce and other applications, user interests are not static but evolve over time, making sequential recommendation a crucial task.

This problem is important because accurately predicting the next item a user will interact with is fundamental to personalized services, driving user engagement and satisfaction. Traditional methods often analyze user interaction sequences directly in the time domain. However, the authors argue that users' complex and intertwined periodic preferences are difficult to disentangle from noisy, chronologically entangled raw sequences in this domain.

Prior research has begun exploring frequency-domain analysis to identify these hidden periodic patterns by decomposing user sequences into different frequency components (e.g., high-frequency for short-term trends, low-frequency for long-term habits) using tools like the Fourier Transform. While promising, existing frequency-domain methods suffer from two key challenges or gaps:

  1. Static Filters: Most current frequency-domain methods apply uniform, static filters to all user sequences. This approach overlooks the personalized nature of behavioral patterns, as different users may be driven by different frequency components (e.g., some by long-term preferences, others by short-term fluctuations). A single, fixed filter cannot adequately capture this diversity.

  2. Global DFT Limitations: The Discrete Fourier Transform (DFT), while excellent at modeling long-range dependencies and extracting global frequency information, tends to blur non-stationary signals and short-term fluctuations. It acts as a low-pass filter, potentially missing fine-grained, transient patterns that are crucial for accurate next-item prediction.

    The paper's entry point and innovative idea is to address these two limitations by introducing adaptive and wavelet-enhanced mechanisms into frequency-domain filtering.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • Novel Model WEARec: Proposing Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation (WEARec), a new model designed to efficiently handle long sequences and capture diverse user behavioral patterns by fusing personalized global information with enhanced local information.

  • Dynamic Frequency-domain Filtering (DFF) Module: Introducing a module that dynamically adjusts filtering operations based on individual user behavioral sequences. This addresses the limitation of static filters by enabling personalized extraction of global frequency-domain information, tailored to each user's unique preferences.

  • Wavelet Feature Enhancement (WFE) Module: Developing a module that integrates wavelet transform to reconstruct sequences. This specifically targets the blurring issue of global DFT by amplifying obscure non-stationary signals and short-term fluctuations, thus capturing fine-grained local temporal patterns.

  • Synergistic Integration: Demonstrating how these two modules (DFF for personalized global context and WFE for enhanced local details) work together synergistically to achieve comprehensive performance and efficiency optimization, particularly in long sequential recommendation scenarios.

  • Empirical Validation: Conducting extensive experiments on four widely-used benchmark datasets, which consistently show that WEARec achieves superior recommendation performance compared to state-of-the-art baselines. The paper also highlights WEARec's lower computational overhead, especially in long-sequence settings.

    The key conclusions and findings are that WEARec effectively overcomes the limitations of previous frequency-domain methods by providing personalized filtering and enhancing local, non-stationary signals. This leads to more accurate and efficient sequential recommendations, particularly when dealing with longer user interaction histories.

3. Prerequisite Knowledge & Related Work

This section aims to provide readers with the foundational knowledge and context necessary to understand the WEARec model, its motivations, and its advancements over prior art.

3.1. Foundational Concepts

3.1.1. Sequential Recommendation (SR)

Sequential Recommendation (SR) is a subfield of recommender systems that focuses on predicting the next item a user will interact with based on their historical sequence of interactions. Unlike traditional recommender systems that might only consider static preferences or item similarity, SR models acknowledge that user interests are dynamic and evolve over time. The goal is to capture these dynamic interest shifts and sequential dependencies. For example, if a user buys a camera, they might next be interested in camera lenses, then tripods, and so on.

3.1.2. Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT)

The Discrete Fourier Transform (DFT) is a mathematical technique used to convert a finite sequence of discrete-time samples into a finite sequence of samples of frequencies present in the signal. In simpler terms, it decomposes a signal (like a user's interaction sequence) into its constituent frequencies, revealing which periodic patterns are strong and at what frequencies they occur. DFT analyzes global signal components, meaning it considers the entire sequence to find periodic patterns.

  • Mathematical Formula for DFT: Given a discrete sequence {xm}m=0n1\{ x _ { m } \} _ { m = 0 } ^ { n - 1 } of length nn, its DFT is defined as: $ X _ { k } = \sum _ { m = 0 } ^ { n - 1 } x _ { m } e ^ { - 2 \pi i m k / n } , 0 \leq k \leq n - 1 $

    • Symbol Explanation:
      • xmx_m: The mm-th sample of the input discrete signal in the time domain.
      • nn: The total number of samples (length of the sequence).
      • XkX_k: The kk-th frequency component (a complex value) in the frequency domain.
      • ii: The imaginary unit, where i2=1i^2 = -1.
      • ee: Euler's number, the base of the natural logarithm.
      • kk: The frequency index, ranging from 0 to n-1.
  • Inverse Discrete Fourier Transform (IDFT): The IDFT converts the frequency-domain representation back into the time domain: $ x _ { m } = \frac { 1 } { n } \sum _ { k = 0 } ^ { n - 1 } X _ { k } e ^ { 2 \pi i m k / n } $

    • Symbol Explanation:
      • xmx_m: The mm-th sample of the reconstructed signal in the time domain.
      • nn: The total number of samples.
      • XkX_k: The kk-th frequency component.
      • ii: The imaginary unit.
      • ee: Euler's number.
      • kk: The frequency index.
  • Fast Fourier Transform (FFT) and Inverse FFT (IFFT): FFT is an efficient algorithm to compute the DFT and its inverse (IFFT). While DFT defines the transformation, FFT is the practical computational method, significantly reducing the computational complexity from O(n2)O(n^2) to O(nlogn)O(n \log n).

  • Parseval's Theorem (from Appendix A.1): This theorem states that the total energy of a signal is conserved during the DFT process. It ensures that adaptive filtering in the frequency domain won't accidentally distort the intrinsic information of the input signal. $ \sum _ { m = 0 } ^ { n - 1 } | x [ m ] | ^ { 2 } = { \frac { 1 } { n } } \sum _ { k = 0 } ^ { n - 1 } | X [ k ] | ^ { 2 } $

    • Symbol Explanation:
      • x[m]: The mm-th sample of the time-domain signal.
      • X[k]: The kk-th frequency component of the signal.
      • 2|\cdot|^2: Represents the squared magnitude (energy) of a complex number.
      • nn: The length of the sequence.
  • Convolution Theorem (from Appendix A.1): This theorem is crucial for frequency-domain filtering. It states that convolution in the time domain is equivalent to element-wise multiplication in the frequency domain. This property directly enables the design of filters that capture specific frequency components. Given an input sequence {xm}m=0n1\{ x _ { m } \} _ { m = 0 } ^ { n - 1 } and convolution parameters {hm}m=0n1\{ h _ { m } \} _ { m = 0 } ^ { n - 1 }, the convolution h[m]x[m]h[m] * x[m] can be transformed into: $ h [ m ] * x [ m ] = { \mathcal { F } } ^ { - 1 } ( { \mathcal { F } } ( h [ m ] ) \odot X [ k ] ) $ This can be simplified to: $ \mathcal { F } ( h * x ) = \mathbf { W } \odot \mathbf { X } $

    • Symbol Explanation:
      • h[m]: The mm-th sample of the filter (kernel) in the time domain.
      • x[m]: The mm-th sample of the input signal in the time domain.
      • * (asterisk): Denotes the circular convolution operator.
      • F()\mathcal{F}(\cdot): The Fourier Transform operator.
      • F1()\mathcal{F}^{-1}(\cdot): The Inverse Fourier Transform operator.
      • W\mathbf{W}: The learnable complex-valued filtering matrix (the Fourier transform of the filter h[m]).
      • X\mathbf{X}: The frequency-domain representation of the input signal.
      • \odot: Denotes the Hadamard product (element-wise multiplication).

3.1.3. Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT)

The Discrete Wavelet Transform (DWT) is a signal processing tool that provides a multiresolution analysis of signals. Unlike DFT, which provides only frequency information, DWT offers both time and frequency localization. This means it can identify not only which frequencies are present but also when they occur in the signal. This property makes DWT particularly effective for analyzing non-stationary signals (signals whose statistical properties change over time) and capturing transient components or short-term fluctuations.

DWT decomposes a signal into approximation coefficients (low-frequency components, representing the overall trend) and detail coefficients (high-frequency components, representing fine-grained details or noise). This is typically done through hierarchical decomposition using low-pass and high-pass filters. The Haar wavelet transform is a simple and computationally efficient type of DWT.

  • Mathematical Formulas for DWT (from Preliminaries): The jj-th level decomposition is defined as: $ A _ { j + 1 } [ m ] = \sum _ { k = 0 } ^ { K - 1 } L [ k ] A _ { j } [ 2 m - k ] $ $ D _ { j + 1 } [ m ] = \sum _ { k = 0 } ^ { K - 1 } H [ k ] A _ { j } [ 2 m - k ] $

    • Symbol Explanation:
      • Aj[m]A_j[m]: Approximation coefficients at decomposition level jj. When j=0j=0, A0[m]=x[m]A_0[m] = x[m] (the original signal). These capture low-frequency components after low-pass filtering.
      • Dj[m]D_j[m]: Detail coefficients at decomposition level jj. These capture high-frequency components after high-pass filtering.
      • L[k]: The low-pass filter coefficients.
      • H[k]: The high-pass filter coefficients.
      • KK: The length of the filter.
      • 2m-k: Represents downsampling with a stride of 2, which halves the length of the output signal at each decomposition level.
  • Inverse Discrete Wavelet Transform (IDWT): The IDWT reconstructs the original signal from its approximation and detail coefficients stage-by-stage via iterative upsampling and filtering operations: $ A _ { j } [ m ] = \sum _ { k = 0 } ^ { K - 1 } \tilde { L } [ k ] A _ { j + 1 } [ 2 m - k ] + \sum _ { k = 0 } ^ { K - 1 } \tilde { H } [ k ] D _ { j + 1 } [ 2 m - k ] $

    • Symbol Explanation:
      • Aj[m]A_j[m]: The reconstructed approximation coefficients at level jj.
      • Aj+1[m]A_{j+1}[m]: Approximation coefficients from the next higher decomposition level.
      • Dj+1[m]D_{j+1}[m]: Detail coefficients from the next higher decomposition level.
      • L~[k]\tilde{L}[k]: Reconstruction low-pass filter coefficients.
      • H~[k]\tilde{H}[k]: Reconstruction high-pass filter coefficients.

3.1.4. Multi-Layer Perceptron (MLP)

A Multi-Layer Perceptron (MLP) is a class of feedforward artificial neural networks. An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (neuron) in one layer connects with a certain weight to every node in the next layer. MLPs are capable of learning non-linear relationships and are used for various tasks, including classification, regression, and, as in this paper, generating adaptive parameters.

3.1.5. Self-Attention and Transformer

Self-attention is a mechanism that allows a model to weigh the importance of different words/items in an input sequence when encoding a particular word/item. It has been a cornerstone of the Transformer architecture, which revolutionized Natural Language Processing (NLP) and Computer Vision (CV). In sequential recommendation, Self-attention (e.g., in SASRec) is used to model item-item relationships and capture long-range dependencies in user sequences. However, the paper points out that items in user interactions are often chronologically entangled and inherently noisy, making it challenging for models to directly discern changes in behavioral preferences from raw sequences within the time domain using self-attention.

3.1.6. Cross-Entropy Loss

Cross-entropy loss is a common loss function used in classification tasks, particularly when the output is a probability distribution. It measures the difference between two probability distributions: the true distribution (ground truth) and the predicted distribution from the model. A lower cross-entropy loss indicates that the model's predictions are closer to the true labels.

3.1.7. Layer Normalization, Dropout, and Skip Connection

These are common techniques used in deep learning models to improve stability, prevent overfitting, and facilitate training of deeper networks:

  • Layer Normalization: Normalizes the inputs across the features for each sample independently. This helps stabilize the hidden state activations, making training faster and more stable.
  • Dropout: A regularization technique where randomly selected neurons are ignored (dropped out) during training. This prevents neurons from co-adapting too much and reduces overfitting.
  • Skip Connection (Residual Connection): Directly adds the input of a layer to its output, allowing gradients to flow more easily through the network. This helps mitigate the vanishing gradient problem in deeper networks and allows the model to learn residual functions.

3.2. Previous Works

The paper categorizes related work into Time-domain SR models and Frequency-domain SR models, highlighting the progression of research in sequential recommendation.

3.2.1. Time-domain SR Models

These models process user interaction sequences directly in their chronological order.

  • Early Markov Chain Models: Focused on immediate transitions between items (e.g., Factorizing Personalized Markov Chains (FPMC) by Rendle et al., 2010).
  • Deep Learning Models:
    • GRU4Rec (Hidasi et al., 2015): The first to apply Gated Recurrent Unit (GRU) networks, a type of Recurrent Neural Network (RNN), to model user behavior sequences. GRUs are designed to capture dependencies in sequential data.

    • Caser (Tang and Wang, 2018): Utilizes Convolutional Neural Networks (CNNs) with horizontal and vertical filters to capture local sequential patterns.

    • SASRec (Kang and McAuley, 2018): A prominent model leveraging the self-attention mechanism to capture item-item relationships within a sequence. It treats the sequence as a set of items and learns their importance relative to each other.

    • Contrastive Learning Enhanced Models: More recent works like CL4SRec (Xie et al., 2022) and DuoRec (Qiu et al., 2022) enhance sequential embedding representations by using contrastive learning. This involves creating augmented versions of user sequences and training the model to make representations of similar sequences closer and dissimilar ones farther apart. DuoRec specifically uses unsupervised model-level augmentation and supervised semantic positive samples.

      The general limitation of these time-domain models, as highlighted by the authors, is their struggle to effectively capture users' underlying periodic behavioral patterns due to the chronologically entangled and noisy nature of raw sequences.

3.2.2. Frequency-domain SR Models

These models transform user sequences into the frequency domain to identify periodic patterns that are less apparent in the time domain.

  • FMLPRec (Zhou et al., 2022): A pioneering work that processes sequential data in the frequency domain. It replaces the self-attention mechanism with learnable filters implemented via an MLP structure to capture periodic patterns and attenuate noise.

  • SLIME4Rec (Du et al., 2023a) and FEARec (Du et al., 2023b): These models advance the frequency-domain approach by proposing a layered frequency ramp structure that considers different frequency bands for each layer. SLIME4Rec further integrates contrastive learning with dynamic and static selection modules. FEARec utilizes frequency-domain information in attention computation and integrates both time and frequency domains.

  • BSARec (Shin et al., 2024): A more recent approach that seeks to uncover fine-grained sequential patterns and inject them as inductive biases into the self-attention mechanism. It adjusts the influence on the high-frequency region to be learnable and alleviates over-smoothing through a frequency recalibrator.

  • FamouSRec (Zhang et al., 2025): Employs a Mixture-of-Experts (MoE) architecture, allowing specialized expert models to be selected based on users' specific frequency-based behavioral patterns.

    The paper identifies two key limitations in these existing frequency-domain models:

  1. Lack of User-Specific Adaptivity: They primarily employ static filters or fixed-pattern filters, processing all users uniformly and ignoring the personalized nature of behavioral patterns.
  2. Global DFT Blurring: While DFT excels at long-range dependencies, it struggles to capture local temporal features of high-frequency interactions and short-term points of interest, acting essentially as a low-pass filter (e.g., FMLPRec tends to learn low-frequency components).

3.3. Technological Evolution

The evolution of sequential recommendation has seen several paradigm shifts:

  1. Early Statistical Models (e.g., Markov Chains): Focused on direct transitions between items, often limited by modeling only short-term dependencies.
  2. Traditional Machine Learning: Employed features engineered from sequences (e.g., matrix factorization with temporal dynamics).
  3. Deep Learning (RNNs, CNNs): GRU4Rec and Caser brought the power of deep learning to capture more complex sequential patterns. RNNs like GRU are naturally suited for sequences, while CNNs can detect local patterns.
  4. Attention Mechanisms (Transformers): SASRec and BERT4Rec (Sun et al., 2019) leveraged the Transformer architecture with self-attention, enabling the capture of arbitrary long-range dependencies across a sequence, becoming the dominant paradigm.
  5. Contrastive Learning: Integrated with deep learning models to improve representation quality by learning robust embeddings.
  6. Frequency-Domain Analysis: A more recent shift, pioneered by FMLPRec, moving beyond the time domain to exploit periodic patterns using Fourier Transform. This aims to disentangle complex user preferences that are hard to see in raw chronological data.
  7. Adaptive and Multi-Resolution Frequency Analysis (This Paper): WEARec represents the current frontier, moving beyond static frequency filters and global DFT limitations by introducing dynamic, personalized filtering and wavelet-based enhancement for local, non-stationary signals.

3.4. Differentiation Analysis

Compared to the main methods in related work, WEARec introduces several core differences and innovations:

  • Against Time-domain Models (e.g., SASRec, DuoRec): WEARec operates primarily in the frequency domain, directly addressing the challenge of chronologically entangled and noisy sequences that make periodic pattern identification difficult for time-domain models. While time-domain models focus on item-item relationships, WEARec decomposes these into fundamental frequencies.

  • Against Existing Frequency-domain Models (e.g., FMLPRec, SLIME4Rec, BSARec, FamouSRec):

    • Personalization through Dynamic Filters: A primary innovation of WEARec is its Dynamic Frequency-domain Filtering (DFF) module. Unlike FMLPRec's static MLP filters or other models that apply uniform filtering, DFF uses a Multi-Layer Perceptron (MLP) to generate adaptive scaling factors and bias terms based on the context signal of each user's sequence. This allows the filter to dynamically adjust its characteristics to match the personalized behavioral patterns of individual users (e.g., emphasizing low frequencies for long-term preferences or higher frequencies for short-term trends), a limitation explicitly identified in previous frequency-domain approaches.
    • Enhanced Local Information with Wavelets: The Wavelet Feature Enhancement (WFE) module is another key differentiator. Existing DFT-based methods (like FMLPRec) struggle with non-stationary signals and short-term fluctuations because DFT provides a global view, effectively acting as a low-pass filter. WFE addresses this by leveraging the Discrete Wavelet Transform (DWT) (specifically Haar wavelet), which provides time-frequency localization. This allows WEARec to capture fine-grained temporal patterns and explicitly enhance obscured high-frequency details that global DFT might blur, offering a more complete representation of user behavior.
    • Combined Strengths: WEARec uniquely combines the global pattern capturing capability of adaptive frequency-domain filtering with the local, fine-grained detail capturing capability of wavelets. This dual approach aims for a more comprehensive understanding of user preferences.
    • Efficiency: The paper also claims WEARec achieves better performance with lower computational costs, especially in long-sequence scenarios, by avoiding the quadratic complexity of self-attention and potentially expensive contrastive learning mechanisms while utilizing efficient FFT and Haar wavelets.

4. Methodology

The proposed WEARec model is designed to overcome the limitations of existing frequency-domain sequential recommendation methods by dynamically adjusting filters for personalized global information and enhancing non-stationary signals with wavelet transforms. The overall framework is depicted in Figure 2 of the original paper.

4.1. Problem Statement

The goal of Sequential Recommendation (SR) is to predict the next item a user will interact with, given their past interactions. Let U\mathcal{U} be the set of users and V\mathcal{V} be the set of items. Each user uUu \in \mathcal{U} has a chronological sequence of item interactions su=[v1(u),v2(u),,vt(u),,vn(u)]s_u = [v_1^{(u)}, v_2^{(u)}, \dots, v_t^{(u)}, \dots, v_n^{(u)}], where vt(u)Vv_t^{(u)} \in \mathcal{V} is the item user uu interacted with at step tt, and nn is the sequence length. For model input, sequences are typically truncated or padded to a maximum length NN, resulting in su=[v1(u),v2(u),,vN(u)]s_u = [v_1^{(u)}, v_2^{(u)}, \dots, v_N^{(u)}]. The SR task is to use this sequence sus_u to predict the top-KK items the user will interact with at time N+1N+1.

4.2. Embedding Layer

Given a user behavior sequence sus_u, the first step is to convert the categorical item IDs into dense vector representations called embeddings. An item embedding matrix MRV×d\mathbf{M} \in \mathbb{R}^{|\mathcal{V}| \times d} is used, where dd is the embedding size, and Msi\mathbf{M}_{s_i} is the embedding for item sis_i. To incorporate the sequential order, positional embeddings PRN×d\mathbf{P} \in \mathbb{R}^{N \times d} are added to the item embeddings. Finally, Layer Normalization and Dropout operations are applied to stabilize training and prevent overfitting. This process generates the sequence representation EuRN×d\mathbf{E}^u \in \mathbb{R}^{N \times d} as follows: $ \mathbf { E } ^ { u } = \operatorname { D r o p o u t } ( \operatorname { LayerN o r m } ( \mathbf { E } ^ { u } + \mathbf { P } ) ) $

  • Symbol Explanation:
    • EuRN×d\mathbf{E}^u \in \mathbb{R}^{N \times d}: The embedded sequence representation for user uu, where NN is the maximum sequence length and dd is the embedding dimension. Initially, it refers to the raw item embeddings.
    • PRN×d\mathbf{P} \in \mathbb{R}^{N \times d}: The positional embedding matrix, which encodes the position of each item in the sequence.
    • LayerNorm()\operatorname{LayerNorm}(\cdot): The Layer Normalization function, which normalizes the activations across each feature for each input sample.
    • Dropout()\operatorname{Dropout}(\cdot): The Dropout function, which randomly sets a fraction of input units to zero during training to prevent overfitting.
    • +: Element-wise addition of the item embeddings and positional embeddings.

4.3. Dynamic Frequency-domain Filtering (DFF)

The Dynamic Frequency-domain Filtering (DFF) module is designed to dynamically adjust filtering operations based on behavioral sequences, extracting personalized global information.

4.3.1. Multi-Head Projection

Inspired by the multi-head attention mechanism, the input item embedding Eu\mathbf{E}^u (or the output of the previous layer Hl\mathbf{H}^l) is first projected into kk parallel feature subspaces. This allows different parts of the embedding dimension to be processed independently, each potentially with its own adaptive filter. $ \mathbf { H } ^ { 0 } = \mathbf { E } ^ { u } $

  • Symbol Explanation:
    • H0\mathbf{H}^0: The initial input to the first layer, which is the output from the embedding layer.

    • HlRN×d\mathbf{H}^l \in \mathbb{R}^{N \times d}: The time-domain feature representation at the ll-th layer.

    • BiRN×d/k\mathbf{B}_i \in \mathbb{R}^{N \times d/k}: Represents the ii-th subspace created by decomposing Hl\mathbf{H}^l along the embedding dimension, where kk is the number of heads.

      For each subspace Bil\mathbf{B}_i^l, a Fast Fourier Transform (FFT) is performed along the item dimension to convert it into the frequency domain: $ \mathcal { F } ( \mathbf { B } _ { i } ^ { l } ) \mathbf { F } _ { i } ^ { l } $

  • Symbol Explanation:
    • BilRN×d/k\mathbf{B}_i^l \in \mathbb{R}^{N \times d/k}: The ii-th time-domain subspace feature of the ll-th layer.
    • F()\mathcal{F}(\cdot): The 1D FFT (Fast Fourier Transform) operator.
    • FilCM×d/k\mathbf{F}_i^l \in \mathbb{C}^{M \times d/k}: The ii-th frequency-domain subspace feature of the ll-th layer. It's a complex tensor.
    • MM: The length of the frequency domain representation, calculated as: $ M = \lceil N / 2 \rceil + 1 $ where NN is the sequence length.

4.3.2. Context Signal Extraction

To enable dynamic adaptation, the model extracts an overall representation of the user's historical interaction sequence. This is done by performing a mean operation on the input features in the time domain along the item dimension: $ \mathbf { c } ^ { l } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \mathbf { H } _ { i } ^ { l } $

  • Symbol Explanation:
    • clR1×d\mathbf{c}^l \in \mathbb{R}^{1 \times d}: The overall representation of the user's historical interaction sequence at the ll-th layer, representing the user's context.
    • HilR1×d\mathbf{H}_i^l \in \mathbb{R}^{1 \times d}: The ii-th row of Hl\mathbf{H}^l, corresponding to the embedding of an item in the sequence.
    • NN: The sequence length.
    • i=1N\sum_{i=1}^N: Summation over all items in the sequence.

4.3.3. Adaptive Filter Generation

Two three-layer MLP networks are used to generate dynamic scaling factors and bias terms from the extracted user contextual features (cl\mathbf{c}^l). These terms will modulate the personalized frequency-domain filters. $ \begin{array} { r } { \Delta \mathbf { s } ^ { l } = \mathrm { M L P } _ { 1 } ( \mathbf { c } ^ { l } ) } \ { \Delta \mathbf { b } ^ { l } = \mathrm { M L P } _ { 2 } ( \mathbf { c } ^ { l } ) } \end{array} $

  • Symbol Explanation:
    • ΔslRk×M\Delta \mathbf{s}^l \in \mathbb{R}^{k \times M}: The scaling factor for dynamically adjusting the filter at the ll-th layer. It is personalized for each sequence.
    • ΔblRk×M\Delta \mathbf{b}^l \in \mathbb{R}^{k \times M}: The bias term for dynamically adjusting the filter at the ll-th layer. It is also personalized for each sequence.
    • MLP1()\mathrm{MLP}_1(\cdot), MLP2()\mathrm{MLP}_2(\cdot): Two distinct Multi-Layer Perceptron networks.
    • cl\mathbf{c}^l: The user's overall context representation.

4.3.4. Personalized Filter Modulation

The base filter weights WlRk×M\mathbf{W}^l \in \mathbb{R}^{k \times M} and bias blRk×M\mathbf{b}^l \in \mathbb{R}^{k \times M} are modulated using the personalization-generated scaling factor Δsl\Delta \mathbf{s}^l and bias term Δbl\Delta \mathbf{b}^l. $ \hat { \mathbf { W } } ^ { l } = \mathbf { W } ^ { l } \odot ( 1 + \Delta \mathbf { s } ^ { l } ) $ $ \hat { \mathbf { b } } ^ { l } = \mathbf { b } ^ { l } + \pmb { \Delta } \mathbf { b } ^ { l } $

  • Symbol Explanation:
    • WlRk×M\mathbf{W}^l \in \mathbb{R}^{k \times M}: The base (learnable) filter weights at the ll-th layer.
    • blRk×M\mathbf{b}^l \in \mathbb{R}^{k \times M}: The base (learnable) bias at the ll-th layer.
    • W^lRk×M\hat{\mathbf{W}}^l \in \mathbb{R}^{k \times M}: The linearly modulated weights of the dynamic filter at the ll-th layer.
    • b^lRk×M\hat{\mathbf{b}}^l \in \mathbb{R}^{k \times M}: The linearly modulated bias of the dynamic filter at the ll-th layer.
    • \odot: Hadamard product (element-wise multiplication).
    • 1: A scalar 1, added element-wise to Δsl\Delta \mathbf{s}^l.

4.3.5. Multiple Learnable Filters

The personalized filtered frequency-domain information is obtained by applying the modulated filter (weights W^l\hat{\mathbf{W}}^l and bias b^l\hat{\mathbf{b}}^l) to each frequency-domain feature subspace Fil\mathbf{F}_i^l. This is an element-wise multiplication and addition in the frequency domain, consistent with the Convolution Theorem. $ \tilde { \mathbf { F } } _ { i } ^ { l } = \mathbf { F } _ { i } ^ { l } \odot \hat { \mathbf { W } } ^ { l } + \hat { \mathbf { b } } ^ { l } $

  • Symbol Explanation:
    • F~il\tilde{\mathbf{F}}_i^l: The filtered frequency-domain feature for the ii-th subspace at the ll-th layer.

    • Fil\mathbf{F}_i^l: The original frequency-domain feature for the ii-th subspace.

    • W^l\hat{\mathbf{W}}^l: The modulated filter weights.

    • b^l\hat{\mathbf{b}}^l: The modulated filter bias.

    • \odot: Hadamard product (element-wise multiplication).

      Finally, the processed frequency-domain signal F~il\tilde{\mathbf{F}}_i^l is mapped back to the time domain using the Inverse Discrete Fourier Transform (IDFT): $ \mathbf { X } _ { i } ^ { l } = \mathcal { F } ^ { - 1 } ( \tilde { \mathbf { F } } _ { i } ^ { l } ) $

  • Symbol Explanation:
    • Xil\mathbf{X}_i^l: The time-domain feature for the ii-th subspace after DFF.

    • F1()\mathcal{F}^{-1}(\cdot): The IDFT (Inverse Discrete Fourier Transform) operator.

    • F~il\tilde{\mathbf{F}}_i^l: The filtered frequency-domain feature.

      The outputs from all kk subspaces are then concatenated to form the complete time-domain representation for the current layer: $ \mathbf { X } ^ { l } = [ \mathbf { X } _ { 1 } ^ { l } , \mathbf { X } _ { 2 } ^ { l } , \ldots , \mathbf { X } _ { k } ^ { l } ] $

  • Symbol Explanation:
    • Xl\mathbf{X}^l: The complete time-domain feature after DFF at layer ll, formed by concatenating the outputs from all kk subspaces.

      The overall framework is visually represented in Figure 2 of the original paper, showing how the Multi-Head Projection feeds into the FFT, which then passes through the Multiple Learnable Filters (modulated by MLP-generated factors) and then IFFT to return to the time domain.

4.4. Wavelet Feature Enhancement (WFE)

The Wavelet Feature Enhancement (WFE) module is designed to capture fine-grained temporal patterns and enhance non-stationary signals and short-term fluctuations, which global DFT might blur. The Haar wavelet transform is chosen for its simplicity, efficiency, and perfect reconstruction property.

4.4.1. Multi-Head Projection

Similar to the DFF module, the input features Hl\mathbf{H}^l are first projected into kk parallel feature subspaces Bil\mathbf{B}_i^l to align with the processing of the DFF module.

4.4.2. Wavelet Decomposition

A 1D Haar wavelet transform is applied along the item dimension of each time-domain subspace feature Bil\mathbf{B}_i^l. This decomposes the signal into low-frequency (approximation) coefficients Ail\mathbf{A}_i^l and high-frequency (detail) coefficients Dil\mathbf{D}_i^l. $ \mathbf { A } _ { i } ^ { l } , \mathbf { D } _ { i } ^ { l } = \mathcal { W } ( \mathbf { B } _ { i } ^ { l } ) $

  • Symbol Explanation:
    • AilRK×d/k\mathbf{A}_i^l \in \mathbb{R}^{K \times d/k}: The ii-th subspace's approximation coefficients at the ll-th layer, representing low-frequency components.
    • DilRKˉ×d/k\mathbf{D}_i^l \in \mathbb{R}^{\bar{K} \times d/k}: The ii-th subspace's detail coefficients at the ll-th layer, capturing high-frequency components. (Note: The paper uses KK for length in both DWT formulas but implies different lengths for A and D. Assuming K=N/2K = N/2 for Haar wavelet at 1-level decomposition for simplicity, where Kˉ\bar{K} would also be N/2N/2).
    • BilRN×d/k\mathbf{B}_i^l \in \mathbb{R}^{N \times d/k}: The ii-th time-domain subspace feature of the ll-th layer.
    • W()\mathcal{W}(\cdot): The 1D Haar wavelet transform operator.

4.4.3. Data Rescale

To enhance or suppress specific high-frequency signals, the detail coefficients Dil\mathbf{D}_i^l are multiplied by an adaptive learnable matrix Tl\mathbf{T}^l. The approximation coefficients Ail\mathbf{A}_i^l (low-frequency information) are left unmodified to preserve the original primary components of the sequence. $ \tilde { \mathbf { D } } _ { i } ^ { l } = \mathbf { D } _ { i } ^ { l } \odot \mathbf { T } ^ { l } $

  • Symbol Explanation:
    • D~ilRK×d/k\tilde{\mathbf{D}}_i^l \in \mathbb{R}^{K \times d/k}: The enhanced detail coefficients of the ii-th subspace at the ll-th layer.
    • Dil\mathbf{D}_i^l: The original detail coefficients.
    • TlRK×d/k\mathbf{T}^l \in \mathbb{R}^{K \times d/k}: The adaptive high-frequency enhancer (a learnable matrix) at the ll-th layer.
    • \odot: Hadamard product (element-wise multiplication).

4.4.4. Wavelet Reconstruction

The high-frequency enhanced time-domain signal is reconstructed by applying the Inverse Haar Wavelet Transform (IDWT) to the (unmodified) approximation coefficients Ail\mathbf{A}_i^l and the enhanced detail coefficients D~il\tilde{\mathbf{D}}_i^l. $ \mathbf { Y } _ { i } ^ { l } = \mathcal { W } ^ { - 1 } ( \mathbf { A } _ { i } ^ { l } , \tilde { \mathbf { D } } _ { i } ^ { l } ) $

  • Symbol Explanation:
    • Yil\mathbf{Y}_i^l: The reconstructed time-domain feature for the ii-th subspace after WFE.

    • W1()\mathcal{W}^{-1}(\cdot): The Inverse Haar Wavelet Transform operator.

    • Ail\mathbf{A}_i^l: The approximation coefficients.

    • D~il\tilde{\mathbf{D}}_i^l: The enhanced detail coefficients.

      The outputs from all kk subspaces are then concatenated to form the complete time-domain representation for the current layer: $ \mathbf { Y } ^ { l } = [ \mathbf { Y } _ { 1 } ^ { l } , \mathbf { Y } _ { 2 } ^ { l } , \ldots , \mathbf { Y } _ { k } ^ { l } ] $

  • Symbol Explanation:
    • Yl\mathbf{Y}^l: The complete time-domain feature after WFE at layer ll, formed by concatenating the outputs from all kk subspaces.

4.5. Feature Integrate

This step combines the global features extracted by the DFF module with the fine-grained local features derived from the WFE module. A learnable hyperparameter α\alpha is used to balance the emphasis between these two types of features. $ \widehat { \mathbf { H } } ^ { l } = \alpha \odot \mathbf { X } ^ { l } + ( 1 - \alpha ) \odot \mathbf { Y } ^ { l } $

  • Symbol Explanation:
    • H^l\widehat{\mathbf{H}}^l: The integrated feature representation at layer ll.

    • α\alpha: A hyperparameter (a scalar or a learnable tensor) that controls the weighting between DFF output and WFE output.

    • Xl\mathbf{X}^l: The global features from the DFF module.

    • Yl\mathbf{Y}^l: The fine-grained features from the WFE module.

    • \odot: Hadamard product (element-wise multiplication).

      To ensure a stable training process and better generalization, especially in deeper models, standard techniques like skip connections, dropout, and layer normalization are applied: $ \widehat { \mathbf { H } } ^ { l } = \mathrm { L a y e r N o r m } ( \mathbf { H } ^ { l } + \mathrm { D r o p o u t } ( \widehat { \mathbf { H } } ^ { l } ) ) $

  • Symbol Explanation:
    • Hl\mathbf{H}^l: The input to the current layer (residual connection).
    • H^l\widehat{\mathbf{H}}^l: The output of the feature integration (before normalization and dropout).
    • LayerNorm()\mathrm{LayerNorm}(\cdot), Dropout()\mathrm{Dropout}(\cdot): As defined before.
    • +: Element-wise addition for the skip connection.

4.6. Point-wise Feed Forward Network (FFN)

After feature integration, a Point-wise Feed-Forward Network (FFN) is applied. This FFN consists of an MLP with GELU activation and introduces non-linearity between different dimensions in the time domain. $ \tilde { \mathbf { H } } ^ { l } = \mathrm { F F N } ( \widehat { \mathbf { H } } ^ { l } ) = ( \mathrm { G E L U } ( \widehat { \mathbf { H } } ^ { l } \mathbf { W } _ { 1 } + \mathbf { b } _ { 1 } ) ) \mathbf { W } _ { 2 } + \mathbf { b } _ { 2 } $

  • Symbol Explanation:
    • H~l\tilde{\mathbf{H}}^l: The output of the FFN at layer ll.

    • H^l\widehat{\mathbf{H}}^l: The input to the FFN (output from feature integration and normalization).

    • W1,W2Rd×d\mathbf{W}_1, \mathbf{W}_2 \in \mathbb{R}^{d \times d}: Learnable weight matrices for the two linear transformations in the FFN.

    • b1,b2R1×d\mathbf{b}_1, \mathbf{b}_2 \in \mathbb{R}^{1 \times d}: Learnable bias vectors for the two linear transformations.

    • GELU()\mathrm{GELU}(\cdot): The Gaussian Error Linear Unit activation function, which introduces non-linearity.

      To prevent overfitting and ensure stability, another dropout layer and layer normalization are applied using a residual connection structure: $ \mathbf { H } ^ { l + 1 } = \mathrm { L a y e r N o r m } ( \widehat { \mathbf { H } } ^ { l } + \mathrm { D r o p o u t } ( \tilde { \mathbf { H } } ^ { l } ) ) $

  • Symbol Explanation:
    • Hl+1\mathbf{H}^{l+1}: The final output of the current WEARec block, which serves as the input to the next block or the prediction layer.
    • H^l\widehat{\mathbf{H}}^l: The input to this residual connection (output from feature integration and first normalization).
    • H~l\tilde{\mathbf{H}}^l: The output from the FFN.
    • LayerNorm()\mathrm{LayerNorm}(\cdot), Dropout()\mathrm{Dropout}(\cdot), +: As defined before.

4.7. Prediction Layer

In the final layer, the model computes a recommendation probability score for each candidate item. The output of the last WEARec block (hL\mathbf{h}^L, where LL is the number of blocks, typically the last item in the sequence) is used to calculate prediction scores by multiplying it with the transposed item embedding matrix M\mathbf{M}^\top. A softmax function is then applied to convert these scores into probabilities. $ \hat { \mathbf { y } } = \mathrm { s o f t m a x } ( \mathbf { h } ^ { L } ( \mathbf { M } ) ^ { \top } ) $

  • Symbol Explanation:
    • y^RV\hat{\mathbf{y}} \in \mathbb{R}^{|\mathcal{V}|}: The predicted probability score for each item in the entire item set.

    • hLR1×d\mathbf{h}^L \in \mathbb{R}^{1 \times d}: The output representation of the user's last interaction from the LL-th WEARec block.

    • (M)(\mathbf{M})^\top: The transposed item embedding matrix, used to project the user's final representation into the item space.

    • softmax()\mathrm{softmax}(\cdot): The softmax function, which converts a vector of raw scores into a probability distribution.

      The model parameters are optimized using the cross-entropy loss function: $ \mathcal { L } _ { R e c } = - \sum _ { i = 1 } ^ { | \mathcal { V } | } y _ { i } \mathrm { l o g } ( \hat { y } _ { i } ) $

  • Symbol Explanation:
    • LRec\mathcal{L}_{Rec}: The cross-entropy loss for sequential recommendation.

    • V|\mathcal{V}|: The total number of items.

    • yiy_i: The ground truth label for item viv_i (1 if viv_i is the actual next item, 0 otherwise).

    • y^i\hat{y}_i: The predicted preference score (probability) for item viv_i.

    • log()\log(\cdot): Natural logarithm.

    • \sum: Summation over all items.

      The overall architecture combines dynamic frequency domain processing with wavelet-based local feature enhancement, integrated through residual connections and feed-forward networks, and optimized for next-item prediction.

5. Experimental Setup

This section details the datasets, evaluation metrics, baseline models, and implementation specifics used to evaluate WEARec.

5.1. Datasets

The experiments were conducted on four public datasets, covering different domains, scales, and sparsity levels, commonly used in sequential recommendation research. Following standard practice (Zhou et al. 2020, 2022), a 5-core setting was adopted, filtering out users with fewer than 5 interactions.

The following are the results from Table 3 of the original paper:

Specs. LastFM ML-1M Beauty Sports
# User 1,090 6,041 22,363 25,598
# Items 3,646 3,417 12,101 18357
# Interactions 52,551 999,611 198,502 296,337
# Avg.Length 48.2 165.5 8.9 8.3
Sparsity 98.68% 95.16% 99.93% 99.95%

5.1.1. Description of Datasets

  • LastFM: This dataset contains user interactions with music, specifically artist listening records. It is used for recommending musicians and is characterized by relatively long sequence lengths (average length 48.2) and high sparsity (98.68%).

  • MovieLens-1M (ML-1M): (Harper and Konstan 2015) This dataset is based on movie reviews collected from the non-commercial MovieLens website. It is a dense dataset with about 1 million interactions and the longest average sequence length (165.5), making it suitable for evaluating models in long-sequence scenarios.

  • Amazon Beauty: (McAuley et al. 2015) Part of the Amazon review dataset, this dataset contains user-item interactions related to beauty products. It is a sparse dataset (99.93%) with relatively short average sequence lengths (8.9).

  • Amazon Sports: (McAuley et al. 2015) Also from the Amazon review dataset, focusing on sports-related products. Similar to Beauty, it is very sparse (99.95%) with short average sequence lengths (8.3).

    These datasets were chosen for their diversity in terms of domain, size, and interaction density, allowing for a comprehensive evaluation of WEARec's performance across various real-world scenarios.

5.2. Evaluation Metrics

The leave-one-out strategy is used for partitioning each user's item sequence, where the last item is held out for testing and the preceding items for training. Predictions are ranked throughout the entire item set without negative sampling. Performance is evaluated using Hit Ratio at K (HR@K) and Normalized Discounted Cumulative Gain at K (NDCG@K), with KK set to 10 and 20.

5.2.1. Hit Ratio at K (HR@K)

  • Conceptual Definition: Hit Ratio at K (HR@K) measures how often the target item (the next item the user actually interacted with) appears within the top KK items recommended by the model. It's a recall-based metric, focusing on whether the relevant item is present in the recommendation list, regardless of its position.
  • Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the target item is in top-K list}}{\text{Total number of users}} $
  • Symbol Explanation:
    • Number of users for whom the target item is in top-K list: The count of unique users for whom the true next item was found among the top KK predicted items.
    • Total number of users: The total number of users in the test set.

5.2.2. Normalized Discounted Cumulative Gain at K (NDCG@K)

  • Conceptual Definition: Normalized Discounted Cumulative Gain at K (NDCG@K) is a ranking quality metric that accounts for the position of the relevant item in the recommendation list. It assigns higher scores to relevant items that appear at higher ranks (earlier in the list). The "discounted" part means that relevant items further down the list contribute less to the total score. The "normalized" part ensures scores are comparable across different query results.
  • Mathematical Formula: First, Discounted Cumulative Gain (DCG@K) is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Then, NDCG@K normalizes DCG@K by dividing it by the Ideal DCG (IDCG@K), which is the DCG of the perfect ranking (where all relevant items are at the top). $ \mathrm{IDCG@K} = \sum{i=1}^{|\mathrm{REL}|} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ where REL|\mathrm{REL}| is the number of relevant items in the ideal ranking, which is typically 1 for next-item recommendation. Finally: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
  • Symbol Explanation:
    • reli\mathrm{rel}_i: The relevance score of the item at position ii in the recommended list. For sequential recommendation, it's typically 1 if the item is the true next item, and 0 otherwise.
    • ii: The rank (position) of the item in the recommended list, from 1 to KK.
    • KK: The number of top recommendations considered.
    • log2(i+1)\log_2(i+1): The discount factor, which reduces the contribution of items at lower ranks.
    • REL|\mathrm{REL}|: The number of relevant items in the ideally sorted list (usually 1 for the single ground-truth next item).

5.3. Baselines

To demonstrate the effectiveness of WEARec, it was compared against a selection of state-of-the-art models from two categories:

5.3.1. Time-domain SR models

These models process sequences directly in the chronological order.

  • GRU4Rec (Hidasi et al. 2015): A foundational model that uses Gated Recurrent Units (GRU) to capture sequential dependencies.
  • Caser (Tang and Wang 2018): A CNN-based method that captures local patterns using horizontal and vertical convolutional filters.
  • SASRec (Kang and McAuley 2018): A highly influential model that employs the self-attention mechanism to model item-item relationships.
  • DuoRec (Qiu et al. 2022): A more recent model that enhances sequence representations through contrastive learning using both unsupervised model-level augmentation and supervised semantic positive samples.

5.3.2. Frequency-domain SR models

These models leverage frequency-domain analysis.

  • FMLPRec (Zhou et al. 2022): An all-MLP model that uses learnable filters in the frequency domain to capture periodic patterns and reduce noise.
  • FamouSRec (Zhang et al. 2025): Utilizes a Mixture-of-Experts (MoE) framework to select specialized expert models tailored to users' specific frequency-based behavioral patterns.
  • FEARec (Du et al. 2023b): Integrates frequency domain information into attention computation and combines insights from both time and frequency domains.
  • SLIME4Rec (Du et al. 2023a): Employs a frequency ramp structure to consider different frequency bands across layers, combined with dynamic and static selection modules and contrastive learning.
  • BSARec (Shin et al. 2024): A very recent and strong baseline that enhances self-attention with an attentive inductive bias from the frequency domain and uses a frequency recalibrator to alleviate over-smoothing.

5.4. Implementation Details

The WEARec model was implemented in PyTorch. For baseline models, the authors refer to optimal hyper-parameter setups from original papers or report results from consistent reimplementations.

  • Embedding Size (dd): Both the dimension of the feed-forward network and the item embedding size are set to 64.

  • Number of WEARec Blocks (LL): Set to 2.

  • Maximum Sequence Length (NN): Set to 50 for primary experiments.

  • Batch Size: 256.

  • Optimizer: Adam optimizer.

  • Learning Rate: Chosen from {0.0005,0.001}\{0.0005, 0.001\}.

  • Wavelet Decomposition Level: Set to 1.

  • Hyperparameter α\alpha: Explored from {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}\{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9\}.

  • Number of Filters kk (Heads): Chosen from {1,2,4,8}\{1, 2, 4, 8\}.

  • Dropout Rate: 0.5 for Amazon datasets and LastFM, 0.1 for MovieLens-1M due to its lower sparsity.

    The following are the results from Table 4 of the original paper:

    Specs. LastFM ML-1M Beauty Sports
    α\alpha 0.3 0.3 0.2 0.3
    k 2 2 8 4
    lr 0.001 0.0005 0.0005 0.001

The table above shows the optimal hyperparameter settings for WEARec on each dataset for reproducibility.

6. Results & Analysis

This section analyzes the experimental results, comparing WEARec against baselines, evaluating its performance in long-sequence scenarios, examining its complexity, and conducting ablation studies and hyper-parameter analysis.

6.1. Core Results Analysis (RQ1)

The primary experimental results comparing WEARec with various baseline models across four datasets (Beauty, Sports, LastFM, ML-1M) are presented in Table 1. Performance is measured using HR@10, HR@20, NDCG@10, and NDCG@20.

The following are the results from Table 1 of the original paper:

Datasets Metric Caser GRU4Rec SASRec DuoRec FMLPRec FamouSRec FEARec SLIME4Rec BSARec WEARec Improv.
Beauty HR@10 0.0225 0.0304 0.0531 0.0965 0.0559 0.0838 0.0982 0.1006 0.1008 0.1041 3.27%
HR@20 0.0403 0.0527 0.0823 0.1313 0.0869 0.1146 0.1352 0.1381 0.1373 0.1391 1.31%
NG@10 0.0108 0.0147 0.0283 0.0584 0.0291 0.0497 0.0601 0.0601 0.0611 0.0614 0.49%
NG@20 0.0153 0.0203 0.0356 0.0671 0.0369 0.0575 0.0694 0.0696 0.0703 0.0703 0.00%
Sports HR@10 0.0163 0.0187 0.0298 0.0569 0.0336 0.0424 0.0589 0.0611 0.0612 0.0631 3.10%
HR@20 0.0260 0.0303 0.0459 0.0791 0.0525 0.0632 0.0836 0.0869 0.0858 0.0895 2.99%
NG@10 0.0080 0.0101 0.0159 0.0331 0.0183 0.0244 0.0343 0.0357 0.0360 0.0367 1.94%
NG@20 0.0104 0.0131 0.0200 0.0387 0.0231 0.0297 0.0405 0.0421 0.0422 0.0433 2.60%
LastFM HR@10 0.0431 0.0404 0.0633 0.0624 0.0560 0.0569 0.0587 0.0633 0.0807 0.0899 11.40%
HR@20 0.0642 0.0541 0.0927 0.0963 0.0826 0.0954 0.0826 0.0927 0.1174 0.1202 2.38%
NG@10 0.0268 0.0245 0.0355 0.0361 0.0306 0.0318 0.0354 0.0359 0.0435 0.0465 6.89%
NG@20 0.0321 0.0280 0.0429 0.0446 0.0372 0.0415 0.0414 0.0433 0.0526 0.0543 3.23%
ML-1M HR@10 0.1556 0.1657 0.2137 0.2704 0.2065 0.2639 0.2705 0.2891 0.2757 0.2952 2.10%
HR@20 0.2488 0.2664 0.3245 0.3738 0.3137 0.3717 0.3714 0.3950 0.3884 0.4031 3.29%
NG@10 0.0950 0.0828 0.1116 0.1530 0.1087 0.1455 0.1516 0.1673 0.1568 0.1696 1.37%
NG@20 0.1028 0.1081 0.1395 0.1790 0.1356 0.1727 0.1771 0.1939 0.1851 0.1968 1.49%

Observations and Conclusions:

  • Time-domain models vs. Frequency-domain models: Traditional time-domain models (Caser, GRU4Rec, SASRec) generally exhibit suboptimal performance. This is attributed to their difficulty in identifying intertwined user periodic patterns from raw temporal sequences. DuoRec, which combines contrastive learning, shows better performance than other time-domain methods, validating the effectiveness of learning robust representations.
  • Superiority of Frequency-domain models: Models leveraging the frequency domain (e.g., FMLPRec, FamouSRec, FEARec, SLIME4Rec, BSARec) generally achieve superior performance compared to time-domain baselines. FMLPRec, by using an MLP structure for frequency-domain filtering, performs comparably or even better than SASRec. Further advancements with contrastive learning (FamouSRec, FEARec, SLIME4Rec) and inductive bias (BSARec) lead to improved results among frequency-domain methods.
  • WEARec's Outperformance: WEARec consistently achieves the best performance across all four datasets and all evaluation metrics. This validates the effectiveness of combining the Dynamic Frequency-domain Filtering (DFF) module (for personalized global information) with the Wavelet Feature Enhancement (WFE) module (for enhanced local information). The improvement percentage (Improv.) over the best baseline ranges from 0.00% (Beauty NG@20, tied with SLIME4Rec) up to 11.40% (LastFM HR@10), with noticeable gains on the LastFM dataset.

6.2. Model in Long Sequence Scenarios (RQ2)

The study investigates WEARec's performance and computational overhead in long-sequence scenarios by varying the maximum sequence length NN. Experiments were conducted on ML-1M and LastFM datasets, which have longer average sequence lengths.

6.2.1. Model Performance

The following figure (Figure 3 from the original paper) shows the HR@20 performance comparison of WEARec with FMLPRec, SLIME4Rec, and BSARec at different sequence lengths NN on ML-1M and LastFM.

Figure 3: The \(\\mathrm { H R } @ 2 0\) performance comparison of WEARec with FMLPRec, SLIME4Rec and BSARec at different sequence lengths \(N\) on ML-1M and LastFM. 该图像是图表,展示了WEARec与FMLPRec、SLIME4Rec和BSARec在不同序列长度 NN 下的 ext{H R}@20 性能比较,数据源为ML-1M和LastFM。

The following are the results from Table 5 of the original paper:

Method ML-1M LastFM
HR@10 NG@10 HR@20 NG@20 HR@10 NG@10 HR@20 NG@20
N = 50 BSARec 0.2757 0.1568 0.3884 0.1851 0.0807 0.0435 0.1174 0.0526
SLIME4Rec 0.2894 0.1675 0.3934 0.1937 0.0633 0.0376 0.0936 0.0453
Ours 0.2952 0.1696 0.4031 0.1968 0.0899 0.0465 0.1202 0.0547
N = 100 BSARec 0.3073 0.1815 0.4089 0.2024 0.0798 0.0455 0.1202 0.0545
SLIME4Rec 0.3147 0.1815 0.4126 0.2062 0.0679 0.0382 0.0991 0.0463
Ours 0.3180 0.1819 0.4175 0.2069 0.0890 0.0494 0.1266 0.0589
N = 150 BSARec 0.3171 0.1826 0.4300 0.2111 0.0826 0.0476 0.1174 0.0564
SLIME4Rec 0.3166 0.1820 0.4298 0.2127 0.0688 0.0387 0.1055 0.0479
Ours 0.3215 0.1848 0.4338 0.2131 0.0927 0.0522 0.1312 0.0617
N = 200 BSARec 0.3161 0.1837 0.4311 0.2127 0.0862 0.0476 0.1257 0.0594
SLIME4Rec 0.3166 0.1850 0.4343 0.2173 0.0679 0.0391 0.1064 0.0488
Ours 0.3334 0.1904 0.4421 0.2179 0.0972 0.0556 0.1477 0.0682

Observations:

  • Impact of Sequence Length: Across most models, performance generally improves with longer sequence lengths (up to N=200N=200), indicating that more historical information helps in representing user behavior patterns more comprehensively.
  • Baselines' Convergence and Overfitting: While baseline models show initial performance improvements in long-sequence scenarios, they tend to exhibit performance convergence or are prone to overfitting with very long sequences, suggesting limitations in handling the increased complexity or noise.
  • WEARec's Consistent Superiority: WEARec consistently outperforms the baselines across all different maximum sequence length settings (N=50,100,150,200N=50, 100, 150, 200) on both ML-1M and LastFM. Furthermore, its improvement over baseline models is often more significant in long-sequence scenarios (e.g., at N=200N=200, WEARec shows a stronger lead compared to N=50N=50). This suggests WEARec is particularly well-suited for scenarios with rich historical data.

6.2.2. Model Complexity and Runtime Analyses

To evaluate the overhead of WEARec, the authors assessed the number of parameters and runtime per epoch during training, specifically for N=200N=200.

The following are the results from Table 6 of the original paper:

Methods ML-1M Beauty Sports LastFM
# params s/epoch # params s/epoch # params s/epoch # params s/epoch
WEARec 426,082 66.46 981,922 15.12 1,382,306 26.12 440,802 5.23
FMLPRec 324,160 36.93 880,000 10.11 1,280,384 22.78 338,880 4.91
BSARec 331,968 109.26 887,808 25.87 1,288,192 50.59 346,688 10.84
SLIME4Rec 375,872 120.43 931,712 31.44 1,332,096 68.74 390,592 13.77

Observations:

  • Total Parameters: WEARec generally has a higher number of total parameters compared to FMLPRec, BSARec, and SLIME4Rec. This is primarily due to the introduction of a simple MLP for dynamic filter generation and the learnable matrix Tl\mathbf{T}^l in the WFE module.
  • Training Time (Runtime per Epoch): Despite having more parameters, WEARec exhibits a shorter training time (runtime per epoch) than BSARec and SLIME4Rec across all datasets. For example, on ML-1M, WEARec trains in 66.46 s/epoch, significantly faster than BSARec (109.26 s/epoch) and SLIME4Rec (120.43 s/epoch). However, WEARec is slower than FMLPRec, which has the fewest parameters and lowest runtime, as FMLPRec is a simpler, attention-free MLP model.
  • Theoretical Complexity: The authors explain that WEARec's computational complexity is dominated by the FFT and Haar wavelets transform, resulting in a time complexity of O(ndlogn+nd+3nd2)\mathcal{O}(nd \log n + nd + 3nd^2), which simplifies to O(ndlogn)\mathcal{O}(nd \log n) for large nn (assuming dd is constant or much smaller than nn). This is a significant improvement over self-attention mechanisms (used in BSARec) with their O(n2)\mathcal{O}(n^2) overhead. By not employing contrastive learning (used in SLIME4Rec), WEARec also avoids the additional computational cost associated with it. This demonstrates WEARec's superior efficiency in long-sequence scenarios where nn is large.

6.3. In-depth Model Analysis (RQ3-RQ4)

6.3.1. Ablation Study (RQ3)

An ablation study was conducted to understand the contribution of each designed module in WEARec. Variants were created by removing specific components:

  • WEARec: The full model.

  • w/o W: WEARec without the Wavelet Feature Enhancement (WFE) module.

  • w/o F: WEARec without the Dynamic Frequency-domain Filtering (DFF) module.

  • w/o M: WEARec without the multi-head projection component.

    The following figure (Figure 4 from the original paper) summarizes the HR@20 and NG@20 performance of WEARec and its variants across four datasets.

    Figure 4: The \(\\mathrm { H R } @ 2 0\) and \(\\mathrm { N G } @ 2 0\) performance achieved by WEARec variants on four datasets.

    Observations:

  • WEARec (the full model) consistently outperforms all its variants across all four datasets and metrics. This indicates that all proposed components (WFE module, DFF module, and multi-head projection) are effective and contribute positively to the overall performance.

  • Removing either the WFE module (w/o W) or the DFF module (w/o F) leads to a notable drop in performance. This confirms the necessity of both modules: DFF for personalized global frequency patterns and WFE for enhancing local, non-stationary signals.

  • Removing the multi-head projection (w/o M) also results in a performance decrease, suggesting that decomposing the input into parallel feature subspaces with tailored processing (even if implicitly, as in FFT/DWT across different embedding dimensions) is beneficial.

6.3.2. Hyper-parameter Analysis (RQ4)

The study also analyzed the sensitivity of WEARec to two key hyperparameters: kk (number of filters/heads) and α\alpha (weighting factor between DFF and WFE outputs).

The following figure (Figure 5 from the original paper) shows the performance of WEARec on HR@20 with varying hyperparameters (kk and α\alpha).

Figure 5: Performance of WEARec on HR \(\\textcircled{ a} 2 0\) with varying hyperparameters..

Observations:

  • Sensitivity to kk (Number of Filters/Heads):
    • The performance varies with kk, and an optimal value appears to exist for each dataset. For example, on ML-1M and LastFM, k=2k=2 or k=4k=4 seems optimal, while for Beauty and Sports, k=8k=8 or k=4k=4 respectively are better.
    • Neither a very small kk (e.g., k=1k=1, implying no multi-head projection) nor a very large kk (not fully explored but implied by trend) is ideal. An optimal kk is critical for learning user interest preferences, as it allows the model to process different feature subspaces, potentially capturing different aspects of user behavior.
  • Sensitivity to α\alpha (Weighting Factor):
    • The performance also changes with α\alpha, which balances the contribution of the DFF (global features) and WFE (local features).
    • Optimal performance is often observed when α\alpha is approximately 0.3. This suggests that a slightly higher emphasis on the global features from Dynamic Frequency-domain Filtering (weighted by α=0.3\alpha=0.3) compared to Wavelet Feature Enhancement (weighted by 1α=0.71-\alpha=0.7) yields the best results. However, this interpretation is inverted from how the formula is written: H^l=αXl+(1α)Yl\widehat { \mathbf { H } } ^ { l } = \alpha \odot \mathbf { X } ^ { l } + ( 1 - \alpha ) \odot \mathbf { Y } ^ { l }. If α=0.3\alpha=0.3, then XlX^l (DFF output) is weighted 0.3, and YlY^l (WFE output) is weighted 0.7. This would mean that WFE (local, high-frequency enhancement) is given a higher emphasis, which aligns with the paper's motivation to address the blurring of non-stationary signals.

6.3.3. Visualization of the Filters

The authors visualize the frequency and amplitude features learned by different types of filtering models to highlight WEARec's capabilities.

The following figure (Figure 6 from the original paper) presents the spectral responses for different types of filter models across layers in Beauty.

Figure 6: Visualization of spectral responses for different types of filter models across layers in Beauty. More in-depth model analysis in Appendix C (Xu et al. 2025).

Observations:

  • Static Filters' Limitation: FMLPRec and SLIME4Rec (with their static filter designs) tend to learn predominantly low-frequency components within their respective frequency bands, especially noticeable in Layer 1. This confirms the hypothesis that existing DFT-based methods often act as low-pass filters, struggling to capture a full spectrum of frequencies.

  • WEARec's Dynamic and Comprehensive Coverage: WEARec, with its dynamic frequency-domain filtering design, is capable of encompassing all frequency components. Its spectral response is more flexible and covers a wider range of frequencies, demonstrating its ability to adapt to diverse user behavioral patterns.

    Additionally, Figure 7 visualizes the learned wavelet feature enhancer T\mathbf{T} from Equation 19.

The following figure (Figure 7 from the original paper) shows the visualization of learned TT in Beauty.

Figure 7: Visualization of learned T in Beauty

Observations:

  • The wavelet feature enhancement module learns to identify non-stationary signals at specific time points and adaptively adjust their weights. In the first layer, multiple time points of non-stationary signals are assigned higher weights for enhancement, while potentially noisy ones receive negative weights for suppression. In the second layer, it focuses on enhancing nearest non-stationary signal points. This indicates the module's capability to capture and modify local details effectively.

6.3.4. Case Study

A case study further evaluates whether the dynamic filter and wavelet enhancer can better capture a wider range of frequency domain features by visualizing the number of users correctly captured by different models across frequency bands.

The following figure (Figure 8 from the original paper) shows the number of users uniquely driven by each frequency component in the Sports and Beauty datasets for different models.

Figure 8: Case Study in Sports and Beauty.

Observations:

  • FMLPRec's Limited Capture: FMLPRec captured the fewest users across multiple frequency bands. This is consistent with its static filter design, which limits its ability to adapt to diverse user behavior patterns.
  • Baselines' Improved Capture: BSARec and SLIME4Rec show better results, attributed to their use of inductive bias (for BSARec) or frequency ramp structures and contrastive learning (for SLIME4Rec) to enhance user embedding representations.
  • WEARec's Best Coverage: WEARec achieves the best performance by capturing the highest number of users across various frequency regions. This is due to the synergistic combination of its dynamic filter's global information capture capability and the wavelet enhancer's local detail extraction. It performs best in both low-frequency regions (capturing long-term patterns) and certain high-frequency regions (capturing short-term or transient patterns).

7. Conclusion & Reflections

7.1. Conclusion Summary

In this paper, the authors introduced WEARec, a novel and efficient model specifically designed for sequential recommendation tasks, particularly adept at handling long sequences and capturing diverse user behavioral patterns. The core of WEARec lies in its two vital modules: Dynamic Frequency-domain Filtering (DFF) and Wavelet Feature Enhancement (WFE). The DFF module allows for personalized adjustments of filtering operations based on individual user sequences, thereby extracting tailored global frequency distributions. Concurrently, the WFE module leverages wavelet transforms to reconstruct sequences, effectively enhancing non-stationary signals and short-term fluctuations that global Fourier Transforms might otherwise blur. Through extensive experiments on four public datasets, WEARec demonstrated superior recommendation performance and competitive computational efficiency compared to state-of-the-art baselines.

7.2. Limitations & Future Work

The paper explicitly states that WEARec "achieves better performance with lower computational costs, especially in long-sequence scenarios," and its time complexity is O(ndlogn)\mathcal{O}(nd \log n), which is an improvement over self-attention's O(n2)\mathcal{O}(n^2). This suggests the authors view efficiency as a strength rather than a limitation.

However, a closer look at Table 6 shows that while WEARec is faster than BSARec and SLIME4Rec, it is slower and has more parameters than FMLPRec. The paper's conclusion section does not explicitly list future work or limitations.

Based on the analysis, potential limitations and future work could include:

  • Increased Parameter Count: While computationally efficient, WEARec does introduce more parameters than simpler MLP-based frequency models like FMLPRec. This could be a consideration for extremely resource-constrained environments. Future work could explore parameter-efficient designs for the MLP networks in DFF or the enhancer matrix Tl\mathbf{T}^l in WFE.
  • Wavelet Choice: The Haar wavelet transform is chosen for its simplicity and efficiency. However, other wavelet families (e.g., Daubechies, Symlets) offer different properties regarding smoothness, vanishing moments, and time-frequency localization. Exploring the impact of different wavelet bases on performance and computational cost could be a direction for future research.
  • Interpretability of Dynamic Filters: While dynamic filters offer personalization, a deeper analysis into what specific frequency bands are emphasized for which types of users could enhance interpretability and provide insights into user behavior.
  • Adaptive α\alpha and kk: The parameters α\alpha and kk are currently hyperparameters. Future work could investigate methods to learn these parameters adaptively, perhaps through a meta-learning approach, allowing the model to dynamically determine the optimal balance between global and local features or the ideal number of frequency subspaces.
  • Generalization to other modalities: While the current work focuses on item sequences, the principles of dynamic frequency filtering and wavelet enhancement could be extended to other sequential data types or modalities in recommendation (e.g., user reviews as text sequences, multimedia feature sequences).

7.3. Personal Insights & Critique

This paper presents a compelling and well-motivated advancement in sequential recommendation by effectively addressing two critical shortcomings of existing frequency-domain methods.

Inspirations and Applications:

  • Hybrid Approach Strength: The core idea of combining global frequency analysis (via DFT-based adaptive filtering) with local time-frequency analysis (via DWT) is highly insightful. Many real-world signals, including user behavior, exhibit both long-term periodicities and short-term transient events. A hybrid approach like WEARec is better equipped to capture this complexity than methods relying solely on one domain or scale. This hybrid thinking could be transferred to other domains dealing with complex time series data, such as time series forecasting (e.g., energy consumption, stock prices) or even natural language processing for analyzing sentiment shifts or topic changes within longer documents.
  • Personalization Beyond Features: The Dynamic Frequency-domain Filtering module's ability to create user-specific filters based on context is a powerful concept. This goes beyond simply learning personalized embeddings; it's about learning personalized processing logic. This idea could be applied to other areas where personalized model architectures or adaptive processing pipelines are beneficial, such as personalized medicine or adaptive control systems.
  • Efficiency in Long Sequences: The focus on long-sequence scenarios and the architectural choice to maintain log-linear complexity through FFT and DWT is highly practical. As user interaction histories grow, quadratic complexity models become intractable. WEARec offers a blueprint for how to build high-performing yet efficient sequence models.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Haar Wavelet Simplicity: While efficient, the Haar wavelet is the simplest orthogonal wavelet. Its blocky nature might not be ideal for all types of smooth signals. The paper assumes its efficiency outweighs any potential loss in representation capability compared to smoother wavelets. This assumption could be further tested.

  • "All Frequency Components" Claim: Figure 6 suggests WEARec "encompasses all frequency components" compared to baselines. While its response is broader and more flexible, the visualization doesn't show a perfectly flat response across all frequencies, indicating some implicit filtering still occurs. The "adaptive" nature is key here, meaning it can choose which components to emphasize, not necessarily equally treat all.

  • Interpretability of MLP for Dynamics: The MLPs generating scaling factors and bias terms for the dynamic filters are powerful but somewhat opaque. While they enable personalization, understanding how these MLPs interpret the context cl\mathbf{c}^l to modify frequency responses could be challenging. Further analysis or more interpretable adaptive mechanisms might be valuable.

  • Hyperparameter α\alpha Interpretation: The optimal α0.3\alpha \approx 0.3 implies that WFE's locally enhanced features receive a higher weighting (0.7) than DFF's globally filtered features (0.3). This suggests that WFE plays a more dominant role in WEARec's success than the name Wavelet Enhanced Adaptive Frequency Filter might initially imply, underscoring the importance of capturing those subtle, non-stationary local details. This finding reinforces the paper's second motivation point.

    Overall, WEARec is a strong contribution that moves beyond monolithic frequency-domain processing, demonstrating the power of tailored, multi-resolution signal processing techniques for understanding complex human behavior in recommender systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.