Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation
TL;DR Summary
This paper introduces the Wavelet Enhanced Adaptive Frequency Filter (WEARec) for sequential recommendation, enhancing traditional frequency-domain methods by dynamically filtering and wavelet feature enhancement, effectively improving performance and efficiency in capturing user
Abstract
Sequential recommendation has garnered significant attention for its ability to capture dynamic preferences by mining users' historical interaction data. Given that users' complex and intertwined periodic preferences are difficult to disentangle in the time domain, recent research is exploring frequency domain analysis to identify these hidden patterns. However, current frequency-domain-based methods suffer from two key limitations: (i) They primarily employ static filters with fixed characteristics, overlooking the personalized nature of behavioral patterns; (ii) While the global discrete Fourier transform excels at modeling long-range dependencies, it can blur non-stationary signals and short-term fluctuations. To overcome these limitations, we propose a novel method called Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation. Specifically, it consists of two vital modules: dynamic frequency-domain filtering and wavelet feature enhancement. The former is used to dynamically adjust filtering operations based on behavioral sequences to extract personalized global information, and the latter integrates wavelet transform to reconstruct sequences, enhancing blurred non-stationary signals and short-term fluctuations. Finally, these two modules work to achieve comprehensive performance and efficiency optimization in long sequential recommendation scenarios. Extensive experiments on four widely-used benchmark datasets demonstrate the superiority of our work.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is a novel approach to sequential recommendation that enhances traditional frequency-domain filtering with adaptive and wavelet-based mechanisms. The title is Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation.
1.2. Authors
The authors are Huayang Xu, Huanhuan Yuan, Guanfeng Liu, Junhua Fang, Lei Zhao, and Pengpeng Zhao. Their affiliations are primarily with the School of Computer Science and Technology, Soochow University, China, with Guanfeng Liu also affiliated with Macquarie University. The first two authors, Huayang Xu and Huanhuan Yuan, are marked with an asterisk, indicating they contributed equally or are co-first authors.
1.3. Journal/Conference
The paper is published as a preprint on arXiv. The publication date (UTC) is 2025-11-10T00:00:00.000Z, suggesting it might be submitted to or accepted by a conference or journal in late 2025 or 2026. arXiv is a reputable open-access archive for preprints of scientific papers in various disciplines, including computer science, allowing early dissemination of research.
1.4. Publication Year
The publication year is 2025, based on the arXiv preprint date.
1.5. Abstract
Sequential recommendation (SR) aims to capture dynamic user preferences from historical interaction data. Recent research has explored frequency-domain analysis to identify hidden periodic patterns that are difficult to disentangle in the time domain. However, existing frequency-domain methods have two main limitations: (i) they use static filters that ignore the personalized nature of user behavior, and (ii) the global discrete Fourier transform (DFT), while good for long-range dependencies, can blur non-stationary signals and short-term fluctuations.
To address these issues, the paper proposes Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation (WEARec). This novel method comprises two key modules:
-
Dynamic Frequency-domain Filtering (DFF): This module dynamically adjusts filtering operations based on individual behavioral sequences to extract personalized global information. -
Wavelet Feature Enhancement (WFE): This module integrates the wavelet transform to reconstruct sequences, specifically enhancing blurred non-stationary signals and short-term fluctuations.These two modules work synergistically to improve performance and efficiency, particularly in long sequential recommendation scenarios. Extensive experiments conducted on four benchmark datasets demonstrate the superiority of
WEARec.
1.6. Original Source Link
The official source link is https://arxiv.org/abs/2511.07028v2, and the PDF link is https://arxiv.org/pdf/2511.07028.pdf. It is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is accurately capturing users' dynamic and often complex periodic preferences in sequential recommendation systems. In real-world e-commerce and other applications, user interests are not static but evolve over time, making sequential recommendation a crucial task.
This problem is important because accurately predicting the next item a user will interact with is fundamental to personalized services, driving user engagement and satisfaction. Traditional methods often analyze user interaction sequences directly in the time domain. However, the authors argue that users' complex and intertwined periodic preferences are difficult to disentangle from noisy, chronologically entangled raw sequences in this domain.
Prior research has begun exploring frequency-domain analysis to identify these hidden periodic patterns by decomposing user sequences into different frequency components (e.g., high-frequency for short-term trends, low-frequency for long-term habits) using tools like the Fourier Transform. While promising, existing frequency-domain methods suffer from two key challenges or gaps:
-
Static Filters: Most current frequency-domain methods apply uniform, static filters to all user sequences. This approach overlooks the
personalized natureof behavioral patterns, as different users may be driven by different frequency components (e.g., some by long-term preferences, others by short-term fluctuations). A single, fixed filter cannot adequately capture this diversity. -
Global DFT Limitations: The
Discrete Fourier Transform (DFT), while excellent at modelinglong-range dependenciesand extracting global frequency information, tends toblur non-stationary signalsandshort-term fluctuations. It acts as a low-pass filter, potentially missing fine-grained, transient patterns that are crucial for accurate next-item prediction.The paper's entry point and innovative idea is to address these two limitations by introducing
adaptiveandwavelet-enhancedmechanisms into frequency-domain filtering.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Novel Model
WEARec: ProposingWavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation (WEARec), a new model designed to efficiently handle long sequences and capture diverse user behavioral patterns by fusing personalized global information with enhanced local information. -
Dynamic Frequency-domain Filtering (DFF)Module: Introducing a module that dynamically adjusts filtering operations based on individual user behavioral sequences. This addresses the limitation of static filters by enabling personalized extraction of global frequency-domain information, tailored to each user's unique preferences. -
Wavelet Feature Enhancement (WFE)Module: Developing a module that integrateswavelet transformto reconstruct sequences. This specifically targets the blurring issue of globalDFTby amplifying obscurenon-stationary signalsandshort-term fluctuations, thus capturing fine-grained local temporal patterns. -
Synergistic Integration: Demonstrating how these two modules (
DFFfor personalized global context andWFEfor enhanced local details) work together synergistically to achieve comprehensive performance and efficiency optimization, particularly inlong sequential recommendation scenarios. -
Empirical Validation: Conducting extensive experiments on four widely-used benchmark datasets, which consistently show that
WEARecachieves superior recommendation performance compared to state-of-the-art baselines. The paper also highlightsWEARec's lower computational overhead, especially in long-sequence settings.The key conclusions and findings are that
WEAReceffectively overcomes the limitations of previous frequency-domain methods by providing personalized filtering and enhancing local, non-stationary signals. This leads to more accurate and efficient sequential recommendations, particularly when dealing with longer user interaction histories.
3. Prerequisite Knowledge & Related Work
This section aims to provide readers with the foundational knowledge and context necessary to understand the WEARec model, its motivations, and its advancements over prior art.
3.1. Foundational Concepts
3.1.1. Sequential Recommendation (SR)
Sequential Recommendation (SR) is a subfield of recommender systems that focuses on predicting the next item a user will interact with based on their historical sequence of interactions. Unlike traditional recommender systems that might only consider static preferences or item similarity, SR models acknowledge that user interests are dynamic and evolve over time. The goal is to capture these dynamic interest shifts and sequential dependencies. For example, if a user buys a camera, they might next be interested in camera lenses, then tripods, and so on.
3.1.2. Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT)
The Discrete Fourier Transform (DFT) is a mathematical technique used to convert a finite sequence of discrete-time samples into a finite sequence of samples of frequencies present in the signal. In simpler terms, it decomposes a signal (like a user's interaction sequence) into its constituent frequencies, revealing which periodic patterns are strong and at what frequencies they occur. DFT analyzes global signal components, meaning it considers the entire sequence to find periodic patterns.
-
Mathematical Formula for DFT: Given a discrete sequence of length , its
DFTis defined as: $ X _ { k } = \sum _ { m = 0 } ^ { n - 1 } x _ { m } e ^ { - 2 \pi i m k / n } , 0 \leq k \leq n - 1 $- Symbol Explanation:
- : The -th sample of the input discrete signal in the time domain.
- : The total number of samples (length of the sequence).
- : The -th
frequency component(a complex value) in the frequency domain. - : The imaginary unit, where .
- : Euler's number, the base of the natural logarithm.
- : The frequency index, ranging from
0ton-1.
- Symbol Explanation:
-
Inverse Discrete Fourier Transform (IDFT): The
IDFTconverts the frequency-domain representation back into the time domain: $ x _ { m } = \frac { 1 } { n } \sum _ { k = 0 } ^ { n - 1 } X _ { k } e ^ { 2 \pi i m k / n } $- Symbol Explanation:
- : The -th sample of the reconstructed signal in the time domain.
- : The total number of samples.
- : The -th
frequency component. - : The imaginary unit.
- : Euler's number.
- : The frequency index.
- Symbol Explanation:
-
Fast Fourier Transform (FFT) and Inverse FFT (IFFT):
FFTis an efficient algorithm to compute theDFTand its inverse (IFFT). WhileDFTdefines the transformation,FFTis the practical computational method, significantly reducing the computational complexity from to . -
Parseval's Theorem (from Appendix A.1): This theorem states that the total energy of a signal is conserved during the
DFTprocess. It ensures that adaptive filtering in the frequency domain won't accidentally distort the intrinsic information of the input signal. $ \sum _ { m = 0 } ^ { n - 1 } | x [ m ] | ^ { 2 } = { \frac { 1 } { n } } \sum _ { k = 0 } ^ { n - 1 } | X [ k ] | ^ { 2 } $- Symbol Explanation:
x[m]: The -th sample of the time-domain signal.X[k]: The -thfrequency componentof the signal.- : Represents the squared magnitude (energy) of a complex number.
- : The length of the sequence.
- Symbol Explanation:
-
Convolution Theorem (from Appendix A.1): This theorem is crucial for frequency-domain filtering. It states that
convolutionin the time domain is equivalent toelement-wise multiplicationin the frequency domain. This property directly enables the design of filters that capture specific frequency components. Given an input sequence and convolution parameters , the convolution can be transformed into: $ h [ m ] * x [ m ] = { \mathcal { F } } ^ { - 1 } ( { \mathcal { F } } ( h [ m ] ) \odot X [ k ] ) $ This can be simplified to: $ \mathcal { F } ( h * x ) = \mathbf { W } \odot \mathbf { X } $- Symbol Explanation:
h[m]: The -th sample of the filter (kernel) in the time domain.x[m]: The -th sample of the input signal in the time domain.*(asterisk): Denotes the circular convolution operator.- : The
Fourier Transformoperator. - : The
Inverse Fourier Transformoperator. - : The learnable complex-valued filtering matrix (the Fourier transform of the filter
h[m]). - : The frequency-domain representation of the input signal.
- : Denotes the Hadamard product (element-wise multiplication).
- Symbol Explanation:
3.1.3. Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT)
The Discrete Wavelet Transform (DWT) is a signal processing tool that provides a multiresolution analysis of signals. Unlike DFT, which provides only frequency information, DWT offers both time and frequency localization. This means it can identify not only which frequencies are present but also when they occur in the signal. This property makes DWT particularly effective for analyzing non-stationary signals (signals whose statistical properties change over time) and capturing transient components or short-term fluctuations.
DWT decomposes a signal into approximation coefficients (low-frequency components, representing the overall trend) and detail coefficients (high-frequency components, representing fine-grained details or noise). This is typically done through hierarchical decomposition using low-pass and high-pass filters. The Haar wavelet transform is a simple and computationally efficient type of DWT.
-
Mathematical Formulas for DWT (from Preliminaries): The -th level decomposition is defined as: $ A _ { j + 1 } [ m ] = \sum _ { k = 0 } ^ { K - 1 } L [ k ] A _ { j } [ 2 m - k ] $ $ D _ { j + 1 } [ m ] = \sum _ { k = 0 } ^ { K - 1 } H [ k ] A _ { j } [ 2 m - k ] $
- Symbol Explanation:
- :
Approximation coefficientsat decomposition level . When , (the original signal). These capture low-frequency components after low-pass filtering. - :
Detail coefficientsat decomposition level . These capture high-frequency components after high-pass filtering. L[k]: Thelow-pass filtercoefficients.H[k]: Thehigh-pass filtercoefficients.- : The length of the filter.
2m-k: Representsdownsamplingwith a stride of 2, which halves the length of the output signal at each decomposition level.
- :
- Symbol Explanation:
-
Inverse Discrete Wavelet Transform (IDWT): The
IDWTreconstructs the original signal from itsapproximationanddetail coefficientsstage-by-stage via iterative upsampling and filtering operations: $ A _ { j } [ m ] = \sum _ { k = 0 } ^ { K - 1 } \tilde { L } [ k ] A _ { j + 1 } [ 2 m - k ] + \sum _ { k = 0 } ^ { K - 1 } \tilde { H } [ k ] D _ { j + 1 } [ 2 m - k ] $- Symbol Explanation:
- : The reconstructed
approximation coefficientsat level . - :
Approximation coefficientsfrom the next higher decomposition level. - :
Detail coefficientsfrom the next higher decomposition level. - :
Reconstruction low-pass filtercoefficients. - :
Reconstruction high-pass filtercoefficients.
- : The reconstructed
- Symbol Explanation:
3.1.4. Multi-Layer Perceptron (MLP)
A Multi-Layer Perceptron (MLP) is a class of feedforward artificial neural networks. An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (neuron) in one layer connects with a certain weight to every node in the next layer. MLPs are capable of learning non-linear relationships and are used for various tasks, including classification, regression, and, as in this paper, generating adaptive parameters.
3.1.5. Self-Attention and Transformer
Self-attention is a mechanism that allows a model to weigh the importance of different words/items in an input sequence when encoding a particular word/item. It has been a cornerstone of the Transformer architecture, which revolutionized Natural Language Processing (NLP) and Computer Vision (CV). In sequential recommendation, Self-attention (e.g., in SASRec) is used to model item-item relationships and capture long-range dependencies in user sequences. However, the paper points out that items in user interactions are often chronologically entangled and inherently noisy, making it challenging for models to directly discern changes in behavioral preferences from raw sequences within the time domain using self-attention.
3.1.6. Cross-Entropy Loss
Cross-entropy loss is a common loss function used in classification tasks, particularly when the output is a probability distribution. It measures the difference between two probability distributions: the true distribution (ground truth) and the predicted distribution from the model. A lower cross-entropy loss indicates that the model's predictions are closer to the true labels.
3.1.7. Layer Normalization, Dropout, and Skip Connection
These are common techniques used in deep learning models to improve stability, prevent overfitting, and facilitate training of deeper networks:
- Layer Normalization: Normalizes the inputs across the features for each sample independently. This helps stabilize the hidden state activations, making training faster and more stable.
- Dropout: A regularization technique where randomly selected neurons are ignored (dropped out) during training. This prevents neurons from co-adapting too much and reduces overfitting.
- Skip Connection (Residual Connection): Directly adds the input of a layer to its output, allowing gradients to flow more easily through the network. This helps mitigate the
vanishing gradient problemin deeper networks and allows the model to learn residual functions.
3.2. Previous Works
The paper categorizes related work into Time-domain SR models and Frequency-domain SR models, highlighting the progression of research in sequential recommendation.
3.2.1. Time-domain SR Models
These models process user interaction sequences directly in their chronological order.
- Early Markov Chain Models: Focused on immediate transitions between items (e.g.,
Factorizing Personalized Markov Chains (FPMC)by Rendle et al., 2010). - Deep Learning Models:
-
GRU4Rec(Hidasi et al., 2015): The first to applyGated Recurrent Unit (GRU)networks, a type ofRecurrent Neural Network (RNN), to model user behavior sequences.GRUs are designed to capture dependencies in sequential data. -
Caser(Tang and Wang, 2018): UtilizesConvolutional Neural Networks (CNNs)with horizontal and vertical filters to capture local sequential patterns. -
SASRec(Kang and McAuley, 2018): A prominent model leveraging theself-attention mechanismto capture item-item relationships within a sequence. It treats the sequence as a set of items and learns their importance relative to each other. -
Contrastive Learning Enhanced Models: More recent works like
CL4SRec(Xie et al., 2022) andDuoRec(Qiu et al., 2022) enhance sequential embedding representations by usingcontrastive learning. This involves creating augmented versions of user sequences and training the model to make representations of similar sequences closer and dissimilar ones farther apart.DuoRecspecifically uses unsupervised model-level augmentation and supervised semantic positive samples.The general limitation of these time-domain models, as highlighted by the authors, is their struggle to effectively capture
users' underlying periodic behavioral patternsdue to thechronologically entangledandnoisynature of raw sequences.
-
3.2.2. Frequency-domain SR Models
These models transform user sequences into the frequency domain to identify periodic patterns that are less apparent in the time domain.
-
FMLPRec(Zhou et al., 2022): A pioneering work that processes sequential data in the frequency domain. It replaces theself-attention mechanismwithlearnable filtersimplemented via anMLPstructure to capture periodic patterns and attenuate noise. -
SLIME4Rec(Du et al., 2023a) andFEARec(Du et al., 2023b): These models advance the frequency-domain approach by proposing alayered frequency ramp structurethat considers different frequency bands for each layer.SLIME4Recfurther integratescontrastive learningwith dynamic and static selection modules.FEARecutilizes frequency-domain information in attention computation and integrates both time and frequency domains. -
BSARec(Shin et al., 2024): A more recent approach that seeks to uncover fine-grained sequential patterns and inject them asinductive biasesinto theself-attention mechanism. It adjusts the influence on the high-frequency region to be learnable and alleviatesover-smoothingthrough afrequency recalibrator. -
FamouSRec(Zhang et al., 2025): Employs aMixture-of-Experts (MoE)architecture, allowing specializedexpert modelsto be selected based on users' specificfrequency-based behavioral patterns.The paper identifies two key limitations in these existing frequency-domain models:
- Lack of User-Specific Adaptivity: They primarily employ
static filtersor fixed-pattern filters, processing all users uniformly and ignoring the personalized nature of behavioral patterns. - Global DFT Blurring: While
DFTexcels atlong-range dependencies, itstruggles to capture local temporal features of high-frequency interactionsandshort-term points of interest, acting essentially as alow-pass filter(e.g.,FMLPRectends to learn low-frequency components).
3.3. Technological Evolution
The evolution of sequential recommendation has seen several paradigm shifts:
- Early Statistical Models (e.g., Markov Chains): Focused on direct transitions between items, often limited by modeling only short-term dependencies.
- Traditional Machine Learning: Employed features engineered from sequences (e.g., matrix factorization with temporal dynamics).
- Deep Learning (RNNs, CNNs):
GRU4RecandCaserbrought the power of deep learning to capture more complex sequential patterns.RNNslikeGRUare naturally suited for sequences, whileCNNscan detect local patterns. - Attention Mechanisms (Transformers):
SASRecandBERT4Rec(Sun et al., 2019) leveraged theTransformerarchitecture withself-attention, enabling the capture of arbitrary long-range dependencies across a sequence, becoming the dominant paradigm. - Contrastive Learning: Integrated with deep learning models to improve representation quality by learning robust embeddings.
- Frequency-Domain Analysis: A more recent shift, pioneered by
FMLPRec, moving beyond the time domain to exploit periodic patterns usingFourier Transform. This aims to disentangle complex user preferences that are hard to see in raw chronological data. - Adaptive and Multi-Resolution Frequency Analysis (This Paper):
WEARecrepresents the current frontier, moving beyond static frequency filters and globalDFTlimitations by introducingdynamic, personalized filteringandwavelet-based enhancementfor local, non-stationary signals.
3.4. Differentiation Analysis
Compared to the main methods in related work, WEARec introduces several core differences and innovations:
-
Against Time-domain Models (e.g.,
SASRec,DuoRec):WEARecoperates primarily in the frequency domain, directly addressing the challenge ofchronologically entangledandnoisysequences that makeperiodic pattern identificationdifficult for time-domain models. While time-domain models focus on item-item relationships,WEARecdecomposes these into fundamental frequencies. -
Against Existing Frequency-domain Models (e.g.,
FMLPRec,SLIME4Rec,BSARec,FamouSRec):- Personalization through Dynamic Filters: A primary innovation of
WEARecis itsDynamic Frequency-domain Filtering (DFF)module. UnlikeFMLPRec's staticMLPfilters or other models that apply uniform filtering,DFFuses aMulti-Layer Perceptron (MLP)to generateadaptive scaling factorsandbias termsbased on the context signal of each user's sequence. This allows the filter todynamically adjustits characteristics to match thepersonalized behavioral patternsof individual users (e.g., emphasizing low frequencies for long-term preferences or higher frequencies for short-term trends), a limitation explicitly identified in previous frequency-domain approaches. - Enhanced Local Information with Wavelets: The
Wavelet Feature Enhancement (WFE)module is another key differentiator. ExistingDFT-based methods (likeFMLPRec) struggle withnon-stationary signalsandshort-term fluctuationsbecauseDFTprovides a global view, effectively acting as alow-pass filter.WFEaddresses this by leveraging theDiscrete Wavelet Transform (DWT)(specifically Haar wavelet), which providestime-frequency localization. This allowsWEARecto capturefine-grained temporal patternsand explicitlyenhance obscured high-frequency detailsthatglobal DFTmight blur, offering a more complete representation of user behavior. - Combined Strengths:
WEARecuniquely combines theglobal pattern capturing capabilityof adaptive frequency-domain filtering with thelocal, fine-grained detail capturing capabilityof wavelets. This dual approach aims for a more comprehensive understanding of user preferences. - Efficiency: The paper also claims
WEARecachieves better performance with lower computational costs, especially inlong-sequence scenarios, by avoiding the quadratic complexity ofself-attentionand potentially expensivecontrastive learningmechanisms while utilizing efficientFFTandHaar wavelets.
- Personalization through Dynamic Filters: A primary innovation of
4. Methodology
The proposed WEARec model is designed to overcome the limitations of existing frequency-domain sequential recommendation methods by dynamically adjusting filters for personalized global information and enhancing non-stationary signals with wavelet transforms. The overall framework is depicted in Figure 2 of the original paper.
4.1. Problem Statement
The goal of Sequential Recommendation (SR) is to predict the next item a user will interact with, given their past interactions.
Let be the set of users and be the set of items. Each user has a chronological sequence of item interactions , where is the item user interacted with at step , and is the sequence length.
For model input, sequences are typically truncated or padded to a maximum length , resulting in . The SR task is to use this sequence to predict the top- items the user will interact with at time .
4.2. Embedding Layer
Given a user behavior sequence , the first step is to convert the categorical item IDs into dense vector representations called embeddings.
An item embedding matrix is used, where is the embedding size, and is the embedding for item .
To incorporate the sequential order, positional embeddings are added to the item embeddings.
Finally, Layer Normalization and Dropout operations are applied to stabilize training and prevent overfitting. This process generates the sequence representation as follows:
$
\mathbf { E } ^ { u } = \operatorname { D r o p o u t } ( \operatorname { LayerN o r m } ( \mathbf { E } ^ { u } + \mathbf { P } ) )
$
- Symbol Explanation:
- : The embedded sequence representation for user , where is the maximum sequence length and is the embedding dimension. Initially, it refers to the raw item embeddings.
- : The
positional embeddingmatrix, which encodes the position of each item in the sequence. - : The
Layer Normalizationfunction, which normalizes the activations across each feature for each input sample. - : The
Dropoutfunction, which randomly sets a fraction of input units to zero during training to prevent overfitting. +: Element-wise addition of the item embeddings and positional embeddings.
4.3. Dynamic Frequency-domain Filtering (DFF)
The Dynamic Frequency-domain Filtering (DFF) module is designed to dynamically adjust filtering operations based on behavioral sequences, extracting personalized global information.
4.3.1. Multi-Head Projection
Inspired by the multi-head attention mechanism, the input item embedding (or the output of the previous layer ) is first projected into parallel feature subspaces. This allows different parts of the embedding dimension to be processed independently, each potentially with its own adaptive filter.
$
\mathbf { H } ^ { 0 } = \mathbf { E } ^ { u }
$
- Symbol Explanation:
-
: The initial input to the first layer, which is the output from the embedding layer.
-
: The time-domain feature representation at the -th layer.
-
: Represents the -th subspace created by decomposing along the embedding dimension, where is the number of heads.
For each subspace , a
Fast Fourier Transform (FFT)is performed along the item dimension to convert it into the frequency domain: $ \mathcal { F } ( \mathbf { B } _ { i } ^ { l } ) \mathbf { F } _ { i } ^ { l } $
-
- Symbol Explanation:
- : The -th time-domain subspace feature of the -th layer.
- : The 1D
FFT(Fast Fourier Transform) operator. - : The -th
frequency-domain subspace featureof the -th layer. It's a complex tensor. - : The length of the frequency domain representation, calculated as: $ M = \lceil N / 2 \rceil + 1 $ where is the sequence length.
4.3.2. Context Signal Extraction
To enable dynamic adaptation, the model extracts an overall representation of the user's historical interaction sequence. This is done by performing a mean operation on the input features in the time domain along the item dimension: $ \mathbf { c } ^ { l } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \mathbf { H } _ { i } ^ { l } $
- Symbol Explanation:
- : The overall representation of the user's historical interaction sequence at the -th layer, representing the user's context.
- : The -th row of , corresponding to the embedding of an item in the sequence.
- : The sequence length.
- : Summation over all items in the sequence.
4.3.3. Adaptive Filter Generation
Two three-layer MLP networks are used to generate dynamic scaling factors and bias terms from the extracted user contextual features (). These terms will modulate the personalized frequency-domain filters.
$
\begin{array} { r } { \Delta \mathbf { s } ^ { l } = \mathrm { M L P } _ { 1 } ( \mathbf { c } ^ { l } ) } \ { \Delta \mathbf { b } ^ { l } = \mathrm { M L P } _ { 2 } ( \mathbf { c } ^ { l } ) } \end{array}
$
- Symbol Explanation:
- : The
scaling factorfor dynamically adjusting the filter at the -th layer. It is personalized for each sequence. - : The
bias termfor dynamically adjusting the filter at the -th layer. It is also personalized for each sequence. - , : Two distinct
Multi-Layer Perceptronnetworks. - : The user's overall context representation.
- : The
4.3.4. Personalized Filter Modulation
The base filter weights and bias are modulated using the personalization-generated scaling factor and bias term .
$
\hat { \mathbf { W } } ^ { l } = \mathbf { W } ^ { l } \odot ( 1 + \Delta \mathbf { s } ^ { l } )
$
$
\hat { \mathbf { b } } ^ { l } = \mathbf { b } ^ { l } + \pmb { \Delta } \mathbf { b } ^ { l }
$
- Symbol Explanation:
- : The base (learnable)
filter weightsat the -th layer. - : The base (learnable)
biasat the -th layer. - : The
linearly modulated weightsof the dynamic filter at the -th layer. - : The
linearly modulated biasof the dynamic filter at the -th layer. - : Hadamard product (element-wise multiplication).
1: A scalar 1, added element-wise to .
- : The base (learnable)
4.3.5. Multiple Learnable Filters
The personalized filtered frequency-domain information is obtained by applying the modulated filter (weights and bias ) to each frequency-domain feature subspace . This is an element-wise multiplication and addition in the frequency domain, consistent with the Convolution Theorem.
$
\tilde { \mathbf { F } } _ { i } ^ { l } = \mathbf { F } _ { i } ^ { l } \odot \hat { \mathbf { W } } ^ { l } + \hat { \mathbf { b } } ^ { l }
$
- Symbol Explanation:
-
: The
filtered frequency-domain featurefor the -th subspace at the -th layer. -
: The original frequency-domain feature for the -th subspace.
-
: The modulated filter weights.
-
: The modulated filter bias.
-
: Hadamard product (element-wise multiplication).
Finally, the processed frequency-domain signal is mapped back to the time domain using the
Inverse Discrete Fourier Transform (IDFT): $ \mathbf { X } _ { i } ^ { l } = \mathcal { F } ^ { - 1 } ( \tilde { \mathbf { F } } _ { i } ^ { l } ) $
-
- Symbol Explanation:
-
: The time-domain feature for the -th subspace after
DFF. -
: The
IDFT(Inverse Discrete Fourier Transform) operator. -
: The filtered frequency-domain feature.
The outputs from all subspaces are then concatenated to form the complete time-domain representation for the current layer: $ \mathbf { X } ^ { l } = [ \mathbf { X } _ { 1 } ^ { l } , \mathbf { X } _ { 2 } ^ { l } , \ldots , \mathbf { X } _ { k } ^ { l } ] $
-
- Symbol Explanation:
-
: The complete time-domain feature after
DFFat layer , formed by concatenating the outputs from all subspaces.The overall framework is visually represented in Figure 2 of the original paper, showing how the
Multi-Head Projectionfeeds into theFFT, which then passes through theMultiple Learnable Filters(modulated byMLP-generated factors) and thenIFFTto return to the time domain.
-
4.4. Wavelet Feature Enhancement (WFE)
The Wavelet Feature Enhancement (WFE) module is designed to capture fine-grained temporal patterns and enhance non-stationary signals and short-term fluctuations, which global DFT might blur. The Haar wavelet transform is chosen for its simplicity, efficiency, and perfect reconstruction property.
4.4.1. Multi-Head Projection
Similar to the DFF module, the input features are first projected into parallel feature subspaces to align with the processing of the DFF module.
4.4.2. Wavelet Decomposition
A 1D Haar wavelet transform is applied along the item dimension of each time-domain subspace feature . This decomposes the signal into low-frequency (approximation) coefficients and high-frequency (detail) coefficients .
$
\mathbf { A } _ { i } ^ { l } , \mathbf { D } _ { i } ^ { l } = \mathcal { W } ( \mathbf { B } _ { i } ^ { l } )
$
- Symbol Explanation:
- : The -th subspace's
approximation coefficientsat the -th layer, representing low-frequency components. - : The -th subspace's
detail coefficientsat the -th layer, capturing high-frequency components. (Note: The paper uses for length in both DWT formulas but implies different lengths for A and D. Assuming for Haar wavelet at 1-level decomposition for simplicity, where would also be ). - : The -th time-domain subspace feature of the -th layer.
- : The 1D
Haar wavelet transformoperator.
- : The -th subspace's
4.4.3. Data Rescale
To enhance or suppress specific high-frequency signals, the detail coefficients are multiplied by an adaptive learnable matrix . The approximation coefficients (low-frequency information) are left unmodified to preserve the original primary components of the sequence.
$
\tilde { \mathbf { D } } _ { i } ^ { l } = \mathbf { D } _ { i } ^ { l } \odot \mathbf { T } ^ { l }
$
- Symbol Explanation:
- : The
enhanced detail coefficientsof the -th subspace at the -th layer. - : The original detail coefficients.
- : The
adaptive high-frequency enhancer(a learnable matrix) at the -th layer. - : Hadamard product (element-wise multiplication).
- : The
4.4.4. Wavelet Reconstruction
The high-frequency enhanced time-domain signal is reconstructed by applying the Inverse Haar Wavelet Transform (IDWT) to the (unmodified) approximation coefficients and the enhanced detail coefficients .
$
\mathbf { Y } _ { i } ^ { l } = \mathcal { W } ^ { - 1 } ( \mathbf { A } _ { i } ^ { l } , \tilde { \mathbf { D } } _ { i } ^ { l } )
$
- Symbol Explanation:
-
: The reconstructed time-domain feature for the -th subspace after
WFE. -
: The
Inverse Haar Wavelet Transformoperator. -
: The approximation coefficients.
-
: The enhanced detail coefficients.
The outputs from all subspaces are then concatenated to form the complete time-domain representation for the current layer: $ \mathbf { Y } ^ { l } = [ \mathbf { Y } _ { 1 } ^ { l } , \mathbf { Y } _ { 2 } ^ { l } , \ldots , \mathbf { Y } _ { k } ^ { l } ] $
-
- Symbol Explanation:
- : The complete time-domain feature after
WFEat layer , formed by concatenating the outputs from all subspaces.
- : The complete time-domain feature after
4.5. Feature Integrate
This step combines the global features extracted by the DFF module with the fine-grained local features derived from the WFE module. A learnable hyperparameter is used to balance the emphasis between these two types of features.
$
\widehat { \mathbf { H } } ^ { l } = \alpha \odot \mathbf { X } ^ { l } + ( 1 - \alpha ) \odot \mathbf { Y } ^ { l }
$
- Symbol Explanation:
-
: The integrated feature representation at layer .
-
: A hyperparameter (a scalar or a learnable tensor) that controls the weighting between
DFFoutput andWFEoutput. -
: The global features from the
DFFmodule. -
: The fine-grained features from the
WFEmodule. -
: Hadamard product (element-wise multiplication).
To ensure a stable training process and better generalization, especially in deeper models, standard techniques like
skip connections,dropout, andlayer normalizationare applied: $ \widehat { \mathbf { H } } ^ { l } = \mathrm { L a y e r N o r m } ( \mathbf { H } ^ { l } + \mathrm { D r o p o u t } ( \widehat { \mathbf { H } } ^ { l } ) ) $
-
- Symbol Explanation:
- : The input to the current layer (residual connection).
- : The output of the feature integration (before normalization and dropout).
- , : As defined before.
+: Element-wise addition for the skip connection.
4.6. Point-wise Feed Forward Network (FFN)
After feature integration, a Point-wise Feed-Forward Network (FFN) is applied. This FFN consists of an MLP with GELU activation and introduces non-linearity between different dimensions in the time domain.
$
\tilde { \mathbf { H } } ^ { l } = \mathrm { F F N } ( \widehat { \mathbf { H } } ^ { l } ) = ( \mathrm { G E L U } ( \widehat { \mathbf { H } } ^ { l } \mathbf { W } _ { 1 } + \mathbf { b } _ { 1 } ) ) \mathbf { W } _ { 2 } + \mathbf { b } _ { 2 }
$
- Symbol Explanation:
-
: The output of the
FFNat layer . -
: The input to the
FFN(output from feature integration and normalization). -
: Learnable weight matrices for the two linear transformations in the
FFN. -
: Learnable bias vectors for the two linear transformations.
-
: The
Gaussian Error Linear Unitactivation function, which introduces non-linearity.To prevent overfitting and ensure stability, another
dropout layerandlayer normalizationare applied using aresidual connectionstructure: $ \mathbf { H } ^ { l + 1 } = \mathrm { L a y e r N o r m } ( \widehat { \mathbf { H } } ^ { l } + \mathrm { D r o p o u t } ( \tilde { \mathbf { H } } ^ { l } ) ) $
-
- Symbol Explanation:
- : The final output of the current
WEARecblock, which serves as the input to the next block or the prediction layer. - : The input to this residual connection (output from feature integration and first normalization).
- : The output from the
FFN. - , ,
+: As defined before.
- : The final output of the current
4.7. Prediction Layer
In the final layer, the model computes a recommendation probability score for each candidate item. The output of the last WEARec block (, where is the number of blocks, typically the last item in the sequence) is used to calculate prediction scores by multiplying it with the transposed item embedding matrix . A softmax function is then applied to convert these scores into probabilities.
$
\hat { \mathbf { y } } = \mathrm { s o f t m a x } ( \mathbf { h } ^ { L } ( \mathbf { M } ) ^ { \top } )
$
- Symbol Explanation:
-
: The predicted probability score for each item in the entire item set.
-
: The output representation of the user's last interaction from the -th
WEARecblock. -
: The transposed
item embedding matrix, used to project the user's final representation into the item space. -
: The
softmaxfunction, which converts a vector of raw scores into a probability distribution.The model parameters are optimized using the
cross-entropy lossfunction: $ \mathcal { L } _ { R e c } = - \sum _ { i = 1 } ^ { | \mathcal { V } | } y _ { i } \mathrm { l o g } ( \hat { y } _ { i } ) $
-
- Symbol Explanation:
-
: The
cross-entropy lossfor sequential recommendation. -
: The total number of items.
-
: The
ground truthlabel for item (1 if is the actual next item, 0 otherwise). -
: The predicted preference score (probability) for item .
-
: Natural logarithm.
-
: Summation over all items.
The overall architecture combines dynamic frequency domain processing with wavelet-based local feature enhancement, integrated through residual connections and feed-forward networks, and optimized for next-item prediction.
-
5. Experimental Setup
This section details the datasets, evaluation metrics, baseline models, and implementation specifics used to evaluate WEARec.
5.1. Datasets
The experiments were conducted on four public datasets, covering different domains, scales, and sparsity levels, commonly used in sequential recommendation research. Following standard practice (Zhou et al. 2020, 2022), a 5-core setting was adopted, filtering out users with fewer than 5 interactions.
The following are the results from Table 3 of the original paper:
| Specs. | LastFM | ML-1M | Beauty | Sports |
|---|---|---|---|---|
| # User | 1,090 | 6,041 | 22,363 | 25,598 |
| # Items | 3,646 | 3,417 | 12,101 | 18357 |
| # Interactions | 52,551 | 999,611 | 198,502 | 296,337 |
| # Avg.Length | 48.2 | 165.5 | 8.9 | 8.3 |
| Sparsity | 98.68% | 95.16% | 99.93% | 99.95% |
5.1.1. Description of Datasets
-
LastFM: This dataset contains user interactions with music, specifically artist listening records. It is used for recommending musicians and is characterized by relatively
long sequence lengths(average length 48.2) and high sparsity (98.68%). -
MovieLens-1M (ML-1M): (Harper and Konstan 2015) This dataset is based on movie reviews collected from the non-commercial
MovieLenswebsite. It is a dense dataset with about 1 million interactions and the longest average sequence length (165.5), making it suitable for evaluating models inlong-sequence scenarios. -
Amazon Beauty: (McAuley et al. 2015) Part of the Amazon review dataset, this dataset contains user-item interactions related to beauty products. It is a sparse dataset (99.93%) with relatively short average sequence lengths (8.9).
-
Amazon Sports: (McAuley et al. 2015) Also from the Amazon review dataset, focusing on sports-related products. Similar to
Beauty, it is very sparse (99.95%) with short average sequence lengths (8.3).These datasets were chosen for their diversity in terms of domain, size, and interaction density, allowing for a comprehensive evaluation of
WEARec's performance across various real-world scenarios.
5.2. Evaluation Metrics
The leave-one-out strategy is used for partitioning each user's item sequence, where the last item is held out for testing and the preceding items for training. Predictions are ranked throughout the entire item set without negative sampling. Performance is evaluated using Hit Ratio at K (HR@K) and Normalized Discounted Cumulative Gain at K (NDCG@K), with set to 10 and 20.
5.2.1. Hit Ratio at K (HR@K)
- Conceptual Definition:
Hit Ratio at K (HR@K)measures how often the target item (the next item the user actually interacted with) appears within the top items recommended by the model. It's a recall-based metric, focusing on whether the relevant item is present in the recommendation list, regardless of its position. - Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the target item is in top-K list}}{\text{Total number of users}} $
- Symbol Explanation:
Number of users for whom the target item is in top-K list: The count of unique users for whom the true next item was found among the top predicted items.Total number of users: The total number of users in the test set.
5.2.2. Normalized Discounted Cumulative Gain at K (NDCG@K)
- Conceptual Definition:
Normalized Discounted Cumulative Gain at K (NDCG@K)is a ranking quality metric that accounts for the position of the relevant item in the recommendation list. It assigns higher scores to relevant items that appear at higher ranks (earlier in the list). The "discounted" part means that relevant items further down the list contribute less to the total score. The "normalized" part ensures scores are comparable across different query results. - Mathematical Formula:
First,
Discounted Cumulative Gain (DCG@K)is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ Then,NDCG@KnormalizesDCG@Kby dividing it by theIdeal DCG (IDCG@K), which is theDCGof the perfect ranking (where all relevant items are at the top). $ \mathrm{IDCG@K} = \sum{i=1}^{|\mathrm{REL}|} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ where is the number of relevant items in the ideal ranking, which is typically 1 for next-item recommendation. Finally: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ - Symbol Explanation:
- : The relevance score of the item at position in the recommended list. For sequential recommendation, it's typically 1 if the item is the true next item, and 0 otherwise.
- : The rank (position) of the item in the recommended list, from 1 to .
- : The number of top recommendations considered.
- : The discount factor, which reduces the contribution of items at lower ranks.
- : The number of relevant items in the ideally sorted list (usually 1 for the single ground-truth next item).
5.3. Baselines
To demonstrate the effectiveness of WEARec, it was compared against a selection of state-of-the-art models from two categories:
5.3.1. Time-domain SR models
These models process sequences directly in the chronological order.
GRU4Rec(Hidasi et al. 2015): A foundational model that usesGated Recurrent Units (GRU)to capture sequential dependencies.Caser(Tang and Wang 2018): ACNN-based method that captures local patterns using horizontal and vertical convolutional filters.SASRec(Kang and McAuley 2018): A highly influential model that employs theself-attention mechanismto model item-item relationships.DuoRec(Qiu et al. 2022): A more recent model that enhances sequence representations throughcontrastive learningusing both unsupervised model-level augmentation and supervised semantic positive samples.
5.3.2. Frequency-domain SR models
These models leverage frequency-domain analysis.
FMLPRec(Zhou et al. 2022): Anall-MLPmodel that uses learnable filters in the frequency domain to capture periodic patterns and reduce noise.FamouSRec(Zhang et al. 2025): Utilizes aMixture-of-Experts (MoE)framework to select specialized expert models tailored to users' specific frequency-based behavioral patterns.FEARec(Du et al. 2023b): Integrates frequency domain information into attention computation and combines insights from both time and frequency domains.SLIME4Rec(Du et al. 2023a): Employs afrequency ramp structureto consider different frequency bands across layers, combined with dynamic and static selection modules andcontrastive learning.BSARec(Shin et al. 2024): A very recent and strong baseline that enhancesself-attentionwith anattentive inductive biasfrom the frequency domain and uses afrequency recalibratorto alleviate over-smoothing.
5.4. Implementation Details
The WEARec model was implemented in PyTorch. For baseline models, the authors refer to optimal hyper-parameter setups from original papers or report results from consistent reimplementations.
-
Embedding Size (): Both the dimension of the
feed-forward networkand theitem embedding sizeare set to 64. -
Number of
WEARecBlocks (): Set to 2. -
Maximum Sequence Length (): Set to 50 for primary experiments.
-
Batch Size: 256.
-
Optimizer:
Adam optimizer. -
Learning Rate: Chosen from .
-
Wavelet Decomposition Level: Set to 1.
-
Hyperparameter : Explored from .
-
Number of Filters (Heads): Chosen from .
-
Dropout Rate: 0.5 for Amazon datasets and
LastFM, 0.1 forMovieLens-1Mdue to its lower sparsity.The following are the results from Table 4 of the original paper:
Specs. LastFM ML-1M Beauty Sports 0.3 0.3 0.2 0.3 k 2 2 8 4 lr 0.001 0.0005 0.0005 0.001
The table above shows the optimal hyperparameter settings for WEARec on each dataset for reproducibility.
6. Results & Analysis
This section analyzes the experimental results, comparing WEARec against baselines, evaluating its performance in long-sequence scenarios, examining its complexity, and conducting ablation studies and hyper-parameter analysis.
6.1. Core Results Analysis (RQ1)
The primary experimental results comparing WEARec with various baseline models across four datasets (Beauty, Sports, LastFM, ML-1M) are presented in Table 1. Performance is measured using HR@10, HR@20, NDCG@10, and NDCG@20.
The following are the results from Table 1 of the original paper:
| Datasets | Metric | Caser | GRU4Rec | SASRec | DuoRec | FMLPRec | FamouSRec | FEARec | SLIME4Rec | BSARec | WEARec | Improv. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Beauty | HR@10 | 0.0225 | 0.0304 | 0.0531 | 0.0965 | 0.0559 | 0.0838 | 0.0982 | 0.1006 | 0.1008 | 0.1041 | 3.27% |
| HR@20 | 0.0403 | 0.0527 | 0.0823 | 0.1313 | 0.0869 | 0.1146 | 0.1352 | 0.1381 | 0.1373 | 0.1391 | 1.31% | |
| NG@10 | 0.0108 | 0.0147 | 0.0283 | 0.0584 | 0.0291 | 0.0497 | 0.0601 | 0.0601 | 0.0611 | 0.0614 | 0.49% | |
| NG@20 | 0.0153 | 0.0203 | 0.0356 | 0.0671 | 0.0369 | 0.0575 | 0.0694 | 0.0696 | 0.0703 | 0.0703 | 0.00% | |
| Sports | HR@10 | 0.0163 | 0.0187 | 0.0298 | 0.0569 | 0.0336 | 0.0424 | 0.0589 | 0.0611 | 0.0612 | 0.0631 | 3.10% |
| HR@20 | 0.0260 | 0.0303 | 0.0459 | 0.0791 | 0.0525 | 0.0632 | 0.0836 | 0.0869 | 0.0858 | 0.0895 | 2.99% | |
| NG@10 | 0.0080 | 0.0101 | 0.0159 | 0.0331 | 0.0183 | 0.0244 | 0.0343 | 0.0357 | 0.0360 | 0.0367 | 1.94% | |
| NG@20 | 0.0104 | 0.0131 | 0.0200 | 0.0387 | 0.0231 | 0.0297 | 0.0405 | 0.0421 | 0.0422 | 0.0433 | 2.60% | |
| LastFM | HR@10 | 0.0431 | 0.0404 | 0.0633 | 0.0624 | 0.0560 | 0.0569 | 0.0587 | 0.0633 | 0.0807 | 0.0899 | 11.40% |
| HR@20 | 0.0642 | 0.0541 | 0.0927 | 0.0963 | 0.0826 | 0.0954 | 0.0826 | 0.0927 | 0.1174 | 0.1202 | 2.38% | |
| NG@10 | 0.0268 | 0.0245 | 0.0355 | 0.0361 | 0.0306 | 0.0318 | 0.0354 | 0.0359 | 0.0435 | 0.0465 | 6.89% | |
| NG@20 | 0.0321 | 0.0280 | 0.0429 | 0.0446 | 0.0372 | 0.0415 | 0.0414 | 0.0433 | 0.0526 | 0.0543 | 3.23% | |
| ML-1M | HR@10 | 0.1556 | 0.1657 | 0.2137 | 0.2704 | 0.2065 | 0.2639 | 0.2705 | 0.2891 | 0.2757 | 0.2952 | 2.10% |
| HR@20 | 0.2488 | 0.2664 | 0.3245 | 0.3738 | 0.3137 | 0.3717 | 0.3714 | 0.3950 | 0.3884 | 0.4031 | 3.29% | |
| NG@10 | 0.0950 | 0.0828 | 0.1116 | 0.1530 | 0.1087 | 0.1455 | 0.1516 | 0.1673 | 0.1568 | 0.1696 | 1.37% | |
| NG@20 | 0.1028 | 0.1081 | 0.1395 | 0.1790 | 0.1356 | 0.1727 | 0.1771 | 0.1939 | 0.1851 | 0.1968 | 1.49% |
Observations and Conclusions:
- Time-domain models vs. Frequency-domain models: Traditional time-domain models (
Caser,GRU4Rec,SASRec) generally exhibit suboptimal performance. This is attributed to their difficulty in identifyingintertwined user periodic patternsfrom raw temporal sequences.DuoRec, which combinescontrastive learning, shows better performance than other time-domain methods, validating the effectiveness of learning robust representations. - Superiority of Frequency-domain models: Models leveraging the
frequency domain(e.g.,FMLPRec,FamouSRec,FEARec,SLIME4Rec,BSARec) generally achieve superior performance compared to time-domain baselines.FMLPRec, by using anMLPstructure for frequency-domain filtering, performs comparably or even better thanSASRec. Further advancements withcontrastive learning(FamouSRec,FEARec,SLIME4Rec) andinductive bias(BSARec) lead to improved results among frequency-domain methods. WEARec's Outperformance:WEARecconsistently achieves the best performance across all four datasets and all evaluation metrics. This validates the effectiveness of combining theDynamic Frequency-domain Filtering (DFF)module (for personalized global information) with theWavelet Feature Enhancement (WFE)module (for enhanced local information). The improvement percentage (Improv.) over the best baseline ranges from 0.00% (BeautyNG@20, tied withSLIME4Rec) up to 11.40% (LastFMHR@10), with noticeable gains on theLastFMdataset.
6.2. Model in Long Sequence Scenarios (RQ2)
The study investigates WEARec's performance and computational overhead in long-sequence scenarios by varying the maximum sequence length . Experiments were conducted on ML-1M and LastFM datasets, which have longer average sequence lengths.
6.2.1. Model Performance
The following figure (Figure 3 from the original paper) shows the HR@20 performance comparison of WEARec with FMLPRec, SLIME4Rec, and BSARec at different sequence lengths on ML-1M and LastFM.
该图像是图表,展示了WEARec与FMLPRec、SLIME4Rec和BSARec在不同序列长度 下的 ext{H R}@20 性能比较,数据源为ML-1M和LastFM。
The following are the results from Table 5 of the original paper:
| Method | ML-1M | LastFM | |||||||
|---|---|---|---|---|---|---|---|---|---|
| HR@10 | NG@10 | HR@20 | NG@20 | HR@10 | NG@10 | HR@20 | NG@20 | ||
| N = 50 | BSARec | 0.2757 | 0.1568 | 0.3884 | 0.1851 | 0.0807 | 0.0435 | 0.1174 | 0.0526 |
| SLIME4Rec | 0.2894 | 0.1675 | 0.3934 | 0.1937 | 0.0633 | 0.0376 | 0.0936 | 0.0453 | |
| Ours | 0.2952 | 0.1696 | 0.4031 | 0.1968 | 0.0899 | 0.0465 | 0.1202 | 0.0547 | |
| N = 100 | BSARec | 0.3073 | 0.1815 | 0.4089 | 0.2024 | 0.0798 | 0.0455 | 0.1202 | 0.0545 |
| SLIME4Rec | 0.3147 | 0.1815 | 0.4126 | 0.2062 | 0.0679 | 0.0382 | 0.0991 | 0.0463 | |
| Ours | 0.3180 | 0.1819 | 0.4175 | 0.2069 | 0.0890 | 0.0494 | 0.1266 | 0.0589 | |
| N = 150 | BSARec | 0.3171 | 0.1826 | 0.4300 | 0.2111 | 0.0826 | 0.0476 | 0.1174 | 0.0564 |
| SLIME4Rec | 0.3166 | 0.1820 | 0.4298 | 0.2127 | 0.0688 | 0.0387 | 0.1055 | 0.0479 | |
| Ours | 0.3215 | 0.1848 | 0.4338 | 0.2131 | 0.0927 | 0.0522 | 0.1312 | 0.0617 | |
| N = 200 | BSARec | 0.3161 | 0.1837 | 0.4311 | 0.2127 | 0.0862 | 0.0476 | 0.1257 | 0.0594 |
| SLIME4Rec | 0.3166 | 0.1850 | 0.4343 | 0.2173 | 0.0679 | 0.0391 | 0.1064 | 0.0488 | |
| Ours | 0.3334 | 0.1904 | 0.4421 | 0.2179 | 0.0972 | 0.0556 | 0.1477 | 0.0682 | |
Observations:
- Impact of Sequence Length: Across most models, performance generally improves with longer sequence lengths (up to ), indicating that more historical information helps in representing user behavior patterns more comprehensively.
- Baselines' Convergence and Overfitting: While baseline models show initial performance improvements in long-sequence scenarios, they tend to exhibit performance convergence or are prone to overfitting with very long sequences, suggesting limitations in handling the increased complexity or noise.
WEARec's Consistent Superiority:WEARecconsistently outperforms the baselines across all different maximum sequence length settings () on bothML-1MandLastFM. Furthermore, its improvement over baseline models is often more significant inlong-sequence scenarios(e.g., at ,WEARecshows a stronger lead compared to ). This suggestsWEARecis particularly well-suited for scenarios with rich historical data.
6.2.2. Model Complexity and Runtime Analyses
To evaluate the overhead of WEARec, the authors assessed the number of parameters and runtime per epoch during training, specifically for .
The following are the results from Table 6 of the original paper:
| Methods | ML-1M | Beauty | Sports | LastFM | ||||
|---|---|---|---|---|---|---|---|---|
| # params | s/epoch | # params | s/epoch | # params | s/epoch | # params | s/epoch | |
| WEARec | 426,082 | 66.46 | 981,922 | 15.12 | 1,382,306 | 26.12 | 440,802 | 5.23 |
| FMLPRec | 324,160 | 36.93 | 880,000 | 10.11 | 1,280,384 | 22.78 | 338,880 | 4.91 |
| BSARec | 331,968 | 109.26 | 887,808 | 25.87 | 1,288,192 | 50.59 | 346,688 | 10.84 |
| SLIME4Rec | 375,872 | 120.43 | 931,712 | 31.44 | 1,332,096 | 68.74 | 390,592 | 13.77 |
Observations:
- Total Parameters:
WEARecgenerally has a higher number of total parameters compared toFMLPRec,BSARec, andSLIME4Rec. This is primarily due to the introduction of a simpleMLPfor dynamic filter generation and thelearnable matrixin theWFEmodule. - Training Time (Runtime per Epoch): Despite having more parameters,
WEARecexhibits a shorter training time (runtime per epoch) thanBSARecandSLIME4Recacross all datasets. For example, onML-1M,WEARectrains in 66.46 s/epoch, significantly faster thanBSARec(109.26 s/epoch) andSLIME4Rec(120.43 s/epoch). However,WEARecis slower thanFMLPRec, which has the fewest parameters and lowest runtime, asFMLPRecis a simpler, attention-freeMLPmodel. - Theoretical Complexity: The authors explain that
WEARec's computational complexity is dominated by theFFTandHaar wavelets transform, resulting in a time complexity of , which simplifies to for large (assuming is constant or much smaller than ). This is a significant improvement overself-attentionmechanisms (used inBSARec) with their overhead. By not employingcontrastive learning(used inSLIME4Rec),WEARecalso avoids the additional computational cost associated with it. This demonstratesWEARec's superior efficiency inlong-sequence scenarioswhere is large.
6.3. In-depth Model Analysis (RQ3-RQ4)
6.3.1. Ablation Study (RQ3)
An ablation study was conducted to understand the contribution of each designed module in WEARec. Variants were created by removing specific components:
-
WEARec: The full model. -
w/o W:WEARecwithout theWavelet Feature Enhancement (WFE)module. -
w/o F:WEARecwithout theDynamic Frequency-domain Filtering (DFF)module. -
w/o M:WEARecwithout themulti-head projectioncomponent.The following figure (Figure 4 from the original paper) summarizes the
HR@20andNG@20performance ofWEARecand its variants across four datasets.
Observations:
-
WEARec(the full model) consistently outperforms all its variants across all four datasets and metrics. This indicates that all proposed components (WFEmodule,DFFmodule, andmulti-head projection) are effective and contribute positively to the overall performance. -
Removing either the
WFEmodule (w/o W) or theDFFmodule (w/o F) leads to a notable drop in performance. This confirms the necessity of both modules:DFFfor personalized global frequency patterns andWFEfor enhancing local, non-stationary signals. -
Removing the
multi-head projection(w/o M) also results in a performance decrease, suggesting that decomposing the input into parallel feature subspaces with tailored processing (even if implicitly, as inFFT/DWTacross different embedding dimensions) is beneficial.
6.3.2. Hyper-parameter Analysis (RQ4)
The study also analyzed the sensitivity of WEARec to two key hyperparameters: (number of filters/heads) and (weighting factor between DFF and WFE outputs).
The following figure (Figure 5 from the original paper) shows the performance of WEARec on HR@20 with varying hyperparameters ( and ).

Observations:
- Sensitivity to (Number of Filters/Heads):
- The performance varies with , and an optimal value appears to exist for each dataset. For example, on
ML-1MandLastFM, or seems optimal, while forBeautyandSports, or respectively are better. - Neither a very small (e.g., , implying no multi-head projection) nor a very large (not fully explored but implied by trend) is ideal. An optimal is critical for learning user interest preferences, as it allows the model to process different feature subspaces, potentially capturing different aspects of user behavior.
- The performance varies with , and an optimal value appears to exist for each dataset. For example, on
- Sensitivity to (Weighting Factor):
- The performance also changes with , which balances the contribution of the
DFF(global features) andWFE(local features). - Optimal performance is often observed when is approximately 0.3. This suggests that a slightly higher emphasis on the
global featuresfromDynamic Frequency-domain Filtering(weighted by ) compared toWavelet Feature Enhancement(weighted by ) yields the best results. However, this interpretation is inverted from how the formula is written: . If , then (DFF output) is weighted 0.3, and (WFE output) is weighted 0.7. This would mean thatWFE(local, high-frequency enhancement) is given a higher emphasis, which aligns with the paper's motivation to address the blurring of non-stationary signals.
- The performance also changes with , which balances the contribution of the
6.3.3. Visualization of the Filters
The authors visualize the frequency and amplitude features learned by different types of filtering models to highlight WEARec's capabilities.
The following figure (Figure 6 from the original paper) presents the spectral responses for different types of filter models across layers in Beauty.

Observations:
-
Static Filters' Limitation:
FMLPRecandSLIME4Rec(with their static filter designs) tend to learn predominantlylow-frequency componentswithin their respective frequency bands, especially noticeable in Layer 1. This confirms the hypothesis that existingDFT-based methods often act aslow-pass filters, struggling to capture a full spectrum of frequencies. -
WEARec's Dynamic and Comprehensive Coverage:WEARec, with itsdynamic frequency-domain filteringdesign, is capable of encompassingall frequency components. Its spectral response is more flexible and covers a wider range of frequencies, demonstrating its ability to adapt to diverse user behavioral patterns.Additionally, Figure 7 visualizes the learned
wavelet feature enhancerfrom Equation 19.
The following figure (Figure 7 from the original paper) shows the visualization of learned in Beauty.

Observations:
- The
wavelet feature enhancementmodule learns to identifynon-stationary signalsat specific time points and adaptively adjust their weights. In the first layer, multiple time points of non-stationary signals are assigned higher weights for enhancement, while potentially noisy ones receive negative weights for suppression. In the second layer, it focuses on enhancing nearest non-stationary signal points. This indicates the module's capability to capture and modify local details effectively.
6.3.4. Case Study
A case study further evaluates whether the dynamic filter and wavelet enhancer can better capture a wider range of frequency domain features by visualizing the number of users correctly captured by different models across frequency bands.
The following figure (Figure 8 from the original paper) shows the number of users uniquely driven by each frequency component in the Sports and Beauty datasets for different models.

Observations:
FMLPRec's Limited Capture:FMLPReccaptured the fewest users across multiple frequency bands. This is consistent with its static filter design, which limits its ability to adapt to diverse user behavior patterns.- Baselines' Improved Capture:
BSARecandSLIME4Recshow better results, attributed to their use ofinductive bias(forBSARec) orfrequency ramp structuresandcontrastive learning(forSLIME4Rec) to enhance user embedding representations. WEARec's Best Coverage:WEARecachieves the best performance by capturing the highest number of users across various frequency regions. This is due to the synergistic combination of itsdynamic filter's global information capture capabilityand thewavelet enhancer's local detail extraction. It performs best in bothlow-frequency regions(capturing long-term patterns) and certainhigh-frequency regions(capturing short-term or transient patterns).
7. Conclusion & Reflections
7.1. Conclusion Summary
In this paper, the authors introduced WEARec, a novel and efficient model specifically designed for sequential recommendation tasks, particularly adept at handling long sequences and capturing diverse user behavioral patterns. The core of WEARec lies in its two vital modules: Dynamic Frequency-domain Filtering (DFF) and Wavelet Feature Enhancement (WFE). The DFF module allows for personalized adjustments of filtering operations based on individual user sequences, thereby extracting tailored global frequency distributions. Concurrently, the WFE module leverages wavelet transforms to reconstruct sequences, effectively enhancing non-stationary signals and short-term fluctuations that global Fourier Transforms might otherwise blur. Through extensive experiments on four public datasets, WEARec demonstrated superior recommendation performance and competitive computational efficiency compared to state-of-the-art baselines.
7.2. Limitations & Future Work
The paper explicitly states that WEARec "achieves better performance with lower computational costs, especially in long-sequence scenarios," and its time complexity is , which is an improvement over self-attention's . This suggests the authors view efficiency as a strength rather than a limitation.
However, a closer look at Table 6 shows that while WEARec is faster than BSARec and SLIME4Rec, it is slower and has more parameters than FMLPRec. The paper's conclusion section does not explicitly list future work or limitations.
Based on the analysis, potential limitations and future work could include:
- Increased Parameter Count: While computationally efficient,
WEARecdoes introduce more parameters than simplerMLP-based frequency models likeFMLPRec. This could be a consideration for extremely resource-constrained environments. Future work could explore parameter-efficient designs for theMLPnetworks inDFFor theenhancer matrixinWFE. - Wavelet Choice: The
Haar wavelet transformis chosen for its simplicity and efficiency. However, other wavelet families (e.g., Daubechies, Symlets) offer different properties regarding smoothness, vanishing moments, and time-frequency localization. Exploring the impact of different wavelet bases on performance and computational cost could be a direction for future research. - Interpretability of Dynamic Filters: While dynamic filters offer personalization, a deeper analysis into what specific frequency bands are emphasized for which types of users could enhance interpretability and provide insights into user behavior.
- Adaptive and : The parameters and are currently hyperparameters. Future work could investigate methods to learn these parameters adaptively, perhaps through a meta-learning approach, allowing the model to dynamically determine the optimal balance between global and local features or the ideal number of frequency subspaces.
- Generalization to other modalities: While the current work focuses on item sequences, the principles of dynamic frequency filtering and wavelet enhancement could be extended to other sequential data types or modalities in recommendation (e.g., user reviews as text sequences, multimedia feature sequences).
7.3. Personal Insights & Critique
This paper presents a compelling and well-motivated advancement in sequential recommendation by effectively addressing two critical shortcomings of existing frequency-domain methods.
Inspirations and Applications:
- Hybrid Approach Strength: The core idea of combining
global frequency analysis(viaDFT-based adaptive filtering) withlocal time-frequency analysis(viaDWT) is highly insightful. Many real-world signals, including user behavior, exhibit both long-term periodicities and short-term transient events. A hybrid approach likeWEARecis better equipped to capture this complexity than methods relying solely on one domain or scale. This hybrid thinking could be transferred to other domains dealing with complex time series data, such astime series forecasting(e.g., energy consumption, stock prices) or evennatural language processingfor analyzing sentiment shifts or topic changes within longer documents. - Personalization Beyond Features: The
Dynamic Frequency-domain Filteringmodule's ability to create user-specific filters based on context is a powerful concept. This goes beyond simply learning personalized embeddings; it's about learning personalized processing logic. This idea could be applied to other areas wherepersonalized model architecturesoradaptive processing pipelinesare beneficial, such as personalized medicine or adaptive control systems. - Efficiency in Long Sequences: The focus on
long-sequence scenariosand the architectural choice to maintainlog-linear complexitythroughFFTandDWTis highly practical. As user interaction histories grow,quadratic complexitymodels become intractable.WEARecoffers a blueprint for how to build high-performing yet efficient sequence models.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Haar Wavelet Simplicity: While efficient, the
Haar waveletis the simplest orthogonal wavelet. Its blocky nature might not be ideal for all types of smooth signals. The paper assumes its efficiency outweighs any potential loss in representation capability compared to smoother wavelets. This assumption could be further tested. -
"All Frequency Components" Claim: Figure 6 suggests
WEARec"encompasses all frequency components" compared to baselines. While its response is broader and more flexible, the visualization doesn't show a perfectly flat response across all frequencies, indicating some implicit filtering still occurs. The "adaptive" nature is key here, meaning it can choose which components to emphasize, not necessarily equally treat all. -
Interpretability of
MLPfor Dynamics: TheMLPs generatingscaling factorsandbias termsfor the dynamic filters are powerful but somewhat opaque. While they enable personalization, understanding how theseMLPs interpret the context to modify frequency responses could be challenging. Further analysis or more interpretable adaptive mechanisms might be valuable. -
Hyperparameter Interpretation: The optimal implies that
WFE's locally enhanced features receive a higher weighting (0.7) thanDFF's globally filtered features (0.3). This suggests thatWFEplays a more dominant role inWEARec's success than the nameWavelet Enhanced Adaptive Frequency Filtermight initially imply, underscoring the importance of capturing those subtle, non-stationary local details. This finding reinforces the paper's second motivation point.Overall,
WEARecis a strong contribution that moves beyond monolithic frequency-domain processing, demonstrating the power of tailored, multi-resolution signal processing techniques for understanding complex human behavior in recommender systems.
Similar papers
Recommended via semantic vector search.