TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation
TL;DR Summary
TV-Rec introduces time-variant convolutional filters inspired by graph signal processing, replacing fixed filters and self-attention to model temporal user behavior variations, improving expressiveness, reducing computation, and boosting accuracy by 7.49% on six benchmarks.
Abstract
Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is a novel approach to sequential recommendation using time-variant convolutional filters. The title is TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation.
1.2. Authors
The authors are:
-
Yehjin Shin
-
Jeongwhan Choi
-
Seojin Kim
-
Noseong Park*
All authors are affiliated with KAIST (Korea Advanced Institute of Science and Technology). Their research background appears to be in recommender systems, machine learning, and potentially graph signal processing, given the paper's core methodology.
1.3. Journal/Conference
The paper is published on arXiv, a preprint server. The Published at (UTC): 2025-10-29T08:14:03.000Z indicates it is a recent submission, likely awaiting peer review or subsequent publication in a conference or journal. While arXiv itself is not a peer-reviewed venue, it is a widely recognized platform for disseminating early research in fields like computer science, mathematics, and physics. The authors' affiliation with KAIST suggests a strong academic background.
1.4. Publication Year
The publication year, based on the provided Published at (UTC) timestamp, is 2025.
1.5. Abstract
The paper addresses a limitation in sequential recommendation (SR) models that use convolutional filters: their fixed nature struggles to capture global interactions, often requiring supplementary self-attention mechanisms. The authors propose TV-Rec (Time-Variant Convolutional Filters for Sequential Recommendation), a model inspired by graph signal processing (GSP). TV-Rec utilizes time-variant graph filters to capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention, TV-Rec achieves higher expressive power, better modeling of complex user behavior, reduced computation, and accelerated inference. Extensive experiments on six public benchmarks demonstrate that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.
1.6. Original Source Link
The official source is https://arxiv.org/abs/2510.25259. This is a link to the arXiv preprint.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve lies in the limitations of existing sequential recommendation (SR) models, particularly concerning their ability to capture dynamic user preferences and complex interaction patterns efficiently.
- Problem: While
convolutional filtersare effective at capturing local sequential patterns in SR, theirfixednature prevents them from adapting to diverse temporal variations across a user's interaction history. This often necessitates combining them withself-attention mechanisms(e.g., in Transformer-based models). However,self-attentionhas its own limitations: it lacks aninductive biastoward sequential structure, treating all positions pairwise without inherent local proximity, and is computationally expensive. This leads to a trade-off between expressiveness, locality, and efficiency. - Importance: Sequential recommendation is crucial in modern recommender systems because user preferences are dynamic and evolve over time. Accurately modeling these evolving preferences is essential for providing timely and relevant recommendations, which directly impacts user engagement and satisfaction.
- Challenges/Gaps:
Fixed convolutional filtersstruggle to captureglobal interactionsandposition-specific semanticsas user interests shift rapidly.Self-attentionis expressive butcomputationally inefficient(especially for long sequences) andinsensitive to locality, making it difficult to modelfine-grained, localized user behavior.- Hybrid models that combine both often inherit the computational burden of self-attention.
- Innovative Idea: The paper's entry point is to bridge this gap by proposing
Time-Variant Convolutional Filters, inspired bygraph signal processing (GSP). The core innovation is to replace both fixed kernels and self-attention with filters that canadapt their weights across the sequence(i.e.,time-variant) to emphasize the most relevant elements at each time step.
2.2. Main Contributions / Findings
The paper makes several significant contributions and presents key findings:
- Novel Architecture (
TV-Rec): The authors proposeTV-Rec, a new sequential recommendation model that utilizestime-variant convolutional filters. These filters are designed to effectively capture temporal dynamics and user behavior patterns by applying different filtering operations at each position in a sequence, eliminating the need forself-attention. This design is inspired bynode-variant graph filtersfromgraph signal processing (GSP). - Enhanced Expressiveness and Generalization: TV-Rec demonstrates superior expressive generalization. By allowing filters to be functions of time (position), the model can effectively capture both
long-term preferences(patterns at the beginning of a sequence) andrecent interests(patterns at the end of a sequence), which fixed filters often miss. - Improved Performance and Efficiency: Extensive experiments on six public benchmark datasets show that TV-Rec consistently outperforms state-of-the-art baseline methods, achieving an average accuracy improvement of 7.49%. Crucially, this performance gain comes with
reduced computationandaccelerated inferencecompared to self-attention-based and hybrid models, as the time-variant filters act as linear operators.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand TV-Rec, a grasp of several fundamental concepts from sequential recommendation, deep learning architectures, and graph signal processing is essential.
-
Sequential Recommendation (SR):
- Conceptual Definition: SR is a subfield of recommender systems that focuses on predicting a user's next interaction (e.g., item purchase, movie watch) based on their historical sequence of interactions. Unlike traditional collaborative filtering, SR explicitly considers the order and temporal dynamics of user behavior, recognizing that preferences evolve over time.
- Goal: Given a sequence of items a user has interacted with, , the goal is to predict the next item or recommend a top- list of potential next items.
-
Convolutional Filters (in Deep Learning):
- Conceptual Definition: In deep learning, especially in
Convolutional Neural Networks (CNNs), a convolutional filter (or kernel) is a small matrix of learnable weights that slides over an input (e.g., an image, a sequence) to detect local patterns. For sequential data,1D convolutional filtersare commonly used. - Mechanism: When applied to a sequence, a 1D filter performs a dot product with a local window of the input sequence. This operation is repeated across the entire sequence, generating a feature map that highlights where the learned local pattern occurs.
- Fixed Nature: Traditionally, these filters have
fixedweights that are applied uniformly across all positions in the input sequence. This makes them excellent at capturing local patterns but limits their adaptability to position-specific or temporally evolving semantics.
- Conceptual Definition: In deep learning, especially in
-
Self-Attention / Transformers:
- Conceptual Definition:
Self-attentionis a mechanism that allows a model to weigh the importance of different parts of an input sequence relative to a specific element in the sequence. It's a core component of theTransformerarchitecture, widely used inNatural Language Processing (NLP)and increasingly in sequential recommendation. - Mechanism (Simplified): For each element in a sequence, self-attention computes three vectors: a
Query (Q),Key (K), andValue (V). TheQueryof an element is matched against theKeysof all other elements (including itself) to computeattention scores. These scores are then used to weight theValuesof all elements, producing a new representation for the original element that incorporates information from the entire sequence. - Attention Calculation: The standard
scaled dot-product attentionis calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $- : Query matrix. Each row corresponds to a query vector for an element in the sequence.
- : Key matrix. Each row corresponds to a key vector for an element in the sequence.
- : Value matrix. Each row corresponds to a value vector for an element in the sequence.
- : Dimension of the key vectors, used for scaling to prevent vanishing gradients.
- : Normalizes the attention scores.
- Strengths: Excellent at capturing
long-range dependenciesand complex relationships between arbitrary positions in a sequence. - Limitations:
Lack of inductive bias: It doesn't inherently prioritize local proximity; all pairwise interactions are considered. Positional encodings are needed to introduce sequential order.Computational cost: The attention mechanism typically has a quadratic complexity with respect to sequence length (), making it expensive for very long sequences.
- Conceptual Definition:
-
Graph Signal Processing (GSP):
- Conceptual Definition: GSP extends traditional signal processing concepts (like Fourier analysis and filtering) to data defined on irregular domains, i.e., graphs. It treats data associated with nodes as signals and the graph structure as an operator that defines relationships between these signals.
- Graph Signal (): A vector where each component corresponds to a value associated with a node in the graph. For
TV-Rec, the sequence of item embeddings is treated as a graph signal, with each item (at a specific temporal position) being a node. - Shift Operator (): A matrix (often the
adjacency matrixorLaplacian matrixof the graph) that captures the underlying graph topology. It "shifts" the signal around the graph. In GSP, filters are often defined as polynomials of this shift operator. - Graph Filter (): An operation that processes a graph signal based on the graph structure. A
graph filteris defined as a polynomial of theshift operator S: $ \mathbf{y} = \mathbf{G x} = \sum_{k=0}^K h_k \mathbf{S}^k \mathbf{x} $- : The filtered graph signal.
- : The input graph signal.
- : Scalar
filter taps(coefficients) that determine the characteristics of the filter. - : The
shift operator(e.g., adjacency matrix). - : The
orderof the filter, determining how far information propagates.
- Graph Fourier Transform (GFT): Analogous to the Fourier Transform for Euclidean signals, the GFT decomposes a graph signal into different
frequency componentsdefined by the eigenvectors of theshift operator S. $ \mathbf{S} = \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}) \mathbf{U}^\top $- : Matrix of eigenvectors of . These eigenvectors form the
graph Fourier basis. - : Diagonal matrix of eigenvalues of .
- : The GFT of signal (signal in the frequency domain).
- : Inverse GFT.
- : Matrix of eigenvectors of . These eigenvectors form the
- Spectral Filtering: In the frequency domain, a graph filter acts as a diagonal matrix on the transformed signal:
$
\tilde{\mathbf{y}} = \tilde{\mathbf{G}} \tilde{\mathbf{x}} = \left( \sum_{k=0}^K h_k \mathrm{diag}(\boldsymbol{\lambda})^k \right) \tilde{\mathbf{x}}
$
- : Filtered signal in the frequency domain.
- : Frequency response of the filter, a diagonal matrix. This diagonal operation is computationally efficient.
-
Node-Variant Graph Filter:
- Conceptual Definition: Unlike conventional graph filters where a single set of
filter taps() is applied uniformly across all nodes, anode-variant graph filterappliesdistinct filter tapsat each node position. This means the filter characteristics can change depending on which node the signal is at. - Mechanism: Instead of scalar , it uses a vector for filter taps, where each component of corresponds to a specific node. This is represented by a diagonal matrix .
$
\mathbf{y} = \mathbf{G}{nv} \mathbf{x} = \left( \sum{k=0}^K \mathrm{diag}(\mathbf{h}_k) \mathbf{S}^k \right) \mathbf{x}
$
- : A diagonal matrix where the diagonal entries are the elements of the vector . This allows each node to have its own value.
- Connection to TV-Rec: In
TV-Rec, where sequence positions are treated as nodes in a graph, anode-variant graph filterbecomes equivalent to atime-variant convolutional filter. This allows the filter to adapt to the specific temporal position within the sequence.
- Conceptual Definition: Unlike conventional graph filters where a single set of
-
Line Graph and Directed Cyclic Graph (DCG):
- Line Graph (for sequences): A natural way to represent a sequence as a graph is a
line graphwhere each item is a node and there's a directed edge from to . - Problem with Line Graph for GSP: A line graph's adjacency matrix (or shift operator) is typically not
diagonalizable(i.e., cannot be decomposed into for spectral filtering) because the first node has no incoming edges, leading to a row of zeros and a rank deficiency. - Directed Cyclic Graph (DCG): To enable spectral filtering,
TV-Recmodels the sequence as aDCG. This is essentially a line graph with an added edge connecting the last node () back to the first node (), making the adjacency matrixcirculantand thus diagonalizable. - Zero-Padding for Causality: To prevent information leakage (backward flow) due to the cyclic edge,
TV-Recuses azero-padding strategy. The input sequence is padded with zeros, and the filter order is limited such that the cyclic connection doesn't affect the original sequence's output, maintainingcausality.
- Line Graph (for sequences): A natural way to represent a sequence as a graph is a
-
Positional Encoding:
- Conceptual Definition: In Transformer models,
positional encodingsare added to item embeddings to inject information about the absolute or relative position of items in a sequence. Without them, self-attention would be permutation-invariant, losing sequence order. - Types: Common types include
sinusoidal positional encodings(fixed, based on sine/cosine functions) orlearnable positional embeddings. - TV-Rec's Approach:
TV-Recargues that explicit positional encodings are unnecessary because theGFT basis(derived from the DCG) inherently captures frequency-based positional variations, and thetime-variant filtersdynamically modulate these components based on position.
- Conceptual Definition: In Transformer models,
3.2. Previous Works
The paper contextualizes TV-Rec by discussing various existing sequential recommendation (SR) approaches.
-
Traditional SR Models:
Markov chains[24]: Simple models that assume the next item depends only on the previous one or a few recent ones.RNNs (Recurrent Neural Networks)[13]: Models likeGRU4Recuse recurrent units (e.g., GRU) to capture temporal dependencies by maintaining a hidden state that processes sequences step-by-step. They are good for short-term patterns but struggle with long-term dependencies.GNNs (Graph Neural Networks)[34]: Some SR models (e.g.,SR-GNN[33],GC-SAN[35],GCL4SR[43]) use GNNs to model item-transition graphs, where items are nodes and transitions are edges, to capture complex item relationships.
-
Transformer-based Models:
SASRec[17]: A seminal work that adapted the Transformer architecture for SR, usingself-attentionto capture long-range dependencies in user sequences. It treats each user's sequence as a sentence.BERT4Rec[29]: Inspired by BERT in NLP, it uses a bidirectional Transformer and a masked item prediction task for training, allowing it to learn contextual embeddings for items.
-
Convolution-only Models:
Caser[30]: One of the first to apply CNNs to SR, treating user-item interactions as images and using horizontal and vertical convolutions to capture sequential patterns.NextItNet[40]: Utilizesdilated 1D convolutionsand residual connections to capture both short- and long-range dependencies more efficiently than standard convolutions.FMLPRec[45]: EmploysFourier Transformsand learnable filters within anall-MLParchitecture to filter noise and enhance sequence representations in the frequency domain.
-
Hybrid (Convolution + Self-Attention) Models:
AdaMCT[16]: Combines1D convolutionswithTransformer attentionto capture both short-term (convolution) and long-term (attention) user preferences.BSARec[28]: Addresses limitations of self-attention by applying convolution using theFourier Transform, aiming to introduce locality bias while retaining the benefits of attention.
-
Other Recent Advanced Methods:
DuoRec[23]: Employscontrastive learningwith model-level augmentation to learn better sequence representations.LRURec[41]: Exploreslinear recurrent unitsto balance efficiency and performance, especially for long sequences, by using principles fromState Space Models.
3.3. Technological Evolution
The field of sequential recommendation has evolved from simpler statistical models to complex deep learning architectures:
- Early Models (Statistical/Rule-based): Initially, methods like
Markov chainswere dominant, focusing on direct transitions between items. These were simple but often limited in capturing complex, long-term patterns. - Recurrent Neural Networks (RNNs): The introduction of
RNNs(and their variants like GRUs and LSTMs) marked a shift towards models capable of handling variable-length sequences and learning more intricate temporal dependencies.GRU4Recis a prime example. - Convolutional Neural Networks (CNNs): CNNs, though initially for computer vision, were adapted for sequences (
Caser,NextItNet) to capture local patterns efficiently. Their fixed filters, however, limited their global understanding. - Transformers and Self-Attention: The
Transformerarchitecture revolutionized sequence modeling due to its ability to model long-range dependencies effectively throughself-attention(SASRec,BERT4Rec). This became the new state-of-the-art, but introduced computational overhead and a lack of inherent locality bias. - Hybrid Models: To mitigate the shortcomings of pure Transformer or pure CNN models,
hybrid approachesemerged (AdaMCT,BSARec) combining the strengths of both: convolutions for locality and attention for global context. However, these often retain the computational cost of self-attention. - Efficiency-Focused and GSP-inspired Models: More recently, research has focused on improving efficiency for long sequences (
LRURec) or finding alternative mechanisms to attention.TV-Recfalls into this category by reinterpreting sequential recommendation through the lens ofGraph Signal Processing (GSP), specifically usingtime-variant convolutional filtersto simultaneously address locality, long-range dependencies, and computational efficiency without relying onself-attention. This marks a significant departure from attention-centric designs towards more flexible and adaptive filtering mechanisms.
3.4. Differentiation Analysis
TV-Rec differentiates itself from existing methods by fundamentally altering how sequential patterns are captured, moving away from both fixed filters and self-attention.
-
Compared to Fixed Convolutional Filters (e.g., Caser, NextItNet, FMLPRec):
- Core Difference: Traditional convolutional filters (
fixed filters) apply the same weights across all positions in a sequence.TV-Recusestime-variant convolutional filters(node-variant in GSP terms), meaning the filter coefficientsadaptfor each position in the sequence. - Innovation: This adaptability allows
TV-Recto captureposition-specific semanticsandtemporally evolving preferences. While fixed filters might excel at recent items,TV-Reccan learn patterns relevant to both early and late stages of a sequence, providing a more comprehensive understanding of user behavior. The paper explicitly states that 1D CNN inAdaMCTcorresponds to a fixed graph convolutional filter, andFMLPRecapplies a discrete Fourier transform which can be seen as a fixed filter in the spectral domain.TV-Recgeneralizes these by making the filter time-variant.
- Core Difference: Traditional convolutional filters (
-
Compared to Self-Attention / Transformer-based Models (e.g., SASRec, BERT4Rec, DuoRec):
- Core Difference:
Transformer-based modelsrely onself-attentionto capture long-range dependencies, which has quadratic computational complexity () and lacks an inherent inductive bias towards sequential order, requiringpositional embeddings.TV-Reccompletely replacesself-attention. - Innovation:
TV-Recachieves high expressive power and captures complex interaction patternswithout self-attention, leading toreduced computationandaccelerated inference( for training, but after precomputation, which can be more efficient for many scenarios than dependency). It also inherently encodes positional information through itsGraph Fourier Transform (GFT)on aDirected Cyclic Graph (DCG), negating the need for explicitpositional embeddings.
- Core Difference:
-
Compared to Hybrid Models (e.g., AdaMCT, BSARec):
- Core Difference: Hybrid models attempt to combine the strengths of convolutions and self-attention, but often still carry the computational burden and inductive bias limitations of
self-attention. - Innovation:
TV-Recoffers a unified approach where thetime-variant filteralone is powerful enough to capture both local and global patterns, makingself-attentionredundant. This simplifies the architecture and improves efficiency.BSARec, another hybrid, applies convolution using Fourier Transform, butTV-Rec's filters are dynamic per position, making them more general.
- Core Difference: Hybrid models attempt to combine the strengths of convolutions and self-attention, but often still carry the computational burden and inductive bias limitations of
-
Compared to GNN-based Methods (e.g., SR-GNN, GC-SAN, GCL4SR):
-
Core Difference (1 - Graph Definition):
GNN-based modelstypically define graphs whereitemsare nodes, and edges represent transitions or relationships between items, often merging identical items.TV-Recreinterprets the sequence as aline graph(orDCG) wheresequence positionsare nodes, allowing repeated items at different temporal positions to be treated distinctly. -
Innovation (1): This distinction allows
TV-Recto capturefine-grained temporal and positional dependenciesthat are lost when identical items are merged in traditional item-transition graphs. -
Core Difference (2 - Information Propagation):
GNN-based modelsrely on iterativemessage passingto update node embeddings, which can be computationally intensive.TV-Recemploystime-variant graph convolutional filtersthat operate directly in thespectral domain. -
Innovation (2): This
spectral filteringapproach removes the need for heavymessage passing, yielding a more efficient yet expressive representation of temporal dynamics.TV-Recpresents itself as a "non-GNN graph-based paradigm" for SR.In essence,
TV-Recoffers a more unified, flexible, and efficient solution by deeply integratinggraph signal processingconcepts to create context-aware, adaptive convolutional filters that transcend the limitations of both fixed convolutions and self-attention.
-
4. Methodology
The proposed TV-Rec model introduces time-variant convolutional filters for sequential recommendation, drawing inspiration from graph signal processing (GSP). The core idea is to apply position-dependent filters to user sequences, allowing the model to adapt to varying temporal dynamics and capture both local and global patterns without relying on self-attention.
4.1. Principles
The fundamental principle behind TV-Rec is to treat a user's interaction sequence as a graph signal on a line graph, which is then processed by node-variant graph filters. These filters are "time-variant" because each position (node) in the sequence receives a distinct filter. This enables the model to:
-
Adapt to Temporal Variations: Different stages of a sequence (e.g., early interactions vs. recent ones) often carry different semantic information. A time-variant filter can apply specific weights at each position, rather than a uniform one, to emphasize relevant information for that specific time point.
-
Capture both Local and Global Patterns: By leveraging the frequency domain via the
Graph Fourier Transform (GFT), the filters can capture both short-term dependencies (high frequencies) and long-term preferences (low frequencies). The time-variant nature allows this capture to be position-aware. -
Efficiency and Expressiveness: By operating as linear operators in the spectral domain, these filters aim to achieve the expressiveness typically associated with self-attention, but with improved computational efficiency, especially during inference.
The methodology reinterprets the sequential nature of user behavior as a
Directed Cyclic Graph (DCG)to enable spectral filtering, overcoming the non-diagonalizability issue of a simple line graph while carefully preventing backward information flow through zero-padding.
4.2. Core Methodology In-depth (Layer by Layer)
The TV-Rec architecture consists of three main modules: an embedding layer, a time-variant encoder, and a prediction layer.
The overall architecture of the proposed TV-Rec model is illustrated in Figure 2. It shows the flow from the input sequence through the embedding layer, multiple Time-Variant Encoding Blocks, and finally to the prediction layer. Each encoding block comprises a Filter Layer and a Feed Forward Network (FFN), both followed by residual connections and layer normalization. The critical part is the Filter Layer where the input is transformed to the frequency domain using GFT and processed by the time-variant convolutional filter.
该图像是图表,展示了论文中图3关于基向量数量对模型性能的敏感性分析,横轴为基向量数量,左纵轴对应NDCG@20,右纵轴对应HR@20,柱状图表示NDCG@20,折线图表示HR@20。
The following figure (Figure 2 from the original paper) shows the architecture of our proposed TV-Rec.
4.2.1. Embedding Layer
First, a user's historical interaction sequence is prepared for processing:
-
Sequence Truncation/Padding: The raw user sequence is processed to a fixed length .
- If , the sequence is truncated to keep the most recent items.
- If , the sequence is padded with zeros at the beginning.
- For simplicity, the user superscript
(u)is omitted, so the processed sequence is denoted as .
-
Item Embedding: Each item in the sequence is converted into a continuous vector representation using an
item embedding matrix, where is the total number of unique items and is thelatent dimension size. Alook-up operationretrieves the embedding for each item . -
Normalization and Dropout: The sequence of item embeddings is then passed through
Layer NormalizationandDropout. This produces the initial embedding representation of the user sequence, , which serves as the input to the time-variant encoder.The process is formally defined as: $ \mathbf{X}^0 = \operatorname{Dropout}(\operatorname{LayerNorm}([\mathbf{E}{s_1}, \mathbf{E}{s_2}, \ldots, \mathbf{E}_{s_N}]^\top)) $
-
: The initial sequence embedding, where is the sequence length and is the embedding dimension. Each row corresponds to the embedding of item .
-
: The embedding vector for item looked up from the
item embedding matrix. -
: Applies
layer normalization, which normalizes the inputs across the features for each sample independently, helping to stabilize training. -
: Applies
dropout, a regularization technique that randomly sets a fraction of input units to zero at each update during training, preventing overfitting.Note on Positional Embedding: A key design choice is that
positional embeddingis not necessary inTV-Rec. This is because thetime-variant convolutional filtersinherently capture position-specific information through thespectral propertiesof the graph, as justified in Appendix C of the paper.
4.2.2. Time-Variant Encoder
The time-variant encoder is built by stacking identical time-variant encoding blocks. Each block processes the sequence embedding through a filter layer and a feed-forward network, incorporating residual connections and layer normalization after each sub-layer.
Filter Layer
In the -th filter layer, the input is . This layer performs the core time-variant filtering operation.
-
Transform to Frequency Domain: The input is first transformed into the
frequency domainusing theGraph Fourier Transform (GFT). $ \widetilde{\mathbf{X}}^\ell = \mathbf{U}^\top \mathbf{X}^\ell $- : The input sequence embedding in the frequency domain. It's complex-valued because the GFT basis (eigenvectors) can be complex.
- : The
GFT matrix. This matrix is derived from apadded Directed Cyclic Graph (DCG), which is used instead of a simpleline graphto ensure diagonalizability and enable spectral filtering. Its columns are the eigenvectors of theshift operatorfor the DCG. is its conjugate transpose (or Hermitian transpose). - This transformation decomposes the signal into its frequency components, allowing the filter to operate spectrally. The
DCGandpadding strategyare crucial for this step, as explained in Appendix B.
-
Apply Time-Variant Convolutional Filter: The core filtering operation happens in the frequency domain. It uses a
time-variant convolutional filter, which is based on thenode-variant graph filterconcept. $ \widehat{\mathbf{X}}^\ell = \mathbf{G}_{nv} \widetilde{\mathbf{X}} = \left( \mathbf{U} \circ (\mathbf{H} \pmb{\Lambda}^\top) \right) \widetilde{\mathbf{X}}^\ell = \left( \mathbf{U} \circ (\mathbf{H} \pmb{\Lambda}^\top) \right) \mathbf{U}^\top \mathbf{X}^\ell $- : The filtered sequence embedding.
- : The
node-variant graph filteroperator. - : The
GFT matrix(as described above). It's used here for the inverse GFT effectively. - : Denotes element-wise multiplication (Hadamard product) between matrices.
- : The
filter tap matrix. Each row of corresponds to a specific node (position) in the sequence, and its columns contain thefilter tapsfor different orders . This is the "time-variant" aspect, as each position has its own filter taps. - : The
Vandermonde matrixformed from the eigenvalues of the shift operator. Its elements are , where are the eigenvalues of the shift operator. This matrix captures the frequency response of each power of the shift operator. - The term effectively computes the frequency response of the
node-variant filterfor each node. The element-wise product with and multiplication with perform the spectral filtering and then implicitly transform back to the spatial domain (or a filtered representation in the frequency domain that will be used for further processing, effectively). - The term is the GFT, taking to the frequency domain.
-
Constructing the Filter Matrix : To enhance expressive power, the
filter matrixis constructed as a product of two matrices: $ \mathbf{H} = \mathbf{C} \bar{\mathbf{B}} = \mathbf{C} \left( \frac{\mathbf{B}}{|\mathbf{B}|_2} \right) $- : The
coefficient matrix. This matrix generatesposition-specific filters. Since each node corresponds to a position in the sequence, can be interpreted as a function of time, determining how the basic filter components are combined for each position. It's a learnable component. - : A
normalized basis matrix. This matrix containsbasis vectors, where determines the number of base filter patterns. These basis vectors are common across all positions, and learns to combine them differently for each position. - : The unnormalized
basis matrix. - : The L2 norm, used to
normalizealong each row for numerical stability. - By forming this way, the model learns a set of basic filter patterns () and then learns how to combine them differently at each sequence position (), achieving the "time-variant" effect.
- : The
-
Residual Connection and Normalization: After the filtering operation, a
residual connection(adding the input ) is applied, followed byDropoutandLayer Normalization. This helps in preventingoverfittingand stabilizing the training of deep networks. $ \mathbf{F}^\ell = \mathrm{LayerNorm}(\mathbf{X}^\ell + \mathrm{Dropout}(\widehat{\mathbf{X}}^\ell)) $- : The output of the filter layer.
Feed Forward Layer
Following the filter layer, a feed-forward network (FFN) is applied to introduce non-linearity and allow the model to learn more complex transformations.
$
\widehat{\mathbf{F}}^\ell = \mathrm{FFN}(\mathbf{F}^\ell) = (\mathrm{GELU}(\mathbf{F}^\ell \mathbf{W}_1^\ell + \mathbf{b}_1^\ell)) \mathbf{W}_2^\ell + \mathbf{b}_2^\ell
$
-
: Output of the FFN.
-
: Input to the FFN (output of the filter layer).
-
:
Gaussian Error Linear Unitactivation function, which introduces non-linearity. -
: Learnable weight matrices for the linear transformations.
-
: Learnable bias vectors.
Similar to the filter layer, a
dropout layer,residual connection, andlayer normalizationare applied to the FFN output to produce the final output of the -th block: $ \mathbf{X}^{\ell+1} = \mathrm{LayerNorm}(\mathbf{F}^\ell + \mathrm{Dropout}(\widehat{\mathbf{F}}^\ell)) $ -
: The output of the -th time-variant encoding block, which becomes the input for the next block or the final representation if it's the last block.
4.2.3. Prediction Layer and Training
Prediction Layer
After passing through time-variant encoding blocks, the final sequence representation is . The last item's embedding in this sequence, (the embedding corresponding to the most recent interaction), is typically used to predict the next item. The model computes a preference score for each item in the entire item set .
$
\hat{y}v = p(v{|S|+1} = v | S) = \mathbf{E}_v^\top \mathbf{X}_N^L
$
- : The predicted score (or probability) that item is the next item in the sequence.
- : The embedding of item from the shared
item embedding matrix. - : The embedding of the -th (most recent) item in the final processed sequence . (Note: The paper implies this is the last item in the sequence, usually if is ).
Model Training
The model is optimized using a combination of cross-entropy loss and an orthogonal regularization term.
$ \mathcal{L} = \underbrace{-\log \frac{\exp(\hat{y}g)}{\sum{v \in \mathcal{V}} \exp(\hat{y}v)}}{\mathcal{L}{\mathrm{ce}}} + \alpha \cdot \underbrace{\left( \left|\mathbf{B}{\mathrm{real}} \mathbf{B}{\mathrm{real}}^\top - \mathbf{I}\right|F^2 + \left|\mathbf{B}{\mathrm{imag}} \mathbf{B}{\mathrm{imag}}^\top - \mathbf{I}\right|F^2 \right)}{\mathcal{L}_{\mathrm{ortho}}} $
- : The total loss function to be minimized.
- : The
cross-entropy loss, a common loss function for classification tasks. It measures the discrepancy between the predicted probability distribution and the true distribution (one-hot encoding for the ground truth item).- : The predicted score for the
ground-truth item. - : The sum of exponentiated scores over all items in the vocabulary, used for normalization in the
softmaxfunction.
- : The predicted score for the
- : An
orthogonal regularization termapplied to thebasis matrix(used in constructing ). This term encourages the basis vectors to beorthogonal, promoting diversity and stability in the learned filter patterns.- : A scalar
hyperparameterthat controls theregularization strength. - : The real component of the complex-valued
basis matrix. - : The imaginary component of the complex-valued
basis matrix. - : The
identity matrix. - : The squared
Frobenius norm, which measures the "size" of a matrix. The term evaluates how close is to being orthogonal (i.e., ).
- : A scalar
- The model is optimized using the
Adam optimizer.
4.2.4. Appendix A: Proof of Frequency Response of Node-Variant Graph Filters
The supplementary material provides a detailed proof for the frequency response of the node-variant graph filter . This is crucial for understanding how the filter operates in the spectral domain.
Proof Derivation:
-
Definition of Node-Variant Graph Filter: The output graph signal from an input signal using a node-variant graph filter of order is defined as: $ \mathbf{y} = \mathbf{G}{nv} \mathbf{x} = \left( \sum{k=0}^K \mathrm{diag}(\mathbf{h}_k) \mathbf{S}^k \right) \mathbf{x} $
- : Input graph signal.
- : Shift operator.
- : A diagonal matrix where is a vector of filter taps, allowing each node to have its own tap.
- : Filter order.
-
GFT and Eigen-Decomposition: The
shift operatoris eigen-decomposed as . TheGFTof a signal is . Similarly, . -
Applying GFT to Filtered Output: Substitute and into the equation for : $ \tilde{\mathbf{y}} = \mathbf{U}^\top \mathbf{y} = \mathbf{U}^\top \left( \sum_{k=0}^K \mathrm{diag}(\mathbf{h}k) \mathbf{S}^k \right) \mathbf{x} $ $ = \mathbf{U}^\top \sum{k=0}^K \mathrm{diag}(\mathbf{h}_k) \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}^k) \tilde{\mathbf{x}} $
- : Vector where each component is the -th power of the corresponding eigenvalue, i.e., .
- The
filter tapsfor all orders can be grouped into a matrix , where the -th column is . - A
Vandermonde matrixis defined where . (Note: The paper text uses but the derivation uses , suggesting in the definition, or a slight mismatch in index. Following the derivation, it implicitly uses for the -th power of eigenvalues).
-
Term-by-Term Analysis of Summation: Consider the
ij-th element of the summation term : $ \left[ \sum_{k=0}^K \mathrm{diag}(\mathbf{h}k) \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}^k) \right]{ij} = \sum_{k=0}^K h_k^{(i)} \lambda_j^k u_{ij} $ $ = u_{ij} \sum_{k=0}^K h_k^{(i)} \lambda_j^k $- : The -th component of the vector , representing the -th row and -th column of .
- : The -th eigenvalue raised to the power .
- : The element at row , column of the
GFT matrix.
-
Relating to : The
ij-th element of the product is given by: $ [\mathbf{H} \pmb{\Lambda}^\top]{ij} = \sum{k=0}^K h_k^{(i)} \lambda_j^k $ (This assumes for the column , which aligns with the derivation and is a standard way to define the matrix of eigenvalue powers.) -
Substitution and Final Form: Substituting this back into the term-by-term analysis: $ \left[ \sum_{k=0}^K \mathrm{diag}(\mathbf{h}k) \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}^k) \right]{ij} = [\mathbf{U} \circ (\mathbf{H} \pmb{\Lambda}^\top)]{ij} $ Therefore, the frequency response of the
node-variant graph filteris: $ \tilde{\mathbf{y}} = \tilde{\mathbf{G}}{nv} \tilde{\mathbf{x}} = \mathbf{U}^\top (\mathbf{U} \circ (\mathbf{H} \pmb{\Lambda}^\top)) \tilde{\mathbf{x}} $ This formula shows how the filter operates on the signal in the frequency domain, effectively modulating the frequency components based on the position-specific filter taps and the graph's eigenvalues .
4.2.5. Appendix B: Theoretical Justification for DCG-Based Filtering
This section justifies the use of a zero-padded Directed Cyclic Graph (DCG) for spectral filtering instead of a simple line graph, ensuring no backward information flow.
-
Problem with Line Graph:
- For
spectral filtering, theGraph Fourier Transform (GFT)requires theshift operator(e.g., adjacency matrix) to bediagonalizable. - A
line graph(representing a sequence ) has an adjacency matrix with rankN-1. The first node has no incoming edges, making its corresponding row in the adjacency matrix entirely zero. This makes the matrixdefectiveandnon-diagonalizable, hindering spectral analysis.
- For
-
Solution: Directed Cyclic Graph (DCG):
- To overcome this, the sequence is modeled as a
DCG. This is achieved by adding a single edge from the last node () to the first node () of the line graph. - This addition makes the adjacency matrix
circulant(each row is a cyclic shift of the previous row).Circulant matricesare alwaysdiagonalizable, enabling spectral filtering in the Fourier domain.
- To overcome this, the sequence is modeled as a
-
Preventing Backward Information Flow (Causality):
- The cyclic edge in a
DCGinherently introduces a backward connection (from future to past), which is undesirable in sequential recommendation as it would cause information leakage. - To prevent this, a
padding strategyis employed. The input sequence iszero-paddedwith zeros to form an extended vector : $ \tilde{\mathbf{x}} = \left[ \begin{array}{c} \mathbf{x} \ \mathbf{0} \end{array} \right] \in \mathbb{R}^{N+K} $- : The original input sequence (vector of length ).
- : A zero vector of length .
- : The padded input sequence (vector of length ).
- A
circulant shift operatoris defined for this extended DCG. - A
spectral filterof order is then applied: $ g(\mathbf{S}) = \sum_{k=0}^K h_k \mathbf{S}^k $- : Filter coefficients.
- The filtered output is computed by applying the filter to the padded input: $ \tilde{\mathbf{y}} = g(\mathbf{S}) \tilde{\mathbf{x}} $
- Extracting Causal Output: The final causal output (equivalent to filtering on the original line graph) is obtained by taking only the first elements of :
$
\mathbf{y} = [\tilde{\mathbf{y}}]_{1:N}
$
- The equivalence holds because the last entries of are zeros. Since the filter involves powers of up to order , these zero entries ensure that the cyclic shifts (which would bring "future" information to the "past" if the filter order was higher than the padding) do not influence the first output values. This effectively blocks any
cyclic information flowthat would lead tobackward leakage.
- The equivalence holds because the last entries of are zeros. Since the filter involves powers of up to order , these zero entries ensure that the cyclic shifts (which would bring "future" information to the "past" if the filter order was higher than the padding) do not influence the first output values. This effectively blocks any
- In summary,
zero-padded DCGallows forspectral filtering(due to diagonalizability) while preserving thecausal structureof sequential data, enabling efficient and accurate modeling of user interactions.
- The cyclic edge in a
4.2.6. Appendix C: Equivalence of Positional Encoding and Graph Fourier Basis
This section explains why TV-Rec does not require explicit positional encoding, unlike Transformer models, due to the inherent positional information captured by its Graph Fourier Basis.
-
Sinusoidal Positional Encoding in Transformer:
Transformermodels usesinusoidal functionsto encode absolute positionposwithin a sequence.- For a given position
posand embedding dimension index2ior : $ \mathrm{PE}{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) $- : Positional Encoding.
pos: The absolute position in the sequence.- : Index for the dimension of the embedding.
- : Total dimension of the embedding.
- This generates a set of periodic signals with varying frequencies, forming a basis for encoding positional variations.
-
Graph Fourier Basis in TV-Rec:
- In
TV-Rec, theshift operatorfor theDCG(which is a circulant matrix) is used. The eigenvectors of a circulant matrix are precisely theDiscrete Fourier Transform (DFT) basisvectors. - The elements of the
GFT matrixfor a circulant shift operator are: $ \mathbf{U}_{kn} = \frac{1}{\sqrt{N}} e^{-i2\pi kn/N} = \frac{1}{\sqrt{N}} \left[ \cos\left(\frac{2\pi kn}{N}\right) - i \sin\left(\frac{2\pi kn}{N}\right) \right] $- : Index of the eigenvector (frequency component).
- : Index of the node (position).
- : Sequence length.
- This formula shows that the
GFT basisis composed oftrigonometric functions(sines and cosines) that inherently capture periodic variations across positions.
- In
-
Equivalence in Representational Capacity:
- Both
Transformer's sinusoidal positional encodingsandTV-Rec's GFT basis(which is the DFT basis) are sets oftrigonometric functions. - While the frequencies might be sampled differently (logarithmic scale for Transformer vs. linear for GFT), both sets of basis functions effectively
span the same space of length N periodic signals. - The
real-valued DFT basisis directly composed of and for various . - This means
TV-Recimplicitly encodes positional information through its spectral decomposition.
- Both
-
Implication for Positional Encoding:
TV-Recprojects the input sequence into theGFT domain() and processes it. This process inherently capturesfrequency-based positional variations.- Furthermore, the
time-variant filteracts as aposition-specific operation, dynamically modulating these frequency components per position. - Consequently, the combination of
GFTandtime-variant filtersnaturally handles positional information, rendering explicitpositional embeddingsunnecessary. Ablation studies confirm this, showing no performance benefit from adding them.
5. Experimental Setup
5.1. Datasets
The authors evaluate TV-Rec on six publicly available benchmark datasets, selected to cover varying sparsity and domains. The preprocessing procedures from previous works [44, 45] are followed, treating all reviews and ratings as implicit feedback.
The following are the results from Table 6 of the original paper:
| # Users | # Items | # Interactions | Avg. Length | Sparsity | |
|---|---|---|---|---|---|
| Beauty | 22,363 | 12,101 | 198,502 | 8.9 | 99.93% |
| Sports | 25,598 | 18,357 | 296,337 | 8.3 | 99.95% |
| Yelp | 30,431 | 20,033 | 316,354 | 10.4 | 99.95% |
| LastFM | 1,090 | 3,646 | 52,551 | 48.2 | 98.68% |
| ML-1M | 6,041 | 3,417 | 999,611 | 165.5 | 95.16% |
| Foursquare | 1,083 | 9,989 | 179,468 | 165.7 | 98.34% |
Here's a description of each dataset:
-
Beauty: An Amazon product review dataset (from [22]) belonging to the "Beauty" category. It has a moderate number of users and items, with relatively short average sequence lengths (8.9).
-
Sports: Another Amazon product review dataset (from [22]), specifically the "Sports and Outdoors" category. Similar to Beauty in terms of user/item count and average length (8.3), indicating sparse interactions.
-
Yelp: A large dataset from Yelp, containing business recommendations. Interactions after 2019/01/01 are used due to its size. It has a higher number of users, items, and interactions, but still a short average sequence length (10.4).
-
LastFM: Contains artist listening records, used for recommending musicians. This dataset has fewer users and items but a significantly longer average interaction length (48.2) compared to the Amazon and Yelp datasets, and lower sparsity.
-
ML-1M (MovieLens-1M): A classic movie recommendation dataset [11] known for its detailed user interaction data. It features a moderate number of users and items but a very long average interaction length (165.5), and the lowest sparsity among the datasets.
-
Foursquare: Provides user check-ins across New York City over 10 months. Similar to ML-1M, it has a long average interaction length (165.7), but with fewer users and more items.
These datasets were chosen because they represent diverse domains (e-commerce, social media, entertainment) and varying interaction characteristics (short vs. long sequences, different sparsity levels). This diversity allows for a comprehensive evaluation of the model's robustness and generalization capabilities across different sequential recommendation scenarios.
5.2. Evaluation Metrics
To evaluate the recommendation performance, the paper uses two widely adopted Top- metrics: HR@r (Hit Rate) and NDCG@r (Normalized Discounted Cumulative Gain), with set to 5, 10, and 20. For fair comparison, ranking results are examined across the entire item set without negative sampling [19].
-
Hit Rate (HR@r):
- Conceptual Definition: Hit Rate at (HR@r) measures the proportion of test cases (user sequences) for which the ground truth item (the next item the user actually interacted with) is present within the top- recommended items. It essentially answers: "How often was the correct item among the top recommendations?" HR is a recall-oriented metric.
- Mathematical Formula: $ \mathrm{HR@r} = \frac{\text{Number of hits @r}}{\text{Total number of test cases}} $
- Symbol Explanation:
Number of hits @r: The count of user test sequences where the ground truth item is found in the top- recommended list generated by the model.Total number of test cases: The total number of user test sequences for which a recommendation was attempted.
-
Normalized Discounted Cumulative Gain (NDCG@r):
- Conceptual Definition: Normalized Discounted Cumulative Gain at (NDCG@r) is a metric that evaluates the quality of a ranked list of recommendations, taking into account both the relevance of items and their position in the list. It assigns higher scores to more relevant items that appear earlier in the ranked list. It's a measure of ranking quality.
- Mathematical Formula: $ \mathrm{NDCG@r} = \frac{\mathrm{DCG@r}}{\mathrm{IDCG@r}} $ where $ \mathrm{DCG@r} = \sum_{i=1}^{r} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ and $ \mathrm{IDCG@r} = \sum_{i=1}^{|REL|} \frac{2^{rel_i} - 1}{\log_2(i+1)} $
- Symbol Explanation:
- (Discounted Cumulative Gain): Measures the relevance of items in the top- positions, penalizing relevant items that appear lower in the list.
- (Ideal Discounted Cumulative Gain): The maximum possible DCG@r, achieved by an ideal ranking where all relevant items are placed at the top of the list in decreasing order of relevance. This normalizes DCG to a value between 0 and 1.
- : The rank position of an item in the recommendation list (from 1 to ).
- : The relevance score of the item at position . In sequential recommendation, for a single ground truth item, is typically 1 if the item at rank is the ground truth item, and 0 otherwise. In this scenario, if .
- : The number of relevant items in the ideal ranking (for a single ground truth item, this is 1).
- : The logarithmic discount factor, which reduces the importance of items at lower ranks.
5.3. Baselines
TV-Rec is compared against a comprehensive set of ten state-of-the-art sequential recommendation baselines, including conventional and recent hybrid models, and three GNN-based models. These baselines are chosen for their representativeness of different architectural paradigms in SR.
Sequential Recommendation Baselines:
- Caser [30]: A
CNN-basedmodel that uses horizontal and vertical convolutions to capture sequential patterns, treating interactions like images. - GRU4Rec [13]: An
RNN-basedmodel that employs Gated Recurrent Units (GRUs) to capture temporal dynamics in user interaction sequences. - SASRec [17]: A
Transformer-basedmodel that uses a multi-headself-attentionmechanism to capture long-range dependencies. - BERT4Rec [29]: A
bidirectional Transformer-basedmodel that uses a masked item prediction task for training, similar to BERT in NLP. - NextItNet [40]: A
CNN-basedmodel that utilizesdilated convolutionsand residual connections to efficiently capture both short- and long-range dependencies. - FMLPRec [45]: An
all-MLParchitecture that incorporatesFourier Transformand learnable filters to reduce noise and enhance sequence representations in the frequency domain. - DuoRec [23]: A
Transformer-basedmodel that usescontrastive learningwith model-level augmentation and semantic positive samples to learn robust sequence representations. - LRURec [41]: A model based on
linear recurrent unitsdesigned for rapid inference and parallelization, particularly effective for long sequences. - AdaMCT [16]: A
hybrid modelthat combinesTransformer attentionwith local1D convolutional filtersto capture both long-term and short-term user preferences. - BSARec [28]: A
hybrid modelthat combinesTransformer self-attentionwith theFourier Transformto address theoversmoothing problemand introduce locality.
GNN-based Sequential Recommendation Baselines:
- SR-GNN [33]: A
GNN-basedmodel that converts user sessions into graphs and applies GNNs to capture item transition relationships. - GC-SAN [35]: A
GNN-basedmodel that dynamically builds a graph for each sequence and combines GNNs withself-attentionto model both local and long-range item dependencies. - GCL4SR [43]: A
GNN-basedmodel that constructs a global item transition graph across all users and usesgraph contrastive learningto integrate global and local context.
5.4. Implementation Details
All experiments were conducted on a consistent hardware and software setup:
- Hardware: Dual
INTEL XEONCPUs and anNVIDIA RTX A6000 GPU. - Software:
UBUNTU 20.04.6 LTS,PYthon 3.9.7,PYTORCH 2.2.2,CUDA 11.1.74, andNVIDIA Driver 550.54.14.
Hyperparameters for Standard Sequential Recommendation:
-
Maximum Sequence Length (): Set to 50 for standard experiments. For long-range dependencies (ML-1M and Foursquare), was set to 200.
-
Learning Rates: Tuned from .
-
Orthogonal Regularization Coefficient (): Tuned from .
-
Dropout Rates (): Tuned from .
-
Number of Basis Vectors (): Tuned from .
-
Order of Time-Variant Convolutional Filter (): Set equal to the maximum sequence length (i.e., 50 for standard).
-
Batch Size: 256.
-
Latent Dimension Size (): 64.
-
Number of Time-Variant Blocks (): 2.
-
Optimizer:
Adam.The following are the results from Table 7 of the original paper:
Dataset Learning Rate α p m L = 50 Beauty 5 × 10−4 0 0.5 32 Sports 5 × 10−4 0 0.5 16 Yelp 5 × 10−4 1 × 10-3 0.1 16 LastFM 1 × 10−3 1 × 10-3 0.4 8 ML-1M 1 × 10−3 1× 10-5 0.3 8 Foursquare 5 × 10−4 1 × 10-5 0.2 8 L = 200 ML-1M 1 × 10−3 0 0.1 16 Foursquare 5 × 10−4 1 × 10−5 0.1 8
6. Results & Analysis
6.1. Core Results Analysis
Overall Performance
The paper demonstrates that TV-Rec consistently outperforms all baseline methods across six diverse datasets.
The following are the results from Table 2 of the original paper:
| Datasets | Metric | Caser | GRU4Rec | SASRec | BERT4Rec | NextItNet | FMLPRec | DuoRec | LRURec | AdaMCT | BSARec | TV-Rec | Improv. |
| Beauty | HR@5 | 0.0149 | 0.0170 | 0.0368 | 0.0491 | 0.0549 | 0.0423 | 0.0680 | 0.0648 | 0.0675 | 0.0714 | 0.0721 | 0.98% |
| HR@10 | 0.0253 | 0.0307 | 0.0574 | 0.0742 | 0.0779 | 0.0639 | 0.0944 | 0.0889 | 0.0925 | 0.1017 | 0.1045 | 2.73% | |
| HR@20 | 0.0416 | 0.0499 | 0.0860 | 0.1079 | 0.1100 | 0.0949 | 0.1279 | 0.1197 | 0.1299 | 0.1393 | 0.1403 | 0.72% | |
| NDCG@5 | 0.0089 | 0.0105 | 0.0241 | 0.0318 | 0.0392 | 0.0272 | 0.0485 | 0.0472 | 0.0489 | 0.0501 | 0.0513 | 2.40% | |
| NDCG@10 | 0.0122 | 0.0149 | 0.0307 | 0.0399 | 0.0467 | 0.0341 | 0.0570 | 0.0549 | 0.0569 | 0.0590 | 0.0608 | 3.05% | |
| NDCG@20 | 0.0164 | 0.0198 | 0.0379 | 0.0484 | 0.0547 | 0.0419 | 0.0654 | 0.0627 | 0.0664 | 0.0691 | 0.0705 | 2.03% | |
| Sports | HR@5 | 0.0091 | 0.0131 | 0.0215 | 0.0279 | 0.0311 | 0.0222 | 0.0390 | 0.0351 | 0.0386 | 0.0390 | 0.0431 | 2.13% |
| HR@10 | 0.0147 | 0.0211 | 0.0319 | 0.0434 | 0.0458 | 0.0358 | 0.0549 | 0.0502 | 0.0544 | 0.0623 | 0.0635 | 1.93% | |
| HR@20 | 0.0253 | 0.0347 | 0.0485 | 0.0658 | 0.0682 | 0.0549 | 0.0779 | 0.0698 | 0.0769 | 0.0865 | 0.0880 | 1.73% | |
| NDCG@5 | 0.0064 | 0.0084 | 0.0142 | 0.0182 | 0.0212 | 0.0148 | 0.0276 | 0.0242 | 0.0272 | 0.0296 | 0.0298 | 0.68% | |
| NDCG@10 | 0.0082 | 0.0110 | 0.0175 | 0.0217 | 0.0260 | 0.0191 | 0.0328 | 0.0291 | 0.0322 | 0.0361 | 0.0363 | 0.55% | |
| NDCG@20 | 0.0109 | 0.0144 | 0.0239 | 0.0288 | 0.0316 | 0.0239 | 0.0385 | 0.0340 | 0.0379 | 0.0425 | 0.0428 | 0.71% | |
| Yelp | HR@5 | 0.0131 | 0.0137 | 0.0165 | 0.0243 | 0.0247 | 0.0195 | 0.0277 | 0.0240 | 0.0239 | 0.0260 | 0.0290 | 4.69% |
| HR@10 | 0.0230 | 0.0240 | 0.0267 | 0.0411 | 0.0423 | 0.0313 | 0.0450 | 0.0396 | 0.0404 | 0.0446 | 0.0474 | 5.33% | |
| HR@20 | 0.0388 | 0.0412 | 0.0445 | 0.0681 | 0.0694 | 0.0518 | 0.0730 | 0.0656 | 0.0670 | 0.0718 | 0.0777 | 6.44% | |
| NDCG@5 | 0.0080 | 0.0086 | 0.0103 | 0.0154 | 0.0151 | 0.0122 | 0.0179 | 0.0151 | 0.0153 | 0.0162 | 0.0186 | 3.91% | |
| NDCG@10 | 0.0112 | 0.0119 | 0.0135 | 0.0208 | 0.0208 | 0.0160 | 0.0234 | 0.0201 | 0.0206 | 0.0222 | 0.0245 | 4.70% | |
| NDCG@20 | 0.0151 | 0.0162 | 0.0180 | 0.0275 | 0.0276 | 0.0211 | 0.0304 | 0.0266 | 0.0272 | 0.0290 | 0.0321 | 5.59% | |
| LastFM | HR@5 | 0.0303 | 0.0339 | 0.0422 | 0.0358 | 0.0431 | 0.0450 | 0.0404 | 0.0358 | 0.0468 | 0.0505 | 0.0596 | 18.02% |
| HR@10 | 0.0459 | 0.0394 | 0.0670 | 0.0606 | 0.0624 | 0.0670 | 0.0587 | 0.0532 | 0.0716 | 0.0679 | 0.0853 | 19.13% | |
| HR@20 | 0.0606 | 0.0550 | 0.0972 | 0.0908 | 0.0936 | 0.1000 | 0.0872 | 0.0807 | 0.1018 | 0.1119 | 0.1202 | 7.42% | |
| NDCG@5 | 0.0222 | 0.0231 | 0.0301 | 0.0213 | 0.0264 | 0.0321 | 0.0276 | 0.0257 | 0.0330 | 0.0348 | 0.0402 | 15.52% | |
| NDCG@10 | 0.0269 | 0.0249 | 0.0382 | 0.0291 | 0.0325 | 0.0392 | 0.0336 | 0.0312 | 0.0409 | 0.0405 | 0.0484 | 18.34% | |
| NDCG@20 | 0.0306 | 0.0288 | 0.0458 | 0.0366 | 0.0402 | 0.0475 | 0.0407 | 0.0380 | 0.0485 | 0.0514 | 0.0572 | 11.28% | |
| ML-1M | HR@5 | 0.1033 | 0.1225 | 0.1406 | 0.1651 | 0.1858 | 0.1329 | 0.1916 | 0.2013 | 0.1909 | 0.2013 | 0.2115 | 5.06% |
| HR@10 | 0.1671 | 0.2199 | 0.2442 | 0.2724 | 0.2848 | 0.1821 | 0.2798 | 0.2904 | 0.2798 | 0.2904 | 0.2961 | 1.97% | |
| HR@20 | 0.2598 | 0.1925 | 0.3250 | 0.3459 | 0.3853 | 0.2089 | 0.3757 | 0.3886 | 0.3647 | 0.3844 | 0.4079 | 4.97% | |
| NDCG@5 | 0.0663 | 0.0779 | 0.0920 | 0.1077 | 0.1264 | 0.0861 | 0.1226 | 0.1339 | 0.1185 | 0.1286 | 0.1371 | 2.39% | |
| NDCG@10 | 0.0868 | 0.1006 | 0.1174 | 0.1332 | 0.1543 | 0.1105 | 0.1507 | 0.1640 | 0.1438 | 0.1573 | 0.1658 | 1.10% | |
| NDCG@20 | 0.1101 | 0.1253 | 0.1438 | 0.1588 | 0.1829 | 0.1388 | 0.1776 | 0.1901 | 0.1711 | 0.1836 | 0.1955 | 2.84% | |
| Foursquare | HR@5 | 0.0139 | 0.0148 | 0.0139 | 0.0139 | 0.0129 | 0.0120 | 0.0139 | 0.0148 | 0.0157 | 0.0148 | 0.0175 | 11.46% |
| HR@10 | 0.0175 | 0.0157 | 0.0185 | 0.0157 | 0.0166 | 0.0148 | 0.0194 | 0.0203 | 0.0203 | 0.0203 | 0.0248 | 22.17% | |
| HR@20 | 0.0231 | 0.0194 | 0.0295 | 0.0240 | 0.0259 | 0.0194 | 0.0305 | 0.0286 | 0.0305 | 0.0305 | 0.0314 | 2.95% | |
| NDCG@5 | 0.0105 | 0.0099 | 0.0085 | 0.0078 | 0.0068 | 0.0087 | 0.0078 | 0.0099 | 0.0094 | 0.0089 | 0.0105 | 6.06% | |
| NDCG@10 | 0.0123 | 0.0111 | 0.0106 | 0.0096 | 0.0095 | 0.0096 | 0.0102 | 0.0106 | 0.0106 | 0.0103 | 0.0131 | 23.85% | |
| NDCG@20 | 0.0133 | 0.0120 | 0.0136 | 0.0117 | 0.0118 | 0.0108 | 0.0126 | 0.0114 | 0.0142 | 0.0135 | 0.0176 | 23.95% |
- Significant Improvements:
TV-Recachieves an average accuracy improvement of 7.49% over the strongest baselines.- Particularly strong gains are observed on
LastFM(e.g., 19.13% onHR@10and 18.34% onNDCG@10) andFoursquare(e.g., 22.17% onHR@10and 23.85% onNDCG@10). These are datasets with longer average sequence lengths or more complex temporal dynamics. - On larger datasets like
ML-1M,TV-Recstill shows notable gains (e.g., 5.06% onHR@5and 2.39% onNDCG@5). - For e-commerce datasets
BeautyandSports, improvements are more subtle but consistent (e.g., 0.98% and 2.13% onHR@5, respectively).
- Particularly strong gains are observed on
- Strong Baselines: Recent hybrid models like
AdaMCTandBSARecshow strong performance, withBSARecoften being the second-best.LRURecperforms competitively onML-1Mdue to its design for long sequences, andDuoRecis strong onYelp. Despite these strong competitors,TV-Recconsistently surpasses them.
Long-Range Sequential Recommendation Results
To assess TV-Rec's ability to handle long-range dependencies, experiments were conducted on ML-1M and Foursquare with the maximum sequence length set to 200.
The following are the results from Table 3 of the original paper:
| Datasets | Metric | Caser | GRU4Rec | SASRec | BERT4Rec | NextItNet | FMLPRec | DuoRec | LRURec | AdaMCT | BSARec | TV-Rec | Improv. |
| ML-1M | HR@5 | 0.1109 | 0.1518 | 0.1558 | 0.1730 | 0.1978 | 0.1397 | 0.1930 | 0.2233 | 0.1760 | 0.1949 | 0.2255 | 0.99% |
| HR@10 | 0.1869 | 0.2374 | 0.2399 | 0.2573 | 0.2882 | 0.2296 | 0.2795 | 0.3005 | 0.2619 | 0.2917 | 0.3232 | 1.80% | |
| HR@20 | 0.2942 | 0.3455 | 0.3551 | 0.3695 | 0.3970 | 0.3462 | 0.3854 | 0.4205 | 0.3695 | 0.4005 | 0.4306 | 2.40% | |
| NDCG@5 | 0.0696 | 0.0981 | 0.1014 | 0.1147 | 0.1334 | 0.0885 | 0.1292 | 0.1516 | 0.1167 | 0.1327 | 0.1572 | 3.69% | |
| NDCG@10 | 0.0939 | 0.1256 | 0.1285 | 0.1418 | 0.1627 | 0.1175 | 0.1571 | 0.1820 | 0.1443 | 0.1639 | 0.1886 | 3.63% | |
| NDCG@20 | 0.1209 | 0.1528 | 0.1576 | 0.1701 | 0.1901 | 0.1468 | 0.1838 | 0.2079 | 0.1715 | 0.1913 | 0.2157 | 3.75% | |
| Foursquare | HR@5 | 0.0139 | 0.0120 | 0.0111 | 0.0102 | 0.0083 | 0.0120 | 0.0120 | 0.0129 | 0.0120 | 0.0129 | 0.0148 | 6.47% |
| HR@10 | 0.0194 | 0.0157 | 0.0175 | 0.0157 | 0.0166 | 0.0148 | 0.0194 | 0.0203 | 0.0157 | 0.0175 | 0.0212 | 9.28% | |
| HR@20 | 0.0231 | 0.0194 | 0.0295 | 0.0240 | 0.0259 | 0.0194 | 0.0286 | 0.0305 | 0.0305 | 0.0305 | 0.0323 | 5.90% | |
| NDCG@5 | 0.0105 | 0.0099 | 0.0085 | 0.0078 | 0.0068 | 0.0087 | 0.0078 | 0.0099 | 0.0094 | 0.0089 | 0.0105 | 6.06% | |
| NDCG@10 | 0.0123 | 0.0111 | 0.0106 | 0.0096 | 0.0095 | 0.0096 | 0.0102 | 0.0106 | 0.0106 | 0.0103 | 0.0129 | 21.70% | |
| NDCG@20 | 0.0133 | 0.0120 | 0.0136 | 0.0117 | 0.0118 | 0.0108 | 0.0126 | 0.0142 | 0.0142 | 0.0135 | 0.0158 | 11.27% |
TV-Recconsistentlyoutperforms all baseline modelsin long-range SR tasks, achieving an average improvement of 4.74% across all metrics compared to the top baseline performances.- The improvements are particularly significant in
NDCGmetrics, withTV-Recshowing up to 11.27% gain inNDCG@20forFoursquare. LRURecperforms strongly among baselines forML-1Mdue to itslinear recurrent unitdesign for long sequences, butTV-Recstill surpasses it.- This demonstrates the
effectiveness of time-variant filtersin modeling extended user interaction histories, maintaining accuracy even with hundreds of interactions.
Comparison to GNN-based Methods
TV-Rec was also compared against three representative GNN-based sequential recommendation models: SR-GNN, GC-SAN, and GCL4SR.
The following are the results from Table 4 of the original paper:
| Methods | Beauty | Sports | Yelp | LastFM | ML-1M | Foursquare | ||||||
| H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | |
| TV-Rec | 0.1403 | 0.0705 | 0.0880 | 0.0425 | 0.0777 | 0.0321 | 0.1202 | 0.0572 | 0.4079 | 0.1955 | 0.0314 | 0.0176 |
| SR-GNN | 0.0847 | 0.0374 | 0.0517 | 0.0224 | 0.0609 | 0.0252 | 0.0872 | 0.0379 | 0.2940 | 0.1390 | 0.0222 | 0.0137 |
| GC-SAN | 0.1059 | 0.0546 | 0.0608 | 0.0289 | 0.0635 | 0.0260 | 0.0807 | 0.0394 | 0.3255 | 0.1611 | 0.0212 | 0.0131 |
| GCL4SR | 0.1206 | 0.0601 | 0.0744 | 0.0356 | 0.0684 | 0.0276 | 0.0908 | 0.0398 | 0.3381 | 0.1607 | 0.0185 | 0.0123 |
TV-Recconsistentlyoutperforms all GNN-based models.- This reinforces the advantage of
TV-Rec's design, which operates directly onsequence positionsrather than propagating messages overitem-transition graphs, leading to a more efficient representation oftemporal dependencies.
6.2. Ablation Studies / Parameter Analysis
Ablation Studies
Ablation studies were conducted to validate the contribution of key design choices in TV-Rec.
The following are the results from Table 5 of the original paper:
| Methods | Beauty | Sports | Yelp | LastFM | ML-1M | Foursquare | ||||||
| H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | |
| TV-Rec | 0.1403 | 0.0705 | 0.0880 | 0.0425 | 0.0777 | 0.0321 | 0.1202 | 0.0572 | 0.4079 | 0.1955 | 0.0314 | 0.0176 |
| (1) Positional Embedding | 0.1408 | 0.0702 | 0.0842 | 0.0396 | 0.0763 | 0.0320 | 0.1018 | 0.0496 | 0.4017 | 0.1936 | 0.0313 | 0.0160 |
| (2) Basic Graph Filter | 0.1402 | 0.0692 | 0.0857 | 0.0408 | 0.0747 | 0.0307 | 0.1165 | 0.0543 | 0.3974 | 0.1933 | 0.0212 | 0.0113 |
| (3) Identity Basis | 0.1400 | 0.0698 | 0.0851 | 0.0410 | 0.0765 | 0.0317 | 0.1138 | 0.0539 | 0.4015 | 0.1930 | 0.0277 | 0.0141 |
| (4) Basis Normalization | 0.1336 | 0.0689 | 0.0841 | 0.0412 | 0.0634 | 0.0264 | 0.0963 | 0.0418 | 0.3985 | 0.1912 | 0.0305 | 0.0144 |
-
TV-Rec (1) Positional Embedding: Adding explicitlearnable positional embeddingstoTV-Recgenerally results ininconsistent or slightly degraded performance. For example, on LastFM, bothHR@20andNDCG@20decrease significantly. This validates the claim thatTV-Recinherently captures positional information through itsspectral propertiesandtime-variant filters, making explicit positional encoding redundant. -
TV-Rec (2) Basic Graph Filter: Replacing thetime-variant filterwith abasic graph filter(fixed filter taps) consistentlydegrades performanceacross all datasets. For instance, on Foursquare,HR@20drops from 0.0314 to 0.0212, andNDCG@20from 0.0176 to 0.0113. This confirms the critical role and effectiveness of thetime-variantnature of the filters. -
TV-Rec (3) Identity Basis: Setting thebasis matrixto anidentity matrix(which makes , effectively removing the basis expansion) alsodegrades resultscompared to the full model. This indicates that thelearnable basisis important for constructing expressive and diverse filter patterns. -
TV-Rec (4) Basis Normalization: Removing theL2 normalizationof the basis matrix (i.e., using directly instead of in Eq. 8) leads to adegradation in performance, especially noticeable on Yelp and LastFM. This highlights the importance ofnormalizationfor numerical stability and effective learning of the basis vectors.Overall, the ablation studies confirm that each component of
TV-Rec's design—the time-variant nature of filters, the learned basis matrix, and its normalization—contributes to the model's superior performance.
Parameter Sensitivity
The paper investigates the sensitivity of TV-Rec to key hyperparameters: (number of basis vectors), (dropout rate), (orthogonal regularization coefficient), and (filter order).
Sensitivity to (Number of Basis Vectors)
The following figure (Figure 3 from the original paper) shows the sensitivity to the number of basis vectors .
该图像是图表,展示了图4中在两个数据集(Sports和Foursquare)上不同dropout率对NDCG@20和HR@20指标的敏感性影响。图中柱状图表示NDCG@20,折线图表示HR@20,反映随变化指标的变化趋势。
The following figure (Figure 8 from the original paper) shows the sensitivity to the number of basis vectors .
该图像是图表,展示了TV-Rec模型对参数p(dropout率)的灵敏度分析,包含六个数据集(Beauty、Sports、Yelp、LastFM、ML-1M、Foursquare)的NDCG@20和HR@20随p变化的趋势。
- The parameter in determines the number of
basis vectorsused to construct the position-specific filters, influencing both expressiveness and computational cost. - For
Beauty, performance anddrops sharply for smaller values, suggesting that a richer representation is beneficial for this dataset. - Conversely, for
LastFM, the best results are achieved at , and performancedegrades as m increases. This indicates that a more compact representation (fewer basis vectors) is sufficient and possibly better forLastFM, possibly due to its specific characteristics (e.g., music recommendation, less complex item relations or more distinct user tastes). - For
YelpandML-1M, yields optimal results, while forSportsandFoursquare, or is also suitable. - This analysis underscores that the optimal number of basis vectors is
dataset-dependent, reflecting varying complexities of user behavior patterns.
Sensitivity to (Dropout Rate)
The following figure (Figure 4 from the original paper) shows the sensitivity to dropout rate .
该图像是图表,展示了LastFM数据集上学习到的图滤波器对比结果。横轴表示图卷积中的移位次数,纵轴表示节点,数值越大表明时间越接近当前时刻。(a)为基本图滤波器,(b)为时变滤波器,后者在捕捉节点和时间依赖性上表现更丰富。
The following figure (Figure 9 from the original paper) shows the sensitivity to dropout rate .
该图像是图表,展示了不同正交正则化系数 α 对六个数据集(Beauty, Sports, Yelp, LastFM, ML-1M, Foursquare)中NDCG@20和HR@20指标的敏感性变化,反映了模型性能随 α 变化的趋势。
Dropout rate pis a regularization parameter.- For
Sports,larger p values (e.g., 0.5)improve performance. This suggests thatSportsmight be more prone tooverfitting, and higher dropout helps in learning more general representations. - For
Foursquare,smaller p values (e.g., 0.1 or 0.2)work better. - Generally, datasets with
fewer interactionsorlower data diversitytend to benefit fromhigher dropout ratesto preventoverfitting, while denser datasets might require less regularization.
Sensitivity to (Orthogonal Regularization Coefficient)
The following figure (Figure 10 from the original paper) shows the sensitivity to orthogonal regularization coefficient .
该图像是图2,展示了论文中提出的TV-Rec模型的整体架构,包括嵌入层、滤波层、前馈层及规范化层等模块。同时配有图傅里叶变换(GFT)和滤波公式,体现时变卷积滤波的数学原理。
- controls the strength of the
orthogonal regularization termapplied to the basis matrix . - For
LastFM, the best accuracy is achieved with . Performance , indicating that sufficient orthogonal regularization is crucial for diverse filters and preventing overfitting in this dataset. - In contrast, for
SportsandYelp, , and performance deteriorates as . This implies thattoo strong orthogonal regularizationmight overly constrain the filter parameters, limiting the model's ability to adapt to the data distributions in these domains. - This highlights the importance of tuning based on dataset characteristics to balance regularization strength.
Sensitivity to (Filter Order)
The following are the results from Table 8 of the original paper:
| K | Beauty | Sports | Yelp | LastFM | ML-1M | Foursquare | ||||||
| H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | H@20 | N@20 | |
| 3 | 0.1380 | 0.0690 | 0.0839 | 0.0405 | 0.0731 | 0.0295 | 0.1046 | 0.0443 | 0.3901 | 0.1843 | 0.0240 | 0.0116 |
| 5 | 0.1364 | 0.0680 | 0.0849 | 0.0407 | 0.0752 | 0.0308 | 0.1147 | 0.0490 | 0.3823 | 0.1814 | 0.0268 | 0.0116 |
| 10 | 0.1378 | 0.0688 | 0.0853 | 0.0410 | 0.0741 | 0.0303 | 0.1165 | 0.0532 | 0.3922 | 0.1898 | 0.0268 | 0.0139 |
| 25 | 0.1371 | 0.0684 | 0.0854 | 0.0409 | 0.0774 | 0.0323 | 0.1248 | 0.0546 | 0.3957 | 0.1891 | 0.0295 | 0.0163 |
| 50 | 0.1403 | 0.0705 | 0.0880 | 0.0425 | 0.0777 | 0.0321 | 0.1202 | 0.0572 | 0.4079 | 0.1955 | 0.0314 | 0.0176 |
- The
filter order Kis analogous to thekernel sizeinCNN-basedmodels. - Except for
YelpandLastFM, setting (equal to the maximum sequence length ) consistently yields thebest performance. - This supports the hypothesis that aligning
filter orderwithsequence lengthallows the model to capture abroader context, enhancing recommendation quality. It implies that a larger receptive field (higher ) is beneficial for integrating more long-range information.
Analyzing Filter Behavior and Case Study
To understand the mechanics of the time-variant filter, a visualization of learned filters and a case study were performed.
The following figure (Figure 5 from the original paper) shows the visualization of learned graph filters on LastFM.
该图像是图表,展示了固定卷积滤波器与时间变卷积滤波器在电影推荐中的差异。通过不同时间阶段的用户偏好变化对比,时间变卷积滤波器更准确地捕捉了用户兴趣的动态变化,使推荐结果更加多样和贴合真实偏好。
-
Filter Visualization (Figure 5): The visualization compares a
basic graph filter(fixed) with thetime-variant convolutional filteron theLastFMdataset.-
Both types of filters generally
assign higher weights to lower shifts, meaning they prioritizerecent items. This is intuitive for sequential recommendation. -
The crucial difference is that the
basic graph filterapplies thesame filter uniformlyacross all nodes (positions). -
In contrast, the
time-variant convolutional filterappliesdifferent filters to each node(position).- For
early-stage nodes(earlier in the sequence), the time-variant filter showssimilar weights, suggesting an initial focus on understandingoverall patterns. - As the sequence
progresses towards recent items, the filter's weightsshift to emphasize more recent items. This dynamic adaptation allows the filter to capturetemporal shiftsmore accurately.
- For
-
This demonstrates
TV-Rec's ability to adapt its focus, capturing bothlong-term preferences(from early patterns) andrecent interests(from later patterns), unlike fixed filters that predominantly focus on recent items.The following figure (Figure 6 from the original paper) shows the case study on ML-1M.
该图像是图表,展示了在Beauty数据集上不同模型的推理时间与NDCG@20性能对比。图中用气泡大小表示模型参数量。TV-Rec在推理时间较短的同时表现出较高的NDCG@20,优于其他模型。
-
-
Case Study (ML-1M, Figure 6): A real-world example from
ML-1Mhighlights the practical advantage.- A
basic graph filter(fixed) model focuses solely on a user's recent interactions withWestern films. TV-Rec(with time-variant filters), however, successfully captures both the user'srecent Western interestand theirbroader Comedy preferencefrom earlier interactions.- This confirms that
applying different filtering operations at different temporal positionsis crucial for effective sequential recommendation, allowingTV-Recto form a more nuanced and accurate understanding of user preferences.
- A
Model Complexity and Runtime Analyses
The efficiency of TV-Rec is evaluated by analyzing the number of parameters and inference time.
The following figure (Figure 7 from the original paper) shows the comparison of model inference time and NDCG@20 on Beauty.
该图像是论文中图8的图表,展示了基向量数量对六个数据集(Beauty, Sports, Yelp, LastFM, ML-1M, Foursquare)中NDCG@20和HR@20指标的敏感性分析,柱状图代表NDCG@20,折线图代表HR@20。
-
Performance-Efficiency Trade-off (Figure 7, Beauty dataset):
-
TV-Recachieves thebest balance between performance (NDCG@20) and computational efficiency (inference time). -
Among top-performing methods,
TV-Rechas thefastest inference timeand thesmallest number of parameterscompared to hybrid models likeAdaMCTandBSARec. -
Compared to
SASRecandBERT4Rec(pure self-attention),TV-Recoffersfaster inferenceandsuperior recommendation accuracy. -
While
FMLPRecruns slightly faster due to its simpler architecture, itdegrades recommendation qualitybecause it uses a basic (fixed) graph filter.TV-Rec's marginally higher inference time thanFMLPRecis justified by its significantly improved performance.The following are the results from Table 10 of the original paper:
Dataset Metrics TV-Rec AdaMCT BSARec FMLPRec NextItNet BERT4Rec SASRec Caser Beauty # Parameters 854,208 878,208 880,318 851,200 981,696 877,888 877,824 2,909,532 Training Cost (s/epoch) 13.20 12.30 11.35 11.85 18.9184 21.98 11.21 65.54 Inference Cost (s/epoch) 0.5697 0.6647 0.7299 0.5427 1.0267 0.5821 0.6316 1.0641 NDCG@20 0.0704 0.0691 0.0664 0.0419 0.0547 0.0484 0.0379 0.0164 Sports # Parameters 1,248,160 1,278,592 1,264,318 1,251,584 1,644,224 1,278,272 1,278,208 4,835,756 Training Cost (s/epoch) 19.46 17.63 18.44 17.76 30.1034 31.60 14.26 98.21 Inference Cost (s/epoch) 0.7411 0.9049 0.9679 0.7049 1.2724 0.8016 0.8460 1.5715 NDCG@20 0.0428 0.0422 0.0379 0.0239 0.0316 0.0288 0.0217 0.0109 Yelp # Parameters 1,355,424 1,385,856 1,365,522 1,358,848 1,751,488 1,385,536 1,385,472 3,925,238 Training Cost (s/epoch) 21.89 20.87 21.83 20.20 32.88 37.57 16.74 111.62 Inference Cost (s/epoch) 0.6388 0.8527 0.8890 0.6223 1.2348 0.7225 0.7322 1.3220 NDCG@20 0.0338 0.0290 0.0272 0.0211 0.0276 0.0275 0.0180 0.0151 LastFM # Parameters 303,440 337,088 322,814 310,080 981,696 336,768 336,704 998,646 Training Cost (s/epoch) 2.41 2.51 2.66 2.51 19.1253 4.29 2.30 13.22 Inference Cost (s/epoch) 0.2649 0.2807 0.3228 0.2678 1.068 0.3018 0.3061 0.3445 NDCG@20 0.0582 0.0514 0.0485 0.0475 0.0402 0.0366 0.0458 0.0306 ML-1M # Parameters 298,368 322,368 308,094 295,360 556,928 322,048 321,984 961,326 Training Cost (s/epoch) 26.60 27.68 31.40 24.67 40.7731 46.51 22.50 102.99 Inference Cost (s/epoch) 0.4129 0.4180 0.4474 0.3594 0.7269 0.4202 0.4181 0.5817 NDCG@20 0.1951 0.1836 0.1711 0.1388 0.1829 0.1588 0.1438 0.1101 Foursquare # Parameters 719,040 743,040 728,766 716,032 1,108,672 742,720 742,656 1,073,044 Training Cost (s/epoch) 6.01 5.79 5.81 5.11 8.9109 9.06 5.27 21.01 Inference Cost (s/epoch) 0.2900 0.2913 0.3562 0.2521 0.4963 0.2871 0.3178 0.3703 NDCG@20 0.0170 0.0147 0.0131 0.0110 0.0115 0.0132 0.0137 0.0133
-
-
Parameter Count:
TV-Recconsistently showscompetitive (often the lowest) parameter efficiencyacross all datasets, maintaining a comparable or slightly lower number of parameters than most advanced baselines. -
Training Cost:
TV-Recdemonstratescompetitive training costin most datasets, often comparable to or slightly higher thanSASRecandFMLPRec. The overhead from GFT/IGFT operations is manageable. -
Inference Cost:
TV-Recexhibitsstrong inference efficiency, frequently having thelowest or near-lowest inference costsamong all models, especially onBeauty,Sports, andYelp. This is a significant advantage, as the time-variant convolutional filter is a linear operator whose components can be precomputed after training, leading to efficient inference with a complexity of (compared to for training).
Additional Results on XLong
To further test scalability for extremely long sequences, TV-Rec was evaluated on the XLong dataset (average length 958.8, 20 times longer than main experiments).
The following are the results from Table 9 of the original paper:
| Metric | SASRec | LRURec | TV-Rec | Improv. |
| HR@5 | 0.3612 | 0.4266 | 0.4844 | 13.55% |
| HR@10 | 0.4680 | 0.5137 | 0.5353 | 4.20% |
| HR@20 | 0.5612 | 0.5874 | 0.5774 | -1.70% |
| NDCG@5 | 0.2656 | 0.3227 | 0.3905 | 21.00% |
| NDCG@10 | 0.2979 | 0.3510 | 0.4071 | 15.98% |
| NDCG@20 | 0.3232 | 0.3697 | 0.4178 | 13.01% |
- On the
XLongdataset,TV-Recshowsstrong performance, significantly outperformingSASRecandLRURecin most metrics (HR@5,HR@10,NDCG@5,NDCG@10,NDCG@20). - While
TV-Recis slightly lower thanLRURecinHR@20(-1.70%), its overall superiority, especially inNDCGmetrics (e.g., 21.00% gain inNDCG@5), highlights itseffectiveness in handling extremely long sequences. This confirms its strength inlong-range modeling tasks.
Statistical Significance of Experimental Results
To ensure reliability, all experiments were run 10 times with different random seeds for TV-Rec and the second-best baseline model for each dataset. Mean and standard deviation are reported.
The following are the results from Table 11 of the original paper:
| Datasets | Methods | HR@5 | HR@10 | HR@20 | NDCG@5 | NDCG@10 | NDCG@20 |
| Beauty | BSARec | 0.0694±0.001 | 0.0978±0.002 | 0.1352±0.002 | 0.0496±0.001 | 0.0587±0.001 | 0.0681±0.001 |
| TV-Rec | 0.0706±0.001 | 0.0997±0.001 | 0.1375±0.002 | 0.0500±0.001 | 0.0594±0.001 | 0.0689±0.001 | |
| Sports | BSARec | 0.0417±0.001 | 0.0600±0.001 | 0.0844±0.001 | 0.0288±0.001 | 0.0349±0.001 | 0.0411±0.001 |
| TV-Rec | 0.0420±0.001 | 0.0610±0.002 | 0.0863±0.002 | 0.0290±0.001 | 0.0351±0.001 | 0.0415±0.001 | |
| Yelp | DuoRec | 0.0268±0.001 | 0.0453±0.001 | 0.0733±0.001 | 0.0170±0.000 | 0.0230±0.000 | 0.0300±0.000 |
| TV-Rec | 0.0284±0.001 | 0.0472±0.001 | 0.0759±0.001 | 0.0179±0.000 | 0.0240±0.001 | 0.0312±0.001 | |
| LastFM | BSARec | 0.0501±0.004 | 0.0707±0.006 | 0.1051±0.008 | 0.0342±0.002 | 0.0420±0.003 | 0.0506±0.003 |
| TV-Rec | 0.0508±0.005 | 0.0750±0.006 | 0.1090±0.007 | 0.0343±0.003 | 0.0420±0.003 | 0.0506±0.003 | |
| ML-1M | LRURec | 0.1955±0.002 | 0.2818±0.002 | 0.3871±0.002 | 0.1326±0.002 | 0.1604±0.002 | 0.1869±0.002 |
| TV-Rec | 0.2024±0.005 | 0.2901±0.005 | 0.3972±0.005 | 0.1365±0.004 | 0.1647±0.003 | 0.1918±0.003 | |
| Foursquare | BSARec | 0.0133±0.003 | 0.0175±0.003 | 0.0250±0.003 | 0.0098±0.002 | 0.0111±0.002 | 0.0130±0.002 |
| TV-Rec | 0.0151±0.002 | 0.0214±0.003 | 0.0289±0.002 | 0.0105±0.002 | 0.0126±0.002 | 0.0145±0.002 |
- The results show that
TV-Recconsistently achieveshigher mean performancecompared to the second-best baselines across all datasets and metrics. - The
standard deviationsare relatively small, indicatingstabilityin performance across different runs and reinforcing thestatistical significanceofTV-Rec's improvements.
6.3. Conclusion from Results
The extensive experimental results demonstrate that TV-Rec is a highly effective and efficient model for sequential recommendation. Its time-variant convolutional filters successfully address the limitations of fixed filters and self-attention, leading to superior accuracy, particularly in capturing long-range dependencies and dynamic user preferences. The model also offers a favorable trade-off between performance and computational cost, making it suitable for practical applications. The ablation studies confirm the importance of each novel component in its design.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces TV-Rec (Time-Variant Convolutional Filters for Sequential Recommendation), a novel model that effectively addresses the inherent limitations of conventional (fixed) convolutional filters and the computational overhead of self-attention in sequential recommendation. Inspired by graph signal processing (GSP), TV-Rec employs time-variant convolutional filters that dynamically adapt their weights based on the position within a user's interaction sequence. This allows the model to capture complex, position-dependent temporal variations and integrate both long-term preferences and recent interests without the need for self-attention. The key findings are TV-Rec's consistent superiority in recommendation accuracy (averaging 7.49% improvement over state-of-the-art baselines), its enhanced expressive generalization, and its computational efficiency (reduced complexity and faster inference, especially after training).
7.2. Limitations & Future Work
The authors acknowledge several limitations of TV-Rec:
-
Training Computational Complexity: While
TV-Recboasts efficient inference, its training process involvesadditional computational complexity. This stems from the repeatedGraph Fourier Transform (GFT)andinverse GFToperations, as well as the generation ofposition-specific filters. -
Dimensionality Increase from DCG: To enable
spectral filtering,TV-Recreplaces the originalline graphwith aDirected Cyclic Graph (DCG). Thisadds extra nodes through padding(zero-padding), which increases the dimensionality of the spectral domain. -
Manageable Overhead: Despite these factors, the authors state that the overhead in training time and memory usage remains
manageableand does not significantly impact training scalability in practice, as evidenced in theirModel Complexity and Runtime Analyses. -
Data-Independent Filter Generation: The current filter generation in
TV-Recisdata-independent, relying solely ontemporal positionrather than the specific content or semantics of the sequence. This design choice simplifies the model and aids generalization but might limit its adaptability to highlyinstance-specific behaviorsorirregular patternsthat are not purely position-dependent. The authors argue that theinference-time efficiencyandarchitectural simplicityjustify these trade-offs for many practical applications.For future work, the authors plan to:
-
Explore both the
theoretical and practical relationshipsbetween theirtime-variant filterandrecent advances in state space models, which have demonstrated strong connections with convolutional filters. This could potentially lead to further optimizations or novel architectural designs.
7.3. Personal Insights & Critique
The TV-Rec paper presents an elegant and effective solution to a long-standing challenge in sequential recommendation. The shift from fixed convolutional filters or computationally heavy self-attention to time-variant filters inspired by Graph Signal Processing is a compelling conceptual leap.
-
Inspirations:
- GSP for Sequence Modeling: The reinterpretation of a sequence as a graph signal on a
DCGis particularly insightful. It allows leveraging the powerful mathematical tools ofGSP(like Fourier analysis) in a domain traditionally dominated by RNNs and Transformers. This cross-domain application is inspiring and could open doors for GSP in other sequential data tasks. - Efficiency through Spectral Methods: The ability to operate largely in the spectral domain, converting complex convolutions into simpler multiplications, is a clever way to achieve efficiency. The claim of faster inference by precomputing filter components is a strong practical advantage for deployment.
- Implicit Positional Encoding: The inherent positional encoding through the
GFT basisis a neat solution that simplifies the model architecture and avoids the need for explicit positional embeddings, which can sometimes be tricky to design or optimize.
- GSP for Sequence Modeling: The reinterpretation of a sequence as a graph signal on a
-
Potential Issues / Areas for Improvement:
-
Interpretability of
time-variant filters: While the paper provides visualizations (Figure 5) and a case study (Figure 6), the learnedfilter tap matrix(constructed from and ) can still be complex. Further research into how individual basis vectors contribute to specific patterns and how the coefficient matrix dynamically blends them could enhance interpretability. -
Data-independent filter generation: The current
data-independentfilter generation (relying only on temporal position) might be a trade-off. While it aids generalization, future work could explore hybrid approaches where filter generation is also subtly influenced by semantic content or user attributes, allowing for even more adaptive filtering without compromising too much on efficiency. For example, some elements of or could be conditioned on sparse user features or global item popularity. -
Scalability for extremely high (filter order): While works well, for sequences with (like in XLong), where could be very large, the dependency in the filter matrix definition and the training complexity might still become a bottleneck if significantly increases. More sophisticated sparse approximations or low-rank representations for or might be needed for truly massive sequence lengths.
-
Generality beyond SR: The core idea of
time-variant convolutional filterscould potentially be applied to other sequence modeling tasks where position-dependent pattern recognition is crucial, such as time series forecasting, genomic sequence analysis, or even certain aspects of natural language processing, offering an alternative to Transformer-based models.Overall,
TV-Recrepresents a fresh perspective in sequential recommendation, successfully integratingGSPto create a powerful, efficient, and expressive model that challenges the dominance of self-attention.
-
Similar papers
Recommended via semantic vector search.