Paper status: completed

TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation

Published:10/29/2025

Sequential Recommender Systems (23)Time-Variant Convolutional Filters (1)User Behavior Sequence Modeling (1)Graph Signal Processing (1)Recommendation Inference Acceleration (1)

Original Link PDF

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

TV-Rec introduces time-variant convolutional filters inspired by graph signal processing, replacing fixed filters and self-attention to model temporal user behavior variations, improving expressiveness, reducing computation, and boosting accuracy by 7.49% on six benchmarks.

Abstract

Recently, convolutional filters have been increasingly adopted in sequential recommendation for their ability to capture local sequential patterns. However, most of these models complement convolutional filters with self-attention. This is because convolutional filters alone, generally fixed filters, struggle to capture global interactions necessary for accurate recommendation. We propose Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec), a model inspired by graph signal processing, where time-variant graph filters capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention with time-variant filters, TV-Rec achieves higher expressive power and better captures complex interaction patterns in user behavior. This design not only eliminates the need for self-attention but also reduces computation while accelerating inference. Extensive experiments on six public benchmarks show that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.

Mind Map

In-depth Reading

English Analysis~33 min read · 48,950 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is a novel approach to sequential recommendation using time-variant convolutional filters. The title is TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation.

1.2. Authors

The authors are:

Yehjin Shin
Jeongwhan Choi
Seojin Kim
Noseong Park*

All authors are affiliated with KAIST (Korea Advanced Institute of Science and Technology). Their research background appears to be in recommender systems, machine learning, and potentially graph signal processing, given the paper's core methodology.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server. The Published at (UTC): 2025-10-29T08:14:03.000Z indicates it is a recent submission, likely awaiting peer review or subsequent publication in a conference or journal. While arXiv itself is not a peer-reviewed venue, it is a widely recognized platform for disseminating early research in fields like computer science, mathematics, and physics. The authors' affiliation with KAIST suggests a strong academic background.

1.4. Publication Year

The publication year, based on the provided Published at (UTC) timestamp, is 2025.

1.5. Abstract

The paper addresses a limitation in sequential recommendation (SR) models that use convolutional filters: their fixed nature struggles to capture global interactions, often requiring supplementary self-attention mechanisms. The authors propose TV-Rec (Time-Variant Convolutional Filters for Sequential Recommendation), a model inspired by graph signal processing (GSP). TV-Rec utilizes time-variant graph filters to capture position-dependent temporal variations in user sequences. By replacing both fixed kernels and self-attention, TV-Rec achieves higher expressive power, better modeling of complex user behavior, reduced computation, and accelerated inference. Extensive experiments on six public benchmarks demonstrate that TV-Rec outperforms state-of-the-art baselines by an average of 7.49%.

1.6. Original Source Link

The official source is https://arxiv.org/abs/2510.25259. This is a link to the arXiv preprint.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve lies in the limitations of existing sequential recommendation (SR) models, particularly concerning their ability to capture dynamic user preferences and complex interaction patterns efficiently.

Problem: While convolutional filters are effective at capturing local sequential patterns in SR, their fixed nature prevents them from adapting to diverse temporal variations across a user's interaction history. This often necessitates combining them with self-attention mechanisms (e.g., in Transformer-based models). However, self-attention has its own limitations: it lacks an inductive bias toward sequential structure, treating all positions pairwise without inherent local proximity, and is computationally expensive. This leads to a trade-off between expressiveness, locality, and efficiency.
Importance: Sequential recommendation is crucial in modern recommender systems because user preferences are dynamic and evolve over time. Accurately modeling these evolving preferences is essential for providing timely and relevant recommendations, which directly impacts user engagement and satisfaction.
Challenges/Gaps:
- Fixed convolutional filters struggle to capture global interactions and position-specific semantics as user interests shift rapidly.
- Self-attention is expressive but computationally inefficient (especially for long sequences) and insensitive to locality, making it difficult to model fine-grained, localized user behavior.
- Hybrid models that combine both often inherit the computational burden of self-attention.
Innovative Idea: The paper's entry point is to bridge this gap by proposing Time-Variant Convolutional Filters, inspired by graph signal processing (GSP). The core innovation is to replace both fixed kernels and self-attention with filters that can adapt their weights across the sequence (i.e., time-variant) to emphasize the most relevant elements at each time step.

2.2. Main Contributions / Findings

The paper makes several significant contributions and presents key findings:

Novel Architecture (TV-Rec): The authors propose TV-Rec, a new sequential recommendation model that utilizes time-variant convolutional filters. These filters are designed to effectively capture temporal dynamics and user behavior patterns by applying different filtering operations at each position in a sequence, eliminating the need for self-attention. This design is inspired by node-variant graph filters from graph signal processing (GSP).
Enhanced Expressiveness and Generalization: TV-Rec demonstrates superior expressive generalization. By allowing filters to be functions of time (position), the model can effectively capture both long-term preferences (patterns at the beginning of a sequence) and recent interests (patterns at the end of a sequence), which fixed filters often miss.
Improved Performance and Efficiency: Extensive experiments on six public benchmark datasets show that TV-Rec consistently outperforms state-of-the-art baseline methods, achieving an average accuracy improvement of 7.49%. Crucially, this performance gain comes with reduced computation and accelerated inference compared to self-attention-based and hybrid models, as the time-variant filters act as linear operators.

3.1. Foundational Concepts

To understand TV-Rec, a grasp of several fundamental concepts from sequential recommendation, deep learning architectures, and graph signal processing is essential.

Sequential Recommendation (SR):
- Conceptual Definition: SR is a subfield of recommender systems that focuses on predicting a user's next interaction (e.g., item purchase, movie watch) based on their historical sequence of interactions. Unlike traditional collaborative filtering, SR explicitly considers the order and temporal dynamics of user behavior, recognizing that preferences evolve over time.
- Goal: Given a sequence of items a user has interacted with, $S = [v_1, v_2, \ldots, v_{|S|}]$ , the goal is to predict the next item $v_{|S|+1}$ or recommend a top- $r$ list of potential next items.
Convolutional Filters (in Deep Learning):
- Conceptual Definition: In deep learning, especially in Convolutional Neural Networks (CNNs), a convolutional filter (or kernel) is a small matrix of learnable weights that slides over an input (e.g., an image, a sequence) to detect local patterns. For sequential data, 1D convolutional filters are commonly used.
- Mechanism: When applied to a sequence, a 1D filter performs a dot product with a local window of the input sequence. This operation is repeated across the entire sequence, generating a feature map that highlights where the learned local pattern occurs.
- Fixed Nature: Traditionally, these filters have fixed weights that are applied uniformly across all positions in the input sequence. This makes them excellent at capturing local patterns but limits their adaptability to position-specific or temporally evolving semantics.
Self-Attention / Transformers:
- Conceptual Definition: Self-attention is a mechanism that allows a model to weigh the importance of different parts of an input sequence relative to a specific element in the sequence. It's a core component of the Transformer architecture, widely used in Natural Language Processing (NLP) and increasingly in sequential recommendation.
- Mechanism (Simplified): For each element in a sequence, self-attention computes three vectors: a Query (Q), Key (K), and Value (V). The Query of an element is matched against the Keys of all other elements (including itself) to compute attention scores. These scores are then used to weight the Values of all elements, producing a new representation for the original element that incorporates information from the entire sequence.
- Attention Calculation: The standard scaled dot-product attention is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
  - $Q$ : Query matrix. Each row corresponds to a query vector for an element in the sequence.
  - $K$ : Key matrix. Each row corresponds to a key vector for an element in the sequence.
  - $V$ : Value matrix. Each row corresponds to a value vector for an element in the sequence.
  - $d_k$ : Dimension of the key vectors, used for scaling to prevent vanishing gradients.
  - $\mathrm{softmax}(\cdot)$ : Normalizes the attention scores.
- Strengths: Excellent at capturing long-range dependencies and complex relationships between arbitrary positions in a sequence.
- Limitations:
  - Lack of inductive bias: It doesn't inherently prioritize local proximity; all pairwise interactions are considered. Positional encodings are needed to introduce sequential order.
  - Computational cost: The attention mechanism typically has a quadratic complexity with respect to sequence length ( $O(N^2)$ ), making it expensive for very long sequences.
Graph Signal Processing (GSP):
- Conceptual Definition: GSP extends traditional signal processing concepts (like Fourier analysis and filtering) to data defined on irregular domains, i.e., graphs. It treats data associated with nodes as signals and the graph structure as an operator that defines relationships between these signals.
- Graph Signal ( $\mathbf{x}$ ): A vector where each component corresponds to a value associated with a node in the graph. For TV-Rec, the sequence of item embeddings is treated as a graph signal, with each item (at a specific temporal position) being a node.
- Shift Operator ( $\mathbf{S}$ ): A matrix (often the adjacency matrix or Laplacian matrix of the graph) that captures the underlying graph topology. It "shifts" the signal around the graph. In GSP, filters are often defined as polynomials of this shift operator.
- Graph Filter ( $\mathbf{G}$ ): An operation that processes a graph signal based on the graph structure. A graph filter is defined as a polynomial of the shift operator S: $ \mathbf{y} = \mathbf{G x} = \sum_{k=0}^K h_k \mathbf{S}^k \mathbf{x} $
  - $\mathbf{y}$ : The filtered graph signal.
  - $\mathbf{x}$ : The input graph signal.
  - $h_k$ : Scalar filter taps (coefficients) that determine the characteristics of the filter.
  - $\mathbf{S}$ : The shift operator (e.g., adjacency matrix).
  - $K$ : The order of the filter, determining how far information propagates.
- Graph Fourier Transform (GFT): Analogous to the Fourier Transform for Euclidean signals, the GFT decomposes a graph signal into different frequency components defined by the eigenvectors of the shift operator S. $ \mathbf{S} = \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}) \mathbf{U}^\top $
  - $\mathbf{U}$ : Matrix of eigenvectors of $\mathbf{S}$ . These eigenvectors form the graph Fourier basis.
  - $\mathrm{diag}(\boldsymbol{\lambda})$ : Diagonal matrix of eigenvalues of $\mathbf{S}$ .
  - $\tilde{\mathbf{x}} = \mathbf{U}^\top \mathbf{x}$ : The GFT of signal $\mathbf{x}$ (signal in the frequency domain).
  - $\mathbf{x} = \mathbf{U} \tilde{\mathbf{x}}$ : Inverse GFT.
- Spectral Filtering: In the frequency domain, a graph filter acts as a diagonal matrix on the transformed signal: $ \tilde{\mathbf{y}} = \tilde{\mathbf{G}} \tilde{\mathbf{x}} = \left( \sum_{k=0}^K h_k \mathrm{diag}(\boldsymbol{\lambda})^k \right) \tilde{\mathbf{x}} $
  - $\tilde{\mathbf{y}}$ : Filtered signal in the frequency domain.
  - $\tilde{\mathbf{G}}$ : Frequency response of the filter, a diagonal matrix. This diagonal operation is computationally efficient.
Node-Variant Graph Filter:
- Conceptual Definition: Unlike conventional graph filters where a single set of filter taps ( $h_k$ ) is applied uniformly across all nodes, a node-variant graph filter applies distinct filter taps at each node position. This means the filter characteristics can change depending on which node the signal is at.
- Mechanism: Instead of scalar $h_k$ $h_{k}$ , it uses a vector $\mathbf{h}_k$ $h_{k}$ for filter taps, where each component of $\mathbf{h}_k$ $h_{k}$ corresponds to a specific node. This is represented by a diagonal matrix $\mathrm{diag}(\mathbf{h}_k)$ $diag (h_{k})$ . $ \mathbf{y} = \mathbf{G}{nv} \mathbf{x} = \left( \sum{k=0}^K \mathrm{diag}(\mathbf{h}_k) \mathbf{S}^k \right) \mathbf{x} $
  - $\mathrm{diag}(\mathbf{h}_k)$ : A diagonal matrix where the diagonal entries are the elements of the vector $\mathbf{h}_k$ . This allows each node to have its own $h_k$ value.
- Connection to TV-Rec: In TV-Rec, where sequence positions are treated as nodes in a graph, a node-variant graph filter becomes equivalent to a time-variant convolutional filter. This allows the filter to adapt to the specific temporal position within the sequence.
Line Graph and Directed Cyclic Graph (DCG):
- Line Graph (for sequences): A natural way to represent a sequence $v_1 \to v_2 \to \ldots \to v_N$ as a graph is a line graph where each item $v_i$ is a node and there's a directed edge from $v_i$ to $v_{i+1}$ .
- Problem with Line Graph for GSP: A line graph's adjacency matrix (or shift operator) is typically not diagonalizable (i.e., cannot be decomposed into $\mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}) \mathbf{U}^\top$ for spectral filtering) because the first node has no incoming edges, leading to a row of zeros and a rank deficiency.
- Directed Cyclic Graph (DCG): To enable spectral filtering, TV-Rec models the sequence as a DCG. This is essentially a line graph with an added edge connecting the last node ( $v_N$ ) back to the first node ( $v_1$ ), making the adjacency matrix circulant and thus diagonalizable.
- Zero-Padding for Causality: To prevent information leakage (backward flow) due to the cyclic edge, TV-Rec uses a zero-padding strategy. The input sequence is padded with zeros, and the filter order $K$ is limited such that the cyclic connection doesn't affect the original sequence's output, maintaining causality.
Positional Encoding:
- Conceptual Definition: In Transformer models, positional encodings are added to item embeddings to inject information about the absolute or relative position of items in a sequence. Without them, self-attention would be permutation-invariant, losing sequence order.
- Types: Common types include sinusoidal positional encodings (fixed, based on sine/cosine functions) or learnable positional embeddings.
- TV-Rec's Approach: TV-Rec argues that explicit positional encodings are unnecessary because the GFT basis (derived from the DCG) inherently captures frequency-based positional variations, and the time-variant filters dynamically modulate these components based on position.

3.2. Previous Works

The paper contextualizes TV-Rec by discussing various existing sequential recommendation (SR) approaches.

Traditional SR Models:
- Markov chains [24]: Simple models that assume the next item depends only on the previous one or a few recent ones.
- RNNs (Recurrent Neural Networks) [13]: Models like GRU4Rec use recurrent units (e.g., GRU) to capture temporal dependencies by maintaining a hidden state that processes sequences step-by-step. They are good for short-term patterns but struggle with long-term dependencies.
- GNNs (Graph Neural Networks) [34]: Some SR models (e.g., SR-GNN [33], GC-SAN [35], GCL4SR [43]) use GNNs to model item-transition graphs, where items are nodes and transitions are edges, to capture complex item relationships.
Transformer-based Models:
- SASRec [17]: A seminal work that adapted the Transformer architecture for SR, using self-attention to capture long-range dependencies in user sequences. It treats each user's sequence as a sentence.
- BERT4Rec [29]: Inspired by BERT in NLP, it uses a bidirectional Transformer and a masked item prediction task for training, allowing it to learn contextual embeddings for items.
Convolution-only Models:
- Caser [30]: One of the first to apply CNNs to SR, treating user-item interactions as images and using horizontal and vertical convolutions to capture sequential patterns.
- NextItNet [40]: Utilizes dilated 1D convolutions and residual connections to capture both short- and long-range dependencies more efficiently than standard convolutions.
- FMLPRec [45]: Employs Fourier Transforms and learnable filters within an all-MLP architecture to filter noise and enhance sequence representations in the frequency domain.
Hybrid (Convolution + Self-Attention) Models:
- AdaMCT [16]: Combines 1D convolutions with Transformer attention to capture both short-term (convolution) and long-term (attention) user preferences.
- BSARec [28]: Addresses limitations of self-attention by applying convolution using the Fourier Transform, aiming to introduce locality bias while retaining the benefits of attention.
Other Recent Advanced Methods:
- DuoRec [23]: Employs contrastive learning with model-level augmentation to learn better sequence representations.
- LRURec [41]: Explores linear recurrent units to balance efficiency and performance, especially for long sequences, by using principles from State Space Models.

3.3. Technological Evolution

The field of sequential recommendation has evolved from simpler statistical models to complex deep learning architectures:

Early Models (Statistical/Rule-based): Initially, methods like Markov chains were dominant, focusing on direct transitions between items. These were simple but often limited in capturing complex, long-term patterns.
Recurrent Neural Networks (RNNs): The introduction of RNNs (and their variants like GRUs and LSTMs) marked a shift towards models capable of handling variable-length sequences and learning more intricate temporal dependencies. GRU4Rec is a prime example.
Convolutional Neural Networks (CNNs): CNNs, though initially for computer vision, were adapted for sequences (Caser, NextItNet) to capture local patterns efficiently. Their fixed filters, however, limited their global understanding.
Transformers and Self-Attention: The Transformer architecture revolutionized sequence modeling due to its ability to model long-range dependencies effectively through self-attention (SASRec, BERT4Rec). This became the new state-of-the-art, but introduced computational overhead and a lack of inherent locality bias.
Hybrid Models: To mitigate the shortcomings of pure Transformer or pure CNN models, hybrid approaches emerged (AdaMCT, BSARec) combining the strengths of both: convolutions for locality and attention for global context. However, these often retain the computational cost of self-attention.
Efficiency-Focused and GSP-inspired Models: More recently, research has focused on improving efficiency for long sequences (LRURec) or finding alternative mechanisms to attention. TV-Rec falls into this category by reinterpreting sequential recommendation through the lens of Graph Signal Processing (GSP), specifically using time-variant convolutional filters to simultaneously address locality, long-range dependencies, and computational efficiency without relying on self-attention. This marks a significant departure from attention-centric designs towards more flexible and adaptive filtering mechanisms.

3.4. Differentiation Analysis

TV-Rec differentiates itself from existing methods by fundamentally altering how sequential patterns are captured, moving away from both fixed filters and self-attention.

Compared to Fixed Convolutional Filters (e.g., Caser, NextItNet, FMLPRec):
- Core Difference: Traditional convolutional filters (fixed filters) apply the same weights across all positions in a sequence. TV-Rec uses time-variant convolutional filters (node-variant in GSP terms), meaning the filter coefficients adapt for each position in the sequence.
- Innovation: This adaptability allows TV-Rec to capture position-specific semantics and temporally evolving preferences. While fixed filters might excel at recent items, TV-Rec can learn patterns relevant to both early and late stages of a sequence, providing a more comprehensive understanding of user behavior. The paper explicitly states that 1D CNN in AdaMCT corresponds to a fixed graph convolutional filter, and FMLPRec applies a discrete Fourier transform which can be seen as a fixed filter in the spectral domain. TV-Rec generalizes these by making the filter time-variant.
Compared to Self-Attention / Transformer-based Models (e.g., SASRec, BERT4Rec, DuoRec):
- Core Difference: Transformer-based models rely on self-attention to capture long-range dependencies, which has quadratic computational complexity ( $O(N^2 d)$ ) and lacks an inherent inductive bias towards sequential order, requiring positional embeddings. TV-Rec completely replaces self-attention.
- Innovation: TV-Rec achieves high expressive power and captures complex interaction patterns without self-attention, leading to reduced computation and accelerated inference ( $O(N^3 + N^2d)$ for training, but $O(N^2d)$ after precomputation, which can be more efficient for many scenarios than $N^2$ dependency). It also inherently encodes positional information through its Graph Fourier Transform (GFT) on a Directed Cyclic Graph (DCG), negating the need for explicit positional embeddings.
Compared to Hybrid Models (e.g., AdaMCT, BSARec):
- Core Difference: Hybrid models attempt to combine the strengths of convolutions and self-attention, but often still carry the computational burden and inductive bias limitations of self-attention.
- Innovation: TV-Rec offers a unified approach where the time-variant filter alone is powerful enough to capture both local and global patterns, making self-attention redundant. This simplifies the architecture and improves efficiency. BSARec, another hybrid, applies convolution using Fourier Transform, but TV-Rec's filters are dynamic per position, making them more general.
Compared to GNN-based Methods (e.g., SR-GNN, GC-SAN, GCL4SR):
- Core Difference (1 - Graph Definition): GNN-based models typically define graphs where items are nodes, and edges represent transitions or relationships between items, often merging identical items. TV-Rec reinterprets the sequence as a line graph (or DCG) where sequence positions are nodes, allowing repeated items at different temporal positions to be treated distinctly.
- Innovation (1): This distinction allows TV-Rec to capture fine-grained temporal and positional dependencies that are lost when identical items are merged in traditional item-transition graphs.
- Core Difference (2 - Information Propagation): GNN-based models rely on iterative message passing to update node embeddings, which can be computationally intensive. TV-Rec employs time-variant graph convolutional filters that operate directly in the spectral domain.
- Innovation (2): This spectral filtering approach removes the need for heavy message passing, yielding a more efficient yet expressive representation of temporal dynamics. TV-Rec presents itself as a "non-GNN graph-based paradigm" for SR.
  
  In essence, TV-Rec offers a more unified, flexible, and efficient solution by deeply integrating graph signal processing concepts to create context-aware, adaptive convolutional filters that transcend the limitations of both fixed convolutions and self-attention.

4. Methodology

The proposed TV-Rec model introduces time-variant convolutional filters for sequential recommendation, drawing inspiration from graph signal processing (GSP). The core idea is to apply position-dependent filters to user sequences, allowing the model to adapt to varying temporal dynamics and capture both local and global patterns without relying on self-attention.

4.1. Principles

The fundamental principle behind TV-Rec is to treat a user's interaction sequence as a graph signal on a line graph, which is then processed by node-variant graph filters. These filters are "time-variant" because each position (node) in the sequence receives a distinct filter. This enables the model to:

Adapt to Temporal Variations: Different stages of a sequence (e.g., early interactions vs. recent ones) often carry different semantic information. A time-variant filter can apply specific weights at each position, rather than a uniform one, to emphasize relevant information for that specific time point.
Capture both Local and Global Patterns: By leveraging the frequency domain via the Graph Fourier Transform (GFT), the filters can capture both short-term dependencies (high frequencies) and long-term preferences (low frequencies). The time-variant nature allows this capture to be position-aware.
Efficiency and Expressiveness: By operating as linear operators in the spectral domain, these filters aim to achieve the expressiveness typically associated with self-attention, but with improved computational efficiency, especially during inference.

The methodology reinterprets the sequential nature of user behavior as a Directed Cyclic Graph (DCG) to enable spectral filtering, overcoming the non-diagonalizability issue of a simple line graph while carefully preventing backward information flow through zero-padding.

4.2. Core Methodology In-depth (Layer by Layer)

The TV-Rec architecture consists of three main modules: an embedding layer, a time-variant encoder, and a prediction layer.

The overall architecture of the proposed TV-Rec model is illustrated in Figure 2. It shows the flow from the input sequence through the embedding layer, multiple Time-Variant Encoding Blocks, and finally to the prediction layer. Each encoding block comprises a Filter Layer and a Feed Forward Network (FFN), both followed by residual connections and layer normalization. The critical part is the Filter Layer where the input $X$ is transformed to the frequency domain using GFT and processed by the time-variant convolutional filter.

$Figure 3: Sensitivity to the number of basis vectors $m$ .$ 该图像是图表，展示了论文中图3关于基向量数量 $m$ 对模型性能的敏感性分析，横轴为基向量数量 $m$ ，左纵轴对应NDCG@20，右纵轴对应HR@20，柱状图表示NDCG@20，折线图表示HR@20。

The following figure (Figure 2 from the original paper) shows the architecture of our proposed TV-Rec.

4.2.1. Embedding Layer

First, a user's historical interaction sequence is prepared for processing:

Sequence Truncation/Padding: The raw user sequence $S^{(u)} = [v_1^{(u)}, v_2^{(u)}, \ldots, v_{|S^{(u)}|}^{(u)}]$ is processed to a fixed length $N$ .
- If $|S^{(u)}| \geq N$ , the sequence is truncated to keep the most recent $N$ items.
- If $|S^{(u)}| < N$ , the sequence is padded with zeros at the beginning.
- For simplicity, the user superscript (u) is omitted, so the processed sequence is denoted as $s = (s_1, s_2, \ldots, s_N)$ .
Item Embedding: Each item $v$ in the sequence is converted into a continuous vector representation using an item embedding matrix $\mathbf{E} \in \mathbb{R}^{|\mathcal{V}| \times D}$ , where $|\mathcal{V}|$ is the total number of unique items and $D$ is the latent dimension size. A look-up operation retrieves the embedding $\mathbf{E}_v$ for each item $v$ .
Normalization and Dropout: The sequence of item embeddings is then passed through Layer Normalization and Dropout. This produces the initial embedding representation of the user sequence, $\mathbf{X}^0$ , which serves as the input to the time-variant encoder.

The process is formally defined as: $ \mathbf{X}^0 = \operatorname{Dropout}(\operatorname{LayerNorm}([\mathbf{E}{s_1}, \mathbf{E}{s_2}, \ldots, \mathbf{E}_{s_N}]^\top)) $

$\mathbf{X}^0 \in \mathbb{R}^{N \times D}$ : The initial sequence embedding, where $N$ is the sequence length and $D$ is the embedding dimension. Each row $\mathbf{X}^0_i$ corresponds to the embedding of item $s_i$ .
$\mathbf{E}_{s_i}$ : The embedding vector for item $s_i$ looked up from the item embedding matrix $\mathbf{E}$ .
$\operatorname{LayerNorm}(\cdot)$ : Applies layer normalization, which normalizes the inputs across the features for each sample independently, helping to stabilize training.
$\operatorname{Dropout}(\cdot)$ : Applies dropout, a regularization technique that randomly sets a fraction of input units to zero at each update during training, preventing overfitting.

Note on Positional Embedding: A key design choice is that positional embedding is not necessary in TV-Rec. This is because the time-variant convolutional filters inherently capture position-specific information through the spectral properties of the graph, as justified in Appendix C of the paper.

4.2.2. Time-Variant Encoder

The time-variant encoder is built by stacking $L$ identical time-variant encoding blocks. Each block processes the sequence embedding through a filter layer and a feed-forward network, incorporating residual connections and layer normalization after each sub-layer.

Filter Layer

In the $\ell$ -th filter layer, the input is $\mathbf{X}^\ell$ . This layer performs the core time-variant filtering operation.

Transform to Frequency Domain: The input $\mathbf{X}^\ell$ is first transformed into the frequency domain using the Graph Fourier Transform (GFT). $ \widetilde{\mathbf{X}}^\ell = \mathbf{U}^\top \mathbf{X}^\ell $
- $\widetilde{\mathbf{X}}^\ell \in \mathbb{C}^{N \times D}$ : The input sequence embedding in the frequency domain. It's complex-valued because the GFT basis (eigenvectors) can be complex.
- $\mathbf{U} \in \mathbb{C}^{N \times N}$ : The GFT matrix. This matrix is derived from a padded Directed Cyclic Graph (DCG), which is used instead of a simple line graph to ensure diagonalizability and enable spectral filtering. Its columns are the eigenvectors of the shift operator for the DCG. $\mathbf{U}^\top$ is its conjugate transpose (or Hermitian transpose).
- This transformation decomposes the signal into its frequency components, allowing the filter to operate spectrally. The DCG and padding strategy are crucial for this step, as explained in Appendix B.
Apply Time-Variant Convolutional Filter: The core filtering operation happens in the frequency domain. It uses a time-variant convolutional filter, which is based on the node-variant graph filter concept. $ \widehat{\mathbf{X}}^\ell = \mathbf{G}_{nv} \widetilde{\mathbf{X}} = \left( \mathbf{U} \circ (\mathbf{H} \pmb{\Lambda}^\top) \right) \widetilde{\mathbf{X}}^\ell = \left( \mathbf{U} \circ (\mathbf{H} \pmb{\Lambda}^\top) \right) \mathbf{U}^\top \mathbf{X}^\ell $
- $\widehat{\mathbf{X}}^\ell \in \mathbb{C}^{N \times D}$ : The filtered sequence embedding.
- $\mathbf{G}_{nv}$ : The node-variant graph filter operator.
- $\mathbf{U}$ : The GFT matrix (as described above). It's used here for the inverse GFT effectively.
- $\circ$ : Denotes element-wise multiplication (Hadamard product) between matrices.
- $\mathbf{H} \in \mathbb{C}^{N \times (K+1)}$ : The filter tap matrix. Each row of $\mathbf{H}$ corresponds to a specific node (position) in the sequence, and its columns contain the filter taps for different orders $k$ . This is the "time-variant" aspect, as each position has its own filter taps.
- $\pmb{\Lambda} \in \mathbb{C}^{N \times (K+1)}$ : The Vandermonde matrix formed from the eigenvalues of the shift operator. Its elements are $\Lambda_{ik} = \lambda_i^{k-1}$ , where $\lambda_i$ are the eigenvalues of the shift operator. This matrix captures the frequency response of each power of the shift operator.
- The term $\mathbf{H} \pmb{\Lambda}^\top$ effectively computes the frequency response of the node-variant filter for each node. The element-wise product with $\mathbf{U}$ and multiplication with $\widetilde{\mathbf{X}}^\ell$ perform the spectral filtering and then implicitly transform back to the spatial domain (or a filtered representation in the frequency domain that will be used for further processing, effectively).
- The term $\mathbf{U}^\top \mathbf{X}^\ell$ is the GFT, taking $\mathbf{X}^\ell$ to the frequency domain.
Constructing the Filter Matrix $\mathbf{H}$ : To enhance expressive power, the filter matrix $\mathbf{H}$ is constructed as a product of two matrices: $ \mathbf{H} = \mathbf{C} \bar{\mathbf{B}} = \mathbf{C} \left( \frac{\mathbf{B}}{|\mathbf{B}|_2} \right) $
- $\mathbf{C} \in \mathbb{R}^{N \times m}$ : The coefficient matrix. This matrix generates position-specific filters. Since each node corresponds to a position in the sequence, $\mathbf{C}$ can be interpreted as a function of time, determining how the basic filter components are combined for each position. It's a learnable component.
- $\bar{\mathbf{B}} \in \mathbb{C}^{m \times (K+1)}$ : A normalized basis matrix. This matrix contains $m$ basis vectors, where $m$ determines the number of base filter patterns. These basis vectors are common across all positions, and $\mathbf{C}$ learns to combine them differently for each position.
- $\mathbf{B} \in \mathbb{C}^{m \times (K+1)}$ : The unnormalized basis matrix.
- $\|\mathbf{B}\|_2$ : The L2 norm, used to normalize $\mathbf{B}$ along each row for numerical stability.
- By forming $\mathbf{H}$ this way, the model learns a set of basic filter patterns ( $\mathbf{B}$ ) and then learns how to combine them differently at each sequence position ( $\mathbf{C}$ ), achieving the "time-variant" effect.
Residual Connection and Normalization: After the filtering operation, a residual connection (adding the input $\mathbf{X}^\ell$ ) is applied, followed by Dropout and Layer Normalization. This helps in preventing overfitting and stabilizing the training of deep networks. $ \mathbf{F}^\ell = \mathrm{LayerNorm}(\mathbf{X}^\ell + \mathrm{Dropout}(\widehat{\mathbf{X}}^\ell)) $
- $\mathbf{F}^\ell \in \mathbb{C}^{N \times D}$ : The output of the filter layer.

Feed Forward Layer

Following the filter layer, a feed-forward network (FFN) is applied to introduce non-linearity and allow the model to learn more complex transformations. $ \widehat{\mathbf{F}}^\ell = \mathrm{FFN}(\mathbf{F}^\ell) = (\mathrm{GELU}(\mathbf{F}^\ell \mathbf{W}_1^\ell + \mathbf{b}_1^\ell)) \mathbf{W}_2^\ell + \mathbf{b}_2^\ell $

$\widehat{\mathbf{F}}^\ell \in \mathbb{C}^{N \times D}$ : Output of the FFN.
$\mathbf{F}^\ell$ : Input to the FFN (output of the filter layer).
$\mathrm{GELU}(\cdot)$ : Gaussian Error Linear Unit activation function, which introduces non-linearity.
$\mathbf{W}_1^\ell, \mathbf{W}_2^\ell \in \mathbb{R}^{D \times D}$ : Learnable weight matrices for the linear transformations.
$\mathbf{b}_1^\ell, \mathbf{b}_2^\ell \in \mathbb{R}^D$ : Learnable bias vectors.

Similar to the filter layer, a dropout layer, residual connection, and layer normalization are applied to the FFN output to produce the final output of the $\ell$ -th block: $ \mathbf{X}^{\ell+1} = \mathrm{LayerNorm}(\mathbf{F}^\ell + \mathrm{Dropout}(\widehat{\mathbf{F}}^\ell)) $
$\mathbf{X}^{\ell+1} \in \mathbb{C}^{N \times D}$ : The output of the $\ell$ -th time-variant encoding block, which becomes the input for the next block or the final representation if it's the last block.

4.2.3. Prediction Layer and Training

Prediction Layer

After passing through $L$ time-variant encoding blocks, the final sequence representation is $\mathbf{X}^L$ . The last item's embedding in this sequence, $\mathbf{X}_N^L$ (the embedding corresponding to the most recent interaction), is typically used to predict the next item. The model computes a preference score for each item in the entire item set $\mathcal{V}$ . $ \hat{y}v = p(v{|S|+1} = v | S) = \mathbf{E}_v^\top \mathbf{X}_N^L $

$\hat{y}_v$ : The predicted score (or probability) that item $v$ is the next item in the sequence.
$\mathbf{E}_v$ : The embedding of item $v$ from the shared item embedding matrix $\mathbf{E}$ .
$\mathbf{X}_N^L \in \mathbb{C}^D$ : The embedding of the $N$ -th (most recent) item in the final processed sequence $\mathbf{X}^L$ . (Note: The paper implies this is the last item in the sequence, usually $\mathbf{X}^L_{N, :}$ if $\mathbf{X}^L$ is $N \times D$ ).

Model Training

The model is optimized using a combination of cross-entropy loss and an orthogonal regularization term.

$ \mathcal{L} = \underbrace{-\log \frac{\exp(\hat{y}g)}{\sum{v \in \mathcal{V}} \exp(\hat{y}v)}}{\mathcal{L}{\mathrm{ce}}} + \alpha \cdot \underbrace{\left( \left|\mathbf{B}{\mathrm{real}} \mathbf{B}{\mathrm{real}}^\top - \mathbf{I}\right|F^2 + \left|\mathbf{B}{\mathrm{imag}} \mathbf{B}{\mathrm{imag}}^\top - \mathbf{I}\right|F^2 \right)}{\mathcal{L}_{\mathrm{ortho}}} $

$\mathcal{L}$ : The total loss function to be minimized.
$\mathcal{L}_{\mathrm{ce}}$ $L_{ce}$ : The cross-entropy loss, a common loss function for classification tasks. It measures the discrepancy between the predicted probability distribution and the true distribution (one-hot encoding for the ground truth item).
- $\hat{y}_g$ : The predicted score for the ground-truth item $g$ .
- $\sum_{v \in \mathcal{V}} \exp(\hat{y}_v)$ : The sum of exponentiated scores over all items in the vocabulary, used for normalization in the softmax function.
$\mathcal{L}_{\mathrm{ortho}}$ $L_{ortho}$ : An orthogonal regularization term applied to the basis matrix $\mathbf{B}$ $B$ (used in constructing $\mathbf{H}$ $H$ ). This term encourages the basis vectors to be orthogonal, promoting diversity and stability in the learned filter patterns.
- $\alpha$ : A scalar hyperparameter that controls the regularization strength.
- $\mathbf{B}_{\mathrm{real}}$ : The real component of the complex-valued basis matrix $\mathbf{B}$ .
- $\mathbf{B}_{\mathrm{imag}}$ : The imaginary component of the complex-valued basis matrix $\mathbf{B}$ .
- $\mathbf{I}$ : The identity matrix.
- $\|\cdot\|_F^2$ : The squared Frobenius norm, which measures the "size" of a matrix. The term $\mathbf{M}\mathbf{M}^\top - \mathbf{I}$ evaluates how close $\mathbf{M}$ is to being orthogonal (i.e., $\mathbf{M}\mathbf{M}^\top = \mathbf{I}$ ).
The model is optimized using the Adam optimizer.

4.2.4. Appendix A: Proof of Frequency Response of Node-Variant Graph Filters

The supplementary material provides a detailed proof for the frequency response of the node-variant graph filter $\mathbf{G}_{nv}$ . This is crucial for understanding how the filter operates in the spectral domain.

Proof Derivation:

Definition of Node-Variant Graph Filter: The output graph signal $\mathbf{y}$ from an input signal $\mathbf{x}$ using a node-variant graph filter $\mathbf{G}_{nv}$ of order $K$ is defined as: $ \mathbf{y} = \mathbf{G}{nv} \mathbf{x} = \left( \sum{k=0}^K \mathrm{diag}(\mathbf{h}_k) \mathbf{S}^k \right) \mathbf{x} $
- $\mathbf{x} \in \mathbb{R}^N$ : Input graph signal.
- $\mathbf{S} \in \mathbb{R}^{N \times N}$ : Shift operator.
- $\mathrm{diag}(\mathbf{h}_k)$ : A diagonal matrix where $\mathbf{h}_k$ is a vector of filter taps, allowing each node to have its own tap.
- $K$ : Filter order.
GFT and Eigen-Decomposition: The shift operator $\mathbf{S}$ is eigen-decomposed as $\mathbf{S} = \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}) \mathbf{U}^\top$ . The GFT of a signal $\mathbf{x}$ is $\tilde{\mathbf{x}} = \mathbf{U}^\top \mathbf{x}$ . Similarly, $\tilde{\mathbf{y}} = \mathbf{U}^\top \mathbf{y}$ .
Applying GFT to Filtered Output: Substitute $\mathbf{y}$ and $\mathbf{S}^k = \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda})^k \mathbf{U}^\top$ into the equation for $\tilde{\mathbf{y}}$ : $ \tilde{\mathbf{y}} = \mathbf{U}^\top \mathbf{y} = \mathbf{U}^\top \left( \sum_{k=0}^K \mathrm{diag}(\mathbf{h}k) \mathbf{S}^k \right) \mathbf{x} $ $ = \mathbf{U}^\top \sum{k=0}^K \mathrm{diag}(\mathbf{h}_k) \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}^k) \tilde{\mathbf{x}} $
- $\boldsymbol{\lambda}^k \in \mathbb{C}^N$ : Vector where each component is the $k$ -th power of the corresponding eigenvalue, i.e., $[\boldsymbol{\lambda}^k]_i = \lambda_i^k$ .
- The filter taps $\mathbf{h}_k$ for all orders $k$ can be grouped into a matrix $\mathbf{H} \in \mathbb{C}^{N \times (K+1)}$ , where the $k$ -th column is $\mathbf{h}_k$ .
- A Vandermonde matrix $\pmb{\Lambda} \in \mathbb{C}^{N \times (K+1)}$ is defined where $\Lambda_{ik} = \lambda_i^{k-1}$ . (Note: The paper text uses $\Lambda_{ik} = \lambda_i^{k-1}$ but the derivation uses $\lambda_j^k$ , suggesting $\Lambda_{ik} = \lambda_i^k$ in the definition, or a slight mismatch in index. Following the derivation, it implicitly uses $\lambda_j^k$ for the $k$ -th power of eigenvalues).
Term-by-Term Analysis of Summation: Consider the ij-th element of the summation term $\sum_{k=0}^K \mathrm{diag}(\mathbf{h}_k) \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}^k)$ : $ \left[ \sum_{k=0}^K \mathrm{diag}(\mathbf{h}k) \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}^k) \right]{ij} = \sum_{k=0}^K h_k^{(i)} \lambda_j^k u_{ij} $ $ = u_{ij} \sum_{k=0}^K h_k^{(i)} \lambda_j^k $
- $h_k^{(i)}$ : The $i$ -th component of the vector $\mathbf{h}_k$ , representing the $i$ -th row and $k$ -th column of $\mathbf{H}$ .
- $\lambda_j^k$ : The $j$ -th eigenvalue raised to the power $k$ .
- $u_{ij}$ : The element at row $i$ , column $j$ of the GFT matrix $\mathbf{U}$ .
Relating to $\mathbf{H} \pmb{\Lambda}^\top$ : The ij-th element of the product $\mathbf{H} \pmb{\Lambda}^\top$ is given by: $ [\mathbf{H} \pmb{\Lambda}^\top]{ij} = \sum{k=0}^K h_k^{(i)} \lambda_j^k $ (This assumes $\Lambda_{jk} = \lambda_j^k$ for the column $k$ , which aligns with the derivation and is a standard way to define the matrix of eigenvalue powers.)
Substitution and Final Form: Substituting this back into the term-by-term analysis: $ \left[ \sum_{k=0}^K \mathrm{diag}(\mathbf{h}k) \mathbf{U} \mathrm{diag}(\boldsymbol{\lambda}^k) \right]{ij} = [\mathbf{U} \circ (\mathbf{H} \pmb{\Lambda}^\top)]{ij} $ Therefore, the frequency response of the node-variant graph filter is: $ \tilde{\mathbf{y}} = \tilde{\mathbf{G}}{nv} \tilde{\mathbf{x}} = \mathbf{U}^\top (\mathbf{U} \circ (\mathbf{H} \pmb{\Lambda}^\top)) \tilde{\mathbf{x}} $ This formula shows how the filter operates on the signal in the frequency domain, effectively modulating the frequency components based on the position-specific filter taps $\mathbf{H}$ and the graph's eigenvalues $\pmb{\Lambda}$ .

4.2.5. Appendix B: Theoretical Justification for DCG-Based Filtering

This section justifies the use of a zero-padded Directed Cyclic Graph (DCG) for spectral filtering instead of a simple line graph, ensuring no backward information flow.

Problem with Line Graph:
- For spectral filtering, the Graph Fourier Transform (GFT) requires the shift operator (e.g., adjacency matrix) to be diagonalizable.
- A line graph (representing a sequence $v_1 \to v_2 \to \ldots \to v_N$ ) has an adjacency matrix with rank N-1. The first node $v_1$ has no incoming edges, making its corresponding row in the adjacency matrix entirely zero. This makes the matrix defective and non-diagonalizable, hindering spectral analysis.
Solution: Directed Cyclic Graph (DCG):
- To overcome this, the sequence is modeled as a DCG. This is achieved by adding a single edge from the last node ( $v_N$ ) to the first node ( $v_1$ ) of the line graph.
- This addition makes the adjacency matrix circulant (each row is a cyclic shift of the previous row). Circulant matrices are always diagonalizable, enabling spectral filtering in the Fourier domain.
Preventing Backward Information Flow (Causality):
- The cyclic edge in a DCG inherently introduces a backward connection (from future to past), which is undesirable in sequential recommendation as it would cause information leakage.
- To prevent this, a padding strategy is employed. The input sequence $\mathbf{x} \in \mathbb{R}^N$ $x \in R^{N}$ is zero-padded with $K$ $K$ zeros to form an extended vector $\tilde{\mathbf{x}}$ $\tilde{x}$ : $ \tilde{\mathbf{x}} = \left[ \begin{array}{c} \mathbf{x} \ \mathbf{0} \end{array} \right] \in \mathbb{R}^{N+K} $
  - $\mathbf{x}$ : The original input sequence (vector of length $N$ ).
  - $\mathbf{0} \in \mathbb{R}^K$ : A zero vector of length $K$ .
  - $\tilde{\mathbf{x}}$ : The padded input sequence (vector of length $N+K$ ).
- A circulant shift operator $\mathbf{S} \in \mathbb{R}^{(N+K) \times (N+K)}$ is defined for this extended DCG.
- A spectral filter of order $K$ $K$ is then applied: $ g(\mathbf{S}) = \sum_{k=0}^K h_k \mathbf{S}^k $
  - $h_k$ : Filter coefficients.
- The filtered output $\tilde{\mathbf{y}}$ is computed by applying the filter to the padded input: $ \tilde{\mathbf{y}} = g(\mathbf{S}) \tilde{\mathbf{x}} $
- Extracting Causal Output: The final causal output $\mathbf{y}$ $y$ (equivalent to filtering on the original line graph) is obtained by taking only the first $N$ $N$ elements of $\tilde{\mathbf{y}}$ $\tilde{y}$ : $ \mathbf{y} = [\tilde{\mathbf{y}}]_{1:N} $
  - The equivalence holds because the last $K$ entries of $\tilde{\mathbf{x}}$ are zeros. Since the filter involves powers of $\mathbf{S}$ up to order $K$ , these zero entries ensure that the cyclic shifts (which would bring "future" information to the "past" if the filter order was higher than the padding) do not influence the first $N$ output values. This effectively blocks any cyclic information flow that would lead to backward leakage.
- In summary, zero-padded DCG allows for spectral filtering (due to diagonalizability) while preserving the causal structure of sequential data, enabling efficient and accurate modeling of user interactions.

4.2.6. Appendix C: Equivalence of Positional Encoding and Graph Fourier Basis

This section explains why TV-Rec does not require explicit positional encoding, unlike Transformer models, due to the inherent positional information captured by its Graph Fourier Basis.

Sinusoidal Positional Encoding in Transformer:
- Transformer models use sinusoidal functions to encode absolute position pos within a sequence.
- For a given position pos and embedding dimension index 2i or $2i+1$ $2 i + 1$ : $ \mathrm{PE}{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad \mathrm{PE}{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) $
  - $\mathrm{PE}$ : Positional Encoding.
  - pos: The absolute position in the sequence.
  - $i$ : Index for the dimension of the embedding.
  - $d$ : Total dimension of the embedding.
- This generates a set of periodic signals with varying frequencies, forming a basis for encoding positional variations.
Graph Fourier Basis in TV-Rec:
- In TV-Rec, the shift operator for the DCG (which is a circulant matrix) is used. The eigenvectors of a circulant matrix are precisely the Discrete Fourier Transform (DFT) basis vectors.
- The elements of the GFT matrix $\mathbf{U}$ $U$ for a circulant shift operator are: $ \mathbf{U}_{kn} = \frac{1}{\sqrt{N}} e^{-i2\pi kn/N} = \frac{1}{\sqrt{N}} \left[ \cos\left(\frac{2\pi kn}{N}\right) - i \sin\left(\frac{2\pi kn}{N}\right) \right] $
  - $k$ : Index of the eigenvector (frequency component).
  - $n$ : Index of the node (position).
  - $N$ : Sequence length.
  - This formula shows that the GFT basis is composed of trigonometric functions (sines and cosines) that inherently capture periodic variations across positions.
Equivalence in Representational Capacity:
- Both Transformer's sinusoidal positional encodings and TV-Rec's GFT basis (which is the DFT basis) are sets of trigonometric functions.
- While the frequencies might be sampled differently (logarithmic scale for Transformer vs. linear for GFT), both sets of basis functions effectively span the same space of length N periodic signals.
- The real-valued DFT basis is directly composed of $\sin\left(\frac{2\pi kn}{N}\right)$ and $\cos\left(\frac{2\pi kn}{N}\right)$ for various $k$ .
- This means TV-Rec implicitly encodes positional information through its spectral decomposition.
Implication for Positional Encoding:
- TV-Rec projects the input sequence $\mathbf{x}$ into the GFT domain ( $\tilde{\mathbf{x}} = \mathbf{U}^\top \mathbf{x}$ ) and processes it. This process inherently captures frequency-based positional variations.
- Furthermore, the time-variant filter acts as a position-specific operation, dynamically modulating these frequency components per position.
- Consequently, the combination of GFT and time-variant filters naturally handles positional information, rendering explicit positional embeddings unnecessary. Ablation studies confirm this, showing no performance benefit from adding them.

5. Experimental Setup

5.1. Datasets

The authors evaluate TV-Rec on six publicly available benchmark datasets, selected to cover varying sparsity and domains. The preprocessing procedures from previous works [44, 45] are followed, treating all reviews and ratings as implicit feedback.

The following are the results from Table 6 of the original paper:

	# Users	# Items	# Interactions	Avg. Length	Sparsity
Beauty	22,363	12,101	198,502	8.9	99.93%
Sports	25,598	18,357	296,337	8.3	99.95%
Yelp	30,431	20,033	316,354	10.4	99.95%
LastFM	1,090	3,646	52,551	48.2	98.68%
ML-1M	6,041	3,417	999,611	165.5	95.16%
Foursquare	1,083	9,989	179,468	165.7	98.34%

Here's a description of each dataset:

Beauty: An Amazon product review dataset (from [22]) belonging to the "Beauty" category. It has a moderate number of users and items, with relatively short average sequence lengths (8.9).
Sports: Another Amazon product review dataset (from [22]), specifically the "Sports and Outdoors" category. Similar to Beauty in terms of user/item count and average length (8.3), indicating sparse interactions.
Yelp: A large dataset from Yelp, containing business recommendations. Interactions after 2019/01/01 are used due to its size. It has a higher number of users, items, and interactions, but still a short average sequence length (10.4).
LastFM: Contains artist listening records, used for recommending musicians. This dataset has fewer users and items but a significantly longer average interaction length (48.2) compared to the Amazon and Yelp datasets, and lower sparsity.
ML-1M (MovieLens-1M): A classic movie recommendation dataset [11] known for its detailed user interaction data. It features a moderate number of users and items but a very long average interaction length (165.5), and the lowest sparsity among the datasets.
Foursquare: Provides user check-ins across New York City over 10 months. Similar to ML-1M, it has a long average interaction length (165.7), but with fewer users and more items.

These datasets were chosen because they represent diverse domains (e-commerce, social media, entertainment) and varying interaction characteristics (short vs. long sequences, different sparsity levels). This diversity allows for a comprehensive evaluation of the model's robustness and generalization capabilities across different sequential recommendation scenarios.

5.2. Evaluation Metrics

To evaluate the recommendation performance, the paper uses two widely adopted Top- $r$ metrics: HR@r (Hit Rate) and NDCG@r (Normalized Discounted Cumulative Gain), with $r$ set to 5, 10, and 20. For fair comparison, ranking results are examined across the entire item set without negative sampling [19].

Hit Rate (HR@r):
- Conceptual Definition: Hit Rate at $r$ (HR@r) measures the proportion of test cases (user sequences) for which the ground truth item (the next item the user actually interacted with) is present within the top- $r$ recommended items. It essentially answers: "How often was the correct item among the top $r$ recommendations?" HR is a recall-oriented metric.
- Mathematical Formula: $ \mathrm{HR@r} = \frac{\text{Number of hits @r}}{\text{Total number of test cases}} $
- Symbol Explanation:
  - Number of hits @r: The count of user test sequences where the ground truth item is found in the top- $r$ recommended list generated by the model.
  - Total number of test cases: The total number of user test sequences for which a recommendation was attempted.
Normalized Discounted Cumulative Gain (NDCG@r):
- Conceptual Definition: Normalized Discounted Cumulative Gain at $r$ (NDCG@r) is a metric that evaluates the quality of a ranked list of recommendations, taking into account both the relevance of items and their position in the list. It assigns higher scores to more relevant items that appear earlier in the ranked list. It's a measure of ranking quality.
- Mathematical Formula: $ \mathrm{NDCG@r} = \frac{\mathrm{DCG@r}}{\mathrm{IDCG@r}} $ where $ \mathrm{DCG@r} = \sum_{i=1}^{r} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ and $ \mathrm{IDCG@r} = \sum_{i=1}^{|REL|} \frac{2^{rel_i} - 1}{\log_2(i+1)} $
- Symbol Explanation:
  - $\mathrm{DCG@r}$ (Discounted Cumulative Gain): Measures the relevance of items in the top- $r$ positions, penalizing relevant items that appear lower in the list.
  - $\mathrm{IDCG@r}$ (Ideal Discounted Cumulative Gain): The maximum possible DCG@r, achieved by an ideal ranking where all relevant items are placed at the top of the list in decreasing order of relevance. This normalizes DCG to a value between 0 and 1.
  - $i$ : The rank position of an item in the recommendation list (from 1 to $r$ ).
  - $rel_i$ : The relevance score of the item at position $i$ . In sequential recommendation, for a single ground truth item, $rel_i$ is typically 1 if the item at rank $i$ is the ground truth item, and 0 otherwise. In this scenario, $IDCG@r = \frac{2^1-1}{\log_2(1+1)} = 1$ if $r \ge 1$ .
  - $|REL|$ : The number of relevant items in the ideal ranking (for a single ground truth item, this is 1).
  - $\log_2(i+1)$ : The logarithmic discount factor, which reduces the importance of items at lower ranks.

5.3. Baselines

TV-Rec is compared against a comprehensive set of ten state-of-the-art sequential recommendation baselines, including conventional and recent hybrid models, and three GNN-based models. These baselines are chosen for their representativeness of different architectural paradigms in SR.

Sequential Recommendation Baselines:

Caser [30]: A CNN-based model that uses horizontal and vertical convolutions to capture sequential patterns, treating interactions like images.
GRU4Rec [13]: An RNN-based model that employs Gated Recurrent Units (GRUs) to capture temporal dynamics in user interaction sequences.
SASRec [17]: A Transformer-based model that uses a multi-head self-attention mechanism to capture long-range dependencies.
BERT4Rec [29]: A bidirectional Transformer-based model that uses a masked item prediction task for training, similar to BERT in NLP.
NextItNet [40]: A CNN-based model that utilizes dilated convolutions and residual connections to efficiently capture both short- and long-range dependencies.
FMLPRec [45]: An all-MLP architecture that incorporates Fourier Transform and learnable filters to reduce noise and enhance sequence representations in the frequency domain.
DuoRec [23]: A Transformer-based model that uses contrastive learning with model-level augmentation and semantic positive samples to learn robust sequence representations.
LRURec [41]: A model based on linear recurrent units designed for rapid inference and parallelization, particularly effective for long sequences.
AdaMCT [16]: A hybrid model that combines Transformer attention with local 1D convolutional filters to capture both long-term and short-term user preferences.
BSARec [28]: A hybrid model that combines Transformer self-attention with the Fourier Transform to address the oversmoothing problem and introduce locality.

GNN-based Sequential Recommendation Baselines:

SR-GNN [33]: A GNN-based model that converts user sessions into graphs and applies GNNs to capture item transition relationships.
GC-SAN [35]: A GNN-based model that dynamically builds a graph for each sequence and combines GNNs with self-attention to model both local and long-range item dependencies.
GCL4SR [43]: A GNN-based model that constructs a global item transition graph across all users and uses graph contrastive learning to integrate global and local context.

5.4. Implementation Details

All experiments were conducted on a consistent hardware and software setup:

Hardware: Dual INTEL XEON CPUs and an NVIDIA RTX A6000 GPU.
Software: UBUNTU 20.04.6 LTS, PYthon 3.9.7, PYTORCH 2.2.2, CUDA 11.1.74, and NVIDIA Driver 550.54.14.

Hyperparameters for Standard Sequential Recommendation:

Maximum Sequence Length ( $N$ ): Set to 50 for standard experiments. For long-range dependencies (ML-1M and Foursquare), $N$ was set to 200.
Learning Rates: Tuned from $\lbrace 5 \times 10^{-4}, 1 \times 10^{-3} \rbrace$ .
Orthogonal Regularization Coefficient ( $\alpha$ ): Tuned from $\lbrace 0, 1 \times 10^{-3}, 1 \times 10^{-5} \rbrace$ .
Dropout Rates ( $p$ ): Tuned from $\lbrace 0.1, 0.2, 0.3, 0.4, 0.5 \rbrace$ .
Number of Basis Vectors ( $m$ ): Tuned from $\lbrace 8, 16, 32 \rbrace$ .
Order of Time-Variant Convolutional Filter ( $K$ ): Set equal to the maximum sequence length $N$ (i.e., 50 for standard).
Batch Size: 256.
Latent Dimension Size ( $D$ ): 64.
Number of Time-Variant Blocks ( $L$ ): 2.

Optimizer: Adam.

The following are the results from Table 7 of the original paper:

Dataset	Learning Rate	α	p	m
L = 50
Beauty	5 × 10⁻⁴	0	0.5	32
Sports	5 × 10⁻⁴	0	0.5	16
Yelp	5 × 10⁻⁴	1 × 10^-3	0.1	16
LastFM	1 × 10⁻³	1 × 10^-3	0.4	8
ML-1M	1 × 10⁻³	1× 10^-5	0.3	8
Foursquare	5 × 10⁻⁴	1 × 10^-5	0.2	8
L = 200
ML-1M	1 × 10⁻³	0	0.1	16
Foursquare	5 × 10⁻⁴	1 × 10⁻⁵	0.1	8

6. Results & Analysis

6.1. Core Results Analysis

Overall Performance

The paper demonstrates that TV-Rec consistently outperforms all baseline methods across six diverse datasets.

The following are the results from Table 2 of the original paper:

Datasets	Metric	Caser	GRU4Rec	SASRec	BERT4Rec	NextItNet	FMLPRec	DuoRec	LRURec	AdaMCT	BSARec	TV-Rec	Improv.
Beauty	HR@5	0.0149	0.0170	0.0368	0.0491	0.0549	0.0423	0.0680	0.0648	0.0675	0.0714	0.0721	0.98%
	HR@10	0.0253	0.0307	0.0574	0.0742	0.0779	0.0639	0.0944	0.0889	0.0925	0.1017	0.1045	2.73%
	HR@20	0.0416	0.0499	0.0860	0.1079	0.1100	0.0949	0.1279	0.1197	0.1299	0.1393	0.1403	0.72%
	NDCG@5	0.0089	0.0105	0.0241	0.0318	0.0392	0.0272	0.0485	0.0472	0.0489	0.0501	0.0513	2.40%
	NDCG@10	0.0122	0.0149	0.0307	0.0399	0.0467	0.0341	0.0570	0.0549	0.0569	0.0590	0.0608	3.05%
	NDCG@20	0.0164	0.0198	0.0379	0.0484	0.0547	0.0419	0.0654	0.0627	0.0664	0.0691	0.0705	2.03%
Sports	HR@5	0.0091	0.0131	0.0215	0.0279	0.0311	0.0222	0.0390	0.0351	0.0386	0.0390	0.0431	2.13%
	HR@10	0.0147	0.0211	0.0319	0.0434	0.0458	0.0358	0.0549	0.0502	0.0544	0.0623	0.0635	1.93%
	HR@20	0.0253	0.0347	0.0485	0.0658	0.0682	0.0549	0.0779	0.0698	0.0769	0.0865	0.0880	1.73%
	NDCG@5	0.0064	0.0084	0.0142	0.0182	0.0212	0.0148	0.0276	0.0242	0.0272	0.0296	0.0298	0.68%
	NDCG@10	0.0082	0.0110	0.0175	0.0217	0.0260	0.0191	0.0328	0.0291	0.0322	0.0361	0.0363	0.55%
	NDCG@20	0.0109	0.0144	0.0239	0.0288	0.0316	0.0239	0.0385	0.0340	0.0379	0.0425	0.0428	0.71%
Yelp	HR@5	0.0131	0.0137	0.0165	0.0243	0.0247	0.0195	0.0277	0.0240	0.0239	0.0260	0.0290	4.69%
	HR@10	0.0230	0.0240	0.0267	0.0411	0.0423	0.0313	0.0450	0.0396	0.0404	0.0446	0.0474	5.33%
	HR@20	0.0388	0.0412	0.0445	0.0681	0.0694	0.0518	0.0730	0.0656	0.0670	0.0718	0.0777	6.44%
	NDCG@5	0.0080	0.0086	0.0103	0.0154	0.0151	0.0122	0.0179	0.0151	0.0153	0.0162	0.0186	3.91%
	NDCG@10	0.0112	0.0119	0.0135	0.0208	0.0208	0.0160	0.0234	0.0201	0.0206	0.0222	0.0245	4.70%
	NDCG@20	0.0151	0.0162	0.0180	0.0275	0.0276	0.0211	0.0304	0.0266	0.0272	0.0290	0.0321	5.59%
LastFM	HR@5	0.0303	0.0339	0.0422	0.0358	0.0431	0.0450	0.0404	0.0358	0.0468	0.0505	0.0596	18.02%
	HR@10	0.0459	0.0394	0.0670	0.0606	0.0624	0.0670	0.0587	0.0532	0.0716	0.0679	0.0853	19.13%
	HR@20	0.0606	0.0550	0.0972	0.0908	0.0936	0.1000	0.0872	0.0807	0.1018	0.1119	0.1202	7.42%
	NDCG@5	0.0222	0.0231	0.0301	0.0213	0.0264	0.0321	0.0276	0.0257	0.0330	0.0348	0.0402	15.52%
	NDCG@10	0.0269	0.0249	0.0382	0.0291	0.0325	0.0392	0.0336	0.0312	0.0409	0.0405	0.0484	18.34%
	NDCG@20	0.0306	0.0288	0.0458	0.0366	0.0402	0.0475	0.0407	0.0380	0.0485	0.0514	0.0572	11.28%
ML-1M	HR@5	0.1033	0.1225	0.1406	0.1651	0.1858	0.1329	0.1916	0.2013	0.1909	0.2013	0.2115	5.06%
	HR@10	0.1671	0.2199	0.2442	0.2724	0.2848	0.1821	0.2798	0.2904	0.2798	0.2904	0.2961	1.97%
	HR@20	0.2598	0.1925	0.3250	0.3459	0.3853	0.2089	0.3757	0.3886	0.3647	0.3844	0.4079	4.97%
	NDCG@5	0.0663	0.0779	0.0920	0.1077	0.1264	0.0861	0.1226	0.1339	0.1185	0.1286	0.1371	2.39%
	NDCG@10	0.0868	0.1006	0.1174	0.1332	0.1543	0.1105	0.1507	0.1640	0.1438	0.1573	0.1658	1.10%
	NDCG@20	0.1101	0.1253	0.1438	0.1588	0.1829	0.1388	0.1776	0.1901	0.1711	0.1836	0.1955	2.84%
Foursquare	HR@5	0.0139	0.0148	0.0139	0.0139	0.0129	0.0120	0.0139	0.0148	0.0157	0.0148	0.0175	11.46%
	HR@10	0.0175	0.0157	0.0185	0.0157	0.0166	0.0148	0.0194	0.0203	0.0203	0.0203	0.0248	22.17%
	HR@20	0.0231	0.0194	0.0295	0.0240	0.0259	0.0194	0.0305	0.0286	0.0305	0.0305	0.0314	2.95%
	NDCG@5	0.0105	0.0099	0.0085	0.0078	0.0068	0.0087	0.0078	0.0099	0.0094	0.0089	0.0105	6.06%
	NDCG@10	0.0123	0.0111	0.0106	0.0096	0.0095	0.0096	0.0102	0.0106	0.0106	0.0103	0.0131	23.85%
	NDCG@20	0.0133	0.0120	0.0136	0.0117	0.0118	0.0108	0.0126	0.0114	0.0142	0.0135	0.0176	23.95%

Significant Improvements: TV-Rec achieves an average accuracy improvement of 7.49% over the strongest baselines.
- Particularly strong gains are observed on LastFM (e.g., 19.13% on HR@10 and 18.34% on NDCG@10) and Foursquare (e.g., 22.17% on HR@10 and 23.85% on NDCG@10). These are datasets with longer average sequence lengths or more complex temporal dynamics.
- On larger datasets like ML-1M, TV-Rec still shows notable gains (e.g., 5.06% on HR@5 and 2.39% on NDCG@5).
- For e-commerce datasets Beauty and Sports, improvements are more subtle but consistent (e.g., 0.98% and 2.13% on HR@5, respectively).
Strong Baselines: Recent hybrid models like AdaMCT and BSARec show strong performance, with BSARec often being the second-best. LRURec performs competitively on ML-1M due to its design for long sequences, and DuoRec is strong on Yelp. Despite these strong competitors, TV-Rec consistently surpasses them.

Long-Range Sequential Recommendation Results

To assess TV-Rec's ability to handle long-range dependencies, experiments were conducted on ML-1M and Foursquare with the maximum sequence length $N$ set to 200.

The following are the results from Table 3 of the original paper:

Datasets	Metric	Caser	GRU4Rec	SASRec	BERT4Rec	NextItNet	FMLPRec	DuoRec	LRURec	AdaMCT	BSARec	TV-Rec	Improv.
ML-1M	HR@5	0.1109	0.1518	0.1558	0.1730	0.1978	0.1397	0.1930	0.2233	0.1760	0.1949	0.2255	0.99%
	HR@10	0.1869	0.2374	0.2399	0.2573	0.2882	0.2296	0.2795	0.3005	0.2619	0.2917	0.3232	1.80%
	HR@20	0.2942	0.3455	0.3551	0.3695	0.3970	0.3462	0.3854	0.4205	0.3695	0.4005	0.4306	2.40%
	NDCG@5	0.0696	0.0981	0.1014	0.1147	0.1334	0.0885	0.1292	0.1516	0.1167	0.1327	0.1572	3.69%
	NDCG@10	0.0939	0.1256	0.1285	0.1418	0.1627	0.1175	0.1571	0.1820	0.1443	0.1639	0.1886	3.63%
	NDCG@20	0.1209	0.1528	0.1576	0.1701	0.1901	0.1468	0.1838	0.2079	0.1715	0.1913	0.2157	3.75%
Foursquare	HR@5	0.0139	0.0120	0.0111	0.0102	0.0083	0.0120	0.0120	0.0129	0.0120	0.0129	0.0148	6.47%
	HR@10	0.0194	0.0157	0.0175	0.0157	0.0166	0.0148	0.0194	0.0203	0.0157	0.0175	0.0212	9.28%
	HR@20	0.0231	0.0194	0.0295	0.0240	0.0259	0.0194	0.0286	0.0305	0.0305	0.0305	0.0323	5.90%
	NDCG@5	0.0105	0.0099	0.0085	0.0078	0.0068	0.0087	0.0078	0.0099	0.0094	0.0089	0.0105	6.06%
	NDCG@10	0.0123	0.0111	0.0106	0.0096	0.0095	0.0096	0.0102	0.0106	0.0106	0.0103	0.0129	21.70%
	NDCG@20	0.0133	0.0120	0.0136	0.0117	0.0118	0.0108	0.0126	0.0142	0.0142	0.0135	0.0158	11.27%

TV-Rec consistently outperforms all baseline models in long-range SR tasks, achieving an average improvement of 4.74% across all metrics compared to the top baseline performances.
The improvements are particularly significant in NDCG metrics, with TV-Rec showing up to 11.27% gain in NDCG@20 for Foursquare.
LRURec performs strongly among baselines for ML-1M due to its linear recurrent unit design for long sequences, but TV-Rec still surpasses it.
This demonstrates the effectiveness of time-variant filters in modeling extended user interaction histories, maintaining accuracy even with hundreds of interactions.

Comparison to GNN-based Methods

TV-Rec was also compared against three representative GNN-based sequential recommendation models: SR-GNN, GC-SAN, and GCL4SR.

The following are the results from Table 4 of the original paper:

Methods	Beauty		Sports		Yelp		LastFM		ML-1M		Foursquare
Methods	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20
TV-Rec	0.1403	0.0705	0.0880	0.0425	0.0777	0.0321	0.1202	0.0572	0.4079	0.1955	0.0314	0.0176
SR-GNN	0.0847	0.0374	0.0517	0.0224	0.0609	0.0252	0.0872	0.0379	0.2940	0.1390	0.0222	0.0137
GC-SAN	0.1059	0.0546	0.0608	0.0289	0.0635	0.0260	0.0807	0.0394	0.3255	0.1611	0.0212	0.0131
GCL4SR	0.1206	0.0601	0.0744	0.0356	0.0684	0.0276	0.0908	0.0398	0.3381	0.1607	0.0185	0.0123

TV-Rec consistently outperforms all GNN-based models.
This reinforces the advantage of TV-Rec's design, which operates directly on sequence positions rather than propagating messages over item-transition graphs, leading to a more efficient representation of temporal dependencies.

6.2. Ablation Studies / Parameter Analysis

Ablation Studies

Ablation studies were conducted to validate the contribution of key design choices in TV-Rec.

The following are the results from Table 5 of the original paper:

Methods	Beauty		Sports		Yelp		LastFM		ML-1M		Foursquare
Methods	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20
TV-Rec	0.1403	0.0705	0.0880	0.0425	0.0777	0.0321	0.1202	0.0572	0.4079	0.1955	0.0314	0.0176
(1) Positional Embedding	0.1408	0.0702	0.0842	0.0396	0.0763	0.0320	0.1018	0.0496	0.4017	0.1936	0.0313	0.0160
(2) Basic Graph Filter	0.1402	0.0692	0.0857	0.0408	0.0747	0.0307	0.1165	0.0543	0.3974	0.1933	0.0212	0.0113
(3) Identity Basis	0.1400	0.0698	0.0851	0.0410	0.0765	0.0317	0.1138	0.0539	0.4015	0.1930	0.0277	0.0141
(4) Basis Normalization	0.1336	0.0689	0.0841	0.0412	0.0634	0.0264	0.0963	0.0418	0.3985	0.1912	0.0305	0.0144

TV-Rec (1) Positional Embedding: Adding explicit learnable positional embeddings to TV-Rec generally results in inconsistent or slightly degraded performance. For example, on LastFM, both HR@20 and NDCG@20 decrease significantly. This validates the claim that TV-Rec inherently captures positional information through its spectral properties and time-variant filters, making explicit positional encoding redundant.
TV-Rec (2) Basic Graph Filter: Replacing the time-variant filter with a basic graph filter (fixed filter taps) consistently degrades performance across all datasets. For instance, on Foursquare, HR@20 drops from 0.0314 to 0.0212, and NDCG@20 from 0.0176 to 0.0113. This confirms the critical role and effectiveness of the time-variant nature of the filters.
TV-Rec (3) Identity Basis: Setting the basis matrix $\mathbf{B}$ to an identity matrix (which makes $\mathbf{H} = \mathbf{C}$ , effectively removing the basis expansion) also degrades results compared to the full model. This indicates that the learnable basis $\mathbf{B}$ is important for constructing expressive and diverse filter patterns.
TV-Rec (4) Basis Normalization: Removing the L2 normalization of the basis matrix $\mathbf{B}$ (i.e., using $\mathbf{B}$ directly instead of $\bar{\mathbf{B}}$ in Eq. 8) leads to a degradation in performance, especially noticeable on Yelp and LastFM. This highlights the importance of normalization for numerical stability and effective learning of the basis vectors.

Overall, the ablation studies confirm that each component of TV-Rec's design—the time-variant nature of filters, the learned basis matrix, and its normalization—contributes to the model's superior performance.

Parameter Sensitivity

The paper investigates the sensitivity of TV-Rec to key hyperparameters: $m$ (number of basis vectors), $p$ (dropout rate), $α$ (orthogonal regularization coefficient), and $K$ (filter order).

Sensitivity to $m$ (Number of Basis Vectors)

The following figure (Figure 3 from the original paper) shows the sensitivity to the number of basis vectors $m$ .

$Figure 4: Sensitivity to dropout rate $p$$ 该图像是图表，展示了图4中在两个数据集（Sports和Foursquare）上不同dropout率 $p$ 对NDCG@20和HR@20指标的敏感性影响。图中柱状图表示NDCG@20，折线图表示HR@20，反映随 $p$ 变化指标的变化趋势。

The following figure (Figure 8 from the original paper) shows the sensitivity to the number of basis vectors $m$ .

$Figure 9: Sensitivity to dropout rate $p$$ 该图像是图表，展示了TV-Rec模型对参数p（dropout率）的灵敏度分析，包含六个数据集（Beauty、Sports、Yelp、LastFM、ML-1M、Foursquare）的NDCG@20和HR@20随p变化的趋势。

The parameter $m$ in $\mathbf{H} = \mathbf{C} \bar{\mathbf{B}}$ determines the number of basis vectors used to construct the position-specific filters, influencing both expressiveness and computational cost.
For Beauty, performance $peaks at m=32$ and drops sharply for smaller values, suggesting that a richer representation is beneficial for this dataset.
Conversely, for LastFM, the best results are achieved at $m=8$ , and performance degrades as m increases. This indicates that a more compact representation (fewer basis vectors) is sufficient and possibly better for LastFM, possibly due to its specific characteristics (e.g., music recommendation, less complex item relations or more distinct user tastes).
For Yelp and ML-1M, $m=16$ yields optimal results, while for Sports and Foursquare, $m=8$ or $m=16$ is also suitable.
This analysis underscores that the optimal number of basis vectors is dataset-dependent, reflecting varying complexities of user behavior patterns.

Sensitivity to $p$ (Dropout Rate)

The following figure (Figure 4 from the original paper) shows the sensitivity to dropout rate $p$ .

$Figure 5: Visualization of learned graph filters on LastFM. The $\\mathbf { X }$ -axis denotes the number of shifts in graph convolution, while the y-axis represents individual nodes, with higher numb…$ 该图像是图表，展示了LastFM数据集上学习到的图滤波器对比结果。横轴表示图卷积中的移位次数，纵轴表示节点，数值越大表明时间越接近当前时刻。(a)为基本图滤波器，(b)为时变滤波器，后者在捕捉节点和时间依赖性上表现更丰富。

The following figure (Figure 9 from the original paper) shows the sensitivity to dropout rate $p$ .

$Figure 10: Sensitivity to orthogonal regularization coefficient $\\alpha$ .$ 该图像是图表，展示了不同正交正则化系数 α 对六个数据集（Beauty, Sports, Yelp, LastFM, ML-1M, Foursquare）中NDCG@20和HR@20指标的敏感性变化，反映了模型性能随 α 变化的趋势。

Dropout rate p is a regularization parameter.
For Sports, larger p values (e.g., 0.5) improve performance. This suggests that Sports might be more prone to overfitting, and higher dropout helps in learning more general representations.
For Foursquare, smaller p values (e.g., 0.1 or 0.2) work better.
Generally, datasets with fewer interactions or lower data diversity tend to benefit from higher dropout rates to prevent overfitting, while denser datasets might require less regularization.

Sensitivity to $α$ (Orthogonal Regularization Coefficient)

The following figure (Figure 10 from the original paper) shows the sensitivity to orthogonal regularization coefficient $\alpha$ .

Figure 2: Architecture of our proposed TV-Rec. 该图像是图2，展示了论文中提出的TV-Rec模型的整体架构，包括嵌入层、滤波层、前馈层及规范化层等模块。同时配有图傅里叶变换（GFT）和滤波公式 $\hat{X}^\ell=\left(U \circ (H\Lambda^{\mathrm{T}}) \right) U^\mathrm{T} X^\ell$ ，体现时变卷积滤波的数学原理。

$α$ controls the strength of the orthogonal regularization term applied to the basis matrix $\mathbf{B}$ .
For LastFM, the best accuracy is achieved with $α = 10^-3$ . Performance $drops significantly as α decreases$ , indicating that sufficient orthogonal regularization is crucial for diverse filters and preventing overfitting in this dataset.
In contrast, for Sports and Yelp, $highest accuracy occurs at lower α values (e.g., 0 or 10^-5)$ , and performance deteriorates as $α increases$ . This implies that too strong orthogonal regularization might overly constrain the filter parameters, limiting the model's ability to adapt to the data distributions in these domains.
This highlights the importance of tuning $α$ based on dataset characteristics to balance regularization strength.

Sensitivity to $K$ (Filter Order)

The following are the results from Table 8 of the original paper:

K	Beauty		Sports		Yelp		LastFM		ML-1M		Foursquare
K	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20
3	0.1380	0.0690	0.0839	0.0405	0.0731	0.0295	0.1046	0.0443	0.3901	0.1843	0.0240	0.0116
5	0.1364	0.0680	0.0849	0.0407	0.0752	0.0308	0.1147	0.0490	0.3823	0.1814	0.0268	0.0116
10	0.1378	0.0688	0.0853	0.0410	0.0741	0.0303	0.1165	0.0532	0.3922	0.1898	0.0268	0.0139
25	0.1371	0.0684	0.0854	0.0409	0.0774	0.0323	0.1248	0.0546	0.3957	0.1891	0.0295	0.0163
50	0.1403	0.0705	0.0880	0.0425	0.0777	0.0321	0.1202	0.0572	0.4079	0.1955	0.0314	0.0176

The filter order K is analogous to the kernel size in CNN-based models.
Except for Yelp and LastFM, setting $K=50$ (equal to the maximum sequence length $N$ ) consistently yields the best performance.
This supports the hypothesis that aligning filter order with sequence length allows the model to capture a broader context, enhancing recommendation quality. It implies that a larger receptive field (higher $K$ ) is beneficial for integrating more long-range information.

Analyzing Filter Behavior and Case Study

To understand the mechanics of the time-variant filter, a visualization of learned filters and a case study were performed.

The following figure (Figure 5 from the original paper) shows the visualization of learned graph filters on LastFM.

Figure 6: Case Study on ML-1M. 该图像是图表，展示了固定卷积滤波器与时间变卷积滤波器在电影推荐中的差异。通过不同时间阶段的用户偏好变化对比，时间变卷积滤波器更准确地捕捉了用户兴趣的动态变化，使推荐结果更加多样和贴合真实偏好。

Filter Visualization (Figure 5): The visualization compares a basic graph filter (fixed) with the time-variant convolutional filter on the LastFM dataset.
- Both types of filters generally assign higher weights to lower shifts, meaning they prioritize recent items. This is intuitive for sequential recommendation.
- The crucial difference is that the basic graph filter applies the same filter uniformly across all nodes (positions).
- In contrast, the time-variant convolutional filter applies different filters to each node (position).
  - For early-stage nodes (earlier in the sequence), the time-variant filter shows similar weights, suggesting an initial focus on understanding overall patterns.
  - As the sequence progresses towards recent items, the filter's weights shift to emphasize more recent items. This dynamic adaptation allows the filter to capture temporal shifts more accurately.
- This demonstrates TV-Rec's ability to adapt its focus, capturing both long-term preferences (from early patterns) and recent interests (from later patterns), unlike fixed filters that predominantly focus on recent items.
  
  The following figure (Figure 6 from the original paper) shows the case study on ML-1M.
  
  $Figure 7: Comparison of model inference time and ${ \\mathrm { N D C G } } @ 2 0$ on Beauty. The size of each circle corresponds to the number of parameters.$ 该图像是图表，展示了在Beauty数据集上不同模型的推理时间与NDCG@20性能对比。图中用气泡大小表示模型参数量。TV-Rec在推理时间较短的同时表现出较高的NDCG@20，优于其他模型。
Case Study (ML-1M, Figure 6): A real-world example from ML-1M highlights the practical advantage.
- A basic graph filter (fixed) model focuses solely on a user's recent interactions with Western films.
- TV-Rec (with time-variant filters), however, successfully captures both the user's recent Western interest and their broader Comedy preference from earlier interactions.
- This confirms that applying different filtering operations at different temporal positions is crucial for effective sequential recommendation, allowing TV-Rec to form a more nuanced and accurate understanding of user preferences.

Model Complexity and Runtime Analyses

The efficiency of TV-Rec is evaluated by analyzing the number of parameters and inference time.

The following figure (Figure 7 from the original paper) shows the comparison of model inference time and NDCG@20 on Beauty.

$Figure 8: Sensitivity to the number of basis vectors $m$ .$ 该图像是论文中图8的图表，展示了基向量数量 $m$ 对六个数据集（Beauty, Sports, Yelp, LastFM, ML-1M, Foursquare）中NDCG@20和HR@20指标的敏感性分析，柱状图代表NDCG@20，折线图代表HR@20。

Performance-Efficiency Trade-off (Figure 7, Beauty dataset):

TV-Rec achieves the best balance between performance (NDCG@20) and computational efficiency (inference time).
Among top-performing methods, TV-Rec has the fastest inference time and the smallest number of parameters compared to hybrid models like AdaMCT and BSARec.
Compared to SASRec and BERT4Rec (pure self-attention), TV-Rec offers faster inference and superior recommendation accuracy.

While FMLPRec runs slightly faster due to its simpler architecture, it degrades recommendation quality because it uses a basic (fixed) graph filter. TV-Rec's marginally higher inference time than FMLPRec is justified by its significantly improved performance.

The following are the results from Table 10 of the original paper:

Dataset	Metrics	TV-Rec	AdaMCT	BSARec	FMLPRec	NextItNet	BERT4Rec	SASRec	Caser
Beauty	# Parameters	854,208	878,208	880,318	851,200	981,696	877,888	877,824	2,909,532
	Training Cost (s/epoch)	13.20	12.30	11.35	11.85	18.9184	21.98	11.21	65.54
	Inference Cost (s/epoch)	0.5697	0.6647	0.7299	0.5427	1.0267	0.5821	0.6316	1.0641
	NDCG@20	0.0704	0.0691	0.0664	0.0419	0.0547	0.0484	0.0379	0.0164
Sports	# Parameters	1,248,160	1,278,592	1,264,318	1,251,584	1,644,224	1,278,272	1,278,208	4,835,756
	Training Cost (s/epoch)	19.46	17.63	18.44	17.76	30.1034	31.60	14.26	98.21
	Inference Cost (s/epoch)	0.7411	0.9049	0.9679	0.7049	1.2724	0.8016	0.8460	1.5715
	NDCG@20	0.0428	0.0422	0.0379	0.0239	0.0316	0.0288	0.0217	0.0109
Yelp	# Parameters	1,355,424	1,385,856	1,365,522	1,358,848	1,751,488	1,385,536	1,385,472	3,925,238
	Training Cost (s/epoch)	21.89	20.87	21.83	20.20	32.88	37.57	16.74	111.62
	Inference Cost (s/epoch)	0.6388	0.8527	0.8890	0.6223	1.2348	0.7225	0.7322	1.3220
	NDCG@20	0.0338	0.0290	0.0272	0.0211	0.0276	0.0275	0.0180	0.0151
LastFM	# Parameters	303,440	337,088	322,814	310,080	981,696	336,768	336,704	998,646
	Training Cost (s/epoch)	2.41	2.51	2.66	2.51	19.1253	4.29	2.30	13.22
	Inference Cost (s/epoch)	0.2649	0.2807	0.3228	0.2678	1.068	0.3018	0.3061	0.3445
	NDCG@20	0.0582	0.0514	0.0485	0.0475	0.0402	0.0366	0.0458	0.0306
ML-1M	# Parameters	298,368	322,368	308,094	295,360	556,928	322,048	321,984	961,326
	Training Cost (s/epoch)	26.60	27.68	31.40	24.67	40.7731	46.51	22.50	102.99
	Inference Cost (s/epoch)	0.4129	0.4180	0.4474	0.3594	0.7269	0.4202	0.4181	0.5817
	NDCG@20	0.1951	0.1836	0.1711	0.1388	0.1829	0.1588	0.1438	0.1101
Foursquare	# Parameters	719,040	743,040	728,766	716,032	1,108,672	742,720	742,656	1,073,044
	Training Cost (s/epoch)	6.01	5.79	5.81	5.11	8.9109	9.06	5.27	21.01
	Inference Cost (s/epoch)	0.2900	0.2913	0.3562	0.2521	0.4963	0.2871	0.3178	0.3703
	NDCG@20	0.0170	0.0147	0.0131	0.0110	0.0115	0.0132	0.0137	0.0133

Parameter Count: TV-Rec consistently shows competitive (often the lowest) parameter efficiency across all datasets, maintaining a comparable or slightly lower number of parameters than most advanced baselines.
Training Cost: TV-Rec demonstrates competitive training cost in most datasets, often comparable to or slightly higher than SASRec and FMLPRec. The overhead from GFT/IGFT operations is manageable.
Inference Cost: TV-Rec exhibits strong inference efficiency, frequently having the lowest or near-lowest inference costs among all models, especially on Beauty, Sports, and Yelp. This is a significant advantage, as the time-variant convolutional filter is a linear operator whose components can be precomputed after training, leading to efficient inference with a complexity of $O(N^2d)$ (compared to $O(N^3 + N^2d)$ for training).

Additional Results on XLong

To further test scalability for extremely long sequences, TV-Rec was evaluated on the XLong dataset (average length 958.8, 20 times longer than main experiments).

The following are the results from Table 9 of the original paper:

Metric	SASRec	LRURec	TV-Rec	Improv.
HR@5	0.3612	0.4266	0.4844	13.55%
HR@10	0.4680	0.5137	0.5353	4.20%
HR@20	0.5612	0.5874	0.5774	-1.70%
NDCG@5	0.2656	0.3227	0.3905	21.00%
NDCG@10	0.2979	0.3510	0.4071	15.98%
NDCG@20	0.3232	0.3697	0.4178	13.01%

On the XLong dataset, TV-Rec shows strong performance, significantly outperforming SASRec and LRURec in most metrics (HR@5, HR@10, NDCG@5, NDCG@10, NDCG@20).
While TV-Rec is slightly lower than LRURec in HR@20 (-1.70%), its overall superiority, especially in NDCG metrics (e.g., 21.00% gain in NDCG@5), highlights its effectiveness in handling extremely long sequences. This confirms its strength in long-range modeling tasks.

Statistical Significance of Experimental Results

To ensure reliability, all experiments were run 10 times with different random seeds for TV-Rec and the second-best baseline model for each dataset. Mean and standard deviation are reported.

The following are the results from Table 11 of the original paper:

Datasets	Methods	HR@5	HR@10	HR@20	NDCG@5	NDCG@10	NDCG@20
Beauty	BSARec	0.0694±0.001	0.0978±0.002	0.1352±0.002	0.0496±0.001	0.0587±0.001	0.0681±0.001
Beauty	TV-Rec	0.0706±0.001	0.0997±0.001	0.1375±0.002	0.0500±0.001	0.0594±0.001	0.0689±0.001
Sports	BSARec	0.0417±0.001	0.0600±0.001	0.0844±0.001	0.0288±0.001	0.0349±0.001	0.0411±0.001
Sports	TV-Rec	0.0420±0.001	0.0610±0.002	0.0863±0.002	0.0290±0.001	0.0351±0.001	0.0415±0.001
Yelp	DuoRec	0.0268±0.001	0.0453±0.001	0.0733±0.001	0.0170±0.000	0.0230±0.000	0.0300±0.000
Yelp	TV-Rec	0.0284±0.001	0.0472±0.001	0.0759±0.001	0.0179±0.000	0.0240±0.001	0.0312±0.001
LastFM	BSARec	0.0501±0.004	0.0707±0.006	0.1051±0.008	0.0342±0.002	0.0420±0.003	0.0506±0.003
LastFM	TV-Rec	0.0508±0.005	0.0750±0.006	0.1090±0.007	0.0343±0.003	0.0420±0.003	0.0506±0.003
ML-1M	LRURec	0.1955±0.002	0.2818±0.002	0.3871±0.002	0.1326±0.002	0.1604±0.002	0.1869±0.002
ML-1M	TV-Rec	0.2024±0.005	0.2901±0.005	0.3972±0.005	0.1365±0.004	0.1647±0.003	0.1918±0.003
Foursquare	BSARec	0.0133±0.003	0.0175±0.003	0.0250±0.003	0.0098±0.002	0.0111±0.002	0.0130±0.002
Foursquare	TV-Rec	0.0151±0.002	0.0214±0.003	0.0289±0.002	0.0105±0.002	0.0126±0.002	0.0145±0.002

The results show that TV-Rec consistently achieves higher mean performance compared to the second-best baselines across all datasets and metrics.
The standard deviations are relatively small, indicating stability in performance across different runs and reinforcing the statistical significance of TV-Rec's improvements.

6.3. Conclusion from Results

The extensive experimental results demonstrate that TV-Rec is a highly effective and efficient model for sequential recommendation. Its time-variant convolutional filters successfully address the limitations of fixed filters and self-attention, leading to superior accuracy, particularly in capturing long-range dependencies and dynamic user preferences. The model also offers a favorable trade-off between performance and computational cost, making it suitable for practical applications. The ablation studies confirm the importance of each novel component in its design.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces TV-Rec (Time-Variant Convolutional Filters for Sequential Recommendation), a novel model that effectively addresses the inherent limitations of conventional (fixed) convolutional filters and the computational overhead of self-attention in sequential recommendation. Inspired by graph signal processing (GSP), TV-Rec employs time-variant convolutional filters that dynamically adapt their weights based on the position within a user's interaction sequence. This allows the model to capture complex, position-dependent temporal variations and integrate both long-term preferences and recent interests without the need for self-attention. The key findings are TV-Rec's consistent superiority in recommendation accuracy (averaging 7.49% improvement over state-of-the-art baselines), its enhanced expressive generalization, and its computational efficiency (reduced complexity and faster inference, especially after training).

7.2. Limitations & Future Work

The authors acknowledge several limitations of TV-Rec:

Training Computational Complexity: While TV-Rec boasts efficient inference, its training process involves additional computational complexity. This stems from the repeated Graph Fourier Transform (GFT) and inverse GFT operations, as well as the generation of position-specific filters.
Dimensionality Increase from DCG: To enable spectral filtering, TV-Rec replaces the original line graph with a Directed Cyclic Graph (DCG). This adds extra nodes through padding (zero-padding), which increases the dimensionality of the spectral domain.
Manageable Overhead: Despite these factors, the authors state that the overhead in training time and memory usage remains manageable and does not significantly impact training scalability in practice, as evidenced in their Model Complexity and Runtime Analyses.
Data-Independent Filter Generation: The current filter generation in TV-Rec is data-independent, relying solely on temporal position rather than the specific content or semantics of the sequence. This design choice simplifies the model and aids generalization but might limit its adaptability to highly instance-specific behaviors or irregular patterns that are not purely position-dependent. The authors argue that the inference-time efficiency and architectural simplicity justify these trade-offs for many practical applications.

For future work, the authors plan to:
Explore both the theoretical and practical relationships between their time-variant filter and recent advances in state space models, which have demonstrated strong connections with convolutional filters. This could potentially lead to further optimizations or novel architectural designs.

7.3. Personal Insights & Critique

The TV-Rec paper presents an elegant and effective solution to a long-standing challenge in sequential recommendation. The shift from fixed convolutional filters or computationally heavy self-attention to time-variant filters inspired by Graph Signal Processing is a compelling conceptual leap.

Inspirations:
- GSP for Sequence Modeling: The reinterpretation of a sequence as a graph signal on a DCG is particularly insightful. It allows leveraging the powerful mathematical tools of GSP (like Fourier analysis) in a domain traditionally dominated by RNNs and Transformers. This cross-domain application is inspiring and could open doors for GSP in other sequential data tasks.
- Efficiency through Spectral Methods: The ability to operate largely in the spectral domain, converting complex convolutions into simpler multiplications, is a clever way to achieve efficiency. The claim of faster inference by precomputing filter components is a strong practical advantage for deployment.
- Implicit Positional Encoding: The inherent positional encoding through the GFT basis is a neat solution that simplifies the model architecture and avoids the need for explicit positional embeddings, which can sometimes be tricky to design or optimize.
Potential Issues / Areas for Improvement:
- Interpretability of time-variant filters: While the paper provides visualizations (Figure 5) and a case study (Figure 6), the learned filter tap matrix $\mathbf{H}$ (constructed from $\mathbf{C}$ and $\mathbf{B}$ ) can still be complex. Further research into how individual basis vectors $\mathbf{B}$ contribute to specific patterns and how the coefficient matrix $\mathbf{C}$ dynamically blends them could enhance interpretability.
- Data-independent filter generation: The current data-independent filter generation (relying only on temporal position) might be a trade-off. While it aids generalization, future work could explore hybrid approaches where filter generation is also subtly influenced by semantic content or user attributes, allowing for even more adaptive filtering without compromising too much on efficiency. For example, some elements of $\mathbf{C}$ or $\mathbf{B}$ could be conditioned on sparse user features or global item popularity.
- Scalability for extremely high $K$ (filter order): While $K=N$ works well, for sequences with $N > 200$ (like in XLong), where $K$ could be very large, the $O(K+1)$ dependency in the filter matrix definition and the $O(N^3)$ training complexity might still become a bottleneck if $N$ significantly increases. More sophisticated sparse approximations or low-rank representations for $\mathbf{H}$ or $\pmb{\Lambda}$ might be needed for truly massive sequence lengths.
- Generality beyond SR: The core idea of time-variant convolutional filters could potentially be applied to other sequence modeling tasks where position-dependent pattern recognition is crucial, such as time series forecasting, genomic sequence analysis, or even certain aspects of natural language processing, offering an alternative to Transformer-based models.
  
  Overall, TV-Rec represents a fresh perspective in sequential recommendation, successfully integrating GSP to create a powerful, efficient, and expressive model that challenges the dominance of self-attention.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.