FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer
TL;DR Summary
The FuXi-$α$ model introduces an Adaptive Multi-channel Self-attention mechanism to improve the modeling of temporal, positional, and semantic features, along with a Multi-stage Feed-Forward Network to enhance implicit feature interactions, outperforming existing models in offlin
Abstract
Inspired by scaling laws and large language models, research on large-scale recommendation models has gained significant attention. Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy. Current state-of-the-art sequential recommendation models primarily use self-attention mechanisms for explicit feature interactions among items, while implicit interactions are managed through Feed-Forward Networks (FFNs). However, these models often inadequately integrate temporal and positional information, either by adding them to attention weights or by blending them with latent representations, which limits their expressive power. A recent model, HSTU, further reduces the focus on implicit feature interactions, constraining its performance. We propose a new model called FuXi- to address these issues. This model introduces an Adaptive Multi-channel Self-attention mechanism that distinctly models temporal, positional, and semantic features, along with a Multi-stage FFN to enhance implicit feature interactions. Our offline experiments demonstrate that our model outperforms existing models, with its performance continuously improving as the model size increases. Additionally, we conducted an online A/B test within the Huawei Music app, which showed a increase in the average number of songs played per user and a increase in the average listening duration per user. Our code has been released at https://github.com/USTC-StarTeam/FuXi-alpha.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
FuXi-: Scaling Recommendation Model with Feature Interaction Enhanced Transformer
1.2. Authors
- Yufei Ye (aboluo2003@mail.ustc.edu.cn) from the University of Science and Technology of China, Hefei, China.
- Wei Guo (guowei67@huawei.com) and Jin Yao Chin (chin.jin.yao@huawei.com) from Huawei Noah's Ark Lab, Singapore, Singapore.
- Hao Wang (wanghao3@ustc.edu.cn) from the University of Science and Technology of China, Hefei, China.
- Hong Zhu (zhuhong8@huawei.com) and Xi Lin (linxi16@huawei.com) from Consumer Business Group, Huawei, Shenzhen, China.
- Yuyang Ye (yeyuyang@mail.ustc.edu.cn) from the University of Science and Technology of China, Hefei, China.
- Yong Liu (liu.yong6@huawei.com) from Huawei Noah's Ark Lab, Singapore, Singapore.
- Ruiming Tang (tangruiming@huawei.com) from Huawei Noah's Ark Lab, Shenzhen, China.
- Defu Lian (liandefu@ustc.edu.cn) and Enhong Chen (cheneh@ustc.edu.cn) from the University of Science and Technology of China, Hefei, China.
1.3. Journal/Conference
This paper is published as a preprint on arXiv. While it does not specify a particular journal or conference, the authors' affiliations with prominent academic institutions (University of Science and Technology of China) and industrial research labs (Huawei Noah's Ark Lab, Consumer Business Group) suggest a strong research background in artificial intelligence, machine learning, and recommender systems.
1.4. Publication Year
2025 (Published at UTC: 2025-02-05T09:46:54.000Z)
1.5. Abstract
The paper addresses the growing interest in large-scale recommendation models, inspired by the success of scaling laws in large language models. It identifies a limitation in current state-of-the-art sequential recommendation models, which often inadequately integrate temporal and positional information and underemphasize implicit feature interactions. To overcome these issues, the authors propose FuXi-, a new model that introduces an Adaptive Multi-channel Self-attention (AMS) mechanism to distinctly model temporal, positional, and semantic features, and a Multi-stage FFN (MFFN) to enhance implicit feature interactions. Offline experiments demonstrate that FuXi- outperforms existing models and shows continuous performance improvement with increasing model size. An online A/B test within the Huawei Music app further validates its effectiveness, showing a 4.76% increase in the average number of songs played per user and a 5.10% increase in average listening duration per user.
1.6. Original Source Link
https://arxiv.org/abs/2502.03036v1 (Preprint) PDF Link: https://arxiv.org/pdf/2502.03036v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inadequate integration of temporal and positional information and the underemphasis of implicit feature interactions in current sequential recommendation models, particularly within the context of scaling laws observed in large models.
In recent years, the concept of "scaling laws" has emerged from the field of large language models (LLMs), demonstrating that model performance can systematically improve with increased parameters, training data, and computational resources. This discovery has driven research into applying similar scaling principles to other domains, including recommendation systems. Sequential recommendation models, which predict user interests based on historical interaction sequences, have shown promise in adopting these scaling laws. State-of-the-art sequential recommendation models, such as those based on Transformer architectures (e.g., SASRec, BERT4Rec, HSTU), primarily use self-attention mechanisms for explicit feature interactions among items. Implicit interactions are typically handled by Feed-Forward Networks (FFNs).
However, existing models face several challenges:
-
Inadequate Integration of Temporal and Positional Information: Current methods often integrate temporal and positional data in a simplified manner, either by adding embeddings to input sequences, incorporating them into attention weights, or blending them with latent representations. This limits their expressive power and ability to fully capture the nuances of user behavior over time. For example, as illustrated in Figure 1, different temporal intervals or orders between items can lead to varied subsequent interactions, a subtlety that existing models struggle to capture effectively.
该图像是示意图,展示了不同时间间隔或顺序下的物品交互序列。图中分为三条物品序列,每条序列中的物品通过箭头相互连接,表明时间对后续互动项的影响。Figure 1: Different temporal intervals or orders between objects may lead to varying subsequent interacted items.
-
Underemphasis of Implicit Feature Interactions: Models like
HSTUprimarily focus on explicit interactions, which can neglect the complex, non-linear relationships that implicit interactions (often handled by FFNs) can uncover. This underemphasis can constrain the model's overall performance and capacity for nuanced learning.The paper's entry point is to explicitly address these two limitations by designing a new Transformer-based architecture that:
- More effectively disentangles and models temporal, positional, and semantic information in self-attention.
- Enhances implicit feature interactions through a specialized FFN. By doing so, the authors aim to develop a recommendation model that not only adheres to scaling laws but also significantly improves performance by better capturing the rich dynamics of user behavior.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Proposed a Novel Model (
FuXi-): The authors introduceFuXi-, a new Transformer-based sequential recommendation model specifically designed to leverage scaling laws while addressing the shortcomings of previous models regarding feature interactions. - Adaptive Multi-channel Self-attention (AMS) Layer:
FuXi-incorporates anAdaptive Multi-channel Self-attention (AMS)layer. This mechanism disentangles the modeling of temporal, positional, and semantic information into distinct channels, allowing for a more expressive representation of these crucial aspects. This explicitly overcomes the limitation of prior work that inadequately integrated temporal and positional cues. - Multi-stage Feedforward Network (MFFN): A
Multi-stage Feedforward Network (MFFN)is introduced to enhance implicit feature interactions. This component aims to capture more complex, high-order relationships among features that are often overlooked or under-modeled in other architectures, particularly those that prioritize explicit interactions. - Adherence to Scaling Laws: The research demonstrates that
FuXi-adheres to scaling laws, showing continuous performance improvement as the model size (e.g., number of layers) increases. This highlights its potential for building large-scale, highly performant recommendation systems. - Extensive Experimental Validation:
- Offline Performance:
FuXi-consistently outperforms state-of-the-art sequential recommendation models across multiple public (MovieLens-1M, MovieLens-20M, KuaiRand) and a large-scale industrial dataset. - Online A/B Test Success: An online A/B test conducted within the Huawei Music app showed significant real-world impact, with a 4.76% increase in the average number of songs played per user and a 5.10% increase in the average listening duration per user. These findings solve the practical problem of improving user engagement and platform usage.
- Offline Performance:
- Code Release: The authors have open-sourced their code at https://github.com/USTC-StarTeam/FuXi-alpha, facilitating reproducibility and further research.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand FuXi-, a reader should be familiar with the following foundational concepts in recommender systems and deep learning:
- Recommender Systems: Systems that suggest items (e.g., movies, products, music) to users based on their preferences and past behavior.
- Sequential Recommendation: A sub-field of recommender systems where the order of user interactions matters. The goal is to predict the next item a user will interact with, given their historical sequence of interactions. For example, if a user watches movie A, then B, then C, a sequential recommender tries to predict movie D.
- Feature Interaction: The interplay between different features (e.g., item characteristics, user attributes, contextual information) that collectively influence a user's preference for an item.
- Explicit Feature Interaction: Directly modeling the relationships between specific features using predefined operations (e.g., dot product, bilinear function, attention). For instance, multiplying a user embedding by an item embedding to capture their compatibility.
- Implicit Feature Interaction: Learning complex, non-linear relationships among all features through deep neural networks (DNNs) without explicitly predefining the interaction type. This often leads to more flexible and powerful models but can be less interpretable.
- Embeddings: Low-dimensional, dense vector representations of discrete entities (e.g., users, items). These vectors capture semantic similarities, where similar items or users have similar embeddings.
- Neural Networks (NNs): A series of interconnected layers of artificial neurons that can learn complex patterns from data.
- Feed-Forward Networks (FFNs) / Multi-Layer Perceptrons (MLPs): Simple neural networks consisting of an input layer, one or more hidden layers, and an output layer, where information flows in one direction. They are universal function approximators and are often used to capture implicit feature interactions.
- Recurrent Neural Networks (RNNs) & Gated Recurrent Units (GRUs): Neural networks designed to process sequential data by maintaining a hidden state that carries information from previous steps.
GRU4Recis a well-known example in sequential recommendation. - Convolutional Neural Networks (CNNs): Neural networks that use convolutional layers to extract local patterns from data, often used in image processing but also adapted for sequential data (e.g.,
Caser). - Graph Neural Networks (GNNs): Neural networks that operate on graph-structured data, allowing them to model relationships between entities (e.g., items in a sequence).
SR-GNNis an example.
- Transformer Architecture: A neural network architecture primarily based on the self-attention mechanism, originally developed for natural language processing (NLP) tasks.
- Self-Attention Mechanism: A core component of Transformers that allows each element in a sequence to weigh the importance of all other elements in the same sequence. It calculates attention weights based on the similarity between queries and keys, and then uses these weights to combine values.
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- (Query), (Key), (Value) are matrices derived from the input embeddings.
- calculates the dot product between queries and keys, representing similarity.
- is a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients, where is the dimension of the key vectors.
- normalizes the attention scores into probabilities.
- The result is a weighted sum of the value vectors, where weights are determined by attention scores.
- Multi-Head Self-Attention: An extension of self-attention where the attention mechanism is run multiple times in parallel with different learned linear projections of the queries, keys, and values. The outputs from multiple "heads" are then concatenated and linearly transformed. This allows the model to attend to different parts of the sequence from different representation subspaces.
- Positional Encoding: In Transformers, since self-attention does not inherently understand the order of elements in a sequence, positional encodings are added to the input embeddings to inject information about the relative or absolute position of each element.
- Layer Normalization: A technique used in neural networks to normalize the activations of a layer across the features, which helps stabilize training and speed up convergence.
Root Mean Square (RMS) Layer Normalizationis a variant. - Activation Functions: Non-linear functions applied to the output of neurons to introduce non-linearity into the network, allowing it to learn complex patterns.
- SiLU (Sigmoid Linear Unit): , where is the sigmoid function. It's a smooth, non-monotonic activation function often used in deep learning.
- SwiGLU (Swish-Gated Linear Unit): An activation function used in some Transformer architectures, defined as . It combines the
SiLUfunction with a gating mechanism and element-wise multiplication ().
- Self-Attention Mechanism: A core component of Transformers that allows each element in a sequence to weigh the importance of all other elements in the same sequence. It calculates attention weights based on the similarity between queries and keys, and then uses these weights to combine values.
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- Scaling Laws: Empirical observations, predominantly in LLMs, indicating that the performance of a model scales predictably (e.g., logarithmically) with increases in model parameters, dataset size, and computational budget. The goal is to find models that adhere to these laws, as they promise continuous performance improvements with more resources.
3.2. Previous Works
The paper extensively discusses several categories of previous works:
-
Traditional Recommendation Models:
- Pooling Operations: Early methods for managing interaction sequences that simply aggregate item features (e.g.,
YouTube recommendations[8]). These have limited expressive capabilities as they lose sequential information. - CNN-based Models (e.g.,
Caser[52]): Use convolutional filters to capture local patterns within a fixed window of the sequence. They are limited in capturing long-range dependencies. - RNN-based Models (e.g.,
GRU4Rec[20]): Employ recurrent neural networks to process sequences, directly interacting with the previous timestep's hidden state. While capturing sequentiality, they can struggle with very long sequences and complex interactions. - GNN-based Models (e.g.,
SR-GNN[61]): Model item interactions as a graph, where interactions are limited to directly connected items, restricting their scope. - Markov Chain Models: Early approaches that assume future interactions depend only on the immediate past.
- Pooling Operations: Early methods for managing interaction sequences that simply aggregate item features (e.g.,
-
Transformer-based Sequential Recommendation Models: These models leverage the self-attention mechanism to capture complex dependencies across entire sequences.
- SASRec [23]: A pioneering work that applies the self-attention mechanism to sequential recommendation, allowing each item in a user's history to attend to all previous items. This marked a significant improvement over RNNs and CNNs.
- BERT4Rec [51]: Adapts the bidirectional Transformer encoder from NLP to recommendation, predicting masked items in a sequence.
- TiSASRec [29]: Improves
SASRecby incorporating time intervals and relative position information, typically by adding them to the input embeddings or modifying attention weights. - LLaMa [10] (in context of recommendation): While primarily an LLM, its architecture, especially its use of
RMSNormandSwiGLU, has influenced recommendation models. In this paper's context,LLaMaserves as a baseline representing a Transformer-based model scaled up. - HSTU [72]: An advanced Transformer-based model that explicitly utilizes positional and temporal information alongside element-wise multiplication to model explicit interactions between items. It attempts to scale sequential recommendation models according to scaling laws and is a strong baseline.
- Differentiation from
FuXi-:HSTUintegrates relative temporal and positional data by adding these features directly to attention weights, whichFuXi-argues can dilute their impact. Furthermore,HSTUnotably lacks anFFNlayer, relying heavily on self-attention and explicit feature interactions, whichFuXi-identifies as a limitation for capturing implicit relationships.
- Differentiation from
-
Deep Learning Recommendation Models (DLRMs): Traditional DLRMs focus on feature interactions in a broader sense, not just sequential data.
- DCN [58], xDeepFM [33], DeepFM [13], PNN [40]: These models use various techniques (dot product, bilinear functions, attention) for explicit feature interactions and often incorporate
DNNsfor implicit interactions. The paper notes thatDLRMshaven't shown significant performance improvements with increased model size in some studies [2, 15].
- DCN [58], xDeepFM [33], DeepFM [13], PNN [40]: These models use various techniques (dot product, bilinear functions, attention) for explicit feature interactions and often incorporate
3.3. Technological Evolution
The evolution of recommendation technologies can be seen as a progression from simpler statistical models to complex deep learning architectures:
-
Early Stages (Pre-Deep Learning):
Markov chains,matrix factorization (MF)(e.g.,BPRMF[44]), and simplepooling operations(e.g.,YouTube recommendations[8]) formed the bedrock. These methods were limited in capturing complex sequential patterns or higher-order feature interactions. -
RNN/CNN Era: The advent of deep learning brought
RNNs(GRU4Rec[20]) andCNNs(Caser[52]) to sequential recommendation, allowing for better capture of temporal dependencies and local patterns. Memory network-based methods also emerged to handle long-term preferences. -
Transformer Era: Inspired by NLP, the Transformer architecture, particularly the
self-attention mechanism, revolutionized sequential recommendation with models likeSASRec[23] andBERT4Rec[51]. These models can capture long-range dependencies and complex item-item interactions more effectively. Further refinements likeTiSASRec[29] started incorporating temporal and positional information more explicitly. -
Scaling Law Exploration: Recent work, notably
HSTU[72], began investigating the applicability of scaling laws from LLMs to sequential recommendation, pushing towards very large models. This era also sees the influence of LLM architectures (likeLLaMa'sRMSNormandSwiGLU) directly impacting recommendation model design.FuXi-positions itself at the forefront of this evolution by building upon the Transformer architecture, explicitly addressing the limitations of how temporal and positional information are handled in existing models (likeTiSASRecandHSTU), and enhancing the implicit feature interaction capabilities often underplayed in purely attention-focused sequential models. It aims to develop a model that not only benefits from scaling but also leverages a more sophisticated handling of feature interactions for superior performance.
3.4. Differentiation Analysis
Compared to the main methods in related work, FuXi- introduces core differences and innovations:
-
Handling of Temporal and Positional Information:
- Conventional Transformer-based (e.g.,
SASRec,BERT4Rec,LLaMa): These models typically integrate positional encodings by simply adding them to input embeddings. Some, likeTiSASRec, incorporate time intervals and relative position information by modifying query/key matrices or attention weights. - HSTU:
HSTUimproves by utilizing positional and temporal information alongside element-wise multiplication for explicit interactions but integrates them by adjusting attention weights. The paper argues this can "dilute their impact" due to simple addition. - FuXi- Innovation:
FuXi-introduces anAdaptive Multi-channel Self-attention (AMS)layer that distinctly models temporal, positional, and semantic features in separate channels. This disentanglement allows each channel to specialize in capturing different aspects of sequence data, providing a more expressive representation and avoiding dilution.
- Conventional Transformer-based (e.g.,
-
Implicit Feature Interactions:
- DLRMs (e.g.,
DCN,xDeepFM): These models heavily rely onDNNsto capture implicit, high-order feature interactions, often combined with explicit interaction components. - Transformer-based Sequential Models (e.g.,
SASRec,LLaMa): These models typically use standardFFNs(Multi-Layer Perceptrons) after self-attention layers to process information, which can capture some implicit interactions but might not be optimized for this specific purpose in a multi-channel context. - HSTU: A key differentiator is that
HSTUlacks anFFNlayer entirely, relying solely on self-attention and explicit feature interactions. This, according to the paper, limits its ability to capture complex item relationships and nuances from implicit learning processes. - FuXi- Innovation:
FuXi-introduces aMulti-stage Feedforward Network (MFFN). ThisMFFNspecifically combines the multi-channel outputs from theAMSlayer in its first stage and then performs enhanced implicit feature interactions in its second stage (usingSwiGLUand residual connections). This design explicitly addresses the underemphasis of implicit interactions in models likeHSTUand aims to boost the model's overall expressiveness.
- DLRMs (e.g.,
-
Adherence to Scaling Laws:
-
General Context: While
HSTUalso aims to adhere to scaling laws,FuXi-'s design, particularly its enhanced feature interaction mechanisms, is posited to lead to better scaling behavior and continuous performance improvement with increasing model size. The paper's empirical results support this claim, showingFuXi-'s performance continuously improves with more layers.In summary,
FuXi-differentiates itself by offering a more granular and sophisticated approach to handling both explicit (throughAMS) and implicit (throughMFFN) feature interactions within a Transformer-based architecture, which are crucial for sequential recommendation and effective scaling.
-
4. Methodology
The FuXi- model is designed to address the limitations of existing sequential recommendation models by enhancing feature interactions, particularly temporal, positional, and semantic information, and improving implicit interaction modeling. The overall architecture is depicted in Figure 2 and consists of stacked FuXi Blocks.
该图像是图示,展示了所提出的 FuXi-eta 模型的整体架构。该架构包含了多个 Fuxi 块、一个多阶段 FFN 和自适应多通道自注意力机制,旨在增强隐式特征交互及建模时间、位置和语义特征的能力。
Figure 2: The overall architecture of the proposed FuXi .
4.1. Embedding Layer
The first step in FuXi- is to transform user interaction sequences into numerical representations suitable for the model.
- Sequence Preparation: Each user's interaction sequence, , is converted into a fixed-length sequence of length . This involves either truncating longer sequences or padding shorter ones with a special "padding item" to reach the length .
- Item Embeddings: Each item (where is the set of all items) is mapped to a -dimensional vector. This is done using a learnable embedding matrix , where is the latent vector dimensionality. So, for an item , its embedding is .
- Positional Encodings: To incorporate the order of items in the sequence, learnable positional encodings are used. For the -th position in the sequence, there is a positional embedding .
- Initial Input Representation: The input for the first
FuXiBlock is constructed by summing the item embeddings and their corresponding positional encodings. For a user with a sequence , the initial input is given by: $ \mathbf{x}^0 = [\mathbf{e}1^{(u)} + \pmb{P}1, ..., \mathbf{e}{n_u}^{(u)} + \pmb{P}{n_u}, \mathbf{0}, ..., \mathbf{0}] $ Where:- is the -dimensional embedding for the -th item in user 's sequence.
- is the positional embedding for the -th position.
- The zero vectors () denote padding items for positions beyond up to .
4.2. FuXi Block
The FuXi Block is the core computational unit of the model, similar in structure to a Transformer decoder block. It consists of two main components: an Adaptive Multi-channel Self-attention (AMS) layer and a Multi-stage Feed-Forward Network (MFFN). The model stacks such FuXi Blocks.
Let be the input to the -th layer and be its output. The initial input for the first layer is .
4.2.1. Adaptive Multi-channel Self-attention (AMS)
The AMS layer is designed to capture user interest patterns by separating the processing of hidden states (semantics), positional information, and temporal signals into distinct attention channels. This allows each channel to specialize.
As depicted in Figure 3, there are three types of channels: semantic, temporal, and positional.
该图像是表征自适应多通道自注意力机制(AMS)的示意图。AMS通过将语义信息的建模与时间和位置信息解耦,增强了特征交互的表达能力,提升了推荐模型的性能。
Figure 3: Illustration of Adaptive Multi-channel Selfattention (AMS). In contrast to the conventional multi-head self-attention, AMS decouples the modeling of temporal and positional information from semantics information.
The computation begins by applying Root Mean Square (RMS) Layer Normalization to the input of the current layer, , to stabilize training.
$
\tilde{\mathbf{x}}^l = \mathrm{RMSN}(\mathbf{x}^{l-1})
$
Where denotes the Root Mean Square (RMS) Layer Normalization operation [75].
Next, the query, key, and value matrices for the semantic channel are computed using linear transformations followed by a non-linear activation function (SiLU [11]).
$
\mathbf{q}^l = \phi(\tilde{\mathbf{x}}^l \mathbf{W}_q^l), \quad \mathbf{k}^l = \phi(\tilde{\mathbf{x}}^l \mathbf{W}_k^l), \quad \mathbf{v}^l = \phi(\tilde{\mathbf{x}}^l \mathbf{W}_v^l)
$
Where:
-
, , are learnable weight matrices for the query, key, and value transformations, respectively, in the -th layer.
-
represents the size of each attention head.
-
is the
SiLUactivation function.The attention weights for each of the three channels are calculated separately:
- Semantic Channel Attention Weights (): These are computed similarly to standard self-attention, using the
queryandkeymatrices, followed by a non-linear activation function. $ \mathbf{a}_h^l = \frac{1}{n} \phi(\mathbf{q}^l (\mathbf{k}^l)^T) $ Where:- is again the
SiLUactivation function. The scaling factor is applied, where is the sequence length. - The use of
SiLUinstead ofsoftmaxfor attention weights in sequential recommendation has been shown to be effective [72].
- is again the
- Temporal Channel Attention Weights (): These attention weights depend only on the difference in relative timestamps between items.
$
(\mathbf{a}t^l){i,j} = \alpha(t_j - t_i)
$
Where:
- is the time difference between item and item .
- represents a mapping function that converts the time difference into buckets, and each bucket is associated with a learnable parameter [42]. This means the attention weight for a temporal difference is looked up from a set of learned parameters corresponding to time difference buckets.
- Positional Channel Attention Weights (): These attention weights depend only on the relative positions between items.
$
(\mathbf{a}p^l){i,j} = \beta_{j-i}
$
Where:
-
j-iis the relative positional difference between item and item . -
is a vector of learnable parameters, where is a learnable parameter specifically for the relative position
j-i.Crucially, for the temporal and positional channels, the paper states there's no further need to calculate separate query and key matrices. Instead, they share the
valuematrices with the semantic channel. While the formula for is shown once, it serves for all channels. For simplicity and clarity, the detailed formulation here assumes a single head per channel, but the approach can be extended to multiple heads similar to multi-head self-attention [54].
-
After computing the attention weights for each channel, the outputs of these channels are concatenated. The final output of the AMS layer, denoted , is obtained by:
$
\mathbf{h}^l = \mathrm{RMSN}(\mathrm{concat}(\mathbf{a}_h^l \mathbf{v}_h^l, \mathbf{a}_p^l \mathbf{v}_p^l, \mathbf{a}_t^l \mathbf{v}_t^l)) \otimes \phi(\mathbf{x}^l \mathbf{W}_u^l)
$
Where:
- , , are the value matrices for the semantic, positional, and temporal channels, respectively. As stated, for simplicity, these value matrices are shared with the semantic channel, so in practice they are all derived from .
- denotes concatenation of the outputs from the different channels. The output dimensions would be if each channel contributes .
- is again the
Root Mean Square Layer Normalization. - is a learnable parameter matrix.
- is the
SiLUactivation function. - denotes element-wise multiplication (Hadamard product). The term (which the paper refers to as matrix following
HSTU[72]) introduces explicit 2nd-order interactions. Note that the paper has a typo here and refers to instead of or for the input to . Based on standard Transformer architecture and the RMSN on forq, k, v, it's more likely this is or . However, adhering strictly to the paper's text, it's .
4.2.2. Multi-stage Feed-Forward Network (MFFN)
The MFFN processes the output of the AMS layer and consists of two stages (Figure 4).
该图像是示意图,展示了多阶段前馈网络(MFFN)的结构。Stage 1 负责融合来自不同通道的输出,Stage 2 促进隐式特征交互。
Figure 4: Diagram of MFFN: Stage 1 fuses outputs from different channels; Stage 2 facilitates implicit feature interactions.
-
Stage 1: Channel Fusion: The first stage receives the concatenated output from the
AMSlayer and applies a linear projection, then combines it with the original input of the current layer (for a residual connection). $ \mathbf{o}^l = \mathbf{h}^l \mathbf{W}_o^l + \mathbf{x}^{l-1} $ Where:- is a learnable projection matrix. Its purpose is to project the concatenated channel outputs (with dimension ) back to the model's latent dimension .
- is the input to the current
FuXiblock, serving as a residual connection.
-
Stage 2: Implicit Feature Interactions: The second stage performs enhanced implicit feature interactions. It first applies
RMS Layer Normalizationto the output of the first stage, , and then passes it through aSwiGLUactivation function [46], followed by another residual connection. $ \mathbf{x}^l = \mathrm{FFN}_l(\mathrm{RMSN}(\mathbf{o}^l)) + \mathbf{o}^l $ Where:- is the final output of the -th
FuXiBlock. - is the
RMS Layer Normalizationapplied to the output of the first stage. - denotes the
SwiGLU-based feed-forward network, defined as: $ \mathrm{FFN}_l(\mathbf{x}) = \mathrm{SwiGLU}(\mathbf{x}) \mathbf{W}_3^l = (\phi(\mathbf{x} \mathbf{W}_1^l) \otimes (\mathbf{x} \mathbf{W}_2^l)) \mathbf{W}_3^l $ Where:- represents the
SiLUactivation function. - denotes element-wise multiplication.
- , , are learnable parameter matrices for the
FFNin the -th layer. - is the hidden dimension of the
FFN. This structure (similar toLLaMa[53]) allows for complex non-linear transformations and implicit feature interactions.
- represents the
- is the final output of the -th
4.3. Prediction Layer & Optimization Objective
After passing through layers of FuXi Blocks, the final output contains rich information about the sequence.
-
Prediction: To predict the next item, the final latent representation of each position is multiplied with the transpose of the item embedding matrix . This projects the latent representations back into the item embedding space. A
softmaxfunction is then applied to obtain a probability distribution over all possible items. $ P(\text{next item} = j \mid S_u) \propto \exp((\mathbf{x}^b_k \mathbf{W}_{pred}) \cdot \mathbf{e}_j) $ (The paper does not provide an explicit formula for the prediction layer, but describes it as "multiplication with the transpose of the input embedding matrix, followed by a softmax function". The general formulation is often , where is the representation for the last item in the sequence or the item at the position where prediction is made, and is the embedding of item . A linear projection might be applied before the dot product with item embeddings. For strict adherence, the description is used.) -
Optimization Objective (Loss Function): To accelerate training, the model uses a
sampled softmax losswith randomly sampled negative samples [25]. This is a common technique in large-vocabulary prediction tasks, where calculating softmax over all items is computationally expensive. Instead, the loss is computed over the true next item and a subset of negative (uninteracted) items. $ \mathcal{L} = -\sum_{k=1}^{n} \left[ \log \left( \frac{\exp(\mathbf{s}k \cdot \mathbf{e}{target_k})}{\sum_{j \in {target_k} \cup \text{NegativeSamples}} \exp(\mathbf{s}_k \cdot \mathbf{e}_j)} \right) \right] $ Where:- is the output latent representation from the
FuXiblock for the -th position (after layers). - is the embedding of the true next item following the item at position
k-1. NegativeSamplesare randomly chosen items that were not interacted with.- The sum is taken over all positions in the sequence for which a prediction is made during training.
- is the output latent representation from the
4.4. Space and Time Complexity
4.4.1. Space Complexity
The space complexity is determined by the number of parameters in the model.
-
FuXi Block: Each
FuXiblock contains anAMSlayer and anMFFN.AMSlayer: Four projection matrices (, , weights for semantic channel, and matrix for 2nd-order interaction) with a total of parameters. Positional and temporal embeddings contribute parameters, where is the sequence length and is the number of buckets for temporal differences.MFFN: Four projection matrices (one for fusion and three forSwiGLU) with parameters.
-
Item Embeddings: The total number of item embeddings is .
Assuming
d_h = O(d), , andn_B = O(n)(i.e., hidden dimensions and bucket counts are proportional to the latent dimension and sequence length, respectively), and stackingFuXilayers, the total space complexity is given by: $ O(b(d^2 + n) + |\mathcal{I}|d) $ Where: -
is the number of
FuXiblocks. -
is the latent dimension.
-
is the sequence length.
-
is the total number of items.
4.4.2. Time Complexity
The time complexity is determined by the computational cost of operations within each FuXi block and the prediction layer.
- AMS Layer:
- Computing attention weights for the semantic channel: (due to matrix multiplication of , where and are ).
- Computing attention weights for other channels (temporal/positional): (as these involve lookups or fixed calculations for attention matrices).
- Calculating
QKVmatrices: (linear projections for queries, keys, values).
- MFFN:
- Requires operations for its linear transformations and
SwiGLUcalculations.
- Requires operations for its linear transformations and
- Prediction Layer:
-
The cost for generating predictions over all items: (multiplying final hidden states with item embeddings). With sampled softmax, this is reduced to .
Thus, the overall time complexity for training is: $ O(b n^2 d + n (b d^2 + |\mathcal{I}| d)) $ Where:
-
- is the number of
FuXiblocks. - is the sequence length.
- is the latent dimension.
- is the total number of items. The term comes from the self-attention mechanism, from the FFNs, and from the prediction layer (which is simplified by negative sampling).
4.5. Polynomial Approximation of Feature Interactions
The paper analyzes the explicit inter-item interactions implemented by FuXi- by simplifying the model. It considers an -th layer of the FuXi Block, treating attention weights as constants, and omitting the second stage of the MFFN, activation functions, and most projection transformations.
In this simplified context, the output of an item after one block can be approximated as:
$
f_{block}^{(l)}(x_i; x_1, ..., x_n) = x_i \circ \left( \sum_{j=1}^{n} a_{i,j}^{(l)} x_j \right) + x_i
$
Where:
-
are the latent representations (embeddings) of items input to the -th layer.
-
is the latent representation of the -th item.
-
denotes an interaction operator, such as element-wise multiplication.
-
are the attention weights in the -th layer, treated as constants.
Let denote the output latent representation of the -th item after the -th layer. The paper shows, using mathematical induction, that is a polynomial of degree up to in the initial item embeddings . Specifically, can be expressed in the form where .
Base Case (b=0):
For the input layer (), is simply . This can be considered a polynomial of degree if we interpret as a constant factor of degree 0. The formulation is slightly different in the paper, , implying is a constant, which fits the pattern.
Inductive Step (b=l+1): Assume the property holds for some integer , meaning incorporates interactions up to degree . Now consider the -th layer: $ x_{l+1,i} = x_{l,i} \circ \sum_{j=1}^{n} a_{i,j}^{(l+1)} x_{l,j} + x_{l,i} $ Substituting (where represents a polynomial of degree in initial item embeddings and some constant factors): $ x_{l+1,i} = x_{0,i} F_{2^l-1} \circ \left( \sum_{j=1}^{n} a_{i,j}^{(l+1)} x_{0,j} F_{2^l-1} + 1 \right) $ Each term in the summation is a polynomial of degree . The summation itself is thus a polynomial of degree . Adding 1 to it does not change its highest degree. So, . Then, . If is element-wise multiplication, the product of a polynomial of degree and a polynomial of degree would yield a polynomial of degree . This means that after layers, incorporates feature interactions among the initial item embeddings and the results of interactions among all items up to a degree of . This analysis highlights the model's capacity to capture high-order feature interactions, which is crucial for its expressive power.
4.6. Analysis of AMS
The paper contrasts the AMS layer's handling of positional information with that of T5 architecture [42], which uses relative positional embeddings by adding a bias term to the attention weights.
In T5, the attention weights are computed as:
$
\mathbf{A} = \phi\left( (\mathbf{x} \mathbf{W}_q)(\mathbf{x} \mathbf{W}_k)^T + \mathbf{B} \right)
$
Where:
-
is a non-linear function (e.g.,
softmaxorSiLU). -
and generate queries and keys.
-
is the matrix of relative positional bias terms.
The output for the -th item in multi-head self-attention, with this positional bias, can be approximately decomposed into two terms: $ o_i \approx W_o \left( \left( \sum_j \phi_1(q_i k_j^T) V_j \right) \otimes u_i \right) + W_o \left( \left( \sum_j \phi_2(b_{i,j}) V_j \right) \otimes u_i \right) $ Where:
-
are query, key, and value vectors.
-
is a vector used for Hadamard product (element-wise multiplication).
-
represent the non-linear function.
-
is an output projection matrix.
In contrast, the
AMSlayer calculates its output as: $ o_i = W_{o1} \left( \left( \sum_j \phi(q_i k_j^T) V_j \right) \otimes u_i^{(1)} \right) + W_{o2} \left( \left( \sum_j b_{i,j} V_j \right) \otimes u_i^{(2)} \right) $ (Note: The paper's formula for inAMSanalysis only explicitly shows two terms, one for semantic attention and one for positional bias . Given theAMSactually has three channels (semantic, temporal, positional), and theAMSoutput combines three terms, this simplified in the analysis might be abstracting or combining the temporal and positional terms, or just comparing one aspect. For strict adherence, we represent it as written. The and here correspond to parts of from theMFFN's first stage, and correspond to components of for the semantic and positional channels respectively). The key difference highlighted is that inAMS, and are distinct parameters (or parts of ) and and are also distinct components. This allows the model to learn separate projections and interactions for the semantic and positional (and temporal) information, rather than simply adding a bias term to a single attention weight calculation. This explicit separation and distinct processing channels ( and ) enableAMSto learn a more expressive representation of positional and temporal information.
4.7. Relationship with Existing Models
FuXi- builds upon and differentiates itself from existing models:
4.7.1. SASRec and LLaMa
- SASRec [23]: A foundational Transformer-based sequential recommender.
- LLaMa [10]: A state-of-the-art Large Language Model, used as a powerful Transformer architecture baseline in recommendation.
- Differences from
FuXi-:- Self-Attention Layer:
SASRecandLLaMause standard multi-head self-attention.FuXi-replaces this with theAdaptive Multi-channel Self-attention (AMS)layer to independently model temporal, positional, and semantic information, leading to better feature utilization. - Feed-Forward Network: Both
SASRecandLLaMause standardFFNs(MLPs).FuXi-introduces theMulti-stage Feedforward Network (MFFN)which specifically processes multi-channel information fromAMSand enhances implicit feature interactions.
- Self-Attention Layer:
4.7.2. HSTU
- HSTU [72]: A recent advanced Transformer-based model that aims to scale sequential recommendation models and uses positional and temporal information.
- Differences from
FuXi-:- Integration of Temporal and Positional Data:
HSTUincorporates relative temporal and positional data by adding these features directly to attention weights.FuXi-argues this can dilute their impact.FuXi-'sAMSlayer explicitly decouples temporal, positional, and semantic information into distinct channels within the self-attention mechanism, allowing for more granular and effective modeling. - Implicit Feature Interactions: A significant difference is that
HSTUlacks anFFNlayer. It relies primarily on self-attention and explicit feature interactions.FuXi-addresses this by introducing theMFFNto facilitate robust implicit interactions, thereby capturing more complex item relationships thatHSTUmight miss.
- Integration of Temporal and Positional Data:
5. Experimental Setup
5.1. Datasets
The experiments were conducted on four datasets: three public datasets and one private large-scale industrial dataset.
-
MovieLens-1M:
- Source: Publicly available movie recommendation dataset.
- Characteristics: Contains user ratings and tagging activities. It is a relatively smaller dataset.
- Statistics:
- Users: 6,041
- Items: 3,706
- Interactions: 1,000,209
- Average Length of Interaction Sequence: 165.60
- Why chosen: A common benchmark for sequential recommendation, allowing comparison with a wide range of existing methods.
-
MovieLens-20M:
- Source: Publicly available movie recommendation dataset.
- Characteristics: A larger subset of the MovieLens dataset, also containing user ratings and tagging activities.
- Statistics:
- Users: 138,493
- Items: 26,744
- Interactions: 20,000,263
- Average Length of Interaction Sequence: 144.41
- Why chosen: Provides a larger-scale benchmark to evaluate model performance and scaling capabilities compared to MovieLens-1M.
-
KuaiRand:
- Source: Collected from user logs of a video-sharing app (Kuaishou).
- Characteristics: Users are highly active, with over 200 interactions on average.
- Statistics:
- Users: 25,634
- Items: 7,550
- Interactions: 6,945,823
- Average Length of Interaction Sequence: 270.96
- Why chosen: Represents a real-world social media/short video interaction scenario, characterized by high user activity and longer sequences.
-
Industrial:
-
Source: Private dataset constructed from user records of a mainstream music listening app (Huawei Music).
-
Characteristics: Very large-scale, with tens of millions of active users monthly. User behavior sequences are constructed from over a month of positive behaviors (collect, like, play, etc.).
-
Statistics:
- Users: 19,252,028
- Items: 234,488
- Interactions: 1,023,711,774
- Average Length of Interaction Sequence: 53.17
-
Why chosen: Crucial for evaluating the model's performance on a massive, real-world industrial scale, especially for assessing adherence to scaling laws and practical deployment impact.
The following are the statistics from Table 1 of the original paper:
Dataset User Item Interactions Avg. Len. MovieLens-1M 6,041 3,706 1,000,209 165.60 MovieLens-20M 138,493 26,744 20,000,263 144.41 KuaiRand 25,634 7,550 6,945,823 270.96 Industrial 19,252,028 234,488 1,023,711,774 53.17
-
For MovieLens-1M and MovieLens-20M, pre-processed train/validation/test sets from HSTU [72] were used. For KuaiRand and Industrial datasets, similar preprocessing was applied by the authors.
5.2. Evaluation Metrics
The paper employs standard evaluation metrics for recall performance in recommendation systems. Higher values indicate better performance for all metrics.
-
Hit Ratio at K ():
- Conceptual Definition: measures whether the ground-truth item is present in the top- recommended items. It indicates the proportion of users for whom the relevant item was successfully recommended within the top positions. It's a measure of recall.
- Mathematical Formula: $ \mathrm{HR}@K = \frac{\text{Number of users for whom the ground-truth item is in the top-K list}}{\text{Total number of users}} $
- Symbol Explanation:
Number of users for whom the ground-truth item is in the top-K list: The count of unique users for whom the next actual item they interacted with (ground-truth) is found within the first items of the model's recommendation list.Total number of users: The total number of users in the evaluation set.
-
Normalized Discounted Cumulative Gain at K ():
- Conceptual Definition: is a measure of ranking quality. It considers not only whether the relevant item is in the top- list (like ) but also its position. Relevant items ranked higher contribute more to the score. It's normalized to be between 0 and 1, where 1 means a perfect ranking.
- Mathematical Formula:
$
\mathrm{NDCG}@K = \frac{1}{|\mathcal{U}|} \sum_{u=1}^{|\mathcal{U}|} \frac{\mathrm{DCG}@K_u}{\mathrm{IDCG}@K_u}
$
Where:
\mathrm{DCG}@K_u = \sum_{j=1}^{K} \frac{2^{rel_j} - 1}{\log_2(j+1)}\mathrm{IDCG}@K_u = \sum_{j=1}^{K} \frac{2^{rel_j^{ideal}} - 1}{\log_2(j+1)}
- Symbol Explanation:
- : Total number of users.
- : Discounted Cumulative Gain for user at rank .
- : Ideal Discounted Cumulative Gain for user at rank , representing the maximum possible if all relevant items were perfectly ranked.
- : Rank position.
- : Relevance score of the item at position in the recommended list. For implicit feedback, this is typically 1 if the item is the ground-truth item, and 0 otherwise.
- : Relevance score of the item at position in the ideal ranking (i.e., if all ground-truth items were ranked at the top).
-
Mean Reciprocal Rank (MRR):
- Conceptual Definition: measures the average of the reciprocal ranks of the first relevant item. If the first relevant item is at rank 1, the reciprocal rank is 1; if at rank 2, it's 1/2, and so on. It's particularly useful for tasks where only one correct answer is expected, or where finding any relevant item quickly is important.
- Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|\mathcal{U}|} \sum_{u=1}^{|\mathcal{U}|} \frac{1}{\mathrm{rank}_u} $
- Symbol Explanation:
-
: Total number of users.
-
: The rank of the first relevant item for user . If no relevant item is found, the reciprocal rank is 0.
The paper reports performance for for and .
-
5.3. Baselines
For a comprehensive comparison, FuXi- is compared against two categories of representative baselines:
-
Conventional Models:
- BPRMF [44]: Bayesian Personalized Ranking Matrix Factorization. A classical collaborative filtering method based on pairwise ranking optimization.
- GRU4Rec [20]: A Recurrent Neural Network (RNN) based model using Gated Recurrent Units (GRUs) for session-based recommendation, capturing sequential patterns.
- NARM [28]: Neural Attentive Recommendation Machine. An RNN-based model that incorporates an attention mechanism to capture users' main purpose and sequential behavior.
-
Autoregressive Generative Models (Transformer-based):
-
SASRec [23]: Self-Attentive Sequential Recommendation. A pioneering model that applies the Transformer's self-attention mechanism to sequential recommendation, allowing it to capture long-range dependencies.
-
LLaMa [10]: A Large Language Model architecture used as a strong Transformer-based baseline, often adapted for recommendation tasks due to its powerful sequential modeling capabilities.
-
HSTU [72]: Hierarchical Self-attentive Transformer for User modeling. An advanced Transformer-based model that incorporates positional and temporal information with element-wise multiplication for explicit interactions, specifically designed to scale with model size.
The paper also includes "Large" versions of
SASRec,LLaMa, andHSTU(e.g.,SASRec-Large), which are models scaled up to 8 layers to analyze scaling effects, compared to the default 2 layers for basic capacity comparison.
-
5.4. Parameter Settings
- Implementation:
FuXi-is implemented usingPyTorch[38]. - Large-scale Training:
Multi-machineandmulti-card parallelismare used with theAcceleratelibrary [26] to enable training of large models. - Fair Comparison: For MovieLens-1M and MovieLens-20M, model parameters (except the number of layers) are kept consistent with
HSTU[72] to ensure a fair comparison. - KuaiRand Specific Settings:
- Hidden dimension: 50
- Number of negative samples: 128 (default)
- Common Parameters: Optimizer, learning rate, and weight decay are consistent with
HSTU[72] across all datasets. - Embedding and Self-attention Dimensions: Embedding dimensions and self-attention hidden vector dimensions are identical across all three public datasets.
- Number of Layers:
- Basic Modeling Capacity: Set to 2 layers for initial comparisons.
- Scaling Analysis: Extended to 8 layers (denoted as "XX-Large" or "-Large" in the tables) for generative models to observe scaling effects.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate FuXi-'s superior performance across various datasets and its adherence to scaling laws.
6.1.1. Public Dataset Performance (RQ1)
The following are the results from Table 2 of the original paper:
| Dataset | MovieLens-1M | MovieLens-20M | KuaiRand | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | NG@10 | NG@50 | HR@10 | HR@50 | MRR | NG@10 | NG@50 | HR@10 | HR@50 | MRR | NG@10 | NG@50 | HR@10 | HR@50 | MRR |
| BPRMF | 0.0607 | 0.1027 | 0.1185 | 0.3127 | 0.0556 | 0.0629 | 0.1074 | 0.1241 | 0.3300 | 0.0572 | 0.0248 | 0.0468 | 0.0520 | 0.1560 | 0.0235 |
| GRU4Rec | 0.1015 | 0.1460 | 0.1816 | 0.3864 | 0.0895 | 0.0768 | 0.1155 | 0.1394 | 0.3177 | 0.0689 | 0.0289 | 0.0531 | 0.0597 | 0.1726 | 0.0275 |
| NARM | 0.1350 | 0.1894 | 0.2445 | 0.4915 | 0.1165 | 0.1037 | 0.1552 | 0.1926 | 0.4281 | 0.0910 | 0.0411 | 0.0747 | 0.0836 | 0.2399 | 0.0387 |
| SASRec | 0.1594 | 0.2187 | 0.2824 | 0.5500 | 0.1375 | 0.1553 | 0.2119 | 0.2781 | 0.5353 | 0.1330 | 0.0486 | 0.0877 | 0.0978 | 0.2801 | 0.0454 |
| LLaMa | 0.1620 | 0.2207 | 0.2926 | 0.5591 | 0.1373 | 0.1640 | 0.2206 | 0.2915 | 0.5476 | 0.1402 | 0.0495 | 0.0878 | 0.0973 | 0.2752 | 0.0466 |
| HSTU | 0.1639 | 0.2238 | 0.2969 | 0.5672 | 0.1390 | 0.1642 | 0.2225 | 0.2909 | 0.5553 | 0.1410 | 0.0491 | 0.0861 | 0.0992 | 0.2718 | 0.0451 |
| FuXi-α | 0.1835 | 0.2429 | 0.3254 | 0.5941 | 0.1557 | 0.1954 | 0.2533 | 0.3353 | 0.5969 | 0.1677 | 0.0537 | 0.0942 | 0.1067 | 0.2951 | 0.0497 |
| SASRec-Large | 0.1186 | 0.1733 | 0.2183 | 0.4671 | 0.0186 | 0.0206 | 0.0379 | 0.0412 | 0.1209 | 0.0207 | 0.0285 | 0.0428 | 0.0544 | 0.1227 | 0.0258 |
| LLaMa-Large | 0.1659 | 0.2257 | 0.2990 | 0.5692 | 0.1408 | 0.1842 | 0.2412 | 0.3202 | 0.5776 | 0.1576 | 0.0494 | 0.0878 | 0.0970 | 0.2754 | 0.0466 |
| HSTU-Large | 0.1844 | 0.2437 | 0.3255 | 0.5929 | 0.1568 | 0.1995 | 0.2572 | 0.3407 | 0.6012 | 0.1714 | 0.0494 | 0.0883 | 0.0990 | 0.2799 | 0.0460 |
| FuXi-α-Large | 0.1934 | 0.2518 | 0.3359 | 0.5983 | 0.1651 | 0.2086 | 0.2658 | 0.3530 | 0.6113 | 0.1792 | 0.0555 | 0.0963 | 0.1105 | 0.2995 | 0.0510 |
Observations:
- Generative Models Outperform Conventional Models:
SASRec,LLaMa,HSTU, andFuXi-(generative models) consistently outperformBPRMF,GRU4Rec, andNARM(conventional models) across all datasets and metrics, even with only two layers. This highlights the superior ability of Transformer-based architectures in capturing complex item relationships and dynamic user preferences. - SASRec's Scaling Issues:
SASRec-Large(8 layers) shows a significant performance drop compared toSASRec(2 layers) across all three datasets. This suggests that simply stacking more layers onSASRecdoes not inherently lead to better performance, indicating it does not adhere well to scaling laws in its original form. - LLaMa and HSTU Scaling:
LLaMa-LargeandHSTU-Largeshow substantial improvements over their 2-layer counterparts on MovieLens-1M and MovieLens-20M, indicating that these architectures are more amenable to scaling. - FuXi-'s Consistent Superiority:
FuXi-consistently achieves the best results across all three datasets and evaluation metrics, regardless of whether it's a shallow (2-layer) or deep (8-layer) network.- Shallow Network (
FuXi-vs.HSTU):FuXi-outperformsHSTU(the strongest baseline) by an average of 13.24% in , 10.59% in , 10.81% in , 6.94% in , and 13.72% in across the three datasets. - Deep Network (
FuXi--Largevs.HSTU-Large):FuXi--LargeoutperformsHSTU-Largeby an average of 7.26% in , 5.24% in , 6.14% in , 3.19% in , and 6.90% in .
- Shallow Network (
- Interpretation: This demonstrates the effectiveness of
FuXi-'s proposedAdaptive Multi-channel Self-attention (AMS)andMulti-stage Feedforward Network (MFFN)in enhancing explicit and implicit feature interactions, leading to better user behavior modeling.
6.1.2. Industrial Dataset Performance
The following are the results from Table 3 of the original paper:
| Dataset | Industrial | ||||
|---|---|---|---|---|---|
| Model | NG@10 | NG@50 | HR@10 | HR@50 | MRR |
| SASRec | 0.1009 | 0.1580 | 0.1970 | 0.4581 | 0.0868 |
| LLaMa | 0.1681 | 0.2238 | 0.2985 | 0.5498 | 0.1426 |
| HSTU | 0.1733 | 0.2289 | 0.3057 | 0.5565 | 0.1472 |
| FuXi-α | 0.1875 | 0.2424 | 0.3230 | 0.5702 | 0.1601 |
Observations:
- Significant Gains over SASRec:
LLaMaandHSTUachieve substantial improvements overSASRecin the music recommendation scenario on the Industrial dataset (e.g., 64.82% and 71.75% gains in , respectively). This indicates the strong performance of more advanced Transformer architectures on large-scale, real-world data. - FuXi-'s Dominance:
FuXi-further outperforms bothLLaMaandHSTUby 11.54% and 8.19% in , respectively. This reconfirms its superiority on a massive industrial dataset, demonstrating its practical value. - Potential of Scaling Laws: These results collectively highlight the potential of scaling laws in recommendation systems and underscore
FuXi-'s effectiveness in harnessing this potential through its enhanced feature interaction mechanisms.
6.1.3. Scaling of FuXi- on Industrial Dataset
The following figure (Figure 5 from the original paper) shows the scaling performance of FuXi- on the Industrial Dataset:
该图像是一个图表,展示了FuXi-模型在不同层数下的性能指标NDCG@10和HR@10的变化趋势。随着层数的增加,NDCG@10和HR@10均表现出持续的提升,表明模型在扩展时有效增强了推荐效果。
Figure 5: Scaling of FuXi- on Industrial Dataset.
Observations:
- The figure plots and performance against the number of layers (from 2 to 32) for
FuXi-on the Industrial dataset. - Adherence to Scaling Law: The results clearly show a positive relationship between the model's performance and its size (number of layers). Both and continuously improve as the number of layers increases from 2 to 32.
- High Potential: This indicates that
FuXi-adheres to the scaling law, implying that its performance can be further improved by increasing model size, training data, and computational resources. This is a highly desirable property for large-scale recommendation systems.
6.2. Efficiency Comparison (RQ2)
The following are the results from Table 4 of the original paper:
| Dataset | KuaiRand | |||
| Model | TPS@200 | TPS@400 | TPS@600 | TPS@800 |
| SASRec | 2481 | 2024 | 1672 | 1398 |
| LLaMa | 2330 | 1972 | 1602 | 1326 |
| HSTU | 2078 | 1183 | 680 | 394 |
| FuXi-α | 1971 | 1053 | 615 | 373 |
Observations:
- Throughput Per Second (TPS): This metric measures the number of training samples processed per second, indicating training efficiency.
- Impact of Sequence Length: As the sequence length increases (from 200 to 800), the TPS for all models decreases. This is expected, as longer sequences require more computation per sample, particularly for attention mechanisms that scale quadratically with sequence length.
- SASRec and LLaMa's Higher TPS:
SASRecandLLaMagenerally exhibit higher TPS compared toHSTUandFuXi-. The paper attributes this to their exclusion (or simpler handling) of temporal information encoding, which, while beneficial for performance, adds computational overhead. - FuXi- vs. HSTU Efficiency:
FuXi-shows a TPS similar toHSTU. For example, at sequence length 800,FuXi-has a TPS of 373, whileHSTUhas 394. This indicates that the added complexity ofAMSandMFFNinFuXi-(to better capture feature interactions) does not significantly degrade its training efficiency compared toHSTU, which already incorporates some temporal/positional modeling. - Performance-Efficiency Trade-off: While
FuXi-andHSTUare less efficient in terms of raw TPS thanSASRecandLLaMa, their superior overall performance (as shown in Tables 2 and 3) justifies the increased computational cost, especially in scenarios where predictive accuracy is paramount.FuXi-specifically achieves significantly better overall performance while maintaining comparable efficiency toHSTU.
6.3. Ablation Study (RQ3)
The following are the results from Table 5 of the original paper:
| Dataset | MovieLens-1M | MovieLens-20M | M | MovieLens-20M | ||||||
| Model | NG@10 | HR@10 | 10 | NG@10 | HR@10 | NG@10 | NG@10 | |||
| Base w/o AMS w/o MFFN |
0.1454 | 0.2676 | 0.1452 | 0.2647 | 0.0476 | 0.0928 | ||||
| 0.1563 | 0.2847 | 0.1612 | 0.2888 | 0.0470 0.0534 |
0.0921 0.0947 |
|||||
| 0.1878 | 0.3304 | 0.2056 | 0.3488 | |||||||
| FuXi-α | 0.1934 | 0.3359 | 0.2086 | 0.3530 | 0.0555 | 0.1105 | ||||
Note: The table formatting in the original PDF is slightly misaligned for the "MovieLens-20M" columns, with "M" and a repeated "MovieLens-20M" appearing. The transcribed table above attempts to correct this visual anomaly based on typical evaluation table structures, assuming the columns are MovieLens-1M and MovieLens-20M and KuaiRand, and the metrics are NG@10 and HR@10 for each. The last two columns (NG@10 and HR@10) seem to correspond to KuaiRand based on previous tables.
To assess the effectiveness of each sub-module, the authors conducted an ablation study with three variants:
- Base Model: This variant replaces the
AMSmodule with the vanilla self-attention layer fromSASRecand substitutes theMFFNmodule with a single-stageMLPfromHSTU. This serves as a strong baseline reflecting a simpler Transformer architecture withHSTU'sFFNapproach. - w/o AMS: This variant replaces the
AMSmodule with the vanilla self-attention layer (similar toSASRec), but retains theMFFNmodule. - w/o MFFN: This variant substitutes the
MFFNmodule with a single-stageMLP(similar toHSTU's FFN approach), but retains theAMSmodule. - FuXi- (Full Model): The complete proposed model.
Observations:
- Impact of
MFFN(w/o MFFN vs. FuXi-): Removing theMFFN(i.e.,w/o MFFNvariant) leads to a significant performance drop compared to the fullFuXi-model. For instance, on MovieLens-1M, drops from 0.1934 to 0.1878, and drops from 0.3359 to 0.3304. This emphasizes the critical role of theMFFNin facilitating thorough implicit feature interactions and boosting model expressiveness. Despite the drop,w/o MFFNstill outperformsHSTU(0.1639 on MovieLens-1M), suggesting that theAMSitself is highly effective. - Impact of
AMS(w/o AMS vs. FuXi-): Replacing theAMSwith a vanilla self-attention layer (w/o AMSvariant) also results in a marked performance decline. For example, on MovieLens-1M, drops to 0.1563, and to 0.2847. This highlights the necessity of theAMSfor explicit feature interactions and its effective utilization of temporal and positional data, which is superior to generic self-attention. - Base Model Performance: The
Basemodel, which lacks bothAMSandMFFN, shows the lowest performance among the variants, reinforcing the individual and combined importance of both proposed modules. Its performance (e.g., 0.1454 on MovieLens-1M) is significantly lower than the fullFuXi-model. - Overall Conclusion: The ablation study confirms that both the
Adaptive Multi-channel Self-attention (AMS)layer and theMulti-stage Feedforward Network (MFFN)are essential components, each contributing significantly to the superior predictive capability ofFuXi-. TheAMSprovides a more effective way to handle temporal and positional information, while theMFFNensures robust implicit feature interactions.
6.4. Hyperparameter Study (RQ4)
The paper investigates the effects of three key hyperparameters: number of layers, hidden dimension, and number of negative samples. Results are primarily shown for and on MovieLens-1M and KuaiRand datasets.
6.4.1. The Number of Layers
The following figure (Figure 6 from the original paper) illustrates the performance with different numbers of layers:
该图像是图表,展示了在不同层数下,模型在 MovieLens-1M 和 KuaiRand 数据集上的 NDCG@10 和 HR@10 指标的表现。模型的性能随着层数的增加而有所不同,MovieLens-1M 达到最佳表现于 8 层,而 KuaiRand 的性能则逐渐提升。
Figure 6: Performances with different number of layers.
Observations:
- MovieLens-1M: Performance (both and ) improves from 2 to 8 layers, but then shows a decline at 16 layers. This suggests that for smaller datasets like MovieLens-1M, increasing the model depth beyond a certain point can lead to overfitting or diminishing returns, as the dataset may not have enough complexity or data volume to support a very deep model.
- KuaiRand: In contrast, performance on KuaiRand consistently increases as the number of layers goes from 2 to 16. This indicates that the larger and more complex KuaiRand dataset benefits from deeper models, allowing
FuXi-to capture more intricate patterns and dependencies. This observation further supportsFuXi-'s adherence to the scaling law on sufficiently large datasets.
6.4.2. The Hidden Dimension
The following figure (Figure 7 from the original paper) illustrates the performance with different hidden dimensions:
该图像是图表,展示了在不同隐藏维度下,MovieLens-1M(左侧)和KuaiRand(右侧)数据集上的NDCG@10和HR@10的性能表现。随着隐藏维度的增加,两个数据集的性能指标均呈现提升趋势。
Figure 7: Performances with different hidden dimension.
Observations:
- MovieLens-1M: Performance on MovieLens-1M saturates at a hidden dimension of 32, with only minimal gains observed when increasing it further to 64. This is consistent with the findings for the number of layers; smaller datasets have less capacity to benefit from very high-dimensional representations.
- KuaiRand: On KuaiRand, performance steadily improves across all tested hidden dimensions (from 8 to 64). This implies that a larger hidden dimension allows the model to learn richer item representations and more accurate self-attention similarities, which is beneficial for complex, real-world datasets.
6.4.3. Negative Samples
The following figure (Figure 8 from the original paper) illustrates the performance with diverse negative sample counts:
该图像是图表,展示了在不同负样本数量下,MovieLens-1M 和 KuaiRand 数据集的 NDCG@10 和 HR@10 的变化趋势。可以观察到,随着负样本数量的增加,NDCG@10 和 HR@10 指标均呈现出上升的趋势,表明更大的负样本可以改善推荐效果。
Figure 8: Diverse negative sample counts in performances.
Observations:
- Consistent Improvement: Performance (both and ) consistently improves on both MovieLens-1M and KuaiRand datasets as the number of negative samples increases from 32 to 256.
- Importance of Negative Sampling: The gains from increasing negative samples are significant, in some cases even surpassing those observed from increasing the number of layers. This underscores the critical role of negative sampling in enhancing model performance in recommendation systems. Effective negative sampling helps the model learn to distinguish relevant items from irrelevant ones more robustly, which is crucial for tasks like sequential recommendation where the item space is vast.
- Relevance to Scaling Laws: This finding is particularly important because the influence of negative sampling is often overlooked in studies focusing on scaling laws in LLMs [21, 24]. The paper highlights that for recommendation models, optimizing negative sampling is as vital as scaling model architecture.
6.5. Online A/B Test
To validate the real-world impact, an online A/B test was conducted within the Huawei Music app.
- Scenario: Main scenario of Huawei Music.
- Duration: 7 days.
- Traffic Allocation: 30% of user traffic was directed to
FuXi-. - Baseline: A well-optimized multi-channel retrieval system, refined over several years, served as the control group.
- Results:
FuXi-demonstrated significant improvements:- Average number of songs played per user: Increased by 4.67%.
- Average listening duration per user: Increased by 5.10%.
- Conclusion: These online results strongly indicate that
FuXi-effectively enhances user interaction and engagement, improving user experience and increasing platform usage time. Following these positive evaluations,FuXi-was integrated as an inherent channel in this scenario, serving most of the online traffic. This is a crucial validation of the model's practical utility and its ability to deliver tangible business value in a production environment.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully proposes FuXi-, a novel Transformer-based model for sequential recommendation that significantly advances the state-of-the-art. Its core innovations include an Adaptive Multi-channel Self-attention (AMS) mechanism, which disentangles the modeling of temporal, positional, and semantic features, and a Multi-stage Feedforward Network (MFFN) to enhance implicit feature interactions. Extensive offline experiments on both public and industrial datasets consistently demonstrate FuXi-'s superior performance compared to existing baselines, including powerful Transformer-based models like LLaMa and HSTU. Furthermore, FuXi- exhibits continuous performance improvement with increasing model size, confirming its adherence to scaling laws. The model's real-world efficacy was validated through an online A/B test within the Huawei Music app, showing notable increases in user engagement metrics (songs played and listening duration). The ablation studies confirmed the critical contributions of both AMS and MFFN components to the model's overall effectiveness.
7.2. Limitations & Future Work
The authors acknowledge several directions for future research:
- More Complex Recommendation Problems: They plan to extend
FuXi-to tackle more intricate recommendation scenarios, such asmulti-behavior(e.g., likes, clicks, purchases within one sequence) andmulti-modalrecommendations (e.g., incorporating text, image, audio features). - Long Sequences: The current model's performance on very long sequences could be further optimized. The quadratic complexity of self-attention with respect to sequence length remains a challenge for extremely long sequences, suggesting a need for more efficient attention mechanisms or hierarchical processing.
7.3. Personal Insights & Critique
The FuXi- paper presents a compelling case for carefully designed feature interactions in large-scale sequential recommendation models. My personal insights and critique are as follows:
-
Innovation in Feature Interaction: The
Adaptive Multi-channel Self-attention (AMS)andMulti-stage FFN (MFFN)are genuinely innovative contributions. The idea of explicitly decoupling temporal, positional, and semantic information within self-attention, rather than simply adding them as biases or embeddings, is a nuanced and powerful approach. This targeted modeling allows each aspect to contribute optimally to the user's interest representation. TheMFFNfurther reinforces this by providing a dedicated mechanism for implicit interactions after the multi-channel attention, addressing a gap in models likeHSTU. This granular control over different types of feature interactions is a significant step forward. -
Strong Empirical Evidence: The comprehensive experimental validation, including both extensive offline benchmarks and a crucial online A/B test on a large industrial platform, lends strong credibility to the model's claims. The online A/B test results, showing tangible improvements in user engagement, are particularly impressive and demonstrate real-world impact, which is often challenging to achieve with research prototypes.
-
Scaling Laws for Recommendation: The paper's explicit focus on scaling laws for recommendation models, mirroring trends in LLMs, is a timely and important research direction. Demonstrating that
FuXi-can continuously improve with increased model size provides a clear path for future development and resource allocation in building even larger and more capable recommendation systems. This could motivate further exploration into the theoretical underpinnings of scaling in this domain. -
Emphasis on Negative Sampling: The hyperparameter study's finding that negative sampling significantly impacts performance, sometimes more than increasing layers, is a valuable insight. It highlights that while architectural scaling is important, fundamental training techniques like negative sampling remain crucial and should not be overlooked, especially in the context of large-vocabulary prediction tasks typical in recommendation. This is a practical takeaway for practitioners.
Potential Areas for Improvement/Critique:
- Computational Cost of
AMS: WhileAMSoffers greater expressiveness, the explicit calculation of three separate attention weight matrices (even if sharing value matrices) and subsequent concatenation might increase computational overhead compared to simpler attention mechanisms, especially for very long sequences. The efficiency comparison in Table 4 confirms thatFuXi-(andHSTU) are slower thanSASRecandLLaMa. While justified by performance gains, explicitly discussing the trade-offs and exploring ways to optimize this (e.g., sparse attention for specific channels) could be beneficial. - Complexity of Temporal/Positional Embeddings: The paper refers to
alphaandbetaparameters for temporal and positional attention without fully detailing their structure or how the bucket mapping for temporal differences is implemented. A more in-depth explanation of these learnable parameters and their behavior would enhance understanding for beginners. - Specifics of
RMSN: WhileRMSNis mentioned, a brief explanation of whyRMSNwas chosen over other normalization layers (e.g.,LayerNorm) would add value, especially given its use inLLaMa. - Ablation for Multi-Channel Aspect of
AMS: While the ablation study showsAMSis better than vanilla self-attention, it doesn't break down the contribution of each channel withinAMS. For example, what if only semantic + temporal channels were used, or semantic + positional? This could further illuminate the benefits of themulti-channeldesign. - Long Sequence Handling: While future work mentions long sequences, the current method, with time complexity for attention, will face challenges with extremely long user histories. Exploring mechanisms like sparse attention, block-wise attention, or recurrent layers (similar to
ReformerorLongformer) could be a natural extension for truly "long" sequences.
Applicability and Transferability:
The principles of FuXi- can be applied to various domains beyond music recommendation. Any sequential recommendation task where temporal and positional information are critical, and where complex implicit interactions are expected, could benefit. This includes:
-
E-commerce: Recommending products based on a user's browsing and purchase history, where the order and time between interactions are crucial (e.g., buying a phone, then a case a few days later, then a charger next week).
-
News/Content Feeds: Personalizing news articles or content streams, where recent interactions and the timing of engagement are key.
-
Online Education: Recommending learning materials or courses based on a student's learning path and progress over time.
The model's strong performance and adherence to scaling laws make it a promising candidate for large-scale industrial deployment in many settings, potentially ushering in a new generation of more context-aware and personalized recommendation systems.
Similar papers
Recommended via semantic vector search.