Paper status: completed

FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer

Published:02/05/2025

Feature Interaction Enhanced Transformer (1)Large-Scale Recommendation Models (1)Adaptive Multi-channel Self-Attention Mechanism (1)Multi-stage Feed-Forward Network (1)Online A/B Testing (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The FuXi-$α$ model introduces an Adaptive Multi-channel Self-attention mechanism to improve the modeling of temporal, positional, and semantic features, along with a Multi-stage Feed-Forward Network to enhance implicit feature interactions, outperforming existing models in offlin

Abstract

Inspired by scaling laws and large language models, research on large-scale recommendation models has gained significant attention. Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy. Current state-of-the-art sequential recommendation models primarily use self-attention mechanisms for explicit feature interactions among items, while implicit interactions are managed through Feed-Forward Networks (FFNs). However, these models often inadequately integrate temporal and positional information, either by adding them to attention weights or by blending them with latent representations, which limits their expressive power. A recent model, HSTU, further reduces the focus on implicit feature interactions, constraining its performance. We propose a new model called FuXi- $α$ to address these issues. This model introduces an Adaptive Multi-channel Self-attention mechanism that distinctly models temporal, positional, and semantic features, along with a Multi-stage FFN to enhance implicit feature interactions. Our offline experiments demonstrate that our model outperforms existing models, with its performance continuously improving as the model size increases. Additionally, we conducted an online A/B test within the Huawei Music app, which showed a $4.76\%$ increase in the average number of songs played per user and a $5.10\%$ increase in the average listening duration per user. Our code has been released at https://github.com/USTC-StarTeam/FuXi-alpha.

Mind Map

In-depth Reading

English Analysis~44 min read · 60,968 chars

1. Bibliographic Information

1.1. Title

FuXi- $α$ : Scaling Recommendation Model with Feature Interaction Enhanced Transformer

1.2. Authors

Yufei Ye (aboluo2003@mail.ustc.edu.cn) from the University of Science and Technology of China, Hefei, China.
Wei Guo (guowei67@huawei.com) and Jin Yao Chin (chin.jin.yao@huawei.com) from Huawei Noah's Ark Lab, Singapore, Singapore.
Hao Wang (wanghao3@ustc.edu.cn) from the University of Science and Technology of China, Hefei, China.
Hong Zhu (zhuhong8@huawei.com) and Xi Lin (linxi16@huawei.com) from Consumer Business Group, Huawei, Shenzhen, China.
Yuyang Ye (yeyuyang@mail.ustc.edu.cn) from the University of Science and Technology of China, Hefei, China.
Yong Liu (liu.yong6@huawei.com) from Huawei Noah's Ark Lab, Singapore, Singapore.
Ruiming Tang (tangruiming@huawei.com) from Huawei Noah's Ark Lab, Shenzhen, China.
Defu Lian (liandefu@ustc.edu.cn) and Enhong Chen (cheneh@ustc.edu.cn) from the University of Science and Technology of China, Hefei, China.

1.3. Journal/Conference

This paper is published as a preprint on arXiv. While it does not specify a particular journal or conference, the authors' affiliations with prominent academic institutions (University of Science and Technology of China) and industrial research labs (Huawei Noah's Ark Lab, Consumer Business Group) suggest a strong research background in artificial intelligence, machine learning, and recommender systems.

1.4. Publication Year

2025 (Published at UTC: 2025-02-05T09:46:54.000Z)

1.5. Abstract

The paper addresses the growing interest in large-scale recommendation models, inspired by the success of scaling laws in large language models. It identifies a limitation in current state-of-the-art sequential recommendation models, which often inadequately integrate temporal and positional information and underemphasize implicit feature interactions. To overcome these issues, the authors propose FuXi- $α$ , a new model that introduces an Adaptive Multi-channel Self-attention (AMS) mechanism to distinctly model temporal, positional, and semantic features, and a Multi-stage FFN (MFFN) to enhance implicit feature interactions. Offline experiments demonstrate that FuXi- $α$ outperforms existing models and shows continuous performance improvement with increasing model size. An online A/B test within the Huawei Music app further validates its effectiveness, showing a 4.76% increase in the average number of songs played per user and a 5.10% increase in average listening duration per user.

1.6. Original Source Link

https://arxiv.org/abs/2502.03036v1 (Preprint) PDF Link: https://arxiv.org/pdf/2502.03036v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inadequate integration of temporal and positional information and the underemphasis of implicit feature interactions in current sequential recommendation models, particularly within the context of scaling laws observed in large models.

In recent years, the concept of "scaling laws" has emerged from the field of large language models (LLMs), demonstrating that model performance can systematically improve with increased parameters, training data, and computational resources. This discovery has driven research into applying similar scaling principles to other domains, including recommendation systems. Sequential recommendation models, which predict user interests based on historical interaction sequences, have shown promise in adopting these scaling laws. State-of-the-art sequential recommendation models, such as those based on Transformer architectures (e.g., SASRec, BERT4Rec, HSTU), primarily use self-attention mechanisms for explicit feature interactions among items. Implicit interactions are typically handled by Feed-Forward Networks (FFNs).

However, existing models face several challenges:

Inadequate Integration of Temporal and Positional Information: Current methods often integrate temporal and positional data in a simplified manner, either by adding embeddings to input sequences, incorporating them into attention weights, or blending them with latent representations. This limits their expressive power and ability to fully capture the nuances of user behavior over time. For example, as illustrated in Figure 1, different temporal intervals or orders between items can lead to varied subsequent interactions, a subtlety that existing models struggle to capture effectively.

该图像是示意图，展示了不同时间间隔或顺序下的物品交互序列。图中分为三条物品序列，每条序列中的物品通过箭头相互连接，表明时间对后续互动项的影响。

Figure 1: Different temporal intervals or orders between objects may lead to varying subsequent interacted items.
Underemphasis of Implicit Feature Interactions: Models like HSTU primarily focus on explicit interactions, which can neglect the complex, non-linear relationships that implicit interactions (often handled by FFNs) can uncover. This underemphasis can constrain the model's overall performance and capacity for nuanced learning.

The paper's entry point is to explicitly address these two limitations by designing a new Transformer-based architecture that:

More effectively disentangles and models temporal, positional, and semantic information in self-attention.
Enhances implicit feature interactions through a specialized FFN. By doing so, the authors aim to develop a recommendation model that not only adheres to scaling laws but also significantly improves performance by better capturing the rich dynamics of user behavior.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Proposed a Novel Model (FuXi- $α$ ): The authors introduce FuXi- $α$ , a new Transformer-based sequential recommendation model specifically designed to leverage scaling laws while addressing the shortcomings of previous models regarding feature interactions.
Adaptive Multi-channel Self-attention (AMS) Layer: FuXi- $α$ incorporates an Adaptive Multi-channel Self-attention (AMS) layer. This mechanism disentangles the modeling of temporal, positional, and semantic information into distinct channels, allowing for a more expressive representation of these crucial aspects. This explicitly overcomes the limitation of prior work that inadequately integrated temporal and positional cues.
Multi-stage Feedforward Network (MFFN): A Multi-stage Feedforward Network (MFFN) is introduced to enhance implicit feature interactions. This component aims to capture more complex, high-order relationships among features that are often overlooked or under-modeled in other architectures, particularly those that prioritize explicit interactions.
Adherence to Scaling Laws: The research demonstrates that FuXi- $α$ adheres to scaling laws, showing continuous performance improvement as the model size (e.g., number of layers) increases. This highlights its potential for building large-scale, highly performant recommendation systems.
Extensive Experimental Validation:
- Offline Performance: FuXi- $α$ consistently outperforms state-of-the-art sequential recommendation models across multiple public (MovieLens-1M, MovieLens-20M, KuaiRand) and a large-scale industrial dataset.
- Online A/B Test Success: An online A/B test conducted within the Huawei Music app showed significant real-world impact, with a 4.76% increase in the average number of songs played per user and a 5.10% increase in the average listening duration per user. These findings solve the practical problem of improving user engagement and platform usage.
Code Release: The authors have open-sourced their code at https://github.com/USTC-StarTeam/FuXi-alpha, facilitating reproducibility and further research.

3.1. Foundational Concepts

To understand FuXi- $α$ , a reader should be familiar with the following foundational concepts in recommender systems and deep learning:

Recommender Systems: Systems that suggest items (e.g., movies, products, music) to users based on their preferences and past behavior.
- Sequential Recommendation: A sub-field of recommender systems where the order of user interactions matters. The goal is to predict the next item a user will interact with, given their historical sequence of interactions. For example, if a user watches movie A, then B, then C, a sequential recommender tries to predict movie D.
Feature Interaction: The interplay between different features (e.g., item characteristics, user attributes, contextual information) that collectively influence a user's preference for an item.
- Explicit Feature Interaction: Directly modeling the relationships between specific features using predefined operations (e.g., dot product, bilinear function, attention). For instance, multiplying a user embedding by an item embedding to capture their compatibility.
- Implicit Feature Interaction: Learning complex, non-linear relationships among all features through deep neural networks (DNNs) without explicitly predefining the interaction type. This often leads to more flexible and powerful models but can be less interpretable.
Embeddings: Low-dimensional, dense vector representations of discrete entities (e.g., users, items). These vectors capture semantic similarities, where similar items or users have similar embeddings.
Neural Networks (NNs): A series of interconnected layers of artificial neurons that can learn complex patterns from data.
- Feed-Forward Networks (FFNs) / Multi-Layer Perceptrons (MLPs): Simple neural networks consisting of an input layer, one or more hidden layers, and an output layer, where information flows in one direction. They are universal function approximators and are often used to capture implicit feature interactions.
- Recurrent Neural Networks (RNNs) & Gated Recurrent Units (GRUs): Neural networks designed to process sequential data by maintaining a hidden state that carries information from previous steps. GRU4Rec is a well-known example in sequential recommendation.
- Convolutional Neural Networks (CNNs): Neural networks that use convolutional layers to extract local patterns from data, often used in image processing but also adapted for sequential data (e.g., Caser).
- Graph Neural Networks (GNNs): Neural networks that operate on graph-structured data, allowing them to model relationships between entities (e.g., items in a sequence). SR-GNN is an example.
Transformer Architecture: A neural network architecture primarily based on the self-attention mechanism, originally developed for natural language processing (NLP) tasks.
- Self-Attention Mechanism: A core component of Transformers that allows each element in a sequence to weigh the importance of all other elements in the same sequence. It calculates attention weights based on the similarity between queries and keys, and then uses these weights to combine values. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $Q K^T$ calculates the dot product between queries and keys, representing similarity.
  - $\sqrt{d_k}$ is a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients, where $d_k$ is the dimension of the key vectors.
  - $\mathrm{softmax}$ normalizes the attention scores into probabilities.
  - The result is a weighted sum of the value vectors, where weights are determined by attention scores.
- Multi-Head Self-Attention: An extension of self-attention where the attention mechanism is run multiple times in parallel with different learned linear projections of the queries, keys, and values. The outputs from multiple "heads" are then concatenated and linearly transformed. This allows the model to attend to different parts of the sequence from different representation subspaces.
- Positional Encoding: In Transformers, since self-attention does not inherently understand the order of elements in a sequence, positional encodings are added to the input embeddings to inject information about the relative or absolute position of each element.
- Layer Normalization: A technique used in neural networks to normalize the activations of a layer across the features, which helps stabilize training and speed up convergence. Root Mean Square (RMS) Layer Normalization is a variant.
- Activation Functions: Non-linear functions applied to the output of neurons to introduce non-linearity into the network, allowing it to learn complex patterns.
  - SiLU (Sigmoid Linear Unit): $\mathrm{SiLU}(x) = x \cdot \sigma(x)$ , where $\sigma(x)$ is the sigmoid function. It's a smooth, non-monotonic activation function often used in deep learning.
  - SwiGLU (Swish-Gated Linear Unit): An activation function used in some Transformer architectures, defined as $\mathrm{SwiGLU}(x) = (\mathrm{SiLU}(xW_1) \otimes (xW_2))W_3$ . It combines the SiLU function with a gating mechanism and element-wise multiplication ( $\otimes$ ).
Scaling Laws: Empirical observations, predominantly in LLMs, indicating that the performance of a model scales predictably (e.g., logarithmically) with increases in model parameters, dataset size, and computational budget. The goal is to find models that adhere to these laws, as they promise continuous performance improvements with more resources.

3.2. Previous Works

The paper extensively discusses several categories of previous works:

Traditional Recommendation Models:
- Pooling Operations: Early methods for managing interaction sequences that simply aggregate item features (e.g., YouTube recommendations [8]). These have limited expressive capabilities as they lose sequential information.
- CNN-based Models (e.g., Caser [52]): Use convolutional filters to capture local patterns within a fixed window of the sequence. They are limited in capturing long-range dependencies.
- RNN-based Models (e.g., GRU4Rec [20]): Employ recurrent neural networks to process sequences, directly interacting with the previous timestep's hidden state. While capturing sequentiality, they can struggle with very long sequences and complex interactions.
- GNN-based Models (e.g., SR-GNN [61]): Model item interactions as a graph, where interactions are limited to directly connected items, restricting their scope.
- Markov Chain Models: Early approaches that assume future interactions depend only on the immediate past.
Transformer-based Sequential Recommendation Models: These models leverage the self-attention mechanism to capture complex dependencies across entire sequences.
- SASRec [23]: A pioneering work that applies the self-attention mechanism to sequential recommendation, allowing each item in a user's history to attend to all previous items. This marked a significant improvement over RNNs and CNNs.
- BERT4Rec [51]: Adapts the bidirectional Transformer encoder from NLP to recommendation, predicting masked items in a sequence.
- TiSASRec [29]: Improves SASRec by incorporating time intervals and relative position information, typically by adding them to the input embeddings or modifying attention weights.
- LLaMa [10] (in context of recommendation): While primarily an LLM, its architecture, especially its use of RMSNorm and SwiGLU, has influenced recommendation models. In this paper's context, LLaMa serves as a baseline representing a Transformer-based model scaled up.
- HSTU [72]: An advanced Transformer-based model that explicitly utilizes positional and temporal information alongside element-wise multiplication to model explicit interactions between items. It attempts to scale sequential recommendation models according to scaling laws and is a strong baseline.
  - Differentiation from FuXi- $α$ : HSTU integrates relative temporal and positional data by adding these features directly to attention weights, which FuXi- $α$ argues can dilute their impact. Furthermore, HSTU notably lacks an FFN layer, relying heavily on self-attention and explicit feature interactions, which FuXi- $α$ identifies as a limitation for capturing implicit relationships.
Deep Learning Recommendation Models (DLRMs): Traditional DLRMs focus on feature interactions in a broader sense, not just sequential data.
- DCN [58], xDeepFM [33], DeepFM [13], PNN [40]: These models use various techniques (dot product, bilinear functions, attention) for explicit feature interactions and often incorporate DNNs for implicit interactions. The paper notes that DLRMs haven't shown significant performance improvements with increased model size in some studies [2, 15].

3.3. Technological Evolution

The evolution of recommendation technologies can be seen as a progression from simpler statistical models to complex deep learning architectures:

Early Stages (Pre-Deep Learning): Markov chains, matrix factorization (MF) (e.g., BPRMF [44]), and simple pooling operations (e.g., YouTube recommendations [8]) formed the bedrock. These methods were limited in capturing complex sequential patterns or higher-order feature interactions.
RNN/CNN Era: The advent of deep learning brought RNNs (GRU4Rec [20]) and CNNs (Caser [52]) to sequential recommendation, allowing for better capture of temporal dependencies and local patterns. Memory network-based methods also emerged to handle long-term preferences.
Transformer Era: Inspired by NLP, the Transformer architecture, particularly the self-attention mechanism, revolutionized sequential recommendation with models like SASRec [23] and BERT4Rec [51]. These models can capture long-range dependencies and complex item-item interactions more effectively. Further refinements like TiSASRec [29] started incorporating temporal and positional information more explicitly.
Scaling Law Exploration: Recent work, notably HSTU [72], began investigating the applicability of scaling laws from LLMs to sequential recommendation, pushing towards very large models. This era also sees the influence of LLM architectures (like LLaMa's RMSNorm and SwiGLU) directly impacting recommendation model design.

FuXi- $α$ positions itself at the forefront of this evolution by building upon the Transformer architecture, explicitly addressing the limitations of how temporal and positional information are handled in existing models (like TiSASRec and HSTU), and enhancing the implicit feature interaction capabilities often underplayed in purely attention-focused sequential models. It aims to develop a model that not only benefits from scaling but also leverages a more sophisticated handling of feature interactions for superior performance.

3.4. Differentiation Analysis

Compared to the main methods in related work, FuXi- $α$ introduces core differences and innovations:

Handling of Temporal and Positional Information:
- Conventional Transformer-based (e.g., SASRec, BERT4Rec, LLaMa): These models typically integrate positional encodings by simply adding them to input embeddings. Some, like TiSASRec, incorporate time intervals and relative position information by modifying query/key matrices or attention weights.
- HSTU: HSTU improves by utilizing positional and temporal information alongside element-wise multiplication for explicit interactions but integrates them by adjusting attention weights. The paper argues this can "dilute their impact" due to simple addition.
- FuXi- $α$ Innovation: FuXi- $α$ introduces an Adaptive Multi-channel Self-attention (AMS) layer that distinctly models temporal, positional, and semantic features in separate channels. This disentanglement allows each channel to specialize in capturing different aspects of sequence data, providing a more expressive representation and avoiding dilution.
Implicit Feature Interactions:
- DLRMs (e.g., DCN, xDeepFM): These models heavily rely on DNNs to capture implicit, high-order feature interactions, often combined with explicit interaction components.
- Transformer-based Sequential Models (e.g., SASRec, LLaMa): These models typically use standard FFNs (Multi-Layer Perceptrons) after self-attention layers to process information, which can capture some implicit interactions but might not be optimized for this specific purpose in a multi-channel context.
- HSTU: A key differentiator is that HSTU lacks an FFN layer entirely, relying solely on self-attention and explicit feature interactions. This, according to the paper, limits its ability to capture complex item relationships and nuances from implicit learning processes.
- FuXi- $α$ Innovation: FuXi- $α$ introduces a Multi-stage Feedforward Network (MFFN). This MFFN specifically combines the multi-channel outputs from the AMS layer in its first stage and then performs enhanced implicit feature interactions in its second stage (using SwiGLU and residual connections). This design explicitly addresses the underemphasis of implicit interactions in models like HSTU and aims to boost the model's overall expressiveness.
Adherence to Scaling Laws:
- General Context: While HSTU also aims to adhere to scaling laws, FuXi- $α$ 's design, particularly its enhanced feature interaction mechanisms, is posited to lead to better scaling behavior and continuous performance improvement with increasing model size. The paper's empirical results support this claim, showing FuXi- $α$ 's performance continuously improves with more layers.
  
  In summary, FuXi- $α$ differentiates itself by offering a more granular and sophisticated approach to handling both explicit (through AMS) and implicit (through MFFN) feature interactions within a Transformer-based architecture, which are crucial for sequential recommendation and effective scaling.

4. Methodology

The FuXi- $α$ model is designed to address the limitations of existing sequential recommendation models by enhancing feature interactions, particularly temporal, positional, and semantic information, and improving implicit interaction modeling. The overall architecture is depicted in Figure 2 and consists of stacked FuXi Blocks.

$Figure 2: The overall architecture of the proposed FuXi $\\alpha$ .$ 该图像是图示，展示了所提出的 FuXi-eta 模型的整体架构。该架构包含了多个 Fuxi 块、一个多阶段 FFN 和自适应多通道自注意力机制，旨在增强隐式特征交互及建模时间、位置和语义特征的能力。

Figure 2: The overall architecture of the proposed FuXi $\\alpha$ .

4.1. Embedding Layer

The first step in FuXi- $α$ is to transform user interaction sequences into numerical representations suitable for the model.

Sequence Preparation: Each user's interaction sequence, $S_u = [i_1^{(u)}, i_2^{(u)}, ..., i_{n_u}^{(u)}]$ , is converted into a fixed-length sequence of length $n$ . This involves either truncating longer sequences or padding shorter ones with a special "padding item" to reach the length $n$ .
Item Embeddings: Each item $i \in \mathcal{I}$ (where $\mathcal{I}$ is the set of all items) is mapped to a $d$ -dimensional vector. This is done using a learnable embedding matrix $\mathbf{E} \in \mathbb{R}^{|\mathcal{I}| \times d}$ , where $d$ is the latent vector dimensionality. So, for an item $i_j^{(u)}$ , its embedding is $\mathbf{e}_j^{(u)}$ .
Positional Encodings: To incorporate the order of items in the sequence, learnable positional encodings are used. For the $k$ -th position in the sequence, there is a positional embedding $\pmb{P}_k$ .
Initial Input Representation: The input for the first FuXi Block is constructed by summing the item embeddings and their corresponding positional encodings. For a user $u$ $u$ with a sequence $S_u$ $S_{u}$ , the initial input $\mathbf{x}^0$ $x^{0}$ is given by: $ \mathbf{x}^0 = [\mathbf{e}1^{(u)} + \pmb{P}1, ..., \mathbf{e}{n_u}^{(u)} + \pmb{P}{n_u}, \mathbf{0}, ..., \mathbf{0}] $ Where:
- $\mathbf{e}_j^{(u)}$ is the $d$ -dimensional embedding for the $j$ -th item in user $u$ 's sequence.
- $\pmb{P}_k$ is the positional embedding for the $k$ -th position.
- The zero vectors ( $\mathbf{0}$ ) denote padding items for positions beyond $n_u$ up to $n$ .

4.2. FuXi Block

The FuXi Block is the core computational unit of the model, similar in structure to a Transformer decoder block. It consists of two main components: an Adaptive Multi-channel Self-attention (AMS) layer and a Multi-stage Feed-Forward Network (MFFN). The model stacks $b$ such FuXi Blocks. Let $\mathbf{x}^{l-1} \in \mathbb{R}^{n \times d}$ be the input to the $l$ -th layer and $\mathbf{x}^l \in \mathbb{R}^{n \times d}$ be its output. The initial input for the first layer is $\mathbf{x}^0$ .

4.2.1. Adaptive Multi-channel Self-attention (AMS)

The AMS layer is designed to capture user interest patterns by separating the processing of hidden states (semantics), positional information, and temporal signals into distinct attention channels. This allows each channel to specialize. As depicted in Figure 3, there are three types of channels: semantic, temporal, and positional.

Figure 3: Illustration of Adaptive Multi-channel Selfattention (AMS). In contrast to the conventional multi-head self-attention, AMS decouples the modeling of temporal and positional information from semantics information. 该图像是表征自适应多通道自注意力机制（AMS）的示意图。AMS通过将语义信息的建模与时间和位置信息解耦，增强了特征交互的表达能力，提升了推荐模型的性能。

The computation begins by applying Root Mean Square (RMS) Layer Normalization to the input of the current layer, $\mathbf{x}^{l-1}$ , to stabilize training. $ \tilde{\mathbf{x}}^l = \mathrm{RMSN}(\mathbf{x}^{l-1}) $ Where $\mathrm{RMSN}$ denotes the Root Mean Square (RMS) Layer Normalization operation [75].

Next, the query, key, and value matrices for the semantic channel are computed using linear transformations followed by a non-linear activation function (SiLU [11]). $ \mathbf{q}^l = \phi(\tilde{\mathbf{x}}^l \mathbf{W}_q^l), \quad \mathbf{k}^l = \phi(\tilde{\mathbf{x}}^l \mathbf{W}_k^l), \quad \mathbf{v}^l = \phi(\tilde{\mathbf{x}}^l \mathbf{W}_v^l) $ Where:

$\mathbf{W}_q^l \in \mathbb{R}^{d \times d_h}$ , $\mathbf{W}_k^l \in \mathbb{R}^{d \times d_h}$ , $\mathbf{W}_v^l \in \mathbb{R}^{d \times d_h}$ are learnable weight matrices for the query, key, and value transformations, respectively, in the $l$ -th layer.
$d_h$ represents the size of each attention head.
$\phi$ is the SiLU activation function.

The attention weights for each of the three channels are calculated separately:

Semantic Channel Attention Weights ( $\mathbf{a}_h^l$ ): These are computed similarly to standard self-attention, using the query and key matrices, followed by a non-linear activation function. $ \mathbf{a}_h^l = \frac{1}{n} \phi(\mathbf{q}^l (\mathbf{k}^l)^T) $ Where:
- $\phi$ is again the SiLU activation function. The scaling factor $\frac{1}{n}$ is applied, where $n$ is the sequence length.
- The use of SiLU instead of softmax for attention weights in sequential recommendation has been shown to be effective [72].
Temporal Channel Attention Weights ( $\mathbf{a}_t^l$ ): These attention weights depend only on the difference in relative timestamps between items. $ (\mathbf{a}t^l){i,j} = \alpha(t_j - t_i) $ Where:
- $t_j - t_i$ is the time difference between item $j$ and item $i$ .
- $\alpha$ represents a mapping function that converts the time difference into buckets, and each bucket is associated with a learnable parameter [42]. This means the attention weight for a temporal difference is looked up from a set of learned parameters corresponding to time difference buckets.
Positional Channel Attention Weights ( $\mathbf{a}_p^l$ ): These attention weights depend only on the relative positions between items. $ (\mathbf{a}p^l){i,j} = \beta_{j-i} $ Where:
- j-i is the relative positional difference between item $j$ and item $i$ .
- $\beta \in \mathbb{R}^n$ is a vector of learnable parameters, where $\beta_{j-i}$ is a learnable parameter specifically for the relative position j-i.
  
  Crucially, for the temporal and positional channels, the paper states there's no further need to calculate separate query and key matrices. Instead, they share the value matrices with the semantic channel. While the formula for $\mathbf{v}^l$ is shown once, it serves for all channels. For simplicity and clarity, the detailed formulation here assumes a single head per channel, but the approach can be extended to multiple heads similar to multi-head self-attention [54].

After computing the attention weights for each channel, the outputs of these channels are concatenated. The final output of the AMS layer, denoted $\mathbf{h}^l$ , is obtained by: $ \mathbf{h}^l = \mathrm{RMSN}(\mathrm{concat}(\mathbf{a}_h^l \mathbf{v}_h^l, \mathbf{a}_p^l \mathbf{v}_p^l, \mathbf{a}_t^l \mathbf{v}_t^l)) \otimes \phi(\mathbf{x}^l \mathbf{W}_u^l) $ Where:

$\mathbf{v}_h^l$ , $\mathbf{v}_p^l$ , $\mathbf{v}_t^l$ are the value matrices for the semantic, positional, and temporal channels, respectively. As stated, for simplicity, these value matrices are shared with the semantic channel, so in practice they are all derived from $\mathbf{v}^l$ .
$\mathrm{concat}$ denotes concatenation of the outputs from the different channels. The output dimensions would be $3d_h$ if each channel contributes $d_h$ .
$\mathrm{RMSN}$ is again the Root Mean Square Layer Normalization.
$\mathbf{W}_u^l \in \mathbb{R}^{d \times 3d_h}$ is a learnable parameter matrix.
$\phi$ is the SiLU activation function.
$\otimes$ denotes element-wise multiplication (Hadamard product). The term $\phi(\mathbf{x}^l \mathbf{W}_u^l)$ (which the paper refers to as matrix $U$ following HSTU [72]) introduces explicit 2nd-order interactions. Note that the paper has a typo here and refers to $\mathbf{x}^l$ instead of $\tilde{\mathbf{x}}^l$ or $\mathbf{x}^{l-1}$ for the input to $\mathbf{W}_u^l$ . Based on standard Transformer architecture and the RMSN on $\mathbf{x}^{l-1}$ for q, k, v, it's more likely this is $\tilde{\mathbf{x}}^l$ or $\mathbf{x}^{l-1}$ . However, adhering strictly to the paper's text, it's $\mathbf{x}^l$ .

4.2.2. Multi-stage Feed-Forward Network (MFFN)

The MFFN processes the output of the AMS layer and consists of two stages (Figure 4).

Figure 4: Diagram of MFFN: Stage 1 fuses outputs from different channels; Stage 2 facilitates implicit feature interactions. 该图像是示意图，展示了多阶段前馈网络（MFFN）的结构。Stage 1 负责融合来自不同通道的输出，Stage 2 促进隐式特征交互。

Figure 4: Diagram of MFFN: Stage 1 fuses outputs from different channels; Stage 2 facilitates implicit feature interactions.

Stage 1: Channel Fusion: The first stage receives the concatenated output $\mathbf{h}^l$ from the AMS layer and applies a linear projection, then combines it with the original input of the current layer $\mathbf{x}^{l-1}$ (for a residual connection). $ \mathbf{o}^l = \mathbf{h}^l \mathbf{W}_o^l + \mathbf{x}^{l-1} $ Where:
- $\mathbf{W}_o^l \in \mathbb{R}^{3d_h \times d}$ is a learnable projection matrix. Its purpose is to project the concatenated channel outputs (with dimension $3d_h$ ) back to the model's latent dimension $d$ .
- $\mathbf{x}^{l-1}$ is the input to the current FuXi block, serving as a residual connection.
Stage 2: Implicit Feature Interactions: The second stage performs enhanced implicit feature interactions. It first applies RMS Layer Normalization to the output of the first stage, $\mathbf{o}^l$ , and then passes it through a SwiGLU activation function [46], followed by another residual connection. $ \mathbf{x}^l = \mathrm{FFN}_l(\mathrm{RMSN}(\mathbf{o}^l)) + \mathbf{o}^l $ Where:
- $\mathbf{x}^l$ is the final output of the $l$ -th FuXi Block.
- $\mathrm{RMSN}(\mathbf{o}^l)$ is the RMS Layer Normalization applied to the output of the first stage.
- $\mathrm{FFN}_l$ $FFN_{l}$ denotes the SwiGLU-based feed-forward network, defined as: $ \mathrm{FFN}_l(\mathbf{x}) = \mathrm{SwiGLU}(\mathbf{x}) \mathbf{W}_3^l = (\phi(\mathbf{x} \mathbf{W}_1^l) \otimes (\mathbf{x} \mathbf{W}_2^l)) \mathbf{W}_3^l $ Where:
  - $\phi$ represents the SiLU activation function.
  - $\otimes$ denotes element-wise multiplication.
  - $\mathbf{W}_1^l \in \mathbb{R}^{d \times d_{FFN}}$ , $\mathbf{W}_2^l \in \mathbb{R}^{d \times d_{FFN}}$ , $\mathbf{W}_3^l \in \mathbb{R}^{d_{FFN} \times d}$ are learnable parameter matrices for the FFN in the $l$ -th layer.
  - $d_{FFN}$ is the hidden dimension of the FFN. This structure (similar to LLaMa [53]) allows for complex non-linear transformations and implicit feature interactions.

4.3. Prediction Layer & Optimization Objective

After passing through $b$ layers of FuXi Blocks, the final output $\mathbf{x}^b$ contains rich information about the sequence.

Prediction: To predict the next item, the final latent representation of each position is multiplied with the transpose of the item embedding matrix $\mathbf{E}^T$ . This projects the latent representations back into the item embedding space. A softmax function is then applied to obtain a probability distribution over all possible items. $ P(\text{next item} = j \mid S_u) \propto \exp((\mathbf{x}^b_k \mathbf{W}_{pred}) \cdot \mathbf{e}_j) $ (The paper does not provide an explicit formula for the prediction layer, but describes it as "multiplication with the transpose of the input embedding matrix, followed by a softmax function". The general formulation is often $\mathbf{x}^b_k \mathbf{E}^T$ , where $\mathbf{x}^b_k$ is the representation for the last item in the sequence or the item at the position where prediction is made, and $\mathbf{e}_j$ is the embedding of item $j$ . A linear projection $\mathbf{W}_{pred}$ might be applied before the dot product with item embeddings. For strict adherence, the description is used.)
Optimization Objective (Loss Function): To accelerate training, the model uses a sampled softmax loss with $N$ randomly sampled negative samples [25]. This is a common technique in large-vocabulary prediction tasks, where calculating softmax over all $|\mathcal{I}|$ items is computationally expensive. Instead, the loss is computed over the true next item and a subset of negative (uninteracted) items. $ \mathcal{L} = -\sum_{k=1}^{n} \left[ \log \left( \frac{\exp(\mathbf{s}k \cdot \mathbf{e}{target_k})}{\sum_{j \in {target_k} \cup \text{NegativeSamples}} \exp(\mathbf{s}_k \cdot \mathbf{e}_j)} \right) \right] $ Where:
- $\mathbf{s}_k$ is the output latent representation from the FuXi block for the $k$ -th position (after $b$ layers).
- $\mathbf{e}_{target_k}$ is the embedding of the true next item following the item at position k-1.
- NegativeSamples are $N$ randomly chosen items that were not interacted with.
- The sum is taken over all positions in the sequence for which a prediction is made during training.

4.4. Space and Time Complexity

4.4.1. Space Complexity

The space complexity is determined by the number of parameters in the model.

FuXi Block: Each FuXi block contains an AMS layer and an MFFN.
- AMS layer: Four projection matrices ( $Q$ , $K$ , $V$ weights for semantic channel, and $U$ matrix for 2nd-order interaction) with a total of $6d \times d_h$ parameters. Positional and temporal embeddings contribute $O(n + n_B)$ parameters, where $n$ is the sequence length and $n_B$ is the number of buckets for temporal differences.
- MFFN: Four projection matrices (one for fusion and three for SwiGLU) with $3d_h \times d + 3d_{FFN} \times d$ parameters.
Item Embeddings: The total number of item embeddings is $|\mathcal{I}| \times d$ .

Assuming d_h = O(d), $d_{FFN} = O(d)$ , and n_B = O(n) (i.e., hidden dimensions and bucket counts are proportional to the latent dimension and sequence length, respectively), and stacking $b$ FuXi layers, the total space complexity is given by: $ O(b(d^2 + n) + |\mathcal{I}|d) $ Where:
$b$ is the number of FuXi blocks.
$d$ is the latent dimension.
$n$ is the sequence length.
$|\mathcal{I}|$ is the total number of items.

4.4.2. Time Complexity

The time complexity is determined by the computational cost of operations within each FuXi block and the prediction layer.

AMS Layer:
- Computing attention weights for the semantic channel: $O(n^2 d)$ (due to matrix multiplication of $Q K^T$ , where $Q$ and $K$ are $n \times d_h$ ).
- Computing attention weights for other channels (temporal/positional): $O(n^2)$ (as these involve lookups or fixed calculations for $n \times n$ attention matrices).
- Calculating QKV matrices: $O(n d^2)$ (linear projections for queries, keys, values).
MFFN:
- Requires $O(n d^2)$ operations for its linear transformations and SwiGLU calculations.
Prediction Layer:
- The cost for generating predictions over all items: $O(n |\mathcal{I}| d)$ (multiplying final hidden states with item embeddings). With sampled softmax, this is reduced to $O(n \cdot (\text{number of negative samples}) \cdot d)$ .
  
  Thus, the overall time complexity for training is: $ O(b n^2 d + n (b d^2 + |\mathcal{I}| d)) $ Where:
$b$ is the number of FuXi blocks.
$n$ is the sequence length.
$d$ is the latent dimension.
$|\mathcal{I}|$ is the total number of items. The term $b n^2 d$ comes from the self-attention mechanism, $b d^2$ from the FFNs, and $n |\mathcal{I}| d$ from the prediction layer (which is simplified by negative sampling).

4.5. Polynomial Approximation of Feature Interactions

The paper analyzes the explicit inter-item interactions implemented by FuXi- $α$ by simplifying the model. It considers an $l$ -th layer of the FuXi Block, treating attention weights as constants, and omitting the second stage of the MFFN, activation functions, and most projection transformations. In this simplified context, the output of an item $x_i$ after one block can be approximated as: $ f_{block}^{(l)}(x_i; x_1, ..., x_n) = x_i \circ \left( \sum_{j=1}^{n} a_{i,j}^{(l)} x_j \right) + x_i $ Where:

$x_1, ..., x_n$ are the latent representations (embeddings) of items input to the $l$ -th layer.
$x_i$ is the latent representation of the $i$ -th item.
$\circ$ denotes an interaction operator, such as element-wise multiplication.
$a_{i,j}^{(l)}$ are the attention weights in the $l$ -th layer, treated as constants.

Let $x_{l,i}$ denote the output latent representation of the $i$ -th item after the $l$ -th layer. The paper shows, using mathematical induction, that $x_{l,i}$ is a polynomial of degree up to $2^l - 1$ in the initial item embeddings $x_{0,i}$ . Specifically, $x_{l,i}$ can be expressed in the form $\sum_{\pmb{\alpha}} w_{\pmb{\alpha}} \prod_p x_{0,p}^{\alpha_p}$ where $\sum \alpha_p \leq 2^l - 1$ .

Base Case (b=0): For the input layer ( $b=0$ ), $x_{0,i}$ is simply $x_{0,i} \cdot 1$ . This can be considered a polynomial of degree $1 = 2^0 - 1$ if we interpret $F_{2^0-1}$ as a constant factor of degree 0. The formulation is slightly different in the paper, $x_{0,i} = x_{0,i} \cdot F_0$ , implying $F_0$ is a constant, which fits the pattern.

Inductive Step (b=l+1): Assume the property holds for some integer $l \geq 0$ , meaning $x_{l,i}$ incorporates interactions up to degree $2^l-1$ . Now consider the $(l+1)$ -th layer: $ x_{l+1,i} = x_{l,i} \circ \sum_{j=1}^{n} a_{i,j}^{(l+1)} x_{l,j} + x_{l,i} $ Substituting $x_{l,i} = x_{0,i} F_{2^l-1}$ (where $F_{2^l-1}$ represents a polynomial of degree $2^l-1$ in initial item embeddings and some constant factors): $ x_{l+1,i} = x_{0,i} F_{2^l-1} \circ \left( \sum_{j=1}^{n} a_{i,j}^{(l+1)} x_{0,j} F_{2^l-1} + 1 \right) $ Each term $x_{0,j} F_{2^l-1}$ in the summation is a polynomial of degree $1 + (2^l - 1) = 2^l$ . The summation itself is thus a polynomial of degree $2^l$ . Adding 1 to it does not change its highest degree. So, $\sum_{j=1}^{n} a_{i,j}^{(l+1)} x_{0,j} F_{2^l-1} + 1 = F_{2^l}$ . Then, $x_{l+1,i} = x_{0,i} F_{2^l-1} \circ F_{2^l}$ . If $\circ$ is element-wise multiplication, the product of a polynomial of degree $2^l-1$ and a polynomial of degree $2^l$ would yield a polynomial of degree $(2^l-1) + 2^l = 2^{l+1}-1$ . This means that after $b$ layers, $x_{b,i}$ incorporates feature interactions among the initial item embeddings $x_{0,i}$ and the results of interactions among all items up to a degree of $2^b - 1$ . This analysis highlights the model's capacity to capture high-order feature interactions, which is crucial for its expressive power.

4.6. Analysis of AMS

The paper contrasts the AMS layer's handling of positional information with that of T5 architecture [42], which uses relative positional embeddings by adding a bias term $\mathbf{B}$ to the attention weights. In T5, the attention weights $\mathbf{A}$ are computed as: $ \mathbf{A} = \phi\left( (\mathbf{x} \mathbf{W}_q)(\mathbf{x} \mathbf{W}_k)^T + \mathbf{B} \right) $ Where:

$\phi$ is a non-linear function (e.g., softmax or SiLU).
$\mathbf{x} \mathbf{W}_q$ and $\mathbf{x} \mathbf{W}_k$ generate queries and keys.
$\mathbf{B} = (b_{i,j})_{n \times n}$ is the matrix of relative positional bias terms.

The output $o_i$ for the $i$ -th item in multi-head self-attention, with this positional bias, can be approximately decomposed into two terms: $ o_i \approx W_o \left( \left( \sum_j \phi_1(q_i k_j^T) V_j \right) \otimes u_i \right) + W_o \left( \left( \sum_j \phi_2(b_{i,j}) V_j \right) \otimes u_i \right) $ Where:
$q_i, k_j, V_j$ are query, key, and value vectors.
$u_i$ is a vector used for Hadamard product (element-wise multiplication).
$\phi_1, \phi_2$ represent the non-linear function.
$W_o$ is an output projection matrix.

In contrast, the AMS layer calculates its output $o_i$ as: $ o_i = W_{o1} \left( \left( \sum_j \phi(q_i k_j^T) V_j \right) \otimes u_i^{(1)} \right) + W_{o2} \left( \left( \sum_j b_{i,j} V_j \right) \otimes u_i^{(2)} \right) $ (Note: The paper's formula for $o_i$ in AMS analysis only explicitly shows two terms, one for semantic attention and one for positional bias $b_{i,j}$ . Given the AMS actually has three channels (semantic, temporal, positional), and the AMS output $\mathbf{h}^l$ combines three terms, this simplified $o_i$ in the analysis might be abstracting or combining the temporal and positional terms, or just comparing one aspect. For strict adherence, we represent it as written. The $W_{o1}$ and $W_{o2}$ here correspond to parts of $\mathbf{W}_o^l$ from the MFFN's first stage, and $u_i^{(1)}, u_i^{(2)}$ correspond to components of $\phi(\mathbf{x}^l \mathbf{W}_u^l)$ for the semantic and positional channels respectively). The key difference highlighted is that in AMS, $W_{o1}$ and $W_{o2}$ are distinct parameters (or parts of $\mathbf{W}_o^l$ ) and $u_i^{(1)}$ and $u_i^{(2)}$ are also distinct components. This allows the model to learn separate projections and interactions for the semantic and positional (and temporal) information, rather than simply adding a bias term to a single attention weight calculation. This explicit separation and distinct processing channels ( $W_{o1}, W_{o2}$ and $u_i^{(1)}, u_i^{(2)}$ ) enable AMS to learn a more expressive representation of positional and temporal information.

4.7. Relationship with Existing Models

FuXi- $α$ builds upon and differentiates itself from existing models:

4.7.1. SASRec and LLaMa

SASRec [23]: A foundational Transformer-based sequential recommender.
LLaMa [10]: A state-of-the-art Large Language Model, used as a powerful Transformer architecture baseline in recommendation.
Differences from FuXi- $α$ :
- Self-Attention Layer: SASRec and LLaMa use standard multi-head self-attention. FuXi- $α$ replaces this with the Adaptive Multi-channel Self-attention (AMS) layer to independently model temporal, positional, and semantic information, leading to better feature utilization.
- Feed-Forward Network: Both SASRec and LLaMa use standard FFNs (MLPs). FuXi- $α$ introduces the Multi-stage Feedforward Network (MFFN) which specifically processes multi-channel information from AMS and enhances implicit feature interactions.

4.7.2. HSTU

HSTU [72]: A recent advanced Transformer-based model that aims to scale sequential recommendation models and uses positional and temporal information.
Differences from FuXi- $α$ :
- Integration of Temporal and Positional Data: HSTU incorporates relative temporal and positional data by adding these features directly to attention weights. FuXi- $α$ argues this can dilute their impact. FuXi- $α$ 's AMS layer explicitly decouples temporal, positional, and semantic information into distinct channels within the self-attention mechanism, allowing for more granular and effective modeling.
- Implicit Feature Interactions: A significant difference is that HSTU lacks an FFN layer. It relies primarily on self-attention and explicit feature interactions. FuXi- $α$ addresses this by introducing the MFFN to facilitate robust implicit interactions, thereby capturing more complex item relationships that HSTU might miss.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on four datasets: three public datasets and one private large-scale industrial dataset.

MovieLens-1M:
- Source: Publicly available movie recommendation dataset.
- Characteristics: Contains user ratings and tagging activities. It is a relatively smaller dataset.
- Statistics:
  - Users: 6,041
  - Items: 3,706
  - Interactions: 1,000,209
  - Average Length of Interaction Sequence: 165.60
- Why chosen: A common benchmark for sequential recommendation, allowing comparison with a wide range of existing methods.
MovieLens-20M:
- Source: Publicly available movie recommendation dataset.
- Characteristics: A larger subset of the MovieLens dataset, also containing user ratings and tagging activities.
- Statistics:
  - Users: 138,493
  - Items: 26,744
  - Interactions: 20,000,263
  - Average Length of Interaction Sequence: 144.41
- Why chosen: Provides a larger-scale benchmark to evaluate model performance and scaling capabilities compared to MovieLens-1M.
KuaiRand:
- Source: Collected from user logs of a video-sharing app (Kuaishou).
- Characteristics: Users are highly active, with over 200 interactions on average.
- Statistics:
  - Users: 25,634
  - Items: 7,550
  - Interactions: 6,945,823
  - Average Length of Interaction Sequence: 270.96
- Why chosen: Represents a real-world social media/short video interaction scenario, characterized by high user activity and longer sequences.

Industrial:

Source: Private dataset constructed from user records of a mainstream music listening app (Huawei Music).
Characteristics: Very large-scale, with tens of millions of active users monthly. User behavior sequences are constructed from over a month of positive behaviors (collect, like, play, etc.).
Statistics:
- Users: 19,252,028
- Items: 234,488
- Interactions: 1,023,711,774
- Average Length of Interaction Sequence: 53.17

Why chosen: Crucial for evaluating the model's performance on a massive, real-world industrial scale, especially for assessing adherence to scaling laws and practical deployment impact.

The following are the statistics from Table 1 of the original paper:

Dataset	User	Item	Interactions	Avg. Len.
MovieLens-1M	6,041	3,706	1,000,209	165.60
MovieLens-20M	138,493	26,744	20,000,263	144.41
KuaiRand	25,634	7,550	6,945,823	270.96
Industrial	19,252,028	234,488	1,023,711,774	53.17

For MovieLens-1M and MovieLens-20M, pre-processed train/validation/test sets from HSTU [72] were used. For KuaiRand and Industrial datasets, similar preprocessing was applied by the authors.

5.2. Evaluation Metrics

The paper employs standard evaluation metrics for recall performance in recommendation systems. Higher values indicate better performance for all metrics.

Hit Ratio at K ( $\mathrm{HR}@K$ ):
- Conceptual Definition: $\mathrm{HR}@K$ measures whether the ground-truth item is present in the top- $K$ recommended items. It indicates the proportion of users for whom the relevant item was successfully recommended within the top $K$ positions. It's a measure of recall.
- Mathematical Formula: $ \mathrm{HR}@K = \frac{\text{Number of users for whom the ground-truth item is in the top-K list}}{\text{Total number of users}} $
- Symbol Explanation:
  - Number of users for whom the ground-truth item is in the top-K list: The count of unique users for whom the next actual item they interacted with (ground-truth) is found within the first $K$ items of the model's recommendation list.
  - Total number of users: The total number of users in the evaluation set.
Normalized Discounted Cumulative Gain at K ( $\mathrm{NDCG}@K$ ):
- Conceptual Definition: $\mathrm{NDCG}@K$ is a measure of ranking quality. It considers not only whether the relevant item is in the top- $K$ list (like $\mathrm{HR}@K$ ) but also its position. Relevant items ranked higher contribute more to the score. It's normalized to be between 0 and 1, where 1 means a perfect ranking.
- Mathematical Formula: $ \mathrm{NDCG}@K = \frac{1}{|\mathcal{U}|} \sum_{u=1}^{|\mathcal{U}|} \frac{\mathrm{DCG}@K_u}{\mathrm{IDCG}@K_u} $ Where:
  - \mathrm{DCG}@K_u = \sum_{j=1}^{K} \frac{2^{rel_j} - 1}{\log_2(j+1)}
  - \mathrm{IDCG}@K_u = \sum_{j=1}^{K} \frac{2^{rel_j^{ideal}} - 1}{\log_2(j+1)}
- Symbol Explanation:
  - $|\mathcal{U}|$ : Total number of users.
  - $\mathrm{DCG}@K_u$ : Discounted Cumulative Gain for user $u$ at rank $K$ .
  - $\mathrm{IDCG}@K_u$ : Ideal Discounted Cumulative Gain for user $u$ at rank $K$ , representing the maximum possible $\mathrm{DCG}@K$ if all relevant items were perfectly ranked.
  - $j$ : Rank position.
  - $rel_j$ : Relevance score of the item at position $j$ in the recommended list. For implicit feedback, this is typically 1 if the item is the ground-truth item, and 0 otherwise.
  - $rel_j^{ideal}$ : Relevance score of the item at position $j$ in the ideal ranking (i.e., if all ground-truth items were ranked at the top).
Mean Reciprocal Rank (MRR):
- Conceptual Definition: $\mathrm{MRR}$ measures the average of the reciprocal ranks of the first relevant item. If the first relevant item is at rank 1, the reciprocal rank is 1; if at rank 2, it's 1/2, and so on. It's particularly useful for tasks where only one correct answer is expected, or where finding any relevant item quickly is important.
- Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|\mathcal{U}|} \sum_{u=1}^{|\mathcal{U}|} \frac{1}{\mathrm{rank}_u} $
- Symbol Explanation:
  - $|\mathcal{U}|$ : Total number of users.
  - $\mathrm{rank}_u$ : The rank of the first relevant item for user $u$ . If no relevant item is found, the reciprocal rank is 0.
    
    The paper reports performance for $K=10, 50$ for $\mathrm{HR}@K$ and $\mathrm{NDCG}@K$ .

5.3. Baselines

For a comprehensive comparison, FuXi- $α$ is compared against two categories of representative baselines:

Conventional Models:
- BPRMF [44]: Bayesian Personalized Ranking Matrix Factorization. A classical collaborative filtering method based on pairwise ranking optimization.
- GRU4Rec [20]: A Recurrent Neural Network (RNN) based model using Gated Recurrent Units (GRUs) for session-based recommendation, capturing sequential patterns.
- NARM [28]: Neural Attentive Recommendation Machine. An RNN-based model that incorporates an attention mechanism to capture users' main purpose and sequential behavior.
Autoregressive Generative Models (Transformer-based):
- SASRec [23]: Self-Attentive Sequential Recommendation. A pioneering model that applies the Transformer's self-attention mechanism to sequential recommendation, allowing it to capture long-range dependencies.
- LLaMa [10]: A Large Language Model architecture used as a strong Transformer-based baseline, often adapted for recommendation tasks due to its powerful sequential modeling capabilities.
- HSTU [72]: Hierarchical Self-attentive Transformer for User modeling. An advanced Transformer-based model that incorporates positional and temporal information with element-wise multiplication for explicit interactions, specifically designed to scale with model size.
  
  The paper also includes "Large" versions of SASRec, LLaMa, and HSTU (e.g., SASRec-Large), which are models scaled up to 8 layers to analyze scaling effects, compared to the default 2 layers for basic capacity comparison.

5.4. Parameter Settings

Implementation: FuXi- $α$ is implemented using PyTorch [38].
Large-scale Training: Multi-machine and multi-card parallelism are used with the Accelerate library [26] to enable training of large models.
Fair Comparison: For MovieLens-1M and MovieLens-20M, model parameters (except the number of layers) are kept consistent with HSTU [72] to ensure a fair comparison.
KuaiRand Specific Settings:
- Hidden dimension: 50
- Number of negative samples: 128 (default)
Common Parameters: Optimizer, learning rate, and weight decay are consistent with HSTU [72] across all datasets.
Embedding and Self-attention Dimensions: Embedding dimensions and self-attention hidden vector dimensions are identical across all three public datasets.
Number of Layers:
- Basic Modeling Capacity: Set to 2 layers for initial comparisons.
- Scaling Analysis: Extended to 8 layers (denoted as "XX-Large" or "-Large" in the tables) for generative models to observe scaling effects.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate FuXi- $α$ 's superior performance across various datasets and its adherence to scaling laws.

6.1.1. Public Dataset Performance (RQ1)

The following are the results from Table 2 of the original paper:

Dataset	MovieLens-1M					MovieLens-20M					KuaiRand
Model	NG@10	NG@50	HR@10	HR@50	MRR	NG@10	NG@50	HR@10	HR@50	MRR	NG@10	NG@50	HR@10	HR@50	MRR
BPRMF	0.0607	0.1027	0.1185	0.3127	0.0556	0.0629	0.1074	0.1241	0.3300	0.0572	0.0248	0.0468	0.0520	0.1560	0.0235
GRU4Rec	0.1015	0.1460	0.1816	0.3864	0.0895	0.0768	0.1155	0.1394	0.3177	0.0689	0.0289	0.0531	0.0597	0.1726	0.0275
NARM	0.1350	0.1894	0.2445	0.4915	0.1165	0.1037	0.1552	0.1926	0.4281	0.0910	0.0411	0.0747	0.0836	0.2399	0.0387
SASRec	0.1594	0.2187	0.2824	0.5500	0.1375	0.1553	0.2119	0.2781	0.5353	0.1330	0.0486	0.0877	0.0978	0.2801	0.0454
LLaMa	0.1620	0.2207	0.2926	0.5591	0.1373	0.1640	0.2206	0.2915	0.5476	0.1402	0.0495	0.0878	0.0973	0.2752	0.0466
HSTU	0.1639	0.2238	0.2969	0.5672	0.1390	0.1642	0.2225	0.2909	0.5553	0.1410	0.0491	0.0861	0.0992	0.2718	0.0451
FuXi-α	0.1835	0.2429	0.3254	0.5941	0.1557	0.1954	0.2533	0.3353	0.5969	0.1677	0.0537	0.0942	0.1067	0.2951	0.0497
SASRec-Large	0.1186	0.1733	0.2183	0.4671	0.0186	0.0206	0.0379	0.0412	0.1209	0.0207	0.0285	0.0428	0.0544	0.1227	0.0258
LLaMa-Large	0.1659	0.2257	0.2990	0.5692	0.1408	0.1842	0.2412	0.3202	0.5776	0.1576	0.0494	0.0878	0.0970	0.2754	0.0466
HSTU-Large	0.1844	0.2437	0.3255	0.5929	0.1568	0.1995	0.2572	0.3407	0.6012	0.1714	0.0494	0.0883	0.0990	0.2799	0.0460
FuXi-α-Large	0.1934	0.2518	0.3359	0.5983	0.1651	0.2086	0.2658	0.3530	0.6113	0.1792	0.0555	0.0963	0.1105	0.2995	0.0510

Observations:

Generative Models Outperform Conventional Models: SASRec, LLaMa, HSTU, and FuXi- $α$ (generative models) consistently outperform BPRMF, GRU4Rec, and NARM (conventional models) across all datasets and metrics, even with only two layers. This highlights the superior ability of Transformer-based architectures in capturing complex item relationships and dynamic user preferences.
SASRec's Scaling Issues: SASRec-Large (8 layers) shows a significant performance drop compared to SASRec (2 layers) across all three datasets. This suggests that simply stacking more layers on SASRec does not inherently lead to better performance, indicating it does not adhere well to scaling laws in its original form.
LLaMa and HSTU Scaling: LLaMa-Large and HSTU-Large show substantial improvements over their 2-layer counterparts on MovieLens-1M and MovieLens-20M, indicating that these architectures are more amenable to scaling.
FuXi- $α$ 's Consistent Superiority: FuXi- $α$ consistently achieves the best results across all three datasets and evaluation metrics, regardless of whether it's a shallow (2-layer) or deep (8-layer) network.
- Shallow Network (FuXi- $α$ vs. HSTU): FuXi- $α$ outperforms HSTU (the strongest baseline) by an average of 13.24% in $\mathrm{NDCG}@10$ , 10.59% in $\mathrm{NDCG}@50$ , 10.81% in $\mathrm{HR}@10$ , 6.94% in $\mathrm{HR}@50$ , and 13.72% in $\mathrm{MRR}$ across the three datasets.
- Deep Network (FuXi- $α$ -Large vs. HSTU-Large): FuXi- $α$ -Large outperforms HSTU-Large by an average of 7.26% in $\mathrm{NDCG}@10$ , 5.24% in $\mathrm{NDCG}@50$ , 6.14% in $\mathrm{HR}@10$ , 3.19% in $\mathrm{HR}@50$ , and 6.90% in $\mathrm{MRR}$ .
Interpretation: This demonstrates the effectiveness of FuXi- $α$ 's proposed Adaptive Multi-channel Self-attention (AMS) and Multi-stage Feedforward Network (MFFN) in enhancing explicit and implicit feature interactions, leading to better user behavior modeling.

6.1.2. Industrial Dataset Performance

The following are the results from Table 3 of the original paper:

Dataset	Industrial
Model	NG@10	NG@50	HR@10	HR@50	MRR
SASRec	0.1009	0.1580	0.1970	0.4581	0.0868
LLaMa	0.1681	0.2238	0.2985	0.5498	0.1426
HSTU	0.1733	0.2289	0.3057	0.5565	0.1472
FuXi-α	0.1875	0.2424	0.3230	0.5702	0.1601

Observations:

Significant Gains over SASRec: LLaMa and HSTU achieve substantial improvements over SASRec in the music recommendation scenario on the Industrial dataset (e.g., 64.82% and 71.75% gains in $\mathrm{NDCG}@10$ , respectively). This indicates the strong performance of more advanced Transformer architectures on large-scale, real-world data.
FuXi- $α$ 's Dominance: FuXi- $α$ further outperforms both LLaMa and HSTU by 11.54% and 8.19% in $\mathrm{NDCG}@10$ , respectively. This reconfirms its superiority on a massive industrial dataset, demonstrating its practical value.
Potential of Scaling Laws: These results collectively highlight the potential of scaling laws in recommendation systems and underscore FuXi- $α$ 's effectiveness in harnessing this potential through its enhanced feature interaction mechanisms.

6.1.3. Scaling of FuXi- $α$ on Industrial Dataset

The following figure (Figure 5 from the original paper) shows the scaling performance of FuXi- $α$ on the Industrial Dataset:

$Figure 5: Scaling of FuXi- $\\alpha$ on Industrial Dataset.$ 该图像是一个图表，展示了FuXi- $α$ 模型在不同层数下的性能指标NDCG@10和HR@10的变化趋势。随着层数的增加，NDCG@10和HR@10均表现出持续的提升，表明模型在扩展时有效增强了推荐效果。

Figure 5: Scaling of FuXi- $\\alpha$ on Industrial Dataset.

Observations:

The figure plots $\mathrm{NDCG}@10$ and $\mathrm{HR}@10$ performance against the number of layers (from 2 to 32) for FuXi- $α$ on the Industrial dataset.
Adherence to Scaling Law: The results clearly show a positive relationship between the model's performance and its size (number of layers). Both $\mathrm{NDCG}@10$ and $\mathrm{HR}@10$ continuously improve as the number of layers increases from 2 to 32.
High Potential: This indicates that FuXi- $α$ adheres to the scaling law, implying that its performance can be further improved by increasing model size, training data, and computational resources. This is a highly desirable property for large-scale recommendation systems.

6.2. Efficiency Comparison (RQ2)

The following are the results from Table 4 of the original paper:

Dataset	KuaiRand
Model	TPS@200	TPS@400	TPS@600	TPS@800
SASRec	2481	2024	1672	1398
LLaMa	2330	1972	1602	1326
HSTU	2078	1183	680	394
FuXi-α	1971	1053	615	373

Observations:

Throughput Per Second (TPS): This metric measures the number of training samples processed per second, indicating training efficiency.
Impact of Sequence Length: As the sequence length increases (from 200 to 800), the TPS for all models decreases. This is expected, as longer sequences require more computation per sample, particularly for attention mechanisms that scale quadratically with sequence length.
SASRec and LLaMa's Higher TPS: SASRec and LLaMa generally exhibit higher TPS compared to HSTU and FuXi- $α$ . The paper attributes this to their exclusion (or simpler handling) of temporal information encoding, which, while beneficial for performance, adds computational overhead.
FuXi- $α$ vs. HSTU Efficiency: FuXi- $α$ shows a TPS similar to HSTU. For example, at sequence length 800, FuXi- $α$ has a TPS of 373, while HSTU has 394. This indicates that the added complexity of AMS and MFFN in FuXi- $α$ (to better capture feature interactions) does not significantly degrade its training efficiency compared to HSTU, which already incorporates some temporal/positional modeling.
Performance-Efficiency Trade-off: While FuXi- $α$ and HSTU are less efficient in terms of raw TPS than SASRec and LLaMa, their superior overall performance (as shown in Tables 2 and 3) justifies the increased computational cost, especially in scenarios where predictive accuracy is paramount. FuXi- $α$ specifically achieves significantly better overall performance while maintaining comparable efficiency to HSTU.

6.3. Ablation Study (RQ3)

The following are the results from Table 5 of the original paper:

Dataset	MovieLens-1M		MovieLens-20M				M	MovieLens-20M
Model	NG@10	HR@10	10		NG@10	HR@10	NG@10		NG@10
Base w/o AMS w/o MFFN	0.1454	0.2676		0.1452		0.2647	0.0476		0.0928
	0.1563	0.2847	0.1612				0.2888	0.0470 0.0534		0.0921 0.0947
	0.1878	0.3304	0.2056			0.3488		0.0470 0.0534		0.0921 0.0947
FuXi-α	0.1934	0.3359	0.2086			0.3530	0.0555		0.1105

Note: The table formatting in the original PDF is slightly misaligned for the "MovieLens-20M" columns, with "M" and a repeated "MovieLens-20M" appearing. The transcribed table above attempts to correct this visual anomaly based on typical evaluation table structures, assuming the columns are MovieLens-1M and MovieLens-20M and KuaiRand, and the metrics are NG@10 and HR@10 for each. The last two columns (NG@10 and HR@10) seem to correspond to KuaiRand based on previous tables.

To assess the effectiveness of each sub-module, the authors conducted an ablation study with three variants:

Base Model: This variant replaces the AMS module with the vanilla self-attention layer from SASRec and substitutes the MFFN module with a single-stage MLP from HSTU. This serves as a strong baseline reflecting a simpler Transformer architecture with HSTU's FFN approach.
w/o AMS: This variant replaces the AMS module with the vanilla self-attention layer (similar to SASRec), but retains the MFFN module.
w/o MFFN: This variant substitutes the MFFN module with a single-stage MLP (similar to HSTU's FFN approach), but retains the AMS module.
FuXi- $α$ (Full Model): The complete proposed model.

Observations:

Impact of MFFN (w/o MFFN vs. FuXi- $α$ ): Removing the MFFN (i.e., w/o MFFN variant) leads to a significant performance drop compared to the full FuXi- $α$ model. For instance, on MovieLens-1M, $\mathrm{NG}@10$ drops from 0.1934 to 0.1878, and $\mathrm{HR}@10$ drops from 0.3359 to 0.3304. This emphasizes the critical role of the MFFN in facilitating thorough implicit feature interactions and boosting model expressiveness. Despite the drop, w/o MFFN still outperforms HSTU (0.1639 $\mathrm{NG}@10$ on MovieLens-1M), suggesting that the AMS itself is highly effective.
Impact of AMS (w/o AMS vs. FuXi- $α$ ): Replacing the AMS with a vanilla self-attention layer (w/o AMS variant) also results in a marked performance decline. For example, on MovieLens-1M, $\mathrm{NG}@10$ drops to 0.1563, and $\mathrm{HR}@10$ to 0.2847. This highlights the necessity of the AMS for explicit feature interactions and its effective utilization of temporal and positional data, which is superior to generic self-attention.
Base Model Performance: The Base model, which lacks both AMS and MFFN, shows the lowest performance among the variants, reinforcing the individual and combined importance of both proposed modules. Its performance (e.g., 0.1454 $\mathrm{NG}@10$ on MovieLens-1M) is significantly lower than the full FuXi- $α$ model.
Overall Conclusion: The ablation study confirms that both the Adaptive Multi-channel Self-attention (AMS) layer and the Multi-stage Feedforward Network (MFFN) are essential components, each contributing significantly to the superior predictive capability of FuXi- $α$ . The AMS provides a more effective way to handle temporal and positional information, while the MFFN ensures robust implicit feature interactions.

6.4. Hyperparameter Study (RQ4)

The paper investigates the effects of three key hyperparameters: number of layers, hidden dimension, and number of negative samples. Results are primarily shown for $\mathrm{NDCG}@10$ and $\mathrm{HR}@10$ on MovieLens-1M and KuaiRand datasets.

6.4.1. The Number of Layers

The following figure (Figure 6 from the original paper) illustrates the performance with different numbers of layers:

Figure 6: Performances with different number of layers. 该图像是图表，展示了在不同层数下，模型在 MovieLens-1M 和 KuaiRand 数据集上的 NDCG@10 和 HR@10 指标的表现。模型的性能随着层数的增加而有所不同，MovieLens-1M 达到最佳表现于 8 层，而 KuaiRand 的性能则逐渐提升。

Figure 6: Performances with different number of layers.

Observations:

MovieLens-1M: Performance (both $\mathrm{NDCG}@10$ and $\mathrm{HR}@10$ ) improves from 2 to 8 layers, but then shows a decline at 16 layers. This suggests that for smaller datasets like MovieLens-1M, increasing the model depth beyond a certain point can lead to overfitting or diminishing returns, as the dataset may not have enough complexity or data volume to support a very deep model.
KuaiRand: In contrast, performance on KuaiRand consistently increases as the number of layers goes from 2 to 16. This indicates that the larger and more complex KuaiRand dataset benefits from deeper models, allowing FuXi- $α$ to capture more intricate patterns and dependencies. This observation further supports FuXi- $α$ 's adherence to the scaling law on sufficiently large datasets.

6.4.2. The Hidden Dimension

The following figure (Figure 7 from the original paper) illustrates the performance with different hidden dimensions:

Figure 7: Performances with different hidden dimension. 该图像是图表，展示了在不同隐藏维度下，MovieLens-1M（左侧）和KuaiRand（右侧）数据集上的NDCG@10和HR@10的性能表现。随着隐藏维度的增加，两个数据集的性能指标均呈现提升趋势。

Figure 7: Performances with different hidden dimension.

Observations:

MovieLens-1M: Performance on MovieLens-1M saturates at a hidden dimension of 32, with only minimal gains observed when increasing it further to 64. This is consistent with the findings for the number of layers; smaller datasets have less capacity to benefit from very high-dimensional representations.
KuaiRand: On KuaiRand, performance steadily improves across all tested hidden dimensions (from 8 to 64). This implies that a larger hidden dimension allows the model to learn richer item representations and more accurate self-attention similarities, which is beneficial for complex, real-world datasets.

6.4.3. Negative Samples

The following figure (Figure 8 from the original paper) illustrates the performance with diverse negative sample counts:

Figure 8: Diverse negative sample counts in performances. 该图像是图表，展示了在不同负样本数量下，MovieLens-1M 和 KuaiRand 数据集的 NDCG@10 和 HR@10 的变化趋势。可以观察到，随着负样本数量的增加，NDCG@10 和 HR@10 指标均呈现出上升的趋势，表明更大的负样本可以改善推荐效果。

Figure 8: Diverse negative sample counts in performances.

Observations:

Consistent Improvement: Performance (both $\mathrm{NDCG}@10$ and $\mathrm{HR}@10$ ) consistently improves on both MovieLens-1M and KuaiRand datasets as the number of negative samples increases from 32 to 256.
Importance of Negative Sampling: The gains from increasing negative samples are significant, in some cases even surpassing those observed from increasing the number of layers. This underscores the critical role of negative sampling in enhancing model performance in recommendation systems. Effective negative sampling helps the model learn to distinguish relevant items from irrelevant ones more robustly, which is crucial for tasks like sequential recommendation where the item space is vast.
Relevance to Scaling Laws: This finding is particularly important because the influence of negative sampling is often overlooked in studies focusing on scaling laws in LLMs [21, 24]. The paper highlights that for recommendation models, optimizing negative sampling is as vital as scaling model architecture.

6.5. Online A/B Test

To validate the real-world impact, an online A/B test was conducted within the Huawei Music app.

Scenario: Main scenario of Huawei Music.
Duration: 7 days.
Traffic Allocation: 30% of user traffic was directed to FuXi- $α$ .
Baseline: A well-optimized multi-channel retrieval system, refined over several years, served as the control group.
Results: FuXi- $α$ demonstrated significant improvements:
- Average number of songs played per user: Increased by 4.67%.
- Average listening duration per user: Increased by 5.10%.
Conclusion: These online results strongly indicate that FuXi- $α$ effectively enhances user interaction and engagement, improving user experience and increasing platform usage time. Following these positive evaluations, FuXi- $α$ was integrated as an inherent channel in this scenario, serving most of the online traffic. This is a crucial validation of the model's practical utility and its ability to deliver tangible business value in a production environment.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully proposes FuXi- $α$ , a novel Transformer-based model for sequential recommendation that significantly advances the state-of-the-art. Its core innovations include an Adaptive Multi-channel Self-attention (AMS) mechanism, which disentangles the modeling of temporal, positional, and semantic features, and a Multi-stage Feedforward Network (MFFN) to enhance implicit feature interactions. Extensive offline experiments on both public and industrial datasets consistently demonstrate FuXi- $α$ 's superior performance compared to existing baselines, including powerful Transformer-based models like LLaMa and HSTU. Furthermore, FuXi- $α$ exhibits continuous performance improvement with increasing model size, confirming its adherence to scaling laws. The model's real-world efficacy was validated through an online A/B test within the Huawei Music app, showing notable increases in user engagement metrics (songs played and listening duration). The ablation studies confirmed the critical contributions of both AMS and MFFN components to the model's overall effectiveness.

7.2. Limitations & Future Work

The authors acknowledge several directions for future research:

More Complex Recommendation Problems: They plan to extend FuXi- $α$ to tackle more intricate recommendation scenarios, such as multi-behavior (e.g., likes, clicks, purchases within one sequence) and multi-modal recommendations (e.g., incorporating text, image, audio features).
Long Sequences: The current model's performance on very long sequences could be further optimized. The quadratic complexity of self-attention with respect to sequence length remains a challenge for extremely long sequences, suggesting a need for more efficient attention mechanisms or hierarchical processing.

7.3. Personal Insights & Critique

The FuXi- $α$ paper presents a compelling case for carefully designed feature interactions in large-scale sequential recommendation models. My personal insights and critique are as follows:

Innovation in Feature Interaction: The Adaptive Multi-channel Self-attention (AMS) and Multi-stage FFN (MFFN) are genuinely innovative contributions. The idea of explicitly decoupling temporal, positional, and semantic information within self-attention, rather than simply adding them as biases or embeddings, is a nuanced and powerful approach. This targeted modeling allows each aspect to contribute optimally to the user's interest representation. The MFFN further reinforces this by providing a dedicated mechanism for implicit interactions after the multi-channel attention, addressing a gap in models like HSTU. This granular control over different types of feature interactions is a significant step forward.
Strong Empirical Evidence: The comprehensive experimental validation, including both extensive offline benchmarks and a crucial online A/B test on a large industrial platform, lends strong credibility to the model's claims. The online A/B test results, showing tangible improvements in user engagement, are particularly impressive and demonstrate real-world impact, which is often challenging to achieve with research prototypes.
Scaling Laws for Recommendation: The paper's explicit focus on scaling laws for recommendation models, mirroring trends in LLMs, is a timely and important research direction. Demonstrating that FuXi- $α$ can continuously improve with increased model size provides a clear path for future development and resource allocation in building even larger and more capable recommendation systems. This could motivate further exploration into the theoretical underpinnings of scaling in this domain.
Emphasis on Negative Sampling: The hyperparameter study's finding that negative sampling significantly impacts performance, sometimes more than increasing layers, is a valuable insight. It highlights that while architectural scaling is important, fundamental training techniques like negative sampling remain crucial and should not be overlooked, especially in the context of large-vocabulary prediction tasks typical in recommendation. This is a practical takeaway for practitioners.

Potential Areas for Improvement/Critique:

Computational Cost of AMS: While AMS offers greater expressiveness, the explicit calculation of three separate attention weight matrices (even if sharing value matrices) and subsequent concatenation might increase computational overhead compared to simpler attention mechanisms, especially for very long sequences. The efficiency comparison in Table 4 confirms that FuXi- $α$ (and HSTU) are slower than SASRec and LLaMa. While justified by performance gains, explicitly discussing the trade-offs and exploring ways to optimize this (e.g., sparse attention for specific channels) could be beneficial.
Complexity of Temporal/Positional Embeddings: The paper refers to alpha and beta parameters for temporal and positional attention without fully detailing their structure or how the bucket mapping for temporal differences is implemented. A more in-depth explanation of these learnable parameters and their behavior would enhance understanding for beginners.
Specifics of RMSN: While RMSN is mentioned, a brief explanation of why RMSN was chosen over other normalization layers (e.g., LayerNorm) would add value, especially given its use in LLaMa.
Ablation for Multi-Channel Aspect of AMS: While the ablation study shows AMS is better than vanilla self-attention, it doesn't break down the contribution of each channel within AMS. For example, what if only semantic + temporal channels were used, or semantic + positional? This could further illuminate the benefits of the multi-channel design.
Long Sequence Handling: While future work mentions long sequences, the current method, with $O(n^2 d)$ time complexity for attention, will face challenges with extremely long user histories. Exploring mechanisms like sparse attention, block-wise attention, or recurrent layers (similar to Reformer or Longformer) could be a natural extension for truly "long" sequences.

Applicability and Transferability: The principles of FuXi- $α$ can be applied to various domains beyond music recommendation. Any sequential recommendation task where temporal and positional information are critical, and where complex implicit interactions are expected, could benefit. This includes:

E-commerce: Recommending products based on a user's browsing and purchase history, where the order and time between interactions are crucial (e.g., buying a phone, then a case a few days later, then a charger next week).
News/Content Feeds: Personalizing news articles or content streams, where recent interactions and the timing of engagement are key.
Online Education: Recommending learning materials or courses based on a student's learning path and progress over time.

The model's strong performance and adherence to scaling laws make it a promising candidate for large-scale industrial deployment in many settings, potentially ushering in a new generation of more context-aware and personalized recommendation systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~44 min read · 60,968 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Embedding Layer

4.2. FuXi Block

4.2.1. Adaptive Multi-channel Self-attention (AMS)

4.2.2. Multi-stage Feed-Forward Network (MFFN)

4.3. Prediction Layer & Optimization Objective

4.4. Space and Time Complexity

4.4.1. Space Complexity

4.4.2. Time Complexity

4.5. Polynomial Approximation of Feature Interactions

4.6. Analysis of AMS

4.7. Relationship with Existing Models

4.7.1. SASRec and LLaMa

4.7.2. HSTU

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Parameter Settings

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Public Dataset Performance (RQ1)

6.1.2. Industrial Dataset Performance

6.1.3. Scaling of FuXi-ααα on Industrial Dataset

6.2. Efficiency Comparison (RQ2)

6.3. Ablation Study (RQ3)

6.4. Hyperparameter Study (RQ4)

6.4.1. The Number of Layers

6.4.2. The Hidden Dimension

6.4.3. Negative Samples

6.5. Online A/B Test

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.1.3. Scaling of FuXi- $α$ on Industrial Dataset