RankMixer: Scaling Up Ranking Models in Industrial Recommenders
TL;DR Summary
RankMixer is a hardware-aware ranking model for industrial recommenders, overcoming cost and latency challenges. It employs a multi-head token mixing module for enhanced efficiency and scalability, achieving significant user engagement gains with a billion-parameter Sparse-MoE va
Abstract
Recent progress on large language models (LLMs) has spurred interest in scaling up recommendation systems, yet two practical obstacles remain. First, training and serving cost on industrial Recommenders must respect strict latency bounds and high QPS demands. Second, most human-designed feature-crossing modules in ranking models were inherited from the CPU era and fail to exploit modern GPUs, resulting in low Model Flops Utilization (MFU) and poor scalability. We introduce RankMixer, a hardware-aware model design tailored towards a unified and scalable feature-interaction architecture. RankMixer retains the transformer's high parallelism while replacing quadratic self-attention with multi-head token mixing module for higher efficiency. Besides, RankMixer maintains both the modeling for distinct feature subspaces and cross-feature-space interactions with Per-token FFNs. We further extend it to one billion parameters with a Sparse-MoE variant for higher ROI. A dynamic routing strategy is adapted to address the inadequacy and imbalance of experts training. Experiments show RankMixer's superior scaling abilities on a trillion-scale production dataset. By replacing previously diverse handcrafted low-MFU modules with RankMixer, we boost the model MFU from 4.5% to 45%, and scale our ranking model parameters by 100x while maintaining roughly the same inference latency. We verify RankMixer's universality with online A/B tests across two core application scenarios (Recommendation and Advertisement). Finally, we launch 1B Dense-Parameters RankMixer for full traffic serving without increasing the serving cost, which improves user active days by 0.3% and total in-app usage duration by 1.08%.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "RankMixer: Scaling Up Ranking Models in Industrial Recommenders," focusing on developing a new model architecture to overcome efficiency and scalability challenges in large-scale recommendation systems.
1.2. Authors
The paper lists multiple authors from ByteDance: Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, Huizhi Yang, Zheng Chai, Zhe Chen, Yuchao Zheng, Qiwei Chen, Feng Zhang, Xun Zhou, Peng Xu, Xiao Yang, Di Wu, and Zuotao Liu. Their affiliations indicate a strong background in industry-scale machine learning and recommender systems, particularly from a prominent technology company known for its large-scale applications like Douyin (TikTok).
1.3. Journal/Conference
This paper is published as a preprint on arXiv (arXiv:2507.15551). Some authors (Fedor Borisyuk, Mingzhou Zhou, etc., although not the primary authors of this paper, are listed in a citation to a KDD 2024 paper within this document) are associated with KDD, which is one of the premier conferences for data mining, data science, and big data. While this specific paper is currently a preprint, its content and author affiliations suggest it is intended for a top-tier venue, and some authors are noted to have published in KDD. The presence on arXiv indicates active research and dissemination within the academic and industrial communities.
1.4. Publication Year
The paper was published at UTC: 2025-07-21T12:28:55.000Z, which means it was published in 2025.
1.5. Abstract
The paper addresses two key practical obstacles in scaling up recommendation systems, inspired by the success of Large Language Models (LLMs):
-
Cost and Latency: Industrial recommenders face strict latency bounds and high queries per second (QPS) demands, making efficient training and serving crucial.
-
Hardware Inefficiency: Many traditional feature-crossing modules, designed for CPU-era systems, fail to exploit modern GPUs effectively, leading to low Model Flops Utilization (MFU) and poor scalability.
To tackle these issues, the authors introduce
RankMixer, a hardware-aware model designed for a unified and scalable feature-interaction architecture. RankMixer maintains the high parallelism of transformer models but replaces the quadratically complexself-attentionwith a more efficientmulti-head token mixingmodule. It also usesPer-token FFNsto model distinct feature subspaces and cross-feature interactions, ensuring better alignment with recommendation data patterns.
For further scalability, RankMixer is extended with a Sparse-MoE (Mixture-of-Experts) variant, reaching one billion parameters. A dynamic routing strategy is adapted to address common MoE issues like expert inadequacy and imbalance.
Experiments on a trillion-scale production dataset demonstrate RankMixer's superior scaling abilities. By replacing diverse handcrafted modules, RankMixer boosts MFU from 4.5% to 45% and scales model parameters by 100x while maintaining inference latency. Online A/B tests across recommendation and advertisement scenarios verify its universality. A 1B Dense-Parameters RankMixer has been launched for full traffic serving, improving user active days by 0.3% and total in-app usage duration by 1.08% without increasing serving cost.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2507.15551. The PDF link is https://arxiv.org/pdf/2507.15551v3.pdf. This paper is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The success of Large Language Models (LLMs) has demonstrated the immense potential of scaling up model parameters and data volume to achieve superior performance across various tasks. This success has naturally spurred interest in applying similar scaling laws to Deep Learning Recommendation Models (DLRMs). However, industrial Recommender Systems (RS) operate under unique and stringent constraints that make direct application of LLM scaling challenging.
The core problem the paper aims to solve is how to effectively scale ranking models in industrial recommenders to billions of parameters, akin to LLMs, without compromising the strict latency bounds and high Queries Per Second (QPS) demands of online serving environments. This problem is critical because existing DLRMs face two major practical obstacles:
-
Computational Efficiency: Training and serving large models in industrial RS must respect strict latency budgets (e.g., milliseconds) and handle massive QPS (thousands to millions per second), making computational efficiency paramount. Scaling up naive
DLRMsoften leads to prohibitive latency. -
Hardware Utilization: Many
feature-crossing modulesin traditionalranking modelswere designed in theCPU eraand perform poorly on modernGPUs. These modules often result in lowModel Flops Utilization (MFU), meaning GPUs are underutilized, and thus models are not scalable. The computational cost of these older models was often proportional to the number of parameters, making theReturn on Investment (ROI)from aggressive scaling difficult to realize.The paper's entry point is to design a new, hardware-aware model architecture that can unify diverse feature interactions and scale efficiently on modern GPU hardware, moving beyond the limitations of
CPU-eradesigns and inefficientself-attentionmechanisms for heterogeneous recommendation data.
2.2. Main Contributions / Findings
The primary contributions and key findings of the RankMixer paper are:
- Novel Hardware-Aware Architecture (
RankMixer): The paper introducesRankMixer, a new model architecture specifically designed to be hardware-aware and scalable for industrialDLRMs. It unifies feature interactions into a highly parallelizable structure.- Multi-head Token Mixing: This innovative component replaces the
quadratic self-attentionmechanism with a more efficient, parameter-free operator for cross-token feature interactions. It is shown to outperformself-attentionfor heterogeneousrecommendation data, avoiding the combinatorial explosion and memory-bound issues ofattention weights. - Per-token Feed-Forward Networks (PFFNs):
RankMixeremploysPFFNsto model distinct feature subspaces independently. This approach increases model capacity significantly by dedicating parameters to each token, preventing dominant features from overshadowing others and aligning better with the diverse nature of recommendation data.
- Multi-head Token Mixing: This innovative component replaces the
- Scalable Sparse Mixture-of-Experts (MoE) Variant: To further enhance
ROIat massive scales,RankMixeris extended with aSparse-MoEvariant. This variant uses adynamic routing strategywithReLU RoutingandDense-Training / Sparse-Inference (DTSI-MoE)to address commonMoEproblems like expert under-training and imbalance, enabling capacity growth with minimal computational cost increase. - Demonstrated Scaling Abilities and Efficiency:
RankMixerachieved a 10x boost inModel Flops Utilization (MFU), from 4.5% to 45%, by replacing diversehandcrafted low-MFU moduleswith its unified architecture. This shifts the model from amemory-boundto acompute-boundregime.- The model successfully scaled parameters by 100x (from 16M to 1B
dense parameters) while maintaining roughly the same inference latency. This was achieved by decoupling parameter growth fromFLOPsand optimizingFLOPsgrowth from actual cost through highMFUandQuantization(e.g.,fp16).
- Universality and Online Performance Validation:
-
RankMixer's universality was verified withonline A/B testsacross two core application scenarios:RecommendationandAdvertisement. -
The 1B
Dense-Parameters RankMixerwas deployed for full traffic serving on Douyin, leading to significant business metric improvements: 0.3% increase in user active days and 1.08% increase in total in-app usage duration. It also showed larger improvements forlow-active users, demonstrating strong generalization capability.These findings collectively demonstrate that
RankMixerprovides a practical and effective solution for scalingDLRMstoLLM-likeparameter counts in industrial settings, overcoming critical efficiency and hardware utilization challenges.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the RankMixer paper, a reader should be familiar with several fundamental concepts in machine learning, particularly within the domain of recommender systems and deep learning architectures.
- Recommender Systems (RS): Systems that predict user preferences for items and suggest items to users. They are crucial for platforms like e-commerce, social media, and content streaming. The paper focuses on
ranking models, which are a component of RS responsible for ordering a retrieved set of candidate items based on predicted user engagement or preference. - Deep Learning Recommendation Models (DLRMs): A class of recommender systems that use deep neural networks to learn complex patterns and interactions from user and item features. They typically consist of an embedding layer for sparse categorical features and dense interaction layers.
- Feature Interaction: The process of combining multiple features to create new, more expressive features. For example, combining a user's age with an item's category might reveal a preference for certain categories among specific age groups.
- Low-order interactions: Simple combinations, like pairwise interactions.
- High-order interactions: More complex combinations involving three or more features.
- Explicit feature interaction: Manually designed or explicitly modeled by specific architectural components (e.g.,
DCN,DeepFM). - Implicit feature interaction: Learned automatically by deep neural networks without explicit design (e.g.,
MLP).
- Scaling Laws: Empirical observations in machine learning, especially prominent in
Natural Language Processing (NLP)andComputer Vision (CV), showing that model performance (e.g., accuracy, loss) often improves predictably as model size (number of parameters), data volume, and computational capacity (FLOPs) increase. This paper aims to establish and exploit similar scaling laws forDLRMs. - Model Flops Utilization (MFU): A metric that measures how efficiently a model uses the theoretical peak floating-point operations (
FLOPs) capacity of the underlying hardware (e.g.,GPU). It is calculated as: $ \text{MFU} = \frac{\text{Actual FLOPs Consumption}}{\text{Theoretical Hardware FLOPs Capacity}} \times 100% $ whereActual FLOPs Consumptionrefers to the floating-point operations performed by the model, andTheoretical Hardware FLOPs Capacityis the maximum number of floating-point operations the hardware can perform per second. A lowMFUindicates that the hardware is underutilized, often due tomemory-boundoperations rather thancompute-boundones. The paper aims to increaseMFUto makeDLRMsmorecompute-boundonGPUs. - Transformers: A neural network architecture introduced in 2017, primarily for
NLPtasks, which revolutionized sequence modeling. Key components include:- Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence (tokens) when processing each token. It computes attention weights based on
query(),key(), andvalue() matrices derived from the input. The standard formula forScaled Dot-Product Attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where , , are matrices representing queries, keys, and values, respectively. has shape , has shape , and has shape , where is the number of queries, is the number of keys/values, is the dimension of keys/queries, and is the dimension of values. calculates the similarity scores between queries and keys. is a scaling factor to prevent large dot products from pushing thesoftmaxfunction into regions with tiny gradients. Thesoftmaxfunction normalizes these scores into attention weights.Self-attentionhas aquadratic complexity( for dot products, for weighted sum) with respect to the sequence length (), which can be a bottleneck for very long sequences or many tokens. - Feed-Forward Networks (FFNs): Position-wise
FFNsare applied independently to each position (token) in the sequence. A standardFFNtypically consists of two linear transformations with aReLUactivation in between: . In standard Transformers, the parameters () are shared across all token positions.
- Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence (tokens) when processing each token. It computes attention weights based on
- Mixture-of-Experts (MoE): An architecture that increases model capacity without proportionally increasing computational cost during inference. It consists of multiple "expert" neural networks and a "gate" (or "router") network that learns to activate a subset of these experts for each input.
- Sparse-MoE: A variant where only a small number () of experts are selected and activated for each input, leading to
sparse activation. - Routing Strategy: The mechanism by which the gate network determines which experts to activate for a given input.
- Sparse-MoE: A variant where only a small number () of experts are selected and activated for each input, leading to
- Latency: The time delay between a request and a response in a system. In online recommenders, low latency is critical for a smooth user experience.
- QPS (Queries Per Second): The number of requests a system can handle per second. High QPS demands require efficient model serving infrastructure.
- AUC (Area Under the Curve): A common evaluation metric for binary classification models. It represents the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.
- UAUC (User-level AUC): A variation of
AUCwhereAUCis calculated for each user individually and then averaged across all users. This is particularly relevant inrecommender systemsto evaluate personalized performance and avoid bias towards popular items or active users.
3.2. Previous Works
The paper contextualizes RankMixer by discussing the evolution of DLRMs and attempts to scale them.
- Early DLRMs for Feature Interaction:
Wide&Deep[5]: One of the earliest efforts to combinelogistic regression(for low-order, memorized interactions) andDNN(for high-order, generalized interactions).DeepFM[11]: IntegratedFactorization Machines (FM)withDNNsto capture both low-order (pairwise) and high-order feature interactions.DeepCross[26]: Extendedresidual networksto learn automatic feature interactions implicitly.PNN[22],DCN[32],DCNv2[33],xDeepFM[18],FGCNN[20],FiGNN[17]: These models proposed various explicit cross-methods to capture high-order interactions, often designing specific operators for feature crossings.DCNandDCNv2specifically used a cross-network to explicitly learn bounded-degree cross features.
- Attention-based Models for Feature Interaction:
AutoInt[28] andHiFormer[10]: Employedattention mechanismswithresidual connectionsto learn complex interactions, inspired byTransformers.HiFormerparticularly focuses onheterogeneous feature interactions.
- Ensemble and Scaled DLRMs:
DHEN[39]: Proposed combining multiple interaction operators (e.g.,DCN,self-attention,FM,LR) and stacking them.Wukong[38]: Investigatedscaling lawsfor feature interaction models by stackingFactorization Machine Blocks (FMB)andLinear Compress Blocks (LCB).- Other scaling efforts [2, 6, 9, 12, 27, 34, 36, 40]: Explored scaling strategies for pre-training user activity sequences, general-purpose user representations, and online retrieval, showing varying degrees of success.
3.3. Technological Evolution
The evolution of DLRMs has progressed from simple MLPs to models explicitly designed to capture feature interactions. Initially, models like Wide&Deep and DeepFM combined linear models with DNNs or Factorization Machines. Subsequent work focused on designing specialized cross-network architectures (e.g., DCN, xDeepFM) to explicitly model complex interactions. More recently, inspired by the success of Transformers in NLP, researchers adapted attention mechanisms (e.g., AutoInt, HiFormer) to DLRMs to allow for dynamic weighting of feature interactions.
However, many of these CPU-era designs and even some attention-based approaches faced challenges when scaling to industrial levels: they often increased latency and memory consumption disproportionately to model size, resulting in low MFU on GPUs. The advent of scaling laws in LLMs highlighted the potential for massive performance gains with larger models, pushing DLRM research towards architectures that are inherently more scalable and hardware-efficient.
RankMixer fits into this timeline by addressing the limitations of prior DLRMs that struggle with GPU utilization and heterogeneous feature spaces. It aims to bring Transformer-like parallelism and scalability to DLRMs while mitigating the specific inefficiencies of self-attention (like quadratic complexity and difficulty with heterogeneous features) and adapting MoE for better performance and training stability in the RS context.
3.4. Differentiation Analysis
RankMixer differentiates itself from previous works through several core innovations tailored for industrial recommender systems:
- Hardware-Aware Design: Unlike many
CPU-eramodels (e.g.,DCN,DeepFM) or even generalDNNsthat suffer from lowMFUonGPUs,RankMixeris explicitly designed for highGPU parallelismandMFU. This is a fundamental shift from feature-centric to hardware-centric model design. - Efficient Cross-Token Interaction:
- Vs. Self-Attention (AutoInt, HiFormer):
RankMixerreplacesself-attentionwith amulti-head token mixingmodule. Whileself-attentionworks well inNLPwhere tokens share a unified embedding space, it is suboptimal for the inherently heterogeneousfeature spacesinRS(e.g.,user IDvs.item ID).Self-attentionincursquadratic complexityand high memory/compute cost for calculating attention weights.RankMixer'stoken mixingis parameter-free, more computationally efficient, and avoids the difficulties of computinginner-product similarityacross disparate feature types. - Vs. Handcrafted/Explicit Cross-feature Modules (DCN, DCNv2, xDeepFM, DHEN): These models often rely on diverse,
handcrafted moduleswhich can be effective but lead to a complex architecture, lowMFU, and poor scalability.RankMixerprovides a unified,Transformer-likebackbone that intrinsically learns these interactions without explicit manual design.
- Vs. Self-Attention (AutoInt, HiFormer):
- Per-token FFNs (PFFNs) for Heterogeneity:
- Vs. Shared FFNs (Vanilla Transformers): In standard
Transformers,FFNparameters are shared across all tokens.RankMixerusesPFFNswhere each token has dedicatedFFNparameters. This allows for isolated modeling of distinctfeature subspaces, preventinghigh-frequency fieldsfrom dominating and enhancingmodeling capacityfor diverse recommendation data, a problem often present inDLRMsthat mix features indiscriminately. - Vs. MMoE (Multi-gate Mixture-of-Experts): While
MMoEuses multiple experts, they typically share the same input.RankMixer'sPFFNsensure eachFFN(orMoEexpert block) sees a distinct token input, aligning with the idea of learning diversity in differentfeature subspaces.
- Vs. Shared FFNs (Vanilla Transformers): In standard
- Robust Sparse-MoE Implementation:
RankMixeraddresses critical challenges inSparse-MoE(expert under-training, imbalanced routing) by introducingReLU RoutingandDense-Training / Sparse-Inference (DTSI-MoE). This allows for significant parameter scaling without proportional inference cost increase, a problem that vanillaMoEmodels often struggle with (as noted in the paper's ablation studies). - Focus on Online Serving Metrics: The paper explicitly targets strict industrial requirements like
latency,QPS, andMFU, demonstrating impressive real-world improvements in these areas, including a 100x parameter increase with no latency penalty and significantMFUboost, which is a major differentiator compared to academic research often focusing solely onAUC.
4. Methodology
4.1. Principles
The core idea behind RankMixer is to create a hardware-aware model design that provides a unified and scalable feature-interaction architecture for industrial recommender systems. This design is guided by two main principles:
- Exploiting GPU Parallelism and Maximizing MFU: Moving away from
CPU-eradesigns that lead to lowModel Flops Utilization (MFU)on modernGPUs,RankMixerleverages highly parallel operations to shift from amemory-boundto acompute-boundregime. - Effective Modeling of Heterogeneous Feature Interactions: Recognizing the unique challenges of
recommendation data(diverse feature types, distinct semantic subspaces, and the need for personalized cross-feature interactions),RankMixeraims to capture these complexities efficiently and scalably.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Overall Architecture
The RankMixer model processes input tokens through successive RankMixer blocks. After these blocks, an output pooling operator consolidates the representations for prediction tasks.
Each RankMixer block consists of two main components:
-
A
Multi-Head Token Mixinglayer. -
A
Per-Token Feed-Forward Network (PFFN)layer.The process begins by taking an input vector and tokenizing it into feature tokens: . These tokens are then iteratively refined through the layers.
The computation within each RankMixer block (for layer ) can be expressed by the following formulas:
$
\mathsf{S}{n-1} = \mathrm{LN}\left(\mathrm{TokenMixing}\left(\mathbf{X}{n-1}\right) + \mathbf{X}{n-1}\right)
$
$
\mathbf{X}n = \mathrm{LN}\left(\mathrm{PFFN}\left(\mathsf{S}{n-1}\right) + \mathsf{S}{n-1}\right)
$
Symbol Explanation:
-
: The intermediate representation after the
TokenMixingmodule andLayer Normalizationin layern-1. -
: A
Layer Normalizationfunction, which normalizes the activations across the features for each sample independently within a layer, helping to stabilize training. -
: The
Multi-Head Token Mixingmodule, which facilitates information exchange across tokens. -
: The input to the
RankMixer blockat layer , representing the collection of tokens from the previous layer. -
: The
Per-Token Feed-Forward Networkmodule, which processes each token with dedicated parameters. -
: The output of the -th
RankMixer block, which becomes the input for the next block or the final pooling. -
: The initial input, formed by stacking the feature tokens ().
-
: The total number of feature tokens.
-
: The hidden dimension of each token (and thus the model).
The
residual connections(the+\mathbf{X}_{n-1}and+ terms) are crucial for enabling deeper networks and improving training stability by allowing gradients to flow directly through the network. Finally, the output representation is derived frommean poolingof the final layer's token representations , which is then used to compute predictions for various tasks.
4.2.2. Input Layer and Feature Tokenization
The first step in RankMixer is to prepare the input features and convert them into a structured format suitable for the Transformer-like architecture. Industrial recommender systems typically deal with a vast array of features, including:
-
User features: Such as
user IDand demographic information. -
Candidate features: Like
video ID,author ID, etc. -
Sequence Features: Processed by
Sequence Modulesto capture temporal interests, yielding embeddings . -
Cross Features: Interactions between user and candidate features.
All these diverse features are initially transformed into embeddings, which may have varying dimensions.
To enable efficient parallel computation and structured interaction, these embeddings are aligned into fixed-dimension vectors, referred to as feature-tokens. This process is called tokenization.
A naive approach of one embedding per feature would lead to hundreds of tokens, fragmenting parameters and computation per token, or too few tokens would collapse the model into a simple DNN, losing the distinct representation of diverse feature spaces.
To overcome these challenges, RankMixer proposes a semantic-based tokenization approach:
- Feature Grouping: Domain knowledge is used to group features into several semantically coherent clusters. For example, all user-related features might form one cluster, and all item-related features another.
- Concatenation: These grouped features are sequentially concatenated into a single embedding vector .
$
e_{\mathrm{input}} = \left[,e_1; e_2; ...; e_N \right]
$
Symbol Explanation:
- : The concatenated embedding vector.
- : An embedding corresponding to a feature or a grouped set of features.
- : The number of feature groups.
- Partitioning and Projection: The concatenated vector is then partitioned into an appropriate number of tokens , each with a fixed dimension . A
Projfunction maps these split embeddings to the model's hidden dimension . $ x_i = \mathrm{Proj}(e_{\mathrm{input}}\left[ d \cdot (i-1) : d \cdot i \right]), \quad i = 1, \ldots, T $ Symbol Explanation:- : The -th feature token, where .
- : A projection function (e.g., a linear layer) that transforms the split embedding into the model's hidden dimension .
- : A segment of the concatenated embedding vector , corresponding to the -th token, with a length of .
- : The fixed dimension chosen for each segment before projection.
- : The resulting number of tokens. Each captures a set of feature embeddings representing a similar semantic aspect.
4.2.3. RankMixer Block
4.2.3.1. Multi-head Token Mixing
The Multi-head Token Mixing module is designed to facilitate cross-token feature interactions, which are crucial for feature crosses and global information modeling. It operates as follows:
- Head Splitting: Each token is evenly divided into smaller parts, or "heads." The -th head of token is denoted as . These heads can be viewed as projections of the token into lower-dimensional
feature subspaces, allowing the model to consider different perspectives of the feature. $ \left[\mathbf{x}_t^{(1)} \parallel \mathbf{x}_t^{(2)} \parallel \ldots \parallel \mathbf{x}_t^{(H)}\right] = \mathrm{SplitHead}\left(\mathbf{x}_t\right) $ Symbol Explanation:- : The -th input token.
- : The -th head of the -th token.
- : Denotes concatenation.
- : An operation that divides a token into equal-sized parts.
- Cross-Token Fusion: For each head , the corresponding heads from all tokens are concatenated to form a new vector . This effectively shuffles the dimensions, grouping all -th heads across tokens.
$
\mathbf{s}^h = {\mathrm{Concat}}\left(\mathbf{x}_1^h, \mathbf{x}_2^h, ..., \mathbf{x}_T^h\right)
$
Symbol Explanation:
- : The vector formed by concatenating the -th head from all tokens.
- : The -th head of the -th token.
This creates a matrix by stacking all such shuffled token vectors .
In this work, the number of heads is set equal to the number of tokens to maintain the same number of tokens for the subsequent
residual connection.
- Output and Normalization: The shuffled tokens undergo a
residual connectionandLayer Normalizationto produce the refined token representations: $ {\bf s}_1, {\bf s}_2, ..., {\bf s}_T = \mathrm{LN}\left(\mathrm{TokenMixing}\left({\bf x}_1, {\bf x}_2, ..., {\bf x}_T\right) + \left({\bf x}_1, {\bf x}_2, ..., {\bf x}_T\right)\right) $ Symbol Explanation:-
: The output tokens after
Multi-head Token Mixing,residual connection, andLayer Normalization. -
:
Layer Normalization. -
: The operation described above.
-
: The input tokens to the
TokenMixingmodule.The paper argues that while
self-attentionis effective inLLMswhere all tokens are in a unified embedding space, it issuboptimal for recommendation systemsdue to theinherently heterogeneous feature spaces. Computinginner-product similaritybetween disparate semantic spaces (e.g.,user IDanditem ID) is difficult and often less effective than theparameter-free multi-head Token Mixingapproach, which also consumes fewer computations,Memory IOoperations, andGPU memory.
-
4.2.3.2. Per-token FFN (PFFN)
Traditional DLRMs often mix features from many disparate semantic spaces within a single interaction module, which can lead to high-frequency fields dominating and obscuring lower-frequency or long-tail signals. To counter this, RankMixer introduces a parameter-isolated feed-forward network architecture called per-token FFN (PFFN).
Unlike traditional FFNs where parameters are shared across all tokens, PFFN processes each token with its own dedicated set of parameters. For the -th token , the per-token FFN can be expressed as:
$
\mathbf{v}t = f{\mathrm{pffn}}^{t,2} \left(\mathrm{Gelu}\left(f_{\mathrm{pffn}}^{t,1}\left(\mathbf{s}t\right)\right)\right)
$
where the linear transformations are defined as:
$
f{\mathrm{pffn}}^{t,i}(\mathbf{x}) = \mathbf{xW}{\mathrm{pffn}}^{t,i} + \mathbf{b}{\mathrm{pffn}}^{t,i}
$
Symbol Explanation:
-
: The output of the
Per-token FFNfor the -th token. -
and : The two linear transformations (layers) specific to the -th token.
-
: The
Gelu activation function, which is a common non-linear activation in deep learning, providing a smoother approximation toReLU. -
: The -th token input to the
PFFN. -
: Input to the linear transformation.
-
: The weight matrix for the first linear layer of the
PFFNspecific to token . -
: The bias vector for the first linear layer of the
PFFNspecific to token . -
: The weight matrix for the second linear layer of the
PFFNspecific to token . -
: The bias vector for the second linear layer of the
PFFNspecific to token . -
: A
hyperparameterthat scales the hidden dimension of theper-token FFN(e.g., if and , the hidden dimension is3072).The
Per-token FFNmodule for all tokens collectively is summarized as: $ {\bf v}_1, {\bf v}_2, ..., {\bf v}_T = \mathrm{PFFN}\left({\bf s}_1, {\bf s}_2, ..., {\bf s}_T\right) $ where $ \mathrm{PFFN}\left({\bf s}_1, {\bf s}2, ..., {\bf s}T\right) = f{\mathrm{pffn}}^{t,2} \left(\mathrm{Gelu}\left(f{\mathrm{pffn}}^{t,1}\left({\bf s}_1, {\bf s}_2, ..., {\bf s}_T\right)\right)\right) $ Symbol Explanation: -
This notation implies that the
PFFNoperation is applied independently to each token to produce , where the specific for and changes for each .The key advantage of
Per-token FFNis that it introduces more parameters and therefore greatermodeling abilitycompared to aparameter-shared FFN, without changing the computational complexity per token. It differs fromMMoEin that eachPer-token FFNreceives a distinct token input, allowing for independent processing offeature subspaces, which is beneficial for learning diverse patterns in recommendation data.
4.2.4. Sparse MoE in RankMixer
To further enhance the scaling ROI (Return on Investment) of RankMixer, the dense PFFNs can be replaced with Sparse Mixture-of-Experts (MoE) blocks. This allows the model's capacity to grow substantially while keeping the computational cost (FLOPs) roughly constant during inference. However, applying Vanilla Sparse-MoE directly to RankMixer presents two challenges:
-
Uniform -expert Routing: Traditional
Top-kexpert selection treats all feature tokens equally. This can waste computational budget onlow-information tokensandstarve high-information ones, hindering the model's ability to capture subtle differences between tokens. -
Expert Under-training:
Per-token FFNsalready multiply parameters by the number of tokens. Adding non-shared experts further increases the totalexpert count, which can lead to highlyunbalanced routingandpoorly trained experts(known as "dying experts").To address these issues,
RankMixeradopts two complementary strategies:
4.2.4.1. ReLU Routing
ReLU Routing is introduced to provide tokens with flexible expert counts and maintain differentiability. It replaces the common Top-k + softmax routing with a ReLU gate combined with an adaptive\ell_1penalty.
Given the -th expert for token and a router , the gating mechanism is defined as:
$
G_{i,j} = \mathrm{ReLU}\big(h(\mathbf{s}i)\big)
$
The output for token is then a weighted sum of expert outputs:
$
\mathbf{v}i = \sum{j=1}^{N_e} G{i,j} \cdot e_{i,j}(\mathbf{s}_i)
$
Symbol Explanation:
-
: The activation gate value for the -th expert for the -th token.
-
: The
Rectified Linear Unitactivation function, which outputs the input directly if positive, otherwise zero. -
: The output of the router network for token .
-
: The output representation for the -th token after processing by the
Sparse-MoEblock. -
: The total number of experts available for each token.
-
: The -th expert network for the -th token.
-
: The -th input token to the
Sparse-MoElayer.ReLU Routingallows more experts to be activated forhigh-information tokensby allowing to be non-zero for multiple experts, thereby improvingparameter efficiency. Sparsity is controlled by a regularization loss with a coefficient , which encourages the averageactive-expert ratioto remain near a predefined budget: $ \mathcal{L} = \mathcal{L}{\mathrm{task}} + \lambda \mathcal{L}{\mathrm{reg}} $ $ \mathcal{L}{\mathrm{reg}} = \sum{i=1}^{N_t} \sum_{j=1}^{N_e} G_{i,j} $ Symbol Explanation: -
: The total loss function.
-
: The primary task-specific loss (e.g., binary cross-entropy for CTR prediction).
-
: A
hyperparameterthat controls the strength of the regularization. -
: The regularization term that penalizes the sum of all gate activations, thereby promoting sparsity.
-
: The total number of tokens.
4.2.4.2. Dense-training / Sparse-inference (DTSI-MoE)
Inspired by [21], RankMixer employs Dense-training / Sparse-inference (DTSI-MoE). This strategy uses two distinct routers during training: and .
-
Both and are updated during the training phase.
-
The regularization term is applied only to . This means that during training, the router can explore activating more experts (or activate them more densely) without being constrained by the sparsity penalty, ensuring that experts receive sufficient training data and gradients.
-
During inference, only is used, which has been optimized for sparsity due to the applied regularization.
This
DTSI-MoEapproach ensures that experts do not suffer fromunder-training(a common problem inSparse-MoEwhere some experts receive very few examples) while simultaneously reducinginference costby allowing sparse activation.
4.2.5. Scaling Up Directions
RankMixer is inherently a highly parallel and extensible architecture, allowing its parameter count and computational cost to be expanded along four orthogonal axes:
-
Token count (): Increasing the number of feature tokens.
-
Model width (): Increasing the hidden dimension of the model.
-
Layers (): Stacking more
RankMixer blocks. -
Expert Numbers (): Increasing the number of experts in the
Sparse-MoEvariant.For the
full-dense-activated version(i.e.,PFFNwithoutMoE), the approximate number of parameters andFLOPsfor one sample can be computed as: $ # \mathrm{Param} \approx 2kLTD^2 $ $ \mathrm{FLOPs} \approx 4kLTD^2 $ Symbol Explanation: -
: The total number of parameters in the dense part of the model.
-
: The total number of floating-point operations for one sample.
-
: The scale ratio (usually an integer, e.g., 2 or 4) adjusting the hidden dimension of the
FFN(as seen inPFFNwhere inner dimension iskD). -
: The number of
RankMixer blocks(layers). -
: The number of tokens.
-
: The hidden dimension of the model.
In the
Sparse-MoE version, the effective parameters and compute per sample are reduced by theactive-expert ratio: $ s = \frac{# \mathrm{Activated}_\mathrm{Param}}{# \mathrm{Total}_\mathrm{Param}} $ Symbol Explanation: -
: The ratio of activated parameters to total parameters, effectively the average fraction of experts activated for each token.
The paper notes an observation shared with
LLM scaling laws: model quality correlates primarily with the total number of parameters, and different scaling directions (depth , width , tokens ) yield almost identical performance for the same total parameters. However, from acompute-efficiencystandpoint, larger hidden dimensions () generate largermatrix-multiplyshapes, which typically achieve higherMFUthan stacking more layers. Therefore, the final configurations for the 100M and 1B parameter models were chosen as () and () respectively, emphasizing width over depth forMFU.
5. Experimental Setup
5.1. Datasets
The offline experiments for RankMixer were conducted using a massive, real-world dataset from the Douyin recommendation system.
-
Source: The data is derived from Douyin's online logs and user feedback labels, reflecting actual user interactions.
-
Characteristics: It includes over 300 features, categorized as:
- Numerical features: Diverse statistics (e.g., click counts, duration).
- ID features: User IDs, video IDs (billions of
user IDsand hundreds of millions ofvideo IDs). - Cross features: Interactions between
useranditemfeatures. - Sequential features: User interaction sequences, processed to capture temporal interests. All these features are converted into embeddings.
-
Scale: The dataset is described as a "trillion-scale production dataset," with "trillions of records per day." The experiments were conducted on data collected over a two-week period.
-
Domain: Online short-video recommendation and advertisement.
While the paper doesn't provide a concrete example of a data sample, one can imagine a record looking like this:
{
"user_id": "user_12345",
"video_id": "video_abcde",
"author_id": "author_xyz",
"user_watch_history_last_5": ["video_fgh", "video_ijk", ...],
"video_category": "comedy",
"user_age_bucket": "25-34",
"watch_duration_ratio": 0.85,
"engagement_label": "finish" // Label for `finish/skip` prediction
}
This dataset is highly effective for validating the method's performance because it represents the actual, complex, and massive data distribution encountered in a leading industrial recommender system. Its scale and diversity are crucial for testing RankMixer's scaling abilities and hardware efficiency in a real-world scenario.
5.2. Evaluation Metrics
The paper uses a comprehensive set of evaluation metrics, covering both model performance and computational efficiency.
5.2.1. Performance Metrics
- AUC (Area Under the Curve):
- Conceptual Definition:
AUCis a standard metric used for binary classification problems. It represents the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example by the classifier. A higherAUCvalue indicates better model performance in distinguishing between positive and negative classes. In the context ofrecommendation systems, an increase of 0.0001 inAUCis often considered a statistically significant improvement. - Mathematical Formula: $ \text{AUC} = \frac{\sum_{i \in \text{positive}} \sum_{j \in \text{negative}} \mathbb{I}(P(i) > P(j))}{N_P \cdot N_N} $
- Symbol Explanation:
P(i): The prediction score (e.g., click-through probability) assigned by the model to a positive instance .P(j): The prediction score assigned by the model to a negative instance .- : The indicator function, which returns 1 if the condition inside is true, and 0 otherwise.
- : The total number of positive instances in the dataset.
- : The total number of negative instances in the dataset.
- Conceptual Definition:
- UAUC (User-level AUC):
- Conceptual Definition:
UAUCis a variation ofAUCthat is particularly relevant forrecommender systems. Instead of calculating a singleAUCover the entire dataset,UAUCcalculates theAUCfor each individual user and then averages these per-userAUCscores. This metric provides a better evaluation of personalized performance, as it accounts for varying user behaviors and ensures that improvements are not solely driven by a few highly active users or popular items. - Mathematical Formula: $ \text{UAUC} = \frac{1}{N_U} \sum_{u=1}^{N_U} \text{AUC}_u $
- Symbol Explanation:
- : The total number of unique users in the dataset.
- : The
AUCcalculated specifically for user based on their interactions.
- Conceptual Definition:
- Online Business Metrics (for A/B tests):
- Active Days (Active Day↑): The average number of days a user is active within a given period. This serves as a proxy for
Daily Active Users (DAU)growth and overall user retention. - Duration (Duration↑): The total cumulative time users spend in the application. This is a crucial measure of user engagement and satisfaction.
- Like (Like↑): The total number of
likes(or similar positive feedback actions) users give to content. - Finish (Finish↑): The total number of times users complete watching a video (or similar content consumption action).
- Comment (Comment↑): The total number of
commentsusers make. - ADVV (Advertiser Value↑): A metric representing the value generated for advertisers, typically related to ad impressions, clicks, or conversions. This is specific to advertisement scenarios.
- Active Days (Active Day↑): The average number of days a user is active within a given period. This serves as a proxy for
5.2.2. Efficiency Metrics
- Dense-Param (Parameters):
- Conceptual Definition: The count of parameters within the dense layers of the model. This specifically excludes sparse
embedding parameters, which can be very large inDLRMsbut are often handled differently (e.g., by parameter servers) and don't directly contribute to thedense computation graph's size. It indicates the complexity and capacity of the main neural network components.
- Conceptual Definition: The count of parameters within the dense layers of the model. This specifically excludes sparse
- Training Flops/Batch:
- Conceptual Definition: The total number of
floating-point operations (FLOPs)required to process a single batch of data during training. This metric quantifies the computational cost of training the model. The batch size used for this measurement is 512.
- Conceptual Definition: The total number of
- MFU (Model FLOPs Utilization):
- Conceptual Definition:
MFUmeasures how effectively the model utilizes the theoretical peakFLOPscapacity of the underlying hardware (e.g.,GPU). It is expressed as a percentage. A higherMFUindicates that the model's computations are well-aligned with the hardware's capabilities, leading to more efficient use of resources and faster processing. LowMFUoften suggestsmemory-boundoperations. - Mathematical Formula: $ \text{MFU} = \frac{\text{Actual FLOPs Consumption}}{\text{Theoretical Hardware FLOPs Capacity}} \times 100% $
- Symbol Explanation:
Actual FLOPs Consumption: The measured floating-point operations performed by the model.Theoretical Hardware FLOPs Capacity: The maximum possible floating-point operations that the specific hardware (e.g., aGPUmodel) can perform per unit of time (e.g., per second). This value depends on theGPUarchitecture and its precision (e.g.,fp32vs.fp16).
- Conceptual Definition:
5.3. Baselines
The authors compared RankMixer against several widely recognized state-of-the-art DLRMs and feature interaction models:
-
DLRM-MLP (base): A vanilla
Deep Neural Network (DNN)orMulti-Layer Perceptron (MLP)used forfeature crossing. This represents a basicDNNapproach for feature interactions. -
DLRM-MLP-100M: A scaled-up version of the
DLRM-MLPbaseline with approximately 100 million parameters, demonstrating the limited gains from simply widening or deepening basicMLPstructures without architectural innovations. -
DCNv2 [33]: An improved version of
Deep & Cross Network, which explicitly learnsbounded-degree cross featuresthrough a dedicatedcross networkwhile also using adeep networkfor implicit interactions. -
RDCN [3]: A variant of
DCN, likely an industrial adaptation or further improvement. -
MoE: A generic
Mixture-of-Expertsmodel, scaled up by simply using multiple parallel experts, without the specific routing or training strategies ofRankMixer'sSparse-MoE. -
AutoInt [28]: A model that uses
multi-head self-attentionto automatically learn explicithigh-order feature interactions. -
HiFormer [10]: A
Transformer-basedmodel forrecommender systemsfocusing onheterogeneous feature interactionsandlow-rank approximationfor matrix computations. -
DHEN [39]:
Deep and Hierarchical Ensemble Network, which combines different types offeature-cross blocks(e.g.,DCN,self-attention,FM,LR) and stacks multiple layers. -
Wukong [38]: A model that investigates
scaling lawsforfeature interactionby stackingFactorization Machine Blocks (FMB)andLinear Compress Blocks (LCB).All experiments were performed on hundreds of
GPUsusing ahybrid distributed training framework, where thesparse part(embeddings) was updatedasynchronously(common in industrial settings to handle largeembedding tables) and thedense part(mainDNN) was updatedsynchronously.RMSPropoptimizer was used for the dense part with a learning rate of 0.01, andAdagradfor the sparse part. These optimizers and their settings were kept consistent across all models for fair comparison.
6. Results & Analysis
6.1. Core Results Analysis
The experiments aimed to evaluate RankMixer's performance, efficiency, and scaling capabilities against various state-of-the-art baselines on a massive industrial dataset.
The following are the results from Table 1 of the original paper:
| Model | Finish | Skip | Efficiency | |||
| AUC↑ | UAUC↑ | AUC↑ | UAUC↑ | Params | FLOPs/Batch | |
| DLRM-MLP (base) | 0.8554 | 0.8270 | 0.8124 | 0.7294 | 8.7 M | 52 G |
| DLRM-MLP-100M | +0.15% | − | +0.15% | − | 95 M | 185 G |
| DCNv2 | +0.13% | +0.13% | +0.15% | +0.26% | 22 M | 170 G |
| RDCN | +0.09% | +0.12% | +0.10% | +0.22% | 22.6 M | 172 G |
| MoE | +0.09% | +0.12% | +0.08% | +0.21% | 47.6 M | 158 G |
| AutoInt | +0.10% | +0.14% | +0.12% | +0.23% | 19.2 M | 307 G |
| DHEN | +0.18% | +0.26% | +0.36% | +0.52% | 22 M | 158 G |
| HiFormer | +0.48% | − | − | − | 116 M | 326 G |
| Wukong | +0.29% | +0.29% | +0.49% | +0.65% | 122 M | 442 G |
| RankMixer-100M | +0.64% | +0.72% | +0.86% | +1.33% | 107 M | 233 G |
| RankMixer-1B | +0.95% | +1.22% | +1.25% | +1.82% | 1.1 B | 2.1T |
Comparison with SOTA methods (100M parameters):
- Limited Gains from Simple Scaling: Simply scaling
DLRM-MLPto 100M parameters (DLRM-MLP-100M) yields only modestAUCgains (+0.15%), despite a significant increase inFLOPs/Batch(from 52G to 185G). This highlights the inefficiency of basicDNNscaling without architectural improvements. - Inefficiencies of Classic Cross-Structure Designs: Models like
DCNv2,RDCN,AutoInt, andDHENshowAUCimprovements but often come with highFLOPsrelative to their parameter count. For instance,AutoIntwith 19.2M parameters already requires 307GFLOPs/Batch, indicating potentialdesign shortcomingsin terms of computational efficiency.DHENshows betterAUCgains thanDCNv2but still has a moderateFLOPscost. - RankMixer's Superiority:
RankMixer-100Mconsistently achieves the best performance across allAUCandUAUCmetrics (e.g., +0.64%Finish AUC, +0.86%Skip AUC) among models with similar parameter counts.- Crucially, its
FLOPs/Batch(233G) is relatively moderate, being lower thanAutoInt(307G),HiFormer(326G), andWukong(442G), yet it outperforms all of them. This demonstratesRankMixer'sbalanced approachto model capacity and computational load.
- Scaling Up with RankMixer-1B: The
RankMixer-1Bmodel, with 1.1 billion parameters, further significantly boosts performance across the board (e.g., +0.95%Finish AUC, +1.25%Skip AUC), demonstratingRankMixer's strong scaling capabilities. Although itsFLOPs/Batchis 2.1T, the accompanying efficiency improvements (discussed in Section 4.6) allow this scale to be practical.
Scaling Laws of Different Models:
The following figure (Figure 2 from the original paper) illustrates the scaling law curves:
该图像是图表,展示了 AUC 增益与不同模型的密集参数及 Flops 之间的关系。左侧图显示了密集参数(M)与 AUC 增益(%)的关系,右侧图展示了 Flops(G)与 AUC 增益(%)的关系。RankMixer 的表现显著优于其他模型。
-
As shown in Figure 2,
RankMixerexhibits the steepest scaling law both in terms ofparameters(Params) andFLOPs(x-axis is logarithmic). This indicates that asRankMixergrows in size, itsAUCgain increases at a faster rate compared to other models. -
Wukongshows a relatively steep parameter curve, suggesting it benefits from increased parameters. However, itscomputational cost (FLOPs)increases even more rapidly, causing a larger gap betweenWukongandRankMixer/HiFormeron theAUC vs. FLOPscurve. This impliesWukongis lesscompute-efficient. -
HiFormerperforms slightly worse thanRankMixer, which the authors attribute to its reliance onfeature-level token segmentationandself-attention, which can affect efficiency. -
DHENshowslimited scalabilityin itscross structure, as itsAUCgain flattens out more quickly. -
The generic
MoEstrategy struggles withexpert balance, leading to suboptimal scaling performance.The authors also note that
RankMixeradheres toLLM scaling laws, wheremodel qualityprimarily correlates with the total number of parameters. While different scaling directions (depth , width , tokens ) yield similar performance for identical total parameters, increasingmodel width (D)is preferred forcompute-efficiencydue to largermatrix-multiplyshapes achieving higherMFU. This guided the configuration choice of () for 100M parameters and () for 1B parameters, emphasizing width over depth.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Ablation on Components of RankMixer-100M
The following are the results from Table 2 of the original paper:
| Setting | ΔAUC |
| w/o skip connections | -0.07% |
| w/o multi-head token mixing | -0.50% |
| w/o layer normalization | -0.05% |
| Per-token FFN → shared FFN | -0.31% |
The ablation study on RankMixer-100M components (Table 2) reveals the critical importance of each design choice:
w/o skip connections(-0.07% ΔAUC): Removingskip connections(residual connections) leads to a performance drop. This is expected, asskip connectionsare vital for stabilizing training, mitigatingvanishing/exploding gradients, and enabling deeper networks inTransformer-likearchitectures.w/o multi-head token mixing(-0.50% ΔAUC): This is the most significant drop. Removingmulti-head token mixingmeans that global information exchange between tokens is lost. EachPer-token FFNwould then only model partial features without interactions, severely limiting the model's ability to capture complexcross-feature interactions.w/o layer normalization(-0.05% ΔAUC): RemovingLayer Normalizationalso negatively impacts performance.Layer Normalizationhelps to stabilize activations and gradients, leading to more stable and faster training.Per-token FFN → shared FFN(-0.31% ΔAUC): ReplacingPer-token FFNswith a single,shared FFNacross all tokens results in a substantial performance decrease. This validatesRankMixer's hypothesis that modeling distinctfeature subspaceswith dedicated parameters is crucial forheterogeneous recommendation data. Ashared FFNwould force all tokens into a common representation, potentially losing specialized information and allowingdominant featuresto overshadow others.
6.2.2. Token2FFN Routing-strategy comparison
The following are the results from Table 3 of the original paper:
| Routing strategy | ΔAUC | ΔParams | ΔFLOPs |
| All-Concat-MLP | -0.18% | 0.0% | 0.0% |
| All-Share | -0.25% | 0.0% | 0.0% |
| Self-Attention | -0.03% | +16% | +71.8% |
This ablation focuses on different strategies for routing information from feature tokens to FFNs (or processing feature interactions). The baseline for comparison here is RankMixer's Multi-Head Token Mixing within its block.
All-Concat-MLP(-0.18% ΔAUC): This strategy concatenates all tokens into a single vector and processes them through a largeMLPbefore splitting them back. The performance decrease suggests that:- Learning effective interactions from a single, very long concatenated vector is challenging for an
MLP. - It likely weakens the learning of
local informationwithin specificfeature subspacesthatRankMixer'stoken-wise processingpreserves.
- Learning effective interactions from a single, very long concatenated vector is challenging for an
All-Share(-0.25% ΔAUC): This strategy implies no splitting or specific token mixing; the entire input vector issharedand fed to eachPer-token FFN(similar to howMMoEexperts might share input). The significant performance decline highlights the importance offeature subspace splittingandindependent modeling, confirming that simply sharing inputs leads to reduced learning diversity.Self-Attention(-0.03% ΔAUC, +16% ΔParams, +71.8% ΔFLOPs): Applyingself-attentionbetween tokens for routing yields performance slightly inferior toMulti-Head Token Mixing(a -0.03%AUCdrop is statistically significant inRS). More importantly,self-attentioncomes with a substantial increase inparameters(+16%) and a massive increase incomputational cost(FLOPs, +71.8%). This validatesRankMixer's claim thatself-attentionis inefficient and suboptimal for theheterogeneous feature spacesofRS, especially when trying to learnsimilarityacross hundreds of differentfeature subspaces.RankMixer'sMulti-Head Token Mixingis more effective and much more efficient.
6.2.3. Sparse-MoE Scalability and Expert Balance
The following figure (Figure 3 from the original paper) plots offline AUC gains versus Sparsity of SMoE:
该图像是图表,展示了不同 RankMixer 变体在稀疏激活比率下的 AUC 性能表现 ():密集训练的 ReLU 路由的 sMoE 几乎保留了 1B 密集模型的所有准确性。
Figure 3 demonstrates the scalability of RankMixer's Sparse-MoE variant.
-
DTSI + ReLU-routed sMoE(Dense-Training-Sparse-Inference with ReLU routing) is crucial for maintaining accuracy under aggressive sparsity. It can scale parameter capacity by more than 8 times with almost noAUCloss, while providing substantialinference-time savings(e.g., 50% throughput improvement). This highlights the effectiveness of the proposedDTSIandReLU routingforMoE. -
Vanilla sMoE: Performance drops monotonically as fewer experts are activated, indicatingexpert imbalanceandunder-training. -
sMoE + Load-balancing loss: While adding aload-balancing lossreduces degradation compared toVanilla sMoE, it still falls short ofDTSI + ReLUbecause the core issue lies more in theexpert trainingprocess itself than justrouter balancing.The following figure (Figure 4 from the original paper) shows the activated expert ratio for different tokens in RankMixer:
该图像是一个图表,展示了 RankMixer 模型中不同 token 的激活专家比率,分为两个层次(layer 1 和 layer 2),每个层次显示多组图。每个子图的横轴表示迭代次数,纵轴表示激活专家比率,能够反映 RankMixer 在特征交互过程中的效率和性能。
Figure 4 visualizes expert balance and diversity.
- It shows the
activated expert ratiofor different tokens inRankMixeracross two layers. - The results indicate that the combination of
DTSI(Dense-Training, Sparse-Inference) withReLU routingeffectively addressesexpert imbalance. Dense-trainingensures that most experts receive sufficient gradient updates, preventing "dying experts" (experts that are rarely or never activated).ReLU routingmakes theactivation ratio dynamicacross tokens. The varying activation proportions per token (as shown in the figure) demonstrate thatRankMixeradapts the number of activated experts based on theinformation contentof each token, which is well-suited for the diverse and dynamic nature ofrecommendation data.
6.3. Online Serving cost
The following are the results from Table 6 of the original paper:
| Metric | OnlineBase-16M | RankMixer-1B | Change |
| #Param | 15.8M | 1.1B | ↑ 70X |
| FLOPs | 107G | 2106G | ↑ 20.7X |
| Flops/Param(G/M) | 6.8 | 1.9 | ↓3.6X |
| MFU | 4.47% | 44.57% | ↑10X |
| Hardware FLOPs | fp32 | fp16 | ↑2X |
| Latency | 14.5ms | 14.3ms | - |
A key achievement of RankMixer is its ability to scale model parameters by two orders of magnitude (70x increase) without increasing inference latency. The paper explains this by breaking down the latency formula:
$
\mathrm{Latency} = \frac{# \mathrm{Param} \times \mathrm{FLOPs} / \mathrm{Param} \mathrm{ratio}}{\mathrm{MFU} \times (\mathrm{Theoretical} \mathrm{Hardware} \mathrm{FLOPs})}
$
Symbol Explanation:
-
: The time delay for a single inference request.
-
: The total number of parameters in the model.
-
: The number of floating-point operations required per parameter.
-
: Model FLOPs Utilization.
-
: The maximum theoretical floating-point operations capacity of the hardware.
The table clearly illustrates how
RankMixer-1Bmaintainslatencydespite a massive parameter increase:
-
FLOPs/Param ratio(3.6x decrease):RankMixerachieves a 70-fold increase in parameters with only a 20.7-fold increase inFLOPs(from 107G to 2106G). This means itsFLOPs/Param ratiodrops from 6.8 G/M to 1.9 G/M, an efficiency gain of 3.6x. This efficiency allowsRankMixerto accommodate significantly more parameters within the sameFLOPs budget. -
Model FLOPs Utilization (MFU)(10x increase): TheMFUforRankMixer-1Bsurged from 4.47% (baselineOnlineBase-16M) to 44.57%, a near 10x improvement. This indicates thatRankMixer's architecture (using largeGEMMshapes, fused parallelPer-token FFNs, reducedmemory bandwidthoverhead) successfully shifts the model from amemory-boundto acompute-boundregime onGPUs, utilizing the hardware much more effectively. -
Quantization(fp16) (2x Theoretical Hardware FLOPs): By leveraginghalf-precision (fp16)inference,RankMixerdoubles thetheoretical peak Hardware FLOPsofGPUs. ModernGPUs(like Nvidia Tensor Cores) are highly optimized forfp16operations, effectively doubling theFLOPscapacity compared tofp32.These three factors collectively counteract the massive parameter increase, allowing
RankMixer-1Bto achieve aninference latencyof 14.3ms, which is even slightly lower than the 14.5ms of the much smallerOnlineBase-16Mmodel.
6.4. Online Performance
To demonstrate RankMixer's universality and practical impact, online A/B tests were conducted across two core application scenarios: feed recommendation and advertising. The baseline in these scenarios was a 16M parameter model combining DLRM and DCN. RankMixer-1B replaced the dense part of this baseline.
The following are the results from Table 4 of the original paper:
| Douyin app | Douyin lite | |||||||||
| Active Day↑ | Duration↑ | Like↑ | Finish↑ | Comment↑ | Active Day↑ | Duration↑ | Like↑ | Finish↑ | Comment↑ | |
| Overall | +0.2908% | +1.0836% | +2.3852% | +1.9874% | +0.7886% | +0.1968% | +0.9869% | +1.1318% | +2.0744% | +1.1338% |
| Low-active | +1.7412% | +3.6434% | +8.1641% | +4.5393% | +2.9368% | +0.5389% | +2.1831% | +4.4486% | +3.3958% | +0.9595% |
| Middle-active | +0.7081% | +1.5269% | +2.5823% | +2.5062% | +1.2266% | +0.4235% | +1.9011% | +1.7491% | +2.6568% | +0.6782% |
| High-active | +0.1445% | +0.6259% | +1.828% | +1.4939% | +0.4151% | +0.0929% | +1.1356% | +1.8212% | +1.7683% | +2.3683% |
Feed Recommendation (Douyin App and Douyin Lite):
-
Overall Improvements:
RankMixerdelivered statistically significant uplifts across all business-critical metrics. For the main Douyin app, it achieved +0.2908%Active Daysand +1.0836%Duration, along with substantial increases inLike,Finish, andCommentrates. Similar positive trends were observed for Douyin Lite. -
Impact on User Activeness Levels:
RankMixershowed particularly strong generalization capabilities by significantly improving metrics forlow-active users. Forlow-active userson the Douyin app,Active Daysincreased by an impressive +1.7412%,Durationby +3.6434%, andLikeby +8.1641%. This suggests the model is better at understanding and engaging users who are typically harder to activate, potentially due to its enhanced ability to model diversefeature subspacesand capturelong-tail signals. While still positive, the improvements formiddle-activeandhigh-active userswere comparatively smaller, indicatingRankMixerhas a broad and inclusive impact.The following are the results from Table 5 of the original paper:
Metric Advertising ΔAUC↑ ADVV↑ Lift +0.73% +3.90%
Advertising Scenario:
-
RankMixeralso demonstrated strong performance in theadvertising scenario, showing a +0.73% increase inAUCand a notable +3.90% increase inADVV (Advertiser Value). This confirms its versatility as a unified backbone for differentpersonalized rankingapplications.These
online A/B testresults, observed over 8 months forFeed Recommendation, provide compelling evidence forRankMixer's practical effectiveness, universality, and ability to generate significant business value without increasing serving costs.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces RankMixer, a novel, hardware-aware model design tailored for scaling up ranking models in industrial recommender systems. RankMixer addresses the critical challenges of latency, QPS, and low MFU on modern GPUs that plague traditional DLRMs. Its core innovations include a multi-head token mixing module that replaces self-attention for more efficient and effective cross-token feature interactions in heterogeneous data, and Per-token FFNs that enhance modeling capacity for distinct feature subspaces. The model is further scaled to one billion parameters using a Sparse-MoE variant, which incorporates ReLU routing and Dense-Training / Sparse-Inference (DTSI-MoE) to ensure expert balance and prevent under-training.
Offline experiments demonstrate RankMixer's superior performance and a steeper scaling law compared to SOTA baselines. Crucially, engineering optimizations and the hardware-aligned architecture allowed RankMixer to boost MFU from 4.5% to 45% and scale dense parameters by 100x (from 16M to 1B) while maintaining inference latency. The successful deployment of 1B Dense-Parameters RankMixer on Douyin's feed ranking for full traffic serving resulted in significant online A/B test improvements: a 0.3% increase in user active days and a 1.08% increase in total in-app usage duration. Its universality was also validated in advertising scenarios, showing robust lifts.
7.2. Limitations & Future Work
The paper does not explicitly delineate a "Limitations" or "Future Work" section. However, based on the discussion, some implicit limitations and potential future directions can be inferred:
- Complexity of Semantic-based Tokenization: The
semantic-based tokenizationrelies ondomain knowledgeto group features. While effective, the process might be manual or heuristic-driven. A potential limitation could be its generalizability to new domains or the effort required to define optimal semantic groups. Future work could explore automating or learning this tokenization process. - Hyperparameter Tuning for MoE: While
DTSI-MoEaddresses expert balance,Sparse-MoEgenerally introduces morehyperparameters(e.g., number of experts, for regularization) that require careful tuning. The optimal balance between sparsity and performance can be dataset-dependent. - Further Scaling of Sparse-MoE: The paper mentions scaling to 10B parameters as a future possibility. While is effective up to 1B, scaling beyond this might introduce new challenges in
expert communication,memory management, or even more subtleexpert under-trainingissues not fully apparent at 1B scale. - Full Hardware Abstraction: While hardware-aware, the model still requires specific optimizations like
fp16 quantizationandGEMMfusions. Further research could aim for even more general hardware-agnostic optimizations or explore novel hardware architectures specifically designed for such models. - Interpretability: As
DLRMsbecome more complex and black-box with billions of parameters,interpretabilitybecomes harder. Understanding whyRankMixermakes certain recommendations, especially withSparse-MoEand dynamic routing, could be an area for future investigation.
7.3. Personal Insights & Critique
RankMixer represents a significant stride in bridging the gap between academic DLRM research and industrial-scale deployment, particularly by drawing lessons from the LLM scaling paradigm.
Key Strengths and Inspirations:
- Hardware-Awareness as a First-Class Citizen: The explicit focus on
hardware-aware designfrom the outset, rather than as an afterthought optimization, is highly commendable. This mindset is crucial for industrial applications wherelatencyandcostare non-negotiable. The dramatic 10x increase inMFUis a testament to this principle. - Intelligent Adaptation of Transformer Concepts: Instead of blindly porting
TransformerstoRS,RankMixerintelligently adapts the parallel architecture while replacing suboptimal components likeself-attention. Themulti-head token mixingis a clever, parameter-free alternative that respects the heterogeneity ofRS features, offering a compelling solution to a long-standing challenge. - Effective MoE Implementation: The
DTSI-MoEwithReLU routingis a practical and robust solution to commonSparse-MoEproblems. This makesMoEa more viable option forDLRMs, which often struggle with data sparsity and feature diversity leading toexpert under-training. - Real-world Impact: The successful
full-traffic deploymenton Douyin and the tangible improvements in key business metrics (active days,duration) provide strong validation. The observation thatlow-active usersbenefit most is particularly impactful, indicating the model's ability to drive growth from potentially underserved segments.
Potential Areas for Further Inquiry/Critique:
-
Transparency of
ProjFunction: TheProjfunction intokenizationis mentioned but not detailed. Is it a simple linear projection, or something more complex? Its design could significantly influence how raw embeddings are transformed into tokens and how much information is preserved or compressed. -
Generalizability of Semantic Grouping: While
domain knowledgeis used forsemantic grouping, the extent to which this process is manual or howrobustit is to evolving feature sets remains an open question. Could an unsupervised or learned method enhance this step? -
Detailed Breakdown of Hardware Optimization: While
MFU,fp16, andFLOPs/Param ratioare discussed, a deeper dive into the specificGPU architecturesused and the detailedkernel fusionsor other low-level optimizations could offer more insights for others to replicate or build upon. -
Long-Term Maintenance of 1B Parameter Models: Managing and updating such large models in production presents its own set of engineering challenges beyond initial deployment (e.g., faster iteration, continuous learning, debugging). While the paper focuses on the initial scaling and deployment, future research could explore these operational aspects.
Overall,
RankMixerprovides a powerful framework for developing highly scalable and efficientranking modelsin industrial settings. Its innovations infeature interactionandMoEstrategies, combined with a stronghardware-awarephilosophy, offer valuable lessons for the broaderrecommender systemscommunity. The paper clearly demonstrates thatscaling lawscan indeed be harnessed forDLRMs, provided the architectural design is meticulously aligned with both data characteristics and hardware capabilities.
Similar papers
Recommended via semantic vector search.