Paper status: completed

RankMixer: Scaling Up Ranking Models in Industrial Recommenders

Published:07/21/2025

Hardware-Aware Recommendation Systems (1)Multi-Head Token Mixing Module (1)Sparse-MoE Variant (1)Feature Interaction Architecture (1)Billion-Parameter Model (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

RankMixer is a hardware-aware ranking model for industrial recommenders, overcoming cost and latency challenges. It employs a multi-head token mixing module for enhanced efficiency and scalability, achieving significant user engagement gains with a billion-parameter Sparse-MoE va

Abstract

Recent progress on large language models (LLMs) has spurred interest in scaling up recommendation systems, yet two practical obstacles remain. First, training and serving cost on industrial Recommenders must respect strict latency bounds and high QPS demands. Second, most human-designed feature-crossing modules in ranking models were inherited from the CPU era and fail to exploit modern GPUs, resulting in low Model Flops Utilization (MFU) and poor scalability. We introduce RankMixer, a hardware-aware model design tailored towards a unified and scalable feature-interaction architecture. RankMixer retains the transformer's high parallelism while replacing quadratic self-attention with multi-head token mixing module for higher efficiency. Besides, RankMixer maintains both the modeling for distinct feature subspaces and cross-feature-space interactions with Per-token FFNs. We further extend it to one billion parameters with a Sparse-MoE variant for higher ROI. A dynamic routing strategy is adapted to address the inadequacy and imbalance of experts training. Experiments show RankMixer's superior scaling abilities on a trillion-scale production dataset. By replacing previously diverse handcrafted low-MFU modules with RankMixer, we boost the model MFU from 4.5% to 45%, and scale our ranking model parameters by 100x while maintaining roughly the same inference latency. We verify RankMixer's universality with online A/B tests across two core application scenarios (Recommendation and Advertisement). Finally, we launch 1B Dense-Parameters RankMixer for full traffic serving without increasing the serving cost, which improves user active days by 0.3% and total in-app usage duration by 1.08%.

Mind Map

In-depth Reading

English Analysis~36 min read · 49,365 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "RankMixer: Scaling Up Ranking Models in Industrial Recommenders," focusing on developing a new model architecture to overcome efficiency and scalability challenges in large-scale recommendation systems.

1.2. Authors

The paper lists multiple authors from ByteDance: Jie Zhu, Zhifang Fan, Xiaoxie Zhu, Yuchen Jiang, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, Huizhi Yang, Zheng Chai, Zhe Chen, Yuchao Zheng, Qiwei Chen, Feng Zhang, Xun Zhou, Peng Xu, Xiao Yang, Di Wu, and Zuotao Liu. Their affiliations indicate a strong background in industry-scale machine learning and recommender systems, particularly from a prominent technology company known for its large-scale applications like Douyin (TikTok).

1.3. Journal/Conference

This paper is published as a preprint on arXiv (arXiv:2507.15551). Some authors (Fedor Borisyuk, Mingzhou Zhou, etc., although not the primary authors of this paper, are listed in a citation to a KDD 2024 paper within this document) are associated with KDD, which is one of the premier conferences for data mining, data science, and big data. While this specific paper is currently a preprint, its content and author affiliations suggest it is intended for a top-tier venue, and some authors are noted to have published in KDD. The presence on arXiv indicates active research and dissemination within the academic and industrial communities.

1.4. Publication Year

The paper was published at UTC: 2025-07-21T12:28:55.000Z, which means it was published in 2025.

1.5. Abstract

The paper addresses two key practical obstacles in scaling up recommendation systems, inspired by the success of Large Language Models (LLMs):

Cost and Latency: Industrial recommenders face strict latency bounds and high queries per second (QPS) demands, making efficient training and serving crucial.
Hardware Inefficiency: Many traditional feature-crossing modules, designed for CPU-era systems, fail to exploit modern GPUs effectively, leading to low Model Flops Utilization (MFU) and poor scalability.

To tackle these issues, the authors introduce RankMixer, a hardware-aware model designed for a unified and scalable feature-interaction architecture. RankMixer maintains the high parallelism of transformer models but replaces the quadratically complex self-attention with a more efficient multi-head token mixing module. It also uses Per-token FFNs to model distinct feature subspaces and cross-feature interactions, ensuring better alignment with recommendation data patterns.

For further scalability, RankMixer is extended with a Sparse-MoE (Mixture-of-Experts) variant, reaching one billion parameters. A dynamic routing strategy is adapted to address common MoE issues like expert inadequacy and imbalance.

Experiments on a trillion-scale production dataset demonstrate RankMixer's superior scaling abilities. By replacing diverse handcrafted modules, RankMixer boosts MFU from 4.5% to 45% and scales model parameters by 100x while maintaining inference latency. Online A/B tests across recommendation and advertisement scenarios verify its universality. A 1B Dense-Parameters RankMixer has been launched for full traffic serving, improving user active days by 0.3% and total in-app usage duration by 1.08% without increasing serving cost.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2507.15551. The PDF link is https://arxiv.org/pdf/2507.15551v3.pdf. This paper is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The success of Large Language Models (LLMs) has demonstrated the immense potential of scaling up model parameters and data volume to achieve superior performance across various tasks. This success has naturally spurred interest in applying similar scaling laws to Deep Learning Recommendation Models (DLRMs). However, industrial Recommender Systems (RS) operate under unique and stringent constraints that make direct application of LLM scaling challenging.

The core problem the paper aims to solve is how to effectively scale ranking models in industrial recommenders to billions of parameters, akin to LLMs, without compromising the strict latency bounds and high Queries Per Second (QPS) demands of online serving environments. This problem is critical because existing DLRMs face two major practical obstacles:

Computational Efficiency: Training and serving large models in industrial RS must respect strict latency budgets (e.g., milliseconds) and handle massive QPS (thousands to millions per second), making computational efficiency paramount. Scaling up naive DLRMs often leads to prohibitive latency.
Hardware Utilization: Many feature-crossing modules in traditional ranking models were designed in the CPU era and perform poorly on modern GPUs. These modules often result in low Model Flops Utilization (MFU), meaning GPUs are underutilized, and thus models are not scalable. The computational cost of these older models was often proportional to the number of parameters, making the Return on Investment (ROI) from aggressive scaling difficult to realize.

The paper's entry point is to design a new, hardware-aware model architecture that can unify diverse feature interactions and scale efficiently on modern GPU hardware, moving beyond the limitations of CPU-era designs and inefficient self-attention mechanisms for heterogeneous recommendation data.

2.2. Main Contributions / Findings

The primary contributions and key findings of the RankMixer paper are:

Novel Hardware-Aware Architecture (RankMixer): The paper introduces RankMixer, a new model architecture specifically designed to be hardware-aware and scalable for industrial DLRMs. It unifies feature interactions into a highly parallelizable structure.
- Multi-head Token Mixing: This innovative component replaces the quadratic self-attention mechanism with a more efficient, parameter-free operator for cross-token feature interactions. It is shown to outperform self-attention for heterogeneous recommendation data, avoiding the combinatorial explosion and memory-bound issues of attention weights.
- Per-token Feed-Forward Networks (PFFNs): RankMixer employs PFFNs to model distinct feature subspaces independently. This approach increases model capacity significantly by dedicating parameters to each token, preventing dominant features from overshadowing others and aligning better with the diverse nature of recommendation data.
Scalable Sparse Mixture-of-Experts (MoE) Variant: To further enhance ROI at massive scales, RankMixer is extended with a Sparse-MoE variant. This variant uses a dynamic routing strategy with ReLU Routing and Dense-Training / Sparse-Inference (DTSI-MoE) to address common MoE problems like expert under-training and imbalance, enabling capacity growth with minimal computational cost increase.
Demonstrated Scaling Abilities and Efficiency:
- RankMixer achieved a 10x boost in Model Flops Utilization (MFU), from 4.5% to 45%, by replacing diverse handcrafted low-MFU modules with its unified architecture. This shifts the model from a memory-bound to a compute-bound regime.
- The model successfully scaled parameters by 100x (from 16M to 1B dense parameters) while maintaining roughly the same inference latency. This was achieved by decoupling parameter growth from FLOPs and optimizing FLOPs growth from actual cost through high MFU and Quantization (e.g., fp16).
Universality and Online Performance Validation:
- RankMixer's universality was verified with online A/B tests across two core application scenarios: Recommendation and Advertisement.
- The 1B Dense-Parameters RankMixer was deployed for full traffic serving on Douyin, leading to significant business metric improvements: 0.3% increase in user active days and 1.08% increase in total in-app usage duration. It also showed larger improvements for low-active users, demonstrating strong generalization capability.
  
  These findings collectively demonstrate that RankMixer provides a practical and effective solution for scaling DLRMs to LLM-like parameter counts in industrial settings, overcoming critical efficiency and hardware utilization challenges.

3.1. Foundational Concepts

To understand the RankMixer paper, a reader should be familiar with several fundamental concepts in machine learning, particularly within the domain of recommender systems and deep learning architectures.

Recommender Systems (RS): Systems that predict user preferences for items and suggest items to users. They are crucial for platforms like e-commerce, social media, and content streaming. The paper focuses on ranking models, which are a component of RS responsible for ordering a retrieved set of candidate items based on predicted user engagement or preference.
Deep Learning Recommendation Models (DLRMs): A class of recommender systems that use deep neural networks to learn complex patterns and interactions from user and item features. They typically consist of an embedding layer for sparse categorical features and dense interaction layers.
Feature Interaction: The process of combining multiple features to create new, more expressive features. For example, combining a user's age with an item's category might reveal a preference for certain categories among specific age groups.
- Low-order interactions: Simple combinations, like pairwise interactions.
- High-order interactions: More complex combinations involving three or more features.
- Explicit feature interaction: Manually designed or explicitly modeled by specific architectural components (e.g., DCN, DeepFM).
- Implicit feature interaction: Learned automatically by deep neural networks without explicit design (e.g., MLP).
Scaling Laws: Empirical observations in machine learning, especially prominent in Natural Language Processing (NLP) and Computer Vision (CV), showing that model performance (e.g., accuracy, loss) often improves predictably as model size (number of parameters), data volume, and computational capacity (FLOPs) increase. This paper aims to establish and exploit similar scaling laws for DLRMs.
Model Flops Utilization (MFU): A metric that measures how efficiently a model uses the theoretical peak floating-point operations (FLOPs) capacity of the underlying hardware (e.g., GPU). It is calculated as: $ \text{MFU} = \frac{\text{Actual FLOPs Consumption}}{\text{Theoretical Hardware FLOPs Capacity}} \times 100% $ where Actual FLOPs Consumption refers to the floating-point operations performed by the model, and Theoretical Hardware FLOPs Capacity is the maximum number of floating-point operations the hardware can perform per second. A low MFU indicates that the hardware is underutilized, often due to memory-bound operations rather than compute-bound ones. The paper aims to increase MFU to make DLRMs more compute-bound on GPUs.
Transformers: A neural network architecture introduced in 2017, primarily for NLP tasks, which revolutionized sequence modeling. Key components include:
- Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence (tokens) when processing each token. It computes attention weights based on query ( $Q$ ), key ( $K$ ), and value ( $V$ ) matrices derived from the input. The standard formula for Scaled Dot-Product Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where $Q$ , $K$ , $V$ are matrices representing queries, keys, and values, respectively. $Q$ has shape $(N, d_k)$ , $K$ has shape $(M, d_k)$ , and $V$ has shape $(M, d_v)$ , where $N$ is the number of queries, $M$ is the number of keys/values, $d_k$ is the dimension of keys/queries, and $d_v$ is the dimension of values. $Q K^T$ calculates the similarity scores between queries and keys. $\sqrt{d_k}$ is a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients. The softmax function normalizes these scores into attention weights. Self-attention has a quadratic complexity ( $O(N^2 \cdot d_k)$ for dot products, $O(N^2 \cdot d_v)$ for weighted sum) with respect to the sequence length ( $N$ ), which can be a bottleneck for very long sequences or many tokens.
- Feed-Forward Networks (FFNs): Position-wise FFNs are applied independently to each position (token) in the sequence. A standard FFN typically consists of two linear transformations with a ReLU activation in between: $\mathrm{FFN}(x) = \mathrm{max}(0, xW_1 + b_1)W_2 + b_2$ . In standard Transformers, the parameters ( $W_1, b_1, W_2, b_2$ ) are shared across all token positions.
Mixture-of-Experts (MoE): An architecture that increases model capacity without proportionally increasing computational cost during inference. It consists of multiple "expert" neural networks and a "gate" (or "router") network that learns to activate a subset of these experts for each input.
- Sparse-MoE: A variant where only a small number ( $k$ ) of experts are selected and activated for each input, leading to sparse activation.
- Routing Strategy: The mechanism by which the gate network determines which experts to activate for a given input.
Latency: The time delay between a request and a response in a system. In online recommenders, low latency is critical for a smooth user experience.
QPS (Queries Per Second): The number of requests a system can handle per second. High QPS demands require efficient model serving infrastructure.
AUC (Area Under the Curve): A common evaluation metric for binary classification models. It represents the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.
UAUC (User-level AUC): A variation of AUC where AUC is calculated for each user individually and then averaged across all users. This is particularly relevant in recommender systems to evaluate personalized performance and avoid bias towards popular items or active users.

3.2. Previous Works

The paper contextualizes RankMixer by discussing the evolution of DLRMs and attempts to scale them.

Early DLRMs for Feature Interaction:
- Wide&Deep [5]: One of the earliest efforts to combine logistic regression (for low-order, memorized interactions) and DNN (for high-order, generalized interactions).
- DeepFM [11]: Integrated Factorization Machines (FM) with DNNs to capture both low-order (pairwise) and high-order feature interactions.
- DeepCross [26]: Extended residual networks to learn automatic feature interactions implicitly.
- PNN [22], DCN [32], DCNv2 [33], xDeepFM [18], FGCNN [20], FiGNN [17]: These models proposed various explicit cross-methods to capture high-order interactions, often designing specific operators for feature crossings. DCN and DCNv2 specifically used a cross-network to explicitly learn bounded-degree cross features.
Attention-based Models for Feature Interaction:
- AutoInt [28] and HiFormer [10]: Employed attention mechanisms with residual connections to learn complex interactions, inspired by Transformers. HiFormer particularly focuses on heterogeneous feature interactions.
Ensemble and Scaled DLRMs:
- DHEN [39]: Proposed combining multiple interaction operators (e.g., DCN, self-attention, FM, LR) and stacking them.
- Wukong [38]: Investigated scaling laws for feature interaction models by stacking Factorization Machine Blocks (FMB) and Linear Compress Blocks (LCB).
- Other scaling efforts [2, 6, 9, 12, 27, 34, 36, 40]: Explored scaling strategies for pre-training user activity sequences, general-purpose user representations, and online retrieval, showing varying degrees of success.

3.3. Technological Evolution

The evolution of DLRMs has progressed from simple MLPs to models explicitly designed to capture feature interactions. Initially, models like Wide&Deep and DeepFM combined linear models with DNNs or Factorization Machines. Subsequent work focused on designing specialized cross-network architectures (e.g., DCN, xDeepFM) to explicitly model complex interactions. More recently, inspired by the success of Transformers in NLP, researchers adapted attention mechanisms (e.g., AutoInt, HiFormer) to DLRMs to allow for dynamic weighting of feature interactions.

However, many of these CPU-era designs and even some attention-based approaches faced challenges when scaling to industrial levels: they often increased latency and memory consumption disproportionately to model size, resulting in low MFU on GPUs. The advent of scaling laws in LLMs highlighted the potential for massive performance gains with larger models, pushing DLRM research towards architectures that are inherently more scalable and hardware-efficient.

RankMixer fits into this timeline by addressing the limitations of prior DLRMs that struggle with GPU utilization and heterogeneous feature spaces. It aims to bring Transformer-like parallelism and scalability to DLRMs while mitigating the specific inefficiencies of self-attention (like quadratic complexity and difficulty with heterogeneous features) and adapting MoE for better performance and training stability in the RS context.

3.4. Differentiation Analysis

RankMixer differentiates itself from previous works through several core innovations tailored for industrial recommender systems:

Hardware-Aware Design: Unlike many CPU-era models (e.g., DCN, DeepFM) or even general DNNs that suffer from low MFU on GPUs, RankMixer is explicitly designed for high GPU parallelism and MFU. This is a fundamental shift from feature-centric to hardware-centric model design.
Efficient Cross-Token Interaction:
- Vs. Self-Attention (AutoInt, HiFormer): RankMixer replaces self-attention with a multi-head token mixing module. While self-attention works well in NLP where tokens share a unified embedding space, it is suboptimal for the inherently heterogeneous feature spaces in RS (e.g., user ID vs. item ID). Self-attention incurs quadratic complexity and high memory/compute cost for calculating attention weights. RankMixer's token mixing is parameter-free, more computationally efficient, and avoids the difficulties of computing inner-product similarity across disparate feature types.
- Vs. Handcrafted/Explicit Cross-feature Modules (DCN, DCNv2, xDeepFM, DHEN): These models often rely on diverse, handcrafted modules which can be effective but lead to a complex architecture, low MFU, and poor scalability. RankMixer provides a unified, Transformer-like backbone that intrinsically learns these interactions without explicit manual design.
Per-token FFNs (PFFNs) for Heterogeneity:
- Vs. Shared FFNs (Vanilla Transformers): In standard Transformers, FFN parameters are shared across all tokens. RankMixer uses PFFNs where each token has dedicated FFN parameters. This allows for isolated modeling of distinct feature subspaces, preventing high-frequency fields from dominating and enhancing modeling capacity for diverse recommendation data, a problem often present in DLRMs that mix features indiscriminately.
- Vs. MMoE (Multi-gate Mixture-of-Experts): While MMoE uses multiple experts, they typically share the same input. RankMixer's PFFNs ensure each FFN (or MoE expert block) sees a distinct token input, aligning with the idea of learning diversity in different feature subspaces.
Robust Sparse-MoE Implementation: RankMixer addresses critical challenges in Sparse-MoE (expert under-training, imbalanced routing) by introducing ReLU Routing and Dense-Training / Sparse-Inference (DTSI-MoE). This allows for significant parameter scaling without proportional inference cost increase, a problem that vanilla MoE models often struggle with (as noted in the paper's ablation studies).
Focus on Online Serving Metrics: The paper explicitly targets strict industrial requirements like latency, QPS, and MFU, demonstrating impressive real-world improvements in these areas, including a 100x parameter increase with no latency penalty and significant MFU boost, which is a major differentiator compared to academic research often focusing solely on AUC.

4. Methodology

4.1. Principles

The core idea behind RankMixer is to create a hardware-aware model design that provides a unified and scalable feature-interaction architecture for industrial recommender systems. This design is guided by two main principles:

Exploiting GPU Parallelism and Maximizing MFU: Moving away from CPU-era designs that lead to low Model Flops Utilization (MFU) on modern GPUs, RankMixer leverages highly parallel operations to shift from a memory-bound to a compute-bound regime.
Effective Modeling of Heterogeneous Feature Interactions: Recognizing the unique challenges of recommendation data (diverse feature types, distinct semantic subspaces, and the need for personalized cross-feature interactions), RankMixer aims to capture these complexities efficiently and scalably.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Overall Architecture

The RankMixer model processes $T$ input tokens through $L$ successive RankMixer blocks. After these blocks, an output pooling operator consolidates the representations for prediction tasks.

Each RankMixer block consists of two main components:

A Multi-Head Token Mixing layer.
A Per-Token Feed-Forward Network (PFFN) layer.

The process begins by taking an input vector $e_{\mathrm{input}}$ and tokenizing it into $T$ feature tokens: $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$ . These tokens are then iteratively refined through the $L$ layers.

The computation within each RankMixer block (for layer $n$ ) can be expressed by the following formulas: $ \mathsf{S}{n-1} = \mathrm{LN}\left(\mathrm{TokenMixing}\left(\mathbf{X}{n-1}\right) + \mathbf{X}{n-1}\right) $ $ \mathbf{X}n = \mathrm{LN}\left(\mathrm{PFFN}\left(\mathsf{S}{n-1}\right) + \mathsf{S}{n-1}\right) $ Symbol Explanation:

$\mathsf{S}_{n-1}$ : The intermediate representation after the TokenMixing module and Layer Normalization in layer n-1.
$\mathrm{LN}(\cdot)$ : A Layer Normalization function, which normalizes the activations across the features for each sample independently within a layer, helping to stabilize training.
$\mathrm{TokenMixing}(\cdot)$ : The Multi-Head Token Mixing module, which facilitates information exchange across tokens.
$\mathbf{X}_{n-1}$ : The input to the RankMixer block at layer $n$ , representing the collection of tokens from the previous layer.
$\mathrm{PFFN}(\cdot)$ : The Per-Token Feed-Forward Network module, which processes each token with dedicated parameters.
$\mathbf{X}_n$ : The output of the $n$ -th RankMixer block, which becomes the input for the next block or the final pooling.
$\mathbf{X}_0 \in \mathbb{R}^{T \times D}$ : The initial input, formed by stacking the $T$ feature tokens ( $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$ ).
$T$ : The total number of feature tokens.
$D$ : The hidden dimension of each token (and thus the model).

The residual connections (the +\mathbf{X}_{n-1}and+ $\mathsf{S}_{n-1}$ terms) are crucial for enabling deeper networks and improving training stability by allowing gradients to flow directly through the network. Finally, the output representation $\mathbf{O}_{\mathrm{output}}$ is derived from mean pooling of the final layer's token representations $\mathbf{X}_L$ , which is then used to compute predictions for various tasks.

4.2.2. Input Layer and Feature Tokenization

The first step in RankMixer is to prepare the input features and convert them into a structured format suitable for the Transformer-like architecture. Industrial recommender systems typically deal with a vast array of features, including:

User features: Such as user ID and demographic information.
Candidate features: Like video ID, author ID, etc.
Sequence Features: Processed by Sequence Modules to capture temporal interests, yielding embeddings $\mathbf{e}_s$ .
Cross Features: Interactions between user and candidate features.

All these diverse features are initially transformed into embeddings, which may have varying dimensions.

To enable efficient parallel computation and structured interaction, these embeddings are aligned into fixed-dimension vectors, referred to as feature-tokens. This process is called tokenization. A naive approach of one embedding per feature would lead to hundreds of tokens, fragmenting parameters and computation per token, or too few tokens would collapse the model into a simple DNN, losing the distinct representation of diverse feature spaces.

To overcome these challenges, RankMixer proposes a semantic-based tokenization approach:

Feature Grouping: Domain knowledge is used to group features into several semantically coherent clusters. For example, all user-related features might form one cluster, and all item-related features another.
Concatenation: These grouped features are sequentially concatenated into a single embedding vector $e_{\mathrm{input}}$ $e_{input}$ . $ e_{\mathrm{input}} = \left[,e_1; e_2; ...; e_N \right] $ Symbol Explanation:
- $e_{\mathrm{input}}$ : The concatenated embedding vector.
- $e_i$ : An embedding corresponding to a feature or a grouped set of features.
- $N$ : The number of feature groups.
Partitioning and Projection: The concatenated vector is then partitioned into an appropriate number of tokens $T$ $T$ , each with a fixed dimension $d$ $d$ . A Proj function maps these split embeddings to the model's hidden dimension $D$ $D$ . $ x_i = \mathrm{Proj}(e_{\mathrm{input}}\left[ d \cdot (i-1) : d \cdot i \right]), \quad i = 1, \ldots, T $ Symbol Explanation:
- $x_i$ : The $i$ -th feature token, where $x_i \in \mathbb{R}^D$ .
- $\mathrm{Proj}(\cdot)$ : A projection function (e.g., a linear layer) that transforms the split embedding into the model's hidden dimension $D$ .
- $e_{\mathrm{input}}\left[ d \cdot (i-1) : d \cdot i \right]$ : A segment of the concatenated embedding vector $e_{\mathrm{input}}$ , corresponding to the $i$ -th token, with a length of $d$ .
- $d$ : The fixed dimension chosen for each segment before projection.
- $T$ : The resulting number of tokens. Each $x_i$ captures a set of feature embeddings representing a similar semantic aspect.

4.2.3. RankMixer Block

4.2.3.1. Multi-head Token Mixing

The Multi-head Token Mixing module is designed to facilitate cross-token feature interactions, which are crucial for feature crosses and global information modeling. It operates as follows:

Head Splitting: Each token $\mathbf{x}_t \in \mathbb{R}^D$ $x_{t} \in R^{D}$ is evenly divided into $H$ $H$ smaller parts, or "heads." The $h$ $h$ -th head of token $\mathbf{x}_t$ $x_{t}$ is denoted as $\mathbf{x}_t^{(h)}$ $x_{t}^{(h)}$ . These heads can be viewed as projections of the token into lower-dimensional feature subspaces, allowing the model to consider different perspectives of the feature. $ \left[\mathbf{x}_t^{(1)} \parallel \mathbf{x}_t^{(2)} \parallel \ldots \parallel \mathbf{x}_t^{(H)}\right] = \mathrm{SplitHead}\left(\mathbf{x}_t\right) $ Symbol Explanation:
- $\mathbf{x}_t$ : The $t$ -th input token.
- $\mathbf{x}_t^{(h)}$ : The $h$ -th head of the $t$ -th token.
- $\parallel$ : Denotes concatenation.
- $\mathrm{SplitHead}(\cdot)$ : An operation that divides a token into $H$ equal-sized parts.
Cross-Token Fusion: For each head $h$ $h$ , the corresponding heads from all $T$ $T$ tokens are concatenated to form a new vector $\mathbf{s}^h$ $s^{h}$ . This effectively shuffles the dimensions, grouping all $h$ $h$ -th heads across tokens. $ \mathbf{s}^h = {\mathrm{Concat}}\left(\mathbf{x}_1^h, \mathbf{x}_2^h, ..., \mathbf{x}_T^h\right) $ Symbol Explanation:
- $\mathbf{s}^h$ : The vector formed by concatenating the $h$ -th head from all tokens.
- $\mathbf{x}_t^h$ : The $h$ -th head of the $t$ -th token. This creates a matrix $\mathbf{S} \in \mathbb{R}^{H \times \frac{TD}{H}}$ by stacking all such shuffled token vectors $\mathbf{s}_1, \mathbf{s}_2, \ldots, \mathbf{s}_H$ . In this work, the number of heads $H$ is set equal to the number of tokens $T$ to maintain the same number of tokens for the subsequent residual connection.
Output and Normalization: The shuffled tokens undergo a residual connection and Layer Normalization to produce the refined token representations: $ {\bf s}_1, {\bf s}_2, ..., {\bf s}_T = \mathrm{LN}\left(\mathrm{TokenMixing}\left({\bf x}_1, {\bf x}_2, ..., {\bf x}_T\right) + \left({\bf x}_1, {\bf x}_2, ..., {\bf x}_T\right)\right) $ Symbol Explanation:
- ${\bf s}_1, ..., {\bf s}_T$ : The output tokens after Multi-head Token Mixing, residual connection, and Layer Normalization.
- $\mathrm{LN}(\cdot)$ : Layer Normalization.
- $\mathrm{TokenMixing}(\cdot)$ : The operation described above.
- ${\bf x}_1, ..., {\bf x}_T$ : The input tokens to the TokenMixing module.
  
  The paper argues that while self-attention is effective in LLMs where all tokens are in a unified embedding space, it is suboptimal for recommendation systems due to the inherently heterogeneous feature spaces. Computing inner-product similarity between disparate semantic spaces (e.g., user ID and item ID) is difficult and often less effective than the parameter-free multi-head Token Mixing approach, which also consumes fewer computations, Memory IO operations, and GPU memory.

4.2.3.2. Per-token FFN (PFFN)

Traditional DLRMs often mix features from many disparate semantic spaces within a single interaction module, which can lead to high-frequency fields dominating and obscuring lower-frequency or long-tail signals. To counter this, RankMixer introduces a parameter-isolated feed-forward network architecture called per-token FFN (PFFN).

Unlike traditional FFNs where parameters are shared across all tokens, PFFN processes each token with its own dedicated set of parameters. For the $t$ -th token $\mathbf{s}_t$ , the per-token FFN can be expressed as: $ \mathbf{v}t = f{\mathrm{pffn}}^{t,2} \left(\mathrm{Gelu}\left(f_{\mathrm{pffn}}^{t,1}\left(\mathbf{s}t\right)\right)\right) $ where the linear transformations $f_{\mathrm{pffn}}^{t,i}(\mathbf{x})$ are defined as: $ f{\mathrm{pffn}}^{t,i}(\mathbf{x}) = \mathbf{xW}{\mathrm{pffn}}^{t,i} + \mathbf{b}{\mathrm{pffn}}^{t,i} $ Symbol Explanation:

$\mathbf{v}_t$ : The output of the Per-token FFN for the $t$ -th token.
$f_{\mathrm{pffn}}^{t,1}$ and $f_{\mathrm{pffn}}^{t,2}$ : The two linear transformations (layers) specific to the $t$ -th token.
$\mathrm{Gelu}(\cdot)$ : The Gelu activation function, which is a common non-linear activation in deep learning, providing a smoother approximation to ReLU.
$\mathbf{s}_t \in \mathbb{R}^D$ : The $t$ -th token input to the PFFN.
$\mathbf{x}$ : Input to the linear transformation.
$\mathbf{W}_{\mathrm{pffn}}^{t,1} \in \mathbb{R}^{D \times kD}$ : The weight matrix for the first linear layer of the PFFN specific to token $t$ .
$\mathbf{b}_{\mathrm{pffn}}^{t,1} \in \mathbb{R}^{kD}$ : The bias vector for the first linear layer of the PFFN specific to token $t$ .
$\mathbf{W}_{\mathrm{pffn}}^{t,2} \in \mathbb{R}^{kD \times D}$ : The weight matrix for the second linear layer of the PFFN specific to token $t$ .
$\mathbf{b}_{\mathrm{pffn}}^{t,2} \in \mathbb{R}^D$ : The bias vector for the second linear layer of the PFFN specific to token $t$ .
$k$ : A hyperparameter that scales the hidden dimension of the per-token FFN (e.g., if $D=768$ and $k=4$ , the hidden dimension is 3072).

The Per-token FFN module for all tokens collectively is summarized as: $ {\bf v}_1, {\bf v}_2, ..., {\bf v}_T = \mathrm{PFFN}\left({\bf s}_1, {\bf s}_2, ..., {\bf s}_T\right) $ where $ \mathrm{PFFN}\left({\bf s}_1, {\bf s}2, ..., {\bf s}T\right) = f{\mathrm{pffn}}^{t,2} \left(\mathrm{Gelu}\left(f{\mathrm{pffn}}^{t,1}\left({\bf s}_1, {\bf s}_2, ..., {\bf s}_T\right)\right)\right) $ Symbol Explanation:
This notation implies that the PFFN operation is applied independently to each token $\mathbf{s}_t$ to produce $\mathbf{v}_t$ , where the specific $t$ for $f_{\mathrm{pffn}}^{t,1}$ and $f_{\mathrm{pffn}}^{t,2}$ changes for each $\mathbf{s}_t$ .

The key advantage of Per-token FFN is that it introduces more parameters and therefore greater modeling ability compared to a parameter-shared FFN, without changing the computational complexity per token. It differs from MMoE in that each Per-token FFN receives a distinct token input, allowing for independent processing of feature subspaces, which is beneficial for learning diverse patterns in recommendation data.

4.2.4. Sparse MoE in RankMixer

To further enhance the scaling ROI (Return on Investment) of RankMixer, the dense PFFNs can be replaced with Sparse Mixture-of-Experts (MoE) blocks. This allows the model's capacity to grow substantially while keeping the computational cost (FLOPs) roughly constant during inference. However, applying Vanilla Sparse-MoE directly to RankMixer presents two challenges:

Uniform $k$ -expert Routing: Traditional Top-k expert selection treats all feature tokens equally. This can waste computational budget on low-information tokens and starve high-information ones, hindering the model's ability to capture subtle differences between tokens.
Expert Under-training: Per-token FFNs already multiply parameters by the number of tokens. Adding non-shared experts further increases the total expert count, which can lead to highly unbalanced routing and poorly trained experts (known as "dying experts").

To address these issues, RankMixer adopts two complementary strategies:

4.2.4.1. ReLU Routing

ReLU Routing is introduced to provide tokens with flexible expert counts and maintain differentiability. It replaces the common Top-k + softmax routing with a ReLU gate combined with an adaptive\ell_1penalty. Given the $j$ -th expert $e_{i,j}(\cdot)$ for token $\mathbf{s}_i \in \mathbb{R}^{d_h}$ and a router $h(\cdot)$ , the gating mechanism is defined as: $ G_{i,j} = \mathrm{ReLU}\big(h(\mathbf{s}i)\big) $ The output $\mathbf{v}_i$ for token $\mathbf{s}_i$ is then a weighted sum of expert outputs: $ \mathbf{v}i = \sum{j=1}^{N_e} G{i,j} \cdot e_{i,j}(\mathbf{s}_i) $ Symbol Explanation:

$G_{i,j}$ : The activation gate value for the $j$ -th expert for the $i$ -th token.
$\mathrm{ReLU}(\cdot)$ : The Rectified Linear Unit activation function, which outputs the input directly if positive, otherwise zero.
$h(\mathbf{s}_i)$ : The output of the router network for token $\mathbf{s}_i$ .
$\mathbf{v}_i$ : The output representation for the $i$ -th token after processing by the Sparse-MoE block.
$N_e$ : The total number of experts available for each token.
$e_{i,j}(\cdot)$ : The $j$ -th expert network for the $i$ -th token.
$\mathbf{s}_i$ : The $i$ -th input token to the Sparse-MoE layer.

ReLU Routing allows more experts to be activated for high-information tokens by allowing $G_{i,j}$ to be non-zero for multiple experts, thereby improving parameter efficiency. Sparsity is controlled by a regularization loss $\mathcal{L}_{\mathrm{reg}}$ with a coefficient $\lambda$ , which encourages the average active-expert ratio to remain near a predefined budget: $ \mathcal{L} = \mathcal{L}{\mathrm{task}} + \lambda \mathcal{L}{\mathrm{reg}} $ $ \mathcal{L}{\mathrm{reg}} = \sum{i=1}^{N_t} \sum_{j=1}^{N_e} G_{i,j} $ Symbol Explanation:
$\mathcal{L}$ : The total loss function.
$\mathcal{L}_{\mathrm{task}}$ : The primary task-specific loss (e.g., binary cross-entropy for CTR prediction).
$\lambda$ : A hyperparameter that controls the strength of the regularization.
$\mathcal{L}_{\mathrm{reg}}$ : The regularization term that penalizes the sum of all gate activations, thereby promoting sparsity.
$N_t$ : The total number of tokens.

4.2.4.2. Dense-training / Sparse-inference (DTSI-MoE)

Inspired by [21], RankMixer employs Dense-training / Sparse-inference (DTSI-MoE). This strategy uses two distinct routers during training: $h_{\mathrm{train}}$ and $h_{\mathrm{infer}}$ .

Both $h_{\mathrm{train}}$ and $h_{\mathrm{infer}}$ are updated during the training phase.
The regularization term $\mathcal{L}_{\mathrm{reg}}$ is applied only to $h_{\mathrm{infer}}$ . This means that during training, the $h_{\mathrm{train}}$ router can explore activating more experts (or activate them more densely) without being constrained by the sparsity penalty, ensuring that experts receive sufficient training data and gradients.
During inference, only $h_{\mathrm{infer}}$ is used, which has been optimized for sparsity due to the applied regularization.

This DTSI-MoE approach ensures that experts do not suffer from under-training (a common problem in Sparse-MoE where some experts receive very few examples) while simultaneously reducing inference cost by allowing sparse activation.

4.2.5. Scaling Up Directions

RankMixer is inherently a highly parallel and extensible architecture, allowing its parameter count and computational cost to be expanded along four orthogonal axes:

Token count ( $T$ ): Increasing the number of feature tokens.
Model width ( $D$ ): Increasing the hidden dimension of the model.
Layers ( $L$ ): Stacking more RankMixer blocks.
Expert Numbers ( $E$ ): Increasing the number of experts in the Sparse-MoE variant.

For the full-dense-activated version (i.e., PFFN without MoE), the approximate number of parameters and FLOPs for one sample can be computed as: $ # \mathrm{Param} \approx 2kLTD^2 $ $ \mathrm{FLOPs} \approx 4kLTD^2 $ Symbol Explanation:
$\# \mathrm{Param}$ : The total number of parameters in the dense part of the model.
$\mathrm{FLOPs}$ : The total number of floating-point operations for one sample.
$k$ : The scale ratio (usually an integer, e.g., 2 or 4) adjusting the hidden dimension of the FFN (as seen in PFFN where inner dimension is kD).
$L$ : The number of RankMixer blocks (layers).
$T$ : The number of tokens.
$D$ : The hidden dimension of the model.

In the Sparse-MoE version, the effective parameters and compute per sample are reduced by the active-expert ratio $s$ : $ s = \frac{# \mathrm{Activated}_\mathrm{Param}}{# \mathrm{Total}_\mathrm{Param}} $ Symbol Explanation:
$s$ : The ratio of activated parameters to total parameters, effectively the average fraction of experts activated for each token.

The paper notes an observation shared with LLM scaling laws: model quality correlates primarily with the total number of parameters, and different scaling directions (depth $L$ , width $D$ , tokens $T$ ) yield almost identical performance for the same total parameters. However, from a compute-efficiency standpoint, larger hidden dimensions ( $D$ ) generate larger matrix-multiply shapes, which typically achieve higher MFU than stacking more layers. Therefore, the final configurations for the 100M and 1B parameter models were chosen as ( $D=768, T=16, L=2$ ) and ( $D=1536, T=32, L=2$ ) respectively, emphasizing width over depth for MFU.

5. Experimental Setup

5.1. Datasets

The offline experiments for RankMixer were conducted using a massive, real-world dataset from the Douyin recommendation system.

Source: The data is derived from Douyin's online logs and user feedback labels, reflecting actual user interactions.
Characteristics: It includes over 300 features, categorized as:
- Numerical features: Diverse statistics (e.g., click counts, duration).
- ID features: User IDs, video IDs (billions of user IDs and hundreds of millions of video IDs).
- Cross features: Interactions between user and item features.
- Sequential features: User interaction sequences, processed to capture temporal interests. All these features are converted into embeddings.
Scale: The dataset is described as a "trillion-scale production dataset," with "trillions of records per day." The experiments were conducted on data collected over a two-week period.
Domain: Online short-video recommendation and advertisement.

While the paper doesn't provide a concrete example of a data sample, one can imagine a record looking like this:

{
  "user_id": "user_12345",
  "video_id": "video_abcde",
  "author_id": "author_xyz",
  "user_watch_history_last_5": ["video_fgh", "video_ijk", ...],
  "video_category": "comedy",
  "user_age_bucket": "25-34",
  "watch_duration_ratio": 0.85,
  "engagement_label": "finish" // Label for `finish/skip` prediction
}

This dataset is highly effective for validating the method's performance because it represents the actual, complex, and massive data distribution encountered in a leading industrial recommender system. Its scale and diversity are crucial for testing RankMixer's scaling abilities and hardware efficiency in a real-world scenario.

5.2. Evaluation Metrics

The paper uses a comprehensive set of evaluation metrics, covering both model performance and computational efficiency.

5.2.1. Performance Metrics

AUC (Area Under the Curve):
- Conceptual Definition: AUC is a standard metric used for binary classification problems. It represents the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example by the classifier. A higher AUC value indicates better model performance in distinguishing between positive and negative classes. In the context of recommendation systems, an increase of 0.0001 in AUC is often considered a statistically significant improvement.
- Mathematical Formula: $ \text{AUC} = \frac{\sum_{i \in \text{positive}} \sum_{j \in \text{negative}} \mathbb{I}(P(i) > P(j))}{N_P \cdot N_N} $
- Symbol Explanation:
  - P(i): The prediction score (e.g., click-through probability) assigned by the model to a positive instance $i$ .
  - P(j): The prediction score assigned by the model to a negative instance $j$ .
  - $\mathbb{I}(\cdot)$ : The indicator function, which returns 1 if the condition inside is true, and 0 otherwise.
  - $N_P$ : The total number of positive instances in the dataset.
  - $N_N$ : The total number of negative instances in the dataset.
UAUC (User-level AUC):
- Conceptual Definition: UAUC is a variation of AUC that is particularly relevant for recommender systems. Instead of calculating a single AUC over the entire dataset, UAUC calculates the AUC for each individual user and then averages these per-user AUC scores. This metric provides a better evaluation of personalized performance, as it accounts for varying user behaviors and ensures that improvements are not solely driven by a few highly active users or popular items.
- Mathematical Formula: $ \text{UAUC} = \frac{1}{N_U} \sum_{u=1}^{N_U} \text{AUC}_u $
- Symbol Explanation:
  - $N_U$ : The total number of unique users in the dataset.
  - $\text{AUC}_u$ : The AUC calculated specifically for user $u$ based on their interactions.
Online Business Metrics (for A/B tests):
- Active Days (Active Day↑): The average number of days a user is active within a given period. This serves as a proxy for Daily Active Users (DAU) growth and overall user retention.
- Duration (Duration↑): The total cumulative time users spend in the application. This is a crucial measure of user engagement and satisfaction.
- Like (Like↑): The total number of likes (or similar positive feedback actions) users give to content.
- Finish (Finish↑): The total number of times users complete watching a video (or similar content consumption action).
- Comment (Comment↑): The total number of comments users make.
- ADVV (Advertiser Value↑): A metric representing the value generated for advertisers, typically related to ad impressions, clicks, or conversions. This is specific to advertisement scenarios.

5.2.2. Efficiency Metrics

Dense-Param (Parameters):
- Conceptual Definition: The count of parameters within the dense layers of the model. This specifically excludes sparse embedding parameters, which can be very large in DLRMs but are often handled differently (e.g., by parameter servers) and don't directly contribute to the dense computation graph's size. It indicates the complexity and capacity of the main neural network components.
Training Flops/Batch:
- Conceptual Definition: The total number of floating-point operations (FLOPs) required to process a single batch of data during training. This metric quantifies the computational cost of training the model. The batch size used for this measurement is 512.
MFU (Model FLOPs Utilization):
- Conceptual Definition: MFU measures how effectively the model utilizes the theoretical peak FLOPs capacity of the underlying hardware (e.g., GPU). It is expressed as a percentage. A higher MFU indicates that the model's computations are well-aligned with the hardware's capabilities, leading to more efficient use of resources and faster processing. Low MFU often suggests memory-bound operations.
- Mathematical Formula: $ \text{MFU} = \frac{\text{Actual FLOPs Consumption}}{\text{Theoretical Hardware FLOPs Capacity}} \times 100% $
- Symbol Explanation:
  - Actual FLOPs Consumption: The measured floating-point operations performed by the model.
  - Theoretical Hardware FLOPs Capacity: The maximum possible floating-point operations that the specific hardware (e.g., a GPU model) can perform per unit of time (e.g., per second). This value depends on the GPU architecture and its precision (e.g., fp32 vs. fp16).

5.3. Baselines

The authors compared RankMixer against several widely recognized state-of-the-art DLRMs and feature interaction models:

DLRM-MLP (base): A vanilla Deep Neural Network (DNN) or Multi-Layer Perceptron (MLP) used for feature crossing. This represents a basic DNN approach for feature interactions.
DLRM-MLP-100M: A scaled-up version of the DLRM-MLP baseline with approximately 100 million parameters, demonstrating the limited gains from simply widening or deepening basic MLP structures without architectural innovations.
DCNv2 [33]: An improved version of Deep & Cross Network, which explicitly learns bounded-degree cross features through a dedicated cross network while also using a deep network for implicit interactions.
RDCN [3]: A variant of DCN, likely an industrial adaptation or further improvement.
MoE: A generic Mixture-of-Experts model, scaled up by simply using multiple parallel experts, without the specific routing or training strategies of RankMixer's Sparse-MoE.
AutoInt [28]: A model that uses multi-head self-attention to automatically learn explicit high-order feature interactions.
HiFormer [10]: A Transformer-based model for recommender systems focusing on heterogeneous feature interactions and low-rank approximation for matrix computations.
DHEN [39]: Deep and Hierarchical Ensemble Network, which combines different types of feature-cross blocks (e.g., DCN, self-attention, FM, LR) and stacks multiple layers.
Wukong [38]: A model that investigates scaling laws for feature interaction by stacking Factorization Machine Blocks (FMB) and Linear Compress Blocks (LCB).

All experiments were performed on hundreds of GPUs using a hybrid distributed training framework, where the sparse part (embeddings) was updated asynchronously (common in industrial settings to handle large embedding tables) and the dense part (main DNN) was updated synchronously. RMSProp optimizer was used for the dense part with a learning rate of 0.01, and Adagrad for the sparse part. These optimizers and their settings were kept consistent across all models for fair comparison.

6. Results & Analysis

6.1. Core Results Analysis

The experiments aimed to evaluate RankMixer's performance, efficiency, and scaling capabilities against various state-of-the-art baselines on a massive industrial dataset.

The following are the results from Table 1 of the original paper:

Model	Finish		Skip		Efficiency
Model	AUC↑	UAUC↑	AUC↑	UAUC↑	Params	FLOPs/Batch
DLRM-MLP (base)	0.8554	0.8270	0.8124	0.7294	8.7 M	52 G
DLRM-MLP-100M	+0.15%	−	+0.15%	−	95 M	185 G
DCNv2	+0.13%	+0.13%	+0.15%	+0.26%	22 M	170 G
RDCN	+0.09%	+0.12%	+0.10%	+0.22%	22.6 M	172 G
MoE	+0.09%	+0.12%	+0.08%	+0.21%	47.6 M	158 G
AutoInt	+0.10%	+0.14%	+0.12%	+0.23%	19.2 M	307 G
DHEN	+0.18%	+0.26%	+0.36%	+0.52%	22 M	158 G
HiFormer	+0.48%	−	−	−	116 M	326 G
Wukong	+0.29%	+0.29%	+0.49%	+0.65%	122 M	442 G
RankMixer-100M	+0.64%	+0.72%	+0.86%	+1.33%	107 M	233 G
RankMixer-1B	+0.95%	+1.22%	+1.25%	+1.82%	1.1 B	2.1T

Comparison with SOTA methods ( $\sim$ 100M parameters):

Limited Gains from Simple Scaling: Simply scaling DLRM-MLP to 100M parameters (DLRM-MLP-100M) yields only modest AUC gains (+0.15%), despite a significant increase in FLOPs/Batch (from 52G to 185G). This highlights the inefficiency of basic DNN scaling without architectural improvements.
Inefficiencies of Classic Cross-Structure Designs: Models like DCNv2, RDCN, AutoInt, and DHEN show AUC improvements but often come with high FLOPs relative to their parameter count. For instance, AutoInt with 19.2M parameters already requires 307G FLOPs/Batch, indicating potential design shortcomings in terms of computational efficiency. DHEN shows better AUC gains than DCNv2 but still has a moderate FLOPs cost.
RankMixer's Superiority:
- RankMixer-100M consistently achieves the best performance across all AUC and UAUC metrics (e.g., +0.64% Finish AUC, +0.86% Skip AUC) among models with similar parameter counts.
- Crucially, its FLOPs/Batch (233G) is relatively moderate, being lower than AutoInt (307G), HiFormer (326G), and Wukong (442G), yet it outperforms all of them. This demonstrates RankMixer's balanced approach to model capacity and computational load.
Scaling Up with RankMixer-1B: The RankMixer-1B model, with 1.1 billion parameters, further significantly boosts performance across the board (e.g., +0.95% Finish AUC, +1.25% Skip AUC), demonstrating RankMixer's strong scaling capabilities. Although its FLOPs/Batch is 2.1T, the accompanying efficiency improvements (discussed in Section 4.6) allow this scale to be practical.

Scaling Laws of Different Models:

The following figure (Figure 2 from the original paper) illustrates the scaling law curves:

$Figure 2: Scaling laws between finish Auc-gain and Params/Flops of different models. The $\\mathbf { x }$ axis uses a logarithmic scale.$ 该图像是图表，展示了 AUC 增益与不同模型的密集参数及 Flops 之间的关系。左侧图显示了密集参数（M）与 AUC 增益（%）的关系，右侧图展示了 Flops（G）与 AUC 增益（%）的关系。RankMixer 的表现显著优于其他模型。

As shown in Figure 2, RankMixer exhibits the steepest scaling law both in terms of parameters (Params) and FLOPs (x-axis is logarithmic). This indicates that as RankMixer grows in size, its AUC gain increases at a faster rate compared to other models.
Wukong shows a relatively steep parameter curve, suggesting it benefits from increased parameters. However, its computational cost (FLOPs) increases even more rapidly, causing a larger gap between Wukong and RankMixer/HiFormer on the AUC vs. FLOPs curve. This implies Wukong is less compute-efficient.
HiFormer performs slightly worse than RankMixer, which the authors attribute to its reliance on feature-level token segmentation and self-attention, which can affect efficiency.
DHEN shows limited scalability in its cross structure, as its AUC gain flattens out more quickly.
The generic MoE strategy struggles with expert balance, leading to suboptimal scaling performance.

The authors also note that RankMixer adheres to LLM scaling laws, where model quality primarily correlates with the total number of parameters. While different scaling directions (depth $L$ , width $D$ , tokens $T$ ) yield similar performance for identical total parameters, increasing model width (D) is preferred for compute-efficiency due to larger matrix-multiply shapes achieving higher MFU. This guided the configuration choice of ( $D=768, T=16, L=2$ ) for 100M parameters and ( $D=1536, T=32, L=2$ ) for 1B parameters, emphasizing width over depth.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation on Components of RankMixer-100M

The following are the results from Table 2 of the original paper:

Setting	ΔAUC
w/o skip connections	-0.07%
w/o multi-head token mixing	-0.50%
w/o layer normalization	-0.05%
Per-token FFN → shared FFN	-0.31%

The ablation study on RankMixer-100M components (Table 2) reveals the critical importance of each design choice:

w/o skip connections (-0.07% ΔAUC): Removing skip connections (residual connections) leads to a performance drop. This is expected, as skip connections are vital for stabilizing training, mitigating vanishing/exploding gradients, and enabling deeper networks in Transformer-like architectures.
w/o multi-head token mixing (-0.50% ΔAUC): This is the most significant drop. Removing multi-head token mixing means that global information exchange between tokens is lost. Each Per-token FFN would then only model partial features without interactions, severely limiting the model's ability to capture complex cross-feature interactions.
w/o layer normalization (-0.05% ΔAUC): Removing Layer Normalization also negatively impacts performance. Layer Normalization helps to stabilize activations and gradients, leading to more stable and faster training.
Per-token FFN → shared FFN (-0.31% ΔAUC): Replacing Per-token FFNs with a single, shared FFN across all tokens results in a substantial performance decrease. This validates RankMixer's hypothesis that modeling distinct feature subspaces with dedicated parameters is crucial for heterogeneous recommendation data. A shared FFN would force all tokens into a common representation, potentially losing specialized information and allowing dominant features to overshadow others.

6.2.2. Token2FFN Routing-strategy comparison

The following are the results from Table 3 of the original paper:

Routing strategy	ΔAUC	ΔParams	ΔFLOPs
All-Concat-MLP	-0.18%	0.0%	0.0%
All-Share	-0.25%	0.0%	0.0%
Self-Attention	-0.03%	+16%	+71.8%

This ablation focuses on different strategies for routing information from feature tokens to FFNs (or processing feature interactions). The baseline for comparison here is RankMixer's Multi-Head Token Mixing within its block.

All-Concat-MLP (-0.18% ΔAUC): This strategy concatenates all tokens into a single vector and processes them through a large MLP before splitting them back. The performance decrease suggests that:
1. Learning effective interactions from a single, very long concatenated vector is challenging for an MLP.
2. It likely weakens the learning of local information within specific feature subspaces that RankMixer's token-wise processing preserves.
All-Share (-0.25% ΔAUC): This strategy implies no splitting or specific token mixing; the entire input vector is shared and fed to each Per-token FFN (similar to how MMoE experts might share input). The significant performance decline highlights the importance of feature subspace splitting and independent modeling, confirming that simply sharing inputs leads to reduced learning diversity.
Self-Attention (-0.03% ΔAUC, +16% ΔParams, +71.8% ΔFLOPs): Applying self-attention between tokens for routing yields performance slightly inferior to Multi-Head Token Mixing (a -0.03% AUC drop is statistically significant in RS). More importantly, self-attention comes with a substantial increase in parameters (+16%) and a massive increase in computational cost (FLOPs, +71.8%). This validates RankMixer's claim that self-attention is inefficient and suboptimal for the heterogeneous feature spaces of RS, especially when trying to learn similarity across hundreds of different feature subspaces. RankMixer's Multi-Head Token Mixing is more effective and much more efficient.

6.2.3. Sparse-MoE Scalability and Expert Balance

The following figure (Figure 3 from the original paper) plots offline AUC gains versus Sparsity of SMoE:

$Figure 3: AUC performance of RankMixer variants under decreasingly sparse activation ratio $( 1 , 1 / 2 , 1 / 4 , 1 / 8$ of experts): dense-training $^ +$ ReLU-routed sMoE preserves almost all accuracy of the 1 B dense model.$ 该图像是图表，展示了不同 RankMixer 变体在稀疏激活比率下的 AUC 性能表现 ( $1, 1/2, 1/4, 1/8$ )：密集训练的 ReLU 路由的 sMoE 几乎保留了 1B 密集模型的所有准确性。

Figure 3 demonstrates the scalability of RankMixer's Sparse-MoE variant.

DTSI + ReLU-routed sMoE (Dense-Training-Sparse-Inference with ReLU routing) is crucial for maintaining accuracy under aggressive sparsity. It can scale parameter capacity by more than 8 times with almost no AUC loss, while providing substantial inference-time savings (e.g., 50% throughput improvement). This highlights the effectiveness of the proposed DTSI and ReLU routing for MoE.
Vanilla sMoE: Performance drops monotonically as fewer experts are activated, indicating expert imbalance and under-training.
sMoE + Load-balancing loss: While adding a load-balancing loss reduces degradation compared to Vanilla sMoE, it still falls short of DTSI + ReLU because the core issue lies more in the expert training process itself than just router balancing.

The following figure (Figure 4 from the original paper) shows the activated expert ratio for different tokens in RankMixer:

该图像是一个图表，展示了 RankMixer 模型中不同 token 的激活专家比率，分为两个层次（layer 1 和 layer 2），每个层次显示多组图。每个子图的横轴表示迭代次数，纵轴表示激活专家比率，能够反映 RankMixer 在特征交互过程中的效率和性能。

Figure 4 visualizes expert balance and diversity.

It shows the activated expert ratio for different tokens in RankMixer across two layers.
The results indicate that the combination of DTSI (Dense-Training, Sparse-Inference) with ReLU routing effectively addresses expert imbalance.
Dense-training ensures that most experts receive sufficient gradient updates, preventing "dying experts" (experts that are rarely or never activated).
ReLU routing makes the activation ratio dynamic across tokens. The varying activation proportions per token (as shown in the figure) demonstrate that RankMixer adapts the number of activated experts based on the information content of each token, which is well-suited for the diverse and dynamic nature of recommendation data.

6.3. Online Serving cost

The following are the results from Table 6 of the original paper:

Metric	OnlineBase-16M	RankMixer-1B	Change
#Param	15.8M	1.1B	↑ 70X
FLOPs	107G	2106G	↑ 20.7X
Flops/Param(G/M)	6.8	1.9	↓3.6X
MFU	4.47%	44.57%	↑10X
Hardware FLOPs	fp32	fp16	↑2X
Latency	14.5ms	14.3ms	-

A key achievement of RankMixer is its ability to scale model parameters by two orders of magnitude (70x increase) without increasing inference latency. The paper explains this by breaking down the latency formula: $ \mathrm{Latency} = \frac{# \mathrm{Param} \times \mathrm{FLOPs} / \mathrm{Param} \mathrm{ratio}}{\mathrm{MFU} \times (\mathrm{Theoretical} \mathrm{Hardware} \mathrm{FLOPs})} $ Symbol Explanation:

$\mathrm{Latency}$ : The time delay for a single inference request.
$\# \mathrm{Param}$ : The total number of parameters in the model.
$\mathrm{FLOPs} / \mathrm{Param} \mathrm{ratio}$ : The number of floating-point operations required per parameter.
$\mathrm{MFU}$ : Model FLOPs Utilization.
$\mathrm{Theoretical} \mathrm{Hardware} \mathrm{FLOPs}$ : The maximum theoretical floating-point operations capacity of the hardware.

The table clearly illustrates how RankMixer-1B maintains latency despite a massive parameter increase:

FLOPs/Param ratio (3.6x decrease): RankMixer achieves a 70-fold increase in parameters with only a 20.7-fold increase in FLOPs (from 107G to 2106G). This means its FLOPs/Param ratio drops from 6.8 G/M to 1.9 G/M, an efficiency gain of 3.6x. This efficiency allows RankMixer to accommodate significantly more parameters within the same FLOPs budget.
Model FLOPs Utilization (MFU) (10x increase): The MFU for RankMixer-1B surged from 4.47% (baseline OnlineBase-16M) to 44.57%, a near 10x improvement. This indicates that RankMixer's architecture (using large GEMM shapes, fused parallel Per-token FFNs, reduced memory bandwidth overhead) successfully shifts the model from a memory-bound to a compute-bound regime on GPUs, utilizing the hardware much more effectively.
Quantization (fp16) (2x Theoretical Hardware FLOPs): By leveraging half-precision (fp16) inference, RankMixer doubles the theoretical peak Hardware FLOPs of GPUs. Modern GPUs (like Nvidia Tensor Cores) are highly optimized for fp16 operations, effectively doubling the FLOPs capacity compared to fp32.

These three factors collectively counteract the massive parameter increase, allowing RankMixer-1B to achieve an inference latency of 14.3ms, which is even slightly lower than the 14.5ms of the much smaller OnlineBase-16M model.

6.4. Online Performance

To demonstrate RankMixer's universality and practical impact, online A/B tests were conducted across two core application scenarios: feed recommendation and advertising. The baseline in these scenarios was a 16M parameter model combining DLRM and DCN. RankMixer-1B replaced the dense part of this baseline.

The following are the results from Table 4 of the original paper:

	Douyin app					Douyin lite
	Active Day↑	Duration↑	Like↑	Finish↑	Comment↑	Active Day↑	Duration↑	Like↑	Finish↑	Comment↑
Overall	+0.2908%	+1.0836%	+2.3852%	+1.9874%	+0.7886%	+0.1968%	+0.9869%	+1.1318%	+2.0744%	+1.1338%
Low-active	+1.7412%	+3.6434%	+8.1641%	+4.5393%	+2.9368%	+0.5389%	+2.1831%	+4.4486%	+3.3958%	+0.9595%
Middle-active	+0.7081%	+1.5269%	+2.5823%	+2.5062%	+1.2266%	+0.4235%	+1.9011%	+1.7491%	+2.6568%	+0.6782%
High-active	+0.1445%	+0.6259%	+1.828%	+1.4939%	+0.4151%	+0.0929%	+1.1356%	+1.8212%	+1.7683%	+2.3683%

Feed Recommendation (Douyin App and Douyin Lite):

Overall Improvements: RankMixer delivered statistically significant uplifts across all business-critical metrics. For the main Douyin app, it achieved +0.2908% Active Days and +1.0836% Duration, along with substantial increases in Like, Finish, and Comment rates. Similar positive trends were observed for Douyin Lite.
Impact on User Activeness Levels: RankMixer showed particularly strong generalization capabilities by significantly improving metrics for low-active users. For low-active users on the Douyin app, Active Days increased by an impressive +1.7412%, Duration by +3.6434%, and Like by +8.1641%. This suggests the model is better at understanding and engaging users who are typically harder to activate, potentially due to its enhanced ability to model diverse feature subspaces and capture long-tail signals. While still positive, the improvements for middle-active and high-active users were comparatively smaller, indicating RankMixer has a broad and inclusive impact.

The following are the results from Table 5 of the original paper:

Metric Advertising

ΔAUC↑ ADVV↑

Lift +0.73% +3.90%

Metric	Advertising
Lift	+0.73%	+3.90%

Advertising Scenario:

RankMixer also demonstrated strong performance in the advertising scenario, showing a +0.73% increase in AUC and a notable +3.90% increase in ADVV (Advertiser Value). This confirms its versatility as a unified backbone for different personalized ranking applications.

These online A/B test results, observed over 8 months for Feed Recommendation, provide compelling evidence for RankMixer's practical effectiveness, universality, and ability to generate significant business value without increasing serving costs.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces RankMixer, a novel, hardware-aware model design tailored for scaling up ranking models in industrial recommender systems. RankMixer addresses the critical challenges of latency, QPS, and low MFU on modern GPUs that plague traditional DLRMs. Its core innovations include a multi-head token mixing module that replaces self-attention for more efficient and effective cross-token feature interactions in heterogeneous data, and Per-token FFNs that enhance modeling capacity for distinct feature subspaces. The model is further scaled to one billion parameters using a Sparse-MoE variant, which incorporates ReLU routing and Dense-Training / Sparse-Inference (DTSI-MoE) to ensure expert balance and prevent under-training.

Offline experiments demonstrate RankMixer's superior performance and a steeper scaling law compared to SOTA baselines. Crucially, engineering optimizations and the hardware-aligned architecture allowed RankMixer to boost MFU from 4.5% to 45% and scale dense parameters by 100x (from 16M to 1B) while maintaining inference latency. The successful deployment of 1B Dense-Parameters RankMixer on Douyin's feed ranking for full traffic serving resulted in significant online A/B test improvements: a 0.3% increase in user active days and a 1.08% increase in total in-app usage duration. Its universality was also validated in advertising scenarios, showing robust lifts.

7.2. Limitations & Future Work

The paper does not explicitly delineate a "Limitations" or "Future Work" section. However, based on the discussion, some implicit limitations and potential future directions can be inferred:

Complexity of Semantic-based Tokenization: The semantic-based tokenization relies on domain knowledge to group features. While effective, the process might be manual or heuristic-driven. A potential limitation could be its generalizability to new domains or the effort required to define optimal semantic groups. Future work could explore automating or learning this tokenization process.
Hyperparameter Tuning for MoE: While DTSI-MoE addresses expert balance, Sparse-MoE generally introduces more hyperparameters (e.g., number of experts, $\lambda$ for regularization) that require careful tuning. The optimal balance between sparsity and performance can be dataset-dependent.
Further Scaling of Sparse-MoE: The paper mentions scaling to 10B parameters as a future possibility. While $DTSI+ReLU$ is effective up to 1B, scaling beyond this might introduce new challenges in expert communication, memory management, or even more subtle expert under-training issues not fully apparent at 1B scale.
Full Hardware Abstraction: While hardware-aware, the model still requires specific optimizations like fp16 quantization and GEMM fusions. Further research could aim for even more general hardware-agnostic optimizations or explore novel hardware architectures specifically designed for such models.
Interpretability: As DLRMs become more complex and black-box with billions of parameters, interpretability becomes harder. Understanding why RankMixer makes certain recommendations, especially with Sparse-MoE and dynamic routing, could be an area for future investigation.

7.3. Personal Insights & Critique

RankMixer represents a significant stride in bridging the gap between academic DLRM research and industrial-scale deployment, particularly by drawing lessons from the LLM scaling paradigm.

Key Strengths and Inspirations:

Hardware-Awareness as a First-Class Citizen: The explicit focus on hardware-aware design from the outset, rather than as an afterthought optimization, is highly commendable. This mindset is crucial for industrial applications where latency and cost are non-negotiable. The dramatic 10x increase in MFU is a testament to this principle.
Intelligent Adaptation of Transformer Concepts: Instead of blindly porting Transformers to RS, RankMixer intelligently adapts the parallel architecture while replacing suboptimal components like self-attention. The multi-head token mixing is a clever, parameter-free alternative that respects the heterogeneity of RS features, offering a compelling solution to a long-standing challenge.
Effective MoE Implementation: The DTSI-MoE with ReLU routing is a practical and robust solution to common Sparse-MoE problems. This makes MoE a more viable option for DLRMs, which often struggle with data sparsity and feature diversity leading to expert under-training.
Real-world Impact: The successful full-traffic deployment on Douyin and the tangible improvements in key business metrics (active days, duration) provide strong validation. The observation that low-active users benefit most is particularly impactful, indicating the model's ability to drive growth from potentially underserved segments.

Potential Areas for Further Inquiry/Critique:

Transparency of Proj Function: The Proj function in tokenization is mentioned but not detailed. Is it a simple linear projection, or something more complex? Its design could significantly influence how raw embeddings are transformed into tokens and how much information is preserved or compressed.
Generalizability of Semantic Grouping: While domain knowledge is used for semantic grouping, the extent to which this process is manual or how robust it is to evolving feature sets remains an open question. Could an unsupervised or learned method enhance this step?
Detailed Breakdown of Hardware Optimization: While MFU, fp16, and FLOPs/Param ratio are discussed, a deeper dive into the specific GPU architectures used and the detailed kernel fusions or other low-level optimizations could offer more insights for others to replicate or build upon.
Long-Term Maintenance of 1B Parameter Models: Managing and updating such large models in production presents its own set of engineering challenges beyond initial deployment (e.g., faster iteration, continuous learning, debugging). While the paper focuses on the initial scaling and deployment, future research could explore these operational aspects.

Overall, RankMixer provides a powerful framework for developing highly scalable and efficient ranking models in industrial settings. Its innovations in feature interaction and MoE strategies, combined with a strong hardware-aware philosophy, offer valuable lessons for the broader recommender systems community. The paper clearly demonstrates that scaling laws can indeed be harnessed for DLRMs, provided the architectural design is meticulously aligned with both data characteristics and hardware capabilities.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.