RankMixer: Scaling Up Ranking Models in Industrial Recommenders
TL;DR Summary
RankMixer is introduced to enhance the efficiency and scalability of ranking models in industrial recommenders. Utilizing a hardware-aware design, it replaces self-attention with multi-head token mixing, achieving improved model performance with one billion parameters and a Spars
Abstract
Recent progress on large language models (LLMs) has spurred interest in scaling up recommendation systems; yet two practical obstacles remain. First, training and serving cost on industrial recommenders must respect strict latency bounds and high QPS demands. Second, most human-designed feature-crossing modules in ranking models were inherited from the CPU era and fail to exploit modern GPUs, resulting in low Model Flops Utilization (MFU) and poor scalability. We introduce RankMixer, a hardware-aware model design tailored towards a unified and scalable feature-interaction architecture. RankMixer retains the transformer's high parallelism while replacing quadratic self-attention with a multi-head token mixing module for higher efficiency. Besides, RankMixer maintains both the modeling for distinct feature subspaces and cross-feature-space interactions with Per-token FFNs. We further extend it to one billion parameters with a Sparse-MoE variant for higher ROI. A dynamic routing strategy is adapted to address the inadequacy and imbalance of experts training. Experiments show RankMixer’s...
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
RankMixer: Scaling Up Ranking Models in Industrial Recommenders
1.2. Authors
Jie Zhu*, Zhifang Fan*, Xiaoxie Zhu*, Yuchen Jiang*, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, Huizhi Yang, Zheng Chai, Zhe Chen, Yuchao Zheng, Qiwei Chen, Feng Zhang, Xun Zhou, Peng Xu, Xiao Yang, Di Wu, and Zuotao Liu. The authors are primarily affiliated with ByteDance, with Zhifang Fan and Qiwei Chen listed as independent researchers. ByteDance is a leading technology company known for products like TikTok (Douyin in China), which heavily relies on recommendation systems. This indicates a strong industry-led research focus on practical, large-scale deployment challenges.
1.3. Journal/Conference
Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). CIKM is a highly reputable and influential international conference in the fields of information retrieval, knowledge management, and database systems. Its reputation ensures that accepted papers undergo rigorous peer review and are considered significant contributions to the field.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses two main challenges in scaling up recommendation systems inspired by Large Language Models (LLMs): high training/serving costs with strict latency/QPS demands, and inefficient GPU utilization by legacy CPU-era feature-crossing modules. It introduces RankMixer, a hardware-aware model designed for a unified and scalable feature-interaction architecture. RankMixer leverages the Transformer's parallelism but replaces quadratic self-attention with a more efficient multi-head token mixing module. It also uses Per-token FFNs to model distinct feature subspaces and cross-feature-space interactions, which enhances capacity. For higher Return on Investment (ROI), a Sparse-MoE variant extends the model to one billion parameters, employing a dynamic routing strategy to tackle expert training inadequacy and imbalance. Experiments on a trillion-scale production dataset demonstrate RankMixer's superior scaling, boosting Model Flops Utilization (MFU) from 4.5% to 45% and scaling parameters by two orders of magnitude while maintaining inference latency. Online A/B tests across recommendation and advertisement scenarios show universal applicability, with a 1B Dense-Parameters RankMixer improving user active days by 0.3% and total in-app usage duration by 1.08% without increasing serving cost.
1.6. Original Source Link
/files/papers/691b1d0680efabfa5c6ee4e9/paper.pdf (Local file provided, implies it's the full paper content) Publication status: Published in CIKM '25.
2. Executive Summary
2.1. Background & Motivation
The paper is motivated by the success of Large Language Models (LLMs) in scaling performance with increasing parameters, prompting interest in applying similar scaling principles to Deep Learning Recommendation Models (DLRMs). However, scaling DLRMs in industrial recommender systems faces two critical practical obstacles:
-
Cost and Latency Constraints: Industrial recommender systems operate under strict
latency bounds(how quickly a response must be generated) and demand extremely highQPS(Queries Per Second, the number of requests a system can handle per second). Increasing model size typically leads to higher computational costs, making it challenging to meet these real-time operational requirements. -
Inefficient Hardware Utilization: Many traditional
feature-crossing modules(components that combine different features to learn complex interactions) in ranking models were designed in theCPU era. These designs often fail to effectively utilize modernGPUs, resulting in lowModel Flops Utilization (MFU)– meaning a significant portion of the GPU's theoretical processing power goes unused. This leads to poor scalability, as the computational cost per parameter is high, limiting the practicalROI(Return on Investment) of simply adding more parameters.The core problem the paper aims to solve is how to effectively scale
DLRMsto billions of parameters in a production environment without violating strict latency andQPSconstraints, by designing hardware-aware architectures that maximizeGPUutilization and computational efficiency.
2.2. Main Contributions / Findings
The paper introduces RankMixer, a novel, hardware-aware model design with the following primary contributions and findings:
- Novel Architecture (
RankMixer): Proposes a new model architecture featuringMulti-head Token Mixingfor efficient cross-token feature interactions (replacing quadraticself-attentionwith a parameter-free operator for higher efficiency) andPer-token FFNsfor isolated, distinct feature subspace modeling and cross-feature-space interactions. This design is hardware-aligned and highly parallelizable. - Sparse-MoE Extension with Dynamic Routing: Extends
RankMixerwith aSparse Mixture-of-Experts (MoE)variant forone billion parameters, enhancingROI. It introduces adynamic routing strategy(specifically,ReLU RoutingandDense-training / Sparse-inference (DTSI-MoE)) to address the challenges of expert under-training and imbalance, which are common issues inSparse-MoEdeployments. - Significant MFU Improvement: By replacing diverse, handcrafted, low-MFU modules with
RankMixer, the modelMFUwas dramatically boosted from 4.5% to 45% (a 10x improvement). This shift from a memory-bound to a compute-bound regime is crucial for efficientGPUutilization. - Massive Parameter Scaling with Stable Latency: Achieved a
70x increasein online ranking model parameters (from 16M to 1.1B) while maintaining roughly the same inference latency. This was made possible by decoupling parameter growth from FLOPs, and FLOPs growth from actual cost through high MFU andQuantization(specifically,fp16inference). - Universal Applicability: Verified
RankMixer's universality through online A/B tests across two core application scenarios: Feed Recommendation and Advertising, demonstrating its generalizability as a unified backbone. - Tangible Business Impact: Successful deployment of the 1B Dense-Parameters
RankMixerfor full traffic serving on Douyin's recommendation system resulted in significant improvements: 0.3% increase in user active days and 1.08% increase in total in-app usage duration, without increasing serving cost. The model also showed strong generalization capabilities, particularly benefiting low-active users.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the RankMixer paper, a reader should be familiar with the following fundamental concepts:
-
Recommender System (RS): A subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. In industrial settings, RS are crucial for platforms like social media (e.g., Douyin/TikTok), e-commerce, and streaming services to suggest relevant content, products, or services to users, thereby enhancing user engagement and platform revenue. They typically analyze user behavior (past interactions), item attributes, and context to make predictions.
-
Deep Learning Recommendation Models (DLRMs): These are recommendation models that utilize deep neural networks to learn complex patterns and interactions within user-item data. A critical component of DLRMs is the
dense interaction layer, which processes the concatenated or pooled embeddings of various features to capture high-order feature interactions. -
Large Language Models (LLMs): Advanced deep learning models, typically based on the Transformer architecture, that have achieved remarkable performance in natural language processing tasks by scaling up their parameter counts to billions or even trillions. Their success in leveraging scale to improve performance has inspired similar scaling efforts in other domains, including recommendation systems.
-
Model Flops Utilization (MFU): A metric that measures how efficiently a model utilizes the theoretical floating-point operations per second (FLOPs) capability of the underlying hardware (e.g., GPU). It is calculated as the ratio of the actual FLOPs consumed by the model to the theoretical peak FLOPs capacity of the hardware. A low MFU indicates that the hardware is underutilized, often due to memory-bound operations rather than compute-bound operations. Maximizing MFU is critical for efficient large-scale model training and inference.
-
Queries Per Second (QPS): A metric representing the number of requests (queries) a system can process in one second. In industrial recommender systems, extremely high QPS demands are common, requiring models to be highly efficient and capable of quick inference to serve a vast number of users in real-time.
-
Feature Engineering (Feature-crossing): The process of creating new features from existing ones to enhance the predictive power of a model.
Feature-crossingspecifically refers to combining two or more features to create an interaction feature, which can capture non-linear relationships. For instance, combining a user'sage groupwith an item'scategorymight reveal preferences not apparent from either feature alone. In recommendation systems, features can be diverse:- Numerical features: e.g., video watch duration, user activity frequency.
- Categorical features: e.g., user ID, item ID, author ID, video tags. These are often converted into dense
embeddings(low-dimensional vector representations) for use in neural networks. - User behavior features: e.g., sequences of previously watched videos, liked items.
- Content features: e.g., textual descriptions or visual attributes of items.
-
Transformer Architecture: A neural network architecture introduced in 2017, which revolutionized NLP and is now widely used in various domains. Its key components include
self-attentionmechanisms andFeed-Forward Networks (FFNs). Transformers are known for their high parallelism during training, which makes them efficient on modern accelerators like GPUs. -
Self-attention: A mechanism in Transformer models that allows the model to weigh the importance of different parts of the input sequence when processing a specific part. For a given token, it computes attention scores by comparing it to all other tokens in the sequence. These scores determine how much "attention" the model should pay to each other token. The output for a token is a weighted sum of all tokens, where weights are the attention scores. The core
self-attentioncalculation for queries , keys , and values is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ whereQ, K, Vare matrices derived from input embeddings, and is the dimension of the keys, used as a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients. This mechanism has a quadratic complexity with respect to the sequence length (number of tokens), which can be a bottleneck for very long sequences or many tokens. -
Feed-Forward Network (FFN): A simple neural network layer, typically consisting of two linear transformations with a non-linear activation function (e.g., ReLU or GeLU) in between. In Transformers, an FFN is applied independently and identically to each position (token) in the sequence after the self-attention layer.
-
Mixture-of-Experts (MoE): An architecture that increases model capacity by routing inputs to a subset of specialized "expert" neural networks. Instead of using all parameters for every input (dense model), MoE models only activate a few experts per input, thus increasing the total number of parameters while keeping the computational cost per inference relatively constant. This makes it possible to build models with trillions of parameters while maintaining manageable inference costs. A
Sparse MoEspecifically refers to activating only a small, sparse subset of experts. -
Layer Normalization (LN): A normalization technique applied across the features of a single input sample (token) within a layer. It helps stabilize training, especially in deep neural networks, by re-centering and re-scaling the inputs to the next layer.
-
Residual Connections (Skip Connections): A technique where the input of a layer is added to its output, allowing gradients to flow more directly through the network. This helps mitigate the vanishing gradient problem and enables the training of very deep networks.
-
Gelu activation function: Gaussian Error Linear Unit, a non-linear activation function commonly used in Transformer models, defined as , where is the cumulative distribution function for the standard Gaussian distribution.
3.2. Previous Works
The paper contextualizes RankMixer within the landscape of modern recommendation systems, highlighting the evolution of DLRMs and feature interaction modeling.
-
Wide&Deep [5] and DeepFM [10]: These are early and influential efforts that combined
wide(linear) components to capture low-order (simple) feature interactions anddeep(DNN) components to capture high-order (complex) feature interactions. They showed that simply relying on DNNs for high-order interactions could be challenging. -
Explicit Cross Methods: Researchers developed various models that explicitly design operators to capture high-order feature interactions, rather than implicitly relying on deep DNNs:
- PNN [20] (Product-based Neural Networks): Introduce product layers to capture explicit feature interactions at various granularities.
- DCN [25] (Deep & Cross Network) and its successor DCNv2 [26]: Use a
cross networkto explicitly learn bounded-degree cross features. The cross network iteratively applies feature crossing by multiplying the input with the output of the previous layer. - xDeepFM [16]: Proposes
Compressed Interaction Network (CIN)to generate feature interactions in an explicit, bit-wise fashion. - FGCNN [18] (Feature Generation by Convolutional Neural Network): Uses CNNs to automatically generate new features from raw input features.
- FiGNN [15] (Feature Interaction Graph Neural Network): Models feature interactions using graph neural networks.
- AutoInt [23] (Automatic Feature Interaction Learning via Self-Attentive Neural Networks): Employs a multi-head self-attention mechanism to explicitly model feature interactions, treating features as queries, keys, and values.
- Hiformer [9]: Combines heterogeneous
self-attentionlayers and low-rank approximation matrix computations, addressing some limitations ofAutoIntby providing more flexible interaction patterns. - DHEN [31] (Deep and Hierarchical Expert Network): Combines different types of feature-crossing blocks (e.g., DCN, self-attention, FM, LR) and stacks multiple layers to learn hierarchical feature interactions.
- Wukong [30]: Investigates the scaling law of feature interaction modeling, following an architecture similar to
DHENbut usingFactorization Machine Blocks (FMB)andLinear Compress Blocks (LCB).
-
Scaling Laws in Recommendation Systems: Research interest has grown in understanding how performance improves with increasing model size, data, or computation, similar to LLMs.
- Studies explored scaling strategies for pretraining user activity sequences [6], general-purpose user representations [22, 32], and online retrieval [8, 27].
HSTU [29]enhances scaling for Generative Recommenders, focusing on sequential aspects.
-
Limitations of Self-Attention in RS: The paper explicitly points out that while
self-attentionis effective inLLMswhere tokens share a unified embedding space, it issuboptimal for recommendation systems. In RS, features areinherently heterogeneous(e.g., a user ID embedding vs. a video category embedding). Computing aninner-product similarity(the core ofself-attention) between such diverse, often high-dimensional, and semantically distinct feature spaces is difficult and can be inefficient. This leads toself-attentionconsuming more computation, memory I/O, and GPU memory without necessarily outperforming simpler, parameter-free approaches likemulti-head Token Mixing.
3.3. Technological Evolution
The evolution of recommendation models has moved from simpler matrix factorization and collaborative filtering methods to deep learning-based models (DLRMs). Early DLRMs like Wide&Deep and DeepFM introduced the idea of combining different mechanisms for low- and high-order feature interactions. Subsequent works focused on designing increasingly sophisticated explicit cross methods (e.g., DCN, xDeepFM, AutoInt, DHEN, Wukong) to better capture these interactions.
The current trend, inspired by LLMs, is to explore scaling laws in DLRMs to improve performance by increasing model capacity. However, simply applying LLM architectures like vanilla Transformers directly to RS data has proven inefficient due to the unique characteristics of recommendation features (heterogeneity, strict latency constraints).
RankMixer positions itself at this frontier, aiming to bridge the gap between LLM-inspired scaling and the practical demands of industrial RS. It takes inspiration from the parallel architecture of Transformers but critically re-engineers the self-attention and FFN components to be hardware-aware and specifically suited for heterogeneous recommendation features. By proposing Multi-head Token Mixing and Per-token FFNs, and integrating Sparse-MoE with novel routing, RankMixer represents an evolution towards more efficient, scalable, and adaptable DLRM architectures for real-world deployment.
3.4. Differentiation Analysis
Compared to the main methods in related work, RankMixer offers several core differences and innovations:
-
Hardware-Aware Design for GPU Efficiency: Unlike many
CPU-erafeature-crossing modules(e.g., earlyDCNvariants) or even someself-attentionbased models (AutoInt,Hiformer) that can suffer from lowMFUonGPUs,RankMixeris explicitly designed to be hardware-aligned. Its components (Multi-head Token Mixing,Per-token FFNs, and their parallel execution) are chosen to maximizeGPUparallelism and shift from memory-bound to compute-bound operations, leading to a 10x MFU improvement. -
Efficient Cross-Token Interaction via Multi-head Token Mixing:
- Vs. Self-Attention (AutoInt, Hiformer):
RankMixerreplacesquadratic self-attentionwith aparameter-free multi-head token mixing module. This is crucial becauseself-attention's reliance oninner-product similarityis suboptimal forheterogeneous feature spacesinRS(e.g., user ID vs. item ID), where semantic similarity is hard to capture via dot products and leads to high computational/memory costs.Token Mixingprovides efficient, high-parallelism global information modeling without these drawbacks. - Vs. Explicit Cross Networks (DCN, DCNv2): While
DCNvariants explicitly model interactions,RankMixeroffers a more unified and scalable architecture inspired by Transformers, capable of learning deeper and more complex interactions efficiently across tokens representing semantic feature groups.
- Vs. Self-Attention (AutoInt, Hiformer):
-
Specialized Feature Subspace Modeling with Per-token FFNs:
- Vs. Shared FFNs (Vanilla Transformer): Traditional Transformers use shared
FFNsacross all tokens.RankMixeremploysPer-token FFNs, which means each feature token receives dedicated parameter transformations. This preventshigh-frequency fieldsfrom dominating and drowning outlower-frequency/long-tail signals, which is a common problem when mixing disparate semantic spaces in a single interaction module. This approach is better aligned with theheterogeneous natureof recommendation data. - Vs. MMoE Experts: While
MMoEalso uses multiple networks, its experts typically process the same input.Per-token FFNsinRankMixerare distinct in that each sees a distinct token input, allowing for specialized modeling of different feature subspaces.
- Vs. Shared FFNs (Vanilla Transformer): Traditional Transformers use shared
-
Scalable Sparse-MoE with Dynamic and Balanced Routing:
- Vs. Vanilla Sparse-MoE:
RankMixeraddresses key limitations ofVanilla Sparse-MoE(uniform -expert routing leading to budget waste, and expert under-training due to imbalance). It introducesReLU Routingfor flexible expert activation based on token information content andDense-training / Sparse-inference (DTSI-MoE)to ensure all experts are sufficiently trained while maintaining sparse inference cost. This allows for significantly higherROIfor model scaling (e.g., 1B+ parameters).
- Vs. Vanilla Sparse-MoE:
-
Holistic Optimization for Industrial Deployment:
RankMixerintegrates architectural innovations with engineering optimizations (high MFU,quantization) to achieve two orders of magnitude parameter scaling without increasing serving latency, which is a critical practical achievement for industrial recommender systems. This holistic approach to balancing effectiveness and efficiency distinguishes it from models that might achieve high AUC but are too costly to deploy.
4. Methodology
4.1. Principles
The core idea behind RankMixer is a hardware-aware model design tailored for the unique characteristics of industrial recommendation systems. It aims to create a unified and scalable feature-interaction architecture that can effectively model heterogeneous feature spaces while maximizing GPU utilization and maintaining strict real-time serving constraints. The theoretical basis and intuition are rooted in:
- Transformer-like Parallelism for Scalability: Leveraging the highly parallel nature of
Transformerarchitectures to enable efficient computation onGPUs. - Addressing Heterogeneity with Specialized Interactions: Recognizing that recommendation features are fundamentally diverse (e.g., user ID, item category, query text) and that generic interaction mechanisms like
self-attention(which relies on inner product similarity) are suboptimal. Instead, it proposes parameter-freetoken mixingfor cross-token interaction andper-token FFNsfor specialized, isolated processing of feature subspaces. - Decoupling Capacity from Cost: Employing
Sparse Mixture-of-Experts (MoE)to significantly increase model capacity (number of parameters) while keeping the computational cost per inference (FLOPs) relatively constant, thereby improving theReturn on Investment (ROI)of scaling. - Hardware-Alignment for Efficiency: Designing components and optimizing their execution (e.g., large
GEMMshapes, kernel fusion,fp16quantization) to achieve highModel Flops Utilization (MFU)and shift computation from memory-bound to compute-bound, which is crucial for meeting latency andQPSdemands in production.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Overall Architecture
The RankMixer model processes input tokens through successive RankMixer blocks. After these blocks, an output pooling operator aggregates the final representations to generate predictions for various tasks. Each RankMixer block is composed of two main components:
-
A
Multi-Head Token Mixinglayer: Responsible for facilitating information exchange across different tokens. -
A
Per-Token Feed-Forward Network (PFFN)layer: Designed to expand the model's capacity and model interactions within distinct feature subspaces.The iterative refinement of token representations across layers is described by the following equations: $ \mathbf{S}{n-1} = \mathrm{LN}\left(\mathrm{TokenMixing}\left(\mathbf{X}{n-1}\right) + \mathbf{X}{n-1}\right) $ $ \mathbf{X}n = \mathrm{LN}\left(\mathrm{PFFN}\left(\mathbf{S}{n-1}\right) + \mathbf{S}{n-1}\right) $ Where:
-
is a
Layer Normalizationfunction, which normalizes the activations of a layer across the features for each sample. This helps stabilize training. -
represents the
Multi-head Token Mixingmodule, which performs cross-token feature interactions. -
represents the
Per-Token Feed-Forward Networkmodule, which processes each token independently. -
is the input to the
(n-1)-thRankMixerblock, representing tokens, each with a hidden dimension . -
is the output after the
TokenMixinglayer and subsequent normalization and residual connection. -
is the output of the -th
RankMixerblock, serving as the input for the next block. -
is the initial stack of input tokens .
-
is the total number of feature tokens.
-
is the hidden dimension of the model for each token.
The
residual connections( added back in the first equation, and in the second) help in training very deep networks by allowing gradients to flow more easily. Finally, theoutput representationis derived from the mean pooling of the representations from the final layer . This aggregated representation is then used to compute predictions for various downstream tasks (e.g., click-through rate, watch duration).
4.2.2. Input Layer and Feature Tokenization
To effectively process the rich information present in large-scale recommendation systems, a crucial first step is to prepare the input features. These features are diverse and typically include:
-
User features: Such as
user IDand other demographic or behavioral information. -
Candidate features: Pertaining to the item being recommended, e.g.,
video ID,author ID,category. -
Sequence Features: Derived from user interaction sequences (e.g., previously watched videos), often processed through a
Sequence Moduleto capture temporal interests, yielding a sequence embedding . -
Cross Features: Explicit or learned interactions between user and candidate features.
All these raw features are initially transformed into
embeddingsof various dimensions. To ensure efficient parallel computation in the subsequentRankMixerblocks, these embeddings, regardless of their original dimension, must be aligned to a fixed dimension. This process is calledtokenization.
The paper highlights a challenge with simple tokenization strategies:
-
Too many tokens: Assigning one embedding per feature (which could be hundreds) leads to many tokens, reducing parameters and computation per token, potentially insufficient modeling for important features, and under-utilization of GPU cores.
-
Too few tokens: Conversely, using very few tokens (e.g., a single concatenated token) degrades the model into a simple
Deep Neural Network (DNN), losing the ability to distinctly represent diverse feature spaces and risking dominant features overshadowing others.To overcome these issues,
RankMixerproposes asemantic-based tokenizationapproach that leverages domain knowledge. Features are grouped into severalsemantically coherent clusters(e.g., all user-related features, all item-related features, all context features). These grouped features are then concatenated into a single embedding vector : $ e_{\mathrm{input}} = \left[ e_1; e_2; ...; e_N \right] $ Where: -
is the concatenated embedding vector, formed by joining feature-group embeddings.
-
represents the embedding vector for the -th feature group.
This concatenated vector is then partitioned into an appropriate number of tokens, each with a fixed dimension size. Each resulting
feature tokenis designed to capture a set of feature embeddings that represent a similar semantic aspect. The tokenization process is formally defined as: $ \mathbf{x}i = \mathrm{Proj}(e{\mathrm{input}} \left[ \boldsymbol{d} \cdot (i - 1) : \boldsymbol{d} \cdot i \right] ), \quad i = 1, ..., T $ Where: -
is the -th feature token, intended to capture a semantic aspect of the input.
-
denotes a segment of the concatenated embedding vector , starting at index and ending at index .
-
is the fixed dimension assigned to each segment before projection.
-
is the total number of resulting feature tokens.
-
is a projection function that maps the split embedding segment into the model's hidden dimension . This ensures all tokens have a uniform dimension for subsequent processing.
4.2.3. RankMixer Block
The core of the RankMixer model lies within its block structure, which iteratively refines the feature token representations.
4.2.3.1. Multi-head Token Mixing
The Multi-head Token Mixing module is introduced to enable effective information exchange across tokens, which is crucial for capturing feature crosses and modeling global information. This module is an alternative to self-attention, designed to be more efficient and suitable for heterogeneous recommendation data.
The process begins by splitting each token into heads: $ \left[ \mathbf{x}_t^{(1)} \parallel \mathbf{x}_t^{(2)} \parallel \ldots \parallel \mathbf{x}_t^{(H)} \right] = \mathrm{SplitHead} \left( \mathbf{x}_t \right) $ Where:
-
is the -th input token.
-
is the number of heads.
-
is the -th head of the token . The symbol indicates concatenation along a dimension, meaning the original token is divided into smaller sub-vectors. These heads can be viewed as projections of the token into lower-dimensional feature subspaces, allowing the model to consider different perspectives of the feature.
After splitting, the
Token Mixingoperation shuffles these heads. For each head index , it concatenates the -th head from all tokens : $ \mathbf{s}^h = \operatorname{Concat} \left( \mathbf{x}_1^h, \mathbf{x}_2^h, ..., \mathbf{x}_T^h \right) $ Where: -
is the resulting vector for the -th "shuffled token" or mixed head. This operation effectively collects the -th perspective from all original input tokens into a single vector.
The output of the
Multi-head Token Mixingmodule is a set of shuffled tokens, which can be stacked into . In this work, the authors set to ensure that the number of tokens remains the same afterToken Mixing, facilitating the use of aresidual connection. After this mixing, aresidual connectionandLayer Normalizationmodule are applied to produce the refined tokens: $ \mathbf{s}_1, \mathbf{s}_2, ..., \mathbf{s}_T = \mathrm{LN} \left( \mathrm{TokenMixing} \left( \mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T \right) + \left( \mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T \right) \right) $ Here, conceptually represents the entire process of splitting heads and concatenating them as described above, resulting in a new set of tokens (when ).
Critique of Self-Attention for Recommendation Systems:
The paper explicitly states that while self-attention is highly effective in LLMs where tokens share a unified embedding space, it is suboptimal for recommendation systems. The primary reasons are:
- Heterogeneous Feature Spaces: In
RS, features (and thus tokens) are inherently diverse (e.g., user ID, item category, query text). - Difficulty of Inner-Product Similarity:
Self-attentioncomputes attention weights using theinner productof tokens. This method works well for homogeneous data (like words in NLP). However, computing an inner-product similarity between twoheterogeneous semantic spaces(e.g., a user embedding and an item embedding) is notoriously difficult and often yields unreliable or uninformative results. This is particularly problematic inRSwhere ID spaces can contain hundreds of millions of elements. - Computational and Memory Cost: Applying
self-attentionto such diverse inputs consumes more computation,Memory IO operations, andGPU memorycompared to the parameter-freemulti-head Token Mixingapproach, without offering superior performance in this context.
4.2.3.2. Per-token FFN
Traditional Deep Learning Recommendation Models (DLRMs) and models like DHEN often mix features from many disparate semantic spaces within a single interaction module. This can lead to a problem where high-frequency fields (e.g., popular item IDs or common user behaviors) dominate the learning process, effectively drowning out lower-frequency or long-tail signals, which ultimately harms overall recommendation quality.
To address this, RankMixer introduces a parameter-isolated feed-forward network architecture called Per-token FFN. Unlike traditional designs where the parameters of the FFN are shared across all tokens, Per-token FFN processes each token with its own dedicated set of transformations, thus isolating the parameters for each token.
For the -th token , the Per-token FFN is expressed as:
$
\mathbf{v}t = f{\mathrm{pffn}}^{t,2} \left( \operatorname{Gelu} \left( f_{\mathrm{pffn}}^{t,1} \left( \mathbf{s}t \right) \right) \right)
$
Where the linear transformation is defined as:
$
f{\mathrm{pffn}}^{t,i} ( \mathbf{x} ) = \mathbf{xW}{\mathrm{pffn}}^{t,i} + \mathbf{b}{\mathrm{pffn}}^{t,i}
$
And:
-
is the -th input token to the
PFFNmodule. -
is the output vector for the -th token after processing by its dedicated
PFFN. -
and represent the first and second linear layers of the
FFNdedicated to the -th token. -
and are the weight matrix and bias vector for the first linear layer of the -th token's
FFN. -
and are the weight matrix and bias vector for the second linear layer of the -th token's
FFN. -
is a hyperparameter that scales the hidden dimension of the
Per-token FFN(e.g., if , the hidden dimension is4D). -
is the
Gaussian Error Linear Unitactivation function, applied element-wise after the first linear layer.The
Per-token FFNmodule can be summarized as a parallel application across all tokens: $ \mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_T = \mathrm{PFFN} \left( \mathbf{s}_1, \mathbf{s}_2, ..., \mathbf{s}_T \right) $ Which is equivalent to applying the individualFFNto each token: $ \mathrm{PFFN} \left( \mathbf{s}_1, \mathbf{s}2, ..., \mathbf{s}T \right) = f{\mathrm{pffn}}^{t,2} \left( \mathrm{Gelu} \left( f{\mathrm{pffn}}^{t,1} \left( \mathbf{s}_1, \mathbf{s}_2, ..., \mathbf{s}_T \right) \right) \right) $ This notation implies that and are applied to each independently.
Key distinctions:
- Vs. Parameter-All-Shared FFN:
Per-token FFNsignificantly enhances modeling ability by introducing more parameters (each token has its ownFFN) while keeping the computational complexity per token unchanged (each token still goes through oneFFNsequence). - Vs. MMoE Experts: In
MMoE, typically many experts process the same input to make a decision, with a gating network deciding which experts to activate. InPer-token FFN, eachFFNsees a distinct token input and is dedicated to transforming that specific token's representation. - Vs. Transformer FFN: In a vanilla
Transformer, different input tokens share one FFN (parameters are shared across token positions).RankMixersimultaneously splits the inputs (into tokens) and the parameters (intoPer-token FFNs), which is beneficial for learning diversity in different feature subspaces.
4.2.4. Sparse MoE in RankMixer
To further boost the ROI of large-scale models, the dense FFNs within each Per-token FFN module can be replaced with Sparse Mixture-of-Experts (MoE) blocks. This allows the model's total capacity (number of parameters) to grow substantially while the computational cost per inference remains roughly constant because only a small subset of experts are activated for any given input.
However, Vanilla Sparse-MoE typically degrades when directly applied to RankMixer for two main reasons:
-
Uniform -expert routing: Traditional
Topkselection inMoEtreats all feature tokens equally. This can lead to inefficient allocation, where budget is wasted onlow-information tokenswhilehigh-information tokensmight not receive sufficient expert capacity, hindering the model's ability to capture nuanced differences. -
Expert under-training: Since
Per-token FFNsalready multiply parameters by the number of tokens (), adding non-shared experts further explodes the potential expert count. If routing is highly imbalanced, many experts may rarely be activated, leading tounder-trainedor "dying" experts that contribute little to the model's performance.To address these challenges,
RankMixercombines two complementary training strategies: -
ReLU Routing: To grant tokens flexible expert counts (i.e., not a fixed ) and maintain differentiability during training,
RankMixerreplaces the common routing with aReLU gatecombined with anadaptive\ell_1penalty. Given the -th expert for token (where is some hidden dimension for expert input) and a router : $ G_{i,j} = \mathrm{ReLU} \big( h ( \mathbf{s}i ) \big ), \quad \mathbf{v}i = \sum{j=1}^{N_e} G{i,j} \cdot e_{i,j} ( \mathbf{s}_i ) $ Where:-
is the activation gate value for the -th expert for the -th token. The
ReLUfunction ensures that only experts with a positive router score are activated. -
is the output of the router network (typically a linear layer) for token , which produces scores for each expert.
-
is the final output of the
MoElayer for token , calculated as a weighted sum of the outputs of activated experts. -
is the output of the -th expert when processing token .
-
is the total number of experts available for each token.
This
ReLU Routingmechanism allows for more experts to be activated forhigh-information tokens(tokens that require more complex modeling) while activating fewer for simpler ones, improving parameter efficiency. The sparsity (i.e., how many experts are activated on average) is controlled by anadaptive\ell_1penaltyadded to the overall loss function: $ \mathcal{L} = \mathcal{L}{\mathrm{task}} + \lambda \mathcal{L}{\mathrm{reg}}, \quad \mathcal{L}{\mathrm{reg}} = \sum{i=1}^{N_t} \sum_{j=1}^{N_e} G_{i,j} $ Where: -
is the total loss to be minimized during training.
-
is the primary task-specific loss (e.g., cross-entropy for classification).
-
is a coefficient that controls the strength of the regularization.
-
is the
regularization loss, which penalizes the sum of all gate values . By adjusting , the average active-expert ratio can be steered towards a desired budget, encouraging sparsity. -
is the number of tokens.
-
-
Dense-training / Sparse-inference (DTSI-MoE): Inspired by [19], this strategy uses two separate routers during training: and .
- Both and are updated throughout the training process.
- Crucially, the
regularization loss(which encourages sparsity) is applied only to . This means can learn to activate experts more densely during training without direct sparsity pressure, allowing experts to receive sufficient gradient updates and preventingunder-training. - During inference, only is used. Since was trained with the sparsity regularization, it will select a sparse set of experts, thus reducing inference cost.
The
DTSI-MoEapproach ensures that experts do not suffer fromunder-training(a common problem inSparse-MoEdue to imbalanced routing or aggressive sparsity) while still enabling significant reductions in inference cost by activating only a subset of experts.
4.2.5. Scaling Up Directions
RankMixer is inherently designed as a highly parallel and extensible architecture. Its parameter count and computational cost can be expanded along four orthogonal axes, allowing for flexible scaling strategies:
-
Token count (): Increasing the number of feature tokens can allow for more granular representation of input features.
-
Model width (): Increasing the hidden dimension of each token allows for richer representations.
-
Layers (): Stacking more
RankMixerblocks allows the model to learn deeper and more complex feature interactions. -
Expert Numbers (): In the
Sparse-MoEvariant, increasing the number of experts perPer-token FFNsignificantly boosts model capacity.For a
full-dense-activated versionofRankMixer(i.e., withoutSparse-MoE), the approximate parameter count and forwardFLOPsfor one sample can be computed as: $ # \mathrm{Param} \approx 2kLTD^2 $ $ \mathrm{FLOPs} \approx 4kLTD^2 $ Where:
-
is the approximate total number of trainable parameters in the dense part of the model.
-
is the approximate number of floating-point operations for one forward pass.
-
is the scale ratio adjusting the hidden dimension of the
FFN(e.g., if the inner dimension of the FFN iskD). -
is the number of
RankMixerlayers. -
is the number of tokens.
-
is the hidden dimension of each token.
In the
Sparse-MoE version, the effective parameters and computation per sample are determined by the sparsity ratio : $ s = \frac{# \mathrm{Activated}_\mathrm{Param}}{# \mathrm{Total}_\mathrm{Param}} $ Where: -
is the number of parameters actively used during a single forward pass (i.e., parameters of the activated experts).
-
is the total number of parameters in the
MoElayer (across all experts). This allows for a large total parameter count to achieve high capacity, while maintaining a controlled computational cost during inference by only activating a fraction of the experts.
5. Experimental Setup
5.1. Datasets
The offline experiments for RankMixer were conducted using a large-scale, proprietary dataset sourced from the Douyin recommendation system (the Chinese version of TikTok).
-
Source: The data is derived directly from Douyin's online logs, capturing real user interactions and feedback labels. This ensures the experiments are highly relevant to a production environment.
-
Scale: The training dataset involves a massive scale, covering
trillions of records per day. Experiments were conducted on data collected over atwo-week period, indicating a dataset of immense volume. -
Characteristics: The dataset includes a rich variety of features, totaling
over 300 features. These encompass:- Numerical features: e.g., view duration, number of likes.
- ID features: e.g., billions of
user IDsand hundreds of millions ofvideo IDs. These categorical IDs are converted intoembeddings. - Cross features: Interactions between different raw features.
- Sequential features: User behavior sequences over time.
-
Domain: The domain is short-video content recommendation, a highly dynamic and personalized environment.
The choice of this dataset is highly effective for validating the method's performance because:
-
Real-world Relevance: Using production data from a massive platform like Douyin ensures that the model's performance and efficiency gains are directly applicable and meaningful in a real-world industrial setting.
-
Scale for Stress Testing: The trillion-scale data and billions of IDs provide a robust environment to test the
scaling lawsand efficiency ofRankMixerunder extreme load, confirming its ability to handle immense data volume and high-cardinality features. -
Feature Richness: The 300+ diverse features allow for comprehensive evaluation of
RankMixer's ability to model heterogeneous feature interactions effectively, which is a core design goal.
5.2. Evaluation Metrics
The paper uses a combination of performance metrics to assess model quality and efficiency metrics to evaluate computational cost and hardware utilization.
5.2.1. Performance Metrics
-
AUC (Area Under the Curve):
- Conceptual Definition: AUC is a widely used metric for binary classification problems. It represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In the context of recommendation, a higher AUC indicates better discriminatory power of the model in distinguishing between items a user will interact with (positive) and those they won't (negative). It provides a single score that summarizes the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) across all possible classification thresholds. An AUC of 1.0 indicates a perfect classifier, while 0.5 indicates performance equivalent to random guessing.
- Mathematical Formula: The AUC is typically calculated as the area under the Receiver Operating Characteristic (ROC) curve. The ROC curve plots TPR against FPR at various threshold settings. $ \mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(\mathrm{FPR}^{-1}(t)) dt = \int_{-\infty}^{\infty} \mathrm{TPR}(t) \mathrm{FPR}'(t) dt $ (Note: The integral form can be complex; in practice, it's often computed using a non-parametric approach like Mann-Whitney U statistic or by summing trapezoids under the ROC curve.)
- Symbol Explanation:
- (True Positive Rate), also known as Recall or Sensitivity: , where is True Positives (correctly predicted positive) and is False Negatives (actual positive but predicted negative).
- (False Positive Rate): , where is False Positives (actual negative but predicted positive) and is True Negatives (correctly predicted negative).
- means the threshold at which the FPR is achieved.
- represents the derivative of FPR with respect to the threshold .
-
UAUC (User-level AUC):
- Conceptual Definition: UAUC addresses a common issue in
AUCcalculation for recommendation systems: users with many interactions (or many positive labels) can disproportionately influence the overallAUCscore.UAUCcalculates theAUCindependently for each user and then averages these per-userAUCscores. This ensures that the model's performance for individual users, especially those with fewer interactions, is given fair consideration, providing a more robust measure of personalized recommendation quality. - Mathematical Formula: $ \mathrm{UAUC} = \frac{1}{N_{\mathrm{users}}} \sum_{u=1}^{N_{\mathrm{users}}} \mathrm{AUC}_u $
- Symbol Explanation:
- : The total number of unique users in the evaluation set.
- : The
AUCscore calculated specifically for user , considering only their interactions.
- Conceptual Definition: UAUC addresses a common issue in
-
Finish/Skip AUC/UAUC: These metrics specifically evaluate the model's ability to predict two distinct user behaviors related to video consumption:
Finish: A label indicating whether a user completed watching a video.Skip: A label indicating whether a user quickly slid to the next video (i.e., did not watch it for a meaningful duration).- The paper notes that an
AUCincrease of0.0001(or 0.01%) is considered aconfidently significant improvementin industrialRSsettings, highlighting the sensitivity and competitive nature of these systems.
5.2.2. Efficiency Metrics
-
Dense-Param:
- Conceptual Definition: This metric refers to the total number of parameters in the "dense" part of the model. In
DLRMs, parameters are typically divided intosparse embedding parameters(e.g., for user IDs, item IDs, which can be very numerous but often looked up sparsely) anddense network parameters(e.g., weights and biases in MLPs, FFNs, attention layers).Dense-Paramspecifically counts the parameters in the dense computational layers, excluding the potentially much larger sparse embedding tables. This is a key indicator of the model's complexity and memory footprint during active computation. - Mathematical Formula: No universal formula as it depends on the specific architecture. It's the sum of all weights and biases in the MLP, FFN, attention, or other dense layers.
- Symbol Explanation:
Dense-Paramis a count (scalar).
- Conceptual Definition: This metric refers to the total number of parameters in the "dense" part of the model. In
-
Training Flops/Batch:
- Conceptual Definition: This metric quantifies the number of floating-point operations (FLOPs) required to run a single batch of data through the model during training. It directly represents the computational cost of training the model. A batch size of 512 is used in the paper. Lower FLOPs/Batch implies faster training or less computational resource consumption per step.
- Mathematical Formula: No single formula; it's an architectural sum.
- Symbol Explanation:
FLOPs(Floating Point Operations) is a unit of computational work.
-
MFU (Model FLOPs Utilization):
- Conceptual Definition: MFU is a crucial metric for evaluating the efficiency of a model's implementation on specific hardware (e.g., GPU). It measures how well the model's actual computational workload (
FLOPs consumption) utilizes the hardware's theoretical peakFLOPs capacity. A high MFU indicates that the model's operations are well-suited for the hardware's architecture (e.g., exploiting parallelism, using compute-bound operations), minimizing wasted computational cycles. A low MFU often points to bottlenecks like memory access (memory-bound operations). - Mathematical Formula: $ \mathrm{MFU} = \frac{\mathrm{Actual}_\mathrm{FLOPs}_\mathrm{Consumption}}{\mathrm{Theoretical}_\mathrm{Hardware}_\mathrm{FLOPs}_\mathrm{Capacity}} $
- Symbol Explanation:
- : The actual number of floating-point operations performed by the model for a given task (e.g., one forward/backward pass).
- : The maximum number of floating-point operations the hardware (e.g., GPU) can theoretically perform per second under ideal conditions.
- Conceptual Definition: MFU is a crucial metric for evaluating the efficiency of a model's implementation on specific hardware (e.g., GPU). It measures how well the model's actual computational workload (
5.3. Baselines
To thoroughly evaluate RankMixer, the authors compare it against several widely recognized state-of-the-art DLRM baselines, categorized by their feature interaction modeling strategies:
-
DLRM-MLP (base): This serves as the fundamental baseline. It represents a vanilla
Deep Learning Recommendation Modelwhere a simple Multi-Layer Perceptron (MLP) is used for feature crossing and dense interactions. -
DLRM-MLP-100M: A scaled-up version of the basic
DLRM-MLPwith approximately 100 million parameters, demonstrating the effects of simply increasingMLPcapacity. -
DCNv2 [26]: A successor to the original
Deep & Cross Network. It is a state-of-the-art model for explicit feature crossing, using across networkto efficiently learn bounded-degree feature interactions. -
RDCN [3]: (Likely "Recurrent DCN" or a variant, though not explicitly cited in the paper's bibliography within the abstract context, it's mentioned as SOTA for feature cross). Another advanced feature cross model.
-
MoE: A general
Mixture-of-Expertsmodel, scaled up by using multiple parallel experts. This baseline helps evaluate the general effectiveness ofMoEfor increasing capacity inDLRMs. -
AutoInt [23]: This model combines heterogeneous
self-attention layersto automatically learn feature interactions. It treats features as elements in an attention mechanism. -
Hiformer [9]: An advanced
self-attentionbased model that combines heterogeneousself-attentionlayers with low-rank approximation matrix computations, aiming for efficient and flexible interaction patterns. -
DHEN [31] (Deep and Hierarchical Expert Network): A model that combines different types of feature-cross blocks (e.g.,
DCN,self-attention,Factorization Machine (FM),Logistic Regression (LR)) and stacks them in multiple layers to learn diverse and hierarchical feature interactions. -
Wukong [30]: This model specifically investigates the scaling laws of feature interaction mechanisms. It follows an architecture similar to
DHENbut emphasizesFactorization Machine Block (FMB)andLinear Compress Block (LCB)for learning interactions.Training Environment: All experiments were conducted on
hundreds of GPUsin ahybrid distributed training framework. -
Sparse part: Updated
asynchronously. This means that updates to the large embedding tables (sparse parameters) don't block the dense model training, allowing for more efficient handling of high-cardinality features. -
Dense part: Updated
synchronously. This ensures that the dense network parameters are consistently updated across allGPUsin parallel. -
Optimizer hyperparameters: Consistent across all models.
-
Dense part:
RMSProp optimizerwith a learning rate of0.01. -
Sparse part:
Adagrad optimizer.These baselines are representative because they cover a wide spectrum of modern
DLRMarchitectures, from basicMLPto advanced explicit feature crossing,self-attentionbased models, andMoEapproaches, all of which are relevant for learning complex feature interactions and exploring scaling strategies in recommendation systems.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate RankMixer's superior performance, scaling capabilities, and efficiency compared to existing state-of-the-art models in industrial recommendation systems.
6.1.1. Comparison with SOTA Methods
The initial comparison focuses on models with similar parameter sizes (around 100 million) to evaluate structural performance under comparable computational budgets.
The following are the results from Table 1 of the original paper:
| Model | Finish | Skip | Efficiency | |||
| AUC↑ | UAUC↑ | AUC↑ | UAUC↑ | Params | FLOPs/Batch | |
| DLRM-MLP (base) | 0.8554 | 0.8270 | 0.8124 | 0.7294 | 8.7 M | 52 G |
| DLRM-MLP-100M | +0.15% | − | +0.15% | − | 95 M | 185 G |
| DCNv2 | +0.13% | +0.13% | +0.15% | +0.26% | 22 M | 170 G |
| RDCN | +0.09% | +0.12% | +0.10% | +0.22% | 22.6 M | 172 G |
| MoE | +0.09% | +0.12% | +0.08% | +0.21% | 47.6 M | 158 G |
| AutoInt | +0.10% | +0.14% | +0.12% | +0.23% | 19.2 M | 307 G |
| DHEN | +0.18% | +0.26% | +0.36% | +0.52% | 22 M | 158 G |
| HiFormer | +0.48% | — | — | — | 116 M | 326 G |
| Wukong | +0.29% | +0.29% | +0.49% | +0.65% | 122 M | 442 G |
| RankMixer-100M | +0.64% | +0.72% | +0.86% | +1.33% | 107 M | 233 G |
| RankMixer-1B | +0.95% | +1.22% | +1.25% | +1.82% | 1.1B | 2.1T |
Analysis of Table 1:
DLRM-MLPBaseline: The baseDLRM-MLPmodel has 8.7M parameters and 52GFLOPs/Batch. Scaling it to 95M parameters (DLRM-MLP-100M) yields only limitedAUCgains (+0.15%), confirming that simply increasingMLPcapacity is insufficient without architectural improvements tailored for recommendation data.- Traditional Cross-Structure Models:
DCNv2,RDCN,AutoInt,DHENshow some improvements over the baseDLRM-MLP. However, models likeAutoInt(19.2M params, 307GFLOPs) andHiFormer(116M params, 326GFLOPs) exhibit a high ratio ofFLOPsto parameters, indicating inefficiencies that constrain their ability to scale effectively. Their designs often lead to large computational costs even for moderate parameter sizes.DHENandMoEare more FLOPs efficient for their parameter count, but theirAUCgains are still moderate. RankMixer-100MSuperiority:RankMixer-100M(107M parameters, 233GFLOPs) achieves the best performance among models around the 100M parameter range, with for finish and for skip. ItsFLOPsare moderate compared toAutoIntandHiFormer, reflecting a better balance between model capacity and computational load.RankMixer-1BScaling: When scaled to 1.1 Billion parameters,RankMixer-1Bfurther boosts performance significantly, achieving for finish and for skip. This demonstratesRankMixer's strong scaling capability, although with a substantially increasedFLOPs/Batch(2.1T).- Comparison with SOTA Scaling Models:
Wukong(122M parameters, 442GFLOPs) shows decent performance, butRankMixer-100Mstill outperforms it with lowerFLOPs. This highlightsRankMixer's efficiency in achieving higher performance with less computation relative to other advanced scaling models.
6.1.2. Scaling Laws of Different Models
The paper investigates the scaling laws of various models, observing how performance (AUC gain) changes with increasing parameters and FLOPs.
The following figure (Figure 2 from the original paper) shows the scaling law curves observed from both parameter size and Flops:
该图像是一个展示AUC增益与密集参数和GFLOPs关系的图表。左侧展示了密集参数(M)与AUC增益(%)的关系,右侧展示了每批次GFLOPs(G)与AUC增益(%)的关系。数个模型(如RankMixer、DLRM、Hiforer等)在不同条件下的表现被标记并对比。
Analysis of Figure 2:
-
The -axis uses a logarithmic scale, emphasizing the behavior over orders of magnitude.
-
RankMixer's Steepest Scaling:RankMixerconsistently demonstrates the steepestscaling lawcurve for both parameters (left plot) andFLOPs(right plot). This means that for a given increase in parameters orFLOPs,RankMixerachieves a greaterAUC gaincompared to all other models. This validates its design principles for efficient scaling. -
Wukongvs.RankMixer: WhileWukongexhibits a relatively steep parameter curve, its computational cost (as shown in theAUC vs FLOPscurve) increases even more rapidly. Consequently, the performance gap betweenWukongandRankMixeris larger when comparing againstFLOPs, indicatingRankMixer's superior compute-efficiency. -
HiFormerPerformance:HiFormer's performance is slightly inferior toRankMixer's. This is attributed to its reliance ontoken segmentationat a feature level and the use ofAttention, which, as discussed in the methodology, can be less efficient forheterogeneous recommendation data. -
DHENandMoEScalability: The scaling ofDHENis not ideal, reflecting the limited scalability of its specific cross-structure.MoE's strategy of scaling by adding experts faces challenges in maintaining expert balance, leading to suboptimal scaling performance when not properly addressed (whichRankMixeraims to do).Scaling Directions: The paper also explored scaling
RankMixerby increasingwidth (D),feature tokens (T), andlayers (L). A conclusion shared withLLM scaling lawswas observed: model quality primarily correlates with the total number of parameters, and different scaling directions (depth L, width D, tokens T) yield almost identical performance for a given parameter count. From a compute-efficiency standpoint, largerhidden-dim(increasing ) generates largermatrix-multiply shapes, leading to higherMFUcompared to simply stacking more layers. Based on these insights, the final configurations were set as: -
RankMixer-100M: , , -
RankMixer-1B: , ,
6.2. Data Presentation (Tables)
The following are the results from Table 2 of the original paper:
| Setting | ΔAUC |
| w/o skip connections | -0.07% |
| w/o multi-head token mixing | -0.50% |
| w/o layer normalization | -0.05% |
| Per-token FFN → shared FFN | -0.31% |
The following are the results from Table 3 of the original paper:
| Routing strategy | ΔAUC | ΔParams | ΔFLOPs |
| All-Concat-MLP | -0.18% | ~ 0.0% | ~ 0.0% |
| All-Share | -0.25% | ~ 0.0% | ~ 0.0% |
| Self-Attention | -0.03% | +16% | +71.8% |
The following are the results from Table 4 of the original paper:
| Douyin app | Douyin lite | ||||||||
| Active Day↑ | Duration↑ | Like↑ | Finish↑ | Comment↑ | Active Day↑ | Duration↑ | Like↑ | Finish↑ | Comment↑ |
| +0.2908% | +1.0836% | +2.3852% | +1.9874% | +0.7886% | +0.1968% | +0.9869% | +1.1318% | +2.0744% | +1.1338% |
| Low-active | |||||||||
| +1.7412% | +3.6434% | +8.1641% | +4.5393% | +2.9368% | +0.5389% | +2.1831% | +4.4486% | +3.3958% | +0.9595% |
| Middle-active | |||||||||
| +0.7081% | +1.5269% | +2.5823% | +2.5062% | +1.2266% | +0.4235% | +1.9011% | +1.7491% | +2.6568% | +0.6782% |
| High-active | |||||||||
| +0.1445% | +0.6259% | +1.828% | +1.4939% | +0.4151% | +0.0929% | +1.1356% | +1.8212% | +1.7683% | +2.3683% |
The following are the results from Table 5 of the original paper:
| Metric | Advertising | |
| ΔAUC↑ | ADVV↑ | |
| Lift | +0.73% | +3.90% |
The following are the results from Table 6 of the original paper:
| Metric | OnlineBase-16M | RankMixer-1B | Change |
| #Param | 15.8M | 1.1B | ↑ 70X |
| FLOPs | 107G | 2106G | ↑ 20.7X |
| Flops/Param(G/M) | 6.8 | 1.9 | ↓3.6X |
| MFU | 4.47% | 44.57% | ↑ 10X |
| Hardware FLOPs | fp32 | fp16 | ↑2X |
| Latency | 14.5ms | 14.3ms | - |
6.3. Ablation Studies / Parameter Analysis
6.3.1. Ablation on Components of RankMixer
The paper conducted ablation studies on the RankMixer-100M model to understand the contribution of its core architectural components.
Analysis of Table 2:
w/o skip connections: Removingskip connectionsled to a0.07%decrease in . This indicates that residual connections are important forRankMixer, likely contributing to training stability and mitigating issues likevanishingorexploding gradientsin deep networks.w/o multi-head token mixing: The most significant performance drop () was observed when removingMulti-Head Token Mixing. This highlights its critical role. Without it, eachPer-token FFNwould only model partial features without any interaction or information exchange across tokens, losing global contextual information and the ability to learn cross-feature interactions.w/o layer normalization: RemovingLayer Normalizationresulted in a0.05%decrease in . Similar toskip connections,LayerNormcontributes to training stability, preventing performance degradation.Per-token FFN → shared FFN: ReplacingPer-token FFNswith ashared FFN(where all tokens are processed by the sameFFNparameters) led to a0.31%decrease in . This strongly validates the importance ofparameter isolationfor different feature subspaces. A sharedFFNwould force diverse semantic features to be processed by the same parameters, leading to potential feature domination and suboptimal modeling of heterogeneous interactions.
6.3.2. Token2FFN Routing-strategy Comparison
The paper further analyzed different strategies for routing feature tokens to FFNs, comparing them against the proposed Multi-Head Token Mixing.
Analysis of Table 3:
All-Concat-MLP: This strategy concatenates all tokens into a single large vector, processes it through one largeMLP, and then splits it back into the same number of tokens. It resulted in a0.18%drop. This shows the challenges in learning effectively from extremely large matrices (a singleMLPfor all concatenated features) and the weakening oflocal information learningby merging everything too early.All-Share: In this strategy, the entire input vector is shared and fed to eachPer-token FFN(similar to how experts in anMoEmight share input). This led to a significant0.25%decline. This result emphasizes the importance offeature subspace splittingandindependent modelingfor distinct tokens, contrasting with an all-shared input which fails to capture the diversity of feature semantics.Self-Attention: Applyingself-attentionbetween tokens for routing yielded a0.03%drop. While the performance decrease is small, it came with a substantial increase in () and (). This confirms the paper's argument thatself-attentionis computationally expensive and inefficient forheterogeneous feature subspacesinRS, failing to significantly outperformMulti-Head Token Mixingdespite its higher cost. It highlights the difficulty of learning meaningful similarity across hundreds of different feature subspaces via inner products.
6.3.3. Sparse-MoE Scalability and Expert Balance
The paper investigates the scalability of the Sparse-MoE variant of RankMixer and the effectiveness of its expert balancing strategies.
The following figure (Figure 3 from the original paper) plots offline AUC gains versus Sparsity of SMoE:
该图像是一个图表,展示了不同稀疏率下模型的性能(AUC)。图中比较了多种RankMixer变体,包括Dense-RankMixer-1B、Vanilla-SMoE-RankMixer等。可以看出,随着稀疏率的增加,模型性能有所下降。
Analysis of Figure 3 (Scalability):
-
The -axis represents the activation ratio of experts, starting from
1(dense model) and decreasing to (highly sparse). -
Dense-training + ReLU-routed SMoE(RankMixer's approach): This variant demonstrates remarkable robustness. Even under aggressive sparsity (down to of experts activated), it preserves almost all the accuracy of the 1B dense model. This is crucial for practical deployment, as it enables substantial parameter scaling () with minimalAUCloss and significant inference-time savings (450% improvement over throughput). This validatesSparse-MoEas a viable path to scaleRankMixerto parameters. -
Vanilla SMoE: The performance ofVanilla SMoEdrops monotonically and significantly as fewer experts are activated. This illustrates the fundamental problems ofexpert imbalanceandunder-trainingwhenVanilla SMoEis applied without specialized routing and training strategies. -
Vanilla SMoE + Load Balancing Loss: Adding aload-balancing losstoVanilla SMoEreduces the performance degradation compared to pureVanilla SMoE. However, it still falls short of theDTSI + ReLUversion. This suggests that the core problems lie not just in distributing load but more deeply in ensuring experts are adequately trained and effectively utilized during sparse inference.The following figure (Figure 4 from the original paper) shows Activated expert ratio for various tokens:
该图像是一个示意图,展示了RankMixer模型在不同层的训练过程中的损失变化轨迹。上半部分为第1层,和下半部分的第2层,图中以不同曲线展示了训练损失的变化情况,有助于理解模型的性能与训练稳定性。Analysis of Figure 4 (Expert Balance and Diversity):
-
This figure (labeled as "Activated expert ratio for various tokens" but visually resembling a loss curve or activation distribution per layer) indirectly supports the claim that
DTSI + ReLU routingeffectively resolves expert imbalance. -
The text describes that
Dense-trainingensures most experts receive sufficientgradient updates, preventingexpert starvation(i.e., "dying experts" that are never activated). -
ReLU routingmakes theactivation ratio dynamicacross tokens. The varying activation proportions shown for different layers in the figure (though the figure itself is a bit ambiguous for direct interpretation of "activated expert ratio") are meant to illustrate that some tokens (likely high-information ones) activate more experts, while others (low-information) activate fewer. This adaptive allocation aligns well with the diverse and highly dynamic distribution of recommendation data, leading to more efficient utilization of expert capacity.
6.3.4. Online Serving Cost
A critical achievement of RankMixer is its ability to scale model parameters by approximately 70x (from a previous 16M-parameter baseline to 1.1B parameters) while maintaining stable inference latency in a production environment.
The following are the results from Table 6 of the original paper:
| Metric | OnlineBase-16M | RankMixer-1B | Change |
| #Param | 15.8M | 1.1B | ↑ 70X |
| FLOPs | 107G | 2106G | ↑ 20.7X |
| Flops/Param(G/M) | 6.8 | 1.9 | ↓3.6X |
| MFU | 4.47% | 44.57% | ↑ 10X |
| Hardware FLOPs | fp32 | fp16 | ↑2X |
| Latency | 14.5ms | 14.3ms | - |
Analysis of Table 6 (Online Deployment Metrics):
The stability in latency despite the massive parameter increase is explained by the following formula and the three-pronged optimization strategy:
$
\mathrm{Latency} = \frac{# \mathrm{Param} \times \mathrm{FLOPs} / \mathrm{Param~ratio}}{\mathrm{MFU} \times (\mathrm{Theoretical~Hardware~FLOPs})}
$
FLOPs/Param ratio(↓3.6X):- The baseline model had 6.8 GFLOPs per million parameters, while
RankMixer-1Bachieved 1.9 GFLOPs per million parameters. RankMixerincreased parameters by70xbut only saw a20.7xincrease in totalFLOPs. This means itsFLOPs/Param ratiodecreased by3.6x(, the paper states3.6x). This efficiency gain meansRankMixercan accommodate significantly more parameters (over 3 times more) for the sameFLOPsbudget, primarily due to theSparse-MoEarchitecture where not all parameters are activated per inference.
- The baseline model had 6.8 GFLOPs per million parameters, while
Model FLOPs Utilization (MFU)(↑10X):- The baseline had a very low
MFUof 4.47%, indicating it was largelymemory-bound. RankMixersignificantly improvedMFUto 44.57% (a10xincrease). This was achieved by:- Using large
GEMM(General Matrix Multiply) shapes, which are highly efficient onGPUs. - Employing a good parallel topology (e.g., fusing parallel
Per-token FFNsinto a singleGPU kernel). - Reducing
memory bandwidth costand overhead.
- Using large
- This shift from a
memory-boundto acompute-boundregime is critical forGPUefficiency, meaning theGPUspends more time on actual computation rather than waiting for data from memory.
- The baseline had a very low
Quantization(↑2X Theoretical Hardware FLOPs):-
The primary computations in
RankMixer(large matrix multiplications) are well-suited forhalf-precision (fp16)inference. -
Moving from
fp32(single-precision) tofp16(half-precision) inference effectively doubles the theoretical peakHardware FLOPscapacity ofGPUs. This is a hardware-level acceleration thatRankMixercan leverage.By synergistically combining a lower
FLOPs/Param ratio, a much higherMFU, and the benefits offp16 quantization,RankMixerwas able to drastically increase its parameter count without increasing inference latency.
-
6.4. Online Performance
To verify its universality, RankMixer was deployed and evaluated in online A/B tests across two core application scenarios of personalized ranking: Feed Recommendation and Advertising.
6.4.1. Feed Recommendation
The RankMixer-1B model was fully deployed on the Douyin Feed Recommendation Ranking system. The long-term observation from A/B test results over 8 months is presented.
Analysis of Table 4 (Online lift of RankMixer in Feed Recommendation):
- Overall Metrics:
RankMixer-1Bachieved statistically significant uplifts across all key business metrics for both Douyin app and Douyin lite.- Douyin app:
Active Day↑ +0.2908%,Duration↑ +1.0836%,Like↑ +2.3852%,Finish↑ +1.9874%,Comment↑ +0.7886%. - Douyin lite: Similar positive trends with
Active Day↑ +0.1968%,Duration↑ +0.9869%, etc.
- Douyin app:
- Impact on User Activeness Levels: The improvements are not uniform across different user activeness groups:
Low-activeusers: Showed the largest improvements, with in Active Days and in Duration on the main Douyin app. This demonstratesRankMixer's stronggeneralization capabilityand its effectiveness in engaging users who are typically harder to activate, potentially by better capturing their subtle preferences or by recommending more relevant long-tail content.Middle-activeusers: Also saw significant gains, e.g., in Active Days.High-activeusers: Still benefited, but with smaller percentage uplifts (e.g., in Active Days), likely because these users are already highly engaged. These results underscoreRankMixer's ability to drive overall user engagement and retention.
6.4.2. Advertising
RankMixer was also tested in an advertising scenario, replacing the dense part of the previous 16M parameter baseline model.
Analysis of Table 5 (Online lift of RankMixer in Advertising):
- Advertising Performance:
RankMixerdelivered positive lifts:- : A significant improvement in the advertising model's quality.
ADVV↑ +3.90%: An increase inAdvertiser Value, indicating higher revenue for the platform. This confirmsRankMixer's versatility and effectiveness across different personalized ranking tasks, serving as a unified backbone for diverse applications.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces RankMixer, a novel, hardware-aware model architecture specifically designed to scale up ranking models in industrial recommender systems. RankMixer addresses the dual challenges of high computational cost and inefficient GPU utilization by legacy feature-crossing modules. Its core innovations include a Multi-head Token Mixing module for efficient cross-token interactions, Per-token FFNs for isolated and specialized modeling of heterogeneous feature subspaces, and a Sparse Mixture-of-Experts (MoE) variant with ReLU Routing and DTSI-MoE to enable massive parameter scaling without proportional increases in computational cost.
Experiments on a trillion-scale production dataset demonstrated RankMixer's superior scaling capabilities, boosting Model Flops Utilization (MFU) from 4.5% to 45% and achieving a 70x increase in model parameters (to 1.1 billion) while maintaining inference latency. Online A/B tests on Douyin's Feed Recommendation and Advertising platforms confirmed its universality and significant business impact, leading to 0.3% increase in user active days and 1.08% increase in total in-app usage duration, with notable gains among low-active users, all without increasing serving costs.
7.2. Limitations & Future Work
The paper explicitly points out that RankMixer provides a path to scale from current 1B parameters to future 10B-scale deployments without breaching cost budgets. While the paper doesn't explicitly list "Limitations" as a separate section, some implicit challenges or areas for future work can be inferred:
- Complexity of Tokenization: The
semantic-based tokenizationrelies ondomain knowledge. While effective, manually grouping features can be labor-intensive. Future work could explore automated or learned semantic tokenization methods. - Hyperparameter Tuning for MoE: While
ReLU RoutingandDTSI-MoEaddress keyMoEchallenges, tuningMoEspecific hyperparameters (e.g., number of experts, for sparsity regularization) can still be complex, especially with millions or billions of parameters. - Generalizability to Other RS Paradigms: While demonstrated for feed recommendation and advertising,
RankMixer's applicability to otherRSparadigms (e.g., sequential recommendation beyond basic sequence features, cold-start scenarios, fairness-aware recommendations) could be further explored. - Dynamic Adaptation to Feature Changes: Industrial
RSfeatures are constantly evolving. The model's ability to dynamically adapt itstokenizationorPer-token FFNstructure to new features or changes in feature importance could be an area for enhancement.
7.3. Personal Insights & Critique
RankMixer presents a highly compelling and practical solution to the persistent challenge of scaling DLRMs in industrial settings.
-
Innovative Hardware-Aware Design: The most significant insight is the explicit focus on
hardware-aware design. Many academic papers optimize forAUCbut often overlook deployment realities.RankMixer's strategy of re-engineeringTransformer-like components (Multi-head Token Mixingfor efficient cross-token interaction andPer-token FFNsfor isolated feature subspace modeling) specifically to maximizeMFUonGPUsis a critical differentiator. This pragmatic approach is essential for bridging the gap between research and production. -
Addressing Heterogeneity Effectively: The core critique of
self-attentionforheterogeneous feature spacesis astute. Theinner-product similaritysimply isn't well-suited for disparate features like user IDs and video categories.RankMixer'sMulti-head Token Mixingoffers a clever, parameter-free alternative that achieves global information exchange without the pitfalls ofself-attentionin this context. ThePer-token FFNsare equally important, providing dedicated "expert" networks for semantically distinct feature groups, which intuitively makes sense for preventing feature domination. -
Robust MoE Implementation: The paper's solution to
Sparse-MoEchallenges (ReLU RoutingandDTSI-MoE) is a valuable contribution.MoEhas immense potential for scaling, butexpert under-trainingandimbalanceare notorious hurdles.RankMixerprovides a robust framework that can truly unlock theROIofMoEin practice. -
Holistic Optimization for Deployment: The detailed breakdown of how
70x parameter increasewas achieved with stablelatency(viaFLOPs/Param ratioreduction,MFUincrease, andfp16 quantization) serves as an excellent blueprint for others facing similar industrial deployment challenges. It underscores that architectural innovation must be coupled with meticulous engineering optimization. -
Applicability Beyond Douyin: While developed for Douyin, the principles of
RankMixerare highly transferable. Any recommendation system, search ranking engine, or personalized content platform dealing withheterogeneous features, highQPSdemands, and the desire to scale model capacity could benefit from adopting or adaptingRankMixer's architecture. Its success in both recommendation and advertising scenarios further supports its universality.A potential area for further exploration or a critique could be a deeper theoretical analysis of why
Multi-head Token Mixing, despite being parameter-free, is superior toself-attentionfor heterogeneous features beyond the empirical observation. Is there a formal way to characterize the "semantic distance" or "interactability" of different feature types that makesinner-productirrelevant or detrimental, and howToken Mixinginherently bypasses this? Additionally, whilesemantic-based tokenizationworks, future research could investigate graph-based or unsupervised methods for token grouping to further reduce reliance on domain experts.
Similar papers
Recommended via semantic vector search.