AiPaper
Paper status: completed

RankMixer: Scaling Up Ranking Models in Industrial Recommenders

Published:11/08/2025
Original Link
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

RankMixer is introduced to enhance the efficiency and scalability of ranking models in industrial recommenders. Utilizing a hardware-aware design, it replaces self-attention with multi-head token mixing, achieving improved model performance with one billion parameters and a Spars

Abstract

Recent progress on large language models (LLMs) has spurred interest in scaling up recommendation systems; yet two practical obstacles remain. First, training and serving cost on industrial recommenders must respect strict latency bounds and high QPS demands. Second, most human-designed feature-crossing modules in ranking models were inherited from the CPU era and fail to exploit modern GPUs, resulting in low Model Flops Utilization (MFU) and poor scalability. We introduce RankMixer, a hardware-aware model design tailored towards a unified and scalable feature-interaction architecture. RankMixer retains the transformer's high parallelism while replacing quadratic self-attention with a multi-head token mixing module for higher efficiency. Besides, RankMixer maintains both the modeling for distinct feature subspaces and cross-feature-space interactions with Per-token FFNs. We further extend it to one billion parameters with a Sparse-MoE variant for higher ROI. A dynamic routing strategy is adapted to address the inadequacy and imbalance of experts training. Experiments show RankMixer’s...

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

RankMixer: Scaling Up Ranking Models in Industrial Recommenders

1.2. Authors

Jie Zhu*, Zhifang Fan*, Xiaoxie Zhu*, Yuchen Jiang*, Hangyu Wang, Xintian Han, Haoran Ding, Xinmin Wang, Wenlin Zhao, Zhen Gong, Huizhi Yang, Zheng Chai, Zhe Chen, Yuchao Zheng, Qiwei Chen, Feng Zhang, Xun Zhou, Peng Xu, Xiao Yang, Di Wu, and Zuotao Liu. The authors are primarily affiliated with ByteDance, with Zhifang Fan and Qiwei Chen listed as independent researchers. ByteDance is a leading technology company known for products like TikTok (Douyin in China), which heavily relies on recommendation systems. This indicates a strong industry-led research focus on practical, large-scale deployment challenges.

1.3. Journal/Conference

Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). CIKM is a highly reputable and influential international conference in the fields of information retrieval, knowledge management, and database systems. Its reputation ensures that accepted papers undergo rigorous peer review and are considered significant contributions to the field.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses two main challenges in scaling up recommendation systems inspired by Large Language Models (LLMs): high training/serving costs with strict latency/QPS demands, and inefficient GPU utilization by legacy CPU-era feature-crossing modules. It introduces RankMixer, a hardware-aware model designed for a unified and scalable feature-interaction architecture. RankMixer leverages the Transformer's parallelism but replaces quadratic self-attention with a more efficient multi-head token mixing module. It also uses Per-token FFNs to model distinct feature subspaces and cross-feature-space interactions, which enhances capacity. For higher Return on Investment (ROI), a Sparse-MoE variant extends the model to one billion parameters, employing a dynamic routing strategy to tackle expert training inadequacy and imbalance. Experiments on a trillion-scale production dataset demonstrate RankMixer's superior scaling, boosting Model Flops Utilization (MFU) from 4.5% to 45% and scaling parameters by two orders of magnitude while maintaining inference latency. Online A/B tests across recommendation and advertisement scenarios show universal applicability, with a 1B Dense-Parameters RankMixer improving user active days by 0.3% and total in-app usage duration by 1.08% without increasing serving cost.

/files/papers/691b1d0680efabfa5c6ee4e9/paper.pdf (Local file provided, implies it's the full paper content) Publication status: Published in CIKM '25.

2. Executive Summary

2.1. Background & Motivation

The paper is motivated by the success of Large Language Models (LLMs) in scaling performance with increasing parameters, prompting interest in applying similar scaling principles to Deep Learning Recommendation Models (DLRMs). However, scaling DLRMs in industrial recommender systems faces two critical practical obstacles:

  1. Cost and Latency Constraints: Industrial recommender systems operate under strict latency bounds (how quickly a response must be generated) and demand extremely high QPS (Queries Per Second, the number of requests a system can handle per second). Increasing model size typically leads to higher computational costs, making it challenging to meet these real-time operational requirements.

  2. Inefficient Hardware Utilization: Many traditional feature-crossing modules (components that combine different features to learn complex interactions) in ranking models were designed in the CPU era. These designs often fail to effectively utilize modern GPUs, resulting in low Model Flops Utilization (MFU) – meaning a significant portion of the GPU's theoretical processing power goes unused. This leads to poor scalability, as the computational cost per parameter is high, limiting the practical ROI (Return on Investment) of simply adding more parameters.

    The core problem the paper aims to solve is how to effectively scale DLRMs to billions of parameters in a production environment without violating strict latency and QPS constraints, by designing hardware-aware architectures that maximize GPU utilization and computational efficiency.

2.2. Main Contributions / Findings

The paper introduces RankMixer, a novel, hardware-aware model design with the following primary contributions and findings:

  • Novel Architecture (RankMixer): Proposes a new model architecture featuring Multi-head Token Mixing for efficient cross-token feature interactions (replacing quadratic self-attention with a parameter-free operator for higher efficiency) and Per-token FFNs for isolated, distinct feature subspace modeling and cross-feature-space interactions. This design is hardware-aligned and highly parallelizable.
  • Sparse-MoE Extension with Dynamic Routing: Extends RankMixer with a Sparse Mixture-of-Experts (MoE) variant for one billion parameters, enhancing ROI. It introduces a dynamic routing strategy (specifically, ReLU Routing and Dense-training / Sparse-inference (DTSI-MoE)) to address the challenges of expert under-training and imbalance, which are common issues in Sparse-MoE deployments.
  • Significant MFU Improvement: By replacing diverse, handcrafted, low-MFU modules with RankMixer, the model MFU was dramatically boosted from 4.5% to 45% (a 10x improvement). This shift from a memory-bound to a compute-bound regime is crucial for efficient GPU utilization.
  • Massive Parameter Scaling with Stable Latency: Achieved a 70x increase in online ranking model parameters (from 16M to 1.1B) while maintaining roughly the same inference latency. This was made possible by decoupling parameter growth from FLOPs, and FLOPs growth from actual cost through high MFU and Quantization (specifically, fp16 inference).
  • Universal Applicability: Verified RankMixer's universality through online A/B tests across two core application scenarios: Feed Recommendation and Advertising, demonstrating its generalizability as a unified backbone.
  • Tangible Business Impact: Successful deployment of the 1B Dense-Parameters RankMixer for full traffic serving on Douyin's recommendation system resulted in significant improvements: 0.3% increase in user active days and 1.08% increase in total in-app usage duration, without increasing serving cost. The model also showed strong generalization capabilities, particularly benefiting low-active users.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand the RankMixer paper, a reader should be familiar with the following fundamental concepts:

  • Recommender System (RS): A subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. In industrial settings, RS are crucial for platforms like social media (e.g., Douyin/TikTok), e-commerce, and streaming services to suggest relevant content, products, or services to users, thereby enhancing user engagement and platform revenue. They typically analyze user behavior (past interactions), item attributes, and context to make predictions.

  • Deep Learning Recommendation Models (DLRMs): These are recommendation models that utilize deep neural networks to learn complex patterns and interactions within user-item data. A critical component of DLRMs is the dense interaction layer, which processes the concatenated or pooled embeddings of various features to capture high-order feature interactions.

  • Large Language Models (LLMs): Advanced deep learning models, typically based on the Transformer architecture, that have achieved remarkable performance in natural language processing tasks by scaling up their parameter counts to billions or even trillions. Their success in leveraging scale to improve performance has inspired similar scaling efforts in other domains, including recommendation systems.

  • Model Flops Utilization (MFU): A metric that measures how efficiently a model utilizes the theoretical floating-point operations per second (FLOPs) capability of the underlying hardware (e.g., GPU). It is calculated as the ratio of the actual FLOPs consumed by the model to the theoretical peak FLOPs capacity of the hardware. A low MFU indicates that the hardware is underutilized, often due to memory-bound operations rather than compute-bound operations. Maximizing MFU is critical for efficient large-scale model training and inference.

  • Queries Per Second (QPS): A metric representing the number of requests (queries) a system can process in one second. In industrial recommender systems, extremely high QPS demands are common, requiring models to be highly efficient and capable of quick inference to serve a vast number of users in real-time.

  • Feature Engineering (Feature-crossing): The process of creating new features from existing ones to enhance the predictive power of a model. Feature-crossing specifically refers to combining two or more features to create an interaction feature, which can capture non-linear relationships. For instance, combining a user's age group with an item's category might reveal preferences not apparent from either feature alone. In recommendation systems, features can be diverse:

    • Numerical features: e.g., video watch duration, user activity frequency.
    • Categorical features: e.g., user ID, item ID, author ID, video tags. These are often converted into dense embeddings (low-dimensional vector representations) for use in neural networks.
    • User behavior features: e.g., sequences of previously watched videos, liked items.
    • Content features: e.g., textual descriptions or visual attributes of items.
  • Transformer Architecture: A neural network architecture introduced in 2017, which revolutionized NLP and is now widely used in various domains. Its key components include self-attention mechanisms and Feed-Forward Networks (FFNs). Transformers are known for their high parallelism during training, which makes them efficient on modern accelerators like GPUs.

  • Self-attention: A mechanism in Transformer models that allows the model to weigh the importance of different parts of the input sequence when processing a specific part. For a given token, it computes attention scores by comparing it to all other tokens in the sequence. These scores determine how much "attention" the model should pay to each other token. The output for a token is a weighted sum of all tokens, where weights are the attention scores. The core self-attention calculation for queries QQ, keys KK, and values VV is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where Q, K, V are matrices derived from input embeddings, and dkd_k is the dimension of the keys, used as a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients. This mechanism has a quadratic complexity with respect to the sequence length (number of tokens), which can be a bottleneck for very long sequences or many tokens.

  • Feed-Forward Network (FFN): A simple neural network layer, typically consisting of two linear transformations with a non-linear activation function (e.g., ReLU or GeLU) in between. In Transformers, an FFN is applied independently and identically to each position (token) in the sequence after the self-attention layer.

  • Mixture-of-Experts (MoE): An architecture that increases model capacity by routing inputs to a subset of specialized "expert" neural networks. Instead of using all parameters for every input (dense model), MoE models only activate a few experts per input, thus increasing the total number of parameters while keeping the computational cost per inference relatively constant. This makes it possible to build models with trillions of parameters while maintaining manageable inference costs. A Sparse MoE specifically refers to activating only a small, sparse subset of experts.

  • Layer Normalization (LN): A normalization technique applied across the features of a single input sample (token) within a layer. It helps stabilize training, especially in deep neural networks, by re-centering and re-scaling the inputs to the next layer.

  • Residual Connections (Skip Connections): A technique where the input of a layer is added to its output, allowing gradients to flow more directly through the network. This helps mitigate the vanishing gradient problem and enables the training of very deep networks.

  • Gelu activation function: Gaussian Error Linear Unit, a non-linear activation function commonly used in Transformer models, defined as Gelu(x)=xΦ(x)\mathrm{Gelu}(x) = x \cdot \Phi(x), where Φ(x)\Phi(x) is the cumulative distribution function for the standard Gaussian distribution.

3.2. Previous Works

The paper contextualizes RankMixer within the landscape of modern recommendation systems, highlighting the evolution of DLRMs and feature interaction modeling.

  • Wide&Deep [5] and DeepFM [10]: These are early and influential efforts that combined wide (linear) components to capture low-order (simple) feature interactions and deep (DNN) components to capture high-order (complex) feature interactions. They showed that simply relying on DNNs for high-order interactions could be challenging.

  • Explicit Cross Methods: Researchers developed various models that explicitly design operators to capture high-order feature interactions, rather than implicitly relying on deep DNNs:

    • PNN [20] (Product-based Neural Networks): Introduce product layers to capture explicit feature interactions at various granularities.
    • DCN [25] (Deep & Cross Network) and its successor DCNv2 [26]: Use a cross network to explicitly learn bounded-degree cross features. The cross network iteratively applies feature crossing by multiplying the input with the output of the previous layer.
    • xDeepFM [16]: Proposes Compressed Interaction Network (CIN) to generate feature interactions in an explicit, bit-wise fashion.
    • FGCNN [18] (Feature Generation by Convolutional Neural Network): Uses CNNs to automatically generate new features from raw input features.
    • FiGNN [15] (Feature Interaction Graph Neural Network): Models feature interactions using graph neural networks.
    • AutoInt [23] (Automatic Feature Interaction Learning via Self-Attentive Neural Networks): Employs a multi-head self-attention mechanism to explicitly model feature interactions, treating features as queries, keys, and values.
    • Hiformer [9]: Combines heterogeneous self-attention layers and low-rank approximation matrix computations, addressing some limitations of AutoInt by providing more flexible interaction patterns.
    • DHEN [31] (Deep and Hierarchical Expert Network): Combines different types of feature-crossing blocks (e.g., DCN, self-attention, FM, LR) and stacks multiple layers to learn hierarchical feature interactions.
    • Wukong [30]: Investigates the scaling law of feature interaction modeling, following an architecture similar to DHEN but using Factorization Machine Blocks (FMB) and Linear Compress Blocks (LCB).
  • Scaling Laws in Recommendation Systems: Research interest has grown in understanding how performance improves with increasing model size, data, or computation, similar to LLMs.

    • Studies explored scaling strategies for pretraining user activity sequences [6], general-purpose user representations [22, 32], and online retrieval [8, 27].
    • HSTU [29] enhances scaling for Generative Recommenders, focusing on sequential aspects.
  • Limitations of Self-Attention in RS: The paper explicitly points out that while self-attention is effective in LLMs where tokens share a unified embedding space, it is suboptimal for recommendation systems. In RS, features are inherently heterogeneous (e.g., a user ID embedding vs. a video category embedding). Computing an inner-product similarity (the core of self-attention) between such diverse, often high-dimensional, and semantically distinct feature spaces is difficult and can be inefficient. This leads to self-attention consuming more computation, memory I/O, and GPU memory without necessarily outperforming simpler, parameter-free approaches like multi-head Token Mixing.

3.3. Technological Evolution

The evolution of recommendation models has moved from simpler matrix factorization and collaborative filtering methods to deep learning-based models (DLRMs). Early DLRMs like Wide&Deep and DeepFM introduced the idea of combining different mechanisms for low- and high-order feature interactions. Subsequent works focused on designing increasingly sophisticated explicit cross methods (e.g., DCN, xDeepFM, AutoInt, DHEN, Wukong) to better capture these interactions.

The current trend, inspired by LLMs, is to explore scaling laws in DLRMs to improve performance by increasing model capacity. However, simply applying LLM architectures like vanilla Transformers directly to RS data has proven inefficient due to the unique characteristics of recommendation features (heterogeneity, strict latency constraints).

RankMixer positions itself at this frontier, aiming to bridge the gap between LLM-inspired scaling and the practical demands of industrial RS. It takes inspiration from the parallel architecture of Transformers but critically re-engineers the self-attention and FFN components to be hardware-aware and specifically suited for heterogeneous recommendation features. By proposing Multi-head Token Mixing and Per-token FFNs, and integrating Sparse-MoE with novel routing, RankMixer represents an evolution towards more efficient, scalable, and adaptable DLRM architectures for real-world deployment.

3.4. Differentiation Analysis

Compared to the main methods in related work, RankMixer offers several core differences and innovations:

  • Hardware-Aware Design for GPU Efficiency: Unlike many CPU-era feature-crossing modules (e.g., early DCN variants) or even some self-attention based models (AutoInt, Hiformer) that can suffer from low MFU on GPUs, RankMixer is explicitly designed to be hardware-aligned. Its components (Multi-head Token Mixing, Per-token FFNs, and their parallel execution) are chosen to maximize GPU parallelism and shift from memory-bound to compute-bound operations, leading to a 10x MFU improvement.

  • Efficient Cross-Token Interaction via Multi-head Token Mixing:

    • Vs. Self-Attention (AutoInt, Hiformer): RankMixer replaces quadratic self-attention with a parameter-free multi-head token mixing module. This is crucial because self-attention's reliance on inner-product similarity is suboptimal for heterogeneous feature spaces in RS (e.g., user ID vs. item ID), where semantic similarity is hard to capture via dot products and leads to high computational/memory costs. Token Mixing provides efficient, high-parallelism global information modeling without these drawbacks.
    • Vs. Explicit Cross Networks (DCN, DCNv2): While DCN variants explicitly model interactions, RankMixer offers a more unified and scalable architecture inspired by Transformers, capable of learning deeper and more complex interactions efficiently across tokens representing semantic feature groups.
  • Specialized Feature Subspace Modeling with Per-token FFNs:

    • Vs. Shared FFNs (Vanilla Transformer): Traditional Transformers use shared FFNs across all tokens. RankMixer employs Per-token FFNs, which means each feature token receives dedicated parameter transformations. This prevents high-frequency fields from dominating and drowning out lower-frequency/long-tail signals, which is a common problem when mixing disparate semantic spaces in a single interaction module. This approach is better aligned with the heterogeneous nature of recommendation data.
    • Vs. MMoE Experts: While MMoE also uses multiple networks, its experts typically process the same input. Per-token FFNs in RankMixer are distinct in that each sees a distinct token input, allowing for specialized modeling of different feature subspaces.
  • Scalable Sparse-MoE with Dynamic and Balanced Routing:

    • Vs. Vanilla Sparse-MoE: RankMixer addresses key limitations of Vanilla Sparse-MoE (uniform kk-expert routing leading to budget waste, and expert under-training due to imbalance). It introduces ReLU Routing for flexible expert activation based on token information content and Dense-training / Sparse-inference (DTSI-MoE) to ensure all experts are sufficiently trained while maintaining sparse inference cost. This allows for significantly higher ROI for model scaling (e.g., 1B+ parameters).
  • Holistic Optimization for Industrial Deployment: RankMixer integrates architectural innovations with engineering optimizations (high MFU, quantization) to achieve two orders of magnitude parameter scaling without increasing serving latency, which is a critical practical achievement for industrial recommender systems. This holistic approach to balancing effectiveness and efficiency distinguishes it from models that might achieve high AUC but are too costly to deploy.

4. Methodology

4.1. Principles

The core idea behind RankMixer is a hardware-aware model design tailored for the unique characteristics of industrial recommendation systems. It aims to create a unified and scalable feature-interaction architecture that can effectively model heterogeneous feature spaces while maximizing GPU utilization and maintaining strict real-time serving constraints. The theoretical basis and intuition are rooted in:

  1. Transformer-like Parallelism for Scalability: Leveraging the highly parallel nature of Transformer architectures to enable efficient computation on GPUs.
  2. Addressing Heterogeneity with Specialized Interactions: Recognizing that recommendation features are fundamentally diverse (e.g., user ID, item category, query text) and that generic interaction mechanisms like self-attention (which relies on inner product similarity) are suboptimal. Instead, it proposes parameter-free token mixing for cross-token interaction and per-token FFNs for specialized, isolated processing of feature subspaces.
  3. Decoupling Capacity from Cost: Employing Sparse Mixture-of-Experts (MoE) to significantly increase model capacity (number of parameters) while keeping the computational cost per inference (FLOPs) relatively constant, thereby improving the Return on Investment (ROI) of scaling.
  4. Hardware-Alignment for Efficiency: Designing components and optimizing their execution (e.g., large GEMM shapes, kernel fusion, fp16 quantization) to achieve high Model Flops Utilization (MFU) and shift computation from memory-bound to compute-bound, which is crucial for meeting latency and QPS demands in production.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Overall Architecture

The RankMixer model processes TT input tokens through LL successive RankMixer blocks. After these blocks, an output pooling operator aggregates the final representations to generate predictions for various tasks. Each RankMixer block is composed of two main components:

  1. A Multi-Head Token Mixing layer: Responsible for facilitating information exchange across different tokens.

  2. A Per-Token Feed-Forward Network (PFFN) layer: Designed to expand the model's capacity and model interactions within distinct feature subspaces.

    The iterative refinement of token representations across LL layers is described by the following equations: $ \mathbf{S}{n-1} = \mathrm{LN}\left(\mathrm{TokenMixing}\left(\mathbf{X}{n-1}\right) + \mathbf{X}{n-1}\right) $ $ \mathbf{X}n = \mathrm{LN}\left(\mathrm{PFFN}\left(\mathbf{S}{n-1}\right) + \mathbf{S}{n-1}\right) $ Where:

  • LN()\mathrm{LN}(\cdot) is a Layer Normalization function, which normalizes the activations of a layer across the features for each sample. This helps stabilize training.

  • TokenMixing()\mathrm{TokenMixing}(\cdot) represents the Multi-head Token Mixing module, which performs cross-token feature interactions.

  • PFFN()\mathrm{PFFN}(\cdot) represents the Per-Token Feed-Forward Network module, which processes each token independently.

  • Xn1RT×D\mathbf{X}_{n-1} \in \mathbb{R}^{T \times D} is the input to the (n-1)-th RankMixer block, representing TT tokens, each with a hidden dimension DD.

  • Sn1RT×D\mathbf{S}_{n-1} \in \mathbb{R}^{T \times D} is the output after the TokenMixing layer and subsequent normalization and residual connection.

  • XnRT×D\mathbf{X}_n \in \mathbb{R}^{T \times D} is the output of the nn-th RankMixer block, serving as the input for the next block.

  • X0RT×D\mathbf{X}_0 \in \mathbb{R}^{T \times D} is the initial stack of input tokens x1,x2,...,xT\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T.

  • TT is the total number of feature tokens.

  • DD is the hidden dimension of the model for each token.

    The residual connections (Xn1\mathbf{X}_{n-1} added back in the first equation, and Sn1\mathbf{S}_{n-1} in the second) help in training very deep networks by allowing gradients to flow more easily. Finally, the output representation Ooutput\mathbf{O}_{\mathrm{output}} is derived from the mean pooling of the representations from the final layer XL\mathbf{X}_L. This aggregated representation is then used to compute predictions for various downstream tasks (e.g., click-through rate, watch duration).

4.2.2. Input Layer and Feature Tokenization

To effectively process the rich information present in large-scale recommendation systems, a crucial first step is to prepare the input features. These features are diverse and typically include:

  • User features: Such as user ID and other demographic or behavioral information.

  • Candidate features: Pertaining to the item being recommended, e.g., video ID, author ID, category.

  • Sequence Features: Derived from user interaction sequences (e.g., previously watched videos), often processed through a Sequence Module to capture temporal interests, yielding a sequence embedding es\mathbf{e}_s.

  • Cross Features: Explicit or learned interactions between user and candidate features.

    All these raw features are initially transformed into embeddings of various dimensions. To ensure efficient parallel computation in the subsequent RankMixer blocks, these embeddings, regardless of their original dimension, must be aligned to a fixed dimension. This process is called tokenization.

The paper highlights a challenge with simple tokenization strategies:

  • Too many tokens: Assigning one embedding per feature (which could be hundreds) leads to many tokens, reducing parameters and computation per token, potentially insufficient modeling for important features, and under-utilization of GPU cores.

  • Too few tokens: Conversely, using very few tokens (e.g., a single concatenated token) degrades the model into a simple Deep Neural Network (DNN), losing the ability to distinctly represent diverse feature spaces and risking dominant features overshadowing others.

    To overcome these issues, RankMixer proposes a semantic-based tokenization approach that leverages domain knowledge. Features are grouped into several semantically coherent clusters (e.g., all user-related features, all item-related features, all context features). These grouped features are then concatenated into a single embedding vector einpute_{\mathrm{input}}: $ e_{\mathrm{input}} = \left[ e_1; e_2; ...; e_N \right] $ Where:

  • einpute_{\mathrm{input}} is the concatenated embedding vector, formed by joining NN feature-group embeddings.

  • eje_j represents the embedding vector for the jj-th feature group.

    This concatenated vector is then partitioned into an appropriate number of tokens, each with a fixed dimension size. Each resulting feature token xiRD\mathbf{x}_i \in \mathbb{R}^D is designed to capture a set of feature embeddings that represent a similar semantic aspect. The tokenization process is formally defined as: $ \mathbf{x}i = \mathrm{Proj}(e{\mathrm{input}} \left[ \boldsymbol{d} \cdot (i - 1) : \boldsymbol{d} \cdot i \right] ), \quad i = 1, ..., T $ Where:

  • xi\mathbf{x}_i is the ii-th feature token, intended to capture a semantic aspect of the input.

  • einput[d(i1):di]e_{\mathrm{input}} \left[ \boldsymbol{d} \cdot (i - 1) : \boldsymbol{d} \cdot i \right] denotes a segment of the concatenated embedding vector einpute_{\mathrm{input}}, starting at index d(i1)\boldsymbol{d} \cdot (i-1) and ending at index di\boldsymbol{d} \cdot i.

  • d\boldsymbol{d} is the fixed dimension assigned to each segment before projection.

  • TT is the total number of resulting feature tokens.

  • Proj()\mathrm{Proj}(\cdot) is a projection function that maps the split embedding segment into the model's hidden dimension DD. This ensures all tokens have a uniform dimension DD for subsequent processing.

4.2.3. RankMixer Block

The core of the RankMixer model lies within its block structure, which iteratively refines the feature token representations.

4.2.3.1. Multi-head Token Mixing

The Multi-head Token Mixing module is introduced to enable effective information exchange across tokens, which is crucial for capturing feature crosses and modeling global information. This module is an alternative to self-attention, designed to be more efficient and suitable for heterogeneous recommendation data.

The process begins by splitting each token xt\mathbf{x}_t into HH heads: $ \left[ \mathbf{x}_t^{(1)} \parallel \mathbf{x}_t^{(2)} \parallel \ldots \parallel \mathbf{x}_t^{(H)} \right] = \mathrm{SplitHead} \left( \mathbf{x}_t \right) $ Where:

  • xtRD\mathbf{x}_t \in \mathbb{R}^D is the tt-th input token.

  • HH is the number of heads.

  • xt(h)RD/H\mathbf{x}_t^{(h)} \in \mathbb{R}^{D/H} is the hh-th head of the token xt\mathbf{x}_t. The \parallel symbol indicates concatenation along a dimension, meaning the original token xt\mathbf{x}_t is divided into HH smaller sub-vectors. These heads can be viewed as projections of the token into lower-dimensional feature subspaces, allowing the model to consider different perspectives of the feature.

    After splitting, the Token Mixing operation shuffles these heads. For each head index hh, it concatenates the hh-th head from all tokens x1,,xT\mathbf{x}_1, \ldots, \mathbf{x}_T: $ \mathbf{s}^h = \operatorname{Concat} \left( \mathbf{x}_1^h, \mathbf{x}_2^h, ..., \mathbf{x}_T^h \right) $ Where:

  • shRT(D/H)\mathbf{s}^h \in \mathbb{R}^{T \cdot (D/H)} is the resulting vector for the hh-th "shuffled token" or mixed head. This operation effectively collects the hh-th perspective from all original input tokens into a single vector.

    The output of the Multi-head Token Mixing module is a set of HH shuffled tokens, which can be stacked into SRH×(TD/H)\mathbf{S} \in \mathbb{R}^{H \times (T \cdot D/H)}. In this work, the authors set H=TH=T to ensure that the number of tokens remains the same after Token Mixing, facilitating the use of a residual connection. After this mixing, a residual connection and Layer Normalization module are applied to produce the refined tokens: $ \mathbf{s}_1, \mathbf{s}_2, ..., \mathbf{s}_T = \mathrm{LN} \left( \mathrm{TokenMixing} \left( \mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T \right) + \left( \mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T \right) \right) $ Here, TokenMixing(x1,,xT)\mathrm{TokenMixing}(\mathbf{x}_1, \ldots, \mathbf{x}_T) conceptually represents the entire process of splitting heads and concatenating them as described above, resulting in a new set of TT tokens (when H=TH=T).

Critique of Self-Attention for Recommendation Systems: The paper explicitly states that while self-attention is highly effective in LLMs where tokens share a unified embedding space, it is suboptimal for recommendation systems. The primary reasons are:

  • Heterogeneous Feature Spaces: In RS, features (and thus tokens) are inherently diverse (e.g., user ID, item category, query text).
  • Difficulty of Inner-Product Similarity: Self-attention computes attention weights using the inner product of tokens. This method works well for homogeneous data (like words in NLP). However, computing an inner-product similarity between two heterogeneous semantic spaces (e.g., a user embedding and an item embedding) is notoriously difficult and often yields unreliable or uninformative results. This is particularly problematic in RS where ID spaces can contain hundreds of millions of elements.
  • Computational and Memory Cost: Applying self-attention to such diverse inputs consumes more computation, Memory IO operations, and GPU memory compared to the parameter-free multi-head Token Mixing approach, without offering superior performance in this context.

4.2.3.2. Per-token FFN

Traditional Deep Learning Recommendation Models (DLRMs) and models like DHEN often mix features from many disparate semantic spaces within a single interaction module. This can lead to a problem where high-frequency fields (e.g., popular item IDs or common user behaviors) dominate the learning process, effectively drowning out lower-frequency or long-tail signals, which ultimately harms overall recommendation quality.

To address this, RankMixer introduces a parameter-isolated feed-forward network architecture called Per-token FFN. Unlike traditional designs where the parameters of the FFN are shared across all tokens, Per-token FFN processes each token with its own dedicated set of transformations, thus isolating the parameters for each token.

For the tt-th token st\mathbf{s}_t, the Per-token FFN is expressed as: $ \mathbf{v}t = f{\mathrm{pffn}}^{t,2} \left( \operatorname{Gelu} \left( f_{\mathrm{pffn}}^{t,1} \left( \mathbf{s}t \right) \right) \right) $ Where the linear transformation fpffnt,i(x)f_{\mathrm{pffn}}^{t,i}(\mathbf{x}) is defined as: $ f{\mathrm{pffn}}^{t,i} ( \mathbf{x} ) = \mathbf{xW}{\mathrm{pffn}}^{t,i} + \mathbf{b}{\mathrm{pffn}}^{t,i} $ And:

  • stRD\mathbf{s}_t \in \mathbb{R}^D is the tt-th input token to the PFFN module.

  • vtRD\mathbf{v}_t \in \mathbb{R}^D is the output vector for the tt-th token after processing by its dedicated PFFN.

  • fpffnt,1()f_{\mathrm{pffn}}^{t,1}(\cdot) and fpffnt,2()f_{\mathrm{pffn}}^{t,2}(\cdot) represent the first and second linear layers of the FFN dedicated to the tt-th token.

  • Wpffnt,1RD×kD\mathbf{W}_{\mathrm{pffn}}^{t,1} \in \mathbb{R}^{D \times kD} and bpffnt,1RkD\mathbf{b}_{\mathrm{pffn}}^{t,1} \in \mathbb{R}^{kD} are the weight matrix and bias vector for the first linear layer of the tt-th token's FFN.

  • Wpffnt,2RkD×D\mathbf{W}_{\mathrm{pffn}}^{t,2} \in \mathbb{R}^{kD \times D} and bpffnt,2RD\mathbf{b}_{\mathrm{pffn}}^{t,2} \in \mathbb{R}^{D} are the weight matrix and bias vector for the second linear layer of the tt-th token's FFN.

  • kk is a hyperparameter that scales the hidden dimension of the Per-token FFN (e.g., if k=4k=4, the hidden dimension is 4D).

  • Gelu()\mathrm{Gelu}(\cdot) is the Gaussian Error Linear Unit activation function, applied element-wise after the first linear layer.

    The Per-token FFN module can be summarized as a parallel application across all tokens: $ \mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_T = \mathrm{PFFN} \left( \mathbf{s}_1, \mathbf{s}_2, ..., \mathbf{s}_T \right) $ Which is equivalent to applying the individual FFN to each token: $ \mathrm{PFFN} \left( \mathbf{s}_1, \mathbf{s}2, ..., \mathbf{s}T \right) = f{\mathrm{pffn}}^{t,2} \left( \mathrm{Gelu} \left( f{\mathrm{pffn}}^{t,1} \left( \mathbf{s}_1, \mathbf{s}_2, ..., \mathbf{s}_T \right) \right) \right) $ This notation implies that fpffnt,1f_{\mathrm{pffn}}^{t,1} and fpffnt,2f_{\mathrm{pffn}}^{t,2} are applied to each st\mathbf{s}_t independently.

Key distinctions:

  • Vs. Parameter-All-Shared FFN: Per-token FFN significantly enhances modeling ability by introducing more parameters (each token has its own FFN) while keeping the computational complexity per token unchanged (each token still goes through one FFN sequence).
  • Vs. MMoE Experts: In MMoE, typically many experts process the same input to make a decision, with a gating network deciding which experts to activate. In Per-token FFN, each FFN sees a distinct token input and is dedicated to transforming that specific token's representation.
  • Vs. Transformer FFN: In a vanilla Transformer, different input tokens share one FFN (parameters are shared across token positions). RankMixer simultaneously splits the inputs (into tokens) and the parameters (into Per-token FFNs), which is beneficial for learning diversity in different feature subspaces.

4.2.4. Sparse MoE in RankMixer

To further boost the ROI of large-scale models, the dense FFNs within each Per-token FFN module can be replaced with Sparse Mixture-of-Experts (MoE) blocks. This allows the model's total capacity (number of parameters) to grow substantially while the computational cost per inference remains roughly constant because only a small subset of experts are activated for any given input.

However, Vanilla Sparse-MoE typically degrades when directly applied to RankMixer for two main reasons:

  1. Uniform kk-expert routing: Traditional Topk selection in MoE treats all feature tokens equally. This can lead to inefficient allocation, where budget is wasted on low-information tokens while high-information tokens might not receive sufficient expert capacity, hindering the model's ability to capture nuanced differences.

  2. Expert under-training: Since Per-token FFNs already multiply parameters by the number of tokens (TT), adding non-shared experts further explodes the potential expert count. If routing is highly imbalanced, many experts may rarely be activated, leading to under-trained or "dying" experts that contribute little to the model's performance.

    To address these challenges, RankMixer combines two complementary training strategies:

  3. ReLU Routing: To grant tokens flexible expert counts (i.e., not a fixed kk) and maintain differentiability during training, RankMixer replaces the common Topk+softmaxTopk+softmax routing with a ReLU gate combined with an adaptive\ell_1penalty. Given the jj-th expert ei,j()e_{i,j}(\cdot) for token siRdh^\mathbf{s}_i \in \mathbb{R}^{\hat{d_h}} (where dh^\hat{d_h} is some hidden dimension for expert input) and a router h()h(\cdot): $ G_{i,j} = \mathrm{ReLU} \big( h ( \mathbf{s}i ) \big ), \quad \mathbf{v}i = \sum{j=1}^{N_e} G{i,j} \cdot e_{i,j} ( \mathbf{s}_i ) $ Where:

    • Gi,jG_{i,j} is the activation gate value for the jj-th expert for the ii-th token. The ReLU function ensures that only experts with a positive router score are activated.

    • h(si)h(\mathbf{s}_i) is the output of the router network (typically a linear layer) for token si\mathbf{s}_i, which produces scores for each expert.

    • vi\mathbf{v}_i is the final output of the MoE layer for token si\mathbf{s}_i, calculated as a weighted sum of the outputs of activated experts.

    • ei,j(si)e_{i,j}(\mathbf{s}_i) is the output of the jj-th expert when processing token si\mathbf{s}_i.

    • NeN_e is the total number of experts available for each token.

      This ReLU Routing mechanism allows for more experts to be activated for high-information tokens (tokens that require more complex modeling) while activating fewer for simpler ones, improving parameter efficiency. The sparsity (i.e., how many experts are activated on average) is controlled by an adaptive\ell_1penalty added to the overall loss function: $ \mathcal{L} = \mathcal{L}{\mathrm{task}} + \lambda \mathcal{L}{\mathrm{reg}}, \quad \mathcal{L}{\mathrm{reg}} = \sum{i=1}^{N_t} \sum_{j=1}^{N_e} G_{i,j} $ Where:

    • L\mathcal{L} is the total loss to be minimized during training.

    • Ltask\mathcal{L}_{\mathrm{task}} is the primary task-specific loss (e.g., cross-entropy for classification).

    • λ\lambda is a coefficient that controls the strength of the regularization.

    • Lreg\mathcal{L}_{\mathrm{reg}} is the regularization loss, which penalizes the sum of all gate values Gi,jG_{i,j}. By adjusting λ\lambda, the average active-expert ratio can be steered towards a desired budget, encouraging sparsity.

    • NtN_t is the number of tokens.

  4. Dense-training / Sparse-inference (DTSI-MoE): Inspired by [19], this strategy uses two separate routers during training: htrainh_{\mathrm{train}} and hinferh_{\mathrm{infer}}.

    • Both htrainh_{\mathrm{train}} and hinferh_{\mathrm{infer}} are updated throughout the training process.
    • Crucially, the regularization loss Lreg\mathcal{L}_{\mathrm{reg}} (which encourages sparsity) is applied only to hinferh_{\mathrm{infer}}. This means htrainh_{\mathrm{train}} can learn to activate experts more densely during training without direct sparsity pressure, allowing experts to receive sufficient gradient updates and preventing under-training.
    • During inference, only hinferh_{\mathrm{infer}} is used. Since hinferh_{\mathrm{infer}} was trained with the sparsity regularization, it will select a sparse set of experts, thus reducing inference cost. The DTSI-MoE approach ensures that experts do not suffer from under-training (a common problem in Sparse-MoE due to imbalanced routing or aggressive sparsity) while still enabling significant reductions in inference cost by activating only a subset of experts.

4.2.5. Scaling Up Directions

RankMixer is inherently designed as a highly parallel and extensible architecture. Its parameter count and computational cost can be expanded along four orthogonal axes, allowing for flexible scaling strategies:

  1. Token count (TT): Increasing the number of feature tokens can allow for more granular representation of input features.

  2. Model width (DD): Increasing the hidden dimension of each token allows for richer representations.

  3. Layers (LL): Stacking more RankMixer blocks allows the model to learn deeper and more complex feature interactions.

  4. Expert Numbers (EE): In the Sparse-MoE variant, increasing the number of experts per Per-token FFN significantly boosts model capacity.

    For a full-dense-activated version of RankMixer (i.e., without Sparse-MoE), the approximate parameter count and forward FLOPs for one sample can be computed as: $ # \mathrm{Param} \approx 2kLTD^2 $ $ \mathrm{FLOPs} \approx 4kLTD^2 $ Where:

  • #Param\# \mathrm{Param} is the approximate total number of trainable parameters in the dense part of the model.

  • FLOPs\mathrm{FLOPs} is the approximate number of floating-point operations for one forward pass.

  • kk is the scale ratio adjusting the hidden dimension of the FFN (e.g., if the inner dimension of the FFN is kD).

  • LL is the number of RankMixer layers.

  • TT is the number of tokens.

  • DD is the hidden dimension of each token.

    In the Sparse-MoE version, the effective parameters and computation per sample are determined by the sparsity ratio ss: $ s = \frac{# \mathrm{Activated}_\mathrm{Param}}{# \mathrm{Total}_\mathrm{Param}} $ Where:

  • #Activated_Param\# \mathrm{Activated}\_\mathrm{Param} is the number of parameters actively used during a single forward pass (i.e., parameters of the activated experts).

  • #Total_Param\# \mathrm{Total}\_\mathrm{Param} is the total number of parameters in the MoE layer (across all experts). This allows for a large total parameter count to achieve high capacity, while maintaining a controlled computational cost during inference by only activating a fraction ss of the experts.

5. Experimental Setup

5.1. Datasets

The offline experiments for RankMixer were conducted using a large-scale, proprietary dataset sourced from the Douyin recommendation system (the Chinese version of TikTok).

  • Source: The data is derived directly from Douyin's online logs, capturing real user interactions and feedback labels. This ensures the experiments are highly relevant to a production environment.

  • Scale: The training dataset involves a massive scale, covering trillions of records per day. Experiments were conducted on data collected over a two-week period, indicating a dataset of immense volume.

  • Characteristics: The dataset includes a rich variety of features, totaling over 300 features. These encompass:

    • Numerical features: e.g., view duration, number of likes.
    • ID features: e.g., billions of user IDs and hundreds of millions of video IDs. These categorical IDs are converted into embeddings.
    • Cross features: Interactions between different raw features.
    • Sequential features: User behavior sequences over time.
  • Domain: The domain is short-video content recommendation, a highly dynamic and personalized environment.

    The choice of this dataset is highly effective for validating the method's performance because:

  • Real-world Relevance: Using production data from a massive platform like Douyin ensures that the model's performance and efficiency gains are directly applicable and meaningful in a real-world industrial setting.

  • Scale for Stress Testing: The trillion-scale data and billions of IDs provide a robust environment to test the scaling laws and efficiency of RankMixer under extreme load, confirming its ability to handle immense data volume and high-cardinality features.

  • Feature Richness: The 300+ diverse features allow for comprehensive evaluation of RankMixer's ability to model heterogeneous feature interactions effectively, which is a core design goal.

5.2. Evaluation Metrics

The paper uses a combination of performance metrics to assess model quality and efficiency metrics to evaluate computational cost and hardware utilization.

5.2.1. Performance Metrics

  • AUC (Area Under the Curve):

    1. Conceptual Definition: AUC is a widely used metric for binary classification problems. It represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In the context of recommendation, a higher AUC indicates better discriminatory power of the model in distinguishing between items a user will interact with (positive) and those they won't (negative). It provides a single score that summarizes the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) across all possible classification thresholds. An AUC of 1.0 indicates a perfect classifier, while 0.5 indicates performance equivalent to random guessing.
    2. Mathematical Formula: The AUC is typically calculated as the area under the Receiver Operating Characteristic (ROC) curve. The ROC curve plots TPR against FPR at various threshold settings. $ \mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(\mathrm{FPR}^{-1}(t)) dt = \int_{-\infty}^{\infty} \mathrm{TPR}(t) \mathrm{FPR}'(t) dt $ (Note: The integral form can be complex; in practice, it's often computed using a non-parametric approach like Mann-Whitney U statistic or by summing trapezoids under the ROC curve.)
    3. Symbol Explanation:
      • TPR\mathrm{TPR} (True Positive Rate), also known as Recall or Sensitivity: TPR=TPTP+FN\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}, where TP\mathrm{TP} is True Positives (correctly predicted positive) and FN\mathrm{FN} is False Negatives (actual positive but predicted negative).
      • FPR\mathrm{FPR} (False Positive Rate): FPR=FPFP+TN\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}}, where FP\mathrm{FP} is False Positives (actual negative but predicted positive) and TN\mathrm{TN} is True Negatives (correctly predicted negative).
      • FPR1(t)\mathrm{FPR}^{-1}(t) means the threshold tt at which the FPR is achieved.
      • FPR(t)\mathrm{FPR}'(t) represents the derivative of FPR with respect to the threshold tt.
  • UAUC (User-level AUC):

    1. Conceptual Definition: UAUC addresses a common issue in AUC calculation for recommendation systems: users with many interactions (or many positive labels) can disproportionately influence the overall AUC score. UAUC calculates the AUC independently for each user and then averages these per-user AUC scores. This ensures that the model's performance for individual users, especially those with fewer interactions, is given fair consideration, providing a more robust measure of personalized recommendation quality.
    2. Mathematical Formula: $ \mathrm{UAUC} = \frac{1}{N_{\mathrm{users}}} \sum_{u=1}^{N_{\mathrm{users}}} \mathrm{AUC}_u $
    3. Symbol Explanation:
      • NusersN_{\mathrm{users}}: The total number of unique users in the evaluation set.
      • AUCu\mathrm{AUC}_u: The AUC score calculated specifically for user uu, considering only their interactions.
  • Finish/Skip AUC/UAUC: These metrics specifically evaluate the model's ability to predict two distinct user behaviors related to video consumption:

    • Finish: A label indicating whether a user completed watching a video.
    • Skip: A label indicating whether a user quickly slid to the next video (i.e., did not watch it for a meaningful duration).
    • The paper notes that an AUC increase of 0.0001 (or 0.01%) is considered a confidently significant improvement in industrial RS settings, highlighting the sensitivity and competitive nature of these systems.

5.2.2. Efficiency Metrics

  • Dense-Param:

    1. Conceptual Definition: This metric refers to the total number of parameters in the "dense" part of the model. In DLRMs, parameters are typically divided into sparse embedding parameters (e.g., for user IDs, item IDs, which can be very numerous but often looked up sparsely) and dense network parameters (e.g., weights and biases in MLPs, FFNs, attention layers). Dense-Param specifically counts the parameters in the dense computational layers, excluding the potentially much larger sparse embedding tables. This is a key indicator of the model's complexity and memory footprint during active computation.
    2. Mathematical Formula: No universal formula as it depends on the specific architecture. It's the sum of all weights and biases in the MLP, FFN, attention, or other dense layers.
    3. Symbol Explanation: Dense-Param is a count (scalar).
  • Training Flops/Batch:

    1. Conceptual Definition: This metric quantifies the number of floating-point operations (FLOPs) required to run a single batch of data through the model during training. It directly represents the computational cost of training the model. A batch size of 512 is used in the paper. Lower FLOPs/Batch implies faster training or less computational resource consumption per step.
    2. Mathematical Formula: No single formula; it's an architectural sum.
    3. Symbol Explanation: FLOPs (Floating Point Operations) is a unit of computational work.
  • MFU (Model FLOPs Utilization):

    1. Conceptual Definition: MFU is a crucial metric for evaluating the efficiency of a model's implementation on specific hardware (e.g., GPU). It measures how well the model's actual computational workload (FLOPs consumption) utilizes the hardware's theoretical peak FLOPs capacity. A high MFU indicates that the model's operations are well-suited for the hardware's architecture (e.g., exploiting parallelism, using compute-bound operations), minimizing wasted computational cycles. A low MFU often points to bottlenecks like memory access (memory-bound operations).
    2. Mathematical Formula: $ \mathrm{MFU} = \frac{\mathrm{Actual}_\mathrm{FLOPs}_\mathrm{Consumption}}{\mathrm{Theoretical}_\mathrm{Hardware}_\mathrm{FLOPs}_\mathrm{Capacity}} $
    3. Symbol Explanation:
      • Actual_FLOPs_Consumption\mathrm{Actual}\_\mathrm{FLOPs}\_\mathrm{Consumption}: The actual number of floating-point operations performed by the model for a given task (e.g., one forward/backward pass).
      • Theoretical_Hardware_FLOPs_Capacity\mathrm{Theoretical}\_\mathrm{Hardware}\_\mathrm{FLOPs}\_\mathrm{Capacity}: The maximum number of floating-point operations the hardware (e.g., GPU) can theoretically perform per second under ideal conditions.

5.3. Baselines

To thoroughly evaluate RankMixer, the authors compare it against several widely recognized state-of-the-art DLRM baselines, categorized by their feature interaction modeling strategies:

  • DLRM-MLP (base): This serves as the fundamental baseline. It represents a vanilla Deep Learning Recommendation Model where a simple Multi-Layer Perceptron (MLP) is used for feature crossing and dense interactions.

  • DLRM-MLP-100M: A scaled-up version of the basic DLRM-MLP with approximately 100 million parameters, demonstrating the effects of simply increasing MLP capacity.

  • DCNv2 [26]: A successor to the original Deep & Cross Network. It is a state-of-the-art model for explicit feature crossing, using a cross network to efficiently learn bounded-degree feature interactions.

  • RDCN [3]: (Likely "Recurrent DCN" or a variant, though not explicitly cited in the paper's bibliography within the abstract context, it's mentioned as SOTA for feature cross). Another advanced feature cross model.

  • MoE: A general Mixture-of-Experts model, scaled up by using multiple parallel experts. This baseline helps evaluate the general effectiveness of MoE for increasing capacity in DLRMs.

  • AutoInt [23]: This model combines heterogeneous self-attention layers to automatically learn feature interactions. It treats features as elements in an attention mechanism.

  • Hiformer [9]: An advanced self-attention based model that combines heterogeneous self-attention layers with low-rank approximation matrix computations, aiming for efficient and flexible interaction patterns.

  • DHEN [31] (Deep and Hierarchical Expert Network): A model that combines different types of feature-cross blocks (e.g., DCN, self-attention, Factorization Machine (FM), Logistic Regression (LR)) and stacks them in multiple layers to learn diverse and hierarchical feature interactions.

  • Wukong [30]: This model specifically investigates the scaling laws of feature interaction mechanisms. It follows an architecture similar to DHEN but emphasizes Factorization Machine Block (FMB) and Linear Compress Block (LCB) for learning interactions.

    Training Environment: All experiments were conducted on hundreds of GPUs in a hybrid distributed training framework.

  • Sparse part: Updated asynchronously. This means that updates to the large embedding tables (sparse parameters) don't block the dense model training, allowing for more efficient handling of high-cardinality features.

  • Dense part: Updated synchronously. This ensures that the dense network parameters are consistently updated across all GPUs in parallel.

  • Optimizer hyperparameters: Consistent across all models.

    • Dense part: RMSProp optimizer with a learning rate of 0.01.

    • Sparse part: Adagrad optimizer.

      These baselines are representative because they cover a wide spectrum of modern DLRM architectures, from basic MLP to advanced explicit feature crossing, self-attention based models, and MoE approaches, all of which are relevant for learning complex feature interactions and exploring scaling strategies in recommendation systems.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate RankMixer's superior performance, scaling capabilities, and efficiency compared to existing state-of-the-art models in industrial recommendation systems.

6.1.1. Comparison with SOTA Methods

The initial comparison focuses on models with similar parameter sizes (around 100 million) to evaluate structural performance under comparable computational budgets.

The following are the results from Table 1 of the original paper:

Model Finish Skip Efficiency
AUC↑ UAUC↑ AUC↑ UAUC↑ Params FLOPs/Batch
DLRM-MLP (base) 0.8554 0.8270 0.8124 0.7294 8.7 M 52 G
DLRM-MLP-100M +0.15% +0.15% 95 M 185 G
DCNv2 +0.13% +0.13% +0.15% +0.26% 22 M 170 G
RDCN +0.09% +0.12% +0.10% +0.22% 22.6 M 172 G
MoE +0.09% +0.12% +0.08% +0.21% 47.6 M 158 G
AutoInt +0.10% +0.14% +0.12% +0.23% 19.2 M 307 G
DHEN +0.18% +0.26% +0.36% +0.52% 22 M 158 G
HiFormer +0.48% 116 M 326 G
Wukong +0.29% +0.29% +0.49% +0.65% 122 M 442 G
RankMixer-100M +0.64% +0.72% +0.86% +1.33% 107 M 233 G
RankMixer-1B +0.95% +1.22% +1.25% +1.82% 1.1B 2.1T

Analysis of Table 1:

  • DLRM-MLP Baseline: The base DLRM-MLP model has 8.7M parameters and 52G FLOPs/Batch. Scaling it to 95M parameters (DLRM-MLP-100M) yields only limited AUC gains (+0.15%), confirming that simply increasing MLP capacity is insufficient without architectural improvements tailored for recommendation data.
  • Traditional Cross-Structure Models: DCNv2, RDCN, AutoInt, DHEN show some improvements over the base DLRM-MLP. However, models like AutoInt (19.2M params, 307G FLOPs) and HiFormer (116M params, 326G FLOPs) exhibit a high ratio of FLOPs to parameters, indicating inefficiencies that constrain their ability to scale effectively. Their designs often lead to large computational costs even for moderate parameter sizes. DHEN and MoE are more FLOPs efficient for their parameter count, but their AUC gains are still moderate.
  • RankMixer-100M Superiority: RankMixer-100M (107M parameters, 233G FLOPs) achieves the best performance among models around the 100M parameter range, with +0.64+0.64% AUC for finish and +0.86+0.86% AUC for skip. Its FLOPs are moderate compared to AutoInt and HiFormer, reflecting a better balance between model capacity and computational load.
  • RankMixer-1B Scaling: When scaled to 1.1 Billion parameters, RankMixer-1B further boosts performance significantly, achieving +0.95+0.95% AUC for finish and +1.25+1.25% AUC for skip. This demonstrates RankMixer's strong scaling capability, although with a substantially increased FLOPs/Batch (2.1T).
  • Comparison with SOTA Scaling Models: Wukong (122M parameters, 442G FLOPs) shows decent performance, but RankMixer-100M still outperforms it with lower FLOPs. This highlights RankMixer's efficiency in achieving higher performance with less computation relative to other advanced scaling models.

6.1.2. Scaling Laws of Different Models

The paper investigates the scaling laws of various models, observing how performance (AUC gain) changes with increasing parameters and FLOPs.

The following figure (Figure 2 from the original paper) shows the scaling law curves observed from both parameter size and Flops:

Figure 2: Scaling laws between Auc-gain and Params/Flops of various models. The \(\\mathbf { x }\) axis uses a logarithmic scale. 该图像是一个展示AUC增益与密集参数和GFLOPs关系的图表。左侧展示了密集参数(M)与AUC增益(%)的关系,右侧展示了每批次GFLOPs(G)与AUC增益(%)的关系。数个模型(如RankMixer、DLRM、Hiforer等)在不同条件下的表现被标记并对比。

Analysis of Figure 2:

  • The xx-axis uses a logarithmic scale, emphasizing the behavior over orders of magnitude.

  • RankMixer's Steepest Scaling: RankMixer consistently demonstrates the steepest scaling law curve for both parameters (left plot) and FLOPs (right plot). This means that for a given increase in parameters or FLOPs, RankMixer achieves a greater AUC gain compared to all other models. This validates its design principles for efficient scaling.

  • Wukong vs. RankMixer: While Wukong exhibits a relatively steep parameter curve, its computational cost (as shown in the AUC vs FLOPs curve) increases even more rapidly. Consequently, the performance gap between Wukong and RankMixer is larger when comparing against FLOPs, indicating RankMixer's superior compute-efficiency.

  • HiFormer Performance: HiFormer's performance is slightly inferior to RankMixer's. This is attributed to its reliance on token segmentation at a feature level and the use of Attention, which, as discussed in the methodology, can be less efficient for heterogeneous recommendation data.

  • DHEN and MoE Scalability: The scaling of DHEN is not ideal, reflecting the limited scalability of its specific cross-structure. MoE's strategy of scaling by adding experts faces challenges in maintaining expert balance, leading to suboptimal scaling performance when not properly addressed (which RankMixer aims to do).

    Scaling Directions: The paper also explored scaling RankMixer by increasing width (D), feature tokens (T), and layers (L). A conclusion shared with LLM scaling laws was observed: model quality primarily correlates with the total number of parameters, and different scaling directions (depth L, width D, tokens T) yield almost identical performance for a given parameter count. From a compute-efficiency standpoint, larger hidden-dim (increasing DD) generates larger matrix-multiply shapes, leading to higher MFU compared to simply stacking more layers. Based on these insights, the final configurations were set as:

  • RankMixer-100M: D=768D = 768, T=16T = 16, L=2L = 2

  • RankMixer-1B: D=1536D = 1536, T=32T = 32, L=2L = 2

6.2. Data Presentation (Tables)

The following are the results from Table 2 of the original paper:

Setting ΔAUC
w/o skip connections -0.07%
w/o multi-head token mixing -0.50%
w/o layer normalization -0.05%
Per-token FFN → shared FFN -0.31%

The following are the results from Table 3 of the original paper:

Routing strategy ΔAUC ΔParams ΔFLOPs
All-Concat-MLP -0.18% ~ 0.0% ~ 0.0%
All-Share -0.25% ~ 0.0% ~ 0.0%
Self-Attention -0.03% +16% +71.8%

The following are the results from Table 4 of the original paper:

Douyin app Douyin lite
Active Day↑ Duration↑ Like↑ Finish↑ Comment↑ Active Day↑ Duration↑ Like↑ Finish↑ Comment↑
+0.2908% +1.0836% +2.3852% +1.9874% +0.7886% +0.1968% +0.9869% +1.1318% +2.0744% +1.1338%
Low-active
+1.7412% +3.6434% +8.1641% +4.5393% +2.9368% +0.5389% +2.1831% +4.4486% +3.3958% +0.9595%
Middle-active
+0.7081% +1.5269% +2.5823% +2.5062% +1.2266% +0.4235% +1.9011% +1.7491% +2.6568% +0.6782%
High-active
+0.1445% +0.6259% +1.828% +1.4939% +0.4151% +0.0929% +1.1356% +1.8212% +1.7683% +2.3683%

The following are the results from Table 5 of the original paper:

Metric Advertising
ΔAUC↑ ADVV↑
Lift +0.73% +3.90%

The following are the results from Table 6 of the original paper:

Metric OnlineBase-16M RankMixer-1B Change
#Param 15.8M 1.1B ↑ 70X
FLOPs 107G 2106G ↑ 20.7X
Flops/Param(G/M) 6.8 1.9 ↓3.6X
MFU 4.47% 44.57% ↑ 10X
Hardware FLOPs fp32 fp16 ↑2X
Latency 14.5ms 14.3ms -

6.3. Ablation Studies / Parameter Analysis

6.3.1. Ablation on Components of RankMixer

The paper conducted ablation studies on the RankMixer-100M model to understand the contribution of its core architectural components.

Analysis of Table 2:

  • w/o skip connections: Removing skip connections led to a 0.07% decrease in ΔAUCΔAUC. This indicates that residual connections are important for RankMixer, likely contributing to training stability and mitigating issues like vanishing or exploding gradients in deep networks.
  • w/o multi-head token mixing: The most significant performance drop (0.50-0.50% ΔAUC) was observed when removing Multi-Head Token Mixing. This highlights its critical role. Without it, each Per-token FFN would only model partial features without any interaction or information exchange across tokens, losing global contextual information and the ability to learn cross-feature interactions.
  • w/o layer normalization: Removing Layer Normalization resulted in a 0.05% decrease in ΔAUCΔAUC. Similar to skip connections, LayerNorm contributes to training stability, preventing performance degradation.
  • Per-token FFN → shared FFN: Replacing Per-token FFNs with a shared FFN (where all tokens are processed by the same FFN parameters) led to a 0.31% decrease in ΔAUCΔAUC. This strongly validates the importance of parameter isolation for different feature subspaces. A shared FFN would force diverse semantic features to be processed by the same parameters, leading to potential feature domination and suboptimal modeling of heterogeneous interactions.

6.3.2. Token2FFN Routing-strategy Comparison

The paper further analyzed different strategies for routing feature tokens to FFNs, comparing them against the proposed Multi-Head Token Mixing.

Analysis of Table 3:

  • All-Concat-MLP: This strategy concatenates all tokens into a single large vector, processes it through one large MLP, and then splits it back into the same number of tokens. It resulted in a 0.18% ΔAUCΔAUC drop. This shows the challenges in learning effectively from extremely large matrices (a single MLP for all concatenated features) and the weakening of local information learning by merging everything too early.
  • All-Share: In this strategy, the entire input vector is shared and fed to each Per-token FFN (similar to how experts in an MoE might share input). This led to a significant 0.25% ΔAUCΔAUC decline. This result emphasizes the importance of feature subspace splitting and independent modeling for distinct tokens, contrasting with an all-shared input which fails to capture the diversity of feature semantics.
  • Self-Attention: Applying self-attention between tokens for routing yielded a 0.03% ΔAUCΔAUC drop. While the performance decrease is small, it came with a substantial increase in ΔParamsΔParams (+16+16%) and ΔFLOPsΔFLOPs (+71.8+71.8%). This confirms the paper's argument that self-attention is computationally expensive and inefficient for heterogeneous feature subspaces in RS, failing to significantly outperform Multi-Head Token Mixing despite its higher cost. It highlights the difficulty of learning meaningful similarity across hundreds of different feature subspaces via inner products.

6.3.3. Sparse-MoE Scalability and Expert Balance

The paper investigates the scalability of the Sparse-MoE variant of RankMixer and the effectiveness of its expert balancing strategies.

The following figure (Figure 3 from the original paper) plots offline AUC gains versus Sparsity of SMoE:

Figure 3: AUC performance of RankMixer variants under decreasingly sparse activation ratio \(( 1 , 1 / 2 , 1 / 4 , 1 / 8\) of experts): dense-training \(^ +\) ReLU-routed SMoE preserves almost all accuracy of the 1 B dense model. 该图像是一个图表,展示了不同稀疏率下模型的性能(AUC)。图中比较了多种RankMixer变体,包括Dense-RankMixer-1B、Vanilla-SMoE-RankMixer等。可以看出,随着稀疏率的增加,模型性能有所下降。

Analysis of Figure 3 (Scalability):

  • The xx-axis represents the activation ratio of experts, starting from 1 (dense model) and decreasing to 1/81/8 (highly sparse).

  • Dense-training + ReLU-routed SMoE (RankMixer's approach): This variant demonstrates remarkable robustness. Even under aggressive sparsity (down to 1/81/8 of experts activated), it preserves almost all the accuracy of the 1B dense model. This is crucial for practical deployment, as it enables substantial parameter scaling (>8×> 8 \times) with minimal AUC loss and significant inference-time savings (450% improvement over throughput). This validates Sparse-MoE as a viable path to scale RankMixer to 10B+10B+ parameters.

  • Vanilla SMoE: The performance of Vanilla SMoE drops monotonically and significantly as fewer experts are activated. This illustrates the fundamental problems of expert imbalance and under-training when Vanilla SMoE is applied without specialized routing and training strategies.

  • Vanilla SMoE + Load Balancing Loss: Adding a load-balancing loss to Vanilla SMoE reduces the performance degradation compared to pure Vanilla SMoE. However, it still falls short of the DTSI + ReLU version. This suggests that the core problems lie not just in distributing load but more deeply in ensuring experts are adequately trained and effectively utilized during sparse inference.

    The following figure (Figure 4 from the original paper) shows Activated expert ratio for various tokens:

    Figure 4: Activated expert ratio for various tokens. 该图像是一个示意图,展示了RankMixer模型在不同层的训练过程中的损失变化轨迹。上半部分为第1层,和下半部分的第2层,图中以不同曲线展示了训练损失的变化情况,有助于理解模型的性能与训练稳定性。

    Analysis of Figure 4 (Expert Balance and Diversity):

  • This figure (labeled as "Activated expert ratio for various tokens" but visually resembling a loss curve or activation distribution per layer) indirectly supports the claim that DTSI + ReLU routing effectively resolves expert imbalance.

  • The text describes that Dense-training ensures most experts receive sufficient gradient updates, preventing expert starvation (i.e., "dying experts" that are never activated).

  • ReLU routing makes the activation ratio dynamic across tokens. The varying activation proportions shown for different layers in the figure (though the figure itself is a bit ambiguous for direct interpretation of "activated expert ratio") are meant to illustrate that some tokens (likely high-information ones) activate more experts, while others (low-information) activate fewer. This adaptive allocation aligns well with the diverse and highly dynamic distribution of recommendation data, leading to more efficient utilization of expert capacity.

6.3.4. Online Serving Cost

A critical achievement of RankMixer is its ability to scale model parameters by approximately 70x (from a previous 16M-parameter baseline to 1.1B parameters) while maintaining stable inference latency in a production environment.

The following are the results from Table 6 of the original paper:

Metric OnlineBase-16M RankMixer-1B Change
#Param 15.8M 1.1B ↑ 70X
FLOPs 107G 2106G ↑ 20.7X
Flops/Param(G/M) 6.8 1.9 ↓3.6X
MFU 4.47% 44.57% ↑ 10X
Hardware FLOPs fp32 fp16 ↑2X
Latency 14.5ms 14.3ms -

Analysis of Table 6 (Online Deployment Metrics): The stability in latency despite the massive parameter increase is explained by the following formula and the three-pronged optimization strategy: $ \mathrm{Latency} = \frac{# \mathrm{Param} \times \mathrm{FLOPs} / \mathrm{Param~ratio}}{\mathrm{MFU} \times (\mathrm{Theoretical~Hardware~FLOPs})} $

  1. FLOPs/Param ratio (↓3.6X):
    • The baseline model had 6.8 GFLOPs per million parameters, while RankMixer-1B achieved 1.9 GFLOPs per million parameters.
    • RankMixer increased parameters by 70x but only saw a 20.7x increase in total FLOPs. This means its FLOPs/Param ratio decreased by 3.6x (70/20.73.3870 / 20.7 ≈ 3.38, the paper states 3.6x). This efficiency gain means RankMixer can accommodate significantly more parameters (over 3 times more) for the same FLOPs budget, primarily due to the Sparse-MoE architecture where not all parameters are activated per inference.
  2. Model FLOPs Utilization (MFU) (↑10X):
    • The baseline had a very low MFU of 4.47%, indicating it was largely memory-bound.
    • RankMixer significantly improved MFU to 44.57% (a 10x increase). This was achieved by:
      • Using large GEMM (General Matrix Multiply) shapes, which are highly efficient on GPUs.
      • Employing a good parallel topology (e.g., fusing parallel Per-token FFNs into a single GPU kernel).
      • Reducing memory bandwidth cost and overhead.
    • This shift from a memory-bound to a compute-bound regime is critical for GPU efficiency, meaning the GPU spends more time on actual computation rather than waiting for data from memory.
  3. Quantization (↑2X Theoretical Hardware FLOPs):
    • The primary computations in RankMixer (large matrix multiplications) are well-suited for half-precision (fp16) inference.

    • Moving from fp32 (single-precision) to fp16 (half-precision) inference effectively doubles the theoretical peak Hardware FLOPs capacity of GPUs. This is a hardware-level acceleration that RankMixer can leverage.

      By synergistically combining a lower FLOPs/Param ratio, a much higher MFU, and the benefits of fp16 quantization, RankMixer was able to drastically increase its parameter count without increasing inference latency.

6.4. Online Performance

To verify its universality, RankMixer was deployed and evaluated in online A/B tests across two core application scenarios of personalized ranking: Feed Recommendation and Advertising.

6.4.1. Feed Recommendation

The RankMixer-1B model was fully deployed on the Douyin Feed Recommendation Ranking system. The long-term observation from A/B test results over 8 months is presented.

Analysis of Table 4 (Online lift of RankMixer in Feed Recommendation):

  • Overall Metrics: RankMixer-1B achieved statistically significant uplifts across all key business metrics for both Douyin app and Douyin lite.
    • Douyin app: Active Day↑ +0.2908%, Duration↑ +1.0836%, Like↑ +2.3852%, Finish↑ +1.9874%, Comment↑ +0.7886%.
    • Douyin lite: Similar positive trends with Active Day↑ +0.1968%, Duration↑ +0.9869%, etc.
  • Impact on User Activeness Levels: The improvements are not uniform across different user activeness groups:
    • Low-active users: Showed the largest improvements, with +1.7412+1.7412% in Active Days and +3.6434+3.6434% in Duration on the main Douyin app. This demonstrates RankMixer's strong generalization capability and its effectiveness in engaging users who are typically harder to activate, potentially by better capturing their subtle preferences or by recommending more relevant long-tail content.
    • Middle-active users: Also saw significant gains, e.g., +0.7081+0.7081% in Active Days.
    • High-active users: Still benefited, but with smaller percentage uplifts (e.g., +0.1445+0.1445% in Active Days), likely because these users are already highly engaged. These results underscore RankMixer's ability to drive overall user engagement and retention.

6.4.2. Advertising

RankMixer was also tested in an advertising scenario, replacing the dense part of the previous 16M parameter baseline model.

Analysis of Table 5 (Online lift of RankMixer in Advertising):

  • Advertising Performance: RankMixer delivered positive lifts:
    • ΔAUC+0.73ΔAUC↑ +0.73%: A significant improvement in the advertising model's quality.
    • ADVV↑ +3.90%: An increase in Advertiser Value, indicating higher revenue for the platform. This confirms RankMixer's versatility and effectiveness across different personalized ranking tasks, serving as a unified backbone for diverse applications.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces RankMixer, a novel, hardware-aware model architecture specifically designed to scale up ranking models in industrial recommender systems. RankMixer addresses the dual challenges of high computational cost and inefficient GPU utilization by legacy feature-crossing modules. Its core innovations include a Multi-head Token Mixing module for efficient cross-token interactions, Per-token FFNs for isolated and specialized modeling of heterogeneous feature subspaces, and a Sparse Mixture-of-Experts (MoE) variant with ReLU Routing and DTSI-MoE to enable massive parameter scaling without proportional increases in computational cost.

Experiments on a trillion-scale production dataset demonstrated RankMixer's superior scaling capabilities, boosting Model Flops Utilization (MFU) from 4.5% to 45% and achieving a 70x increase in model parameters (to 1.1 billion) while maintaining inference latency. Online A/B tests on Douyin's Feed Recommendation and Advertising platforms confirmed its universality and significant business impact, leading to 0.3% increase in user active days and 1.08% increase in total in-app usage duration, with notable gains among low-active users, all without increasing serving costs.

7.2. Limitations & Future Work

The paper explicitly points out that RankMixer provides a path to scale from current 1B parameters to future 10B-scale deployments without breaching cost budgets. While the paper doesn't explicitly list "Limitations" as a separate section, some implicit challenges or areas for future work can be inferred:

  • Complexity of Tokenization: The semantic-based tokenization relies on domain knowledge. While effective, manually grouping features can be labor-intensive. Future work could explore automated or learned semantic tokenization methods.
  • Hyperparameter Tuning for MoE: While ReLU Routing and DTSI-MoE address key MoE challenges, tuning MoE specific hyperparameters (e.g., number of experts, λ\lambda for sparsity regularization) can still be complex, especially with millions or billions of parameters.
  • Generalizability to Other RS Paradigms: While demonstrated for feed recommendation and advertising, RankMixer's applicability to other RS paradigms (e.g., sequential recommendation beyond basic sequence features, cold-start scenarios, fairness-aware recommendations) could be further explored.
  • Dynamic Adaptation to Feature Changes: Industrial RS features are constantly evolving. The model's ability to dynamically adapt its tokenization or Per-token FFN structure to new features or changes in feature importance could be an area for enhancement.

7.3. Personal Insights & Critique

RankMixer presents a highly compelling and practical solution to the persistent challenge of scaling DLRMs in industrial settings.

  • Innovative Hardware-Aware Design: The most significant insight is the explicit focus on hardware-aware design. Many academic papers optimize for AUC but often overlook deployment realities. RankMixer's strategy of re-engineering Transformer-like components (Multi-head Token Mixing for efficient cross-token interaction and Per-token FFNs for isolated feature subspace modeling) specifically to maximize MFU on GPUs is a critical differentiator. This pragmatic approach is essential for bridging the gap between research and production.

  • Addressing Heterogeneity Effectively: The core critique of self-attention for heterogeneous feature spaces is astute. The inner-product similarity simply isn't well-suited for disparate features like user IDs and video categories. RankMixer's Multi-head Token Mixing offers a clever, parameter-free alternative that achieves global information exchange without the pitfalls of self-attention in this context. The Per-token FFNs are equally important, providing dedicated "expert" networks for semantically distinct feature groups, which intuitively makes sense for preventing feature domination.

  • Robust MoE Implementation: The paper's solution to Sparse-MoE challenges (ReLU Routing and DTSI-MoE) is a valuable contribution. MoE has immense potential for scaling, but expert under-training and imbalance are notorious hurdles. RankMixer provides a robust framework that can truly unlock the ROI of MoE in practice.

  • Holistic Optimization for Deployment: The detailed breakdown of how 70x parameter increase was achieved with stable latency (via FLOPs/Param ratio reduction, MFU increase, and fp16 quantization) serves as an excellent blueprint for others facing similar industrial deployment challenges. It underscores that architectural innovation must be coupled with meticulous engineering optimization.

  • Applicability Beyond Douyin: While developed for Douyin, the principles of RankMixer are highly transferable. Any recommendation system, search ranking engine, or personalized content platform dealing with heterogeneous features, high QPS demands, and the desire to scale model capacity could benefit from adopting or adapting RankMixer's architecture. Its success in both recommendation and advertising scenarios further supports its universality.

    A potential area for further exploration or a critique could be a deeper theoretical analysis of why Multi-head Token Mixing, despite being parameter-free, is superior to self-attention for heterogeneous features beyond the empirical observation. Is there a formal way to characterize the "semantic distance" or "interactability" of different feature types that makes inner-product irrelevant or detrimental, and how Token Mixing inherently bypasses this? Additionally, while semantic-based tokenization works, future research could investigate graph-based or unsupervised methods for token grouping to further reduce reliance on domain experts.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.