Paper status: completed

STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models

Published:11/24/2025

Efficient Attention Mechanism (3)Scalable Ranking Models (1)Semantic Tokenization (1)Orthogonal Rotation Transformation (1)High-Dimensional Feature Sparsity (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces STORE, a scalable ranking framework addressing representation and computational bottlenecks in personalized recommendation systems through semantic tokenization, efficient attention, and orthogonal rotation.

Abstract

Ranking models have become an important part of modern personalized recommendation systems. However, significant challenges persist in handling high-cardinality, heterogeneous, and sparse feature spaces, particularly regarding model scalability and efficiency. We identify two key bottlenecks: (i) Representation Bottleneck: Driven by the high cardinality and dynamic nature of features, model capacity is forced into sparse-activated embedding layers, leading to low-rank representations. This, in turn, triggers phenomena like "One-Epoch" and "Interaction-Collapse," ultimately hindering model scalability.(ii) Computational Bottleneck: Integrating all heterogeneous features into a unified model triggers an explosion in the number of feature tokens, rendering traditional attention mechanisms computationally demanding and susceptible to attention dispersion. To dismantle these barriers, we introduce STORE, a unified and scalable token-based ranking framework built upon three core innovations: (1) Semantic Tokenization fundamentally tackles feature heterogeneity and sparsity by decomposing high-cardinality sparse features into a compact set of stable semantic tokens; and (2) Orthogonal Rotation Transformation is employed to rotate the subspace spanned by low-cardinality static features, which facilitates more efficient and effective feature interactions; and (3) Efficient attention that filters low-contributing tokens to improve computional efficiency while preserving model accuracy. Across extensive offline experiments and online A/B tests, our framework consistently improves prediction accuracy(online CTR by 2.71%, AUC by 1.195%) and training effeciency (1.84 throughput).

Mind Map

In-depth Reading

English Analysis~33 min read · 46,312 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models." This title suggests a novel framework designed to improve the scalability and efficiency of ranking models, particularly in recommendation systems, by addressing issues related to feature representation and computational costs through semantic tokenization, orthogonal transformations, and efficient attention mechanisms.

1.2. Authors

The authors are:

Yi Xu (Alibaba Group, Beijing, China)
Chaofan Fan (Alibaba Group, Beijing, China)
Jinxin Hu (Alibaba Group, Beijing, China)
Yu Zhang (Alibaba Group, Beijing, China)
Xiaoyi Zeng (Alibaba Group, Beijing, China)
Jing Zhang (Wuhan University, Wuhan, China)

The majority of the authors are affiliated with Alibaba Group, indicating a strong industry presence and focus on practical applications in large-scale recommender systems. Jing Zhang from Wuhan University suggests an academic collaboration component.

1.3. Journal/Conference

The paper is published at (UTC) 2025-11-24T00:00:00.000Z. The ACM Reference Format section indicates it is intended for "Proceedings of Make sure to enter the correct conference title from your rights confirmation email" (Conference acronym "XX"). This implies it is a conference paper, though the specific conference name is a placeholder in the provided text. ACM conferences are generally highly reputable venues in computer science, especially for topics like information systems and recommender systems.

1.4. Publication Year

The publication year, based on the ACM reference format, is 2018. However, the provided metadata states "Published at (UTC): 2025-11-24T00:00:00.000Z" and the arXiv link also points to a 2025 publication date ( $arxiv.org/abs/2511.18805$ ). Given the content and references (e.g., MoBA 2025, RankMixer 2025), it is highly likely that the publication year is 2025, and the 2018 in the ACM reference format is a placeholder or an error in the provided text. For this analysis, we will assume the intended publication year is 2025.

1.5. Abstract

The paper addresses significant challenges in modern personalized recommendation systems, specifically in ranking models that handle high-cardinality, heterogeneous, and sparse feature spaces. The authors identify two primary bottlenecks:

Representation Bottleneck: This arises from high-cardinality and dynamic features forcing model capacity into sparse-activated embedding layers, leading to low-rank representations. This triggers issues like "One-Epoch" and "Interaction-Collapse," which limit model scalability.
Computational Bottleneck: The integration of numerous heterogeneous features leads to an explosion of feature tokens, making traditional attention mechanisms computationally expensive (due to $O(L^2)$ complexity) and prone to attention dispersion.

To overcome these, the paper introduces STORE, a unified and scalable token-based ranking framework with three core innovations:
Semantic Tokenization: Decomposes high-cardinality sparse features into a compact, stable set of semantic tokens (SIDs), addressing feature heterogeneity and sparsity.
Orthogonal Rotation Transformation: Rotates the subspace of low-cardinality static features to facilitate more efficient and effective feature interactions.
Efficient Attention: Filters low-contributing tokens to enhance computational efficiency and prevent attention dispersion while maintaining accuracy.

The STORE framework is validated through extensive offline experiments and online A/B tests, demonstrating consistent improvements in prediction accuracy (e.g., online CTR by 2.71%, AUC by 1.195%) and training efficiency (1.84× throughput).

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2511.18805
PDF Link: https://arxiv.org/pdf/2511.18805.pdf
Publication Status: The paper is available as a preprint on arXiv, dated 2025-11-24.

2. Executive Summary

2.1. Background & Motivation

Modern personalized recommendation systems heavily rely on ranking models to model complex user behavior by processing a vast and heterogeneous collection of features. However, current ranking models face significant challenges, preventing them from scaling effectively, unlike Large Language Models (LLMs) which benefit from Scaling Laws. The paper identifies two fundamental bottlenecks:

Representation Bottleneck:
- Problem: High-cardinality features (features with a very large number of unique values, like item IDs or user IDs) are typically handled by sparse-activated embedding layers. When these embeddings are fed into deep neural networks, they often result in low-rank representations. This means the embedding vectors occupy a limited subspace, failing to capture the full complexity of the data.
- Consequences: This issue leads to phenomena such as "One-Epoch" (where model performance peaks quickly and then plateaus or degrades, suggesting limited learning capacity beyond initial exposure) and "Interaction-Collapse" (where high-order feature interactions, which are crucial for fine-grained recommendations, are lost).
- Impact: These problems severely hinder the scalability of ranking models, meaning that simply increasing model depth or training epochs yields diminishing returns, thus undermining capacity utilization and predictable scaling.
Computational Bottleneck:
- Problem: Integrating a vast number of heterogeneous features into a unified model naturally leads to an explosion in the number of feature tokens. Each feature (or group of features) can be considered a token.
- Consequences: Traditional attention mechanisms (like self-attention in Transformers) have a computational complexity of $O(L^2)$ , where $L$ is the sequence length (number of tokens). With an explosion of feature tokens, this quadratic complexity becomes computationally prohibitive. Furthermore, a large number of tokens can lead to attention dispersion, where the attention mechanism struggles to focus on truly vital signals amidst a "sea of irrelevant tokens," diluting the impact of important interactions.
- Impact: This makes it difficult to scale ranking models to incorporate richer feature sets and more complex interactions efficiently.
  
  The paper's entry point is to directly address these two fundamental bottlenecks by proposing a unified, token-based framework that re-thinks how features are represented and interact.

2.2. Main Contributions / Findings

The paper's primary contributions are the introduction of STORE (Semantic Tokenization, Orthogonal Rotation, and Efficient Attention), a unified and scalable token-based ranking framework built upon three synergistic components designed to dismantle the identified bottlenecks:

Semantic Tokenization:
- Contribution: Proposes a novel method to decompose high-cardinality sparse features (like item IDs) into a compact set of stable semantic tokens (SIDs). This is achieved using an Orthogonal, Parallel, Multi-expert Quantization network (OPMQ).
- Problem Solved: Fundamentally mitigates feature heterogeneity and sparsity, which are root causes of the representation bottleneck and low-rank embeddings. By mapping sparse IDs to dense, structured SIDs, it allows for more efficient and effective model scaling.
Orthogonal Rotation Transformation:
- Contribution: Introduces a mechanism to rotate the subspace spanned by low-cardinality static features (features with fewer unique values, like age, gender, category ID). This transformation generates diverse instance-wise feature blocks.
- Problem Solved: Facilitates more efficient and effective feature interactions in high-dimensional spaces for these static features, complementing the handling of high-cardinality features and contributing to resolving the representation bottleneck. Diversity regularization ensures distinct rotations.
Efficient Attention for Unified Feature Interaction:
- Contribution: Integrates an efficient attention mechanism (specifically MOBA) that adaptively prunes low-contributing tokens based on the target item and context. This allows for unified feature interaction across all processed tokens.
- Problem Solved: Drastically reduces the computational complexity from quadratic $O(L^2)$ to a more manageable level, directly addressing the computational bottleneck and alleviating attention dispersion by focusing on relevant tokens.

Key Conclusions / Findings:

STORE consistently improves prediction accuracy: The framework achieved an online CTR increase of 2.71% and an AUC improvement of 1.195% in online A/B tests and offline experiments, respectively.
STORE significantly boosts training efficiency: It demonstrated a 1.84× higher training throughput, making it more scalable for large industrial applications.
The Semantic Tokenization effectively combats the "One-Epoch" and "Interaction-Collapse" problems, allowing models to benefit from more training epochs and deeper layers.
The Orthogonal Rotation Transformation creates diverse feature representations, enhancing interaction quality.
The Efficient Attention mechanism maintains accuracy while substantially reducing computational cost, enabling a unified interaction framework for a large number of tokens.
The synergistic combination of these three components provides a unified and scalable solution for token-based ranking models.

3.1. Foundational Concepts

To understand STORE, a beginner should be familiar with the following concepts in recommender systems and deep learning:

Recommender Systems: Systems that predict user preferences for items (e.g., movies, products, articles) and suggest relevant ones. Ranking models are a core component, ordering items based on predicted relevance or click-through probability.
Click-Through Rate (CTR) Prediction: A common task in recommender systems where the goal is to predict the probability that a user will click on a given item. Higher CTR often indicates more relevant recommendations.
Features in Recommender Systems: Information used by models to make predictions. These can be:
- User Features: Age, gender, location, past behavior (e.g., items clicked, purchased).
- Item Features: Category, brand, price, description.
- Context Features: Time of day, device type.
High-Cardinality Features: Features that can take a very large number of unique values. Examples include user IDs, item IDs, shop IDs. These are common in recommender systems and pose challenges due to their sparsity.
Low-Cardinality Features: Features that have a relatively small, fixed number of unique values. Examples include gender (male/female), day of week, product category.
Sparse Features / Sparse-Activated Embedding Layers: When features are high-cardinality, they are often represented as one-hot vectors (a vector with all zeros except for a single 1 at the index corresponding to the feature's value). These vectors are very long and mostly zeros, hence "sparse." To handle them in deep learning, an embedding layer is used. This layer maps each unique ID to a dense, lower-dimensional vector called an embedding. Only the embeddings corresponding to the active (non-zero) features in an input are activated and used, making these layers "sparse-activated."
Low-Rank Representations: In the context of embeddings, low-rank representations mean that the high-dimensional embedding vectors effectively reside within a much lower-dimensional subspace. This implies that the embeddings might not be diverse enough to capture all the nuanced relationships between items or users, limiting the model's expressive power.
"One-Epoch" Phenomenon: A problem where a deep learning model for CTR prediction achieves its best performance within the first few epochs (often even just one) of training, and further training provides diminishing or even negative returns. This suggests the model quickly exhausts its capacity to learn from the data, possibly due to low-rank representations or other issues limiting effective learning.
"Interaction-Collapse": A phenomenon where high-order feature interactions (complex relationships between three or more features) are lost or poorly captured by the model, especially when low-rank representations are present. These complex interactions are crucial for personalized recommendations.
Attention Mechanism: A core component in many modern neural networks (especially Transformers). It allows a model to weigh the importance of different parts of the input when processing a specific part. For example, when predicting a user's preference for an item, attention might focus on specific past interactions or item attributes that are most relevant.
- Self-Attention: A specific type of attention where a sequence attends to itself. Each element in the sequence (e.g., a feature token) computes its relevance to every other element in the same sequence.
- Query (Q), Key (K), Value (V): In attention, Query represents the element seeking information, Key represents elements that can provide information, and Value represents the information itself. Attention is computed by taking the dot product of $Q$ with $K$ to get attention scores, which are then used to weight the Value vectors.
- Computational Complexity $O(L^2)$ : For self-attention, if there are $L$ tokens in a sequence, each token computes its attention with $L$ other tokens. This results in a quadratic growth of computation with respect to sequence length $L$ .
Attention Dispersion: When there are too many tokens in the input sequence, the attention mechanism might distribute its attention too broadly across many irrelevant or less important tokens. This can dilute the signal from genuinely important tokens, making the model less effective at identifying crucial interactions.
Tokens / Feature Tokens: In STORE, features are transformed into discrete units called tokens. A feature token represents an embedding of a specific feature or a semantic concept derived from features.
Orthogonal Transformation: A linear transformation (represented by an orthogonal matrix) that preserves lengths and angles between vectors. In geometric terms, it's a rotation, reflection, or a combination of both. Orthogonal matrices satisfy the property $\mathbf{R}^T \mathbf{R} = \mathbf{I}$ , where $\mathbf{I}$ is the identity matrix. Using orthogonal transformations can help maintain information integrity and prevent collapse into low-rank representations.
Quantization: The process of mapping continuous values to a finite set of discrete values. In machine learning, vector quantization (VQ) is common, where continuous embedding vectors are mapped to a finite set of codewords in a codebook. This can compress representations and introduce discreteness.
Scaling Laws: Empirical observations in Large Language Models (LLMs) demonstrating that model performance (e.g., accuracy, loss) tends to improve predictably and consistently as the model size, dataset size, and computational budget increase. This allows for reliable scaling of LLMs to achieve better performance. The paper states that ranking models currently lack similar predictable scaling behavior.
Layer Normalization (LN): A technique used in neural networks to normalize the activations of a layer. It helps stabilize training, especially in deep networks, by ensuring that the inputs to subsequent layers have a consistent distribution.

3.2. Previous Works

The paper references several existing CTR prediction models as baselines, which represent different evolutionary stages and architectural paradigms in the field:

Factorization Machines (FM) [8]: A classic model for CTR prediction that can capture second-order feature interactions. It generalizes linear regression and matrix factorization by modeling interactions between all pairs of features using shared embedding vectors.
- Core Idea: For input features $\mathbf{x} = [x_1, \dots, x_n]$ , FM predicts: $ \hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j $ where $w_0$ is the global bias, $w_i$ is the weight for the $i$ -th feature, and $\langle \mathbf{v}_i, \mathbf{v}_j \rangle$ is the dot product of the $i$ -th and $j$ -th feature embeddings, capturing their interaction.
Deep Neural Networks (DNN): General multi-layer perceptrons used for CTR prediction by learning complex, non-linear relationships between concatenated feature embeddings.
Wide & Deep Learning [2]: Introduced by Google, it combines a "Wide" linear model (for memorization of sparse features and cross-product feature transformations) with a "Deep" DNN model (for generalization and learning dense embeddings).
DeepFM [4]: Combines the power of Factorization Machines (for low-order feature interactions) with a Deep Neural Network (for high-order feature interactions) in a single model.
DCN (Deep & Cross Network) [11]: Addresses high-order feature interactions through a specialized "cross network" that explicitly applies feature crosses at each layer, alongside a parallel DNN.
AutoInt [9]: Uses self-attentive neural networks to automatically learn explicit high-order feature interactions in an adaptive manner, treating each feature field as a query to interact with other fields.
GDCN (Global-Deep-Cross Network) [10]: An evolution of DCN that aims for deeper and more interpretable cross networks.
MaskNet [12]: Introduces feature-wise multiplication to CTR ranking models using an instance-guided mask, aiming to improve the expressiveness of feature interactions.
PEPNet (Parameter and Embedding Personalized Network) [1]: A more recent approach that infuses personalized prior information into the model using parameter and embedding personalization.
RankMixer [17]: Proposed to scale up ranking models in industrial recommenders, likely employing mixing or MLP-based architectures, possibly similar to MLP-Mixer concepts for vision.
OneTrans [15]: Aims for unified feature interaction and sequence modeling with a single Transformer architecture in industrial recommenders. This is particularly relevant as STORE also uses an efficient attention mechanism.

Regarding the core Attention Mechanism: Since STORE employs an Efficient Attention mechanism, understanding the standard Scaled Dot-Product Attention from the Transformer model [Vaswani et al., 2017] is crucial. This is the foundation that STORE optimizes.

Formula for Scaled Dot-Product Attention: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing the input sequences.
- $d_k$ is the dimension of the key vectors. The division by $\sqrt{d_k}$ is a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients.
- $QK^T$ computes the similarity (attention scores) between each query and all keys.
- softmax normalizes these scores, turning them into probabilities.
- The softmax output is then multiplied by $V$ to get a weighted sum of value vectors, where the weights indicate relevance.
Complexity: As mentioned earlier, if Q, K, V are sequences of length $L$ , the matrix multiplication $QK^T$ has a complexity of $O(L^2 \cdot d_k)$ , and the overall complexity is dominated by $O(L^2)$ operations.

Scaling Laws for Neural Language Models [6]: This work demonstrated that large language models exhibit predictable performance improvements with increased model size, dataset size, and computation. The authors of STORE specifically mention that ranking models lack this property, highlighting a key motivation for their work.

MoBA: Mixture of Block Attention for Long-Context LLMs [7]: This paper is cited as the source for the Efficient Attention mechanism (MoBA) used in STORE. MoBA is designed to handle long sequences more efficiently than standard self-attention by employing a routing strategy for queries to attend to only a subset of key-value pairs, thus reducing $O(L^2)$ complexity.

3.3. Technological Evolution

The evolution of ranking models has moved from simpler statistical methods to complex deep learning architectures:

Early Models (e.g., Logistic Regression): Simple, interpretable, but limited in capturing complex interactions.
Factorization-based Models (e.g., FM): Introduced explicit modeling of feature interactions, improving recommendation quality.
Deep Learning Models (e.g., DNN, Wide&Deep, DeepFM): Leveraged DNNs to learn non-linear high-order feature interactions and dense embeddings from sparse features. This significantly boosted accuracy but also introduced challenges like representation bottleneck due to sparse-activated embedding layers.
Attention-based Models (e.g., AutoInt, OneTrans): Adopted attention mechanisms from Transformers to dynamically weigh feature importance and learn complex interactions. While powerful, they inherited the computational bottleneck of $O(L^2)$ complexity with increasing feature tokens and attention dispersion.
Scaling Challenges: Despite these advancements, ranking models have struggled to achieve the predictable scaling laws observed in LLMs. This is attributed to the representation bottleneck (leading to "One-Epoch," "Interaction-Collapse") and computational bottleneck (due to attention dispersion and $O(L^2)$ complexity with many feature tokens).

STORE fits into this timeline as a next-generation framework that explicitly addresses these scaling challenges. It attempts to bridge the gap between powerful deep learning techniques (like attention) and the unique constraints of recommender systems (high-cardinality, sparsity, heterogeneity) by re-thinking feature representation and interaction mechanisms.

3.4. Differentiation Analysis

Compared to the main methods in related work, STORE offers several core differences and innovations:

Unified Bottleneck Addressing: Unlike many prior works that might focus on one aspect (e.g., better feature interaction learning in AutoInt or DCN, or sequence modeling in OneTrans), STORE explicitly targets both the representation bottleneck and the computational bottleneck holistically.
Fundamental Feature Re-representation:
- Semantic Tokenization: While models like RankMixer and OneTrans aggregate feature groups, STORE introduces a more fundamental transformation for high-cardinality sparse features by converting them into compact, stable Semantic IDs (SIDs) using Orthogonal, Parallel, Multi-expert Quantization (OPMQ). This is a proactive step to create better, low-rank-free representations at the input stage, directly combating the "One-Epoch" and "Interaction-Collapse" problems that plague models relying solely on sparse-activated embedding layers.
- Orthogonal Rotation: For low-cardinality static features, STORE doesn't just concatenate or MLP-transform them; it applies orthogonal rotation with diversity regularization to create multiple, diverse feature blocks. This aims to enhance interaction potential in high-dimensional spaces more effectively than simple concatenation or shallow fusion.
Efficient Attention Tailored for Ranking: While Transformer-based models like OneTrans use self-attention for unified interaction, they typically face the $O(L^2)$ computational cost and attention dispersion when $L$ (number of features/tokens) is large. STORE directly integrates an efficient attention mechanism (MoBA) that actively prunes low-contributing tokens. This makes the attention mechanism practical and scalable for the vast feature token sets in industrial recommender systems, a critical distinction for efficiency.
Synergistic Components: The power of STORE lies in the synergistic combination of its three components. Semantic Tokenization and Orthogonal Rotation create robust and diverse tokens that are then efficiently processed by the Efficient Attention. This contrasts with approaches that might treat feature representation and interaction learning as separate, less integrated steps.
Scaling Law Alignment: STORE explicitly aims to enable ranking models to exhibit scaling law-like behavior, which is a major motivation missing from previous works. By tackling the root causes of non-scaling, it paves the way for more predictable performance gains with increased resources.

In essence, STORE differentiates itself by proposing a more fundamental re-structuring of feature handling (via tokenization and rotation) and interaction (via efficient attention) specifically tailored to the unique challenges of scaling ranking models in real-world scenarios.

4. Methodology

4.1. Principles

The core idea behind STORE is to dismantle the identified representation bottleneck and computational bottleneck in ranking models by systematically transforming how features are represented and how they interact. The theoretical basis and intuition are as follows:

Decoupling High-Cardinality Feature Representation from Sparsity: Instead of relying on sparse-activated embedding layers for high-cardinality features that lead to low-rank representations and issues like "One-Epoch" and "Interaction-Collapse," STORE proposes Semantic Tokenization. The intuition is that if sparse IDs can be mapped to a compact, stable set of semantic tokens, the model can learn richer, more diverse representations from these tokens, rather than being constrained by the inherent sparsity of the raw IDs. This essentially creates a more robust input signal for the deeper layers.
Enhancing Interaction Diversity for Static Features: For low-cardinality static features, the principle is to leverage their manageable size to explicitly create diverse perspectives. Orthogonal Rotation Transformation aims to project these features into different high-dimensional subspaces. The intuition is that different rotations can highlight different aspects of these features, leading to richer and more effective feature interactions when combined with other tokens, without forcing them into a low-rank space.
Scaling Attention through Sparsity: The principle for addressing the computational bottleneck and attention dispersion is to make attention mechanisms efficient by being selective. Instead of every token attending to every other token (leading to $O(L^2)$ complexity), the intuition is that only a subset of tokens are truly relevant for interaction at any given moment. Efficient Attention with token filtering allows the model to focus computational resources only on high-contributing tokens, thereby reducing complexity while preserving accuracy and preventing attention dispersion.

By combining these three principles, STORE aims to provide a unified framework where feature heterogeneity and sparsity are fundamentally handled at the representation level, and feature interactions are learned efficiently and effectively, enabling ranking models to scale more predictably.

4.2. Core Methodology In-depth (Layer by Layer)

The STORE framework consists of three main components: Semantic Tokenizer, Orthogonal Rotation Transformation, and Efficient Attention for Unified Feature Interaction. These components work synergistically as illustrated in Figure 1.

fig 1 该图像是一个示意图，展示了STORE框架的核心组成部分，包括语义标记器、正交旋转变换和高效注意力机制。这些模块共同协作，以处理高基数稀疏特征，实现模型的高效性和可扩展性。

Figure 1: Overview of the proposed STORE.

The STORE framework processes features by categorizing them into high-cardinality sparse features (e.g., item identifiers) and low-cardinality static features (e.g., category ID, age, gender). Each type of feature undergoes distinct processing strategies before being fed into a unified Efficient Attention module.

4.2.1 Semantic Tokenizer

This component addresses the challenge of high-cardinality sparse features. Instead of using raw item IDs (which are high-cardinality and sparse), STORE maps them into a more stable and structured semantic space via Semantic IDs (SIDs).

The process begins with powerful pre-trained item embeddings, such as those obtained from a SASRec model. These continuous embeddings are then quantized into a sequence of SIDs.

The transformation from a pre-trained item embedding $\mathbf{e}_p$ to a sequence of $K$ SIDs is formalized as: $ (SID_{1},SID_{2},\dots.,SID_{K}) = \mathcal{F}_{\mathrm{item}}(\mathbf{e}_p\in \mathbb{R}^d) \quad (1) $ where:

$\mathbf{e}_p \in \mathbb{R}^d$ denotes the pre-trained item embedding, a dense vector of dimension $d$ .
$\mathcal{F}_{\mathrm{item}}$ is the item semantic tokenization function.
$(SID_{1},SID_{2},\dots.,SID_{K})$ is the resulting sequence of $K$ Semantic IDs for a given item.
The paper specifies that $K = H$ , where $H$ is likely the number of tokens or heads in the attention mechanism.

To perform this efficient encoding of high-cardinality IDs into compact and parallel SIDs, STORE proposes an Orthogonal, Parallel, Multi-expert Quantization network (OPMQ). For each item, this network utilizes $K$ distinct "experts" to encode its pre-trained embedding into $K$ latent representations.

The encoding by each expert is given by: $ {\boldsymbol {z}_i = \boldsymbol {E}_i(\boldsymbol {\mathbf{e}}_p),\quad \boldsymbol {i}\in {1,\dots,K}} \quad (2) $ where:

$\mathbf{z}_i$ is the $i$ -th latent representation, a vector produced by the $i$ -th expert.
$\boldsymbol{E}_i$ is the $i$ -th expert network, which takes the pre-trained item embedding $\mathbf{e}_p$ as input. The paper notes that $\boldsymbol {z}_i$ is referred to as $\alpha_i$ in some internal discussions.

Following the generation of latent representations, each latent vector $\mathbf{z}_i$ is assigned to the index of its nearest neighbor codeword $c_i$ , which corresponds to a codeword vector $\mathbf{s}_i$ . This is a standard vector quantization step where the latent representations are mapped to discrete SIDs from a predefined codebook.

The entire OPMQ network is trained end-to-end to minimize the reconstruction error between the original pre-trained embedding $\mathbf{e}_p$ and the output of a decoder that aggregates the quantized vectors. This ensures that the generated SIDs retain as much information as possible from the original embedding.

The reconstruction loss is formulated as: $ \mathcal{L}_{recon} = ||\mathbf{e}p - deckor[\sum{i}^{K}(\mathbf{z}_i + sg(\mathbf{s}_i - \mathbf{z}_i))]||^2 \quad (4) $ where:

$\mathcal{L}_{recon}$ is the reconstruction loss.
$||\cdot||^2$ denotes the squared Euclidean ( $L_2$ ) norm.
$deckor[\cdot]$ is a decoder function that aggregates the $K$ quantized vectors (or their gradients) to reconstruct the original embedding.
$\mathbf{z}_i$ is the latent representation from the $i$ -th expert.
$\mathbf{s}_i$ is the codeword vector that $\mathbf{z}_i$ is quantized to (i.e., its nearest neighbor in the codebook).
$sg(\cdot)$ is the stop-gradient operator. This means that during backpropagation, gradients flow through $\mathbf{z}_i$ but not through $\mathbf{s}_i - \mathbf{z}_i$ . This is a common technique in vector quantization to allow gradients from the decoder to update the experts ( $\boldsymbol{E}_i$ ) and the codebook, while ensuring that the quantization step itself (finding the nearest neighbor) is treated as non-differentiable during the forward pass. The term $(\mathbf{z}_i + sg(\mathbf{s}_i - \mathbf{z}_i))$ effectively uses $\mathbf{s}_i$ in the forward pass but $\mathbf{z}_i$ 's gradient in the backward pass.

To ensure that the SIDs capture diverse and non-redundant aspects of the original item, orthogonal regularization is applied to the parameters of the multi-experts within OPMQ. For each $i$ -th expert, its parameter vector $\mathbf{V}_i$ is defined as the $L_2$ -normalized version of its flattened parameter matrix $\mathbf{W}_i \in \mathbb{R}^{d_1 \times d_2}$ .

The orthogonal regularization term is then applied to the set of $K$ parameter vectors: $ \mathcal{L}_{\mathrm{orth}} = \left| \mathbf{V}\mathbf{V}^{\top} - \mathbf{I}\right| _F^2, \quad (5) $ where:

$\mathcal{L}_{\mathrm{orth}}$ is the orthogonal regularization loss.
$\mathbf{V}$ is a matrix formed by stacking the $K$ normalized parameter vectors $\mathbf{V}_i$ (i.e., $\mathbf{V} = [\mathbf{V}_1, \dots, \mathbf{V}_K]$ ).
$\mathbf{V}^{\top}$ is the transpose of $\mathbf{V}$ .
$\mathbf{I}$ is the identity matrix.
$||\cdot||_F^2$ denotes the squared Frobenius norm. The Frobenius norm of a matrix $A$ is defined as $||A||_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^2}$ . Minimizing $|| \mathbf{V}\mathbf{V}^{\top} - \mathbf{I}|| _F^2$ encourages the rows (or columns, depending on how $\mathbf{V}$ is constructed from $\mathbf{V}_i$ ) of $\mathbf{V}$ to be orthogonal to each other and to have unit length, thus making the experts learn diverse and independent transformations.

4.2.2 Orthogonal Rotation Transformation

This component handles low-cardinality static features, which typically have controllable sizes and are less prone to the extreme sparsity of item IDs. These features are directly used through their original embeddings.

To simplify and improve efficiency, these static features are manually grouped based on their semantic meanings and domain knowledge. Each group contains several features. For each such feature group, a shallow MLP (Multi-Layer Perceptron) is used to perform intra-group feature fusion, allowing for simple interactions within the group.

After fusion, all the semantically fused feature groups are concatenated to form an instance-wise feature block, denoted as $\mathbf{C}$ . $ \mathbf{C} = [MLP_1(g_1),\dots ,MLP_K(g_K)] \quad (6) $ where:

$\mathbf{C}$ is the concatenated instance-wise feature block.
$MLP_j$ is a shallow MLP applied to the $j$ -th feature group $g_j$ .
$g_j$ represents the $j$ -th semantically grouped features.
$K$ is the number of semantic groups, which is also the number of SIDs from the Semantic Tokenizer.

To facilitate efficient and effective feature interactions in high-dimensional spaces, the orthogonal rotation transformation is applied to this instance-wise feature block $\mathbf{C}$ . The goal is to obtain $K$ diverse instance-wise feature blocks by rotating $\mathbf{C}$ with $K$ groups of orthogonal matrices.

For the $i$ -th rotation, the transformed block $\mathbf{O}_i$ is obtained by: $ \mathbf O_{\mathrm{i}} = \mathbf C\mathbf R_{\mathrm{i}} \quad (7) $ where:

$\mathbf{O}_i$ is the $i$ -th rotated instance-wise feature block.
$\mathbf{C}$ is the original instance-wise feature block from Equation (6).
$\mathbf{R}_i$ is the $i$ -th orthogonal matrix. Since it's an orthogonal matrix, it implies $\mathbf{R}_i^T \mathbf{R}_i = \mathbf{I}$ , preserving the norm of vectors and acting as a rotation/reflection.

To prevent these rotation matrices from becoming too similar or collapsing during training, a diversity regularization term is introduced. This term, in conjunction with the orthogonality constraint ( $\mathbf{R}_i^T \mathbf{R}_i = \mathbf{I}$ ), encourages a diverse set of learned transformations.

The diversity regularization is formulated as an optimization problem: $ \underset{{\bf R}1,\ldots ,{\bf R}k}{\min} -\lambda \sum{i = 1}^{K}\sum{j = i + 1}^{K}|{\bf R}_i - {\bf R}_j| _F^2 \quad (8) $ $ {\bf s.t.}\quad{\bf R_i}^T{\bf R_i} = {\bf I},\quad \forall i\in {1,\ldots ,K} \quad (9) $ where:

$\lambda$ is a hyperparameter controlling the strength of the diversity regularization (set to 0.1 in the paper).
$||\cdot||_F$ is the Frobenius norm.
The term $-\lambda \sum_{i = 1}^{K}\sum_{j = i + 1}^{K}\|{\bf R}_i - {\bf R}_j\| _F^2$ encourages the Frobenius distance between distinct rotation matrices $\mathbf{R}_i$ and $\mathbf{R}_j$ to be large, thus promoting diversity. Minimizing the negative of this sum is equivalent to maximizing the sum of distances.
The constraint ${\bf R_i}^T{\bf R_i} = {\bf I}$ ensures that each matrix $\mathbf{R}_i$ remains orthogonal.

The rotation matrices $\mathbf{R}_i$ and the parameters of the main network are optimized alternatively, meaning they are updated in separate steps during training.

4.2.3 Efficient Attention for Unified Feature Interaction

After distinct processing of high-cardinality sparse features (via Semantic Tokenizer) and low-cardinality static features (via Orthogonal Rotation Transformation), the framework unifies them for interaction using an Efficient Attention mechanism.

In the first layer of the attention module, the embedding of SIDs is concatenated with the rotated feature blocks. For each item or instance, this forms an input sequence for attention. The input for the $l$ -th layer of attention is denoted as $\mathbf{X}_{l-1}$ . The combination of SIDs and rotated blocks forms $\mathbf{X_0^l} = [\mathbf{s_i},\mathbf{O_l}]$ (where $\mathbf{s_i}$ likely represents the semantic tokens for item $i$ , and $\mathbf{O_l}$ represents one of the $K$ rotated static feature blocks). The full input sequence for the attention module is constructed as $\mathbf{X_0} = [\mathbf{X_0^1},\mathbf{X_0^2},\dots,\mathbf{X_0^H} ]$ , which effectively creates the Query (Q), Key (K), and Value (V) for the attention mechanism.

The iterative unified efficient attention for feature interaction is formulated as: $ \mathbf{X_{l} = LN(E f f i c e n t A t t e n t i o n(X_{l - 1}) + X_{l - 1})} \quad (10) $ where:

$\mathbf{X}_{l}$ is the output of the $l$ -th attention layer.
$\mathbf{X}_{l-1}$ is the input to the $l$ -th layer.
$LN(\cdot)$ denotes Layer Normalization, a technique to stabilize training by normalizing inputs to each layer.
$EfficientAttention(\cdot)$ is the efficient attention module itself.
The $+ X_{l-1}$ term represents a residual connection, a common practice in deep networks to facilitate gradient flow and prevent degradation.

The traditional vanilla self-attention mechanism has a computational complexity of $O(H^2)$ , where $H$ is the number of instance-wise tokens. This quadratic growth makes it prohibitive when $H$ is large, which is often the case when integrating many heterogeneous features.

To overcome this computational bottleneck, STORE incorporates an efficient attention mechanism called MOBA (Mixture of Block Attention) [7]. MOBA reduces complexity by employing a routing strategy where each query attends to only a small subset of key-value pairs, rather than all of them.

The MoBA mechanism is formulated as: $ \mathrm{MoBA}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{Softmax}\left(\mathbf{QK}\big[Ind\big]^T\right)\mathbf{V}[Ind], \quad (11) $ $ Ind_{i} = \big[\big(i - 1\big)\times B + 1,i\times B\big] \quad (12) $ where:

$\mathbf{Q}$ , $\mathbf{K}$ , $\mathbf{V}$ are the Query, Key, and Value matrices, respectively, derived from the input token sequence $\mathbf{X}_{l-1}$ .
$Ind_i \subseteq \{1,\dots,H\}$ is the dynamically selected set of indices of key-value pairs that the $i$ -th query will attend to.
$\mathbf{K}[Ind]$ and $\mathbf{V}[Ind]$ denote taking only the key and value vectors corresponding to the indices in Ind.
$B$ is the size of the selective block, meaning each query attends to a block of $B$ key-value pairs.
The softmax function is applied to the scaled dot product of the query with this selected subset of keys.

This approach significantly reduces the complexity from quadratic $O(H^2)$ to a more linear or block-wise complexity, making attention feasible for a large number of tokens. The authors state that this efficiency is made possible by their framework's effective mitigation of feature heterogeneity and sparsity at the representation level, ensuring that even with fewer key-value pairs to attend to, the underlying signals are strong and meaningful.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on both a public dataset and a large-scale industrial dataset to validate the STORE framework's effectiveness.

Avazu:
- Source: A widely-used public benchmark for CTR prediction.
- Characteristics: Consists of 9 million chronologically ordered ad click logs. It includes 23 feature fields and 3437 unique site IDs. This dataset is known for its high-cardinality categorical features.
- Purpose: To demonstrate the framework's performance on a publicly accessible and well-established benchmark for CTR prediction.
Industrial Dataset:
- Source: An international e-commerce advertising system.
- Characteristics: Contains 7 billion user interaction records. Features diverse item features and user behavior sequences, representing a real-world, large-scale industrial scenario with immense data volume and complexity.
- Purpose: To validate STORE's scalability and effectiveness in a production environment with extremely high-cardinality and heterogeneous features, which is the primary target scenario for the proposed solution.
  
  The choice of these datasets allows for evaluating the model on both a standard academic benchmark and a challenging real-world industrial setting, covering different scales and complexities of recommendation tasks.

5.2. Evaluation Metrics

The effectiveness and efficiency of STORE are evaluated using a combination of prediction accuracy metrics and training efficiency metrics.

5.2.1 Prediction Accuracy Metrics

Area Under the Receiver Operating Characteristic Curve (AUC)
- Conceptual Definition: AUC measures the overall performance of a binary classifier. It represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. An AUC of 1.0 indicates a perfect classifier, while 0.5 indicates a random classifier. It is robust to class imbalance.
- Mathematical Formula: $ AUC = \frac{\sum_{i \in P} \sum_{j \in N} I(score_i > score_j)}{|P| \cdot |N|} $
- Symbol Explanation:
  - $P$ : Set of positive instances (e.g., actual clicks).
  - $N$ : Set of negative instances (e.g., non-clicks).
  - $|P|$ : Number of positive instances.
  - $|N|$ : Number of negative instances.
  - $score_i$ : The predicted score (e.g., click probability) for instance $i$ .
  - $I(\cdot)$ : Indicator function, which returns 1 if the condition is true, and 0 otherwise.
Group AUC (GAUC)
- Conceptual Definition: GAUC is an extension of AUC that calculates AUC for each user (or group) individually and then averages these scores, often weighted by the number of impressions or positive samples per user. This metric is particularly relevant in recommender systems as it reflects individual user experience more accurately, accounting for the fact that a user's CTR predictions are compared only against other items shown to that specific user.
- Mathematical Formula: $ GAUC = \frac{\sum_{u=1}^{U} w_u \cdot AUC_u}{\sum_{u=1}^{U} w_u} $
- Symbol Explanation:
  - $U$ : Total number of users (or groups).
  - $AUC_u$ : The AUC score calculated for user $u$ 's recommendations.
  - $w_u$ : Weight for user $u$ , often the number of positive samples or impressions for user $u$ .
LogLoss (Binary Cross-Entropy Loss)
- Conceptual Definition: LogLoss quantifies the performance of a classification model where the prediction output is a probability value between 0 and 1. It measures the prediction error, penalizing incorrect classifications more heavily when the model is confident in its wrong prediction. Lower LogLoss values indicate better prediction accuracy.
- Mathematical Formula: $ LogLoss = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)] $
- Symbol Explanation:
  - $N$ : Total number of instances.
  - $y_i$ : The true label for instance $i$ (0 or 1).
  - $p_i$ : The predicted probability that instance $i$ is positive.

5.2.2 Training Efficiency Metric

Training TFlops/Batch (Batch size = 1024)
- Conceptual Definition: TFlops/Batch measures the number of Tera Floating Point Operations per batch during training. This metric directly quantifies the computational cost of processing a single batch of data. A lower value indicates higher training efficiency for a given batch size.
- Symbol Explanation:
  - TFlops: Tera Floating Point Operations ( $10^{12}$ floating-point operations).
  - Batch: Refers to a single batch of data processed during training.
  - $Batch size = 1024$ : The number of samples in one training batch.

5.3. Baselines

To demonstrate the effectiveness of STORE, its performance is compared against several state-of-the-art CTR prediction models, representing a comprehensive set of baselines:

FM [8]: Factorization Machines capture second-order feature interactions. It's a foundational baseline for feature interaction learning.
DNN: A standard Deep Neural Network without explicit feature interaction components, serving as a general deep learning baseline.
Wide&Deep [2]: Combines a linear model with a DNN for memorization and generalization, a widely adopted industry model.
DeepFM [4]: Integrates FM with DNN to capture both low-order and high-order feature interactions.
DCN [11]: Deep & Cross Network explicitly learns high-order feature crosses through its cross network component.
AutoInt [9]: Uses self-attentive neural networks to automatically learn feature interactions. This is a relevant baseline as STORE also employs attention.
GDCN [10]: Global-Deep-Cross Network, an advancement over DCN with potential for deeper cross networks.
MaskNet [12]: Incorporates feature-wise multiplication using an instance-guided mask to enhance CTR ranking models.
PEPNet [1]: Parameter and Embedding Personalized Network, a more recent model focusing on personalized prior information.
RankMixer [17]: A model designed for scaling up ranking models in industrial settings, potentially similar to STORE in its goal.
OneTrans [15]: A Transformer-based model aiming for unified feature interaction and sequence modeling, representing attention-centric baselines.

These baselines cover a spectrum from traditional factorization models to deep learning models with various feature interaction mechanisms, including attention-based approaches. This diverse set allows for a robust evaluation of STORE's improvements over existing methods.

5.4. Implementation Details

Pre-trained Embeddings: The paper utilizes pre-trained item embeddings obtained from a SASRec model. SASRec (Self-Attentive Sequential Recommendation) is a Transformer-based sequential recommender that learns item representations from user interaction sequences. This provides high-quality initial item embeddings for the Semantic Tokenizer.
Semantic Tokenizer (OPMQ) Configuration:
- Number of SIDs ( $K$ ): Set to 3 for the Avazu public dataset and 32 for the industrial dataset. This suggests that a larger number of semantic tokens are used for more complex, larger-scale industrial data.
- Codebook Size: Set to 16 for the Avazu public dataset and 300 for the industrial dataset. The codebook size determines the number of discrete codeword vectors available for quantization, indicating a richer semantic space for the industrial data.
Orthogonal Rotation Transformation: The hyperparameter $\lambda$ for diversity regularization (Equation 8) is set to 0.1.
Efficient Attention (MoBA) Configuration: The sparsity of attention in online deployment is set to $1/2$ , meaning approximately half of the key-value pairs are filtered out.

These details highlight the practical configurations used for STORE in both research and deployment settings.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1 Overall Performance (RQ1)

The overall performance of STORE compared to various baselines on both the Avazu (public) and Industrial datasets is presented in Table 1. The metrics include AUC, GAUC, and LogLoss, with Improv. indicating the relative improvement of STORE over the best baseline.

The following are the results from Table 1 of the original paper:

Dataset	Avazu			Industrial
Dataset	AUC	GAUC	Logloss	AUC	GAUC	Logloss
FM	0.7291	0.7248	0.4052	0.6711	0.6011	0.1144
DNN	0.7231	0.7211	0.4052	0.6721	0.6005	0.1148
Wide&Deep	0.7356	0.7329	0.3988	0.6720	0.6018	0.1144
DeepFM	0.7404	0.7375	0.3965	0.6707	0.5907	0.1152
DCN	0.7344	0.7310	0.4042	0.6734	0.6029	0.1141
AutoInt	0.7439	0.7408	0.3948	0.6728	0.6021	0.1142
GDCN	0.7370	0.7344	0.3989	0.6726	0.6022	0.1142
MaskNet	0.7426	0.7383	0.3942	0.6753	0.6054	0.1140
PEPNet	0.7411	0.7380	0.5961	0.6741	0.6039	0.1148
RankMixer	0.7450	0.7412	0.3951	0.6774	0.6053	0.1140
OneTrans	0.7461	0.7432	0.3943	0.6771	0.6058	0.1141
STORE	0.7479	0.7451	0.3912	0.6804	0.6064	0.1139
STORE-4 Epoch	0.7488	0.7463	0.3900	0.6855	0.6086	0.1134
Improv.	+0.362%	+0.417%	+0.913%	+1.195%	+0.462%	+0.526%

Analysis:

Superiority of STORE:
- On both Avazu and Industrial datasets, STORE consistently achieves the highest AUC and GAUC scores, and the lowest LogLoss, indicating superior prediction accuracy.
- For example, on the Industrial dataset, STORE achieves an AUC of 0.6804 and GAUC of 0.6064, outperforming the best baselines (RankMixer AUC 0.6774, OneTrans GAUC 0.6058).
- The STORE-4 Epoch variant further improves accuracy, particularly on the Industrial dataset, with an AUC of 0.6855 and GAUC of 0.6086. This variant suggests that STORE can benefit from more training epochs, directly addressing the "One-Epoch" problem.
Relative Improvements:
- The Improvement row highlights the substantial gains. For the Industrial dataset, STORE-4 Epoch improves AUC by +1.195% (relative to RankMixer's 0.6774) and LogLoss by +0.526% (relative to MaskNet's/RankMixer's 0.1140).
- On Avazu, STORE-4 Epoch also shows notable improvements, with +0.362% AUC and +0.913% LogLoss reduction over OneTrans. These are significant gains in CTR prediction, where even small percentage increases can translate to large business impact.
Comparison with Attention-based Baselines: Models like AutoInt and OneTrans which utilize attention mechanisms generally perform better than older models like FM or DNN. However, STORE surpasses them. This suggests that STORE's approach to feature representation (Semantic Tokenization, Orthogonal Rotation) and its efficient attention mechanism overcome the limitations (like attention dispersion or $O(L^2)$ complexity) faced by these baselines.
Addressing Bottlenecks: The strong performance of STORE (especially STORE-4 Epoch) in accuracy, coupled with the ability to benefit from more training epochs, strongly validates its claim of mitigating the representation bottleneck (reducing "One-Epoch" and "Interaction-Collapse"). The overall efficiency gains (discussed later) further support the alleviation of the computational bottleneck.
Beyond Aggregation: The paper explicitly states that while models like RankMixer and OneTrans project or aggregate feature groups to mitigate feature heterogeneity, STORE's fundamental approach of SIDs and orthogonal rotation provides a deeper, more effective solution, leading to substantial accuracy improvements. This differentiation is clearly supported by the results.

6.1.2 Online A/B Test Results

The paper also conducted a 15-day online A/B test on a large-scale e-commerce platform.

STORE achieved a relative CTR increase of 2.71% compared to the production baseline.
In deployment, the OPMQ (Semantic Tokenizer) was configured with $K=32$ SIDs and a codebook size of 300.
The sparsity of attention was set to $1/2$ , which means roughly half of the tokens were filtered, leading to increased inference efficiency and response speed while maintaining performance.

This online result is crucial as it demonstrates STORE's effectiveness and efficiency in a real-world, high-stakes production environment, translating offline gains into tangible business impact.

6.2. Ablation Studies / Parameter Analysis

6.2.1 Ablation Study (RQ2)

Table 2 presents an ablation study evaluating the impact of individual components and design choices within STORE on the Industrial dataset.

The following are the results from Table 2 of the original paper:

Variants	AUC	GAUC	Logloss	TFlops/Batch
STORE-4 Epoch	0.6855	0.6086	0.1134	1.764
STORE	0.6804	0.6064	0.1139	1.763
u OPQ	0.6787	0.6045	0.1140	1.763
w RQ-VAE	0.6768	0.6047	0.1141	1.762
w/o Orthogonal Rotation w Vanilla-Attention	0.6780	0.6050	0.1140	1.760
	0.6812	0.6068	0.1137	3.240

Analysis:

STORE vs. STORE-4 Epoch: STORE-4 Epoch (0.6855 AUC) significantly outperforms the standard STORE (0.6804 AUC). This confirms that STORE effectively mitigates the "One-Epoch" phenomenon and can benefit from extended training, leading to better model capacity utilization.
Semantic Tokenizer Impact (OPMQ vs. Alternatives):
- STORE uses OPMQ (Orthogonal, Parallel, Multi-expert Quantization).
- u OPQ (likely a variant or simplified OPQ) results in an AUC of 0.6787, which is lower than STORE's 0.6804.
- w RQ-VAE (using Residual Quantization VAE [13] as the tokenizer) yields an even lower AUC of 0.6768.
- Conclusion: This demonstrates that the specific design of OPMQ with its orthogonal and multi-expert approach is crucial for STORE's performance, outperforming other quantization methods in capturing semantic tokens.
Orthogonal Rotation Transformation Impact:
- w/o Orthogonal Rotation (meaning without this component) leads to an AUC of 0.6780, notably lower than STORE's 0.6804.
- Conclusion: This confirms the effectiveness of the Orthogonal Rotation Transformation in facilitating more efficient and effective feature interactions for low-cardinality static features, contributing positively to overall accuracy.
Efficient Attention Impact:
- Comparing STORE (AUC 0.6804, TFlops/Batch 1.763) with w Vanilla-Attention (AUC 0.6812, TFlops/Batch 3.240), we observe that Efficient Attention achieves comparable prediction accuracy while drastically improving training efficiency.
- While w Vanilla-Attention shows a slightly higher AUC (0.6812 vs 0.6804), the TFlops/Batch nearly doubles (3.240 vs 1.763).
- Conclusion: This highlights the trade-off inherent in efficient mechanisms: Efficient Attention successfully preserves model accuracy (or incurs only a negligible drop) while providing significant computational gains (almost 2x throughput, as 3.240/1.763 $\approx$ 1.84). This directly addresses the computational bottleneck without sacrificing predictive power.
  
  The ablation study clearly validates that each of STORE's proposed components (Semantic Tokenizer with OPMQ, Orthogonal Rotation Transformation, and Efficient Attention) contributes meaningfully to its overall superior performance and efficiency.

6.2.2 Scaling Laws Study with Different Hyperparameters (RQ3)

Figure 2 illustrates how STORE's performance and efficiency scale with different hyperparameters, addressing RQ3.

fig 2 该图像是一个示意图，展示了不同参数设置对AUC的影响。图中包括四个子图，分别表示在不同的训练轮次、SID数量、层数以及稀疏度下，AUC的变化情况。特别地，AUC与计算复杂度（TFLOPs/Batch）以及Sparsity的关系也得到了阐明。

Figure 2: Scaling Laws Study of a) Epoch Number (b) SID Number (c) Layer Number (d) Sparsity.

Analysis of Figure 2:

a) Epoch Number:
- The graph shows that STORE (represented by the green line) continues to improve in AUC as the Epoch Number increases, especially beyond the initial epochs.
- In contrast, models using raw ItemIDs (blue line) show limited gains or even decreased performance (phenomenon of One-Epoch or overfitting) after a few epochs.
- Conclusion: This plot strongly supports the claim that Semantic Tokenization effectively combats the One-Epoch phenomenon, allowing STORE to benefit from longer training and achieve higher AUC with more epochs, thus enabling better model scaling over training time.
b) SID Number:
- The AUC generally increases as the SID Number (K, the number of Semantic IDs) increases.
- Conclusion: This suggests that a richer set of semantic tokens allows the model to capture more nuanced information from high-cardinality features, leading to better representations and improved prediction accuracy. There might be a point of diminishing returns, but within the tested range, more SIDs translate to better effects.
c) Layer Number:
- The AUC generally improves as the Layer Number (depth of the model) increases.
- Conclusion: This indicates that STORE effectively handles deeper architectures without suffering from representation collapse or diminishing returns often seen in traditional ranking models. This demonstrates that STORE mitigates the representation bottleneck sufficiently to benefit from increased model capacity, enabling scaling in terms of model depth.
d) Sparsity:
- This plot shows the relationship between attention sparsity, training efficiency (TFLOPs/Batch), and model accuracy (AUC).
- As Sparsity increases (meaning more tokens are filtered, reducing computation), TFLOPs/Batch decreases significantly, indicating improved training efficiency.
- Crucially, the AUC remains relatively stable even with high sparsity (e.g., up to $1/2$ or $1/4$ sparsity levels), with only a minimal impact on performance.
- Conclusion: This plot directly demonstrates STORE's ability to reduce computational cost with minimal impact on performance. This validates the effectiveness of the Efficient Attention mechanism in addressing the computational bottleneck by filtering low-contributing tokens without significantly sacrificing accuracy. The online deployment setting used $1/2$ sparsity, which aligns with this finding.
  
  Overall, the scaling laws study provides compelling evidence that STORE is designed to be scalable along multiple dimensions (training epochs, semantic token complexity, model depth, and computational efficiency), fulfilling its core mission of scaling up ranking models.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces STORE, a novel and unified token-based framework designed to enhance the scalability and efficiency of ranking models in modern personalized recommendation systems. STORE effectively addresses two critical bottlenecks: the representation bottleneck (caused by high-cardinality, sparse features leading to low-rank embeddings, "One-Epoch," and "Interaction-Collapse") and the computational bottleneck (stemming from an explosion of feature tokens making traditional attention mechanisms prohibitively expensive and prone to attention dispersion).

The framework's success is attributed to three core innovations:

Semantic Tokenization: Decomposes high-cardinality sparse features into a compact set of stable semantic tokens using an Orthogonal, Parallel, Multi-expert Quantization network (OPMQ). This fundamentally tackles feature heterogeneity and sparsity.
Orthogonal Rotation Transformation: Rotates the subspace of low-cardinality static features with diversity regularization to facilitate more efficient and effective feature interactions.
Efficient Attention: Incorporates a sparse attention mechanism (specifically MOBA) that filters low-contributing tokens, significantly improving computational efficiency and alleviating attention dispersion while preserving accuracy.

Extensive offline experiments on Avazu and a large-scale Industrial dataset, along with online A/B tests, confirm STORE's superiority. It consistently achieved higher prediction accuracy (e.g., +2.71% online CTR, +1.195% AUC offline) and significantly boosted training efficiency (1.84× throughput). The ablation studies and scaling law analyses further validated the individual contributions of each component and STORE's ability to scale with more epochs, SIDs, and layers, while maintaining efficiency through attention sparsity.

7.2. Limitations & Future Work

The paper explicitly states that STORE resolves both representation and computational bottlenecks and is a practical and effective path towards building more powerful large-scale ranking models. However, the paper does not explicitly detail specific limitations of STORE or suggest explicit future research directions.

Nevertheless, some implicit areas for potential future work or considerations could be inferred:

Optimal Hyperparameter Search: While the paper explores SID Number, Layer Number, and Sparsity, finding the optimal configuration for diverse datasets and tasks might require advanced autoML or neural architecture search techniques.
Generalizability of OPMQ: The OPMQ relies on pre-trained embeddings (e.g., from SASRec). Future work could explore the impact of different pre-training methods or end-to-end learning of the initial embeddings within STORE.
Theoretical Guarantees: While empirical results are strong, more theoretical analysis regarding the guarantees of orthogonal rotation for diversity and the bounds of efficient attention's performance under various sparsity levels could be beneficial.
Dynamic Sparsity: The attention sparsity is set to a fixed $1/2$ in deployment. More dynamic or adaptive sparsity mechanisms that adjust based on specific query-item pairs or context might further enhance efficiency or accuracy.
Beyond CTR Prediction: While STORE is evaluated for CTR prediction, its token-based architecture could potentially be extended to other ranking objectives (e.g., conversion rate, long-term user satisfaction) or multimodal recommendation scenarios.

7.3. Personal Insights & Critique

STORE presents a compelling and well-engineered solution to long-standing problems in ranking models. The explicit framing around "bottlenecks" and the aspiration to achieve Scaling Laws akin to LLMs is a particularly insightful starting point for research in this domain.

Key Strengths:

Unified and Holistic Approach: The strength of STORE lies in its holistic treatment of both representation and computational bottlenecks. Many models address one but not necessarily the other in a truly integrated fashion. The synergistic interaction of Semantic Tokenization, Orthogonal Rotation, and Efficient Attention is a notable architectural contribution.
Practical Relevance: The results from the Industrial dataset and the online A/B test underscore the practical applicability and significant business value of STORE. A 2.71% relative CTR increase in production is a substantial gain.
Addressing Fundamental Problems: The explicit focus on "One-Epoch" and "Interaction-Collapse" is commendable, as these are critical empirical observations that limit the true scalability of deep ranking models. STORE provides a principled way to overcome these.
Leveraging Existing Innovations: The use of pre-trained SASRec embeddings and MoBA for efficient attention demonstrates a smart approach to building upon state-of-the-art components, rather than reinventing every wheel.

Potential Issues/Areas for Improvement:

Complexity of Implementation: The framework, while effective, appears to be relatively complex, involving specialized quantization networks, orthogonal transformations with regularization, and a specific efficient attention mechanism. This might pose challenges for adoption in environments without significant engineering resources.
Dependency on Pre-trained Embeddings: The Semantic Tokenizer relies on the quality of pre-trained item embeddings. If these embeddings are suboptimal or from a different domain, the performance of STORE could be affected. The process of pre-training SASRec itself can be computationally intensive.
Interpretability: While not explicitly discussed, the introduction of multiple semantic tokens and rotated feature blocks, processed through efficient attention, might reduce the overall interpretability of the model's decisions compared to simpler factorization models or even some cross-network models. This is a common trade-off in complex deep learning models but is particularly relevant in recommendation where explainability can be desired.
Hyperparameter Sensitivity: Given the multiple components and regularization terms ( $\lambda$ ), STORE might have a relatively large number of hyperparameters that need careful tuning. The paper provides specific values for $K$ , codebook size, $\lambda$ , and sparsity, but these might vary significantly across different datasets or domains.

Transferability and Applicability: The methods introduced in STORE are highly transferable to other domains facing similar challenges with high-cardinality, sparse, and heterogeneous features, particularly in large-scale machine learning systems beyond recommender systems. For instance:

Ad Targeting: Similar to CTR prediction, ad targeting systems deal with vast numbers of user, ad, and context features.
Search Ranking: Ranking search results also involves numerous diverse features where scalability and efficiency are paramount.
Fraud Detection: Detecting fraud often involves sparse categorical features (e.g., IP addresses, account IDs) and requires efficient processing of many signals.
Any large-scale tabular data problem: The semantic tokenization and orthogonal rotation components could be adapted to create more robust representations for tabular data with high-cardinality categorical features before feeding them into other deep learning models.

In conclusion, STORE is an impactful work that provides a robust and scalable solution for ranking models in industry. It highlights the importance of addressing foundational representation and computational issues, rather than just incrementally improving feature interaction techniques, paving the way for truly scalable deep learning in recommendation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.