STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models
TL;DR Summary
The paper introduces STORE, a scalable ranking framework addressing representation and computational bottlenecks in personalized recommendation systems through semantic tokenization, efficient attention, and orthogonal rotation.
Abstract
Ranking models have become an important part of modern personalized recommendation systems. However, significant challenges persist in handling high-cardinality, heterogeneous, and sparse feature spaces, particularly regarding model scalability and efficiency. We identify two key bottlenecks: (i) Representation Bottleneck: Driven by the high cardinality and dynamic nature of features, model capacity is forced into sparse-activated embedding layers, leading to low-rank representations. This, in turn, triggers phenomena like "One-Epoch" and "Interaction-Collapse," ultimately hindering model scalability.(ii) Computational Bottleneck: Integrating all heterogeneous features into a unified model triggers an explosion in the number of feature tokens, rendering traditional attention mechanisms computationally demanding and susceptible to attention dispersion. To dismantle these barriers, we introduce STORE, a unified and scalable token-based ranking framework built upon three core innovations: (1) Semantic Tokenization fundamentally tackles feature heterogeneity and sparsity by decomposing high-cardinality sparse features into a compact set of stable semantic tokens; and (2) Orthogonal Rotation Transformation is employed to rotate the subspace spanned by low-cardinality static features, which facilitates more efficient and effective feature interactions; and (3) Efficient attention that filters low-contributing tokens to improve computional efficiency while preserving model accuracy. Across extensive offline experiments and online A/B tests, our framework consistently improves prediction accuracy(online CTR by 2.71%, AUC by 1.195%) and training effeciency (1.84 throughput).
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models." This title suggests a novel framework designed to improve the scalability and efficiency of ranking models, particularly in recommendation systems, by addressing issues related to feature representation and computational costs through semantic tokenization, orthogonal transformations, and efficient attention mechanisms.
1.2. Authors
The authors are:
-
Yi Xu (Alibaba Group, Beijing, China)
-
Chaofan Fan (Alibaba Group, Beijing, China)
-
Jinxin Hu (Alibaba Group, Beijing, China)
-
Yu Zhang (Alibaba Group, Beijing, China)
-
Xiaoyi Zeng (Alibaba Group, Beijing, China)
-
Jing Zhang (Wuhan University, Wuhan, China)
The majority of the authors are affiliated with Alibaba Group, indicating a strong industry presence and focus on practical applications in large-scale recommender systems. Jing Zhang from Wuhan University suggests an academic collaboration component.
1.3. Journal/Conference
The paper is published at (UTC) 2025-11-24T00:00:00.000Z. The ACM Reference Format section indicates it is intended for "Proceedings of Make sure to enter the correct conference title from your rights confirmation email" (Conference acronym "XX"). This implies it is a conference paper, though the specific conference name is a placeholder in the provided text. ACM conferences are generally highly reputable venues in computer science, especially for topics like information systems and recommender systems.
1.4. Publication Year
The publication year, based on the ACM reference format, is 2018. However, the provided metadata states "Published at (UTC): 2025-11-24T00:00:00.000Z" and the arXiv link also points to a 2025 publication date (). Given the content and references (e.g., MoBA 2025, RankMixer 2025), it is highly likely that the publication year is 2025, and the 2018 in the ACM reference format is a placeholder or an error in the provided text. For this analysis, we will assume the intended publication year is 2025.
1.5. Abstract
The paper addresses significant challenges in modern personalized recommendation systems, specifically in ranking models that handle high-cardinality, heterogeneous, and sparse feature spaces. The authors identify two primary bottlenecks:
-
Representation Bottleneck: This arises from high-cardinality and dynamic features forcing model capacity into sparse-activated embedding layers, leading to
low-rank representations. This triggers issues like "One-Epoch" and "Interaction-Collapse," which limit model scalability. -
Computational Bottleneck: The integration of numerous heterogeneous features leads to an explosion of
feature tokens, making traditionalattention mechanismscomputationally expensive (due to complexity) and prone toattention dispersion.To overcome these, the paper introduces
STORE, a unified and scalable token-based ranking framework with three core innovations: -
Semantic Tokenization: Decomposes high-cardinality sparse features into a compact, stable set of
semantic tokens(SIDs), addressing feature heterogeneity and sparsity. -
Orthogonal Rotation Transformation: Rotates the subspace of low-cardinality static features to facilitate more efficient and effective feature interactions.
-
Efficient Attention: Filters
low-contributing tokensto enhance computational efficiency and preventattention dispersionwhile maintaining accuracy.The
STOREframework is validated through extensive offline experiments and online A/B tests, demonstrating consistent improvements in prediction accuracy (e.g., online CTR by 2.71%, AUC by 1.195%) and training efficiency (1.84× throughput).
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2511.18805
- PDF Link: https://arxiv.org/pdf/2511.18805.pdf
- Publication Status: The paper is available as a preprint on arXiv, dated 2025-11-24.
2. Executive Summary
2.1. Background & Motivation
Modern personalized recommendation systems heavily rely on ranking models to model complex user behavior by processing a vast and heterogeneous collection of features. However, current ranking models face significant challenges, preventing them from scaling effectively, unlike Large Language Models (LLMs) which benefit from Scaling Laws. The paper identifies two fundamental bottlenecks:
-
Representation Bottleneck:
- Problem:
High-cardinality features(features with a very large number of unique values, like item IDs or user IDs) are typically handled bysparse-activated embedding layers. When these embeddings are fed into deep neural networks, they often result inlow-rank representations. This means the embedding vectors occupy a limited subspace, failing to capture the full complexity of the data. - Consequences: This issue leads to phenomena such as "One-Epoch" (where model performance peaks quickly and then plateaus or degrades, suggesting limited learning capacity beyond initial exposure) and "Interaction-Collapse" (where high-order feature interactions, which are crucial for fine-grained recommendations, are lost).
- Impact: These problems severely hinder the scalability of ranking models, meaning that simply increasing model depth or training epochs yields diminishing returns, thus undermining capacity utilization and predictable scaling.
- Problem:
-
Computational Bottleneck:
-
Problem: Integrating a vast number of heterogeneous features into a unified model naturally leads to an
explosion in the number of feature tokens. Each feature (or group of features) can be considered a token. -
Consequences: Traditional
attention mechanisms(likeself-attentionin Transformers) have a computational complexity of , where is the sequence length (number of tokens). With an explosion of feature tokens, this quadratic complexity becomes computationally prohibitive. Furthermore, a large number of tokens can lead toattention dispersion, where the attention mechanism struggles to focus on truly vital signals amidst a "sea of irrelevant tokens," diluting the impact of important interactions. -
Impact: This makes it difficult to scale ranking models to incorporate richer feature sets and more complex interactions efficiently.
The paper's entry point is to directly address these two fundamental bottlenecks by proposing a unified, token-based framework that re-thinks how features are represented and interact.
-
2.2. Main Contributions / Findings
The paper's primary contributions are the introduction of STORE (Semantic Tokenization, Orthogonal Rotation, and Efficient Attention), a unified and scalable token-based ranking framework built upon three synergistic components designed to dismantle the identified bottlenecks:
-
Semantic Tokenization:
- Contribution: Proposes a novel method to decompose
high-cardinality sparse features(like item IDs) into a compact set of stablesemantic tokens(SIDs). This is achieved using anOrthogonal, Parallel, Multi-expert Quantization network (OPMQ). - Problem Solved: Fundamentally mitigates
feature heterogeneityandsparsity, which are root causes of therepresentation bottleneckandlow-rank embeddings. By mapping sparse IDs to dense, structured SIDs, it allows for more efficient and effective model scaling.
- Contribution: Proposes a novel method to decompose
-
Orthogonal Rotation Transformation:
- Contribution: Introduces a mechanism to rotate the subspace spanned by
low-cardinality static features(features with fewer unique values, like age, gender, category ID). This transformation generates diverse instance-wise feature blocks. - Problem Solved: Facilitates more efficient and effective feature interactions in high-dimensional spaces for these static features, complementing the handling of high-cardinality features and contributing to resolving the
representation bottleneck. Diversity regularization ensures distinct rotations.
- Contribution: Introduces a mechanism to rotate the subspace spanned by
-
Efficient Attention for Unified Feature Interaction:
- Contribution: Integrates an
efficient attention mechanism(specificallyMOBA) that adaptively pruneslow-contributing tokensbased on the target item and context. This allows for unified feature interaction across all processed tokens. - Problem Solved: Drastically reduces the computational complexity from quadratic to a more manageable level, directly addressing the
computational bottleneckand alleviatingattention dispersionby focusing on relevant tokens.
- Contribution: Integrates an
Key Conclusions / Findings:
STOREconsistently improves prediction accuracy: The framework achieved an onlineCTRincrease of 2.71% and anAUCimprovement of 1.195% in online A/B tests and offline experiments, respectively.STOREsignificantly boosts training efficiency: It demonstrated a 1.84× higher training throughput, making it more scalable for large industrial applications.- The
Semantic Tokenizationeffectively combats the "One-Epoch" and "Interaction-Collapse" problems, allowing models to benefit from more training epochs and deeper layers. - The
Orthogonal Rotation Transformationcreates diverse feature representations, enhancing interaction quality. - The
Efficient Attentionmechanism maintains accuracy while substantially reducing computational cost, enabling a unified interaction framework for a large number of tokens. - The synergistic combination of these three components provides a unified and scalable solution for token-based ranking models.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand STORE, a beginner should be familiar with the following concepts in recommender systems and deep learning:
- Recommender Systems: Systems that predict user preferences for items (e.g., movies, products, articles) and suggest relevant ones.
Ranking modelsare a core component, ordering items based on predicted relevance or click-through probability. - Click-Through Rate (CTR) Prediction: A common task in recommender systems where the goal is to predict the probability that a user will click on a given item. Higher
CTRoften indicates more relevant recommendations. - Features in Recommender Systems: Information used by models to make predictions. These can be:
- User Features: Age, gender, location, past behavior (e.g., items clicked, purchased).
- Item Features: Category, brand, price, description.
- Context Features: Time of day, device type.
- High-Cardinality Features: Features that can take a very large number of unique values. Examples include
user IDs,item IDs,shop IDs. These are common in recommender systems and pose challenges due to their sparsity. - Low-Cardinality Features: Features that have a relatively small, fixed number of unique values. Examples include
gender(male/female),day of week,product category. - Sparse Features / Sparse-Activated Embedding Layers: When features are
high-cardinality, they are often represented asone-hot vectors(a vector with all zeros except for a single1at the index corresponding to the feature's value). These vectors are very long and mostly zeros, hence "sparse." To handle them in deep learning, anembedding layeris used. This layer maps each unique ID to a dense, lower-dimensional vector called anembedding. Only the embeddings corresponding to the active (non-zero) features in an input areactivatedand used, making these layers "sparse-activated." - Low-Rank Representations: In the context of embeddings,
low-rank representationsmean that the high-dimensional embedding vectors effectively reside within a much lower-dimensional subspace. This implies that the embeddings might not be diverse enough to capture all the nuanced relationships between items or users, limiting the model's expressive power. - "One-Epoch" Phenomenon: A problem where a deep learning model for
CTR predictionachieves its best performance within the first few epochs (often even just one) of training, and further training provides diminishing or even negative returns. This suggests the model quickly exhausts its capacity to learn from the data, possibly due tolow-rank representationsor other issues limiting effective learning. - "Interaction-Collapse": A phenomenon where
high-order feature interactions(complex relationships between three or more features) are lost or poorly captured by the model, especially whenlow-rank representationsare present. These complex interactions are crucial for personalized recommendations. - Attention Mechanism: A core component in many modern neural networks (especially Transformers). It allows a model to weigh the importance of different parts of the input when processing a specific part. For example, when predicting a user's preference for an item, attention might focus on specific past interactions or item attributes that are most relevant.
- Self-Attention: A specific type of attention where a sequence attends to itself. Each element in the sequence (e.g., a feature token) computes its relevance to every other element in the same sequence.
- Query (Q), Key (K), Value (V): In attention,
Queryrepresents the element seeking information,Keyrepresents elements that can provide information, andValuerepresents the information itself. Attention is computed by taking the dot product of with to get attention scores, which are then used to weight theValuevectors. - Computational Complexity : For
self-attention, if there are tokens in a sequence, each token computes its attention with other tokens. This results in a quadratic growth of computation with respect to sequence length .
- Attention Dispersion: When there are too many
tokensin the input sequence, theattention mechanismmight distribute its attention too broadly across many irrelevant or less important tokens. This can dilute the signal from genuinely important tokens, making the model less effective at identifying crucial interactions. - Tokens / Feature Tokens: In
STORE, features are transformed into discrete units calledtokens. Afeature tokenrepresents an embedding of a specific feature or a semantic concept derived from features. - Orthogonal Transformation: A linear transformation (represented by an
orthogonal matrix) that preserves lengths and angles between vectors. In geometric terms, it's a rotation, reflection, or a combination of both.Orthogonal matricessatisfy the property , where is the identity matrix. Using orthogonal transformations can help maintain information integrity and prevent collapse intolow-rank representations. - Quantization: The process of mapping continuous values to a finite set of discrete values. In machine learning,
vector quantization(VQ) is common, where continuous embedding vectors are mapped to a finite set ofcodewordsin acodebook. This can compress representations and introduce discreteness. - Scaling Laws: Empirical observations in
Large Language Models (LLMs)demonstrating that model performance (e.g., accuracy, loss) tends to improve predictably and consistently as the model size, dataset size, and computational budget increase. This allows for reliable scaling ofLLMsto achieve better performance. The paper states that ranking models currently lack similar predictable scaling behavior. - Layer Normalization (LN): A technique used in neural networks to normalize the activations of a layer. It helps stabilize training, especially in deep networks, by ensuring that the inputs to subsequent layers have a consistent distribution.
3.2. Previous Works
The paper references several existing CTR prediction models as baselines, which represent different evolutionary stages and architectural paradigms in the field:
- Factorization Machines (FM) [8]: A classic model for
CTR predictionthat can capturesecond-order feature interactions. It generalizes linear regression and matrix factorization by modeling interactions between all pairs of features using shared embedding vectors.- Core Idea: For input features ,
FMpredicts: $ \hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j $ where is the global bias, is the weight for the -th feature, and is the dot product of the -th and -th feature embeddings, capturing their interaction.
- Core Idea: For input features ,
- Deep Neural Networks (DNN): General multi-layer perceptrons used for
CTR predictionby learning complex, non-linear relationships between concatenated feature embeddings. - Wide & Deep Learning [2]: Introduced by Google, it combines a "Wide" linear model (for memorization of
sparse featuresandcross-product feature transformations) with a "Deep"DNNmodel (for generalization and learningdense embeddings). - DeepFM [4]: Combines the power of
Factorization Machines(forlow-order feature interactions) with aDeep Neural Network(forhigh-order feature interactions) in a single model. - DCN (Deep & Cross Network) [11]: Addresses
high-order feature interactionsthrough a specialized "cross network" that explicitly applies feature crosses at each layer, alongside a parallelDNN. - AutoInt [9]: Uses
self-attentive neural networksto automatically learn explicithigh-order feature interactionsin an adaptive manner, treating each feature field as a query to interact with other fields. - GDCN (Global-Deep-Cross Network) [10]: An evolution of
DCNthat aims for deeper and more interpretablecross networks. - MaskNet [12]: Introduces
feature-wise multiplicationtoCTR ranking modelsusing aninstance-guided mask, aiming to improve the expressiveness of feature interactions. - PEPNet (Parameter and Embedding Personalized Network) [1]: A more recent approach that infuses
personalized prior informationinto the model usingparameter and embedding personalization. - RankMixer [17]: Proposed to scale up ranking models in industrial recommenders, likely employing mixing or MLP-based architectures, possibly similar to
MLP-Mixerconcepts for vision. - OneTrans [15]: Aims for
unified feature interactionandsequence modelingwith a singleTransformerarchitecture in industrial recommenders. This is particularly relevant asSTOREalso uses anefficient attention mechanism.
Regarding the core Attention Mechanism:
Since STORE employs an Efficient Attention mechanism, understanding the standard Scaled Dot-Product Attention from the Transformer model [Vaswani et al., 2017] is crucial. This is the foundation that STORE optimizes.
-
Formula for Scaled Dot-Product Attention: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where:
- (Query), (Key), (Value) are matrices representing the input sequences.
- is the dimension of the
keyvectors. The division by is a scaling factor to prevent large dot products from pushing thesoftmaxfunction into regions with tiny gradients. - computes the similarity (attention scores) between each
queryand allkeys. softmaxnormalizes these scores, turning them into probabilities.- The
softmaxoutput is then multiplied by to get a weighted sum ofvaluevectors, where the weights indicate relevance.
-
Complexity: As mentioned earlier, if
Q, K, Vare sequences of length , the matrix multiplication has a complexity of , and the overall complexity is dominated by operations.Scaling Laws for Neural Language Models [6]: This work demonstrated that large language models exhibit predictable performance improvements with increased model size, dataset size, and computation. The authors of
STOREspecifically mention that ranking models lack this property, highlighting a key motivation for their work.
MoBA: Mixture of Block Attention for Long-Context LLMs [7]: This paper is cited as the source for the Efficient Attention mechanism (MoBA) used in STORE. MoBA is designed to handle long sequences more efficiently than standard self-attention by employing a routing strategy for queries to attend to only a subset of key-value pairs, thus reducing complexity.
3.3. Technological Evolution
The evolution of ranking models has moved from simpler statistical methods to complex deep learning architectures:
-
Early Models (e.g., Logistic Regression): Simple, interpretable, but limited in capturing complex interactions.
-
Factorization-based Models (e.g., FM): Introduced explicit modeling of
feature interactions, improving recommendation quality. -
Deep Learning Models (e.g., DNN, Wide&Deep, DeepFM): Leveraged
DNNsto learnnon-linear high-order feature interactionsanddense embeddingsfromsparse features. This significantly boosted accuracy but also introduced challenges likerepresentation bottleneckdue tosparse-activated embedding layers. -
Attention-based Models (e.g., AutoInt, OneTrans): Adopted
attention mechanismsfromTransformersto dynamically weigh feature importance and learn complex interactions. While powerful, they inherited thecomputational bottleneckof complexity with increasingfeature tokensandattention dispersion. -
Scaling Challenges: Despite these advancements,
ranking modelshave struggled to achieve the predictablescaling lawsobserved inLLMs. This is attributed to therepresentation bottleneck(leading to "One-Epoch," "Interaction-Collapse") andcomputational bottleneck(due toattention dispersionand complexity with manyfeature tokens).STOREfits into this timeline as a next-generation framework that explicitly addresses these scaling challenges. It attempts to bridge the gap between powerful deep learning techniques (likeattention) and the unique constraints ofrecommender systems(high-cardinality, sparsity, heterogeneity) by re-thinking feature representation and interaction mechanisms.
3.4. Differentiation Analysis
Compared to the main methods in related work, STORE offers several core differences and innovations:
-
Unified Bottleneck Addressing: Unlike many prior works that might focus on one aspect (e.g., better
feature interactionlearning inAutoIntorDCN, or sequence modeling inOneTrans),STOREexplicitly targets both therepresentation bottleneckand thecomputational bottleneckholistically. -
Fundamental Feature Re-representation:
- Semantic Tokenization: While models like
RankMixerandOneTransaggregate feature groups,STOREintroduces a more fundamental transformation forhigh-cardinality sparse featuresby converting them into compact, stableSemantic IDs (SIDs)usingOrthogonal, Parallel, Multi-expert Quantization (OPMQ). This is a proactive step to create better,low-rank-freerepresentations at the input stage, directly combating the "One-Epoch" and "Interaction-Collapse" problems that plague models relying solely onsparse-activated embedding layers. - Orthogonal Rotation: For
low-cardinality static features,STOREdoesn't just concatenate or MLP-transform them; it appliesorthogonal rotationwithdiversity regularizationto create multiple, diversefeature blocks. This aims to enhance interaction potential inhigh-dimensional spacesmore effectively than simple concatenation or shallow fusion.
- Semantic Tokenization: While models like
-
Efficient Attention Tailored for Ranking: While
Transformer-based models likeOneTransuseself-attentionfor unified interaction, they typically face the computational cost andattention dispersionwhen (number of features/tokens) is large.STOREdirectly integrates anefficient attention mechanism(MoBA) that actively pruneslow-contributing tokens. This makes theattention mechanismpractical and scalable for the vastfeature tokensets in industrialrecommender systems, a critical distinction for efficiency. -
Synergistic Components: The power of
STORElies in the synergistic combination of its three components.Semantic TokenizationandOrthogonal Rotationcreate robust and diversetokensthat are then efficiently processed by theEfficient Attention. This contrasts with approaches that might treat feature representation and interaction learning as separate, less integrated steps. -
Scaling Law Alignment:
STOREexplicitly aims to enableranking modelsto exhibitscaling law-like behavior, which is a major motivation missing from previous works. By tackling the root causes of non-scaling, it paves the way for more predictable performance gains with increased resources.In essence,
STOREdifferentiates itself by proposing a more fundamental re-structuring of feature handling (via tokenization and rotation) and interaction (via efficient attention) specifically tailored to the unique challenges of scalingranking modelsin real-world scenarios.
4. Methodology
4.1. Principles
The core idea behind STORE is to dismantle the identified representation bottleneck and computational bottleneck in ranking models by systematically transforming how features are represented and how they interact. The theoretical basis and intuition are as follows:
-
Decoupling High-Cardinality Feature Representation from Sparsity: Instead of relying on
sparse-activated embedding layersforhigh-cardinality featuresthat lead tolow-rank representationsand issues like "One-Epoch" and "Interaction-Collapse,"STOREproposesSemantic Tokenization. The intuition is that ifsparse IDscan be mapped to a compact, stable set ofsemantic tokens, the model can learn richer, more diverse representations from these tokens, rather than being constrained by the inherent sparsity of the raw IDs. This essentially creates a more robust input signal for the deeper layers. -
Enhancing Interaction Diversity for Static Features: For
low-cardinality static features, the principle is to leverage their manageable size to explicitly create diverse perspectives.Orthogonal Rotation Transformationaims to project these features into differenthigh-dimensional subspaces. The intuition is that different rotations can highlight different aspects of these features, leading to richer and more effectivefeature interactionswhen combined with othertokens, without forcing them into alow-rankspace. -
Scaling Attention through Sparsity: The principle for addressing the
computational bottleneckandattention dispersionis to makeattention mechanismsefficient by being selective. Instead of everytokenattending to every othertoken(leading to complexity), the intuition is that only a subset oftokensare truly relevant for interaction at any given moment.Efficient Attentionwithtoken filteringallows the model to focus computational resources only onhigh-contributing tokens, thereby reducing complexity while preserving accuracy and preventingattention dispersion.By combining these three principles,
STOREaims to provide a unified framework wherefeature heterogeneityandsparsityare fundamentally handled at the representation level, andfeature interactionsare learned efficiently and effectively, enablingranking modelsto scale more predictably.
4.2. Core Methodology In-depth (Layer by Layer)
The STORE framework consists of three main components: Semantic Tokenizer, Orthogonal Rotation Transformation, and Efficient Attention for Unified Feature Interaction. These components work synergistically as illustrated in Figure 1.
该图像是一个示意图,展示了STORE框架的核心组成部分,包括语义标记器、正交旋转变换和高效注意力机制。这些模块共同协作,以处理高基数稀疏特征,实现模型的高效性和可扩展性。
Figure 1: Overview of the proposed STORE.
The STORE framework processes features by categorizing them into high-cardinality sparse features (e.g., item identifiers) and low-cardinality static features (e.g., category ID, age, gender). Each type of feature undergoes distinct processing strategies before being fed into a unified Efficient Attention module.
4.2.1 Semantic Tokenizer
This component addresses the challenge of high-cardinality sparse features. Instead of using raw item IDs (which are high-cardinality and sparse), STORE maps them into a more stable and structured semantic space via Semantic IDs (SIDs).
The process begins with powerful pre-trained item embeddings, such as those obtained from a SASRec model. These continuous embeddings are then quantized into a sequence of SIDs.
The transformation from a pre-trained item embedding to a sequence of SIDs is formalized as:
$
(SID_{1},SID_{2},\dots.,SID_{K}) = \mathcal{F}_{\mathrm{item}}(\mathbf{e}_p\in \mathbb{R}^d) \quad (1)
$
where:
-
denotes the pre-trained item embedding, a dense vector of dimension .
-
is the item
semantic tokenizationfunction. -
is the resulting sequence of
Semantic IDsfor a given item. -
The paper specifies that , where is likely the number of tokens or heads in the attention mechanism.
To perform this efficient encoding of
high-cardinality IDsintocompactandparallel SIDs,STOREproposes anOrthogonal, Parallel, Multi-expert Quantization network (OPMQ). For each item, this network utilizes distinct "experts" to encode its pre-trained embedding intolatent representations.
The encoding by each expert is given by: $ {\boldsymbol {z}_i = \boldsymbol {E}_i(\boldsymbol {\mathbf{e}}_p),\quad \boldsymbol {i}\in {1,\dots,K}} \quad (2) $ where:
-
is the -th
latent representation, a vector produced by the -th expert. -
is the -th expert network, which takes the pre-trained item embedding as input. The paper notes that is referred to as in some internal discussions.
Following the generation of
latent representations, eachlatent vectoris assigned to the index of its nearest neighborcodeword, which corresponds to acodeword vector. This is a standardvector quantizationstep where thelatent representationsare mapped to discreteSIDsfrom a predefinedcodebook.
The entire OPMQ network is trained end-to-end to minimize the reconstruction error between the original pre-trained embedding and the output of a decoder that aggregates the quantized vectors. This ensures that the generated SIDs retain as much information as possible from the original embedding.
The reconstruction loss is formulated as:
$
\mathcal{L}_{recon} = ||\mathbf{e}p - deckor[\sum{i}^{K}(\mathbf{z}_i + sg(\mathbf{s}_i - \mathbf{z}_i))]||^2 \quad (4)
$
where:
-
is the reconstruction loss.
-
denotes the squared Euclidean () norm.
-
is a decoder function that aggregates the quantized vectors (or their gradients) to reconstruct the original embedding.
-
is the
latent representationfrom the -th expert. -
is the
codeword vectorthat is quantized to (i.e., its nearest neighbor in thecodebook). -
is the
stop-gradientoperator. This means that during backpropagation, gradients flow through but not through . This is a common technique invector quantizationto allow gradients from the decoder to update the experts () and the codebook, while ensuring that the quantization step itself (finding the nearest neighbor) is treated as non-differentiable during the forward pass. The term effectively uses in the forward pass but 's gradient in the backward pass.To ensure that the
SIDscapture diverse and non-redundant aspects of the original item,orthogonal regularizationis applied to the parameters of the multi-experts withinOPMQ. For each -th expert, its parameter vector is defined as the -normalized version of its flattened parameter matrix .
The orthogonal regularization term is then applied to the set of parameter vectors:
$
\mathcal{L}_{\mathrm{orth}} = \left| \mathbf{V}\mathbf{V}^{\top} - \mathbf{I}\right| _F^2, \quad (5)
$
where:
- is the orthogonal regularization loss.
- is a matrix formed by stacking the normalized parameter vectors (i.e., ).
- is the transpose of .
- is the identity matrix.
- denotes the squared
Frobenius norm. TheFrobenius normof a matrix is defined as . Minimizing encourages the rows (or columns, depending on how is constructed from ) of to beorthogonalto each other and to have unit length, thus making the experts learn diverse and independent transformations.
4.2.2 Orthogonal Rotation Transformation
This component handles low-cardinality static features, which typically have controllable sizes and are less prone to the extreme sparsity of item IDs. These features are directly used through their original embeddings.
To simplify and improve efficiency, these static features are manually grouped based on their semantic meanings and domain knowledge. Each group contains several features. For each such feature group, a shallow MLP (Multi-Layer Perceptron) is used to perform intra-group feature fusion, allowing for simple interactions within the group.
After fusion, all the semantically fused feature groups are concatenated to form an instance-wise feature block, denoted as .
$
\mathbf{C} = [MLP_1(g_1),\dots ,MLP_K(g_K)] \quad (6)
$
where:
-
is the concatenated
instance-wise feature block. -
is a
shallow MLPapplied to the -thfeature group. -
represents the -th semantically grouped features.
-
is the number of semantic groups, which is also the number of
SIDsfrom theSemantic Tokenizer.To facilitate efficient and effective
feature interactionsinhigh-dimensional spaces, theorthogonal rotation transformationis applied to thisinstance-wise feature block. The goal is to obtain diverseinstance-wise feature blocksby rotating with groups oforthogonal matrices.
For the -th rotation, the transformed block is obtained by: $ \mathbf O_{\mathrm{i}} = \mathbf C\mathbf R_{\mathrm{i}} \quad (7) $ where:
-
is the -th rotated
instance-wise feature block. -
is the original
instance-wise feature blockfrom Equation (6). -
is the -th
orthogonal matrix. Since it's anorthogonal matrix, it implies , preserving the norm of vectors and acting as a rotation/reflection.To prevent these rotation matrices from becoming too similar or
collapsingduring training, adiversity regularization termis introduced. This term, in conjunction with theorthogonality constraint(), encourages a diverse set of learned transformations.
The diversity regularization is formulated as an optimization problem:
$
\underset{{\bf R}1,\ldots ,{\bf R}k}{\min} -\lambda \sum{i = 1}^{K}\sum{j = i + 1}^{K}|{\bf R}_i - {\bf R}_j| _F^2 \quad (8)
$
$
{\bf s.t.}\quad{\bf R_i}^T{\bf R_i} = {\bf I},\quad \forall i\in {1,\ldots ,K} \quad (9)
$
where:
-
is a hyperparameter controlling the strength of the diversity regularization (set to 0.1 in the paper).
-
is the
Frobenius norm. -
The term encourages the
Frobenius distancebetween distinct rotation matrices and to be large, thus promoting diversity. Minimizing the negative of this sum is equivalent to maximizing the sum of distances. -
The constraint ensures that each matrix remains
orthogonal.The
rotation matricesand the parameters of the main network are optimized alternatively, meaning they are updated in separate steps during training.
4.2.3 Efficient Attention for Unified Feature Interaction
After distinct processing of high-cardinality sparse features (via Semantic Tokenizer) and low-cardinality static features (via Orthogonal Rotation Transformation), the framework unifies them for interaction using an Efficient Attention mechanism.
In the first layer of the attention module, the embedding of SIDs is concatenated with the rotated feature blocks. For each item or instance, this forms an input sequence for attention.
The input for the -th layer of attention is denoted as . The combination of SIDs and rotated blocks forms (where likely represents the semantic tokens for item , and represents one of the rotated static feature blocks). The full input sequence for the attention module is constructed as , which effectively creates the Query (Q), Key (K), and Value (V) for the attention mechanism.
The iterative unified efficient attention for feature interaction is formulated as:
$
\mathbf{X_{l} = LN(E f f i c e n t A t t e n t i o n(X_{l - 1}) + X_{l - 1})} \quad (10)
$
where:
-
is the output of the -th
attention layer. -
is the input to the -th layer.
-
denotes
Layer Normalization, a technique to stabilize training by normalizing inputs to each layer. -
is the
efficient attentionmodule itself. -
The term represents a
residual connection, a common practice in deep networks to facilitate gradient flow and prevent degradation.The traditional
vanilla self-attentionmechanism has a computational complexity of , where is the number ofinstance-wise tokens. This quadratic growth makes it prohibitive when is large, which is often the case when integrating many heterogeneous features.
To overcome this computational bottleneck, STORE incorporates an efficient attention mechanism called MOBA (Mixture of Block Attention) [7]. MOBA reduces complexity by employing a routing strategy where each query attends to only a small subset of key-value pairs, rather than all of them.
The MoBA mechanism is formulated as:
$
\mathrm{MoBA}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{Softmax}\left(\mathbf{QK}\big[Ind\big]^T\right)\mathbf{V}[Ind], \quad (11)
$
$
Ind_{i} = \big[\big(i - 1\big)\times B + 1,i\times B\big] \quad (12)
$
where:
-
, , are the
Query,Key, andValuematrices, respectively, derived from the input token sequence . -
is the dynamically selected set of indices of
key-value pairsthat the -thquerywill attend to. -
and denote taking only the
keyandvaluevectors corresponding to the indices inInd. -
is the size of the
selective block, meaning eachqueryattends to a block ofkey-value pairs. -
The
softmaxfunction is applied to the scaled dot product of thequerywith this selected subset ofkeys.This approach significantly reduces the complexity from quadratic to a more linear or block-wise complexity, making
attentionfeasible for a large number oftokens. The authors state that this efficiency is made possible by their framework's effective mitigation offeature heterogeneityandsparsityat the representation level, ensuring that even with fewerkey-value pairsto attend to, the underlying signals are strong and meaningful.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on both a public dataset and a large-scale industrial dataset to validate the STORE framework's effectiveness.
-
Avazu:
- Source: A widely-used public benchmark for
CTR prediction. - Characteristics: Consists of 9 million chronologically ordered ad click logs. It includes 23 feature fields and 3437 unique site IDs. This dataset is known for its
high-cardinality categorical features. - Purpose: To demonstrate the framework's performance on a publicly accessible and well-established benchmark for
CTR prediction.
- Source: A widely-used public benchmark for
-
Industrial Dataset:
-
Source: An international e-commerce advertising system.
-
Characteristics: Contains 7 billion user interaction records. Features
diverse item featuresanduser behavior sequences, representing a real-world, large-scale industrial scenario with immense data volume and complexity. -
Purpose: To validate
STORE's scalability and effectiveness in a production environment with extremelyhigh-cardinalityandheterogeneous features, which is the primary target scenario for the proposed solution.The choice of these datasets allows for evaluating the model on both a standard academic benchmark and a challenging real-world industrial setting, covering different scales and complexities of
recommendation tasks.
-
5.2. Evaluation Metrics
The effectiveness and efficiency of STORE are evaluated using a combination of prediction accuracy metrics and training efficiency metrics.
5.2.1 Prediction Accuracy Metrics
-
Area Under the Receiver Operating Characteristic Curve (AUC)
- Conceptual Definition:
AUCmeasures the overall performance of a binary classifier. It represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance. AnAUCof 1.0 indicates a perfect classifier, while 0.5 indicates a random classifier. It is robust to class imbalance. - Mathematical Formula: $ AUC = \frac{\sum_{i \in P} \sum_{j \in N} I(score_i > score_j)}{|P| \cdot |N|} $
- Symbol Explanation:
- : Set of positive instances (e.g., actual clicks).
- : Set of negative instances (e.g., non-clicks).
- : Number of positive instances.
- : Number of negative instances.
- : The predicted score (e.g., click probability) for instance .
- : Indicator function, which returns 1 if the condition is true, and 0 otherwise.
- Conceptual Definition:
-
Group AUC (GAUC)
- Conceptual Definition:
GAUCis an extension ofAUCthat calculatesAUCfor each user (or group) individually and then averages these scores, often weighted by the number of impressions or positive samples per user. This metric is particularly relevant inrecommender systemsas it reflects individual user experience more accurately, accounting for the fact that a user'sCTRpredictions are compared only against other items shown to that specific user. - Mathematical Formula: $ GAUC = \frac{\sum_{u=1}^{U} w_u \cdot AUC_u}{\sum_{u=1}^{U} w_u} $
- Symbol Explanation:
- : Total number of users (or groups).
- : The
AUCscore calculated for user 's recommendations. - : Weight for user , often the number of positive samples or impressions for user .
- Conceptual Definition:
-
LogLoss (Binary Cross-Entropy Loss)
- Conceptual Definition:
LogLossquantifies the performance of a classification model where the prediction output is a probability value between 0 and 1. It measures the prediction error, penalizing incorrect classifications more heavily when the model is confident in its wrong prediction. LowerLogLossvalues indicate better prediction accuracy. - Mathematical Formula: $ LogLoss = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)] $
- Symbol Explanation:
- : Total number of instances.
- : The true label for instance (0 or 1).
- : The predicted probability that instance is positive.
- Conceptual Definition:
5.2.2 Training Efficiency Metric
- Training TFlops/Batch (Batch size = 1024)
- Conceptual Definition:
TFlops/Batchmeasures the number of Tera Floating Point Operations per batch during training. This metric directly quantifies the computational cost of processing a single batch of data. A lower value indicates higher training efficiency for a given batch size. - Symbol Explanation:
TFlops: Tera Floating Point Operations ( floating-point operations).Batch: Refers to a single batch of data processed during training.- : The number of samples in one training batch.
- Conceptual Definition:
5.3. Baselines
To demonstrate the effectiveness of STORE, its performance is compared against several state-of-the-art CTR prediction models, representing a comprehensive set of baselines:
-
FM [8]:
Factorization Machinescapturesecond-order feature interactions. It's a foundational baseline forfeature interactionlearning. -
DNN: A standard
Deep Neural Networkwithout explicitfeature interactioncomponents, serving as a general deep learning baseline. -
Wide&Deep [2]: Combines a linear model with a
DNNfor memorization and generalization, a widely adopted industry model. -
DeepFM [4]: Integrates
FMwithDNNto capture bothlow-orderandhigh-order feature interactions. -
DCN [11]:
Deep & Cross Networkexplicitly learnshigh-order feature crossesthrough itscross networkcomponent. -
AutoInt [9]: Uses
self-attentive neural networksto automatically learnfeature interactions. This is a relevant baseline asSTOREalso employsattention. -
GDCN [10]:
Global-Deep-Cross Network, an advancement overDCNwith potential for deepercross networks. -
MaskNet [12]: Incorporates
feature-wise multiplicationusing aninstance-guided maskto enhanceCTR ranking models. -
PEPNet [1]:
Parameter and Embedding Personalized Network, a more recent model focusing onpersonalized prior information. -
RankMixer [17]: A model designed for scaling up
ranking modelsin industrial settings, potentially similar toSTOREin its goal. -
OneTrans [15]: A
Transformer-based model aiming forunified feature interactionandsequence modeling, representingattention-centric baselines.These baselines cover a spectrum from traditional
factorization modelstodeep learning modelswith variousfeature interactionmechanisms, includingattention-based approaches. This diverse set allows for a robust evaluation ofSTORE's improvements over existing methods.
5.4. Implementation Details
-
Pre-trained Embeddings: The paper utilizes
pre-trained item embeddingsobtained from aSASRecmodel.SASRec(Self-Attentive Sequential Recommendation) is aTransformer-based sequential recommender that learns item representations from user interaction sequences. This provides high-quality initial item embeddings for theSemantic Tokenizer. -
Semantic Tokenizer (OPMQ) Configuration:
- Number of SIDs (): Set to 3 for the Avazu public dataset and 32 for the industrial dataset. This suggests that a larger number of semantic tokens are used for more complex, larger-scale industrial data.
- Codebook Size: Set to 16 for the Avazu public dataset and 300 for the industrial dataset. The
codebook sizedetermines the number of discretecodeword vectorsavailable for quantization, indicating a richer semantic space for the industrial data.
-
Orthogonal Rotation Transformation: The hyperparameter for
diversity regularization(Equation 8) is set to 0.1. -
Efficient Attention (MoBA) Configuration: The
sparsityofattentionin online deployment is set to , meaning approximately half of thekey-value pairsare filtered out.These details highlight the practical configurations used for
STOREin both research and deployment settings.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1 Overall Performance (RQ1)
The overall performance of STORE compared to various baselines on both the Avazu (public) and Industrial datasets is presented in Table 1. The metrics include AUC, GAUC, and LogLoss, with Improv. indicating the relative improvement of STORE over the best baseline.
The following are the results from Table 1 of the original paper:
| Dataset | Avazu | Industrial | ||||
| AUC | GAUC | Logloss | AUC | GAUC | Logloss | |
| FM | 0.7291 | 0.7248 | 0.4052 | 0.6711 | 0.6011 | 0.1144 |
| DNN | 0.7231 | 0.7211 | 0.4052 | 0.6721 | 0.6005 | 0.1148 |
| Wide&Deep | 0.7356 | 0.7329 | 0.3988 | 0.6720 | 0.6018 | 0.1144 |
| DeepFM | 0.7404 | 0.7375 | 0.3965 | 0.6707 | 0.5907 | 0.1152 |
| DCN | 0.7344 | 0.7310 | 0.4042 | 0.6734 | 0.6029 | 0.1141 |
| AutoInt | 0.7439 | 0.7408 | 0.3948 | 0.6728 | 0.6021 | 0.1142 |
| GDCN | 0.7370 | 0.7344 | 0.3989 | 0.6726 | 0.6022 | 0.1142 |
| MaskNet | 0.7426 | 0.7383 | 0.3942 | 0.6753 | 0.6054 | 0.1140 |
| PEPNet | 0.7411 | 0.7380 | 0.5961 | 0.6741 | 0.6039 | 0.1148 |
| RankMixer | 0.7450 | 0.7412 | 0.3951 | 0.6774 | 0.6053 | 0.1140 |
| OneTrans | 0.7461 | 0.7432 | 0.3943 | 0.6771 | 0.6058 | 0.1141 |
| STORE | 0.7479 | 0.7451 | 0.3912 | 0.6804 | 0.6064 | 0.1139 |
| STORE-4 Epoch | 0.7488 | 0.7463 | 0.3900 | 0.6855 | 0.6086 | 0.1134 |
| Improv. | +0.362% | +0.417% | +0.913% | +1.195% | +0.462% | +0.526% |
Analysis:
-
Superiority of STORE:
- On both
AvazuandIndustrialdatasets,STOREconsistently achieves the highestAUCandGAUCscores, and the lowestLogLoss, indicating superior prediction accuracy. - For example, on the
Industrialdataset,STOREachieves anAUCof 0.6804 andGAUCof 0.6064, outperforming the best baselines (RankMixerAUC 0.6774,OneTransGAUC 0.6058). - The
STORE-4 Epochvariant further improves accuracy, particularly on theIndustrialdataset, with anAUCof 0.6855 andGAUCof 0.6086. This variant suggests thatSTOREcan benefit from more training epochs, directly addressing the "One-Epoch" problem.
- On both
-
Relative Improvements:
- The
Improvementrow highlights the substantial gains. For theIndustrialdataset,STORE-4 EpochimprovesAUCby +1.195% (relative toRankMixer's 0.6774) andLogLossby +0.526% (relative toMaskNet's/RankMixer's 0.1140). - On
Avazu,STORE-4 Epochalso shows notable improvements, with +0.362%AUCand +0.913%LogLossreduction overOneTrans. These are significant gains inCTR prediction, where even small percentage increases can translate to large business impact.
- The
-
Comparison with Attention-based Baselines: Models like
AutoIntandOneTranswhich utilizeattention mechanismsgenerally perform better than older models likeFMorDNN. However,STOREsurpasses them. This suggests thatSTORE's approach to feature representation (Semantic Tokenization, Orthogonal Rotation) and itsefficient attentionmechanism overcome the limitations (likeattention dispersionor complexity) faced by these baselines. -
Addressing Bottlenecks: The strong performance of
STORE(especiallySTORE-4 Epoch) in accuracy, coupled with the ability to benefit from more training epochs, strongly validates its claim of mitigating therepresentation bottleneck(reducing "One-Epoch" and "Interaction-Collapse"). The overall efficiency gains (discussed later) further support the alleviation of thecomputational bottleneck. -
Beyond Aggregation: The paper explicitly states that while models like
RankMixerandOneTransproject or aggregate feature groups to mitigatefeature heterogeneity,STORE's fundamental approach ofSIDsandorthogonal rotationprovides a deeper, more effective solution, leading to substantial accuracy improvements. This differentiation is clearly supported by the results.
6.1.2 Online A/B Test Results
The paper also conducted a 15-day online A/B test on a large-scale e-commerce platform.
-
STOREachieved a relativeCTRincrease of 2.71% compared to the production baseline. -
In deployment, the
OPMQ(Semantic Tokenizer) was configured withSIDsand acodebook sizeof 300. -
The
sparsityofattentionwas set to , which means roughly half of the tokens were filtered, leading to increased inference efficiency and response speed while maintaining performance.This online result is crucial as it demonstrates
STORE's effectiveness and efficiency in a real-world, high-stakes production environment, translating offline gains into tangible business impact.
6.2. Ablation Studies / Parameter Analysis
6.2.1 Ablation Study (RQ2)
Table 2 presents an ablation study evaluating the impact of individual components and design choices within STORE on the Industrial dataset.
The following are the results from Table 2 of the original paper:
| Variants | AUC | GAUC | Logloss | TFlops/Batch |
| STORE-4 Epoch | 0.6855 | 0.6086 | 0.1134 | 1.764 |
| STORE | 0.6804 | 0.6064 | 0.1139 | 1.763 |
| u OPQ | 0.6787 | 0.6045 | 0.1140 | 1.763 |
| w RQ-VAE | 0.6768 | 0.6047 | 0.1141 | 1.762 |
| w/o Orthogonal Rotation w Vanilla-Attention | 0.6780 | 0.6050 | 0.1140 | 1.760 |
| 0.6812 | 0.6068 | 0.1137 | 3.240 |
Analysis:
-
STORE vs. STORE-4 Epoch:
STORE-4 Epoch(0.6855 AUC) significantly outperforms the standardSTORE(0.6804 AUC). This confirms thatSTOREeffectively mitigates the "One-Epoch" phenomenon and can benefit from extended training, leading to better model capacity utilization. -
Semantic Tokenizer Impact (OPMQ vs. Alternatives):
STOREusesOPMQ(Orthogonal, Parallel, Multi-expert Quantization).u OPQ(likely a variant or simplifiedOPQ) results in anAUCof 0.6787, which is lower thanSTORE's 0.6804.w RQ-VAE(usingResidual Quantization VAE[13] as the tokenizer) yields an even lowerAUCof 0.6768.- Conclusion: This demonstrates that the specific design of
OPMQwith itsorthogonalandmulti-expertapproach is crucial forSTORE's performance, outperforming otherquantizationmethods in capturingsemantic tokens.
-
Orthogonal Rotation Transformation Impact:
w/o Orthogonal Rotation(meaning without this component) leads to anAUCof 0.6780, notably lower thanSTORE's 0.6804.- Conclusion: This confirms the effectiveness of the
Orthogonal Rotation Transformationin facilitating more efficient and effectivefeature interactionsforlow-cardinality static features, contributing positively to overall accuracy.
-
Efficient Attention Impact:
-
Comparing
STORE(AUC 0.6804,TFlops/Batch1.763) withw Vanilla-Attention(AUC 0.6812,TFlops/Batch3.240), we observe thatEfficient Attentionachieves comparable prediction accuracy while drastically improving training efficiency. -
While
w Vanilla-Attentionshows a slightly higherAUC(0.6812 vs 0.6804), theTFlops/Batchnearly doubles (3.240 vs 1.763). -
Conclusion: This highlights the trade-off inherent in efficient mechanisms:
Efficient Attentionsuccessfully preserves model accuracy (or incurs only a negligible drop) while providing significant computational gains (almost 2x throughput, as 3.240/1.763 1.84). This directly addresses thecomputational bottleneckwithout sacrificing predictive power.The ablation study clearly validates that each of
STORE's proposed components (Semantic Tokenizer withOPMQ, Orthogonal Rotation Transformation, and Efficient Attention) contributes meaningfully to its overall superior performance and efficiency.
-
6.2.2 Scaling Laws Study with Different Hyperparameters (RQ3)
Figure 2 illustrates how STORE's performance and efficiency scale with different hyperparameters, addressing RQ3.
该图像是一个示意图,展示了不同参数设置对AUC的影响。图中包括四个子图,分别表示在不同的训练轮次、SID数量、层数以及稀疏度下,AUC的变化情况。特别地,AUC与计算复杂度(TFLOPs/Batch)以及Sparsity的关系也得到了阐明。
Figure 2: Scaling Laws Study of a) Epoch Number (b) SID Number (c) Layer Number (d) Sparsity.
Analysis of Figure 2:
-
a) Epoch Number:
- The graph shows that
STORE(represented by the green line) continues to improve inAUCas theEpoch Numberincreases, especially beyond the initial epochs. - In contrast, models using raw
ItemIDs(blue line) show limited gains or even decreased performance (phenomenon ofOne-Epochoroverfitting) after a few epochs. - Conclusion: This plot strongly supports the claim that
Semantic Tokenizationeffectivelycombats the One-Epoch phenomenon, allowingSTOREto benefit from longer training and achieve higherAUCwith more epochs, thus enabling better model scaling over training time.
- The graph shows that
-
b) SID Number:
- The
AUCgenerally increases as theSID Number(K, the number of Semantic IDs) increases. - Conclusion: This suggests that a richer set of
semantic tokensallows the model to capture more nuanced information fromhigh-cardinality features, leading to better representations and improved prediction accuracy. There might be a point of diminishing returns, but within the tested range, moreSIDstranslate to better effects.
- The
-
c) Layer Number:
- The
AUCgenerally improves as theLayer Number(depth of the model) increases. - Conclusion: This indicates that
STOREeffectively handles deeper architectures without suffering fromrepresentation collapseordiminishing returnsoften seen in traditionalranking models. This demonstrates thatSTOREmitigates therepresentation bottlenecksufficiently to benefit from increased model capacity, enabling scaling in terms of model depth.
- The
-
d) Sparsity:
-
This plot shows the relationship between
attention sparsity,training efficiency(TFLOPs/Batch), andmodel accuracy(AUC). -
As
Sparsityincreases (meaning more tokens are filtered, reducing computation),TFLOPs/Batchdecreases significantly, indicating improved training efficiency. -
Crucially, the
AUCremains relatively stable even with highsparsity(e.g., up to or sparsity levels), with only a minimal impact on performance. -
Conclusion: This plot directly demonstrates
STORE's ability toreduce computational costwithminimal impact on performance. This validates the effectiveness of theEfficient Attentionmechanism in addressing thecomputational bottleneckby filteringlow-contributing tokenswithout significantly sacrificing accuracy. The online deployment setting used sparsity, which aligns with this finding.Overall, the scaling laws study provides compelling evidence that
STOREis designed to be scalable along multiple dimensions (training epochs, semantic token complexity, model depth, and computational efficiency), fulfilling its core mission of scaling upranking models.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces STORE, a novel and unified token-based framework designed to enhance the scalability and efficiency of ranking models in modern personalized recommendation systems. STORE effectively addresses two critical bottlenecks: the representation bottleneck (caused by high-cardinality, sparse features leading to low-rank embeddings, "One-Epoch," and "Interaction-Collapse") and the computational bottleneck (stemming from an explosion of feature tokens making traditional attention mechanisms prohibitively expensive and prone to attention dispersion).
The framework's success is attributed to three core innovations:
-
Semantic Tokenization: Decomposes
high-cardinality sparse featuresinto a compact set of stablesemantic tokensusing anOrthogonal, Parallel, Multi-expert Quantization network (OPMQ). This fundamentally tacklesfeature heterogeneityandsparsity. -
Orthogonal Rotation Transformation: Rotates the subspace of
low-cardinality static featureswithdiversity regularizationto facilitate more efficient and effectivefeature interactions. -
Efficient Attention: Incorporates a sparse
attention mechanism(specificallyMOBA) that filterslow-contributing tokens, significantly improving computational efficiency and alleviatingattention dispersionwhile preserving accuracy.Extensive offline experiments on
Avazuand a large-scaleIndustrialdataset, along with online A/B tests, confirmSTORE's superiority. It consistently achieved higher prediction accuracy (e.g., +2.71% onlineCTR, +1.195%AUCoffline) and significantly boosted training efficiency (1.84× throughput). The ablation studies and scaling law analyses further validated the individual contributions of each component andSTORE's ability to scale with more epochs,SIDs, and layers, while maintaining efficiency throughattention sparsity.
7.2. Limitations & Future Work
The paper explicitly states that STORE resolves both representation and computational bottlenecks and is a practical and effective path towards building more powerful large-scale ranking models. However, the paper does not explicitly detail specific limitations of STORE or suggest explicit future research directions.
Nevertheless, some implicit areas for potential future work or considerations could be inferred:
- Optimal Hyperparameter Search: While the paper explores
SID Number,Layer Number, andSparsity, finding the optimal configuration for diverse datasets and tasks might require advancedautoMLorneural architecture searchtechniques. - Generalizability of OPMQ: The
OPMQrelies onpre-trained embeddings(e.g., fromSASRec). Future work could explore the impact of different pre-training methods or end-to-end learning of the initial embeddings withinSTORE. - Theoretical Guarantees: While empirical results are strong, more theoretical analysis regarding the guarantees of
orthogonal rotationfor diversity and the bounds ofefficient attention's performance under varioussparsitylevels could be beneficial. - Dynamic Sparsity: The
attention sparsityis set to a fixed in deployment. More dynamic or adaptivesparsitymechanisms that adjust based on specific query-item pairs or context might further enhance efficiency or accuracy. - Beyond CTR Prediction: While
STOREis evaluated forCTR prediction, its token-based architecture could potentially be extended to otherranking objectives(e.g., conversion rate, long-term user satisfaction) or multimodalrecommendation scenarios.
7.3. Personal Insights & Critique
STORE presents a compelling and well-engineered solution to long-standing problems in ranking models. The explicit framing around "bottlenecks" and the aspiration to achieve Scaling Laws akin to LLMs is a particularly insightful starting point for research in this domain.
Key Strengths:
- Unified and Holistic Approach: The strength of
STORElies in its holistic treatment of bothrepresentationandcomputational bottlenecks. Many models address one but not necessarily the other in a truly integrated fashion. The synergistic interaction ofSemantic Tokenization,Orthogonal Rotation, andEfficient Attentionis a notable architectural contribution. - Practical Relevance: The results from the
Industrialdataset and the online A/B test underscore the practical applicability and significant business value ofSTORE. A 2.71% relativeCTRincrease in production is a substantial gain. - Addressing Fundamental Problems: The explicit focus on "One-Epoch" and "Interaction-Collapse" is commendable, as these are critical empirical observations that limit the true scalability of
deep ranking models.STOREprovides a principled way to overcome these. - Leveraging Existing Innovations: The use of
pre-trained SASRec embeddingsandMoBAforefficient attentiondemonstrates a smart approach to building upon state-of-the-art components, rather than reinventing every wheel.
Potential Issues/Areas for Improvement:
- Complexity of Implementation: The framework, while effective, appears to be relatively complex, involving specialized quantization networks, orthogonal transformations with regularization, and a specific
efficient attentionmechanism. This might pose challenges for adoption in environments without significant engineering resources. - Dependency on Pre-trained Embeddings: The
Semantic Tokenizerrelies on the quality ofpre-trained item embeddings. If these embeddings are suboptimal or from a different domain, the performance ofSTOREcould be affected. The process of pre-trainingSASRecitself can be computationally intensive. - Interpretability: While not explicitly discussed, the introduction of multiple
semantic tokensand rotated feature blocks, processed throughefficient attention, might reduce the overall interpretability of the model's decisions compared to simplerfactorization modelsor even somecross-networkmodels. This is a common trade-off in complex deep learning models but is particularly relevant inrecommendationwhere explainability can be desired. - Hyperparameter Sensitivity: Given the multiple components and regularization terms (),
STOREmight have a relatively large number of hyperparameters that need careful tuning. The paper provides specific values for , codebook size, , andsparsity, but these might vary significantly across different datasets or domains.
Transferability and Applicability:
The methods introduced in STORE are highly transferable to other domains facing similar challenges with high-cardinality, sparse, and heterogeneous features, particularly in large-scale machine learning systems beyond recommender systems. For instance:
-
Ad Targeting: Similar to
CTR prediction,ad targetingsystems deal with vast numbers of user, ad, and context features. -
Search Ranking: Ranking search results also involves numerous diverse features where scalability and efficiency are paramount.
-
Fraud Detection: Detecting fraud often involves
sparse categorical features(e.g., IP addresses, account IDs) and requires efficient processing of many signals. -
Any large-scale tabular data problem: The
semantic tokenizationandorthogonal rotationcomponents could be adapted to create more robust representations fortabular datawithhigh-cardinality categorical featuresbefore feeding them into other deep learning models.In conclusion,
STOREis an impactful work that provides a robust and scalable solution forranking modelsin industry. It highlights the importance of addressing foundational representation and computational issues, rather than just incrementally improvingfeature interactiontechniques, paving the way for truly scalable deep learning inrecommendation.
Similar papers
Recommended via semantic vector search.