LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders
TL;DR Summary
LONGER is a Transformer model designed for industrial recommender systems that captures ultra-long user behavior sequences. Key innovations include a global token mechanism, a token merge module to reduce complexity, and engineering optimizations, demonstrating strong performance
Abstract
Modeling ultra-long user behavior sequences is critical for capturing both long- and short-term preferences in industrial recommender systems. Existing solutions typically rely on two-stage retrieval or indirect modeling paradigms, incuring upstream-downstream inconsistency and computational inefficiency. In this paper, we present LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER incorporates (i) a global token mechanism for stabilizing attention over long contexts, (ii) a token merge module with lightweight InnerTransformers and hybrid attention strategy to reduce quadratic complexity, and (iii) a series of engineering optimizations, including training with mixed-precision and activation recomputation, KV cache serving, and the fully synchronous model training and serving framework for unified GPU-based dense and sparse parameter updates. LONGER consistently outperforms strong baselines in both offline metrics and online A/B testing in both advertising and e-commerce services at ByteDance, validating its consistent effectiveness and industrial-level scaling laws. Currently, LONGER has been fully deployed at more than 10 influential scenarios at ByteDance, serving billion users.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is scaling up long sequence modeling, specifically using a Transformer-based approach, for industrial recommender systems. The title LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders reflects this focus.
1.2. Authors
The paper lists numerous authors, all affiliated with ByteDance:
- Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, Xionghang Xie, Shiru Ren, Xiang Sun, Yaocheng Tan, Peng Xu, Yuchao Zheng, and Di Wu. The authors appear to be researchers and engineers from ByteDance, a major technology company, indicating a strong industry-focused research background. The presence of many authors suggests a large collaborative effort, typical for industrial research aiming for large-scale deployment.
1.3. Journal/Conference
The paper is listed as being published at The Nineteenth ACM Conference on Recommender Systems (RecSys '25), September 22-26, 2025, Prague, Czech Republic. RecSys is a top-tier conference in the field of recommender systems, known for publishing high-quality research, both academic and industrial, with significant impact.
1.4. Publication Year
The paper was published at 2025-05-07T13:54:26.000Z, corresponding to May 7, 2025.
1.5. Abstract
The abstract highlights that modeling ultra-long user behavior sequences is crucial for capturing both long-term and short-term preferences in industrial recommender systems. It points out that existing solutions often suffer from upstream-downstream inconsistency and computational inefficiency due to reliance on two-stage retrieval or indirect modeling.
The paper introduces LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER's core contributions include:
-
A
global token mechanismto stabilize attention over long contexts. -
A
token merge modulewithlightweight InnerTransformersand ahybrid attention strategyto reduce quadratic complexity. -
A suite of
engineering optimizations, such as mixed-precision training, activation recomputation, KV cache serving, and a fully synchronous GPU-based training/serving framework.LONGER demonstrates superior performance over strong baselines in both offline metrics and online A/B tests across advertising and e-commerce services at ByteDance. Its consistent effectiveness and industrial-level scaling laws are validated, leading to full deployment in over 10 influential scenarios at ByteDance, serving billions of users.
1.6. Original Source Link
The official source link is: https://arxiv.org/abs/2505.04421
The PDF link is: https://arxiv.org/pdf/2505.04421v2.pdf
The paper is currently available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the effective and efficient modeling of ultra-long user behavior sequences in industrial recommender systems. User behavior sequences, which can contain hundreds or thousands of interactions, are critical for understanding both a user's evolving short-term interests and their stable long-term preferences.
This problem is important because accurately capturing these preferences leads to better recommendation accuracy, diversity, and can help mitigate the information cocoon phenomenon (where users are only exposed to information that confirms their existing beliefs).
However, significant challenges exist:
- Computational Constraints: Traditional
Transformermodels, which are powerful for sequence modeling, suffer fromquadratic complexity( where is sequence length) in theirself-attentionmechanism. This makes them computationally prohibitive forultra-long sequences(lengths exceeding ). - Existing Solutions' Limitations: Current industrial practices often resort to compromises:
Two-stage retrieval: Selecting a subset of relevant items from the long sequence, which sacrifices full information.Pre-trained User Embeddings: Condensing the long sequence into a single embedding, losing fine-grained details.Memory-augmented Models: Relying on external memory, which can be complex and require extensive training. These approaches introduceupstream-downstream inconsistency(the model trained on a subset of data might not align with the full data distribution) andindirect perceptionof the original sequence, inherently limiting their ability to fully leverage all available user behavior.
The paper's entry point is inspired by the success of scaling laws in large language models (like GPT) and rapid advancements in computing infrastructure (e.g., GPU capabilities). This allows for pioneering an end-to-end ultra-long sequence modeling paradigm directly within industrial-scale recommendation systems, moving beyond the limitations of indirect methods.
2.2. Main Contributions / Findings
The paper's primary contributions are:
- LONGER Framework: Proposing
LONGER, aLong-sequence Optimized traNsformer for GPU-Efficient Recommenders, which is an industrial GPU-efficient Transformer structure specifically designed to scale user sequence modeling length to 10,000 in an end-to-end manner. - Architectural Innovations for Efficiency:
- Global Token Mechanism: Introducing
global tokensto stabilize attention over long contexts and serve as centralized information anchors. - Token Merge Module with InnerTrans: Implementing a
token mergestrategy, which groups adjacent tokens to reduce sequence length and thusquadratic complexity, along withlightweight InnerTrans(InnerTransformers) to preserve intra-group interactions and fine-grained details. This module reducesFLOPsby approximately 50% with almost no performance loss. - Hybrid Attention Strategy: Employing a
hybrid attentiondesign combiningcross-causal attentionandself-causal attentionto efficiently capture both global context and local dependencies.
- Global Token Mechanism: Introducing
- Industrial Engineering Optimizations: Devising a series of system-level optimizations for large-scale deployment:
-
Fully Synchronous Training and Serving Framework: A unified GPU-based framework for dense and sparse parameter updates, optimizing throughput and memory efficiency.
-
Mixed Precision Training and Activation Recomputation: Techniques to alleviate GPU memory pressure and improve computational speed by trading computation for memory savings and using lower precision (BF16/FP16).
-
KV Cache Serving: A mechanism to improve inference efficiency by precomputing and caching user sequence representations, avoiding redundant computations for multiple candidate items.
The key findings demonstrate LONGER's effectiveness:
-
- Superior Performance: LONGER consistently outperforms strong baselines in offline metrics (AUC, LogLoss) on billion-scale industrial datasets. For instance, it achieved a relative AUC improvement of 1.57% and LogLoss decrease of 3.39% compared to a base model, and 0.21% AUC improvement over the most competitive Transformer baseline.
- Online Validation: Successful deployment and significant gains in online A/B tests across influential business scenarios at ByteDance, including Douyin Ads (e.g., +2.097% ADSS, +2.151% ADVV in Short Video format) and Douyin E-Commerce (+7.9222% Order/U, +6.5404% GMV/U in Live Streaming).
- Scalability and Efficiency: The architectural and engineering optimizations enable LONGER to handle
sequence lengths up to 10,000tokens efficiently, making it suitable for real-world industrial deployment with billions of users. - Scaling Laws Validation: The paper validates
scaling lawsfor performance concerning sequence length, parameters, and FLOPs, showing consistent improvements with increased scale.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand LONGER, a grasp of several fundamental concepts in machine learning, particularly in deep learning and recommender systems, is essential:
- Recommender Systems: Systems that predict user preferences for items (e.g., products, movies, news) and suggest relevant ones. They are crucial for platforms like e-commerce, social media, and content streaming.
- User Behavior Sequences: A chronological record of a user's interactions with items (e.g., clicks, views, purchases). These sequences are vital for understanding user preferences and predicting future actions.
Ultra-long user behavior sequencesrefer to sequences that can contain thousands of interactions, spanning a long period. - Transformers: A neural network architecture introduced in 2017, initially for natural language processing (NLP), which has since become foundational in many sequence-to-sequence tasks. Unlike recurrent neural networks (RNNs), Transformers process sequences in parallel, making them highly efficient, especially on modern hardware like GPUs.
- Attention Mechanism: The core component of a Transformer. It allows the model to weigh the importance of different parts of the input sequence when processing a specific element. Instead of processing a sequence step-by-step, attention can directly establish relationships between any two positions in the sequence.
- Self-Attention: A mechanism within a Transformer layer where the model computes attention over the input sequence itself. This means each element in the sequence attends to all other elements (including itself) to compute its new representation. It helps capture long-range dependencies within a single sequence. The computational complexity of
self-attentionis , where is the sequence length and is the embedding dimension. This quadratic dependency on is the main challenge forultra-long sequences. - Cross-Attention: A mechanism where elements in one sequence (the
querysequence) attend to elements in a different sequence (thekeyandvaluesequence). This is commonly used in encoder-decoder architectures where the decoder attends to the encoder's output. InLONGER, it's used to allowglobal tokensandsampled sequence tokens(queries) to attend to thefull input sequence(keys and values).
- Self-Attention: A mechanism within a Transformer layer where the model computes attention over the input sequence itself. This means each element in the sequence attends to all other elements (including itself) to compute its new representation. It helps capture long-range dependencies within a single sequence. The computational complexity of
- Positional Encoding: Since Transformers process sequences in parallel, they lose information about the order of elements.
Positional encodingis added to the input embeddings to inject information about the relative or absolute position of each element in the sequence. - Multi-Layer Perceptron (MLP): A type of feedforward artificial neural network consisting of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (except for the input nodes) is a neuron that uses a nonlinear activation function. MLPs are used for various tasks, including feature transformation and classification.
- Computational Complexity (FLOPs):
Floating Point Operations Per Second(FLOPs) is a common measure of computational cost. A higher FLOPs count generally means more computation and thus slower processing or higher resource consumption.Quadratic complexity() means that if the sequence length doubles, the computation increases by a factor of four. - Mixed Precision Training: A technique that combines different numerical precisions (e.g., FP32 for full precision, FP16 or BF16 for half precision) during model training. It can significantly reduce memory usage and speed up computations, especially on GPUs with specialized hardware for lower precision arithmetic, while maintaining model accuracy.
- Activation Recomputation: A memory optimization technique used during neural network training, particularly for deep models. Instead of storing all intermediate activations from the forward pass (which are needed for the backward pass to compute gradients), some activations are discarded and recomputed during the backward pass. This trades computation for memory savings.
- KV Cache (Key-Value Cache): A technique used in Transformer inference to speed up processing, especially when generating sequences or scoring multiple candidates against a fixed input. It involves caching the Key (K) and Value (V) representations of previously processed tokens (e.g., the user behavior sequence) so they don't need to be recomputed for subsequent steps or candidate items.
- GPU Clusters: A group of Graphics Processing Units (GPUs) working together, often across multiple machines, to accelerate large-scale computations, such as training very large deep learning models.
- Sparse Parameters / Embedding Tables: In recommendation systems, categorical features (like user IDs, item IDs) are often represented by high-dimensional, sparse
embedding vectors. These embeddings are stored in largeembedding tables.Sparse parametersrefers to these embeddings, which are accessed and updated sparsely (only a few are active for any given input). - Dense Parameters: Refers to the weights and biases in the dense layers (e.g., MLPs, linear layers) of a neural network, which are typically updated more frequently and are smaller in number compared to sparse parameters.
3.2. Previous Works
The paper discusses previous approaches to sequential modeling and long-sequence modeling, categorizing them and highlighting their limitations.
Traditional Short-Sequence Modeling
Early and widely adopted models for user sequence modeling typically focus on short sequences ( items).
-
DIN (Deep Interest Network) [30]: A pioneering model that introduced an
attention mechanismto capture diverse user interests by adaptively calculating the relevance of historical behaviors to a candidate item. -
DIEN (Deep Interest Evolution Network) [29]: Extends DIN by modeling the
evolution of user interestsover time, often using Gated Recurrent Units (GRUs) combined with an attention mechanism. -
CAN (Co-Action Network) [28]: Focuses on modeling feature co-action, which refers to how different features interact with each other to influence user behavior.
-
Other approaches include
multi-domain[2, 4],multi-interest[1, 11], andsequence denoising methods[5, 20].Limitation: These sophisticated architectures were primarily designed for and confined to short-sequence scenarios, making them unsuitable for
ultra-long sequences.
Long-Sequence Modeling
Existing methods for handling longer sequences generally fall into these categories:
- Two-stage retrieval:
- Principle: Instead of processing the entire ultra-long sequence, a retrieval stage first selects a small subset (e.g., top-k, typically ) of items from the original sequence that are most relevant to the current candidate item. This "shortened" sequence is then passed to a downstream model for end-to-end modeling.
- Examples:
SIM(Search-based user Interest Modeling) [18] andTWIN(TWo-stage Interest Network) [3, 21]. - Limitation: This approach inevitably sacrifices raw full-sequence information and can lead to
upstream-downstream inconsistencybecause the retrieval stage operates separately from the final ranking model.
- Pre-trained User Embeddings:
- Principle: The entire ultra-long user sequence is pre-trained in a source model to derive a
condensed user embedding (UE). This single, fixed-size embedding then serves as input to downstream recommendation models. - Examples: Works like [9, 13, 31] explore this paradigm.
- Limitation: While this leverages high-performance GPUs for pre-training, condensing a long sequence into a single embedding can lead to
information lossandindirect perceptionof the original sequence, making it difficult to capture fine-grained, dynamic preferences.
- Principle: The entire ultra-long user sequence is pre-trained in a source model to derive a
- Memory-augmented Models:
- Principle: These models use external memory components to store and retrieve user interest representations or intermediate computation results for long sequences.
- Examples:
MIMN(Multi-channel user Interest Memory Network) [17] uses a neural Turing machine and memory induction unit.LMN(Large Memory Network) [14] proposes a lightweight structure with product quantization-based decomposition.MARM(Memory Augmented Recommendation Model) [15] caches intermediate results from computationally intensive modules. - Limitation: Memory-augmented models generally require
long-term training periodsto effectively accumulatehit rateswithin their memory slots, and their design can be complex.
- Direct Long Sequence Modeling: More recent efforts aim to directly model long sequences using Transformer variants.
- HSTU (Hierarchical Self-attention for Transformer Units) [25]: Uses a stack of identical
self-attention layerswith residual connections to model long sequences, showing better performance than vanilla Transformers. - Wukong [26]: Develops a stacked factorization machine and linear compression block-based architecture, validating
scaling lawsin recommendation. - Other works like [27, 32] also explore scaling Transformer models for sequential recommendation.
- Limitation: The paper argues that a
GPU-efficient long sequence modelingsolution remains underexplored for large-scale industrial recommender systems.
- HSTU (Hierarchical Self-attention for Transformer Units) [25]: Uses a stack of identical
Core Formula: The Attention Mechanism
Since LONGER is a Transformer-based model, understanding the Attention mechanism is crucial. The fundamental Scaled Dot-Product Attention is defined as:
$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
Where:
-
: Query matrix. This represents what you are looking for in the sequence. Each row corresponds to a query vector.
-
: Key matrix. This represents what is available to be matched against the query. Each row corresponds to a key vector.
-
: Value matrix. This contains the actual information associated with each key. Each row corresponds to a value vector.
-
: Transpose of the key matrix.
-
: Dimension of the key vectors. Dividing by is a scaling factor to prevent the dot products from becoming too large, which can push the
softmaxfunction into regions with very small gradients. -
: A function that converts a vector of arbitrary real values into a vector of probabilities, where the values sum to 1. It highlights the most relevant keys.
-
The product computes
attention scoresindicating the similarity between each query and each key. -
The
softmaxapplies these scores to the values, effectively weighing how much each value contributes to the output for each query.This formula describes how
self-attention(where Q, K, V are derived from the same input sequence) andcross-attention(where Q comes from one sequence, and K, V from another) fundamentally operate.
3.3. Technological Evolution
The evolution of sequence modeling in recommenders has progressed from simple statistical methods to sophisticated deep learning architectures.
-
Early Methods (e.g., Matrix Factorization, Collaborative Filtering): Focused on static preferences and item similarity, largely ignoring sequential patterns.
-
RNN/GRU-based Sequential Models: Introduced the ability to model sequences by capturing temporal dependencies, but struggled with very long sequences due to vanishing/exploding gradients and computational bottlenecks.
-
Attention-based Models (e.g., DIN): Revolutionized sequential recommendation by allowing models to dynamically weigh the importance of past interactions, leading to better capture of diverse interests. However, these were often still applied to relatively short, active sequences.
-
Transformer-based Models: Ushered in a new era with parallel processing and direct modeling of long-range dependencies, overcoming some limitations of RNNs. Initially applied to NLP, their success quickly spread to recommendation.
-
Long-Sequence Transformers (Current Frontier): The challenge with Transformers in recommendation is the
quadratic complexityforultra-long sequences. Current research, includingLONGER, focuses on optimizing these models for industrial scale, addressing efficiency, memory usage, and deployment challenges. This involves architectural innovations (likesparse attention,token merging) and system-level optimizations (likemixed precision,KV caching).LONGERfits into this timeline by pushing the frontier of Transformer-based models forultra-long sequencesby explicitly addressing theGPU-efficiencyandindustrial deploymentaspects, moving towards a trulyend-to-endmodeling paradigm for sequences of unprecedented length in real-world systems.
3.4. Differentiation Analysis
Compared to the main methods in related work, LONGER differentiates itself primarily through its end-to-end, GPU-efficient, directly optimized Transformer architecture for ultra-long user behavior sequences, specifically designed for industrial scale.
- Vs. Two-stage retrieval (e.g., SIM, TWIN):
LONGERavoids theupstream-downstream inconsistencyandinformation lossinherent in selecting a subset of interactions. It directly processes the fullultra-long sequence, preserving all available information. - Vs. Pre-trained User Embeddings:
LONGERdoesn't condense the entire sequence into a single, static embedding. Instead, it maintains a dynamic, detailed representation through its optimized Transformer layers, allowing for a richer and more context-aware understanding of user preferences. This avoidsindirect perceptionof the original sequence. - Vs. Memory-augmented Models (e.g., MIMN, LMN, MARM): While memory networks address long-term dependencies,
LONGERintegrates the "memory" function directly into the Transformer's attention mechanism throughglobal tokensand efficient processing of the full sequence, potentially reducing the training complexity and explicit memory management required by external memory modules. - Vs. Vanilla Transformers and initial direct long-sequence models (e.g., HSTU, Wukong):
-
LONGERspecifically targetsGPU-efficiencyin an industrial context, which is crucial for deployment at ByteDance's scale. It tackles thequadratic complexityhead-on with novel architectural designs liketoken mergeandhybrid attention, explicitly reducingFLOPsby ~50% with minimal performance loss. -
It also incorporates a comprehensive suite of
engineering optimizations(synchronous training/serving, mixed precision, recomputation, KV cache) that are critical forbillion-userscale deployment, aspects often less emphasized in purely academic proposals. -
The
global token mechanismis a specific innovation aimed at stabilizing attention over very long contexts, addressing a known issue (attention sink) in long Transformers.In essence,
LONGERcombines architectural ingenuity with robust system-level engineering to deliver a truly scalable and efficientend-to-end ultra-long sequence modelingsolution, filling a critical gap in industrial recommender systems.
-
4. Methodology
4.1. Principles
The core principle behind LONGER is to enable end-to-end ultra-long sequence modeling in industrial recommender systems by strategically balancing computational efficiency and representational fidelity. It aims to overcome the quadratic complexity () of vanilla Transformers, which makes them prohibitive for sequence lengths greater than , while avoiding the information loss and upstream-downstream inconsistency of existing two-stage or indirect modeling approaches. This is achieved through a combination of novel architectural designs and comprehensive engineering optimizations, all tailored for GPU-efficient operation at an industrial scale.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1 Problem Statement
The recommendation task is defined as predicting the probability of a user interacting (e.g., clicking or converting) with a target item , given the user's raw behavior sequence , user basic features , and cross features.
Let be the set of users and be the set of items. For a user , their raw behavior sequence is , where is an item at time . The goal is to predict the probability:
$ P(y = 1 \mid S_u, u_d, v) \in [0, 1] $
Where:
-
indicates whether user will interact with item .
-
: The user's historical behavior sequence.
-
: User demographic and other basic features.
-
: The candidate target item.
The model learns this mapping by optimizing the
binary cross-entropy loss() over historical interaction data :
$ \mathcal{L} = - \frac{1}{|\mathcal{D}|} \sum_{(S_u, u_d, v, y) \in \mathcal{D}} [y \log \hat{y} + (1 - y) \log (1 - \hat{y})] $
Where:
- : The total number of samples in the dataset.
- : The true interaction label (0 or 1).
- : The predicted probability by the recommendation model , parameterized by .
4.2.2 Overall Framework
The LONGER framework is designed to handle long and complex user behavior sequences efficiently. As illustrated in Figure 1, the model integrates several key components:
该图像是LONGER模型架构的示意图,展示了全局令牌机制、自注意力模块和跨注意力模块等关键组件,以及用于处理用户长序列的共享嵌入层和轻量级InnerTransformers。整体架构旨在优化多种特征的处理,以提高推荐系统的效率和准确性。
Figure 1: LONGER Model Architecture.
- Input Generation: The process starts by generating raw input tokens, which include
global tokensandsequence tokens. - Token Merge: The
sequence tokensundergo aToken Mergeoperation, potentially incorporatingInnerTransfor local interactions, to compress the sequence and reduce computational complexity. - LONGER Model Structure: The core model architecture processes these tokens using a
hybrid attentionmechanism, combiningcross-causal attentionin the first layer andself-causal attentionin subsequent layers. - Prediction: The refined representations are then used for the downstream prediction task (e.g., CVR prediction).
- Training and Serving Optimizations: Underlying this architecture are several system-level optimizations, including a
fully synchronous training and serving framework,mixed precision training,activation recomputation, andKV cache serving, all designed forGPU-efficientoperation at scale.
4.2.3 Global Tokens
LONGER introduces Global Tokens as auxiliary representations appended to the input sequence. These tokens serve as aggregated anchor representations, meaning they are designed to consolidate and represent global contextual information. Examples of information that can be encoded in global tokens include:
Target item representation: The embedding of the item for which the recommendation probability is being predicted.Learnable CLS tokens: Special tokens, similar to the[CLS]token in BERT, that can learn to encapsulate overall sequence information.UID embeddings: The embedding representing the user's identity.High-order compressed user-item interaction features: Aggregations of more complex interaction signals.
Purpose:
- Centralized Information Anchors: Global tokens facilitate enhanced feature interactions by providing a central point where user history, contextual attributes, and candidate item features can interact.
- Stabilize Attention Dynamics: They help stabilize attention in long sequences, especially with
sparse attention configurations. Similar toStreamLLM[23], a small number of global tokens can alleviate the "attention sink" effect, where deeper attention layers disproportionately focus on early tokens. These tokens act as anchor points, maintainingattention diversityand preservinglong-range dependency modeling. Global tokens are designed to have afull attention receptive field, meaning they can attend to every other token in the sequence and be attended to by every other token, enabling global information fusion.
4.2.4 Token Merge
The Token Merge strategy addresses the quadratic complexity of vanilla Transformers, which is a major bottleneck for processing long sequences (). For a vanilla Transformer encoder layer, the FLOPs and Parameters can be expressed as [16]:
$ \mathrm{FLOPs}{\mathrm{vanillatrans}} = 24Ld^2 + 4L^2d $ $ \mathrm{Params}{\mathrm{vanillatrans}} = 12d^2 + 13d $
Where:
-
: The sequence length.
-
: The embedding dimension.
The
Token Mergestrategy groups adjacent tokens and compresses them into shorter sequences, effectively reducing the sequence length by a factor of . This performs spatial compression.
Computational Complexity: The ratio of attention complexity before and after token merge is:
$ \frac{\mathrm{FLOPs}{\mathrm{Merge~Token}}}{\mathrm{FLOPs}{\mathrm{vanilla}}} = \frac{24Ld^2K + \frac{4L^2d}{K}}{24Ld^2 + 4L^2d} = \frac{6dK + \frac{L}{K}}{6d + L} $
Where:
- : The merge factor, indicating how many tokens are grouped into one. The effective sequence length becomes
L/K. For typical values like and , this strategy significantly reducesFLOPs.
Parameter Expansion: Token merging also impacts the number of parameters. While it reduces computational complexity, it can simultaneously increase the number of parameters in certain parts (e.g., if InnerTrans blocks are used). The paper states:
$ \Theta_{\mathrm{merge}} = 12K^2d^2 + 13Kd $
This suggests a trade-off where efficiency gains from reduced sequence length are balanced against potential parameter expansion, ultimately aiming for improved model expressiveness and overall performance.
InnerTrans
To ensure that merging adjacent tokens doesn't lead to a loss of fine-grained details or insufficient interaction within a group, LONGER introduces InnerTrans.
-
Principle:
InnerTransapplies alightweight Transformer blockwithin each token group before merging. This allows for local interactions and preserves semantics among the grouped tokens. -
Mechanism: For the -th group of item embeddings , an
InnerTransblock processes them to produce a group representation :$ \mathbf{M}_i = \mathrm{TransformerBlock}\left( [\mathbf{e}_i^1, ..., \mathbf{e}_i^K] \right) $
Where:
- : The merged representation of the -th group.
- : The -th item embedding in the -th group.
- : A standard Transformer block (including self-attention and FFN).
The
computation budgetofInnerTransis kept small due to the very limited dimension and sequence length ( is small) within each group.
4.2.5 LONGER Model Structure
The core model architecture employs a hybrid attention mechanism combining cross-attention and self-attention layers.
Input Generation
The input to the model consists of two main parts:
-
Global Tokens: Contextual information (e.g., target item features, user identifiers) as discussed in Section 3.3.
-
Sequence Tokens: The user's historical behavior sequence.
To capture temporal dynamics,
positional side informationis added to thesequence tokens:
-
(1) Absolute Time-Difference Feature: This feature quantifies the temporal distance between each user interaction and the target item. It is concatenated to each item embedding.
-
(2) Learnable Absolute Positional Embedding: A standard learnable embedding that encodes the position of each token within the sequence, which is added to the item embedding.
After
positional encoding, the tokens are passed through aMulti-Layer Perceptron (MLP)to generate their input representations . This representation is structured as:
$ \mathbf{R} \in \mathbb{R}^{(m+L) \times d} = [\mathbf{G} \in \mathbb{R}^{m \times \breve{d}} ; \mathbf{H} \in \mathbb{R}^{L \times d}] $
Where:
-
: The combined input representation matrix.
-
: Number of global tokens.
-
: Length of the (potentially merged) sequence tokens.
-
: Embedding dimension.
-
: Global token representations.
-
: Sequence token representations.
A
query matrixis then constructed by concatenating the global tokens withsampled sequence tokensfrom the full sequence tokens . The paper notes thatrecent k(most recent items) provides the best results among various sampling strategies (recent , uniform , learnable tokens). The composite query is:
$ \mathbf{O} = [\mathbf{G} ; \mathbf{H}_{\mathsf{S}}] $
This hybrid design selectively focuses attention on critical local behaviors (via sampled sequence tokens) and global contextual signals (via global tokens), enabling efficient capture of both specific sequence dependencies and broader context.
Cross-Causal Attention (First Layer)
In the first attention layer, cross-causal attention is applied. The query matrix is (global tokens + sampled sequence tokens), while the key and value matrices are derived from the full input tokens (global tokens + all sequence tokens).
The query, key, and value matrices are projected using learned weight matrices: $ \mathbf{Q} = \mathbf{O W}{\mathbf{Q}} $ $ \mathbf{K} = \mathbf{R W}{\mathbf{K}} $ $ \mathbf{V} = \mathbf{R W}_{\mathbf{V}} $
Where:
-
: Learnable projection matrices, each with shape .
The
attentionmechanism is then computed as:
$ \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{Softmax}\left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d}} + \mathbf{M} \right) \mathbf{V} $
Where:
-
: The projected query, key, and value matrices.
-
: The embedding dimension.
-
: A
mask matrixapplied to enforce causality. This mask ensures that attention is only computed over preceding tokens (or global tokens), preventing future information leakage. It is defined as:$ \mathbf{M}_{i,j} = \left{ \begin{array}{ll} 0, & \mathrm{if~} j \geq i, \mathrm{~where~} {i,j} \in [1, m+L] \ -\infty, & \mathrm{otherwise} \end{array} \right. $
The causal mask has two main benefits:
-
Temporal Relevance: It maintains temporal relevance between sequence items, ensuring that predictions for a given time step only depend on past information.
-
KV Cache Serving: It enables the
KV Cache Servingmechanism (discussed in Section 3.6.3), as the attention from the sequence to the candidate item is invisible, allowing precomputation of user sequence representations.After the attention computation, the result is passed through a
feed-forward network (FFN)for further processing.
Self-Causal Attention (Subsequent Layers)
Following the initial cross-causal attention layer, the model employs several self-causal attention blocks in subsequent layers.
-
Principle: These layers focus on learning
internal relationshipswithin the sampled tokens sequence (the output from the previous layer), allowing the model to capture deeper dependencies and patterns. -
Mechanism: The
self-causal attentionmechanism is computed similarly:$ \mathrm{SelfAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d}} + \mathbf{M} \right) \mathbf{V} $
Here:
- are obtained by applying separate linear projections () to the output of the previous layer. This means the queries, keys, and values all originate from the same sequence (the output of the previous layer), hence "self-attention."
- The
causal maskcontinues to be applied. Eachself-causal attention layeris followed by anFFN.
Stacking and Compression
The self-causal attention layers are stacked times. This iterative refinement allows the model to capture increasingly complex and higher-order dependencies within the input sequence. The overall attention flow can be summarized as:
$ \mathrm{CrossAttn}( {\bf O}, {\bf R} ) \longrightarrow \mathrm{SelfAttn}( \cdot ) \times N $
The final output of this stacked attention mechanism is a compressed output representation, which is then used for the downstream prediction task (e.g., feeding into a final prediction head). This hybrid approach efficiently handles long sequences by leveraging both global context (via cross-attention to the full input) and internal dependencies (via self-attention within the sampled/global tokens).
4.2.6 Training and Deployment Optimization
Training Framework
LONGER utilizes a fully synchronous training system tailored for large-scale sparse models on GPU clusters.
- Hardware-Software Co-design: The framework is built on a philosophy that optimizes both hardware and software aspects to maximize computational throughput and memory efficiency in distributed training.
- Pipeline:
-
Data Ingestion: Training data enters as batches or streams.
-
Preprocessing: Data is processed by the
Fountainmodule. -
Dispatch to GPUs: Processed data is sent to multiple
GPU runners(e.g., GPU0, GPU1, GPU2 in Figure 2). -
Synchronous Updates: Both
denseandsparse parametersare updated synchronously across all GPUs.
该图像是一个示意图,展示了LONGER模型的训练框架。包括从训练数据(批处理或流处理)到数据处理模块(Fountain),再到多个GPU(GPU0, GPU1, GPU2等)进行全同步的参数更新,展示了高效的训练流程。
-
Figure 2: Training Framework
- Unified Parameter Storage: A key innovation is that both
denseandsparse parametersare stored and updated synchronously directly onGPU machines, eliminating the need for externalParameter Servercomponents typically used for sparse parameters. This reducescommunication overheadandmemory transfer latency. - Hierarchical Memory System for Sparse Embeddings: To manage the vast size of embedding tables common in recommenders, a hierarchical memory system is adopted:
High-frequency features: Stored in high-bandwidthGPU memory (HBM).Mid-frequency features: Reside inCPU main memory (MEM).Low-frequency features: Offloaded to localsolid-state drives (SSD). This stratified layout optimizes access characteristics, providing a practical balance of latency, throughput, and capacity.
- Benefits: Improved training throughput, reduced staleness (outdated parameter copies), and enhanced convergence stability.
Mixed Precision Training and Recompute
To address GPU memory pressure during training, especially with large models and long sequences:
- Activation Recomputation:
- Problem:
Reverse-mode automatic differentiation(used for gradient computation) requires storing allintermediate activationsfrom theforward pass, which can be a major memory bottleneck. - Solution:
LONGERsupports recomputing declarations at the model definition level. Selected activations are discarded during the forward pass and recomputed during the backward pass. This is acomputation-for-memory trade-off. - Implementation: Since native TensorFlow doesn't officially support recomputation, it's implemented using the
custom_gradientmechanism, allowing fine-grained control via code-level annotations.
- Problem:
- Mixed Precision Training:
- Problem: Large dense models increase compute overhead.
- Solution: Uses
BF16/FP16-based mixed precision training. Users configure precision at the model level, applying higher precision (e.g., FP32) to critical components and lower precision (BF16/FP16) elsewhere. - Benefits: Demonstrated substantial gains in production: +18% throughput, -16% training time, and -18% memory usage on average (up to -28% in dense layers).
KV Cache Serving
To improve inference efficiency when scoring multiple candidate items against a single user's sequence, inspired by M-FALCON [25], LONGER uses a KV caching mechanism.
-
Principle: Decouples attention computation between user behavior tokens and candidate-specific
global tokens. Since the user sequence remains constant for a given user across multiple candidate items, itskey (K)andvalue (V)representations can be computed once and reused.
该图像是一个示意图,展示了标准 Transformer 的注意力计算与 KV 缓存服务的过程。左侧显示了标准 Transformer 针对每个候选项计算全部注意力,右侧展示了缓存用户序列的 KV 过程及逐候选项的 KV 查询。这种方法优化了计算效率,有助于大型推荐系统的实现。
Figure 3: KV Cache Serving
-
Two-stage inference process:
- Precompute and Cache: The
key-value tensorsof the user sequence are computed once and cached. This corresponds to the "Cached User Sequence KV" part in Figure 3. - Per-Candidate Computation: For each candidate item, only the attention involving its
global token(as query) and thecached user sequence(as key/value) is computed. This corresponds to the "Per-candidate KV Query" part in Figure 3.
- Precompute and Cache: The
-
Benefits: This optimization avoids redundant computation for each candidate, significantly reducing
serving latency. In practice, it reducedthroughput degradationfrom 40% to only 6.8%, making online serving much more efficient.
5. Experimental Setup
5.1. Datasets
The model was evaluated on the Conversion Rate (CVR) prediction task within the Douyin Ads system, which is described as a real-world, large-scale industrial advertising recommendation scenario.
-
Source: Subset of online user interaction logs collected from October 16th, 2024, to February 23rd, 2025.
-
Scale: 5.2 billion samples over 130 consecutive days.
-
Characteristics: Each sample includes:
User demographic features: e.g., user ID (UID), gender.Ultra-long user behavior sequence: Contains various interaction types like page views, clicks, and conversions.Candidate ad item: The item for which CVR is being predicted.Item-side features: Ad content, display context, and associated metadata.
-
Temporal Split:
- Training Data: The first 123 days of data.
- Offline Evaluation Data: The remaining 7 days.
This
temporally consistent data splitaligns with real-world deployment practices and preventsfuture data leakage, ensuring a realistic evaluation.
-
Domain: Advertising, specifically within ByteDance's Douyin platform.
The choice of this dataset is effective for validating the method's performance because it is:
-
Real-world and Large-scale: Reflects the complexity and volume of industrial recommendation problems.
-
Ultra-long sequences: Provides the necessary data to test the model's ability to handle extensive user histories.
-
Temporally split: Ensures that the evaluation is robust and generalizable to future scenarios.
5.2. Evaluation Metrics
The paper uses several evaluation metrics, both offline and online, to assess the performance of LONGER.
AUC (Area Under the ROC Curve)
- Conceptual Definition: AUC is a commonly used metric for binary classification problems. It quantifies the overall ability of a model to distinguish between positive and negative classes. An AUC value of 1.0 indicates a perfect classifier, while 0.5 indicates a classifier no better than random guessing. It is robust to imbalanced datasets.
- Mathematical Formula: The AUC is the area under the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
$
\mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(\mathrm{FPR}^{-1}(x)) dx
$
Where:
- (True Positive Rate, also known as Recall or Sensitivity)
- (False Positive Rate)
- : Number of true positives (correctly predicted positive instances).
- : Number of false negatives (positive instances incorrectly predicted as negative).
- : Number of false positives (negative instances incorrectly predicted as positive).
- : Number of true negatives (correctly predicted negative instances).
LogLoss (Binary Cross-Entropy Loss)
- Conceptual Definition: LogLoss is a measure of the accuracy of a probability prediction. It quantifies the penalty for incorrectly predicting probabilities. A lower LogLoss value indicates better prediction accuracy. It heavily penalizes confident wrong predictions. The formula is identical to the objective function used during training (binary cross-entropy).
- Mathematical Formula:
$
\mathcal{L} = - \frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]
$
Where:
- : The total number of samples.
- : The true label for sample (0 or 1).
- : The predicted probability that for sample .
ADSS (Advertiser Score)
- Conceptual Definition: ADSS (Advertiser Score) is an internal, proprietary metric used in advertising systems, specifically mentioned as one of the "most important indicators in industrial advertising systems" at ByteDance. It likely reflects the value or satisfaction of advertisers, potentially aggregating factors like return on investment, click-through rates, conversion rates, or other engagement metrics weighted by advertiser spend.
- Mathematical Formula: Not provided in the paper, as it's an internal proprietary metric.
- Symbol Explanation: Specific symbols are not defined without a formula.
ADVV (Advertiser Value)
- Conceptual Definition: ADVV (Advertiser Value) is another internal, proprietary metric in advertising systems, also cited as a key indicator. It is likely related to the overall revenue generated for advertisers or the platform itself, reflecting the economic impact of the recommendations.
- Mathematical Formula: Not provided in the paper, as it's an internal proprietary metric.
- Symbol Explanation: Specific symbols are not defined without a formula.
Order/U (Orders per User)
- Conceptual Definition: Orders per User (Order/U) is an e-commerce metric that measures the average number of orders placed by a single user. It indicates user engagement and purchasing frequency. A higher Order/U suggests more active and valuable users.
- Mathematical Formula: Not provided in the paper. It is typically calculated as:
$
\mathrm{Order/U} = \frac{\text{Total Number of Orders}}{\text{Total Number of Unique Users}}
$
Where:
- Total Number of Orders: Sum of all orders within a specific period.
- Total Number of Unique Users: Count of distinct users who placed at least one order in the same period.
GMV/U (Gross Merchandise Volume per User)
- Conceptual Definition: Gross Merchandise Volume per User (GMV/U) is an e-commerce metric that calculates the average total value of sales (before deducting returns or cancellations) generated by each unique user. It reflects the economic value generated per user. A higher GMV/U indicates that users are making more valuable purchases.
- Mathematical Formula: Not provided in the paper. It is typically calculated as:
$
\mathrm{GMV/U} = \frac{\text{Total Gross Merchandise Volume}}{\text{Total Number of Unique Users}}
$
Where:
- Total Gross Merchandise Volume: The sum of the value of all merchandise sold within a specific period.
- Total Number of Unique Users: Count of distinct users who generated GMV in the same period.
5.3. Baselines
The paper compares LONGER against several strong baselines, categorized by their ability to model short- or long-range user behavior. All models were trained with the same preprocessing pipeline and hyperparameter tuning on a GPU cluster.
Short-Sequence Methods:
- DIN (Recent50) [30]: Deep Interest Network (DIN) applied to only the 50 most recent interactions. This represents a strong attention-based baseline for short sequences.
- TWIN [3]:
TWo-stage Interest Network, a two-stage retrieval approach that selects top-k relevant items (here, implicitly for short sequences) for modeling.
Long-Sequence Methods:
-
SumPooling: A simple baseline that aggregates (e.g., sums or averages) all item embeddings in the long sequence to form a single user representation. It captures long-term general preferences but loses sequential and fine-grained information.
-
DIN [30]: Deep Interest Network applied to an extended behavior history. While DIN is primarily for short sequences, applying it to longer sequences directly without specific optimizations for length would typically suffer from scalability issues.
-
HSTU [25]:
Hierarchical Self-attention for Transformer Units, a Transformer-based model designed for modeling long sequences, consisting of a stack of self-attention layers. -
Transformer [6]: A vanilla Transformer architecture applied to long sequences. This is a crucial baseline to highlight the computational challenges and the need for
LONGER's optimizations. It's often used as the "base" model to show the relative improvements.These baselines are representative because they cover a spectrum of approaches: simple aggregation, established attention models (both short and extended), and recent Transformer-based solutions for long sequences. This allows
LONGERto demonstrate its advantages in both accuracy and efficiency over different paradigms.
6. Results & Analysis
6.1. Core Results Analysis
The experiments validate LONGER's effectiveness in both offline metrics and online A/B tests.
6.1.1 Comparison of Existing Methods
The following are the results from Table 1 of the original paper:
| Base | SumPooling | TWIN | DIN (Recent50) | DIN | HSTU | Transformer | LONGER | |
| AUC↑ | 0.83968 | 0.84201 | 0.84472 | 0.84698 | 0.84982 | 0.84994 | 0.85111 | 0.85290 |
| LogLoss↓ | 0.48758 | 0.48538 | 0.48168 | 0.47830 | 0.47452 | 0.47490 | 0.47293 | 0.47103 |
| ΔAUC(%) | +0.28 | +0.60 | +0.87 | +1.21 | +1.22 | +1.36 | +1.57 | |
| ΔLogLoss(%) | -0.45 | -1.21 | -1.90 | -2.68 | -2.60 | -3.00 | -3.39 |
Table 1: Evaluation of methods on industrial datasets
Analysis:
- Overall Superiority:
LONGERconsistently outperforms all baselines in bothAUC(higher is better) andLogLoss(lower is better). It achieves the highest AUC of 0.85290 and the lowest LogLoss of 0.47103. - Relative Improvements:
- Compared to the
Basemodel,LONGERshows a relative improvement of and-3.39% in LogLoss. - Crucially, even against the
Transformerbaseline (which represents a vanilla Transformer on long sequences),LONGERachieves a+0.21% higher AUC(0.85290 vs. 0.85111). The paper emphasizes that a0.1% AUC improvementis consideredsignificantin industrial online A/B tests. This highlights LONGER's ability to extract more valuable information from long sequences while being more efficient.
- Compared to the
- Baseline Performance:
SumPoolingperforms the worst among non-base models, indicating that simple aggregation is insufficient for capturing complex user interests.TWINand show improvements but are limited by their short-sequence or two-stage nature.DINandHSTU(Transformer-based) show better performance, demonstrating the value of modeling longer sequences, but are still surpassed byLONGERand even thevanilla Transformerbaseline. TheHSTU'sLogLossis slightly higher thanDIN's, which is interesting.
- Efficiency (Implicit): While Table 1 focuses on performance, the abstract states that
LONGERachieves this while beingGPU-efficient, implying a better trade-off compared to a vanillaTransformerwhich would likely incur higher computational costs for comparable performance. The following ablation study will clarify this.
6.1.2 Ablation Study
The following are the results from Table 2 of the original paper:
| Configuration | FLOPs (×10) | AUC↑ | LogLoss↓ | ΔAUC | ΔLogLoss |
| LONGER (w/o Merge, 2000) | 3.73 | 0.85111 | 0.47293 | +1.36% | -3.00% |
| +TokenMerge4(Concat, 500) | 2.13 | 0.85232 | 0.47145 | +1.51% | -3.31% |
| +TokenMerge8(Concat, 250) | 3.03 | 0.85291 | 0.47062 | +1.58% | -3.48% |
| Based on LONGER with TokenMerge8 | |||||
| + InnerTrans | 3.52 | 0.85332 | 0.47052 | +1.63% | -3.50% |
| Varying Query Number (Sampling Recent k items) | |||||
| Query number = 50 | 1.27 | 0.85235 | 0.47162 | +1.51% | -3.27% |
| Query number = 80 | 1.59 | 0.85248 | 0.47157 | +1.52% | -3.28% |
| Query number = 100 | 1.91 | 0.85290 | 0.47103 | +1.57% | -3.39% |
| Query number = 150 | 2.36 | 0.85290 | 0.47101 | +1.57% | -3.40% |
| Query number = 200 | 2.93 | 0.85331 | 0.47077 | +1.62% | -3.45% |
| Query number = 250 | 3.52 | 0.85332 | 0.47052 | +1.63% | -3.50% |
| Query Selection Strategies | |||||
| Learnable 100 | 1.91 | 0.84946 | 0.47523 | +1.17% | -2.53% |
| Recent 100 | 1.91 | 0.85290 | 0.47103 | +1.57% | -3.39% |
| Uniform 100 | 1.91 | 0.85183 | 0.47215 | +1.45% | -3.16% |
| Recent50 + Rest Unif50 | 1.91 | 0.85255 | 0.47129 | +1.53% | -3.34% |
Table 2: Ablation Study on Query Quantity and Key Components of LONGER.
Analysis:
-
Impact of TokenMerge and InnerTrans:
- The baseline
LONGER (w/o Merge, 2000)(equivalent to a vanilla Transformer with 2000 sequence length) has FLOPs, AUC 0.85111, LogLoss 0.47293. - Applying
+TokenMerge4 (Concat, 500)(merging 4 tokens into 1, reducing length to 500) significantly drops FLOPs to (a ~43% reduction) while improving AUC to 0.85232 and LogLoss to 0.47145. This showsTokenMergeis highly effective in reducing computation while improving performance, not just maintaining it. +TokenMerge8 (Concat, 250)further reduces the sequence length but increases FLOPs to . This counter-intuitive FLOPs increase after merging more (from 500 to 250) might be due to a larger parameter count as per or other architectural specifics not fully detailed in the table. Despite the FLOPs increase overTokenMerge4, it still significantly improves AUC (0.85291) and LogLoss (0.47062) over thew/o Mergebaseline.- Adding
+ InnerTranson top ofTokenMerge8(Base on LONGER with TokenMerge8) further improves performance to AUC 0.85332 and LogLoss 0.47052, with a FLOPs count of . This confirms thatInnerTranseffectively captures intra-group interactions, leading to better representations without prohibitive computational cost. The best overall performance (AUC 0.85332, LogLoss 0.47052) is achieved withTokenMerge8 + InnerTrans.
- The baseline
-
Varying Query Number (Sampling Recent k items):
- This section explores the trade-off between the number of sampled query items () and performance/FLOPs. The performance generally improves as increases, but with diminishing returns.
- Using achieves an AUC of 0.85290 and LogLoss of 0.47103 with FLOPs. This is the same performance as but with significantly fewer FLOPs.
- The best performance for this part (AUC 0.85332, LogLoss 0.47052) is achieved at with FLOPs. This matches the
TokenMerge8 + InnerTransconfiguration, suggesting this is the full LONGER setup. - The paper highlights that offers a strong trade-off: it achieves performance "very close" to using all 250 queries (which represents the full merged sequence length) but with only
54% of the FLOPs( vs. ). This makes it highly practical for real-world deployment where computational budgets are tight.
-
Query Selection Strategies:
-
This compares how the query items are selected.
-
Learnable 100(randomly initialized learnable tokens) performs the worst (AUC 0.84946), indicating that relying solely on learnable tokens without direct historical context is less effective. -
Recent 100(selecting the 100 most recent behaviors) achieves the best performance (AUC 0.85290, LogLoss 0.47103), demonstrating the critical importance of recency in user behavior modeling. -
Uniform 100(uniformly sampled tokens) is better than learnable but worse than recent, suggesting a balanced view of history is less impactful than recency. -
Recent50 + Rest Unif50(a mix) performs slightly worse thanRecent 100. -
These findings confirm that
informative behaviors, especially recent ones, are crucial for effective query construction in long-sequence modeling.Summary of Ablation: The ablation study confirms that
TokenMerge(especially when combined withInnerTrans) is highly effective at reducing computational cost while improving or maintaining accuracy. The choice ofquery numberandselection strategyis also critical, withsampling recent itemsoffering the best balance of efficiency and performance. This detailed analysis underscores the careful design choices inLONGERto achieve its goals.
-
6.2. Scaling Analysis
This section examines how LONGER's performance scales with key factors: sequence length, parameters, and FLOPs. The scaling behavior is generally described by a power-law trend: .
6.2.1 Sequence Length
The following is the image from Figure 4 of the original paper:
该图像是图表,展示了LONGER模型在不同序列长度(以Token为单位)下的AUC和LogLoss指标变化。左侧图表显示随着序列长度的增加,AUC值提升的趋势,并列出了五个层的拟合方程及其决定系数R²。右侧图表呈现了对应的LogLoss值随序列长度的变化,显示出LogLoss值随着序列长度增加而降低的趋势,同样包含了层的拟合方程及其R²值。这些结果表明LONGER在处理长序列时的有效性。
Figure 4: Scaling up sequence length in LONGER.
Analysis:
- Figure 4 shows the relationship between
sequence length(number of tokens) and model performance (AUCandLogLoss) across different model depths (number of layers). - AUC Trend: For all model depths, increasing the
sequence lengthconsistentlyimproves AUC. This confirms the hypothesis that modeling longer user histories provides more information and leads to better predictive accuracy. The improvement generally follows apower-law trend, indicating that early gains from increased length are more substantial than later ones. - LogLoss Trend: Correspondingly,
LogLoss decreasesas sequence length increases, signifying better calibrated probability predictions. - Impact of Model Depth:
Deeper models(e.g., 5-layer) benefit more from longer sequences, achieving higher AUC. However, theAUC improvement slows with depth, indicatingdiminishing returnsbeyond a certain point. This suggests an optimal depth that balances model capacity with computational constraints. - Conclusion: Longer sequences enhance performance, particularly when paired with an appropriately chosen model depth. Beyond a certain depth, further gains from sequence length become marginal.
6.2.2 Parameters and FLOPs
The following is the image from Figure 5 of the original paper:
该图像是一个图表,展示了参数数量与AUC值(图(a))以及FLOPs与AUC值(图(b))之间的关系。图中,横坐标分别表示参数数量和FLOPs,纵坐标表示AUC值,各自配有线性拟合曲线,公式分别为 和 ,表明了模型复杂度与性能之间的正相关关系。
Figure 5: Scaling performance with respect to FLOPs and model parameters.
Analysis:
- Figure 5(a) - Scaling with Parameters:
- This graph shows
AUCas a function of thenumber of parameters, achieved by scaling thehidden dimension sizewhile fixing the number of layers (2) and input sequence length (2000). AUC increases steadily with parameter count, exhibiting a strongpower-law trend().- Implication: Increasing the model's
width(hidden dimension) effectively enhances performance under a fixed architecture, withno sign of saturationwithin the tested parameter range. This suggests thatLONGERcan continue to benefit from larger model capacities.
- This graph shows
- Figure 5(b) - Scaling with FLOPs:
- This graph shows
AUCas a function ofFLOPs, achieved by varying thenumber of layersandsequence lengthwhile keeping the model dimensionality fixed at 32. AUC increases steadily with FLOPs, also following a strongpower-law trend().- Implication: Increasing computational resources (by having more layers or longer effective sequence lengths) allows the model to process more complex user behavior, capturing higher-order dependencies and improving accuracy, even with a fixed model
width.
- This graph shows
- Overall Conclusion: Both increasing model capacity (parameters) and computational resources (FLOPs) are effective ways to improve
LONGER's performance. However, these gains must be balanced against the practical computational and memory constraints of real-world systems.
6.3. Online A/B Tests
The online A/B tests were conducted on real-world scenarios within both Douyin Ads and Douyin E-Commerce Platforms, serving as critical validation for LONGER's practical impact. The improvements are particularly significant given that the baseline models are already strong.
6.3.1 Douyin Ads Platform
The following are the results from Table 3 of the original paper:
| Advertise Type | ADSS | ADVV |
| Live Streaming | +1.063% | +1.168% |
| Short Video | +2.097% | +2.151% |
| Mall | +1.816% | +1.407% |
Table 3: Douyin Ads A/B Test Results
Analysis:
LONGERconsistently improvesADSS(Advertiser Score) andADVV(Advertiser Value) across all three advertisement formats.- Short Video format shows the most substantial gains, with
+2.097% ADSSand+2.151% ADVV. This indicatesLONGERis particularly effective in improving advertiser outcomes for short video ads. Mallformat also sees strong improvements with+1.816% ADSSand+1.407% ADVV.Live Streamingshows positive, albeit slightly smaller, improvements of+1.063% ADSSand+1.168% ADVV.- Conclusion: The consistent positive results across diverse ad formats validate
LONGER's ability to enhance advertiser performance within a major advertising platform.
6.3.2 Douyin E-Commerce Service
The following are the results from Table 4 of the original paper:
| E-commerce Type | Order / U | GMV/U |
| Live Streaming | +7.9222% | +6.5404% |
| Short Video | +4.6125% | +5.2771% |
Table 4: Douyin E-commerce A/B Test Results
Analysis:
-
LONGERalso demonstrates significant improvements in key e-commerce metrics:Order/U(orders per user) andGMV/U(gross merchandise volume per user). -
Live Streaming content shows very strong positive impacts:
+7.9222% in Order/Uand . This suggests that for live streaming e-commerce,LONGERsignificantly boosts user purchasing frequency and the value of those purchases. -
Short Video content also yields considerable gains:
+4.6125% in Order/Uand . -
Conclusion: Both content formats benefit substantially, with
Live Streamingexhibiting particularly larger improvements. This highlightsLONGER's strong positive impact on user engagement and economic value generation in e-commerce settings.The combination of strong offline metrics, positive scaling analysis, and significant online A/B test gains across multiple influential scenarios firmly establishes
LONGER's effectiveness and robustness for industrial-scale deployment.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces LONGER, a novel Transformer-based framework specifically optimized for efficient and scalable modeling of ultra-long user behavior sequences in industrial recommender systems. It effectively addresses the challenges of quadratic complexity and information loss faced by previous methods by integrating several key innovations:
- Architectural Enhancements:
Global tokensfor attention stabilization and information anchoring; atoken merge modulewithlightweight InnerTransto reduce computational complexity while preserving local interactions; and ahybrid causal attentionmechanism. - System-Level Optimizations: A
fully synchronous GPU-based training and serving framework,mixed-precision trainingcoupled withactivation recomputationfor memory efficiency, and aKV cache serving strategyfor faster inference. Extensive experiments on abillion-scale industrial datasetand successful online A/B tests across both advertising (Douyin Ads) and e-commerce (Douyin E-Commerce) domains at ByteDance validateLONGER's superior performance, robustness, and generalizability. It achieves significant improvements in key offline metrics (AUC, LogLoss) and crucial online business indicators (ADSS, ADVV, Order/U, GMV/U), demonstrating its capability forend-to-end ultra-long sequence modelingunder real-world industrial constraints. Currently,LONGERhas been successfully deployed across numerous scenarios at ByteDance, serving billions of users.
7.2. Limitations & Future Work
The authors briefly mention future work directions:
- Investigating
more efficient sequence modeling techniques. This implies that whileLONGERmakes significant strides, there's always room for further optimization in handling extremely long sequences or higher computational demands. - Improving
cross-domain behavior modelingin the industry. This suggests extending the model's capabilities to integrate user behaviors across different platforms or service domains, which is a complex but valuable challenge.
7.3. Personal Insights & Critique
LONGER stands out as a highly practical and impactful contribution to the field of industrial recommender systems.
Strengths:
- Industry Relevance: The paper tackles a critical, real-world problem (ultra-long sequence modeling) that current industrial solutions often compromise on. The reported online A/B test results from ByteDance's platforms (Douyin Ads, Douyin E-Commerce) provide strong evidence of its practical value and significant business impact. The scale of deployment ("serving billions of users") is particularly impressive.
- Balanced Innovation:
LONGERdoesn't just propose a new architecture; it integrates architectural innovations (global tokens, token merge, hybrid attention) with crucial engineering optimizations (synchronous framework, mixed precision, recomputation, KV cache). This holistic approach is essential for deployment at such a massive scale, where efficiency and resource management are paramount. - Clear Problem Framing: The paper clearly articulates the limitations of existing solutions (two-stage retrieval, indirect embeddings) and positions
LONGERas a superiorend-to-endalternative, preserving more information from the full user history. - Comprehensive Evaluation: The combination of offline metrics, detailed ablation studies, scaling analysis, and online A/B tests provides a thorough validation of the proposed method's effectiveness and efficiency.
Potential Issues & Areas for Improvement:
- Dataset Dates: A notable point for critique is the specified dataset collection period: "October 16th, 2024 and February 23rd, 2025." Given the publication date of the paper (May 7, 2025), these dates are in the future. This is likely a typo, and the actual dataset was collected in 2023-2024 or earlier. While not affecting the technical merit, it's an inaccuracy that could confuse readers.
- Proprietary Metrics: While understandable for industrial settings, the lack of mathematical definitions for
ADSS,ADVV,Order/U, andGMV/Umakes it harder for external researchers to fully grasp or replicate the exact performance impact. More generalizable proxy metrics or a high-level conceptual explanation of how these are calculated would be beneficial. - InnerTrans Parameter Count: The ablation study shows
TokenMerge8(Concat, 250)having FLOPs, and then+ InnerTranson top of it has FLOPs. WhileInnerTransis described as "lightweight," it still adds a notable amount of FLOPs (). A more detailed breakdown of theFLOPsandparameter countcontribution ofInnerTransrelative to the overall model would further clarify its "lightweight" claim. - Generalizability of Hyperparameters: The optimal for the hybrid attention is validated in their specific industrial context. While providing valuable insight, generalizability to other domains or datasets might require re-tuning.
Transferability and Application:
The methods and conclusions of LONGER are highly transferable to other domains involving sequential data, especially where long sequences and real-time inference are critical. This includes:
-
News Recommendation: Modeling long reading histories to personalize news feeds.
-
Video/Content Platforms: Capturing extended viewing behavior for tailored content suggestions.
-
Search Engines: Understanding long-term search queries and click histories to refine search results.
-
Healthcare: Analyzing long patient health records for personalized treatment recommendations or disease prediction.
The
token mergestrategy,global tokens, andhybrid attentionare general architectural patterns that can be adapted to other Transformer-based models facing similar long-sequence challenges. Theengineering optimizations(mixed precision, recomputation, KV cache, synchronous training) are broadly applicable best practices for deploying large deep learning models in production environments across various industries.
In summary, LONGER is a testament to the power of combining cutting-edge deep learning architectures with practical engineering solutions to unlock the potential of rich, long-sequence data in highly demanding industrial applications.
Similar papers
Recommended via semantic vector search.