Paper status: completed

LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders

Published:05/07/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LONGER is a Transformer model designed for industrial recommender systems that captures ultra-long user behavior sequences. Key innovations include a global token mechanism, a token merge module to reduce complexity, and engineering optimizations, demonstrating strong performance

Abstract

Modeling ultra-long user behavior sequences is critical for capturing both long- and short-term preferences in industrial recommender systems. Existing solutions typically rely on two-stage retrieval or indirect modeling paradigms, incuring upstream-downstream inconsistency and computational inefficiency. In this paper, we present LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER incorporates (i) a global token mechanism for stabilizing attention over long contexts, (ii) a token merge module with lightweight InnerTransformers and hybrid attention strategy to reduce quadratic complexity, and (iii) a series of engineering optimizations, including training with mixed-precision and activation recomputation, KV cache serving, and the fully synchronous model training and serving framework for unified GPU-based dense and sparse parameter updates. LONGER consistently outperforms strong baselines in both offline metrics and online A/B testing in both advertising and e-commerce services at ByteDance, validating its consistent effectiveness and industrial-level scaling laws. Currently, LONGER has been fully deployed at more than 10 influential scenarios at ByteDance, serving billion users.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is scaling up long sequence modeling, specifically using a Transformer-based approach, for industrial recommender systems. The title LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders reflects this focus.

1.2. Authors

The paper lists numerous authors, all affiliated with ByteDance:

  • Zheng Chai, Qin Ren, Xijun Xiao, Huizhi Yang, Bo Han, Sijun Zhang, Di Chen, Hui Lu, Wenlin Zhao, Lele Yu, Xionghang Xie, Shiru Ren, Xiang Sun, Yaocheng Tan, Peng Xu, Yuchao Zheng, and Di Wu. The authors appear to be researchers and engineers from ByteDance, a major technology company, indicating a strong industry-focused research background. The presence of many authors suggests a large collaborative effort, typical for industrial research aiming for large-scale deployment.

1.3. Journal/Conference

The paper is listed as being published at The Nineteenth ACM Conference on Recommender Systems (RecSys '25), September 22-26, 2025, Prague, Czech Republic. RecSys is a top-tier conference in the field of recommender systems, known for publishing high-quality research, both academic and industrial, with significant impact.

1.4. Publication Year

The paper was published at 2025-05-07T13:54:26.000Z, corresponding to May 7, 2025.

1.5. Abstract

The abstract highlights that modeling ultra-long user behavior sequences is crucial for capturing both long-term and short-term preferences in industrial recommender systems. It points out that existing solutions often suffer from upstream-downstream inconsistency and computational inefficiency due to reliance on two-stage retrieval or indirect modeling.

The paper introduces LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER's core contributions include:

  1. A global token mechanism to stabilize attention over long contexts.

  2. A token merge module with lightweight InnerTransformers and a hybrid attention strategy to reduce quadratic complexity.

  3. A suite of engineering optimizations, such as mixed-precision training, activation recomputation, KV cache serving, and a fully synchronous GPU-based training/serving framework.

    LONGER demonstrates superior performance over strong baselines in both offline metrics and online A/B tests across advertising and e-commerce services at ByteDance. Its consistent effectiveness and industrial-level scaling laws are validated, leading to full deployment in over 10 influential scenarios at ByteDance, serving billions of users.

The official source link is: https://arxiv.org/abs/2505.04421 The PDF link is: https://arxiv.org/pdf/2505.04421v2.pdf The paper is currently available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the effective and efficient modeling of ultra-long user behavior sequences in industrial recommender systems. User behavior sequences, which can contain hundreds or thousands of interactions, are critical for understanding both a user's evolving short-term interests and their stable long-term preferences.

This problem is important because accurately capturing these preferences leads to better recommendation accuracy, diversity, and can help mitigate the information cocoon phenomenon (where users are only exposed to information that confirms their existing beliefs).

However, significant challenges exist:

  • Computational Constraints: Traditional Transformer models, which are powerful for sequence modeling, suffer from quadratic complexity (O(L2)O(L^2) where LL is sequence length) in their self-attention mechanism. This makes them computationally prohibitive for ultra-long sequences (lengths exceeding 10310^3).
  • Existing Solutions' Limitations: Current industrial practices often resort to compromises:
    • Two-stage retrieval: Selecting a subset of relevant items from the long sequence, which sacrifices full information.
    • Pre-trained User Embeddings: Condensing the long sequence into a single embedding, losing fine-grained details.
    • Memory-augmented Models: Relying on external memory, which can be complex and require extensive training. These approaches introduce upstream-downstream inconsistency (the model trained on a subset of data might not align with the full data distribution) and indirect perception of the original sequence, inherently limiting their ability to fully leverage all available user behavior.

The paper's entry point is inspired by the success of scaling laws in large language models (like GPT) and rapid advancements in computing infrastructure (e.g., GPU capabilities). This allows for pioneering an end-to-end ultra-long sequence modeling paradigm directly within industrial-scale recommendation systems, moving beyond the limitations of indirect methods.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. LONGER Framework: Proposing LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders, which is an industrial GPU-efficient Transformer structure specifically designed to scale user sequence modeling length to 10,000 in an end-to-end manner.
  2. Architectural Innovations for Efficiency:
    • Global Token Mechanism: Introducing global tokens to stabilize attention over long contexts and serve as centralized information anchors.
    • Token Merge Module with InnerTrans: Implementing a token merge strategy, which groups adjacent tokens to reduce sequence length and thus quadratic complexity, along with lightweight InnerTrans (InnerTransformers) to preserve intra-group interactions and fine-grained details. This module reduces FLOPs by approximately 50% with almost no performance loss.
    • Hybrid Attention Strategy: Employing a hybrid attention design combining cross-causal attention and self-causal attention to efficiently capture both global context and local dependencies.
  3. Industrial Engineering Optimizations: Devising a series of system-level optimizations for large-scale deployment:
    • Fully Synchronous Training and Serving Framework: A unified GPU-based framework for dense and sparse parameter updates, optimizing throughput and memory efficiency.

    • Mixed Precision Training and Activation Recomputation: Techniques to alleviate GPU memory pressure and improve computational speed by trading computation for memory savings and using lower precision (BF16/FP16).

    • KV Cache Serving: A mechanism to improve inference efficiency by precomputing and caching user sequence representations, avoiding redundant computations for multiple candidate items.

      The key findings demonstrate LONGER's effectiveness:

  • Superior Performance: LONGER consistently outperforms strong baselines in offline metrics (AUC, LogLoss) on billion-scale industrial datasets. For instance, it achieved a relative AUC improvement of 1.57% and LogLoss decrease of 3.39% compared to a base model, and 0.21% AUC improvement over the most competitive Transformer baseline.
  • Online Validation: Successful deployment and significant gains in online A/B tests across influential business scenarios at ByteDance, including Douyin Ads (e.g., +2.097% ADSS, +2.151% ADVV in Short Video format) and Douyin E-Commerce (+7.9222% Order/U, +6.5404% GMV/U in Live Streaming).
  • Scalability and Efficiency: The architectural and engineering optimizations enable LONGER to handle sequence lengths up to 10,000 tokens efficiently, making it suitable for real-world industrial deployment with billions of users.
  • Scaling Laws Validation: The paper validates scaling laws for performance concerning sequence length, parameters, and FLOPs, showing consistent improvements with increased scale.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand LONGER, a grasp of several fundamental concepts in machine learning, particularly in deep learning and recommender systems, is essential:

  • Recommender Systems: Systems that predict user preferences for items (e.g., products, movies, news) and suggest relevant ones. They are crucial for platforms like e-commerce, social media, and content streaming.
  • User Behavior Sequences: A chronological record of a user's interactions with items (e.g., clicks, views, purchases). These sequences are vital for understanding user preferences and predicting future actions. Ultra-long user behavior sequences refer to sequences that can contain thousands of interactions, spanning a long period.
  • Transformers: A neural network architecture introduced in 2017, initially for natural language processing (NLP), which has since become foundational in many sequence-to-sequence tasks. Unlike recurrent neural networks (RNNs), Transformers process sequences in parallel, making them highly efficient, especially on modern hardware like GPUs.
  • Attention Mechanism: The core component of a Transformer. It allows the model to weigh the importance of different parts of the input sequence when processing a specific element. Instead of processing a sequence step-by-step, attention can directly establish relationships between any two positions in the sequence.
    • Self-Attention: A mechanism within a Transformer layer where the model computes attention over the input sequence itself. This means each element in the sequence attends to all other elements (including itself) to compute its new representation. It helps capture long-range dependencies within a single sequence. The computational complexity of self-attention is O(L2d)O(L^2 \cdot d), where LL is the sequence length and dd is the embedding dimension. This quadratic dependency on LL is the main challenge for ultra-long sequences.
    • Cross-Attention: A mechanism where elements in one sequence (the query sequence) attend to elements in a different sequence (the key and value sequence). This is commonly used in encoder-decoder architectures where the decoder attends to the encoder's output. In LONGER, it's used to allow global tokens and sampled sequence tokens (queries) to attend to the full input sequence (keys and values).
  • Positional Encoding: Since Transformers process sequences in parallel, they lose information about the order of elements. Positional encoding is added to the input embeddings to inject information about the relative or absolute position of each element in the sequence.
  • Multi-Layer Perceptron (MLP): A type of feedforward artificial neural network consisting of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (except for the input nodes) is a neuron that uses a nonlinear activation function. MLPs are used for various tasks, including feature transformation and classification.
  • Computational Complexity (FLOPs): Floating Point Operations Per Second (FLOPs) is a common measure of computational cost. A higher FLOPs count generally means more computation and thus slower processing or higher resource consumption. Quadratic complexity (O(L2)O(L^2)) means that if the sequence length LL doubles, the computation increases by a factor of four.
  • Mixed Precision Training: A technique that combines different numerical precisions (e.g., FP32 for full precision, FP16 or BF16 for half precision) during model training. It can significantly reduce memory usage and speed up computations, especially on GPUs with specialized hardware for lower precision arithmetic, while maintaining model accuracy.
  • Activation Recomputation: A memory optimization technique used during neural network training, particularly for deep models. Instead of storing all intermediate activations from the forward pass (which are needed for the backward pass to compute gradients), some activations are discarded and recomputed during the backward pass. This trades computation for memory savings.
  • KV Cache (Key-Value Cache): A technique used in Transformer inference to speed up processing, especially when generating sequences or scoring multiple candidates against a fixed input. It involves caching the Key (K) and Value (V) representations of previously processed tokens (e.g., the user behavior sequence) so they don't need to be recomputed for subsequent steps or candidate items.
  • GPU Clusters: A group of Graphics Processing Units (GPUs) working together, often across multiple machines, to accelerate large-scale computations, such as training very large deep learning models.
  • Sparse Parameters / Embedding Tables: In recommendation systems, categorical features (like user IDs, item IDs) are often represented by high-dimensional, sparse embedding vectors. These embeddings are stored in large embedding tables. Sparse parameters refers to these embeddings, which are accessed and updated sparsely (only a few are active for any given input).
  • Dense Parameters: Refers to the weights and biases in the dense layers (e.g., MLPs, linear layers) of a neural network, which are typically updated more frequently and are smaller in number compared to sparse parameters.

3.2. Previous Works

The paper discusses previous approaches to sequential modeling and long-sequence modeling, categorizing them and highlighting their limitations.

Traditional Short-Sequence Modeling

Early and widely adopted models for user sequence modeling typically focus on short sequences (10210310^2 - 10^3 items).

  • DIN (Deep Interest Network) [30]: A pioneering model that introduced an attention mechanism to capture diverse user interests by adaptively calculating the relevance of historical behaviors to a candidate item.

  • DIEN (Deep Interest Evolution Network) [29]: Extends DIN by modeling the evolution of user interests over time, often using Gated Recurrent Units (GRUs) combined with an attention mechanism.

  • CAN (Co-Action Network) [28]: Focuses on modeling feature co-action, which refers to how different features interact with each other to influence user behavior.

  • Other approaches include multi-domain [2, 4], multi-interest [1, 11], and sequence denoising methods [5, 20].

    Limitation: These sophisticated architectures were primarily designed for and confined to short-sequence scenarios, making them unsuitable for ultra-long sequences.

Long-Sequence Modeling

Existing methods for handling longer sequences generally fall into these categories:

  • Two-stage retrieval:
    • Principle: Instead of processing the entire ultra-long sequence, a retrieval stage first selects a small subset (e.g., top-k, typically k102k \approx 10^2) of items from the original sequence that are most relevant to the current candidate item. This "shortened" sequence is then passed to a downstream model for end-to-end modeling.
    • Examples: SIM (Search-based user Interest Modeling) [18] and TWIN (TWo-stage Interest Network) [3, 21].
    • Limitation: This approach inevitably sacrifices raw full-sequence information and can lead to upstream-downstream inconsistency because the retrieval stage operates separately from the final ranking model.
  • Pre-trained User Embeddings:
    • Principle: The entire ultra-long user sequence is pre-trained in a source model to derive a condensed user embedding (UE). This single, fixed-size embedding then serves as input to downstream recommendation models.
    • Examples: Works like [9, 13, 31] explore this paradigm.
    • Limitation: While this leverages high-performance GPUs for pre-training, condensing a long sequence into a single embedding can lead to information loss and indirect perception of the original sequence, making it difficult to capture fine-grained, dynamic preferences.
  • Memory-augmented Models:
    • Principle: These models use external memory components to store and retrieve user interest representations or intermediate computation results for long sequences.
    • Examples: MIMN (Multi-channel user Interest Memory Network) [17] uses a neural Turing machine and memory induction unit. LMN (Large Memory Network) [14] proposes a lightweight structure with product quantization-based decomposition. MARM (Memory Augmented Recommendation Model) [15] caches intermediate results from computationally intensive modules.
    • Limitation: Memory-augmented models generally require long-term training periods to effectively accumulate hit rates within their memory slots, and their design can be complex.
  • Direct Long Sequence Modeling: More recent efforts aim to directly model long sequences using Transformer variants.
    • HSTU (Hierarchical Self-attention for Transformer Units) [25]: Uses a stack of identical self-attention layers with residual connections to model long sequences, showing better performance than vanilla Transformers.
    • Wukong [26]: Develops a stacked factorization machine and linear compression block-based architecture, validating scaling laws in recommendation.
    • Other works like [27, 32] also explore scaling Transformer models for sequential recommendation.
    • Limitation: The paper argues that a GPU-efficient long sequence modeling solution remains underexplored for large-scale industrial recommender systems.

Core Formula: The Attention Mechanism

Since LONGER is a Transformer-based model, understanding the Attention mechanism is crucial. The fundamental Scaled Dot-Product Attention is defined as:

$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $

Where:

  • QQ: Query matrix. This represents what you are looking for in the sequence. Each row corresponds to a query vector.

  • KK: Key matrix. This represents what is available to be matched against the query. Each row corresponds to a key vector.

  • VV: Value matrix. This contains the actual information associated with each key. Each row corresponds to a value vector.

  • KTK^T: Transpose of the key matrix.

  • dkd_k: Dimension of the key vectors. Dividing by dk\sqrt{d_k} is a scaling factor to prevent the dot products from becoming too large, which can push the softmax function into regions with very small gradients.

  • softmax()\mathrm{softmax}(\cdot): A function that converts a vector of arbitrary real values into a vector of probabilities, where the values sum to 1. It highlights the most relevant keys.

  • The product QKTQK^T computes attention scores indicating the similarity between each query and each key.

  • The softmax applies these scores to the values, effectively weighing how much each value contributes to the output for each query.

    This formula describes how self-attention (where Q, K, V are derived from the same input sequence) and cross-attention (where Q comes from one sequence, and K, V from another) fundamentally operate.

3.3. Technological Evolution

The evolution of sequence modeling in recommenders has progressed from simple statistical methods to sophisticated deep learning architectures.

  1. Early Methods (e.g., Matrix Factorization, Collaborative Filtering): Focused on static preferences and item similarity, largely ignoring sequential patterns.

  2. RNN/GRU-based Sequential Models: Introduced the ability to model sequences by capturing temporal dependencies, but struggled with very long sequences due to vanishing/exploding gradients and computational bottlenecks.

  3. Attention-based Models (e.g., DIN): Revolutionized sequential recommendation by allowing models to dynamically weigh the importance of past interactions, leading to better capture of diverse interests. However, these were often still applied to relatively short, active sequences.

  4. Transformer-based Models: Ushered in a new era with parallel processing and direct modeling of long-range dependencies, overcoming some limitations of RNNs. Initially applied to NLP, their success quickly spread to recommendation.

  5. Long-Sequence Transformers (Current Frontier): The challenge with Transformers in recommendation is the quadratic complexity for ultra-long sequences. Current research, including LONGER, focuses on optimizing these models for industrial scale, addressing efficiency, memory usage, and deployment challenges. This involves architectural innovations (like sparse attention, token merging) and system-level optimizations (like mixed precision, KV caching).

    LONGER fits into this timeline by pushing the frontier of Transformer-based models for ultra-long sequences by explicitly addressing the GPU-efficiency and industrial deployment aspects, moving towards a truly end-to-end modeling paradigm for sequences of unprecedented length in real-world systems.

3.4. Differentiation Analysis

Compared to the main methods in related work, LONGER differentiates itself primarily through its end-to-end, GPU-efficient, directly optimized Transformer architecture for ultra-long user behavior sequences, specifically designed for industrial scale.

  • Vs. Two-stage retrieval (e.g., SIM, TWIN): LONGER avoids the upstream-downstream inconsistency and information loss inherent in selecting a subset of interactions. It directly processes the full ultra-long sequence, preserving all available information.
  • Vs. Pre-trained User Embeddings: LONGER doesn't condense the entire sequence into a single, static embedding. Instead, it maintains a dynamic, detailed representation through its optimized Transformer layers, allowing for a richer and more context-aware understanding of user preferences. This avoids indirect perception of the original sequence.
  • Vs. Memory-augmented Models (e.g., MIMN, LMN, MARM): While memory networks address long-term dependencies, LONGER integrates the "memory" function directly into the Transformer's attention mechanism through global tokens and efficient processing of the full sequence, potentially reducing the training complexity and explicit memory management required by external memory modules.
  • Vs. Vanilla Transformers and initial direct long-sequence models (e.g., HSTU, Wukong):
    • LONGER specifically targets GPU-efficiency in an industrial context, which is crucial for deployment at ByteDance's scale. It tackles the quadratic complexity head-on with novel architectural designs like token merge and hybrid attention, explicitly reducing FLOPs by ~50% with minimal performance loss.

    • It also incorporates a comprehensive suite of engineering optimizations (synchronous training/serving, mixed precision, recomputation, KV cache) that are critical for billion-user scale deployment, aspects often less emphasized in purely academic proposals.

    • The global token mechanism is a specific innovation aimed at stabilizing attention over very long contexts, addressing a known issue (attention sink) in long Transformers.

      In essence, LONGER combines architectural ingenuity with robust system-level engineering to deliver a truly scalable and efficient end-to-end ultra-long sequence modeling solution, filling a critical gap in industrial recommender systems.

4. Methodology

4.1. Principles

The core principle behind LONGER is to enable end-to-end ultra-long sequence modeling in industrial recommender systems by strategically balancing computational efficiency and representational fidelity. It aims to overcome the quadratic complexity (O(L2)O(L^2)) of vanilla Transformers, which makes them prohibitive for sequence lengths greater than 10310^3, while avoiding the information loss and upstream-downstream inconsistency of existing two-stage or indirect modeling approaches. This is achieved through a combination of novel architectural designs and comprehensive engineering optimizations, all tailored for GPU-efficient operation at an industrial scale.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1 Problem Statement

The recommendation task is defined as predicting the probability of a user uu interacting (e.g., clicking or converting) with a target item vv, given the user's raw behavior sequence SuS_u, user basic features udu_d, and cross features.

Let U\boldsymbol{\mathcal{U}} be the set of users and I\boldsymbol{\mathcal{I}} be the set of items. For a user uUu \in \mathcal{U}, their raw behavior sequence is Su=[iˉ1(u),...,iL(u)]S_u = [\bar{i}_1^{(u)}, ..., i_L^{(u)}], where it(u)Ii_t^{(u)} \in \mathcal{I} is an item at time tt. The goal is to predict the probability:

$ P(y = 1 \mid S_u, u_d, v) \in [0, 1] $

Where:

  • y{0,1}y \in \{0, 1\} indicates whether user uu will interact with item vv.

  • SuS_u: The user's historical behavior sequence.

  • udu_d: User demographic and other basic features.

  • vv: The candidate target item.

    The model learns this mapping by optimizing the binary cross-entropy loss (L\mathcal{L}) over historical interaction data D={(Su,ud,v,y)}\mathcal{D} = \{ (S_u, u_d, v, y) \}:

$ \mathcal{L} = - \frac{1}{|\mathcal{D}|} \sum_{(S_u, u_d, v, y) \in \mathcal{D}} [y \log \hat{y} + (1 - y) \log (1 - \hat{y})] $

Where:

  • D|\mathcal{D}|: The total number of samples in the dataset.
  • yy: The true interaction label (0 or 1).
  • y^=fθ(Su,v)\hat{y} = f_\theta(S_u, v): The predicted probability by the recommendation model fθf_\theta, parameterized by θ\theta.

4.2.2 Overall Framework

The LONGER framework is designed to handle long and complex user behavior sequences efficiently. As illustrated in Figure 1, the model integrates several key components:

Figure 1: LONGER Model Architecture. 该图像是LONGER模型架构的示意图,展示了全局令牌机制、自注意力模块和跨注意力模块等关键组件,以及用于处理用户长序列的共享嵌入层和轻量级InnerTransformers。整体架构旨在优化多种特征的处理,以提高推荐系统的效率和准确性。

Figure 1: LONGER Model Architecture.

  1. Input Generation: The process starts by generating raw input tokens, which include global tokens and sequence tokens.
  2. Token Merge: The sequence tokens undergo a Token Merge operation, potentially incorporating InnerTrans for local interactions, to compress the sequence and reduce computational complexity.
  3. LONGER Model Structure: The core model architecture processes these tokens using a hybrid attention mechanism, combining cross-causal attention in the first layer and self-causal attention in subsequent layers.
  4. Prediction: The refined representations are then used for the downstream prediction task (e.g., CVR prediction).
  5. Training and Serving Optimizations: Underlying this architecture are several system-level optimizations, including a fully synchronous training and serving framework, mixed precision training, activation recomputation, and KV cache serving, all designed for GPU-efficient operation at scale.

4.2.3 Global Tokens

LONGER introduces Global Tokens as auxiliary representations appended to the input sequence. These tokens serve as aggregated anchor representations, meaning they are designed to consolidate and represent global contextual information. Examples of information that can be encoded in global tokens include:

  • Target item representation: The embedding of the item for which the recommendation probability is being predicted.
  • Learnable CLS tokens: Special tokens, similar to the [CLS] token in BERT, that can learn to encapsulate overall sequence information.
  • UID embeddings: The embedding representing the user's identity.
  • High-order compressed user-item interaction features: Aggregations of more complex interaction signals.

Purpose:

  1. Centralized Information Anchors: Global tokens facilitate enhanced feature interactions by providing a central point where user history, contextual attributes, and candidate item features can interact.
  2. Stabilize Attention Dynamics: They help stabilize attention in long sequences, especially with sparse attention configurations. Similar to StreamLLM [23], a small number of global tokens can alleviate the "attention sink" effect, where deeper attention layers disproportionately focus on early tokens. These tokens act as anchor points, maintaining attention diversity and preserving long-range dependency modeling. Global tokens are designed to have a full attention receptive field, meaning they can attend to every other token in the sequence and be attended to by every other token, enabling global information fusion.

4.2.4 Token Merge

The Token Merge strategy addresses the quadratic complexity of vanilla Transformers, which is a major bottleneck for processing long sequences (L2000L \ge 2000). For a vanilla Transformer encoder layer, the FLOPs and Parameters can be expressed as [16]:

$ \mathrm{FLOPs}{\mathrm{vanillatrans}} = 24Ld^2 + 4L^2d $ $ \mathrm{Params}{\mathrm{vanillatrans}} = 12d^2 + 13d $

Where:

  • LL: The sequence length.

  • dd: The embedding dimension.

    The Token Merge strategy groups adjacent tokens and compresses them into shorter sequences, effectively reducing the sequence length by a factor of KK. This performs spatial compression.

Computational Complexity: The ratio of attention complexity before and after token merge is:

$ \frac{\mathrm{FLOPs}{\mathrm{Merge~Token}}}{\mathrm{FLOPs}{\mathrm{vanilla}}} = \frac{24Ld^2K + \frac{4L^2d}{K}}{24Ld^2 + 4L^2d} = \frac{6dK + \frac{L}{K}}{6d + L} $

Where:

  • KK: The merge factor, indicating how many tokens are grouped into one. The effective sequence length becomes L/K. For typical values like L=2048L = 2048 and d=32d = 32, this strategy significantly reduces FLOPs.

Parameter Expansion: Token merging also impacts the number of parameters. While it reduces computational complexity, it can simultaneously increase the number of parameters Θmerge\Theta_{\mathrm{merge}} in certain parts (e.g., if InnerTrans blocks are used). The paper states:

$ \Theta_{\mathrm{merge}} = 12K^2d^2 + 13Kd $

This suggests a trade-off where efficiency gains from reduced sequence length are balanced against potential parameter expansion, ultimately aiming for improved model expressiveness and overall performance.

InnerTrans

To ensure that merging adjacent tokens doesn't lead to a loss of fine-grained details or insufficient interaction within a group, LONGER introduces InnerTrans.

  • Principle: InnerTrans applies a lightweight Transformer block within each token group before merging. This allows for local interactions and preserves semantics among the grouped tokens.

  • Mechanism: For the ii-th group of KK item embeddings [ei1,...,eiK][\mathbf{e}_i^1, ..., \mathbf{e}_i^K], an InnerTrans block processes them to produce a group representation Mi\mathbf{M}_i:

    $ \mathbf{M}_i = \mathrm{TransformerBlock}\left( [\mathbf{e}_i^1, ..., \mathbf{e}_i^K] \right) $

Where:

  • Mi\mathbf{M}_i: The merged representation of the ii-th group.
  • eik\mathbf{e}_i^k: The kk-th item embedding in the ii-th group.
  • TransformerBlock()\mathrm{TransformerBlock}(\cdot): A standard Transformer block (including self-attention and FFN). The computation budget of InnerTrans is kept small due to the very limited dimension and sequence length (KK is small) within each group.

4.2.5 LONGER Model Structure

The core model architecture employs a hybrid attention mechanism combining cross-attention and self-attention layers.

Input Generation

The input to the model consists of two main parts:

  1. Global Tokens: Contextual information (e.g., target item features, user identifiers) as discussed in Section 3.3.

  2. Sequence Tokens: The user's historical behavior sequence.

    To capture temporal dynamics, positional side information is added to the sequence tokens:

  • (1) Absolute Time-Difference Feature: This feature quantifies the temporal distance between each user interaction and the target item. It is concatenated to each item embedding.

  • (2) Learnable Absolute Positional Embedding: A standard learnable embedding that encodes the position of each token within the sequence, which is added to the item embedding.

    After positional encoding, the tokens are passed through a Multi-Layer Perceptron (MLP) to generate their input representations R\mathbf{R}. This representation is structured as:

$ \mathbf{R} \in \mathbb{R}^{(m+L) \times d} = [\mathbf{G} \in \mathbb{R}^{m \times \breve{d}} ; \mathbf{H} \in \mathbb{R}^{L \times d}] $

Where:

  • R\mathbf{R}: The combined input representation matrix.

  • mm: Number of global tokens.

  • LL: Length of the (potentially merged) sequence tokens.

  • dd: Embedding dimension.

  • G\mathbf{G}: Global token representations.

  • H\mathbf{H}: Sequence token representations.

    A query matrix O\mathbf{O} is then constructed by concatenating the mm global tokens G\mathbf{G} with kk sampled sequence tokens HˉS\bar{\mathbf{H}}_{\mathsf{S}} from the full sequence tokens H\mathbf{H}. The paper notes that recent k (most recent kk items) provides the best results among various sampling strategies (recent kk, uniform kk, learnable tokens). The composite query is:

$ \mathbf{O} = [\mathbf{G} ; \mathbf{H}_{\mathsf{S}}] $

This hybrid design selectively focuses attention on critical local behaviors (via sampled sequence tokens) and global contextual signals (via global tokens), enabling efficient capture of both specific sequence dependencies and broader context.

Cross-Causal Attention (First Layer)

In the first attention layer, cross-causal attention is applied. The query matrix is O\mathbf{O} (global tokens + sampled sequence tokens), while the key and value matrices are derived from the full input tokens R\mathbf{R} (global tokens + all sequence tokens).

The query, key, and value matrices are projected using learned weight matrices: $ \mathbf{Q} = \mathbf{O W}{\mathbf{Q}} $ $ \mathbf{K} = \mathbf{R W}{\mathbf{K}} $ $ \mathbf{V} = \mathbf{R W}_{\mathbf{V}} $

Where:

  • WQ,WK,WV\mathbf{W}_{\mathbf{Q}}, \mathbf{W}_{\mathbf{K}}, \mathbf{W}_{\mathbf{V}}: Learnable projection matrices, each with shape Rd×d\mathbb{R}^{d \times d}.

    The attention mechanism is then computed as:

$ \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{Softmax}\left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d}} + \mathbf{M} \right) \mathbf{V} $

Where:

  • Q,K,V\mathbf{Q}, \mathbf{K}, \mathbf{V}: The projected query, key, and value matrices.

  • dd: The embedding dimension.

  • M\mathbf{M}: A mask matrix applied to enforce causality. This mask ensures that attention is only computed over preceding tokens (or global tokens), preventing future information leakage. It is defined as:

    $ \mathbf{M}_{i,j} = \left{ \begin{array}{ll} 0, & \mathrm{if~} j \geq i, \mathrm{~where~} {i,j} \in [1, m+L] \ -\infty, & \mathrm{otherwise} \end{array} \right. $

The causal mask has two main benefits:

  1. Temporal Relevance: It maintains temporal relevance between sequence items, ensuring that predictions for a given time step only depend on past information.

  2. KV Cache Serving: It enables the KV Cache Serving mechanism (discussed in Section 3.6.3), as the attention from the sequence to the candidate item is invisible, allowing precomputation of user sequence representations.

    After the attention computation, the result is passed through a feed-forward network (FFN) for further processing.

Self-Causal Attention (Subsequent Layers)

Following the initial cross-causal attention layer, the model employs several self-causal attention blocks in subsequent layers.

  • Principle: These layers focus on learning internal relationships within the sampled tokens sequence (the output from the previous layer), allowing the model to capture deeper dependencies and patterns.

  • Mechanism: The self-causal attention mechanism is computed similarly:

    $ \mathrm{SelfAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d}} + \mathbf{M} \right) \mathbf{V} $

Here:

  • Q,K,V\mathbf{Q}, \mathbf{K}, \mathbf{V} are obtained by applying separate linear projections (WQ,WK,WV\mathbf{W}_{\mathbf{Q}}, \mathbf{W}_{\mathbf{K}}, \mathbf{W}_{\mathbf{V}}) to the output of the previous layer. This means the queries, keys, and values all originate from the same sequence (the output of the previous layer), hence "self-attention."
  • The causal mask M\mathbf{M} continues to be applied. Each self-causal attention layer is followed by an FFN.

Stacking and Compression

The self-causal attention layers are stacked NN times. This iterative refinement allows the model to capture increasingly complex and higher-order dependencies within the input sequence. The overall attention flow can be summarized as:

$ \mathrm{CrossAttn}( {\bf O}, {\bf R} ) \longrightarrow \mathrm{SelfAttn}( \cdot ) \times N $

The final output of this stacked attention mechanism is a compressed output representation, which is then used for the downstream prediction task (e.g., feeding into a final prediction head). This hybrid approach efficiently handles long sequences by leveraging both global context (via cross-attention to the full input) and internal dependencies (via self-attention within the sampled/global tokens).

4.2.6 Training and Deployment Optimization

Training Framework

LONGER utilizes a fully synchronous training system tailored for large-scale sparse models on GPU clusters.

  • Hardware-Software Co-design: The framework is built on a philosophy that optimizes both hardware and software aspects to maximize computational throughput and memory efficiency in distributed training.
  • Pipeline:
    1. Data Ingestion: Training data enters as batches or streams.

    2. Preprocessing: Data is processed by the Fountain module.

    3. Dispatch to GPUs: Processed data is sent to multiple GPU runners (e.g., GPU0, GPU1, GPU2 in Figure 2).

    4. Synchronous Updates: Both dense and sparse parameters are updated synchronously across all GPUs.

      Figure 2: Training Framework 该图像是一个示意图,展示了LONGER模型的训练框架。包括从训练数据(批处理或流处理)到数据处理模块(Fountain),再到多个GPU(GPU0, GPU1, GPU2等)进行全同步的参数更新,展示了高效的训练流程。

Figure 2: Training Framework

  • Unified Parameter Storage: A key innovation is that both dense and sparse parameters are stored and updated synchronously directly on GPU machines, eliminating the need for external Parameter Server components typically used for sparse parameters. This reduces communication overhead and memory transfer latency.
  • Hierarchical Memory System for Sparse Embeddings: To manage the vast size of embedding tables common in recommenders, a hierarchical memory system is adopted:
    • High-frequency features: Stored in high-bandwidth GPU memory (HBM).
    • Mid-frequency features: Reside in CPU main memory (MEM).
    • Low-frequency features: Offloaded to local solid-state drives (SSD). This stratified layout optimizes access characteristics, providing a practical balance of latency, throughput, and capacity.
  • Benefits: Improved training throughput, reduced staleness (outdated parameter copies), and enhanced convergence stability.

Mixed Precision Training and Recompute

To address GPU memory pressure during training, especially with large models and long sequences:

  • Activation Recomputation:
    • Problem: Reverse-mode automatic differentiation (used for gradient computation) requires storing all intermediate activations from the forward pass, which can be a major memory bottleneck.
    • Solution: LONGER supports recomputing declarations at the model definition level. Selected activations are discarded during the forward pass and recomputed during the backward pass. This is a computation-for-memory trade-off.
    • Implementation: Since native TensorFlow doesn't officially support recomputation, it's implemented using the custom_gradient mechanism, allowing fine-grained control via code-level annotations.
  • Mixed Precision Training:
    • Problem: Large dense models increase compute overhead.
    • Solution: Uses BF16/FP16-based mixed precision training. Users configure precision at the model level, applying higher precision (e.g., FP32) to critical components and lower precision (BF16/FP16) elsewhere.
    • Benefits: Demonstrated substantial gains in production: +18% throughput, -16% training time, and -18% memory usage on average (up to -28% in dense layers).

KV Cache Serving

To improve inference efficiency when scoring multiple candidate items against a single user's sequence, inspired by M-FALCON [25], LONGER uses a KV caching mechanism.

  • Principle: Decouples attention computation between user behavior tokens and candidate-specific global tokens. Since the user sequence remains constant for a given user across multiple candidate items, its key (K) and value (V) representations can be computed once and reused.

    Figure 3: KV Cache Serving 该图像是一个示意图,展示了标准 Transformer 的注意力计算与 KV 缓存服务的过程。左侧显示了标准 Transformer 针对每个候选项计算全部注意力,右侧展示了缓存用户序列的 KV 过程及逐候选项的 KV 查询。这种方法优化了计算效率,有助于大型推荐系统的实现。

Figure 3: KV Cache Serving

  • Two-stage inference process:

    1. Precompute and Cache: The key-value tensors of the user sequence are computed once and cached. This corresponds to the "Cached User Sequence KV" part in Figure 3.
    2. Per-Candidate Computation: For each candidate item, only the attention involving its global token (as query) and the cached user sequence (as key/value) is computed. This corresponds to the "Per-candidate KV Query" part in Figure 3.
  • Benefits: This optimization avoids redundant computation for each candidate, significantly reducing serving latency. In practice, it reduced throughput degradation from 40% to only 6.8%, making online serving much more efficient.

5. Experimental Setup

5.1. Datasets

The model was evaluated on the Conversion Rate (CVR) prediction task within the Douyin Ads system, which is described as a real-world, large-scale industrial advertising recommendation scenario.

  • Source: Subset of online user interaction logs collected from October 16th, 2024, to February 23rd, 2025.

  • Scale: 5.2 billion samples over 130 consecutive days.

  • Characteristics: Each sample includes:

    • User demographic features: e.g., user ID (UID), gender.
    • Ultra-long user behavior sequence: Contains various interaction types like page views, clicks, and conversions.
    • Candidate ad item: The item for which CVR is being predicted.
    • Item-side features: Ad content, display context, and associated metadata.
  • Temporal Split:

    • Training Data: The first 123 days of data.
    • Offline Evaluation Data: The remaining 7 days. This temporally consistent data split aligns with real-world deployment practices and prevents future data leakage, ensuring a realistic evaluation.
  • Domain: Advertising, specifically within ByteDance's Douyin platform.

    The choice of this dataset is effective for validating the method's performance because it is:

  • Real-world and Large-scale: Reflects the complexity and volume of industrial recommendation problems.

  • Ultra-long sequences: Provides the necessary data to test the model's ability to handle extensive user histories.

  • Temporally split: Ensures that the evaluation is robust and generalizable to future scenarios.

5.2. Evaluation Metrics

The paper uses several evaluation metrics, both offline and online, to assess the performance of LONGER.

AUC (Area Under the ROC Curve)

  • Conceptual Definition: AUC is a commonly used metric for binary classification problems. It quantifies the overall ability of a model to distinguish between positive and negative classes. An AUC value of 1.0 indicates a perfect classifier, while 0.5 indicates a classifier no better than random guessing. It is robust to imbalanced datasets.
  • Mathematical Formula: The AUC is the area under the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. $ \mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(\mathrm{FPR}^{-1}(x)) dx $ Where:
    • TPR=TPTP+FN\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} (True Positive Rate, also known as Recall or Sensitivity)
    • FPR=FPFP+TN\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}} (False Positive Rate)
    • TP\mathrm{TP}: Number of true positives (correctly predicted positive instances).
    • FN\mathrm{FN}: Number of false negatives (positive instances incorrectly predicted as negative).
    • FP\mathrm{FP}: Number of false positives (negative instances incorrectly predicted as positive).
    • TN\mathrm{TN}: Number of true negatives (correctly predicted negative instances).

LogLoss (Binary Cross-Entropy Loss)

  • Conceptual Definition: LogLoss is a measure of the accuracy of a probability prediction. It quantifies the penalty for incorrectly predicting probabilities. A lower LogLoss value indicates better prediction accuracy. It heavily penalizes confident wrong predictions. The formula is identical to the objective function used during training (binary cross-entropy).
  • Mathematical Formula: $ \mathcal{L} = - \frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)] $ Where:
    • NN: The total number of samples.
    • yiy_i: The true label for sample ii (0 or 1).
    • y^i\hat{y}_i: The predicted probability that yi=1y_i=1 for sample ii.

ADSS (Advertiser Score)

  • Conceptual Definition: ADSS (Advertiser Score) is an internal, proprietary metric used in advertising systems, specifically mentioned as one of the "most important indicators in industrial advertising systems" at ByteDance. It likely reflects the value or satisfaction of advertisers, potentially aggregating factors like return on investment, click-through rates, conversion rates, or other engagement metrics weighted by advertiser spend.
  • Mathematical Formula: Not provided in the paper, as it's an internal proprietary metric.
  • Symbol Explanation: Specific symbols are not defined without a formula.

ADVV (Advertiser Value)

  • Conceptual Definition: ADVV (Advertiser Value) is another internal, proprietary metric in advertising systems, also cited as a key indicator. It is likely related to the overall revenue generated for advertisers or the platform itself, reflecting the economic impact of the recommendations.
  • Mathematical Formula: Not provided in the paper, as it's an internal proprietary metric.
  • Symbol Explanation: Specific symbols are not defined without a formula.

Order/U (Orders per User)

  • Conceptual Definition: Orders per User (Order/U) is an e-commerce metric that measures the average number of orders placed by a single user. It indicates user engagement and purchasing frequency. A higher Order/U suggests more active and valuable users.
  • Mathematical Formula: Not provided in the paper. It is typically calculated as: $ \mathrm{Order/U} = \frac{\text{Total Number of Orders}}{\text{Total Number of Unique Users}} $ Where:
    • Total Number of Orders: Sum of all orders within a specific period.
    • Total Number of Unique Users: Count of distinct users who placed at least one order in the same period.

GMV/U (Gross Merchandise Volume per User)

  • Conceptual Definition: Gross Merchandise Volume per User (GMV/U) is an e-commerce metric that calculates the average total value of sales (before deducting returns or cancellations) generated by each unique user. It reflects the economic value generated per user. A higher GMV/U indicates that users are making more valuable purchases.
  • Mathematical Formula: Not provided in the paper. It is typically calculated as: $ \mathrm{GMV/U} = \frac{\text{Total Gross Merchandise Volume}}{\text{Total Number of Unique Users}} $ Where:
    • Total Gross Merchandise Volume: The sum of the value of all merchandise sold within a specific period.
    • Total Number of Unique Users: Count of distinct users who generated GMV in the same period.

5.3. Baselines

The paper compares LONGER against several strong baselines, categorized by their ability to model short- or long-range user behavior. All models were trained with the same preprocessing pipeline and hyperparameter tuning on a 48×A10048 \times \mathrm{A}100 GPU cluster.

Short-Sequence Methods:

  • DIN (Recent50) [30]: Deep Interest Network (DIN) applied to only the 50 most recent interactions. This represents a strong attention-based baseline for short sequences.
  • TWIN [3]: TWo-stage Interest Network, a two-stage retrieval approach that selects top-k relevant items (here, implicitly for short sequences) for modeling.

Long-Sequence Methods:

  • SumPooling: A simple baseline that aggregates (e.g., sums or averages) all item embeddings in the long sequence to form a single user representation. It captures long-term general preferences but loses sequential and fine-grained information.

  • DIN [30]: Deep Interest Network applied to an extended behavior history. While DIN is primarily for short sequences, applying it to longer sequences directly without specific optimizations for length would typically suffer from scalability issues.

  • HSTU [25]: Hierarchical Self-attention for Transformer Units, a Transformer-based model designed for modeling long sequences, consisting of a stack of self-attention layers.

  • Transformer [6]: A vanilla Transformer architecture applied to long sequences. This is a crucial baseline to highlight the computational challenges and the need for LONGER's optimizations. It's often used as the "base" model to show the relative improvements.

    These baselines are representative because they cover a spectrum of approaches: simple aggregation, established attention models (both short and extended), and recent Transformer-based solutions for long sequences. This allows LONGER to demonstrate its advantages in both accuracy and efficiency over different paradigms.

6. Results & Analysis

6.1. Core Results Analysis

The experiments validate LONGER's effectiveness in both offline metrics and online A/B tests.

6.1.1 Comparison of Existing Methods

The following are the results from Table 1 of the original paper:

Base SumPooling TWIN DIN (Recent50) DIN HSTU Transformer LONGER
AUC↑ 0.83968 0.84201 0.84472 0.84698 0.84982 0.84994 0.85111 0.85290
LogLoss↓ 0.48758 0.48538 0.48168 0.47830 0.47452 0.47490 0.47293 0.47103
ΔAUC(%) +0.28 +0.60 +0.87 +1.21 +1.22 +1.36 +1.57
ΔLogLoss(%) -0.45 -1.21 -1.90 -2.68 -2.60 -3.00 -3.39

Table 1: Evaluation of methods on industrial datasets

Analysis:

  • Overall Superiority: LONGER consistently outperforms all baselines in both AUC (higher is better) and LogLoss (lower is better). It achieves the highest AUC of 0.85290 and the lowest LogLoss of 0.47103.
  • Relative Improvements:
    • Compared to the Base model, LONGER shows a relative improvement of +1.57+1.57% in AUC and -3.39% in LogLoss.
    • Crucially, even against the Transformer baseline (which represents a vanilla Transformer on long sequences), LONGER achieves a +0.21% higher AUC (0.85290 vs. 0.85111). The paper emphasizes that a 0.1% AUC improvement is considered significant in industrial online A/B tests. This highlights LONGER's ability to extract more valuable information from long sequences while being more efficient.
  • Baseline Performance:
    • SumPooling performs the worst among non-base models, indicating that simple aggregation is insufficient for capturing complex user interests.
    • TWIN and DIN(Recent50)DIN (Recent50) show improvements but are limited by their short-sequence or two-stage nature.
    • DIN and HSTU (Transformer-based) show better performance, demonstrating the value of modeling longer sequences, but are still surpassed by LONGER and even the vanilla Transformer baseline. The HSTU's LogLoss is slightly higher than DIN's, which is interesting.
  • Efficiency (Implicit): While Table 1 focuses on performance, the abstract states that LONGER achieves this while being GPU-efficient, implying a better trade-off compared to a vanilla Transformer which would likely incur higher computational costs for comparable performance. The following ablation study will clarify this.

6.1.2 Ablation Study

The following are the results from Table 2 of the original paper:

Configuration FLOPs (×10) AUC↑ LogLoss↓ ΔAUC ΔLogLoss
LONGER (w/o Merge, 2000) 3.73 0.85111 0.47293 +1.36% -3.00%
+TokenMerge4(Concat, 500) 2.13 0.85232 0.47145 +1.51% -3.31%
+TokenMerge8(Concat, 250) 3.03 0.85291 0.47062 +1.58% -3.48%
Based on LONGER with TokenMerge8
+ InnerTrans 3.52 0.85332 0.47052 +1.63% -3.50%
Varying Query Number (Sampling Recent k items)
Query number = 50 1.27 0.85235 0.47162 +1.51% -3.27%
Query number = 80 1.59 0.85248 0.47157 +1.52% -3.28%
Query number = 100 1.91 0.85290 0.47103 +1.57% -3.39%
Query number = 150 2.36 0.85290 0.47101 +1.57% -3.40%
Query number = 200 2.93 0.85331 0.47077 +1.62% -3.45%
Query number = 250 3.52 0.85332 0.47052 +1.63% -3.50%
Query Selection Strategies
Learnable 100 1.91 0.84946 0.47523 +1.17% -2.53%
Recent 100 1.91 0.85290 0.47103 +1.57% -3.39%
Uniform 100 1.91 0.85183 0.47215 +1.45% -3.16%
Recent50 + Rest Unif50 1.91 0.85255 0.47129 +1.53% -3.34%

Table 2: Ablation Study on Query Quantity and Key Components of LONGER.

Analysis:

  1. Impact of TokenMerge and InnerTrans:

    • The baseline LONGER (w/o Merge, 2000) (equivalent to a vanilla Transformer with 2000 sequence length) has 3.73×1093.73 \times 10^9 FLOPs, AUC 0.85111, LogLoss 0.47293.
    • Applying +TokenMerge4 (Concat, 500) (merging 4 tokens into 1, reducing length to 500) significantly drops FLOPs to 2.13×1092.13 \times 10^9 (a ~43% reduction) while improving AUC to 0.85232 and LogLoss to 0.47145. This shows TokenMerge is highly effective in reducing computation while improving performance, not just maintaining it.
    • +TokenMerge8 (Concat, 250) further reduces the sequence length but increases FLOPs to 3.03×1093.03 \times 10^9. This counter-intuitive FLOPs increase after merging more (from 500 to 250) might be due to a larger parameter count as per Θmerge=12K2d2+13Kd\Theta_{\mathrm{merge}} = 12K^2d^2 + 13Kd or other architectural specifics not fully detailed in the table. Despite the FLOPs increase over TokenMerge4, it still significantly improves AUC (0.85291) and LogLoss (0.47062) over the w/o Merge baseline.
    • Adding + InnerTrans on top of TokenMerge8 (Base on LONGER with TokenMerge8) further improves performance to AUC 0.85332 and LogLoss 0.47052, with a FLOPs count of 3.52×1093.52 \times 10^9. This confirms that InnerTrans effectively captures intra-group interactions, leading to better representations without prohibitive computational cost. The best overall performance (AUC 0.85332, LogLoss 0.47052) is achieved with TokenMerge8 + InnerTrans.
  2. Varying Query Number (Sampling Recent k items):

    • This section explores the trade-off between the number of sampled query items (kk) and performance/FLOPs. The performance generally improves as kk increases, but with diminishing returns.
    • Using Querynumber=100Query number = 100 achieves an AUC of 0.85290 and LogLoss of 0.47103 with 1.91×1091.91 \times 10^9 FLOPs. This is the same performance as Querynumber=150Query number = 150 but with significantly fewer FLOPs.
    • The best performance for this part (AUC 0.85332, LogLoss 0.47052) is achieved at Querynumber=250Query number = 250 with 3.52×1093.52 \times 10^9 FLOPs. This matches the TokenMerge8 + InnerTrans configuration, suggesting this is the full LONGER setup.
    • The paper highlights that Querynumber=100Query number = 100 offers a strong trade-off: it achieves performance "very close" to using all 250 queries (which represents the full merged sequence length) but with only 54% of the FLOPs (1.91×1091.91 \times 10^9 vs. 3.52×1093.52 \times 10^9). This makes it highly practical for real-world deployment where computational budgets are tight.
  3. Query Selection Strategies:

    • This compares how the k=100k=100 query items are selected.

    • Learnable 100 (randomly initialized learnable tokens) performs the worst (AUC 0.84946), indicating that relying solely on learnable tokens without direct historical context is less effective.

    • Recent 100 (selecting the 100 most recent behaviors) achieves the best performance (AUC 0.85290, LogLoss 0.47103), demonstrating the critical importance of recency in user behavior modeling.

    • Uniform 100 (uniformly sampled tokens) is better than learnable but worse than recent, suggesting a balanced view of history is less impactful than recency.

    • Recent50 + Rest Unif50 (a mix) performs slightly worse than Recent 100.

    • These findings confirm that informative behaviors, especially recent ones, are crucial for effective query construction in long-sequence modeling.

      Summary of Ablation: The ablation study confirms that TokenMerge (especially when combined with InnerTrans) is highly effective at reducing computational cost while improving or maintaining accuracy. The choice of query number and selection strategy is also critical, with sampling recent items offering the best balance of efficiency and performance. This detailed analysis underscores the careful design choices in LONGER to achieve its goals.

6.2. Scaling Analysis

This section examines how LONGER's performance scales with key factors: sequence length, parameters, and FLOPs. The scaling behavior is generally described by a power-law trend: y=αxβ+γy = \alpha x^\beta + \gamma.

6.2.1 Sequence Length

The following is the image from Figure 4 of the original paper:

Figure 4: Scaling up sequence length in LONGER. 该图像是图表,展示了LONGER模型在不同序列长度(以Token为单位)下的AUC和LogLoss指标变化。左侧图表显示随着序列长度的增加,AUC值提升的趋势,并列出了五个层的拟合方程及其决定系数R²。右侧图表呈现了对应的LogLoss值随序列长度的变化,显示出LogLoss值随着序列长度增加而降低的趋势,同样包含了层的拟合方程及其R²值。这些结果表明LONGER在处理长序列时的有效性。

Figure 4: Scaling up sequence length in LONGER.

Analysis:

  • Figure 4 shows the relationship between sequence length (number of tokens) and model performance (AUC and LogLoss) across different model depths (number of layers).
  • AUC Trend: For all model depths, increasing the sequence length consistently improves AUC. This confirms the hypothesis that modeling longer user histories provides more information and leads to better predictive accuracy. The improvement generally follows a power-law trend, indicating that early gains from increased length are more substantial than later ones.
  • LogLoss Trend: Correspondingly, LogLoss decreases as sequence length increases, signifying better calibrated probability predictions.
  • Impact of Model Depth: Deeper models (e.g., 5-layer) benefit more from longer sequences, achieving higher AUC. However, the AUC improvement slows with depth, indicating diminishing returns beyond a certain point. This suggests an optimal depth that balances model capacity with computational constraints.
  • Conclusion: Longer sequences enhance performance, particularly when paired with an appropriately chosen model depth. Beyond a certain depth, further gains from sequence length become marginal.

6.2.2 Parameters and FLOPs

The following is the image from Figure 5 of the original paper:

Figure 5: Scaling performance with respect to FLOPs and model parameters. 该图像是一个图表,展示了参数数量与AUC值(图(a))以及FLOPs与AUC值(图(b))之间的关系。图中,横坐标分别表示参数数量和FLOPs,纵坐标表示AUC值,各自配有线性拟合曲线,公式分别为 y=0.83719×103x0.001(R2=0.987)y = 0.83719 \times 10^{-3} \cdot x^{0.001} \, (R^2 = 0.987)y=0.82983×103x0.001(R2=0.967)y = 0.82983 \times 10^{-3} \cdot x^{0.001} \, (R^2 = 0.967),表明了模型复杂度与性能之间的正相关关系。

Figure 5: Scaling performance with respect to FLOPs and model parameters.

Analysis:

  • Figure 5(a) - Scaling with Parameters:
    • This graph shows AUC as a function of the number of parameters, achieved by scaling the hidden dimension size while fixing the number of layers (2) and input sequence length (2000).
    • AUC increases steadily with parameter count, exhibiting a strong power-law trend (R2=0.987R^2 = 0.987).
    • Implication: Increasing the model's width (hidden dimension) effectively enhances performance under a fixed architecture, with no sign of saturation within the tested parameter range. This suggests that LONGER can continue to benefit from larger model capacities.
  • Figure 5(b) - Scaling with FLOPs:
    • This graph shows AUC as a function of FLOPs, achieved by varying the number of layers and sequence length while keeping the model dimensionality fixed at 32.
    • AUC increases steadily with FLOPs, also following a strong power-law trend (R2=0.967R^2 = 0.967).
    • Implication: Increasing computational resources (by having more layers or longer effective sequence lengths) allows the model to process more complex user behavior, capturing higher-order dependencies and improving accuracy, even with a fixed model width.
  • Overall Conclusion: Both increasing model capacity (parameters) and computational resources (FLOPs) are effective ways to improve LONGER's performance. However, these gains must be balanced against the practical computational and memory constraints of real-world systems.

6.3. Online A/B Tests

The online A/B tests were conducted on real-world scenarios within both Douyin Ads and Douyin E-Commerce Platforms, serving as critical validation for LONGER's practical impact. The improvements are particularly significant given that the baseline models are already strong.

6.3.1 Douyin Ads Platform

The following are the results from Table 3 of the original paper:

Advertise Type ADSS ADVV
Live Streaming +1.063% +1.168%
Short Video +2.097% +2.151%
Mall +1.816% +1.407%

Table 3: Douyin Ads A/B Test Results

Analysis:

  • LONGER consistently improves ADSS (Advertiser Score) and ADVV (Advertiser Value) across all three advertisement formats.
  • Short Video format shows the most substantial gains, with +2.097% ADSS and +2.151% ADVV. This indicates LONGER is particularly effective in improving advertiser outcomes for short video ads.
  • Mall format also sees strong improvements with +1.816% ADSS and +1.407% ADVV.
  • Live Streaming shows positive, albeit slightly smaller, improvements of +1.063% ADSS and +1.168% ADVV.
  • Conclusion: The consistent positive results across diverse ad formats validate LONGER's ability to enhance advertiser performance within a major advertising platform.

6.3.2 Douyin E-Commerce Service

The following are the results from Table 4 of the original paper:

E-commerce Type Order / U GMV/U
Live Streaming +7.9222% +6.5404%
Short Video +4.6125% +5.2771%

Table 4: Douyin E-commerce A/B Test Results

Analysis:

  • LONGER also demonstrates significant improvements in key e-commerce metrics: Order/U (orders per user) and GMV/U (gross merchandise volume per user).

  • Live Streaming content shows very strong positive impacts: +7.9222% in Order/U and +6.5404+6.5404% in GMV/U. This suggests that for live streaming e-commerce, LONGER significantly boosts user purchasing frequency and the value of those purchases.

  • Short Video content also yields considerable gains: +4.6125% in Order/U and +5.2771+5.2771% in GMV/U.

  • Conclusion: Both content formats benefit substantially, with Live Streaming exhibiting particularly larger improvements. This highlights LONGER's strong positive impact on user engagement and economic value generation in e-commerce settings.

    The combination of strong offline metrics, positive scaling analysis, and significant online A/B test gains across multiple influential scenarios firmly establishes LONGER's effectiveness and robustness for industrial-scale deployment.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces LONGER, a novel Transformer-based framework specifically optimized for efficient and scalable modeling of ultra-long user behavior sequences in industrial recommender systems. It effectively addresses the challenges of quadratic complexity and information loss faced by previous methods by integrating several key innovations:

  • Architectural Enhancements: Global tokens for attention stabilization and information anchoring; a token merge module with lightweight InnerTrans to reduce computational complexity while preserving local interactions; and a hybrid causal attention mechanism.
  • System-Level Optimizations: A fully synchronous GPU-based training and serving framework, mixed-precision training coupled with activation recomputation for memory efficiency, and a KV cache serving strategy for faster inference. Extensive experiments on a billion-scale industrial dataset and successful online A/B tests across both advertising (Douyin Ads) and e-commerce (Douyin E-Commerce) domains at ByteDance validate LONGER's superior performance, robustness, and generalizability. It achieves significant improvements in key offline metrics (AUC, LogLoss) and crucial online business indicators (ADSS, ADVV, Order/U, GMV/U), demonstrating its capability for end-to-end ultra-long sequence modeling under real-world industrial constraints. Currently, LONGER has been successfully deployed across numerous scenarios at ByteDance, serving billions of users.

7.2. Limitations & Future Work

The authors briefly mention future work directions:

  • Investigating more efficient sequence modeling techniques. This implies that while LONGER makes significant strides, there's always room for further optimization in handling extremely long sequences or higher computational demands.
  • Improving cross-domain behavior modeling in the industry. This suggests extending the model's capabilities to integrate user behaviors across different platforms or service domains, which is a complex but valuable challenge.

7.3. Personal Insights & Critique

LONGER stands out as a highly practical and impactful contribution to the field of industrial recommender systems.

Strengths:

  • Industry Relevance: The paper tackles a critical, real-world problem (ultra-long sequence modeling) that current industrial solutions often compromise on. The reported online A/B test results from ByteDance's platforms (Douyin Ads, Douyin E-Commerce) provide strong evidence of its practical value and significant business impact. The scale of deployment ("serving billions of users") is particularly impressive.
  • Balanced Innovation: LONGER doesn't just propose a new architecture; it integrates architectural innovations (global tokens, token merge, hybrid attention) with crucial engineering optimizations (synchronous framework, mixed precision, recomputation, KV cache). This holistic approach is essential for deployment at such a massive scale, where efficiency and resource management are paramount.
  • Clear Problem Framing: The paper clearly articulates the limitations of existing solutions (two-stage retrieval, indirect embeddings) and positions LONGER as a superior end-to-end alternative, preserving more information from the full user history.
  • Comprehensive Evaluation: The combination of offline metrics, detailed ablation studies, scaling analysis, and online A/B tests provides a thorough validation of the proposed method's effectiveness and efficiency.

Potential Issues & Areas for Improvement:

  • Dataset Dates: A notable point for critique is the specified dataset collection period: "October 16th, 2024 and February 23rd, 2025." Given the publication date of the paper (May 7, 2025), these dates are in the future. This is likely a typo, and the actual dataset was collected in 2023-2024 or earlier. While not affecting the technical merit, it's an inaccuracy that could confuse readers.
  • Proprietary Metrics: While understandable for industrial settings, the lack of mathematical definitions for ADSS, ADVV, Order/U, and GMV/U makes it harder for external researchers to fully grasp or replicate the exact performance impact. More generalizable proxy metrics or a high-level conceptual explanation of how these are calculated would be beneficial.
  • InnerTrans Parameter Count: The ablation study shows TokenMerge8(Concat, 250) having 3.03×1093.03 \times 10^9 FLOPs, and then + InnerTrans on top of it has 3.52×1093.52 \times 10^9 FLOPs. While InnerTrans is described as "lightweight," it still adds a notable amount of FLOPs ( 0.5×109~0.5 \times 10^9). A more detailed breakdown of the FLOPs and parameter count contribution of InnerTrans relative to the overall model would further clarify its "lightweight" claim.
  • Generalizability of Hyperparameters: The optimal Querynumber=100Query number = 100 for the hybrid attention is validated in their specific industrial context. While providing valuable insight, generalizability to other domains or datasets might require re-tuning.

Transferability and Application: The methods and conclusions of LONGER are highly transferable to other domains involving sequential data, especially where long sequences and real-time inference are critical. This includes:

  • News Recommendation: Modeling long reading histories to personalize news feeds.

  • Video/Content Platforms: Capturing extended viewing behavior for tailored content suggestions.

  • Search Engines: Understanding long-term search queries and click histories to refine search results.

  • Healthcare: Analyzing long patient health records for personalized treatment recommendations or disease prediction.

    The token merge strategy, global tokens, and hybrid attention are general architectural patterns that can be adapted to other Transformer-based models facing similar long-sequence challenges. The engineering optimizations (mixed precision, recomputation, KV cache, synchronous training) are broadly applicable best practices for deploying large deep learning models in production environments across various industries.

In summary, LONGER is a testament to the power of combining cutting-edge deep learning architectures with practical engineering solutions to unlock the potential of rich, long-sequence data in highly demanding industrial applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.