AiPaper
Paper status: completed

OneRec-V2 Technical Report

Published:08/28/2025
Original LinkPDF
Price: 0.10
Price: 0.10
19 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

OneRec-V2 removes the encoder with a lazy decoder-only design, cutting computation by 94% and scaling to 8B parameters; it aligns recommendations with real user feedback via duration-aware reward shaping and adaptive ratio clipping, improving app stay time in large-scale tests.

Abstract

Recent breakthroughs in generative AI have transformed recommender systems through end-to-end generation. OneRec reformulates recommendation as an autoregressive generation task, achieving high Model FLOPs Utilization. While OneRec-V1 has shown significant empirical success in real-world deployment, two critical challenges hinder its scalability and performance: (1) inefficient computational allocation where 97.66% of resources are consumed by sequence encoding rather than generation, and (2) limitations in reinforcement learning relying solely on reward models. To address these challenges, we propose OneRec-V2, featuring: (1) Lazy Decoder-Only Architecture: Eliminates encoder bottlenecks, reducing total computation by 94% and training resources by 90%, enabling successful scaling to 8B parameters. (2) Preference Alignment with Real-World User Interactions: Incorporates Duration-Aware Reward Shaping and Adaptive Ratio Clipping to better align with user preferences using real-world feedback. Extensive A/B tests on Kuaishou demonstrate OneRec-V2's effectiveness, improving App Stay Time by 0.467%/0.741% while balancing multi-objective recommendations. This work advances generative recommendation scalability and alignment with real-world feedback, representing a step forward in the development of end-to-end recommender systems.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: OneRec-V2 Technical Report
  • Authors: OneRec Team. The paper is authored by an industrial team, suggesting a focus on practical, large-scale applications.
  • Journal/Conference: The paper is presented as a technical report on arXiv, a common platform for rapidly disseminating research, especially from industrial labs.
  • Publication Year: The paper uses future dates (e.g., experiments from August 2025, citations to 2025), and the arXiv ID 2508.20900 is syntactically invalid for the current date. This suggests the paper is either a hypothetical example or a forward-looking internal report.
  • Abstract: The paper introduces OneRec-V2, an advancement over the OneRec-V1 generative recommender system. It addresses two key issues in V1: (1) computational inefficiency in the encoder-decoder architecture, where 97.66% of compute is spent on encoding user history rather than generating recommendations, and (2) limitations of reinforcement learning (RL) that depends solely on proxy reward models. OneRec-V2 proposes two solutions: (1) a Lazy Decoder-Only Architecture that eliminates the encoder, reducing computation by 94% and enabling scaling to 8 billion parameters, and (2) a Preference Alignment framework using real user feedback, featuring Duration-Aware Reward Shaping and Adaptive Ratio Clipping (a new RL algorithm). A/B tests on the Kuaishou app show significant improvements in user App Stay Time.
  • Original Source Link: https://arxiv.org/abs/2508.20900 (Note: This is a placeholder link).

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Traditional recommender systems use a multi-stage "cascaded" architecture, which is inefficient and involves optimizing different objectives at each stage. Generative recommender systems like OneRec-V1 unify this into a single end-to-end sequence generation task. However, OneRec-V1 itself suffers from major bottlenecks.
    • Key Gaps in Prior Work (OneRec-V1):
      1. Computational Inefficiency: The standard encoder-decoder architecture used in OneRec-V1 is highly inefficient for recommendation. The encoder processes long user interaction histories, consuming the vast majority of computational resources (97.66%), while the decoder, which actually generates the recommendation, uses very little. This imbalance severely limits model scalability.
      2. Suboptimal Reinforcement Learning: OneRec-V1's RL phase relies on a separate "reward model" to provide feedback. This approach is inefficient (requires extra computation for scoring) and prone to "reward hacking," where the model learns to exploit flaws in the proxy reward model instead of genuinely improving user experience.
    • Innovation: OneRec-V2 introduces a novel, highly efficient architecture and a more direct RL framework. It shifts the paradigm from encoding-and-generating to a "lazy" generation process that only computes what is necessary for the next recommendation, and it aligns the model directly with real user satisfaction signals instead of a proxy model.
  • Main Contributions / Findings (What):

    1. Lazy Decoder-Only Architecture: A new transformer design that removes the encoder entirely. User history (context) is pre-processed and fed directly into the decoder's cross-attention layers as static key-value pairs. This design dramatically reduces computational load by 94% and training resources by 90%, enabling the model to scale to 8 billion parameters under the same budget.
    2. Preference Alignment with Real-World User Interactions: A post-training framework that uses direct user feedback for RL. It includes:
      • Duration-Aware Reward Shaping: A method to create a more accurate user satisfaction signal by normalizing video watch time against the video's total duration, mitigating inherent duration bias.
      • Gradient-Bounded Policy Optimization (GBPO): A novel RL optimization algorithm that improves upon PPO/ECPO by using all training samples (no clipping) and bounding gradients to prevent instability, especially from negative samples.
    3. Significant Real-World Impact: Extensive A/B tests on the Kuaishou app (400 million daily active users) demonstrated that OneRec-V2 improved App Stay Time by 0.467% to 0.741% over OneRec-V1, without the common "seesaw effect" (improving one metric at the expense of another).

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Recommender Systems: Systems that predict user preferences and suggest relevant items (e.g., videos, products). Cascaded architectures are multi-stage pipelines (e.g., retrieval -> ranking -> re-ranking), where each stage filters or scores items.
    • Generative Recommendation: A new paradigm that frames recommendation as a sequence generation task, similar to how a language model generates text. The system "generates" a sequence of recommended item IDs directly.
    • Transformer Architecture: A neural network architecture based on the attention mechanism. It consists of:
      • Encoder: Reads and processes an input sequence to create a rich contextual representation.
      • Decoder: Generates an output sequence one token at a time, using its own previously generated tokens (self-attention) and the encoder's output (cross-attention).
    • Autoregressive Generation: A process where each element in a sequence is generated based on the elements that came before it. This is the core mechanism of models like GPT.
    • Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In this context, the recommender system is the policy, a recommendation is an action, and user satisfaction (e.g., watch time) is the reward.
    • Model FLOPs Utilization (MFU): A measure of how effectively a model's theoretical peak floating-point operations per second (FLOPs) are utilized during training. High MFU means efficient computation.
    • Mixture-of-Experts (MoE): A technique to increase model size without a proportional increase in computation. An MoE layer consists of many "expert" sub-networks, and a routing mechanism selects only a few experts to process each input token.
    • Grouped Query Attention (GQA): An optimization of the attention mechanism where multiple query heads share a single key-value head, reducing memory usage and access costs.
  • Previous Works:

    • OneRec-V1 (Zhou et al., 2025): The direct predecessor. It established the viability of generative recommendation at an industrial scale using an encoder-decoder architecture and RL with a reward model. This paper explicitly positions OneRec-V2 as the solution to V1's identified scalability and alignment problems.
    • Other Generative Models (GPT, LLaMA): The paper draws heavy inspiration from the success and scalability of large language models, adopting their autoregressive, decoder-centric design philosophy.
  • Technological Evolution: The paper charts an evolution from inefficient, multi-stage recommenders to a more unified generative approach (OneRec-V1) and finally to a hyper-efficient, scalable, and better-aligned generative system (OneRec-V2).

  • Differentiation:

    • vs. Encoder-Decoder (OneRec-V1): The Lazy Decoder-Only architecture eliminates the encoder, which is the primary computational bottleneck in V1.
    • vs. Naive Decoder-Only: A naive decoder-only model would process the entire user history and target item together autoregressively, leading to massive, redundant computations on the static user history. The Lazy Decoder-Only design avoids this by treating the user history as a fixed context, computing its representation only once.
    • vs. Reward Model RL (OneRec-V1): OneRec-V2 uses direct user feedback (watch time) for its reward signal, avoiding the need for a separate, potentially flawed reward model. Its GBPO algorithm is also custom-designed for better stability.

4. Methodology (Core Technology & Implementation)

The technical core of OneRec-V2 is divided into its architecture and its post-training alignment framework.

1. Lazy Decoder-Only Architecture

The central idea is to concentrate all computational effort on generating the target item's representation, not on re-encoding the user's history.

  • Design Principles: The paper first analyzes the computational cost of existing architectures.

    • In a standard Encoder-Decoder model with a context length of 512, 97.66% of FLOPs are spent on Context Encoding (processing user history in the encoder and cross-attention). Only 2.34% is spent on Target Decoding (processing the item to be recommended), which is the only part that directly contributes to the training loss.

    • The paper also introduces different data organization methods to motivate its architectural choice, as shown in Image 8. It settles on New Impression Only Organization, where the loss is only computed for the newest item, which naturally fits an architecture that separates context and target.

      Figure 3 | Naive Impression Organization: The pattern \(\\mathrm { A } { } \\mathrm { B }\) is redundantly trained across multiple impressions. User-Centric Organization: When training on User- \${ \\bolds… 该图像是示意图,展示了三种印象组织方式:(a) 朴素印象组织中模式ext{A B}在多条印象中重复训练。(b) 用户中心组织中,当训练用户2在时间t3t_3的数据时,模型已从用户1在t4t_4的未来交互学习到模式ext{B} o ext{C}。(c) 仅新印象组织只训练最新的印象。

  • Overall Architecture: The proposed architecture is illustrated in Image 9.

    Figure 4 | Architecture of the proposed lazy decoder-only generative recommender. The Context Processor transforms heterogeneous user feature pathways into unified context representations, which are… 该图像是OneRec-V2中Lazy Decoder-Only生成推荐系统的架构示意图,展示了上下文处理器将多通路用户特征融合生成上下文,再通过堆叠的Lazy Decoder模块进行自注意力和跨注意力计算,最终输出下一项预测。

    It consists of two main parts:

    1. Context Processor:

      • This module takes all user context (profile, short-term and long-term behavior sequences) and processes it into a single, static representation.
      • This representation is then split into key-value (K-V) pairs for the decoder's cross-attention layers. Crucially, these K-V pairs are computed only once and shared across decoder layers, which is the "lazy" part of the design. This avoids the massive cost of an encoder.
      • The paper introduces parameters for controlling this sharing: LkvL_{\text{kv}} (number of distinct K-V sets) and SkvS_{\text{kv}} (whether K and V share the same representation).
    2. Lazy Decoder Block:

      • Input: The decoder takes as input the tokenized semantic IDs of the target item (e.g., [BOS, s¹, s²]).
      • Structure: Each block contains three components in sequence:
        1. Lazy Cross-Attention: The decoder's query (from the target item representation) attends to the static K-V pairs provided by the Context Processor. This is how the model incorporates user context. To be even more efficient, this layer has no K-V projection weights.
        2. Causal Self-Attention: The target item's tokens attend to each other to build a coherent representation of the item itself.
        3. Feed-Forward Network (FFN): A standard FFN for further processing. For larger models, this is replaced with a more efficient Mixture-of-Experts (MoE) layer.
      • KV-Sharing: Multiple decoder layers can share the same K-V pair from the context processor, further reducing memory. The index of the K-V pair for layer ll is given by: lkv=lLkvNlayer l _ { \mathrm { kv } } = \left\lfloor \frac { l \cdot L _ { \mathrm { kv } } } { N _ { \mathrm { l a y e r } } } \right\rfloor where NlayerN_{\text{layer}} is the total number of layers and LkvL_{\text{kv}} is the number of distinct K-V sets.
      • Grouped Query Attention (GQA): This is used in the cross-attention module to reduce the memory footprint of the context K-V cache.

2. Preference Alignment with Real-World User Interactions

This is the post-training phase to fine-tune the model using RL on real user feedback.

  • Duration-Aware Reward Shaping:

    • Problem: Using raw video watch time as a reward is biased. A user watching 30 seconds of a 60-second video is more satisfied than a user watching 30 seconds of a 10-minute video.
    • Solution: As shown in Image 12, the method normalizes watch time based on user-specific history for videos of similar duration.
      1. Bucketing: Videos in a user's history are grouped into "buckets" based on their duration using a logarithmic scale: F(d)=logβ(d+ϵ) \mathcal{F}(d) = \lfloor \log_{\beta}(d + \epsilon) \rfloor where dd is the video duration and β\beta controls the bucket size.

      2. Percentile Score: For a new recommended video, its watch time (pip_i) is compared to the historical watch times of other videos in the same duration bucket (Pu,bP_{u,b}). The reward score qiq_i is its percentile rank: qi={pjPu,bpjpi}Pu,b q_i = \frac{|\{p_j \in P_{u,b} | p_j \le p_i\}|}{|P_{u,b}|}

      3. Advantage Definition: This score is used to define a sparse but high-quality advantage signal AiA_i for RL. Ai={+1,qi>τB and negi=0,1,negi=1,0,otherwise. A_i = \begin{cases} +1, & q_i > \tau_B \text{ and } neg_i = 0, \\ -1, & neg_i = 1, \\ 0, & \text{otherwise}. \end{cases} where τB\tau_B is the 75th percentile (top 25%) of scores in a batch, and negi=1neg_i=1 indicates an explicit "dislike" action.

        该图像是一个示意图,展示了基于历史视频和播放时长的时长分桶与奖励计算方法,包含计算分桶索引的公式 \(\\lfloor\\log_\\beta (d_i + \\epsilon)\\rfloor = b\) 及目标视频奖励的计算公式 \(q_i=\\frac{|\\{p_j \\in P_{u,b}|p_j \\le p_i\\}|}{|P_{u,b}|}\)。 该图像是一个示意图,展示了基于历史视频和播放时长的时长分桶与奖励计算方法,包含计算分桶索引的公式 logβ(di+ϵ)=b\lfloor\log_\beta (d_i + \epsilon)\rfloor = b 及目标视频奖励的计算公式 qi={pjPu,bpjpi}Pu,bq_i=\frac{|\{p_j \in P_{u,b}|p_j \le p_i\}|}{|P_{u,b}|}

  • Gradient-Bounded Policy Optimization (GBPO):

    • Problem: Standard RL algorithms like PPO use clipping to prevent destructively large policy updates, but this discards useful training signals. For negative samples, the gradient can explode when the model's prediction probability is already low, leading to instability.
    • Solution: GBPO is a new RL objective that aims for both stability and full sample utilization.
      • Objective Function: \mathcal{T}_{GBPO}(\theta) = -\mathbb{E}_{u \sim P(U), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{\pi_{\theta}(o_i|u)}{\pi'_{\theta_{old}}(o_i|u)} \cdot A_i \right]

      • The key innovation is the dynamic denominator πθold\pi'_{\theta_{old}}, which is defined differently for positive and negative rewards: πθold(oiu)={max(πθold,sg(πθ)),Ai0,max(πθold,1sg(πθ)),Ai<0. \pi'_{\theta_{old}}(o_i|u) = \begin{cases} \max(\pi_{\theta_{old}}, sg(\pi_{\theta})), & A_i \geq 0, \\ \max(\pi_{\theta_{old}}, 1 - sg(\pi_{\theta})), & A_i < 0. \end{cases} where sg()sg(\cdot) is the stop-gradient operator.

      • Intuition: For negative samples (Ai<0A_i < 0), this formulation ensures that the gradient magnitude is bounded, behaving similarly to the stable Binary Cross-Entropy (BCE) loss. This prevents gradient explosion when πθ\pi_{\theta} is small. For all samples, it avoids the hard clipping of PPO, allowing the model to learn from all experiences. The difference is visualized in Image 2.

        该图像是模型训练收敛损失与训练FLOPs关系的图表,展示了不同模型类型在FLOPs变化下的收敛性能。横轴为训练每样本FLOPs,纵轴为收敛损失,标注了模型规模及性能差异Δ。 该图像是模型训练收敛损失与训练FLOPs关系的图表,展示了不同模型类型在FLOPs变化下的收敛性能。横轴为训练每样本FLOPs,纵轴为收敛损失,标注了模型规模及性能差异Δ。

5. Experimental Setup

  • Datasets: Streaming impression data from the Kuaishou app, collected between August 10-14, 2025.
  • Evaluation Metrics:
    • Pre-training Loss (Convergence Loss): The average cross-entropy loss over the three semantic tokens predicted for the target item. A lower loss indicates better next-item prediction. LGen=13i=13logp(siBOS,s<i,Context) \mathcal{L}_{\mathrm{Gen}} = - \frac{1}{3} \sum_{i=1}^{3} \log p(s^i | \mathrm{BOS}, s^{<i}, \mathrm{Context})
    • Online Business Metrics:
      • App Stay Time: Total time a user spends in the app per day. This is a primary metric for user engagement and satisfaction.
      • Video View: The number of videos a user watches.
      • LT7 (Lifetime over 7 days): A measure of long-term user retention and value.
    • Efficiency Metrics:
      • GFLOPs: Giga Floating Point Operations per second. A measure of computational cost.
      • Activations: The size of intermediate outputs stored in memory during the forward/backward pass. A measure of memory consumption.
  • Baselines:
    • Encoder-Decoder (Enc:Dec): The architecture of OneRec-V1, tested with 1:1 and 1:2 encoder-to-decoder parameter ratios.
    • Naive Decoder-Only: A standard decoder-only model that processes the full context and target sequence autoregressively.

6. Results & Analysis

  • Core Results: Architecture Comparison

    • The results are shown in Table 2 (transcribed below) and Figure 5 (Image 10).

    • Finding: The Lazy Decoder-Only architecture achieves a Convergence Loss that is nearly identical to the much more computationally expensive Encoder-Decoder and Naive Decoder-Only models. For example, at the 1B parameter scale, the Lazy Decoder-Only model uses only 18.89 GFLOPs to achieve a loss of 3.27, while the Encoder-Decoder (1:2) model uses 204.21 GFLOPs for a similar loss of 3.26. This is a >10x reduction in computation for the same performance.

    • Manual Transcription of Table 2:

      Architecture Total Parameters GFLOPs Activations Convergence Loss
      Enc:Dec=1:1 0.1B 25.64 4.21B 3.59
      Enc:Dec=1:2 0.1B 17.72 2.92B 3.55
      Naive Dec-Only 0.1B 63.78 7.52B 3.54
      Lazy Dec-Only 0.1B 1.98 0.31B 3.57
      Enc:Dec=1:1 0.5B 142.73 10.79B 3.35
      Enc:Dec=1:2 0.5B 104.73 7.94B 3.32
      Naive Dec-Only 0.5B 317.68 19.28B *
      Lazy Dec-Only 0.5B 9.55 0.77B 3.33
      Enc:Dec=1:1 1B 296.36 17.63B 3.28
      Enc:Dec=1:2 1B 204.21 12.20B 3.26
      Naive Dec-Only 1B 634.83 31.53B *
      Lazy Dec-Only 1B 18.89 1.24B 3.27

      Figure 5 | Training curves for different architectures across three model scales. Despite achieving similar loss, Lazy Decoder-Only architecture requires \(1 0 \\times\) fewer FLOPs than classic archite… 该图像是图表,展示了不同架构在三种模型规模下的训练曲线。图中显示,尽管各模型损失相似,Lazy Decoder-Only架构所需的FLOPs比传统架构少10倍,图中注有E1D1和E1D2分别表示编码器-解码器参数比为1:1和1:2。

  • Ablations / Parameter Sensitivity:

    • Key-Value Sharing (Table 3): This experiment shows that aggressively sharing K-V pairs across many layers (Lkv=1Lkv=1) and even using the same representation for keys and values (Skv=1Skv=1) has no negative impact on the final loss, while being the most efficient configuration.

      • Manual Transcription of Table 3:

        Lkv Skv GFLOPs Activations Convergence Loss
        1 1 18.89 1.24B 3.27
        1 2 19.19 1.33B 3.27
        3 1 19.49 1.42B 3.27
        9 1 21.27 1.99B 3.27
        18 1 23.95 2.83B 3.27
    • Grouped Query Attention (Table 4): This study demonstrates that reducing the number of K-V head groups (Gkv) from 14 (full attention) down to 1 drastically reduces the K-V cache size (from 94M to 7M) with a negligible effect on convergence loss. This validates GQA as a key efficiency enabler.

      • Manual Transcription of Table 4:

        Gkv GFLOPs Activations KV Size Convergence Loss
        14 18.89 1.24B 94M 3.27
        7 18.74 1.19B 47M 3.28
        2 18.64 1.16B 13M 3.28
        1 18.62 1.15B 7M 3.27
  • Model Scaling

    • The scaling experiments, shown in Figure 6 (Image 11) and Table 5, confirm that the Lazy Decoder-Only architecture scales effectively.

    • Dense Models: Increasing model size from 0.1B to 8B parameters consistently lowers the convergence loss (from 3.57 to 3.19). However, gains diminish after the 2B parameter mark, suggesting a point of diminishing returns for dense scaling.

    • Sparse MoE Model: A 4B parameter MoE model (with only 0.5B active parameters per token) achieves a lower loss (3.22) than a 2B dense model (3.23), while having the same computational cost as a 0.5B dense model. This highlights the superior efficiency of sparse architectures for scaling generative recommenders.

    • Manual Transcription of Table 5:

      Model Parameters d_model n_layers n_heads embed_dim Learning Rate Convergence Loss
      Dense 0.1B 640 12 10 32 5.00e-4 3.57
      0.2B 896 12 14 45 3.54e-4 3.46
      0.5B 1408 14 11 70 2.24e-4 3.33
      1B 1792 18 14 90 1.58e-4 3.27
      2B 2304 22 18 115 1.12e-4 3.23
      4B 2944 26 23 147 7.91e-5 3.20
      8B 3584 34 28 179 5.59e-5 3.19
      MoE 4B (0.5B active) 1408 14 11 70 2.24e-4 3.22

      Figure 6 | Training dynamics of lazy decoder architectures across model scales. Convergence loss decreases from 3.57 (0.1B) to 3.19 (8B). The 4B MoE variant (0.5B activated), denoted as 4BA0.5B in th… 该图像是图表,展示了不同模型规模的懒惰解码器架构的训练曲线。随着模型规模从0.1B增大到8B,收敛损失从3.57降低到3.19。图中包含4B MoE变体(4BA0.5B),在保持计算效率的同时表现出竞争力性能。

  • Preference Alignment with RL

    • The paper investigates training with samples from the traditional pipeline (w/o OneRec Samples) versus including samples generated by OneRec itself (w/ OneRec Samples, i.e., on-policy data).
    • Finding:
      1. Using user feedback (Duration-Aware Reward) successfully improves duration-related metrics like App Stay Time but can hurt other metrics like Video View.
      2. Crucially, when the model is trained on its own generated data (w/ OneRec Samples), it achieves significant improvements across all metrics. This demonstrates a powerful self-improvement loop where the model gets better at generating recommendations that align with its own optimization objective, leading to better overall performance.

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully presents OneRec-V2, a major leap forward for industrial-scale generative recommendation. By identifying and solving the core bottlenecks in OneRec-V1, it introduces two key innovations: the Lazy Decoder-Only Architecture for unprecedented computational efficiency and scalability, and a Preference Alignment framework with GBPO that directly optimizes for real user satisfaction. The impressive online A/B test results on a massive user base validate the effectiveness of this new approach, marking a significant step towards more powerful and efficient end-to-end recommender systems.

  • Limitations & Future Work: (Note: The provided text cuts off before this section. The following is a reasoned inference.)

    • Authors' Acknowledged Limitations: The authors would likely mention that the Duration-Aware Reward is still a proxy for true satisfaction and could be improved. They might also note challenges in balancing multiple objectives beyond just watch time.
    • Future Work: Directions could include exploring even more sophisticated reward shaping techniques, applying the Lazy Decoder architecture to other domains (e.g., e-commerce), further research into scaling laws for sparse models, and investigating methods to reduce the inference latency of large generative models.
  • Personal Insights & Critique:

    • Novelty and Significance: The Lazy Decoder-Only architecture is a simple yet brilliant insight. It correctly identifies that for recommendation, the user context is static for the duration of a single prediction, and thus the expensive encoding process is largely redundant. By architecturally enforcing this separation, the authors achieve massive efficiency gains. This is a highly practical and impactful contribution for industrial systems where computational cost is a primary constraint.
    • Practicality: The GBPO algorithm is another strong practical contribution. It directly addresses the well-known instability of policy gradient methods, providing a more robust alternative tailored to the high-throughput, noisy data environment of a large-scale recommender system.
    • Critical View: The "94% reduction in computation" is a headline figure that depends on a long context length where the encoder's cost dominates. While true in that scenario, the relative gain would be smaller for shorter contexts. The use of future dates and a placeholder arXiv ID is unconventional and detracts from the credibility of the document as a formal publication, though its technical content is sound and detailed. The incomplete nature of the provided text also prevents a full analysis of the final results tables and conclusion.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.