AiPaper
Paper status: completed

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Published:08/22/2025
Original LinkPDF
Price: 0.10
Price: 0.10
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Jet-Nemotron uses PostNAS to freeze pre-trained MLP weights and optimize attention blocks, creating efficient hybrid-architecture language models that match or surpass accuracy of leading models while boosting generation throughput by up to 53.6×.

Abstract

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

1.2. Authors

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai. All authors are affiliated with NVIDIA, specifically from NVlabs. Their research background primarily lies in efficient AI, neural architecture search, and large language models.

1.3. Journal/Conference

Published at (UTC): 2025-08-21T17:59:08.000Z. The paper is listed as an arXiv preprint. arXiv is a widely recognized and influential open-access preprint server for research papers in fields such as physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It allows researchers to share their work rapidly before formal peer review and publication, making it a crucial platform for disseminating cutting-edge research in AI and machine learning.

1.4. Publication Year

2025

1.5. Abstract

The paper introduces Jet-Nemotron, a new family of hybrid-architecture language models that achieve or surpass the accuracy of leading full-attention models while significantly boosting generation throughput. This is accomplished through Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline. Unlike previous approaches, PostNAS begins with a pre-trained full-attention model, freezes its Multi-Layer Perceptron (MLP) weights, and then efficiently explores attention block designs. The pipeline comprises four key stages: (1) learning optimal full-attention layer placement and elimination, (2) selecting linear attention blocks, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. The Jet-Nemotron-2B model demonstrates comparable or superior accuracy to advanced models like Qwen3, Qwen2.5, Gemma3, and Llama3.2 across various benchmarks. Simultaneously, it achieves up to a 53.6×53.6 \times increase in generation throughput and a 6.1×6.1 \times speedup in prefilling. Notably, it also achieves higher accuracy on MMLU and MMLU-Pro than larger MoE full-attention models such as DeepSeek-V3-Small and Moonlight, despite their larger parameter counts.

Official Source Link: https://arxiv.org/abs/2508.15884 PDF Link: https://arxiv.org/pdf/2508.15884v3.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The rapid rise of Large Language Models (LLMs) has demonstrated exceptional accuracy across a wide range of tasks, marking a transformative era in artificial intelligence. However, their computational and memory demands pose significant challenges, particularly in long-context generation and reasoning. The self-attention mechanism, a core component of Transformer models, incurs a computational complexity of O(n2)O(n^2) with respect to sequence length nn, and generates a large Key-Value (KV) cache. This quadratic complexity makes LLMs inefficient for long sequences and resource-constrained environments.

To address this, prior research has focused on two main directions:

  1. Developing efficient attention mechanisms: These aim to reduce the computational complexity to O(n)O(n), often referred to as linear attention.

  2. Constructing hybrid models: These combine full attention and linear attention to balance accuracy and efficiency.

    However, a significant gap remains: while these efficient models improve throughput, their accuracy often lags behind state-of-the-art full-attention models, especially on demanding benchmarks like MMLU(-Pro), mathematical reasoning, retrieval, coding, and long-context tasks. Designing new LLM architectures from scratch is also prohibitively expensive due to high pre-training costs and data requirements, limiting innovation for researchers outside large organizations.

The paper's entry point or innovative idea is to tackle this challenge by introducing Post Neural Architecture Search (PostNAS). Instead of pre-training models from scratch, PostNAS begins with an already pre-trained full-attention model, reuses and freezes its MLP weights, and then efficiently explores novel attention block designs. This strategy drastically reduces training costs and accelerates the development of efficient yet accurate LLM architectures.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Introduction of PostNAS: A novel model architecture exploration paradigm for LLMs. By reusing pre-trained LLMs and freezing MLP weights, PostNAS significantly reduces the cost and risk associated with LLM architecture exploration, fostering faster and more efficient innovation.
  • Novel Insights into Efficient LLM Architecture Design: PostNAS provides insights such as the task-specific importance of attention layers and the finding that KV cache size is a more critical factor than parameter count for generation throughput.
  • Development of JetBlock: A novel linear attention block that integrates dynamic convolution with linear attention and hardware-aware architecture search. JetBlock consistently delivers significant accuracy improvements over previous linear attention blocks while maintaining comparable generation throughput.
  • Introduction of Jet-Nemotron Family: A new family of hybrid-architecture LLMs that achieves superior accuracy across a wide range of tasks (including MMLU(-Pro), mathematical reasoning, retrieval, coding, and long-context tasks) while offering significantly higher generation throughput than prior state-of-the-art full-attention models (e.g., Qwen2.5, Qwen3, Gemma3, and Llama3.2).
  • Exceptional Performance and Efficiency: The Jet-Nemotron-2B model matches or exceeds the accuracy of leading full-attention models while delivering up to 53.6×53.6 \times generation throughput speedup and 6.1×6.1 \times prefilling speedup on NVIDIA H100 GPU under long-context settings (256K tokens). It also outperforms larger MoE models in accuracy on MMLU and MMLU-Pro. These findings demonstrate that Jet-Nemotron offers practical benefits for applications requiring efficient LLMs.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Language Models (LMs) / Large Language Models (LLMs)

Language Models are computational models designed to understand and generate human language. They learn patterns, grammar, and semantic relationships from vast amounts of text data. Large Language Models (LLMs) are a specific type of LM characterized by their massive scale (billions of parameters) and the ability to perform a wide variety of Natural Language Processing (NLP) tasks, often demonstrating emergent capabilities. They are typically based on the Transformer architecture.

Transformer Architecture

The Transformer is a neural network architecture introduced in 2017, which revolutionized NLP. Unlike previous recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers rely entirely on attention mechanisms to draw global dependencies between input and output. A Transformer typically consists of an encoder and a decoder (though LLMs often use only the decoder part). Each encoder and decoder block is composed of Multi-Head Self-Attention layers and feed-forward neural networks (often called Multi-Layer Perceptrons or MLPs), connected by residual connections and layer normalization.

Self-Attention Mechanism

The self-attention mechanism is the core innovation of the Transformer. It allows the model to weigh the importance of different words in the input sequence when processing each word. For each token in a sequence, self-attention computes three vectors: Query (QQ), Key (KK), and Value (VV). These vectors are derived from the input embeddings through linear transformations. The attention score is calculated by taking the dot product of the Query vector with all Key vectors, then scaling by the square root of the Key dimension (dk\sqrt{d_k}), and applying a softmax function. This score is then multiplied by the Value vectors to produce the output.

The formula for Self-Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

  • QQ (Query), KK (Key), VV (Value) are matrices representing the query, key, and value vectors, respectively. Each row corresponds to a token in the sequence.
  • QTQ^T is the transpose of the Key matrix.
  • dkd_k is the dimension of the Key vectors. Scaling by dk\sqrt{d_k} helps stabilize gradients during training.
  • softmax is an activation function that converts a vector of numbers into a probability distribution, ensuring that the attention weights sum to 1.

Computational Complexity of Self-Attention

The main bottleneck of self-attention is its computational complexity with respect to the sequence length nn. Calculating QKTQK^T involves multiplying two n×dkn \times d_k and dk×nd_k \times n matrices, which results in an n×nn \times n matrix. This operation has a complexity of O(n2dk)O(n^2 \cdot d_k), or simply O(n2)O(n^2) if dkd_k is considered constant. This quadratic dependency means that as the sequence length doubles, the computation required for attention quadruples, making it very expensive for long-context scenarios.

Key-Value (KV) Cache

During autoregressive generation (where the model generates one token at a time), Transformer decoders need access to the Keys and Values computed from previously generated tokens. To avoid recomputing these for every new token, they are stored in a Key-Value (KV) cache. This cache grows linearly with the sequence length. However, for very long contexts, the KV cache can consume a significant amount of memory, becoming another major bottleneck for LLM inference.

Linear Attention

Linear attention mechanisms are designed to overcome the O(n2)O(n^2) computational and memory complexity of self-attention by reducing it to O(n)O(n). They achieve this by reordering operations or using different mathematical formulations that avoid the explicit computation of the n×nn \times n attention matrix. Instead, they often rely on associative properties or state-space models to aggregate information sequentially, making them more efficient for processing long sequences. Examples include RWKV, RetNet, Mamba, GLA, and Gated DeltaNet.

Hybrid-Architecture Language Models

Hybrid-architecture language models combine full attention layers with linear attention layers within the same model. The goal is to leverage the strengths of both: full attention for its strong performance on complex tasks requiring global interactions (especially in early layers or specific critical layers), and linear attention for its efficiency in handling long contexts and reducing memory footprint. The challenge lies in finding the optimal balance and placement of these different attention types.

Neural Architecture Search (NAS)

Neural Architecture Search (NAS) is an automated technique for designing neural network architectures. Instead of manually designing network structures, NAS algorithms explore a predefined search space of architectures and evaluate their performance on a target task. This typically involves a search algorithm (e.g., reinforcement learning, evolutionary algorithms, gradient-based methods) and a performance evaluation strategy. Hardware-aware NAS specifically incorporates hardware efficiency metrics (like latency, throughput, or energy consumption) into the search objective to find architectures optimized for specific deployment environments. The traditional NAS approach for LLMs is prohibitively expensive due to the cost of pre-training and evaluating many different large architectures.

Multi-Layer Perceptron (MLP)

A Multi-Layer Perceptron (MLP), also known as a feed-forward neural network, is a fundamental component of many neural networks, including Transformers. In a Transformer block, after the attention mechanism, the data passes through an MLP. This MLP typically consists of two or more linear layers with an activation function (like GELU) in between. Its purpose is to process the information aggregated by the attention mechanism independently for each token position.

Distillation Loss

Knowledge distillation is a technique where a smaller, simpler model (the "student") is trained to mimic the behavior of a larger, more complex model (the "teacher"). Distillation loss is a component of the student's training objective that encourages its outputs (e.g., probability distributions over tokens) to match those of the teacher model. This allows the student to learn a compressed version of the teacher's knowledge, often achieving better performance than if it were trained from scratch without the teacher's guidance. The distillation loss usually involves a Kullback-Leibler (KL) divergence or cross-entropy term between the student's and teacher's outputs.

3.2. Previous Works

The paper contextualizes Jet-Nemotron by contrasting it with several categories of existing LLMs:

  • Full-Attention Models: These are the state-of-the-art models that Jet-Nemotron aims to match or surpass in accuracy. Examples mentioned include:

    • Qwen2.5 [4], Qwen3 [5] (developed by Alibaba Cloud)
    • Gemma3 [41], Rec.Gemma [62], Gemma3n-E2B [42, 73] (developed by Google)
    • Llama3.2 [2] (developed by Meta)
    • MiniCPM-2B-128K [58]
    • MobileLLM-1.5B [59]
    • Smollm2-1.7B [60]
    • DeepSeek-V3-Small [6] and Moonlight [61]: These are Mixture-of-Experts (MoE) models, which also use full attention but have sparse activation, allowing them to scale up total parameters while keeping activated parameters relatively small. The paper highlights that Jet-Nemotron can achieve higher accuracy than these larger MoE models on certain benchmarks.
  • Linear Attention Models: These models replace full attention entirely with more efficient linear attention mechanisms, aiming for O(n)O(n) complexity. Examples include:

    • RWKV7 [10]: A linear attention model that uses a recurrent structure.
    • RetNet [12]: Introduces retention mechanism for parallel training and recurrent inference.
    • Mamba2 [50]: Builds on Structured State Space Models (SSMs) to achieve linear complexity.
    • GLA (Gated Linear Attention) [11]: Designed for hardware-efficient training.
    • Deltanet [51] and Gated DeltaNet [32]: Based on the delta rule and data-dependent gating.
  • Hybrid Models: These models combine full attention and linear attention to strike a balance between accuracy and efficiency. Examples mentioned include:

    • Zamba2 [16]
    • Hymba [44]
    • Jamba [87]: A concurrent hybrid model that combines Transformer and Mamba.
    • Falcon-H1 [106]: Another concurrent hybrid model incorporating Mamba2 and full attention.

3.3. Technological Evolution

The evolution of efficient LLMs can be seen as a progression:

  1. Full-Attention Dominance: Initial Transformers and LLMs relied entirely on the self-attention mechanism, achieving state-of-the-art accuracy but suffering from O(n2)O(n^2) complexity, limiting context length and increasing inference costs.
  2. Rise of Linear Attention: To address efficiency concerns, various linear attention mechanisms emerged, aiming to reduce complexity to O(n)O(n). While these offered significant speedups and memory savings, they often came with a noticeable drop in accuracy, especially on complex reasoning tasks, as they sometimes struggled to capture global dependencies as effectively as full attention.
  3. Emergence of Hybrid Architectures: Recognizing the trade-off, researchers started combining full attention and linear attention within the same model. The idea is to strategically place full attention layers where they are most critical for accuracy and use linear attention elsewhere for efficiency. Early hybrid models showed promise but still struggled to fully close the accuracy gap with pure full-attention SOTA models.
  4. PostNAS and Jet-Nemotron's Place: Jet-Nemotron represents a significant step in this evolution by introducing PostNAS. Instead of heuristic design or scratch training for hybrid models, PostNAS provides a systematic, data-driven, and hardware-aware approach to design these hybrids efficiently. By starting with a pre-trained full-attention model and strategically adapting its attention blocks, Jet-Nemotron aims to achieve the best of both worlds: SOTA accuracy with dramatically improved efficiency, effectively pushing the efficiency-accuracy trade-off frontier.

3.4. Differentiation Analysis

Compared to the main methods in related work, Jet-Nemotron's core differences and innovations stem from its PostNAS pipeline:

  • Starting Point:

    • Traditional NAS for LLMs / From-Scratch Training: Most previous work on efficient LLM architectures (whether pure linear attention or hybrid) involved pre-training models from scratch or designing architectures prior to pre-training. This is extremely costly and risky, limiting the exploration of the architectural design space.
    • PostNAS Innovation: PostNAS fundamentally changes this by starting with a pre-trained full-attention model. This dramatically reduces training costs and data requirements, making architectural exploration feasible and rapid.
  • Freezing MLP Weights:

    • Traditional Approach: Typically, all parameters are trained or fine-tuned.
    • PostNAS Innovation: By freezing the MLP weights of the pre-trained model, PostNAS focuses the architectural search specifically on the attention blocks. This simplifies the search space and leverages the rich representations already learned by the MLP in the pre-trained model, ensuring that the core knowledge is retained.
  • Systematic Coarse-to-Fine Search (Four Stages):

    • Heuristic or Limited Search: Previous hybrid models often used uniform placement of full attention or limited manual exploration.
    • PostNAS Innovation: PostNAS provides a comprehensive, multi-stage search:
      1. Optimal Full Attention Layer Placement: Instead of uniform placement, PostNAS learns where full-attention layers are most critical for specific tasks, leading to more effective use of these expensive layers.
      2. Linear Attention Block Selection: It systematically evaluates existing linear attention blocks for accuracy and efficiency, ensuring the best fit for the specific hybrid architecture.
      3. New Attention Block Design (JetBlock): It innovates by proposing JetBlock, a new linear attention block that incorporates dynamic convolution, improving expressiveness while maintaining efficiency.
      4. Hardware-Aware Hyperparameter Search: Crucially, it optimizes hyperparameters based on actual hardware generation throughput rather than just parameter count, leading to truly efficient deployment.
  • Accuracy-Efficiency Trade-off:

    • Previous Efficient Models: Often sacrificed significant accuracy for efficiency.

    • Jet-Nemotron Innovation: Through PostNAS, Jet-Nemotron demonstrably matches or exceeds the accuracy of state-of-the-art full-attention models while achieving orders of magnitude higher generation throughput, effectively breaking the previous accuracy-efficiency trade-off.

      In essence, PostNAS offers a pragmatic and powerful methodology to adapt existing, high-performing LLMs into highly efficient hybrid architectures without the immense costs of pre-training from scratch, making advanced LLM capabilities more accessible and deployable.

4. Methodology

4.1. Principles

The core idea behind Post Neural Architecture Search (PostNAS) is to efficiently design new Language Model (LM) architectures by building upon existing pre-trained full-attention models rather than starting from scratch. This strategy dramatically reduces training costs and data requirements. The theoretical basis is that the Multi-Layer Perceptron (MLP) weights of pre-trained LLMs already encode a vast amount of knowledge, which can be reused and frozen. By focusing the architecture search solely on the attention block designs, PostNAS can rapidly explore efficient model variations without retraining the entire model, thereby allowing for comprehensive architectural innovation at a lower cost. The intuition is that while MLPs learn the core features and knowledge, the attention mechanism primarily controls how information is aggregated across the sequence. By optimizing this aggregation, one can significantly impact efficiency without necessarily sacrificing the learned knowledge in the MLPs.

The process is structured as a coarse-to-fine search, systematically optimizing different aspects of the attention block design to achieve both high accuracy and high hardware efficiency.

4.2. Core Methodology In-depth (Layer by Layer)

The PostNAS pipeline is composed of four key steps, as illustrated in Figure 2 (Image 2 from the original paper). It begins with a pre-trained full-attention model where the MLP weights are frozen.

The following figure (Figure 2 from the original paper) illustrates the overall PostNAS pipeline:

Figure 2 | PostNAS Roadmap. Our pipeline starts from a pre-trained full-attention model and keeps the MLP frozen. It then perfors a coarse-to-fine search for efient attention block desins, frst deter… 该图像是图2示意图,展示了PostNAS的整体流程,包括基于预训练全注意力模型冻结MLP权重,进行全注意力层位置优化、线性注意力块选择、新注意力块设计以及硬件感知超参数搜索四个阶段,体现了Jet-Nemotron的高效设计路径。

Figure 2 | PostNAS Roadmap. Our pipeline starts from a pre-trained full-attention model and keeps the MLP frozen. It then perfors a coarse-to-fine search for efient attention block desins, frst determining the optimal placement of ful-attention layers, then selecting the best linear attention block or using a newnear attention block, and finally searching for optimal architectural hyperparameters.

4.2.1. Full Attention Placement and Elimination

The first step in PostNAS is to strategically determine where to keep full-attention layers and where to replace them with more efficient linear attention or sliding window attention (SWA) layers. While incorporating a few full-attention layers is known to improve accuracy, their optimal placement is not trivial.

The following figure (Figure 4 from the original paper) shows the process of learning to place full attention with PostNAS:

Figure 4 | Learning to Place Full Attention with PostNAS. We train a once-for-all super network and perform beam search to identify the optimal placement of full attention layers. 该图像是图4示意图,展示了利用PostNAS训练一次性超网络并通过Beam Search搜索最优全注意力层位置的过程。

Figure 4 | Learning to Place Full Attention with PostNAS. We train a once-for-all super network and perform beam search to identify the optimal placement of full attention layers.

To achieve this, PostNAS employs an automated method:

  1. Once-for-all Super Network Training: A once-for-all super network [45, 31] is constructed. This super network is an augmented version of the pre-trained full-attention model. For each layer that could potentially be a full-attention layer, it is augmented with alternative linear attention paths. This means each layer can either operate as a full-attention layer or a linear attention layer. During training, a subnetwork (i.e., a specific combination of full and linear attention layers) is randomly sampled at each step. This subnetwork is then trained using a feature distillation loss [46, 47, 48]. This distillation loss aims to make the subnetwork's internal representations (features) similar to those of the original pre-trained full-attention model. The MLP weights are kept frozen throughout this process, only attention block parameters and any newly introduced layers (e.g., for linear attention) are updated.

  2. Beam Search for Optimal Placement: After the super network is trained, beam search [49] is performed to identify the optimal placement of full-attention layers under a given constraint (e.g., a maximum number of full-attention layers). The search objective is task-dependent. For example:

    • For MMLU, the objective is to select the configuration that yields the lowest loss on the correct answer.

    • For mathematical and retrieval tasks, the objective is to choose the configuration with the highest accuracy.

      The findings from this stage reveal:

  • Key Finding 1: Not all attention layers in a pre-trained full-attention model contribute equally to performance. Only a few layers are critically important for specific tasks (e.g., two for MMLU, two to three for retrieval).

  • Key Finding 2: Different attention layers contribute to different capabilities. Layers critical for MMLU might not be important for retrieval tasks, indicating specialization.

  • Key Finding 3: Complex tasks like mathematical reasoning show intricate patterns of attention importance, but the key layers identified for MMLU and retrieval often encompass those needed for math.

    The following figure (Figure 5 from the original paper) shows the layer placement search results and compares PostNAS with uniform placement:

    Figure 5 | (a) Layer Placement Search Results on Qwen2.5-1.5B. Each grid cell represents the searcobjective value of the corresponding attention layer; higher values indicate greater importance.(b) C… 该图像是图表,展示了Figure 5中Jet-Nemotron模型在Qwen2.5-1.5B上的层次放置搜索结果和PostNAS与均匀放置策略的性能比较。(a)部分展示了不同任务中注意力层的重要性热力图,(b)部分对比了不同全注意力层数量下的MMLU准确率,显示PostNAS明显优于Uniform方法。

Figure 5 | (a) Layer Placement Search Results on Qwen2.5-1.5B. Each grid cell represents the searcobjective value of the corresponding attention layer; higher values indicate greater importance.(b) Comparison Between PostNAS and Uniform Placement.

As Figure 5(b) illustrates, PostNAS significantly outperforms uniform placement in terms of accuracy, validating the effectiveness of learning layer placement.

4.2.2. Linear Attention Block Selection

Once the placement of full-attention layers is determined, the next step is to select the most suitable linear attention block to replace the eliminated full-attention layers. This involves systematically evaluating various state-of-the-art linear attention blocks.

In their experiments, the authors evaluated six SOTA linear attention blocks: RWKV7 [10], RetNet [12], Mamba2 [50], GLA [11], Deltanet [51], and Gated DeltaNet [32]. The selection process involves:

  1. Efficiency Profiling: Initial profiling for training throughput is performed. RWKV7 was excluded from training experiments due to lower training throughput (possibly suboptimal kernel implementation).

  2. Accuracy Evaluation: The remaining linear attention blocks are evaluated for accuracy across diverse tasks.

  3. Inference Speed Evaluation: Their inference speed is also considered.

    This stage leverages the low training cost of the PostNAS framework, allowing for comprehensive evaluation without relying on small proxy tasks. The results indicated that Gated DeltaNet achieved the best overall accuracy. Its superior performance is attributed to two key mechanisms:

  • Data-Dependent Gating Mechanism [52]: This mechanism dynamically controls whether the model should prioritize the current token's information or the accumulated history state, allowing for adaptive information flow.

  • Delta Rule [53]: This rule updates the history state by focusing on the increment of information from the current token, efficiently managing the limited memory of the state.

    Based on these findings, Gated DeltaNet was chosen for subsequent experiments.

4.2.3. New Attention Block Design (JetBlock)

The PostNAS framework also facilitates the design of entirely new attention blocks. While convolution has been shown to enhance linear attention capacity [32], prior methods typically use static convolution kernels that cannot adapt to the input.

To address this, the authors introduce JetBlock. JetBlock enhances linear attention by incorporating dynamic convolution [54, 55]. The overall structure of JetBlock is shown in Figure 2 (#3).

The key innovation of JetBlock is the use of a kernel generator:

  1. Kernel Generator Module: This module dynamically produces convolution kernels based on the input features.

    • Input: It shares the same input as the Query (Q), Key (K), and Value (V) projection layers.
    • Efficiency: It starts with a linear reduction layer (reduction ratio of 8) for efficiency.
    • Activation: A GELU activation function [57] is applied.
    • Output: A final linear layer outputs the convolution kernel weights.
  2. Application of Dynamic Kernels: These dynamically generated convolution kernels are applied specifically to the Value (V) tokens. The authors found that applying them to QQ or KK tokens offered little benefit.

  3. Streamlined Computation: Redundant static convolutions on QQ and KK are removed with negligible impact on accuracy once dynamic convolution is applied to VV, further streamlining computation and improving efficiency.

  4. Time-Mixing: Gated DeltaNet (selected in the previous step) is adopted for the time-mixing component of JetBlock.

    The combination of dynamic convolution and the Gated DeltaNet's data-dependent gating and delta rule allows JetBlock to achieve improved accuracy, particularly on math reasoning and retrieval tasks, while maintaining efficiency comparable to other linear attention blocks (Table 1).

The final step focuses on optimizing core architectural hyperparameters for actual hardware efficiency. Traditionally, the number of parameters has been a proxy for LM efficiency. However, PostNAS acknowledges that parameter count does not directly correlate with generation throughput on real hardware.

The hardware-aware architecture search aims to identify optimal architectural hyperparameters (e.g., key/value dimension, number of attention heads) by directly targeting generation throughput.

Key Finding 4: KV cache size is the most critical factor influencing long-context and long-generation throughput. Models with different parameter counts can exhibit similar generation throughput if their KV cache size is constant (Table 2). This is because the decoding stage is typically memory-bandwidth-bound rather than compute-bound. In long-context scenarios, the KV cache can consume more memory than the model weights, and reducing its size decreases memory transfer time per decoding step, enabling larger batch sizes and improving generation throughput.

Based on this finding, the search process involves:

  1. Fixing KV Cache Size: The KV cache size is fixed to match the original design.

  2. Grid Search: A small-scale grid search is performed over key dimension, value dimension, and number of attention heads.

  3. Optimization Target: The objective is to achieve a generation throughput comparable to the original configuration while potentially allowing for more parameters to achieve better accuracy.

    This step refines the JetBlock configuration, boosting its accuracy while maintaining training and inference throughput, as shown in Table 1 (row "+ Hardware-Aware Search").

The overall result of these four steps is the Jet-Nemotron family, a hybrid-architecture LM that balances full attention for critical tasks, sliding window attention for specific patterns (like in MMLU), and JetBlock for efficient linear attention, all optimized for hardware performance.

5. Experimental Setup

5.1. Datasets

The training of Jet-Nemotron models involves a two-stage process using a combination of various datasets.

  • Stage 1 (PostNAS and Initial Training):
    • Nemotron-CC [63]: A large pre-training corpus. While not explicitly detailed in the paper, Nemotron-CC is likely derived from Common Crawl and refined for long-horizon pre-training.
    • Redstone-QA [64]: A QA dataset also used as part of the pre-training corpus.
    • Purpose: These datasets are used for the distillation loss training in the first stage, where MLPs are frozen and attention blocks are adapted via PostNAS.
  • Stage 2 (Full-Model Training):
    • The data mixture for this stage includes the datasets from Stage 1, augmented with more high-quality data from specific domains:

      • Math-related datasets: [65]
      • Coding-related datasets: [66, 67]
    • Purpose: This stage involves full-model training (all parameters unfrozen) to further refine the model's capabilities on these important domains.

      For evaluation, a comprehensive suite of benchmarks is used, covering diverse tasks:

  • Massive Multitask Language Understanding (MMLU) [18] & MMLU-Pro [19]: General knowledge and reasoning across 57 subjects (MMLU) and a more robust/challenging version (MMLU-Pro).
  • Mathematical Reasoning:
    • GSM8K [22]: Grade school math word problems.
    • MATH [18]: Diverse challenging math problems.
    • MathQA [21]: Math word problems with interpretable formalisms.
    • MMLU-Stem: A subset of MMLU focused on science, technology, engineering, and mathematics.
    • GPQA [20]: Graduate-level Google-Proof Question Answering.
  • Commonsense Reasoning:
    • ARC-c, ARC-e [34]: AI2 Reasoning Challenge (Challenge and Easy sets).
    • PIQA [35]: Physical Commonsense Reasoning.
    • Wino. (Winograd Schema Challenge) [36]: Ambiguous pronoun resolution.
    • OBQA [38]: Open Book Question Answering.
    • BoolQ [33]: Boolean Questions from natural text.
    • TruthQA [37]: Truthfulness of statements.
  • Retrieval:
    • FDA (Financial Domain QA) [23]
    • SWDE (Semi-structured Web Data Extraction) [24]
    • Squad (Stanford Question Answering Dataset) [25]: Reading comprehension.
  • Coding:
    • EvalPlus [40]: Rigorous evaluation for code generation.
    • CRUXEval [28]: Code Reasoning, Understanding, and Execution benchmark.
  • Long-Context Tasks:
    • LongBench [29]: A bilingual, multitask benchmark for long context understanding.

      These datasets are chosen to provide a comprehensive and challenging evaluation across a wide spectrum of LLM capabilities, from general knowledge and reasoning to specialized domains like math, coding, and long-context understanding.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided below:

Accuracy

Conceptual Definition: Accuracy is a fundamental metric that measures the proportion of correctly predicted instances out of the total number of instances. In classification and reasoning tasks, it indicates how often the model's output matches the ground truth. Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $ Symbol Explanation:

  • Number of Correct Predictions\text{Number of Correct Predictions}: The count of instances where the model's output exactly matches the true label or answer.
  • Total Number of Predictions\text{Total Number of Predictions}: The total count of instances for which the model made a prediction.

Generation Throughput

Conceptual Definition: Generation Throughput measures the efficiency of a language model during text generation (decoding). It quantifies the rate at which the model can produce tokens, typically expressed in tokens per second. Higher throughput indicates faster generation, which is crucial for real-time applications and serving many users. Mathematical Formula: The paper does not provide a specific formula, but it is generally calculated as: $ \text{Generation Throughput (tokens/s)} = \frac{\text{Total Tokens Generated}}{\text{Total Time Taken for Generation (seconds)}} $ Symbol Explanation:

  • Total Tokens Generated\text{Total Tokens Generated}: The sum of all tokens produced by the model during the generation phase.
  • Total Time Taken for Generation (seconds)\text{Total Time Taken for Generation (seconds)}: The total duration required for the model to generate those tokens.

Prefilling Speedup

Conceptual Definition: Prefilling Speedup measures how much faster a new method (Jet-Nemotron) can process the initial input prompt (prefilling phase) compared to a baseline model. The prefilling phase involves processing the entire input context to compute the initial KV cache states before autoregressive decoding begins. Mathematical Formula: $ \text{Prefilling Speedup} = \frac{\text{Prefilling Time of Baseline Model}}{\text{Prefilling Time of Jet-Nemotron}} $ Symbol Explanation:

  • Prefilling Time of Baseline Model\text{Prefilling Time of Baseline Model}: The time taken by the baseline model to process the initial input prompt.
  • Prefilling Time of Jet-Nemotron\text{Prefilling Time of Jet-Nemotron}: The time taken by the Jet-Nemotron model to process the initial input prompt.

Decoding Speedup

Conceptual Definition: Decoding Speedup measures how much faster a new method (Jet-Nemotron) can generate tokens during the autoregressive decoding phase compared to a baseline model. The decoding phase involves generating one token at a time based on the previous tokens and the KV cache. Mathematical Formula: $ \text{Decoding Speedup} = \frac{\text{Decoding Time of Baseline Model}}{\text{Decoding Time of Jet-Nemotron}} $ Symbol Explanation:

  • Decoding Time of Baseline Model\text{Decoding Time of Baseline Model}: The time taken by the baseline model to generate a set of tokens during autoregressive decoding.
  • Decoding Time of Jet-Nemotron\text{Decoding Time of Jet-Nemotron}: The time taken by the Jet-Nemotron model to generate the same set of tokens during autoregressive decoding.

KV Cache Size (in MB)

Conceptual Definition: KV Cache Size refers to the amount of memory consumed by storing the Key and Value vectors from previous tokens in the Transformer's attention mechanism. This cache is essential for efficient autoregressive decoding. The size of the KV cache grows with the context length and the number of attention layers and heads, directly impacting memory usage and generation throughput. The paper reports this in Megabytes (MB).

5.3. Baselines

The paper compares Jet-Nemotron against a wide array of state-of-the-art models, categorized by their attention mechanism type:

  • O(n2)O(n^2) (Full Attention) Models: These are standard Transformer models with quadratic attention complexity, representing the performance ceiling Jet-Nemotron aims to match.

    • Qwen2.5-1.5B [4]
    • Qwen3-1.7B-Base [5]
    • Llama3.2-3B [2]
    • MiniCPM-2B-128K [58]
    • MobileLLM-1.5B [59]
    • Smollm2-1.7B [60]
    • DeepSeek-V3-Small@1.3T [6]: An MoE model with 2.2B activated and 15B total parameters.
    • Moonlight@1.2T [61]: Another MoE model with 2.2B activated and 15B total parameters.
    • Mamba2-2.7B [50]: Although Mamba2 is O(n)O(n), it's listed under O(n2)O(n^2) in Table 3 for MMLU(-Pro) results, likely due to its context length limitation or specific evaluation setup contrasting it against traditional Transformers. Correction: In Table 3, Mamba2 is under the O(n2)O(n^2) section, which is a bit unusual given its nature as an O(n)O(n) model. However, the paper categorizes it this way for the MMLU results, perhaps indicating its performance profile against these models on that specific benchmark or its overall architecture being compared for its scale. In later tables (e.g. Table 4), it's correctly placed under O(n)O(n). I will follow the paper's table structure.
  • O(n)O(n) (Linear Attention) Models: These models prioritize efficiency with linear attention mechanisms.

    • RWKV7-1.5B [10]
    • Rec.Gemma-2B [62]
    • Gemma3n-E2B [42]
    • Hymba-1.5B [44]
    • Zamba2-1.2B [16]
    • Mamba2-2.7B [50]: Correctly categorized as O(n)O(n) in Tables 4, 5, 6, 7, 8.
  • Hybrid Models: Models combining full and linear attention.

    • Gemma2-2.6B [73]

    • Gemma3n-E2B [73]

    • Hymba-1.5B [44]

    • Zamba2-1.2B [16]

    • Falcon-H1-1.5B [106] and Falcon-H1-1.5B-deep [106]: Concurrent hybrid models that combine Mamba2 and full attention.

      These baselines are representative because they cover the spectrum of current LLM architectures (pure full attention, pure linear attention, and various hybrid approaches), including leading models from major research labs and highly efficient smaller models. This comprehensive comparison allows Jet-Nemotron to demonstrate its superior accuracy-efficiency trade-off against the best existing solutions.

5.4. Jet-Nemotron Model Family Details

The Jet-Nemotron family consists of two main models: Jet-Nemotron-2B and Jet-Nemotron-4B.

Final Model Architecture

The final Jet-Nemotron models are constructed from a stack of blocks, each containing a Multi-Layer Perceptron (MLP) layer and an attention layer. The attention layer can be one of three types: full attention, sliding window attention (SWA), or JetBlock (the new linear attention block).

The following are the results from Table 9 of the original paper:

Jet-Nemotron-2B Jet-Nemotron-4B
Total blocks 28 36
Full Attention Layers No. 15, 20 No. 18, 21, 22, 28, 33
Sliding Window Attention Layers No. 21, 22 No. 17, 20, 23, 24, 26
Vocabulary Size 151,643 151,643
Hidden Size 1,536 2,048
MLP Intermediate Size 8,960 11,008

Table 9 | The overall model architectures of Jet-Nemotron families.

  • Jet-Nemotron-2B:
    • Built upon Qwen2.5-1.5B.
    • Uses two full-attention layers (No. 15 and 20), guided by the Retrieval task.
    • Includes two sliding window attention (SWA) layers (No. 21 and 22), guided by MMLU (for pattern matching in multiple-choice tasks). The window size for SWA is 1,152.
    • All remaining attention layers are replaced with JetBlock.
  • Jet-Nemotron-4B:
    • Built upon Qwen2.5-3B.

    • Incorporates five full-attention layers (No. 18, 21, 22, 28, 33).

    • Includes five SWA layers (No. 17, 20, 23, 24, 26). The window size for SWA is 2,048.

    • The rest are JetBlock layers.

      The full attention and sliding window attention layers use grouped-query attention [105].

The following are the results from Table 10 of the original paper:

Full Attention / SWA Jet-Nemotron-2B Jet-Nemotron-4B
Attention Head Number 12 16
Dimensions of Q/K/V 128 128
K/V Head Number 2 2
Position Embedding RoPE RoPE

Table 10 | The configurations of full-attention layers in Jet-Nemotron models.

The configurations for JetBlock are distinct from the full attention/SWA layers:

The following are the results from Table 11 of the original paper:

JetBlock Jet-Nemotron-2B Jet-Nemotron-4B
Q/K Dimension 96 128
V Dimension 256 256
Head Number 12 16
Convolution Kernel Size 4 4
DConv Generator Hidden Size 32 32

Table 11 | The configurations of JetBlock.

Training Details

The training proceeds in two stages:

  1. Stage 1: MLPs are frozen. The model is trained using a distillation loss on a combination of Nemotron-CC and Redstone-QA for 50B tokens. This stage is where PostNAS operates.
  2. Stage 2: Full-model training (all parameters trainable) is performed. Additional high-quality data from math and coding domains are added to the data mixture. Models are trained on 350B tokens in this stage.

Evaluation Details

  • Shots:
    • GSM8K and MATH: 4-shot evaluation.
    • GPQA and MMLU-Pro: 5-shot evaluation.
    • All other tasks: Zero-shot setting.
  • Implementations:
    • Coding tasks: Official implementations of EvalPlus [40] and CRUXEval [28].
    • Other tasks: LM-Evaluation-Harness [68].

Throughput Testbed

  • Hardware: DGX H100 server (8 NVIDIA H100 GPUs, 2 Intel Xeon Platinum 8480C CPUs, 2TB RAM).

  • Software: Pytorch 2.7.0, Triton 3.3.0.

  • Optimized Kernels:

    • Full-attention block: FlashAttention 2.7.4 [69].
    • Linear-attention blocks: Flash-Linear-Attention 0.2.1 [70].
  • Model Inference: Based on Transformers 4.52.0 [71].

  • Context Length: 64K tokens (unless specified).

  • GPU Usage: Single H100 GPU per test.

  • Optimization: Chunk-prefilling [72] is used to maximize decoding batch size without sacrificing prefilling throughput by adjusting chunk sizes. The highest achievable decoding throughput is reported.

    The following are the results from Table 13 of the original paper:

    Model Batch Size Chunk Size
    Qwen2.5-1.5B 32 8,192
    Qwen3-1.7B 8 16,384
    Llama3.2-1B 32 4,096
    MiniCPM-2B-128K Pythia-2.8B 2 2,048
    2 16,384
    Smollm2-1.7B 4 16,384
    Mamba2-2.7B 128 1,024
    RWKV7-1.5B 256 2,048
    Rec.Gemma-2B 128 512
    Gemma3n-E2B Gemma2-2.6B Hymba-1.5B 64 4,096
    16 2,048
    64 512
    Zamba2-1.2B 8 8,192
    Jet-Nemotron-2B 128 2,048
    Jet-Nemotron-4B 64 1,024

Table 13 | Hyper-Parameters in Efficiency Measurement. We adjust the chunk size to maximize decoding batch size without compromising prefilling throughput.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Jet-Nemotron models significantly advance the efficiency-accuracy trade-off frontier for Language Models. Across a comprehensive suite of benchmarks, Jet-Nemotron matches or exceeds the accuracy of state-of-the-art full-attention models while delivering vastly superior generation throughput.

The following figure (Figure 1 from the original paper) shows the comparison between Jet-Nemotron and State-of-the-Art Efficient Language Models:

Figure 1 | Comparison Between Jet-Nemotron and State-of-the-Art Efficient Language Models. The generation throughput is measured on the NVIDIA H100 GPU under a context length of 64K tokens. Jet-Nemot… 该图像是性能比较的图表,展示了Jet-Nemotron系列模型与多款高效语言模型在MMLU-Pro五次准确率与生成吞吐量上的对比。图中显示Jet-Nemotron-2B在准确率优于Qwen3-1.7B-Base的同时,实现约47倍的生成速度加速;同时Jet-Nemotron-4B在模型规模较大时仍保持吞吐量领先。

Figure 1 | Comparison Between Jet-Nemotron and State-of-the-Art Efficient Language Models. The generation throughput is measured on the NVIDIA H100 GPU under a context length of 64K tokens. Jet-Nemotron-2B delivers a higher accuracy than Qwen3-1.7B-Base on MMLU-Pro while achieving 47×4 7 \times higher generation throughput. Jet-Nemotron-4B, despite its larger model size, still achieves higher generation throughput than all full-attention models with less than 2B parameters.

As Figure 1 clearly illustrates, Jet-Nemotron-2B achieves higher MMLU-Pro accuracy than Qwen3-1.7B-Base while providing approximately 47×47 \times higher generation throughput. Even the larger Jet-Nemotron-4B maintains a higher generation throughput than all full-attention models under 2B parameters, indicating exceptional efficiency.

Results on MMLU(-Pro) and BBH

The following are the results from Table 3 of the original paper:

Type Model Params (B) Cache Size (MB) Throughput (token/s) ↑ MMLU Acc. ↑ MMLU-Pro Acc. ↑ BBH Acc. ↑
O(n2) Qwen2.5-1.5B [4] 1.5 1,792 241 59.5 28.9 44.1
Qwen3-1.7B-Base [5] 1.7 7,168 61 60.3 37.8 54.2
Llama3.2-3B [2] 3.0 7,168 60 54.9 25.0 47.1
MiniCPM-2B-128K [58] 2.8 23,040 18 46.0 18.0 36.5
MobileLLM-1.5B [59] 1.5 4,320 101 26.0 9.4 27.2
Smollm2-1.7B [60] 1.7 12,288 32 48.5 18.3 35.1
DeepSeek-V3-Small@1.3T [6] 2.2/15 - , 53.3 - -
Moonlight@1.2T [61] 2.2/15 - - 60.4 28.1 43.2
Mamba2-2.7B [50] 2.7 80 2,507 25.1 8.6 25.7
O(n) RWKV7-1.5B [10] 1.5 24 3,050 41.0 13.4 15.9
Rec.Gemma-2B [62] 2.0 16 2,355 28.6 12.8 33.3
Gemma3n-E2B [42] 2.0 768 701 53.9 24.3
Hymba-1.5B [44] 1.5 240 180 49.7 17.4 45.1 29.8
Zamba2-1.2B [16] 1.2 6,114 71 43.1 14.2 19.6
Hybrid Jet-Nemotron-2B 2.0 154 2,885 60.8 39.0 58.3
Jet-Nemotron-4B 4.0 258 1,271 65.2 44.2 65.0

Table 3 | Results on MMLU(-Pro) and BBH. DeepSeek-V3-Small @1.3T@ 1 . 3 \mathrm { T } and Moonlight @1.2T@ 1 . 2 \mathrm { T } are MoE models with 2.2B2 . 2 \mathrm { B } activated and 15B total parameters, trained on 1.3T and 1.2T tokens, respectively.

Jet-Nemotron-2B excels, achieving a higher MMLU (60.8), MMLU-Pro (39.0), and BBH (58.3) accuracy than Qwen3-1.7B-Base, while offering a dramatic 47×47 \times higher throughput (2885 vs 61 tokens/s) and a 47×47 \times smaller KV cache size (154MB vs 7168MB). Notably, Jet-Nemotron-2B even surpasses MoE models like DeepSeek-V3-Small and Moonlight in MMLU and MMLU-Pro, despite their larger total and activated parameters. Jet-Nemotron-4B further boosts accuracy across these benchmarks, maintaining a significant 21×21 \times throughput advantage over Qwen3-1.7B-Base. Compared to other linear attention and hybrid models, Jet-Nemotron demonstrates substantially higher accuracy.

Results on Math Tasks

The following are the results from Table 4 of the original paper:

Type Model | Throughput| | (token/s) ↑ | Accuracy ↑
Avg. |GSM8K MATH MathQA MMLU-Stem GPQA
O(n2) Qwen2.5-1.5B [4] 241 38.4 62.4 13.1 34.4 52.7 29.4
Qwen3-1.7B-Base [5] 61 42.3 62.8 16.7 46.0 50.8 27.9
Llama3.2-3B [2] 60 28.8 25.8 8.6 34.2 45.3 30.1
MiniCPM-2B-128K [58] 18 27.6 39.2 5.9 28.5 36.3 28.1
Smollm2-1.7B [60] 32 28.9 30.3 9.2 33.7 41.3 30.1
O(n) Mamba2-2.7B [50] 2,507 16.6 3.0 3.9 24.3 26.6 25.3
RWKV7-1.5B [10] 2,669 18.3 5.6 0.8 27.2 34.9 23.0
Rec.Gemma-2B [62] 2,355 20.8 13.9 7.6 25.3 28.5 28.6
Gemma3n-E2B [42] 701 28.3 24.9 10.1 31.1 45.7 31.8
Hymba-1.5B [44] 180 23.1 17.9 0.8 28.0 40.9 27.9
Hybrid Zamba2-1.2B [16] 71 24.8 28.1 5.9 26.0 36.5 27.7
Jet-Nemotron-2B 2,885 49.6 76.2 23.3 53.8 62.7 32.1
Jet-Nemotron-4B 1,271 51.3 78.7 25.2 52.5 65.6 34.6

Table 4 | Results on Math Tasks.

For math tasks, Jet-Nemotron-2B achieves an impressive average accuracy of 49.6, outperforming Qwen3-1.7B-Base by 6.3 points while being 47×47 \times faster. Traditional linear attention and hybrid models significantly lag behind Qwen3 on these tasks, highlighting Jet-Nemotron's unique ability to combine efficiency with strong mathematical reasoning.

Results on Commonsense Reasoning Tasks

The following are the results from Table 5 of the original paper:

Model Throughput Accuracy ↑
|(token/s) ↑ | | Avg. | ARC-c ARC-e PIQA Wino. OBQA BoolQ TruthQA
Qwen2.5-1.5B [4] 241 59.4 45.4 71.2 75.8 63.8 40.2 72.8 46.6
Qwen3-1.7B-Base [5] 61 60.0 44.9 68.6 75.5 63.8 39.0 79.0 48.8
Llama3.2-3B [2] 60 59.9 46.6 72.0 78.0 69.3 40.4 73.9 39.3
MiniCPM-2B-128K [58] 18 57.6 41.0 69.4 75.5 63.8 40.6 74.7 38.3
Smollm2-1.7B [60] 32 59.7 47.0 73.3 77.7 66.2 44.6 72.5 36.7
Mamba2-2.7B [50] 2,507 57.2 42.1 70.5 76.1 62.7 41.4 71.5 36.1
RWKV7-1.5B [10] 3,050 59.7 46.3 75.7 77.4 67.6 45.4 70.5 34.7
Rec.Gemma-2B [62] 2,355 46.5 29.4 41.5 66.6 54.1 27.0 72.0 34.7
Gemma3n-E2B [42] 701 58.6 43.2 73.1 77.0 60.8 40.8 76.0 39.1
Hymba-1.5B [44] 180 61.2 46.9 76.9 77.7 66.2 41.0 80.8 39.0
Zamba2-1.2B [16] 71 58.0 44.4 66.8 77.4 65.6 42.8 70.8 38.5
Jet-Nemotron-2B 2,885 62.0 48.6 74.8 75.4 65.8 40.6 81.2 47.8
Jet-Nemotron-4B 1,271 64.7 51.7 79.2 78.1 70.5 43.6 83.0 46.6

Table 5 | Results on Commonsense Tasks.

Jet-Nemotron-2B (avg. 62.0) outperforms all baseline models, including Qwen2.5 and Qwen3 which are relatively weaker in this domain. Jet-Nemotron-4B achieves even higher accuracy (avg. 64.7).

Results on Retrieval Tasks

The following are the results from Table 6 of the original paper:

Type Model Throughput (token/s) ↑ Accuracy ↑
Avg. FDA SWDE Squad
O(n2) Qwen2.5-1.5B [4] 241 72.4 82.8 86.3 48.1
Qwen3-1.7B-Base [5] 61 76.1 81.8 89.2 57.2
Llama3.2-3B [2] 60 71.3 82.3 89.6 56.4
MiniCPM-2B-128K [58] 18 72.6 72.3 86.4 59.1
Smollm2-1.7B [60] 32 68.9 78.1 82.4 46.3
O(n) Mamba2-2.7B [50] 2,507 57.0 51.7 74.3 45.1
RWKV7-1.5B [10] 3,050 58.6 54.5 73.3 48.0
Rec.Gemma-2.6B [62] 2,355 68.8 62.3 86.4 57.8
Hybrid Gemma3n-E2B [73] 701 74.0 77.3 86.4 58.2
Hymba-1.5B [44] 180 57.1 46.6 74.4 50.2
Zamba2-1.2B [16] 71 66.4 73.8 80.7 44.8
Jet-Nemotron-2B 2,885 74.2 80.4 85.7 56.6
Jet-Nemotron-4B 1,271 76.2 82.5 89.7 56.4

Table 6 | Results on Retrieval Tasks.

Jet-Nemotron-2B performs strongly, outperforming all baselines except Qwen3-1.7B-Base. The Jet-Nemotron-4B achieves the highest average accuracy (76.2) among all models, while still maintaining a 21×21 \times speedup compared to Qwen3.

Results on Coding Tasks

The following are the results from Table 7 of the original paper:

Type Model | Throughput (token/s) ↑ Accuracy ↑
Avg. EvalPlus CRUXEval-I-cot CRUXEval-O-cot
O(n2) Qwen2.5-1.5B [4] 241 52.0 54.3 56.0 45.8
Qwen3-1.7B-Base [5] 61 58.9 62.8 60.4 53.4
Llama3.2-3B [2] 60 44.0 35.5 54.7 41.7
MiniCPM-2B-128K [58] 18 34.2 40.7 29.9 31.9
Smollm2-1.7B[ [60] 32 36.2 20.6 49.5 38.6
O(n) Mamba2-2.7B [50] 2,507 14.0 12.0 9.3 20.7
RWKV7-1.5B [10] 3,050 13.2 16.8 8.0 14.7
Rec.Gemma-2.6B [62] 2,355 36.8 29.5 46.7 34.2
Hybrid Gemma3n-E2B [73] 701 40.4 29.6 49.9 41.6
Hymba-1.5B [44] 180 30.3 31.3 32.2 27.5
Zamba2-1.2B [16] 71 20.1 12.7 21.1 26.4
Jet-Nemotron-2B 2,885 59.5 60.8 61.1 56.7
Jet-Nemotron-4B 1,271 63.5 65.6 65.9 59.0

Table 7 | Results on Coding Tasks.

Jet-Nemotron-2B performs comparably to Qwen3-1.7B-Base, and Jet-Nemotron-4B achieves higher accuracy across all coding tasks while maintaining a large generation throughput advantage.

Results on Long-Context Tasks

The following are the results from Table 8 of the original paper:

Type Model Throughput (token/s) ↑ Accuracy ↑
Avg. Few-Shot Code Sum. Single-Doc Multi-Doc
O(n2) Qwen2.5-1.5B [4] 241 39.1 63.9 57.2 26.3 28.3 19.9
Qwen3-1.7B-Base [5] 61 42.2 68.8 48.1 26.8 36.6 30.6
Llama3.2-3B [2] 60 39.9 65.2 58.0 24.3 27.6 24.6
MiniCPM-2B-128K [58] 18 41.1 57.3 59.6 25.7 33.4 29.6
Smollm2-1.7B [60] 32 21.3 38.9 28.6 16.0 13.2 9.8
O(n) Mamba2-2.7B [50] 2,507 10.3 6.4 30.2 9.1 3.5 2.5
RWKV7-1.5B [10] 3,050 14.2 10.6 21.1 18.1 12.8 8.7
Rec.Gemma-2.6B [62] 2,355 24.1 31.8 56.7 12.9 9.2 9.6
Hybrid Gemma2-2.6B [73] 388 22.9 28.7 52.0 12.6 13.9 7.3
Gemma3n-E2B [73] 701 40.4 56.4 67.2 25.6 29.3 28.6
Hymba-1.5B [44]. 180 28.0 36.1 53.5 51.8 14.0 19.8
Zamba2-1.2B [16] 71 9.2 10.0 20.1 10.2 3.8 1.7
Jet-Nemotron-2B 2,885 41.1 68.7 58.1 26.0 30.8 21.9
Jet-Nemotron-4B 1,271 43.9 69.7 63.2 26.4 32.5 27.5

Table 8 | Results on Long-Context Tasks.

On LongBench up to 64K context length, Jet-Nemotron-2B (with only two full-attention layers) achieves performance comparable to models like Qwen2.5-1.5B and Gemma3n-E2B (which have considerably more full-attention layers). Jet-Nemotron-4B surpasses Qwen3-1.7B-Base while delivering a 21×21 \times speedup, demonstrating a substantial advancement in the efficiency-accuracy trade-off for long-context tasks.

Efficiency Benchmark Results

The following figure (Figure 6 from the original paper) shows the efficiency comparison across different context lengths:

Figure 6 | Efficiency Comparison Across Different Context Lengths. Jet-Nemotron-2B achieves up to a \(6 . 1 4 \\times\) speedup in prefilling and a \(5 3 . 6 \\times\) speedup in decoding compared to Qwen3… 该图像是图表,展示了不同上下文长度下Jet-Nemotron-2B与Qwen3-1.7B在预填充(prefilling)和解码(decoding)速度上的相对提升。Jet-Nemotron-2B的预填充速度最高达到6.14×6.14\times,解码速度最高达到53.6×53.6\times,显著优于Qwen3-1.7B。

Figure 6 | Efficiency Comparison Across Different Context Lengths. Jet-Nemotron-2B achieves up to a 6.14×6 . 1 4 \times speedup in prefilling and a 53.6×5 3 . 6 \times speedup in decoding compared to Qwen3-1.7B-Base.

Figure 6 details the throughput comparison between Qwen3-1.7B-Base and Jet-Nemotron-2B across various context lengths.

  • Prefilling Stage: At shorter context lengths (4K and 8K), Jet-Nemotron-2B is initially 1.14x and 1.15x faster than Qwen3-1.7B-Base. As the context length increases, the benefits of linear attention become more pronounced, leading to a 6.14×6.14 \times speedup at a 256K context length.
  • Decoding Stage: Jet-Nemotron-2B consistently and substantially outperforms Qwen3-1.7B-Base. With 2 full-attention layers and 2 groups of key-value states (vs. Qwen3's 28 full-attention layers and 8 groups of key-value states), Jet-Nemotron-2B's theoretical maximum speedup is 14×4=5614 \times 4 = 56 times. The model achieves a 15.6×15.6 \times speedup at 4K context length and nearly reaches its theoretical upper bound with a 53.6×53.6 \times speedup at 256K context length.

6.2. Ablation Studies / Parameter Analysis

PostNAS Accuracy Improvement Breakdown

The following figure (Figure 3 from the original paper) shows the PostNAS accuracy improvement breakdown:

Figure 3 | PostNAS Accuracy Improvement Breakdown. By applying PostNAS to the baseline model, we achieve significant accuracy improvements across all benchmarks. 该图像是图表,展示了图3中PostNAS对基线模型的准确度提升细分。通过逐步应用不同优化策略,在四个指标上均取得显著提升,最高达到58.1、34.9、70.4和59.3的准确率。

Figure 3 | PostNAS Accuracy Improvement Breakdown. By applying PostNAS to the baseline model, we achieve significant accuracy improvements across all benchmarks.

Figure 3 illustrates the incremental accuracy gains achieved by each component of the PostNAS pipeline. Starting from a baseline model, PostNAS progressively adds accuracy:

  • +5.3+5.3 on MMLU

  • +8.4+8.4 on Math

  • +7.8+7.8 on Retrieval

  • +3.2+3.2 on Commonsense Reasoning

    This breakdown confirms the individual effectiveness of each PostNAS stage in enhancing model performance across various tasks.

Hardware-Aware Architecture Search Impact

The following are the results from Table 1 of the original paper:

Attention Block Data-Depend Delta Gating Rule Throughput ↑ Training Inference Accuracy ↑ MMLU Math Retreival
Common.
RWKV7 [10] 123 2,542
RetNet [12] 269 2,535 53.6 29.9 63.7 58.1
Mamba2 [50] 273 3,220 51.5 26.0 68.9 57.5
GLA [11] 265 3,079 55.8 31.2 66.6 58.5
Deltanet [51] 254 2,955 48.9 27.4 67.9 56.6
Gated DeltaNet [32] 247 2,980 55.6 32.3 69.3 58.7
JetBlock ; : 233 2,885 56.3 32.8 69.9 58.5
+ Hardware-Aware Search 227 2,883 58.1 34.9 70.4 59.5

Table 1 | Accuracy and Efficiency of JetBlock. JetBlock is designed through Linear Attention Block Selection, New Attention Block Design, and Hardware-Aware Search in PostNAS. It achieves higher accuracy than previous linear attention blocks while maintaining comparable training and inference effiency.

Table 1 shows the impact of the hardware-aware architecture search on JetBlock. After applying the search, JetBlock (row "+ Hardware-Aware Search") sees further accuracy boosts (e.g., MMLU from 56.3 to 58.1, Math from 32.8 to 34.9) while maintaining comparable training and inference efficiency. This demonstrates the effectiveness of optimizing hyperparameters directly for hardware performance.

The following are the results from Table 2 of the original paper:

dK dV nhead Params (B) Cache Size (MB) Throughput ↑ (token/s) Retrieval ↑ Accuracy Math Accuracy ↑
256 288 4 1.62 154 2,969 67.6 31.3
192 384 4 1.64 154 2,961 69.3 32.3
128 576 4 1.70 154 2,979 69.5 32.5
256 144 8 1.66 154 2,986 68.3 32.1
192 192 8 1.70 154 2,970 70.6 32.8
128 288 8 1.74 154 2,971 69.6 33.2
128 192 12 1.78 154 2,959 68.8 32.9
96 256 12 1.84 154 2,955 69.6 34.8
64 384 12 1.98 154 2,952 70.1 34.2

Table 2 | Detailed Results of Hardware-Aware Architecture Search. The gray row is the original design [32], while the blue row shows the new design produced by our hardware-aware architecture search.

Table 2 further details the hardware-aware architecture search results for Gated DeltaNet. By fixing the KV cache size (154MB) and varying key dimension (dK), value dimension (dV), and number of attention heads (nhead), the search found configurations that achieve similar generation throughput but with higher accuracy (e.g., Retrieval and Math). The row highlighted in blue (dK=96, dV=256, nhead=12) shows the final design which has a slightly higher parameter count (1.84B vs 1.62B for the original), similar throughput (2,955 vs 2,969 token/s), but significantly improved accuracy (e.g., Math Accuracy 34.8 vs 31.3). This confirms Key Finding 4 that KV cache size is dominant for throughput, and within that constraint, PostNAS can find configurations that use slightly more parameters for better accuracy.

Controlled Study on Training Data

The following are the results from Table 14 of the original paper:

Model MMLU Math Commonsense Retrieval
Qwen2.5-1.5B-continual 56.7 37.6 59.8 71.5
Mamba2-2.7B-continual 41.0 22.5 56.9 55.9
RWKV7-1.5B-continual 49.8 25.2 59.3 57.2
Jet-Nemotron-2B 59.6 40.2 61.7 73.6

Table 14 | Controlled Study on Training Data. All models are pre-trained or continually pre-trained on the Jet-Nemotron stage-2 training corpus discussed in Section 3.1.

To ensure that Jet-Nemotron's superior performance is due to its architecture and PostNAS, rather than just the training data, a controlled study was conducted. Baseline models (Qwen2.5, Mamba2, RWKV7) were continually pre-trained on the same Jet-Nemotron stage-2 training corpus. As shown in Table 14, even after being trained on the same data, Jet-Nemotron-2B still significantly outperforms all finetuned baseline models across MMLU, Math, Commonsense, and Retrieval tasks. This strengthens the claim that PostNAS and the resulting architecture are indeed effective.

Throughput Results on Lower-End Hardware

The following are the results from Table 15 of the original paper:

Hardware Qwen2.5-1.5B (Tokens/s) Jet-Nemotron-2B (Tokens/s) SpeedUp
Orin 6.22 55.00 8.84
3090 105.18 684.01 6.50

Table 15 | Throughput Results on Jetson Orin (32GB) and NVIDIA RTX 3090 GPUs.

The efficiency benefits of Jet-Nemotron-2B extend beyond high-end H100 GPUs. Table 15 shows substantial speedups on lower-end hardware: 8.84×8.84 \times on NVIDIA Jetson Orin (32GB) and 6.50×6.50 \times on NVIDIA RTX 3090 GPUs compared to Qwen2.5-1.5B. This demonstrates the broad applicability and practical value of Jet-Nemotron for various deployment scenarios, including edge devices.

Comparison to Falcon-H1

The following are the results from Table 16 of the original paper:

Model |Throughput (token/s)↑ Accuracy ↑
MMLU MATH Common. Retrieval Code Long-Context
Falcon-H1-1.5B [106] 223 60.5 40.1 59.9 73.5 56.0 40.7
Falcon-H1-1.5B-deep [106] 66 63.5 46.8 60.6 74.6 60.3 33.4
Jet-Nemotron-2B 2,885 60.8 49.6 62.0 74.2 59.5 41.1
Jet-Nemotron-4B 1,271 65.2 51.3 64.7 76.2 63.5 43.9

Table 16 | Comparison with Falcon-H1.

Comparing with the concurrent Falcon-H1 (a hybrid model using Mamba2 and full attention), Jet-Nemotron-2B offers comparable or superior accuracy (e.g., better Math, Commonsense, Long-Context than Falcon-H1-1.5B-deep) while achieving significantly higher generation throughput (2,885 vs 66 tokens/s for Falcon-H1-1.5B-deep). This efficiency gap is attributed to Falcon-H1's head-wise hybrid strategy, which requires sequential computation of Mamba2 and full attention within a single layer, limiting parallelism. Jet-Nemotron's layer-wise alternation strategy is inherently more parallelizable. Jet-Nemotron-4B further demonstrates this advantage, outperforming both Falcon-H1 variants in accuracy and still maintaining a high throughput.

6.3. Summary

The experimental results conclusively demonstrate that Jet-Nemotron models (2B and 4B) achieve a new state-of-the-art in accuracy-efficiency trade-off. They match or surpass the accuracy of leading full-attention models (Qwen3, Qwen2.5, Gemma3, Llama3.2) across diverse benchmarks (MMLU(-Pro), Math, Commonsense, Retrieval, Coding, Long-Context). Crucially, they deliver unprecedented generation throughput speedups (up to 53.6×53.6 \times) and prefilling speedups (up to 6.1×6.1 \times) on H100 GPUs, with benefits extending to lower-end hardware. This exceptional performance is a direct result of the PostNAS pipeline, which enables efficient and hardware-aware architectural design without the prohibitive costs of pre-training from scratch. The smaller KV cache sizes are a key factor in these efficiency gains.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Jet-Nemotron, a new family of hybrid-architecture Language Models, which sets a new benchmark for the accuracy-efficiency trade-off in LLMs. These models achieve comparable or superior accuracy to leading full-attention models like Qwen3, Qwen2.5, Gemma3, and Llama3.2, while simultaneously delivering substantial efficiency gains, including up to 53.6×53.6 \times higher generation throughput and 6.1×6.1 \times prefilling speedup on H100 GPUs for long contexts.

The core innovations enabling Jet-Nemotron are:

  1. Post Neural Architecture Search (PostNAS): A novel and highly efficient post-training architecture adaptation pipeline. By starting with pre-trained full-attention models and freezing their MLP weights, PostNAS drastically reduces the cost and risk of LLM architectural exploration. It systematically explores attention block designs through optimal full-attention layer placement, linear attention block selection, new attention block design, and hardware-aware hyperparameter search.

  2. JetBlock: A novel linear attention block that integrates dynamic convolution with Gated DeltaNet's data-dependent gating and delta rule. JetBlock significantly outperforms prior linear attention designs (e.g., Mamba2, GLA, Gated DeltaNet) in accuracy while maintaining comparable efficiency.

    Extensive empirical results validate Jet-Nemotron's strong accuracy and exceptional inference efficiency across a broad range of benchmarks and hardware platforms.

7.2. Limitations & Future Work

The authors highlight several key contributions but do not explicitly detail specific limitations or future work within the conclusion section. However, implicit limitations and future directions can be inferred:

  • Generalizability of PostNAS to other base models: While Jet-Nemotron is built on Qwen2.5, it is implied that PostNAS could be applied to any pre-trained Transformer model. Future work could rigorously test this claim across a broader range of base LLMs from different families and sizes to confirm universal applicability.
  • Optimal trade-offs in PostNAS: The paper shows the effectiveness of PostNAS's stages. Future work might explore even more sophisticated search algorithms or different search spaces within each stage (e.g., more diverse linear attention candidates, alternative dynamic convolution designs, more fine-grained hardware metrics).
  • Hardware-Awareness Beyond Throughput: While KV cache size and throughput are key, other hardware considerations like energy consumption, memory footprint on smaller devices, or specialized accelerator features could be integrated into the hardware-aware search for even broader optimization.
  • Dynamic MLP Adaptation: PostNAS freezes MLP weights for efficiency. While effective, there might be scenarios where some degree of MLP adaptation or fine-tuning, perhaps with specialized low-rank adaptation techniques, could yield further accuracy gains without completely sacrificing the efficiency of the PostNAS pipeline.
  • Theoretical Understanding of Hybrid Models: While empirical results are strong, deeper theoretical analysis into why specific full attention placements work best for certain tasks, or how JetBlock's dynamic convolution precisely enhances linear attention's representational power, could lead to more principled architectural designs.
  • Efficiency of PostNAS itself: While PostNAS makes architectural search feasible for LLMs, optimizing the search process itself (e.g., faster super network training, more efficient beam search) remains an area for improvement.

7.3. Personal Insights & Critique

This paper presents a highly impactful and practical approach to making LLMs more efficient without sacrificing their accuracy, which is a critical bottleneck for real-world deployment. The PostNAS pipeline is a particularly insightful innovation. By recognizing the immense cost of LLM pre-training and leveraging existing pre-trained models as a foundation, it democratizes LLM architecture research to some extent, making it accessible even to entities without "Google-scale" computational resources for full NAS.

One of the most profound insights is Key Finding 4: the dominance of KV cache size over parameter count for long-context generation throughput. This challenges conventional wisdom where parameter count is often seen as the primary indicator of model "size" and efficiency. This finding provides a clear, actionable target for future LLM optimization efforts.

The JetBlock itself, with its dynamic convolution and integration of Gated DeltaNet mechanisms, is a strong contribution to the linear attention literature. It tackles a known limitation of static convolutions in linear attention models, demonstrating that more expressive attention blocks can be designed within the O(n)O(n) complexity constraint.

A potential area for future exploration or a point of critique could be the degree to which PostNAS is truly "post-training." While it uses a pre-trained model as a starting point and freezes MLPs, it still involves a two-stage training process (distillation and full-model training) on potentially large datasets (50B + 350B tokens). While significantly cheaper than pre-training from scratch, it's not a zero-cost architecture adaptation. Investigating methods for purely "post-training" architectural adaptation (e.g., without extensive retraining even of attention blocks, or using much smaller adaptation datasets) could be a fascinating next step.

The paper's clear and comprehensive evaluation across numerous benchmarks and hardware types adds significant credibility to its claims. The consistently high accuracy combined with remarkable throughput gains suggests that Jet-Nemotron and the PostNAS methodology will be highly influential in the development and deployment of next-generation efficient LLMs. Its methods and conclusions could be particularly applicable in domains requiring on-device LLM inference or large-scale LLM serving where computational resources and latency are critical constraints. The rigorous, methodical approach of PostNAS could also inspire similar structured architectural search pipelines in other complex deep learning domains beyond LLMs.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.