Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
TL;DR Summary
Jet-Nemotron uses PostNAS to freeze pre-trained MLP weights and optimize attention blocks, creating efficient hybrid-architecture language models that match or surpass accuracy of leading models while boosting generation throughput by up to 53.6×.
Abstract
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
1.2. Authors
Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai. All authors are affiliated with NVIDIA, specifically from NVlabs. Their research background primarily lies in efficient AI, neural architecture search, and large language models.
1.3. Journal/Conference
Published at (UTC): 2025-08-21T17:59:08.000Z. The paper is listed as an arXiv preprint. arXiv is a widely recognized and influential open-access preprint server for research papers in fields such as physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It allows researchers to share their work rapidly before formal peer review and publication, making it a crucial platform for disseminating cutting-edge research in AI and machine learning.
1.4. Publication Year
2025
1.5. Abstract
The paper introduces Jet-Nemotron, a new family of hybrid-architecture language models that achieve or surpass the accuracy of leading full-attention models while significantly boosting generation throughput. This is accomplished through Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline. Unlike previous approaches, PostNAS begins with a pre-trained full-attention model, freezes its Multi-Layer Perceptron (MLP) weights, and then efficiently explores attention block designs. The pipeline comprises four key stages: (1) learning optimal full-attention layer placement and elimination, (2) selecting linear attention blocks, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. The Jet-Nemotron-2B model demonstrates comparable or superior accuracy to advanced models like Qwen3, Qwen2.5, Gemma3, and Llama3.2 across various benchmarks. Simultaneously, it achieves up to a increase in generation throughput and a speedup in prefilling. Notably, it also achieves higher accuracy on MMLU and MMLU-Pro than larger MoE full-attention models such as DeepSeek-V3-Small and Moonlight, despite their larger parameter counts.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2508.15884 PDF Link: https://arxiv.org/pdf/2508.15884v3.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The rapid rise of Large Language Models (LLMs) has demonstrated exceptional accuracy across a wide range of tasks, marking a transformative era in artificial intelligence. However, their computational and memory demands pose significant challenges, particularly in long-context generation and reasoning. The self-attention mechanism, a core component of Transformer models, incurs a computational complexity of with respect to sequence length , and generates a large Key-Value (KV) cache. This quadratic complexity makes LLMs inefficient for long sequences and resource-constrained environments.
To address this, prior research has focused on two main directions:
-
Developing efficient attention mechanisms: These aim to reduce the computational complexity to , often referred to as
linear attention. -
Constructing hybrid models: These combine
full attentionandlinear attentionto balance accuracy and efficiency.However, a significant gap remains: while these efficient models improve throughput, their accuracy often lags behind state-of-the-art
full-attention models, especially on demanding benchmarks likeMMLU(-Pro), mathematical reasoning, retrieval, coding, and long-context tasks. Designing newLLMarchitectures from scratch is also prohibitively expensive due to highpre-trainingcosts and data requirements, limiting innovation for researchers outside large organizations.
The paper's entry point or innovative idea is to tackle this challenge by introducing Post Neural Architecture Search (PostNAS). Instead of pre-training models from scratch, PostNAS begins with an already pre-trained full-attention model, reuses and freezes its MLP weights, and then efficiently explores novel attention block designs. This strategy drastically reduces training costs and accelerates the development of efficient yet accurate LLM architectures.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Introduction of
PostNAS: A novel model architecture exploration paradigm forLLMs. By reusingpre-trained LLMsand freezingMLPweights,PostNASsignificantly reduces the cost and risk associated withLLM architecture exploration, fostering faster and more efficient innovation. - Novel Insights into Efficient
LLMArchitecture Design:PostNASprovides insights such as the task-specific importance ofattention layersand the finding thatKV cache sizeis a more critical factor than parameter count forgeneration throughput. - Development of
JetBlock: A novellinear attention blockthat integratesdynamic convolutionwithlinear attentionandhardware-aware architecture search.JetBlockconsistently delivers significant accuracy improvements over previouslinear attention blockswhile maintaining comparable generation throughput. - Introduction of
Jet-NemotronFamily: A new family ofhybrid-architecture LLMsthat achieves superior accuracy across a wide range of tasks (includingMMLU(-Pro), mathematical reasoning, retrieval, coding, and long-context tasks) while offering significantly higher generation throughput than prior state-of-the-artfull-attention models(e.g.,Qwen2.5,Qwen3,Gemma3, andLlama3.2). - Exceptional Performance and Efficiency: The
Jet-Nemotron-2Bmodel matches or exceeds the accuracy of leadingfull-attention modelswhile delivering up togeneration throughput speedupandprefilling speeduponNVIDIA H100 GPUunder long-context settings (256K tokens). It also outperforms largerMoEmodels in accuracy onMMLUandMMLU-Pro. These findings demonstrate thatJet-Nemotronoffers practical benefits for applications requiring efficientLLMs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Language Models (LMs) / Large Language Models (LLMs)
Language Models are computational models designed to understand and generate human language. They learn patterns, grammar, and semantic relationships from vast amounts of text data. Large Language Models (LLMs) are a specific type of LM characterized by their massive scale (billions of parameters) and the ability to perform a wide variety of Natural Language Processing (NLP) tasks, often demonstrating emergent capabilities. They are typically based on the Transformer architecture.
Transformer Architecture
The Transformer is a neural network architecture introduced in 2017, which revolutionized NLP. Unlike previous recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers rely entirely on attention mechanisms to draw global dependencies between input and output. A Transformer typically consists of an encoder and a decoder (though LLMs often use only the decoder part). Each encoder and decoder block is composed of Multi-Head Self-Attention layers and feed-forward neural networks (often called Multi-Layer Perceptrons or MLPs), connected by residual connections and layer normalization.
Self-Attention Mechanism
The self-attention mechanism is the core innovation of the Transformer. It allows the model to weigh the importance of different words in the input sequence when processing each word. For each token in a sequence, self-attention computes three vectors: Query (), Key (), and Value (). These vectors are derived from the input embeddings through linear transformations. The attention score is calculated by taking the dot product of the Query vector with all Key vectors, then scaling by the square root of the Key dimension (), and applying a softmax function. This score is then multiplied by the Value vectors to produce the output.
The formula for Self-Attention is:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- (Query), (Key), (Value) are matrices representing the query, key, and value vectors, respectively. Each row corresponds to a token in the sequence.
- is the transpose of the
Keymatrix. - is the dimension of the
Keyvectors. Scaling by helps stabilize gradients during training. softmaxis an activation function that converts a vector of numbers into a probability distribution, ensuring that the attention weights sum to 1.
Computational Complexity of Self-Attention
The main bottleneck of self-attention is its computational complexity with respect to the sequence length . Calculating involves multiplying two and matrices, which results in an matrix. This operation has a complexity of , or simply if is considered constant. This quadratic dependency means that as the sequence length doubles, the computation required for attention quadruples, making it very expensive for long-context scenarios.
Key-Value (KV) Cache
During autoregressive generation (where the model generates one token at a time), Transformer decoders need access to the Keys and Values computed from previously generated tokens. To avoid recomputing these for every new token, they are stored in a Key-Value (KV) cache. This cache grows linearly with the sequence length. However, for very long contexts, the KV cache can consume a significant amount of memory, becoming another major bottleneck for LLM inference.
Linear Attention
Linear attention mechanisms are designed to overcome the computational and memory complexity of self-attention by reducing it to . They achieve this by reordering operations or using different mathematical formulations that avoid the explicit computation of the attention matrix. Instead, they often rely on associative properties or state-space models to aggregate information sequentially, making them more efficient for processing long sequences. Examples include RWKV, RetNet, Mamba, GLA, and Gated DeltaNet.
Hybrid-Architecture Language Models
Hybrid-architecture language models combine full attention layers with linear attention layers within the same model. The goal is to leverage the strengths of both: full attention for its strong performance on complex tasks requiring global interactions (especially in early layers or specific critical layers), and linear attention for its efficiency in handling long contexts and reducing memory footprint. The challenge lies in finding the optimal balance and placement of these different attention types.
Neural Architecture Search (NAS)
Neural Architecture Search (NAS) is an automated technique for designing neural network architectures. Instead of manually designing network structures, NAS algorithms explore a predefined search space of architectures and evaluate their performance on a target task. This typically involves a search algorithm (e.g., reinforcement learning, evolutionary algorithms, gradient-based methods) and a performance evaluation strategy. Hardware-aware NAS specifically incorporates hardware efficiency metrics (like latency, throughput, or energy consumption) into the search objective to find architectures optimized for specific deployment environments. The traditional NAS approach for LLMs is prohibitively expensive due to the cost of pre-training and evaluating many different large architectures.
Multi-Layer Perceptron (MLP)
A Multi-Layer Perceptron (MLP), also known as a feed-forward neural network, is a fundamental component of many neural networks, including Transformers. In a Transformer block, after the attention mechanism, the data passes through an MLP. This MLP typically consists of two or more linear layers with an activation function (like GELU) in between. Its purpose is to process the information aggregated by the attention mechanism independently for each token position.
Distillation Loss
Knowledge distillation is a technique where a smaller, simpler model (the "student") is trained to mimic the behavior of a larger, more complex model (the "teacher"). Distillation loss is a component of the student's training objective that encourages its outputs (e.g., probability distributions over tokens) to match those of the teacher model. This allows the student to learn a compressed version of the teacher's knowledge, often achieving better performance than if it were trained from scratch without the teacher's guidance. The distillation loss usually involves a Kullback-Leibler (KL) divergence or cross-entropy term between the student's and teacher's outputs.
3.2. Previous Works
The paper contextualizes Jet-Nemotron by contrasting it with several categories of existing LLMs:
-
Full-Attention Models: These are the state-of-the-art models that
Jet-Nemotronaims to match or surpass in accuracy. Examples mentioned include:Qwen2.5[4],Qwen3[5] (developed by Alibaba Cloud)Gemma3[41],Rec.Gemma[62],Gemma3n-E2B[42, 73] (developed by Google)Llama3.2[2] (developed by Meta)MiniCPM-2B-128K[58]MobileLLM-1.5B[59]Smollm2-1.7B[60]DeepSeek-V3-Small[6] andMoonlight[61]: These areMixture-of-Experts (MoE)models, which also usefull attentionbut have sparse activation, allowing them to scale up total parameters while keeping activated parameters relatively small. The paper highlights thatJet-Nemotroncan achieve higher accuracy than these largerMoEmodels on certain benchmarks.
-
Linear Attention Models: These models replace
full attentionentirely with more efficientlinear attentionmechanisms, aiming for complexity. Examples include:RWKV7[10]: Alinear attention modelthat uses a recurrent structure.RetNet[12]: Introducesretention mechanismfor parallel training andrecurrent inference.Mamba2[50]: Builds onStructured State Space Models (SSMs)to achievelinear complexity.GLA (Gated Linear Attention)[11]: Designed for hardware-efficient training.Deltanet[51] andGated DeltaNet[32]: Based on thedelta ruleanddata-dependent gating.
-
Hybrid Models: These models combine
full attentionandlinear attentionto strike a balance between accuracy and efficiency. Examples mentioned include:Zamba2[16]Hymba[44]Jamba[87]: A concurrent hybrid model that combinesTransformerandMamba.Falcon-H1[106]: Another concurrent hybrid model incorporatingMamba2andfull attention.
3.3. Technological Evolution
The evolution of efficient LLMs can be seen as a progression:
- Full-Attention Dominance: Initial
TransformersandLLMsrelied entirely on theself-attention mechanism, achieving state-of-the-art accuracy but suffering from complexity, limiting context length and increasing inference costs. - Rise of Linear Attention: To address efficiency concerns, various
linear attentionmechanisms emerged, aiming to reduce complexity to . While these offered significant speedups and memory savings, they often came with a noticeable drop in accuracy, especially on complex reasoning tasks, as they sometimes struggled to capture global dependencies as effectively asfull attention. - Emergence of Hybrid Architectures: Recognizing the trade-off, researchers started combining
full attentionandlinear attentionwithin the same model. The idea is to strategically placefull attentionlayers where they are most critical for accuracy and uselinear attentionelsewhere for efficiency. Early hybrid models showed promise but still struggled to fully close the accuracy gap with purefull-attentionSOTA models. PostNASandJet-Nemotron's Place:Jet-Nemotronrepresents a significant step in this evolution by introducingPostNAS. Instead of heuristic design or scratch training for hybrid models,PostNASprovides a systematic, data-driven, and hardware-aware approach to design these hybrids efficiently. By starting with a pre-trainedfull-attentionmodel and strategically adapting itsattention blocks,Jet-Nemotronaims to achieve the best of both worlds: SOTA accuracy with dramatically improved efficiency, effectively pushing theefficiency-accuracy trade-offfrontier.
3.4. Differentiation Analysis
Compared to the main methods in related work, Jet-Nemotron's core differences and innovations stem from its PostNAS pipeline:
-
Starting Point:
- Traditional
NASforLLMs/ From-Scratch Training: Most previous work onefficient LLMarchitectures (whether purelinear attentionorhybrid) involvedpre-trainingmodels from scratch or designing architectures prior topre-training. This is extremely costly and risky, limiting the exploration of the architectural design space. PostNASInnovation:PostNASfundamentally changes this by starting with a pre-trainedfull-attention model. This dramatically reducestraining costsand data requirements, making architectural exploration feasible and rapid.
- Traditional
-
Freezing
MLPWeights:- Traditional Approach: Typically, all parameters are trained or fine-tuned.
PostNASInnovation: By freezing theMLPweights of thepre-trained model,PostNASfocuses the architectural search specifically on theattention blocks. This simplifies the search space and leverages the rich representations already learned by theMLPin thepre-trained model, ensuring that the core knowledge is retained.
-
Systematic Coarse-to-Fine Search (Four Stages):
- Heuristic or Limited Search: Previous hybrid models often used uniform placement of
full attentionor limited manual exploration. PostNASInnovation:PostNASprovides a comprehensive, multi-stage search:- Optimal
Full AttentionLayer Placement: Instead of uniform placement,PostNASlearns wherefull-attention layersare most critical for specific tasks, leading to more effective use of these expensive layers. Linear Attention BlockSelection: It systematically evaluates existinglinear attention blocksfor accuracy and efficiency, ensuring the best fit for the specifichybrid architecture.- New
Attention BlockDesign (JetBlock): It innovates by proposingJetBlock, a newlinear attention blockthat incorporatesdynamic convolution, improving expressiveness while maintaining efficiency. Hardware-Aware Hyperparameter Search: Crucially, it optimizeshyperparametersbased on actual hardwaregeneration throughputrather than just parameter count, leading to truly efficient deployment.
- Optimal
- Heuristic or Limited Search: Previous hybrid models often used uniform placement of
-
Accuracy-Efficiency Trade-off:
-
Previous Efficient Models: Often sacrificed significant accuracy for efficiency.
-
Jet-NemotronInnovation: ThroughPostNAS,Jet-Nemotrondemonstrably matches or exceeds the accuracy of state-of-the-artfull-attention modelswhile achieving orders of magnitude higher generation throughput, effectively breaking the previousaccuracy-efficiency trade-off.In essence,
PostNASoffers a pragmatic and powerful methodology to adapt existing, high-performingLLMsinto highly efficienthybrid architectureswithout the immense costs ofpre-trainingfrom scratch, making advancedLLMcapabilities more accessible and deployable.
-
4. Methodology
4.1. Principles
The core idea behind Post Neural Architecture Search (PostNAS) is to efficiently design new Language Model (LM) architectures by building upon existing pre-trained full-attention models rather than starting from scratch. This strategy dramatically reduces training costs and data requirements. The theoretical basis is that the Multi-Layer Perceptron (MLP) weights of pre-trained LLMs already encode a vast amount of knowledge, which can be reused and frozen. By focusing the architecture search solely on the attention block designs, PostNAS can rapidly explore efficient model variations without retraining the entire model, thereby allowing for comprehensive architectural innovation at a lower cost. The intuition is that while MLPs learn the core features and knowledge, the attention mechanism primarily controls how information is aggregated across the sequence. By optimizing this aggregation, one can significantly impact efficiency without necessarily sacrificing the learned knowledge in the MLPs.
The process is structured as a coarse-to-fine search, systematically optimizing different aspects of the attention block design to achieve both high accuracy and high hardware efficiency.
4.2. Core Methodology In-depth (Layer by Layer)
The PostNAS pipeline is composed of four key steps, as illustrated in Figure 2 (Image 2 from the original paper). It begins with a pre-trained full-attention model where the MLP weights are frozen.
The following figure (Figure 2 from the original paper) illustrates the overall PostNAS pipeline:
该图像是图2示意图,展示了PostNAS的整体流程,包括基于预训练全注意力模型冻结MLP权重,进行全注意力层位置优化、线性注意力块选择、新注意力块设计以及硬件感知超参数搜索四个阶段,体现了Jet-Nemotron的高效设计路径。
Figure 2 | PostNAS Roadmap. Our pipeline starts from a pre-trained full-attention model and keeps the MLP frozen. It then perfors a coarse-to-fine search for efient attention block desins, frst determining the optimal placement of ful-attention layers, then selecting the best linear attention block or using a newnear attention block, and finally searching for optimal architectural hyperparameters.
4.2.1. Full Attention Placement and Elimination
The first step in PostNAS is to strategically determine where to keep full-attention layers and where to replace them with more efficient linear attention or sliding window attention (SWA) layers. While incorporating a few full-attention layers is known to improve accuracy, their optimal placement is not trivial.
The following figure (Figure 4 from the original paper) shows the process of learning to place full attention with PostNAS:
该图像是图4示意图,展示了利用PostNAS训练一次性超网络并通过Beam Search搜索最优全注意力层位置的过程。
Figure 4 | Learning to Place Full Attention with PostNAS. We train a once-for-all super network and perform beam search to identify the optimal placement of full attention layers.
To achieve this, PostNAS employs an automated method:
-
Once-for-all Super Network Training: A
once-for-all super network[45, 31] is constructed. Thissuper networkis an augmented version of thepre-trained full-attention model. For each layer that could potentially be afull-attention layer, it is augmented with alternativelinear attention paths. This means each layer can either operate as afull-attention layeror alinear attention layer. During training, a subnetwork (i.e., a specific combination offullandlinear attentionlayers) is randomly sampled at each step. This subnetwork is then trained using afeature distillation loss[46, 47, 48]. Thisdistillation lossaims to make the subnetwork's internal representations (features) similar to those of the originalpre-trained full-attention model. TheMLPweights are kept frozen throughout this process, onlyattention blockparameters and any newly introduced layers (e.g., forlinear attention) are updated. -
Beam Search for Optimal Placement: After the
super networkis trained,beam search[49] is performed to identify the optimal placement offull-attention layersunder a given constraint (e.g., a maximum number offull-attention layers). Thesearch objectiveis task-dependent. For example:-
For
MMLU, the objective is to select the configuration that yields thelowest losson the correct answer. -
For mathematical and retrieval tasks, the objective is to choose the configuration with the
highest accuracy.The findings from this stage reveal:
-
-
Key Finding 1: Not all
attention layersin apre-trained full-attention modelcontribute equally to performance. Only a few layers are critically important for specific tasks (e.g., two forMMLU, two to three for retrieval). -
Key Finding 2: Different
attention layerscontribute to different capabilities. Layers critical forMMLUmight not be important for retrieval tasks, indicating specialization. -
Key Finding 3: Complex tasks like mathematical reasoning show intricate patterns of attention importance, but the key layers identified for
MMLUand retrieval often encompass those needed for math.The following figure (Figure 5 from the original paper) shows the layer placement search results and compares
PostNASwith uniform placement:
该图像是图表,展示了Figure 5中Jet-Nemotron模型在Qwen2.5-1.5B上的层次放置搜索结果和PostNAS与均匀放置策略的性能比较。(a)部分展示了不同任务中注意力层的重要性热力图,(b)部分对比了不同全注意力层数量下的MMLU准确率,显示PostNAS明显优于Uniform方法。
Figure 5 | (a) Layer Placement Search Results on Qwen2.5-1.5B. Each grid cell represents the searcobjective value of the corresponding attention layer; higher values indicate greater importance.(b) Comparison Between PostNAS and Uniform Placement.
As Figure 5(b) illustrates, PostNAS significantly outperforms uniform placement in terms of accuracy, validating the effectiveness of learning layer placement.
4.2.2. Linear Attention Block Selection
Once the placement of full-attention layers is determined, the next step is to select the most suitable linear attention block to replace the eliminated full-attention layers. This involves systematically evaluating various state-of-the-art linear attention blocks.
In their experiments, the authors evaluated six SOTA linear attention blocks: RWKV7 [10], RetNet [12], Mamba2 [50], GLA [11], Deltanet [51], and Gated DeltaNet [32].
The selection process involves:
-
Efficiency Profiling: Initial profiling for
training throughputis performed.RWKV7was excluded from training experiments due to lowertraining throughput(possibly suboptimal kernel implementation). -
Accuracy Evaluation: The remaining
linear attention blocksare evaluated for accuracy across diverse tasks. -
Inference Speed Evaluation: Their inference speed is also considered.
This stage leverages the low
training costof thePostNASframework, allowing for comprehensive evaluation without relying on small proxy tasks. The results indicated thatGated DeltaNetachieved the best overall accuracy. Its superior performance is attributed to two key mechanisms:
-
Data-Dependent Gating Mechanism [52]: This mechanism dynamically controls whether the model should prioritize the current token's information or the accumulated history state, allowing for adaptive information flow.
-
Delta Rule [53]: This rule updates the history state by focusing on the increment of information from the current token, efficiently managing the limited memory of the state.
Based on these findings,
Gated DeltaNetwas chosen for subsequent experiments.
4.2.3. New Attention Block Design (JetBlock)
The PostNAS framework also facilitates the design of entirely new attention blocks. While convolution has been shown to enhance linear attention capacity [32], prior methods typically use static convolution kernels that cannot adapt to the input.
To address this, the authors introduce JetBlock. JetBlock enhances linear attention by incorporating dynamic convolution [54, 55]. The overall structure of JetBlock is shown in Figure 2 (#3).
The key innovation of JetBlock is the use of a kernel generator:
-
Kernel Generator Module: This module dynamically produces
convolution kernelsbased on the input features.- Input: It shares the same input as the
Query (Q),Key (K), andValue (V)projection layers. - Efficiency: It starts with a
linear reduction layer(reduction ratio of 8) for efficiency. - Activation: A
GELUactivation function [57] is applied. - Output: A final linear layer outputs the
convolution kernel weights.
- Input: It shares the same input as the
-
Application of Dynamic Kernels: These dynamically generated
convolution kernelsare applied specifically to theValue (V)tokens. The authors found that applying them to or tokens offered little benefit. -
Streamlined Computation: Redundant static convolutions on and are removed with negligible impact on accuracy once
dynamic convolutionis applied to , further streamlining computation and improving efficiency. -
Time-Mixing:
Gated DeltaNet(selected in the previous step) is adopted for thetime-mixingcomponent ofJetBlock.The combination of
dynamic convolutionand theGated DeltaNet'sdata-dependent gatinganddelta ruleallowsJetBlockto achieve improved accuracy, particularly on math reasoning and retrieval tasks, while maintaining efficiency comparable to otherlinear attention blocks(Table 1).
4.2.4. Hardware-Aware Architecture Search
The final step focuses on optimizing core architectural hyperparameters for actual hardware efficiency. Traditionally, the number of parameters has been a proxy for LM efficiency. However, PostNAS acknowledges that parameter count does not directly correlate with generation throughput on real hardware.
The hardware-aware architecture search aims to identify optimal architectural hyperparameters (e.g., key/value dimension, number of attention heads) by directly targeting generation throughput.
Key Finding 4: KV cache size is the most critical factor influencing long-context and long-generation throughput. Models with different parameter counts can exhibit similar generation throughput if their KV cache size is constant (Table 2). This is because the decoding stage is typically memory-bandwidth-bound rather than compute-bound. In long-context scenarios, the KV cache can consume more memory than the model weights, and reducing its size decreases memory transfer time per decoding step, enabling larger batch sizes and improving generation throughput.
Based on this finding, the search process involves:
-
Fixing
KV Cache Size: TheKV cache sizeis fixed to match the original design. -
Grid Search: A small-scale
grid searchis performed overkey dimension,value dimension, andnumber of attention heads. -
Optimization Target: The objective is to achieve a
generation throughputcomparable to the original configuration while potentially allowing for more parameters to achieve better accuracy.This step refines the
JetBlockconfiguration, boosting its accuracy while maintaining training and inference throughput, as shown in Table 1 (row "+ Hardware-Aware Search").
The overall result of these four steps is the Jet-Nemotron family, a hybrid-architecture LM that balances full attention for critical tasks, sliding window attention for specific patterns (like in MMLU), and JetBlock for efficient linear attention, all optimized for hardware performance.
5. Experimental Setup
5.1. Datasets
The training of Jet-Nemotron models involves a two-stage process using a combination of various datasets.
- Stage 1 (PostNAS and Initial Training):
- Nemotron-CC [63]: A large
pre-training corpus. While not explicitly detailed in the paper,Nemotron-CCis likely derived fromCommon Crawland refined forlong-horizon pre-training. - Redstone-QA [64]: A
QA datasetalso used as part of thepre-training corpus. - Purpose: These datasets are used for the
distillation losstraining in the first stage, whereMLPsare frozen andattention blocksare adapted viaPostNAS.
- Nemotron-CC [63]: A large
- Stage 2 (Full-Model Training):
-
The data mixture for this stage includes the datasets from Stage 1, augmented with more high-quality data from specific domains:
- Math-related datasets: [65]
- Coding-related datasets: [66, 67]
-
Purpose: This stage involves full-model training (all parameters unfrozen) to further refine the model's capabilities on these important domains.
For evaluation, a comprehensive suite of benchmarks is used, covering diverse tasks:
-
- Massive Multitask Language Understanding (MMLU) [18] & MMLU-Pro [19]: General knowledge and reasoning across 57 subjects (MMLU) and a more robust/challenging version (MMLU-Pro).
- Mathematical Reasoning:
GSM8K[22]: Grade school math word problems.MATH[18]: Diverse challenging math problems.MathQA[21]: Math word problems with interpretable formalisms.MMLU-Stem: A subset ofMMLUfocused on science, technology, engineering, and mathematics.GPQA[20]: Graduate-levelGoogle-Proof Question Answering.
- Commonsense Reasoning:
ARC-c,ARC-e[34]: AI2 Reasoning Challenge (Challenge and Easy sets).PIQA[35]: Physical Commonsense Reasoning.Wino.(Winograd Schema Challenge) [36]: Ambiguous pronoun resolution.OBQA[38]: Open Book Question Answering.BoolQ[33]: Boolean Questions from natural text.TruthQA[37]: Truthfulness of statements.
- Retrieval:
FDA(Financial Domain QA) [23]SWDE(Semi-structured Web Data Extraction) [24]Squad(Stanford Question Answering Dataset) [25]: Reading comprehension.
- Coding:
EvalPlus[40]: Rigorous evaluation for code generation.CRUXEval[28]: Code Reasoning, Understanding, and Execution benchmark.
- Long-Context Tasks:
-
LongBench[29]: A bilingual, multitask benchmark for long context understanding.These datasets are chosen to provide a comprehensive and challenging evaluation across a wide spectrum of
LLMcapabilities, from general knowledge and reasoning to specialized domains like math, coding, andlong-contextunderstanding.
-
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, a complete explanation is provided below:
Accuracy
Conceptual Definition: Accuracy is a fundamental metric that measures the proportion of correctly predicted instances out of the total number of instances. In classification and reasoning tasks, it indicates how often the model's output matches the ground truth. Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $ Symbol Explanation:
- : The count of instances where the model's output exactly matches the true label or answer.
- : The total count of instances for which the model made a prediction.
Generation Throughput
Conceptual Definition: Generation Throughput measures the efficiency of a language model during text generation (decoding). It quantifies the rate at which the model can produce tokens, typically expressed in tokens per second. Higher throughput indicates faster generation, which is crucial for real-time applications and serving many users.
Mathematical Formula: The paper does not provide a specific formula, but it is generally calculated as:
$
\text{Generation Throughput (tokens/s)} = \frac{\text{Total Tokens Generated}}{\text{Total Time Taken for Generation (seconds)}}
$
Symbol Explanation:
- : The sum of all tokens produced by the model during the generation phase.
- : The total duration required for the model to generate those tokens.
Prefilling Speedup
Conceptual Definition: Prefilling Speedup measures how much faster a new method (Jet-Nemotron) can process the initial input prompt (prefilling phase) compared to a baseline model. The prefilling phase involves processing the entire input context to compute the initial KV cache states before autoregressive decoding begins.
Mathematical Formula:
$
\text{Prefilling Speedup} = \frac{\text{Prefilling Time of Baseline Model}}{\text{Prefilling Time of Jet-Nemotron}}
$
Symbol Explanation:
- : The time taken by the baseline model to process the initial input prompt.
- : The time taken by the Jet-Nemotron model to process the initial input prompt.
Decoding Speedup
Conceptual Definition: Decoding Speedup measures how much faster a new method (Jet-Nemotron) can generate tokens during the autoregressive decoding phase compared to a baseline model. The decoding phase involves generating one token at a time based on the previous tokens and the KV cache.
Mathematical Formula:
$
\text{Decoding Speedup} = \frac{\text{Decoding Time of Baseline Model}}{\text{Decoding Time of Jet-Nemotron}}
$
Symbol Explanation:
- : The time taken by the baseline model to generate a set of tokens during
autoregressive decoding. - : The time taken by the Jet-Nemotron model to generate the same set of tokens during
autoregressive decoding.
KV Cache Size (in MB)
Conceptual Definition: KV Cache Size refers to the amount of memory consumed by storing the Key and Value vectors from previous tokens in the Transformer's attention mechanism. This cache is essential for efficient autoregressive decoding. The size of the KV cache grows with the context length and the number of attention layers and heads, directly impacting memory usage and generation throughput. The paper reports this in Megabytes (MB).
5.3. Baselines
The paper compares Jet-Nemotron against a wide array of state-of-the-art models, categorized by their attention mechanism type:
-
(Full Attention) Models: These are standard
Transformermodels with quadratic attention complexity, representing the performance ceilingJet-Nemotronaims to match.Qwen2.5-1.5B[4]Qwen3-1.7B-Base[5]Llama3.2-3B[2]MiniCPM-2B-128K[58]MobileLLM-1.5B[59]Smollm2-1.7B[60]DeepSeek-V3-Small@1.3T[6]: AnMoEmodel with 2.2B activated and 15B total parameters.Moonlight@1.2T[61]: AnotherMoEmodel with 2.2B activated and 15B total parameters.Mamba2-2.7B[50]: AlthoughMamba2is , it's listed under in Table 3 forMMLU(-Pro)results, likely due to its context length limitation or specific evaluation setup contrasting it against traditionalTransformers. Correction: In Table 3,Mamba2is under the section, which is a bit unusual given its nature as an model. However, the paper categorizes it this way for theMMLUresults, perhaps indicating its performance profile against these models on that specific benchmark or its overall architecture being compared for its scale. In later tables (e.g. Table 4), it's correctly placed under . I will follow the paper's table structure.
-
(Linear Attention) Models: These models prioritize efficiency with linear attention mechanisms.
RWKV7-1.5B[10]Rec.Gemma-2B[62]Gemma3n-E2B[42]Hymba-1.5B[44]Zamba2-1.2B[16]Mamba2-2.7B[50]: Correctly categorized as in Tables 4, 5, 6, 7, 8.
-
Hybrid Models: Models combining
fullandlinear attention.-
Gemma2-2.6B[73] -
Gemma3n-E2B[73] -
Hymba-1.5B[44] -
Zamba2-1.2B[16] -
Falcon-H1-1.5B[106] andFalcon-H1-1.5B-deep[106]: Concurrenthybrid modelsthat combineMamba2andfull attention.These baselines are representative because they cover the spectrum of current
LLMarchitectures (purefull attention, purelinear attention, and varioushybrid approaches), including leading models from major research labs and highly efficient smaller models. This comprehensive comparison allowsJet-Nemotronto demonstrate its superioraccuracy-efficiency trade-offagainst the best existing solutions.
-
5.4. Jet-Nemotron Model Family Details
The Jet-Nemotron family consists of two main models: Jet-Nemotron-2B and Jet-Nemotron-4B.
Final Model Architecture
The final Jet-Nemotron models are constructed from a stack of blocks, each containing a Multi-Layer Perceptron (MLP) layer and an attention layer. The attention layer can be one of three types: full attention, sliding window attention (SWA), or JetBlock (the new linear attention block).
The following are the results from Table 9 of the original paper:
| Jet-Nemotron-2B | Jet-Nemotron-4B | |
|---|---|---|
| Total blocks | 28 | 36 |
| Full Attention Layers | No. 15, 20 | No. 18, 21, 22, 28, 33 |
| Sliding Window Attention Layers | No. 21, 22 | No. 17, 20, 23, 24, 26 |
| Vocabulary Size | 151,643 | 151,643 |
| Hidden Size | 1,536 | 2,048 |
| MLP Intermediate Size | 8,960 | 11,008 |
Table 9 | The overall model architectures of Jet-Nemotron families.
Jet-Nemotron-2B:- Built upon
Qwen2.5-1.5B. - Uses two
full-attention layers(No. 15 and 20), guided by theRetrieval task. - Includes two
sliding window attention (SWA) layers(No. 21 and 22), guided byMMLU(for pattern matching in multiple-choice tasks). The window size forSWAis 1,152. - All remaining
attention layersare replaced withJetBlock.
- Built upon
Jet-Nemotron-4B:-
Built upon
Qwen2.5-3B. -
Incorporates five
full-attention layers(No. 18, 21, 22, 28, 33). -
Includes five
SWA layers(No. 17, 20, 23, 24, 26). The window size forSWAis 2,048. -
The rest are
JetBlocklayers.The
full attentionandsliding window attentionlayers usegrouped-query attention[105].
-
The following are the results from Table 10 of the original paper:
| Full Attention / SWA | Jet-Nemotron-2B | Jet-Nemotron-4B | |
|---|---|---|---|
| Attention Head Number | 12 | 16 | |
| Dimensions of Q/K/V | 128 | 128 | |
| K/V Head Number | 2 | 2 | |
| Position Embedding | RoPE | RoPE |
Table 10 | The configurations of full-attention layers in Jet-Nemotron models.
The configurations for JetBlock are distinct from the full attention/SWA layers:
The following are the results from Table 11 of the original paper:
| JetBlock | Jet-Nemotron-2B | Jet-Nemotron-4B |
|---|---|---|
| Q/K Dimension | 96 | 128 |
| V Dimension | 256 | 256 |
| Head Number | 12 | 16 |
| Convolution Kernel Size | 4 | 4 |
| DConv Generator Hidden Size | 32 | 32 |
Table 11 | The configurations of JetBlock.
Training Details
The training proceeds in two stages:
- Stage 1:
MLPsare frozen. The model is trained using adistillation losson a combination ofNemotron-CCandRedstone-QAfor 50B tokens. This stage is wherePostNASoperates. - Stage 2: Full-model training (all parameters trainable) is performed. Additional high-quality data from
mathandcodingdomains are added to the data mixture. Models are trained on 350B tokens in this stage.
Evaluation Details
- Shots:
GSM8KandMATH: 4-shot evaluation.GPQAandMMLU-Pro: 5-shot evaluation.- All other tasks: Zero-shot setting.
- Implementations:
- Coding tasks: Official implementations of
EvalPlus[40] andCRUXEval[28]. - Other tasks:
LM-Evaluation-Harness[68].
- Coding tasks: Official implementations of
Throughput Testbed
-
Hardware:
DGX H100 server(8NVIDIA H100 GPUs, 2Intel Xeon Platinum 8480C CPUs, 2TBRAM). -
Software:
Pytorch 2.7.0,Triton 3.3.0. -
Optimized Kernels:
Full-attention block:FlashAttention 2.7.4[69].Linear-attention blocks:Flash-Linear-Attention 0.2.1[70].
-
Model Inference: Based on
Transformers 4.52.0[71]. -
Context Length: 64K tokens (unless specified).
-
GPU Usage: Single
H100 GPUper test. -
Optimization:
Chunk-prefilling[72] is used to maximizedecoding batch sizewithout sacrificingprefilling throughputby adjusting chunk sizes. The highest achievabledecoding throughputis reported.The following are the results from Table 13 of the original paper:
Model Batch Size Chunk Size Qwen2.5-1.5B 32 8,192 Qwen3-1.7B 8 16,384 Llama3.2-1B 32 4,096 MiniCPM-2B-128K Pythia-2.8B 2 2,048 2 16,384 Smollm2-1.7B 4 16,384 Mamba2-2.7B 128 1,024 RWKV7-1.5B 256 2,048 Rec.Gemma-2B 128 512 Gemma3n-E2B Gemma2-2.6B Hymba-1.5B 64 4,096 16 2,048 64 512 Zamba2-1.2B 8 8,192 Jet-Nemotron-2B 128 2,048 Jet-Nemotron-4B 64 1,024
Table 13 | Hyper-Parameters in Efficiency Measurement. We adjust the chunk size to maximize decoding batch size without compromising prefilling throughput.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that Jet-Nemotron models significantly advance the efficiency-accuracy trade-off frontier for Language Models. Across a comprehensive suite of benchmarks, Jet-Nemotron matches or exceeds the accuracy of state-of-the-art full-attention models while delivering vastly superior generation throughput.
The following figure (Figure 1 from the original paper) shows the comparison between Jet-Nemotron and State-of-the-Art Efficient Language Models:
该图像是性能比较的图表,展示了Jet-Nemotron系列模型与多款高效语言模型在MMLU-Pro五次准确率与生成吞吐量上的对比。图中显示Jet-Nemotron-2B在准确率优于Qwen3-1.7B-Base的同时,实现约47倍的生成速度加速;同时Jet-Nemotron-4B在模型规模较大时仍保持吞吐量领先。
Figure 1 | Comparison Between Jet-Nemotron and State-of-the-Art Efficient Language Models. The generation throughput is measured on the NVIDIA H100 GPU under a context length of 64K tokens. Jet-Nemotron-2B delivers a higher accuracy than Qwen3-1.7B-Base on MMLU-Pro while achieving higher generation throughput. Jet-Nemotron-4B, despite its larger model size, still achieves higher generation throughput than all full-attention models with less than 2B parameters.
As Figure 1 clearly illustrates, Jet-Nemotron-2B achieves higher MMLU-Pro accuracy than Qwen3-1.7B-Base while providing approximately higher generation throughput. Even the larger Jet-Nemotron-4B maintains a higher generation throughput than all full-attention models under 2B parameters, indicating exceptional efficiency.
Results on MMLU(-Pro) and BBH
The following are the results from Table 3 of the original paper:
| Type | Model | Params (B) | Cache Size (MB) | Throughput (token/s) ↑ | MMLU Acc. ↑ | MMLU-Pro Acc. ↑ | BBH Acc. ↑ |
| O(n2) | Qwen2.5-1.5B [4] | 1.5 | 1,792 | 241 | 59.5 | 28.9 | 44.1 |
| Qwen3-1.7B-Base [5] | 1.7 | 7,168 | 61 | 60.3 | 37.8 | 54.2 | |
| Llama3.2-3B [2] | 3.0 | 7,168 | 60 | 54.9 | 25.0 | 47.1 | |
| MiniCPM-2B-128K [58] | 2.8 | 23,040 | 18 | 46.0 | 18.0 | 36.5 | |
| MobileLLM-1.5B [59] | 1.5 | 4,320 | 101 | 26.0 | 9.4 | 27.2 | |
| Smollm2-1.7B [60] | 1.7 | 12,288 | 32 | 48.5 | 18.3 | 35.1 | |
| DeepSeek-V3-Small@1.3T [6] | 2.2/15 | - | , | 53.3 | - | - | |
| Moonlight@1.2T [61] | 2.2/15 | - | - | 60.4 | 28.1 | 43.2 | |
| Mamba2-2.7B [50] | 2.7 | 80 | 2,507 | 25.1 | 8.6 | 25.7 | |
| O(n) | RWKV7-1.5B [10] | 1.5 | 24 | 3,050 | 41.0 | 13.4 | 15.9 |
| Rec.Gemma-2B [62] | 2.0 | 16 | 2,355 | 28.6 | 12.8 | 33.3 | |
| Gemma3n-E2B [42] | 2.0 | 768 | 701 | 53.9 | 24.3 | ||
| Hymba-1.5B [44] | 1.5 | 240 | 180 | 49.7 | 17.4 | 45.1 29.8 | |
| Zamba2-1.2B [16] | 1.2 | 6,114 | 71 | 43.1 | 14.2 | 19.6 | |
| Hybrid | Jet-Nemotron-2B | 2.0 | 154 | 2,885 | 60.8 | 39.0 | 58.3 |
| Jet-Nemotron-4B | 4.0 | 258 | 1,271 | 65.2 | 44.2 | 65.0 |
Table 3 | Results on MMLU(-Pro) and BBH. DeepSeek-V3-Small and Moonlight are MoE models with activated and 15B total parameters, trained on 1.3T and 1.2T tokens, respectively.
Jet-Nemotron-2B excels, achieving a higher MMLU (60.8), MMLU-Pro (39.0), and BBH (58.3) accuracy than Qwen3-1.7B-Base, while offering a dramatic higher throughput (2885 vs 61 tokens/s) and a smaller KV cache size (154MB vs 7168MB). Notably, Jet-Nemotron-2B even surpasses MoE models like DeepSeek-V3-Small and Moonlight in MMLU and MMLU-Pro, despite their larger total and activated parameters. Jet-Nemotron-4B further boosts accuracy across these benchmarks, maintaining a significant throughput advantage over Qwen3-1.7B-Base. Compared to other linear attention and hybrid models, Jet-Nemotron demonstrates substantially higher accuracy.
Results on Math Tasks
The following are the results from Table 4 of the original paper:
| Type Model | | Throughput| | (token/s) ↑ | | Accuracy ↑ | ||||||
| Avg. | |GSM8K MATH MathQA MMLU-Stem | GPQA | ||||||
| O(n2) | Qwen2.5-1.5B [4] | 241 | 38.4 | 62.4 | 13.1 | 34.4 | 52.7 | 29.4 |
| Qwen3-1.7B-Base [5] | 61 | 42.3 | 62.8 | 16.7 | 46.0 | 50.8 | 27.9 | |
| Llama3.2-3B [2] | 60 | 28.8 | 25.8 | 8.6 | 34.2 | 45.3 | 30.1 | |
| MiniCPM-2B-128K [58] | 18 | 27.6 | 39.2 | 5.9 | 28.5 | 36.3 | 28.1 | |
| Smollm2-1.7B [60] | 32 | 28.9 | 30.3 | 9.2 | 33.7 | 41.3 | 30.1 | |
| O(n) | Mamba2-2.7B [50] | 2,507 | 16.6 | 3.0 | 3.9 | 24.3 | 26.6 | 25.3 |
| RWKV7-1.5B [10] | 2,669 | 18.3 | 5.6 | 0.8 | 27.2 | 34.9 | 23.0 | |
| Rec.Gemma-2B [62] | 2,355 | 20.8 | 13.9 | 7.6 | 25.3 | 28.5 | 28.6 | |
| Gemma3n-E2B [42] | 701 | 28.3 | 24.9 | 10.1 | 31.1 | 45.7 | 31.8 | |
| Hymba-1.5B [44] | 180 | 23.1 | 17.9 | 0.8 | 28.0 | 40.9 | 27.9 | |
| Hybrid Zamba2-1.2B [16] | 71 | 24.8 | 28.1 | 5.9 | 26.0 | 36.5 | 27.7 | |
| Jet-Nemotron-2B | 2,885 | 49.6 | 76.2 | 23.3 | 53.8 | 62.7 | 32.1 | |
| Jet-Nemotron-4B | 1,271 | 51.3 | 78.7 | 25.2 | 52.5 | 65.6 | 34.6 | |
Table 4 | Results on Math Tasks.
For math tasks, Jet-Nemotron-2B achieves an impressive average accuracy of 49.6, outperforming Qwen3-1.7B-Base by 6.3 points while being faster. Traditional linear attention and hybrid models significantly lag behind Qwen3 on these tasks, highlighting Jet-Nemotron's unique ability to combine efficiency with strong mathematical reasoning.
Results on Commonsense Reasoning Tasks
The following are the results from Table 5 of the original paper:
| Model | Throughput | Accuracy ↑ | |||||||
| |(token/s) ↑ | | | Avg. | | ARC-c ARC-e PIQA Wino. | OBQA BoolQ TruthQA | ||||||
| Qwen2.5-1.5B [4] | 241 | 59.4 | 45.4 | 71.2 | 75.8 | 63.8 | 40.2 | 72.8 | 46.6 |
| Qwen3-1.7B-Base [5] | 61 | 60.0 | 44.9 | 68.6 | 75.5 | 63.8 | 39.0 | 79.0 | 48.8 |
| Llama3.2-3B [2] | 60 | 59.9 | 46.6 | 72.0 | 78.0 | 69.3 | 40.4 | 73.9 | 39.3 |
| MiniCPM-2B-128K [58] | 18 | 57.6 | 41.0 | 69.4 | 75.5 | 63.8 | 40.6 | 74.7 | 38.3 |
| Smollm2-1.7B [60] | 32 | 59.7 | 47.0 | 73.3 | 77.7 | 66.2 | 44.6 | 72.5 | 36.7 |
| Mamba2-2.7B [50] | 2,507 | 57.2 | 42.1 | 70.5 | 76.1 | 62.7 | 41.4 | 71.5 | 36.1 |
| RWKV7-1.5B [10] | 3,050 | 59.7 | 46.3 | 75.7 | 77.4 | 67.6 | 45.4 | 70.5 | 34.7 |
| Rec.Gemma-2B [62] | 2,355 | 46.5 | 29.4 | 41.5 | 66.6 | 54.1 | 27.0 | 72.0 | 34.7 |
| Gemma3n-E2B [42] | 701 | 58.6 | 43.2 | 73.1 | 77.0 | 60.8 | 40.8 | 76.0 | 39.1 |
| Hymba-1.5B [44] | 180 | 61.2 | 46.9 | 76.9 | 77.7 | 66.2 | 41.0 | 80.8 | 39.0 |
| Zamba2-1.2B [16] | 71 | 58.0 | 44.4 | 66.8 | 77.4 | 65.6 | 42.8 | 70.8 | 38.5 |
| Jet-Nemotron-2B | 2,885 | 62.0 | 48.6 | 74.8 | 75.4 | 65.8 | 40.6 | 81.2 | 47.8 |
| Jet-Nemotron-4B | 1,271 | 64.7 | 51.7 | 79.2 | 78.1 | 70.5 | 43.6 | 83.0 | 46.6 |
Table 5 | Results on Commonsense Tasks.
Jet-Nemotron-2B (avg. 62.0) outperforms all baseline models, including Qwen2.5 and Qwen3 which are relatively weaker in this domain. Jet-Nemotron-4B achieves even higher accuracy (avg. 64.7).
Results on Retrieval Tasks
The following are the results from Table 6 of the original paper:
| Type | Model | Throughput (token/s) ↑ | Accuracy ↑ | |||
| Avg. | FDA | SWDE | Squad | |||
| O(n2) | Qwen2.5-1.5B [4] | 241 | 72.4 | 82.8 | 86.3 | 48.1 |
| Qwen3-1.7B-Base [5] | 61 | 76.1 | 81.8 | 89.2 | 57.2 | |
| Llama3.2-3B [2] | 60 | 71.3 | 82.3 | 89.6 | 56.4 | |
| MiniCPM-2B-128K [58] | 18 | 72.6 | 72.3 | 86.4 | 59.1 | |
| Smollm2-1.7B [60] | 32 | 68.9 | 78.1 | 82.4 | 46.3 | |
| O(n) | Mamba2-2.7B [50] | 2,507 | 57.0 | 51.7 | 74.3 | 45.1 |
| RWKV7-1.5B [10] | 3,050 | 58.6 | 54.5 | 73.3 | 48.0 | |
| Rec.Gemma-2.6B [62] | 2,355 | 68.8 | 62.3 | 86.4 | 57.8 | |
| Hybrid | Gemma3n-E2B [73] | 701 | 74.0 | 77.3 | 86.4 | 58.2 |
| Hymba-1.5B [44] | 180 | 57.1 | 46.6 | 74.4 | 50.2 | |
| Zamba2-1.2B [16] | 71 | 66.4 | 73.8 | 80.7 | 44.8 | |
| Jet-Nemotron-2B | 2,885 | 74.2 | 80.4 | 85.7 | 56.6 | |
| Jet-Nemotron-4B | 1,271 | 76.2 | 82.5 | 89.7 | 56.4 | |
Table 6 | Results on Retrieval Tasks.
Jet-Nemotron-2B performs strongly, outperforming all baselines except Qwen3-1.7B-Base. The Jet-Nemotron-4B achieves the highest average accuracy (76.2) among all models, while still maintaining a speedup compared to Qwen3.
Results on Coding Tasks
The following are the results from Table 7 of the original paper:
| Type | Model | | Throughput (token/s) ↑ | Accuracy ↑ | |||
| Avg. | EvalPlus | CRUXEval-I-cot | CRUXEval-O-cot | |||
| O(n2) | Qwen2.5-1.5B [4] | 241 | 52.0 | 54.3 | 56.0 | 45.8 |
| Qwen3-1.7B-Base [5] | 61 | 58.9 | 62.8 | 60.4 | 53.4 | |
| Llama3.2-3B [2] | 60 | 44.0 | 35.5 | 54.7 | 41.7 | |
| MiniCPM-2B-128K [58] | 18 | 34.2 | 40.7 | 29.9 | 31.9 | |
| Smollm2-1.7B[ [60] | 32 | 36.2 | 20.6 | 49.5 | 38.6 | |
| O(n) | Mamba2-2.7B [50] | 2,507 | 14.0 | 12.0 | 9.3 | 20.7 |
| RWKV7-1.5B [10] | 3,050 | 13.2 | 16.8 | 8.0 | 14.7 | |
| Rec.Gemma-2.6B [62] | 2,355 | 36.8 | 29.5 | 46.7 | 34.2 | |
| Hybrid | Gemma3n-E2B [73] | 701 | 40.4 | 29.6 | 49.9 | 41.6 |
| Hymba-1.5B [44] | 180 | 30.3 | 31.3 | 32.2 | 27.5 | |
| Zamba2-1.2B [16] | 71 | 20.1 | 12.7 | 21.1 | 26.4 | |
| Jet-Nemotron-2B | 2,885 | 59.5 | 60.8 | 61.1 | 56.7 | |
| Jet-Nemotron-4B | 1,271 | 63.5 | 65.6 | 65.9 | 59.0 | |
Table 7 | Results on Coding Tasks.
Jet-Nemotron-2B performs comparably to Qwen3-1.7B-Base, and Jet-Nemotron-4B achieves higher accuracy across all coding tasks while maintaining a large generation throughput advantage.
Results on Long-Context Tasks
The following are the results from Table 8 of the original paper:
| Type | Model | Throughput (token/s) ↑ | Accuracy ↑ | |||||
| Avg. | Few-Shot | Code | Sum. | Single-Doc | Multi-Doc | |||
| O(n2) | Qwen2.5-1.5B [4] | 241 | 39.1 | 63.9 | 57.2 | 26.3 | 28.3 | 19.9 |
| Qwen3-1.7B-Base [5] | 61 | 42.2 | 68.8 | 48.1 | 26.8 | 36.6 | 30.6 | |
| Llama3.2-3B [2] | 60 | 39.9 | 65.2 | 58.0 | 24.3 | 27.6 | 24.6 | |
| MiniCPM-2B-128K [58] | 18 | 41.1 | 57.3 | 59.6 | 25.7 | 33.4 | 29.6 | |
| Smollm2-1.7B [60] | 32 | 21.3 | 38.9 | 28.6 | 16.0 | 13.2 | 9.8 | |
| O(n) | Mamba2-2.7B [50] | 2,507 | 10.3 | 6.4 | 30.2 | 9.1 | 3.5 | 2.5 |
| RWKV7-1.5B [10] | 3,050 | 14.2 | 10.6 | 21.1 | 18.1 | 12.8 | 8.7 | |
| Rec.Gemma-2.6B [62] | 2,355 | 24.1 | 31.8 | 56.7 | 12.9 | 9.2 | 9.6 | |
| Hybrid | Gemma2-2.6B [73] | 388 | 22.9 | 28.7 | 52.0 | 12.6 | 13.9 | 7.3 |
| Gemma3n-E2B [73] | 701 | 40.4 | 56.4 | 67.2 | 25.6 | 29.3 | 28.6 | |
| Hymba-1.5B [44]. | 180 | 28.0 | 36.1 | 53.5 | 51.8 | 14.0 | 19.8 | |
| Zamba2-1.2B [16] | 71 | 9.2 | 10.0 | 20.1 | 10.2 | 3.8 | 1.7 | |
| Jet-Nemotron-2B | 2,885 | 41.1 | 68.7 | 58.1 | 26.0 | 30.8 | 21.9 | |
| Jet-Nemotron-4B | 1,271 | 43.9 | 69.7 | 63.2 | 26.4 | 32.5 | 27.5 | |
Table 8 | Results on Long-Context Tasks.
On LongBench up to 64K context length, Jet-Nemotron-2B (with only two full-attention layers) achieves performance comparable to models like Qwen2.5-1.5B and Gemma3n-E2B (which have considerably more full-attention layers). Jet-Nemotron-4B surpasses Qwen3-1.7B-Base while delivering a speedup, demonstrating a substantial advancement in the efficiency-accuracy trade-off for long-context tasks.
Efficiency Benchmark Results
The following figure (Figure 6 from the original paper) shows the efficiency comparison across different context lengths:
该图像是图表,展示了不同上下文长度下Jet-Nemotron-2B与Qwen3-1.7B在预填充(prefilling)和解码(decoding)速度上的相对提升。Jet-Nemotron-2B的预填充速度最高达到,解码速度最高达到,显著优于Qwen3-1.7B。
Figure 6 | Efficiency Comparison Across Different Context Lengths. Jet-Nemotron-2B achieves up to a speedup in prefilling and a speedup in decoding compared to Qwen3-1.7B-Base.
Figure 6 details the throughput comparison between Qwen3-1.7B-Base and Jet-Nemotron-2B across various context lengths.
- Prefilling Stage: At shorter
context lengths(4K and 8K),Jet-Nemotron-2Bis initially 1.14x and 1.15x faster thanQwen3-1.7B-Base. As thecontext lengthincreases, the benefits oflinear attentionbecome more pronounced, leading to a speedup at a 256Kcontext length. - Decoding Stage:
Jet-Nemotron-2Bconsistently and substantially outperformsQwen3-1.7B-Base. With 2full-attention layersand 2 groups ofkey-value states(vs. Qwen3's 28full-attention layersand 8 groups ofkey-value states),Jet-Nemotron-2B's theoretical maximum speedup is times. The model achieves a speedup at 4Kcontext lengthand nearly reaches its theoretical upper bound with a speedup at 256Kcontext length.
6.2. Ablation Studies / Parameter Analysis
PostNAS Accuracy Improvement Breakdown
The following figure (Figure 3 from the original paper) shows the PostNAS accuracy improvement breakdown:
该图像是图表,展示了图3中PostNAS对基线模型的准确度提升细分。通过逐步应用不同优化策略,在四个指标上均取得显著提升,最高达到58.1、34.9、70.4和59.3的准确率。
Figure 3 | PostNAS Accuracy Improvement Breakdown. By applying PostNAS to the baseline model, we achieve significant accuracy improvements across all benchmarks.
Figure 3 illustrates the incremental accuracy gains achieved by each component of the PostNAS pipeline. Starting from a baseline model, PostNAS progressively adds accuracy:
-
on
MMLU -
on
Math -
on
Retrieval -
on
Commonsense ReasoningThis breakdown confirms the individual effectiveness of each
PostNASstage in enhancing model performance across various tasks.
Hardware-Aware Architecture Search Impact
The following are the results from Table 1 of the original paper:
| Attention Block | Data-Depend Delta Gating | Rule | Throughput ↑ Training Inference | Accuracy ↑ MMLU Math Retreival | ||||
| Common. | ||||||||
| RWKV7 [10] | √ | ✓ | 123 | 2,542 | − | − | − | − |
| RetNet [12] | 269 | 2,535 | 53.6 | 29.9 | 63.7 | 58.1 | ||
| Mamba2 [50] | 273 | 3,220 | 51.5 | 26.0 | 68.9 | 57.5 | ||
| GLA [11] | ✓ | 265 | 3,079 | 55.8 | 31.2 | 66.6 | 58.5 | |
| Deltanet [51] | ✓ | 254 | 2,955 | 48.9 | 27.4 | 67.9 | 56.6 | |
| Gated DeltaNet [32] | ✓ | ✓ | 247 | 2,980 | 55.6 | 32.3 | 69.3 | 58.7 |
| JetBlock | ; | : | 233 | 2,885 | 56.3 | 32.8 | 69.9 | 58.5 |
| + Hardware-Aware Search | 227 | 2,883 | 58.1 | 34.9 | 70.4 | 59.5 | ||
Table 1 | Accuracy and Efficiency of JetBlock. JetBlock is designed through Linear Attention Block Selection, New Attention Block Design, and Hardware-Aware Search in PostNAS. It achieves higher accuracy than previous linear attention blocks while maintaining comparable training and inference effiency.
Table 1 shows the impact of the hardware-aware architecture search on JetBlock. After applying the search, JetBlock (row "+ Hardware-Aware Search") sees further accuracy boosts (e.g., MMLU from 56.3 to 58.1, Math from 32.8 to 34.9) while maintaining comparable training and inference efficiency. This demonstrates the effectiveness of optimizing hyperparameters directly for hardware performance.
The following are the results from Table 2 of the original paper:
| dK | dV | nhead | Params (B) | Cache Size (MB) | Throughput ↑ (token/s) | Retrieval ↑ Accuracy | Math Accuracy ↑ |
| 256 | 288 | 4 | 1.62 | 154 | 2,969 | 67.6 | 31.3 |
| 192 | 384 | 4 | 1.64 | 154 | 2,961 | 69.3 | 32.3 |
| 128 | 576 | 4 | 1.70 | 154 | 2,979 | 69.5 | 32.5 |
| 256 | 144 | 8 | 1.66 | 154 | 2,986 | 68.3 | 32.1 |
| 192 | 192 | 8 | 1.70 | 154 | 2,970 | 70.6 | 32.8 |
| 128 | 288 | 8 | 1.74 | 154 | 2,971 | 69.6 | 33.2 |
| 128 | 192 | 12 | 1.78 | 154 | 2,959 | 68.8 | 32.9 |
| 96 | 256 | 12 | 1.84 | 154 | 2,955 | 69.6 | 34.8 |
| 64 | 384 | 12 | 1.98 | 154 | 2,952 | 70.1 | 34.2 |
Table 2 | Detailed Results of Hardware-Aware Architecture Search. The gray row is the original design [32], while the blue row shows the new design produced by our hardware-aware architecture search.
Table 2 further details the hardware-aware architecture search results for Gated DeltaNet. By fixing the KV cache size (154MB) and varying key dimension (dK), value dimension (dV), and number of attention heads (nhead), the search found configurations that achieve similar generation throughput but with higher accuracy (e.g., Retrieval and Math). The row highlighted in blue (dK=96, dV=256, nhead=12) shows the final design which has a slightly higher parameter count (1.84B vs 1.62B for the original), similar throughput (2,955 vs 2,969 token/s), but significantly improved accuracy (e.g., Math Accuracy 34.8 vs 31.3). This confirms Key Finding 4 that KV cache size is dominant for throughput, and within that constraint, PostNAS can find configurations that use slightly more parameters for better accuracy.
Controlled Study on Training Data
The following are the results from Table 14 of the original paper:
| Model | MMLU | Math | Commonsense | Retrieval |
| Qwen2.5-1.5B-continual | 56.7 | 37.6 | 59.8 | 71.5 |
| Mamba2-2.7B-continual | 41.0 | 22.5 | 56.9 | 55.9 |
| RWKV7-1.5B-continual | 49.8 | 25.2 | 59.3 | 57.2 |
| Jet-Nemotron-2B | 59.6 | 40.2 | 61.7 | 73.6 |
Table 14 | Controlled Study on Training Data. All models are pre-trained or continually pre-trained on the Jet-Nemotron stage-2 training corpus discussed in Section 3.1.
To ensure that Jet-Nemotron's superior performance is due to its architecture and PostNAS, rather than just the training data, a controlled study was conducted. Baseline models (Qwen2.5, Mamba2, RWKV7) were continually pre-trained on the same Jet-Nemotron stage-2 training corpus. As shown in Table 14, even after being trained on the same data, Jet-Nemotron-2B still significantly outperforms all finetuned baseline models across MMLU, Math, Commonsense, and Retrieval tasks. This strengthens the claim that PostNAS and the resulting architecture are indeed effective.
Throughput Results on Lower-End Hardware
The following are the results from Table 15 of the original paper:
| Hardware | Qwen2.5-1.5B (Tokens/s) | Jet-Nemotron-2B (Tokens/s) | SpeedUp |
| Orin | 6.22 | 55.00 | 8.84 |
| 3090 | 105.18 | 684.01 | 6.50 |
Table 15 | Throughput Results on Jetson Orin (32GB) and NVIDIA RTX 3090 GPUs.
The efficiency benefits of Jet-Nemotron-2B extend beyond high-end H100 GPUs. Table 15 shows substantial speedups on lower-end hardware: on NVIDIA Jetson Orin (32GB) and on NVIDIA RTX 3090 GPUs compared to Qwen2.5-1.5B. This demonstrates the broad applicability and practical value of Jet-Nemotron for various deployment scenarios, including edge devices.
Comparison to Falcon-H1
The following are the results from Table 16 of the original paper:
| Model | |Throughput (token/s)↑ | Accuracy ↑ | |||||
| MMLU | MATH | Common. | Retrieval | Code | Long-Context | ||
| Falcon-H1-1.5B [106] | 223 | 60.5 | 40.1 | 59.9 | 73.5 | 56.0 | 40.7 |
| Falcon-H1-1.5B-deep [106] | 66 | 63.5 | 46.8 | 60.6 | 74.6 | 60.3 | 33.4 |
| Jet-Nemotron-2B | 2,885 | 60.8 | 49.6 | 62.0 | 74.2 | 59.5 | 41.1 |
| Jet-Nemotron-4B | 1,271 | 65.2 | 51.3 | 64.7 | 76.2 | 63.5 | 43.9 |
Table 16 | Comparison with Falcon-H1.
Comparing with the concurrent Falcon-H1 (a hybrid model using Mamba2 and full attention), Jet-Nemotron-2B offers comparable or superior accuracy (e.g., better Math, Commonsense, Long-Context than Falcon-H1-1.5B-deep) while achieving significantly higher generation throughput (2,885 vs 66 tokens/s for Falcon-H1-1.5B-deep). This efficiency gap is attributed to Falcon-H1's head-wise hybrid strategy, which requires sequential computation of Mamba2 and full attention within a single layer, limiting parallelism. Jet-Nemotron's layer-wise alternation strategy is inherently more parallelizable. Jet-Nemotron-4B further demonstrates this advantage, outperforming both Falcon-H1 variants in accuracy and still maintaining a high throughput.
6.3. Summary
The experimental results conclusively demonstrate that Jet-Nemotron models (2B and 4B) achieve a new state-of-the-art in accuracy-efficiency trade-off. They match or surpass the accuracy of leading full-attention models (Qwen3, Qwen2.5, Gemma3, Llama3.2) across diverse benchmarks (MMLU(-Pro), Math, Commonsense, Retrieval, Coding, Long-Context). Crucially, they deliver unprecedented generation throughput speedups (up to ) and prefilling speedups (up to ) on H100 GPUs, with benefits extending to lower-end hardware. This exceptional performance is a direct result of the PostNAS pipeline, which enables efficient and hardware-aware architectural design without the prohibitive costs of pre-training from scratch. The smaller KV cache sizes are a key factor in these efficiency gains.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Jet-Nemotron, a new family of hybrid-architecture Language Models, which sets a new benchmark for the accuracy-efficiency trade-off in LLMs. These models achieve comparable or superior accuracy to leading full-attention models like Qwen3, Qwen2.5, Gemma3, and Llama3.2, while simultaneously delivering substantial efficiency gains, including up to higher generation throughput and prefilling speedup on H100 GPUs for long contexts.
The core innovations enabling Jet-Nemotron are:
-
Post Neural Architecture Search (PostNAS): A novel and highly efficientpost-training architecture adaptation pipeline. By starting withpre-trained full-attention modelsand freezing theirMLPweights,PostNASdrastically reduces the cost and risk ofLLMarchitectural exploration. It systematically exploresattention blockdesigns through optimalfull-attention layerplacement,linear attention blockselection, newattention blockdesign, andhardware-aware hyperparameter search. -
JetBlock: A novellinear attention blockthat integratesdynamic convolutionwithGated DeltaNet'sdata-dependent gatinganddelta rule.JetBlocksignificantly outperforms priorlinear attention designs(e.g.,Mamba2,GLA,Gated DeltaNet) in accuracy while maintaining comparable efficiency.Extensive empirical results validate
Jet-Nemotron's strong accuracy and exceptional inference efficiency across a broad range of benchmarks and hardware platforms.
7.2. Limitations & Future Work
The authors highlight several key contributions but do not explicitly detail specific limitations or future work within the conclusion section. However, implicit limitations and future directions can be inferred:
- Generalizability of PostNAS to other base models: While
Jet-Nemotronis built onQwen2.5, it is implied thatPostNAScould be applied to any pre-trained Transformer model. Future work could rigorously test this claim across a broader range of baseLLMsfrom different families and sizes to confirm universal applicability. - Optimal trade-offs in
PostNAS: The paper shows the effectiveness ofPostNAS's stages. Future work might explore even more sophisticated search algorithms or different search spaces within each stage (e.g., more diverselinear attentioncandidates, alternativedynamic convolutiondesigns, more fine-grained hardware metrics). - Hardware-Awareness Beyond Throughput: While
KV cache sizeandthroughputare key, other hardware considerations like energy consumption, memory footprint on smaller devices, or specialized accelerator features could be integrated into thehardware-aware searchfor even broader optimization. - Dynamic MLP Adaptation:
PostNASfreezesMLPweights for efficiency. While effective, there might be scenarios where some degree ofMLPadaptation or fine-tuning, perhaps with specialized low-rank adaptation techniques, could yield further accuracy gains without completely sacrificing the efficiency of thePostNASpipeline. - Theoretical Understanding of Hybrid Models: While empirical results are strong, deeper theoretical analysis into why specific
full attentionplacements work best for certain tasks, or howJetBlock'sdynamic convolutionprecisely enhanceslinear attention's representational power, could lead to more principled architectural designs. - Efficiency of
PostNASitself: WhilePostNASmakes architectural search feasible forLLMs, optimizing the search process itself (e.g., fastersuper networktraining, more efficientbeam search) remains an area for improvement.
7.3. Personal Insights & Critique
This paper presents a highly impactful and practical approach to making LLMs more efficient without sacrificing their accuracy, which is a critical bottleneck for real-world deployment. The PostNAS pipeline is a particularly insightful innovation. By recognizing the immense cost of LLM pre-training and leveraging existing pre-trained models as a foundation, it democratizes LLM architecture research to some extent, making it accessible even to entities without "Google-scale" computational resources for full NAS.
One of the most profound insights is Key Finding 4: the dominance of KV cache size over parameter count for long-context generation throughput. This challenges conventional wisdom where parameter count is often seen as the primary indicator of model "size" and efficiency. This finding provides a clear, actionable target for future LLM optimization efforts.
The JetBlock itself, with its dynamic convolution and integration of Gated DeltaNet mechanisms, is a strong contribution to the linear attention literature. It tackles a known limitation of static convolutions in linear attention models, demonstrating that more expressive attention blocks can be designed within the complexity constraint.
A potential area for future exploration or a point of critique could be the degree to which PostNAS is truly "post-training." While it uses a pre-trained model as a starting point and freezes MLPs, it still involves a two-stage training process (distillation and full-model training) on potentially large datasets (50B + 350B tokens). While significantly cheaper than pre-training from scratch, it's not a zero-cost architecture adaptation. Investigating methods for purely "post-training" architectural adaptation (e.g., without extensive retraining even of attention blocks, or using much smaller adaptation datasets) could be a fascinating next step.
The paper's clear and comprehensive evaluation across numerous benchmarks and hardware types adds significant credibility to its claims. The consistently high accuracy combined with remarkable throughput gains suggests that Jet-Nemotron and the PostNAS methodology will be highly influential in the development and deployment of next-generation efficient LLMs. Its methods and conclusions could be particularly applicable in domains requiring on-device LLM inference or large-scale LLM serving where computational resources and latency are critical constraints. The rigorous, methodical approach of PostNAS could also inspire similar structured architectural search pipelines in other complex deep learning domains beyond LLMs.
Similar papers
Recommended via semantic vector search.