Qwen3 Technical Report
TL;DR Summary
Qwen3 introduces a unified framework integrating thinking and non-thinking modes for dynamic switching, enhancing performance and multilingual support. It also features a thinking budget mechanism for adaptive resource allocation, expanding language capabilities from 29 to 119 la
Abstract
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "Qwen3 Technical Report". The central topic is the introduction and detailed technical description of Qwen3, the latest generation of the Qwen large language model (LLM) family.
1.2. Authors
The paper lists "Qwen Team" as the authors, followed by "Core Contributors" and "Cruor" sections detailing a large number of individuals involved. While specific affiliations are not explicitly stated in the provided abstract or initial pages, the Qwen model series is developed by Alibaba Cloud. The extensive list of contributors indicates a large-scale, collaborative research and development effort.
1.3. Journal/Conference
The paper is published as a preprint on arXiv, with the link https://arxiv.org/abs/2505.09388. As a preprint, it has not yet undergone formal peer review for publication in a journal or conference. However, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like artificial intelligence, and technical reports from major industry labs (like Alibaba's Qwen team) often serve as primary sources for new model introductions.
1.4. Publication Year
The paper was published on arXiv on 2025-05-14T13:41:34.000Z, indicating a publication year of 2025.
1.5. Abstract
This work introduces Qwen3, the newest iteration in the Qwen model family, featuring a series of large language models (LLMs) engineered for enhanced performance, efficiency, and multilingual capabilities. The Qwen3 series encompasses both dense and Mixture-of-Expert (MoE) architectures, with parameter counts spanning from 0.6 billion to 235 billion. A core innovation is the unified integration of a thinking mode for complex, multi-step reasoning and a non-thinking mode for swift, context-driven responses, obviating the need for model switching. This framework allows for dynamic mode selection based on user input or templates. Qwen3 also introduces a thinking budget mechanism, enabling adaptive allocation of computational resources during inference to balance latency and performance according to task complexity. Furthermore, by distilling knowledge from larger flagship models, the computational cost for developing smaller models is significantly reduced while maintaining competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including code generation, mathematical reasoning, and agent tasks, rivaling larger MoE and proprietary models. Compared to its predecessor, Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, improving global accessibility through advanced cross-lingual understanding and generation. All Qwen3 models are openly released under the Apache 2.0 license to foster reproducibility and community-driven research.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2505.09388, and the PDF link is https://arxiv.org/pdf/2505.09388v1.pdf. It is published as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The field of artificial intelligence is relentlessly pursuing artificial general intelligence (AGI) and artificial super intelligence (ASI). Recent advancements in large foundation models (LLMs), such as GPT-4o, Claude 3.7, and Llama-4, have shown remarkable progress in distilling human knowledge and capabilities by training on vast datasets. However, a significant challenge remains: most state-of-the-art models are proprietary, limiting broader research and innovation. While open-source models like DeepSeek-V3 and Qwen2.5 have narrowed the performance gap, there's still a need for open-weight models that can compete at the highest levels, especially in complex reasoning tasks.
A key problem LLMs face is balancing the need for rapid, general-purpose responses with the demand for deep, multi-step reasoning. Currently, users often switch between different types of models—e.g., chat-optimized models for quick replies and dedicated reasoning models for complex problems—leading to inefficiencies and a fragmented user experience. Moreover, training and deploying large models, particularly for specialized reasoning, incurs substantial computational costs. There's a gap in providing a unified, adaptable, and efficient solution that caters to both thinking and non-thinking modes while also being accessible and capable across a wide array of languages.
The paper's entry point is to address these challenges by introducing Qwen3, a comprehensive series of open-weight LLMs. Its innovative idea centers on integrating thinking and non-thinking modes within a single model, coupled with a thinking budget mechanism, to offer unprecedented flexibility and resource control. This aims to eliminate the need for model switching, optimize inference costs, and extend state-of-the-art performance to an expanded multilingual user base.
2.2. Main Contributions / Findings
The Qwen3 technical report highlights several primary contributions and key findings:
-
Unified Thinking and Non-Thinking Modes: Qwen3 integrates two distinct operational modes—
thinking modefor complex, multi-step reasoning andnon-thinking modefor rapid, context-driven responses—into a single, unified model. This eliminates the need for users to switch between different models for varying task complexities, offering dynamic mode switching based on user queries or chat templates. -
Dynamic Thinking Budget Mechanism: A novel
thinking budgetmechanism is introduced, allowing users to adaptively allocate computational resources during inference. This provides fine-grained control over the model's reasoning effort, balancing latency and performance based on task complexity. -
Broad Model Series and Architectures: Qwen3 comprises a diverse series of LLMs, including both
denseandMixture-of-Expert (MoE)architectures. These models span a wide range of parameter scales, from 0.6 billion to 235 billion, catering to various downstream applications and deployment environments. The flagshipQwen3-235B-A22Bis an MoE model with 235 billion total parameters and 22 billion activated parameters per token, balancing performance and efficiency. -
Enhanced Multilingual Capabilities: Qwen3 significantly expands its multilingual support, covering 119 languages and dialects, a substantial increase from Qwen2.5's 29 languages. This enhancement improves cross-lingual understanding and generation, making the models globally accessible.
-
Efficient Strong-to-Weak Distillation: A
strong-to-weak distillationpipeline is developed to efficiently train smaller-scale models. By leveraging knowledge transfer from larger, more capable flagship models, this approach drastically reduces the computational resources and development effort required for lightweight models while ensuring highly competitive performance. This distillation process is shown to be significantly more efficient than reinforcement learning for smaller models. -
State-of-the-Art Performance: Empirical evaluations demonstrate that Qwen3 models achieve state-of-the-art results across a diverse set of benchmarks. This includes tasks in code generation, mathematical reasoning, and agent tasks, with the flagship models (
Qwen3-235B-A22BandQwen3-32B) performing competitively against larger MoE models and proprietary models likeOpenAI-o1,Gemini2.5-Pro, andGPT-4o. -
Open-Source Release: All Qwen3 models are publicly accessible under the Apache 2.0 license, facilitating reproducibility and fostering community-driven research and development.
These contributions collectively address the fragmentation in model usage, resource inefficiency, and limitations in multilingual support, pushing the boundaries of open-source LLM capabilities.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the Qwen3 technical report, several foundational concepts in large language models (LLMs) and deep learning are essential:
- Large Language Models (LLMs): These are artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They typically use transformer architectures and can perform a wide range of natural language processing tasks, from translation to question answering.
- Dense Models vs. Mixture-of-Experts (MoE) Models:
- Dense Models: These are traditional neural networks where all parameters are activated for every input token. They are computationally intensive as their size scales up.
- Mixture-of-Experts (MoE) Models: These models consist of multiple "expert" sub-networks. For each input token, only a small subset of experts (e.g., 2 or 4 out of 128) are activated by a "router" or "gate" network. This allows MoE models to have a very large total number of parameters (for higher capacity) while maintaining a manageable number of activated parameters per token during inference, leading to more efficient computation for a given performance level.
- Pre-training and Fine-tuning:
- Pre-training: The initial, computationally expensive phase where a large model learns general language patterns, facts, and reasoning abilities from massive, diverse datasets using unsupervised or self-supervised learning objectives (e.g., predicting the next word).
- Fine-tuning: A subsequent phase where the pre-trained model is further trained on smaller, task-specific datasets to adapt its learned knowledge to specific downstream applications or human preferences. This can involve
Supervised Fine-Tuning (SFT)orReinforcement Learning (RL).
- Chain-of-Thought (CoT) Reasoning: A prompting technique where the LLM is instructed to verbalize its intermediate reasoning steps before providing a final answer. This encourages the model to break down complex problems into manageable steps, often leading to more accurate and verifiable solutions, especially in mathematical or logical tasks.
- Reinforcement Learning (RL) and Reinforcement Learning from Human Feedback (RLHF):
- RL: A paradigm where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties. The goal is to maximize cumulative reward.
- RLHF: A common technique for aligning LLMs with human preferences. It involves: 1) training a reward model (RM) on human judgments of LLM outputs, and 2) using this RM to provide rewards to the LLM during an RL phase (e.g., using algorithms like PPO or GRPO) to fine-tune its behavior.
- Transformer Architecture: The foundational neural network architecture for LLMs, characterized by
self-attentionmechanisms andfeed-forward layers. It allows the model to weigh the importance of different parts of the input sequence when processing each token.- Attention Mechanism: A core component of transformers that allows the model to focus on different parts of the input sequence when generating an output. The
Scaled Dot-Product Attentionis defined as: Where:- is the Query matrix.
- is the Key matrix.
- is the Value matrix.
- is the dimension of the keys.
- is the softmax function, which normalizes the attention scores.
- Grouped Query Attention (GQA): An optimization for multi-head attention where multiple query heads share the same key and value heads. This reduces the computational cost and memory footprint, especially during inference, while maintaining much of the performance of full multi-head attention.
- SwiGLU (Swish Gated Linear Unit): An activation function used in feed-forward networks within transformers, often replacing the traditional
ReLU. It is defined as: Where:- is the input.
- are weight matrices.
- .
- denotes element-wise product.
- Rotary Positional Embeddings (RoPE): A type of positional encoding that encodes absolute position information with a rotation matrix and naturally incorporates relative position information. It is applied directly within the attention mechanism.
- RMSNorm (Root Mean Square Normalization): A normalization technique applied before or after layers in neural networks. It normalizes inputs by their root mean square, which can offer computational advantages and stability over
LayerNorm. For an input vector ,RMSNormis defined as: Where:- is the dimension of .
- is a small constant for numerical stability.
- is a learnable scaling factor.
- QK-Norm: A normalization technique applied to the query-key dot product in the attention mechanism to ensure stable training, particularly for very deep or large transformer models.
- Attention Mechanism: A core component of transformers that allows the model to focus on different parts of the input sequence when generating an output. The
- Byte-level Byte-Pair Encoding (BBPE): A tokenization algorithm that learns a vocabulary of common character sequences (subwords) by iteratively merging the most frequent adjacent pairs of bytes. Byte-level ensures that all input text, regardless of character set, can be tokenized.
- Knowledge Distillation: A technique where a smaller "student" model learns from a larger, more powerful "teacher" model. The student tries to mimic the teacher's outputs, often by minimizing the
Kullback-Leibler (KL) divergencebetween their logits. This allows the student to achieve better performance than it would if trained from scratch, with less data or computation. - Kullback-Leibler (KL) Divergence: A measure of how one probability distribution diverges from a second, expected probability distribution . In distillation, it quantifies the difference between the teacher's soft probabilities and the student's probabilities. For discrete probability distributions and , it is defined as:
- Adaptive Base Frequency (ABF): A technique used with
RoPEto extend the context window of LLMs by adjusting the base frequency. - Yet Another RoPE extensioN (YARN): A method to effectively extend the context length of
RoPE-based models with minimal fine-tuning. - Dual Chunk Attention (DCA): A technique designed to achieve efficient inference for long context lengths by processing input in chunks.
3.2. Previous Works
The paper contextualizes Qwen3 by referring to several prominent LLMs and related research efforts:
- Foundation Models (General Progress):
GPT-4o(OpenAI, 2024): OpenAI's multimodal flagship model, noted for its strong performance across diverse tasks, often serving as a benchmark for general LLM capabilities.Claude 3.7(Anthropic, 2025): Anthropic's competitive LLM series, known for its strong reasoning and safety features.Gemini 2.5(DeepMind, 2025): Google DeepMind's multimodal model, also a major player in the SOTA LLM landscape.DeepSeek-V3(Liu et al., 2024a): An open-source model notable for its scale and performance, particularly as an MoE model. Qwen3 often benchmarks against it.Llama-4(Meta-AI, 2025): Meta AI's next-generation open-source LLM, representing a strong competitor in the open-weight community.Qwen2.5(Yang et al., 2024b): The immediate predecessor to Qwen3, serving as a baseline for improvements in architecture, data, and multilingual support.
- Reasoning Models (Optimized via RL):
OpenAI-o3(OpenAI, 2025): A model from OpenAI specifically optimized for reasoning tasks, often through advanced RL techniques.DeepSeek-R1(Guo et al., 2025): A dedicated reasoning model, likely developed using RL, against which Qwen3's reasoning capabilities are compared.QwQ-32B(Qwen Team, 2024, 2025): An earlier reasoning-focused model from the Qwen team, used as a strong baseline for Qwen3'sthinking modedevelopment and also as a teacher model incold-startfine-tuning.
- Open-Source Baselines (for comparison across scales):
Llama-3(Dubey et al., 2024): An earlier open-source model from Meta, serving as a benchmark for various parameter scales.Gemma-3(Team et al., 2025): Google's open-source model series, providing benchmarks across different sizes.Phi-4(Abdin et al., 2024): A smaller-scale model from Microsoft, often used for benchmarking lightweight LLMs.
Background on Qwen2.5's Role: Qwen2.5 is particularly relevant as it forms the direct lineage for Qwen3. The paper mentions several Qwen2.5 variants for data generation:
-
Qwen2.5-VL(Bai et al., 2025): A vision-language model used to extract text from PDF documents for Qwen3's pre-training data. -
Qwen2.5-Math(Yang et al., 2024c): A specialized model for mathematical content generation, contributing to synthetic data. -
Qwen2.5-Coder(Hui et al., 2024): A specialized model for code-related data generation, also contributing to synthetic data.The
Qwen2.5-MoEarchitecture (Yang et al., 2024b) also serves as a direct predecessor for Qwen3's MoE models, with Qwen3 building upon its concepts like fine-grained expert segmentation.
3.3. Technological Evolution
The evolution of LLMs has been marked by several key trends:
-
Scaling Laws: Initial models demonstrated that increasing parameters and data led to improved performance. This motivated the development of ever-larger models.
-
Architectural Innovations: The
Transformerarchitecture revolutionized NLP, leading to models withself-attentionandpositional embeddings. Subsequent innovations focused on improving efficiency (e.g.,GQA), stability (RMSNorm,QK-Norm), and context handling (RoPE,YARN,DCA). -
Data Curation & Augmentation: Beyond simply scaling data, researchers focused on data quality, diversity (e.g., code, scientific texts), and synthetic data generation using existing LLMs (as seen with Qwen3 leveraging Qwen2.5 variants).
-
Multilingualism: Early models were predominantly English-centric. There's a growing trend towards truly multilingual models, expanding language coverage and improving cross-lingual transfer.
-
Alignment & Reasoning: The introduction of
Chain-of-Thought (CoT)prompting andReinforcement Learning from Human Feedback (RLHF)significantly enhanced models' reasoning abilities and their alignment with human preferences and instructions. Dedicated reasoning models emerged from this focus. -
Efficiency through Sparsity:
Mixture-of-Experts (MoE)models represent a major step in combining vast capacity (total parameters) with efficient inference (activated parameters), addressing the computational bottlenecks of dense models. -
Unified Capabilities: The latest frontier involves integrating diverse capabilities (e.g., multimodal inputs, reasoning, quick responses) into a single, adaptable model, minimizing the need for specialized models.
Qwen3's work fits squarely within these latest trends, pushing the boundaries in scale, multilingual support, efficiency through MoE and distillation, and crucially, the unification of
thinkingandnon-thinkingmodes.
3.4. Differentiation Analysis
Compared to the main methods and models in related work, Qwen3 introduces several core differences and innovations:
-
Unified Thinking and Non-Thinking Modes with Dynamic Switching: This is Qwen3's most prominent differentiator. Prior approaches often required users to select a specific model optimized for
chat(e.g.,GPT-4o) orreasoning(e.g.,QwQ-32B). Qwen3 integrates both within a single model, enabling dynamic switching viachat templatesor user prompts (/think,/no_thinkflags). This simplifies deployment and usage, offering unparalleled flexibility. -
Adaptive Thinking Budget: Building on the unified modes, Qwen3 introduces a
thinking budgetmechanism. This allows users to control the computational resources and time allocated for reasoning during inference, dynamically balancing latency and performance based on the complexity of the query. This fine-grained control is a novel optimization for practical LLM deployment. -
Sophisticated Strong-to-Weak Distillation: While
knowledge distillationis not new, Qwen3's application ofstrong-to-weak distillationfor lightweight models is highly effective. It involves bothoff-policyandon-policyphases, leveraging flagship models as teachers. The paper demonstrates that this approach significantly outperformsreinforcement learningfor smaller models in terms of performance and training efficiency (1/10th of GPU hours), and also enhancesexploration ability(Pass@64scores). -
Massive Multilingual Expansion: Qwen3 drastically expands its language support from 29 to 119 languages and dialects. This is a significant leap compared to many models that focus primarily on high-resource languages, making Qwen3 exceptionally broad in its global accessibility. This was achieved through a dedicated multilingual data annotation system and a massive 36 trillion token pre-training dataset.
-
Advanced MoE Architecture: Qwen3's MoE models (e.g.,
Qwen3-235B-A22B) build uponQwen2.5-MoEbut introduce improvements such as excludingshared expertsand adopting aglobal-batch load balancing loss(Qiu et al., 2025) to encourage expert specialization. This contributes to better performance and efficiency compared to previous MoE designs. -
Architectural Refinements for Stability: Beyond standard
Transformercomponents, Qwen3 incorporates subtle but important changes like removingQKV-biasand introducingQK-Normto ensure stable training, particularly for its large-scale models. -
Comprehensive Post-training Pipeline: The multi-stage post-training approach, combining
Long-CoT Cold Start,Reasoning RL,Thinking Mode Fusion, andGeneral RL, ensures robust reasoning, alignment, and general capabilities for flagship models, which then serve as teachers for smaller models. This structured approach is designed to systematically instill and refine diverse skills.In essence, Qwen3 differentiates itself by offering a more holistic, flexible, and efficient LLM solution that explicitly addresses the trade-offs between rapid response and deep reasoning within a single, highly multilingual, and open-source framework.
4. Methodology
4.1. Principles
The core idea behind Qwen3's methodology is to develop a highly versatile and efficient family of large language models (LLMs) that can dynamically adapt to different task requirements, particularly balancing rapid response with complex reasoning. This is achieved through a unified architecture that supports both thinking and non-thinking modes, complemented by a thinking budget mechanism. The theoretical basis rests on the observation that not all tasks require extensive reasoning, and therefore, dynamically adjusting the computational effort can lead to significant efficiency gains without compromising performance when reasoning is truly needed.
The methodology can be broken down into three main pillars:
- Flexible Architecture Design: Incorporating both
denseandMixture-of-Experts (MoE)models, with architectural refinements for stability and efficiency. - Massive and Diverse Pre-training: Training on an enormous, linguistically diverse, and domain-rich dataset using a multi-stage approach to build a strong foundation of general knowledge, reasoning, and long-context understanding.
- Sophisticated Multi-Stage Post-training: A comprehensive fine-tuning strategy that explicitly develops
thinkingandnon-thinkingcapabilities, aligns the models with human preferences, and efficiently transfers knowledge to smaller models throughstrong-to-weak distillation.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Architecture
The Qwen3 series includes 6 dense models and 2 MoE models, ranging from 0.6 billion to 235 billion parameters.
Dense Model Architecture:
The architecture of Qwen3 dense models builds upon Qwen2.5 with several enhancements. Key components include:
-
Grouped Query Attention (GQA): An optimization for the
multi-head attentionmechanism where multiplequeryheads share a singlekeyandvaluehead. This reduces the memory footprint and increases inference speed compared to standardmulti-head attention, especially for larger models. -
SwiGLU: A
gated activation functionused in thefeed-forward networkblocks of theTransformer. It often provides performance improvements overReLUorGELU. -
Rotary Positional Embeddings (RoPE): A method for incorporating positional information into the
self-attentionmechanism by rotatingqueryandkeyvectors based on their absolute positions, which implicitly captures relative position information. -
RMSNorm with Pre-normalization:
Root Mean Square Normalizationapplied before the attention and feed-forward layers (pre-normalization), which can improve training stability and performance. -
QKV-bias Removal: Unlike Qwen2,
QKV-bias(bias terms added to thequery,key, andvalueprojections) is removed in Qwen3. This is a simplification that can sometimes improve stability or performance. -
QK-Norm Introduction: A new normalization technique
QK-Norm(Dehghani et al., 2023) is introduced to theattention mechanism. This is applied to the dot product ofqueryandkeyvectors before thesoftmaxfunction, aiming to ensure stable training for large models.The specific architectural details for dense models are provided in Table 1.
The following are the results from Table 1 of the original paper:
| Models | Layers | Heads (Q / KV) | Tie Embedding | Context Length |
| Qwen3-0.6B | 28 | 16 / 8 | Yes | 32K |
| Qwen3-1.7B | 28 | 16 / 8 | Yes | 32K |
| Qwen3-4B | 36 | 32 / 8 | Yes | 128K |
| Qwen3-8B | 36 | 32 / 8 | No | 128K |
| Qwen3-14B | 40 | 40 / 8 | No | 128K |
| Qwen3-32B | 64 | 64 / 8 | No | 128K |
MoE Model Architecture:
The Qwen3 MoE models share the fundamental architecture with the dense models (i.e., GQA, SwiGLU, RoPE, RMSNorm, QK-Norm). Key specifics for MoE are:
-
Total Experts: 128 total experts.
-
Activated Experts: 8 experts activated per token. This means for each token, the
routernetwork selects 8 out of the 128 experts to process it. -
Fine-grained Expert Segmentation: Similar to
Qwen2.5-MoE, this approach likely refers to how experts are structured and utilized, potentially at a sub-layer or fine-grained level within thefeed-forwardblock. -
Exclusion of Shared Experts: Unlike
Qwen2.5-MoE, Qwen3's MoE design excludes shared experts. This means all experts are distinct, fostering better specialization.Shared expertsare usually common experts that all tokens access, while other experts are conditionally activated. Removing them implies a more purely conditional computation. -
Global-batch Load Balancing Loss: This is a training objective (Qiu et al., 2025) applied to the
routernetwork of the MoE model. Its purpose is to encourageexpert specializationand ensure that the workload is evenly distributed across all experts within a trainingglobal batch. This prevents a few experts from becoming overloaded while others remain underutilized, which is crucial for efficient MoE training.The specific architectural details for MoE models are provided in Table 2.
The following are the results from Table 2 of the original paper:
| Models | Layers | Heads (Q / KV) | # Experts (Total / Activated) | Context Length |
| Qwen3-30B-A3B | 48 | 32 / 4 | 128 / 8 | 128K |
| Qwen3-235B-A22B | 94 | 64 / 4 | 128 / 8 | 128K |
Tokenizer:
Qwen3 models use Qwen's tokenizer (Bai et al., 2023), which implements byte-level byte-pair encoding (BBPE). It has a vocabulary size of 151,669. BBPE tokenizers are robust to unseen characters and can tokenize any input string.
4.2.2. Pre-training
Qwen3's pre-training involves a large, diverse dataset and a three-stage process.
4.2.2.1. Pre-training Data
The pre-training dataset for Qwen3 is significantly expanded compared to Qwen2.5, featuring:
- Scale: 36 trillion tokens (twice as many as Qwen2.5).
- Languages: 119 languages and dialects (three times more than Qwen2.5's 29).
- Diversity: High-quality content across various domains including
coding,STEM(Science, Technology, Engineering, and Mathematics),reasoning tasks,books,multilingual texts, andsynthetic data.
Data Expansion Methods:
- Multi-modal Extraction:
Qwen2.5-VL(Bai et al., 2025), a vision-language model, is used to extract text from a large volume ofPDF-like documents. The extracted text is then refined usingQwen2.5to improve its quality, yielding trillions of additional high-quality tokens. - Synthetic Data Generation: Domain-specific Qwen2.5 models are employed to synthesize trillions of text tokens in various formats:
Qwen2.5(Yang et al., 2024b) for general text.Qwen2.5-Math(Yang et al., 2024c) for mathematical content (textbooks, Q&A).Qwen2.5-Coder(Hui et al., 2024) for code-related data (code snippets).
- Multilingual Data Annotation System: A sophisticated system was developed to annotate over 30 trillion tokens across multiple dimensions (educational value, fields, domains, safety). This supports finer-grained data filtering and combination. The method optimizes data mixture at the instance-level through ablation experiments on small proxy models with these fine-grained labels, a more granular approach than previous efforts that typically optimize at the
corpusordomain level.
4.2.2.2. Pre-training Stages
The pre-training process is divided into three distinct stages:
-
General Stage (S1):
- Objective: Build a strong foundation of general knowledge and language proficiency.
- Data: Over 30 trillion tokens.
- Context Length: 4,096 tokens.
- Coverage: 119 languages and dialects.
-
Reasoning Stage (S2):
- Objective: Enhance the model's reasoning abilities.
- Data: Approximately 5 trillion higher-quality tokens. The pre-training corpus is optimized by increasing the proportion of
STEM,coding,reasoning, andsynthetic data. - Context Length: Not explicitly stated but implied to be similar to S1 or potentially varied based on data. The original text states "a sequence length of Wa", which seems like a typo, potentially meaning "a sequence length of 4,096 tokens" or another specific value. Given the subsequent long context stage, it likely refers to a standard sequence length.
-
Long Context Stage:
- Objective: Extend the maximum context length of Qwen3 models.
- Data: Hundreds of billions of tokens from a high-quality long context corpus.
- Context Length Extension: Increases from 4,096 to 32,768 tokens.
- Corpus Composition: of text between 16,384 to 32,768 tokens in length, and of text between 4,096 to 16,384 tokens in length.
- Techniques for Long Context:
- Adaptive Base Frequency (ABF) with RoPE: The base frequency of
RoPEis increased from 10,000 to 1,000,000 using theABF technique(Xiong et al., 2023). This modifies the way positional information is embedded to support longer sequences. - YARN (Peng et al., 2023): Yet Another RoPE extensioN, a method to effectively extend the context length of
RoPE-based models. - Dual Chunk Attention (DCA, An et al., 2024): A technique to achieve a four-fold increase in sequence length capacity during inference by processing long contexts in a chunk-wise manner.
- Adaptive Base Frequency (ABF) with RoPE: The base frequency of
Scaling Laws for Hyperparameters: Similar to Qwen2.5, scaling laws are developed to predict optimal hyperparameters (e.g., learning rate scheduler, batch size) for each stage. Extensive experiments study the relationship between model architecture, training data, training stage, and optimal training hyperparameters to set the predicted values for each dense or MoE model.
4.2.3. Post-training
The post-training pipeline for Qwen3 is designed with two main objectives:
-
Thinking Control: To integrate
thinkingandnon-thinkingmodes and allow fine-grained control over reasoning depth via athinking budget. -
Strong-to-Weak Distillation: To optimize the post-training process for lightweight models by leveraging knowledge from larger flagship models.
The flagship models follow a sophisticated four-stage post-training process, while smaller models utilize
strong-to-weak distillation.
The post-training pipeline is visually represented in Figure 1 (from the original paper).
该图像是Qwen3系列模型的后期训练流程示意图。图中展示了从基础模型到旗舰模型和轻量级模型的不同训练阶段,包括长期CoT冷启动、推理强化学习、思维模式融合和一般强化学习等关键步骤。
Figure 1: Post-training pipeline of the Qwen3 series models.
4.2.3.1. Long-CoT Cold Start
This is the first stage of post-training for flagship models, focusing on developing foundational thinking abilities.
- Dataset Curation: A comprehensive dataset covering
math,code,logical reasoning, andgeneral STEM problemsis created. Each problem is paired with verified reference answers or code-based test cases. - Two-Phase Filtering:
- Query Filtering:
Qwen2.5-72B-Instructis used to:- Remove non-verifiable queries (e.g., multiple sub-questions, general text generation).
- Exclude queries that
Qwen2.5-72B-Instructcan answer withoutCoTreasoning (to ensure only complex problems requiring deeper reasoning are included). - Annotate each query's domain to maintain balanced representation.
- Response Filtering: For queries with positive
Pass@N(meaning a correct solution was found within N attempts),QwQ-32B(Qwen Team, 2025) generates candidate responses. These responses undergo stringent filtering to remove:- Incorrect final answers.
- Substantial repetition.
- Guesswork without adequate reasoning.
- Inconsistencies between
thinkingandsummarycontent. - Inappropriate language mixing or stylistic shifts.
- Responses overly similar to potential validation items.
- Query Filtering:
- Objective: Instill foundational reasoning patterns without over-emphasizing immediate reasoning performance, preparing the model for subsequent
Reinforcement Learning (RL). This phase aims to minimize training samples and steps.
4.2.3.2. Reasoning RL
The second stage focuses on further enhancing reasoning through Reinforcement Learning.
- Query-Verifier Pairs: A dataset of 3,995
query-verifier pairsis collected, satisfying four criteria:- Not used in the
cold-startphase. - Learnable for the
cold-startmodel. - As challenging as possible.
- Covering a broad range of sub-domains.
- Not used in the
- RL Algorithm:
GRPO(Shao et al., 2024) is employed to update model parameters.GRPO(Generalized Reinforcement Learning with Policy Optimization) is a policy gradient method designed for stable and efficient reinforcement learning. - Training Benefits: Large
batch sizesand a high number ofrollouts per query, along withoff-policy training, are found beneficial for improvingsample efficiency. - Exploration-Exploitation Balance: Strategies to balance exploration and exploitation are addressed by controlling the model's
entropyto increase steadily or remain stable. This helps maintain stable training and prevents premature convergence. - Results: Consistent improvements in both training reward and validation performance are observed without manual hyperparameter intervention. For example,
AIME'24score forQwen3-235B-A22Bincreased from 70.1 to 85.1 over 170 RL training steps.
4.2.3.3. Thinking Mode Fusion
The third stage integrates non-thinking capabilities into the thinking-capable model developed in previous stages.
-
Objective: Allow developers to manage reasoning behaviors and reduce the cost/complexity of deploying separate models.
-
Approach:
Continual supervised fine-tuning (SFT)is performed on theReasoning RLmodel. -
SFT Data Construction:
Thinkingdata is generated viarejection samplingon Stage 1 queries using the Stage 2 model itself, ensuring performance is not compromised.Non-thinkingdata is curated to cover diverse tasks, includingconceptual knowledge,summarization, androle-playing.Automatically generated checklistsare used to assess response quality.- The proportion of
translation tasksis increased to enhance performance onlow-resource languages.
-
Chat Template Design: A
chat templateis designed for dynamic mode switching, as shown in Table 9.-
/thinkand/no_thinkflags are introduced in user queries or system messages. -
For
non-thinking modesamples, anempty thinking block() is retained in the assistant's response to ensure internal format consistency and allow developers to explicitly prevent thinking. -
By default, the model operates in
thinking mode; thus, somethinking modesamples without/thinkflags are included in training. -
For
multi-turn dialogs, multiple/thinkand/no_thinkflags can be randomly inserted, with the model adhering to the last encountered flag.The following are the results from Table 9 of the original paper:
Thinking Mode Non-Thinking Mode <|im_start/>user {query}/think<|im_end|> <|im_start|>user {query}/no_think<|im_end|> <|im_start/>assistant <think> <|im_start/>assistant <think> {thinking_content} </think> </think> {response}<|im_end|> {response}<|im_end|>
-
-
Thinking Budget Mechanism: The model's ability to handle
intermediate cases(generating responses based on incomplete thinking) emerges naturally fromThinking Mode Fusion. This forms the basis forbudget control.- If the model's thinking length reaches a
user-defined threshold, the process is manually halted. - A
stop-thinking instructionis inserted: . - The model then generates a final response based on its accumulated reasoning up to that point.
- If the model's thinking length reaches a
4.2.3.4. General RL
The final stage of post-training aims for broad enhancement of capabilities and stability across diverse scenarios.
- Sophisticated Reward System: Covers over 20 distinct tasks, each with customized scoring criteria, targeting core capabilities:
- Instruction Following: Ensures accurate interpretation and execution of user instructions (content, format, length, structured output).
- Format Following: Adherence to specific formatting conventions, e.g., responding to
/think//no_thinkflags and using / tokens. - Preference Alignment: For open-ended queries, improves
helpfulness,engagement, andstylefor a natural user experience. - Agent Ability: Training the model to correctly invoke tools via designated interfaces. During
RL rollout, the model performsmulti-turn interaction cycleswithreal environment execution feedbackto improve long-horizon decision-making. - Abilities for Specialized Scenarios: Tasks tailored to specific contexts, e.g.,
Retrieval-Augmented Generation (RAG)tasks incorporate reward signals to guideaccurateandcontextually appropriateresponses, minimizinghallucination.
- Three Types of Rewards:
- Rule-based Reward: Widely used in
reasoning RLand for general tasks likeinstruction followingandformat adherence. Provides high precision in assessing correctness and preventsreward hacking. - Model-based Reward with Reference Answer: A reference answer is provided for each query, and
Qwen2.5-72B-Instructscores the model's response against it. Offers flexibility for diverse tasks without strict formatting. - Model-based Reward without Reference Answer: A reward model is trained using
human preference datato assign scalar scores to responses. Handles a broader range of queries and enhancesengagementandhelpfulness.
- Rule-based Reward: Widely used in
4.2.3.5. Strong-to-Weak Distillation
This pipeline is specifically for optimizing lightweight models (5 dense models: Qwen3-0.6B, 1.7B, 4B, 8B, 14B, and one MoE model: Qwen3-30B-A3B).
- Objective: Enhance performance and impart robust
mode-switching capabilitiesefficiently, requiring significantly fewer computational resources than the full four-stage process. - Two Primary Phases:
- Off-policy Distillation:
- Combines outputs from teacher models (generated with both
/thinkand/no_thinkmodes) forresponse distillation. - Helps student models develop basic reasoning skills and mode-switching ability, establishing a foundation for the next phase.
- Combines outputs from teacher models (generated with both
- On-policy Distillation:
-
Student model generates
on-policy sequencesfor fine-tuning. -
Prompts are sampled, and the student model produces responses in either
/thinkor/no_thinkmode. -
The student model is fine-tuned by aligning its
logitswith those of a teacher model (Qwen3-32BorQwen3-235B-A22B). This alignment is typically achieved by minimizing theKullback-Leibler (KL) divergencebetween the student's and teacher'ssoft probabilities(logits converted to probabilities viasoftmax). Where: -
represents the probability of token predicted by the Teacher model.
-
represents the probability of token predicted by the Student model.
-
is the vocabulary.
-
denotes the Kullback-Leibler divergence.
This process enables better
Pass@1scores (immediate performance) and improvedexploration ability(Pass@64), while requiring only 1/10 of the GPU hours compared to the four-stage training.
-
- Off-policy Distillation:
5. Experimental Setup
The experimental setup for Qwen3 involved comprehensive evaluations of both pre-trained (base) and post-trained (instruction-tuned) models across a wide array of benchmarks.
5.1. Datasets
The datasets used in the experiments cover general knowledge, reasoning, mathematics, scientific knowledge, coding, and multilingual capabilities.
For Pre-trained Base Models (Section 3.3):
- General Tasks:
MMLU(Hendrycks et al., 2021a): Massive Multitask Language Understanding (5-shot). A benchmark covering 57 academic subjects.MMLU-Pro(Wang et al., 2024): An extended version of MMLU (5-shot, CoT).MMLU-redux(Gema et al., 2024): Another variant of MMLU (5-shot).BBH(Suzgun et al., 2023): BIG-Bench Hard (3-shot, CoT). A subset of BIG-Bench tasks designed to be challenging for large language models, often requiring complex reasoning.SuperGPQA(Du et al., 2025): Super Graduate-level Google-proof Q&A (5-shot, CoT). A very challenging question-answering benchmark.
- Math & STEM Tasks:
GPQA(Rein et al., 2023): Graduate-level Google-proof Q&A (5-shot, CoT). Similar to SuperGPQA, requiring deep scientific and mathematical knowledge.GSM8K(Cobbe et al., 2021): Grade School Math 8K (4-shot, CoT). A dataset of elementary school math word problems.MATH(Hendrycks et al., 2021b): A dataset of challenging mathematical problems from high school competitions (4-shot, CoT).
- Coding Tasks:
EvalPlus(Liu et al., 2023a): (0-shot) An evaluation suite for code generation, averaging performance onHumanEval,MBPP, , and .MultiPL-E(Cassano et al., 2023): (0-shot) A polyglot benchmark for code generation across multiple programming languages (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript).MBPP-3shot(Austin et al., 2021): Mostly Basic Python Problems (3-shot). A dataset of Python programming problems.CRUX-OofCRUXEval(Gu et al., 2024): (1-shot) A benchmark for code reasoning, understanding, and execution.
- Multilingual Tasks:
MGSM(Shi et al., 2023): Multilingual Grade School Math (8-shot, CoT). Multilingual math word problems.MMMLU(OpenAI, 2024): Multilingual Massive Multitask Language Understanding (5-shot). A multilingual version of MMLU.INCLUDE(Romanou et al., 2024): Evaluating multilingual language understanding with regional knowledge (5-shot).
For Post-trained Models (Section 4.6):
- General Tasks:
MMLU-Redux(Gema et al., 2024).GPQA-Diamond(Rein et al., 2023): A subset of GPQA with very challenging questions. For this benchmark, 10 samples are taken for each query, and the averaged accuracy is reported.C-Eval(Huang et al., 2023): A multilevel multi-discipline Chinese evaluation suite for foundation models.LiveBench (2024-11-25)(White et al., 2024): A challenging, contamination-free LLM benchmark.
- Alignment Tasks:
IFEval(Zhou et al., 2023): Instruction-following evaluation for large language models, reporting strict prompt accuracy.Arena-Hard(Li et al., 2024): A benchmark for evaluating human preferences, derived from crowdsourced data.AlignBench v1.1(Liu et al., 2023b): Benchmarking Chinese alignment of large language models.Creative Writing V3(Paech, 2024): Evaluates creative writing proficiency.WritingBench(Wu et al., 2025): A comprehensive benchmark for generative writing.
- Math & Text Reasoning:
MATH-500(Lightman et al., 2023): A high-level math benchmark.AIME'24andAIME'25(AIME, 2025): Problems from the American Invitational Mathematics Examination. For each question, 64 samples are taken, and the average accuracy is reported.ZebraLogic(Lin et al., 2025): A benchmark for logical reasoning, particularlyZebra Puzzles.AutoLogi(Zu et al., 2025): Automated generation of logic puzzles.
- Agent & Coding:
BFCL v3(Yan et al., 2024): Berkeley Function Calling Leaderboard. Models are evaluated using theFC format, andyarnis used to extend context length to 64k forMulti-Turn evaluation.LiveCodeBench (v5, 2024.10-2025.02)(Jain et al., 2024): A holistic and contamination-free evaluation for code generation. Fornon-thinking mode, the official prompt is used; forthinking mode, prompt templates are adjusted to allow more free thinking.CodeForces RatingsfromCodeElo(Quan et al., 2025): Calculates Elo ratings to compare model performance against competitive programming experts. Each problem is solved by generating up to eight independent reasoning attempts.
- Multilingual Tasks:
Multi-IF(He et al., 2024): Multilingual instruction following (8 key languages).INCLUDE(Romanou et al., 2024): Regional knowledge (44 languages). Only 10% of original data sampled for efficiency.MMMLU(OpenAI, 2024): General knowledge (14 languages, excluding unoptimized Yoruba). Only 10% of original data sampled for efficiency.MT-AIME2024(Son et al., 2025): Multilingual AIME (55 languages).PolyMath(Wang et al., 2025): Multilingual mathematical reasoning (18 languages).MLogiQA(Zhang et al., 2024): Multilingual logical reasoning (10 languages).Belebele(Bandarkar et al., 2023): A benchmark for natural language understanding in 122 language variants. Evaluated on 80 supported languages.
In-house Benchmarks (for Ablation Studies):
-
CounterFactQA: Contains counterfactual questions where the model needs to identify non-factual nature and avoid hallucination. -
LengthCtrl: Creative writing tasks with length requirements; score based on difference between generated and target length. -
ThinkFollow: Multi-turn dialogues with random/thinkand/no_thinkflags to test mode switching. -
: Evaluates tool-calling proficiency (intent, format, parameter accuracy) with multi-turn interactions and real environment feedback.
The following are the results from Table 10 of the original paper:
Benchmark # Langs Languages Multi-IF 8 en, es, fr, hi, it, pt, ru, zh INCLUDE 44 ar, az, be, bg, bn, de, el, es, et, eu, fa, fi, fr, he, hi, hr, hu, hy, id, it, ja, ka, kk, ko, lt, mk, ml, ms, ne, nl, pl, pt, ru, sq, sr, ta, te, tl, tr, uk, ur, uz, vi, zh MMMLU 14 ar, bn, de, en, es, fr, hi, id, it, ja, ko, pt, sw, zh MT-AIME2024 55 af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-Hans, zh-Hant PolyMath 18 ar, bn, de, en, es, fr, id, it, ja, ko, ms, pt, ru, sw, te, th, vi, zh MLogiQA 10 ar, en, es, fr, ja, ko, pt, th, vi, zh
The following are the results from Table 36 of the original paper:
| Language family | # Langs | Language code (ISO 639-3 ISO 15924) | |
| Indo-European | 40 | por_Latn, deu_Latn, tgk_Cyrl, ces_Latn, nob_Latn, dan_Latn, snd_Arab, spa_Latn, isl_Latn, slv_Latn, eng_Latn, ory_Orya, hrv_Latn, ell_Grek, ukr_Cyrl, pan_Guru, srp_Cyrl, npi_Deva, mkd_Cyrl, guj_Gujr, nld_Latn, swe_Latn, hin_Deva, rus_Cyrl, asm_Beng, cat_Latn, als_Latn, sin_Sinh, urd_Arab, mar_Deva, lit_Latn, slk_Latn, | |
| ita_Latn, pol_Latn, bul_Cyrl, afr_Latn, ron_Latn, fra_Latn, ben_Beng, hye_Armn | |||
| Sino-Tibetan | 3 | zho_Hans, mya_Mymr, zho_Hant | |
| Afro-Asiatic | 8 | heb_Hebr, apc_Arab, acm_Arab, ary_Arab, ars_Arab, arb_Arab, mlt_Latn, erz_Arab | |
| Austronesian | 7 | ilo_Latn, ceb_Latn, tgl_Latn, sun_Latn, jav_Latn, war_Latn, ind_Latn | |
| Dravidian | 4 | mal_Mlym, kan_Knda, tel_Telu, tam_Taml | |
| Turkic | 4 | kaz_Cyrl, azj_Latn, tur_Latn, uzn_Latn | |
| Tai-Kadai | 2 | tha_Thai, lao_Laoo | |
| Uralic | 3 | fin_Latn, hun_Latn, est_Latn | |
| Austroasiatic | 2 | vie_Latn, khm_Khmr | |
| Other | eus_Latn, kor_Hang, hat_Latn, swh_Latn, kea_Latn, jpn_Jpan, kat_Geor |
5.2. Evaluation Metrics
For every evaluation metric, the following explanations are provided:
-
Accuracy (Acc):
- Conceptual Definition: Measures the proportion of correctly predicted instances out of the total number of instances. It is a common metric for classification tasks.
- Mathematical Formula:
- Symbol Explanation:
Number of Correct Predictionsrefers to the count of instances where the model's output matches the ground truth.Total Number of Predictionsis the total count of instances evaluated.
-
Pass@k:
- Conceptual Definition: A metric used primarily in code generation benchmarks. It evaluates the probability that at least one out of generated solutions for a problem passes all given test cases. It accounts for the stochastic nature of LLM generation.
- Mathematical Formula: The exact formula for
Pass@kis typically derived fromPass@1, where if candidates are generated, of which pass,Pass@kis calculated by averaging over problems: Where:- is the number of problems.
- is the number of code candidates generated for problem .
- is the number of passing candidates for problem .
- is the binomial coefficient, representing "n choose r".
- Symbol Explanation: is the total number of coding problems. is the number of attempts made by the model for problem . is the number of those attempts that correctly solve problem . is the number of solutions considered to pass. The term calculates the probability that at least one of randomly chosen solutions out of attempts (where are successful) will be correct.
-
Elo Rating (CodeForces):
- Conceptual Definition: A method for calculating the relative skill levels of players (or in this case, AI models) in zero-sum games, such as competitive programming. It's a dynamic rating system where a player's rating changes based on the outcome of matches against other players. A higher Elo rating indicates a stronger performance.
- Mathematical Formula: The change in Elo rating () after a match is calculated as:
Where:
- is the K-factor, a constant that determines the maximum possible adjustment for a single game.
- is the actual score (1 for win, 0.5 for draw, 0 for loss).
- is the expected score, calculated based on the opponent's rating () and the player's own rating ():
- Symbol Explanation: is the sensitivity of the rating change. is the actual outcome of the competition (e.g., passing a coding problem). is the expected outcome based on the current ratings. and are the Elo ratings of the opponent and the player, respectively.
-
Strict Prompt Accuracy (IFEval):
- Conceptual Definition: Measures how precisely a model adheres to the explicit instructions given in a prompt. It's often a binary metric (pass/fail) for each instruction, focusing on exact compliance rather than general quality.
- Mathematical Formula: Not typically a single universal formula, but rather a calculation of the percentage of instruction sets where all instructions were followed correctly.
- Symbol Explanation:
Number of Prompts with All Instructions Followedis the count of prompts where the model's response correctly fulfilled every specified instruction.Total Number of Promptsis the number of instruction sets evaluated.
-
Average (Avg.):
- Conceptual Definition: The arithmetic mean of scores across multiple benchmarks or tasks, used to provide a single aggregate performance indicator.
- Mathematical Formula:
- Symbol Explanation: is the number of individual scores or benchmarks being averaged. is the score for the -th benchmark.
5.3. Baselines
The Qwen3 models are compared against a comprehensive set of baselines, including both open-source and proprietary models, to demonstrate their competitive standing across various scales and capabilities.
Pre-trained Base Model Baselines:
- Qwen2.5 Base Models:
Qwen2.5-0.5B,Qwen2.5-1.5B,Qwen2.5-3B,Qwen2.5-7B,Qwen2.5-14B,Qwen2.5-32B,Qwen2.5-72B-Base,Qwen2.5-Plus-Base(MoE). These are direct predecessors, showing generational improvements. - DeepSeek-V3 Base (Liu et al., 2024a): A large open-source MoE model, representing a strong competitor in terms of scale and architecture.
- Gemma-3 Base Models (Team et al., 2025):
Gemma-3-1B,Gemma-3-4B,Gemma-3-12B,Gemma-3-27B. Google's open-source series. - Llama-3 Base Models (Dubey et al., 2024):
Llama-3-8B. Meta's popular open-source series. - Llama-4 Base Models (Meta-AI, 2025):
Llama-4-Maverick,Llama-4-Scout. Next-generation open-source models from Meta, often larger and more capable.
Post-trained (Instruction-tuned) Model Baselines:
- Proprietary Models:
OpenAI-o1(OpenAI, 2024): A reasoning-focused model from OpenAI.GPT-4o-2024-11-20(OpenAI, 2024): OpenAI's flagship multimodal model.Gemini2.5-Pro(DeepMind, 2025): Google DeepMind's powerful model.Grok-3-Beta (Think)(xAI, 2025): A reasoning-focused model from xAI.OpenAI-o3-mini (medium)(OpenAI, 2025): A smaller, reasoning-focused model from OpenAI.GPT-4o-mini-2024-07-18: A smaller version of GPT-4o.
- Open-Source and Qwen Predecessors:
DeepSeek-R1(Guo et al., 2025): A dedicated reasoning model.DeepSeek-V3(Liu et al., 2024a).Qwen2.5-72B-Instruct,Qwen2.5-32B-Instruct,Qwen2.5-14B-Instruct,Qwen2.5-7B-Instruct,Qwen2.5-3B-Instruct,Qwen2.5-1.5B-Instruct. Instruction-tuned versions of Qwen2.5 models.QwQ-32B(Qwen Team, 2025): Qwen's previous strongest reasoning model.LLaMA-4-Maverick(Meta-AI, 2025),LLaMA-4-Scout(Meta-AI, 2025).LLaMA-3.1-8B-Instruct(Dubey et al., 2024).Gemma-3-27B-IT,Gemma-3-12B-IT,Gemma-3-4B-IT,Gemma-3-1B-IT. Instruction-tuned versions of Gemma-3 models.Phi-4(Abdin et al., 2024),Phi-4-mini.DeepSeek-R1-Distill-Llama-70B,DeepSeek-R1-Distill-Qwen-32B,DeepSeek-R1-Distill-Qwen-14B,DeepSeek-R1-Distill-Qwen-1.5B,DeepSeek-R1-Distill-Llama-8B: Distilled versions of DeepSeek-R1, used as baselines forstrong-to-weak distillation.
5.4. Hyperparameters
The following hyperparameters are used for evaluation:
- For Thinking Mode (Qwen3 models):
Sampling Temperature: 0.6Top-p: 0.95Top-k: 20Presence Penalty: 1.5 (specifically forCreative Writing v3andWritingBenchto encourage diverse content).
- For Non-Thinking Mode (Qwen3 models):
Sampling Temperature: 0.7Top-p: 0.8Top-k: 20Presence Penalty: 1.5
- Maximum Output Length: 32,768 tokens, except for
AIME'24andAIME'25where it is extended to 38,912 tokens to provide sufficient thinking space for complex math problems.
6. Results & Analysis
The experimental results are presented for both pre-trained (base) and post-trained (instruction-tuned) models, covering a wide range of benchmarks and comparing Qwen3 against various state-of-the-art baselines.
6.1. Pre-training Evaluation
The evaluation of base models focuses on general knowledge, reasoning, mathematics, scientific knowledge, coding, and multilingual capabilities.
Summary of Evaluation Results for Qwen3 Base Models:
- Flagship Model Performance:
Qwen3-235B-A22B-Baseoutperforms most previously open-source SOTA dense and MoE base models (e.g.,DeepSeek-V3 Base,Llama-4-Maverick Base,Qwen2.5-72B-Base) across most tasks, often with significantly fewer total or activated parameters. - MoE Efficiency: Qwen3 MoE base models (
Qwen3-30B-A3B-Base,Qwen3-235B-A22B-Base) achieve performance similar to Qwen3 dense base models with only 1/5 activated parameters, demonstrating high efficiency. They also outperformQwen2.5 MoEmodels with fewer activated parameters. Notably, a Qwen3 MoE model can achieve comparable performance to aQwen2.5 dense base modelwith 1/10 of the activated parameters, indicating significant inference and training cost advantages. - Dense Model Improvements: Qwen3
dense base modelsshow comparable performance toQwen2.5 base modelsat higher parameter scales. Specifically, smaller Qwen3 dense models (1.7B,4B,8B,14B,32B) often surpass or match largerQwen2.5counterparts (3B,7B,14B,32B,72B), especially inSTEM,coding, andreasoning benchmarks.
6.1.1. Qwen3-235B-A22B-Base
The following are the results from Table 3 of the original paper:
| Qwen2.5-72B Base | Qwen2.5-Plus Base | Llama-4-Maverick Base | DeepSeek-V3 Base | Qwen3-235B-A22B Base | |
| Architecture | Dense | MoE | MoE | MoE | MoE |
| # Total Params | 72B | 271B | 402B | 671B | 235B |
| # Activated Params | 72B | 37B | 17B | 37B | 22B |
| General Tasks | |||||
| MMLU | 86.06 | 85.02 | 85.16 | 87.19 | 87.81 |
| MMLU-Redux | 83.91 | 82.69 | 84.05 | 86.14 | 87.40 |
| MMLU-Pro | 58.07 | 63.52 | 63.91 | 59.84 | 68.18 |
| SuperGPQA | 36.20 | 37.18 | 40.85 | 41.53 | 44.06 |
| BBH | 86.30 | 85.60 | 83.62 | 86.22 | 88.87 |
| Math & STEM Tasks | |||||
| GPQA | 45.88 | 41.92 | 43.94 | 41.92 | 47.47 |
| GSM8K | 91.50 | 91.89 | 87.72 | 87.57 | 94.39 |
| MATH | 62.12 | 62.78 | 63.32 | 62.62 | 71.84 |
| Coding Tasks | |||||
| EvalPlus | 65.93 | 61.43 | 68.38 | 63.75 | 77.60 |
| MultiPL-E | 58.70 | 62.16 | 57.28 | 62.26 | 65.94 |
| MBPP | 76.00 | 74.60 | 75.40 | 74.20 | 81.40 |
| CRUX-O | 66.20 | 68.50 | 77.00 | 76.60 | 79.00 |
| Multilingual Tasks | |||||
| MGSM | 82.40 | 82.21 | 79.69 | 82.68 | 83.53 |
| MMMLU | 84.40 | 83.49 | 83.09 | 85.88 | 86.70 |
| INCLUDE | 69.05 | 66.97 | 73.47 | 75.17 | 73.46 |
Analysis of Qwen3-235B-A22B-Base:
- Overall Dominance: Qwen3-235B-A22B-Base achieves the highest scores in most benchmarks, significantly outperforming competitors.
- Vs. Llama-4-Maverick: Despite
Llama-4-Maverickhaving roughly twice the total parameters, Qwen3-235B-A22B-Base performs better on most benchmarks, indicating superior architectural or training efficiency. - Vs. DeepSeek-V3: Qwen3-235B-A22B-Base outperforms
DeepSeek-V3-Baseon 14 out of 15 benchmarks with only about 1/3 of its total parameters and 2/3 of its activated parameters, demonstrating impressive power and cost-effectiveness. - Vs. Qwen2.5-Plus (MoE): Qwen3-235B-A22B-Base significantly outperforms its previous MoE counterpart,
Qwen2.5-Plus, with fewer parameters and activated parameters, highlighting advancements in pre-training data, strategy, and architecture. - Vs. Qwen2.5-72B (Dense): The Qwen3 MoE flagship model surpasses the
Qwen2.5-72B-Base(dense) in all benchmarks while using less than 1/3 of its activated parameters. This translates to much cheaper inference and training costs per trillion tokens.
6.1.2. Qwen3-32B-Base
The following are the results from Table 4 of the original paper:
| Qwen2.5-32B Base | Qwen2.5-72B Base | Gemma-3-27B Base | Llama-4-Scout Base | Qwen3-32B Base | |
| Architecture | Dense | Dense | Dense | MoE | Dense |
| # Total Params | 32B | 72B | 27B | 109B | 32B |
| # Activated Params | 32B | 72B | 27B | 17B | 32B |
| General Tasks | |||||
| MMLU | 83.32 | 86.06 | 78.69 | 78.27 | 83.61 |
| MMLU-Redux | 81.97 | 83.91 | 76.53 | 71.09 | 83.41 |
| MMLU-Pro | 55.10 | 58.07 | 52.88 | 56.13 | 65.54 |
| SuperGPQA | 33.55 | 36.20 | 29.87 | 26.51 | 39.78 |
| BBH | 84.48 | 86.30 | 79.95 | 82.40 | 87.38 |
| Math & STEM Tasks | |||||
| GPQA | 47.97 | 45.88 | 26.26 | 40.40 | 49.49 |
| GSM8K | 92.87 | 91.50 | 81.20 | 85.37 | 93.40 |
| MATH | 57.70 | 62.12 | 51.78 | 51.66 | 61.62 |
| Coding Tasks | |||||
| EvalPlus | 66.25 | 65.93 | 55.78 | 59.90 | 72.05 |
| MultiPL-E | 58.30 | 58.70 | 45.03 | 47.38 | 67.06 |
| MBPP | 73.60 | 76.00 | 68.40 | 68.60 | 78.20 |
| CRUX-O | 67.80 | 66.20 | 60.00 | 61.90 | 72.50 |
| Multilingual Tasks | |||||
| MGSM | 78.12 | 82.40 | 73.74 | 79.93 | 83.06 |
| MMMLU | 82.40 | 84.40 | 77.62 | 74.83 | 83.83 |
| INCLUDE | 64.35 | 69.05 | 68.94 | 68.09 | 67.87 |
Analysis of Qwen3-32B-Base:
- Vs. Similar-sized Models: Qwen3-32B-Base outperforms
Qwen2.5-32B-BaseandGemma-3-27B-Baseon most benchmarks, particularly showing significant leads inMMLU-Pro,SuperGPQA, andcoding benchmarks. - Vs. Larger Predecessor: Surprisingly, Qwen3-32B-Base, with less than half the parameters of
Qwen2.5-72B-Base, outperforms it in 10 out of 15 benchmarks, especially incoding,mathematics, andreasoning. This indicates substantial improvements in efficiency and capability for its size. - Vs. MoE Baseline: Qwen3-32B-Base significantly outperforms
Llama-4-Scout-Baseon all 15 benchmarks, despite the latter having three times its total parameters (though half the activated parameters for Llama-4-Scout are 17B, which is less than Qwen3-32B-Base's 32B activated parameters).
6.1.3. Qwen3-14B-Base & Qwen3-30B-A3B-Base
The following are the results from Table 5 of the original paper:
| Gemma-3-12B Base | Qwen2.5-14B Base | Qwen2.5-32B Base | Qwen2.5-Turbo Base | Qwen3-14B Base | Qwen3-30B-A3B Base | ||
| Architecture | Dense | Dense | Dense | MoE | Dense | MoE | |
| # Total Params | 12B | 14B | 32B | 42B | 14B | 30B | |
| # Activated Params | 12B | 14B | 32B | 6B | 14B | 3B | |
| General Tasks | |||||||
| MMLU | 73.87 | 79.66 | 83.32 | 79.50 | 81.05 | 81.38 | |
| MMLU-Redux | 70.70 | 76.64 | 81.97 | 77.11 | 79.88 | 81.17 | |
| MMLU-Pro | 44.91 | 51.16 | 55.10 | 55.60 | 61.03 | 61.49 | |
| SuperGPQA | 24.61 | 30.68 | 33.55 | 31.19 | 34.27 | 35.72 | |
| BBH | 74.28 | 78.18 | 84.48 | 76.10 | 81.07 | 81.54 | |
| Math & STEM Tasks | |||||||
| GPQA | 31.31 | 32.83 | 47.97 | 41.41 | 39.90 | 43.94 | |
| GSM8K | 78.01 | 90.22 | 92.87 | 88.32 | 92.49 | 91.81 | |
| MATH | 44.43 | 55.64 | 57.70 | 55.60 | 62.02 | 59.04 | |
| Coding Tasks | |||||||
| EvalPlus | 52.65 | 60.70 | 66.25 | 61.23 | 72.23 | 71.45 | |
| MultiPL-E | 43.03 | 54.79 | 58.30 | 53.24 | 61.69 | 66.53 | |
| MBPP | 60.60 | 69.00 | 73.60 | 67.60 | 73.40 | 74.40 | |
| CRUX-O | 52.00 | 61.10 | 67.80 | 60.20 | 68.60 | 67.20 | |
| Multilingual Tasks | |||||||
| MGSM | 64.35 | 74.68 | 78.12 | 70.45 | 79.20 | 79.11 | |
| MMMLU | 72.50 | 78.34 | 82.40 | 79.76 | 79.69 | 81.46 | |
| INCLUDE | 63.34 | 60.26 | 64.35 | 59.25 | 64.55 | 67.00 | |
Analysis of Qwen3-14B-Base & Qwen3-30B-A3B-Base:
- Qwen3-14B-Base Superiority: Qwen3-14B-Base significantly outperforms
Gemma-3-12B-BaseandQwen2.5-14B-Baseon all 15 benchmarks. It also achieves very competitive results against the much largerQwen2.5-32B-Basewith less than half the parameters. - Qwen3-30B-A3B-Base Efficiency:
Qwen3-30B-A3B-Base, an MoE model, with only 3 billion activated parameters (1/5 ofQwen2.5-14B-Base's activated parameters), significantly outperformsQwen2.5-14B-Baseon all tasks. It also achieves comparable performance to the largerQwen3-14B-BaseandQwen2.5-32B-Base, highlighting substantial advantages in inference and training costs due to its efficient MoE architecture.
6.1.4. Qwen3-8B / 4B / 1.7B / 0.6B-Base
The following are the results from Table 6 of the original paper:
| Llama-3-8B Base | Qwen2.5-7B Base | Qwen2.5-14B Base | Qwen3-8B Base | |
| Architecture | Dense | Dense | Dense | Dense |
| # Total Params | 8B | 7B | 14B | 8B |
| # Activated Params | 8B | 7B | 14B | 8B |
| General Tasks | ||||
| MMLU | 66.60 | 74.16 | 79.66 | 76.89 |
| MMLU-Redux | 61.59 | 71.06 | 76.64 | 76.17 |
| MMLU-Pro | 35.36 | 45.00 | 51.16 | 56.73 |
| SuperGPQA | 20.54 | 26.34 | 30.68 | 31.64 |
| BBH | 57.70 | 70.40 | 78.18 | 78.40 |
| Math & STEM Tasks | ||||
| GPQA | 25.80 | 36.36 | 32.83 | 44.44 |
| GSM8K | 55.30 | 85.36 | 90.22 | 89.84 |
| MATH | 20.50 | 49.80 | 55.64 | 60.80 |
| Coding Tasks | ||||
| EvalPlus | 44.13 | 62.18 | 60.70 | 67.65 |
| MultiPL-E | 31.45 | 50.73 | 54.79 | 58.75 |
| MBPP | 48.40 | 63.40 | 69.00 | 69.80 |
| CRUX-O | 36.80 | 48.50 | 61.10 | 62.00 |
| Multilingual Tasks | ||||
| MGSM | 38.92 | 63.60 | 74.68 | 76.02 |
| MMMLU | 59.65 | 71.34 | 78.34 | 75.72 |
| IINCLUDE | 44.94 | 53.98 | 60.26 | 59.40 |
The following are the results from Table 7 of the original paper:
| Gemma-3-4B Base | Qwen2.5-3B Base | Qwen2.5-7B Base | Qwen3-4B Base | |
| Architecture | Dense | Dense | Dense | Dense |
| # Total Params | 4B | 3B | 7B | 4B |
| # Activated Params | 4B | 3B | 7B | 4B |
| General Tasks | ||||
| MMLU | 59.51 | 65.62 | 74.16 | 72.99 |
| MMLU-Redux | 56.91 | 63.68 | 71.06 | 72.79 |
| MMLU-Pro | 29.23 | 34.61 | 45.00 | 50.58 |
| SuperGPQA | 17.68 | 20.31 | 26.34 | 28.43 |
| BBH | 51.70 | 56.30 | 70.40 | 72.59 |
| Math & STEM Tasks | ||||
| GPQA | 24.24 | 26.26 | 36.36 | 36.87 |
| GSM8K | 43.97 | 79.08 | 85.36 | 87.79 |
| MATH | 26.10 | 42.64 | 49.80 | 54.10 |
| Coding Tasks | ||||
| EvalPlus | 43.23 | 46.28 | 62.18 | 63.53 |
| MultiPL-E | 28.06 | 39.65 | 50.73 | 53.13 |
| MBPP | 46.40 | 54.60 | 63.40 | 67.00 |
| CRUX-O | 34.00 | 36.50 | 48.50 | 55.00 |
| Multilingual Tasks | ||||
| MGSM | 33.11 | 47.53 | 63.60 | 67.74 |
| MMMLU | 59.62 | 65.55 | 71.34 | 71.42 |
| INCLUDE | 49.06 | 45.90 | 53.98 | 56.29 |
The following are the results from Table 8 of the original paper:
| Qwen2.5-0.5B Base | Qwen3-0.6B Base | Gemma-3-1B Base | Qwen2.5-1.5B Base | Qwen3-1.7B Base | |
| Architecture | Dense | Dense | Dense | Dense | Dense |
| # Total Params | 0.5B | 0.6B | 1B | 1.5B | 1.7B |
| # Activated Params | 0.5B | 0.6B | 1B | 1.5B | 1.7B |
| General Tasks | |||||
| MMLU | 47.50 | 52.81 | 26.26 | 60.90 | 62.63 |
| MMLU-Redux | 45.10 | 51.26 | 25.99 | 58.46 | 61.66 |
| MMLU-Pro | 15.69 | 24.74 | 9.72 | 28.53 | 36.76 |
| SuperGPQA | 11.30 | 15.03 | 7.19 | 17.64 | 20.92 |
| BBH | 20.30 | 41.47 | 28.13 | 45.10 | 54.47 |
| Math & STEM Tasks | |||||
| GPQA | 24.75 | 26.77 | 24.75 | 24.24 | 28.28 |
| GSM8K | 41.62 | 59.59 | 2.20 | 68.54 | 75.44 |
| MATH | 19.48 | 32.44 | 3.66 | 35.00 | 43.50 |
| Coding Tasks | |||||
| EvalPlus | 31.85 | 36.23 | 8.98 | 44.80 | 52.70 |
| MultiPL-E | 18.70 | 24.58 | 5.15 | 33.10 | 42.71 |
| MBPP | 29.80 | 36.60 | 9.20 | 43.60 | 55.40 |
| CRUX-O | 12.10 | 27.00 | 3.80 | 29.60 | 36.40 |
| Multilingual Tasks | |||||
| MGSM | 12.07 | 30.99 | 1.74 | 32.82 | 50.71 |
| MMMLU | 31.53 | 50.16 | 26.57 | 60.27 | 63.27 |
| INCLUDE | 24.74 | 34.26 | 25.62 | 39.55 | 45.57 |
Analysis of Smaller Qwen3 Base Models:
- Consistent Strong Performance:
Qwen3-8B,4B,1.7B, and0.6B-Basemodels consistently maintain strong performance across nearly all benchmarks relative to their size. - Outperforming Larger Qwen2.5 Models: Notably,
Qwen3-8B,4B, and1.7B-Basemodels even outperform largerQwen2.5-14B,7B, and3B-Basemodels, respectively, on over half of the benchmarks. This is particularly evident inSTEM-relatedandcoding benchmarks, reflecting a significant generational improvement.
6.2. Post-training Evaluation
The post-trained models are evaluated for their instruction-following, alignment, reasoning, agent, coding, and multilingual abilities under both thinking and non-thinking modes.
Summary of Evaluation Results for Finalized Qwen3 Models:
- Flagship SOTA:
Qwen3-235B-A22Bachieves state-of-the-art overall performance among open-source models in boththinkingandnon-thinkingmodes, surpassing strong baselines likeDeepSeek-R1andDeepSeek-V3. It also demonstrates strong competitiveness against closed-source leaders such asOpenAI-o1,Gemini2.5-Pro, andGPT-4o. - Flagship Dense Model (32B) Excellence:
Qwen3-32Boutperforms the previous strongest reasoning model,QwQ-32B, on most benchmarks, setting a new SOTA for its size. It competes comparably withOpenAI-o3-mini(closed-source) and excels innon-thinking mode, surpassingQwen2.5-72B-Instruct. - Lightweight Model Success: Lightweight models (
Qwen3-30B-A3B,Qwen3-14B, and smaller dense models) consistently show superior performance compared to open-source models with similar or larger parameter counts. This validates the effectiveness of theStrong-to-Weak Distillationapproach.
6.2.1. Qwen3-235B-A22B
The following are the results from Table 11 of the original paper:
| OpenAI-o1 | DeepSeek-R1 | Grok-3-Beta (Think) | Gemini2.5-Pro | Qwen3-235B-A22B | ||
| Architecture | MoE | MoE | ||||
| # Activated Params | 37B | - | 22B | |||
| # Total Params | - | 671B | - | 235B | ||
| MMLU-Redux | 92.8 | 92.9 | 93.7 | 92.7 | ||
| General Tasks | GPQA-Diamond | 78.0 | 71.5 | 80.2 | 84.0 | 71.1 |
| C-Eval | 85.5 | 91.8 | 82.9 | 89.6 | ||
| LiveBench 2024-11-25 | 75.7 | 71.6 | - | 82.4 | 77.1 | |
| IFEval strict prompt | 92.6 | 83.3 | - | 89.5 | 83.4 | |
| Alignment Tasks | Arena-Hard | 92.1 | 92.3 | 96.4 | 95.6 | |
| AlignBench v1.1 | 8.86 | 8.76 | 9.03 | 8.94 | ||
| Creative Writing v3 | 81.7 | 85.5 | 86.0 | 84.6 | ||
| WritingBench | 7.69 | 7.71 | 8.09 | 8.03 | ||
| MATH-500 | 96.4 | 97.3 | 98.8 | 98.0 | ||
| Math & Text Reasoning | AIME'24 | 74.3 | 79.8 | 83.9 | 92.0 | 85.7 |
| AIME'25 | 79.2 | 70.0 | 77.3 | 86.7 | 81.5 | |
| ZebraLogic | 81.0 | 78.7 | - | 87.4 | 80.3 | |
| AutoLogi | 79.8 | 86.1 | - | 85.4 | 89.0 | |
| BFCL v3 | 67.8 | 56.9 | - | 62.9 | 70.8 | |
| Agent & Coding | LiveCodeBench v5 | 63.9 | 64.3 | 70.6 | 70.4 | 70.7 |
| CodeForces (Rating / Percentile) | 1891 / 96.7% | 2029 / 98.1% | - | 2001 / 97.9% | 2056 / 98.2% | |
| Multi-IF | 48.8 | 67.7 | 77.8 | 71.9 | ||
| INCLUDE | 84.6 | 82.7 | 85.1 | 78.7 | ||
| Multilingual Tasks | MMMLU 14 languages | 88.4 | 86.4 | 86.9 | 84.3 | |
| MT-AIME2024 | 67.4 | 73.5 | 76.9 | 80.8 | ||
| PolyMath | 38.9 | 47.1 | 52.2 | 54.7 | ||
| MLogiQA | 75.5 | 73.8 | 75.6 | 77.1 | ||
Analysis of Qwen3-235B-A22B (Thinking Mode):
-
Open-Source Leader: Qwen3-235B-A22B (Thinking) outperforms
DeepSeek-R1on 17/23 benchmarks despite having fewer activated parameters (22B vs. 37B) and significantly fewer total parameters (235B vs. 671B). Its performance is particularly strong inreasoning-demanded taskslikemathematics,agent, andcoding. -
Competitiveness with Proprietary Models: It is highly competitive with closed-source models such as
OpenAI-o1,Grok-3-Beta (Think), andGemini2.5-Pro, substantially narrowing the performance gap in reasoning capabilities. For instance, it achieves the highestCodeForcesrating (2056 / 98.2%).The following are the results from Table 12 of the original paper:
GPT-40 -2024-11-20 DeepSeek-V3 Qwen2.5-72B -Instruct LLaMA-4 -Maverick Qwen3-235B-A22B Architecture MoE Dense MoE MoE # Activated Params 37B 72B 17B 22B # Total Params - 671B 72B 402B 235B General Tasks MMLU-Redux 87.0 89.1 86.8 91.8 89.2 GPQA-Diamond 46.0 59.1 49.0 69.8 62.9 C-Eval 75.5 86.5 84.7 83.5 86.1 LiveBench 2024-11-25 52.2 60.5 51.4 59.5 62.5 Alignment Tasks IFEval strict prompt 86.5 86.1 84.1 86.7 83.2 Arena-Hard 85.3 85.5 81.2 82.7 96.1 AlignBench v1.1 8.42 8.64 7.89 7.97 8.91 Creative Writing v3 81.1 74.0 61.8 61.3 80.4 WritingBench 7.11 6.49 7.06 5.46 7.70 Math & Text Reasoning MATH-500 77.2 90.2 83.6 90.6 91.2 AIME'24 11.1 39.2 18.9 38.5 40.1 AIME'25 7.6 28.8 15.0 15.9 24.7 ZebraLogic 27.4 42.1 26.6 40.0 37.7 AutoLogi 65.9 76.1 66.1 75.2 83.3 Agent & Coding BFCL v3 72.5 57.6 63.4 52.9 68.0 LiveCodeBench v5 32.7 33.1 30.7 37.2 35.3 CodeForces (Rating / Percentile) 864 / 35.4% 1134 / 54.1% 859 / 35.0% 712 / 24.3% 1387 / 75.7% Multilingual Tasks Multi-IF 65.6 55.6 65.3 75.5 70.2 INCLUDE 78.8 76.7 69.6 80.9 75.6 MMMLU 14 languages 80.3 81.1 76.9 82.5 79.8 MT-AIME2024 9.2 20.9 12.7 27.0 32.4 PolyMath 13.7 20.4 16.9 26.1 27.0 MLogiQA 57.4 58.9 59.3 59.9 67.6
Analysis of Qwen3-235B-A22B (Non-thinking Mode):
- Superiority over Open-Source: Qwen3-235B-A22B (Non-thinking) exceeds other leading open-source models like
DeepSeek-V3,LLaMA-4-Maverick, andQwen2.5-72B-Instruct. - Outperforms GPT-4o-2024-11-20: It surpasses the closed-source
GPT-4o-2024-11-20in 18/23 benchmarks, indicating strong inherent capabilities even without explicit reasoning steps. This highlights its robust general performance for rapid responses.
6.2.2. Qwen3-32B
The following are the results from Table 13 of the original paper:
| DeepSeek-R1 -Distili-Llama-70B | QwQ-32B | OpenAI-03-mini (medium) | Qwen3-32B | ||
| Architecture # Activated Params | Dense | Dense | - | Dense | |
| # Total Params | 70B 70B | 32B 32B | - | 32B 32B | |
| General Tasks | MMLU-Redux | 89.3 | 90.0 | 90.0 | 90.9 |
| GPQA-Diamond | 65.2 | 65.6 | 76.8 | 68.4 | |
| C-Eval LiveBench 2024-11-25 | 71.8 54.5 | 88.4 72.0 | 75.1 70.0 | 87.3 74.9 | |
| Alignment Tasks | IFEval strict prompt | 79.3 | 83.9 | 91.5 | 85.0 |
| Arena-Hard | 60.6 | 89.5 | 89.0 | 93.8 | |
| AlignBench v1.1 | 6.74 | 8.70 | 8.38 | 8.72 | |
| Creative Writing v3 WritingBench | 62.1 6.08 | 82.4 7.86 | 74.8 7.52 | 81.0 7.90 | |
| Math & Text Reasoning | MATH-500 | 94.5 | 98.0 | 98.0 | 97.2 |
| AIME'24 | 70.0 | 79.5 | 79.6 | 81.4 | |
| AIME'25 ZebraLogic | 56.3 71.3 | 69.5 76.8 | 74.8 88.9 | 72.9 88.8 | |
| AutoLogi | 83.5 | 88.1 | 86.3 | 87.3 | |
| Agent & Coding | BFCL v3 | 49.3 | 66.4 | 64.6 | 70.3 |
| LiveCodeBench v5 | 54.5 | 62.7 | 66.3 | 65.7 | |
| CodeForces (Rating / Percentile) | 1633 / 91.4% | 1982 / 97.7% | 2036 / 98.1% | 1977 / 97.7% | |
| Multilingual Tasks | Multi-IF | 57.6 | 68.3 | 48.4 | 73.0 |
| INCLUDE | 62.1 | 69.7 | 73.1 | 73.7 | |
| MMMLU 14 languages | 69.6 | 80.9 | 79.3 | 80.6 | |
| MT-AIME2024 | 29.3 | 68.0 | 73.9 | 75.0 | |
| PolyMath | 29.4 | 45.9 | 38.6 | 47.4 | |
| MLogiQA | 60.3 | 75.5 | 71.1 | 76.3 | |
Analysis of Qwen3-32B (Thinking Mode):
-
New SOTA at 32B: Qwen3-32B (Thinking) outperforms
QwQ-32Bon 17/23 benchmarks, establishing it as the new state-of-the-art reasoning model for its size. -
Competitiveness with Proprietary Models: It competes well with the closed-source
OpenAI-o3-mini (medium), particularly excelling inalignmentandmultilingual performance.The following are the results from Table 14 of the original paper:
GPT-4o-mini -2024-07-18 LLaMA-4 -Scout Qwen2.5-72B -Instruct Qwen3-32B Architecture MoE Dense Dense # Activated Params 17B 72B 32B # Total Params - 109B 72B 32B General Tasks MMLU-Redux 81.5 86.3 86.8 85.7 GPQA-Diamond 40.2 57.2 49.0 54.6 C-Eval 66.3 78.2 84.7 83.3 LiveBench 2024-11-25 41.3 47.6 51.4 59.8 Alignment Tasks IFEval strict prompt 80.4 84.7 84.1 83.2 Arena-Hard 74.9 70.5 81.2 92.8 AlignBench v1.1 7.81 7.49 7.89 8.58 Creative Writing v3 70.3 55.0 61.8 78.3 WritingBench 5.98 5.49 7.06 7.54 Math & Text Reasoning MATH-500 78.2 82.6 83.6 88.6 AIME'24 8.1 28.6 18.9 31.0 AIME'25 8.8 10.0 15.0 20.2 ZebraLogic 20.1 24.2 26.6 29.2 AutoLogi 52.6 56.8 66.1 78.5 Agent & Coding BFCL v3 LiveCodeBench v5 64.0 27.9 45.4 29.8 63.4 30.7 63.0 31.3 CodeForces (Rating / Percentile) 1113 / 52.6% 981 / 43.7% 859 / 35.0% 1353 / 71.0% Multilingual Tasks Multi-IF 62.4 64.2 65.3 70.7 INCLUDE 66.0 74.1 69.6 70.9 MMMLU 14 languages 72.1 77.5 76.9 76.5 MT-AIME2024 6.0 19.1 12.7 24.1 PolyMath 12.0 20.9 16.9 22.5 MLogiQA 42.6 53.9 59.3 62.9
Analysis of Qwen3-32B (Non-thinking Mode):
- Superiority: Qwen3-32B (Non-thinking) exhibits superior performance to almost all baselines.
- Vs. Larger Predecessor: It performs on par with
Qwen2.5-72B-Instructon general tasks, but with significant advantages inalignment,multilingual, andreasoning-related tasks, showcasing fundamental improvements over the Qwen2.5 series.
6.2.3. Qwen3-30B-A3B & Qwen3-14B
The following are the results from Table 15 of the original paper:
| DeepSeek-R1 -Distili-Qwen-32B | QwQ-32B | Qwen3-14B | Qwen3-30B-A3B | ||
| Architecture | Dense | Dense | Dense | MoE | |
| # Activated Params | 32B | 32B | 14B | 3B | |
| # Total Params | 32B | 32B | 14B | 30B | |
| General Tasks | MMLU-Redux | 88.2 | 90.0 | 88.6 | 89.5 |
| GPQA-Diamond | 62.1 | 65.6 | 64.0 | 65.8 | |
| C-Eval | 82.2 | 88.4 | 86.2 | 86.6 | |
| LiveBench 2024-11-25 | 45.6 | 72.0 | 71.3 | 74.3 | |
| Alignment Tasks | IFEval strict prompt Arena-Hard | 72.5 60.8 | 83.9 89.5 | 85.4 91.7 | 86.5 91.0 |
| AlignBench v1.1 | 7.25 | 8.70 | 8.56 | 8.70 | |
| Creative Writing v3 | 55.0 | 82.4 | 80.3 | 79.1 | |
| WritingBench | 6.13 | 7.86 | 7.80 | 7.70 | |
| Math & Text Reasoning | MATH-500 | 94.3 | 98.0 | 96.8 | 98.0 |
| AIME'24 | 72.6 | 79.5 | 79.3 | 80.4 | |
| AIME'25 | 49.6 | 69.5 | 70.4 | 70.9 | |
| ZebraLogic | 69.6 | 76.8 | 88.5 | 89.5 | |
| AutoLogi | 74.6 | 88.1 | 89.2 | 88.7 | |
| Agent & Coding | BFCL v3 | 53.5 | 66.4 | 70.4 | 69.1 |
| LiveCodeBench v5 | 54.5 | 62.7 | 63.5 | 62.6 | |
| CodeForces (Rating / Percentile) | 1691 / 93.4% | 1982 / 97.7% | 1766 / 95.3% | 1974 / 97.7% | |
| Multilingual Tasks | Multi-IF | 31.3 | 68.3 | 74.8 | 72.2 |
| INCLUDE | 68.0 | 69.7 | 71.7 | 71.9 | |
| MMMLU 14 languages | 78.6 | 80.9 | 77.9 | 78.4 | |
| MT-AIME2024 | 44.6 | 68.0 | 73.3 | 73.9 | |
| PolyMath | 35.1 | 45.9 | 45.8 | 46.1 | |
| MLogiQA | 63.3 | 75.5 | 71.1 | 70.1 | |
Analysis of Qwen3-30B-A3B / Qwen3-14B (Thinking Mode):
-
Strong-to-Weak Distillation Success: Both
Qwen3-30B-A3BandQwen3-14B(Thinking) are highly competitive withQwQ-32B, especially in reasoning benchmarks. -
Efficiency of MoE Distillation:
Qwen3-30B-A3Bachieves comparable performance toQwQ-32Bdespite having a smaller total size (30B vs. 32B) and significantly fewer activated parameters (3B vs. 32B), demonstrating the effectiveness of theStrong-to-Weak Distillationin endowing lightweight MoE models with profound reasoning.The following are the results from Table 16 of the original paper:
Phi-4 Gemma-3 -27B-IT Qwen2.5-32B -Instruct Qwen3-14B Qwen3-30B-A3B Architecture # Activated Params Dense 14B Dense 27B Dense 32B Dense 14B MoE 3B # Total Params 14B 27B 32B 14B 30B MMLU-Redux 85.3 82.6 83.9 82.0 84.1 General Tasks GPQA-Diamond C-Eval 56.1 66.9 42.4 66.6 49.5 80.6 54.8 81.0 54.8 82.9 LiveBench 2024-11-25 41.6 49.2 50.0 59.6 59.4 IFEval strict prompt 62.1 80.6 79.5 84.8 83.7 Alignment Tasks Arena-Hard 75.4 86.8 74.5 86.3 88.0 AlignBench v1.1 7.61 7.80 7.71 8.52 8.55 Creative Writing v3 51.2 82.0 54.6 73.1 68.1 WritingBench 5.73 7.22 5.90 7.24 7.22 Math & Text Reasoning MATH-500 80.8 90.0 84.6 90.0 89.8 AIME'24 22.9 32.6 18.8 31.7 32.8 AIME'25 17.3 24.0 12.8 23.3 21.6 ZebraLogic AutoLogi 32.3 66.2 24.6 64.2 26.1 65.5 33.0 82.0 33.2 81.5 Agent & Coding BFCL v3 47.0 59.1 62.8 61.5 58.6 LiveCodeBench v5 CodeForces (Rating / Percentile) 25.2 1280 / 65.3% 26.9 1063 / 49.3% 26.4 903 / 38.2% 29.0 1200 / 58.6% 29.8 1267 / 64.1% Multilingual Tasks Multi-IF 49.5 69.8 63.2 72.9 70.8 INCLUDE 65.3 71.4 67.5 67.8 67.8 MMMLU 14 languages 74.7 76.1 74.2 72.6 73.8 MT-AIME2024 13.1 23.0 15.3 23.2 24.6 PolyMath 17.4 20.3 18.3 22.0 23.3 MLogiQA 53.1 58.5 58.0 58.9 53.3
Analysis of Qwen3-30B-A3B / Qwen3-14B (Non-thinking Mode):
- Outperforming Baselines: Both models surpass non-reasoning baselines in most benchmarks.
- Efficiency: They exceed
Qwen2.5-32B-Instructwith significantly fewer activated and total parameters, enabling more efficient and cost-effective performance.
6.2.4. Qwen3-8B / 4B / 1.7B / 0.6B
The following are the results from Table 17 of the original paper:
| DeepSeek-R1 -Distill-Qwen-14B | DeepSeek-R1 -Distili-Qwen-32B | Qwen3-4B | Qwen3-8B | ||
| Architecture | Dense | Dense | Dense | Dense | |
| # Activated Params | 14B | 32B | 4B | 8B | |
| # Total Params | 14B | 32B | 4B | 8B | |
| General Tasks | MMLU-Redux | 84.1 | 88.2 | 83.7 | 87.5 |
| GPQA-Diamond C-Eval | 59.1 | 62.1 | 55.9 | 62.0 | |
| LiveBench 2024-11-25 | 78.1 52.3 | 82.2 45.6 | 77.5 63.6 | 83.4 67.1 | |
| Alignment Tasks | IFEval strict prompt | 72.6 | 72.5 | 81.9 | 85.0 |
| Arena-Hard | 48.0 | 60.8 | 76.6 | 85.8 | |
| AlignBench v1.1 | 7.43 | 7.25 | 8.30 | 8.46 | |
| Creative Writing v3 WritingBench | 54.2 6.03 | 55.0 6.13 | 61.1 7.35 | 75.0 7.59 | |
| Math & Text Reasoning | MATH-500 | 93.9 | 94.3 | 97.0 | 97.4 |
| AIME'24 | 69.7 | 72.6 | 73.8 | 76.0 | |
| AIME'25 | 44.5 | 49.6 | 65.6 | 67.3 | |
| ZebraLogic | 59.1 | 69.6 | 81.0 | 84.8 | |
| Agent & Coding | AutoLogi | 78.6 | 74.6 | 87.9 | 89.1 |
| BFCL v3 | 49.5 | 53.5 | 65.9 | 68.1 | |
| LiveCodeBench v5 CodeForces (Rating / Percentile) | 45.5 1574 / 89.1% | 54.5 1691 / 93.4% | 54.2 1671 / 92.8% | 57.5 1785 / 95.6% | |
| Multilingual Tasks | Multi-IF | 29.8 | 31.3 | 66.3 | 71.2 |
| INCLUDE | 59.7 | 68.0 | 61.8 | 67.8 | |
| MMMLU 14 languages | 73.8 | 78.6 | 69.8 | 74.4 | |
| MT-AIME2024 | 33.7 | 44.6 | 60.7 | 65.4 | |
| PolyMath | 28.6 | 35.1 | 40.0 | 42.7 | |
| MLogiQA | 53.6 | 63.3 | 65.9 | 69.0 | |
The following are the results from Table 18 of the original paper:
| LLaMA-3.1-8B -Instruct | Gemma-3 -12B-IT | Qwen2.5-7B -Instruct | Qwen2.5-14B -Instruct | Qwen3-4B | Qwen3-8B | ||
| Architecture | Dense | Dense | Dense | Dense | Dense | Dense | |
| # Activated Params | 8B | 12B | 7B | 14B | 4B | 8B | |
| # Total Params | 8B | 12B | 7B | 14B | 4B | 8B | |
| General Tasks | MMLU-Redux | 61.7 | 77.8 | 75.4 | 80.0 | 77.3 | 79.5 |
| GPQA-Diamond C-Eval | 32.8 52.0 | 40.9 61.1 | 36.4 76.2 | 45.5 78.0 | 41.7 72.2 | 39.3 77.9 | |
| LiveBench 2024-11-25 | 26.0 | 43.7 | 34.9 | 42.2 | 48.4 | 53.5 | |
| Alignment Tasks | IFEval strict prompt | 75.0 | 80.2 | 71.2 | 81.0 | 81.2 | 83.0 |
| Arena-Hard | 30.1 | 82.6 | 52.0 | 68.3 | 66.2 | 79.6 | |
| AlignBench v1.1 | 6.01 | 7.77 | 7.27 | 7.67 | 8.10 | 8.38 | |
| Creative Writing v3 WritingBench | 52.8 4.57 | 79.9 7.05 | 49.8 5.82 | 55.8 5.93 | 53.6 6.85 | 64.5 7.15 | |
| Math & Text Reasoning | MATH-500 | 54.8 | 85.6 | 77.6 | 83.4 | 84.8 | 87.4 |
| AIME'24 | 6.3 | 22.4 | 9.1 | 15.2 | 25.0 | 29.1 | |
| AIME'25 | 2.7 | 18.8 | 12.1 | 13.6 | 19.1 | 20.9 | |
| ZebraLogic | 12.8 | 58.9 | 12.0 | 19.7 | 35.2 | 26.7 | |
| AutoLogi | 30.9 | 76.3 | 42.9 | 57.4 | 76.3 | 76.5 | |
| Agent & Coding | BFCL v3 | 49.6 | 50.6 | 55.8 | 58.7 | 57.6 | 60.2 |
| LiveCodeBench v5 CodeForces (Rating / Percentile) | 10.8 473 / 14.9% | 25.7 462 / 14.7% | 14.4 191 / 0.0% | 21.9 904 / 38.3% | 21.3 842 / 33.7% | 22.8 1110 / 52.4% | |
| Multilingual Tasks | Multi-IF | 52.1 | 65.6 | 47.7 | 55.5 | 61.3 | 69.2 |
| INCLUDE | 34.0 | 65.3 | 53.6 | 63.5 | 53.8 | 62.5 | |
| MMMLU 14 languages | 44.4 | 70.0 | 61.4 | 70.3 | 61.7 | 66.9 | |
| MT-AIME2024 | 0.4 | 16.7 | 5.5 | 8.5 | 13.9 | 16.6 | |
| PolyMath | 5.8 | 17.6 | 11.9 | 15.0 | 16.6 | 18.8 | |
| MLogiQA | 41.9 | 54.5 | 49.5 | 51.3 | 49.9 | 51.4 | |
The following are the results from Table 19 of the original paper:
| DeepSeek-R1 -Distili-Qwen-1.5B | DeepSeek-R1 -Distill-Llama-8B | Qwen3-0.6B | Qwen3-1.7B | ||
| Architecture # Activated Params | Dense | Dense | Dense | Dense | |
| # Total Params | 1.5B 1.5B | 8B 8B | 0.6B 0.6B | 1.7B 1.7B | |
| General Tasks | MMLU-Redux | 45.4 | 66.4 | 55.6 | 73.9 |
| GPQA-Diamond | 33.8 | 49.0 | 27.9 | 40.1 | |
| C-Eval | 27.1 | 50.4 | 50.4 | 68.1 | |
| LiveBench 2024-11-25 | 24.9 | 40.6 | 30.3 | 51.1 | |
| Alignment Tasks | IFEval strict prompt | 39.9 | 59.0 | 59.2 | 72.5 |
| Arena-Hard | 4.5 | 17.6 | 8.5 | 43.1 | |
| AlignBench v1.1 | 5.00 | 6.24 | 6.10 | 7.60 | |
| Creative Writing v3 WritingBench | 16.4 4.03 | 51.1 5.42 | 30.6 5.61 | 48.0 7.02 | |
| Math & Text Reasoning | MATH-500 | 83.9 | 89.1 | 77.6 | 93.4 |
| AIME'24 | 28.9 | 50.4 | 10.7 | 48.3 | |
| AIME'25 | 22.8 | 27.8 | 15.1 | 36.8 | |
| ZebraLogic | 4.9 | 37.1 | 30.3 | 63.2 | |
| AutoLogi | 19.1 | 63.4 | 61.6 | 83.2 | |
| Agent & Coding | BFCL v3 LiveCodeBench v5 | 14.0 13.2 | 21.5 42.5 | 46.4 12.3 | 56.6 33.2 |
| CodeForces (Rating / Percentile) | 36.1 | 51.2 | |||
| Multilingual Tasks | Multi-IF | 13.3 | 27.0 | 35.9 | 51.8 |
| INCLUDE | 21.9 | 34.5 | 43.1 | 59.1 | |
| MMMLU 14 languages | 27.3 | 40.1 | 7.8 | 36.1 | |
| MT-AIME2024 | 12.4 | 13.2 | 11.4 | 25.2 | |
| PolyMath | 14.5 | 10.8 | 40.9 | 56.0 | |
| MLogiQA | 29.0 | 32.8 | |||
The following are the results from Table 20 of the original paper:
| Gemma-3 -1B-IT | Phi-4-mini | Qwen2.5-1.5B -Instruct | Qwen2.5-3B -Instruct | Qwen3-0.6B | Qwen3-1.7B | ||
| Architecture # Activated Params | Dense | Dense | Dense | Dense | Dense | Dense | |
| 1.0B | 3.8B | 1.5B | 3.1B | 0.6B | 1.7B | ||
| # Total Params | 1.0B | 3.8B | 1.5B | 3.1B | 0.6B | 1.7B | |
| MMLU-Redux | 33.3 | 67.9 | 50.7 | 64.4 | 44.6 | 64.4 | |
| GPQA-Diamond | 19.2 | 25.2 | 29.8 | 30.3 | 22.9 | 28.6 | |
| Tasks | C-Eval | 28.5 | 40.0 | 53.3 | 68.2 | 42.6 | 61.0 |
| LiveBench 2024-11-25 | 14.4 | 25.3 | 18.0 | 23.8 | 21.8 | 35.6 | |
| IFEval strict prompt | 54.5 | 68.6 | 42.5 | 58.2 | 54.5 | 68.2 | |
| Arena-Hard | 17.8 | 32.8 | 9.0 | 23.7 | 6.5 | 36.9 | |
| Alignment Tasks | AlignBench v1.1 | 5.3 | 6.00 | 5.60 | 6.49 | 5.60 | 7.20 |
| Creative Writing v3 | 52.8 | 10.3 | 31.5 | 42.8 | 28.4 | 43.6 | |
| WritingBench | 5.18 | 4.05 | 4.67 | 5.55 | 5.13 | 6.54 | |
| MATH-500 | 46.4 | 67.6 | 55.0 | 67.2 | 55.2 | 73.0 | |
| AIME'24 | 0.9 | 8.1 | 0.9 | 6.7 | 3.4 | 13.4 | |
| Math & Text Reasoning | AIME'25 | 0.8 | 5.3 | 0.4 | 4.2 | 2.6 | 9.8 |
| ZebraLogic | 1.9 | 2.7 | 3.4 | 4.8 | 4.2 | 12.8 | |
| AutoLogi | 16.4 | 28.8 | 22.5 | 29.9 | 37.4 | 59.8 | |
| BFCL v3 | 16.3 | 31.3 | 47.8 | 50.4 | 44.1 | 52.2 | |
| Coding | LiveCodeBench v5 | 1.8 | 10.4 | 5.3 | 9.2 | 3.6 | 11.6 |
| Multi-IF | |||||||
| INCLUDE | 32.8 | 40.5 | 20.2 | 32.3 | 33.3 | 44.7 | |
| 32.7 | 43.8 | 33.1 | 43.8 | 34.4 | 42.6 | ||
| MMMLU 14 languages | 32.5 | 51.4 | 40.4 | 51.8 | 37.1 | 48.3 | |
| MT-AIME2024 | 0.2 | 0.9 | 0.7 | 1.6 | 1.5 | 4.9 | |
| PolyMath | 3.5 | 6.7 | 5.0 | 7.3 | 4.6 | 10.3 | |
| MLogiQA | 31.8 | 39.5 | 40.9 | 39.5 | 37.3 | 41.1 |
Analysis of Smaller Qwen3 Models (Thinking and Non-thinking Modes):
- Edge-side Performance: The smaller Qwen3 models (
Qwen3-8B,4B,1.7B,0.6B) exhibit impressive performance, often outperforming baselines with more parameters, including previousQwen2.5models, in boththinkingandnon-thinkingmodes. - Distillation Efficacy: These results further reinforce the efficacy of the
Strong-to-Weak Distillationapproach, enabling the creation of lightweight Qwen3 models with remarkably reduced costs and efforts while maintaining high capabilities.
6.3. Discussion
6.3.1. The Effectiveness of Thinking Budget
The ability of Qwen3 to enhance its intelligence by leveraging an increased thinking budget is a key innovation.
The following figure (Figure 2 from the original paper) shows the performance of Qwen3-235B-A22B with respect to the thinking budget:
该图像是一个图表,显示了 Qwen3-235B-A22B 在不同思维预算下的性能表现,包括 AIME'24、AIME'25、LiveCodeBench (v5) 和 GPQA Diamond 四个任务。图中分别展示了思维模式和非思维模式的效果,随着思维预算的增加,性能显著提升。
Figure 2: Performance of Qwen3-235B-A22B with respect to the thinking budget.
Analysis:
- Scalable Performance Improvement: As observed in Figure 2, Qwen3-235B-A22B demonstrates a clear and consistent improvement in performance across various benchmarks (
AIME'24,AIME'25,LiveCodeBench v5,GPQA Diamond) as the allocatedthinking budget(measured in tokens) increases. This validates the design principle that more computational resources dedicated tothinkingdirectly translate to better reasoning outcomes. - Smooth Scaling: The scaling curves are smooth, suggesting that the
thinking budgetmechanism provides a continuous knob for users to trade off latency (due to more thinking tokens) and performance based on task complexity. - Future Potential: The authors hypothesize that extending the output length beyond 32K tokens for thinking could yield further performance improvements, suggesting potential for future work in pushing the limits of the
thinking budget.
6.3.2. The Effectiveness and Efficiency of On-Policy Distillation
The on-policy distillation approach, part of the strong-to-weak distillation pipeline, proves to be both effective and highly efficient.
The following are the results from Table 21 of the original paper:
| Method | AIME'24 | AIME'25 | MATH500 | LiveCodeBench v5 | MMLU -Redux | GPQA -Diamond | GPU Hours |
| Off-policy Distillation | 55.0 (90.0) | 42.8 (83.3) | 92.4 | 42.0 | 86.4 | 55.6 | - |
| + Reinforcement Learning | 67.6 (90.0) | 55.5 (83.3) | 94.8 | 52.9 | 86.9 | 61.3 | 17,920 |
| + On-policy Distillation | 74.4 (93.3) | 65.5 (86.7) | 97.0 | 60.3 | 88.3 | 63.3 | 1,800 |
Analysis:
- Superior Performance over RL:
On-policy Distillationachieves significantly better performance across all listed benchmarks (AIME'24,AIME'25,MATH500,LiveCodeBench v5,MMLU-Redux,GPQA-Diamond) compared to directReinforcement Learningwhen starting from the sameoff-policy distilled 8B checkpoint. For example,AIME'24improves from 67.6 (RL) to 74.4 (Distillation). - Dramatic Efficiency Gains: This performance gain comes with a remarkable reduction in computational cost.
On-policy Distillationrequires only 1,800GPU hoursfor the Qwen3-8B model, approximately 1/10th of the 17,920GPU hoursneeded forReinforcement Learning. This demonstrates a massive efficiency advantage for training smaller models. - Enhanced Exploration: Distillation from
teacher logitsnot only improves direct performance (Pass@1scores) but also expands the student model'sexploration spaceandreasoning potential, as evidenced by improvedPass@64scores onAIME'24(93.3 vs. 90.0) andAIME'25(86.7 vs. 83.3). In contrast,Reinforcement Learningalone did not lead to any improvement inPass@64scores from the initialoff-policycheckpoint. This suggests that mimicking a strong teacher's thought process, including its uncertainties and alternative paths (captured bysoft probabilities), is more beneficial for learning robustness and exploration than direct reward optimization for exploration.
6.3.3. The Effects of Thinking Mode Fusion and General RL
The stages of Thinking Mode Fusion and General RL are crucial for integrating non-thinking capabilities, refining instruction following, and enhancing overall model robustness.
The following are the results from Table 22 of the original paper:
| Stage 2 Reasoning RL | Stage 3 Thinking Mode Fusion | Stage 4 General RL | ||||
| Benchmark | Thinking | Thinking | Non-Thinking | Thinking | Non-Thinking | |
| General Tasks | LiveBench 2024-11-25 | 68.6 | 70.9 (+2.3) | 57.1 | 74.9 (+4.0) | 59.8 (+2.8) |
| Arena-Hard | 86.8 | 89.4 (+2.6) | 88.5 | 93.8 (+4.4) | 92.8 (+4.3) | |
| CounterFactQA* | 50.4 | 61.3 (+10.9) | 64.3 | 68.1 (+6.8) | 66.4 (+2.1) | |
| Instruction & Format Following | IFEval strict prompt | 73.0 | 78.4 (+5.4) | 78.4 | 85.0 (+6.6) | 83.2 (+4.8) |
| Multi-IF | 61.4 | 64.6 (+3.2) | 65.2 | 73.0 (+8.4) | 70.7 (+5.5) | |
| LengthCtrl* | 62.6 | 70.6 (+8.0) | 84.9 | 73.5 (+2.9) | 87.3 (+2.4) | |
| ThinkFollow* | - | 88.7 | 98.9 (+10.2) | |||
| Agent | BFCL v3 | 69.0 | 68.4 (-0.6) | 61.5 | 70.3 (+1.9) | 63.0 (+1.5) |
| ToolUse* | 63.3 | 70.4 (+7.1) | 73.2 | 85.5 (+15.1) | 86.5 (+13.3) | |
| Knowledge & STEM | MMLU-Redux | 91.4 | 91.0 (-0.4) | 86.7 | 90.9 (-0.1) | 85.7 (-1.0) |
| GPQA-Diamond | 68.8 | 69.0 (+0.2) | 50.4 | 68.4 (-0.6) | 54.6 (+4.3) | |
| Math & | AIME'24 | 83.8 | 81.9 (-1.9) | 28.5 | 81.4 (-0.5) | 31.0 (+2.5) |
| TCCoding | LiveCodeBench v5 | 68.4 | 67.2 (-1.2) | 31.1 | 65.7 (-1.5) | 31.3 (+0.2) |
Analysis of Qwen3-32B at Different Stages:
- Stage 3 (Thinking Mode Fusion):
- Initial Mode Switching: The
ThinkFollowbenchmark score of 88.7 indicates that the model gains an initial, though imperfect, ability to switch between thinking modes. - General and Instruction Following Improvements (Thinking Mode): The model shows significant gains in
CounterFactQA(+10.9 points) andLengthCtrl(+8.0 points) inthinking mode, demonstrating improved general and instruction-following capabilities.
- Initial Mode Switching: The
- Stage 4 (General RL):
- Robust Mode Switching: The
ThinkFollowscore dramatically improves to 98.9 (+10.2 points), confirming thatGeneral RLensures highly accurate mode switching. - Broad Capability Enhancement:
General RLfurther strengthens general (LiveBench,Arena-Hard,CounterFactQA),instruction-following(IFEval,Multi-IF,LengthCtrl), andagent capabilities(BFCL v3, ) in boththinkingandnon-thinkingmodes. sees a substantial boost (+15.1 in thinking, +13.3 in non-thinking).
- Robust Mode Switching: The
- Performance Trade-offs for Specialized Tasks:
- For
Knowledge,STEM,Math(AIME'24), andCoding(LiveCodeBench v5) tasks,Thinking Mode FusionandGeneral RLdo not yield significant improvements inthinking mode. In fact, some challenging tasks likeAIME'24andLiveCodeBenchshow a slight decrease inthinking modeperformance after these stages. - Conjecture: The authors hypothesize this degradation is due to training on a broader range of general tasks, which might compromise specialized capabilities. This represents a conscious trade-off to enhance the model's overall versatility, acknowledging that improving general robustness might dilute peak performance in highly specialized, complex reasoning tasks.
- For
6.3.4. Long-Context Ability
The long-context processing capabilities are evaluated using the RULER benchmark.
The following are the results from Table 23 of the original paper:
| Model | RULER | |||||||
| Avg. | 4K | 8K | 16K | 32K | 64K | 128K | ||
| Qwen2.5-7B-Instruct | 85.4 | 96.7 | 95.1 | 93.7 | 89.4 | 82.3 | 55.1 | |
| Qwen2.5-14B-Instruct | 91.4 | 97.7 | 96.8 | 95.9 | 93.4 | 86.7 | 78.1 | |
| Qwen2.5-32B-Instruct | 92.9 | 96.9 | 97.1 | 95.5 | 95.5 | 90.3 | 82.0 | |
| Qwen2.5-72B-Instruct | 95.1 | 97.7 | 97.2 | 97.7 | 96.5 | 93.0 | 88.4 | |
| Qwen3-4B | 85.2 | 95.1 | 93.6 | 91.0 | 87.8 | 77.8 | 66.0 | |
| Non-thinking Mode | Qwen3-8B | 89.1 | 96.3 | 96.0 | 91.8 | 91.2 | 82.1 | 77.4 |
| Qwen3-14B | 94.6 | 98.0 | 97.8 | 96.4 | 96.1 | 94.0 | 85.1 | |
| Qwen3-32B | 93.7 | 98.4 | 96.0 | 96.2 | 94.4 | 91.8 | 85.6 | |
| Qwen3-30B-A3B | 91.6 | 96.5 | 97.0 | 95.3 | 92.4 | 89.1 | 79.2 | |
| Qwen3-235B-A22B | 95.0 | 97.7 | 97.2 | 96.4 | 95.1 | 93.3 | 90.6 | |
| Thinking Mode | Qwen3-4B | 83.5 | 92.7 | 88.7 | 86.5 | 83.2 | 83.0 | 67.2 |
| Qwen3-8B | 84.4 | 94.7 | 94.4 | 86.1 | 80.8 | 78.3 | 72.0 | |
| Qwen3-14B | 90.1 | 95.4 | 93.6 | 89.8 | 91.9 | 90.6 | 79.0 | |
| Qwen3-32B | 91.0 | 94.7 | 93.7 | 91.6 | 92.5 | 90.0 | 83.5 | |
| Qwen3-30B-A3B | 86.6 | 94.1 | 92.7 | 89.0 | 86.6 | 82.1 | 75.0 | |
| Qwen3-235B-A22B | 92.2 | 95.1 | 94.8 | 93.0 | 92.3 | 92.0 | 86.0 | |
Analysis:
- Non-thinking Mode Improvement: In
non-thinking mode, Qwen3 models generally outperformQwen2.5 modelsof similar size inlong-context processing tasks, especially at longer context lengths (e.g., Qwen3-14B vs Qwen2.5-14B, Qwen3-32B vs Qwen2.5-32B show clear improvements in average and 128K scores). The flagshipQwen3-235B-A22Bachieves a strong 90.6 at 128K. - Thinking Mode Degradation: In
thinking mode, the model's performance onRULERslightly degrades compared tonon-thinking mode. The authors hypothesize that for pure retrieval tasks (likeRULER) which do not rely on complex reasoning, thethinking contentgenerated by the model might not offer significant benefits and could even interfere with the retrieval process. This indicates an area for future improvement to ensurethinking modeprovides benefits across all task types.
6.3.5. Multilingual Ability
The multilingual capabilities are showcased through various benchmarks and detailed language-specific tables (Tables 24-35).
Summary: Tables 24-35 present detailed benchmark scores across various languages, including Spanish, French, Portuguese, Italian, Arabic, Japanese, Korean, Indonesian, Russian, Vietnamese, German, and Thai. The results demonstrate that the Qwen3 series models achieve competitive performance across all evaluated benchmarks for these specific languages, showcasing their strong multilingual capabilities.
For a broader assessment, the Belebele benchmark (Bandarkar et al., 2023) is used, covering 80 supported languages (excluding 42 unoptimized ones).
The following are the results from Table 37 of the original paper:
| Model | Indo-European | Sino-Tibetan | Afro-Asiatic | Austronesian | Dravidian | Turkic | Tai-Kadai | Uralic | Austroasiatic | Other |
| Gemma-3-27B-IT | 89.2 | 86.3 | 85.9 | 84.1 | 83.5 | 86.8 | 81.0 | 91.0 | 86.5 | 87.0 |
| Qwen2.5-32B-Instruct | 85.5 | 82.3 | 80.4 | 70.6 | 67.8 | 80.8 | 74.5 | 87.0 | 79.0 | 72.6 |
| QwQ-32B | 86.1 | 83.7 | 81.9 | 71.3 | 69.3 | 80.3 | 77.0 | 88.0 | 83.0 | 74.0 |
| Qwen3-32B (Thinking) | 90.7 | 89.7 | 84.8 | 86.7 | 84.5 | 89.3 | 83.5 | 91.3 | 88.0 | 83.1 |
| Qwen3-32B (Non-thinking) | 89.1 | 88.0 | 82.3 | 83.7 | 84.0 | 85.0 | 85.0 | 88.7 | 88.0 | 81.3 |
| Gemma-3-12B-IT | 85.8 | 83.3 | 83.4 | 79.3 | 79.0 | 82.8 | 77.5 | 89.0 | 83.0 | 81.6 |
| Qwen2.5-14B-Instruct | 82.7 | 78.9 | 80.4 | 69.1 | 66.2 | 74.2 | 72.2 | 883.9 | 77.9 | 70.4 |
| Qwen3-14B (Thinking) | 88.6 | 87.3 | 82.4 | 82.4 | 81.0 | 83.8 | 83.5 | 91.0 | 82.5 | 81.7 |
| Qwen3-14B (Non-thinking) | 87.4 | 82.7 | 80.1 | 80.7 | 78.0 | 81.8 | 80.5 | 87.7 | 81.5 | 77.0 |
| Gemma-3-4B-IT | 71.8 | 72.0 | 63.5 | 61.7 | 64.8 | 64.0 | 61.5 | 70.7 | 71.0 | 62.6 |
| Qwen2.5-3B-Instruct | 58.0 | 62.3 | 57.2 | 47.9 | 36.9 | 45.1 | 49.8 | 50.6 | 56.8 | 48.4 |
| Qwen3-4B (Thinking) | 82.2 | 77.7 | 74.1 | 73.0 | 74.3 | 76.3 | 68.5 | 83.0 | 74.5 | 67.9 |
| Qwen3-4B (Non-thinking) | 76.0 | 77.0 | 65.6 | 65.6 | 65.5 | 64.0 | 60.5 | 74.0 | 74.0 | 61.0 |
| Gemma-3-1B-IT | 36.5 | 36.0 | 30.0 | 29.1 | 28.8 | 27.3 | 28.0 | 32.7 | 33.0 | 30.9 |
| Qwen2.5-1.5B-Instruct | 41.5 | 43.0 | 39.6 | 34.8 | 28.6 | 29.7 | 39.4 | 33.8 | 42.0 | 36.0 |
| Qwen3-1.7B (Thinking) | 69.7 | 66.0 | 59.4 | 58.6 | 52.8 | 57.8 | 53.5 | 70.3 | 63.5 | 53.4 |
| Qwen3-1.7B (Non-thinking) | 58.8 | 62.7 | 50.8 | 53.0 | 43.3 | 48.0 | 46.0 | 54.3 | 54.0 | 43.9 |
Analysis of Belebele Benchmark:
- Superiority over Qwen2.5: Qwen3 models significantly outperform their
Qwen2.5counterparts across all language families, highlighting the impact of the expanded multilingual pre-training data (119 languages) and improved training strategies. - Competitiveness with Gemma-3: Qwen3 achieves comparable or superior performance to similarly-sized
Gemmamodels (e.g.,Qwen3-32Bvs.Gemma-3-27B,Qwen3-14Bvs.Gemma-3-12B). - Thinking Mode Advantage: For
Qwen3models, thethinking modeconsistently yields higher scores across almost all language families compared to theirnon-thinkingcounterparts, demonstrating the benefit of reasoning in cross-lingual understanding tasks. This is in contrast to theRULERbenchmark, where thinking mode was less beneficial, indicating task-specific utility of the thinking mechanism.
7. Conclusion & Reflections
7.1. Conclusion Summary
This technical report introduces Qwen3, a significant advancement in the Qwen family of large language models. The key contributions include the novel integration of thinking and non-thinking modes into a unified framework, complemented by a dynamic thinking budget mechanism. This design empowers users with flexible control over computational resources and reasoning depth, eliminating the need to switch between specialized models. Qwen3 comprises a diverse series of dense and Mixture-of-Expert (MoE) models, ranging from 0.6B to 235B parameters, featuring architectural refinements like QK-Norm and global-batch load balancing loss. The models were pre-trained on an unprecedented 36 trillion tokens across 119 languages and dialects, vastly expanding multilingual capabilities. A sophisticated multi-stage post-training pipeline, including Long-CoT Cold Start, Reasoning RL, Thinking Mode Fusion, and General RL, coupled with an efficient Strong-to-Weak Distillation approach for smaller models, ensures state-of-the-art performance. Empirical evaluations consistently demonstrate Qwen3's competitive results against proprietary and leading open-source models across diverse benchmarks, particularly excelling in code generation, mathematical reasoning, and agent tasks. The open-source release under Apache 2.0 further promotes community engagement and research.
7.2. Limitations & Future Work
The authors acknowledge certain limitations and outline future research directions:
-
Thinking Mode for Retrieval Tasks: While
thinking modegenerally enhances reasoning, performance on certain retrieval-basedlong-context tasks(likeRULER) slightly degrades. The authors hypothesize that the generatedthinking contentmight interfere with retrieval in these specific scenarios, suggesting a need to refinethinking modefor such tasks in future versions. -
Specialized vs. General Performance Trade-off: The
Thinking Mode FusionandGeneral RLstages, while enhancing overall versatility and instruction following, sometimes lead to a slight decrease inthinking modeperformance on highly challenging, specialized tasks likeAIME'24andLiveCodeBench. This indicates a trade-off between broad generalization and peak performance in niche, complex problem-solving.Future research will focus on:
-
Scaling Pre-training: Continuing to scale up pre-training with even higher quality and more diverse data.
-
Architectural and Training Method Improvements: Enhancing model architecture and training methods for
effective compressionandscaling to extremely long contexts. This includes addressing the observed performance dip ofthinking modein retrieval tasks. -
Increased RL Resources and Agent-based RL Systems: Allocating more computational resources for
Reinforcement Learning, with a particular emphasis onagent-based RL systemsthat learn from environmental feedback. The goal is to build agents capable of tackling complex tasks requiringinference time scaling, indicating a move towards more autonomous and interactive LLM agents.
7.3. Personal Insights & Critique
Qwen3 represents a compelling stride towards more adaptable and efficient LLMs. The unified thinking and non-thinking modes, coupled with the thinking budget, are genuinely innovative concepts that address a critical practical challenge in LLM deployment: how to balance responsiveness and deep reasoning without maintaining separate models. This dynamic control over inference-time computation is a sophisticated solution for optimizing user experience and resource utilization.
The Strong-to-Weak Distillation approach for smaller models is particularly insightful. Demonstrating that distillation can be significantly more effective and efficient than direct RL, especially for exploration ability (Pass@64), offers a powerful blueprint for developing competitive lightweight models. This has direct implications for edge deployment and broader accessibility.
However, the acknowledged trade-off between generalized capabilities (achieved through General RL) and peak performance in specialized thinking mode tasks is an important point of critique. While understandable for overall versatility, it highlights a fundamental tension in LLM training: optimizing for average performance across a vast array of tasks might dilute the model's ability to achieve extreme performance in highly focused, difficult domains. Future work in multi-objective optimization or more sophisticated curriculum learning could potentially mitigate this.
The massive expansion to 119 languages is commendable, pushing the boundaries of global accessibility. However, the report doesn't extensively detail the performance across all these low-resource languages or the specific challenges encountered during their integration. While Belebele provides a broad overview, deeper dives into specific language families or challenges unique to low-resource settings would further enhance understanding.
Overall, Qwen3's contributions, particularly in its novel mode-switching and resource allocation mechanisms, alongside its commitment to open-source, position it as a significant player in the evolving LLM landscape. Its methods could be transferable to other AI domains requiring dynamic computational allocation based on task complexity, beyond just language generation.
Similar papers
Recommended via semantic vector search.