Paper status: completed

Qwen3 Technical Report

Published:05/14/2025

Large Language Model Series (1)Mixture-of-Expert Architecture (1)Dynamic Model Switching (1)Thinking Budget Mechanism (1)Expanded Multilingual Support (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Qwen3 introduces a unified framework integrating thinking and non-thinking modes for dynamic switching, enhancing performance and multilingual support. It also features a thinking budget mechanism for adaptive resource allocation, expanding language capabilities from 29 to 119 la

Abstract

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

Mind Map

In-depth Reading

English Analysis~45 min read · 74,598 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "Qwen3 Technical Report". The central topic is the introduction and detailed technical description of Qwen3, the latest generation of the Qwen large language model (LLM) family.

1.2. Authors

The paper lists "Qwen Team" as the authors, followed by "Core Contributors" and "Cruor" sections detailing a large number of individuals involved. While specific affiliations are not explicitly stated in the provided abstract or initial pages, the Qwen model series is developed by Alibaba Cloud. The extensive list of contributors indicates a large-scale, collaborative research and development effort.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, with the link https://arxiv.org/abs/2505.09388. As a preprint, it has not yet undergone formal peer review for publication in a journal or conference. However, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like artificial intelligence, and technical reports from major industry labs (like Alibaba's Qwen team) often serve as primary sources for new model introductions.

1.4. Publication Year

The paper was published on arXiv on 2025-05-14T13:41:34.000Z, indicating a publication year of 2025.

1.5. Abstract

This work introduces Qwen3, the newest iteration in the Qwen model family, featuring a series of large language models (LLMs) engineered for enhanced performance, efficiency, and multilingual capabilities. The Qwen3 series encompasses both dense and Mixture-of-Expert (MoE) architectures, with parameter counts spanning from 0.6 billion to 235 billion. A core innovation is the unified integration of a thinking mode for complex, multi-step reasoning and a non-thinking mode for swift, context-driven responses, obviating the need for model switching. This framework allows for dynamic mode selection based on user input or templates. Qwen3 also introduces a thinking budget mechanism, enabling adaptive allocation of computational resources during inference to balance latency and performance according to task complexity. Furthermore, by distilling knowledge from larger flagship models, the computational cost for developing smaller models is significantly reduced while maintaining competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including code generation, mathematical reasoning, and agent tasks, rivaling larger MoE and proprietary models. Compared to its predecessor, Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, improving global accessibility through advanced cross-lingual understanding and generation. All Qwen3 models are openly released under the Apache 2.0 license to foster reproducibility and community-driven research.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2505.09388, and the PDF link is https://arxiv.org/pdf/2505.09388v1.pdf. It is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The field of artificial intelligence is relentlessly pursuing artificial general intelligence (AGI) and artificial super intelligence (ASI). Recent advancements in large foundation models (LLMs), such as GPT-4o, Claude 3.7, and Llama-4, have shown remarkable progress in distilling human knowledge and capabilities by training on vast datasets. However, a significant challenge remains: most state-of-the-art models are proprietary, limiting broader research and innovation. While open-source models like DeepSeek-V3 and Qwen2.5 have narrowed the performance gap, there's still a need for open-weight models that can compete at the highest levels, especially in complex reasoning tasks.

A key problem LLMs face is balancing the need for rapid, general-purpose responses with the demand for deep, multi-step reasoning. Currently, users often switch between different types of models—e.g., chat-optimized models for quick replies and dedicated reasoning models for complex problems—leading to inefficiencies and a fragmented user experience. Moreover, training and deploying large models, particularly for specialized reasoning, incurs substantial computational costs. There's a gap in providing a unified, adaptable, and efficient solution that caters to both thinking and non-thinking modes while also being accessible and capable across a wide array of languages.

The paper's entry point is to address these challenges by introducing Qwen3, a comprehensive series of open-weight LLMs. Its innovative idea centers on integrating thinking and non-thinking modes within a single model, coupled with a thinking budget mechanism, to offer unprecedented flexibility and resource control. This aims to eliminate the need for model switching, optimize inference costs, and extend state-of-the-art performance to an expanded multilingual user base.

2.2. Main Contributions / Findings

The Qwen3 technical report highlights several primary contributions and key findings:

Unified Thinking and Non-Thinking Modes: Qwen3 integrates two distinct operational modes—thinking mode for complex, multi-step reasoning and non-thinking mode for rapid, context-driven responses—into a single, unified model. This eliminates the need for users to switch between different models for varying task complexities, offering dynamic mode switching based on user queries or chat templates.
Dynamic Thinking Budget Mechanism: A novel thinking budget mechanism is introduced, allowing users to adaptively allocate computational resources during inference. This provides fine-grained control over the model's reasoning effort, balancing latency and performance based on task complexity.
Broad Model Series and Architectures: Qwen3 comprises a diverse series of LLMs, including both dense and Mixture-of-Expert (MoE) architectures. These models span a wide range of parameter scales, from 0.6 billion to 235 billion, catering to various downstream applications and deployment environments. The flagship Qwen3-235B-A22B is an MoE model with 235 billion total parameters and 22 billion activated parameters per token, balancing performance and efficiency.
Enhanced Multilingual Capabilities: Qwen3 significantly expands its multilingual support, covering 119 languages and dialects, a substantial increase from Qwen2.5's 29 languages. This enhancement improves cross-lingual understanding and generation, making the models globally accessible.
Efficient Strong-to-Weak Distillation: A strong-to-weak distillation pipeline is developed to efficiently train smaller-scale models. By leveraging knowledge transfer from larger, more capable flagship models, this approach drastically reduces the computational resources and development effort required for lightweight models while ensuring highly competitive performance. This distillation process is shown to be significantly more efficient than reinforcement learning for smaller models.
State-of-the-Art Performance: Empirical evaluations demonstrate that Qwen3 models achieve state-of-the-art results across a diverse set of benchmarks. This includes tasks in code generation, mathematical reasoning, and agent tasks, with the flagship models (Qwen3-235B-A22B and Qwen3-32B) performing competitively against larger MoE models and proprietary models like OpenAI-o1, Gemini2.5-Pro, and GPT-4o.
Open-Source Release: All Qwen3 models are publicly accessible under the Apache 2.0 license, facilitating reproducibility and fostering community-driven research and development.

These contributions collectively address the fragmentation in model usage, resource inefficiency, and limitations in multilingual support, pushing the boundaries of open-source LLM capabilities.

3.1. Foundational Concepts

To understand the Qwen3 technical report, several foundational concepts in large language models (LLMs) and deep learning are essential:

Large Language Models (LLMs): These are artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They typically use transformer architectures and can perform a wide range of natural language processing tasks, from translation to question answering.
Dense Models vs. Mixture-of-Experts (MoE) Models:
- Dense Models: These are traditional neural networks where all parameters are activated for every input token. They are computationally intensive as their size scales up.
- Mixture-of-Experts (MoE) Models: These models consist of multiple "expert" sub-networks. For each input token, only a small subset of experts (e.g., 2 or 4 out of 128) are activated by a "router" or "gate" network. This allows MoE models to have a very large total number of parameters (for higher capacity) while maintaining a manageable number of activated parameters per token during inference, leading to more efficient computation for a given performance level.
Pre-training and Fine-tuning:
- Pre-training: The initial, computationally expensive phase where a large model learns general language patterns, facts, and reasoning abilities from massive, diverse datasets using unsupervised or self-supervised learning objectives (e.g., predicting the next word).
- Fine-tuning: A subsequent phase where the pre-trained model is further trained on smaller, task-specific datasets to adapt its learned knowledge to specific downstream applications or human preferences. This can involve Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL).
Chain-of-Thought (CoT) Reasoning: A prompting technique where the LLM is instructed to verbalize its intermediate reasoning steps before providing a final answer. This encourages the model to break down complex problems into manageable steps, often leading to more accurate and verifiable solutions, especially in mathematical or logical tasks.
Reinforcement Learning (RL) and Reinforcement Learning from Human Feedback (RLHF):
- RL: A paradigm where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties. The goal is to maximize cumulative reward.
- RLHF: A common technique for aligning LLMs with human preferences. It involves: 1) training a reward model (RM) on human judgments of LLM outputs, and 2) using this RM to provide rewards to the LLM during an RL phase (e.g., using algorithms like PPO or GRPO) to fine-tune its behavior.
Transformer Architecture: The foundational neural network architecture for LLMs, characterized by self-attention mechanisms and feed-forward layers. It allows the model to weigh the importance of different parts of the input sequence when processing each token.
- Attention Mechanism: A core component of transformers that allows the model to focus on different parts of the input sequence when generating an output. The Scaled Dot-Product Attention is defined as: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ Where:
  - $Q$ is the Query matrix.
  - $K$ is the Key matrix.
  - $V$ is the Value matrix.
  - $d_k$ is the dimension of the keys.
  - $\mathrm{softmax}$ is the softmax function, which normalizes the attention scores.
- Grouped Query Attention (GQA): An optimization for multi-head attention where multiple query heads share the same key and value heads. This reduces the computational cost and memory footprint, especially during inference, while maintaining much of the performance of full multi-head attention.
- SwiGLU (Swish Gated Linear Unit): An activation function used in feed-forward networks within transformers, often replacing the traditional ReLU. It is defined as: $\mathrm{SwiGLU}(x, W_1, W_2, V) = (\mathrm{Swish}(xW_1) \odot (xV))W_2$ Where:
  - $x$ is the input.
  - $W_1, W_2, V$ are weight matrices.
  - $\mathrm{Swish}(y) = y \cdot \mathrm{sigmoid}(y)$ .
  - $\odot$ denotes element-wise product.
- Rotary Positional Embeddings (RoPE): A type of positional encoding that encodes absolute position information with a rotation matrix and naturally incorporates relative position information. It is applied directly within the attention mechanism.
- RMSNorm (Root Mean Square Normalization): A normalization technique applied before or after layers in neural networks. It normalizes inputs by their root mean square, which can offer computational advantages and stability over LayerNorm. For an input vector $x$ $x$ , RMSNorm is defined as: $\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{N} \sum_{i=1}^{N} x_i^2 + \epsilon}} \cdot g$ Where:
  - $N$ is the dimension of $x$ .
  - $\epsilon$ is a small constant for numerical stability.
  - $g$ is a learnable scaling factor.
- QK-Norm: A normalization technique applied to the query-key dot product in the attention mechanism to ensure stable training, particularly for very deep or large transformer models.
Byte-level Byte-Pair Encoding (BBPE): A tokenization algorithm that learns a vocabulary of common character sequences (subwords) by iteratively merging the most frequent adjacent pairs of bytes. Byte-level ensures that all input text, regardless of character set, can be tokenized.
Knowledge Distillation: A technique where a smaller "student" model learns from a larger, more powerful "teacher" model. The student tries to mimic the teacher's outputs, often by minimizing the Kullback-Leibler (KL) divergence between their logits. This allows the student to achieve better performance than it would if trained from scratch, with less data or computation.
Kullback-Leibler (KL) Divergence: A measure of how one probability distribution $P$ diverges from a second, expected probability distribution $Q$ . In distillation, it quantifies the difference between the teacher's soft probabilities and the student's probabilities. For discrete probability distributions $P$ and $Q$ , it is defined as: $D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right)$
Adaptive Base Frequency (ABF): A technique used with RoPE to extend the context window of LLMs by adjusting the base frequency.
Yet Another RoPE extensioN (YARN): A method to effectively extend the context length of RoPE-based models with minimal fine-tuning.
Dual Chunk Attention (DCA): A technique designed to achieve efficient inference for long context lengths by processing input in chunks.

3.2. Previous Works

The paper contextualizes Qwen3 by referring to several prominent LLMs and related research efforts:

Foundation Models (General Progress):
- GPT-4o (OpenAI, 2024): OpenAI's multimodal flagship model, noted for its strong performance across diverse tasks, often serving as a benchmark for general LLM capabilities.
- Claude 3.7 (Anthropic, 2025): Anthropic's competitive LLM series, known for its strong reasoning and safety features.
- Gemini 2.5 (DeepMind, 2025): Google DeepMind's multimodal model, also a major player in the SOTA LLM landscape.
- DeepSeek-V3 (Liu et al., 2024a): An open-source model notable for its scale and performance, particularly as an MoE model. Qwen3 often benchmarks against it.
- Llama-4 (Meta-AI, 2025): Meta AI's next-generation open-source LLM, representing a strong competitor in the open-weight community.
- Qwen2.5 (Yang et al., 2024b): The immediate predecessor to Qwen3, serving as a baseline for improvements in architecture, data, and multilingual support.
Reasoning Models (Optimized via RL):
- OpenAI-o3 (OpenAI, 2025): A model from OpenAI specifically optimized for reasoning tasks, often through advanced RL techniques.
- DeepSeek-R1 (Guo et al., 2025): A dedicated reasoning model, likely developed using RL, against which Qwen3's reasoning capabilities are compared.
- QwQ-32B (Qwen Team, 2024, 2025): An earlier reasoning-focused model from the Qwen team, used as a strong baseline for Qwen3's thinking mode development and also as a teacher model in cold-start fine-tuning.
Open-Source Baselines (for comparison across scales):
- Llama-3 (Dubey et al., 2024): An earlier open-source model from Meta, serving as a benchmark for various parameter scales.
- Gemma-3 (Team et al., 2025): Google's open-source model series, providing benchmarks across different sizes.
- Phi-4 (Abdin et al., 2024): A smaller-scale model from Microsoft, often used for benchmarking lightweight LLMs.

Background on Qwen2.5's Role: Qwen2.5 is particularly relevant as it forms the direct lineage for Qwen3. The paper mentions several Qwen2.5 variants for data generation:

Qwen2.5-VL (Bai et al., 2025): A vision-language model used to extract text from PDF documents for Qwen3's pre-training data.
Qwen2.5-Math (Yang et al., 2024c): A specialized model for mathematical content generation, contributing to synthetic data.
Qwen2.5-Coder (Hui et al., 2024): A specialized model for code-related data generation, also contributing to synthetic data.

The Qwen2.5-MoE architecture (Yang et al., 2024b) also serves as a direct predecessor for Qwen3's MoE models, with Qwen3 building upon its concepts like fine-grained expert segmentation.

3.3. Technological Evolution

The evolution of LLMs has been marked by several key trends:

Scaling Laws: Initial models demonstrated that increasing parameters and data led to improved performance. This motivated the development of ever-larger models.
Architectural Innovations: The Transformer architecture revolutionized NLP, leading to models with self-attention and positional embeddings. Subsequent innovations focused on improving efficiency (e.g., GQA), stability (RMSNorm, QK-Norm), and context handling (RoPE, YARN, DCA).
Data Curation & Augmentation: Beyond simply scaling data, researchers focused on data quality, diversity (e.g., code, scientific texts), and synthetic data generation using existing LLMs (as seen with Qwen3 leveraging Qwen2.5 variants).
Multilingualism: Early models were predominantly English-centric. There's a growing trend towards truly multilingual models, expanding language coverage and improving cross-lingual transfer.
Alignment & Reasoning: The introduction of Chain-of-Thought (CoT) prompting and Reinforcement Learning from Human Feedback (RLHF) significantly enhanced models' reasoning abilities and their alignment with human preferences and instructions. Dedicated reasoning models emerged from this focus.
Efficiency through Sparsity: Mixture-of-Experts (MoE) models represent a major step in combining vast capacity (total parameters) with efficient inference (activated parameters), addressing the computational bottlenecks of dense models.
Unified Capabilities: The latest frontier involves integrating diverse capabilities (e.g., multimodal inputs, reasoning, quick responses) into a single, adaptable model, minimizing the need for specialized models.

Qwen3's work fits squarely within these latest trends, pushing the boundaries in scale, multilingual support, efficiency through MoE and distillation, and crucially, the unification of thinking and non-thinking modes.

3.4. Differentiation Analysis

Compared to the main methods and models in related work, Qwen3 introduces several core differences and innovations:

Unified Thinking and Non-Thinking Modes with Dynamic Switching: This is Qwen3's most prominent differentiator. Prior approaches often required users to select a specific model optimized for chat (e.g., GPT-4o) or reasoning (e.g., QwQ-32B). Qwen3 integrates both within a single model, enabling dynamic switching via chat templates or user prompts (/think, /no_think flags). This simplifies deployment and usage, offering unparalleled flexibility.
Adaptive Thinking Budget: Building on the unified modes, Qwen3 introduces a thinking budget mechanism. This allows users to control the computational resources and time allocated for reasoning during inference, dynamically balancing latency and performance based on the complexity of the query. This fine-grained control is a novel optimization for practical LLM deployment.
Sophisticated Strong-to-Weak Distillation: While knowledge distillation is not new, Qwen3's application of strong-to-weak distillation for lightweight models is highly effective. It involves both off-policy and on-policy phases, leveraging flagship models as teachers. The paper demonstrates that this approach significantly outperforms reinforcement learning for smaller models in terms of performance and training efficiency (1/10th of GPU hours), and also enhances exploration ability (Pass@64 scores).
Massive Multilingual Expansion: Qwen3 drastically expands its language support from 29 to 119 languages and dialects. This is a significant leap compared to many models that focus primarily on high-resource languages, making Qwen3 exceptionally broad in its global accessibility. This was achieved through a dedicated multilingual data annotation system and a massive 36 trillion token pre-training dataset.
Advanced MoE Architecture: Qwen3's MoE models (e.g., Qwen3-235B-A22B) build upon Qwen2.5-MoE but introduce improvements such as excluding shared experts and adopting a global-batch load balancing loss (Qiu et al., 2025) to encourage expert specialization. This contributes to better performance and efficiency compared to previous MoE designs.
Architectural Refinements for Stability: Beyond standard Transformer components, Qwen3 incorporates subtle but important changes like removing QKV-bias and introducing QK-Norm to ensure stable training, particularly for its large-scale models.
Comprehensive Post-training Pipeline: The multi-stage post-training approach, combining Long-CoT Cold Start, Reasoning RL, Thinking Mode Fusion, and General RL, ensures robust reasoning, alignment, and general capabilities for flagship models, which then serve as teachers for smaller models. This structured approach is designed to systematically instill and refine diverse skills.

In essence, Qwen3 differentiates itself by offering a more holistic, flexible, and efficient LLM solution that explicitly addresses the trade-offs between rapid response and deep reasoning within a single, highly multilingual, and open-source framework.

4. Methodology

4.1. Principles

The core idea behind Qwen3's methodology is to develop a highly versatile and efficient family of large language models (LLMs) that can dynamically adapt to different task requirements, particularly balancing rapid response with complex reasoning. This is achieved through a unified architecture that supports both thinking and non-thinking modes, complemented by a thinking budget mechanism. The theoretical basis rests on the observation that not all tasks require extensive reasoning, and therefore, dynamically adjusting the computational effort can lead to significant efficiency gains without compromising performance when reasoning is truly needed.

The methodology can be broken down into three main pillars:

Flexible Architecture Design: Incorporating both dense and Mixture-of-Experts (MoE) models, with architectural refinements for stability and efficiency.
Massive and Diverse Pre-training: Training on an enormous, linguistically diverse, and domain-rich dataset using a multi-stage approach to build a strong foundation of general knowledge, reasoning, and long-context understanding.
Sophisticated Multi-Stage Post-training: A comprehensive fine-tuning strategy that explicitly develops thinking and non-thinking capabilities, aligns the models with human preferences, and efficiently transfers knowledge to smaller models through strong-to-weak distillation.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Architecture

The Qwen3 series includes 6 dense models and 2 MoE models, ranging from 0.6 billion to 235 billion parameters.

Dense Model Architecture: The architecture of Qwen3 dense models builds upon Qwen2.5 with several enhancements. Key components include:

Grouped Query Attention (GQA): An optimization for the multi-head attention mechanism where multiple query heads share a single key and value head. This reduces the memory footprint and increases inference speed compared to standard multi-head attention, especially for larger models.
SwiGLU: A gated activation function used in the feed-forward network blocks of the Transformer. It often provides performance improvements over ReLU or GELU.
Rotary Positional Embeddings (RoPE): A method for incorporating positional information into the self-attention mechanism by rotating query and key vectors based on their absolute positions, which implicitly captures relative position information.
RMSNorm with Pre-normalization: Root Mean Square Normalization applied before the attention and feed-forward layers (pre-normalization), which can improve training stability and performance.
QKV-bias Removal: Unlike Qwen2, QKV-bias (bias terms added to the query, key, and value projections) is removed in Qwen3. This is a simplification that can sometimes improve stability or performance.
QK-Norm Introduction: A new normalization technique QK-Norm (Dehghani et al., 2023) is introduced to the attention mechanism. This is applied to the dot product of query and key vectors before the softmax function, aiming to ensure stable training for large models.

The specific architectural details for dense models are provided in Table 1.

The following are the results from Table 1 of the original paper:

Models	Layers	Heads (Q / KV)	Tie Embedding	Context Length
Qwen3-0.6B	28	16 / 8	Yes	32K
Qwen3-1.7B	28	16 / 8	Yes	32K
Qwen3-4B	36	32 / 8	Yes	128K
Qwen3-8B	36	32 / 8	No	128K
Qwen3-14B	40	40 / 8	No	128K
Qwen3-32B	64	64 / 8	No	128K

MoE Model Architecture: The Qwen3 MoE models share the fundamental architecture with the dense models (i.e., GQA, SwiGLU, RoPE, RMSNorm, QK-Norm). Key specifics for MoE are:

Total Experts: 128 total experts.
Activated Experts: 8 experts activated per token. This means for each token, the router network selects 8 out of the 128 experts to process it.
Fine-grained Expert Segmentation: Similar to Qwen2.5-MoE, this approach likely refers to how experts are structured and utilized, potentially at a sub-layer or fine-grained level within the feed-forward block.
Exclusion of Shared Experts: Unlike Qwen2.5-MoE, Qwen3's MoE design excludes shared experts. This means all experts are distinct, fostering better specialization. Shared experts are usually common experts that all tokens access, while other experts are conditionally activated. Removing them implies a more purely conditional computation.
Global-batch Load Balancing Loss: This is a training objective (Qiu et al., 2025) applied to the router network of the MoE model. Its purpose is to encourage expert specialization and ensure that the workload is evenly distributed across all experts within a training global batch. This prevents a few experts from becoming overloaded while others remain underutilized, which is crucial for efficient MoE training.

The specific architectural details for MoE models are provided in Table 2.

The following are the results from Table 2 of the original paper:

Models	Layers	Heads (Q / KV)	# Experts (Total / Activated)	Context Length
Qwen3-30B-A3B	48	32 / 4	128 / 8	128K
Qwen3-235B-A22B	94	64 / 4	128 / 8	128K

Tokenizer: Qwen3 models use Qwen's tokenizer (Bai et al., 2023), which implements byte-level byte-pair encoding (BBPE). It has a vocabulary size of 151,669. BBPE tokenizers are robust to unseen characters and can tokenize any input string.

4.2.2. Pre-training

Qwen3's pre-training involves a large, diverse dataset and a three-stage process.

4.2.2.1. Pre-training Data

The pre-training dataset for Qwen3 is significantly expanded compared to Qwen2.5, featuring:

Scale: 36 trillion tokens (twice as many as Qwen2.5).
Languages: 119 languages and dialects (three times more than Qwen2.5's 29).
Diversity: High-quality content across various domains including coding, STEM (Science, Technology, Engineering, and Mathematics), reasoning tasks, books, multilingual texts, and synthetic data.

Data Expansion Methods:

Multi-modal Extraction: Qwen2.5-VL (Bai et al., 2025), a vision-language model, is used to extract text from a large volume of PDF-like documents. The extracted text is then refined using Qwen2.5 to improve its quality, yielding trillions of additional high-quality tokens.
Synthetic Data Generation: Domain-specific Qwen2.5 models are employed to synthesize trillions of text tokens in various formats:
- Qwen2.5 (Yang et al., 2024b) for general text.
- Qwen2.5-Math (Yang et al., 2024c) for mathematical content (textbooks, Q&A).
- Qwen2.5-Coder (Hui et al., 2024) for code-related data (code snippets).
Multilingual Data Annotation System: A sophisticated system was developed to annotate over 30 trillion tokens across multiple dimensions (educational value, fields, domains, safety). This supports finer-grained data filtering and combination. The method optimizes data mixture at the instance-level through ablation experiments on small proxy models with these fine-grained labels, a more granular approach than previous efforts that typically optimize at the corpus or domain level.

4.2.2.2. Pre-training Stages

The pre-training process is divided into three distinct stages:

General Stage (S1):
- Objective: Build a strong foundation of general knowledge and language proficiency.
- Data: Over 30 trillion tokens.
- Context Length: 4,096 tokens.
- Coverage: 119 languages and dialects.
Reasoning Stage (S2):
- Objective: Enhance the model's reasoning abilities.
- Data: Approximately 5 trillion higher-quality tokens. The pre-training corpus is optimized by increasing the proportion of STEM, coding, reasoning, and synthetic data.
- Context Length: Not explicitly stated but implied to be similar to S1 or potentially varied based on data. The original text states "a sequence length of Wa", which seems like a typo, potentially meaning "a sequence length of 4,096 tokens" or another specific value. Given the subsequent long context stage, it likely refers to a standard sequence length.
Long Context Stage:
- Objective: Extend the maximum context length of Qwen3 models.
- Data: Hundreds of billions of tokens from a high-quality long context corpus.
- Context Length Extension: Increases from 4,096 to 32,768 tokens.
- Corpus Composition: $75\%$ of text between 16,384 to 32,768 tokens in length, and $25\%$ of text between 4,096 to 16,384 tokens in length.
- Techniques for Long Context:
  - Adaptive Base Frequency (ABF) with RoPE: The base frequency of RoPE is increased from 10,000 to 1,000,000 using the ABF technique (Xiong et al., 2023). This modifies the way positional information is embedded to support longer sequences.
  - YARN (Peng et al., 2023): Yet Another RoPE extensioN, a method to effectively extend the context length of RoPE-based models.
  - Dual Chunk Attention (DCA, An et al., 2024): A technique to achieve a four-fold increase in sequence length capacity during inference by processing long contexts in a chunk-wise manner.

Scaling Laws for Hyperparameters: Similar to Qwen2.5, scaling laws are developed to predict optimal hyperparameters (e.g., learning rate scheduler, batch size) for each stage. Extensive experiments study the relationship between model architecture, training data, training stage, and optimal training hyperparameters to set the predicted values for each dense or MoE model.

4.2.3. Post-training

The post-training pipeline for Qwen3 is designed with two main objectives:

Thinking Control: To integrate thinking and non-thinking modes and allow fine-grained control over reasoning depth via a thinking budget.
Strong-to-Weak Distillation: To optimize the post-training process for lightweight models by leveraging knowledge from larger flagship models.

The flagship models follow a sophisticated four-stage post-training process, while smaller models utilize strong-to-weak distillation.

The post-training pipeline is visually represented in Figure 1 (from the original paper).

Figure 1: Post-training pipeline of the Qwen3 series models. 该图像是Qwen3系列模型的后期训练流程示意图。图中展示了从基础模型到旗舰模型和轻量级模型的不同训练阶段，包括长期CoT冷启动、推理强化学习、思维模式融合和一般强化学习等关键步骤。

Figure 1: Post-training pipeline of the Qwen3 series models.

4.2.3.1. Long-CoT Cold Start

This is the first stage of post-training for flagship models, focusing on developing foundational thinking abilities.

Dataset Curation: A comprehensive dataset covering math, code, logical reasoning, and general STEM problems is created. Each problem is paired with verified reference answers or code-based test cases.
Two-Phase Filtering:
1. Query Filtering: Qwen2.5-72B-Instruct is used to:
  - Remove non-verifiable queries (e.g., multiple sub-questions, general text generation).
  - Exclude queries that Qwen2.5-72B-Instruct can answer without CoT reasoning (to ensure only complex problems requiring deeper reasoning are included).
  - Annotate each query's domain to maintain balanced representation.
2. Response Filtering: For queries with positive Pass@N (meaning a correct solution was found within N attempts), QwQ-32B (Qwen Team, 2025) generates $N$ $N$ candidate responses. These responses undergo stringent filtering to remove:
  - Incorrect final answers.
  - Substantial repetition.
  - Guesswork without adequate reasoning.
  - Inconsistencies between thinking and summary content.
  - Inappropriate language mixing or stylistic shifts.
  - Responses overly similar to potential validation items.
Objective: Instill foundational reasoning patterns without over-emphasizing immediate reasoning performance, preparing the model for subsequent Reinforcement Learning (RL). This phase aims to minimize training samples and steps.

4.2.3.2. Reasoning RL

The second stage focuses on further enhancing reasoning through Reinforcement Learning.

Query-Verifier Pairs: A dataset of 3,995 query-verifier pairs is collected, satisfying four criteria:
1. Not used in the cold-start phase.
2. Learnable for the cold-start model.
3. As challenging as possible.
4. Covering a broad range of sub-domains.
RL Algorithm: GRPO (Shao et al., 2024) is employed to update model parameters. GRPO (Generalized Reinforcement Learning with Policy Optimization) is a policy gradient method designed for stable and efficient reinforcement learning.
Training Benefits: Large batch sizes and a high number of rollouts per query, along with off-policy training, are found beneficial for improving sample efficiency.
Exploration-Exploitation Balance: Strategies to balance exploration and exploitation are addressed by controlling the model's entropy to increase steadily or remain stable. This helps maintain stable training and prevents premature convergence.
Results: Consistent improvements in both training reward and validation performance are observed without manual hyperparameter intervention. For example, AIME'24 score for Qwen3-235B-A22B increased from 70.1 to 85.1 over 170 RL training steps.

4.2.3.3. Thinking Mode Fusion

The third stage integrates non-thinking capabilities into the thinking-capable model developed in previous stages.

Objective: Allow developers to manage reasoning behaviors and reduce the cost/complexity of deploying separate models.
Approach: Continual supervised fine-tuning (SFT) is performed on the Reasoning RL model.
SFT Data Construction:
- Thinking data is generated via rejection sampling on Stage 1 queries using the Stage 2 model itself, ensuring performance is not compromised.
- Non-thinking data is curated to cover diverse tasks, including conceptual knowledge, summarization, and role-playing. Automatically generated checklists are used to assess response quality.
- The proportion of translation tasks is increased to enhance performance on low-resource languages.

Chat Template Design: A chat template is designed for dynamic mode switching, as shown in Table 9.

/think and /no_think flags are introduced in user queries or system messages.
For non-thinking mode samples, an empty thinking block ( $<think></think>$ ) is retained in the assistant's response to ensure internal format consistency and allow developers to explicitly prevent thinking.
By default, the model operates in thinking mode; thus, some thinking mode samples without /think flags are included in training.

For multi-turn dialogs, multiple /think and /no_think flags can be randomly inserted, with the model adhering to the last encountered flag.

The following are the results from Table 9 of the original paper:

Thinking Mode	Non-Thinking Mode
<\|im_start/>user {query}/think<\|im_end\|>	<\|im_start\|>user {query}/no_think<\|im_end\|>
<\|im_start/>assistant <think>	<\|im_start/>assistant <think>
{thinking_content} </think>	</think>
{response}<\|im_end\|>	{response}<\|im_end\|>

Thinking Budget Mechanism: The model's ability to handle intermediate cases (generating responses based on incomplete thinking) emerges naturally from Thinking Mode Fusion. This forms the basis for budget control.
- If the model's thinking length reaches a user-defined threshold, the process is manually halted.
- A stop-thinking instruction is inserted: $"Considering the limited time by the user, I have to give the solution based on the thinking directly now. \n</think>.\n\n"$ .
- The model then generates a final response based on its accumulated reasoning up to that point.

4.2.3.4. General RL

The final stage of post-training aims for broad enhancement of capabilities and stability across diverse scenarios.

Sophisticated Reward System: Covers over 20 distinct tasks, each with customized scoring criteria, targeting core capabilities:
- Instruction Following: Ensures accurate interpretation and execution of user instructions (content, format, length, structured output).
- Format Following: Adherence to specific formatting conventions, e.g., responding to /think//no_think flags and using $<think>$ / $</think>$ tokens.
- Preference Alignment: For open-ended queries, improves helpfulness, engagement, and style for a natural user experience.
- Agent Ability: Training the model to correctly invoke tools via designated interfaces. During RL rollout, the model performs multi-turn interaction cycles with real environment execution feedback to improve long-horizon decision-making.
- Abilities for Specialized Scenarios: Tasks tailored to specific contexts, e.g., Retrieval-Augmented Generation (RAG) tasks incorporate reward signals to guide accurate and contextually appropriate responses, minimizing hallucination.
Three Types of Rewards:
1. Rule-based Reward: Widely used in reasoning RL and for general tasks like instruction following and format adherence. Provides high precision in assessing correctness and prevents reward hacking.
2. Model-based Reward with Reference Answer: A reference answer is provided for each query, and Qwen2.5-72B-Instruct scores the model's response against it. Offers flexibility for diverse tasks without strict formatting.
3. Model-based Reward without Reference Answer: A reward model is trained using human preference data to assign scalar scores to responses. Handles a broader range of queries and enhances engagement and helpfulness.

4.2.3.5. Strong-to-Weak Distillation

This pipeline is specifically for optimizing lightweight models (5 dense models: Qwen3-0.6B, 1.7B, 4B, 8B, 14B, and one MoE model: Qwen3-30B-A3B).

Objective: Enhance performance and impart robust mode-switching capabilities efficiently, requiring significantly fewer computational resources than the full four-stage process.
Two Primary Phases:
1. Off-policy Distillation:
  - Combines outputs from teacher models (generated with both /think and /no_think modes) for response distillation.
  - Helps student models develop basic reasoning skills and mode-switching ability, establishing a foundation for the next phase.
2. On-policy Distillation:
  - Student model generates on-policy sequences for fine-tuning.
  - Prompts are sampled, and the student model produces responses in either /think or /no_think mode.
  - The student model is fine-tuned by aligning its logits with those of a teacher model (Qwen3-32B or Qwen3-235B-A22B). This alignment is typically achieved by minimizing the Kullback-Leibler (KL) divergence between the student's and teacher's soft probabilities (logits converted to probabilities via softmax). $\mathcal{L}_{distill} = D_{KL}(P_T || P_S) = \sum_{x \in \mathcal{V}} P_T(x) \log\left(\frac{P_T(x)}{P_S(x)}\right)$ Where:
  - $P_T(x)$ represents the probability of token $x$ predicted by the Teacher model.
  - $P_S(x)$ represents the probability of token $x$ predicted by the Student model.
  - $\mathcal{V}$ is the vocabulary.
  - $D_{KL}$ denotes the Kullback-Leibler divergence.
    
    This process enables better Pass@1 scores (immediate performance) and improved exploration ability (Pass@64), while requiring only 1/10 of the GPU hours compared to the four-stage training.

5. Experimental Setup

The experimental setup for Qwen3 involved comprehensive evaluations of both pre-trained (base) and post-trained (instruction-tuned) models across a wide array of benchmarks.

5.1. Datasets

The datasets used in the experiments cover general knowledge, reasoning, mathematics, scientific knowledge, coding, and multilingual capabilities.

For Pre-trained Base Models (Section 3.3):

General Tasks:
- MMLU (Hendrycks et al., 2021a): Massive Multitask Language Understanding (5-shot). A benchmark covering 57 academic subjects.
- MMLU-Pro (Wang et al., 2024): An extended version of MMLU (5-shot, CoT).
- MMLU-redux (Gema et al., 2024): Another variant of MMLU (5-shot).
- BBH (Suzgun et al., 2023): BIG-Bench Hard (3-shot, CoT). A subset of BIG-Bench tasks designed to be challenging for large language models, often requiring complex reasoning.
- SuperGPQA (Du et al., 2025): Super Graduate-level Google-proof Q&A (5-shot, CoT). A very challenging question-answering benchmark.
Math & STEM Tasks:
- GPQA (Rein et al., 2023): Graduate-level Google-proof Q&A (5-shot, CoT). Similar to SuperGPQA, requiring deep scientific and mathematical knowledge.
- GSM8K (Cobbe et al., 2021): Grade School Math 8K (4-shot, CoT). A dataset of elementary school math word problems.
- MATH (Hendrycks et al., 2021b): A dataset of challenging mathematical problems from high school competitions (4-shot, CoT).
Coding Tasks:
- EvalPlus (Liu et al., 2023a): (0-shot) An evaluation suite for code generation, averaging performance on HumanEval, MBPP, $HumanEval+$ , and $MBPP+$ .
- MultiPL-E (Cassano et al., 2023): (0-shot) A polyglot benchmark for code generation across multiple programming languages (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript).
- MBPP-3shot (Austin et al., 2021): Mostly Basic Python Problems (3-shot). A dataset of Python programming problems.
- CRUX-O of CRUXEval (Gu et al., 2024): (1-shot) A benchmark for code reasoning, understanding, and execution.
Multilingual Tasks:
- MGSM (Shi et al., 2023): Multilingual Grade School Math (8-shot, CoT). Multilingual math word problems.
- MMMLU (OpenAI, 2024): Multilingual Massive Multitask Language Understanding (5-shot). A multilingual version of MMLU.
- INCLUDE (Romanou et al., 2024): Evaluating multilingual language understanding with regional knowledge (5-shot).

For Post-trained Models (Section 4.6):

General Tasks:
- MMLU-Redux (Gema et al., 2024).
- GPQA-Diamond (Rein et al., 2023): A subset of GPQA with very challenging questions. For this benchmark, 10 samples are taken for each query, and the averaged accuracy is reported.
- C-Eval (Huang et al., 2023): A multilevel multi-discipline Chinese evaluation suite for foundation models.
- LiveBench (2024-11-25) (White et al., 2024): A challenging, contamination-free LLM benchmark.
Alignment Tasks:
- IFEval (Zhou et al., 2023): Instruction-following evaluation for large language models, reporting strict prompt accuracy.
- Arena-Hard (Li et al., 2024): A benchmark for evaluating human preferences, derived from crowdsourced data.
- AlignBench v1.1 (Liu et al., 2023b): Benchmarking Chinese alignment of large language models.
- Creative Writing V3 (Paech, 2024): Evaluates creative writing proficiency.
- WritingBench (Wu et al., 2025): A comprehensive benchmark for generative writing.
Math & Text Reasoning:
- MATH-500 (Lightman et al., 2023): A high-level math benchmark.
- AIME'24 and AIME'25 (AIME, 2025): Problems from the American Invitational Mathematics Examination. For each question, 64 samples are taken, and the average accuracy is reported.
- ZebraLogic (Lin et al., 2025): A benchmark for logical reasoning, particularly Zebra Puzzles.
- AutoLogi (Zu et al., 2025): Automated generation of logic puzzles.
Agent & Coding:
- BFCL v3 (Yan et al., 2024): Berkeley Function Calling Leaderboard. Models are evaluated using the FC format, and yarn is used to extend context length to 64k for Multi-Turn evaluation.
- LiveCodeBench (v5, 2024.10-2025.02) (Jain et al., 2024): A holistic and contamination-free evaluation for code generation. For non-thinking mode, the official prompt is used; for thinking mode, prompt templates are adjusted to allow more free thinking.
- CodeForces Ratings from CodeElo (Quan et al., 2025): Calculates Elo ratings to compare model performance against competitive programming experts. Each problem is solved by generating up to eight independent reasoning attempts.
Multilingual Tasks:
- Multi-IF (He et al., 2024): Multilingual instruction following (8 key languages).
- INCLUDE (Romanou et al., 2024): Regional knowledge (44 languages). Only 10% of original data sampled for efficiency.
- MMMLU (OpenAI, 2024): General knowledge (14 languages, excluding unoptimized Yoruba). Only 10% of original data sampled for efficiency.
- MT-AIME2024 (Son et al., 2025): Multilingual AIME (55 languages).
- PolyMath (Wang et al., 2025): Multilingual mathematical reasoning (18 languages).
- MLogiQA (Zhang et al., 2024): Multilingual logical reasoning (10 languages).
- Belebele (Bandarkar et al., 2023): A benchmark for natural language understanding in 122 language variants. Evaluated on 80 supported languages.

In-house Benchmarks (for Ablation Studies):

CounterFactQA: Contains counterfactual questions where the model needs to identify non-factual nature and avoid hallucination.
LengthCtrl: Creative writing tasks with length requirements; score based on difference between generated and target length.
ThinkFollow: Multi-turn dialogues with random /think and /no_think flags to test mode switching.

$ToolUse*$ : Evaluates tool-calling proficiency (intent, format, parameter accuracy) with multi-turn interactions and real environment feedback.

The following are the results from Table 10 of the original paper:

Benchmark	# Langs	Languages
Multi-IF	8	en, es, fr, hi, it, pt, ru, zh
INCLUDE	44	ar, az, be, bg, bn, de, el, es, et, eu, fa, fi, fr, he, hi, hr, hu, hy, id, it, ja, ka, kk, ko, lt, mk, ml, ms, ne, nl, pl, pt, ru, sq, sr, ta, te, tl, tr, uk, ur, uz, vi, zh
MMMLU	14	ar, bn, de, en, es, fr, hi, id, it, ja, ko, pt, sw, zh
MT-AIME2024	55	af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-Hans, zh-Hant
PolyMath	18	ar, bn, de, en, es, fr, id, it, ja, ko, ms, pt, ru, sw, te, th, vi, zh
MLogiQA	10	ar, en, es, fr, ja, ko, pt, th, vi, zh

The following are the results from Table 36 of the original paper:

	Language family	# Langs	Language code (ISO 639-3 ISO 15924)
Indo-European		40	por_Latn, deu_Latn, tgk_Cyrl, ces_Latn, nob_Latn, dan_Latn, snd_Arab, spa_Latn, isl_Latn, slv_Latn, eng_Latn, ory_Orya, hrv_Latn, ell_Grek, ukr_Cyrl, pan_Guru, srp_Cyrl, npi_Deva, mkd_Cyrl, guj_Gujr, nld_Latn, swe_Latn, hin_Deva, rus_Cyrl, asm_Beng, cat_Latn, als_Latn, sin_Sinh, urd_Arab, mar_Deva, lit_Latn, slk_Latn,
Indo-European		40	ita_Latn, pol_Latn, bul_Cyrl, afr_Latn, ron_Latn, fra_Latn, ben_Beng, hye_Armn
Sino-Tibetan		3	zho_Hans, mya_Mymr, zho_Hant
Afro-Asiatic		8	heb_Hebr, apc_Arab, acm_Arab, ary_Arab, ars_Arab, arb_Arab, mlt_Latn, erz_Arab
Austronesian		7	ilo_Latn, ceb_Latn, tgl_Latn, sun_Latn, jav_Latn, war_Latn, ind_Latn
Dravidian		4	mal_Mlym, kan_Knda, tel_Telu, tam_Taml
Turkic		4	kaz_Cyrl, azj_Latn, tur_Latn, uzn_Latn
Tai-Kadai		2	tha_Thai, lao_Laoo
Uralic		3	fin_Latn, hun_Latn, est_Latn
Austroasiatic		2	vie_Latn, khm_Khmr
Other			eus_Latn, kor_Hang, hat_Latn, swh_Latn, kea_Latn, jpn_Jpan, kat_Geor

5.2. Evaluation Metrics

For every evaluation metric, the following explanations are provided:

Accuracy (Acc):
1. Conceptual Definition: Measures the proportion of correctly predicted instances out of the total number of instances. It is a common metric for classification tasks.
2. Mathematical Formula: $\mathrm{Accuracy} = \frac{\mathrm{Number~of~Correct~Predictions}}{\mathrm{Total~Number~of~Predictions}}$
3. Symbol Explanation: Number of Correct Predictions refers to the count of instances where the model's output matches the ground truth. Total Number of Predictions is the total count of instances evaluated.
Pass@k:
1. Conceptual Definition: A metric used primarily in code generation benchmarks. It evaluates the probability that at least one out of $k$ generated solutions for a problem passes all given test cases. It accounts for the stochastic nature of LLM generation.
2. Mathematical Formula: The exact formula for Pass@k is typically derived from Pass@1, where if $C$ $C$ candidates are generated, $c$ $c$ of which pass, Pass@k is calculated by averaging over $N$ $N$ problems: $\mathrm{Pass@k} = \frac{1}{N} \sum_{i=1}^N \left[ 1 - \frac{\binom{C_i - p_i}{k}}{\binom{C_i}{k}} \right]$ Where:
  - $N$ is the number of problems.
  - $C_i$ is the number of code candidates generated for problem $i$ .
  - $p_i$ is the number of passing candidates for problem $i$ .
  - $\binom{n}{r}$ is the binomial coefficient, representing "n choose r".
3. Symbol Explanation: $N$ is the total number of coding problems. $C_i$ is the number of attempts made by the model for problem $i$ . $p_i$ is the number of those attempts that correctly solve problem $i$ . $k$ is the number of solutions considered to pass. The term $1 - \frac{\binom{C_i - p_i}{k}}{\binom{C_i}{k}}$ calculates the probability that at least one of $k$ randomly chosen solutions out of $C_i$ attempts (where $p_i$ are successful) will be correct.
Elo Rating (CodeForces):
1. Conceptual Definition: A method for calculating the relative skill levels of players (or in this case, AI models) in zero-sum games, such as competitive programming. It's a dynamic rating system where a player's rating changes based on the outcome of matches against other players. A higher Elo rating indicates a stronger performance.
2. Mathematical Formula: The change in Elo rating ( $\Delta R$ $Δ R$ ) after a match is calculated as: $\Delta R = K \cdot (S - E)$ Where:
  - $K$ is the K-factor, a constant that determines the maximum possible adjustment for a single game.
  - $S$ is the actual score (1 for win, 0.5 for draw, 0 for loss).
  - $E$ is the expected score, calculated based on the opponent's rating ( $R_O$ ) and the player's own rating ( $R_P$ ): $E = \frac{1}{1 + 10^{(R_O - R_P)/400}}$
3. Symbol Explanation: $K$ is the sensitivity of the rating change. $S$ is the actual outcome of the competition (e.g., passing a coding problem). $E$ is the expected outcome based on the current ratings. $R_O$ and $R_P$ are the Elo ratings of the opponent and the player, respectively.
Strict Prompt Accuracy (IFEval):
1. Conceptual Definition: Measures how precisely a model adheres to the explicit instructions given in a prompt. It's often a binary metric (pass/fail) for each instruction, focusing on exact compliance rather than general quality.
2. Mathematical Formula: Not typically a single universal formula, but rather a calculation of the percentage of instruction sets where all instructions were followed correctly. $\mathrm{Strict~Prompt~Accuracy} = \frac{\mathrm{Number~of~Prompts~with~All~Instructions~Followed}}{\mathrm{Total~Number~of~Prompts}}$
3. Symbol Explanation: Number of Prompts with All Instructions Followed is the count of prompts where the model's response correctly fulfilled every specified instruction. Total Number of Prompts is the number of instruction sets evaluated.
Average (Avg.):
1. Conceptual Definition: The arithmetic mean of scores across multiple benchmarks or tasks, used to provide a single aggregate performance indicator.
2. Mathematical Formula: $\mathrm{Average} = \frac{1}{N} \sum_{i=1}^{N} S_i$
3. Symbol Explanation: $N$ is the number of individual scores or benchmarks being averaged. $S_i$ is the score for the $i$ -th benchmark.

5.3. Baselines

The Qwen3 models are compared against a comprehensive set of baselines, including both open-source and proprietary models, to demonstrate their competitive standing across various scales and capabilities.

Pre-trained Base Model Baselines:

Qwen2.5 Base Models: Qwen2.5-0.5B, Qwen2.5-1.5B, Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Qwen2.5-32B, Qwen2.5-72B-Base, Qwen2.5-Plus-Base (MoE). These are direct predecessors, showing generational improvements.
DeepSeek-V3 Base (Liu et al., 2024a): A large open-source MoE model, representing a strong competitor in terms of scale and architecture.
Gemma-3 Base Models (Team et al., 2025): Gemma-3-1B, Gemma-3-4B, Gemma-3-12B, Gemma-3-27B. Google's open-source series.
Llama-3 Base Models (Dubey et al., 2024): Llama-3-8B. Meta's popular open-source series.
Llama-4 Base Models (Meta-AI, 2025): Llama-4-Maverick, Llama-4-Scout. Next-generation open-source models from Meta, often larger and more capable.

Post-trained (Instruction-tuned) Model Baselines:

Proprietary Models:
- OpenAI-o1 (OpenAI, 2024): A reasoning-focused model from OpenAI.
- GPT-4o-2024-11-20 (OpenAI, 2024): OpenAI's flagship multimodal model.
- Gemini2.5-Pro (DeepMind, 2025): Google DeepMind's powerful model.
- Grok-3-Beta (Think) (xAI, 2025): A reasoning-focused model from xAI.
- OpenAI-o3-mini (medium) (OpenAI, 2025): A smaller, reasoning-focused model from OpenAI.
- GPT-4o-mini-2024-07-18: A smaller version of GPT-4o.
Open-Source and Qwen Predecessors:
- DeepSeek-R1 (Guo et al., 2025): A dedicated reasoning model.
- DeepSeek-V3 (Liu et al., 2024a).
- Qwen2.5-72B-Instruct, Qwen2.5-32B-Instruct, Qwen2.5-14B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-3B-Instruct, Qwen2.5-1.5B-Instruct. Instruction-tuned versions of Qwen2.5 models.
- QwQ-32B (Qwen Team, 2025): Qwen's previous strongest reasoning model.
- LLaMA-4-Maverick (Meta-AI, 2025), LLaMA-4-Scout (Meta-AI, 2025).
- LLaMA-3.1-8B-Instruct (Dubey et al., 2024).
- Gemma-3-27B-IT, Gemma-3-12B-IT, Gemma-3-4B-IT, Gemma-3-1B-IT. Instruction-tuned versions of Gemma-3 models.
- Phi-4 (Abdin et al., 2024), Phi-4-mini.
- DeepSeek-R1-Distill-Llama-70B, DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-Distill-Qwen-14B, DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Llama-8B: Distilled versions of DeepSeek-R1, used as baselines for strong-to-weak distillation.

5.4. Hyperparameters

The following hyperparameters are used for evaluation:

For Thinking Mode (Qwen3 models):
- Sampling Temperature: 0.6
- Top-p: 0.95
- Top-k: 20
- Presence Penalty: 1.5 (specifically for Creative Writing v3 and WritingBench to encourage diverse content).
For Non-Thinking Mode (Qwen3 models):
- Sampling Temperature: 0.7
- Top-p: 0.8
- Top-k: 20
- Presence Penalty: 1.5
Maximum Output Length: 32,768 tokens, except for AIME'24 and AIME'25 where it is extended to 38,912 tokens to provide sufficient thinking space for complex math problems.

6. Results & Analysis

The experimental results are presented for both pre-trained (base) and post-trained (instruction-tuned) models, covering a wide range of benchmarks and comparing Qwen3 against various state-of-the-art baselines.

6.1. Pre-training Evaluation

The evaluation of base models focuses on general knowledge, reasoning, mathematics, scientific knowledge, coding, and multilingual capabilities.

Summary of Evaluation Results for Qwen3 Base Models:

Flagship Model Performance: Qwen3-235B-A22B-Base outperforms most previously open-source SOTA dense and MoE base models (e.g., DeepSeek-V3 Base, Llama-4-Maverick Base, Qwen2.5-72B-Base) across most tasks, often with significantly fewer total or activated parameters.
MoE Efficiency: Qwen3 MoE base models (Qwen3-30B-A3B-Base, Qwen3-235B-A22B-Base) achieve performance similar to Qwen3 dense base models with only 1/5 activated parameters, demonstrating high efficiency. They also outperform Qwen2.5 MoE models with fewer activated parameters. Notably, a Qwen3 MoE model can achieve comparable performance to a Qwen2.5 dense base model with 1/10 of the activated parameters, indicating significant inference and training cost advantages.
Dense Model Improvements: Qwen3 dense base models show comparable performance to Qwen2.5 base models at higher parameter scales. Specifically, smaller Qwen3 dense models (1.7B, 4B, 8B, 14B, 32B) often surpass or match larger Qwen2.5 counterparts (3B, 7B, 14B, 32B, 72B), especially in STEM, coding, and reasoning benchmarks.

6.1.1. Qwen3-235B-A22B-Base

The following are the results from Table 3 of the original paper:

	Qwen2.5-72B Base	Qwen2.5-Plus Base	Llama-4-Maverick Base	DeepSeek-V3 Base	Qwen3-235B-A22B Base
Architecture	Dense	MoE	MoE	MoE	MoE
# Total Params	72B	271B	402B	671B	235B
# Activated Params	72B	37B	17B	37B	22B
General Tasks
MMLU	86.06	85.02	85.16	87.19	87.81
MMLU-Redux	83.91	82.69	84.05	86.14	87.40
MMLU-Pro	58.07	63.52	63.91	59.84	68.18
SuperGPQA	36.20	37.18	40.85	41.53	44.06
BBH	86.30	85.60	83.62	86.22	88.87
Math & STEM Tasks
GPQA	45.88	41.92	43.94	41.92	47.47
GSM8K	91.50	91.89	87.72	87.57	94.39
MATH	62.12	62.78	63.32	62.62	71.84
Coding Tasks
EvalPlus	65.93	61.43	68.38	63.75	77.60
MultiPL-E	58.70	62.16	57.28	62.26	65.94
MBPP	76.00	74.60	75.40	74.20	81.40
CRUX-O	66.20	68.50	77.00	76.60	79.00
Multilingual Tasks
MGSM	82.40	82.21	79.69	82.68	83.53
MMMLU	84.40	83.49	83.09	85.88	86.70
INCLUDE	69.05	66.97	73.47	75.17	73.46

Analysis of Qwen3-235B-A22B-Base:

Overall Dominance: Qwen3-235B-A22B-Base achieves the highest scores in most benchmarks, significantly outperforming competitors.
Vs. Llama-4-Maverick: Despite Llama-4-Maverick having roughly twice the total parameters, Qwen3-235B-A22B-Base performs better on most benchmarks, indicating superior architectural or training efficiency.
Vs. DeepSeek-V3: Qwen3-235B-A22B-Base outperforms DeepSeek-V3-Base on 14 out of 15 benchmarks with only about 1/3 of its total parameters and 2/3 of its activated parameters, demonstrating impressive power and cost-effectiveness.
Vs. Qwen2.5-Plus (MoE): Qwen3-235B-A22B-Base significantly outperforms its previous MoE counterpart, Qwen2.5-Plus, with fewer parameters and activated parameters, highlighting advancements in pre-training data, strategy, and architecture.
Vs. Qwen2.5-72B (Dense): The Qwen3 MoE flagship model surpasses the Qwen2.5-72B-Base (dense) in all benchmarks while using less than 1/3 of its activated parameters. This translates to much cheaper inference and training costs per trillion tokens.

6.1.2. Qwen3-32B-Base

The following are the results from Table 4 of the original paper:

	Qwen2.5-32B Base	Qwen2.5-72B Base	Gemma-3-27B Base	Llama-4-Scout Base	Qwen3-32B Base
Architecture	Dense	Dense	Dense	MoE	Dense
# Total Params	32B	72B	27B	109B	32B
# Activated Params	32B	72B	27B	17B	32B
General Tasks
MMLU	83.32	86.06	78.69	78.27	83.61
MMLU-Redux	81.97	83.91	76.53	71.09	83.41
MMLU-Pro	55.10	58.07	52.88	56.13	65.54
SuperGPQA	33.55	36.20	29.87	26.51	39.78
BBH	84.48	86.30	79.95	82.40	87.38
Math & STEM Tasks
GPQA	47.97	45.88	26.26	40.40	49.49
GSM8K	92.87	91.50	81.20	85.37	93.40
MATH	57.70	62.12	51.78	51.66	61.62
Coding Tasks
EvalPlus	66.25	65.93	55.78	59.90	72.05
MultiPL-E	58.30	58.70	45.03	47.38	67.06
MBPP	73.60	76.00	68.40	68.60	78.20
CRUX-O	67.80	66.20	60.00	61.90	72.50
Multilingual Tasks
MGSM	78.12	82.40	73.74	79.93	83.06
MMMLU	82.40	84.40	77.62	74.83	83.83
INCLUDE	64.35	69.05	68.94	68.09	67.87

Analysis of Qwen3-32B-Base:

Vs. Similar-sized Models: Qwen3-32B-Base outperforms Qwen2.5-32B-Base and Gemma-3-27B-Base on most benchmarks, particularly showing significant leads in MMLU-Pro, SuperGPQA, and coding benchmarks.
Vs. Larger Predecessor: Surprisingly, Qwen3-32B-Base, with less than half the parameters of Qwen2.5-72B-Base, outperforms it in 10 out of 15 benchmarks, especially in coding, mathematics, and reasoning. This indicates substantial improvements in efficiency and capability for its size.
Vs. MoE Baseline: Qwen3-32B-Base significantly outperforms Llama-4-Scout-Base on all 15 benchmarks, despite the latter having three times its total parameters (though half the activated parameters for Llama-4-Scout are 17B, which is less than Qwen3-32B-Base's 32B activated parameters).

6.1.3. Qwen3-14B-Base & Qwen3-30B-A3B-Base

The following are the results from Table 5 of the original paper:

	Gemma-3-12B Base	Qwen2.5-14B Base	Qwen2.5-32B Base	Qwen2.5-Turbo Base	Qwen3-14B Base	Qwen3-30B-A3B Base
Architecture	Dense	Dense	Dense	MoE	Dense	MoE
# Total Params	12B	14B	32B	42B	14B	30B
# Activated Params	12B	14B	32B	6B	14B	3B
General Tasks
MMLU	73.87	79.66	83.32	79.50	81.05	81.38
MMLU-Redux	70.70	76.64	81.97	77.11	79.88	81.17
MMLU-Pro	44.91	51.16	55.10	55.60	61.03	61.49
SuperGPQA	24.61	30.68	33.55	31.19	34.27	35.72
BBH	74.28	78.18	84.48	76.10	81.07	81.54
Math & STEM Tasks
GPQA	31.31	32.83	47.97	41.41	39.90	43.94
GSM8K	78.01	90.22	92.87	88.32	92.49	91.81
MATH	44.43	55.64	57.70	55.60	62.02	59.04
Coding Tasks
EvalPlus	52.65	60.70	66.25	61.23	72.23	71.45
MultiPL-E	43.03	54.79	58.30	53.24	61.69	66.53
MBPP	60.60	69.00	73.60	67.60	73.40	74.40
CRUX-O	52.00	61.10	67.80	60.20	68.60	67.20
Multilingual Tasks
MGSM	64.35	74.68	78.12	70.45	79.20	79.11
MMMLU	72.50	78.34	82.40	79.76	79.69	81.46
INCLUDE	63.34	60.26	64.35	59.25	64.55	67.00

Analysis of Qwen3-14B-Base & Qwen3-30B-A3B-Base:

Qwen3-14B-Base Superiority: Qwen3-14B-Base significantly outperforms Gemma-3-12B-Base and Qwen2.5-14B-Base on all 15 benchmarks. It also achieves very competitive results against the much larger Qwen2.5-32B-Base with less than half the parameters.
Qwen3-30B-A3B-Base Efficiency: Qwen3-30B-A3B-Base, an MoE model, with only 3 billion activated parameters (1/5 of Qwen2.5-14B-Base's activated parameters), significantly outperforms Qwen2.5-14B-Base on all tasks. It also achieves comparable performance to the larger Qwen3-14B-Base and Qwen2.5-32B-Base, highlighting substantial advantages in inference and training costs due to its efficient MoE architecture.

6.1.4. Qwen3-8B / 4B / 1.7B / 0.6B-Base

The following are the results from Table 6 of the original paper:

	Llama-3-8B Base	Qwen2.5-7B Base	Qwen2.5-14B Base	Qwen3-8B Base
Architecture	Dense	Dense	Dense	Dense
# Total Params	8B	7B	14B	8B
# Activated Params	8B	7B	14B	8B
General Tasks
MMLU	66.60	74.16	79.66	76.89
MMLU-Redux	61.59	71.06	76.64	76.17
MMLU-Pro	35.36	45.00	51.16	56.73
SuperGPQA	20.54	26.34	30.68	31.64
BBH	57.70	70.40	78.18	78.40
Math & STEM Tasks
GPQA	25.80	36.36	32.83	44.44
GSM8K	55.30	85.36	90.22	89.84
MATH	20.50	49.80	55.64	60.80
Coding Tasks
EvalPlus	44.13	62.18	60.70	67.65
MultiPL-E	31.45	50.73	54.79	58.75
MBPP	48.40	63.40	69.00	69.80
CRUX-O	36.80	48.50	61.10	62.00
Multilingual Tasks
MGSM	38.92	63.60	74.68	76.02
MMMLU	59.65	71.34	78.34	75.72
IINCLUDE	44.94	53.98	60.26	59.40

The following are the results from Table 7 of the original paper:

	Gemma-3-4B Base	Qwen2.5-3B Base	Qwen2.5-7B Base	Qwen3-4B Base
Architecture	Dense	Dense	Dense	Dense
# Total Params	4B	3B	7B	4B
# Activated Params	4B	3B	7B	4B
General Tasks
MMLU	59.51	65.62	74.16	72.99
MMLU-Redux	56.91	63.68	71.06	72.79
MMLU-Pro	29.23	34.61	45.00	50.58
SuperGPQA	17.68	20.31	26.34	28.43
BBH	51.70	56.30	70.40	72.59
Math & STEM Tasks
GPQA	24.24	26.26	36.36	36.87
GSM8K	43.97	79.08	85.36	87.79
MATH	26.10	42.64	49.80	54.10
Coding Tasks
EvalPlus	43.23	46.28	62.18	63.53
MultiPL-E	28.06	39.65	50.73	53.13
MBPP	46.40	54.60	63.40	67.00
CRUX-O	34.00	36.50	48.50	55.00
Multilingual Tasks
MGSM	33.11	47.53	63.60	67.74
MMMLU	59.62	65.55	71.34	71.42
INCLUDE	49.06	45.90	53.98	56.29

The following are the results from Table 8 of the original paper:

	Qwen2.5-0.5B Base	Qwen3-0.6B Base	Gemma-3-1B Base	Qwen2.5-1.5B Base	Qwen3-1.7B Base
Architecture	Dense	Dense	Dense	Dense	Dense
# Total Params	0.5B	0.6B	1B	1.5B	1.7B
# Activated Params	0.5B	0.6B	1B	1.5B	1.7B
General Tasks
MMLU	47.50	52.81	26.26	60.90	62.63
MMLU-Redux	45.10	51.26	25.99	58.46	61.66
MMLU-Pro	15.69	24.74	9.72	28.53	36.76
SuperGPQA	11.30	15.03	7.19	17.64	20.92
BBH	20.30	41.47	28.13	45.10	54.47
Math & STEM Tasks
GPQA	24.75	26.77	24.75	24.24	28.28
GSM8K	41.62	59.59	2.20	68.54	75.44
MATH	19.48	32.44	3.66	35.00	43.50
Coding Tasks
EvalPlus	31.85	36.23	8.98	44.80	52.70
MultiPL-E	18.70	24.58	5.15	33.10	42.71
MBPP	29.80	36.60	9.20	43.60	55.40
CRUX-O	12.10	27.00	3.80	29.60	36.40
Multilingual Tasks
MGSM	12.07	30.99	1.74	32.82	50.71
MMMLU	31.53	50.16	26.57	60.27	63.27
INCLUDE	24.74	34.26	25.62	39.55	45.57

Analysis of Smaller Qwen3 Base Models:

Consistent Strong Performance: Qwen3-8B, 4B, 1.7B, and 0.6B-Base models consistently maintain strong performance across nearly all benchmarks relative to their size.
Outperforming Larger Qwen2.5 Models: Notably, Qwen3-8B, 4B, and 1.7B-Base models even outperform larger Qwen2.5-14B, 7B, and 3B-Base models, respectively, on over half of the benchmarks. This is particularly evident in STEM-related and coding benchmarks, reflecting a significant generational improvement.

6.2. Post-training Evaluation

The post-trained models are evaluated for their instruction-following, alignment, reasoning, agent, coding, and multilingual abilities under both thinking and non-thinking modes.

Summary of Evaluation Results for Finalized Qwen3 Models:

Flagship SOTA: Qwen3-235B-A22B achieves state-of-the-art overall performance among open-source models in both thinking and non-thinking modes, surpassing strong baselines like DeepSeek-R1 and DeepSeek-V3. It also demonstrates strong competitiveness against closed-source leaders such as OpenAI-o1, Gemini2.5-Pro, and GPT-4o.
Flagship Dense Model (32B) Excellence: Qwen3-32B outperforms the previous strongest reasoning model, QwQ-32B, on most benchmarks, setting a new SOTA for its size. It competes comparably with OpenAI-o3-mini (closed-source) and excels in non-thinking mode, surpassing Qwen2.5-72B-Instruct.
Lightweight Model Success: Lightweight models (Qwen3-30B-A3B, Qwen3-14B, and smaller dense models) consistently show superior performance compared to open-source models with similar or larger parameter counts. This validates the effectiveness of the Strong-to-Weak Distillation approach.

6.2.1. Qwen3-235B-A22B

The following are the results from Table 11 of the original paper:

		OpenAI-o1	DeepSeek-R1	Grok-3-Beta (Think)	Gemini2.5-Pro	Qwen3-235B-A22B
	Architecture		MoE			MoE
	# Activated Params		37B	-		22B
	# Total Params	-	671B		-	235B
	MMLU-Redux	92.8	92.9		93.7	92.7
General Tasks	GPQA-Diamond	78.0	71.5	80.2	84.0	71.1
	C-Eval	85.5	91.8		82.9	89.6
	LiveBench 2024-11-25	75.7	71.6	-	82.4	77.1
	IFEval strict prompt	92.6	83.3	-	89.5	83.4
Alignment Tasks	Arena-Hard	92.1	92.3		96.4	95.6
	AlignBench v1.1	8.86	8.76		9.03	8.94
	Creative Writing v3	81.7	85.5		86.0	84.6
	WritingBench	7.69	7.71		8.09	8.03
	MATH-500	96.4	97.3		98.8	98.0
Math & Text Reasoning	AIME'24	74.3	79.8	83.9	92.0	85.7
	AIME'25	79.2	70.0	77.3	86.7	81.5
	ZebraLogic	81.0	78.7	-	87.4	80.3
	AutoLogi	79.8	86.1	-	85.4	89.0
	BFCL v3	67.8	56.9	-	62.9	70.8
Agent & Coding	LiveCodeBench v5	63.9	64.3	70.6	70.4	70.7
	CodeForces (Rating / Percentile)	1891 / 96.7%	2029 / 98.1%	-	2001 / 97.9%	2056 / 98.2%
	Multi-IF	48.8	67.7		77.8	71.9
	INCLUDE	84.6	82.7		85.1	78.7
Multilingual Tasks	MMMLU 14 languages	88.4	86.4		86.9	84.3
	MT-AIME2024	67.4	73.5		76.9	80.8
	PolyMath	38.9	47.1		52.2	54.7
	MLogiQA	75.5	73.8		75.6	77.1

Analysis of Qwen3-235B-A22B (Thinking Mode):

Open-Source Leader: Qwen3-235B-A22B (Thinking) outperforms DeepSeek-R1 on 17/23 benchmarks despite having fewer activated parameters (22B vs. 37B) and significantly fewer total parameters (235B vs. 671B). Its performance is particularly strong in reasoning-demanded tasks like mathematics, agent, and coding.

Competitiveness with Proprietary Models: It is highly competitive with closed-source models such as OpenAI-o1, Grok-3-Beta (Think), and Gemini2.5-Pro, substantially narrowing the performance gap in reasoning capabilities. For instance, it achieves the highest CodeForces rating (2056 / 98.2%).

The following are the results from Table 12 of the original paper:

		GPT-40 -2024-11-20	DeepSeek-V3	Qwen2.5-72B -Instruct	LLaMA-4 -Maverick	Qwen3-235B-A22B
	Architecture		MoE	Dense	MoE	MoE
	# Activated Params		37B	72B	17B	22B
	# Total Params	-	671B	72B	402B	235B
General Tasks	MMLU-Redux	87.0	89.1	86.8	91.8	89.2
	GPQA-Diamond	46.0	59.1	49.0	69.8	62.9
	C-Eval	75.5	86.5	84.7	83.5	86.1
	LiveBench 2024-11-25	52.2	60.5	51.4	59.5	62.5
Alignment Tasks	IFEval strict prompt	86.5	86.1	84.1	86.7	83.2
	Arena-Hard	85.3	85.5	81.2	82.7	96.1
	AlignBench v1.1	8.42	8.64	7.89	7.97	8.91
	Creative Writing v3	81.1	74.0	61.8	61.3	80.4
	WritingBench	7.11	6.49	7.06	5.46	7.70
Math & Text Reasoning	MATH-500	77.2	90.2	83.6	90.6	91.2
	AIME'24	11.1	39.2	18.9	38.5	40.1
	AIME'25	7.6	28.8	15.0	15.9	24.7
	ZebraLogic	27.4	42.1	26.6	40.0	37.7
	AutoLogi	65.9	76.1	66.1	75.2	83.3
Agent & Coding	BFCL v3	72.5	57.6	63.4	52.9	68.0
	LiveCodeBench v5	32.7	33.1	30.7	37.2	35.3
	CodeForces (Rating / Percentile)	864 / 35.4%	1134 / 54.1%	859 / 35.0%	712 / 24.3%	1387 / 75.7%
Multilingual Tasks	Multi-IF	65.6	55.6	65.3	75.5	70.2
	INCLUDE	78.8	76.7	69.6	80.9	75.6
	MMMLU 14 languages	80.3	81.1	76.9	82.5	79.8
	MT-AIME2024	9.2	20.9	12.7	27.0	32.4
	PolyMath	13.7	20.4	16.9	26.1	27.0
MLogiQA	57.4	58.9	59.3	59.9	67.6

Analysis of Qwen3-235B-A22B (Non-thinking Mode):

Superiority over Open-Source: Qwen3-235B-A22B (Non-thinking) exceeds other leading open-source models like DeepSeek-V3, LLaMA-4-Maverick, and Qwen2.5-72B-Instruct.
Outperforms GPT-4o-2024-11-20: It surpasses the closed-source GPT-4o-2024-11-20 in 18/23 benchmarks, indicating strong inherent capabilities even without explicit reasoning steps. This highlights its robust general performance for rapid responses.

6.2.2. Qwen3-32B

The following are the results from Table 13 of the original paper:

		DeepSeek-R1 -Distili-Llama-70B	QwQ-32B	OpenAI-03-mini (medium)	Qwen3-32B
	Architecture # Activated Params	Dense	Dense	-	Dense
	# Total Params	70B 70B	32B 32B	-	32B 32B
General Tasks	MMLU-Redux	89.3	90.0	90.0	90.9
	GPQA-Diamond	65.2	65.6	76.8	68.4
	C-Eval LiveBench 2024-11-25	71.8 54.5	88.4 72.0	75.1 70.0	87.3 74.9
Alignment Tasks	IFEval strict prompt	79.3	83.9	91.5	85.0
	Arena-Hard	60.6	89.5	89.0	93.8
	AlignBench v1.1	6.74	8.70	8.38	8.72
	Creative Writing v3 WritingBench	62.1 6.08	82.4 7.86	74.8 7.52	81.0 7.90
Math & Text Reasoning	MATH-500	94.5	98.0	98.0	97.2
	AIME'24	70.0	79.5	79.6	81.4
	AIME'25 ZebraLogic	56.3 71.3	69.5 76.8	74.8 88.9	72.9 88.8
	AutoLogi	83.5	88.1	86.3	87.3
Agent & Coding	BFCL v3	49.3	66.4	64.6	70.3
	LiveCodeBench v5	54.5	62.7	66.3	65.7
	CodeForces (Rating / Percentile)	1633 / 91.4%	1982 / 97.7%	2036 / 98.1%	1977 / 97.7%
Multilingual Tasks	Multi-IF	57.6	68.3	48.4	73.0
	INCLUDE	62.1	69.7	73.1	73.7
	MMMLU 14 languages	69.6	80.9	79.3	80.6
	MT-AIME2024	29.3	68.0	73.9	75.0
	PolyMath	29.4	45.9	38.6	47.4
	MLogiQA	60.3	75.5	71.1	76.3

Analysis of Qwen3-32B (Thinking Mode):

New SOTA at 32B: Qwen3-32B (Thinking) outperforms QwQ-32B on 17/23 benchmarks, establishing it as the new state-of-the-art reasoning model for its size.

Competitiveness with Proprietary Models: It competes well with the closed-source OpenAI-o3-mini (medium), particularly excelling in alignment and multilingual performance.

The following are the results from Table 14 of the original paper:

		GPT-4o-mini -2024-07-18	LLaMA-4 -Scout	Qwen2.5-72B -Instruct	Qwen3-32B
	Architecture		MoE	Dense	Dense
	# Activated Params		17B	72B	32B
	# Total Params	-	109B	72B	32B
General Tasks	MMLU-Redux	81.5	86.3	86.8	85.7
	GPQA-Diamond	40.2	57.2	49.0	54.6
	C-Eval	66.3	78.2	84.7	83.3
	LiveBench 2024-11-25	41.3	47.6	51.4	59.8
Alignment Tasks	IFEval strict prompt	80.4	84.7	84.1	83.2
	Arena-Hard	74.9	70.5	81.2	92.8
	AlignBench v1.1	7.81	7.49	7.89	8.58
	Creative Writing v3	70.3	55.0	61.8	78.3
	WritingBench	5.98	5.49	7.06	7.54
Math & Text Reasoning	MATH-500	78.2	82.6	83.6	88.6
	AIME'24	8.1	28.6	18.9	31.0
	AIME'25	8.8	10.0	15.0	20.2
	ZebraLogic	20.1	24.2	26.6	29.2
	AutoLogi	52.6	56.8	66.1	78.5
Agent & Coding	BFCL v3 LiveCodeBench v5	64.0 27.9	45.4 29.8	63.4 30.7	63.0 31.3
Agent & Coding	CodeForces (Rating / Percentile)	1113 / 52.6%	981 / 43.7%	859 / 35.0%	1353 / 71.0%
Multilingual Tasks	Multi-IF	62.4	64.2	65.3	70.7
	INCLUDE	66.0	74.1	69.6	70.9
	MMMLU 14 languages	72.1	77.5	76.9	76.5
	MT-AIME2024	6.0	19.1	12.7	24.1
	PolyMath	12.0	20.9	16.9	22.5
	MLogiQA	42.6	53.9	59.3	62.9

Analysis of Qwen3-32B (Non-thinking Mode):

Superiority: Qwen3-32B (Non-thinking) exhibits superior performance to almost all baselines.
Vs. Larger Predecessor: It performs on par with Qwen2.5-72B-Instruct on general tasks, but with significant advantages in alignment, multilingual, and reasoning-related tasks, showcasing fundamental improvements over the Qwen2.5 series.

6.2.3. Qwen3-30B-A3B & Qwen3-14B

The following are the results from Table 15 of the original paper:

		DeepSeek-R1 -Distili-Qwen-32B	QwQ-32B	Qwen3-14B	Qwen3-30B-A3B
	Architecture	Dense	Dense	Dense	MoE
	# Activated Params	32B	32B	14B	3B
	# Total Params	32B	32B	14B	30B
General Tasks	MMLU-Redux	88.2	90.0	88.6	89.5
	GPQA-Diamond	62.1	65.6	64.0	65.8
	C-Eval	82.2	88.4	86.2	86.6
	LiveBench 2024-11-25	45.6	72.0	71.3	74.3
Alignment Tasks	IFEval strict prompt Arena-Hard	72.5 60.8	83.9 89.5	85.4 91.7	86.5 91.0
	AlignBench v1.1	7.25	8.70	8.56	8.70
	Creative Writing v3	55.0	82.4	80.3	79.1
	WritingBench	6.13	7.86	7.80	7.70
Math & Text Reasoning	MATH-500	94.3	98.0	96.8	98.0
	AIME'24	72.6	79.5	79.3	80.4
	AIME'25	49.6	69.5	70.4	70.9
	ZebraLogic	69.6	76.8	88.5	89.5
	AutoLogi	74.6	88.1	89.2	88.7
Agent & Coding	BFCL v3	53.5	66.4	70.4	69.1
	LiveCodeBench v5	54.5	62.7	63.5	62.6
	CodeForces (Rating / Percentile)	1691 / 93.4%	1982 / 97.7%	1766 / 95.3%	1974 / 97.7%
Multilingual Tasks	Multi-IF	31.3	68.3	74.8	72.2
	INCLUDE	68.0	69.7	71.7	71.9
	MMMLU 14 languages	78.6	80.9	77.9	78.4
	MT-AIME2024	44.6	68.0	73.3	73.9
	PolyMath	35.1	45.9	45.8	46.1
	MLogiQA	63.3	75.5	71.1	70.1

Analysis of Qwen3-30B-A3B / Qwen3-14B (Thinking Mode):

Strong-to-Weak Distillation Success: Both Qwen3-30B-A3B and Qwen3-14B (Thinking) are highly competitive with QwQ-32B, especially in reasoning benchmarks.

Efficiency of MoE Distillation: Qwen3-30B-A3B achieves comparable performance to QwQ-32B despite having a smaller total size (30B vs. 32B) and significantly fewer activated parameters (3B vs. 32B), demonstrating the effectiveness of the Strong-to-Weak Distillation in endowing lightweight MoE models with profound reasoning.

The following are the results from Table 16 of the original paper:

		Phi-4	Gemma-3 -27B-IT	Qwen2.5-32B -Instruct	Qwen3-14B	Qwen3-30B-A3B
Architecture		Phi-4	Gemma-3 -27B-IT	Qwen2.5-32B -Instruct	Qwen3-14B	Qwen3-30B-A3B
	# Activated Params	Dense 14B	Dense 27B	Dense 32B	Dense 14B	MoE 3B
	# Total Params	14B	27B	32B	14B	30B

	MMLU-Redux	85.3	82.6	83.9	82.0	84.1
General Tasks	GPQA-Diamond C-Eval	56.1 66.9	42.4 66.6	49.5 80.6	54.8 81.0	54.8 82.9
	LiveBench 2024-11-25	41.6	49.2	50.0	59.6	59.4
	IFEval strict prompt	62.1	80.6	79.5	84.8	83.7
Alignment Tasks	Arena-Hard	75.4	86.8	74.5	86.3	88.0
	AlignBench v1.1	7.61	7.80	7.71	8.52	8.55
	Creative Writing v3	51.2	82.0	54.6	73.1	68.1
	WritingBench	5.73	7.22	5.90	7.24	7.22
Math & Text Reasoning	MATH-500	80.8	90.0	84.6	90.0	89.8
	AIME'24	22.9	32.6	18.8	31.7	32.8
	AIME'25	17.3	24.0	12.8	23.3	21.6
	ZebraLogic AutoLogi	32.3 66.2	24.6 64.2	26.1 65.5	33.0 82.0	33.2 81.5
Agent & Coding	BFCL v3	47.0	59.1	62.8	61.5	58.6
Agent & Coding	LiveCodeBench v5 CodeForces (Rating / Percentile)	25.2 1280 / 65.3%	26.9 1063 / 49.3%	26.4 903 / 38.2%	29.0 1200 / 58.6%	29.8 1267 / 64.1%
Multilingual Tasks	Multi-IF	49.5	69.8	63.2	72.9	70.8
	INCLUDE	65.3	71.4	67.5	67.8	67.8
	MMMLU 14 languages	74.7	76.1	74.2	72.6	73.8
	MT-AIME2024	13.1	23.0	15.3	23.2	24.6
	PolyMath	17.4	20.3	18.3	22.0	23.3
	MLogiQA	53.1	58.5	58.0	58.9	53.3

Analysis of Qwen3-30B-A3B / Qwen3-14B (Non-thinking Mode):

Outperforming Baselines: Both models surpass non-reasoning baselines in most benchmarks.
Efficiency: They exceed Qwen2.5-32B-Instruct with significantly fewer activated and total parameters, enabling more efficient and cost-effective performance.

6.2.4. Qwen3-8B / 4B / 1.7B / 0.6B

The following are the results from Table 17 of the original paper:

		DeepSeek-R1 -Distill-Qwen-14B	DeepSeek-R1 -Distili-Qwen-32B	Qwen3-4B	Qwen3-8B
	Architecture	Dense	Dense	Dense	Dense
	# Activated Params	14B	32B	4B	8B
	# Total Params	14B	32B	4B	8B
General Tasks	MMLU-Redux	84.1	88.2	83.7	87.5
	GPQA-Diamond C-Eval	59.1	62.1	55.9	62.0
	LiveBench 2024-11-25	78.1 52.3	82.2 45.6	77.5 63.6	83.4 67.1
Alignment Tasks	IFEval strict prompt	72.6	72.5	81.9	85.0
	Arena-Hard	48.0	60.8	76.6	85.8
	AlignBench v1.1	7.43	7.25	8.30	8.46
	Creative Writing v3 WritingBench	54.2 6.03	55.0 6.13	61.1 7.35	75.0 7.59
Math & Text Reasoning	MATH-500	93.9	94.3	97.0	97.4
	AIME'24	69.7	72.6	73.8	76.0
	AIME'25	44.5	49.6	65.6	67.3
	ZebraLogic	59.1	69.6	81.0	84.8
Agent & Coding	AutoLogi	78.6	74.6	87.9	89.1
	BFCL v3	49.5	53.5	65.9	68.1
	LiveCodeBench v5 CodeForces (Rating / Percentile)	45.5 1574 / 89.1%	54.5 1691 / 93.4%	54.2 1671 / 92.8%	57.5 1785 / 95.6%
Multilingual Tasks	Multi-IF	29.8	31.3	66.3	71.2
	INCLUDE	59.7	68.0	61.8	67.8
	MMMLU 14 languages	73.8	78.6	69.8	74.4
	MT-AIME2024	33.7	44.6	60.7	65.4
	PolyMath	28.6	35.1	40.0	42.7
	MLogiQA	53.6	63.3	65.9	69.0

The following are the results from Table 18 of the original paper:

		LLaMA-3.1-8B -Instruct	Gemma-3 -12B-IT	Qwen2.5-7B -Instruct	Qwen2.5-14B -Instruct	Qwen3-4B	Qwen3-8B
	Architecture	Dense	Dense	Dense	Dense	Dense	Dense
	# Activated Params	8B	12B	7B	14B	4B	8B
	# Total Params	8B	12B	7B	14B	4B	8B
General Tasks	MMLU-Redux	61.7	77.8	75.4	80.0	77.3	79.5
	GPQA-Diamond C-Eval	32.8 52.0	40.9 61.1	36.4 76.2	45.5 78.0	41.7 72.2	39.3 77.9
	LiveBench 2024-11-25	26.0	43.7	34.9	42.2	48.4	53.5
Alignment Tasks	IFEval strict prompt	75.0	80.2	71.2	81.0	81.2	83.0
	Arena-Hard	30.1	82.6	52.0	68.3	66.2	79.6
	AlignBench v1.1	6.01	7.77	7.27	7.67	8.10	8.38
	Creative Writing v3 WritingBench	52.8 4.57	79.9 7.05	49.8 5.82	55.8 5.93	53.6 6.85	64.5 7.15
Math & Text Reasoning	MATH-500	54.8	85.6	77.6	83.4	84.8	87.4
	AIME'24	6.3	22.4	9.1	15.2	25.0	29.1
	AIME'25	2.7	18.8	12.1	13.6	19.1	20.9
	ZebraLogic	12.8	58.9	12.0	19.7	35.2	26.7
	AutoLogi	30.9	76.3	42.9	57.4	76.3	76.5
Agent & Coding	BFCL v3	49.6	50.6	55.8	58.7	57.6	60.2
Agent & Coding	LiveCodeBench v5 CodeForces (Rating / Percentile)	10.8 473 / 14.9%	25.7 462 / 14.7%	14.4 191 / 0.0%	21.9 904 / 38.3%	21.3 842 / 33.7%	22.8 1110 / 52.4%
Multilingual Tasks	Multi-IF	52.1	65.6	47.7	55.5	61.3	69.2
	INCLUDE	34.0	65.3	53.6	63.5	53.8	62.5
	MMMLU 14 languages	44.4	70.0	61.4	70.3	61.7	66.9
	MT-AIME2024	0.4	16.7	5.5	8.5	13.9	16.6
	PolyMath	5.8	17.6	11.9	15.0	16.6	18.8
	MLogiQA	41.9	54.5	49.5	51.3	49.9	51.4

The following are the results from Table 19 of the original paper:

		DeepSeek-R1 -Distili-Qwen-1.5B	DeepSeek-R1 -Distill-Llama-8B	Qwen3-0.6B	Qwen3-1.7B
	Architecture # Activated Params	Dense	Dense	Dense	Dense
	# Total Params	1.5B 1.5B	8B 8B	0.6B 0.6B	1.7B 1.7B
General Tasks	MMLU-Redux	45.4	66.4	55.6	73.9
	GPQA-Diamond	33.8	49.0	27.9	40.1
	C-Eval	27.1	50.4	50.4	68.1
	LiveBench 2024-11-25	24.9	40.6	30.3	51.1
Alignment Tasks	IFEval strict prompt	39.9	59.0	59.2	72.5
	Arena-Hard	4.5	17.6	8.5	43.1
	AlignBench v1.1	5.00	6.24	6.10	7.60
	Creative Writing v3 WritingBench	16.4 4.03	51.1 5.42	30.6 5.61	48.0 7.02
Math & Text Reasoning	MATH-500	83.9	89.1	77.6	93.4
	AIME'24	28.9	50.4	10.7	48.3
	AIME'25	22.8	27.8	15.1	36.8
	ZebraLogic	4.9	37.1	30.3	63.2
	AutoLogi	19.1	63.4	61.6	83.2
Agent & Coding	BFCL v3 LiveCodeBench v5	14.0 13.2	21.5 42.5	46.4 12.3	56.6 33.2
	CodeForces (Rating / Percentile)			36.1	51.2
	Multilingual Tasks	Multi-IF	13.3	27.0	35.9	51.8
INCLUDE		21.9	34.5	43.1	59.1
MMMLU 14 languages		27.3	40.1	7.8	36.1
MT-AIME2024		12.4	13.2	11.4	25.2
PolyMath		14.5	10.8	40.9	56.0
MLogiQA		29.0	32.8

The following are the results from Table 20 of the original paper:

		Gemma-3 -1B-IT	Phi-4-mini	Qwen2.5-1.5B -Instruct	Qwen2.5-3B -Instruct	Qwen3-0.6B	Qwen3-1.7B
	Architecture # Activated Params	Dense	Dense	Dense	Dense	Dense	Dense
		1.0B	3.8B	1.5B	3.1B	0.6B	1.7B
	# Total Params	1.0B	3.8B	1.5B	3.1B	0.6B	1.7B
	MMLU-Redux	33.3	67.9	50.7	64.4	44.6	64.4
	GPQA-Diamond	19.2	25.2	29.8	30.3	22.9	28.6
Tasks	C-Eval	28.5	40.0	53.3	68.2	42.6	61.0
	LiveBench 2024-11-25	14.4	25.3	18.0	23.8	21.8	35.6
	IFEval strict prompt	54.5	68.6	42.5	58.2	54.5	68.2
	Arena-Hard	17.8	32.8	9.0	23.7	6.5	36.9
Alignment Tasks	AlignBench v1.1	5.3	6.00	5.60	6.49	5.60	7.20
	Creative Writing v3	52.8	10.3	31.5	42.8	28.4	43.6
	WritingBench	5.18	4.05	4.67	5.55	5.13	6.54
	MATH-500	46.4	67.6	55.0	67.2	55.2	73.0
	AIME'24	0.9	8.1	0.9	6.7	3.4	13.4
Math & Text Reasoning	AIME'25	0.8	5.3	0.4	4.2	2.6	9.8
	ZebraLogic	1.9	2.7	3.4	4.8	4.2	12.8
	AutoLogi	16.4	28.8	22.5	29.9	37.4	59.8
	BFCL v3	16.3	31.3	47.8	50.4	44.1	52.2
Coding	LiveCodeBench v5	1.8	10.4	5.3	9.2	3.6	11.6
	Multi-IF
	INCLUDE	32.8	40.5	20.2	32.3	33.3	44.7
		32.7	43.8	33.1	43.8	34.4	42.6
	MMMLU 14 languages	32.5	51.4	40.4	51.8	37.1	48.3
	MT-AIME2024	0.2	0.9	0.7	1.6	1.5	4.9
	PolyMath	3.5	6.7	5.0	7.3	4.6	10.3
	MLogiQA	31.8	39.5	40.9	39.5	37.3	41.1

Analysis of Smaller Qwen3 Models (Thinking and Non-thinking Modes):

Edge-side Performance: The smaller Qwen3 models (Qwen3-8B, 4B, 1.7B, 0.6B) exhibit impressive performance, often outperforming baselines with more parameters, including previous Qwen2.5 models, in both thinking and non-thinking modes.
Distillation Efficacy: These results further reinforce the efficacy of the Strong-to-Weak Distillation approach, enabling the creation of lightweight Qwen3 models with remarkably reduced costs and efforts while maintaining high capabilities.

6.3. Discussion

6.3.1. The Effectiveness of Thinking Budget

The ability of Qwen3 to enhance its intelligence by leveraging an increased thinking budget is a key innovation.

The following figure (Figure 2 from the original paper) shows the performance of Qwen3-235B-A22B with respect to the thinking budget:

Figure 2: Performance of Qwen3-235B-A22B with respect to the thinking budget. 该图像是一个图表，显示了 Qwen3-235B-A22B 在不同思维预算下的性能表现，包括 AIME'24、AIME'25、LiveCodeBench (v5) 和 GPQA Diamond 四个任务。图中分别展示了思维模式和非思维模式的效果，随着思维预算的增加，性能显著提升。

Figure 2: Performance of Qwen3-235B-A22B with respect to the thinking budget.

Analysis:

Scalable Performance Improvement: As observed in Figure 2, Qwen3-235B-A22B demonstrates a clear and consistent improvement in performance across various benchmarks (AIME'24, AIME'25, LiveCodeBench v5, GPQA Diamond) as the allocated thinking budget (measured in tokens) increases. This validates the design principle that more computational resources dedicated to thinking directly translate to better reasoning outcomes.
Smooth Scaling: The scaling curves are smooth, suggesting that the thinking budget mechanism provides a continuous knob for users to trade off latency (due to more thinking tokens) and performance based on task complexity.
Future Potential: The authors hypothesize that extending the output length beyond 32K tokens for thinking could yield further performance improvements, suggesting potential for future work in pushing the limits of the thinking budget.

6.3.2. The Effectiveness and Efficiency of On-Policy Distillation

The on-policy distillation approach, part of the strong-to-weak distillation pipeline, proves to be both effective and highly efficient.

The following are the results from Table 21 of the original paper:

Method	AIME'24	AIME'25	MATH500	LiveCodeBench v5	MMLU -Redux	GPQA -Diamond	GPU Hours
Off-policy Distillation	55.0 (90.0)	42.8 (83.3)	92.4	42.0	86.4	55.6	-
+ Reinforcement Learning	67.6 (90.0)	55.5 (83.3)	94.8	52.9	86.9	61.3	17,920
+ On-policy Distillation	74.4 (93.3)	65.5 (86.7)	97.0	60.3	88.3	63.3	1,800

Analysis:

Superior Performance over RL: On-policy Distillation achieves significantly better performance across all listed benchmarks (AIME'24, AIME'25, MATH500, LiveCodeBench v5, MMLU-Redux, GPQA-Diamond) compared to direct Reinforcement Learning when starting from the same off-policy distilled 8B checkpoint. For example, AIME'24 improves from 67.6 (RL) to 74.4 (Distillation).
Dramatic Efficiency Gains: This performance gain comes with a remarkable reduction in computational cost. On-policy Distillation requires only 1,800 GPU hours for the Qwen3-8B model, approximately 1/10th of the 17,920 GPU hours needed for Reinforcement Learning. This demonstrates a massive efficiency advantage for training smaller models.
Enhanced Exploration: Distillation from teacher logits not only improves direct performance (Pass@1 scores) but also expands the student model's exploration space and reasoning potential, as evidenced by improved Pass@64 scores on AIME'24 (93.3 vs. 90.0) and AIME'25 (86.7 vs. 83.3). In contrast, Reinforcement Learning alone did not lead to any improvement in Pass@64 scores from the initial off-policy checkpoint. This suggests that mimicking a strong teacher's thought process, including its uncertainties and alternative paths (captured by soft probabilities), is more beneficial for learning robustness and exploration than direct reward optimization for exploration.

6.3.3. The Effects of Thinking Mode Fusion and General RL

The stages of Thinking Mode Fusion and General RL are crucial for integrating non-thinking capabilities, refining instruction following, and enhancing overall model robustness.

The following are the results from Table 22 of the original paper:

		Stage 2 Reasoning RL	Stage 3 Thinking Mode Fusion		Stage 4 General RL
	Benchmark	Thinking	Thinking	Non-Thinking	Thinking	Non-Thinking
General Tasks	LiveBench 2024-11-25	68.6	70.9 (+2.3)	57.1	74.9 (+4.0)	59.8 (+2.8)
	Arena-Hard	86.8	89.4 (+2.6)	88.5	93.8 (+4.4)	92.8 (+4.3)
	CounterFactQA*	50.4	61.3 (+10.9)	64.3	68.1 (+6.8)	66.4 (+2.1)
Instruction & Format Following	IFEval strict prompt	73.0	78.4 (+5.4)	78.4	85.0 (+6.6)	83.2 (+4.8)
	Multi-IF	61.4	64.6 (+3.2)	65.2	73.0 (+8.4)	70.7 (+5.5)
	LengthCtrl*	62.6	70.6 (+8.0)	84.9	73.5 (+2.9)	87.3 (+2.4)
	ThinkFollow*	-		88.7	98.9 (+10.2)
Agent	BFCL v3	69.0	68.4 (-0.6)	61.5	70.3 (+1.9)	63.0 (+1.5)
Agent	ToolUse*	63.3	70.4 (+7.1)	73.2	85.5 (+15.1)	86.5 (+13.3)
Knowledge & STEM	MMLU-Redux	91.4	91.0 (-0.4)	86.7	90.9 (-0.1)	85.7 (-1.0)
Knowledge & STEM	GPQA-Diamond	68.8	69.0 (+0.2)	50.4	68.4 (-0.6)	54.6 (+4.3)
Math &	AIME'24	83.8	81.9 (-1.9)	28.5	81.4 (-0.5)	31.0 (+2.5)
TCCoding	LiveCodeBench v5	68.4	67.2 (-1.2)	31.1	65.7 (-1.5)	31.3 (+0.2)

Analysis of Qwen3-32B at Different Stages:

Stage 3 (Thinking Mode Fusion):
- Initial Mode Switching: The ThinkFollow benchmark score of 88.7 indicates that the model gains an initial, though imperfect, ability to switch between thinking modes.
- General and Instruction Following Improvements (Thinking Mode): The model shows significant gains in CounterFactQA (+10.9 points) and LengthCtrl (+8.0 points) in thinking mode, demonstrating improved general and instruction-following capabilities.
Stage 4 (General RL):
- Robust Mode Switching: The ThinkFollow score dramatically improves to 98.9 (+10.2 points), confirming that General RL ensures highly accurate mode switching.
- Broad Capability Enhancement: General RL further strengthens general (LiveBench, Arena-Hard, CounterFactQA), instruction-following (IFEval, Multi-IF, LengthCtrl), and agent capabilities (BFCL v3, $ToolUse*$ ) in both thinking and non-thinking modes. $ToolUse*$ sees a substantial boost (+15.1 in thinking, +13.3 in non-thinking).
Performance Trade-offs for Specialized Tasks:
- For Knowledge, STEM, Math (AIME'24), and Coding (LiveCodeBench v5) tasks, Thinking Mode Fusion and General RL do not yield significant improvements in thinking mode. In fact, some challenging tasks like AIME'24 and LiveCodeBench show a slight decrease in thinking mode performance after these stages.
- Conjecture: The authors hypothesize this degradation is due to training on a broader range of general tasks, which might compromise specialized capabilities. This represents a conscious trade-off to enhance the model's overall versatility, acknowledging that improving general robustness might dilute peak performance in highly specialized, complex reasoning tasks.

6.3.4. Long-Context Ability

The long-context processing capabilities are evaluated using the RULER benchmark.

The following are the results from Table 23 of the original paper:

	Model	RULER
	Model	Avg.	4K	8K	16K	32K	64K	128K
	Qwen2.5-7B-Instruct	85.4	96.7	95.1	93.7	89.4	82.3	55.1
	Qwen2.5-14B-Instruct	91.4	97.7	96.8	95.9	93.4	86.7	78.1
	Qwen2.5-32B-Instruct	92.9	96.9	97.1	95.5	95.5	90.3	82.0
	Qwen2.5-72B-Instruct	95.1	97.7	97.2	97.7	96.5	93.0	88.4
	Qwen3-4B	85.2	95.1	93.6	91.0	87.8	77.8	66.0
Non-thinking Mode	Qwen3-8B	89.1	96.3	96.0	91.8	91.2	82.1	77.4
	Qwen3-14B	94.6	98.0	97.8	96.4	96.1	94.0	85.1
	Qwen3-32B	93.7	98.4	96.0	96.2	94.4	91.8	85.6
	Qwen3-30B-A3B	91.6	96.5	97.0	95.3	92.4	89.1	79.2
	Qwen3-235B-A22B	95.0	97.7	97.2	96.4	95.1	93.3	90.6
Thinking Mode	Qwen3-4B	83.5	92.7	88.7	86.5	83.2	83.0	67.2
	Qwen3-8B	84.4	94.7	94.4	86.1	80.8	78.3	72.0
	Qwen3-14B	90.1	95.4	93.6	89.8	91.9	90.6	79.0
	Qwen3-32B	91.0	94.7	93.7	91.6	92.5	90.0	83.5
	Qwen3-30B-A3B	86.6	94.1	92.7	89.0	86.6	82.1	75.0
	Qwen3-235B-A22B	92.2	95.1	94.8	93.0	92.3	92.0	86.0

Analysis:

Non-thinking Mode Improvement: In non-thinking mode, Qwen3 models generally outperform Qwen2.5 models of similar size in long-context processing tasks, especially at longer context lengths (e.g., Qwen3-14B vs Qwen2.5-14B, Qwen3-32B vs Qwen2.5-32B show clear improvements in average and 128K scores). The flagship Qwen3-235B-A22B achieves a strong 90.6 at 128K.
Thinking Mode Degradation: In thinking mode, the model's performance on RULER slightly degrades compared to non-thinking mode. The authors hypothesize that for pure retrieval tasks (like RULER) which do not rely on complex reasoning, the thinking content generated by the model might not offer significant benefits and could even interfere with the retrieval process. This indicates an area for future improvement to ensure thinking mode provides benefits across all task types.

6.3.5. Multilingual Ability

The multilingual capabilities are showcased through various benchmarks and detailed language-specific tables (Tables 24-35).

Summary: Tables 24-35 present detailed benchmark scores across various languages, including Spanish, French, Portuguese, Italian, Arabic, Japanese, Korean, Indonesian, Russian, Vietnamese, German, and Thai. The results demonstrate that the Qwen3 series models achieve competitive performance across all evaluated benchmarks for these specific languages, showcasing their strong multilingual capabilities.

For a broader assessment, the Belebele benchmark (Bandarkar et al., 2023) is used, covering 80 supported languages (excluding 42 unoptimized ones). The following are the results from Table 37 of the original paper:

Model	Indo-European	Sino-Tibetan	Afro-Asiatic	Austronesian	Dravidian	Turkic	Tai-Kadai	Uralic	Austroasiatic	Other
Gemma-3-27B-IT	89.2	86.3	85.9	84.1	83.5	86.8	81.0	91.0	86.5	87.0
Qwen2.5-32B-Instruct	85.5	82.3	80.4	70.6	67.8	80.8	74.5	87.0	79.0	72.6
QwQ-32B	86.1	83.7	81.9	71.3	69.3	80.3	77.0	88.0	83.0	74.0
Qwen3-32B (Thinking)	90.7	89.7	84.8	86.7	84.5	89.3	83.5	91.3	88.0	83.1
Qwen3-32B (Non-thinking)	89.1	88.0	82.3	83.7	84.0	85.0	85.0	88.7	88.0	81.3
Gemma-3-12B-IT	85.8	83.3	83.4	79.3	79.0	82.8	77.5	89.0	83.0	81.6
Qwen2.5-14B-Instruct	82.7	78.9	80.4	69.1	66.2	74.2	72.2	883.9	77.9	70.4
Qwen3-14B (Thinking)	88.6	87.3	82.4	82.4	81.0	83.8	83.5	91.0	82.5	81.7
Qwen3-14B (Non-thinking)	87.4	82.7	80.1	80.7	78.0	81.8	80.5	87.7	81.5	77.0
Gemma-3-4B-IT	71.8	72.0	63.5	61.7	64.8	64.0	61.5	70.7	71.0	62.6
Qwen2.5-3B-Instruct	58.0	62.3	57.2	47.9	36.9	45.1	49.8	50.6	56.8	48.4
Qwen3-4B (Thinking)	82.2	77.7	74.1	73.0	74.3	76.3	68.5	83.0	74.5	67.9
Qwen3-4B (Non-thinking)	76.0	77.0	65.6	65.6	65.5	64.0	60.5	74.0	74.0	61.0
Gemma-3-1B-IT	36.5	36.0	30.0	29.1	28.8	27.3	28.0	32.7	33.0	30.9
Qwen2.5-1.5B-Instruct	41.5	43.0	39.6	34.8	28.6	29.7	39.4	33.8	42.0	36.0
Qwen3-1.7B (Thinking)	69.7	66.0	59.4	58.6	52.8	57.8	53.5	70.3	63.5	53.4
Qwen3-1.7B (Non-thinking)	58.8	62.7	50.8	53.0	43.3	48.0	46.0	54.3	54.0	43.9

Analysis of Belebele Benchmark:

Superiority over Qwen2.5: Qwen3 models significantly outperform their Qwen2.5 counterparts across all language families, highlighting the impact of the expanded multilingual pre-training data (119 languages) and improved training strategies.
Competitiveness with Gemma-3: Qwen3 achieves comparable or superior performance to similarly-sized Gemma models (e.g., Qwen3-32B vs. Gemma-3-27B, Qwen3-14B vs. Gemma-3-12B).
Thinking Mode Advantage: For Qwen3 models, the thinking mode consistently yields higher scores across almost all language families compared to their non-thinking counterparts, demonstrating the benefit of reasoning in cross-lingual understanding tasks. This is in contrast to the RULER benchmark, where thinking mode was less beneficial, indicating task-specific utility of the thinking mechanism.

7. Conclusion & Reflections

7.1. Conclusion Summary

This technical report introduces Qwen3, a significant advancement in the Qwen family of large language models. The key contributions include the novel integration of thinking and non-thinking modes into a unified framework, complemented by a dynamic thinking budget mechanism. This design empowers users with flexible control over computational resources and reasoning depth, eliminating the need to switch between specialized models. Qwen3 comprises a diverse series of dense and Mixture-of-Expert (MoE) models, ranging from 0.6B to 235B parameters, featuring architectural refinements like QK-Norm and global-batch load balancing loss. The models were pre-trained on an unprecedented 36 trillion tokens across 119 languages and dialects, vastly expanding multilingual capabilities. A sophisticated multi-stage post-training pipeline, including Long-CoT Cold Start, Reasoning RL, Thinking Mode Fusion, and General RL, coupled with an efficient Strong-to-Weak Distillation approach for smaller models, ensures state-of-the-art performance. Empirical evaluations consistently demonstrate Qwen3's competitive results against proprietary and leading open-source models across diverse benchmarks, particularly excelling in code generation, mathematical reasoning, and agent tasks. The open-source release under Apache 2.0 further promotes community engagement and research.

7.2. Limitations & Future Work

The authors acknowledge certain limitations and outline future research directions:

Thinking Mode for Retrieval Tasks: While thinking mode generally enhances reasoning, performance on certain retrieval-based long-context tasks (like RULER) slightly degrades. The authors hypothesize that the generated thinking content might interfere with retrieval in these specific scenarios, suggesting a need to refine thinking mode for such tasks in future versions.
Specialized vs. General Performance Trade-off: The Thinking Mode Fusion and General RL stages, while enhancing overall versatility and instruction following, sometimes lead to a slight decrease in thinking mode performance on highly challenging, specialized tasks like AIME'24 and LiveCodeBench. This indicates a trade-off between broad generalization and peak performance in niche, complex problem-solving.

Future research will focus on:
Scaling Pre-training: Continuing to scale up pre-training with even higher quality and more diverse data.
Architectural and Training Method Improvements: Enhancing model architecture and training methods for effective compression and scaling to extremely long contexts. This includes addressing the observed performance dip of thinking mode in retrieval tasks.
Increased RL Resources and Agent-based RL Systems: Allocating more computational resources for Reinforcement Learning, with a particular emphasis on agent-based RL systems that learn from environmental feedback. The goal is to build agents capable of tackling complex tasks requiring inference time scaling, indicating a move towards more autonomous and interactive LLM agents.

7.3. Personal Insights & Critique

Qwen3 represents a compelling stride towards more adaptable and efficient LLMs. The unified thinking and non-thinking modes, coupled with the thinking budget, are genuinely innovative concepts that address a critical practical challenge in LLM deployment: how to balance responsiveness and deep reasoning without maintaining separate models. This dynamic control over inference-time computation is a sophisticated solution for optimizing user experience and resource utilization.

The Strong-to-Weak Distillation approach for smaller models is particularly insightful. Demonstrating that distillation can be significantly more effective and efficient than direct RL, especially for exploration ability (Pass@64), offers a powerful blueprint for developing competitive lightweight models. This has direct implications for edge deployment and broader accessibility.

However, the acknowledged trade-off between generalized capabilities (achieved through General RL) and peak performance in specialized thinking mode tasks is an important point of critique. While understandable for overall versatility, it highlights a fundamental tension in LLM training: optimizing for average performance across a vast array of tasks might dilute the model's ability to achieve extreme performance in highly focused, difficult domains. Future work in multi-objective optimization or more sophisticated curriculum learning could potentially mitigate this.

The massive expansion to 119 languages is commendable, pushing the boundaries of global accessibility. However, the report doesn't extensively detail the performance across all these low-resource languages or the specific challenges encountered during their integration. While Belebele provides a broad overview, deeper dives into specific language families or challenges unique to low-resource settings would further enhance understanding.

Overall, Qwen3's contributions, particularly in its novel mode-switching and resource allocation mechanisms, alongside its commitment to open-source, position it as a significant player in the evolving LLM landscape. Its methods could be transferable to other AI domains requiring dynamic computational allocation based on task complexity, beyond just language generation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Thinking Mode	Non-Thinking Mode
<\|im_start/>user {query}/think<\|im_end\|>	<\|im_start\|>user {query}/no_think<\|im_end\|>
<\|im_start/>assistant <think>	<\|im_start/>assistant <think>
{thinking_content} </think>	</think>
{response}<\|im_end\|>	{response}<\|im_end\|>

Qwen3 Technical Report

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~45 min read · 74,598 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Architecture

4.2.2. Pre-training

4.2.2.1. Pre-training Data

4.2.2.2. Pre-training Stages

4.2.3. Post-training

4.2.3.1. Long-CoT Cold Start

4.2.3.2. Reasoning RL

4.2.3.3. Thinking Mode Fusion

4.2.3.4. General RL

4.2.3.5. Strong-to-Weak Distillation

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Hyperparameters

6. Results & Analysis

6.1. Pre-training Evaluation

6.1.1. Qwen3-235B-A22B-Base

6.1.2. Qwen3-32B-Base

6.1.3. Qwen3-14B-Base & Qwen3-30B-A3B-Base

6.1.4. Qwen3-8B / 4B / 1.7B / 0.6B-Base

6.2. Post-training Evaluation

6.2.1. Qwen3-235B-A22B

6.2.2. Qwen3-32B

6.2.3. Qwen3-30B-A3B & Qwen3-14B

6.2.4. Qwen3-8B / 4B / 1.7B / 0.6B

6.3. Discussion

6.3.1. The Effectiveness of Thinking Budget

6.3.2. The Effectiveness and Efficiency of On-Policy Distillation

6.3.3. The Effects of Thinking Mode Fusion and General RL

6.3.4. Long-Context Ability

6.3.5. Multilingual Ability

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers