Paper status: completed

Qwen3 Technical Report

Published:05/14/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Qwen3 introduces a unified framework integrating thinking and non-thinking modes for dynamic switching, enhancing performance and multilingual support. It also features a thinking budget mechanism for adaptive resource allocation, expanding language capabilities from 29 to 119 la

Abstract

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The title of the paper is "Qwen3 Technical Report". The central topic is the introduction and detailed technical description of Qwen3, the latest generation of the Qwen large language model (LLM) family.

1.2. Authors

The paper lists "Qwen Team" as the authors, followed by "Core Contributors" and "Cruor" sections detailing a large number of individuals involved. While specific affiliations are not explicitly stated in the provided abstract or initial pages, the Qwen model series is developed by Alibaba Cloud. The extensive list of contributors indicates a large-scale, collaborative research and development effort.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, with the link https://arxiv.org/abs/2505.09388. As a preprint, it has not yet undergone formal peer review for publication in a journal or conference. However, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like artificial intelligence, and technical reports from major industry labs (like Alibaba's Qwen team) often serve as primary sources for new model introductions.

1.4. Publication Year

The paper was published on arXiv on 2025-05-14T13:41:34.000Z, indicating a publication year of 2025.

1.5. Abstract

This work introduces Qwen3, the newest iteration in the Qwen model family, featuring a series of large language models (LLMs) engineered for enhanced performance, efficiency, and multilingual capabilities. The Qwen3 series encompasses both dense and Mixture-of-Expert (MoE) architectures, with parameter counts spanning from 0.6 billion to 235 billion. A core innovation is the unified integration of a thinking mode for complex, multi-step reasoning and a non-thinking mode for swift, context-driven responses, obviating the need for model switching. This framework allows for dynamic mode selection based on user input or templates. Qwen3 also introduces a thinking budget mechanism, enabling adaptive allocation of computational resources during inference to balance latency and performance according to task complexity. Furthermore, by distilling knowledge from larger flagship models, the computational cost for developing smaller models is significantly reduced while maintaining competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including code generation, mathematical reasoning, and agent tasks, rivaling larger MoE and proprietary models. Compared to its predecessor, Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, improving global accessibility through advanced cross-lingual understanding and generation. All Qwen3 models are openly released under the Apache 2.0 license to foster reproducibility and community-driven research.

The original source link is https://arxiv.org/abs/2505.09388, and the PDF link is https://arxiv.org/pdf/2505.09388v1.pdf. It is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The field of artificial intelligence is relentlessly pursuing artificial general intelligence (AGI) and artificial super intelligence (ASI). Recent advancements in large foundation models (LLMs), such as GPT-4o, Claude 3.7, and Llama-4, have shown remarkable progress in distilling human knowledge and capabilities by training on vast datasets. However, a significant challenge remains: most state-of-the-art models are proprietary, limiting broader research and innovation. While open-source models like DeepSeek-V3 and Qwen2.5 have narrowed the performance gap, there's still a need for open-weight models that can compete at the highest levels, especially in complex reasoning tasks.

A key problem LLMs face is balancing the need for rapid, general-purpose responses with the demand for deep, multi-step reasoning. Currently, users often switch between different types of models—e.g., chat-optimized models for quick replies and dedicated reasoning models for complex problems—leading to inefficiencies and a fragmented user experience. Moreover, training and deploying large models, particularly for specialized reasoning, incurs substantial computational costs. There's a gap in providing a unified, adaptable, and efficient solution that caters to both thinking and non-thinking modes while also being accessible and capable across a wide array of languages.

The paper's entry point is to address these challenges by introducing Qwen3, a comprehensive series of open-weight LLMs. Its innovative idea centers on integrating thinking and non-thinking modes within a single model, coupled with a thinking budget mechanism, to offer unprecedented flexibility and resource control. This aims to eliminate the need for model switching, optimize inference costs, and extend state-of-the-art performance to an expanded multilingual user base.

2.2. Main Contributions / Findings

The Qwen3 technical report highlights several primary contributions and key findings:

  • Unified Thinking and Non-Thinking Modes: Qwen3 integrates two distinct operational modes—thinking mode for complex, multi-step reasoning and non-thinking mode for rapid, context-driven responses—into a single, unified model. This eliminates the need for users to switch between different models for varying task complexities, offering dynamic mode switching based on user queries or chat templates.

  • Dynamic Thinking Budget Mechanism: A novel thinking budget mechanism is introduced, allowing users to adaptively allocate computational resources during inference. This provides fine-grained control over the model's reasoning effort, balancing latency and performance based on task complexity.

  • Broad Model Series and Architectures: Qwen3 comprises a diverse series of LLMs, including both dense and Mixture-of-Expert (MoE) architectures. These models span a wide range of parameter scales, from 0.6 billion to 235 billion, catering to various downstream applications and deployment environments. The flagship Qwen3-235B-A22B is an MoE model with 235 billion total parameters and 22 billion activated parameters per token, balancing performance and efficiency.

  • Enhanced Multilingual Capabilities: Qwen3 significantly expands its multilingual support, covering 119 languages and dialects, a substantial increase from Qwen2.5's 29 languages. This enhancement improves cross-lingual understanding and generation, making the models globally accessible.

  • Efficient Strong-to-Weak Distillation: A strong-to-weak distillation pipeline is developed to efficiently train smaller-scale models. By leveraging knowledge transfer from larger, more capable flagship models, this approach drastically reduces the computational resources and development effort required for lightweight models while ensuring highly competitive performance. This distillation process is shown to be significantly more efficient than reinforcement learning for smaller models.

  • State-of-the-Art Performance: Empirical evaluations demonstrate that Qwen3 models achieve state-of-the-art results across a diverse set of benchmarks. This includes tasks in code generation, mathematical reasoning, and agent tasks, with the flagship models (Qwen3-235B-A22B and Qwen3-32B) performing competitively against larger MoE models and proprietary models like OpenAI-o1, Gemini2.5-Pro, and GPT-4o.

  • Open-Source Release: All Qwen3 models are publicly accessible under the Apache 2.0 license, facilitating reproducibility and fostering community-driven research and development.

    These contributions collectively address the fragmentation in model usage, resource inefficiency, and limitations in multilingual support, pushing the boundaries of open-source LLM capabilities.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the Qwen3 technical report, several foundational concepts in large language models (LLMs) and deep learning are essential:

  • Large Language Models (LLMs): These are artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. They typically use transformer architectures and can perform a wide range of natural language processing tasks, from translation to question answering.
  • Dense Models vs. Mixture-of-Experts (MoE) Models:
    • Dense Models: These are traditional neural networks where all parameters are activated for every input token. They are computationally intensive as their size scales up.
    • Mixture-of-Experts (MoE) Models: These models consist of multiple "expert" sub-networks. For each input token, only a small subset of experts (e.g., 2 or 4 out of 128) are activated by a "router" or "gate" network. This allows MoE models to have a very large total number of parameters (for higher capacity) while maintaining a manageable number of activated parameters per token during inference, leading to more efficient computation for a given performance level.
  • Pre-training and Fine-tuning:
    • Pre-training: The initial, computationally expensive phase where a large model learns general language patterns, facts, and reasoning abilities from massive, diverse datasets using unsupervised or self-supervised learning objectives (e.g., predicting the next word).
    • Fine-tuning: A subsequent phase where the pre-trained model is further trained on smaller, task-specific datasets to adapt its learned knowledge to specific downstream applications or human preferences. This can involve Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL).
  • Chain-of-Thought (CoT) Reasoning: A prompting technique where the LLM is instructed to verbalize its intermediate reasoning steps before providing a final answer. This encourages the model to break down complex problems into manageable steps, often leading to more accurate and verifiable solutions, especially in mathematical or logical tasks.
  • Reinforcement Learning (RL) and Reinforcement Learning from Human Feedback (RLHF):
    • RL: A paradigm where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties. The goal is to maximize cumulative reward.
    • RLHF: A common technique for aligning LLMs with human preferences. It involves: 1) training a reward model (RM) on human judgments of LLM outputs, and 2) using this RM to provide rewards to the LLM during an RL phase (e.g., using algorithms like PPO or GRPO) to fine-tune its behavior.
  • Transformer Architecture: The foundational neural network architecture for LLMs, characterized by self-attention mechanisms and feed-forward layers. It allows the model to weigh the importance of different parts of the input sequence when processing each token.
    • Attention Mechanism: A core component of transformers that allows the model to focus on different parts of the input sequence when generating an output. The Scaled Dot-Product Attention is defined as: Attention(Q,K,V)=softmax(QKTdk)V \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V Where:
      • QQ is the Query matrix.
      • KK is the Key matrix.
      • VV is the Value matrix.
      • dkd_k is the dimension of the keys.
      • softmax\mathrm{softmax} is the softmax function, which normalizes the attention scores.
    • Grouped Query Attention (GQA): An optimization for multi-head attention where multiple query heads share the same key and value heads. This reduces the computational cost and memory footprint, especially during inference, while maintaining much of the performance of full multi-head attention.
    • SwiGLU (Swish Gated Linear Unit): An activation function used in feed-forward networks within transformers, often replacing the traditional ReLU. It is defined as: SwiGLU(x,W1,W2,V)=(Swish(xW1)(xV))W2 \mathrm{SwiGLU}(x, W_1, W_2, V) = (\mathrm{Swish}(xW_1) \odot (xV))W_2 Where:
      • xx is the input.
      • W1,W2,VW_1, W_2, V are weight matrices.
      • Swish(y)=ysigmoid(y)\mathrm{Swish}(y) = y \cdot \mathrm{sigmoid}(y).
      • \odot denotes element-wise product.
    • Rotary Positional Embeddings (RoPE): A type of positional encoding that encodes absolute position information with a rotation matrix and naturally incorporates relative position information. It is applied directly within the attention mechanism.
    • RMSNorm (Root Mean Square Normalization): A normalization technique applied before or after layers in neural networks. It normalizes inputs by their root mean square, which can offer computational advantages and stability over LayerNorm. For an input vector xx, RMSNorm is defined as: RMSNorm(x)=x1Ni=1Nxi2+ϵg \mathrm{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{N} \sum_{i=1}^{N} x_i^2 + \epsilon}} \cdot g Where:
      • NN is the dimension of xx.
      • ϵ\epsilon is a small constant for numerical stability.
      • gg is a learnable scaling factor.
    • QK-Norm: A normalization technique applied to the query-key dot product in the attention mechanism to ensure stable training, particularly for very deep or large transformer models.
  • Byte-level Byte-Pair Encoding (BBPE): A tokenization algorithm that learns a vocabulary of common character sequences (subwords) by iteratively merging the most frequent adjacent pairs of bytes. Byte-level ensures that all input text, regardless of character set, can be tokenized.
  • Knowledge Distillation: A technique where a smaller "student" model learns from a larger, more powerful "teacher" model. The student tries to mimic the teacher's outputs, often by minimizing the Kullback-Leibler (KL) divergence between their logits. This allows the student to achieve better performance than it would if trained from scratch, with less data or computation.
  • Kullback-Leibler (KL) Divergence: A measure of how one probability distribution PP diverges from a second, expected probability distribution QQ. In distillation, it quantifies the difference between the teacher's soft probabilities and the student's probabilities. For discrete probability distributions PP and QQ, it is defined as: DKL(PQ)=xXP(x)log(P(x)Q(x)) D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right)
  • Adaptive Base Frequency (ABF): A technique used with RoPE to extend the context window of LLMs by adjusting the base frequency.
  • Yet Another RoPE extensioN (YARN): A method to effectively extend the context length of RoPE-based models with minimal fine-tuning.
  • Dual Chunk Attention (DCA): A technique designed to achieve efficient inference for long context lengths by processing input in chunks.

3.2. Previous Works

The paper contextualizes Qwen3 by referring to several prominent LLMs and related research efforts:

  • Foundation Models (General Progress):
    • GPT-4o (OpenAI, 2024): OpenAI's multimodal flagship model, noted for its strong performance across diverse tasks, often serving as a benchmark for general LLM capabilities.
    • Claude 3.7 (Anthropic, 2025): Anthropic's competitive LLM series, known for its strong reasoning and safety features.
    • Gemini 2.5 (DeepMind, 2025): Google DeepMind's multimodal model, also a major player in the SOTA LLM landscape.
    • DeepSeek-V3 (Liu et al., 2024a): An open-source model notable for its scale and performance, particularly as an MoE model. Qwen3 often benchmarks against it.
    • Llama-4 (Meta-AI, 2025): Meta AI's next-generation open-source LLM, representing a strong competitor in the open-weight community.
    • Qwen2.5 (Yang et al., 2024b): The immediate predecessor to Qwen3, serving as a baseline for improvements in architecture, data, and multilingual support.
  • Reasoning Models (Optimized via RL):
    • OpenAI-o3 (OpenAI, 2025): A model from OpenAI specifically optimized for reasoning tasks, often through advanced RL techniques.
    • DeepSeek-R1 (Guo et al., 2025): A dedicated reasoning model, likely developed using RL, against which Qwen3's reasoning capabilities are compared.
    • QwQ-32B (Qwen Team, 2024, 2025): An earlier reasoning-focused model from the Qwen team, used as a strong baseline for Qwen3's thinking mode development and also as a teacher model in cold-start fine-tuning.
  • Open-Source Baselines (for comparison across scales):
    • Llama-3 (Dubey et al., 2024): An earlier open-source model from Meta, serving as a benchmark for various parameter scales.
    • Gemma-3 (Team et al., 2025): Google's open-source model series, providing benchmarks across different sizes.
    • Phi-4 (Abdin et al., 2024): A smaller-scale model from Microsoft, often used for benchmarking lightweight LLMs.

Background on Qwen2.5's Role: Qwen2.5 is particularly relevant as it forms the direct lineage for Qwen3. The paper mentions several Qwen2.5 variants for data generation:

  • Qwen2.5-VL (Bai et al., 2025): A vision-language model used to extract text from PDF documents for Qwen3's pre-training data.

  • Qwen2.5-Math (Yang et al., 2024c): A specialized model for mathematical content generation, contributing to synthetic data.

  • Qwen2.5-Coder (Hui et al., 2024): A specialized model for code-related data generation, also contributing to synthetic data.

    The Qwen2.5-MoE architecture (Yang et al., 2024b) also serves as a direct predecessor for Qwen3's MoE models, with Qwen3 building upon its concepts like fine-grained expert segmentation.

3.3. Technological Evolution

The evolution of LLMs has been marked by several key trends:

  1. Scaling Laws: Initial models demonstrated that increasing parameters and data led to improved performance. This motivated the development of ever-larger models.

  2. Architectural Innovations: The Transformer architecture revolutionized NLP, leading to models with self-attention and positional embeddings. Subsequent innovations focused on improving efficiency (e.g., GQA), stability (RMSNorm, QK-Norm), and context handling (RoPE, YARN, DCA).

  3. Data Curation & Augmentation: Beyond simply scaling data, researchers focused on data quality, diversity (e.g., code, scientific texts), and synthetic data generation using existing LLMs (as seen with Qwen3 leveraging Qwen2.5 variants).

  4. Multilingualism: Early models were predominantly English-centric. There's a growing trend towards truly multilingual models, expanding language coverage and improving cross-lingual transfer.

  5. Alignment & Reasoning: The introduction of Chain-of-Thought (CoT) prompting and Reinforcement Learning from Human Feedback (RLHF) significantly enhanced models' reasoning abilities and their alignment with human preferences and instructions. Dedicated reasoning models emerged from this focus.

  6. Efficiency through Sparsity: Mixture-of-Experts (MoE) models represent a major step in combining vast capacity (total parameters) with efficient inference (activated parameters), addressing the computational bottlenecks of dense models.

  7. Unified Capabilities: The latest frontier involves integrating diverse capabilities (e.g., multimodal inputs, reasoning, quick responses) into a single, adaptable model, minimizing the need for specialized models.

    Qwen3's work fits squarely within these latest trends, pushing the boundaries in scale, multilingual support, efficiency through MoE and distillation, and crucially, the unification of thinking and non-thinking modes.

3.4. Differentiation Analysis

Compared to the main methods and models in related work, Qwen3 introduces several core differences and innovations:

  • Unified Thinking and Non-Thinking Modes with Dynamic Switching: This is Qwen3's most prominent differentiator. Prior approaches often required users to select a specific model optimized for chat (e.g., GPT-4o) or reasoning (e.g., QwQ-32B). Qwen3 integrates both within a single model, enabling dynamic switching via chat templates or user prompts (/think, /no_think flags). This simplifies deployment and usage, offering unparalleled flexibility.

  • Adaptive Thinking Budget: Building on the unified modes, Qwen3 introduces a thinking budget mechanism. This allows users to control the computational resources and time allocated for reasoning during inference, dynamically balancing latency and performance based on the complexity of the query. This fine-grained control is a novel optimization for practical LLM deployment.

  • Sophisticated Strong-to-Weak Distillation: While knowledge distillation is not new, Qwen3's application of strong-to-weak distillation for lightweight models is highly effective. It involves both off-policy and on-policy phases, leveraging flagship models as teachers. The paper demonstrates that this approach significantly outperforms reinforcement learning for smaller models in terms of performance and training efficiency (1/10th of GPU hours), and also enhances exploration ability (Pass@64 scores).

  • Massive Multilingual Expansion: Qwen3 drastically expands its language support from 29 to 119 languages and dialects. This is a significant leap compared to many models that focus primarily on high-resource languages, making Qwen3 exceptionally broad in its global accessibility. This was achieved through a dedicated multilingual data annotation system and a massive 36 trillion token pre-training dataset.

  • Advanced MoE Architecture: Qwen3's MoE models (e.g., Qwen3-235B-A22B) build upon Qwen2.5-MoE but introduce improvements such as excluding shared experts and adopting a global-batch load balancing loss (Qiu et al., 2025) to encourage expert specialization. This contributes to better performance and efficiency compared to previous MoE designs.

  • Architectural Refinements for Stability: Beyond standard Transformer components, Qwen3 incorporates subtle but important changes like removing QKV-bias and introducing QK-Norm to ensure stable training, particularly for its large-scale models.

  • Comprehensive Post-training Pipeline: The multi-stage post-training approach, combining Long-CoT Cold Start, Reasoning RL, Thinking Mode Fusion, and General RL, ensures robust reasoning, alignment, and general capabilities for flagship models, which then serve as teachers for smaller models. This structured approach is designed to systematically instill and refine diverse skills.

    In essence, Qwen3 differentiates itself by offering a more holistic, flexible, and efficient LLM solution that explicitly addresses the trade-offs between rapid response and deep reasoning within a single, highly multilingual, and open-source framework.

4. Methodology

4.1. Principles

The core idea behind Qwen3's methodology is to develop a highly versatile and efficient family of large language models (LLMs) that can dynamically adapt to different task requirements, particularly balancing rapid response with complex reasoning. This is achieved through a unified architecture that supports both thinking and non-thinking modes, complemented by a thinking budget mechanism. The theoretical basis rests on the observation that not all tasks require extensive reasoning, and therefore, dynamically adjusting the computational effort can lead to significant efficiency gains without compromising performance when reasoning is truly needed.

The methodology can be broken down into three main pillars:

  1. Flexible Architecture Design: Incorporating both dense and Mixture-of-Experts (MoE) models, with architectural refinements for stability and efficiency.
  2. Massive and Diverse Pre-training: Training on an enormous, linguistically diverse, and domain-rich dataset using a multi-stage approach to build a strong foundation of general knowledge, reasoning, and long-context understanding.
  3. Sophisticated Multi-Stage Post-training: A comprehensive fine-tuning strategy that explicitly develops thinking and non-thinking capabilities, aligns the models with human preferences, and efficiently transfers knowledge to smaller models through strong-to-weak distillation.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Architecture

The Qwen3 series includes 6 dense models and 2 MoE models, ranging from 0.6 billion to 235 billion parameters.

Dense Model Architecture: The architecture of Qwen3 dense models builds upon Qwen2.5 with several enhancements. Key components include:

  • Grouped Query Attention (GQA): An optimization for the multi-head attention mechanism where multiple query heads share a single key and value head. This reduces the memory footprint and increases inference speed compared to standard multi-head attention, especially for larger models.

  • SwiGLU: A gated activation function used in the feed-forward network blocks of the Transformer. It often provides performance improvements over ReLU or GELU.

  • Rotary Positional Embeddings (RoPE): A method for incorporating positional information into the self-attention mechanism by rotating query and key vectors based on their absolute positions, which implicitly captures relative position information.

  • RMSNorm with Pre-normalization: Root Mean Square Normalization applied before the attention and feed-forward layers (pre-normalization), which can improve training stability and performance.

  • QKV-bias Removal: Unlike Qwen2, QKV-bias (bias terms added to the query, key, and value projections) is removed in Qwen3. This is a simplification that can sometimes improve stability or performance.

  • QK-Norm Introduction: A new normalization technique QK-Norm (Dehghani et al., 2023) is introduced to the attention mechanism. This is applied to the dot product of query and key vectors before the softmax function, aiming to ensure stable training for large models.

    The specific architectural details for dense models are provided in Table 1.

The following are the results from Table 1 of the original paper:

Models Layers Heads (Q / KV) Tie Embedding Context Length
Qwen3-0.6B 28 16 / 8 Yes 32K
Qwen3-1.7B 28 16 / 8 Yes 32K
Qwen3-4B 36 32 / 8 Yes 128K
Qwen3-8B 36 32 / 8 No 128K
Qwen3-14B 40 40 / 8 No 128K
Qwen3-32B 64 64 / 8 No 128K

MoE Model Architecture: The Qwen3 MoE models share the fundamental architecture with the dense models (i.e., GQA, SwiGLU, RoPE, RMSNorm, QK-Norm). Key specifics for MoE are:

  • Total Experts: 128 total experts.

  • Activated Experts: 8 experts activated per token. This means for each token, the router network selects 8 out of the 128 experts to process it.

  • Fine-grained Expert Segmentation: Similar to Qwen2.5-MoE, this approach likely refers to how experts are structured and utilized, potentially at a sub-layer or fine-grained level within the feed-forward block.

  • Exclusion of Shared Experts: Unlike Qwen2.5-MoE, Qwen3's MoE design excludes shared experts. This means all experts are distinct, fostering better specialization. Shared experts are usually common experts that all tokens access, while other experts are conditionally activated. Removing them implies a more purely conditional computation.

  • Global-batch Load Balancing Loss: This is a training objective (Qiu et al., 2025) applied to the router network of the MoE model. Its purpose is to encourage expert specialization and ensure that the workload is evenly distributed across all experts within a training global batch. This prevents a few experts from becoming overloaded while others remain underutilized, which is crucial for efficient MoE training.

    The specific architectural details for MoE models are provided in Table 2.

The following are the results from Table 2 of the original paper:

Models Layers Heads (Q / KV) # Experts (Total / Activated) Context Length
Qwen3-30B-A3B 48 32 / 4 128 / 8 128K
Qwen3-235B-A22B 94 64 / 4 128 / 8 128K

Tokenizer: Qwen3 models use Qwen's tokenizer (Bai et al., 2023), which implements byte-level byte-pair encoding (BBPE). It has a vocabulary size of 151,669. BBPE tokenizers are robust to unseen characters and can tokenize any input string.

4.2.2. Pre-training

Qwen3's pre-training involves a large, diverse dataset and a three-stage process.

4.2.2.1. Pre-training Data

The pre-training dataset for Qwen3 is significantly expanded compared to Qwen2.5, featuring:

  • Scale: 36 trillion tokens (twice as many as Qwen2.5).
  • Languages: 119 languages and dialects (three times more than Qwen2.5's 29).
  • Diversity: High-quality content across various domains including coding, STEM (Science, Technology, Engineering, and Mathematics), reasoning tasks, books, multilingual texts, and synthetic data.

Data Expansion Methods:

  • Multi-modal Extraction: Qwen2.5-VL (Bai et al., 2025), a vision-language model, is used to extract text from a large volume of PDF-like documents. The extracted text is then refined using Qwen2.5 to improve its quality, yielding trillions of additional high-quality tokens.
  • Synthetic Data Generation: Domain-specific Qwen2.5 models are employed to synthesize trillions of text tokens in various formats:
    • Qwen2.5 (Yang et al., 2024b) for general text.
    • Qwen2.5-Math (Yang et al., 2024c) for mathematical content (textbooks, Q&A).
    • Qwen2.5-Coder (Hui et al., 2024) for code-related data (code snippets).
  • Multilingual Data Annotation System: A sophisticated system was developed to annotate over 30 trillion tokens across multiple dimensions (educational value, fields, domains, safety). This supports finer-grained data filtering and combination. The method optimizes data mixture at the instance-level through ablation experiments on small proxy models with these fine-grained labels, a more granular approach than previous efforts that typically optimize at the corpus or domain level.

4.2.2.2. Pre-training Stages

The pre-training process is divided into three distinct stages:

  1. General Stage (S1):

    • Objective: Build a strong foundation of general knowledge and language proficiency.
    • Data: Over 30 trillion tokens.
    • Context Length: 4,096 tokens.
    • Coverage: 119 languages and dialects.
  2. Reasoning Stage (S2):

    • Objective: Enhance the model's reasoning abilities.
    • Data: Approximately 5 trillion higher-quality tokens. The pre-training corpus is optimized by increasing the proportion of STEM, coding, reasoning, and synthetic data.
    • Context Length: Not explicitly stated but implied to be similar to S1 or potentially varied based on data. The original text states "a sequence length of Wa", which seems like a typo, potentially meaning "a sequence length of 4,096 tokens" or another specific value. Given the subsequent long context stage, it likely refers to a standard sequence length.
  3. Long Context Stage:

    • Objective: Extend the maximum context length of Qwen3 models.
    • Data: Hundreds of billions of tokens from a high-quality long context corpus.
    • Context Length Extension: Increases from 4,096 to 32,768 tokens.
    • Corpus Composition: 75%75\% of text between 16,384 to 32,768 tokens in length, and 25%25\% of text between 4,096 to 16,384 tokens in length.
    • Techniques for Long Context:
      • Adaptive Base Frequency (ABF) with RoPE: The base frequency of RoPE is increased from 10,000 to 1,000,000 using the ABF technique (Xiong et al., 2023). This modifies the way positional information is embedded to support longer sequences.
      • YARN (Peng et al., 2023): Yet Another RoPE extensioN, a method to effectively extend the context length of RoPE-based models.
      • Dual Chunk Attention (DCA, An et al., 2024): A technique to achieve a four-fold increase in sequence length capacity during inference by processing long contexts in a chunk-wise manner.

Scaling Laws for Hyperparameters: Similar to Qwen2.5, scaling laws are developed to predict optimal hyperparameters (e.g., learning rate scheduler, batch size) for each stage. Extensive experiments study the relationship between model architecture, training data, training stage, and optimal training hyperparameters to set the predicted values for each dense or MoE model.

4.2.3. Post-training

The post-training pipeline for Qwen3 is designed with two main objectives:

  1. Thinking Control: To integrate thinking and non-thinking modes and allow fine-grained control over reasoning depth via a thinking budget.

  2. Strong-to-Weak Distillation: To optimize the post-training process for lightweight models by leveraging knowledge from larger flagship models.

    The flagship models follow a sophisticated four-stage post-training process, while smaller models utilize strong-to-weak distillation.

The post-training pipeline is visually represented in Figure 1 (from the original paper).

Figure 1: Post-training pipeline of the Qwen3 series models. 该图像是Qwen3系列模型的后期训练流程示意图。图中展示了从基础模型到旗舰模型和轻量级模型的不同训练阶段,包括长期CoT冷启动、推理强化学习、思维模式融合和一般强化学习等关键步骤。

Figure 1: Post-training pipeline of the Qwen3 series models.

4.2.3.1. Long-CoT Cold Start

This is the first stage of post-training for flagship models, focusing on developing foundational thinking abilities.

  • Dataset Curation: A comprehensive dataset covering math, code, logical reasoning, and general STEM problems is created. Each problem is paired with verified reference answers or code-based test cases.
  • Two-Phase Filtering:
    1. Query Filtering: Qwen2.5-72B-Instruct is used to:
      • Remove non-verifiable queries (e.g., multiple sub-questions, general text generation).
      • Exclude queries that Qwen2.5-72B-Instruct can answer without CoT reasoning (to ensure only complex problems requiring deeper reasoning are included).
      • Annotate each query's domain to maintain balanced representation.
    2. Response Filtering: For queries with positive Pass@N (meaning a correct solution was found within N attempts), QwQ-32B (Qwen Team, 2025) generates NN candidate responses. These responses undergo stringent filtering to remove:
      • Incorrect final answers.
      • Substantial repetition.
      • Guesswork without adequate reasoning.
      • Inconsistencies between thinking and summary content.
      • Inappropriate language mixing or stylistic shifts.
      • Responses overly similar to potential validation items.
  • Objective: Instill foundational reasoning patterns without over-emphasizing immediate reasoning performance, preparing the model for subsequent Reinforcement Learning (RL). This phase aims to minimize training samples and steps.

4.2.3.2. Reasoning RL

The second stage focuses on further enhancing reasoning through Reinforcement Learning.

  • Query-Verifier Pairs: A dataset of 3,995 query-verifier pairs is collected, satisfying four criteria:
    1. Not used in the cold-start phase.
    2. Learnable for the cold-start model.
    3. As challenging as possible.
    4. Covering a broad range of sub-domains.
  • RL Algorithm: GRPO (Shao et al., 2024) is employed to update model parameters. GRPO (Generalized Reinforcement Learning with Policy Optimization) is a policy gradient method designed for stable and efficient reinforcement learning.
  • Training Benefits: Large batch sizes and a high number of rollouts per query, along with off-policy training, are found beneficial for improving sample efficiency.
  • Exploration-Exploitation Balance: Strategies to balance exploration and exploitation are addressed by controlling the model's entropy to increase steadily or remain stable. This helps maintain stable training and prevents premature convergence.
  • Results: Consistent improvements in both training reward and validation performance are observed without manual hyperparameter intervention. For example, AIME'24 score for Qwen3-235B-A22B increased from 70.1 to 85.1 over 170 RL training steps.

4.2.3.3. Thinking Mode Fusion

The third stage integrates non-thinking capabilities into the thinking-capable model developed in previous stages.

  • Objective: Allow developers to manage reasoning behaviors and reduce the cost/complexity of deploying separate models.

  • Approach: Continual supervised fine-tuning (SFT) is performed on the Reasoning RL model.

  • SFT Data Construction:

    • Thinking data is generated via rejection sampling on Stage 1 queries using the Stage 2 model itself, ensuring performance is not compromised.
    • Non-thinking data is curated to cover diverse tasks, including conceptual knowledge, summarization, and role-playing. Automatically generated checklists are used to assess response quality.
    • The proportion of translation tasks is increased to enhance performance on low-resource languages.
  • Chat Template Design: A chat template is designed for dynamic mode switching, as shown in Table 9.

    • /think and /no_think flags are introduced in user queries or system messages.

    • For non-thinking mode samples, an empty thinking block (<think></think><think></think>) is retained in the assistant's response to ensure internal format consistency and allow developers to explicitly prevent thinking.

    • By default, the model operates in thinking mode; thus, some thinking mode samples without /think flags are included in training.

    • For multi-turn dialogs, multiple /think and /no_think flags can be randomly inserted, with the model adhering to the last encountered flag.

      The following are the results from Table 9 of the original paper:

      Thinking Mode Non-Thinking Mode
      <|im_start/>user {query}/think<|im_end|> <|im_start|>user {query}/no_think<|im_end|>
      <|im_start/>assistant <think> <|im_start/>assistant <think>
      {thinking_content} </think> </think>
      {response}<|im_end|> {response}<|im_end|>
  • Thinking Budget Mechanism: The model's ability to handle intermediate cases (generating responses based on incomplete thinking) emerges naturally from Thinking Mode Fusion. This forms the basis for budget control.

    • If the model's thinking length reaches a user-defined threshold, the process is manually halted.
    • A stop-thinking instruction is inserted: "Consideringthelimitedtimebytheuser,Ihavetogivethesolutionbasedonthethinkingdirectlynow.\n</think>.\n\n""Considering the limited time by the user, I have to give the solution based on the thinking directly now. \n</think>.\n\n".
    • The model then generates a final response based on its accumulated reasoning up to that point.

4.2.3.4. General RL

The final stage of post-training aims for broad enhancement of capabilities and stability across diverse scenarios.

  • Sophisticated Reward System: Covers over 20 distinct tasks, each with customized scoring criteria, targeting core capabilities:
    • Instruction Following: Ensures accurate interpretation and execution of user instructions (content, format, length, structured output).
    • Format Following: Adherence to specific formatting conventions, e.g., responding to /think//no_think flags and using <think><think>/</think></think> tokens.
    • Preference Alignment: For open-ended queries, improves helpfulness, engagement, and style for a natural user experience.
    • Agent Ability: Training the model to correctly invoke tools via designated interfaces. During RL rollout, the model performs multi-turn interaction cycles with real environment execution feedback to improve long-horizon decision-making.
    • Abilities for Specialized Scenarios: Tasks tailored to specific contexts, e.g., Retrieval-Augmented Generation (RAG) tasks incorporate reward signals to guide accurate and contextually appropriate responses, minimizing hallucination.
  • Three Types of Rewards:
    1. Rule-based Reward: Widely used in reasoning RL and for general tasks like instruction following and format adherence. Provides high precision in assessing correctness and prevents reward hacking.
    2. Model-based Reward with Reference Answer: A reference answer is provided for each query, and Qwen2.5-72B-Instruct scores the model's response against it. Offers flexibility for diverse tasks without strict formatting.
    3. Model-based Reward without Reference Answer: A reward model is trained using human preference data to assign scalar scores to responses. Handles a broader range of queries and enhances engagement and helpfulness.

4.2.3.5. Strong-to-Weak Distillation

This pipeline is specifically for optimizing lightweight models (5 dense models: Qwen3-0.6B, 1.7B, 4B, 8B, 14B, and one MoE model: Qwen3-30B-A3B).

  • Objective: Enhance performance and impart robust mode-switching capabilities efficiently, requiring significantly fewer computational resources than the full four-stage process.
  • Two Primary Phases:
    1. Off-policy Distillation:
      • Combines outputs from teacher models (generated with both /think and /no_think modes) for response distillation.
      • Helps student models develop basic reasoning skills and mode-switching ability, establishing a foundation for the next phase.
    2. On-policy Distillation:
      • Student model generates on-policy sequences for fine-tuning.

      • Prompts are sampled, and the student model produces responses in either /think or /no_think mode.

      • The student model is fine-tuned by aligning its logits with those of a teacher model (Qwen3-32B or Qwen3-235B-A22B). This alignment is typically achieved by minimizing the Kullback-Leibler (KL) divergence between the student's and teacher's soft probabilities (logits converted to probabilities via softmax). Ldistill=DKL(PTPS)=xVPT(x)log(PT(x)PS(x)) \mathcal{L}_{distill} = D_{KL}(P_T || P_S) = \sum_{x \in \mathcal{V}} P_T(x) \log\left(\frac{P_T(x)}{P_S(x)}\right) Where:

      • PT(x)P_T(x) represents the probability of token xx predicted by the Teacher model.

      • PS(x)P_S(x) represents the probability of token xx predicted by the Student model.

      • V\mathcal{V} is the vocabulary.

      • DKLD_{KL} denotes the Kullback-Leibler divergence.

        This process enables better Pass@1 scores (immediate performance) and improved exploration ability (Pass@64), while requiring only 1/10 of the GPU hours compared to the four-stage training.

5. Experimental Setup

The experimental setup for Qwen3 involved comprehensive evaluations of both pre-trained (base) and post-trained (instruction-tuned) models across a wide array of benchmarks.

5.1. Datasets

The datasets used in the experiments cover general knowledge, reasoning, mathematics, scientific knowledge, coding, and multilingual capabilities.

For Pre-trained Base Models (Section 3.3):

  • General Tasks:
    • MMLU (Hendrycks et al., 2021a): Massive Multitask Language Understanding (5-shot). A benchmark covering 57 academic subjects.
    • MMLU-Pro (Wang et al., 2024): An extended version of MMLU (5-shot, CoT).
    • MMLU-redux (Gema et al., 2024): Another variant of MMLU (5-shot).
    • BBH (Suzgun et al., 2023): BIG-Bench Hard (3-shot, CoT). A subset of BIG-Bench tasks designed to be challenging for large language models, often requiring complex reasoning.
    • SuperGPQA (Du et al., 2025): Super Graduate-level Google-proof Q&A (5-shot, CoT). A very challenging question-answering benchmark.
  • Math & STEM Tasks:
    • GPQA (Rein et al., 2023): Graduate-level Google-proof Q&A (5-shot, CoT). Similar to SuperGPQA, requiring deep scientific and mathematical knowledge.
    • GSM8K (Cobbe et al., 2021): Grade School Math 8K (4-shot, CoT). A dataset of elementary school math word problems.
    • MATH (Hendrycks et al., 2021b): A dataset of challenging mathematical problems from high school competitions (4-shot, CoT).
  • Coding Tasks:
    • EvalPlus (Liu et al., 2023a): (0-shot) An evaluation suite for code generation, averaging performance on HumanEval, MBPP, HumanEval+HumanEval+, and MBPP+MBPP+.
    • MultiPL-E (Cassano et al., 2023): (0-shot) A polyglot benchmark for code generation across multiple programming languages (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript).
    • MBPP-3shot (Austin et al., 2021): Mostly Basic Python Problems (3-shot). A dataset of Python programming problems.
    • CRUX-O of CRUXEval (Gu et al., 2024): (1-shot) A benchmark for code reasoning, understanding, and execution.
  • Multilingual Tasks:
    • MGSM (Shi et al., 2023): Multilingual Grade School Math (8-shot, CoT). Multilingual math word problems.
    • MMMLU (OpenAI, 2024): Multilingual Massive Multitask Language Understanding (5-shot). A multilingual version of MMLU.
    • INCLUDE (Romanou et al., 2024): Evaluating multilingual language understanding with regional knowledge (5-shot).

For Post-trained Models (Section 4.6):

  • General Tasks:
    • MMLU-Redux (Gema et al., 2024).
    • GPQA-Diamond (Rein et al., 2023): A subset of GPQA with very challenging questions. For this benchmark, 10 samples are taken for each query, and the averaged accuracy is reported.
    • C-Eval (Huang et al., 2023): A multilevel multi-discipline Chinese evaluation suite for foundation models.
    • LiveBench (2024-11-25) (White et al., 2024): A challenging, contamination-free LLM benchmark.
  • Alignment Tasks:
    • IFEval (Zhou et al., 2023): Instruction-following evaluation for large language models, reporting strict prompt accuracy.
    • Arena-Hard (Li et al., 2024): A benchmark for evaluating human preferences, derived from crowdsourced data.
    • AlignBench v1.1 (Liu et al., 2023b): Benchmarking Chinese alignment of large language models.
    • Creative Writing V3 (Paech, 2024): Evaluates creative writing proficiency.
    • WritingBench (Wu et al., 2025): A comprehensive benchmark for generative writing.
  • Math & Text Reasoning:
    • MATH-500 (Lightman et al., 2023): A high-level math benchmark.
    • AIME'24 and AIME'25 (AIME, 2025): Problems from the American Invitational Mathematics Examination. For each question, 64 samples are taken, and the average accuracy is reported.
    • ZebraLogic (Lin et al., 2025): A benchmark for logical reasoning, particularly Zebra Puzzles.
    • AutoLogi (Zu et al., 2025): Automated generation of logic puzzles.
  • Agent & Coding:
    • BFCL v3 (Yan et al., 2024): Berkeley Function Calling Leaderboard. Models are evaluated using the FC format, and yarn is used to extend context length to 64k for Multi-Turn evaluation.
    • LiveCodeBench (v5, 2024.10-2025.02) (Jain et al., 2024): A holistic and contamination-free evaluation for code generation. For non-thinking mode, the official prompt is used; for thinking mode, prompt templates are adjusted to allow more free thinking.
    • CodeForces Ratings from CodeElo (Quan et al., 2025): Calculates Elo ratings to compare model performance against competitive programming experts. Each problem is solved by generating up to eight independent reasoning attempts.
  • Multilingual Tasks:
    • Multi-IF (He et al., 2024): Multilingual instruction following (8 key languages).
    • INCLUDE (Romanou et al., 2024): Regional knowledge (44 languages). Only 10% of original data sampled for efficiency.
    • MMMLU (OpenAI, 2024): General knowledge (14 languages, excluding unoptimized Yoruba). Only 10% of original data sampled for efficiency.
    • MT-AIME2024 (Son et al., 2025): Multilingual AIME (55 languages).
    • PolyMath (Wang et al., 2025): Multilingual mathematical reasoning (18 languages).
    • MLogiQA (Zhang et al., 2024): Multilingual logical reasoning (10 languages).
    • Belebele (Bandarkar et al., 2023): A benchmark for natural language understanding in 122 language variants. Evaluated on 80 supported languages.

In-house Benchmarks (for Ablation Studies):

  • CounterFactQA: Contains counterfactual questions where the model needs to identify non-factual nature and avoid hallucination.

  • LengthCtrl: Creative writing tasks with length requirements; score based on difference between generated and target length.

  • ThinkFollow: Multi-turn dialogues with random /think and /no_think flags to test mode switching.

  • ToolUseToolUse*: Evaluates tool-calling proficiency (intent, format, parameter accuracy) with multi-turn interactions and real environment feedback.

    The following are the results from Table 10 of the original paper:

    Benchmark # Langs Languages
    Multi-IF 8 en, es, fr, hi, it, pt, ru, zh
    INCLUDE 44 ar, az, be, bg, bn, de, el, es, et, eu, fa, fi, fr, he, hi, hr, hu, hy, id, it, ja, ka, kk, ko, lt, mk, ml, ms, ne, nl, pl, pt, ru, sq, sr, ta, te, tl, tr, uk, ur, uz, vi, zh
    MMMLU 14 ar, bn, de, en, es, fr, hi, id, it, ja, ko, pt, sw, zh
    MT-AIME2024 55 af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-Hans, zh-Hant
    PolyMath 18 ar, bn, de, en, es, fr, id, it, ja, ko, ms, pt, ru, sw, te, th, vi, zh
    MLogiQA 10 ar, en, es, fr, ja, ko, pt, th, vi, zh

The following are the results from Table 36 of the original paper:

Language family # Langs Language code (ISO 639-3 ISO 15924)
Indo-European 40 por_Latn, deu_Latn, tgk_Cyrl, ces_Latn, nob_Latn, dan_Latn, snd_Arab, spa_Latn, isl_Latn, slv_Latn, eng_Latn, ory_Orya, hrv_Latn, ell_Grek, ukr_Cyrl, pan_Guru, srp_Cyrl, npi_Deva, mkd_Cyrl, guj_Gujr, nld_Latn, swe_Latn, hin_Deva, rus_Cyrl, asm_Beng, cat_Latn, als_Latn, sin_Sinh, urd_Arab, mar_Deva, lit_Latn, slk_Latn,
ita_Latn, pol_Latn, bul_Cyrl, afr_Latn, ron_Latn, fra_Latn, ben_Beng, hye_Armn
Sino-Tibetan 3 zho_Hans, mya_Mymr, zho_Hant
Afro-Asiatic 8 heb_Hebr, apc_Arab, acm_Arab, ary_Arab, ars_Arab, arb_Arab, mlt_Latn, erz_Arab
Austronesian 7 ilo_Latn, ceb_Latn, tgl_Latn, sun_Latn, jav_Latn, war_Latn, ind_Latn
Dravidian 4 mal_Mlym, kan_Knda, tel_Telu, tam_Taml
Turkic 4 kaz_Cyrl, azj_Latn, tur_Latn, uzn_Latn
Tai-Kadai 2 tha_Thai, lao_Laoo
Uralic 3 fin_Latn, hun_Latn, est_Latn
Austroasiatic 2 vie_Latn, khm_Khmr
Other eus_Latn, kor_Hang, hat_Latn, swh_Latn, kea_Latn, jpn_Jpan, kat_Geor

5.2. Evaluation Metrics

For every evaluation metric, the following explanations are provided:

  • Accuracy (Acc):

    1. Conceptual Definition: Measures the proportion of correctly predicted instances out of the total number of instances. It is a common metric for classification tasks.
    2. Mathematical Formula: Accuracy=Number of Correct PredictionsTotal Number of Predictions \mathrm{Accuracy} = \frac{\mathrm{Number~of~Correct~Predictions}}{\mathrm{Total~Number~of~Predictions}}
    3. Symbol Explanation: Number of Correct Predictions refers to the count of instances where the model's output matches the ground truth. Total Number of Predictions is the total count of instances evaluated.
  • Pass@k:

    1. Conceptual Definition: A metric used primarily in code generation benchmarks. It evaluates the probability that at least one out of kk generated solutions for a problem passes all given test cases. It accounts for the stochastic nature of LLM generation.
    2. Mathematical Formula: The exact formula for Pass@k is typically derived from Pass@1, where if CC candidates are generated, cc of which pass, Pass@k is calculated by averaging over NN problems: Pass@k=1Ni=1N[1(Cipik)(Cik)] \mathrm{Pass@k} = \frac{1}{N} \sum_{i=1}^N \left[ 1 - \frac{\binom{C_i - p_i}{k}}{\binom{C_i}{k}} \right] Where:
      • NN is the number of problems.
      • CiC_i is the number of code candidates generated for problem ii.
      • pip_i is the number of passing candidates for problem ii.
      • (nr)\binom{n}{r} is the binomial coefficient, representing "n choose r".
    3. Symbol Explanation: NN is the total number of coding problems. CiC_i is the number of attempts made by the model for problem ii. pip_i is the number of those attempts that correctly solve problem ii. kk is the number of solutions considered to pass. The term 1(Cipik)(Cik)1 - \frac{\binom{C_i - p_i}{k}}{\binom{C_i}{k}} calculates the probability that at least one of kk randomly chosen solutions out of CiC_i attempts (where pip_i are successful) will be correct.
  • Elo Rating (CodeForces):

    1. Conceptual Definition: A method for calculating the relative skill levels of players (or in this case, AI models) in zero-sum games, such as competitive programming. It's a dynamic rating system where a player's rating changes based on the outcome of matches against other players. A higher Elo rating indicates a stronger performance.
    2. Mathematical Formula: The change in Elo rating (ΔR\Delta R) after a match is calculated as: ΔR=K(SE) \Delta R = K \cdot (S - E) Where:
      • KK is the K-factor, a constant that determines the maximum possible adjustment for a single game.
      • SS is the actual score (1 for win, 0.5 for draw, 0 for loss).
      • EE is the expected score, calculated based on the opponent's rating (ROR_O) and the player's own rating (RPR_P): E=11+10(RORP)/400 E = \frac{1}{1 + 10^{(R_O - R_P)/400}}
    3. Symbol Explanation: KK is the sensitivity of the rating change. SS is the actual outcome of the competition (e.g., passing a coding problem). EE is the expected outcome based on the current ratings. ROR_O and RPR_P are the Elo ratings of the opponent and the player, respectively.
  • Strict Prompt Accuracy (IFEval):

    1. Conceptual Definition: Measures how precisely a model adheres to the explicit instructions given in a prompt. It's often a binary metric (pass/fail) for each instruction, focusing on exact compliance rather than general quality.
    2. Mathematical Formula: Not typically a single universal formula, but rather a calculation of the percentage of instruction sets where all instructions were followed correctly. Strict Prompt Accuracy=Number of Prompts with All Instructions FollowedTotal Number of Prompts \mathrm{Strict~Prompt~Accuracy} = \frac{\mathrm{Number~of~Prompts~with~All~Instructions~Followed}}{\mathrm{Total~Number~of~Prompts}}
    3. Symbol Explanation: Number of Prompts with All Instructions Followed is the count of prompts where the model's response correctly fulfilled every specified instruction. Total Number of Prompts is the number of instruction sets evaluated.
  • Average (Avg.):

    1. Conceptual Definition: The arithmetic mean of scores across multiple benchmarks or tasks, used to provide a single aggregate performance indicator.
    2. Mathematical Formula: Average=1Ni=1NSi \mathrm{Average} = \frac{1}{N} \sum_{i=1}^{N} S_i
    3. Symbol Explanation: NN is the number of individual scores or benchmarks being averaged. SiS_i is the score for the ii-th benchmark.

5.3. Baselines

The Qwen3 models are compared against a comprehensive set of baselines, including both open-source and proprietary models, to demonstrate their competitive standing across various scales and capabilities.

Pre-trained Base Model Baselines:

  • Qwen2.5 Base Models: Qwen2.5-0.5B, Qwen2.5-1.5B, Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Qwen2.5-32B, Qwen2.5-72B-Base, Qwen2.5-Plus-Base (MoE). These are direct predecessors, showing generational improvements.
  • DeepSeek-V3 Base (Liu et al., 2024a): A large open-source MoE model, representing a strong competitor in terms of scale and architecture.
  • Gemma-3 Base Models (Team et al., 2025): Gemma-3-1B, Gemma-3-4B, Gemma-3-12B, Gemma-3-27B. Google's open-source series.
  • Llama-3 Base Models (Dubey et al., 2024): Llama-3-8B. Meta's popular open-source series.
  • Llama-4 Base Models (Meta-AI, 2025): Llama-4-Maverick, Llama-4-Scout. Next-generation open-source models from Meta, often larger and more capable.

Post-trained (Instruction-tuned) Model Baselines:

  • Proprietary Models:
    • OpenAI-o1 (OpenAI, 2024): A reasoning-focused model from OpenAI.
    • GPT-4o-2024-11-20 (OpenAI, 2024): OpenAI's flagship multimodal model.
    • Gemini2.5-Pro (DeepMind, 2025): Google DeepMind's powerful model.
    • Grok-3-Beta (Think) (xAI, 2025): A reasoning-focused model from xAI.
    • OpenAI-o3-mini (medium) (OpenAI, 2025): A smaller, reasoning-focused model from OpenAI.
    • GPT-4o-mini-2024-07-18: A smaller version of GPT-4o.
  • Open-Source and Qwen Predecessors:
    • DeepSeek-R1 (Guo et al., 2025): A dedicated reasoning model.
    • DeepSeek-V3 (Liu et al., 2024a).
    • Qwen2.5-72B-Instruct, Qwen2.5-32B-Instruct, Qwen2.5-14B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-3B-Instruct, Qwen2.5-1.5B-Instruct. Instruction-tuned versions of Qwen2.5 models.
    • QwQ-32B (Qwen Team, 2025): Qwen's previous strongest reasoning model.
    • LLaMA-4-Maverick (Meta-AI, 2025), LLaMA-4-Scout (Meta-AI, 2025).
    • LLaMA-3.1-8B-Instruct (Dubey et al., 2024).
    • Gemma-3-27B-IT, Gemma-3-12B-IT, Gemma-3-4B-IT, Gemma-3-1B-IT. Instruction-tuned versions of Gemma-3 models.
    • Phi-4 (Abdin et al., 2024), Phi-4-mini.
    • DeepSeek-R1-Distill-Llama-70B, DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-Distill-Qwen-14B, DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Llama-8B: Distilled versions of DeepSeek-R1, used as baselines for strong-to-weak distillation.

5.4. Hyperparameters

The following hyperparameters are used for evaluation:

  • For Thinking Mode (Qwen3 models):
    • Sampling Temperature: 0.6
    • Top-p: 0.95
    • Top-k: 20
    • Presence Penalty: 1.5 (specifically for Creative Writing v3 and WritingBench to encourage diverse content).
  • For Non-Thinking Mode (Qwen3 models):
    • Sampling Temperature: 0.7
    • Top-p: 0.8
    • Top-k: 20
    • Presence Penalty: 1.5
  • Maximum Output Length: 32,768 tokens, except for AIME'24 and AIME'25 where it is extended to 38,912 tokens to provide sufficient thinking space for complex math problems.

6. Results & Analysis

The experimental results are presented for both pre-trained (base) and post-trained (instruction-tuned) models, covering a wide range of benchmarks and comparing Qwen3 against various state-of-the-art baselines.

6.1. Pre-training Evaluation

The evaluation of base models focuses on general knowledge, reasoning, mathematics, scientific knowledge, coding, and multilingual capabilities.

Summary of Evaluation Results for Qwen3 Base Models:

  1. Flagship Model Performance: Qwen3-235B-A22B-Base outperforms most previously open-source SOTA dense and MoE base models (e.g., DeepSeek-V3 Base, Llama-4-Maverick Base, Qwen2.5-72B-Base) across most tasks, often with significantly fewer total or activated parameters.
  2. MoE Efficiency: Qwen3 MoE base models (Qwen3-30B-A3B-Base, Qwen3-235B-A22B-Base) achieve performance similar to Qwen3 dense base models with only 1/5 activated parameters, demonstrating high efficiency. They also outperform Qwen2.5 MoE models with fewer activated parameters. Notably, a Qwen3 MoE model can achieve comparable performance to a Qwen2.5 dense base model with 1/10 of the activated parameters, indicating significant inference and training cost advantages.
  3. Dense Model Improvements: Qwen3 dense base models show comparable performance to Qwen2.5 base models at higher parameter scales. Specifically, smaller Qwen3 dense models (1.7B, 4B, 8B, 14B, 32B) often surpass or match larger Qwen2.5 counterparts (3B, 7B, 14B, 32B, 72B), especially in STEM, coding, and reasoning benchmarks.

6.1.1. Qwen3-235B-A22B-Base

The following are the results from Table 3 of the original paper:

Qwen2.5-72B Base Qwen2.5-Plus Base Llama-4-Maverick Base DeepSeek-V3 Base Qwen3-235B-A22B Base
Architecture Dense MoE MoE MoE MoE
# Total Params 72B 271B 402B 671B 235B
# Activated Params 72B 37B 17B 37B 22B
General Tasks
MMLU 86.06 85.02 85.16 87.19 87.81
MMLU-Redux 83.91 82.69 84.05 86.14 87.40
MMLU-Pro 58.07 63.52 63.91 59.84 68.18
SuperGPQA 36.20 37.18 40.85 41.53 44.06
BBH 86.30 85.60 83.62 86.22 88.87
Math & STEM Tasks
GPQA 45.88 41.92 43.94 41.92 47.47
GSM8K 91.50 91.89 87.72 87.57 94.39
MATH 62.12 62.78 63.32 62.62 71.84
Coding Tasks
EvalPlus 65.93 61.43 68.38 63.75 77.60
MultiPL-E 58.70 62.16 57.28 62.26 65.94
MBPP 76.00 74.60 75.40 74.20 81.40
CRUX-O 66.20 68.50 77.00 76.60 79.00
Multilingual Tasks
MGSM 82.40 82.21 79.69 82.68 83.53
MMMLU 84.40 83.49 83.09 85.88 86.70
INCLUDE 69.05 66.97 73.47 75.17 73.46

Analysis of Qwen3-235B-A22B-Base:

  • Overall Dominance: Qwen3-235B-A22B-Base achieves the highest scores in most benchmarks, significantly outperforming competitors.
  • Vs. Llama-4-Maverick: Despite Llama-4-Maverick having roughly twice the total parameters, Qwen3-235B-A22B-Base performs better on most benchmarks, indicating superior architectural or training efficiency.
  • Vs. DeepSeek-V3: Qwen3-235B-A22B-Base outperforms DeepSeek-V3-Base on 14 out of 15 benchmarks with only about 1/3 of its total parameters and 2/3 of its activated parameters, demonstrating impressive power and cost-effectiveness.
  • Vs. Qwen2.5-Plus (MoE): Qwen3-235B-A22B-Base significantly outperforms its previous MoE counterpart, Qwen2.5-Plus, with fewer parameters and activated parameters, highlighting advancements in pre-training data, strategy, and architecture.
  • Vs. Qwen2.5-72B (Dense): The Qwen3 MoE flagship model surpasses the Qwen2.5-72B-Base (dense) in all benchmarks while using less than 1/3 of its activated parameters. This translates to much cheaper inference and training costs per trillion tokens.

6.1.2. Qwen3-32B-Base

The following are the results from Table 4 of the original paper:

Qwen2.5-32B Base Qwen2.5-72B Base Gemma-3-27B Base Llama-4-Scout Base Qwen3-32B Base
Architecture Dense Dense Dense MoE Dense
# Total Params 32B 72B 27B 109B 32B
# Activated Params 32B 72B 27B 17B 32B
General Tasks
MMLU 83.32 86.06 78.69 78.27 83.61
MMLU-Redux 81.97 83.91 76.53 71.09 83.41
MMLU-Pro 55.10 58.07 52.88 56.13 65.54
SuperGPQA 33.55 36.20 29.87 26.51 39.78
BBH 84.48 86.30 79.95 82.40 87.38
Math & STEM Tasks
GPQA 47.97 45.88 26.26 40.40 49.49
GSM8K 92.87 91.50 81.20 85.37 93.40
MATH 57.70 62.12 51.78 51.66 61.62
Coding Tasks
EvalPlus 66.25 65.93 55.78 59.90 72.05
MultiPL-E 58.30 58.70 45.03 47.38 67.06
MBPP 73.60 76.00 68.40 68.60 78.20
CRUX-O 67.80 66.20 60.00 61.90 72.50
Multilingual Tasks
MGSM 78.12 82.40 73.74 79.93 83.06
MMMLU 82.40 84.40 77.62 74.83 83.83
INCLUDE 64.35 69.05 68.94 68.09 67.87

Analysis of Qwen3-32B-Base:

  • Vs. Similar-sized Models: Qwen3-32B-Base outperforms Qwen2.5-32B-Base and Gemma-3-27B-Base on most benchmarks, particularly showing significant leads in MMLU-Pro, SuperGPQA, and coding benchmarks.
  • Vs. Larger Predecessor: Surprisingly, Qwen3-32B-Base, with less than half the parameters of Qwen2.5-72B-Base, outperforms it in 10 out of 15 benchmarks, especially in coding, mathematics, and reasoning. This indicates substantial improvements in efficiency and capability for its size.
  • Vs. MoE Baseline: Qwen3-32B-Base significantly outperforms Llama-4-Scout-Base on all 15 benchmarks, despite the latter having three times its total parameters (though half the activated parameters for Llama-4-Scout are 17B, which is less than Qwen3-32B-Base's 32B activated parameters).

6.1.3. Qwen3-14B-Base & Qwen3-30B-A3B-Base

The following are the results from Table 5 of the original paper:

Gemma-3-12B Base Qwen2.5-14B Base Qwen2.5-32B Base Qwen2.5-Turbo Base Qwen3-14B Base Qwen3-30B-A3B Base
Architecture Dense Dense Dense MoE Dense MoE
# Total Params 12B 14B 32B 42B 14B 30B
# Activated Params 12B 14B 32B 6B 14B 3B
General Tasks
MMLU 73.87 79.66 83.32 79.50 81.05 81.38
MMLU-Redux 70.70 76.64 81.97 77.11 79.88 81.17
MMLU-Pro 44.91 51.16 55.10 55.60 61.03 61.49
SuperGPQA 24.61 30.68 33.55 31.19 34.27 35.72
BBH 74.28 78.18 84.48 76.10 81.07 81.54
Math & STEM Tasks
GPQA 31.31 32.83 47.97 41.41 39.90 43.94
GSM8K 78.01 90.22 92.87 88.32 92.49 91.81
MATH 44.43 55.64 57.70 55.60 62.02 59.04
Coding Tasks
EvalPlus 52.65 60.70 66.25 61.23 72.23 71.45
MultiPL-E 43.03 54.79 58.30 53.24 61.69 66.53
MBPP 60.60 69.00 73.60 67.60 73.40 74.40
CRUX-O 52.00 61.10 67.80 60.20 68.60 67.20
Multilingual Tasks
MGSM 64.35 74.68 78.12 70.45 79.20 79.11
MMMLU 72.50 78.34 82.40 79.76 79.69 81.46
INCLUDE 63.34 60.26 64.35 59.25 64.55 67.00

Analysis of Qwen3-14B-Base & Qwen3-30B-A3B-Base:

  • Qwen3-14B-Base Superiority: Qwen3-14B-Base significantly outperforms Gemma-3-12B-Base and Qwen2.5-14B-Base on all 15 benchmarks. It also achieves very competitive results against the much larger Qwen2.5-32B-Base with less than half the parameters.
  • Qwen3-30B-A3B-Base Efficiency: Qwen3-30B-A3B-Base, an MoE model, with only 3 billion activated parameters (1/5 of Qwen2.5-14B-Base's activated parameters), significantly outperforms Qwen2.5-14B-Base on all tasks. It also achieves comparable performance to the larger Qwen3-14B-Base and Qwen2.5-32B-Base, highlighting substantial advantages in inference and training costs due to its efficient MoE architecture.

6.1.4. Qwen3-8B / 4B / 1.7B / 0.6B-Base

The following are the results from Table 6 of the original paper:

Llama-3-8B Base Qwen2.5-7B Base Qwen2.5-14B Base Qwen3-8B Base
Architecture Dense Dense Dense Dense
# Total Params 8B 7B 14B 8B
# Activated Params 8B 7B 14B 8B
General Tasks
MMLU 66.60 74.16 79.66 76.89
MMLU-Redux 61.59 71.06 76.64 76.17
MMLU-Pro 35.36 45.00 51.16 56.73
SuperGPQA 20.54 26.34 30.68 31.64
BBH 57.70 70.40 78.18 78.40
Math & STEM Tasks
GPQA 25.80 36.36 32.83 44.44
GSM8K 55.30 85.36 90.22 89.84
MATH 20.50 49.80 55.64 60.80
Coding Tasks
EvalPlus 44.13 62.18 60.70 67.65
MultiPL-E 31.45 50.73 54.79 58.75
MBPP 48.40 63.40 69.00 69.80
CRUX-O 36.80 48.50 61.10 62.00
Multilingual Tasks
MGSM 38.92 63.60 74.68 76.02
MMMLU 59.65 71.34 78.34 75.72
IINCLUDE 44.94 53.98 60.26 59.40

The following are the results from Table 7 of the original paper:

Gemma-3-4B Base Qwen2.5-3B Base Qwen2.5-7B Base Qwen3-4B Base
Architecture Dense Dense Dense Dense
# Total Params 4B 3B 7B 4B
# Activated Params 4B 3B 7B 4B
General Tasks
MMLU 59.51 65.62 74.16 72.99
MMLU-Redux 56.91 63.68 71.06 72.79
MMLU-Pro 29.23 34.61 45.00 50.58
SuperGPQA 17.68 20.31 26.34 28.43
BBH 51.70 56.30 70.40 72.59
Math & STEM Tasks
GPQA 24.24 26.26 36.36 36.87
GSM8K 43.97 79.08 85.36 87.79
MATH 26.10 42.64 49.80 54.10
Coding Tasks
EvalPlus 43.23 46.28 62.18 63.53
MultiPL-E 28.06 39.65 50.73 53.13
MBPP 46.40 54.60 63.40 67.00
CRUX-O 34.00 36.50 48.50 55.00
Multilingual Tasks
MGSM 33.11 47.53 63.60 67.74
MMMLU 59.62 65.55 71.34 71.42
INCLUDE 49.06 45.90 53.98 56.29

The following are the results from Table 8 of the original paper:

Qwen2.5-0.5B Base Qwen3-0.6B Base Gemma-3-1B Base Qwen2.5-1.5B Base Qwen3-1.7B Base
Architecture Dense Dense Dense Dense Dense
# Total Params 0.5B 0.6B 1B 1.5B 1.7B
# Activated Params 0.5B 0.6B 1B 1.5B 1.7B
General Tasks
MMLU 47.50 52.81 26.26 60.90 62.63
MMLU-Redux 45.10 51.26 25.99 58.46 61.66
MMLU-Pro 15.69 24.74 9.72 28.53 36.76
SuperGPQA 11.30 15.03 7.19 17.64 20.92
BBH 20.30 41.47 28.13 45.10 54.47
Math & STEM Tasks
GPQA 24.75 26.77 24.75 24.24 28.28
GSM8K 41.62 59.59 2.20 68.54 75.44
MATH 19.48 32.44 3.66 35.00 43.50
Coding Tasks
EvalPlus 31.85 36.23 8.98 44.80 52.70
MultiPL-E 18.70 24.58 5.15 33.10 42.71
MBPP 29.80 36.60 9.20 43.60 55.40
CRUX-O 12.10 27.00 3.80 29.60 36.40
Multilingual Tasks
MGSM 12.07 30.99 1.74 32.82 50.71
MMMLU 31.53 50.16 26.57 60.27 63.27
INCLUDE 24.74 34.26 25.62 39.55 45.57

Analysis of Smaller Qwen3 Base Models:

  • Consistent Strong Performance: Qwen3-8B, 4B, 1.7B, and 0.6B-Base models consistently maintain strong performance across nearly all benchmarks relative to their size.
  • Outperforming Larger Qwen2.5 Models: Notably, Qwen3-8B, 4B, and 1.7B-Base models even outperform larger Qwen2.5-14B, 7B, and 3B-Base models, respectively, on over half of the benchmarks. This is particularly evident in STEM-related and coding benchmarks, reflecting a significant generational improvement.

6.2. Post-training Evaluation

The post-trained models are evaluated for their instruction-following, alignment, reasoning, agent, coding, and multilingual abilities under both thinking and non-thinking modes.

Summary of Evaluation Results for Finalized Qwen3 Models:

  1. Flagship SOTA: Qwen3-235B-A22B achieves state-of-the-art overall performance among open-source models in both thinking and non-thinking modes, surpassing strong baselines like DeepSeek-R1 and DeepSeek-V3. It also demonstrates strong competitiveness against closed-source leaders such as OpenAI-o1, Gemini2.5-Pro, and GPT-4o.
  2. Flagship Dense Model (32B) Excellence: Qwen3-32B outperforms the previous strongest reasoning model, QwQ-32B, on most benchmarks, setting a new SOTA for its size. It competes comparably with OpenAI-o3-mini (closed-source) and excels in non-thinking mode, surpassing Qwen2.5-72B-Instruct.
  3. Lightweight Model Success: Lightweight models (Qwen3-30B-A3B, Qwen3-14B, and smaller dense models) consistently show superior performance compared to open-source models with similar or larger parameter counts. This validates the effectiveness of the Strong-to-Weak Distillation approach.

6.2.1. Qwen3-235B-A22B

The following are the results from Table 11 of the original paper:

OpenAI-o1 DeepSeek-R1 Grok-3-Beta (Think) Gemini2.5-Pro Qwen3-235B-A22B
Architecture MoE MoE
# Activated Params 37B - 22B
# Total Params - 671B - 235B
MMLU-Redux 92.8 92.9 93.7 92.7
General Tasks GPQA-Diamond 78.0 71.5 80.2 84.0 71.1
C-Eval 85.5 91.8 82.9 89.6
LiveBench 2024-11-25 75.7 71.6 - 82.4 77.1
IFEval strict prompt 92.6 83.3 - 89.5 83.4
Alignment Tasks Arena-Hard 92.1 92.3 96.4 95.6
AlignBench v1.1 8.86 8.76 9.03 8.94
Creative Writing v3 81.7 85.5 86.0 84.6
WritingBench 7.69 7.71 8.09 8.03
MATH-500 96.4 97.3 98.8 98.0
Math & Text Reasoning AIME'24 74.3 79.8 83.9 92.0 85.7
AIME'25 79.2 70.0 77.3 86.7 81.5
ZebraLogic 81.0 78.7 - 87.4 80.3
AutoLogi 79.8 86.1 - 85.4 89.0
BFCL v3 67.8 56.9 - 62.9 70.8
Agent & Coding LiveCodeBench v5 63.9 64.3 70.6 70.4 70.7
CodeForces (Rating / Percentile) 1891 / 96.7% 2029 / 98.1% - 2001 / 97.9% 2056 / 98.2%
Multi-IF 48.8 67.7 77.8 71.9
INCLUDE 84.6 82.7 85.1 78.7
Multilingual Tasks MMMLU 14 languages 88.4 86.4 86.9 84.3
MT-AIME2024 67.4 73.5 76.9 80.8
PolyMath 38.9 47.1 52.2 54.7
MLogiQA 75.5 73.8 75.6 77.1

Analysis of Qwen3-235B-A22B (Thinking Mode):

  • Open-Source Leader: Qwen3-235B-A22B (Thinking) outperforms DeepSeek-R1 on 17/23 benchmarks despite having fewer activated parameters (22B vs. 37B) and significantly fewer total parameters (235B vs. 671B). Its performance is particularly strong in reasoning-demanded tasks like mathematics, agent, and coding.

  • Competitiveness with Proprietary Models: It is highly competitive with closed-source models such as OpenAI-o1, Grok-3-Beta (Think), and Gemini2.5-Pro, substantially narrowing the performance gap in reasoning capabilities. For instance, it achieves the highest CodeForces rating (2056 / 98.2%).

    The following are the results from Table 12 of the original paper:

    GPT-40 -2024-11-20 DeepSeek-V3 Qwen2.5-72B -Instruct LLaMA-4 -Maverick Qwen3-235B-A22B
    Architecture MoE Dense MoE MoE
    # Activated Params 37B 72B 17B 22B
    # Total Params - 671B 72B 402B 235B
    General Tasks MMLU-Redux 87.0 89.1 86.8 91.8 89.2
    GPQA-Diamond 46.0 59.1 49.0 69.8 62.9
    C-Eval 75.5 86.5 84.7 83.5 86.1
    LiveBench 2024-11-25 52.2 60.5 51.4 59.5 62.5
    Alignment Tasks IFEval strict prompt 86.5 86.1 84.1 86.7 83.2
    Arena-Hard 85.3 85.5 81.2 82.7 96.1
    AlignBench v1.1 8.42 8.64 7.89 7.97 8.91
    Creative Writing v3 81.1 74.0 61.8 61.3 80.4
    WritingBench 7.11 6.49 7.06 5.46 7.70
    Math & Text Reasoning MATH-500 77.2 90.2 83.6 90.6 91.2
    AIME'24 11.1 39.2 18.9 38.5 40.1
    AIME'25 7.6 28.8 15.0 15.9 24.7
    ZebraLogic 27.4 42.1 26.6 40.0 37.7
    AutoLogi 65.9 76.1 66.1 75.2 83.3
    Agent & Coding BFCL v3 72.5 57.6 63.4 52.9 68.0
    LiveCodeBench v5 32.7 33.1 30.7 37.2 35.3
    CodeForces (Rating / Percentile) 864 / 35.4% 1134 / 54.1% 859 / 35.0% 712 / 24.3% 1387 / 75.7%
    Multilingual Tasks Multi-IF 65.6 55.6 65.3 75.5 70.2
    INCLUDE 78.8 76.7 69.6 80.9 75.6
    MMMLU 14 languages 80.3 81.1 76.9 82.5 79.8
    MT-AIME2024 9.2 20.9 12.7 27.0 32.4
    PolyMath 13.7 20.4 16.9 26.1 27.0
    MLogiQA 57.4 58.9 59.3 59.9 67.6

Analysis of Qwen3-235B-A22B (Non-thinking Mode):

  • Superiority over Open-Source: Qwen3-235B-A22B (Non-thinking) exceeds other leading open-source models like DeepSeek-V3, LLaMA-4-Maverick, and Qwen2.5-72B-Instruct.
  • Outperforms GPT-4o-2024-11-20: It surpasses the closed-source GPT-4o-2024-11-20 in 18/23 benchmarks, indicating strong inherent capabilities even without explicit reasoning steps. This highlights its robust general performance for rapid responses.

6.2.2. Qwen3-32B

The following are the results from Table 13 of the original paper:

DeepSeek-R1 -Distili-Llama-70B QwQ-32B OpenAI-03-mini (medium) Qwen3-32B
Architecture # Activated Params Dense Dense - Dense
# Total Params 70B 70B 32B 32B - 32B 32B
General Tasks MMLU-Redux 89.3 90.0 90.0 90.9
GPQA-Diamond 65.2 65.6 76.8 68.4
C-Eval LiveBench 2024-11-25 71.8 54.5 88.4 72.0 75.1 70.0 87.3 74.9
Alignment Tasks IFEval strict prompt 79.3 83.9 91.5 85.0
Arena-Hard 60.6 89.5 89.0 93.8
AlignBench v1.1 6.74 8.70 8.38 8.72
Creative Writing v3 WritingBench 62.1 6.08 82.4 7.86 74.8 7.52 81.0 7.90
Math & Text Reasoning MATH-500 94.5 98.0 98.0 97.2
AIME'24 70.0 79.5 79.6 81.4
AIME'25 ZebraLogic 56.3 71.3 69.5 76.8 74.8 88.9 72.9 88.8
AutoLogi 83.5 88.1 86.3 87.3
Agent & Coding BFCL v3 49.3 66.4 64.6 70.3
LiveCodeBench v5 54.5 62.7 66.3 65.7
CodeForces (Rating / Percentile) 1633 / 91.4% 1982 / 97.7% 2036 / 98.1% 1977 / 97.7%
Multilingual Tasks Multi-IF 57.6 68.3 48.4 73.0
INCLUDE 62.1 69.7 73.1 73.7
MMMLU 14 languages 69.6 80.9 79.3 80.6
MT-AIME2024 29.3 68.0 73.9 75.0
PolyMath 29.4 45.9 38.6 47.4
MLogiQA 60.3 75.5 71.1 76.3

Analysis of Qwen3-32B (Thinking Mode):

  • New SOTA at 32B: Qwen3-32B (Thinking) outperforms QwQ-32B on 17/23 benchmarks, establishing it as the new state-of-the-art reasoning model for its size.

  • Competitiveness with Proprietary Models: It competes well with the closed-source OpenAI-o3-mini (medium), particularly excelling in alignment and multilingual performance.

    The following are the results from Table 14 of the original paper:

    GPT-4o-mini -2024-07-18 LLaMA-4 -Scout Qwen2.5-72B -Instruct Qwen3-32B
    Architecture MoE Dense Dense
    # Activated Params 17B 72B 32B
    # Total Params - 109B 72B 32B
    General Tasks MMLU-Redux 81.5 86.3 86.8 85.7
    GPQA-Diamond 40.2 57.2 49.0 54.6
    C-Eval 66.3 78.2 84.7 83.3
    LiveBench 2024-11-25 41.3 47.6 51.4 59.8
    Alignment Tasks IFEval strict prompt 80.4 84.7 84.1 83.2
    Arena-Hard 74.9 70.5 81.2 92.8
    AlignBench v1.1 7.81 7.49 7.89 8.58
    Creative Writing v3 70.3 55.0 61.8 78.3
    WritingBench 5.98 5.49 7.06 7.54
    Math & Text Reasoning MATH-500 78.2 82.6 83.6 88.6
    AIME'24 8.1 28.6 18.9 31.0
    AIME'25 8.8 10.0 15.0 20.2
    ZebraLogic 20.1 24.2 26.6 29.2
    AutoLogi 52.6 56.8 66.1 78.5
    Agent & Coding BFCL v3 LiveCodeBench v5 64.0 27.9 45.4 29.8 63.4 30.7 63.0 31.3
    CodeForces (Rating / Percentile) 1113 / 52.6% 981 / 43.7% 859 / 35.0% 1353 / 71.0%
    Multilingual Tasks Multi-IF 62.4 64.2 65.3 70.7
    INCLUDE 66.0 74.1 69.6 70.9
    MMMLU 14 languages 72.1 77.5 76.9 76.5
    MT-AIME2024 6.0 19.1 12.7 24.1
    PolyMath 12.0 20.9 16.9 22.5
    MLogiQA 42.6 53.9 59.3 62.9

Analysis of Qwen3-32B (Non-thinking Mode):

  • Superiority: Qwen3-32B (Non-thinking) exhibits superior performance to almost all baselines.
  • Vs. Larger Predecessor: It performs on par with Qwen2.5-72B-Instruct on general tasks, but with significant advantages in alignment, multilingual, and reasoning-related tasks, showcasing fundamental improvements over the Qwen2.5 series.

6.2.3. Qwen3-30B-A3B & Qwen3-14B

The following are the results from Table 15 of the original paper:

DeepSeek-R1 -Distili-Qwen-32B QwQ-32B Qwen3-14B Qwen3-30B-A3B
Architecture Dense Dense Dense MoE
# Activated Params 32B 32B 14B 3B
# Total Params 32B 32B 14B 30B
General Tasks MMLU-Redux 88.2 90.0 88.6 89.5
GPQA-Diamond 62.1 65.6 64.0 65.8
C-Eval 82.2 88.4 86.2 86.6
LiveBench 2024-11-25 45.6 72.0 71.3 74.3
Alignment Tasks IFEval strict prompt Arena-Hard 72.5 60.8 83.9 89.5 85.4 91.7 86.5 91.0
AlignBench v1.1 7.25 8.70 8.56 8.70
Creative Writing v3 55.0 82.4 80.3 79.1
WritingBench 6.13 7.86 7.80 7.70
Math & Text Reasoning MATH-500 94.3 98.0 96.8 98.0
AIME'24 72.6 79.5 79.3 80.4
AIME'25 49.6 69.5 70.4 70.9
ZebraLogic 69.6 76.8 88.5 89.5
AutoLogi 74.6 88.1 89.2 88.7
Agent & Coding BFCL v3 53.5 66.4 70.4 69.1
LiveCodeBench v5 54.5 62.7 63.5 62.6
CodeForces (Rating / Percentile) 1691 / 93.4% 1982 / 97.7% 1766 / 95.3% 1974 / 97.7%
Multilingual Tasks Multi-IF 31.3 68.3 74.8 72.2
INCLUDE 68.0 69.7 71.7 71.9
MMMLU 14 languages 78.6 80.9 77.9 78.4
MT-AIME2024 44.6 68.0 73.3 73.9
PolyMath 35.1 45.9 45.8 46.1
MLogiQA 63.3 75.5 71.1 70.1

Analysis of Qwen3-30B-A3B / Qwen3-14B (Thinking Mode):

  • Strong-to-Weak Distillation Success: Both Qwen3-30B-A3B and Qwen3-14B (Thinking) are highly competitive with QwQ-32B, especially in reasoning benchmarks.

  • Efficiency of MoE Distillation: Qwen3-30B-A3B achieves comparable performance to QwQ-32B despite having a smaller total size (30B vs. 32B) and significantly fewer activated parameters (3B vs. 32B), demonstrating the effectiveness of the Strong-to-Weak Distillation in endowing lightweight MoE models with profound reasoning.

    The following are the results from Table 16 of the original paper:

    Phi-4 Gemma-3 -27B-IT Qwen2.5-32B -Instruct Qwen3-14B Qwen3-30B-A3B
    Architecture
    # Activated Params Dense 14B Dense 27B Dense 32B Dense 14B MoE 3B
    # Total Params 14B 27B 32B 14B 30B
    MMLU-Redux 85.3 82.6 83.9 82.0 84.1
    General Tasks GPQA-Diamond C-Eval 56.1 66.9 42.4 66.6 49.5 80.6 54.8 81.0 54.8 82.9
    LiveBench 2024-11-25 41.6 49.2 50.0 59.6 59.4
    IFEval strict prompt 62.1 80.6 79.5 84.8 83.7
    Alignment Tasks Arena-Hard 75.4 86.8 74.5 86.3 88.0
    AlignBench v1.1 7.61 7.80 7.71 8.52 8.55
    Creative Writing v3 51.2 82.0 54.6 73.1 68.1
    WritingBench 5.73 7.22 5.90 7.24 7.22
    Math & Text Reasoning MATH-500 80.8 90.0 84.6 90.0 89.8
    AIME'24 22.9 32.6 18.8 31.7 32.8
    AIME'25 17.3 24.0 12.8 23.3 21.6
    ZebraLogic AutoLogi 32.3 66.2 24.6 64.2 26.1 65.5 33.0 82.0 33.2 81.5
    Agent & Coding BFCL v3 47.0 59.1 62.8 61.5 58.6
    LiveCodeBench v5 CodeForces (Rating / Percentile) 25.2 1280 / 65.3% 26.9 1063 / 49.3% 26.4 903 / 38.2% 29.0 1200 / 58.6% 29.8 1267 / 64.1%
    Multilingual Tasks Multi-IF 49.5 69.8 63.2 72.9 70.8
    INCLUDE 65.3 71.4 67.5 67.8 67.8
    MMMLU 14 languages 74.7 76.1 74.2 72.6 73.8
    MT-AIME2024 13.1 23.0 15.3 23.2 24.6
    PolyMath 17.4 20.3 18.3 22.0 23.3
    MLogiQA 53.1 58.5 58.0 58.9 53.3

Analysis of Qwen3-30B-A3B / Qwen3-14B (Non-thinking Mode):

  • Outperforming Baselines: Both models surpass non-reasoning baselines in most benchmarks.
  • Efficiency: They exceed Qwen2.5-32B-Instruct with significantly fewer activated and total parameters, enabling more efficient and cost-effective performance.

6.2.4. Qwen3-8B / 4B / 1.7B / 0.6B

The following are the results from Table 17 of the original paper:

DeepSeek-R1 -Distill-Qwen-14B DeepSeek-R1 -Distili-Qwen-32B Qwen3-4B Qwen3-8B
Architecture Dense Dense Dense Dense
# Activated Params 14B 32B 4B 8B
# Total Params 14B 32B 4B 8B
General Tasks MMLU-Redux 84.1 88.2 83.7 87.5
GPQA-Diamond C-Eval 59.1 62.1 55.9 62.0
LiveBench 2024-11-25 78.1 52.3 82.2 45.6 77.5 63.6 83.4 67.1
Alignment Tasks IFEval strict prompt 72.6 72.5 81.9 85.0
Arena-Hard 48.0 60.8 76.6 85.8
AlignBench v1.1 7.43 7.25 8.30 8.46
Creative Writing v3 WritingBench 54.2 6.03 55.0 6.13 61.1 7.35 75.0 7.59
Math & Text Reasoning MATH-500 93.9 94.3 97.0 97.4
AIME'24 69.7 72.6 73.8 76.0
AIME'25 44.5 49.6 65.6 67.3
ZebraLogic 59.1 69.6 81.0 84.8
Agent & Coding AutoLogi 78.6 74.6 87.9 89.1
BFCL v3 49.5 53.5 65.9 68.1
LiveCodeBench v5 CodeForces (Rating / Percentile) 45.5 1574 / 89.1% 54.5 1691 / 93.4% 54.2 1671 / 92.8% 57.5 1785 / 95.6%
Multilingual Tasks Multi-IF 29.8 31.3 66.3 71.2
INCLUDE 59.7 68.0 61.8 67.8
MMMLU 14 languages 73.8 78.6 69.8 74.4
MT-AIME2024 33.7 44.6 60.7 65.4
PolyMath 28.6 35.1 40.0 42.7
MLogiQA 53.6 63.3 65.9 69.0

The following are the results from Table 18 of the original paper:

LLaMA-3.1-8B -Instruct Gemma-3 -12B-IT Qwen2.5-7B -Instruct Qwen2.5-14B -Instruct Qwen3-4B Qwen3-8B
Architecture Dense Dense Dense Dense Dense Dense
# Activated Params 8B 12B 7B 14B 4B 8B
# Total Params 8B 12B 7B 14B 4B 8B
General Tasks MMLU-Redux 61.7 77.8 75.4 80.0 77.3 79.5
GPQA-Diamond C-Eval 32.8 52.0 40.9 61.1 36.4 76.2 45.5 78.0 41.7 72.2 39.3 77.9
LiveBench 2024-11-25 26.0 43.7 34.9 42.2 48.4 53.5
Alignment Tasks IFEval strict prompt 75.0 80.2 71.2 81.0 81.2 83.0
Arena-Hard 30.1 82.6 52.0 68.3 66.2 79.6
AlignBench v1.1 6.01 7.77 7.27 7.67 8.10 8.38
Creative Writing v3 WritingBench 52.8 4.57 79.9 7.05 49.8 5.82 55.8 5.93 53.6 6.85 64.5 7.15
Math & Text Reasoning MATH-500 54.8 85.6 77.6 83.4 84.8 87.4
AIME'24 6.3 22.4 9.1 15.2 25.0 29.1
AIME'25 2.7 18.8 12.1 13.6 19.1 20.9
ZebraLogic 12.8 58.9 12.0 19.7 35.2 26.7
AutoLogi 30.9 76.3 42.9 57.4 76.3 76.5
Agent & Coding BFCL v3 49.6 50.6 55.8 58.7 57.6 60.2
LiveCodeBench v5 CodeForces (Rating / Percentile) 10.8 473 / 14.9% 25.7 462 / 14.7% 14.4 191 / 0.0% 21.9 904 / 38.3% 21.3 842 / 33.7% 22.8 1110 / 52.4%
Multilingual Tasks Multi-IF 52.1 65.6 47.7 55.5 61.3 69.2
INCLUDE 34.0 65.3 53.6 63.5 53.8 62.5
MMMLU 14 languages 44.4 70.0 61.4 70.3 61.7 66.9
MT-AIME2024 0.4 16.7 5.5 8.5 13.9 16.6
PolyMath 5.8 17.6 11.9 15.0 16.6 18.8
MLogiQA 41.9 54.5 49.5 51.3 49.9 51.4

The following are the results from Table 19 of the original paper:

DeepSeek-R1 -Distili-Qwen-1.5B DeepSeek-R1 -Distill-Llama-8B Qwen3-0.6B Qwen3-1.7B
Architecture # Activated Params Dense Dense Dense Dense
# Total Params 1.5B 1.5B 8B 8B 0.6B 0.6B 1.7B 1.7B
General Tasks MMLU-Redux 45.4 66.4 55.6 73.9
GPQA-Diamond 33.8 49.0 27.9 40.1
C-Eval 27.1 50.4 50.4 68.1
LiveBench 2024-11-25 24.9 40.6 30.3 51.1
Alignment Tasks IFEval strict prompt 39.9 59.0 59.2 72.5
Arena-Hard 4.5 17.6 8.5 43.1
AlignBench v1.1 5.00 6.24 6.10 7.60
Creative Writing v3 WritingBench 16.4 4.03 51.1 5.42 30.6 5.61 48.0 7.02
Math & Text Reasoning MATH-500 83.9 89.1 77.6 93.4
AIME'24 28.9 50.4 10.7 48.3
AIME'25 22.8 27.8 15.1 36.8
ZebraLogic 4.9 37.1 30.3 63.2
AutoLogi 19.1 63.4 61.6 83.2
Agent & Coding BFCL v3 LiveCodeBench v5 14.0 13.2 21.5 42.5 46.4 12.3 56.6 33.2
CodeForces (Rating / Percentile) 36.1 51.2
Multilingual Tasks Multi-IF 13.3 27.0 35.9 51.8
INCLUDE 21.9 34.5 43.1 59.1
MMMLU 14 languages 27.3 40.1 7.8 36.1
MT-AIME2024 12.4 13.2 11.4 25.2
PolyMath 14.5 10.8 40.9 56.0
MLogiQA 29.0 32.8

The following are the results from Table 20 of the original paper:

Gemma-3 -1B-IT Phi-4-mini Qwen2.5-1.5B -Instruct Qwen2.5-3B -Instruct Qwen3-0.6B Qwen3-1.7B
Architecture # Activated Params Dense Dense Dense Dense Dense Dense
1.0B 3.8B 1.5B 3.1B 0.6B 1.7B
# Total Params 1.0B 3.8B 1.5B 3.1B 0.6B 1.7B
MMLU-Redux 33.3 67.9 50.7 64.4 44.6 64.4
GPQA-Diamond 19.2 25.2 29.8 30.3 22.9 28.6
Tasks C-Eval 28.5 40.0 53.3 68.2 42.6 61.0
LiveBench 2024-11-25 14.4 25.3 18.0 23.8 21.8 35.6
IFEval strict prompt 54.5 68.6 42.5 58.2 54.5 68.2
Arena-Hard 17.8 32.8 9.0 23.7 6.5 36.9
Alignment Tasks AlignBench v1.1 5.3 6.00 5.60 6.49 5.60 7.20
Creative Writing v3 52.8 10.3 31.5 42.8 28.4 43.6
WritingBench 5.18 4.05 4.67 5.55 5.13 6.54
MATH-500 46.4 67.6 55.0 67.2 55.2 73.0
AIME'24 0.9 8.1 0.9 6.7 3.4 13.4
Math & Text Reasoning AIME'25 0.8 5.3 0.4 4.2 2.6 9.8
ZebraLogic 1.9 2.7 3.4 4.8 4.2 12.8
AutoLogi 16.4 28.8 22.5 29.9 37.4 59.8
BFCL v3 16.3 31.3 47.8 50.4 44.1 52.2
Coding LiveCodeBench v5 1.8 10.4 5.3 9.2 3.6 11.6
Multi-IF
INCLUDE 32.8 40.5 20.2 32.3 33.3 44.7
32.7 43.8 33.1 43.8 34.4 42.6
MMMLU 14 languages 32.5 51.4 40.4 51.8 37.1 48.3
MT-AIME2024 0.2 0.9 0.7 1.6 1.5 4.9
PolyMath 3.5 6.7 5.0 7.3 4.6 10.3
MLogiQA 31.8 39.5 40.9 39.5 37.3 41.1

Analysis of Smaller Qwen3 Models (Thinking and Non-thinking Modes):

  • Edge-side Performance: The smaller Qwen3 models (Qwen3-8B, 4B, 1.7B, 0.6B) exhibit impressive performance, often outperforming baselines with more parameters, including previous Qwen2.5 models, in both thinking and non-thinking modes.
  • Distillation Efficacy: These results further reinforce the efficacy of the Strong-to-Weak Distillation approach, enabling the creation of lightweight Qwen3 models with remarkably reduced costs and efforts while maintaining high capabilities.

6.3. Discussion

6.3.1. The Effectiveness of Thinking Budget

The ability of Qwen3 to enhance its intelligence by leveraging an increased thinking budget is a key innovation.

The following figure (Figure 2 from the original paper) shows the performance of Qwen3-235B-A22B with respect to the thinking budget:

Figure 2: Performance of Qwen3-235B-A22B with respect to the thinking budget. 该图像是一个图表,显示了 Qwen3-235B-A22B 在不同思维预算下的性能表现,包括 AIME'24、AIME'25、LiveCodeBench (v5) 和 GPQA Diamond 四个任务。图中分别展示了思维模式和非思维模式的效果,随着思维预算的增加,性能显著提升。

Figure 2: Performance of Qwen3-235B-A22B with respect to the thinking budget.

Analysis:

  • Scalable Performance Improvement: As observed in Figure 2, Qwen3-235B-A22B demonstrates a clear and consistent improvement in performance across various benchmarks (AIME'24, AIME'25, LiveCodeBench v5, GPQA Diamond) as the allocated thinking budget (measured in tokens) increases. This validates the design principle that more computational resources dedicated to thinking directly translate to better reasoning outcomes.
  • Smooth Scaling: The scaling curves are smooth, suggesting that the thinking budget mechanism provides a continuous knob for users to trade off latency (due to more thinking tokens) and performance based on task complexity.
  • Future Potential: The authors hypothesize that extending the output length beyond 32K tokens for thinking could yield further performance improvements, suggesting potential for future work in pushing the limits of the thinking budget.

6.3.2. The Effectiveness and Efficiency of On-Policy Distillation

The on-policy distillation approach, part of the strong-to-weak distillation pipeline, proves to be both effective and highly efficient.

The following are the results from Table 21 of the original paper:

Method AIME'24 AIME'25 MATH500 LiveCodeBench v5 MMLU -Redux GPQA -Diamond GPU Hours
Off-policy Distillation 55.0 (90.0) 42.8 (83.3) 92.4 42.0 86.4 55.6 -
+ Reinforcement Learning 67.6 (90.0) 55.5 (83.3) 94.8 52.9 86.9 61.3 17,920
+ On-policy Distillation 74.4 (93.3) 65.5 (86.7) 97.0 60.3 88.3 63.3 1,800

Analysis:

  • Superior Performance over RL: On-policy Distillation achieves significantly better performance across all listed benchmarks (AIME'24, AIME'25, MATH500, LiveCodeBench v5, MMLU-Redux, GPQA-Diamond) compared to direct Reinforcement Learning when starting from the same off-policy distilled 8B checkpoint. For example, AIME'24 improves from 67.6 (RL) to 74.4 (Distillation).
  • Dramatic Efficiency Gains: This performance gain comes with a remarkable reduction in computational cost. On-policy Distillation requires only 1,800 GPU hours for the Qwen3-8B model, approximately 1/10th of the 17,920 GPU hours needed for Reinforcement Learning. This demonstrates a massive efficiency advantage for training smaller models.
  • Enhanced Exploration: Distillation from teacher logits not only improves direct performance (Pass@1 scores) but also expands the student model's exploration space and reasoning potential, as evidenced by improved Pass@64 scores on AIME'24 (93.3 vs. 90.0) and AIME'25 (86.7 vs. 83.3). In contrast, Reinforcement Learning alone did not lead to any improvement in Pass@64 scores from the initial off-policy checkpoint. This suggests that mimicking a strong teacher's thought process, including its uncertainties and alternative paths (captured by soft probabilities), is more beneficial for learning robustness and exploration than direct reward optimization for exploration.

6.3.3. The Effects of Thinking Mode Fusion and General RL

The stages of Thinking Mode Fusion and General RL are crucial for integrating non-thinking capabilities, refining instruction following, and enhancing overall model robustness.

The following are the results from Table 22 of the original paper:

Stage 2 Reasoning RL Stage 3 Thinking Mode Fusion Stage 4 General RL
Benchmark Thinking Thinking Non-Thinking Thinking Non-Thinking
General Tasks LiveBench 2024-11-25 68.6 70.9 (+2.3) 57.1 74.9 (+4.0) 59.8 (+2.8)
Arena-Hard 86.8 89.4 (+2.6) 88.5 93.8 (+4.4) 92.8 (+4.3)
CounterFactQA* 50.4 61.3 (+10.9) 64.3 68.1 (+6.8) 66.4 (+2.1)
Instruction & Format Following IFEval strict prompt 73.0 78.4 (+5.4) 78.4 85.0 (+6.6) 83.2 (+4.8)
Multi-IF 61.4 64.6 (+3.2) 65.2 73.0 (+8.4) 70.7 (+5.5)
LengthCtrl* 62.6 70.6 (+8.0) 84.9 73.5 (+2.9) 87.3 (+2.4)
ThinkFollow* - 88.7 98.9 (+10.2)
Agent BFCL v3 69.0 68.4 (-0.6) 61.5 70.3 (+1.9) 63.0 (+1.5)
ToolUse* 63.3 70.4 (+7.1) 73.2 85.5 (+15.1) 86.5 (+13.3)
Knowledge & STEM MMLU-Redux 91.4 91.0 (-0.4) 86.7 90.9 (-0.1) 85.7 (-1.0)
GPQA-Diamond 68.8 69.0 (+0.2) 50.4 68.4 (-0.6) 54.6 (+4.3)
Math & AIME'24 83.8 81.9 (-1.9) 28.5 81.4 (-0.5) 31.0 (+2.5)
TCCoding LiveCodeBench v5 68.4 67.2 (-1.2) 31.1 65.7 (-1.5) 31.3 (+0.2)

Analysis of Qwen3-32B at Different Stages:

  1. Stage 3 (Thinking Mode Fusion):
    • Initial Mode Switching: The ThinkFollow benchmark score of 88.7 indicates that the model gains an initial, though imperfect, ability to switch between thinking modes.
    • General and Instruction Following Improvements (Thinking Mode): The model shows significant gains in CounterFactQA (+10.9 points) and LengthCtrl (+8.0 points) in thinking mode, demonstrating improved general and instruction-following capabilities.
  2. Stage 4 (General RL):
    • Robust Mode Switching: The ThinkFollow score dramatically improves to 98.9 (+10.2 points), confirming that General RL ensures highly accurate mode switching.
    • Broad Capability Enhancement: General RL further strengthens general (LiveBench, Arena-Hard, CounterFactQA), instruction-following (IFEval, Multi-IF, LengthCtrl), and agent capabilities (BFCL v3, ToolUseToolUse*) in both thinking and non-thinking modes. ToolUseToolUse* sees a substantial boost (+15.1 in thinking, +13.3 in non-thinking).
  3. Performance Trade-offs for Specialized Tasks:
    • For Knowledge, STEM, Math (AIME'24), and Coding (LiveCodeBench v5) tasks, Thinking Mode Fusion and General RL do not yield significant improvements in thinking mode. In fact, some challenging tasks like AIME'24 and LiveCodeBench show a slight decrease in thinking mode performance after these stages.
    • Conjecture: The authors hypothesize this degradation is due to training on a broader range of general tasks, which might compromise specialized capabilities. This represents a conscious trade-off to enhance the model's overall versatility, acknowledging that improving general robustness might dilute peak performance in highly specialized, complex reasoning tasks.

6.3.4. Long-Context Ability

The long-context processing capabilities are evaluated using the RULER benchmark.

The following are the results from Table 23 of the original paper:

Model RULER
Avg. 4K 8K 16K 32K 64K 128K
Qwen2.5-7B-Instruct 85.4 96.7 95.1 93.7 89.4 82.3 55.1
Qwen2.5-14B-Instruct 91.4 97.7 96.8 95.9 93.4 86.7 78.1
Qwen2.5-32B-Instruct 92.9 96.9 97.1 95.5 95.5 90.3 82.0
Qwen2.5-72B-Instruct 95.1 97.7 97.2 97.7 96.5 93.0 88.4
Qwen3-4B 85.2 95.1 93.6 91.0 87.8 77.8 66.0
Non-thinking Mode Qwen3-8B 89.1 96.3 96.0 91.8 91.2 82.1 77.4
Qwen3-14B 94.6 98.0 97.8 96.4 96.1 94.0 85.1
Qwen3-32B 93.7 98.4 96.0 96.2 94.4 91.8 85.6
Qwen3-30B-A3B 91.6 96.5 97.0 95.3 92.4 89.1 79.2
Qwen3-235B-A22B 95.0 97.7 97.2 96.4 95.1 93.3 90.6
Thinking Mode Qwen3-4B 83.5 92.7 88.7 86.5 83.2 83.0 67.2
Qwen3-8B 84.4 94.7 94.4 86.1 80.8 78.3 72.0
Qwen3-14B 90.1 95.4 93.6 89.8 91.9 90.6 79.0
Qwen3-32B 91.0 94.7 93.7 91.6 92.5 90.0 83.5
Qwen3-30B-A3B 86.6 94.1 92.7 89.0 86.6 82.1 75.0
Qwen3-235B-A22B 92.2 95.1 94.8 93.0 92.3 92.0 86.0

Analysis:

  1. Non-thinking Mode Improvement: In non-thinking mode, Qwen3 models generally outperform Qwen2.5 models of similar size in long-context processing tasks, especially at longer context lengths (e.g., Qwen3-14B vs Qwen2.5-14B, Qwen3-32B vs Qwen2.5-32B show clear improvements in average and 128K scores). The flagship Qwen3-235B-A22B achieves a strong 90.6 at 128K.
  2. Thinking Mode Degradation: In thinking mode, the model's performance on RULER slightly degrades compared to non-thinking mode. The authors hypothesize that for pure retrieval tasks (like RULER) which do not rely on complex reasoning, the thinking content generated by the model might not offer significant benefits and could even interfere with the retrieval process. This indicates an area for future improvement to ensure thinking mode provides benefits across all task types.

6.3.5. Multilingual Ability

The multilingual capabilities are showcased through various benchmarks and detailed language-specific tables (Tables 24-35).

Summary: Tables 24-35 present detailed benchmark scores across various languages, including Spanish, French, Portuguese, Italian, Arabic, Japanese, Korean, Indonesian, Russian, Vietnamese, German, and Thai. The results demonstrate that the Qwen3 series models achieve competitive performance across all evaluated benchmarks for these specific languages, showcasing their strong multilingual capabilities.

For a broader assessment, the Belebele benchmark (Bandarkar et al., 2023) is used, covering 80 supported languages (excluding 42 unoptimized ones). The following are the results from Table 37 of the original paper:

Model Indo-European Sino-Tibetan Afro-Asiatic Austronesian Dravidian Turkic Tai-Kadai Uralic Austroasiatic Other
Gemma-3-27B-IT 89.2 86.3 85.9 84.1 83.5 86.8 81.0 91.0 86.5 87.0
Qwen2.5-32B-Instruct 85.5 82.3 80.4 70.6 67.8 80.8 74.5 87.0 79.0 72.6
QwQ-32B 86.1 83.7 81.9 71.3 69.3 80.3 77.0 88.0 83.0 74.0
Qwen3-32B (Thinking) 90.7 89.7 84.8 86.7 84.5 89.3 83.5 91.3 88.0 83.1
Qwen3-32B (Non-thinking) 89.1 88.0 82.3 83.7 84.0 85.0 85.0 88.7 88.0 81.3
Gemma-3-12B-IT 85.8 83.3 83.4 79.3 79.0 82.8 77.5 89.0 83.0 81.6
Qwen2.5-14B-Instruct 82.7 78.9 80.4 69.1 66.2 74.2 72.2 883.9 77.9 70.4
Qwen3-14B (Thinking) 88.6 87.3 82.4 82.4 81.0 83.8 83.5 91.0 82.5 81.7
Qwen3-14B (Non-thinking) 87.4 82.7 80.1 80.7 78.0 81.8 80.5 87.7 81.5 77.0
Gemma-3-4B-IT 71.8 72.0 63.5 61.7 64.8 64.0 61.5 70.7 71.0 62.6
Qwen2.5-3B-Instruct 58.0 62.3 57.2 47.9 36.9 45.1 49.8 50.6 56.8 48.4
Qwen3-4B (Thinking) 82.2 77.7 74.1 73.0 74.3 76.3 68.5 83.0 74.5 67.9
Qwen3-4B (Non-thinking) 76.0 77.0 65.6 65.6 65.5 64.0 60.5 74.0 74.0 61.0
Gemma-3-1B-IT 36.5 36.0 30.0 29.1 28.8 27.3 28.0 32.7 33.0 30.9
Qwen2.5-1.5B-Instruct 41.5 43.0 39.6 34.8 28.6 29.7 39.4 33.8 42.0 36.0
Qwen3-1.7B (Thinking) 69.7 66.0 59.4 58.6 52.8 57.8 53.5 70.3 63.5 53.4
Qwen3-1.7B (Non-thinking) 58.8 62.7 50.8 53.0 43.3 48.0 46.0 54.3 54.0 43.9

Analysis of Belebele Benchmark:

  • Superiority over Qwen2.5: Qwen3 models significantly outperform their Qwen2.5 counterparts across all language families, highlighting the impact of the expanded multilingual pre-training data (119 languages) and improved training strategies.
  • Competitiveness with Gemma-3: Qwen3 achieves comparable or superior performance to similarly-sized Gemma models (e.g., Qwen3-32B vs. Gemma-3-27B, Qwen3-14B vs. Gemma-3-12B).
  • Thinking Mode Advantage: For Qwen3 models, the thinking mode consistently yields higher scores across almost all language families compared to their non-thinking counterparts, demonstrating the benefit of reasoning in cross-lingual understanding tasks. This is in contrast to the RULER benchmark, where thinking mode was less beneficial, indicating task-specific utility of the thinking mechanism.

7. Conclusion & Reflections

7.1. Conclusion Summary

This technical report introduces Qwen3, a significant advancement in the Qwen family of large language models. The key contributions include the novel integration of thinking and non-thinking modes into a unified framework, complemented by a dynamic thinking budget mechanism. This design empowers users with flexible control over computational resources and reasoning depth, eliminating the need to switch between specialized models. Qwen3 comprises a diverse series of dense and Mixture-of-Expert (MoE) models, ranging from 0.6B to 235B parameters, featuring architectural refinements like QK-Norm and global-batch load balancing loss. The models were pre-trained on an unprecedented 36 trillion tokens across 119 languages and dialects, vastly expanding multilingual capabilities. A sophisticated multi-stage post-training pipeline, including Long-CoT Cold Start, Reasoning RL, Thinking Mode Fusion, and General RL, coupled with an efficient Strong-to-Weak Distillation approach for smaller models, ensures state-of-the-art performance. Empirical evaluations consistently demonstrate Qwen3's competitive results against proprietary and leading open-source models across diverse benchmarks, particularly excelling in code generation, mathematical reasoning, and agent tasks. The open-source release under Apache 2.0 further promotes community engagement and research.

7.2. Limitations & Future Work

The authors acknowledge certain limitations and outline future research directions:

  • Thinking Mode for Retrieval Tasks: While thinking mode generally enhances reasoning, performance on certain retrieval-based long-context tasks (like RULER) slightly degrades. The authors hypothesize that the generated thinking content might interfere with retrieval in these specific scenarios, suggesting a need to refine thinking mode for such tasks in future versions.

  • Specialized vs. General Performance Trade-off: The Thinking Mode Fusion and General RL stages, while enhancing overall versatility and instruction following, sometimes lead to a slight decrease in thinking mode performance on highly challenging, specialized tasks like AIME'24 and LiveCodeBench. This indicates a trade-off between broad generalization and peak performance in niche, complex problem-solving.

    Future research will focus on:

  • Scaling Pre-training: Continuing to scale up pre-training with even higher quality and more diverse data.

  • Architectural and Training Method Improvements: Enhancing model architecture and training methods for effective compression and scaling to extremely long contexts. This includes addressing the observed performance dip of thinking mode in retrieval tasks.

  • Increased RL Resources and Agent-based RL Systems: Allocating more computational resources for Reinforcement Learning, with a particular emphasis on agent-based RL systems that learn from environmental feedback. The goal is to build agents capable of tackling complex tasks requiring inference time scaling, indicating a move towards more autonomous and interactive LLM agents.

7.3. Personal Insights & Critique

Qwen3 represents a compelling stride towards more adaptable and efficient LLMs. The unified thinking and non-thinking modes, coupled with the thinking budget, are genuinely innovative concepts that address a critical practical challenge in LLM deployment: how to balance responsiveness and deep reasoning without maintaining separate models. This dynamic control over inference-time computation is a sophisticated solution for optimizing user experience and resource utilization.

The Strong-to-Weak Distillation approach for smaller models is particularly insightful. Demonstrating that distillation can be significantly more effective and efficient than direct RL, especially for exploration ability (Pass@64), offers a powerful blueprint for developing competitive lightweight models. This has direct implications for edge deployment and broader accessibility.

However, the acknowledged trade-off between generalized capabilities (achieved through General RL) and peak performance in specialized thinking mode tasks is an important point of critique. While understandable for overall versatility, it highlights a fundamental tension in LLM training: optimizing for average performance across a vast array of tasks might dilute the model's ability to achieve extreme performance in highly focused, difficult domains. Future work in multi-objective optimization or more sophisticated curriculum learning could potentially mitigate this.

The massive expansion to 119 languages is commendable, pushing the boundaries of global accessibility. However, the report doesn't extensively detail the performance across all these low-resource languages or the specific challenges encountered during their integration. While Belebele provides a broad overview, deeper dives into specific language families or challenges unique to low-resource settings would further enhance understanding.

Overall, Qwen3's contributions, particularly in its novel mode-switching and resource allocation mechanisms, alongside its commitment to open-source, position it as a significant player in the evolving LLM landscape. Its methods could be transferable to other AI domains requiring dynamic computational allocation based on task complexity, beyond just language generation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.