Paper status: completed

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Original Link
Price: 0.100000
71 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DeepSeek-V3.2 is presented, balancing computational efficiency and reasoning capabilities through three innovations: a sparse attention mechanism that reduces complexity, a scalable reinforcement learning framework rivaling GPT-5, and a synthesis pipeline enhancing generalization

Abstract

We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

1.2. Authors

The paper is authored by the research and engineering teams at DeepSeek-AI.

1.3. Journal/Conference

The paper is presented as a technical report and made available through the ModelScope platform. This indicates it is likely a preprint, released to disseminate research findings quickly, rather than a paper that has undergone a formal peer-review process for a specific conference or journal.

1.4. Publication Year

The paper contains references to benchmarks and events dated in 2025 (e.g., IMO 2025, HMMT Feb 2025), suggesting that the paper is either written with a future-facing perspective or is a fictional work set in that year. The content implies a publication date in late 2025.

1.5. Abstract

The paper introduces DeepSeek-V3.2, a large language model designed to balance high computational efficiency with strong performance in reasoning and agent tasks. The authors highlight three key technical innovations:

  1. DeepSeek Sparse Attention (DSA): A novel attention mechanism that reduces computational complexity, especially in long-context scenarios, without degrading performance.
  2. Scalable Reinforcement Learning Framework: A robust reinforcement learning (RL) protocol that, when scaled with significant post-training compute, enables DeepSeek-V3.2 to perform on par with a hypothetical GPT-5. A high-compute variant, DeepSeek-V3.2-Speciale, is shown to surpass GPT-5 and match Gemini-3.0-Pro, achieving gold medals in international Olympiads for mathematics and informatics.
  3. Large-Scale Agentic Task Synthesis Pipeline: A method for systematically generating vast amounts of training data for AI agents, which improves their ability to use tools, follow complex instructions, and generalize in interactive environments.
  • Link: https://modelscope.cn/models/deepseek-ai/DeepSeek-V3.2/resolve/master/assets/paper.pdf
  • Publication Status: Preprint / Technical Report.

2. Executive Summary

2.1. Background & Motivation

The paper addresses a critical challenge in the field of Large Language Models (LLMs): the widening performance gap between closed-source, proprietary models and their open-source counterparts. While the open-source community continues to advance, the rate of improvement in proprietary models has been significantly faster, particularly in complex reasoning and agentic tasks.

The authors identify three primary deficiencies holding back open-source models:

  1. Architectural Inefficiency: The reliance on the standard vanilla attention mechanism is computationally expensive, especially for long sequences. This creates a bottleneck for both training and deployment.

  2. Insufficient Post-Training Compute: Open-source models typically receive a much smaller computational budget for post-training (like reinforcement learning) compared to pre-training. This limits their ability to master difficult, nuanced tasks.

  3. Lagging Agent Capabilities: Open-source models struggle with generalization and robust instruction-following when required to act as agents that use tools in complex environments.

    The paper's innovative entry point is to tackle these three issues simultaneously with a new model architecture, a scaled-up post-training methodology, and a novel data synthesis pipeline for agent training.

2.2. Main Contributions / Findings

The paper's main contributions are threefold, directly corresponding to the identified deficiencies:

  1. A Highly Efficient Sparse Attention Architecture (DSA): The authors propose DeepSeek Sparse Attention (DSA), an attention mechanism that reduces the computational complexity from quadratic (O(L2)O(L^2)) to near-linear (O(Lk)O(Lk)) with respect to sequence length LL. This is achieved by using a lightweight "lightning indexer" to select a small, relevant subset of key-value pairs for each query, making long-context processing much more efficient.

  2. Demonstration of Scalable Reinforcement Learning: The paper develops and implements a stable and scalable Reinforcement Learning (RL) framework based on Group Relative Policy Optimization (GRPO). By allocating a post-training compute budget exceeding 10% of the pre-training cost, they significantly enhance the model's reasoning abilities, pushing its performance into the same tier as top proprietary models.

  3. A Scalable Pipeline for Synthesizing Agentic Tasks: To improve the model's ability to use tools, the authors created a pipeline that automatically generates over 1,800 distinct environments and 85,000 complex prompts. This diverse, synthetic data is used in the RL phase to teach the model robust instruction-following and generalization in interactive scenarios.

    The key finding is that by addressing these deficiencies, an open model like DeepSeek-V3.2 can close the performance gap with leading proprietary models like GPT-5 and Kimi-k2-thinking. Furthermore, its high-compute variant, DeepSeek-V3.2-Speciale, demonstrates that open models can reach the absolute frontier of reasoning, achieving performance on par with Gemini-3.0-Pro and winning gold medals in prestigious competitions.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Large Language Models (LLMs): These are deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and reasoning about human language.
  • Transformer Architecture: The foundational architecture for most modern LLMs. Its key innovation is the self-attention mechanism, which allows the model to weigh the importance of different words in the input sequence when processing a specific word.
  • Attention Mechanism: A mechanism that enables a model to focus on the most relevant parts of the input sequence when producing an output. In the context of Transformers, self-attention allows each token in a sequence to "attend" to all other tokens, capturing complex dependencies regardless of their distance.
  • Sparse Attention: A class of attention mechanisms designed to reduce the computational and memory complexity of the standard self-attention, which is quadratic in the sequence length (O(L2)O(L^2)). Instead of allowing every token to attend to every other token, sparse attention methods restrict connections to a smaller, more relevant subset of tokens. DeepSeek Sparse Attention (DSA) is a novel implementation of this concept.
  • Reinforcement Learning (RL): A paradigm of machine learning where an "agent" learns to make decisions by performing actions in an "environment" to maximize a cumulative "reward." In LLMs, RL, particularly Reinforcement Learning from Human Feedback (RLHF), is used to fine-tune the model to better align with human preferences and improve its ability to follow instructions and solve complex problems.
  • Mixture-of-Experts (MoE): A model architecture where multiple "expert" sub-networks are used. For any given input, a "gating" network selects a small subset of these experts to perform the computation. This allows for a massive increase in the total number of model parameters without a proportional increase in the computational cost for inference, as only a fraction of the model is active at any time.

3.2. Previous Works

  • Vanilla Attention (Vaswani et al., 2017): This refers to the original self-attention mechanism proposed in the paper "Attention Is All You Need." It is the cornerstone of the Transformer architecture. For a given set of queries QQ, keys KK, and values VV, the output is computed as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, every query in QQ is compared against every key in KK, resulting in a complexity of O(L2d)O(L^2 d), where LL is the sequence length and dd is the model dimension. This quadratic scaling is the primary bottleneck for long sequences.

  • Multi-Query Attention (MQA) (Shazeer, 2019): An optimization of the standard Multi-Head Attention (MHA). In MHA, each attention head has its own set of query, key, and value projection matrices. In MQA, all heads share a single set of key and value projection matrices while retaining separate query projections. This significantly reduces the memory bandwidth required for loading keys and values during inference, making decoding faster without a substantial drop in quality. DeepSeek-V3.2's DSA is implemented based on an MQA variant.

  • MLA (DeepSeek-AI, 2024): The attention mechanism used in the predecessor model, DeepSeek-V2. MLA stands for "Multi-head Latent Attention." It compresses key-value information into a smaller set of "latent vectors" to reduce the attention complexity. The paper mentions that DSA is instantiated under MLA, suggesting it's an evolution of this prior work. The appendix (Figure 7) illustrates the MHA and MQA modes of MLA.

  • Group Relative Policy Optimization (GRPO) (DeepSeek-AI, 2025; Shao et al., 2024): The RL algorithm used for post-training. GRPO is a variant of policy optimization algorithms like PPO (Proximal Policy Optimization). Its key idea is to sample a group of responses for a given prompt and normalize the rewards within that group. This turns the absolute reward score into a relative ranking, which can stabilize training and make the model more robust to noisy reward signals. The paper builds upon this algorithm by introducing several modifications to scale it effectively.

3.3. Technological Evolution

The paper's work sits at the confluence of several key technological trends in LLMs:

  1. The Long-Context Revolution: Early LLMs were limited to short context windows (e.g., 512 or 2048 tokens). As models were applied to more complex tasks, the need for longer contexts grew. This spurred research into more efficient attention mechanisms, moving from the dense vanilla attention to various forms of sparse attention to make context windows of 128K tokens and beyond computationally feasible. DSA is the latest step in this evolution.
  2. The Rise of Post-Training: Initially, the performance of LLMs was thought to be primarily determined by the scale of pre-training (data and compute). However, recent work has shown that post-training, especially sophisticated RL techniques, is crucial for unlocking advanced capabilities like complex reasoning and reliable tool use. This paper pushes this trend further by advocating for and demonstrating the effectiveness of a much larger computational budget for RL (over 10% of pre-training).
  3. The Emergence of AI Agents: The frontier of AI has shifted from passive language generation to active problem-solving via AI agents that can interact with environments and use tools (e.g., code interpreters, web browsers). A major bottleneck for training such agents is the scarcity of high-quality, diverse training data. This has led to a focus on synthetic data generation, and this paper's large-scale agentic task synthesis pipeline is a significant contribution to this area.

3.4. Differentiation Analysis

Compared to previous work, this paper's approach is distinguished by:

  • A Custom-Built Sparse Attention: While many sparse attention methods exist, DSA is specifically designed with a lightning indexer that is lightweight and efficient (can be run in FP8), and is tightly integrated with the existing MLA architecture of the DeepSeek model family.
  • Unprecedented RL Scaling for Open Models: The most significant differentiator is the sheer scale of the RL compute budget. While proprietary models are known to use extensive post-training, this paper makes a strong case that open models have been severely under-resourced in this phase. Their framework and results demonstrate that scaling RL is a viable path for open models to achieve top-tier performance.
  • Systematic and Scalable Agent Data Synthesis: Instead of relying on existing datasets or simple heuristics, the authors built an entire multi-agent pipeline to synthesize a vast and diverse set of agentic tasks. This "environment-synthesis agent" approach, which creates tasks that are hard to solve but easy to verify, is a novel and powerful way to bootstrap agent capabilities.

4. Methodology

4.1. Principles

The methodology of DeepSeek-V3.2 is built on three core principles designed to overcome the limitations of existing open-source LLMs:

  1. Efficiency through Sparsity: The standard attention mechanism is a computational bottleneck. The core idea is to replace it with a sparse alternative, DeepSeek Sparse Attention (DSA), that intelligently selects a small subset of relevant information from the context for each token. This preserves performance while dramatically reducing computational cost, especially for long sequences.
  2. Capability through Scaled-Up Post-Training: Advanced reasoning and problem-solving skills are not fully unlocked by pre-training alone. The principle here is that investing a significant amount of computational resources into post-training, specifically Reinforcement Learning (RL), can refine the model's abilities to a level comparable with frontier proprietary models. The authors developed a stable RL recipe to enable this scaling.
  3. Generalization through Diverse Synthetic Data: To build competent AI agents that can use tools effectively, the model must be trained on a wide variety of complex, interactive tasks. Since real-world agentic data is scarce, the principle is to create a systematic, automated pipeline to synthesize these tasks at a massive scale, thereby fostering robust generalization and instruction-following.

4.2. Core Methodology In-depth

4.2.1. DeepSeek-V3.2 Architecture: DeepSeek Sparse Attention (DSA)

The primary architectural innovation in DeepSeek-V3.2 is the introduction of DeepSeek Sparse Attention (DSA). This mechanism is designed to replace the dense attention of its predecessor, DeepSeek-V3.1-Terminus, through a process of continued training.

Prototype of DSA

The DSA mechanism has two main components:

  1. Lightning Indexer: This is a lightweight module responsible for quickly scoring the relevance of all preceding tokens for the current query token. For a query token's hidden state htRd\mathbf{h}_t \in \mathbb{R}^d and a preceding token's hidden state hsRd\mathbf{h}_s \in \mathbb{R}^d, the indexer computes an index score It,sI_{t,s}. This score determines which tokens should be selected. The formula is: $ I _ { t , s } = \sum _ { j = 1 } ^ { H ^ { I } } w _ { t , j } ^ { I } \cdot \mathrm { R e L U } \left( \mathbf { q } _ { t , j } ^ { I } \cdot \mathbf { k } _ { s } ^ { I } \right) , $

    • Symbol Explanation:
      • It,sI_{t,s}: The index score indicating the relevance of token ss to query token tt.
      • HIH^I: The number of heads in the indexer (a small number for efficiency).
      • qt,jIRdI\mathbf{q}_{t,j}^I \in \mathbb{R}^{d^I}: The jj-th indexer query vector, derived from the query token's hidden state ht\mathbf{h}_t.
      • wt,jIRw_{t,j}^I \in \mathbb{R}: A scalar weight for the jj-th indexer head, also derived from ht\mathbf{h}_t.
      • ksIRdI\mathbf{k}_s^I \in \mathbb{R}^{d^I}: An indexer key vector, derived from the preceding token's hidden state hs\mathbf{h}_s.
      • ReLU: The Rectified Linear Unit activation function, chosen for its computational efficiency. The authors note that the indexer is highly efficient as it has few heads and can be implemented in FP8 precision.
  2. Fine-Grained Token Selection: After the indexer computes scores {It,s}\{I_{t,s}\} for a query token tt against all preceding tokens ss, this mechanism selects only the key-value entries corresponding to the tokens with the highest scores. The attention mechanism is then applied only to this sparse set. The attention output ut\mathbf{u}_t is computed as: $ \mathbf { u } _ { t } = \mathrm { A t t n } \big ( \mathbf { h } _ { t } , \big { \mathbf { c } _ { s } \big | I _ { t , s } \in \mathrm { T o p } { - } \mathrm { k } \big ( I _ { t , : } \big ) \big } \big ) . $

    • Symbol Explanation:
      • ut\mathbf{u}_t: The final output vector for token tt.
      • Attn()\mathrm{Attn}(\cdot): The standard attention function.
      • ht\mathbf{h}_t: The query from the current token tt.
      • {cs}\{\mathbf{c}_s\}: The set of key-value entries from preceding tokens.
      • Topk(It,:)\mathrm{Top-k}(I_{t,:}): A function that returns the set of kk highest index scores for query tt. This operation is the core of the sparsification.

Instantiation of DSA under MLA

For practical implementation and to enable continued training from DeepSeek-V3.1-Terminus, DSA is built upon the existing MLA (Multi-head Latent Attention) framework, specifically in its MQA (Multi-Query Attention) mode. In this mode, a single key-value entry (a latent vector in MLA) is shared across all query heads for a given token, which is highly efficient.

The following figure from the paper illustrates this architecture.

Figure 2 | Attention architecture of DeepSeek-V3.2, where DSA is instantiated under MLA. The green part illustrates how DSA selects the top- \(\\boldsymbol { \\cdot } \\mathbf { k }\) key-value entries according to the indexer. 该图像是DeepSeek-V3.2的注意力结构示意图,其中展示了DSA(稀疏注意力)在多查询注意力(Multi-Query Attention)中的应用。绿色部分说明了DSA如何根据索引器选择前kk个关键值条目。

4.2.2. Continued Pre-Training for DSA

The model transitions from dense to sparse attention via a two-stage continued pre-training process, starting from a 128K context DeepSeek-V3.1-Terminus checkpoint.

  1. Dense Warm-up Stage:

    • Goal: To initialize the lightning indexer so that its relevance scores align with the original model's attention patterns.
    • Process: The main model parameters are frozen, and only the indexer is trained. The standard dense attention mechanism is still active.
    • Objective Function: A KL-divergence loss is used to make the indexer's output distribution match the main attention distribution. The target distribution pt,:p_{t,:} is created by summing the attention scores across all heads of the main attention mechanism and L1-normalizing them. The loss is: $ \mathcal { L } ^ { I } = \sum _ { t } \mathbb { D } _ { \mathrm { K L } } \big ( p _ { t , : } \big | \mathrm { S o f t m a x } \big ( I _ { t , : } \big ) \big ) . $
      • Symbol Explanation:
        • LI\mathcal{L}^I: The loss for the indexer.
        • DKL()\mathbb{D}_{\mathrm{KL}}(\cdot\|\cdot): The Kullback-Leibler divergence, a measure of how one probability distribution diverges from a second, expected probability distribution.
        • pt,:p_{t,:}: The target probability distribution over all preceding tokens for query tt, derived from the main attention scores.
        • Softmax(It,:)\mathrm{Softmax}(I_{t,:}): The probability distribution produced by the indexer's scores for query tt.
    • Training Details: This stage is very short, lasting only 1000 steps (2.1B tokens).
  2. Sparse Training Stage:

    • Goal: To adapt the entire model to the sparse attention patterns introduced by DSA.
    • Process: All model parameters are unfrozen. The fine-grained token selection mechanism is activated, meaning attention is now sparse. The indexer continues to be aligned with the main attention distribution, but now only over the set of selected tokens.
    • Objective Function: The KL-divergence loss for the indexer is modified to only consider the selected tokens: $ \mathcal { L } ^ { I } = \sum _ { t } \mathbb { D } _ { \mathrm { K L } } \big ( p _ { t , S _ { t } } \big | S \mathrm { o f t m a x } \big ( I _ { t , S _ { t } } \big ) \big ) . $
      • Symbol Explanation:
        • St={sIt,sTopk(It,:)}S_t = \{ s | I_{t,s} \in \mathrm{Top-k}(I_{t,:}) \}: The set of indices of the top-k tokens selected by the indexer for query tt.
        • pt,Stp_{t,S_t} and It,StI_{t,S_t}: The distributions restricted to the tokens in set StS_t.
    • Training Details: The main model is trained with the standard language modeling loss, while the indexer is trained with LI\mathcal{L}^I. This stage is much longer, lasting 15,000 steps (943.7B tokens) with k=2048k=2048 selected tokens.

4.2.3. Post-Training: A Scalable RL Framework

After continued pre-training, the model undergoes extensive post-training using a scaled-up and stabilized version of Group Relative Policy Optimization (GRPO).

Review of GRPO

The objective of GRPO is to optimize the policy model πθ\pi_{\theta} by maximizing the expected advantage of its generated responses. For each prompt qq, a group of GG responses {o1,,oG}\{o_1, \dots, o_G\} is sampled from a previous version of the policy (πold\pi_{old}). The objective function is: IGRPO(θ)=EqP(Q),{σi}i=1Gπold(q)[1Gi=1G1σit=1σimin(ri,t(θ)A^i,t,clip(ri,t(θ),1ε,1+ε)A^i,t)βDKL(πθ(σi,t)πref(σi,t))], \begin{array} { l } { \mathcal { I } _ { \mathrm { GRPO } } ( \theta ) = \mathbb { E } _ { q \sim P ( Q ) , \{ \sigma _ { i } \} _ { i = 1 } ^ { G } \sim \pi _ { \mathrm { old } } ( \cdot | q ) } \left[ \displaystyle \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | \sigma _ { i } | } \sum _ { t = 1 } ^ { | \sigma _ { i } | } \right. } \\ { \left. \operatorname* { min } \left( r _ { i , t } ( \theta ) \hat { A } _ { i , t } , \mathrm { clip } \left( r _ { i , t } ( \theta ) , 1 - \varepsilon , 1 + \varepsilon \right) \hat { A } _ { i , t } \right) - \beta \mathbb { D } _ { \mathrm { KL } } \left( \pi _ { \theta } ( \sigma _ { i , t } ) \left| \right| \pi _ { \mathrm { ref } } ( \sigma _ { i , t } ) \right) \right] , } \end{array}

  • Symbol Explanation:
    • πθ\pi_{\theta}: The current policy (the model being trained).
    • πold\pi_{old}: The policy used to generate the sample data.
    • πref\pi_{ref}: A reference policy (often the initial SFT model) used for KL regularization to prevent the model from deviating too far.
    • ri,t(θ)=πθ(oi,tq,oi,<t)πold(oi,tq,oi,<t)r_{i,t}(\theta) = \frac{\pi_{\theta}(o_{i,t} | q, o_{i,<t})}{\pi_{old}(o_{i,t} | q, o_{i,<t})}: The importance sampling ratio.
    • A^i,t\hat{A}_{i,t}: The advantage of taking action oi,to_{i,t}. In GRPO, this is calculated as Rimean(R)R_i - \mathrm{mean}(\mathbf{R}), where RiR_i is the reward for the entire sequence oio_i and mean(R)\mathrm{mean}(\mathbf{R}) is the average reward across the group of GG responses.
    • clip()\mathrm{clip}(\cdot): A function that clips the importance ratio within the range [1ε,1+ε][1-\varepsilon, 1+\varepsilon] to stabilize updates.
    • β\beta: A hyperparameter controlling the strength of the KL divergence penalty.

Modifications for Scaling GRPO

The authors introduce four key modifications to make this algorithm stable at a large scale:

  1. Unbiased KL Estimate: The standard KL estimator can be biased. The authors use an importance-sampling corrected estimator to get an unbiased estimate of the KL divergence, which provides a more stable gradient. The formula is given as: $ \mathbb { D } _ { \mathrm { KL } } \big ( \pi _ { \theta } ( o _ { i , t } ) \big | \big | \pi _ { \mathrm { ref } } ( o _ { i , t } ) \big ) = \frac { \pi _ { \theta } ( o _ { i , t } | q , o _ { i , < t } ) } { \pi _ { \mathrm { old } } ( o _ { i , t } | q , o _ { i , < t } ) } \left( \frac { \pi _ { \mathrm { ref } } ( o _ { i , t } | q , o _ { i , < t } ) } { \pi _ { \theta } ( o _ { i , t } | q , o _ { i , < t } ) } - \log \frac { \pi _ { \mathrm { ref } } ( o _ { i , t } | q , o _ { i , < t } ) } { \pi _ { \theta } ( o _ { i , t } | q , o _ { i , < t } ) } - 1 \right) . $ This correction prevents unstable, high-magnitude gradients that can occur when the current policy assigns a much lower probability to a token than the reference policy.

  2. Off-Policy Sequence Masking: To handle the "off-policy" nature of training (where the training data comes from an older policy πold\pi_{old}), they introduce a mask Mi,tM_{i,t} that selectively ignores certain training steps. Specifically, they mask out updates for sequences that have a negative advantage and have diverged significantly from the sampling policy. The mask is defined as: $ M _ { i , t } = \left{ \begin{array} { l l } { 0 } & { \hat { A } _ { i , t } < 0 , \frac { 1 } { | o _ { i } | } \sum _ { t = 1 } ^ { | o _ { i } | } \log \frac { \pi _ { \mathrm { old } } ( o _ { i , t } | q , o _ { i , < t } ) } { \pi _ { \theta } ( o _ { i , t } | q , o _ { i , < t } ) } > \delta } \ { 1 } & { \mathrm { otherwise } , } \end{array} \right. $ where δ\delta is a divergence threshold. This prevents the model from being destabilized by learning from its "mistakes" when those mistakes were generated by a very different policy.

  3. Keep Routing (for MoE models): In Mixture-of-Experts models, the path of activated experts can change between data generation (inference) and training. This inconsistency can destabilize training. The Keep Routing technique enforces that the exact same expert routing path used during data sampling is used again during the training update for that data, ensuring parameter updates are consistent.

  4. Keep Sampling Mask: Sampling strategies like top-p and top-k are used during generation to improve quality. However, this truncates the action space, which violates the assumptions of importance sampling. The Keep Sampling Mask technique saves the mask of allowed tokens from the sampling step (πold\pi_{old}) and applies the same mask to the current policy (πθ\pi_{\theta}) during training, ensuring both policies operate on the same action space.

4.2.4. Thinking in Tool-Use: Agentic Capabilities

The paper details a methodology for integrating long-chain reasoning ("thinking") with the ability to use tools ("agentic tasks").

Thinking Context Management

To make the reasoning process more efficient during multi-turn tool use, they implement a specific context management strategy:

  • Reasoning history (the <think><think> block) is retained across multiple tool calls within the same user turn.

  • Reasoning history is discarded only when a new user message is added to the conversation.

  • The history of tool calls and their outputs is always preserved.

    The following diagram illustrates this mechanism.

    Figure 4 | Thinking retention mechanism in tool-calling scenarios. 该图像是图示,展示了工具调用场景中的思维保持机制。图中展示了多个回合的输入和输出,包括用户消息、工具调用和思维过程,体现了模型在处理复杂任务时的思维流程。

Cold-Start

This is the initial phase to get the model to generate trajectories that mix reasoning and tool use. The approach is to simply combine prompts from existing reasoning datasets and agentic datasets with a carefully designed system prompt. The new prompt explicitly instructs the model to use tools within its reasoning process. While the initial quality may be low, it produces enough valid examples to serve as a starting point for the subsequent RL stage.

Large-Scale Agentic Task Synthesis

This is the core of their agent training strategy. A diverse set of RL tasks is crucial for robustness. They create tasks for several domains:

  • Search Agent: A multi-agent pipeline generates high-quality question-answer pairs. One agent finds long-tail entities, another generates questions by searching about them, and a verification agent validates the answers.
  • Code Agent: They mine millions of issue-pull request pairs from GitHub to create realistic software engineering environments. An automated agent sets up these environments (installing dependencies, running tests) to create reproducible tasks where the goal is to fix a bug.
  • Code Interpreter Agent: Uses a Jupyter Notebook environment for tasks requiring code execution to find a solution (e.g., in math, logic, data science).
  • General Agent: An "environment-synthesis agent" automatically creates new tasks. The process is as follows:
    1. Given a category (e.g., trip planning), the agent uses bash and search tools to populate a sandbox database with relevant information.
    2. It then synthesizes a set of task-specific tools (e.g., get_all_hotels_by_city).
    3. It proposes a simple task, a solution that uses the tools, and a Python verifier function. It iteratively refines these until the solution passes verification.
    4. It then makes the task progressively harder, adding more constraints and augmenting the toolset if necessary. This process generated 1,827 environments and 4,417 tasks.

5. Experimental Setup

5.1. Datasets

The paper evaluates DeepSeek-V3.2 on a wide array of benchmarks covering general reasoning, coding, mathematics, and various agentic capabilities.

  • General Reasoning & Knowledge:

    • MMLU-Pro: A more challenging version of the popular MMLU benchmark for multitask language understanding.
    • GPQA Diamond: A graduate-level, Google-proof Q&A benchmark designed to be difficult for search engines.
    • Human Last Exam (HLE) Text-only: A benchmark consisting of problems from the final high school exam in Vietnam, known for its difficulty.
  • Coding & Code Generation:

    • LiveCodeBench: A benchmark with programming problems sourced from real-time competitive programming contests.
    • Codeforces: A platform for competitive programming; model performance is measured by an Elo-style rating based on solving problems.
  • Mathematics:

    • AIME: American Invitational Mathematics Examination, a high-school level math competition.
    • HMMT: Harvard-MIT Mathematics Tournament, another prestigious high-school competition.
    • IMOAnswerBench: A benchmark focused on problems from the International Mathematical Olympiad (IMO), requiring only the final answer.
  • Code Agent Tasks:

    • Terminal Bench 2.0: A benchmark where the model must complete tasks by issuing commands in a terminal environment.
    • SWE-Verified: A subset of the SWE-bench dataset for software engineering tasks (bug fixing), where solutions have been human-verified.
    • SWE Multilingual: An extension of SWE-bench covering multiple programming languages.
  • Search & Web Agent Tasks:

    • BrowseComp: A benchmark for web browsing agents.
    • BrowseCompZh: The Chinese version of BrowseComp.
  • General Tool-Use / Agent Tasks:

    • τ²-bench: Evaluates conversational agents in environments requiring dual control (e.g., booking flights for a user).

    • MCP-Universe & MCP-Mark: Benchmarks for evaluating agents that interact with real-world services and APIs.

    • Tool-Decathlon: A benchmark covering a diverse set of 10 tool-use tasks.

      To provide an intuitive feel for the data, here is the example of a complex synthesized task prompt from the "General Agent" section:

I'm planning a three-day trip starting from Hangzhou, and I need help creating an itinerary from October 1st to October 3rd, 2025. A few important requirements: I don't want to repeat any cities, hotels, attractions, or restaurants during the entire trip. Also, please make sure that every hotel, restaurant, and attraction you recommend is actually located in the city where I'll be staying that day. One more thing about the second day - I'm trying to be smart about my budget. If I end up booking a luxury hotel that costs 800 CNY or more per night, then I need to be more careful with other expenses: my total spending on both restaurants (lunch and dinner) should stay under 350 CNY, both restaurants should be rated at least 4.0 stars, and the afternoon attraction ticket needs to be less than 120 CNY. If the hotel on day 2 is in the mid-to-high range (500-800 CNY), then I have a bit more flexibility - I just need to make sure at least one of my restaurant choices is rated 4.0 or higher, and the attraction ticket should be below 180 CNY. For more affordable hotels (200-500 CNY range), I only need to ensure that at least one restaurant has a rating of 3.2 or above. Can you help me put together this itinerary?

5.2. Evaluation Metrics

The paper uses several metrics standard to LLM evaluation.

  • Pass@k:

    1. Conceptual Definition: This metric evaluates a model's ability to generate a correct solution within kk attempts. For a single problem, the model generates kk candidate solutions. If at least one of these solutions is correct, the problem is considered solved. The final Pass@k score is the fraction of problems in the test set that are solved. It is commonly used in coding and math benchmarks.
    2. Mathematical Formula: An unbiased estimator for Pass@k is often calculated as: $ \text{Pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] $
    3. Symbol Explanation:
      • nn: The total number of samples generated per problem.
      • cc: The number of correct samples among the nn generated samples.
      • kk: The number of samples considered for a successful pass (e.g., Pass@1, Pass@100).
  • Accuracy (Acc) / Exact Match (EM):

    1. Conceptual Definition: This metric measures the percentage of model predictions that exactly match the ground-truth answer. It is a strict metric used in tasks where there is a single, unambiguous correct answer (e.g., multiple-choice questions, short answer questions).
    2. Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
    3. Symbol Explanation: Self-explanatory.
  • Success Rate:

    1. Conceptual Definition: Used in agentic and tool-use tasks, this metric measures the percentage of tasks that the agent successfully completes. A task is deemed successful if the agent achieves the final goal state or produces the correct final output, as determined by a verifier or environment feedback.
    2. Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successfully Completed Tasks}}{\text{Total Number of Tasks}} $
    3. Symbol Explanation: Self-explanatory.
  • Codeforces Rating:

    1. Conceptual Definition: This is an Elo-style rating system used on the Codeforces competitive programming platform. A higher rating indicates a stronger ability to solve programming problems. The rating is dynamically adjusted based on performance in contests against other participants (or in this case, other models). It reflects not just correctness but also the difficulty of the problems solved.

5.3. Baselines

The paper compares DeepSeek-V3.2 against a range of both state-of-the-art closed-source and open-source models. These baselines are representative of the current LLM landscape.

  • Closed-Source Models:

    • GPT-5 High: A hypothetical high-performance variant of OpenAI's next-generation model.
    • Gemini-3.0-Pro: A hypothetical next-generation model from Google DeepMind, positioned as a top-tier reasoning model.
    • Claude-4.5-Sonnet: A hypothetical model from Anthropic, likely representing a highly capable but more cost-effective model in their lineup.
  • Open-Source Models: (The paper specifically compares against models that support thinking in tool-use)

    • Kimi-K2 Thinking: A model from MoonShot AI, known for its long-context capabilities and reasoning features.
    • M2: A model from MiniMax.
    • MiniMax: This appears to be a duplicate or a general reference to models from the same company.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that DeepSeek-V3.2 successfully narrows the gap with leading closed-source models and significantly outperforms other open-source models, especially in agentic tasks.

The following are the results from Table 2 of the original paper:

Benchmark (Metric) Closed-Source Open-Source
Claude-4.5-Sonnet GPT-5 High Gemini-3.0 Pro Kimi-K2 Thinking M2 MiniMax DeepSeek-V3.2 Thinking
English MMLU-Pro (EM) 88.2 87.5 90.1 84.6 82.0 85.0
GPQA Diamond (Pass@1) 83.4 85.7 91.9 84.5 77.7 82.4
HLE (Pass@1) 13.7 26.3 37.7 23.9 12.5 25.1
Code LiveCodeBench (Pass@1-COT) 64.0 84.5 90.7 82.6 83.0 83.3
Codeforces (Rating) 1480 2537 2708 - - 2386
Math AIME 2025 (Pass@1) 87.0 94.6 95.0 94.5 78.3 93.1
HMMT Feb 2025 (Pass@1) 79.2 88.3 97.5 89.4 - 92.5
HMMT Nov 2025 (Pass@1) 81.7 89.2 93.3 89.2 - 90.2
IMOAnswerBench (Pass@1) - 76.0 83.3 78.6 - 78.3
Code Agent Terminal Bench 2.0 (Acc) 42.8 35.2 54.2 35.7 30.0 46.4
SWE Verified (Resolved) 77.2 74.9 76.2 71.3 69.4 73.1
SWE Multilingual (Resolved) 68.0 55.3 - 61.1 56.5 70.2
Search Agent BrowseComp (Pass@1) 24.1 54.9 - -/60.2* 44.0 51.4/67.6*
BrowseCompZh (Pass@1) 42.4 63.0 - 62.3 48.5 65.0
HLE (Pass@1) 32.0 35.2 45.8 44.9 31.8 40.8
ToolUse τ2-Bench(Pass@1) 84.7 80.2 85.4 74.3 76.9 80.3
MCP-Universe (Success Rate) 46.5 47.9 50.7 35.6 29.4 45.9
MCP-Mark (Pass@1) 33.3 50.9 43.1 20.4 24.4 38.0
Tool-Decathlon (Pass@1) 38.6 29.0 36.4 17.6 16.0 35.2
  • Reasoning and Knowledge (English, Code, Math): DeepSeek-V3.2 achieves performance that is competitive with GPT-5 High and Kimi-K2 Thinking, and generally surpasses other open models. For instance, on HLE (25.1 vs. 26.3) and LiveCodeBench (83.3 vs. 84.5), it is very close to GPT-5. Gemini-3.0-Pro remains the clear leader in this category. The authors attribute these gains to the scaled-up RL training.
  • Agentic Capabilities (Code Agent, Search Agent, ToolUse): This is where DeepSeek-V3.2 shows its most significant improvements over other open models. On Tool-Decathlon, it scores 35.2, more than double its nearest open-source competitors (17.6 and 16.0). On MCP-Universe, it achieves 45.9, substantially closing the gap with closed-source models (46.5-50.7). This validates the effectiveness of the large-scale agentic task synthesis pipeline.
  • Context Management Impact: The results for BrowseComp (marked with *) show a dramatic improvement when context management is used (from 51.4 to 67.6). This highlights that agentic performance is not just about the model's intrinsic ability but also about the test-time strategies used to manage long interactions.

6.2. DeepSeek-V3.2-Speciale: Pushing the Limits

The paper introduces DeepSeek-V3.2-Speciale, a variant trained with fewer token constraints during RL to explore the upper limits of reasoning.

The following are the results from Table 3 of the original paper:

Benchmark GPT-5 High Gemini-3.0 Pro Kimi-K2 Thinking DeepSeek-V3.2 Thinking DeepSeek-V3.2 Speciale
AIME 2025(Pass@1) 94.6 (13k) 95.0 (15k) 94.5 (24k) 93.1 (16k) 96.0 (23k)
HMMT Feb 2025 (Pass@1) 88.3 (16k) 97.5 (16k) 89.4 (31k) 92.5 (19k) 99.2 (27k)
HMMT Nov 2025 (Pass@1) 89.2 (20k) 93.3 (15k) 89.2 (29k) 90.2 (18k) 94.4 (25k)
IMOAnswerBench (Pass@1) 76.0 (31k) 83.3 (18k) 78.6 (37k) 78.3 (27k) 84.5 (45k)
LiveCodeBench (Pass@1-coT) 84.5 (13k) 90.7 (13k) 82.6 (29k) 83.3 (16k) 88.7 (27k)
CodeForces (Rating) 2537 (29k) 2708 (22k) 2386 (42k) 2701 (77k)
GPQA Diamond (Pass@1) 85.7 (8k) 91.9 (8k) 84.5 (12k) 82.4 (7k) 85.7 (16k)
HLE (Pass@1) 26.3 (15k) 37.7 (15k) 23.9 (24k) 25.1 (21k) 30.6 (35k)
  • Performance: DeepSeek-V3.2-Speciale surpasses all other models, including Gemini-3.0-Pro, on several math benchmarks (AIME, HMMT, IMOAnswerBench). It also achieves a near-identical Codeforces rating to Gemini-3.0-Pro (2701 vs. 2708).

  • Token Efficiency: A critical finding is that this superior performance comes at the cost of token efficiency. Speciale consistently uses significantly more tokens (e.g., 45k on IMOAnswerBench vs. 18k for Gemini-3.0-Pro) to achieve its results. This highlights a trade-off between performance and cost/latency.

    The following are the results from Table 4 of the original paper:

    Competition P1 P2 P3 P4 P5 P6 Overall Medal
    IMO 2025 7 7 7 7 7 0 35/42 Gold
    CMO 2025 18 18 9 21 18 18 102/126 Gold
    IOI 2025 100 82 72 100 55 83 492/600 Gold
    Competition A B C D E F G H I J K L Overall Medal
    ICPC WF 2025 3 1 1 2 2 - 1 1 1 1 1 10/12 Gold

This table shows that DeepSeek-V3.2-Speciale achieves gold medal-level performance in the world's most prestigious mathematics (IMO, CMO) and programming (IOI, ICPC WF) competitions, a landmark achievement for a general-purpose language model.

6.3. Ablation Studies / Parameter Analysis

Analysis of Synthetic Agentic Tasks

The paper conducts experiments to validate its agentic task synthesis pipeline.

  1. Are the tasks challenging? They test frontier models on a sample of 50 synthesized tasks.

    The following are the results from Table 5 of the original paper:

    Pass@K DeepSeek-v3.2-Exp Sonnet-4.5 Gemini-3.0 Pro GPT-5-Thinking
    1 12% 34% 51% 62%
    2 18% 47% 65% 75%
    4 26% 62% 74% 82%

    The results show that even the most powerful models (GPT-5) only achieve 62% Pass@1, indicating the tasks are indeed difficult and provide a strong training signal.

  2. Do the tasks generalize? They fine-tune a model (DeepSeek-V3.2-SFT) using RL only on the synthetic general agent data.

    Figure 5 | RL training of DeepSeek-V3.2-SFT using exclusively synthetic general agent data. 该图像是图表,展示了 DeepSeek-V3.2-SFT 在不同任务的强化学习训练过程中的性能变化。图中包括多个子图,分别表示航空、零售、电信等多个领域的 benchmark 结果,横坐标为训练步骤,纵坐标为性能指标。

    As shown in Figure 5, this training leads to substantial improvements on unseen, real-world benchmarks like Tau2Bench and MCP-Mark/Universe. This confirms that the skills learned on the synthetic data successfully transfer to other agentic scenarios.

Analysis of Context Management

The paper analyzes different strategies for handling context length limits during long agentic tasks on the BrowseComp benchmark.

Figure 6 | Accuracy of Browsecomp with different test-time compute expansion strategies. 该图像是一个图表,展示了不同测试时间计算扩展策略下 Browsecomp 的准确率。图中包含四条曲线,分别表示 Summary(蓝色)、Discard-75%(橙色)、Discard-all(绿色)和 Parallel-fewest-step(灰色),各自与真实步数的关系。通过这些数据可以直观比较不同策略的表现。

Figure 6 shows that all context management strategies improve performance over the baseline by allowing the agent to take more steps.

  • The Summary strategy improves the score but is inefficient.
  • The Discard-all strategy, which simply resets the context, is surprisingly effective and scalable, achieving a score of 67.6.
  • This performance is comparable to Parallel-fewest-step (running multiple trajectories in parallel), but Discard-all is more compute-efficient (uses fewer total steps). This analysis shows the importance of test-time algorithms in addition to the core model capabilities.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces DeepSeek-V3.2, a powerful open-source LLM that significantly closes the performance gap with leading proprietary models. The work makes three primary contributions:

  1. It proposes DeepSeek Sparse Attention (DSA), an efficient attention mechanism that enables effective long-context processing without performance degradation.

  2. It demonstrates the immense potential of scaling up RL compute, showing that a substantial post-training budget is a key ingredient for unlocking top-tier reasoning capabilities.

  3. It develops a large-scale agentic task synthesis pipeline that effectively trains the model for robust and generalizable tool use, a critical area where open models have previously lagged.

    The result is a model that is competitive in reasoning and state-of-the-art among open models in agentic tasks. The high-compute variant, DeepSeek-V3.2-Speciale, sets a new milestone by achieving gold-medal performance in elite STEM Olympiads, proving that open models can reach the absolute frontier of AI.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations when comparing their model to the best closed-source systems like Gemini-3.0-Pro:

  • Knowledge Breadth: Due to a lower total pre-training compute (FLOPs), DeepSeek-V3.2 has a less comprehensive knowledge base than its proprietary counterparts. The authors plan to address this by scaling up pre-training in future versions.
  • Token Efficiency: The model, especially the high-performance Speciale variant, often requires generating much longer reasoning chains (more tokens) to achieve the same quality of results. Future work will focus on improving the "intelligence density" to make reasoning more efficient.
  • Complex Task Performance: While strong, the model's ability to solve the most complex tasks still falls short of the absolute frontier. This motivates further refinements to both the foundation model and the post-training recipes.

7.3. Personal Insights & Critique

This paper presents a compelling and well-executed research effort that offers a clear roadmap for advancing open-source LLMs.

  • Inspirations:

    • The Primacy of Post-Training: The most impactful insight is the empirical validation that post-training is not just a final alignment step but a critical phase for capability development. The idea of allocating over 10% of the pre-training compute budget to RL is a paradigm shift for the open-source community, which has often been resource-constrained in this area.
    • Systematic Data Synthesis: The agentic data synthesis pipeline is a brilliant piece of engineering. Instead of waiting for high-quality human data to emerge, they have created a scalable, automated way to generate challenging and verifiable tasks. This "teach the model to create its own curriculum" approach is a powerful strategy for overcoming data bottlenecks in emerging AI domains.
    • Pragmatic Architectural Evolution: The introduction of DSA via continued pre-training is a practical and effective way to upgrade a model's architecture without starting from scratch. This iterative improvement approach is valuable for research labs that need to balance innovation with computational costs.
  • Potential Issues & Critique:

    • Hypothetical Baselines: A significant point of critique is the use of hypothetical, future-dated models like GPT-5 and Gemini-3.0-Pro as primary baselines. While this sets an ambitious target, it makes the results difficult to ground and verify against the current state of the art. The paper's entire timeline is set in 2025, which can be confusing and makes it read partly as a "future vision" paper rather than a report on existing achievements.

    • Token Efficiency is a Real-World Bottleneck: The paper acknowledges the token efficiency problem, but its importance cannot be overstated. In practical applications, a model that is 2-3x less token-efficient can be prohibitively slow and expensive, even if its final accuracy is higher. The trade-off between "length of thought" and "cost of thought" remains a critical research challenge.

    • Reproducibility of RL Scaling: While the paper outlines its modifications to GRPO, successfully scaling RL is notoriously difficult and sensitive to implementation details. The stability and success of their framework may be difficult for others in the open-source community to reproduce without access to their specific infrastructure and codebase.

      Overall, "DeepSeek-V3.2" is a landmark paper for the open-source AI community. It not only delivers a highly capable model but also provides a clear and actionable blueprint for how to compete at the highest level: innovate in architecture for efficiency, invest heavily in post-training for capability, and systematically synthesize data for new frontiers like agentic AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.