Mathematical Formulas & Key Details:
1. Multiple Resolution Support:
To enable experiments with different compression ratios, DeepEncoder
is trained to support multiple input resolutions, as shown in Figure 4 and detailed in Table 1.

This is a manual transcription of Table 1 from the paper.
Mode |
Native Resolution |
Dynamic Resolution |
Tiny |
Small |
Base |
Large |
Gundam |
Gundam-M |
Resolution |
512 |
640 |
1024 |
1280 |
nx640+1024 |
nx1024+1280 |
Tokens |
64 |
100 |
256 |
400 |
nx100+256 |
nx256+400 |
Process |
resize |
resize |
padding |
padding |
resize + padding |
For modes with padding (Base
, Large
), not all vision tokens correspond to actual image content. The number of valid
tokens is calculated using the following formula:
Nvalid=⌈Nactual×[1−((max(w,h)−min(w,h))/(max(w,h)))]⌉
- Symbol Explanation:
- Nvalid: The number of vision tokens that correspond to the original image content (excluding padding).
- Nactual: The total number of vision tokens for the padded square image (e.g., 256 for
Base
mode).
w, h
: The width and height of the original, unpadded input image.
- The term inside the square brackets calculates the aspect ratio-based proportion of the area covered by the image.
2. Decoder Reconstruction:
The decoder's job is to learn a mapping fdec from the compressed vision tokens back to the original text representation.
X^=fdec(Z)wheren≤N
- Symbol Explanation:
- Z: The set of n compressed vision tokens from the
DeepEncoder
.
- X^: The reconstructed sequence of N text tokens.
- fdec: The non-linear transformation learned by the LLM decoder.
- n≤N: The number of vision tokens is less than or equal to the number of original text tokens, indicating compression.
3. Data Engine:
The model is trained on a diverse mix of data to ensure robust capabilities.
-
OCR 1.0 Data: 30M pages of multilingual documents and 20M scene text images. Annotations range from "coarse" (raw text extraction) to "fine" (detailed layout and text information), as shown in Figure 5.
该图像是一个示意图,展示了DeepSeek-OCR对八年级数学下册几何证明题的文档结构化处理过程,包括输入图像、结果文本和图形解析与重绘,体现其对几何图形复杂结构的解析能力。
-
OCR 2.0 Data: Specialized data for parsing complex structures. This includes 10M charts (converted to HTML tables), 5M chemical formulas (from SMILES strings), and 1M plane geometry figures. Figure 6 shows examples of the ground truth format for charts and geometry.
该图像是两页论文文档的截图,包含阿拉伯语段落文字说明及一个三列表格,内容涉及支持小微企业及就业岗位的角色,文中无公式。
该图像是一个示意图,展示了多语言OCR文本的检测、布局分割及结构化识别过程,左侧为原文图像,中部为颜色标注的分段结果,右侧为对应的结构化文本输出。
-
General Vision Data: Data for tasks like image captioning and object detection, making up 20% of the training mix to retain general VLM capabilities.
-
Text-only Data: 10% of the training data is text-only to maintain the model's core language abilities.
4. Training Pipelines:
Training is a two-stage process:
- Train
DeepEncoder
: The encoder is first trained independently with a compact language model to learn how to produce useful visual representations from the OCR and general vision data.
- Train
DeepSeek-OCR
: The pre-trained DeepEncoder
is connected to the DeepSeek-3B-MoE
decoder. The entire model is then trained on the full data mix. During this stage, the early parts of the encoder (SAM) are frozen, and the rest of the model (CLIP, Decoder) is fine-tuned. The training is done at scale using 20 nodes (160 A100 GPUs) with both pipeline and data parallelism.