DeepSeek-OCR:ContextsOpticalCompression
TL;DR Summary
DeepSeek-OCR uses 2D optical mapping to compress long texts efficiently, achieving 97% OCR accuracy under 10× compression and 60% at 20×, outperforming existing OCR models and enabling large-scale training data generation for LLMs.
Abstract
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20×, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.
English Analysis
1. Bibliographic Information
-
Title: DeepSeek-OCR: Contexts Optical Compression
-
Authors: Haoran Wei, Yaofeng Sun, Yukun Li
-
Affiliations: DeepSeek-AI
-
Journal/Conference: This paper is presented as a technical report. As of the analysis date, it appears to be a preprint (e.g., posted on a service like arXiv) rather than a formally peer-reviewed conference or journal publication.
-
Publication Year: The paper references works from 2025, suggesting it was likely written in late 2024 or early 2025.
-
Abstract: The paper introduces
DeepSeek-OCR
, a model designed to explore the compression of long text contexts by mapping them into a 2D optical format (an image). The system consists of aDeepEncoder
and aDeepSeek3B-MoE-A570M
decoder. TheDeepEncoder
is engineered to handle high-resolution images while producing a small, manageable number of vision tokens, thus achieving a high compression ratio. Experiments demonstrate that the model can achieve 97% OCR precision with a compression ratio under 10x and 60% accuracy at a 20x ratio. The authors highlight the model's promise for long-context compression research and its practical value, outperforming models likeGOT-OCR2.0
andMinerU2.0
on theOmniDocBench
benchmark while using significantly fewer vision tokens. The code and models are made publicly available. -
Original Source Link:
https://raw.githubusercontent.com/deepseek-ai/DeepSeek-OCR/refs/heads/main/DeepSeek_OCR_paper.pdf
(This is a direct link to the PDF file).
2. Executive Summary
Background & Motivation (Why):
The primary challenge this paper addresses is the computational inefficiency of Large Language Models (LLMs) when processing long sequences of text. The attention mechanism, central to modern LLMs, has a computational cost that scales quadratically with the length of the input sequence (). This makes handling very long documents, conversations, or codebases prohibitively expensive in terms of both memory and processing time.
Prior work has focused on algorithmic improvements to the attention mechanism or architectural changes within the text domain. This paper introduces a completely different and novel approach: "contexts optical compression". The core idea is to leverage the visual modality as a highly efficient compression medium. Instead of feeding a long string of text tokens to an LLM, the text is first rendered as an image. A Vision-Language Model (VLM) then "reads" this image. A single image can represent thousands of words using a much smaller number of "vision tokens," potentially breaking the quadratic scaling bottleneck. The paper uses the task of Optical Character Recognition (OCR) as a perfect testbed for this concept, as it provides a direct, quantifiable mapping between a compressed visual input (the image) and a decompressed textual output (the recognized text).
Main Contributions / Findings (What):
The paper presents three main contributions:
-
Quantitative Proof-of-Concept for Optical Compression: It provides the first comprehensive analysis of the trade-off between vision token compression ratios and text decoding accuracy. It shows that near-lossless decoding is possible at a 10x compression ratio (e.g., decoding 1000 text tokens from just 100 vision tokens), and meaningful information is still retained even at a 20x compression ratio.
-
A Novel Vision Encoder (
DeepEncoder
): To enable this high compression, the authors designed a new vision encoder architecture.DeepEncoder
can process high-resolution images while keeping both its internal memory usage (activation
) low and the final number of output vision tokens minimal. It cleverly combines a local-feature extractor (SAM
) with a global-feature extractor (CLIP
) bridged by a convolutional compressor. -
A State-of-the-Art Practical OCR Model (
DeepSeek-OCR
): The complete model,DeepSeek-OCR
, which pairs theDeepEncoder
with aDeepSeek3B-MoE
decoder, achieves state-of-the-art performance among end-to-end models on theOmniDocBench
benchmark. Critically, it does so while using an order of magnitude fewer vision tokens than comparable high-performance models, demonstrating the practical efficiency of its design.
3. Prerequisite Knowledge & Related Work
Foundational Concepts
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT series) trained on vast amounts of text data to understand, generate, and reason about human language. Their performance is often limited by the length of the text they can process at once, known as the "context window."
- Vision-Language Models (VLMs): VLMs are models that can process both visual data (images) and text. They can perform tasks like describing an image, answering questions about it, or, in this case, reading text within an image.
- Optical Character Recognition (OCR): The technology used to convert images containing typed, handwritten, or printed text into machine-encoded text. Traditional OCR involves pipelines of detection and recognition, whereas modern end-to-end models do it in one step.
- Tokens: In machine learning, tokens are the fundamental units of data. For LLMs, a token is typically a word or sub-word. For VLMs, an image is broken down into a grid of patches, and each patch is converted into a vision token. The core idea of this paper is to represent many text tokens with few vision tokens.
- Mixture of Experts (MoE): An efficient neural network architecture. Instead of using one giant network for all tasks, an MoE model contains many smaller "expert" sub-networks. For any given input, only a few relevant experts are activated, saving significant computational cost during inference.
- Attention Mechanisms: The technology that allows models to weigh the importance of different tokens when processing a sequence.
Global attention
considers all tokens in the sequence, which is powerful but computationally expensive.Window attention
only considers tokens within a small local window, which is much more efficient but misses long-range dependencies.
Previous Works & Technological Evolution
The paper positions its work against existing VLM encoder architectures, highlighting their respective weaknesses.
-
Dual-Tower Architectures (e.g.,
Vary
): These use two parallel encoders (e.g., one for general features, one for high-res details). The paper notes this approach complicates deployment and training due to the need for dual image preprocessing pipelines. -
Tile-Based Methods (e.g.,
InternVL2.0
): These methods slice a high-resolution image into smaller tiles and process each one separately. While this handles large images, it can lead to excessive fragmentation and a very large number of vision tokens, which slows down the language model decoder. -
Adaptive Resolution Encoders (e.g.,
Qwen2-VL
based onNaViT
): These encoders can flexibly handle images of any aspect ratio. However, for very large images, their memory consumption (activation
) can become unmanageable, and packing sequences of different lengths during training is complex.Image 2 illustrates these three common but flawed approaches.
该图像是图表,展示了当前主流开放源代码视觉语言模型中三种典型视觉编码器的结构及其各自缺陷,涵盖分辨率、视觉tokens数量及推理速度等问题。
In the domain of end-to-end OCR, models like Nougat
and GOT-OCR2.0
have simplified traditional OCR pipelines. However, the authors argue that these works have not addressed a fundamental question: "for a document containing 1000 words, how many vision tokens are at least needed for decoding?" DeepSeek-OCR
is designed specifically to answer this.
Differentiation
DeepSeek-OCR
's key innovation is the DeepEncoder
, which synthesizes the strengths of previous approaches while mitigating their weaknesses. It serially connects a low-memory window attention
component (for initial high-resolution processing) with a powerful global attention
component (for holistic understanding). The critical element is a convolutional compressor placed between them, which drastically reduces the number of tokens before they reach the computationally expensive global attention layers. This achieves the goals of handling high resolutions with low memory usage and producing a minimal number of vision tokens.
4. Methodology (Core Technology & Implementation)
The DeepSeek-OCR
model has a standard VLM architecture: a vision encoder that processes an image and a language decoder that generates text based on the visual information.
该图像是图3,展示了DeepSeek-OCR的架构示意图。包括SAM局部注意力的Tokenizer,Conv16x下采样生成视觉tokens,以及具有全局注意力机制的CLIP嵌入层,最终由DeepSeek-3B解码器生成输出。
Principles & Steps
As shown in Figure 3, the pipeline works as follows:
- Input: An image containing text is fed into the system.
DeepEncoder
Processing:- A
SAM-base
encoder, which uses efficientwindow attention
, processes the high-resolution image first. For a image, it generates a large number of initial patch tokens (e.g., 4096). Because it's a smaller model (80M parameters) using window attention, the memory activation is manageable. - These tokens then pass through a Token Compressor. This is a simple 2-layer convolutional module that downsamples the tokens, reducing their count by a factor of 16 (e.g., from 4096 to 256).
- The reduced set of tokens is then fed into a
CLIP-large
encoder, which uses powerful but expensivedense global attention
. Since it now operates on far fewer tokens, the computational cost is kept low.
- A
DeepSeek-3B-MoE
Decoder: The final, compressed vision tokens from theDeepEncoder
are passed to the language model decoder. The decoder, an efficient Mixture-of-Experts model, then reconstructs the original text, effectively "decompressing" the visual information.
Mathematical Formulas & Key Details
The core function of the decoder is to perform this "decompression":
- represents the compressed vision tokens from the
DeepEncoder
, where is the small number of vision tokens. - represents the reconstructed text representation, where is the large number of original text tokens.
- is the decoder function (the LLM) that learns the non-linear mapping from the compressed visual space back to the textual space.
Multiple Resolution Support
To systematically study compression ratios, the model must support various input resolutions, which produce different numbers of vision tokens. The paper details several modes, achieved through dynamic interpolation of positional encodings.
该图像是示意图,展示了DeepSeek-OCR在不同分辨率模式下,通过调整视觉token数量实现的三种压缩方式,包含Resize、Padding和Gundam模式,体现实际应用中压缩比与视觉token配置的关系。
This is summarized in the manually transcribed table below, based on Table 1 from the paper.
Table 1: Multi resolution support of DeepEncoder.
Mode | Native Resolution | Dynamic Resolution | ||||
Tiny | Small | Base | Large | Gundam | Gundam-M | |
Resolution | 512 | 640 | 1024 | 1280 | 640+1024 | 1024+1280 |
Tokens | 64 | 100 | 256 | 400 | nx100+256 | nx256+400 |
Process | resize | resize | padding | padding | resize + padding | resize + padding |
- Native Resolution: Modes like
Tiny
(64 tokens) andSmall
(100 tokens)resize
the image, which is efficient but may distort it.Base
(256 tokens) andLarge
(400 tokens)pad
the image to preserve its aspect ratio. - When padding, not all vision tokens correspond to the actual image. The number of valid tokens is calculated as:
- : The number of vision tokens corresponding to the original image content.
- : The total number of vision tokens for the padded square image (e.g., 256 for
Base
mode). w, h
: The width and height of the original image.
- Dynamic Resolution: Modes like
Gundam
are for very high-resolution images (e.g., newspapers). They combine a tiled approach (local views) with a down-scaled global view, balancing detail and context.
Data Engine & Training
The model was trained on a diverse mix of data:
-
OCR 1.0 Data: 30M pages of multilingual PDFs and 3M Word documents. This data includes both "coarse" annotations (text extracted directly) and "fine" annotations with detailed layout information (bounding boxes), as shown in Figure 5.
该图像是图5,展示了OCR 1.0的细粒度标注。内容采用布局与文本交错格式,每段文字前标注归属坐标和标签,坐标归一化到1000格。
-
OCR 2.0 Data: Synthetically generated data for complex parsing tasks like charts, chemical formulas, and geometric figures.
该图像是一个包含柱状图和表格的图表,展示了4种茶类(奇磷酶、打抽茶、蜡酶、银曲霉丝)在不同指标上的数值分布及对应的百分比,右侧表格详细列出各茶类在酸辣豆花、修水双井茶和车厘子中的具体数值。
该图像是几何示意图,展示了一个多面体结构,其中标记有点E、S、X和R,体现了空间中几何关系和线段连接状况。
-
General Vision Data (20%): Standard vision-language data for tasks like captioning and detection to retain general visual understanding.
-
Text-only Data (10%): Pure text data to maintain the LLM's language capabilities.
The training was done in two stages on 20 nodes of A100-40G GPUs, first training the
DeepEncoder
and then the fullDeepSeek-OCR
model.
5. Experimental Setup
-
Datasets:
Fox
[21] benchmark: Used for the core vision-text compression study. The authors selected 100 English documents from this benchmark.OmniDocBench
[27]: A comprehensive benchmark for real-world document parsing, used to evaluate the model's practical OCR performance against other state-of-the-art models.
-
Evaluation Metrics:
- Edit Distance: This is the primary metric for OCR quality.
- Conceptual Definition: Edit distance (specifically, Levenshtein distance) measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change the model's output string into the ground-truth string. A lower edit distance indicates a better, more accurate transcription. It is a very strict metric.
- Mathematical Formula: The Levenshtein distance
lev(a, b)
between two strings (of length ) and (of length ) is given by: - Symbol Explanation:
a, b
: The two strings being compared (ground truth and model prediction).- : The lengths of the strings.
- : A function that returns the string without its first character.
a[0], b[0]
: The first character of each string. The values reported in the paper are typically normalized by the length of the ground truth text.
- Precision: In this context, it refers to the accuracy of the decoded OCR text, likely calculated at the character or word level. It is presented as a percentage.
- Compression Ratio: The ratio of text information to visual information.
- Edit Distance: This is the primary metric for OCR quality.
-
Baselines: A wide range of models were used for comparison on
OmniDocBench
, including:-
Pipeline Models: Traditional systems that separate layout analysis and OCR, such as
Marker
,Mathpix
, andMinerU-2.1.1
. -
End-to-end Models: Unified VLMs like
Nougat
,InternVL2-76B
,Qwen2.5-VL-7B
,GOT-OCR2.0
, and proprietary models likeGPT4o
andGemini2.5-Pro
.
-
6. Results & Analysis
Core Results: Vision-text Compression Study
The results from the Fox
benchmark experiment (Table 2) provide strong evidence for the feasibility of optical compression.
Manual Transcription of Table 2: Vision-text compression ratio on Fox benchmark.
Text Tokens | Vision Tokens = 64 | Vision Tokens = 100 | n Pages | ||
---|---|---|---|---|---|
Precision | Compression | Precision | Compression | ||
600-700 | 96.5% | 10.5x | 98.5% | 6.7x | 7 |
700-800 | 93.8% | 11.8x | 97.3% | 7.5x | 28 |
800-900 | 83.8% | 13.2x | 96.8% | 8.5x | 28 |
900-1000 | 85.9% | 15.1x | 96.8% | 9.7x | 14 |
1000-1100 | 79.3% | 16.5x | 91.5% | 10.6x | 11 |
1100-1200 | 76.4% | 17.7x | 89.8% | 11.3x | 8 |
1200-1300 | 59.1% | 19.7x | 87.1% | 12.6x | 4 |
Analysis:
-
With 100 vision tokens (
Small
mode), the model maintains ~97% precision up to a 9.7x compression ratio. The decoding is nearly lossless. -
Performance gracefully degrades as the compression ratio increases. Even at a very high compression ratio of 19.7x (decoding ~1250 text tokens from only 64 vision tokens), the model still achieves ~60% precision, indicating that significant information is preserved.
-
These results are visualized in Figure 1(a).
该图像是由两部分组成的图表,包含图1(a)和图1(b)。图1(a)显示了Fox基准测试中不同文本令牌数量下的压缩率与精度;图1(b)展示了OmniDocBench上各模型在视觉令牌数量与整体性能(编辑距离)间的表现对比。
Core Results: OCR Practical Performance
Table 3 shows that DeepSeek-OCR
is not just a theoretical model but a highly competitive practical OCR system.
Manual Transcription of Table 3: Performance on OmniDocBench (Edit Distance, lower is better). Due to the complexity of this table with merged cells, HTML is used for transcription.
Model | Tokens | English | Chinese | ||||||||
overall text formula table order | overall text formula table order | ||||||||||
(Selected models for brevity, focusing on key comparisons) | |||||||||||
End-to-end Models | |||||||||||
GOT-OCR2.0 [38] | 256 | 0.287 | 0.189 | 0.360 | 0.459 | 0.141 | 0.411 | 0.315 | 0.528 | 0.520 | 0.280 |
MinerU2.0 [34] | 6790 | 0.133 | 0.045 | 0.273 | 0.150 | 0.066 | 0.238 | 0.115 | 0.506 | 0.209 | 0.122 |
DeepSeek-OCR (end2end) | |||||||||||
Small | 100 | 0.221 | 0.142 | 0.373 | 0.242 | 0.125 | 0.284 | 0.240 | 0.530 | 0.159 | 0.205 |
Base | 256(182) | 0.137 | 0.054 | 0.267 | 0.163 | 0.064 | 0.240 | 0.205 | 0.474 | 0.100 | 0.181 |
Large | 400(285) | 0.138 | 0.054 | 0.277 | 0.152 | 0.067 | 0.208 | 0.143 | 0.461 | 0.104 | 0.123 |
Gundam | 795 | 0.127 | 0.043 | 0.269 | 0.134 | 0.062 | 0.181 | 0.097 | 0.432 | 0.089 | 0.103 |
Analysis:
- Efficiency:
DeepSeek-OCR
inSmall
mode (100 tokens) achieves a better overall score (0.221) thanGOT-OCR2.0
(0.287), which uses 256 tokens. - Performance:
DeepSeek-OCR
inGundam
mode, using only 795 tokens, outperformsMinerU2.0
(0.127 vs 0.133), which requires nearly 7,000 vision tokens. - This demonstrates that the
DeepEncoder
design is highly effective, achieving top-tier performance with a fraction of the visual tokens, which directly translates to faster inference and lower computational costs.
Ablations / Document Type Analysis
Table 4 breaks down performance by document type, revealing which kinds of documents require more visual information.
Manual Transcription of Table 4: Edit distances for different document categories.
Type / Mode | Book | Slides | Financial Report | Textbook | Exam Paper | Magazine | Academic Papers | Notes | Newspaper | Overall |
---|---|---|---|---|---|---|---|---|---|---|
Tiny | 0.147 | 0.116 | 0.207 | 0.173 | 0.294 | 0.201 | 0.395 | 0.297 | 0.940 | 0.320 |
Small | 0.085 | 0.111 | 0.079 | 0.147 | 0.171 | 0.107 | 0.131 | 0.187 | 0.744 | 0.205 |
Base | 0.037 | 0.080 | 0.027 | 0.100 | 0.130 | 0.073 | 0.052 | 0.176 | 0.645 | 0.156 |
Large | 0.038 | 0.108 | 0.022 | 0.084 | 0.109 | 0.060 | 0.053 | 0.155 | 0.353 | 0.117 |
Gundam | 0.035 | 0.085 | 0.289 | 0.095 | 0.094 | 0.059 | 0.039 | 0.153 | 0.122 | 0.083 |
Guandam-M | 0.052 | 0.090 | 0.034 | 0.091 | 0.079 | 0.079 | 0.048 | 0.100 | 0.099 | 0.077 |
Analysis:
- Simple layout documents like
Slides
andFinancial Reports
achieve excellent performance with very few tokens (Small
orBase
mode is sufficient). - Text-dense and complex-layout documents like
Newspapers
require the high-resolutionGundam
orGundam-M
modes to achieve an acceptable edit distance. This is because they contain 4,000-5,000 text tokens, which would exceed the ~10x compression boundary for the smaller modes. This empirically validates the compression limits found in theFox
benchmark study.
Qualitative Study
The paper provides visual examples of DeepSeek-OCR
's advanced capabilities.
-
Deep Parsing: The model can perform multi-level analysis. After an initial OCR pass, it can be prompted to "deep parse" specific regions, such as extracting structured data from charts (Figure 7), generating captions for images within a document (Figure 8), recognizing chemical formulas (Figure 9), or interpreting geometric diagrams (Figure 10).
Figure 7: Deep Parsing a Chart Figure 8: Captioning an Image in a Doc Figure 9: Parsing a Chemical Formula Figure 10: Parsing Geometry  该图像是金融研究报告中DeepSeek-OCR深度解析模式对图表结构化结果的展示,展示了图表提取和渲染的对比,反映了OCR模型未来对图表结构提取的重要能力。
-
Multilingual Recognition: The model is not limited to English and Chinese. It can process documents in nearly 100 languages, as shown with Arabic and Sinhala examples in Figure 11.
该图像是三页文档内容的示意图,展示了文本的原始版本、带有颜色标注的细粒度块分割,以及对应的结构化文本代码,体现了文档的多层次信息解析和文本块检测过程。
-
General Vision Understanding: By including general vision data in its training, the model retains capabilities like object detection and grounding, making it a versatile VLM (Figure 12).
该图像是由六个不同场景构成的图像集合,展示了教学黑板上的数学题、绿色塑料豆瓣酱容器、黑白漫画中的老师、户外放风筝的活动场景、带领结的消防栓照片以及带有“Bountiful Potential”英文标注的白色马克杯。各图像内容丰富,涵盖教育、日常生活和文本识别。
7. Conclusion & Reflections
Conclusion Summary
The authors successfully demonstrate that "contexts optical compression" is a viable and highly promising direction for tackling the long-context problem in LLMs. Their model, DeepSeek-OCR
, validates this by achieving near-lossless text decompression at a ~10x compression ratio. The novel DeepEncoder
architecture proves to be exceptionally efficient, enabling state-of-the-art OCR performance with significantly fewer computational resources than competing models. The paper concludes that this approach has the potential to facilitate future developments in both VLMs and LLMs and serves as a powerful tool for large-scale data generation.
Limitations & Future Work
The authors acknowledge that OCR is just a first step. To fully validate the concept of context compression for general reasoning, they plan to conduct further research, including:
- Digital-optical text interleaved pretraining: Training a model on a mix of regular text and "optical text" to see if it can seamlessly use both.
- Needle-in-a-haystack testing: A standard test to see if a model can recall a small, specific piece of information from a very long context, which would test the fidelity of the optical compression.
Personal Insights & Critique
This paper presents a genuinely innovative and "out-of-the-box" idea. The most compelling aspect is its simplicity and elegance. Instead of complex algorithmic wizardry, it repurposes an existing modality (vision) to solve a core scaling problem in another (language).
The analogy to human memory and forgetting is particularly insightful. As discussed in the paper and illustrated in Figure 13, the method naturally mimics how memory works: recent, important information is kept at high fidelity (high-resolution image), while older, less critical context can be progressively compressed (by resizing the image), losing detail but retaining the gist, much like how distant memories fade.
该图像是一个示意图,展示了记忆、视觉和文本信息随时间、距离和分辨率变化的清晰度趋势,图中以灯泡、眼睛和文本图标分别标示不同类型信息。
This could be a game-changer for building agents or conversational AI with theoretically unlimited context, as the computational cost of the "memory" would grow sub-linearly.
Potential questions and critiques:
-
Rendering Overhead: The process requires rendering text to an image. What is the computational cost and latency of this step? For real-time applications, this could be a bottleneck.
-
Information beyond Text: The current study focuses on text. How does this approach handle non-textual information like tables, charts, or diagrams that are already visual? The "deep parsing" capability hints at this, but a deeper investigation is needed.
-
Error Propagation: If the OCR makes an error during the "decompression" of an old memory, that error could be propagated or even amplified in subsequent reasoning steps. The 60% precision at 20x compression might be too low for tasks requiring high fidelity.
Overall, this is a landmark paper that opens a new and exciting research avenue. It shifts the perspective on long-context processing from a purely linguistic challenge to a multi-modal one, with profound implications for the future architecture of AI systems.
Similar papers
Recommended via semantic vector search.
Discussion
Leave a comment
No comments yet. Start the discussion!