AiPaper
Status: completed

DeepSeek-OCR: Contexts Optical Compression

Long-Context CompressionOptical 2D MappingVision EncoderText Optical Character Recognition (OCR)Large-Scale Document Training Data Generation
Original LinkPDFEdit PDF notes
Price: 0.10
Price: 0.10
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DeepSeek-OCR uses 2D optical mapping for efficient long-text compression, achieving 97% OCR accuracy at 10x compression and 60% at 20x. It surpasses existing OCR models with fewer vision tokens and enables large-scale training data generation.

Abstract

We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

English Analysis

1. Bibliographic Information

  • Title: DeepSeek-OCR: Contexts Optical Compression
  • Authors: Haoran Wei, Yaofeng Sun, Yukun Li
  • Affiliations: All authors are affiliated with DeepSeek-AI.
  • Journal/Conference: The paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for publication in a journal or conference.
  • Publication Year: The arXiv ID 2510.18234v1 suggests a submission date in October 2025, which is likely a placeholder or typo in the provided source text, as the current date is in 2025. The paper reflects recent advancements in the field.
  • Abstract: The paper introduces DeepSeek-OCR, a model designed to explore the compression of long text contexts by mapping them into a 2D optical format (an image). The model consists of a novel DeepEncoder and a DeepSeek3B-MoE decoder. The core innovation is the DeepEncoder, which can process high-resolution images while producing a small, manageable number of vision tokens, thereby achieving a high compression ratio. Experiments show that the model can decode text with 97% precision at a compression ratio of up to 10x (meaning the original text has 10 times more tokens than the resulting image). Even at a 20x compression ratio, it maintains about 60% accuracy. The paper highlights the model's practical value, outperforming existing models like GOT-OCR2.0 and MinerU2.0 on the OmniDocBench benchmark with significantly fewer vision tokens. The system is efficient enough for production, capable of processing over 200,000 pages per day on a single GPU for generating training data.
  • Original Source Link: The paper is available as a preprint at https://arxiv.org/abs/2510.18234v1. The code and models are available at http://github.com/deepseek-ai/DeepSeek-OCR.

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Modern Large Language Models (LLMs) struggle with extremely long text sequences. The computational cost, particularly memory, grows quadratically with the length of the input text (O(n2)O(n^2)), making it inefficient and expensive to process long documents, books, or extensive conversation histories.
    • Gap in Prior Work: While many models focus on improving visual question answering or general vision tasks, less attention has been paid to using the visual modality as a tool to make LLMs more efficient at their primary task: text processing. The fundamental question of "how many visual tokens are needed to represent a thousand words?" has been largely unexplored.
    • Fresh Angle: The paper proposes a novel paradigm: Contexts Optical Compression. Instead of feeding a long string of text tokens to an LLM, the text is first rendered as an image of a document. A Vision-Language Model (VLM) then "reads" this image. Since an image can be represented by a much smaller number of "vision tokens" compared to the "text tokens" of the original text, this process acts as a form of compression, potentially sidestepping the quadratic complexity bottleneck of LLMs.
  • Main Contributions / Findings (What):

    1. Quantitative Analysis of Vision-Text Compression: The paper provides the first comprehensive study on the compression ratio achievable by mapping text to an image. It demonstrates that near-lossless (97% accuracy) decompression is possible at a 10x compression ratio, and a usable level of accuracy (60%) is retained even at a 20x ratio.
    2. A Novel Vision Encoder (DeepEncoder): A new encoder architecture is introduced that is specifically designed for this compression task. It can handle high-resolution images while keeping both memory usage (activation) and the number of output vision tokens low. This is achieved by serially connecting a window attention module (SAM) and a global attention module (CLIP) with a convolutional compressor in between.
    3. A State-of-the-Art OCR Model (DeepSeek-OCR): The complete model, DeepSeek-OCR, combines the DeepEncoder with a DeepSeek3B-MoE decoder. It achieves top-tier performance on the OmniDocBench document understanding benchmark while using drastically fewer vision tokens than its competitors. This demonstrates the practical viability and efficiency of the proposed approach.
    4. Practical Utility and Open-Sourcing: The model is highly efficient, capable of generating large-scale training data for other models. The authors have publicly released the code and model weights, encouraging further research in this direction.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Large Language Model (LLM): An AI model trained on vast amounts of text data to understand and generate human-like language. Their performance often depends on the length of the text they can process, known as the "context window."
    • Vision-Language Model (VLM): A model that can process both images and text. It typically consists of a vision encoder (to understand images) and a language model (to process text and generate responses).
    • Optical Character Recognition (OCR): The process of converting images of typed, handwritten, or printed text into machine-encoded text. This paper uses OCR as the "decompression" step in its proposed pipeline.
    • Tokens (Text vs. Vision): In AI, text and images are broken down into small units called tokens. For text, a token might be a word or part of a word. For an image, a vision token is typically a small patch of the image. The number of tokens directly impacts computational cost.
    • Attention Mechanisms: The core component of modern Transformer-based models.
      • Global Attention: Each token can "look at" every other token in the sequence. This is powerful but computationally expensive (O(n2)O(n^2)).
      • Window Attention: Each token only "looks at" other tokens within a small local window. This is much more efficient but can miss long-range dependencies.
    • Mixture-of-Experts (MoE): A type of neural network architecture where instead of using one large model, there are many smaller "expert" sub-models. For any given input, only a few relevant experts are activated, making inference much faster and more efficient while maintaining the expressive power of a much larger model.
  • Previous Works: The paper positions itself against existing vision encoders for VLMs, categorizing them into three types as shown in Figure 2.

    Figure 7 | In the field of financial research reports, the deep parsing mode of DeepSeek-OCR can be used to obtain structured results of charts within documents. Charts are a crucial form of data rep… 该图像是论文中的示意图,展示了DeepSeek-OCR对金融研究报告中图表的深度解析能力,能将图表转化为结构化数据,体现了模型在财务和科学文档中图表信息提取的重要应用。

    1. Dual-Tower Architecture (e.g., Vary): These models use two parallel encoders (e.g., a standard one and a high-resolution one like SAM's encoder) to process images. Limitation: This approach is complex to deploy and train, as it requires dual image processing pipelines.
    2. Tile-Based Method (e.g., InternVL2.0): These models handle high-resolution images by cutting them into smaller tiles and processing each tile separately. Limitation: This can create an excessively large number of vision tokens, which slows down the language model decoder.
    3. Adaptive Resolution Encoding (e.g., Qwen2-VL, NaViT): These models can process images of any resolution directly without tiling. Limitation: For very large images, this method consumes a massive amount of GPU memory (activation memory), which can lead to out-of-memory errors.
  • Differentiation: DeepSeek-OCR proposes DeepEncoder, a novel architecture that synthesizes the strengths of these approaches while mitigating their weaknesses. It uses window attention (like SAM) on the initial high-resolution patches to keep memory low, then uses a convolutional compressor to drastically reduce the number of tokens, and finally feeds these few, information-rich tokens into a powerful global attention module (CLIP). This hybrid design achieves high-resolution processing, low memory usage, and a minimal number of final vision tokens.

4. Methodology (Core Technology & Implementation)

  • Principles: The core idea is to treat a document image as a compressed representation of its text. The DeepEncoder acts as a compressor, converting a high-dimensional image into a low-dimensional set of vision tokens. The MoE Decoder acts as a decompressor, reconstructing the original text from these vision tokens.

  • Steps & Procedures: The overall architecture is shown in Figure 3. The process is as follows:

    1. An input image (e.g., a document page) is fed into the DeepEncoder.

    2. DeepEncoder Stage 1 (Perception): A SAM-base encoder, which is dominated by efficient window attention, processes the high-resolution image. For a 1024×10241024 \times 1024 image, it creates 4096 initial patch tokens. This stage is responsible for low-level feature extraction.

    3. DeepEncoder Stage 2 (Compression): A 2-layer convolutional module downsamples the tokens by a factor of 16x. The 4096 tokens from the previous step are compressed down to just 256 tokens. This is the key step for managing token count.

    4. DeepEncoder Stage 3 (Knowledge): The 256 compressed tokens are fed into a CLIP-large encoder, which uses expensive but powerful global attention. Since it only operates on a small number of tokens, it remains computationally feasible. This stage extracts high-level semantic knowledge.

    5. Decoding: The final vision tokens from DeepEncoder are passed to the DeepSeek-3B-MoE decoder, which autoregressively generates the text content of the document.

      Figure 8 | For books and articles, the deep parsing mode can output dense captions for natural images in the documents. With just a prompt, the model can automatically identify what type of image it… 该图像是论文中图8的示意图,展示了在书籍和文章中深度解析模式能够输出文档中自然图像的密集图注。模型通过简单提示自动识别图像类型并生成所需结果。

  • Mathematical Formulas & Key Details:

    1. Multiple Resolution Support: To enable experiments with different compression ratios, DeepEncoder is trained to support multiple input resolutions, as shown in Figure 4 and detailed in Table 1.

    Figure 9 | DeepSeek-OCR in deep parsing mode can also recognize chemical formulas within \(1 . 0 { + } 2 . 0 \) technology may play a significant role in the development of VLM/LLM in STEM fields.

    This is a manual transcription of Table 1 from the paper.

    Mode Native Resolution Dynamic Resolution
    Tiny Small Base Large Gundam Gundam-M
    Resolution 512 640 1024 1280 nx640+1024 nx1024+1280
    Tokens 64 100 256 400 nx100+256 nx256+400
    Process resize resize padding padding resize + padding

    For modes with padding (Base, Large), not all vision tokens correspond to actual image content. The number of valid tokens is calculated using the following formula: Nvalid=Nactual×[1((max(w,h)min(w,h))/(max(w,h)))] N _ { v a l i d } = \lceil N _ { a c t u a l } \times \left[ 1 - ( ( m a x ( w , h ) - m i n ( w , h ) ) / ( m a x ( w , h ) ) ) \right] \rceil

    • Symbol Explanation:
      • NvalidN_{valid}: The number of vision tokens that correspond to the original image content (excluding padding).
      • NactualN_{actual}: The total number of vision tokens for the padded square image (e.g., 256 for Base mode).
      • w, h: The width and height of the original, unpadded input image.
      • The term inside the square brackets calculates the aspect ratio-based proportion of the area covered by the image.

    2. Decoder Reconstruction: The decoder's job is to learn a mapping fdecf_{dec} from the compressed vision tokens back to the original text representation. X^=fdec(Z)wherenN \hat { \mathbf { X } } = f _ { \mathrm { d e c } } ( \mathbf { Z } ) \quad \mathrm { where } n \le N

    • Symbol Explanation:
      • Z\mathbf{Z}: The set of nn compressed vision tokens from the DeepEncoder.
      • X^\hat{\mathbf{X}}: The reconstructed sequence of NN text tokens.
      • fdecf_{dec}: The non-linear transformation learned by the LLM decoder.
      • nNn \le N: The number of vision tokens is less than or equal to the number of original text tokens, indicating compression.

    3. Data Engine: The model is trained on a diverse mix of data to ensure robust capabilities.

    • OCR 1.0 Data: 30M pages of multilingual documents and 20M scene text images. Annotations range from "coarse" (raw text extraction) to "fine" (detailed layout and text information), as shown in Figure 5.

      Figure 10 | DeepSeek-OCR also possesses the capability to copy (structure) simple planar geometric figures. Due to the intricate interdependencies among line segments in geometric shapes, parsing geo… 该图像是一个示意图,展示了DeepSeek-OCR对八年级数学下册几何证明题的文档结构化处理过程,包括输入图像、结果文本和图形解析与重绘,体现其对几何图形复杂结构的解析能力。

    • OCR 2.0 Data: Specialized data for parsing complex structures. This includes 10M charts (converted to HTML tables), 5M chemical formulas (from SMILES strings), and 1M plane geometry figures. Figure 6 shows examples of the ground truth format for charts and geometry.

      该图像是两页论文文档的截图,包含阿拉伯语段落文字说明及一个三列表格,内容涉及支持小微企业及就业岗位的角色,文中无公式。 该图像是两页论文文档的截图,包含阿拉伯语段落文字说明及一个三列表格,内容涉及支持小微企业及就业岗位的角色,文中无公式。

      Figure 11 | To endow the capability of processing widely crawled PDFs (multilingual data), we train our model with OCR capabilities for nearly 100 languages. Minority language documents can also supp… 该图像是一个示意图,展示了多语言OCR文本的检测、布局分割及结构化识别过程,左侧为原文图像,中部为颜色标注的分段结果,右侧为对应的结构化文本输出。

    • General Vision Data: Data for tasks like image captioning and object detection, making up 20% of the training mix to retain general VLM capabilities.

    • Text-only Data: 10% of the training data is text-only to maintain the model's core language abilities.

    4. Training Pipelines: Training is a two-stage process:

    1. Train DeepEncoder: The encoder is first trained independently with a compact language model to learn how to produce useful visual representations from the OCR and general vision data.
    2. Train DeepSeek-OCR: The pre-trained DeepEncoder is connected to the DeepSeek-3B-MoE decoder. The entire model is then trained on the full data mix. During this stage, the early parts of the encoder (SAM) are frozen, and the rest of the model (CLIP, Decoder) is fine-tuned. The training is done at scale using 20 nodes (160 A100 GPUs) with both pipeline and data parallelism.

5. Experimental Setup

  • Datasets:

    • Fox Benchmark: Used for the vision-text compression study. It contains English documents with diverse layouts. The authors selected 100 pages with 600-1300 text tokens for their analysis.
    • OmniDocBench: A comprehensive benchmark for evaluating document parsing performance across multiple languages (Chinese, English) and content types (text, formulas, tables, layout order).
  • Evaluation Metrics:

    1. Precision: Used in the compression study.
      • Conceptual Definition: Precision measures the accuracy of the recognized characters. It answers the question: "Of all the characters the model output, what fraction was correct?" A higher precision indicates fewer "hallucinated" or incorrect characters.
      • Mathematical Formula: The paper does not provide a formula, but the standard definition for character-level precision would be: Precision=Number of Correctly Recognized CharactersTotal Number of Recognized Characters \text{Precision} = \frac{\text{Number of Correctly Recognized Characters}}{\text{Total Number of Recognized Characters}}
      • Symbol Explanation: This is a simplified view. In practice, OCR evaluation often uses metrics that account for insertions, deletions, and substitutions.
    2. Edit Distance (Normalized): Used in OmniDocBench.
      • Conceptual Definition: Edit distance (specifically Levenshtein distance) measures the difference between the model's output text and the ground-truth text. It is the minimum number of single-character insertions, deletions, or substitutions required to change the output into the ground truth. In benchmarks like OmniDocBench, this distance is typically normalized by the length of the ground-truth text. A lower value is better.
      • Mathematical Formula: The canonical formula for Levenshtein distance between two strings aa and bb is defined recursively. Let a|a| and b|b| be the lengths of the strings. lev(a,b)={aif b=0,bif a=0,lev(tail(a),tail(b))if a[0]=b[0],1+min{lev(tail(a),b)lev(a,tail(b))lev(tail(a),tail(b))otherwise. \operatorname{lev}(a, b) = \begin{cases} |a| & \text{if } |b| = 0, \\ |b| & \text{if } |a| = 0, \\ \operatorname{lev}(\text{tail}(a), \text{tail}(b)) & \text{if } a[0] = b[0], \\ 1 + \min \begin{cases} \operatorname{lev}(\text{tail}(a), b) \\ \operatorname{lev}(a, \text{tail}(b)) \\ \operatorname{lev}(\text{tail}(a), \text{tail}(b)) \end{cases} & \text{otherwise.} \end{cases}
      • Symbol Explanation: tail() refers to the rest of the string after the first character. The three cases in the min function correspond to deletion, insertion, and substitution, respectively. The benchmark score is this value divided by the ground truth length.
  • Baselines: The model is compared against a wide range of baselines on OmniDocBench, including:

    • Pipeline Models: Traditional systems that use separate models for different sub-tasks (e.g., Marker, MinerU-2.1.1, PPstructure-v3).
    • End-to-end Models: Unified VLM models that perform OCR directly (e.g., Nougat, InternVL2-76B, Qwen2.5-VL-7B, GOT-OCR2.0).
    • Proprietary Models: Powerful closed-source models (e.g., GPT4o, Gemini2.5-Pro).

6. Results & Analysis

  • Core Results:

    1. Vision-text Compression Study: Table 2 shows the core results of the compression experiment on the Fox benchmark.

    This is a manual transcription of Table 2 from the paper.

    | Text Tokens | Vision Tokens = 64 | Vision Tokens = 100 | n Pages | :--- | :---: | :---: | :---: | :---: | :--- | | Precision | Compression | Precision | Compression | | 600-700 | 96.5% | 10.5x | 98.5% | 6.7x | 7 | 700-800 | 93.8% | 11.8x | 97.3% | 7.5x | 28 | 800-900 | 83.8% | 13.2x | 96.8% | 8.5x | 28 | 900-1000 | 85.9% | 15.1x | 96.8% | 9.7x | 14 | 1000-1100 | 79.3% | 16.5x | 91.5% | 10.6x | 11 | 1100-1200 | 76.4% | 17.7x | 89.8% | 11.3x | 8 | 1200-1300 | 59.1% | 19.7x | 87.1% | 12.6x | 4

    • Analysis: The results are remarkable. With 100 vision tokens (Small mode), the model maintains over 96% precision up to a ~10x compression ratio. Performance degrades gracefully as the compression ratio increases. Even at a nearly 20x compression ratio (1200+ text tokens vs. 64 vision tokens), the model retains ~60% precision. This provides strong empirical evidence for the viability of optical context compression. This is also visualized in Figure 1(a).

      Figure 1 | Figure (a) shows the compression ratio (number of text tokens in ground truth/number of vision tokens model used) testing on Fox \[21\] benchmark; Figure (b) shows performance comparisons on…

      2. OCR Practical Performance on OmniDocBench: Table 3 presents the performance against other state-of-the-art models.

    This is a manual transcription of Table 3 from the paper, using HTML for complex structure.

    Model Tokens English Chinese
    overall text formula table order overall text formula table order
    Pipeline Models
    Dolphin [11] 0.356 0.352 0.465 0.258 0.35 0.44 0.44 0.604 0.367 0.351
    Marker [1] 0.296 0.085 0.374 0.609 0.116 0.497 0.293 0.688 0.678 0.329
    Mathpix [2] 0.191 0.105 0.306 0.243 0.108 0.364 0.381 0.454 0.32 0.30
    MinerU-2.1.1 [34] 0.162 0.072 0.313 0.166 0.097 0.244 0.111 0.581 0.15 0.136
    MonkeyOCR-1.2B [18] 0.154 0.062 0.295 0.164 0.094 0.263 0.179 0.464 0.168 0.243
    PPstructure-v3 [9] 0.152 0.073 0.295 0.162 0.077 0.223 0.136 0.535 0.111 0.11
    End-to-end Models
    Nougat [6] 2352 0.452 0.365 0.488 0.572 0.382 0.973 0.998 0.941 1.00 0.954
    SmolDocling [25] 392 0.493 0.262 0.753 0.729 0.227 0.816 0.838 0.997 0.907 0.522
    InternVL2-76B [8] 6790 0.44 0.353 0.543 0.547 0.317 0.443 0.29 0.701 0.555 0.228
    Qwen2.5-VL-7B [5] 3949 0.316 0.151 0.376 0.598 0.138 0.399 0.243 0.5 0.627 0.226
    OLMOCR [28] 3949 0.326 0.097 0.455 0.608 0.145 0.469 0.293 0.655 0.652 0.277
    GOT-OCR2.0 [38] 256 0.287 0.189 0.360 0.459 0.141 0.411 0.315 0.528 0.52 0.28
    OCRFlux-3B [3] 3949 0.238 0.112 0.447 0.269 0.126 0.349 0.256 0.716 0.162 0.263
    GPT4o [26] - 0.233 0.144 0.425 0.234 0.128 0.399 0.409 0.606 0.329 0.251
    InternVL3-78B [42] 6790 0.218 0.117 0.38 0.279 0.095 0.296 0.21 0.533 0.282 0.161
    Qwen2.5-VL-72B [5] 3949 0.214 0.092 0.315 0.341 0.106 0.261 0.18 0.434 0.262 0.168
    dots.ocr [30] 3949 0.182 0.137 0.320 0.166 0.182 0.261 0.229 0.468 0.160 0.261
    Gemini2.5-Pro [4] 0.148 0.055 0.356 0.13 0.049 0.212 0.168 0.439 0.119 0.121
    MinerU2.0 [34] 6790 0.133 0.045 0.273 0.15 0.066 0.238 0.115 0.506 0.209 0.122
    dots.ocr†200dpi [30] 5545 0.125 0.032 0.329 0.099 0.04 0.16 0.066 0.416 0.092 0.067
    DeepSeek-OCR (end2end)
    Tiny 64 0.386 0.373 0.469 0.422 0.283 0.361 0.307 0.635 0.266 0.236
    Small 100 0.221 0.142 0.373 0.242 0.125 0.284 0.24 0.53 0.159 0.205
    Base 256(182) 0.137 0.054 0.267 0.163 0.064 0.24 0.205 0.474 0.1 0.181
    Large 400(285) 0.138 0.054 0.277 0.152 0.067 0.208 0.143 0.461 0.104 0.123
    Gundam 795 0.127 0.043 0.269 0.134 0.062 0.181 0.097 0.432 0.089 0.103
    Gundam-M†200dpi 1853 0.123 0.049 0.242 0.147 0.056 0.157 0.087 0.377 0.08 0.085
    • Analysis: DeepSeek-OCR demonstrates exceptional performance-per-token.
      • The Small mode (100 tokens) outperforms GOT-OCR2.0 (256 tokens).
      • The Gundam mode (795 tokens) outperforms MinerU2.0 (6790 tokens) and is competitive with the best proprietary models like Gemini2.5-Pro and dots.ocr.
      • This shows that the DeepEncoder is highly effective at creating information-dense vision tokens, validating the architecture's design.

    3. Performance by Document Type: Table 4 breaks down performance by document category, showing which modes are suitable for which tasks.

    This is a manual transcription of Table 4 from the paper.

    Type / Mode Book Slides Financial Report Textbook Exam Paper Magazine Academic Papers Notes Newspaper Overall
    Tiny 0.147 0.116 0.207 0.173 0.294 0.201 0.395 0.297 0.94 0.32
    Small 0.085 0.111 0.079 0.147 0.171 0.107 0.131 0.187 0.744 0.205
    Base 0.037 0.08 0.027 0.1 0.13 0.073 0.052 0.176 0.645 0.156
    Large 0.038 0.108 0.022 0.084 0.109 0.06 0.053 0.155 0.353 0.117
    Gundam 0.035 0.085 0.289 0.095 0.094 0.059 0.039 0.153 0.122 0.083
    Gundam-M 0.052 0.09 0.034 0.091 0.079 0.079 0.048 0.1 0.099 0.077
    • Analysis: Simpler documents like Slides, Books, and Financial Reports achieve excellent performance with low-token modes (Base or even Small). This aligns with the compression study, as these documents likely fall within the <10x compression ratio sweet spot. In contrast, text-dense and complex-layout documents like Newspapers require high-resolution modes (Gundam or Gundam-M) to achieve good results, as the compression ratio would otherwise be too high.
  • Qualitative Study:

    • Deep Parsing: The model can perform "deep parsing," where it first performs layout OCR and then recursively calls itself to parse complex content within the document, such as charts, formulas, or figures.

      • Charts: Figure 7 shows the model extracting structured data from a chart in a financial report.

      • Natural Images: Figure 8 demonstrates the model providing captions for natural images found within an article.

      • Chemical Formulas: Figure 9 shows successful recognition of complex chemical structures.

      • Geometry: Figure 10 shows the model can parse and represent simple geometric figures.

        Figure 12 | We retain DeepSeek-OCR's capabilities in general visual understanding, mainly including image description, object detection, grounding, etc. Meanwhile, due to the inclusion of text-only d… 该图像是由六张图片组成的图表,展示了多样的视觉理解和OCR能力,包括文字识别、物体检测、文本定位及卡通图像识别,体现模型在多模态任务上的广泛适用性。

        Figure 13 | Forgetting mechanisms constitute one of the most fundamental characteristics of human memory. The contexts optical compression approach can simulate this mechanism by rendering previous r… 该图像是示意图,展示了记忆、视觉和文本清晰度随时间、距离和分辨率变化的关系,用以模拟人类记忆中的遗忘机制和视觉模糊过程。

        Figure 2 | Typical vision encoders in popular VLMs. Here are three types of encoders commonly used in current open-source VLMs, all of which suffer from their respective deficiencies. 该图像是图表,展示了当前开源视觉语言模型(VLM)中三种典型视觉编码器的结构及其缺陷,如低分辨率、视觉token过多和计算资源消耗大等问题。

        Figure 3 | The architecture of DeepSeek-OCR. DeepSeek-OCR consists of a DeepEncoder and a DeepSeek-3B-MoE decoder. DeepEncoder is the core of DeepSeek-OCR, comprising three components: a SAM \[17\] for… 该图像是DeepSeek-OCR架构示意图,展示了输入文档通过SAM(局部注意力)分割成16×16小块后,经Conv下采样生成视觉Token,再通过CLIP(全局注意力)和Embedding层,最终由DeepSeek-3B-MoE解码器输出结果。

    • Multilingual Recognition: The model is not limited to English. Figure 11 shows its capability to process documents in other languages, like Arabic and Sinhala, with or without layout information.

      Figure 4 | To test model performance under different compression ratios (requiring different numbers of vision tokens) and enhance the practicality of DeepSeek-OCR, we configure it with multiple reso… 该图像是论文中的图表,展示了DeepSeek-OCR在不同分辨率模式下的文本图像压缩方式及对应的视觉token数量,包括Resize模式、Padding模式和Gundam模式,说明模型在多种配置下的适应能力。

      Figure 5 | OCR 1.0 fine annotations display. We format the ground truth into an interleaved layout and text format, where each paragraph of text is preceded by the coordinates and label of it in the… 该图像是论文中图5的示意图,展示了OCR 1.0细粒度标注的格式。左图为原始图像,右图为对应的以坐标和标签交叉排列的文本标注,所有坐标均归一化至1000格。

    • General Vision Understanding: Figure 12 shows that the model retains general vision capabilities, such as image description, object detection, and grounding, thanks to the inclusion of general vision data during training.

      该图像是包含柱状图和表格的图表,展示了不同处理方法(奇瑜峰、打油茶、蛭腊、银鱼蕉丝)在酸辣豆花、修水双井茶和车厘子3个试验条件下的相关数值。图表通过柱状图对比百分比数据,表格详细列出了对应数值,未包含公式。 该图像是包含柱状图和表格的图表,展示了不同处理方法(奇瑜峰、打油茶、蛭腊、银鱼蕉丝)在酸辣豆花、修水双井茶和车厘子3个试验条件下的相关数值。图表通过柱状图对比百分比数据,表格详细列出了对应数值,未包含公式。

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully presents DeepSeek-OCR as a proof-of-concept for "contexts optical compression." It establishes that representing long text as an image can achieve significant compression (10x-20x) while retaining high-to-moderate fidelity. The proposed DeepEncoder architecture is highly efficient and practical, enabling state-of-the-art OCR performance with a fraction of the vision tokens used by competing models. The work opens a promising new research direction for tackling the long-context problem in LLMs.

  • Limitations & Future Work:

    • The authors acknowledge that OCR is just a first step. True "context compression" needs further validation with tasks like needle-in-a-haystack tests on optically compressed text.

    • The paper proposes a futuristic application: simulating a memory forgetting mechanism. As conversation history gets older, the rendered images of that text could be progressively downsized. This would reduce their token cost and naturally blur the information, mimicking how human memory for distant events fades. This idea, visualized in Figure 13, is a key conceptual takeaway for future research.

      该图像是一个几何示意图,展示了多面体内部的点X及相关顶点E、S、R的位置关系,右侧部分为图形数据的代码表示。 该图像是一个几何示意图,展示了多面体内部的点X及相关顶点E、S、R的位置关系,右侧部分为图形数据的代码表示。

    • Future work will involve interleaved pretraining on digital and optical text to better integrate this capability into foundational LLMs.

  • Personal Insights & Critique:

    • Novelty: The most significant contribution is the reframing of a classic VLM task (OCR) as a solution to a core LLM problem (long-context efficiency). This cross-domain thinking is innovative.
    • Potential Weaknesses: The optical compression approach is inherently lossy. While it works well for standard text, it might struggle with content where exact formatting, subtle visual cues, or non-textual information is critical. The degradation at a 20x ratio, while impressive, still represents a 40% loss of information, which might be unacceptable for many applications.
    • Future Impact: The concept of an "optical context cache" is powerful. It suggests a hybrid architecture for future LLMs where recent, high-priority context is kept in digital text form, while older, less critical context is moved to a highly compressed optical representation. This could lead to models with theoretically unlimited, yet computationally bounded, context windows. The paper provides a strong empirical foundation and an open-source toolkit for the community to build upon this exciting idea.

Discussion

Leave a comment

Sign in to join the discussion.

No comments yet. Start the discussion!