Paper status: completed

DeepSeek-OCR:ContextsOpticalCompression

Long-Context Optical Compression (1)Vision Text Compression Encoder (1)OCR High Compression Decoding (1)Mixture-of-Experts Decoder (1)Historical Document Long Text Compression (1)

Original Link

Price: 0.100000

57 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DeepSeek-OCR uses 2D optical mapping to compress long texts efficiently, achieving 97% OCR accuracy under 10× compression and 60% at 20×, outperforming existing OCR models and enabling large-scale training data generation for LLMs.

Abstract

DeepSeek-OCR: Contexts Optical Compression Haoran Wei, Yaofeng Sun, Yukun Li DeepSeek-AI Abstract We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10 × ), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20 × , the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision

Mind Map

In-depth Reading

English Analysis~18 min read · 19,821 chars

1. Bibliographic Information

Title: DeepSeek-OCR: Contexts Optical Compression
Authors: Haoran Wei, Yaofeng Sun, Yukun Li
Affiliations: DeepSeek-AI
Journal/Conference: This paper is presented as a technical report. As of the analysis date, it appears to be a preprint (e.g., posted on a service like arXiv) rather than a formally peer-reviewed conference or journal publication.
Publication Year: The paper references works from 2025, suggesting it was likely written in late 2024 or early 2025.
Abstract: The paper introduces DeepSeek-OCR, a model designed to explore the compression of long text contexts by mapping them into a 2D optical format (an image). The system consists of a DeepEncoder and a DeepSeek3B-MoE-A570M decoder. The DeepEncoder is engineered to handle high-resolution images while producing a small, manageable number of vision tokens, thus achieving a high compression ratio. Experiments demonstrate that the model can achieve 97% OCR precision with a compression ratio under 10x and 60% accuracy at a 20x ratio. The authors highlight the model's promise for long-context compression research and its practical value, outperforming models like GOT-OCR2.0 and MinerU2.0 on the OmniDocBench benchmark while using significantly fewer vision tokens. The code and models are made publicly available.
Original Source Link: https://raw.githubusercontent.com/deepseek-ai/DeepSeek-OCR/refs/heads/main/DeepSeek_OCR_paper.pdf (This is a direct link to the PDF file).

2. Executive Summary

Background & Motivation (Why):

The primary challenge this paper addresses is the computational inefficiency of Large Language Models (LLMs) when processing long sequences of text. The attention mechanism, central to modern LLMs, has a computational cost that scales quadratically with the length of the input sequence ( $O(n^2)$ ). This makes handling very long documents, conversations, or codebases prohibitively expensive in terms of both memory and processing time.

Prior work has focused on algorithmic improvements to the attention mechanism or architectural changes within the text domain. This paper introduces a completely different and novel approach: "contexts optical compression". The core idea is to leverage the visual modality as a highly efficient compression medium. Instead of feeding a long string of text tokens to an LLM, the text is first rendered as an image. A Vision-Language Model (VLM) then "reads" this image. A single image can represent thousands of words using a much smaller number of "vision tokens," potentially breaking the quadratic scaling bottleneck. The paper uses the task of Optical Character Recognition (OCR) as a perfect testbed for this concept, as it provides a direct, quantifiable mapping between a compressed visual input (the image) and a decompressed textual output (the recognized text).

Main Contributions / Findings (What):

The paper presents three main contributions:

Quantitative Proof-of-Concept for Optical Compression: It provides the first comprehensive analysis of the trade-off between vision token compression ratios and text decoding accuracy. It shows that near-lossless decoding is possible at a 10x compression ratio (e.g., decoding 1000 text tokens from just 100 vision tokens), and meaningful information is still retained even at a 20x compression ratio.
A Novel Vision Encoder (DeepEncoder): To enable this high compression, the authors designed a new vision encoder architecture. DeepEncoder can process high-resolution images while keeping both its internal memory usage (activation) low and the final number of output vision tokens minimal. It cleverly combines a local-feature extractor (SAM) with a global-feature extractor (CLIP) bridged by a convolutional compressor.
A State-of-the-Art Practical OCR Model (DeepSeek-OCR): The complete model, DeepSeek-OCR, which pairs the DeepEncoder with a DeepSeek3B-MoE decoder, achieves state-of-the-art performance among end-to-end models on the OmniDocBench benchmark. Critically, it does so while using an order of magnitude fewer vision tokens than comparable high-performance models, demonstrating the practical efficiency of its design.

Foundational Concepts

Large Language Models (LLMs): These are massive neural networks (e.g., GPT series) trained on vast amounts of text data to understand, generate, and reason about human language. Their performance is often limited by the length of the text they can process at once, known as the "context window."
Vision-Language Models (VLMs): VLMs are models that can process both visual data (images) and text. They can perform tasks like describing an image, answering questions about it, or, in this case, reading text within an image.
Optical Character Recognition (OCR): The technology used to convert images containing typed, handwritten, or printed text into machine-encoded text. Traditional OCR involves pipelines of detection and recognition, whereas modern end-to-end models do it in one step.
Tokens: In machine learning, tokens are the fundamental units of data. For LLMs, a token is typically a word or sub-word. For VLMs, an image is broken down into a grid of patches, and each patch is converted into a vision token. The core idea of this paper is to represent many text tokens with few vision tokens.
Mixture of Experts (MoE): An efficient neural network architecture. Instead of using one giant network for all tasks, an MoE model contains many smaller "expert" sub-networks. For any given input, only a few relevant experts are activated, saving significant computational cost during inference.
Attention Mechanisms: The technology that allows models to weigh the importance of different tokens when processing a sequence. Global attention considers all tokens in the sequence, which is powerful but computationally expensive. Window attention only considers tokens within a small local window, which is much more efficient but misses long-range dependencies.

Previous Works & Technological Evolution

The paper positions its work against existing VLM encoder architectures, highlighting their respective weaknesses.

Dual-Tower Architectures (e.g., Vary): These use two parallel encoders (e.g., one for general features, one for high-res details). The paper notes this approach complicates deployment and training due to the need for dual image preprocessing pipelines.
Tile-Based Methods (e.g., InternVL2.0): These methods slice a high-resolution image into smaller tiles and process each one separately. While this handles large images, it can lead to excessive fragmentation and a very large number of vision tokens, which slows down the language model decoder.
Adaptive Resolution Encoders (e.g., Qwen2-VL based on NaViT): These encoders can flexibly handle images of any aspect ratio. However, for very large images, their memory consumption (activation) can become unmanageable, and packing sequences of different lengths during training is complex.

Image 2 illustrates these three common but flawed approaches.

该图像是图表，展示了当前主流开放源代码视觉语言模型中三种典型视觉编码器的结构及其各自缺陷，涵盖分辨率、视觉tokens数量及推理速度等问题。

In the domain of end-to-end OCR, models like Nougat and GOT-OCR2.0 have simplified traditional OCR pipelines. However, the authors argue that these works have not addressed a fundamental question: "for a document containing 1000 words, how many vision tokens are at least needed for decoding?" DeepSeek-OCR is designed specifically to answer this.

Differentiation

DeepSeek-OCR's key innovation is the DeepEncoder, which synthesizes the strengths of previous approaches while mitigating their weaknesses. It serially connects a low-memory window attention component (for initial high-resolution processing) with a powerful global attention component (for holistic understanding). The critical element is a convolutional compressor placed between them, which drastically reduces the number of tokens before they reach the computationally expensive global attention layers. This achieves the goals of handling high resolutions with low memory usage and producing a minimal number of vision tokens.

4. Methodology (Core Technology & Implementation)

The DeepSeek-OCR model has a standard VLM architecture: a vision encoder that processes an image and a language decoder that generates text based on the visual information.

$Figure 3 | The architecture of DeepSeek-OCR. DeepSeek-OCR consists of a DeepEncoder and a DeepSeek-3B-MoE decoder. DeepEncoder is the core of DeepSeek-OCR, comprising three components: a SAM \[17\] for…$ 该图像是图3，展示了DeepSeek-OCR的架构示意图。包括SAM局部注意力的Tokenizer，Conv16x下采样生成视觉tokens，以及具有全局注意力机制的CLIP嵌入层，最终由DeepSeek-3B解码器生成输出。

Principles & Steps

As shown in Figure 3, the pipeline works as follows:

Input: An image containing text is fed into the system.
DeepEncoder Processing:
- A SAM-base encoder, which uses efficient window attention, processes the high-resolution image first. For a $1024 \times 1024$ image, it generates a large number of initial patch tokens (e.g., 4096). Because it's a smaller model (80M parameters) using window attention, the memory activation is manageable.
- These tokens then pass through a $16\times$ Token Compressor. This is a simple 2-layer convolutional module that downsamples the tokens, reducing their count by a factor of 16 (e.g., from 4096 to 256).
- The reduced set of tokens is then fed into a CLIP-large encoder, which uses powerful but expensive dense global attention. Since it now operates on far fewer tokens, the computational cost is kept low.
DeepSeek-3B-MoE Decoder: The final, compressed vision tokens from the DeepEncoder are passed to the language model decoder. The decoder, an efficient Mixture-of-Experts model, then reconstructs the original text, effectively "decompressing" the visual information.

Mathematical Formulas & Key Details

The core function of the decoder is to perform this "decompression": $\hat{\mathbf{X}} = f_{\mathrm{dec}}(\mathbf{Z})$

$\mathbf{Z} \in \mathbb{R}^{n \times d_{\mathrm{latent}}}$ represents the compressed vision tokens from the DeepEncoder, where $n$ is the small number of vision tokens.
$\hat{\mathbf{X}} \in \mathbb{R}^{N \times d_{\mathrm{text}}}$ represents the reconstructed text representation, where $N$ is the large number of original text tokens.
$f_{\mathrm{dec}}$ is the decoder function (the LLM) that learns the non-linear mapping from the compressed visual space back to the textual space.

Multiple Resolution Support

To systematically study compression ratios, the model must support various input resolutions, which produce different numbers of vision tokens. The paper details several modes, achieved through dynamic interpolation of positional encodings.

Figure 4 | To test model performance under different compression ratios (requiring different numbers of vision tokens) and enhance the practicality of DeepSeek-OCR, we configure it with multiple reso… 该图像是示意图，展示了DeepSeek-OCR在不同分辨率模式下，通过调整视觉token数量实现的三种压缩方式，包含Resize、Padding和Gundam模式，体现实际应用中压缩比与视觉token配置的关系。

This is summarized in the manually transcribed table below, based on Table 1 from the paper.

Table 1: Multi resolution support of DeepEncoder.

Mode	Native Resolution				Dynamic Resolution
Mode	Tiny	Small	Base	Large	Gundam	Gundam-M
Resolution	512	640	1024	1280	640+1024	1024+1280
Tokens	64	100	256	400	nx100+256	nx256+400
Process	resize	resize	padding	padding	resize + padding	resize + padding

Native Resolution: Modes like Tiny (64 tokens) and Small (100 tokens) resize the image, which is efficient but may distort it. Base (256 tokens) and Large (400 tokens) pad the image to preserve its aspect ratio.
When padding, not all vision tokens correspond to the actual image. The number of valid tokens is calculated as: $N _ { v a l i d } = \lceil N _ { a c t u a l } \times \left[ 1 - ( ( m a x ( w , h ) - m i n ( w , h ) ) / ( m a x ( w , h ) ) ) \right] \rceil$
- $N_{valid}$ : The number of vision tokens corresponding to the original image content.
- $N_{actual}$ : The total number of vision tokens for the padded square image (e.g., 256 for Base mode).
- w, h: The width and height of the original image.
Dynamic Resolution: Modes like Gundam are for very high-resolution images (e.g., newspapers). They combine a tiled approach (local views) with a down-scaled global view, balancing detail and context.

Data Engine & Training

The model was trained on a diverse mix of data:

OCR 1.0 Data: 30M pages of multilingual PDFs and 3M Word documents. This data includes both "coarse" annotations (text extracted directly) and "fine" annotations with detailed layout information (bounding boxes), as shown in Figure 5.

该图像是图5，展示了OCR 1.0的细粒度标注。内容采用布局与文本交错格式，每段文字前标注归属坐标和标签，坐标归一化到1000格。
OCR 2.0 Data: Synthetically generated data for complex parsing tasks like charts, chemical formulas, and geometric figures.

该图像是一个包含柱状图和表格的图表，展示了4种茶类（奇磷酶、打抽茶、蜡酶、银曲霉丝）在不同指标上的数值分布及对应的百分比，右侧表格详细列出各茶类在酸辣豆花、修水双井茶和车厘子中的具体数值。

该图像是几何示意图，展示了一个多面体结构，其中标记有点E、S、X和R，体现了空间中几何关系和线段连接状况。
General Vision Data (20%): Standard vision-language data for tasks like captioning and detection to retain general visual understanding.
Text-only Data (10%): Pure text data to maintain the LLM's language capabilities.

The training was done in two stages on 20 nodes of A100-40G GPUs, first training the DeepEncoder and then the full DeepSeek-OCR model.

5. Experimental Setup

Datasets:
- Fox [21] benchmark: Used for the core vision-text compression study. The authors selected 100 English documents from this benchmark.
- OmniDocBench [27]: A comprehensive benchmark for real-world document parsing, used to evaluate the model's practical OCR performance against other state-of-the-art models.
Evaluation Metrics:
- Edit Distance: This is the primary metric for OCR quality.
  1. Conceptual Definition: Edit distance (specifically, Levenshtein distance) measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change the model's output string into the ground-truth string. A lower edit distance indicates a better, more accurate transcription. It is a very strict metric.
  2. Mathematical Formula: The Levenshtein distance lev(a, b) between two strings $a$ (of length $|a|$ ) and $b$ (of length $|b|$ ) is given by: $lev(a,b) = \begin{cases} |a| & \text{if } |b| = 0, \\ |b| & \text{if } |a| = 0, \\ lev(\text{tail}(a), \text{tail}(b)) & \text{if } a[0] = b[0] \\ 1 + \min \begin{cases} lev(\text{tail}(a), b) \\ lev(a, \text{tail}(b)) \\ lev(\text{tail}(a), \text{tail}(b)) \end{cases} & \text{otherwise} \end{cases}$
  3. Symbol Explanation:
    - a, b: The two strings being compared (ground truth and model prediction).
    - $|a|, |b|$ : The lengths of the strings.
    - $\text{tail}(\cdot)$ : A function that returns the string without its first character.
    - a[0], b[0]: The first character of each string. The values reported in the paper are typically normalized by the length of the ground truth text.
- Precision: In this context, it refers to the accuracy of the decoded OCR text, likely calculated at the character or word level. It is presented as a percentage.
- Compression Ratio: The ratio of text information to visual information. $\text{Compression Ratio} = \frac{\text{Number of ground truth text tokens}}{\text{Number of vision tokens used by the model}}$
Baselines: A wide range of models were used for comparison on OmniDocBench, including:
- Pipeline Models: Traditional systems that separate layout analysis and OCR, such as Marker, Mathpix, and MinerU-2.1.1.
- End-to-end Models: Unified VLMs like Nougat, InternVL2-76B, Qwen2.5-VL-7B, GOT-OCR2.0, and proprietary models like GPT4o and Gemini2.5-Pro.

6. Results & Analysis

Core Results: Vision-text Compression Study

The results from the Fox benchmark experiment (Table 2) provide strong evidence for the feasibility of optical compression.

Manual Transcription of Table 2: Vision-text compression ratio on Fox benchmark.

Text Tokens	Vision Tokens = 64		Vision Tokens = 100		n Pages
	Precision	Compression	Precision	Compression
600-700	96.5%	10.5x	98.5%	6.7x	7
700-800	93.8%	11.8x	97.3%	7.5x	28
800-900	83.8%	13.2x	96.8%	8.5x	28
900-1000	85.9%	15.1x	96.8%	9.7x	14
1000-1100	79.3%	16.5x	91.5%	10.6x	11
1100-1200	76.4%	17.7x	89.8%	11.3x	8
1200-1300	59.1%	19.7x	87.1%	12.6x	4

Analysis:

With 100 vision tokens (Small mode), the model maintains ~97% precision up to a 9.7x compression ratio. The decoding is nearly lossless.
Performance gracefully degrades as the compression ratio increases. Even at a very high compression ratio of 19.7x (decoding ~1250 text tokens from only 64 vision tokens), the model still achieves ~60% precision, indicating that significant information is preserved.
These results are visualized in Figure 1(a).

$Figure 1 | Figure (a) shows the compression ratio (number of text tokens in ground truth/number of vision tokens model used) testing on Fox \[21\] benchmark; Figure (b) shows performance comparisons on…$ 该图像是由两部分组成的图表，包含图1(a)和图1(b)。图1(a)显示了Fox基准测试中不同文本令牌数量下的压缩率与精度；图1(b)展示了OmniDocBench上各模型在视觉令牌数量与整体性能（编辑距离）间的表现对比。

Core Results: OCR Practical Performance

Table 3 shows that DeepSeek-OCR is not just a theoretical model but a highly competitive practical OCR system.

Manual Transcription of Table 3: Performance on OmniDocBench (Edit Distance, lower is better). Due to the complexity of this table with merged cells, HTML $<div class="table-wrapper"><table>$ is used for transcription.

Model	Tokens	English				Chinese
		overall text formula table order				overall text formula table order
		(Selected models for brevity, focusing on key comparisons)
End-to-end Models
GOT-OCR2.0 [38]	256	0.287	0.189	0.360	0.459	0.141	0.411	0.315	0.528	0.520	0.280
MinerU2.0 [34]	6790	0.133	0.045	0.273	0.150	0.066	0.238	0.115	0.506	0.209	0.122
DeepSeek-OCR (end2end)
Small	100	0.221	0.142	0.373	0.242	0.125	0.284	0.240	0.530	0.159	0.205
Base	256(182)	0.137	0.054	0.267	0.163	0.064	0.240	0.205	0.474	0.100	0.181
Large	400(285)	0.138	0.054	0.277	0.152	0.067	0.208	0.143	0.461	0.104	0.123
Gundam	795	0.127	0.043	0.269	0.134	0.062	0.181	0.097	0.432	0.089	0.103

Analysis:

Efficiency: DeepSeek-OCR in Small mode (100 tokens) achieves a better overall score (0.221) than GOT-OCR2.0 (0.287), which uses 256 tokens.
Performance: DeepSeek-OCR in Gundam mode, using only 795 tokens, outperforms MinerU2.0 (0.127 vs 0.133), which requires nearly 7,000 vision tokens.
This demonstrates that the DeepEncoder design is highly effective, achieving top-tier performance with a fraction of the visual tokens, which directly translates to faster inference and lower computational costs.

Ablations / Document Type Analysis

Table 4 breaks down performance by document type, revealing which kinds of documents require more visual information.

Manual Transcription of Table 4: Edit distances for different document categories.

Type / Mode	Book	Slides	Financial Report	Textbook	Exam Paper	Magazine	Academic Papers	Notes	Newspaper	Overall
Tiny	0.147	0.116	0.207	0.173	0.294	0.201	0.395	0.297	0.940	0.320
Small	0.085	0.111	0.079	0.147	0.171	0.107	0.131	0.187	0.744	0.205
Base	0.037	0.080	0.027	0.100	0.130	0.073	0.052	0.176	0.645	0.156
Large	0.038	0.108	0.022	0.084	0.109	0.060	0.053	0.155	0.353	0.117
Gundam	0.035	0.085	0.289	0.095	0.094	0.059	0.039	0.153	0.122	0.083
Guandam-M	0.052	0.090	0.034	0.091	0.079	0.079	0.048	0.100	0.099	0.077

Analysis:

Simple layout documents like Slides and Financial Reports achieve excellent performance with very few tokens (Small or Base mode is sufficient).
Text-dense and complex-layout documents like Newspapers require the high-resolution Gundam or Gundam-M modes to achieve an acceptable edit distance. This is because they contain 4,000-5,000 text tokens, which would exceed the ~10x compression boundary for the smaller modes. This empirically validates the compression limits found in the Fox benchmark study.

Qualitative Study

The paper provides visual examples of DeepSeek-OCR's advanced capabilities.

Deep Parsing: The model can perform multi-level analysis. After an initial OCR pass, it can be prompted to "deep parse" specific regions, such as extracting structured data from charts (Figure 7), generating captions for images within a document (Figure 8), recognizing chemical formulas (Figure 9), or interpreting geometric diagrams (Figure 10).

Figure 7: Deep Parsing a Chart	Figure 8: Captioning an Image in a Doc	Figure 9: Parsing a Chemical Formula	Figure 10: Parsing Geometry
![Figure 7	In the field of financial research reports, the deep parsing mode of DeepSeek-OCR can be used to obtain structured results of charts within documents. Charts are a crucial form of data rep…](/files/papers/68f5f8dd8f4c73b8e2a313bb/images/8.jpg)

该图像是金融研究报告中DeepSeek-OCR深度解析模式对图表结构化结果的展示，展示了图表提取和渲染的对比，反映了OCR模型未来对图表结构提取的重要能力。

Multilingual Recognition: The model is not limited to English and Chinese. It can process documents in nearly 100 languages, as shown with Arabic and Sinhala examples in Figure 11.

该图像是三页文档内容的示意图，展示了文本的原始版本、带有颜色标注的细粒度块分割，以及对应的结构化文本代码，体现了文档的多层次信息解析和文本块检测过程。
General Vision Understanding: By including general vision data in its training, the model retains capabilities like object detection and grounding, making it a versatile VLM (Figure 12).

该图像是由六个不同场景构成的图像集合，展示了教学黑板上的数学题、绿色塑料豆瓣酱容器、黑白漫画中的老师、户外放风筝的活动场景、带领结的消防栓照片以及带有“Bountiful Potential”英文标注的白色马克杯。各图像内容丰富，涵盖教育、日常生活和文本识别。

7. Conclusion & Reflections

Conclusion Summary

The authors successfully demonstrate that "contexts optical compression" is a viable and highly promising direction for tackling the long-context problem in LLMs. Their model, DeepSeek-OCR, validates this by achieving near-lossless text decompression at a ~10x compression ratio. The novel DeepEncoder architecture proves to be exceptionally efficient, enabling state-of-the-art OCR performance with significantly fewer computational resources than competing models. The paper concludes that this approach has the potential to facilitate future developments in both VLMs and LLMs and serves as a powerful tool for large-scale data generation.

Limitations & Future Work

The authors acknowledge that OCR is just a first step. To fully validate the concept of context compression for general reasoning, they plan to conduct further research, including:

Digital-optical text interleaved pretraining: Training a model on a mix of regular text and "optical text" to see if it can seamlessly use both.
Needle-in-a-haystack testing: A standard test to see if a model can recall a small, specific piece of information from a very long context, which would test the fidelity of the optical compression.

Personal Insights & Critique

This paper presents a genuinely innovative and "out-of-the-box" idea. The most compelling aspect is its simplicity and elegance. Instead of complex algorithmic wizardry, it repurposes an existing modality (vision) to solve a core scaling problem in another (language).

The analogy to human memory and forgetting is particularly insightful. As discussed in the paper and illustrated in Figure 13, the method naturally mimics how memory works: recent, important information is kept at high fidelity (high-resolution image), while older, less critical context can be progressively compressed (by resizing the image), losing detail but retaining the gist, much like how distant memories fade.

Figure 13 | Forgetting mechanisms constitute one of the most fundamental characteristics of human memory. The contexts optical compression approach can simulate this mechanism by rendering previous r… 该图像是一个示意图，展示了记忆、视觉和文本信息随时间、距离和分辨率变化的清晰度趋势，图中以灯泡、眼睛和文本图标分别标示不同类型信息。

This could be a game-changer for building agents or conversational AI with theoretically unlimited context, as the computational cost of the "memory" would grow sub-linearly.

Potential questions and critiques:

Rendering Overhead: The process requires rendering text to an image. What is the computational cost and latency of this step? For real-time applications, this could be a bottleneck.
Information beyond Text: The current study focuses on text. How does this approach handle non-textual information like tables, charts, or diagrams that are already visual? The "deep parsing" capability hints at this, but a deeper investigation is needed.
Error Propagation: If the OCR makes an error during the "decompression" of an old memory, that error could be propagated or even amplified in subsequent reasoning steps. The 60% precision at 20x compression might be too low for tasks requiring high fidelity.

Overall, this is a landmark paper that opens a new and exciting research avenue. It shifts the perspective on long-context processing from a purely linguistic challenge to a multi-modal one, with profound implications for the future architecture of AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

DeepSeek-OCR:ContextsOpticalCompression

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~18 min read · 19,821 chars

1. Bibliographic Information

2. Executive Summary

Background & Motivation (Why):

Main Contributions / Findings (What):

3. Prerequisite Knowledge & Related Work

Foundational Concepts

Previous Works & Technological Evolution

Differentiation

4. Methodology (Core Technology & Implementation)

Principles & Steps

Mathematical Formulas & Key Details

Multiple Resolution Support

Data Engine & Training

5. Experimental Setup

6. Results & Analysis

Core Results: Vision-text Compression Study

Core Results: OCR Practical Performance

Ablations / Document Type Analysis

Qualitative Study

7. Conclusion & Reflections

Conclusion Summary

Limitations & Future Work

Personal Insights & Critique

Similar papers