Wei, H., Sun, Y., & Li, Y. (2025, October 21).
arXiv.org.
Abstract
We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at this http URL.
Here are some thoughts:
This paper presents a paradigm-shifting perspective by reframing the visual modality in Vision-Language Models (VLMs) not merely as a source of understanding, but as a highly efficient compression medium for textual information. The core innovation is the DeepEncoder, a novel architecture that serially combines a window-attention model (SAM) for high-resolution perception with a aggressive convolutional compressor and a global-attention model (CLIP), enabling it to process high-resolution document images while outputting an exceptionally small number of vision tokens. The study provides crucial quantitative evidence for this "optical compression" thesis, demonstrating that DeepSeek-OCR can achieve near-lossless text reconstruction (97% accuracy) at a ~10x compression ratio and still retain about 60% accuracy at a ~20x ratio. Beyond its state-of-the-art practical performance in document parsing, the work provocatively suggests that this mechanism can simulate a computational "forgetting curve" for Large Language Models (LLMs), where older context is progressively stored in more heavily compressed (lower-resolution) images, mirroring human memory decay. This positions the paper as a foundational exploration that opens new avenues for efficient long-context handling and memory management in AI systems.
