- Title: Glyph: Scaling Context Windows via Visual-Text Compression
- Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, and Minlie Huang.
- Affiliations: The authors are affiliated with Tsinghua University (The Conversational Artificial Intelligence (CoAI) Group and The Knowledge Engineering Group (KEG)) and Zhipu AI. These institutions are renowned for their significant contributions to AI and large language models (e.g., the GLM series).
- Journal/Conference: The paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for publication in a conference or journal. The futuristic arXiv ID (
2510.17800
) suggests this is a placeholder or example document.
- Publication Year: The paper lists a publication year of 2025 for several cited works and has a futuristic timestamp, indicating it is a very recent or forthcoming preprint.
- Abstract: The paper addresses the prohibitive computational and memory costs of scaling Large Language Models (LLMs) to million-token context windows. Instead of extending token-based sequences, the authors propose
Glyph
, a framework that renders long texts into images and processes them with a Vision-Language Model (VLM). This visual-text compression approach significantly reduces the number of input tokens while preserving semantic information. The framework includes an LLM-driven genetic search to find the optimal rendering settings. Experiments show that Glyph
achieves a 3-4x token compression, leading to ~4x faster inference and ~2x faster training, while maintaining accuracy comparable to leading LLMs like Qwen3-8B
. The method can scale a 128K-context VLM to handle 1M-token text tasks and also benefits multimodal tasks like document understanding.
- Original Source Link:
2. Executive Summary
-
Foundational Concepts:
- Large Language Model (LLM): A deep learning model trained on vast amounts of text data to understand, generate, and reason about human language. Examples include GPT-4 and LLaMA.
- Vision-Language Model (VLM): An extension of an LLM that can process both text and visual inputs (images, videos). VLMs typically have a visual encoder that converts an image into a sequence of "visual tokens," which are then processed by the language model part.
- Context Window: The maximum number of tokens (pieces of words) that an LLM can take as input at one time. A larger context window allows the model to "remember" and reason over longer documents or conversations.
- Self-Attention: The core mechanism in the Transformer architecture (used by most LLMs) that allows the model to weigh the importance of different words in the input sequence when processing a specific word. Its computational complexity is quadratic (O(n2)), where n is the sequence length, making it a bottleneck for long contexts.
- Tokenization: The process of breaking down raw text into smaller units called tokens. In LLMs, these are typically sub-words. In VLMs, an image is divided into patches, and each patch is converted into a "visual token."
- Optical Character Recognition (OCR): The technology to recognize and extract text from images. VLMs inherently develop OCR-like capabilities to read text present in images.
- Retrieval-Augmented Generation (RAG): A technique where an LLM's knowledge is supplemented by retrieving relevant information from an external database (like a document collection) before generating a response. This avoids feeding the entire document into the context window.
-
Previous Works: The paper positions Glyph
against three dominant approaches for long-context modeling:
- Positional Encoding Extension: Methods like
YaRN
and RoPE
modify the way a model understands the position of tokens in a sequence. This allows a model trained on a short context (e.g., 4K tokens) to accept much longer inputs (e.g., 128K tokens) without retraining. Limitation: This "stretches" the context but doesn't reduce the computational cost of processing the longer sequence and can lead to performance degradation.
- Efficient Attention Mechanisms: Techniques like
Longformer
(sparse attention) and Gated Linear Attention
modify the self-attention mechanism to have linear or near-linear complexity instead of quadratic. Limitation: While more efficient per token, the total cost still scales with the number of tokens, which remains very large for million-token contexts.
- Retrieval-Augmented Approaches: These methods use a retriever to find the most relevant chunks of a long document and feed only those to the LLM. Limitation: The retriever might fail to find all necessary information ("lost in the middle" problem), and the retrieval step itself adds latency.
-
Differentiation: Glyph
is fundamentally different. It doesn't modify the model's architecture or retrieve snippets of text. Instead, it changes the input data representation. By rendering text as an image, it leverages the high information density of visual tokens. A single visual token can encapsulate a patch of the image containing several words, effectively "compressing" the sequence length before it even enters the model. This makes it an orthogonal approach that could potentially be combined with efficient attention mechanisms for even greater scalability.
4. Methodology (Core Technology & Implementation)
The Glyph
framework is a multi-stage process designed to teach a VLM to efficiently understand long texts rendered as images.
该图像是图示,展示了Glyph方法的三大阶段:渲染长文本数据的持续预训练,基于LLM驱动的渲染参数搜索,以及利用最佳渲染配置进行后续训练,以实现视觉文本压缩和高效长上下文建模。
As shown in Figure 2, the pipeline consists of three stages:
Stage 1: Continual Pre-Training
The goal of this stage is to adapt a base VLM to understand text presented in a visual format. It transfers the model's existing long-context abilities from the text modality to the visual modality.
- Data Construction: Large-scale long-text documents are rendered into images using a wide variety of visual styles. To ensure robustness, the rendering configurations are diverse. The paper defines several style themes like
document_style
, web_style
, dark_mode
, code_style
, and artistic_pixel
to mimic real-world text appearances.
- Pre-training Tasks: Three types of tasks are used to train the model:
- OCR Tasks: The model is asked to reconstruct the full text from one or more rendered image pages. This forces it to develop strong, fine-grained text recognition skills.
- Interleaved Language Modeling: The input is a mix of text and rendered images. For example, a paragraph might be shown as an image, followed by a paragraph of regular text. This teaches the model to seamlessly process information across both modalities.
- Generation Tasks: The model is given a portion of a rendered document (e.g., the first few pages) and must generate the content of the missing parts. This develops its ability to understand context and generate coherent continuations in the visual domain.
- Loss Function: The model is trained using a standard cross-entropy loss to maximize the probability of generating the correct target text R.
LCPT=−E(T∗,V,R)t∑logPϕ(rt∣T∗,V,r<t)
-
ϕ: The parameters of the VLM.
-
T∗: An optional instruction (e.g., "reconstruct the text").
-
V: The sequence of rendered images (visual context).
-
R: The target response text, which the model generates token by token (rt).
The output of this stage is Glyph-Base
, a VLM capable of understanding visually rendered text.
Stage 2: LLM-Driven Rendering Search
The way text is rendered into an image is crucial. Densely packed text offers high compression but may be hard for the VLM to read, hurting accuracy. This stage automates finding the best balance.
- Rendering Configuration: A rendering is defined by a parameter vector θ:
θ=(dpi,page_size,font_family,font_size,line_height,alignment,indent,spacing,h_scale,colors,borders,…)
This vector controls everything from resolution (
dpi
) and font size to layout and color.
- Compression Ratio: The effectiveness of a configuration θ is measured by the compression ratio ρ(θ):
ρ(θ)=∑i=1nτ(vi)∣C∣
- ∣C∣: The number of tokens in the original text.
- τ(vi): The number of visual tokens generated by the VLM's vision encoder for the i-th image page vi.
- A higher ρ means more text tokens are represented by each visual token, indicating better compression.
- Genetic Algorithm: A genetic search process is used to find the optimal configuration θ∗:
- Initialization: Start with a population of diverse rendering configurations.
- Evaluation: For each configuration, render a validation dataset and evaluate the
Glyph-Base
model's performance (accuracy) and the resulting compression ratio.
- LLM Analysis & Critique: An external powerful LLM is prompted to act as an analyst. It receives the performance data of the current configurations and suggests promising "mutations" (e.g., "try increasing font size by 2 points") and "crossovers" (e.g., "combine the layout of config A with the font of config B"). This steers the search intelligently.
- Selection: Promising new configurations are generated based on the LLM's suggestions and added to the population for the next iteration.
- Termination: The process repeats until performance plateaus, yielding the best-found configuration θ∗.
Stage 3: Post-Training
Using the optimal rendering configuration θ∗ found in the previous stage, the Glyph-Base
model is further fine-tuned for high performance on downstream tasks.
- Supervised Fine-Tuning (SFT): The model is trained on a high-quality instruction-following dataset where the long-text contexts are rendered using θ∗. The responses are structured in a "thinking-style" format (e.g., <think>...</think>), encouraging the model to perform explicit step-by-step reasoning.
- Reinforcement Learning (RL): After SFT, the model is refined using Group Relative Policy Optimization (GRPO), a variant of PPO.
- For a given input, multiple responses are sampled from the model.
- A reward model (an LLM-as-a-judge) scores each response based on accuracy and format correctness.
- The GRPO objective aims to increase the probability of high-reward responses. The objective is:
TGRPO(ϕ)=E(T,V)∼P,{ri}i=1G∼πϕold[G1i=1∑G(min(wiAi, clip(wi,1−ϵl,1+ϵh)Ai) − βDKL(πϕ∥πSFT))],
- wi: Importance sampling weight, comparing the likelihood of a response under the new and old policies.
- Ai: The advantage of a response, indicating how much better it is than the average response.
- The
min
and clip
functions form the core PPO-style clipped objective, which prevents the policy from changing too drastically in one update.
- βDKL: A penalty term that keeps the new policy from deviating too far from the SFT model, ensuring stability.
- Auxiliary OCR Alignment: Throughout both SFT and RL, an auxiliary OCR task is included. This task forces the model to maintain its low-level ability to read text accurately from the images, preventing performance degradation on fine-grained details.
5. Experimental Setup
6. Results & Analysis
Core Results
The experiments demonstrate that Glyph
achieves its goal of compressing context while maintaining strong performance.
该图像是图表,展示了传统基于文本的长上下文处理方式与Glyph框架基于视觉文本压缩的对比。上部示意了从长文本到图像的压缩流程,下部柱状图展示了Glyph在LongBench和MRCR任务上与其他模型的性能及在128K输入下的压缩率和推理加速比。
Figure 1 provides a high-level summary. The top panel illustrates the core idea: compressing long text into images. The bottom panel shows that Glyph
achieves competitive accuracy on LongBench
and MRCR
while offering a ~3.3x compression ratio and ~4.4x faster decoding.
-
LongBench
and MRCR
Performance:
This is a manual transcription of Table 1 from the paper.
Model |
Single-Doc QA |
Multi-Doc QA |
Summarization |
Few-shot |
Synthetic |
Code |
Avg |
QP |
NQA |
HQA |
2QA |
QSUM |
GovRep |
TREC |
TriQA |
PR Zh |
PR En |
RB |
LCC |
GPT-4.1 |
51.60 |
35.73 |
69.10 |
74.15 |
23.50 |
33.36 |
77.00 |
93.36 |
100.00 |
100.00 |
67.94 |
68.43 |
56.03 |
LLaMA-3.1-8B-Instruct |
44.56 |
26.34 |
56.88 |
46.67 |
23.28 |
32.36 |
19.25 |
89.12 |
62.20 |
99.50 |
42.81 |
46.35 |
41.34 |
Qwen2.5-7B-Instruct-1M |
45.29 |
25.61 |
60.70 |
40.51 |
22.95 |
29.97 |
59.37 |
86.93 |
98.5 |
100.00 |
29.80 |
21.72 |
42.42 |
Qwen3-8B |
44.67 |
26.13 |
65.83 |
73.92 |
19.60 |
26.85 |
70.50 |
87.98 |
100.00 |
97.26 |
40.89 |
44.87 |
47.46 |
GLM-4-9B-Chat-1M |
43.75 |
26.72 |
58.98 |
50.89 |
22.84 |
27.60 |
61.50 |
90.07 |
100.00 |
99.50 |
55.64 |
59.54 |
49.27 |
Glyph |
40.64 |
28.45 |
66.42 |
72.98 |
19.78 |
25.53 |
82.62 |
88.54 |
89.03 |
99.50 |
60.80 |
48.85 |
50.56 |
This is a manual transcription of Table 2 from the paper.
Model |
4 Needle |
8 Needle |
0k-8k |
8k-16k |
16k-32k |
32k-64k |
64k-128k |
Avg |
0k-8k |
8k-16k |
16k-32k |
32k-64k |
64k-128k |
Avg |
GPT-4.1 |
50 |
38 |
29 |
42 |
38 |
39.4 |
33 |
26 |
17 |
22 |
19 |
23.4 |
LLaMA-3.1-8B-Instruct |
33.42 |
25.97 |
22.73 |
26.97 |
12.68 |
24.35 |
23.80 |
17.69 |
19.85 |
17.72 |
11.79 |
18.17 |
Qwen2.5-7B-Instruct-1M |
25.96 |
20.13 |
19.93 |
24.25 |
17.29 |
21.51 |
17.64 |
19.48 |
12.41 |
14.80 |
14.24 |
15.71 |
Qwen3-8B |
29.34 |
22.67 |
20.34 |
23.63 |
19.11 |
23.02 |
18.75 |
19.69 |
16.81 |
17.86 |
15.00 |
17.62 |
GLM-4-9B-Chat-1M |
15.17 |
13.78 |
9.18 |
20.27 |
15.05 |
14.69 |
14.55 |
9.65 |
9.34 |
9.47 |
8.97 |
10.40 |
Glyph |
35.44 |
26.82 |
24.15 |
25.69 |
16.37 |
25.81 |
25.12 |
21.22 |
16.43 |
13.91 |
13.51 |
18.14 |
In Table 1, Glyph
(50.56% avg) outperforms strong baselines like Qwen3-8B
(47.46%) and its own text-based backbone GLM-4-9B-Chat-1M
(49.27%). This is a remarkable result, showing that the visual compression method does not lead to a significant loss in understanding and, in some cases, can even improve performance. In Table 2 (MRCR
), Glyph
is highly competitive, ranking first or second in most settings.
-
Ruler
Benchmark and Test-Time Scaling:
This is a manual transcription of Table 3 from the paper.
Model |
Niah-S1 |
Niah-S2 |
Niah-M1 |
Niah-M2 |
Niah-V |
Niah-Q |
VT |
CWE |
FWE |
QA-1 |
QA-2 |
Avg |
GPT-4.1 |
100.0 |
98.85 |
100.0 |
100.0 |
99.67 |
100.0 |
100.0 |
97.87 |
98.66 |
86.82 |
77.47 |
96.30 |
LLaMA-3.1-8B-Instruct |
99.33 |
99.33 |
99.33 |
99.00 |
98.17 |
99.67 |
87.07 |
57.30 |
81.85 |
84.00 |
58.00 |
87.55 |
Qwen2.5-7B-Instruct-1M |
100.00 |
99.67 |
99.67 |
99.00 |
93.83 |
98.75 |
85.40 |
72.10 |
85.67 |
80.00 |
60.67 |
88.61 |
Qwen3-8B |
100.00 |
100.00 |
95.33 |
84.67 |
97.42 |
99.33 |
98.47 |
74.67 |
86.67 |
70.33 |
53.33 |
87.29 |
GLM-4-9B-Chat-1M |
100.00 |
100.00 |
92.67 |
99.00 |
95.00 |
100.00 |
98.20 |
49.50 |
83.22 |
72.67 |
56.67 |
86.08 |
DPI: 72 / Compression rate: average 4.0, up to 7.7 |
Glyph |
73.33 |
64.67 |
67.33 |
56.00 |
73.42 |
71.42 |
77.93 |
94.40 |
92.67 |
59.33 |
63.33 |
72.17 |
DPI: 96 / Compression rate: average 2.2, up to 4.4 |
Glyph |
98.00 |
95.33 |
95.67 |
85.00 |
96.33 |
95.83 |
94.93 |
94.80 |
98.00 |
79.00 |
70.67 |
91.23 |
DPI: 120 / Compression rate: average 1.2, up to 2.8 |
Glyph |
99.67 |
99.00 |
100.00 |
93.67 |
99.00 |
99.58 |
99.33 |
98.97 |
99.11 |
79.00 |
74.00 |
94.67 |
Table 3 is particularly insightful. It shows the performance of Glyph
under different rendering resolutions (DPI
). At low DPI (72), the compression is very high (4.0x average), but performance suffers (72.17% avg). As DPI increases to 120, the image becomes clearer, compression drops to 1.2x, but performance skyrockets to 94.67%, surpassing most text-only baselines. This demonstrates a clear and controllable trade-off between efficiency and accuracy.
-
Performance Degradation with Length:
该图像是图表,展示了在Ruler基准测试中,不同序列长度下多模型的准确率变化趋势。曲线显示Glyph模型在长序列下表现优于其他模型,且所有模型性能随序列增长普遍下降。
Figure 5 shows that while all models' performance degrades as context length increases, Glyph
's performance curve is flatter. It degrades more slowly than text-only models like LLaMA-3.1-8B-Instruct
. This is because a 32K to 64K token increase for a text model is a 32K token jump, while for Glyph
with 3x compression, it's only about a 10.7K visual token jump, making the task easier for the model.
Efficiency Evaluation
该图像是三幅折线图,展示了Glyph模型与文本主干模型在不同序列长度下的相对预填充速度、解码吞吐速度和训练吞吐速度加速比,显示Glyph在长序列上具备显著速度优势。
Figure 4 quantifies the speed benefits. Glyph
achieves significant speedups in:
- Prefill Speed: Up to 4.8x faster at 128K tokens. This is the initial processing of the long context.
- Decoding Throughput: Up to 4.4x faster. This is the speed of generating the response.
- SFT Training Throughput: Around 2x faster, which significantly cuts down on training costs.
Cross-Modal Generalization
This is a manual transcription of Table 4 from the paper.
Model |
SP |
CP |
UA |
Acc |
F1 |
GLM-4.1V-9B-Base |
36.76 |
23.41 |
21.52 |
29.18 |
28.78 |
Glyph-Base |
47.91 |
22.24 |
14.80 |
32.48 |
34.44 |
Glyph |
57.73 |
39.75 |
27.80 |
45.57 |
46.32 |
Table 4 shows that on MMLongBench-Doc
, which contains real-world PDFs, the final Glyph
model significantly outperforms its base VLM (GLM-4.1V-9B-Base
). This indicates that training on rendered plain text generalizes positively to understanding complex, structured documents with mixed text and layouts.
Ablation Study & Analysis
-
Configuration Search:
This is a manual transcription of Table 5 from the paper.
Configuration |
LongBench |
MRCR |
Ruler |
Avg. |
Random Config |
41.78 |
15.82 |
65.13 |
40.91 |
Manual Config |
43.45 |
19.33 |
68.09 |
43.62 |
Search-based Config |
43.45 |
22.10 |
71.24 |
45.60 |
Table 5 confirms the value of the LLM-driven genetic search. The configuration discovered by the search algorithm (Search-based Config
) achieves the best average performance, outperforming both randomly sampled and manually designed configurations.
-
Auxiliary OCR Tasks:
This is a manual transcription of Table 6 from the paper.
Model |
LongBench |
MRCR |
Ruler |
Glyph |
50.56 |
26.27 |
72.17 |
- w/o OCR (in RL) |
-1.40 |
-2.00 |
-0.35 |
- w/o RL |
-7.11 |
-4.17 |
-0.93 |
- w/o OCR (in SFT) |
-8.12 |
-8.42 |
-1.23 |
Table 6 shows performance drops when components are removed. Removing the auxiliary OCR task during SFT (- w/o OCR (in SFT)
) causes a major performance hit (e.g., -8.12% on LongBench
). This highlights that explicitly reinforcing the model's ability to read fine-grained text is crucial for high-level understanding.
-
Extreme Compression Exploration:
This is a manual transcription of Table 7 from the paper.
Model |
2 Needle |
4 Needle |
8 Needle |
GLM-4-9B-Chat-1M |
10.08 |
6.19 |
2.26 |
Qwen2.5-7B-Instruct-1M |
11.36 |
7.34 |
7.77 |
Glyph |
9.36 |
7.62 |
7.64 |
Table 7 shows the results of an experiment with an 8x compression ratio, extending the effective context to 1M tokens. Glyph
's performance remains on par with dedicated 1M-context models like Qwen2.5-1M
, demonstrating the massive potential of this approach for scaling to contexts far beyond current limits.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces and validates Glyph
, a novel framework for scaling LLM context windows through visual-text compression. By rendering long texts into images and processing them with a VLM, Glyph
achieves 3-4x token compression, which translates to significant improvements in training and inference speed. The framework maintains competitive accuracy against leading text-only models, demonstrates effective test-time scaling, and even shows positive generalization to real-world multimodal document understanding. The authors convincingly argue that increasing the information density of tokens is a viable and promising alternative to architectural modifications for long-context modeling.
-
Limitations & Future Work: The authors acknowledge several limitations:
- Sensitivity to Rendering: Performance is sensitive to rendering parameters like DPI and font. While the search finds a good configuration, making the model robust to any rendering style is an open problem.
- OCR Challenges: The model struggles with rare, complex alphanumeric sequences (like UUIDs), which can be misread. Improving the VLM's core OCR fidelity would raise the performance ceiling.
- Task Diversity: The evaluation focuses on long-context understanding. The model's generalization to other domains, such as complex reasoning or agentic tasks, needs further investigation.
- Future Directions: The paper suggests several exciting avenues: adaptive rendering models that tailor the visualization to the task, improving the visual encoder, better aligning visual-text and text-only models, and applying the framework to agent memory systems.
-
Personal Insights & Critique:
- Novelty and Impact:
Glyph
presents a truly "out-of-the-box" solution to the long-context problem. It cleverly reframes the issue from one of algorithmic complexity to one of data representation. Its orthogonality is a major strength; it can be combined with future advances in attention mechanisms or model architectures to compound efficiency gains.
- The Compression-Fidelity Trade-off: The core of the method is a lossy compression scheme. The key insight is that for many tasks, perfect, character-for-character recall is less important than semantic understanding. The adjustable DPI in the
Ruler
benchmark experiments (Table 3) beautifully illustrates this trade-off. For tasks requiring high fidelity, one can use a lower compression ratio (higher DPI), and for tasks where gist is sufficient, one can use a higher compression ratio for maximum speed.
- Practical Implications: This approach could make million-token context models accessible without requiring top-tier, specialized hardware. A model with a native 128K or 256K context window could, through
Glyph
, handle inputs that would otherwise require a much larger and more expensive model.
- Potential Weaknesses: The reliance on OCR makes the system vulnerable to tasks that depend on perfect character-level accuracy, such as processing code with specific syntax, legal documents where punctuation is critical, or data containing unique identifiers (as noted by the authors). The pre-training and search pipeline, while effective, adds complexity to the model development lifecycle compared to a standard text-only LLM.
- Overall:
Glyph
is a creative and highly promising contribution to the field. It opens up a new dimension for research in long-context modeling, shifting focus from just making attention cheaper to making each token "smarter" by packing more information into it.