Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
TL;DR Summary
This work introduces a Vision-to-Language Tokenizer to convert images into discrete LLM vocabulary tokens, enabling frozen LLMs to perform visual understanding and image restoration without fine-tuning, validated across diverse tasks.
Abstract
In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at https://github.com/zh460045050/V2L-Tokenizer.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
- Authors: Lei Zhu (Peking University), Fangyun Wei (Microsoft Research Asia), Yanye Lu (Peking University).
- Journal/Conference: This paper is an arXiv preprint, a common platform for sharing research before or during the formal peer-review process. It is not yet published in a peer-reviewed conference or journal.
- Publication Year: 2024 (v1 submitted in March 2024).
- Abstract: The paper explores enabling a pre-trained, frozen Large Language Model (LLM) to understand visual data without any fine-tuning. The core idea is to treat an image as a "foreign language" and translate it into discrete tokens from the LLM's own vocabulary. They introduce the Vision-to-Language (V2L) Tokenizer to perform this translation, using an encoder-decoder structure, the LLM's vocabulary, and a CLIP model. This method allows the frozen LLM to perform not only visual understanding tasks (like captioning and VQA) but also image restoration tasks (like inpainting and deblurring) in an auto-regressive manner. The authors validate their approach through extensive experiments.
- Original Source Link:
- Official Source: https://arxiv.org/abs/2403.07874
- PDF Link: http://arxiv.org/pdf/2403.07874v1
- Publication Status: arXiv Preprint.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Large Language Models (LLMs) are exceptionally powerful but inherently operate on text. To make them "multimodal" (able to understand images), the standard approach is to connect them to a vision module and then perform expensive fine-tuning on massive image-text datasets. This process is resource-intensive and creates a new, specialized model.
- Identified Gaps: Previous attempts to bypass fine-tuning by mapping images directly to an LLM's token space struggled to assign semantically meaningful tokens. This "language" was too abstract for the LLM to truly comprehend the image content, limiting its performance.
- Fresh Angle: This paper proposes a novel way to "translate" an image into a more semantically rich "foreign language" that a frozen, off-the-shelf LLM can understand. Instead of aligning representations in the high-dimensional feature space (which requires fine-tuning), they align them in the discrete token space.
-
Main Contributions / Findings (What):
- Vision-to-Language (V2L) Tokenizer: They introduce a novel tokenizer that converts an image into a sequence of discrete tokens from an LLM's vocabulary. This tokenizer is trained while the LLM and its vocabulary remain completely frozen.
- Dual Token Representation: The
V2L Tokenizeruniquely generates two types of tokens for each image:- Global Tokens: A small set of highly semantic tokens that capture the overall meaning of the image, used for understanding tasks like classification and captioning.
- Local Tokens: A grid of tokens that represent fine-grained, patch-level details, used for reconstruction and denoising tasks like inpainting.
- Vocabulary Expansion Technique: To improve the semantic quality of global tokens, they propose a method to create meaningful bigrams and trigrams from the LLM's subword vocabulary and filter them using CLIP to form a richer "global codebook."
- Zero Fine-Tuning LLM Performance: The key finding is that by using this sophisticated image-to-token translation, a completely frozen LLM can achieve strong performance on a wide range of multimodal tasks, including comprehension and image generation/restoration, often outperforming prior fine-tuning-free methods.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT, LLaMA) trained on vast amounts of text data. They excel at understanding and generating human-like text. A key characteristic is their auto-regressive nature, meaning they predict the next word (or token) based on the sequence of words they have seen so far. The models used in this paper, like
LLaMA-2, are decoder-only Transformers. - Fine-tuning: A process where a pre-trained model's weights (parameters) are further trained (updated) on a smaller, task-specific dataset. This adapts the general model to a specialized task. This paper's main goal is to avoid this step for the LLM.
- In-context Learning (Few-shot Learning): A remarkable ability of modern LLMs where they can learn a new task simply by being given a few examples (
shots) in the prompt, without any weight updates. The LLM recognizes the pattern and applies it to a new query. - Image Quantization: The process of converting a continuous image (represented by pixel values) into a sequence of discrete integer indices, called tokens. Each token corresponds to an entry in a learned "codebook."
- VQ-VAE (Vector Quantized Variational Autoencoder): An early model that uses an encoder to compress an image into a grid of features and then "quantizes" each feature by replacing it with the closest vector from a learned codebook. A decoder then reconstructs the image from these quantized vectors.
- VQ-GAN (Vector Quantized Generative Adversarial Network): An improvement on VQ-VAE that uses a GAN's discriminator to make the reconstructed images look more realistic and detailed. This paper's tokenizer builds on the
VQ-GANarchitecture.
- CLIP (Contrastive Language-Image Pre-training): A model from OpenAI trained on millions of image-text pairs. It learns a shared embedding space where a text description and its corresponding image are located close to each other. This allows for measuring the semantic similarity between any image and any piece of text.
- Encoder-Decoder Architecture: A common neural network design pattern. The encoder compresses the input data into a compact representation (a feature vector), and the decoder reconstructs the desired output from this representation.
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT, LLaMA) trained on vast amounts of text data. They excel at understanding and generating human-like text. A key characteristic is their auto-regressive nature, meaning they predict the next word (or token) based on the sequence of words they have seen so far. The models used in this paper, like
-
Previous Works & Differentiation: The paper positions its work against two main paradigms for making LLMs see:
-
Feature-Space Alignment (Requires Fine-tuning):
- Examples:
Flamingo,BLIP-2,LLaVA,MiniGPT-4. - How they work: These models keep the vision encoder and the LLM mostly frozen but introduce a small, trainable "bridge" module (e.g., a linear layer or cross-attention layers) between them. This bridge is trained to translate visual features from the vision encoder into a format the LLM can process.
- Limitation: They require expensive fine-tuning on large image-text datasets to train this bridge.
- Examples:
-
Token-Space Alignment (Fine-tuning Free):
- Examples:
LQAE,SPAE. - How they work: These methods are conceptually closer to the current paper. They train a VQ-VAE-style tokenizer to map an image directly into tokens from a frozen LLM's vocabulary. The image is thus treated as a sequence of "foreign words."
- Limitation: They struggle to find semantically meaningful tokens. For example, a picture of a "dog" might be mapped to abstract tokens like "the", "an", "is" because their vector embeddings happen to be close in the embedding space. This weak semantic link prevents the LLM from truly understanding the visual content.
- Examples:
- This Paper's Innovation (Differentiation):
- The
V2L Tokenizerovercomes the limitations of previous token-space methods by using CLIP to guide the quantization. This ensures that the chosen language tokens are semantically relevant to the image content. - The creation of a separate global codebook with expanded vocabulary (bigrams/trigrams like "French bulldog") provides highly descriptive tokens for understanding tasks.
- The separation of global tokens (for semantics) and local tokens (for detail) allows the model to excel at both understanding and restoration tasks, a capability prior methods lacked.
- The
-
4. Methodology (Core Technology & Implementation)
The core of the paper is the Vision-to-Language (V2L) Tokenizer. Its goal is to translate an image into a sequence of tokens that a frozen LLM can process.
该图像是图10,示意图展示了V2L Tokenizer中的本地编码器和解码器结构,包含卷积块、残差块、非局部块及规范化等模块,整体实现图像特征的编码与重构。
-
Principles: The central idea is to view an image as a linguistic entity from a "foreign language" and translate it into the LLM's "native language" (its vocabulary tokens). This is achieved by training a tokenizer that maps visual features to the LLM's token embedding space.
-
Steps & Procedures: The
V2L Tokenizerfollows an encoder-quantizer-decoder structure, as shown in the diagram above (a more detailed version of Figure 2 from the paper).-
Codebook Preparation (Pre-computation):
- Local Codebook: This is simply the original vocabulary of the LLM (e.g., LLaMA-2's 32,000 tokens). It's used for capturing fine-grained details.
- Global Codebook: To get more meaningful tokens, the authors create an expanded vocabulary.
- They form bigrams (e.g., "ice cream") and trigrams (e.g., "a red car") from the base vocabulary.
- This creates a massive, noisy set. To clean it, they use a filtering strategy: for each image in ImageNet, they find the top-5 most similar text tokens (unigrams, bigrams, trigrams) using CLIP similarity.
- The final global codebook is the collection of all these top-5 tokens from across the entire ImageNet dataset. This results in a smaller, semantically rich vocabulary of 11,908 items.
- Codebook Embeddings: The text tokens in both codebooks are converted into vector embeddings using a frozen
CLIP text encoder. The global codebook embeddings are calledE-LLM embeddings, and the local ones areLLM embeddings. The localLLM embeddingsare passed through a small trainableprojector(a linear layer) to createP-LLM embeddingsfor better alignment with visual features.
-
Encoding:
- An input image is fed into two parallel encoders:
- A trainable CNN encoder (from VQ-GAN) extracts a grid of local feature vectors, .
- A frozen CLIP vision encoder extracts a single global feature vector for the entire image, .
- An input image is fed into two parallel encoders:
-
Quantization:
- Local Quantizer: For each local feature vector in the grid, it finds the nearest
P-LLM embeddingfrom the local codebook using Euclidean distance. This produces a grid of quantized embeddings and a corresponding grid of local tokens . - Global Quantizer: It takes the single global feature vector and finds the nearest
E-LLM embeddingsfrom the global codebook. This produces a set of global embeddings and their corresponding global tokens .
- Local Quantizer: For each local feature vector in the grid, it finds the nearest
-
Decoding:
- The decoder's job is to reconstruct the original image from the quantized embeddings to ensure the tokens capture enough information.
- It takes the grid of local embeddings as primary input.
- Crucially, it injects the global semantic information from using a cross-attention layer, where provides the queries and provides the keys and values.
- The rest of the decoder architecture follows
VQ-GAN, using transposed convolutions to upsample the features back to the original image size.
-
-
Mathematical Formulas & Key Details: The training process optimizes the encoder, decoder, and projector using the
VQ-GANloss function. The codebooks and CLIP models are frozen.- : The vector quantization loss. It encourages the encoder's output features to be close to the chosen codebook embeddings.
- : A perceptual loss that ensures the reconstructed image is perceptually similar to the original, not just mathematically close in pixel values.
- : An adversarial loss where a discriminator tries to distinguish between real images and reconstructed ones. This pushes the decoder to generate more realistic images.
- : Hyperparameters to balance the different loss terms.
-
Visual Signal Comprehension with a Frozen LLM: Once the
V2L Tokenizeris trained, it's used to "translate" images. The resulting tokens are then embedded in prompts for a frozen LLM.
该图像是一张插图,显示了两头大象在草原上站立的自然场景。背景为远处的山丘和晴朗的天空,画面色彩自然,反映了野生环境。- Image Classification: The prompt contains instructions, a few examples (), and the test image's global tokens (
Input: <global_tokens_test>, output:). The LLM then generates the label text. - Image Captioning: Similar to classification, but the output is a full sentence.
Input: <global_tokens_test>, output:. - VQA: The prompt includes the image's global tokens as "Condition" and the text question.
Condition: <global_tokens_test>. Question: <question_text>. Answer:. - Image Denoising: These tasks use the local tokens . For tasks like inpainting, the prompt includes examples of corrupted token sequences and their corrected versions, guiding the LLM to fill in the masked/corrupted tokens of the test image.
- Image Classification: The prompt contains instructions, a few examples (), and the test image's global tokens (
5. Experimental Setup
-
Datasets:
- Tokenizer Training:
ImageNet-1K, a large-scale dataset with 1.28 million images across 1000 categories. - Few-Shot Classification:
Mini-ImageNet, a subset of ImageNet used for benchmarking few-shot learning. - Captioning/VQA:
COCO CaptionandVQA v2datasets. - Denoising Evaluation: A random subset of 5,000 images from the
ImageNet-1Kvalidation set.
- Tokenizer Training:
-
Evaluation Metrics:
- FID (Fréchet Inception Distance):
- Conceptual Definition: Measures the quality and diversity of generated images compared to a set of real images. It calculates the distance between the feature distributions of real and generated images, as extracted by a pre-trained InceptionV3 network. Lower is better.
- Mathematical Formula:
- Symbol Explanation: are the mean feature vectors for real and generated images, respectively. are the covariance matrices of these features. denotes the trace of a matrix.
- LPIPS (Learned Perceptual Image Patch Similarity):
- Conceptual Definition: Measures how perceptually similar two images are, mimicking human judgment better than traditional metrics like PSNR. It computes the distance between deep features extracted from different layers of a pre-trained network. Lower is better.
- Mathematical Formula:
- Symbol Explanation: is the distance between images and . are the feature activations from layer of a deep network. are channel-wise weights to scale the importance of different channels. The difference is computed over all spatial locations
(h,w).
- PSNR (Peak Signal-to-Noise Ratio):
- Conceptual Definition: A classic metric that measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects its fidelity. In images, it measures reconstruction quality based on pixel-wise differences. Higher is better.
- Mathematical Formula:
- Symbol Explanation: is the maximum possible pixel value of the image (e.g., 255 for 8-bit grayscale images). is the Mean Squared Error between the original and reconstructed images.
- CLIP Score:
- Conceptual Definition: Measures the semantic similarity between an image and a text caption by computing the cosine similarity of their embeddings in CLIP's shared feature space. Higher is better.
- Mathematical Formula:
- Symbol Explanation: is the image embedding for image , and is the text embedding for caption .
- CLIP-R (Relative CLIP Score):
- Conceptual Definition: The paper uses this to compare the alignment quality between an image and its generated tokens. It likely represents the CLIP score of the image-tokens pair, possibly normalized or compared against a baseline, but the paper does not provide a formula. It's used to show the semantic quality of the generated tokens. Higher is better.
- FID (Fréchet Inception Distance):
-
Baselines:
VQ-GAN: The foundational image quantization model.LQAE&SPAE: The most direct competitors, as they also aim for fine-tuning-free visual comprehension by mapping images to an LLM's token space.Frozen: A baseline from a prior work that uses a frozen LLM for multimodal learning.
6. Results & Analysis
-
Core Results:
-
Few-Shot Classification (Table 1): Manually Transcribed Table 1: Few-shot Classification on 2-way and 5-way Mini-ImageNet benchmarks.
Method #Tokens LLM 2-way 5-way K-shot #Repetitions Avg K-shot #Repetitions Avg 1 3 5 Avg 1 3 5 1 3 5 Avg 1 3 5 Frozen [47] - - 33.7 66.0 - - 63.0 65.0 63.7 51.3 14.5 34.7 - - 33.8 33.3 32.8 26.3 LQAE [25] 256 GPT-3.5 35.2 68.2 69.8 - 68.5 68.7 65.9 54.0 15.7 35.9 36.5 - 31.9 36.4 45.9 29.0 SPAE [54] 5 PaLM-2 (340B) 84.0 88.5 88.4 85.1 83.6 82.4 - 77.7 64.2 68.0 69.9 - 63.4 62.0 60.2 58.8 Ours 5 LLaMA-2 (7B) 73.1 89.0 93.4 79.6 80.6 79.1 - 75.6 54.6 88.6 91.1 - 70.7 72.8 74.4 69.8 Ours 5 LLaMA-2 (13B) 77.9 91.9 94.4 81.5 82.8 82.0 - 79.3 69.6 89.9 91.3 - 75.8 75.7 77.2 75.0 Ours 5 LLaMA-2 (70B) 87.1 94.8 96.1 88.9 89.2 89.1 - 83.9 81.5 92.3 93.0 - 85.7 86.1 86.3 81.5 SPAE [54] 21 PaLM-2 (340B) 84.8 92.5 92.6 84.8 85.2 85.4 - 79.0 65.1 73.7 74.3 - 66.4 67.0 66.3 61.9 Ours 21 LLaMA-2 (7B) 76.3 91.2 95.3 84.0 84.4 83.7 - 78.8 44.8 91.8 94.0 - 73.9 82.2 85.3 72.7 Ours 21 LLaMA-2 (13B) 73.1 92.4 95.7 80.9 83.8 82.0 - 79.5 62.7 93.0 94.5 - 72.8 79.6 82.0 75.2 Ours 21 LLaMA-2 (70B) 89.1 96.9 97.8 91.4 92.7 92.9 - 86.7 79.7 94.9 95.6 - 89.3 90.7 90.2 83.5 Analysis: The authors' method (
Ours) consistently outperforms the previous state-of-the-art,SPAE. Notably, theirLLaMA-2 70Bmodel surpassesSPAE's much largerPaLM-2 340Bmodel (e.g., 86.7% vs 79.0% on 2-way average, and 83.5% vs 61.9% on 5-way average, with 21 tokens). This demonstrates the superior quality of the generated semantic tokens. Performance also scales with the LLM size and the number of global tokens used. -
Semantic Quality (Table 2): Manually Transcribed Table 2: Semantic quality evaluation on ImageNet-1K val set.
Method Codebook #Tokens CLIP↑ CLIP-R↑ SPAE [54] PaLM-2 5 0.1868 0.7147 Ours E-LLaMA-2 5 0.2576 0.9165 SPAE [54] PaLM-2 21 0.1815 0.6901 Ours E-LLaMA-2 21 0.2427 0.8520 Analysis: The
CLIPandCLIP-Rscores are significantly higher for the authors' method. This quantitatively proves that their global tokens, derived from the expanded and filtered vocabulary (E-LLaMA-2), have a much stronger semantic alignment with the image content than those produced bySPAE. -
Image Reconstruction (Table 3): Manually Transcribed Table 3: Reconstruction evaluation on ImageNet-1K val set.
Method Codebook #Tokens FID↓ LPIPS↓ PSNR↑ VQ-GAN [12] Learnable 256 5.48 0.13 - VQ-GAN [12] PaLM-2 256 7.44 0.17 - VQ-GAN* [12] LLaMA-2 256 9.51 0.17 21.48 SPAE [54] PaLM-2 341 9.49 0.17 - SPAE [54] PaLM-2 597 4.41 0.12 - SPAE [54] PaLM-2 1109 3.89 0.11 - Ours LLaMA2 256 3.41 0.08 23.56 Ours Hybrid 277 2.88 0.08 23.25 Analysis: The
V2L Tokenizerachieves state-of-the-art reconstruction quality. The standardOursmodel (with 256 local tokens) already achieves a strong FID of 3.41. TheHybridmodel, which adds 21 global tokens to guide the decoder via cross-attention, further improves the FID to 2.88, demonstrating the benefit of injecting global semantic context into the reconstruction process. It outperformsSPAEeven whenSPAEuses far more tokens (1109). -
Image Denoising (Table 4): Manually Transcribed Table 4: Quantitative evaluation across five denoising restoration tasks.
Tokenizer LLM Inpainting Outpainting Deblurring Rotation Shift FID↓ LPIPS↓ FID↓ LPIPS↓ FID↓ LPIPS↓ FID↓ LPIPS↓ FID↓ LPIPS↓ Ours LLaMA-2 7B 13.13 0.1219 15.28 0.1442 10.09 0.1033 10.64 0.1064 10.53 0.1058 Ours LLaMA-2 13B 11.70 0.1134 12.56 0.1275 10.60 0.1085 11.36 0.1128 11.84 0.1176 Ours LLaMA-2 70B 10.11 0.1021 10.73 0.1128 10.42 0.1058 10.48 0.1058 10.79 0.1093 (Note: Only the 'Ours' results are transcribed for brevity, as they are consistently superior to all baselines (VQ-GAN, LQAE*, SPAE*) across all tasks and LLM sizes shown in the full table).* Analysis: Across all five denoising tasks (inpainting, outpainting, deblurring, rotation, shift), the combination of the
V2L Tokenizerand a frozenLLaMA-2model achieves the best results (lowest FID and LPIPS). This shows that the local tokens effectively capture fine-grained image structure, enabling the LLM to reason about and restore corrupted visual patterns via in-context learning.
-
-
Qualitative Results:
该图像是一个比萨饼的照片,展示了带有融化奶酪和新鲜罗勒叶的披萨。图像清晰展示了披萨的边缘酥脆和丰富配料,未包含公式。-
Captioning and VQA (Figure 5): The above image (part of Fig 5) shows a VQA example. The model correctly identifies the food as "Pizza," its origin as "Italy," and the topping as "Basil," demonstrating a deep contextual understanding derived from the global tokens. Other visualizations in the paper (e.g., Figures 12, 13, 14 in the appendix) further confirm the model's strong comprehension abilities.
该图像是论文中语义解释的示意图,展示了多张物体图片及其对应的标签词。这些词围绕图像排列,反映出模型对图像内容的不同语义理解,以及图像与语言的关联。 -
Semantic Interpretation (Figure 6): This figure shows the top global tokens assigned to various images. For an image of a "French bulldog," the tokenizer produces tokens like
"French bulldog","Boston terrier", and"bulldog", which are semantically precise. This confirms the effectiveness of the vocabulary expansion and CLIP-based filtering.
该图像是论文中图8的示意图,展示了基于V2T Tokenizer方法的遮挡图像修复效果。左侧为输入的遮挡图像,右侧为模型自动恢复的预测结果,涵盖不同场景和对象的细节还原。 -
Masked Image Restoration (Figure 8): This visualization shows the model's ability to perform masked image modeling. Given an image with 30% of its local tokens masked, a LoRA-tuned LLM can predict the missing tokens, leading to high-fidelity image reconstruction. This highlights the potential for using this framework for self-supervised pre-training.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully demonstrates a powerful method to enable frozen, text-only LLMs to perform complex visual comprehension and restoration tasks without any multimodal fine-tuning. By conceptualizing images as a "foreign language" and designing a sophisticated
V2L Tokenizerto translate them into semantically rich LLM tokens, the authors bridge the modality gap in the input space. The dual representation of global (semantic) and local (detailed) tokens proves effective for a wide array of tasks, outperforming prior fine-tuning-free approaches. -
Limitations & Future Work (Inferred):
- Computational Cost of Tokenization: While the LLM is frozen, the
V2L Tokenizeritself is a complex model that requires significant resources (trained on ImageNet for 100 epochs on 32 V100 GPUs) to train. - Fixed Resolution: The experiments are conducted on low-resolution images (). Scaling this approach to high-resolution images would significantly increase the number of local tokens, potentially exceeding the context length limits of current LLMs.
- In-context Learning Efficiency: While effective, relying on in-context learning with many examples can be computationally more expensive at inference time compared to a single forward pass in a fine-tuned model.
- Dependence on CLIP: The quality of the semantic tokens is heavily dependent on the quality of the pre-trained CLIP model. Biases or weaknesses in CLIP could be inherited by the tokenizer.
- Computational Cost of Tokenization: While the LLM is frozen, the
-
Personal Insights & Critique:
- Novelty and Elegance: The core idea of "aligning in the token space, not the feature space" is highly innovative and elegant. It provides a practical and effective alternative to the dominant paradigm of multimodal fine-tuning, significantly lowering the barrier to entry for developing multimodal systems.
- Unifying Framework: This approach offers a promising path toward a unified model for multiple modalities. One could imagine creating similar "tokenizers" for audio, video, or other sensor data, translating everything into the "lingua franca" of an LLM's vocabulary. This could enable an LLM to reason seamlessly across different types of information.
- Critique: The vocabulary expansion and filtering process, while effective, feels somewhat heuristic. The final global codebook is static and dependent on the content of the filtering dataset (ImageNet). A more dynamic or end-to-end method for discovering semantic tokens could be a direction for future work.
- Future Impact: This work could inspire a new class of multimodal models that prioritize modularity and efficiency. Instead of creating monolithic, retrained models, the community might shift towards developing powerful, specialized "modal-to-language" tokenizers that can plug into any off-the-shelf LLM. This would make advanced multimodal AI more accessible and adaptable.
Similar papers
Recommended via semantic vector search.