Paper status: completed

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Published:02/02/2023

Multimodal Representation Learning (2)Language Quantized AutoEncoder (1)Unsupervised Text-Image Alignment (1)Image Representation via Pretrained Language Models (1)Utilization of Unaligned Text-Image Data (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LQAE uses pretrained language model vocabularies to quantize image embeddings into text tokens, enabling unsupervised text-image alignment and facilitating few-shot image classification with large language models without paired data.

Abstract

Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa). Our main idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features. To the best of our knowledge, our work is the first work that uses unaligned images for multimodal tasks by leveraging the power of pretrained language models.

Mind Map

In-depth Reading

English Analysis~13 min read · 16,683 chars

1. Bibliographic Information

Title: Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment
Authors: Hao Liu, Wilson Yan, Pieter Abbeel. The authors are affiliated with the University of California, Berkeley. Pieter Abbeel is a prominent figure in the fields of robotics and deep learning.
Journal/Conference: The paper was submitted to arXiv, a preprint server for academic papers. This means it has not yet undergone formal peer review or been published in a conference or journal.
Publication Year: 2023
Abstract: The authors address the limitation that large language models (LLMs) lack visual perception. Current solutions typically require large, expensive, aligned image-text datasets. To overcome this, they propose the Language-Quantized AutoEncoder (LQAE), which aligns images and text in an unsupervised manner. The core idea is to treat an image as a sequence of text tokens by quantizing image features using the vocabulary (codebook) of a pretrained language model (like BERT). The model is trained to reconstruct the original image from these tokens after they have been masked and "denoised" by a frozen BERT model. This process forces the model to learn a mapping where similar images are represented by similar clusters of text tokens, achieving alignment without paired data. The paper demonstrates that this allows for few-shot image classification with LLMs like GPT-3 and effective linear classification using BERT features. The authors claim this is the first work to use unaligned images for multimodal tasks by leveraging pretrained language models.
Original Source Link: https://arxiv.org/abs/2302.00902 (Preprint)

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Large language models (LLMs) like GPT-3 have shown incredible few-shot learning abilities on text tasks but are "blind" to the visual world. Integrating vision is crucial for real-world applications like robotics and visual question answering.
- Existing Challenge: The dominant approach to connect vision and language (e.g., CLIP, Flamingo) is to pretrain models on massive, curated datasets of image-text pairs. This data collection is costly, time-consuming, and difficult to scale to other modalities.
- Proposed Innovation: This paper introduces a novel method to achieve text-image alignment without using any aligned data. It cleverly repurposes powerful, independently pretrained models from the language and vision domains, bridging the gap in a fully unsupervised way.
Main Contributions / Findings (What):
- The paper introduces the Language-Quantized AutoEncoder (LQAE), a new architecture that learns to "translate" images into sequences of text tokens drawn from a pretrained language model's vocabulary.
- It demonstrates that these learned image representations, while not human-readable, can be fed directly into an LLM like GPT-3 to perform few-shot image classification via standard prompting, achieving results competitive with or even superior to methods trained on millions of aligned pairs.
- It shows that the intermediate features produced by a BERT model when processing these image-tokens are highly effective for standard linear classification of images, outperforming other unsupervised representations.

Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT-3, PaLM) trained on vast amounts of text data. Their key feature is few-shot learning or in-context learning, where they can perform new tasks by seeing just a few examples in a prompt, without needing to be retrained.
- Autoencoder (AE): A type of neural network with an encoder-decoder architecture. The encoder compresses the input (e.g., an image) into a low-dimensional latent representation, and the decoder tries to reconstruct the original input from this representation. The goal is to learn a meaningful compression.
- Vector Quantized Variational Autoencoder (VQ-VAE): A specific type of autoencoder where the latent representation is not continuous but discrete. The encoder's output is "quantized" by mapping it to the closest vector in a learned dictionary called a codebook. This forces the model to represent complex data using a finite set of discrete codes.
- BERT (Bidirectional Encoder Representations from Transformers): A powerful language model pretrained on a "denoising" task called Masked Language Modeling (MLM). In MLM, some words in a sentence are hidden (masked), and the model is trained to predict them based on the surrounding context. This makes BERT excellent at understanding and filling in missing information in a sequence. RoBERTa is an optimized version of BERT.
- Unsupervised Learning: A machine learning paradigm where models learn patterns from data without explicit labels or guidance. In this paper's context, it means using a collection of images and a pretrained language model (trained on text) without any (image, caption) pairs linking them.
Previous Works:
- CLIP (Radford et al., 2021): A landmark model that learned rich joint representations of images and text by training on 400 million image-text pairs from the internet. It excels at zero-shot image classification. LQAE aims to achieve similar goals without this massive aligned dataset.
- Frozen (Tsimpoukelli et al., 2021): This model connects vision to a frozen (non-trainable) LLM. It trains a visual encoder to produce representations that the LLM can understand. However, this training step still requires a significant amount of aligned image-text data (like the Conceptual Captions dataset). LQAE is directly compared against Frozen and shows superior performance.
- Vision-Language Models (e.g., ViLBERT, Flamingo): These are large multimodal models pretrained on aligned data to perform tasks like VQA. They are typically task-specific or require fine-tuning, unlike the few-shot approach enabled by LQAE.
Technological Evolution: The field has progressed from training separate models for vision and language to creating unified models. The main bottleneck has always been the need for aligned data. This paper proposes a paradigm shift: instead of collecting aligned data, it suggests we can create the alignment by forcing one modality (vision) to be expressed in the "language" of another (text) using off-the-shelf pretrained models.
Differentiation: The key differentiator is unsupervised alignment. While CLIP and Frozen rely on supervised signals from (image, text) pairs, LQAE learns a mapping purely through a reconstruction objective on the image modality, using a frozen language model as a structural prior. This makes the approach far more data-efficient and generalizable.

4. Methodology (Core Technology & Implementation)

The proposed method, Language-Quantized AutoEncoder (LQAE), is a modification of the VQ-VAE framework. Its architecture is shown in the figure below.

该图像是一个示意图，展示了论文中提出的Language-Quantized AutoEncoder (LQAE)的核心流程。图中显示了如何通过RoBERTa codebook对图像特征进行量化，然后进行高比例掩码处理，最后由冻结的RoBERTa模型重建图像表示，实现无监督的文本-图像对齐。

Principles: The central intuition is to force an image to be represented by discrete tokens from a pretrained language model's vocabulary. To ensure these token sequences are not random but have a structure that other language models can process, a frozen BERT-like model is inserted as a "denoising" bottleneck. The encoder and decoder are trained to work around this bottleneck, indirectly learning to produce "language-like" representations for images.
Steps & Procedures:
1. Image Encoding: An input image $x$ is passed through a vision encoder $E$ (a ViT-Base model) to produce a grid of patch embeddings, $h$ . For a $256 \times 256$ image with $16 \times 16$ patches, this results in $16 \times 16 = 256$ embedding vectors.
2. Language Quantization: A fixed, pretrained codebook $C$ is used. This codebook is the word embedding matrix from a RoBERTa model. Each image patch embedding in $h$ is replaced by its nearest neighbor vector from this codebook. This step "quantizes" the image into a sequence of 256 text token embeddings, $z$ . This is the core "Language-Quantized" step.
3. Masking & Denoising: A large portion of the token sequence $z$ is randomly masked (e.g., 50%). This masked sequence is fed into a frozen RoBERTa model. The RoBERTa model acts as a denoiser, using its pretrained knowledge of language structure to predict the embeddings for the masked positions.
4. Image Decoding: The output from the RoBERTa model (the complete, "denoised" sequence of embeddings) is passed to a vision decoder $D$ (also a ViT) to reconstruct the original image $\hat{x}$ .
5. Training: Only the image encoder $E$ and decoder $D$ are trained. The RoBERTa codebook and the RoBERTa model itself remain frozen throughout.
Mathematical Formulas & Key Details: The model is optimized using a composite loss function. First, consider the standard VQ-VAE loss for context: $\mathcal { L } = \Vert x - \hat { x } \Vert _ { 2 } ^ { 2 } + \Vert \mathrm { s g } ( h ) - z \Vert _ { 2 } ^ { 2 } + \beta \Vert h - \mathrm { s g } ( z ) \Vert _ { 2 } ^ { 2 }$
- $\Vert x - \hat { x } \Vert _ { 2 } ^ { 2 }$ : The reconstruction loss, which measures the difference between the original image $x$ and the reconstructed image $\hat{x}$ .
- $\Vert \mathrm { s g } ( h ) - z \Vert _ { 2 } ^ { 2 }$ : The codebook loss, which pushes the codebook vectors $z$ to be closer to the encoder outputs $h$ . sg stands for stop-gradient, meaning the encoder is not updated by this loss.
- $\beta \Vert h - \mathrm { s g } ( z ) \Vert _ { 2 } ^ { 2 }$ : The commitment loss, which encourages the encoder's output $h$ to commit to a codebook vector. $\beta$ is a hyperparameter.
  
  The LQAE loss function modifies this as follows: $\mathcal { L } = \Vert x - \hat { x } \Vert _ { 2 } ^ { 2 } + \beta \Vert h - \mathrm { s g } ( z ) \Vert _ { 2 } ^ { 2 } + \alpha \log p ( z \mid z _ { m } )$
- Reconstruction Loss: The first term is the same, but crucially, $\hat{x}$ is now reconstructed from the output of the frozen RoBERTa denoiser.
- Codebook Loss: This term is removed because the RoBERTa codebook is frozen and not learned.
- Commitment Loss: The second term remains, encouraging the image encoder's outputs to stay close to the chosen language token embeddings.
- BERT Loss: The new third term, $\alpha \log p ( z \mid z _ { m } )$ , is the log-probability of predicting the original tokens $z$ given the masked sequence $z_m$ . This loss encourages the encoder $E$ to generate token sequences that are "natural" or "predictable" for the RoBERTa model, thereby imparting a language-like structure. $\alpha$ is a small weighting hyperparameter.

5. Experimental Setup

Datasets:
- Training Data: The model was trained on the ImageNet dataset, using only the images without any associated labels or text.
- Evaluation Data:
  - ImageNet: Used for the linear classification task.
  - Real-Name Open-Ended miniImageNet: A variant of the miniImageNet benchmark used for few-shot classification, introduced by Tsimpoukelli et al. (2021). It uses real class names (e.g., "goldfish") as labels.
Evaluation Metrics:
- Accuracy: Used for both linear and few-shot classification.
  1. Conceptual Definition: It measures the fraction of predictions that are correct. In classification, this is the most straightforward metric for performance.
  2. Mathematical Formula: $\mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$
  3. Symbol Explanation:
    - Number of Correct Predictions: The count of test samples where the model's predicted label matches the true label.
    - Total Number of Predictions: The total number of samples in the test set.
Baselines:
- ASCII: A baseline that converts images into $64 \times 64$ ASCII art to create a text representation. This tests whether any textual representation is sufficient.
- MAE + Linear: A strong image-only pretraining baseline. It uses a Masked Autoencoder (MAE), a state-of-the-art self-supervised vision model, and trains a linear classifier on its features for each few-shot task.
- Frozen: The primary competitor, which uses a visual encoder trained on aligned image-text data to interface with a frozen language model.
- untrained LQAE: An ablation baseline to verify that the training process, not just the architecture, is responsible for the performance.
Few-shot Terminology: The paper uses specific terms for its few-shot evaluation, illustrated in Figure 3.

该图像是图3，展示了用于少样本图像分类实验中术语的视觉示例。图中展示了两个任务诱导场景下，支持集（Support）和查询集（Query）中不同图像及其对应的标签和问题。
- Task induction: A natural language instruction at the start of the prompt (e.g., "Please answer the question").
- Number of ways: The number of classes in the classification task (e.g., 2-way for dog vs. cat).
- Number of inner-shots: The number of unique example images provided for each class in the prompt.
- Number of repeats: The number of times each unique example is repeated in the prompt.

6. Results & Analysis

Core Results:

1. Linear Classification with BERT: Figure 4 shows the linear classification accuracy on ImageNet. The features extracted from the intermediate layers of the RoBERTa model after processing LQAE tokens achieve 39.7% accuracy. This significantly outperforms features from a standard VQ-VAE encoder (14.2%) and features from LQAE's own vision encoder (35.6%). This proves that forcing the representation through the frozen language denoiser produces features that are more semantically meaningful and linearly separable.

Figure 4. Linear classification on ImageNet.

2. Few-shot Classification with GPT-3: The main results are presented in Tables 1 and 2, which show few-shot classification accuracy on mini-ImageNet. The following are manually transcribed versions of these tables.

Table 1: 2-way few-shot classification using GPT-3 on mini-ImageNet. A random guess on a 2-way task would yield 50% accuracy.

Few-shot Setting	Task Induction	X	✓	✓	✓	✓	✓	✓	Avg
	Inner Shots	1	1	3	5	1	1	1
	Repeats	0	0	0	0	1	3	5
No image or text ASCII (64x64 img)	0	5.2	5.9	6.5	4.5	4.8	5.2	4.59
Image pretrain + Image-text finetune MAE + Linear	0	8.9	11.4	13.5	12.8	15.6	19.8	11.71
Image-text pretrain Frozen	1.7	33.7	66	66	63	65	63.7	51.3
Image Pretrain	untrained LQAE	0	8.2	13.8	14.5	10.4	12.7	15.6	10.74
Image Pretrain	LQAE (ours)	1.5	35.2	68.2	69.8	68.5	68.7	65.9	53.97

Table 2: 5-way few-shot classification using GPT-3 on mini-ImageNet. A random guess on a 5-way task would yield 20% accuracy.

Few-shot Setting	Task Induction	X	✓	✓	✓	✓	✓	✓	Avg
	Inner Shots	1	1	3	5	1	1	1
	Repeats	0	0	0	0	1	3	5
No image or text ASCII (64x64 img)	0	0	0	0	0	0	0	0
Image pretrain + Image-text finetune MAE + Linear	0.3	2	2.5	3.2	3.1	3.5	3.6	2.6
Image-text pretrain Frozen	0.9	14.5	34.7	33.8	33.8	33.3	32.8	26.26
Image Pretrain	untrained LQAE	0	1.2	1.6	2.3	2.1	1.9	2.3	1.63
Image Pretrain	LQAE	1	15.7	35.9	36.5	31.9	36.4	45.9	29.04

Analysis: LQAE substantially outperforms all baselines that do not use aligned data (ASCII, MAE + Linear). More surprisingly, it also outperforms Frozen, which was pretrained on 3 million aligned image-text pairs. The authors suggest this may be partly because they use a larger LLM (GPT-3.5 vs. Frozen's 6B model), but highlight that LQAE's approach makes using such large models feasible, as no expensive finetuning is required. The key takeaway is that GPT-3 can successfully perform few-shot learning on the non-human-interpretable token sequences generated by LQAE, suggesting it recognizes underlying patterns and structures beyond surface-level semantics.

Ablations / Parameter Sensitivity:

The paper performs extensive ablations, summarized in Table 3 (transcribed below) and Figure 5.

Table 3: Ablation studies on ImageNet linear accuracy and mini-ImageNet few-shot accuracy.

Variation	Entropy	Trained BERT	L2	BERT loss	Decoder STE	% Code	GPT-3 size	Linear Acc	2-way Avg	5-way Avg
Default	0.0	true	true	0.001	false	100	Davinci (175B)	35.60	53.97	29.04
(A) No L2 Norm			false					30.30	52.45	27.42
(B) Random BERT		false						11.80	50.45	26.54
(D) Entropy Reg. (0.5)	0.5							30.70	52.45	28.51
(E) BERT Loss Weight				0.00				34.80	54.53	30.01
(E) BERT Loss Weight				1.00				36.90	40.45	20.93
(F) Decoder STE					true			34.80	54.53	30.01
(G) % Code Used						25		N/A	15.45	1.45
						50		N/A	21.00	5.56
						75		N/A	50.56	20.85
(H) GPT-3 Size							Curie(6.7B)	N/A	46.55	22.80
(H) GPT-3 Size							Babbage(1.3B)	N/A	23.85	14.70
(I) VQ to RoBERTa				VQ to Roberta				N/A	3.24	0.00

Key Ablation Insights:
- (B) Trained BERT: Using a pretrained RoBERTa is crucial. A randomly initialized model leads to a massive drop in linear accuracy (35.6 -> 11.8), proving that the language prior is essential.
- (E) BERT Loss: Removing the auxiliary BERT loss ( $\alpha=0$ ) has a minimal negative impact, and even slightly improves few-shot results. This suggests the image reconstruction objective through the BERT denoiser is the primary mechanism for learning the alignment.
- (H) GPT-3 Size: Performance scales strongly with the size of the LLM used for few-shot evaluation.
- (I) VQ to RoBERTa: This is a critical ablation. If you first train a standard VQ-VAE and then simply map its learned codes to random RoBERTa tokens, the performance is terrible. This proves that the LQAE training process creates a non-arbitrary language structure that GPT can leverage, not just a set of correlated but meaningless codes.

Masking Ratio: Figure 5 shows that performance is highest with a 50% masking ratio, which is much higher than the 15% typically used for training BERT on text. This high ratio forces the model to learn global, long-range dependencies within the image-token sequence, rather than just local patterns, which is vital for holistic image understanding.

Figure 5. High mask ratio is crucial for LQAE results. Top: Linear classification result on ImageNet. Bottom: 5-way and 2-way few-shot image classification results on Mini-ImageNet. 该图像是图表，展示了不同遮罩比例对LQAE模型在线性分类（ImageNet）及少样本分类（Mini-ImageNet）任务中准确率的影响。结果显示较高遮罩比例（50%）对于模型表现至关重要。

7. Conclusion & Reflections

Conclusion Summary: The paper successfully presents LQAE, a novel VQ-VAE-style model that learns to align images to text tokens without requiring any paired data. By leveraging a frozen, pretrained language denoiser (RoBERTa) as a bottleneck, LQAE forces an image encoder to generate structured, "language-like" representations for images. This enables powerful downstream applications, including few-shot image classification by prompting LLMs and strong linear classification performance. The work demonstrates a new, data-efficient path towards multimodal AI, reducing the reliance on expensive curated datasets.
Limitations & Future Work: The authors acknowledge several limitations:
- Interpretability: The "text" generated by LQAE is not human-readable. While effective for machines, this limits applications in domains requiring transparency, like healthcare.
- Scaling: Due to compute constraints, the authors could not explore scaling to larger image encoders or larger BERT-like models during training, which they hypothesize would significantly improve results.
- Generalization: The method is presented for text-image alignment but is theoretically generalizable to any two modalities where a pretrained denoiser exists for the target modality (e.g., audio-to-text, video-to-text). This is noted as a promising direction for future work.
Personal Insights & Critique:
- Paradigm Shift: LQAE represents a significant conceptual leap. Instead of the brute-force approach of collecting massive aligned datasets, it shows how to cleverly compose existing unimodal expert models to create multimodal capabilities. This is a far more elegant and scalable philosophy.
- The Nature of "Language": The success of prompting GPT-3 with non-semantic, machine-generated "text" is profound. It suggests that LLMs' in-context learning ability might be more sensitive to statistical patterns, format, and structure than to human-level semantic meaning. LQAE essentially learns a "machine language" for images that happens to share statistical properties with human language.
- Practical Implications: This approach could democratize multimodal research by lowering the barrier to entry. Researchers without access to massive, proprietary datasets could still build powerful vision-language systems.
- Open Questions: Could this technique be used to "re-align" modalities? For instance, could providing a few true (image, text) pairs during inference help GPT-3 map the LQAE tokens to human-readable concepts? The generalization to other modalities (e.g., fMRI, audio spectra) is also an extremely exciting avenue for future research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.