m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt
TL;DR Summary
m3P leverages multimodal prompts and visual context as a language-independent representation to align 102 languages semantically, significantly improving translation quality, especially in low-resource settings.
Abstract
Multilingual translation supports multiple translation directions by projecting all languages in a shared space, but the translation quality is undermined by the difference between languages in the text-only modality, especially when the number of languages is large. To bridge this gap, we introduce visual context as the universal language-independent representation to facilitate multilingual translation. In this paper, we propose a framework to leverage the multimodal prompt to guide the Multimodal Multilingual neural Machine Translation (m3P), which aligns the representations of different languages with the same meaning and generates the conditional vision-language memory for translation. We construct a multilingual multimodal instruction dataset (InstrMulti102) to support 102 languages. Our method aims to minimize the representation distance of different languages by regarding the image as a central language. Experimental results show that m3P outperforms previous text-only baselines and multilingual multimodal methods by a large margin. Furthermore, the probing experiments validate the effectiveness of our method in enhancing translation under the low-resource and massively multilingual scenario.
Mind Map
In-depth Reading
English Analysis
Bibliographic Information
- Title:
m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt - Authors: Jian Yang, Hongcheng Guo, Yuwei Yin, Jiaqi Bai, Bing Wang, Jiaheng Liu, Xinnian Liang, Linzheng Chai, Liqun Yang, Zhoujun Li.
- Affiliations: State Key Lab of Software Development Environment, Beihang University; Department of Computer Science, University of British Columbia.
- Journal/Conference: This paper is a preprint published on
arXiv, a renowned open-access archive for scholarly articles, particularly in fields like computer science and physics. Preprints are typically submitted to conferences or journals later, butarXivitself is a highly respected platform for disseminating early research. - Publication Year: 2024
- Abstract: Multilingual translation (MNMT) faces challenges due to linguistic differences, which degrade translation quality, especially with a large number of languages. This paper addresses this by introducing visual context as a universal, language-independent representation. The authors propose
m3P, a framework that leverages a multimodal prompt to guide Multimodal Multilingual Neural Machine Translation.m3Paligns representations of different languages with the same meaning and generates conditional vision-language memory for translation. To support this, they constructInstrMulti102, a multilingual multimodal instruction dataset covering 102 languages. The method aims to minimize the representation distance across languages by treating the image as a central language. Experimental results demonstrate thatm3Psignificantly outperforms previous text-only and multilingual multimodal baselines. Probing experiments further validate its effectiveness in low-resource and massively multilingual scenarios. - Original Source Link: https://arxiv.org/abs/2403.17556
- PDF Link: https://arxiv.org/pdf/2403.17556v1.pdf
Executive Summary
Background & Motivation (Why)
The core problem addressed by this paper is the inherent limitation of Multilingual Neural Machine Translation (MNMT) models. While MNMT models support multiple translation directions within a single model by relying solely on text data, their translation quality often suffers due to the significant differences between languages, especially as the number of supported languages grows. This issue is exacerbated in massively multilingual settings where linguistic resources might be scarce for certain language pairs.
Traditional MNMT approaches implicitly bring languages together by sharing model parameters, but they often fail to explicitly bridge the semantic gap between diverse languages. Multimodal NMT (MMT) has emerged as a promising direction by incorporating visual context from relevant images. Images are considered a language-agnostic semantic representation or a "universal language" because they can convey ideas and concepts across linguistic and cultural barriers without being tied to specific grammar or vocabulary.
The paper identifies a crucial gap: while images are intuitively promising as a universal router in multilingual translation, previous MMT works predominantly focused on bilingual translation. These approaches lacked a robust mechanism to explicitly align the representations of multiple languages using visual context as a common anchor, and they struggled to scale to a large number of languages.
The paper's novel approach is to leverage this universal language (images) to explicitly bridge the gap among multiple languages in a massively multilingual translation setting. It proposes a framework that uses a multimodal prompt to guide the translation process, aligning language representations to a shared semantic space via visual cues.
Main Contributions / Findings (What)
The primary contributions of this paper are:
- Proposed
m3PFramework: The introduction ofm3P(Multimodal Multilingual neural Machine Translation with Multimodal Prompt), a novel framework that integrates visual features as a central, language-independent representation to enhance multilingual translation. It utilizes amultimodal promptto guide the translation process, aligning different language representations with the same underlying meaning and generatingconditional vision-language memory (CVLM). - Multilingual Multimodal Contrastive Learning (
MMCL): The development and integration ofMMCLcombined withmasked language/image augmentationto explicitly align text and vision modalities into a common semantic space, thereby minimizing the representation distance across diverse languages. - Construction of
InstrMulti102Dataset: The creation of a new, large-scale multilingual multimodal instruction dataset,InstrMulti102, which supports 102 languages. This dataset extends theMulti30kbenchmark and addresses the data scarcity challenge for massively multilingual multimodal tasks. - Significant Performance Improvement:
m3Pdemonstrates substantial improvements over previous text-only baselines and existing multilingual multimodal methods on standard benchmarks (Multi30k) and the newly createdInstrMulti102dataset, achieving gains of to BLEU points. - Validation in Low-Resource and Massively Multilingual Scenarios: Extensive experiments, including probing analyses, validate the effectiveness of
m3Pin enhancing translation quality, particularly inlow-resourcesettings and scenarios involving a large number of languages (up to 102). The model's ability to utilize visual context to compensate for masked text input further highlights its robustness.
Prerequisite Knowledge & Related Work
To fully understand the m3P framework, a beginner should be familiar with several foundational concepts and the evolution of machine translation research.
Foundational Concepts
- Neural Machine Translation (NMT):
NMTis a modern approach to machine translation that uses deep neural networks to directly map a source sequence to a target sequence. Unlike older statistical methods,NMTmodels learn complex relationships between words and phrases, typically generating translations end-to-end. - Transformer Architecture: The
Transformeris a neural network architecture introduced in 2017, which revolutionizedNMTand other sequence-to-sequence tasks. It relies entirely onattention mechanisms(specificallyself-attentionandcross-attention) to draw global dependencies between input and output. It consists of anencoderthat processes the input sequence and adecoderthat generates the output sequence. The paper explicitly mentions usingTransformeras the backbone model for language and vision encoding. - Multilingual NMT (MNMT): An extension of
NMTwhere a single model is trained to translate between multiple languages, often in many-to-many directions. This allows for parameter sharing across languages, potentially improving performance for low-resource languages by leveraging data from high-resource ones. The target language is typically indicated by prepending atarget language symbol(e.g.,[En]for English) to the source sentence. - Multimodal Machine Translation (MMT):
MMTenhancesNMTby incorporating information from multiple modalities, most commonly text and images. The goal is to provide additional context (e.g., visual cues) to disambiguate ambiguous words or phrases in the text, leading to more accurate translations. - Vision-Language Pre-trained Models: These models are trained on massive datasets of image-text pairs to learn joint representations of visual and linguistic information.
- CLIP (Contrastive Language-Image Pre-training): A model trained to efficiently learn visual concepts from natural language supervision. It learns to associate images with their corresponding text descriptions by maximizing the similarity of correct image-text pairs and minimizing it for incorrect ones in an embedding space.
CLIPis used inm3Pto initialize thevision encoder.
- CLIP (Contrastive Language-Image Pre-training): A model trained to efficiently learn visual concepts from natural language supervision. It learns to associate images with their corresponding text descriptions by maximizing the similarity of correct image-text pairs and minimizing it for incorrect ones in an embedding space.
- Cross-lingual Pre-trained Language Models: These models are pre-trained on text from many different languages, learning universal language representations that can be transferred across languages.
- XLM-R (Cross-lingual Language Model - RoBERTa): A powerful cross-lingual language model trained on a massive multilingual corpus. It enables a single model to handle multiple languages effectively and is used in
m3Pto initialize thelanguage encoder.
- XLM-R (Cross-lingual Language Model - RoBERTa): A powerful cross-lingual language model trained on a massive multilingual corpus. It enables a single model to handle multiple languages effectively and is used in
- Contrastive Learning: A machine learning paradigm where the model learns representations by pushing "similar" samples closer together and "dissimilar" samples further apart in an embedding space.
- InfoNCE Loss (Info Noise-Contrastive Estimation): A commonly used loss function in contrastive learning. It measures how well a model can distinguish between positive sample pairs (e.g., an image and its correct caption) and negative sample pairs (e.g., an image and other incorrect captions in the same batch). The goal is to maximize the mutual information between representations. Where:
- : The query embedding (e.g., an image representation).
- : The positive key embedding (e.g., the correct text description for the image).
- : Negative key embeddings (e.g., incorrect text descriptions for the image, usually sampled from the same batch).
- : A similarity function, often cosine similarity or dot product.
- : A temperature hyper-parameter that scales the logits before the softmax, influencing the sharpness of the distribution and the difficulty of distinguishing samples.
- Large Language Models (LLMs): Very large neural networks (often Transformer-based) trained on vast amounts of text data, capable of understanding and generating human-like text, performing various language tasks, and often exhibiting emergent abilities. The paper mentions
Llama2as an example of anLLM. - Instruction Tuning: A technique used to fine-tune
LLMson a diverse set of tasks framed as instructions (e.g., "Translate this English sentence to French: ..."). This helpsLLMsfollow instructions and align their output with specific task requirements, bridging the gap between next-word prediction and downstream tasks.
Previous Works
The paper discusses prior research in both MNMT and MMT, highlighting their limitations.
- Text-only MNMT:
- Arivazhagan et al. (2019), Yang et al. (2021a): These works established
MNMTmodels that rely solely on text data, supporting multiple translation directions by sharing parameters. While efficient, they suffer from quality degradation for linguistically distant languages or in massively multilingual setups. - Fan et al. (2021):
MNMT rin the results refers to this line of work, representing state-of-the-art text-only multilingual translation. - Pan et al. (2021), Yang et al. (2021b), Winata et al. (2021), Gong et al. (2021): These works explored
aligned augmentationandcontrastive learningto bridge gaps between languages, but primarily within the text modality.mRASP2(Pan et al., 2021) is specifically mentioned as using text-only contrastive learning.
- Arivazhagan et al. (2019), Yang et al. (2021a): These works established
- Multimodal NMT (MMT):
- Zhang et al. (2020b), Li et al. (2021a, 2022), Fang and Feng (2022), Guo et al. (2022b): These works introduced visual context to enhance translation. However, they mainly focused on bilingual translation, requiring separate models for each language pair (Figure 1a).
- Li et al. (2021a): Introduced
MNMT (Gated Fusion)andMNMT (Concatenation)which incorporate visual context using different fusion mechanisms. These are evaluated as baselines. - Li et al. (2022):
Selective Attnused a single-head attention network to correlate words with image patches. - Guo et al. (2022b):
LVP-M3used language-aware visual prompts.
Technological Evolution
The field has evolved from:
- Bilingual NMT: One model for one language pair.
- Text-only Multilingual NMT (MNMT): A single model for many language pairs, sharing parameters but suffering from linguistic divergence.
- Bilingual Multimodal NMT (MMT): Incorporating images for a single language pair, often struggling to scale to multilingualism.
- Multilingual Multimodal NMT: Attempting to combine the benefits of both, but often implicitly or with limited explicit cross-modal, cross-lingual alignment.
Differentiation
m3P differentiates itself from previous works by:
- Explicitly Bridging Gaps via Visual Context: Unlike text-only
MNMTthat implicitly shares parameters,m3Pexplicitly uses images as acentral languageto directly align representations across multiple languages. - Multimodal Multilingual Contrastive Learning (
MMCL): While some priorMNMTworks used text-only contrastive learning,m3PintroducesMMCLto enforce alignment between text (across multiple languages) and images in a shared semantic space. - Conditional Vision-Language Memory (
CVLM):m3Pdesigns a specific mechanism to generateCVLMby allowing language features to query visual features, effectively integrating multimodal information into the encoder states for the decoder. - Massive Multilingual Support: Most
MMTfocused on bilingual scenarios or a few languages.m3Pis designed for and validated on a massively multilingual scale, supporting up to 102 languages through its novelInstrMulti102dataset and architectural design. - Multimodal Prompting:
m3Pleveragesmultimodal promptsto guide the translation process, making it adaptable to bothencoder-decoderanddecoder-only(LLM) architectures.
Methodology (Core Technology & Implementation Details)
The m3P framework aims to enhance multilingual machine translation by incorporating visual context as a universal, language-independent representation. It achieves this by aligning multimodal and multilingual representations in a shared semantic space and generating a conditional memory for the decoder.
Principles
The core idea behind m3P is to treat the image as a central language that can explicitly bridge the semantic gaps between various natural languages. This is motivated by the intuition that images convey meaning universally, independent of linguistic structures. The framework operates on three main principles:
- Universal Visual Features: Extracting robust, language-agnostic visual features from images.
- Shared Semantic Space: Aligning multilingual text representations and visual representations into a common embedding space using contrastive learning.
- Conditional Memory Generation: Fusing the aligned vision-language information into a
conditional vision-language memory (CVLM)that guides the language decoder for improved translation.
Overview
该图像是一个示意图,展示了论文中多模态多语言翻译框架m3P的模型结构。图中包括图像分块输入、视觉编码器、跨语言编码器、条件视觉语言记忆模块(CVLM)及多语言语言解码器,体现了通过对齐不同语言表示以提升翻译性能的思路。
Figure 2: m3P model architecture.
As depicted in Figure 2, m3P consists of three main components:
-
Cross-lingual Language Encoder: Encodes the source sentence (prefixed with a target language symbol) into
language representations. -
Vision Transformer Encoder: Extracts
visual contextfrom the input image. -
Multilingual Language Decoder: Generates the target translation based on the
conditional vision-language memory.The process for a given -th sentence pair and corresponding image is as follows:
- The source sentence , prefixed with the target language symbol (e.g.,
[En]), is fed into thecross-lingual language encoderto obtain language representations . - The image is reshaped into a sequence of flattened patches and processed by the
vision Transformer encoderto extract vision representations . Multilingual Multimodal Contrastive Learning (MMCL)is applied to minimize the distance between (from different languages) and , fostering a shared semantic space.- The
Conditional Vision-Language Memory (CVLM)is generated. This involves using (image tokens) as key and value, and (language features) as query in a multi-head cross-attention mechanism. - Finally, is fed into the
multilingual language decoderto predict the target translation .
Multilingual Multimodal Translation
The model is trained on bilingual corpora with images, denoted as , where represents source sentences, target sentences, and corresponding images. The training objective for multilingual multimodal translation is to minimize the negative log-likelihood of the target sentences:
Here, represents the complete shared parameters of the multimodal multilingual model. The Transformer architecture serves as the backbone for both language and vision encoding. The language encoder is initialized with XLM-R (Conneau et al., 2020) and the vision encoder with CLIP (Radford et al., 2021). A target language symbol (e.g., [En]) is prefixed to the source sentence to specify the translation direction.
Multimodal Multilingual Prompt
该图像是论文中关于多模态提示(Multimodal Prompt)在大型语言模型(LLM)中的示意图,展示了文本编码器、视觉编码器和视觉解码器在翻译任务中的协同工作流程,包含带有多种颜色区分的指令、输入和响应示例。
Figure 3: Multimodal prompt for LLM.
The concept of a multimodal prompt is crucial, especially for decoder-only models like Llama2 (Liu et al., 2023b).
- Decoder-only Setting (Figure 3a): The visual tokens extracted from image (denoted as ) are integrated directly into a prompt template alongside the source sentence , target language symbol , and source language . This entire prompt is then fed into the
LLM. - Encoder-Decoder Setting (Figure 3b): The source text and image are fed separately into the text and vision encoders, respectively, as part of the standard encoder-decoder pipeline. The prompt mechanism here is more about how the input is structured for the specific encoders.
Multimodal Encoding
Language Encoding
For the encoder-decoder setting, given a text prompt (e.g., [target_language_symbol] source_sentence), the cross-lingual language encoder (denoted as ) processes the concatenation of tokens to produce language representations:
Where:
- : The output
language representations(a sequence of tokens). - : The
language encoder(initialized byXLM-R). - : The
target language symbolfor language . - : The source sentence tokens.
Vision Encoding
To encode the image (with height, width, channels), it is first reshaped into a sequence of flattened patches. These patches are then fed into the vision Transformer encoder (denoted as ) to extract vision representations:
Where:
-
: The output
vision representations(a sequence of tokens). -
: The
vision Transformer encoder(initialized byCLIP). -
: The input image.
-
: The patch size, such that is the number of patches.
For the
decoder-onlysetting, a vision extractor obtains visual tokens, which are then filled into the multimodal prompt (Figure 3a). All tokens (visual and textual) are concatenated and fed into theLLMto obtain combined representations. Subsequently, and can be derived from these combined representations for further processing.
Multilingual Multimodal Alignment
To effectively fuse multilingual text and vision features, m3P regards the image as a universal language. It introduces Multilingual Multimodal Contrastive Learning (MMCL) to improve text-image alignment and multilingual text-text alignment implicitly through the shared visual context. MMCL uses the InfoNCE objective (van den Oord et al., 2018). The total contrastive loss is the sum of image-to-text similarity and text-to-image similarity losses:
Where is the multilingual dataset of sampled image-text pairs.
Image-to-Text Contrastive Loss
The loss measures how well the image aligns with its corresponding text : Where:
- : A
temperature hyper-parameter. - : The dot product (similarity) between the image embedding and its positive text embedding .
- : Negative text embeddings, implicitly formed by other text clips within the same training batch . The sum in the denominator includes the positive pair and all negative pairs.
Text-to-Image Contrastive Loss
Symmetrically, the loss measures text-to-image alignment: Where represents negative image embeddings from the batch .
Temperature-based Sampling
To construct multilingual text batches and balance multiple bilingual corpora, a temperature-based sampling method is used. This involves sampling probabilities for each corpus :
Where is the size of dataset , and is the total size of all datasets. The temperature for sampling gradually increases over epochs:
Where is the initial temperature, is the peak temperature, is the current epoch, and is the number of warming-up epochs. This mechanism helps to give more weight to smaller, low-resource languages initially, and then progressively balances the sampling.
Multilingual Multimodal Augmentation
To create challenging examples for MMCL and learn robust representations, m3P employs multilingual multimodal augmentation (MMA). This involves augmenting both images and text.
- Image Augmentation: The function applies various transformations to the original image , including cropping, resizing, rotation, cutout, color distortion, Gaussian blur, and Sobel filtering. Additionally,
masked image modelingis performed by masking chosen image patches sampled from a uniform distribution. - Multilingual Text Augmentation: For text,
masked language modelingis applied by randomly masking contiguous spans of tokens in the source sentence. The augmented source sentence and augmented image are then used inMMCLto learnspecific representational invariances, making the model more robust to variations in input.
Conditional Vision-Language Memory (CVLM)
The CVLM is generated to integrate the visual context into the language representations. It uses a multi-head cross-attention mechanism where language representations act as the query, and vision representations act as the key and value. This is described in the paper as "language is regarded as the main input (key) with the auxiliary by the multi-head cross-attention" and then the formula for shows as Query and as Key and Value. Based on the provided formula, is Query, and is Key and Value. This interpretation aligns with common practices where auxiliary information (like vision) attends to the primary input (language).
Where:
-
: The resulting
encoder representations, which constitute theCVLM. -
: The concatenation operator across attention heads.
-
: The softmax operation.
-
: Linear projection matrices for the Query, Key, and Value, respectively, for the -th attention head.
-
:
Vision representationsfrom the vision encoder (used as Query in the formula). -
:
Language representationsfrom the language encoder (used as Key and Value in the formula). -
: The number of feature channels, used for scaling the dot product.
These representations, rich in both linguistic and visual information, are then fed into the decoder.
Multilingual Generation
m3P employs a specialized training strategy called Multimodal DropNet (MDropNet) to handle multilingual generation. This strategy involves training the model on three different objectives, emphasizing various aspects of modality integration. The multilingual language decoder is a standard Transformer decoder.
-
Text-only Translation: The model is trained to predict target words based solely on the source language representation : Where is the -th target word conditioned on previous
t-1tokens and the source language context . This objective trains the text encoder effectively. -
Image Captioning: To align the image and target language, the model is also trained to generate captions based only on the image context : Here, is the -th target word conditioned on previous
t-1tokens and the image context . -
Full Multimodal Translation (CVLM-based): The primary translation objective uses the
CVLM, which contains both language and vision information: Here, is the -th target word conditioned on previoust-1tokens and the combinedCVLM.
During training, the model is trained for of the time on Eq. 9 (text-only translation), of the time on Eq. 10 (image captioning), and of the time on Eq. 11 (full multimodal translation). This MDropNet strategy ensures that the model learns to perform robustly even when one modality might be less informative or partially missing, and it explicitly promotes image-language alignment and text-only translation capabilities.
Training Objective
The overall training objective for m3P combines the multilingual multimodal translation objective (Equation 1) and the multilingual multimodal contrastive objective (Equation 2) with a balancing coefficient :
Where is a coefficient that balances the contribution of the translation loss and the contrastive learning loss. This joint optimization allows the model to simultaneously learn effective multimodal translation and strong cross-modal, cross-lingual alignments.
Experimental Setup
Datasets
The experiments in this paper are conducted on two main datasets: Multi30k and a newly constructed dataset, InstrMulti102.
-
Multi30k (Elliott et al., 2016):
- Origin & Characteristics: This is a widely used benchmark dataset for
multimodal machine translationandimage captioning. It consists of English sentences describing images from Flickr, along with their translations into other languages. Each sentence pair is associated with a corresponding image. - Languages: The dataset originally contains four languages: English (En), German (De), French (Fr), and Czech (Cs).
- Size: The training set has 29K sentences, and the validation set has 1K sentences.
- Test Sets: Results are reported on Flickr2016, Flickr2017, Flickr2018, and
MSCOCO(Lin et al., 2014). - MSCOCO: The
MSCOCOtest set is consideredout-of-domainforMulti30kas it features different image characteristics and often containsambiguous verbs, making it more challenging and highlighting the need for visual context. - Data Sample (Conceptual): A data sample from
Multi30kwould be an image (e.g., a photo of a child on rocks) paired with multiple textual descriptions in different languages that describe that image (e.g., "A young child is standing alone on some jagged rocks." (En), "Ein kleines Kind steht allein auf einem zerklüfteten Felsen." (De), etc.). Figure 6 in the paper shows such an example where an image of a child on rocks is associated with sentences in English, German, French, and Czech.
- Origin & Characteristics: This is a widely used benchmark dataset for
-
InstrMulti102:
- Origin & Construction: This dataset is newly constructed by the authors to support massively multilingual multimodal machine translation. It originates from the
Multi30kEnglish data. The English sentences fromMulti30kare translated into 101 other languages using the text-only multilingualMicrosoft translator(Yang et al., 2021a). This effectively creates a multimodal dataset (images paired with English descriptions) and extends it to 101 additional languages, enabling experiments with a total of 102 languages. - Purpose: To break the limits of existing
multimodal machine translationbenchmarks, which are typically restricted to only a few languages, and allow for evaluation in a trulymassively multilingualsetting. - Languages: 102 languages.
- Origin & Construction: This dataset is newly constructed by the authors to support massively multilingual multimodal machine translation. It originates from the
Evaluation Metrics
The primary evaluation metric used in this paper is sacreBLEU, a standard metric for machine translation quality.
- Conceptual Definition:
BLEU(BiLingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. It measures then-gram precisionbetween the machine-generated translation (candidate) and one or more human-created reference translations. A higherBLEUscore indicates a better translation quality.sacreBLEUis a standardized version ofBLEUthat ensures consistent and reproducible scores across different research implementations, addressing common issues with tokenization and normalization. - Mathematical Formula: The
BLEUscore is calculated as a geometric mean of modified -gram precisions, multiplied by a brevity penalty (BP). Where:- :
Brevity Penalty. This factor penalizes candidate translations that are shorter than the reference translations. Where is the length of the candidate translation and is the effective reference corpus length. - : The maximum -gram order considered (typically 4 for
BLEU-4). - : Weights for each -gram precision (typically uniform, i.e., ).
- : The
modified -gram precision. This is the count of matching -grams in the candidate and reference translations, clipped by the maximum count of that -gram in any single reference translation, divided by the total number of -grams in the candidate translation. is the count of ann-gramin the candidate that also appears in a reference, clipped to the maximum number of times it appears in any single reference.
- :
- Importance:
BLEUis a widely accepted and interpretable metric inmachine translation, providing a numerical value that correlates reasonably well with human judgment of translation quality.sacreBLEUfurther enhances its reliability and comparability across research.
Baselines
The m3P model is compared against several text-only and multimodal baselines to demonstrate its effectiveness. All baselines for fair comparison use XLM-R for language encoders and CLIP for vision encoders where applicable.
Text-only Methods
- BiNMT (Vaswani et al., 2017): A standard
Transformermodel initialized byXLM-Rand trained only on a single translation direction (i.e., a separate model for each language pair). This serves as a strong bilingual baseline. - MNMT (Fan et al., 2021): A multilingual
Transformermodel, jointly trained on all multilingual text data. It uses atarget language symbolprefixed to the input sentence to indicate the desired translation direction. This represents a strong multilingual text-only baseline.
Multimodal Methods
- BiNMT (Vaswani et al., 2017) + Concatenation: A bilingual
Transformermodel where language and visual features are simply concatenated before being fed into the decoder or subsequent layers. This is a basic approach to incorporating visual context in a bilingual setting. - MNMT (Gated Fusion) (Li et al., 2021a): A multilingual
NMTmodel that incorporates visual context using agated fusion unit. This unit selectively combines text and visual features based on learned gates, allowing the model to dynamically decide how much visual information to use. - MNMT (Concatenation) (Li et al., 2021a): Similar to the bilingual concatenation approach, but applied in a multilingual setting. Visual context is integrated by concatenating visual features with language features.
- mRASP2 (Pan et al., 2021): This method primarily focuses on
text-only contrastive learningformultilingual translation. The paper appliesmRASP2's contrastive learning scheme to a multimodal context for comparison. - Selective Attn (Li et al., 2022): This approach uses a
single-head attention networkto correlate words with image patches, aiming for more granular and relevant visual context integration. - LVP-M3 (Guo et al., 2022b):
Language-aware Visual Prompt for Multilingual Multimodal Machine Translation. This method uses visual prompts that are informed by the source language to guide multimodal translation.
Training and Evaluation Details
- Encoder-Decoder Setting:
- Language Encoder: Initialized by
XLM-R(Conneau et al., 2020). - Vision Encoder: Initialized by
CLIP(Radford et al., 2021).
- Language Encoder: Initialized by
- Decoder-only Setting:
- Text Generation: Uses
Llama2(Liu et al., 2023b). - Vision Extractor: Uses
CLIP.
- Text Generation: Uses
- Optimizer:
Adam(Kingma and Ba, 2015) with parameters and . - Learning Rate: with a
warm-up stepof 4,000. - Label Smoothing: Cross-entropy loss with a smoothing ratio of 0.1.
- Model Architecture: The
vision encoder,language encoder, andlanguage decodereach consist of 12 layers with a hidden size of 768. They share the same embedding matrix. - Multilingual Training:
- Batch Size: 2048 tokens.
- Hardware: 8 Tesla V100 GPUs.
- Evaluation Metric: Case-sensitive detokenized
sacreBLEU1.
Results & Analysis
The experimental results demonstrate the superior performance of m3P across various settings, including standard multilingual translation, out-of-domain scenarios, massively multilingual setups, and low-resource conditions.
Flickr Test Set
Table 1 presents the sacreBLEU scores on the Flickr2016 test set for En (Fr, Cs, De) translation directions.
Table 1: En and En evaluation results for bilingual and many-to-many models on the Flickr2016 test set.
| En→Fr | En→Cs | En→De | Fr→En | Cs→En | De→En | Avg6 | ||
|---|---|---|---|---|---|---|---|---|
| Only Trained on Text Data | ||||||||
| 1→1 | BiNMT (Vaswani et al., 2017) | 63.3 | 33.4 | 39.9 | 54.0 | 41.1 | 43.8 | 45.9 |
| N→N | MNMT r (Fan et al., 2021) | 63.8 | 34.0 | 40.2 | 52.0 | 41.3 | 42.5 | 45.6 |
| Trained on Text and Vision Data | ||||||||
| 1→1 | BiNMT (Vaswani et al., 2017) | 63.5 | 33.0 | 40.3 | 55.1 | 41.8 | 44.1 | 46.3 |
| N→N | MNMT (Gated Fusion) (Li et al., 2021a) | 63.8 | 34.4 | 41.0 | 51.5 | 41.1 | 43.3 | 45.8 |
| MNMT (Concatenation) (Li et al., 2021a) | 63.0 | 33.8 | 38.8 | 53.3 | 43.6 | 44.0 | 46.1 | |
| mRASP2 (Pan et al., 2021) | 63.8 | 34.4 | 41.3 | 53.2 | 44.0 | 44.5 | 46.9 | |
| Selective Attn (Li et al., 2022) | 63.5 | 34.4 | 41.3 | 53.2 | 44.0 | 44.5 | 46.8 | |
| LVP-M³ (Guo et al., 2022b) | 63.4 | 34.1 | 41.4 | 53.2 | 44.0 | 44.5 | 46.8 | |
| M3P (Encoder-Decoder) | 64.8 | 35.2 | 41.8 | 53.8 | 44.8 | 45.0 | 47.6 | |
| M3P (Decoder-only) | 66.4 | 38.1 | 43.5 | 56.7 | 46.9 | 48.1 | 49.9 |
Table 2: En and En evaluation sulal and many-to-many models on the Flickr2017 test set and MSCOCO test set.
| En→Fr | En→De | De→En | Fr→En | Avg4 | En→Fr | En→De | Fr→En | De→En | Avg_4 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Flickr2017 | MSCOCO | ||||||||||
| Only Trained on Text Data | |||||||||||
| 1→1 | BiNMT (Vaswani et al., 2017) | 55.4 | 34.1 | 39.2 | 43.4 | 43.0 | 45.8 | 32.1 | 40.6 | 34.3 | 38.2 |
| N→N | MNMT (Fan et al., 2021) | 56.8 | * | 40.3 | 44.6 | 44.2 | 45.9 | 31.9 | 41.6 | 34.6 | 38.5 |
| Trained on Text and Vision Data | |||||||||||
| 1→1 | BiNMT (Vaswani et al., 2017) | 55.8 | * | 39.6 | 43.6 | 43.4 | 45.8 | 32.3 | 41.6 | 34.4 | 38.5 |
| MNMT (Gated Fusion) (Li et al., 2021a) | * | 34.6 | * | * | * | 46.8 | * | * | 34.5 | 39.0 | |
| MNMT (Concatenation) (Li et a., 2021a) | 56.8 | 34.3 | 40.3 | 44.2 | 43.9 | 46.4 | 32.5 | 42.2 | 34.1 | 38.9 | |
| mRASP2 (Pan et al., 2021) | 56.4 57.0 | 34.0 | 39.4 | 43.8 | 43.4 | * | 32.6 | 42.4 42.3 | 34.8 | 39.2 | |
| Selective Attn Li et al. 2022) | 56.6 | 35.1 | 39.6 40.3 | 44.1 44.4 | 43.9 43.9 | 47.1 46.8 | 32.7 32.5 | 42.5 | 34.3 | 39.0 | |
| LVP-M³ (Guo et al., 2022) | * | 34.2 | 40.4 | 44.7 | 44.2 | 46.8 | 32.5 | 42.6 | 34.5 | 39.1 | |
| * | 57.4 | 34.4 | 41.0 | 45.6 | 44.8 | * | * | 43.2 | 35.2 | 39.6 | |
| M3P (Encoder-Decoder) | 57.4 | 35.3 | 42.2 | 46.5 | 46.1 | 46.8 47.4 | 33.1 34.2 | * | * | * | |
| M3P (Decoder-only) | 58.3 | 37.2 | * | * | * | * | * | 44.5 | 36.2 | 40.6 |
(Note: Some entries in the original Table 2 were incomplete or had formatting issues, marked with * or transcribed as they appeared if ambiguous multiple values were present, such as "42.4 42.3".)
Analysis:
m3P(Encoder-DecoderandDecoder-onlyvariants) consistently outperforms all baselines across all 6 translation directions on Flickr2016 (Table 1) and on Flickr2017 (Table 2).- On Flickr2016,
m3P(Encoder-Decoder) achieves an averageBLEUof 47.6, significantly higher than the best previous multimodal baseline (mRASP2with 46.9) and text-onlyMNMT(45.6). - The
Decoder-onlyvariant ofm3Pdemonstrates even more substantial gains, reaching an averageBLEUof 49.9 on Flickr2016, showcasing the potential of integratingmultimodal promptswithLLMs. This represents an improvement of nearly BLEU points over previous state-of-the-art multimodal methods and over text-onlyMNMT. - Previously, text-only
MNMTmodels often underperformed their bilingual counterparts on average (e.g., 45.6 vs 45.9 for text-only on Flickr2016).m3Preverses this trend, showing that properly integrated visual context can pushMNMTbeyond bilingual performance.
MSCOCO Test Set
Table 2 also presents results on the MSCOCO test set.
Analysis:
MSCOCOis a more challengingout-of-domaindataset withambiguous verbs, requiring stronger reliance on visual context for disambiguation.m3P(Decoder-only) achieves the highest of 40.6 onMSCOCO, outperforming the best previous multimodal baseline (LVP-M3at 39.1) and text-onlyMNMT(38.5).- This strong performance on
MSCOCOhighlightsm3P's ability to effectively leverage visual information to resolve ambiguities, confirming the intuition that visual context is crucial when textual information alone is insufficient.
Massively Multilingual Translation
Table 3 presents the average sacreBLEU results on the InstrMulti102 dataset, which includes 101 translation directions, focusing on translations to English from Chinese (Zh), Hindi (Hi), and Thai (Th).
Table 3: Massively multilingual translation average results (101 translation directions) on InstrMulti102.
| Model | Zh→En | Hi→En | Th→En | Avg101 |
|---|---|---|---|---|
| Text-only MNMT | 14.3 | 13.5 | 11.1 | 14.3 |
| MNMT (Gated Fusion) | 15.2 | 14.3 | 12.1 | 15.4 |
| MNMT (Concatenation) | 15.1 | 14.6 | 13.1 | 15.8 |
| M3P (Encoder-Decoder) | 16.8 | 15.2 | 14.8 | 18.2 |
| M3P (Decoder-only) | 18.2 | 16.4 | 16.5 | 21.2 |
Analysis:
- All multilingual models with visual context (including
MNMT (Gated Fusion)andConcatenation) perform better than thetext-only MNMT` baseline. This confirms the general benefit of visual information in massively multilingual setups. m3P(Encoder-DecoderandDecoder-only) further extends this improvement.m3P(Decoder-only) achieves anAvg101 BLEUof 21.2, which is a significant BLEU points over thetext-only MNMT(14.3) and points over the best previous multimodal baseline (MNMT (Concatenation)at 15.8).- This demonstrates that
m3Peffectively scales to a large number of languages (102), successfully projecting visual features into a shared semantic space and leading to substantial quality enhancements even in challenging low-resource and massively multilingual contexts.
Ablation Study
Performance on Different Backbones
Table 4 compares m3P's performance using different vision backbones (CNN-based ResNet and Transformer-based ViT).
Table 4: Comparison of different vision backbones (e.g., CNN and Transformer backbones) on the Flickr2016 test set.
| Model | En→Fr | En→De | Fr→En | De→En |
|---|---|---|---|---|
| Text-only MNMT | 63.8 | 40.2 | 52.0 | 42.5 |
| ResNet50 | 64.2 | 40.6 | 52.3 | 43.1 |
| ResNet101 | 64.4 | 40.8 | 52.4 | 43.4 |
| ViT-B/32 | 64.8 | 41.6 | 53.8 | 45.0 |
| ViT-B/16 | 65.1 | 41.8 | 53.6 | 44.8 |
| ViT-B/14 | 65.2 | 41.9 | 53.4 | 45.2 |
Analysis:
- All multimodal variants (
ResNetandViTbackbones) outperform thetext-only MNMT, reinforcing the value of visual context. m3PwithTransformer-basedViTbackbones consistently outperformsCNN-basedResNetbackbones. This suggests that theTransformerarchitecture's ability to capture global dependencies is beneficial for visual feature extraction in this multimodal context.- Among
ViTmodels,ViT-B/14(smaller patch size) generally achieves slightly better performance thanViT-B/16andViT-B/32. However,ViT-B/14generates longer visual token sequences, increasing computational cost. The authors recommendViT-B/32for efficiency orViT-B/16for a good balance of performance and efficiency.
Effect of Different Modules
Table 5 summarizes an ablation study on Flickr2016, showing the contribution of each proposed module. The final model is denoted as m3P.
Table 5: Ablation study of the different modules on Flickr2016. is the final model of our method.
| ID | Flickr2016 | En→De | De→En |
|---|---|---|---|
| ① | M³3P (our method) | 41.6 | 45.0 |
| ② | ① - MMCL | 41.2 | 44.6 |
| ③ | ② - CVLM | 40.8 | 44.0 |
| ④ | ③ - MDropNet | 40.5 | 43.8 |
| ⑤ | ④ - Multilingual Training | 40.1 | 43.2 |
Analysis:
The ablation study demonstrates a progressive learning approach, where each module contributes positively to the overall performance:
- ⑤ - Multilingual Training: The baseline starts with a multilingual model trained on multilingual data (without other proposed modules), achieving 40.1 (En→De) and 43.2 (De→En).
- ④ - MDropNet: Adding the
Multimodal DropNettraining strategy (randomly training with visual-only, text-only, or full multimodal objectives) improves performance to 40.5 (En→De) and 43.8 (De→En). This shows the benefit of the alternative training strategy for robust multimodal learning. - ③ - CVLM: Incorporating the
Conditional Vision-Language Memory(where language features query visual tokens) further boosts scores to 40.8 (En→De) and 44.0 (De→En). This validates the effectiveness of explicitly fusing visual context into the encoder representations. - ② - MMCL: Introducing
Multilingual Multimodal Contrastive Learningexplicitly to narrow the gap between different languages and modalities yields a performance of 41.2 (En→De) and 44.6 (De→En). This indicates that explicit alignment through contrastive learning is crucial. - ① - m3P (full model): The full
m3Pmodel, combining all modules, achieves the highest scores of 41.6 (En→De) and 45.0 (De→En). This validates the cumulative effectiveness of the proposed components and theprogressive learningstrategy.
Analysis
Distance of Different Languages
该图像是图表,展示了多语言基线模型(a)与本研究基于图像上下文监督的多语言模型(b)中所有语言的句子平均编码器表示的可视化。图中颜色代表不同语言,显示模型对语言表示的聚类效果差异。
Figure 4: Visualization of the sentence average encoder representations of all languages from the multilingual baseline (a) and our multilingual model supervised by the image context (b). Each color denotes one language.
Analysis:
- Figure 4 visualizes
t-SNEprojections of sentence encoder representations for English, German, French, and Czech. - Figure 4a (Text-only MNMT): The representations for different languages are clearly separated and form distinct clusters. This indicates that the
text-only MNMTmodel, despite being multilingual, struggles to align the semantic spaces of different languages, leading to significant inter-language distances. - Figure 4b (
m3Pwith image context): In contrast,m3Pdraws the representations across the four languages much closer together. The clusters for different languages are significantly more intertwined and overlapping. This empirically confirms that treating the image as auniversal languageand applyingMMCLeffectively minimizes the representation distance among multiple languages, projecting them into a more unified semantic space.
Low-resource Setting
该图像是论文中图5的四个子图,展示了本文方法与文本仅多语种机器翻译(Text-only MNMT)在Flickr2016数据集上,英法(a)、英德(b)、法英(c)、德英(d)方向不同训练数据规模下的BLEU得分对比曲线。
Figure 5: The performance of our method on Flickr2016 (a) En fr, (b) E , (c) ,and (d) with different sizes of training data on Flickr2016.
Analysis:
- Figure 5 illustrates the
BLEUscores ofm3Pand thetext-only MNMTbaseline across varying percentages () of training data ( to ) on Flickr2016. - For all four translation directions (
En→Fr,En→De,Fr→En,De→En),m3Pconsistently outperforms thetext-only MNMTbaseline, especially inlow-resourcescenarios (smaller values). - When the parallel data size is small (e.g., , of data), the baseline model produces
unsatisfactory resultsand its performance is significantly lower.m3Pmaintains a much higherBLEUscore, demonstrating its robustness and efficiency in learning from limited data. - In some cases (e.g., Figure 5a,
En→Fr),m3Pfine-tuned on approximately90%of the data can even surpass the baseline trained on100%of the data. This highlights the effectiveness ofm3P's pre-trained components (XLM-R,CLIP) and its ability to leverage visual context to compensate for data scarcity, making it highly suitable forlow-resource language pairs.
Vision-Language Alignment
该图像是图像识别与文本多语言对齐的示意图,展示了四种语言(英语、德语、法语、捷克语)在图像不同区域的视觉注意力热力图,颜色越亮表示注意力值越高,体现了多语言视觉语义层面对齐效果。
Figure 6:Representative examples of vision-language alignment from the CVLM of four languages between image patches. Brighter colors represent a higher attention value.
Analysis:
- Figure 6 visualizes the
conditional vision-language alignment (CVLM)by showing attention heatmaps between source sentences (in English, German, French, and Czech) and image patches. Brighter colors indicate higher attention values. - For the example sentence "A young child is standing alone on some jagged rocks," the heatmaps for English (Figure 6b), German (Figure 6c: "Ein kleines Kind steht allein auf einem zerklüfteten Felsen."), French (Figure 6d), and Czech (Figure 6e) all show similar attention patterns.
- Specifically, all four languages tend to focus their attention on the
jagged rocksarea of the image. This indicates that despite linguistic differences,m3Peffectively learns to associate semantically equivalent phrases (e.g., "jagged rocks" and "zerklüfteten Felsen") with the same relevant visual regions. - This visualization provides strong evidence that
MMCLandCVLMsuccessfully force the model to learnsimilar vision-language attention patternsacross different languages, projecting them into a shared semantic space anchored by the image.
Sanity Check on Visual Context
该图像是图7,是一组折线图,展示了在不同文本掩码比例下,文本单模MNMT与所提方法m³P的BLEU分数比较。图中方法在各掩码比例下均优于文本单模MNMT,体现了多模态提示的优势。
Figure 7: Comparison between the text-only MNMT and when the source sentence is masked with different ratios.
Analysis:
- Figure 7 compares
m3Pwith thetext-only MNMTwhen varying ratios of the source sentence are masked ( to ). This experiment highlights the necessity and compensatory role of visual context. - As the
mask ratioincreases, the performance of thetext-only MNMTmodel drops sharply, as it loses its primary source of information. When the source sentence is100% masked, thetext-only MNMTcannot perform any translation, resulting in0 BLEUpoints. - In contrast,
m3Pmaintains significantly higherBLEUscores even with substantial masking. When the mask ratio is100%,m3Pstill achieves nearly15 BLEUpoints. This remarkable resilience is attributed to its ability to leverage thevisual contextprovided by the image. - Under this extreme scenario (100% masked source sentence),
m3Pessentially performsimage captioning, demonstrating that thevisual representationsfrom the vision encoder cancompensate for the masked wordsand provide sufficient information for meaningful output, thanks to theMDropNettraining strategy. This strongly validates the importance and effectiveness of integrating visual context.
Conclusion & Personal Thoughts
Conclusion Summary
This work introduces m3P, a pioneering model for Multimodal Multilingual Machine Translation that significantly advances the field by supporting 102 languages. The central innovation lies in leveraging images as a universal language to explicitly bridge semantic gaps between diverse linguistic representations. This is achieved through Multilingual Multimodal Contrastive Learning (MMCL), which aligns text and vision modalities, and the generation of Conditional Vision-Language Memory (CVLM), which effectively integrates visual context into the language generation process. The creation of the InstrMulti102 dataset further enables research in massively multilingual multimodal scenarios. Experimental results consistently show that m3P substantially outperforms existing text-only and multimodal baselines across various benchmarks, with notable improvements in low-resource and out-of-domain settings. The comprehensive ablation studies and probing experiments validate the contribution of each component and the critical role of visual signals in enhancing multilingual translation robustness.
Limitations & Future Work
The paper does not explicitly detail a "Limitations" section, but some potential limitations and avenues for future work can be inferred:
- Reliance on Synthetic Data for
InstrMulti102: TheInstrMulti102dataset is constructed by translating EnglishMulti30kdata into 101 other languages using atext-only multilingual Microsoft translator. While practical for scale, translations from automated tools may not always capture the nuances and cultural context as accurately as human translations. This could introduce noise or biases that might limit them3Pmodel's performance on truly organic, human-translated data for all 102 languages. - Computational Cost for Massively Multilingual Models: Training and deploying models for 102 languages, especially with multimodal inputs, is inherently computationally intensive. While the paper mentions using 8 Tesla V100 GPUs, scaling this further or deploying it in resource-constrained environments could be challenging.
- Generalizability Beyond Images: The current framework focuses on images as the universal modality. Future work could explore integrating other modalities like video, audio, or sensory data, which might offer even richer context for translation.
- Handling Abstract Concepts: While images are good for concrete objects and actions, translating highly abstract concepts or complex philosophical texts might still pose challenges, even with visual cues, if the visual context cannot adequately represent the abstract meaning.
- Bias in Pre-trained Models:
m3Prelies on pre-trainedXLM-RandCLIP. Any inherent biases present in these foundational models (e.g., cultural, gender, or racial biases in image-text associations) could be propagated to them3Pmodel. Addressing such biases in multimodal, multilingual contexts is a significant challenge. - Specific Error Analysis: The paper presents aggregate
BLEUscores. A detailed error analysis for different language pairs, especially low-resource ones, and types of translation errors could provide deeper insights into the model's shortcomings.
Personal Insights & Critique
The m3P paper presents a compelling and significant step forward in multimodal multilingual machine translation. The idea of using an image as a "central language" is intuitively powerful, and the MMCL and CVLM mechanisms provide concrete methods to realize this intuition within a Transformer-based architecture.
The most impressive aspects are:
-
Scalability: The successful demonstration on 102 languages with the
InstrMulti102dataset is a major achievement, pushing the boundaries ofMMTbeyond the typically handful of languages. This opens up new possibilities for truly global communication. -
Robustness to Low-Resource Settings: The performance gains in low-resource scenarios are particularly impactful. Many languages suffer from a scarcity of parallel text data, and
m3Poffers a promising avenue to improve translation quality for these languages by leveraging readily available visual information. -
Multimodal Prompting with
LLMs: The exploration ofm3Pwithdecoder-only LLMs(Llama2) and the substantial performance boost observed for theDecoder-onlyvariant suggest a powerful synergy. This aligns with current trends inAIresearch towards large, general-purpose models. -
Clear Ablation and Analysis: The paper provides thorough ablation studies and insightful visualizations (t-SNE, attention heatmaps, masked input experiments) that clearly validate the contribution of each module and the underlying principles. The "sanity check on visual context" with masked inputs is particularly effective in demonstrating the compensatory power of visual information.
A minor critique could be the reliance on machine-translated data for
InstrMulti102. While necessary for scale, it means the model is learning from potentially imperfect or less natural translations, which could cap its ultimate performance compared to what might be achieved with human-quality data across all languages. However, given the immense difficulty and cost of collecting such a dataset, this is a pragmatic and understandable approach for initial exploration.
Overall, m3P provides a robust and well-justified framework that contributes significantly to making machine translation more accurate and accessible across a wider spectrum of languages, particularly by effectively harnessing the universal power of visual context. It lays strong groundwork for future research into more sophisticated multimodal and multilingual AI systems.
Similar papers
Recommended via semantic vector search.