Abstract

Computational Meme Understanding, which concerns the automated comprehension of memes, has garnered interest over the last four years and is facing both substantial opportunities and challenges. We survey this emerging area of research by first introducing a comprehensive taxonomy for memes along three dimensions – forms, functions, and topics. Next, we present three key tasks in Computational Meme Understanding, namely, classification, interpretation, and explanation, and conduct a comprehensive review of existing datasets and models, discussing their limitations. Finally, we highlight the key challenges and recommend avenues for future work.

1. Bibliographic Information

1.1. Title

The central topic of the paper is a comprehensive survey of Computational Meme Understanding (CMU).

1.2. Authors

The authors are:

Khoi P. N. Nguyen
Vincent Ng

Both authors are affiliated with the Human Language Technology Research Institute, University of Texas at Dallas.

1.3. Journal/Conference

The paper is a survey, indicating it reviews and synthesizes existing research in an emerging field. While it doesn't specify a particular journal or conference proceedings, surveys are often published in specialized review journals or presented at major conferences as survey tracks.

1.4. Publication Year

The paper was published on January 1, 2024.

1.5. Abstract

This paper surveys Computational Meme Understanding (CMU), an emerging research area focused on the automated comprehension of memes, which has gained significant interest in the last four years. The survey begins by introducing a comprehensive taxonomy for memes, categorizing them along three dimensions: forms, functions, and topics. It then delineates three key tasks within CMU: classification, interpretation, and explanation. For each task, the paper conducts a detailed review of existing datasets and models, critically discussing their limitations. Finally, it identifies major challenges facing the field and proposes several promising avenues for future research.

1.6. Original Source Link

The original source link is /files/papers/69174c3c110b75dcc59ae048/paper.pdf. This link directly points to the PDF file of the paper, indicating it is an officially published or available document.

2. Executive Summary

2.1. Background & Motivation

The paper addresses the growing importance of Computational Meme Understanding (CMU) in the digital age. Memes, defined as user-created combinations of pictures and images overlaid with text, have become a pervasive and novel form of online communication, valued for their amusement and quick consumption.

The core problem the paper aims to solve is the automated comprehension of these complex multimodal entities. This problem is crucial for several reasons:

Detecting Malicious Content: Memes can be used for harmful purposes, such as spreading hate speech, disinformation, or political manipulation (e.g., influencing elections). Given the sheer volume of online content, human moderation is impractical, necessitating automated detection systems.
Improving Human Communication: Memes reflect thoughts and opinions. CMU technologies could help bridge communication gaps, for instance, by allowing teachers to better understand students' perspectives or aiding foreign students in adapting to new cultures by comprehending local jokes and expressions.

However, CMU presents significant challenges:
Multimodal Integration: Systems must seamlessly recognize and combine both textual and visual elements within a meme.
Broad and Deep Knowledge: Fully understanding a meme's message often requires extensive knowledge of current events, internet subcultures, meme cultures, and real-world context.
Figurative Language: Memes frequently employ figurative language, requiring systems to "read between the lines" to decode their true meaning.

The paper identifies a gap in existing literature, noting that while related surveys exist (e.g., on computational propaganda, multimodal disinformation, hate speech, humor generation), none offer a comprehensive review specifically dedicated to CMU that covers the breadth of meme types and technical tasks. This lack of a unified overview is the paper's entry point.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Comprehensive Taxonomy for Memes: It introduces a new, comprehensive taxonomy for memes, categorizing them along three crucial dimensions: forms (e.g., Macros, Shops, Screenshots), functions (e.g., persuading, mocking, praising, based on speech act theory), and topics (e.g., politics, COVID-19, misogyny). This taxonomy provides a unified language for classifying memes, which was previously lacking in NLP research.
Delineation of Key CMU Tasks: The survey clearly defines and categorizes the core tasks in Computational Meme Understanding into classification, interpretation, and explanation.
Review of Datasets and Models: It provides an extensive review of existing datasets for CMU tasks, discussing their coverage of meme types, annotation quality, and temporal context. It also reviews state-of-the-art models for these tasks, outlining their approaches, performance, and common limitations (e.g., issues with visual information, lack of background knowledge, hallucinations).
Identification of Challenges and Future Work: The paper highlights critical challenges specific to CMU, such as the need for meme-specific and temporal context knowledge, the inherent subjectivity in meme interpretation, and the demand for interpretable models. It then recommends concrete avenues for future research, including richer annotations, leveraging Vision-Language Models (VLMs) for annotation, advanced visual reasoning, active knowledge acquisition, incorporating pragmatics, and extending CMU to animated/video memes and meme generation.

These findings collectively address the need for a structured understanding of the CMU landscape, offering a foundational reference for researchers entering or working within this field, and guiding future research efforts toward overcoming current limitations.

3.1. Foundational Concepts

To fully grasp the contents of this survey, a reader should be familiar with the following foundational concepts:

Memes: In the context of this paper, memes are defined as user-created combinations of pictures and images overlaid with text. They serve as a prevalent form of online communication, often conveying humor, satire, or opinions. Their multimodal nature (combining visual and textual elements) is a core characteristic.
Multimodal Understanding: This refers to the ability of computational systems to process and integrate information from multiple modalities, typically vision (images/videos) and language (text). In CMU, this means understanding how the visual content of a meme interacts with its overlaid text to create a cohesive message.
Natural Language Processing (NLP): A field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. NLP techniques are crucial for processing the text components of memes, including sentiment analysis, hate speech detection, and understanding figurative language.
Computer Vision (CV): A field of artificial intelligence that enables computers to "see" and interpret visual data from the world, such as images and videos. CV techniques are essential for analyzing the visual components of memes, including object recognition, scene understanding, and extracting visual cues.
Figurative Language: Language that deviates from conventional literal meaning to achieve a special effect, such as metaphor, simile, irony, sarcasm, and allusion. Memes frequently employ figurative language, making its comprehension a significant challenge for CMU systems.
Hate Speech: Abusive or threatening language, gestures, or conduct that expresses prejudice against a particular group, especially on the basis of ethnicity, religion, sexual orientation, or disability. Detecting hate speech in memes is a major application area of CMU.
Classification: A supervised machine learning task where an algorithm learns to assign a category or label to new observations from a set of predefined categories. In CMU, this often involves classifying memes as hateful or non-hateful, offensive or not offensive, or by emotion or genre.
Text Generation: An NLP task where a model produces human-like text. In CMU, this is applied to meme interpretation (generating a message for the meme) and meme explanation (generating a reason for a specific label).
Speech Act Theory: A theory in linguistics and philosophy of language that analyzes language in terms of the actions or speech acts it performs. These acts go beyond the literal meaning of words, encompassing intentions like stating, questioning, commanding, or promising. The paper references illocutionary acts (the speaker's intention in making an utterance) from this theory to classify meme functions.
Vision-Language Models (VLMs): Advanced deep learning models that are trained on vast amounts of multimodal data (images and corresponding text) to understand and generate content across both modalities. They combine visual encoders (like those from CV) and language models (like those from NLP) and are increasingly used in CMU. Examples include CLIP, Llava, Flamingo.

3.2. Previous Works

The paper explicitly states that there have been no comprehensive surveys specifically on Computational Meme Understanding. However, it acknowledges several tangentially related surveys:

Computational Propaganda: Martino et al. (2020) and Ng and Li (2023) focus on propaganda detection, which may involve memes as a medium but is not solely dedicated to meme comprehension. Propaganda often aims to manipulate public opinion through various media.
Multimodal Disinformation and Fact-checking: Alam et al. (2022) and Akhtar et al. (2023) survey the detection of false or misleading information across multiple modalities. Memes can be a vehicle for disinformation, making these areas relevant, but their scope is broader than just memes.
Hate Speech: Schmidt and Wiegand (2017) provide a survey on hate speech detection. While detecting hateful memes is a significant task within CMU, this survey primarily focuses on text-based hate speech.
Humor Generation: Amin and Burghardt (2020) cover humor generation, which is related to the function of many memes. However, generating humor is distinct from understanding the full spectrum of meme meanings and purposes.
Harmful Memes (Sharma et al., 2022a): This is identified as the closest existing review to memes, published two years prior to this survey. However, the current paper differentiates itself by noting that Sharma et al. (2022a) only concerns harmful memes and associated classification tasks. This implies a much narrower scope compared to the current survey, which aims to cover many more types of memes and more technical tasks.

The current paper's proactive approach to unifying the understanding of memes across forms, functions, and topics, and defining the core tasks comprehensively, addresses a significant gap left by these previous, more specialized reviews.

3.3. Technological Evolution

The evolution of technologies relevant to CMU can be traced through several key developments:

Rise of Social Media and User-Generated Content: The proliferation of platforms like Reddit, Twitter, Facebook, and Instagram has led to an explosion of user-created content, including memes, making them a significant mode of communication and cultural expression.
Advancements in Deep Learning: The development of deep neural networks, particularly Convolutional Neural Networks (CNNs) for image processing and Recurrent Neural Networks (RNNs)/Transformers for natural language processing, laid the groundwork for robust multimodal systems.
Emergence of Multimodal Learning: Early efforts focused on fusing features from separate vision and language models. This evolved into more sophisticated multimodal fusion techniques, such as cross-attention mechanisms, that allow for richer interaction between modalities.
Pre-trained Vision-Language Models (VLMs): A major breakthrough has been the development of large pre-trained VLMs (e.g., BERT, RoBERTa for text; ResNet, ViT for vision; and CLIP, Llava, Flamingo for vision-language integration). These models, trained on massive datasets, provide powerful general-purpose representations that can be fine-tuned for specific CMU tasks, often outperforming earlier specialized models.

This paper's work fits into the current state of technological evolution by surveying how these advanced multimodal AI models are being applied to the unique challenges of meme understanding, moving beyond simple classification to more complex interpretation and explanation tasks.

3.4. Differentiation Analysis

Compared to the related work, the core differences and innovations of this paper's approach lie in its comprehensiveness and structured definition of the field:

Broader Scope: Unlike previous reviews that focus on specific aspects like harmful memes (Sharma et al., 2022a) or general propaganda, this survey covers the full spectrum of meme types (via a novel taxonomy) and Computational Meme Understanding tasks.
Unified Taxonomy: The introduction of a three-dimensional taxonomy (forms, functions, topics) provides a foundational, unified language for researchers to classify and discuss memes, addressing the lack of a standardized framework in NLP research. This structured view allows for a more holistic understanding of the diverse nature of memes.
Task Categorization: By clearly delineating classification, interpretation, and explanation as key CMU tasks, the paper helps structure the research landscape, allowing for better comparison and focused development within each area.
Critical Review and Future Outlook: The survey not only synthesizes existing work but also critically assesses the limitations of current datasets and models, and proactively identifies key challenges and promising future research directions (e.g., meme-specific knowledge, temporal context, interpretability, pragmatics, video memes). This forward-looking perspective is crucial for an emerging field.

In essence, this paper serves as a foundational "state-of-the-art" document for CMU, providing a much-needed organizational framework and roadmap for future research, which existing, narrower surveys could not achieve.

4. Methodology

As a survey paper, this work does not propose a novel computational methodology or algorithm in the traditional sense. Instead, its "methodology" lies in its structured approach to synthesizing and analyzing the existing body of research in Computational Meme Understanding (CMU). The paper systematically organizes the field by:

4.1. Principles

The core idea of the method used in this survey is to provide a comprehensive and structured overview of an emerging research area. The theoretical basis is that a nascent field benefits from a unified framework to guide future work, identify gaps, and consolidate disparate research efforts. The intuition is that by categorizing the subject matter (memes), the problems (tasks), and the existing solutions (datasets and models), researchers can gain a clearer understanding of the landscape, facilitating targeted innovation and collaboration.

4.2. Core Methodology In-depth (Layer by Layer)

The paper employs a multi-layered approach to survey Computational Meme Understanding, dissecting the field into foundational elements, core tasks, and current technical approaches.

4.2.1. Introduction of a Comprehensive Taxonomy for Memes

The survey begins by establishing a unified language for describing memes, crucial given the diverse and evolving nature of these multimodal entities. This taxonomy is organized along three dimensions:

Forms: This dimension categorizes memes based on their visual structure and how meaning is created.
- Remixed Images: Memes created through image manipulation.
  - Macros: A base template with text at the top (premise) and bottom (punchline).
  - Shops: Images manipulated by adding elements or graphical edits (e.g., "Photoshop").
  - Annotated Stills, Demotivationals, Quotes, Text.
  - Stacked Images: Multiple remixed images combined.
- Stable Images: Images used as memes without editing.
  - Screenshots: E.g., conversations on social media.
  - Photos: Including Memes IRL (in real life).
  - Drawings and Graphs. This categorization, adapted from Milner (2012), helps in recognizing the diverse visual structures that contribute to a meme's meaning. As shown in Figure 1 from the original paper, this taxonomy visually organizes these forms:
该图像是示意图，展示了根据形式对 memes 的分类，包括单图、重混图和稳定图等类型。这些分类的子项如标注静止图、消极图和宏图等显示了 memes 的多样性，帮助理解其功能与主题。

Figure 1: Taxonomy of forms for memes, adapted from Milner (2012).
Functions: This dimension classifies memes based on their illocutionary acts, i.e., "what the meme does" beyond its literal content.
- Adapted from Grundlingh (2018), who applied speech act theory to memes.
- Examples include stating, predicting, stereotyping, disputing, persuading, mocking, praising.
- This is particularly relevant for harmful memes, where functions like inciting hate or trolling are critical to detect. Appendix A further details this, showing a taxonomy of illocutionary acts for memes (Figure 3 in the paper).
  
  该图像是一个示意图，展示了交际性言辞行为的分类结构，包括常规性行为、指令性言辞、承诺性言辞和认可行为。每种行为下又细分了具体的类型，有助于理解不同类型的言辞行为及其功能。
Figure: Taxonomy of illocutionary acts of memes, adapted from Grundlingh (2018). The grayed-out entries (Commissives and Acknowledgements) are illocutionary acts from speech acts theory that do not apply to memes.
Topics: This dimension categorizes memes by their semantic themes or subjects.
- Topics can be evergreen (e.g., misogyny, antisemitism) or time-sensitive (e.g., US presidential elections, COVID-19, Russia-Ukraine crisis).
- The topic dictates the background knowledge and reasoning ability required for understanding.

4.2.2. Delineation of Key CMU Tasks

The survey structures the research landscape by identifying three primary tasks in CMU:

Classification:
- The most prevalent task, typically involving labeling memes with predefined categories.
- Focus areas include detecting malicious memes (e.g., offensive, trolling, hateful, antisemitic, harmful, misogynous). These are often binary classification tasks.
- Other classification tasks involve predicting persuasion techniques, targets (e.g., religion, race), emotion types (sarcastic, humorous), figurative language types, roles of people (hero, villain), and meme genres. These are typically multi-class classification problems.
- Evaluation metrics typically include accuracy, F1-macro score, and Area Under the ROC Curve (ROC AUC).
Interpretation:
- A newer task focused on generating text that captures the final message or meaning of a meme.
- Currently, only one dataset, MemeCap (Hwang and Shwartz, 2023), specifically addresses this task, referring to it as meme captioning.
- As a text generation task, evaluation can be manual (human evaluation) or automatic using n-gram-based metrics (e.g., BLEU, ROUGE, METEOR) or semantics-based metrics (e.g., BERTScore).
Explanation:
- Involves generating textual explanations for a specific label assigned to a meme.
- Two main variants are discussed:
  - Explaining why an entity in a harmful meme plays a given role (e.g., hero, villain, victim), as defined by Sharma (2023).
  - Explaining the reason for hateful memes by identifying a specific target group and describing how hateful feelings are expressed, as defined by Hee et al. (2023). This often follows a structured pattern like " ".
- These tasks are distinct from interpretation as they involve constrained generation based on provided labels or targets.

4.2.3. Review of Existing Datasets

The survey meticulously reviews 24 commonly used datasets, categorizing them by the CMU tasks they support (classification, interpretation, explanation). It discusses their objectives, number of memes, languages, collection methods, and licenses. A critical analysis of these datasets highlights:

Forms Overlooked: Many datasets focus on specific forms (e.g., Macros) potentially limiting model generalizability.
Annotation Quality: Concerns regarding inter-annotator agreement (often not reported or low) and the lack of annotation review in some datasets (e.g., MemeCap). It praises approaches like COLLECT-AND-JUDGE (Wiegreffe and Marasovic, 2021) with multiple rounds of training and judging for explanation datasets.
Temporal Context: A significant gap is the lack of posted timestamps for memes, hindering the development of models that can handle time-sensitive content.

4.2.4. Overview of Existing Models

The paper then surveys the state-of-the-art models for each CMU task:

Classification Models:
- Approaches: Most systems extract visual and textual features (e.g., using OCR, Google Cloud Vision API, FairFace for entity properties). These features are then encoded into embedding spaces using specialized encoders (ResNet, ViT for vision; BERT, RoBERTa, T5, Llama 2 for language). Modalities are aligned (concatenation, Cross-Attention) and fed into a classification head (e.g., Feedforward Neural Network). Some approaches reduce multimodal problems to text classification by generating textual descriptions of images.
- Vision-Language Models (VLMs): Recently, pre-trained VLMs (Flamingo, PaLI, GPT4, Llava, OpenFlamingo) have shown strong performance after fine-tuning.
- Performances: Varies widely, from over 90% accuracy on some benchmarks (e.g., HatefulMemes, WOAH5) to challenging F1 scores as low as 0.58 on others (e.g., SemEval-2021-T6).
- Common Errors (Appendix C): Misclassification due to lack of context, biased data (leading to biased models), failure to perform complex reasoning, or failure to attend to important visual information.
Explanation Models:
- Approaches: Extend classification models by replacing the classification head with a language decoder to generate text. An example is LUMEN, which uses joint learning for classification and explanation.
- Performances: Generally low in human evaluation, with correctness scores under 70% for HatReD. Challenges include unreliable visual information extractors and hallucinations. Retrieval augmentation is suggested to incorporate explicit knowledge.
Interpretation Models:
- Approaches: For MemeCap, open-source VLMs have been experimented with.
- Performances: Models still struggle, similar to explanation tasks. Errors stem from failure to attend to important visual elements and lack of sufficient background knowledge.

4.2.5. Identification of Key Challenges and Future Work

The survey concludes by outlining pressing challenges and proposing future research directions, derived from the comprehensive analysis:

Meme-specific Knowledge: Need for systems to acquire "insider's knowledge" of meme cultures and how meme forms contribute to meaning (e.g., Macros, Stack Images).
Temporal Context: Models must account for the post date of a meme and acquire up-to-date knowledge while also being able to "think in the past" for historical memes.
Subjectivity in Interpretation: Memes can have multiple valid interpretations, posing challenges for annotation and model output. The goal is to output most popular messages.
Interpretable Models: The need for models to explain their outputs, potentially mimicking human multi-step reasoning processes, especially for sensitive tasks like flagging harmful memes.
Richer Annotations: Collecting training data that represents human reasoning processes.
Improving Annotation Procedures with VLMs: Using VLMs to draft initial annotations for human review.
Next Level of Visual Reasoning: Teaching models to attend to the "right" visual details by incorporating explicit visual cues in annotations.
Active Knowledge Acquisition: Developing systems that can continuously learn meme cultures (e.g., from Know Your Meme) and topic-specific background knowledge (e.g., via Retrieval Augmentation).
Connection to Pragmatics: Leveraging concepts like presuppositions, deixis, and social-context grounding to enrich model understanding.
Towards Processing Animated and Video Memes: Extending CMU to GIFs and short videos, which present complex temporal and multimodal challenges.
Meme Generation: Exploring meme generation as a measure of understanding, for humanizing interfaces, and for digital marketing.

This structured approach forms the "methodology" of the survey, systematically moving from defining the domain to analyzing existing work and charting future paths.

5. Experimental Setup

This section details the experimental setups discussed and reviewed within the survey paper, specifically focusing on the datasets, evaluation metrics, and general baseline approaches used in the Computational Meme Understanding (CMU) literature. The survey itself does not conduct new experiments but rather analyzes those of existing research.

5.1. Datasets

The survey provides a comprehensive overview of 24 commonly used datasets, primarily for classification tasks, but also including interpretation and explanation datasets.

The following are the results from Table 1 of the original paper:

Dataset and/or Publication	Task	Objective	# Memes	Lang.	Method	License
HatefulMemes (Kiela et al., 2020)	2C	Hate	10,000	E	Synthesis	Custom
MUTE (Hossain et al., 2022b)	2C	Hate	4,158	E+Be	Scrape	MIT
MMHS150K (Gomez et al., 2019)	2C	Hate	150,000	E	Scrape	Custom
Sabat et al. (2019)	2C	Hate	5,020	E	Scrape	CC0
CrisisHateMM (Thapa et al., 2024)	NC	Hate & Target	4,486	E	Scrape	MIT
WOAH-5 (Mathias et al., 2021)	NC	Hate Type & Target	10,000	E	Inherit	Apache-2.0
HarMeme (Pramanick et al., 2021a)	2C, NC	Harm & Target	3,544	E	Scrape	BSD
HARM-C&P (Pramanick et al., 2021b)	2C, NC	Harm & Target	7,096	E	Inherit	MIT
Giri et al. (2021)	NC	Offensiveness	6,992	E	Scrape	Unavailable
Shang et al. (2021b)	2C	Offensiveness	3,059	E	Scrape	Unavailable
MultiOFF (Suryawanshi et al., 2020a)	2C	Offensiveness	743	E	Scrape	None
TamilMemes (Suryawanshi et al., 2020b)	2C	Trolling	2,969	T	Scrape	GNU-3.0
BanglaAbuse (Das and Mukherjee, 2023)	2C	Abuse	4,043	Be	Scrape	MIT
Jewtocracy (Chandra et al., 2021a)	2C, NC	Antisemitism	6,611	E	Scrape	Unavailable
MAMI (Fersini et al., 2022)	2C, NC	Misogyny	11,000	E	Scrape	Apache-2.0
MIMOSA (Ahsan et al., 2024)	NC	Agression Target	4,848	Be	Scrape	MIT
Memotion (Sharma et al., 2020)	NC	Emotion	10,000	E	Scrape	MIT
FigMemes (Liu et al., 2022)	NC	Figurative Lang.	5,141	E	Scrape	None
HVVMemes (Sharma et al., 2022b)	NC	Role of Entities	7,000	E	Inherit	None
MemoSen (Hossain et al., 2022a)	NC	Sentiment	4,417	Be	Scrape	Custom
SemEval-2021-T6 (Dimitrov et al., 2021)	NC	Persuasion Tech.	950	E	Scrape	None
HatReD (Hee et al., 2023)	E	Hate	3,304	E	Inherit	Custom
ExHVV (Sharma et al., 2023)	E	Role of Entities	4,680	E	Inherit	CC0-1.0
MemeCap (Hwang and Shwartz, 2023)	I	Meme Captioning	6,387	E	Scrape	GPL-3.0

Description of Dataset Characteristics:

Task: 2C denotes Binary Classification, NC denotes Multi-class Classification, $E$ denotes Explanation, and $I$ denotes Interpretation.
Objective: Specifies the target of classification (e.g., Hate, Harm, Offensiveness, Emotion, Role of Entities, Meme Captioning).
# Memes: Indicates the size of the dataset.
Lang. (Language): $E$ for English, Be for Bengali, $T$ for Tamil. This shows the multilingual nature of some CMU research.
Method: Synthesis (created programmatically or with specific guidelines), Scrape (collected from online sources), Inherit (derived from another existing dataset with new annotations).
License: Specifies usage rights.

Discussion on Dataset Issues (as per the survey):

Forms Overlooked: Many datasets, like HatefulMemes, primarily focus on the Macros form of memes (image with overlaid text). This narrow focus means models trained on these datasets may perform poorly on other meme forms found "in the wild" (e.g., Screenshots, plain text memes). The survey emphasizes that a deliberate control over meme forms is often missing, leaving it unclear if datasets truly cover the diversity of meme structures.
Annotation Quality:
- For classification datasets, inter-annotator agreement is frequently not reported or, when reported (e.g., MAMI with Kappa 0.33), indicates only "fair" agreement, which is problematic given the subjective nature of meme understanding.
- For the interpretation dataset (MemeCap), annotations were collected via crowdsourcing but without review, raising questions about data quality.
- For explanation datasets (HatReD, ExHVV), authors used rigorous COLLECT-AND-JUDGE methods (Wiegreffe and Marasovic, 2021) with multiple training rounds and human judges, and even reported inter-judge agreement for HatReD, aiming for higher quality control. However, the survey notes that this method might still suffer from shared biases among judges.
Temporal Context: None of the reviewed datasets record the posted timestamps of memes, which is critical for understanding memes within their historical context. While some specify collection date ranges, this doesn't provide the fine-grained temporal information needed for models to "think in the past" or stay updated with rapidly evolving internet trends and real-world events. An example given is the Still your president meme (Figure 2c), whose meaning changes drastically depending on its post-date relative to Trump's presidency.

5.2. Evaluation Metrics

The survey outlines various evaluation metrics used across the three CMU tasks:

5.2.1. Classification Tasks

For classification, the primary metrics are:

Accuracy:
- Conceptual Definition: Measures the proportion of total predictions that were correct. It's a straightforward measure of overall correctness, but can be misleading in imbalanced datasets.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
  - Number of Correct Predictions: The count of instances where the model's predicted label matches the true label.
  - Total Number of Predictions: The total number of instances evaluated.
F1-macro Score:
- Conceptual Definition: The harmonic mean of precision and recall, calculated independently for each class and then averaged. It provides a balanced measure for multi-class or imbalanced classification problems, giving equal weight to each class.
- Mathematical Formula: $ \mathrm{F1}{\text{macro}} = \frac{1}{N} \sum{i=1}^{N} \mathrm{F1}{i} $ where, $ \mathrm{F1}{i} = 2 \times \frac{\mathrm{Precision}{i} \times \mathrm{Recall}{i}}{\mathrm{Precision}{i} + \mathrm{Recall}{i}} $ $ \mathrm{Precision}{i} = \frac{\mathrm{TP}{i}}{\mathrm{TP}{i} + \mathrm{FP}{i}} $ $ \mathrm{Recall}{i} = \frac{\mathrm{TP}{i}}{\mathrm{TP}{i} + \mathrm{FN}{i}} $
- Symbol Explanation:
  - $N$ : The total number of classes.
  - $\mathrm{F1}_{i}$ : The F1 score for class $i$ .
  - $\mathrm{Precision}_{i}$ : The precision for class $i$ .
  - $\mathrm{Recall}_{i}$ : The recall for class $i$ .
  - $\mathrm{TP}_{i}$ : True Positives for class $i$ (correctly predicted as class $i$ ).
  - $\mathrm{FP}_{i}$ : False Positives for class $i$ (incorrectly predicted as class $i$ ).
  - $\mathrm{FN}_{i}$ : False Negatives for class $i$ (incorrectly not predicted as class $i$ ).
Area Under the ROC Curve (ROC AUC):
- Conceptual Definition: Measures the ability of a classifier to distinguish between classes. It's the area under the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. A higher AUC indicates better model performance, regardless of the classification threshold chosen.
- Mathematical Formula: While there's no single closed-form formula for AUC that directly computes it from a set of TP, FP, TN, FN values (it's typically computed by integrating or summing areas under the ROC curve), the conceptual basis involves TPR and FPR: $ \mathrm{TPR} = \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $ $ \mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}} $ AUC is the area under the curve formed by plotting (FPR, TPR) pairs for all possible classification thresholds.
- Symbol Explanation:
  - $\mathrm{TPR}$ : True Positive Rate, also known as recall or sensitivity.
  - $\mathrm{FPR}$ : False Positive Rate.
  - $\mathrm{TP}$ : True Positives.
  - $\mathrm{FP}$ : False Positives.
  - $\mathrm{FN}$ : False Negatives.
  - $\mathrm{TN}$ : True Negatives.

5.2.2. Interpretation and Explanation Tasks

For text generation tasks, metrics typically include both human evaluation and automated metrics:

BLEU (Bilingual Evaluation Understudy) (Papineni et al., 2002):
- Conceptual Definition: Measures the similarity between a machine-generated text and a set of human-generated reference texts. It primarily focuses on n-gram precision, counting how many n-grams in the candidate text appear in the reference text, with a penalty for short sentences.
- Mathematical Formula: $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ where, $ \mathrm{BP} = \min\left(1, \exp\left(1 - \frac{\mathrm{length}{\text{ref}}}{\mathrm{length}{\text{cand}}}\right)\right) $ $ p_n = \frac{\sum_{\text{sentence} \in \text{cand}} \sum_{\text{n-gram} \in \text{sentence}} \mathrm{Count}{\text{clip}}(\text{n-gram})}{\sum{\text{sentence} \in \text{cand}} \sum_{\text{n-gram} \in \text{sentence}} \mathrm{Count}(\text{n-gram})} $
- Symbol Explanation:
  - $\mathrm{BP}$ : Brevity Penalty, penalizes candidate sentences that are too short compared to the reference.
  - $\mathrm{length}_{\text{ref}}$ : Effective reference corpus length.
  - $\mathrm{length}_{\text{cand}}$ : Length of the candidate (generated) corpus.
  - $N$ : Maximum n-gram order (typically 4).
  - $w_n$ : Weight for the $n$ -gram precision (often $1/N$ ).
  - $p_n$ : Modified n-gram precision for $n$ -grams of length $n$ .
  - $\mathrm{Count}_{\text{clip}}(\text{n-gram})$ : Count of n-grams in the candidate that are also present in the reference, clipped to the maximum count in any single reference sentence.
  - $\mathrm{Count}(\text{n-gram})$ : Count of n-grams in the candidate.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) (Lin, 2004):
- Conceptual Definition: A set of metrics for evaluating summarization and machine translation, primarily focusing on recall. ROUGE-L (L for Longest Common Subsequence) measures the longest common subsequence between the generated and reference texts.
- Mathematical Formula (for ROUGE-L): $ \mathrm{ROUGE-L} = \frac{(1+\beta^2) \times \mathrm{LCS}{\text{recall}} \times \mathrm{LCS}{\text{precision}}}{\mathrm{LCS}{\text{recall}} + \beta^2 \times \mathrm{LCS}{\text{precision}}} $ where, $ \mathrm{LCS}{\text{recall}} = \frac{\mathrm{LCS}(X, Y)}{\mathrm{length}(Y)} $ $ \mathrm{LCS}{\text{precision}} = \frac{\mathrm{LCS}(X, Y)}{\mathrm{length}(X)} $
- Symbol Explanation:
  - $\mathrm{LCS}(X, Y)$ : Length of the longest common subsequence between candidate text $X$ and reference text $Y$ .
  - $\mathrm{length}(X)$ : Length of candidate text $X$ .
  - $\mathrm{length}(Y)$ : Length of reference text $Y$ .
  - $\beta$ : A parameter that adjusts the relative importance of precision and recall (often set to a high value like 1, or to make recall more important).
METEOR (Metric for Evaluation of Translation with Explicit Ordering) (Banerjee and Lavie, 2005):
- Conceptual Definition: Measures the similarity between machine-generated and reference texts by aligning words based on exact, stem, synonym, and paraphrase matches, then computing a harmonic mean of precision and recall based on these alignments, with a penalty for fragmentation.
- Mathematical Formula: $ \mathrm{METEOR} = \mathrm{F_{\text{mean}}} \times (1 - \mathrm{Penalty}) $ where, $ \mathrm{F_{\text{mean}}} = \frac{10 \times \mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Recall} + 9 \times \mathrm{Precision}} $ $ \mathrm{Penalty} = 0.5 \times \left(\frac{\mathrm{num_chunks}}{\mathrm{num_matches}}\right)^3 $
- Symbol Explanation:
  - $\mathrm{Precision}$ : Precision based on matched words.
  - $\mathrm{Recall}$ : Recall based on matched words.
  - $\mathrm{num\_chunks}$ : Number of "chunks" or contiguous sequences of matched words.
  - $\mathrm{num\_matches}$ : Total number of matched words.
BERTScore (Zhang et al., 2020):
- Conceptual Definition: A semantics-based metric that leverages pre-trained BERT embeddings to compute similarity between generated and reference sentences. Instead of discrete n-gram matching, it measures cosine similarity between contextualized token embeddings, offering a more robust assessment of semantic similarity.
- Mathematical Formula (simplified conceptual overview): For each token $x_i$ in candidate sentence $X$ and each token $y_j$ in reference sentence $Y$ : $ \mathrm{BERTScore}{\text{precision}} = \frac{1}{|X|} \sum{x_i \in X} \max_{y_j \in Y} \mathrm{cos}(\mathrm{E}(x_i), \mathrm{E}(y_j)) $ $ \mathrm{BERTScore}{\text{recall}} = \frac{1}{|Y|} \sum{y_j \in Y} \max_{x_i \in X} \mathrm{cos}(\mathrm{E}(x_i), \mathrm{E}(y_j)) $ $\mathrm{BERTScore}_{\text{F1}}$ is the harmonic mean of precision and recall.
- Symbol Explanation:
  - $\mathrm{E}(\cdot)$ : Embedding function (e.g., from BERT) that maps a token to its contextualized vector representation.
  - $\mathrm{cos}(\cdot, \cdot)$ : Cosine similarity between two vectors.
  - $|X|$ : Number of tokens in candidate sentence $X$ .
  - $|Y|$ : Number of tokens in reference sentence $Y$ .
Human Evaluation: For text generation, human judges assess outputs based on criteria like fluency, correctness, coherence, and adequacy. This is often considered the gold standard, especially given the nuances of meme understanding. The survey specifically mentions Fluent and Correct as metrics in human evaluation for MemeCap.

5.3. Baselines

The survey does not define specific "baselines" in the sense of a new model being compared against other existing models. Instead, it reviews the performance of various state-of-the-art models from the literature, which serve as the comparison points within their respective publications. These models broadly fall into categories:

Feature-Engineering Approaches: Older methods that relied on manually extracted features from text and images.
Multimodal Deep Learning Models: Systems combining specialized vision encoders (e.g., ResNet, ViT) and language encoders (e.g., BERT, RoBERTa) with fusion mechanisms like concatenation or Cross-Attention.
Vision-Language Models (VLMs): More recent, large pre-trained models (CLIP, Flamingo, PaLI, Llava, GPT4, OpenFlamingo) that integrate vision and language capabilities from the start, often showing superior performance due to extensive pre-training.

The comparison is typically performed against other published methods for the same task and dataset, with the objective of achieving state-of-the-art performance on specific benchmarks.

6. Results & Analysis

This section synthesizes the findings presented in the survey regarding the performance and limitations of existing models for Computational Meme Understanding (CMU) tasks.

6.1. Core Results Analysis

The survey reveals a varied landscape of model performance across CMU tasks:

Classification Tasks: Model performance for classification tasks shows a wide spectrum. While some benchmarks, particularly for binary classification of harmful content (e.g., HatefulMemes), have seen models achieve high accuracy levels (above 90%), other, more complex multi-class classification tasks remain challenging. For instance, SemEval-2021-T6 (detecting persuasion techniques across 22 classes) shows the best model achieving an F1 score of only 0.58. This indicates that while basic detection of obvious harmful content is progressing, nuanced classification requiring deeper understanding of context and fine-grained categories is still a significant hurdle.
Explanation and Interpretation Tasks: Models for meme explanation and interpretation consistently score low, especially in human evaluation. For HatReD, the best systems score under 70% for correctness. This highlights the inherent difficulty of generating free-form text that accurately captures the nuanced meaning or explains the rationale behind a meme's classification. The primary struggles identified are:
- Failure to Attend to Visual Elements: Models often miss crucial visual cues that are essential for understanding a meme's meaning (Hwang and Shwartz, 2023).
- Lack of Background Knowledge: Models struggle to leverage meme-specific, cultural, and real-world knowledge, leading to shallow or incorrect interpretations (Hwang and Shwartz, 2023; Hee et al., 2023).
- Hallucinations: Models generate plausible but incorrect or non-existent information in their explanations or interpretations (Hee et al., 2023).
- Unreliable Visual Information Extractors: Errors in upstream components (e.g., OCR, object recognition) can propagate and severely impact the quality of explanations.
  
  Overall, the results suggest that while deep learning models are effective for straightforward classification, tasks requiring generative capabilities, complex reasoning, and deep cultural/temporal understanding still have substantial room for improvement.

6.2. Data Presentation (Tables)

The survey provides two tables detailing the state-of-the-art models and their performance.

The following are the results from Table 2 of the original paper:

Publication of state-of-the-art models	Dataset	Task	Metric		F1
Publication of state-of-the-art models	Dataset	Task	Acc	AUC	F1
Hu et al. (2024)	Hateful Memes (Kiela et al., 2020)	B	0.90	0.81
Zia et al. (2021)	WOAH5 (Mathias et al., 2021)	NC Target			0.96
Mathias et al. (2021)		NC Target			0.97
Zia et al. (2021)		NC Attack type			0.91
Mathias et al. (2021)	MAMI (Fersini et al., 2022)	NC Attack type			0.91
Cao et al. (2023)		B	0.74	0.84
Zhang and Wang (2022)		B			0.83
Zhang and Wang (2022)	HarMeme (Pramanick et al., 2021a)	NC Target			0.73
Cao et al. (2023)		B	0.91
Pramanick et al. (2021a)		NC Level	0.76		0.54
Pramanick et al. (2021a)	HARM-C (Pramanick et al., 2021b)	NC Target	0.76		0.66
Lin et al. (2024)		B	0.87		0.86
Pramanick et al. (2021b)		NC Level	0.77		0.55
Pramanick et al. (2021b)	HARM-P (Pramanick et al., 2021b)	NC Target	0.78		0.70
Lin et al. (2024)		B	0.91		0.91
Pramanick et al. (2021b)		NC Level	0.87		0.67
Pramanick et al. (2021b)	Jewtocracy (Chandra et al., 2021a)	NC Target	0.79		0.72
Chandra et al. (2021b)		B (Twitter)			0.69
Chandra et al. (2021b)		B (Gab)		0.91
Chandra et al. (2021b)		NC (Twitter)	0.68		0.67
Lee et al. (2021)	MultiOFF (Suryawanshi et al., 2020a)	NC Gab			0.65
Suryawanshi et al. (2020b)	TamilMemes (Suryawanshi et al., 2020b)	B			0.52
Gomez et al. (2019)	MMHS150K (Gomez et al., 2019)	B	0.68	0.73
Sabat et al. (2019)	Sabat et al. (2019)	B	0.70
Giri et al. (2021)	Giri et al. (2021)	B	0.83
Giri et al. (2021)	Giri et al. (2021)	B	0.71
Shang et al. (2021a)	Shang et al. (2021a)	NC BR	0.99
Shang et al. (2021a)	Shang et al. (2021a)	NC BG	0.73
Feng et al. (2021)	SemEval-2021-T6 (Dimitrov et al., 2021)	NC Sentiment	0.70		0.49
Sharma et al. (2020)	Memotion (Sharma et al., 2020)	NC 3			0.55
Sharma et al. (2020)		NC Sentiment			0.58
Sharma et al. (2020)		NC Humor			0.35

Note: In the Task column for Table 2, $B$ denotes Binary Classification, NC denotes Multi-class Classification, followed by the specific classification target (e.g., Target, Attack type, Level, Sentiment, Humor, Gab, Twitter, BR for Binary/Reddit, BG for Binary/Gab, 3 for 3-class classification).

The following are the results from Table 3 of the original paper:

Dataset	Model	Automatic Eval.			Human Eval.
Dataset	Model	BLEU	ROUGE-L	BERTScore	Fluent	Correct
HatReD	Text-only: RoBERTa-base	0.177	0.389	0.480	0.975	0.544
HatReD	Text-only: T5-Large	0.190	0.392	0.479	0.926	0.622
ExHVV	LUMEN	0.313	0.294	0.902
MemeCap	Open-Flamingo few-shot	0.267	0.435	0.739	0.933	0.361
	Open-Flamingo few-shot	0.270	0.435	0.743
	Llama fewshot	0.266	0.434	0.747	0.967	0.361

6.3. Ablation Studies / Parameter Analysis

The survey, being a comprehensive review, does not present its own ablation studies or parameter analyses. However, it implicitly discusses the impact of various model components and design choices by highlighting the limitations and common errors observed in the literature. These observations function as an indirect form of analysis on what components or aspects are currently underperforming or need improvement.

For instance, the discussion on common errors in meme classifiers (Appendix C) points to:

Lack of context: Models misclassify when they lack the necessary background knowledge. This implies that components designed for context acquisition or reasoning are either insufficient or missing.
Biased data: Training on biased datasets leads to models exhibiting similar biases, suggesting that the data sampling and labeling parameters significantly influence model fairness and generalizability.
Failure to perform complex reasoning: This indicates that current model architectures or their training paradigms are not robust enough for the intricate logical inferences often required to understand memes.
Failure to attend to important visual information: This directly critiques the visual encoding and multimodal fusion components, suggesting they are not effectively highlighting or integrating critical visual cues.

Similarly, for explanation and interpretation models, the reported issues like unreliable visual information extractors, hallucinations, and the lack of sufficient background knowledge (from the "Performances" sub-sections in "Models") serve as indicators of where current Vision-Language Models (VLMs) and their components (e.g., visual encoders, language decoders, knowledge integration mechanisms) are currently weak and need significant improvements. This collective analysis guides future research toward refining these critical components and addressing these performance bottlenecks.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey provides a timely and comprehensive overview of Computational Meme Understanding (CMU), an emerging field focused on the automated comprehension of multimodal internet memes. The paper's key contributions include the introduction of a three-dimensional taxonomy for memes (forms, functions, and topics), a clear delineation of three core CMU tasks (classification, interpretation, and explanation), and a detailed review of existing datasets and models. It highlights that while CMU has made strides in basic classification, more complex tasks like interpretation and explanation still face significant challenges, particularly related to incorporating meme-specific knowledge, temporal context, and handling the inherent subjectivity and figurative language. The survey concludes by outlining critical future research directions necessary for robust and interpretable CMU systems.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose promising avenues for future research:

Current Limitations (as identified in the paper):

Meme-specific Knowledge: Current systems lack the "insider's knowledge" of meme cultures and how specific meme forms convey meaning.
Temporal Context: Models struggle to account for the time-sensitive nature of memes, requiring knowledge both up-to-date and able to "think in the past."
Subjectivity in Interpretation: Meme interpretation is inherently subjective, making annotation challenging and allowing for multiple plausible correct interpretations.
Interpretability: Models currently lack the ability to provide detailed, human-like explanations for their outputs, crucial for building user trust.
Dataset Deficiencies: Many datasets overlook the diversity of meme forms, lack consistent high-quality annotations (e.g., inter-annotator agreement often not reported or low), and crucially, lack posted timestamps.
Model Performance Gaps: While classification can be high for simple tasks, complex multi-class problems and especially generative tasks (interpretation, explanation) show low performance, suffering from unreliable visual information extractors, hallucinations, and lack of background knowledge.

Recommended Avenues for Future Work:

Richer Annotations for More Robust Models:
- Develop annotation schemes that capture human reasoning processes (multi-step derivations combining textual/visual cues with background knowledge).
- Collect training data specifically for this new task to facilitate supervised learning of interpretable models.
Improving Annotation Procedures with VLMs:
- Leverage Vision-Language Models (VLMs) to generate initial drafts of annotations, reducing human effort, but investigate the trade-offs between editing VLM outputs vs. writing from scratch.
Next Level of Visual Reasoning:
- Teach models to attend to crucial demographic information and deciding visual elements.
- Construct datasets that explicitly include textual explanations of which visual details are important and why for meme understanding, guiding visual attention during model training.
Active Knowledge Acquisition:
- Develop methods for models to continuously acquire knowledge about meme cultures (e.g., by leveraging internet databases like Know Your Meme).
- Explore how Retrieval Augmentation can be used to acquire topic-specific background knowledge for real-time events and implicit associations.
Connection to Pragmatics:
- Integrate pragmatic concepts like presuppositions, deixis, and social-context grounding as features to improve CMU systems, enriching their understanding of contextual meaning.
Towards Processing Animated and Video Memes:
- Extend CMU research to GIFs and short videos, addressing the challenges posed by dynamic visual content and complex temporal relationships.
Meme Generation:
- Explore meme generation as a means to measure model understanding, humanize computer interfaces, and develop technologies for captivating online content creation.

7.3. Personal Insights & Critique

This survey serves as an invaluable resource for anyone entering or working within the domain of Computational Meme Understanding. Its strength lies in its meticulous organization of a rapidly evolving and inherently chaotic subject matter. The proposed taxonomy for forms, functions, and topics is particularly insightful, providing a much-needed framework that moves beyond ad-hoc categorizations often found in individual research papers. The clear delineation of tasks (classification, interpretation, explanation) also helps to structure the research landscape effectively.

The critical analysis of datasets is a highlight, underscoring fundamental issues that often plague multimodal research: the lack of diverse data, annotation quality inconsistencies, and the critical absence of temporal metadata. These insights are crucial for guiding future data collection efforts toward creating more robust and generalizable CMU systems. The call for "richer annotations" that capture human reasoning is a particularly important and challenging future direction, as truly understanding a meme often involves a multi-step cognitive process that current annotations largely abstract away.

One potential area for further emphasis, although touched upon, could be the sociological and psychological underpinnings of meme consumption and creation. While the paper leverages linguistic theories (speech act theory), a deeper dive into the cognitive biases or social dynamics that make certain memes effective or harmful could further inform model design, especially for understanding persuasion and manipulation. The inherent subjectivity of meme interpretation, highlighted as a challenge, is precisely where human-in-the-loop systems or ensemble approaches combining multiple "perspectives" might prove more fruitful than seeking a single "correct" interpretation.

Finally, the ethical considerations section is highly relevant. The dual-use nature of CMU technologies (detecting harm vs. potentially generating it) necessitates careful thought about responsible AI development and deployment. The suggestions regarding annotator well-being and controlled release of models and datasets are commendable. The future direction of meme generation also presents significant ethical challenges, as the power to create highly engaging and potentially manipulative content could be misused. This survey lays a strong foundation, and its forward-looking perspective is vital for navigating the opportunities and pitfalls of this fascinating research area.

Computational Meme Understanding: A Survey

TL;DR Summary