Computational Meme Understanding: A Survey
TL;DR Summary
This paper surveys Computational Meme Understanding (CMU), introducing a comprehensive meme taxonomy and analyzing key tasks like classification and interpretation, while reviewing existing datasets and models, addressing limitations, key challenges, and suggesting future researc
Abstract
Computational Meme Understanding, which concerns the automated comprehension of memes, has garnered interest over the last four years and is facing both substantial opportunities and challenges. We survey this emerging area of research by first introducing a comprehensive taxonomy for memes along three dimensions – forms, functions, and topics. Next, we present three key tasks in Computational Meme Understanding, namely, classification, interpretation, and explanation, and conduct a comprehensive review of existing datasets and models, discussing their limitations. Finally, we highlight the key challenges and recommend avenues for future work.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is a comprehensive survey of Computational Meme Understanding (CMU).
1.2. Authors
The authors are:
-
Khoi P. N. Nguyen
-
Vincent Ng
Both authors are affiliated with the Human Language Technology Research Institute, University of Texas at Dallas.
1.3. Journal/Conference
The paper is a survey, indicating it reviews and synthesizes existing research in an emerging field. While it doesn't specify a particular journal or conference proceedings, surveys are often published in specialized review journals or presented at major conferences as survey tracks.
1.4. Publication Year
The paper was published on January 1, 2024.
1.5. Abstract
This paper surveys Computational Meme Understanding (CMU), an emerging research area focused on the automated comprehension of memes, which has gained significant interest in the last four years. The survey begins by introducing a comprehensive taxonomy for memes, categorizing them along three dimensions: forms, functions, and topics. It then delineates three key tasks within CMU: classification, interpretation, and explanation. For each task, the paper conducts a detailed review of existing datasets and models, critically discussing their limitations. Finally, it identifies major challenges facing the field and proposes several promising avenues for future research.
1.6. Original Source Link
The original source link is /files/papers/69174c3c110b75dcc59ae048/paper.pdf. This link directly points to the PDF file of the paper, indicating it is an officially published or available document.
2. Executive Summary
2.1. Background & Motivation
The paper addresses the growing importance of Computational Meme Understanding (CMU) in the digital age. Memes, defined as user-created combinations of pictures and images overlaid with text, have become a pervasive and novel form of online communication, valued for their amusement and quick consumption.
The core problem the paper aims to solve is the automated comprehension of these complex multimodal entities. This problem is crucial for several reasons:
-
Detecting Malicious Content: Memes can be used for harmful purposes, such as spreading hate speech, disinformation, or political manipulation (e.g., influencing elections). Given the sheer volume of online content, human moderation is impractical, necessitating automated detection systems.
-
Improving Human Communication: Memes reflect thoughts and opinions. CMU technologies could help bridge communication gaps, for instance, by allowing teachers to better understand students' perspectives or aiding foreign students in adapting to new cultures by comprehending local jokes and expressions.
However, CMU presents significant challenges:
-
Multimodal Integration: Systems must seamlessly recognize and combine both textual and visual elements within a meme.
-
Broad and Deep Knowledge: Fully understanding a meme's message often requires extensive knowledge of current events, internet subcultures, meme cultures, and real-world context.
-
Figurative Language: Memes frequently employ figurative language, requiring systems to "read between the lines" to decode their true meaning.
The paper identifies a gap in existing literature, noting that while related surveys exist (e.g., on computational propaganda, multimodal disinformation, hate speech, humor generation), none offer a comprehensive review specifically dedicated to CMU that covers the breadth of meme types and technical tasks. This lack of a unified overview is the paper's entry point.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Comprehensive Taxonomy for Memes: It introduces a new, comprehensive taxonomy for memes, categorizing them along three crucial dimensions:
forms(e.g., Macros, Shops, Screenshots),functions(e.g., persuading, mocking, praising, based on speech act theory), andtopics(e.g., politics, COVID-19, misogyny). This taxonomy provides a unified language for classifying memes, which was previously lacking in NLP research. -
Delineation of Key CMU Tasks: The survey clearly defines and categorizes the core tasks in
Computational Meme Understandingintoclassification,interpretation, andexplanation. -
Review of Datasets and Models: It provides an extensive review of existing datasets for CMU tasks, discussing their coverage of meme types, annotation quality, and temporal context. It also reviews state-of-the-art models for these tasks, outlining their approaches, performance, and common limitations (e.g., issues with visual information, lack of background knowledge, hallucinations).
-
Identification of Challenges and Future Work: The paper highlights critical challenges specific to CMU, such as the need for meme-specific and temporal context knowledge, the inherent subjectivity in meme interpretation, and the demand for interpretable models. It then recommends concrete avenues for future research, including richer annotations, leveraging
Vision-Language Models(VLMs) for annotation, advanced visual reasoning, active knowledge acquisition, incorporating pragmatics, and extending CMU to animated/video memes and meme generation.These findings collectively address the need for a structured understanding of the CMU landscape, offering a foundational reference for researchers entering or working within this field, and guiding future research efforts toward overcoming current limitations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the contents of this survey, a reader should be familiar with the following foundational concepts:
- Memes: In the context of this paper, memes are defined as
user-created combinations of pictures and images overlaid with text. They serve as a prevalent form of online communication, often conveying humor, satire, or opinions. Their multimodal nature (combining visual and textual elements) is a core characteristic. - Multimodal Understanding: This refers to the ability of computational systems to process and integrate information from multiple modalities, typically
vision(images/videos) andlanguage(text). In CMU, this means understanding how the visual content of a meme interacts with its overlaid text to create a cohesive message. - Natural Language Processing (NLP): A field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. NLP techniques are crucial for processing the text components of memes, including sentiment analysis, hate speech detection, and understanding figurative language.
- Computer Vision (CV): A field of artificial intelligence that enables computers to "see" and interpret visual data from the world, such as images and videos. CV techniques are essential for analyzing the visual components of memes, including object recognition, scene understanding, and extracting visual cues.
- Figurative Language: Language that deviates from conventional literal meaning to achieve a special effect, such as metaphor, simile, irony, sarcasm, and allusion. Memes frequently employ figurative language, making its comprehension a significant challenge for CMU systems.
- Hate Speech: Abusive or threatening language, gestures, or conduct that expresses prejudice against a particular group, especially on the basis of ethnicity, religion, sexual orientation, or disability. Detecting hate speech in memes is a major application area of CMU.
- Classification: A supervised machine learning task where an algorithm learns to assign a category or label to new observations from a set of predefined categories. In CMU, this often involves classifying memes as
hatefulornon-hateful,offensiveornot offensive, or byemotionorgenre. - Text Generation: An NLP task where a model produces human-like text. In CMU, this is applied to
meme interpretation(generating a message for the meme) andmeme explanation(generating a reason for a specific label). - Speech Act Theory: A theory in linguistics and philosophy of language that analyzes language in terms of the actions or
speech actsit performs. These acts go beyond the literal meaning of words, encompassing intentions like stating, questioning, commanding, or promising. The paper referencesillocutionary acts(the speaker's intention in making an utterance) from this theory to classify meme functions. - Vision-Language Models (VLMs): Advanced deep learning models that are trained on vast amounts of multimodal data (images and corresponding text) to understand and generate content across both modalities. They combine visual encoders (like those from CV) and language models (like those from NLP) and are increasingly used in CMU. Examples include
CLIP,Llava,Flamingo.
3.2. Previous Works
The paper explicitly states that there have been no comprehensive surveys specifically on Computational Meme Understanding. However, it acknowledges several tangentially related surveys:
-
Computational Propaganda: Martino et al. (2020) and Ng and Li (2023) focus on propaganda detection, which may involve memes as a medium but is not solely dedicated to meme comprehension. Propaganda often aims to manipulate public opinion through various media.
-
Multimodal Disinformation and Fact-checking: Alam et al. (2022) and Akhtar et al. (2023) survey the detection of false or misleading information across multiple modalities. Memes can be a vehicle for disinformation, making these areas relevant, but their scope is broader than just memes.
-
Hate Speech: Schmidt and Wiegand (2017) provide a survey on hate speech detection. While detecting hateful memes is a significant task within CMU, this survey primarily focuses on text-based hate speech.
-
Humor Generation: Amin and Burghardt (2020) cover humor generation, which is related to the
functionof many memes. However, generating humor is distinct from understanding the full spectrum of meme meanings and purposes. -
Harmful Memes (Sharma et al., 2022a): This is identified as the closest existing review to memes, published two years prior to this survey. However, the current paper differentiates itself by noting that Sharma et al. (2022a) only concerns harmful memes and associated classification tasks. This implies a much narrower scope compared to the current survey, which aims to cover
many more types of memes and more technical tasks.The current paper's proactive approach to unifying the understanding of memes across forms, functions, and topics, and defining the core tasks comprehensively, addresses a significant gap left by these previous, more specialized reviews.
3.3. Technological Evolution
The evolution of technologies relevant to CMU can be traced through several key developments:
-
Rise of Social Media and User-Generated Content: The proliferation of platforms like Reddit, Twitter, Facebook, and Instagram has led to an explosion of user-created content, including memes, making them a significant mode of communication and cultural expression.
-
Advancements in Deep Learning: The development of deep neural networks, particularly
Convolutional Neural Networks (CNNs)for image processing andRecurrent Neural Networks (RNNs)/Transformersfor natural language processing, laid the groundwork for robust multimodal systems. -
Emergence of Multimodal Learning: Early efforts focused on fusing features from separate vision and language models. This evolved into more sophisticated
multimodal fusiontechniques, such ascross-attentionmechanisms, that allow for richer interaction between modalities. -
Pre-trained Vision-Language Models (VLMs): A major breakthrough has been the development of large pre-trained
VLMs(e.g.,BERT,RoBERTafor text;ResNet,ViTfor vision; andCLIP,Llava,Flamingofor vision-language integration). These models, trained on massive datasets, provide powerful general-purpose representations that can be fine-tuned for specific CMU tasks, often outperforming earlier specialized models.This paper's work fits into the current state of technological evolution by surveying how these advanced multimodal AI models are being applied to the unique challenges of meme understanding, moving beyond simple classification to more complex interpretation and explanation tasks.
3.4. Differentiation Analysis
Compared to the related work, the core differences and innovations of this paper's approach lie in its comprehensiveness and structured definition of the field:
-
Broader Scope: Unlike previous reviews that focus on specific aspects like
harmful memes(Sharma et al., 2022a) or generalpropaganda, this survey covers the full spectrum of meme types (via a novel taxonomy) andComputational Meme Understandingtasks. -
Unified Taxonomy: The introduction of a three-dimensional taxonomy (forms, functions, topics) provides a foundational, unified language for researchers to classify and discuss memes, addressing the lack of a standardized framework in NLP research. This structured view allows for a more holistic understanding of the diverse nature of memes.
-
Task Categorization: By clearly delineating
classification,interpretation, andexplanationas key CMU tasks, the paper helps structure the research landscape, allowing for better comparison and focused development within each area. -
Critical Review and Future Outlook: The survey not only synthesizes existing work but also critically assesses the limitations of current datasets and models, and proactively identifies
key challengesandpromising future research directions(e.g., meme-specific knowledge, temporal context, interpretability, pragmatics, video memes). This forward-looking perspective is crucial for an emerging field.In essence, this paper serves as a foundational "state-of-the-art" document for CMU, providing a much-needed organizational framework and roadmap for future research, which existing, narrower surveys could not achieve.
4. Methodology
As a survey paper, this work does not propose a novel computational methodology or algorithm in the traditional sense. Instead, its "methodology" lies in its structured approach to synthesizing and analyzing the existing body of research in Computational Meme Understanding (CMU). The paper systematically organizes the field by:
4.1. Principles
The core idea of the method used in this survey is to provide a comprehensive and structured overview of an emerging research area. The theoretical basis is that a nascent field benefits from a unified framework to guide future work, identify gaps, and consolidate disparate research efforts. The intuition is that by categorizing the subject matter (memes), the problems (tasks), and the existing solutions (datasets and models), researchers can gain a clearer understanding of the landscape, facilitating targeted innovation and collaboration.
4.2. Core Methodology In-depth (Layer by Layer)
The paper employs a multi-layered approach to survey Computational Meme Understanding, dissecting the field into foundational elements, core tasks, and current technical approaches.
4.2.1. Introduction of a Comprehensive Taxonomy for Memes
The survey begins by establishing a unified language for describing memes, crucial given the diverse and evolving nature of these multimodal entities. This taxonomy is organized along three dimensions:
-
Forms: This dimension categorizes memes based on their visual structure and how meaning is created.
- Remixed Images: Memes created through image manipulation.
Macros: A base template with text at the top (premise) and bottom (punchline).Shops: Images manipulated by adding elements or graphical edits (e.g., "Photoshop").Annotated Stills,Demotivationals,Quotes,Text.Stacked Images: Multiple remixed images combined.
- Stable Images: Images used as memes without editing.
Screenshots: E.g., conversations on social media.Photos: IncludingMemes IRL(in real life).DrawingsandGraphs. This categorization, adapted from Milner (2012), helps in recognizing the diverse visual structures that contribute to a meme's meaning. As shown in Figure 1 from the original paper, this taxonomy visually organizes these forms:

该图像是示意图,展示了根据形式对 memes 的分类,包括单图、重混图和稳定图等类型。这些分类的子项如标注静止图、消极图和宏图等显示了 memes 的多样性,帮助理解其功能与主题。Figure 1: Taxonomy of forms for memes, adapted from Milner (2012).
- Remixed Images: Memes created through image manipulation.
-
Functions: This dimension classifies memes based on their
illocutionary acts, i.e., "what the meme does" beyond its literal content.-
Adapted from Grundlingh (2018), who applied
speech act theoryto memes. -
Examples include
stating,predicting,stereotyping,disputing,persuading,mocking,praising. -
This is particularly relevant for
harmful memes, where functions likeinciting hateortrollingare critical to detect. Appendix A further details this, showing a taxonomy of illocutionary acts for memes (Figure 3 in the paper).
该图像是一个示意图,展示了交际性言辞行为的分类结构,包括常规性行为、指令性言辞、承诺性言辞和认可行为。每种行为下又细分了具体的类型,有助于理解不同类型的言辞行为及其功能。
Figure: Taxonomy of illocutionary acts of memes, adapted from Grundlingh (2018). The grayed-out entries (Commissives and Acknowledgements) are illocutionary acts from speech acts theory that do not apply to memes.
-
-
Topics: This dimension categorizes memes by their
semantic themesor subjects.- Topics can be
evergreen(e.g., misogyny, antisemitism) ortime-sensitive(e.g., US presidential elections, COVID-19, Russia-Ukraine crisis). - The topic dictates the
background knowledgeandreasoning abilityrequired for understanding.
- Topics can be
4.2.2. Delineation of Key CMU Tasks
The survey structures the research landscape by identifying three primary tasks in CMU:
-
Classification:
- The most prevalent task, typically involving labeling memes with predefined categories.
- Focus areas include detecting
malicious memes(e.g., offensive, trolling, hateful, antisemitic, harmful, misogynous). These are oftenbinary classificationtasks. - Other classification tasks involve predicting
persuasion techniques,targets(e.g., religion, race),emotion types(sarcastic, humorous),figurative language types,roles of people(hero, villain), andmeme genres. These are typicallymulti-class classificationproblems. - Evaluation metrics typically include
accuracy,F1-macro score, andArea Under the ROC Curve (ROC AUC).
-
Interpretation:
- A newer task focused on
generating textthat captures thefinal messageor meaning of a meme. - Currently, only one dataset,
MemeCap(Hwang and Shwartz, 2023), specifically addresses this task, referring to it asmeme captioning. - As a text generation task, evaluation can be manual (human evaluation) or automatic using
n-gram-based metrics(e.g.,BLEU,ROUGE,METEOR) orsemantics-based metrics(e.g.,BERTScore).
- A newer task focused on
-
Explanation:
- Involves
generating textual explanationsfor a specificlabelassigned to a meme. - Two main variants are discussed:
- Explaining why an
entityin a harmful meme plays a givenrole(e.g., hero, villain, victim), as defined by Sharma (2023). - Explaining the
reason for hateful memesby identifying a specifictarget groupand describing how hateful feelings are expressed, as defined by Hee et al. (2023). This often follows a structured pattern like "".
- Explaining why an
- These tasks are distinct from interpretation as they involve
constrained generationbased on provided labels or targets.
- Involves
4.2.3. Review of Existing Datasets
The survey meticulously reviews 24 commonly used datasets, categorizing them by the CMU tasks they support (classification, interpretation, explanation). It discusses their objectives, number of memes, languages, collection methods, and licenses. A critical analysis of these datasets highlights:
- Forms Overlooked: Many datasets focus on specific forms (e.g., Macros) potentially limiting model generalizability.
- Annotation Quality: Concerns regarding
inter-annotator agreement(often not reported or low) and the lack ofannotation reviewin some datasets (e.g.,MemeCap). It praises approaches likeCOLLECT-AND-JUDGE(Wiegreffe and Marasovic, 2021) with multiple rounds of training and judging for explanation datasets. - Temporal Context: A significant gap is the lack of
posted timestampsfor memes, hindering the development of models that can handle time-sensitive content.
4.2.4. Overview of Existing Models
The paper then surveys the state-of-the-art models for each CMU task:
-
Classification Models:
- Approaches: Most systems extract
visualandtextual features(e.g., usingOCR,Google Cloud Vision API,FairFacefor entity properties). These features are then encoded intoembedding spacesusing specialized encoders (ResNet,ViTfor vision;BERT,RoBERTa,T5,Llama 2for language). Modalities arealigned(concatenation,Cross-Attention) and fed into aclassification head(e.g.,Feedforward Neural Network). Some approaches reduce multimodal problems to text classification by generatingtextual descriptionsof images. - Vision-Language Models (VLMs): Recently, pre-trained
VLMs(Flamingo,PaLI,GPT4,Llava,OpenFlamingo) have shown strong performance after fine-tuning. - Performances: Varies widely, from over 90% accuracy on some benchmarks (e.g.,
HatefulMemes,WOAH5) to challenging F1 scores as low as 0.58 on others (e.g.,SemEval-2021-T6). - Common Errors (Appendix C): Misclassification due to
lack of context,biased data(leading to biased models), failure to performcomplex reasoning, orfailure to attend to important visual information.
- Approaches: Most systems extract
-
Explanation Models:
- Approaches: Extend classification models by replacing the classification head with a
language decoderto generate text. An example isLUMEN, which uses joint learning for classification and explanation. - Performances: Generally low in human evaluation, with correctness scores under 70% for
HatReD. Challenges includeunreliable visual information extractorsandhallucinations.Retrieval augmentationis suggested to incorporate explicit knowledge.
- Approaches: Extend classification models by replacing the classification head with a
-
Interpretation Models:
- Approaches: For
MemeCap, open-sourceVLMshave been experimented with. - Performances: Models still struggle, similar to explanation tasks. Errors stem from
failure to attend to important visual elementsandlack of sufficient background knowledge.
- Approaches: For
4.2.5. Identification of Key Challenges and Future Work
The survey concludes by outlining pressing challenges and proposing future research directions, derived from the comprehensive analysis:
-
Meme-specific Knowledge: Need for systems to acquire "insider's knowledge" of meme cultures and how meme forms contribute to meaning (e.g.,
Macros,Stack Images). -
Temporal Context: Models must account for the
post dateof a meme and acquireup-to-dateknowledge while also being able to "think in the past" for historical memes. -
Subjectivity in Interpretation: Memes can have multiple valid interpretations, posing challenges for annotation and model output. The goal is to output
most popular messages. -
Interpretable Models: The need for models to explain their outputs, potentially mimicking human multi-step reasoning processes, especially for sensitive tasks like flagging harmful memes.
-
Richer Annotations: Collecting training data that represents human reasoning processes.
-
Improving Annotation Procedures with VLMs: Using
VLMsto draft initial annotations for human review. -
Next Level of Visual Reasoning: Teaching models to
attend to the "right" visual detailsby incorporating explicit visual cues in annotations. -
Active Knowledge Acquisition: Developing systems that can continuously learn
meme cultures(e.g., fromKnow Your Meme) andtopic-specific background knowledge(e.g., viaRetrieval Augmentation). -
Connection to Pragmatics: Leveraging concepts like
presuppositions,deixis, andsocial-context groundingto enrich model understanding. -
Towards Processing Animated and Video Memes: Extending CMU to
GIFsandshort videos, which present complex temporal and multimodal challenges. -
Meme Generation: Exploring meme generation as a measure of understanding, for humanizing interfaces, and for digital marketing.
This structured approach forms the "methodology" of the survey, systematically moving from defining the domain to analyzing existing work and charting future paths.
5. Experimental Setup
This section details the experimental setups discussed and reviewed within the survey paper, specifically focusing on the datasets, evaluation metrics, and general baseline approaches used in the Computational Meme Understanding (CMU) literature. The survey itself does not conduct new experiments but rather analyzes those of existing research.
5.1. Datasets
The survey provides a comprehensive overview of 24 commonly used datasets, primarily for classification tasks, but also including interpretation and explanation datasets.
The following are the results from Table 1 of the original paper:
| Dataset and/or Publication | Task | Objective | # Memes | Lang. | Method | License |
| HatefulMemes (Kiela et al., 2020) | 2C | Hate | 10,000 | E | Synthesis | Custom |
| MUTE (Hossain et al., 2022b) | 2C | Hate | 4,158 | E+Be | Scrape | MIT |
| MMHS150K (Gomez et al., 2019) | 2C | Hate | 150,000 | E | Scrape | Custom |
| Sabat et al. (2019) | 2C | Hate | 5,020 | E | Scrape | CC0 |
| CrisisHateMM (Thapa et al., 2024) | NC | Hate & Target | 4,486 | E | Scrape | MIT |
| WOAH-5 (Mathias et al., 2021) | NC | Hate Type & Target | 10,000 | E | Inherit | Apache-2.0 |
| HarMeme (Pramanick et al., 2021a) | 2C, NC | Harm & Target | 3,544 | E | Scrape | BSD |
| HARM-C&P (Pramanick et al., 2021b) | 2C, NC | Harm & Target | 7,096 | E | Inherit | MIT |
| Giri et al. (2021) | NC | Offensiveness | 6,992 | E | Scrape | Unavailable |
| Shang et al. (2021b) | 2C | Offensiveness | 3,059 | E | Scrape | Unavailable |
| MultiOFF (Suryawanshi et al., 2020a) | 2C | Offensiveness | 743 | E | Scrape | None |
| TamilMemes (Suryawanshi et al., 2020b) | 2C | Trolling | 2,969 | T | Scrape | GNU-3.0 |
| BanglaAbuse (Das and Mukherjee, 2023) | 2C | Abuse | 4,043 | Be | Scrape | MIT |
| Jewtocracy (Chandra et al., 2021a) | 2C, NC | Antisemitism | 6,611 | E | Scrape | Unavailable |
| MAMI (Fersini et al., 2022) | 2C, NC | Misogyny | 11,000 | E | Scrape | Apache-2.0 |
| MIMOSA (Ahsan et al., 2024) | NC | Agression Target | 4,848 | Be | Scrape | MIT |
| Memotion (Sharma et al., 2020) | NC | Emotion | 10,000 | E | Scrape | MIT |
| FigMemes (Liu et al., 2022) | NC | Figurative Lang. | 5,141 | E | Scrape | None |
| HVVMemes (Sharma et al., 2022b) | NC | Role of Entities | 7,000 | E | Inherit | None |
| MemoSen (Hossain et al., 2022a) | NC | Sentiment | 4,417 | Be | Scrape | Custom |
| SemEval-2021-T6 (Dimitrov et al., 2021) | NC | Persuasion Tech. | 950 | E | Scrape | None |
| HatReD (Hee et al., 2023) | E | Hate | 3,304 | E | Inherit | Custom |
| ExHVV (Sharma et al., 2023) | E | Role of Entities | 4,680 | E | Inherit | CC0-1.0 |
| MemeCap (Hwang and Shwartz, 2023) | I | Meme Captioning | 6,387 | E | Scrape | GPL-3.0 |
Description of Dataset Characteristics:
- Task:
2CdenotesBinary Classification,NCdenotesMulti-class Classification, denotesExplanation, and denotesInterpretation. - Objective: Specifies the target of classification (e.g.,
Hate,Harm,Offensiveness,Emotion,Role of Entities,Meme Captioning). - # Memes: Indicates the size of the dataset.
- Lang. (Language): for English,
Befor Bengali, for Tamil. This shows the multilingual nature of some CMU research. - Method:
Synthesis(created programmatically or with specific guidelines),Scrape(collected from online sources),Inherit(derived from another existing dataset with new annotations). - License: Specifies usage rights.
Discussion on Dataset Issues (as per the survey):
- Forms Overlooked: Many datasets, like
HatefulMemes, primarily focus on theMacrosform of memes (image with overlaid text). This narrow focus means models trained on these datasets may perform poorly on other meme forms found "in the wild" (e.g., Screenshots, plain text memes). The survey emphasizes that a deliberate control over meme forms is often missing, leaving it unclear if datasets truly cover the diversity of meme structures. - Annotation Quality:
- For
classification datasets,inter-annotator agreementis frequently not reported or, when reported (e.g.,MAMIwith Kappa 0.33), indicates only "fair" agreement, which is problematic given the subjective nature of meme understanding. - For the
interpretation dataset(MemeCap), annotations were collected via crowdsourcing butwithout review, raising questions about data quality. - For
explanation datasets(HatReD,ExHVV), authors used rigorousCOLLECT-AND-JUDGEmethods (Wiegreffe and Marasovic, 2021) with multiple training rounds and human judges, and even reportedinter-judge agreementforHatReD, aiming for higher quality control. However, the survey notes that this method might still suffer from shared biases among judges.
- For
- Temporal Context: None of the reviewed datasets record the
posted timestampsof memes, which is critical for understanding memes within their historical context. While some specify collection date ranges, this doesn't provide the fine-grained temporal information needed for models to "think in the past" or stay updated with rapidly evolving internet trends and real-world events. An example given is theStill your presidentmeme (Figure 2c), whose meaning changes drastically depending on its post-date relative to Trump's presidency.
5.2. Evaluation Metrics
The survey outlines various evaluation metrics used across the three CMU tasks:
5.2.1. Classification Tasks
For classification, the primary metrics are:
-
Accuracy:
- Conceptual Definition: Measures the proportion of total predictions that were correct. It's a straightforward measure of overall correctness, but can be misleading in imbalanced datasets.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The count of instances where the model's predicted label matches the true label.Total Number of Predictions: The total number of instances evaluated.
-
F1-macro Score:
- Conceptual Definition: The harmonic mean of
precisionandrecall, calculated independently for each class and then averaged. It provides a balanced measure for multi-class or imbalanced classification problems, giving equal weight to each class. - Mathematical Formula: $ \mathrm{F1}{\text{macro}} = \frac{1}{N} \sum{i=1}^{N} \mathrm{F1}{i} $ where, $ \mathrm{F1}{i} = 2 \times \frac{\mathrm{Precision}{i} \times \mathrm{Recall}{i}}{\mathrm{Precision}{i} + \mathrm{Recall}{i}} $ $ \mathrm{Precision}{i} = \frac{\mathrm{TP}{i}}{\mathrm{TP}{i} + \mathrm{FP}{i}} $ $ \mathrm{Recall}{i} = \frac{\mathrm{TP}{i}}{\mathrm{TP}{i} + \mathrm{FN}{i}} $
- Symbol Explanation:
- : The total number of classes.
- : The F1 score for class .
- : The precision for class .
- : The recall for class .
- : True Positives for class (correctly predicted as class ).
- : False Positives for class (incorrectly predicted as class ).
- : False Negatives for class (incorrectly not predicted as class ).
- Conceptual Definition: The harmonic mean of
-
Area Under the ROC Curve (ROC AUC):
- Conceptual Definition: Measures the ability of a classifier to distinguish between classes. It's the area under the
Receiver Operating Characteristic (ROC)curve, which plots theTrue Positive Rate (TPR)against theFalse Positive Rate (FPR)at various threshold settings. A higher AUC indicates better model performance, regardless of the classification threshold chosen. - Mathematical Formula: While there's no single closed-form formula for AUC that directly computes it from a set of TP, FP, TN, FN values (it's typically computed by integrating or summing areas under the ROC curve), the conceptual basis involves TPR and FPR: $ \mathrm{TPR} = \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $ $ \mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}} $ AUC is the area under the curve formed by plotting (FPR, TPR) pairs for all possible classification thresholds.
- Symbol Explanation:
- : True Positive Rate, also known as
recallorsensitivity. - : False Positive Rate.
- : True Positives.
- : False Positives.
- : False Negatives.
- : True Negatives.
- : True Positive Rate, also known as
- Conceptual Definition: Measures the ability of a classifier to distinguish between classes. It's the area under the
5.2.2. Interpretation and Explanation Tasks
For text generation tasks, metrics typically include both human evaluation and automated metrics:
-
BLEU (Bilingual Evaluation Understudy) (Papineni et al., 2002):
- Conceptual Definition: Measures the similarity between a machine-generated text and a set of human-generated reference texts. It primarily focuses on
n-gram precision, counting how many n-grams in the candidate text appear in the reference text, with a penalty for short sentences. - Mathematical Formula: $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ where, $ \mathrm{BP} = \min\left(1, \exp\left(1 - \frac{\mathrm{length}{\text{ref}}}{\mathrm{length}{\text{cand}}}\right)\right) $ $ p_n = \frac{\sum_{\text{sentence} \in \text{cand}} \sum_{\text{n-gram} \in \text{sentence}} \mathrm{Count}{\text{clip}}(\text{n-gram})}{\sum{\text{sentence} \in \text{cand}} \sum_{\text{n-gram} \in \text{sentence}} \mathrm{Count}(\text{n-gram})} $
- Symbol Explanation:
- :
Brevity Penalty, penalizes candidate sentences that are too short compared to the reference. - : Effective reference corpus length.
- : Length of the candidate (generated) corpus.
- : Maximum n-gram order (typically 4).
- : Weight for the -gram precision (often ).
- : Modified n-gram precision for -grams of length .
- : Count of n-grams in the candidate that are also present in the reference, clipped to the maximum count in any single reference sentence.
- : Count of n-grams in the candidate.
- :
- Conceptual Definition: Measures the similarity between a machine-generated text and a set of human-generated reference texts. It primarily focuses on
-
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) (Lin, 2004):
- Conceptual Definition: A set of metrics for evaluating summarization and machine translation, primarily focusing on
recall.ROUGE-L(L for Longest Common Subsequence) measures the longest common subsequence between the generated and reference texts. - Mathematical Formula (for ROUGE-L): $ \mathrm{ROUGE-L} = \frac{(1+\beta^2) \times \mathrm{LCS}{\text{recall}} \times \mathrm{LCS}{\text{precision}}}{\mathrm{LCS}{\text{recall}} + \beta^2 \times \mathrm{LCS}{\text{precision}}} $ where, $ \mathrm{LCS}{\text{recall}} = \frac{\mathrm{LCS}(X, Y)}{\mathrm{length}(Y)} $ $ \mathrm{LCS}{\text{precision}} = \frac{\mathrm{LCS}(X, Y)}{\mathrm{length}(X)} $
- Symbol Explanation:
- : Length of the longest common subsequence between candidate text and reference text .
- : Length of candidate text .
- : Length of reference text .
- : A parameter that adjusts the relative importance of precision and recall (often set to a high value like 1, or to make recall more important).
- Conceptual Definition: A set of metrics for evaluating summarization and machine translation, primarily focusing on
-
METEOR (Metric for Evaluation of Translation with Explicit Ordering) (Banerjee and Lavie, 2005):
- Conceptual Definition: Measures the similarity between machine-generated and reference texts by aligning words based on exact, stem, synonym, and paraphrase matches, then computing a harmonic mean of
precisionandrecallbased on these alignments, with a penalty for fragmentation. - Mathematical Formula: $ \mathrm{METEOR} = \mathrm{F_{\text{mean}}} \times (1 - \mathrm{Penalty}) $ where, $ \mathrm{F_{\text{mean}}} = \frac{10 \times \mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Recall} + 9 \times \mathrm{Precision}} $ $ \mathrm{Penalty} = 0.5 \times \left(\frac{\mathrm{num_chunks}}{\mathrm{num_matches}}\right)^3 $
- Symbol Explanation:
- : Precision based on matched words.
- : Recall based on matched words.
- : Number of "chunks" or contiguous sequences of matched words.
- : Total number of matched words.
- Conceptual Definition: Measures the similarity between machine-generated and reference texts by aligning words based on exact, stem, synonym, and paraphrase matches, then computing a harmonic mean of
-
BERTScore (Zhang et al., 2020):
- Conceptual Definition: A semantics-based metric that leverages pre-trained
BERTembeddings to compute similarity between generated and reference sentences. Instead of discrete n-gram matching, it measurescosine similaritybetween contextualized token embeddings, offering a more robust assessment of semantic similarity. - Mathematical Formula (simplified conceptual overview): For each token in candidate sentence and each token in reference sentence : $ \mathrm{BERTScore}{\text{precision}} = \frac{1}{|X|} \sum{x_i \in X} \max_{y_j \in Y} \mathrm{cos}(\mathrm{E}(x_i), \mathrm{E}(y_j)) $ $ \mathrm{BERTScore}{\text{recall}} = \frac{1}{|Y|} \sum{y_j \in Y} \max_{x_i \in X} \mathrm{cos}(\mathrm{E}(x_i), \mathrm{E}(y_j)) $ is the harmonic mean of precision and recall.
- Symbol Explanation:
- : Embedding function (e.g., from
BERT) that maps a token to its contextualized vector representation. - : Cosine similarity between two vectors.
- : Number of tokens in candidate sentence .
- : Number of tokens in reference sentence .
- : Embedding function (e.g., from
- Conceptual Definition: A semantics-based metric that leverages pre-trained
-
Human Evaluation: For text generation, human judges assess outputs based on criteria like
fluency,correctness,coherence, andadequacy. This is often considered the gold standard, especially given the nuances of meme understanding. The survey specifically mentionsFluentandCorrectas metrics in human evaluation forMemeCap.
5.3. Baselines
The survey does not define specific "baselines" in the sense of a new model being compared against other existing models. Instead, it reviews the performance of various state-of-the-art models from the literature, which serve as the comparison points within their respective publications. These models broadly fall into categories:
-
Feature-Engineering Approaches: Older methods that relied on manually extracted features from text and images.
-
Multimodal Deep Learning Models: Systems combining specialized vision encoders (e.g.,
ResNet,ViT) and language encoders (e.g.,BERT,RoBERTa) with fusion mechanisms like concatenation orCross-Attention. -
Vision-Language Models (VLMs): More recent, large pre-trained models (
CLIP,Flamingo,PaLI,Llava,GPT4,OpenFlamingo) that integrate vision and language capabilities from the start, often showing superior performance due to extensive pre-training.The comparison is typically performed against other published methods for the same task and dataset, with the objective of achieving
state-of-the-artperformance on specific benchmarks.
6. Results & Analysis
This section synthesizes the findings presented in the survey regarding the performance and limitations of existing models for Computational Meme Understanding (CMU) tasks.
6.1. Core Results Analysis
The survey reveals a varied landscape of model performance across CMU tasks:
- Classification Tasks: Model performance for classification tasks shows a wide spectrum. While some benchmarks, particularly for
binary classificationof harmful content (e.g.,HatefulMemes), have seen models achieve highaccuracylevels (above 90%), other, more complexmulti-class classificationtasks remain challenging. For instance,SemEval-2021-T6(detecting persuasion techniques across 22 classes) shows the best model achieving anF1 scoreof only 0.58. This indicates that while basic detection of obvious harmful content is progressing, nuanced classification requiring deeper understanding of context and fine-grained categories is still a significant hurdle. - Explanation and Interpretation Tasks: Models for
meme explanationandinterpretationconsistently score low, especially inhuman evaluation. ForHatReD, the best systems score under 70% forcorrectness. This highlights the inherent difficulty of generating free-form text that accurately captures the nuanced meaning or explains the rationale behind a meme's classification. The primary struggles identified are:-
Failure to Attend to Visual Elements: Models often miss crucial visual cues that are essential for understanding a meme's meaning (Hwang and Shwartz, 2023).
-
Lack of Background Knowledge: Models struggle to leverage meme-specific, cultural, and real-world knowledge, leading to shallow or incorrect interpretations (Hwang and Shwartz, 2023; Hee et al., 2023).
-
Hallucinations: Models generate plausible but incorrect or non-existent information in their explanations or interpretations (Hee et al., 2023).
-
Unreliable Visual Information Extractors: Errors in upstream components (e.g.,
OCR, object recognition) can propagate and severely impact the quality of explanations.Overall, the results suggest that while deep learning models are effective for straightforward classification, tasks requiring generative capabilities, complex reasoning, and deep cultural/temporal understanding still have substantial room for improvement.
-
6.2. Data Presentation (Tables)
The survey provides two tables detailing the state-of-the-art models and their performance.
The following are the results from Table 2 of the original paper:
| Publication of state-of-the-art models | Dataset | Task | Metric | F1 | |
|---|---|---|---|---|---|
| Acc | AUC | ||||
| Hu et al. (2024) | Hateful Memes (Kiela et al., 2020) | B | 0.90 | 0.81 | |
| Zia et al. (2021) | WOAH5 (Mathias et al., 2021) | NC Target | 0.96 | ||
| Mathias et al. (2021) | NC Target | 0.97 | |||
| Zia et al. (2021) | NC Attack type | 0.91 | |||
| Mathias et al. (2021) | MAMI (Fersini et al., 2022) | NC Attack type | 0.91 | ||
| Cao et al. (2023) | B | 0.74 | 0.84 | ||
| Zhang and Wang (2022) | B | 0.83 | |||
| Zhang and Wang (2022) | HarMeme (Pramanick et al., 2021a) | NC Target | 0.73 | ||
| Cao et al. (2023) | B | 0.91 | |||
| Pramanick et al. (2021a) | NC Level | 0.76 | 0.54 | ||
| Pramanick et al. (2021a) | HARM-C (Pramanick et al., 2021b) | NC Target | 0.76 | 0.66 | |
| Lin et al. (2024) | B | 0.87 | 0.86 | ||
| Pramanick et al. (2021b) | NC Level | 0.77 | 0.55 | ||
| Pramanick et al. (2021b) | HARM-P (Pramanick et al., 2021b) | NC Target | 0.78 | 0.70 | |
| Lin et al. (2024) | B | 0.91 | 0.91 | ||
| Pramanick et al. (2021b) | NC Level | 0.87 | 0.67 | ||
| Pramanick et al. (2021b) | Jewtocracy (Chandra et al., 2021a) | NC Target | 0.79 | 0.72 | |
| Chandra et al. (2021b) | B (Twitter) | 0.69 | |||
| Chandra et al. (2021b) | B (Gab) | 0.91 | |||
| Chandra et al. (2021b) | NC (Twitter) | 0.68 | 0.67 | ||
| Lee et al. (2021) | MultiOFF (Suryawanshi et al., 2020a) | NC Gab | 0.65 | ||
| Suryawanshi et al. (2020b) | TamilMemes (Suryawanshi et al., 2020b) | B | 0.52 | ||
| Gomez et al. (2019) | MMHS150K (Gomez et al., 2019) | B | 0.68 | 0.73 | |
| Sabat et al. (2019) | Sabat et al. (2019) | B | 0.70 | ||
| Giri et al. (2021) | Giri et al. (2021) | B | 0.83 | ||
| Giri et al. (2021) | B | 0.71 | |||
| Shang et al. (2021a) | Shang et al. (2021a) | NC BR | 0.99 | ||
| Shang et al. (2021a) | NC BG | 0.73 | |||
| Feng et al. (2021) | SemEval-2021-T6 (Dimitrov et al., 2021) | NC Sentiment | 0.70 | 0.49 | |
| Sharma et al. (2020) | Memotion (Sharma et al., 2020) | NC 3 | 0.55 | ||
| Sharma et al. (2020) | NC Sentiment | 0.58 | |||
| Sharma et al. (2020) | NC Humor | 0.35 | |||
Note: In the Task column for Table 2, denotes Binary Classification, NC denotes Multi-class Classification, followed by the specific classification target (e.g., Target, Attack type, Level, Sentiment, Humor, Gab, Twitter, BR for Binary/Reddit, BG for Binary/Gab, 3 for 3-class classification).
The following are the results from Table 3 of the original paper:
| Dataset | Model | Automatic Eval. | Human Eval. | |||
|---|---|---|---|---|---|---|
| BLEU | ROUGE-L | BERTScore | Fluent | Correct | ||
| HatReD | Text-only: RoBERTa-base | 0.177 | 0.389 | 0.480 | 0.975 | 0.544 |
| Text-only: T5-Large | 0.190 | 0.392 | 0.479 | 0.926 | 0.622 | |
| ExHVV | LUMEN | 0.313 | 0.294 | 0.902 | ||
| MemeCap | Open-Flamingo few-shot | 0.267 | 0.435 | 0.739 | 0.933 | 0.361 |
| Open-Flamingo few-shot | 0.270 | 0.435 | 0.743 | |||
| Llama fewshot | 0.266 | 0.434 | 0.747 | 0.967 | 0.361 | |
6.3. Ablation Studies / Parameter Analysis
The survey, being a comprehensive review, does not present its own ablation studies or parameter analyses. However, it implicitly discusses the impact of various model components and design choices by highlighting the limitations and common errors observed in the literature. These observations function as an indirect form of analysis on what components or aspects are currently underperforming or need improvement.
For instance, the discussion on common errors in meme classifiers (Appendix C) points to:
-
Lack of context: Models misclassify when they lack the necessary background knowledge. This implies that components designed for context acquisition or reasoning are either insufficient or missing.
-
Biased data: Training on biased datasets leads to models exhibiting similar biases, suggesting that the
data samplingandlabeling parameterssignificantly influence model fairness and generalizability. -
Failure to perform complex reasoning: This indicates that current model architectures or their training paradigms are not robust enough for the intricate logical inferences often required to understand memes.
-
Failure to attend to important visual information: This directly critiques the
visual encodingandmultimodal fusioncomponents, suggesting they are not effectively highlighting or integrating critical visual cues.Similarly, for
explanationandinterpretationmodels, the reported issues likeunreliable visual information extractors,hallucinations, and thelack of sufficient background knowledge(from the "Performances" sub-sections in "Models") serve as indicators of where currentVision-Language Models(VLMs) and their components (e.g.,visual encoders,language decoders,knowledge integration mechanisms) are currently weak and need significant improvements. This collective analysis guides future research toward refining these critical components and addressing these performance bottlenecks.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey provides a timely and comprehensive overview of Computational Meme Understanding (CMU), an emerging field focused on the automated comprehension of multimodal internet memes. The paper's key contributions include the introduction of a three-dimensional taxonomy for memes (forms, functions, and topics), a clear delineation of three core CMU tasks (classification, interpretation, and explanation), and a detailed review of existing datasets and models. It highlights that while CMU has made strides in basic classification, more complex tasks like interpretation and explanation still face significant challenges, particularly related to incorporating meme-specific knowledge, temporal context, and handling the inherent subjectivity and figurative language. The survey concludes by outlining critical future research directions necessary for robust and interpretable CMU systems.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose promising avenues for future research:
Current Limitations (as identified in the paper):
- Meme-specific Knowledge: Current systems lack the "insider's knowledge" of meme cultures and how specific meme forms convey meaning.
- Temporal Context: Models struggle to account for the time-sensitive nature of memes, requiring knowledge both up-to-date and able to "think in the past."
- Subjectivity in Interpretation: Meme interpretation is inherently subjective, making annotation challenging and allowing for multiple plausible correct interpretations.
- Interpretability: Models currently lack the ability to provide detailed, human-like explanations for their outputs, crucial for building user trust.
- Dataset Deficiencies: Many datasets overlook the diversity of meme forms, lack consistent high-quality annotations (e.g., inter-annotator agreement often not reported or low), and crucially, lack
posted timestamps. - Model Performance Gaps: While classification can be high for simple tasks, complex multi-class problems and especially generative tasks (interpretation, explanation) show low performance, suffering from
unreliable visual information extractors,hallucinations, andlack of background knowledge.
Recommended Avenues for Future Work:
- Richer Annotations for More Robust Models:
- Develop annotation schemes that capture human reasoning processes (multi-step derivations combining textual/visual cues with background knowledge).
- Collect training data specifically for this new task to facilitate supervised learning of interpretable models.
- Improving Annotation Procedures with VLMs:
- Leverage
Vision-Language Models(VLMs) to generate initial drafts of annotations, reducing human effort, but investigate the trade-offs between editing VLM outputs vs. writing from scratch.
- Leverage
- Next Level of Visual Reasoning:
- Teach models to attend to
crucial demographic informationanddeciding visual elements. - Construct datasets that explicitly include textual explanations of which visual details are important and why for meme understanding, guiding visual attention during model training.
- Teach models to attend to
- Active Knowledge Acquisition:
- Develop methods for models to continuously acquire knowledge about
meme cultures(e.g., by leveraging internet databases likeKnow Your Meme). - Explore how
Retrieval Augmentationcan be used to acquiretopic-specific background knowledgefor real-time events and implicit associations.
- Develop methods for models to continuously acquire knowledge about
- Connection to Pragmatics:
- Integrate
pragmatic conceptslikepresuppositions,deixis, andsocial-context groundingas features to improve CMU systems, enriching their understanding of contextual meaning.
- Integrate
- Towards Processing Animated and Video Memes:
- Extend CMU research to
GIFsandshort videos, addressing the challenges posed by dynamic visual content and complex temporal relationships.
- Extend CMU research to
- Meme Generation:
- Explore meme generation as a means to measure model understanding, humanize computer interfaces, and develop technologies for captivating online content creation.
7.3. Personal Insights & Critique
This survey serves as an invaluable resource for anyone entering or working within the domain of Computational Meme Understanding. Its strength lies in its meticulous organization of a rapidly evolving and inherently chaotic subject matter. The proposed taxonomy for forms, functions, and topics is particularly insightful, providing a much-needed framework that moves beyond ad-hoc categorizations often found in individual research papers. The clear delineation of tasks (classification, interpretation, explanation) also helps to structure the research landscape effectively.
The critical analysis of datasets is a highlight, underscoring fundamental issues that often plague multimodal research: the lack of diverse data, annotation quality inconsistencies, and the critical absence of temporal metadata. These insights are crucial for guiding future data collection efforts toward creating more robust and generalizable CMU systems. The call for "richer annotations" that capture human reasoning is a particularly important and challenging future direction, as truly understanding a meme often involves a multi-step cognitive process that current annotations largely abstract away.
One potential area for further emphasis, although touched upon, could be the sociological and psychological underpinnings of meme consumption and creation. While the paper leverages linguistic theories (speech act theory), a deeper dive into the cognitive biases or social dynamics that make certain memes effective or harmful could further inform model design, especially for understanding persuasion and manipulation. The inherent subjectivity of meme interpretation, highlighted as a challenge, is precisely where human-in-the-loop systems or ensemble approaches combining multiple "perspectives" might prove more fruitful than seeking a single "correct" interpretation.
Finally, the ethical considerations section is highly relevant. The dual-use nature of CMU technologies (detecting harm vs. potentially generating it) necessitates careful thought about responsible AI development and deployment. The suggestions regarding annotator well-being and controlled release of models and datasets are commendable. The future direction of meme generation also presents significant ethical challenges, as the power to create highly engaging and potentially manipulative content could be misused. This survey lays a strong foundation, and its forward-looking perspective is vital for navigating the opportunities and pitfalls of this fascinating research area.
Similar papers
Recommended via semantic vector search.