Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning
TL;DR Summary
Face-LLaVA is a multimodal large language model designed for facial expression and attribute recognition, generating natural language descriptions. Utilizing the FaceInstruct-1M dataset and a novel visual encoder, it outperforms existing models in various tasks and achieves highe
Abstract
The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at https://face-llava.github.io to support future advancements in social AI and foundational vision-language research.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning."
1.2. Authors
The authors are Ashutosh Chaubey, Xulang Guan, and Mohammad Soleymani. Their affiliations are the Institute for Creative Technologies, University of Southern California, Los Angeles, California, USA. Their research backgrounds appear to be in computer vision, multimodal machine learning, and affective computing, given the paper's focus.
1.3. Journal/Conference
The paper is published as a preprint on arXiv. arXiv is a well-known open-access repository for preprints of scientific papers in fields like mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While it is not a peer-reviewed journal or conference in itself, publishing on arXiv is a common practice for researchers to disseminate their work quickly, gather feedback, and establish priority before formal peer review and publication in a reputable conference or journal.
1.4. Publication Year
The paper was published on April 9, 2025.
1.5. Abstract
The paper introduces Face-LLaVA, a multimodal large language model (MLLM) designed for face-centered, in-context learning, specifically for tasks such as facial expression and attribute recognition. A key feature of Face-LLaVA is its ability to generate natural language descriptions, which can be used for reasoning about its predictions.
To achieve this, the authors first developed FaceInstruct-1M, a large-scale, face-centered dataset for instruction tuning MLLMs, leveraging existing visual databases. They also propose a novel face-specific visual encoder that integrates face geometry with local visual features through Face-Region Guided Cross-Attention.
Face-LLaVA was evaluated across nine different datasets and five distinct face processing tasks: facial expression recognition, action unit detection, facial attribute detection, age estimation, and deepfake detection. The results demonstrate that Face-LLaVA achieves superior performance compared to existing open-source MLLMs in zero-shot settings and competitive performance against commercial solutions. Furthermore, its natural language outputs received higher reasoning ratings from GPT-4o in a zero-shot setting across all tasks. Both the dataset and model weights are planned for public release to support advancements in social AI and foundational vision-language research.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2504.07198v1. It is currently a preprint, meaning it has not yet undergone formal peer review or been officially published in a conference or journal. The PDF link is https://arxiv.org/pdf/2504.07198v1.pdf.
2. Executive Summary
2.1. Background & Motivation
The human face serves as a primary channel for social communication, making accurate and interpretable face analysis crucial for a wide range of human-centered applications. These applications span from human-computer interaction and social behavior assessment to e-learning and surveillance.
The core problem the paper aims to solve stems from two major limitations of existing face analysis approaches:
-
Task Specificity and Limited Generalizability: Most current models are developed for specific tasks, such as
facial expression recognition (FER)orfacial attribute detection. This specialization limits their ability to generalize across different face analysis tasks, requiring separate models for each. While recent efforts have focused on developing generalist face models, they often still lack a unified interface for diverse tasks. -
Lack of Natural Language Reasoning: Traditional face analysis models primarily output categorical labels or numerical values without providing any natural language description or reasoning for their predictions. This lack of interpretability is a significant drawback, especially for critical applications like healthcare and surveillance, where understanding the "why" behind a prediction is as important as the prediction itself.
The paper's innovative idea is to leverage the power of
Multimodal Large Language Models (MLLMs)to develop a generalist face analysis model that can perform various face processing tasks and, crucially, generate natural language descriptions and reasoning for its predictions. This bridges the gap between raw visual perception and human-understandable explanations, paving the way for more advanced and socially intelligent AI agents.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of multimodal AI and face analysis:
- FaceInstruct-1M Dataset: The authors introduce
FaceInstruct-1M, a novel, large-scale (one million samples), face-centered instruction-tuning dataset. This dataset is unique in its breadth, covering five distinct face analysis tasks:facial expression recognition,action unit (AU) detection,facial attribute detection,age estimation, anddeepfake detection. Crucially, it comprises both image and video data and includes task-specific instructions and detailed natural language descriptions, which are essential for trainingMLLMs. The dataset generation process involves an innovativeGPT-4o-assisted evaluation and filtering pipeline to ensure high data quality. - Face-LLaVA Model Architecture: The paper proposes
Face-LLaVA, a multimodal large language model specifically designed for comprehensive face analysis. The model incorporates a novel face-specific visual encoder with two key components:- Face-Region Landmark Projector (FRLP): This module groups facial landmarks into different anatomically meaningful face regions (e.g., eyes, brows, lips) and projects them into individual
face-region tokens. This preserves fine-grained local information crucial for nuanced face analysis. - Face-Region Guided Cross-Attention (FRGCA): This mechanism refines the general visual features by cross-attending them with the
face-region tokens. It also employs aregion-patch proximity maskto explicitly guide attention towards relevant facial areas, ensuring that the model focuses on salient face-specific information. This approach is more efficient in utilizing theLLM's context windowcompared to simply concatenating landmark features.
- Face-Region Landmark Projector (FRLP): This module groups facial landmarks into different anatomically meaningful face regions (e.g., eyes, brows, lips) and projects them into individual
- Superior Performance and Reasoning Capabilities: Through extensive experiments on nine diverse datasets across five face processing tasks,
Face-LLaVAdemonstrates:-
State-of-the-art zero-shot performance: It significantly outperforms existing open-source
MLLMsin a zero-shot setting, showcasing its strong generalization capabilities. -
Competitive fine-tuned performance: It achieves performance comparable to specialized, supervised task-specific approaches, which typically have a performance advantage due to their narrow focus.
-
Enhanced reasoning:
Face-LLaVA's generated natural language descriptions receive a higher reasoning rating fromGPT-4oin a zero-shot setting, indicating its superior ability to provide coherent and context-aware explanations for its predictions.These contributions address the limitations of prior work by offering a generalist, interpretable, and high-performing solution for complex face analysis, paving the way for more robust and socially aware AI systems.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Face-LLaVA, a grasp of several fundamental concepts in machine learning and computer vision is essential:
- Multimodal Large Language Models (MLLMs):
- Conceptual Definition:
MLLMsare advanced AI models that integrate capabilities from bothLarge Language Models (LLMs)andcomputer vision models(or other modalities like audio). They are designed to process and understand information from multiple modalities (e.g., text, images, videos) simultaneously and generate human-like text responses based on that combined understanding. - How they work: Typically, an
MLLMuses avision encoderto convert visual inputs (images/videos) into a sequence of numerical representations calledvisual tokensorembeddings. Thesevisual tokensare then aligned with thelanguage token spaceof anLLMthrough aprojectormodule. TheLLMthen takes these visual embeddings along with textual instructions as input and generates a natural language response.
- Conceptual Definition:
- Large Language Models (LLMs):
- Conceptual Definition:
LLMsare deep learning models, usually based on theTransformerarchitecture, trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like text generation, summarization, translation, and question-answering. - Key aspects:
LLMsoperate by predicting the next word ortokenin a sequence given the previous ones, a process known asautoregressive generation. They learn complex patterns and relationships in language, enabling them to produce coherent and contextually relevant text.
- Conceptual Definition:
- Vision Encoders:
- Conceptual Definition:
Vision encodersare neural networks designed to extract meaningful features from images or videos. They transform raw pixel data into a higher-level, compressed numerical representation (embeddings or tokens) that captures the visual content. - Examples: Popular
vision encodersincludeConvolutional Neural Networks (CNNs)andVision Transformers. Models likeCLIP(Contrastive Language–Image Pre-training) andLanguageBindare examples ofvision encodersthat are pre-trained on large image-text paired datasets, learning robust visual representations that are semantically aligned with language. They often producepatch-based tokens, where an image is divided into small patches, and each patch is encoded into a token.
- Conceptual Definition:
- Instruction Tuning:
- Conceptual Definition:
Instruction tuningis a fine-tuning technique whereMLLMs(orLLMs) are trained on a dataset consisting of diverse instruction-response pairs. The goal is to make the model better at following natural language instructions and generating appropriate responses, rather than just completing text or recognizing objects. - Importance: For
MLLMs,instruction tuningenables them to perform specific tasks (like "describe the emotion in this face") and generate explanations, moving beyond simple classification to conversational capabilities.
- Conceptual Definition:
- Facial Landmarks:
- Conceptual Definition:
Facial landmarksare key fiducial points on the human face, such as the corners of the eyes, tip of the nose, corners of the mouth, and points along the eyebrows or jawline. These points provide a precise geometric representation of the face's structure and configuration. - Importance in face analysis:
Facial landmarksare crucial for fine-grained face analysis tasks because they capture subtle changes in facial expressions (e.g., eyebrow raise, lip corner pull), head pose, and overall facial geometry. They are less sensitive to lighting or background variations than raw pixels.
- Conceptual Definition:
- Cross-Attention:
- Conceptual Definition:
Cross-attentionis a mechanism, typically found inTransformerarchitectures, that allows a model to weigh the importance of elements from one sequence (e.g.,visual tokens) while processing elements from another sequence (e.g.,landmark tokens). It's a way for different modalities or information sources to interact and inform each other. - How it works: In
cross-attention, one sequence provides theQuery (Q)vectors, and the other providesKey (K)andValue (V)vectors. The vectors attend to the vectors to compute attention weights, which are then applied to the vectors to produce an output that is a weighted sum of the s, modulated by the s. This allows the model to selectively focus on relevant parts of the other sequence. The standard formula forAttentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- : Query matrix.
- : Key matrix.
- : Value matrix.
- : Dimensionality of the key vectors.
- : Normalization function that converts raw scores into probabilities.
- Conceptual Definition:
- Action Units (AUs):
- Conceptual Definition:
Action Units (AUs)are fundamental movements of individual or groups of facial muscles, as defined by theFacial Action Coding System (FACS).FACSis an anatomically based system for objectively describing all visually distinguishable facial movements. - Importance:
AUsare a more granular and objective way to describe facial expressions than broad emotion categories (like "happy" or "sad"). DetectingAUsprovides a detailed understanding of the underlying muscle movements contributing to an expression.
- Conceptual Definition:
- Deepfake Detection:
- Conceptual Definition:
Deepfake detectionis the task of identifying manipulated or synthetically generated media, particularly videos or images of human faces, that appear realistic but are fake. These manipulations are often created usingdeep learningtechniques. - Challenges:
Deepfakescan be very convincing, making their detection a challenging task that often requires identifying subtle inconsistencies or artifacts not present in real media.
- Conceptual Definition:
3.2. Previous Works
The paper positions its work in the context of extensive prior research across traditional face analysis and the emerging field of MLLMs for face processing.
3.2.1. Traditional Face Analysis
- Task-Specific Models: Historically, research has focused on developing highly specialized models for individual tasks such as
expression recognition(e.g., [7, 64, 70, 75, 79, 80, 84, 85]),action unit detection(e.g., [33, 43, 59, 65]),age estimation(e.g., [2, 13, 62]),attribute detection(e.g., [42, 45, 58]), anddeepfake detection(e.g., [1, 9, 52]). These models achieve high performance on their specific tasks but lack generalizability. - Generalist Face Models: More recently, the trend has shifted towards developing models capable of handling multiple face analysis tasks using a single architecture or learning robust facial representations.
Faceptor[54]: Builds uponFaRL[91] andSWINFace[53] by introducing a single encoder with a dual-decoder transformer, aiming to unify multiple face analysis tasks.MARLIN[6] andPrefAce[19]: Utilizeself-supervised learningon video data, often employingmasked autoencoders (MAE)[16, 69] to learn versatile facial representations.PCL[41]: Employs apose-disentangled decoderforcontrastive learningto generate robust features. The limitation of these generalist models, as highlighted by the authors, is their lack of a natural language interface for reasoning.
3.2.2. Face Analysis using Instruction Tuning
The advent of MLLMs has opened new avenues for face analysis, incorporating reasoning and diverse inputs.
- Early MLLM/VLM applications:
EmoVIT[76]: Usesinstruction tuningto improve visual emotion understanding inLLMs.Emotion-LLaMA[8]: AnMLLMthat takes audio, video, and text to describe emotional responses and reasoning in videos. (However, the current paper notes its reliance on background and audio context, which is a key differentiator forFace-LLaVA).- Foteinopoulou et al. [12]: Explores
deepfake detectionand reasoning withLLMs. These approaches were often task-specific or relied on multimodal inputs beyond just facial cues.
- Face-Specific MLLMs:
FABA[34] andVL-FAU[15]: Attempt to bridge the gap by leveraging powerful visual encoders andLLMreasoning. However, their capabilities are often limited to static images and primarily focus onfacial expression analysis.FABA-Instruct[34] is aface-specificdataset but also limited to still images and affective behavior.AU-LLAVA[18]: Uses anLLMforAU detectionand intensity estimation but outputs only categorical/numerical predictions without deeper reasoning.EmoLA[34]: Integrateslandmark prior tokenswith visual embeddings forfacial affective behavior analysiswith reasoning. (This is a more direct precursor toFace-LLaVA's approach to incorporating landmarks, butFace-LLaVAaims for greater efficiency and generalizability).EMO-LLaMA[77]: Employs aface-information mining modulefor enhanced facial feature encoding, demonstrating reasoning and conversational capabilities.Face-MLLM[63]: UsesGeminifor automatic annotation ofLaion-Face[90] forinstruction tuning, but its focus remains on perception rather than reasoning over predictions, and its dataset generation relies heavily on an external commercial model's face analysis.FaVChat[88]: Leverages existing face video datasets forinstruction tuning, but faces issues with class imbalances in attributes.
3.3. Technological Evolution
The field has evolved from:
- Isolated, Task-Specific Models: Early computer vision focused on training separate models for each face analysis task (e.g., one CNN for
FER, another forage estimation). These were often highly optimized but lacked broader applicability. - Generalist Perception Models: The advent of large-scale pre-training and self-supervised learning (e.g.,
MAE,Vision Transformers) led to models capable of learning robust facial representations useful for multiple downstream tasks. These models could performface recognition,expression recognition, andattribute estimationwith a single encoder but typically still output numerical or categorical labels. - Multimodal Language-Integrated Models (MLLMs): The recent surge in
LLMs(likeGPT,LLaMA) has prompted researchers to integrate visual inputs with language models. This allows for tasks that require complex reasoning and natural language outputs, moving beyond mere perception toperception with reasoning. - Face-Specific MLLMs with Enhanced Visual Priors: This paper represents the next step, where
MLLMsare not just applied to face data, but their architectures are specifically enhanced withface-expert knowledge(likefacial landmarks) to improve the quality of visual encoding for fine-grained facial analysis, while still providing robust natural language reasoning.
3.4. Differentiation Analysis
Compared to the main methods in related work, Face-LLaVA offers several core differences and innovations:
- Comprehensive Multi-task Coverage: Unlike most prior
MLLMsfor face analysis which focus onaffective behaviororexpression recognition(e.g.,FABA[34],EmoLA[34]),Face-LLaVAis a truly generalist model, spanning five diverse and distinct tasks:expression recognition,AU detection,attribute detection,age estimation, anddeepfake detection. - Large-scale, Face-Centric Instruction-Tuning Dataset (FaceInstruct-1M):
- It addresses the scarcity of large-scale,
face-specific instruction-tuning data. Previousinstruction-tuning datasetswere often small, task-specific, or included background context/audio (e.g.,Emotion-LLaMA[8],EmoVIT[76]), which can lead tohallucinationsnot strictly based on facial cues. FaceInstruct-1Mis specifically curated for faces, includes both images and videos (unlikeFABA-Instruct[34] which is image-only), and employs a robustGPT-4o-assisted filtering process for quality control, reducing noise.
- It addresses the scarcity of large-scale,
- Novel Face-Specific Visual Encoder (
FRLPandFRGCA):- Existing
MLLMsoften rely on general-purposevision encoders(likeCLIPorLanguageBind) that are not optimized for fine-grained face features likefacial landmarks.Face-LLaVAexplicitly integrates theseface-expert features. - Efficiency: Unlike approaches that append landmark features directly to visual tokens (
EmoLA[34]),Face-LLaVAusescross-attention. This means thelandmark tokens(which provide geometric priors) inform thevisual tokenswithout increasing theLLM's context windowsize, making the process more efficient. - Region-Guided Attention: The
Face-Region Guided Cross-Attentionwith itsregion-patch proximity maskis innovative in forcing the model to explicitly focus its attention on anatomically relevant facial regions, which mimics human perception for complex face analysis tasks.
- Existing
- Perception with Reasoning for Videos: While some prior
MLLMscould reason (EmoLA,EMO-LLaMA),Face-LLaVAexplicitly handles both images and videos with reasoning capabilities, providing a more versatile solution for dynamic facial analysis. It particularly avoids reliance on non-facial cues (audio, background) that some models might inadvertently use.
4. Methodology
4.1. Principles
The core idea behind Face-LLaVA is to augment a general-purpose Multimodal Large Language Model (MLLM) with specialized face-centric knowledge and mechanisms to enhance its understanding and reasoning capabilities for a diverse set of facial analysis tasks. The key principles are:
- Instruction Tuning for Generalization and Reasoning: By training on a large dataset of instruction-response pairs (
FaceInstruct-1M), the model learns to understand diverse queries related to facial expressions, attributes, age, and deepfakes, and to generate natural language explanations for its predictions. - Integration of Face-Expert Features: Recognizing that general
vision encodersmight not adequately capture fine-grained facial details,Face-LLaVAexplicitly incorporatesfacial landmark features. Theselandmarksprovide precise geometric information about the face, which is crucial for nuanced analysis. - Efficient and Targeted Feature Fusion: Instead of simply concatenating
landmark featureswith generalvisual tokens(which can exhaust theLLM's context window),Face-LLaVAuses a novelcross-attentionmechanism. This mechanism allowslandmark featuresto guide and refine thevisual featuresby directing the model's attention to salient facial regions, ensuring that both local and holistic facial cues are effectively utilized.
4.2. Core Methodology In-depth (Layer by Layer)
Figure 3 from the original paper (VLM Description: The image is a diagram illustrating the processing flow of facial features in the Face-LLaVA model, including video input, patch encoding, facial landmark detector, and face-region guided cross-attention mechanism. The diagram includes the functionality of various modules such as the vision projector and large language model, depicting how facial-related information is processed and understood.) illustrates the architecture of Face-LLaVA.
The Face-LLaVA architecture, similar to other MLLMs like Video-LLaVA [37], consists of several main components:
-
Patch-based Vision Encoder (): This module takes the raw visual input (image or video) and encodes it into a sequence of
visual tokens. -
Vision Projector (): This component maps the
visual tokensfrom thevision encoder's feature space into thelanguage token spaceof theLLM. -
Tokenizer (): This standard
LLMcomponent converts input text instructions intolanguage tokens. -
Large Language Model Decoder (): This is the core
LLMthat processes the combined visual and language tokens in anautoregressivemanner to generate natural language responses. -
Face-Expert Model (): A specialized model (in this case, a
facial landmark detector) that extracts fine-grained face-specific geometric features. -
Face-Region Landmark Projector (FRLP): A novel module to convert raw
landmark coordinatesintoface-region tokens. -
Face-Region Guided Cross-Attention (FRGCA): A novel
cross-attentionmechanism that integrates theface-region tokenswith the generalvisual tokens.Let's detail the flow and the novel components:
4.2.1. Visual Input Processing
Given a visual input (which can be a single image or a sequence of video frames), the patch-based vision encoder processes it. The output of is then passed through the vision projector to align it with the language token space. This yields the visual tokens :
$ h_v = {\bf P}{\bf V}^{\theta} ( {\bf E}{\bf V} ( x_v ) ) $
Where:
- : The input visual data (image or video).
- : The
patch-based vision encoder. For video inputs, is processed frame-by-frame or as short clips. - : The
vision projector, parameterized by . Its role is to map the visual features into a compatible space for theLLM. - : The resulting
visual tokens. These tokens have a shape of , where:- : Number of video frames being processed (1 for images).
- : Number of visual tokens extracted from the
patch-based vision encoder(e.g., 256 forLanguageBind). - : Dimensionality of the hidden representation of the
LLM.
4.2.2. Face-Expert Feature Extraction
In parallel to the general visual processing, a face-expert model is used to extract face-specific features from the visual input . In Face-LLaVA, this face-expert model is a landmark detector (specifically, FAN [5]). It extracts normalized 2D facial landmark coordinates:
$ z = \mathbf{E}_{\mathbf{L}} ( x_v ) $
Where:
- : The extracted
normalized 2D landmark coordinates. Its shape is , where:-
: Number of video frames (1 for images).
-
: Total number of
facial landmarksdetected (e.g., 68 landmarks). -
2: Represents the 2D (x, y) coordinates for each landmark.These
face-expert featuresare then fed into the proposed novelFace-Region Landmark Projector (FRLP)andFace-Region Guided Cross-Attention (FRGCA)modules.
-
4.2.3. Face-Region Landmark Projector (FRLP)
The FRLP module is designed to project the facial landmark features into a more meaningful and tokenized space, preserving both local regional information and global facial structure. It achieves this using two sub-modules:
-
Local Landmark Projector (): This component groups facial landmarks into distinct
face regions(e.g., face boundary, left/right eye, left/right brow, nose, nostril, lips, teeth). For each -th face region, which contains landmark points (), a separateMulti-Layer Perceptron (MLP)is used for projection. The resultinglocal landmark tokensare:$ h_l^{local} = { {\bf MLP}_i^{local} ( z_i ) ~ \forall ~ i \in { 1 \ldots M } } $
Where:
- : A set of tokens, each corresponding to a specific
face region. Each token has a shape . - : A separate single-layer
MLPfor the -th face region. - : The 2D landmark points belonging to the -th face region.
- : A set of tokens, each corresponding to a specific
-
Global Landmark Projector (): To capture the overall facial structure and dynamics, all landmarks from each frame are projected to generate a single
global landmark tokenper frame.$ h_l^{global} = \mathbf{M} \mathbf{L} \mathbf{P}^{global} ( z ) $
Where:
-
: A single token per frame, representing the global facial structure. Its shape is .
-
: A single-layer
MLPfor global landmark projection. -
: All 2D landmark points for a given frame.
The final
landmark tokensare obtained by combining the global and local components. The paper states "the goal of is broadcasted in the token dimension", implying that the local region tokens are somehow combined or added to the global token for each frame. The equation given is:
-
$ h_l = h_l^{global} + h_l^{local} $
It should be noted that the operation for in the formula appears to output a set of tokens for each frame. If is a single token per frame, then the addition operation would imply a broadcasting mechanism or element-wise addition if is also aggregated into a single token per frame. The most common interpretation is that represents a sequence of tokens per frame, and is broadcasted across these tokens, or itself is an aggregation of the local tokens. Given the context of cross-attention receiving landmark tokens per frame later, it is more likely that contributes to the tokens, and is added to each of these tokens for each frame. For simplicity and minimal computation, all MLPs used in FRLP are single-layer MLPs. The final shape of is .
4.2.4. Face-Region Guided Cross-Attention (FRGCA)
The FRGCA module is designed to integrate the face-region landmark tokens with the general visual tokens . This mechanism offers two main advantages:
-
Weighing Salient Visual Tokens:
Cross-attentionallows the model to assign higher importance tovisual tokensthat are spatially close to or semantically relevant to the salientface regions(as indicated by thelandmark tokens). -
Context Window Efficiency: Unlike appending
landmark tokenstovisual tokens(which increases the input sequence length for theLLM),cross-attentionprocesses them separately, thus saving theLLM's context window.Here's how
FRGCAworks:
-
Query, Key, Value Generation:
Linear layersare used to transform thevisual tokensintoQuery (Q)vectors and thelandmark tokensintoKey (K)andValue (V)vectors. For simplicity, the time dimension is dropped, meaning these operations occur independently for each frame.- (derived from )
- (derived from )
- (derived from )
Where is the number of visual tokens and is the number of
face-region tokens.
-
Face-Region Patch Proximity (RPP) Mask Generation: To further guide the attention mechanism towards relevant face regions, an
RPP maskis computed. This mask is inversely proportional to the pairwise 2D Euclidean distances between the centroids of thevisual patches(which correspond to thevisual tokens) and the centroids of the differentface regions(which correspond to theface-region tokensderived from ).$ m_{ji}^{RPP} = - || centroid(z_i) - centroid(h_{v,j}) ||_2 $
Where:
- : An element of the
RPP maskfor the -th visual token and -th face region. - : The 2D centroid of the -th face region (computed from its
facial landmarks). - : The 2D centroid of the -th
visual patchassociated with the visual token . The negative sign ensures that smaller distances result in larger (less negative) values, thus higher attention.
- : An element of the
-
Guided Cross-Attention: The
RPP maskis added to the attention scores () before thesoftmaxlayer. This guides the attention weights, forcing the model to prioritizevisual patchesthat are geometrically closer to specificface regions. The output of theFRGCAmodule, , is then added back to the originalvisual tokensvia aresidual connection.$ h_v^l = Linear\left( Softmax\left( \frac{QK^T}{\sqrt{d_{attn}}} + m^{RPP} \right) V \right) + h_v $
Where:
- : The
visual tokensrefined by thelandmark information. This is the final visual input that goes into theLLM. Linear: A final linear projection layer.- : The entire
FRGCAmodule, parameterized by . - : The dimensionality of the attention head, used for scaling.
- : The
4.2.5. Text Instruction Processing
The user's text instruction is processed by the tokenizer to generate language tokens .
4.2.6. Language Model Decoder
Finally, the LLM takes the refined visual tokens and the language tokens as input. It then autoregressively generates the natural language response according to the instruction.
4.2.7. Training
The training of Face-LLaVA is conducted in multiple stages to ensure proper alignment and fine-tuning:
-
Face-Region Pretraining:
- Objective: To align the newly initialized
FRLPandFRGCAmodules with the existingvisualandlanguage tokenspaces. - Trainable Parameters: Only the parameters of the
FRLPmodule () and theFRGCAmodule () are trained. All other weights (vision encoder, vision projector, tokenizer, andLLM) are kept frozen. - Purpose: This stage ensures that the
landmark tokensare effectively learned and can interact meaningfully with thevisual tokens.
- Objective: To align the newly initialized
-
Finetuning:
-
Objective: To jointly fine-tune the entire model, including the
vision projectorand theLLM, to improve itsinstruction-following capabilitiesand overall performance. -
Trainable Parameters: The parameters of the
FRLP(),FRGCA(),vision projector(), and theLLM() are all trained. Thevision encoderandtokenizerremain frozen (pretrained). -
Learning Rate: A lower learning rate is used in this stage compared to pretraining.
-
LLM Training: Since
FaceInstruct-1Mis a substantial dataset, all parameters () of theLLMare fine-tuned, rather than using parameter-efficient methods likeLoRA[17].For both training stages, the model is trained in an
autoregressivefashion, maximizing the likelihood of the generated response :
-
$ P ( x_r | h_v^l, h_t ) = \prod_{i = 1}^{\mathcal{L}} P_{\Theta} ( x_r^i | h_v^l, h_t, x_r^{[1 : i - 1]} ) $
Where:
-
: The probability distribution over tokens, parameterized by .
-
: The set of trainable parameters ( for pretraining; for finetuning).
-
: The full natural language response generated by the model.
-
: The -th token in the response sequence.
-
: The total length (number of tokens) of the response .
-
: The
visual tokensrefined byFRGCA. -
: The
language tokensfrom the instruction. -
: The sequence of response tokens generated prior to the -th token.
The learning rates for the pretraining and finetuning stages are 1e-4 and 2e-5, respectively, with a
cosine learning rate schedule. Models are trained for one epoch.
5. Experimental Setup
5.1. Datasets
Face-LLaVA's FaceInstruct-1M dataset is constructed from a variety of existing task-specific datasets. For evaluation, experiments are performed on nine benchmarks, which largely overlap with the source datasets used for FaceInstruct-1M.
The following are the details of the datasets used to construct FaceInstruct-1M and for evaluation:
The following are the results from Table 2 of the original paper:
| Task | Datasets | Mean GPT Rating (1-10) | |||
| Label Acc. | Desc.-Vid. Consistency | Desc.-Lab. Consistency | Overall | ||
| FaceInstruct-1M | |||||
| Expression | DFEW [22], MAFW [40], FERV39k [72] Crema-D [24], AffectNet [47], RAF-DB [31] | 8.81 | 8.74 | 8.79 | 8.71 |
| AU | DISFA [46], BP4D [83] | 8.17 | 8.23 | 7.62 | 7.99 |
| Attribute | CelebA [42] | 8.79 | 8.75 | 8.52 | 8.69 |
| Age | MORPH II [56], UTK Face [87] | 8.63 | 7.92 | 7.98 | 7.99 |
| Deepfake* | FF++ [57], Fake AV Celeb [25] | 8.84 | 8.13 | 8.60 | 8.20 |
| FABAInstruct [34] | |||||
| Expression | AffectNet [47] | 8.79 | 6.44 | 6.32 | 6.22 |
*For deepfake detection, a set of real videos from DFEW [22], MAFW [40], FERV39k [72] are annotated as control samples.
This table highlights the high quality of the automatically generated annotations in FaceInstruct-1M, as rated by GPT-4o-mini, which are consistently high across tasks, especially when compared to a reference like FABAInstruct [34].
The following figure (Figure 2 from the original paper) shows samples from the FaceInstruct-1M dataset for different tasks:
该图像是示意图,展示了FaceInstruct-1M数据集中不同任务的样本。图中包含了与面部表情分析和面部属性识别相关的多种任务示例,分别解析了面部情感、动作单元、估计年龄和面部特征的变化。这些样本旨在阐明研究中使用的不同面部表达及其相关数据,体现了该项目在面部分析领域的应用和重要性。
This figure visually demonstrates the diversity of tasks covered by FaceInstruct-1M and the type of descriptions generated. For video inputs, the descriptions capture subtle variations in facial attributes and provide justifications for the responses based on these observations.
The following figure (Figure 5 from the original paper) shows the data annotation pipeline used for creating FaceInstruct-1M dataset:
该图像是一个示意图,展示了创建 FaceInstruct-1M 数据集的数据注释流程。流程包括输入大规模数据、数据预处理和筛选、标签辅助的数据生成,以及使用 GPT-4 mini 的评分指令进行样本评分和筛选。
This pipeline illustrates the process of generating FaceInstruct-1M using large-scale data, pre-processing, label-assisted Gemini generation of descriptions, and GPT-4o-mini for rating and filtering.
5.1.1. Datasets for Facial Expression Recognition
- DFEW [22]: A large-scale database for
Dynamic Facial Expressions in the Wild. It is a video dataset. - Crema-D [24]: A multimodal dataset (audio-visual) for
emotion recognition. Used forfacial expression recognition. - RAF-DB [31]:
Real-world Affective Faces Database. A large dataset of static images annotated with basic and compound expressions. - AffectNet [47]: A very large database for
facial expression, valence, and arousal computing in the wild. Primarily images. - FERV39k [72]: A large-scale multi-scene dataset for
facial expression recognition in videos. - MAFW [40]: A large-scale,
Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild.
5.1.2. Datasets for Action Unit (AU) Detection
- DISFA [46]:
Database of Spontaneous Facial Action. Contains videos annotated forAUintensity on an 0-5 scale byFACSexperts. - BP4D [83]:
Binghamton-Pittsburgh 4D Spontaneous Expression Database. Contains videos annotated with 12AUs, along with tracked head pose andfacial landmarks.
5.1.3. Datasets for Facial Attribute Detection
- CelebA [42]:
CelebFaces Attributes Dataset. A large-scale dataset of celebrity faces with diverse pose and background, annotated with 40 binaryfacial attributes.
5.1.4. Datasets for Age Estimation
- MORPH II [56]: A longitudinal image database for
age-progression, containing mugshots annotated with age, gender, and race. - UTKFace [87]: A large-scale dataset of faces with ages between 0 and 116 years, annotated with age, gender, and ethnicity.
5.1.5. Datasets for Deepfake Detection
- FaceForensics++ (FF++) [57]: A dataset consisting of 1000 original video sequences manipulated with 4
face manipulation methods(Deepfakes, Face2Face, FaceSwap, NeuralTextures). The paper useslow-qualitysamples. - Fake AV-Celeb [25]: A dataset including about 20K manipulated videos using various
deepfakemethods, with a base set of 500 real celebrity videos.
5.1.6. GPT-Assisted Dataset Construction for FaceInstruct-1M
FaceInstruct-1Mis constructed by:- Preprocessing: Cropping faces in videos and ensuring a single subject is present.
- Gemini Annotation:
Gemini 1.5 Flash[66] is used to generate video descriptions for specific tasks. It receives the face video, its categorical label, and a carefully designed annotation instruction (e.g., "The given video has... emotion. Your task is to reason why the emotion is tagged as... Focus specifically on the facial expression..."). - Instruction Curation: 100 handcrafted instructions for each task are randomly paired with the generated descriptions for
instruction tuning. - GPT-4o Filtering:
GPT-4o-mini[67] is used to rate the generated samples based on: (i) accuracy of the manual label, (ii) consistency of the generated description with the video, (iii) consistency of the description with the label, and (iv) overall quality. Samples with an overall rating are removed (approx. 7% of the dataset).
5.2. Evaluation Metrics
The paper uses both traditional quantitative metrics for face analysis tasks and GPT-based evaluation for reasoning capabilities.
5.2.1. Traditional Metrics
-
Unweighted Average Recall (UAR):
- Conceptual Definition:
UAR(also known asAverage Per-Class Accuracy) calculates the average of the recall (or sensitivity) for each class. It is particularly useful for evaluating performance on imbalanced datasets, as it gives equal weight to each class, regardless of its sample size. It reflects the model's ability to correctly identify instances of each class. - Mathematical Formula:
$
\mathrm{UAR} = \frac{1}{C} \sum_{i=1}^{C} \mathrm{Recall}_i
$
Where:
- : The total number of classes.
- : The recall for class .
- Symbol Explanation:
- : Number of instances of class correctly predicted as class .
- : Number of instances of class incorrectly predicted as other classes.
- Conceptual Definition:
-
Weighted Average Recall (WAR):
- Conceptual Definition:
WAR(also known asOverall Accuracy) calculates the proportion of correctly predicted instances out of the total number of instances across all classes. It is sensitive to class imbalance, meaning classes with more samples will have a larger impact on theWARscore. It reflects the overall correctness of the model's predictions. - Mathematical Formula: $ \mathrm{WAR} = \frac{\sum_{i=1}^{C} \mathrm{True Positive}i}{\sum{i=1}^{C} (\mathrm{True Positive}_i + \mathrm{False Positive}_i + \mathrm{True Negative}_i + \mathrm{False Negative}_i)} $ Or simply, the total number of correct predictions divided by the total number of samples.
- Symbol Explanation:
- : Number of instances of class correctly predicted as class .
- : Number of instances of other classes incorrectly predicted as class .
- : Number of instances of other classes correctly predicted as other classes (not ).
- : Number of instances of class incorrectly predicted as other classes.
- Conceptual Definition:
-
F1-score:
- Conceptual Definition: The
F1-scoreis theharmonic meanofprecisionandrecall. It provides a single score that balances bothprecision(the proportion of positive identifications that were actually correct) andrecall(the proportion of actual positives that were identified correctly). It is particularly useful forclassification taskswhere class distribution is uneven. ForAU detection, it's often reported as anaverage F1-scoreacross allAUs. - Mathematical Formula:
$
\mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}
$
Where:
- Symbol Explanation:
- : Number of positive instances correctly identified.
- : Number of negative instances incorrectly identified as positive.
- : Number of positive instances incorrectly identified as negative.
- Conceptual Definition: The
-
Mean Absolute Error (MAE):
- Conceptual Definition:
MAEis a measure of errors between paired observations expressing the same phenomenon. It is the average of the absolute differences between predictions and actual observations. It gives an idea of the average magnitude of the errors without considering their direction. It is commonly used forregression taskslikeage estimation. - Mathematical Formula:
$
\mathrm{MAE} = \frac{1}{N} \sum_{j=1}^{N} |y_j - \hat{y}_j|
$
Where:
- : Total number of samples.
- : The actual value for sample .
- : The predicted value for sample .
- : Absolute value.
- Symbol Explanation:
- : Ground truth age for sample .
- : Predicted age for sample .
- Conceptual Definition:
-
Mean Accuracy (mAcc):
- Conceptual Definition: For
facial attribute detection, this typically refers to the average accuracy across all attributes. If a model predicts multiple binary attributes for a face, theaccuracyfor each attribute is calculated, and then these accuracies are averaged. - Mathematical Formula (for a single attribute ): $ \mathrm{Accuracy}_k = \frac{\mathrm{Correct Predictions}_k}{\mathrm{Total Samples}_k} $ Then, , where is the number of attributes.
- Symbol Explanation:
- : Number of samples where attribute was correctly classified.
- : Total number of samples for which attribute was evaluated.
- Conceptual Definition: For
-
Accuracy:
- Conceptual Definition: The proportion of total predictions that were correct. Commonly used for binary
classification taskslikedeepfake detection. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\mathrm{Number of Correct Predictions}}{\mathrm{Total Number of Predictions}} $
- Symbol Explanation:
- : Sum of true positives and true negatives.
- : Sum of true positives, true negatives, false positives, and false negatives.
- Conceptual Definition: The proportion of total predictions that were correct. Commonly used for binary
5.2.2. Reasoning Evaluation (GPT-based)
To assess the generative and reasoning capabilities of MLLMs, GPT-4o-mini [67] is used to rate the generated responses based on three aspects:
-
Consistency or overlap of the given reasoning with video: How well does the textual explanation align with the visual content of the input video/image?
-
Consistency or overlap of the given reasoning with ground truth label: How well does the textual explanation support or align with the known correct label (e.g., the actual emotion, age, or deepfake status)?
-
Overall completeness of the reasoning to support the ground truth label with respect to the video: How comprehensive and convincing is the explanation in justifying the prediction, considering both the visual evidence and the correct label?
This evaluation is performed on a small, carefully crafted test set from
FaceInstruct-1Mconsisting of 500 samples per task. Traditional text generation metrics likeBLEU[49] orROUGE[38] are deemed insufficient as they rely on n-gram matches rather than semantic analysis of reasoning.
5.2.3. Evaluation Protocols
- Perception:
- Zero-shot setting: The entire task-specific dataset (e.g., DFEW for
expression recognition) is removed fromFaceInstruct-1Mfor tuningFace-LLaVA. This tests generalization. - Fine-tuned setting: The zero-shot model is further fine-tuned using the official training splits of the benchmark dataset. This tests performance when task-specific data is available.
- For
MLLMbaselines, official inference code is applied to face-cropped inputs. - For
GPT-4o-miniandGemini 1.5 Flash, official APIs are used for batch inference. - For
expression recognition,attribute detection, anddeepfake detection,synonym matchingis used to extract categorical labels from generated text. ForAU detectionandage estimation, numerical values are extracted directly.
- Zero-shot setting: The entire task-specific dataset (e.g., DFEW for
- Reasoning:
- Evaluated using
GPT-4o-minion theFaceInstruct-1Mtest set (500 samples per task), as described above.
- Evaluated using
5.3. Baselines
The paper compares Face-LLaVA against a range of contemporary MLLMs and specialized supervised methods.
5.3.1. Closed-source Commercial Models
GPT-4o-mini[67]Gemini-1.5F[66]
5.3.2. Open-source MLLMs (Zero-shot)
Vid.LLaMA 3[81]Qwen 2.5 VL[68]Vid.-LLaVA[37]LLaVA-Vid.[86] /LLaVA-OV[28]EmoLA[34] (often uses middle video frame or image-only)Emotion LLaMA[8] (multimodal, using audio and background context, also evaluated with context removed)
5.3.3. Fine-tuned Supervised Task-Specific Baselines
These are models specifically designed and extensively trained for individual tasks, representing the state-of-the-art in their respective domains.
- Facial Expression Recognition:
EC-STFL[22],Former-DFER[89], [29],M3DFEL[70],MAE-DFER[64],S2D[7],EMO-LLaMA[77],Emotion-LLaMA[8],Lei et al.[27],PTH-Net[30],MTL-ER*[82],MT-Former*[78],MTCAE-DFER[75],RUL[84],EAC[85],TransFER[79],Xue et al.[80],POSTERv2[44],EmoLA[34]. - Action Unit Detection:
ATCM[21],ReCoT[33],KS[32],ME-GraphAU[43],JÅA-Net[59],PIAP-DF[65],VL-FAU[15],AU-LLaVA[18],EmoLA[34]. - Age Estimation:
PML[10],Berg et al.[4],DLDL-v2[14],MWR[62],Faceptor[54]. - Facial Attribute Detection:
Liu et al.[42],MOON[58],SwinFace[53],Faceptor[54],DMM-CNN[45]. - Deepfake Detection:
MesoNet[1],Xception[9],MARLIN[6],M2TR[71],F3-Net[52].
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Facial Expression Recognition
The following are the results from Table 3 of the original paper:
| DFEW [22] | Crema-D [24] | RAF-DB [31] | |||||||
| Method | UAR ↑ | WAR↑ | Method | UAR ↑ | WAR ↑ | Method | Acc. ↑ | ||
| Closed - source models | |||||||||
| GPT4o-mini [67] | 0.426 | 0.518 | GPT4o-mini [67] | 0.410 | 0.486 | GPT4o-mini [67] | 0.758 | ||
| Gemini-1.5F [66] | 0.433 | 0.481 | Gemini-1.5F [66] | 0.465 | 0.635 | Gemini-1.5F [66] | 0.685 | ||
| Zero-shot | |||||||||
| Vid.LLaMA 3 [81] | 0.286 | 0.305 | Vid.LLaMA 3 [81] | 0.397 | 0.546 | Vid.LLaMA 3 [81] | 0.671 | ||
| Qwen 2.5 [68] | 0.293 | 0.399 | Qwen 2.5 [68] | 0.395 | 0.566 | Qwen 2.5 [68] | 0.526 | ||
| Vid.-LLaVA [37] | 0.220 | 0.326 | Vid.-LLaVA [37] | 0.367 | 0.557 | Vid.-LLaVA [37] | 0.545 | ||
| LLaVA-Vid. [86] | 0.375 | 0.498 | LLaVA-Vid. [86] | 0.478 | 0.618 | LLaVA-OV [28] | 0.700 | ||
| EmoLA* [34] | 0.346 | 0.449 | EmoLA* [34] | 0.431 | 0.618 | EmoLA [34] | 0.741 | ||
| Emotion LLaMA† [8] | 0.456 | 0.594 | Emotion LLaMA [8] | 0.225 | 0.308 | ||||
| Emotion LLaMA‡ [8] | 0.302 | 0.378 | |||||||
| Face-LLaVA (Ours) | 0.469 | 0.564 | Face-LLaVA (Ours) | 0.582 | 0.681 | Face-LLaVA (Ours) | 0.780 | ||
| Fine-tuned | |||||||||
| EC-STFL [22] | 0.454 | 0.565 | Lei et al. [27] | 0.645 | 0.648 | RUL [84] | 0.890 | ||
| Former-DFER [89] | 0.537 | 0.657 | PTH-Net [30] | 0.699 | 0.700 | EAC [85] | 0.909 | ||
| GCA+IAL [29] | 0.557 | 0.692 | MAE-DFER [64] | 0.773 | 0.774 | TransFER [79] | 0.909 | ||
| M3DFEL [70] | 0.561 | 0.693 | MTL-ER* [82] | 0.745 | 0.756 | Xue et al. [80] | 0.920 | ||
| MAE-DFER [64] | 0.634 | 0.744 | MT-Former* [78] | 0.793 | 0.807 | POSTERv2 [44] | 0.922 | ||
| S2D [7] | 0.618 | 0.760 | MTCAE-DFER [75] | 0.847 | 0.850 | EmoLA [34] | 0.921 | ||
| EMO-LLaMA [77] | 0.602 | 0.659 | |||||||
| Emotion-LLaMA [8] | 0.642 | 0.771 | |||||||
| Face-LLaVA (Ours) | 0.625 | 0.745 | Face-LLaVA (Ours) | 0.798 | 0.813 | Face-LLaVA (Ours) | 0.921 | ||
*Only using middle video frame, †: Results taken from the paper [8], ‡: Results computed after running inference code on face-cropped video.
In zero-shot settings, Face-LLaVA demonstrates superior performance across all benchmarks (DFEW, Crema-D, RAF-DB) for facial expression recognition compared to most open-source MLLMs. For instance, on DFEW, Face-LLaVA achieves 0.469 UAR and 0.564 WAR, significantly outperforming Vid.LLaMA 3 (0.286 UAR, 0.305 WAR) and Vid.-LLaVA (0.220 UAR, 0.326 WAR).
Emotion-LLaMA [8] initially shows a higher WAR on DFEW (0.594), but this model is multimodal, using audio and background context. When re-evaluated with only face-cropped video (denoted by ‡), its performance drops substantially (0.302 UAR, 0.378 WAR), highlighting Face-LLaVA's strength in relying solely on facial cues. On Crema-D, Face-LLaVA notably achieves 0.582 UAR and 0.681 WAR, which is considerably higher than LLaVA-Vid. and EmoLA.
In the fine-tuned setting, Face-LLaVA maintains competitive performance with specialized state-of-the-art models. For instance, on RAF-DB, it reaches 0.921 Accuracy, matching EmoLA [34] and being very close to the top-performing POSTERv2 [44] (0.922). This is notable given that Face-LLaVA is still trained to provide descriptions rather than just categorical labels, and it lacks external context (audio/background) used by some baselines.
6.1.2. Action Unit (AU) Detection
The following are the results from Table 4 of the original paper:
| Method | Average F1 ↑ | |
| DISFA [46] | BP4D [83] | |
| Closed-source models | ||
| GPT4o-mini [67] | 0.429 | 0.496 |
| Gemini-1.5F [66] | 0.515 | 0.532 |
| Zero-shot | ||
| VideoLLaMA 3 [81] | 0.374 | 0.458 |
| Qwen 2.5 VL [68] | 0.431 | 0.467 |
| Video-LLaVA [37] | 0.442 | 0.445 |
| LLaVA-OneVision [28] | 0.280 | 0.439 |
| EmoLA [34] | 0.418 | 0.407 |
| Face-LLaVA (Ours) | 0.553 | 0.495 |
| Fine-tuned | ||
| ATCM [21] | 0.615 | 0.642 |
| ReCoT [33] | 0.626 | 0.648 |
| KS [32] | 0.628 | |
| ME-GraphAU [43] | 0.631 | 0.655 |
| JÅA-Net [59] | 0.635 | 0.624 |
| PIAP-DF [65] | 0.638 | 0.641 |
| VL-FAU [15] | 0.665 | 0.658 |
| AU-LLaVA [18] | 0.525 | 0.603 |
| EmoLA [34] | 0.651 | 0.642 |
| Face-LLaVA (Ours) | 0.729 | 0.658 |
For AU detection, Face-LLaVA again excels. In the zero-shot setting, it achieves 0.553 F1 on DISFA, significantly outperforming Gemini-1.5F (0.515) and all other open-source MLLMs. This demonstrates the effectiveness of injecting AU information from AffectNet samples during zero-shot training for AU detection.
In the fine-tuned setting on DISFA, Face-LLaVA achieves 0.729 F1, marking a substantial relative improvement of approximately 10% over the previous state-of-the-art (VL-FAU at 0.665 or EmoLA at 0.651). On BP4D, Face-LLaVA performs competitively (0.658 F1), matching VL-FAU [15] and surpassing other MLLMs. This strong performance underscores the model's ability to precisely identify subtle facial muscle movements.
6.1.3. Age Estimation, Attribute Detection, and Deepfake Detection
The following are the results from Table 5 of the original paper:
| Age Estimation (MAE ↓) | Face Attribute (mAcc. ↑) | DeepFake Det. (Acc. ↑) | ||||||
| Method | M [56] | U [87] | Method | CA [42] | Method | FF [57] | ||
| Closed-source models | ||||||||
| GPT4o-m [67] | 4.09 | 5.04 | GPT4o-m [67] | 0.780 | GPT4o-m [67] | 0.807 | ||
| Gem.-1.5F [66] | 4.78 | 6.13 | Gem.-1.5F [66] | 0.814 | Gem.-1.5F [66] | 0.770 | ||
| Zero-shot | ||||||||
| V-LLaMA3 [81] | 6.98 | 6.91 | V-LLaMA3 [81] | 0.813 | V-LLaMA3 [81] | 0.793 | ||
| Qwen 2.5 [68] | 6.09 | 5.25 | Qwen 2.5 [68] | 0.786 | Qwen 2.5 [68] | 0.653 | ||
| V-LLaVA [37] | 6.75 | 5.89 | V-LLaVA [37] | 0.795 | V-LLaVA [37] | 0.697 | ||
| LLaVA-OV [28] | 6.33 | 6.87 | LLaVA-OV [28] | 0.805 | LLaVA-V [86] | 0.751 | ||
| Face-LLaVA | 3.34 | 4.89 | Face-LLaVA | 0.868 | Face-LLaVA | 0.845 | ||
| Fine-tuned | ||||||||
| PML [10] | 2.15 | - | Liu et al. [42] | 0.873 | MesoNet [1] | 0.705 | ||
| Berg et al. [4] | - | 4.55 | MOON [58] | 0.909 | Xception [9] | 0.869 | ||
| DLDL-v2 [14] | 1.97 | 4.42 | SwinFace [53] | 0.913 | MARLIN [6] | 0.894 | ||
| MWR [62] | 2.00 | 4.37 | Faceptor [54] | 0.914 | M2TR [71] | 0.929 | ||
| Faceptor [54] | 1.96 | 4.10 | DMM-CNN [45] | 0.917 | F3-Net [52] | 0.930 | ||
| Face-LLaVA | 2.02 | 4.06 | Face-LLaVA | 0.901 | Face-LLaVA | 0.888 | ||
M: MORPH II [56], U: UTKFace [87], CA: CelebA [42], FF: FaceForensics++ [57].
Face-LLaVA continues to demonstrate strong performance in these diverse tasks:
- Age Estimation: In the
zero-shotsetting,Face-LLaVAachieves the lowestMAEon both MORPH II (3.34) and UTKFace (4.89), significantly outperforming all otherMLLMbaselines, includingGPT-4o-miniandGemini-1.5F. This is impressive for anMLLMattempting a regression task. In thefine-tunedsetting,Face-LLaVAremains highly competitive, achieving2.02 MAEon MORPH II and4.06 MAEon UTKFace, which are very close to or better than specialized regression-based baselines. - Facial Attribute Detection: For
zero-shotattribute detectionon CelebA,Face-LLaVAachieves0.868 mAcc, outperforming all otherMLLMs. In thefine-tunedsetting, it reaches0.901 mAcc, showing competitive performance against specialized models, which often achieve slightly higher scores (DMM-CNN[45] at 0.917). - Deepfake Detection: This task proves challenging for most
MLLMs. In thezero-shotsetting on (low-quality videos),Face-LLaVAachieves0.845 Accuracy, significantly surpassing all otherMLLMbaselines (which are often close to or worse than the 80% random baseline accuracy for this imbalanced dataset). This highlights its unique ability to discern subtle forgery artifacts. In thefine-tunedsetting,Face-LLaVAreaches0.888 Accuracy. While this is lower than the absolute SOTA (F3-Net[52] at 0.930), the paper attributes this gap toFace-LLaVA's limitation of processing only eight frames per video, whereas fine-tuned baselines often use a higher number of frames.
6.1.4. Evaluation of Reasoning
The following are the results from Table 10 of the original paper:
| Method | Reason-Video Consistency | Reason-GT Consistency | Reasoning Completeness | |||||||||||||||
| Emo. | AU | Attr. | Age | DF. | All | Emo. | AU | Attr. | Age | DF. | All | Emo. | AU | Attr. | Age | DF. | All | |
| GT from FaceInstruct-1M | 9.47 | 8.52 | 9.80 | 9.27 | 8.85 | 9.18 | 9.70 | 8.84 | 9.88 | 9.55 | 9.56 | 9.51 | 9.21 | 8.26 | 9.75 | 9.02 | 8.41 | 8.93 |
| VideoLLaMA 3 [81] | 5.14 | 2.58 | 6.90 | 5.82 | 7.02 | 5.49 | 5.27 | 2.06 | 6.27 | 5.13 | 7.64 | 5.27 | 4.90 | 2.73 | 6.50 | 5.37 | 6.51 | 5.20 |
| Qwen 2.5 VL [68] | 5.82 | 3.02 | 5.48 | 7.36 | 5.48 | 5.43 | 5.96 | 2.54 | 4.86 | 7.02 | 5.76 | 5.23 | 5.57 | 3.34 | 5.21 | 6.89 | 5.30 | 5.26 |
| Video LLaVA [37] | 4.31 | 2.58 | 5.82 | 7.79 | 6.19 | 5.34 | 4.47 | 2.06 | 5.28 | 7.10 | 6.58 | 5.10 | 4.20 | 2.73 | 5.30 | 7.00 | 5.84 | 5.01 |
| LLaVA-OV [28] | 7.11 | 2.18 | 6.08 | 7.97 | 6.69 | 6.01 | 7.30 | 1.95 | 5.48 | 7.44 | 7.33 | 5.90 | 6.67 | 2.34 | 5.72 | 7.52 | 6.19 | 5.69 |
| EmoLA [34] | 7.33 | 5.17 | - | - | - | - | 7.58 | 5.04 | - | - | - | - | 6.81 | 5.32 | - | - | - | - |
| Emotion-LLaMA [8] | 6.77 | - | - | - | - | 6.90 | - | - | - | - | 6.50 | - | - | - | - | - | ||
| Face-LLaVA (Ours) | 7.95 | 8.34 | 7.68 | 8.56 | 7.89 | 8.09 | 8.14 | 9.20 | 7.94 | 8.11 | 7.79 | 8.24 | 7.79 | 8.11 | 7.59 | 8.11 | 7.60 | 7.84 |
The GPT-based reasoning evaluation reveals a significant advantage for Face-LLaVA. It consistently achieves superior ratings compared to other MLLM baselines across all three aspects (Reason-Video Consistency, Reason-GT Consistency, and Reasoning Completeness) and all five tasks.
- The mean rating for
reasoning completenessforFace-LLaVAis , which is approximately 33% higher than the best baseline (LLaVA-OVat ). - For
Reason-Video Consistency,Face-LLaVAscores8.09, indicating excellent alignment between its generated descriptions and the actual visual content. - For
Reason-GT Consistency,Face-LLaVAscores8.24, showing that its reasoning effectively supports the ground truth labels. - The high scores highlight
Face-LLaVA's ability to not only accurately perceive facial information but also to articulate coherent, context-aware, and faithful explanations, which is a critical feature for interpretable AI.
6.2. Ablation Studies / Parameter Analysis
The following are the results from Table 6 of the original paper:
| Model | Landmark Projector | Cross- Attention | Input tokens | DFEW [22] UAR | WAR |
| Baseline | - | - | hv | 0.391 | 0.479 |
| Baseline + Landmarks | only global | - | hv + hglobal | 0.402 | 0.483 |
| only local | - | hv + hlocal | 0.409 | 0.491 | |
| FRLP | - | hv + hl | 0.410 | 0.491 | |
| only global | simple | hlglobal | 0.401 | 0.483 | |
| only local | simple | hllocal | 0.409 | 0.494 | |
| FRLP | simple | hv | 0.416 | 0.512 | |
| FRLP | FRGCA | hv | 0.424 | 0.520 | |
| Baseline + Face Parsing | - | simple | hp | 0.413 | 0.498 |
This table presents an ablation study evaluating the contributions of different modules to the zero-shot performance on DFEW [22] (using MAFW [40], FERV39k [40], and Crema-D [24] for training). The baseline model is Video-LLaVA [37], which uses only visual tokens ().
-
Impact of Landmark Tokens (without Cross-Attention):
- Adding
global landmark tokens() to improvesUARfrom 0.391 to 0.402 andWARfrom 0.479 to 0.483. - Adding
local landmark tokens() yields slightly better results:UAR0.409,WAR0.491. - Using the full
FRLP(combining global and local) withoutcross-attention() results inUAR0.410,WAR0.491. These results indicate that simply appendinglandmark tokensas additional input tokens to theLLMprovides a performance boost over the baseline, but the gains are modest.
- Adding
-
Impact of Cross-Attention:
- When
simple cross-attention(withoutRPP mask) is used to integratelandmark tokens(either global or local) withvisual tokens(hv), the performance shows further improvement. For example, usingFRLPwithsimple cross-attentionresults inUAR0.416 andWAR0.512. This suggests that allowinglandmark tokensto refinevisual tokensthroughcross-attentionis more effective than just concatenating them as direct input to theLLM. - This also validates the design choice of
cross-attentionfor efficiency, as it avoids increasing theLLM's context window.
- When
-
Impact of Face-Region Guided Cross-Attention (FRGCA):
- The full
Face-LLaVAmodel, employingFRLPfor landmark projection andFRGCA(which includes theregion-patch proximity mask) forcross-attention, achieves the highest performance in this ablation:UAR0.424 andWAR0.520. - This demonstrates that the explicit guidance provided by the
RPP maskinFRGCAleads to a larger performance gain compared to simplecross-attention, validating the effectiveness of focusing attention on specific facial regions.
- The full
-
Landmarks vs. Face Parsing:
-
An alternative experiment replacing
landmarkswithface parse heatmaps(encoding them through the visual encoder and using them insimple cross-attention) yieldsUAR0.413 andWAR0.498. -
While competitive with some
landmark-based configurations, this approach incurs extra computational and memory costs due to the pixel-wise nature offace parse maps(which result in tokens of the same size asvisual tokens). The authors conclude thatlandmarksoffer a more efficient and equally effective solution for injecting face-expert knowledge.In summary, the ablation study clearly shows that:
-
- Adding
face-specific landmark featuresis beneficial. - Integrating these
landmark featuresviacross-attentionis more effective than direct concatenation. - The novel
Face-Region Guided Cross-Attention(with theRPP mask) provides the most significant performance boost by intelligently guiding the attention mechanism to salient facial regions.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work marks a significant advancement in the application of Multimodal Large Language Models (MLLMs) to the nuanced field of facial analysis. The authors successfully address the limitations of previous approaches by introducing Face-LLaVA, a generalist MLLM capable of performing a diverse range of facial perception tasks with accompanying natural language reasoning.
The core contributions include:
-
FaceInstruct-1M: A large-scale, face-centric instruction-tuning dataset comprising over one million image and video samples, automatically annotated using
Geminiand filtered byGPT-4o. This dataset is crucial forinstruction tuning MLLMsfor comprehensive face processing. -
Face-LLaVA Architecture: A novel
MLLMarchitecture that integratesface-specific geometric features(facial landmarks) into thevisual encodingprocess through two innovative modules: theFace-Region Landmark Projector (FRLP)andFace-Region Guided Cross-Attention (FRGCA). These modules ensure thatFace-LLaVAeffectively leverages fine-grained facial cues without unnecessarily expanding theLLM's context window. -
Demonstrated Superiority: Extensive evaluations across nine datasets and five distinct face analysis tasks (facial expression recognition, action unit detection, facial attribute detection, age estimation, and deepfake detection) confirm
Face-LLaVA's effectiveness. It achieves superior zero-shot performance compared to existing open-sourceMLLMsand competitive performance against specialized, supervised task-specific methods. Furthermore, its generative capabilities are validated by higher reasoning ratings fromGPT-4o, confirming its ability to provide comprehensive and consistent explanations.By offering a model that combines robust perception with interpretable reasoning,
Face-LLaVAsubstantially contributes to the development of more sophisticated and socially aware AI systems.
7.2. Limitations & Future Work
The authors acknowledge several limitations of the current Face-LLaVA model:
-
Single-Turn Interactions: The current model is limited to single-turn interactions, meaning it can respond to an instruction but cannot engage in multi-turn dialogue or follow up questions.
-
Lack of Advanced Chain-of-Thought Reasoning:
Face-LLaVAdoes not currently incorporate advancedchain-of-thoughtreasoning capabilities, which would allow it to break down complex problems into intermediate steps and provide more elaborate explanations. -
Limited Task Scope: The model has not explored
face identification(recognizing specific individuals) ordense prediction tasks(e.g., pixel-level facial parsing or reconstruction). -
Inherited Biases: As the
FaceInstruct-1Mdataset is constructed from existing public datasets, it may inherit any biases present in those original sources (e.g., demographic biases in age, gender, or race representation). The authors state that addressing model bias is a future work direction.Based on these limitations, the authors suggest future research directions:
-
Integrating
multi-turn dialogue capabilities. -
Expanding to other facial tasks, such as
face identificationordense prediction. -
Addressing potential
biasesin the dataset and model.
7.3. Personal Insights & Critique
Face-LLaVA represents a compelling step forward in making facial AI more generalist and interpretable. The paper's strength lies in its holistic approach: not only proposing a novel MLLM architecture but also developing a custom, large-scale instruction-tuning dataset specifically tailored for face analysis.
Innovations and Strengths:
- Dataset Quality and Scope: The
FaceInstruct-1Mdataset is a significant contribution. Its scale, multi-task nature, and inclusion of video data are vital. TheGPT-assisted annotation and filtering pipelineis particularly innovative, demonstrating a robust way to generate high-quality instruction-tuning data automatically, a common bottleneck forMLLMdevelopment. The highGPT-4oratings forFaceInstruct-1Mvalidate this approach. - Efficient Landmark Integration: The
FRLPandFRGCAmodules are elegant solutions. They effectively injectface-expert knowledge(facial landmarks) into theMLLMwithout burdening theLLM's context window(a frequent challenge in multimodal models). Theregion-patch proximity maskis a clever mechanism to focus attention on anatomically meaningful areas, mimicking how humans often analyze faces. This targeted attention likely contributes to the superior performance in fine-grained tasks likeAU detection. - Demonstrated Generalization and Reasoning: The impressive
zero-shotperformance across diverse tasks is a strong indicator of the model's generalization capabilities. More importantly, the highGPT-based reasoning ratingsvalidate the model's ability to provide human-understandable explanations, which is critical for trust and application in sensitive domains like healthcare or surveillance. - Video Processing: The ability to handle both images and videos is a practical advantage, especially for dynamic tasks like
expression recognitionanddeepfake detectionin real-world scenarios.
Potential Issues and Areas for Improvement:
- Reliance on Commercial LLMs: While the
GPT-assisteddata generation and evaluation are effective, they inherently rely on closed-source commercial models (Gemini,GPT-4o-mini). This introduces a dependency and potential black-box aspect to the data quality and evaluation metrics, which could be less transparent than fully open-source pipelines. Future work could explore using open-sourceLLMsfor these tasks, though they may not currently match the performance of commercial counterparts. - Computational Cost of
FRGCA: WhileFRGCAis efficient in terms ofLLM context window, thecross-attentionmechanism itself, especially with the calculation of theRPP maskand attention weights for potentially many visual patches () and face regions (), does add to the computational overhead compared to a pureVideo-LLaVAbaseline. The paper mentions "keeping extra compute to a minimum" forMLPs, but thecross-attentionitself still adds layers. - Bias Mitigation: The acknowledgment of inherited biases is important. A deeper analysis and proactive mitigation strategies for demographic biases (e.g., fairness metrics, debiasing techniques) would further strengthen the work, particularly given the sensitive nature of
face analysis. - Context Window for Videos: While
FRGCAsaves theLLM's context windowfromlandmark tokens, the overall number of frames processed for videos is still limited (e.g., eight frames fordeepfake detection). This is a commonMLLMlimitation, but it explains the gap to SOTA methods that can leverage more temporal information. Future work could explore more advancedvideo summarizationorlong-context video processingtechniques.
Transferability and Applications:
The methods and conclusions from Face-LLaVA can be transferred to other fine-grained multimodal tasks where specific geometric or structural priors are crucial, but general vision encoders might fall short. Examples include:
-
Medical Image Analysis: Integrating domain-specific anatomical landmarks (e.g., in X-rays or MRI scans) with
MLLMsfor diagnosis and explanation generation. -
Action Recognition: Using pose estimation landmarks to guide
MLLMsfor detailed descriptions of human actions and activities. -
Robotics: Allowing robots to interpret human social cues (via face analysis) and explain their interpretations in natural language, leading to more natural human-robot interaction.
Overall,
Face-LLaVAprovides a robust and interpretable framework forface analysis, significantly pushing the boundaries of whatMLLMscan achieve in this critical domain.
Similar papers
Recommended via semantic vector search.