Paper status: completed

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

Published:04/10/2025

Multimodal Large Language Model (24)Facial Attribute Recognition (1)Facial Expression Recognition (1)FaceInstruct-1M Dataset (1)Face-Region Guided Cross-Attention (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Face-LLaVA is a multimodal large language model designed for facial expression and attribute recognition, generating natural language descriptions. Utilizing the FaceInstruct-1M dataset and a novel visual encoder, it outperforms existing models in various tasks and achieves highe

Abstract

The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at https://face-llava.github.io to support future advancements in social AI and foundational vision-language research.

Mind Map

In-depth Reading

English Analysis~34 min read · 49,515 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning."

1.2. Authors

The authors are Ashutosh Chaubey, Xulang Guan, and Mohammad Soleymani. Their affiliations are the Institute for Creative Technologies, University of Southern California, Los Angeles, California, USA. Their research backgrounds appear to be in computer vision, multimodal machine learning, and affective computing, given the paper's focus.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. arXiv is a well-known open-access repository for preprints of scientific papers in fields like mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While it is not a peer-reviewed journal or conference in itself, publishing on arXiv is a common practice for researchers to disseminate their work quickly, gather feedback, and establish priority before formal peer review and publication in a reputable conference or journal.

1.4. Publication Year

The paper was published on April 9, 2025.

1.5. Abstract

The paper introduces Face-LLaVA, a multimodal large language model (MLLM) designed for face-centered, in-context learning, specifically for tasks such as facial expression and attribute recognition. A key feature of Face-LLaVA is its ability to generate natural language descriptions, which can be used for reasoning about its predictions.

To achieve this, the authors first developed FaceInstruct-1M, a large-scale, face-centered dataset for instruction tuning MLLMs, leveraging existing visual databases. They also propose a novel face-specific visual encoder that integrates face geometry with local visual features through Face-Region Guided Cross-Attention.

Face-LLaVA was evaluated across nine different datasets and five distinct face processing tasks: facial expression recognition, action unit detection, facial attribute detection, age estimation, and deepfake detection. The results demonstrate that Face-LLaVA achieves superior performance compared to existing open-source MLLMs in zero-shot settings and competitive performance against commercial solutions. Furthermore, its natural language outputs received higher reasoning ratings from GPT-4o in a zero-shot setting across all tasks. Both the dataset and model weights are planned for public release to support advancements in social AI and foundational vision-language research.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2504.07198v1. It is currently a preprint, meaning it has not yet undergone formal peer review or been officially published in a conference or journal. The PDF link is https://arxiv.org/pdf/2504.07198v1.pdf.

2. Executive Summary

2.1. Background & Motivation

The human face serves as a primary channel for social communication, making accurate and interpretable face analysis crucial for a wide range of human-centered applications. These applications span from human-computer interaction and social behavior assessment to e-learning and surveillance.

The core problem the paper aims to solve stems from two major limitations of existing face analysis approaches:

Task Specificity and Limited Generalizability: Most current models are developed for specific tasks, such as facial expression recognition (FER) or facial attribute detection. This specialization limits their ability to generalize across different face analysis tasks, requiring separate models for each. While recent efforts have focused on developing generalist face models, they often still lack a unified interface for diverse tasks.
Lack of Natural Language Reasoning: Traditional face analysis models primarily output categorical labels or numerical values without providing any natural language description or reasoning for their predictions. This lack of interpretability is a significant drawback, especially for critical applications like healthcare and surveillance, where understanding the "why" behind a prediction is as important as the prediction itself.

The paper's innovative idea is to leverage the power of Multimodal Large Language Models (MLLMs) to develop a generalist face analysis model that can perform various face processing tasks and, crucially, generate natural language descriptions and reasoning for its predictions. This bridges the gap between raw visual perception and human-understandable explanations, paving the way for more advanced and socially intelligent AI agents.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of multimodal AI and face analysis:

FaceInstruct-1M Dataset: The authors introduce FaceInstruct-1M, a novel, large-scale (one million samples), face-centered instruction-tuning dataset. This dataset is unique in its breadth, covering five distinct face analysis tasks: facial expression recognition, action unit (AU) detection, facial attribute detection, age estimation, and deepfake detection. Crucially, it comprises both image and video data and includes task-specific instructions and detailed natural language descriptions, which are essential for training MLLMs. The dataset generation process involves an innovative GPT-4o-assisted evaluation and filtering pipeline to ensure high data quality.
Face-LLaVA Model Architecture: The paper proposes Face-LLaVA, a multimodal large language model specifically designed for comprehensive face analysis. The model incorporates a novel face-specific visual encoder with two key components:
- Face-Region Landmark Projector (FRLP): This module groups facial landmarks into different anatomically meaningful face regions (e.g., eyes, brows, lips) and projects them into individual face-region tokens. This preserves fine-grained local information crucial for nuanced face analysis.
- Face-Region Guided Cross-Attention (FRGCA): This mechanism refines the general visual features by cross-attending them with the face-region tokens. It also employs a region-patch proximity mask to explicitly guide attention towards relevant facial areas, ensuring that the model focuses on salient face-specific information. This approach is more efficient in utilizing the LLM's context window compared to simply concatenating landmark features.
Superior Performance and Reasoning Capabilities: Through extensive experiments on nine diverse datasets across five face processing tasks, Face-LLaVA demonstrates:
- State-of-the-art zero-shot performance: It significantly outperforms existing open-source MLLMs in a zero-shot setting, showcasing its strong generalization capabilities.
- Competitive fine-tuned performance: It achieves performance comparable to specialized, supervised task-specific approaches, which typically have a performance advantage due to their narrow focus.
- Enhanced reasoning: Face-LLaVA's generated natural language descriptions receive a higher reasoning rating from GPT-4o in a zero-shot setting, indicating its superior ability to provide coherent and context-aware explanations for its predictions.
  
  These contributions address the limitations of prior work by offering a generalist, interpretable, and high-performing solution for complex face analysis, paving the way for more robust and socially aware AI systems.

3.1. Foundational Concepts

To understand Face-LLaVA, a grasp of several fundamental concepts in machine learning and computer vision is essential:

Multimodal Large Language Models (MLLMs):
- Conceptual Definition: MLLMs are advanced AI models that integrate capabilities from both Large Language Models (LLMs) and computer vision models (or other modalities like audio). They are designed to process and understand information from multiple modalities (e.g., text, images, videos) simultaneously and generate human-like text responses based on that combined understanding.
- How they work: Typically, an MLLM uses a vision encoder to convert visual inputs (images/videos) into a sequence of numerical representations called visual tokens or embeddings. These visual tokens are then aligned with the language token space of an LLM through a projector module. The LLM then takes these visual embeddings along with textual instructions as input and generates a natural language response.
Large Language Models (LLMs):
- Conceptual Definition: LLMs are deep learning models, usually based on the Transformer architecture, trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like text generation, summarization, translation, and question-answering.
- Key aspects: LLMs operate by predicting the next word or token in a sequence given the previous ones, a process known as autoregressive generation. They learn complex patterns and relationships in language, enabling them to produce coherent and contextually relevant text.
Vision Encoders:
- Conceptual Definition: Vision encoders are neural networks designed to extract meaningful features from images or videos. They transform raw pixel data into a higher-level, compressed numerical representation (embeddings or tokens) that captures the visual content.
- Examples: Popular vision encoders include Convolutional Neural Networks (CNNs) and Vision Transformers. Models like CLIP (Contrastive Language–Image Pre-training) and LanguageBind are examples of vision encoders that are pre-trained on large image-text paired datasets, learning robust visual representations that are semantically aligned with language. They often produce patch-based tokens, where an image is divided into small patches, and each patch is encoded into a token.
Instruction Tuning:
- Conceptual Definition: Instruction tuning is a fine-tuning technique where MLLMs (or LLMs) are trained on a dataset consisting of diverse instruction-response pairs. The goal is to make the model better at following natural language instructions and generating appropriate responses, rather than just completing text or recognizing objects.
- Importance: For MLLMs, instruction tuning enables them to perform specific tasks (like "describe the emotion in this face") and generate explanations, moving beyond simple classification to conversational capabilities.
Facial Landmarks:
- Conceptual Definition: Facial landmarks are key fiducial points on the human face, such as the corners of the eyes, tip of the nose, corners of the mouth, and points along the eyebrows or jawline. These points provide a precise geometric representation of the face's structure and configuration.
- Importance in face analysis: Facial landmarks are crucial for fine-grained face analysis tasks because they capture subtle changes in facial expressions (e.g., eyebrow raise, lip corner pull), head pose, and overall facial geometry. They are less sensitive to lighting or background variations than raw pixels.
Cross-Attention:
- Conceptual Definition: Cross-attention is a mechanism, typically found in Transformer architectures, that allows a model to weigh the importance of elements from one sequence (e.g., visual tokens) while processing elements from another sequence (e.g., landmark tokens). It's a way for different modalities or information sources to interact and inform each other.
- How it works: In cross-attention, one sequence provides the Query (Q) vectors, and the other provides Key (K) and Value (V) vectors. The $Q$ $Q$ vectors attend to the $K$ $K$ vectors to compute attention weights, which are then applied to the $V$ $V$ vectors to produce an output that is a weighted sum of the $V$ $V$ s, modulated by the $Q$ $Q$ s. This allows the model to selectively focus on relevant parts of the other sequence. The standard formula for Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ : Query matrix.
  - $K$ : Key matrix.
  - $V$ : Value matrix.
  - $d_k$ : Dimensionality of the key vectors.
  - $\mathrm{softmax}$ : Normalization function that converts raw scores into probabilities.
Action Units (AUs):
- Conceptual Definition: Action Units (AUs) are fundamental movements of individual or groups of facial muscles, as defined by the Facial Action Coding System (FACS). FACS is an anatomically based system for objectively describing all visually distinguishable facial movements.
- Importance: AUs are a more granular and objective way to describe facial expressions than broad emotion categories (like "happy" or "sad"). Detecting AUs provides a detailed understanding of the underlying muscle movements contributing to an expression.
Deepfake Detection:
- Conceptual Definition: Deepfake detection is the task of identifying manipulated or synthetically generated media, particularly videos or images of human faces, that appear realistic but are fake. These manipulations are often created using deep learning techniques.
- Challenges: Deepfakes can be very convincing, making their detection a challenging task that often requires identifying subtle inconsistencies or artifacts not present in real media.

3.2. Previous Works

The paper positions its work in the context of extensive prior research across traditional face analysis and the emerging field of MLLMs for face processing.

3.2.1. Traditional Face Analysis

Task-Specific Models: Historically, research has focused on developing highly specialized models for individual tasks such as expression recognition (e.g., [7, 64, 70, 75, 79, 80, 84, 85]), action unit detection (e.g., [33, 43, 59, 65]), age estimation (e.g., [2, 13, 62]), attribute detection (e.g., [42, 45, 58]), and deepfake detection (e.g., [1, 9, 52]). These models achieve high performance on their specific tasks but lack generalizability.
Generalist Face Models: More recently, the trend has shifted towards developing models capable of handling multiple face analysis tasks using a single architecture or learning robust facial representations.
- Faceptor [54]: Builds upon FaRL [91] and SWINFace [53] by introducing a single encoder with a dual-decoder transformer, aiming to unify multiple face analysis tasks.
- MARLIN [6] and PrefAce [19]: Utilize self-supervised learning on video data, often employing masked autoencoders (MAE) [16, 69] to learn versatile facial representations.
- PCL [41]: Employs a pose-disentangled decoder for contrastive learning to generate robust features. The limitation of these generalist models, as highlighted by the authors, is their lack of a natural language interface for reasoning.

3.2.2. Face Analysis using Instruction Tuning

The advent of MLLMs has opened new avenues for face analysis, incorporating reasoning and diverse inputs.

Early MLLM/VLM applications:
- EmoVIT [76]: Uses instruction tuning to improve visual emotion understanding in LLMs.
- Emotion-LLaMA [8]: An MLLM that takes audio, video, and text to describe emotional responses and reasoning in videos. (However, the current paper notes its reliance on background and audio context, which is a key differentiator for Face-LLaVA).
- Foteinopoulou et al. [12]: Explores deepfake detection and reasoning with LLMs. These approaches were often task-specific or relied on multimodal inputs beyond just facial cues.
Face-Specific MLLMs:
- FABA [34] and VL-FAU [15]: Attempt to bridge the gap by leveraging powerful visual encoders and LLM reasoning. However, their capabilities are often limited to static images and primarily focus on facial expression analysis. FABA-Instruct [34] is a face-specific dataset but also limited to still images and affective behavior.
- AU-LLAVA [18]: Uses an LLM for AU detection and intensity estimation but outputs only categorical/numerical predictions without deeper reasoning.
- EmoLA [34]: Integrates landmark prior tokens with visual embeddings for facial affective behavior analysis with reasoning. (This is a more direct precursor to Face-LLaVA's approach to incorporating landmarks, but Face-LLaVA aims for greater efficiency and generalizability).
- EMO-LLaMA [77]: Employs a face-information mining module for enhanced facial feature encoding, demonstrating reasoning and conversational capabilities.
- Face-MLLM [63]: Uses Gemini for automatic annotation of Laion-Face [90] for instruction tuning, but its focus remains on perception rather than reasoning over predictions, and its dataset generation relies heavily on an external commercial model's face analysis.
- FaVChat [88]: Leverages existing face video datasets for instruction tuning, but faces issues with class imbalances in attributes.

3.3. Technological Evolution

The field has evolved from:

Isolated, Task-Specific Models: Early computer vision focused on training separate models for each face analysis task (e.g., one CNN for FER, another for age estimation). These were often highly optimized but lacked broader applicability.
Generalist Perception Models: The advent of large-scale pre-training and self-supervised learning (e.g., MAE, Vision Transformers) led to models capable of learning robust facial representations useful for multiple downstream tasks. These models could perform face recognition, expression recognition, and attribute estimation with a single encoder but typically still output numerical or categorical labels.
Multimodal Language-Integrated Models (MLLMs): The recent surge in LLMs (like GPT, LLaMA) has prompted researchers to integrate visual inputs with language models. This allows for tasks that require complex reasoning and natural language outputs, moving beyond mere perception to perception with reasoning.
Face-Specific MLLMs with Enhanced Visual Priors: This paper represents the next step, where MLLMs are not just applied to face data, but their architectures are specifically enhanced with face-expert knowledge (like facial landmarks) to improve the quality of visual encoding for fine-grained facial analysis, while still providing robust natural language reasoning.

3.4. Differentiation Analysis

Compared to the main methods in related work, Face-LLaVA offers several core differences and innovations:

Comprehensive Multi-task Coverage: Unlike most prior MLLMs for face analysis which focus on affective behavior or expression recognition (e.g., FABA [34], EmoLA [34]), Face-LLaVA is a truly generalist model, spanning five diverse and distinct tasks: expression recognition, AU detection, attribute detection, age estimation, and deepfake detection.
Large-scale, Face-Centric Instruction-Tuning Dataset (FaceInstruct-1M):
- It addresses the scarcity of large-scale, face-specific instruction-tuning data. Previous instruction-tuning datasets were often small, task-specific, or included background context/audio (e.g., Emotion-LLaMA [8], EmoVIT [76]), which can lead to hallucinations not strictly based on facial cues.
- FaceInstruct-1M is specifically curated for faces, includes both images and videos (unlike FABA-Instruct [34] which is image-only), and employs a robust GPT-4o-assisted filtering process for quality control, reducing noise.
Novel Face-Specific Visual Encoder (FRLP and FRGCA):
- Existing MLLMs often rely on general-purpose vision encoders (like CLIP or LanguageBind) that are not optimized for fine-grained face features like facial landmarks. Face-LLaVA explicitly integrates these face-expert features.
- Efficiency: Unlike approaches that append landmark features directly to visual tokens (EmoLA [34]), Face-LLaVA uses cross-attention. This means the landmark tokens (which provide geometric priors) inform the visual tokens without increasing the LLM's context window size, making the process more efficient.
- Region-Guided Attention: The Face-Region Guided Cross-Attention with its region-patch proximity mask is innovative in forcing the model to explicitly focus its attention on anatomically relevant facial regions, which mimics human perception for complex face analysis tasks.
Perception with Reasoning for Videos: While some prior MLLMs could reason (EmoLA, EMO-LLaMA), Face-LLaVA explicitly handles both images and videos with reasoning capabilities, providing a more versatile solution for dynamic facial analysis. It particularly avoids reliance on non-facial cues (audio, background) that some models might inadvertently use.

4. Methodology

4.1. Principles

The core idea behind Face-LLaVA is to augment a general-purpose Multimodal Large Language Model (MLLM) with specialized face-centric knowledge and mechanisms to enhance its understanding and reasoning capabilities for a diverse set of facial analysis tasks. The key principles are:

Instruction Tuning for Generalization and Reasoning: By training on a large dataset of instruction-response pairs (FaceInstruct-1M), the model learns to understand diverse queries related to facial expressions, attributes, age, and deepfakes, and to generate natural language explanations for its predictions.
Integration of Face-Expert Features: Recognizing that general vision encoders might not adequately capture fine-grained facial details, Face-LLaVA explicitly incorporates facial landmark features. These landmarks provide precise geometric information about the face, which is crucial for nuanced analysis.
Efficient and Targeted Feature Fusion: Instead of simply concatenating landmark features with general visual tokens (which can exhaust the LLM's context window), Face-LLaVA uses a novel cross-attention mechanism. This mechanism allows landmark features to guide and refine the visual features by directing the model's attention to salient facial regions, ensuring that both local and holistic facial cues are effectively utilized.

4.2. Core Methodology In-depth (Layer by Layer)

Figure 3 from the original paper (VLM Description: The image is a diagram illustrating the processing flow of facial features in the Face-LLaVA model, including video input, patch encoding, facial landmark detector, and face-region guided cross-attention mechanism. The diagram includes the functionality of various modules such as the vision projector and large language model, depicting how facial-related information is processed and understood.) illustrates the architecture of Face-LLaVA.

The Face-LLaVA architecture, similar to other MLLMs like Video-LLaVA [37], consists of several main components:

Patch-based Vision Encoder ( $\mathbf{E_V}$ ): This module takes the raw visual input (image or video) and encodes it into a sequence of visual tokens.
Vision Projector ( $\mathbf{P_V^\theta}$ ): This component maps the visual tokens from the vision encoder's feature space into the language token space of the LLM.
Tokenizer ( $\mathbf{E_T}$ ): This standard LLM component converts input text instructions into language tokens.
Large Language Model Decoder ( $\mathbf{\Phi}$ ): This is the core LLM that processes the combined visual and language tokens in an autoregressive manner to generate natural language responses.
Face-Expert Model ( $\mathbf{E_L}$ ): A specialized model (in this case, a facial landmark detector) that extracts fine-grained face-specific geometric features.
Face-Region Landmark Projector (FRLP): A novel module to convert raw landmark coordinates into face-region tokens.
Face-Region Guided Cross-Attention (FRGCA): A novel cross-attention mechanism that integrates the face-region tokens with the general visual tokens.

Let's detail the flow and the novel components:

4.2.1. Visual Input Processing

Given a visual input $x_v$ (which can be a single image or a sequence of video frames), the patch-based vision encoder $\mathbf{E_V}$ processes it. The output of $\mathbf{E_V}$ is then passed through the vision projector $\mathbf{P_V^\theta}$ to align it with the language token space. This yields the visual tokens $h_v$ :

$ h_v = {\bf P}{\bf V}^{\theta} ( {\bf E}{\bf V} ( x_v ) ) $

Where:

$x_v$ : The input visual data (image or video).
$\mathbf{E_V}$ : The patch-based vision encoder. For video inputs, $x_v$ is processed frame-by-frame or as short clips.
$\mathbf{P_V^\theta}$ : The vision projector, parameterized by $\theta$ . Its role is to map the visual features into a compatible space for the LLM.
$h_v$ $h_{v}$ : The resulting visual tokens. These tokens have a shape of $\bar{T} \times \bar{N} \times d$ $\overset{ˉ}{T} \times \overset{ˉ}{N} \times d$ , where:
- $\bar{T}$ : Number of video frames being processed (1 for images).
- $\bar{N}$ : Number of visual tokens extracted from the patch-based vision encoder (e.g., 256 for LanguageBind).
- $d$ : Dimensionality of the hidden representation of the LLM.

4.2.2. Face-Expert Feature Extraction

In parallel to the general visual processing, a face-expert model $\mathbf{E_L}$ is used to extract face-specific features from the visual input $x_v$ . In Face-LLaVA, this face-expert model is a landmark detector (specifically, FAN [5]). It extracts normalized 2D facial landmark coordinates:

$ z = \mathbf{E}_{\mathbf{L}} ( x_v ) $

Where:

$z$ $z$ : The extracted normalized 2D landmark coordinates. Its shape is $T \times L \times 2$ $T \times L \times 2$ , where:
- $T$ : Number of video frames (1 for images).
- $L$ : Total number of facial landmarks detected (e.g., 68 landmarks).
- 2: Represents the 2D (x, y) coordinates for each landmark.
  
  These face-expert features $z$ are then fed into the proposed novel Face-Region Landmark Projector (FRLP) and Face-Region Guided Cross-Attention (FRGCA) modules.

4.2.3. Face-Region Landmark Projector (FRLP)

The FRLP module is designed to project the facial landmark features $z$ into a more meaningful and tokenized space, preserving both local regional information and global facial structure. It achieves this using two sub-modules:

Local Landmark Projector ( $h_l^{local}$ ): This component groups $L$ facial landmarks into $M$ distinct face regions (e.g., face boundary, left/right eye, left/right brow, nose, nostril, lips, teeth). For each $i$ -th face region, which contains $L_i$ landmark points ( $z_i \in \mathbb{R}^{T \times L_i \times 2}$ ), a separate Multi-Layer Perceptron (MLP) is used for projection. The resulting local landmark tokens are:

$ h_l^{local} = { {\bf MLP}_i^{local} ( z_i ) ~ \forall ~ i \in { 1 \ldots M } } $

Where:
- $h_l^{local}$ : A set of $M$ tokens, each corresponding to a specific face region. Each token has a shape $T \times d$ .
- $\mathbf{MLP}_i^{local}$ : A separate single-layer MLP for the $i$ -th face region.
- $z_i$ : The $L_i$ 2D landmark points belonging to the $i$ -th face region.
Global Landmark Projector ( $h_l^{global}$ ): To capture the overall facial structure and dynamics, all $L$ landmarks from each frame are projected to generate a single global landmark token per frame.

$ h_l^{global} = \mathbf{M} \mathbf{L} \mathbf{P}^{global} ( z ) $

Where:
- $h_l^{global}$ : A single token per frame, representing the global facial structure. Its shape is $T \times d$ .
- $\mathbf{MLP}^{global}$ : A single-layer MLP for global landmark projection.
- $z$ : All $L$ 2D landmark points for a given frame.
  
  The final landmark tokens $h_l$ are obtained by combining the global and local components. The paper states "the goal of $h_l^{local}$ is broadcasted in the token dimension", implying that the local region tokens are somehow combined or added to the global token for each frame. The equation given is:

$ h_l = h_l^{global} + h_l^{local} $

It should be noted that the operation for $h_l^{local}$ in the formula appears to output a set of tokens for each frame. If $h_l^{global}$ is a single token per frame, then the addition operation would imply a broadcasting mechanism or element-wise addition if $h_l^{local}$ is also aggregated into a single token per frame. The most common interpretation is that $h_l^{local}$ represents a sequence of $M$ tokens per frame, and $h_l^{global}$ is broadcasted across these $M$ tokens, or $h_l^{local}$ itself is an aggregation of the $M$ local tokens. Given the context of cross-attention receiving $M$ landmark tokens per frame later, it is more likely that $h_l^{local}$ contributes to the $M$ tokens, and $h_l^{global}$ is added to each of these $M$ tokens for each frame. For simplicity and minimal computation, all MLPs used in FRLP are single-layer MLPs. The final shape of $h_l$ is $T \times M \times d$ .

4.2.4. Face-Region Guided Cross-Attention (FRGCA)

The FRGCA module is designed to integrate the face-region landmark tokens $h_l$ with the general visual tokens $h_v$ . This mechanism offers two main advantages:

Weighing Salient Visual Tokens: Cross-attention allows the model to assign higher importance to visual tokens that are spatially close to or semantically relevant to the salient face regions (as indicated by the landmark tokens).
Context Window Efficiency: Unlike appending landmark tokens to visual tokens (which increases the input sequence length for the LLM), cross-attention processes them separately, thus saving the LLM's context window.

Here's how FRGCA works:

Query, Key, Value Generation: Linear layers are used to transform the visual tokens $h_v$ into Query (Q) vectors and the landmark tokens $h_l$ into Key (K) and Value (V) vectors. For simplicity, the time dimension $T$ is dropped, meaning these operations occur independently for each frame.
- $Q \in \mathbb{R}^{N \times d_{attn}}$ (derived from $h_v$ )
- $K \in \mathbb{R}^{M \times d_{attn}}$ (derived from $h_l$ )
- $V \in \mathbb{R}^{M \times d_{attn}}$ (derived from $h_l$ ) Where $N$ is the number of visual tokens and $M$ is the number of face-region tokens.
Face-Region Patch Proximity (RPP) Mask Generation: To further guide the attention mechanism towards relevant face regions, an RPP mask $m^{RPP}$ is computed. This mask is inversely proportional to the pairwise 2D Euclidean distances between the centroids of the visual patches (which correspond to the visual tokens $h_{v,j}$ ) and the centroids of the different face regions (which correspond to the face-region tokens derived from $z_i$ ).

$ m_{ji}^{RPP} = - || centroid(z_i) - centroid(h_{v,j}) ||_2 $

Where:
- $m_{ji}^{RPP} \in \mathbb{R}^{N \times M}$ : An element of the RPP mask for the $j$ -th visual token and $i$ -th face region.
- $centroid(z_i)$ : The 2D centroid of the $i$ -th face region (computed from its facial landmarks).
- $centroid(h_{v,j})$ : The 2D centroid of the $j$ -th visual patch associated with the visual token $h_{v,j}$ . The negative sign ensures that smaller distances result in larger (less negative) values, thus higher attention.
Guided Cross-Attention: The RPP mask is added to the attention scores ( $QK^T / \sqrt{d_{attn}}$ ) before the softmax layer. This guides the attention weights, forcing the model to prioritize visual patches that are geometrically closer to specific face regions. The output of the FRGCA module, $\mathbf{A_{VL}^\alpha}$ , is then added back to the original visual tokens $h_v$ via a residual connection.

$ h_v^l = Linear\left( Softmax\left( \frac{QK^T}{\sqrt{d_{attn}}} + m^{RPP} \right) V \right) + h_v $

Where:
- $h_v^l$ : The visual tokens refined by the landmark information. This is the final visual input that goes into the LLM.
- Linear: A final linear projection layer.
- $\mathbf{A_{VL}^\alpha}$ : The entire FRGCA module, parameterized by $\alpha$ .
- $d_{attn}$ : The dimensionality of the attention head, used for scaling.

4.2.5. Text Instruction Processing

The user's text instruction $x_i$ is processed by the tokenizer $\mathbf{E_T}$ to generate language tokens $h_t$ .

4.2.6. Language Model Decoder

Finally, the LLM $\mathbf{\Phi}$ takes the refined visual tokens $h_v^l$ and the language tokens $h_t$ as input. It then autoregressively generates the natural language response $x_r$ according to the instruction.

4.2.7. Training

The training of Face-LLaVA is conducted in multiple stages to ensure proper alignment and fine-tuning:

Face-Region Pretraining:
- Objective: To align the newly initialized FRLP and FRGCA modules with the existing visual and language token spaces.
- Trainable Parameters: Only the parameters of the FRLP module ( $\gamma$ ) and the FRGCA module ( $\alpha$ ) are trained. All other weights (vision encoder, vision projector, tokenizer, and LLM) are kept frozen.
- Purpose: This stage ensures that the landmark tokens are effectively learned and can interact meaningfully with the visual tokens.
Finetuning:
- Objective: To jointly fine-tune the entire model, including the vision projector and the LLM, to improve its instruction-following capabilities and overall performance.
- Trainable Parameters: The parameters of the FRLP ( $\gamma$ ), FRGCA ( $\alpha$ ), vision projector ( $\theta$ ), and the LLM ( $\phi$ ) are all trained. The vision encoder and tokenizer remain frozen (pretrained).
- Learning Rate: A lower learning rate is used in this stage compared to pretraining.
- LLM Training: Since FaceInstruct-1M is a substantial dataset, all parameters ( $\phi$ ) of the LLM are fine-tuned, rather than using parameter-efficient methods like LoRA [17].
  
  For both training stages, the model is trained in an autoregressive fashion, maximizing the likelihood of the generated response $x_r$ :

$ P ( x_r | h_v^l, h_t ) = \prod_{i = 1}^{\mathcal{L}} P_{\Theta} ( x_r^i | h_v^l, h_t, x_r^{[1 : i - 1]} ) $

Where:

$P_{\Theta}$ : The probability distribution over tokens, parameterized by $\Theta$ .
$\Theta$ : The set of trainable parameters ( $\{\gamma, \alpha\}$ for pretraining; $\{\gamma, \alpha, \theta, \phi\}$ for finetuning).
$x_r$ : The full natural language response generated by the model.
$x_r^i$ : The $i$ -th token in the response sequence.
$\mathcal{L}$ : The total length (number of tokens) of the response $x_r$ .
$h_v^l$ : The visual tokens refined by FRGCA.
$h_t$ : The language tokens from the instruction.
$x_r^{[1:i-1]}$ : The sequence of response tokens generated prior to the $i$ -th token.

The learning rates for the pretraining and finetuning stages are 1e-4 and 2e-5, respectively, with a cosine learning rate schedule. Models are trained for one epoch.

5. Experimental Setup

5.1. Datasets

Face-LLaVA's FaceInstruct-1M dataset is constructed from a variety of existing task-specific datasets. For evaluation, experiments are performed on nine benchmarks, which largely overlap with the source datasets used for FaceInstruct-1M.

The following are the details of the datasets used to construct FaceInstruct-1M and for evaluation:

The following are the results from Table 2 of the original paper:

Task	Datasets	Mean GPT Rating (1-10)
Task	Datasets	Label Acc.	Desc.-Vid. Consistency	Desc.-Lab. Consistency	Overall
FaceInstruct-1M
Expression	DFEW [22], MAFW [40], FERV39k [72] Crema-D [24], AffectNet [47], RAF-DB [31]	8.81	8.74	8.79	8.71
AU	DISFA [46], BP4D [83]	8.17	8.23	7.62	7.99
Attribute	CelebA [42]	8.79	8.75	8.52	8.69
Age	MORPH II [56], UTK Face [87]	8.63	7.92	7.98	7.99
Deepfake*	FF++ [57], Fake AV Celeb [25]	8.84	8.13	8.60	8.20
FABAInstruct [34]
Expression	AffectNet [47]	8.79	6.44	6.32	6.22

*For deepfake detection, a set of real videos from DFEW [22], MAFW [40], FERV39k [72] are annotated as control samples.

This table highlights the high quality of the automatically generated annotations in FaceInstruct-1M, as rated by GPT-4o-mini, which are consistently high across tasks, especially when compared to a reference like FABAInstruct [34].

The following figure (Figure 2 from the original paper) shows samples from the FaceInstruct-1M dataset for different tasks:

Figure 2. FaceInstruct-1M dataset samples for different tasks. 该图像是示意图，展示了FaceInstruct-1M数据集中不同任务的样本。图中包含了与面部表情分析和面部属性识别相关的多种任务示例，分别解析了面部情感、动作单元、估计年龄和面部特征的变化。这些样本旨在阐明研究中使用的不同面部表达及其相关数据，体现了该项目在面部分析领域的应用和重要性。

This figure visually demonstrates the diversity of tasks covered by FaceInstruct-1M and the type of descriptions generated. For video inputs, the descriptions capture subtle variations in facial attributes and provide justifications for the responses based on these observations.

The following figure (Figure 5 from the original paper) shows the data annotation pipeline used for creating FaceInstruct-1M dataset:

Figure 5. Data annotation pipeline used for creating FaceInstruct-1M dataset. 该图像是一个示意图，展示了创建 FaceInstruct-1M 数据集的数据注释流程。流程包括输入大规模数据、数据预处理和筛选、标签辅助的数据生成，以及使用 GPT-4 mini 的评分指令进行样本评分和筛选。

This pipeline illustrates the process of generating FaceInstruct-1M using large-scale data, pre-processing, label-assisted Gemini generation of descriptions, and GPT-4o-mini for rating and filtering.

5.1.1. Datasets for Facial Expression Recognition

DFEW [22]: A large-scale database for Dynamic Facial Expressions in the Wild. It is a video dataset.
Crema-D [24]: A multimodal dataset (audio-visual) for emotion recognition. Used for facial expression recognition.
RAF-DB [31]: Real-world Affective Faces Database. A large dataset of static images annotated with basic and compound expressions.
AffectNet [47]: A very large database for facial expression, valence, and arousal computing in the wild. Primarily images.
FERV39k [72]: A large-scale multi-scene dataset for facial expression recognition in videos.
MAFW [40]: A large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild.

5.1.2. Datasets for Action Unit (AU) Detection

DISFA [46]: Database of Spontaneous Facial Action. Contains videos annotated for AU intensity on an 0-5 scale by FACS experts.
BP4D [83]: Binghamton-Pittsburgh 4D Spontaneous Expression Database. Contains videos annotated with 12 AUs, along with tracked head pose and facial landmarks.

5.1.3. Datasets for Facial Attribute Detection

CelebA [42]: CelebFaces Attributes Dataset. A large-scale dataset of celebrity faces with diverse pose and background, annotated with 40 binary facial attributes.

5.1.4. Datasets for Age Estimation

MORPH II [56]: A longitudinal image database for age-progression, containing mugshots annotated with age, gender, and race.
UTKFace [87]: A large-scale dataset of faces with ages between 0 and 116 years, annotated with age, gender, and ethnicity.

5.1.5. Datasets for Deepfake Detection

FaceForensics++ (FF++) [57]: A dataset consisting of 1000 original video sequences manipulated with 4 face manipulation methods (Deepfakes, Face2Face, FaceSwap, NeuralTextures). The paper uses low-quality samples.
Fake AV-Celeb [25]: A dataset including about 20K manipulated videos using various deepfake methods, with a base set of 500 real celebrity videos.

5.1.6. GPT-Assisted Dataset Construction for FaceInstruct-1M

FaceInstruct-1M is constructed by:
- Preprocessing: Cropping faces in videos and ensuring a single subject is present.
- Gemini Annotation: Gemini 1.5 Flash [66] is used to generate video descriptions for specific tasks. It receives the face video, its categorical label, and a carefully designed annotation instruction (e.g., "The given video has... emotion. Your task is to reason why the emotion is tagged as... Focus specifically on the facial expression...").
- Instruction Curation: 100 handcrafted instructions for each task are randomly paired with the generated descriptions for instruction tuning.
- GPT-4o Filtering: GPT-4o-mini [67] is used to rate the generated samples based on: (i) accuracy of the manual label, (ii) consistency of the generated description with the video, (iii) consistency of the description with the label, and (iv) overall quality. Samples with an overall rating $\le 6$ are removed (approx. 7% of the dataset).

5.2. Evaluation Metrics

The paper uses both traditional quantitative metrics for face analysis tasks and GPT-based evaluation for reasoning capabilities.

5.2.1. Traditional Metrics

Unweighted Average Recall (UAR):
- Conceptual Definition: UAR (also known as Average Per-Class Accuracy) calculates the average of the recall (or sensitivity) for each class. It is particularly useful for evaluating performance on imbalanced datasets, as it gives equal weight to each class, regardless of its sample size. It reflects the model's ability to correctly identify instances of each class.
- Mathematical Formula: $ \mathrm{UAR} = \frac{1}{C} \sum_{i=1}^{C} \mathrm{Recall}_i $ Where:
  - $C$ : The total number of classes.
  - $\mathrm{Recall}_i$ : The recall for class $i$ .
  - $\mathrm{Recall}_i = \frac{\mathrm{True Positive}_i}{\mathrm{True Positive}_i + \mathrm{False Negative}_i}$
- Symbol Explanation:
  - $\mathrm{True Positive}_i$ : Number of instances of class $i$ correctly predicted as class $i$ .
  - $\mathrm{False Negative}_i$ : Number of instances of class $i$ incorrectly predicted as other classes.
Weighted Average Recall (WAR):
- Conceptual Definition: WAR (also known as Overall Accuracy) calculates the proportion of correctly predicted instances out of the total number of instances across all classes. It is sensitive to class imbalance, meaning classes with more samples will have a larger impact on the WAR score. It reflects the overall correctness of the model's predictions.
- Mathematical Formula: $ \mathrm{WAR} = \frac{\sum_{i=1}^{C} \mathrm{True Positive}i}{\sum{i=1}^{C} (\mathrm{True Positive}_i + \mathrm{False Positive}_i + \mathrm{True Negative}_i + \mathrm{False Negative}_i)} $ Or simply, the total number of correct predictions divided by the total number of samples.
- Symbol Explanation:
  - $\mathrm{True Positive}_i$ : Number of instances of class $i$ correctly predicted as class $i$ .
  - $\mathrm{False Positive}_i$ : Number of instances of other classes incorrectly predicted as class $i$ .
  - $\mathrm{True Negative}_i$ : Number of instances of other classes correctly predicted as other classes (not $i$ ).
  - $\mathrm{False Negative}_i$ : Number of instances of class $i$ incorrectly predicted as other classes.
F1-score:
- Conceptual Definition: The F1-score is the harmonic mean of precision and recall. It provides a single score that balances both precision (the proportion of positive identifications that were actually correct) and recall (the proportion of actual positives that were identified correctly). It is particularly useful for classification tasks where class distribution is uneven. For AU detection, it's often reported as an average F1-score across all AUs.
- Mathematical Formula: $ \mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where:
  - $\mathrm{Precision} = \frac{\mathrm{True Positive}}{\mathrm{True Positive} + \mathrm{False Positive}}$
  - $\mathrm{Recall} = \frac{\mathrm{True Positive}}{\mathrm{True Positive} + \mathrm{False Negative}}$
- Symbol Explanation:
  - $\mathrm{True Positive}$ : Number of positive instances correctly identified.
  - $\mathrm{False Positive}$ : Number of negative instances incorrectly identified as positive.
  - $\mathrm{False Negative}$ : Number of positive instances incorrectly identified as negative.
Mean Absolute Error (MAE):
- Conceptual Definition: MAE is a measure of errors between paired observations expressing the same phenomenon. It is the average of the absolute differences between predictions and actual observations. It gives an idea of the average magnitude of the errors without considering their direction. It is commonly used for regression tasks like age estimation.
- Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N} \sum_{j=1}^{N} |y_j - \hat{y}_j| $ Where:
  - $N$ : Total number of samples.
  - $y_j$ : The actual value for sample $j$ .
  - $\hat{y}_j$ : The predicted value for sample $j$ .
  - $| \cdot |$ : Absolute value.
- Symbol Explanation:
  - $y_j$ : Ground truth age for sample $j$ .
  - $\hat{y}_j$ : Predicted age for sample $j$ .
Mean Accuracy (mAcc):
- Conceptual Definition: For facial attribute detection, this typically refers to the average accuracy across all attributes. If a model predicts multiple binary attributes for a face, the accuracy for each attribute is calculated, and then these accuracies are averaged.
- Mathematical Formula (for a single attribute $k$ ): $ \mathrm{Accuracy}_k = \frac{\mathrm{Correct Predictions}_k}{\mathrm{Total Samples}_k} $ Then, $\mathrm{mAcc} = \frac{1}{K} \sum_{k=1}^{K} \mathrm{Accuracy}_k$ , where $K$ is the number of attributes.
- Symbol Explanation:
  - $\mathrm{Correct Predictions}_k$ : Number of samples where attribute $k$ was correctly classified.
  - $\mathrm{Total Samples}_k$ : Total number of samples for which attribute $k$ was evaluated.
Accuracy:
- Conceptual Definition: The proportion of total predictions that were correct. Commonly used for binary classification tasks like deepfake detection.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\mathrm{Number of Correct Predictions}}{\mathrm{Total Number of Predictions}} $
- Symbol Explanation:
  - $\mathrm{Number of Correct Predictions}$ : Sum of true positives and true negatives.
  - $\mathrm{Total Number of Predictions}$ : Sum of true positives, true negatives, false positives, and false negatives.

5.2.2. Reasoning Evaluation (GPT-based)

To assess the generative and reasoning capabilities of MLLMs, GPT-4o-mini [67] is used to rate the generated responses based on three aspects:

Consistency or overlap of the given reasoning with video: How well does the textual explanation align with the visual content of the input video/image?
Consistency or overlap of the given reasoning with ground truth label: How well does the textual explanation support or align with the known correct label (e.g., the actual emotion, age, or deepfake status)?
Overall completeness of the reasoning to support the ground truth label with respect to the video: How comprehensive and convincing is the explanation in justifying the prediction, considering both the visual evidence and the correct label?

This evaluation is performed on a small, carefully crafted test set from FaceInstruct-1M consisting of 500 samples per task. Traditional text generation metrics like BLEU [49] or ROUGE [38] are deemed insufficient as they rely on n-gram matches rather than semantic analysis of reasoning.

5.2.3. Evaluation Protocols

Perception:
- Zero-shot setting: The entire task-specific dataset (e.g., DFEW for expression recognition) is removed from FaceInstruct-1M for tuning Face-LLaVA. This tests generalization.
- Fine-tuned setting: The zero-shot model is further fine-tuned using the official training splits of the benchmark dataset. This tests performance when task-specific data is available.
- For MLLM baselines, official inference code is applied to face-cropped inputs.
- For GPT-4o-mini and Gemini 1.5 Flash, official APIs are used for batch inference.
- For expression recognition, attribute detection, and deepfake detection, synonym matching is used to extract categorical labels from generated text. For AU detection and age estimation, numerical values are extracted directly.
Reasoning:
- Evaluated using GPT-4o-mini on the FaceInstruct-1M test set (500 samples per task), as described above.

5.3. Baselines

The paper compares Face-LLaVA against a range of contemporary MLLMs and specialized supervised methods.

5.3.1. Closed-source Commercial Models

GPT-4o-mini [67]
Gemini-1.5F [66]

5.3.2. Open-source MLLMs (Zero-shot)

Vid.LLaMA 3 [81]
Qwen 2.5 VL [68]
Vid.-LLaVA [37]
LLaVA-Vid. [86] / LLaVA-OV [28]
EmoLA [34] (often uses middle video frame or image-only)
Emotion LLaMA [8] (multimodal, using audio and background context, also evaluated with context removed)

5.3.3. Fine-tuned Supervised Task-Specific Baselines

These are models specifically designed and extensively trained for individual tasks, representing the state-of-the-art in their respective domains.

Facial Expression Recognition: EC-STFL [22], Former-DFER [89], $GCA+IAL$ [29], M3DFEL [70], MAE-DFER [64], S2D [7], EMO-LLaMA [77], Emotion-LLaMA [8], Lei et al. [27], PTH-Net [30], MTL-ER* [82], MT-Former* [78], MTCAE-DFER [75], RUL [84], EAC [85], TransFER [79], Xue et al. [80], POSTERv2 [44], EmoLA [34].
Action Unit Detection: ATCM [21], ReCoT [33], KS [32], ME-GraphAU [43], JÅA-Net [59], PIAP-DF [65], VL-FAU [15], AU-LLaVA [18], EmoLA [34].
Age Estimation: PML [10], Berg et al. [4], DLDL-v2 [14], MWR [62], Faceptor [54].
Facial Attribute Detection: Liu et al. [42], MOON [58], SwinFace [53], Faceptor [54], DMM-CNN [45].
Deepfake Detection: MesoNet [1], Xception [9], MARLIN [6], M2TR [71], F3-Net [52].

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Facial Expression Recognition

The following are the results from Table 3 of the original paper:

DFEW [22]			Crema-D [24]			RAF-DB [31]
Method	UAR ↑	WAR↑	Method	UAR ↑	WAR ↑	Method	Acc. ↑
Closed - source models
GPT4o-mini [67]	0.426	0.518	GPT4o-mini [67]	0.410	0.486	GPT4o-mini [67]	0.758
Gemini-1.5F [66]	0.433	0.481	Gemini-1.5F [66]	0.465	0.635	Gemini-1.5F [66]	0.685
Zero-shot
Vid.LLaMA 3 [81]	0.286	0.305	Vid.LLaMA 3 [81]	0.397	0.546	Vid.LLaMA 3 [81]	0.671
Qwen 2.5 [68]	0.293	0.399	Qwen 2.5 [68]	0.395	0.566	Qwen 2.5 [68]	0.526
Vid.-LLaVA [37]	0.220	0.326	Vid.-LLaVA [37]	0.367	0.557	Vid.-LLaVA [37]	0.545
LLaVA-Vid. [86]	0.375	0.498	LLaVA-Vid. [86]	0.478	0.618	LLaVA-OV [28]	0.700
EmoLA* [34]	0.346	0.449	EmoLA* [34]	0.431	0.618	EmoLA [34]	0.741
Emotion LLaMA† [8]	0.456	0.594	Emotion LLaMA [8]	0.225	0.308
Emotion LLaMA‡ [8]	0.302	0.378
Face-LLaVA (Ours)	0.469	0.564	Face-LLaVA (Ours)	0.582	0.681	Face-LLaVA (Ours)	0.780
Fine-tuned
EC-STFL [22]	0.454	0.565	Lei et al. [27]	0.645	0.648	RUL [84]	0.890
Former-DFER [89]	0.537	0.657	PTH-Net [30]	0.699	0.700	EAC [85]	0.909
GCA+IAL [29]	0.557	0.692	MAE-DFER [64]	0.773	0.774	TransFER [79]	0.909
M3DFEL [70]	0.561	0.693	MTL-ER* [82]	0.745	0.756	Xue et al. [80]	0.920
MAE-DFER [64]	0.634	0.744	MT-Former* [78]	0.793	0.807	POSTERv2 [44]	0.922
S2D [7]	0.618	0.760	MTCAE-DFER [75]	0.847	0.850	EmoLA [34]	0.921
EMO-LLaMA [77]	0.602	0.659
Emotion-LLaMA [8]	0.642	0.771
Face-LLaVA (Ours)	0.625	0.745	Face-LLaVA (Ours)	0.798	0.813	Face-LLaVA (Ours)	0.921

*Only using middle video frame, †: Results taken from the paper [8], ‡: Results computed after running inference code on face-cropped video.

In zero-shot settings, Face-LLaVA demonstrates superior performance across all benchmarks (DFEW, Crema-D, RAF-DB) for facial expression recognition compared to most open-source MLLMs. For instance, on DFEW, Face-LLaVA achieves 0.469 UAR and 0.564 WAR, significantly outperforming Vid.LLaMA 3 (0.286 UAR, 0.305 WAR) and Vid.-LLaVA (0.220 UAR, 0.326 WAR).

Emotion-LLaMA [8] initially shows a higher WAR on DFEW (0.594), but this model is multimodal, using audio and background context. When re-evaluated with only face-cropped video (denoted by ‡), its performance drops substantially (0.302 UAR, 0.378 WAR), highlighting Face-LLaVA's strength in relying solely on facial cues. On Crema-D, Face-LLaVA notably achieves 0.582 UAR and 0.681 WAR, which is considerably higher than LLaVA-Vid. and EmoLA.

In the fine-tuned setting, Face-LLaVA maintains competitive performance with specialized state-of-the-art models. For instance, on RAF-DB, it reaches 0.921 Accuracy, matching EmoLA [34] and being very close to the top-performing POSTERv2 [44] (0.922). This is notable given that Face-LLaVA is still trained to provide descriptions rather than just categorical labels, and it lacks external context (audio/background) used by some baselines.

6.1.2. Action Unit (AU) Detection

The following are the results from Table 4 of the original paper:

Method	Average F1 ↑
	DISFA [46]	BP4D [83]
Closed-source models
GPT4o-mini [67]	0.429	0.496
Gemini-1.5F [66]	0.515	0.532
Zero-shot
VideoLLaMA 3 [81]	0.374	0.458
Qwen 2.5 VL [68]	0.431	0.467
Video-LLaVA [37]	0.442	0.445
LLaVA-OneVision [28]	0.280	0.439
EmoLA [34]	0.418	0.407
Face-LLaVA (Ours)	0.553	0.495
Fine-tuned
ATCM [21]	0.615	0.642
ReCoT [33]	0.626	0.648
KS [32]	0.628
ME-GraphAU [43]	0.631	0.655
JÅA-Net [59]	0.635	0.624
PIAP-DF [65]	0.638	0.641
VL-FAU [15]	0.665	0.658
AU-LLaVA [18]	0.525	0.603
EmoLA [34]	0.651	0.642
Face-LLaVA (Ours)	0.729	0.658

For AU detection, Face-LLaVA again excels. In the zero-shot setting, it achieves 0.553 F1 on DISFA, significantly outperforming Gemini-1.5F (0.515) and all other open-source MLLMs. This demonstrates the effectiveness of injecting AU information from AffectNet samples during zero-shot training for AU detection.

In the fine-tuned setting on DISFA, Face-LLaVA achieves 0.729 F1, marking a substantial relative improvement of approximately 10% over the previous state-of-the-art (VL-FAU at 0.665 or EmoLA at 0.651). On BP4D, Face-LLaVA performs competitively (0.658 F1), matching VL-FAU [15] and surpassing other MLLMs. This strong performance underscores the model's ability to precisely identify subtle facial muscle movements.

6.1.3. Age Estimation, Attribute Detection, and Deepfake Detection

The following are the results from Table 5 of the original paper:

Age Estimation (MAE ↓)			Face Attribute (mAcc. ↑)			DeepFake Det. (Acc. ↑)
Method	M [56]	U [87]	Method	CA [42]	Method	FF [57]
Closed-source models
GPT4o-m [67]	4.09	5.04	GPT4o-m [67]	0.780	GPT4o-m [67]	0.807
Gem.-1.5F [66]	4.78	6.13	Gem.-1.5F [66]	0.814	Gem.-1.5F [66]	0.770
Zero-shot
V-LLaMA3 [81]	6.98	6.91	V-LLaMA3 [81]	0.813	V-LLaMA3 [81]	0.793
Qwen 2.5 [68]	6.09	5.25	Qwen 2.5 [68]	0.786	Qwen 2.5 [68]	0.653
V-LLaVA [37]	6.75	5.89	V-LLaVA [37]	0.795	V-LLaVA [37]	0.697
LLaVA-OV [28]	6.33	6.87	LLaVA-OV [28]	0.805	LLaVA-V [86]	0.751
Face-LLaVA	3.34	4.89	Face-LLaVA	0.868	Face-LLaVA	0.845
Fine-tuned
PML [10]	2.15	-	Liu et al. [42]	0.873	MesoNet [1]	0.705
Berg et al. [4]	-	4.55	MOON [58]	0.909	Xception [9]	0.869
DLDL-v2 [14]	1.97	4.42	SwinFace [53]	0.913	MARLIN [6]	0.894
MWR [62]	2.00	4.37	Faceptor [54]	0.914	M2TR [71]	0.929
Faceptor [54]	1.96	4.10	DMM-CNN [45]	0.917	F3-Net [52]	0.930
Face-LLaVA	2.02	4.06	Face-LLaVA	0.901	Face-LLaVA	0.888

M: MORPH II [56], U: UTKFace [87], CA: CelebA [42], FF: FaceForensics++ [57].

Face-LLaVA continues to demonstrate strong performance in these diverse tasks:

Age Estimation: In the zero-shot setting, Face-LLaVA achieves the lowest MAE on both MORPH II (3.34) and UTKFace (4.89), significantly outperforming all other MLLM baselines, including GPT-4o-mini and Gemini-1.5F. This is impressive for an MLLM attempting a regression task. In the fine-tuned setting, Face-LLaVA remains highly competitive, achieving 2.02 MAE on MORPH II and 4.06 MAE on UTKFace, which are very close to or better than specialized regression-based baselines.
Facial Attribute Detection: For zero-shot attribute detection on CelebA, Face-LLaVA achieves 0.868 mAcc, outperforming all other MLLMs. In the fine-tuned setting, it reaches 0.901 mAcc, showing competitive performance against specialized models, which often achieve slightly higher scores (DMM-CNN [45] at 0.917).
Deepfake Detection: This task proves challenging for most MLLMs. In the zero-shot setting on $FaceForensics++$ (low-quality videos), Face-LLaVA achieves 0.845 Accuracy, significantly surpassing all other MLLM baselines (which are often close to or worse than the 80% random baseline accuracy for this imbalanced dataset). This highlights its unique ability to discern subtle forgery artifacts. In the fine-tuned setting, Face-LLaVA reaches 0.888 Accuracy. While this is lower than the absolute SOTA (F3-Net [52] at 0.930), the paper attributes this gap to Face-LLaVA's limitation of processing only eight frames per video, whereas fine-tuned baselines often use a higher number of frames.

6.1.4. Evaluation of Reasoning

The following are the results from Table 10 of the original paper:

Method	Reason-Video Consistency						Reason-GT Consistency						Reasoning Completeness
Method	Emo.	AU	Attr.	Age	DF.	All	Emo.	AU	Attr.	Age	DF.	All	Emo.	AU	Attr.	Age	DF.	All
GT from FaceInstruct-1M	9.47	8.52	9.80	9.27	8.85	9.18	9.70	8.84	9.88	9.55	9.56	9.51	9.21	8.26	9.75	9.02	8.41	8.93
VideoLLaMA 3 [81]	5.14	2.58	6.90	5.82	7.02	5.49	5.27	2.06	6.27	5.13	7.64	5.27	4.90	2.73	6.50	5.37	6.51	5.20
Qwen 2.5 VL [68]	5.82	3.02	5.48	7.36	5.48	5.43	5.96	2.54	4.86	7.02	5.76	5.23	5.57	3.34	5.21	6.89	5.30	5.26
Video LLaVA [37]	4.31	2.58	5.82	7.79	6.19	5.34	4.47	2.06	5.28	7.10	6.58	5.10	4.20	2.73	5.30	7.00	5.84	5.01
LLaVA-OV [28]	7.11	2.18	6.08	7.97	6.69	6.01	7.30	1.95	5.48	7.44	7.33	5.90	6.67	2.34	5.72	7.52	6.19	5.69
EmoLA [34]	7.33	5.17	-	-	-	-	7.58	5.04	-	-	-	-	6.81	5.32	-	-	-	-
Emotion-LLaMA [8]	6.77	-		-	-	-	6.90	-	-	-		-	6.50	-	-	-	-	-
Face-LLaVA (Ours)	7.95	8.34	7.68	8.56	7.89	8.09	8.14	9.20	7.94	8.11	7.79	8.24	7.79	8.11	7.59	8.11	7.60	7.84

The GPT-based reasoning evaluation reveals a significant advantage for Face-LLaVA. It consistently achieves superior ratings compared to other MLLM baselines across all three aspects (Reason-Video Consistency, Reason-GT Consistency, and Reasoning Completeness) and all five tasks.

The mean rating for reasoning completeness for Face-LLaVA is $7.84/10$ , which is approximately 33% higher than the best baseline (LLaVA-OV at $5.69/10$ ).
For Reason-Video Consistency, Face-LLaVA scores 8.09, indicating excellent alignment between its generated descriptions and the actual visual content.
For Reason-GT Consistency, Face-LLaVA scores 8.24, showing that its reasoning effectively supports the ground truth labels.
The high scores highlight Face-LLaVA's ability to not only accurately perceive facial information but also to articulate coherent, context-aware, and faithful explanations, which is a critical feature for interpretable AI.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table 6 of the original paper:

Model	Landmark Projector	Cross- Attention	Input tokens	DFEW [22] UAR	WAR
Baseline	-	-	hv	0.391	0.479
Baseline + Landmarks	only global	-	hv + hglobal	0.402	0.483
	only local	-	hv + hlocal	0.409	0.491
	FRLP	-	hv + hl	0.410	0.491
	only global	simple	hlglobal	0.401	0.483
	only local	simple	hllocal	0.409	0.494
FRLP	simple	hv	0.416	0.512
FRLP	FRGCA	hv	0.424	0.520
Baseline + Face Parsing	-	simple	hp	0.413	0.498

This table presents an ablation study evaluating the contributions of different modules to the zero-shot performance on DFEW [22] (using MAFW [40], FERV39k [40], and Crema-D [24] for training). The baseline model is Video-LLaVA [37], which uses only visual tokens ( $h_v$ ).

Impact of Landmark Tokens (without Cross-Attention):
- Adding global landmark tokens ( $h_{global}$ ) to $h_v$ improves UAR from 0.391 to 0.402 and WAR from 0.479 to 0.483.
- Adding local landmark tokens ( $h_{local}$ ) yields slightly better results: UAR 0.409, WAR 0.491.
- Using the full FRLP (combining global and local) without cross-attention ( $hv + hl$ ) results in UAR 0.410, WAR 0.491. These results indicate that simply appending landmark tokens as additional input tokens to the LLM provides a performance boost over the baseline, but the gains are modest.
Impact of Cross-Attention:
- When simple cross-attention (without RPP mask) is used to integrate landmark tokens (either global or local) with visual tokens (hv), the performance shows further improvement. For example, using FRLP with simple cross-attention results in UAR 0.416 and WAR 0.512. This suggests that allowing landmark tokens to refine visual tokens through cross-attention is more effective than just concatenating them as direct input to the LLM.
- This also validates the design choice of cross-attention for efficiency, as it avoids increasing the LLM's context window.
Impact of Face-Region Guided Cross-Attention (FRGCA):
- The full Face-LLaVA model, employing FRLP for landmark projection and FRGCA (which includes the region-patch proximity mask) for cross-attention, achieves the highest performance in this ablation: UAR 0.424 and WAR 0.520.
- This demonstrates that the explicit guidance provided by the RPP mask in FRGCA leads to a larger performance gain compared to simple cross-attention, validating the effectiveness of focusing attention on specific facial regions.
Landmarks vs. Face Parsing:
- An alternative experiment replacing landmarks with face parse heatmaps (encoding them through the visual encoder and using them in simple cross-attention) yields UAR 0.413 and WAR 0.498.
- While competitive with some landmark-based configurations, this approach incurs extra computational and memory costs due to the pixel-wise nature of face parse maps (which result in tokens of the same size as visual tokens). The authors conclude that landmarks offer a more efficient and equally effective solution for injecting face-expert knowledge.
  
  In summary, the ablation study clearly shows that:

Adding face-specific landmark features is beneficial.
Integrating these landmark features via cross-attention is more effective than direct concatenation.
The novel Face-Region Guided Cross-Attention (with the RPP mask) provides the most significant performance boost by intelligently guiding the attention mechanism to salient facial regions.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work marks a significant advancement in the application of Multimodal Large Language Models (MLLMs) to the nuanced field of facial analysis. The authors successfully address the limitations of previous approaches by introducing Face-LLaVA, a generalist MLLM capable of performing a diverse range of facial perception tasks with accompanying natural language reasoning.

The core contributions include:

FaceInstruct-1M: A large-scale, face-centric instruction-tuning dataset comprising over one million image and video samples, automatically annotated using Gemini and filtered by GPT-4o. This dataset is crucial for instruction tuning MLLMs for comprehensive face processing.
Face-LLaVA Architecture: A novel MLLM architecture that integrates face-specific geometric features (facial landmarks) into the visual encoding process through two innovative modules: the Face-Region Landmark Projector (FRLP) and Face-Region Guided Cross-Attention (FRGCA). These modules ensure that Face-LLaVA effectively leverages fine-grained facial cues without unnecessarily expanding the LLM's context window.
Demonstrated Superiority: Extensive evaluations across nine datasets and five distinct face analysis tasks (facial expression recognition, action unit detection, facial attribute detection, age estimation, and deepfake detection) confirm Face-LLaVA's effectiveness. It achieves superior zero-shot performance compared to existing open-source MLLMs and competitive performance against specialized, supervised task-specific methods. Furthermore, its generative capabilities are validated by higher reasoning ratings from GPT-4o, confirming its ability to provide comprehensive and consistent explanations.

By offering a model that combines robust perception with interpretable reasoning, Face-LLaVA substantially contributes to the development of more sophisticated and socially aware AI systems.

7.2. Limitations & Future Work

The authors acknowledge several limitations of the current Face-LLaVA model:

Single-Turn Interactions: The current model is limited to single-turn interactions, meaning it can respond to an instruction but cannot engage in multi-turn dialogue or follow up questions.
Lack of Advanced Chain-of-Thought Reasoning: Face-LLaVA does not currently incorporate advanced chain-of-thought reasoning capabilities, which would allow it to break down complex problems into intermediate steps and provide more elaborate explanations.
Limited Task Scope: The model has not explored face identification (recognizing specific individuals) or dense prediction tasks (e.g., pixel-level facial parsing or reconstruction).
Inherited Biases: As the FaceInstruct-1M dataset is constructed from existing public datasets, it may inherit any biases present in those original sources (e.g., demographic biases in age, gender, or race representation). The authors state that addressing model bias is a future work direction.

Based on these limitations, the authors suggest future research directions:
Integrating multi-turn dialogue capabilities.
Expanding to other facial tasks, such as face identification or dense prediction.
Addressing potential biases in the dataset and model.

7.3. Personal Insights & Critique

Face-LLaVA represents a compelling step forward in making facial AI more generalist and interpretable. The paper's strength lies in its holistic approach: not only proposing a novel MLLM architecture but also developing a custom, large-scale instruction-tuning dataset specifically tailored for face analysis.

Innovations and Strengths:

Dataset Quality and Scope: The FaceInstruct-1M dataset is a significant contribution. Its scale, multi-task nature, and inclusion of video data are vital. The GPT-assisted annotation and filtering pipeline is particularly innovative, demonstrating a robust way to generate high-quality instruction-tuning data automatically, a common bottleneck for MLLM development. The high GPT-4o ratings for FaceInstruct-1M validate this approach.
Efficient Landmark Integration: The FRLP and FRGCA modules are elegant solutions. They effectively inject face-expert knowledge (facial landmarks) into the MLLM without burdening the LLM's context window (a frequent challenge in multimodal models). The region-patch proximity mask is a clever mechanism to focus attention on anatomically meaningful areas, mimicking how humans often analyze faces. This targeted attention likely contributes to the superior performance in fine-grained tasks like AU detection.
Demonstrated Generalization and Reasoning: The impressive zero-shot performance across diverse tasks is a strong indicator of the model's generalization capabilities. More importantly, the high GPT-based reasoning ratings validate the model's ability to provide human-understandable explanations, which is critical for trust and application in sensitive domains like healthcare or surveillance.
Video Processing: The ability to handle both images and videos is a practical advantage, especially for dynamic tasks like expression recognition and deepfake detection in real-world scenarios.

Potential Issues and Areas for Improvement:

Reliance on Commercial LLMs: While the GPT-assisted data generation and evaluation are effective, they inherently rely on closed-source commercial models (Gemini, GPT-4o-mini). This introduces a dependency and potential black-box aspect to the data quality and evaluation metrics, which could be less transparent than fully open-source pipelines. Future work could explore using open-source LLMs for these tasks, though they may not currently match the performance of commercial counterparts.
Computational Cost of FRGCA: While FRGCA is efficient in terms of LLM context window, the cross-attention mechanism itself, especially with the calculation of the RPP mask and attention weights for potentially many visual patches ( $N$ ) and face regions ( $M$ ), does add to the computational overhead compared to a pure Video-LLaVA baseline. The paper mentions "keeping extra compute to a minimum" for MLPs, but the cross-attention itself still adds layers.
Bias Mitigation: The acknowledgment of inherited biases is important. A deeper analysis and proactive mitigation strategies for demographic biases (e.g., fairness metrics, debiasing techniques) would further strengthen the work, particularly given the sensitive nature of face analysis.
Context Window for Videos: While FRGCA saves the LLM's context window from landmark tokens, the overall number of frames processed for videos is still limited (e.g., eight frames for deepfake detection). This is a common MLLM limitation, but it explains the gap to SOTA methods that can leverage more temporal information. Future work could explore more advanced video summarization or long-context video processing techniques.

Transferability and Applications: The methods and conclusions from Face-LLaVA can be transferred to other fine-grained multimodal tasks where specific geometric or structural priors are crucial, but general vision encoders might fall short. Examples include:

Medical Image Analysis: Integrating domain-specific anatomical landmarks (e.g., in X-rays or MRI scans) with MLLMs for diagnosis and explanation generation.
Action Recognition: Using pose estimation landmarks to guide MLLMs for detailed descriptions of human actions and activities.
Robotics: Allowing robots to interpret human social cues (via face analysis) and explain their interpretations in natural language, leading to more natural human-robot interaction.

Overall, Face-LLaVA provides a robust and interpretable framework for face analysis, significantly pushing the boundaries of what MLLMs can achieve in this critical domain.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~34 min read · 49,515 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Traditional Face Analysis

3.2.2. Face Analysis using Instruction Tuning

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Visual Input Processing

4.2.2. Face-Expert Feature Extraction

4.2.3. Face-Region Landmark Projector (FRLP)

4.2.4. Face-Region Guided Cross-Attention (FRGCA)

4.2.5. Text Instruction Processing

4.2.6. Language Model Decoder

4.2.7. Training

5. Experimental Setup

5.1. Datasets

5.1.1. Datasets for Facial Expression Recognition

5.1.2. Datasets for Action Unit (AU) Detection

5.1.3. Datasets for Facial Attribute Detection

5.1.4. Datasets for Age Estimation

5.1.5. Datasets for Deepfake Detection

5.1.6. GPT-Assisted Dataset Construction for FaceInstruct-1M

5.2. Evaluation Metrics

5.2.1. Traditional Metrics

5.2.2. Reasoning Evaluation (GPT-based)

5.2.3. Evaluation Protocols

5.3. Baselines

5.3.1. Closed-source Commercial Models

5.3.2. Open-source MLLMs (Zero-shot)

5.3.3. Fine-tuned Supervised Task-Specific Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Facial Expression Recognition

6.1.2. Action Unit (AU) Detection

6.1.3. Age Estimation, Attribute Detection, and Deepfake Detection

6.1.4. Evaluation of Reasoning

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers