Explainable AI for Image Aesthetic Evaluation Using Vision-Language Models
TL;DR Summary
This study enhances image aesthetic evaluation using vision-language models by proposing an interpretable method. It explores feature importance through SHAP analysis and predicts quality scores with LightGBM, demonstrating high correlation with human judgment, thus advancing obj
Abstract
The provided text only includes the title, authors, and metadata for the first page of the PDF, but does not contain the abstract or any other content beyond the first page. Therefore, the abstract cannot be extracted from the given material.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Explainable AI for Image Aesthetic Evaluation Using Vision-Language Models
1.2. Authors
- Supatta Viriyavisuthisakul (Dept. of Engineering and Technology, Panyapiwat Institute of Management, Nonthaburi, Thailand)
- Shun Yoshida (Dep. of Information and Communication Engineering, The University of Tokyo, Tokyo, Japan)
- Kaede Shiohara (Dep. of Information and Communication Engineering, The University of Tokyo, Tokyo, Japan)
- Ling Xiao (Dep. of Information and Communication Engineering, The University of Tokyo, Tokyo, Japan)
- Toshihiko Yamasaki (Dep. of Information and Communication Engineering, The University of Tokyo, Tokyo, Japan)
1.3. Journal/Conference
The paper does not explicitly state the specific journal or conference it was published in. However, the metadata indicates it was "Published at (UTC): 2025-02-03T00:00:00.000Z". This suggests it is a recent publication, likely in a conference proceeding or a technical report given the typical structure.
1.4. Publication Year
2025
1.5. Abstract
Evaluating image aesthetics has traditionally been a subjective task relying on human experts. Vision-Language models (VLMs), such as Contrastive Language-Image Pre-Training (CLIP), offer a novel approach to assessing visual features and descriptions, leading to more interpretable aesthetic evaluations. Recently, CLIP-IQA (CLIP-based image quality assessment) emerged as a method to quantify image quality and abstract perception using an antonym prompt pairing strategy. While CLIP-IQA achieves high correlation with human aesthetic judgment, the relevance of its features to human perception remains a question. This study investigates the significance of image features derived from various paired prompts. Each prompt pair is encoded into feature vectors using a text encoder, and images are similarly encoded using an image encoder. Light Gradient Boosting Machine (LightGBM) is employed as a regressor to predict quality scores. After training, SHapley Additive exPlanations (SHAP) values are computed for each feature to evaluate the contribution of individual prompt elements. Furthermore, a multimodal large language model (MLLM), LLaVa, is applied to generate linguistic explanations of images. The proposed method yields Spearman's rank correlation coefficient (SROCC) and Pearson linear correlation coefficient (PLCC) scores of 0.762 and 0.785, respectively. The research also explores advanced prompting strategies to gain deeper insights into the IQA scoring mechanism.
1.6. Original Source Link
/files/papers/6911d810b150195a0db749a3/paper.pdf (Published as a PDF document)
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the inherent subjectivity and the traditional reliance on human expertise for Image Aesthetics Assessment (IAA). While Vision-Language models (VLMs) like CLIP have opened new avenues for evaluating images based on both visual features and textual descriptions, enabling potentially more interpretable assessments, existing methods face significant limitations. Specifically, CLIP-IQA (CLIP-based Image Quality Assessment), despite correlating well with human judgment, often operates as a "black-box" model, lacking transparency regarding which features contribute to its aesthetic predictions. Conversely, Multimodal Large Language Models (MLLMs) like LLaVa, while capable of generating linguistic explanations, tend to exhibit low recognition accuracy and inconsistent outputs when applied directly to aesthetic evaluation tasks. The challenge, therefore, lies in developing a framework that can achieve both high accuracy in aesthetic prediction and provide meaningful, interpretable explanations for those predictions, bridging the gap between quantitative scores and human-understandable reasoning.
2.2. Main Contributions / Findings
The paper makes several key contributions:
- Novel Integrated Framework: It proposes a new framework that combines
CLIP-IQAfor quantitative aesthetic scoring,SHAPvalues for explaining the feature importance, andLLaVafor generating natural language explanations. This integration aims to enhance both the interpretability and reliability ofIAApredictions. - Improved Explainability and Accuracy: The framework effectively addresses the limitations of existing methods. It moves beyond the black-box nature of
CLIP-IQAby leveragingSHAPto reveal the contribution of specific aesthetic elements (derived from prompt pairs) to the overall score. It also overcomes the low accuracy typically seen inMLLM-based approaches forIAA. - Quantitative Performance: The proposed method demonstrates significant performance improvements, achieving
SROCCandPLCCscores of 0.762 and 0.785, respectively. These results are notably higher than those obtained byLLaVaorCLIP-IQAwhen used in isolation. - Investigation of Prompt Strategies: The study explores the impact of using a larger number of pairwise concepts (prompts) generated by
ChatGPT. It finds that increasing the number of prompts (e.g., from 1 to 30 pairs) substantially improves accuracy, particularly for the machine learning model component, suggesting that richer semantic input enhances the model's ability to leverageCLIP's internal knowledge. - Linguistic Explanations: By feeding
SHAPvalues and original images toLLaVa, the framework can provide natural language explanations for the aesthetic scores, offering human-understandable reasoning behind the model's judgments.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a reader should be familiar with several core concepts in artificial intelligence, computer vision, and natural language processing.
- Image Aesthetics Assessment (IAA): This is the task of automatically predicting the aesthetic quality or attractiveness of an image, similar to how a human might judge a photograph as "good" or "bad." It goes beyond simple image quality (e.g., blurriness) to encompass subjective elements like composition, lighting, color harmony, and overall appeal.
- Vision-Language Models (VLMs): These are AI models designed to understand and process information from both visual (images, videos) and textual (natural language) modalities. They learn to associate visual concepts with their linguistic descriptions, enabling tasks like image captioning, visual question answering, and, in this paper's context, aesthetic evaluation by interpreting image features in relation to descriptive text.
- Contrastive Language-Image Pre-Training (CLIP):
CLIPis a specific type ofVLMdeveloped by OpenAI. It learns to associate images with text by training on a massive dataset of image-text pairs. The core idea iscontrastive learning: it learns to predict which text captions go with which images from a batch, effectively pulling similar image-text pairs closer together in an embedding space while pushing dissimilar pairs apart.- Zero-shot learning: A key capability of
CLIP. Once trained,CLIPcan perform tasks (like image classification or aesthetic evaluation) on new datasets without any further training specific to that task. This is achieved by comparing image embeddings to text embeddings of descriptions for different classes or attributes (e.g., "a photo of a cat" vs. "a photo of a dog"). The image is then assigned to the class whose text embedding is most similar.
- Zero-shot learning: A key capability of
- CLIP-IQA (CLIP-based Image Quality Assessment): This is an application of
CLIPspecifically tailored for image quality and aesthetic assessment. It typically involves using antonym prompt pairs (e.g., "Good photo" vs. "Bad photo") and calculating theCLIPsimilarity scores between an image and these prompts to derive an aesthetic score. The image's affinity towards "Good photo" relative to "Bad photo" can indicate its perceived quality. - SHapley Additive exPlanations (SHAP):
SHAPis anExplainable AI (XAI)technique that provides model-agnostic explanations for the output of any machine learning model. It is based on the concept ofShapley valuesfrom cooperative game theory.- Shapley Value: In game theory, the Shapley value fairly distributes the total gain among players in a coalition, based on their marginal contributions to all possible sub-coalitions. In
XAI, this translates to assigning an importance value to each feature in a prediction, indicating how much that feature contributed to the specific output, taking into account all possible combinations of features.SHAPvalues are known for being consistent and locally accurate.
- Shapley Value: In game theory, the Shapley value fairly distributes the total gain among players in a coalition, based on their marginal contributions to all possible sub-coalitions. In
- Multimodal Large Language Models (MLLMs): These models extend the capabilities of traditional
Large Language Models (LLMs)by enabling them to process and understand multiple types of data (modalities), such as text, images, and sometimes audio. They integrate a vision encoder with anLLMto generate text-based responses that are visually grounded. - LLaVa (Large Language and Vision Assistant):
LLaVais a specificMLLMthat combines a vision encoder (likeCLIP's image encoder or aViT) with anLLM(likeVicuna). It's designed to perform visual instruction tuning, allowing it to follow instructions and answer questions about images, thus generating detailed descriptions and reasoning based on visual content.
3.2. Previous Works
The paper contextualizes its approach by referencing several key prior works in Image Aesthetics Assessment (IAA) and related AI techniques:
- Early IAA Models (Hand-crafted Features): Initially,
IAAmodels relied on carefully designed,hand-crafted featuresthat often emulated photographic principles. For example, color harmony [1], composition rules [2, 3], and even human aesthetic perception [4, 7] were encoded into features. These methods were foundational but limited by their reliance on human-defined rules. - Deep Learning-based IAA: With advancements in deep learning,
IAAmodels began to integrate deep neural networks. These models could automatically learn relevant features from data, often combining global and local features to capture both overall image structure and fine-grained details [5, 6]. Deep learning significantly improved efficiency and performance over hand-crafted approaches. - CLIP in IAA: The introduction of
CLIP[8] marked a significant shift. Its ability to learn transferable visual models from natural language supervision enabledzero-shotIAA.CLIP's image encoder proved effective at extracting relevant features like lighting, composition, and beauty attributes, offering a more robust foundation forIAAnetworks compared to models pre-trained solely onImageNet. It could capture both content and style. - CLIP-IQA: Building on
CLIP,CLIP-IQA[9] was developed to quantify image quality and abstract perception. It employed anantonym prompt pairing strategy(e.g., "Good photo" vs. "Bad photo") to assess images.CLIP-IQAdemonstrated strong correlations with human assessments across various image quality datasets, distinguishing visual attributes (e.g., brightness vs. contrast) and abstract perceptions (e.g., happy vs. sad). However, the paper highlights its limitation as a black-box model regarding feature importance. - Shapley Value and SHAP: The concept of
Shapley valueoriginates from cooperative game theory [10], where it is used to fairly distribute payouts among players based on their contributions.SHAP[10] adapted this concept forExplainable AI (XAI), providing a unified approach to interpreting model predictions by quantifying each feature's contribution.SHAPvalues are model-agnostic and offer consistent insights into why a model makes specific predictions. - LLaVa:
LLaVa[11] represents a major step inmultimodal AI. It combines a vision encoder with anLLMto process visual inputs and generate text-based responses.LLaVacan provide detailed image descriptions, answer questions based on visual context, and perform image-based reasoning. It has shown promising performance in aesthetic perception, sometimes approaching human-level assessments [12]. However, the paper notes its potential for low recognition accuracy and inconsistency in pureIAAtasks.
3.3. Technological Evolution
The evolution of Image Aesthetics Assessment has progressed from rule-based systems to data-driven deep learning, and now to multi-modal Vision-Language Models.
-
Hand-crafted Features (Early 2010s): Initial
IAAsystems relied on human experts defining photographic rules (e.g., rule of thirds, color harmony). These systems were limited by their explicit knowledge base and lacked generalization. -
Deep Learning (Mid-2010s onwards): Convolutional Neural Networks (
CNNs) revolutionizedIAAby learning features directly from large datasets of aesthetically-rated images. This significantly improved performance and reduced the need for manual feature engineering. -
Vision-Language Models (Early 2020s onwards):
CLIPintroduced the ability to leverage natural language for understanding visual concepts. This allowed forzero-shotIAAand more flexible, prompt-driven evaluation, moving beyond simple classification to semantic understanding.CLIP-IQAspecifically applied this to aesthetic assessment. -
Multimodal Large Language Models (Recent):
LLaVaand similarMLLMsfurther integrate vision and language, enabling conversationalAIthat can reason about images and generate explanatory text. This represents a step towardshuman-like understandingand communication about visual content.This paper's work fits within the latest stage of this evolution, specifically focusing on enhancing the interpretability of
VLM-basedIAAby combining the strengths of quantitativeCLIP-IQAwithSHAPfor feature importance andMLLMsfor linguistic explanations, thereby addressing the "black-box" nature of previousVLMapplications inIAA.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's approach offers several core innovations and differentiations:
-
Addressing the Interpretability Gap of CLIP-IQA: While
CLIP-IQA(e.g., [9]) provides good quantitative aesthetic scores, it largely remains a black-box model. The proposed framework directly addresses this by integratingSHAPvalues, allowing it to explain which specific features (derived from prompt pairs like "Bright photo" vs. "Dark photo") contribute positively or negatively to an image's aesthetic score. This provides a level of transparencyCLIP-IQAalone lacks. -
Improving Accuracy and Consistency of MLLM-based IAA: Direct application of
MLLMslikeLLaVaforIAA(as hinted by [12]) often suffers from lower recognition accuracy and inconsistent outputs for aesthetic judgments. This paper's method leveragesLLaVanot for direct scoring, but for generating linguistic explanations based onCLIP-IQAscores andSHAPvalues. This usesLLaVa's strength in language generation while relying onCLIP-IQAandLightGBMfor robust scoring, thus achieving higher overall accuracy (as shown in Table II). -
Synthesizing Quantitative and Qualitative Explanations: The framework is unique in its ability to provide both numerical evaluation (via
CLIP-IQAscores) and feature importance (viaSHAPvalues), culminating in human-readable linguistic explanations (viaLLaVa). This comprehensive approach offers a richer understanding of image aesthetics than any single component could provide alone. -
Advanced Prompt Engineering: The paper investigates the impact of a larger number of diverse prompt pairs generated by
ChatGPT, demonstrating that thisprompt engineeringstrategy significantly boosts accuracy. This goes beyond the basic antonym pairs typically used inCLIP-IQAand explores a more nuanced, multi-faceted approach to feature extraction.In essence, the paper differentiates itself by offering a unified framework that synergistically combines the strengths of
CLIP-IQA(quantitative scoring),SHAP(XAI), andLLaVa(linguistic explanation) to overcome the individual limitations of each, resulting in both accurate and explainableImage Aesthetics Assessment.
4. Methodology
4.1. Principles
The core principle of the proposed method is to create an Explainable AI (XAI) framework for Image Aesthetics Assessment (IAA) by integrating the quantitative scoring capability of CLIP-IQA with the interpretability provided by SHAP values, and the natural language explanation generation of a Multimodal Large Language Model (MLLM) like LLaVa. The underlying intuition is that while Vision-Language Models (VLMs) can assess aesthetics, they often lack transparency. By breaking down the prediction into contributions from specific aesthetic attributes (derived from prompt pairs) using SHAP and then translating these contributions into human-understandable text, the framework aims to provide both accurate scores and clear reasoning.
4.2. Core Methodology In-depth (Layer by Layer)
The proposed approach integrates several components as depicted in the overall framework (Figure 1 in the paper, represented as images/2.jpg in the provided assets). Here's a step-by-step breakdown:
The following figure (Figure 1 from the original paper, referred to as images/2.jpg) illustrates the overall framework of the proposed method:
该图像是示意图,展示了一种用于图像美学评价的解释性AI方法。图中描述了输入图像经过图像编码器处理,并与文本编码器生成的提示配对。系统计算得到CLIP-IQA评分和SHAP评分,提供形状、色彩、高对比度等特征的量化评估。其中,SHAP分数显示形状得分为+14.51,表明图像相对清晰。
4.2.1. Prompt Generation
The process begins with defining antonym prompt pairs that represent various aesthetic attributes. Initially, the paper uses basic prompts such as those listed in Table I. For advanced strategies, a Large Language Model (LLM) like ChatGPT is used to generate a larger set of diverse prompt pairs, as exemplified in Table III.
The following are the results from Table I of the original paper:
| Bright photo | Dark photo |
| Clean photo | Noisy photo |
| Colorful photo | Dull photo |
| Sharp photo | Blurry photo |
| High contrast photo | Low contrast photo |
The following are the results from Table III of the original paper:
| Clear | Blurry |
| Shap | Dull |
| Crisp | Fuzzy |
| Vibrant | Dull |
| Detailed | Pixelated |
| High-resolution | Low-resolution |
| Colorful | Monochrome |
| Well-exposed | Overexposed |
| Balanced | Imbalanced |
| High contrast | Low contrast |
4.2.2. Feature Encoding using CLIP
An input image and each generated prompt (both positive and negative terms of a pair) are processed through CLIP's respective encoders.
- Text Encoder: Each prompt text (e.g., "Bright photo", "Dark photo") is encoded into a high-dimensional feature vector (embedding) using
CLIP's text encoder. Let denote the embedding for a positive prompt and for its antonym negative prompt. - Image Encoder: The input image is encoded into a high-dimensional feature vector (embedding) using
CLIP's image encoder. Let denote the image embedding.
4.2.3. CLIP-IQA Score Calculation
For each prompt pair, the cosine similarity between the encoded image feature vector and each of the prompt's feature vectors is calculated. Cosine similarity measures the cosine of the angle between two non-zero vectors in an inner product space, indicating how similar their directions are. It ranges from -1 (opposite) to 1 (identical).
For a given image embedding and a prompt embedding , the cosine similarity is calculated as: $ \mathrm{Similarity}(I, T) = \frac{I \cdot T}{|I| |T|} $ where is the dot product of vectors and , and and are their L2-norms (magnitudes).
This yields similarities for the positive prompt and for the negative prompt.
These similarity scores for a specific prompt pair are then normalized via the softmax function. The softmax function is used to convert a vector of raw scores (logits) into a vector of probabilities, where the probabilities sum to 1. In this context, it would indicate the relative "strength" of association with the positive versus negative prompt.
$
P_p = \frac{e^{S_p}}{e^{S_p} + e^{S_n}} \quad \text{and} \quad P_n = \frac{e^{S_n}}{e^{S_p} + e^{S_n}}
$
where is the "probability" of the image aligning with the positive prompt, and with the negative prompt.
The paper states these are "aggregated into a single value, represented as a CLIP-IQA score." While the exact aggregation formula for a single CLIP-IQA score from multiple prompt pairs isn't explicitly given, a common approach in CLIP-IQA is to take the difference or ratio of positive vs. negative probabilities, or to use the positive probability directly. For a single prompt pair, a simplified CLIP-IQA score for that pair could be:
$
\text{CLIP-IQA}_{\text{pair}} = P_p - P_n
$
or simply . If multiple prompt pairs are used, these individual scores would then be concatenated or averaged to form a feature vector representing the image's aesthetic attributes across all prompt pairs. This feature vector, comprising scores from each prompt pair, captures the aesthetic attributes from both global and fine-grained perspectives.
4.2.4. Regressor Training with LightGBM
The CLIP-IQA scores (as a feature vector derived from all prompt pairs for each image) are used as input to train a regressor. The paper specifies Light Gradient Boosting Machine (LightGBM) as the chosen regressor.
- Dataset: The
KonIQ-10kdataset, which contains images with human-assigned quality scores (Mean Opinion Scores,MOS), is used for training. - Training Objective:
LightGBMis trained to predict theMOSof images based on theirCLIP-IQAfeature vectors. It learns a mapping from the prompt-derived aesthetic features to a human-perceived quality score.
4.2.5. SHAP Value Calculation for Interpretability
After the LightGBM regressor is trained, SHapley Additive exPlanations (SHAP) values are computed for each input feature. Each "feature" here corresponds to the CLIP-IQA score derived from one of the antonym prompt pairs (e.g., the "Bright photo" vs. "Dark photo" score is one feature).
-
Purpose:
SHAPvalues quantify the contribution of each individual prompt-derived feature (aesthetic element) to theLightGBM's prediction of the overall aesthetic score for a given image. -
Mechanism:
SHAPvalues are calculated by considering all possible combinations (coalitions) of features and observing how the prediction changes when a particular feature is included or excluded. The average marginal contribution of a feature across all possible coalitions is itsSHAPvalue. -
Interpretation: A positive
SHAPvalue for a feature indicates that this feature pushed the prediction higher than the baseline average prediction, while a negative value pushed it lower. The magnitude of theSHAPvalue indicates the strength of this contribution.The
SHAPvalue for a feature for a model and input is given by: $ \phi_j(f, \mathbf{x}) = \sum_{S \subseteq N \setminus {j}} \frac{|S|!(|N| - |S| - 1)!}{|N|!} [f_S(\mathbf{x}_S \cup {x_j}) - f_S(\mathbf{x}_S)] $ where: -
is the set of all features.
-
is a subset of features that does not include feature .
-
is the total number of features.
-
is the number of features in subset .
-
is the prediction of the model using only the features in set with their corresponding values .
-
is the prediction using features in plus feature . The term represents the weight for each permutation.
4.2.6. Linguistic Explanations with MLLM (LLaVa)
Finally, to provide human-understandable explanations, the calculated SHAP values along with the original input images are fed into the Multimodal Large Language Model, LLaVa.
-
Input: The visual content of the image and the quantitative
SHAPvalues (which indicate the importance of different aesthetic attributes). -
Output:
LLaVaprocesses this combined information to generate natural language text that explains why the image received its predicted aesthetic score, highlighting the features that most contributed to or detracted from its quality, as identified by theSHAPvalues. This bridges the gap between numerical assessment and human reasoning.This integrated approach ensures that the model not only predicts aesthetic scores accurately but also provides actionable and interpretable insights into the underlying reasons for those scores.
5. Experimental Setup
5.1. Datasets
The primary dataset used in this study for training the regressor (LightGBM) is:
- KonIQ-10k [13]: This is an ecologically valid database specifically designed for deep learning-based blind
Image Quality Assessment (IQA). It consists of 10,000 images, each associated withMean Opinion Scores (MOS)collected from human evaluators. TheMOSscores represent the subjective quality or aesthetic ratings of the images.- Characteristics:
KonIQ-10kis known for its diversity in content and distortion types, making it suitable for training models that generalize well to real-world image quality perception. The human-assignedMOSvalues provide a ground truth for aesthetic quality, which is crucial for training and evaluatingIAAmodels. - Purpose: This dataset allows the
LightGBMmodel to learn the relationship between theCLIP-IQAfeatures (derived from prompt pairs) and human aesthetic judgment.
- Characteristics:
5.2. Evaluation Metrics
The paper uses two standard correlation coefficients to evaluate the performance of the aesthetic evaluation models, specifically measuring the agreement between the model's predicted scores and human-assigned ground truth scores (MOS). For all metrics, higher values indicate better performance (closer correlation).
-
Spearman's Rank-Order Correlation Coefficient (SROCC):
- Conceptual Definition:
SROCCassesses the monotonic relationship between two ranked variables. It measures how well the relationship between two variables can be described using a monotonic function. If the predicted aesthetic scores consistently rank images in the same order as humanMOSscores,SROCCwill be high, regardless of whether the absolute values match. It is robust to non-linear relationships and outliers in the score values themselves, focusing purely on rank agreement. - Mathematical Formula: $ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $
- Symbol Explanation:
- : Spearman's rank correlation coefficient.
- : The difference between the ranks of the -th observation for the two variables (e.g., predicted score rank and
MOSrank). - : The number of observations (images).
- Conceptual Definition:
-
Pearson's Linear Correlation Coefficient (PLCC):
- Conceptual Definition:
PLCCmeasures the linear relationship between two variables. It indicates the strength and direction of a linear association between the predicted aesthetic scores and the humanMOSscores. A highPLCCimplies that the predicted scores not only rank images correctly but also maintain a consistent linear scaling with the human scores. It is sensitive to both the order and the magnitude of the scores. - Mathematical Formula: $ r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} $
- Symbol Explanation:
- : Pearson's linear correlation coefficient.
- : The number of observations (images).
- : The individual values of the first variable (e.g., model's predicted aesthetic score).
- : The individual values of the second variable (e.g., human
MOSscore). - : The sum of the products of corresponding and values.
- : The sum of all values.
- : The sum of all values.
- : The sum of the squares of all values.
- : The sum of the squares of all values.
- Conceptual Definition:
5.3. Baselines
The proposed method's performance is compared against two distinct baselines, representing different approaches to Image Aesthetics Assessment:
-
LLaVa (Standalone): This baseline represents directly using a
Multimodal Large Language Model(LLaVa) for aesthetic evaluation. In this scenario,LLaVawould likely be prompted to directly assess the aesthetic quality of an image and perhaps output a score or a qualitative judgment. The paper highlights its limitation of typically lower recognition accuracy and inconsistent outputs when directly tasked with aesthetic judgment. -
CLIP-IQA (Standalone): This baseline represents the traditional
CLIP-IQAapproach, whereCLIPsimilarities with antonym prompts (e.g., "Good photo" vs. "Bad photo") are used to derive an aesthetic score. While effective in achieving good correlations with human perception, the paper characterizes this as a "black-box" model, meaning it does not inherently provide explanations for its aesthetic judgments.These baselines allow the paper to demonstrate that its integrated framework not only achieves superior quantitative performance (compared to standalone
LLaVaorCLIP-IQA) but also overcomes the specific limitations of each: providing explainability whereCLIP-IQAis opaque, and achieving higher accuracy whereLLaVamight fall short in direct aesthetic scoring.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate the effectiveness of the proposed integrated framework in achieving both high accuracy and interpretability for Image Aesthetics Assessment (IAA).
6.1.1. Quantitative Evaluation of the Proposed Method
The paper first presents a direct comparison of its integrated method against the LLaVa and CLIP-IQA baselines in terms of explainability, SROCC, and PLCC.
The following are the results from Table II of the original paper:
| Method | Explainability | SROCC ↑ | PLCC ↑ |
| LLaVA | O | 0.195 | 0.171 |
| CLIP-IQA | × | 0.684 | 0.700 |
| Ours | O | 0.762 | 0.785 |
Analysis of Table II:
- LLaVa Baseline: Shows low
SROCC(0.195) andPLCC(0.171). This confirms the paper's assertion that whileLLaVaoffersexplainability(denoted by 'O'), its direct application for quantitative aesthetic assessment yields poor accuracy. This highlights the challenge of usingMLLMsdirectly for precise aesthetic scoring. - CLIP-IQA Baseline: Achieves significantly higher correlation coefficients than
LLaVa(SROCC0.684,PLCC0.700). This demonstratesCLIP-IQA's effectiveness in quantitative aesthetic judgment, aligning with previous research. However, it is marked with '×' forExplainability, confirming its black-box nature as identified by the authors. - Proposed Method ("Ours"): Outperforms both baselines dramatically. It achieves the highest
SROCC(0.762) andPLCC(0.785), indicating superior accuracy in both rank-order and linear correlation with human aesthetic judgments. Crucially, it also maintainsExplainability(marked 'O'), successfully integrating both high performance and interpretability. The improvement overCLIP-IQAis substantial, withSROCCincreasing by approximately 11% () andPLCCby over 12% (). This validates the core hypothesis that combiningCLIP-IQAwithSHAPandLLaVacan yield a more effective and transparentIAAsystem.
6.1.2. Impact of Number of Prompts
The paper also investigates how the quantity of prompt pairs influences the accuracy, comparing a "Prompt learning model" (likely referring to the raw CLIP similarity scores aggregated in some way before LightGBM) and the "Machine learning model" (referring to the LightGBM regressor component of the proposed framework). ChatGPT was used to generate these prompt pairs.
The following are the results from Table IV of the original paper:
| No. of | Prompt learning model | Machine learning model | ||
| prompt pairs | SROCC↑ | PLCC↑ | SROCC↑ | PLCC↑ |
| 1 | 0.873 | 0.892 | 0.684 | 0.700 |
| 10 | 0.879 | 0.897 | 0.768 | 0.790 |
| 20 | 0.880 | 0.898 | 0.840 | 0.867 |
| 30 | 0.889 | 0.902 | 0.849 | 0.874 |
Analysis of Table IV:
- General Trend: For both "Prompt learning model" and "Machine learning model," increasing the number of prompt pairs generally leads to improved
SROCCandPLCCscores. This suggests that a richer set of aesthetic descriptors allows the models to capture more nuanced aspects of image quality. - "Prompt learning model" (CLIP-IQA based scores): This model shows very high initial correlations (0.873 SROCC for 1 pair) and slight improvements with more prompts (0.889 SROCC for 30 pairs). This indicates that the raw
CLIPfeature space, when queried with relevant prompts, is already highly discriminative for aesthetics. The 1-prompt pair row corresponds to using "good photo" and "bad photo" as indicated in the text. - "Machine learning model" (LightGBM): This component shows a more significant accuracy gain with an increased number of prompts. Starting from 0.684
SROCCwith 1 prompt pair, it reaches 0.849SROCCwith 30 prompt pairs. This represents an improvement of approximately 24% (). This substantial gain for theLightGBMmodel demonstrates that the regressor effectively leverages the additional, diverse aesthetic features extracted byCLIPusing the expanded prompt sets. - Insight: The disparity in initial performance (e.g., 0.873 vs. 0.684 for 1 prompt pair) between the "Prompt learning model" and "Machine learning model" for the same initial prompt set highlights the role of the
LightGBMregressor. It processes theCLIP-IQAscores, potentially refining them and making them more directly predictive of the final humanMOS. The greater gain in the "Machine learning model" with more prompts suggests thatLightGBMis better able to integrate and interpret the diverse signals provided by the larger prompt sets compared to a simpler aggregation within the "Prompt learning model." This reinforces the idea that increasing the number of decision criteria axes (prompt pairs) allows for better utilization ofCLIP's internal knowledge.
6.2. Data Presentation (Tables)
The quantitative results are presented in tables within the paper. These tables are transcribed above within the analysis section (Table II and Table IV). Table III, showing examples of ChatGPT-generated prompts, is also included in the Methodology section.
6.3. Ablation Studies / Parameter Analysis
While the paper does not present a formal ablation study removing specific components of the proposed framework (e.g., removing SHAP or LLaVa), the comparison in Table II implicitly acts as an ablation against the baselines (LLaVa alone and CLIP-IQA alone). This demonstrates the combined value of the components.
The analysis in Table IV on the "Number of prompt pairs" serves as a parameter analysis. It shows how a key hyper-parameter (the richness of the prompt set) impacts the performance. The observation that more prompts lead to better accuracy, especially for the LightGBM component, is a crucial finding regarding the optimal input features for the regressor.
6.3.1. Visual Examples of Explanations
Figure 2 (referred to as images/3.jpg in the provided assets) provides concrete examples of the proposed method's output, showcasing SHAP values and their corresponding explanations.
The following figure (Figure 2 from the original paper) shows example results of the proposed method:
![Fig. 2. Example result of the proposed method. The Shapley value is denoted as `f ( X )` and average value of the predictions is represented as \(E \[ f ( X ) \]\) Fetues i posivcnbutions hown nd those w…](/files/papers/6911d810b150195a0db749a3/images/3.jpg)
Analysis of Figure 2 (images/3.jpg):
The figure presents two example images with their aesthetic evaluations and explanations.
- Example 1 (White Bird by Water):
- : This represents the average predicted aesthetic score (Mean Opinion Score,
MOS) across the dataset. - : This is the final predicted score for this specific image.
- Feature Contributions: The image highlights features that positively or negatively contribute to the
f(X)score relative toE[f(X)].color(positive contribution): The green bar indicates a positive impact of thecolorattribute, suggesting the image's coloration enhances its aesthetics.shape(positive contribution): Theshapeattribute also has a positive contribution.high contrast(negative contribution): The red bar indicates thathigh contrastslightly detracts from the aesthetic score in this specific case.
- Linguistic Explanation: The text "The image is evaluated as a good photo, mainly because of its good color. The shape is fine. However, it seems like the contrast is not that good." This linguistic explanation directly translates the
SHAPvalue contributions into human-understandable language, validating theLLaVacomponent's role.
- : This represents the average predicted aesthetic score (Mean Opinion Score,
- Example 2 (Moon Photo):
-
: Same average predicted
MOS. -
: This image receives a lower predicted score, indicating it is considered less aesthetically pleasing.
-
Feature Contributions:
brightness(negative contribution): A significant negative contribution frombrightnessindicates the image is perceived as too dark or poorly exposed, negatively impacting its aesthetics.high contrast(negative contribution):High contrastalso has a negative impact.colorful(positive contribution): Surprisingly, despite being a moon photo,colorfulhas a small positive contribution, which might be interpreted as subtle hues or gradients within the image.
-
Linguistic Explanation: "The image is evaluated as a bad photo, mainly because of its low brightness. The contrast is not that good. However, the color is fine." This explanation again aligns perfectly with the visual
SHAPvalue breakdown, providing clear reasons for the lower score.These examples effectively illustrate how
SHAPvalues enable the identification of key aesthetic drivers, and howLLaVatranslates these insights into meaningful linguistic feedback, thus achieving the goal of explainableIAA.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces a novel and comprehensive framework for Explainable AI (XAI) in Image Aesthetics Assessment (IAA) by ingeniously integrating CLIP-IQA, SHAP, and MLLMs (LLaVa). The study successfully addresses significant limitations of prior methods: the black-box nature of CLIP-IQA and the low accuracy of standalone MLLM-based approaches for aesthetic evaluation. By leveraging CLIP-IQA for quantitative scoring, SHAP to pinpoint the contributions of individual aesthetic features (derived from antonym prompt pairs), and LLaVa to translate these insights into natural language explanations, the framework achieves both high accuracy (SROCC of 0.762 and PLCC of 0.785) and robust interpretability. Furthermore, the research demonstrates that utilizing a larger and more diverse set of LLM-generated prompt pairs significantly enhances the model's performance, particularly for the LightGBM regressor component, by better exploiting CLIP's internal knowledge. Ultimately, the proposed method provides not just an aesthetic score but also the underlying rationale, moving closer to human-like understanding and explanation of image appeal.
7.2. Limitations & Future Work
The paper implicitly highlights the limitations of existing methods that its framework aims to overcome:
-
Black-box nature of CLIP-IQA: While
CLIP-IQAexcels in correlation with human judgment, it lacks transparency, making it difficult to understand why a particular aesthetic score was assigned. The proposed framework directly mitigates this withSHAPandLLaVa. -
Low accuracy of MLLM-based methods for direct aesthetic evaluation:
MLLMslikeLLaVaare powerful for generating text, but when tasked directly with quantitative aesthetic scoring, they tend to yield lower accuracy and inconsistent results. The paper addresses this by usingLLaVafor explanation rather than direct scoring.While the paper does not explicitly state future work directions in a dedicated section, several avenues can be inferred or suggested:
-
Exploring advanced prompt engineering: The paper shows the benefit of more prompts. Future work could delve deeper into how to generate optimal prompts (e.g., using active learning, human-in-the-loop validation, or more sophisticated
LLMfine-tuning for prompt generation) to capture even more nuanced aesthetic attributes. -
Generalizability to diverse aesthetic domains: The
KonIQ-10kdataset is general-purpose. Testing the framework on niche aesthetic domains (e.g., fashion photography, medical imaging aesthetics, specific artistic styles) could reveal its adaptability and potential need for domain-specific prompt sets or fine-tuning. -
User studies for explanation quality: While
LLaVagenerates linguistic explanations, formal user studies could evaluate how useful, intuitive, and trustworthy these explanations are to human users, potentially leading to refinements in the explanation generation process. -
Real-time applications: Investigating the computational efficiency of the framework for real-time
IAAand explanation generation would be important for practical deployment. -
Incorporating user-specific aesthetics: Aesthetic judgment is highly subjective. Future work could explore personalizing the model to individual users' aesthetic preferences.
7.3. Personal Insights & Critique
This paper presents an elegant solution to a critical problem in AI: moving beyond mere prediction to meaningful explanation, especially in subjective domains like aesthetics. The core insight of combining CLIP-IQA's quantitative power with SHAP's interpretability and LLaVa's linguistic capabilities is very strong.
- Strength of Integration: The framework's modular design is a significant strength. By leveraging existing, powerful models for specific sub-tasks (CLIP for embeddings, LightGBM for regression, SHAP for explanation, LLaVa for language generation), the authors create a synergistic system that outperforms individual components. This approach is highly practical and generalizable, as different
VLMs, regressors, orMLLMscould be swapped in as they evolve. - Importance of Explainability: The emphasis on
explainabilityis crucial. For subjective tasks likeIAA, knowing why a machine thinks an image is aesthetically pleasing (or not) builds trust and provides actionable feedback for creators. The use ofSHAPvalues directly linked tohuman-understandable prompt pairsis particularly effective in making this transparent. - Role of Prompt Engineering: The demonstration that increasing the number of prompt pairs significantly boosts performance highlights the continuing importance of
prompt engineeringinVLMandLLMapplications. This suggests that the quality and diversity of semantic inputs are as vital as the model architecture itself.
Potential Issues or Areas for Improvement:
-
Aggregation of CLIP-IQA Scores: The paper mentions that similarities are "normalized via the softmax function for specific prompt pairs and then aggregated into a single value, represented as a CLIP-IQA score." While the overall framework is clear, the exact mathematical formula or process for this
aggregationstep, especially when multiple prompt pairs are used, is not explicitly detailed. A clearer exposition of this aggregation, particularly how the individual pair-wise scores are combined into the features fed toLightGBM, would enhance reproducibility and understanding. -
Computational Cost: Running multiple
CLIPencodings for numerous prompt pairs, training aLightGBM, calculatingSHAPvalues, and then querying anMLLMcan be computationally intensive. While not explicitly discussed, the efficiency of this pipeline, especially for real-time applications or very large datasets, could be a consideration. -
Subjectivity of "Good" Prompts: While
ChatGPTis used to generate prompts, the quality and representativeness of these prompts for aesthetic attributes are still inherently tied to the biases and capabilities of theLLMused for generation. Further validation of prompt sets for diverse aesthetic dimensions could be explored. -
Beyond Correlation: While
SROCCandPLCCare standard metrics, they don't fully capture the nuances of aesthetic judgment. Exploring other evaluation methods, perhaps involving direct human feedback on the generated explanations, could provide deeper insights into the quality of the explainability.Overall, this paper makes a significant contribution by bridging the gap between high-performance
VLM-based aesthetic assessment and the critical need for transparent, human-understandable explanations. Its methodology is transferable and inspirational for developing more trustworthyAIsystems in various domains.
Similar papers
Recommended via semantic vector search.