Abstract

The provided text only includes the title, authors, and metadata for the first page of the PDF, but does not contain the abstract or any other content beyond the first page. Therefore, the abstract cannot be extracted from the given material.

1. Bibliographic Information

1.1. Title

Explainable AI for Image Aesthetic Evaluation Using Vision-Language Models

1.2. Authors

Supatta Viriyavisuthisakul (Dept. of Engineering and Technology, Panyapiwat Institute of Management, Nonthaburi, Thailand)
Shun Yoshida (Dep. of Information and Communication Engineering, The University of Tokyo, Tokyo, Japan)
Kaede Shiohara (Dep. of Information and Communication Engineering, The University of Tokyo, Tokyo, Japan)
Ling Xiao (Dep. of Information and Communication Engineering, The University of Tokyo, Tokyo, Japan)
Toshihiko Yamasaki (Dep. of Information and Communication Engineering, The University of Tokyo, Tokyo, Japan)

1.3. Journal/Conference

The paper does not explicitly state the specific journal or conference it was published in. However, the metadata indicates it was "Published at (UTC): 2025-02-03T00:00:00.000Z". This suggests it is a recent publication, likely in a conference proceeding or a technical report given the typical structure.

1.4. Publication Year

2025

1.5. Abstract

Evaluating image aesthetics has traditionally been a subjective task relying on human experts. Vision-Language models (VLMs), such as Contrastive Language-Image Pre-Training (CLIP), offer a novel approach to assessing visual features and descriptions, leading to more interpretable aesthetic evaluations. Recently, CLIP-IQA (CLIP-based image quality assessment) emerged as a method to quantify image quality and abstract perception using an antonym prompt pairing strategy. While CLIP-IQA achieves high correlation with human aesthetic judgment, the relevance of its features to human perception remains a question. This study investigates the significance of image features derived from various paired prompts. Each prompt pair is encoded into feature vectors using a text encoder, and images are similarly encoded using an image encoder. Light Gradient Boosting Machine (LightGBM) is employed as a regressor to predict quality scores. After training, SHapley Additive exPlanations (SHAP) values are computed for each feature to evaluate the contribution of individual prompt elements. Furthermore, a multimodal large language model (MLLM), LLaVa, is applied to generate linguistic explanations of images. The proposed method yields Spearman's rank correlation coefficient (SROCC) and Pearson linear correlation coefficient (PLCC) scores of 0.762 and 0.785, respectively. The research also explores advanced prompting strategies to gain deeper insights into the IQA scoring mechanism.

1.6. Original Source Link

/files/papers/6911d810b150195a0db749a3/paper.pdf (Published as a PDF document)

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the inherent subjectivity and the traditional reliance on human expertise for Image Aesthetics Assessment (IAA). While Vision-Language models (VLMs) like CLIP have opened new avenues for evaluating images based on both visual features and textual descriptions, enabling potentially more interpretable assessments, existing methods face significant limitations. Specifically, CLIP-IQA (CLIP-based Image Quality Assessment), despite correlating well with human judgment, often operates as a "black-box" model, lacking transparency regarding which features contribute to its aesthetic predictions. Conversely, Multimodal Large Language Models (MLLMs) like LLaVa, while capable of generating linguistic explanations, tend to exhibit low recognition accuracy and inconsistent outputs when applied directly to aesthetic evaluation tasks. The challenge, therefore, lies in developing a framework that can achieve both high accuracy in aesthetic prediction and provide meaningful, interpretable explanations for those predictions, bridging the gap between quantitative scores and human-understandable reasoning.

2.2. Main Contributions / Findings

The paper makes several key contributions:

Novel Integrated Framework: It proposes a new framework that combines CLIP-IQA for quantitative aesthetic scoring, SHAP values for explaining the feature importance, and LLaVa for generating natural language explanations. This integration aims to enhance both the interpretability and reliability of IAA predictions.
Improved Explainability and Accuracy: The framework effectively addresses the limitations of existing methods. It moves beyond the black-box nature of CLIP-IQA by leveraging SHAP to reveal the contribution of specific aesthetic elements (derived from prompt pairs) to the overall score. It also overcomes the low accuracy typically seen in MLLM-based approaches for IAA.
Quantitative Performance: The proposed method demonstrates significant performance improvements, achieving SROCC and PLCC scores of 0.762 and 0.785, respectively. These results are notably higher than those obtained by LLaVa or CLIP-IQA when used in isolation.
Investigation of Prompt Strategies: The study explores the impact of using a larger number of pairwise concepts (prompts) generated by ChatGPT. It finds that increasing the number of prompts (e.g., from 1 to 30 pairs) substantially improves accuracy, particularly for the machine learning model component, suggesting that richer semantic input enhances the model's ability to leverage CLIP's internal knowledge.
Linguistic Explanations: By feeding SHAP values and original images to LLaVa, the framework can provide natural language explanations for the aesthetic scores, offering human-understandable reasoning behind the model's judgments.

3.1. Foundational Concepts

To fully understand this paper, a reader should be familiar with several core concepts in artificial intelligence, computer vision, and natural language processing.

Image Aesthetics Assessment (IAA): This is the task of automatically predicting the aesthetic quality or attractiveness of an image, similar to how a human might judge a photograph as "good" or "bad." It goes beyond simple image quality (e.g., blurriness) to encompass subjective elements like composition, lighting, color harmony, and overall appeal.
Vision-Language Models (VLMs): These are AI models designed to understand and process information from both visual (images, videos) and textual (natural language) modalities. They learn to associate visual concepts with their linguistic descriptions, enabling tasks like image captioning, visual question answering, and, in this paper's context, aesthetic evaluation by interpreting image features in relation to descriptive text.
Contrastive Language-Image Pre-Training (CLIP): CLIP is a specific type of VLM developed by OpenAI. It learns to associate images with text by training on a massive dataset of image-text pairs. The core idea is contrastive learning: it learns to predict which text captions go with which images from a batch, effectively pulling similar image-text pairs closer together in an embedding space while pushing dissimilar pairs apart.
- Zero-shot learning: A key capability of CLIP. Once trained, CLIP can perform tasks (like image classification or aesthetic evaluation) on new datasets without any further training specific to that task. This is achieved by comparing image embeddings to text embeddings of descriptions for different classes or attributes (e.g., "a photo of a cat" vs. "a photo of a dog"). The image is then assigned to the class whose text embedding is most similar.
CLIP-IQA (CLIP-based Image Quality Assessment): This is an application of CLIP specifically tailored for image quality and aesthetic assessment. It typically involves using antonym prompt pairs (e.g., "Good photo" vs. "Bad photo") and calculating the CLIP similarity scores between an image and these prompts to derive an aesthetic score. The image's affinity towards "Good photo" relative to "Bad photo" can indicate its perceived quality.
SHapley Additive exPlanations (SHAP): SHAP is an Explainable AI (XAI) technique that provides model-agnostic explanations for the output of any machine learning model. It is based on the concept of Shapley values from cooperative game theory.
- Shapley Value: In game theory, the Shapley value fairly distributes the total gain among players in a coalition, based on their marginal contributions to all possible sub-coalitions. In XAI, this translates to assigning an importance value to each feature in a prediction, indicating how much that feature contributed to the specific output, taking into account all possible combinations of features. SHAP values are known for being consistent and locally accurate.
Multimodal Large Language Models (MLLMs): These models extend the capabilities of traditional Large Language Models (LLMs) by enabling them to process and understand multiple types of data (modalities), such as text, images, and sometimes audio. They integrate a vision encoder with an LLM to generate text-based responses that are visually grounded.
LLaVa (Large Language and Vision Assistant): LLaVa is a specific MLLM that combines a vision encoder (like CLIP's image encoder or a ViT) with an LLM (like Vicuna). It's designed to perform visual instruction tuning, allowing it to follow instructions and answer questions about images, thus generating detailed descriptions and reasoning based on visual content.

3.2. Previous Works

The paper contextualizes its approach by referencing several key prior works in Image Aesthetics Assessment (IAA) and related AI techniques:

Early IAA Models (Hand-crafted Features): Initially, IAA models relied on carefully designed, hand-crafted features that often emulated photographic principles. For example, color harmony [1], composition rules [2, 3], and even human aesthetic perception [4, 7] were encoded into features. These methods were foundational but limited by their reliance on human-defined rules.
Deep Learning-based IAA: With advancements in deep learning, IAA models began to integrate deep neural networks. These models could automatically learn relevant features from data, often combining global and local features to capture both overall image structure and fine-grained details [5, 6]. Deep learning significantly improved efficiency and performance over hand-crafted approaches.
CLIP in IAA: The introduction of CLIP [8] marked a significant shift. Its ability to learn transferable visual models from natural language supervision enabled zero-shot IAA. CLIP's image encoder proved effective at extracting relevant features like lighting, composition, and beauty attributes, offering a more robust foundation for IAA networks compared to models pre-trained solely on ImageNet. It could capture both content and style.
CLIP-IQA: Building on CLIP, CLIP-IQA [9] was developed to quantify image quality and abstract perception. It employed an antonym prompt pairing strategy (e.g., "Good photo" vs. "Bad photo") to assess images. CLIP-IQA demonstrated strong correlations with human assessments across various image quality datasets, distinguishing visual attributes (e.g., brightness vs. contrast) and abstract perceptions (e.g., happy vs. sad). However, the paper highlights its limitation as a black-box model regarding feature importance.
Shapley Value and SHAP: The concept of Shapley value originates from cooperative game theory [10], where it is used to fairly distribute payouts among players based on their contributions. SHAP [10] adapted this concept for Explainable AI (XAI), providing a unified approach to interpreting model predictions by quantifying each feature's contribution. SHAP values are model-agnostic and offer consistent insights into why a model makes specific predictions.
LLaVa: LLaVa [11] represents a major step in multimodal AI. It combines a vision encoder with an LLM to process visual inputs and generate text-based responses. LLaVa can provide detailed image descriptions, answer questions based on visual context, and perform image-based reasoning. It has shown promising performance in aesthetic perception, sometimes approaching human-level assessments [12]. However, the paper notes its potential for low recognition accuracy and inconsistency in pure IAA tasks.

3.3. Technological Evolution

The evolution of Image Aesthetics Assessment has progressed from rule-based systems to data-driven deep learning, and now to multi-modal Vision-Language Models.

Hand-crafted Features (Early 2010s): Initial IAA systems relied on human experts defining photographic rules (e.g., rule of thirds, color harmony). These systems were limited by their explicit knowledge base and lacked generalization.
Deep Learning (Mid-2010s onwards): Convolutional Neural Networks (CNNs) revolutionized IAA by learning features directly from large datasets of aesthetically-rated images. This significantly improved performance and reduced the need for manual feature engineering.
Vision-Language Models (Early 2020s onwards): CLIP introduced the ability to leverage natural language for understanding visual concepts. This allowed for zero-shot IAA and more flexible, prompt-driven evaluation, moving beyond simple classification to semantic understanding. CLIP-IQA specifically applied this to aesthetic assessment.
Multimodal Large Language Models (Recent): LLaVa and similar MLLMs further integrate vision and language, enabling conversational AI that can reason about images and generate explanatory text. This represents a step towards human-like understanding and communication about visual content.

This paper's work fits within the latest stage of this evolution, specifically focusing on enhancing the interpretability of VLM-based IAA by combining the strengths of quantitative CLIP-IQA with SHAP for feature importance and MLLMs for linguistic explanations, thereby addressing the "black-box" nature of previous VLM applications in IAA.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach offers several core innovations and differentiations:

Addressing the Interpretability Gap of CLIP-IQA: While CLIP-IQA (e.g., [9]) provides good quantitative aesthetic scores, it largely remains a black-box model. The proposed framework directly addresses this by integrating SHAP values, allowing it to explain which specific features (derived from prompt pairs like "Bright photo" vs. "Dark photo") contribute positively or negatively to an image's aesthetic score. This provides a level of transparency CLIP-IQA alone lacks.
Improving Accuracy and Consistency of MLLM-based IAA: Direct application of MLLMs like LLaVa for IAA (as hinted by [12]) often suffers from lower recognition accuracy and inconsistent outputs for aesthetic judgments. This paper's method leverages LLaVa not for direct scoring, but for generating linguistic explanations based on CLIP-IQA scores and SHAP values. This uses LLaVa's strength in language generation while relying on CLIP-IQA and LightGBM for robust scoring, thus achieving higher overall accuracy (as shown in Table II).
Synthesizing Quantitative and Qualitative Explanations: The framework is unique in its ability to provide both numerical evaluation (via CLIP-IQA scores) and feature importance (via SHAP values), culminating in human-readable linguistic explanations (via LLaVa). This comprehensive approach offers a richer understanding of image aesthetics than any single component could provide alone.
Advanced Prompt Engineering: The paper investigates the impact of a larger number of diverse prompt pairs generated by ChatGPT, demonstrating that this prompt engineering strategy significantly boosts accuracy. This goes beyond the basic antonym pairs typically used in CLIP-IQA and explores a more nuanced, multi-faceted approach to feature extraction.

In essence, the paper differentiates itself by offering a unified framework that synergistically combines the strengths of CLIP-IQA (quantitative scoring), SHAP (XAI), and LLaVa (linguistic explanation) to overcome the individual limitations of each, resulting in both accurate and explainable Image Aesthetics Assessment.

4. Methodology

4.1. Principles

The core principle of the proposed method is to create an Explainable AI (XAI) framework for Image Aesthetics Assessment (IAA) by integrating the quantitative scoring capability of CLIP-IQA with the interpretability provided by SHAP values, and the natural language explanation generation of a Multimodal Large Language Model (MLLM) like LLaVa. The underlying intuition is that while Vision-Language Models (VLMs) can assess aesthetics, they often lack transparency. By breaking down the prediction into contributions from specific aesthetic attributes (derived from prompt pairs) using SHAP and then translating these contributions into human-understandable text, the framework aims to provide both accurate scores and clear reasoning.

4.2. Core Methodology In-depth (Layer by Layer)

The proposed approach integrates several components as depicted in the overall framework (Figure 1 in the paper, represented as images/2.jpg in the provided assets). Here's a step-by-step breakdown:

The following figure (Figure 1 from the original paper, referred to as images/2.jpg) illustrates the overall framework of the proposed method:

该图像是示意图，展示了一种用于图像美学评价的解释性AI方法。图中描述了输入图像经过图像编码器处理，并与文本编码器生成的提示配对。系统计算得到CLIP-IQA评分和SHAP评分，提供形状、色彩、高对比度等特征的量化评估。其中，SHAP分数显示形状得分为+14.51，表明图像相对清晰。

4.2.1. Prompt Generation

The process begins with defining antonym prompt pairs that represent various aesthetic attributes. Initially, the paper uses basic prompts such as those listed in Table I. For advanced strategies, a Large Language Model (LLM) like ChatGPT is used to generate a larger set of diverse prompt pairs, as exemplified in Table III.

The following are the results from Table I of the original paper:

Bright photo	Dark photo
Clean photo	Noisy photo
Colorful photo	Dull photo
Sharp photo	Blurry photo
High contrast photo	Low contrast photo

The following are the results from Table III of the original paper:

Clear	Blurry
Shap	Dull
Crisp	Fuzzy
Vibrant	Dull
Detailed	Pixelated
High-resolution	Low-resolution
Colorful	Monochrome
Well-exposed	Overexposed
Balanced	Imbalanced
High contrast	Low contrast

4.2.2. Feature Encoding using CLIP

An input image and each generated prompt (both positive and negative terms of a pair) are processed through CLIP's respective encoders.

Text Encoder: Each prompt text (e.g., "Bright photo", "Dark photo") is encoded into a high-dimensional feature vector (embedding) using CLIP's text encoder. Let $T_p$ denote the embedding for a positive prompt and $T_n$ for its antonym negative prompt.
Image Encoder: The input image is encoded into a high-dimensional feature vector (embedding) using CLIP's image encoder. Let $I$ denote the image embedding.

4.2.3. CLIP-IQA Score Calculation

For each prompt pair, the cosine similarity between the encoded image feature vector and each of the prompt's feature vectors is calculated. Cosine similarity measures the cosine of the angle between two non-zero vectors in an inner product space, indicating how similar their directions are. It ranges from -1 (opposite) to 1 (identical).

For a given image embedding $I$ and a prompt embedding $T$ , the cosine similarity is calculated as: $ \mathrm{Similarity}(I, T) = \frac{I \cdot T}{|I| |T|} $ where $I \cdot T$ is the dot product of vectors $I$ and $T$ , and $\|I\|$ and $\|T\|$ are their L2-norms (magnitudes).

This yields similarities $S_p = \mathrm{Similarity}(I, T_p)$ for the positive prompt and $S_n = \mathrm{Similarity}(I, T_n)$ for the negative prompt.

These similarity scores for a specific prompt pair are then normalized via the softmax function. The softmax function is used to convert a vector of raw scores (logits) into a vector of probabilities, where the probabilities sum to 1. In this context, it would indicate the relative "strength" of association with the positive versus negative prompt. $ P_p = \frac{e^{S_p}}{e^{S_p} + e^{S_n}} \quad \text{and} \quad P_n = \frac{e^{S_n}}{e^{S_p} + e^{S_n}} $ where $P_p$ is the "probability" of the image aligning with the positive prompt, and $P_n$ with the negative prompt.

The paper states these are "aggregated into a single value, represented as a CLIP-IQA score." While the exact aggregation formula for a single CLIP-IQA score from multiple prompt pairs isn't explicitly given, a common approach in CLIP-IQA is to take the difference or ratio of positive vs. negative probabilities, or to use the positive probability directly. For a single prompt pair, a simplified CLIP-IQA score for that pair could be: $ \text{CLIP-IQA}_{\text{pair}} = P_p - P_n $ or simply $P_p$ . If multiple prompt pairs are used, these individual scores would then be concatenated or averaged to form a feature vector representing the image's aesthetic attributes across all prompt pairs. This feature vector, comprising scores from each prompt pair, captures the aesthetic attributes from both global and fine-grained perspectives.

4.2.4. Regressor Training with LightGBM

The CLIP-IQA scores (as a feature vector derived from all prompt pairs for each image) are used as input to train a regressor. The paper specifies Light Gradient Boosting Machine (LightGBM) as the chosen regressor.

Dataset: The KonIQ-10k dataset, which contains images with human-assigned quality scores (Mean Opinion Scores, MOS), is used for training.
Training Objective: LightGBM is trained to predict the MOS of images based on their CLIP-IQA feature vectors. It learns a mapping from the prompt-derived aesthetic features to a human-perceived quality score.

4.2.5. SHAP Value Calculation for Interpretability

After the LightGBM regressor is trained, SHapley Additive exPlanations (SHAP) values are computed for each input feature. Each "feature" here corresponds to the CLIP-IQA score derived from one of the antonym prompt pairs (e.g., the "Bright photo" vs. "Dark photo" score is one feature).

Purpose: SHAP values quantify the contribution of each individual prompt-derived feature (aesthetic element) to the LightGBM's prediction of the overall aesthetic score for a given image.
Mechanism: SHAP values are calculated by considering all possible combinations (coalitions) of features and observing how the prediction changes when a particular feature is included or excluded. The average marginal contribution of a feature across all possible coalitions is its SHAP value.
Interpretation: A positive SHAP value for a feature indicates that this feature pushed the prediction higher than the baseline average prediction, while a negative value pushed it lower. The magnitude of the SHAP value indicates the strength of this contribution.

The SHAP value $\phi_j(f, \mathbf{x})$ for a feature $j$ for a model $f$ and input $\mathbf{x}$ is given by: $ \phi_j(f, \mathbf{x}) = \sum_{S \subseteq N \setminus {j}} \frac{|S|!(|N| - |S| - 1)!}{|N|!} [f_S(\mathbf{x}_S \cup {x_j}) - f_S(\mathbf{x}_S)] $ where:
$N$ is the set of all features.
$S$ is a subset of features that does not include feature $j$ .
$|N|$ is the total number of features.
$|S|$ is the number of features in subset $S$ .
$f_S(\mathbf{x}_S)$ is the prediction of the model $f$ using only the features in set $S$ with their corresponding values $\mathbf{x}_S$ .
$f_S(\mathbf{x}_S \cup \{x_j\})$ is the prediction using features in $S$ plus feature $j$ . The term $\frac{|S|!(|N| - |S| - 1)!}{|N|!}$ represents the weight for each permutation.

4.2.6. Linguistic Explanations with MLLM (LLaVa)

Finally, to provide human-understandable explanations, the calculated SHAP values along with the original input images are fed into the Multimodal Large Language Model, LLaVa.

Input: The visual content of the image and the quantitative SHAP values (which indicate the importance of different aesthetic attributes).
Output: LLaVa processes this combined information to generate natural language text that explains why the image received its predicted aesthetic score, highlighting the features that most contributed to or detracted from its quality, as identified by the SHAP values. This bridges the gap between numerical assessment and human reasoning.

This integrated approach ensures that the model not only predicts aesthetic scores accurately but also provides actionable and interpretable insights into the underlying reasons for those scores.

5. Experimental Setup

5.1. Datasets

The primary dataset used in this study for training the regressor (LightGBM) is:

KonIQ-10k [13]: This is an ecologically valid database specifically designed for deep learning-based blind Image Quality Assessment (IQA). It consists of 10,000 images, each associated with Mean Opinion Scores (MOS) collected from human evaluators. The MOS scores represent the subjective quality or aesthetic ratings of the images.
- Characteristics: KonIQ-10k is known for its diversity in content and distortion types, making it suitable for training models that generalize well to real-world image quality perception. The human-assigned MOS values provide a ground truth for aesthetic quality, which is crucial for training and evaluating IAA models.
- Purpose: This dataset allows the LightGBM model to learn the relationship between the CLIP-IQA features (derived from prompt pairs) and human aesthetic judgment.

5.2. Evaluation Metrics

The paper uses two standard correlation coefficients to evaluate the performance of the aesthetic evaluation models, specifically measuring the agreement between the model's predicted scores and human-assigned ground truth scores (MOS). For all metrics, higher values indicate better performance (closer correlation).

Spearman's Rank-Order Correlation Coefficient (SROCC):
- Conceptual Definition: SROCC assesses the monotonic relationship between two ranked variables. It measures how well the relationship between two variables can be described using a monotonic function. If the predicted aesthetic scores consistently rank images in the same order as human MOS scores, SROCC will be high, regardless of whether the absolute values match. It is robust to non-linear relationships and outliers in the score values themselves, focusing purely on rank agreement.
- Mathematical Formula: $ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $
- Symbol Explanation:
  - $\rho$ : Spearman's rank correlation coefficient.
  - $d_i$ : The difference between the ranks of the $i$ -th observation for the two variables (e.g., predicted score rank and MOS rank).
  - $n$ : The number of observations (images).
Pearson's Linear Correlation Coefficient (PLCC):
- Conceptual Definition: PLCC measures the linear relationship between two variables. It indicates the strength and direction of a linear association between the predicted aesthetic scores and the human MOS scores. A high PLCC implies that the predicted scores not only rank images correctly but also maintain a consistent linear scaling with the human scores. It is sensitive to both the order and the magnitude of the scores.
- Mathematical Formula: $ r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} $
- Symbol Explanation:
  - $r$ : Pearson's linear correlation coefficient.
  - $n$ : The number of observations (images).
  - $x$ : The individual values of the first variable (e.g., model's predicted aesthetic score).
  - $y$ : The individual values of the second variable (e.g., human MOS score).
  - $\sum xy$ : The sum of the products of corresponding $x$ and $y$ values.
  - $\sum x$ : The sum of all $x$ values.
  - $\sum y$ : The sum of all $y$ values.
  - $\sum x^2$ : The sum of the squares of all $x$ values.
  - $\sum y^2$ : The sum of the squares of all $y$ values.

5.3. Baselines

The proposed method's performance is compared against two distinct baselines, representing different approaches to Image Aesthetics Assessment:

LLaVa (Standalone): This baseline represents directly using a Multimodal Large Language Model (LLaVa) for aesthetic evaluation. In this scenario, LLaVa would likely be prompted to directly assess the aesthetic quality of an image and perhaps output a score or a qualitative judgment. The paper highlights its limitation of typically lower recognition accuracy and inconsistent outputs when directly tasked with aesthetic judgment.
CLIP-IQA (Standalone): This baseline represents the traditional CLIP-IQA approach, where CLIP similarities with antonym prompts (e.g., "Good photo" vs. "Bad photo") are used to derive an aesthetic score. While effective in achieving good correlations with human perception, the paper characterizes this as a "black-box" model, meaning it does not inherently provide explanations for its aesthetic judgments.

These baselines allow the paper to demonstrate that its integrated framework not only achieves superior quantitative performance (compared to standalone LLaVa or CLIP-IQA) but also overcomes the specific limitations of each: providing explainability where CLIP-IQA is opaque, and achieving higher accuracy where LLaVa might fall short in direct aesthetic scoring.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the effectiveness of the proposed integrated framework in achieving both high accuracy and interpretability for Image Aesthetics Assessment (IAA).

6.1.1. Quantitative Evaluation of the Proposed Method

The paper first presents a direct comparison of its integrated method against the LLaVa and CLIP-IQA baselines in terms of explainability, SROCC, and PLCC.

The following are the results from Table II of the original paper:

Method	Explainability	SROCC ↑	PLCC ↑
LLaVA	O	0.195	0.171
CLIP-IQA	×	0.684	0.700
Ours	O	0.762	0.785

Analysis of Table II:

LLaVa Baseline: Shows low SROCC (0.195) and PLCC (0.171). This confirms the paper's assertion that while LLaVa offers explainability (denoted by 'O'), its direct application for quantitative aesthetic assessment yields poor accuracy. This highlights the challenge of using MLLMs directly for precise aesthetic scoring.
CLIP-IQA Baseline: Achieves significantly higher correlation coefficients than LLaVa (SROCC 0.684, PLCC 0.700). This demonstrates CLIP-IQA's effectiveness in quantitative aesthetic judgment, aligning with previous research. However, it is marked with '×' for Explainability, confirming its black-box nature as identified by the authors.
Proposed Method ("Ours"): Outperforms both baselines dramatically. It achieves the highest SROCC (0.762) and PLCC (0.785), indicating superior accuracy in both rank-order and linear correlation with human aesthetic judgments. Crucially, it also maintains Explainability (marked 'O'), successfully integrating both high performance and interpretability. The improvement over CLIP-IQA is substantial, with SROCC increasing by approximately 11% ( $(0.762 - 0.684) / 0.684 \approx 0.114$ ) and PLCC by over 12% ( $(0.785 - 0.700) / 0.700 \approx 0.121$ ). This validates the core hypothesis that combining CLIP-IQA with SHAP and LLaVa can yield a more effective and transparent IAA system.

6.1.2. Impact of Number of Prompts

The paper also investigates how the quantity of prompt pairs influences the accuracy, comparing a "Prompt learning model" (likely referring to the raw CLIP similarity scores aggregated in some way before LightGBM) and the "Machine learning model" (referring to the LightGBM regressor component of the proposed framework). ChatGPT was used to generate these prompt pairs.

The following are the results from Table IV of the original paper:

No. of	Prompt learning model		Machine learning model
prompt pairs	SROCC↑	PLCC↑	SROCC↑	PLCC↑
1	0.873	0.892	0.684	0.700
10	0.879	0.897	0.768	0.790
20	0.880	0.898	0.840	0.867
30	0.889	0.902	0.849	0.874

Analysis of Table IV:

General Trend: For both "Prompt learning model" and "Machine learning model," increasing the number of prompt pairs generally leads to improved SROCC and PLCC scores. This suggests that a richer set of aesthetic descriptors allows the models to capture more nuanced aspects of image quality.
"Prompt learning model" (CLIP-IQA based scores): This model shows very high initial correlations (0.873 SROCC for 1 pair) and slight improvements with more prompts (0.889 SROCC for 30 pairs). This indicates that the raw CLIP feature space, when queried with relevant prompts, is already highly discriminative for aesthetics. The 1-prompt pair row corresponds to using "good photo" and "bad photo" as indicated in the text.
"Machine learning model" (LightGBM): This component shows a more significant accuracy gain with an increased number of prompts. Starting from 0.684 SROCC with 1 prompt pair, it reaches 0.849 SROCC with 30 prompt pairs. This represents an improvement of approximately 24% ( $(0.849 - 0.684) / 0.684 \approx 0.241$ ). This substantial gain for the LightGBM model demonstrates that the regressor effectively leverages the additional, diverse aesthetic features extracted by CLIP using the expanded prompt sets.
Insight: The disparity in initial performance (e.g., 0.873 vs. 0.684 for 1 prompt pair) between the "Prompt learning model" and "Machine learning model" for the same initial prompt set highlights the role of the LightGBM regressor. It processes the CLIP-IQA scores, potentially refining them and making them more directly predictive of the final human MOS. The greater gain in the "Machine learning model" with more prompts suggests that LightGBM is better able to integrate and interpret the diverse signals provided by the larger prompt sets compared to a simpler aggregation within the "Prompt learning model." This reinforces the idea that increasing the number of decision criteria axes (prompt pairs) allows for better utilization of CLIP's internal knowledge.

6.2. Data Presentation (Tables)

The quantitative results are presented in tables within the paper. These tables are transcribed above within the analysis section (Table II and Table IV). Table III, showing examples of ChatGPT-generated prompts, is also included in the Methodology section.

6.3. Ablation Studies / Parameter Analysis

While the paper does not present a formal ablation study removing specific components of the proposed framework (e.g., removing SHAP or LLaVa), the comparison in Table II implicitly acts as an ablation against the baselines (LLaVa alone and CLIP-IQA alone). This demonstrates the combined value of the components.

The analysis in Table IV on the "Number of prompt pairs" serves as a parameter analysis. It shows how a key hyper-parameter (the richness of the prompt set) impacts the performance. The observation that more prompts lead to better accuracy, especially for the LightGBM component, is a crucial finding regarding the optimal input features for the regressor.

6.3.1. Visual Examples of Explanations

Figure 2 (referred to as images/3.jpg in the provided assets) provides concrete examples of the proposed method's output, showcasing SHAP values and their corresponding explanations.

The following figure (Figure 2 from the original paper) shows example results of the proposed method:

$Fig. 2. Example result of the proposed method. The Shapley value is denoted as `f ( X )` and average value of the predictions is represented as $E \[ f ( X ) \]$ Fetues i posivcnbutions hown nd those w…$

Analysis of Figure 2 (images/3.jpg): The figure presents two example images with their aesthetic evaluations and explanations.

Example 1 (White Bird by Water):
- $E[f(X)] = 58.738$ : This represents the average predicted aesthetic score (Mean Opinion Score, MOS) across the dataset.
- $f(X) = 54.584$ : This is the final predicted score for this specific image.
- Feature Contributions: The image highlights features that positively or negatively contribute to the f(X) score relative to E[f(X)].
  - color (positive contribution): The green bar indicates a positive impact of the color attribute, suggesting the image's coloration enhances its aesthetics.
  - shape (positive contribution): The shape attribute also has a positive contribution.
  - high contrast (negative contribution): The red bar indicates that high contrast slightly detracts from the aesthetic score in this specific case.
- Linguistic Explanation: The text "The image is evaluated as a good photo, mainly because of its good color. The shape is fine. However, it seems like the contrast is not that good." This linguistic explanation directly translates the SHAP value contributions into human-understandable language, validating the LLaVa component's role.
Example 2 (Moon Photo):
- $E[f(X)] = 58.738$ : Same average predicted MOS.
- $f(X) = 36.155$ : This image receives a lower predicted score, indicating it is considered less aesthetically pleasing.
- Feature Contributions:
  - brightness (negative contribution): A significant negative contribution from brightness indicates the image is perceived as too dark or poorly exposed, negatively impacting its aesthetics.
  - high contrast (negative contribution): High contrast also has a negative impact.
  - colorful (positive contribution): Surprisingly, despite being a moon photo, colorful has a small positive contribution, which might be interpreted as subtle hues or gradients within the image.
- Linguistic Explanation: "The image is evaluated as a bad photo, mainly because of its low brightness. The contrast is not that good. However, the color is fine." This explanation again aligns perfectly with the visual SHAP value breakdown, providing clear reasons for the lower score.
  
  These examples effectively illustrate how SHAP values enable the identification of key aesthetic drivers, and how LLaVa translates these insights into meaningful linguistic feedback, thus achieving the goal of explainable IAA.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces a novel and comprehensive framework for Explainable AI (XAI) in Image Aesthetics Assessment (IAA) by ingeniously integrating CLIP-IQA, SHAP, and MLLMs (LLaVa). The study successfully addresses significant limitations of prior methods: the black-box nature of CLIP-IQA and the low accuracy of standalone MLLM-based approaches for aesthetic evaluation. By leveraging CLIP-IQA for quantitative scoring, SHAP to pinpoint the contributions of individual aesthetic features (derived from antonym prompt pairs), and LLaVa to translate these insights into natural language explanations, the framework achieves both high accuracy (SROCC of 0.762 and PLCC of 0.785) and robust interpretability. Furthermore, the research demonstrates that utilizing a larger and more diverse set of LLM-generated prompt pairs significantly enhances the model's performance, particularly for the LightGBM regressor component, by better exploiting CLIP's internal knowledge. Ultimately, the proposed method provides not just an aesthetic score but also the underlying rationale, moving closer to human-like understanding and explanation of image appeal.

7.2. Limitations & Future Work

The paper implicitly highlights the limitations of existing methods that its framework aims to overcome:

Black-box nature of CLIP-IQA: While CLIP-IQA excels in correlation with human judgment, it lacks transparency, making it difficult to understand why a particular aesthetic score was assigned. The proposed framework directly mitigates this with SHAP and LLaVa.
Low accuracy of MLLM-based methods for direct aesthetic evaluation: MLLMs like LLaVa are powerful for generating text, but when tasked directly with quantitative aesthetic scoring, they tend to yield lower accuracy and inconsistent results. The paper addresses this by using LLaVa for explanation rather than direct scoring.

While the paper does not explicitly state future work directions in a dedicated section, several avenues can be inferred or suggested:
Exploring advanced prompt engineering: The paper shows the benefit of more prompts. Future work could delve deeper into how to generate optimal prompts (e.g., using active learning, human-in-the-loop validation, or more sophisticated LLM fine-tuning for prompt generation) to capture even more nuanced aesthetic attributes.
Generalizability to diverse aesthetic domains: The KonIQ-10k dataset is general-purpose. Testing the framework on niche aesthetic domains (e.g., fashion photography, medical imaging aesthetics, specific artistic styles) could reveal its adaptability and potential need for domain-specific prompt sets or fine-tuning.
User studies for explanation quality: While LLaVa generates linguistic explanations, formal user studies could evaluate how useful, intuitive, and trustworthy these explanations are to human users, potentially leading to refinements in the explanation generation process.
Real-time applications: Investigating the computational efficiency of the framework for real-time IAA and explanation generation would be important for practical deployment.
Incorporating user-specific aesthetics: Aesthetic judgment is highly subjective. Future work could explore personalizing the model to individual users' aesthetic preferences.

7.3. Personal Insights & Critique

This paper presents an elegant solution to a critical problem in AI: moving beyond mere prediction to meaningful explanation, especially in subjective domains like aesthetics. The core insight of combining CLIP-IQA's quantitative power with SHAP's interpretability and LLaVa's linguistic capabilities is very strong.

Strength of Integration: The framework's modular design is a significant strength. By leveraging existing, powerful models for specific sub-tasks (CLIP for embeddings, LightGBM for regression, SHAP for explanation, LLaVa for language generation), the authors create a synergistic system that outperforms individual components. This approach is highly practical and generalizable, as different VLMs, regressors, or MLLMs could be swapped in as they evolve.
Importance of Explainability: The emphasis on explainability is crucial. For subjective tasks like IAA, knowing why a machine thinks an image is aesthetically pleasing (or not) builds trust and provides actionable feedback for creators. The use of SHAP values directly linked to human-understandable prompt pairs is particularly effective in making this transparent.
Role of Prompt Engineering: The demonstration that increasing the number of prompt pairs significantly boosts performance highlights the continuing importance of prompt engineering in VLM and LLM applications. This suggests that the quality and diversity of semantic inputs are as vital as the model architecture itself.

Potential Issues or Areas for Improvement:

Aggregation of CLIP-IQA Scores: The paper mentions that similarities are "normalized via the softmax function for specific prompt pairs and then aggregated into a single value, represented as a CLIP-IQA score." While the overall framework is clear, the exact mathematical formula or process for this aggregation step, especially when multiple prompt pairs are used, is not explicitly detailed. A clearer exposition of this aggregation, particularly how the individual pair-wise scores are combined into the features fed to LightGBM, would enhance reproducibility and understanding.
Computational Cost: Running multiple CLIP encodings for numerous prompt pairs, training a LightGBM, calculating SHAP values, and then querying an MLLM can be computationally intensive. While not explicitly discussed, the efficiency of this pipeline, especially for real-time applications or very large datasets, could be a consideration.
Subjectivity of "Good" Prompts: While ChatGPT is used to generate prompts, the quality and representativeness of these prompts for aesthetic attributes are still inherently tied to the biases and capabilities of the LLM used for generation. Further validation of prompt sets for diverse aesthetic dimensions could be explored.
Beyond Correlation: While SROCC and PLCC are standard metrics, they don't fully capture the nuances of aesthetic judgment. Exploring other evaluation methods, perhaps involving direct human feedback on the generated explanations, could provide deeper insights into the quality of the explainability.

Overall, this paper makes a significant contribution by bridging the gap between high-performance VLM-based aesthetic assessment and the critical need for transparent, human-understandable explanations. Its methodology is transferable and inspirational for developing more trustworthy AI systems in various domains.