Paper status: completed

LITA: LMM-Guided Image-Text Alignment for Art Assessment

Published:12/30/2024

Multimodal Artistic Image Aesthetics Assessment (1)LMM-Guided Image-Text Alignment (1)LLaVA Model Application (1)Artistic Style and Aesthetic Semantic Analysis (1)Image Feature and Text Comment Alignment (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

To address the growing need for Artistic Image Aesthetics Assessment (AIAA), the authors propose LITA, an LMM-guided image-text alignment model utilizing pre-trained LLaVA comments for rich feature extraction. LITA effectively captures artistic style and semantics, outperforming

Abstract

With an increasing number of artworks being shared on social media, Artistic Image Aesthetics Assessment (AIAA) models that can evaluate the aesthetics of these artworks are becoming increasingly essential. Existing methods primarily focus on devising pure vision models, often overlooking the nuanced and abstract elements that are crucial in artistic evaluation. To address the issue, we propose Large Multimodal Model (LMM)-guided Image-Text Alignment (LITA) for AIAA. LITA leverages comments from pre-trained LLaVA for rich image feature extraction and aesthetics prediction, considering that LLaVA is pre-trained on a wide variety of images and texts, and is capable of understanding abstract concepts such as artistic style and aesthetics. In our training, image features extracted by image encoders are aligned with text features of the comments generated by LLaVA. The alignment allows the image features to incorporate artistic style and aesthetic semantics. Experimental results show that our method outperforms the existing AIAA methods. Our code is available at https://github.com/Suna-D/LITA.

Mind Map

In-depth Reading

English Analysis~32 min read · 43,375 chars

1. Bibliographic Information

1.1. Title

LITA: LMM-Guided Image-Text Alignment for Art Assessment

1.2. Authors

Tatsumi Sunada
Kaede Shiohara
Ling Xiao
Toshihiko Yamasaki

All authors are affiliated with The University of Tokyo, Tokyo, Japan. Their research backgrounds appear to be in computer vision and potentially multimodal AI, given the subject matter of the paper.

1.3. Journal/Conference

The paper does not explicitly state a journal or conference name but provides a UTC publication date, suggesting it might be from a conference proceeding or an archived preprint. The research topic is highly relevant to computer vision, machine learning, and artificial intelligence conferences such as CVPR, ICCV, ECCV, or NeurIPS.

1.4. Publication Year

2024 (as indicated by the publication date 2024-12-30T00:00:00.000Z).

1.5. Abstract

The paper addresses the increasing need for Artistic Image Aesthetics Assessment (AIAA) models, especially with the proliferation of artworks on social media. It identifies a limitation in existing AIAA methods, which primarily rely on pure vision models and often fail to capture the nuanced and abstract elements crucial for artistic evaluation. To overcome this, the authors propose Large Multimodal Model (LMM)-guided Image-Text Alignment (LITA). LITA utilizes comments generated by the pre-trained LLaVA model (a specific LMM) to extract rich image features and predict aesthetics. The core idea is that LLaVA, being pre-trained on diverse image-text pairs, can understand abstract concepts like artistic style and aesthetics. During training, image features from vision encoders are aligned with text features of these LLaVA-generated comments. This alignment allows the image features to incorporate artistic style and aesthetic semantics. The experimental results demonstrate that LITA outperforms existing AIAA methods, with the code made publicly available.

1.6. Original Source Link

/files/papers/6911dc4ab150195a0db749c4/paper.pdf This appears to be an internal file path, likely from a system where the paper was uploaded or processed. It is currently a preprint or submitted paper given the provided publication date is in the future.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the accurate and nuanced assessment of artistic image aesthetics, particularly for the vast amount of artworks shared on social media. Artistic Image Aesthetics Assessment (AIAA) models are essential for various applications, including providing feedback to artists, curating high-quality content on platforms, and offering aesthetically pleasing recommendations to users.

However, AIAA is a challenging task due to several factors:

Multifaceted Nature of Aesthetics: Art aesthetics are influenced by diverse elements like lighting, composition, and color harmony.
Wide Range of Styles: Art encompasses numerous styles (e.g., impressionism, expressionism, realism), each with unique characteristics and aesthetic considerations.
Abstract Concepts: Artistic style and aesthetics are abstract concepts that are difficult for traditional pure vision models (models that only process image data) to capture accurately.

Prior research in Image Aesthetics Assessment (IAA) (for photographs) has shown that incorporating textual information (like user comments) can effectively capture abstract elements. However, this approach has been under-explored in AIAA due to the limited availability of rich datasets that pair art images with descriptive text. This gap prevents AIAA models from fully understanding and evaluating the abstract dimensions of artistic works.

The paper's innovative idea is to leverage the capabilities of recently emerged Large Multimodal Models (LMMs) to bridge this data gap. LMMs, pre-trained on vast image-text pairs, can comprehend both high-level visual content and low-level visual features, and have shown proficiency in understanding and generating creative content, including art descriptions. This suggests LMMs can serve as "art critics" to generate textual descriptions of artistic style and aesthetics, thereby providing the necessary textual data to enhance AIAA models.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Proposed LITA Framework: The authors propose LMM-guided Image-Text Alignment (LITA), a novel framework for AIAA. LITA utilizes a pre-trained LMM (specifically LLaVA) to generate rich textual comments about an artwork's style and aesthetics. These comments are then used to guide the training of visual encoders.
Enhanced Image Feature Extraction: LITA employs an image-text alignment mechanism during training. This mechanism aligns image features extracted by dedicated style and aesthetic vision encoders with the corresponding textual features derived from LLaVA's comments. This alignment process enables the vision encoders to incorporate artistic style and aesthetic semantics, leading to richer image feature representations that can better capture abstract concepts.
Computational Efficiency during Inference: Crucially, the LLaVA model is only used during the training phase to generate comments. During inference, LITA only uses the trained vision encoders and a fully-connected layer, making it computationally lighter and more practical than methods that require text inputs during inference.
State-of-the-Art Performance: Experimental results on the Boldbrush Artistic Image Dataset (BAID) demonstrate that LITA significantly outperforms existing AIAA methods. It achieves a Pearson linear correlation coefficient (PLCC) of 0.573 and an accuracy of 78.91%, surpassing previous methods by 0.015 for PLCC and 1.19% for accuracy. These findings highlight the effectiveness of integrating LMM-generated textual guidance for improving artistic aesthetics assessment.

3.1. Foundational Concepts

To fully understand the LITA paper, a foundational understanding of several key concepts is necessary, especially for a beginner in the field.

Artistic Image Aesthetics Assessment (AIAA) and Image Aesthetics Assessment (IAA):
- Image Aesthetics Assessment (IAA) is the task of evaluating the visual appeal and quality of general photographs. It aims to predict how pleasing or "beautiful" an image is to human observers.
- Artistic Image Aesthetics Assessment (AIAA) is a specialized sub-field of IAA that specifically focuses on evaluating the aesthetic appeal of artworks (e.g., paintings, digital art). AIAA is often more challenging than IAA because art aesthetics involve more abstract concepts like artistic style, composition, historical context, and emotional impact, which are harder to quantify and model.
- The output of these tasks can be either a numerical aesthetic score (regression) or a binary classification (e.g., "high aesthetic" vs. "low aesthetic").
Large Multimodal Models (LMMs):
- LMMs are advanced artificial intelligence models that can process and understand information from multiple modalities, typically text and images. They are trained on vast datasets containing paired images and text descriptions.
- Their core capability lies in learning joint representations (embeddings) for both modalities, allowing them to understand the relationship between visual content and linguistic descriptions.
- This enables them to perform tasks like image captioning (describing an image in text), visual question answering (answering questions about an image), and in the context of this paper, generating insightful comments about artistic style and aesthetics.
- Examples include LLaVA, BLIP-2, GPT-4 (when combined with visual capabilities), and CLIP.
LLaVA (Large Language and Vision Assistant):
- LLaVA is a specific LMM that combines a pre-trained vision encoder (to understand images) with a large language model (to understand and generate text). It's designed to follow human instructions and engage in multimodal dialogues, making it capable of describing images in detail, answering complex questions about them, and even understanding abstract concepts like artistic style and aesthetics, as leveraged in this paper.
Vision Transformer (ViT):
- ViT is a type of neural network architecture that applies the Transformer model (originally designed for natural language processing) directly to image classification tasks.
- Instead of using convolutional layers, ViT treats an image as a sequence of small patches (like words in a sentence). Each patch is linearly embedded, positional encodings are added, and then these patches are fed into a standard Transformer encoder.
- The output often includes a special [CLS] token (short for "classifier token"), which aggregates information from all patches and is used as the overall image representation or embedding for downstream tasks. In LITA, two ViT models are used as style image encoder and aesthetic image encoder to generate image features $\mathbf{v}_{\mathrm{style}}$ and $\mathbf{v}_{\mathrm{aes}}$ .
Bidirectional Encoder Representations from Transformers (BERT):
- BERT is a powerful Transformer-based language model pre-trained on a massive amount of text data. It's designed to understand the context of words in a sentence by looking at words before and after them (hence "bidirectional").
- Similar to ViT, BERT also uses a special [CLS] token at the beginning of its input. The embedding corresponding to this [CLS] token after processing by BERT is often used as a fixed-dimensional representation of the entire input sentence or text. In LITA, BERT is used as a text encoder to generate text features $\mathbf{t}_{\mathrm{style}}$ and $\mathbf{t}_{\mathrm{aes}}$ from LLaVA's comments.
Contrastive Learning (CLIP-style):
- Contrastive learning is a machine learning paradigm where a model learns to distinguish between similar and dissimilar pairs of data points. The goal is to bring representations of similar pairs closer together in an embedding space, while pushing dissimilar pairs apart.
- CLIP (Contrastive Language-Image Pre-training) is a prominent example. It trains a vision encoder and a text encoder jointly to predict which text caption goes with which image, from a batch of randomly paired image-text examples. This is achieved by maximizing the cosine similarity between correctly matched image-text pairs and minimizing it for incorrectly matched pairs within a batch.
- LITA adopts this contrastive loss mechanism to align visual features from its image encoders with textual features from its text encoder, effectively transferring LMM's knowledge into the vision-only prediction pipeline.
Evaluation Metrics:
- Pearson Linear Correlation Coefficient (PLCC):
  - Conceptual Definition: PLCC measures the strength and direction of a linear relationship between two continuous variables. In AIAA, it quantifies how well the predicted aesthetic scores linearly correlate with the ground-truth scores. A higher PLCC value (closer to 1 or -1) indicates a stronger linear relationship, with 1 being a perfect positive linear correlation.
  - Mathematical Formula: $ \mathrm{PLCC} = \frac{\sum_{i=1}^{N}(P_i - \bar{P})(G_i - \bar{G})}{\sqrt{\sum_{i=1}^{N}(P_i - \bar{P})^2 \sum_{i=1}^{N}(G_i - \bar{G})^2}} $
  - Symbol Explanation:
    - $N$ : The total number of data points (art images) in the dataset.
    - $P_i$ : The predicted aesthetic score for the $i$ -th image.
    - $\bar{P}$ : The mean of all predicted aesthetic scores.
    - $G_i$ : The ground-truth aesthetic score for the $i$ -th image.
    - $\bar{G}$ : The mean of all ground-truth aesthetic scores.
- Spearman's Rank Correlation Coefficient (SRCC):
  - Conceptual Definition: SRCC assesses the monotonic relationship between two ranked variables. Unlike PLCC, which looks for linear relationships, SRCC evaluates how well the relationship between two variables can be described using a monotonic function (meaning as one variable increases, the other also increases, or vice-versa, but not necessarily at a constant rate). In AIAA, it measures the correlation between the ranks of predicted scores and the ranks of ground-truth scores. It's robust to non-linear relationships and outliers. A higher SRCC (closer to 1 or -1) indicates stronger monotonic correlation.
  - Mathematical Formula: $ \mathrm{SRCC} = 1 - \frac{6 \sum_{i=1}^{N} d_i^2}{N(N^2 - 1)} $
  - Symbol Explanation:
    - $N$ : The total number of data points (art images).
    - $d_i$ : The difference between the rank of $P_i$ (predicted score for image $i$ ) and the rank of $G_i$ (ground-truth score for image $i$ ).
- Accuracy (Acc):
  - Conceptual Definition: For classification tasks, Accuracy is the proportion of correct predictions (both true positives and true negatives) among the total number of cases examined. In AIAA binary classification, it measures how often the model correctly classifies an artwork as "attractive" or "unattractive" (e.g., above or below a certain aesthetic score threshold).
  - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
  - Symbol Explanation:
    - Number of Correct Predictions: The count of instances where the model's binary prediction matches the ground-truth binary label.
    - Total Number of Predictions: The total count of instances in the dataset.
Mean Squared Error (MSE) Loss:
- Conceptual Definition: MSE is a common loss function used in regression tasks. It measures the average of the squares of the errors (the difference between predicted and actual values). It penalizes larger errors more severely than smaller ones. The goal during training is to minimize MSE.
- Mathematical Formula: $ \mathrm{MSE} = \frac{1}{N} \sum_{i=1}^{N} (P_i - G_i)^2 $
- Symbol Explanation:
  - $N$ : The total number of data points.
  - $P_i$ : The predicted aesthetic score for the $i$ -th image.
  - $G_i$ : The ground-truth aesthetic score for the $i$ -th image.
Box-Cox Transformation:
- Conceptual Definition: The Box-Cox transformation is a statistical technique used to transform non-normally distributed data to a normal distribution. It is particularly useful for stabilizing variance and making data more amenable to statistical models that assume normality. In AIAA, aesthetic score datasets often exhibit long-tailed distributions (e.g., many scores clustered around a mean, with fewer very low or very high scores), leading to imbalanced data. Applying Box-Cox transformation can help normalize these scores, making the training process more stable and preventing the model from ignoring minority aesthetic features.

3.2. Previous Works

The paper contextualizes LITA by discussing prior research in both Image Aesthetics Assessment (IAA) and Artistic Image Aesthetics Assessment (AIAA), as well as the emergence of Large Multimodal Models (LMMs).

3.2.1. Image Aesthetics Assessment (IAA)

Traditional Methods: Early IAA approaches relied on hand-crafted features (e.g., color histograms, lighting properties, composition rules) combined with machine learning techniques [18, 20, 29]. These methods struggled to capture the subjective and abstract nature of aesthetic values.
Deep Learning Methods: With the advent of deep learning and large-scale datasets like AVA [19] (which contains photograph images with voted aesthetic scores), models began using deep neural networks to extract image features and predict aesthetic score distributions [8, 9, 11, 25, 26, 28, 31, 36, 39]. These models significantly improved performance over traditional methods.
Text-Utilizing Methods: More recently, the advancement of large-scale vision-language models prompted IAA works to incorporate textual information.
- MSCAN [37]: Proposed Multimodal Self-and-Collaborative Attention Network, which extracted both image and user comment features, using a co-attention mechanism to capture their correlation.
- Other Text-based IAA [10, 38]: Similar works used images and user comments to evaluate aesthetics.
- Limitation of Text-based IAA: A critical limitation of these methods is their requirement for both images and user comments during inference. This makes them impractical in real-world scenarios where user comments are often unavailable.
- VILA [13]: Proposed VIsion-Language Aesthetics (VILA) learning framework to address the unpaired data problem during inference. VILA used image-text pre-training to embed image and user comment features into the same space, allowing the model to learn suitable features without needing comments at inference time. This concept of image-text alignment for inference-time efficiency is a direct inspiration for LITA.

3.2.2. Artistic Image Aesthetics Assessment (AIAA)

AIAA tasks are typically divided into aesthetic binary classification and aesthetic score regression, with ground-truth scores determined by mean opinion score (MOS) from annotators.

Early AIAA Methods: These were mainly handcrafted feature-based [2, 7] and focused on classifying art images into quality categories using features like color, complexity, and segmentation. These were limited by their reliance on small datasets.
Large-scale Dataset & Deep Learning: The field saw significant advancement with the creation of the Boldbrush Artistic Image Dataset (BAID) [34] in 2023, the first large-scale artistic image dataset.
- SAAN [34]: Proposed Style-specific Art Assessment Network, which extracted style-specific aesthetic features and generic aesthetic features to predict aesthetic scores.
- TSC-Net [32]: Proposed Theme-Style-Color guided Artistic Image Aesthetics Assessment Network, incorporating theme understanding, aesthetic feature extraction, and color distribution networks.
- GCN-based [27]: Utilized Graph Convolutional Networks (GCNs) to predict aesthetics by embedding similar images into visual features and constructing a GCN for understanding semantic and style.
Limitation of Existing AIAA: The paper notes that existing AIAA methods predominantly focus on extracting style and aesthetic features using devised networks but do not use textual information, which LITA argues is crucial for understanding abstract elements.

3.2.3. Large Multimodal Models (LMMs)

Text-Image Paired Pre-training: Models like CLIP [21] and ALIGN [12] demonstrated strong capabilities by learning joint embeddings for text and images, enabling significant performance in various downstream tasks [22, 23].
Emergence of LMMs: This led to the development of more sophisticated LMMs like LLaVA [17], BLIP-2 [14], and GPT-4 [1]. These models can describe image information based on text prompts, understand visual aesthetics and quality [30, 33, 35], and even generate creative content.
LMMs for Aesthetics: These works indicate LMMs' ability to capture abstract concepts like image aesthetics. LITA leverages this specific capability of LMMs to generate art criticism for AIAA.

3.3. Technological Evolution

The evolution of aesthetic assessment models has progressed through several stages:

Handcrafted Features (Early 2000s - early 2010s): Researchers manually designed features (e.g., rule of thirds, color vibrancy, texture) and fed them into traditional machine learning models (SVMs, random forests). These were limited in capturing high-level semantics and subjective human perception.
Deep Learning (Mid-2010s - Present): With large datasets (like AVA for IAA, and later BAID for AIAA) and the rise of Convolutional Neural Networks (CNNs), models learned hierarchical features directly from pixels, significantly improving performance. These were primarily pure vision models.
Multimodal Integration (Late 2010s - Present): Recognizing the importance of contextual information, IAA began incorporating textual data (user comments). However, the practical challenge of text unavailability during inference limited its widespread adoption.
Large Multimodal Models (Early 2020s - Present): The latest advancement involves LMMs (CLIP, LLaVA, etc.). These models inherently understand both text and images and can even generate high-quality text descriptions. This capability offers a solution to the "text unavailability" problem: if text is needed, an LMM can generate it. LITA fits into this latest stage, using LMMs to generate text during training to enhance vision models, thereby addressing the abstract nature of art aesthetics without compromising inference efficiency.

3.4. Differentiation Analysis

Compared to the main methods in related work, LITA presents several core differences and innovations:

Addressing the "Abstract Elements" Gap in AIAA:
- Previous AIAA methods (e.g., SAAN, TSC-Net, GCN-based): Primarily pure vision models that devise specific networks to extract style and aesthetic features. They do not use textual information, which LITA argues is crucial for understanding abstract concepts.
- LITA's Innovation: Directly addresses this by incorporating LMM-generated textual descriptions of artistic style and aesthetics. It explicitly guides vision encoders to learn features imbued with these abstract semantics.
Overcoming Text Scarcity and Inference Challenges:
- IAA methods using text (e.g., MSCAN): Showed performance gains by integrating user comments but suffered from the impracticality of requiring text during inference, as comments are not always available.
- VILA [13]: Moved towards image-text pre-training to embed features into a joint space, allowing inference without explicit text. LITA builds on this idea but leverages the generative power of modern LMMs to create the textual guidance.
- LITA's Innovation: Uses LLaVA to generate comments only during training. This LMM-guided image-text alignment allows the vision models to internalize textual knowledge. During inference, the LMM is not needed, making the model computationally efficient and practical, resolving the "text unavailability" issue without sacrificing the benefits of textual guidance.
Leveraging LMM's "Art Criticism" Ability:
- General LMMs: While LMMs have been used for visual aesthetics and quality assessment [30, 33, 35], LITA specifically harnesses LLaVA's capacity to act as an art critic, generating structured descriptions of "artistic style" and "aesthetics" from different perspectives. This focused application extracts more relevant and tailored textual semantics for AIAA.
  
  In essence, LITA innovatively combines the strength of LMMs in understanding abstract concepts and generating text with the practical need for efficient, vision-only inference. It uses LMMs as an intelligent data augmentation and knowledge distillation tool during training, rather than as a core component of the inference pipeline, differentiating it from prior multimodal approaches.

4. Methodology

4.1. Principles

The core idea behind LITA is to enhance the capabilities of Artistic Image Aesthetics Assessment (AIAA) models by injecting the rich, abstract understanding of art that Large Multimodal Models (LMMs) possess. The fundamental principle is that artistic style and aesthetics are abstract concepts that are difficult for pure visual models to accurately capture. However, LMMs (like LLaVA), trained on vast image-text datasets, are adept at comprehending and describing such abstract notions.

LITA leverages this by:

Generating "Art Criticism": Using a pre-trained LMM (LLaVA) to generate specific textual comments about an artwork's artistic style and aesthetics.
Guiding Vision Encoders via Alignment: Aligning the visual features extracted by dedicated image encoders (one for style, one for aesthetics) with the textual features derived from these LMM-generated comments. This image-text alignment acts as a supervisory signal, forcing the image encoders to learn visual representations that are semantically consistent with the LMM's understanding of style and aesthetics.
Efficient Inference: Critically, this LMM-guided process occurs only during training. At inference time, the LMM is no longer needed. The trained image encoders, having internalized the LMM's knowledge, can then predict aesthetic scores using only image input, making the system practical and computationally efficient.

This approach allows LITA's vision models to capture nuanced and abstract elements of art that are often overlooked by models trained solely on visual data, thereby improving AIAA performance.

4.2. Core Methodology In-depth (Layer by Layer)

The LITA framework integrates LMM capabilities into a traditional AIAA pipeline through a sophisticated image-text alignment mechanism. Let's break down its components and data flow step-by-step.

4.2.1. Problem Definition

The paper defines the AIAA problem as two primary tasks:

Binary Classification: Classifying art images into high aesthetic and low aesthetic groups.
Score Regression: Predicting a continuous aesthetic score for an artwork. The Boldbrush Artistic Image Dataset (BAID) [34] is used, where each artwork has an associated aesthetic score (0 to 10) based on multiple votes.

4.2.2. Overview of LITA Pipeline

The overall architecture of LITA is depicted in Figure 1.

Fig. 1. Overview of our proposed LMM-guided Image-Text Alignment (LITA) for art assessment. The pre-trained LLaVA model is first used to produce comments of an artistic image from style and aesthetic…
该图像是示意图，展示了LMM引导的图像-文本对齐（LITA）模型用于艺术评估的结构。图中包含一幅艺术图像及其风格和美学特征的提取过程，通过图像编码器和文本编码器获得相应特征，并结合LLaVA生成的评论。最终，这些特征被输入至全连接层，预测美学评分。图中显示了各个损失函数，包括评分回归损失、风格距离损失和美学距离损失。

As shown in Figure 1, the LITA pipeline begins by processing an input artistic image.

LMM Comment Generation (Training Only): For each art image in the training dataset, the pre-trained LLaVA model is used to generate two types of comments: one describing the artistic style and another describing the aesthetics of the image. This step provides the crucial textual guidance.
Text Feature Extraction: A frozen text encoder (specifically, BERT) takes these LLaVA-generated comments as input and extracts textual feature representations. This results in a style textual feature ( $\mathbf{t}_{\mathrm{style}}$ ) and an aesthetic textual feature ( $\mathbf{t}_{\mathrm{aes}}$ ). The BERT parameters are frozen to ensure LLaVA's knowledge is directly transferred without being altered by the AIAA task.
Image Feature Extraction: Simultaneously, two distinct vision encoders (both Vision Transformer (ViT) models) process the same art image. One is designated as the style image encoder and the other as the aesthetic image encoder. These encoders extract style visual features ( $\mathbf{v}_{\mathrm{style}}$ ) and aesthetic visual features ( $\mathbf{v}_{\mathrm{aes}}$ ), respectively.
Image-Text Alignment: The core of LITA's learning framework involves aligning these visual features with their corresponding textual features. The style visual feature ( $\mathbf{v}_{\mathrm{style}}$ ) is aligned with the style textual feature ( $\mathbf{t}_{\mathrm{style}}$ ), and the aesthetic visual feature ( $\mathbf{v}_{\mathrm{aes}}$ ) is aligned with the aesthetic textual feature ( $\mathbf{t}_{\mathrm{aes}}$ ). This alignment uses contrastive loss functions, which encourage the image encoders to produce features that are semantically close to the LMM's text descriptions.
Score Prediction: The extracted style visual feature ( $\mathbf{v}_{\mathrm{style}}$ ) and aesthetic visual feature ( $\mathbf{v}_{\mathrm{aes}}$ ) are concatenated. This combined feature vector is then fed into a fully-connected layer which outputs the predicted aesthetic score ( $p$ ).
Loss Optimization (Training): During training, three types of loss functions are optimized:
- A regression loss ( $\mathcal{L}_{\mathrm{reg}}$ ) between the predicted score $p$ and the ground-truth score $g$ .
- Two distance loss functions ( $\mathcal{L}_{\mathrm{d}}^{\mathrm{style}}$ and $\mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}$ ) for the image-text alignment.
Inference: During inference, only the trained image encoders and the fully-connected layer are used. The LLaVA model and text encoder are discarded, ensuring efficient prediction.

4.2.3. LMM Comments on Artworks

The motivation for using LMMs to generate comments is two-fold:

Addressing Artistic Style Diversity: Art includes numerous styles (Realism, Pop Art, Cubism), each with unique characteristics. LMMs can recognize and describe these distinct aspects.
Human-like Art Criticism: LMMs are trained on diverse datasets and can generate human-like text, suggesting they share perceptions of art similar to humans.

The LLaVA-1.6 model [16] is employed for this task. It generates comments using specific prompts tailored to elicit descriptions of style and aesthetics: $<img> Describe an {artistic style / aesthetics} of an image.$

Qualitative examples of these generated descriptions are provided in Figures 2, 3, and 4 in the original paper, showing how LLaVA can articulate observations about the visual characteristics and mood of different artworks.

该图像是一个艺术肖像，描绘了一位女性，她穿着条纹裙子，表情温和。背景模糊，突出了她的面部特征，展现了细腻的艺术风格和情感。

As seen in the figure, for a portrait of a woman, LLaVA describes the style as "realistic portrait" focusing on "detailed and lifelike representation" and a "warm color palette." For aesthetics, it mentions "thoughtful expression," "colorful headband," and a "textured background."

该图像是插图，展现了一幅抽象艺术作品，运用丰富的色彩和形状，传递出深刻的视觉情感与艺术风格。

For an abstract painting, LLaVA describes the style as "abstract expressionism, characterized by its loose, gestural brushstrokes and the use of color to convey emotion and mood." The aesthetics are characterized by "a blend of abstract and impressionistic elements, with a focus on color and texture that creates a sense of depth and movement."

该图像是插图，描绘了一幅夕阳下的海洋景观，水面波光粼粼，颜色层次丰富，展现了艺术的美感与风格。

For a watercolor landscape, LLaVA identifies the style as "watercolor painting that a rocky shore, a waterfall, and a lush green hillside." For aesthetics, it highlights the "soft and dreamy atmosphere" and "skillful use of color and brushwork to capture the essence of the coastal landscape."

These generated textual comments (e.g., "warm color palette", "abstract expressionism", "soft and dreamy atmosphere") then serve as semantic anchors to guide the image feature extraction.

4.2.4. LMM-Guided Image-Text Alignment for Art Assessment

This section details the specific components and loss functions that enable the LMM-guided learning.

Image Encoding

The model uses two separate Vision Transformer (ViT) models [6] as image encoders:

Style Image Encoder: Extracts style-attended features.
Aesthetic Image Encoder: Extracts aesthetic-aware image features. Both are ViT models pre-trained on ImageNet [4]. The [CLS] token output from each ViT is used as the respective image feature:
$\mathbf{v}_{\mathrm{style}}$ : The [CLS] token embedding from the style image encoder.
$\mathbf{v}_{\mathrm{aes}}$ : The [CLS] token embedding from the aesthetic image encoder.

Text Encoding

The textual descriptions generated by LLaVA are processed by a pre-trained BERT model [5].

The BERT model's parameters are frozen throughout training. This is crucial because it prevents the BERT model from adapting to the AIAA task and ensures that the LMM's pre-trained knowledge is directly used to guide the vision encoders.
The [CLS] token output from BERT serves as the textual embedding:
- $\mathbf{t}_{\mathrm{style}}$ : The [CLS] token embedding for the artistic style comment.
- $\mathbf{t}_{\mathrm{aes}}$ : The [CLS] token embedding for the aesthetic comment.

Image-text Paired Learning

This is the core mechanism where visual features are aligned with textual features using contrastive loss, inspired by CLIP. The goal is to maximize the similarity between corresponding image and text embeddings while minimizing similarity with non-corresponding pairs within the same batch.

The distance loss functions $\mathcal{L}_{\mathrm{d}}^{\mathrm{style}}$ and $\mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}$ are defined as: $ \begin{array}{rl} & \mathcal{L}{\mathrm{d}}^{\mathrm{style}}(\mathbf{v}{\mathrm{style}}, \mathbf{t}{\mathrm{style}}) = \mathcal{L}{\mathrm{Con}}(\mathbf{v}{\mathrm{style}}, \mathbf{t}{\mathrm{style}}) + \mathcal{L}{\mathrm{Con}}(\mathbf{t}{\mathrm{style}}, \mathbf{v}{\mathrm{style}}), \ & \quad \mathcal{L}{\mathrm{d}}^{\mathrm{aes}}(\mathbf{v}{\mathrm{aes}}, \mathbf{t}{\mathrm{aes}}) = \mathcal{L}{\mathrm{Con}}(\mathbf{v}{\mathrm{aes}}, \mathbf{t}{\mathrm{aes}}) + \mathcal{L}{\mathrm{Con}}(\mathbf{t}{\mathrm{aes}}, \mathbf{v}{\mathrm{aes}}). \end{array} $ Symbol Explanation:

$\mathcal{L}_{\mathrm{d}}^{\mathrm{style}}$ : The total distance loss for artistic style alignment.
$\mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}$ : The total distance loss for aesthetic alignment.
$\mathbf{v}_{\mathrm{style}}$ : The visual feature vector for artistic style.
$\mathbf{t}_{\mathrm{style}}$ : The textual feature vector for artistic style.
$\mathbf{v}_{\mathrm{aes}}$ : The visual feature vector for aesthetics.
$\mathbf{t}_{\mathrm{aes}}$ : The textual feature vector for aesthetics.
$\mathcal{L}_{\mathrm{Con}}(\mathbf{x}, \mathbf{y})$ : A contrastive loss function that attempts to align features $\mathbf{x}$ and $\mathbf{y}$ . The two terms in the sum (e.g., $\mathcal{L}_{\mathrm{Con}}(\mathbf{v}_{\mathrm{style}}, \mathbf{t}_{\mathrm{style}})$ and $\mathcal{L}_{\mathrm{Con}}(\mathbf{t}_{\mathrm{style}}, \mathbf{v}_{\mathrm{style}})$ ) mean the loss is calculated in both directions: aligning image to text and aligning text to image, which is a common practice in contrastive learning to ensure bidirectional mapping.

The contrastive loss $\mathcal{L}_{\mathrm{Con}}(\mathbf{x}, \mathbf{y})$ is specifically defined as: $ \mathcal{L}{\mathrm{Con}}(\mathbf{x}, \mathbf{y}) = - \frac{1}{N} \left( \sum{i=1}^N \log \frac{\exp(x_i^\top y_i)}{\sum_{j=1}^N \exp(x_i^\top y_j)} \right), $ Symbol Explanation:
$N$ : The batch size, representing the number of (image, text) pairs in the current training batch.
$x_i$ : The $i$ -th embedding from the first set of features (e.g., $\mathbf{v}_{\mathrm{style}}$ or $\mathbf{t}_{\mathrm{style}}$ ) within the batch.
$y_i$ : The $i$ -th embedding from the second set of features (e.g., $\mathbf{t}_{\mathrm{style}}$ or $\mathbf{v}_{\mathrm{style}}$ ) within the batch, corresponding to the $i$ -th data point.
$x_i^\top y_i$ : The dot product (or cosine similarity if features are normalized) between the $i$ -th $x$ and $i$ -th $y$ embeddings, representing the similarity of the correct pair.
$\sum_{j=1}^N \exp(x_i^\top y_j)$ : The sum of exponentiated dot products of $x_i$ with all $y_j$ embeddings in the batch. This serves as a normalization term, where $y_j$ includes the correct $y_i$ and N-1 incorrect (negative) pairs.
The log and negative sum components are standard for cross-entropy loss in contrastive learning, aiming to maximize the similarity of positive pairs relative to negative pairs.

Score Prediction

For aesthetic score prediction, the two visual features are first concatenated:

$\mathbf{v}_{\mathrm{combined}} = C(\mathbf{v}_{\mathrm{style}}, \mathbf{v}_{\mathrm{aes}})$ where $C(\mathbf{x}, \mathbf{y})$ denotes the concatenation operation of embedding $\mathbf{x}$ and $\mathbf{y}$ .

This combined vector is then passed through a fully-connected layer $F(\cdot)$ to predict the aesthetic score $p$ : $ p = F ( C ( \mathbf { v } _ { \mathrm { s t y l e } } , \mathbf { v } _ { \mathrm { a e s } } ) ). $ Symbol Explanation:

$p$ : The predicted aesthetic score for the input art image.
$F(\mathbf{x})$ : Represents a fully-connected layer (or a multi-layer perceptron) that maps the input feature vector $\mathbf{x}$ to a single scalar output (the aesthetic score).
$C(\mathbf{v}_{\mathrm{style}}, \mathbf{v}_{\mathrm{aes}})$ : The concatenated vector of the style visual feature and the aesthetic visual feature.

The regression loss $\mathcal{L}_{\mathrm{reg}}$ is calculated between the predicted score $p$ and the ground-truth score $g$ using Mean Squared Error (MSE).

Fused Model Learning

During the training process, all three loss components are optimized jointly. The total loss $\mathcal{L}$ is a weighted sum of the regression loss and the two distance loss functions: $ \mathcal{L} = \mathcal{L}{\mathrm{reg}}(p, g) + \lambda { \mathcal{L}{\mathrm{d}}^{\mathrm{style}}(\mathbf{v}{\mathrm{style}}, \mathbf{t}{\mathrm{style}}) + \mathcal{L}{\mathrm{d}}^{\mathrm{aes}}(\mathbf{v}{\mathrm{aes}}, \mathbf{t}_{\mathrm{aes}}) }. $ Symbol Explanation:

$\mathcal{L}$ : The total loss function that the model minimizes during training.
$\mathcal{L}_{\mathrm{reg}}(p, g)$ : The Mean Squared Error (MSE) loss between the predicted score $p$ and the ground-truth score $g$ .
$\lambda$ : A hyperparameter that controls the weighting or importance of the image-text alignment losses relative to the regression loss. A higher $\lambda$ means more emphasis on alignment, while a lower $\lambda$ prioritizes direct score prediction. In this paper, $\lambda$ is set to 0.35.
$\mathcal{L}_{\mathrm{d}}^{\mathrm{style}}(\mathbf{v}_{\mathrm{style}}, \mathbf{t}_{\mathrm{style}})$ : The contrastive distance loss for style features.
$\mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}(\mathbf{v}_{\mathrm{aes}}, \mathbf{t}_{\mathrm{aes}})$ : The contrastive distance loss for aesthetic features.

By optimizing this combined loss, LITA simultaneously learns to predict aesthetic scores accurately while ensuring that its visual features are semantically enriched by the LMM's understanding of artistic style and aesthetics.

5. Experimental Setup

5.1. Datasets

The experiments primarily use the Boldbrush Artistic Image Dataset (BAID) [34] for AIAA.

Source and Scale: The BAID dataset consists of 60,337 artistic images. These images are sourced from the Boldbrush website, where artists submit their work for monthly competitions and receive public votes.
Annotations: Each image in BAID is annotated with an aesthetic score, derived from more than 360,000 votes. The scores are scaled to a range from 0 (lowest aesthetic value) to 10 (highest aesthetic value).
Data Split: The dataset is split into training, validation, and testing sets following the conventions of previous works [27, 32, 34]:
- Training: 50,737 images
- Validation: 3,200 images
- Testing: 6,400 images
Data Preprocessing: The BAID dataset exhibits long-tailed distributions, meaning that a large number of artworks have ground-truth scores clustered around a specific range (e.g., 3 to 4), while fewer artworks have very high or very low scores. This imbalance can cause models to bias predictions towards the mean score. To address this, the Box-Cox transformation [24] is applied to the ground-truth scores. This transformation helps to normalize the score distribution, making the dataset less imbalanced. After prediction, the inverse Box-Cox transformation is applied to the predicted values to revert them to the original 0-10 range for fair comparison with evaluation metrics.

5.2. Evaluation Metrics

Following previous AIAA works, LITA employs three metrics to evaluate performance: Spearman's rank correlation coefficient (SRCC), Pearson linear correlation coefficient (PLCC), and Accuracy (Acc).

5.2.1. Spearman's Rank Correlation Coefficient (SRCC)

Conceptual Definition: SRCC measures the monotonic relationship between the ranks of the predicted aesthetic scores and the ranks of the ground-truth aesthetic scores. It is suitable for assessing agreement in order or ranking, even if the relationship isn't strictly linear. A higher SRCC value (closer to 1) indicates better agreement in ranking.
Mathematical Formula: $ \mathrm{SRCC} = 1 - \frac{6 \sum_{i=1}^{N} d_i^2}{N(N^2 - 1)} $
Symbol Explanation:
- $N$ : The total number of images in the dataset (or subset being evaluated).
- $d_i$ : The difference between the rank of the predicted score ( $P_i$ ) and the rank of the ground-truth score ( $G_i$ ) for the $i$ -th image. That is, $d_i = \mathrm{rank}(P_i) - \mathrm{rank}(G_i)$ .

5.2.2. Pearson Linear Correlation Coefficient (PLCC)

Conceptual Definition: PLCC quantifies the strength and direction of a linear relationship between the actual predicted scores and the ground-truth scores. It is sensitive to the magnitude of the predictions. A higher PLCC value (closer to 1) indicates a stronger positive linear relationship, meaning that as ground-truth scores increase, predicted scores also tend to increase proportionally.
Mathematical Formula: $ \mathrm{PLCC} = \frac{\sum_{i=1}^{N}(P_i - \bar{P})(G_i - \bar{G})}{\sqrt{\sum_{i=1}^{N}(P_i - \bar{P})^2 \sum_{i=1}^{N}(G_i - \bar{G})^2}} $
Symbol Explanation:
- $N$ : The total number of images.
- $P_i$ : The predicted aesthetic score for the $i$ -th image.
- $\bar{P}$ : The mean of all predicted aesthetic scores.
- $G_i$ : The ground-truth aesthetic score for the $i$ -th image.
- $\bar{G}$ : The mean of all ground-truth aesthetic scores.

5.2.3. Accuracy (Acc)

Conceptual Definition: Accuracy is used to evaluate the performance of binary classification. For AIAA, scores are converted into binary labels (e.g., "attractive" or "unattractive") using a threshold. Accuracy measures the proportion of correctly classified images (both attractive and unattractive) out of the total number of images.
Conversion to Binary Labels: The paper states that for binary classification, predicted and ground-truth scores are converted into binary labels using a threshold of 5. This threshold represents the central point on the 0-10 score scale. So, scores $\ge 5$ are considered "attractive" and scores $< 5$ are "unattractive".
Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
Symbol Explanation:
- Number of Correct Predictions: The count of images where the model's binary classification (based on the predicted score and threshold 5) matches the ground-truth binary label (based on the ground-truth score and threshold 5).
- Total Number of Predictions: The total number of images in the test set.

5.3. Baselines

The LITA method is compared against a range of existing state-of-the-art models, including methods from both general Image Aesthetics Assessment (IAA) and specialized Artistic Image Aesthetics Assessment (AIAA). These baselines are chosen to represent diverse approaches in the field.

IAA Methods:

NIMA [28]: Neural Image Assessment, a widely recognized deep learning baseline for general image aesthetics.
MPada [26]: Multi-Path Attention-aware Deep Aesthetic model.
MLSP [11]: Multi-Label Stream Perception.
UPF [36]: Unified Perception Framework.
BIAA [39]: Bidirectional Image Aesthetic Assessment.
HLA-GCN [25]: Hierarchical Local-Global Attention Graph Convolutional Network.
TANet [9]: Topic-Aware Neural Network.
EAT [8]: Efficient Attention Transformer.

AIAA Methods:

SAAN [34]: Style-specific Art Assessment Network, proposed with the BAID dataset.
TSC-Net [32]: Theme-Style-Color guided Artistic Image Aesthetics Assessment Network.
SSMR [27]: Style-Specific Multimodal Regression.

These baselines are representative as they cover various deep learning architectures, attention mechanisms, and some specific multimodal approaches within IAA and AIAA. Comparing against them demonstrates LITA's performance relative to both general aesthetic models and those tailored for art.

5.4. Implementation Details

LMM for Comment Generation: LLaVA-1.6 [16] is used to generate the stylistic and aesthetic comments for artworks.
Image Encoders: Two Vision Transformer (ViT) models are used as the style and aesthetic image encoders. These ViT models are pre-trained on the ImageNet dataset [4].
Text Encoder: A BERT model [5] is used as the text encoder. Its parameters are frozen during training to preserve its pre-trained knowledge.
Image Preprocessing: Images are scaled to a size of $224 \times 224$ pixels. Other image manipulations like cropping and rotation are avoided because previous IAA works [3, 11] have shown that such augmentations can alter aesthetic information and negatively impact training.
Training:
- Epochs: 15
- Batch Size: 64
- Optimizer: Adam
- Learning Rate: 0.0001
Loss Weight: The hyperparameter $\lambda$ in the total loss function (Equation 4) is set to 0.35, balancing the regression loss with the image-text alignment losses.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that LITA achieves state-of-the-art performance on the BAID dataset, particularly excelling in PLCC and Accuracy metrics, and showing competitive SRCC.

The following are the results from Table 1 of the original paper:

Methods	SRCC↑	PLCC↑	Acc (%)↑
NIMA [28]	0.393	0.382	71.01
MPada [26]	0.437	0.425	74.33
MLSP [11]	0.441	0.430	74.92
UPF [36]	0.427	0.431	73.58
BIAA [39]	0.389	0.376	71.61
HLA-GCN [25]	0.405	0.412	72.57
TANet [9]	0.453	0.437	75.45
EAT [8]	0.486	0.495	77.23
SAAN [34]	0.473	0.467	76.80
TSC-Net [32]	0.480	0.479	76.97
SSMR [27]	0.508	0.558	77.72
Ours	0.490	0.573	78.91

As shown in Table 1, LITA demonstrates superior performance:

PLCC: LITA achieves the highest PLCC of 0.573. This is a significant improvement of 0.015 over the previous best (SSMR at 0.558), indicating a stronger linear correlation between LITA's predicted scores and the ground-truth scores. This suggests LITA provides more accurate magnitude predictions.
Accuracy: LITA also achieves the highest Accuracy of 78.91%, surpassing SSMR (77.72%) by 1.19%. This means LITA is better at correctly classifying artworks into "attractive" or "unattractive" categories.
SRCC: While SSMR holds the top SRCC score of 0.508, LITA's SRCC of 0.490 is highly competitive and ranks second. This indicates that LITA is very good at ranking artworks by their aesthetic appeal, even if not perfectly linear.

The strong performance of LITA across these metrics validates the effectiveness of its LMM-guided image-text alignment approach. By leveraging LLaVA's understanding of artistic style and aesthetics, LITA's vision encoders learn richer features, leading to more accurate and nuanced aesthetic assessments compared to purely visual models and even previous multimodal AIAA methods.

6.2. Comparison of LMM Usage

The paper further analyzes the contribution of its specific LMM usage strategy by comparing LITA with alternative ways of incorporating textual information from LLaVA.

The following are the results from Table 2 of the original paper:

Method	SRCC↑	PLCC↑	Acc(%)↑
f_t	0.262	0.292	76.20
f_t	f_i-t	0.292	76.20	0.440	0.556	78.69
LITA	0.490	0.573	78.91

Here, the table presents a comparison with two alternative models:

$f_{\mathrm{t}}$ (text-only): A model that uses only the text features ( $\mathbf{t}_{\mathrm{style}}$ and $\mathbf{t}_{\mathrm{aes}}$ ) generated by LLaVA, concatenates them, and feeds them into a fully-connected layer for aesthetic score regression.
$f_{\mathrm{i-t}}$ (image-text concatenation): A model that extracts both image features ( $\mathbf{v}_{\mathrm{style}}$ and $\mathbf{v}_{\mathrm{aes}}$ ) and text features ( $\mathbf{t}_{\mathrm{style}}$ and $\mathbf{t}_{\mathrm{aes}}$ ), concatenates all of them, and then uses a fully-connected layer for prediction. In this model, unlike LITA, the BERT parameters (text encoder) are also optimized during training.

Analysis:
Text-only ( $f_{\mathrm{t}}$ ): This model performs poorly (SRCC 0.262, PLCC 0.292, Acc 76.20%). This indicates that text comments alone, even from an LMM, are insufficient for robust aesthetic assessment, as visual information is inherently primary.
Image-text Concatenation ( $f_{\mathrm{i-t}}$ ): This model performs significantly better than text-only, achieving SRCC 0.440, PLCC 0.556, and Acc 78.69%. This highlights the value of incorporating both image and text information. However, this model would require LLaVA (or its equivalent) to generate comments during inference, adding computational overhead.
LITA: LITA (SRCC 0.490, PLCC 0.573, Acc 78.91%) outperforms both $f_{\mathrm{t}}$ and $f_{\mathrm{i-t}}$ across all metrics.
- The outperformance over $f_{\mathrm{i-t}}$ is particularly telling. Even though $f_{\mathrm{i-t}}$ directly uses text features at inference (which LITA does not), LITA's alignment strategy (where LLaVA's knowledge is implicitly transferred to the image encoders) proves more effective.
- This comparison strongly validates LITA's approach: the image-text alignment effectively enhances the image feature extraction without needing the LMM at inference, offering both superior performance and practical efficiency.

6.3. Ablation Studies

The paper conducts ablation studies to understand the contribution of different loss functions and the impact of using multiple image encoders.

The following are the results from Table 3 of the original paper:

#Image encoders	L_reg	L_style	L_aes	SRCC↑	PLCC↑	Acc(%)↑
1	✓			0.435	0.541	78.59
1	√	√		0.438	0.541	78.55
1	V		v	0.484	0.567	78.25
2	L			0.430	0.551	78.03
2	✓	✓	✓	0.490	0.573	78.91

Analysis of the ablation study:

Baseline (only $\mathcal{L}_{\mathrm{reg}}$ ):
- With 1 image encoder, only optimizing $\mathcal{L}_{\mathrm{reg}}$ yields SRCC 0.435, PLCC 0.541, Acc 78.59%.
- With 2 image encoders, only optimizing $\mathcal{L}_{\mathrm{reg}}$ yields SRCC 0.430, PLCC 0.551, Acc 78.03%. Interestingly, using two encoders without alignment losses does not significantly improve performance over one encoder, and even slightly reduces SRCC and Acc. This implies that simply having two separate encoders without specific guidance for style and aesthetic features is not inherently beneficial.
Effect of Alignment Losses ( $\mathcal{L}_{\mathrm{d}}^{\mathrm{style}}$ , $\mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}$ ):
- Single Image Encoder:
  - Adding $\mathcal{L}_{\mathrm{d}}^{\mathrm{style}}$ (1 encoder, $\mathcal{L}_{\mathrm{reg}} + \mathcal{L}_{\mathrm{d}}^{\mathrm{style}}$ ) slightly improves SRCC (0.438 vs. 0.435) but not PLCC or Acc.
  - Adding $\mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}$ (1 encoder, $\mathcal{L}_{\mathrm{reg}} + \mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}$ ) shows a substantial performance increase: SRCC jumps to 0.484 (from 0.435), PLCC to 0.567 (from 0.541). This highlights that the aesthetic distance loss is particularly effective in improving the model's predictive power.
- Two Image Encoders (Full LITA):
  - The full LITA model (2 encoders, $\mathcal{L}_{\mathrm{reg}} + \mathcal{L}_{\mathrm{d}}^{\mathrm{style}} + \mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}$ ) achieves the best performance: SRCC 0.490, PLCC 0.573, Acc 78.91%.
  - Comparing this to the 2-encoder baseline (only $\mathcal{L}_{\mathrm{reg}}$ ), the addition of both image-text alignment losses leads to significant improvements: SRCC increases by 0.060 (from 0.430 to 0.490) and PLCC by 0.022 (from 0.551 to 0.573).
Conclusion from Ablation: The ablation study clearly demonstrates that the image-text alignment losses (especially $\mathcal{L}_{\mathrm{d}}^{\mathrm{aes}}$ ) are crucial for LITA's strong performance. They enable the image encoders to extract richer image features by incorporating the LMM's conceptual understanding of artistic style and aesthetics. The number of image encoders (one vs. two) has less impact compared to the presence and combination of the alignment losses, suggesting that the loss functions are the primary drivers of performance improvement by effectively guiding feature learning. The accuracy metric shows less variability due to its simpler binary classification nature.

6.4. Case Study

To gain qualitative insights into LITA's performance, the paper presents examples of successful and unsuccessful predictions.

Fig. 3. Successful and unsuccessful cases.
该图像是一个插图，展示了成功与不成功的艺术作品评估案例。上部分展示了四个成功案例，其中预测分数与真实分数接近；下部分展示了四个不成功案例，预测分数与真实分数差异较大。

As shown in Figure 3, the case study reveals:

Successful Cases (Figure 3a):
- LITA performs well in assessing paintings of women, correctly predicting high aesthetic scores for them. The paper notes that such paintings often tend to have high aesthetic scores, and LITA captures this trend.
- Crucially, LITA also accurately assesses abstract paintings, which are generally considered difficult to evaluate due to their non-representational nature. This suggests LITA successfully captures abstract artistic qualities.
Unsuccessful Cases (Figure 3b):
- Failure cases often involve LITA predicting very high scores for artworks that have only moderate ground-truth scores. This could indicate instances where LITA overestimates the aesthetic appeal or misinterprets certain elements.

6.5. Visualization

The paper visualizes the attention maps of the image encoders to illustrate how LITA's image-text alignment influences what parts of an image the model focuses on. This is compared to a baseline ViT model trained only with $\mathcal{L}_{\mathrm{reg}}$ to predict aesthetic scores.

Fig. 4. Attention map visualization of the baseline and our model. The baseline only captures the main object while our model pays attention to both the main object and the background.
该图像是示意图，展示了不同艺术作品对应的基线、风格和美学预测。左侧为原始图像，第一行显示了各种艺术作品在GT与预测分数的对比，后续列展示了基线模型和我们模型在风格及美学上的关注图。通过这些图，可以观察到我们的模型在捕捉细节方面的优势。

As seen in Figure 4, the attention map visualization provides compelling evidence for LITA's effectiveness:

Baseline Model: In examples (a) and (b), the baseline ViT model primarily pays attention to only the main object (e.g., the face of a woman). In examples (c) and (d), for landscapes or abstract art, the baseline model focuses on some isolated points but fails to capture the overall whole area or context. This behavior is typical for ViT models pre-trained on ImageNet for object recognition, where the primary goal is to identify prominent objects.
LITA's Image Encoders (Style and Aesthetic): In contrast, LITA's style image encoder and aesthetic image encoder demonstrate a broader and more conceptual understanding.
- For examples (a) and (b), LITA's encoders attend not only to the main object but also to the background and shadows, indicating a more holistic visual analysis relevant to aesthetics and style.
- For examples (c) and (d) (landscape/abstract art), LITA's encoders pay attention to various regions like the sky, water surface, ground, and boundaries. This suggests they are capturing compositional elements and overall visual flow, rather than just isolated objects.
  
  Insight from Visualization: This visualization confirms that the LMM-guided image-text alignment successfully incorporates LLaVA's conceptual comprehension of paintings into the image encoders. This enables LITA's vision models to move beyond mere object recognition and effectively capture abstract concepts crucial for AIAA, which is a significant limitation of pure vision models.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces LMM-guided Image-Text Alignment (LITA), a novel framework for Artistic Image Aesthetics Assessment (AIAA). The core innovation lies in leveraging Large Multimodal Models (LMMs), specifically LLaVA, to generate rich textual comments about an artwork's style and aesthetics during the training phase. These LLaVA-generated texts serve as a guiding signal to align the features extracted by dedicated style and aesthetic vision encoders. This image-text alignment process enables the vision encoders to internalize and comprehend abstract concepts like artistic style and aesthetics, which are often overlooked by traditional pure vision models.

A significant practical advantage of LITA is that the LMM is only utilized during training, making the inference process computationally efficient as it relies solely on the trained vision encoders. Experimental results on the Boldbrush Artistic Image Dataset (BAID) unequivocally demonstrate LITA's effectiveness, outperforming existing AIAA methods in Pearson linear correlation coefficient (PLCC) and binary classification accuracy. The qualitative analysis through attention maps further validates that LITA's image encoders attend to broader, more abstract artistic elements beyond just main objects, thereby enriching the visual feature representation.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Quality of LLaVA Descriptions: A primary limitation is that LLaVA's descriptions of aesthetics are "sometimes inappropriate." It is observed that some comments merely describe objects present in the image rather than offering insightful, abstract thoughts or emotions about the artwork's aesthetic qualities.
- Future Work: The authors suggest the need for more insightful comments from LMMs that truly capture abstract thoughts or emotions, rather than just object explanations. This implies improving the LMM's "art criticism" capabilities or developing more sophisticated prompting strategies.
Simplicity of Image Encoder Architecture: The paper used a simple ViT model (pre-trained on ImageNet) to validate the image-text alignment concept. While effective, it might not be the optimal architecture for AIAA.
- Future Work: It is preferable to construct specific networks tailored for better style and aesthetic feature extraction, potentially more complex or specialized vision architectures designed for artistic image analysis.

7.3. Personal Insights & Critique

This paper presents a very insightful and practical approach to a challenging problem. Here are some personal insights and critiques:

Strengths:
- Novelty in LMM Application: The idea of using an LMM as an "art critic" to generate training data (textual guidance) for a vision-only inference model is highly innovative. It elegantly bypasses the inference-time text unavailability problem that plagued earlier multimodal IAA approaches.
- Addressing Abstractness: LITA directly tackles the long-standing challenge of modeling abstract concepts in art aesthetics by grounding visual features in LMM's nuanced textual understanding. The attention map visualizations effectively demonstrate this.
- Computational Efficiency: The training-only LMM usage ensures that the final deployed model is efficient, which is crucial for real-world applications on social media platforms.
- Strong Performance: The significant performance gains on PLCC and Accuracy are compelling evidence of the method's effectiveness.
Potential Issues / Unverified Assumptions / Areas for Improvement:
- LMM Bias and Hallucination: The quality of LITA's training signal is entirely dependent on LLaVA's "art criticism." LMMs are known to hallucinate or express biases. If LLaVA generates inaccurate or biased aesthetic judgments, this could be ingrained into LITA's vision encoders, potentially perpetuating undesirable aesthetic standards or missing genuinely novel artistic expressions. The "inappropriate comments" limitation mentioned by authors is a hint at this.
- Domain Shift for LMM: While LLaVA is trained on diverse data, art (especially abstract art) can be highly subjective and culturally specific. How well LLaVA generalizes its "aesthetic sense" across different artistic periods, cultures, and styles is an unverified assumption.
- Complexity of Two Encoders: The ablation study showed that two encoders without alignment only marginally (or negatively) impacted performance. While the alignment makes two encoders work, it's worth exploring if a single, more sophisticated vision encoder, guided by both style and aesthetic comments, could achieve similar (or better) results with less model complexity. The concatenation of two separate [CLS] tokens might not be the most optimal way to integrate style and aesthetic information.
- Interpretability of LMM "Aesthetics": While LLaVA provides text, its internal reasoning for an aesthetic judgment is still a black box. This limits the interpretability of why LITA finds certain features aesthetically pleasing, beyond simply learning to correlate them with LLaVA's descriptions.
- Transferability to Other Domains: The concept of using LMM-generated textual guidance for visual feature learning could be highly transferable. For instance, in medical image analysis, LMMs could generate descriptions of "pathological features" to guide diagnostic models, or in material science, "material properties" to guide material classification. The core idea of "knowledge distillation" from LMM to a specialized vision model is powerful.
  
  In conclusion, LITA represents a significant step forward in AIAA by intelligently harnessing the power of LMMs. While its dependence on the quality and objectivity of LMM's generated comments presents a future research avenue, the framework itself is robust and demonstrates the immense potential of multimodal AI in understanding complex human concepts like art aesthetics.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

LITA: LMM-Guided Image-Text Alignment for Art Assessment

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~32 min read · 43,375 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Image Aesthetics Assessment (IAA)

3.2.2. Artistic Image Aesthetics Assessment (AIAA)

3.2.3. Large Multimodal Models (LMMs)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Definition

4.2.2. Overview of LITA Pipeline

4.2.3. LMM Comments on Artworks

4.2.4. LMM-Guided Image-Text Alignment for Art Assessment

Image Encoding

Text Encoding

Image-text Paired Learning

Score Prediction

Fused Model Learning

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Spearman's Rank Correlation Coefficient (SRCC)

5.2.2. Pearson Linear Correlation Coefficient (PLCC)

5.2.3. Accuracy (Acc)

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Comparison of LMM Usage

6.3. Ablation Studies

6.4. Case Study

6.5. Visualization

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers