A Challenging Benchmark of Anime Style Recognition
TL;DR Summary
This paper presents a challenging benchmark for Anime Style Recognition (ASR), aiming to determine if two images of different characters are from the same work. A large dataset (LSASRD) is introduced, alongside a cross-role evaluation protocol, revealing the need for deeper ASR r
Abstract
Given two images of different anime roles, anime style recognition (ASR) aims to learn abstract painting style to determine whether the two images are from the same work, which is an interesting but challenging problem. Unlike biometric recognition, such as face recognition, iris recognition, and person re-identification, ASR suffers from a much larger semantic gap but receives less attention. In this paper, we propose a challenging ASR benchmark. Firstly, we collect a large-scale ASR dataset (LSASRD), which contains 20,937 images of 190 anime works and each work at least has ten different roles. In addition to the large-scale, LSASRD contains a list of challenging factors, such as complex illuminations, various poses, theatrical colors and exaggerated compositions. Secondly, we design a cross-role protocol to evaluate ASR performance, in which query and gallery images must come from different roles to validate an ASR model is to learn abstract painting style rather than learn discriminative features of roles. Finally, we apply two powerful person re-identification methods, namely, AGW and TransReID, to construct the baseline performance on LSASRD. Surprisingly, the recent transformer model (i.e., TransReID) only acquires a 42.24% mAP on LSASRD. Therefore, we believe that the ASR task of a huge semantic gap deserves deep and long-term research. We will open our dataset and code at https://github.com/nkjcqvcpi/ASR.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "A Challenging Benchmark of Anime Style Recognition".
1.2. Authors
The authors of the paper are:
-
Haotang Li
-
Shengtao Guo
-
Kailin Lyu
-
Xiao Yang
-
Tianchen Chen Jianqing Zhu
-
Huanqiang Zeng*
All authors are affiliated with the College of Engineering, Huaqiao University. The research background appears to be in computer vision, particularly in areas like image recognition, re-identification, and deep learning, applied to specific domains like anime.
1.3. Journal/Conference
The paper was published at , which stands for CVPR 2022 Workshops / Vision for Design and Art. CVPR (Conference on Computer Vision and Pattern Recognition) is one of the top-tier conferences in the field of computer vision, highly reputable and influential. Workshops associated with CVPR often focus on emerging or specialized topics within the broader field.
1.4. Publication Year
The paper was published in 2022.
1.5. Abstract
The paper introduces Anime Style Recognition (ASR), a task that aims to determine if two anime images, potentially of different roles, originate from the same anime work by learning abstract painting styles. Unlike biometric recognition, ASR faces a much larger semantic gap. To address this, the authors propose a challenging ASR benchmark. First, they release a large-scale ASR dataset (LSASRD) comprising 20,937 images from 190 anime works, with each work having at least ten different roles. This dataset incorporates challenging factors such as complex illuminations, various poses, theatrical colors, and exaggerated compositions. Second, they design a cross-role protocol for evaluation, ensuring that query and gallery images are from different roles, thereby forcing models to learn abstract style rather than role-specific features. Finally, they establish baseline performance on LSASRD using two powerful person re-identification methods, AGW and TransReID. Surprisingly, TransReID, a recent transformer-based model, achieves only a 42.24% mAP. This indicates that ASR, due to its significant semantic gap, warrants extensive and long-term research. The dataset and code are made publicly available.
1.6. Original Source Link
The official source link for the paper is: https://openaccess.thecvf.com/content/CVPR2022W/VDU/papers/Li_A_Challenging_Benchmark_of_Anime_Style_Recognition_CVPRW_2022_paper.pdf
It is published as part of the CVPR 2022 Workshops proceedings.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is Anime Style Recognition (ASR). This involves learning the abstract painting style of anime works to determine if two images, even of different characters (roles), come from the same anime series.
This problem is important because anime has become a global cultural phenomenon with a rapidly growing audience. Understanding and recognizing anime styles automatically is a crucial step towards advanced image understanding mechanisms for artificial intelligence, especially given the complex and diverse visual information present in anime works (e.g., complex illuminations, various poses, theatrical colors, exaggerated compositions).
Existing research largely focuses on biometric recognition (like face or person re-identification), which aims to identify specific individuals. These tasks typically deal with a semantic gap where variations are within a single identity (e.g., different angles of the same face). However, ASR presents a much larger semantic gap because it requires recognizing a holistic style across different characters from the same work, and distinguishing it from other works, which can have visually similar characters but distinct underlying styles. This problem has received less attention despite its complexity.
The paper's entry point is the recognition that existing datasets and methods are inadequate for this ASR task. Current anime datasets often focus on specific sub-tasks like face detection, text detection, or color rendering, and lack rich annotations for style or work-level identification across roles. The innovative idea is to create a challenging benchmark specifically designed to push models to learn abstract painting styles rather than just character identities.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Proposal of a Challenging ASR Benchmark: This is the overarching contribution, highlighting the difficulty and importance of
ASRas a research problem. - Collection of a Large-Scale ASR Dataset (LSASRD):
LSASRDis a novel dataset specifically designed forASR, containing 20,937 images of 1,829 roles from 190 anime works.- It includes comprehensive metadata (year, region, staff, gender, race) and captures a wide range of challenging factors such as
complex illuminations,various poses,theatrical colors, andexaggerated compositions. - Crucially, each work in
LSASRDcontains at least ten different roles, enforcing the need for style learning over identity learning.
- Design of a Cross-Role Evaluation Protocol:
- This protocol strictly separates
queryandgalleryimages, ensuring they come from different roles within the same work. This is a critical design choice to validate whether a model learns abstractpainting stylerather than merelydiscriminative features of roles(i.e., character identities). - It employs standard
re-identificationmetrics (mINP,mAP,CMC) adapted for theASRcontext.
- This protocol strictly separates
- Establishment of Strong Baselines using SOTA Person Re-ID Methods:
- The paper applies two powerful
person re-identification(Re-ID) methods,AGW(aCNN-based approach) andTransReID(aTransformer-based approach), to establish baseline performance onLSASRD.
- The paper applies two powerful
- Key Findings:
-
The most significant finding is that even state-of-the-art
person re-identificationmethods perform surprisingly poorly onLSASRD. For instance,TransReIDachieved only 42.24%mAP. -
This poor performance underscores the enormous
semantic gappresent inASRand demonstrates that current methods are inadequate for extracting theabstract attributesnecessary for recognizing anime styles across different characters. -
The results suggest that
ASRis a deeply challenging problem that requires dedicated, long-term research efforts beyond adaptations of existing Re-ID techniques.These contributions collectively address the gap in
ASRresearch by providing a robust dataset and evaluation framework, while simultaneously revealing the significant challenges that remain.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following concepts:
-
Anime Style Recognition (ASR): The core task, defined in this paper, where the goal is to determine if two images of potentially different anime characters (roles) originate from the same anime work (e.g., a series or movie) by identifying common stylistic elements. This is distinct from recognizing specific characters.
-
Semantic Gap: This refers to the difference between the information that can be extracted from raw data (like pixels in an image) and the high-level semantic meaning that humans can interpret. In
ASR, the semantic gap is large because models must bridge the visual differences between distinct characters (even if from the same work) to recognize an abstract, underlying artistic style. -
Biometric Recognition: A field of computer vision focused on automatically recognizing individuals based on their unique biological and behavioral characteristics. Examples include
face recognition(identifying a person from their face),iris recognition(identifying from iris patterns), andperson re-identification(re-identifying the same person across different camera views or times). These tasks primarily focus onidentity context. -
Deep Metric Learning: A machine learning paradigm where the goal is to learn an embedding space where similar samples are close together and dissimilar samples are far apart. This is often used in tasks like
person re-identificationto embed images into a feature space where distances reflect identity similarity. The paper adapts this approach for style similarity. -
Convolutional Neural Networks (CNNs): A class of deep neural networks widely used in image processing.
CNNsare designed to automatically and adaptively learn spatial hierarchies of features throughconvolutional layers,pooling layers, andfully connected layers. They are very effective at extracting visual features from images.ResNet(Residual Network) is a common and powerfulCNNarchitecture. -
Transformers: A type of neural network architecture that gained prominence in natural language processing (NLP) and has recently been adapted for computer vision (
Vision Transformer,ViT). Transformers rely heavily on theself-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence (or image patches) when processing each part. They are known for capturing long-range dependencies. -
Person Re-identification (Re-ID): A specific biometric recognition task aiming to identify the same person across a network of non-overlapping cameras. It addresses challenges like changes in viewpoint, illumination, pose, and occlusions. The paper uses
AGWandTransReID, which are state-of-the-art Re-ID methods, as baselines. -
Evaluation Metrics (mAP, mINP, CMC): These are standard metrics used in
re-identificationandretrieval tasksto quantify model performance.mAP(mean Average Precision) measures the average precision across all queries.mINP(mean Inverse Negative Penalty) assesses the ability to retrieve the hardest correct match.CMC(Cumulative Matching Characteristics) curves show the probability of finding a correct match within the top-k retrieved results. These will be defined mathematically in Section 5.
3.2. Previous Works
The paper contextualizes its contribution by discussing prior work in related fields:
-
Biometric Recognition: The paper explicitly contrasts
ASRwith commonbiometric recognitiontasks likeface recognition[25],iris recognition[23], andperson re-identification[37]. These tasks focus on learningidentity context(i.e., who is this person?), where one category corresponds to a single individual with multiple images.ASR, in contrast, defines one category as an entireanime work, which includes multiple distinct characters (roles), each with multiple images. This makesASRinherently more complex due tointra-class variations(variations within the same style/work). -
Art Understanding: Some research has focused on understanding visual arts.
- Mao et al. [17] proposed
DeepArtto learn joint representations ofcontentandstylein visual arts. - Garcia et al. [7] introduced
Text2Artfor multimodal art retrieval. - Shen et al. [28] developed a spatially-consistent feature learning method for art collections.
- Other works focus on
style descriptions[8] orclassification[3] of single art pieces. The paper argues thatanimecannot be simply treated as fine art because its free design incorporates extreme elements (complex illuminations, exaggerated compositions) that make it distinct and more challenging.
- Mao et al. [17] proposed
-
Existing Anime Image Datasets: Several datasets related to anime images exist, but none directly address the
ASRtask as defined in this paper.-
Manga109[1] by Aizawa et al. contains scanned comic books with annotations fortext boxesandcomic panels, primarily fortext detectionandimage splitting. -
iCartoonFace[41] by Zheng et al. is forcartoon face detectionandrecognition. While large, it often contains consecutive frames with similar content and sometimes lacks faces, making it unsuitable for style recognition across diverse roles. -
Nico-illust[13] by Ikuta et al. consists of painting images foranime color rendering. -
Danbooru2021[2] andCOMICS[14] are other datasets with different applications likerenderingorplot prediction, but they lack the specificworkandroleannotations needed forASR.The following table (Table 1 from the original paper) summarizes the comparison of these datasets: The following are the results from Table 1 of the original paper:
-
| Dataset | iCartoonFace [41] | Danbooru2021 [2] | Manga109 [1] | COMICS [14] | nico-illst [13] | LSASRD |
| Images Roles | 389,678 5,013 | 400k Not annotated | 21,142 500k | 1.2m Not annotated | 400k Not annotated | 20,937 1,829 |
| Work | Not annotated Face Detection | Not annotated Color | 109 Text | Not annotated Plot | Not annotated Color | 190 Style |
| Applications | and Recognition | Rendering | Detection | Prediction | Rendering | Recognition |
As can be seen from the comparison, LSASRD is unique in its focus on Style recognition with 190 annotated Works and 1,829 Roles.
3.3. Technological Evolution
The field of image understanding has seen significant advancements, starting from traditional feature engineering methods to modern deep learning approaches.
-
Early methods: Relied on hand-crafted features for recognition and retrieval tasks.
-
CNNera: The advent ofCNNsrevolutionized computer vision by enabling automatic feature learning from raw pixel data. Architectures likeResNet[10] became standard for tasks likeimage classification,object detection, andperson re-identification(e.g.,AGW[37]). -
Transformerera: More recently,Transformers, initially successful inNatural Language Processing (NLP), have been adapted for vision tasks (Vision TransformersorViT[5]). These models, likeTransReID[11], leverageself-attentionto model global dependencies within images, often achieving state-of-the-art results by capturing more holistic features compared toCNNsthat might focus more on local patterns.This paper's work fits into the current
deep learningera, specifically exploring the capabilities of advancedCNNandTransformerarchitectures for a novel and challengingASRtask, pushing the boundaries of what these models cansemantically understand.
3.4. Differentiation Analysis
The core differences and innovations of this paper's approach, compared to related work, are:
-
Novel Task Definition: Unlike
biometric recognitionwhich identifies individuals, or generalart understandingwhich might categorize broad art styles,ASRspecifically focuses on identifyinganime worksby their uniquepainting styleacross different characters. This addresses a previously underexploredsemantic gap. -
Purpose-Built Dataset (
LSASRD): Existing anime datasets are either too specific (e.g.,Manga109for text,iCartoonFacefor faces within a video,Nico-illustfor coloring) or lack the crucialworkandroleannotations needed forASR.LSASRDis the first large-scale dataset explicitly designed forASR, featuring diverse works, numerous roles per work, and challenging visual factors. -
Rigorous Cross-Role Protocol: This evaluation strategy is a key innovation. By ensuring that
queryandgalleryimages are from different roles within the same work, the protocol directly forces models to learn abstractpainting stylesrather than simply matching character identities or memorizing specific facial features. This is a crucial distinction fromperson re-identification, where the goal is to match the same person. -
Revealing Fundamental Limitations: By applying state-of-the-art
person re-identificationmethods (AGW,TransReID), the paper not only establishes baselines but, more importantly, reveals that these powerful models perform poorly. This highlights a fundamental limitation of currentdeep learningmodels in handling thehuge semantic gapand abstract style recognition required forASR, differentiating it from papers that only report improvements on existing benchmarks.The paper Figure 2 below shows examples of different datasets, visually illustrating why existing datasets are not suitable for
ASR.
该图像是一个示意图,展示了四个不同的动漫风格数据集,包括 iCartoonFace、Danbooru2021、Manga109 和 LSASRD。每个数据集包含多种动漫角色的示例,呈现了不同的艺术表现形式和风格特点。
Figure 2 from the original paper shows how datasets like iCartoonFace focus on faces, Danbooru2021 on general illustrations, and Manga109 on manga pages, while LSASRD provides diverse character images within specific anime works for style recognition.
4. Methodology
4.1. Principles
The core idea of the methodology is to establish a rigorous benchmark for Anime Style Recognition (ASR). This involves two main principles:
-
Creating a Representative and Challenging Dataset: The
Large-Scale Anime Style Recognition Dataset (LSASRD)is designed to capture the diversity and complexity of anime styles across different works, eras, and regions, including challenging visual factors. Crucially, it must contain multiple distinct characters (roles) within each work to prevent models from simply learning character identities. -
Designing a Robust Evaluation Protocol: The
cross-role protocolensures that models are evaluated on their ability to recognize abstractpainting stylerather than specific character features. This is achieved by strictly separatingqueryandgalleryimages so that they belong to different roles, even if from the same anime work. The evaluation uses standardre-identificationmetrics to quantify performance in this style-matching context.By adhering to these principles, the paper aims to provide a clear framework for advancing research into
ASR, pushingdeep learningmodels towards a deeper understanding of abstract visual semantics.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology encompasses the construction of the LSASRD dataset, the design of the cross-role protocol for evaluation, and the selection and adaptation of baseline models.
4.2.1. Large-Scale Anime Style Recognition Dataset (LSASRD)
Overview:
LSASRD is a well-labeled and challenging dataset created to facilitate ASR research. It comprises images collected from 190 anime and cartoon works spanning 93 years (1928 to 2021) and originating from 13 different countries and regions. It includes both 2D and 3D animation. For each work, up to ten roles were chosen. The dataset emphasizes context understanding and a wide variety of styles.
The distribution of the dataset regarding time, regions, and role characteristics is shown in Figure 3. For example, two-thirds of works are from China and Japan, and 39% were broadcast between 2010 and 2020. Role statistics show that 58.3% are male, 29.5% are female, 69.2% are human, 16.9% are humanoid (like furry characters), and 13.8% are inhuman (like animals). This diversity is intentional to make the dataset challenging.
The following figure (Figure 3 from the original paper) shows the distribution of time and regions in LSASRD:
该图像是图表,展示了在LSASRD中不同年代和地区的动漫作品分布(左侧图)、性别比例(右侧图上)以及种族分布(右侧图下)。左侧图表显示了从1920年至2020年各地区作品的数量变化,明显可见日本和中国的作品数量占主导。右侧图表分别以饼图形式展示了角色的性别比例(男性占67.9%)和种族分类(人类比例为69.2%)。
Figure 3 visually represents the geographical and temporal spread of works in LSASRD, along with gender and human/non-human role distributions, highlighting its diverse nature.
Data Collection and Annotation:
- Work Selection: A list of anime works was fetched from
MoegirlpediaandBiliBili. 190 works were selected by weighted random sampling, considering popularity, era, and regional distribution. Serial works or branched works were treated as the same work. - Role Selection: Both main and supporting characters (roles) were chosen for each work. If a work had fewer than ten roles, all available roles were included. This resulted in 1,829 unique roles.
- Image Acquisition: Images were sourced from picture searching engines and online video websites in the public domain or under fair use. The initial raw data contained approximately 60,000 images of varying quality, including manga, comics, episodes, movies, and even photos of peripheral products and secondary creations.
- Manual Annotation: Five collaborators manually cleaned and annotated the data over about two months. 20,937 images were selected and labeled.
-
Focus on Faces: To control for scene and technical variations, only the face part of each role was used. All images in
LSASRDare portraits, cropped and resized to 256px by 256px. -
Labeling: Images were labeled by the
title of the work, and bounding boxes were drawn around faces, with labels indicatingwhich rolethe face belongs to. -
Metadata: Handcrafted metadata for each work (year, region, staff) and role (gender, race) was also collected to provide rich contextual information.
The following figure (Figure 4 from the original paper) shows annotations of LSASRD:
该图像是示意图,展示了在ASR基准中的不同挑战因素,包括颜色、构图、光照、低质量、遮挡和姿势等。每个因素通过多张动画角色图像呈现,为研究者提供了丰富的视觉参考。
-
Figure 4 displays examples of cropped and labeled face images from LSASRD, demonstrating the bounding box and role identity annotations for characters from the same work.
Challenges in LSASRD: The dataset was designed to be challenging due to several factors:
-
Complex Image Content Conditions: Images exhibit
lower clarity,too bright or too darkconditions,large-area occlusion,exaggerated poses, andspecial compositions. There are also variations inresolutions,colors,illumination, andangles. -
Complex Image Style Conditions: The dataset includes images of roles from both original works and
secondary creations, meaning a single role might appear in different styles. The inclusion of works from multiple regions and eras, and roles spanninghuman,humanoid, andinhumancategories, prevents models from relying on inherent facial feature patterns for recognition. -
More Complicated than Biometric Recognition Datasets: In biometric datasets, one subject (e.g., a person) has multiple images. In
LSASRD, one subject is ananime work, which contains different roles, and each role, in turn, has different images. Models cannot merely learn identity; they must learn high-level common features that define the abstract style of a work across distinct characters.The following figure (Figure 5 from the original paper) illustrates examples of LSASRD challenges:
该图像是一个示意图,展示了AGW框架的结构。图中包含ResNet50主干网络及非局部注意力模块,通过加权正则三元组来优化特征提取,最终计算ID损失以实现识别目标。
Figure 5 visually demonstrates various challenging factors within LSASRD, such as complex colors, compositions, illuminations, low quality, occlusion, and diverse poses.
4.2.2. Cross-Role Protocol
To rigorously evaluate a model's ability to understand style differentiation, the paper designs a cross-role protocol.
Training Set and Testing Set Dividing:
The LSASRD images are randomly divided into training and testing sets. The testing set is further split into a query set and a gallery set with a 6:4 ratio based on roles. A critical aspect of this division is that roles present in the query set are strictly not allowed to exist in the gallery set simultaneously. This cross-role division enforces that a deep learning model must holistically learn painting styles rather than identity contexts of specific roles.
The following are the results from Table 2 of the original paper:
| Subset | Train | Gallery | Query |
| Work | 114 | 76 | 76 |
| Role | 1097 | 439 | 293 |
| Image | 12562 | 5025 | 3350 |
Table 2 details the splitting of the LSASRD dataset into Train, Gallery, and Query subsets, showing the distribution of works, roles, and images for each.
5-Fold Cross-Validation:
To reduce bias from data distribution and prevent overfitting, 5-fold cross-validation is employed. The total dataset is split into five folds. In each fold, one part is used as the validation set, and the other four parts constitute the training set. This process is repeated five times, ensuring each fold serves as the validation set once. The mean value of the five evaluations determines the baseline performance.
The following are the results from Table 3 of the original paper:
| Fold-k | 1 | 2 | 3 | 4 | 5 |
| Work | 38 | 38 | 38 | 38 | 38 |
| Role | 364 | 367 | 367 | 366 | 365 |
| Image | 4187 | 4185 | 4188 | 4189 | 4188 |
Table 3 outlines the 5-Fold Cross-Validation Setting, specifying the number of works, roles, and images allocated to each fold.
Performance Metrics:
The performance is evaluated using re-identification-based metrics, specifically mean Inverse Negative Penalty (mINP), mean Average Precision (mAP), and Cumulative Matching Characteristics (CMC) curves. These metrics are designed to assess retrieval tasks where a candidate list is returned based on feature distances.
-
Mean Inverse Negative Penalty (mINP):
- Conceptual Definition:
mINPevaluates a model's ability to retrieve the hardest correct match. It provides a complementary measure to standardRe-IDperformance metrics by penalizing models that fail to retrieve correct matches efficiently, especially when those matches are difficult to find within the ranked list. A highermINPindicates better performance. - Mathematical Formula:
where
N Pmeasures the penalty to find the hardest correct match, and is calculated as . - Symbol Explanation:
- : The total number of queries.
- : Index for a specific query.
- : The Negative Penalty for query .
- : The total number of ground-truth correct matches for query .
- : The rank position of the hardest correct match for query . The "hardest" correct match is typically the one that appears latest in the ranked retrieval list.
- Conceptual Definition:
-
Mean Average Precision (mAP):
- Conceptual Definition:
mAPis a widely used metric forinformation retrievalandobject detectiontasks. It calculates the mean of theAverage Precision (AP)scores for each query.APitself is the area under thePrecision-Recall curve.mAPprovides a single-value metric that summarizes the overall ranking quality, considering both precision and recall across all queries. A highermAPindicates that relevant items are generally ranked higher in the retrieval lists. - Mathematical Formula:
where
APis calculated as . - Symbol Explanation:
- : The total number of queries.
- : Index for a specific query.
- : The total number of queries.
AP(q): The Average Precision for query .- : The size of the recall list (or the total number of retrieved items).
- : The rank position in the recall list.
P(k): The precision at cutoff (i.e., the proportion of relevant items among the top retrieved items).rel(k): An indicator function that is 1 if the item at rank is relevant, and 0 otherwise.- : The total number of relevant items for the current query.
- Conceptual Definition:
-
Cumulative Matching Characteristics (CMC) Curves:
- Conceptual Definition:
CMCcurves show the probability of finding a correct match within the top- retrieved results. Specifically,CMC@Rank(k)represents the percentage of queries for which a correct match is found among the top candidates. It is commonly used inperson re-identificationto evaluate how often the correct identity appears at various ranks in the sorted retrieval list. - Mathematical Formula:
- Symbol Explanation:
- : The total number of queries.
- : Index for a specific query.
- : The rank position.
rel(q, k): An indicator function that equals 1 if the ground-truth (correct match) for query appears before or at rank in the gallery, and 0 otherwise.
- Conceptual Definition:
4.2.3. Baseline Methods
Two state-of-the-art person re-identification methods are used as baselines: AGW and TransReID.
- AGW (
All-around ReID with Global-local, weighted Triplet, and ID loss) [37]:AGWis a strong baseline method forperson re-identificationthat combines robust feature extraction with effective loss functions.- Framework Structure: It includes a
backbone network, specificcriterion, and optimizedhyper-parameter configuration. - Backbones:
- Default:
ResNet50with anon-local block(ResNet50 NL). - Other
ResNetvariants:ResNet50,ResNet101,ResNet152. IBN-Net(Instance-Batch Normalization):ResNet50 IBN Ais a variant that improves generalization by combining instance normalization and batch normalization.Squeeze-and-Excitation (SE)modules [12]: Integrated intoResNet(SE ResNet50,SE ResNet101,SE ResNet152) andResNeXt(SE ResNeXt50,SE ResNeXt101) to enhance channel-wise feature recalibration.
- Default:
- Criterion (Loss Functions):
-
Label Smoothing Cross-Entropy[21]: This loss encourages the logits of the correct class to be distinct from incorrect classes but avoids over-confidence, improving generalization. -
Weighted Regularized Tripletloss: This loss optimizes the relative distance between positive and negative pairs in the embedding space, without requiring additional margin parameters, making it robust.The following figure (Figure 6 from the original paper) shows the AGW framework:
该图像是TransReID框架的示意图,展示了输入数据的分块处理、位置嵌入、线性投影和多个变换层的结构。它包括一个全球分支和一个拼图分支模块,使用不同的损失函数来优化模型性能。
-
Figure 6 illustrates the AGW framework, showing its ResNet50 backbone, non-local attention module, and the use of weighted regularization triplet and ID loss for optimization.
- TransReID (
Transformer-based Object Re-identification) [11]:TransReIDis a recent framework that adaptsTransformermodels forobject re-identification.- Backbones:
Vision Transformer (ViT)[5] inbaseandsmallsizes (ViT-Base,ViT-Small).Data-efficient image Transformers (DeiT)[31] insmallsize (DeiT-Small).
- Modules:
Jigsaw Patch Module (JPM): This module rearrangespatch embeddingsthroughshiftandpatch shuffleoperations. Its purpose is to generate more robust and discriminative features by ensuring broader coverage of the input image.ViT-JPMincorporates this.Transformer Stride: Variations in theTransformerstride (default 16, also tested with 12) are explored to see their impact on performance (ViT-Stride,DeiT-Stride).
- Loss Functions:
-
ID Loss: Standardcross-entropy lossfor identity classification. -
Triplet Loss: Used to enforce separation between features of different identities in the embedding space. -
Specific Loss Calculation: For a batch of samples, the total loss is calculated as the sum of half the loss value of the first sample and half the sum of the loss values of the remaining samples.
The following figure (Figure 7 from the original paper) shows the TransReID framework:
该图像是图表,展示了不同网络结构(如SE ResNext50、ResNet50等)在多个实验上的CMC曲线。图中不同线条和标记分别代表各模型在不同条件下的性能变化,显示了它们在图像识别任务中的表现。
-
Figure 7 presents the TransReID framework, detailing its Transformer architecture with input block processing, position embedding, linear projection, and the integration of a global branch and a jigsaw branch module for feature learning.
5. Experimental Setup
5.1. Datasets
The primary dataset used for all experiments is the Large-Scale Anime Style Recognition Dataset (LSASRD), proposed in this paper.
-
Source: Images are collected from publicly available online sources like
Moegirlpedia,BiliBili, picture searching engines, and online video websites. -
Scale: Contains 20,937 images.
-
Characteristics:
- Works: 190 unique anime/cartoon works.
- Roles: 1,829 distinct roles across these works, with each work having at least ten different roles (or all available if fewer).
- Diversity: Spans 93 years (1928-2021), 13 countries/regions, and includes 2D and 3D animation.
- Challenging Factors: Images present complex illuminations, various poses, theatrical colors, exaggerated compositions, low clarity, occlusion, and varying resolutions.
- Content: All images are cropped to show only the face part of each role and resized to 256px by 256px.
- Annotations: Richly annotated with work titles, role labels, bounding boxes for faces, and metadata (year, region, staff for works; gender, race for roles).
- Data Example: As shown in Figure 4 in the methodology section,
LSASRDimages are face portraits. For example, images show various anime characters' faces, cropped tightly to the head, with bounding boxes and role labels, demonstrating the dataset's specific focus. The diversity of styles and features (e.g., different eye shapes, hair colors, expressions) across these cropped faces from different works (or even different roles within the same work) is crucial.
-
Why LSASRD was chosen:
LSASRDwas explicitly created to address theASRtask, filling a gap left by existing anime datasets that are not suitable forstyle recognitionacross multiple roles from the same work. Its large scale, diverse characteristics, and thecross-role protocolmake it an effective tool for validating models' ability to learn abstract painting styles.
5.2. Evaluation Metrics
The performance of the models is evaluated using standard re-identification metrics: mean Inverse Negative Penalty (mINP), mean Average Precision (mAP), and Cumulative Matching Characteristics (CMC) curves. These metrics are well-suited for retrieval tasks, where the goal is to rank relevant items higher.
-
Mean Inverse Negative Penalty (mINP):
- Conceptual Definition:
mINPmeasures how well a model can retrieve the hardest correct match among all candidates. It specifically penalizes models for pushing hard-to-find relevant items far down the ranked list. A higher value implies that even challenging correct matches are retrieved at relatively good ranks. - Mathematical Formula:
where
N Pmeasures the penalty to find the hardest correct match, and is calculated as . - Symbol Explanation:
- : The total number of queries being evaluated.
- : A specific query index.
- : The negative penalty for query .
- : The total count of ground-truth (correct) matches for query in the gallery.
- : The rank position of the hardest correct match for query . The "hardest" correct match is typically the one that appears furthest down the ranked list of retrieved items.
- Conceptual Definition:
-
Mean Average Precision (mAP):
- Conceptual Definition:
mAPis a comprehensive metric for evaluating the overall quality of ranked retrieval results. It calculates the average of theAverage Precision (AP)scores across all queries.APquantifies the area under the precision-recall curve for a single query.mAPprovides a single number that reflects both the precision (how many retrieved items are correct) and recall (how many correct items are retrieved) at various retrieval thresholds. - Mathematical Formula:
where
APis calculated as . - Symbol Explanation:
- : The total number of queries.
- : An index representing a specific query.
- : The total number of queries in the evaluation set.
AP(q): The Average Precision calculated for query .- : The total number of items retrieved for a given query (the size of the recall list).
- : A specific rank position within the retrieved list.
P(k): The precision at cutoff , defined as the ratio of relevant items to the total items retrieved up to rank .rel(k): An indicator function, which is 1 if the item at rank is relevant to the query, and 0 otherwise.- : The total number of relevant items (ground-truth matches) for the current query in the entire gallery.
- Conceptual Definition:
-
Cumulative Matching Characteristics (CMC) Curves:
- Conceptual Definition:
CMCcurves plot theRank-k accuracy(also known asRank-k identification rateortop-k accuracy) against different values of .CMC@Rank(k)indicates the probability that a correct match for a query is found within the top retrieved results. These curves are particularly useful for understanding how effectively a model ranks correct matches near the top of its retrieval list. - Mathematical Formula:
- Symbol Explanation:
- : The total number of queries in the evaluation set.
- : An index representing a specific query.
- : The maximum rank being considered (e.g., for Rank-1, Rank-5, Rank-10).
rel(q, k): An indicator function that returns 1 if at least one ground-truth match for query is found within the top positions in the ranked gallery list, and 0 otherwise.
- Conceptual Definition:
5.3. Baselines
The paper uses two state-of-the-art person re-identification methods as baselines to assess performance on the ASR task with LSASRD:
-
AGW (All-around ReID with Global-local, weighted Triplet, and ID loss) [37]: This
CNN-based method is a strong performer inperson re-identification. Variousbackboneswere tested:ResNet50[10]ResNet50 NL(with Non-Local block) [15]ResNet101[10]ResNet152[10]SE ResNet50[12],SE ResNet101[12],SE ResNet152[12] (with Squeeze-and-Excitation modules)SE ResNeXt50[35],SE ResNeXt101[35]ResNet50 IBN A[36] (with Instance-Batch Normalization)
-
TransReID (Transformer-based Object Re-identification) [11]: This
Transformer-based method represents the cutting edge inRe-IDand is explored for its ability to capture global dependencies. DifferentTransformerconfigurations were tested:-
DeiT-Small[31] -
DeiT-Stride[31] (DeiT with modified Transformer stride) -
ViT-Small[5] -
ViT-Base[5] -
ViT-JPM[11] (ViT with Jigsaw Patch Module) -
ViT-Stride[5] (ViT with modified Transformer stride)These baselines are representative because they are leading methods in the closely related field of
person re-identificationand include both dominantCNNand emergingTransformerarchitectures, providing a comprehensive assessment of current capabilities forASR.
-
5.4. Training Setup
- Data Augmentation: During the training phase, standard data augmentation techniques were applied:
random horizontal flip,padding,random crop,normalization, andrandom erasing. These techniques help improve model generalization and reduce overfitting. - Optimizer:
Mini-batch stochastic gradient descentwas used as the optimizer. - Batch Configuration: Each
mini-batchcontained 16 distinct roles, and each role had four images, totaling 64 images per batch (16 roles * 4 images/role). - Training Epochs: Models were trained for 100 epochs.
- Learning Rate Schedulers:
- For AGW: A
multi-step learning rate schedulerwith awarm-upphase was employed. The learning rate started at and proportionally increased to over the first ten epochs (warm-up). It then dropped to one-tenth of its current value at the 40th and 80th epochs. - For TransReID: A
cosine learning rate schedulerwas used. Themarginwas set to 0.5, and thescaleto 30. The learning rate started at one-third of its peak value and proportionally increased to over the first five epochs (warm-up).
- For AGW: A
- Other Settings: All other experimental settings followed the default configurations of the original
AGWandTransReIDimplementations.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that even state-of-the-art re-identification methods (AGW and TransReID) perform poorly on the proposed ASR benchmark, LSASRD. This indicates the challenging nature of ASR and the significant semantic gap that current models struggle to bridge.
For AGW, the ResNet50 backbone achieved the best mINP (12.48%) and mAP (40.84%), while ResNet50 NL slightly edged out with the best Rank1 (72.50%) and Rank5 (88.18%). Interestingly, deeper ResNet models (ResNet101, ResNet152) did not consistently yield better performance, suggesting that simply increasing depth is not sufficient for this task, possibly due to the need for more specialized feature learning or hyperparameter tuning. SE modules and IBN also showed varying but generally lower performance compared to basic ResNet50.
For TransReID, the ViT-Stride model achieved the best mINP (13.14%), while ViT-Small performed best in mAP (42.76%), Rank1 (76.68%), and Rank5 (91.04%). Although Transformer-based models (like TransReID) are generally powerful for capturing global features, their mAP scores remain low (in the low 40s), indicating a significant challenge in adapting pre-trained ViT models (which are typically trained on concrete natural images) to the abstract nature of anime styles.
The lack of significant performance difference between AGW (CNN-based) and TransReID (Transformer-based) suggests that while these models excel at extracting texture features or identity-specific features, they "lack a mechanism for learning the semantics information in an image" for ASR. This implies that the core challenge lies in understanding abstract attributes and high-level stylistic elements, rather than just extracting low-level visual patterns or person-specific identifiers. The low mAP values (around 40-42%) are remarkably poor for state-of-the-art models in related domains, underscoring that ASR is a task requiring much deeper and specialized research.
6.2. Data Presentation (Tables)
The following are the results from Table 4 of the original paper:
| Metrics | mINP | mAP | R1 | R5 |
| ResNet50 [10] | 12.48 | 40.84 | 72.06 | 88.60 |
| ResNet50 NL [15] | 12.40 | 40.80 | 72.50 | 88.18 |
| ResNet101 [10] | 12.30 | 40.18 | 70.78 | 87.28 |
| ResNet152 [10] | 12.26 | 40.34 | 71.82 | 88.04 |
| SE ResNet50 [12] | 10.76 | 38.10 | 69.44 | 86.90 |
| SE ResNet101 [12] | 10.86 | 38.42 | 70.10 | 86.86 |
| SE ResNet152 [12] | 10.28 | 36.80 | 67.26 | 86.10 |
| SE ResNext50 [35] | 10.90 | 39.44 | 71.52 | 88.10 |
| SE ResNext101 [35] | 9.32 | 37.56 | 71.54 | 88.52 |
| ResNet50 IBN A [36] | 10.90 | 40.74 | 71.52 | 88.10 |
Table 4 presents the performance comparisons of AGW with various CNN-based backbones on the LSASRD dataset across mINP, mAP, , and metrics.
The following are the results from Table 5 of the original paper:
| Metrics | mINP | mAP | R1 | R5 |
| DeiT-Small [31] | 10.58 | 36.58 | 67.54 | 86.56 |
| DeiT-Stride [31] | 11.34 | 39.66 | 72.32 | 88.22 |
| ViT-Small [5] | 12.48 | 42.76 | 76.68 | 91.04 |
| ViT-Base [5] | 11.34 | 36.70 | 65.70 | 82.98 |
| ViT-JPM [11] | 12.72 | 41.88 | 74.16 | 89.30 |
| ViT-Stride [5] | 13.14 | 42.24 | 74.72 | 89.34 |
Table 5 shows the performance comparisons of TransReID with different Transformer settings on LSASRD using the same set of metrics.
6.3. CMC Curves
The CMC curves provide a more comprehensive view of the retrieval performance across different ranks.
The following figure (Figure 8 from the original paper) shows the CMC curve of several experiments on AGW:
该图像是图表,显示了不同模型(ViT和DeiT)在若干实验中的累积匹配特性(CMC)曲线。每种模型的表现通过不同的符号进行区分,纵轴为匹配率,横轴为排名。该图展示了不同模型在ASR任务中的性能差异。
Figure 8 shows the CMC curves for various AGW configurations. It confirms that the performance variations among different ResNet backbones (including those with NL and SE modules) are relatively small, with their curves closely clustered. This suggests that simply changing CNN architecture details within this family does not drastically improve the ability to learn abstract anime styles. The curves rise, but their initial slopes and saturation points are not very high, reinforcing the idea of limited performance.
The following figure (Figure 9 from the original paper) shows the CMC curve of several experiments on TransReID:
该图像是热图示意图,展示了 AGW 和 TransReID 方法在四个不同样本上的效果,样本编号为1至4,左右分别是 AGW 和 TransReID 的结果。
Figure 9 displays the CMC curves for different TransReID models. Similar to AGW, the Transformer-based models also show limited performance. ViT-Small generally achieves the highest curve among the TransReID variants, particularly at lower ranks, aligning with its best Rank1 and Rank5 scores. However, the overall performance, as reflected by the curves, still indicates that a significant portion of correct matches are not retrieved at very high ranks, further emphasizing the difficulty of the ASR task for these models.
6.4. Visualization (Heatmaps)
To understand what features the baseline models are focusing on, [4] visualizations (heatmaps) were generated for both AGW (using ResNet50 NL) and TransReID (using ViT's first LayerNorm layer).
The following figure (Figure 10 from the original paper) shows the heat map of samples:
该图像是一个示意图,展示了四个不同的动漫风格数据集,包括 iCartoonFace、Danbooru2021、Manga109 和 LSASRD。每个数据集包含多种动漫角色的示例,呈现了不同的艺术表现形式和风格特点。
Figure 10 shows heatmaps for four sample images:
- Samples 1 & 2 (3D Anime): For these
3D animeimages,AGWhighlights very few features, suggesting it struggles to find salient stylistic cues.TransReID, on the other hand, noticesedgesandshadows, indicating it might be picking up on rendering techniques or structural details. - Samples 3 & 4 (2D Anime): For the third sample (a character named Kabuto), both models highlight significant features. For the fourth sample, characterized by a large and unique drawing style of eyes, both models also focus on these prominent features. However,
TransReIDtends to highlight more parts of the sample compared toAGW.
Analysis of Heatmaps:
The heatmaps suggest that models sometimes focus on specific, salient features (like large eyes) rather than a holistic style. For 3D anime or less distinct 2D anime styles, AGW appears to be less effective at identifying relevant areas. TransReID, with its self-attention mechanism, seems to capture a broader context (more parts of the image) and potentially more abstract elements like edges and shadows for 3D styles. However, even with these observations, the overall low performance metrics imply that simply attending to more areas or individual prominent features is not sufficient for truly understanding the abstract painting style across different roles. The models might be picking up on local patterns or character-specific traits rather than the underlying artistic coherence of a work.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces Anime Style Recognition (ASR) as a challenging new benchmark task aimed at evaluating the semantic understanding capability of deep learning models for abstract visual attributes. A primary contribution is the creation of LSASRD, a novel, large-scale dataset containing 20,937 hand-annotated images from 190 diverse anime works and 1,829 roles, enriched with comprehensive metadata and designed to incorporate numerous visual challenges. To rigorously assess ASR performance, the authors devised a cross-role protocol, which mandates that query and gallery images belong to different characters, thereby forcing models to learn abstract painting style instead of individual character identities.
Baseline experiments conducted with state-of-the-art person re-identification methods, AGW (CNN-based) and TransReID (Transformer-based), revealed surprisingly low performance, with TransReID achieving only 42.24% mAP. This striking result underscores that current deep learning models are largely inadequate for extracting the high-level, abstract attributes necessary to bridge the huge semantic gap inherent in ASR. The work provides a robust framework for future research in both model development for semantic understanding and practical applications within anime image processing.
7.2. Limitations & Future Work
The authors implicitly point out several limitations through their findings and discussions:
-
Inadequacy of Current Models: The primary limitation is that existing
state-of-the-art re-identification methodsare unable to achieve satisfactory performance onASR. This suggests that these models, primarily designed foridentity-based recognitionortexture feature extraction,lack a mechanism for learning the semantics informationof abstract painting styles. -
Huge Semantic Gap: The task's inherent
huge semantic gapis a major challenge that current approaches struggle to overcome. This gap refers to the difficulty in recognizing a consistent style across different characters, poses, lighting, and artistic interpretations within the same work. -
Focus on Natural Images for Pre-training: The
Transformer-based models (ViT,DeiT) are typically pre-trained on vast amounts of natural images. This pre-training might not adequately prepare them for the highly stylized and abstract nature of anime images, requiring significant adaptation.Based on these limitations, the authors explicitly state that the
ASRtask "deserves deep and long-term research." Future work could involve: -
Developing Novel Architectures: Designing models specifically tailored to learn abstract artistic styles, potentially incorporating inductive biases that are more suitable for stylized imagery than those found in models pre-trained on natural images.
-
Exploring Anime-Specific Pre-training: Investigating whether pre-training models on large, unannotated collections of diverse anime images could provide a better initialization for
ASRcompared to natural image datasets. -
Multi-modal Learning: Leveraging the rich metadata (year, region, staff, character attributes) provided with
LSASRDto inform and enhance style recognition. -
Explainable AI for Style: Developing methods to visualize and interpret why a model identifies a certain style, moving beyond simple heatmaps to more semantically meaningful explanations of artistic elements.
7.3. Personal Insights & Critique
This paper makes a significant contribution by formalizing Anime Style Recognition as a distinct and challenging problem, backed by a purpose-built dataset and a rigorous evaluation protocol. The finding that SOTA models perform poorly is crucial because it highlights a fundamental gap in current AI's ability to grasp abstract, high-level aesthetic concepts, which is often taken for granted in human perception.
Inspirations and Applications:
- Beyond Identity: The
cross-role protocolis an excellent design choice that forces models to move beyondidentity recognitiontostyle recognition. This paradigm could be extended to other domains requiring abstract feature learning, such as distinguishing artistic movements, brand styles in marketing, or even underlying themes in diverse literary works (if translated to visual form). - Content Recommendation and Retrieval: The successful implementation of
ASRcould revolutionize anime content recommendation systems, allowing users to discover new works based purely on their preferred artistic styles, rather than just genre or character popularity. It would also enable more sophisticated image retrieval for artists or animators seeking specific stylistic references. - Understanding Human Aesthetics: The difficulty of this task for AI suggests that human style perception involves complex cognitive processes that are not yet well-captured by current deep learning architectures. Research in
ASRcould offer insights into what constitutes "style" in a quantifiable manner, bridging the gap between computational models and human aesthetic judgment.
Potential Issues and Areas for Improvement:
-
Subjectivity of Style: While the
cross-role protocolis rigorous, the concept of "style" itself can be subjective. Different human annotators might have slightly varied interpretations of what defines a work's style, especially with secondary creations. The dataset'sground truthrelies on the explicit association of images with original works, which is objective, but the nuances of style might still be open to interpretation in borderline cases. -
Limited Contextual Information (Face-only): The decision to crop images to only the face part simplifies the task by removing background clutter and full-body poses. However, a work's style is also heavily influenced by elements beyond the face, such as character design (body proportions, clothing), background art, animation quality, and color palettes for entire scenes. Future work could explore
ASRon full-body images or even video frames to capture a more holistic sense of style, albeit with increased complexity. -
Interpretability of Style Features: While offers some insight, understanding what specific visual elements (
edges,shadows,eye shapes,color gradients) contribute most significantly to a style decision for the model is still challenging. More advancedexplainable AI (XAI)techniques could be employed to break down the "black box" and reveal how models learn and differentiate styles. -
Dataset Scale vs. Diversity: While 20,937 images and 190 works is "large-scale" for a manually annotated dataset, the sheer diversity of anime styles and the vast number of anime works globally mean that
LSASRDis still a limited sample. Expanding the dataset further, especially with more challenging inter-work similarities and intra-work variations, could push models even further.Overall, this paper is a commendable effort to introduce a crucial, yet under-explored, problem in computer vision. It sets a clear path for future research in
abstract visual understanding, moving beyond basic object recognition towards more nuanced aesthetic comprehension.
Similar papers
Recommended via semantic vector search.