Quality-Guided Vision-Language Learning for Long-Term Action Quality Assessment
TL;DR Summary
The study introduces a quality-guided vision-language learning method using textual prompts and a progressive semantic module to map visual features to fine-grained scores, achieving state-of-the-art results across diverse long-term action quality datasets without extra annotatio
Abstract
IEEE TRANSACTIONS ON MULTIMEDIA 1 Quality-Guided Vision-Language Learning for Long-Term Action Quality Assessment Huangbiao Xu, Huanqi Wu, Xiao Ke, Member, IEEE, Yuezhou Li, Rui Xu, and Wenzhong Guo, Member, IEEE Abstract —Long-term action quality assessment poses a chal- lenging visual task since it requires assessing technical actions at different skill levels in a long video. Recent state-of-the-art methods incorporate additional modality information to aid in understanding action semantics, which incurs extra annotation costs and imposes higher constraints on action scenes and datasets. To address this issue, we propose a Quality-Guided Vision-Language Learning (QGVL) method to map visual fea- tures into appropriate fine-grained intervals of quality scores. Specifically, we use a set of quality-related textual prompts as quality prototypes to guide the discrimination and aggregation of specific visual actions. To avoid fuzzy rule mapping, we further propose a progressive semantic learning strategy with a Granularity-Adaptive Semantic Learning Module (GSLM) that refines accurate score intervals from coarse to
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Quality-Guided Vision-Language Learning for Long-Term Action Quality Assessment
1.2. Authors
Huangbiao Xu, Huanqi Wu, Xiao Ke, Yuezhou Li, Rui Xu, and Wenzhong Guo. The authors are affiliated with Fuzhou University, China, specifically with the Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing and the Engineering Research Center of Big Data Intelligence, Ministry of Education. Xiao Ke is noted as the corresponding author.
1.3. Journal/Conference
This paper was published in the proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024. CVPR is the premier annual conference in the field of computer vision, widely recognized for its high impact and rigorous peer-review process. Publication at CVPR signifies a significant contribution to the field.
1.4. Publication Year
1.5. Abstract
The abstract introduces the challenge of long-term Action Quality Assessment (AQA), which involves evaluating the skill level of actions in lengthy videos. It points out that existing state-of-the-art methods often rely on extra information (modalities) that require costly and specific annotations, limiting their generalizability. To overcome this, the authors propose a Quality-Guided Vision-Language Learning (QGVL) method. QGVL uses a universal set of quality-related text prompts (e.g., "good performance," "poor performance") to serve as "quality prototypes." These prototypes guide the model in mapping visual features to corresponding quality score intervals. The paper introduces a progressive semantic learning strategy that refines this mapping from coarse (clip-level) to fine (grade-level and score-level) granularities, implemented via a Granularity-Adaptive Semantic Learning Module (GSLM). This approach avoids fuzzy mappings and does not require extra annotations, making it universally applicable. The authors demonstrate through extensive experiments that their method achieves new state-of-the-art results on four major AQA benchmarks: Rhythmic Gymnastics, Fis-V, FS1000, and FineFS.
1.6. Original Source Link
- Official Link: The paper is accessible via the CVPR 2024 proceedings, and a version is available at
/files/papers/690088a9ed47de95d44a34b3/paper.pdf. - Publication Status: Officially published.
2. Executive Summary
2.1. Background & Motivation
- Core Problem: The paper addresses Long-Term Action Quality Assessment (AQA). Unlike assessing simple, short actions (like a single dive), long-term AQA involves evaluating complex sequences of sub-actions over an extended period (e.g., a full figure skating routine). This is challenging because it requires understanding the quality of individual movements and their temporal relationships.
- Gaps in Prior Research: Recent high-performing AQA models have started incorporating extra information like audio or detailed textual descriptions of the performed actions to better understand action semantics. However, this creates a major bottleneck: these extra modalities require costly, scene-specific, and often professional-level annotations for every video. For example, a model using textual descriptions of a specific figure skating jump cannot be easily applied to rhythmic gymnastics. This lack of universality and high annotation cost limits the scalability and practical application of such models.
- Innovative Idea: The authors' key insight is to ask: "Is there a universal semantic learning approach applicable to various action scenes?" Instead of using text that describes the specific action being performed (e.g., "a triple axel jump"), they propose using text that describes the quality of the performance (e.g., "a video with excellent performance"). These "quality-related textual prompts" are generic, require no additional annotation, and can be applied to any type of action. The core idea is to use this universal language as a guide to help the model learn to distinguish between different levels of skill directly from the visual data.
2.2. Main Contributions / Findings
- Proposing the QGVL Method: The paper introduces a novel Quality-Guided Vision-Language Learning (QGVL) framework. This method leverages fine-grained, quality-related text prompts to guide the model in learning the mapping between visual action performances and their corresponding quality scores, eliminating the need for extra annotations.
- Progressive Semantic Learning Strategy: To ensure accurate and fine-grained assessment, the authors designed a coarse-to-fine learning strategy. This is implemented through a novel Granularity-Adaptive Semantic Learning Module (GSLM). The model first learns coarse quality "grades" and then refines them into precise "scores," preventing the fuzzy mappings that might arise from using only a few coarse quality levels.
- Demonstrating Universality and State-of-the-Art Performance: The paper validates its method on four diverse and challenging long-term AQA datasets (Rhythmic Gymnastics, Fis-V, FS1000, and FineFS). The QGVL method significantly outperforms previous works, establishing new state-of-the-art results. Crucially, the paper also shows that a single, unified model trained on multiple datasets can achieve strong performance, highlighting the method's versatility and generalizability across different action scenarios.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Action Quality Assessment (AQA)
Action Quality Assessment (AQA) is a subfield of computer vision focused on automatically evaluating how well an action is performed, rather than just recognizing what action it is. It aims to assign a quantitative score that reflects the skill, proficiency, or correctness of the execution. This has applications in sports scoring (e.g., diving, gymnastics), medical rehabilitation (assessing patient exercises), and skill training. Early methods often regressed a score directly from video features, while more recent works explore more complex relationships, such as modeling the temporal structure of actions.
3.1.2. Vision-Language Learning (VLL)
Vision-Language Learning (VLL) is a field of AI that aims to build models capable of understanding the relationship between visual data (images, videos) and natural language (text). A key goal is to create a shared "semantic space" where visual and textual concepts are aligned. For instance, the image of a cat and the sentence "a photo of a cat" should have similar representations in this space.
3.1.3. CLIP (Contrastive Language-Image Pre-training)
CLIP, developed by OpenAI, is a foundational VLL model. It is trained on a massive dataset of image-text pairs from the internet. Its core idea is contrastive learning. Given a batch of image-text pairs, the model learns to predict which of the N×N possible pairings are correct. It does this by training two encoders—one for images and one for text—to project their respective inputs into a shared embedding space. The training objective is to maximize the cosine similarity of the correct image-text pair embeddings while minimizing the similarity of incorrect pairs. This allows CLIP to learn robust, general-purpose representations that connect visual concepts with language, enabling "zero-shot" capabilities where it can recognize objects it wasn't explicitly trained to classify. The current paper leverages this idea by using a pre-trained text encoder (like CLIP's) to get meaningful semantic embeddings for its quality-related prompts.
3.1.4. Transformer and Self-Attention
The Transformer is a neural network architecture introduced in the paper "Attention Is All You Need" that has become dominant in natural language processing and is increasingly used in computer vision. It relies entirely on attention mechanisms to process sequences of data (like words in a sentence or clips in a video).
The core component of the Transformer is the self-attention mechanism. It allows the model to weigh the importance of different elements in the input sequence when processing a single element. For a sequence of input vectors, it calculates three new vectors for each input: a Query (Q), a Key (K), and a Value (V). The attention score is computed by taking the dot product of a query with all keys, scaling it, and applying a softmax function to get weights. These weights are then used to compute a weighted sum of the values.
The standard formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
-
: The query matrix, representing the current element's perspective.
-
: The key matrix, representing all elements' "labels" to be matched against.
-
: The value matrix, representing the content of the elements.
-
: The dimension of the key vectors. The scaling factor prevents the dot products from growing too large, which would lead to vanishing gradients in the softmax function.
-
softmax: A function that normalizes the scores into a probability distribution (weights that sum to 1).In this paper, the Transformer's ability to model long-range dependencies is used to capture the temporal context between different video clips.
3.2. Previous Works
The paper categorizes previous AQA research, particularly for long-term actions:
-
Early Approaches: These works focused on regressing scores directly from spatio-temporal features. For instance,
MS-LSTM[30] used LSTMs to learn multi-scale video features, whileACTION-NET[29] combined dynamic and static information. -
Decoupling/Pattern-Based Methods: To better model fine-grained quality, some works decoupled the problem.
GDLT[25] proposed learning "grade patterns" to represent different quality levels.CoFInAl[26] built on this with a coarse-to-fine alignment strategy to handle domain shift. These methods, however, typically used learnable vectors or positional embeddings to represent grades, which lack explicit semantic meaning. -
Multimodal Learning: More recent methods have introduced additional modalities to provide richer semantic context.
MLP-Mixer[12] andPAMFN[13] incorporated audio information, which is particularly useful in sports like figure skating where movements are synchronized to music.SGN[14] used language by providing action-specific descriptions (e.g., "triple axel"). -
The key limitation of these multimodal approaches, which this paper aims to solve, is their reliance on additional, scene-specific annotations, as illustrated in the figure below.
该图像是示意图,展示了图1中(a)基于动作特定语义描述方法,需逐一注释且不适用于其他动作,和(b)提出的QGVL方法,利用通用质量相关文本提示实现无额外注释的语义引导。
3.3. Technological Evolution
The evolution of AQA has moved from simple regression models to more complex architectures that can understand the structure of an action.
- Feature Extraction + Regression: Early methods used hand-crafted or deep-learned features (e.g., from
C3DorI3Dnetworks) and fed them into a simple regression model like a Support Vector Regressor (SVR) or a recurrent neural network (RNN). - Temporal Modeling: Recognizing that actions are temporal sequences, models like
LSTMwere adopted to capture the evolution of movements over time. - Fine-Grained Pattern Learning: To capture subtle differences in quality, methods like
GDLTandCoFInAlmoved away from direct regression. They instead tried to classify video segments into predefined quality "grades" or "patterns" and then compute a score based on the distribution of these patterns. - Multimodal Fusion: The state-of-the-art has recently shifted towards using multiple data streams (video, audio, text, skeleton data) to provide a more holistic understanding of the action.
- This Paper's Position: This work builds on the fine-grained pattern learning and multimodal fusion trends. However, it innovates by proposing a form of multimodal learning (vision-language) that does not require expensive, specific annotations, making it more scalable and universal. It replaces action-specific text with general quality-related text.
3.4. Differentiation Analysis
- vs. GDLT/CoFInAl: These methods also learn quality "grades," but they represent these grades with abstract, learnable vectors that have no inherent meaning. This paper replaces these vectors with semantically rich text prompts (e.g., "excellent performance"). This provides the model with explicit, prior knowledge about what each grade represents from the very beginning of training.
- vs. SGN:
SGNalso uses vision-language learning but requires textual descriptions of the specific technical actions being performed in the video (e.g., "ChSq1," "3Lz+3T"). This is powerful but requires costly, expert annotations and is not generalizable to other sports. In contrast, this paper's quality-related prompts are universal and can be applied to any action type without new annotations. - vs. MLP-Mixer/PAMFN: These methods use audio, which is a powerful signal but is not always available or relevant for all types of actions. The proposed method uses language, which is more flexible. Furthermore, the paper shows that its vision-only model can outperform audio-based models, and when audio is added, its performance increases even further, demonstrating its compatibility.
4. Methodology
The core of the paper is the Quality-Guided Vision-Language Learning (QGVL) method. This section breaks down its architecture and logic step-by-step. The overall framework is depicted in the figure below.
该图像是论文中关于质量引导视觉语言学习(QGVL)框架的示意图,展示了从输入视频到多粒度语义学习及质量得分预测的整体流程。
4.1. Principles
The central principle of QGVL is to leverage the semantic power of natural language to guide the learning of visual quality patterns. Instead of letting the model discover abstract quality levels on its own, it is explicitly guided by text prompts describing performance quality (e.g., "good," "average," "poor"). This is achieved through a progressive, coarse-to-fine learning strategy:
-
Clip-level: The model first understands the temporal context between short video segments.
-
Grade-level: It then uses coarse quality prompts (e.g., "good performance") to group visual features into a few quality "grades."
-
Score-level: Finally, it uses fine-grained score prompts (e.g., "a score of 85") to further refine the assessment within each grade, mapping visual features to precise score intervals.
This progressive refinement is handled by the Granularity-Adaptive Semantic Learning Module (GSLM).
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Initial Feature Extraction
First, a long input video is divided into non-overlapping clips. A pre-trained video backbone (like I3D or VST) is used to extract a feature vector for each clip. This results in a sequence of clip-level visual features , where and is the feature dimension.
4.2.2. Temporal Context Enhancement
The initial features represent short clips and lack broader temporal context. To address this, the paper uses a network to enhance the features.
- Dimension Reduction: A transformation block (two fully-connected layers with batch normalization and ReLU) projects the features from dimension to a smaller dimension , matching the dimension of the text features.
- Context Modeling: A Transformer encoder is then applied to these projected features. The self-attention mechanism within the Transformer allows each clip's representation to be updated by attending to all other clips, thereby capturing long-range temporal relationships and context. The output of this module is the enhanced visual features , which are semantically richer and ready for interaction with text prompts. This process is formally written as: $ \hat{\pmb{F}} = \mathcal{N}(\pmb{F} | \Theta) $ where represents the learnable parameters of the network .
4.2.3. Quality-Guided Vision-Language Learning
This is the core of the method, where visual features are mapped to quality levels using text prompts. It happens in two progressive stages, both powered by the Granularity-Adaptive Semantic Learning Module (GSLM).
The architecture of the GSLM is shown below. It consists of a Quality Semantic Adapter (QSA) and a Quality-guided Cross-granularity Integrator (QCI).
该图像是论文中图3的示意图,展示了粒度自适应语义学习模块的整体架构,包括质量语义适配器(QSA)和质量引导的跨粒度整合器(QCI)两个关键部分,体现了从剪辑到等级再到得分层级的渐进式语义学习流程。
Stage 1: From Clip-level to Grade-level Semantics
-
Grade-level Text Prompts: A set of text prompts are designed to represent coarse quality grades. For example, if , the prompts could describe "poor," "average," "good," and "excellent" performance. A template like "a video of [action prompt] with [grade prompt] performance..." is used. A pre-trained text encoder (e.g., CLIP's) converts these prompts into a set of textual feature vectors (prototypes) .
-
Semantic Aggregation with GSLM:
-
Quality Semantic Adapter (QSA): The grade-level text prototypes are first fed into the QSA, which is a multi-head self-attention network. This allows the prototypes to interact with each other, uncovering their intrinsic relationships and adapting them into a new set of embeddings that are better suited for interacting with the visual features.
-
Quality-guided Cross-granularity Integrator (QCI): The QCI, a multi-head cross-attention network, then performs the main vision-language interaction. It uses the adapted text embeddings as queries and the enhanced visual features as keys and values. This process forces the model to "look" at the video clips () through the "lens" of each quality grade (
g'_k), aggregating the visual information relevant to that specific grade.The output for the -th grade is a new semantic representation , which represents the video's content as it pertains to that quality grade. The computation is: $ \mathcal{G}_k = \mathrm{Softmax}\left( W_q g'_k (W_k \hat{\pmb{F}}_t)^T / \sqrt{d} \right) $ $ \hat{\pmb{g}}_k = \mathrm{FFN}(\mathcal{G}_k (W_v \hat{\pmb{F}})) $
-
: Learnable weight matrices for the query, key, and value projections in the cross-attention mechanism.
-
: The attention weight matrix, showing how much each visual clip feature in contributes to the -th grade semantic.
-
FFN: A feed-forward network that further processes the aggregated features.The final output of this stage is a set of grade-level semantic features .
-
Stage 2: From Grade-level to Score-level Semantics
To avoid the roughness of only using a few grades, the model further refines the assessment to a score-level.
-
Score-level Text Prompts: A new set of text prompts are created using a template like "a video of [action prompt] with a quality score of [score prompt]". For example, if the score range is [0, 100], could be 101, with prompts for "a score of 0," "a score of 1," etc. These are encoded into text features . These score prompts are then partitioned into groups, where each group corresponds to a grade.
-
Semantic Refinement with GSLM: The GSLM module is used again, but this time to refine the grade-level semantics into score-level semantics .
-
The QSA adapts the score-level text prompts within each grade group.
-
The QCI performs cross-attention, but now it uses the score-level text prompts as queries and the corresponding grade-level semantic feature as the key and value. This step refines the aggregated information for a specific grade into finer score intervals.
The computation for a score within grade is: $ \mathcal{S}_k = \mathrm{Softmax}\left( w_q s'^{(k)}_n (w_k \hat{\pmb{g}}_k)^T / \sqrt{d} \right) $ $ \hat{\pmb{s}}^{(k)}_n = \mathrm{FFN}(\mathcal{S}_k (w_v \hat{\pmb{g}}_k)) $
-
: Learnable weights for this cross-attention layer. The paper notes these weights are shared with the weights from the clip-to-grade QCI, allowing the integrator to learn common patterns of quality aggregation across different granularities.
After this process is done for all grades, we get the final set of score-level semantic features . The entire coarse-to-fine pipeline is visualized in Figure 4.
该图像是图4,展示了一个粗到细质量相关语义学习框架流水线。流程包括从剪辑级视觉特征通过GSLM模块逐步映射到等级级和分数级语义,利用质量文本提示进行粗粒度到细粒度的量化学习,示例中K=4,N=101,配有分数区间公式如[0-1],[1-2]等。
-
4.2.4. Quantitative Score Generation
After obtaining the grade-level () and score-level () semantic representations, the final step is to compute a numerical score.
-
Define Quantitative Values: First, fixed numerical values are assigned to each grade and score level. For grades and scores in a normalized [0, 1] range: $ v_k^g = \frac{k-1}{K-1}, \quad v_n^s = \frac{n-1}{N-1} $ For example, with , the grade values would be 0, 0.33, 0.67, and 1.
-
Estimate Weights: Two separate Multi-Layer Perceptrons (MLPs), and , followed by a sigmoid activation , are used to predict the intensity (weight) of each grade and score being present in the video: $ w_k^g = \delta(\phi_g(\hat{\pmb{g}}_k)), \quad w_n^s = \delta(\phi_s(\hat{\pmb{s}}_n)) $
-
Normalize Weights: The raw weights are normalized to form two valid probability distributions that sum to 1: $ \hat{w}k^g = \frac{w_k^g}{\sum{k=1}^K w_k^g} \quad \text{and} \quad \hat{w}n^s = \frac{w_n^s}{\sum{n=1}^N w_n^s} $
-
Calculate Final Score: The grade-level score and score-level score are computed as the expectation over their respective distributions. These two scores are then combined using learnable adaptive weights ( and ) to produce the final score : $ \mathbf{s} = \lambda_g \sum_{k=1}^K \hat{w}k^g v_k^g + \lambda_s \sum{n=1}^N \hat{w}_n^s v_n^s $
4.2.5. Optimization
The model is trained with a composite loss function to ensure three properties: accurate score prediction, distinctiveness of semantic patterns, and alignment with text prompts.
-
Regression Loss: The standard Mean Squared Error (MSE) loss, , is used to minimize the difference between the predicted score and the ground-truth score.
-
Triplet Loss (): This loss ensures that the learned semantic representations for different grades (or scores) are distinct. For a given video's semantic feature (anchor), it pushes it closer to the feature of the same grade from another video (positive) and farther from features of different grades (negative). The loss is defined as: $ \mathcal{L}{TL}(\hat{G}) = \frac{1}{BK} \sum{i=1}^B \sum_{k=1}^K \max(D_+^{i,k} - D_-^{i,k} + \varepsilon, 0) $ where:
- is the distance to the hardest positive sample (same grade , different video ).
- is the distance to the hardest negative sample (different grade ).
- is a margin parameter.
distis the cosine distance. This loss is applied to both grade-level semantics and score-level semantics .
-
Cross-Entropy Loss (): This loss ensures that the learned grade semantic is correctly aligned with its original text prompt prototype . It computes the similarity between each learned semantic and all text prototypes and uses cross-entropy to enforce a one-to-one mapping. $ \mathcal{L}_{CE}(\hat{G}, G) = - \sum_k \log \frac{\exp(\mathrm{sim}(\hat{g}_k, g_k) / \tau)}{\sum_i \exp(\mathrm{sim}(\hat{g}_k, g_i) / \tau)} $
- is cosine similarity.
- is a temperature hyperparameter. This loss is also applied to both grade-level and score-level semantics.
-
Final Objective Function: The total loss is a weighted sum of these three components: $ \mathcal{I} = \lambda_1 \mathcal{L}{MSE} + \lambda_2 (\mathcal{L}{TL}(\hat{\pmb{G}}) + \mathcal{L}{TL}(\hat{\pmb{S}})) + \lambda_3 (\mathcal{L}{CE}(\hat{\pmb{G}}, \pmb{G}) + \mathcal{L}_{CE}(\hat{\pmb{S}}, \pmb{S})) $
- are balancing hyperparameters.
5. Experimental Setup
5.1. Datasets
The authors evaluated their method on four challenging public benchmarks for long-term AQA.
- Rhythmic Gymnastics (RG): Contains 1000 videos of four gymnastics actions (ball, clubs, hoop, ribbon). Each video is about 1.6 minutes long. The authors follow the standard protocol of training separate models for each action.
- Fis-V: A figure skating dataset with 500 videos of ladies' singles short programs, each about 2.9 minutes long. It includes two types of scores: Total Element Score (TES), which evaluates technical difficulty and execution, and Program Component Score (PCS), which assesses artistic aspects. Separate models are trained for TES and PCS.
- FS1000: A larger figure skating dataset with 1247 videos across eight categories. It has TES and PCS scores, plus five sub-scores for PCS (e.g., Skating Skills, Performance). Videos are around 3.3 minutes long. Separate models are trained for each score type.
- FineFS: Another figure skating dataset with 1167 samples, divided into short program and free skating. It also uses TES and PCS scores. Separate models are trained for each score and program type.
5.2. Evaluation Metrics
Two primary metrics were used to evaluate the performance of the AQA models.
5.2.1. Spearman's Rank Correlation ()
- Conceptual Definition: Spearman's rank correlation coefficient is a non-parametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function. In AQA, it measures whether the predicted scores are in the correct order compared to the ground-truth scores, even if the absolute values are off. A higher indicates that the model is better at ranking the quality of different performances. Its value ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no correlation.
- Mathematical Formula: $ \rho = \frac{\sum_i (p_i - \bar{p})(\hat{p}_i - \bar{\hat{p}})}{\sqrt{\sum_i (p_i - \bar{p})^2 \sum_i (\hat{p}_i - \bar{\hat{p}})^2}} $
- Symbol Explanation:
- : The rank of the -th ground-truth score.
- : The rank of the -th predicted score.
- : The mean of the ground-truth ranks.
- : The mean of the predicted ranks.
5.2.2. Mean Square Error (MSE)
- Conceptual Definition: Mean Square Error measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. It quantifies the accuracy of the predicted numerical scores. A lower MSE is better, indicating that the predicted scores are closer to the ground-truth scores.
- Mathematical Formula: $ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
- Symbol Explanation:
- : The number of samples.
- : The -th ground-truth score.
- : The -th predicted score.
5.3. Baselines
The paper compares QGVL against a comprehensive set of state-of-the-art (SOTA) AQA methods, including:
- Classic Methods: [59],
MS-LSTM[30]. - Temporal/Structural Modeling Methods:
ACTION-NET[29],TSA-Net[16],HGCN[36],TPT[3]. - Fine-Grained Pattern-Based Methods:
GDLT[25],CoFInAl[26]. - Multimodal Methods:
MLP-Mixer[12] andPAMFN[13] (which use audio), andSGN[14] (which uses action-specific text). These baselines cover the main evolution of AQA techniques and represent the strongest competitors.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Performance on Rhythmic Gymnastics (RG)
The following are the results from Table I of the original paper:
| Methods | Features | Spearman Correlation (↑) | Mean Square Error (↓) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ball | Clubs | Hoop | Ribbon | Avg. | Ball | Clubs | Hoop | Ribbon | Avg. | ||
| C3D+SVR [59] | C3D [60] | 0.357 | 0.551 | 0.495 | 0.516 | 0.483 | - | - | - | - | - |
| MS-LSTM [30] | I3D [51] | 0.515 | 0.621 | 0.540 | 0.522 | 0.551 | 10.55* | 6.94* | 5.85* | 12.56* | 8.97* |
| ACTION-NET [29] | I3D [51]+ResNet [61] | 0.528 | 0.652 | 0.708 | 0.578 | 0.623 | 9.09* | 6.40* | 5.93* | 10.23* | 7.91* |
| GDLT* [25] | I3D [51] | 0.553 | 0.720 | 0.712 | 0.562 | 0.644 | 8.78 | 6.25 | 6.02 | 9.39 | 7.61 |
| HGCN* [36] | I3D [51] | 0.527 | 0.590 | 0.697 | 0.659 | 0.622 | 8.88 | 7.79 | 7.28 | 10.69 | 8.66 |
| MLP-Mixer†* [12] | I3D [51]+AST [62] | 0.597 | 0.603 | 0.724 | 0.622 | 0.640 | 7.70 | 8.02 | 6.45 | 10.02 | 8.05 |
| CoFInAl [26] | I3D [51] | 0.625 | 0.719 | 0.734 | 0.757 | 0.712 | 7.04* | 6.37* | 5.81* | 6.98* | 6.55* |
| VATP-Net [63] | I3D [51] | 0.580 | 0.720 | 0.739 | 0.724 | 0.696 | - | - | - | - | - |
| QGVL (Ours) | I3D [51] | 0.697 | 0.729 | 0.767 | 0.703 | 0.725 | 6.37 | 6.21 | 5.01 | 7.24 | 6.21 |
| QGVL† (Ours) | I3D [51]+AST [62] | 0.708 | 0.735 | 0.777 | 0.713 | 0.734 | 6.28 | 6.15 | 5.13 | 7.09 | 6.16 |
| MS-LSTM [30] | VST [53] | 0.621 | 0.661 | 0.670 | 0.695 | 0.663 | 7.52* | 6.04* | 6.16* | 5.78* | 6.37* |
| ACTION-NET [29] | VST [53]+ResNet [61] | 0.684 | 0.737 | 0.733 | 0.754 | 0.728 | 9.55* | 6.36* | 5.56* | 8.15* | 7.41* |
| GDLT [25] | VST [53] | 0.746 | 0.802 | 0.765 | 0.741 | 0.765 | 5.90* | 4.34* | 5.70* | 6.16* | 5.53* |
| HGCN* [36] | VST [53] | 0.664 | 0.671 | 0.765 | 0.736 | 0.712 | 8.71 | 7.60 | 5.82 | 7.23 | 7.34 |
| MLP-Mixer†* [12] | VST [53]+AST [62] | 0.677 | 0.708 | 0.778 | 0.706 | 0.719 | 6.75 | 5.81 | 5.94 | 6.87 | 6.34 |
| PAMFN‡† [13] | VST [53]+AST [62]+I3D [51] | 0.757 | 0.825 | 0.836 | 0.846 | 0.819 | 6.24 | 7.45 | 5.21 | 7.67 | 6.64 |
| CoFInAl [26] | VST [53] | 0.809 | 0.806 | 0.804 | 0.810 | 0.807 | 5.07* | 5.19* | 6.37* | 6.30* | 5.73* |
| VATP-Net [63] | VST [53] | 0.800 | 0.810 | 0.780 | 0.769 | 0.790 | - | - | - | - | - |
| QGVL (Ours) | VST [53] | 0.824 | 0.812 | 0.825 | 0.834 | 0.824 | 4.91 | 4.40 | 4.68 | 5.23 | 4.81 |
| QGVL† (Ours) | VST [53]+AST [62] | 0.828 | 0.827 | 0.830 | 0.836 | 0.830 | 4.83 | 4.30 | 4.77 | 5.20 | 4.78 |
- Analysis: The
QGVLmethod consistently achieves the best results. Using the strongerVSTbackbone, it obtains an average Spearman correlation of 0.824 and an MSE of 4.81, outperforming the previous best vision-only modelCoFInAl(0.807 correlation, 5.73 MSE). This highlights the effectiveness of using quality-guided text prompts. Even when compared to multimodal methods likePAMFNthat use audio and optical flow,QGVLachieves a higher average correlation and a significantly lower MSE. TheQGVL†variant, which adds audio, pushes the performance even further to 0.830 correlation and 4.78 MSE, showing the method's compatibility with other modalities.
6.1.2. Performance on FS1000, FineFS, and Fis-V
Similar state-of-the-art results are reported on the three figure skating datasets (Tables II, III, and IV in the paper).
- FS1000 (Table II):
QGVLachieves the best average correlation (0.87) and MSE (12.50) across all seven scoring categories, outperforming strong baselines including the audio-enhancedMLP-Mixerand the text-specificSGN. - FineFS (Table III):
QGVLsets new SOTA results for both Short Program and Free Skating on both TES and PCS scores. For example, in the Short Program, it improves the average correlation to 0.831 and reduces the average MSE to 28.13, substantial gains over prior work. - Fis-V (Table IV):
QGVLachieves an average correlation of 0.800 and an MSE of 12.77, surpassingSGN(0.773 correlation, 13.51 MSE), which uses action-specific text. This is a crucial result, as it shows that the proposed universal quality prompts are more effective than expensive, specialized text annotations.
6.1.3. Unified AQA Model
The following are the results from Table V of the original paper, testing a single model trained on all datasets:
| AQA Methods | RG-Avg. | Fis-V | FS1000 | FineFS-Avg. | Avg. |
|---|---|---|---|---|---|
| MS-LSTM* [30] | 0.475 / 44.6 | 0.302 / 114.0 | 0.487 / 325.9 | 0.610 / 142.6 | 0.476 / 156.8 |
| GDLT* [25] | 0.450 / 35.0 | 0.343 / 82.4 | 0.798 / 173.6 | 0.619 / 129.6 | 0.581 / 105.2 |
| PAMFN* [13] | 0.518 / 23.5 | 0.572 / 93.8 | 0.683 / 259.0 | 0.722 / 135.8 | 0.631 / 128.0 |
| CoFInAl* [26] | 0.570 / 13.8 | 0.489 / 30.4 | 0.569 / 294.6 | 0.690 / 164.2 | 0.584 / 125.8 |
| QGVL (Ours) | 0.537 / 11.8 | 0.548 / 74.6 | 0.732 / 145.4 | 0.734 / 105.7 | 0.648 / 84.4 |
| VLM Methods | RG-Avg. | Fis-V | FS1000 | FineFS-Avg. | Avg. |
| BLIP*-768 [66] | 0.457 / 45.6 | 0.554 / 84.8 | 0.651 / 208.4 | 0.715 / 107.7 | 0.603 / 111.6 |
| BLIP*-512 [66] | 0.481 / 51.8 | 0.547 / 72.7 | 0.661 / 207.5 | 0.725 / 106.1 | 0.612 / 109.5 |
| ViT*-768 [67] | 0.289 / 79.1 | 0.439 / 135.7 | 0.324 / 475.6 | 0.586 / 212.0 | 0.417 / 225.6 |
| ViT*-512 [67] | 0.257 / 72.7 | 0.433 / 148.4 | 0.466 / 435.3 | 0.545 / 218.9 | 0.431 / 218.8 |
| CLIP* [28] | 0.490 / 21.7 | 0.529 / 103.5 | 0.634 / 269.8 | 0.650 / 175.6 | 0.581 / 139.9 |
| ViFi-CLIP* [48] | 0.517 / 14.4 | 0.532 / 94.1 | 0.716 / 128.5 | 0.716 / 126.7 | 0.606 / 126.7 |
| QGVL (Ours) | 0.537 / 11.8 | 0.548 / 74.6 | 0.732 / 145.4 | 0.734 / 105.7 | 0.648 / 84.4 |
- Analysis: This experiment demonstrates the method's generalizability.
QGVLachieves the best overall average performance (0.648 correlation, 84.4 MSE) when trained as a single unified model. This is a strong testament to its core design, as the universal quality prompts allow it to learn a shared concept of "good" vs. "bad" performance that transfers across different sports. It also significantly outperforms standard Vision-Language Models (VLMs) likeCLIPandBLIP, indicating that the specialized architecture of QGVL is crucial for the AQA task.
6.2. Ablation Studies / Parameter Analysis
The authors conduct extensive ablation studies on the RG dataset (Tables VI-X) to validate each component of their model.
-
Model Components (Table VI): This study shows that each component contributes positively to the final performance.
- Adding temporal context enhancement () improves performance over a simple baseline.
- Replacing learnable vectors () with quality text prompts () gives a major boost (Avg. Corr. from 0.763 to 0.795).
- Adding the score-level refinement () further improves performance to the final result (Avg. Corr. 0.824).
- Removing the custom
GSLMmodule and using a native Transformer significantly degrades performance, proving the effectiveness of the proposed adapter and integrator design.
-
Quality-related Texts (Fig. 5): The t-SNE visualization shows that at the start of training, the text prompt embeddings are already much closer to the centroids of their corresponding quality clusters compared to randomly initialized learnable vectors. This demonstrates that the text prompts provide a powerful and accurate starting point for learning.
该图像是图表,展示了图5中等级模式学习初始阶段的t-SNE可视化。图中使用k-means算法将预训练骨干提取的视觉特征聚类为4类,红色×标记聚类中心点,黑点为可学习向量,橙点为位置嵌入,蓝点为等级文本。分别展示了四个数据集:(a) RG,(b) Fis-V,(c) FS1000,(d) FineFS。 -
Loss Functions (Table VI): The combination of all three losses (, , and ) yields the best results. The triplet loss () ensures distinctiveness between grade patterns, while the cross-entropy loss () ensures they are correctly aligned with their semantic meaning.
-
Number of Grades and Scores (Tables VIII & IX): The experiments show that grades and scores are optimal. Too few grades (e.g., ) is insufficient for complex actions, while too many can cause ambiguity. A very fine-grained score level () allows the model to capture nuanced differences.
-
Qualitative Analysis (Figs. 6 & 7): These visualizations show the attention weights of the GSLM when aggregating grade-level semantics. They demonstrate that the model learns to focus on the correct moments in the video for each quality grade. For example, in a figure skating video (Fig. 6), the "poor" grade pattern (Grade 1) focuses on a fall (marker a), while the "excellent" grade pattern (Grade 4) focuses on a difficult spinning sequence (marker c). This confirms that the model is learning semantically meaningful quality patterns.
该图像是论文中Fig. 6的示意图,展示了GSLM模型在聚合等级语义时对视频片段的权重分布。图中曲线代表四个评分等级,标注的红色星号对应高关注度的关键视频片段,上方配有对应的视频帧。
该图像是论文中图7,一个示意图,展示了GSLM在汇总等级语义时对RG中Clubs类别#16视频的加权可视化。不同颜色曲线代表四个评分等级,红色星标和字母标记了高关注的视频片段。
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully proposes and validates Quality-Guided Vision-Language Learning (QGVL), a novel method for long-term action quality assessment. By using a set of universal, quality-related textual prompts, QGVL circumvents the major limitation of previous multimodal methods: the need for expensive, action-specific annotations. Its progressive, coarse-to-fine learning strategy, implemented with the Granularity-Adaptive Semantic Learning Module (GSLM), allows it to map visual features to precise score intervals effectively. The method establishes a new state-of-the-art on four challenging AQA benchmarks, demonstrating its superior performance, efficiency, and generalizability across different action scenarios.
7.2. Limitations & Future Work
- Limitations: The authors acknowledge that their framework might struggle when action performance is very stable and quality differences are extremely subtle. In such cases, the visual features for different quality levels might be too similar, making it difficult for the model to learn diverse patterns and potentially leading to overfitting.
- Future Work: The paper suggests that future research could explore the predictive associations among short-term, stable actions to better handle these subtle cases.
7.3. Personal Insights & Critique
- Key Innovation: The most brilliant aspect of this paper is its elegant and practical solution to the annotation bottleneck in multimodal AQA. The shift from "what is the action?" to "how good is the action?" in the language guidance is a simple yet powerful idea. It leverages the vast semantic knowledge of pre-trained language models in a zero-annotation-cost manner, which is a significant step forward for the field's practical applicability.
- Methodological Strength: The progressive coarse-to-fine learning strategy is well-motivated and convincingly effective. It mimics human judgment, where one might first get a general impression ("that was a good routine") before focusing on finer details to assign a precise score. The GSLM module, with its shared-weight QCI, is a clever design choice that promotes learning generalizable quality aggregation patterns.
- Potential Issues/Areas for Improvement:
- Prompt Sensitivity: The performance might be sensitive to the specific wording of the quality prompts. While the paper uses simple prompts, the field of prompt engineering has shown that small changes can lead to different results. It would be interesting to see an analysis of how different sets of quality descriptors affect performance.
- Lack of Interpretability of the "Why": The model can assess "how well" an action is performed but cannot explain "why" a score was given (e.g., "the skater's arm position was incorrect during the landing"). The learned grade patterns are still black boxes to some extent. Future work could try to link these patterns back to specific visual attributes or human-understandable concepts.
- Unified Model Performance Gap: While the unified model performs well on average, there is still a noticeable performance gap compared to the individually trained models, especially on the RG dataset. This suggests that while the concept of "quality" is somewhat universal, domain-specific differences (e.g., scoring rules, types of errors) are still significant and challenging for a single model to capture perfectly. More advanced domain adaptation techniques could be explored to close this gap.
- Broader Impact: This work provides a cost-effective blueprint for incorporating semantic guidance into video understanding tasks beyond AQA. The idea of using universal, task-related (but not content-specific) prompts could be applied to other evaluative tasks, such as assessing the aesthetic quality of a video or the emotional intensity of a scene. It strongly advocates for a smarter, more resource-efficient way of using large pre-trained models.
Similar papers
Recommended via semantic vector search.