AiPaper
Paper status: completed

Language-Guided Audio-Visual Learning for Long-Term Sports Assessment

Published:06/10/2025
Original Link
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Proposed a language-guided audio-visual learning framework using action knowledge graphs and cross-modal fusion, achieving state-of-the-art long-term sports assessment with low computational cost on four public benchmarks.

Abstract

Language-Guided Audio-Visual Learning for Long-Term Sports Assessment Huangbiao Xu 1,2 , Xiao Ke 1,2* , Huanqi Wu 1,2 , Rui Xu 1,2 , Yuezhou Li 1,2 , Wenzhong Guo 1,2 1 Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China 2 Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350108, China kex@fzu.edu.cn, { huangbiaoxu.chn, wuhuanqi135, xurui.ryan.chn, liyuezhou.cm } @gmail.com Abstract Long-term sports assessment is a challenging task in video understanding since it requires judging complex movement variations and action-music coordination. However, there is no direct correlation between the diverse background mu- sic and movements in sporting events. Previous works re- quire a large number of model parameters to learn potential associations between actions and music. To address this issue, we propose a language-guided audio-visual learn- ing (MLAVL) framework that models “audio-action-visual” correlations guided by low-cost language modality. In our framework, multidimensional domain-based actions form action knowled

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Language-Guided Audio-Visual Learning for Long-Term Sports Assessment

1.2. Authors

  • Huangbiao Xu

  • Xiao Ke

  • Huanqi Wu

  • Rui Xu

  • Yuezhou Li

  • Wenzhong Guo

    All authors are affiliated with:

  • Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China

  • Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350108, China

1.3. Journal/Conference

The paper is published at CVPR (Computer Vision and Pattern Recognition). CVPR is one of the premier annual computer vision conferences, highly respected and influential in the fields of computer vision, machine learning, and artificial intelligence. Publication at CVPR indicates a high level of novelty, technical soundness, and significant contribution to the field.

1.4. Publication Year

2024 (as indicated by the T2CR [21] reference year in Table 1)

1.5. Abstract

The paper addresses the challenge of long-term sports assessment, which involves complex movement variations and the coordination between actions and background music. A key difficulty lies in the lack of a direct correlation between diverse background music and sporting actions, often requiring previous models to use a large number of parameters to learn these weak associations. To overcome this, the authors propose a Language-Guided Audio-Visual Learning (MLAVL) framework. This framework models "audio-action-visual" correlations by leveraging a low-cost language modality. It constructs action knowledge graphs from multidimensional domain-based actions, guiding audio-visual modalities to focus on task-relevant actions. The framework incorporates a Shared-Specific Context Encoder (S²CE) to integrate deep multimodal semantics and an Audio-Visual Cross-modal Fusion (AVCF) module to evaluate action-music consistency. Furthermore, a Dual-Branch Prompt-Guided Grading (DPG) module is designed to assess both visual and audio-visual performance in alignment with specific sport rules. Extensive experiments demonstrate that MLAVL achieves state-of-the-art results on four public long-term sports benchmarks while maintaining low computational costs and fewer parameters.

/files/papers/69033af859708f78ec6faf73/paper.pdf This is the Open Access version provided by the Computer Vision Foundation, confirmed to be identical to the accepted version.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is long-term sports assessment, a sub-area within Action Quality Assessment (AQA). AQA is crucial in various domains, including professional sports for scoring, healthcare for rehabilitation progress monitoring, and skill determination in training. While short-term action assessment has seen many solutions, long-term sports analysis remains particularly challenging due to its rich, minute-long video content, which contains much more complex and varied correlations.

Specifically, in sports like figure skating and rhythmic gymnastics, a significant challenge is assessing the action-music coordination (consistency between athletes' movements and background music). Judges explicitly factor this synchronization into scoring. However, existing methods face several issues:

  1. Complex Movement Variations: Long videos contain diverse sub-actions within the same category, making it difficult to understand and assess quality.

  2. Weak Action-Music Correlation: The background music in sports events often doesn't directly correlate with specific action sounds (e.g., landing, hitting), making it hard for models to naturally capture this relationship.

  3. High Model Complexity: Previous approaches often resort to using a large number of model parameters to learn these weak associations between actions and music, leading to high computational costs (as illustrated in Figure 1a and 1b). This reliance on larger models overlooks the need for more efficient understanding.

    The paper identifies a crucial gap: the need for domain-specific knowledge to reduce reliance on large model parameters, and the low-cost integration of prior knowledge. The innovative idea is to leverage language as this low-cost, domain-specific knowledge carrier. Language, a cornerstone of human cognition, has proven effective in computer vision (CV) tasks like action recognition and localization by providing semantic understanding. The authors propose to use language to introduce domain-specific action knowledge to bridge the action-music correlations within audio-visual modalities, aligning with actual sport rules. This transforms audio-visual learning into audio-action-visual learning.

2.2. Main Contributions / Findings

The paper introduces the Multidimensional Language-Guided Audio-Visual Learning (MLAVL) framework, offering several key contributions:

  • Novel Framework: Proposes MLAVL, a language-guided audio-visual learning framework for long-term sports assessment. This framework explicitly aims to reduce reliance on large model parameters by guiding the learning process with domain-specific action knowledge graphs.
  • Multidimensional Action Graph Guidance (MAG²): Designs a module that uses multidimensional domain-based actions to form action knowledge graphs. These graphs motivate both audio and visual temporal modalities to focus on task-relevant actions, effectively transforming audio-visual learning into audio-action-visual learning by incorporating action knowledge at a low cost.
  • Shared-Specific Context Encoder (S²CE): Introduces S²CE to address modality interference and enhance the correlation between multimodal features. It integrates modality-specific and modality-general information, providing richer and deeper features for subsequent modules to construct latent audio-action-visual correlations.
  • Audio-Visual Cross-Modal Fusion (AVCF): Proposes a novel AVCF module specifically designed for long-term sports assessment. This module evaluates action-music consistency by focusing on both global and clip-wise match between actions and music, which directly aligns with how judges score.
  • Dual-Branch Prompt-Guided Grading (DPG): Develops a DPG module that weighs both visual performance and audio-visual performance (action-music matching) to generate final scores. This module uses quality-related textual prompts to guide the assessment patterns, reflecting sport rules.
  • State-of-the-Art Performance: Achieves new state-of-the-art results on four public long-term sports benchmarks (FS1000, Fis-V, Rhythmic Gymnastics, and LOGO). The framework demonstrates superior performance in both correlation and Mean Square Error (MSE) metrics, while maintaining low computational costs and fewer parameters.
  • Plug-and-Play Design: The MAG² module is shown to be plug-and-play, significantly improving the performance of existing visual-only methods on the LOGO dataset, highlighting the generalizability and value of language-guided action knowledge.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following concepts:

  • Action Quality Assessment (AQA): This is the overarching task. AQA involves automatically evaluating the quality or skill level of an action performed in a video. For example, assessing how well a gymnast performs a routine or a diver executes a dive.

    • Short-term AQA: Focuses on discrete, well-defined actions (e.g., a single jump).
    • Long-term AQA: Deals with continuous, complex sequences of actions often spanning minutes (e.g., an entire figure skating routine), which includes multiple sub-actions and their overall composition. This paper focuses on long-term AQA.
  • Multimodal Learning: This is an approach in machine learning that combines information from multiple modalities (types of data) to gain a more comprehensive understanding of a phenomenon. In this paper, the modalities are:

    • Visual: Information from video frames (what you see).
    • Audio: Information from sound (what you hear, primarily music in this context).
    • Language: Information from text (semantic descriptions, rules, prompts). The idea is that each modality provides unique insights, and combining them can lead to a richer representation and better task performance than using any single modality alone.
  • Graph Neural Networks (GNNs) / Graph Convolutional Networks (GCNs): These are a class of neural networks designed to process data that is structured as a graph, rather than a sequence or a grid.

    • Graph: A data structure consisting of nodes (or vertices) and edges (or links) connecting pairs of nodes.
    • Graph Convolution: Similar to how a convolutional neural network (CNN) processes pixels in an image by looking at their neighbors, a GCN processes node features by aggregating information from their neighbors in the graph. This allows GCNs to learn representations that capture the relationships and structure within the graph.
    • In this paper, action knowledge graphs and temporal graphs are built, and GCNs are used to pass information between them.
  • Transformers and Attention Mechanism:

    • Transformer: A neural network architecture introduced in 2017, which has become dominant in natural language processing (NLP) and is increasingly used in computer vision. Its key innovation is the attention mechanism.
    • Attention Mechanism: This allows the model to weigh the importance of different parts of the input sequence (or different modalities) when processing another part. Instead of processing all input elements equally, attention enables the model to focus on the most relevant information.
      • Self-Attention: A mechanism within Transformers that relates different positions of a single sequence to compute a representation of the sequence. It calculates how much each word (or token) in a sentence relates to every other word in the same sentence.
      • Cross-Attention: A mechanism that relates elements from two different sequences. For example, it can be used to attend to visual features based on an audio query, or vice-versa.
    • The paper uses a shared Transformer encoder and a cross-temporal relation decoder (which uses cross-attention).
  • Text Encoders (e.g., CLIP): These are models designed to convert text (like words, phrases, or sentences) into numerical representations called embeddings (or feature vectors). These embeddings capture the semantic meaning of the text.

    • CLIP (Contrastive Language-Image Pre-training) [36]: A prominent example of a multimodal model that learns to associate text and images. It has a text encoder and an image encoder, trained to produce embeddings where matching text-image pairs are close in the embedding space. This allows it to perform zero-shot classification and guide tasks using natural language. The paper uses ViFi-CLIP [37], a fine-tuned version of CLIP for video.

3.2. Previous Works

The paper contextualizes its contributions by discussing existing approaches in Sports Assessment and Language-Guided Multimodal Video Understanding.

Sports Assessment

  • Short-Term Actions: Many works have tackled short-term AQA ([2, 21, 39, 50, 53, 54, 56, 64]), developing effective learning-based models for discrete actions.
  • Long-Term Actions: This is where the challenge lies. Previous methods have explored:
    • Visual-only Approaches:
      • Using multidimensional video information: multi-scale temporal features [49], video dynamic information [58], athlete static poses [35, 58].
      • Coarse-to-fine feature aggregation: [10, 48, 65] to establish grade patterns. Examples include GDLT [48] and CoFInAl [65].
      • TPT [2] (Temporal Parsing Transformer) and QTD [10] (Interpretable Long-term Action Quality Assessment).
    • Multimodal Learning (Audio, Language, Optical Flow):
      • Recent methods introduce diverse modalities to enhance long-term sports assessment.
      • Audio-Visual Models: [47, 57] learn audio-visual modalities to assess action-music consistency. Examples include MLP-Mixer [47] and PAMFN [57]. These models often rely on large parameters to capture weak associations.
      • Language-Enhanced Models: [12, 52] incorporate language for enhanced understanding. SGN [12] (Semantics-Guided Representations for scoring figure skating) is an example.

Language-Guided Multimodal Video Understanding

  • General Multimodal Video Understanding: [12, 17, 24, 30, 38, 47, 50, 57, 61] combine diverse modalities.
  • Language as a Bridge: When audio-visual supervision is insufficient, language can act as a bridge for audio-visual semantics ([17, 24, 38]).
  • Language for AQA: Textual semantics are shown to significantly enhance AQA ([12, 30, 50, 61]). Examples include SGN [12] and Narrative Action Evaluation [61].
  • Challenge of Weak Correlation: The paper notes that existing works struggle to extract weak correlations between background music and actions, as music is not derived from action sounds ([1, 32]).

Core Formulas from Previous Works (for context)

Attention Mechanism (from Transformers [41]): Since the paper uses cross-attention and self-attention, understanding the fundamental attention mechanism is crucial. The Scaled Dot-Product Attention is given by: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $

  • QQ (Query): A matrix representing the query vectors. Each row is a query vector.

  • KK (Key): A matrix representing the key vectors. Each row is a key vector.

  • VV (Value): A matrix representing the value vectors. Each row is a value vector.

  • dkd_k: The dimension of the key vectors. It's used for scaling to prevent the dot products from growing too large, which could push the softmax function into regions with extremely small gradients.

  • QKTQK^T: Computes the dot product similarity between each query and all keys.

  • softmax()\mathrm{softmax}(\cdot): Normalizes the scores, turning them into probability distributions, indicating how much "attention" each value should get.

  • VV: The values are weighted by the attention probabilities and summed, producing the output of the attention layer.

    For Self-Attention, Q, K, V come from the same input sequence. For Cross-Attention, QQ comes from one sequence (e.g., visual features) and K, V come from another (e.g., audio features).

Graph Convolutional Network (GCN) Layer (simplified from GCNs [15]): The GCN layer typically performs feature transformation and aggregation across nodes. A basic GCN layer operation for a graph G=(V,E)G = (V, E) with adjacency matrix AA and node features H(l)H^{(l)} at layer ll can be expressed as: $ H^{(l+1)} = \sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}) $

  • H(l)H^{(l)}: Input feature matrix at layer ll.
  • A~=A+I\tilde{A} = A + I: Adjacency matrix AA with added self-loops (identity matrix II).
  • D~\tilde{D}: Degree matrix of A~\tilde{A} (a diagonal matrix where D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}).
  • D~12A~D~12\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}: Symmetrically normalized adjacency matrix.
  • W(l)W^{(l)}: Learnable weight matrix for layer ll.
  • σ()\sigma(\cdot): An activation function (e.g., ReLU). The paper uses a slightly modified GCN formulation in its MAG² module, which will be detailed in the methodology section.

3.3. Technological Evolution

The field of AQA has evolved from purely visual-based methods, which often struggled with the nuances of long-term actions, to increasingly sophisticated approaches. Initially, models focused on extracting spatial-temporal features from video frames. The introduction of multimodal learning marked a significant step, recognizing that additional information (like audio or optical flow) could provide richer context. However, simply combining modalities often led to challenges such as:

  1. Weak correlation learning: Models needed large parameters to learn implicit, often weak, correlations between modalities.
  2. Lack of domain knowledge: Generic multimodal fusion might not align with specific assessment rules (e.g., judge criteria in sports). The current paper represents a further evolution by integrating language guidance. This leverages the power of pre-trained large language models (LLMs) or vision-language models (VLMs) to inject explicit domain-specific knowledge (like sport rules and action definitions) at a low computational cost. This allows for a more direct and rule-aligned modeling of audio-action-visual relationships, moving beyond simply correlating audio and visual streams to understanding their synergy through the lens of human-defined actions.

3.4. Differentiation Analysis

Compared to the main methods in related work, MLAVL introduces several core differences and innovations:

  • Language-Guided Correlation Modeling:

    • Previous: Existing audio-visual models (MLP-Mixer [47], PAMFN [57]) often require large model parameters to learn weak associations between diverse background music and actions, as there's no direct acoustic correlation.
    • MLAVL: Explicitly transforms audio-visual learning into audio-action-visual learning. It uses a low-cost language modality to introduce domain-specific action knowledge (via action knowledge graphs). This guidance directly bridges the action-music correlations, making the learning process more efficient and semantically aligned with sport rules, reducing the need for massive parameters to implicitly learn these relations.
  • Structured Knowledge Integration:

    • Previous: While some works use language, they might not structure domain knowledge as rigorously.
    • MLAVL: MAG² module constructs multidimensional domain-based action knowledge graphs. This structured approach allows explicit knowledge transfer to guide audio-visual modalities to task-relevant actions, which is a more robust way to inject prior knowledge.
  • Enhanced Multimodal Semantics Integration:

    • Previous: Directly aggregating knowledge graphs to features can cause modality interference and performance degradation.
    • MLAVL: The Shared-Specific Context Encoder (S²CE) is specifically designed to fuse both modality-specific (pure, diverse information from each modality) and modality-general (shared temporal context) information. This deep integration aims to uncover latent audio-action-visual correlations more effectively.
  • Rule-Aligned Action-Music Consistency Assessment:

    • Previous: Existing cross-modal fusion modules often focus on global inter-modal interactions, potentially overlooking short-lived poor action-music interactions crucial for scoring.
    • MLAVL: Audio-Visual Cross-modal Fusion (AVCF) focuses on both global and clip-wise consistency between actions and music, directly conforming to how judges penalize mismatches.
  • Dual-Branch, Prompt-Guided Grading:

    • Previous: Grading modules might not explicitly weigh different aspects of performance according to sport rules.
    • MLAVL: The Dual-Branch Prompt-Guided Grading (DPG) module uses quality-related textual prompts to guide two distinct branches: one for visual action performance and another for action-music matching. This allows for a more nuanced and rule-aligned scoring mechanism.

4. Methodology

The proposed Multidimensional Language-guided Audio-Visual Learning (MLAVL) framework aims to learn audio-action-visual score patterns, guided by domain-specific action knowledge graphs. The overall framework, illustrated in Figure 2 of the original paper, processes video and audio inputs to generate a final score based on visual performance and action-music consistency.

4.1. Principles

The core idea of MLAVL is to leverage the semantic power of language to explicitly guide the learning of complex audio-visual correlations in long-term sports assessment. Instead of relying solely on implicit learning from audio-visual data (which can be noisy or weakly correlated) or requiring massive model parameters, MLAVL introduces domain-specific action knowledge through textual prompts and knowledge graphs. This guidance helps the model focus on task-relevant actions and their coordination with music, leading to more accurate and efficient assessment aligned with human judging rules. The framework aims to model a richer audio-action-visual relationship by integrating modality-specific and modality-general contexts, fusing information at both global and local levels, and using a rule-aligned dual-branch grading system.

4.2. Core Methodology In-depth (Layer by Layer)

The MLAVL framework consists of several interconnected modules. We will break down its architecture, data flow, and execution logic step-by-step, integrating mathematical formulas as they appear.

Step 1: Input Processing and Initial Feature Extraction

The input to the MLAVL framework is a long video containing image sequences and audio.

  1. Video Segmentation: The long video is first divided into TT non-overlapping consecutive 32-frame clips. This is a common practice in sports assessment to manage the temporal complexity of long videos.
  2. Modality-Specific Feature Extraction (Pre-trained Backbones):
    • For visual input (VTV_T), a pre-trained visual-specific encoder (e.g., Video Swin Transformer (VST) [29] or Timesformer [3]) extracts visual features.
    • For audio input (ATA_T), a pre-trained audio-specific encoder (e.g., Audio Spectrogram Transformer (AST) [14]) extracts audio features.
    • The parameters of these specific encoders are typically frozen during training to leverage their strong pre-trained representations.
  3. Token Projection: Trainable token projection networks (2-layer MLPs) project the features from different modalities into a potential space with a consistent feature dimension dd. These projected features are denoted as {Ftv}t=1T\{ \mathcal { F } _ { t } ^ { \mathbf { v } } \} _ { t = 1 } ^ { T } for visual and {Fta}t=1T\{ \mathcal { F } _ { t } ^ { \mathbf { a } } \} _ { t = 1 } ^ { T } for audio.

Step 2: Shared-Specific Context Encoder (S²CE)

  • Purpose: The S²CE aims to fuse both modality-specific (pure features from each modality) and modality-general (long-range temporal context shared across modalities) information. This is crucial because while modality-specific features provide diverse information, relying solely on them can lead to superficial single-modal learning under language guidance. The goal is to learn latent audio-action-visual correlations rather than isolated action-audio or action-visual links.
  • Process:
    • After initial projection (as described above), a shared Transformer encoder [41] (denoted as EsE_s) is applied. This Transformer encoder captures long-range, modality-agnostic temporal context, which is essential for analyzing long videos and human-centric tasks.
    • The output of this shared encoder is then combined with the original projected modality-specific features.
  • Formula (Equation 7): $ \left{ f _ { t } ^ { \mathbf { m } } \right} _ { t = 1 } ^ { T } = \left{ \mathcal { F } _ { t } ^ { \mathbf { m } } \right} _ { t = 1 } ^ { T } \oplus E _ { s } \left( \left{ \mathcal { F } _ { t } ^ { \mathbf { m } } \right} _ { t = 1 } ^ { T } \right) , \mathbf { m } \in \left{ \mathbf { v } , \mathbf { a } \right} $
    • {ftm}t=1T\left\{ f _ { t } ^ { \mathbf { m } } \right\} _ { t = 1 } ^ { T }: The final enhanced feature set for modality m\mathbf{m} (where m\mathbf{m} can be v\mathbf{v} for visual or a\mathbf{a} for audio) after being processed by S²CE. Each ftmf_t^{\mathbf{m}} is a feature vector for clip tt.
    • {Ftm}t=1T\left\{ \mathcal { F } _ { t } ^ { \mathbf { m } } \right\} _ { t = 1 } ^ { T }: The initial projected modality-specific feature set for modality m\mathbf{m}.
    • Es({Ftm}t=1T)E _ { s } \left( \left\{ \mathcal { F } _ { t } ^ { \mathbf { m } } \right\} _ { t = 1 } ^ { T } \right): The output of the shared Transformer encoder EsE_s, which processes the projected modality-specific features to extract modality-agnostic temporal context.
    • \oplus: This symbol typically denotes concatenation or element-wise summation, indicating how the original modality-specific features are combined with the shared temporal context. Given the context of "mixing cross-modal information" and common practice, it likely refers to concatenation along the feature dimension followed by another projection or summation. In this paper's abstract, it states "mixing cross-modal information" in the general introduction, but in the specific S2CE section, it states "fuse modality-specific and modality-general information". The subsequent mention of "summed with global-level" features in AVCF hints at summation being a common operation within this framework. For this operation, it's best to interpret it as an additive combination for feature refinement.

Step 3: Language-Guided Action Knowledge Embedding

  • Purpose: To introduce domain-specific action knowledge into the model using low-cost language modality. This knowledge will guide the audio-visual modalities to focus on task-relevant actions.
  • Process:
    • Text Prompt Sets: Two sets of text prompts are designed: MvM_{\mathbf{v}} for visual actions and MaM_{\mathbf{a}} for audio actions. These consist of MM basic actions derived from official sport rules.
      • Example visual prompt template: "'a video of [category]'", where [category] is a basic action (e.g., "a video of triple axel").
      • Example audio prompt template: "'a music suitable for [category]'".
    • Text Encoding: A frozen pre-trained text encoder (e.g., ViFi-CLIP [37], a fine-tuned CLIP [36]) is used to embed these prompts into feature vectors.
    • Token Projection: Similar to visual/audio features, a trainable token projection network maps these text embeddings into the consistent dd-dimension space.
  • Formulas (Equations 2 & 1): The overall overview in Section 3.1 states the initial feature extraction steps: $ \left{ f _ { t } ^ { \mathbf { v } } \right} _ { t = 1 } ^ { T } = E _ { \mathbf { v } \cdot \mathbf { a } } \left( V _ { T } \right) , \left{ f _ { t } ^ { \mathbf { a } } \right} _ { t = 1 } ^ { T } = E _ { \mathbf { v } \cdot \mathbf { a } } \left( A _ { T } \right) , $ $ \left{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { v } } \right} _ { m = 1 } ^ { M } = E _ { \mathbf { t } } \left( M \mathbf { v } \right) , \left{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { a } } \right} _ { m = 1 } ^ { M } = E _ { \mathbf { t } } \left( M \mathbf { a } \right) , $
    • {ftv}t=1T\left\{ f _ { t } ^ { \mathbf { v } } \right\} _ { t = 1 } ^ { T } and {fta}t=1T\left\{ f _ { t } ^ { \mathbf { a } } \right\} _ { t = 1 } ^ { T }: These are the visual and audio features for each clip tt, output by the Shared-Specific Context Encoder (E_{\mathbf{v}\cdot\mathbf{a}}), which corresponds to the {ftm}\left\{ f _ { t } ^ { \mathbf { m } } \right\} from Equation 7.
    • {fmtv}m=1M\left\{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { v } } \right\} _ { m = 1 } ^ { M }: The feature set for MM visual-action text prompts, encoded by the text encoder EtE_{\mathbf{t}}.
    • {fmta}m=1M\left\{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { a } } \right\} _ { m = 1 } ^ { M }: The feature set for MM audio-action text prompts, encoded by the text encoder EtE_{\mathbf{t}}.

Step 4: Multidimensional Action Graph Guidance (MAG²)

  • Purpose: To transfer the action knowledge from the textual semantics (action knowledge graphs) to the visual/audio features (temporal graphs), explicitly guiding the model's focus.
  • Process:
    • Graph Construction:
      • Action Knowledge Graphs: Gactv\mathcal { G } _ { \mathbf { a c t } } ^ { \mathbf { v } } (for visual actions) and Gacta\mathcal { G } _ { \mathbf { a c t } } ^ { \mathbf { a } } (for audio actions) are constructed. Their nodes are the encoded text prompt features ({fmtv}\left\{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { v } } \right\} and {fmta}\left\{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { a } } \right\}). They are complete graphs initially, meaning every action node is connected to every other action node, representing intrinsic associations between action terms.
      • Temporal Graphs: Gv\mathcal { G } _ { \mathbf { v } } (for visual clips) and Ga\mathcal { G } _ { \mathbf { a } } (for audio clips) are constructed. Their nodes are the enhanced visual/audio features from S²CE ({ftv}\left\{ f _ { t } ^ { \mathbf { v } } \right\} and {fta}\left\{ f _ { t } ^ { \mathbf { a } } \right\}). These are also complete graphs initially, representing temporal associations between video/audio segments.
    • Dual Information Aggregation (GCNs): MAG² utilizes graph convolutional networks (GCNs) [15] to pass action knowledge. A cross-graph mapping operation AactvA_{\mathbf{act} \to \mathbf{v}} aggregates information from action nodes (text features) to visual temporal nodes (visual features). A similar process occurs for audio.
  • Formulas (Equations 8 & 9): Let Hact(0)H_{\mathbf{act}}^{(0)} be the initial feature matrix for action knowledge graph nodes (e.g., {fmtv}\left\{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { v } } \right\}), and Hv(0)H_{\mathbf{v}}^{(0)} be the initial feature matrix for visual temporal graph nodes (e.g., {ftv}\left\{ f _ { t } ^ { \mathbf { v } } \right\}). The GCN operations at layer l+1l+1 are: $ H _ { \mathbf { a c t } } ^ { ( l + 1 ) } = \sigma \left( A _ { \mathbf { a c t } } ^ { \mathbf { v } } H _ { \mathbf { a c t } } ^ { ( l ) } W _ { \mathbf { a c t } } ^ { ( l ) } \right) , $ $ { \cal H } _ { \bf v } ^ { ( l + 1 ) } = \sigma ( A _ { \bf v } H _ { \bf v } ^ { ( l ) } W _ { \bf v } ^ { ( l ) } + A _ { \bf a c t v } H _ { \bf a c t } ^ { ( l ) } W _ { \bf c r o s s } ^ { ( l ) } ) , $
    • Hact(l+1)H_{\mathbf{act}}^{(l+1)}: Feature matrix for the action knowledge graph nodes at layer l+1l+1.
    • σ()\sigma(\cdot): The ReLU non-linearity (activation function).
    • AactvA_{\mathbf{act}}^{\mathbf{v}}: The adjacency matrix for the visual action knowledge graph. This describes the relationships between different visual action types.
    • Hact(l)H_{\mathbf{act}}^{(l)}: Input feature matrix for the action knowledge graph nodes at layer ll.
    • Wact(l)W_{\mathbf{act}}^{(l)}: Learnable weight matrix for the action knowledge graph GCN at layer ll.
    • Hv(l+1){ \cal H } _ { \bf v } ^ { ( l + 1 ) }: Feature matrix for the visual temporal graph nodes at layer l+1l+1.
    • AvA_{\mathbf{v}}: The adjacency matrix for the visual temporal graph, describing temporal relationships between visual clips.
    • Hv(l)H_{\mathbf{v}}^{(l)}: Input feature matrix for the visual temporal graph nodes at layer ll.
    • Wv(l)W_{\mathbf{v}}^{(l)}: Learnable weight matrix for the visual temporal graph GCN at layer ll.
    • AactvA_{\mathbf{act} \to \mathbf{v}}: The cross-graph mapping matrix. This matrix defines how information flows from the action knowledge graph nodes to the visual temporal graph nodes. It captures the influence of specific actions on visual segments.
    • Wcross(l)W_{\mathbf{cross}}^{(l)}: Learnable weight matrix for the cross-graph aggregation at layer ll. This process is performed similarly for audio features, resulting in {f^tv}t=1T\left\{ \hat { f } _ { t } ^ { \mathbf { v } } \right\} _ { t = 1 } ^ { T } and {f^ta}t=1T\left\{ \hat { f } _ { t } ^ { \mathbf { a } } \right\} _ { t = 1 } ^ { T } which are the visual and audio features with aggregated action knowledge. MAG² typically uses a 2-layer GCN.

Step 5: Audio-Visual Cross-Modal Fusion (AVCF)

  • Purpose: To generate multimodal features (f^tva\hat { f } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } }) specifically for assessing audio-visual scores, adhering to sport rules that emphasize consistency between actions and music. Unlike previous methods that focus on global interactions, AVCF considers both global and clip-wise action-music matches.
  • Process:
    1. Global Alignment (Cross-Temporal Relation Decoder): This module uses a 2-layer decoder with cross-attention to find global alignment between visual and audio features.
      • The visual features (f^tv\hat { f } _ { t } ^ { \mathbf { v } }) act as the "query" (QQ).
      • The audio features (f^ta\hat { f } _ { t } ^ { \mathbf { a } }) act as the "key" (KK) and "value" (VV).
      • Before cross-attention, self-attention learning is also performed on f^tv\hat { f } _ { t } ^ { \mathbf { v } } to enhance its representation.
    • Formula (Equation 10): $ \mathbf { G } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } = \mathrm { S o f t m a x } \left( w _ { t } ^ { Q } \hat { f } _ { t } ^ { \mathbf { v } } \left( w _ { t } ^ { K } \hat { f } _ { t } ^ { \mathbf { a } } \right) ^ { \mathrm { T } } / \sqrt { d } \right) w _ { t } ^ { V } \hat { f } _ { t } ^ { \mathbf { a } } , $
      • Gtva\mathbf{G}_t^{\mathbf{v}\cdot\mathbf{a}}: The globally aligned audio-visual feature for clip tt.
      • f^tv\hat{f}_t^{\mathbf{v}}: Visual feature for clip tt (output of MAG²).
      • f^ta\hat{f}_t^{\mathbf{a}}: Audio feature for clip tt (output of MAG²).
      • wtQ,wtK,wtVw_t^Q, w_t^K, w_t^V: Learnable weight matrices for the query, key, and value transformations in the cross-attention mechanism for clip tt.
      • dd: The feature dimension, used for scaling the dot product.
      • Softmax()\mathrm{Softmax}(\cdot): Normalizes attention scores.
      • A feed-forward network is then applied for non-linear transformations, as is standard in Transformer blocks.
    1. Clip-wise Match (Concatenation and Convolution): To capture short-lived action-music interactions, features are also processed clip-by-clip.
      • Visual (f^tv\hat { f } _ { t } ^ { \mathbf { v } }) and audio (f^ta\hat { f } _ { t } ^ { \mathbf { a } }) features are concatenated for each clip tt.
      • A two-layer convolutional block (Conv-BatchNorm-ReLU) compresses this concatenated feature back to the original dimension dd. This leverages the local feature extraction capabilities of convolution while maintaining low computational cost.
      • The resulting clip-level fused features (Ctva\mathbf{C}_t^{\mathbf{v}\cdot\mathbf{a}}) are then summed with the global-level features (Gtva\mathbf{G}_t^{\mathbf{v}\cdot\mathbf{a}}).
  • Formulas (Equations 11 & 12): $ \mathbf { C } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } = \mathrm { C o n v b l o c k } \left( \hat { \mathbf { C } } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right) , \hat { \mathbf { C } } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } = \mathrm { C o n c a t } \left( \hat { f } _ { t } ^ { \mathbf { v } } , \hat { f } _ { t } ^ { \mathbf { a } } \right) , $ $ \left{ \hat { f } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right} _ { t = 1 } ^ { T } = \left{ \mathbf { G } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right} _ { t = 1 } ^ { T } \oplus \left{ \mathbf { C } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right} _ { t = 1 } ^ { T } $
    • Ctva\mathbf{C}_t^{\mathbf{v}\cdot\mathbf{a}}: The clip-level fused audio-visual feature for clip tt.
    • Convblock()\mathrm{Convblock}(\cdot): Represents the two-layer convolutional block.
    • C^tva\hat{\mathbf{C}}_t^{\mathbf{v}\cdot\mathbf{a}}: The concatenated visual and audio features for clip tt, with dimension 2d.
    • Concat()\mathrm{Concat}(\cdot): The concatenation operation.
    • {f^tva}t=1T\left\{ \hat { f } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right\} _ { t = 1 } ^ { T }: The final fused audio-visual features for all clips, incorporating both global and clip-wise consistency.
    • \oplus: Again, indicating an additive combination (summation) of global and clip-level features.

Step 6: Dual-Branch Prompt-Guided Grading (DPG)

  • Purpose: To assess final quality scores by weighing visual action performance and action-music matching, aligned with sport rules.
  • Process:
    1. Grade Prompt Design:
      • Visual Grade Prompts: NN distinct prompts (e.g., "very poor performance", "average performance", "excellent performance") are designed as grade prototypes for visual quality.
      • Audio-Visual Grade Prompts: 2N prompts are designed, which additionally consider action-music fit (e.g., "poorly matched performance", "perfectly matched performance").
    2. Prompt Encoding: These textual prompts are encoded by the same pre-trained text encoder as before, yielding {fntv}n=1N\left\{ f _ { n } ^ { \mathbf { t } \cdot \mathbf { v } } \right\} _ { n = 1 } ^ { N } for visual and {fntva}n=12N\left\{ f _ { n } ^ { \mathbf { t } \cdot \mathbf { v } \cdot \mathbf { a } } \right\} _ { n = 1 } ^ { 2 N } for audio-visual.
    3. Performance Grading Transformer (PGT): A 3-layer Transformer decoder acts as the PGT.
      • It uses the grade prompts as "query" and the processed visual features (f^tv\hat { f } _ { t } ^ { \mathbf { v } }) or fused audio-visual features (f^tva\hat { f } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } }) as "key-value" pairs.
      • Shared decoder parameters are used across prompts to uncover universal quality patterns and reduce computational costs.
    • Formula (Equation 13): $ \mathbf { P } _ { n } ^ { \mathbf { m } ^ { \prime } } = \operatorname { S o f t m a x } \left( \mathcal { W } _ { n } ^ { Q } f _ { n } ^ { \mathbf { t } \cdot \mathbf { m } ^ { \prime } } \left( \mathcal { W } _ { t } ^ { K } \hat { f } _ { t } ^ { \mathbf { m } ^ { \prime } } \right) ^ { \mathrm { T } } / \sqrt { d } \right) \mathcal { W } _ { t } ^ { V } \hat { f } _ { t } ^ { \mathbf { m } ^ { \prime } } , $
      • Pnm\mathbf{P}_n^{\mathbf{m}'}: The nn-th grade pattern for modality m\mathbf{m}'.
      • m{v,va}\mathbf{m}' \in \{ \mathbf{v}, \mathbf{v}\cdot\mathbf{a} \}: Denotes either the visual branch or the audio-visual branch.
      • fntmf_n^{\mathbf{t}\cdot\mathbf{m}'}: The nn-th encoded text prompt feature for modality m\mathbf{m}'. This acts as the query.
      • f^tm\hat{f}_t^{\mathbf{m}'}: The visual (f^tv\hat { f } _ { t } ^ { \mathbf { v } }) or audio-visual (f^tva\hat { f } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } }) features for clip tt. These act as the keys and values.
      • WnQ,WtK,WtV\mathcal{W}_n^Q, \mathcal{W}_t^K, \mathcal{W}_t^V: Learnable weight matrices for query, key, and value transformations.
    1. Score Calculation:
      • Two 2-layer MLPs convert the grade patterns (Pnm\mathbf{P}_n^{\mathbf{m}'}) into predicted grade probabilities (P^nm\hat { \boldsymbol { \mathrm { P } } } _ { n } ^ { \mathbf { m } ^ { \prime } }).
      • These probabilities are then combined with fixed grade weights (Wnv\mathbf{W}_n^{\mathbf{v}} and Wnva\mathbf{W}_n^{\mathbf{v}\cdot\mathbf{a}}) to obtain visual action score (s^1\hat{s}_1) and action-music matching score (s^2\hat{s}_2). The fixed weights are Wnv=n1N1\mathbf{W}_n^{\mathbf{v}} = \frac{n-1}{N-1} and Wnva=n12N1\mathbf{W}_n^{\mathbf{v}\cdot\mathbf{a}} = \frac{n-1}{2N-1}, indicating a linear progression of quality.
      • The final score (s^\hat{s}) is a weighted sum of s^1\hat{s}_1 and s^2\hat{s}_2, using a learnable weight α\alpha.
  • Formulas (Equations 14 & 15): $ \hat { \boldsymbol { \mathrm { P } } } _ { n } ^ { \mathbf { m } ^ { \prime } } = \mathrm { M L P } \left( \mathrm { P } _ { n } ^ { \mathbf { m } ^ { \prime } } \right) , \mathbf { m } ^ { \prime } \in \left{ \mathbf { v } , \mathbf { v } \cdot \mathbf { a } \right} , $ $ \hat { s } = \alpha \sum _ { n = 1 } ^ { N } \mathbf { W } _ { n } ^ { \mathbf { v } } \hat { \mathbf { P } } _ { n } ^ { \mathbf { v } } + ( 1 - \alpha ) \sum _ { n = 1 } ^ { 2 N } \mathbf { W } _ { n } ^ { \mathbf { v } \cdot \mathbf { a } } \hat { \mathbf { P } } _ { n } ^ { \mathbf { v } \cdot \mathbf { a } } . $
    • P^nm\hat { \boldsymbol { \mathrm { P } } } _ { n } ^ { \mathbf { m } ^ { \prime } }: The predicted probability for the nn-th grade pattern in modality m\mathbf{m}'.
    • MLP()\mathrm{MLP}(\cdot): A multi-layer perceptron (neural network).
    • s^\hat{s}: The final predicted score.
    • α\alpha: A learnable weighting parameter (between 0 and 1) that balances the contribution of the visual-only score (s^1=n=1NWnvP^nv\hat{s}_1 = \sum _ { n = 1 } ^ { N } \mathbf { W } _ { n } ^ { \mathbf { v } } \hat { \mathbf { P } } _ { n } ^ { \mathbf { v } }) and the audio-visual score (s^2=n=12NWnvaP^nva\hat{s}_2 = \sum _ { n = 1 } ^ { 2 N } \mathbf { W } _ { n } ^ { \mathbf { v } \cdot \mathbf { a } } \hat { \mathbf { P } } _ { n } ^ { \mathbf { v } \cdot \mathbf { a } }).
    • Wnv\mathbf{W}_n^{\mathbf{v}}: Fixed grade weights for visual performance, linearly increasing from 0 to 1 across NN grades.
    • Wnva\mathbf{W}_n^{\mathbf{v}\cdot\mathbf{a}}: Fixed grade weights for audio-visual performance, linearly increasing from 0 to 1 across 2N grades.

Step 7: Optimization

  • Purpose: To optimize the grading patterns and predicted scores for accurate sports assessment.

  • Process: The model is trained using a combination of three loss functions:

    1. Triplet Loss (LTL\mathcal{L}_{TL}): Ensures discriminability between different grade patterns, meaning distinct quality levels should result in distinct patterns.
    2. Cross-Entropy Loss (LCE\mathcal{L}_{CE}): Ensures that the textual prompts accurately guide the grading, aligning the learned grade patterns with their corresponding textual semantics.
    3. Mean Square Error Loss (LMSE\mathcal{L}_{MSE}): Measures the direct numerical difference between the predicted score (s^\hat{s}) and the ground-truth score (ss).
  • Formulas (Equations 16, 17, & 18): $ \begin{array} { l } { { \displaystyle { \mathcal L } _ { T L } = \sum _ { n } \left[ \operatorname* { m a x } \left( \sin \left( { \mathrm { P } _ { n } ^ { \mathbf m } } , { \mathrm { P } _ { i } ^ { \mathbf m } } \right) \right) - \operatorname* { m i n } \left( \sin \left( { \mathrm { P } _ { n } ^ { \mathbf m } } , { \mathrm { P } _ { i } ^ { \mathbf m } } \right) \right) + \delta \right] _ { + } } , } \ { { \displaystyle { \mathcal L } _ { C E } = - \sum _ { n } \log \frac { \exp \Big ( \sin \Big ( { f _ { n } ^ { \mathbf t } } ^ { \mathbf m } , { \mathrm { P } _ { n } ^ { \mathbf m } } \Big ) / \varsigma \Big ) } { \sum _ { j } \exp \Big ( \sin \Big ( { f _ { n } ^ { \mathbf t } } ^ { \mathbf m } , { \mathrm { P } _ { j } ^ { \mathbf m } } \Big ) / \varsigma \Big ) } , } } \end{array} $ $ \mathcal { T } = \lambda _ { 1 } \mathcal { L } _ { T L } + \lambda _ { 2 } \mathcal { L } _ { C E } + \lambda _ { 3 } \mathcal { L } _ { M S E } . $

    • LTL\mathcal{L}_{TL} (Triplet Loss):

      • sin(,)\sin(\cdot, \cdot): Cosine similarity, measuring the angle between two feature vectors (grade patterns Pnm\mathrm{P}_n^{\mathbf{m}} and Pim\mathrm{P}_i^{\mathbf{m}}).
      • Pnm\mathrm{P}_n^{\mathbf{m}}: The nn-th grade pattern for modality m\mathbf{m} (either visual or audio-visual).
      • Pim\mathrm{P}_i^{\mathbf{m}}: Another grade pattern, where ini \neq n.
      • max(sin(Pnm,Pam))min(sin(Pnm,Ppm))+δ\max(\sin(\mathrm{P}_n^{\mathbf{m}}, \mathrm{P}_a^{\mathbf{m}})) - \min(\sin(\mathrm{P}_n^{\mathbf{m}}, \mathrm{P}_p^{\mathbf{m}})) + \delta: This is a common form of triplet loss. It aims to pull an "anchor" (PnmP_n^m) closer to "positive" samples (PpmP_p^m) and push it further away from "negative" samples (PamP_a^m) by at least a margin δ\delta. The formula as written in the paper implies selecting the maximum similarity to an anchor and minimum similarity to a positive from other PimP_i^m for the triplet. Specifically, it seems to ensure that the similarity between a grade pattern and its "closest" other pattern (the max term) is sufficiently smaller than its similarity to itself (implicitly 1), and that its similarity to its "farthest" other pattern (the min term) is above a certain threshold. The notation suggests it's pushing all other patterns away from the current one, or ensuring a minimum separation. Given typical triplet loss forms, it should be max(sim(Pnm,Pnegm))min(sim(Pnm,Pposm))+δmax(sim(P_n^m, P_{neg}^m)) - min(sim(P_n^m, P_{pos}^m)) + \delta. However, strictly following the paper's formula, it means that for each PnmP_n^m, the difference between its maximum similarity to any other PimP_i^m and its minimum similarity to any other PimP_i^m plus a margin δ\delta should be positive. This structure encourages distinctiveness among all patterns.
      • []+[\cdot]_+: Denotes max(0,)\max(0, \cdot), meaning the loss is only incurred if the term inside is positive.
      • δ\delta: A margin parameter, defining the minimum separation required between patterns.
    • LCE\mathcal{L}_{CE} (Cross-Entropy Loss):

      • sin(,)\sin(\cdot, \cdot): Cosine similarity.
      • fntmf_n^{\mathbf{t}\cdot\mathbf{m}}: The encoded text prompt feature corresponding to grade nn for modality m\mathbf{m}.
      • Pnm\mathrm{P}_n^{\mathbf{m}}: The nn-th grade pattern for modality m\mathbf{m}.
      • exp()/ς\exp(\cdot)/\varsigma: Exponential of the similarity score, scaled by a hyperparameter ς\varsigma, used in the softmax denominator to normalize.
      • This loss encourages the learned grade pattern Pnm\mathrm{P}_n^{\mathbf{m}} to be highly similar to its corresponding text prompt fntmf_n^{\mathbf{t}\cdot\mathbf{m}} compared to all other grade patterns PjmP_j^m.
    • LMSE\mathcal{L}_{MSE} (Mean Square Error Loss): This is the standard MSE loss between the predicted score s^\hat{s} and the ground-truth score ss. The paper provides it implicitly as part of the metric description: MSE=s^s2\mathrm { M S E } = \left. \hat { s } - s \right. ^ { 2 } . More precisely it is MSE=1Ni=1N(si^si)2\mathrm { M S E } = \frac { 1 } { N } \sum_{i=1}^N \left( \hat { s_i } - s_i \right) ^ { 2 }.

    • T\mathcal{T}: The overall objective function to be minimized.

    • λ1,λ2,λ3\lambda_1, \lambda_2, \lambda_3: Balancing weights for the three loss components.

      The complete architecture integrates these modules, allowing information to flow from raw video/audio, through language-guided feature enhancement, cross-modal fusion, and finally to a rule-aligned grading mechanism.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on four public long-term sports assessment benchmarks:

  • FS1000 [47]:

    • Source/Domain: Figure skating.
    • Characteristics: Provides a rule-aligned, comprehensive audio-visual assessment with diverse scoring scenes. It is particularly challenging due to the complexity of figure skating movements and the intricate coordination with music.
    • Scale: Fixed 95 video clips were randomly selected for the experiments.
    • Why chosen: Represents a challenging scenario requiring robust multimodal understanding and precise assessment of action-music coordination.
  • Fis-V [49]:

    • Source/Domain: Figure skating.
    • Characteristics: Another benchmark for figure skating assessment.
    • Scale: Fixed 124 video clips were randomly selected for the experiments.
    • Why chosen: Allows for further validation of the method's performance on figure skating, which is a key domain for audio-visual coordination.
  • Rhythmic Gymnastics (RG) [58]:

    • Source/Domain: Rhythmic Gymnastics.
    • Characteristics: An audio-visual dataset where long videos contain complex multi-scale temporal features, video dynamic information, and athlete static poses. Rhythmic gymnastics also heavily emphasizes action-music coordination.
    • Scale: Fixed 68 video clips were randomly selected for the experiments.
    • Why chosen: Provides a different sport context with similar audio-visual coordination requirements, validating generalizability.
  • LOGO [60]:

    • Source/Domain: Group Action Quality Assessment (AQA).

    • Characteristics: A long-form video dataset. Notably, this is a visual-only dataset.

    • Scale: Fixed 48 video clips were randomly selected for the experiments.

    • Why chosen: Used to validate the plug-and-play capability and effectiveness of the MAG² module (action-visual graph guidance only) on visual-only methods, demonstrating the value of language-guided action knowledge beyond audio-visual contexts.

      These datasets are chosen because they represent the core challenges of long-term sports assessment, including complex movements, the need for multimodal understanding (especially audio-visual synergy), and the relevance of domain-specific rules in scoring.

5.2. Evaluation Metrics

The paper adopts two standard metrics to evaluate the approach fully: Spearman's Rank Correlation (\rho) and Mean Square Error (MSE) / Relative L2-distance (R-\ell_2).

  1. Spearman's Rank Correlation (ρ\rho)

    • Conceptual Definition: Spearman's Rank Correlation coefficient assesses the monotonic relationship between two ranked variables. It measures how well the relationship between two variables can be described using a monotonic function (either increasing or decreasing). Unlike Pearson's correlation, it does not assume linearity or normal distribution of the data. In the context of AQA, it is used to measure the agreement between the rank order of predicted scores and the rank order of ground-truth scores. A high Spearman's ρ\rho indicates that if one video is ranked higher in predicted scores, it is also likely ranked higher in ground-truth scores.
    • Mathematical Formula: $ \rho = \frac { \sum _ { i } \left( q _ { i } - \bar { q } \right) \left( \hat { q } _ { i } - \bar { \hat { q } } \right) } { \sqrt { \sum _ { i } \left( q _ { i } - \bar { q } \right) ^ { 2 } \sum _ { i } \left( \hat { q } _ { i } - \bar { \hat { q } } \right) ^ { 2 } } } $
    • Symbol Explanation:
      • qiq_i: The ground-truth rank of the ii-th video.
      • q^i\hat{q}_i: The predicted rank of the ii-th video.
      • qˉ\bar{q}: The mean of the ground-truth ranks.
      • q^ˉ\bar{\hat{q}}: The mean of the predicted ranks.
      • The sum i\sum_i is taken over all NN videos in the dataset.
    • Interpretation: Values range from -1 to +1. A value of +1 indicates a perfect monotonic increasing relationship, -1 indicates a perfect monotonic decreasing relationship, and 0 indicates no monotonic relationship. For AQA, higher values (closer to +1) are better.
  2. Mean Square Error (MSE)

    • Conceptual Definition: Mean Square Error is a common metric used to quantify the average magnitude of the errors in a set of predictions. It measures the average of the squares of the differences between the predicted values and the actual values. By squaring the errors, MSE penalizes larger errors more heavily than smaller ones. In AQA, MSE quantifies the numerical difference between the predicted scores and the ground-truth scores.
    • Mathematical Formula: $ \mathrm { M S E } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \left( \hat { s } _ { i } - s _ { i } \right) ^ { 2 } $
    • Symbol Explanation:
      • s^i\hat{s}_i: The predicted score for the ii-th video.
      • sis_i: The ground-truth score for the ii-th video.
      • NN: The total number of videos.
    • Interpretation: A value of 0 indicates a perfect prediction. Lower values are better, indicating closer agreement between predicted and true scores.
  3. Relative L2-distance (R2R-\ell_2)

    • Conceptual Definition: Relative L2-distance is a normalized version of the L2 distance (Euclidean distance). It measures the average magnitude of the errors, but scales these errors relative to the maximum possible range of scores. This normalization makes the metric more interpretable across different datasets that might have varying score ranges, providing a scale-invariant measure of prediction accuracy.
    • Mathematical Formula: $ \mathrm { R \mathcal { \ell } _ { 2 } } = \frac { 1 } { N } \sum _ { n } ^ { N } \left( \frac { \left| s _ { n } - \hat { s } _ { n } \right| } { s _ { \mathrm { m a x } } - s _ { \mathrm { m i n } } } \right) ^ { 2 } $
    • Symbol Explanation:
      • sns_n: The ground-truth score for the nn-th video.
      • s^n\hat{s}_n: The predicted score for the nn-th video.
      • smaxs_{\mathrm{max}}: The maximum possible ground-truth score in the dataset.
      • smins_{\mathrm{min}}: The minimum possible ground-truth score in the dataset.
      • NN: The total number of videos.
    • Interpretation: Lower values are better, indicating that the prediction errors are small relative to the total possible range of scores.

5.3. Baselines

The paper compares MLAVL against a comprehensive set of state-of-the-art AQA methods, categorized primarily by their modality usage:

  • Visual-Only Methods: These baselines primarily rely on visual information from the video to assess action quality.

    • C3D-LSTM [34]: An early approach using 3D Convolutional Neural Networks (C3D) for spatio-temporal feature extraction, followed by Long Short-Term Memory (LSTM) networks for temporal modeling.
    • MSCADC [33]: A multi-scale approach for action quality assessment.
    • MS-LSTM [49]: Multi-scale LSTM for scoring figure skating.
    • CoRe [56]: Group-aware Contrastive Regression for AQA.
    • GDLT [48]: Likert scoring with grade decoupling for long-term action assessment.
    • TPT [2]: Temporal Parsing Transformer for AQA.
    • T2CR [21]: Two-path Target-aware Contrastive Regression for AQA.
    • CoFInAl [65]: Enhances AQA with coarse-to-fine instruction alignment.
    • QTD [10]: Interpretable long-term action quality assessment.
  • Audio-Visual/Multimodal Methods: These baselines integrate information from both audio and visual modalities.

    • M-BERT (Late) [23]: A multimodal Transformer approach that processes visual and audio features. "Late" likely refers to late fusion of modalities.

    • MLP-Mixer [47]: Utilizes MLPs for long-term sport audio-visual modeling.

    • SGN [12]: Semantics-Guided Representations for scoring figure skating, potentially using language or semantic cues.

    • PAMFN [57]: Multimodal Action Quality Assessment, a state-of-the-art audio-visual model.

      These baselines are representative of the evolution and current state of AQA research, covering different architectural choices (CNNs, LSTMs, Transformers, MLPs) and modality integrations (visual-only, audio-visual). Comparing against them allows MLAVL to demonstrate its advantages in leveraging language for more efficient and accurate multimodal learning.

6. Results & Analysis

The experimental results demonstrate that MLAVL consistently achieves state-of-the-art performance across various long-term sports assessment benchmarks, often with lower computational costs.

6.1. Core Results Analysis

FS1000 Dataset (Figure Skating)

The FS1000 dataset is known for its comprehensive, rule-aligned audio-visual assessment of figure skating, presenting significant challenges. As shown in Table 1, MLAVL achieves the best results across all score types (TES, PCS, SS, TR, PE, CO, IN) for Spearman Correlation (\rho) and Mean Square Error (MSE), and consequently the best average.

  • Spearman Correlation (Avg.): MLAVL achieves 0.90, surpassing PAMFN [57] (0.87) by 3.0 percentage points and SGN [12] (0.85) by 5.0 percentage points.
  • MSE (Avg.): MLAVL achieves 10.39, significantly lower than PAMFN [57] (16.80) and SGN [12] (12.77). This highlights MLAVL's ability to model complex multimodal relationships and accurately capture action-music coordination. The authors attribute this to MLAVL's design of fixed, domain-specific prompts that introduce action knowledge at a low cost, leading to accurate learning of audio-visual relationships.

Fis-V Dataset (Figure Skating)

Table 2 compares MLAVL on the Fis-V dataset, focusing on Spearman Correlation (\rho) and MSE along with computational cost (#Params and #FLOPs).

  • Balanced Performance: MLAVL achieves an Avg. Sp. Corr. of 0.823 and Avg. MSE of 13.31. While PAMFN has a slightly higher Avg. Sp. Corr. of 0.822 (which is very close), MLAVL achieves a much better Avg. MSE. Compared to MLP-Mixer [47], which is strong on MSE, MLAVL is competitive (13.31 vs 13.77) while having higher Sp. Corr..
  • Efficiency: Crucially, MLAVL achieves this performance with significantly fewer parameters (3.82M) and lower FLOPs (0.778G) compared to MLP-Mixer (14.32M params, 49.900G FLOPs) and PAMFN (18.06M params, 2.562G FLOPs). This validates the paper's claim that language-guided prompts efficiently establish audio-action-visual relationships without requiring large model parameters.

Rhythmic Gymnastics (RG) Dataset

Table 3 presents results on the RG dataset, showcasing MLAVL's performance across different apparatus (Ball, Clubs, Hoop, Ribbon).

  • Overall SOTA: MLAVL sets a new state-of-the-art with an Avg. Sp. Corr. of 0.849 and an Avg. MSE of 4.47.
  • Significant MSE Improvement: It improves the Avg. MSE by 1.06 over the second-best approach GDLT [48] (5.53). This strong performance is attributed to the multidimensional action knowledge and the dual-branch prompt-guided grading mechanism, which directly aligns with assessment rules in rhythmic gymnastics.

Overall Effectiveness

The results across these three datasets consistently demonstrate MLAVL's robust effectiveness, achieving balanced state-of-the-art performance in both correlation (rank order) and numerical accuracy (MSE).

The comparison of scatter plots in Figure 3 (a, d) visually reinforces MLAVL's superiority over PAMFN on FS1000 (PCS). MLAVL's predictions show a tighter correlation with ground truth scores, indicating better accuracy. The t-SNE feature distribution plots (b, c, e, f) further illustrate the impact of the MAG² module. Without MAG², the feature distribution is disordered with significant class overlap (b, e). With MAG², the grade categories display clear boundaries and distinct clustering (c, f), confirming that MAG² effectively distinguishes different action qualities by introducing language-guided action knowledge.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Methods Year Features Spearman Correlation (↑) Mean Square Error (↓)
TES PCS SS TR PE CO IN Avg. TES PCS SS TR PE CO IN Avg.
C3D-LSTM [34] 2017 C3D [40] 0.78 0.53 0.50 0.52 0.52 0.57 0.47 0.57 308.30 25.85 0.92 0.99 1.21 0.97 1.01 48.61
MSCADC [33] 2019 Timesformer [3] 0.77 0.70 0.69 0.69 0.71 0.68 0.71 0.71 148.02 15.47 0.51 0.57 0.78 0.55 0.60 23.79
MS-LSTM [49] 2019 Timesformer [3] 0.86 0.80 0.77 0.78 0.76 0.79 0.78 0.79 94.55 11.03 0.45 0.49 0.76 0.43 0.47 15.45
CoRe [56] 2021 Timesformer [3] 0.88 0.84 0.81 0.83 0.81 0.83 0.80 0.83 103.50 9.85 0.41 0.37 0.81 0.38 0.41 16.53
GDLT* [48] 2022 Timesformer [3] 0.88 0.86 0.84 0.86 0.83 0.85 0.84 0.85 82.73 10.32 0.35 0.37 0.67 0.38 0.42 13.60
TPT [2] 2022 Timesformer [3] 0.88 0.83 0.82 0.82 0.81 0.82 0.81 0.83 80.00 8.88 0.34 0.37 0.63 0.34 0.39 12.99
T2CR* [21] 2024 Timesformer [3] 0.86 0.79 0.83 0.84 0.82 0.84 0.80 0.83 107.59 15.26 0.61 0.48 0.69 0.57 0.42 17.95
CoFInAl* [65] 2024 Timesformer [3] 0.84 0.83 0.84 0.84 0.81 0.83 0.82 0.83 81.65 16.05 0.56 0.63 0.71 0.41 0.54 14.36
QTD* [10] 2024 Timesformer [3] 0.88 0.85 0.85 0.86 0.83 0.85 0.84 0.85 137.09 17.48 0.51 0.73 0.80 0.91 0.98 22.64
M-BERT (Late) [23] 2020 TF [3]+AST [14] 0.79 0.75 0.80 0.81 0.80 0.80 0.76 0.79 131.28 15.28 0.44 0.43 0.67 0.47 0.55 21.30
MLP-Mixer† [47] 2023 TF [3]+AST [14] 0.88 0.82 0.80 0.81 0.80 0.81 0.81 0.82 81.24 9.47 0.35 0.35 0.62 0.37 0.39 13.26
SGN [12] 2024 TF [3]+AST [14] 0.89 0.85 0.84 0.85 0.82 0.85 0.83 0.85 79.08 8.40 0.31 0.32 0.61 0.33 0.37 12.77
PAMFN+* [57] 2024 TF [3]+AST [14]+I3D [4] 0.90 0.89 0.86 0.87 0.86 0.87 0.85 0.87 104.89 10.05 0.39 0.52 0.78 0.40 0.56 16.80
MLAVL (Ours) TF [3]+AST [14]+CLIP [36] 0.92 0.89 0.90 0.90 0.88 0.89 0.88 0.90 64.89 6.39 0.23 0.24 0.50 0.25 0.26 10.39

The following are the results from Table 2 of the original paper:

Methods #Params (M) #FLOPs (G) Sp. Corr. (↑) MSE (↓)
TES PCS Avg. TES PCS Avg.
C3D-LSTM [34] - - 0.290 0.510 0.406 39.25 21.97 30.61
MSCADC [33] - - 0.500 0.610 0.557 25.93 11.94 18.94
MS-LSTM [49] - - 0.650 0.780 0.721 19.91 8.35 14.13
M-BERT (Late) [23] 4.00 1.272 0.530 0.720 0.634 27.73 12.38 20.06
GDLT* [48] 3.20 0.268 0.685 0.820 0.761 20.99 8.75 14.87
CoRe [56] 2.51 0.010 0.660 0.820 0.751 23.50 9.25 16.38
TPT [2] 11.82 2.229 0.570 0.760 0.676 27.50 11.25 19.38
MLP-Mixer [47] 14.32 49.900 0.680 0.820 0.759 19.57 7.96 13.77
SGN [12] - - 0.700 0.830 0.773 19.05 7.96 13.51
PAMFN [57] 18.06 2.562 0.754 0.872 0.822 22.50 8.16 15.33
CoFInAl* [65] 5.24 0.509 0.716 0.843 0.788 20.76 7.91 14.34
QTD* [10] 5.51 0.396 0.717 0.858 0.798 26.97 10.89 18.93
MLAVL (Ours) 3.82 0.778 0.766 0.863 0.823 19.44 7.17 13.31

The following are the results from Table 3 of the original paper:

Methods Year Features Spearman Correlation (↑) Mean Square Error (↓)
Ball Clubs Hoop Ribbon Avg. Ball Clubs Hoop Ribbon Avg.
C3D+SVR [34] 2017 C3D [40] 0.357 0.551 0.495 0.516 0.483 - - - - -
MS-LSTM* [49] 2019 I3D [4] 0.515 0.621 0.540 0.522 0.551 10.55 6.94 5.85 12.56 8.97
VST [29] 0.621 0.661 0.670 0.695 0.663 7.52 6.04 6.16 5.78 6.37
ACTION-NET* [58] 2020 I3D [4]+ResNet [16] 0.528 0.652 0.708 0.578 0.623 9.09 6.40 5.93 10.23 7.91
VST [29]+ResNet [16] 0.684 0.737 0.733 0.754 0.728 9.55 6.36 5.56 8.15 7.41
GDLT* [48] 2022 VST [29] 0.746 0.802 0.765 0.741 0.765 5.90 4.34 5.70 6.16 5.53
PAMFN [57] 2024 VST [29]+AST [14]+I3D [4] 0.757 0.825 0.836 0.846 0.819 6.24 7.45 5.21 7.67 6.64
CoFInAl* [65] 2024 I3D [4] 0.625 0.719 0.734 0.757 0.712 7.04 6.37 5.81 6.98 6.55
VST [29] 0.809 0.806 0.804 0.810 0.807 5.07 5.19 6.37 6.30 5.73
QTD* [10] 2024 VST [29] 0.823 0.852 0.837 0.857 0.842 7.94 5.66 7.95 8.87 7.61
MLAVL (Ours) - VST [29]+AST [14]+CLIP [36] 0.826 0.829 0.871 0.866 0.849 5.57 4.20 4.11 3.99 4.47

6.3. Ablation Studies / Parameter Analysis

Effects of Multidimensional Language Guidance (MAG²) on LOGO Dataset

To specifically validate the effectiveness of MAG² (which introduces multidimensional action knowledge via language guidance), the authors plugged MAG² into nine existing visual-only methods on the LOGO dataset. Only action-visual graph guidance was used here, as LOGO is a visual-only dataset.

The following are the results from Table 4 of the original paper:

Methods Native +MAG² (Ours)
Sp. Corr.↑ R-l2 (×100)↓ Sp. Corr.↑ R-l2 (×100)↓
MS-LSTM [49] 0.542 5.763 0.582↑7% 4.916↓15%
USDL [39] 0.762 2.556 0.804↑6% 2.269↓11%
GDLT [48] 0.647 4.148 0.654↑1% 3.589↓13%
CoRe [56] 0.697 5.620 0.723↑4% 3.386↓40%
TPT [2] 0.589 5.228 0.621↑5% 3.130↓40%
HGCN [64] 0.541 4.765 0.640↑18% 3.698↓22%
T2CR [21] 0.681 5.973 0.699↑3% 4.809↓19%
CoFInAl [65] 0.661 5.754 0.708↑7% 3.950↓31%
QTD [10] 0.698 4.948 0.729↑4% 3.869↓22%

Analysis: MAG² consistently enhances the performance of all tested visual-only methods. On average, it improves Sp. Corr. by 3.8% and R-\ell_2$$ by 1.238. This confirms that language-guided action knowledge is highly valuable, effectively introducing relevant semantic understanding even to methods that do not inherently use language or audio.

Ablation Study of MLAVL Components

A comprehensive ablation study was conducted on FS1000 and RG datasets to evaluate the contribution of each proposed component. The baseline uses cross-attention for audio-visual fusion and a 2-layer MLP for score prediction.

The following are the results from Table 5 of the original paper:

Settings Sp. Corr. (↑) MSE (↓)
TES PCS RG-Avg. TES PCS RG-Avg.
baseline 0.835 0.825 0.736 81.43 9.94 7.72
+S²CE 0.848↑2% 0.840↑2% 0.757↑3% 77.17↓5.2% 8.67↓13% 6.98↓10%
+MAG² 0.876↑3% 0.866↑3% 0.801↑6% 69.40↓10% 7.67↓12% 5.36↓23%
+AVCF (w/o DPG) 0.887↑1% 0.875↑1% 0.818↑2% 67.08↓3% 7.09↓8% 4.93↓8%
+DPG (Ours) 0.917↑3% 0.892↑2% 0.849↑4% 64.89↓3% 6.39↓10% 4.47↓9%
w/o S²CE 0.891↓3% 0.878↓2% 0.821↓3% 67.29↑7% 6.46↑1% 4.89↑9%
w/o MAG² 0.879↓4% 0.869↓3% 0.802↓6% 70.05↑8% 7.53↑18% 5.07↑13%
w/o AVCF 0.894↓3% 0.876↓2% 0.817↓4% 66.03↑2% 6.62↑4% 4.76↑6%
w/o LTL 0.886↓3% 0.871↓2% 0.814↓4% 67.83↑5% 7.11↑11% 4.90↑10%
w/o LCE 0.894↓3% 0.880↓1% 0.827↓3% 68.69↑6% 7.39↑16% 5.07↑13%
w/o LTL + LCE 0.875↓5% 0.867↓3% 0.813↓4% 68.87↑6% 7.50↑17% 5.21↑17%

Analysis:

  • +S2CE+S²CE: The Shared-Specific Context Encoder (S²CE) provides initial performance boosts, particularly in MSE, by effectively combining modality-specific and modality-general information.
  • +MAG2+MAG²: The Multidimensional Action Graph Guidance (MAG²) module yields a substantial improvement in both Sp. Corr. and MSE. This strongly indicates the importance of language in bridging audio-visual semantics and guiding the model's understanding of domain-based actions.
  • +AVCF+AVCF: The Audio-Visual Cross-modal Fusion (AVCF) module further improves performance by addressing global and clip-wise matching of audio and visual streams, aligning better with human scoring rules than generic cross-attention mechanisms.
  • +DPG+DPG: The Dual-branch Prompt-guided Grading (DPG) module provides the final boost, significantly improving metrics (e.g., 3% for Sp. Corr. and 7% for MSE on average from the previous step). This highlights that a rule-aligned assessment of visual performance and action-music matching is critical.

Effects of Loss Functions

The bottom half of Table 5 analyzes the contributions of the proposed loss functions:

  • w/o LTL (without Triplet Loss): Removing LTL\mathcal{L}_{TL} leads to a drop in performance (e.g., 3% Sp. Corr. and 5% TES MSE for MLAVL compared to full model). This loss is crucial for ensuring sufficient separation and discriminability between different grade patterns.
  • w/o LCE (without Cross-Entropy Loss): Removing LCE\mathcal{L}_{CE} also results in performance degradation (e.g., 3% Sp. Corr. and 6% TES MSE). This loss is vital for aligning grade patterns with corresponding textual prompts, ensuring the model focuses on relevant semantics and refines the coarse-to-fine relations from visual actions to action-music alignment.
  • w/o LTL + LCE: Removing both Triplet Loss and Cross-Entropy Loss causes the most significant performance drop (e.g., 5% Sp. Corr. and 6% TES MSE), demonstrating the combined importance of these losses in guiding the model towards an accurate and quality-aware score space, especially given the limited labeled data typical in sports assessment.

Effects of Different Modalities

Figure 4 (SRCC bars and MSE folds) illustrates the contribution of different modalities.

  • Visual-only (V): Forms the base performance, indicating that visual information is the primary source for assessment.

  • Visual + Audio (V+A): Adding audio generally improves performance, underscoring its importance for better assessment, especially for action-music coordination.

  • Visual + Language (V+L): Introducing language alongside visual information consistently improves results, emphasizing the critical role of action knowledge guidance.

  • Visual + Audio + Language (V+A+L): The combination of all three modalities yields the best results, showcasing the synergistic effect of language-guided audio-visual learning.

    Figure 4. SRCC bars and MSE folds for different modalities. 该图像是图表,展示了不同模态组合下的SRCC条形图和MSE折线图,比较了Visual、Audio及其与Language结合的多种设置对动作评分的影响。

Figure 4. SRCC bars and MSE folds for different modalities.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces MLAVL, a novel Multidimensional Language-guided Audio-Visual Learning framework designed for long-term sports assessment. The core innovation lies in its ability to reduce reliance on large model parameters by integrating low-cost language modality to explicitly guide the modeling of audio-action-visual correlations. This is achieved through several key contributions:

  1. Action Knowledge Graphs: Embedding domain-specific basic action corpora into audio-visual features through action knowledge graphs to provide explicit guidance.

  2. Shared-Specific Context Encoder (S²CE): Enhancing multimodal features by fusing modality-specific and modality-general information.

  3. Audio-Visual Cross-modal Fusion (AVCF): A module specifically designed to evaluate action-music consistency by focusing on both global and clip-wise alignments.

  4. Dual-Branch Prompt-Guided Grading (DPG): A rule-aligned grading mechanism that assesses both visual performance and audio-visual synchronization using coarse-to-fine textual prompts.

    The MLAVL framework achieves new state-of-the-art results on four public long-term sports benchmarks (FS1000, Fis-V, Rhythmic Gymnastics, and LOGO) while maintaining a lower parameter count and computational cost compared to previous leading methods. Its MAG² module also demonstrates plug-and-play capability, significantly improving existing visual-only methods, highlighting the broader utility of language-guided action knowledge.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation:

  • Reliance on Rule-Based Consistency: The effectiveness of some of MLAVL's designs, particularly those related to action-music consistency, relies heavily on the presence of explicit rule-based requirements in long-term sports. In specialized scenes where audio information might be sparse, less useful, or entirely irrelevant to scoring, the performance of the audio-related components might be impacted.

    For future work, the authors suggest:

  • Advanced Extraction Techniques: Exploring more advanced techniques to better capture sparse cues from audio, ensuring robustness even when audio information is not directly or strongly correlated with actions. This could involve developing more sophisticated audio representations or fusion mechanisms that are less sensitive to noise or weak signals.

7.3. Personal Insights & Critique

Strengths:

  • Innovative Use of Language as Guidance: The most compelling aspect of this paper is its effective use of low-cost language modality to inject explicit domain-specific knowledge. This shifts the paradigm from purely data-driven implicit learning (which requires large models for weak correlations) to a knowledge-guided approach. This is particularly relevant for AQA where human judgment is often rule-based.

  • Rule-Aligned Design: The AVCF and DPG modules are thoughtfully designed to mimic how human judges score, considering both action quality and action-music synchronization at multiple temporal granularities. This interpretability and alignment with real-world criteria are significant advantages.

  • Efficiency: Achieving SOTA results with fewer parameters and lower computational cost is a strong indicator of the model's practical utility and efficiency, especially for deployment in real-world applications.

  • Modularity and Generalizability: The plug-and-play nature of MAG² demonstrates that the concept of language-guided action knowledge is beneficial even for visual-only tasks, suggesting broad applicability beyond the specific audio-visual sports assessment domain.

    **Potential Issues, Unverified Assumptions, or Areas

Similar papers

Recommended via semantic vector search.

No similar papers found yet.