Language-Guided Audio-Visual Learning for Long-Term Sports Assessment
TL;DR Summary
Proposed a language-guided audio-visual learning framework using action knowledge graphs and cross-modal fusion, achieving state-of-the-art long-term sports assessment with low computational cost on four public benchmarks.
Abstract
Language-Guided Audio-Visual Learning for Long-Term Sports Assessment Huangbiao Xu 1,2 , Xiao Ke 1,2* , Huanqi Wu 1,2 , Rui Xu 1,2 , Yuezhou Li 1,2 , Wenzhong Guo 1,2 1 Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China 2 Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350108, China kex@fzu.edu.cn, { huangbiaoxu.chn, wuhuanqi135, xurui.ryan.chn, liyuezhou.cm } @gmail.com Abstract Long-term sports assessment is a challenging task in video understanding since it requires judging complex movement variations and action-music coordination. However, there is no direct correlation between the diverse background mu- sic and movements in sporting events. Previous works re- quire a large number of model parameters to learn potential associations between actions and music. To address this issue, we propose a language-guided audio-visual learn- ing (MLAVL) framework that models “audio-action-visual” correlations guided by low-cost language modality. In our framework, multidimensional domain-based actions form action knowled
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Language-Guided Audio-Visual Learning for Long-Term Sports Assessment
1.2. Authors
-
Huangbiao Xu
-
Xiao Ke
-
Huanqi Wu
-
Rui Xu
-
Yuezhou Li
-
Wenzhong Guo
All authors are affiliated with:
-
Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350108, China
-
Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350108, China
1.3. Journal/Conference
The paper is published at CVPR (Computer Vision and Pattern Recognition). CVPR is one of the premier annual computer vision conferences, highly respected and influential in the fields of computer vision, machine learning, and artificial intelligence. Publication at CVPR indicates a high level of novelty, technical soundness, and significant contribution to the field.
1.4. Publication Year
2024 (as indicated by the T2CR [21] reference year in Table 1)
1.5. Abstract
The paper addresses the challenge of long-term sports assessment, which involves complex movement variations and the coordination between actions and background music. A key difficulty lies in the lack of a direct correlation between diverse background music and sporting actions, often requiring previous models to use a large number of parameters to learn these weak associations. To overcome this, the authors propose a Language-Guided Audio-Visual Learning (MLAVL) framework. This framework models "audio-action-visual" correlations by leveraging a low-cost language modality. It constructs action knowledge graphs from multidimensional domain-based actions, guiding audio-visual modalities to focus on task-relevant actions. The framework incorporates a Shared-Specific Context Encoder (S²CE) to integrate deep multimodal semantics and an Audio-Visual Cross-modal Fusion (AVCF) module to evaluate action-music consistency. Furthermore, a Dual-Branch Prompt-Guided Grading (DPG) module is designed to assess both visual and audio-visual performance in alignment with specific sport rules. Extensive experiments demonstrate that MLAVL achieves state-of-the-art results on four public long-term sports benchmarks while maintaining low computational costs and fewer parameters.
1.6. Original Source Link
/files/papers/69033af859708f78ec6faf73/paper.pdf
This is the Open Access version provided by the Computer Vision Foundation, confirmed to be identical to the accepted version.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is long-term sports assessment, a sub-area within Action Quality Assessment (AQA). AQA is crucial in various domains, including professional sports for scoring, healthcare for rehabilitation progress monitoring, and skill determination in training. While short-term action assessment has seen many solutions, long-term sports analysis remains particularly challenging due to its rich, minute-long video content, which contains much more complex and varied correlations.
Specifically, in sports like figure skating and rhythmic gymnastics, a significant challenge is assessing the action-music coordination (consistency between athletes' movements and background music). Judges explicitly factor this synchronization into scoring. However, existing methods face several issues:
-
Complex Movement Variations: Long videos contain diverse sub-actions within the same category, making it difficult to understand and assess quality.
-
Weak Action-Music Correlation: The background music in sports events often doesn't directly correlate with specific action sounds (e.g., landing, hitting), making it hard for models to naturally capture this relationship.
-
High Model Complexity: Previous approaches often resort to using a large number of model parameters to learn these weak associations between actions and music, leading to high computational costs (as illustrated in Figure 1a and 1b). This reliance on larger models overlooks the need for more efficient understanding.
The paper identifies a crucial gap: the need for
domain-specific knowledgeto reduce reliance on large model parameters, and thelow-cost integration of prior knowledge. The innovative idea is to leveragelanguageas this low-cost, domain-specific knowledge carrier. Language, a cornerstone of human cognition, has proven effective incomputer vision (CV)tasks likeaction recognitionandlocalizationby providing semantic understanding. The authors propose to use language to introducedomain-specific action knowledgeto bridge theaction-music correlationswithinaudio-visual modalities, aligning with actual sport rules. This transformsaudio-visual learningintoaudio-action-visual learning.
2.2. Main Contributions / Findings
The paper introduces the Multidimensional Language-Guided Audio-Visual Learning (MLAVL) framework, offering several key contributions:
- Novel Framework: Proposes
MLAVL, a language-guided audio-visual learning framework for long-term sports assessment. This framework explicitly aims to reduce reliance on large model parameters by guiding the learning process with domain-specific action knowledge graphs. - Multidimensional Action Graph Guidance (MAG²): Designs a module that uses
multidimensional domain-based actionsto formaction knowledge graphs. These graphs motivate both audio and visual temporal modalities to focus ontask-relevant actions, effectively transformingaudio-visual learningintoaudio-action-visual learningby incorporatingaction knowledgeat a low cost. - Shared-Specific Context Encoder (S²CE): Introduces
S²CEto address modality interference and enhance the correlation between multimodal features. It integratesmodality-specificandmodality-generalinformation, providing richer and deeper features for subsequent modules to constructlatent audio-action-visual correlations. - Audio-Visual Cross-Modal Fusion (AVCF): Proposes a novel
AVCFmodule specifically designed for long-term sports assessment. This module evaluatesaction-music consistencyby focusing on bothglobalandclip-wisematch between actions and music, which directly aligns with how judges score. - Dual-Branch Prompt-Guided Grading (DPG): Develops a
DPGmodule that weighs bothvisual performanceandaudio-visual performance(action-music matching) to generate final scores. This module usesquality-related textual promptsto guide the assessment patterns, reflectingsport rules. - State-of-the-Art Performance: Achieves new state-of-the-art results on four public long-term sports benchmarks (
FS1000,Fis-V,Rhythmic Gymnastics, andLOGO). The framework demonstrates superior performance in both correlation andMean Square Error (MSE)metrics, while maintaining low computational costs and fewer parameters. - Plug-and-Play Design: The
MAG²module is shown to be plug-and-play, significantly improving the performance of existing visual-only methods on theLOGOdataset, highlighting the generalizability and value of language-guided action knowledge.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following concepts:
-
Action Quality Assessment (AQA): This is the overarching task. AQA involves automatically evaluating the quality or skill level of an action performed in a video. For example, assessing how well a gymnast performs a routine or a diver executes a dive.
Short-term AQA: Focuses on discrete, well-defined actions (e.g., a single jump).Long-term AQA: Deals with continuous, complex sequences of actions often spanning minutes (e.g., an entire figure skating routine), which includes multiple sub-actions and their overall composition. This paper focuses onlong-term AQA.
-
Multimodal Learning: This is an approach in machine learning that combines information from multiple modalities (types of data) to gain a more comprehensive understanding of a phenomenon. In this paper, the modalities are:
Visual: Information from video frames (what you see).Audio: Information from sound (what you hear, primarily music in this context).Language: Information from text (semantic descriptions, rules, prompts). The idea is that each modality provides unique insights, and combining them can lead to a richer representation and better task performance than using any single modality alone.
-
Graph Neural Networks (GNNs) / Graph Convolutional Networks (GCNs): These are a class of neural networks designed to process data that is structured as a graph, rather than a sequence or a grid.
- Graph: A data structure consisting of
nodes(or vertices) andedges(or links) connecting pairs of nodes. - Graph Convolution: Similar to how a
convolutional neural network (CNN)processes pixels in an image by looking at their neighbors, a GCN processes node features by aggregating information from their neighbors in the graph. This allows GCNs to learn representations that capture the relationships and structure within the graph. - In this paper,
action knowledge graphsandtemporal graphsare built, and GCNs are used to pass information between them.
- Graph: A data structure consisting of
-
Transformers and Attention Mechanism:
- Transformer: A neural network architecture introduced in 2017, which has become dominant in natural language processing (NLP) and is increasingly used in computer vision. Its key innovation is the
attention mechanism. - Attention Mechanism: This allows the model to weigh the importance of different parts of the input sequence (or different modalities) when processing another part. Instead of processing all input elements equally, attention enables the model to focus on the most relevant information.
- Self-Attention: A mechanism within Transformers that relates different positions of a single sequence to compute a representation of the sequence. It calculates how much each word (or token) in a sentence relates to every other word in the same sentence.
- Cross-Attention: A mechanism that relates elements from two different sequences. For example, it can be used to attend to visual features based on an audio query, or vice-versa.
- The paper uses a
shared Transformer encoderand across-temporal relation decoder(which uses cross-attention).
- Transformer: A neural network architecture introduced in 2017, which has become dominant in natural language processing (NLP) and is increasingly used in computer vision. Its key innovation is the
-
Text Encoders (e.g., CLIP): These are models designed to convert text (like words, phrases, or sentences) into numerical representations called
embeddings(orfeature vectors). These embeddings capture the semantic meaning of the text.CLIP (Contrastive Language-Image Pre-training)[36]: A prominent example of a multimodal model that learns to associate text and images. It has a text encoder and an image encoder, trained to produce embeddings where matching text-image pairs are close in the embedding space. This allows it to perform zero-shot classification and guide tasks using natural language. The paper usesViFi-CLIP [37], a fine-tuned version ofCLIPfor video.
3.2. Previous Works
The paper contextualizes its contributions by discussing existing approaches in Sports Assessment and Language-Guided Multimodal Video Understanding.
Sports Assessment
- Short-Term Actions: Many works have tackled short-term AQA (
[2, 21, 39, 50, 53, 54, 56, 64]), developing effective learning-based models for discrete actions. - Long-Term Actions: This is where the challenge lies. Previous methods have explored:
- Visual-only Approaches:
- Using multidimensional video information:
multi-scale temporal features [49],video dynamic information [58],athlete static poses [35, 58]. Coarse-to-fine feature aggregation:[10, 48, 65]to establish grade patterns. Examples includeGDLT [48]andCoFInAl [65].TPT [2](Temporal Parsing Transformer) andQTD [10](Interpretable Long-term Action Quality Assessment).
- Using multidimensional video information:
- Multimodal Learning (Audio, Language, Optical Flow):
- Recent methods introduce diverse modalities to enhance long-term sports assessment.
- Audio-Visual Models:
[47, 57]learnaudio-visual modalitiesto assessaction-music consistency. Examples includeMLP-Mixer [47]andPAMFN [57]. These models often rely on large parameters to capture weak associations. - Language-Enhanced Models:
[12, 52]incorporate language for enhanced understanding.SGN [12](Semantics-Guided Representations for scoring figure skating) is an example.
- Visual-only Approaches:
Language-Guided Multimodal Video Understanding
- General Multimodal Video Understanding:
[12, 17, 24, 30, 38, 47, 50, 57, 61]combine diverse modalities. - Language as a Bridge: When
audio-visual supervisionis insufficient, language can act as a bridge foraudio-visual semantics([17, 24, 38]). - Language for AQA: Textual semantics are shown to significantly enhance AQA (
[12, 30, 50, 61]). Examples includeSGN [12]andNarrative Action Evaluation [61]. - Challenge of Weak Correlation: The paper notes that existing works struggle to extract weak correlations between background music and actions, as music is not derived from action sounds (
[1, 32]).
Core Formulas from Previous Works (for context)
Attention Mechanism (from Transformers [41]):
Since the paper uses cross-attention and self-attention, understanding the fundamental attention mechanism is crucial. The Scaled Dot-Product Attention is given by:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
-
(Query): A matrix representing the query vectors. Each row is a query vector.
-
(Key): A matrix representing the key vectors. Each row is a key vector.
-
(Value): A matrix representing the value vectors. Each row is a value vector.
-
: The dimension of the key vectors. It's used for scaling to prevent the dot products from growing too large, which could push the softmax function into regions with extremely small gradients.
-
: Computes the dot product similarity between each query and all keys.
-
: Normalizes the scores, turning them into probability distributions, indicating how much "attention" each value should get.
-
: The values are weighted by the attention probabilities and summed, producing the output of the attention layer.
For
Self-Attention,Q, K, Vcome from the same input sequence. ForCross-Attention, comes from one sequence (e.g., visual features) andK, Vcome from another (e.g., audio features).
Graph Convolutional Network (GCN) Layer (simplified from GCNs [15]): The GCN layer typically performs feature transformation and aggregation across nodes. A basic GCN layer operation for a graph with adjacency matrix and node features at layer can be expressed as: $ H^{(l+1)} = \sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}) $
- : Input feature matrix at layer .
- : Adjacency matrix with added self-loops (identity matrix ).
- : Degree matrix of (a diagonal matrix where ).
- : Symmetrically normalized adjacency matrix.
- : Learnable weight matrix for layer .
- : An activation function (e.g., ReLU).
The paper uses a slightly modified GCN formulation in its
MAG²module, which will be detailed in the methodology section.
3.3. Technological Evolution
The field of AQA has evolved from purely visual-based methods, which often struggled with the nuances of long-term actions, to increasingly sophisticated approaches. Initially, models focused on extracting spatial-temporal features from video frames. The introduction of multimodal learning marked a significant step, recognizing that additional information (like audio or optical flow) could provide richer context. However, simply combining modalities often led to challenges such as:
- Weak correlation learning: Models needed large parameters to learn implicit, often weak, correlations between modalities.
- Lack of domain knowledge: Generic multimodal fusion might not align with specific assessment rules (e.g., judge criteria in sports).
The current paper represents a further evolution by integrating
language guidance. This leverages the power ofpre-trained large language models (LLMs)orvision-language models (VLMs)to inject explicitdomain-specific knowledge(like sport rules and action definitions) at a low computational cost. This allows for a more direct and rule-aligned modeling ofaudio-action-visual relationships, moving beyond simply correlating audio and visual streams to understanding their synergy through the lens of human-defined actions.
3.4. Differentiation Analysis
Compared to the main methods in related work, MLAVL introduces several core differences and innovations:
-
Language-Guided Correlation Modeling:
- Previous: Existing audio-visual models (
MLP-Mixer [47],PAMFN [57]) often require large model parameters to learnweak associationsbetween diverse background music and actions, as there's no direct acoustic correlation. - MLAVL: Explicitly transforms
audio-visual learningintoaudio-action-visual learning. It uses alow-cost language modalityto introducedomain-specific action knowledge(viaaction knowledge graphs). This guidance directly bridges theaction-music correlations, making the learning process more efficient and semantically aligned with sport rules, reducing the need for massive parameters to implicitly learn these relations.
- Previous: Existing audio-visual models (
-
Structured Knowledge Integration:
- Previous: While some works use language, they might not structure
domain knowledgeas rigorously. - MLAVL:
MAG²module constructsmultidimensional domain-based action knowledge graphs. This structured approach allows explicit knowledge transfer to guide audio-visual modalities totask-relevant actions, which is a more robust way to inject prior knowledge.
- Previous: While some works use language, they might not structure
-
Enhanced Multimodal Semantics Integration:
- Previous: Directly aggregating knowledge graphs to features can cause
modality interferenceandperformance degradation. - MLAVL: The
Shared-Specific Context Encoder (S²CE)is specifically designed to fuse bothmodality-specific(pure, diverse information from each modality) andmodality-general(shared temporal context) information. This deep integration aims to uncoverlatent audio-action-visual correlationsmore effectively.
- Previous: Directly aggregating knowledge graphs to features can cause
-
Rule-Aligned Action-Music Consistency Assessment:
- Previous: Existing
cross-modal fusionmodules often focus onglobal inter-modal interactions, potentially overlookingshort-lived poor action-music interactionscrucial for scoring. - MLAVL:
Audio-Visual Cross-modal Fusion (AVCF)focuses on bothglobalandclip-wiseconsistency between actions and music, directly conforming to how judges penalize mismatches.
- Previous: Existing
-
Dual-Branch, Prompt-Guided Grading:
- Previous: Grading modules might not explicitly weigh different aspects of performance according to sport rules.
- MLAVL: The
Dual-Branch Prompt-Guided Grading (DPG)module usesquality-related textual promptsto guide two distinct branches: one forvisual action performanceand another foraction-music matching. This allows for a more nuanced and rule-aligned scoring mechanism.
4. Methodology
The proposed Multidimensional Language-guided Audio-Visual Learning (MLAVL) framework aims to learn audio-action-visual score patterns, guided by domain-specific action knowledge graphs. The overall framework, illustrated in Figure 2 of the original paper, processes video and audio inputs to generate a final score based on visual performance and action-music consistency.
4.1. Principles
The core idea of MLAVL is to leverage the semantic power of language to explicitly guide the learning of complex audio-visual correlations in long-term sports assessment. Instead of relying solely on implicit learning from audio-visual data (which can be noisy or weakly correlated) or requiring massive model parameters, MLAVL introduces domain-specific action knowledge through textual prompts and knowledge graphs. This guidance helps the model focus on task-relevant actions and their coordination with music, leading to more accurate and efficient assessment aligned with human judging rules. The framework aims to model a richer audio-action-visual relationship by integrating modality-specific and modality-general contexts, fusing information at both global and local levels, and using a rule-aligned dual-branch grading system.
4.2. Core Methodology In-depth (Layer by Layer)
The MLAVL framework consists of several interconnected modules. We will break down its architecture, data flow, and execution logic step-by-step, integrating mathematical formulas as they appear.
Step 1: Input Processing and Initial Feature Extraction
The input to the MLAVL framework is a long video containing image sequences and audio.
- Video Segmentation: The long video is first divided into non-overlapping consecutive 32-frame clips. This is a common practice in sports assessment to manage the temporal complexity of long videos.
- Modality-Specific Feature Extraction (Pre-trained Backbones):
- For
visualinput (), a pre-trainedvisual-specific encoder(e.g.,Video Swin Transformer (VST) [29]orTimesformer [3]) extractsvisual features. - For
audioinput (), a pre-trainedaudio-specific encoder(e.g.,Audio Spectrogram Transformer (AST) [14]) extractsaudio features. - The parameters of these specific encoders are typically frozen during training to leverage their strong pre-trained representations.
- For
- Token Projection: Trainable
token projection networks(2-layerMLPs) project the features from different modalities into apotential spacewith a consistent feature dimension . These projected features are denoted as for visual and for audio.
Step 2: Shared-Specific Context Encoder (S²CE)
- Purpose: The
S²CEaims to fuse bothmodality-specific(pure features from each modality) andmodality-general(long-range temporal context shared across modalities) information. This is crucial because while modality-specific features provide diverse information, relying solely on them can lead tosuperficial single-modal learningunder language guidance. The goal is to learnlatent audio-action-visual correlationsrather than isolated action-audio or action-visual links. - Process:
- After initial projection (as described above), a
shared Transformer encoder [41](denoted as ) is applied. ThisTransformer encodercaptureslong-range, modality-agnostic temporal context, which is essential for analyzing long videos and human-centric tasks. - The output of this shared encoder is then combined with the original projected modality-specific features.
- After initial projection (as described above), a
- Formula (Equation 7):
$
\left{ f _ { t } ^ { \mathbf { m } } \right} _ { t = 1 } ^ { T } = \left{ \mathcal { F } _ { t } ^ { \mathbf { m } } \right} _ { t = 1 } ^ { T } \oplus E _ { s } \left( \left{ \mathcal { F } _ { t } ^ { \mathbf { m } } \right} _ { t = 1 } ^ { T } \right) , \mathbf { m } \in \left{ \mathbf { v } , \mathbf { a } \right}
$
- : The final enhanced feature set for modality (where can be for visual or for audio) after being processed by
S²CE. Each is a feature vector for clip . - : The initial projected modality-specific feature set for modality .
- : The output of the shared Transformer encoder , which processes the projected modality-specific features to extract modality-agnostic temporal context.
- : This symbol typically denotes concatenation or element-wise summation, indicating how the original modality-specific features are combined with the shared temporal context. Given the context of "mixing cross-modal information" and common practice, it likely refers to concatenation along the feature dimension followed by another projection or summation. In this paper's abstract, it states "mixing cross-modal information" in the general introduction, but in the specific S2CE section, it states "fuse modality-specific and modality-general information". The subsequent mention of "summed with global-level" features in AVCF hints at summation being a common operation within this framework. For this operation, it's best to interpret it as an additive combination for feature refinement.
- : The final enhanced feature set for modality (where can be for visual or for audio) after being processed by
Step 3: Language-Guided Action Knowledge Embedding
- Purpose: To introduce domain-specific action knowledge into the model using
low-cost language modality. This knowledge will guide theaudio-visual modalitiesto focus ontask-relevant actions. - Process:
- Text Prompt Sets: Two sets of text prompts are designed: for visual actions and for audio actions. These consist of basic actions derived from official sport rules.
- Example visual prompt template: "'a video of [category]'", where
[category]is a basic action (e.g., "a video of triple axel"). - Example audio prompt template: "'a music suitable for [category]'".
- Example visual prompt template: "'a video of [category]'", where
- Text Encoding: A frozen pre-trained
text encoder(e.g.,ViFi-CLIP [37], a fine-tunedCLIP [36]) is used to embed these prompts intofeature vectors. - Token Projection: Similar to visual/audio features, a trainable
token projection networkmaps these text embeddings into the consistent -dimension space.
- Text Prompt Sets: Two sets of text prompts are designed: for visual actions and for audio actions. These consist of basic actions derived from official sport rules.
- Formulas (Equations 2 & 1):
The overall overview in Section 3.1 states the initial feature extraction steps:
$
\left{ f _ { t } ^ { \mathbf { v } } \right} _ { t = 1 } ^ { T } = E _ { \mathbf { v } \cdot \mathbf { a } } \left( V _ { T } \right) , \left{ f _ { t } ^ { \mathbf { a } } \right} _ { t = 1 } ^ { T } = E _ { \mathbf { v } \cdot \mathbf { a } } \left( A _ { T } \right) ,
$
$
\left{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { v } } \right} _ { m = 1 } ^ { M } = E _ { \mathbf { t } } \left( M \mathbf { v } \right) , \left{ f _ { m } ^ { \mathbf { t } \cdot \mathbf { a } } \right} _ { m = 1 } ^ { M } = E _ { \mathbf { t } } \left( M \mathbf { a } \right) ,
$
- and : These are the visual and audio features for each clip , output by the
Shared-Specific Context Encoder (E_{\mathbf{v}\cdot\mathbf{a}}), which corresponds to the from Equation 7. - : The feature set for visual-action text prompts, encoded by the text encoder .
- : The feature set for audio-action text prompts, encoded by the text encoder .
- and : These are the visual and audio features for each clip , output by the
Step 4: Multidimensional Action Graph Guidance (MAG²)
- Purpose: To transfer the
action knowledgefrom thetextual semantics(action knowledge graphs) to thevisual/audio features(temporal graphs), explicitly guiding the model's focus. - Process:
- Graph Construction:
Action Knowledge Graphs: (for visual actions) and (for audio actions) are constructed. Their nodes are the encoded text prompt features ( and ). They arecomplete graphsinitially, meaning every action node is connected to every other action node, representing intrinsic associations between action terms.Temporal Graphs: (for visual clips) and (for audio clips) are constructed. Their nodes are the enhanced visual/audio features fromS²CE( and ). These are alsocomplete graphsinitially, representing temporal associations between video/audio segments.
- Dual Information Aggregation (GCNs):
MAG²utilizesgraph convolutional networks (GCNs) [15]to pass action knowledge. A cross-graph mapping operation aggregates information fromaction nodes(text features) tovisual temporal nodes(visual features). A similar process occurs for audio.
- Graph Construction:
- Formulas (Equations 8 & 9):
Let be the initial feature matrix for action knowledge graph nodes (e.g., ), and be the initial feature matrix for visual temporal graph nodes (e.g., ).
The GCN operations at layer are:
$
H _ { \mathbf { a c t } } ^ { ( l + 1 ) } = \sigma \left( A _ { \mathbf { a c t } } ^ { \mathbf { v } } H _ { \mathbf { a c t } } ^ { ( l ) } W _ { \mathbf { a c t } } ^ { ( l ) } \right) ,
$
$
{ \cal H } _ { \bf v } ^ { ( l + 1 ) } = \sigma ( A _ { \bf v } H _ { \bf v } ^ { ( l ) } W _ { \bf v } ^ { ( l ) } + A _ { \bf a c t v } H _ { \bf a c t } ^ { ( l ) } W _ { \bf c r o s s } ^ { ( l ) } ) ,
$
- : Feature matrix for the action knowledge graph nodes at layer .
- : The
ReLUnon-linearity (activation function). - : The adjacency matrix for the visual action knowledge graph. This describes the relationships between different visual action types.
- : Input feature matrix for the action knowledge graph nodes at layer .
- : Learnable weight matrix for the action knowledge graph GCN at layer .
- : Feature matrix for the visual temporal graph nodes at layer .
- : The adjacency matrix for the visual temporal graph, describing temporal relationships between visual clips.
- : Input feature matrix for the visual temporal graph nodes at layer .
- : Learnable weight matrix for the visual temporal graph GCN at layer .
- : The cross-graph mapping matrix. This matrix defines how information flows from the action knowledge graph nodes to the visual temporal graph nodes. It captures the influence of specific actions on visual segments.
- : Learnable weight matrix for the cross-graph aggregation at layer .
This process is performed similarly for audio features, resulting in and which are the visual and audio features with aggregated action knowledge.
MAG²typically uses a 2-layer GCN.
Step 5: Audio-Visual Cross-Modal Fusion (AVCF)
- Purpose: To generate
multimodal features() specifically for assessingaudio-visual scores, adhering to sport rules that emphasize consistency between actions and music. Unlike previous methods that focus on global interactions,AVCFconsiders bothglobalandclip-wiseaction-music matches. - Process:
- Global Alignment (Cross-Temporal Relation Decoder): This module uses a 2-layer decoder with
cross-attentionto find global alignment between visual and audio features.- The visual features () act as the "query" ().
- The audio features () act as the "key" () and "value" ().
- Before
cross-attention,self-attention learningis also performed on to enhance its representation.
- Formula (Equation 10):
$
\mathbf { G } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } = \mathrm { S o f t m a x } \left( w _ { t } ^ { Q } \hat { f } _ { t } ^ { \mathbf { v } } \left( w _ { t } ^ { K } \hat { f } _ { t } ^ { \mathbf { a } } \right) ^ { \mathrm { T } } / \sqrt { d } \right) w _ { t } ^ { V } \hat { f } _ { t } ^ { \mathbf { a } } ,
$
- : The globally aligned audio-visual feature for clip .
- : Visual feature for clip (output of
MAG²). - : Audio feature for clip (output of
MAG²). - : Learnable weight matrices for the query, key, and value transformations in the cross-attention mechanism for clip .
- : The feature dimension, used for scaling the dot product.
- : Normalizes attention scores.
- A
feed-forward networkis then applied for non-linear transformations, as is standard inTransformerblocks.
- Clip-wise Match (Concatenation and Convolution): To capture short-lived action-music interactions, features are also processed clip-by-clip.
- Visual () and audio () features are concatenated for each clip .
- A
two-layer convolutional block(Conv-BatchNorm-ReLU) compresses this concatenated feature back to the original dimension . This leverages the local feature extraction capabilities of convolution while maintaining low computational cost. - The resulting
clip-level fused features() are then summed with theglobal-level features().
- Global Alignment (Cross-Temporal Relation Decoder): This module uses a 2-layer decoder with
- Formulas (Equations 11 & 12):
$
\mathbf { C } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } = \mathrm { C o n v b l o c k } \left( \hat { \mathbf { C } } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right) , \hat { \mathbf { C } } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } = \mathrm { C o n c a t } \left( \hat { f } _ { t } ^ { \mathbf { v } } , \hat { f } _ { t } ^ { \mathbf { a } } \right) ,
$
$
\left{ \hat { f } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right} _ { t = 1 } ^ { T } = \left{ \mathbf { G } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right} _ { t = 1 } ^ { T } \oplus \left{ \mathbf { C } _ { t } ^ { \mathbf { v } \cdot \mathbf { a } } \right} _ { t = 1 } ^ { T }
$
- : The clip-level fused audio-visual feature for clip .
- : Represents the two-layer convolutional block.
- : The concatenated visual and audio features for clip , with dimension
2d. - : The concatenation operation.
- : The final fused audio-visual features for all clips, incorporating both global and clip-wise consistency.
- : Again, indicating an additive combination (summation) of global and clip-level features.
Step 6: Dual-Branch Prompt-Guided Grading (DPG)
- Purpose: To assess final quality scores by weighing
visual action performanceandaction-music matching, aligned with sport rules. - Process:
- Grade Prompt Design:
Visual Grade Prompts: distinct prompts (e.g., "very poor performance", "average performance", "excellent performance") are designed asgrade prototypesfor visual quality.Audio-Visual Grade Prompts:2Nprompts are designed, which additionally consideraction-music fit(e.g., "poorly matched performance", "perfectly matched performance").
- Prompt Encoding: These textual prompts are encoded by the same pre-trained
text encoderas before, yielding for visual and for audio-visual. - Performance Grading Transformer (PGT): A 3-layer
Transformer decoderacts as thePGT.- It uses the grade prompts as "query" and the processed visual features () or fused audio-visual features () as "key-value" pairs.
- Shared decoder parameters are used across prompts to uncover universal quality patterns and reduce computational costs.
- Formula (Equation 13):
$
\mathbf { P } _ { n } ^ { \mathbf { m } ^ { \prime } } = \operatorname { S o f t m a x } \left( \mathcal { W } _ { n } ^ { Q } f _ { n } ^ { \mathbf { t } \cdot \mathbf { m } ^ { \prime } } \left( \mathcal { W } _ { t } ^ { K } \hat { f } _ { t } ^ { \mathbf { m } ^ { \prime } } \right) ^ { \mathrm { T } } / \sqrt { d } \right) \mathcal { W } _ { t } ^ { V } \hat { f } _ { t } ^ { \mathbf { m } ^ { \prime } } ,
$
- : The -th
grade patternfor modality . - : Denotes either the visual branch or the audio-visual branch.
- : The -th encoded text prompt feature for modality . This acts as the
query. - : The visual () or audio-visual () features for clip . These act as the
keysandvalues. - : Learnable weight matrices for query, key, and value transformations.
- : The -th
- Score Calculation:
- Two 2-layer
MLPsconvert the grade patterns () intopredicted grade probabilities(). - These probabilities are then combined with
fixed grade weights( and ) to obtainvisual action score() andaction-music matching score(). The fixed weights are and , indicating a linear progression of quality. - The final score () is a weighted sum of and , using a learnable weight .
- Two 2-layer
- Grade Prompt Design:
- Formulas (Equations 14 & 15):
$
\hat { \boldsymbol { \mathrm { P } } } _ { n } ^ { \mathbf { m } ^ { \prime } } = \mathrm { M L P } \left( \mathrm { P } _ { n } ^ { \mathbf { m } ^ { \prime } } \right) , \mathbf { m } ^ { \prime } \in \left{ \mathbf { v } , \mathbf { v } \cdot \mathbf { a } \right} ,
$
$
\hat { s } = \alpha \sum _ { n = 1 } ^ { N } \mathbf { W } _ { n } ^ { \mathbf { v } } \hat { \mathbf { P } } _ { n } ^ { \mathbf { v } } + ( 1 - \alpha ) \sum _ { n = 1 } ^ { 2 N } \mathbf { W } _ { n } ^ { \mathbf { v } \cdot \mathbf { a } } \hat { \mathbf { P } } _ { n } ^ { \mathbf { v } \cdot \mathbf { a } } .
$
- : The predicted probability for the -th grade pattern in modality .
- : A multi-layer perceptron (neural network).
- : The final predicted score.
- : A learnable weighting parameter (between 0 and 1) that balances the contribution of the visual-only score () and the audio-visual score ().
- : Fixed grade weights for visual performance, linearly increasing from 0 to 1 across grades.
- : Fixed grade weights for audio-visual performance, linearly increasing from 0 to 1 across
2Ngrades.
Step 7: Optimization
-
Purpose: To optimize the
grading patternsandpredicted scoresfor accuratesports assessment. -
Process: The model is trained using a combination of three loss functions:
- Triplet Loss (): Ensures
discriminabilitybetween different grade patterns, meaning distinct quality levels should result in distinct patterns. - Cross-Entropy Loss (): Ensures that the
textual promptsaccuratelyguide the grading, aligning the learned grade patterns with their corresponding textual semantics. - Mean Square Error Loss (): Measures the direct numerical difference between the
predicted score() and theground-truth score().
- Triplet Loss (): Ensures
-
Formulas (Equations 16, 17, & 18): $ \begin{array} { l } { { \displaystyle { \mathcal L } _ { T L } = \sum _ { n } \left[ \operatorname* { m a x } \left( \sin \left( { \mathrm { P } _ { n } ^ { \mathbf m } } , { \mathrm { P } _ { i } ^ { \mathbf m } } \right) \right) - \operatorname* { m i n } \left( \sin \left( { \mathrm { P } _ { n } ^ { \mathbf m } } , { \mathrm { P } _ { i } ^ { \mathbf m } } \right) \right) + \delta \right] _ { + } } , } \ { { \displaystyle { \mathcal L } _ { C E } = - \sum _ { n } \log \frac { \exp \Big ( \sin \Big ( { f _ { n } ^ { \mathbf t } } ^ { \mathbf m } , { \mathrm { P } _ { n } ^ { \mathbf m } } \Big ) / \varsigma \Big ) } { \sum _ { j } \exp \Big ( \sin \Big ( { f _ { n } ^ { \mathbf t } } ^ { \mathbf m } , { \mathrm { P } _ { j } ^ { \mathbf m } } \Big ) / \varsigma \Big ) } , } } \end{array} $ $ \mathcal { T } = \lambda _ { 1 } \mathcal { L } _ { T L } + \lambda _ { 2 } \mathcal { L } _ { C E } + \lambda _ { 3 } \mathcal { L } _ { M S E } . $
-
(Triplet Loss):
- : Cosine similarity, measuring the angle between two feature vectors (grade patterns and ).
- : The -th grade pattern for modality (either visual or audio-visual).
- : Another grade pattern, where .
- : This is a common form of triplet loss. It aims to pull an "anchor" () closer to "positive" samples () and push it further away from "negative" samples () by at least a margin . The formula as written in the paper implies selecting the maximum similarity to an
anchorand minimum similarity to apositivefrom other for the triplet. Specifically, it seems to ensure that the similarity between a grade pattern and its "closest" other pattern (themaxterm) is sufficiently smaller than its similarity to itself (implicitly 1), and that its similarity to its "farthest" other pattern (theminterm) is above a certain threshold. The notation suggests it's pushing all other patterns away from the current one, or ensuring a minimum separation. Given typical triplet loss forms, it should be . However, strictly following the paper's formula, it means that for each , the difference between its maximum similarity to any other and its minimum similarity to any other plus a margin should be positive. This structure encourages distinctiveness among all patterns. - : Denotes , meaning the loss is only incurred if the term inside is positive.
- : A margin parameter, defining the minimum separation required between patterns.
-
(Cross-Entropy Loss):
- : Cosine similarity.
- : The encoded text prompt feature corresponding to grade for modality .
- : The -th grade pattern for modality .
- : Exponential of the similarity score, scaled by a hyperparameter , used in the softmax denominator to normalize.
- This loss encourages the learned grade pattern to be highly similar to its corresponding text prompt compared to all other grade patterns .
-
(Mean Square Error Loss): This is the standard MSE loss between the predicted score and the ground-truth score . The paper provides it implicitly as part of the metric description: . More precisely it is .
-
: The overall objective function to be minimized.
-
: Balancing weights for the three loss components.
The complete architecture integrates these modules, allowing information to flow from raw video/audio, through language-guided feature enhancement, cross-modal fusion, and finally to a rule-aligned grading mechanism.
-
5. Experimental Setup
5.1. Datasets
The experiments are conducted on four public long-term sports assessment benchmarks:
-
FS1000 [47]:
- Source/Domain: Figure skating.
- Characteristics: Provides a rule-aligned, comprehensive audio-visual assessment with diverse scoring scenes. It is particularly challenging due to the complexity of figure skating movements and the intricate coordination with music.
- Scale: Fixed 95 video clips were randomly selected for the experiments.
- Why chosen: Represents a challenging scenario requiring robust multimodal understanding and precise assessment of action-music coordination.
-
Fis-V [49]:
- Source/Domain: Figure skating.
- Characteristics: Another benchmark for figure skating assessment.
- Scale: Fixed 124 video clips were randomly selected for the experiments.
- Why chosen: Allows for further validation of the method's performance on figure skating, which is a key domain for audio-visual coordination.
-
Rhythmic Gymnastics (RG) [58]:
- Source/Domain: Rhythmic Gymnastics.
- Characteristics: An audio-visual dataset where
long videoscontain complexmulti-scale temporal features,video dynamic information, andathlete static poses. Rhythmic gymnastics also heavily emphasizesaction-music coordination. - Scale: Fixed 68 video clips were randomly selected for the experiments.
- Why chosen: Provides a different sport context with similar audio-visual coordination requirements, validating generalizability.
-
LOGO [60]:
-
Source/Domain: Group Action Quality Assessment (AQA).
-
Characteristics: A long-form video dataset. Notably, this is a
visual-only dataset. -
Scale: Fixed 48 video clips were randomly selected for the experiments.
-
Why chosen: Used to validate the
plug-and-playcapability and effectiveness of theMAG²module (action-visual graph guidance only) on visual-only methods, demonstrating the value of language-guided action knowledge beyond audio-visual contexts.These datasets are chosen because they represent the core challenges of
long-term sports assessment, including complex movements, the need formultimodal understanding(especially audio-visual synergy), and the relevance ofdomain-specific rulesin scoring.
-
5.2. Evaluation Metrics
The paper adopts two standard metrics to evaluate the approach fully: Spearman's Rank Correlation (\rho) and Mean Square Error (MSE) / Relative L2-distance (R-\ell_2).
-
Spearman's Rank Correlation ()
- Conceptual Definition: Spearman's Rank Correlation coefficient assesses the monotonic relationship between two ranked variables. It measures how well the relationship between two variables can be described using a monotonic function (either increasing or decreasing). Unlike Pearson's correlation, it does not assume linearity or normal distribution of the data. In the context of AQA, it is used to measure the agreement between the rank order of predicted scores and the rank order of ground-truth scores. A high Spearman's indicates that if one video is ranked higher in predicted scores, it is also likely ranked higher in ground-truth scores.
- Mathematical Formula: $ \rho = \frac { \sum _ { i } \left( q _ { i } - \bar { q } \right) \left( \hat { q } _ { i } - \bar { \hat { q } } \right) } { \sqrt { \sum _ { i } \left( q _ { i } - \bar { q } \right) ^ { 2 } \sum _ { i } \left( \hat { q } _ { i } - \bar { \hat { q } } \right) ^ { 2 } } } $
- Symbol Explanation:
- : The ground-truth rank of the -th video.
- : The predicted rank of the -th video.
- : The mean of the ground-truth ranks.
- : The mean of the predicted ranks.
- The sum is taken over all videos in the dataset.
- Interpretation: Values range from -1 to +1. A value of +1 indicates a perfect monotonic increasing relationship, -1 indicates a perfect monotonic decreasing relationship, and 0 indicates no monotonic relationship. For AQA, higher values (closer to +1) are better.
-
Mean Square Error (MSE)
- Conceptual Definition: Mean Square Error is a common metric used to quantify the average magnitude of the errors in a set of predictions. It measures the average of the squares of the differences between the predicted values and the actual values. By squaring the errors, MSE penalizes larger errors more heavily than smaller ones. In AQA, MSE quantifies the numerical difference between the predicted scores and the ground-truth scores.
- Mathematical Formula: $ \mathrm { M S E } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \left( \hat { s } _ { i } - s _ { i } \right) ^ { 2 } $
- Symbol Explanation:
- : The predicted score for the -th video.
- : The ground-truth score for the -th video.
- : The total number of videos.
- Interpretation: A value of 0 indicates a perfect prediction. Lower values are better, indicating closer agreement between predicted and true scores.
-
Relative L2-distance ()
- Conceptual Definition: Relative L2-distance is a normalized version of the L2 distance (Euclidean distance). It measures the average magnitude of the errors, but scales these errors relative to the maximum possible range of scores. This normalization makes the metric more interpretable across different datasets that might have varying score ranges, providing a scale-invariant measure of prediction accuracy.
- Mathematical Formula: $ \mathrm { R \mathcal { \ell } _ { 2 } } = \frac { 1 } { N } \sum _ { n } ^ { N } \left( \frac { \left| s _ { n } - \hat { s } _ { n } \right| } { s _ { \mathrm { m a x } } - s _ { \mathrm { m i n } } } \right) ^ { 2 } $
- Symbol Explanation:
- : The ground-truth score for the -th video.
- : The predicted score for the -th video.
- : The maximum possible ground-truth score in the dataset.
- : The minimum possible ground-truth score in the dataset.
- : The total number of videos.
- Interpretation: Lower values are better, indicating that the prediction errors are small relative to the total possible range of scores.
5.3. Baselines
The paper compares MLAVL against a comprehensive set of state-of-the-art AQA methods, categorized primarily by their modality usage:
-
Visual-Only Methods: These baselines primarily rely on visual information from the video to assess action quality.
C3D-LSTM [34]: An early approach using3D Convolutional Neural Networks (C3D)for spatio-temporal feature extraction, followed byLong Short-Term Memory (LSTM)networks for temporal modeling.MSCADC [33]: A multi-scale approach for action quality assessment.MS-LSTM [49]: Multi-scaleLSTMfor scoring figure skating.CoRe [56]:Group-aware Contrastive Regressionfor AQA.GDLT [48]:Likert scoringwithgrade decouplingfor long-term action assessment.TPT [2]:Temporal Parsing Transformerfor AQA.T2CR [21]:Two-path Target-aware Contrastive Regressionfor AQA.CoFInAl [65]: Enhances AQA withcoarse-to-fine instruction alignment.QTD [10]:Interpretable long-term action quality assessment.
-
Audio-Visual/Multimodal Methods: These baselines integrate information from both audio and visual modalities.
-
M-BERT (Late) [23]: A multimodalTransformerapproach that processes visual and audio features. "Late" likely refers to late fusion of modalities. -
MLP-Mixer [47]: UtilizesMLPsfor long-term sport audio-visual modeling. -
SGN [12]:Semantics-Guided Representationsfor scoring figure skating, potentially using language or semantic cues. -
PAMFN [57]:Multimodal Action Quality Assessment, a state-of-the-art audio-visual model.These baselines are representative of the evolution and current state of
AQAresearch, covering different architectural choices (CNNs, LSTMs, Transformers, MLPs) and modality integrations (visual-only, audio-visual). Comparing against them allowsMLAVLto demonstrate its advantages in leveraging language for more efficient and accuratemultimodal learning.
-
6. Results & Analysis
The experimental results demonstrate that MLAVL consistently achieves state-of-the-art performance across various long-term sports assessment benchmarks, often with lower computational costs.
6.1. Core Results Analysis
FS1000 Dataset (Figure Skating)
The FS1000 dataset is known for its comprehensive, rule-aligned audio-visual assessment of figure skating, presenting significant challenges. As shown in Table 1, MLAVL achieves the best results across all score types (TES, PCS, SS, TR, PE, CO, IN) for Spearman Correlation (\rho) and Mean Square Error (MSE), and consequently the best average.
- Spearman Correlation (Avg.):
MLAVLachieves 0.90, surpassingPAMFN [57](0.87) by 3.0 percentage points andSGN [12](0.85) by 5.0 percentage points. - MSE (Avg.):
MLAVLachieves 10.39, significantly lower thanPAMFN [57](16.80) andSGN [12](12.77). This highlightsMLAVL's ability to model complexmultimodal relationshipsand accurately captureaction-music coordination. The authors attribute this toMLAVL's design of fixed, domain-specificpromptsthat introduce action knowledge at alow cost, leading to accurate learning ofaudio-visual relationships.
Fis-V Dataset (Figure Skating)
Table 2 compares MLAVL on the Fis-V dataset, focusing on Spearman Correlation (\rho) and MSE along with computational cost (#Params and #FLOPs).
- Balanced Performance:
MLAVLachieves anAvg. Sp. Corr.of 0.823 andAvg. MSEof 13.31. WhilePAMFNhas a slightly higherAvg. Sp. Corr.of 0.822 (which is very close),MLAVLachieves a much betterAvg. MSE. Compared toMLP-Mixer [47], which is strong on MSE,MLAVLis competitive (13.31 vs 13.77) while having higherSp. Corr.. - Efficiency: Crucially,
MLAVLachieves this performance with significantly fewer parameters (3.82M) and lower FLOPs (0.778G) compared toMLP-Mixer(14.32M params, 49.900G FLOPs) andPAMFN(18.06M params, 2.562G FLOPs). This validates the paper's claim thatlanguage-guided promptsefficiently establishaudio-action-visual relationshipswithout requiring large model parameters.
Rhythmic Gymnastics (RG) Dataset
Table 3 presents results on the RG dataset, showcasing MLAVL's performance across different apparatus (Ball, Clubs, Hoop, Ribbon).
- Overall SOTA:
MLAVLsets a new state-of-the-art with anAvg. Sp. Corr.of 0.849 and anAvg. MSEof 4.47. - Significant MSE Improvement: It improves the
Avg. MSEby 1.06 over the second-best approachGDLT [48](5.53). This strong performance is attributed to themultidimensional action knowledgeand thedual-branch prompt-guided grading mechanism, which directly aligns with assessment rules in rhythmic gymnastics.
Overall Effectiveness
The results across these three datasets consistently demonstrate MLAVL's robust effectiveness, achieving balanced state-of-the-art performance in both correlation (rank order) and numerical accuracy (MSE).
The comparison of scatter plots in Figure 3 (a, d) visually reinforces MLAVL's superiority over PAMFN on FS1000 (PCS). MLAVL's predictions show a tighter correlation with ground truth scores, indicating better accuracy. The t-SNE feature distribution plots (b, c, e, f) further illustrate the impact of the MAG² module. Without MAG², the feature distribution is disordered with significant class overlap (b, e). With MAG², the grade categories display clear boundaries and distinct clustering (c, f), confirming that MAG² effectively distinguishes different action qualities by introducing language-guided action knowledge.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Methods | Year | Features | Spearman Correlation (↑) | Mean Square Error (↓) | ||||||||||||||
| TES | PCS | SS | TR | PE | CO | IN | Avg. | TES | PCS | SS | TR | PE | CO | IN | Avg. | |||
| C3D-LSTM [34] | 2017 | C3D [40] | 0.78 | 0.53 | 0.50 | 0.52 | 0.52 | 0.57 | 0.47 | 0.57 | 308.30 | 25.85 | 0.92 | 0.99 | 1.21 | 0.97 | 1.01 | 48.61 |
| MSCADC [33] | 2019 | Timesformer [3] | 0.77 | 0.70 | 0.69 | 0.69 | 0.71 | 0.68 | 0.71 | 0.71 | 148.02 | 15.47 | 0.51 | 0.57 | 0.78 | 0.55 | 0.60 | 23.79 |
| MS-LSTM [49] | 2019 | Timesformer [3] | 0.86 | 0.80 | 0.77 | 0.78 | 0.76 | 0.79 | 0.78 | 0.79 | 94.55 | 11.03 | 0.45 | 0.49 | 0.76 | 0.43 | 0.47 | 15.45 |
| CoRe [56] | 2021 | Timesformer [3] | 0.88 | 0.84 | 0.81 | 0.83 | 0.81 | 0.83 | 0.80 | 0.83 | 103.50 | 9.85 | 0.41 | 0.37 | 0.81 | 0.38 | 0.41 | 16.53 |
| GDLT* [48] | 2022 | Timesformer [3] | 0.88 | 0.86 | 0.84 | 0.86 | 0.83 | 0.85 | 0.84 | 0.85 | 82.73 | 10.32 | 0.35 | 0.37 | 0.67 | 0.38 | 0.42 | 13.60 |
| TPT [2] | 2022 | Timesformer [3] | 0.88 | 0.83 | 0.82 | 0.82 | 0.81 | 0.82 | 0.81 | 0.83 | 80.00 | 8.88 | 0.34 | 0.37 | 0.63 | 0.34 | 0.39 | 12.99 |
| T2CR* [21] | 2024 | Timesformer [3] | 0.86 | 0.79 | 0.83 | 0.84 | 0.82 | 0.84 | 0.80 | 0.83 | 107.59 | 15.26 | 0.61 | 0.48 | 0.69 | 0.57 | 0.42 | 17.95 |
| CoFInAl* [65] | 2024 | Timesformer [3] | 0.84 | 0.83 | 0.84 | 0.84 | 0.81 | 0.83 | 0.82 | 0.83 | 81.65 | 16.05 | 0.56 | 0.63 | 0.71 | 0.41 | 0.54 | 14.36 |
| QTD* [10] | 2024 | Timesformer [3] | 0.88 | 0.85 | 0.85 | 0.86 | 0.83 | 0.85 | 0.84 | 0.85 | 137.09 | 17.48 | 0.51 | 0.73 | 0.80 | 0.91 | 0.98 | 22.64 |
| M-BERT (Late) [23] | 2020 | TF [3]+AST [14] | 0.79 | 0.75 | 0.80 | 0.81 | 0.80 | 0.80 | 0.76 | 0.79 | 131.28 | 15.28 | 0.44 | 0.43 | 0.67 | 0.47 | 0.55 | 21.30 |
| MLP-Mixer† [47] | 2023 | TF [3]+AST [14] | 0.88 | 0.82 | 0.80 | 0.81 | 0.80 | 0.81 | 0.81 | 0.82 | 81.24 | 9.47 | 0.35 | 0.35 | 0.62 | 0.37 | 0.39 | 13.26 |
| SGN [12] | 2024 | TF [3]+AST [14] | 0.89 | 0.85 | 0.84 | 0.85 | 0.82 | 0.85 | 0.83 | 0.85 | 79.08 | 8.40 | 0.31 | 0.32 | 0.61 | 0.33 | 0.37 | 12.77 |
| PAMFN+* [57] | 2024 | TF [3]+AST [14]+I3D [4] | 0.90 | 0.89 | 0.86 | 0.87 | 0.86 | 0.87 | 0.85 | 0.87 | 104.89 | 10.05 | 0.39 | 0.52 | 0.78 | 0.40 | 0.56 | 16.80 |
| MLAVL (Ours) | TF [3]+AST [14]+CLIP [36] | 0.92 | 0.89 | 0.90 | 0.90 | 0.88 | 0.89 | 0.88 | 0.90 | 64.89 | 6.39 | 0.23 | 0.24 | 0.50 | 0.25 | 0.26 | 10.39 | |
The following are the results from Table 2 of the original paper:
| Methods | #Params (M) | #FLOPs (G) | Sp. Corr. (↑) | MSE (↓) | ||||
| TES | PCS | Avg. | TES | PCS | Avg. | |||
| C3D-LSTM [34] | - | - | 0.290 | 0.510 | 0.406 | 39.25 | 21.97 | 30.61 |
| MSCADC [33] | - | - | 0.500 | 0.610 | 0.557 | 25.93 | 11.94 | 18.94 |
| MS-LSTM [49] | - | - | 0.650 | 0.780 | 0.721 | 19.91 | 8.35 | 14.13 |
| M-BERT (Late) [23] | 4.00 | 1.272 | 0.530 | 0.720 | 0.634 | 27.73 | 12.38 | 20.06 |
| GDLT* [48] | 3.20 | 0.268 | 0.685 | 0.820 | 0.761 | 20.99 | 8.75 | 14.87 |
| CoRe [56] | 2.51 | 0.010 | 0.660 | 0.820 | 0.751 | 23.50 | 9.25 | 16.38 |
| TPT [2] | 11.82 | 2.229 | 0.570 | 0.760 | 0.676 | 27.50 | 11.25 | 19.38 |
| MLP-Mixer [47] | 14.32 | 49.900 | 0.680 | 0.820 | 0.759 | 19.57 | 7.96 | 13.77 |
| SGN [12] | - | - | 0.700 | 0.830 | 0.773 | 19.05 | 7.96 | 13.51 |
| PAMFN [57] | 18.06 | 2.562 | 0.754 | 0.872 | 0.822 | 22.50 | 8.16 | 15.33 |
| CoFInAl* [65] | 5.24 | 0.509 | 0.716 | 0.843 | 0.788 | 20.76 | 7.91 | 14.34 |
| QTD* [10] | 5.51 | 0.396 | 0.717 | 0.858 | 0.798 | 26.97 | 10.89 | 18.93 |
| MLAVL (Ours) | 3.82 | 0.778 | 0.766 | 0.863 | 0.823 | 19.44 | 7.17 | 13.31 |
The following are the results from Table 3 of the original paper:
| Methods | Year | Features | Spearman Correlation (↑) | Mean Square Error (↓) | ||||||||
| Ball | Clubs | Hoop | Ribbon | Avg. | Ball | Clubs | Hoop | Ribbon | Avg. | |||
| C3D+SVR [34] | 2017 | C3D [40] | 0.357 | 0.551 | 0.495 | 0.516 | 0.483 | - | - | - | - | - |
| MS-LSTM* [49] | 2019 | I3D [4] | 0.515 | 0.621 | 0.540 | 0.522 | 0.551 | 10.55 | 6.94 | 5.85 | 12.56 | 8.97 |
| VST [29] | 0.621 | 0.661 | 0.670 | 0.695 | 0.663 | 7.52 | 6.04 | 6.16 | 5.78 | 6.37 | ||
| ACTION-NET* [58] | 2020 | I3D [4]+ResNet [16] | 0.528 | 0.652 | 0.708 | 0.578 | 0.623 | 9.09 | 6.40 | 5.93 | 10.23 | 7.91 |
| VST [29]+ResNet [16] | 0.684 | 0.737 | 0.733 | 0.754 | 0.728 | 9.55 | 6.36 | 5.56 | 8.15 | 7.41 | ||
| GDLT* [48] | 2022 | VST [29] | 0.746 | 0.802 | 0.765 | 0.741 | 0.765 | 5.90 | 4.34 | 5.70 | 6.16 | 5.53 |
| PAMFN [57] | 2024 | VST [29]+AST [14]+I3D [4] | 0.757 | 0.825 | 0.836 | 0.846 | 0.819 | 6.24 | 7.45 | 5.21 | 7.67 | 6.64 |
| CoFInAl* [65] | 2024 | I3D [4] | 0.625 | 0.719 | 0.734 | 0.757 | 0.712 | 7.04 | 6.37 | 5.81 | 6.98 | 6.55 |
| VST [29] | 0.809 | 0.806 | 0.804 | 0.810 | 0.807 | 5.07 | 5.19 | 6.37 | 6.30 | 5.73 | ||
| QTD* [10] | 2024 | VST [29] | 0.823 | 0.852 | 0.837 | 0.857 | 0.842 | 7.94 | 5.66 | 7.95 | 8.87 | 7.61 |
| MLAVL (Ours) | - | VST [29]+AST [14]+CLIP [36] | 0.826 | 0.829 | 0.871 | 0.866 | 0.849 | 5.57 | 4.20 | 4.11 | 3.99 | 4.47 |
6.3. Ablation Studies / Parameter Analysis
Effects of Multidimensional Language Guidance (MAG²) on LOGO Dataset
To specifically validate the effectiveness of MAG² (which introduces multidimensional action knowledge via language guidance), the authors plugged MAG² into nine existing visual-only methods on the LOGO dataset. Only action-visual graph guidance was used here, as LOGO is a visual-only dataset.
The following are the results from Table 4 of the original paper:
| Methods | Native | +MAG² (Ours) | ||
| Sp. Corr.↑ | R-l2 (×100)↓ | Sp. Corr.↑ | R-l2 (×100)↓ | |
| MS-LSTM [49] | 0.542 | 5.763 | 0.582↑7% | 4.916↓15% |
| USDL [39] | 0.762 | 2.556 | 0.804↑6% | 2.269↓11% |
| GDLT [48] | 0.647 | 4.148 | 0.654↑1% | 3.589↓13% |
| CoRe [56] | 0.697 | 5.620 | 0.723↑4% | 3.386↓40% |
| TPT [2] | 0.589 | 5.228 | 0.621↑5% | 3.130↓40% |
| HGCN [64] | 0.541 | 4.765 | 0.640↑18% | 3.698↓22% |
| T2CR [21] | 0.681 | 5.973 | 0.699↑3% | 4.809↓19% |
| CoFInAl [65] | 0.661 | 5.754 | 0.708↑7% | 3.950↓31% |
| QTD [10] | 0.698 | 4.948 | 0.729↑4% | 3.869↓22% |
Analysis: MAG² consistently enhances the performance of all tested visual-only methods. On average, it improves Sp. Corr. by 3.8% and R-\ell_2$$ by 1.238. This confirms that language-guided action knowledge is highly valuable, effectively introducing relevant semantic understanding even to methods that do not inherently use language or audio.
Ablation Study of MLAVL Components
A comprehensive ablation study was conducted on FS1000 and RG datasets to evaluate the contribution of each proposed component. The baseline uses cross-attention for audio-visual fusion and a 2-layer MLP for score prediction.
The following are the results from Table 5 of the original paper:
| Settings | Sp. Corr. (↑) | MSE (↓) | ||||
| TES | PCS | RG-Avg. | TES | PCS | RG-Avg. | |
| baseline | 0.835 | 0.825 | 0.736 | 81.43 | 9.94 | 7.72 |
| +S²CE | 0.848↑2% | 0.840↑2% | 0.757↑3% | 77.17↓5.2% | 8.67↓13% | 6.98↓10% |
| +MAG² | 0.876↑3% | 0.866↑3% | 0.801↑6% | 69.40↓10% | 7.67↓12% | 5.36↓23% |
| +AVCF (w/o DPG) | 0.887↑1% | 0.875↑1% | 0.818↑2% | 67.08↓3% | 7.09↓8% | 4.93↓8% |
| +DPG (Ours) | 0.917↑3% | 0.892↑2% | 0.849↑4% | 64.89↓3% | 6.39↓10% | 4.47↓9% |
| w/o S²CE | 0.891↓3% | 0.878↓2% | 0.821↓3% | 67.29↑7% | 6.46↑1% | 4.89↑9% |
| w/o MAG² | 0.879↓4% | 0.869↓3% | 0.802↓6% | 70.05↑8% | 7.53↑18% | 5.07↑13% |
| w/o AVCF | 0.894↓3% | 0.876↓2% | 0.817↓4% | 66.03↑2% | 6.62↑4% | 4.76↑6% |
| w/o LTL | 0.886↓3% | 0.871↓2% | 0.814↓4% | 67.83↑5% | 7.11↑11% | 4.90↑10% |
| w/o LCE | 0.894↓3% | 0.880↓1% | 0.827↓3% | 68.69↑6% | 7.39↑16% | 5.07↑13% |
| w/o LTL + LCE | 0.875↓5% | 0.867↓3% | 0.813↓4% | 68.87↑6% | 7.50↑17% | 5.21↑17% |
Analysis:
- : The
Shared-Specific Context Encoder(S²CE) provides initial performance boosts, particularly inMSE, by effectively combiningmodality-specificandmodality-generalinformation. - : The
Multidimensional Action Graph Guidance(MAG²) module yields a substantial improvement in bothSp. Corr.andMSE. This strongly indicates the importance oflanguagein bridgingaudio-visual semanticsand guiding the model's understanding ofdomain-based actions. - : The
Audio-Visual Cross-modal Fusion(AVCF) module further improves performance by addressingglobalandclip-wisematching of audio and visual streams, aligning better with human scoring rules than generic cross-attention mechanisms. - : The
Dual-branch Prompt-guided Grading(DPG) module provides the final boost, significantly improving metrics (e.g., 3% forSp. Corr.and 7% forMSEon average from the previous step). This highlights that a rule-aligned assessment ofvisual performanceandaction-music matchingis critical.
Effects of Loss Functions
The bottom half of Table 5 analyzes the contributions of the proposed loss functions:
w/o LTL(without Triplet Loss): Removing leads to a drop in performance (e.g., 3%Sp. Corr.and 5%TES MSEfor MLAVL compared to full model). This loss is crucial for ensuringsufficient separationanddiscriminabilitybetween differentgrade patterns.w/o LCE(without Cross-Entropy Loss): Removing also results in performance degradation (e.g., 3%Sp. Corr.and 6%TES MSE). This loss is vital foraligning grade patterns with corresponding textual prompts, ensuring the model focuses on relevant semantics and refines thecoarse-to-fine relationsfrom visual actions toaction-music alignment.w/o LTL + LCE: Removing bothTriplet LossandCross-Entropy Losscauses the most significant performance drop (e.g., 5%Sp. Corr.and 6%TES MSE), demonstrating the combined importance of these losses in guiding the model towards an accurate and quality-aware score space, especially given thelimited labeled datatypical in sports assessment.
Effects of Different Modalities
Figure 4 (SRCC bars and MSE folds) illustrates the contribution of different modalities.
-
Visual-only (V): Forms the base performance, indicating that visual information is the primary source for assessment.
-
Visual + Audio (V+A): Adding audio generally improves performance, underscoring its importance for better assessment, especially for
action-music coordination. -
Visual + Language (V+L): Introducing language alongside visual information consistently improves results, emphasizing the critical role of
action knowledge guidance. -
Visual + Audio + Language (V+A+L): The combination of all three modalities yields the best results, showcasing the synergistic effect of
language-guided audio-visual learning.
该图像是图表,展示了不同模态组合下的SRCC条形图和MSE折线图,比较了Visual、Audio及其与Language结合的多种设置对动作评分的影响。
Figure 4. SRCC bars and MSE folds for different modalities.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces MLAVL, a novel Multidimensional Language-guided Audio-Visual Learning framework designed for long-term sports assessment. The core innovation lies in its ability to reduce reliance on large model parameters by integrating low-cost language modality to explicitly guide the modeling of audio-action-visual correlations. This is achieved through several key contributions:
-
Action Knowledge Graphs: Embedding
domain-specific basic action corporaintoaudio-visual featuresthroughaction knowledge graphsto provide explicit guidance. -
Shared-Specific Context Encoder (S²CE): Enhancing multimodal features by fusing
modality-specificandmodality-generalinformation. -
Audio-Visual Cross-modal Fusion (AVCF): A module specifically designed to evaluate
action-music consistencyby focusing on bothglobalandclip-wisealignments. -
Dual-Branch Prompt-Guided Grading (DPG): A rule-aligned grading mechanism that assesses both
visual performanceandaudio-visual synchronizationusingcoarse-to-fine textual prompts.The
MLAVLframework achieves newstate-of-the-artresults on four public long-term sports benchmarks (FS1000,Fis-V,Rhythmic Gymnastics, andLOGO) while maintaining a lower parameter count and computational cost compared to previous leading methods. ItsMAG²module also demonstratesplug-and-playcapability, significantly improving existing visual-only methods, highlighting the broader utility oflanguage-guided action knowledge.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation:
-
Reliance on Rule-Based Consistency: The effectiveness of some of
MLAVL's designs, particularly those related toaction-music consistency, relies heavily on the presence of explicit rule-based requirements in long-term sports. Inspecialized sceneswhere audio information might besparse, less useful, or entirely irrelevantto scoring, the performance of the audio-related components might be impacted.For future work, the authors suggest:
-
Advanced Extraction Techniques: Exploring more advanced techniques to better capture
sparse cuesfrom audio, ensuring robustness even when audio information is not directly or strongly correlated with actions. This could involve developing more sophisticated audio representations or fusion mechanisms that are less sensitive to noise or weak signals.
7.3. Personal Insights & Critique
Strengths:
-
Innovative Use of Language as Guidance: The most compelling aspect of this paper is its effective use of
low-cost language modalityto inject explicitdomain-specific knowledge. This shifts the paradigm from purely data-driven implicit learning (which requires large models for weak correlations) to a knowledge-guided approach. This is particularly relevant forAQAwhere human judgment is often rule-based. -
Rule-Aligned Design: The
AVCFandDPGmodules are thoughtfully designed to mimic how human judges score, considering bothaction qualityandaction-music synchronizationat multiple temporal granularities. This interpretability and alignment with real-world criteria are significant advantages. -
Efficiency: Achieving SOTA results with fewer parameters and lower computational cost is a strong indicator of the model's practical utility and efficiency, especially for deployment in real-world applications.
-
Modularity and Generalizability: The
plug-and-playnature ofMAG²demonstrates that the concept oflanguage-guided action knowledgeis beneficial even for visual-only tasks, suggesting broad applicability beyond the specificaudio-visual sports assessmentdomain.**Potential Issues, Unverified Assumptions, or Areas
Similar papers
Recommended via semantic vector search.