Paper status: completed

Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment

Published:10/25/2025

Two-Stream Mamba Pyramid Network (1)Figure Skating Action Quality Assessment (1)Long-Range Action Representation Learning (1)Technical Element Score and Program Component Score Prediction (1)Multi-Level Audio-Visual Feature Fusion (1)

Original Link

Price: 0.100000

8 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a two-stream Mamba pyramid network for automated, rule-compliant figure skating assessment (TES/PCS) in long videos. It leverages separate visual-only (TES) and audio-visual (PCS) streams, utilizing Mamba for efficient long-range dependency modeling and indi

Abstract

Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element's score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba's superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.

Mind Map

In-depth Reading

English Analysis~13 min read · 16,743 chars

1. Bibliographic Information

Title: Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment
Authors:
- Fengshun Wang (Capital University of Physical Education and Sports)
- Qiurui Wang (Capital University of Physical Education and Sports)
- Peilin Zhao (Shanghai Jiao Tong University)
Journal/Conference: The paper does not explicitly state the publication venue, but the structure and content are typical of submissions to major computer vision or multimedia conferences (e.g., CVPR, ACM Multimedia, AAAI).
Publication Year: The provided text does not specify a publication year, but based on the references (e.g., Mamba from 2024/2025), it is a very recent work, likely from 2024.
Abstract: The paper tackles the problem of automated figure skating scoring for both the Technical Element Score (TES) and Program Component Score (PCS). The authors identify three key challenges in existing methods: (1) improper fusion of audio-visual cues for both scores, contrary to judging rules; (2) failure to assess individual action elements for a cumulative TES; and (3) inefficiency in processing long videos. To solve this, they propose a two-stream Mamba pyramid network. One stream uses only visual features to localize and score individual technical elements for the TES. The second stream fuses visual and audio features using a multi-level mechanism to predict the PCS. The use of Mamba architecture allows for efficient modeling of long-range dependencies. The model achieves state-of-the-art results on the FineFS benchmark and shows strong generalization to other datasets without retraining.
Original Source Link: /files/papers/68e2427d304e42a3fca2e380/paper.pdf (The paper provides a GitHub link for the source code).

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Automated figure skating assessment requires accurately predicting two distinct scores: the Technical Element Score (TES), based on the execution of specific athletic moves, and the Program Component Score (PCS), based on artistic interpretation, including musicality.
- Existing Gaps: The paper highlights three critical flaws in prior research:
  1. Misaligned Modality Fusion: Previous models fuse video and audio features to predict both TES and PCS. However, official judging criteria stipulate that TES is purely visual (evaluating actions), while PCS incorporates artistry and musical interpretation (requiring audio). This is a fundamental mismatch.
  2. Lack of Granularity: Existing methods often predict a single, holistic TES for the entire performance. In reality, the TES is the sum of scores from discrete, individual action elements (jumps, spins, etc.) occurring at different times. A proper model should first locate and then score each element.
  3. Inefficiency with Long Videos: Figure skating routines can be several minutes long. Traditional models like RNNs struggle with long-range dependencies, while Transformers are computationally expensive (quadratic complexity) for long sequences.
- Innovation: The paper introduces a novel architecture that directly mirrors the human judging process. It separates the feature streams for TES and PCS and employs the highly efficient Mamba (a State Space Model) to handle long video sequences.
Main Contributions / Findings (What):
- Judging-Aligned Two-Stream Network: The primary contribution is a two-stream architecture that processes modalities based on official rules: a visual-only stream for TES and a multi-modal (audio-visual) stream for PCS. This enforces a strong inductive bias that improves performance.
- Fine-Grained TES Assessment: The model introduces a multi-scale Mamba pyramid with a specialized TES Head. This component performs Temporal Action Localization (TAL) to identify the start/end times of each action element, classifies it, and regresses its quality score. The final TES is the sum of these individual scores, mimicking the real-world process.
- Efficient Long-Range Modeling with Mamba: The paper leverages Mamba's ability to capture long-range dependencies with linear computational complexity. This makes the model uniquely suited for analyzing entire, un-trimmed figure skating performances efficiently. The model achieves state-of-the-art (SOTA) performance on the FineFS benchmark.

Foundational Concepts:
- Action Quality Assessment (AQA): A subfield of computer vision focused on assigning a numerical score to a human action based on its execution quality. Unlike action recognition (what is the action?), AQA asks "how well" was the action performed.
- Figure Skating Scoring:
  - Technical Element Score (TES): An objective score derived from the sum of scores for individual technical elements like jumps, spins, and step sequences. It is based purely on the visual execution of these moves.
  - Program Component Score (PCS): A more subjective score evaluating the overall artistry of the performance across five criteria: Skating Skills, Transitions, Performance, Composition, and Interpretation of the Music. The last criterion explicitly links performance to the accompanying music.
- Temporal Action Localization (TAL): The task of identifying the start and end times of action instances within a long, untrimmed video. This is a prerequisite for the fine-grained TES assessment proposed in the paper.
- Multi-modal Learning: An approach that uses data from multiple sources or modalities (e.g., video frames, audio waves, text) to make predictions. The core idea is that different modalities provide complementary information.
- Mamba / State Space Models (SSMs): A recent class of sequence models that have emerged as a powerful alternative to Transformers. They model long-range dependencies by summarizing the past into a compact "state" vector. Their key advantage is linear time complexity (O(N)) with respect to sequence length, unlike Transformers' quadratic complexity ( $O(N^2)$ ), making them highly efficient for long sequences like videos.
Previous Works:
- General AQA: Early methods focused on regressing a score from a global video representation. More recent works use graph neural networks to model body joints ([15, 24]) or incorporate domain-specific scoring rubrics ([21, 23, 29]).
- Figure Skating AQA: Prior methods like [16] used skeleton data with ST-GCNs. LUSD-Net [12] attempted to disentangle TES and PCS representations and used weakly-supervised localization, but its localization was not an end-to-end trainable part of the framework, and it still predicted a total TES rather than per-element scores.
- Multi-modal AQA: Works like Skating-Mixer [28] and Semantics-Guided Network (SGN) [8] have shown the benefit of using audio-visual data. However, as the authors argue, these models indiscriminately fuse modalities for both TES and PCS, which is flawed. The proposed work corrects this by using a principled, two-stream approach.
Differentiation: This paper's novelty comes from its principled architectural design that is directly inspired by the official judging rules. While previous works used multi-modal learning or TAL, none combined them in a way that respects the fundamental separation of TES and PCS evaluation criteria. Furthermore, it is one of the first to apply the highly efficient Mamba architecture to this specific AQA task, solving the long-video problem more effectively than Transformer-based methods.

4. Methodology (Core Technology & Implementation)

The core of the paper is the Two-Stream Mamba Pyramid Network. The architecture is illustrated in Image 1.

Figure 1: Overall Architecture of the Two-Stream Mamba Pyramid Network

As shown, the model consists of a top visual-only stream for TES and a bottom audio-visual stream for PCS.

Problem Formulation:
- The goal is to analyze a figure skating video and its audio.
- For TES, the model must identify each action segment $P_i = (s_i, e_i, c_i, r_i)$ $P_{i} = (s_{i}, e_{i}, c_{i}, r_{i})$ , where:
  - $s_i, e_i$ are the start and end times.
  - $c_i$ is the action class label (e.g., Triple Axel).
  - $r_i$ is the quality score for that specific action.
- For PCS, the model must predict a single overall score $r_{pcs}$ .
Feature Extraction:
- Video: A pre-trained I3D network extracts frame-level features.
- Audio: A pre-trained VGGish network extracts audio features.
- Both feature sequences are projected into a common embedding space, resulting in $F_v, F_a \in \mathbb{R}^{T \times C}$ , where $T$ is the temporal length and $C$ is the feature dimension.
Temporal Hierarchical Feature Encoder (THFE):
- This module processes the initial video and audio feature sequences separately. Its goal is to capture temporal patterns while maintaining the original temporal resolution.
- It consists of two parts:
  1. Temporal Embedding Module (TEM): A series of 1D convolutional layers to create a richer temporal embedding. The operation is: $\mathbf { F } _ { i } = \operatorname { R e L U } ( \operatorname { L a y e r N o r m } ( \operatorname { C o n v 1 D } ( \mathbf { F } _ { i - 1 } ) ) )$
  2. Temporal Refinement Module (TRM): A series of Masked Mamba Blocks to further refine the temporal features, capturing dependencies across the sequence. $\mathbf { F } _ { j } = { \mathrm { M a s k M a m b a B l o c k } } ( \mathbf { F } _ { j - 1 } , \mathbf { M } )$
Multi-scale Mamba Pyramid (MMP):
- This is the core component for handling actions of varying durations. It processes the features from the THFE to create a feature pyramid, similar to how feature pyramids are used in object detection for handling objects of different sizes.
- Mamba Down Sampling (MDS) Block: This block is the building unit of the pyramid. It combines a Mamba layer for feature extraction, a residual connection (DropPath), and a max pooling layer for downsampling in a single operation. $\mathbf { F } _ { \mathrm { m } } = \mathrm { M a x P o o l } \left( \mathrm { M a m b a } \left( \mathrm { L N } ( \mathbf { F } _ { \mathbf { r } } ) \right) + \mathrm { D r o p P a th } ( \mathbf { F } _ { \mathbf { r } } ) \right)$
- By stacking multiple MDS blocks, the model generates features at different temporal resolutions (e.g., T, T/2, T/4, ...), allowing it to detect both short (e.g., a quick jump) and long (e.g., a step sequence) actions.
Multi-level Cross Attention Fusion (MCAF):
- This module connects the two streams. Critically, it is a one-way fusion: information flows from the audio stream to the video stream only within the PCS branch. The TES stream remains purely visual.
- At each level of the pyramid, it uses cross-attention to fuse audio and video features. Image 2(b) shows this mechanism.
- The video features from the PCS stream act as the Query (Q), while the audio features act as the Key (K) and Value (V). The attention mechanism is standard scaled dot-product attention: $\mathrm { A t t e n t i o n } ( \mathbf { Q } ^ { l } , \mathbf { K } ^ { l } , \mathbf { V } ^ { l } ) = \mathrm { S o f t m a x } \left( \frac { \mathbf { Q } ^ { l } ( \mathbf { K } ^ { l } ) ^ { T } } { \sqrt { d _ { k } } } \right) \mathbf { V } ^ { l }$
- The output is added back to the original video features via a residual connection to create the fused features for PCS prediction: $\mathbf { F } _ { \mathrm { f u s ed } } ^ { l } \ = \text{Linear}(\text{Attention}( { \boldsymbol { \mathrm { Q } } } ^ { l } , { \boldsymbol { \mathrm { K } } } ^ { l } , { \boldsymbol { \mathrm { V } } } ^ { l } ) ) + \mathbf { F } _ { v } ^ { l }$
Score Regression:
- TES Head: Applied to each level of the visual-only feature pyramid. As shown in Image 2(a), it's a simple head made of 1D convolutions that branches into three outputs for each time point:
  1. Action Categories: Predicts the probability of each action class.
  2. Temporal Offsets: Predicts the distance from the current time point to the start and end of the action.
  3. Action Scores: Predicts the quality score of the action at that time point.
- PCS Head: Applied to the final, most compressed level of the fused audio-visual feature pyramid. It uses 1D convolutions followed by an average pooling layer to regress a single PCS score for the entire video. $\mathbf { r } _ { \mathbf { p c s } } = \mathrm { A v g P o o l } ( \mathrm { C o n v 1 D } ( \mathrm { R e L U } ( \mathrm { L N } ( \mathrm { C o n v 1 D } ( \mathbf { F } _ { f u s e d } ^ { L } ) ) ) ) )$
Optimization:
- Label Generation: The model is trained using point-based supervision. For each time point inside a ground-truth action segment, targets are generated as illustrated in Image 3.
  - Classification Target: A one-hot vector indicating the action class.
  - Regression Target: A 2D vector representing the relative offsets to the action's start and end times.
  - Score Target: The ground-truth score of that action element.
- Loss Function: A multi-task loss function combines the losses for all tasks. $\mathcal { L } = \frac { \alpha \mathcal { L } _ { \mathrm { f o c a l } } + \mathbb { I } _ { \boldsymbol { \rho } o s i t i v e } ( \beta \mathcal { L } _ { \mathrm { d i o u } } + \mathcal { L } _ { \mathrm { e l e m e n t } } ^ { m s e } ) } { N ^ { + } } + \mathcal { L } _ { \mathrm { p c s } } ^ { m s e }$
  - $\mathcal{L}_{\mathrm{focal}}$ : Focal Loss for action classification, which helps with class imbalance.
  - $\mathcal{L}_{\mathrm{diou}}$ : DIoU Loss for temporal offset regression, which directly optimizes the overlap between predicted and true segments.
  - $\mathcal{L}_{\mathrm{element}}^{mse}$ : Mean Squared Error (MSE) loss for the individual action element scores.
  - $\mathcal{L}_{\mathrm{pcs}}^{mse}$ : MSE loss for the final PCS score.
  - $\alpha$ and $\beta$ are weighting coefficients. The regression and element score losses are only computed for positive samples (time points inside a ground-truth action).

5. Experimental Setup

Datasets:
- FineFS: The primary dataset used for training and evaluation. It is fine-grained, containing 1604 high-resolution videos with precise annotations for action categories, temporal segments (start/end times), and scores for each element.
- Fis-V: A dataset of 500 ladies' singles short programs. Used for zero-shot evaluation (testing without training).
- FS1000: A larger dataset with 1604 videos from various disciplines. Also used for zero-shot evaluation.
Evaluation Metrics:
- Spearman's Rank Correlation Coefficient ( $\rho$ ): The primary metric for scoring performance. It measures the monotonic relationship between the ranked predicted scores and the ranked ground-truth scores. A value closer to 1 indicates better performance. It is more robust to the absolute scale of scores than MSE and better reflects whether the model can correctly rank performances. $\rho = \frac { \sum _ { i } ( a _ { i } ^ { r } - \bar { b } ^ { r } ) ( b _ { i } ^ { r } - \bar { b } ^ { r } ) } { \sqrt { \sum _ { i } ( a _ { i } ^ { r } - \bar { b } ^ { r } ) ^ { 2 } \sum _ { i } ( b _ { i } ^ { r } - \bar { b } ^ { r } ) ^ { 2 } } }$
- mean Average Precision (mAP): Used to evaluate the performance of Temporal Action Localization. It is calculated at various Temporal Intersection over Union (tIoU) thresholds (from 0.5 to 0.95).
Baselines: The paper compares against several SOTA methods, including GDLT, MS-LSTM, TSA, and LUSD-Net on FineFS, and additional methods like Action-Net, Skating-Mixer, Semantic-Guide, and PAMFN on the other datasets.

6. Results & Analysis

Core Results:
- On FineFS (Table 1): The proposed model significantly outperforms all previous state-of-the-art methods on the FineFS dataset. For Free Skating, it achieves a $\rho_{TES}$ of 0.80 and a $\rho_{PCS}$ of 0.96. For Short Program, it achieves 0.75 and 0.94, respectively. The near-perfect PCS correlation (0.96 and 0.94) strongly validates the effectiveness of the audio-visual fusion strategy. The high TES score demonstrates the success of the fine-grained localization and scoring approach.
- Zero-Shot Generalization (Table 2): When tested on Fis-V and FS1000 without any further training, the model remains highly competitive. It achieves the best PCS score on FS1000 (0.91) and is among the top performers for TES on both datasets. This demonstrates excellent robustness and transferability.
Ablations / Parameter Sensitivity: The authors conduct extensive ablation studies to validate each design choice.
- Number of Action Categories (Table 3): Using 22 categories provides the best trade-off. Fewer categories (4 or 8) give better localization (higher mAP) but worse TES scores, as they are too coarse. Too many categories (242) hurt localization performance significantly. The 22-category setup yields the best TES (0.77) and PCS (0.95) correlation.
- Temporal Encoder Structure (Table 4): Replacing Mamba blocks with traditional CNN blocks results in a drastic drop in performance across all metrics, even when the parameter count is similar. Mamba achieves a TES score of 0.77 compared to CNN's 0.71, highlighting Mamba's superior ability to model temporal sequences.
- Pyramid Levels and Regression Range (Table 5): A 6-level pyramid performs best, showing that a deep, multi-scale representation is crucial for capturing actions of all durations present in the data.
- Audio Integration Strategies (Table 6): This is the most critical ablation.
  - w/o Audio: Removing audio hurts PCS (0.89 vs. 0.95), but TES is still good (0.75), confirming audio is mainly for PCS.
  - Symmetrical Fusion: Fusing audio into the TES stream hurts TES performance (0.73 vs 0.77), proving that audio is irrelevant/harmful for technical scoring.
  - One Stream Fusion: Using a single fused stream for both tasks severely degrades both localization and scoring, demonstrating the necessity of separate streams.
  - Two Stream Fusion (Proposed): The proposed method yields the best results for both TES (0.77) and PCS (0.95), validating the core hypothesis of the paper.
- MCAF Fusion Levels (Table 7): Fusing at all 6 levels of the pyramid gives the best PCS score (0.9526), indicating that integrating audio-visual context at multiple temporal scales is beneficial.
- Query Type in Cross-Attention (Table 8): Using video as the query and audio as the key/value (0.9526) works better than the reverse (0.9359). This suggests it's more effective to use the visually-driven temporal structure to "query" for relevant auditory information.
- Loss Weights (Table 9): Assigning a higher weight to the classification loss ( $\alpha=0.7$ ) over the localization loss ( $\beta=0.3$ ) yields the best overall performance.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully presents a novel two-stream Mamba pyramid network for figure skating assessment. The key innovation is an architecture that aligns with real-world judging criteria by separating the visual-only TES evaluation from the audio-visual PCS evaluation. The use of a multi-scale Mamba pyramid allows for fine-grained localization and scoring of individual action elements, while Mamba's efficiency makes it ideal for long videos. The model establishes a new state-of-the-art on the FineFS benchmark and shows strong generalization.
Limitations & Future Work:
- Reliance on Pre-trained Extractors: The model's performance is dependent on the quality of features from I3D and VGGish. An end-to-end model trained from raw pixels and audio might capture more domain-specific features.
- Interpretability: While the model provides scores for individual elements, it doesn't explain why an element received a certain score (e.g., "under-rotated jump" or "poor landing"). Future work could focus on generating textual feedback or highlighting visual evidence for the scores.
- Dataset Bias: The model is trained on one primary dataset (FineFS). Although it generalizes well, its performance might vary on competitions with different camera angles, lighting conditions, or judging standards not represented in the training data.
Personal Insights & Critique:
- This is a strong, well-executed piece of research. The central idea of designing the network architecture to mirror the human-defined, rule-based process is both simple and powerful. It moves beyond generic deep learning approaches to incorporate crucial domain knowledge.
- The comprehensive ablation studies are a major strength, providing convincing evidence for each architectural choice, especially the audio integration strategy.
- The adoption of Mamba is timely and appropriate for the problem, effectively addressing the long-standing challenge of modeling long video sequences in AQA without the computational burden of Transformers.
- The work has significant practical potential for assisting judges, providing objective feedback to athletes and coaches, and enhancing the viewing experience for audiences by providing real-time analytics. It represents a significant step forward in building truly useful and reliable automated sports analytics systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.