Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
TL;DR Summary
This paper proposes decoupling static and motion perception in referring video segmentation, using expression-decoupling and hierarchical motion modules with contrastive learning, achieving state-of-the-art results on multiple datasets.
Abstract
Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable \textbf{9.2%} improvement on the challenging dataset. Code is available at https://github.com/heshuting555/DsHmp.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
- Authors: Shuting He (Nanyang Technological University), Henghui Ding (Nanyang Technological University, Fudan University)
- Journal/Conference: Published at the Conference on Computer Vision and Pattern Recognition (CVPR) 2024. CVPR is a premier, top-tier international conference in the field of computer vision, known for presenting significant new research.
- Publication Year: 2024
- Abstract: The paper addresses the task of referring video segmentation, where a natural language expression is used to segment a specific object in a video. The authors argue that previous methods are suboptimal because they process the entire sentence as a single unit, mixing static appearance cues (e.g., "the girl in red") with temporal motion cues (e.g., "jumping"). This mix prevents the model from effectively understanding motion, as image-level features are not suited for it, and static cues can sometimes distract the model from focusing on motion. To solve this, the authors propose a new model,
DsHmp, which decouples the problem into two parts: static perception and motion perception. First, an expression-decoupling module separates the language expression into static and motion-related words. Second, a hierarchical motion perception module is introduced to understand motion at various timescales (from brief actions to long-term movements). Finally, contrastive learning is used to help the model distinguish between visually similar objects that have different motions. The proposed method achieves state-of-the-art results on five standard datasets, including a very large 9.2% improvement on the motion-focusedMeViSdataset. - Original Source Link:
-
Official Source: https://arxiv.org/abs/2404.03645
-
The paper is available as a preprint on arXiv and was accepted for publication at CVPR 2024.
-
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: The task of Referring Video Segmentation (RVS) requires a model to identify and segment a specific object in a video based on a textual description. These descriptions often contain crucial motion-related clues (e.g., "the person walking away").
- Existing Gaps: Previous state-of-the-art models typically convert the entire descriptive sentence into a single feature vector (a sentence embedding). This single embedding is then used to guide the segmentation process across all video frames. The authors identify a fundamental flaw in this approach: it conflates static cues (like color and shape) with motion cues (like actions and movements). Image-level feature extractors are good at understanding static appearance but struggle to comprehend motion from a static sentence embedding. Conversely, focusing on static features can "overshadow" or interfere with the model's ability to perceive the more subtle temporal motion patterns.
- Fresh Angle: The paper's core innovation is to decouple this process. Instead of treating the sentence as one monolithic guide, the authors propose splitting it into "what the object looks like" (static cues) and "what the object is doing" (motion cues). These two sets of cues are then used by specialized parts of the model, allowing each to focus on its distinct role: static perception for identifying candidate objects in individual frames and motion perception for pinpointing the correct target based on its temporal behavior.
-
Main Contributions / Findings (What):
-
Decoupled Static and Motion Perception: The paper introduces a novel framework that separates the understanding of a referring expression into two streams. An expression-decoupling module first parses the sentence to isolate static words (nouns, adjectives) from motion words (verbs, adverbs). The static cues guide an image-level segmenter to find potential candidates in each frame, while the motion cues are used later to identify the true target based on its movement across frames.
-
Hierarchical Motion Perception (HMP) Module: To effectively understand complex motions that can occur over different time scales (e.g., a quick "jump" vs. a continuous "walk"), the paper proposes the HMP module. This module progressively analyzes object movements, starting with short-term actions and building up to an understanding of long-term temporal patterns, mimicking how humans might process a video.
-
Motion-Focused Contrastive Learning: To address the challenging scenario of multiple visually similar objects (e.g., two identical-looking sheep, one running and one standing), the paper employs contrastive learning. This technique explicitly trains the model to produce distinct feature representations for objects with different motions, even if their appearance is nearly identical. This is enhanced by a memory bank to provide a rich set of examples for comparison.
-
State-of-the-Art Performance: The resulting model,
DsHmp, sets a new state-of-the-art on five benchmark datasets. The most striking result is a massive 9.2% improvement in the primaryJ&Fmetric on theMeViSdataset, which is specifically designed to test motion understanding.
-
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Referring Video/Image Segmentation (RVS/RIS): These are multi-modal tasks combining computer vision and natural language processing. The goal is to segment (i.e., create a pixel-perfect mask for) an object in a video or image that is being referred to by a given text description.
- Transformer Architecture: A neural network architecture that relies heavily on a mechanism called self-attention. It has become dominant in both NLP and vision. In this paper, it's used to process language, fuse vision and language features, and model temporal relationships between video frames. Cross-attention is a variant used to combine information from two different sources, such as injecting language cues into visual queries.
- Mask2Former: A powerful, universal image segmentation model. It works by predicting a set of "object queries" and corresponding masks. Instead of classifying every pixel, it proposes a small number of potential objects and segments them. This paper uses it as a backbone for the static perception part to generate candidate object masks and features in each frame.
- Contrastive Learning: A type of self-supervised learning where the model learns to create an embedding space. In this space, representations of "similar" (positive) samples are pulled closer together, while representations of "dissimilar" (negative) samples are pushed further apart. This is useful for learning discriminative features.
- Hungarian Algorithm: An optimization algorithm that solves the assignment problem. In this context, it is used to match object detections across consecutive video frames to create object tracks, ensuring that the token for "person A" in frame 1 is correctly associated with the token for "person A" in frame 2.
-
Previous Works & Technological Evolution:
- Early RVS methods often adapted techniques from Referring Image Segmentation (RIS). They would process each video frame independently to find the target and then use post-processing to link the segmentations together, largely ignoring the rich temporal and motion information available in videos.
- Later works like
ReferFormerandMTTRintroduced Transformer-based, end-to-end pipelines. These models significantly improved performance by processing multiple frames at once. However, they still simplified the language guidance by encoding the entire sentence into a single vector, which was then replicated and used to query the video features. This is the "mixing" problem that the current paper identifies. - The
LMPMmodel, which serves as the direct baseline for this work, also used a Transformer architecture but began to focus more on motion by using object tokens for temporal learning. However, it still treated all frames uniformly and didn't have a sophisticated mechanism for handling motions at different timescales. - The creation of the MeViS dataset was a turning point, as it specifically included complex scenes and language expressions centered on motion. This highlighted the shortcomings of existing models in motion understanding and motivated research like this paper.
-
Differentiation:
-
DsHmpvs.ReferFormer/LMPM: The key difference is the explicit decoupling of language cues. WhileReferFormerandLMPMuse a single sentence embedding for everything,DsHmpintelligently separates the sentence into static and motion components. It uses static cues for an image-level segmentation task and motion cues for a temporal-level identification task. -
DsHmpvs.LMPM(Temporal modeling): WhileLMPMperforms temporal learning over object tokens, it treats the entire video clip uniformly.DsHmpintroduces a hierarchical approach (the HMP module), which is more sophisticated. It starts by analyzing small time windows and progressively merges them to understand longer, more complex actions. -
DsHmpvs. Others (Distinguishing similar objects):DsHmpis one of the first works in this area to explicitly use contrastive learning to improve the model's ability to differentiate objects based only on their motion, which is a critical and challenging sub-problem.
-
4. Methodology (Core Technology & Implementation)
The proposed method, Decoupled Static and Hierarchical Motion Perception (DsHmp), is illustrated in the overall architecture diagram below.
该图像是论文中提出的模型架构示意图,展示了静态感知与分层运动感知模块的流程。输入文字经过文本编码后,通过表达式解耦模块分别提取静态线索和运动线索,后者通过层级运动感知模块捕捉不同时间尺度的运动信息并结合对比学习提升相似物体动态辨别能力,最终生成视频分割掩码。
The model's pipeline can be broken down into the following key stages:
4.1. Decoupling Motion and Static Perception
This is the first and most foundational contribution. The goal is to stop treating the guiding sentence as a single, monolithic piece of information.
-
Expression Decoupling: Given an input sentence (e.g., "Bird standing on hand, then flying away"), an external NLP tool is used to parse it.
- Static Cues: Nouns, adjectives, and prepositions are extracted (e.g., "bird", "on", "hand"). These are combined to form the static feature .
- Motion Cues: Verbs and adverbs are extracted (e.g., "standing", "flying away"). These are combined to form the motion feature .
- To retain the overall context, the embedding of the full sentence,
F_Se, is added to both and .
-
Specialized Queries: Instead of using one type of query, the model prepares two distinct sets of queries for the different perception tasks.
-
Static Query Generation: A set of learnable static queries, , are enriched with the static cues using cross-attention. This produces static-aware queries
hat(Q)_s.hat(Q)_s: The final static queries, which now have knowledge of the static appearance of the target.- : Initial learnable static queries (learns general object properties from the data).
- : The feature vectors of the static words from the sentence.
- : The feature dimension.
This
hat(Q)_sis then fed intoMask2Formerto identify and segment all potential candidate objects in each frame.
-
Motion Query Generation: Similarly, a set of learnable motion queries, , are combined with the motion cues to produce motion-aware queries
hat(Q)_m.hat(Q)_m: The final motion queries, which are now primed to look for the specific motion described in the sentence.- : Initial learnable motion queries.
- : The feature vectors of the motion words.
This
hat(Q)_mis used later in theMotion Decoderto select the correct object from the candidates based on its motion.
-
4.2. Hierarchical Motion Perception (HMP)
After the static perception module (Mask2Former) generates candidate object tokens () for each frame , the HMP module processes them to understand their motion over time.
该图像是论文“Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation”中的示意图,展示了分层运动感知模块(Hierarchical Motion Perception, HMP)的结构和流程,包含层级交叉注意力、时间自注意力和前馈网络。右侧图示详细阐释了多层帧融合与加权计算过程。
The process is as follows:
-
Object Tracking: The Hungarian algorithm is used to match object tokens between adjacent frames. This creates object trajectories, where each trajectory
{tilde(O)_i}is a sequence of tokens representing a single object instance across frames. -
Hierarchical Processing: The HMP module consists of cascaded blocks. Each block performs a cycle of attention and merging to build a multi-timescale understanding of motion. Within each stage of the hierarchy:
- Motion Relevance Highlighting: For each object trajectory , the model calculates an attention map showing how relevant each frame's token is to the motion cues .
- Motion Feature Injection: The object features are updated by adding the motion cues, weighted by their relevance. The term represents a frame-level importance score, signifying how much each frame contributes to the described motion.
- Token Merging: To move up the hierarchy (from short-term to long-term), adjacent frame tokens are merged into a single token. This merging is a weighted average, where the weights are the frame importance scores . This ensures that frames more relevant to the motion have a greater influence on the merged representation. This step reduces the temporal dimension by a factor of 2. The process is repeated times. This creates a pyramid-like structure where the top level represents a long-term summary of the object's motion.
-
Final Motion-Aware Tokens: The final output of the HMP module, , which now contains rich multi-timescale motion information, is passed to a
Motion Decoder. The decoder uses the motion querieshat(Q)_mto attend to these motion-aware tokens and produce the finalvideo tokenscorresponding to the target object.
4.3. Contrastive Learning
To improve the model's ability to distinguish between objects with similar looks but different motions, contrastive learning is applied to the final video tokens .
-
Memory Bank: A memory bank is maintained to store a feature centroid for each distinct object instance in the training dataset. This is more efficient than storing every single token and provides a more stable, representative feature for each object. The centroid for an object is updated using a moving average:
M_[tilde(V)_j]: The feature centroid in the memory bank for the object corresponding to the anchor tokentilde(V)_j.- : A momentum hyperparameter controlling the update speed.
-
Contrastive Loss: The loss function aims to pull an anchor token
tilde(V)_jcloser to its corresponding positive centroid (the correct object's centroid from the memory bank) and push it away from all negative centroids (centroids of other objects).tilde(V)_j: The anchor video token feature.- : The positive sample (feature centroid of the same object).
- : The set of negative samples (centroids of different objects). The authors prioritize selecting negatives from the same object category to make the task more challenging and effective.
- : A temperature hyperparameter that controls the sharpness of the distribution.
4.4. Training Objective
The overall model is trained with a composite loss function:
-
: A frame-level segmentation loss.
-
: A video-level segmentation loss.
-
L_con: The proposed contrastive loss. -
: A weight to balance the contrastive loss.
5. Experimental Setup
-
Datasets:
MeViS: A large-scale, recent dataset specifically designed to test motion understanding. It contains complex scenes where motion cues are essential for identifying the correct object. This is the primary dataset for evaluating the paper's core claims.Ref-YouTube-VOS: The largest RVS dataset, containing a wide variety of videos and expressions.Ref-DAVIS17: A smaller but challenging dataset with high-quality annotations.A2D-Sentences&JHMDB-Sentences: Datasets focused on actor and action segmentation, where the text describes actions performed by subjects.
-
Evaluation Metrics:
-
Region Similarity ( or IoU): Also known as Intersection over Union (IoU). It measures the overlap between the predicted segmentation mask and the ground truth mask.
- Conceptual Definition: It quantifies how accurate the shape and location of the predicted object mask are. A value of 1.0 means a perfect match.
- Mathematical Formula:
- Symbol Explanation:
|...|denotes the area (number of pixels) of the mask.∩is the intersection (common area), and∪is the union (total area covered by both).
-
Contour Accuracy (): Measures the similarity of the boundaries of the predicted and ground truth masks.
- Conceptual Definition: It evaluates how well the predicted outline of the object matches the real outline. It is calculated as the F-measure (harmonic mean) of precision and recall for boundary pixels.
- Mathematical Formula:
- Symbol Explanation: Precision is the fraction of predicted boundary pixels that are correct, and Recall is the fraction of true boundary pixels that were correctly predicted.
-
Overall (
J&F): The average of and .- Conceptual Definition: Provides a single, comprehensive score for segmentation quality, balancing both region overlap and boundary accuracy.
- Mathematical Formula:
-
mean Average Precision (
mAP):- Conceptual Definition: Primarily used in object detection and instance segmentation. It calculates the average precision (AP) over multiple IoU thresholds and then averages these APs across all classes. It rewards models that are accurate across different levels of overlap stringency.
-
Overall IoU (
oIoU) and Mean IoU (mIoU):- Conceptual Definition:
oIoUis the IoU calculated over all pixels from all videos in the dataset collectively.mIoUis the IoU calculated per class and then averaged across all classes.mIoUis generally considered a more balanced metric.
- Conceptual Definition:
-
-
Baselines: The paper compares
DsHmpagainst a range of recent, strong baselines, includingReferFormer,MTTR,SgMg,SOC, and most importantly,LMPM, which was the previous state-of-the-art on theMeViSdataset and serves as the direct predecessor to this work's ideas.
6. Results & Analysis
The experimental results strongly validate the effectiveness of the proposed DsHmp model.
-
Core Results on
MeViS(Table 3): This is the most critical result. On the motion-centricMeViSdataset,DsHmpdemonstrates a massive improvement over all prior work.(Manual transcription of Table 3)
Methods Reference J&F J F URVOS [47] [ECCV'20] 27.8 25.7 29.9 LBDT [12] [CVPR'22] 29.3 27.8 30.8 MTTR [2] [CVPR'22] 30.0 28.8 31.2 ReferFormer [56] [CVPR'22] 31.0 29.8 32.2 VLT+TC [10] [TPAMI'22] 35.5 33.6 37.3 LMPM [8] [ICCV'23] 37.2 34.2 40.2 DsHmp (ours) [CVPR'24] 46.4 43.0 49.8 Analysis:
DsHmpachieves aJ&Fscore of 46.4%, which is 9.2% absolute points higher than the previous best method,LMPM(37.2%). This large leap in performance on a dataset specifically designed to test motion understanding is powerful evidence that the paper's core ideas—decoupling perception and hierarchical motion modeling—are highly effective. -
Core Results on Other Datasets (Tables 4 & 5):
DsHmpalso achieves new state-of-the-art results onRef-YouTube-VOS,Ref-DAVIS17,A2D-Sentences, andJHMDB-Sentences. The improvements on these datasets are more modest (around 0.7-1.3%) compared toMeViS. This is expected, as these datasets contain a higher proportion of expressions that rely on static appearance cues, where the baseline models already perform well. Nevertheless, the consistent state-of-the-art performance demonstrates the generalizability and robustness of theDsHmpmodel.(Manual transcription of Table 4 - showing Video-Swin-Base results for brevity)
Method Reference Ref-YouTube-VOS J&F Ref-DAVIS17 J&F ... ... ... ... SOC [40] [NIPS'23] 66.0 64.2 DsHmp (ours) [CVPR'24] 67.1 64.9 -
Ablation Studies (Tables 1 & 2): The ablation studies systematically dissect the model to verify the contribution of each new component.
(Manual transcription of Table 1)
Index DS HMP CL J&F J F 0 (Baseline) X × × 39.7 36.6 42.8 1 ✓ X X 42.5 39.4 45.6 2 X ✓ X 43.8 40.7 46.9 3 X X ✓ 42.1 39.0 45.2 7 (Full Model) ✓ ✓ ✓ 46.4 43.0 49.8 Analysis:
- Adding Decoupling Sentence (DS) alone improves
J&Fby 2.8% (42.5 vs 39.7). - Adding Hierarchical Motion Perception (HMP) alone provides the biggest boost of 4.1% (43.8 vs 39.7).
- Adding Contrastive Learning (CL) alone improves
J&Fby 2.4% (42.1 vs 39.7). - When all three components are combined, they work synergistically to achieve the final score of 46.4%, showing that each proposed module is a significant and complementary contributor to the final performance.
- Adding Decoupling Sentence (DS) alone improves
-
Visualizations:
该图像是论文中图4的示意图,展示了无对比学习(左)和有对比学习(右)时视频特征的可视化。不同颜色表示不同类别标签,有对比学习显著使视频特征空间结构更合理。The t-SNE visualization in Figure 4 shows that without contrastive learning (left), the feature representations for different objects are mixed and overlapping. With contrastive learning (right), the features for the same object cluster tightly together, and different object clusters become well-separated. This demonstrates that
CLsuccessfully learns a more discriminative feature space based on motion.
该图像是论文中的对比示意图,展示了LMPM方法与本方法在视频分割任务中对自然语言表达“Cat turning around and playing with toy”及类似语句的分割效果对比。图中利用不同色块标记了分割出的对象,体现了本方法在捕捉运动细节和对象区分上的优势。The qualitative results in Figure 5 show a clear example. For the expression "Panda pushing another panda and falling over", the baseline
LMPMgets confused and segments both pandas. In contrast,DsHmpcorrectly identifies and segments only the panda that is performing the described action, showcasing its superior motion comprehension.
7. Conclusion & Reflections
-
Conclusion Summary: This paper introduces
DsHmp, a novel approach for referring video segmentation that significantly enhances temporal and motion understanding. By decoupling the task into static perception and motion perception, the model allows specialized components to excel at their respective roles. The hierarchical motion perception module effectively captures complex motions across various timescales, while contrastive learning sharpens the model's ability to distinguish visually similar objects based on their actions. The resulting state-of-the-art performance, especially the groundbreaking 9.2% improvement on the motion-focusedMeViSdataset, provides compelling evidence for the efficacy of this decoupled approach. -
Limitations & Future Work:
- Reliance on External Parser: The current method relies on an external, off-the-shelf NLP tool to split the sentence into static and motion cues. The performance of the model could be sensitive to the accuracy of this parser. An interesting direction for future work would be to learn this decoupling in an end-to-end fashion, making the model more robust and self-contained.
- Computational Complexity: The hierarchical processing and iterative attention mechanisms, while effective, might add computational overhead compared to simpler models. Future work could explore more efficient implementations of the HMP module.
-
Personal Insights & Critique:
- Novelty and Impact: The core idea of decoupling static and motion perception is simple, intuitive, and remarkably effective. It addresses a fundamental flaw in how previous models interpreted language guidance for a temporal task. This conceptual shift is a significant contribution and is likely to influence future research in video-language tasks beyond just segmentation.
- Methodological Soundness: The implementation of this idea is well-executed. The HMP module is a clever way to model temporal hierarchies, and the targeted use of contrastive learning to solve a specific, challenging sub-problem (similar-looking objects) is very smart.
- Strong Empirical Validation: The authors provide extensive and convincing experimental evidence. The massive gain on
MeViSis not just an incremental improvement; it's a clear demonstration that their model successfully captures a dimension of the problem—motion—that previous models largely missed. - Overall: This paper is an excellent example of identifying a core conceptual weakness in existing methods and proposing a clean, well-motivated, and highly effective solution. It represents a significant step forward for referring video segmentation and video understanding in general.
Similar papers
Recommended via semantic vector search.