ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

Jian-Fang Hu

Paper status: completed

ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

Published:03/31/2025

Referring Video Object Segmentation (1)ReferDINO model (1)Mask Quality Enhancement and Fusion Strategy (1)SAM2 Mask Generation (1)MeViS Challenge Solution (1)

Original Link PDF

Price: 0.10

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ReferDINO-Plus integrates SAM2 with ReferDINO using conditional mask fusion to improve mask quality and temporal consistency in referring video object segmentation, achieving 60.43 J&F and 2nd place in the CVPR 2025 MeViS challenge.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 $\mathcal{J}\&\mathcal{F}$ on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.

Mind Map

In-depth Reading

English Analysis~13 min read · 15,095 chars

1. Bibliographic Information

Title: ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025
Authors: Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu.
- Affiliations: All authors are from Sun Yat-sen University.
Journal/Conference: The paper reports results for the 4th PVUW (Pixel-level Video Understanding in the Wild) MeViS (Motion Expression Video Segmentation) Challenge, held at the CVPR (Conference on Computer Vision and Pattern Recognition) 2025.
- Venue Reputation: CVPR is the premier international conference in the field of computer vision, making it a highly prestigious venue. This paper is a technical report describing a competition-winning solution.
Publication Year: The paper is a preprint submitted in 2025.
Abstract: The paper presents ReferDINO-Plus, a solution for Referring Video Object Segmentation (RVOS), which involves segmenting objects in a video based on a text description. The method builds upon ReferDINO, a strong existing model, by integrating SAM2 (Segment Anything Model 2) to improve mask quality and temporal consistency. To address a key challenge where SAM2 struggles with multiple objects, the authors introduce a Conditional Mask Fusion (CMF) strategy. This strategy adaptively combines the outputs of ReferDINO and SAM2. The final solution achieved 2nd place in the MeViS PVUW challenge at CVPR 2025 with a score of 60.43 $\mathcal{J}\&\mathcal{F}$ on the test set.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2503.23509v2
- PDF Link: https://arxiv.org/pdf/2503.23509v2.pdf
- Publication Status: This is a preprint available on arXiv, not yet formally published in the conference proceedings.

2. Executive Summary

Background & Motivation (Why):
- The core problem is Referring Video Object Segmentation (RVOS): given a video and a natural language sentence (e.g., "the cat running to the left"), the goal is to create a pixel-perfect mask of the described object(s) in every frame of the video.
- This task is important for applications like advanced video editing and human-robot interaction. However, existing methods often fail in complex, real-world scenarios. The MeViS dataset, a key benchmark, specifically introduces challenges like descriptions based on motion and temporal actions (e.g., "the person who was standing still and then started walking") and the need to segment multiple objects from a single description.
- The paper starts with a powerful baseline model, ReferDINO, which leverages a foundational vision-language model. The motivation is to further enhance ReferDINO's performance by addressing its limitations, particularly in mask quality, and to specifically tackle the multi-object challenge posed by MeViS.
Main Contributions / Findings (What):
- The primary contribution is a two-stage method called ReferDINO-Plus. It is not a new model architecture from scratch but an intelligent combination of existing powerful models.
- Stage 1: Uses ReferDINO to generate initial object masks based on the text description. ReferDINO is good at understanding the text and identifying the correct objects.
- Stage 2: Uses SAM2, a state-of-the-art video segmentation model, to refine the masks from ReferDINO. SAM2 excels at producing high-quality, temporally consistent masks but needs a starting point (a "prompt").
- A novel Conditional Mask Fusion (CMF) strategy is introduced. The authors observed that SAM2, when given a prompt for multiple objects, would sometimes only track one, degrading performance. The CMF strategy intelligently decides whether to use the SAM2 mask alone (for single objects) or combine it with the original ReferDINO mask (for multiple objects), effectively getting the best of both worlds.
- The proposed solution secured 2nd place in a competitive academic challenge, demonstrating its state-of-the-art performance without complex training schemes like using pseudo-labels.

Foundational Concepts:
- Referring Video Object Segmentation (RVOS): This is a computer vision task that combines natural language processing and video analysis. Unlike generic object segmentation, which finds all objects of a certain class (e.g., all cars), RVOS finds the specific object(s) referred to in a text query.
- Foundation Models: These are large-scale models pre-trained on vast amounts of data (e.g., GPT-3 for text, SAM for images). They develop general-purpose capabilities that can be adapted to many downstream tasks. This paper leverages two such models:
  - GroundingDINO: A model that can detect arbitrary objects in an image based on a text description. It "grounds" language in visual space. ReferDINO is built upon this.
  - SAM (Segment Anything Model) & SAM2: SAM is a revolutionary model that can segment almost any object in an image given a prompt (like a point, box, or mask). SAM2 extends this capability to videos, making it excellent for tracking and segmenting objects consistently across frames.
- Transformers and DETR: Transformers are a neural network architecture, originally for language, that excel at modeling relationships between elements in a sequence. DEtection TRansformer (DETR) adapted this for object detection, treating it as a direct set prediction problem. This paradigm influenced many subsequent vision tasks, including RVOS models like MTTR and ReferFormer.
Previous Works:
- Early RVOS Methods: Some initial works ([1]) simply applied methods designed for referring image segmentation to each video frame independently. This approach ignored temporal information, leading to flickering or inconsistent segmentations over time.
- Transformer-based RVOS: Models like MTTR [2] and ReferFormer [22] introduced Transformers to the RVOS task. They process video frames and text together, allowing for better temporal reasoning and cross-modal understanding.
- Modular Improvements: Subsequent works like SOC [17] and DsHmp [11] focused on improving specific parts of the pipeline, such as enhancing temporal modeling or better aligning visual and textual features.
- Foundation Model Era: The recent introduction of ReferDINO [15] marked a significant step. By building on the powerful open-vocabulary detection capabilities of GroundingDINO, it achieved superior performance, especially in understanding complex descriptions.
Differentiation: ReferDINO-Plus is not a new end-to-end trained model. Its innovation lies in its pragmatic and effective two-stage system design. While ReferDINO is excellent at vision-language understanding and SAM2 is excellent at high-quality segmentation, they have different strengths and weaknesses. ReferDINO-Plus combines them and introduces the Conditional Mask Fusion (CMF) strategy as the "glue" that mitigates their individual shortcomings, particularly SAM2's failure mode in multi-object scenarios. This system-level thinking is what sets it apart.

4. Methodology (Core Technology & Implementation)

The overall framework of ReferDINO-Plus is a simple yet powerful three-step pipeline, as illustrated in Figure 1.

该图像是论文中ReferDINO-Plus方法的示意图，展示了输入视频帧经过ReferDINO生成初步掩码，随后利用SAM2优化掩码质量，并通过条件掩码融合策略获得最终目标分割结果。

Principles: The core idea is to leverage the strengths of two specialized models: ReferDINO for its superior text-to-object identification (semantic understanding) and SAM2 for its superior mask generation and tracking (pixel-level accuracy and temporal consistency).
Steps & Procedures:
1. Step 1: Cross-modal Reasoning with ReferDINO
  - Input: A video clip with $T$ frames and a text description.
  - Process: The inputs are fed into the pre-trained ReferDINO model. ReferDINO uses its internal vision-language reasoning capabilities to identify the objects matching the description.
  - Output: ReferDINO produces a sequence of object masks $\{ M_r^t \}_{t=1}^T$ and their corresponding confidence scores $\{ S_r^t \}_{t=1}^T$ for each frame $t$ . To handle cases where a description refers to multiple objects, masks with a score above a threshold $\sigma$ are combined.
2. Step 2: Mask Refinement with SAM2
  - Input: The authors select the frame and mask from ReferDINO's output that has the single highest confidence score. This (frame, mask) pair serves as a high-quality "prompt" for SAM2.
  - Process: SAM2, being a promptable video segmentation model, takes this initial mask and efficiently propagates it through the entire video, tracking the object and generating a new set of masks.
  - Output: SAM2 produces a refined and temporally smooth mask sequence $\{ M_s^t \}_{t=1}^T$ . These masks are generally of higher quality and more consistent frame-to-frame than ReferDINO's original output.
3. Step 3: Conditional Mask Fusion (CMF)
  - Problem: The authors observed that if the initial prompt mask from ReferDINO contained multiple disconnected objects, SAM2 would often latch onto only one of them and fail to track the others. This leads to a significant drop in performance for multi-object queries.
  - Solution: The CMF strategy is a heuristic designed to detect and correct this failure mode. For each frame, it compares the area of the mask from SAM2 ( $M_s$ ) with the area of the mask from ReferDINO ( $M_r$ ).
  - Mathematical Formula & Key Details: The final mask $M$ $M$ for a given frame is determined by the following rule: $M = { \left\{ \begin{array} { l l } { M _ { s } + M _ { r } } & { { \mathrm { if } } \ { \mathcal { A } } ( M _ { s } ) < { \frac { 2 } { 3 } } { \mathcal { A } } ( M _ { r } ) } \\ { M _ { s } } & { { \mathrm { o t h e r w i s e } } } \end{array} \right. }$
    - Symbol Explanation:
      - $M$ : The final output mask for the frame.
      - $M_s$ : The mask generated by SAM2.
      - $M_r$ : The mask generated by ReferDINO.
      - $\mathcal{A}(\cdot)$ : A function that calculates the pixel area of a mask.
      - Logic: The condition $\mathcal{A}(M_s) < \frac{2}{3} \mathcal{A}(M_r)$ acts as a detector for the multi-object failure case. If the SAM2 mask is significantly smaller (less than 2/3 the area) than the ReferDINO mask, it's assumed that SAM2 dropped one or more objects. In this "multi-object case," the final mask is the union of both masks ( $M_s + M_r$ ), recovering the lost objects. Otherwise, in the "single-object case," the higher-quality SAM2 mask ( $M_s$ ) is used exclusively. This check is performed independently for each frame.

5. Experimental Setup

Datasets:
- MeViS: The primary benchmark for the competition. It is a large-scale RVOS dataset with 2,000 videos and 28,000 text descriptions. Its key features are a focus on motion-based language, descriptions referring to single or multiple objects, and even descriptions of non-existent objects, making it very challenging.
- Training Datasets: The ReferDINO model was trained in stages: first on referring image segmentation datasets (RefCOCO, $RefCOCO+$ , RefCOCOg), then on older RVOS datasets (Refer-Youtube-VOS, Ref-DAVIS17), and finally fine-tuned on the MeViS training set.
Evaluation Metrics:
- Region Similarity ( $\mathcal{J}$ ):
  1. Conceptual Definition: Also known as the Jaccard Index or Intersection over Union (IoU), this metric measures the overlap between the predicted mask and the ground-truth mask. A value of 1 indicates a perfect match, while 0 indicates no overlap. It evaluates the overall spatial accuracy of the segmentation.
  2. Mathematical Formula: $\mathcal{J} = \frac{|M_p \cap M_g|}{|M_p \cup M_g|}$
  3. Symbol Explanation:
    - $M_p$ : The predicted segmentation mask.
    - $M_g$ : The ground-truth segmentation mask.
    - $|\cdot|$ : The number of pixels in a set (mask).
    - $\cap$ : The intersection of the two masks (pixels where both are positive).
    - $\cup$ : The union of the two masks (pixels where at least one is positive).
- Contour Accuracy ( $\mathcal{F}$ ):
  1. Conceptual Definition: This metric evaluates the accuracy of the predicted mask's boundary compared to the ground-truth boundary. It is typically calculated as an F-measure (the harmonic mean of precision and recall) on the boundary pixels. It is sensitive to the fine-grained details and "tightness" of the mask's outline.
  2. Mathematical Formula: $\mathcal{F} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$
  3. Symbol Explanation:
    - Precision and Recall are calculated for the boundary pixels. Precision measures what fraction of predicted boundary pixels are correct, while Recall measures what fraction of true boundary pixels were found.
- Primary Metric ( $\mathcal{J}\&\mathcal{F}$ ):
  1. Conceptual Definition: The official ranking metric for the challenge, which is the simple average of Region Similarity and Contour Accuracy. It provides a balanced assessment of both overall shape and boundary quality.
  2. Mathematical Formula: $\mathcal{J}\&\mathcal{F} = \frac{\mathcal{J} + \mathcal{F}}{2}$
  3. Symbol Explanation:
    - $\mathcal{J}$ : The Region Similarity score.
    - $\mathcal{F}$ : The Contour Accuracy score.
Baselines:
- The main baseline for comparison is the standalone ReferDINO model.
- The ablation study also provides intermediate baselines: ReferDINO + SAM2 (without CMF) and $ReferDINO + SAM2 + CMF_V$ (with video-level fusion).
- The competition leaderboard (Table 1) shows results from other teams (MVP-Lab, HarborY, etc.), serving as external benchmarks.

6. Results & Analysis

Core Results: The main result is the model's performance on the official MeViS test set leaderboard. ReferDINO-Plus achieved second place, demonstrating its competitiveness.

(Manual transcription of Table 1 from the paper) Table 1. The leaderboard of the MeViS test set.

Team	J&F	J	F
MVP-Lab	61.98	58.83	65.14
ReferDINO-Plus	60.43	56.79	64.07
HarborY	56.26	52.68	59.84
Pengsong	55.91	53.06	58.76
ssam2s	55.16	52.00	58.33
strong_kimchi	55.02	51.78	58.27

This table clearly shows ReferDINO-Plus outperforming other competitors by a significant margin, with only the first-place team (MVP-Lab) scoring higher.

Ablations / Parameter Sensitivity: The ablation study on the MeViS validation set is crucial for understanding why ReferDINO-Plus works. It systematically adds each component to the baseline ReferDINO model.

(Manual transcription of Table 2 from the paper) Table 2. Ablation stuides on the MeViS validation set.

Method J&F J F

ReferDINO 51.67 47.94 55.40

+SAM2 52.54 49.18 55.90

+SAM2+CMFv 54.82 51.39 58.24

+SAM2+CMF 55.27 51.80 58.75
- ReferDINO (Baseline): Achieves a respectable 51.67 $\mathcal{J}\&\mathcal{F}$ .
- $+SAM2$ : Adding SAM2 for refinement provides a modest boost to 52.54. This shows that while SAM2 improves mask quality, its failure on multi-object cases limits the overall gain.
- $+SAM2+CMFv$ : Introducing the Conditional Mask Fusion strategy (applied once for the whole video, CMFv) gives a large performance jump to 54.82. This confirms that explicitly handling the multi-object failure case is the key to unlocking performance.
- $+SAM2+CMF$ (Final Method): Applying the fusion logic on a per-frame basis (CMF) provides a further small gain to 55.27. This suggests that per-frame adaptivity is slightly better than a single decision for the whole video.
- Overall, the full ReferDINO-Plus method improves upon the ReferDINO baseline by 3.6 points in $\mathcal{J}\&\mathcal{F}$ (55.27 vs. 51.67), with the CMF strategy contributing the majority of this gain.
Visualization: Figure 2 in the paper visualizes the model's outputs on several challenging examples from the MeViS test set.

该图像是三幅展示海龟前臂不同动作的连续图像，体现了动物在水中的动态姿态变化，未包含公式或其他标注信息。

该图像是三幅鸟类照片的对比示意图，展示了两只鸟在手掌上的不同颜色蒙版叠加效果，反映了可能的目标分割或识别技术应用。

该图像是三帧连续的水下海龟和红色鱼类视频截图，展示了ReferDINO-Plus在视频目标分割任务中的目标连续性和遮挡处理能力。

该图像是示意图，展示了使用ReferDINO-Plus对视频中三只猫咪进行分割的结果，分别用不同颜色（蓝色、红色、绿色）高亮表示各个目标对象的掩码区域。

该图像是两张示意图，展示了ReferDINO-Plus模型在含文本描述的视频目标分割任务中对多只猫进行语义分割的效果，分别用不同颜色区分猫的实例，体现了模型的对象一致性和高质量掩码。

该图像是两张猫的彩色分割示意图，分别以红、绿、蓝三色将同一组猫的不同部分进行标注。该图用于展示ReferDINO-Plus方法中对多目标视频分割中目标遮罩的效果对比。

该图像是多帧视频中的目标实例分割示意图，展示了通过ReferDINO-Plus模型对视频中牛和卡车等目标的掩码分割结果，其中不同颜色表示不同目标实例的精确分割。
- The examples show the model can distinguish between objects based on subtle motion cues (e.g., "The parrot that is in motion" vs. "The parrot eating without moving").
- The "cows moving to left" example explicitly demonstrates successful segmentation of multiple objects, validating the effectiveness of the Conditional Mask Fusion strategy.
- Overall, the visualizations confirm that ReferDINO-Plus produces accurate and high-quality masks that are consistent over time.

7. Conclusion & Reflections

Conclusion Summary: The paper introduces ReferDINO-Plus, a highly effective two-stage solution for referring video object segmentation. By combining the semantic understanding of ReferDINO with the mask refinement capabilities of SAM2, and crucially, introducing a Conditional Mask Fusion strategy to handle multi-object scenarios, the method achieves state-of-the-art results. Its 2nd place finish in the competitive MeViS challenge at CVPR 2025 validates the strength of this pragmatic and well-designed system.
Limitations & Future Work:
- Heuristic-based Fusion: The paper does not explicitly state limitations, but a primary one is the reliance on a hand-crafted heuristic for the CMF strategy. The threshold ( $\frac{2}{3}$ area ratio) is a fixed value that may not be optimal for all situations or datasets.
- Two-Stage Pipeline: The method is a two-stage pipeline, which can be less elegant and potentially slower than a fully end-to-end model. There is no joint optimization between the ReferDINO and SAM2 components.
- Future work could explore learning the fusion strategy with a small neural network or integrating the refinement process more deeply into an end-to-end trainable architecture.
Personal Insights & Critique:
- Strengths: The brilliance of this paper lies in its simplicity and effectiveness. It's a prime example of strong engineering and system design in the era of foundation models. Rather than inventing a complex new architecture, the authors astutely identified the complementary strengths and weaknesses of two powerful existing models and designed a simple rule to combine them optimally. This approach is practical, reproducible, and highly effective.
- Critique: The CMF rule, while effective, feels like an "engineering fix." A more robust solution might learn to predict when SAM2 will fail or learn how to fuse the masks dynamically based on features from both models. However, for a competition setting, a simple and effective heuristic is often superior to a more complex but harder-to-train solution.
- Transferability: This "identify-then-refine-and-fuse" paradigm is highly transferable to other dense prediction tasks in vision. The core insight—that one model may be better at high-level semantics while another is better at low-level details, and a smart fusion rule can combine them—is a powerful and generalizable lesson for building high-performance vision systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Method	J&F	J	F
ReferDINO	51.67	47.94	55.40
+SAM2	52.54	49.18	55.90
+SAM2+CMFv	54.82	51.39	58.24
+SAM2+CMF	55.27	51.80	58.75

ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~13 min read · 15,095 chars

1. Bibliographic Information

2. Executive Summary

3. Prerequisite Knowledge & Related Work

4. Methodology (Core Technology & Implementation)

5. Experimental Setup

6. Results & Analysis

7. Conclusion & Reflections

Similar papers