ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025
TL;DR Summary
ReferDINO-Plus integrates SAM2 with ReferDINO using conditional mask fusion to improve mask quality and temporal consistency in referring video object segmentation, achieving 60.43 J&F and 2nd place in the CVPR 2025 MeViS challenge.
Abstract
Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025
- Authors: Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu.
- Affiliations: All authors are from Sun Yat-sen University.
- Journal/Conference: The paper reports results for the 4th PVUW (Pixel-level Video Understanding in the Wild) MeViS (Motion Expression Video Segmentation) Challenge, held at the CVPR (Conference on Computer Vision and Pattern Recognition) 2025.
- Venue Reputation: CVPR is the premier international conference in the field of computer vision, making it a highly prestigious venue. This paper is a technical report describing a competition-winning solution.
- Publication Year: The paper is a preprint submitted in 2025.
- Abstract: The paper presents
ReferDINO-Plus, a solution for Referring Video Object Segmentation (RVOS), which involves segmenting objects in a video based on a text description. The method builds uponReferDINO, a strong existing model, by integratingSAM2(Segment Anything Model 2) to improve mask quality and temporal consistency. To address a key challenge whereSAM2struggles with multiple objects, the authors introduce aConditional Mask Fusion (CMF)strategy. This strategy adaptively combines the outputs ofReferDINOandSAM2. The final solution achieved 2nd place in the MeViS PVUW challenge at CVPR 2025 with a score of 60.43 on the test set. - Original Source Link:
- arXiv Link: https://arxiv.org/abs/2503.23509v2
- PDF Link: https://arxiv.org/pdf/2503.23509v2.pdf
- Publication Status: This is a preprint available on arXiv, not yet formally published in the conference proceedings.
2. Executive Summary
-
Background & Motivation (Why):
- The core problem is Referring Video Object Segmentation (RVOS): given a video and a natural language sentence (e.g., "the cat running to the left"), the goal is to create a pixel-perfect mask of the described object(s) in every frame of the video.
- This task is important for applications like advanced video editing and human-robot interaction. However, existing methods often fail in complex, real-world scenarios. The
MeViSdataset, a key benchmark, specifically introduces challenges like descriptions based on motion and temporal actions (e.g., "the person who was standing still and then started walking") and the need to segment multiple objects from a single description. - The paper starts with a powerful baseline model,
ReferDINO, which leverages a foundational vision-language model. The motivation is to further enhanceReferDINO's performance by addressing its limitations, particularly in mask quality, and to specifically tackle the multi-object challenge posed byMeViS.
-
Main Contributions / Findings (What):
- The primary contribution is a two-stage method called
ReferDINO-Plus. It is not a new model architecture from scratch but an intelligent combination of existing powerful models. - Stage 1: Uses
ReferDINOto generate initial object masks based on the text description.ReferDINOis good at understanding the text and identifying the correct objects. - Stage 2: Uses
SAM2, a state-of-the-art video segmentation model, to refine the masks fromReferDINO.SAM2excels at producing high-quality, temporally consistent masks but needs a starting point (a "prompt"). - A novel Conditional Mask Fusion (CMF) strategy is introduced. The authors observed that
SAM2, when given a prompt for multiple objects, would sometimes only track one, degrading performance. The CMF strategy intelligently decides whether to use theSAM2mask alone (for single objects) or combine it with the originalReferDINOmask (for multiple objects), effectively getting the best of both worlds. - The proposed solution secured 2nd place in a competitive academic challenge, demonstrating its state-of-the-art performance without complex training schemes like using pseudo-labels.
- The primary contribution is a two-stage method called
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Referring Video Object Segmentation (RVOS): This is a computer vision task that combines natural language processing and video analysis. Unlike generic object segmentation, which finds all objects of a certain class (e.g., all cars), RVOS finds the specific object(s) referred to in a text query.
- Foundation Models: These are large-scale models pre-trained on vast amounts of data (e.g.,
GPT-3for text,SAMfor images). They develop general-purpose capabilities that can be adapted to many downstream tasks. This paper leverages two such models:GroundingDINO: A model that can detect arbitrary objects in an image based on a text description. It "grounds" language in visual space.ReferDINOis built upon this.SAM(Segment Anything Model) &SAM2:SAMis a revolutionary model that can segment almost any object in an image given a prompt (like a point, box, or mask).SAM2extends this capability to videos, making it excellent for tracking and segmenting objects consistently across frames.
- Transformers and DETR: Transformers are a neural network architecture, originally for language, that excel at modeling relationships between elements in a sequence. DEtection TRansformer (DETR) adapted this for object detection, treating it as a direct set prediction problem. This paradigm influenced many subsequent vision tasks, including RVOS models like
MTTRandReferFormer.
-
Previous Works:
- Early RVOS Methods: Some initial works ([1]) simply applied methods designed for referring image segmentation to each video frame independently. This approach ignored temporal information, leading to flickering or inconsistent segmentations over time.
- Transformer-based RVOS: Models like
MTTR[2] andReferFormer[22] introduced Transformers to the RVOS task. They process video frames and text together, allowing for better temporal reasoning and cross-modal understanding. - Modular Improvements: Subsequent works like
SOC[17] andDsHmp[11] focused on improving specific parts of the pipeline, such as enhancing temporal modeling or better aligning visual and textual features. - Foundation Model Era: The recent introduction of
ReferDINO[15] marked a significant step. By building on the powerful open-vocabulary detection capabilities ofGroundingDINO, it achieved superior performance, especially in understanding complex descriptions.
-
Differentiation:
ReferDINO-Plusis not a new end-to-end trained model. Its innovation lies in its pragmatic and effective two-stage system design. WhileReferDINOis excellent at vision-language understanding andSAM2is excellent at high-quality segmentation, they have different strengths and weaknesses.ReferDINO-Pluscombines them and introduces theConditional Mask Fusion (CMF)strategy as the "glue" that mitigates their individual shortcomings, particularlySAM2's failure mode in multi-object scenarios. This system-level thinking is what sets it apart.
4. Methodology (Core Technology & Implementation)
The overall framework of ReferDINO-Plus is a simple yet powerful three-step pipeline, as illustrated in Figure 1.
该图像是论文中ReferDINO-Plus方法的示意图,展示了输入视频帧经过ReferDINO生成初步掩码,随后利用SAM2优化掩码质量,并通过条件掩码融合策略获得最终目标分割结果。
-
Principles: The core idea is to leverage the strengths of two specialized models:
ReferDINOfor its superior text-to-object identification (semantic understanding) andSAM2for its superior mask generation and tracking (pixel-level accuracy and temporal consistency). -
Steps & Procedures:
-
Step 1: Cross-modal Reasoning with
ReferDINO- Input: A video clip with frames and a text description.
- Process: The inputs are fed into the pre-trained
ReferDINOmodel.ReferDINOuses its internal vision-language reasoning capabilities to identify the objects matching the description. - Output:
ReferDINOproduces a sequence of object masks and their corresponding confidence scores for each frame . To handle cases where a description refers to multiple objects, masks with a score above a threshold are combined.
-
Step 2: Mask Refinement with
SAM2- Input: The authors select the frame and mask from
ReferDINO's output that has the single highest confidence score. This(frame, mask)pair serves as a high-quality "prompt" forSAM2. - Process:
SAM2, being a promptable video segmentation model, takes this initial mask and efficiently propagates it through the entire video, tracking the object and generating a new set of masks. - Output:
SAM2produces a refined and temporally smooth mask sequence . These masks are generally of higher quality and more consistent frame-to-frame thanReferDINO's original output.
- Input: The authors select the frame and mask from
-
Step 3: Conditional Mask Fusion (CMF)
- Problem: The authors observed that if the initial prompt mask from
ReferDINOcontained multiple disconnected objects,SAM2would often latch onto only one of them and fail to track the others. This leads to a significant drop in performance for multi-object queries. - Solution: The CMF strategy is a heuristic designed to detect and correct this failure mode. For each frame, it compares the area of the mask from
SAM2() with the area of the mask fromReferDINO(). - Mathematical Formula & Key Details: The final mask for a given frame is determined by the following rule:
- Symbol Explanation:
- : The final output mask for the frame.
- : The mask generated by
SAM2. - : The mask generated by
ReferDINO. - : A function that calculates the pixel area of a mask.
- Logic: The condition acts as a detector for the multi-object failure case. If the
SAM2mask is significantly smaller (less than 2/3 the area) than theReferDINOmask, it's assumed thatSAM2dropped one or more objects. In this "multi-object case," the final mask is the union of both masks (), recovering the lost objects. Otherwise, in the "single-object case," the higher-qualitySAM2mask () is used exclusively. This check is performed independently for each frame.
- Symbol Explanation:
- Problem: The authors observed that if the initial prompt mask from
-
5. Experimental Setup
-
Datasets:
MeViS: The primary benchmark for the competition. It is a large-scale RVOS dataset with 2,000 videos and 28,000 text descriptions. Its key features are a focus on motion-based language, descriptions referring to single or multiple objects, and even descriptions of non-existent objects, making it very challenging.- Training Datasets: The
ReferDINOmodel was trained in stages: first on referring image segmentation datasets (RefCOCO, ,RefCOCOg), then on older RVOS datasets (Refer-Youtube-VOS,Ref-DAVIS17), and finally fine-tuned on theMeViStraining set.
-
Evaluation Metrics:
- Region Similarity ():
- Conceptual Definition: Also known as the Jaccard Index or Intersection over Union (IoU), this metric measures the overlap between the predicted mask and the ground-truth mask. A value of 1 indicates a perfect match, while 0 indicates no overlap. It evaluates the overall spatial accuracy of the segmentation.
- Mathematical Formula:
- Symbol Explanation:
- : The predicted segmentation mask.
- : The ground-truth segmentation mask.
- : The number of pixels in a set (mask).
- : The intersection of the two masks (pixels where both are positive).
- : The union of the two masks (pixels where at least one is positive).
- Contour Accuracy ():
- Conceptual Definition: This metric evaluates the accuracy of the predicted mask's boundary compared to the ground-truth boundary. It is typically calculated as an F-measure (the harmonic mean of precision and recall) on the boundary pixels. It is sensitive to the fine-grained details and "tightness" of the mask's outline.
- Mathematical Formula:
- Symbol Explanation:
- Precision and Recall are calculated for the boundary pixels. Precision measures what fraction of predicted boundary pixels are correct, while Recall measures what fraction of true boundary pixels were found.
- Primary Metric ():
- Conceptual Definition: The official ranking metric for the challenge, which is the simple average of Region Similarity and Contour Accuracy. It provides a balanced assessment of both overall shape and boundary quality.
- Mathematical Formula:
- Symbol Explanation:
- : The Region Similarity score.
- : The Contour Accuracy score.
- Region Similarity ():
-
Baselines:
- The main baseline for comparison is the standalone
ReferDINOmodel. - The ablation study also provides intermediate baselines:
ReferDINO + SAM2(without CMF) and (with video-level fusion). - The competition leaderboard (Table 1) shows results from other teams (
MVP-Lab,HarborY, etc.), serving as external benchmarks.
- The main baseline for comparison is the standalone
6. Results & Analysis
-
Core Results: The main result is the model's performance on the official
MeViStest set leaderboard.ReferDINO-Plusachieved second place, demonstrating its competitiveness.(Manual transcription of Table 1 from the paper) Table 1. The leaderboard of the MeViS test set.
Team J&F J F MVP-Lab 61.98 58.83 65.14 ReferDINO-Plus 60.43 56.79 64.07 HarborY 56.26 52.68 59.84 Pengsong 55.91 53.06 58.76 ssam2s 55.16 52.00 58.33 strong_kimchi 55.02 51.78 58.27 This table clearly shows
ReferDINO-Plusoutperforming other competitors by a significant margin, with only the first-place team (MVP-Lab) scoring higher. -
Ablations / Parameter Sensitivity: The ablation study on the
MeViSvalidation set is crucial for understanding whyReferDINO-Plusworks. It systematically adds each component to the baselineReferDINOmodel.(Manual transcription of Table 2 from the paper) Table 2. Ablation stuides on the MeViS validation set.
Method J&F J F ReferDINO 51.67 47.94 55.40 +SAM2 52.54 49.18 55.90 +SAM2+CMFv 54.82 51.39 58.24 +SAM2+CMF 55.27 51.80 58.75 ReferDINO(Baseline): Achieves a respectable 51.67 .- : Adding
SAM2for refinement provides a modest boost to 52.54. This shows that whileSAM2improves mask quality, its failure on multi-object cases limits the overall gain. - : Introducing the Conditional Mask Fusion strategy (applied once for the whole video,
CMFv) gives a large performance jump to 54.82. This confirms that explicitly handling the multi-object failure case is the key to unlocking performance. - (Final Method): Applying the fusion logic on a per-frame basis (
CMF) provides a further small gain to 55.27. This suggests that per-frame adaptivity is slightly better than a single decision for the whole video. - Overall, the full
ReferDINO-Plusmethod improves upon theReferDINObaseline by 3.6 points in (55.27 vs. 51.67), with the CMF strategy contributing the majority of this gain.
-
Visualization: Figure 2 in the paper visualizes the model's outputs on several challenging examples from the
MeViStest set.
该图像是三幅展示海龟前臂不同动作的连续图像,体现了动物在水中的动态姿态变化,未包含公式或其他标注信息。
该图像是三幅鸟类照片的对比示意图,展示了两只鸟在手掌上的不同颜色蒙版叠加效果,反映了可能的目标分割或识别技术应用。
该图像是三帧连续的水下海龟和红色鱼类视频截图,展示了ReferDINO-Plus在视频目标分割任务中的目标连续性和遮挡处理能力。
该图像是示意图,展示了使用ReferDINO-Plus对视频中三只猫咪进行分割的结果,分别用不同颜色(蓝色、红色、绿色)高亮表示各个目标对象的掩码区域。
该图像是两张示意图,展示了ReferDINO-Plus模型在含文本描述的视频目标分割任务中对多只猫进行语义分割的效果,分别用不同颜色区分猫的实例,体现了模型的对象一致性和高质量掩码。
该图像是两张猫的彩色分割示意图,分别以红、绿、蓝三色将同一组猫的不同部分进行标注。该图用于展示ReferDINO-Plus方法中对多目标视频分割中目标遮罩的效果对比。
该图像是多帧视频中的目标实例分割示意图,展示了通过ReferDINO-Plus模型对视频中牛和卡车等目标的掩码分割结果,其中不同颜色表示不同目标实例的精确分割。- The examples show the model can distinguish between objects based on subtle motion cues (e.g., "The parrot that is in motion" vs. "The parrot eating without moving").
- The "cows moving to left" example explicitly demonstrates successful segmentation of multiple objects, validating the effectiveness of the
Conditional Mask Fusionstrategy. - Overall, the visualizations confirm that
ReferDINO-Plusproduces accurate and high-quality masks that are consistent over time.
7. Conclusion & Reflections
-
Conclusion Summary: The paper introduces
ReferDINO-Plus, a highly effective two-stage solution for referring video object segmentation. By combining the semantic understanding ofReferDINOwith the mask refinement capabilities ofSAM2, and crucially, introducing aConditional Mask Fusionstrategy to handle multi-object scenarios, the method achieves state-of-the-art results. Its 2nd place finish in the competitive MeViS challenge at CVPR 2025 validates the strength of this pragmatic and well-designed system. -
Limitations & Future Work:
- Heuristic-based Fusion: The paper does not explicitly state limitations, but a primary one is the reliance on a hand-crafted heuristic for the CMF strategy. The threshold ( area ratio) is a fixed value that may not be optimal for all situations or datasets.
- Two-Stage Pipeline: The method is a two-stage pipeline, which can be less elegant and potentially slower than a fully end-to-end model. There is no joint optimization between the
ReferDINOandSAM2components. - Future work could explore learning the fusion strategy with a small neural network or integrating the refinement process more deeply into an end-to-end trainable architecture.
-
Personal Insights & Critique:
- Strengths: The brilliance of this paper lies in its simplicity and effectiveness. It's a prime example of strong engineering and system design in the era of foundation models. Rather than inventing a complex new architecture, the authors astutely identified the complementary strengths and weaknesses of two powerful existing models and designed a simple rule to combine them optimally. This approach is practical, reproducible, and highly effective.
- Critique: The CMF rule, while effective, feels like an "engineering fix." A more robust solution might learn to predict when
SAM2will fail or learn how to fuse the masks dynamically based on features from both models. However, for a competition setting, a simple and effective heuristic is often superior to a more complex but harder-to-train solution. - Transferability: This "identify-then-refine-and-fuse" paradigm is highly transferable to other dense prediction tasks in vision. The core insight—that one model may be better at high-level semantics while another is better at low-level details, and a smart fusion rule can combine them—is a powerful and generalizable lesson for building high-performance vision systems.
Similar papers
Recommended via semantic vector search.