Paper Reading AI Learner

FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

2025-04-13 10:14:19
Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma

Abstract

Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.

Abstract (translated)

视频对象分割(VOS)是计算机视觉中最基础且最具挑战性的任务之一,它在广泛的应用领域中发挥着重要作用。目前大多数现有方法依赖于时空记忆网络来提取帧级特征,并在常用数据集上取得了令人鼓舞的结果。然而,在更复杂的现实场景下,这些方法往往表现出色不足。 本文旨在解决这一问题,目标是实现对具有挑战性场景中的视频对象进行准确分割。我们提出了一种针对特定数据集优化现有方法的微调VOS(FVOS)策略,并通过定制化训练来提升性能。此外,我们还引入了一种形态学后处理策略,以应对单模型预测中相邻对象间距离过大的问题。最后,我们将多尺度分割结果结合投票融合法生成最终输出。 我们的方法在验证阶段和测试阶段分别取得了J&F分数76.81%和83.92%,在2025年第四届PVUW挑战赛的MOSE轨道中获得了总成绩第三名。

URL

https://arxiv.org/abs/2504.09507

PDF

https://arxiv.org/pdf/2504.09507.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot