Abstract
Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.
Abstract (translated)
视频对象分割(VOS)是计算机视觉中最基础且最具挑战性的任务之一,它在广泛的应用领域中发挥着重要作用。目前大多数现有方法依赖于时空记忆网络来提取帧级特征,并在常用数据集上取得了令人鼓舞的结果。然而,在更复杂的现实场景下,这些方法往往表现出色不足。 本文旨在解决这一问题,目标是实现对具有挑战性场景中的视频对象进行准确分割。我们提出了一种针对特定数据集优化现有方法的微调VOS(FVOS)策略,并通过定制化训练来提升性能。此外,我们还引入了一种形态学后处理策略,以应对单模型预测中相邻对象间距离过大的问题。最后,我们将多尺度分割结果结合投票融合法生成最终输出。 我们的方法在验证阶段和测试阶段分别取得了J&F分数76.81%和83.92%,在2025年第四届PVUW挑战赛的MOSE轨道中获得了总成绩第三名。
URL
https://arxiv.org/abs/2504.09507