Abstract
Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed Grounded Prompting (GroPrompt) framework. More specifically, we propose Text-Aware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences with only box supervisions, including Text-Contrastive Prompt Learning (TextCon) and Modality-Contrastive Prompt Learning (ModalCon) at frame level and video level, respectively. With the proposed TAP-CL, our GroPrompt framework can generate temporal-consistent yet text-aware position prompts describing locations and movements for the referred object from the video. The experimental results in the standard RVOS benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences) demonstrate the competitive performance of our proposed GroPrompt framework given only bounding box weak supervisions.
Abstract (translated)
翻译: Referring Video Object Segmentation (RVOS) 的目的是在整个视频中根据查询句子所指的对象进行分割。大多数现有方法需要使用密集掩码注释进行端到端的训练,这可能会消耗计算资源且不够可扩展。在这项工作中,我们旨在通过所提出的 Grounded Prompting (GroPrompt) 框架, efficiently 将基础分割模型适应弱监督的 RVOS。具体来说,我们提出了 Text-Aware Prompt Contrastive Learning (TAP-CL) 来增强位置提示与参考句子的关联,仅包括文本对比学习和帧级别和视频级别的模式对比学习。通过 TAP-CL,我们的 GroPrompt 框架可以生成描述从视频中指代对象的位置和运动的语义一致的时空提示。在标准 RVOS 基准测试(Ref-YouTube-VOS,Ref-DAVIS17,A2D-Sentences 和 JHMDB-Sentences)中,实验结果表明,仅使用边界框弱监督,我们的 GroPrompt 框架具有竞争力的性能。
URL
https://arxiv.org/abs/2406.12834