Abstract
Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models. Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations. We systematically explore different strategies for connecting pretrained models and evaluate their efficacy across multiple datasets. Experimental results demonstrate that our framework achieves state-of-the-art zero-shot AVS performance, highlighting the effectiveness of multimodal model integration for finegrained audiovisual segmentation.
Abstract (translated)
音频视觉分割(AVS)旨在识别与声音来源相对应的视觉区域,在视频理解、监控和人机交互中发挥着重要作用。传统的AVS方法依赖于大规模像素级注释,这些注释获取起来既费时又昂贵。为了解决这一问题,我们提出了一种新颖的零样本音频视觉分割(Zero-shot AVS)框架,通过利用多个预训练模型来消除特定任务的训练需求。我们的方法整合了音频、视觉和文本表示,以弥合模态间的差距,并能够在没有专门针对AVS的注释的情况下进行精确的声音来源分割。我们系统地探索了连接预训练模型的不同策略,并在多个数据集上评估了它们的有效性。实验结果表明,我们的框架实现了最先进的零样本AVS性能,突显了多模式模型整合对于细粒度音频视觉分割的有效性。
URL
https://arxiv.org/abs/2506.06537