Paper Reading AI Learner

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

2024-04-02 09:07:05
Tanvir Mahmud, Yapeng Tian, Diana Marculescu

Abstract

Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods.

Abstract (translated)

视觉声音源定位在确定每个声音样本中的语义区域方面是一个重大的挑战。现有的自监督和弱监督源定位方法很难准确地区分每个声音样本的语义区域,特别是在多源混合情况下。这些方法通常依赖于音频-视觉对应关系作为指导,这可能导致在复杂的多源定位场景中性能下降。在多源混合中访问单个源声音会加剧学习有效音频-视觉对应关系的难度。为解决这个问题,本文提出使用三模态联合嵌入模型(如AudioCLIP)将文本模态作为中间特征指导,以分离多源混合中的语义音频-视觉源对应关系。我们的框架名为T-VSL,首先预测混合中的声音样本类别。接着,每个声音样本的文本表示被用作指导,以从多源混合中分离细粒度的音频-视觉源对应关系,利用三模态AudioCLIP嵌入。这种方法使我们框架能够处理灵活的源数量,并且在测试时间表现出具有前景的零 shot 转移性。在MUSIC、VGGSound和VGGSound-Instruments数据集上进行的大量实验证明,与最先进的 methods相比,性能有显著的提高。

URL

https://arxiv.org/abs/2404.01751

PDF

https://arxiv.org/pdf/2404.01751.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot