Paper Reading AI Learner

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

2024-04-19 09:08:44
Zhaoxi Mu, Xinyu Yang

Abstract

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.

Abstract (translated)

视觉提示的集成已经恢复了目标语音提取任务的性能,将其提升到领域的最前沿。然而,这种多模态学习范式通常会遇到模态不平衡的挑战。在音频-视频目标语音提取任务中,音频模态往往占主导地位,可能削弱视觉指导的重要性。为解决这个问题,我们提出了AVSepChain,受到语音链概念的启发。我们的方法将音频-视频目标语音提取任务分为两个阶段:语音感知和语音生成。在语音感知阶段,音频作为主导模态,而视觉信息作为条件模态。相反,在语音生成阶段,这两个角色是相反的。这种模态状态的转换旨在减轻模态不平衡的问题。此外,我们引入了对比性语义匹配损失,以确保生成的语音所传达的语义信息与语音生产阶段时唇运动所传达的语义信息相一致。通过在多个音频-视频目标语音提取基准数据集上进行广泛实验,我们展示了我们方法所取得的优越性能。

URL

https://arxiv.org/abs/2404.12725

PDF

https://arxiv.org/pdf/2404.12725.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot