Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Abstract
Abstract (translated)
URL
PDF

Abstract

The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.

Abstract (translated)

视觉提示的集成已经恢复了目标语音提取任务的性能，将其提升到领域的最前沿。然而，这种多模态学习范式通常会遇到模态不平衡的挑战。在音频-视频目标语音提取任务中，音频模态往往占主导地位，可能削弱视觉指导的重要性。为解决这个问题，我们提出了AVSepChain，受到语音链概念的启发。我们的方法将音频-视频目标语音提取任务分为两个阶段：语音感知和语音生成。在语音感知阶段，音频作为主导模态，而视觉信息作为条件模态。相反，在语音生成阶段，这两个角色是相反的。这种模态状态的转换旨在减轻模态不平衡的问题。此外，我们引入了对比性语义匹配损失，以确保生成的语音所传达的语义信息与语音生产阶段时唇运动所传达的语义信息相一致。通过在多个音频-视频目标语音提取基准数据集上进行广泛实验，我们展示了我们方法所取得的优越性能。

URL

https://arxiv.org/abs/2404.12725

PDF

https://arxiv.org/pdf/2404.12725.pdf

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Abstract

Abstract (translated)

URL

PDF Copy

PDF