Audio-Visual Speech Enhancement in Noisy Environments via Emotion-Based Contextual Cues

Abstract
Abstract (translated)
URL
PDF

Abstract

In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Audio-visual speech enhancement (AVSE) attempts to restore speech quality, but existing methods often fall short, particularly in dynamic noise conditions. This study investigates the inclusion of emotion as a novel contextual cue within AVSE, hypothesizing that incorporating emotional understanding can improve speech enhancement performance. We propose a novel emotion-aware AVSE system that leverages both auditory and visual information. It extracts emotional features from the facial landmarks of the speaker and fuses them with corresponding audio and visual modalities. This enriched data serves as input to a deep UNet-based encoder-decoder network, specifically designed to orchestrate the fusion of multimodal information enhanced with emotion. The network iteratively refines the enhanced speech representation through an encoder-decoder architecture, guided by perceptually-inspired loss functions for joint learning and optimization. We train and evaluate the model on the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset, a rich repository of audio-visual recordings with annotated emotions. Our comprehensive evaluation demonstrates the effectiveness of emotion as a contextual cue for AVSE. By integrating emotional features, the proposed system achieves significant improvements in both objective and subjective assessments of speech quality and intelligibility, especially in challenging noise environments. Compared to baseline AVSE and audio-only speech enhancement systems, our approach exhibits a noticeable increase in PESQ and STOI, indicating higher perceptual quality and intelligibility. Large-scale listening tests corroborate these findings, suggesting improved human understanding of enhanced speech.

Abstract (translated)

在现实环境里，背景噪音显著地降低了人类语音的可听度和清晰度。音频-视觉语音增强（AVSE）试图恢复语音质量，但现有方法往往不够，特别是在动态噪音条件下。本研究探讨了将情感作为AVSE中的新颖上下文线索，假设将情感理解纳入其中可以提高 speech enhancement performance。我们提出了一个新颖的基于情感的AVSE系统，该系统利用听觉和视觉信息。它从发言者的面部特征中提取情感特征，并将它们与相应的音频和视觉模块融合。这个丰富的数据作为输入输入到专为情感信息融合而设计的深度UNet-based编码器-解码器网络中，该网络通过感知驱动的损失函数进行迭代优化。我们在CMU多模态情感意见（CMU-MOSEI）数据集上训练并评估该模型，这是一个丰富的音频-视觉录音数据集，带有注释的情感。全面的评估表明，情感作为上下文线索在AVSE中具有有效的效果。通过将情感特征融合到系统中，所提出的系统在客观和主观评估中的语音质量和可听性方面都取得了显著的改进，尤其是在具有挑战性噪音环境中。与基线AVSE和仅音频增强系统相比，我们的方法在感知质量和社会影响力方面明显增加，表明具有更高的可听度和理解性。大规模听力测试证实了这些发现，表明提高了人类对增强语音的理解。

URL

https://arxiv.org/abs/2402.16394

PDF

https://arxiv.org/pdf/2402.16394.pdf

Audio-Visual Speech Enhancement in Noisy Environments via Emotion-Based Contextual Cues

Abstract

Abstract (translated)

URL

PDF Copy

PDF