Time-Domain Multi-modal Bone/air Conducted Speech Enhancement

2019-11-22 04:17:17

Cheng Yu, Yan-Ting Lin, Kuo-Hsuan Hung, Syu-Siang Wang, Szu-Wei Fu, Yu Tsao, Jeih-weih Hung

arXiv_SD

arXiv_SD CNN Deep_Learning Pose Enhancement Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

Integrating modalities, such as video signals with speech, has been shown to provide a standard quality and intelligibility for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources, which may complicate the respective SE. By contrast, a bone-conducted speech signal has a moderate data size while it manifests speech-phoneme structures, and thus complements its air-conducted counterpart, benefiting the enhancement. In this study, we propose a novel multi-modal SE structure that leverages bone- and air-conducted signals. In addition, we examine two strategies, early fusion and late fusion (LF), to process the two types of speech signals, and adopt a deep learning-based fully convolutional network to conduct the enhancement. The experiment results indicate that this newly presented multi-modal structure significantly outperforms the single-source SE counterparts (with a bone- or air-conducted signal only) in various speech evaluation metrics. In addition, the adoption of an LF strategy other than an EF in this novel SE multi-modal structure achieves better results.

Abstract (translated)

URL

https://arxiv.org/abs/1911.09847

PDF

https://arxiv.org/pdf/1911.09847.pdf