Attention-Guided Generative Adversarial Network for Whisper to Normal Speech Conversion

2021-11-02 03:00:19

Teng Gao, Jian Zhou, Huabin Wang, Liang Tao, Hon Keung Kwan

arXiv_SD

arXiv_SD GAN Adversarial Attention Pose Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

Whispered speech is a special way of pronunciation without using vocal cord vibration. A whispered speech does not contain a fundamental frequency, and its energy is about 20dB lower than that of a normal speech. Converting a whispered speech into a normal speech can improve speech quality and intelligibility. In this paper, a novel attention-guided generative adversarial network model incorporating an autoencoder, a Siamese neural network, and an identity mapping loss function for whisper to normal speech conversion (AGAN-W2SC) is proposed. The proposed method avoids the challenge of estimating the fundamental frequency of the normal voiced speech converted from a whispered speech. Specifically, the proposed model is more amendable to practical applications because it does not need to align speech features for training. Experimental results demonstrate that the proposed AGAN-W2SC can obtain improved speech quality and intelligibility compared with dynamic-time-warping-based methods.

Abstract (translated)

URL

https://arxiv.org/abs/2111.01342

PDF

https://arxiv.org/pdf/2111.01342.pdf