Abstract
This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.
Abstract (translated)
本文提出了一种基于无监督分段的鲁棒语音活动检测方法。该方法包括两次去噪,然后是语音活动检测(VAD)阶段。在第一遍中,使用后验信噪比(snr)加权能量差检测语音信号中的高能段,如果在一段中未检测到音高,则该段被视为高能噪声段并设置为零。第二步,利用语音增强方法对语音信号进行去噪,并对几种方法进行了探讨。其次,相邻的带音高的帧被组合在一起形成音高段,根据语音统计,音高段从两端进一步延伸,以便包括有声和无声以及可能的非语音部分。最后,将后验信噪比加权能量差应用于去噪语音信号的扩展节段,以检测语音活动。我们使用大鼠和Aurora-2这两个数据库评估了该方法的VAD性能,该数据库包含多种噪声条件。RVAD方法在Reddots 2016 Challenge数据库及其噪声破坏版本的扬声器验证性能方面进行了进一步评估。实验结果表明,与现有的几种方法相比,RVAD方法具有较好的优越性。此外,我们提出了一个修正版本的RVAD,其中计算密集的螺距提取被计算有效的谱平坦度计算所取代。修改后的版本大大降低了计算的复杂性,同时降低了相对较低的VAD性能,这在处理大量数据和在低资源设备上运行时是一个优势。RVAD的源代码是公开的。
URL
https://arxiv.org/abs/1906.03588