Paper Reading AI Learner

rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method

2019-06-09 07:51:23
Zheng-Hua Tan, Achintya kr. Sarkar, Najim Dehak

Abstract

This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.

Abstract (translated)

本文提出了一种基于无监督分段的鲁棒语音活动检测方法。该方法包括两次去噪,然后是语音活动检测(VAD)阶段。在第一遍中,使用后验信噪比(snr)加权能量差检测语音信号中的高能段,如果在一段中未检测到音高,则该段被视为高能噪声段并设置为零。第二步,利用语音增强方法对语音信号进行去噪,并对几种方法进行了探讨。其次,相邻的带音高的帧被组合在一起形成音高段,根据语音统计,音高段从两端进一步延伸,以便包括有声和无声以及可能的非语音部分。最后,将后验信噪比加权能量差应用于去噪语音信号的扩展节段,以检测语音活动。我们使用大鼠和Aurora-2这两个数据库评估了该方法的VAD性能,该数据库包含多种噪声条件。RVAD方法在Reddots 2016 Challenge数据库及其噪声破坏版本的扬声器验证性能方面进行了进一步评估。实验结果表明,与现有的几种方法相比,RVAD方法具有较好的优越性。此外,我们提出了一个修正版本的RVAD,其中计算密集的螺距提取被计算有效的谱平坦度计算所取代。修改后的版本大大降低了计算的复杂性,同时降低了相对较低的VAD性能,这在处理大量数据和在低资源设备上运行时是一个优势。RVAD的源代码是公开的。

URL

https://arxiv.org/abs/1906.03588

PDF

https://arxiv.org/pdf/1906.03588.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot