Paper Reading AI Learner

End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

2023-03-21 16:33:56
Giovanni Morrone, Samuele Cornell, Luca Serafini, Enrico Zovato, Alessio Brutti, Stefano Squartini

Abstract

Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 seconds. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

Abstract (translated)

近年来的工作表明,语音分离引导的分音(SSGD)是一个越来越有前途的方向,这主要得益于语音分离领域的 recent 进展。它首先分离说话人,然后对每个分离的流应用语音活动检测(VAD)。在这项工作中,我们深入研究了语音分离引导的分音(SSGD)在口语电话语音(CTS)领域中的应用,主要集中在低延迟流分音应用。我们考虑了三种最先进的语音分离算法(SSep),并研究了它们在在线和离线场景下的性能,考虑了非因果和因果实现的实现方式,以及连续 SSep(CSS)窗口推理。我们比较了不同 SSGD 算法在两个广泛使用的 CTS 数据集上的表现:CALLHOME 和 Fisher Corpus(Part 1 和 2),并评估了分离和分音性能。为了改善性能,我们提出了一种新的、因果且计算高效的泄漏去除算法,这显著减少了误报。我们还首次探索了 SSep 和 VAD 模块之间的完全端到端 SSGD 集成。至关重要的是,这使得可以在没有可用的oracle 说话人来源的现实世界数据上进行微调。特别是,我们的最佳模型在CALLHOME上取得了 8.8%的der,比当前最先进的端到端神经网络分音模型还要好,尽管训练数据量要少得多,且延迟显著更低,即 0.1 秒 vs. 1秒。最后,我们还表明,分离信号可以方便地用于自动语音识别,在某些配置下达到与使用oracle 说话人来源类似的性能。

URL

https://arxiv.org/abs/2303.12002

PDF

https://arxiv.org/pdf/2303.12002.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot