End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 seconds. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

Abstract (translated)

近年来的工作表明,语音分离引导的分音(SSGD)是一个越来越有前途的方向,这主要得益于语音分离领域的 recent 进展。它首先分离说话人,然后对每个分离的流应用语音活动检测(VAD)。在这项工作中,我们深入研究了语音分离引导的分音(SSGD)在口语电话语音(CTS)领域中的应用,主要集中在低延迟流分音应用。我们考虑了三种最先进的语音分离算法(SSep),并研究了它们在在线和离线场景下的性能,考虑了非因果和因果实现的实现方式,以及连续 SSep(CSS)窗口推理。我们比较了不同 SSGD 算法在两个广泛使用的 CTS 数据集上的表现:CALLHOME 和 Fisher Corpus(Part 1 和 2),并评估了分离和分音性能。为了改善性能,我们提出了一种新的、因果且计算高效的泄漏去除算法,这显著减少了误报。我们还首次探索了 SSep 和 VAD 模块之间的完全端到端 SSGD 集成。至关重要的是,这使得可以在没有可用的oracle 说话人来源的现实世界数据上进行微调。特别是,我们的最佳模型在CALLHOME上取得了 8.8%的der,比当前最先进的端到端神经网络分音模型还要好,尽管训练数据量要少得多,且延迟显著更低,即 0.1 秒 vs. 1秒。最后,我们还表明,分离信号可以方便地用于自动语音识别,在某些配置下达到与使用oracle 说话人来源类似的性能。

URL

https://arxiv.org/abs/2303.12002

PDF

https://arxiv.org/pdf/2303.12002.pdf