Abstract
In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a standard approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.
Abstract (translated)
尽管端到端说话人分离系统的流行性很高,由语音活动检测(VAD)、说话人嵌入提取加聚类以及重叠语音检测(OSD)加处理组成的模块化系统在许多条件下仍然能够达到具有竞争力的性能。然而,模块化系统的一个主要缺点是需要独立运行(和训练)不同的模块。在这项工作中,我们提出了一种方法来联合训练一个模型,以同时生成说话人嵌入、VAD和OSD,并且以标准方法的一小部分推理时间达到具有竞争力的性能。此外,联合推理导致了一个简化的整体管道,使我们更接近于可以针对特定分离目标进行端到端训练的统一聚类方法。
URL
https://arxiv.org/abs/2411.02165