Paper Reading AI Learner

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

2024-11-04 15:23:37
Petr P\'alka, Federico Landini, Dominik Klement, Mireia Diez, Anna Silnova, Marc Delcroix, Luk\'a\v{s} Burget

Abstract

In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a standard approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.

Abstract (translated)

尽管端到端说话人分离系统的流行性很高,由语音活动检测(VAD)、说话人嵌入提取加聚类以及重叠语音检测(OSD)加处理组成的模块化系统在许多条件下仍然能够达到具有竞争力的性能。然而,模块化系统的一个主要缺点是需要独立运行(和训练)不同的模块。在这项工作中,我们提出了一种方法来联合训练一个模型,以同时生成说话人嵌入、VAD和OSD,并且以标准方法的一小部分推理时间达到具有竞争力的性能。此外,联合推理导致了一个简化的整体管道,使我们更接近于可以针对特定分离目标进行端到端训练的统一聚类方法。

URL

https://arxiv.org/abs/2411.02165

PDF

https://arxiv.org/pdf/2411.02165.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot