Paper Reading AI Learner

A Review of Common Online Speaker Diarization Methods

2024-06-20 16:26:03
Roman Aperdannier, Sigurd Schacht, Alexander Piazza

Abstract

Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.

Abstract (translated)

演讲者识别提供了音频文件中“什么时候发言”问题的答案。这个信息可以用于进一步处理音频转录。大多数演讲者识别系统假定音频文件是整个文件。然而,在音频片段到达时立即需要演讲者标签的情况是存在的。具有相应低延迟的演讲者识别称为在线演讲者识别。本文提供概述。首先简要介绍了在线演讲者识别的历史。接着给出了用于训练和评估的数据集的分类和详细说明。在接下来的部分中,详细讨论了在线演讲者识别方法和系统。本文结论是,该领域未来研究需要解决仍然存在的挑战。

URL

https://arxiv.org/abs/2406.14464

PDF

https://arxiv.org/pdf/2406.14464.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot