Paper Reading AI Learner

Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning

2024-04-16 18:40:28
Amit Kumar Bhuyan, Hrishikesh Dutta, Subir Biswas

Abstract

This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices. The work proposes a Federated Learning model which can identify the participants in a conversation without the requirement of a large audio database for training. An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings. Moreover, the proposed diarization system solves the problem of speaker change detection via. unsupervised segmentation techniques using Hotelling's t-squared Statistic and Bayesian Information Criterion. In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the trade-off between the missed detection and false detection rates. Additionally, the computational overhead due to frame-by-frame identification of speakers is reduced via. unsupervised clustering of speech segments. The results demonstrate the effectiveness of the proposed training method in the presence of non-IID speech data. It also shows a considerable improvement in the reduction of false and missed detection at the segmentation stage, while reducing the computational overhead. Improved accuracy and reduced computational cost makes the mechanism suitable for real-time speaker diarization across a distributed IoT audio network.

Abstract (translated)

本文提出了一种计算高效且分布式的音频设备 speaker diarization 框架,适用于网络化 IOT 式音频设备。该工作提出了一种联邦学习模型,可以在不需要大量音频数据进行训练的情况下,识别对话中的参与者。对于联邦学习模型,提出了一种无监督在线更新机制,它依赖于说话人嵌入的余弦相似性。此外,所提出的 diarization 系统通过使用 Hotelling 的 t-平方统计量和贝叶斯信息准则来解决说话人切换检测问题。在新方法中,说话人切换检测存在偏差,集中在检测到的伪 silence 上,这减少了错检和误检率之间的权衡。此外,通过无监督对语音段进行聚类,可以降低识别每个说话人所需的计算开销。结果表明,所提出的训练方法在非 IID 语音数据存在的情况下非常有效。它还表明,在分割阶段,错检和误检的减少效果显著,同时降低了计算开销。这种提高准确性和降低计算成本使得该机制适用于分布式 IOT 音频网络中的实时 speaker diarization。

URL

https://arxiv.org/abs/2404.10842

PDF

https://arxiv.org/pdf/2404.10842.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot