Abstract
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices. The work proposes a Federated Learning model which can identify the participants in a conversation without the requirement of a large audio database for training. An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings. Moreover, the proposed diarization system solves the problem of speaker change detection via. unsupervised segmentation techniques using Hotelling's t-squared Statistic and Bayesian Information Criterion. In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the trade-off between the missed detection and false detection rates. Additionally, the computational overhead due to frame-by-frame identification of speakers is reduced via. unsupervised clustering of speech segments. The results demonstrate the effectiveness of the proposed training method in the presence of non-IID speech data. It also shows a considerable improvement in the reduction of false and missed detection at the segmentation stage, while reducing the computational overhead. Improved accuracy and reduced computational cost makes the mechanism suitable for real-time speaker diarization across a distributed IoT audio network.
Abstract (translated)
本文提出了一种计算高效且分布式的音频设备 speaker diarization 框架,适用于网络化 IOT 式音频设备。该工作提出了一种联邦学习模型,可以在不需要大量音频数据进行训练的情况下,识别对话中的参与者。对于联邦学习模型,提出了一种无监督在线更新机制,它依赖于说话人嵌入的余弦相似性。此外,所提出的 diarization 系统通过使用 Hotelling 的 t-平方统计量和贝叶斯信息准则来解决说话人切换检测问题。在新方法中,说话人切换检测存在偏差,集中在检测到的伪 silence 上,这减少了错检和误检率之间的权衡。此外,通过无监督对语音段进行聚类,可以降低识别每个说话人所需的计算开销。结果表明,所提出的训练方法在非 IID 语音数据存在的情况下非常有效。它还表明,在分割阶段,错检和误检的减少效果显著,同时降低了计算开销。这种提高准确性和降低计算成本使得该机制适用于分布式 IOT 音频网络中的实时 speaker diarization。
URL
https://arxiv.org/abs/2404.10842