Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

2021-09-06 16:22:03

Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke

arXiv_SD

arXiv_SD Embedding Pose

Abstract
Abstract (translated)
URL
PDF

Abstract

Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same household, we propose a household-adapted nonlinear mapping to a low dimensional space to complement the global scoring metric. The combined scoring function is optimized on labeled or pseudo-labeled speaker utterances. With input dropout, the proposed scoring model reduces EER by 45-71% in simulated households with 2 to 7 hard-to-discriminate speakers per household. On real-world internal data, the EER reduction is 49.2%. From t-SNE visualization, we also show that clusters formed by household-adapted speaker embeddings are more compact and uniformly distributed, compared to clusters formed by global embeddings before adaptation.

Abstract (translated)

URL

https://arxiv.org/abs/2109.02576

PDF

https://arxiv.org/pdf/2109.02576.pdf