Disentangled speaker and nuisance attribute embedding for robust speaker verification

2020-08-07 07:31:21

Woo Hyun Kang, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim

arXiv_SD

arXiv_SD Deep_Learning Embedding Pose Emotion Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). In this paper, we propose a novel fully supervised training method for extracting a speaker embedding vector disentangled from the variability caused by the nuisance attributes. The proposed framework was compared with the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset. Experimental results show that the proposed approach can extract speaker embeddings robust to channel and emotional variability.

Abstract (translated)

URL

https://arxiv.org/abs/2008.03024

PDF

https://arxiv.org/pdf/2008.03024.pdf