Abstract
In this paper, an architecture based on Long Short-Term Memory Networks has been proposed for the text-independent scenario which is aimed to capture the temporal speaker-related information by operating over traditional speech features. For speaker verification, at first, a background model must be created for speaker representation. Then, in enrollment stage, the speaker models will be created based on the enrollment utterances. For this work, the model will be trained in an end-to-end fashion to combine the first two stages. The main goal of end-to-end training is the model being optimized to be consistent with the speaker verification protocol. The end- to-end training jointly learns the background and speaker models by creating the representation space. The LSTM architecture is trained to create a discrimination space for validating the match and non-match pairs for speaker verification. The proposed architecture demonstrate its superiority in the text-independent compared to other traditional methods.
Abstract (translated)
在本文中,基于长短期记忆网络的架构已经被提出用于独立于文本的场景,该场景旨在通过对传统的语音特征进行操作来捕获与时间相关的讲话者信息。对于说话人验证,首先,必须为说话人表示创建背景模型。然后,在招生阶段,将根据招生声明创建演讲者模型。对于这项工作,该模型将以端到端的方式进行培训,以结合前两个阶段。端到端培训的主要目标是将模型进行优化以符合说话人验证协议。端到端训练通过创建表示空间共同学习背景和说话人模型。 LSTM体系结构经过培训,可以创建用于验证说话人验证的匹配和非匹配对的区分空间。与其他传统方法相比,所提出的架构在文本独立方面显示出其优越性。
URL
https://arxiv.org/abs/1805.00604