Abstract
State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.
Abstract (translated)
先进的深度学习系统通常基于发言者嵌入提取器。这些架构通常由一个特征提取器前端和一个池化层组成,将变长语音转换为固定长度的发言者向量。作者最近提出了使用双 multi-head 自注意力池化进行发言者识别,将自注意力机制放在基于 CNN 的前端和一系列全连接层之间。这已被证明是从语音信号中选择最相关特征的绝佳方法。在本文中,我们通过将此架构应用于其他不同的发言者特征识别任务,如情感识别、性别分类和 COVID-19 检测,展示了出色的实验结果。
URL
https://arxiv.org/abs/2405.04096