Abstract
Recent advances in self-supervised learning (SSL) on Transformers have significantly improved speaker verification (SV) by providing domain-general speech representations. However, existing approaches have underutilized the multi-layered nature of SSL encoders. To address this limitation, we propose the layer-aware time-delay neural network (L-TDNN), which directly performs layer/frame-wise processing on the layer-wise hidden state outputs from pre-trained models, extracting fixed-size speaker vectors. L-TDNN comprises a layer-aware convolutional network, a frame-adaptive layer aggregation, and attentive statistic pooling, explicitly modeling of the recognition and processing of previously overlooked layer dimension. We evaluated L-TDNN across multiple speech SSL Transformers and diverse speech-speaker corpora against other approaches for leveraging pre-trained encoders. L-TDNN consistently demonstrated robust verification performance, achieving the lowest error rates throughout the experiments. Concurrently, it stood out in terms of model compactness and exhibited inference efficiency comparable to the existing systems. These results highlight the advantages derived from the proposed layer-aware processing approach. Future work includes exploring joint training with SSL frontends and the incorporation of score calibration to further enhance state-of-the-art verification performance.
Abstract (translated)
最近的自监督学习(SSL)在Transformer上的进展显著提高了说话人验证(SV)的效果,通过提供领域通用的语音表示。然而,现有的方法未能充分利用SSL编码器的多层特性。为了解决这一限制,我们提出了基于层次感知的时间延迟神经网络(L-TDNN),它直接对预训练模型各层隐藏状态输出进行逐层/逐帧处理,提取固定大小的说话人向量。L-TDNN包括一个层次感知卷积网络、帧适应性层级聚合以及注意统计池化模块,显式建模了之前被忽略的层级维度的识别和处理过程。 我们在多个语音SSL Transformer模型和多样化的语音-说话人语料库中评估了L-TDNN,并将其与利用预训练编码器的其他方法进行了比较。L-TDNN在所有实验中均表现出稳健的验证性能,实现了最低的错误率。同时,在模型紧凑性和推理效率方面也表现突出,与现有系统相当。 这些结果强调了所提出的层次感知处理方法的优势。未来的工作将探索SSL前端和评分校准的联合训练,以进一步提升最先进的验证性能。
URL
https://arxiv.org/abs/2409.07770