Dual-stream Time-Delay Neural Network with Dynamic Global Filter for Speaker Verification

Abstract
Abstract (translated)
URL
PDF

Abstract

The time-delay neural network (TDNN) is one of the state-of-the-art models for text-independent speaker verification. However, it is difficult for conventional TDNN to capture global context that has been proven critical for robust speaker representations and long-duration speaker verification in many recent works. Besides, the common solutions, e.g., self-attention, have quadratic complexity for input tokens, which makes them computationally unaffordable when applied to the feature maps with large sizes in TDNN. To address these issues, we propose the Global Filter for TDNN, which applies log-linear complexity FFT/IFFT and a set of differentiable frequency-domain filters to efficiently model the long-term dependencies in speech. Besides, a dynamic filtering strategy, and a sparse regularization method are specially designed to enhance the performance of the global filter and prevent it from overfitting. Furthermore, we construct a dual-stream TDNN (DS-TDNN), which splits the basic channels for complexity reduction and employs the global filter to increase recognition performance. Experiments on Voxceleb and SITW databases show that the DS-TDNN achieves approximate 10% improvement with a decline over 28% and 15% in complexity and parameters compared with the ECAPA-TDNN. Besides, it has the best trade-off between efficiency and effectiveness compared with other popular baseline systems when facing long-duration speech. Finally, visualizations and a detailed ablation study further reveal the advantages of the DS-TDNN.

Abstract (translated)

时间延迟神经网络(TDNN)是文本独立的 Speaker Verification 中的一种最先进的模型。然而,传统的 TDNN 难以捕捉全球上下文,这在许多最近的研究中证明对于稳定 speaker Representation 和长时间 speaker Verification 至关重要。此外,常见的解决方案,例如自我注意,对于输入 token 具有quadratic 复杂性,这使得它们在 TDNN 中的特征映射大小较大的情况下的计算成本非常高。为了解决这些问题,我们提出了 TDNN 的全球滤波器,该滤波器应用了 log-线性的复杂性FFT/IFFT 和一组可区分的频率滤波器,高效地建模语音中的长期依赖关系。此外,动态滤波策略和稀疏正则化方法特别设计,以增强全球滤波器的性能并防止其过拟合。此外,我们建立了双流 TDNN(DS-TDNN),该将基本通道进行分片,以降低复杂性,并使用全球滤波来提高识别性能。 VoxCeleb 和 SITW 数据库的实验表明,DS-TDNN 几乎实现了与 ECAPA-TDNN 相比10% 的提高,复杂性和参数下降28% 和15%。此外,与其他流行的基准系统相比,在处理长时间语音时,它具有效率与效果的最佳权衡。最后,可视化和详细的剔除研究进一步揭示了 DS-TDNN 的优势。

URL

https://arxiv.org/abs/2303.11020

PDF

https://arxiv.org/pdf/2303.11020.pdf