Paper Reading AI Learner

Dual-stream Time-Delay Neural Network with Dynamic Global Filter for Speaker Verification

2023-03-20 10:58:12
Yangfu Li, Xiaodan Lin

Abstract

The time-delay neural network (TDNN) is one of the state-of-the-art models for text-independent speaker verification. However, it is difficult for conventional TDNN to capture global context that has been proven critical for robust speaker representations and long-duration speaker verification in many recent works. Besides, the common solutions, e.g., self-attention, have quadratic complexity for input tokens, which makes them computationally unaffordable when applied to the feature maps with large sizes in TDNN. To address these issues, we propose the Global Filter for TDNN, which applies log-linear complexity FFT/IFFT and a set of differentiable frequency-domain filters to efficiently model the long-term dependencies in speech. Besides, a dynamic filtering strategy, and a sparse regularization method are specially designed to enhance the performance of the global filter and prevent it from overfitting. Furthermore, we construct a dual-stream TDNN (DS-TDNN), which splits the basic channels for complexity reduction and employs the global filter to increase recognition performance. Experiments on Voxceleb and SITW databases show that the DS-TDNN achieves approximate 10% improvement with a decline over 28% and 15% in complexity and parameters compared with the ECAPA-TDNN. Besides, it has the best trade-off between efficiency and effectiveness compared with other popular baseline systems when facing long-duration speech. Finally, visualizations and a detailed ablation study further reveal the advantages of the DS-TDNN.

Abstract (translated)

时间延迟神经网络(TDNN)是文本独立的 Speaker Verification 中的一种最先进的模型。然而,传统的 TDNN 难以捕捉全球上下文,这在许多最近的研究中证明对于稳定 speaker Representation 和长时间 speaker Verification 至关重要。此外,常见的解决方案,例如自我注意,对于输入 token 具有quadratic 复杂性,这使得它们在 TDNN 中的特征映射大小较大的情况下的计算成本非常高。为了解决这些问题,我们提出了 TDNN 的全球滤波器,该滤波器应用了 log-线性的复杂性FFT/IFFT 和一组可区分的频率滤波器,高效地建模语音中的长期依赖关系。此外,动态滤波策略和稀疏正则化方法特别设计,以增强全球滤波器的性能并防止其过拟合。此外,我们建立了双流 TDNN(DS-TDNN),该将基本通道进行分片,以降低复杂性,并使用全球滤波来提高识别性能。 VoxCeleb 和 SITW 数据库的实验表明,DS-TDNN 几乎实现了与 ECAPA-TDNN 相比10% 的提高,复杂性和参数下降28% 和15%。此外,与其他流行的基准系统相比,在处理长时间语音时,它具有效率与效果的最佳权衡。最后,可视化和详细的剔除研究进一步揭示了 DS-TDNN 的优势。

URL

https://arxiv.org/abs/2303.11020

PDF

https://arxiv.org/pdf/2303.11020.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot