Paper Reading AI Learner

Layer-aware TDNN: Speaker Recognition Using Multi-Layer Features from Pre-Trained Models

2025-12-15 05:44:38
Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Juan Yun, Sung Won Han

Abstract

Recent advances in self-supervised learning (SSL) on Transformers have significantly improved speaker verification (SV) by providing domain-general speech representations. However, existing approaches have underutilized the multi-layered nature of SSL encoders. To address this limitation, we propose the layer-aware time-delay neural network (L-TDNN), which directly performs layer/frame-wise processing on the layer-wise hidden state outputs from pre-trained models, extracting fixed-size speaker vectors. L-TDNN comprises a layer-aware convolutional network, a frame-adaptive layer aggregation, and attentive statistic pooling, explicitly modeling of the recognition and processing of previously overlooked layer dimension. We evaluated L-TDNN across multiple speech SSL Transformers and diverse speech-speaker corpora against other approaches for leveraging pre-trained encoders. L-TDNN consistently demonstrated robust verification performance, achieving the lowest error rates throughout the experiments. Concurrently, it stood out in terms of model compactness and exhibited inference efficiency comparable to the existing systems. These results highlight the advantages derived from the proposed layer-aware processing approach. Future work includes exploring joint training with SSL frontends and the incorporation of score calibration to further enhance state-of-the-art verification performance.

Abstract (translated)

最近的自监督学习(SSL)在Transformer上的进展显著提高了说话人验证(SV)的效果,通过提供领域通用的语音表示。然而,现有的方法未能充分利用SSL编码器的多层特性。为了解决这一限制,我们提出了基于层次感知的时间延迟神经网络(L-TDNN),它直接对预训练模型各层隐藏状态输出进行逐层/逐帧处理,提取固定大小的说话人向量。L-TDNN包括一个层次感知卷积网络、帧适应性层级聚合以及注意统计池化模块,显式建模了之前被忽略的层级维度的识别和处理过程。 我们在多个语音SSL Transformer模型和多样化的语音-说话人语料库中评估了L-TDNN,并将其与利用预训练编码器的其他方法进行了比较。L-TDNN在所有实验中均表现出稳健的验证性能,实现了最低的错误率。同时,在模型紧凑性和推理效率方面也表现突出,与现有系统相当。 这些结果强调了所提出的层次感知处理方法的优势。未来的工作将探索SSL前端和评分校准的联合训练,以进一步提升最先进的验证性能。

URL

https://arxiv.org/abs/2409.07770

PDF

https://arxiv.org/pdf/2409.07770.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot