Paper Reading AI Learner

EfficientTDNN: Efficient Architecture Search for Speaker Recognition in the Wild

2021-03-25 03:28:07
Rui Wang, Zhihua Wei, Shouling Ji, Zhen Hong

Abstract

Speaker recognition refers to audio biometrics that utilizes acoustic characteristics for automatic speaker recognition. These systems have emerged as an essential means of verifying identity in various scenarios, such as smart homes, general business interactions, e-commerce applications, and forensics. However, the mismatch between training and real-world data causes a shift of speaker embedding space and severely degrades the recognition performance. Various complicated neural architectures are presented to address speaker recognition in the wild but neglect the requirements of storage and computation. To address this issue, we propose a neural architecture search-based efficient time-delay neural network (EfficientTDNN) to improve inference efficiency while maintaining recognition accuracy. The proposed EfficientTDNN contains three phases. First, supernet design is to construct a dynamic neural architecture that consists of sequential cells and enables network pruning. Second, progressive training is to optimize randomly sampled subnets that inherit the weights of the supernet. Third, three search methods, including manual grid search, random search, and model predictive evolutionary search, are introduced to find a trade-off between accuracy and efficiency. Results of experiments on the VoxCeleb dataset show EfficientTDNN provides a huge search space including approximately $10^{13}$ subnets and achieves 1.66% EER and 0.156 DCF$_{0.01}$ with 565M MACs. Comprehensive investigation suggests that the trained supernet generalizes cells unseen during training and obtains an acceptable balance between accuracy and efficiency.

Abstract (translated)

URL

https://arxiv.org/abs/2103.13581

PDF

https://arxiv.org/pdf/2103.13581.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot