Paper Reading AI Learner

Pretext Tasks selection for multitask self-supervised speech representation learning

2021-07-01 16:36:29
Salah Zaiem, Titouan Parcollet, Slim Essid

Abstract

Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In various application domains, including computer vision, natural language processing and audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features has proven to be a particularly relevant pretext task leading to building useful self-supervised representations that prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks, where each task targets a different group of features for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates properly calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on speaker recognition and automatic speech recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.

Abstract (translated)

URL

https://arxiv.org/abs/2107.00594

PDF

https://arxiv.org/pdf/2107.00594


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Attention Autonomous Bert Boundary_Detection Caption Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Drone Dynamic_Memory_Network Edge_Detection Embedding Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot