Paper Reading AI Learner

Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

2023-03-20 14:07:13
Maryam Fazel-Zarandi, Wei-Ning Hsu

Abstract

Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.

Abstract (translated)

自监督学习有效地利用了未标记数据,提高了标签效率和将未标记数据 domains,如更多的声学/语言学领域、语言和模式学 generalization 到其他领域的能力。尽管最近的工作研究了更广泛的声学/语言学领域、语言和模式学的泛化,但这些研究局限于在录制中只有一个主要说话人的单一源语音。本文介绍了鸡尾酒HuBERT,一种自监督学习框架,使用掩盖伪源分离目标将混合语音 generalization 到发现单元。这个目标鼓励模型确定来源数量、分离和理解上下文,并推断掩盖区域的内容,使其在多说话人 ASR 任务中比最先进的结果低69%,在去噪任务中低31%,并在SuperB中的单和多说话人任务中具有竞争力。

URL

https://arxiv.org/abs/2303.11131

PDF

https://arxiv.org/pdf/2303.11131.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot