Paper Reading AI Learner

Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

2024-04-17 11:31:16
Ye Bai, Chenxing Li, Hao Li, Yuanyuan Zhao, Xiaorui Wang

Abstract

In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.

Abstract (translated)

简短的视频和现场直播中,说话声、歌唱声和背景音乐经常重叠并掩盖彼此。这种复杂性使得对音频内容的组织和识别带来了困难,这可能会影响到后续的ASR和音乐理解应用程序。本文提出了一种基于多任务音频源分离(MTASS)的ASR模型,称为JRSV,它同时识别说话和歌唱声音。具体来说,MTASS模块将混合音频分离为不同的说话和歌唱声道,并去除了背景音乐。CTC/attention混合识别模块同时识别这两条轨道。提出了在线去噪以进一步提高识别的鲁棒性。为了评估所提出的方法,构建了一个基准数据集并发布。实验结果表明,JRSV可以在混合音频的每个轨道上显著提高识别准确性。

URL

https://arxiv.org/abs/2404.11275

PDF

https://arxiv.org/pdf/2404.11275.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot