Paper Reading AI Learner

Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition

2024-05-01 04:05:29
Dongyuan Li, Ying Zhang, Yusong Wang, Funakoshi Kataro, Manabu Okumura

Abstract

Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications in various fields, including human-machine interaction, virtual assistants, and mental health assistance. However, existing SER methods often overlook the information gap between the pre-training speech recognition task and the downstream SER task, resulting in sub-optimal performance. Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training speech recognition task and the downstream speech emotion recognition task. Then, AL methods are employed to iteratively select a subset of the most informative and diverse samples for fine-tuning, thereby reducing time consumption. Experiments demonstrate that our proposed method \textsc{After}, using only 20\% of samples, improves accuracy by 8.45\% and reduces time consumption by 79\%. The additional extension of \textsc{After} and ablation studies further confirm its effectiveness and applicability to various real-world scenarios. Our source code is available on Github for reproducibility. (this https URL).

Abstract (translated)

演讲情感识别(SER)因为其在各种领域的广泛应用而引起了越来越多的关注,包括人机交互、虚拟助手和心理健康协助。然而,现有的SER方法通常忽视了预训练语音识别任务和下游SER任务之间的信息差距,导致性能低。此外,当前方法对每个具体语音数据集进行微调的时间成本很高,比如IEMOCAP,这限制了其在具有大规模嘈杂数据的大规模场景中的有效性。为了解决这些问题,我们提出了一个基于主动学习(AL)的SER微调框架,称为\textsc{After},并利用任务适应预训练(TAPT)和AL方法来提高性能和效率。具体来说,我们首先使用TAPT最小化预训练语音识别任务和下游语音情感识别任务之间的信息差距。然后,使用AL方法迭代选择具有最信息量和多样性的样本进行微调,从而减少时间消耗。实验证明,我们提出的\textsc{After}方法,只需使用20%的样本,提高了8.45%的准确率,并将时间消耗降低了79%。此外,对\textsc{After}的扩展研究和消融分析进一步证实了其有效性和应用到各种现实场景中的可行性。我们的源代码可以在Github上获取可重复性。(就是这个https://URL)

URL

https://arxiv.org/abs/2405.00307

PDF

https://arxiv.org/pdf/2405.00307.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot