Abstract
New-age conversational agent systems perform both speech emotion recognition (SER) and automatic speech recognition (ASR) using two separate and often independent approaches for real-world application in noisy environments. In this paper, we investigate a joint ASR-SER multitask learning approach in a low-resource setting and show that improvements are observed not only in SER, but also in ASR. We also investigate the robustness of such jointly trained models to the presence of background noise, babble, and music. Experimental results on the IEMOCAP dataset show that joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively in clean scenarios. In noisy scenarios, results on data augmented with MUSAN show that the joint approach outperforms the independent ASR and SER approaches across many noisy conditions. Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
Abstract (translated)
新一代交互式对话系统系统同时实现语音情感识别( SER ) 和自动语音识别( ASR ),使用两个独立的方法和在噪声环境中的实际应用。在本文中,我们探讨了在资源有限的环境下一种 joint ASR- SER 多任务学习方法,并证明不仅 SER 有改善, ASR 也有改善。我们还研究了这种联合训练模型对背景噪声、嗡嗡声和音乐的鲁棒性。IEMOCAP 数据集的实验结果显示,在干净的情况下,联合学习可以分别提高 ASR 单词错误率(WER)和 SER 分类精度 10.7%。在噪声的情况下,使用MUSAN增强的数据结果证明,联合方法在许多噪声条件下优于独立的 ASR 和 SER 方法。总体而言,联合 ASR- SER 方法产生比独立 ASR 和 SER 方法更难被噪声影响模型。
URL
https://arxiv.org/abs/2305.12540