Paper Reading AI Learner

ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition

2023-05-25 13:56:09
Yuanchao Li, Zeyu Zhao, Ondrej Klejch, Peter Bell, Catherine Lai

Abstract

In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora and examining the distribution of word errors and confidence scores in ASR transcripts to gain insight into how emotion affects ASR. We utilize four ASR systems, namely Kaldi ASR, wav2vec, Conformer, and Whisper, and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability. Additionally, we conduct text-based SER on ASR transcripts with increasing word error rates to investigate how ASR affects SER. The objective of this study is to uncover the relationship and mutual impact of ASR and SER, in order to facilitate ASR adaptation to emotional speech and the use of SER in real world.

Abstract (translated)

在语音情感识别( SER )中,通常使用文本数据与音频信号一起解决问题,以解决其固有的不确定性。然而,在大多数研究中,依赖人类标注的文本限制了实际 SER 系统的开发。要克服这一挑战,我们研究如何将自动语音识别(ASR )在情感语音中进行表现,通过分析情感 corpora 的 ASR 表现,并检查 ASR transcripts 中单词错误和自信心分数的分布,以了解情感如何影响 ASR。我们使用四个 ASR 系统,即 Kaldi ASR、wav2vec、Conformer 和 Whisper,以及三个 corpora:IEMOCAP、MOSI 和 MELD,以确保可扩展性。此外,我们逐渐增加单词错误率,在 ASR transcripts 上进行文本 SER,以研究 ASR 对 SER 的影响。本研究的目标是揭示 ASR 和 SER 之间的关系和相互影响,以促进 ASR 适应情感语音,并促进 SER 在现实世界中的应用。

URL

https://arxiv.org/abs/2305.16065

PDF

https://arxiv.org/pdf/2305.16065.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot