Abstract
In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora and examining the distribution of word errors and confidence scores in ASR transcripts to gain insight into how emotion affects ASR. We utilize four ASR systems, namely Kaldi ASR, wav2vec, Conformer, and Whisper, and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability. Additionally, we conduct text-based SER on ASR transcripts with increasing word error rates to investigate how ASR affects SER. The objective of this study is to uncover the relationship and mutual impact of ASR and SER, in order to facilitate ASR adaptation to emotional speech and the use of SER in real world.
Abstract (translated)
在语音情感识别( SER )中,通常使用文本数据与音频信号一起解决问题,以解决其固有的不确定性。然而,在大多数研究中,依赖人类标注的文本限制了实际 SER 系统的开发。要克服这一挑战,我们研究如何将自动语音识别(ASR )在情感语音中进行表现,通过分析情感 corpora 的 ASR 表现,并检查 ASR transcripts 中单词错误和自信心分数的分布,以了解情感如何影响 ASR。我们使用四个 ASR 系统,即 Kaldi ASR、wav2vec、Conformer 和 Whisper,以及三个 corpora:IEMOCAP、MOSI 和 MELD,以确保可扩展性。此外,我们逐渐增加单词错误率,在 ASR transcripts 上进行文本 SER,以研究 ASR 对 SER 的影响。本研究的目标是揭示 ASR 和 SER 之间的关系和相互影响,以促进 ASR 适应情感语音,并促进 SER 在现实世界中的应用。
URL
https://arxiv.org/abs/2305.16065