Abstract
This paper introduces a novel approach to emotion detection in speech using Large Language Models (LLMs). We address the limitation of LLMs in processing audio inputs by translating speech characteristics into natural language descriptions. Our method integrates these descriptions into text prompts, enabling LLMs to perform multimodal emotion analysis without architectural modifications. We evaluate our approach on two datasets: IEMOCAP and MELD, demonstrating significant improvements in emotion recognition accuracy, particularly for high-quality audio data. Our experiments show that incorporating speech descriptions yields a 2 percentage point increase in weighted F1 score on IEMOCAP (from 70.111\% to 72.596\%). We also compare various LLM architectures and explore the effectiveness of different feature representations. Our findings highlight the potential of this approach in enhancing emotion detection capabilities of LLMs and underscore the importance of audio quality in speech-based emotion recognition tasks. We'll release the source code on Github.
Abstract (translated)
本文提出了一种利用大型语言模型(LLMs)进行情感检测的新方法。我们通过将语音特征翻译成自然语言描述来解决LLMs在处理音频输入方面的局限性。我们的方法将这些描述整合到文本提示中,使得LLMs能够在不进行架构修改的情况下进行多模态情感分析。我们在两个数据集上评估我们的方法:IEMOCAP和MELD,证明了在情感识别准确度方面显著的改进,特别是对于高质量音频数据。我们的实验结果表明,整合语音描述可以使IEMOCAP上的加权F1得分增加2个百分点(从70.111%增加到72.596%)。我们还比较了各种LLM架构,并探讨了不同特征表示的有效性。我们的研究结果突出了这种方法的增强LLM情感检测能力的潜力,并强调了在基于语音的情感识别任务中音频质量的重要性。我们将会在Github上发布源代码。
URL
https://arxiv.org/abs/2407.21315