Abstract
The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.
Abstract (translated)
人类情感的研究,传统上是一个心理学和神经科学领域的基石,受到了人工智能(AI)的深刻影响。多种渠道,如语音(声音)和面部表情(图像),对于理解人类情感至关重要。然而,AI在多模态情感识别(MER)方面的旅程充满了技术挑战。一个重要的挑战是AI模型如何处理特定模态的缺失 - 在现实情况中这是一种常见的情况。本研究的核心是对两种策略在遇到一种缺失模态时的表现和恢复力的评估:一种新颖的多模态动态模态和视图选择,以及跨注意机制。RECOLA数据集上的结果表明,基于动态选择的策略对于MER来说是一个有前景的方法。在缺失模态场景中,所有基于动态选择的策略都超过了基线。本研究结论强调了音频和视频模态在情感预测中的复杂相互作用,展示了动态选择方法在处理缺失模态的适应性。
URL
https://arxiv.org/abs/2404.12251