Abstract
Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.
Abstract (translated)
有效地在多模态对话环境中捕捉到一致性和互补性的语义特征对于多模态情感识别(MERC)至关重要。现有的方法主要使用图结构来建模对话上下文语义关系,并使用图神经网络(GNN)来捕捉多模态语义特征以进行情感识别。然而,这些方法由于GNN的一些固有特性(如过度平滑和低通滤波),导致无法有效地学习长距离一致性和互补信息。由于一致性和互补性信息对应于低频和高频信息,因此本文从图谱的角度重新研究了对话中多模态情感识别的问题。具体来说,本文提出了一种基于图谱的跨模态一致性和互补性协同学习框架GS-MCC。首先,GS-MCC使用滑动窗口构建一个多模态交互图来建模对话关系,并使用高效的傅里叶图操作提取 long-distance high-frequency和low-frequency信息。然后,GS-MCC使用对比学习构建自监督信号,反映高和低频信号的互补性和一致性,从而提高高和低频信息对真实情感的反映能力。最后,GS-MCC将合作 high和low-frequency信息输入到MLP网络和软max函数进行情感预测。大量实验证明,本文提出的GS-MCC架构在两个基准数据集上的优越性。
URL
https://arxiv.org/abs/2404.17862