Paper Reading AI Learner

Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

2024-04-27 10:47:07
Tao Meng, Fuchen Zhang, Yuntao Shou, Wei Ai, Nan Yin, Keqin Li

Abstract

Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.

Abstract (translated)

有效地在多模态对话环境中捕捉到一致性和互补性的语义特征对于多模态情感识别(MERC)至关重要。现有的方法主要使用图结构来建模对话上下文语义关系,并使用图神经网络(GNN)来捕捉多模态语义特征以进行情感识别。然而,这些方法由于GNN的一些固有特性(如过度平滑和低通滤波),导致无法有效地学习长距离一致性和互补信息。由于一致性和互补性信息对应于低频和高频信息,因此本文从图谱的角度重新研究了对话中多模态情感识别的问题。具体来说,本文提出了一种基于图谱的跨模态一致性和互补性协同学习框架GS-MCC。首先,GS-MCC使用滑动窗口构建一个多模态交互图来建模对话关系,并使用高效的傅里叶图操作提取 long-distance high-frequency和low-frequency信息。然后,GS-MCC使用对比学习构建自监督信号,反映高和低频信号的互补性和一致性,从而提高高和低频信息对真实情感的反映能力。最后,GS-MCC将合作 high和low-frequency信息输入到MLP网络和软max函数进行情感预测。大量实验证明,本文提出的GS-MCC架构在两个基准数据集上的优越性。

URL

https://arxiv.org/abs/2404.17862

PDF

https://arxiv.org/pdf/2404.17862.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot