Paper Reading AI Learner

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

2024-04-27 10:22:03
Yuntao Shou, Tao Meng, Fuchen Zhang, Nan Yin, Keqin Li

Abstract

Multi-modal Emotion Recognition in Conversation (MERC) has received considerable attention in various fields, e.g., human-computer interaction and recommendation systems. Most existing works perform feature disentanglement and fusion to extract emotional contextual information from multi-modal features and emotion classification. After revisiting the characteristic of MERC, we argue that long-range contextual semantic information should be extracted in the feature disentanglement stage and the inter-modal semantic information consistency should be maximized in the feature fusion stage. Inspired by recent State Space Models (SSMs), Mamba can efficiently model long-distance dependencies. Therefore, in this work, we fully consider the above insights to further improve the performance of MERC. Specifically, on the one hand, in the feature disentanglement stage, we propose a Broad Mamba, which does not rely on a self-attention mechanism for sequence modeling, but uses state space models to compress emotional representation, and utilizes broad learning systems to explore the potential data distribution in broad space. Different from previous SSMs, we design a bidirectional SSM convolution to extract global context information. On the other hand, we design a multi-modal fusion strategy based on probability guidance to maximize the consistency of information between modalities. Experimental results show that the proposed method can overcome the computational and memory limitations of Transformer when modeling long-distance contexts, and has great potential to become a next-generation general architecture in MERC.

Abstract (translated)

多模态情感识别(MERC)在诸如人机交互和推荐系统等领域得到了广泛关注。大多数现有作品通过特征解离和融合来提取多模态特征和情感分类所需的情感上下文信息。回顾MERC的特点后,我们提出,在特征解离阶段应该提取长距离上下文语义信息,而在特征融合阶段应该最大化跨模态语义信息的一致性。受到最近的状态空间模型(SSMs)的启发,Mamba可以有效地建模长距离依赖关系。因此,在本文中,我们完全考虑了上述见解,以进一步改进MERC的性能。 具体来说,在特征解离阶段,我们提出了一种Broad Mamba,它不依赖于序列建模的自注意力机制,而是使用状态空间模型压缩情感表示,并利用Broad学习系统在宽空间中探索潜在数据分布。与以前的SSMs不同,我们设计了一种双向SSM卷积以提取全局上下文信息。另一方面,我们设计了一种基于概率指导的多模态融合策略,以最大化模态之间的信息一致性。 实验结果表明,与Transformer模型相比,所提出的方法在建模长距离上下文时可以克服计算和内存限制,具有很大的潜力成为MERC的下一代通用架构。

URL

https://arxiv.org/abs/2404.17858

PDF

https://arxiv.org/pdf/2404.17858.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot