Paper Reading AI Learner

Cross-Attention is Not Enough: Incongruity-Aware Multimodal Sentiment Analysis and Emotion Recognition

2023-05-23 01:24:15
Yaoting Wang, Yuanchao Li, Peter Bell, Catherine Lai

Abstract

Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) outperforms major prior work by achieving competitive results and can successfully recognize hard samples; 2) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 3) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.

Abstract (translated)

将多种感官模式合并用于情感计算任务已经证明能够提高性能。然而,如何整合多种感官模式的运作尚不清楚,因此在现实世界中使用通常会导致大型模型大小。在这项工作中,对于情感和情绪分析,我们首先分析了如何在跨感官注意力中影响某一感官模式的另一条感官信息。我们发现,由于跨感官注意力,在潜伏阶段存在跨感官不匹配。基于这一发现,我们提出了一种轻量级模型,通过Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG),根据它对目标任务的贡献确定一种主要感官模式,然后Hierarchically incorporates辅助感官模式,减轻跨感官不匹配,减少信息冗余。在三个基准数据集:CMU-MOSI、CMU-MOSEI和IEMOCAP的实验评估证实了我们的方法的有效性,表明它: 1) 通过实现竞争结果并成功识别困难样本,超越了先前的主要工作; 2) 在感官模式不匹配的潜伏阶段减轻跨感官不匹配; 3) 在模型大小不到100万参数的情况下,却超越了同类模型的大小。

URL

https://arxiv.org/abs/2305.13583

PDF

https://arxiv.org/pdf/2305.13583.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot