Paper Reading AI Learner

ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale Fusion of Locally Descriptors

2021-12-02 16:09:33
Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, Tianrui Li

Abstract

Fusion technique is a key research topic in multimodal sentiment analysis. The recent attention-based fusion demonstrates advances over simple operation-based fusion. However, these fusion works adopt single-scale, i.e., token-level or utterance-level, unimodal representation. Such single-scale fusion is suboptimal because that different modality should be aligned with different granularities. This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio with shared Vectors of Locally Aggregated Descriptors to improve unaligned multimodal sentiment analysis. These shared vectors can be regarded as shared topics to align different modalities. In addition, we propose a self-supervised shifted clustering loss to keep the fused feature differentiation among samples. The backbones are three Transformer encoders corresponding to three modalities, and the aggregated features generated from the fusion module are feed to a Transformer plus a full connection to finish task predictions. Experiments on three popular sentiment analysis benchmarks, IEMOCAP, MOSI, and MOSEI, demonstrate significant gains over baselines.

Abstract (translated)

URL

https://arxiv.org/abs/2112.01368

PDF

https://arxiv.org/pdf/2112.01368.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot