Paper Reading AI Learner

Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities

2018-07-03 12:38:11
Nathaniel Blanchard, Daniel Moreira, Aparna Bharati, Walter J. Scheirer

Abstract

In the last decade, video blogs (vlogs) have become an extremely popular method through which people express sentiment. The ubiquitousness of these videos has increased the importance of multimodal fusion models, which incorporate video and audio features with traditional text features for automatic sentiment detection. Multimodal fusion offers a unique opportunity to build models that learn from the full depth of expression available to human viewers. In the detection of sentiment in these videos, acoustic and video features provide clarity to otherwise ambiguous transcripts. In this paper, we present a multimodal fusion model that exclusively uses high-level video and audio features to analyze spoken sentences for sentiment. We discard traditional transcription features in order to minimize human intervention and to maximize the deployability of our model on at-scale real-world data. We select high-level features for our model that have been successful in nonaffect domains in order to test their generalizability in the sentiment detection domain. We train and test our model on the newly released CMU Multimodal Opinion Sentiment and Emotion Intensity (CMUMOSEI) dataset, obtaining an F1 score of 0.8049 on the validation set and an F1 score of 0.6325 on the held-out challenge test set.

Abstract (translated)

在过去的十年中,视频博客(vlogs)已经成为人们表达情感的一种非常流行的方法。这些视频无处不在增加了多模式融合模型的重要性,多模式融合模型将视频和音频功能与传统文本功能相结合,用于自动情绪检测。多模式融合提供了一个独特的机会来构建模型,从人类观众可用的全面表达中学习。在检测这些视频中的情绪时,声学和视频特征为其他模糊的成绩单提供了清晰度。在本文中,我们提出了一种多模式融合模型,该模型专门使用高级视频和音频功能来分析口语句子以获得情感。我们放弃传统的转录功能,以最大限度地减少人为干预,并最大限度地提高我们的模型在大规模现实世界数据上的可部署性。我们为我们的模型选择了在非影响域中成功的高级特征,以测试它们在情感检测领域的普遍性。我们在新发布的CMU Multimodal Opinion Seninment和Emotion Intensity(CMUMOSEI)数据集上训练和测试我们的模型,在验证集上获得0.8049的F1分数,在保持的挑战测试集上获得0.6325的F1分数。

URL

https://arxiv.org/abs/1807.01122

PDF

https://arxiv.org/pdf/1807.01122.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot