Paper Reading AI Learner

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

2024-05-01 15:25:54
Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, Bihan Wen, Heikki Kälviäinen

Abstract

Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.

Abstract (translated)

情感人工智能是计算机理解人类情感状态的能力。现有的工作已经取得了一定的进展,但还需要解决两个限制:1)以前的研究主要关注短序列视频的情感分析,而忽略了长序列视频;然而,短序列视频中的情感仅反映瞬时的情感,可能是有意引导或隐藏的。相反,长序列视频可以揭示真正的情感;2)以前的研究通常会利用各种信号,如面部、语音,甚至是敏感的生物信号(例如心率图),然而,由于对隐私的需求不断增加,不依赖敏感信号的开发情感人工智能变得越来越重要。为了应对上述限制,在本文中,我们通过收集和处理运动员比赛后采访的序列,构建了一个用于情感分析的长序列和去识别性视频的 dataset,称为 EALD。除了提供每个视频的整体情感状态的注释外,我们还为每个球员提供了非面部身体语言(NFBL)注释。NFBL是一种内部驱动的情感表达,可以作为无身份的线索来理解情感状态。此外,我们提供了一个简单但有效的基础,供进一步的研究使用。具体来说,我们使用去识别性信号(如视觉、语音和 NFBL)评估多模态大型语言模型(MLLM)进行情感分析。我们的实验结果表明:1)MLLM可以在零散景观中实现与监督单一模态模型相当甚至更好的性能;2)NFBL在长序列情感分析中是一个重要的线索。EALD 将公开发布在开源平台上。

URL

https://arxiv.org/abs/2405.00574

PDF

https://arxiv.org/pdf/2405.00574.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot