EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Abstract
Abstract (translated)
URL
PDF

Abstract

Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.

Abstract (translated)

情感人工智能是计算机理解人类情感状态的能力。现有的工作已经取得了一定的进展，但还需要解决两个限制：1）以前的研究主要关注短序列视频的情感分析，而忽略了长序列视频；然而，短序列视频中的情感仅反映瞬时的情感，可能是有意引导或隐藏的。相反，长序列视频可以揭示真正的情感；2）以前的研究通常会利用各种信号，如面部、语音，甚至是敏感的生物信号（例如心率图），然而，由于对隐私的需求不断增加，不依赖敏感信号的开发情感人工智能变得越来越重要。为了应对上述限制，在本文中，我们通过收集和处理运动员比赛后采访的序列，构建了一个用于情感分析的长序列和去识别性视频的 dataset，称为 EALD。除了提供每个视频的整体情感状态的注释外，我们还为每个球员提供了非面部身体语言（NFBL）注释。NFBL是一种内部驱动的情感表达，可以作为无身份的线索来理解情感状态。此外，我们提供了一个简单但有效的基础，供进一步的研究使用。具体来说，我们使用去识别性信号（如视觉、语音和 NFBL）评估多模态大型语言模型（MLLM）进行情感分析。我们的实验结果表明：1）MLLM可以在零散景观中实现与监督单一模态模型相当甚至更好的性能；2）NFBL在长序列情感分析中是一个重要的线索。EALD 将公开发布在开源平台上。

URL

https://arxiv.org/abs/2405.00574

PDF

https://arxiv.org/pdf/2405.00574.pdf

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Abstract

Abstract (translated)

URL

PDF Copy

PDF