Abstract
Understanding emotions and expressions is a task of interest across multiple disciplines, especially for improving user experiences. Contrary to the common perception, it has been shown that emotions are not discrete entities but instead exist along a continuum. People understand discrete emotions differently due to a variety of factors, including cultural background, individual experiences, and cognitive biases. Therefore, most approaches to expression understanding, particularly those relying on discrete categories, are inherently biased. In this paper, we present a comparative in-depth analysis of two common datasets (AffectNet and EMOTIC) equipped with the components of the circumplex model of affect. Further, we propose a model for the prediction of facial expressions tailored for lightweight applications. Using a small-scaled MaxViT-based model architecture, we evaluate the impact of discrete expression category labels in training with the continuous valence and arousal labels. We show that considering valence and arousal in addition to discrete category labels helps to significantly improve expression inference. The proposed model outperforms the current state-of-the-art models on AffectNet, establishing it as the best-performing model for inferring valence and arousal achieving a 7% lower RMSE. Training scripts and trained weights to reproduce our results can be found here: this https URL.
Abstract (translated)
理解情感和表达是一个跨越多个学科的任务,尤其是在提高用户体验方面。与普遍认识相反,已经证明情感并不是离散的实体,而是存在于一个连续的过程中。由于各种因素(包括文化背景、个人经历和认知偏见)的不同,人们对离散情感的理解存在差异。因此,大多数表达理解方法,尤其是那些依赖离散类别的,在本质上存在偏见。在本文中,我们对两个常见的数据集(AffectNet和EMOTIC)进行了比较深入的分析和评估,这些数据集配备了共轭模型的组件。此外,我们提出了一个专为轻量级应用设计的面部表情预测模型。通过基于小规模的MaxViT模型架构,我们在训练过程中使用连续的紧张和兴奋标签对离散表达类别标签的影响进行了评估。我们发现,在考虑紧张和兴奋标签的同时,使用离散类别标签可以显著提高表情推断。所提出的模型在AffectNet上优于现有状态,将其确立为推断紧张和兴奋的最佳模型,具有7%的较低MSE。训练脚本和训练权重以复制我们的结果可以从这里找到:https://www. this URL。
URL
https://arxiv.org/abs/2404.14975