Abstract
Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression and classification tasks. This study proposes a \emph{versatile audio-visual learning} (VAVL) framework for handling unimodal and multimodal systems for emotion regression and emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on both the CREMA-D and MSP-IMPROV corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus. Code available at: this https URL
Abstract (translated)
当前大多数音频和视觉情感识别模型缺乏实际应用所需的灵活性。我们构想了一种多模态系统,即使在只有一个模态可用的情况下也能工作,并且可以相互替代地用于预测情感属性或识别分类情感。在实现多模态情感识别系统的灵活性方面,由于准确解释和整合多种数据源的内在挑战,非常困难。此外,在处理缺失或部分信息的同时,允许直接进行回归和分类任务的挑战也非常困难。本文提出了一个 \emph{多功能音频-视觉学习} (VAVL)框架,用于处理情感回归和情感分类任务中的单模态和多模态系统。我们实现了一个音频和视觉共享层的框架,在共享层上保留连接,并实现了单模态重建任务。我们的实验结果显示,我们的架构在CREMA-D和MSP-IMPROV corpora上的 strong baselines 显著超越了它们。值得注意的是,在MSP-IMPROV corpora上的情感属性预测任务中,VAVL取得了新的先进技术表现。代码可在 this https URL 上获取。
URL
https://arxiv.org/abs/2305.07216