Abstract
Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) outperforms major prior work by achieving competitive results and can successfully recognize hard samples; 2) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 3) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.
Abstract (translated)
将多种感官模式合并用于情感计算任务已经证明能够提高性能。然而,如何整合多种感官模式的运作尚不清楚,因此在现实世界中使用通常会导致大型模型大小。在这项工作中,对于情感和情绪分析,我们首先分析了如何在跨感官注意力中影响某一感官模式的另一条感官信息。我们发现,由于跨感官注意力,在潜伏阶段存在跨感官不匹配。基于这一发现,我们提出了一种轻量级模型,通过Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG),根据它对目标任务的贡献确定一种主要感官模式,然后Hierarchically incorporates辅助感官模式,减轻跨感官不匹配,减少信息冗余。在三个基准数据集:CMU-MOSI、CMU-MOSEI和IEMOCAP的实验评估证实了我们的方法的有效性,表明它: 1) 通过实现竞争结果并成功识别困难样本,超越了先前的主要工作; 2) 在感官模式不匹配的潜伏阶段减轻跨感官不匹配; 3) 在模型大小不到100万参数的情况下,却超越了同类模型的大小。
URL
https://arxiv.org/abs/2305.13583