Abstract
In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer [43]. To learn a discriminative cross-modality fusion, we further design multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that robustly align the two modalities. MMT surpasses previous state-of-the-art approaches by 7.3% and 2.1% on Kinetics-Sounds and VGGSound in terms of the top-1 accuracy without external training data. Moreover, the proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets, and is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.
Abstract (translated)
近年来,研究人员将音频和视频信号结合起来以处理无法通过视觉线索良好表示或捕获的动作挑战。然而,如何有效利用这两个模式还存在待发展。在这项工作中,我们开发了一个多尺度多模态Transformer(MMT),利用层次表示学习。特别地,MMT由一种新颖的多尺度音频Transformer(MAT)和一种多尺度视频Transformer [43]组成。为了学习具有区分性的跨模态融合,我们进一步设计了一种多模态有监督对比损失(AVC)和内部模式对比损失(IMC),使两个模式具有良好的对齐性。MMT在Kinetics-Sounds和VGGSound上的 top-1 准确率比肩先前最先进的解决方案,并且在没有外部训练数据的情况下,在Kinetics-Sounds和VGGSound上的 top-1 准确率分别提高了 7.3% 和 2.1%。此外,与AST [28]相比,所提出的MAT显著提高了22.2%、4.4%和4.7%,同时在基于FLOPs的效率上提高了约3%,基于GPU内存使用的效率也提高了约9.8%。
URL
https://arxiv.org/abs/2401.04023