Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Abstract
Abstract (translated)
URL
PDF

Abstract

Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100.

Abstract (translated)

音频和视频是主流媒体平台中最常见的两种模式，例如YouTube。为了有效地学习多模态视频，在这项工作中，我们提出了名为音频视频Transformer（AVT）的新音频-视频识别方法，利用视频Transformer的有效时空表示来提高动作识别准确性。对于多模态融合，简单地将跨模态Transformer中的多模态标记连接起来需要大量的计算和内存资源，相反，我们通过音频-视频瓶颈Transformer减少了跨模态复杂性。为了提高多模态Transformer的学习效率，我们将自监督目标，即音频-视频对比学习、音频-视频匹配和遮罩音频和视频学习，融入AVT训练，将多样性的音频和视频表示映射到共同的跨模态表示空间。我们进一步提出了遮罩音频段损失来学习AVT中的语义音频活动。通过对三个公开数据集和两个内部数据集的实验和消融研究，我们一致证明了所提出的AVT的有效性。具体来说，AVT在Kinetics-Sounds上的性能比最先进的同类方法提高了8%。AVT还在VGGSound上超越了之前的最先进视频Transformer[25]，其性能提高了10%。与之前的最先进的多模态方法MBT[32]相比，AVT在FLOPs方面提高了1.3%，在Epic-Kitchens-100上的准确性提高了3.8%。

URL

https://arxiv.org/abs/2401.04154

PDF

https://arxiv.org/pdf/2401.04154.pdf

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Abstract

Abstract (translated)

URL

PDF Copy

PDF