Abstract
Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100.
Abstract (translated)
音频和视频是主流媒体平台中最常见的两种模式,例如YouTube。为了有效地学习多模态视频,在这项工作中,我们提出了名为音频视频Transformer(AVT)的新音频-视频识别方法,利用视频Transformer的有效时空表示来提高动作识别准确性。对于多模态融合,简单地将跨模态Transformer中的多模态标记连接起来需要大量的计算和内存资源,相反,我们通过音频-视频瓶颈Transformer减少了跨模态复杂性。为了提高多模态Transformer的学习效率,我们将自监督目标,即音频-视频对比学习、音频-视频匹配和遮罩音频和视频学习,融入AVT训练,将多样性的音频和视频表示映射到共同的跨模态表示空间。我们进一步提出了遮罩音频段损失来学习AVT中的语义音频活动。通过对三个公开数据集和两个内部数据集的实验和消融研究,我们一致证明了所提出的AVT的有效性。具体来说,AVT在Kinetics-Sounds上的性能比最先进的同类方法提高了8%。AVT还在VGGSound上超越了之前的最先进视频Transformer[25],其性能提高了10%。与之前的最先进的多模态方法MBT[32]相比,AVT在FLOPs方面提高了1.3%,在Epic-Kitchens-100上的准确性提高了3.8%。
URL
https://arxiv.org/abs/2401.04154