Paper Reading AI Learner

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification


Abstract

Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100.

Abstract (translated)

音频和视频是主流媒体平台中最常见的两种模式,例如YouTube。为了有效地学习多模态视频,在这项工作中,我们提出了名为音频视频Transformer(AVT)的新音频-视频识别方法,利用视频Transformer的有效时空表示来提高动作识别准确性。对于多模态融合,简单地将跨模态Transformer中的多模态标记连接起来需要大量的计算和内存资源,相反,我们通过音频-视频瓶颈Transformer减少了跨模态复杂性。为了提高多模态Transformer的学习效率,我们将自监督目标,即音频-视频对比学习、音频-视频匹配和遮罩音频和视频学习,融入AVT训练,将多样性的音频和视频表示映射到共同的跨模态表示空间。我们进一步提出了遮罩音频段损失来学习AVT中的语义音频活动。通过对三个公开数据集和两个内部数据集的实验和消融研究,我们一致证明了所提出的AVT的有效性。具体来说,AVT在Kinetics-Sounds上的性能比最先进的同类方法提高了8%。AVT还在VGGSound上超越了之前的最先进视频Transformer[25],其性能提高了10%。与之前的最先进的多模态方法MBT[32]相比,AVT在FLOPs方面提高了1.3%,在Epic-Kitchens-100上的准确性提高了3.8%。

URL

https://arxiv.org/abs/2401.04154

PDF

https://arxiv.org/pdf/2401.04154.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot