Paper Reading AI Learner

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification


Abstract

In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer [43]. To learn a discriminative cross-modality fusion, we further design multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that robustly align the two modalities. MMT surpasses previous state-of-the-art approaches by 7.3% and 2.1% on Kinetics-Sounds and VGGSound in terms of the top-1 accuracy without external training data. Moreover, the proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets, and is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.

Abstract (translated)

近年来,研究人员将音频和视频信号结合起来以处理无法通过视觉线索良好表示或捕获的动作挑战。然而,如何有效利用这两个模式还存在待发展。在这项工作中,我们开发了一个多尺度多模态Transformer(MMT),利用层次表示学习。特别地,MMT由一种新颖的多尺度音频Transformer(MAT)和一种多尺度视频Transformer [43]组成。为了学习具有区分性的跨模态融合,我们进一步设计了一种多模态有监督对比损失(AVC)和内部模式对比损失(IMC),使两个模式具有良好的对齐性。MMT在Kinetics-Sounds和VGGSound上的 top-1 准确率比肩先前最先进的解决方案,并且在没有外部训练数据的情况下,在Kinetics-Sounds和VGGSound上的 top-1 准确率分别提高了 7.3% 和 2.1%。此外,与AST [28]相比,所提出的MAT显著提高了22.2%、4.4%和4.7%,同时在基于FLOPs的效率上提高了约3%,基于GPU内存使用的效率也提高了约9.8%。

URL

https://arxiv.org/abs/2401.04023

PDF

https://arxiv.org/pdf/2401.04023.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot