Paper Reading AI Learner

Zorro: the masked multimodal transformer

2023-01-23 17:51:39
Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman

Abstract

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

Abstract (translated)

注意力模型对于多模态处理很有吸引力,因为可以从多个模态输入中concatenate并将它们输入到一个主干网络中,因此只需要很少的 fusion engineering 即可。但是, resulting 的表示方式在整个网络中都是全相关的,这并不一定总是理想的:在学习时,对比性的视觉自监督学习需要独立的音频和视觉特征来运行,否则学习就会崩溃;在推理时,对于只有音频或只有视频的基准,应该可以实现音频和视觉模型的评估。在本文中,我们介绍了 Zorro 技术,它使用 masks 来控制每个模态输入如何路由到Transformers内部,并保持表示模态纯的部分。我们应用该技术到三个流行的Transformer-based架构(ViT、Swin和HiP)中,并证明通过对比性前训练,Zorro 在多模态任务中的关键基准上取得了最先进的结果(AudioSet和VGGSound)。此外, resulting 模型能够在视频和音频基准上执行单模态推理,如Kinetics-400 或 ESC-50。

URL

https://arxiv.org/abs/2301.09595

PDF

https://arxiv.org/pdf/2301.09595.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot