Zorro: the masked multimodal transformer

Abstract
Abstract (translated)
URL
PDF

Abstract

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

Abstract (translated)

注意力模型对于多模态处理很有吸引力,因为可以从多个模态输入中concatenate并将它们输入到一个主干网络中,因此只需要很少的 fusion engineering 即可。但是, resulting 的表示方式在整个网络中都是全相关的,这并不一定总是理想的:在学习时,对比性的视觉自监督学习需要独立的音频和视觉特征来运行,否则学习就会崩溃;在推理时,对于只有音频或只有视频的基准,应该可以实现音频和视觉模型的评估。在本文中,我们介绍了 Zorro 技术,它使用 masks 来控制每个模态输入如何路由到Transformers内部,并保持表示模态纯的部分。我们应用该技术到三个流行的Transformer-based架构(ViT、Swin和HiP)中,并证明通过对比性前训练,Zorro 在多模态任务中的关键基准上取得了最先进的结果(AudioSet和VGGSound)。此外, resulting 模型能够在视频和音频基准上执行单模态推理,如Kinetics-400 或 ESC-50。

URL

https://arxiv.org/abs/2301.09595

PDF

https://arxiv.org/pdf/2301.09595.pdf