Abstract
We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
Abstract (translated)
我们提出了综合modality感知(IMP),这是一种简单且可扩展的多种任务多视角训练和建模方法。IMP将图像、视频、文本和音频等多种输入合并为一个单一的Transformer编码器,并使用最少的modality特定组件。IMP采用了一种独特的设计,将交替进行梯度下降更新(AGD)和混合专家混合(MoE)用于高效的模型和任务扩展。我们对IMP进行了广泛的实证研究,并揭示了以下关键见解: 1)通过交替进行不同modality类型的梯度下降更新,同时 varying input分辨率,有效地提高了modality理解度。 2)使用MoE在一个modality不相关的编码器上显著改进了性能,比使用modality特定编码器或额外的融合层更有效的击败了密度高的模型,并极大地缓解了modality之间的冲突。 IMP在多种下游任务中取得了竞争性能,包括图像分类、视频分类、图像-文本和视频-文本检索。特别是,我们训练了一个稀疏的IMP-MoE-L,专注于视频任务,实现了零样本视频分类的最新技术水平。我们的模型在Kinetics-400、Kinetics-600和Kinetics-700中实现了77.0%、76.8%和76.8%的零样本分类准确性,分别提高了之前的最新技术水平5%、6.7%和5.8%。同时,仅使用了它们总训练计算成本的15%。
URL
https://arxiv.org/abs/2305.06324