Abstract
Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in this https URL.
Abstract (translated)
近年来,语言、视觉和多模态预训练领域经历了巨大的聚合。在本研究中,我们提出了mPLUG-2,这是一种新统一的范式,采用模块化设计,用于多模态预训练。这种范式可以利用模态合作,同时解决模态纠缠问题。与主要范式仅依靠序列到序列生成或编码实例 discrimination 单一依赖不同,mPLUG-2引入了多个模块组合网络,通过共享模态合作通用模块,并分离不同模态模块来处理模态纠缠。它可以灵活地选择不同的模块,以处理包括文本、图像和视频的所有模态理解任务和单模态任务,包括文本、图像和视频理解。实证研究表明,mPLUG-2在超过30个后续任务中实现了先进的或竞争的结果,涵盖了图像文本和视频文本理解与生成、仅文本、仅图像和仅视频理解的任务。特别是,mPLUG-2在挑战性的MSRVTT视频QA和视频字幕任务中展示了新的前沿结果,达到了48.0%的top-1精度和80.3的CIDEr,模型大小和数据规模都非常小。它还表现出强大的零样本迁移能力,在视觉语言和视频语言任务中。代码和模型将在本httpsURL中发布。
URL
https://arxiv.org/abs/2302.00402