Paper Reading AI Learner

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

2023-02-01 12:40:03
Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou

Abstract

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in this https URL.

Abstract (translated)

近年来,语言、视觉和多模态预训练领域经历了巨大的聚合。在本研究中,我们提出了mPLUG-2,这是一种新统一的范式,采用模块化设计,用于多模态预训练。这种范式可以利用模态合作,同时解决模态纠缠问题。与主要范式仅依靠序列到序列生成或编码实例 discrimination 单一依赖不同,mPLUG-2引入了多个模块组合网络,通过共享模态合作通用模块,并分离不同模态模块来处理模态纠缠。它可以灵活地选择不同的模块,以处理包括文本、图像和视频的所有模态理解任务和单模态任务,包括文本、图像和视频理解。实证研究表明,mPLUG-2在超过30个后续任务中实现了先进的或竞争的结果,涵盖了图像文本和视频文本理解与生成、仅文本、仅图像和仅视频理解的任务。特别是,mPLUG-2在挑战性的MSRVTT视频QA和视频字幕任务中展示了新的前沿结果,达到了48.0%的top-1精度和80.3的CIDEr,模型大小和数据规模都非常小。它还表现出强大的零样本迁移能力,在视觉语言和视频语言任务中。代码和模型将在本httpsURL中发布。

URL

https://arxiv.org/abs/2302.00402

PDF

https://arxiv.org/pdf/2302.00402.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot