Paper Reading AI Learner

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

2023-05-10 17:22:06
Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam

Abstract

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

Abstract (translated)

我们提出了综合modality感知(IMP),这是一种简单且可扩展的多种任务多视角训练和建模方法。IMP将图像、视频、文本和音频等多种输入合并为一个单一的Transformer编码器,并使用最少的modality特定组件。IMP采用了一种独特的设计,将交替进行梯度下降更新(AGD)和混合专家混合(MoE)用于高效的模型和任务扩展。我们对IMP进行了广泛的实证研究,并揭示了以下关键见解: 1)通过交替进行不同modality类型的梯度下降更新,同时 varying input分辨率,有效地提高了modality理解度。 2)使用MoE在一个modality不相关的编码器上显著改进了性能,比使用modality特定编码器或额外的融合层更有效的击败了密度高的模型,并极大地缓解了modality之间的冲突。 IMP在多种下游任务中取得了竞争性能,包括图像分类、视频分类、图像-文本和视频-文本检索。特别是,我们训练了一个稀疏的IMP-MoE-L,专注于视频任务,实现了零样本视频分类的最新技术水平。我们的模型在Kinetics-400、Kinetics-600和Kinetics-700中实现了77.0%、76.8%和76.8%的零样本分类准确性,分别提高了之前的最新技术水平5%、6.7%和5.8%。同时,仅使用了它们总训练计算成本的15%。

URL

https://arxiv.org/abs/2305.06324

PDF

https://arxiv.org/pdf/2305.06324.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot