Paper Reading AI Learner

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

2024-07-31 17:46:51
Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, Armen Aghajanyan

Abstract

We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion language model pre-training, paving the way for more resource-efficient and capable multimodal AI systems.

Abstract (translated)

我们提出了MoMa,一种新颖的模态感知专家(MoE)架构,专为预训练混合模态、早期融合语言模型而设计。MoMa通过将专家模块划分为模态特定的组来处理图像和文本中的任意序列。这些组仅在组内处理指定的标记,并使用学习到的路由在组内保持语义通知的适应性。我们的实证结果表明,通过这种模态特定的参数分配,预训练效率得到了显著提高。在一个10亿个token的训练预算下,MoMa 1.4B模型(具有4个文本专家和4个图像专家)实现了令人印象深刻的FLOPs节省:总体3.7x,文本2.6x,图像5.2x,与计算等效的密集基线相比。这超过了使用8个混合模态专家的标准专家选择MoE,它实现了3x的总体FLOPs节省(3x文本,2.8x图像)。将MoMa与深度混合(MoD)相结合进一步提高了预训练FLOPs节省至4.2x(文本:3.4x,图像:5.3x),尽管这种组合由于增加了路由准确性的敏感性而在因果推理上表现不佳。这些结果证明了MoMa在推动混合模态、早期融合语言模型预训练效率方面具有巨大潜力,为更节能和高效的跨模态人工智能系统铺平道路。

URL

https://arxiv.org/abs/2407.21770

PDF

https://arxiv.org/pdf/2407.21770.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot