Paper Reading AI Learner

eP-ALM: Efficient Perceptual Augmentation of Language Models

2023-03-20 19:20:34
Mustafa Shukor, Corentin Dancette, Matthieu Cord

Abstract

Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best performance on challenging benchmarks. With the abundance of such unimodal models, a natural question arises; do we need also to follow this trend to tackle multimodal tasks? In this work, we propose to rather direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. In particular, they still train a large number of parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP) trained on huge image-text datasets, and add significant inference overhead. In addition, most of these approaches have focused on Zero-Shot and In Context Learning, with little to no effort on direct finetuning. We investigate the minimal computational effort needed to adapt unimodal models for multimodal tasks and propose a new challenging setup, alongside different approaches, that efficiently adapts unimodal pretrained models. We show that by freezing more than 99\% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning across Image, Video, and Audio modalities, following the proposed setup. The code will be available here: this https URL.

Abstract (translated)

大型语言模型(LLM)已经在世界上引起了轰动,它们在大规模模型中展现了前所未有的能力。在视觉方面,Transformer模型(即ViT)正在遵循相同的趋势,并在挑战性的基准测试中取得了最佳表现。由于这种单目模型的供应充足,一个自然的问题是:我们需要也遵循这种趋势来解决多目任务吗?在本文中,我们建议直接努力优化现有的模型,并建议通过感知来增强语言模型。现有的用于适应视觉语言任务预训练模型的方法仍然依赖于几个关键组件,限制了它们的效率。特别是,他们仍然训练大量的参数,依赖于大规模的多目预训练,使用训练在巨大图像文本数据集上的编码器,并增加了大量的推理 overhead。此外,这些方法大多数都关注零知识和上下文学习,几乎没有直接微调的努力。我们研究了适应单目模型对多目任务所需的最小计算 effort,并提出了一种新的挑战性架构,与其他方法一起,高效地适应单目预训练模型。我们表明,通过冻结超过99\%的整个参数,只训练一个线性投影层,并添加一个可训练的元,我们的方法(称为eP-ALM)在图像、视频和音频modality之间的视频问答和标题生成任务中显著优于其他基准。代码将在这里可用: this https URL。

URL

https://arxiv.org/abs/2303.11403

PDF

https://arxiv.org/pdf/2303.11403.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot