Paper Reading AI Learner

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

2024-11-04 18:26:08
Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang

Abstract

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at this https URL.

Abstract (translated)

MLLMs(多模态大语言模型)已经展现出了处理复杂语言和视觉数据时的卓越理解和推理能力。这些进展激发了建立一种全能型机器人MLLM的愿景,这种模型能够理解复杂的指令并完成各种实际任务。然而,为现实世界的机器人开发MLLM颇具挑战性,因为通常机器人平台上的计算和内存容量是有限的。相反,MLLMs的推理需要存储数十亿个参数并进行大量的计算,这对硬件提出了较高的要求。 在我们的论文中,我们提出了一种用于机器人视觉-语言-动作模型(DeeR-VLA,或简称DeeR)的动态提前退出框架,该框架能够根据每一种情况自动调整激活的MLLM大小。这种方法利用了MLLMs中的多出口架构,允许模型一旦为特定情况激活了适当规模的部分便停止处理过程,从而避免进一步冗余计算。此外,我们还开发了新的算法来建立DeeR的提前终止标准,这些标准基于预定义的需求如平均计算成本(即功耗)、峰值计算消耗(即延迟)以及GPU内存使用量。这些改进确保了DeeR在不同资源约束下能够高效运行并保持竞争力。 在CALVIN机器人操作基准测试中,DeeR展示了显著的计算成本降低,降低了5.2-6.5倍,并将LLM的GPU内存需求减少了2-6倍,同时并未影响性能。代码和检查点可以在提供的此https URL处获取。

URL

https://arxiv.org/abs/2411.02359

PDF

https://arxiv.org/pdf/2411.02359.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot