Abstract
MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at this https URL.
Abstract (translated)
MLLMs(多模态大语言模型)已经展现出了处理复杂语言和视觉数据时的卓越理解和推理能力。这些进展激发了建立一种全能型机器人MLLM的愿景,这种模型能够理解复杂的指令并完成各种实际任务。然而,为现实世界的机器人开发MLLM颇具挑战性,因为通常机器人平台上的计算和内存容量是有限的。相反,MLLMs的推理需要存储数十亿个参数并进行大量的计算,这对硬件提出了较高的要求。 在我们的论文中,我们提出了一种用于机器人视觉-语言-动作模型(DeeR-VLA,或简称DeeR)的动态提前退出框架,该框架能够根据每一种情况自动调整激活的MLLM大小。这种方法利用了MLLMs中的多出口架构,允许模型一旦为特定情况激活了适当规模的部分便停止处理过程,从而避免进一步冗余计算。此外,我们还开发了新的算法来建立DeeR的提前终止标准,这些标准基于预定义的需求如平均计算成本(即功耗)、峰值计算消耗(即延迟)以及GPU内存使用量。这些改进确保了DeeR在不同资源约束下能够高效运行并保持竞争力。 在CALVIN机器人操作基准测试中,DeeR展示了显著的计算成本降低,降低了5.2-6.5倍,并将LLM的GPU内存需求减少了2-6倍,同时并未影响性能。代码和检查点可以在提供的此https URL处获取。
URL
https://arxiv.org/abs/2411.02359