Paper Reading AI Learner

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

2025-05-22 06:23:04
Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

Abstract

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

Abstract (translated)

端到端自动驾驶(E2E-AD)需要有效处理多视角感知数据,并且能够稳健地应对各种复杂驾驶场景,特别是像急转弯这样的罕见操作。近期,在大型语言模型(LLMs)中混合专家(MoE)架构的成功表明参数的专业化可以实现强大的可扩展性。在此工作中,我们提出了DriveMoE,这是一种基于MoE的新型端到端自动驾驶框架,包括专门针对场景的视觉MoE和专门针对技能的动作MoE。DriveMoE是建立在我们的$\pi_0$ 视觉-语言-动作(VLA)基准模型基础上构建的,该模型最初来自具身人工智能领域,并称之为Drive-$\pi_0$。 具体来说,我们通过训练一个路由器来根据驾驶情境动态选择相关摄像头,将视觉MoE添加到了Drive-$\pi_0$中。这种设计反映了人类驾驶认知的特点:驾驶员会选择性地关注关键的视觉线索,而不是全面处理所有视觉信息。此外,我们还通过训练另一个路由器激活针对不同驾驶行为的专业专家模块的方式增加了动作MoE。通过明确的行为专业化,DriveMoE能够在处理各种场景时避免现有模型因模式平均化而导致的问题。 在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的(SOTA)性能,证明了结合视觉和行动的MoE在自动驾驶任务中的有效性。我们将发布DriveMoE和Drive-$\pi_0$ 的代码及模型。

URL

https://arxiv.org/abs/2505.16278

PDF

https://arxiv.org/pdf/2505.16278.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot