Abstract
End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.
Abstract (translated)
端到端自动驾驶(E2E-AD)需要有效处理多视角感知数据,并且能够稳健地应对各种复杂驾驶场景,特别是像急转弯这样的罕见操作。近期,在大型语言模型(LLMs)中混合专家(MoE)架构的成功表明参数的专业化可以实现强大的可扩展性。在此工作中,我们提出了DriveMoE,这是一种基于MoE的新型端到端自动驾驶框架,包括专门针对场景的视觉MoE和专门针对技能的动作MoE。DriveMoE是建立在我们的$\pi_0$ 视觉-语言-动作(VLA)基准模型基础上构建的,该模型最初来自具身人工智能领域,并称之为Drive-$\pi_0$。 具体来说,我们通过训练一个路由器来根据驾驶情境动态选择相关摄像头,将视觉MoE添加到了Drive-$\pi_0$中。这种设计反映了人类驾驶认知的特点:驾驶员会选择性地关注关键的视觉线索,而不是全面处理所有视觉信息。此外,我们还通过训练另一个路由器激活针对不同驾驶行为的专业专家模块的方式增加了动作MoE。通过明确的行为专业化,DriveMoE能够在处理各种场景时避免现有模型因模式平均化而导致的问题。 在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的(SOTA)性能,证明了结合视觉和行动的MoE在自动驾驶任务中的有效性。我们将发布DriveMoE和Drive-$\pi_0$ 的代码及模型。
URL
https://arxiv.org/abs/2505.16278