Abstract
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
Abstract (translated)
当前的视觉-语言-行动(VLA)范式在自动驾驶领域主要依赖于模仿学习(IL),这种方法会带来固有的挑战,例如分布偏移和因果混淆。在线强化学习通过试错学习提供了一种有前景的方法来解决这些问题。然而,在线强化学习应用于自动驾驶中的VLA模型时,由于连续动作空间中探索效率低下而受到限制。 为了解决这一局限性,我们提出了MindDrive框架,该框架包括一个大型语言模型(LLM),配备有两个不同的LoRA参数集。其中一个LLM作为决策专家,负责场景推理和驾驶决策;另一个则充当行动专家,能够动态地将语言决策映射到可行的轨迹中。通过向推理空间反馈轨迹级奖励,MindDrive使基于有限集合内的离散语言驾驶决策进行试错学习成为可能,而不是直接在连续动作空间内操作。这种方法有效地平衡了复杂场景中的最优决策、类似人类的驾驶行为以及在线强化学习中的高效探索。 在具有挑战性的Bench2Drive基准测试中,MindDrive表现出强大的闭环性能,获得了78.04的驾驶评分(DS)和55.09%的成功率(SR)。据我们所知,这是首次展示在线强化学习在自动驾驶领域VLA模型中有效性的研究。
URL
https://arxiv.org/abs/2512.13636