Abstract
Vision-Language Model (VLM) driving agents promise explainable end-to-end autonomy by first producing natural-language reasoning and then predicting trajectory planning. However, whether planning is causally driven by this reasoning remains a critical but unverified assumption. To investigate this, we build DriveMind, a large-scale driving Visual Question Answering (VQA) corpus with plan-aligned Chain-of-Thought (CoT), automatically generated from nuPlan. Our data generation process converts sensors and annotations into structured inputs and, crucially, separates priors from to-be-reasoned signals, enabling clean information ablations. Using DriveMind, we train representative VLM agents with Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) and evaluate them with nuPlan's metrics. Our results, unfortunately, indicate a consistent causal disconnect in reasoning-planning: removing ego/navigation priors causes large drops in planning scores, whereas removing CoT produces only minor changes. Attention analysis further shows that planning primarily focuses on priors rather than the CoT. Based on this evidence, we propose the Reasoning-Planning Decoupling Hypothesis, positing that the training-yielded reasoning is an ancillary byproduct rather than a causal mediator. To enable efficient diagnosis, we also introduce a novel, training-free probe that measures an agent's reliance on priors by evaluating its planning robustness against minor input perturbations. In summary, we provide the community with a new dataset and a diagnostic tool to evaluate the causal fidelity of future models.
Abstract (translated)
视觉语言模型(VLM)驱动的代理通过首先生成自然语言推理,然后预测轨迹规划来承诺实现可解释的端到端自主性。然而,这种推理是否对规划具有因果驱动作用仍然是一个关键但未经验证的假设。为了研究这一问题,我们构建了DriveMind,这是一个大规模的驾驶视觉问答(VQA)语料库,其中包含从nuPlan自动生成的与计划一致的链式思维(CoT)。我们的数据生成过程将传感器和注释转换为结构化输入,并且关键的是将先验知识与待推理信号分开,从而实现干净的信息消融。使用DriveMind,我们用监督微调(SFT)和群体相对策略优化(GRPO)训练代表性的VLM代理,并使用nuPlan的指标对其进行评估。我们的结果不幸地表明,在推理-规划之间存在一致的因果断开:移除自我导航先验会导致规划分数大幅下降,而移除CoT只会产生微小的变化。注意力分析进一步显示,规划主要集中在先验知识上,而不是CoT。基于这些证据,我们提出了推理-规划解耦假设,认为训练产生的推理是附属副产品而非因果中介。为了实现高效的诊断,我们还引入了一种新颖的无需训练的探针,通过评估代理在受到轻微输入扰动下的规划鲁棒性来测量其对先验知识的依赖程度。总之,我们为社区提供了一个新的数据集和一个诊断工具,用于评估未来模型的因果准确性。
URL
https://arxiv.org/abs/2510.04532