Paper Reading AI Learner

More Than Meets the Eye? Uncovering the Reasoning-Planning Disconnect in Training Vision-Language Driving Models

2025-10-06 06:50:16
Xurui Song, Shuo Huai, JingJing Jiang, Jiayi Kong, Jun Luo

Abstract

Vision-Language Model (VLM) driving agents promise explainable end-to-end autonomy by first producing natural-language reasoning and then predicting trajectory planning. However, whether planning is causally driven by this reasoning remains a critical but unverified assumption. To investigate this, we build DriveMind, a large-scale driving Visual Question Answering (VQA) corpus with plan-aligned Chain-of-Thought (CoT), automatically generated from nuPlan. Our data generation process converts sensors and annotations into structured inputs and, crucially, separates priors from to-be-reasoned signals, enabling clean information ablations. Using DriveMind, we train representative VLM agents with Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) and evaluate them with nuPlan's metrics. Our results, unfortunately, indicate a consistent causal disconnect in reasoning-planning: removing ego/navigation priors causes large drops in planning scores, whereas removing CoT produces only minor changes. Attention analysis further shows that planning primarily focuses on priors rather than the CoT. Based on this evidence, we propose the Reasoning-Planning Decoupling Hypothesis, positing that the training-yielded reasoning is an ancillary byproduct rather than a causal mediator. To enable efficient diagnosis, we also introduce a novel, training-free probe that measures an agent's reliance on priors by evaluating its planning robustness against minor input perturbations. In summary, we provide the community with a new dataset and a diagnostic tool to evaluate the causal fidelity of future models.

Abstract (translated)

视觉语言模型(VLM)驱动的代理通过首先生成自然语言推理,然后预测轨迹规划来承诺实现可解释的端到端自主性。然而,这种推理是否对规划具有因果驱动作用仍然是一个关键但未经验证的假设。为了研究这一问题,我们构建了DriveMind,这是一个大规模的驾驶视觉问答(VQA)语料库,其中包含从nuPlan自动生成的与计划一致的链式思维(CoT)。我们的数据生成过程将传感器和注释转换为结构化输入,并且关键的是将先验知识与待推理信号分开,从而实现干净的信息消融。使用DriveMind,我们用监督微调(SFT)和群体相对策略优化(GRPO)训练代表性的VLM代理,并使用nuPlan的指标对其进行评估。我们的结果不幸地表明,在推理-规划之间存在一致的因果断开:移除自我导航先验会导致规划分数大幅下降,而移除CoT只会产生微小的变化。注意力分析进一步显示,规划主要集中在先验知识上,而不是CoT。基于这些证据,我们提出了推理-规划解耦假设,认为训练产生的推理是附属副产品而非因果中介。为了实现高效的诊断,我们还引入了一种新颖的无需训练的探针,通过评估代理在受到轻微输入扰动下的规划鲁棒性来测量其对先验知识的依赖程度。总之,我们为社区提供了一个新的数据集和一个诊断工具,用于评估未来模型的因果准确性。

URL

https://arxiv.org/abs/2510.04532

PDF

https://arxiv.org/pdf/2510.04532.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot