Paper Reading AI Learner

Guiding Attention in End-to-End Driving Models

2024-04-30 23:18:51
Diego Porres, Yi Xiao, Gabriel Villalonga, Alexandre Levy, Antonio M. López

Abstract

Vision-based end-to-end driving models trained by imitation learning can lead to affordable solutions for autonomous driving. However, training these well-performing models usually requires a huge amount of data, while still lacking explicit and intuitive activation maps to reveal the inner workings of these models while driving. In this paper, we study how to guide the attention of these models to improve their driving quality and obtain more intuitive activation maps by adding a loss term during training using salient semantic maps. In contrast to previous work, our method does not require these salient semantic maps to be available during testing time, as well as removing the need to modify the model's architecture to which it is applied. We perform tests using perfect and noisy salient semantic maps with encouraging results in both, the latter of which is inspired by possible errors encountered with real data. Using CIL++ as a representative state-of-the-art model and the CARLA simulator with its standard benchmarks, we conduct experiments that show the effectiveness of our method in training better autonomous driving models, especially when data and computational resources are scarce.

Abstract (translated)

基于视觉的端到端自动驾驶模型通过模仿学习进行训练可以实现自动驾驶的实惠解决方案。然而,为这些表现优异的模型进行训练通常需要大量数据,同时在驾驶过程中仍缺乏明确的激活图来揭示这些模型的内在工作原理。在本文中,我们研究了如何通过在训练过程中添加一个损失项来引导模型的注意,从而提高其驾驶质量和获得更直观的激活图。与之前的工作不同,我们的方法不需要在测试时具有显眼的语义图,也不需要修改应用的模型架构。我们使用完美和噪声明显的语义图进行了测试,具有鼓舞人心的结果,尤其是后者受到真实数据中可能出现的错误的启发。使用CIL++作为具有代表性的最先进模型和CARLA仿真器的标准基准,我们进行了实验,证明了我们的方法在训练资源有限的情况下可以有效提高自动驾驶模型的效果。

URL

https://arxiv.org/abs/2405.00242

PDF

https://arxiv.org/pdf/2405.00242.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot