Paper Reading AI Learner

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

2023-05-24 07:40:50
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang

Abstract

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at this https URL.

Abstract (translated)

我们将在自动驾驶场景中引入一项全新的视觉问答任务(VQA),旨在基于路景线索回答自然语言问题。与传统的VQA任务相比,自动驾驶场景中的VQA任务面临更多的挑战。首先, raw 视觉数据是多模态的,包括相机和激光雷达捕获的图像和点云。其次,数据是多帧的,因为持续实时获取。第三,户外场景既有移动的前端,也有静态的背景。现有VQA基准点无法充分解决这些复杂性。为了解决这个问题,我们提出了 NuScenes-QA,它是自动驾驶场景中VQA的第一个基准,涵盖了34,000个视觉场景和460,000个问答对。具体来说,我们利用现有的3D检测注释生成场景图,并手动设计问答模板。随后,根据这些模板,通过编程方式生成问答对。全面的统计表明,我们的 NuScenes-QA是一个平衡的大型基准,具有多种问题格式。基于它,我们开发了一系列基准,采用高级3D检测和VQA技术。我们的广泛实验突出了这个新任务所带来的挑战。代码和数据集可在这个 https URL 上获取。

URL

https://arxiv.org/abs/2305.14836

PDF

https://arxiv.org/pdf/2305.14836.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot