Paper Reading AI Learner

IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic

2024-04-12 16:00:03
Chirag Parikh, Rohit Saluja, C.V. Jawahar, Ravi Kiran Sarvadevabhatla

Abstract

Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured and sparse traffic scenarios, fall short of capturing the complexity of driving in such environments. To fill this gap, we present IDD-X, a large-scale dual-view driving video dataset. With 697K bounding boxes, 9K important object tracks, and 1-12 objects per video, IDD-X offers comprehensive ego-relative annotations for multiple important road objects covering 10 categories and 19 explanation label categories. The dataset also incorporates rearview information to provide a more complete representation of the driving environment. We also introduce custom-designed deep networks aimed at multiple important object localization and per-object explanation prediction. Overall, our dataset and introduced prediction models form the foundation for studying how road conditions and surrounding entities affect driving behavior in complex traffic situations.

Abstract (translated)

智能车辆系统需要对道路状况、周围实体和自适应车辆的驾驶行为之间的相互作用进行深入理解,以确保安全和高效的导航。这在发展中国家尤为重要,因为交通情况往往密集且不规则,有异质的道路使用者。现有的数据集,主要针对结构和稀疏交通场景,不足以捕捉在这样的环境中的驾驶复杂性。为了填补这一空白,我们提出了IDD-X,一个大型双视驾驶视频数据集。具有697K个边界框、9K个重要对象跟踪和1-12个物体 per video,IDD-X为多个重要道路对象的全面自适应相对注释提供了全面的覆盖,涵盖了10个类别和19个解释标签类别。该数据集还包含了后方信息,以提供更完整的驾驶环境表示。我们还引入了针对多个重要对象局部定位和每对象解释预测的自定义设计深度网络。总的来说,我们的数据集和引入的预测模型是我们研究道路状况和周围实体如何影响复杂交通情况驾驶行为的基础。

URL

https://arxiv.org/abs/2404.08561

PDF

https://arxiv.org/pdf/2404.08561.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot