Paper Reading AI Learner

Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving

2023-05-25 06:22:10
Wenhao Cheng, Junbo Yin, Wei Li, Ruigang Yang, Jianbing Shen

Abstract

This paper addresses the problem of 3D referring expression comprehension (REC) in autonomous driving scenario, which aims to ground a natural language to the targeted region in LiDAR point clouds. Previous approaches for REC usually focus on the 2D or 3D-indoor domain, which is not suitable for accurately predicting the location of the queried 3D region in an autonomous driving scene. In addition, the upper-bound limitation and the heavy computation cost motivate us to explore a better solution. In this work, we propose a new multi-modal visual grounding task, termed LiDAR Grounding. Then we devise a Multi-modal Single Shot Grounding (MSSG) approach with an effective token fusion strategy. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector without any post-processing. Moreover, the image feature can be flexibly integrated into our approach to provide rich texture and color information. The cross-modal learning enforces the detector to concentrate on important regions in the point cloud by considering the informative language expressions, thus leading to much better accuracy and efficiency. Extensive experiments on the Talk2Car dataset demonstrate the effectiveness of the proposed methods. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.

Abstract (translated)

本论文探讨了在自动驾驶场景中的3D指代表达理解(REC)问题,旨在在LiDAR点云上建立自然语言到目标区域的 ground 线。以往的 REC 方法通常只关注2D或3D室内区域,不适合在自动驾驶场景中准确预测 query 的3D区域的位置。此外,限制上限和高昂的计算成本也激励我们探索更好的解决方案。在这项工作中,我们提出了一种新的多模态视觉grounding任务,称为LiDARgrounding,然后开发了一种有效的 token fusion 策略来联合学习LiDAR基于物体检测器和语言特征,并从检测器直接预测目标区域,不需要任何后处理。此外,图像特征可以灵活地集成到我们的方法和提供丰富的纹理和颜色信息。跨模态学习强迫检测器集中关注点云中的重要区域,考虑 informative 语言表达方式,从而带来更好的精度和效率。在Talk2Car数据集上进行广泛的实验证明了所提出的方法的有效性。我们的工作深入探究了基于LiDAR的grounding任务,我们期望它为自动驾驶社区提供了一个有前途的方向。

URL

https://arxiv.org/abs/2305.15765

PDF

https://arxiv.org/pdf/2305.15765.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot