Abstract
This paper addresses the problem of 3D referring expression comprehension (REC) in autonomous driving scenario, which aims to ground a natural language to the targeted region in LiDAR point clouds. Previous approaches for REC usually focus on the 2D or 3D-indoor domain, which is not suitable for accurately predicting the location of the queried 3D region in an autonomous driving scene. In addition, the upper-bound limitation and the heavy computation cost motivate us to explore a better solution. In this work, we propose a new multi-modal visual grounding task, termed LiDAR Grounding. Then we devise a Multi-modal Single Shot Grounding (MSSG) approach with an effective token fusion strategy. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector without any post-processing. Moreover, the image feature can be flexibly integrated into our approach to provide rich texture and color information. The cross-modal learning enforces the detector to concentrate on important regions in the point cloud by considering the informative language expressions, thus leading to much better accuracy and efficiency. Extensive experiments on the Talk2Car dataset demonstrate the effectiveness of the proposed methods. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
Abstract (translated)
本论文探讨了在自动驾驶场景中的3D指代表达理解(REC)问题,旨在在LiDAR点云上建立自然语言到目标区域的 ground 线。以往的 REC 方法通常只关注2D或3D室内区域,不适合在自动驾驶场景中准确预测 query 的3D区域的位置。此外,限制上限和高昂的计算成本也激励我们探索更好的解决方案。在这项工作中,我们提出了一种新的多模态视觉grounding任务,称为LiDARgrounding,然后开发了一种有效的 token fusion 策略来联合学习LiDAR基于物体检测器和语言特征,并从检测器直接预测目标区域,不需要任何后处理。此外,图像特征可以灵活地集成到我们的方法和提供丰富的纹理和颜色信息。跨模态学习强迫检测器集中关注点云中的重要区域,考虑 informative 语言表达方式,从而带来更好的精度和效率。在Talk2Car数据集上进行广泛的实验证明了所提出的方法的有效性。我们的工作深入探究了基于LiDAR的grounding任务,我们期望它为自动驾驶社区提供了一个有前途的方向。
URL
https://arxiv.org/abs/2305.15765