Abstract
Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration. It seeks to localize a position from a city-scale point cloud scene based on a few natural language instructions. In this paper, we address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances. Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation. In both stages, we introduce an instance query extractor, in which the cells are encoded by a 3D sparse convolution U-Net to generate the multi-scale point cloud features, and a set of queries iteratively attend to these features to represent instances. In the coarse stage, a row-column relative position-aware self-attention (RowColRPA) module is designed to capture the spatial relations among the instance queries. In the fine stage, a multi-modal relative position-aware cross-attention (RPCA) module is developed to fuse the text and point cloud features along with spatial relations for improving fine position estimation. Experiment results on the KITTI360Pose dataset demonstrate that our model achieves competitive performance with the state-of-the-art models without taking ground-truth instances as input.
Abstract (translated)
文本到点云跨模态定位是一种新兴的视觉语言任务,对未来的机器人-人类协作至关重要。它试图从城市规模的点云场景中根据几条自然语言指令局部定位一个位置。在本文中,我们解决了现有方法的两个关键限制:1)他们依赖于真实实例作为输入;2)他们忽视了潜在实例之间的相对位置。我们提出的模型采用两阶段流程,包括粗阶段和细阶段。在两个阶段中,我们引入了实例查询提取器,其中单元通过3D稀疏卷积U-Net编码生成多尺度点云特征,同时有一组查询逐步关注这些特征以表示实例。在粗阶段,设计了一个行列相对位置感知自注意力(RowColRPA)模块,以捕捉实例查询之间的空间关系。在细阶段,开发了一个多模态相对位置感知交叉注意力(RPCA)模块,以融合文本和点云特征以及空间关系来提高细位置估计。在KITTI360Pose数据集的实验结果中,我们的模型与最先进的模型在不需要使用真实实例的情况下实现了竞争性的性能。
URL
https://arxiv.org/abs/2404.17845