Paper Reading AI Learner

Instance-free Text to Point Cloud Localization with Relative Position Awareness

2024-04-27 09:46:49
Lichao Wang, Zhihao Yuan, Jinke Ren, Shuguang Cui, Zhen Li

Abstract

Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration. It seeks to localize a position from a city-scale point cloud scene based on a few natural language instructions. In this paper, we address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances. Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation. In both stages, we introduce an instance query extractor, in which the cells are encoded by a 3D sparse convolution U-Net to generate the multi-scale point cloud features, and a set of queries iteratively attend to these features to represent instances. In the coarse stage, a row-column relative position-aware self-attention (RowColRPA) module is designed to capture the spatial relations among the instance queries. In the fine stage, a multi-modal relative position-aware cross-attention (RPCA) module is developed to fuse the text and point cloud features along with spatial relations for improving fine position estimation. Experiment results on the KITTI360Pose dataset demonstrate that our model achieves competitive performance with the state-of-the-art models without taking ground-truth instances as input.

Abstract (translated)

文本到点云跨模态定位是一种新兴的视觉语言任务,对未来的机器人-人类协作至关重要。它试图从城市规模的点云场景中根据几条自然语言指令局部定位一个位置。在本文中,我们解决了现有方法的两个关键限制:1)他们依赖于真实实例作为输入;2)他们忽视了潜在实例之间的相对位置。我们提出的模型采用两阶段流程,包括粗阶段和细阶段。在两个阶段中,我们引入了实例查询提取器,其中单元通过3D稀疏卷积U-Net编码生成多尺度点云特征,同时有一组查询逐步关注这些特征以表示实例。在粗阶段,设计了一个行列相对位置感知自注意力(RowColRPA)模块,以捕捉实例查询之间的空间关系。在细阶段,开发了一个多模态相对位置感知交叉注意力(RPCA)模块,以融合文本和点云特征以及空间关系来提高细位置估计。在KITTI360Pose数据集的实验结果中,我们的模型与最先进的模型在不需要使用真实实例的情况下实现了竞争性的性能。

URL

https://arxiv.org/abs/2404.17845

PDF

https://arxiv.org/pdf/2404.17845.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot