Paper Reading AI Learner

SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

2024-04-16 12:55:15
Niklas Gard, Anna Hilsmann, Peter Eisert

Abstract

In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout representations of the indoor environment. The panoramas are rendered from an untextured 3D reference model, which only comprises approximate structural information about room shapes, along with door and window annotations. We demonstrate that a straightforward convolutional network structure can successfully achieve image-to-panorama and ultimately image-to-model matching. Through a viewport classification score, we rank reference panoramas and select the best match for the query image. Then, a 6D relative pose is estimated between the chosen panorama and query image. Our experiments demonstrate that this approach not only efficiently bridges the domain gap but also generalizes well to previously unseen scenes that are not part of the training data. Moreover, it achieves superior localization accuracy compared to the state of the art methods and also estimates more degrees of freedom of the camera pose. We will make our source code publicly available at this https URL .

Abstract (translated)

在本文中,我们提出了SPVLoc,一种全局室内定位方法,准确地确定了一个查询图像的六维(6D)相机姿态,并且不需要场景特定知识,也不需要场景特定训练。我们的方法采用了一种新颖的匹配过程,用于在室内环境的一个全景语义布局表示中定位视角相机的视图域,该表示为一个 RGB 图像。全景图是从无纹理的 3D 参考模型中渲染的,该模型仅包含房间形状的大致结构信息以及门和窗户注释。我们证明了直通道卷积网络结构可以成功实现图像到全景和最终图像到模型的匹配。通过视图分类得分,我们排名参考全景并将最佳匹配分配给查询图像。然后,在选择的全景图像和查询图像之间估计 6D 相对姿态。我们的实验证明,这种方法不仅有效地弥合了领域差距,而且对之前未见过的场景具有很好的泛化能力。此外,与最先进的方法相比,它的定位精度更高,同时还估计了相机的姿态自由度。我们将源代码公开发布在以下链接处:https:// 这个 URL 。

URL

https://arxiv.org/abs/2404.10527

PDF

https://arxiv.org/pdf/2404.10527.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot