Abstract
Re-localizing a camera from a single image in a previously mapped area is vital for many computer vision applications in robotics and augmented/virtual reality. In this work, we address the problem of estimating the 6 DoF camera pose relative to a global frame from a single image. We propose to leverage a novel network of relative spatial and temporal geometric constraints to guide the training of a Deep Network for localization. We employ simultaneously spatial and temporal relative pose constraints that are obtained not only from adjacent camera frames but also from camera frames that are distant in the spatio-temporal space of the scene. We show that our method, through these constraints, is capable of learning to localize when little or very sparse ground-truth 3D coordinates are available. In our experiments, this is less than 1% of available ground-truth data. We evaluate our method on 3 common visual localization datasets and show that it outperforms other direct pose estimation methods.
Abstract (translated)
将一个相机从以前映射区域中的单个图像重新定位到机器人学和增强现实(AR/VR)中许多计算机视觉应用程序中,对齐相机从单个图像到全局帧是至关重要的。在这项工作中,我们解决了在单个图像上估计6个自由度相机姿态相对于全局帧的问题。我们提出了一个新颖的相对空间和时间几何约束网络,用于指导基于局部网络的相机定位训练。我们采用不仅来自相邻相机帧而且来自场景中距离较远的相机帧的相对姿态约束。我们证明了我们的方法通过这些约束有能力在几乎没有或非常稀疏的地面真实3D坐标可用时进行定位。在我们的实验中,这个比例不到1%。我们在3个常见的视觉局部定位数据集上评估我们的方法,并证明了它优于其他直接姿态估计方法。
URL
https://arxiv.org/abs/2312.00500