Abstract
Visual localization is the task of accurate camera pose estimation in a known scene. It is a key problem in computer vision and robotics, with applications including self-driving cars, Structure-from-Motion, SLAM, and Mixed Reality. Traditionally, the localization problem has been tackled using 3D geometry. Recently, end-to-end approaches based on convolutional neural networks have become popular. These methods learn to directly regress the camera pose from an input image. However, they do not achieve the same level of pose accuracy as 3D structure-based methods. To understand this behavior, we develop a theoretical model for camera pose regression. We use our model to predict failure cases for pose regression techniques and verify our predictions through experiments. We furthermore use our model to show that pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure. A key result is that current approaches do not consistently outperform a handcrafted image retrieval baseline. This clearly shows that additional research is needed before pose regression algorithms are ready to compete with structure-based methods.
Abstract (translated)
视觉定位是在已知场景中进行精确的摄像机姿态估计的任务。它是计算机视觉和机器人学中的一个关键问题,应用包括自动驾驶汽车、运动结构、冲击力和混合现实。传统上,定位问题是通过三维几何来解决的。近年来,基于卷积神经网络的端到端方法越来越流行。这些方法学习从输入图像直接回归相机姿势。然而,它们并没有达到与基于三维结构的方法相同的姿态精度水平。为了理解这种行为,我们开发了一个相机姿态回归的理论模型。我们使用我们的模型来预测姿势回归技术的失败案例,并通过实验验证我们的预测。此外,我们还利用我们的模型证明,与通过三维结构进行精确的姿态估计相比,通过图像检索进行姿态回归与姿态近似更为密切。一个关键的结果是,当前的方法并不总是优于手工制作的图像检索基线。这清楚地表明,在姿势回归算法准备好与基于结构的方法竞争之前,还需要进行额外的研究。
URL
https://arxiv.org/abs/1903.07504