Paper Reading AI Learner

Understanding the Limitations of CNN-based Absolute Camera Pose Regression

2019-03-18 15:24:11
Torsten Sattler, Qunjie Zhou, Marc Pollefeys, Laura Leal-Taixe

Abstract

Visual localization is the task of accurate camera pose estimation in a known scene. It is a key problem in computer vision and robotics, with applications including self-driving cars, Structure-from-Motion, SLAM, and Mixed Reality. Traditionally, the localization problem has been tackled using 3D geometry. Recently, end-to-end approaches based on convolutional neural networks have become popular. These methods learn to directly regress the camera pose from an input image. However, they do not achieve the same level of pose accuracy as 3D structure-based methods. To understand this behavior, we develop a theoretical model for camera pose regression. We use our model to predict failure cases for pose regression techniques and verify our predictions through experiments. We furthermore use our model to show that pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure. A key result is that current approaches do not consistently outperform a handcrafted image retrieval baseline. This clearly shows that additional research is needed before pose regression algorithms are ready to compete with structure-based methods.

Abstract (translated)

视觉定位是在已知场景中进行精确的摄像机姿态估计的任务。它是计算机视觉和机器人学中的一个关键问题,应用包括自动驾驶汽车、运动结构、冲击力和混合现实。传统上,定位问题是通过三维几何来解决的。近年来,基于卷积神经网络的端到端方法越来越流行。这些方法学习从输入图像直接回归相机姿势。然而,它们并没有达到与基于三维结构的方法相同的姿态精度水平。为了理解这种行为,我们开发了一个相机姿态回归的理论模型。我们使用我们的模型来预测姿势回归技术的失败案例,并通过实验验证我们的预测。此外,我们还利用我们的模型证明,与通过三维结构进行精确的姿态估计相比,通过图像检索进行姿态回归与姿态近似更为密切。一个关键的结果是,当前的方法并不总是优于手工制作的图像检索基线。这清楚地表明,在姿势回归算法准备好与基于结构的方法竞争之前,还需要进行额外的研究。

URL

https://arxiv.org/abs/1903.07504

PDF

https://arxiv.org/pdf/1903.07504.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot