Paper Reading AI Learner

DistanceNet: Estimating Traveled Distance from Monocular Images using a Recurrent Convolutional Neural Network

2019-04-17 06:38:00
Robin Kreuzig, Matthias Ochs, Rudolf Mester

Abstract

Classical monocular vSLAM/VO methods suffer from the scale ambiguity problem. Hybrid approaches solve this problem by adding deep learning methods, for example by using depth maps which are predicted by a CNN. We suggest that it is better to base scale estimation on estimating the traveled distance for a set of subsequent images. In this paper, we propose a novel end-to-end many-to-one traveled distance estimator. By using a deep recurrent convolutional neural network (RCNN), the traveled distance between the first and last image of a set of consecutive frames is estimated by our DistanceNet. Geometric features are learned in the CNN part of our model, which are subsequently used by the RNN to learn dynamics and temporal information. Moreover, we exploit the natural order of distances by using ordinal regression to predict the distance. The evaluation on the KITTI dataset shows that our approach outperforms current state-of-the-art deep learning pose estimators and classical mono vSLAM/VO methods in terms of distance prediction. Thus, our DistanceNet can be used as a component to solve the scale problem and help improve current and future classical mono vSLAM/VO methods.

Abstract (translated)

经典的单目VSLAM/VO方法存在尺度模糊问题。混合方法通过添加深度学习方法来解决这个问题,例如使用CNN预测的深度图。我们建议,最好是基于尺度估计估计估计的旅行距离为一组后续的图像。本文提出了一种新的端到端多对一行程距离估计方法。利用深度递归卷积神经网络(RCNN),用距离网估计一组连续帧的第一个和最后一个图像之间的移动距离。几何特征在我们模型的CNN部分中学习,随后RNN使用这些特征来学习动力学和时间信息。此外,我们利用距离的自然顺序,利用序数回归预测距离。对Kitti数据集的评估表明,在距离预测方面,我们的方法优于当前最先进的深度学习姿态估计和经典的单VSLAM/VO方法。因此,我们的距离网可以作为解决尺度问题的一个组成部分,有助于改进当前和未来的经典单VSLAM/VO方法。

URL

https://arxiv.org/abs/1904.08105

PDF

https://arxiv.org/pdf/1904.08105.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot