Paper Reading AI Learner

MACARONS: Mapping And Coverage Anticipation with RGB Online Self-Supervision

2023-03-06 17:38:03
Antoine Guédon, Tom Monnier, Pascal Monasse, Vincent Lepetit

Abstract

We introduce a method that simultaneously learns to explore new large environments and to reconstruct them in 3D from color images only. This is closely related to the Next Best View problem (NBV), where one has to identify where to move the camera next to improve the coverage of an unknown scene. However, most of the current NBV methods rely on depth sensors, need 3D supervision and/or do not scale to large scenes. Our method requires only a color camera and no 3D supervision. It simultaneously learns in a self-supervised fashion to predict a "volume occupancy field" from color images and, from this field, to predict the NBV. Thanks to this approach, our method performs well on new scenes as it is not biased towards any training 3D data. We demonstrate this on a recent dataset made of various 3D scenes and show it performs even better than recent methods requiring a depth sensor, which is not a realistic assumption for outdoor scenes captured with a flying drone.

Abstract (translated)

我们介绍了一种方法,可以同时学习探索新的大型环境和仅从彩色图像中重构它们3D的能力。这与Next Best View Problem(NBV)密切相关,其中必须确定下一步应该移动相机的位置,以改善未知场景的覆盖范围。然而,当前NBV方法的大部分依赖于深度传感器,需要3D监督或无法处理大型场景。我们的方法只需要彩色相机,不需要3D监督。它同时通过自监督的方式学习预测“体积占用空间”从彩色图像中,以及从该空间中预测NBV。得益于这种方法,我们的方法在新场景中表现良好,因为它不倾向于训练3D数据。我们展示了一个由各种3D场景组成的最近数据集,并表明它的表现甚至优于需要深度传感器的最新方法,这对于使用飞行无人机捕获的户外场景来说并不是一种真实的假设。

URL

https://arxiv.org/abs/2303.03315

PDF

https://arxiv.org/pdf/2303.03315.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot