Paper Reading AI Learner

Deep Hough Voting for 3D Object Detection in Point Clouds

2019-04-21 21:36:36
Charles R. Qi, Or Litany, Kaiming He, Leonidas J. Guibas

Abstract

Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -- samples from 2D manifolds in 3D space -- we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.

Abstract (translated)

当前的三维物体检测方法受到二维探测器的严重影响。为了利用二维探测器中的结构,它们通常将三维点云转换为常规网格(即,转换为体素网格或鸟瞰图图像),或者依靠二维图像中的检测来提出三维框。很少有研究试图直接探测点云中的物体。在这项工作中,我们回到第一个原则,为点云数据和尽可能通用的三维检测管道。然而,由于数据的稀疏性——来自三维空间中二维流形的样本——我们在直接从场景点预测边界框参数时面临一个主要挑战:三维对象的质心可能远离任何曲面点,因此很难在一个步骤中精确回归。为了解决这一挑战,我们提出了一种基于深度点集网络和Hough投票协同的端到端三维目标检测网络Votenet。我们的模型通过简单的设计、紧凑的模型尺寸和高效的效率,在两个大型数据集上实现了最先进的3D检测,包括扫描和SUN RGB-D。值得注意的是,Votenet通过使用纯几何信息而不依赖彩色图像,优于以前的方法。

URL

https://arxiv.org/abs/1904.09664

PDF

https://arxiv.org/pdf/1904.09664.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot