Abstract
This paper focuses on the task of semantic instance completion: from an incomplete, RGB-D scan of a scene, we aim to detect the individual object instances comprising the scene and jointly infer their complete object geometry. This enables a semantically meaningful decomposition of a scanned scene into individual, complete 3D objects. This semantic instance completion of a 3D scene opens up many new possibilities in enabling meaningful interactions with a scene, for instance for virtual or robotic agents. Rather than considering 3D semantic instance segmentation and scan completion separately, we propose 3D-SIC, a new end-to-end 3D convolutional neural network which jointly learns to detect object instances and predict their complete geometry -- achieving significantly better performance than treating these tasks independently. 3D-SIC leverages joint color-geometry feature learning and a fully-convolutional 3D network to effectively infer semantic instance completion for 3D scans at scale. Our method runs at interactive rates, taking several seconds inference time on scenes of $30$m $\times$ $25$m spatial extent. For the task of semantic instance completion, we additionally introduce a new semantic instance completion benchmark on real scan data, where we outperform alternative approaches by over 15 in mAP@0.5.
Abstract (translated)
本文主要研究语义实例完成的任务:从一个场景的不完全的RGB-D扫描,我们旨在检测包含该场景的单个对象实例,并共同推断它们的完整对象几何。这使得一个有语义意义的扫描场景分解成单个,完整的三维对象。3D场景的语义实例完成为与场景进行有意义的交互提供了许多新的可能性,例如虚拟或机器人代理。我们没有单独考虑3D语义实例分割和扫描完成,而是提出了一种新的端到端3D卷积神经网络3d-sic,它共同学习检测对象实例并预测其完整的几何结构——比独立处理这些任务显著提高性能。3d-sic利用联合颜色几何特征学习和完全卷积的3d网络有效地推断三维扫描的语义实例完成。我们的方法以交互速率运行,在30$M$ imes$25$M空间范围的场景上花费几秒钟的推理时间。对于语义实例完成的任务,我们还引入了一个新的基于真实扫描数据的语义实例完成基准,在map@0.5中,我们的性能优于其他方法超过15个。
URL
https://arxiv.org/abs/1904.12012