Abstract
Obtaining accurate 3D object poses is vital for numerous computer vision applications, such as 3D reconstruction and scene understanding. However, annotating real-world objects is time-consuming and challenging. While synthetically generated training data is a viable alternative, the domain shift between real and synthetic data is a significant challenge. In this work, we aim to narrow the performance gap between models trained on synthetic data and few real images and fully supervised models trained on large-scale data. We achieve this by approaching the problem from two perspectives: 1) We introduce SyntheticP3D, a new synthetic dataset for object pose estimation generated from CAD models and enhanced with a novel algorithm. 2) We propose a novel approach (CC3D) for training neural mesh models that perform pose estimation via inverse rendering. In particular, we exploit the spatial relationships between features on the mesh surface and a contrastive learning scheme to guide the domain adaptation process. Combined, these two approaches enable our models to perform competitively with state-of-the-art models using only 10% of the respective real training images, while outperforming the SOTA model by 10.4% with a threshold of pi/18 using only 50% of the real training data. Our trained model further demonstrates robust generalization to out-of-distribution scenarios despite being trained with minimal real data.
Abstract (translated)
获取准确的三维物体姿态对于许多计算机视觉应用至关重要,例如3D重建和场景理解。然而,标注现实世界的物体是耗时且具有挑战性的。虽然合成的训练数据是一个可行的替代方案,但 domain shift between real and synthetic data是一个重大的挑战。在这项工作中,我们旨在缩小训练数据集上模型与一小部分真实图像和完全监督模型之间的差距。我们通过从两个角度看待问题来实现这一目标:1)我们引入了SyntheticP3D,一个从CAD模型生成的物体姿态估计新合成数据集,并使用了一种 novel 算法增强。2)我们提出了一种 novel 的方法(CC3D)用于训练神经网络网格模型,通过逆渲染进行姿态估计。特别是,我们利用网格表面上特征的空间关系和比较学习计划指导域适应过程。结合这两个方法,我们的模型可以使用仅真实训练图像的10%的数据,以竞争的方式与最先进的模型进行表现,而使用仅真实训练数据的50%的数据时,比SOTA模型表现更好,达到阈值pi/18时,表现高出10.4%。我们的训练模型进一步证明了在离群值场景下的鲁棒泛化能力,尽管训练数据仅有很少的真实数据。
URL
https://arxiv.org/abs/2305.16124