Abstract
Most modern image-based 6D object pose estimation methods learn to predict 2D-3D correspondences, from which the pose can be obtained using a PnP solver. Because of the non-differentiable nature of common PnP solvers, these methods are supervised via the individual correspondences. To address this, several methods have designed differentiable PnP strategies, thus imposing supervision on the pose obtained after the PnP step. Here, we argue that this conflicts with the averaging nature of the PnP problem, leading to gradients that may encourage the network to degrade the accuracy of individual correspondences. To address this, we derive a loss function that exploits the ground truth pose before solving the PnP problem. Specifically, we linearize the PnP solver around the ground-truth pose and compute the covariance of the resulting pose distribution. We then define our loss based on the diagonal covariance elements, which entails considering the final pose estimate yet not suffering from the PnP averaging issue. Our experiments show that our loss consistently improves the pose estimation accuracy for both dense and sparse correspondence based methods, achieving state-of-the-art results on both Linemod-Occluded and YCB-Video.
Abstract (translated)
大多数现代基于图像的6D物体姿态估计方法学习预测2D-3D对应关系,通过PnP求解器获得姿态。由于常见的PnP求解器的不可变性,这些方法通过个体对应关系进行监督。为了解决这一问题,有几种方法设计了不同的PnP策略,从而在PnP步骤后对获得的姿态进行监督。在这里,我们指出,这与PnP问题的平均值性质冲突,导致梯度可能鼓励网络降低个体对应关系的精度。为了解决这一问题,我们推导出一个损失函数,利用 ground-truth 姿态进行PnP求解。具体来说,我们将PnP求解器对 ground-truth 姿态进行线性化,并计算结果姿态分布的共变项。然后我们基于对角共变项定义我们的损失,这要求考虑最终姿态估计,但避免了PnP平均值问题。我们的实验结果表明,我们的损失 consistently 改善密集对应方法和小样本对应方法的姿态估计精度,在Linemod-Occluded和YCB-Video两个平台上取得了最先进的结果。
URL
https://arxiv.org/abs/2303.11516