Paper Reading AI Learner

Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation

2023-03-21 00:32:31
Fulin Liu, Yinlin Hu, Mathieu Salzmann

Abstract

Most modern image-based 6D object pose estimation methods learn to predict 2D-3D correspondences, from which the pose can be obtained using a PnP solver. Because of the non-differentiable nature of common PnP solvers, these methods are supervised via the individual correspondences. To address this, several methods have designed differentiable PnP strategies, thus imposing supervision on the pose obtained after the PnP step. Here, we argue that this conflicts with the averaging nature of the PnP problem, leading to gradients that may encourage the network to degrade the accuracy of individual correspondences. To address this, we derive a loss function that exploits the ground truth pose before solving the PnP problem. Specifically, we linearize the PnP solver around the ground-truth pose and compute the covariance of the resulting pose distribution. We then define our loss based on the diagonal covariance elements, which entails considering the final pose estimate yet not suffering from the PnP averaging issue. Our experiments show that our loss consistently improves the pose estimation accuracy for both dense and sparse correspondence based methods, achieving state-of-the-art results on both Linemod-Occluded and YCB-Video.

Abstract (translated)

大多数现代基于图像的6D物体姿态估计方法学习预测2D-3D对应关系,通过PnP求解器获得姿态。由于常见的PnP求解器的不可变性,这些方法通过个体对应关系进行监督。为了解决这一问题,有几种方法设计了不同的PnP策略,从而在PnP步骤后对获得的姿态进行监督。在这里,我们指出,这与PnP问题的平均值性质冲突,导致梯度可能鼓励网络降低个体对应关系的精度。为了解决这一问题,我们推导出一个损失函数,利用 ground-truth 姿态进行PnP求解。具体来说,我们将PnP求解器对 ground-truth 姿态进行线性化,并计算结果姿态分布的共变项。然后我们基于对角共变项定义我们的损失,这要求考虑最终姿态估计,但避免了PnP平均值问题。我们的实验结果表明,我们的损失 consistently 改善密集对应方法和小样本对应方法的姿态估计精度,在Linemod-Occluded和YCB-Video两个平台上取得了最先进的结果。

URL

https://arxiv.org/abs/2303.11516

PDF

https://arxiv.org/pdf/2303.11516.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot