Paper Reading AI Learner

Disentangled Implicit Shape and Pose Learning for Scalable 6D Pose Estimation

2021-07-27 01:55:30
Yilin Wen, Xiangyu Li, Hao Pan, Lei Yang, Zheng Wang, Taku Komura, Wenping Wang

Abstract

6D pose estimation of rigid objects from a single RGB image has seen tremendous improvements recently by using deep learning to combat complex real-world variations, but a majority of methods build models on the per-object level, failing to scale to multiple objects simultaneously. In this paper, we present a novel approach for scalable 6D pose estimation, by self-supervised learning on synthetic data of multiple objects using a single autoencoder. To handle multiple objects and generalize to unseen objects, we disentangle the latent object shape and pose representations, so that the latent shape space models shape similarities, and the latent pose code is used for rotation retrieval by comparison with canonical rotations. To encourage shape space construction, we apply contrastive metric learning and enable the processing of unseen objects by referring to similar training objects. The different symmetries across objects induce inconsistent latent pose spaces, which we capture with a conditioned block producing shape-dependent pose codebooks by re-entangling shape and pose representations. We test our method on two multi-object benchmarks with real data, T-LESS and NOCS REAL275, and show it outperforms existing RGB-based methods in terms of pose estimation accuracy and generalization.

Abstract (translated)

URL

https://arxiv.org/abs/2107.12549

PDF

https://arxiv.org/pdf/2107.12549.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot