Paper Reading AI Learner

SOCS: Semantically-aware Object Coordinate Space for Category-Level 6D Object Pose Estimation under Large Shape Variations

2023-03-18 06:34:16
Boyan Wan, Yifei Shi, Kai Xu

Abstract

Most learning-based approaches to category-level 6D pose estimation are design around normalized object coordinate space (NOCS). While being successful, NOCS-based methods become inaccurate and less robust when handling objects of a category containing significant intra-category shape variations. This is because the object coordinates induced by global and rigid alignment of objects are semantically incoherent, making the coordinate regression hard to learn and generalize. We propose Semantically-aware Object Coordinate Space (SOCS) built by warping-and-aligning the objects guided by a sparse set of keypoints with semantically meaningful correspondence. SOCS is semantically coherent: Any point on the surface of a object can be mapped to a semantically meaningful location in SOCS, allowing for accurate pose and size estimation under large shape variations. To learn effective coordinate regression to SOCS, we propose a novel multi-scale coordinate-based attention network. Evaluations demonstrate that our method is easy to train, well-generalizing for large intra-category shape variations and robust to inter-object occlusions.

Abstract (translated)

大多数基于学习的类别级别的6D姿态估计方法都围绕着 normalization 对象坐标空间 (NOCS) 设计。虽然这些方法都取得了成功,但在处理包含内部类别形状变异较大类别的物体时,NOCS 方法会变得不准确且不够稳健。这是因为由物体全球和固定对齐引起的对象坐标具有语义上的不一致性,这使得坐标回归很难学习和泛化。我们提出了一种语义化的 Object Coordinate Space (SOCS),通过稀疏的一组关键点以语义有意义的对应关系指导拉伸和对齐物体。SOCS 具有语义一致性:物体表面的任何点都可以映射到SOCS中的一个语义有意义的位置,从而实现在大型形状变异下准确的姿势和尺寸估计。为了学习有效地从SOCS中Regression,我们提出了一种新颖的多尺度坐标基注意力网络。评估表明,我们的方法易于训练,对于大型内部类别形状变异有很好的泛化能力,并且能够抵御外部物体遮挡。

URL

https://arxiv.org/abs/2303.10346

PDF

https://arxiv.org/pdf/2303.10346.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot