Paper Reading AI Learner

GearTrack: Automating 6D Pose Estimation

2024-09-30 06:26:49
Yu Deng, Teng Cao, Jiahong Xue

Abstract

We developed a robust solution for real-time 6D object detection in industrial applications by integrating FoundationPose, SAM2, and LightGlue, eliminating the need for retraining. Our approach addresses two key challenges: the requirement for an initial object mask in the first frame in FoundationPose and issues with tracking loss and automatic rotation for symmetric objects. The algorithm requires only a CAD model of the target object, with the user clicking on its location in the live feed during the initial setup. Once set, the algorithm automatically saves a reference image of the object and, in subsequent runs, employs LightGlue for feature matching between the object and the real-time scene, providing an initial prompt for detection. Tested on the YCB dataset and industrial components such as bleach cleanser and gears, the algorithm demonstrated reliable 6D detection and tracking. By integrating SAM2 and FoundationPose, we effectively mitigated common limitations such as the problem of tracking loss, ensuring continuous and accurate tracking under challenging conditions like occlusion or rapid movement.

Abstract (translated)

我们通过将FoundationPose、SAM2和LightGlue集成开发了一个实时的6D物体检测解决方案,无需重新训练。我们的方法解决了两个关键挑战:在FoundationPose中第一帧需要初始物体掩码,以及对称物体跟踪损失和自动旋转问题。算法只需要目标对象的CAD模型,用户在初始设置过程中点击其位置。一旦设置完成,算法会自动保存物体的参考图像,并在后续运行中使用LightGlue在物体和实时场景之间进行特征匹配,为检测提供初始提示。在YCB数据集和工业组件(如漂白清洁剂和齿轮)上进行了测试,该算法展示了可靠的6D检测和跟踪。通过将SAM2和FoundationPose集成,我们有效地减轻了常见的限制,如跟踪损失问题,确保在具有挑战性的条件(如遮挡或快速运动)下实现连续和准确的跟踪。

URL

https://arxiv.org/abs/2409.19986

PDF

https://arxiv.org/pdf/2409.19986.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot