Paper Reading AI Learner

GoTrack: Generic 6DoF Object Pose Refinement and Tracking

2025-06-08 14:01:47
Van Nguyen Nguyen, Christian Forster, Sindi Shkodrani, Vincent Lepetit, Bugra Tekin, Cem Keskin, Tomas Hodan

Abstract

We introduce GoTrack, an efficient and accurate CAD-based method for 6DoF object pose refinement and tracking, which can handle diverse objects without any object-specific training. Unlike existing tracking methods that rely solely on an analysis-by-synthesis approach for model-to-frame registration, GoTrack additionally integrates frame-to-frame registration, which saves compute and stabilizes tracking. Both types of registration are realized by optical flow estimation. The model-to-frame registration is noticeably simpler than in existing methods, relying only on standard neural network blocks (a transformer is trained on top of DINOv2) and producing reliable pose confidence scores without a scoring network. For the frame-to-frame registration, which is an easier problem as consecutive video frames are typically nearly identical, we employ a light off-the-shelf optical flow model. We demonstrate that GoTrack can be seamlessly combined with existing coarse pose estimation methods to create a minimal pipeline that reaches state-of-the-art RGB-only results on standard benchmarks for 6DoF object pose estimation and tracking. Our source code and trained models are publicly available at this https URL

Abstract (translated)

我们介绍了GoTrack,这是一种基于CAD的高效且精确的方法,用于六自由度(6DoF)物体姿态精炼和跟踪。该方法可以处理各种不同的对象,并且无需针对特定对象进行训练。与现有的仅依赖于合成分析法来进行模型到帧注册的追踪方法不同,GoTrack还集成了帧到帧的注册功能,这能够节省计算资源并稳定追踪过程。这两种类型的注册都是通过光流估计来实现的。 在模型到帧的注册中,与现有方法相比,GoTrack的方法明显更加简单,仅依赖于标准神经网络模块(基于DINOv2训练了一个变压器),并且可以在没有评分网络的情况下产生可靠的姿态置信度得分。对于帧到帧的注册问题,由于连续视频帧通常几乎相同,因此这个问题相对容易解决,我们使用了一种轻量级现成的光流模型来处理。 我们展示了GoTrack可以无缝地与现有的粗略的姿态估计方法结合使用,以创建一条达到六自由度物体姿态估计和追踪标准基准测试最佳效果(仅限RGB输入)的最小化管道。我们的源代码和训练好的模型可以在以下链接中公开获取:[提供链接] (请将此占位符替换为实际提供的URL)

URL

https://arxiv.org/abs/2506.07155

PDF

https://arxiv.org/pdf/2506.07155.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot