Paper Reading AI Learner

OmniTracker: Unifying Object Tracking by Tracking-with-Detection

2023-03-21 17:59:57
Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Xiyang Dai, Lu Yuan, Yu-Gang Jiang

Abstract

Object tracking (OT) aims to estimate the positions of target objects in a video sequence. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, OT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline. Extensive experiments on 7 tracking datasets, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.

Abstract (translated)

对象跟踪(OT)的目标是在视频序列中估计目标物体的位置。取决于目标物体的初始状态是否由第一个帧提供的注解或分类指定,OT可以被视为实例跟踪(例如SOT和VOS)和分类跟踪(例如MOT、MOTS和VIS)任务。结合两个社区开发的最佳实践的优势,我们提出了一种新的跟踪与检测范式,其中跟踪补充检测的外观先验,而检测提供跟踪与候选边界框以匹配。配备这种设计,一个统一的跟踪模型 OmniTracker 被进一步介绍,以通过完全共享的网络架构、模型权重和推理管道解决所有跟踪任务。对7个跟踪数据集,包括LaSOT、TrackingNet、 Davis16-17、MOT17、MOTS20和YTVIS19进行了广泛的实验,结果表明, OmniTracker 与任务特定的和统一跟踪模型相比,实现与同样或更好的结果。

URL

https://arxiv.org/abs/2303.12079

PDF

https://arxiv.org/pdf/2303.12079.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot