Abstract
Object tracking (OT) aims to estimate the positions of target objects in a video sequence. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, OT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline. Extensive experiments on 7 tracking datasets, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
Abstract (translated)
对象跟踪(OT)的目标是在视频序列中估计目标物体的位置。取决于目标物体的初始状态是否由第一个帧提供的注解或分类指定,OT可以被视为实例跟踪(例如SOT和VOS)和分类跟踪(例如MOT、MOTS和VIS)任务。结合两个社区开发的最佳实践的优势,我们提出了一种新的跟踪与检测范式,其中跟踪补充检测的外观先验,而检测提供跟踪与候选边界框以匹配。配备这种设计,一个统一的跟踪模型 OmniTracker 被进一步介绍,以通过完全共享的网络架构、模型权重和推理管道解决所有跟踪任务。对7个跟踪数据集,包括LaSOT、TrackingNet、 Davis16-17、MOT17、MOTS20和YTVIS19进行了广泛的实验,结果表明, OmniTracker 与任务特定的和统一跟踪模型相比,实现与同样或更好的结果。
URL
https://arxiv.org/abs/2303.12079