Abstract
Recent progress in multiple object tracking (MOT) has shown that a robust similarity score is key to the success of trackers. A good similarity score is expected to reflect multiple cues, e.g. appearance, location, and topology, over a long period of time. However, these cues are heterogeneous, making them hard to be combined in a unified network. As a result, existing methods usually encode them in separate networks or require a complex training approach. In this paper, we present a unified framework for similarity measurement which could simultaneously encode various cues and perform reasoning across both spatial and temporal domains. We also study the feature representation of a tracklet-object pair in depth, showing a proper design of the pair features can well empower the trackers. The resulting approach is named spatial-temporal relation networks (STRN). It runs in a feed-forward way and can be trained in an end-to-end manner. The state-of-the-art accuracy was achieved on all of the MOT15-17 benchmarks using public detection and online settings.
Abstract (translated)
多目标跟踪(MOT)的最新进展表明,鲁棒相似性评分是追踪器成功的关键。在很长一段时间内,很好的相似性评分可以反映多个线索,如外观、位置和拓扑结构。然而,这些提示是异构的,使得它们很难在一个统一的网络中组合起来。因此,现有的方法通常将它们编码在单独的网络中,或者需要一种复杂的训练方法。本文提出了一个统一的相似性度量框架,可以同时对各种线索进行编码,并在时空域进行推理。我们还深入研究了tracklet对象对的特征表示,说明了对特征的合理设计可以很好地增强跟踪者的能力。这种方法被称为时空关系网络(strn)。它以一种前馈的方式运行,并且可以以端到端的方式进行培训。使用公共检测和在线设置,在所有MOT15-17基准上实现了最先进的精度。
URL
https://arxiv.org/abs/1904.11489