Frame-wise Motion and Appearance for Real-time Multiple Object Tracking

Abstract
Abstract (translated)
URL
PDF

Abstract

The main challenge of Multiple Object Tracking (MOT) is the efficiency in associating indefinite number of objects between video frames. Standard motion estimators used in tracking, e.g., Long Short Term Memory (LSTM), only deal with single object, while Re-IDentification (Re-ID) based approaches exhaustively compare object appearances. Both approaches are computationally costly when they are scaled to a large number of objects, making it very difficult for real-time MOT. To address these problems, we propose a highly efficient Deep Neural Network (DNN) that simultaneously models association among indefinite number of objects. The inference computation of the DNN does not increase with the number of objects. Our approach, Frame-wise Motion and Appearance (FMA), computes the Frame-wise Motion Fields (FMF) between two frames, which leads to very fast and reliable matching among a large number of object bounding boxes. As auxiliary information is used to fix uncertain matches, Frame-wise Appearance Features (FAF) are learned in parallel with FMFs. Extensive experiments on the MOT17 benchmark show that our method achieved real-time MOT with competitive results as the state-of-the-art approaches.

Abstract (translated)

多目标跟踪（MOT）的主要挑战是在视频帧之间关联不确定数量的对象的效率。用于跟踪的标准运动估计量，例如长短期记忆（LSTM），只处理单个对象，而基于重新识别（RE ID）的方法则彻底比较对象的外观。这两种方法都是计算成本高，当它们被缩放到大量的对象，使实时MOT非常困难。为了解决这些问题，我们提出了一种高效的深度神经网络（DNN），它可以同时模拟不定数量物体之间的关联。DNN的推理计算不随对象数目的增加而增加。我们的方法，即帧方向运动与外观（fma），计算了两帧之间的帧方向运动场（fmf），这使得大量的对象边界框之间的匹配非常快速和可靠。由于辅助信息用于固定不确定的匹配，因此帧相关外观特征（FAF）与FMF并行学习。在MOT17基准上的大量实验表明，我们的方法以最先进的方法获得了具有竞争力的实时MOT。

URL

https://arxiv.org/abs/1905.02292

PDF

https://arxiv.org/pdf/1905.02292.pdf