Abstract
Accurately distinguishing each object is a fundamental goal of Multi-object tracking (MOT) algorithms. However, achieving this goal still remains challenging, primarily due to: (i) For crowded scenes with occluded objects, the high overlap of object bounding boxes leads to confusion among closely located objects. Nevertheless, humans naturally perceive the depth of elements in a scene when observing 2D videos. Inspired by this, even though the bounding boxes of objects are close on the camera plane, we can differentiate them in the depth dimension, thereby establishing a 3D perception of the objects. (ii) For videos with rapidly irregular camera motion, abrupt changes in object positions can result in ID switches. However, if the camera pose are known, we can compensate for the errors in linear motion models. In this paper, we propose \textit{DepthMOT}, which achieves: (i) detecting and estimating scene depth map \textit{end-to-end}, (ii) compensating the irregular camera motion by camera pose estimation. Extensive experiments demonstrate the superior performance of DepthMOT in VisDrone-MOT and UAVDT datasets. The code will be available at \url{this https URL}.
Abstract (translated)
准确地区分每个物体是多目标跟踪(MOT)算法的一个基本目标。然而,要实现这个目标仍然具有挑战性,主要原因如下:(i)在拥挤的场景中,物体边界框的高重叠会导致近距离物体之间的混淆。然而,当观察2D视频时,人类会自然地感知场景中元素的深度。受到这个启发,尽管在相机平面上,物体的边界框很接近,我们仍然可以在深度维度上区分它们,从而建立对物体的3D感知。(ii)对于快速不规则的相机运动视频,物体位置的突然变化可能导致ID切换。然而,如果已知相机姿态,我们可以通过估计线性运动模型的误差来补偿。在本文中,我们提出了深度MOT(DepthMOT),它实现了:(i)检测和估计场景深度图(end-to-end),(ii)通过相机姿态估计来补偿不规则相机运动。在VisDrone-MOT和UAVDT数据集上进行的大量实验证明,深度MOT在表现优异。代码将在此处公开可用:https://this URL。
URL
https://arxiv.org/abs/2404.05518