Paper Reading AI Learner

DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

2024-04-08 13:39:12
Jiapeng Wu, Yichen Liu

Abstract

Accurately distinguishing each object is a fundamental goal of Multi-object tracking (MOT) algorithms. However, achieving this goal still remains challenging, primarily due to: (i) For crowded scenes with occluded objects, the high overlap of object bounding boxes leads to confusion among closely located objects. Nevertheless, humans naturally perceive the depth of elements in a scene when observing 2D videos. Inspired by this, even though the bounding boxes of objects are close on the camera plane, we can differentiate them in the depth dimension, thereby establishing a 3D perception of the objects. (ii) For videos with rapidly irregular camera motion, abrupt changes in object positions can result in ID switches. However, if the camera pose are known, we can compensate for the errors in linear motion models. In this paper, we propose \textit{DepthMOT}, which achieves: (i) detecting and estimating scene depth map \textit{end-to-end}, (ii) compensating the irregular camera motion by camera pose estimation. Extensive experiments demonstrate the superior performance of DepthMOT in VisDrone-MOT and UAVDT datasets. The code will be available at \url{this https URL}.

Abstract (translated)

准确地区分每个物体是多目标跟踪(MOT)算法的一个基本目标。然而,要实现这个目标仍然具有挑战性,主要原因如下:(i)在拥挤的场景中,物体边界框的高重叠会导致近距离物体之间的混淆。然而,当观察2D视频时,人类会自然地感知场景中元素的深度。受到这个启发,尽管在相机平面上,物体的边界框很接近,我们仍然可以在深度维度上区分它们,从而建立对物体的3D感知。(ii)对于快速不规则的相机运动视频,物体位置的突然变化可能导致ID切换。然而,如果已知相机姿态,我们可以通过估计线性运动模型的误差来补偿。在本文中,我们提出了深度MOT(DepthMOT),它实现了:(i)检测和估计场景深度图(end-to-end),(ii)通过相机姿态估计来补偿不规则相机运动。在VisDrone-MOT和UAVDT数据集上进行的大量实验证明,深度MOT在表现优异。代码将在此处公开可用:https://this URL。

URL

https://arxiv.org/abs/2404.05518

PDF

https://arxiv.org/pdf/2404.05518.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot