Paper Reading AI Learner

Frame-wise Motion and Appearance for Real-time Multiple Object Tracking

2019-05-06 23:37:05
Jimuyang Zhang, Sanping Zhou, Jinjun Wang, Dong Huang

Abstract

The main challenge of Multiple Object Tracking (MOT) is the efficiency in associating indefinite number of objects between video frames. Standard motion estimators used in tracking, e.g., Long Short Term Memory (LSTM), only deal with single object, while Re-IDentification (Re-ID) based approaches exhaustively compare object appearances. Both approaches are computationally costly when they are scaled to a large number of objects, making it very difficult for real-time MOT. To address these problems, we propose a highly efficient Deep Neural Network (DNN) that simultaneously models association among indefinite number of objects. The inference computation of the DNN does not increase with the number of objects. Our approach, Frame-wise Motion and Appearance (FMA), computes the Frame-wise Motion Fields (FMF) between two frames, which leads to very fast and reliable matching among a large number of object bounding boxes. As auxiliary information is used to fix uncertain matches, Frame-wise Appearance Features (FAF) are learned in parallel with FMFs. Extensive experiments on the MOT17 benchmark show that our method achieved real-time MOT with competitive results as the state-of-the-art approaches.

Abstract (translated)

多目标跟踪(MOT)的主要挑战是在视频帧之间关联不确定数量的对象的效率。用于跟踪的标准运动估计量,例如长短期记忆(LSTM),只处理单个对象,而基于重新识别(RE ID)的方法则彻底比较对象的外观。这两种方法都是计算成本高,当它们被缩放到大量的对象,使实时MOT非常困难。为了解决这些问题,我们提出了一种高效的深度神经网络(DNN),它可以同时模拟不定数量物体之间的关联。DNN的推理计算不随对象数目的增加而增加。我们的方法,即帧方向运动与外观(fma),计算了两帧之间的帧方向运动场(fmf),这使得大量的对象边界框之间的匹配非常快速和可靠。由于辅助信息用于固定不确定的匹配,因此帧相关外观特征(FAF)与FMF并行学习。在MOT17基准上的大量实验表明,我们的方法以最先进的方法获得了具有竞争力的实时MOT。

URL

https://arxiv.org/abs/1905.02292

PDF

https://arxiv.org/pdf/1905.02292.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot