Paper Reading AI Learner

Spatial-Temporal Relation Networks for Multi-Object Tracking

2019-04-25 17:59:17
Jiarui Xu, Yue Cao, Zheng Zhang, Han Hu

Abstract

Recent progress in multiple object tracking (MOT) has shown that a robust similarity score is key to the success of trackers. A good similarity score is expected to reflect multiple cues, e.g. appearance, location, and topology, over a long period of time. However, these cues are heterogeneous, making them hard to be combined in a unified network. As a result, existing methods usually encode them in separate networks or require a complex training approach. In this paper, we present a unified framework for similarity measurement which could simultaneously encode various cues and perform reasoning across both spatial and temporal domains. We also study the feature representation of a tracklet-object pair in depth, showing a proper design of the pair features can well empower the trackers. The resulting approach is named spatial-temporal relation networks (STRN). It runs in a feed-forward way and can be trained in an end-to-end manner. The state-of-the-art accuracy was achieved on all of the MOT15-17 benchmarks using public detection and online settings.

Abstract (translated)

多目标跟踪(MOT)的最新进展表明,鲁棒相似性评分是追踪器成功的关键。在很长一段时间内,很好的相似性评分可以反映多个线索,如外观、位置和拓扑结构。然而,这些提示是异构的,使得它们很难在一个统一的网络中组合起来。因此,现有的方法通常将它们编码在单独的网络中,或者需要一种复杂的训练方法。本文提出了一个统一的相似性度量框架,可以同时对各种线索进行编码,并在时空域进行推理。我们还深入研究了tracklet对象对的特征表示,说明了对特征的合理设计可以很好地增强跟踪者的能力。这种方法被称为时空关系网络(strn)。它以一种前馈的方式运行,并且可以以端到端的方式进行培训。使用公共检测和在线设置,在所有MOT15-17基准上实现了最先进的精度。

URL

https://arxiv.org/abs/1904.11489

PDF

https://arxiv.org/pdf/1904.11489.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot