Paper Reading AI Learner

A Large Scale Urban Surveillance Video Dataset for Multiple-Object Tracking and Behavior Analysis

2019-04-26 11:58:36
Guojun Yin, Bin Liu, Huihui Zhu, Tao Gong, Nenghai Yu

Abstract

Multiple-object tracking and behavior analysis have been the essential parts of surveillance video analysis for public security and urban management. With billions of surveillance video captured all over the world, multiple-object tracking and behavior analysis by manual labor are cumbersome and cost expensive. Due to the rapid development of deep learning algorithms in recent years, automatic object tracking and behavior analysis put forward an urgent demand on a large scale well-annotated surveillance video dataset that can reflect the diverse, congested, and complicated scenarios in real applications. This paper introduces an urban surveillance video dataset (USVD) which is by far the largest and most comprehensive. The dataset consists of 16 scenes captured in 7 typical outdoor scenarios: street, crossroads, hospital entrance, school gate, park, pedestrian mall, and public square. Over 200k video frames are annotated carefully, resulting in more than 3:7 million object bounding boxes and about 7:1 thousand trajectories. We further use this dataset to evaluate the performance of typical algorithms for multiple-object tracking and anomaly behavior analysis and explore the robustness of these methods in urban congested scenarios.

Abstract (translated)

多目标跟踪和行为分析是城市治安管理监控视频分析的重要组成部分。由于全球范围内捕获了数以十亿计的监控视频,人工进行多目标跟踪和行为分析既麻烦又昂贵。近年来,随着深度学习算法的迅速发展,自动目标跟踪和行为分析对大规模、注释性好的监控视频数据集提出了迫切的需求,该数据集能够反映实际应用中的各种、拥挤和复杂场景。本文介绍了目前为止规模最大、综合性最强的城市监控视频数据集(USVD)。数据集由7个典型室外场景中捕获的16个场景组成:街道、十字路口、医院入口、学校大门、公园、步行街和公共广场。超过20万个视频帧被仔细标注,产生了超过370万个对象边界框和大约7:1千个轨迹。进一步利用该数据集对多目标跟踪和异常行为分析的典型算法进行了性能评估,并探讨了这些方法在城市拥挤情况下的鲁棒性。

URL

https://arxiv.org/abs/1904.11784

PDF

https://arxiv.org/pdf/1904.11784.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot