Paper Reading AI Learner

Point in, Box out: Beyond Counting Persons in Crowds

2019-04-02 11:03:32
Yuting Liu, Miaojing Shi, Qijun Zhao, Xiaofang Wang

Abstract

Modern crowd counting methods usually employ deep neural networks (DNN) to estimate crowd counts via density regression. Despite their significant improvements, the regression-based methods are incapable of providing the detection of individuals in crowds. The detection-based methods, on the other hand, have not been largely explored in recent trends of crowd counting due to the needs for expensive bounding box annotations. In this work, we instead propose a new deep detection network with only point supervision required. It can simultaneously detect the size and location of human heads and count them in crowds. We first mine useful person size information from point-level annotations and initialize the pseudo ground truth bounding boxes. An online updating scheme is introduced to refine the pseudo ground truth during training; while a locally-constrained regression loss is designed to provide additional constraints on the size of the predicted boxes in a local neighborhood. In the end, we propose a curriculum learning strategy to train the network from images of relatively accurate and easy pseudo ground truth first. Extensive experiments are conducted in both detection and counting tasks on several standard benchmarks, e.g. ShanghaiTech, UCF_CC_50, WiderFace, and TRANCOS datasets, and the results show the superiority of our method over the state-of-the-art.

Abstract (translated)

现代的人群计数方法通常采用深度神经网络(DNN)通过密度回归估计人群数量。尽管这些方法有了显著的改进,但是基于回归的方法无法提供对人群中个体的检测。另一方面,由于需要昂贵的边界框注释,基于检测的方法在最近的人群计数趋势中还没有得到广泛的探索。在这项工作中,我们提出了一个新的深度检测网络,只需要点监控。它可以同时检测人类头部的大小和位置,并在人群中计数。我们首先从点级注释中挖掘有用的人员规模信息,并初始化伪地面真值边界框。在训练过程中引入了一种在线更新方案来改进伪地面真值,同时设计了一种局部约束回归损失来对局部邻域中预测盒的大小提供额外的约束。最后,我们提出了一种课程学习策略,首先从相对准确和容易的伪地面真实图像训练网络。在上海科技、UCF-CC-U50、Wideface和Trancos数据集等多个标准基准上对检测和计数任务进行了广泛的实验,结果表明我们的方法优于最新技术。

URL

https://arxiv.org/abs/1904.01333

PDF

https://arxiv.org/pdf/1904.01333.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot