Paper Reading AI Learner

Towards Locally Consistent Object Counting with Constrained Multi-stage Convolutional Neural Networks

2019-04-06 06:21:07
Muming Zhao, Jian Zhang, Chongyang Zhang, Wenjun Zhang

Abstract

High-density object counting in surveillance scenes is challenging mainly due to the drastic variation of object scales. The prevalence of deep learning has largely boosted the object counting accuracy on several benchmark datasets. However, does the global counts really count? Armed with this question we dive into the predicted density map whose summation over the whole regions reports the global counts for more in-depth analysis. We observe that the object density map generated by most existing methods usually lacks of local consistency, i.e., counting errors in local regions exist unexpectedly even though the global count seems to well match with the ground-truth. Towards this problem, in this paper we propose a constrained multi-stage Convolutional Neural Networks (CNNs) to jointly pursue locally consistent density map from two aspects. Different from most existing methods that mainly rely on the multi-column architectures of plain CNNs, we exploit a stacking formulation of plain CNNs. Benefited from the internal multi-stage learning process, the feature map could be repeatedly refined, allowing the density map to approach the ground-truth density distribution. For further refinement of the density map, we also propose a grid loss function. With finer local-region-based supervisions, the underlying model is constrained to generate locally consistent density values to minimize the training errors considering both the global and local counts accuracy. Experiments on two widely-tested object counting benchmarks with overall significant results compared with state-of-the-art methods demonstrate the effectiveness of our approach.

Abstract (translated)

高密度目标计数在监控场景中具有挑战性,主要是由于目标尺度的剧烈变化。深度学习的普及在很大程度上提高了几个基准数据集上的对象计数精度。但是,全局计数真的很重要吗?有了这个问题,我们将深入研究预测密度图,它对整个区域的总和报告了全球统计数据,以便进行更深入的分析。我们观察到,大多数现有方法生成的目标密度图通常缺乏局部一致性,即局部区域的计数误差出乎意料地存在,尽管全局计数似乎与地面实况吻合得很好。针对这一问题,本文提出了一种约束多级卷积神经网络(CNN),从两个方面共同寻求局部一致密度图。与现有的主要依赖于平面CNN多柱结构的方法不同,我们开发了平面CNN的堆叠公式。得益于内部的多阶段学习过程,特征图可以反复进行细化,使密度图接近地面真密度分布。为了进一步完善密度图,我们还提出了一个网格损失函数。在更精细的基于局部区域的监督下,基础模型被约束生成局部一致的密度值,以最小化考虑全局和局部计数精度的训练误差。在两个广泛测试的对象计数基准上进行的实验(与最先进的方法相比具有总体显著的结果)证明了我们的方法的有效性。

URL

https://arxiv.org/abs/1904.03373

PDF

https://arxiv.org/pdf/1904.03373.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot