Paper Reading AI Learner

Crowd3D: Towards Hundreds of People Reconstruction from a Single Image

2023-01-23 11:45:27
Hao Wen, Jing Huang, Huili Cui, Haozhe Lin, YuKun Lai, Lu Fang, Kun Li

Abstract

Image-based multi-person reconstruction in wide-field large scenes is critical for crowd analysis and security alert. However, existing methods cannot deal with large scenes containing hundreds of people, which encounter the challenges of large number of people, large variations in human scale, and complex spatial distribution. In this paper, we propose Crowd3D, the first framework to reconstruct the 3D poses, shapes and locations of hundreds of people with global consistency from a single large-scene image. The core of our approach is to convert the problem of complex crowd localization into pixel localization with the help of our newly defined concept, Human-scene Virtual Interaction Point (HVIP). To reconstruct the crowd with global consistency, we propose a progressive reconstruction network based on HVIP by pre-estimating a scene-level camera and a ground plane. To deal with a large number of persons and various human sizes, we also design an adaptive human-centric cropping scheme. Besides, we contribute a benchmark dataset, LargeCrowd, for crowd reconstruction in a large scene. Experimental results demonstrate the effectiveness of the proposed method. The code and datasets will be made public.

Abstract (translated)

在宽广的场景中,基于图像的多人重建对于人群分析和安全警报至关重要。然而,现有的方法无法处理包含数百人的大规模场景,该场景面临多人、人类规模的巨大差异和复杂的空间分布的挑战。在本文中,我们提出了 Crowd3D,是第一个框架,可以从一张大规模场景图像中以全球一致性重建数百人的三维姿势、形状和位置。我们的方法的核心是使用我们新定义的概念——人类场景虚拟交互点(HVIP),将复杂的人群定位问题转换为像素定位问题。为了以全球一致性重建人群,我们提出了基于HVIP的渐进重建网络,并通过预先估计场景级别的相机和地面平面来实现。为了处理大量的人员和不同人类大小的人类,我们还设计了自适应人类中心裁剪方案。此外,我们还为在一个大规模场景中进行人群重建提供了基准数据集LargeCrowd。实验结果显示了该方法的有效性。代码和数据集将公开发布。

URL

https://arxiv.org/abs/2301.09376

PDF

https://arxiv.org/pdf/2301.09376.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot