Paper Reading AI Learner

Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption

2024-04-17 11:55:45
Buzhen Huang, Chen Li, Chongyang Xu, Liang Pan, Yangang Wang, Gim Hee Lee

Abstract

Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration, but overlook the modeling of close interactions. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. The main challenge of this task comes from insufficient visual information caused by depth ambiguity and severe inter-person occlusion. In view of this, we propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information. This is based on the observation that human interaction has specific patterns following the social proxemics. Specifically, we first design a latent representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to model human interaction. A proxemics and physics guided diffusion model is then introduced to denoise the initial distribution. We design the diffusion model as dual branch with each branch representing one individual such that the interaction can be modeled via cross attention. With the learned priors of VQ-VAE and physical constraint as the additional information, our proposed approach is capable of estimating accurate poses that are also proxemics and physics plausible. Experimental results on Hi4D, 3DPW, and CHI3D demonstrate that our method outperforms existing approaches. The code is available at \url{this https URL}.

Abstract (translated)

目前的多人人类重建方法主要集中在准确恢复人体的姿势或避免透视,但忽视了近距离互动的建模。在这项工作中,我们解决了从单目视频重构近距离互动人类的问题。这个任务的难点来自于深度模糊和严重的人际遮挡所引起的视觉信息不足。因此,我们提出了利用近体行为和物理知识来弥补视觉信息不足的方法。这基于观察到的人类互动具有特定的模式,具体来说,我们首先基于Vector Quantised-Variational AutoEncoder(VQ-VAE)设计了一个隐含表示,以建模人类互动。然后引入了一个指导扩散模型来消除初始分布中的噪声。我们设计了一个扩散模型为双分支,每个分支代表一个单独的人,以便通过跨注意来建模互动。通过VQ-VAE和学习到的先验信息和物理约束作为附加信息,我们的方法能够估计出既是近体又是物理上合理的姿势。在Hi4D、3DPW和CHI3D等实验中,我们的方法表现优异。代码可在此处访问:\url{这个链接}。

URL

https://arxiv.org/abs/2404.11291

PDF

https://arxiv.org/pdf/2404.11291.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot