Paper Reading AI Learner

ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos

2025-04-17 17:59:02
Zetong Zhang, Manuel kaufmann, Lixin Xue, Jie Song, Martin R. Oswald

Abstract

Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at this https URL.

Abstract (translated)

从单目野外视频创建逼真的场景和人体重建在人类中心的3D世界的感知中占据重要地位。最近的神经渲染技术进步已实现了完整的人体场景重建,但需要预先校准的摄像机和人体姿态,并且需要数天的训练时间。在这项工作中,我们引入了一个新颖的统一框架,在线同时执行相机跟踪、人体姿态估计和人体场景重建。3D高斯点阵(Gaussian Splatting)被用来高效地学习用于人类和场景的高斯基元,并设计了基于重构的摄像机跟踪和人体姿态估计算法模块,以实现整体理解并有效地分离姿势和外观。具体而言,我们设计了一个人体变形模块来重建细节并提高对分布外姿态的一致性和泛化能力。为了准确学习人与场景之间的空间相关性,我们引入了感知遮挡的人体轮廓渲染和单目几何先验,这进一步提高了重构的质量。在EMDB和NeuMan数据集上的实验表明,在摄像机跟踪、人体姿态估计、新视图合成和运行时间方面,我们的方法的表现优于或与现有方法相当。我们的项目页面位于此链接:[https URL]。 请注意,最后的URL被省略了具体的网址内容,请确认并提供完整的链接地址以便查阅详细信息。

URL

https://arxiv.org/abs/2504.13167

PDF

https://arxiv.org/pdf/2504.13167.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot