Abstract
Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at this https URL.
Abstract (translated)
从单目野外视频创建逼真的场景和人体重建在人类中心的3D世界的感知中占据重要地位。最近的神经渲染技术进步已实现了完整的人体场景重建,但需要预先校准的摄像机和人体姿态,并且需要数天的训练时间。在这项工作中,我们引入了一个新颖的统一框架,在线同时执行相机跟踪、人体姿态估计和人体场景重建。3D高斯点阵(Gaussian Splatting)被用来高效地学习用于人类和场景的高斯基元,并设计了基于重构的摄像机跟踪和人体姿态估计算法模块,以实现整体理解并有效地分离姿势和外观。具体而言,我们设计了一个人体变形模块来重建细节并提高对分布外姿态的一致性和泛化能力。为了准确学习人与场景之间的空间相关性,我们引入了感知遮挡的人体轮廓渲染和单目几何先验,这进一步提高了重构的质量。在EMDB和NeuMan数据集上的实验表明,在摄像机跟踪、人体姿态估计、新视图合成和运行时间方面,我们的方法的表现优于或与现有方法相当。我们的项目页面位于此链接:[https URL]。 请注意,最后的URL被省略了具体的网址内容,请确认并提供完整的链接地址以便查阅详细信息。
URL
https://arxiv.org/abs/2504.13167