Paper Reading AI Learner

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

2024-04-22 17:59:50
Inhee Lee, Byungjun Kim, Hanbyul Joo

Abstract

In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.

Abstract (translated)

在本文中,我们提出了一种从单目视频输入中重构世界和多个动态人类的方法。关键思想是,我们通过最近新兴的3D高斯扩散(3D-GS)表示来表示世界和多个人类,使得可以方便地组合和渲染它们。特别地,我们解决了在3D人类重建中严重有限和稀疏观察到的挑战,这是在现实生活中常见的一个挑战。为了解决这个问题,我们引入了一种通过融合共同空间中稀疏提示的新方法来优化3D-GS表示。我们利用预训练的2D扩散模型在保持观察到的2D外观一致性的同时,合成未见过的视图。我们证明了我们的方法可以在各种具有遮挡、图像裁剪、少样本和极其稀疏观察的挑战例子中重构高质量的可动画3D人类。在重构后,我们的方法不仅可以在任意时间点渲染场景的新视图,还可以通过删除单个人类或为每个人类应用不同的运动来编辑3D场景。通过各种实验,我们证明了我们的方法在各种现有方法中的质量和效率。

URL

https://arxiv.org/abs/2404.14410

PDF

https://arxiv.org/pdf/2404.14410.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot