Paper Reading AI Learner

Dy3DGS-SLAM: Monocular 3D Gaussian Splatting SLAM for Dynamic Environments

2025-06-06 10:43:41
Mingrui Li, Yiming Zhou, Hongxing Zhou, Xinggang Hu, Florian Roemer, Hongyu Wang, Ahmad Osman

Abstract

Current Simultaneous Localization and Mapping (SLAM) methods based on Neural Radiance Fields (NeRF) or 3D Gaussian Splatting excel in reconstructing static 3D scenes but struggle with tracking and reconstruction in dynamic environments, such as real-world scenes with moving elements. Existing NeRF-based SLAM approaches addressing dynamic challenges typically rely on RGB-D inputs, with few methods accommodating pure RGB input. To overcome these limitations, we propose Dy3DGS-SLAM, the first 3D Gaussian Splatting (3DGS) SLAM method for dynamic scenes using monocular RGB input. To address dynamic interference, we fuse optical flow masks and depth masks through a probabilistic model to obtain a fused dynamic mask. With only a single network iteration, this can constrain tracking scales and refine rendered geometry. Based on the fused dynamic mask, we designed a novel motion loss to constrain the pose estimation network for tracking. In mapping, we use the rendering loss of dynamic pixels, color, and depth to eliminate transient interference and occlusion caused by dynamic objects. Experimental results demonstrate that Dy3DGS-SLAM achieves state-of-the-art tracking and rendering in dynamic environments, outperforming or matching existing RGB-D methods.

Abstract (translated)

当前基于神经辐射场(NeRF)或三维高斯点阵(3D Gaussian Splatting)的同步定位与建图(SLAM)方法在重建静态三维场景方面表现出色,但在处理包含移动元素的真实世界动态环境时却显得力不从心。现有的针对动态挑战进行优化的基于NeRF的SLAM方法大多依赖于RGB-D输入,而能够适应纯RGB输入的方法则相对较少。为了克服这些限制,我们提出了Dy3DGS-SLAM,这是首个专门用于处理动态场景并采用单目RGB输入的三维高斯点阵(3DGS)SLAM方法。 为了解决动态干扰问题,我们通过概率模型融合光流掩码和深度掩码以获得一个融合后的动态掩码。仅需一次网络迭代即可借助这一融合后的动态掩码限制跟踪尺度并优化渲染几何形状。在设计约束姿态估计网络进行跟踪的新型运动损失时,我们亦基于该融合后的动态掩码进行了创新。 在建图过程中,我们将动像素的渲染损失、颜色和深度相结合以消除由动态对象引起的瞬态干扰与遮挡问题。实验结果表明,在处理动态环境时,Dy3DGS-SLAM不仅能够实现最先进的跟踪与渲染性能,并且超越或匹敌现有的RGB-D方法。

URL

https://arxiv.org/abs/2506.05965

PDF

https://arxiv.org/pdf/2506.05965.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot