Paper Reading AI Learner

DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

2025-12-15 09:21:28
Vivek Alumootil, Tuan-Anh Vu, M. Khalid Jawed

Abstract

Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: this https URL

Abstract (translated)

当前在动态场景中进行密集三维点跟踪的方法通常依赖于成对处理、需要已知的相机姿态,或者假设输入帧具有时间顺序,这些限制了它们的灵活性和适用性。此外,最近的研究成功地实现了从大规模未定位图像集合中的高效3D重建,突显了统一方法在动态场景理解方面的机会。受此启发,我们提出了DePT3R,这是一种新颖的框架,该框架能够同时利用多张图像在一个前向传播过程中完成密集点跟踪和动态场景的三维重建任务。通过强大的骨干网络提取深度时空特征,并使用密集预测头回归像素级映射,从而实现了这种多任务学习。关键的是,DePT3R在不需要相机姿态的情况下运行,这大大增强了其适应性和效率——特别是在快速变化的动态环境中尤为重要。 我们在多个涉及动态场景的具有挑战性的基准测试中验证了DePT3R,展示了强大的性能,并且与现有的最先进的方法相比,在内存效率方面取得了显著改进。数据和代码可通过开放存储库获取:[此链接](this https URL)

URL

https://arxiv.org/abs/2512.13122

PDF

https://arxiv.org/pdf/2512.13122.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot