Abstract
Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: this https URL
Abstract (translated)
当前在动态场景中进行密集三维点跟踪的方法通常依赖于成对处理、需要已知的相机姿态,或者假设输入帧具有时间顺序,这些限制了它们的灵活性和适用性。此外,最近的研究成功地实现了从大规模未定位图像集合中的高效3D重建,突显了统一方法在动态场景理解方面的机会。受此启发,我们提出了DePT3R,这是一种新颖的框架,该框架能够同时利用多张图像在一个前向传播过程中完成密集点跟踪和动态场景的三维重建任务。通过强大的骨干网络提取深度时空特征,并使用密集预测头回归像素级映射,从而实现了这种多任务学习。关键的是,DePT3R在不需要相机姿态的情况下运行,这大大增强了其适应性和效率——特别是在快速变化的动态环境中尤为重要。 我们在多个涉及动态场景的具有挑战性的基准测试中验证了DePT3R,展示了强大的性能,并且与现有的最先进的方法相比,在内存效率方面取得了显著改进。数据和代码可通过开放存储库获取:[此链接](this https URL)
URL
https://arxiv.org/abs/2512.13122