Abstract
Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision, we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework. Our code, model, and benchmark will be released.
Abstract (translated)
动态三维重建和视频中的点跟踪通常被视为独立的任务,尽管它们之间有着深刻的联系。我们提出了St4RTrack,这是一个前馈框架,它能够从RGB输入中同时在世界坐标系下重构和追踪动态视频内容。这是通过为两个不同时间捕获的帧预测两个适当定义的点图(pointmaps)来实现的。具体来说,我们在同一时刻、同一个世界内预测这两个点图,捕捉静态和动态场景几何结构的同时保持三维对应关系。这些预测沿着视频序列相对于参考帧进行链接,自然地计算出长距离对应的3D重建与3D追踪相结合的有效方法。不同于以往的方法依赖于4D的地面实况监督,我们采用了一种基于重投影损失的新适应方案。 我们建立了新的世界坐标系下的重构和跟踪广泛基准测试,证明了我们的统一数据驱动框架的有效性和效率。我们的代码、模型以及基准测试将公开发布。
URL
https://arxiv.org/abs/2504.13152