Abstract
Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce \Approach, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, \Approach delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of \Approach on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.
Abstract (translated)
从单目视频中跟踪密集的3D运动仍然是一项挑战,尤其是在追求长序列中的像素级精度时。我们介绍了一种新颖的方法\Approach,该方法能够高效地在3D空间中追踪每个像素,从而在整个视频范围内实现准确的运动估计。我们的方法利用了一个联合全局和局部注意力机制来进行低分辨率跟踪,并通过基于变压器的上采样器来获得高分辨率预测。与现有的受限于计算效率低下或稀疏追踪的方法不同,\Approach实现了大规模密集3D追踪,其运行速度比以往的方法快8倍以上,同时达到了最先进的精度水平。此外,我们探讨了深度表示对追踪性能的影响,并确定了对数深度作为最佳选择。大量的实验表明,\Approach在多个基准测试中表现出色,在2D和3D密集跟踪任务上均实现了新的最先进的结果。我们的方法为需要精细、长期的3D空间运动追踪的应用提供了一个稳健的解决方案。
URL
https://arxiv.org/abs/2410.24211