Abstract
Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.
Abstract (translated)
密集点跟踪是计算机视觉中的一个基本问题,其应用范围从视频分析到机器人操作。最先进的追踪器通常依赖于成本体积(cost volumes)来跨帧匹配特征,但这种方法在空间分辨率上带来了二次复杂度的开销,从而限制了可扩展性和效率。在这篇论文中,我们提出了\method,这是一种新型的密集点跟踪器,它摒弃了传统的成本体积方法,转而采用变形操作(warping)。受最近光学流进展的启发,我们的方法通过基于当前估计将目标帧中的特征变形到查询帧上来迭代地细化追踪估算。结合能够执行所有跟踪路径上时空联合推理的变压器架构,我们设计的方法能够在不计算特征相关性的情况下建立长距离对应关系。模型的设计简洁,并在标准密集点跟踪基准测试(如TAP-Vid-DAVIS、TAP-Vid-Kinetics和Robo-TAP)中实现了最先进的性能。值得注意的是,该模型还在光学流领域表现出色,在Sintel、KITTI和Spring等基准测试中的表现有时甚至超过了专门的方法。这些结果表明基于变形的架构可以将密集点跟踪与光流估计统一起来。
URL
https://arxiv.org/abs/2602.04877