Abstract
We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train on a wider set of datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available at this https URL .
Abstract (translated)
我们介绍 AllTracker:这是一种通过估算查询帧与视频中每一其他帧之间的流场来估计长距离点轨迹的模型。与现有的点跟踪方法不同,我们的方法提供高分辨率和密集(全像素)对应字段,可以可视化为流图。不同于现有光流方法仅将一个帧与其下一个帧进行配准的方式,我们的方法将一个帧与后续的数百个帧进行关联。 为了完成这一任务,我们开发了一种新的架构,结合了现有的光学流技术和点跟踪技术:模型在低分辨率对应估计网格上执行迭代推理,通过二维卷积层传播空间信息,并通过像素对齐注意层传播时间信息。该模型既快速又参数高效(1600万个参数),并在高分辨率下提供最先进的点跟踪精度(即,在40G GPU上追踪768x1024像素)。我们的设计的一个优势是,我们可以使用更广泛的训练数据集进行训练,并且我们发现这样做对于顶级性能至关重要。 我们在架构细节和训练方案方面进行了详尽的消融研究,清晰地指出了哪些细节最重要。我们的代码和模型权重可以在以下链接获取:[此处插入URL] 。
URL
https://arxiv.org/abs/2506.07310