Paper Reading AI Learner

AllTracker: Efficient Dense Point Tracking at High Resolution

2025-06-08 22:55:06
Adam W. Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Suya You, Rares Ambrus, Katerina Fragkiadaki, Leonidas J. Guibas

Abstract

We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train on a wider set of datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available at this https URL .

Abstract (translated)

我们介绍 AllTracker:这是一种通过估算查询帧与视频中每一其他帧之间的流场来估计长距离点轨迹的模型。与现有的点跟踪方法不同,我们的方法提供高分辨率和密集(全像素)对应字段,可以可视化为流图。不同于现有光流方法仅将一个帧与其下一个帧进行配准的方式,我们的方法将一个帧与后续的数百个帧进行关联。 为了完成这一任务,我们开发了一种新的架构,结合了现有的光学流技术和点跟踪技术:模型在低分辨率对应估计网格上执行迭代推理,通过二维卷积层传播空间信息,并通过像素对齐注意层传播时间信息。该模型既快速又参数高效(1600万个参数),并在高分辨率下提供最先进的点跟踪精度(即,在40G GPU上追踪768x1024像素)。我们的设计的一个优势是,我们可以使用更广泛的训练数据集进行训练,并且我们发现这样做对于顶级性能至关重要。 我们在架构细节和训练方案方面进行了详尽的消融研究,清晰地指出了哪些细节最重要。我们的代码和模型权重可以在以下链接获取:[此处插入URL] 。

URL

https://arxiv.org/abs/2506.07310

PDF

https://arxiv.org/pdf/2506.07310.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot