Paper Reading AI Learner

CoWTracker: Tracking by Warping instead of Correlation

2026-02-04 18:58:59
Zihang Lai, Eldar Insafutdinov, Edgar Sucar, Andrea Vedaldi

Abstract

Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.

Abstract (translated)

密集点跟踪是计算机视觉中的一个基本问题,其应用范围从视频分析到机器人操作。最先进的追踪器通常依赖于成本体积(cost volumes)来跨帧匹配特征,但这种方法在空间分辨率上带来了二次复杂度的开销,从而限制了可扩展性和效率。在这篇论文中,我们提出了\method,这是一种新型的密集点跟踪器,它摒弃了传统的成本体积方法,转而采用变形操作(warping)。受最近光学流进展的启发,我们的方法通过基于当前估计将目标帧中的特征变形到查询帧上来迭代地细化追踪估算。结合能够执行所有跟踪路径上时空联合推理的变压器架构,我们设计的方法能够在不计算特征相关性的情况下建立长距离对应关系。模型的设计简洁,并在标准密集点跟踪基准测试(如TAP-Vid-DAVIS、TAP-Vid-Kinetics和Robo-TAP)中实现了最先进的性能。值得注意的是,该模型还在光学流领域表现出色,在Sintel、KITTI和Spring等基准测试中的表现有时甚至超过了专门的方法。这些结果表明基于变形的架构可以将密集点跟踪与光流估计统一起来。

URL

https://arxiv.org/abs/2602.04877

PDF

https://arxiv.org/pdf/2602.04877.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot