Abstract
We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform optical flow based methods. Finally, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking.
Abstract (translated)
我们使用大量未标记的视频来学习模型进行视觉追踪,无需人工监督。我们利用颜色的自然时间一致性创建一个模型,通过从参考框架复制颜色来学习着色灰度视频。定量和定性实验表明,这一任务使模型自动学习跟踪视觉区域。虽然模型没有任何地面真值标签,但我们的方法学习的轨迹足以胜过基于光流法的方法。最后,我们的结果表明,跟踪失败与失败着色相关,表明推进视频着色可能会进一步改善自我监督的视觉跟踪。
URL
https://arxiv.org/abs/1806.09594