Abstract
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for unsupervised learning on videos. The mgPFF takes as input a pair of frames and outputs per-pixel filters to warp one frame to the other. Compared to optical flow used for warping frames, mgPFF is more powerful in modeling sub-pixel movement and dealing with corruption (e.g., motion blur). We develop a multigrid coarse-to-fine modeling strategy that avoids the requirement of learning large filters to capture large displacement. This allows us to train an extremely compact model (4.6MB) which operates in a progressive way over multiple resolutions with shared weights. We train mgPFF on unsupervised, free-form videos and show that mgPFF is able to not only estimate long-range flow for frame reconstruction and detect video shot transitions, but also readily amendable for video object segmentation and pose tracking, where it substantially outperforms the published state-of-the-art without bells and whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we have the unique opportunity to visualize how each pixel is evolving during solving these tasks, thus gaining better interpretability.
Abstract (translated)
我们介绍了多网格预测滤波流(MGPFF),一个视频无监督学习的框架。mgpff以每像素一对帧和输出滤波器作为输入,将一帧扭曲到另一帧。与用于扭曲帧的光流相比,MGPFF在亚像素运动建模和处理损坏(如运动模糊)方面更为强大。我们开发了一种从粗到细的多网格建模策略,避免了学习大滤波器捕捉大位移的要求。这使我们能够训练一个非常紧凑的模型(4.6MB),它在多个分辨率上以渐进的方式运行,并共享权重。我们在无监督、自由格式的视频上对MGPFF进行了培训,证明MGPFF不仅能够估计帧重建的长距离流和检测视频镜头转换,而且可以很容易地对视频对象分割和姿势跟踪进行修正,在这方面,它大大优于出版的最先进的无铃声和哨声视频。此外,由于mgpff的每像素滤波器预测性质,我们有独特的机会来可视化每个像素在解决这些任务过程中是如何演变的,从而获得更好的解释性。
URL
https://arxiv.org/abs/1904.01693