Paper Reading AI Learner

Multigrid Predictive Filter Flow for Unsupervised Learning on Videos

2019-04-02 22:41:48
Shu Kong, Charless Fowlkes

Abstract

We introduce multigrid Predictive Filter Flow (mgPFF), a framework for unsupervised learning on videos. The mgPFF takes as input a pair of frames and outputs per-pixel filters to warp one frame to the other. Compared to optical flow used for warping frames, mgPFF is more powerful in modeling sub-pixel movement and dealing with corruption (e.g., motion blur). We develop a multigrid coarse-to-fine modeling strategy that avoids the requirement of learning large filters to capture large displacement. This allows us to train an extremely compact model (4.6MB) which operates in a progressive way over multiple resolutions with shared weights. We train mgPFF on unsupervised, free-form videos and show that mgPFF is able to not only estimate long-range flow for frame reconstruction and detect video shot transitions, but also readily amendable for video object segmentation and pose tracking, where it substantially outperforms the published state-of-the-art without bells and whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we have the unique opportunity to visualize how each pixel is evolving during solving these tasks, thus gaining better interpretability.

Abstract (translated)

我们介绍了多网格预测滤波流(MGPFF),一个视频无监督学习的框架。mgpff以每像素一对帧和输出滤波器作为输入,将一帧扭曲到另一帧。与用于扭曲帧的光流相比,MGPFF在亚像素运动建模和处理损坏(如运动模糊)方面更为强大。我们开发了一种从粗到细的多网格建模策略,避免了学习大滤波器捕捉大位移的要求。这使我们能够训练一个非常紧凑的模型(4.6MB),它在多个分辨率上以渐进的方式运行,并共享权重。我们在无监督、自由格式的视频上对MGPFF进行了培训,证明MGPFF不仅能够估计帧重建的长距离流和检测视频镜头转换,而且可以很容易地对视频对象分割和姿势跟踪进行修正,在这方面,它大大优于出版的最先进的无铃声和哨声视频。此外,由于mgpff的每像素滤波器预测性质,我们有独特的机会来可视化每个像素在解决这些任务过程中是如何演变的,从而获得更好的解释性。

URL

https://arxiv.org/abs/1904.01693

PDF

https://arxiv.org/pdf/1904.01693.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot