Paper Reading AI Learner

Cross Pixel Optical Flow Similarity for Self-Supervised Learning

2018-07-15 23:48:59
Aravindh Mahendran, James Thewlis, Andrea Vedaldi

Abstract

We propose a novel method for learning convolutional neural image representations without manual supervision. We use motion cues in the form of optical flow, to supervise representations of static images. The obvious approach of training a network to predict flow from a single image can be needlessly difficult due to intrinsic ambiguities in this prediction task. We instead propose a much simpler learning goal: embed pixels such that the similarity between their embeddings matches that between their optical flow vectors. At test time, the learned deep network can be used without access to video or flow information and transferred to tasks such as image classification, detection, and segmentation. Our method, which significantly simplifies previous attempts at using motion for self-supervision, achieves state-of-the-art results in self-supervision using motion cues, competitive results for self-supervision in general, and is overall state of the art in self-supervised pretraining for semantic image segmentation, as demonstrated on standard benchmarks.

Abstract (translated)

我们提出了一种新的学习卷积神经图像表示的方法,无需人工监督。我们以光流的形式使用运动提示来监督静态图像的表示。由于该预测任务中的内在模糊性,训练网络来预测来自单个图像的流的显而易见的方法可能是不必要的困难。相反,我们提出了一个更简单的学习目标:嵌入像素使得它们的嵌入之间的相似性与它们的光流向量之间的相似性相匹配。在测试时,可以使用学习的深度网络而无需访问视频或流信息,并将其传送到诸如图像分类,检测和分段的任务。我们的方法显着简化了以前使用运动进行自我监督的尝试,使用运动线索实现了最先进的自我监督结果,一般的自我监督的竞争结果,并且是整体的现有技术。用于语义图像分割的自我监督预训练,如标准基准所示。

URL

https://arxiv.org/abs/1807.05636

PDF

https://arxiv.org/pdf/1807.05636.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot