Paper Reading AI Learner

Self-supervised learning of a facial attribute embedding from video

2018-08-21 13:01:46
Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

Abstract

We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.

Abstract (translated)

我们提出了一个自我监督的框架,通过简单地观看人脸的说话,笑声和随时间移动的视频来学习面部属性。为了执行这项任务,我们引入了一个网络,面部属性 - 网络(FAb-Net),它被训练成将来自同一视频面部轨道的多个帧嵌入到一个共同的低维空间中。通过这种方法,我们做出了三个贡献:首先,我们通过预测每个帧的置信/关注掩码,表明网络可以利用来自多个源帧的信息;第二,我们证明使用课程学习制度可以改善学习嵌入;最后,我们证明网络学习了一种有意义的面部嵌入,该面部嵌入编码关于头部姿势,面部地标和面部表情的信息,即面部属性,而不用任何标记数据进行监督。我们在这些任务上与最先进的自我监督方法相当或更优,并且接近监督方法的表现。

URL

https://arxiv.org/abs/1808.06882

PDF

https://arxiv.org/pdf/1808.06882.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot