Abstract
We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.
Abstract (translated)
我们提出了一个自我监督的框架,通过简单地观看人脸的说话,笑声和随时间移动的视频来学习面部属性。为了执行这项任务,我们引入了一个网络,面部属性 - 网络(FAb-Net),它被训练成将来自同一视频面部轨道的多个帧嵌入到一个共同的低维空间中。通过这种方法,我们做出了三个贡献:首先,我们通过预测每个帧的置信/关注掩码,表明网络可以利用来自多个源帧的信息;第二,我们证明使用课程学习制度可以改善学习嵌入;最后,我们证明网络学习了一种有意义的面部嵌入,该面部嵌入编码关于头部姿势,面部地标和面部表情的信息,即面部属性,而不用任何标记数据进行监督。我们在这些任务上与最先进的自我监督方法相当或更优,并且接近监督方法的表现。
URL
https://arxiv.org/abs/1808.06882