Fine-grained Attention-based Video Face Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper aims to learn a compact representation of a video for video face recognition task. We make the following contributions: first, we propose a meta attention-based aggregation scheme which adaptively and fine-grained weighs the feature along each feature dimension among all frames to form a compact and discriminative representation. It makes the best to exploit the valuable or discriminative part of each frame to promote the performance of face recognition, without discarding or despising low quality frames as usual methods do. Second, we build a feature aggregation network comprised of a feature embedding module and a feature aggregation module. The embedding module is a convolutional neural network used to extract a feature vector from a face image, while the aggregation module consists of cascaded two meta attention blocks which adaptively aggregate the feature vectors into a single fixed-length representation. The network can deal with arbitrary number of frames, and is insensitive to frame order. Third, we validate the performance of proposed aggregation scheme. Experiments on publicly available datasets, such as YouTube face dataset and IJB-A dataset, show the effectiveness of our method, and it achieves competitive performances on both the verification and identification protocols.

Abstract (translated)

本文旨在学习视频人脸识别任务中视频的紧凑表示。我们做出了如下贡献：首先，我们提出了一种基于元注意的聚合方案，该方案自适应地、细粒度地对所有帧中每个特征维的特征进行加权，以形成一种紧凑的、有区别的表示。最好利用每个帧中有价值或有区别的部分来提高人脸识别的性能，而不象通常的方法那样丢弃或轻视低质量的帧。其次，构建了一个由特征嵌入模块和特征聚合模块组成的特征聚合网络。嵌入模块是一种卷积神经网络，用于从人脸图像中提取特征向量，而聚合模块由两个元注意块级联而成，这些元注意块自适应地将特征向量聚合为一个固定长度的表示。该网络可以处理任意数量的帧，对帧序不敏感。第三，验证了所提出的聚合方案的性能。通过对YouTube人脸数据集和ijb-a数据集等公开数据集的实验，证明了该方法的有效性，在验证和识别协议上都取得了很好的效果。

URL

https://arxiv.org/abs/1905.01796

PDF

https://arxiv.org/pdf/1905.01796.pdf