Paper Reading AI Learner

Fine-grained Attention-based Video Face Recognition

2019-05-06 02:37:12
Zhaoxiang Liu, Huan Hu, Jinqiang Bai, Shaohua Li, Shiguo Lian

Abstract

This paper aims to learn a compact representation of a video for video face recognition task. We make the following contributions: first, we propose a meta attention-based aggregation scheme which adaptively and fine-grained weighs the feature along each feature dimension among all frames to form a compact and discriminative representation. It makes the best to exploit the valuable or discriminative part of each frame to promote the performance of face recognition, without discarding or despising low quality frames as usual methods do. Second, we build a feature aggregation network comprised of a feature embedding module and a feature aggregation module. The embedding module is a convolutional neural network used to extract a feature vector from a face image, while the aggregation module consists of cascaded two meta attention blocks which adaptively aggregate the feature vectors into a single fixed-length representation. The network can deal with arbitrary number of frames, and is insensitive to frame order. Third, we validate the performance of proposed aggregation scheme. Experiments on publicly available datasets, such as YouTube face dataset and IJB-A dataset, show the effectiveness of our method, and it achieves competitive performances on both the verification and identification protocols.

Abstract (translated)

本文旨在学习视频人脸识别任务中视频的紧凑表示。我们做出了如下贡献:首先,我们提出了一种基于元注意的聚合方案,该方案自适应地、细粒度地对所有帧中每个特征维的特征进行加权,以形成一种紧凑的、有区别的表示。最好利用每个帧中有价值或有区别的部分来提高人脸识别的性能,而不象通常的方法那样丢弃或轻视低质量的帧。其次,构建了一个由特征嵌入模块和特征聚合模块组成的特征聚合网络。嵌入模块是一种卷积神经网络,用于从人脸图像中提取特征向量,而聚合模块由两个元注意块级联而成,这些元注意块自适应地将特征向量聚合为一个固定长度的表示。该网络可以处理任意数量的帧,对帧序不敏感。第三,验证了所提出的聚合方案的性能。通过对YouTube人脸数据集和ijb-a数据集等公开数据集的实验,证明了该方法的有效性,在验证和识别协议上都取得了很好的效果。

URL

https://arxiv.org/abs/1905.01796

PDF

https://arxiv.org/pdf/1905.01796.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot