Paper Reading AI Learner

Convolutional Temporal Attention Model for Video-based Person Re-identification

2019-04-09 07:03:53
Tanzila Rahman, Mrigank Rochan, Yang Wang

Abstract

The goal of video-based person re-identification is to match two input videos, so that the distance of the two videos is small if two videos contain the same person. A common approach for person re-identification is to first extract image features for all frames in the video, then aggregate all the features to form a video-level feature. The video-level features of two videos can then be used to calculate the distance of the two videos. In this paper, we propose a temporal attention approach for aggregating frame-level features into a video-level feature vector for re-identification. Our method is motivated by the fact that not all frames in a video are equally informative. We propose a fully convolutional temporal attention model for generating the attention scores. Fully convolutional network (FCN) has been widely used in semantic segmentation for generating 2D output maps. In this paper, we formulate video based person reidentification as a sequence labeling problem like semantic segmentation. We establish a connection between them and modify FCN to generate attention scores to represent the importance of each frame. Extensive experiments on three different benchmark datasets (i.e. iLIDS-VID, PRID-2011 and SDU-VID) show that our proposed method outperforms other state-of-the-art approaches.

Abstract (translated)

基于视频的人再识别的目标是匹配两个输入视频,因此如果两个视频包含相同的人,那么两个视频的距离很小。一种常见的人的重新识别方法是,首先提取视频中所有帧的图像特征,然后聚合所有特征以形成视频级特征。然后可以使用两个视频的视频级别功能来计算两个视频的距离。本文提出了一种将帧级特征聚合成视频级特征向量进行再识别的时间注意方法。我们的方法是由这样一个事实驱动的:并非视频中的所有帧都具有相同的信息量。我们提出一个完全卷积的时间注意模型来产生注意分数。全卷积网络(FCN)在语义分割中得到了广泛的应用。在本文中,我们将基于视频的人再身份识别作为一个序列标记问题,如语义分割。我们建立了它们之间的联系,并修改了FCN来生成注意分数,以表示每一帧的重要性。对三种不同的基准数据集(即ilids-vid、prid-2011和sdu-vid)进行的大量实验表明,我们提出的方法优于其他最先进的方法。

URL

https://arxiv.org/abs/1904.04492

PDF

https://arxiv.org/pdf/1904.04492.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot