Paper Reading AI Learner

Intra-clip Aggregation for Video Person Re-identification

2019-05-05 17:37:33
Takashi Isobe, Jian Han, Fang Zhu, Yali Li, Shengjin Wang

Abstract

Video-based person re-id has drawn much attention in recent years due to its prospective applications in video surveillance. Most existing methods concentrate on how to represent discriminative clip-level features. Moreover, clip-level data augmentation is also important, especially for temporal aggregation task. Inconsistent intra-clip augmentation will collapse inter-frame alignment, thus bringing in additional noise. To tackle the above-motioned problems, we design a novel framework for video-based person re-id, which consists of two main modules: Synchronized Transformation (ST) and Intra-clip Aggregation (ICA). The former module augments intra-clip frames with the same probability and the same operation, while the latter leverages two-level intra-clip encoding to generate more discriminative clip-level features. To confirm the advantage of synchronized transformation, we conduct ablation study with different synchronized transformation scheme. We also perform cross-dataset experiment to better understand the generality of our method. Extensive experiments on three benchmark datasets demonstrate that our framework outperforming the most of recent state-of-the-art methods.

Abstract (translated)

基于视频的人脸识别技术在视频监控中具有广阔的应用前景,近年来引起了人们的广泛关注。现有的方法大多集中在如何表示具有识别性的剪辑级特征上。此外,剪辑级的数据扩充也很重要,特别是对于时间聚合任务。不一致的帧内增强将导致帧间对齐崩溃,从而带来额外的噪声。为了解决上述问题,我们设计了一个基于视频的人识别框架,该框架由两个主要模块组成:同步转换(ST)和帧内聚合(ICA)。前一个模块以相同的概率和相同的操作增加了帧内剪辑,而后一个模块利用两级帧内剪辑编码来生成更具辨别力的剪辑级特征。为了确定同步变换的优点,我们采用不同的同步变换方案进行了烧蚀研究。我们还进行了交叉数据集实验,以更好地理解我们方法的一般性。对三个基准数据集进行的大量实验表明,我们的框架优于最新最先进的方法。

URL

https://arxiv.org/abs/1905.01722

PDF

https://arxiv.org/pdf/1905.01722.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot