Paper Reading AI Learner

Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association

2018-08-05 07:19:24
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Zejian Yuan, Xiaogang Wang

Abstract

Person re-identification is an important task that requires learning discriminative visual features for distinguishing different person identities. Diverse auxiliary information has been utilized to improve the visual feature learning. In this paper, we propose to exploit natural language description as additional training supervisions for effective visual features. Compared with other auxiliary information, language can describe a specific person from more compact and semantic visual aspects, thus is complementary to the pixel-level image data. Our method not only learns better global visual feature with the supervision of the overall description but also enforces semantic consistencies between local visual and linguistic features, which is achieved by building global and local image-language associations. The global image-language association is established according to the identity labels, while the local association is based upon the implicit correspondences between image regions and noun phrases. Extensive experiments demonstrate the effectiveness of employing language as training supervisions with the two association schemes. Our method achieves state-of-the-art performance without utilizing any auxiliary information during testing and shows better performance than other joint embedding methods for the image-language association.

Abstract (translated)

人员重新识别是一项重要任务,需要学习辨别性视觉特征来区分不同的人物身份。已经利用不同的辅助信息来改进视觉特征学习。在本文中,我们建议利用自然语言描述作为有效视觉特征的附加训练监督。与其他辅助信息相比,语言可以从更紧凑和语义的视觉方面描述特定的人,因此与像素级图像数据互补。我们的方法不仅通过对整体描述的监督来学习更好的全局视觉特征,而且还通过构建全局和局部图像语言关联来实现局部视觉和语言特征之间的语义一致性。根据身份标签建立全局图像语言关联,而局部关联基于图像区域和名词短语之间的隐式对应关系。大量实验证明了使用语言作为两种关联方案的训练监督的有效性。我们的方法在测试期间不使用任何辅助信息就实现了最先进的性能,并且显示出比图像语言关联的其他联合嵌入方法更好的性能。

URL

https://arxiv.org/abs/1808.01571

PDF

https://arxiv.org/pdf/1808.01571.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot