Abstract
Person re-identification is an important task that requires learning discriminative visual features for distinguishing different person identities. Diverse auxiliary information has been utilized to improve the visual feature learning. In this paper, we propose to exploit natural language description as additional training supervisions for effective visual features. Compared with other auxiliary information, language can describe a specific person from more compact and semantic visual aspects, thus is complementary to the pixel-level image data. Our method not only learns better global visual feature with the supervision of the overall description but also enforces semantic consistencies between local visual and linguistic features, which is achieved by building global and local image-language associations. The global image-language association is established according to the identity labels, while the local association is based upon the implicit correspondences between image regions and noun phrases. Extensive experiments demonstrate the effectiveness of employing language as training supervisions with the two association schemes. Our method achieves state-of-the-art performance without utilizing any auxiliary information during testing and shows better performance than other joint embedding methods for the image-language association.
Abstract (translated)
人员重新识别是一项重要任务,需要学习辨别性视觉特征来区分不同的人物身份。已经利用不同的辅助信息来改进视觉特征学习。在本文中,我们建议利用自然语言描述作为有效视觉特征的附加训练监督。与其他辅助信息相比,语言可以从更紧凑和语义的视觉方面描述特定的人,因此与像素级图像数据互补。我们的方法不仅通过对整体描述的监督来学习更好的全局视觉特征,而且还通过构建全局和局部图像语言关联来实现局部视觉和语言特征之间的语义一致性。根据身份标签建立全局图像语言关联,而局部关联基于图像区域和名词短语之间的隐式对应关系。大量实验证明了使用语言作为两种关联方案的训练监督的有效性。我们的方法在测试期间不使用任何辅助信息就实现了最先进的性能,并且显示出比图像语言关联的其他联合嵌入方法更好的性能。
URL
https://arxiv.org/abs/1808.01571