Abstract
Text-based person search aims to retrieve the corresponding persons in an image database by virtue of a describing sentence about the person, which poses great potential for various applications such as video surveillance. Extracting visual contents corresponding to the human description is the key to this cross-modal matching problem. Moreover, correlated images and descriptions involve different levels of semantic relevance. To exploit the multilevel relevances between human description and corresponding visual contents, we propose a pose-guided joint global and attentive local matching network (GALM), which includes global, uni-local and bi-local matching. The global matching network aims to learn global cross-modal representations. To further capture the meaningful local relations, we propose an uni-local matching network to compute the local similarities between image regions and textual description and then utilize a similarity-based hard attention to select the description-related image regions. In addition to sentence-level matching, the fine-grained phrase-level matching is captured by the bi-local matching network, which employs pose information to learn latent semantic alignment between visual body part and textual noun phrase. To verify the effectiveness of our model, we perform extensive experiments on the CUHK Person Description Dataset (CUHK-PEDES) which is currently the only available dataset for text-based person search. Experimental results show that our approach outperforms the state-of-the-art methods by 15 \% in terms of top-1 metric.
Abstract (translated)
基于文本的人搜索是通过对人的描述句来检索图像数据库中相应的人,这对视频监控等各种应用具有很大的潜力。提取与人类描述相对应的视觉内容是解决这种跨模态匹配问题的关键。此外,相关图像和描述涉及不同层次的语义相关性。为了挖掘人体描述与相应视觉内容的多层次关联性,提出了一种基于姿势引导的全局和专注局部匹配网络(Galm),它包括全局匹配、单局部匹配和双局部匹配。全局匹配网络的目标是学习全局跨模态表示。为了进一步捕获有意义的局部关系,我们提出了一种单局部匹配网络,计算图像区域和文本描述之间的局部相似性,然后利用基于相似性的硬注意选择描述相关的图像区域。除了句子级匹配外,双局域匹配网络还捕捉到了细微的短语级匹配,它利用姿势信息来学习视觉体部分和文本名词短语之间潜在的语义对齐。为了验证我们的模型的有效性,我们对中文大学的人员描述数据集(中文大学-pedes)进行了广泛的实验,该数据集是目前唯一可用于基于文本的人员搜索的数据集。实验结果表明,我们的方法优于最先进的方法,在前1个指标中占15%。
URL
https://arxiv.org/abs/1809.08440