Paper Reading AI Learner

Pose-Guided Joint Global and Attentive Local Matching Network for Text-Based Person Search

2019-03-25 08:48:05
Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, Tieniu Tan

Abstract

Text-based person search aims to retrieve the corresponding persons in an image database by virtue of a describing sentence about the person, which poses great potential for various applications such as video surveillance. Extracting visual contents corresponding to the human description is the key to this cross-modal matching problem. Moreover, correlated images and descriptions involve different levels of semantic relevance. To exploit the multilevel relevances between human description and corresponding visual contents, we propose a pose-guided joint global and attentive local matching network (GALM), which includes global, uni-local and bi-local matching. The global matching network aims to learn global cross-modal representations. To further capture the meaningful local relations, we propose an uni-local matching network to compute the local similarities between image regions and textual description and then utilize a similarity-based hard attention to select the description-related image regions. In addition to sentence-level matching, the fine-grained phrase-level matching is captured by the bi-local matching network, which employs pose information to learn latent semantic alignment between visual body part and textual noun phrase. To verify the effectiveness of our model, we perform extensive experiments on the CUHK Person Description Dataset (CUHK-PEDES) which is currently the only available dataset for text-based person search. Experimental results show that our approach outperforms the state-of-the-art methods by 15 \% in terms of top-1 metric.

Abstract (translated)

基于文本的人搜索是通过对人的描述句来检索图像数据库中相应的人,这对视频监控等各种应用具有很大的潜力。提取与人类描述相对应的视觉内容是解决这种跨模态匹配问题的关键。此外,相关图像和描述涉及不同层次的语义相关性。为了挖掘人体描述与相应视觉内容的多层次关联性,提出了一种基于姿势引导的全局和专注局部匹配网络(Galm),它包括全局匹配、单局部匹配和双局部匹配。全局匹配网络的目标是学习全局跨模态表示。为了进一步捕获有意义的局部关系,我们提出了一种单局部匹配网络,计算图像区域和文本描述之间的局部相似性,然后利用基于相似性的硬注意选择描述相关的图像区域。除了句子级匹配外,双局域匹配网络还捕捉到了细微的短语级匹配,它利用姿势信息来学习视觉体部分和文本名词短语之间潜在的语义对齐。为了验证我们的模型的有效性,我们对中文大学的人员描述数据集(中文大学-pedes)进行了广泛的实验,该数据集是目前唯一可用于基于文本的人员搜索的数据集。实验结果表明,我们的方法优于最先进的方法,在前1个指标中占15%。

URL

https://arxiv.org/abs/1809.08440

PDF

https://arxiv.org/pdf/1809.08440.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot