Abstract
Text-based person search aims to retrieve the specified person images given a textual description. The key to tackling such a challenging task is to learn powerful multi-modal representations. Towards this, we propose a Relation and Sensitivity aware representation learning method (RaSa), including two novel tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA). For one thing, existing methods cluster representations of all positive pairs without distinction and overlook the noise problem caused by the weak positive pairs where the text and the paired image have noise correspondences, thus leading to overfitting learning. RA offsets the overfitting risk by introducing a novel positive relation detection task (i.e., learning to distinguish strong and weak positive pairs). For another thing, learning invariant representation under data augmentation (i.e., being insensitive to some transformations) is a general practice for improving representation's robustness in existing methods. Beyond that, we encourage the representation to perceive the sensitive transformation by SA (i.e., learning to detect the replaced words), thus promoting the representation's robustness. Experiments demonstrate that RaSa outperforms existing state-of-the-art methods by 6.94%, 4.45% and 15.35% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively. Code is available at: this https URL.
Abstract (translated)
文本人物搜索的目标是根据文本描述检索指定人物的图像。解决这一挑战性任务的关键要学习强大的多视角表示。为此,我们提出了一种关系和敏感性 aware 表示学习方法(RaSa),包括两个全新的任务:关系 aware 学习(RA)和敏感性 aware 学习(SA)。一方面,现有的方法将所有正交对的表示Cluster在一起,并忽略文本和配对图像中弱正交对造成的噪声问题,从而导致过拟合学习。RA 减少了过拟合的风险,通过引入一个独特的正交关系检测任务(即学习区分强和弱正交对)。另一方面,学习在数据增强下的不变表示(即对一些变换变得不敏感)是改进现有方法表示鲁棒性的通用做法。此外,我们鼓励表示学习通过 SA 感知敏感变换(即学习检测替换词),从而促进表示的鲁棒性。实验表明,RaSa 在 CUHK-PEDES、ICFG-PEDES 和 RSTPReid 数据集上的排名@1分别优于现有最先进的方法的 6.94%、4.45% 和 15.35%。代码已可在上述 https URL 上提供。
URL
https://arxiv.org/abs/2305.13653