Semi-supervised Text-based Person Search

Abstract
Abstract (translated)
URL
PDF

Abstract

Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtaining annotated texts is challenging. The paper undertakes a pioneering initiative to explore TBPS under the semi-supervised setting, where only a limited number of person images are annotated with textual descriptions while the majority of images lack annotations. We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS. The generation stage enriches annotated data by applying an image captioning model to generate pseudo-texts for unannotated images. Later, the retrieval stage performs fully-supervised retrieval learning using the augmented data. Significantly, considering the noise interference of the pseudo-texts on retrieval learning, we propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data. The framework integrates two key strategies: Hybrid Patch-Channel Masking (PC-Mask) to refine the model architecture, and Noise-Guided Progressive Training (NP-Train) to enhance the training process. PC-Mask performs masking on the input data at both the patch-level and the channel-level to prevent overfitting noisy supervision. NP-Train introduces a progressive training schedule based on the noise level of pseudo-texts to facilitate noise-robust learning. Extensive experiments on multiple TBPS benchmarks show that the proposed framework achieves promising performance under the semi-supervised setting.

Abstract (translated)

基于文本的人 search (TBPS) 旨在从大量图像库中根据自然语言描述检索特定的人 images。现有的方法依赖于大规模带有注释的图像-文本数据来实现令人满意的半监督学习性能。在实践中，获取来自监视视频的人图像相对容易，而获取注释文本相对困难。本文致力于在半监督设置中探索 TBPS，其中只有少数人图像带有文本描述，而大部分图像都没有注释。我们提出了一个基于生成-然后-检索的两阶段基本解决方案。生成阶段通过应用图像描述模型生成未注释图像的伪文本。后来，检索阶段使用增强数据执行半监督检索学习。考虑到伪文本在检索学习中的噪声干扰，我们提出了一个噪音抗性的检索框架，增强了检索模型的处理噪音数据的能力。该框架集成了两个关键策略：混合补丁级通道掩码（PC-Mask）来优化模型架构，以及噪音引导的逐步训练（NP-Train）来增强训练过程。PC-Mask 在输入数据的补丁级别和通道级别进行遮蔽，以防止过拟合噪音监督。NP-Train 根据伪文本的噪音水平引入了逐步训练计划，以促进噪音抗性学习。在多个 TBPS 基准测试上进行的大量实验证明，在半监督设置下，所提出的框架取得了良好的性能。

URL

https://arxiv.org/abs/2404.18106

PDF

https://arxiv.org/pdf/2404.18106.pdf

Semi-supervised Text-based Person Search

Abstract

Abstract (translated)

URL

PDF Copy

PDF