Paper Reading AI Learner

Semi-supervised Text-based Person Search

2024-04-28 07:47:52
Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, Min Zhang

Abstract

Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtaining annotated texts is challenging. The paper undertakes a pioneering initiative to explore TBPS under the semi-supervised setting, where only a limited number of person images are annotated with textual descriptions while the majority of images lack annotations. We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS. The generation stage enriches annotated data by applying an image captioning model to generate pseudo-texts for unannotated images. Later, the retrieval stage performs fully-supervised retrieval learning using the augmented data. Significantly, considering the noise interference of the pseudo-texts on retrieval learning, we propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data. The framework integrates two key strategies: Hybrid Patch-Channel Masking (PC-Mask) to refine the model architecture, and Noise-Guided Progressive Training (NP-Train) to enhance the training process. PC-Mask performs masking on the input data at both the patch-level and the channel-level to prevent overfitting noisy supervision. NP-Train introduces a progressive training schedule based on the noise level of pseudo-texts to facilitate noise-robust learning. Extensive experiments on multiple TBPS benchmarks show that the proposed framework achieves promising performance under the semi-supervised setting.

Abstract (translated)

基于文本的人 search (TBPS) 旨在从大量图像库中根据自然语言描述检索特定的人 images。现有的方法依赖于大规模带有注释的图像-文本数据来实现令人满意的半监督学习性能。在实践中,获取来自监视视频的人图像相对容易,而获取注释文本相对困难。本文致力于在半监督设置中探索 TBPS,其中只有少数人图像带有文本描述,而大部分图像都没有注释。我们提出了一个基于生成-然后-检索的两阶段基本解决方案。生成阶段通过应用图像描述模型生成未注释图像的伪文本。后来,检索阶段使用增强数据执行半监督检索学习。考虑到伪文本在检索学习中的噪声干扰,我们提出了一个噪音抗性的检索框架,增强了检索模型的处理噪音数据的能力。该框架集成了两个关键策略:混合补丁级通道掩码(PC-Mask)来优化模型架构,以及噪音引导的逐步训练(NP-Train)来增强训练过程。PC-Mask 在输入数据的补丁级别和通道级别进行遮蔽,以防止过拟合噪音监督。NP-Train 根据伪文本的噪音水平引入了逐步训练计划,以促进噪音抗性学习。在多个 TBPS 基准测试上进行的大量实验证明,在半监督设置下,所提出的框架取得了良好的性能。

URL

https://arxiv.org/abs/2404.18106

PDF

https://arxiv.org/pdf/2404.18106.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot