Paper Reading AI Learner

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

2024-05-08 10:15:04
Wentao Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, Dapeng Tao

Abstract

Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.

Abstract (translated)

文本到图像人物识别(ReID)根据文本描述检索行人图像。手动标注文本描述费时,限制了现有数据集中的规模,因此限制了ReID模型的泛化能力。因此,我们研究可迁移的文本到图像ReID问题,在这个问题上,我们在提出的 large-scale 数据库上训练一个模型,然后直接部署到各种数据集上进行评估。我们通过多模态大型语言模型(MLLMs)获得了大量训练数据。此外,我们解决了利用获得的文本描述的两个关键挑战。首先,一个 MLLM 倾向于生成具有相似结构的描述,导致模型过拟合特定的句法模式。因此,我们提出了一种新颖的方法,使用 MLLMs 根据各种模板给图像 caption。这些模板是在与大型语言模型(LLM)的多轮对话中获得的。因此,我们可以构建一个具有多样文本描述的大型数据集。其次,一个 MLLM 可能产生错误的描述。因此,我们引入了一种新颖的方法,该方法会自动识别描述中与图像不匹配的单词。这个方法基于文本和图像中所有补丁词向量的相似性。然后,我们在后续的训练 epoch 中将这些单词的概率增大,减轻了噪音文本描述的影响。实验结果表明,我们的方法显著提高了直接迁移文本到图像 ReID 的性能。利用预训练模型权重,我们在传统评估设置中也取得了最先进的性能。

URL

https://arxiv.org/abs/2405.04940

PDF

https://arxiv.org/pdf/2405.04940.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot