Paper Reading AI Learner

MARS: Paying more attention to visual attributes for text-based person search

2024-07-05 06:44:43
Alex Ergasti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati

Abstract

Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art.

Abstract (translated)

文本基于人员搜索(TBPS)是一个在研究社区中引起广泛关注的问题。任务是根据文本描述检索一或多个特定个人的图像。任务的多样性要求学习在共享潜在空间中连接文本和图像数据的表示。现有的TBPS系统面临着两个主要挑战。一个是由于文本描述的固有模糊和不精确性而产生的身份混淆噪声,它表明了视觉属性的描述如何通常与不同的人相关联;另一个是内 Identity Variations,它们都是那些例如姿态、照明等可以改变给定主题文本属性的视觉外观的细微差别。为了应对这些问题,本文提出了一种名为MARS( Mae-Attribute-Relation-Sensitive)的新TBPS架构,它通过引入两个关键组件来增强现有技术的水平:视觉重构损失和属性损失。前一个采用基于文本描述的随机遮罩自动编码器来重构图像补丁。这样做,模型被鼓励在潜在空间中学习更富有表现力的表示和文本-视觉关系。相反,属性损失平衡了不同属性的贡献,这些属性定义为形容词短语文本。这种损失确保了在人员检索过程中考虑到了每个属性。在三个常用的数据集(CUHK-PEDES,ICFG-PEDES和RSTPReid)上进行的大量实验报告显示,性能得到了提高,特别是平均精度(mAP)指标与现有技术的水平相比显著增益。

URL

https://arxiv.org/abs/2407.04287

PDF

https://arxiv.org/pdf/2407.04287.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot