Paper Reading AI Learner

Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification

2024-12-11 14:27:30
Neng Dong, Shuanglin Yan, Liyan Zhang, Jinhui Tang

Abstract

Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. In this paper, we propose an Embedding and Enriching Explicit Semantics (EEES) framework to learn semantically rich cross-modality pedestrian representations. Our method offers several contributions. First, with the collaboration of multiple large language-vision models, we develop Explicit Semantics Embedding (ESE), which automatically supplements language descriptions for pedestrians and aligns image-text pairs into a common space, thereby learning visual content associated with explicit semantics. Second, recognizing the complementarity of multi-view information, we present Cross-View Semantics Compensation (CVSC), which constructs multi-view image-text pair representations, establishes their many-to-many matching, and propagates knowledge to single-view representations, thus compensating visual content with its missing cross-view semantics. Third, to eliminate noisy semantics such as conflicting color attributes in different modalities, we design Cross-Modality Semantics Purification (CMSP), which constrains the distance between inter-modality image-text pair representations to be close to that between intra-modality image-text pair representations, further enhancing the modality-invariance of visual content. Finally, experimental results demonstrate the effectiveness and superiority of the proposed EEES.

Abstract (translated)

可见光-红外行人再识别(VIReID)在不同的模态间检索同一身份的行人群图像。现有的方法仅从图像中学习视觉内容,缺乏感知高级语义的能力。本文提出了一种嵌入和丰富显式语义(EEES)框架来学习语义丰富的跨模态行人表示。我们的方法提供了几方面的贡献。首先,在多个大型语言-视觉模型的协作下,我们开发了显式语义嵌入(ESE),该技术自动为行人群补充语言描述,并将图像-文本对对齐到一个共同的空间中,从而学习与显式语义相关的视觉内容。其次,认识到多视角信息的互补性,我们提出了跨视图语义补偿(CVSC),它构建了多视图图像-文本对表示形式,建立了它们之间的多对多匹配,并将知识传播给单视图表示,从而用缺失的跨视图语义来补充视觉内容。第三,为了消除不同模态中的冲突颜色属性等噪声语义,我们设计了跨模态语义净化(CMSP),该技术限制不同模态间图像-文本对表示之间的距离接近同一模态内图像-文本对表示间的距离,进一步增强了视觉内容的模态不变性。最后,实验结果证明了所提出的EEES的有效性和优越性。

URL

https://arxiv.org/abs/2412.08406

PDF

https://arxiv.org/pdf/2412.08406.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot