Paper Reading AI Learner

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

2024-03-15 12:44:35
Pingping Zhang, Yuhao Wang, Yang Liu, Zhengzheng Tu, Huchuan Lu

Abstract

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning framework named \textbf{EDITOR} to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at this https URL.

Abstract (translated)

单模态物体识别(ReID)在复杂的视觉场景中面临很大的挑战。相比之下,多模态物体ReID利用来自不同模态的互补信息,具有很大的实际应用潜力。然而,之前的方法可能会受到无关背景的影响,通常会忽略模态差距。为解决上述问题,我们提出了一个名为《编辑器》(Editor)的新学习框架,用于从视觉Transformer中选择多样化的标记。我们首先使用共享的视觉Transformer提取不同输入模态的标记。然后,我们引入了一个空间频率标记选择(SFTS)模块,以适应选择具有空间和频率信息的物体中心标记。接下来,我们使用层次结构掩码聚合(HMA)模块促进模态之间和模态之间的特征交互。最后,为了进一步减少背景的影响,我们提出了背景一致性约束(BCC)和物体中心特征细化(OCFR)。它们被表示为两个新的损失函数,通过背景抑制改善了特征识别。因此,我们的框架可以生成更有区分性的多模态物体ReID。在三个多模态ReID基准测试中进行了广泛的实验,验证了我们的方法的有效性。代码可在此处下载:https://www.example.com/。

URL

https://arxiv.org/abs/2403.10254

PDF

https://arxiv.org/pdf/2403.10254.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot