Paper Reading AI Learner

TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification

2025-01-15 07:14:02
Zheng-An Zhu, Hsin-Che Chien, Chen-Kuo Chiang

Abstract

This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{this https URL}{this https URL}.

Abstract (translated)

本文提出了ViT Token Constraint和多尺度记忆库(TCMM)方法,旨在解决无监督行人重识别工作中出现的补丁噪声和特征不一致性问题。许多优秀的方法利用ViT特征来获取伪标签和聚类原型,然后通过对比学习训练模型。然而,由于ViT是通过对图像进行补丁嵌入处理来工作的,这不可避免地会在各个补丁中引入噪声,并可能影响重识别模型的性能。另一方面,基于以往记忆库的对比方法可能会因为批次大小的限制而导致数据不一致的问题。此外,现有的伪标签生成方法往往忽略那些难以聚类的异常样本,从而牺牲了这些异常样本的潜在价值,导致模型多样性和鲁棒性的局限性。 本文介绍了一种ViT Token Constraint技术来减轻补丁噪声对ViT架构造成的损害。提出的多尺度记忆库则增强了对异常样本的探索,并保持了特征的一致性。实验结果显示,我们的系统在常见基准测试中达到了最先进的性能水平。该项目可在[此处](this https URL)访问。

URL

https://arxiv.org/abs/2501.09044

PDF

https://arxiv.org/pdf/2501.09044.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot