The primary color profile of the same identity is assumed to remain consistent in typical Person Re-identification (Person ReID) tasks. However, this assumption may be invalid in real-world situations and images hold variant color profiles, because of cross-modality cameras or identity with different clothing. To address this issue, we propose Color Space Learning (CSL) for those Cross-Color Person ReID problems. Specifically, CSL guides the model to be less color-sensitive with two modules: Image-level Color-Augmentation and Pixel-level Color-Transformation. The first module increases the color diversity of the inputs and guides the model to focus more on the non-color information. The second module projects every pixel of input images onto a new color space. In addition, we introduce a new Person ReID benchmark across RGB and Infrared modalities, NTU-Corridor, which is the first with privacy agreements from all participants. To evaluate the effectiveness and robustness of our proposed CSL, we evaluate it on several Cross-Color Person ReID benchmarks. Our method surpasses the state-of-the-art methods consistently. The code and benchmark are available at: this https URL
相同身份的主要颜色剖面假定在典型的人识别(Person ReID)任务中保持一致。然而,在现实世界的场景和图像中,这一假设可能是无效的,因为跨模态相机或由于不同服装而具有不同的身份。为了应对这个问题,我们为跨颜色人物识别问题提出了颜色空间学习(CSL)。具体来说,CSL通过两个模块来指导模型:图像级别的颜色增强和像素级别的颜色变换。第一个模块增加了输入的色度多样性并指导模型更关注非颜色信息。第二个模块将输入图像的每个像素投影到一个新的颜色空间。此外,我们还引入了一个新的跨RGB和红外模态的人识别基准,NTU-Corridor,它是第一个隐私协议得到所有参与者认可的基准。为了评估我们提出的CSL的有效性和鲁棒性,我们在多个跨颜色人物识别基准上进行了评估。我们的方法超越了最先进的方法。代码和基准可用于此链接:https:// this URL
https://arxiv.org/abs/2405.09487
With the advancement of video analysis technology, the multi-object tracking (MOT) problem in complex scenes involving pedestrians is gaining increasing importance. This challenge primarily involves two key tasks: pedestrian detection and re-identification. While significant progress has been achieved in pedestrian detection tasks in recent years, enhancing the effectiveness of re-identification tasks remains a persistent challenge. This difficulty arises from the large total number of pedestrian samples in multi-object tracking datasets and the scarcity of individual instance samples. Motivated by recent rapid advancements in meta-learning techniques, we introduce MAML MOT, a meta-learning-based training approach for multi-object tracking. This approach leverages the rapid learning capability of meta-learning to tackle the issue of sample scarcity in pedestrian re-identification tasks, aiming to improve the model's generalization performance and robustness. Experimental results demonstrate that the proposed method achieves high accuracy on mainstream datasets in the MOT Challenge. This offers new perspectives and solutions for research in the field of pedestrian multi-object tracking.
随着视频分析技术的进步,涉及行人的复杂场景中的多目标跟踪(MOT)问题越来越重要。这个挑战主要涉及两个关键任务:行人检测和识别。虽然近年来在行人检测任务上取得了显著的进展,但增强识别任务的有效性仍然是一个持续的挑战。这种困难源于多对象跟踪数据集中大量的行人样本和样本稀疏性。为了应对这种挑战,受到最近元学习技术快速发展启发的我们引入了MAML MOT,一种基于元学习的多目标跟踪训练方法。这种方法利用元学习的快速学习能力来解决行人识别任务中的样本稀疏性问题,旨在提高模型的泛化性能和鲁棒性。实验结果表明,与主流数据集相比,所提出的方法在MOT挑战中具有很高的准确率。这为研究行人多目标跟踪领域提供了新的视角和解决方案。
https://arxiv.org/abs/2405.07272
Unsupervised Visible-Infrared Person Re-identification (USVI-ReID) presents a formidable challenge, which aims to match pedestrian images across visible and infrared modalities without any annotations. Recently, clustered pseudo-label methods have become predominant in USVI-ReID, although the inherent noise in pseudo-labels presents a significant obstacle. Most existing works primarily focus on shielding the model from the harmful effects of noise, neglecting to calibrate noisy pseudo-labels usually associated with hard samples, which will compromise the robustness of the model. To address this issue, we design a Robust Pseudo-label Learning with Neighbor Relation (RPNR) framework for USVI-ReID. To be specific, we first introduce a straightforward yet potent Noisy Pseudo-label Calibration module to correct noisy pseudo-labels. Due to the high intra-class variations, noisy pseudo-labels are difficult to calibrate completely. Therefore, we introduce a Neighbor Relation Learning module to reduce high intra-class variations by modeling potential interactions between all samples. Subsequently, we devise an Optimal Transport Prototype Matching module to establish reliable cross-modality correspondences. On that basis, we design a Memory Hybrid Learning module to jointly learn modality-specific and modality-invariant information. Comprehensive experiments conducted on two widely recognized benchmarks, SYSU-MM01 and RegDB, demonstrate that RPNR outperforms the current state-of-the-art GUR with an average Rank-1 improvement of 10.3%. The source codes will be released soon.
无监督可见-红外人员识别(USVI-ReID)提出了一个极具挑战性的问题,旨在在没有任何注释的情况下匹配可见和红外模式下的行人图像。近年来,聚类伪标签方法在美国VI-ReID中变得主导,尽管伪标签固有噪声对模型有很大影响。大多数现有工作主要关注屏蔽噪声对模型的影响,而忽视了通常与困难样本相关联的噪音伪标签的校准,这将削弱模型的鲁棒性。为解决此问题,我们设计了一个鲁棒伪标签学习与邻居关系(RPNR)框架,用于美国VI-ReID。具体来说,我们首先引入了一个简单而强大的噪音伪标签校准模块来纠正噪音伪标签。由于高类内差异,噪音伪标签的校准很难完全进行。因此,我们引入了一个邻居关系学习模块,通过建模所有样本之间可能存在的相互作用来降低类内差异。接下来,我们设计了一个最优传输原型匹配模块,以建立可靠的跨模态对应关系。基于此,我们设计了一个记忆混合学习模块,共同学习模式特有和模式无关的信息。在两个广泛认可的基准数据集SYSU-MM01和RegDB上进行全面的实验证明,RPNR在平均秩-1改进方面优于目前的最佳GUR,平均秩-1改进率为10.3%。源代码即将发布。
https://arxiv.org/abs/2405.05613
Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
文本到图像人物识别(ReID)根据文本描述检索行人图像。手动标注文本描述费时,限制了现有数据集中的规模,因此限制了ReID模型的泛化能力。因此,我们研究可迁移的文本到图像ReID问题,在这个问题上,我们在提出的 large-scale 数据库上训练一个模型,然后直接部署到各种数据集上进行评估。我们通过多模态大型语言模型(MLLMs)获得了大量训练数据。此外,我们解决了利用获得的文本描述的两个关键挑战。首先,一个 MLLM 倾向于生成具有相似结构的描述,导致模型过拟合特定的句法模式。因此,我们提出了一种新颖的方法,使用 MLLMs 根据各种模板给图像 caption。这些模板是在与大型语言模型(LLM)的多轮对话中获得的。因此,我们可以构建一个具有多样文本描述的大型数据集。其次,一个 MLLM 可能产生错误的描述。因此,我们引入了一种新颖的方法,该方法会自动识别描述中与图像不匹配的单词。这个方法基于文本和图像中所有补丁词向量的相似性。然后,我们在后续的训练 epoch 中将这些单词的概率增大,减轻了噪音文本描述的影响。实验结果表明,我们的方法显著提高了直接迁移文本到图像 ReID 的性能。利用预训练模型权重,我们在传统评估设置中也取得了最先进的性能。
https://arxiv.org/abs/2405.04940
In Re-identification (ReID), recent advancements yield noteworthy progress in both unimodal and cross-modal retrieval tasks. However, the challenge persists in developing a unified framework that could effectively handle varying multimodal data, including RGB, infrared, sketches, and textual information. Additionally, the emergence of large-scale models shows promising performance in various vision tasks but the foundation model in ReID is still blank. In response to these challenges, a novel multimodal learning paradigm for ReID is introduced, referred to as All-in-One (AIO), which harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning. The diverse multimodal data in AIO are seamlessly tokenized into a unified space, allowing the modality-shared frozen encoder to extract identity-consistent features comprehensively across all modalities. Furthermore, a meticulously crafted ensemble of cross-modality heads is designed to guide the learning trajectory. AIO is the \textbf{first} framework to perform all-in-one ReID, encompassing four commonly used modalities. Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts, showcasing exceptional performance in zero-shot and domain generalization scenarios.
在重新识别(Re-identification,ReID)中,近期的进展在单模态和跨模态检索任务中都取得了显著的进步。然而,开发一个统一框架来处理多样 multimodal 数据(包括 RGB、红外、手绘和文本信息)仍然具有挑战性。此外,大规模模型的出现表明在各种视觉任务上的表现颇具前景,但 ReID 的基础模型仍然是空白的。为了应对这些挑战,提出了一个名为 All-in-One(AIO)的新型多模态学习范式,它利用预训练的大模型作为编码器,实现有效跨模态检索而无需额外微调。AIO 中的多样 multimodal 数据被无缝地转化为统一的 space,使模态共享的预训练编码器可以全面提取所有模态的身份一致特征。此外,还设计了一个精心制作的跨模态头集,以指导学习轨迹。AIO 是第一个实现所有模态 ReID 的框架,涵盖四种常用模态。在跨模态和多模态 ReID 的实验中,AIO 不仅能够应对各种模态数据,而且在具有挑战性的情境中表现出色,展示了在零散和领域通用情况下的卓越性能。
https://arxiv.org/abs/2405.04741
The quest for robust Person re-identification (Re-ID) systems capable of accurately identifying subjects across diverse scenarios remains a formidable challenge in surveillance and security applications. This study presents a novel methodology that significantly enhances Person Re-Identification (Re-ID) by integrating Uncertainty Feature Fusion (UFFM) with Wise Distance Aggregation (WDA). Tested on benchmark datasets - Market-1501, DukeMTMC-ReID, and MSMT17 - our approach demonstrates substantial improvements in Rank-1 accuracy and mean Average Precision (mAP). Specifically, UFFM capitalizes on the power of feature synthesis from multiple images to overcome the limitations imposed by the variability of subject appearances across different views. WDA further refines the process by intelligently aggregating similarity metrics, thereby enhancing the system's ability to discern subtle but critical differences between subjects. The empirical results affirm the superiority of our method over existing approaches, achieving new performance benchmarks across all evaluated datasets. Code is available on Github.
寻找在多样场景中准确识别主题的稳健Person识别(Re-ID)系统仍然是一项艰巨的挑战,尤其是在监视和安全性应用中。本研究介绍了一种通过将不确定度特征融合(UFFM)与智能距离聚合(WDA)相结合来显著增强Person Re-Identification(Re-ID)的新方法。在基准数据集- Market-1501、DukeMTMC-ReID和MSMT17上进行了测试,我们的方法在排名1准确性和平均精度(mAP)方面取得了显著改进。具体来说,UFFM利用多个图像的特征合成能力克服了在不同视角下主题外观变异性所施加的局限性。WDA通过智能聚合相似度度量进一步优化了过程,从而增强了系统在识别主题间微小但关键差异的能力。实证结果证实了我们的方法优越于现有方法,在所有评估数据集上都实现了新的性能基准。代码可在Github上获取。
https://arxiv.org/abs/2405.01101
Whistleblowing is essential for ensuring transparency and accountability in both public and private sectors. However, (potential) whistleblowers often fear or face retaliation, even when reporting anonymously. The specific content of their disclosures and their distinct writing style may re-identify them as the source. Legal measures, such as the EU WBD, are limited in their scope and effectiveness. Therefore, computational methods to prevent re-identification are important complementary tools for encouraging whistleblowers to come forward. However, current text sanitization tools follow a one-size-fits-all approach and take an overly limited view of anonymity. They aim to mitigate identification risk by replacing typical high-risk words (such as person names and other NE labels) and combinations thereof with placeholders. Such an approach, however, is inadequate for the whistleblowing scenario since it neglects further re-identification potential in textual features, including writing style. Therefore, we propose, implement, and evaluate a novel classification and mitigation strategy for rewriting texts that involves the whistleblower in the assessment of the risk and utility. Our prototypical tool semi-automatically evaluates risk at the word/term level and applies risk-adapted anonymization techniques to produce a grammatically disjointed yet appropriately sanitized text. We then use a LLM that we fine-tuned for paraphrasing to render this text coherent and style-neutral. We evaluate our tool's effectiveness using court cases from the ECHR and excerpts from a real-world whistleblower testimony and measure the protection against authorship attribution (AA) attacks and utility loss statistically using the popular IMDb62 movie reviews dataset. Our method can significantly reduce AA accuracy from 98.81% to 31.22%, while preserving up to 73.1% of the original content's semantics.
举报举报对于确保公共和私营部门的高透明度和问责制至关重要。然而,(可能的)举报者经常担心或面临报复,即使他们匿名举报。他们举报内容的具体内容和独特的写作风格可能会使他们重新被识别为来源。法律措施,如欧盟举报机制(WBD),在范围和效果上有限。因此,计算方法防止重新识别是鼓励举报者举报的重要补充工具。然而,当前的文本消毒工具遵循一种一刀切的方法,对匿名性持过于狭隘的观点。它们试图通过用典型高风险词汇(如人物姓名和其他标签)替换来减轻识别风险。然而,这种方法在举报场景中是不够的,因为它忽略了文本特征中进一步的重新识别可能性,包括文体。因此,我们提出了一个新颖的分类和减轻策略,该策略让举报者在评估风险和效用时参与其中。我们的原型工具在词/短语级别评估风险,并应用风险适应的匿名化技术生成语法不连贯但适度消毒的文本。然后,我们使用一个我们微调用于复写的LLM来生成连贯且风格中性的文本。我们用IMDb62电影评论数据集中的片段来衡量我们工具的有效性,并评估其对作者归属攻击和效用损失的统计保护。我们的方法可以从98.81%的准确度降低到31.22%,同时保留原始内容的73.1%的语义。
https://arxiv.org/abs/2405.01097
To facilitate the re-identification (Re-ID) of individual animals, existing methods primarily focus on maximizing feature similarity within the same individual and enhancing distinctiveness between different individuals. However, most of them still rely on supervised learning and require substantial labeled data, which is challenging to obtain. To avoid this issue, we propose a Feature-Aware Noise Contrastive Learning (FANCL) method to explore an unsupervised learning solution, which is then validated on the task of red panda re-ID. FANCL employs a Feature-Aware Noise Addition module to produce noised images that conceal critical features and designs two contrastive learning modules to calculate the losses. Firstly, a feature consistency module is designed to bridge the gap between the original and noised features. Secondly, the neural networks are trained through a cluster contrastive learning module. Through these more challenging learning tasks, FANCL can adaptively extract deeper representations of red pandas. The experimental results on a set of red panda images collected in both indoor and outdoor environments prove that FANCL outperforms several related state-of-the-art unsupervised methods, achieving high performance comparable to supervised learning methods.
为了方便对个体动物进行重新鉴定(Re-ID),现有的方法主要集中在在同一个体中最大化特征相似度并增强不同个体之间的差异性。然而,大多数方法仍然依赖于监督学习,并需要大量标记数据,这很难获得。为了避免这个问题,我们提出了一个以特征为感知噪声对比学习(FANCL)方法来探索无监督学习解决方案,并在红熊猫Re-ID任务上验证该方法。FANCL采用一个特征感知噪声增加模块产生带噪音的图像,并设计两个对比学习模块计算损失。首先,设计了一个特征一致性模块来弥合原始和噪音特征之间的差距。其次,通过聚类对比学习模块训练神经网络。通过这些更具挑战性的学习任务,FANCL可以动态地提取红熊猫的更深表示。在一系列室内和室外收集的红熊猫图像的实验结果证明,FANCL在几个相关无监督方法中表现出优异性能,其性能可与监督学习方法相媲美。
https://arxiv.org/abs/2405.00468
The Palácio do Planalto, office of the President of Brazil, was invaded by protesters on January 8, 2023. Surveillance videos taken from inside the building were subsequently released by the Brazilian Supreme Court for public scrutiny. We used segments of such footage to create the UFPR-Planalto801 dataset for people tracking and re-identification in a real-world scenario. This dataset consists of more than 500,000 images. This paper presents a tracking approach targeting this dataset. The method proposed in this paper relies on the use of known state-of-the-art trackers combined in a multilevel hierarchy to correct the ID association over the trajectories. We evaluated our method using IDF1, MOTA, MOTP and HOTA metrics. The results show improvements for every tracker used in the experiments, with IDF1 score increasing by a margin up to 9.5%.
帕拉蒂宫(Planalto Palace)是巴西总统的办公室,2023年1月8日,抗议者入侵了这座建筑。巴西最高法院随后发布了大楼内的监控视频,供公众审查。我们使用这些视频片段创建了用于人员在现实场景中跟踪和识别的UFPR-Planalto801数据集。这个数据集包含超过500,000张图片。本文介绍了针对这个数据集的跟踪方法。这种方法依赖于在多层结构中使用已知的最先进跟踪器来纠正轨迹上的ID关联。我们使用IDF1、MOTA、MOTP和HOTA指标评估我们的方法。实验结果表明,使用的每个跟踪器都取得了改进,IDF1得分甚至提高了9.5%。
https://arxiv.org/abs/2404.18876
Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global-local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we propose the Local Multi-layer Fusion (LMF) which leverages both the global cues from GAE and multi-layer patch tokens to explore the discriminative local representations. Extensive experiments demonstrate that our proposed method achieves superior performance on four object Re-ID benchmarks.
物体识别(Re-ID)的目的是从不同时间和地点捕获的图像中识别和检索特定的物体。近年来,随着Vision Transformers (ViT)的进步,物体Re-ID已经取得了巨大的成功。然而,在Transformers中,对全局和局部关系的影响还没有完全被探索。在这项工作中,我们首先探索ViT的局部和全局特征对物体Re-ID的影响,然后进一步提出了一种名为全局-局部Transformer(GLTrans)的高性能物体Re-ID模型。我们发现,ViT最后几层的特征已经具有很强的表示能力,全局和局部信息可以相互增强。基于这个事实,我们提出了一种全局聚合编码器(GAE)来有效地利用最后几个Transformer层中的分类标签,并学习全面的全局特征。同时,我们还提出了一种局部多层融合(LMF),它利用来自GAE的全局线索和多层补丁token来探索具有区分性的局部表示。大量实验证明,我们提出的方法在四个物体Re-ID基准测试中实现了卓越的性能。
https://arxiv.org/abs/2404.14985
Clothes-changing person re-identification (CC-ReID) aims to retrieve images of the same person wearing different outfits. Mainstream researches focus on designing advanced model structures and strategies to capture identity information independent of clothing. However, the same-clothes discrimination as the standard ReID learning objective in CC-ReID is persistently ignored in previous researches. In this study, we dive into the relationship between standard and clothes-changing~(CC) learning objectives, and bring the inner conflicts between these two objectives to the fore. We try to magnify the proportion of CC training pairs by supplementing high-fidelity clothes-varying synthesis, produced by our proposed Clothes-Changing Diffusion model. By incorporating the synthetic images into CC-ReID model training, we observe a significant improvement under CC protocol. However, such improvement sacrifices the performance under the standard protocol, caused by the inner conflict between standard and CC. For conflict mitigation, we decouple these objectives and re-formulate CC-ReID learning as a multi-objective optimization (MOO) problem. By effectively regularizing the gradient curvature across multiple objectives and introducing preference restrictions, our MOO solution surpasses the single-task training paradigm. Our framework is model-agnostic, and demonstrates superior performance under both CC and standard ReID protocols.
换衣人重新识别(CC-ReID)旨在检索同一人穿着不同服装的照片。主流研究关注设计具有先进模型结构和策略,使其独立于服装捕捉身份信息。然而, previous researchers 对相同服装的歧视标准 ReID 学习目标持续忽视。在本文中,我们深入研究了标准和换衣人~(CC) 学习目标之间的关系,并揭示了这两者之间的内心冲突。通过补充我们提出的换衣人扩散模型生成的极高保真度换衣图像,我们试图通过增加 CC 训练对偶的比例如下:通过将合成图像纳入 CC-ReID 模型训练,我们观察到在 CC 协议下显著的改善。然而,这种改善牺牲了标准协议下的性能,由于标准和 CC 之间的内心冲突。为减轻这种冲突,我们解耦这两个目标,并将 CC-ReID 学习重新表述为一个多目标优化(MOO)问题。通过有效地对多个目标周围的梯度曲率进行 regularization 和引入偏好限制,我们的 MOO 解决方案超越了单任务训练范式。我们的框架对模型一无所知,并且在 CC 和标准 ReID 协议下均表现出卓越的性能。
https://arxiv.org/abs/2404.12611
Current clothes-changing person re-identification (re-id) approaches usually perform retrieval based on clothes-irrelevant features, while neglecting the potential of clothes-relevant features. However, we observe that relying solely on clothes-irrelevant features for clothes-changing re-id is limited, since they often lack adequate identity information and suffer from large intra-class variations. On the contrary, clothes-relevant features can be used to discover same-clothes intermediaries that possess informative identity clues. Based on this observation, we propose a Feasibility-Aware Intermediary Matching (FAIM) framework to additionally utilize clothes-relevant features for retrieval. Firstly, an Intermediary Matching (IM) module is designed to perform an intermediary-assisted matching process. This process involves using clothes-relevant features to find informative intermediates, and then using clothes-irrelevant features of these intermediates to complete the matching. Secondly, in order to reduce the negative effect of low-quality intermediaries, an Intermediary-Based Feasibility Weighting (IBFW) module is designed to evaluate the feasibility of intermediary matching process by assessing the quality of intermediaries. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on several widely-used clothes-changing re-id benchmarks.
通常,基于衣物无关特征的当前人物识别(RE-ID)方法忽略了衣物相关特征的潜力。然而,我们观察到仅依赖衣物无关特征进行衣物RE-ID是有限的,因为它们通常缺乏足够的身份信息并受到类内变化的影响。相反,衣物相关特征可以用于发现具有有用身份提示的同款衣物中介。基于这一观察,我们提出了一个可行性感知的中介匹配(FAIM)框架,以进一步利用衣物相关特征进行检索。 首先,设计了一个中介匹配(IM)模块,执行中间人辅助匹配过程。这一过程涉及使用衣物相关特征找到有用的中介,然后使用这些中介的衣物无关特征完成匹配。 其次,为了减少低质量中介对匹配过程的负面影响,设计了一个基于中介的中性可行性加权(IBFW)模块,通过评估中介的质量来评估匹配过程的可行性。 丰富的实验证明,我们的方法在多个广泛使用的衣物RE-ID基准测试中超越了最先进的方法。
https://arxiv.org/abs/2404.09507
Visible-infrared person re-identification (VI-reID) aims at matching cross-modality pedestrian images captured by disjoint visible or infrared cameras. Existing methods alleviate the cross-modality discrepancies via designing different kinds of network architectures. Different from available methods, in this paper, we propose a novel parameter optimizing paradigm, parameter hierarchical optimization (PHO) method, for the task of VI-ReID. It allows part of parameters to be directly optimized without any training, which narrows the search space of parameters and makes the whole network more easier to be trained. Specifically, we first divide the parameters into different types, and then introduce a self-adaptive alignment strategy (SAS) to automatically align the visible and infrared images through transformation. Considering that features in different dimension have varying importance, we develop an auto-weighted alignment learning (AAL) module that can automatically weight features according to their importance. Importantly, in the alignment process of SAS and AAL, all the parameters are immediately optimized with optimization principles rather than training the whole network, which yields a better parameter training manner. Furthermore, we establish the cross-modality consistent learning (CCL) loss to extract discriminative person representations with translation consistency. We provide both theoretical justification and empirical evidence that our proposed PHO method outperform existing VI-reID approaches.
可见红外人员重新识别(VI-reID)旨在通过设计不同类型的网络架构来匹配由分离的可见或红外相机捕获的跨模态行人图像。现有的方法通过设计不同的网络架构来减轻跨模态差异。与现有的方法不同,本文提出了一种新的参数优化范例,参数层次优化(PHO)方法,用于VI-reID任务。它允许部分参数通过直接优化而无需训练,从而缩小参数搜索空间并使整个网络更容易训练。具体来说,我们首先将参数分为不同类型,然后引入自适应对齐策略(SAS)通过变换来自动对齐可见和红外图像。考虑到不同维度特征的重要性不同,我们开发了一个自适应加权对齐学习(AAL)模块,可以根据其重要性自动加权特征。重要的是,在SAS和AAL的对齐过程中,所有参数都使用优化原理进行优化,而不是训练整个网络,这导致了更好的参数训练方式。此外,我们还建立了跨模态一致性学习(CCL)损失,用于通过平移一致性提取具有平移一致性的区分性人物表示。我们提供了理论证明和实证证据,证明我们提出的PHO方法优于现有的VI-reID方法。
https://arxiv.org/abs/2404.07930
Unsupervised visible-infrared person re-identification (UVI-ReID) has recently gained great attention due to its potential for enhancing human detection in diverse environments without labeling. Previous methods utilize intra-modality clustering and cross-modality feature matching to achieve UVI-ReID. However, there exist two challenges: 1) noisy pseudo labels might be generated in the clustering process, and 2) the cross-modality feature alignment via matching the marginal distribution of visible and infrared modalities may misalign the different identities from two modalities. In this paper, we first conduct a theoretic analysis where an interpretable generalization upper bound is introduced. Based on the analysis, we then propose a novel unsupervised cross-modality person re-identification framework (PRAISE). Specifically, to address the first challenge, we propose a pseudo-label correction strategy that utilizes a Beta Mixture Model to predict the probability of mis-clustering based network's memory effect and rectifies the correspondence by adding a perceptual term to contrastive learning. Next, we introduce a modality-level alignment strategy that generates paired visible-infrared latent features and reduces the modality gap by aligning the labeling function of visible and infrared features to learn identity discriminative and modality-invariant features. Experimental results on two benchmark datasets demonstrate that our method achieves state-of-the-art performance than the unsupervised visible-ReID methods.
无监督可见-红外人员识别(UVI-ReID)最近因其在不同环境中增强人类检测潜力而受到广泛关注,而无需标签。以前的方法利用内部模态聚类和跨模态特征匹配来实现UVI-ReID。然而,存在两个挑战:1)在聚类过程中可能生成噪声伪标签,2)通过匹配可见和红外模态的边缘分布进行跨模态特征对齐可能错位不同个体的身份。在本文中,我们首先进行理论分析引入了可解释的泛化上界。基于分析,我们 then 提出了一个新颖的无监督跨模态人员识别框架(PRAISE)。具体来说,为解决第一个挑战,我们提出了一个伪标签修正策略,利用贝叶斯混合模型预测网络记忆效应并纠正错配,通过添加感知项来 contrastive 学习。接下来,我们引入了一个模块级别对齐策略,生成对齐的可视-红外潜在特征,并通过对可见和红外特征的标签函数进行对齐来降低模态差距,以学习具有身份鉴别和模态无关特征的识别。在两个基准数据集上的实验结果表明,与其他无监督可见-ReID 方法相比,我们的方法实现了最先进的性能。
https://arxiv.org/abs/2404.06683
The memory dictionary-based contrastive learning method has achieved remarkable results in the field of unsupervised person Re-ID. However, The method of updating memory based on all samples does not fully utilize the hardest sample to improve the generalization ability of the model, and the method based on hardest sample mining will inevitably introduce false-positive samples that are incorrectly clustered in the early stages of the model. Clustering-based methods usually discard a significant number of outliers, leading to the loss of valuable information. In order to address the issues mentioned before, we propose an adaptive intra-class variation contrastive learning algorithm for unsupervised Re-ID, called AdaInCV. And the algorithm quantitatively evaluates the learning ability of the model for each class by considering the intra-class variations after clustering, which helps in selecting appropriate samples during the training process of the model. To be more specific, two new strategies are proposed: Adaptive Sample Mining (AdaSaM) and Adaptive Outlier Filter (AdaOF). The first one gradually creates more reliable clusters to dynamically refine the memory, while the second can identify and filter out valuable outliers as negative samples.
基于记忆字典的对比学习方法在无监督的人体Re-ID领域取得了显著的成果。然而,基于所有样本更新的记忆方法并没有充分利用最难的样本来提高模型的泛化能力,而基于最难样本挖掘的方法可能会引入错误聚类的早期阶段的假阳性样本。聚类方法通常会舍弃大量的异常值,导致重要信息的丢失。为了解决前面提到的问题,我们提出了一个自适应类内变异对比学习算法,称为AdaInCV。这个算法通过考虑聚类后的类内变化来定量评估模型每个类的学习能力,有助于在模型训练过程中选择合适的样本。具体来说,我们提出了两种新的策略:自适应样本挖掘(AdaSaM)和自适应异常滤波(AdaOF)。第一种策略逐渐创建更有信心的聚类以动态优化记忆,而第二种策略可以识别并过滤出有价值的异常作为负样本。
https://arxiv.org/abs/2404.04665
The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle the aforementioned issues effectively. Firstly, we introduce the human parsing label to guide the generation of more accurate human part attention maps. In addition, we propose a fine-grained feature focuser for generating fine-grained human local feature representations while suppressing background interference. Moreover, We also design a part triplet loss to supervise the learning of human local features, which optimizes intra/inter-class distance. We conducted extensive experiments on specialized occlusion and regular ReID datasets, showcasing that our approach outperforms the existing state-of-the-art methods.
遮挡人物识别(ReID)的目标是检索遮挡情况下的特定行人。然而,遮挡人物ReID仍然受到背景杂乱和低质量局部特征表示的限制,这限制了模型的性能。在我们的研究中,我们引入了一个新的框架PAB-ReID,这是一种新型的ReID模型,采用了部分注意机制来有效解决上述问题。首先,我们引入了人类解析标签来指导生成更准确的人的部分注意力图。此外,我们提出了一种细粒度特征关注器,用于在抑制背景干扰的同时生成细粒度的人局部特征表示。此外,我们还设计了一个部分三元组损失来指导人局部特征的学习,该损失优化了类内/类间距离。我们在专门的遮挡和普通ReID数据集上进行了广泛的实验,展示了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2404.03443
Occlusion remains one of the major challenges in person reidentification (ReID) as a result of the diversity of poses and the variation of appearances. Developing novel architectures to improve the robustness of occlusion-aware person Re-ID requires new insights, especially on low-resolution edge cameras. We propose a deep ensemble model that harnesses both CNN and Transformer architectures to generate robust feature representations. To achieve robust Re-ID without the need to manually label occluded regions, we propose to take an ensemble learning-based approach derived from the analogy between arbitrarily shaped occluded regions and robust feature representation. Using the orthogonality principle, our developed deep CNN model makes use of masked autoencoder (MAE) and global-local feature fusion for robust person identification. Furthermore, we present a part occlusion-aware transformer capable of learning feature space that is robust to occluded regions. Experimental results are reported on several Re-ID datasets to show the effectiveness of our developed ensemble model named orthogonal fusion with occlusion handling (OFOH). Compared to competing methods, the proposed OFOH approach has achieved competent rank-1 and mAP performance.
遮挡仍然是人物识别(ReID)中的一个主要挑战,由于不同姿态和外观的差异。开发新的架构来提高遮挡注意到的行人ReID的鲁棒性需要新的见解,尤其是在低分辨率边缘相机上。我们提出了一种深度集成模型,利用CNN和Transformer架构生成鲁棒的特征表示。为了实现无需手动标注遮挡区域的稳健ReID,我们提出了一个基于元学习的方法,其来源于任意形状的遮挡区域与鲁棒特征表示的类比。通过正交性原理,我们开发了一种深度CNN模型,利用遮罩自动编码器(MAE)和全局局部特征融合进行鲁棒的行人识别。此外,我们还提出了一个部分遮挡注意到的Transformer,能够学习对遮挡区域鲁棒的特征空间。在多个ReID数据集上进行的实验结果表明,我们提出的具有遮挡处理能力的元学习模型具有很好的效果,名为Orthogonal Fusion with Occlusion Handling (OFOH)。与竞争方法相比,所提出的OFOH方法已经取得了出色的排名1和mAP性能。
https://arxiv.org/abs/2404.00107
Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields, such as visual surveillance, crowd behavior analysis, and anomaly detection. However, due to the difficulty and cost of collecting and labeling data, existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting, which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue, we present MTMMC, a real-world, large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments - campus and factory - across various time, weather, and season conditions. This dataset provides a challenging test-bed for studying multi-camera tracking under diverse real-world complexities and includes an additional input modality of spatially aligned and temporally synchronized RGB and thermal cameras, which enhances the accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets, benefiting independent fields such as person detection, re-identification, and multiple object tracking. We provide baselines and new learning setups on this dataset and set the reference scores for future studies. The datasets, models, and test server will be made publicly available.
多目标多摄像机跟踪是一个关键任务,涉及使用来自多个摄像头的视频流识别和跟踪一段时间内的个体。这项任务在各种领域具有实际应用,如视频监控、人群行为分析和异常检测。然而,由于收集和标注数据的难度和成本,现有的数据集只能在受控的相机网络环境中构建,这限制了它们对建模现实世界动态和泛化到不同相机配置的能力。为了解决这个问题,我们提出了MTMMC,一个真实世界的大型数据集,其中包括16个多模态相机在校园和工厂环境中捕获的长时间视频序列。这个数据集为研究不同现实世界复杂性下的多相机跟踪提供了具有挑战性的测试平台,包括额外的输入模态:空间对齐和时间同步的RGB和热成像相机,从而提高了多相机跟踪的准确性。MTMMC是现有数据集中的超集, benefit independent fields such as person detection, re-identification, and multiple object tracking。我们在该数据集上提供了基线和新学习设置,并为未来的研究设置了参考分数。数据集、模型和测试服务器将公开发布。
https://arxiv.org/abs/2403.20225
Unsupervised person re-identification aims to retrieve images of a specified person without identity labels. Many recent unsupervised Re-ID approaches adopt clustering-based methods to measure cross-camera feature similarity to roughly divide images into clusters. They ignore the feature distribution discrepancy induced by camera domain gap, resulting in the unavoidable performance degradation. Camera information is usually available, and the feature distribution in the single camera usually focuses more on the appearance of the individual and has less intra-identity variance. Inspired by the observation, we introduce a \textbf{C}amera-\textbf{A}ware \textbf{L}abel \textbf{R}efinement~(CALR) framework that reduces camera discrepancy by clustering intra-camera similarity. Specifically, we employ intra-camera training to obtain reliable local pseudo labels within each camera, and then refine global labels generated by inter-camera clustering and train the discriminative model using more reliable global pseudo labels in a self-paced manner. Meanwhile, we develop a camera-alignment module to align feature distributions under different cameras, which could help deal with the camera variance further. Extensive experiments validate the superiority of our proposed method over state-of-the-art approaches. The code is accessible at this https URL.
无监督的人重新识别的目的是检索指定人物的图像,而无需身份标签。许多最近的无监督 Re-ID 方法采用聚类为基础的方法来测量跨相机特征的相似性,将图像大致分为簇。它们忽略了由相机领域差异引起的特征分布差异,导致性能降低。相机信息通常可用,而单个相机的特征分布通常更加关注单个个人的外观,并且具有较少的内部identity variance。受到观察的启发,我们引入了一个 Camera-Aware Label Refinement (CALR) 框架,通过聚类相机内相似性来减少相机差异。具体来说,我们使用相机内训练来获得每个相机内的可靠局部伪标签,然后通过 inter-camera 聚类生成的全局标签,以更可靠的全局伪标签的方式训练判别模型。同时,我们开发了一个相机对齐模块,用于在不同相机上对特征分布进行对齐,这可以帮助我们进一步处理相机变化。大量实验验证了我们提出的方法相对于最先进方法的优越性。代码可在此链接访问:
https://arxiv.org/abs/2403.16450
Lifelong Person Re-Identification (LReID) aims to continuously learn from successive data streams, matching individuals across multiple cameras. The key challenge for LReID is how to effectively preserve old knowledge while learning new information incrementally. Task-level domain gaps and limited old task datasets are key factors leading to catastrophic forgetting in ReLD, which are overlooked in existing methods. To alleviate this problem, we propose a novel Diverse Representation Embedding (DRE) framework for LReID. The proposed DRE preserves old knowledge while adapting to new information based on instance-level and task-level layout. Concretely, an Adaptive Constraint Module (ACM) is proposed to implement integration and push away operations between multiple representations, obtaining dense embedding subspace for each instance to improve matching ability on limited old task datasets. Based on the processed diverse representation, we interact knowledge between the adjustment model and the learner model through Knowledge Update (KU) and Knowledge Preservation (KP) strategies at the task-level layout, which reduce the task-wise domain gap on both old and new tasks, and exploit diverse representation of each instance in limited datasets from old tasks, improving model performance for extended periods. Extensive experiments were conducted on eleven Re-ID datasets, including five seen datasets for training in order-1 and order-2 orders and six unseen datasets for inference. Compared to state-of-the-art methods, our method achieves significantly improved performance in holistic, large-scale, and occluded datasets.
终身人物识别(LReID)旨在从连续的数据流中持续学习,并将个体跨越多台摄像机进行匹配。LReID的关键挑战是如何在逐渐学习新信息的同时有效保留旧知识。任务级别领域空白和有限的旧任务数据集是导致ReLD灾难性遗忘的原因,而现有的方法忽略了这个问题。为了减轻这个问题,我们提出了一个名为Diverse Representation Embedding(DRE)的新LReID框架。DRE在保留旧知识的同时,根据实例级别和任务级别布局自适应地适应新信息。具体来说,我们提出了一个自适应约束模块(ACM)来实现多个表示之间的集成和推开操作,为每个实例获得稠密嵌入子空间,从而提高在有限的老任务数据集中的匹配能力。根据处理后的多样性表示,我们在任务级别布局中通过知识更新(KU)和知识保留(KP)策略与调整模型和学习器模型交互,从而在老任务和新任务上减少任务级别领域差异,并从老任务有限数据集中每个实例的多样性表示中挖掘知识,提高模型在长时间内的性能。我们在包括order-1和order-2订单的五个可见数据集以及六个未见数据集上进行了广泛的实验。与最先进的 methods相比,我们的方法在整体、大型和遮挡的数据集上取得了显著的改进。
https://arxiv.org/abs/2403.16003