Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global-local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we propose the Local Multi-layer Fusion (LMF) which leverages both the global cues from GAE and multi-layer patch tokens to explore the discriminative local representations. Extensive experiments demonstrate that our proposed method achieves superior performance on four object Re-ID benchmarks.
物体识别(Re-ID)的目的是从不同时间和地点捕获的图像中识别和检索特定的物体。近年来,随着Vision Transformers (ViT)的进步,物体Re-ID已经取得了巨大的成功。然而,在Transformers中,对全局和局部关系的影响还没有完全被探索。在这项工作中,我们首先探索ViT的局部和全局特征对物体Re-ID的影响,然后进一步提出了一种名为全局-局部Transformer(GLTrans)的高性能物体Re-ID模型。我们发现,ViT最后几层的特征已经具有很强的表示能力,全局和局部信息可以相互增强。基于这个事实,我们提出了一种全局聚合编码器(GAE)来有效地利用最后几个Transformer层中的分类标签,并学习全面的全局特征。同时,我们还提出了一种局部多层融合(LMF),它利用来自GAE的全局线索和多层补丁token来探索具有区分性的局部表示。大量实验证明,我们提出的方法在四个物体Re-ID基准测试中实现了卓越的性能。
https://arxiv.org/abs/2404.14985
Clothes-changing person re-identification (CC-ReID) aims to retrieve images of the same person wearing different outfits. Mainstream researches focus on designing advanced model structures and strategies to capture identity information independent of clothing. However, the same-clothes discrimination as the standard ReID learning objective in CC-ReID is persistently ignored in previous researches. In this study, we dive into the relationship between standard and clothes-changing~(CC) learning objectives, and bring the inner conflicts between these two objectives to the fore. We try to magnify the proportion of CC training pairs by supplementing high-fidelity clothes-varying synthesis, produced by our proposed Clothes-Changing Diffusion model. By incorporating the synthetic images into CC-ReID model training, we observe a significant improvement under CC protocol. However, such improvement sacrifices the performance under the standard protocol, caused by the inner conflict between standard and CC. For conflict mitigation, we decouple these objectives and re-formulate CC-ReID learning as a multi-objective optimization (MOO) problem. By effectively regularizing the gradient curvature across multiple objectives and introducing preference restrictions, our MOO solution surpasses the single-task training paradigm. Our framework is model-agnostic, and demonstrates superior performance under both CC and standard ReID protocols.
换衣人重新识别(CC-ReID)旨在检索同一人穿着不同服装的照片。主流研究关注设计具有先进模型结构和策略,使其独立于服装捕捉身份信息。然而, previous researchers 对相同服装的歧视标准 ReID 学习目标持续忽视。在本文中,我们深入研究了标准和换衣人~(CC) 学习目标之间的关系,并揭示了这两者之间的内心冲突。通过补充我们提出的换衣人扩散模型生成的极高保真度换衣图像,我们试图通过增加 CC 训练对偶的比例如下:通过将合成图像纳入 CC-ReID 模型训练,我们观察到在 CC 协议下显著的改善。然而,这种改善牺牲了标准协议下的性能,由于标准和 CC 之间的内心冲突。为减轻这种冲突,我们解耦这两个目标,并将 CC-ReID 学习重新表述为一个多目标优化(MOO)问题。通过有效地对多个目标周围的梯度曲率进行 regularization 和引入偏好限制,我们的 MOO 解决方案超越了单任务训练范式。我们的框架对模型一无所知,并且在 CC 和标准 ReID 协议下均表现出卓越的性能。
https://arxiv.org/abs/2404.12611
Current clothes-changing person re-identification (re-id) approaches usually perform retrieval based on clothes-irrelevant features, while neglecting the potential of clothes-relevant features. However, we observe that relying solely on clothes-irrelevant features for clothes-changing re-id is limited, since they often lack adequate identity information and suffer from large intra-class variations. On the contrary, clothes-relevant features can be used to discover same-clothes intermediaries that possess informative identity clues. Based on this observation, we propose a Feasibility-Aware Intermediary Matching (FAIM) framework to additionally utilize clothes-relevant features for retrieval. Firstly, an Intermediary Matching (IM) module is designed to perform an intermediary-assisted matching process. This process involves using clothes-relevant features to find informative intermediates, and then using clothes-irrelevant features of these intermediates to complete the matching. Secondly, in order to reduce the negative effect of low-quality intermediaries, an Intermediary-Based Feasibility Weighting (IBFW) module is designed to evaluate the feasibility of intermediary matching process by assessing the quality of intermediaries. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on several widely-used clothes-changing re-id benchmarks.
通常,基于衣物无关特征的当前人物识别(RE-ID)方法忽略了衣物相关特征的潜力。然而,我们观察到仅依赖衣物无关特征进行衣物RE-ID是有限的,因为它们通常缺乏足够的身份信息并受到类内变化的影响。相反,衣物相关特征可以用于发现具有有用身份提示的同款衣物中介。基于这一观察,我们提出了一个可行性感知的中介匹配(FAIM)框架,以进一步利用衣物相关特征进行检索。 首先,设计了一个中介匹配(IM)模块,执行中间人辅助匹配过程。这一过程涉及使用衣物相关特征找到有用的中介,然后使用这些中介的衣物无关特征完成匹配。 其次,为了减少低质量中介对匹配过程的负面影响,设计了一个基于中介的中性可行性加权(IBFW)模块,通过评估中介的质量来评估匹配过程的可行性。 丰富的实验证明,我们的方法在多个广泛使用的衣物RE-ID基准测试中超越了最先进的方法。
https://arxiv.org/abs/2404.09507
Visible-infrared person re-identification (VI-reID) aims at matching cross-modality pedestrian images captured by disjoint visible or infrared cameras. Existing methods alleviate the cross-modality discrepancies via designing different kinds of network architectures. Different from available methods, in this paper, we propose a novel parameter optimizing paradigm, parameter hierarchical optimization (PHO) method, for the task of VI-ReID. It allows part of parameters to be directly optimized without any training, which narrows the search space of parameters and makes the whole network more easier to be trained. Specifically, we first divide the parameters into different types, and then introduce a self-adaptive alignment strategy (SAS) to automatically align the visible and infrared images through transformation. Considering that features in different dimension have varying importance, we develop an auto-weighted alignment learning (AAL) module that can automatically weight features according to their importance. Importantly, in the alignment process of SAS and AAL, all the parameters are immediately optimized with optimization principles rather than training the whole network, which yields a better parameter training manner. Furthermore, we establish the cross-modality consistent learning (CCL) loss to extract discriminative person representations with translation consistency. We provide both theoretical justification and empirical evidence that our proposed PHO method outperform existing VI-reID approaches.
可见红外人员重新识别(VI-reID)旨在通过设计不同类型的网络架构来匹配由分离的可见或红外相机捕获的跨模态行人图像。现有的方法通过设计不同的网络架构来减轻跨模态差异。与现有的方法不同,本文提出了一种新的参数优化范例,参数层次优化(PHO)方法,用于VI-reID任务。它允许部分参数通过直接优化而无需训练,从而缩小参数搜索空间并使整个网络更容易训练。具体来说,我们首先将参数分为不同类型,然后引入自适应对齐策略(SAS)通过变换来自动对齐可见和红外图像。考虑到不同维度特征的重要性不同,我们开发了一个自适应加权对齐学习(AAL)模块,可以根据其重要性自动加权特征。重要的是,在SAS和AAL的对齐过程中,所有参数都使用优化原理进行优化,而不是训练整个网络,这导致了更好的参数训练方式。此外,我们还建立了跨模态一致性学习(CCL)损失,用于通过平移一致性提取具有平移一致性的区分性人物表示。我们提供了理论证明和实证证据,证明我们提出的PHO方法优于现有的VI-reID方法。
https://arxiv.org/abs/2404.07930
Unsupervised visible-infrared person re-identification (UVI-ReID) has recently gained great attention due to its potential for enhancing human detection in diverse environments without labeling. Previous methods utilize intra-modality clustering and cross-modality feature matching to achieve UVI-ReID. However, there exist two challenges: 1) noisy pseudo labels might be generated in the clustering process, and 2) the cross-modality feature alignment via matching the marginal distribution of visible and infrared modalities may misalign the different identities from two modalities. In this paper, we first conduct a theoretic analysis where an interpretable generalization upper bound is introduced. Based on the analysis, we then propose a novel unsupervised cross-modality person re-identification framework (PRAISE). Specifically, to address the first challenge, we propose a pseudo-label correction strategy that utilizes a Beta Mixture Model to predict the probability of mis-clustering based network's memory effect and rectifies the correspondence by adding a perceptual term to contrastive learning. Next, we introduce a modality-level alignment strategy that generates paired visible-infrared latent features and reduces the modality gap by aligning the labeling function of visible and infrared features to learn identity discriminative and modality-invariant features. Experimental results on two benchmark datasets demonstrate that our method achieves state-of-the-art performance than the unsupervised visible-ReID methods.
无监督可见-红外人员识别(UVI-ReID)最近因其在不同环境中增强人类检测潜力而受到广泛关注,而无需标签。以前的方法利用内部模态聚类和跨模态特征匹配来实现UVI-ReID。然而,存在两个挑战:1)在聚类过程中可能生成噪声伪标签,2)通过匹配可见和红外模态的边缘分布进行跨模态特征对齐可能错位不同个体的身份。在本文中,我们首先进行理论分析引入了可解释的泛化上界。基于分析,我们 then 提出了一个新颖的无监督跨模态人员识别框架(PRAISE)。具体来说,为解决第一个挑战,我们提出了一个伪标签修正策略,利用贝叶斯混合模型预测网络记忆效应并纠正错配,通过添加感知项来 contrastive 学习。接下来,我们引入了一个模块级别对齐策略,生成对齐的可视-红外潜在特征,并通过对可见和红外特征的标签函数进行对齐来降低模态差距,以学习具有身份鉴别和模态无关特征的识别。在两个基准数据集上的实验结果表明,与其他无监督可见-ReID 方法相比,我们的方法实现了最先进的性能。
https://arxiv.org/abs/2404.06683
The memory dictionary-based contrastive learning method has achieved remarkable results in the field of unsupervised person Re-ID. However, The method of updating memory based on all samples does not fully utilize the hardest sample to improve the generalization ability of the model, and the method based on hardest sample mining will inevitably introduce false-positive samples that are incorrectly clustered in the early stages of the model. Clustering-based methods usually discard a significant number of outliers, leading to the loss of valuable information. In order to address the issues mentioned before, we propose an adaptive intra-class variation contrastive learning algorithm for unsupervised Re-ID, called AdaInCV. And the algorithm quantitatively evaluates the learning ability of the model for each class by considering the intra-class variations after clustering, which helps in selecting appropriate samples during the training process of the model. To be more specific, two new strategies are proposed: Adaptive Sample Mining (AdaSaM) and Adaptive Outlier Filter (AdaOF). The first one gradually creates more reliable clusters to dynamically refine the memory, while the second can identify and filter out valuable outliers as negative samples.
基于记忆字典的对比学习方法在无监督的人体Re-ID领域取得了显著的成果。然而,基于所有样本更新的记忆方法并没有充分利用最难的样本来提高模型的泛化能力,而基于最难样本挖掘的方法可能会引入错误聚类的早期阶段的假阳性样本。聚类方法通常会舍弃大量的异常值,导致重要信息的丢失。为了解决前面提到的问题,我们提出了一个自适应类内变异对比学习算法,称为AdaInCV。这个算法通过考虑聚类后的类内变化来定量评估模型每个类的学习能力,有助于在模型训练过程中选择合适的样本。具体来说,我们提出了两种新的策略:自适应样本挖掘(AdaSaM)和自适应异常滤波(AdaOF)。第一种策略逐渐创建更有信心的聚类以动态优化记忆,而第二种策略可以识别并过滤出有价值的异常作为负样本。
https://arxiv.org/abs/2404.04665
The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle the aforementioned issues effectively. Firstly, we introduce the human parsing label to guide the generation of more accurate human part attention maps. In addition, we propose a fine-grained feature focuser for generating fine-grained human local feature representations while suppressing background interference. Moreover, We also design a part triplet loss to supervise the learning of human local features, which optimizes intra/inter-class distance. We conducted extensive experiments on specialized occlusion and regular ReID datasets, showcasing that our approach outperforms the existing state-of-the-art methods.
遮挡人物识别(ReID)的目标是检索遮挡情况下的特定行人。然而,遮挡人物ReID仍然受到背景杂乱和低质量局部特征表示的限制,这限制了模型的性能。在我们的研究中,我们引入了一个新的框架PAB-ReID,这是一种新型的ReID模型,采用了部分注意机制来有效解决上述问题。首先,我们引入了人类解析标签来指导生成更准确的人的部分注意力图。此外,我们提出了一种细粒度特征关注器,用于在抑制背景干扰的同时生成细粒度的人局部特征表示。此外,我们还设计了一个部分三元组损失来指导人局部特征的学习,该损失优化了类内/类间距离。我们在专门的遮挡和普通ReID数据集上进行了广泛的实验,展示了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2404.03443
Occlusion remains one of the major challenges in person reidentification (ReID) as a result of the diversity of poses and the variation of appearances. Developing novel architectures to improve the robustness of occlusion-aware person Re-ID requires new insights, especially on low-resolution edge cameras. We propose a deep ensemble model that harnesses both CNN and Transformer architectures to generate robust feature representations. To achieve robust Re-ID without the need to manually label occluded regions, we propose to take an ensemble learning-based approach derived from the analogy between arbitrarily shaped occluded regions and robust feature representation. Using the orthogonality principle, our developed deep CNN model makes use of masked autoencoder (MAE) and global-local feature fusion for robust person identification. Furthermore, we present a part occlusion-aware transformer capable of learning feature space that is robust to occluded regions. Experimental results are reported on several Re-ID datasets to show the effectiveness of our developed ensemble model named orthogonal fusion with occlusion handling (OFOH). Compared to competing methods, the proposed OFOH approach has achieved competent rank-1 and mAP performance.
遮挡仍然是人物识别(ReID)中的一个主要挑战,由于不同姿态和外观的差异。开发新的架构来提高遮挡注意到的行人ReID的鲁棒性需要新的见解,尤其是在低分辨率边缘相机上。我们提出了一种深度集成模型,利用CNN和Transformer架构生成鲁棒的特征表示。为了实现无需手动标注遮挡区域的稳健ReID,我们提出了一个基于元学习的方法,其来源于任意形状的遮挡区域与鲁棒特征表示的类比。通过正交性原理,我们开发了一种深度CNN模型,利用遮罩自动编码器(MAE)和全局局部特征融合进行鲁棒的行人识别。此外,我们还提出了一个部分遮挡注意到的Transformer,能够学习对遮挡区域鲁棒的特征空间。在多个ReID数据集上进行的实验结果表明,我们提出的具有遮挡处理能力的元学习模型具有很好的效果,名为Orthogonal Fusion with Occlusion Handling (OFOH)。与竞争方法相比,所提出的OFOH方法已经取得了出色的排名1和mAP性能。
https://arxiv.org/abs/2404.00107
Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields, such as visual surveillance, crowd behavior analysis, and anomaly detection. However, due to the difficulty and cost of collecting and labeling data, existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting, which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue, we present MTMMC, a real-world, large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments - campus and factory - across various time, weather, and season conditions. This dataset provides a challenging test-bed for studying multi-camera tracking under diverse real-world complexities and includes an additional input modality of spatially aligned and temporally synchronized RGB and thermal cameras, which enhances the accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets, benefiting independent fields such as person detection, re-identification, and multiple object tracking. We provide baselines and new learning setups on this dataset and set the reference scores for future studies. The datasets, models, and test server will be made publicly available.
多目标多摄像机跟踪是一个关键任务,涉及使用来自多个摄像头的视频流识别和跟踪一段时间内的个体。这项任务在各种领域具有实际应用,如视频监控、人群行为分析和异常检测。然而,由于收集和标注数据的难度和成本,现有的数据集只能在受控的相机网络环境中构建,这限制了它们对建模现实世界动态和泛化到不同相机配置的能力。为了解决这个问题,我们提出了MTMMC,一个真实世界的大型数据集,其中包括16个多模态相机在校园和工厂环境中捕获的长时间视频序列。这个数据集为研究不同现实世界复杂性下的多相机跟踪提供了具有挑战性的测试平台,包括额外的输入模态:空间对齐和时间同步的RGB和热成像相机,从而提高了多相机跟踪的准确性。MTMMC是现有数据集中的超集, benefit independent fields such as person detection, re-identification, and multiple object tracking。我们在该数据集上提供了基线和新学习设置,并为未来的研究设置了参考分数。数据集、模型和测试服务器将公开发布。
https://arxiv.org/abs/2403.20225
Unsupervised person re-identification aims to retrieve images of a specified person without identity labels. Many recent unsupervised Re-ID approaches adopt clustering-based methods to measure cross-camera feature similarity to roughly divide images into clusters. They ignore the feature distribution discrepancy induced by camera domain gap, resulting in the unavoidable performance degradation. Camera information is usually available, and the feature distribution in the single camera usually focuses more on the appearance of the individual and has less intra-identity variance. Inspired by the observation, we introduce a \textbf{C}amera-\textbf{A}ware \textbf{L}abel \textbf{R}efinement~(CALR) framework that reduces camera discrepancy by clustering intra-camera similarity. Specifically, we employ intra-camera training to obtain reliable local pseudo labels within each camera, and then refine global labels generated by inter-camera clustering and train the discriminative model using more reliable global pseudo labels in a self-paced manner. Meanwhile, we develop a camera-alignment module to align feature distributions under different cameras, which could help deal with the camera variance further. Extensive experiments validate the superiority of our proposed method over state-of-the-art approaches. The code is accessible at this https URL.
无监督的人重新识别的目的是检索指定人物的图像,而无需身份标签。许多最近的无监督 Re-ID 方法采用聚类为基础的方法来测量跨相机特征的相似性,将图像大致分为簇。它们忽略了由相机领域差异引起的特征分布差异,导致性能降低。相机信息通常可用,而单个相机的特征分布通常更加关注单个个人的外观,并且具有较少的内部identity variance。受到观察的启发,我们引入了一个 Camera-Aware Label Refinement (CALR) 框架,通过聚类相机内相似性来减少相机差异。具体来说,我们使用相机内训练来获得每个相机内的可靠局部伪标签,然后通过 inter-camera 聚类生成的全局标签,以更可靠的全局伪标签的方式训练判别模型。同时,我们开发了一个相机对齐模块,用于在不同相机上对特征分布进行对齐,这可以帮助我们进一步处理相机变化。大量实验验证了我们提出的方法相对于最先进方法的优越性。代码可在此链接访问:
https://arxiv.org/abs/2403.16450
Lifelong Person Re-Identification (LReID) aims to continuously learn from successive data streams, matching individuals across multiple cameras. The key challenge for LReID is how to effectively preserve old knowledge while learning new information incrementally. Task-level domain gaps and limited old task datasets are key factors leading to catastrophic forgetting in ReLD, which are overlooked in existing methods. To alleviate this problem, we propose a novel Diverse Representation Embedding (DRE) framework for LReID. The proposed DRE preserves old knowledge while adapting to new information based on instance-level and task-level layout. Concretely, an Adaptive Constraint Module (ACM) is proposed to implement integration and push away operations between multiple representations, obtaining dense embedding subspace for each instance to improve matching ability on limited old task datasets. Based on the processed diverse representation, we interact knowledge between the adjustment model and the learner model through Knowledge Update (KU) and Knowledge Preservation (KP) strategies at the task-level layout, which reduce the task-wise domain gap on both old and new tasks, and exploit diverse representation of each instance in limited datasets from old tasks, improving model performance for extended periods. Extensive experiments were conducted on eleven Re-ID datasets, including five seen datasets for training in order-1 and order-2 orders and six unseen datasets for inference. Compared to state-of-the-art methods, our method achieves significantly improved performance in holistic, large-scale, and occluded datasets.
终身人物识别(LReID)旨在从连续的数据流中持续学习,并将个体跨越多台摄像机进行匹配。LReID的关键挑战是如何在逐渐学习新信息的同时有效保留旧知识。任务级别领域空白和有限的旧任务数据集是导致ReLD灾难性遗忘的原因,而现有的方法忽略了这个问题。为了减轻这个问题,我们提出了一个名为Diverse Representation Embedding(DRE)的新LReID框架。DRE在保留旧知识的同时,根据实例级别和任务级别布局自适应地适应新信息。具体来说,我们提出了一个自适应约束模块(ACM)来实现多个表示之间的集成和推开操作,为每个实例获得稠密嵌入子空间,从而提高在有限的老任务数据集中的匹配能力。根据处理后的多样性表示,我们在任务级别布局中通过知识更新(KU)和知识保留(KP)策略与调整模型和学习器模型交互,从而在老任务和新任务上减少任务级别领域差异,并从老任务有限数据集中每个实例的多样性表示中挖掘知识,提高模型在长时间内的性能。我们在包括order-1和order-2订单的五个可见数据集以及六个未见数据集上进行了广泛的实验。与最先进的 methods相比,我们的方法在整体、大型和遮挡的数据集上取得了显著的改进。
https://arxiv.org/abs/2403.16003
Person re-identification (ReID) has made great strides thanks to the data-driven deep learning techniques. However, the existing benchmark datasets lack diversity, and models trained on these data cannot generalize well to dynamic wild scenarios. To meet the goal of improving the explicit generalization of ReID models, we develop a new Open-World, Diverse, Cross-Spatial-Temporal dataset named OWD with several distinct features. 1) Diverse collection scenes: multiple independent open-world and highly dynamic collecting scenes, including streets, intersections, shopping malls, etc. 2) Diverse lighting variations: long time spans from daytime to nighttime with abundant illumination changes. 3) Diverse person status: multiple camera networks in all seasons with normal/adverse weather conditions and diverse pedestrian appearances (e.g., clothes, personal belongings, poses, etc.). 4) Protected privacy: invisible faces for privacy critical applications. To improve the implicit generalization of ReID, we further propose a Latent Domain Expansion (LDE) method to develop the potential of source data, which decouples discriminative identity-relevant and trustworthy domain-relevant features and implicitly enforces domain-randomized identity feature space expansion with richer domain diversity to facilitate domain invariant representations. Our comprehensive evaluations with most benchmark datasets in the community are crucial for progress, although this work is far from the grand goal toward open-world and dynamic wild applications.
由于数据驱动的深度学习技术的进步,人物识别(ReID)取得了很大进展。然而,现有的基准数据集缺乏多样性,因此在这些数据上训练的模型在动态野外场景下的泛化能力差。为了实现提高ReID模型的显式泛化目标,我们开发了一个名为OWD的新开放世界、多样、跨时空数据集,具有多个独特的特征。1) 多样化的场景收集:包括多个独立开放世界和高度动态的场景,如街道、交叉口、购物中心等。2) 多样化的光照变化:从白天到黑夜漫长的时间段,有丰富的光照变化。3) 多样的人的状态:所有季节的多个相机网络,包括正常/恶劣的天气条件以及多样的人行道外观(例如,衣服、个人物品、姿势等)。4) 保护隐私:对于关键隐私应用的可见面。为了提高ReID的隐式泛化,我们进一步提出了一个潜在领域扩展(LDE)方法,以开发数据源的潜在能力,该方法解耦了相关域的特征,隐含地强制域随机化身份特征空间扩张,并为领域不变的表示创造更丰富的领域多样性。我们在社区中的大多数基准数据集的全面评估对于进步来说至关重要,尽管这项工作离开放世界和动态野外应用的 grand goal 还有很长的路要走。
https://arxiv.org/abs/2403.15119
Generative techniques for image anonymization have great potential to generate datasets that protect the privacy of those depicted in the images, while achieving high data fidelity and utility. Existing methods have focused extensively on preserving facial attributes, but failed to embrace a more comprehensive perspective that considers the scene and background into the anonymization process. This paper presents, to the best of our knowledge, the first approach to image anonymization based on Latent Diffusion Models (LDMs). Every element of a scene is maintained to convey the same meaning, yet manipulated in a way that makes re-identification difficult. We propose two LDMs for this purpose: CAMOUFLaGE-Base exploits a combination of pre-trained ControlNets, and a new controlling mechanism designed to increase the distance between the real and anonymized images. CAMOFULaGE-Light is based on the Adapter technique, coupled with an encoding designed to efficiently represent the attributes of different persons in a scene. The former solution achieves superior performance on most metrics and benchmarks, while the latter cuts the inference time in half at the cost of fine-tuning a lightweight module. We show through extensive experimental comparison that the proposed method is competitive with the state-of-the-art concerning identity obfuscation whilst better preserving the original content of the image and tackling unresolved challenges that current solutions fail to address.
图像匿名技术的生成方法具有很大的潜力,生成保护图像中人物隐私的 dataset,同时实现高数据保真度和利用率。现有的方法主要关注保留面部特征,但未能从更全面的视角考虑场景和背景在匿名处理过程中的影响。本文介绍了基于潜在扩散模型(LDMs)的图像匿名化方法。我们知识范围内,这是第一种基于潜在扩散模型的图像匿名化方法。对于此目的,我们提出了两个 LDMs:CAMOUFLaGE-Base 利用了预训练的控制网络,并设计了一个新的控制机制,以增加真实和匿名图像之间的距离。CAMOFULaGE-Light 基于 Adapter 技术,附加了一个编码,旨在有效地表示场景中不同人物的属性。前一个解决方案在大多数指标和基准测试中实现了卓越的性能,而后者在牺牲轻量级模块的微调来减半推理时间的同时,将推理时间进一步减少了一半。我们通过广泛的实验比较证明了,与最先进的身份隐藏解决方案相比,所提出的具有竞争力的方法在保留图像原始内容的同时,更好地解决了当前解决方案未能解决的问题。
https://arxiv.org/abs/2403.14790
Existing person re-identification methods have achieved remarkable advances in appearance-based identity association across homogeneous cameras, such as ground-ground matching. However, as a more practical scenario, aerial-ground person re-identification (AGPReID) among heterogeneous cameras has received minimal attention. To alleviate the disruption of discriminative identity representation by dramatic view discrepancy as the most significant challenge in AGPReID, the view-decoupled transformer (VDT) is proposed as a simple yet effective framework. Two major components are designed in VDT to decouple view-related and view-unrelated features, namely hierarchical subtractive separation and orthogonal loss, where the former separates these two features inside the VDT, and the latter constrains these two to be independent. In addition, we contribute a large-scale AGPReID dataset called CARGO, consisting of five/eight aerial/ground cameras, 5,000 identities, and 108,563 images. Experiments on two datasets show that VDT is a feasible and effective solution for AGPReID, surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on CARGO and 3.7%/5.2% on AG-ReID, keeping the same magnitude of computational complexity. Our project is available at this https URL
目前,在基于外观的个体识别方法已经在均匀相机中取得了显著的进步,例如地面地面匹配。然而,作为更实际的场景,异质相机中的航空地面人物识别(AGPReID)受到了很少的关注。为了减轻由于显著的视差差异导致的区分性身份表示中断,我们提出了一个简单的但有效的框架——视解耦变压器(VDT)。 VDT有两个主要组成部分,用于解耦视相关和视无关特征。具体来说,前者在VDT内部分离这两个特征,后者则约束这两个特征相互独立。此外,我们还提出了一个名为CARGO的大规模AGPReID数据集,包括5/8个航空/地面相机,5,000个个体和108,563个图像。在两个数据集上的实验结果表明,VDT对于AGPReID是一个可行的且有效的解决方案,在CARGO数据集上比前方法提高了5.0%/2.7%的mAP/Rank1,而在AG-ReID数据集上提高了3.7%/5.2%的性能,同时保持相同的计算复杂度。我们的项目可以在这个https://url上找到。
https://arxiv.org/abs/2403.14513
Person re-identification (re-id), which aims to retrieve images of the same person in a given image from a database, is one of the most practical image recognition applications. In the real world, however, the environments that the images are taken from change over time. This causes a distribution shift between training and testing and degrades the performance of re-id. To maintain re-id performance, models should continue adapting to the test environment's temporal changes. Test-time adaptation (TTA), which aims to adapt models to the test environment with only unlabeled test data, is a promising way to handle this problem because TTA can adapt models instantly in the test environment. However, the previous TTA methods are designed for classification and cannot be directly applied to re-id. This is because the set of people's identities in the dataset differs between training and testing in re-id, whereas the set of classes is fixed in the current TTA methods designed for classification. To improve re-id performance in changing test environments, we propose TEst-time similarity Modification for Person re-identification (TEMP), a novel TTA method for re-id. TEMP is the first fully TTA method for re-id, which does not require any modification to pre-training. Inspired by TTA methods that refine the prediction uncertainty in classification, we aim to refine the uncertainty in re-id. However, the uncertainty cannot be computed in the same way as classification in re-id since it is an open-set task, which does not share person labels between training and testing. Hence, we propose re-id entropy, an alternative uncertainty measure for re-id computed based on the similarity between the feature vectors. Experiments show that the re-id entropy can measure the uncertainty on re-id and TEMP improves the performance of re-id in online settings where the distribution changes over time.
人物识别(RE-ID)是一种从数据库中检索相同人物图像的图像识别应用,是实践中最实用的图像识别应用之一。然而,在现实生活中,照片拍摄的环境会随着时间的推移而变化,这会导致训练和测试之间的分布转移,从而降低RE-ID的性能。为了保持RE-ID的性能,模型应继续适应测试环境的时变性。测试时间适应(TTA)是一种通过仅使用未标记测试数据来适应测试环境的方法,是解决这个问题的一个有前途的方法,因为TTA可以在测试环境中立即适应模型。然而,之前的设计为分类的TTA方法无法直接应用于RE-ID。这是因为数据集中的人名集合在训练和测试环境之间有所不同,而当前为分类设计的TTA方法中的集合是固定的。为了在变化的环境中提高RE-ID的性能,我们提出了TEst-time similarity Modification for Person re-identification(TEMP),一种新颖的RE-ID TTA方法。TEMP是第一个完全的RE-ID TTA方法,不需要对预训练进行修改。受到分类TTA方法精炼预测不确定性的启发,我们试图提高RE-ID的不确定性。然而,由于RE-ID是一个开放集任务,它不共享在训练和测试环境之间的人名标签,因此我们提出了RE-ID entropy,一种基于特征向量之间相似性计算的RE-ID alternative uncertainty measure。实验证明,RE-ID熵可以衡量RE-ID的不确定性,而TEMP的性能在在线设置中提高了RE-ID的性能,这些设置中分布会随时间变化。
https://arxiv.org/abs/2403.14114
We study the task of 3D multi-object re-identification from embodied tours. Specifically, an agent is given two tours of an environment (e.g. an apartment) under two different layouts (e.g. arrangements of furniture). Its task is to detect and re-identify objects in 3D - e.g. a "sofa" moved from location A to B, a new "chair" in the second layout at location C, or a "lamp" from location D in the first layout missing in the second. To support this task, we create an automated infrastructure to generate paired egocentric tours of initial/modified layouts in the Habitat simulator using Matterport3D scenes, YCB and Google-scanned objects. We present 3D Semantic MapNet (3D-SMNet) - a two-stage re-identification model consisting of (1) a 3D object detector that operates on RGB-D videos with known pose, and (2) a differentiable object matching module that solves correspondence estimation between two sets of 3D bounding boxes. Overall, 3D-SMNet builds object-based maps of each layout and then uses a differentiable matcher to re-identify objects across the tours. After training 3D-SMNet on our generated episodes, we demonstrate zero-shot transfer to real-world rearrangement scenarios by instantiating our task in Replica, Active Vision, and RIO environments depicting rearrangements. On all datasets, we find 3D-SMNet outperforms competitive baselines. Further, we show jointly training on real and generated episodes can lead to significant improvements over training on real data alone.
我们研究了从 embodied 导览中学习 3D 多对象识别的任务。具体来说,一个代理被给予两个环境(例如公寓)的不同布局(例如家具排列)。它的任务是检测和识别 3D 中的对象 - 例如从位置 A 到位置 B 的“沙发”,位置 C 的第二个布局中的新“椅子”,或者在第一个布局中缺失的位置 D 的“灯”。为了支持这项任务,我们创建了一个自动化的基础设施,使用Matterport3D 场景、YCB 和 Google-扫描的对象生成初始/修改布局的对称形导览。我们展示了 3D 语义图网络(3D-SMNet),这是一种由两个阶段组成的识别模型,其第一阶段是一个在已知姿态的 RGB-D 视频上运行的 3D 物体检测器,第二阶段是一个用于解决两个 3D 边界框之间对应关系的不同可导模块。总的来说,3D-SMNet 构建了每个布局的物体基础映射,然后使用可导匹配器在导览之间重新识别物体。在用我们的生成任务训练 3D-SMNet 后,我们在 Replica、Active Vision 和 RIO 等环境中通过实例展示了零散转移到现实世界的重新排列场景。在所有数据集中,我们发现 3D-SMNet 都优于竞争基线。此外,我们还证明了在真实和生成任务上共同训练可以带来在训练仅基于真实数据时的显著改进。
https://arxiv.org/abs/2403.13190
In numerous studies, deep learning algorithms have proven their potential for the analysis of histopathology images, for example, for revealing the subtypes of tumors or the primary origin of metastases. These models require large datasets for training, which must be anonymized to prevent possible patient identity leaks. This study demonstrates that even relatively simple deep learning algorithms can re-identify patients in large histopathology datasets with substantial accuracy. We evaluated our algorithms on two TCIA datasets including lung squamous cell carcinoma (LSCC) and lung adenocarcinoma (LUAD). We also demonstrate the algorithm's performance on an in-house dataset of meningioma tissue. We predicted the source patient of a slide with F1 scores of 50.16 % and 52.30 % on the LSCC and LUAD datasets, respectively, and with 62.31 % on our meningioma dataset. Based on our findings, we formulated a risk assessment scheme to estimate the risk to the patient's privacy prior to publication.
在许多研究中,深度学习算法已经证明了其在病理学图像分析中的潜力,例如,揭示肿瘤亚型或转移灶的原始来源。这些模型需要大量的数据集进行训练,为了防止可能的患者身份泄露,这些数据集必须匿名化。这项研究展示了即使是相对简单的深度学习算法,也可以在大型病理学数据集中准确地重新识别患者。我们在两个TCIA数据集上评估了我们的算法,包括肺鳞状细胞癌(LSCC)和肺腺癌(LUAD)。我们还将在本体内膜组织数据集上评估算法的性能。我们预测了LSCC和LUAD数据集中的幻灯片来源患者的F1分数分别为50.16%和52.30%,而在本体内膜组织数据集上的分数为62.31%。根据我们的研究结果,我们制定了一个风险评估方案,以估计在发表前对患者隐私的风险。
https://arxiv.org/abs/2403.12816
Visible-Infrared Person Re-identification (VI-ReID) is a challenging cross-modal pedestrian retrieval task, due to significant intra-class variations and cross-modal discrepancies among different cameras. Existing works mainly focus on embedding images of different modalities into a unified space to mine modality-shared features. They only seek distinctive information within these shared features, while ignoring the identity-aware useful information that is implicit in the modality-specific features. To address this issue, we propose a novel Implicit Discriminative Knowledge Learning (IDKL) network to uncover and leverage the implicit discriminative information contained within the modality-specific. First, we extract modality-specific and modality-shared features using a novel dual-stream network. Then, the modality-specific features undergo purification to reduce their modality style discrepancies while preserving identity-aware discriminative knowledge. Subsequently, this kind of implicit knowledge is distilled into the modality-shared feature to enhance its distinctiveness. Finally, an alignment loss is proposed to minimize modality discrepancy on enhanced modality-shared features. Extensive experiments on multiple public datasets demonstrate the superiority of IDKL network over the state-of-the-art methods. Code is available at this https URL.
可见-红外人物识别(VI-ReID)是一个具有挑战性的跨模态行人检索任务,因为不同相机之间存在显著的类内差异和跨模态差异。现有工作主要集中在将不同模态的图像嵌入到一个统一的 space 中,以挖掘模态共性特征。他们仅关注这些共享特征中的显着信息,而忽略了隐含在模态特定特征中的身份意识有用信息。为了解决这个问题,我们提出了一个新颖的隐式区分性知识学习(IDKL)网络来揭示和利用模态特定特征中隐含的区分性信息。首先,我们使用一种新颖的双流网络提取模态特定和模态共性特征。然后,模态特定特征经过净化,以减少其模态风格差异,同时保留身份意识区分性知识。接下来,这种隐含知识被蒸馏到模态共性特征中,以增强其独特性。最后,提出了一种对增强模态共性特征的同步损失,以最小化模态差异。在多个公开数据集上进行的大量实验证明,IDKL网络相对于最先进的方法具有优越性。代码可在此链接处获取。
https://arxiv.org/abs/2403.11708
Person Re-identification (ReID) has been extensively developed for a decade in order to learn the association of images of the same person across non-overlapping camera views. To overcome significant variations between images across camera views, mountains of variants of ReID models were developed for solving a number of challenges, such as resolution change, clothing change, occlusion, modality change, and so on. Despite the impressive performance of many ReID variants, these variants typically function distinctly and cannot be applied to other challenges. To our best knowledge, there is no versatile ReID model that can handle various ReID challenges at the same time. This work contributes to the first attempt at learning a versatile ReID model to solve such a problem. Our main idea is to form a two-stage prompt-based twin modeling framework called VersReID. Our VersReID firstly leverages the scene label to train a ReID Bank that contains abundant knowledge for handling various scenes, where several groups of scene-specific prompts are used to encode different scene-specific knowledge. In the second stage, we distill a V-Branch model with versatile prompts from the ReID Bank for adaptively solving the ReID of different scenes, eliminating the demand for scene labels during the inference stage. To facilitate training VersReID, we further introduce the multi-scene properties into self-supervised learning of ReID via a multi-scene prioris data augmentation (MPDA) strategy. Through extensive experiments, we demonstrate the success of learning an effective and versatile ReID model for handling ReID tasks under multi-scene conditions without manual assignment of scene labels in the inference stage, including general, low-resolution, clothing change, occlusion, and cross-modality scenes. Codes and models are available at this https URL.
为了学习同一人在不同视角下的图像之间的关联,在过去的十年里,对Person Re-identification(ReID)的研究已经得到了广泛的发展。为了克服在不同视角之间图像之间的图像差异,为了解决诸如分辨率变化、着装变化、遮挡和模态变化等问题,已经开发了大量的ReID模型的变体。尽管许多ReID变体在性能上表现出色,但这些变体通常会以独特的方式运行,并且不能应用于其他问题。据我们所知,没有一种通用的ReID模型可以同时处理各种ReID挑战。 我们的主要想法是建立一个两阶段提示为基础的双胞胎建模框架,称为VersReID。VersReID首先利用场景标签来训练一个包含丰富知识以处理各种场景的ReID银行,其中几组场景特定的提示被用于编码不同的场景特定知识。在第二阶段,我们从ReID银行中提取具有多样提示的V-支模态,用于自适应地解决不同场景的ReID,消除在推理阶段需要场景标签的需求。为了方便训练VersReID,我们还通过多场景 priori数据增强(MPDA)策略引入了多场景属性。 通过大量实验,我们证明了在不需要在推理阶段手动分配场景标签的情况下,学习一个有效且多场景的ReID模型可以成功地解决ReID任务,包括一般、低分辨率、着装变化、遮挡和跨模态场景。代码和模型可以从该链接下载。
https://arxiv.org/abs/2403.11121
A key challenge in visible-infrared person re-identification (V-I ReID) is training a backbone model capable of effectively addressing the significant discrepancies across modalities. State-of-the-art methods that generate a single intermediate bridging domain are often less effective, as this generated domain may not adequately capture sufficient common discriminant information. This paper introduces the Bidirectional Multi-step Domain Generalization (BMDG), a novel approach for unifying feature representations across diverse modalities. BMDG creates multiple virtual intermediate domains by finding and aligning body part features extracted from both I and V modalities. Indeed, BMDG aims to reduce the modality gaps in two steps. First, it aligns modalities in feature space by learning shared and modality-invariant body part prototypes from V and I images. Then, it generalizes the feature representation by applying bidirectional multi-step learning, which progressively refines feature representations in each step and incorporates more prototypes from both modalities. In particular, our method minimizes the cross-modal gap by identifying and aligning shared prototypes that capture key discriminative features across modalities, then uses multiple bridging steps based on this information to enhance the feature representation. Experiments conducted on challenging V-I ReID datasets indicate that our BMDG approach outperforms state-of-the-art part-based models or methods that generate an intermediate domain from V-I person ReID.
在可见-红外人员识别(V-I ReID)中的一个关键挑战是训练一个能够有效解决不同模态之间显著差异的主干模型。最先进的生成单个中间域的方法通常效果较差,因为生成的中间域可能不足以捕捉足够的共同区分信息。本文介绍了一种名为双向多级域泛化(BMDG)的新方法,用于统一不同模态的特征表示。BMDG通过找到并平滑从I和V模态中提取的身体部位特征,创建多个虚拟的中间域。实际上,BMDG旨在通过两个步骤减少模态差距。首先,它通过在特征空间中学习共享的和与模态无关的身体部位原型来对模态进行对齐。然后,它通过双向多级学习逐步优化每个步骤的特征表示,并从两个模态中包括更多的原型。特别地,我们的方法通过识别和归一化捕捉关键区分特征的共享原型,然后根据这些信息使用多个桥接步骤来增强特征表示。在具有挑战性的V-I ReID数据集上进行的实验表明,我们的BMDG方法优于最先进的部分基于模型或从V-I人员ReID中生成中间域的方法。
https://arxiv.org/abs/2403.10782