Person Re-Identification (Re-ID) has gained popularity in computer vision, enabling cross-camera pedestrian recognition. Although the development of deep learning has provided a robust technical foundation for person Re-ID research, most existing person Re-ID methods overlook the potential relationships among local person features, failing to adequately address the impact of pedestrian pose variations and local body parts occlusion. Therefore, we propose a Transformer-enhanced Graph Convolutional Network (Tran-GCN) model to improve Person Re-Identification performance in monitoring videos. The model comprises four key components: (1) A Pose Estimation Learning branch is utilized to estimate pedestrian pose information and inherent skeletal structure data, extracting pedestrian key point information; (2) A Transformer learning branch learns the global dependencies between fine-grained and semantically meaningful local person features; (3) A Convolution learning branch uses the basic ResNet architecture to extract the person's fine-grained local features; (4) A Graph Convolutional Module (GCM) integrates local feature information, global feature information, and body information for more effective person identification after fusion. Quantitative and qualitative analysis experiments conducted on three different datasets (Market-1501, DukeMTMC-ReID, and MSMT17) demonstrate that the Tran-GCN model can more accurately capture discriminative person features in monitoring videos, significantly improving identification accuracy.
人物识别(Re-ID)在计算机视觉领域取得了广泛的应用,实现了跨摄像头的行人识别。尽管深度学习的快速发展为人物识别研究提供了坚实的技术基础,但现有的人物识别方法大多数忽视了局部特征之间的潜在关系,未能充分解决行人姿态变化和局部身体部分遮挡的影响。因此,我们提出了一个Transformer-增强的图卷积网络(Tran-GCN)模型,以提高在监控视频中的行人识别性能。该模型包括四个关键组件:(1)一个姿态估计学习分支用于估计行人的姿态信息和固有骨骼结构数据,提取行人关键点信息;(2)一个Transformer学习分支学习细粒度和有意义局部人的全局依赖关系;(3)一个卷积学习分支使用基本ResNet架构提取行人的细粒度局部特征;(4)一个图卷积模块(GCM)整合了局部特征信息、全局特征信息和身体信息,进行更有效的行人识别融合。在三个不同数据集(Market-1501,DukeMTMC-ReID和MSMT17)上进行的定量和定性分析实验证明,Tran-GCN模型在监测视频中更准确地捕捉到具有区分性的行人特征,显著提高了识别准确性。
https://arxiv.org/abs/2409.09391
The primary challenges in visible-infrared person re-identification arise from the differences between visible (vis) and infrared (ir) images, including inter-modal and intra-modal variations. These challenges are further complicated by varying viewpoints and irregular movements. Existing methods often rely on horizontal partitioning to align part-level features, which can introduce inaccuracies and have limited effectiveness in reducing modality discrepancies. In this paper, we propose a novel Prototype-Driven Multi-feature generation framework (PDM) aimed at mitigating cross-modal discrepancies by constructing diversified features and mining latent semantically similar features for modal alignment. PDM comprises two key components: Multi-Feature Generation Module (MFGM) and Prototype Learning Module (PLM). The MFGM generates diversity features closely distributed from modality-shared features to represent pedestrians. Additionally, the PLM utilizes learnable prototypes to excavate latent semantic similarities among local features between visible and infrared modalities, thereby facilitating cross-modal instance-level alignment. We introduce the cosine heterogeneity loss to enhance prototype diversity for extracting rich local features. Extensive experiments conducted on the SYSU-MM01 and LLCM datasets demonstrate that our approach achieves state-of-the-art performance. Our codes are available at this https URL.
可见-红外人员识别的主要挑战来自于可见(vis)和红外(ir)图像之间的差异,包括模态和模态内的变化。这些挑战进一步复杂化,由于不同的视点和不规则的运动。现有的方法通常依赖于水平分割来对部分级别特征进行对齐,这可能会引入不准确,并且在减少模态差异方面效果有限。在本文中,我们提出了一个新型的原型驱动多特征生成框架(PDM),旨在通过构建多样化的特征和挖掘模态间潜在的语义相似性来缓解跨模态差异。PDM包括两个关键组件:多特征生成模块(MFGM)和原型学习模块(PLM)。MFGM生成从模态共享特征到模式共享特征的多样性特征,从而表示行人。此外,PLM利用可学习的原型来挖掘可见和红外模态之间局部特征之间的潜在语义相似性,从而促进跨模态实例级别的对齐。我们引入了余弦异质损失来增强原型的多样性,以提取丰富的局部特征。在SYSU-MM01和LLCM数据集上进行的大量实验证明,我们的方法达到了最先进水平。我们的代码可在此处访问:https://www.xxx
https://arxiv.org/abs/2409.05642
We address the problem of person re-identification (reID), that is, retrieving person images from a large dataset, given a query image of the person of interest. A key challenge is to learn person representations robust to intra-class variations, as different persons could have the same attribute, and persons' appearances look different, e.g., with viewpoint changes. Recent reID methods focus on learning person features discriminative only for a particular factor of variations (e.g., human pose), which also requires corresponding supervisory signals (e.g., pose annotations). To tackle this problem, we propose to factorize person images into identity-related and unrelated features. Identity-related features contain information useful for specifying a particular person (e.g., clothing), while identity-unrelated ones hold other factors (e.g., human pose). To this end, we propose a new generative adversarial network, dubbed identity shuffle GAN (IS-GAN). It disentangles identity-related and unrelated features from person images through an identity-shuffling technique that exploits identification labels alone without any auxiliary supervisory signals. We restrict the distribution of identity-unrelated features or encourage the identity-related and unrelated features to be uncorrelated, facilitating the disentanglement process. Experimental results validate the effectiveness of IS-GAN, showing state-of-the-art performance on standard reID benchmarks, including Market-1501, CUHK03, and DukeMTMC-reID. We further demonstrate the advantages of disentangling person representations on a long-term reID task, setting a new state of the art on a Celeb-reID dataset.
我们解决这个问题,即从大量数据中检索特定人物的形象,给定感兴趣人物的查询图像。一个关键挑战是学会对类内变化具有鲁棒性的人物表示,因为不同的人可能有相同的属性,而且人的外表看起来不同,例如视角变化。最近的方法集中在仅对特定因素的变化学习人物特征,例如人体姿势,这也需要相应的指导信号(例如姿势注释)。为了应对这个问题,我们将人物图像分解为身份相关和不相关特征。身份相关特征包含指定特定人物(例如服装)的有用信息,而身份无关特征持有其他因素(例如人体姿势)。为此,我们提出了一个新的生成对抗网络,被称为身份混淆生成对抗网络(IS-GAN)。它通过利用身份混淆技术从人物图像中区分身份相关和不相关特征,而无需任何辅助监督信号。我们限制身份无关特征的分布或鼓励身份相关和不相关特征相互独立,促进解纠缠过程。实验结果证实了IS-GAN的有效性,在包括Market-1501、CUHK03和DukeMTMC-reID等标准的REID基准测试中取得了最先进的性能。我们还进一步展示了在长期REID任务中解纠缠人物表示的优势,将桂冠REID数据集上的新颖状态推向了新的高度。
https://arxiv.org/abs/2409.05277
Deep learning-based person re-identification (re-id) models are widely employed in surveillance systems and inevitably inherit the vulnerability of deep networks to adversarial attacks. Existing attacks merely consider cross-dataset and cross-model transferability, ignoring the cross-test capability to perturb models trained in different domains. To powerfully examine the robustness of real-world re-id models, the Meta Transferable Generative Attack (MTGA) method is proposed, which adopts meta-learning optimization to promote the generative attacker producing highly transferable adversarial examples by learning comprehensively simulated transfer-based cross-model\&dataset\&test black-box meta attack tasks. Specifically, cross-model\&dataset black-box attack tasks are first mimicked by selecting different re-id models and datasets for meta-train and meta-test attack processes. As different models may focus on different feature regions, the Perturbation Random Erasing module is further devised to prevent the attacker from learning to only corrupt model-specific features. To boost the attacker learning to possess cross-test transferability, the Normalization Mix strategy is introduced to imitate diverse feature embedding spaces by mixing multi-domain statistics of target models. Extensive experiments show the superiority of MTGA, especially in cross-model\&dataset and cross-model\&dataset\&test attacks, our MTGA outperforms the SOTA methods by 21.5\% and 11.3\% on mean mAP drop rate, respectively. The code of MTGA will be released after the paper is accepted.
基于深度学习的行人重识别(重识别)模型在监视系统中被广泛应用,并继承了深度网络对对抗攻击的脆弱性。现有的攻击仅考虑跨数据集和跨模型可传输性,而忽略了不同领域训练模型的交叉测试能力。为了有效地研究真实世界重识别模型的鲁棒性,提出了元传输可训练生成攻击(MTGA)方法,通过元学习优化促进生成攻击者通过全面模拟基于交叉模型的交叉测试黑盒攻击任务来产生高度可传输的对抗样本。 具体来说,通过选择不同的重识别模型和数据集,对元训练和元测试攻击过程进行跨模型跨数据集的攻击。由于不同的模型可能关注不同的特征区域,因此进一步设计干扰随机擦除模块,以防止攻击者仅学习模型特定特征的攻击方法。为了提高攻击者具有跨测试可传输性,引入了标准化混合策略,通过混合目标模型的多个领域统计数据来模拟多样特征嵌入空间。 大量实验证明,MTGA在跨模型和跨数据集攻击方面具有优越性。与最先进的攻击方法相比,MTGA在平均mAP下降率上分别提高了21.5%和11.3%。MTGA的代码将在论文被接受后发布。
https://arxiv.org/abs/2409.04208
In recent years, the development of deep learning approaches for the task of person re-identification led to impressive results. However, this comes with a limitation for industrial and practical real-world applications. Firstly, most of the existing works operate on closed-world scenarios, in which the people to re-identify (probes) are compared to a closed-set (gallery). Real-world scenarios often are open-set problems in which the gallery is not known a priori, but the number of open-set approaches in the literature is significantly lower. Secondly, challenges such as multi-camera setups, occlusions, real-time requirements, etc., further constrain the applicability of off-the-shelf methods. This work presents MICRO-TRACK, a Modular Industrial multi-Camera Re_identification and Open-set Tracking system that is real-time, scalable, and easy to integrate into existing industrial surveillance scenarios. Furthermore, we release a novel Re-ID and tracking dataset acquired in an industrial manufacturing facility, dubbed Facility-ReID, consisting of 18-minute videos captured by 8 surveillance cameras.
近年来,用于人物识别任务的深度学习方法的开发取得了令人印象深刻的成果。然而,这也限制了在工业和实际应用领域的应用。首先,大部分现有工作都在封闭世界场景中进行,其中被重新识别(探针)的人与一个封闭集(画廊)进行比较。现实世界场景通常是开放式问题,画廊不知道提前知道,但文书中公开的开放式方法的数量相当有限。其次,多摄像头设置、遮挡、实时要求等挑战进一步限制了通用方法的适用性。本文介绍了一种名为MICRO-TRACK的模块化工业多相机重新识别和开放式跟踪系统,具有实时性、可扩展性和易于集成到现有工业安防场景的特点。此外,我们还发布了在工业制造设施中获取的一组18分钟视频,由8个监控摄像头捕捉的数据,称为设施-ReID数据集。
https://arxiv.org/abs/2409.03879
For the visible-infrared person re-identification (VI-ReID) task, one of the primary challenges lies in significant cross-modality discrepancy. Existing methods struggle to conduct modality-invariant information mining. They often focus solely on mining singular dimensions like spatial or channel, and overlook the extraction of specific-modality multi-dimension information. To fully mine modality-invariant information across a wide range, we introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. Empowered by the proposed Global Region Interaction (GRI), MIIM comprehensively mines non-local spatial and channel information through intra-dimension interaction. Moreover, Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information. AICL, by introducing the novel Cross-Modality Key-Instance Contrastive (CMKIC) loss, effectively guides the network in extracting modality-invariant information. We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset. The results demonstrate WRIM-Net's superiority over state-of-the-art methods.
对于可见-红外人员重新识别(VI-ReID)任务,一个主要挑战是显著的跨模态不一致性。现有的方法很难进行模态无关信息挖掘。他们通常仅关注空间或通道等单一维度,并忽略了提取特定模态多维度信息。要完全挖掘跨模态信息,我们引入了宽范围信息挖掘网络(WRIM-Net),它主要由多维度交互信息挖掘(MIIM)模块和基于辅助信息的目标差异学习(AICL)方法组成。通过引入全局区域交互(GRI),MIIM通过内维度交互全面挖掘非局部空间和通道信息。此外,由于低计算复杂度设计,单独的MIIM可以被放置在浅层,使网络更好地挖掘特定模态多维度信息。AICL通过引入新颖的跨模态键实例对比(CMKIC)损失有效地引导网络提取模态无关信息。我们在既定的知名SYSU-MM01和RegDB数据集以及最新的大型跨模态LLVM数据集上进行了广泛的实验。结果表明,WRIM-Net优越地超越了最先进的方法。
https://arxiv.org/abs/2408.10624
Unsupervised person re-identification (Re-ID) aims to learn a feature network with cross-camera retrieval capability in unlabelled datasets. Although the pseudo-label based methods have achieved great progress in Re-ID, their performance in the complex scenario still needs to sharpen up. In order to reduce potential misguidance, including feature bias, noise pseudo-labels and invalid hard samples, accumulated during the learning process, in this pa per, a confidence-guided clustering and contrastive learning (3C) framework is proposed for unsupervised person Re-ID. This 3C framework presents three confidence degrees. i) In the clustering stage, the confidence of the discrepancy between samples and clusters is proposed to implement a harmonic discrepancy clustering algorithm (HDC). ii) In the forward-propagation training stage, the confidence of the camera diversity of a cluster is evaluated via a novel camera information entropy (CIE). Then, the clusters with high CIE values will play leading roles in training the model. iii) In the back-propagation training stage, the confidence of the hard sample in each cluster is designed and further used in a confidence integrated harmonic discrepancy (CHD), to select the informative sample for updating the memory in contrastive learning. Extensive experiments on three popular Re-ID benchmarks demonstrate the superiority of the proposed framework. Particularly, the 3C framework achieves state-of-the-art results: 86.7%/94.7%, 45.3%/73.1% and 47.1%/90.6% in terms of mAP/Rank-1 accuracy on Market-1501, the com plex datasets MSMT17 and VeRi-776, respectively. Code is available at this https URL.
无监督的人重新识别(Re-ID)旨在在未标注数据集中学习具有跨相机检索能力的特征网络。尽管基于伪标签的方法在Re-ID方面已经取得了很大的进展,但它们在复杂场景下的性能仍然需要改进。为了减少可能的误导,包括特征偏差,噪声伪标签和无效硬样本,在学习和过程中累积的,这个部分提出了一个基于信心的聚类和对比学习(3C)框架来实现无监督的人Re-ID。这个3C框架提出了三个置信度。i)在聚类阶段,差异样本和聚类的置信度被提出以实现和谐差异聚类算法(HDC)。ii)在前向传播训练阶段,通过评估新颖的相机信息熵(CIE)来评估聚类的置信度。然后,置信度高的聚类将在训练中扮演主导角色。iii)在反向传播训练阶段,为每个聚类设计了置信度,并进一步用于信心整合的和谐差异(CHD),以选择用于更新记忆的信息样本。通过对三个流行的Re-ID基准进行广泛的实验,证明了所提出的框架的优越性。特别是,3C框架在Market-1501和MSMT17等复杂数据集上的表现达到最佳水平。代码可在此处下载:<https://url.com>
https://arxiv.org/abs/2408.09464
In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset.
在本文中,我们构建了一个大规模的人体识别基准数据集G2A-VReID,用于基于地面到空中的视频基于识别,该数据集包括185,907个图像和5,576个轨迹,具有2,788个不同的身份。据我们所知,这是第一个基于地面到空中的视频识别数据集。G2A-VReID数据集具有以下特点:1)剧烈的视图变化; 2)大量注释的身份; 3)丰富的户外场景; 4)分辨率巨大的差异。此外,我们通过将跨平台视觉对齐问题转化为视觉语义对齐,并通过视觉语言模型(即CLIP)提出了一种新的跨平台ReID基准方法,称为VSLA-CLIP。此外,为了进一步减少不同平台之间的巨大差异,我们还设计了平台桥提示,用于有效的视觉特征对齐。大量实验证明,与现有视频ReID数据集相比,以及我们提出的G2A-VReID数据集,所提出的方法具有优越性。
https://arxiv.org/abs/2408.07500
Within the domain of person re-identification (ReID), partial ReID methods are considered mainstream, aiming to measure feature distances through comparisons of body parts between samples. However, in practice, previous methods often lack sufficient awareness of anatomical aspect of body parts, resulting in the failure to capture features of the same body parts across different samples. To address this issue, we introduce \textbf{Part Aware Transformer (PAFormer)}, a pose estimation based ReID model which can perform precise part-to-part comparison. In order to inject part awareness to pose tokens, we introduce learnable parameters called `pose token' which estimate the correlation between each body part and partial regions of the image. Notably, at inference phase, PAFormer operates without additional modules related to body part localization, which is commonly used in previous ReID methodologies leveraging pose estimation models. Additionally, leveraging the enhanced awareness of body parts, PAFormer suggests the use of a learning-based visibility predictor to estimate the degree of occlusion for each body part. Also, we introduce a teacher forcing technique using ground truth visibility scores which enables PAFormer to be trained only with visible parts. A set of extensive experiments show that our method outperforms existing approaches on well-known ReID benchmark datasets.
在人体识别(ReID)领域,部分ReID方法被认为是主流的,通过比较样本中身体部位之间的特征来测量特征距离。然而,在实践中,以前的方法通常缺乏对身体部位解剖结构的足够认识,导致不同样本中相同身体部位的特征捕捉失败。为解决这个问题,我们引入了《部分感知Transformer(PAFormer)》(Part-Awareness Transformer,PAFormer)模型,一种基于姿态估计的身体部分ReID模型,可以精确地进行部分到部分的比较。为了在姿态令牌中注入部分意识,我们引入了称为“姿态令牌参数”的可学习参数,估计每个身体部位与图像部分区域的相关性。值得注意的是,在推理阶段,PAFormer无需额外的身体部分定位模块,这是以前使用姿态估计模型进行ReID的方法中常见的。此外,通过增强对身体的认识,PAFormer建议使用基于学习的可见性预测器估计每个身体部分的遮挡程度。另外,我们还引入了一种教师强迫技术,使用地面真实可见性分数,使得PAFormer只能训练可见部分。一系列广泛的实验证明,我们的方法在知名ReID基准数据集上优于现有方法。
https://arxiv.org/abs/2408.05918
Online person re-identification services face privacy breaches from potential data leakage and recovery attacks, exposing cloud-stored images to malicious attackers and triggering public concern. The privacy protection of pedestrian images is crucial. Previous privacy-preserving person re-identification methods are unable to resist recovery attacks and compromise accuracy. In this paper, we propose an iterative method (PixelFade) to optimize pedestrian images into noise-like images to resist recovery attacks. We first give an in-depth study of protected images from previous privacy methods, which reveal that the chaos of protected images can disrupt the learning of recovery models. Accordingly, Specifically, we propose Noise-guided Objective Function with the feature constraints of a specific authorization model, optimizing pedestrian images to normal-distributed noise images while preserving their original identity information as per the authorization model. To solve the above non-convex optimization problem, we propose a heuristic optimization algorithm that alternately performs the Constraint Operation and the Partial Replacement Operation. This strategy not only safeguards that original pixels are replaced with noises to protect privacy, but also guides the images towards an improved optimization direction to effectively preserve discriminative features. Extensive experiments demonstrate that our PixelFade outperforms previous methods in resisting recovery attacks and Re-ID performance. The code is available at this https URL.
在线人格识别服务面临潜在数据泄露和恢复攻击的隐私泄露,将存储在云中的图像暴露给恶意攻击者,并引发公众关注。保护行人图像的隐私至关重要。之前用于保护个人识别信息的隐私方法都无法抵御恢复攻击,且会破坏准确性。在本文中,我们提出了一种迭代方法(PixelFade)将行人图像优化为噪音像以抵抗恢复攻击。我们首先对之前用于保护图像的隐私方法进行了深入研究,发现受保护图像的混乱可能干扰恢复模型的学习。因此,我们提出了一种以特定授权模型的特征约束为目标函数,将行人图像优化为正态分布噪音像,同时保留其原始身份信息的Noise-guided Objective Function。为解决上述非凸优化问题,我们提出了一个启发式优化算法,交替执行约束操作和部分替换操作。这种策略不仅保护了原始像素免受噪声替换以保护隐私,而且引导图像朝着提高优化方向的改进,有效地保留了区分特征。大量实验证明,我们的PixelFade在抵抗恢复攻击和Re-ID性能方面优于之前的方法。代码可在此链接处获取:https://www.example.com/pixelfade
https://arxiv.org/abs/2408.05543
Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and discriminative contrastive learning, and then conducts supervised fine-tuning training in the person ReID task. This person feature extraction method based on ViT with masked image modeling (PersonViT) has the good characteristics of unsupervised, scalable, and strong generalization capabilities, overcoming the problem of difficult annotation in supervised person ReID, and achieves state-of-the-art results on publicly available benchmark datasets, including MSMT17, Market1501, DukeMTMC-reID, and Occluded-Duke. The code and pre-trained models of the PersonViT method are released at this https URL to promote further research in the person ReID fie
人员识别(ReID)的目的是在非重叠的相机图像中检索相关个体,并在公共安全领域具有广泛的应用。近年来,随着Vision Transformer(ViT)和自监督学习技术的快速发展,基于自监督预训练的人员ReID性能得到了显著提高。人员ReID需要提取人体的高区分度局部细粒度特征,而传统的ViT则擅长提取与上下文相关的全局特征,使得关注局部人体特征困难。为此,本文将最近提出的遮罩图像建模(MIM)自监督学习方法引入人员ReID,通过结合遮罩图像建模和判别式对比学习在大规模无监督预训练过程中有效提取高质量全局和局部特征,然后对人员ReID任务进行有监督微调训练。基于ViT的遮罩图像建模(PersonViT)人员特征提取方法具有自主性、可扩展性和强泛化能力,克服了监督人员ReID中困难注释的问题,并在公开可用的基准数据集上实现了最先进的性能,包括MSMT17、Market1501、DukeMTMC-reID和Occluded-Duke。PersonViT方法的相关代码和预训练模型发布在https://这个URL上,以促进在人员ReID领域进一步研究。
https://arxiv.org/abs/2408.05398
Text-based person re-identification (Re-ID) is a challenging topic in the field of complex multimodal analysis, its ultimate aim is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions. Despite the wide range of applicable areas such as security surveillance, video retrieval, person tracking, and social media analytics, there is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective. To address this gap, we propose to introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task. We start by laying the groundwork for text-based person Re-ID, elucidating fundamental concepts related to attribute/natural language-based identification. Then a thorough examination of existing benchmark datasets and metrics is presented. Subsequently, we further delve into prevalent feature extraction strategies employed in text-based person Re-ID research, followed by a concise summary of common network architectures within the domain. Prevalent loss functions utilized for model optimization and modality alignment in text-based person Re-ID are also scrutinized. To conclude, we offer a concise summary of our findings, pinpointing challenges in text-based person Re-ID. In response to these challenges, we outline potential avenues for future open-set text-based person Re-ID and present a baseline architecture for text-based pedestrian image generation-guided re-identification(TBPGR).
基于文本的人身识别(Re-ID)是复杂多模态分析领域的一个具有挑战性的课题,其最终目标是通过对属性/自然语言描述的审查来识别特定的行人。尽管应用领域很广,如安全监控、视频检索、行人追踪和社会媒体分析,但缺乏对基于文本的人身识别的全面综述是一个显著的不足。为了填补这个空白,我们提议引入一个评估、策略、架构和优化维度的分类法,全面调查基于文本的人身识别(Re-ID)任务。我们首先阐明了基于文本的人身识别的基础知识,阐明了与属性/自然语言-基于识别相关的核心概念。接着,我们详细介绍了现有基准数据集和指标。随后,我们深入研究了文本中常用的人身识别特征提取策略,接着是对该领域中常见的网络架构的简要概述。我们还对用于模型优化和模式对齐的现有损失函数进行了审查。总之,我们给出了我们调查的简要总结,指出了基于文本的人身识别的挑战。为了应对这些挑战,我们提出了未来可以开放设置的基于文本的人身识别的可能途径,并提出了基于文本行人图像生成引导的人身识别(TBPGR)的基准架构。
https://arxiv.org/abs/2408.00096
Occluded Person Re-Identification (ReID) is a metric learning task that involves matching occluded individuals based on their appearance. While many studies have tackled occlusions caused by objects, multi-person occlusions remain less explored. In this work, we identify and address a critical challenge overlooked by previous occluded ReID methods: the Multi-Person Ambiguity (MPA) arising when multiple individuals are visible in the same bounding box, making it impossible to determine the intended ReID target among the candidates. Inspired by recent work on prompting in vision, we introduce Keypoint Promptable ReID (KPR), a novel formulation of the ReID problem that explicitly complements the input bounding box with a set of semantic keypoints indicating the intended target. Since promptable re-identification is an unexplored paradigm, existing ReID datasets lack the pixel-level annotations necessary for prompting. To bridge this gap and foster further research on this topic, we introduce Occluded-PoseTrack ReID, a novel ReID dataset with keypoints labels, that features strong inter-person occlusions. Furthermore, we release custom keypoint labels for four popular ReID benchmarks. Experiments on person retrieval, but also on pose tracking, demonstrate that our method systematically surpasses previous state-of-the-art approaches on various occluded scenarios. Our code, dataset and annotations are available at this https URL.
https://arxiv.org/abs/2407.18112
Biometric recognition has primarily addressed closed-set identification, assuming all probe subjects are in the gallery. However, most practical applications involve open-set biometrics, where probe subjects may or may not be present in the gallery. This poses distinct challenges in effectively distinguishing individuals in the gallery while minimizing false detections. While it is commonly believed that powerful biometric models can excel in both closed- and open-set scenarios, existing loss functions are inconsistent with open-set evaluation. They treat genuine (mated) and imposter (non-mated) similarity scores symmetrically and neglect the relative magnitudes of imposter scores. To address these issues, we simulate open-set evaluation using minibatches during training and introduce novel loss functions: (1) the identification-detection loss optimized for open-set performance under selective thresholds and (2) relative threshold minimization to reduce the maximum negative score for each probe. Across diverse biometric tasks, including face recognition, gait recognition, and person re-identification, our experiments demonstrate the effectiveness of the proposed loss functions, significantly enhancing open-set performance while positively impacting closed-set performance. Our code and models are available at this https URL.
https://arxiv.org/abs/2407.16133
In the contemporary of deep learning, where models often grapple with the challenge of simultaneously achieving robustness against adversarial attacks and strong generalization capabilities, this study introduces an innovative Local Feature Masking (LFM) strategy aimed at fortifying the performance of Convolutional Neural Networks (CNNs) on both fronts. During the training phase, we strategically incorporate random feature masking in the shallow layers of CNNs, effectively alleviating overfitting issues, thereby enhancing the model's generalization ability and bolstering its resilience to adversarial attacks. LFM compels the network to adapt by leveraging remaining features to compensate for the absence of certain semantic features, nurturing a more elastic feature learning mechanism. The efficacy of LFM is substantiated through a series of quantitative and qualitative assessments, collectively showcasing a consistent and significant improvement in CNN's generalization ability and resistance against adversarial attacks--a phenomenon not observed in current and prior methodologies. The seamless integration of LFM into established CNN frameworks underscores its potential to advance both generalization and adversarial robustness within the deep learning paradigm. Through comprehensive experiments, including robust person re-identification baseline generalization experiments and adversarial attack experiments, we demonstrate the substantial enhancements offered by LFM in addressing the aforementioned challenges. This contribution represents a noteworthy stride in advancing robust neural network architectures.
在深度学习的当代,由于模型经常需要在对抗攻击和强大的泛化能力之间做出权衡,因此,本研究介绍了一种创新的小特征掩码(LFM)策略,旨在加强卷积神经网络(CNN)在两个方面的性能。在训练阶段,我们策略地将随机特征掩码引入到CNN的浅层,有效地减轻了过拟合问题,从而提高了模型的泛化能力和对对抗攻击的抵抗力。LFM通过利用剩余的特征来抵消某些语义特征的缺失,培育了一种更加弹性的特征学习机制。LFM的有效性通过一系列定量和定性评估得到证实,这些评估共同展示了CNN在泛化能力和对抗攻击抵抗力方面的一致性和显著提高——这种现象在现有和先前的方法中并未观察到。将LFM无缝地集成到现有的CNN框架中,强调了其在深度学习范式中提高泛化和对抗鲁棒性的潜力。通过综合实验,包括用于隐私保护的人身份识别基线泛化实验和用于攻击的实验,我们证明了LFM在解决上述挑战方面提供了极大的增强。这一贡献在推动深度学习架构的鲁棒性方面具有重要意义。
https://arxiv.org/abs/2407.13646
Person Re-identification (re-ID) in computer vision aims to recognize and track individuals across different cameras. While previous research has mainly focused on challenges like pose variations and lighting changes, the impact of extreme capture conditions is often not adequately addressed. These extreme conditions, including varied lighting, camera styles, angles, and image distortions, can significantly affect data distribution and re-ID accuracy. Current research typically improves model generalization under normal shooting conditions through data augmentation techniques such as adjusting brightness and contrast. However, these methods pay less attention to the robustness of models under extreme shooting conditions. To tackle this, we propose a multi-mode synchronization learning (MMSL) strategy . This approach involves dividing images into grids, randomly selecting grid blocks, and applying data augmentation methods like contrast and brightness adjustments. This process introduces diverse transformations without altering the original image structure, helping the model adapt to extreme variations. This method improves the model's generalization under extreme conditions and enables learning diverse features, thus better addressing the challenges in re-ID. Extensive experiments on a simulated test set under extreme conditions have demonstrated the effectiveness of our method. This approach is crucial for enhancing model robustness and adaptability in real-world scenarios, supporting the future development of person re-identification technology.
在计算机视觉中,人物识别(re-ID)旨在通过识别和跟踪不同相机中的个体。虽然以前的研究主要集中在诸如姿态变化和照明变化等挑战上,但极端捕捉条件的影响往往没有得到足够的关注。这些极端条件包括不同的光照、相机风格、角度和图像畸变,可能会显著影响数据分布和识别准确性。通常,当前研究通过数据增强技术(如调整亮度和对比度)来提高模型在正常拍摄条件下的泛化能力。然而,这些方法在极端拍摄条件下的模型的鲁棒性方面关注较少。为了应对这个问题,我们提出了一个多模式同步学习(MMSL)策略。这种方法将图像分割成网格,随机选择网格块,并应用像差和亮度调整等数据增强方法。这个过程引入了各种变换,而没有改变原始图像结构,有助于模型适应极端变化。这种方法在极端条件下提高了模型的泛化能力,并能够学习到更多的特征,从而更好地解决识别问题。在极端条件下进行模拟测试集的实验已经证明了我们方法的有效性。这种方法对于提高模型的稳健性和适应性在现实场景中至关重要,支持人物识别技术的发展。
https://arxiv.org/abs/2407.13640
Unsupervised visible infrared person re-identification (USVI-ReID) is a challenging retrieval task that aims to retrieve cross-modality pedestrian images without using any label information. In this task, the large cross-modality variance makes it difficult to generate reliable cross-modality labels, and the lack of annotations also provides additional difficulties for learning modality-invariant features. In this paper, we first deduce an optimization objective for unsupervised VI-ReID based on the mutual information between the model's cross-modality input and output. With equivalent derivation, three learning principles, i.e., "Sharpness" (entropy minimization), "Fairness" (uniform label distribution), and "Fitness" (reliable cross-modality matching) are obtained. Under their guidance, we design a loop iterative training strategy alternating between model training and cross-modality matching. In the matching stage, a uniform prior guided optimal transport assignment ("Fitness", "Fairness") is proposed to select matched visible and infrared prototypes. In the training stage, we utilize this matching information to introduce prototype-based contrastive learning for minimizing the intra- and cross-modality entropy ("Sharpness"). Extensive experimental results on benchmarks demonstrate the effectiveness of our method, e.g., 60.6% and 90.3% of Rank-1 accuracy on SYSU-MM01 and RegDB without any annotations.
未经监督的可见红外人物识别(USVI-ReID)是一项具有挑战性的检索任务,旨在在没有使用任何标签信息的情况下检索跨模态行人图像。在这个任务中,大的跨模态方差使得生成可靠的跨模态标签变得困难,而缺乏注释也使得学习不变特征更加困难。在本文中,我们首先根据模型的跨模态输入和输出之间的互信息来推导出无监督的VI-ReID的优化目标。通过等价导出,我们得到了三个学习原则,即“锐利性”(熵最小化)、“公平性”(均匀标签分布)和“适应性”(可靠跨模态匹配)。在他们的指导下,我们设计了一个交替进行模型训练和跨模态匹配的循环迭代训练策略。在匹配阶段,我们提出了一种基于均匀先验的匹配最优传输分配(适应性,“公平性”)来选择匹配的可见和红外原型。在训练阶段,我们利用这个匹配信息引入基于原型的对比学习来最小化内和跨模态熵(锐利性)。在基准测试上进行的大量实验结果证明了我们的方法的的有效性,例如,没有标签信息的SYSU-MM01和RegDB上的排名1准确率分别为60.6%和90.3%。
https://arxiv.org/abs/2407.12758
Cloth-changing person re-identification (CC-ReID) aims to retrieve specific pedestrians in a cloth-changing scenario. Its main challenge is to disentangle the clothing-related and clothing-unrelated features. Most existing approaches force the model to learn clothing-unrelated features by changing the color of the clothes. However, due to the lack of ground truth, these methods inevitably introduce noise, which destroys the discriminative features and leads to an uncontrollable disentanglement process. In this paper, we propose a new person re-identification network called features reconstruction disentanglement ReID (FRD-ReID), which can controllably decouple the clothing-unrelated and clothing-related features. Specifically, we first introduce the human parsing mask as the ground truth of the reconstruction process. At the same time, we propose the far away attention (FAA) mechanism and the person contour attention (PCA) mechanism for clothing-unrelated features and pedestrian contour features to improve the feature reconstruction efficiency. In the testing phase, we directly discard the clothing-related features for inference,which leads to a controllable disentanglement process. We conducted extensive experiments on the PRCC, LTCC, and Vc-Clothes datasets and demonstrated that our method outperforms existing state-of-the-art methods.
衣物更换人识别(CC-ReID)旨在从换衣场景中检索特定行人。它的主要挑战是区分相关和无关的特征。现有方法通过改变衣服的颜色来强制模型学习无关的特征。然而,由于缺乏真实数据,这些方法不可避免地引入噪声,破坏了可区分特征,导致不可控的解纠缠过程。在本文中,我们提出了一个名为特征重构解纠缠ReID(FRD-ReID)的新人物识别网络,可以控制地解耦相关和无关特征。具体来说,我们首先引入人类解析掩码作为重构过程的地面真值。同时,我们提出了用于服装无关特征和行人轮廓特征的远方注意(FAA)机制和人物轮廓注意(PCA)机制,以提高特征重构效率。在测试阶段,我们直接丢弃与推理相关的衣服特征,从而导致可控制的解纠缠过程。我们对PRCC、LTCC和Vc-Clothes数据集进行了广泛的实验,并证明了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2407.10694
Person re-identification (ReID), as a crucial technology in the field of security, plays an important role in security detection and people counting. Current security and monitoring systems largely rely on visual information, which may infringe on personal privacy and be susceptible to interference from pedestrian appearances and clothing in certain scenarios. Meanwhile, the widespread use of routers offers new possibilities for ReID. This letter introduces a method using WiFi Channel State Information (CSI), leveraging the multipath propagation characteristics of WiFi signals as a basis for distinguishing different pedestrian features. We propose a two-stream network structure capable of processing variable-length data, which analyzes the amplitude in the time domain and the phase in the frequency domain of WiFi signals, fuses time-frequency information through continuous lateral connections, and employs advanced objective functions for representation and metric learning. Tested on a dataset collected in the real world, our method achieves 93.68% mAP and 98.13% Rank-1.
人员识别(ReID)作为一种关键的安全技术,在安全和计数领域发挥着重要作用。当前的安全和监控系统很大程度上依赖于视觉信息,这可能侵犯个人隐私并受到行人外貌和服装等干扰。同时,广泛使用路由器为ReID提供了新的可能性。本文介绍了一种利用WiFi信道状态信息(CSI)的方法,作为基于WiFi信号多径传播特性的区分不同行人特征的方法。我们提出了一个能够处理变长数据的两个流网络结构,通过连续的横向连接分析WiFi信号的时间域和频域中的振幅,融合时间-频域信息,并采用先进的表示和指标学习方法。在现实世界的数据集上进行测试后,我们的方法实现了93.68%的mAP和98.13%的Rank-1。
https://arxiv.org/abs/2407.09045
The Balanced-Pairwise-Affinities (BPA) feature transform is designed to upgrade the features of a set of input items to facilitate downstream matching or grouping related tasks. The transformed set encodes a rich representation of high order relations between the input features. A particular min-cost-max-flow fractional matching problem, whose entropy regularized version can be approximated by an optimal transport (OT) optimization, leads to a transform which is efficient, differentiable, equivariant, parameterless and probabilistically interpretable. While the Sinkhorn OT solver has been adapted extensively in many contexts, we use it differently by minimizing the cost between a set of features to $itself$ and using the transport plan's $rows$ as the new representation. Empirically, the transform is highly effective and flexible in its use and consistently improves networks it is inserted into, in a variety of tasks and training schemes. We demonstrate state-of-the-art results in few-shot classification, unsupervised image clustering and person re-identification. Code is available at \url{this http URL}.
平衡对偶互吸引力(BPA)特征变换旨在升级一组输入项的特征,以促进下游匹配或相关任务。变换后的集合对输入特征的高级关系进行了一种丰富的编码。一个特定最小成本最大流分数匹配问题(其熵 Regularized 版本可以由最优传输(OT)优化近似)导致了一种有效的、可导的、等价的、无参数且概率可解释的变换。尽管在许多上下文中,Sinkhorn OT 求解器已经得到了广泛的适应,但我们通过将一组特征的代价与其与自己相等的新表示来使用它。经验证,该变换在应用中的效果和灵活性都非常出色,并且在不计入训练数据的情况下,它可以持续改进将它插入到的网络。我们在几 shot分类、无监督图像聚类和人员识别等任务中展示了最先进的结果。代码可在此处下载:\url{http://this http URL}.
https://arxiv.org/abs/2407.01467