Long-term Person Re-Identification (LRe-ID) aims at matching an individual across cameras after a long period of time, presenting variations in clothing, pose, and viewpoint. In this work, we propose CCPA: Contrastive Clothing and Pose Augmentation framework for LRe-ID. Beyond appearance, CCPA captures body shape information which is cloth-invariant using a Relation Graph Attention Network. Training a robust LRe-ID model requires a wide range of clothing variations and expensive cloth labeling, which is lacked in current LRe-ID datasets. To address this, we perform clothing and pose transfer across identities to generate images of more clothing variations and of different persons wearing similar clothing. The augmented batch of images serve as inputs to our proposed Fine-grained Contrastive Losses, which not only supervise the Re-ID model to learn discriminative person embeddings under long-term scenarios but also ensure in-distribution data generation. Results on LRe-ID datasets demonstrate the effectiveness of our CCPA framework.
长期人物识别(LRe-ID)旨在在长时间内匹配单个个体,展示衣物、姿势和视角的差异。在这项工作中,我们提出了CCPA:对比性服装和姿势增强框架用于LRe-ID。除了外观,CCPA通过使用关系图注意力网络捕捉身体形状信息,这是 cloth-invariant 的。训练一个稳健的LRe-ID模型需要广泛的服装变化和昂贵的布料标注,这在当前的LRe-ID数据集中是缺乏的。为了解决这个问题,我们在个体之间进行服装和姿势转移,生成更多服装变化和穿着类似服装不同人物的图像。增强的批片图像作为我们提出的细粒度对比损失的输入,不仅监督重新识别模型在长期场景下学习具有区分性的个体嵌入,而且还确保同分布数据的生成。在LRe-ID数据集上的结果证明了我们的CCPA框架的有效性。
https://arxiv.org/abs/2402.14454
AI based Face Recognition Systems (FRSs) are now widely distributed and deployed as MLaaS solutions all over the world, moreso since the COVID-19 pandemic for tasks ranging from validating individuals' faces while buying SIM cards to surveillance of citizens. Extensive biases have been reported against marginalized groups in these systems and have led to highly discriminatory outcomes. The post-pandemic world has normalized wearing face masks but FRSs have not kept up with the changing times. As a result, these systems are susceptible to mask based face occlusion. In this study, we audit four commercial and nine open-source FRSs for the task of face re-identification between different varieties of masked and unmasked images across five benchmark datasets (total 14,722 images). These simulate a realistic validation/surveillance task as deployed in all major countries around the world. Three of the commercial and five of the open-source FRSs are highly inaccurate; they further perpetuate biases against non-White individuals, with the lowest accuracy being 0%. A survey for the same task with 85 human participants also results in a low accuracy of 40%. Thus a human-in-the-loop moderation in the pipeline does not alleviate the concerns, as has been frequently hypothesized in literature. Our large-scale study shows that developers, lawmakers and users of such services need to rethink the design principles behind FRSs, especially for the task of face re-identification, taking cognizance of observed biases.
基于AI的Face Recognition系统(FRS)现在在全球范围内广泛分布并作为MLaaS解决方案部署,尤其是在COVID-19大流行期间,例如验证个人购买SIM卡时的人脸识别和监视公民等任务。在这些系统中,针对边缘化群体的偏见报道较多,导致高度歧视性结果。大流行过后,世界已经正常化戴口罩,但FRS并没有跟上时代的变化。因此,这些系统容易受到口罩为基础的人脸遮挡。在本文中,我们对四个商业和九个开源FRS进行了审计,针对不同口罩和未戴口罩的图像进行人脸识别,跨越五个基准数据集(总共14,722张图片)。这些模拟了一个真实世界的验证/监视任务,类似于全球各国广泛部署的任务。三个商业和五个开源FRS高度不准确;它们进一步加剧了针对非白人居民的偏见,最低准确度仅为0%。同样,使用85名人类参与者的相同任务调查也结果不理想,准确度只有40%。因此,在流水线中的人类-闭环管理并不能减轻这些担忧,正如文献中经常假设的那样。 我们的大规模研究显示,开发人员、立法者和使用这些服务的用户需要重新思考FRS的设计原则,尤其是针对人脸识别任务,要关注所观察到的偏见。
https://arxiv.org/abs/2402.13771
Preservation of private user data is of paramount importance for high Quality of Experience (QoE) and acceptability, particularly with services treating sensitive data, such as IT-based health services. Whereas anonymization techniques were shown to be prone to data re-identification, synthetic data generation has gradually replaced anonymization since it is relatively less time and resource-consuming and more robust to data leakage. Generative Adversarial Networks (GANs) have been used for generating synthetic datasets, especially GAN frameworks adhering to the differential privacy phenomena. This research compares state-of-the-art GAN-based models for synthetic data generation to generate time-series synthetic medical records of dementia patients which can be distributed without privacy concerns. Predictive modeling, autocorrelation, and distribution analysis are used to assess the Quality of Generating (QoG) of the generated data. The privacy preservation of the respective models is assessed by applying membership inference attacks to determine potential data leakage risks. Our experiments indicate the superiority of the privacy-preserving GAN (PPGAN) model over other models regarding privacy preservation while maintaining an acceptable level of QoG. The presented results can support better data protection for medical use cases in the future.
保护用户数据的隐私对高品质体验(QoE)和可接受性至关重要,特别是对于处理敏感数据的服务的QoE。然而,显示匿名化技术易被数据重新识别,而合成数据生成技术逐渐取代了匿名化,因为它相对较短的时间和资源消耗,且对数据泄漏更健壮。生成对抗网络(GANs)被用于生成合成数据,尤其是遵循差分隐私现象的GAN框架。这项研究将最先进的基于GAN的合成数据生成模型与用于生成阿尔茨海默病患者的时序合成医疗记录的隐私保护进行了比较,这些数据在没有隐私担忧的情况下可以自由分发。预测建模、自相关和分布分析用于评估生成数据的生成质量(QoG)。通过应用成员推断攻击来确定潜在的数据泄漏风险,评估了所提出的模型的隐私保护能力。我们的实验结果表明,在保持可接受隐私水平的同时,隐私保护的GAN(PPGAN)模型比其他模型在隐私保护方面具有优势。所呈现的结果可以为未来医疗使用场景提供更好的数据保护支持。
https://arxiv.org/abs/2402.14042
Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we present an innovative end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). This model effectively distinguishes human body information from occlusions automatically and dynamically, eliminating the need for external detectors or precise image alignment. Specifically, we introduce a dynamic patch token selection module (DPSM). DPSM utilizes a label-guided proxy token as an intermediary to identify informative occlusion-free tokens. These tokens are then selected for deriving subsequent local part features. To facilitate the seamless integration of global classification features with the finely detailed local features selected by DPSM, we introduce a novel feature blending module (FBM). FBM enhances feature representation through the complementary nature of information and the exploitation of part diversity. Furthermore, to ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy. This strategy leverages the recent advances in the Segment Anything Model (SAM). As a result, it generates occlusion images that closely resemble real-world occlusions, greatly enhancing the subsequent contrastive learning process. Experiments on occluded and holistic re-ID benchmarks signify a substantial advancement of DPEFormer over existing state-of-the-art approaches. The code will be made publicly available.
人员识别(Re-ID)继续成为一个显著的挑战,尤其是在遮挡场景中。针对遮挡的方法主要集中在利用外部语义线索对物理身体特征进行对齐。然而,这些方法往往较为复杂且容易受到噪声的影响。为了应对上述挑战,我们提出了一个创新的全端解决方案,称为动态补丁感知增强Transformer(DPEFormer)。该模型能自动且动态地区分人体信息和遮挡,无需外部检测器或精确的图像对齐。具体来说,我们引入了一个动态补丁标记选择模块(DPSM)。DPSM利用标签指导的代理标记作为中间层来识别有用的遮挡无标记的token。这些标记然后用于提取后续的局部部分特征。为了使全局分类特征与由DPSM选择的细小局部特征无缝集成,我们引入了一种新颖的特征混合模块(FBM)。FBM通过信息互补性和部分多样性的利用来增强特征表示。此外,为了确保DPSM和整个DPEFormer仅通过身份标签有效地学习,我们还提出了一个现实主义遮挡增强(ROA)策略。通过利用最近在Segment Anything Model(SAM)上的进展,该策略生成了与现实世界遮挡密切相关的遮挡图像,极大地提高了后续的对比学习过程。在遮挡和整体Re-ID基准测试上进行的实验证明,DPEFormer在现有技术水平上取得了显著的进步。代码将公开发布。
https://arxiv.org/abs/2402.10435
Despite significant progress in optical character recognition (OCR) and computer vision systems, robustly recognizing text and identifying people in images taken in unconstrained \emph{in-the-wild} environments remain an ongoing challenge. However, such obstacles must be overcome in practical applications of vision systems, such as identifying racers in photos taken during off-road racing events. To this end, we introduce two new challenging real-world datasets - the off-road motorcycle Racer Number Dataset (RND) and the Muddy Racer re-iDentification Dataset (MUDD) - to highlight the shortcomings of current methods and drive advances in OCR and person re-identification (ReID) under extreme conditions. These two datasets feature over 6,300 images taken during off-road competitions which exhibit a variety of factors that undermine even modern vision systems, namely mud, complex poses, and motion blur. We establish benchmark performance on both datasets using state-of-the-art models. Off-the-shelf models transfer poorly, reaching only 15% end-to-end (E2E) F1 score on text spotting, and 33% rank-1 accuracy on ReID. Fine-tuning yields major improvements, bringing model performance to 53% F1 score for E2E text spotting and 79% rank-1 accuracy on ReID, but still falls short of good performance. Our analysis exposes open problems in real-world OCR and ReID that necessitate domain-targeted techniques. With these datasets and analysis of model limitations, we aim to foster innovations in handling real-world conditions like mud and complex poses to drive progress in robust computer vision. All data was sourced from this http URL, a website used by professional motorsports photographers, racers, and fans. The top-performing text spotting and ReID models are deployed on this platform to power real-time race photo search.
尽管在光学字符识别(OCR)和计算机视觉系统方面取得了显著的进展,但在不受约束的野外环境中准确识别文本和识别人物仍然是一个持续的挑战。然而,在视觉系统的实际应用中, such 障碍必须被克服,例如在赛车照片中识别赛车手。为此,我们引入了两个新的具有挑战性的现实世界数据集——赛车手编号数据集(RND)和泥泞赛车手重新识别数据集(MUDD),以强调在极端条件下 OCR 和人物识别(ReID)方法的不足之处,推动在不受约束的环境中实现更好的识别性能。这两个数据集涵盖了在赛车比赛中拍摄的超过 6,300 张图像,这些图像呈现出各种因素,对即使是最先进的现代视觉系统也会产生影响,例如泥、复杂的姿势和运动模糊。我们在两个数据集上使用最先进的模型进行基准性能评估。通用的模型转移差,仅达到15%的端到端(E2E) F1 分数在文本检测中,而在 ReID 方面也只有33%的排名一准确率。微调带来重大改进,将模型的性能提高到53%的E2E文本检测和79%的排名一准确率在 ReID 上,但仍然存在不足。我们的分析揭示了在现实世界 OCR 和 ReID 中需要解决的问题,这需要领域特定的技术。有了这些数据集和模型限制的分析,我们旨在推动在处理类似泥和复杂姿态的实时情况方面的创新,以推动计算机视觉在实时情况下的进步。所有数据都来自这个链接,这是一个专业赛车摄影师、赛车手和粉丝使用的网站。在这个平台上,最优秀的文本检测和 ReID 模型部署用于实时赛车照片搜索。
https://arxiv.org/abs/2402.08025
The acquisition of large-scale, precisely labeled datasets for person re-identification (ReID) poses a significant challenge. Weakly supervised ReID has begun to address this issue, although its performance lags behind fully supervised methods. In response, we introduce Contrastive Multiple Instance Learning (CMIL), a novel framework tailored for more effective weakly supervised ReID. CMIL distinguishes itself by requiring only a single model and no pseudo labels while leveraging contrastive losses -- a technique that has significantly enhanced traditional ReID performance yet is absent in all prior MIL-based approaches. Through extensive experiments and analysis across three datasets, CMIL not only matches state-of-the-art performance on the large-scale SYSU-30k dataset with fewer assumptions but also consistently outperforms all baselines on the WL-market1501 and Weakly Labeled MUddy racer re-iDentification dataset (WL-MUDD) datasets. We introduce and release the WL-MUDD dataset, an extension of the MUDD dataset featuring naturally occurring weak labels from the real-world application at this http URL. All our code and data are accessible at this https URL.
大规模、精确标注的数据集用于人物识别(ReID)的收购是一个重大挑战。尽管弱监督ReID已经开始解决这个问题,但它的性能仍然落后于完全监督方法。为了应对这个问题,我们引入了 Contrastive Multiple Instance Learning (CMIL),一种专门针对弱监督ReID的新框架。CMIL通过仅要求一个模型和不需要伪标签,并利用对比损失技术脱颖而出 - 这种技术在所有先前的MIL基于方法中都是缺失的。通过三个数据集的大量实验和分析,CMIL不仅在大型SYSU-30k数据集上与最先进的性能相匹敌,而且 consistently在WL-market1501和WL-Labeled MUddy racer re-identification dataset (WL-MUDD)数据集上优于所有基线。我们介绍并发布了WL-MUDD数据集,这是MUDD数据集的扩展,来自真实的应用场景的自然弱标签。所有我们的代码和数据都可以通过这个链接访问:https://www.example.com/
https://arxiv.org/abs/2402.07685
Current state-of-the-art Video-based Person Re-Identification (Re-ID) primarily relies on appearance features extracted by deep learning models. These methods are not applicable for long-term analysis in real-world scenarios where persons have changed clothes, making appearance information unreliable. In this work, we deal with the practical problem of Video-based Cloth-Changing Person Re-ID (VCCRe-ID) by proposing "Attention-based Shape and Gait Representations Learning" (ASGL) for VCCRe-ID. Our ASGL framework improves Re-ID performance under clothing variations by learning clothing-invariant gait cues using a Spatial-Temporal Graph Attention Network (ST-GAT). Given the 3D-skeleton-based spatial-temporal graph, our proposed ST-GAT comprises multi-head attention modules, which are able to enhance the robustness of gait embeddings under viewpoint changes and occlusions. The ST-GAT amplifies the important motion ranges and reduces the influence of noisy poses. Then, the multi-head learning module effectively reserves beneficial local temporal dynamics of movement. We also boost discriminative power of person representations by learning body shape cues using a GAT. Experiments on two large-scale VCCRe-ID datasets demonstrate that our proposed framework outperforms state-of-the-art methods by 12.2% in rank-1 accuracy and 7.0% in mAP.
目前最先进的基于视频的人物识别(Re-ID)主要依赖于由深度学习模型提取的视觉特征。这些方法在现实场景中不适用于长时间分析,因为人们会换衣服,导致外观信息不可靠。在这项工作中,我们处理了基于视频的衣物更换人物识别(VCCRe-ID)的实际问题,通过提出 "基于注意力的形状和步伐表示学习" (ASGL) 来解决VCCRe-ID。我们的ASGL框架通过使用空间时间图注意力网络(ST-GAT)学习 clothing-invariant 步伐线索来提高Re-ID性能。 考虑到基于3D骨骼的空间时间图,我们提出的ST-GAT包括多头注意模块,这些模块能够增强在视角变化和遮挡情况下的步伐嵌入的鲁棒性。ST-GAT增强了重要的运动范围并减少了噪声姿态的影响。然后,多头学习模块有效地保留了有益的局部时间动态。 我们还通过学习身体形状线索来提高人物表示的判别能力。在两个大型VCCRe-ID数据集的实验中,我们的框架在排名1准确性和mAP方面均优于最先进的方法,分别提高了12.2%和7.0%。
https://arxiv.org/abs/2402.03716
In this paper, we study a new problem of cross-domain video based person re-identification (Re-ID). Specifically, we take the synthetic video dataset as the source domain for training and use the real-world videos for testing, which significantly reduces the dependence on real training data collection and annotation. To unveil the power of synthetic data for video person Re-ID, we first propose a self-supervised domain invariant feature learning strategy for both static and temporal features. Then, to further improve the person identification ability in the target domain, we develop a mean-teacher scheme with the self-supervised ID consistency loss. Experimental results on four real datasets verify the rationality of cross-synthetic-real domain adaption and the effectiveness of our method. We are also surprised to find that the synthetic data performs even better than the real data in the cross-domain setting.
在本文中,我们研究了一个新的跨领域视频基于人物识别(Re-ID)的新问题。具体来说,我们以合成视频数据集作为训练来源,使用真实视频作为测试数据,这显著减少了依赖于真实训练数据收集和注释的依赖。为了揭示合成数据在视频人物Re-ID中的优势,我们首先提出了一种自监督的领域不变特征学习策略,用于静态和时间特征。然后,为了进一步提高目标领域中人物识别的能力,我们开发了一种自监督ID一致损失的均教师方案。对四个真实数据集的实验结果证实了跨合成真实领域适应性和我们方法的合理性。我们还惊讶地发现,在跨领域设置中,合成数据的表现甚至比真实数据更好。
https://arxiv.org/abs/2402.02108
As a cutting-edge biosensor, the event camera holds significant potential in the field of computer vision, particularly regarding privacy preservation. However, compared to traditional cameras, event streams often contain noise and possess extremely sparse semantics, posing a formidable challenge for event-based person re-identification (event Re-ID). To address this, we introduce a novel event person re-identification network: the Spectrum-guided Feature Enhancement Network (SFE-Net). This network consists of two innovative components: the Multi-grain Spectrum Attention Mechanism (MSAM) and the Consecutive Patch Dropout Module (CPDM). MSAM employs a fourier spectrum transform strategy to filter event noise, while also utilizing an event-guided multi-granularity attention strategy to enhance and capture discriminative person semantics. CPDM employs a consecutive patch dropout strategy to generate multiple incomplete feature maps, encouraging the deep Re-ID model to equally perceive each effective region of the person's body and capture robust person descriptors. Extensive experiments on Event Re-ID datasets demonstrate that our SFE-Net achieves the best performance in this task.
作为一款前沿的生物传感器,事件相机在计算机视觉领域具有显著的应用潜力,尤其是在隐私保护方面。然而,与传统相机相比,事件流通常包含噪声,且具有极稀疏的语义,这给基于事件的人识别(事件Re-ID)带来了巨大的挑战。为解决这个问题,我们引入了一个新颖的事件人识别网络:光谱引导的特征增强网络(SFE-Net)。该网络由两个创新组件组成:多粒度光谱关注机制(MSAM)和连续补丁丢弃模块(CPDM)。MSAM采用傅里叶频谱变换策略来滤除事件噪声,同时利用事件引导的多粒度关注策略来增强和捕捉有区分性的个人语义。CPDM采用连续补丁丢弃策略来生成多个不完整的特征图,鼓励深度Re-ID模型平等地感知每个人身体有效区域的丰富信息,并捕捉到一个鲁棒的人描述符。在事件Re-ID数据集上进行的大量实验证明,我们的SFE-Net在任务中取得了最佳性能。
https://arxiv.org/abs/2402.01269
Generative latent diffusion models hold a wide range of applications in the medical imaging domain. A noteworthy application is privacy-preserved open-data sharing by proposing synthetic data as surrogates of real patient data. Despite the promise, these models are susceptible to patient data memorization, where models generate patient data copies instead of novel synthetic samples. This undermines the whole purpose of preserving patient data and may even result in patient re-identification. Considering the importance of the problem, surprisingly it has received relatively little attention in the medical imaging community. To this end, we assess memorization in latent diffusion models for medical image synthesis. We train 2D and 3D latent diffusion models on CT, MR, and X-ray datasets for synthetic data generation. Afterwards, we examine the amount of training data memorized utilizing self-supervised models and further investigate various factors that can possibly lead to memorization by training models in different settings. We observe a surprisingly large amount of data memorization among all datasets, with up to 41.7%, 19.6%, and 32.6% of the training data memorized in CT, MRI, and X-ray datasets respectively. Further analyses reveal that increasing training data size and using data augmentation reduce memorization, while over-training enhances it. Overall, our results suggest a call for memorization-informed evaluation of synthetic data prior to open-data sharing.
生成式潜在扩散模型在医学成像领域具有广泛的应用。一个值得注意的是应用是提出合成数据作为真实患者数据的替代物来实现隐私保护的开放数据共享。尽管这些模型具有很大的潜力,但它们容易受到患者数据记忆的影响,其中模型生成患者数据副本而不是新颖的合成样本。这削弱了保护患者数据的整个目的,甚至可能导致患者重新识别。考虑到这个问题的重要性,令人惊讶的是,在医学成像领域并没有得到太多的关注。为此,我们评估了潜在扩散模型中数据记忆的情况。我们在CT、MR和X光数据集上训练2D和3D潜在扩散模型来生成合成数据。然后,我们使用自监督模型检查训练数据的记忆量,并进一步研究了可能在不同环境中训练模型导致记忆的各种因素。我们观察到所有数据集中数据记忆量相当大,其中CT数据集的训练数据记忆量最高达41.7%,MRI数据集为19.6%,X光数据集为32.6%。进一步分析表明,增加训练数据大小和使用数据增强可以减轻记忆,而过度训练则会增强记忆。总体而言,我们的结果表明,在开放数据共享之前,应该对合成数据进行基于记忆的评估。
https://arxiv.org/abs/2402.01054
Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to retrieve pedestrian images of the same identity from different modalities without annotations. While prior work focuses on establishing cross-modality pseudo-label associations to bridge the modality-gap, they ignore maintaining the instance-level homogeneous and heterogeneous consistency in pseudo-label space, resulting in coarse associations. In response, we introduce a Modality-Unified Label Transfer (MULT) module that simultaneously accounts for both homogeneous and heterogeneous fine-grained instance-level structures, yielding high-quality cross-modality label associations. It models both homogeneous and heterogeneous affinities, leveraging them to define the inconsistency for the pseudo-labels and then minimize it, leading to pseudo-labels that maintain alignment across modalities and consistency within intra-modality structures. Additionally, a straightforward plug-and-play Online Cross-memory Label Refinement (OCLR) module is proposed to further mitigate the impact of noisy pseudo-labels while simultaneously aligning different modalities, coupled with a Modality-Invariant Representation Learning (MIRL) framework. Experiments demonstrate that our proposed method outperforms existing USL-VI-ReID methods, highlighting the superiority of our MULT in comparison to other cross-modality association methods. The code will be available.
无监督可见-红外人员识别(USL-VI-ReID)旨在从不同模态下检索具有相同身份的行人图像,而无需注释。虽然之前的工作集中于建立跨模态伪标签关联以弥合模态差距,但他们忽略了在伪标签空间中保持实例级别统一和异构性,导致粗略的关联。为了应对这个问题,我们引入了一个模块化标签转移(MULT)模块,同时考虑了统一和异构实例级别的结构,从而实现了高质量跨模态标签关联。它建模了统一和异构的亲和力,并利用它们定义了伪标签的不一致性,然后最小化它,导致伪标签在模态之间保持对齐,并且在模态内部保持一致性。此外,我们还提出了一个简单的插件式在线跨模态标签优化(OCLR)模块,以减轻噪声伪标签的影响,同时使不同模态之间保持对齐,并采用模态无关表示学习(MIRL)框架。实验证明,与现有USL-VI-ReID方法相比,我们的方法具有优异的性能,突出了我们在跨模态关联方法中的优越性。代码将可用。
https://arxiv.org/abs/2402.00672
The paper introduces the Decouple Re-identificatiOn and human Parsing (DROP) method for occluded person re-identification (ReID). Unlike mainstream approaches using global features for simultaneous multi-task learning of ReID and human parsing, or relying on semantic information for attention guidance, DROP argues that the inferior performance of the former is due to distinct granularity requirements for ReID and human parsing features. ReID focuses on instance part-level differences between pedestrian parts, while human parsing centers on semantic spatial context, reflecting the internal structure of the human body. To address this, DROP decouples features for ReID and human parsing, proposing detail-preserving upsampling to combine varying resolution feature maps. Parsing-specific features for human parsing are decoupled, and human position information is exclusively added to the human parsing branch. In the ReID branch, a part-aware compactness loss is introduced to enhance instance-level part differences. Experimental results highlight the efficacy of DROP, especially achieving a Rank-1 accuracy of 76.8% on Occluded-Duke, surpassing two mainstream methods. The codebase is accessible at this https URL.
本文介绍了一种名为Decouple Re-identification and human Parsing (DROP)的方法,用于解决遮挡的人重新识别(ReID)问题。与使用全局特征进行同时多任务学习ReID和人类解析的传统方法不同,或者依赖语义信息进行关注指导的方法,DROP认为,前者的性能差是由于ReID和人类解析特征的细粒度要求不同导致的。ReID关注于行人部分之间的实例级别差异,而人类解析关注于语义空间上下文,反映了人体内部的结构。为了应对这个问题,DROP解耦了ReID和人类解析的特征,提出了一种保留详细信息的上采样方法,结合了不同分辨率特征图。人类解析特定的特征被解耦,而人类位置信息被专门添加到人类解析分支中。在ReID分支中,引入了一种部分感知紧凑性损失,以增强实例级别的部分差异。实验结果强调了DROP的有效性,尤其是实现了在Occluded-Duke上的排名1准确度为76.8%,超过了两个主要方法。代码库可在此处访问:https://github.com/google-research/DROP。
https://arxiv.org/abs/2401.18032
Person re-identification via 3D skeletons is an important emerging research area that triggers great interest in the pattern recognition community. With distinctive advantages for many application scenarios, a great diversity of 3D skeleton based person re-identification (SRID) methods have been proposed in recent years, effectively addressing prominent problems in skeleton modeling and feature learning. Despite recent advances, to the best of our knowledge, little effort has been made to comprehensively summarize these studies and their challenges. In this paper, we attempt to fill this gap by providing a systematic survey on current SRID approaches, model designs, challenges, and future directions. Specifically, we first formulate the SRID problem, and propose a taxonomy of SRID research with a summary of benchmark datasets, commonly-used model architectures, and an analytical review of different methods' characteristics. Then, we elaborate on the design principles of SRID models from multiple aspects to offer key insights for model improvement. Finally, we identify critical challenges confronting current studies and discuss several promising directions for future research of SRID.
通过3D骨架进行人员识别是一个新兴的研究领域,引起了模式识别社区极大的兴趣。在许多应用场景中具有显著优势,近年来已经提出了许多基于骨架的人员识别(SRID)方法,有效地解决了骨架建模和特征学习中的突出问题。尽管最近取得了进展,但据我们所知,还没有对这些研究及其挑战进行全面的总结。在本文中,我们试图填补这一空白,通过提供当前SRID方法的系统综述来概述这些研究及其挑战。具体来说,我们首先对SRID问题进行了建模,并提出了一个基于基准数据集的SRID研究分类,涵盖了常用的模型架构和对不同方法的性质分析。然后,我们从多个方面详细介绍了SRID模型的设计原则,为模型改进提供了关键见解。最后,我们指出了当前研究中面临的关键挑战,并为SRID的未来研究提出了几个有前景的方向。
https://arxiv.org/abs/2401.15296
We propose an efficient cross-cameras surveillance system called,STAC, that leverages spatio-temporal associations between multiple cameras to provide real-time analytics and inference under constrained network environments. STAC is built using the proposed omni-scale feature learning people reidentification (reid) algorithm that allows accurate detection, tracking and re-identification of people across cameras using the spatio-temporal characteristics of video frames. We integrate STAC with frame filtering and state-of-the-art compression for streaming technique (that is, ffmpeg libx264 codec) to remove redundant information from cross-camera frames. This helps in optimizing the cost of video transmission as well as compute/processing, while maintaining high accuracy for real-time query inference. The introduction of AICity Challenge 2023 Data [1] by NVIDIA has allowed exploration of systems utilizing multi-camera people tracking algorithms. We evaluate the performance of STAC using this dataset to measure the accuracy metrics and inference rate for reid. Additionally, we quantify the reduction in video streams achieved through frame filtering and compression using FFmpeg compared to the raw camera streams. For completeness, we make available our repository to reproduce the results, available at this https URL.
我们提出了一个名为STAC的高效跨相机监控系统,它利用多个相机之间的时空关联来提供在受约束网络环境下的实时分析和推理。STAC使用提出的全尺度特征学习人识别(REID)算法构建,允许准确检测、跟踪和重新识别跨相机的人。我们将STAC与帧过滤和先进的压缩流媒体技术(即ffmpeg libx264编解码器)集成,以从跨相机帧中消除冗余信息。这有助于优化视频传输的成本以及计算/处理,同时保持实时查询推理的高准确性。 NVIDIA在2023年AICity挑战中引入了AICity Challenge 2023数据集,允许我们探索使用多相机跟踪算法来优化系统。我们使用这个数据集来评估STAC的性能,以衡量REID的准确性和推理率。此外,我们使用FFmpeg量化帧过滤和压缩对原始相机流的影响。为了完整性,我们将我们的存储库可供复制的链接公开,复现结果,https:// this URL.
https://arxiv.org/abs/2401.15288
Multimodal large language models (MLLM) have achieved satisfactory results in many tasks. However, their performance in the task of person re-identification (ReID) has not been explored to date. This paper will investigate how to adapt them for the task of ReID. An intuitive idea is to fine-tune MLLM with ReID image-text datasets, and then use their visual encoder as a backbone for ReID. However, there still exist two apparent issues: (1) Designing instructions for ReID, MLLMs may overfit specific instructions, and designing a variety of instructions will lead to higher costs. (2) Latent image feature vectors from LLMs are not involved in loss computation. Instructional learning, aligning image-text features, results in indirect optimization and a learning objective that inadequately utilizes features, limiting effectiveness in person feature learning. To address these problems, this paper proposes MLLMReID: Multimodal Large Language Model-based ReID. Firstly, we proposed Common Instruction, a simple approach that leverages the essence ability of LLMs to continue writing, avoiding complex and diverse instruction design. Secondly, we proposed DirectReID, which effectively employs the latent image feature vectors of images outputted by LLMs in ReID tasks. The experimental results demonstrate the superiority of our method. We will open-source the code on GitHub.
多模态大型语言模型(MLLM)在许多任务上已经取得了满意的结果。然而,在人物识别(ReID)任务上,它们的表现仍没有被探讨过。本文将研究如何为ReID任务调整MLLM。一个直观的想法是使用ReID图像-文本数据集对MLLM进行微调,然后将它们的视觉编码器用作ReID的骨干。然而,仍然存在两个明显的问题:(1)为ReID设计指令,MLLMs可能会过拟合特定的指令,而设计多种指令将导致更高的成本。(2)LLM的潜在图像特征向量未参与损失计算。指令学习、对图像-文本特征进行对齐,导致间接优化以及利用不够充分的特征,限制了人物特征学习的效果。为了解决这些问题,本文提出了MLLMReID:基于MLLM的多模态大型语言模型人物识别。首先,我们提出了Common Instruction,一种简单的方法,它利用了LLMs继续写作的本质能力,避免了复杂和多样化的指令设计。其次,我们提出了DirectReID,它有效地利用了LLM在ReID任务中输出的图像的潜在图像特征向量。实验结果证明了我们的方法的优越性。我们将把代码开源到GitHub上。
https://arxiv.org/abs/2401.13201
Vehicle re-identification (ReID) endeavors to associate vehicle images collected from a distributed network of cameras spanning diverse traffic environments. This task assumes paramount importance within the spectrum of vehicle-centric technologies, playing a pivotal role in deploying Intelligent Transportation Systems (ITS) and advancing smart city initiatives. Rapid advancements in deep learning have significantly propelled the evolution of vehicle ReID technologies in recent years. Consequently, undertaking a comprehensive survey of methodologies centered on deep learning for vehicle re-identification has become imperative and inescapable. This paper extensively explores deep learning techniques applied to vehicle ReID. It outlines the categorization of these methods, encompassing supervised and unsupervised approaches, delves into existing research within these categories, introduces datasets and evaluation criteria, and delineates forthcoming challenges and potential research directions. This comprehensive assessment examines the landscape of deep learning in vehicle ReID and establishes a foundation and starting point for future works. It aims to serve as a complete reference by highlighting challenges and emerging trends, fostering advancements and applications in vehicle ReID utilizing deep learning models.
车辆识别(ReID)旨在将来自分布式网络摄像机收集的车辆图像与各种交通环境中的车辆图像相关联。这项任务在车辆中心的科技范畴中具有至关重要的地位,在部署智能交通系统(ITS)和推动智能城市倡议中发挥关键作用。近年来,深度学习技术的快速发展推动了车辆ReID技术的演变。因此,对基于深度学习的车辆ReID方法进行全面调查已成为迫切和不可回避的任务。本文详细探讨了应用到车辆ReID的深度学习技术。它对这类方法的分类进行了概述,包括监督和无监督方法,深入研究了这些分类,介绍了数据集和评估标准,并勾勒出未来的挑战和潜在研究方向。这次全面评估审查了车辆ReID领域中的深度学习技术格局,为未来的研究奠定了基础和起点。其目的是通过突出挑战和新兴趋势,促进使用深度学习模型进行车辆ReID的改进和发展。
https://arxiv.org/abs/2401.10643
In the field of computer vision, the persistent presence of color bias, resulting from fluctuations in real-world lighting and camera conditions, presents a substantial challenge to the robustness of models. This issue is particularly pronounced in complex wide-area surveillance scenarios, such as person re-identification and industrial dust segmentation, where models often experience a decline in performance due to overfitting on color information during training, given the presence of environmental variations. Consequently, there is a need to effectively adapt models to cope with the complexities of camera conditions. To address this challenge, this study introduces a learning strategy named Random Color Erasing, which draws inspiration from ensemble learning. This strategy selectively erases partial or complete color information in the training data without disrupting the original image structure, thereby achieving a balanced weighting of color features and other features within the neural network. This approach mitigates the risk of overfitting and enhances the model's ability to handle color variation, thereby improving its overall robustness. The approach we propose serves as an ensemble learning strategy, characterized by robust interpretability. A comprehensive analysis of this methodology is presented in this paper. Across various tasks such as person re-identification and semantic segmentation, our approach consistently improves strong baseline methods. Notably, in comparison to existing methods that prioritize color robustness, our strategy significantly enhances performance in cross-domain scenarios. The code available at \url{this https URL\_reID\_baseline\_pytorch/blob/master/random\this http URL} or \url{this https URL}.
在计算机视觉领域,由于现实照明和相机条件波动引起的颜色偏差的存在,对模型的稳健性构成了一个很大的挑战。特别是在复杂的全景监视场景中,例如人识别和工业灰尘分割,由于训练过程中颜色信息的过拟合,模型经常会在环境变化的存在下性能下降。因此,有必要有效地适应模型以应对相机条件的复杂性。为解决这一挑战,本研究引入了一种名为随机颜色擦除的学习策略,该策略从集成学习中获得了灵感。这种策略在训练数据中选择性地擦除部分或完整的颜色信息,而不会破坏原始图像结构,从而实现颜色特征与其他特征之间的平衡加权。这种方法减轻了过拟合的风险,提高了模型应对颜色变化的处理能力,从而提高了其整体稳健性。我们所提出的方法是一种集成学习策略,具有可解释性强的特点。本文对这种方法进行了全面分析。在各种任务中,如人识别和语义分割,我们的方法 consistently超过了强大的基线方法。值得注意的是,与现有方法侧重于颜色鲁棒性相比,我们的策略在跨领域场景中的性能显著增强。代码可于\url{this <https://URL_reID_baseline_pytorch/blob/master/random>}或\url{this <https://URL>}找到。
https://arxiv.org/abs/2401.10512
In recent years, there has been significant research focusing on addressing security concerns in single-modal person re-identification (ReID) systems that are based on RGB images. However, the safety of cross-modality scenarios, which are more commonly encountered in practical applications involving images captured by infrared cameras, has not received adequate attention. The main challenge in cross-modality ReID lies in effectively dealing with visual differences between different modalities. For instance, infrared images are typically grayscale, unlike visible images that contain color information. Existing attack methods have primarily focused on the characteristics of the visible image modality, overlooking the features of other modalities and the variations in data distribution among different modalities. This oversight can potentially undermine the effectiveness of these methods in image retrieval across diverse modalities. This study represents the first exploration into the security of cross-modality ReID models and proposes a universal perturbation attack specifically designed for cross-modality ReID. This attack optimizes perturbations by leveraging gradients from diverse modality data, thereby disrupting the discriminator and reinforcing the differences between modalities. We conducted experiments on two widely used cross-modality datasets, namely RegDB and SYSU, which not only demonstrated the effectiveness of our method but also provided insights for future enhancements in the robustness of cross-modality ReID systems.
近年来,针对基于RGB图像的单模态人物重识别(ReID)系统的安全问题进行了大量研究。然而,在涉及使用红外摄像机捕获图像的实用应用程序中,跨模态场景的安全性并没有得到足够的关注。跨模态ReID的主要挑战在于有效地处理不同模态之间的视觉差异。例如,红外图像通常是灰度图像,而可见图像则包含颜色信息。现有的攻击方法主要关注可见图像模态的特征,而忽视了其他模态的特征以及不同模态之间数据分布的差异。这种忽视可能会削弱这些方法在多样模态图像检索中的有效性。 本研究是关于跨模态ReID模型安全性的首次探索,并提出了一个专门针对跨模态ReID的通用扰动攻击。该攻击通过利用多样模态数据中的梯度来优化扰动,从而干扰判别器并增强模态之间的差异。我们在两个广泛使用的跨模态数据集RegDB和SYSU上进行了实验,不仅证明了我们的方法的有效性,还提供了对未来增强跨模态ReID系统鲁棒性的启示。
https://arxiv.org/abs/2401.10090
Weakly supervised text-based person re-identification (TPRe-ID) seeks to retrieve images of a target person using textual descriptions, without relying on identity annotations and is more challenging and practical. The primary challenge is the intra-class differences, encompassing intra-modal feature variations and cross-modal semantic gaps. Prior works have focused on instance-level samples and ignored prototypical features of each person which are intrinsic and invariant. Toward this, we propose a Cross-Modal Prototypical Contrastive Learning (CPCL) method. In practice, the CPCL introduces the CLIP model to weakly supervised TPRe-ID for the first time, mapping visual and textual instances into a shared latent space. Subsequently, the proposed Prototypical Multi-modal Memory (PMM) module captures associations between heterogeneous modalities of image-text pairs belonging to the same person through the Hybrid Cross-modal Matching (HCM) module in a many-to-many mapping fashion. Moreover, the Outlier Pseudo Label Mining (OPLM) module further distinguishes valuable outlier samples from each modality, enhancing the creation of more reliable clusters by mining implicit relationships between image-text pairs. Experimental results demonstrate that our proposed CPCL attains state-of-the-art performance on all three public datasets, with a significant improvement of 11.58%, 8.77% and 5.25% in Rank@1 accuracy on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively. The code is available at this https URL.
弱监督基于文本的人身份识别(TPRe-ID)旨在通过文本描述检索目标人员的图像,而不依赖于身份注释,这比依赖注释更具有挑战性和实用性。主要挑战是类内差异,包括模态特征变化和跨模态语义鸿沟。之前的工作主要集中在实例级别的样本,并忽略了每个人固有的、不变的典型特征。为了达到这一目标,我们提出了跨模态原型对比学习(CPCL)方法。在实践中,CPCL是第一次将CLIP模型应用于弱监督TPRe-ID,将视觉和文本实例映射到共享的潜在空间。随后,所提出的原型多模态记忆(PMM)模块通过混合多模态匹配(HCM)模块在多对多映射方式下捕捉到属于同一个人的图像-文本对之间的关联。此外,离群伪标签挖掘(OPLM)模块进一步通过挖掘图像-文本对之间的隐含关系区分有价值的异常样本,从而通过挖掘隐含关系增强创建更可靠的聚类。实验结果表明,我们提出的CPCL在所有三个公共数据集上都取得了最先进的性能,其中在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上的排名@1准确性分别提高了11.58%、8.77%和5.25%。代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2401.10011
Effective tracking and re-identification of players is essential for analyzing soccer videos. But, it is a challenging task due to the non-linear motion of players, the similarity in appearance of players from the same team, and frequent occlusions. Therefore, the ability to extract meaningful embeddings to represent players is crucial in developing an effective tracking and re-identification system. In this paper, a multi-purpose part-based person representation method, called PRTreID, is proposed that performs three tasks of role classification, team affiliation, and re-identification, simultaneously. In contrast to available literature, a single network is trained with multi-task supervision to solve all three tasks, jointly. The proposed joint method is computationally efficient due to the shared backbone. Also, the multi-task learning leads to richer and more discriminative representations, as demonstrated by both quantitative and qualitative results. To demonstrate the effectiveness of PRTreID, it is integrated with a state-of-the-art tracking method, using a part-based post-processing module to handle long-term tracking. The proposed tracking method outperforms all existing tracking methods on the challenging SoccerNet tracking dataset.
有效地跟踪和识别球员对于分析足球视频至关重要。但是,由于球员的非线性运动、相同球队球员之间的相似性以及频繁的遮挡,这成为一个具有挑战性的任务。因此,在开发有效的跟踪和识别系统时,能够提取有意义的嵌入以表示球员至关重要。在本文中,提出了一种多用途基于部分的人表示方法,称为PRTreID,它同时执行三个任务:角色分类、队伍归属和识别。与现有文献不同,该方法通过多任务监督训练了一个单一网络,以解决所有三个任务。与现有的方法相比,该方法在计算上是高效的。此外,多任务学习导致更丰富和更具判别性的表示,如定量和定性结果所展示的。为了证明PRTreID的有效性,它与最先进的跟踪方法相结合,使用基于部分的后处理模块处理长距离跟踪。与具有挑战性的SoccerNet跟踪数据集上的所有现有跟踪方法相比,所提出的跟踪方法都表现出色。
https://arxiv.org/abs/2401.09942