This study presents an investigation of four distinct approaches to long-term person identification using body shape. Unlike short-term re-identification systems that rely on temporary features (e.g., clothing), we focus on learning persistent body shape characteristics that remain stable over time. We introduce a body identification model based on a Vision Transformer (ViT) (Body Identification from Diverse Datasets, BIDDS) and on a Swin-ViT model (Swin-BIDDS). We also expand on previous approaches based on the Linguistic and Non-linguistic Core ResNet Identity Models (LCRIM and NLCRIM), but with improved training. All models are trained on a large and diverse dataset of over 1.9 million images of approximately 5k identities across 9 databases. Performance was evaluated on standard re-identification benchmark datasets (MARS, MSMT17, Outdoor Gait, DeepChange) and on an unconstrained dataset that includes images at a distance (from close-range to 1000m), at altitude (from an unmanned aerial vehicle, UAV), and with clothing change. A comparative analysis across these models provides insights into how different backbone architectures and input image sizes impact long-term body identification performance across real-world conditions.
这项研究探讨了四种不同的长期人体识别方法,这些方法基于身体形状。与依赖临时特征(如穿着)的短期重识别系统不同,我们专注于学习那些随时间保持稳定的身体形态特征。我们提出了一种基于视觉变换器(Vision Transformer, ViT) 的人体识别模型(Body Identification from Diverse Datasets, BIDDS),以及一种基于Swin-ViT 模型(Swin-BIDDS)的同类方法。同时,我们也扩展了之前基于语言和非语言核心ResNet身份模型(Linguistic and Non-linguistic Core ResNet Identity Models, LCRIM 和 NLCRIM) 的研究,并改进了训练过程。 所有这些模型都是在超过1.9 百万张图像的大规模多样化数据集上进行训练的,该数据集中包含约5000 个身份信息,覆盖九个数据库。性能评估是在标准重识别基准数据集(MARS、MSMT17、Outdoor Gait 和 DeepChange)以及一个非限制条件下数据集上完成的,后者包含了从近距离到1000 米远的各种距离图像,来自无人驾驶飞行器(UAV) 的高空图像,以及穿着变化后的图像。通过对这些模型进行比较分析,我们获得了不同骨干网络架构和输入图像尺寸如何影响在真实世界条件下的长期身体识别性能的见解。
https://arxiv.org/abs/2502.07130
Person re-identification (Re-ID) is a key challenge in computer vision, requiring the matching of individuals across different cameras, locations, and time periods. While most research focuses on short-term scenarios with minimal appearance changes, real-world applications demand robust Re-ID systems capable of handling long-term scenarios, where persons' appearances can change significantly due to variations in clothing and physical characteristics. In this paper, we present CHIRLA, Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis, a novel dataset specifically designed for long-term person Re-ID. CHIRLA consists of recordings from strategically placed cameras over a seven-month period, capturing significant variations in both temporal and appearance attributes, including controlled changes in participants' clothing and physical features. The dataset includes 22 individuals, four connected indoor environments, and seven cameras. We collected more than five hours of video that we semi-automatically labeled to generate around one million bounding boxes with identity annotations. By introducing this comprehensive benchmark, we aim to facilitate the development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios.
人员重识别(Re-ID)是计算机视觉领域中的一个关键挑战,它要求在不同摄像头、地点和时间段内匹配同一人物。尽管大多数研究集中在短期场景下,即外观变化较小的情况下,但实际应用场景需要能够处理长期情景的鲁棒性重识别系统,在这种情况下,由于穿着或身体特征的变化,人物的外观可能发生显著改变。本文中,我们介绍了CHIRLA(用于大规模分析的综合高分辨率身份和重识别),这是一个专门为长时间人员重识别设计的新数据集。CHIRLA包含了为期七个月、从战略性放置的七个摄像头收集的数据,记录了包括参与者服装和身体特征控制变化在内的时空属性的重大变化。该数据集包括22名个体、四个连接的室内环境以及七个摄像头。我们半自动标注超过五小时的视频,生成带有身份注释的大约一百万个边界框。通过引入这一全面基准测试,我们的目标是促进可靠地应用于挑战性的长时间现实场景中的重识别算法的发展和评估。
https://arxiv.org/abs/2502.06681
Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness. Codes will be available at this https URL.
领域通用重识别(DG Re-ID)的目标是在一个或多个源域上训练模型,并在未见过的目标域上评估其性能,这一任务因其实际应用的关联性而越来越受到关注。虽然已经提出了许多方法,但大多数依赖于判别式或对比学习框架来学习可泛化的特征表示。然而,这些方法往往未能缓解捷径学习问题(shortcut learning),导致性能不佳。在这项工作中,我们提出了一种新的方法——基于扩散模型辅助的表征学习与相关感知条件方案(DCAC)来增强DG Re-ID的效果。我们的方法通过一个相关感知条件方案将判别式和对比式的Re-ID模型与预训练的扩散模型相结合。该条件方案通过使用从Re-ID模型生成的身份分类概率以及一组可学习的身份提示,注入了捕捉身份之间相关性的暗知识(dark knowledge),以指导扩散过程。同时,来自扩散模型的反馈被通过条件方案反向传播到Re-ID模型中,有效提升了重识别特征的泛化能力。在单一源域和多源域DG Re-ID任务上的大量实验表明,我们的方法达到了最先进的性能表现。全面的消融研究进一步验证了所提方法的有效性,并提供了对其鲁棒性的见解。代码将在该链接提供。 这段描述详细介绍了针对领域通用重识别(Domain-Generalizable Re-Identification)问题的新方法——DCAC(Diffusion Model-Assisted Representation Learning with Correlation-Aware Conditioning),展示了该方法如何通过结合判别式/对比学习模型和预训练的扩散模型来提升特征表示的泛化能力,并在不同实验设置中实现了卓越性能。
https://arxiv.org/abs/2502.06619
Group Re-Identification (Group ReID) aims matching groups of pedestrians across non-overlapping cameras. Unlike single-person ReID, Group ReID focuses more on the changes in group structure, emphasizing the number of members and their spatial arrangement. However, most methods rely on certainty-based models, which consider only the specific group structures in the group images, often failing to match unseen group configurations. To this end, we propose a novel Group-CLIP UncertaintyModeling (GCUM) approach that adapts group text descriptions to undetermined accommodate member and layout variations. Specifically, we design a Member Variant Simulation (MVS)module that simulates member exclusions using a Bernoulli distribution and a Group Layout Adaptation (GLA) module that generates uncertain group text descriptions with identity-specific tokens. In addition, we design a Group RelationshipConstruction Encoder (GRCE) that uses group features to refine individual features, and employ cross-modal contrastive loss to obtain generalizable knowledge from group text descriptions. It is worth noting that we are the first to employ CLIP to GroupReID, and extensive experiments show that GCUM significantly outperforms state-of-the-art Group ReID methods.
群组再识别(Group Re-Identification,简称Group ReID)的目标是在非重叠的摄像机之间匹配行人组成的群体。与单人再识别任务不同的是,Group ReID 更注重群体结构的变化,强调成员数量和空间布局的差异。然而,大多数现有的方法依赖于确定性模型,这些模型仅考虑特定群体图像中的固定结构,因而难以处理未见过的不同群体配置。 为了解决这个问题,我们提出了一种新颖的方法——群组CLIP不确定性建模(Group-CLIP Uncertainty Modeling, GCUM),它能够根据成员和布局变化灵活调整群组文本描述。具体而言,我们设计了一个成员变体模拟模块(Member Variant Simulation, MVS),该模块使用伯努利分布来模拟成员的排除情况,并且一个群组布局适应模块(Group Layout Adaptation, GLA)用于生成包含身份特异令牌的不确定群组文本描述。 此外,我们还开发了一种群体关系构建编码器(Group Relationship Construction Encoder, GRCE),该编码器利用群体特征来改进个体特征。通过跨模态对比损失的应用,我们可以从群组文本描述中获取可泛化的知识。值得一提的是,这是首次将CLIP技术应用于Group ReID任务,并且广泛的实验表明GCUM在性能上显著优于现有的最先进的方法。
https://arxiv.org/abs/2502.06460
Navigating the complexities of person re-identification (ReID) in varied surveillance scenarios, particularly when occlusions occur, poses significant challenges. We introduce an innovative Motion-Aware Fusion (MOTAR-FUSE) network that utilizes motion cues derived from static imagery to significantly enhance ReID capabilities. This network incorporates a dual-input visual adapter capable of processing both images and videos, thereby facilitating more effective feature extraction. A unique aspect of our approach is the integration of a motion consistency task, which empowers the motion-aware transformer to adeptly capture the dynamics of human motion. This technique substantially improves the recognition of features in scenarios where occlusions are prevalent, thereby advancing the ReID process. Our comprehensive evaluations across multiple ReID benchmarks, including holistic, occluded, and video-based scenarios, demonstrate that our MOTAR-FUSE network achieves superior performance compared to existing approaches.
在各种监控场景中,特别是在遮挡发生的情况下,处理人员重新识别(ReID)的复杂性带来了很多挑战。我们提出了一种创新的运动感知融合(MOTAR-FUSE)网络,该网络利用从静态图像中提取的运动线索来显著增强ReID的能力。此网络包括一个双输入视觉适配器,能够同时处理图像和视频,从而实现更有效的特征提取。我们的方法的独特之处在于整合了一个运动一致性任务,这使得运动感知变压器能够有效地捕捉人类动作的动力学特性。这种方法大大提高了在遮挡普遍存在的情况下识别特征的能力,从而推进了ReID过程的进步。我们在多个ReID基准测试上进行了全面的评估,包括整体、遮挡和基于视频的场景,结果显示我们的MOTAR-FUSE网络相比现有方法表现出色。
https://arxiv.org/abs/2502.00665
Defacing is often applied to head magnetic resonance image (MRI) datasets prior to public release to address privacy concerns. The alteration of facial and nearby voxels has provoked discussions about the true capability of these techniques to ensure privacy as well as their impact on downstream tasks. With advancements in deep generative models, the extent to which defacing can protect privacy is uncertain. Additionally, while the altered voxels are known to contain valuable anatomical information, their potential to support research beyond the anatomical regions directly affected by defacing remains uncertain. To evaluate these considerations, we develop a refacing pipeline that recovers faces in defaced head MRIs using cascaded diffusion probabilistic models (DPMs). The DPMs are trained on images from 180 subjects and tested on images from 484 unseen subjects, 469 of whom are from a different dataset. To assess whether the altered voxels in defacing contain universally useful information, we also predict computed tomography (CT)-derived skeletal muscle radiodensity from facial voxels in both defaced and original MRIs. The results show that DPMs can generate high-fidelity faces that resemble the original faces from defaced images, with surface distances to the original faces significantly smaller than those of a population average face (p < 0.05). This performance also generalizes well to previously unseen datasets. For skeletal muscle radiodensity predictions, using defaced images results in significantly weaker Spearman's rank correlation coefficients compared to using original images (p < 10-4). For shin muscle, the correlation is statistically significant (p < 0.05) when using original images but not statistically significant (p > 0.05) when any defacing method is applied, suggesting that defacing might not only fail to protect privacy but also eliminate valuable information.
面部篡改(defacing)通常在头磁共振成像(MRI)数据集公开发布之前进行,以解决隐私问题。对脸部及附近体素的修改引发了关于这些技术真正保障隐私的能力以及它们对下游任务影响的讨论。随着深度生成模型的进步,面部篡改能多大程度上保护隐私的程度变得不确定。此外,尽管已知被修改的体素包含有价值的解剖学信息,但这些体素在超出直接受篡改影响的解剖区域之外的研究中是否有价值尚不清楚。 为评估这些问题,我们开发了一条复原管线(refacing pipeline),使用级联扩散概率模型(DPMs)从面部篡改过的MRI图像中恢复脸部。DPMs是在来自180名受试者的图像上训练,并在484张未见过的图片(其中469张来自不同的数据集)上进行测试。为了评估面部篡改过程中改变的体素是否包含普遍有用的信息,我们还预测了从面部体素推导出的CT扫描所得骨骼肌放射密度,在面部篡改过的MRI图像和原始MRI图像中进行比较。 结果表明,DPMs可以生成与原图相比非常逼真的脸部,表面距离显著小于人口平均脸的距离(p < 0.05)。这种性能也很好地泛化到了之前未见过的数据集中。对于骨骼肌放射密度的预测,使用面部篡改过的图像所得的Spearman等级相关系数明显低于使用原始图像的情况(p < 10^-4)。在小腿肌肉上,当使用原图时相关性具有统计学意义(p<0.05),但应用任何一种面部篡改方法后则不具统计学意义(p>0.05)。这表明,面部篡改不仅可能无法保护隐私,还可能会消除有价值的信息。
https://arxiv.org/abs/2501.18834
Large vision-language models (LVLMs) have been regarded as a breakthrough advance in an astoundingly variety of tasks, from content generation to virtual assistants and multimodal search or retrieval. However, for many of these applications, the performance of these methods has been widely criticized, particularly when compared with state-of-the-art methods and technologies in each specific domain. In this work, we compare the performance of the leading large vision-language models in the human re-identification task, using as baseline the performance attained by state-of-the-art AI models specifically designed for this problem. We compare the results due to ChatGPT-4o, Gemini-2.0-Flash, Claude 3.5 Sonnet, and Qwen-VL-Max to a baseline ReID PersonViT model, using the well-known Market1501 dataset. Our evaluation pipeline includes the dataset curation, prompt engineering, and metric selection to assess the models' performance. Results are analyzed from many different perspectives: similarity scores, classification accuracy, and classification metrics, including precision, recall, F1 score, and area under curve (AUC). Our results confirm the strengths of LVLMs, but also their severe limitations that often lead to catastrophic answers and should be the scope of further research. As a concluding remark, we speculate about some further research that should fuse traditional and LVLMs to combine the strengths from both families of techniques and achieve solid improvements in performance.
大型视觉-语言模型(LVLM)在内容生成、虚拟助手和多模态搜索或检索等众多任务中被视为突破性进展。然而,对于许多这些应用而言,其性能遭到了广泛的批评,尤其是在与特定领域内最先进的方法和技术相比时更是如此。在这项工作中,我们使用专门为该问题设计的最新AI模型作为基准,比较了领先的大规模视觉-语言模型在人类再识别任务中的表现,并将ChatGPT-4o、Gemini-2.0-Flash、Claude 3.5 Sonnet和Qwen-VL-Max的结果与基础ReID PersonViT模型进行了对比。我们使用著名的Market1501数据集进行评估,该评估流程包括数据集整理、提示工程和度量选择以评价模型性能。从多个角度分析了结果:相似性评分、分类准确性以及包括精确率、召回率、F1分数和曲线下面积(AUC)在内的分类指标。我们的研究证实了LVLMs的优势,但也揭示了其严重的局限性,这些局限性往往导致灾难性的答案,并需要进一步的研究来解决这些问题。 作为结论性意见,我们推测未来的研究应该融合传统技术和LVLM技术,以便结合这两种方法的优点并实现性能的实质性提升。
https://arxiv.org/abs/2501.18698
This paper proposes a new effective and efficient plug-and-play backbone for video-based person re-identification (ReID). Conventional video-based ReID methods typically use CNN or transformer backbones to extract deep features for every position in every sampled video frame. Here, we argue that this exhaustive feature extraction could be unnecessary, since we find that different frames in a ReID video often exhibit small differences and contain many similar regions due to the relatively slight movements of human beings. Inspired by this, a more selective, efficient paradigm is explored in this paper. Specifically, we introduce a patch selection mechanism to reduce computational cost by choosing only the crucial and non-repetitive patches for feature extraction. Additionally, we present a novel network structure that generates and utilizes pseudo frame global context to address the issue of incomplete views resulting from sparse inputs. By incorporating these new designs, our backbone can achieve both high performance and low computational cost. Extensive experiments on multiple datasets show that our approach reduces the computational cost by 74\% compared to ViT-B and 28\% compared to ResNet50, while the accuracy is on par with ViT-B and outperforms ResNet50 significantly.
本文提出了一种新的有效且高效的基于视频的人员重新识别(ReID)插件式骨干网络。传统的基于视频的ReID方法通常使用CNN或变压器骨干来提取每个采样视频帧中每个位置的深层特征。在这里,我们指出这种详尽无遗的特征提取可能是不必要的,因为我们发现,在一个ReID视频中的不同帧之间往往存在细微差异,并且由于人类相对微小的动作而包含许多相似区域。受到这一观察的启发,本文探讨了一种更为选择性和高效的范式。 具体来说,我们引入了一个补丁选择机制,通过只选择关键且非重复的补丁来进行特征提取以减少计算成本。此外,我们提出了一种新的网络结构,该结构生成并利用伪帧全局上下文来解决由于稀疏输入导致的视角不完整问题。 通过结合这些新设计,我们的骨干网能够实现高性能与低计算成本的良好平衡。在多个数据集上的广泛实验表明,相比于ViT-B和ResNet50,我们的方法可以将计算成本分别降低74% 和 28%,同时精度可媲美于ViT-B,并且显著优于ResNet50。
https://arxiv.org/abs/2501.16811
The Visual Language Model, known for its robust cross-modal capabilities, has been extensively applied in various computer vision tasks. In this paper, we explore the use of CLIP (Contrastive Language-Image Pretraining), a vision-language model pretrained on large-scale image-text pairs to align visual and textual features, for acquiring fine-grained and domain-invariant representations in generalizable person re-identification. The adaptation of CLIP to the task presents two primary challenges: learning more fine-grained features to enhance discriminative ability, and learning more domain-invariant features to improve the model's generalization capabilities. To mitigate the first challenge thereby enhance the ability to learn fine-grained features, a three-stage strategy is proposed to boost the accuracy of text descriptions. Initially, the image encoder is trained to effectively adapt to person re-identification tasks. In the second stage, the features extracted by the image encoder are used to generate textual descriptions (i.e., prompts) for each image. Finally, the text encoder with the learned prompts is employed to guide the training of the final image encoder. To enhance the model's generalization capabilities to unseen domains, a bidirectional guiding method is introduced to learn domain-invariant image features. Specifically, domain-invariant and domain-relevant prompts are generated, and both positive (pulling together image features and domain-invariant prompts) and negative (pushing apart image features and domain-relevant prompts) views are used to train the image encoder. Collectively, these strategies contribute to the development of an innovative CLIP-based framework for learning fine-grained generalized features in person re-identification.
视觉语言模型因其强大的跨模态能力而著称,在各种计算机视觉任务中得到了广泛应用。本文探讨了将CLIP(对比语言图像预训练)应用于一般化的人体再识别任务,以获取细粒度和领域不变的表示形式。CLIP是一个基于大规模图像-文本对进行预训练的视觉-语言模型,旨在对齐视觉和文本特征。 在该任务中应用CLIP面临两大主要挑战:一是学习更细粒度的特征以增强区分能力;二是学习更多领域不变的特征以提升模型的一般化能力。为缓解第一个挑战并提高学习细粒度特征的能力,本文提出了一种三阶段策略来提升文本描述的准确性。首先训练图像编码器有效适应人体再识别任务。在第二阶段中,通过图像编码器提取的特征生成每张图片的文字描述(即提示)。最后,使用带有已学得提示的文本编码器来指导最终图像编码器的训练。 为了增强模型对未见领域的泛化能力,引入了一种双向引导方法来学习领域不变的图像特征。具体而言,该方法通过正向视角(拉近图像特征和领域不变提示)以及反向视角(推远图像特征和领域相关的提示)生成领域不变与相关提示,并以此训练图像编码器。 综合上述策略,本文发展了一个基于CLIP的创新框架,在人体再识别任务中学习细粒度且具有泛化能力的特征。
https://arxiv.org/abs/2501.16065
We introduce YOLO11-JDE, a fast and accurate multi-object tracking (MOT) solution that combines real-time object detection with self-supervised Re-Identification (Re-ID). By incorporating a dedicated Re-ID branch into YOLO11s, our model performs Joint Detection and Embedding (JDE), generating appearance features for each detection. The Re-ID branch is trained in a fully self-supervised setting while simultaneously training for detection, eliminating the need for costly identity-labeled datasets. The triplet loss, with hard positive and semi-hard negative mining strategies, is used for learning discriminative embeddings. Data association is enhanced with a custom tracking implementation that successfully integrates motion, appearance, and location cues. YOLO11-JDE achieves competitive results on MOT17 and MOT20 benchmarks, surpassing existing JDE methods in terms of FPS and using up to ten times fewer parameters. Thus, making our method a highly attractive solution for real-world applications.
我们介绍了一种名为YOLO11-JDE的快速且准确的多目标跟踪(MOT)解决方案,该方案结合了实时物体检测与自监督重识别(Re-ID)。通过在YOLO11架构中加入专门的Re-ID分支,我们的模型实现了联合检测和嵌入(JDE),为每个检测生成外观特征。该Re-ID分支在完全自监督设置下进行训练,并同时进行检测任务的训练,从而无需昂贵的身份标签数据集。我们使用带有难正样本挖掘与半难负样本挖掘策略的三元组损失函数来学习区分性嵌入。通过定制化的跟踪实现,运动、外观和位置线索得到了增强的数据关联处理。YOLO11-JDE在MOT17和MOT20基准测试中取得了有竞争力的结果,在每秒帧数(FPS)方面超越了现有的JDE方法,并且使用的参数最多减少了十倍。因此,我们的方法成为适用于实际应用的极具吸引力的选择。
https://arxiv.org/abs/2501.13710
Identifying individual animals within large wildlife populations is essential for effective wildlife monitoring and conservation efforts. Recent advancements in computer vision have shown promise in animal re-identification (Animal ReID) by leveraging data from camera traps. However, existing methods rely exclusively on visual data, neglecting environmental metadata that ecologists have identified as highly correlated with animal behavior and identity, such as temperature and circadian rhythms. To bridge this gap, we propose the Meta-Feature Adapter (MFA), a lightweight module designed to integrate environmental metadata into vision-language foundation models, such as CLIP, to enhance Animal ReID performance. Our approach translates environmental metadata into natural language descriptions, encodes them into metadata-aware text embeddings, and incorporates these embeddings into image features through a cross-attention mechanism. Furthermore, we introduce a Gated Cross-Attention mechanism that dynamically adjusts the weights of metadata contributions, further improving performance. To validate our approach, we constructed the Metadata Augmented Animal Re-identification (MAAR) dataset, encompassing six species from New Zealand and featuring paired image data and environmental metadata. Extensive experiments demonstrate that MFA consistently improves Animal ReID performance across multiple baseline models.
在大型野生动物种群中识别个体动物对于有效的野生动植物监测和保护工作至关重要。近年来,计算机视觉领域的进展显示,在利用相机陷阱数据进行动物重新识别(Animal Re-Identification, Animal ReID)方面具有巨大潜力。然而,现有的方法仅依赖于视觉数据,并忽视了生态学家认为与动物行为和身份高度相关的环境元数据,例如温度和昼夜节律。为了解决这一问题,我们提出了Meta-特征适配器(Meta-Feature Adapter, MFA),这是一个轻量级模块,旨在将环境元数据整合到CLIP等视觉-语言基础模型中,以增强动物重新识别性能。 我们的方法将环境元数据转换成自然语言描述,并将其编码为包含元信息的文本嵌入。然后通过交叉注意力机制将这些嵌入合并到图像特征中。此外,我们引入了门控交叉注意机制(Gated Cross-Attention),该机制能够动态调整元数据贡献权重,进一步提高性能。 为了验证我们的方法的有效性,我们构建了一个名为Metadata Augmented Animal Re-Identification (MAAR)的数据集,其中包括新西兰六种不同物种的成对图像数据和环境元数据。广泛的实验表明,MFA在多个基准模型上始终能提升动物重新识别性能。
https://arxiv.org/abs/2501.13368
Visible-infrared person re-identification (VI-ReID) aims to match individuals across different camera modalities, a critical task in modern surveillance systems. While current VI-ReID methods focus on cross-modality matching, real-world applications often involve mixed galleries containing both V and I images, where state-of-the-art methods show significant performance limitations due to large domain shifts and low discrimination across mixed modalities. This is because gallery images from the same modality may have lower domain gaps but correspond to different identities. This paper introduces a novel mixed-modal ReID setting, where galleries contain data from both modalities. To address the domain shift among inter-modal and low discrimination capacity in intra-modal matching, we propose the Mixed Modality-Erased and -Related (MixER) method. The MixER learning approach disentangles modality-specific and modality-shared identity information through orthogonal decomposition, modality-confusion, and ID-modality-related objectives. MixER enhances feature robustness across modalities, improving cross-modal and mixed-modal settings performance. Our extensive experiments on the SYSU-MM01, RegDB and LLMC datasets indicate that our approach can provide state-of-the-art results using a single backbone, and showcase the flexibility of our approach in mixed gallery applications.
可见光-红外人物再识别(VI-ReID)旨在跨不同相机模态匹配个体,这是现代监控系统中的一个关键任务。尽管当前的VI-ReID方法专注于跨模态匹配,但在实际应用中,混合图像库通常包含可见光(V)和红外(I)两种图像的情况很常见。对于这种场景,最先进的方法在处理大规模领域偏移和低跨模态识别能力时表现出明显的性能限制。这是因为来自同一模态的图库图像可能具有较小的领域差距但对应于不同的身份。 本文提出了一种新的混合模式下的ReID设置,在该设置中图库包含两种模态的数据。为了解决跨模态之间的领域偏移和低模内匹配识别能力的问题,我们提出了混合模式擦除与相关(MixER)方法。通过正交分解、模态混淆以及身份-模态相关的目标,MixER学习方法能够分离出特定于模态的和个人共享的身份信息。 MixER增强了跨模态特征的鲁棒性,从而提高了跨模和混合设置下的性能表现。我们在SYSU-MM01、RegDB及LLMC数据集上的广泛实验表明,我们的方法可以使用单一主干网络提供最先进的结果,并展示了在混合图库应用中的灵活性和适用性。
https://arxiv.org/abs/2501.13307
Deep learning based person re-identification (re-id) models have been widely employed in surveillance systems. Recent studies have demonstrated that black-box single-modality and cross-modality re-id models are vulnerable to adversarial examples (AEs), leaving the robustness of multi-modality re-id models unexplored. Due to the lack of knowledge about the specific type of model deployed in the target black-box surveillance system, we aim to generate modality unified AEs for omni-modality (single-, cross- and multi-modality) re-id models. Specifically, we propose a novel Modality Unified Attack method to train modality-specific adversarial generators to generate AEs that effectively attack different omni-modality models. A multi-modality model is adopted as the surrogate model, wherein the features of each modality are perturbed by metric disruption loss before fusion. To collapse the common features of omni-modality models, Cross Modality Simulated Disruption approach is introduced to mimic the cross-modality feature embeddings by intentionally feeding images to non-corresponding modality-specific subnetworks of the surrogate model. Moreover, Multi Modality Collaborative Disruption strategy is devised to facilitate the attacker to comprehensively corrupt the informative content of person images by leveraging a multi modality feature collaborative metric disruption loss. Extensive experiments show that our MUA method can effectively attack the omni-modality re-id models, achieving 55.9%, 24.4%, 49.0% and 62.7% mean mAP Drop Rate, respectively.
基于深度学习的人再识别(re-id)模型在监控系统中得到了广泛的应用。近期的研究表明,单模态和跨模态的黑盒人再识别模型容易受到对抗样本(AEs)的影响,而多模态人再识别模型的鲁棒性尚不清楚。由于目标黑盒监视系统中所部署的具体类型未知,我们的研究旨在生成适用于全模态(单模、跨模和多模)re-id模型的统一对抗样本。 为此,我们提出了一种名为“Modality Unified Attack”(MUA)的新方法,通过训练特定于每个模态的对抗生成器来生成能够有效攻击不同全模态模型的对抗性示例。在我们的方案中,采用了一个多模态模型作为替代模型,在该模型中,每一种模态的数据特征都经过了度量破坏损失函数(metric disruption loss)的扰动处理,并在融合之前进行。 为了使各种全模态模型共有的特征失效,“Cross Modality Simulated Disruption”方法被引入,通过故意将图像输入到替代模型中的非对应特定模态子网络中来模仿跨模态特征嵌入。此外,“Multi Modality Collaborative Disruption”策略旨在帮助攻击者全面破坏人像图片中有用的信息内容,利用多模态特征协作度量破坏损失函数。 广泛的实验表明,我们的MUA方法能够有效攻击全模态的人再识别模型,在单模、跨模和多模模型上分别实现了55.9%,24.4% 和 62.7%的平均mAP下降率。此外,在包含所有三种类型的综合测试集上的均值mAP下降率为49.0%。 综上所述,我们的研究为提升人再识别模型的安全性和鲁棒性提供了一种全新的视角和方法论。
https://arxiv.org/abs/2501.12761
To prove that a dataset is sufficiently anonymized, many privacy policies suggest that a re-identification risk assessment be performed, but do not provide a precise methodology for doing so, leaving the industry alone with the problem. This paper proposes a practical and ready-to-use methodology for re-identification risk assessment, the originality of which is manifold: (1) it is the first to follow well-known risk analysis methods (e.g. EBIOS) that have been used in the cybersecurity field for years, which consider not only the ability to perform an attack, but also the impact such an attack can have on an individual; (2) it is the first to qualify attributes and values of attributes with e.g. degree of exposure, as known real-world attacks mainly target certain types of attributes and not others.
为了证明一个数据集已经被足够匿名化,许多隐私政策建议进行重新识别风险评估,但却没有提供明确的方法论来执行这一过程,从而将该问题留给了行业自身。本文提出了一种实用且可直接使用的重新识别风险评估方法,其原创性体现在多个方面:(1)它首次遵循了在网络安全领域已使用多年的知名风险分析方法(例如EBIOS),这些方法不仅考虑攻击的可行性,还考虑此类攻击对个人可能产生的影响;(2)它是第一个为属性及其值赋予如曝光程度等质量评估的方法,因为现实中大多数已知攻击主要针对特定类型的属性,而非所有类型。
https://arxiv.org/abs/2501.10841
This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{this https URL}{this https URL}.
本文提出了ViT Token Constraint和多尺度记忆库(TCMM)方法,旨在解决无监督行人重识别工作中出现的补丁噪声和特征不一致性问题。许多优秀的方法利用ViT特征来获取伪标签和聚类原型,然后通过对比学习训练模型。然而,由于ViT是通过对图像进行补丁嵌入处理来工作的,这不可避免地会在各个补丁中引入噪声,并可能影响重识别模型的性能。另一方面,基于以往记忆库的对比方法可能会因为批次大小的限制而导致数据不一致的问题。此外,现有的伪标签生成方法往往忽略那些难以聚类的异常样本,从而牺牲了这些异常样本的潜在价值,导致模型多样性和鲁棒性的局限性。 本文介绍了一种ViT Token Constraint技术来减轻补丁噪声对ViT架构造成的损害。提出的多尺度记忆库则增强了对异常样本的探索,并保持了特征的一致性。实验结果显示,我们的系统在常见基准测试中达到了最先进的性能水平。该项目可在[此处](this https URL)访问。
https://arxiv.org/abs/2501.09044
Video-based person re-identification (ReID) has become increasingly important due to its applications in video surveillance applications. By employing events in video-based person ReID, more motion information can be provided between continuous frames to improve recognition accuracy. Previous approaches have assisted by introducing event data into the video person ReID task, but they still cannot avoid the privacy leakage problem caused by RGB images. In order to avoid privacy attacks and to take advantage of the benefits of event data, we consider using only event data. To make full use of the information in the event stream, we propose a Cross-Modality and Temporal Collaboration (CMTC) network for event-based video person ReID. First, we design an event transform network to obtain corresponding auxiliary information from the input of raw events. Additionally, we propose a differential modality collaboration module to balance the roles of events and auxiliaries to achieve complementary effects. Furthermore, we introduce a temporal collaboration module to exploit motion information and appearance cues. Experimental results demonstrate that our method outperforms others in the task of event-based video person ReID.
基于视频的人重识别(ReID)由于在视频监控中的应用而变得越来越重要。通过采用事件数据,在连续帧之间可以提供更多的运动信息,从而提高识别的准确性。虽然之前的方法已经尝试将事件数据引入到基于视频的人体重识别任务中来辅助解决该问题,但它们仍然无法避免由RGB图像引起的隐私泄露问题。为了防止隐私攻击,并利用事件数据带来的优势,我们考虑仅使用事件数据进行人体重识别。 为充分利用事件流中的信息,我们提出了一种跨模态和时间协作(CMTC)网络用于基于事件的视频人体重识别任务。首先,我们设计了一个事件变换网络来从原始事件输入中获取相应的辅助信息。此外,我们提出了一个差分模态协作模块以平衡事件与辅助信息的作用,从而实现互补效果。同时,我们引入了时间协作模块来利用运动信息和外观线索。 实验结果表明,在基于事件的视频人体重识别任务上,我们的方法优于其他现有技术。
https://arxiv.org/abs/2501.07296
Clothing-change person re-identification (CC Re-ID) has attracted increasing attention in recent years due to its application prospect. Most existing works struggle to adequately extract the ID-related information from the original RGB images. In this paper, we propose an Identity-aware Feature Decoupling (IFD) learning framework to mine identity-related features. Particularly, IFD exploits a dual stream architecture that consists of a main stream and an attention stream. The attention stream takes the clothing-masked images as inputs and derives the identity attention weights for effectively transferring the spatial knowledge to the main stream and highlighting the regions with abundant identity-related information. To eliminate the semantic gap between the inputs of two streams, we propose a clothing bias diminishing module specific to the main stream to regularize the features of clothing-relevant regions. Extensive experimental results demonstrate that our framework outperforms other baseline models on several widely-used CC Re-ID datasets.
近年来,服装变化人员再识别(CC Re-ID)由于其应用前景而备受关注。现有大多数工作难以充分从原始RGB图像中提取出与身份相关的信息。在本文中,我们提出了一种基于身份感知特征解耦(IFD)的学习框架来挖掘身份相关的特征信息。特别地,IFD采用了双流架构,包括主流和注意力流。注意力流以服装遮罩后的图像为输入,并得出用于有效传输空间知识到主流以及突出富含身份相关信息区域的身份注意权重。 为了消除两个流的输入之间的语义差距,我们在主流中提出了一种专门针对服装偏差减小的模块来规范与服装相关的区域特征。大量的实验结果表明,我们的框架在几个广泛使用的CC Re-ID数据集上优于其他基线模型。
https://arxiv.org/abs/2501.05851
Street cats in urban areas often rely on human intervention for survival, leading to challenges in population control and welfare management. In April 2023, Hello Inc., a Chinese urban mobility company, launched the Hello Street Cat initiative to address these issues. The project deployed over 21,000 smart feeding stations across 14 cities in China, integrating livestreaming cameras and treat dispensers activated through user donations. It also promotes the Trap-Neuter-Return (TNR) method, supported by a community-driven platform, HelloStreetCatWiki, where volunteers catalog and identify cats. However, manual identification is inefficient and unsustainable, creating a need for automated solutions. This study explores Deep Learning-based models for re-identifying street cats in the Hello Street Cat initiative. A dataset of 2,796 images of 69 cats was used to train Siamese Networks with EfficientNetB0, MobileNet and VGG16 as base models, evaluated under contrastive and triplet loss functions. VGG16 paired with contrastive loss emerged as the most effective configuration, achieving up to 97% accuracy and an F1 score of 0.9344 during testing. The approach leverages image augmentation and dataset refinement to overcome challenges posed by limited data and diverse visual variations. These findings underscore the potential of automated cat re-identification to streamline population monitoring and welfare efforts. By reducing reliance on manual processes, the method offers a scalable and reliable solution for communitydriven initiatives. Future research will focus on expanding datasets and developing real-time implementations to enhance practicality in large-scale deployments.
城市街道上的流浪猫常常依赖人类的干预来生存,这导致了种群控制和福利管理方面的挑战。2023年4月,中国的一家城市移动公司Hello Inc.推出了“Hello Street Cat”项目,旨在解决这些问题。该项目在中国14个城市部署了超过21,000个智能喂食站,这些站点集成了实时摄像头和通过用户捐赠启动的投喂器,并推广了捕获-绝育-放归(TNR)的方法。此外,还有一个由社区驱动的平台“HelloStreetCatWiki”,志愿者们在此平台上登记并识别流浪猫。然而,手动识别效率低下且不可持续,因此需要自动化的解决方案。 本研究探索了基于深度学习模型在“Hello Street Cat”项目中重新识别街道流浪猫的有效性。使用包含69只猫咪共2,796张图像的数据集来训练Siamese Networks,并采用EfficientNetB0、MobileNet和VGG16作为基础模型,在对比损失和三元组损失函数下进行评估。实验结果表明,结合了对比损失的VGG16配置最为有效,测试时达到了高达97%的准确率和F1得分为0.9344的良好效果。该方法通过图像增强和数据集精炼来克服有限数据量及视觉变化多样性的挑战。 这些发现强调了自动化猫识别在优化种群监测和福利努力方面所具有的潜力,通过减少对人工过程的依赖提供了一种可扩展且可靠的社区驱动解决方案。未来的研究将专注于扩大数据集规模,并开发实时实施方法以增强大规模部署时的实际应用性。
https://arxiv.org/abs/2501.02112
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.
无监督可见光-红外行人重识别(USL-VI-ReID)的目标是从未标注的跨模态数据集中学习模态不变特征,以减少跨模态差异。然而,现有的方法要么缺乏跨模态聚类能力,要么过度追求簇级别关联,这使得可靠地学习模态不变特征变得困难。为了解决这些问题,我们提出了一种扩展跨模态联合学习(ECUL)框架,该框架结合了扩展的模态-相机聚类(EMCC)和两步内存更新策略(TSMem)模块。 具体而言,我们设计的ECUL框架自然地整合了内模态聚类、跨模态聚类以及实例级别的跨模态选择机制,在减少引入噪音标签的同时建立了紧密且准确的跨模态关联。此外,EMCC通过扩展编码向量来捕捉和过滤邻域关系,进一步促进基于聚类算法的模态不变性和相机不变性知识的学习。最后,TSMem通过分阶段更新内存提供精确而通用的对比学习代理点。 在SYSU-MM01和RegDB数据集上进行的一系列实验结果表明,所提出的ECUL框架展示了令人鼓舞的表现,并且甚至超过了某些监督方法的效果。
https://arxiv.org/abs/2412.19134
The development of deep learning has facilitated the application of person re-identification (ReID) technology in intelligent security. Visible-infrared person re-identification (VI-ReID) aims to match pedestrians across infrared and visible modality images enabling 24-hour surveillance. Current studies relying on unsupervised modality transformations as well as inefficient embedding constraints to bridge the spectral differences between infrared and visible images, however, limit their potential performance. To tackle the limitations of the above approaches, this paper introduces a simple yet effective Spectral Enhancement and Pseudo-anchor Guidance Network, named SEPG-Net. Specifically, we propose a more homogeneous spectral enhancement scheme based on frequency domain information and greyscale space, which avoids the information loss typically caused by inefficient modality transformations. Further, a Pseudo Anchor-guided Bidirectional Aggregation (PABA) loss is introduced to bridge local modality discrepancies while better preserving discriminative identity embeddings. Experimental results on two public benchmark datasets demonstrate the superior performance of SEPG-Net against other state-of-the-art methods. The code is available at this https URL.
深度学习的发展促进了智能安全领域中人员再识别(ReID)技术的应用。可见光与红外线的人员再识别(VI-ReID)旨在跨红外和可见图像匹配行人,从而实现24小时监控。然而,当前依赖于无监督模态转换以及低效嵌入约束的方法在连接红外和可见图像之间的频谱差异时,其潜力受到了限制。为了克服上述方法的局限性,本文介绍了一种简单而有效的光谱增强与伪锚引导网络,命名为SEPG-Net。 具体来说,我们提出了一种基于频率域信息和灰度空间更均匀的光谱增强方案,该方案避免了由低效模态转换通常导致的信息损失。此外,引入了伪锚导向双向聚合(PABA)损失来弥合局部模式差异,并更好地保持具有判别性的身份嵌入。 在两个公开基准数据集上的实验结果表明,SEPG-Net优于其他最先进的方法。代码可在此URL获取:[此链接应为指向GitHub或类似平台的实际网址,在这里未提供实际链接]。
https://arxiv.org/abs/2412.19111