Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different views. Previous methods usually adopt large-scale models, focusing on view-invariant features. However, they overlook the semantic information in person attributes. Additionally, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. More specifically, we first introduce the Contrastive Language-Image Pre-training (CLIP) model as the backbone, and propose an Attribute-aware Image Encoder (AIE) to extract global semantic features and attribute-aware features. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to generate person attribute predictions and obtain the encoded representations of predicted attributes. Finally, we design a Coupled Prompt Template (CPT) to transform attribute tokens and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve the AG-ReID. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed LATex. The source code will be available.
空中-地面人员重识别(AG-ReID)旨在检索跨不同视角中异构摄像机中的特定人员。以往的方法通常采用大规模模型,专注于视图不变特征。然而,它们忽略了人体属性中的语义信息。此外,现有的训练策略往往依赖于对大规模模型进行完全微调,这大大增加了训练成本。为了解决这些问题,我们提出了一种名为LATex的新框架用于AG-ReID,该框架采用了提示微调策略以利用基于属性的文本知识。 更具体地说,首先我们将对比语言图像预训练(CLIP)模型作为骨干网络,并提出了一个感知属性的图像编码器(AIE),用于提取全局语义特征和感知属性的特征。然后,利用这些特征,我们提出了一种提示属性分类器组(PACG)以生成人员属性预测并获取预测属性的编码表示。最后,我们设计了一个耦合提示模板(CPT)将属性标记和视图信息转换为结构化句子。这些句子通过CLIP中的文本编码器处理,从而产生更具判别性的特征。 因此,我们的框架可以充分利用基于属性的文本知识以改进AG-ReID性能。在三个AG-ReID基准数据集上的广泛实验展示了我们提出的LATex的有效性。源代码将公开提供。
https://arxiv.org/abs/2503.23722
Clothes-changing person re-identification (CC-ReID) aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions. In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism designed for feature disentanglement by leveraging the separable nature of text descriptions as supervision. It partitions the feature space into distinct subspaces and, through gradient reversal layers, effectively separates identity-related features from non-biometric features. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6% on LTCC, 3.4% on PRCC, 2.5% on CelebReID-Light, and 1% on CCVID. Our code can be found here.
衣物变化人物再识别(CC-ReID)的目标是在不同的穿着场景下识别人体。当前的CC-ReID方法要么专注于使用额外模态如轮廓、姿态和身体网格来建模人体形状,这可能导致模型忽略性别、年龄和风格等其他关键生物特征;要么通过引入附加标签(例如衣物或个人属性)进行监督,这些标签可能被模型试图忽略或强调。然而,这些注释本质上是离散的,并不能捕捉全面描述。 为此,我们提出了DIFFER:从缠结表示中分离身份特征的一种新颖对抗学习方法,该方法利用文本描述来分离身份特征。认识到图像特征内在地混合了不可分割的信息,DIFFER引入了NBDetach机制,这是一种通过利用文本描述作为监督的可分性质来进行特征解纠缠的设计。它将特征空间划分为不同的子空间,并通过使用梯度反转层有效地区分与身份相关的特征和非生物识别特征。 我们在4个不同基准数据集(LTCC、PRCC、CelebreID-Light 和 CCVID)上评估了DIFFER,以证明其有效性并提供所有基准上的最新性能。与基线方法相比,DIFFER在LTCC上提高了3.6%的top-1准确率,在PRCC上提高了3.4%,在CelebReID-Light上提高了2.5%,在CCVID上提高了1%。 我们的代码可以在这里找到:[此处提供链接]。
https://arxiv.org/abs/2503.22912
The differences between images belonging to fine-grained categories are often subtle and highly localized, and existing explainability techniques for deep learning models are often too diffuse to provide useful and interpretable explanations. We propose a new explainability method (PAIR-X) that leverages both intermediate model activations and backpropagated relevance scores to generate fine-grained, highly-localized pairwise visual explanations. We use animal and building re-identification (re-ID) as a primary case study of our method, and we demonstrate qualitatively improved results over a diverse set of explainability baselines on 35 public re-ID datasets. In interviews, animal re-ID experts were in unanimous agreement that PAIR-X was an improvement over existing baselines for deep model explainability, and suggested that its visualizations would be directly applicable to their work. We also propose a novel quantitative evaluation metric for our method, and demonstrate that PAIR-X visualizations appear more plausible for correct image matches than incorrect ones even when the model similarity score for the pairs is the same. By improving interpretability, PAIR-X enables humans to better distinguish correct and incorrect matches. Our code is available at: this https URL
细粒度类别之间的图像差异往往细微且高度局部化,现有的深度学习模型可解释性技术常常过于分散,无法提供有用和易于理解的解释。我们提出了一种新的可解释性方法(PAIR-X),该方法利用了中间模型激活和反向传播的相关性得分来生成精细级、高度局部化的成对视觉解释。我们将动物和建筑再识别作为本方法的主要案例研究,并在35个公共再识别数据集上,通过一系列多样化的可解释性基线展示了定性的改进结果。在接受采访时,动物再识别专家一致认为PAIR-X相较于现有基线对于深度模型的可解释性是一个提升,并建议其可视化效果可以直接应用于他们的工作中。此外,我们还为我们的方法提出了一个新的定量评估指标,并证明了即使配对的模型相似度分数相同,PAIR-X生成的可视化图像在正确的图片匹配上看起来更合理。通过提高可解释性,人类能够更好地区分正确和错误的匹配结果。我们的代码可以在以下网址获取:[提供的链接]
https://arxiv.org/abs/2503.22881
Advances in deep learning have significantly enhanced medical image analysis, yet the availability of large-scale medical datasets remains constrained by patient privacy concerns. We present EchoFlow, a novel framework designed to generate high-quality, privacy-preserving synthetic echocardiogram images and videos. EchoFlow comprises four key components: an adversarial variational autoencoder for defining an efficient latent representation of cardiac ultrasound images, a latent image flow matching model for generating accurate latent echocardiogram images, a latent re-identification model to ensure privacy by filtering images anatomically, and a latent video flow matching model for animating latent images into realistic echocardiogram videos conditioned on ejection fraction. We rigorously evaluate our synthetic datasets on the clinically relevant task of ejection fraction regression and demonstrate, for the first time, that downstream models trained exclusively on EchoFlow-generated synthetic datasets achieve performance parity with models trained on real datasets. We release our models and synthetic datasets, enabling broader, privacy-compliant research in medical ultrasound imaging at this https URL.
深度学习的进步显著提升了医学影像分析的水平,然而大规模医疗数据集的应用受到患者隐私保护问题的限制。我们提出了EchoFlow框架,旨在生成高质量、隐私安全的人工合成超声心动图图像和视频。EchoFlow包含四个关键组成部分:一个对抗变分自动编码器(adversarial variational autoencoder),用于定义心脏超声影像的有效潜在表示;一个潜像流匹配模型,用于生成准确的潜超声心动图影像;一个重识别模型,通过解剖学过滤图像确保隐私安全;以及一个基于射血分数条件下的潜视频流匹配模型,将潜像动画为逼真的超声心动图视频。我们在临床相关的射血分数回归任务上对合成数据集进行了严格的评估,并首次展示了仅在EchoFlow生成的合成数据集上训练的下游模型能够达到与真实数据集训练模型相当的表现水平。我们公开了我们的模型和合成数据集,以促进医学超声成像领域更广泛且符合隐私保护的研究,详情请访问此链接:[提供的URL]。
https://arxiv.org/abs/2503.22357
Person re-identification (ReID) plays a critical role in applications like security surveillance and criminal investigations by matching individuals across large image galleries captured by non-overlapping cameras. Traditional ReID methods rely on unimodal inputs, typically images, but face limitations due to challenges like occlusions, lighting changes, and pose variations. While advancements in image-based and text-based ReID systems have been made, the integration of both modalities has remained under-explored. This paper presents FusionSegReID, a multimodal model that combines both image and text inputs for enhanced ReID performance. By leveraging the complementary strengths of these modalities, our model improves matching accuracy and robustness, particularly in complex, real-world scenarios where one modality may struggle. Our experiments show significant improvements in Top-1 accuracy and mean Average Precision (mAP) for ReID, as well as better segmentation results in challenging scenarios like occlusion and low-quality images. Ablation studies further confirm that multimodal fusion and segmentation modules contribute to enhanced re-identification and mask accuracy. The results show that FusionSegReID outperforms traditional unimodal models, offering a more robust and flexible solution for real-world person ReID tasks.
人员重新识别(ReID)在安全监控和刑事调查等应用中扮演着关键角色,通过将个体跨多个非重叠摄像头捕获的大规模图像库进行匹配。传统的ReID方法依赖于单一模态输入,通常是图像,但由于遮挡、光照变化和姿态变异等问题的挑战性,这些方法面临限制。尽管基于图像和文本的ReID系统的改进已取得进展,但两种模式结合的研究却相对较少。本文介绍了FusionSegReID模型,这是一个多模态模型,它结合了图像和文本输入来增强ReID性能。通过利用这两种模式互补的优势,我们的模型提高了匹配准确性和鲁棒性,在单一模式可能遇到困难的复杂现实场景中尤为突出。实验结果显示,与传统单模态模型相比,FusionSegReID在Top-1精度和平均精度(mAP)方面显著提升了人员重新识别性能,并且在遮挡和低质量图像等具有挑战性的环境中实现了更优的分割结果。消融研究表明,多模态融合和分割模块对于增强重识别准确性和掩码准确性起到了重要作用。总之,实验结果表明FusionSegReID超越了传统的单模态模型,在实际应用中提供了更为稳健且灵活的解决方案以应对人员重新识别任务。
https://arxiv.org/abs/2503.21595
During criminal investigations, images of persons of interest directly influence the success of identification procedures. However, law enforcement agencies often face challenges related to the scarcity of high-quality images or their obsolescence, which can affect the accuracy and success of people searching processes. This paper introduces a novel forensic mugshot augmentation framework aimed at addressing these limitations. Our approach enhances the identification probability of individuals by generating additional, high-quality images through customizable data augmentation techniques, while maintaining the biometric integrity and consistency of the original data. Several experimental results show that our method significantly improves identification accuracy and robustness across various forensic scenarios, demonstrating its effectiveness as a trustworthy tool law enforcement applications. Index Terms: Digital Forensics, Person re-identification, Feature extraction, Data augmentation, Visual-Language models.
在刑事调查中,涉案人员的图像直接影响到识别程序的成功率。然而,执法机构常常面临高质量图片稀缺或过时的问题,这些问题会影响搜寻和确认身份过程的准确性和成功率。本文介绍了一种新的法医照片增强框架,旨在解决这些限制。我们的方法通过使用可定制的数据增强技术生成额外的、高质量的图像来提高个体识别的概率,同时保持原始数据的生物测量完整性和一致性。多个实验结果表明,我们提出的方法在各种法医场景中显著提高了识别准确率和鲁棒性,展示了其作为执法应用中的可信工具的有效性。 关键词:数字取证,人员再识别,特征提取,数据增强,视觉-语言模型。
https://arxiv.org/abs/2503.19478
Real-world surveillance systems are dynamically evolving, requiring a person Re-identification model to continuously handle newly incoming data from various domains. To cope with these dynamics, Lifelong ReID (LReID) has been proposed to learn and accumulate knowledge across multiple domains incrementally. However, LReID models need to be trained on large-scale labeled data for each unseen domain, which are typically inaccessible due to privacy and cost concerns. In this paper, we propose a new paradigm called Continual Few-shot ReID (CFReID), which requires models to be incrementally trained using few-shot data and tested on all seen domains. Under few-shot conditions, CFREID faces two core challenges: 1) learning knowledge from few-shot data of unseen domain, and 2) avoiding catastrophic forgetting of seen domains. To tackle these two challenges, we propose a Stable Distribution Alignment (SDA) framework from feature distribution perspective. Specifically, our SDA is composed of two modules, i.e., Meta Distribution Alignment (MDA) and Prototype-based Few-shot Adaptation (PFA). To support the study of CFReID, we establish an evaluation benchmark for CFReID on five publicly available ReID datasets. Extensive experiments demonstrate that our SDA can enhance the few-shot learning and anti-forgetting capabilities under few-shot conditions. Notably, our approach, using only 5\% of the data, i.e., 32 IDs, significantly outperforms LReID's state-of-the-art performance, which requires 700 to 1,000 IDs.
现实世界中的监控系统在不断演变,需要一个人重识别模型能够持续处理来自不同领域的新的数据。为应对这种动态变化,提出了终身重识别(Lifelong ReID,简称LReID)方法,旨在跨多个领域逐步学习和累积知识。然而,LReID模型通常需要针对每个未见过的领域使用大量标记的数据进行训练,这在实际操作中由于隐私保护和成本问题往往难以实现。 为了解决这些问题,在本文中我们提出了一个名为连续少量样本重识别(Continual Few-shot ReID, 简称CFReID)的新范式。在这种框架下,模型需要通过少量样本数据逐步进行增量训练,并在所有已见过的领域上进行测试。在少量样本条件下,CFReID面临两个核心挑战:1) 从未见过领域的少量样本数据中学习知识;2) 避免对已见过领域的遗忘现象。 为应对这两个挑战,我们提出了一种名为稳定分布对齐(Stable Distribution Alignment, 简称SDA)的框架,并从特征分布的角度出发解决这些问题。具体来说,我们的SDA包括两个模块:元分布对齐(Meta Distribution Alignment, 简称MDA)和基于原型的少量样本适应(Prototype-based Few-shot Adaptation, 简称PFA)。为了支持CFReID的研究工作,我们建立了五个公开可用重识别数据集上的评估基准。大量的实验结果表明,在少量样本条件下,我们的SDA能够增强模型的学习能力和抗遗忘能力。 值得注意的是,通过仅使用5%的数据(即32个身份),我们提出的方法在性能上显著超越了LReID的最佳表现方法,后者需要700到1,000个身份的数据。
https://arxiv.org/abs/2503.18469
Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of real-world scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilities. Such systems face significant challenges due to variations in camera perspectives, lighting conditions, and sensor modalities, hindering effective person ReID. To address these challenges, we introduce the MP-ReID benchmark, a novel dataset designed specifically for multi-modality and multi-platform ReID. This benchmark uniquely compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging, captured by both UAVs and ground-based cameras in indoor and outdoor environments. Building on this benchmark, we introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios. Our method consistently outperforms state-of-the-art approaches, establishing a robust foundation for future research in complex and dynamic ReID environments. Our dataset are available at:this https URL.
传统的人员再识别(ReID)研究通常局限于静态摄像头采集的单一模态传感器数据,这无法应对现实场景中的复杂性问题,这些场景中多模态信号越来越普遍。例如,在一个集成固定RGB相机、夜间红外传感器和装有动态跟踪能力无人机的城市ReID系统中,由于视角变化、光照条件以及传感器模式的不同,这类系统面临着重大挑战,阻碍了有效的人体再识别。为了解决这些问题,我们引入了MP-ReID基准测试,这是一个专门针对多模态和跨平台人员再识别的新数据集。该基准测试从1,930个不同身份中收集了多样化的模态数据,包括RGB、红外和热成像图像,并且这些数据由无人机和地面相机在室内和室外环境中捕获。 基于这一基准,我们提出了Uni-Prompt ReID框架,这是一个设计有特定提示的系统,专为跨模态和多平台场景而量身定制。我们的方法在各种测试中始终超越现有最先进技术,奠定了未来复杂且动态环境下的ReID研究的基础。 数据集可以在以下网址获取:[此链接](this%20https%20URL)
https://arxiv.org/abs/2503.17096
Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at this https URL.
多视角人物关联是人类活动多视角分析中的基础步骤。虽然人员再识别特征已被证明有效,但在场景中人物外观相似时,这些特征变得不可靠。因此,跨视图几何约束对于实现更稳健的关联是必要的。然而,大多数现有方法要么完全依赖于使用真实身份标签进行监督训练,要么需要难以获取的校准摄像机参数。在本文工作中,我们研究了从同步学习中的潜力,并提出了一种无需任何注释的自监督未校准多视角人物关联方法——Self-MVA。 具体来说,我们提出了一个包含编码器-解码器模型和自我监督前置任务(跨视图图像同步)的自监督学习框架。此前置任务旨在区分来自不同视角的两张图片是否在同一时间被捕捉。该模型编码每个人统一的几何与外观特征,并通过应用匈牙利匹配算法,利用同步标签进行训练以弥合实例级和图像级距离之间的差距。 为了进一步缩小解决方案空间,我们提出了两种自监督线性约束类型:多视图重新投影和成对边缘关联。 在三个具有挑战性的公开基准数据集(WILDTRACK、MVOR 和 SOLDIERS)上的大量实验表明,我们的方法取得了最先进的结果,超过了现有的无监督和完全监督的方法。代码可在提供的链接中获取。
https://arxiv.org/abs/2503.13739
Surgical domain models improve workflow optimization through automated predictions of each staff member's surgical role. However, mounting evidence indicates that team familiarity and individuality impact surgical outcomes. We present a novel staff-centric modeling approach that characterizes individual team members through their distinctive movement patterns and physical characteristics, enabling long-term tracking and analysis of surgical personnel across multiple procedures. To address the challenge of inter-clinic variability, we develop a generalizable re-identification framework that encodes sequences of 3D point clouds to capture shape and articulated motion patterns unique to each individual. Our method achieves 86.19% accuracy on realistic clinical data while maintaining 75.27% accuracy when transferring between different environments - a 12% improvement over existing methods. When used to augment markerless personnel tracking, our approach improves accuracy by over 50%. Through extensive validation across three datasets and the introduction of a novel workflow visualization technique, we demonstrate how our framework can reveal novel insights into surgical team dynamics and space utilization patterns, advancing methods to analyze surgical workflows and team coordination.
手术领域模型通过自动化预测每位工作人员的外科角色来优化工作流程。然而,越来越多的证据表明团队熟悉度和个人差异会影响手术结果。我们提出了一种以人员为中心的新建模方法,该方法通过独特的运动模式和物理特征来表征每个团队成员,并能够长期跟踪和分析多个程序中的外科人员。为了解决跨诊所可变性的问题,我们开发了一个通用的重新识别框架,编码3D点云序列以捕捉每个人特有的形状和关节运动模式。我们的方法在现实临床数据上实现了86.19%的准确率,在不同环境之间迁移时保持了75.27%的准确率——相较于现有方法提高了12%。当用于增强无标记人员跟踪时,我们的方法将准确性提升了超过50%。通过三个数据集上的广泛验证以及新型工作流程可视化技术的引入,我们展示了如何使用该框架揭示手术团队动态和空间利用模式的新见解,并推进了分析外科工作流程和团队协作的方法。
https://arxiv.org/abs/2503.13028
The aim of multiple object tracking (MOT) is to detect all objects in a video and bind them into multiple trajectories. Generally, this process is carried out in two steps: detecting objects and associating them across frames based on various cues and metrics. Many studies and applications adopt object appearance, also known as re-identification (ReID) features, for target matching through straightforward similarity calculation. However, we argue that this practice is overly naive and thus overlooks the unique characteristics of MOT tasks. Unlike regular re-identification tasks that strive to distinguish all potential targets in a general representation, multi-object tracking typically immerses itself in differentiating similar targets within the same video sequence. Therefore, we believe that seeking a more suitable feature representation space based on the different sample distributions of each sequence will enhance tracking performance. In this paper, we propose using history-aware transformations on ReID features to achieve more discriminative appearance representations. Specifically, we treat historical trajectory features as conditions and employ a tailored Fisher Linear Discriminant (FLD) to find a spatial projection matrix that maximizes the differentiation between different trajectories. Our extensive experiments reveal that this training-free projection can significantly boost feature-only trackers to achieve competitive, even superior tracking performance compared to state-of-the-art methods while also demonstrating impressive zero-shot transfer capabilities. This demonstrates the effectiveness of our proposal and further encourages future investigation into the importance and customization of ReID models in multiple object tracking. The code will be released at this https URL.
多目标跟踪(MOT)的目标是在视频中检测所有对象并将其绑定为多个轨迹。通常,此过程分为两个步骤:检测对象和根据各种线索和指标在帧之间关联它们。许多研究和应用采用对象外观特征,也称为重新识别(ReID)特征,通过简单的相似度计算来进行目标匹配。然而,我们认为这种做法过于简单,并因此忽略了MOT任务的独特特性。与常规的重新识别任务致力于在一个通用表示中区分所有潜在的目标不同,多对象跟踪通常专注于在同一个视频序列中区分类似的目标。因此,我们相信根据每个序列不同的样本分布寻找更合适的特征表示空间将增强跟踪性能。 在这篇论文中,我们提出使用具有历史感知变换的ReID特征来实现更具判别性的外观表示。具体来说,我们将历史轨迹特征视为条件,并采用一种定制化的费希尔线性判别法(FLD)来找到最大化不同轨迹之间差异的空间投影矩阵。我们的广泛实验表明,这种无需训练的投影可以显著增强仅使用特征的跟踪器的表现,使其能够达到甚至优于最先进的方法的追踪性能,并且还展示了令人印象深刻的零样本迁移能力。这证明了我们提案的有效性,并进一步鼓励未来对MOT中ReID模型的重要性和定制化的研究。 代码将在以下网址发布:[此链接](https://this-url.com)。
https://arxiv.org/abs/2503.12562
The performance of models is intricately linked to the abundance of training data. In Visible-Infrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about $9\%$ in mAP over the baseline on the LLCM dataset. Code: this https URL
模型的性能与其训练数据的数量密切相关。在可见光-红外行人重识别(VI-ReID)任务中,收集和标注不同摄像头及模态下每个人的大规模图像是一项繁琐、耗时且昂贵的任务,并且必须遵守数据保护法规,这使得满足数据集要求变得极其具有挑战性。当前的研究调查了一种生成合成数据的方法作为实地采集真实数据的高效且隐私保障的替代方案。然而,尚未有专门针对VI-ReID模型的数据合成技术被探索出来。在本文中,我们提出了一种新颖的数据生成框架,称为基于扩散的可见光-红外行人重识别数据扩展(DiVE),该框架能够通过解耦身份和模态来自动获取大量具有身份保持性的RGB-IR配对图像以提升VI-ReID模型的表现。具体来说,身份表示是从一组具有相同ID的样本中获得的,而图像的模态则是通过对特定模态的数据进行微调稳定扩散(SD)模型得到的。DiVE将文本驱动的图像生成扩展到了保持身份一致性的RGB-IR多模态图像合成。这一方法通过直接在Re-ID模型训练过程中加入合成数据显著减少了数据采集和标注的成本。实验表明,基于DiVE生成的合成数据训练的VI-ReID模型持续表现出明显的性能提升。特别地,在LLCM数据集上使用最先进的方法CAJ进行训练时,与基线相比,其mAP值提高了大约9%。 代码链接:[此URL](https://this https URL)
https://arxiv.org/abs/2503.12472
Aiming to match pedestrian images captured under varying lighting conditions, visible-infrared person re-identification (VI-ReID) has drawn intensive research attention and achieved promising results. However, in real-world surveillance contexts, data is distributed across multiple devices/entities, raising privacy and ownership concerns that make existing centralized training impractical for VI-ReID. To tackle these challenges, we propose L2RW, a benchmark that brings VI-ReID closer to real-world applications. The rationale of L2RW is that integrating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing regulation. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we ensure the model training is done in the conditions that: 1) data from each camera remains completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy constraints which is closer to real-world conditions. Intensive experiments with various server-side federated algorithms are conducted, showing the feasibility of decentralized VI-ReID training. Notably, when evaluated in unseen domains (i.e., new data entities), our L2RW, trained with isolated data (privacy-preserved), achieves performance comparable to SOTAs trained with shared data (privacy-unrestricted). We hope this work offers a novel research entry for deploying VI-ReID that fits real-world scenarios and can benefit the community.
为了在不同光照条件下匹配行人的图像,可见光-红外线行人再识别(VI-ReID)吸引了大量的研究关注,并取得了令人鼓舞的结果。然而,在实际监控环境中,数据分布在多个设备或实体之间,这引发了隐私和所有权方面的担忧,使得现有的集中式训练方法对于VI-ReID变得不切实际。为了解决这些问题,我们提出了L2RW基准测试,旨在将VI-ReID更接近于现实应用环境。 L2RW的逻辑在于,通过将分散化训练集成到VI-ReID中可以解决数据共享监管有限场景中的隐私问题。具体而言,我们针对不同的隐私敏感度级别设计了协议和相应的算法。在我们的新基准测试中,确保模型训练满足以下条件:1)来自每个摄像头的数据完全隔离;2)不同数据实体(如特定区域的数据控制器)可以选择性地分享数据。通过这种方式,我们在严格遵守隐私约束的场景下进行了模拟,这更接近于现实情况。 我们对各种服务器端联邦学习算法进行了广泛的实验,表明分散化的VI-ReID训练是可行的。值得注意的是,在未见过的域(即新的数据实体中),我们的L2RW在使用隔离的数据进行训练时(保证隐私保护)与使用共享数据进行训练的技术最佳方法(不考虑隐私约束)相比,达到了相当的性能水平。 我们希望通过这项工作为部署适合现实场景的VI-ReID提供一个新颖的研究入口,并为社区带来益处。
https://arxiv.org/abs/2503.12232
Biometric recognition becomes increasingly challenging as we move away from the visible spectrum to infrared imagery, where domain discrepancies significantly impact identification performance. In this paper, we show that body embeddings perform better than face embeddings for cross-spectral person identification in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains. Due to the lack of multi-domain datasets, previous research on cross-spectral body identification - also known as Visible-Infrared Person Re-Identification (VI-ReID) - has primarily focused on individual infrared bands, such as near-infrared (NIR) or LWIR, separately. We address the multi-domain body recognition problem using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which enables matching of short-wave infrared (SWIR), MWIR, and LWIR images against RGB (VIS) images. We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset and, through extensive experiments, provide valuable insights into the interrelation of infrared domains, the adaptability of VIS-pretrained models, the role of local semantic features in body-embeddings, and effective training strategies for small datasets. Additionally, we show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores on the LLCM dataset.
随着我们从可见光谱过渡到红外图像,生物识别技术面临的挑战日益增加,因为不同领域的差异显著影响了识别性能。本文展示了在中波红外(MWIR)和长波红外(LWIR)领域,人体嵌入式模型比面部嵌入式模型更适用于跨光谱的人体识别。 由于缺乏多域数据集,之前关于跨光谱人体识别的研究——即可见光-红外人再识别(VI-ReID),主要集中在近红外(NIR)或长波红外等单独的红外频段上。本文使用IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) 数据集解决了多域人体识别问题,该数据集支持将短波红外(SWIR),MWIR 和 LWIR 图像与RGB图像匹配。 我们利用视觉变换器架构在 IJB-MDF 数据集上建立了基准结果,并通过广泛的实验提供了关于红外频段之间相互关系的宝贵见解、可见光预训练模型的适应性、人体嵌入中的局部语义特征的作用,以及小数据集中有效的训练策略。此外,本文还展示了仅在VIS数据上进行预训练的人体模型,在使用简单的交叉熵和三元组损失组合微调后,在LLCM 数据集上达到了最先进的平均精度(mAP)分数。
https://arxiv.org/abs/2503.10931
Clothes-Changing Person Re-Identification (ReID) aims to recognize the same individual across different videos captured at various times and locations. This task is particularly challenging due to changes in appearance, such as clothing, hairstyle, and accessories. We propose a Clothes-Changing ReID method that uses only skeleton data and does not use appearance features. Traditional ReID methods often depend on appearance features, leading to decreased accuracy when clothing changes. Our approach utilizes a spatio-temporal Graph Convolution Network (GCN) encoder to generate a skeleton-based descriptor for each individual. During testing, we improve accuracy by aggregating predictions from multiple segments of a video clip. Evaluated on the CCVID dataset with several different pose estimation models, our method achieves state-of-the-art performance, offering a robust and efficient solution for Clothes-Changing ReID.
服装变化下的重识别(ReID)旨在跨不同时间和地点捕捉的视频中识别同一人。这一任务特别具有挑战性,因为人的外貌会随时间发生变化,例如穿着、发型和配饰等。我们提出了一种仅使用骨架数据而不依赖于外观特征的服装变化下的人再识别方法。传统的人再识别方法通常依靠外观特征,在遇到着装变化时准确率会降低。 我们的方法利用了时空图卷积网络(GCN)编码器,为每个人生成基于骨架描述符。在测试阶段,我们通过汇集视频片段多个时间段上的预测来提高准确性。我们在使用多种姿态估计模型的CCVID数据集上进行了评估,取得了最先进的性能,提供了一种鲁棒且高效的人再识别解决方案,在服装变化情况下也能保持高精度。
https://arxiv.org/abs/2503.10759
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
多模态对象重识别(Re-ID)的目标是通过利用各种模式提供的互补信息来检索特定对象。然而,现有的方法主要集中在融合异构视觉特征上,忽略了基于文本的语义信息的潜在优势。为了解决这一问题,我们首先构建了三个增强文本的多模态对象 Re-ID 基准测试集。具体来说,我们提出了一种标准化的多模态描述生成管道,用于使用多模态大型语言模型(MLLM)创建结构化和简洁的文字注释。 此外,当前的方法通常直接聚合多模态信息而没有选择具有代表性的局部特征,导致冗余和复杂性增加。为了解决上述问题,我们引入了 IDEA,这是一种新颖的特征学习框架,包括倒置多模态特征提取器(IMFE)和合作可变形聚集(CDA)。IMFE 利用模式前缀和逆向网络将多模态信息与反向文本提供的语义指导相结合。CDA 自适应地生成采样位置,使模型能够关注全局特征与判别性局部特征之间的交互作用。 通过构建的基准测试集以及提出的模块,我们的框架能够在复杂场景下生成更稳健的多模态特征。在三个多模态对象 Re-ID 基准测试上的大量实验验证了我们方法的有效性。
https://arxiv.org/abs/2503.10324
Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models.
文本到图像的人再识别(ReID)旨在根据文字描述检索感兴趣人物的图片。这一任务的主要挑战之一是大规模数据库的手动标注成本高昂,这影响了ReID模型的泛化能力。近期的研究通过利用多模态大型语言模型(MLLMs)自动描述行人图像来解决这个问题。然而,由MLLM生成的说明缺乏多样化的描述风格。为了解决这一问题,我们提出了一种人类标注员建模(HAM)方法,使MLLM能够模仿数千名人类标注员的描述风格。具体来说,我们首先从人类文本描述中提取风格特征,并对其执行聚类操作,以便将具有相似风格的文字描述归入同一类别。然后,为每个这样的集群使用一个提示来表示它,并应用提示学习以模仿不同的人类标注员的描述风格。此外,我们定义了一个风格特征空间并在该空间内进行均匀采样,从而获取更多的多样化聚类原型,进一步丰富了MLLM生成的说明多样性。最后,我们将HAM应用于大规模数据库的自动注释,用于文本到图像的ReID任务。在这一数据库上的广泛实验表明,这显著提高了ReID模型的泛化能力。
https://arxiv.org/abs/2503.09962
Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection. In this paper, we introduce a thorough benchmark and a unified model for this problem. First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data. Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features. Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images. We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance. Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection. Codes and model is available at \href{this https URL}{this https URL}.
尽管医学图像重新识别(MedReID)在个性化医疗和隐私保护方面具有关键应用,但该领域迄今为止的研究相对较少。本文介绍了针对这一问题的全面基准测试及统一模型。首先,为了处理各种医学模态,我们提出了一种新颖的基于连续模态的参数适配器(Continuous Modality-based Parameter Adapter, ComPA)。ComPA 将医学内容浓缩成连续模态表示,并在运行时通过特定于模态的参数动态调整无模态感知模型。这使得单一模型能够自适应地学习和处理多种模态的数据。此外,我们通过将模型与一整套预训练的医学基础模型在差异特征方面对齐,来整合医学先验知识至我们的模型中。相较于单张图像特征,建模图像间的差异更适合重新识别问题,因为后者涉及区分多张图像。 我们在25个基础模型和8个大型多模态语言模型上进行了测试,并横跨11个图像数据集评估了所提出的方法,结果表明其性能始终优于其他方法。此外,我们还将提出的MedReID技术应用于两个现实世界的场景中:即基于历史记录的个性化诊断以及医学隐私保护。代码和模型可在此网址获取:\href{this https URL}{this https URL}。
https://arxiv.org/abs/2503.08173
We introduce AG-VPReID, a challenging large-scale benchmark dataset for aerial-ground video-based person re-identification (ReID), comprising 6,632 identities, 32,321 tracklets, and 9.6 million frames captured from drones (15-120m altitude), CCTV, and wearable cameras. This dataset presents a real-world benchmark to investigate the robustness of Person ReID approaches against the unique challenges of cross-platform aerial-ground settings. To address these challenges, we propose AG-VPReID-Net, an end-to-end framework combining three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and temporal feature learning, (2) a Normalized Appearance Stream using physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. Our approach integrates complementary visual-semantic information from all streams to generate robust, viewpoint-invariant person representations. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and other existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. The relatively lower performance of all state-of-the-art approaches, including our proposed approach, on our new dataset highlights its challenging nature. The AG-VPReID dataset, code and models are available at this https URL.
我们介绍了AG-VPReID,这是一个用于基于空中和地面视频的人重识别(ReID)的具有挑战性的大规模基准数据集。该数据集包含6,632个身份标识、32,321条轨迹片段以及从无人机(高度为15至120米)、闭路电视(CCTV)和可穿戴相机捕捉到的960万帧图像。此数据集提供了一个现实世界的基准,用于研究跨平台空地设置下的人重识别方法的鲁棒性。 为了应对这些挑战,我们提出了AG-VPReID-Net,这是一种端到端框架,结合了三个互补的数据流:(1)适应的时间空间流,以解决运动模式不一致和时间特征学习的问题;(2)采用物理启发技术来处理分辨率和外观变化的标准化外观流;以及(3)用于处理不同无人机高度下的尺度变化的多尺度注意力流。我们的方法整合了所有数据流中的互补视觉语义信息,生成鲁棒且视角不变的人体表示。 广泛的实验表明,在我们新的数据集和其他现有的基于视频的ReID基准测试上,AG-VPReID-Net的表现均优于最先进的方法,证明了其有效性和泛化能力。所有包括我们在内的最新方法在新数据集上的相对较低性能突显了该数据集的挑战性。 AG-VPReID数据集、代码和模型可在此网址获取:[此URL](请将此处的占位符替换为实际的访问链接)。
https://arxiv.org/abs/2503.08121
When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering $4,231$ unique identities and containing $63,841$ high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible and effective solution for the AGPReID task. The datasets and source code available on this https URL.
在讨论空地人物再识别(AGPReID)任务时,我们面临的主要挑战是由于不同视角导致的显著外观变化,这使得身份匹配变得困难。为解决这一问题,以前的方法试图通过关键属性和解耦视角来减少视角之间的差异。尽管这些方法能够在一定程度上缓解视角差异的问题,但它们仍然面临着两个主要问题:(1)难以处理视角多样性;(2)忽视了局部特征贡献的作用。 为了有效应对这些挑战,我们设计并实现了一种名为自我校准与自适应提示(SeCap)的方法用于AGPReID任务。该框架的核心依赖于提示重新校准模块(PRM),它能够根据输入数据自适应地重新调整提示信息。结合局部特征精炼模块(LFRM),SeCap可以从局部特征中提取视角不变的特征,从而应用于AGPReID。 同时,在当前AGPReID领域的数据集稀缺的情况下,我们进一步贡献了两个大规模的真实世界空地人物再识别数据集:LAGPeR和G2APS-ReID。前者是我们独立收集并标注的数据集,包含4,231个独特的身份,并有63,841张高质量图像;后者是从人员搜索数据集G2APS重建而来。通过在AGPReID数据集上的广泛实验,我们证明了SeCap是解决AGPReID任务的可行且有效的方法。 这些数据集和源代码可以在提供的网址上获取:[此链接](请将方括号内的内容替换为实际的数据集和代码发布页面URL)。
https://arxiv.org/abs/2503.06965