Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at this https URL.
多视角人物关联是人类活动多视角分析中的基础步骤。虽然人员再识别特征已被证明有效,但在场景中人物外观相似时,这些特征变得不可靠。因此,跨视图几何约束对于实现更稳健的关联是必要的。然而,大多数现有方法要么完全依赖于使用真实身份标签进行监督训练,要么需要难以获取的校准摄像机参数。在本文工作中,我们研究了从同步学习中的潜力,并提出了一种无需任何注释的自监督未校准多视角人物关联方法——Self-MVA。 具体来说,我们提出了一个包含编码器-解码器模型和自我监督前置任务(跨视图图像同步)的自监督学习框架。此前置任务旨在区分来自不同视角的两张图片是否在同一时间被捕捉。该模型编码每个人统一的几何与外观特征,并通过应用匈牙利匹配算法,利用同步标签进行训练以弥合实例级和图像级距离之间的差距。 为了进一步缩小解决方案空间,我们提出了两种自监督线性约束类型:多视图重新投影和成对边缘关联。 在三个具有挑战性的公开基准数据集(WILDTRACK、MVOR 和 SOLDIERS)上的大量实验表明,我们的方法取得了最先进的结果,超过了现有的无监督和完全监督的方法。代码可在提供的链接中获取。
https://arxiv.org/abs/2503.13739
Surgical domain models improve workflow optimization through automated predictions of each staff member's surgical role. However, mounting evidence indicates that team familiarity and individuality impact surgical outcomes. We present a novel staff-centric modeling approach that characterizes individual team members through their distinctive movement patterns and physical characteristics, enabling long-term tracking and analysis of surgical personnel across multiple procedures. To address the challenge of inter-clinic variability, we develop a generalizable re-identification framework that encodes sequences of 3D point clouds to capture shape and articulated motion patterns unique to each individual. Our method achieves 86.19% accuracy on realistic clinical data while maintaining 75.27% accuracy when transferring between different environments - a 12% improvement over existing methods. When used to augment markerless personnel tracking, our approach improves accuracy by over 50%. Through extensive validation across three datasets and the introduction of a novel workflow visualization technique, we demonstrate how our framework can reveal novel insights into surgical team dynamics and space utilization patterns, advancing methods to analyze surgical workflows and team coordination.
手术领域模型通过自动化预测每位工作人员的外科角色来优化工作流程。然而,越来越多的证据表明团队熟悉度和个人差异会影响手术结果。我们提出了一种以人员为中心的新建模方法,该方法通过独特的运动模式和物理特征来表征每个团队成员,并能够长期跟踪和分析多个程序中的外科人员。为了解决跨诊所可变性的问题,我们开发了一个通用的重新识别框架,编码3D点云序列以捕捉每个人特有的形状和关节运动模式。我们的方法在现实临床数据上实现了86.19%的准确率,在不同环境之间迁移时保持了75.27%的准确率——相较于现有方法提高了12%。当用于增强无标记人员跟踪时,我们的方法将准确性提升了超过50%。通过三个数据集上的广泛验证以及新型工作流程可视化技术的引入,我们展示了如何使用该框架揭示手术团队动态和空间利用模式的新见解,并推进了分析外科工作流程和团队协作的方法。
https://arxiv.org/abs/2503.13028
The aim of multiple object tracking (MOT) is to detect all objects in a video and bind them into multiple trajectories. Generally, this process is carried out in two steps: detecting objects and associating them across frames based on various cues and metrics. Many studies and applications adopt object appearance, also known as re-identification (ReID) features, for target matching through straightforward similarity calculation. However, we argue that this practice is overly naive and thus overlooks the unique characteristics of MOT tasks. Unlike regular re-identification tasks that strive to distinguish all potential targets in a general representation, multi-object tracking typically immerses itself in differentiating similar targets within the same video sequence. Therefore, we believe that seeking a more suitable feature representation space based on the different sample distributions of each sequence will enhance tracking performance. In this paper, we propose using history-aware transformations on ReID features to achieve more discriminative appearance representations. Specifically, we treat historical trajectory features as conditions and employ a tailored Fisher Linear Discriminant (FLD) to find a spatial projection matrix that maximizes the differentiation between different trajectories. Our extensive experiments reveal that this training-free projection can significantly boost feature-only trackers to achieve competitive, even superior tracking performance compared to state-of-the-art methods while also demonstrating impressive zero-shot transfer capabilities. This demonstrates the effectiveness of our proposal and further encourages future investigation into the importance and customization of ReID models in multiple object tracking. The code will be released at this https URL.
多目标跟踪(MOT)的目标是在视频中检测所有对象并将其绑定为多个轨迹。通常,此过程分为两个步骤:检测对象和根据各种线索和指标在帧之间关联它们。许多研究和应用采用对象外观特征,也称为重新识别(ReID)特征,通过简单的相似度计算来进行目标匹配。然而,我们认为这种做法过于简单,并因此忽略了MOT任务的独特特性。与常规的重新识别任务致力于在一个通用表示中区分所有潜在的目标不同,多对象跟踪通常专注于在同一个视频序列中区分类似的目标。因此,我们相信根据每个序列不同的样本分布寻找更合适的特征表示空间将增强跟踪性能。 在这篇论文中,我们提出使用具有历史感知变换的ReID特征来实现更具判别性的外观表示。具体来说,我们将历史轨迹特征视为条件,并采用一种定制化的费希尔线性判别法(FLD)来找到最大化不同轨迹之间差异的空间投影矩阵。我们的广泛实验表明,这种无需训练的投影可以显著增强仅使用特征的跟踪器的表现,使其能够达到甚至优于最先进的方法的追踪性能,并且还展示了令人印象深刻的零样本迁移能力。这证明了我们提案的有效性,并进一步鼓励未来对MOT中ReID模型的重要性和定制化的研究。 代码将在以下网址发布:[此链接](https://this-url.com)。
https://arxiv.org/abs/2503.12562
The performance of models is intricately linked to the abundance of training data. In Visible-Infrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about $9\%$ in mAP over the baseline on the LLCM dataset. Code: this https URL
模型的性能与其训练数据的数量密切相关。在可见光-红外行人重识别(VI-ReID)任务中,收集和标注不同摄像头及模态下每个人的大规模图像是一项繁琐、耗时且昂贵的任务,并且必须遵守数据保护法规,这使得满足数据集要求变得极其具有挑战性。当前的研究调查了一种生成合成数据的方法作为实地采集真实数据的高效且隐私保障的替代方案。然而,尚未有专门针对VI-ReID模型的数据合成技术被探索出来。在本文中,我们提出了一种新颖的数据生成框架,称为基于扩散的可见光-红外行人重识别数据扩展(DiVE),该框架能够通过解耦身份和模态来自动获取大量具有身份保持性的RGB-IR配对图像以提升VI-ReID模型的表现。具体来说,身份表示是从一组具有相同ID的样本中获得的,而图像的模态则是通过对特定模态的数据进行微调稳定扩散(SD)模型得到的。DiVE将文本驱动的图像生成扩展到了保持身份一致性的RGB-IR多模态图像合成。这一方法通过直接在Re-ID模型训练过程中加入合成数据显著减少了数据采集和标注的成本。实验表明,基于DiVE生成的合成数据训练的VI-ReID模型持续表现出明显的性能提升。特别地,在LLCM数据集上使用最先进的方法CAJ进行训练时,与基线相比,其mAP值提高了大约9%。 代码链接:[此URL](https://this https URL)
https://arxiv.org/abs/2503.12472
Aiming to match pedestrian images captured under varying lighting conditions, visible-infrared person re-identification (VI-ReID) has drawn intensive research attention and achieved promising results. However, in real-world surveillance contexts, data is distributed across multiple devices/entities, raising privacy and ownership concerns that make existing centralized training impractical for VI-ReID. To tackle these challenges, we propose L2RW, a benchmark that brings VI-ReID closer to real-world applications. The rationale of L2RW is that integrating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing regulation. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we ensure the model training is done in the conditions that: 1) data from each camera remains completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy constraints which is closer to real-world conditions. Intensive experiments with various server-side federated algorithms are conducted, showing the feasibility of decentralized VI-ReID training. Notably, when evaluated in unseen domains (i.e., new data entities), our L2RW, trained with isolated data (privacy-preserved), achieves performance comparable to SOTAs trained with shared data (privacy-unrestricted). We hope this work offers a novel research entry for deploying VI-ReID that fits real-world scenarios and can benefit the community.
为了在不同光照条件下匹配行人的图像,可见光-红外线行人再识别(VI-ReID)吸引了大量的研究关注,并取得了令人鼓舞的结果。然而,在实际监控环境中,数据分布在多个设备或实体之间,这引发了隐私和所有权方面的担忧,使得现有的集中式训练方法对于VI-ReID变得不切实际。为了解决这些问题,我们提出了L2RW基准测试,旨在将VI-ReID更接近于现实应用环境。 L2RW的逻辑在于,通过将分散化训练集成到VI-ReID中可以解决数据共享监管有限场景中的隐私问题。具体而言,我们针对不同的隐私敏感度级别设计了协议和相应的算法。在我们的新基准测试中,确保模型训练满足以下条件:1)来自每个摄像头的数据完全隔离;2)不同数据实体(如特定区域的数据控制器)可以选择性地分享数据。通过这种方式,我们在严格遵守隐私约束的场景下进行了模拟,这更接近于现实情况。 我们对各种服务器端联邦学习算法进行了广泛的实验,表明分散化的VI-ReID训练是可行的。值得注意的是,在未见过的域(即新的数据实体中),我们的L2RW在使用隔离的数据进行训练时(保证隐私保护)与使用共享数据进行训练的技术最佳方法(不考虑隐私约束)相比,达到了相当的性能水平。 我们希望通过这项工作为部署适合现实场景的VI-ReID提供一个新颖的研究入口,并为社区带来益处。
https://arxiv.org/abs/2503.12232
Biometric recognition becomes increasingly challenging as we move away from the visible spectrum to infrared imagery, where domain discrepancies significantly impact identification performance. In this paper, we show that body embeddings perform better than face embeddings for cross-spectral person identification in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains. Due to the lack of multi-domain datasets, previous research on cross-spectral body identification - also known as Visible-Infrared Person Re-Identification (VI-ReID) - has primarily focused on individual infrared bands, such as near-infrared (NIR) or LWIR, separately. We address the multi-domain body recognition problem using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which enables matching of short-wave infrared (SWIR), MWIR, and LWIR images against RGB (VIS) images. We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset and, through extensive experiments, provide valuable insights into the interrelation of infrared domains, the adaptability of VIS-pretrained models, the role of local semantic features in body-embeddings, and effective training strategies for small datasets. Additionally, we show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores on the LLCM dataset.
随着我们从可见光谱过渡到红外图像,生物识别技术面临的挑战日益增加,因为不同领域的差异显著影响了识别性能。本文展示了在中波红外(MWIR)和长波红外(LWIR)领域,人体嵌入式模型比面部嵌入式模型更适用于跨光谱的人体识别。 由于缺乏多域数据集,之前关于跨光谱人体识别的研究——即可见光-红外人再识别(VI-ReID),主要集中在近红外(NIR)或长波红外等单独的红外频段上。本文使用IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) 数据集解决了多域人体识别问题,该数据集支持将短波红外(SWIR),MWIR 和 LWIR 图像与RGB图像匹配。 我们利用视觉变换器架构在 IJB-MDF 数据集上建立了基准结果,并通过广泛的实验提供了关于红外频段之间相互关系的宝贵见解、可见光预训练模型的适应性、人体嵌入中的局部语义特征的作用,以及小数据集中有效的训练策略。此外,本文还展示了仅在VIS数据上进行预训练的人体模型,在使用简单的交叉熵和三元组损失组合微调后,在LLCM 数据集上达到了最先进的平均精度(mAP)分数。
https://arxiv.org/abs/2503.10931
Clothes-Changing Person Re-Identification (ReID) aims to recognize the same individual across different videos captured at various times and locations. This task is particularly challenging due to changes in appearance, such as clothing, hairstyle, and accessories. We propose a Clothes-Changing ReID method that uses only skeleton data and does not use appearance features. Traditional ReID methods often depend on appearance features, leading to decreased accuracy when clothing changes. Our approach utilizes a spatio-temporal Graph Convolution Network (GCN) encoder to generate a skeleton-based descriptor for each individual. During testing, we improve accuracy by aggregating predictions from multiple segments of a video clip. Evaluated on the CCVID dataset with several different pose estimation models, our method achieves state-of-the-art performance, offering a robust and efficient solution for Clothes-Changing ReID.
服装变化下的重识别(ReID)旨在跨不同时间和地点捕捉的视频中识别同一人。这一任务特别具有挑战性,因为人的外貌会随时间发生变化,例如穿着、发型和配饰等。我们提出了一种仅使用骨架数据而不依赖于外观特征的服装变化下的人再识别方法。传统的人再识别方法通常依靠外观特征,在遇到着装变化时准确率会降低。 我们的方法利用了时空图卷积网络(GCN)编码器,为每个人生成基于骨架描述符。在测试阶段,我们通过汇集视频片段多个时间段上的预测来提高准确性。我们在使用多种姿态估计模型的CCVID数据集上进行了评估,取得了最先进的性能,提供了一种鲁棒且高效的人再识别解决方案,在服装变化情况下也能保持高精度。
https://arxiv.org/abs/2503.10759
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
多模态对象重识别(Re-ID)的目标是通过利用各种模式提供的互补信息来检索特定对象。然而,现有的方法主要集中在融合异构视觉特征上,忽略了基于文本的语义信息的潜在优势。为了解决这一问题,我们首先构建了三个增强文本的多模态对象 Re-ID 基准测试集。具体来说,我们提出了一种标准化的多模态描述生成管道,用于使用多模态大型语言模型(MLLM)创建结构化和简洁的文字注释。 此外,当前的方法通常直接聚合多模态信息而没有选择具有代表性的局部特征,导致冗余和复杂性增加。为了解决上述问题,我们引入了 IDEA,这是一种新颖的特征学习框架,包括倒置多模态特征提取器(IMFE)和合作可变形聚集(CDA)。IMFE 利用模式前缀和逆向网络将多模态信息与反向文本提供的语义指导相结合。CDA 自适应地生成采样位置,使模型能够关注全局特征与判别性局部特征之间的交互作用。 通过构建的基准测试集以及提出的模块,我们的框架能够在复杂场景下生成更稳健的多模态特征。在三个多模态对象 Re-ID 基准测试上的大量实验验证了我们方法的有效性。
https://arxiv.org/abs/2503.10324
Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models.
文本到图像的人再识别(ReID)旨在根据文字描述检索感兴趣人物的图片。这一任务的主要挑战之一是大规模数据库的手动标注成本高昂,这影响了ReID模型的泛化能力。近期的研究通过利用多模态大型语言模型(MLLMs)自动描述行人图像来解决这个问题。然而,由MLLM生成的说明缺乏多样化的描述风格。为了解决这一问题,我们提出了一种人类标注员建模(HAM)方法,使MLLM能够模仿数千名人类标注员的描述风格。具体来说,我们首先从人类文本描述中提取风格特征,并对其执行聚类操作,以便将具有相似风格的文字描述归入同一类别。然后,为每个这样的集群使用一个提示来表示它,并应用提示学习以模仿不同的人类标注员的描述风格。此外,我们定义了一个风格特征空间并在该空间内进行均匀采样,从而获取更多的多样化聚类原型,进一步丰富了MLLM生成的说明多样性。最后,我们将HAM应用于大规模数据库的自动注释,用于文本到图像的ReID任务。在这一数据库上的广泛实验表明,这显著提高了ReID模型的泛化能力。
https://arxiv.org/abs/2503.09962
Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection. In this paper, we introduce a thorough benchmark and a unified model for this problem. First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data. Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features. Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images. We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance. Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection. Codes and model is available at \href{this https URL}{this https URL}.
尽管医学图像重新识别(MedReID)在个性化医疗和隐私保护方面具有关键应用,但该领域迄今为止的研究相对较少。本文介绍了针对这一问题的全面基准测试及统一模型。首先,为了处理各种医学模态,我们提出了一种新颖的基于连续模态的参数适配器(Continuous Modality-based Parameter Adapter, ComPA)。ComPA 将医学内容浓缩成连续模态表示,并在运行时通过特定于模态的参数动态调整无模态感知模型。这使得单一模型能够自适应地学习和处理多种模态的数据。此外,我们通过将模型与一整套预训练的医学基础模型在差异特征方面对齐,来整合医学先验知识至我们的模型中。相较于单张图像特征,建模图像间的差异更适合重新识别问题,因为后者涉及区分多张图像。 我们在25个基础模型和8个大型多模态语言模型上进行了测试,并横跨11个图像数据集评估了所提出的方法,结果表明其性能始终优于其他方法。此外,我们还将提出的MedReID技术应用于两个现实世界的场景中:即基于历史记录的个性化诊断以及医学隐私保护。代码和模型可在此网址获取:\href{this https URL}{this https URL}。
https://arxiv.org/abs/2503.08173
We introduce AG-VPReID, a challenging large-scale benchmark dataset for aerial-ground video-based person re-identification (ReID), comprising 6,632 identities, 32,321 tracklets, and 9.6 million frames captured from drones (15-120m altitude), CCTV, and wearable cameras. This dataset presents a real-world benchmark to investigate the robustness of Person ReID approaches against the unique challenges of cross-platform aerial-ground settings. To address these challenges, we propose AG-VPReID-Net, an end-to-end framework combining three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and temporal feature learning, (2) a Normalized Appearance Stream using physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. Our approach integrates complementary visual-semantic information from all streams to generate robust, viewpoint-invariant person representations. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and other existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. The relatively lower performance of all state-of-the-art approaches, including our proposed approach, on our new dataset highlights its challenging nature. The AG-VPReID dataset, code and models are available at this https URL.
我们介绍了AG-VPReID,这是一个用于基于空中和地面视频的人重识别(ReID)的具有挑战性的大规模基准数据集。该数据集包含6,632个身份标识、32,321条轨迹片段以及从无人机(高度为15至120米)、闭路电视(CCTV)和可穿戴相机捕捉到的960万帧图像。此数据集提供了一个现实世界的基准,用于研究跨平台空地设置下的人重识别方法的鲁棒性。 为了应对这些挑战,我们提出了AG-VPReID-Net,这是一种端到端框架,结合了三个互补的数据流:(1)适应的时间空间流,以解决运动模式不一致和时间特征学习的问题;(2)采用物理启发技术来处理分辨率和外观变化的标准化外观流;以及(3)用于处理不同无人机高度下的尺度变化的多尺度注意力流。我们的方法整合了所有数据流中的互补视觉语义信息,生成鲁棒且视角不变的人体表示。 广泛的实验表明,在我们新的数据集和其他现有的基于视频的ReID基准测试上,AG-VPReID-Net的表现均优于最先进的方法,证明了其有效性和泛化能力。所有包括我们在内的最新方法在新数据集上的相对较低性能突显了该数据集的挑战性。 AG-VPReID数据集、代码和模型可在此网址获取:[此URL](请将此处的占位符替换为实际的访问链接)。
https://arxiv.org/abs/2503.08121
When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering $4,231$ unique identities and containing $63,841$ high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible and effective solution for the AGPReID task. The datasets and source code available on this https URL.
在讨论空地人物再识别(AGPReID)任务时,我们面临的主要挑战是由于不同视角导致的显著外观变化,这使得身份匹配变得困难。为解决这一问题,以前的方法试图通过关键属性和解耦视角来减少视角之间的差异。尽管这些方法能够在一定程度上缓解视角差异的问题,但它们仍然面临着两个主要问题:(1)难以处理视角多样性;(2)忽视了局部特征贡献的作用。 为了有效应对这些挑战,我们设计并实现了一种名为自我校准与自适应提示(SeCap)的方法用于AGPReID任务。该框架的核心依赖于提示重新校准模块(PRM),它能够根据输入数据自适应地重新调整提示信息。结合局部特征精炼模块(LFRM),SeCap可以从局部特征中提取视角不变的特征,从而应用于AGPReID。 同时,在当前AGPReID领域的数据集稀缺的情况下,我们进一步贡献了两个大规模的真实世界空地人物再识别数据集:LAGPeR和G2APS-ReID。前者是我们独立收集并标注的数据集,包含4,231个独特的身份,并有63,841张高质量图像;后者是从人员搜索数据集G2APS重建而来。通过在AGPReID数据集上的广泛实验,我们证明了SeCap是解决AGPReID任务的可行且有效的方法。 这些数据集和源代码可以在提供的网址上获取:[此链接](请将方括号内的内容替换为实际的数据集和代码发布页面URL)。
https://arxiv.org/abs/2503.06965
Person Re-identification (ReID) systems identify individuals across images or video frames and play a critical role in various real-world applications. However, many ReID methods are influenced by sensitive attributes such as gender, pose, and body mass index (BMI), which vary in uncontrolled environments, leading to biases and reduced generalization. To address this, we extend the concept of expressivity to the body recognition domain to better understand how ReID models encode these attributes. Expressivity, defined as the mutual information between feature vector representations and specific attributes, is computed using a secondary neural network that takes feature and attribute vectors as inputs. This provides a quantitative framework for analyzing the extent to which sensitive attributes are embedded in the model's representations. We apply expressivity analysis to SemReID, a state-of-the-art self-supervised ReID model, and find that BMI consistently exhibits the highest expressivity scores in the model's final layers, underscoring its dominant role in feature encoding. In the final attention layer of the trained network, the expressivity order for body attributes is BMI > Pitch > Yaw > Gender, highlighting their relative importance in learned representations. Additionally, expressivity values evolve progressively across network layers and training epochs, reflecting a dynamic encoding of attributes during feature extraction. These insights emphasize the influence of body-related attributes on ReID models and provide a systematic methodology for identifying and mitigating attribute-driven biases. By leveraging expressivity analysis, we offer valuable tools to enhance the fairness, robustness, and generalization of ReID systems in diverse real-world settings.
人员重识别(ReID)系统用于在图像或视频帧中识别个人,这对各种实际应用至关重要。然而,许多ReID方法会受到性别、姿势和体质指数(BMI)等敏感属性的影响,在不受控制的环境中这些属性变化较大,导致偏见并降低了模型的一般化能力。为了解决这个问题,我们将可表达性的概念扩展到人体识别领域,以更好地理解ReID模型如何编码这些属性。可表达性被定义为特征向量表示与特定属性之间的互信息,并通过一个辅助神经网络计算得出,该网络使用特征和属性向量作为输入。这提供了一个定量框架来分析敏感属性在模型表示中的嵌入程度。 我们将这种可表达性分析应用于SemReID,这是一种最先进的自我监督的ReID模型。我们发现,在模型最终层中,BMI始终表现出最高的可表达性得分,强调了它在特征编码中的主导作用。在网络训练后的最后注意力层,人体相关属性的重要性排序为:BMI > Pitch(俯仰角)> Yaw(偏航角)> 性别。 此外,随着网络层次的增加和训练周期的发展,这些可表达性的值也逐渐变化,反映了在特征提取过程中对属性编码动态调整的过程。这些见解强调了身体相关属性对ReID模型的影响,并提供了一种系统的方法来识别和缓解由属性驱动的偏见。 通过利用这种可表达性分析方法,我们为提高ReID系统在各种现实世界场景中的公平性、鲁棒性和泛化能力提供了有价值的工具。
https://arxiv.org/abs/2503.06451
WiFi-based mobility monitoring in urban environments can provide valuable insights into pedestrian and vehicle movements. However, MAC address randomization introduces a significant obstacle in accurately estimating congestion levels and path trajectories. To this end, we consider radio frequency fingerprinting and re-identification for attributing WiFi traffic to emitting devices without the use of MAC addresses. We present MobRFFI, an AI-based device fingerprinting and re-identification framework for WiFi networks that leverages an encoder deep learning model to extract unique features based on WiFi chipset hardware impairments. It is entirely independent of frame type. When evaluated on the WiFi fingerprinting dataset WiSig, our approach achieves 94% and 100% device accuracy in multi-day and single-day re-identification scenarios, respectively. We also collect a novel dataset, MobRFFI, for granular multi-receiver WiFi device fingerprinting evaluation. Using the dataset, we demonstrate that the combination of fingerprints from multiple receivers boosts re-identification performance from 81% to 100% on a single-day scenario and from 41% to 100% on a multi-day scenario.
基于WiFi的移动性监测在城市环境中可以提供有关行人和车辆运动的重要洞察。然而,MAC地址随机化为准确估计拥堵水平和路径轨迹带来了重大障碍。为此,我们考虑了无线电频率指纹识别与重新识别方法,以将WiFi流量归因于发出设备而不使用MAC地址。本文介绍了MobRFFI框架,这是一种基于AI的设备指纹识别和重新识别框架,适用于WiFi网络。该框架利用编码深度学习模型提取独特的特征,这些特征基于WiFi芯片集硬件的缺陷,且完全独立于帧类型。 在评估WiSig WiFi指纹数据集时,我们的方法分别在多日和单日重识别场景中达到了94%和100%的设备准确性。我们还收集了一个新的数据集MobRFFI,用于细粒度的多接收器WiFi设备指纹识别评估。使用该数据集,我们展示了多个接收器指纹组合如何将单日场景中的重新识别性能从81%提升至100%,以及在多日场景中从41%提升至100%。 简而言之,MobRFFI框架通过利用深度学习模型提取独特的WiFi芯片硬件特征来实现设备的精确重识别,并且证明了来自多个接收器的数据组合能够显著提高重新识别性能。
https://arxiv.org/abs/2503.02156
Person re-identification (ReID) aims to extract accurate identity representation features. However, during feature extraction, individual samples are inevitably affected by noise (background, occlusions, and model limitations). Considering that features from the same identity follow a normal distribution around identity centers after training, we propose a Training-Free Feature Centralization ReID framework (Pose2ID) by aggregating the same identity features to reduce individual noise and enhance the stability of identity representation, which preserves the feature's original distribution for following strategies such as re-ranking. Specifically, to obtain samples of the same identity, we introduce two components:Identity-Guided Pedestrian Generation: by leveraging identity features to guide the generation process, we obtain high-quality images with diverse poses, ensuring identity consistency even in complex scenarios such as infrared, and this http URL Feature Centralization: it explores each sample's potential positive samples from its neighborhood. Experiments demonstrate that our generative model exhibits strong generalization capabilities and maintains high identity consistency. With the Feature Centralization framework, we achieve impressive performance even with an ImageNet pre-trained model without ReID training, reaching mAP/Rank-1 of 52.81/78.92 on Market1501. Moreover, our method sets new state-of-the-art results across standard, cross-modality, and occluded ReID tasks, showcasing strong adaptability.
人员重新识别(ReID)的目标是从图像或视频中提取准确的身份表示特征。然而,在特征提取过程中,由于背景干扰、遮挡以及模型局限性等因素的影响,个体样本不可避免地会受到噪声的干扰。考虑到经过训练后的同一身份的特征在身份中心附近遵循正态分布,我们提出了一种无需训练的特征集中化ReID框架(Pose2ID),通过聚集同一个人的身份特征来减少个体噪声,并增强身份表示的稳定性,从而保留了用于后续策略如重新排名等操作的原始特征分布。 具体来说,为了获取同一身份的样本,我们引入了两个组件: 1. 身份引导的人形生成:利用身份特征指导生成过程,获得姿态多样的高质量图像,在红外线等多种复杂场景中也保证了身份的一致性。 2. 特征集中化:从每个样本周围的邻居探索潜在的正向样本。 实验表明,我们的生成模型表现出强大的泛化能力和保持高一致性。使用ImageNet预训练且未经过ReID训练的模型,在Market1501数据集上达到了52.81/mAP和78.92/Rank-1的成绩。此外,我们的方法在标准、跨模态以及遮挡ReID任务中均取得了新的最佳性能,显示出了强大的适应性。
https://arxiv.org/abs/2503.00938
Cloth-Changing Person Re-identification (CC-ReID) aims to solve the challenge of identifying individuals across different temporal-spatial scenarios, viewpoints, and clothing variations. This field is gaining increasing attention in big data research and public security domains. Existing ReID research primarily relies on face recognition, gait semantic recognition, and clothing-irrelevant feature identification, which perform relatively well in scenarios with high-quality clothing change videos and images. However, these approaches depend on either single features or simple combinations of multiple features, making further performance improvements difficult. Additionally, limitations such as missing facial information, challenges in gait extraction, and inconsistent camera parameters restrict the broader application of CC-ReID. To address the above limitations, we innovatively propose a Tri-Stream Dynamic Weight Network (TSDW) that requires only images. This dynamic weighting network consists of three parallel feature streams: facial features, head-limb features, and global features. Each stream specializes in extracting its designated features, after which a gating network dynamically fuses confidence levels. The three parallel feature streams enhance recognition performance and reduce the impact of any single feature failure, thereby improving model robustness. Extensive experiments on benchmark datasets (e.g., PRCC, Celeb-reID, VC-Clothes) demonstrate that our method significantly outperforms existing state-of-the-art approaches.
服装变化人员重识别(CC-ReID)旨在解决跨不同时空场景、视角和着装变化时个体身份的识别挑战。这一领域在大数据研究和公共安全领域正日益受到关注。现有的ReID研究主要依赖于面部识别、步态语义识别以及与衣物无关特征的识别,这些方法在高质量的服装变化视频和图像场景中表现相对较好。然而,这些方法要么仅依靠单一特征,要么简单地组合多个特征,这使得进一步提升性能变得困难。此外,缺乏面部信息、步态提取挑战以及摄像机参数不一致等问题也限制了CC-ReID更广泛的应用。 为了解决上述限制,我们创新性地提出了一种三流动态权重网络(TSDW),该方法仅需要图像作为输入。这一动态加权网络由三个并行的特征流组成:面部特征、头部与肢体特征以及全局特征。每个特征流专门负责提取其指定的信息,在此之后通过一个门控网络动态融合置信度级别。这三个平行特征流增强了识别性能,并减少了任何单一特征失效对整体性能的影响,从而提高了模型的鲁棒性。 在基准数据集(如PRCC、Celeb-reID、VC-Clothes)上的广泛实验表明,我们提出的方法显著优于现有最先进的方法。
https://arxiv.org/abs/2503.00477
Visible-Infrared Person Re-Identification (VI-ReID) plays a crucial role in applications such as search and rescue, infrastructure protection, and nighttime surveillance. However, it faces significant challenges due to modality discrepancies, varying illumination, and frequent occlusions. To overcome these obstacles, we propose \textbf{AMINet}, an Adaptive Modality Interaction Network. AMINet employs multi-granularity feature extraction to capture comprehensive identity attributes from both full-body and upper-body images, improving robustness against occlusions and background clutter. The model integrates an interactive feature fusion strategy for deep intra-modal and cross-modal alignment, enhancing generalization and effectively bridging the RGB-IR modality gap. Furthermore, AMINet utilizes phase congruency for robust, illumination-invariant feature extraction and incorporates an adaptive multi-scale kernel MMD to align feature distributions across varying scales. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving a Rank-1 accuracy of $74.75\%$ on SYSU-MM01, surpassing the baseline by $7.93\%$ and outperforming the current state-of-the-art by $3.95\%$.
可见光-红外重识别(VI-ReID)在搜索与救援、基础设施保护和夜间监控等应用中扮演着关键角色。然而,由于模态差异、光照变化以及频繁遮挡等问题,VI-ReID面临着重大挑战。为了克服这些障碍,我们提出了一种自适应模态交互网络(AMINet)。AMINet采用了多层次特征提取方法,从全身图像和上半身图像中捕捉全面的身份属性信息,提高了对遮挡和背景杂乱的鲁棒性。 该模型整合了一个互动式特征融合策略,用于深层次的模内及跨模态对齐,从而增强泛化能力,并有效地弥合了RGB-IR模态之间的差距。此外,AMINet利用相位一致性进行稳健且不受光照影响的特征提取,并采用自适应多尺度核MMD(最大均值差异)来在不同尺度上对齐特征分布。 在基准数据集上的大量实验验证了我们方法的有效性,在SYSU-MM01数据集中实现了74.75%的Rank-1准确率,比基线提高了7.93%,并且超过了当前最先进的技术3.95%。
https://arxiv.org/abs/2502.21163
Person re-identification (Re-ID) is a critical task in human-centric intelligent systems, enabling consistent identification of individuals across different camera views using multi-modal query information. Recent studies have successfully integrated LVLMs with person Re-ID, yielding promising results. However, existing LVLM-based methods face several limitations. They rely on extracting textual embeddings from fixed templates, which are used either as intermediate features for image representation or for prompt tuning in domain-specific tasks. Furthermore, they are unable to adopt the VQA inference format, significantly restricting their broader applicability. In this paper, we propose a novel, versatile, one-for-all person Re-ID framework, ChatReID. Our approach introduces a Hierarchical Progressive Tuning (HPT) strategy, which ensures fine-grained identity-level retrieval by progressively refining the model's ability to distinguish pedestrian identities. Extensive experiments demonstrate that our approach outperforms SOTA methods across ten benchmarks in four different Re-ID settings, offering enhanced flexibility and user-friendliness. ChatReID provides a scalable, practical solution for real-world person Re-ID applications, enabling effective multi-modal interaction and fine-grained identity discrimination.
人员再识别(Re-ID)是基于人类的智能系统中的一个关键任务,它能够利用多模态查询信息在不同的摄像机视角下持续地对个人进行一致的身份识别。最近的研究成功地将大型视觉语言模型(LVLMs)与人员Re-ID结合在一起,取得了很有前景的结果。然而,现有的基于LVLM的方法面临多个限制。它们依赖于从固定模板中提取文本嵌入,并将其用作图像表示的中间特征或用于特定领域的提示微调。此外,这些方法无法采用视觉问答(VQA)推理格式,这极大地限制了它们在更广泛的应用场景中的适用性。 在这篇论文中,我们提出了一种新颖、多功能且适用于所有情况的人员Re-ID框架——ChatReID。我们的方法引入了一个分层逐步微调(Hierarchical Progressive Tuning, HPT)策略,通过逐步提高模型区分行人身份的能力,确保了精细粒度的身份级检索。广泛的实验表明,在四个不同的Re-ID设置下的十个基准测试中,我们提出的方法优于现有的最先进方法,提供了增强的灵活性和用户友好性。ChatReID为现实世界中的人员Re-ID应用提供了一个可扩展且实用的解决方案,支持有效的多模态交互与精细粒度的身份区分。
https://arxiv.org/abs/2502.19958
Person re-identification (re-id) models are vital in security surveillance systems, requiring transferable adversarial attacks to explore the vulnerabilities of them. Recently, vision-language models (VLM) based attacks have shown superior transferability by attacking generalized image and textual features of VLM, but they lack comprehensive feature disruption due to the overemphasis on discriminative semantics in integral representation. In this paper, we introduce the Attribute-aware Prompt Attack (AP-Attack), a novel method that leverages VLM's image-text alignment capability to explicitly disrupt fine-grained semantic features of pedestrian images by destroying attribute-specific textual embeddings. To obtain personalized textual descriptions for individual attributes, textual inversion networks are designed to map pedestrian images to pseudo tokens that represent semantic embeddings, trained in the contrastive learning manner with images and a predefined prompt template that explicitly describes the pedestrian attributes. Inverted benign and adversarial fine-grained textual semantics facilitate attacker in effectively conducting thorough disruptions, enhancing the transferability of adversarial examples. Extensive experiments show that AP-Attack achieves state-of-the-art transferability, significantly outperforming previous methods by 22.9% on mean Drop Rate in cross-model&dataset attack scenarios.
人员重识别(Re-ID)模型在安全监控系统中至关重要,需要具有可转移性的对抗攻击来探索其漏洞。最近,基于视觉-语言模型(VLM)的攻击通过针对VLM中的通用图像和文本特征进行攻击展示了出色的可迁移性,但它们由于过分强调整体表示中的辨别语义而缺乏全面的功能破坏。 在本文中,我们介绍了属性感知提示攻击(AP-Attack),这是一种新颖的方法,它利用了VLM的图-文对齐能力来显式地破坏行人图像中的细粒度语义特征,通过摧毁特定于属性的文本嵌入。为了为每个单独的属性获得个性化的文本描述,设计了文本逆向网络将行人图像映射到表示语义嵌入的伪令牌上,并采用对比学习方式与图像和预先定义的提示模板一起训练,该模板明确描述了行人的属性特征。 通过这种逆向处理,生成的良性细粒度文本语义和对抗性细粒度文本语义帮助攻击者更有效地进行全面破坏,从而增强对抗样本的可迁移性。广泛的实验表明,AP-Attack在跨模型及数据集攻击场景中实现了最先进的可迁移性,比先前的方法平均Drop Rate指标高出了22.9%。 简而言之,通过引入属性感知提示攻击(AP-Attack),我们为提高行人重识别系统的对抗鲁棒性和安全性提供了一种创新而有效的方法。
https://arxiv.org/abs/2502.19697
Vehicle re-identification (Re-ID) is a crucial task in intelligent transportation systems (ITS), aimed at retrieving and matching the same vehicle across different surveillance cameras. Numerous studies have explored methods to enhance vehicle Re-ID by focusing on semantic enhancement. However, these methods often rely on additional annotated information to enable models to extract effective semantic features, which brings many limitations. In this work, we propose a CLIP-based Semantic Enhancement Network (CLIP-SENet), an end-to-end framework designed to autonomously extract and refine vehicle semantic attributes, facilitating the generation of more robust semantic feature representations. Inspired by zero-shot solutions for downstream tasks presented by large-scale vision-language models, we leverage the powerful cross-modal descriptive capabilities of the CLIP image encoder to initially extract general semantic information. Instead of using a text encoder for semantic alignment, we design an adaptive fine-grained enhancement module (AFEM) to adaptively enhance this general semantic information at a fine-grained level to obtain robust semantic feature representations. These features are then fused with common Re-ID appearance features to further refine the distinctions between vehicles. Our comprehensive evaluation on three benchmark datasets demonstrates the effectiveness of CLIP-SENet. Our approach achieves new state-of-the-art performance, with 92.9% mAP and 98.7% Rank-1 on VeRi-776 dataset, 90.4% Rank-1 and 98.7% Rank-5 on VehicleID dataset, and 89.1% mAP and 97.9% Rank-1 on the more challenging VeRi-Wild dataset.
车辆再识别(Re-ID)是智能交通系统(ITS)中的一个关键任务,旨在通过不同监控摄像头检索和匹配同一辆车。许多研究探索了通过语义增强来提高车辆 Re-ID 的方法,但这些方法往往依赖于额外的标注信息以使模型能够提取有效的语义特征,从而带来了诸多限制。在这项工作中,我们提出了一种基于 CLIP(对比语言-图像预训练)的语义增强网络(CLIP-SENet),这是一种端到端框架,旨在自主地提取和精炼车辆语义属性,促进更稳健的语义特征表示生成。 受大规模视觉-语言模型为下游任务提供的零样本解决方案的启发,我们利用 CLIP 图像编码器强大的跨模态描述能力来初步抽取通用语义信息。不同于使用文本编码器进行语义对齐,我们设计了一个自适应细粒度增强模块(AFEM),以在细粒度水平上自适应地精炼这种通用语义信息,从而获得稳健的语义特征表示。这些特征随后与常见的 Re-ID 外观特征融合,进一步细化车辆之间的区别。 我们在三个基准数据集上的全面评估证明了 CLIP-SENet 的有效性。我们的方法在 VeRi-776 数据集中实现了新的最先进的性能,其 mAP(平均精度)为 92.9%,Rank-1 精度为 98.7%;在 VehicleID 数据集中 Rank-1 和 Rank-5 分别达到 90.4% 和 98.7%;在更具挑战性的 VeRi-Wild 数据集中 mAP 和 Rank-1 分别达到了 89.1% 和 97.9%,从而验证了其有效性。
https://arxiv.org/abs/2502.16815