Occluded person re-identification aims to retrieve holistic images based on occluded ones. Existing methods often rely on aligning visible body parts, applying occlusion augmentation, or complementing missing semantics using holistic images. However, they face challenges in handling diverse occlusion scenarios not seen during training and the issue of feature contamination from holistic images. To address these limitations, we propose Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation (OGFR), which simultaneously mitigates these challenges. OGFR adopts a teacher-student distillation architecture that effectively incorporates diverse occlusion patterns into feature representation while transferring the purified discriminative holistic knowledge from the holistic to the occluded branch through reinforced knowledge distillation. Specifically, an Occlusion-Aware Vision Transformer is designed to leverage learnable occlusion pattern embeddings to explicitly model such diverse occlusion types, thereby guiding occlusion-aware robust feature representation. Moreover, we devise a Feature Erasing and Purification Module within the holistic branch, in which an agent is employed to identify low-quality patch tokens of holistic images that contain noisy negative information via deep reinforcement learning, and substitute these patch tokens with learnable embedding tokens to avoid feature contamination and further excavate identity-related discriminative clues. Afterward, with the assistance of knowledge distillation, the student branch effectively absorbs the purified holistic knowledge to precisely learn robust representation regardless of the interference of occlusions.
遮挡人物重识别的目标是根据带有遮挡的图像检索出完整的图像。现有的方法通常依赖于对可见身体部位进行对齐,应用遮挡增强处理,或使用完整图象来补充缺失的语义信息。然而,这些方法在面对训练期间未见过的各种遮挡场景和来自完整图像特征污染的问题时面临挑战。 为了克服这些限制,我们提出了通过强化知识蒸馏实现的遮挡引导特征净化学习(OGFR),这种方法同时缓解了上述挑战。OGFR采用了一种教师-学生蒸馏架构,该架构能够有效地将各种遮挡模式整合进特征表示中,并通过强化知识蒸馏从完整分支向遮挡分支传输经过净化的鉴别性完整图像知识。 具体而言,设计了一个感知遮挡的视觉变换器(Occlusion-Aware Vision Transformer),利用可学习的遮挡模式嵌入来显式建模各种多样的遮挡类型,从而引导生成针对遮挡敏感且鲁棒的特征表示。此外,在完整的分支中,我们开发了一个特征擦除和净化模块,其中使用代理通过深度强化学习识别包含嘈杂负信息的完整图像低质量补丁标记,并用可学习嵌入标记替换这些补丁标记以避免特征污染并进一步挖掘与身份相关的鉴别性线索。 最后,在知识蒸馏的帮助下,学生分支能够有效吸收经过净化的完整图像知识,从而无论遮挡干扰如何都能精确地学习到鲁棒表示。
https://arxiv.org/abs/2507.08520
We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73\% mAP and 97.32\% Rank-1 accuracy on MARS, and 96.00\% Rank-1 and 100.0\% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.
我们提出了**KeyRe-ID**,这是一种基于关键点的视频人物重识别框架,该框架由全局和局部两部分组成,并利用人体关键点进行增强的空间时间表示学习。全局分支通过基于Transformer的时间聚合捕捉整体身份语义,而局部分支则根据关键点动态分割身体区域以生成细粒度、部件感知特征。 在MARS和iLIDS-VID基准测试中进行了广泛的实验,证明了该方法的先进性能,在MARS上达到了91.73% mAP和97.32% Rank-1准确率,在iLIDS-VID上则分别实现了96.00% Rank-1和100.0% Rank-5的准确性。 此工作的代码将在发布后在GitHub上公开。
https://arxiv.org/abs/2507.07393
The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework's practical utility. Code is available at this https URL.
作为开放数据的一部分,街道级别的记录在推进自动驾驶系统和人工智能研究中发挥着至关重要的作用。然而,这些数据集存在显著的隐私风险,尤其是对行人而言,因为其中包含的个人识别信息(PII)超出了面部等生物特征之外的信息。 本文介绍了cRID,这是一种结合了大型视觉-语言模型、图注意力网络以及表示学习的跨模态框架,用于检测文本描述性线索中的个人识别信息,并增强人员重识别(Re-ID)。我们的方法专注于识别和利用可解释的特性,从而能够发现超越低级外观提示的语义上有意义的PII。我们对人物图像数据集中个人识别信息的存在进行了系统的评估。 实验结果显示,在实际跨数据集人员重识别场景中性能有所提升,特别是从Market-1501到CUHK03-np(检测)的数据迁移上表现尤为突出,这表明了框架的实际应用价值。代码可在上述链接中获得。
https://arxiv.org/abs/2507.01504
Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance.
近期,针对利用如CLIP等视觉-语言预训练模型进行人员重新识别(ReID)任务的研究进展,通常依赖于复杂的适配器设计或特定模态的调优方法,但这些方法往往忽视了跨模态交互的重要性,导致计算成本高昂或者对齐效果不佳。为了解决这些问题,我们提出了一种简单而有效的框架,名为选择性跨模态提示调整(SCING),该框架旨在增强跨模态对齐能力并提升在现实世界扰动下的鲁棒性。 我们的方法引入了两个关键创新点: 1. **选择性视觉提示融合(SVIP)**:这是一种轻量级模块,通过一个跨模态门控机制动态地将具有区分性的视觉特征注入文本提示中。 2. **扰动驱动的一致性对齐(PDCA)**:这是一种双路径训练策略,通过对原始和增强后的跨模态嵌入之间的一致性进行正则化来强制执行在随机图像扰动下的不变特性对齐。 我们在多个流行基准数据集上进行了广泛的实验,包括Market1501、DukeMTMC-ReID、Occluded-Duke、Occluded-REID和P-DukeMTMC。实验结果表明了所提出方法的卓越性能。值得注意的是,我们的框架在消除重适配器的同时保持高效的推理能力,从而实现了性能与计算开销之间的最佳平衡。 该代码将在论文被接受后发布。
https://arxiv.org/abs/2507.00506
The deployment of robot assistants in large indoor spaces has seen significant growth, with escorting tasks becoming a key application. However, most current escorting robots primarily rely on navigation-focused strategies, assuming that the person being escorted will follow without issue. In crowded environments, this assumption often falls short, as individuals may struggle to keep pace, become obstructed, get distracted, or need to stop unexpectedly. As a result, conventional robotic systems are often unable to provide effective escorting services due to their limited understanding of human movement dynamics. To address these challenges, an effective escorting robot must continuously detect and interpret human actions during the escorting process and adjust its movement accordingly. However, there is currently no existing dataset designed specifically for human action detection in the context of escorting. Given that escorting often occurs in crowded environments, where other individuals may enter the robot's camera view, the robot also needs to identify the specific human it is escorting (the subject) before predicting their actions. Since no existing model performs both person re-identification and action prediction in real-time, we propose a novel neural network architecture that can accomplish both tasks. This enables the robot to adjust its speed dynamically based on the escortee's movements and seamlessly resume escorting after any disruption. In comparative evaluations against strong baselines, our system demonstrates superior efficiency and effectiveness, showcasing its potential to significantly improve robotic escorting services in complex, real-world scenarios.
在大型室内空间中部署机器人助手已经取得了显著的增长,其中引导任务已成为关键应用之一。然而,大多数现有的引导机器人主要依赖于导航策略,假设被引导的人能够顺利跟随而没有任何问题。但在拥挤的环境中,这一假设往往无法实现,因为个体可能会遇到跟不上步伐、被阻挡、分心或意外停下等问题。因此,传统的机器人系统由于对人类运动动态的理解有限,通常难以提供有效的引导服务。 为了解决这些问题,一个高效的引导机器人必须能够在引导过程中持续检测和解释人的动作,并根据这些信息调整自己的移动方式。然而,目前还没有专门针对引导场景设计的人类行为检测数据集。考虑到引导任务往往发生在拥挤的环境中,在这种情况下,其他个体可能会进入机器人的摄像机视野内,因此机器人需要在预测被引导人(目标)的行为之前识别出特定的这个人。由于现有的模型无法同时实现实时的人物再识别和动作预测,我们提出了一种新型神经网络架构来实现这两个任务。这使得机器人能够根据跟随者的移动情况动态调整速度,并且能够在任何干扰后无缝地继续进行引导。 在与强大基准方法进行比较评估中,我们的系统展示了更高的效率和有效性,表明它有潜力显著提高复杂现实场景中的机器人引导服务的水平。
https://arxiv.org/abs/2506.23573
Person re-identification (ReID) across aerial and ground vantage points has become crucial for large-scale surveillance and public safety applications. Although significant progress has been made in ground-only scenarios, bridging the aerial-ground domain gap remains a formidable challenge due to extreme viewpoint differences, scale variations, and occlusions. Building upon the achievements of the AG-ReID 2023 Challenge, this paper introduces the AG-VPReID 2025 Challenge - the first large-scale video-based competition focused on high-altitude (80-120m) aerial-ground ReID. Constructed on the new AG-VPReID dataset with 3,027 identities, over 13,500 tracklets, and approximately 3.7 million frames captured from UAVs, CCTV, and wearable cameras, the challenge featured four international teams. These teams developed solutions ranging from multi-stream architectures to transformer-based temporal reasoning and physics-informed modeling. The leading approach, X-TFCLIP from UAM, attained 72.28% Rank-1 accuracy in the aerial-to-ground ReID setting and 70.77% in the ground-to-aerial ReID setting, surpassing existing baselines while highlighting the dataset's complexity. For additional details, please refer to the official website at this https URL.
人员再识别(ReID)在空中和地面视角之间的跨域应用对于大规模监控和公共安全至关重要。尽管仅限于地面场景的进展显著,但由于极端视点差异、尺度变化以及遮挡问题,跨越空地领域的挑战依然巨大。基于AG-ReID 2023挑战赛的成绩,本文介绍了AG-VPReID 2025挑战赛——首个专注于高空(80至120米)空中与地面再识别的大规模视频竞赛。该挑战赛建立在新发布的AG-VPReID数据集之上,该数据集包含3,027个身份、超过13,500条轨迹片段以及约370万帧图像,这些图像由无人机、CCTV和可穿戴相机采集而成。四支国际团队参与了此次挑战赛,并开发了一系列解决方案,包括多流架构到基于Transformer的时间推理及物理信息建模技术。 领先的方法是UAM提出的X-TFCLIP,在空中转地面再识别任务中实现了72.28%的Rank-1准确率,在地面向空中再识别任务中则达到了70.77%,这一成绩超越了现有基线,同时展示了数据集的独特复杂性。欲了解更多信息,请参阅官方网址:[this https URL](请将方括号中的内容替换为实际链接)。
https://arxiv.org/abs/2506.22843
Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding (e.g., human activity recognition (HAR), retrieval, or person re-identification (RE-ID)). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a Deep Skeleton-Pointcloud-IMU-Text Embedding model, which effectively learns a joint embedding space across these four modalities through noise contrastive estimation. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint embedding space. Our experiments demonstrate novel human activity understanding tasks for point cloud sequences enabled through DeSPITE, including Skeleton<->Pointcloud<->IMU matching, retrieval, and temporal moment retrieval. Furthermore, we show that DeSPITE is an effective pre-training strategy for point cloud HAR through experiments in MSR-Action3D and HMPEAR.
尽管激光雷达(LiDAR,Light Detection And Ranging)作为一种有效的隐私保护替代方案,在感知人类活动方面比RGB相机更加有效,但在多模态对比预训练以理解人类活动中(例如,人体动作识别(Human Activity Recognition, HAR)、检索或行人再识别(Person Re-Identification, RE-ID)等任务)的应用仍然较少。为了弥补这一不足,我们的工作探索了在联合嵌入空间中学习激光雷达点云、人体骨架姿势、惯性测量单元(IMU)数据和文本之间对应关系的方法。 具体来说,我们提出了DeSPITE(Deep Skeleton-Pointcloud-IMU-Text Embedding)模型,这是一种深度骨架-点云-IMU-文本嵌入模型。该模型能够通过噪声对比估计方法有效地学习这四种模态之间的联合嵌入空间。在我们的经验研究中,我们将现有的LIPD和Babel数据集结合起来使用,从而实现了所有四类数据的同步,使我们能够在新的联合嵌入空间中进行探索。 实验表明,DeSPITE为点云序列中的新的人体活动理解任务提供了可能,包括骨架<->点云<->IMU匹配、检索以及时间段检索。此外,通过在MSR-Action3D和HMPEAR数据集上的实验,我们证明了DeSPITE是一种有效的预训练策略,可以应用于基于点云的动作识别任务中。
https://arxiv.org/abs/2506.13897
Person re-identification (ReID) has evolved from handcrafted feature-based methods to deep learning approaches and, more recently, to models incorporating large language models (LLMs). Early methods struggled with variations in lighting, pose, and viewpoint, but deep learning addressed these issues by learning robust visual features. Building on this, LLMs now enable ReID systems to integrate semantic and contextual information through natural language. This survey traces that full evolution and offers one of the first comprehensive reviews of ReID approaches that leverage LLMs, where textual descriptions are used as privileged information to improve visual matching. A key contribution is the use of dynamic, identity-specific prompts generated by GPT-4o, which enhance the alignment between images and text in vision-language ReID systems. Experimental results show that these descriptions improve accuracy, especially in complex or ambiguous cases. To support further research, we release a large set of GPT-4o-generated descriptions for standard ReID datasets. By bridging computer vision and natural language processing, this survey offers a unified perspective on the field's development and outlines key future directions such as better prompt design, cross-modal transfer learning, and real-world adaptability.
人员重新识别(ReID)技术已经从基于手工制作特征的方法演进到深度学习方法,再到最近结合大型语言模型(LLMs)的模型。早期方法在处理光照、姿态和视角变化时遇到了困难,但深度学习通过学习鲁棒的视觉特征解决了这些问题。在此基础上,LLMs使得ReID系统能够通过自然语言整合语义和上下文信息。本次调查追溯了这一完整的发展历程,并对利用LLMs增强人员重新识别的方法进行了迄今为止最全面的回顾之一,其中文本描述被用作特权信息来提升视觉匹配性能。一个关键贡献是使用GPT-4o生成的身份特定动态提示,这增强了图像与文字之间的对齐,在视觉语言ReID系统中尤为有效。实验结果显示,这些描述在复杂或模棱两可的情况下特别提升了准确度。为了支持进一步的研究,我们发布了一大批针对标准ReID数据集的由GPT-4o生成的描述。通过将计算机视觉和自然语言处理领域结合在一起,这项调查提供了一个统一视角来审视该领域的进展,并概述了未来的关键发展方向,如更好的提示设计、跨模态迁移学习以及现实世界的适应性。
https://arxiv.org/abs/2506.13039
Person Re-identification (ReID) aims to retrieve images of the same individual captured across non-overlapping camera views, making it a critical component of intelligent surveillance systems. Traditional ReID methods assume that the training and test domains share similar characteristics and primarily focus on learning discriminative features within a given domain. However, they often fail to generalize to unseen domains due to domain shifts caused by variations in viewpoint, background, and lighting conditions. To address this issue, Domain-Adaptive ReID (DA-ReID) methods have been proposed. These approaches incorporate unlabeled target domain data during training and improve performance by aligning feature distributions between source and target domains. Domain-Generalizable ReID (DG-ReID) tackles a more realistic and challenging setting by aiming to learn domain-invariant features without relying on any target domain data. Recent methods have explored various strategies to enhance generalization across diverse environments, but the field remains relatively underexplored. In this paper, we present a comprehensive survey of DG-ReID. We first review the architectural components of DG-ReID including the overall setting, commonly used backbone networks and multi-source input configurations. Then, we categorize and analyze domain generalization modules that explicitly aim to learn domain-invariant and identity-discriminative representations. To examine the broader applicability of these techniques, we further conduct a case study on a related task that also involves distribution shifts. Finally, we discuss recent trends, open challenges, and promising directions for future research in DG-ReID. To the best of our knowledge, this is the first systematic survey dedicated to DG-ReID.
人员重新识别(ReID)旨在从不重叠的摄像机视图中检索同一人的图像,是智能监控系统中的一个关键组成部分。传统的ReID方法假设训练和测试领域具有相似的特点,并主要集中在给定领域的判别特征学习上。然而,由于视角、背景和光照条件的变化,这些方法在面对未见领域时往往难以泛化。为了解决这一问题,域适应性人员重新识别(Domain-Adaptive ReID, DA-ReID)方法被提出。这类方法通过在训练过程中利用无标签的目标域数据来改进性能,并通过调整源域和目标域之间的特征分布实现这一点。相比之下,领域通用型人员重识别(Domain-Generalizable ReID, DG-ReID)旨在学习不依赖任何目标域数据的领域不变性特征,从而应对更现实且更具挑战性的场景设置。 最近的方法探索了各种策略以增强在多样化环境中的泛化能力,但该领域的研究相对较少。本文提供了一篇关于DG-ReID的全面综述。首先,我们回顾了DG-ReID的基本架构组件,包括整体设定、常用的骨干网络和多源输入配置。接下来,我们将分析专门用于学习领域不变性和身份判别表示的域泛化模块,并对其进行分类。为了考察这些技术在更广泛任务中的适用性,我们还对一个涉及分布变化的相关任务进行了案例研究。最后,我们讨论了DG-ReID领域的最新趋势、开放挑战以及未来的研究方向。 据我们所知,这是首次关于DG-ReID的系统性综述文章。
https://arxiv.org/abs/2506.12413
In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at this https URL.
https://arxiv.org/abs/2506.09385
Vision Transformers (ViTs) have demonstrated impressive performance across a wide range of biometric tasks, including face and body recognition. In this work, we adapt a ViT model pretrained on visible (VIS) imagery to the challenging problem of cross-spectral body recognition, which involves matching images captured in the visible and infrared (IR) domains. Recent ViT architectures have explored incorporating additional embeddings beyond traditional positional embeddings. Building on this idea, we integrate Side Information Embedding (SIE) and examine the impact of encoding domain and camera information to enhance cross-spectral matching. Surprisingly, our results show that encoding only camera information - without explicitly incorporating domain information - achieves state-of-the-art performance on the LLCM dataset. While occlusion handling has been extensively studied in visible-spectrum person re-identification (Re-ID), occlusions in visible-infrared (VI) Re-ID remain largely underexplored - primarily because existing VI-ReID datasets, such as LLCM, SYSU-MM01, and RegDB, predominantly feature full-body, unoccluded images. To address this gap, we analyze the impact of range-induced occlusions using the IARPA Janus Benchmark Multi-Domain Face (IJB-MDF) dataset, which provides a diverse set of visible and infrared images captured at various distances, enabling cross-range, cross-spectral evaluations.
视觉变压器(ViT)在包括面部和身体识别在内的多种生物特征任务中表现出色。在这项工作中,我们将一个预训练于可见光图像的ViT模型应用于跨谱系身体识别这一具有挑战性的问题上,该问题涉及到匹配可视域和红外域捕获的图像。最近的ViT架构探索了在传统位置嵌入之外添加额外嵌入的方法。在此基础上,我们整合了侧信息嵌入(SIE),并研究编码领域和相机信息对跨谱系匹配的影响。令人惊讶的是,我们的结果显示,在LLCM数据集上,仅通过编码相机信息——而无需明确地将领域信息纳入考虑内——即可达到最先进的性能。 尽管在可见光谱范围内的人员重新识别(Re-ID)中已经广泛研究了遮挡处理方法,但针对可见-红外(VI)Re-ID中的遮挡问题尚未得到充分探索。这主要是因为现有的VI-ReID数据集(如LLCM、SYSU-MM01和RegDB),主要包含的是全身且未被遮挡的图像。 为了解决这一差距,我们利用IARPA Janus基准多域面部识别(IJB-MDF)数据集分析了由范围引起的遮挡影响。该数据集提供了一组多样化的可见光和红外图像,在不同距离下拍摄,支持跨范围、跨谱系评估。
https://arxiv.org/abs/2506.08953
Person Re-Identification (Re-ID) is a very important task in video surveillance systems such as tracking people, finding people in public places, or analysing customer behavior in supermarkets. Although there have been many works to solve this problem, there are still remaining challenges such as large-scale datasets, imbalanced data, viewpoint, fine grained data (attributes), the Local Features are not employed at semantic level in online stage of Re-ID task, furthermore, the imbalanced data problem of attributes are not taken into consideration. This paper has proposed a Unified Re-ID system consisted of three main modules such as Pedestrian Attribute Ontology (PAO), Local Multi-task DCNN (Local MDCNN), Imbalance Data Solver (IDS). The new main point of our Re-ID system is the power of mutual support of PAO, Local MDCNN and IDS to exploit the inner-group correlations of attributes and pre-filter the mismatch candidates from Gallery set based on semantic information as Fashion Attributes and Facial Attributes, to solve the imbalanced data of attributes without adjusting network architecture and data augmentation. We experimented on the well-known Market1501 dataset. The experimental results have shown the effectiveness of our Re-ID system and it could achieve the higher performance on Market1501 dataset in comparison to some state-of-the-art Re-ID methods.
人重新识别(Re-ID)是视频监控系统中的一个非常重要的任务,例如跟踪人员、在公共场所寻找人员或分析超市中顾客的行为。尽管已经有许多研究致力于解决这一问题,但仍存在一些挑战,如大规模数据集的处理、数据不平衡、视角变化以及细粒度数据(属性)等问题。此外,在人重新识别任务的在线阶段,局部特征并未被用于语义层面的应用,同时,属性的数据不平衡问题也未得到充分考虑。 本文提出了一种统一的人重新识别系统,该系统由三个主要模块组成:行人属性本体论(PAO)、局部多任务深度卷积神经网络(Local MDCNN)和数据不平衡求解器(IDS)。我们的人重新识别系统的创新点在于,这三个模块——即PAO、Local MDCNN 和 IDS 能够相互支持,通过利用属性之间的内部分组相关性,并基于语义信息如时尚属性和面部属性对Gallery集合中的不匹配候选者进行预筛选,以解决属性数据不平衡问题而不调整网络架构或数据增强。 我们在著名的Market1501数据集上进行了实验。实验结果表明了我们的人重新识别系统的效果显著,在Market1501数据集上的性能优于一些最新的Re-ID方法。
https://arxiv.org/abs/2506.04143
Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at this https URL.
基于视频的可见光-红外行人再识别(VVI-ReID)旨在通过提取模态不变的序列级特征来匹配不同模态下的行人序列。作为一种高级语义表示,语言为可见光和红外两种模态下的人行特性提供了一致性的描述。理论上,利用对比学习的语言图像预训练(CLIP)模型生成视频级别的文本提示,并指导模态不变序列级别特征的学习是可行的。然而,如何生成并使用跨模态共享的视频级语言提示以解决不同模态之间的差距问题仍是一个关键挑战。 为了解决这一问题,我们提出了一种简单而强大的框架——视频级别语言驱动的VVI-ReID(VLD),该框架由两个核心模块组成:不变性模态语言提示(IMLP)和时空提示(STP)。IMLP采用视觉编码器和提示学习者的联合微调策略,有效生成跨模态共享的文字提示,并在CLIP的多模态空间中使其与不同模态下的视觉特征对齐,从而缓解了模态差异。此外,通过空间-时间枢纽(STH)和时空聚合(STA)这两个子模块,STP建模空间-时间信息,并进一步通过将空间-时间信息融入文本提示来增强IMLP。 具体来说,STH在每个帧的[CLS]标记上跨视觉变换器(ViT)层进行空间-时间信息的聚集和扩散操作;而STA则引入了专门的身份级别损失以及专业化的多头注意力机制,以确保STH专注于身份相关的时间空间特征聚合。VLD框架在两个VVI-ReID基准测试中取得了最先进的结果。 有关此项目的代码将在此 URL 发布:[请参阅原文链接]
https://arxiv.org/abs/2506.02439
The increasing popularity of egocentric cameras has generated growing interest in studying multi-camera interactions in shared environments. Although large-scale datasets such as Ego4D and Ego-Exo4D have propelled egocentric vision research, interactions between multiple camera wearers remain underexplored-a key gap for applications like immersive learning and collaborative robotics. To bridge this, we present TF2025, an expanded dataset with synchronized first- and third-person views. In addition, we introduce a sequence-based method to identify first-person wearers in third-person footage, combining motion cues and person re-identification.
第一人称相机日益普及,引发了人们对共享环境中多相机交互研究的兴趣不断增加。尽管大规模数据集如Ego4D和Ego-Exo4D推动了第一人称视觉研究的发展,但多个佩戴者之间的互动仍被严重忽视——这对沉浸式学习和协作机器人等应用来说是一个关键缺口。为了弥补这一不足,我们推出了TF2025,这是一个扩展的数据集,包含了同步的第一人称和第三人称视角视图。此外,我们还介绍了一种基于序列的方法来识别第三人称视频中佩戴第一人称相机的人,该方法结合了运动线索和个人重识别技术。
https://arxiv.org/abs/2506.00394
In this paper, we leverage the advantages of event cameras to resist harsh lighting conditions, reduce background interference, achieve high time resolution, and protect facial information to study the long-sequence event-based person re-identification (Re-ID) task. To this end, we propose a simple and efficient long-sequence event Re-ID model, namely the Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net). To better handle asynchronous event data, we build S3CE-Net based on spiking neural networks (SNNs). The S3CE-Net incorporates the Spike-guided Spatial-temporal Attention Mechanism (SSAM) and the Spatiotemporal Feature Sampling Strategy (STFS). The SSAM is designed to carry out semantic interaction and association in both spatial and temporal dimensions, leveraging the capabilities of SNNs. The STFS involves sampling spatial feature subsequences and temporal feature subsequences from the spatiotemporal dimensions, driving the Re-ID model to perceive broader and more robust effective semantics. Notably, the STFS introduces no additional parameters and is only utilized during the training stage. Therefore, S3CE-Net is a low-parameter and high-efficiency model for long-sequence event-based person Re-ID. Extensive experiments have verified that our S3CE-Net achieves outstanding performance on many mainstream long-sequence event-based person Re-ID datasets. Code is available at:this https URL.
在这篇论文中,我们利用事件相机的优势来抵抗恶劣的光照条件、减少背景干扰、实现高时间分辨率,并保护面部信息,以研究基于长时间序列事件的人重识别(Re-ID)任务。为此,我们提出了一种简单而高效的长时间序列事件 Re-ID 模型,即 Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net)。为了更好地处理异步事件数据,我们在脉冲神经网络(SNNs)的基础上构建了 S3CE-Net。S3CE-Net 集成了由 SSAM 和 STFS 构成的组件:即 Spike-guided Spatial-temporal Attention Mechanism (SSAM) 以及 Spatiotemporal Feature Sampling Strategy (STFS)。SSAM 被设计用于在空间和时间两个维度上进行语义交互与关联,利用 SNNs 的能力实现这一目标。而 STFS 则涉及从时空维度中采样空间特征子序列和时间特征子序列,促使 Re-ID 模型能够感知到更广泛且更具鲁棒性的有效语义信息。值得注意的是,STFS 不会引入额外的参数,并仅在训练阶段使用。因此,S3CE-Net 是一种低参数、高效率的长时间序列事件基于的人重识别模型。广泛的实验验证了我们的 S3CE-Net 在许多主流的长时间序列事件基础的人重识别数据集上取得了卓越的表现。代码可在此处获得:[this https URL]。
https://arxiv.org/abs/2505.24401
Video-based person re-identification (Re-ID) remains brittle in real-world deployments despite impressive benchmark performance. Most existing models rely on superficial correlations such as clothing, background, or lighting that fail to generalize across domains, viewpoints, and temporal variations. This survey examines the emerging role of causal reasoning as a principled alternative to traditional correlation-based approaches in video-based Re-ID. We provide a structured and critical analysis of methods that leverage structural causal models, interventions, and counterfactual reasoning to isolate identity-specific features from confounding factors. The survey is organized around a novel taxonomy of causal Re-ID methods that spans generative disentanglement, domain-invariant modeling, and causal transformers. We review current evaluation metrics and introduce causal-specific robustness measures. In addition, we assess practical challenges of scalability, fairness, interpretability, and privacy that must be addressed for real-world adoption. Finally, we identify open problems and outline future research directions that integrate causal modeling with efficient architectures and self-supervised learning. This survey aims to establish a coherent foundation for causal video-based person Re-ID and to catalyze the next phase of research in this rapidly evolving domain.
基于视频的人再识别(Re-ID)在实际部署中尽管基准性能表现出色,但依然脆弱。大多数现有模型依赖于肤浅的相关性,如服装、背景或光照等特征,这些特征无法跨领域、视角和时间变化进行泛化。本文综述探讨了因果推理作为传统相关性方法原则替代方案的新兴角色,在基于视频的人再识别中发挥作用。我们提供了对利用结构因果模型、干预措施和反事实推理的方法进行结构化且批判性的分析,以从混淆因素中隔离身份特定特征。本综述围绕新颖的因果Re-ID方法分类组织展开,该分类涵盖了生成性解缠、域不变建模以及因果变压器等领域。此外,我们回顾了当前评估指标,并引入了针对因果特性的鲁棒性度量。同时,我们也评估了实现规模扩展、公平性、可解释性和隐私等实际挑战,这些对于现实世界的采用至关重要。最后,本文确定了开放问题并概述了未来研究方向,旨在将因果建模与高效架构及自监督学习相结合。 该综述旨在为基于视频的人再识别的因果方法建立一个连贯的基础,并推动这一迅速发展的领域的下一阶段研究。
https://arxiv.org/abs/2505.20540
Person re-identification (ReID) models are known to suffer from camera bias, where learned representations cluster according to camera viewpoints rather than identity, leading to significant performance degradation under (inter-camera) domain shifts in real-world surveillance systems when new cameras are added to camera networks. State-of-the-art test-time adaptation (TTA) methods, largely designed for classification tasks, rely on classification entropy-based objectives that fail to generalize well to ReID, thus making them unsuitable for tackling camera bias. In this paper, we introduce DART$^3$, a TTA framework specifically designed to mitigate camera-induced domain shifts in person ReID. DART$^3$ (Distance-Aware Retrieval Tuning at Test Time) leverages a distance-based objective that aligns better with image retrieval tasks like ReID by exploiting the correlation between nearest-neighbor distance and prediction error. Unlike prior ReID-specific domain adaptation methods, DART$^3$ requires no source data, architectural modifications, or retraining, and can be deployed in both fully black-box and hybrid settings. Empirical evaluations on multiple ReID benchmarks indicate that DART$^3$ and DART$^3$ LITE, a lightweight alternative to the approach, consistently outperforms state-of-the-art TTA baselines, making for a viable option to online learning to mitigate the adverse effects of camera bias.
人员重新识别(ReID)模型在面对摄像机偏差时会表现不佳,这种情况下,学习到的表示方式根据摄像机视角进行聚类,而不是按身份分组。这导致了当现实世界的监控系统中加入新的摄像头(跨摄像机领域转移)时,性能显著下降。现有的测试时间适应(TTA, Test-Time Adaptation)方法主要是为分类任务设计的,依赖于基于分类熵的目标函数,这些目标函数无法很好地推广到ReID任务上,因此不适合解决摄像机偏差问题。 在本文中,我们提出了DART$^3$,这是一种专门用于缓解人员重新识别中的相机引起的领域偏移的测试时间适应框架。DART$^3$(Distance-Aware Retrieval Tuning at Test Time)采用了一个基于距离的目标函数,该目标函数更适合于像ReID这样的图像检索任务,它通过利用最近邻距离和预测误差之间的相关性来实现这一点。 与以前专门针对ReID领域的领域适应方法不同,DART$^3$不需要源数据、架构修改或重新训练,并且可以在全黑盒和混合设置中部署。在多个ReID基准测试上的实证评估表明,DART$^3$及其轻量级变体DART$^3$ LITE始终优于现有的TTA基线方法,这使得它成为缓解摄像机偏差不良影响的在线学习的一种可行选择。
https://arxiv.org/abs/2505.18337
In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at this https URL.
在这篇论文中,我们提出了一种称为可微通道选择注意力模块(Differentiable Channel Selection Attention module,简称DCS-Attention模块)的新型注意机制。与传统的自注意力机制不同,DCS-Attention模块在计算注意力权重时具有选择信息量丰富的通道的功能。该通道的选择过程是以一种可微分的方式进行的,从而能够无缝地集成到深度神经网络(DNN)的训练过程中。 我们的DCS-Attention模块既可以应用于固定结构的神经网络基础架构,也可以应用于通过可微神经体系结构搜索(Differentiable Neural Architecture Search, DNAS)学习得到的基础架构,分别称为具有固定骨干网的DCS (DCS-FB)和使用DNAS的DCS-DNAS。尤为重要的是,我们的DCS-Attention模块是基于信息瓶颈原则(Information Bottleneck, IB)设计的,并且我们还推导出一种新的适用于IB损失的变分上界,这个上界可以通过随机梯度下降(SGD)进行优化并集成到包含DCS-Attention模块的网络训练过程中。通过这种方式,在使用DCS-Attention模块的神经网络中,可以选出用于特征提取的信息量最大的通道,从而为重识别(Re-ID)任务提供最先进的性能。 我们在多个人员重识别基准数据集上进行了广泛实验,使用了既包括固定骨干网的DCS-FB也包括通过DNAS学习得到骨干网的DCS-DNAS方法,结果显示,DCS-Attention模块显著提高了深度神经网络在人员重识别中的预测准确性。这证明了DCS-Attention机制在学习区分性特征以准确辨别人的身份方面是非常有效的。 我们的工作代码可以在提供的链接处获取。
https://arxiv.org/abs/2505.08961
Gait recognition, known for its ability to identify individuals from a distance, has gained significant attention in recent times due to its non-intrusive verification. While video-based gait identification systems perform well on large public datasets, their performance drops when applied to real-world, unconstrained gait data due to various factors. Among these, uncontrolled outdoor environments, non-overlapping camera views, varying illumination, and computational efficiency are core challenges in gait-based authentication. Currently, no dataset addresses all these challenges simultaneously. In this paper, we propose an OptiGait-LGBM model capable of recognizing person re-identification under these constraints using a skeletal model approach, which helps mitigate inconsistencies in a person's appearance. The model constructs a dataset from landmark positions, minimizing memory usage by using non-sequential data. A benchmark dataset, RUET-GAIT, is introduced to represent uncontrolled gait sequences in complex outdoor environments. The process involves extracting skeletal joint landmarks, generating numerical datasets, and developing an OptiGait-LGBM gait classification model. Our aim is to address the aforementioned challenges with minimal computational cost compared to existing methods. A comparative analysis with ensemble techniques such as Random Forest and CatBoost demonstrates that the proposed approach outperforms them in terms of accuracy, memory usage, and training time. This method provides a novel, low-cost, and memory-efficient video-based gait recognition solution for real-world scenarios.
步态识别因其能够从远处识别个人而不侵扰的特点,在近期引起了广泛关注。基于视频的步态识别系统在大型公共数据集上表现出色,但在实际应用中面对无约束环境时性能下降,这主要是由于不受控的户外环境、摄像机视角不一致、光照变化以及计算效率低下等因素造成的。目前尚无数据集同时解决这些挑战。 本文提出了一种名为OptiGait-LGBM的模型,该模型能够在上述限制条件下进行人员再识别,并使用骨骼模型的方法来减少外观上的不一致性。通过利用关键点的位置构建数据集,该方法能够降低内存消耗并处理非连续的数据。我们还引入了一个基准数据集RUET-GAIT,用于表示复杂户外环境中不受控的步态序列。研究过程包括提取骨骼关节的关键点、生成数值化数据集,并开发OptiGait-LGBM步态分类模型。 我们的目标是在与现有方法相比具有较低计算成本的情况下解决上述挑战。通过与随机森林和CatBoost等集成技术进行比较分析,证明了我们提出的方法在准确率、内存使用量以及训练时间方面均优于这些方法。这种方法提供了一种新颖的、低成本且内存高效的视频步态识别解决方案,适用于实际场景中的应用。
https://arxiv.org/abs/2505.08801
Video surveillance image analysis and processing is a challenging field in computer vision, with one of its most difficult tasks being Person Re-Identification (PRe-ID). PRe-ID aims to identify and track target individuals who have already been detected in a network of cameras, using a robust description of their pedestrian images. The success of recent research in person PRe-ID is largely due to effective feature extraction and representation, as well as the powerful learning of these features to reliably discriminate between pedestrian images. To this end, two powerful features, Convolutional Neural Networks (CNN) and Local Maximal Occurrence (LOMO), are modeled on multidimensional data using the proposed method, High-Dimensional Feature Fusion (HDFF). Specifically, a new tensor fusion scheme is introduced to leverage and combine these two types of features in a single tensor, even though their dimensions are not identical. To enhance the system's accuracy, we employ Tensor Cross-View Quadratic Analysis (TXQDA) for multilinear subspace learning, followed by cosine similarity for matching. TXQDA efficiently facilitates learning while reducing the high dimensionality inherent in high-order tensor data. The effectiveness of our approach is verified through experiments on three widely-used PRe-ID datasets: VIPeR, GRID, and PRID450S. Extensive experiments demonstrate that our approach outperforms recent state-of-the-art methods.
视频监控图像分析和处理是计算机视觉领域中的一个挑战性课题,其中最难的任务之一就是行人再识别(Person Re-Identification,PRe-ID)。PRe-ID的目标是在由多个摄像头组成的网络中对已经检测到的特定个人进行识别与追踪。这需要通过强有力的描述来刻画行人的图像特征。近年来,PRe-ID研究的成功很大程度上归功于有效的特征提取和表示方法,以及这些特征的强大学习能力,使其能够可靠地区分行人图像。 为此,本文提出了一种高维特征融合(High-Dimensional Feature Fusion, HDFF)的方法,在这种方法中,利用两种强大的特征——卷积神经网络(Convolutional Neural Networks, CNN)和局部最大发生率(Local Maximal Occurrence, LOMO)来建模多维度数据。具体来说,我们引入了一种新的张量融合方案,即使这两种特征的维度不相同,也能将它们组合成一个单一的张量中进行处理。为了提高系统的准确性,我们在高阶张量数据固有的高维性学习过程中采用了张量跨视图二次分析(Tensor Cross-View Quadratic Analysis, TXQDA)和余弦相似度匹配的方法来进行多线性子空间学习。 我们通过在三个广泛使用的PRe-ID数据集:VIPeR、GRID和PRID450S上进行实验,验证了该方法的有效性。大量的实验证明我们的方法优于近期最先进的方法。
https://arxiv.org/abs/2505.15825