Human following is a crucial feature of human-robot interaction, yet it poses numerous challenges to mobile agents in real-world scenarios. Some major hurdles are that the target person may be in a crowd, obstructed by others, or facing away from the agent. To tackle these challenges, we present a novel person re-identification module composed of three parts: a 360-degree visual registration, a neural-based person re-identification using human faces and torsos, and a motion tracker that records and predicts the target person's future position. Our human-following system also addresses other challenges, including identifying fast-moving targets with low latency, searching for targets that move out of the camera's sight, collision avoidance, and adaptively choosing different following mechanisms based on the distance between the target person and the mobile agent. Extensive experiments show that our proposed person re-identification module significantly enhances the human-following feature compared to other baseline variants.
In recent times, there is an increased interest in the identification and re-identification of people at long distances, such as from rooftop cameras, UAV cameras, street cams, and others. Such recognition needs to go beyond face and use whole-body markers such as gait. However, datasets to train and test such recognition algorithms are not widely prevalent, and fewer are labeled. This paper introduces DIOR -- a framework for data collection, semi-automated annotation, and also provides a dataset with 14 subjects and 1.649 million RGB frames with 3D/2D skeleton gait labels, including 200 thousands frames from a long range camera. Our approach leverages advanced 3D computer vision techniques to attain pixel-level accuracy in indoor settings with motion capture systems. Additionally, for outdoor long-range settings, we remove the dependency on motion capture systems and adopt a low-cost, hybrid 3D computer vision and learning pipeline with only 4 low-cost RGB cameras, successfully achieving precise skeleton labeling on far-away subjects, even when their height is limited to a mere 20-25 pixels within an RGB frame. On publication, we will make our pipeline open for others to use.
近年来,对远距离识别和重新识别的兴趣日益增加,例如从屋顶摄像头、无人机摄像头、街头摄像头和其他设备中拍摄的图像。这种识别需要超越面部识别,使用全身标志,如步态。然而,训练和测试这种识别算法的 datasets 并不普遍,标记的样本更少。本文介绍了 DIOR - 一个数据收集、半自动标注的框架,并提供了包含14个 subjects 和1.649百万张 RGB 帧的三维/二维骨骼步态标签的数据集,其中包括从远程相机拍摄200 thousands帧的图像。我们的方法利用先进的三维计算机视觉技术在室内条件下实现像素级别的精度。此外,对于室外远距离设置,我们摆脱了对运动捕捉系统的依赖性,采用仅4个低成本 RGB 相机的低成本三维计算机视觉和学习通道,成功对远距离样本进行精确的骨骼标签标注,即使样本的高度仅在RGB帧内仅有20-25像素。在出版时,我们将我们的通道开放给他人使用。
Robot person following (RPF) is a crucial capability in human-robot interaction (HRI) applications, allowing a robot to persistently follow a designated person. In practical RPF scenarios, the person often be occluded by other objects or people. Consequently, it is necessary to re-identify the person when he/she re-appears within the robot's field of view. Previous person re-identification (ReID) approaches to person following rely on offline-trained features and short-term experiences. Such an approach i) has a limited capacity to generalize across scenarios; and ii) often fails to re-identify the person when his re-appearance is out of the learned domain represented by the short-term experiences. Based on this observation, in this work, we propose a ReID framework for RPF that leverages long-term experiences. The experiences are maintained by a loss-guided keyframe selection strategy, to enable online continual learning of the appearance model. Our experiments demonstrate that even in the presence of severe appearance changes and distractions from visually similar people, the proposed method can still re-identify the person more accurately than the state-of-the-art methods.
机器人跟随(RPF)是人机交互(HRI)应用中的关键能力,允许机器人持续跟随指定的人。在实际应用中,这个人常常被其他物体或人 occlusion 所遮挡,因此必须重新识别这个人当她/她再次出现在机器人的视野中。先前的人重识别(ReID)方法依赖于离线训练的特征和短期经验。这种方法 i) 只能在相同的场景下有限地泛化; ii) 当这个人的再次出现超出了由短期经验所代表的学习域时,往往无法重新识别她。基于这一观察,在本研究中,我们提出了 RPF 中的 ReID 框架,利用长期经验。经验是通过损失引导的关键帧选择策略维持的,以便在线持续学习外貌模型。我们的实验表明,即使在存在严重的外貌变化和与 visually similar 的人的分心的情况下, proposed 方法仍能够更准确地重新识别这个人。
Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, detection and Re-IDentification (ReID). Despite significant progress, two major challenges remain: 1) Detection-prior modules in previous methods are suboptimal for the ReID task. 2) The collaboration between two sub-tasks is ignored. To alleviate these issues, we present a novel Person Search framework based on the Diffusion model, PSDiff. PSDiff formulates the person search as a dual denoising process from noisy boxes and ReID embeddings to ground truths. Unlike existing methods that follow the Detection-to-ReID paradigm, our denoising paradigm eliminates detection-prior modules to avoid the local-optimum of the ReID task. Following the new paradigm, we further design a new Collaborative Denoising Layer (CDL) to optimize detection and ReID sub-tasks in an iterative and collaborative way, which makes two sub-tasks mutually beneficial. Extensive experiments on the standard benchmarks show that PSDiff achieves state-of-the-art performance with fewer parameters and elastic computing overhead.
Person re-identification (re-ID) requires densely distributed cameras. In practice, the person of interest may not be captured by cameras and, therefore, needs to be retrieved using subjective information (e.g., sketches from witnesses). Previous research defines this case using the sketch as sketch re-identification (Sketch re-ID) and focuses on eliminating the domain gap. Actually, subjectivity is another significant challenge. We model and investigate it by posing a new dataset with multi-witness descriptions. It features two aspects. 1) Large-scale. It contains over 4,763 sketches and 32,668 photos, making it the largest Sketch re-ID dataset. 2) Multi-perspective and multi-style. Our dataset offers multiple sketches for each identity. Witnesses' subjective cognition provides multiple perspectives on the same individual, while different artists' drawing styles provide variation in sketch styles. We further have two novel designs to alleviate the challenge of subjectivity. 1) Fusing subjectivity. We propose a non-local (NL) fusion module that gathers sketches from different witnesses for the same identity. 2) Introducing objectivity. An AttrAlign module utilizes attributes as an implicit mask to align cross-domain features. To push forward the advance of Sketch re-ID, we set three benchmarks (large-scale, multi-style, cross-style). Extensive experiments demonstrate our leading performance in these benchmarks. Dataset and Codes are publicly available at: this https URL
人身份确认(re-ID)需要密集分布的摄像头。实际上,感兴趣的人可能不会被摄像头捕捉,因此需要使用主观信息(例如从目击证人的 Sketch 素描中获取的 Sketch 素描)来恢复身份。以前的研究将 Sketch 素描作为 Sketch 重身份(Sketch re-ID)的定义,并关注消除领域差异。实际上,主观性也是一个重要的挑战。我们采用一个新的多目击证人描述数据集来模型和研究它。它有两个方面。1) 大规模。它包含了超过 4,763 个 Sketch 素描和 32,668 张照片,使其成为最大的 Sketch 重身份数据集。2) 多视角和多风格。我们的数据集为每个身份提供了多个 Sketch 素描。目击证人的主观认知为同一个体提供了多个视角,而不同艺术家的绘画风格提供了 Sketch 风格的变异。我们还有两个新的设计来减轻主观性的挑战。1) 融合主观性。我们提出了一个非local(NL)融合模块,该模块从不同目击证人的相同身份中提取 Sketch 素描。2) 引入客观性。一个 AttrAlign 模块使用属性作为隐含掩码来对齐跨域特征。为了推动 Sketch 重身份的发展,我们设置了三个基准(大规模、多风格、跨风格)。广泛的实验证明了我们在这些基准中的领先表现。数据集和代码都公开可用:这个 https URL。
Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations. Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space. As a result, the learned feature emphasizes the common patterns across modalities while suppressing modality-specific and identity-aware information that is valuable for Re-ID. To address these issues, we propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID. First, the auxiliary modality is generated by combining the proposed cross-modality learner and intra-modality learner, which can dynamically model the modality-specific and modality-shared representations to alleviate both cross-modality and intra-modality variations. Second, by aligning identity centres across the three modalities, an identity alignment loss function is proposed to discover the discriminative feature representations. Third, a modality alignment loss is introduced to consistently reduce the distribution distance of visible and infrared images by modality prototype modeling. Extensive experiments on multiple public datasets demonstrate that the proposed method surpasses the current state-of-the-art methods by a significant margin.
Visible-Infrared person re-identification (VI-ReID)是一项具有挑战性的任务,因为跨模态差异和班级内变异很大。现有的方法主要关注学习模态共同表示,通过将不同模态嵌入相同的特征空间来实现。因此,学习的特征强调了跨模态的共同模式,同时抑制了对于VI-ReID有价值的模态特异性和身份意识信息。为了解决这些问题,我们提出了一种新型的模态统一网络(MUN)来探索VI-ReID的稳健辅助模态。首先,辅助模态是通过将 proposed 跨模态学习和内部模态学习相结合产生的,它能够动态地建模模态特异性和模态共同表示,减轻跨模态和内部模态变异。其次,通过将三种模态的身份中心对齐,提出了身份对齐损失函数,以发现区别性特征表示。第三,引入模态对齐损失,通过模态原型建模 consistently 减少可见和红外图像的分布距离。在多个公共数据集上进行广泛的实验表明, proposed 方法比我当前的方法高出很多。
The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on this https URL. Baselines and development kits can be found on this https URL.
《 soccerNet 2023 挑战》是由 soccerNet 团队组织的三年期视频理解挑战。这一版挑战由七个视觉任务组成,被分成三个主要主题。第一个主题是广播视频理解,包括三个高级别的任务,与描述视频广播中发生的事件有关:(1)行动Spotting,专注于从视频中检索与足球全球行动相关的所有时间戳;(2)球行动Spotting,专注于从视频中检索与足球球状态变化相关的所有时间戳;(3)稠密视频字幕,专注于用自然语言描述广播,并Anchoring 时间戳。第二个主题是实地理解,涉及任务 (4) 的单一任务,专注于从图像中检索内在和外部相机参数。第三个主题是球员理解,包括三个低级别的任务,与提取有关球员的信息有关:(5)重识别,专注于从多个视角检索相同的球员;(6)多物体跟踪,专注于通过未编辑的视频流跟踪球员和球;(7)球衣号码识别,专注于从跟踪let 中识别球员的球衣号码。与前几次《 soccerNet 挑战》相比,任务 (2-3-7) 是全新的,包括新的注释和数据,任务 (4) 得到了更多的数据和注释增强,而任务 (6) 则专注于端到端的方法。更多关于任务、挑战和排行榜的信息可以在 this https URL 上获取。基准线和开发kit 可以在 this https URL 上找到。
Text-based person re-identification (TBPReID) aims to retrieve person images represented by a given textual query. In this task, how to effectively align images and texts globally and locally is a crucial challenge. Recent works have obtained high performances by solving Masked Language Modeling (MLM) to align image/text parts. However, they only performed uni-directional (i.e., from image to text) local-matching, leaving room for improvement by introducing opposite-directional (i.e., from text to image) local-matching. In this work, we introduce Bidirectional Local-Matching (BiLMa) framework that jointly optimize MLM and Masked Image Modeling (MIM) in TBPReID model training. With this framework, our model is trained so as the labels of randomly masked both image and text tokens are predicted by unmasked tokens. In addition, to narrow the semantic gap between image and text in MIM, we propose Semantic MIM (SemMIM), in which the labels of masked image tokens are automatically given by a state-of-the-art human parser. Experimental results demonstrate that our BiLMa framework with SemMIM achieves state-of-the-art Rank@1 and mAP scores on three benchmarks.
基于文本的人脸识别(TBPReID)的目标是从给定文本查询中提取代表人脸的图像。在这个任务中,如何有效地在全球范围内和Locally align图像和文本是一个关键挑战。最近的工作通过解决Masked Language Modeling(MLM)来对齐图像和文本部分取得了高性能。然而,他们仅进行了单向(即从图像到文本)Local matching,留下了改进空间,通过引入相反方向的Local matching(即从文本到图像)。在本研究中,我们引入了Bidirectional Local Matching(BiLMa)框架,它在TBPReID模型训练中共同优化MLM和Masked Image Modeling(MIM)。通过这个框架,我们的模型被训练以便随机masked的图像和文本token的labels可以通过unmasked的token进行预测。此外,为了缩小在MIM中图像和文本之间的语义差距,我们提出了Semantic MIM(SemMIM),其中masked image token的labels是由先进的人类parser自动生成的。实验结果显示,我们的BiLMa框架和SemMIM在三个基准上取得了最先进的排名@1和mAP得分。
Person Re-identification (ReID) plays a more and more crucial role in recent years with a wide range of applications. Existing ReID methods are suffering from the challenges of misalignment and occlusions, which degrade the performance dramatically. Most methods tackle such challenges by utilizing external tools to locate body parts or exploiting matching strategies. Nevertheless, the inevitable domain gap between the datasets utilized for external tools and the ReID datasets and the complicated matching process make these methods unreliable and sensitive to noises. In this paper, we propose a Region Generation and Assessment Network (RGANet) to effectively and efficiently detect the human body regions and highlight the important regions. In the proposed RGANet, we first devise a Region Generation Module (RGM) which utilizes the pre-trained CLIP to locate the human body regions using semantic prototypes extracted from text descriptions. Learnable prompt is designed to eliminate domain gap between CLIP datasets and ReID datasets. Then, to measure the importance of each generated region, we introduce a Region Assessment Module (RAM) that assigns confidence scores to different regions and reduces the negative impact of the occlusion regions by lower scores. The RAM consists of a discrimination-aware indicator and an invariance-aware indicator, where the former indicates the capability to distinguish from different identities and the latter represents consistency among the images of the same class of human body regions. Extensive experimental results for six widely-used benchmarks including three tasks (occluded, partial, and holistic) demonstrate the superiority of RGANet against state-of-the-art methods.
人的身份识别(ReID)在近年来变得越来越重要,在各种应用领域中发挥着越来越重要的作用。现有的ReID方法面临着位置不准确性和遮挡的挑战,这些挑战会对性能产生显著的影响。大多数方法会通过利用外部工具来定位身体部位或利用匹配策略来解决这些挑战。然而,用于外部工具和ReID数据集之间的不可避免的领域差距以及复杂的匹配过程,这些方法的可靠性和对噪声敏感。在本文中,我们提出了一个区域生成和评估网络(RGANet),以有效地和快速地检测人类身体区域并强调重要的区域。在提出的RGANet中,我们首先设计了一个区域生成模块(RGM),该模块利用预先训练的Clip数据集来从文本描述中提取语义原型来定位人类身体区域。可学习提示旨在消除Clip数据集和ReID数据集之间的领域差距。然后,为了衡量每个生成区域的重要性,我们引入了一个区域评估模块(RAM),该模块将赋予不同的区域信心分数,并通过降低分数以减少遮挡区域的影响。RAM包括一个歧视性感知 indicator 和一个不变性感知 indicator,前者表示区分不同身份的能力,后者则代表同类别图像之间的一致性。广泛的实验结果涵盖了六个广泛应用基准任务,包括三个任务(遮挡、部分和整体),证明了RGANet相对于当前方法的优越性。
The pre-training task is indispensable for the text-to-image person re-identification (T2I-ReID) task. However, there are two underlying inconsistencies between these two tasks that may impact the performance; i) Data inconsistency. A large domain gap exists between the generic images/texts used in public pre-trained models and the specific person data in the T2I-ReID task. This gap is especially severe for texts, as general textual data are usually unable to describe specific people in fine-grained detail. ii) Training inconsistency. The processes of pre-training of images and texts are independent, despite cross-modality learning being critical to T2I-ReID. To address the above issues, we present a new unified pre-training pipeline (UniPT) designed specifically for the T2I-ReID task. We first build a large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual descriptions of images are automatically generated by the CLIP paradigm using a divide-conquer-combine strategy. Benefiting from this dataset, we then utilize a simple vision-and-language pre-training framework to explicitly align the feature space of the image and text modalities during pre-training. In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels. Without the need for any bells and whistles, our UniPT achieves competitive Rank-1 accuracy of, ie, 68.50%, 60.09%, and 51.85% on CUHK-PEDES, ICFG-PEDES and RSTPReid, respectively. Both the LUPerson-T dataset and code are available at https;//github.com/ZhiyinShao-H/UniPT.
Vehicle re-identification (ReID) in a large-scale camera network is important in public safety, traffic control, and security. However, due to the appearance ambiguities of vehicle, the previous appearance-based ReID methods often fail to track vehicle across multiple cameras. To overcome the challenge, we propose a spatial-temporal vehicle ReID framework that estimates reliable camera network topology based on the adaptive Parzen window method and optimally combines the appearance and spatial-temporal similarities through the fusion network. Based on the proposed methods, we performed superior performance on the public dataset (VeRi776) by 99.64% of rank-1 accuracy. The experimental results support that utilizing spatial and temporal information for ReID can leverage the accuracy of appearance-based methods and effectively deal with appearance ambiguities.
在大规模摄像头网络中,车辆识别(ReID)在公共安全、交通控制和安全问题等方面非常重要。然而,由于车辆的外观歧义,以前的基于外观的ReID方法常常无法在多个摄像头之间跟踪车辆。为了克服这一挑战,我们提出了一种基于自适应帕zen窗口方法的空间和时间车辆ReID框架,该框架基于可靠的摄像头网络拓扑估计,并通过 fusion 网络最佳地结合外观和空间-时间相似之处。基于我们提出的方法,我们在公开数据集(VeRi776)上表现出99.64%的排名第一的精度。实验结果表明,利用空间和时间信息进行ReID可以充分利用基于外观的方法的精度,有效地处理外观歧义。
Nighttime person Re-ID (person re-identification in the nighttime) is a very important and challenging task for visual surveillance but it has not been thoroughly investigated. Under the low illumination condition, the performance of person Re-ID methods usually sharply deteriorates. To address the low illumination challenge in nighttime person Re-ID, this paper proposes an Illumination Distillation Framework (IDF), which utilizes illumination enhancement and illumination distillation schemes to promote the learning of Re-ID models. Specifically, IDF consists of a master branch, an illumination enhancement branch, and an illumination distillation module. The master branch is used to extract the features from a nighttime image. The illumination enhancement branch first estimates an enhanced image from the nighttime image using a nonlinear curve mapping method and then extracts the enhanced features. However, nighttime and enhanced features usually contain data noise due to unstable lighting conditions and enhancement failures. To fully exploit the complementary benefits of nighttime and enhanced features while suppressing data noise, we propose an illumination distillation module. In particular, the illumination distillation module fuses the features from two branches through a bottleneck fusion model and then uses the fused features to guide the learning of both branches in a distillation manner. In addition, we build a real-world nighttime person Re-ID dataset, named Night600, which contains 600 identities captured from different viewpoints and nighttime illumination conditions under complex outdoor environments. Experimental results demonstrate that our IDF can achieve state-of-the-art performance on two nighttime person Re-ID datasets (i.e., Night600 and Knight ). We will release our code and dataset at this https URL.
夜晚身份重识别(夜晚身份重识别)是视觉监视中非常重要且具有挑战性的任务,但至今尚未进行全面的调查。在低光照条件下,人的身份重识别方法通常急剧恶化。为了解决夜晚身份重识别中低光照的挑战,本文提出了照明蒸馏框架(IDF),该框架利用照明增强和照明蒸馏方案来促进身份重识别模型的学习。具体来说,IDF由一个主分支、一个照明增强分支和一个照明蒸馏模块组成。主分支用于从夜晚图像中提取特征。照明增强分支首先使用非线性曲线映射方法估计从夜晚图像中提取的增强图像,然后提取增强特征。然而,夜晚和增强特征通常包含数据噪声,因为不稳定的照明条件和增强失败。为了充分利用夜晚和增强特征的互补好处,同时抑制数据噪声,我们提出了照明蒸馏模块。特别是,照明蒸馏模块通过一个瓶颈融合模型将两个分支的特征进行融合,然后使用融合的特征以指导两个分支的学习,以蒸馏的方式进行学习。此外,我们建立了一个真实的夜晚身份重识别数据集,名为 Night600,该数据集包含从不同视角和复杂户外环境中捕获的600个身份,并在夜晚光照条件下进行了复杂环境下的照明增强。实验结果表明,我们的IDF可以在两个夜晚身份重识别数据集(即 Night600和Knight)上实现最先进的性能。我们将在此处发布我们的代码和数据集。
Visible-Infrared person re-identification (VI-ReID) is an important and challenging task in intelligent video surveillance. Existing methods mainly focus on learning a shared feature space to reduce the modality discrepancy between visible and infrared modalities, which still leave two problems underexplored: information redundancy and modality complementarity. To this end, properly eliminating the identity-irrelevant information as well as making up for the modality-specific information are critical and remains a challenging endeavor. To tackle the above problems, we present a novel mutual information and modality consensus network, namely CMInfoNet, to extract modality-invariant identity features with the most representative information and reduce the redundancies. The key insight of our method is to find an optimal representation to capture more identity-relevant information and compress the irrelevant parts by optimizing a mutual information bottleneck trade-off. Besides, we propose an automatically search strategy to find the most prominent parts that identify the pedestrians. To eliminate the cross- and intra-modality variations, we also devise a modality consensus module to align the visible and infrared modalities for task-specific guidance. Moreover, the global-local feature representations can also be acquired for key parts discrimination. Experimental results on four benchmarks, i.e., SYSU-MM01, RegDB, Occluded-DukeMTMC, Occluded-REID, Partial-REID and Partial\_iLIDS dataset, have demonstrated the effectiveness of CMInfoNet.
visible-Infrared person re-identification (VI-ReID) 是在智能视频监控中非常重要的一个挑战性任务。现有的方法主要关注学习一个共享特征空间来减少可见和红外特征之间的差异,但仍有两个问题未被深入研究:信息冗余和特征互补。因此,正确消除身份无关信息以及弥补特征特定信息是至关重要的,仍是一项艰巨的努力。为了解决这些问题,我们提出了一个 novel 互信息和特征共识网络,即CMInfoNet,以提取特征不变的身份特征,并减少冗余信息。我们的 key insight 是找到最优表示来捕获更多的身份相关信息,并通过优化互信息瓶颈 trade-off 来压缩无关部分。此外,我们还提出了一种自动搜索策略,以找到识别行人的最突出部分。为了消除交叉和内部特征变化,我们还设计了特征共识模块,以任务特定指导 visible 和红外特征的对齐。此外,还可以获取全球和局部特征表示,用于关键部分区分。在四个基准数据集上进行了实验,包括SYSU-MM01、RegDB、受限的DukeMTMC、受限的REID、partialREID 和 partial_iLIDS 数据集,实验结果表明,CMInfoNet 的有效性。
Cloth-changing Person Re-Identification (CC-ReID) is a challenging task that aims to retrieve the target person across multiple surveillance cameras when clothing changes might happen. Despite recent progress in CC-ReID, existing approaches are still hindered by the interference of clothing variations since they lack effective constraints to keep the model consistently focused on clothing-irrelevant regions. To address this issue, we present a Semantic-aware Consistency Network (SCNet) to learn identity-related semantic features by proposing effective consistency constraints. Specifically, we generate the black-clothing image by erasing pixels in the clothing area, which explicitly mitigates the interference from clothing variations. In addition, to fully exploit the fine-grained identity information, a head-enhanced attention module is introduced, which learns soft attention maps by utilizing the proposed part-based matching loss to highlight head information. We further design a semantic consistency loss to facilitate the learning of high-level identity-related semantic features, forcing the model to focus on semantically consistent cloth-irrelevant regions. By using the consistency constraint, our model does not require any extra auxiliary segmentation module to generate the black-clothing image or locate the head region during the inference stage. Extensive experiments on four cloth-changing person Re-ID datasets (LTCC, PRCC, Vc-Clothes, and DeepChange) demonstrate that our proposed SCNet makes significant improvements over prior state-of-the-art approaches. Our code is available at: this https URL.
穿着更换人重新识别(CC-ReID)是一个具有挑战性的任务,旨在在发生穿着更换时从多个监控摄像头中检索目标人物。尽管CC-ReID最近取得了进展,但现有的方法仍然受到服装差异的干扰,因为它们缺乏有效的约束,以保持模型始终专注于与服装无关的区域。为了解决这一问题,我们提出了一种语义aware Consistency Network(SCNet),以提出有效的 consistency constraints来学习身份相关的语义特征。具体来说,我们生成黑色服装图像的方法是在服装区域删除像素,这明确减少了服装差异的干扰。此外,为了充分利用高精度的身份信息,我们引入了头部增强注意力模块,它通过学习部分匹配损失来突出头部信息,并进一步设计了语义一致性损失,以促进高级身份相关的语义特征的学习,迫使模型专注于语义一致的服装无关区域。通过使用 consistency constraints,我们的模型不需要额外的辅助分割模块来生成黑色服装图像或在推理阶段定位头部区域。我们对四个穿着更换人重新识别数据集(LTCC、PRCC、Vc-Clothes和DeepChange)进行了广泛的实验,结果表明,我们提出的SCNet比先前的先进技术方法做出了显著改进。我们的代码可在以下 https URL 获取。
We present a novel unsupervised domain adaption method for person re-identification (reID) that generalizes a model trained on a labeled source domain to an unlabeled target domain. We introduce a camera-driven curriculum learning (CaCL) framework that leverages camera labels of person images to transfer knowledge from source to target domains progressively. To this end, we divide target domain dataset into multiple subsets based on the camera labels, and initially train our model with a single subset (i.e., images captured by a single camera). We then gradually exploit more subsets for training, according to a curriculum sequence obtained with a camera-driven scheduling rule. The scheduler considers maximum mean discrepancies (MMD) between each subset and the source domain dataset, such that the subset closer to the source domain is exploited earlier within the curriculum. For each curriculum sequence, we generate pseudo labels of person images in a target domain to train a reID model in a supervised way. We have observed that the pseudo labels are highly biased toward cameras, suggesting that person images obtained from the same camera are likely to have the same pseudo labels, even for different IDs. To address the camera bias problem, we also introduce a camera-diversity (CD) loss encouraging person images of the same pseudo label, but captured across various cameras, to involve more for discriminative feature learning, providing person representations robust to inter-camera variations. Experimental results on standard benchmarks, including real-to-real and synthetic-to-real scenarios, demonstrate the effectiveness of our framework.
我们提出了一种 novel 的无监督人重新识别(reID)方法,该方法可以 generalize 一个在标记源domain 上训练的模型到未标记的目标domain。我们引入了一种基于相机标签的 curriculum learning(CaCL)框架,利用相机标签将源到目标domain 的知识逐步转移。为此,我们将目标domain 数据集根据相机标签分成多个子集,并首先使用一个子集(即单一相机捕获的图像)训练我们的模型。然后,我们根据一个基于相机驱动 scheduling 规则的 curriculum 序列逐渐利用更多的子集进行训练。 scheduling 规则考虑每个子集与源domain 数据集的最大平均差异(MMD),以便在 curriculum 内靠近源domain 的子集更早地被利用。在每个 curriculum 序列中,我们生成目标domain 中的人图像的伪标签,以以监督方式训练 reID 模型。我们观察到,伪标签具有很强的相机偏见,这表明从同一相机获取的人图像很可能具有相同的伪标签,即使拥有不同的标识符。为了解决相机偏见问题,我们引入了相机多样性损失,鼓励同一伪标签下来自不同相机的人图像进行更多的区分特征学习,提供跨相机差异的 robust 的人表示。在标准基准测试中,包括真实到真实和合成到真实场景的情况,我们的框架的实验结果证明了其有效性。
Biometric applications, such as person re-identification (ReID), are often deployed on energy constrained devices. While recent ReID methods prioritize high retrieval performance, they often come with large computational costs and high search time, rendering them less practical in real-world settings. In this work, we propose an input-adaptive network with multiple exit blocks, that can terminate computation early if the retrieval is straightforward or noisy, saving a lot of computation. To assess the complexity of the input, we introduce a temporal-based classifier driven by a new training strategy. Furthermore, we adopt a binary hash code generation approach instead of relying on continuous-valued features, which significantly improves the search process by a factor of 20. To ensure similarity preservation, we utilize a new ranking regularizer that bridges the gap between continuous and binary features. Extensive analysis of our proposed method is conducted on three datasets: Market1501, MSMT17 (Multi-Scene Multi-Time), and the BGC1 (BRIAR Government Collection). Using our approach, more than 70% of the samples with compact hash codes exit early on the Market1501 dataset, saving 80% of the networks computational cost and improving over other hash-based methods by 60%. These results demonstrate a significant improvement over dynamic networks and showcase comparable accuracy performance to conventional ReID methods. Code will be made available.
生物特征识别应用(如人身份识别(ReID))通常部署在能量受限的设备上。虽然最近的ReID方法 prioritize高检索性能,但它们通常伴随着巨大的计算成本和漫长的搜索时间,使其在真实场景中不太实用。在本文中,我们提出了一种自适应输入网络,有多个退出块,可以在检索简单或噪声明显时提前终止计算,从而减少了大量的计算。为了评估输入的复杂性,我们引入了一种新的时间based分类器驱动的训练策略。此外,我们采用了二进制哈希码生成方法,而不是依赖连续值特征的方法,这极大地提高了搜索过程的性能,提高了20倍。为了确保相似性保留,我们使用了一种新的排名 regularizer,以填补连续和二进制特征之间的差异。我们对三个数据集(Market1501、MSMT17(多场景多时间)、BGC1(布里尔政府收集))进行了广泛的分析,使用我们的方法,超过70%的紧凑哈希码样本在Market1501数据集中提前退出,节省了网络计算成本的80%,并比其他基于哈希的方法提高了60%。这些结果表明比动态网络取得了显著改善,并展示了与传统的ReID方法相当的精度性能。代码将提供。
Law enforcement regularly faces the challenge of ranking suspects from their facial images. Deep face models aid this process but frequently introduce biases that disproportionately affect certain demographic segments. While bias investigation is common in domains like job candidate ranking, the field of forensic face rankings remains underexplored. In this paper, we propose a novel experimental framework, encompassing six state-of-the-art face encoders and two public data sets, designed to scrutinize the extent to which demographic groups suffer from biases in exposure in the context of forensic face rankings. Through comprehensive experiments that cover both re-identification and identification tasks, we show that exposure biases within this domain are far from being countered, demanding attention towards establishing ad-hoc policies and corrective measures. The source code is available at this https URL
警察经常面临从面部图像中排名嫌疑人的挑战。深度面部模型有助于这个过程,但经常引入具有不成比例地影响某些年龄段群的偏见。虽然像招聘候选人排名这样的领域普遍存在偏见调查,但法医面部排名领域仍被忽略。在本文中,我们提出了一种新的实验框架,包括六款最先进的面部编码器和两个公共数据集,旨在仔细审查在法医面部排名背景下,不同年龄段群是否受到偏见的曝光。通过涵盖重认和识别任务的全面实验,我们表明,在这个领域中,曝光偏见非常严重,需要关注制定临时政策和纠正措施。源代码可在这个 https URL 上获取。
Anonymity of both natural and legal persons in court rulings is a critical aspect of privacy protection in the European Union and Switzerland. With the advent of LLMs, concerns about large-scale re-identification of anonymized persons are growing. In accordance with the Federal Supreme Court of Switzerland, we explore the potential of LLMs to re-identify individuals in court rulings by constructing a proof-of-concept using actual legal data from the Swiss federal supreme court. Following the initial experiment, we constructed an anonymized Wikipedia dataset as a more rigorous testing ground to further investigate the findings. With the introduction and application of the new task of re-identifying people in texts, we also introduce new metrics to measure performance. We systematically analyze the factors that influence successful re-identifications, identifying model size, input length, and instruction tuning among the most critical determinants. Despite high re-identification rates on Wikipedia, even the best LLMs struggled with court decisions. The complexity is attributed to the lack of test datasets, the necessity for substantial training resources, and data sparsity in the information used for re-identification. In conclusion, this study demonstrates that re-identification using LLMs may not be feasible for now, but as the proof-of-concept on Wikipedia showed, it might become possible in the future. We hope that our system can help enhance the confidence in the security of anonymized decisions, thus leading to the courts being more confident to publish decisions.
Person Re-IDentification (Re-ID) as a retrieval task, has achieved tremendous development over the past decade. Existing state-of-the-art methods follow an analogous framework to first extract features from the input images and then categorize them with a classifier. However, since there is no identity overlap between training and testing sets, the classifier is often discarded during inference. Only the extracted features are used for person retrieval via distance metrics. In this paper, we rethink the role of the classifier in person Re-ID, and advocate a new perspective to conceive the classifier as a projection from image features to class prototypes. These prototypes are exactly the learned parameters of the classifier. In this light, we describe the identity of input images as similarities to all prototypes, which are then utilized as more discriminative features to perform person Re-ID. We thereby propose a new baseline ProNet, which innovatively reserves the function of the classifier at the inference stage. To facilitate the learning of class prototypes, both triplet loss and identity classification loss are applied to features that undergo the projection by the classifier. An improved version of ProNet++ is presented by further incorporating multi-granularity designs. Experiments on four benchmarks demonstrate that our proposed ProNet is simple yet effective, and significantly beats previous baselines. ProNet++ also achieves competitive or even better results than transformer-based competitors.
将人重新身份识别(Re-ID)作为检索任务,它在过去十年中取得了巨大的发展。现有的先进方法遵循类似的框架,从输入图像中提取特征,然后使用分类器将其分类。然而,由于训练和测试集之间的身份没有重叠,分类器经常被在推理时丢弃。仅提取的特征用于距离度量的人重新身份识别。在本文中,我们重新思考了分类器在人重新身份识别中的作用,并倡导一种新的观点,将分类器视为从图像特征到类原型的投影。这些原型正是分类器学习的参数。从这个角度来看,输入图像的身份描述被视为与所有原型的相似性,然后被用作更区分的特征,以进行人重新身份识别。因此,我们提出了一种新的基线ProNet,创新地在其推理阶段保留分类器的功能。为了促进类原型的学习,同时应用三组损失和身份分类损失的特征。改进的ProNet++版本通过进一步包括多粒度设计呈现。对四个基准点的实验证明,我们提出的ProNet是简单而有效的, significantly beats previous baselines。ProNet++也实现了比基于Transformer的竞争对手更好的结果。
Unsupervised domain adaptive person re-identification (Re-ID) methods alleviate the burden of data annotation through generating pseudo supervision messages. However, real-world Re-ID systems, with continuously accumulating data streams, simultaneously demand more robust adaptation and anti-forgetting capabilities. Methods based on image rehearsal addresses the forgetting issue with limited extra storage but carry the risk of privacy leakage. In this work, we propose a Color Prompting (CoP) method for data-free continual unsupervised domain adaptive person Re-ID. Specifically, we employ a light-weighted prompter network to fit the color distribution of the current task together with Re-ID training. Then for the incoming new tasks, the learned color distribution serves as color style transfer guidance to transfer the images into past styles. CoP achieves accurate color style recovery for past tasks with adequate data diversity, leading to superior anti-forgetting effects compared with image rehearsal methods. Moreover, CoP demonstrates strong generalization performance for fast adaptation into new domains, given only a small amount of unlabeled images. Extensive experiments demonstrate that after the continual training pipeline the proposed CoP achieves 6.7% and 8.1% average rank-1 improvements over the replay method on seen and unseen domains, respectively. The source code for this work is publicly available in this https URL.