Person re-identification (Re-ID) is a key challenge in computer vision, requiring the matching of individuals across different cameras, locations, and time periods. While most research focuses on short-term scenarios with minimal appearance changes, real-world applications demand robust Re-ID systems capable of handling long-term scenarios, where persons' appearances can change significantly due to variations in clothing and physical characteristics. In this paper, we present CHIRLA, Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis, a novel dataset specifically designed for long-term person Re-ID. CHIRLA consists of recordings from strategically placed cameras over a seven-month period, capturing significant variations in both temporal and appearance attributes, including controlled changes in participants' clothing and physical features. The dataset includes 22 individuals, four connected indoor environments, and seven cameras. We collected more than five hours of video that we semi-automatically labeled to generate around one million bounding boxes with identity annotations. By introducing this comprehensive benchmark, we aim to facilitate the development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios.
人员重识别(Re-ID)是计算机视觉领域中的一个关键挑战,它要求在不同摄像头、地点和时间段内匹配同一人物。尽管大多数研究集中在短期场景下,即外观变化较小的情况下,但实际应用场景需要能够处理长期情景的鲁棒性重识别系统,在这种情况下,由于穿着或身体特征的变化,人物的外观可能发生显著改变。本文中,我们介绍了CHIRLA(用于大规模分析的综合高分辨率身份和重识别),这是一个专门为长时间人员重识别设计的新数据集。CHIRLA包含了为期七个月、从战略性放置的七个摄像头收集的数据,记录了包括参与者服装和身体特征控制变化在内的时空属性的重大变化。该数据集包括22名个体、四个连接的室内环境以及七个摄像头。我们半自动标注超过五小时的视频,生成带有身份注释的大约一百万个边界框。通过引入这一全面基准测试,我们的目标是促进可靠地应用于挑战性的长时间现实场景中的重识别算法的发展和评估。
https://arxiv.org/abs/2502.06681
Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness. Codes will be available at this https URL.
领域通用重识别(DG Re-ID)的目标是在一个或多个源域上训练模型,并在未见过的目标域上评估其性能,这一任务因其实际应用的关联性而越来越受到关注。虽然已经提出了许多方法,但大多数依赖于判别式或对比学习框架来学习可泛化的特征表示。然而,这些方法往往未能缓解捷径学习问题(shortcut learning),导致性能不佳。在这项工作中,我们提出了一种新的方法——基于扩散模型辅助的表征学习与相关感知条件方案(DCAC)来增强DG Re-ID的效果。我们的方法通过一个相关感知条件方案将判别式和对比式的Re-ID模型与预训练的扩散模型相结合。该条件方案通过使用从Re-ID模型生成的身份分类概率以及一组可学习的身份提示,注入了捕捉身份之间相关性的暗知识(dark knowledge),以指导扩散过程。同时,来自扩散模型的反馈被通过条件方案反向传播到Re-ID模型中,有效提升了重识别特征的泛化能力。在单一源域和多源域DG Re-ID任务上的大量实验表明,我们的方法达到了最先进的性能表现。全面的消融研究进一步验证了所提方法的有效性,并提供了对其鲁棒性的见解。代码将在该链接提供。 这段描述详细介绍了针对领域通用重识别(Domain-Generalizable Re-Identification)问题的新方法——DCAC(Diffusion Model-Assisted Representation Learning with Correlation-Aware Conditioning),展示了该方法如何通过结合判别式/对比学习模型和预训练的扩散模型来提升特征表示的泛化能力,并在不同实验设置中实现了卓越性能。
https://arxiv.org/abs/2502.06619
Navigating the complexities of person re-identification (ReID) in varied surveillance scenarios, particularly when occlusions occur, poses significant challenges. We introduce an innovative Motion-Aware Fusion (MOTAR-FUSE) network that utilizes motion cues derived from static imagery to significantly enhance ReID capabilities. This network incorporates a dual-input visual adapter capable of processing both images and videos, thereby facilitating more effective feature extraction. A unique aspect of our approach is the integration of a motion consistency task, which empowers the motion-aware transformer to adeptly capture the dynamics of human motion. This technique substantially improves the recognition of features in scenarios where occlusions are prevalent, thereby advancing the ReID process. Our comprehensive evaluations across multiple ReID benchmarks, including holistic, occluded, and video-based scenarios, demonstrate that our MOTAR-FUSE network achieves superior performance compared to existing approaches.
在各种监控场景中,特别是在遮挡发生的情况下,处理人员重新识别(ReID)的复杂性带来了很多挑战。我们提出了一种创新的运动感知融合(MOTAR-FUSE)网络,该网络利用从静态图像中提取的运动线索来显著增强ReID的能力。此网络包括一个双输入视觉适配器,能够同时处理图像和视频,从而实现更有效的特征提取。我们的方法的独特之处在于整合了一个运动一致性任务,这使得运动感知变压器能够有效地捕捉人类动作的动力学特性。这种方法大大提高了在遮挡普遍存在的情况下识别特征的能力,从而推进了ReID过程的进步。我们在多个ReID基准测试上进行了全面的评估,包括整体、遮挡和基于视频的场景,结果显示我们的MOTAR-FUSE网络相比现有方法表现出色。
https://arxiv.org/abs/2502.00665
This paper proposes a new effective and efficient plug-and-play backbone for video-based person re-identification (ReID). Conventional video-based ReID methods typically use CNN or transformer backbones to extract deep features for every position in every sampled video frame. Here, we argue that this exhaustive feature extraction could be unnecessary, since we find that different frames in a ReID video often exhibit small differences and contain many similar regions due to the relatively slight movements of human beings. Inspired by this, a more selective, efficient paradigm is explored in this paper. Specifically, we introduce a patch selection mechanism to reduce computational cost by choosing only the crucial and non-repetitive patches for feature extraction. Additionally, we present a novel network structure that generates and utilizes pseudo frame global context to address the issue of incomplete views resulting from sparse inputs. By incorporating these new designs, our backbone can achieve both high performance and low computational cost. Extensive experiments on multiple datasets show that our approach reduces the computational cost by 74\% compared to ViT-B and 28\% compared to ResNet50, while the accuracy is on par with ViT-B and outperforms ResNet50 significantly.
本文提出了一种新的有效且高效的基于视频的人员重新识别(ReID)插件式骨干网络。传统的基于视频的ReID方法通常使用CNN或变压器骨干来提取每个采样视频帧中每个位置的深层特征。在这里,我们指出这种详尽无遗的特征提取可能是不必要的,因为我们发现,在一个ReID视频中的不同帧之间往往存在细微差异,并且由于人类相对微小的动作而包含许多相似区域。受到这一观察的启发,本文探讨了一种更为选择性和高效的范式。 具体来说,我们引入了一个补丁选择机制,通过只选择关键且非重复的补丁来进行特征提取以减少计算成本。此外,我们提出了一种新的网络结构,该结构生成并利用伪帧全局上下文来解决由于稀疏输入导致的视角不完整问题。 通过结合这些新设计,我们的骨干网能够实现高性能与低计算成本的良好平衡。在多个数据集上的广泛实验表明,相比于ViT-B和ResNet50,我们的方法可以将计算成本分别降低74% 和 28%,同时精度可媲美于ViT-B,并且显著优于ResNet50。
https://arxiv.org/abs/2501.16811
The Visual Language Model, known for its robust cross-modal capabilities, has been extensively applied in various computer vision tasks. In this paper, we explore the use of CLIP (Contrastive Language-Image Pretraining), a vision-language model pretrained on large-scale image-text pairs to align visual and textual features, for acquiring fine-grained and domain-invariant representations in generalizable person re-identification. The adaptation of CLIP to the task presents two primary challenges: learning more fine-grained features to enhance discriminative ability, and learning more domain-invariant features to improve the model's generalization capabilities. To mitigate the first challenge thereby enhance the ability to learn fine-grained features, a three-stage strategy is proposed to boost the accuracy of text descriptions. Initially, the image encoder is trained to effectively adapt to person re-identification tasks. In the second stage, the features extracted by the image encoder are used to generate textual descriptions (i.e., prompts) for each image. Finally, the text encoder with the learned prompts is employed to guide the training of the final image encoder. To enhance the model's generalization capabilities to unseen domains, a bidirectional guiding method is introduced to learn domain-invariant image features. Specifically, domain-invariant and domain-relevant prompts are generated, and both positive (pulling together image features and domain-invariant prompts) and negative (pushing apart image features and domain-relevant prompts) views are used to train the image encoder. Collectively, these strategies contribute to the development of an innovative CLIP-based framework for learning fine-grained generalized features in person re-identification.
视觉语言模型因其强大的跨模态能力而著称,在各种计算机视觉任务中得到了广泛应用。本文探讨了将CLIP(对比语言图像预训练)应用于一般化的人体再识别任务,以获取细粒度和领域不变的表示形式。CLIP是一个基于大规模图像-文本对进行预训练的视觉-语言模型,旨在对齐视觉和文本特征。 在该任务中应用CLIP面临两大主要挑战:一是学习更细粒度的特征以增强区分能力;二是学习更多领域不变的特征以提升模型的一般化能力。为缓解第一个挑战并提高学习细粒度特征的能力,本文提出了一种三阶段策略来提升文本描述的准确性。首先训练图像编码器有效适应人体再识别任务。在第二阶段中,通过图像编码器提取的特征生成每张图片的文字描述(即提示)。最后,使用带有已学得提示的文本编码器来指导最终图像编码器的训练。 为了增强模型对未见领域的泛化能力,引入了一种双向引导方法来学习领域不变的图像特征。具体而言,该方法通过正向视角(拉近图像特征和领域不变提示)以及反向视角(推远图像特征和领域相关的提示)生成领域不变与相关提示,并以此训练图像编码器。 综合上述策略,本文发展了一个基于CLIP的创新框架,在人体再识别任务中学习细粒度且具有泛化能力的特征。
https://arxiv.org/abs/2501.16065
Visible-infrared person re-identification (VI-ReID) aims to match individuals across different camera modalities, a critical task in modern surveillance systems. While current VI-ReID methods focus on cross-modality matching, real-world applications often involve mixed galleries containing both V and I images, where state-of-the-art methods show significant performance limitations due to large domain shifts and low discrimination across mixed modalities. This is because gallery images from the same modality may have lower domain gaps but correspond to different identities. This paper introduces a novel mixed-modal ReID setting, where galleries contain data from both modalities. To address the domain shift among inter-modal and low discrimination capacity in intra-modal matching, we propose the Mixed Modality-Erased and -Related (MixER) method. The MixER learning approach disentangles modality-specific and modality-shared identity information through orthogonal decomposition, modality-confusion, and ID-modality-related objectives. MixER enhances feature robustness across modalities, improving cross-modal and mixed-modal settings performance. Our extensive experiments on the SYSU-MM01, RegDB and LLMC datasets indicate that our approach can provide state-of-the-art results using a single backbone, and showcase the flexibility of our approach in mixed gallery applications.
可见光-红外人物再识别(VI-ReID)旨在跨不同相机模态匹配个体,这是现代监控系统中的一个关键任务。尽管当前的VI-ReID方法专注于跨模态匹配,但在实际应用中,混合图像库通常包含可见光(V)和红外(I)两种图像的情况很常见。对于这种场景,最先进的方法在处理大规模领域偏移和低跨模态识别能力时表现出明显的性能限制。这是因为来自同一模态的图库图像可能具有较小的领域差距但对应于不同的身份。 本文提出了一种新的混合模式下的ReID设置,在该设置中图库包含两种模态的数据。为了解决跨模态之间的领域偏移和低模内匹配识别能力的问题,我们提出了混合模式擦除与相关(MixER)方法。通过正交分解、模态混淆以及身份-模态相关的目标,MixER学习方法能够分离出特定于模态的和个人共享的身份信息。 MixER增强了跨模态特征的鲁棒性,从而提高了跨模和混合设置下的性能表现。我们在SYSU-MM01、RegDB及LLMC数据集上的广泛实验表明,我们的方法可以使用单一主干网络提供最先进的结果,并展示了在混合图库应用中的灵活性和适用性。
https://arxiv.org/abs/2501.13307
Deep learning based person re-identification (re-id) models have been widely employed in surveillance systems. Recent studies have demonstrated that black-box single-modality and cross-modality re-id models are vulnerable to adversarial examples (AEs), leaving the robustness of multi-modality re-id models unexplored. Due to the lack of knowledge about the specific type of model deployed in the target black-box surveillance system, we aim to generate modality unified AEs for omni-modality (single-, cross- and multi-modality) re-id models. Specifically, we propose a novel Modality Unified Attack method to train modality-specific adversarial generators to generate AEs that effectively attack different omni-modality models. A multi-modality model is adopted as the surrogate model, wherein the features of each modality are perturbed by metric disruption loss before fusion. To collapse the common features of omni-modality models, Cross Modality Simulated Disruption approach is introduced to mimic the cross-modality feature embeddings by intentionally feeding images to non-corresponding modality-specific subnetworks of the surrogate model. Moreover, Multi Modality Collaborative Disruption strategy is devised to facilitate the attacker to comprehensively corrupt the informative content of person images by leveraging a multi modality feature collaborative metric disruption loss. Extensive experiments show that our MUA method can effectively attack the omni-modality re-id models, achieving 55.9%, 24.4%, 49.0% and 62.7% mean mAP Drop Rate, respectively.
基于深度学习的人再识别(re-id)模型在监控系统中得到了广泛的应用。近期的研究表明,单模态和跨模态的黑盒人再识别模型容易受到对抗样本(AEs)的影响,而多模态人再识别模型的鲁棒性尚不清楚。由于目标黑盒监视系统中所部署的具体类型未知,我们的研究旨在生成适用于全模态(单模、跨模和多模)re-id模型的统一对抗样本。 为此,我们提出了一种名为“Modality Unified Attack”(MUA)的新方法,通过训练特定于每个模态的对抗生成器来生成能够有效攻击不同全模态模型的对抗性示例。在我们的方案中,采用了一个多模态模型作为替代模型,在该模型中,每一种模态的数据特征都经过了度量破坏损失函数(metric disruption loss)的扰动处理,并在融合之前进行。 为了使各种全模态模型共有的特征失效,“Cross Modality Simulated Disruption”方法被引入,通过故意将图像输入到替代模型中的非对应特定模态子网络中来模仿跨模态特征嵌入。此外,“Multi Modality Collaborative Disruption”策略旨在帮助攻击者全面破坏人像图片中有用的信息内容,利用多模态特征协作度量破坏损失函数。 广泛的实验表明,我们的MUA方法能够有效攻击全模态的人再识别模型,在单模、跨模和多模模型上分别实现了55.9%,24.4% 和 62.7%的平均mAP下降率。此外,在包含所有三种类型的综合测试集上的均值mAP下降率为49.0%。 综上所述,我们的研究为提升人再识别模型的安全性和鲁棒性提供了一种全新的视角和方法论。
https://arxiv.org/abs/2501.12761
This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{this https URL}{this https URL}.
本文提出了ViT Token Constraint和多尺度记忆库(TCMM)方法,旨在解决无监督行人重识别工作中出现的补丁噪声和特征不一致性问题。许多优秀的方法利用ViT特征来获取伪标签和聚类原型,然后通过对比学习训练模型。然而,由于ViT是通过对图像进行补丁嵌入处理来工作的,这不可避免地会在各个补丁中引入噪声,并可能影响重识别模型的性能。另一方面,基于以往记忆库的对比方法可能会因为批次大小的限制而导致数据不一致的问题。此外,现有的伪标签生成方法往往忽略那些难以聚类的异常样本,从而牺牲了这些异常样本的潜在价值,导致模型多样性和鲁棒性的局限性。 本文介绍了一种ViT Token Constraint技术来减轻补丁噪声对ViT架构造成的损害。提出的多尺度记忆库则增强了对异常样本的探索,并保持了特征的一致性。实验结果显示,我们的系统在常见基准测试中达到了最先进的性能水平。该项目可在[此处](this https URL)访问。
https://arxiv.org/abs/2501.09044
Video-based person re-identification (ReID) has become increasingly important due to its applications in video surveillance applications. By employing events in video-based person ReID, more motion information can be provided between continuous frames to improve recognition accuracy. Previous approaches have assisted by introducing event data into the video person ReID task, but they still cannot avoid the privacy leakage problem caused by RGB images. In order to avoid privacy attacks and to take advantage of the benefits of event data, we consider using only event data. To make full use of the information in the event stream, we propose a Cross-Modality and Temporal Collaboration (CMTC) network for event-based video person ReID. First, we design an event transform network to obtain corresponding auxiliary information from the input of raw events. Additionally, we propose a differential modality collaboration module to balance the roles of events and auxiliaries to achieve complementary effects. Furthermore, we introduce a temporal collaboration module to exploit motion information and appearance cues. Experimental results demonstrate that our method outperforms others in the task of event-based video person ReID.
基于视频的人重识别(ReID)由于在视频监控中的应用而变得越来越重要。通过采用事件数据,在连续帧之间可以提供更多的运动信息,从而提高识别的准确性。虽然之前的方法已经尝试将事件数据引入到基于视频的人体重识别任务中来辅助解决该问题,但它们仍然无法避免由RGB图像引起的隐私泄露问题。为了防止隐私攻击,并利用事件数据带来的优势,我们考虑仅使用事件数据进行人体重识别。 为充分利用事件流中的信息,我们提出了一种跨模态和时间协作(CMTC)网络用于基于事件的视频人体重识别任务。首先,我们设计了一个事件变换网络来从原始事件输入中获取相应的辅助信息。此外,我们提出了一个差分模态协作模块以平衡事件与辅助信息的作用,从而实现互补效果。同时,我们引入了时间协作模块来利用运动信息和外观线索。 实验结果表明,在基于事件的视频人体重识别任务上,我们的方法优于其他现有技术。
https://arxiv.org/abs/2501.07296
Clothing-change person re-identification (CC Re-ID) has attracted increasing attention in recent years due to its application prospect. Most existing works struggle to adequately extract the ID-related information from the original RGB images. In this paper, we propose an Identity-aware Feature Decoupling (IFD) learning framework to mine identity-related features. Particularly, IFD exploits a dual stream architecture that consists of a main stream and an attention stream. The attention stream takes the clothing-masked images as inputs and derives the identity attention weights for effectively transferring the spatial knowledge to the main stream and highlighting the regions with abundant identity-related information. To eliminate the semantic gap between the inputs of two streams, we propose a clothing bias diminishing module specific to the main stream to regularize the features of clothing-relevant regions. Extensive experimental results demonstrate that our framework outperforms other baseline models on several widely-used CC Re-ID datasets.
近年来,服装变化人员再识别(CC Re-ID)由于其应用前景而备受关注。现有大多数工作难以充分从原始RGB图像中提取出与身份相关的信息。在本文中,我们提出了一种基于身份感知特征解耦(IFD)的学习框架来挖掘身份相关的特征信息。特别地,IFD采用了双流架构,包括主流和注意力流。注意力流以服装遮罩后的图像为输入,并得出用于有效传输空间知识到主流以及突出富含身份相关信息区域的身份注意权重。 为了消除两个流的输入之间的语义差距,我们在主流中提出了一种专门针对服装偏差减小的模块来规范与服装相关的区域特征。大量的实验结果表明,我们的框架在几个广泛使用的CC Re-ID数据集上优于其他基线模型。
https://arxiv.org/abs/2501.05851
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.
无监督可见光-红外行人重识别(USL-VI-ReID)的目标是从未标注的跨模态数据集中学习模态不变特征,以减少跨模态差异。然而,现有的方法要么缺乏跨模态聚类能力,要么过度追求簇级别关联,这使得可靠地学习模态不变特征变得困难。为了解决这些问题,我们提出了一种扩展跨模态联合学习(ECUL)框架,该框架结合了扩展的模态-相机聚类(EMCC)和两步内存更新策略(TSMem)模块。 具体而言,我们设计的ECUL框架自然地整合了内模态聚类、跨模态聚类以及实例级别的跨模态选择机制,在减少引入噪音标签的同时建立了紧密且准确的跨模态关联。此外,EMCC通过扩展编码向量来捕捉和过滤邻域关系,进一步促进基于聚类算法的模态不变性和相机不变性知识的学习。最后,TSMem通过分阶段更新内存提供精确而通用的对比学习代理点。 在SYSU-MM01和RegDB数据集上进行的一系列实验结果表明,所提出的ECUL框架展示了令人鼓舞的表现,并且甚至超过了某些监督方法的效果。
https://arxiv.org/abs/2412.19134
The development of deep learning has facilitated the application of person re-identification (ReID) technology in intelligent security. Visible-infrared person re-identification (VI-ReID) aims to match pedestrians across infrared and visible modality images enabling 24-hour surveillance. Current studies relying on unsupervised modality transformations as well as inefficient embedding constraints to bridge the spectral differences between infrared and visible images, however, limit their potential performance. To tackle the limitations of the above approaches, this paper introduces a simple yet effective Spectral Enhancement and Pseudo-anchor Guidance Network, named SEPG-Net. Specifically, we propose a more homogeneous spectral enhancement scheme based on frequency domain information and greyscale space, which avoids the information loss typically caused by inefficient modality transformations. Further, a Pseudo Anchor-guided Bidirectional Aggregation (PABA) loss is introduced to bridge local modality discrepancies while better preserving discriminative identity embeddings. Experimental results on two public benchmark datasets demonstrate the superior performance of SEPG-Net against other state-of-the-art methods. The code is available at this https URL.
深度学习的发展促进了智能安全领域中人员再识别(ReID)技术的应用。可见光与红外线的人员再识别(VI-ReID)旨在跨红外和可见图像匹配行人,从而实现24小时监控。然而,当前依赖于无监督模态转换以及低效嵌入约束的方法在连接红外和可见图像之间的频谱差异时,其潜力受到了限制。为了克服上述方法的局限性,本文介绍了一种简单而有效的光谱增强与伪锚引导网络,命名为SEPG-Net。 具体来说,我们提出了一种基于频率域信息和灰度空间更均匀的光谱增强方案,该方案避免了由低效模态转换通常导致的信息损失。此外,引入了伪锚导向双向聚合(PABA)损失来弥合局部模式差异,并更好地保持具有判别性的身份嵌入。 在两个公开基准数据集上的实验结果表明,SEPG-Net优于其他最先进的方法。代码可在此URL获取:[此链接应为指向GitHub或类似平台的实际网址,在这里未提供实际链接]。
https://arxiv.org/abs/2412.19111
Person re-identification (ReID) models often struggle to generalize across diverse cultural contexts, particularly in Islamic regions like Iran, where modest clothing styles are prevalent. Existing datasets predominantly feature Western and East Asian fashion, limiting their applicability in these settings. To address this gap, we introduce IUST_PersonReId, a dataset designed to reflect the unique challenges of ReID in new cultural environments, emphasizing modest attire and diverse scenarios from Iran, including markets, campuses, and mosques. Experiments on IUST_PersonReId with state-of-the-art models, such as Solider and CLIP-ReID, reveal significant performance drops compared to benchmarks like Market1501 and MSMT17, highlighting the challenges posed by occlusion and limited distinctive features. Sequence-based evaluations show improvements by leveraging temporal context, emphasizing the dataset's potential for advancing culturally sensitive and robust ReID systems. IUST_PersonReId offers a critical resource for addressing fairness and bias in ReID research globally. The dataset is publicly available at this https URL.
人员重新识别(ReID)模型在面对不同的文化环境时经常难以泛化,尤其是在伊朗这样的伊斯兰地区,那里的服装风格通常遵循传统的保守着装规范。现有的数据集主要以西方和东亚的时尚为主,这限制了它们在这些特定场景下的适用性。为了解决这一问题,我们引入了IUST_PersonReId数据集,该数据集旨在反映新的文化环境中人员重新识别的独特挑战,特别是伊朗的各种场合,如市场、校园和清真寺中的保守着装情况。 使用最新的Solider和CLIP-ReID模型在IUST_PersonReId上的实验结果显示,与Market1501和MSMT17这样的基准数据集相比,性能有显著下降。这揭示了遮挡(occlusion)问题以及独特特征有限带来的挑战。基于序列的评估显示通过利用时间上下文可以改善识别效果,突显该数据集在推进文化敏感性和鲁棒性人员重新识别系统方面的作用。 IUST_PersonReId为解决全球范围内人员重新识别研究中的公平性和偏见提供了关键资源,并且该数据集已在[此处](https://example.com)公开发布。
https://arxiv.org/abs/2412.18874
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras, which greatly helps intelligent transportation systems. As we all know, Convolutional Neural Networks (CNNs) and Transformers have the unique strengths to extract local and global features, respectively. Considering this fact, we focus on the mutual fusion between them to learn more comprehensive representations for persons. In particular, we utilize the complementary integration of deep features from different model structures. We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID. More specifically, we first deploy a Dual-branch Feature Extraction (DFE) to extract features through CNNs and Transformers from a single image. Moreover, we design a novel Dual-attention Mutual Fusion (DMF) to achieve sufficient feature fusions. The DMF comprises Local Refinement Units (LRU) and Heterogenous Transmission Modules (HTM). LRU utilizes depth-separable convolutions to align deep features in channel dimensions and spatial sizes. HTM consists of a Shared Encoding Unit (SEU) and two Mutual Fusion Units (MFU). Through the continuous stacking of HTM, deep features after LRU are repeatedly utilized to generate more discriminative features. Extensive experiments on three public ReID benchmarks demonstrate that our method can attain superior performances than most state-of-the-arts. The source code is available at this https URL.
人员再识别(ReID)的目标是在非重叠摄像头之间检索特定的人,这极大地帮助了智能交通系统。众所周知,卷积神经网络(CNNs)和Transformer分别在提取局部特征和全局特征方面具有独特的优势。考虑到这一点,我们专注于它们之间的相互融合以学习更全面的人员表示。具体而言,我们利用不同模型结构中深层特征的互补整合。我们提出了一种名为FusionReID的新颖融合框架,旨在统一CNNs和Transformer的优势,用于基于图像的人员再识别。更具体地说,我们首先部署了双分支特征提取(DFE)通过CNNs和Transformers从单个图像中提取特征。此外,我们设计了一种新颖的双重注意力相互融合(DMF),以实现充分的特征融合。DMF包括局部精化单元(LRU)和异构传输模块(HTM)。LRU利用深度可分离卷积在通道维度和空间尺寸上对齐深层特征。HTM由共享编码单元(SEU)和两个相互融合单元(MFU)组成。通过持续堆叠HTM,经过LRU处理后的深层特征被反复用于生成更具判别力的特征。我们在三个公共ReID基准数据集上的广泛实验表明,我们的方法可以达到优于大多数最新方法的表现。源代码可在该网址获取:[此链接]。
https://arxiv.org/abs/2412.17239
Visual re-ranking using Nearest Neighbor graph~(NN graph) has been adapted to yield high retrieval accuracy, since it is beneficial to exploring an high-dimensional manifold and applicable without additional fine-tuning. The quality of visual re-ranking using NN graph, however, is limited to that of connectivity, i.e., edges of the NN graph. Some edges can be misconnected with negative images. This is known as a noisy edge problem, resulting in a degradation of the retrieval quality. To address this, we propose a complementary denoising method based on Continuous Conditional Random Field (C-CRF) that uses a statistical distance of our similarity-based distribution. This method employs the concept of cliques to make the process computationally feasible. We demonstrate the complementarity of our method through its application to three visual re-ranking methods, observing quality boosts in landmark retrieval and person re-identification (re-ID).
使用最近邻图(NN 图)进行视觉重新排序已被调整以获得较高的检索准确性,因为它有助于探索高维流形,并且无需额外的微调即可应用。然而,基于 NN 图的视觉重新排序的质量受限于其连通性,即 NN 图中的边。某些边可能会错误地连接到负面图像上,这被称为噪声边问题,导致检索质量下降。为了解决这一问题,我们提出了一种基于连续条件随机场(C-CRF)的互补去噪方法,该方法使用基于相似度分布的统计距离。此方法利用了团的概念以使过程在计算上可行。通过将我们的方法应用于三种视觉重新排序方法中,我们展示了其互补性,并观察到了地标检索和行人再识别(re-ID)质量上的提升。
https://arxiv.org/abs/2412.13875
Unsupervised visible-infrared person re-identification (USL-VI-ReID) is of great research and practical significance yet remains challenging due to the absence of annotations. Existing approaches aim to learn modality-invariant representations in an unsupervised setting. However, these methods often encounter label noise within and across modalities due to suboptimal clustering results and considerable modality discrepancies, which impedes effective training. To address these challenges, we propose a straightforward yet effective solution for USL-VI-ReID by mitigating universal label noise using neighbor information. Specifically, we introduce the Neighbor-guided Universal Label Calibration (N-ULC) module, which replaces explicit hard pseudo labels in both homogeneous and heterogeneous spaces with soft labels derived from neighboring samples to reduce label noise. Additionally, we present the Neighbor-guided Dynamic Weighting (N-DW) module to enhance training stability by minimizing the influence of unreliable samples. Extensive experiments on the RegDB and SYSU-MM01 datasets demonstrate that our method outperforms existing USL-VI-ReID approaches, despite its simplicity. The source code is available at: this https URL.
无监督的可见光-红外行人重识别(USL-VI-ReID)具有重要的研究和实用价值,但由于缺少标注信息而颇具挑战性。现有的方法旨在在无监督环境下学习模态不变表征。然而,这些方法常因次优的聚类结果及显著的模态差异,在同一或跨模态中遭遇标签噪声,阻碍了有效的训练过程。为解决这些问题,我们提出了一种通过利用邻域信息来缓解普遍存在的标签噪声的简单而有效的方法,以应对USL-VI-ReID中的挑战。具体来说,我们引入了基于邻居指导的通用标签校准(N-ULC)模块,在同质和异质空间中用从邻近样本衍生出的软标签替代显式的硬伪标签,以此来减少标签噪声。此外,我们还提出了基于邻居指导的动力权重调整(N-DW)模块,通过最小化不可靠样本的影响以增强训练稳定性。在RegDB和SYSU-MM01数据集上的广泛实验表明,尽管方法简单,我们的方法仍优于现有的USL-VI-ReID方案。源代码可在以下链接获取:此 https URL。
https://arxiv.org/abs/2412.12220
Lifelong person re-identification (LReID) is an important but challenging task that suffers from catastrophic forgetting due to significant domain gaps between training steps. Existing LReID approaches typically rely on data replay and knowledge distillation to mitigate this issue. However, data replay methods compromise data privacy by storing historical exemplars, while knowledge distillation methods suffer from limited performance due to the cumulative forgetting of undistilled knowledge. To overcome these challenges, we propose a novel paradigm that models and rehearses the distribution of the old domains to enhance knowledge consolidation during the new data learning, possessing a strong anti-forgetting capacity without storing any exemplars. Specifically, we introduce an exemplar-free LReID method called Distribution Rehearsing via Adaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser Learning mechanism that learns to transform arbitrary distribution data into the current data style at each learning step. To enhance the style transfer capacity of DRL, an Adaptive Kernel Prediction network is explored to achieve an instance-specific distribution adjustment. Additionally, we design a Distribution Rehearsing-driven LReID Training module, which rehearses old distribution based on the new data via the old AKPNet model, achieving effective new-old knowledge accumulation under a joint knowledge consolidation scheme. Experimental results show our DASK outperforms the existing methods by 3.6%-6.8% and 4.5%-6.5% on anti-forgetting and generalization capacity, respectively. Our code is available at this https URL
终身人物再识别(LReID)是一项重要但极具挑战性的任务,由于训练步骤之间的显著领域差异,该任务会遭受灾难性遗忘的影响。现有的LReID方法通常依赖于数据重放和知识蒸馏来缓解这一问题。然而,数据重放方法通过存储历史样本牺牲了数据隐私,而知识蒸馏方法因未被提炼的知识累积遗忘而导致性能受限。为了克服这些挑战,我们提出了一种新的范式,该范式通过建模并复习旧领域的分布,在新数据学习过程中增强知识巩固,无需存储任何样例就具有强大的抗遗忘能力。具体来说,我们引入了一种无样本的LReID方法,称为自适应风格核学习(DASK)下的分布复习。DASK包括一个分布复习器学习机制,该机制在每次学习步骤中学会将任意分布的数据转换为当前数据样式。为了增强DRL的风格迁移能力,探索了一个自适应核预测网络以实现特定实例的分布调整。此外,我们设计了一个由分布复习驱动的LReID训练模块,通过旧AKPNet模型基于新数据复习旧分布,从而在联合知识巩固方案下实现有效的新型和旧型知识积累。实验结果表明,我们的DASK方法在抗遗忘能力和泛化能力上分别比现有方法高出3.6%-6.8%和4.5%-6.5%。我们的代码可在该链接获取:[此https URL]
https://arxiv.org/abs/2412.09224
Person re-identification (re-ID) via 3D skeleton data is a challenging task with significant value in many scenarios. Existing skeleton-based methods typically assume virtual motion relations between all joints, and adopt average joint or sequence representations for learning. However, they rarely explore key body structure and motion such as gait to focus on more important body joints or limbs, while lacking the ability to fully mine valuable spatial-temporal sub-patterns of skeletons to enhance model learning. This paper presents a generic Motif guided graph transformer with Combinatorial skeleton prototype learning (MoCos) that exploits structure-specific and gait-related body relations as well as combinatorial features of skeleton graphs to learn effective skeleton representations for person re-ID. In particular, motivated by the locality within joints' structure and the body-component collaboration in gait, we first propose the motif guided graph transformer (MGT) that incorporates hierarchical structural motifs and gait collaborative motifs, which simultaneously focuses on multi-order local joint correlations and key cooperative body parts to enhance skeleton relation learning. Then, we devise the combinatorial skeleton prototype learning (CSP) that leverages random spatial-temporal combinations of joint nodes and skeleton graphs to generate diverse sub-skeleton and sub-tracklet representations, which are contrasted with the most representative features (prototypes) of each identity to learn class-related semantics and discriminative skeleton representations. Extensive experiments validate the superior performance of MoCos over existing state-of-the-art models. We further show its generality under RGB-estimated skeletons, different graph modeling, and unsupervised scenarios.
基于3D骨骼数据的人体再识别(re-ID)是一项在许多场景中具有重要价值的挑战性任务。现有的基于骨架的方法通常假设所有关节之间的虚拟运动关系,并采用平均关节或序列表示进行学习。然而,它们很少探索关键的身体结构和运动,例如步态,以更专注于重要的身体关节或四肢,同时缺乏充分挖掘骨骼有价值的空间-时间子模式的能力来增强模型的学习效果。本文提出了一种通用的动机引导图变换器与组合骨架原型学习(MoCos)方法,该方法利用特定于结构和与步态相关的身体关系以及骨架图的组合特征,以学习有效的骨架表示用于人体再识别。特别地,在关节结构内的局部性和步态中的身体部件协作的启发下,我们首先提出了动机引导图变换器(MGT),它结合了分层结构模式和步态协作模式,同时关注多阶局部关节关联和关键合作的身体部位以增强骨骼关系学习。然后,我们设计了一种组合骨架原型学习(CSP)方法,利用关节节点和骨架图的随机空间-时间组合来生成多样化的子骨架和子轨迹表示,并将这些表示与每个身份最具代表性的特征(原型)进行对比,从而学习类别相关的语义和辨别性骨骼表示。广泛的实验验证了MoCos在现有最先进的模型中的优越性能。我们进一步展示了它在基于RGB估算的骨架、不同的图建模以及无监督场景下的通用性。
https://arxiv.org/abs/2412.09044
Image retrieval methods rely on metric learning to train backbone feature extraction models that can extract discriminant queries and reference (gallery) feature representations for similarity matching. Although state-of-the-art accuracy has improved considerably with the advent of deep learning (DL) models trained on large datasets, image retrieval remains challenging in many real-world video analytics and surveillance applications, e.g., person re-identification. Using the Euclidean space for matching limits the performance in real-world applications due to the curse of dimensionality, overfitting, and sensitivity to noisy data. We argue that the feature dissimilarity space is more suitable for similarity matching, and propose a dichotomy transformation to project query and reference embeddings into a single embedding in the dissimilarity space. We also advocate for end-to-end training of a backbone and binary classification models for pair-wise matching. As opposed to comparing the distance between queries and reference embeddings, we show the benefits of classifying the single dissimilarity space embedding (as similar or dissimilar), especially when trained end-to-end. We propose a method to train the max-margin classifier together with the backbone feature extractor by applying constraints to the L2 norm of the classifier weights along with the hinge loss. Our extensive experiments on challenging image retrieval datasets and using diverse feature extraction backbones highlight the benefits of similarity matching in the dissimilarity space. In particular, when jointly training the feature extraction backbone and regularised classifier for matching, the dissimilarity space provides a higher level of accuracy.
图像检索方法依赖度量学习来训练骨干特征提取模型,该模型能够抽取具有判别力的查询和参考(画廊)特征表示以进行相似性匹配。尽管随着深度学习(DL)模型在大规模数据集上的应用,状态-of-the-art准确率有了显著提高,但在许多实际视频分析和监控应用场景中,如行人重识别,图像检索仍然是一项挑战。使用欧几里得空间进行匹配由于维度灾难、过拟合以及对噪声数据的敏感性限制了其在实际应用中的性能表现。我们认为特征不相似空间更适合于相似性匹配,并提出了一种二分转换方法,将查询和参考嵌入投影到单个不相似度空间中的单一嵌入中。我们还提倡端到端训练骨干模型与用于配对匹配的二元分类模型。通过分类单个不相似度空间嵌入(将其归类为相似或不相似),而不是比较查询和参考嵌入之间的距离,尤其是在进行端到端训练时,能够展示出显著的好处。我们提出了一种方法,通过对分类器权重施加L2范数约束以及结合铰链损失来同时训练最大间隔分类器与骨干特征提取器。我们在具有挑战性的图像检索数据集上的广泛实验及使用各种特征提取骨干模型均表明,在不相似度空间中进行相似性匹配的优势。特别是,当联合训练特征提取骨干和正则化分类器时,不相似度空间提供了更高的准确率水平。
https://arxiv.org/abs/2412.08618
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. In this paper, we propose an Embedding and Enriching Explicit Semantics (EEES) framework to learn semantically rich cross-modality pedestrian representations. Our method offers several contributions. First, with the collaboration of multiple large language-vision models, we develop Explicit Semantics Embedding (ESE), which automatically supplements language descriptions for pedestrians and aligns image-text pairs into a common space, thereby learning visual content associated with explicit semantics. Second, recognizing the complementarity of multi-view information, we present Cross-View Semantics Compensation (CVSC), which constructs multi-view image-text pair representations, establishes their many-to-many matching, and propagates knowledge to single-view representations, thus compensating visual content with its missing cross-view semantics. Third, to eliminate noisy semantics such as conflicting color attributes in different modalities, we design Cross-Modality Semantics Purification (CMSP), which constrains the distance between inter-modality image-text pair representations to be close to that between intra-modality image-text pair representations, further enhancing the modality-invariance of visual content. Finally, experimental results demonstrate the effectiveness and superiority of the proposed EEES.
可见光-红外行人再识别(VIReID)在不同的模态间检索同一身份的行人群图像。现有的方法仅从图像中学习视觉内容,缺乏感知高级语义的能力。本文提出了一种嵌入和丰富显式语义(EEES)框架来学习语义丰富的跨模态行人表示。我们的方法提供了几方面的贡献。首先,在多个大型语言-视觉模型的协作下,我们开发了显式语义嵌入(ESE),该技术自动为行人群补充语言描述,并将图像-文本对对齐到一个共同的空间中,从而学习与显式语义相关的视觉内容。其次,认识到多视角信息的互补性,我们提出了跨视图语义补偿(CVSC),它构建了多视图图像-文本对表示形式,建立了它们之间的多对多匹配,并将知识传播给单视图表示,从而用缺失的跨视图语义来补充视觉内容。第三,为了消除不同模态中的冲突颜色属性等噪声语义,我们设计了跨模态语义净化(CMSP),该技术限制不同模态间图像-文本对表示之间的距离接近同一模态内图像-文本对表示间的距离,进一步增强了视觉内容的模态不变性。最后,实验结果证明了所提出的EEES的有效性和优越性。
https://arxiv.org/abs/2412.08406