Visible-Infrared Person Re-identification (VI-ReID) is a challenging cross-modal pedestrian retrieval task, due to significant intra-class variations and cross-modal discrepancies among different cameras. Existing works mainly focus on embedding images of different modalities into a unified space to mine modality-shared features. They only seek distinctive information within these shared features, while ignoring the identity-aware useful information that is implicit in the modality-specific features. To address this issue, we propose a novel Implicit Discriminative Knowledge Learning (IDKL) network to uncover and leverage the implicit discriminative information contained within the modality-specific. First, we extract modality-specific and modality-shared features using a novel dual-stream network. Then, the modality-specific features undergo purification to reduce their modality style discrepancies while preserving identity-aware discriminative knowledge. Subsequently, this kind of implicit knowledge is distilled into the modality-shared feature to enhance its distinctiveness. Finally, an alignment loss is proposed to minimize modality discrepancy on enhanced modality-shared features. Extensive experiments on multiple public datasets demonstrate the superiority of IDKL network over the state-of-the-art methods. Code is available at this https URL.
可见-红外人物识别(VI-ReID)是一个具有挑战性的跨模态行人检索任务,因为不同相机之间存在显著的类内差异和跨模态差异。现有工作主要集中在将不同模态的图像嵌入到一个统一的 space 中,以挖掘模态共性特征。他们仅关注这些共享特征中的显着信息,而忽略了隐含在模态特定特征中的身份意识有用信息。为了解决这个问题,我们提出了一个新颖的隐式区分性知识学习(IDKL)网络来揭示和利用模态特定特征中隐含的区分性信息。首先,我们使用一种新颖的双流网络提取模态特定和模态共性特征。然后,模态特定特征经过净化,以减少其模态风格差异,同时保留身份意识区分性知识。接下来,这种隐含知识被蒸馏到模态共性特征中,以增强其独特性。最后,提出了一种对增强模态共性特征的同步损失,以最小化模态差异。在多个公开数据集上进行的大量实验证明,IDKL网络相对于最先进的方法具有优越性。代码可在此链接处获取。
https://arxiv.org/abs/2403.11708
Person Re-identification (ReID) has been extensively developed for a decade in order to learn the association of images of the same person across non-overlapping camera views. To overcome significant variations between images across camera views, mountains of variants of ReID models were developed for solving a number of challenges, such as resolution change, clothing change, occlusion, modality change, and so on. Despite the impressive performance of many ReID variants, these variants typically function distinctly and cannot be applied to other challenges. To our best knowledge, there is no versatile ReID model that can handle various ReID challenges at the same time. This work contributes to the first attempt at learning a versatile ReID model to solve such a problem. Our main idea is to form a two-stage prompt-based twin modeling framework called VersReID. Our VersReID firstly leverages the scene label to train a ReID Bank that contains abundant knowledge for handling various scenes, where several groups of scene-specific prompts are used to encode different scene-specific knowledge. In the second stage, we distill a V-Branch model with versatile prompts from the ReID Bank for adaptively solving the ReID of different scenes, eliminating the demand for scene labels during the inference stage. To facilitate training VersReID, we further introduce the multi-scene properties into self-supervised learning of ReID via a multi-scene prioris data augmentation (MPDA) strategy. Through extensive experiments, we demonstrate the success of learning an effective and versatile ReID model for handling ReID tasks under multi-scene conditions without manual assignment of scene labels in the inference stage, including general, low-resolution, clothing change, occlusion, and cross-modality scenes. Codes and models are available at this https URL.
为了学习同一人在不同视角下的图像之间的关联,在过去的十年里,对Person Re-identification(ReID)的研究已经得到了广泛的发展。为了克服在不同视角之间图像之间的图像差异,为了解决诸如分辨率变化、着装变化、遮挡和模态变化等问题,已经开发了大量的ReID模型的变体。尽管许多ReID变体在性能上表现出色,但这些变体通常会以独特的方式运行,并且不能应用于其他问题。据我们所知,没有一种通用的ReID模型可以同时处理各种ReID挑战。 我们的主要想法是建立一个两阶段提示为基础的双胞胎建模框架,称为VersReID。VersReID首先利用场景标签来训练一个包含丰富知识以处理各种场景的ReID银行,其中几组场景特定的提示被用于编码不同的场景特定知识。在第二阶段,我们从ReID银行中提取具有多样提示的V-支模态,用于自适应地解决不同场景的ReID,消除在推理阶段需要场景标签的需求。为了方便训练VersReID,我们还通过多场景 priori数据增强(MPDA)策略引入了多场景属性。 通过大量实验,我们证明了在不需要在推理阶段手动分配场景标签的情况下,学习一个有效且多场景的ReID模型可以成功地解决ReID任务,包括一般、低分辨率、着装变化、遮挡和跨模态场景。代码和模型可以从该链接下载。
https://arxiv.org/abs/2403.11121
A key challenge in visible-infrared person re-identification (V-I ReID) is training a backbone model capable of effectively addressing the significant discrepancies across modalities. State-of-the-art methods that generate a single intermediate bridging domain are often less effective, as this generated domain may not adequately capture sufficient common discriminant information. This paper introduces the Bidirectional Multi-step Domain Generalization (BMDG), a novel approach for unifying feature representations across diverse modalities. BMDG creates multiple virtual intermediate domains by finding and aligning body part features extracted from both I and V modalities. Indeed, BMDG aims to reduce the modality gaps in two steps. First, it aligns modalities in feature space by learning shared and modality-invariant body part prototypes from V and I images. Then, it generalizes the feature representation by applying bidirectional multi-step learning, which progressively refines feature representations in each step and incorporates more prototypes from both modalities. In particular, our method minimizes the cross-modal gap by identifying and aligning shared prototypes that capture key discriminative features across modalities, then uses multiple bridging steps based on this information to enhance the feature representation. Experiments conducted on challenging V-I ReID datasets indicate that our BMDG approach outperforms state-of-the-art part-based models or methods that generate an intermediate domain from V-I person ReID.
在可见-红外人员识别(V-I ReID)中的一个关键挑战是训练一个能够有效解决不同模态之间显著差异的主干模型。最先进的生成单个中间域的方法通常效果较差,因为生成的中间域可能不足以捕捉足够的共同区分信息。本文介绍了一种名为双向多级域泛化(BMDG)的新方法,用于统一不同模态的特征表示。BMDG通过找到并平滑从I和V模态中提取的身体部位特征,创建多个虚拟的中间域。实际上,BMDG旨在通过两个步骤减少模态差距。首先,它通过在特征空间中学习共享的和与模态无关的身体部位原型来对模态进行对齐。然后,它通过双向多级学习逐步优化每个步骤的特征表示,并从两个模态中包括更多的原型。特别地,我们的方法通过识别和归一化捕捉关键区分特征的共享原型,然后根据这些信息使用多个桥接步骤来增强特征表示。在具有挑战性的V-I ReID数据集上进行的实验表明,我们的BMDG方法优于最先进的部分基于模型或从V-I人员ReID中生成中间域的方法。
https://arxiv.org/abs/2403.10782
Lifelong person re-identification (LReID) assumes a practical scenario where the model is sequentially trained on continuously incoming datasets while alleviating the catastrophic forgetting in the old datasets. However, not only the training datasets but also the gallery images are incrementally accumulated, that requires a huge amount of computational complexity and storage space to extract the features at the inference phase. In this paper, we address the above mentioned problem by incorporating the backward-compatibility to LReID for the first time. We train the model using the continuously incoming datasets while maintaining the model's compatibility toward the previously trained old models without re-computing the features of the old gallery images. To this end, we devise the cross-model compatibility loss based on the contrastive learning with respect to the replay features across all the old datasets. Moreover, we also develop the knowledge consolidation method based on the part classification to learn the shared representation across different datasets for the backward-compatibility. We suggest a more practical methodology for performance evaluation as well where all the gallery and query images are considered together. Experimental results demonstrate that the proposed method achieves a significantly higher performance of the backward-compatibility compared with the existing methods. It is a promising tool for more practical scenarios of LReID.
终身人物识别(LReID)假定一个实际场景,即在连续的 incoming 数据集中对模型进行序列训练,同时减轻 old 数据集中的灾难性遗忘。然而,不仅训练数据集,还包括画廊图像,都需要积累大量的计算复杂度和存储空间,在推理阶段提取特征。在本文中,我们通过首次将反向兼容性引入 LReID,解决了上述提到的这个问题。我们在连续的 incoming 数据集上训练模型,同时保持模型对之前训练的旧模型的兼容性,而不重新计算旧画廊图像的特征。为此,我们根据所有 old 数据集的对比学习,设计了一种跨模态兼容性损失。此外,我们还基于部分分类开发了知识整合方法,以学习不同数据集之间的共享表示。我们建议一种更实际的性能评估方法,其中所有画廊和查询图像都被考虑在内。实验结果表明,与现有方法相比,所提出的方法在反向兼容性方面取得了显著的提高。这是一个有前景的工具,适用于更实际的 LReID 场景。
https://arxiv.org/abs/2403.10022
Cloth-changing person re-identification aims to retrieve and identify spe-cific pedestrians by using cloth-irrelevant features in person cloth-changing scenarios. However, pedestrian images captured by surveillance probes usually contain occlusions in real-world scenarios. The perfor-mance of existing cloth-changing re-identification methods is significantly degraded due to the reduction of discriminative cloth-irrelevant features caused by occlusion. We define cloth-changing person re-identification in occlusion scenarios as occluded cloth-changing person re-identification (Occ-CC-ReID), and to the best of our knowledge, we are the first to pro-pose occluded cloth-changing person re-identification as a new task. We constructed two occluded cloth-changing person re-identification datasets for different occlusion scenarios: Occluded-PRCC and Occluded-LTCC. The datasets can be obtained from the following link: this https URL Re-Identification.
衣物换人识别旨在通过在人员更衣场景中使用与衣物无关的特征来检索和识别特定的行人。然而,由遮挡导致的现实场景中行人图像通常包含遮挡。由于遮挡导致歧视性衣物无关特征的减少,现有衣物换人识别方法的性能 significantly degraded。我们将遮挡场景中的衣物换人识别定义为遮挡衣物换人识别(Occ-CC-ReID),据我们所知,我们第一个提出了遮挡衣物换人识别作为新任务。我们为不同遮挡场景构建了两个遮挡衣物换人识别数据集:Occluded-PRCC和Occluded-LTCC。数据集可以通过以下链接获取:https://this.url。
https://arxiv.org/abs/2403.08557
Cloth-Changing Person Re-Identification (CC-ReID) aims to accurately identify the target person in more realistic surveillance scenarios, where pedestrians usually change their clothing. Despite great progress, limited cloth-changing training samples in existing CC-ReID datasets still prevent the model from adequately learning cloth-irrelevant features. In addition, due to the absence of explicit supervision to keep the model constantly focused on cloth-irrelevant areas, existing methods are still hampered by the disruption of clothing variations. To solve the above issues, we propose an Identity-aware Dual-constraint Network (IDNet) for the CC-ReID task. Specifically, to help the model extract cloth-irrelevant clues, we propose a Clothes Diversity Augmentation (CDA), which generates more realistic cloth-changing samples by enriching the clothing color while preserving the texture. In addition, a Multi-scale Constraint Block (MCB) is designed, which extracts fine-grained identity-related features and effectively transfers cloth-irrelevant knowledge. Moreover, a Counterfactual-guided Attention Module (CAM) is presented, which learns cloth-irrelevant features from channel and space dimensions and utilizes the counterfactual intervention for supervising the attention map to highlight identity-related regions. Finally, a Semantic Alignment Constraint (SAC) is designed to facilitate high-level semantic feature interaction. Comprehensive experiments on four CC-ReID datasets indicate that our method outperforms prior state-of-the-art approaches.
衣物变化人员重新识别(CC-ReID)的目的是在更真实的监视场景中准确地识别目标人物,其中行人通常会改变衣服。尽管取得了很大进展,但现有的CC-ReID数据集中有限的换衣训练样本还是无法使模型充分学习与衣物无关的特征。此外,由于缺乏明确监督来保持模型持续关注衣物无关区域,现有方法仍然受到服装变化的影响。为解决上述问题,我们提出了一个具有身份感的学习双约束网络(IDNet)用于CC-ReID任务。具体来说,为了帮助模型提取与衣物无关的线索,我们提出了一个衣物多样性增强(CDA),它通过增加衣服颜色的同时保留纹理来生成更逼真的换衣样本。此外,还设计了一个多尺度约束块(MCB),它提取细粒度的身份相关特征,并有效地将衣物无关知识转移。此外,我们还提出了一个反事实引导的注意力模块(CAM),它从通道和空间维度学习衣物无关特征,并利用反事实干预来监督注意力图以突出身份相关区域。最后,我们还设计了一个语义对齐约束(SAC)来促进高级语义特征交互。对四个CC-ReID数据集的全面实验表明,我们的方法超越了先前最先进的解决方案。
https://arxiv.org/abs/2403.08270
Visible-infrared person re-identification (VI-ReID) is challenging due to considerable cross-modality discrepancies. Existing works mainly focus on learning modality-invariant features while suppressing modality-specific ones. However, retrieving visible images only depends on infrared samples is an extreme problem because of the absence of color information. To this end, we present the Refer-VI-ReID settings, which aims to match target visible images from both infrared images and coarse language descriptions (e.g., "a man with red top and black pants") to complement the missing color information. To address this task, we design a Y-Y-shape decomposition structure, dubbed YYDS, to decompose and aggregate texture and color features of targets. Specifically, the text-IoU regularization strategy is firstly presented to facilitate the decomposition training, and a joint relation module is then proposed to infer the aggregation. Furthermore, the cross-modal version of k-reciprocal re-ranking algorithm is investigated, named CMKR, in which three neighbor search strategies and one local query expansion method are explored to alleviate the modality bias problem of the near neighbors. We conduct experiments on SYSU-MM01, RegDB and LLCM datasets with our manually annotated descriptions. Both YYDS and CMKR achieve remarkable improvements over SOTA methods on all three datasets. Codes are available at this https URL.
可见-红外人员识别(VI-ReID)挑战较大,因为存在显著的跨模态差异。现有工作主要集中在通过抑制模态特定特征来学习模态无关特征,然而仅从红外样本中检索可见图像是一个极端问题,因为缺少颜色信息。为此,我们提出了Refer-VI-ReID设置,旨在将来自红外图像的目标可见图像和粗语言描述(例如"一个穿着红色上衣和黑色裤子的男人")进行匹配,以补充缺失的颜色信息。为解决此任务,我们设计了一个Y-Y-形状的分解结构,称之为YYDS,以分解和聚合目标的纹理和颜色特征。具体来说,我们首先提出了文本IoU正则化策略来促进分解训练,然后提出了联合关系模块来推断聚合。此外,我们还研究了k-互推重排算法的跨模态版本,名为CMKR,其中采用了三种邻居搜索策略和一种局部查询扩展方法来减轻近邻模态偏差问题。我们在SYSU-MM01、RegDB和LLVM数据集上进行手动注释的实验。所有设置都取得了显著的提高,超过了当前最先进的方法。代码可在此链接处获取:https://www.xxx
https://arxiv.org/abs/2403.04183
This work addresses the task of long-term person re-identification. Typically, person re-identification assumes that people do not change their clothes, which limits its applications to short-term scenarios. To overcome this limitation, we investigate long-term person re-identification, which considers both clothes-changing and clothes-consistent scenarios. In this paper, we propose a novel framework that effectively learns and utilizes both global and local information. The proposed framework consists of three streams: global, local body part, and head streams. The global and head streams encode identity-relevant information from an entire image and a cropped image of the head region, respectively. Both streams encode the most distinct, less distinct, and average features using the combinations of adversarial erasing, max pooling, and average pooling. The local body part stream extracts identity-related information for each body part, allowing it to be compared with the same body part from another image. Since body part annotations are not available in re-identification datasets, pseudo-labels are generated using clustering. These labels are then utilized to train a body part segmentation head in the local body part stream. The proposed framework is trained by backpropagating the weighted summation of the identity classification loss, the pair-based loss, and the pseudo body part segmentation loss. To demonstrate the effectiveness of the proposed method, we conducted experiments on three publicly available datasets (Celeb-reID, PRCC, and VC-Clothes). The experimental results demonstrate that the proposed method outperforms the previous state-of-the-art method.
本文解决了长期人物识别(long-term person re-identification)的任务。通常,人物识别假设人们不改变衣服,这限制了其应用于短期场景。为了克服这一限制,我们研究了长期人物识别,考虑了换衣服和换衣服一致的情况。在本文中,我们提出了一个新颖的框架,有效地学习和利用了全局和局部信息。该框架包括三个流:全局流、局部身体部分流和头流。全局和头流分别从整个图像和头部裁剪图像中编码身份相关信息。这两条流使用组合的对抗性消除、最大池化和平均池化来编码最明显的、不太明显的和平均的特征。局部身体部分流提取与每个身体部分相关的身份信息,使得它可以与另一个图像中的相同身体部分进行比较。由于身份标注在识别数据集中不存在,因此通过聚类生成伪标签。这些伪标签随后用于在局部身体部分流中训练身体部分分割头。所提出的框架通过反向传播全局身份分类损失、基于一对的损失和伪身体部分分割损失的加权求和进行训练。为了证明所提出方法的有效性,我们在三个公开可用数据集(Celeb-reID、PRCC和VC-Clothes)上进行了实验。实验结果表明,与以前的最先进方法相比,所提出的方法具有优越性能。
https://arxiv.org/abs/2403.02892
Recent unsupervised person re-identification (re-ID) methods achieve high performance by leveraging fine-grained local context. These methods are referred to as part-based methods. However, most part-based methods obtain local contexts through horizontal division, which suffer from misalignment due to various human poses. Additionally, the misalignment of semantic information in part features restricts the use of metric learning, thus affecting the effectiveness of part-based methods. The two issues mentioned above result in the under-utilization of part features in part-based methods. We introduce the Spatial Cascaded Clustering and Weighted Memory (SCWM) method to address these challenges. SCWM aims to parse and align more accurate local contexts for different human body parts while allowing the memory module to balance hard example mining and noise suppression. Specifically, we first analyze the foreground omissions and spatial confusions issues in the previous method. Then, we propose foreground and space corrections to enhance the completeness and reasonableness of the human parsing results. Next, we introduce a weighted memory and utilize two weighting strategies. These strategies address hard sample mining for global features and enhance noise resistance for part features, which enables better utilization of both global and part features. Extensive experiments on Market-1501 and MSMT17 validate the proposed method's effectiveness over many state-of-the-art methods.
最近,无监督的人重新识别(Re-ID)方法通过利用细粒度局部上下文取得了高性能。这些方法被称为基于部分的(part-based)方法。然而,大多数基于部分的方法通过水平分割获得局部上下文,这会导致因为各种人体姿势而产生的错位。此外,部分特征中的语义信息错位限制了使用指标学习,从而影响了基于部分的方法的有效性。上述两个问题导致基于部分的方法中部分特征的利用率较低。我们引入了空间级联聚类和加权记忆(SCWM)方法来解决这些问题。SCWM旨在解析和校准不同人体部位更准确的局部上下文,同时允许记忆模块平衡难样本挖掘和噪声抑制。具体来说,我们首先分析了前方法中的前景缺失和空间混淆问题。然后,我们提出了前景和空间修正来提高人类解析结果的完整性和合理性。接下来,我们引入了加权记忆,并利用了两种加权策略。这些策略解决了全局特征的难样本挖掘问题,并提高了部分特征的噪声抵抗能力,从而更好地利用全局和部分特征。在Market-1501和MSMT17等大量实验中,我们验证了所提出方法的有效性超过了许多最先进的method。
https://arxiv.org/abs/2403.00261
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match specified people in infrared images to visible images without annotation, and vice versa. USVI-ReID is a challenging yet under-explored task. Most existing methods address the USVI-ReID problem using cluster-based contrastive learning, which simply employs the cluster center as a representation of a person. However, the cluster center primarily focuses on shared information, overlooking disparity. To address the problem, we propose a Progressive Contrastive Learning with Multi-Prototype (PCLMP) method for USVI-ReID. In brief, we first generate the hard prototype by selecting the sample with the maximum distance from the cluster center. This hard prototype is used in the contrastive loss to emphasize disparity. Additionally, instead of rigidly aligning query images to a specific prototype, we generate the dynamic prototype by randomly picking samples within a cluster. This dynamic prototype is used to retain the natural variety of features while reducing instability in the simultaneous learning of both common and disparate information. Finally, we introduce a progressive learning strategy to gradually shift the model's attention towards hard samples, avoiding cluster deterioration. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets validate the effectiveness of the proposed method. PCLMP outperforms the existing state-of-the-art method with an average mAP improvement of 3.9%. The source codes will be released.
无监督可见-红外人员识别(USVI-ReID)旨在将红外图像中指定的个人与可见图像中的个人进行匹配,反之亦然。USVI-ReID是一个具有挑战性但尚未被充分探索的任务。现有的方法主要通过基于聚类的对比学习来解决USVI-ReID问题,这简单地使用聚类中心作为一个人的表示。然而,聚类中心主要关注共享信息,忽视了差异。为了解决这个问题,我们提出了一个渐进式对比学习多原型(PCLMP)方法来解决USVI-ReID问题。简而言之,我们首先通过选择距离聚类中心最远的样本生成硬原型。这个硬原型用于对比损失,强调差异。此外,我们通过随机选择聚类内的样本生成动态原型。这个动态原型用于保留特征的自然多样性的同时,减少同时学习共同信息和差异信息的不稳定性。最后,我们引入了渐进学习策略,逐渐将模型的注意力从普通样本转移到困难样本,避免聚类恶化。在公开可用的大规模数据集SYSU-MM01和RegDB上进行的大量实验证实了所提出方法的有效性。PCLMP平均mAP提高了3.9%。源代码将发布。
https://arxiv.org/abs/2402.19026
Online Unsupervised Domain Adaptation (OUDA) for person Re-Identification (Re-ID) is the task of continuously adapting a model trained on a well-annotated source domain dataset to a target domain observed as a data stream. In OUDA, person Re-ID models face two main challenges: catastrophic forgetting and domain shift. In this work, we propose a new Source-guided Similarity Preservation (S2P) framework to alleviate these two problems. Our framework is based on the extraction of a support set composed of source images that maximizes the similarity with the target data. This support set is used to identify feature similarities that must be preserved during the learning process. S2P can incorporate multiple existing UDA methods to mitigate catastrophic forgetting. Our experiments show that S2P outperforms previous state-of-the-art methods on multiple real-to-real and synthetic-to-real challenging OUDA benchmarks.
在OUDA中,在线无监督领域适应(OUDA)对人物识别(Re-ID)的任务是对一个在良好注释的源域数据集上训练的模型,将其应用于观察到的目标域数据流中。在OUDA中,人物Re-ID模型面临着两个主要挑战:灾难性遗忘和领域转移。在这项工作中,我们提出了一种新的源指导相似性保留(S2P)框架来缓解这两个问题。我们的框架基于从源图像中提取一个支持集,该支持集与目标数据具有最大相似性。这个支持集用于在学习过程中保留必须保留的特征相似性。S2P可以结合多个现有的UDA方法来减轻灾难性遗忘。我们的实验结果表明,S2P在多个真实与真实和合成与真实具有挑战性的OUDA基准上超过了最先进的水平。
https://arxiv.org/abs/2402.15206
Long-term Person Re-Identification (LRe-ID) aims at matching an individual across cameras after a long period of time, presenting variations in clothing, pose, and viewpoint. In this work, we propose CCPA: Contrastive Clothing and Pose Augmentation framework for LRe-ID. Beyond appearance, CCPA captures body shape information which is cloth-invariant using a Relation Graph Attention Network. Training a robust LRe-ID model requires a wide range of clothing variations and expensive cloth labeling, which is lacked in current LRe-ID datasets. To address this, we perform clothing and pose transfer across identities to generate images of more clothing variations and of different persons wearing similar clothing. The augmented batch of images serve as inputs to our proposed Fine-grained Contrastive Losses, which not only supervise the Re-ID model to learn discriminative person embeddings under long-term scenarios but also ensure in-distribution data generation. Results on LRe-ID datasets demonstrate the effectiveness of our CCPA framework.
长期人物识别(LRe-ID)旨在在长时间内匹配单个个体,展示衣物、姿势和视角的差异。在这项工作中,我们提出了CCPA:对比性服装和姿势增强框架用于LRe-ID。除了外观,CCPA通过使用关系图注意力网络捕捉身体形状信息,这是 cloth-invariant 的。训练一个稳健的LRe-ID模型需要广泛的服装变化和昂贵的布料标注,这在当前的LRe-ID数据集中是缺乏的。为了解决这个问题,我们在个体之间进行服装和姿势转移,生成更多服装变化和穿着类似服装不同人物的图像。增强的批片图像作为我们提出的细粒度对比损失的输入,不仅监督重新识别模型在长期场景下学习具有区分性的个体嵌入,而且还确保同分布数据的生成。在LRe-ID数据集上的结果证明了我们的CCPA框架的有效性。
https://arxiv.org/abs/2402.14454
Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we present an innovative end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). This model effectively distinguishes human body information from occlusions automatically and dynamically, eliminating the need for external detectors or precise image alignment. Specifically, we introduce a dynamic patch token selection module (DPSM). DPSM utilizes a label-guided proxy token as an intermediary to identify informative occlusion-free tokens. These tokens are then selected for deriving subsequent local part features. To facilitate the seamless integration of global classification features with the finely detailed local features selected by DPSM, we introduce a novel feature blending module (FBM). FBM enhances feature representation through the complementary nature of information and the exploitation of part diversity. Furthermore, to ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy. This strategy leverages the recent advances in the Segment Anything Model (SAM). As a result, it generates occlusion images that closely resemble real-world occlusions, greatly enhancing the subsequent contrastive learning process. Experiments on occluded and holistic re-ID benchmarks signify a substantial advancement of DPEFormer over existing state-of-the-art approaches. The code will be made publicly available.
人员识别(Re-ID)继续成为一个显著的挑战,尤其是在遮挡场景中。针对遮挡的方法主要集中在利用外部语义线索对物理身体特征进行对齐。然而,这些方法往往较为复杂且容易受到噪声的影响。为了应对上述挑战,我们提出了一个创新的全端解决方案,称为动态补丁感知增强Transformer(DPEFormer)。该模型能自动且动态地区分人体信息和遮挡,无需外部检测器或精确的图像对齐。具体来说,我们引入了一个动态补丁标记选择模块(DPSM)。DPSM利用标签指导的代理标记作为中间层来识别有用的遮挡无标记的token。这些标记然后用于提取后续的局部部分特征。为了使全局分类特征与由DPSM选择的细小局部特征无缝集成,我们引入了一种新颖的特征混合模块(FBM)。FBM通过信息互补性和部分多样性的利用来增强特征表示。此外,为了确保DPSM和整个DPEFormer仅通过身份标签有效地学习,我们还提出了一个现实主义遮挡增强(ROA)策略。通过利用最近在Segment Anything Model(SAM)上的进展,该策略生成了与现实世界遮挡密切相关的遮挡图像,极大地提高了后续的对比学习过程。在遮挡和整体Re-ID基准测试上进行的实验证明,DPEFormer在现有技术水平上取得了显著的进步。代码将公开发布。
https://arxiv.org/abs/2402.10435
Despite significant progress in optical character recognition (OCR) and computer vision systems, robustly recognizing text and identifying people in images taken in unconstrained \emph{in-the-wild} environments remain an ongoing challenge. However, such obstacles must be overcome in practical applications of vision systems, such as identifying racers in photos taken during off-road racing events. To this end, we introduce two new challenging real-world datasets - the off-road motorcycle Racer Number Dataset (RND) and the Muddy Racer re-iDentification Dataset (MUDD) - to highlight the shortcomings of current methods and drive advances in OCR and person re-identification (ReID) under extreme conditions. These two datasets feature over 6,300 images taken during off-road competitions which exhibit a variety of factors that undermine even modern vision systems, namely mud, complex poses, and motion blur. We establish benchmark performance on both datasets using state-of-the-art models. Off-the-shelf models transfer poorly, reaching only 15% end-to-end (E2E) F1 score on text spotting, and 33% rank-1 accuracy on ReID. Fine-tuning yields major improvements, bringing model performance to 53% F1 score for E2E text spotting and 79% rank-1 accuracy on ReID, but still falls short of good performance. Our analysis exposes open problems in real-world OCR and ReID that necessitate domain-targeted techniques. With these datasets and analysis of model limitations, we aim to foster innovations in handling real-world conditions like mud and complex poses to drive progress in robust computer vision. All data was sourced from this http URL, a website used by professional motorsports photographers, racers, and fans. The top-performing text spotting and ReID models are deployed on this platform to power real-time race photo search.
尽管在光学字符识别(OCR)和计算机视觉系统方面取得了显著的进展,但在不受约束的野外环境中准确识别文本和识别人物仍然是一个持续的挑战。然而,在视觉系统的实际应用中, such 障碍必须被克服,例如在赛车照片中识别赛车手。为此,我们引入了两个新的具有挑战性的现实世界数据集——赛车手编号数据集(RND)和泥泞赛车手重新识别数据集(MUDD),以强调在极端条件下 OCR 和人物识别(ReID)方法的不足之处,推动在不受约束的环境中实现更好的识别性能。这两个数据集涵盖了在赛车比赛中拍摄的超过 6,300 张图像,这些图像呈现出各种因素,对即使是最先进的现代视觉系统也会产生影响,例如泥、复杂的姿势和运动模糊。我们在两个数据集上使用最先进的模型进行基准性能评估。通用的模型转移差,仅达到15%的端到端(E2E) F1 分数在文本检测中,而在 ReID 方面也只有33%的排名一准确率。微调带来重大改进,将模型的性能提高到53%的E2E文本检测和79%的排名一准确率在 ReID 上,但仍然存在不足。我们的分析揭示了在现实世界 OCR 和 ReID 中需要解决的问题,这需要领域特定的技术。有了这些数据集和模型限制的分析,我们旨在推动在处理类似泥和复杂姿态的实时情况方面的创新,以推动计算机视觉在实时情况下的进步。所有数据都来自这个链接,这是一个专业赛车摄影师、赛车手和粉丝使用的网站。在这个平台上,最优秀的文本检测和 ReID 模型部署用于实时赛车照片搜索。
https://arxiv.org/abs/2402.08025
The acquisition of large-scale, precisely labeled datasets for person re-identification (ReID) poses a significant challenge. Weakly supervised ReID has begun to address this issue, although its performance lags behind fully supervised methods. In response, we introduce Contrastive Multiple Instance Learning (CMIL), a novel framework tailored for more effective weakly supervised ReID. CMIL distinguishes itself by requiring only a single model and no pseudo labels while leveraging contrastive losses -- a technique that has significantly enhanced traditional ReID performance yet is absent in all prior MIL-based approaches. Through extensive experiments and analysis across three datasets, CMIL not only matches state-of-the-art performance on the large-scale SYSU-30k dataset with fewer assumptions but also consistently outperforms all baselines on the WL-market1501 and Weakly Labeled MUddy racer re-iDentification dataset (WL-MUDD) datasets. We introduce and release the WL-MUDD dataset, an extension of the MUDD dataset featuring naturally occurring weak labels from the real-world application at this http URL. All our code and data are accessible at this https URL.
大规模、精确标注的数据集用于人物识别(ReID)的收购是一个重大挑战。尽管弱监督ReID已经开始解决这个问题,但它的性能仍然落后于完全监督方法。为了应对这个问题,我们引入了 Contrastive Multiple Instance Learning (CMIL),一种专门针对弱监督ReID的新框架。CMIL通过仅要求一个模型和不需要伪标签,并利用对比损失技术脱颖而出 - 这种技术在所有先前的MIL基于方法中都是缺失的。通过三个数据集的大量实验和分析,CMIL不仅在大型SYSU-30k数据集上与最先进的性能相匹敌,而且 consistently在WL-market1501和WL-Labeled MUddy racer re-identification dataset (WL-MUDD)数据集上优于所有基线。我们介绍并发布了WL-MUDD数据集,这是MUDD数据集的扩展,来自真实的应用场景的自然弱标签。所有我们的代码和数据都可以通过这个链接访问:https://www.example.com/
https://arxiv.org/abs/2402.07685
Current state-of-the-art Video-based Person Re-Identification (Re-ID) primarily relies on appearance features extracted by deep learning models. These methods are not applicable for long-term analysis in real-world scenarios where persons have changed clothes, making appearance information unreliable. In this work, we deal with the practical problem of Video-based Cloth-Changing Person Re-ID (VCCRe-ID) by proposing "Attention-based Shape and Gait Representations Learning" (ASGL) for VCCRe-ID. Our ASGL framework improves Re-ID performance under clothing variations by learning clothing-invariant gait cues using a Spatial-Temporal Graph Attention Network (ST-GAT). Given the 3D-skeleton-based spatial-temporal graph, our proposed ST-GAT comprises multi-head attention modules, which are able to enhance the robustness of gait embeddings under viewpoint changes and occlusions. The ST-GAT amplifies the important motion ranges and reduces the influence of noisy poses. Then, the multi-head learning module effectively reserves beneficial local temporal dynamics of movement. We also boost discriminative power of person representations by learning body shape cues using a GAT. Experiments on two large-scale VCCRe-ID datasets demonstrate that our proposed framework outperforms state-of-the-art methods by 12.2% in rank-1 accuracy and 7.0% in mAP.
目前最先进的基于视频的人物识别(Re-ID)主要依赖于由深度学习模型提取的视觉特征。这些方法在现实场景中不适用于长时间分析,因为人们会换衣服,导致外观信息不可靠。在这项工作中,我们处理了基于视频的衣物更换人物识别(VCCRe-ID)的实际问题,通过提出 "基于注意力的形状和步伐表示学习" (ASGL) 来解决VCCRe-ID。我们的ASGL框架通过使用空间时间图注意力网络(ST-GAT)学习 clothing-invariant 步伐线索来提高Re-ID性能。 考虑到基于3D骨骼的空间时间图,我们提出的ST-GAT包括多头注意模块,这些模块能够增强在视角变化和遮挡情况下的步伐嵌入的鲁棒性。ST-GAT增强了重要的运动范围并减少了噪声姿态的影响。然后,多头学习模块有效地保留了有益的局部时间动态。 我们还通过学习身体形状线索来提高人物表示的判别能力。在两个大型VCCRe-ID数据集的实验中,我们的框架在排名1准确性和mAP方面均优于最先进的方法,分别提高了12.2%和7.0%。
https://arxiv.org/abs/2402.03716
In this paper, we study a new problem of cross-domain video based person re-identification (Re-ID). Specifically, we take the synthetic video dataset as the source domain for training and use the real-world videos for testing, which significantly reduces the dependence on real training data collection and annotation. To unveil the power of synthetic data for video person Re-ID, we first propose a self-supervised domain invariant feature learning strategy for both static and temporal features. Then, to further improve the person identification ability in the target domain, we develop a mean-teacher scheme with the self-supervised ID consistency loss. Experimental results on four real datasets verify the rationality of cross-synthetic-real domain adaption and the effectiveness of our method. We are also surprised to find that the synthetic data performs even better than the real data in the cross-domain setting.
在本文中,我们研究了一个新的跨领域视频基于人物识别(Re-ID)的新问题。具体来说,我们以合成视频数据集作为训练来源,使用真实视频作为测试数据,这显著减少了依赖于真实训练数据收集和注释的依赖。为了揭示合成数据在视频人物Re-ID中的优势,我们首先提出了一种自监督的领域不变特征学习策略,用于静态和时间特征。然后,为了进一步提高目标领域中人物识别的能力,我们开发了一种自监督ID一致损失的均教师方案。对四个真实数据集的实验结果证实了跨合成真实领域适应性和我们方法的合理性。我们还惊讶地发现,在跨领域设置中,合成数据的表现甚至比真实数据更好。
https://arxiv.org/abs/2402.02108
As a cutting-edge biosensor, the event camera holds significant potential in the field of computer vision, particularly regarding privacy preservation. However, compared to traditional cameras, event streams often contain noise and possess extremely sparse semantics, posing a formidable challenge for event-based person re-identification (event Re-ID). To address this, we introduce a novel event person re-identification network: the Spectrum-guided Feature Enhancement Network (SFE-Net). This network consists of two innovative components: the Multi-grain Spectrum Attention Mechanism (MSAM) and the Consecutive Patch Dropout Module (CPDM). MSAM employs a fourier spectrum transform strategy to filter event noise, while also utilizing an event-guided multi-granularity attention strategy to enhance and capture discriminative person semantics. CPDM employs a consecutive patch dropout strategy to generate multiple incomplete feature maps, encouraging the deep Re-ID model to equally perceive each effective region of the person's body and capture robust person descriptors. Extensive experiments on Event Re-ID datasets demonstrate that our SFE-Net achieves the best performance in this task.
作为一款前沿的生物传感器,事件相机在计算机视觉领域具有显著的应用潜力,尤其是在隐私保护方面。然而,与传统相机相比,事件流通常包含噪声,且具有极稀疏的语义,这给基于事件的人识别(事件Re-ID)带来了巨大的挑战。为解决这个问题,我们引入了一个新颖的事件人识别网络:光谱引导的特征增强网络(SFE-Net)。该网络由两个创新组件组成:多粒度光谱关注机制(MSAM)和连续补丁丢弃模块(CPDM)。MSAM采用傅里叶频谱变换策略来滤除事件噪声,同时利用事件引导的多粒度关注策略来增强和捕捉有区分性的个人语义。CPDM采用连续补丁丢弃策略来生成多个不完整的特征图,鼓励深度Re-ID模型平等地感知每个人身体有效区域的丰富信息,并捕捉到一个鲁棒的人描述符。在事件Re-ID数据集上进行的大量实验证明,我们的SFE-Net在任务中取得了最佳性能。
https://arxiv.org/abs/2402.01269
Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to retrieve pedestrian images of the same identity from different modalities without annotations. While prior work focuses on establishing cross-modality pseudo-label associations to bridge the modality-gap, they ignore maintaining the instance-level homogeneous and heterogeneous consistency in pseudo-label space, resulting in coarse associations. In response, we introduce a Modality-Unified Label Transfer (MULT) module that simultaneously accounts for both homogeneous and heterogeneous fine-grained instance-level structures, yielding high-quality cross-modality label associations. It models both homogeneous and heterogeneous affinities, leveraging them to define the inconsistency for the pseudo-labels and then minimize it, leading to pseudo-labels that maintain alignment across modalities and consistency within intra-modality structures. Additionally, a straightforward plug-and-play Online Cross-memory Label Refinement (OCLR) module is proposed to further mitigate the impact of noisy pseudo-labels while simultaneously aligning different modalities, coupled with a Modality-Invariant Representation Learning (MIRL) framework. Experiments demonstrate that our proposed method outperforms existing USL-VI-ReID methods, highlighting the superiority of our MULT in comparison to other cross-modality association methods. The code will be available.
无监督可见-红外人员识别(USL-VI-ReID)旨在从不同模态下检索具有相同身份的行人图像,而无需注释。虽然之前的工作集中于建立跨模态伪标签关联以弥合模态差距,但他们忽略了在伪标签空间中保持实例级别统一和异构性,导致粗略的关联。为了应对这个问题,我们引入了一个模块化标签转移(MULT)模块,同时考虑了统一和异构实例级别的结构,从而实现了高质量跨模态标签关联。它建模了统一和异构的亲和力,并利用它们定义了伪标签的不一致性,然后最小化它,导致伪标签在模态之间保持对齐,并且在模态内部保持一致性。此外,我们还提出了一个简单的插件式在线跨模态标签优化(OCLR)模块,以减轻噪声伪标签的影响,同时使不同模态之间保持对齐,并采用模态无关表示学习(MIRL)框架。实验证明,与现有USL-VI-ReID方法相比,我们的方法具有优异的性能,突出了我们在跨模态关联方法中的优越性。代码将可用。
https://arxiv.org/abs/2402.00672
The paper introduces the Decouple Re-identificatiOn and human Parsing (DROP) method for occluded person re-identification (ReID). Unlike mainstream approaches using global features for simultaneous multi-task learning of ReID and human parsing, or relying on semantic information for attention guidance, DROP argues that the inferior performance of the former is due to distinct granularity requirements for ReID and human parsing features. ReID focuses on instance part-level differences between pedestrian parts, while human parsing centers on semantic spatial context, reflecting the internal structure of the human body. To address this, DROP decouples features for ReID and human parsing, proposing detail-preserving upsampling to combine varying resolution feature maps. Parsing-specific features for human parsing are decoupled, and human position information is exclusively added to the human parsing branch. In the ReID branch, a part-aware compactness loss is introduced to enhance instance-level part differences. Experimental results highlight the efficacy of DROP, especially achieving a Rank-1 accuracy of 76.8% on Occluded-Duke, surpassing two mainstream methods. The codebase is accessible at this https URL.
本文介绍了一种名为Decouple Re-identification and human Parsing (DROP)的方法,用于解决遮挡的人重新识别(ReID)问题。与使用全局特征进行同时多任务学习ReID和人类解析的传统方法不同,或者依赖语义信息进行关注指导的方法,DROP认为,前者的性能差是由于ReID和人类解析特征的细粒度要求不同导致的。ReID关注于行人部分之间的实例级别差异,而人类解析关注于语义空间上下文,反映了人体内部的结构。为了应对这个问题,DROP解耦了ReID和人类解析的特征,提出了一种保留详细信息的上采样方法,结合了不同分辨率特征图。人类解析特定的特征被解耦,而人类位置信息被专门添加到人类解析分支中。在ReID分支中,引入了一种部分感知紧凑性损失,以增强实例级别的部分差异。实验结果强调了DROP的有效性,尤其是实现了在Occluded-Duke上的排名1准确度为76.8%,超过了两个主要方法。代码库可在此处访问:https://github.com/google-research/DROP。
https://arxiv.org/abs/2401.18032