Video-based person re-identification (video re-ID) has lately fascinated growing attention due to its broad practical applications in various areas, such as surveillance, smart city, and public safety. Nevertheless, video re-ID is quite difficult and is an ongoing stage due to numerous uncertain challenges such as viewpoint, occlusion, pose variation, and uncertain video sequence, etc. In the last couple of years, deep learning on video re-ID has continuously achieved surprising results on public datasets, with various approaches being developed to handle diverse problems in video re-ID. Compared to image-based re-ID, video re-ID is much more challenging and complex. To encourage future research and challenges, this first comprehensive paper introduces a review of up-to-date advancements in deep learning approaches for video re-ID. It broadly covers three important aspects, including brief video re-ID methods with their limitations, major milestones with technical challenges, and architectural design. It offers comparative performance analysis on various available datasets, guidance to improve video re-ID with valuable thoughts, and exciting research directions.
视频身份识别(视频重配)最近吸引了越来越多的关注,因为它在许多领域都具有广泛的应用,例如监控、智慧城市和公共安全等。然而,视频重配仍然是一项相当困难的任务,并且仍然是一个持续的阶段,因为有许多不确定的挑战,例如视角、遮挡、姿态变化和不确定的视频序列等。在过去两年中,深度学习在视频重配方面已经取得了令人惊奇的结果,在公开数据集上开发了各种方法来处理各种视频重配问题。与基于图像的身份识别相比,视频重配更具挑战性和复杂性。为了鼓励未来的研究和挑战,本综述性论文介绍了关于视频重配深度学习方法的最新进展。它涵盖了三个重要的方面,包括简短的视频重配方法及其限制、具有技术挑战的主要里程碑和建筑设计。它提供了对各种可用数据集的比较性能分析、有价值的思想和改进视频重配的方法,并提出了令人兴奋的研究方向。
https://arxiv.org/abs/2303.11332
Occluded person re-identification (Re-ID) aims to address the potential occlusion problem when matching occluded or holistic pedestrians from different camera views. Many methods use the background as artificial occlusion and rely on attention networks to exclude noisy interference. However, the significant discrepancy between simple background occlusion and realistic occlusion can negatively impact the generalization of the this http URL address this issue, we propose a novel transformer-based Attention Disturbance and Dual-Path Constraint Network (ADP) to enhance the generalization of attention networks. Firstly, to imitate real-world obstacles, we introduce an Attention Disturbance Mask (ADM) module that generates an offensive noise, which can distract attention like a realistic occluder, as a more complex form of occlusion.Secondly, to fully exploit these complex occluded images, we develop a Dual-Path Constraint Module (DPC) that can obtain preferable supervision information from holistic images through dual-path interaction. With our proposed method, the network can effectively circumvent a wide variety of occlusions using the basic ViT baseline. Comprehensive experimental evaluations conducted on person re-ID benchmarks demonstrate the superiority of ADP over state-of-the-art methods.
遮挡人重识别(Re-ID)的目标是在匹配遮挡或整体行人从不同相机视图中识别潜在 occlusion 问题。许多方法使用背景作为 artificial occlusion 并依靠注意力网络排除噪声干扰。然而,简单的背景 occlusion 和真实的 occlusion 之间的显著差异可能会负面影响 this http URL 的泛化性。为了解决这一问题,我们提出了一种新颖的Transformer-based注意力干扰和双重路径约束网络(ADP)来增强注意力网络的泛化性。首先,模仿现实世界障碍物,我们引入了注意力干扰掩码(ADM)模块,产生 offensive 噪声,可以像真实的 occlusion 一样分散注意力,成为一种更复杂的 occlusion 形式。其次, fully 利用了这些复杂的 occlusion 图像,我们开发了双重路径约束模块(DPC),可以通过双重路径交互从整体图像获取更好的监督信息。通过我们提出的方法,网络可以 effectively 通过基本 ViT 基线绕过各种 occlusion。在人重识别基准点的 comprehensive 实验评估表明,ADP 比现有方法优越。
https://arxiv.org/abs/2303.10976
Pose transfer aims to transfer a given person into a specified posture, has recently attracted considerable attention. A typical pose transfer framework usually employs representative datasets to train a discriminative model, which is often violated by out-of-distribution (OOD) instances. Recently, test-time adaption (TTA) offers a feasible solution for OOD data by using a pre-trained model that learns essential features with self-supervision. However, those methods implicitly make an assumption that all test distributions have a unified signal that can be learned directly. In open-world conditions, the pose transfer task raises various independent signals: OOD appearance and skeleton, which need to be extracted and distributed in speciality. To address this point, we develop a SEquential Test-time Adaption (SETA). In the test-time phrase, SETA extracts and distributes external appearance texture by augmenting OOD data for self-supervised training. To make non-Euclidean similarity among different postures explicit, SETA uses the image representations derived from a person re-identification (Re-ID) model for similarity computation. By addressing implicit posture representation in the test-time sequentially, SETA greatly improves the generalization performance of current pose transfer models. In our experiment, we first show that pose transfer can be applied to open-world applications, including Tiktok reenactment and celebrity motion synthesis.
姿态转移的目标是将给定的人转移到指定的姿势,最近吸引了相当大的关注。典型的姿态转移框架通常使用代表性的数据集来训练一个鉴别模型,这常常受到分布之外(OOD)实例的违反。最近,测试时间适应(TTA)提供了一个可行的解决方案,使用一个自监督学习的训练模型来学习关键特征,以自学训练。然而,这些方法隐含地假设所有测试分布都有一个通用的信号,可以直接学习。在开放世界条件下,姿态转移任务产生了各种独立的信号:OOD的外观和骨骼,需要提取和分布在特定领域的专业知识。为了解决这一问题,我们开发了平方测试时间适应(SETA)。在测试时间短语中,SETA通过增加OOD数据来提高外部外观纹理,以自学训练。为了使不同姿势之间的非欧几何相似性更加明显,SETA使用从人身份验证(Re-ID)模型推导的图像表示来进行相似性计算。通过在测试时间顺序解决暗示的姿态表示问题,SETA极大地改善了当前姿态转移模型的泛化性能。在我们的实验中,我们首先展示了姿态转移可以应用于开放世界应用程序,包括 Tiktok重编和名人运动合成。
https://arxiv.org/abs/2303.10945
Text-based person re-identification (ReID) aims to identify images of the targeted person from a large-scale person image database according to a given textual description. However, due to significant inter-modal gaps, text-based person ReID remains a challenging problem. Most existing methods generally rely heavily on the similarity contributed by matched word-region pairs, while neglecting mismatched word-region pairs which may play a decisive role. Accordingly, we propose to mine false positive examples (MFPE) via a jointly optimized multi-branch architecture to handle this problem. MFPE contains three branches including a false positive mining (FPM) branch to highlight the role of mismatched word-region pairs. Besides, MFPE delicately designs a cross-relu loss to increase the gap of similarity scores between matched and mismatched word-region pairs. Extensive experiments on CUHK-PEDES demonstrate the superior effectiveness of MFPE. Our code is released at this https URL.
基于文本的人重身份(ReID)旨在根据给定文本描述从大型人图像数据库中识别目标人物的图像。然而,由于存在显著的modal差异,基于文本的人重身份仍然是一个挑战性的问题。大多数现有方法通常 heavily rely on 匹配词框的相似性贡献,而忽视了可能扮演决定性角色的不匹配词框。因此,我们提议通过 jointly optimized 的多分支架构来处理这个问题,并开发了一种名为“False positive examples (MFPE)”的算法来 mine 误报实例(MFPE)。MFPE 包含三个分支,包括一个误报发现(FPM)分支,以突出不匹配词框的作用。此外,MFPE 精心设计了一个交叉relu损失,以增加匹配和不匹配词框之间的相似性得分之间的差距。在CUHK-PEDES 实验中,广泛的实验结果表明,MFPE 具有卓越的效果。我们的代码在此httpsURL 上发布。
https://arxiv.org/abs/2303.08466
Person re-identification (re-ID) via 3D skeleton data is an emerging topic with prominent advantages. Existing methods usually design skeleton descriptors with raw body joints or perform skeleton sequence representation learning. However, they typically cannot concurrently model different body-component relations, and rarely explore useful semantics from fine-grained representations of body joints. In this paper, we propose a generic Transformer-based Skeleton Graph prototype contrastive learning (TranSG) approach with structure-trajectory prompted reconstruction to fully capture skeletal relations and valuable spatial-temporal semantics from skeleton graphs for person re-ID. Specifically, we first devise the Skeleton Graph Transformer (SGT) to simultaneously learn body and motion relations within skeleton graphs, so as to aggregate key correlative node features into graph representations. Then, we propose the Graph Prototype Contrastive learning (GPC) to mine the most typical graph features (graph prototypes) of each identity, and contrast the inherent similarity between graph representations and different prototypes from both skeleton and sequence levels to learn discriminative graph representations. Last, a graph Structure-Trajectory Prompted Reconstruction (STPR) mechanism is proposed to exploit the spatial and temporal contexts of graph nodes to prompt skeleton graph reconstruction, which facilitates capturing more valuable patterns and graph semantics for person re-ID. Empirical evaluations demonstrate that TranSG significantly outperforms existing state-of-the-art methods. We further show its generality under different graph modeling, RGB-estimated skeletons, and unsupervised scenarios.
人重新识别(re-ID)通过3D骨骼数据是一个具有显著优势的新话题。现有的方法通常使用 raw body joints 或精细的骨骼 joint 表示法来设计骨骼描述符,或进行骨骼序列表示学习。然而,它们通常不能同时模型不同身体组件的关系,而且很少从骨骼 joint 的精细表示法中探索有用的语义。在本文中,我们提出了一种通用的Transformer-based Skeleton Graph 原型对比学习(TranSG)方法,并结合结构引导重构,以完全捕获骨骼关系和从骨骼 graphs 中获取宝贵的空间-时间语义。具体来说,我们首先设计了一个Skeleton Graph Transformer(SGT),以同时学习骨骼 graphs 中的身体和运动关系,以将关键相对节点特征聚合成 graph 表示。然后,我们提出了 Graph 原型对比学习(GPC)方法,以挖掘每个身份的最重要 graph 特征(graph 原型),并对比 graph 表示和不同原型之间的固有相似性,以学习分化的 graph 表示。最后,我们提出了一种 graph 结构引导重构(STPR)机制,利用 graph 节点的空间和时间上下文,以引导骨骼 graph 重构,这有助于捕捉更有价值的模式和 graph 语义,以用于人重新识别。实验结果表明,TranSG significantly outperforms existing state-of-the-art methods。我们还在不同 graph 建模、RGB估计骨骼和无监督场景下展示了它的通用性。
https://arxiv.org/abs/2303.06819
Unsupervised Re-ID methods aim at learning robust and discriminative features from unlabeled data. However, existing methods often ignore the relationship between module parameters of Re-ID framework and feature distributions, which may lead to feature misalignment and hinder the model performance. To address this problem, we propose a dynamic clustering and cluster contrastive learning (DCCC) method. Specifically, we first design a dynamic clustering parameters scheduler (DCPS) which adjust the hyper-parameter of clustering to fit the variation of intra- and inter-class distances. Then, a dynamic cluster contrastive learning (DyCL) method is designed to match the cluster representation vectors' weights with the local feature association. Finally, a label smoothing soft contrastive loss ($L_{ss}$) is built to keep the balance between cluster contrastive learning and self-supervised learning with low computational consumption and high computational efficiency. Experiments on several widely used public datasets validate the effectiveness of our proposed DCCC which outperforms previous state-of-the-art methods by achieving the best performance.
无监督Re-ID方法旨在从未标记数据中学习稳健和有区分的特征。然而,现有方法常常忽略Re-ID框架模块参数和特征分布之间的关系,这可能导致特征不匹配和妨碍模型性能。为了解决这一问题,我们提出了一种动态分组和分组比较学习(DCCC)方法。具体来说,我们首先设计了一个动态分组参数调度器(DCPS),该调度器调整分组的超参数以适应内层和间层距离的变化。然后,我们设计了一种动态分组比较学习(DyCL)方法,该方法匹配分组表示向量的权重与局部特征映射。最后,我们建立了一个标签平滑软比较损失($L_{ss}$),以保持分组比较学习和自监督学习之间的平衡,以减少计算消耗和提高计算效率。对多个广泛使用的公共数据集进行了实验,证明了我们提出的DCCC方法的有效性,该方法通过实现最佳性能而优于先前的先进技术方法。
https://arxiv.org/abs/2303.06810
Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.
人类中心化感知在视觉模型工业应用中扮演着关键角色。虽然具体的人类中心化任务有其特定的相关语义方面需要关注,但它们也共享人类身体的基本语义结构。然而,只有少数工作尝试过利用这种一致性设计一个通用的人类中心化任务模型。在本工作中,我们回顾了广泛的人类中心化任务,并以 minimalist 的方式将它们统一起来。我们提出了 UniHCP,一个人类中心化感知的统一模型,它将广泛的人类中心化任务与简单的视觉变换架构相统一。通过在33个人类中心化数据集上进行大规模联合训练,UniHCP可以在几个内部和下游任务中通过直接评估比强基线表现更好。当适应特定的任务时,UniHCP在广泛的人类中心化任务中实现了新的 SOTA,例如,69.8 米IoU在 CIHP 中是人类Parsing,86.18 mA 在 PA-100K 中是属性预测,90.3 米AP在 Market1501 中是人类 ReID,和 85.8 JI 在 CrowdHuman 中用于 pedestrian detection,比每个任务专门的模型表现更好。
https://arxiv.org/abs/2303.02936
Occluded person re-identification (Re-ID) is a challenging problem due to the destruction of occluders. Most existing methods focus on visible human body parts through some prior information. However, when complementary occlusions occur, features in occluded regions can interfere with matching, which affects performance severely. In this paper, different from most previous works that discard the occluded region, we propose a Feature Completion Transformer (FCFormer) to implicitly complement the semantic information of occluded parts in the feature space. Specifically, Occlusion Instance Augmentation (OIA) is proposed to simulates real and diverse occlusion situations on the holistic image. These augmented images not only enrich the amount of occlusion samples in the training set, but also form pairs with the holistic images. Subsequently, a dual-stream architecture with a shared encoder is proposed to learn paired discriminative features from pairs of inputs. Without additional semantic information, an occluded-holistic feature sample-label pair can be automatically created. Then, Feature Completion Decoder (FCD) is designed to complement the features of occluded regions by using learnable tokens to aggregate possible information from self-generated occluded features. Finally, we propose the Cross Hard Triplet (CHT) loss to further bridge the gap between complementing features and extracting features under the same ID. In addition, Feature Completion Consistency (FC$^2$) loss is introduced to help the generated completion feature distribution to be closer to the real holistic feature distribution. Extensive experiments over five challenging datasets demonstrate that the proposed FCFormer achieves superior performance and outperforms the state-of-the-art methods by significant margins on occluded datasets.
遮罩人重定向(Re-ID)是一个由于遮罩破坏而带来的挑战性问题。大部分现有方法都通过某些先验信息专注于可见人体部位。然而,当互补遮罩发生时,遮罩区域中的特征是可能与匹配干扰的,这严重影响了性能。在本文中,与大多数先前工作放弃遮罩区域不同,我们提出了一个特征完成Transformer(FC Former),以在特征空间中隐含地补充遮罩部分语义信息。具体来说,我们提出了遮罩实例增强(OIA),以模拟整个图像中的实际和多样化的遮罩情况。这些增强图像不仅丰富了训练集中的遮罩样本数量,而且与整个图像形成了对对。随后,我们提出了一种具有共享编码器的双重流架构,从两个输入中学习对偶的特征。在没有额外的语义信息的情况下,可以自动创建遮罩-整体特征样本标签对。然后,我们提出了特征完成解码器(FCD),以通过可学习代币将自生成遮罩特征中的可能信息聚合起来,以补充遮罩区域的特征。最后,我们提出了交叉硬二元分类(CHT)损失,以进一步弥合重定向特征和提取特征的ID相同的特征提取特征之间的差距。此外,我们引入了特征完成一致性(FC$^2$)损失,以帮助生成的完成特征分布更接近真实的整体特征分布。广泛的实验在五个挑战性数据集上证明了,我们提出的FC Former取得了更好的性能,并在遮罩数据集上比最先进的方法领先显著。
https://arxiv.org/abs/2303.01656
Person re-identification plays a key role in applications where a mobile robot needs to track its users over a long period of time, even if they are partially unobserved for some time, in order to follow them or be available on demand. In this context, deep-learning based real-time feature extraction on a mobile robot is often performed on special-purpose devices whose computational resources are shared for multiple tasks. Therefore, the inference speed has to be taken into account. In contrast, person re-identification is often improved by architectural changes that come at the cost of significantly slowing down inference. Attention blocks are one such example. We will show that some well-performing attention blocks used in the state of the art are subject to inference costs that are far too high to justify their use for mobile robotic applications. As a consequence, we propose an attention block that only slightly affects the inference speed while keeping up with much deeper networks or more complex attention blocks in terms of re-identification accuracy. We perform extensive neural architecture search to derive rules at which locations this attention block should be integrated into the architecture in order to achieve the best trade-off between speed and accuracy. Finally, we confirm that the best performing configuration on a re-identification benchmark also performs well on an indoor robotic dataset.
人重新配对在需要对移动机器人的用户进行长期跟踪的应用中发挥着关键作用,即使他们部分被观察了一段时间,以便跟随他们或随时可用。在这种情况下,基于深度学习的实时特征提取通常在特殊的专用设备上进行,这些设备的计算资源被共享用于多个任务。因此,推断速度必须考虑到。相比之下,人重新配对通常通过建筑结构改变来实现,这样做的代价是显著减缓推断速度。注意力块就是一个这样的例子。我们将证明,一些先进的注意力块在常用的设计中表现良好,但推断成本却非常高,以至于不能将其用于移动机器人应用。因此,我们提出了一个注意力块,它只略微影响推断速度,而能够在人重新配对精度方面与更深层的网络或更复杂的注意力块保持同步。我们进行了广泛的神经网络架构搜索,以推导出该注意力块应该嵌入到架构中的特定位置的规则,以实现速度与精度的最佳权衡。最后,我们确认,在人重新配对基准测试中表现最佳的配置也在室内机器人数据集上表现良好。
https://arxiv.org/abs/2302.14574
In person re-identification (re-ID) task, it is still challenging to learn discriminative representation by deep learning, due to limited data. Generally speaking, the model will get better performance when increasing the amount of data. The addition of similar classes strengthens the ability of the classifier to identify similar identities, thereby improving the discrimination of representation. In this paper, we propose a Diverse and Compact Transformer (DC-Former) that can achieve a similar effect by splitting embedding space into multiple diverse and compact subspaces. Compact embedding subspace helps model learn more robust and discriminative embedding to identify similar classes. And the fusion of these diverse embeddings containing more fine-grained information can further improve the effect of re-ID. Specifically, multiple class tokens are used in vision transformer to represent multiple embedding spaces. Then, a self-diverse constraint (SDC) is applied to these spaces to push them away from each other, which makes each embedding space diverse and compact. Further, a dynamic weight controller(DWC) is further designed for balancing the relative importance among them during training. The experimental results of our method are promising, which surpass previous state-of-the-art methods on several commonly used person re-ID benchmarks.
在人重定向(重定向)任务中,由于数据有限,仍然难以通过深度学习学习特征表示,从而提高分类器识别类似身份的能力,从而改善表示的区分性。通常来说,增加数据量会提高模型的性能。添加类似类可以加强分类器识别类似身份的能力,从而提高分类器的识别能力。在本文中,我们提出了一种多样性和紧凑的Transformer(DC- former),可以通过将嵌入空间分裂为多个多样性和紧凑的子空间来实现类似的效果。紧凑的嵌入子空间可以帮助模型学习更加稳健和歧视性的嵌入表示,以识别类似类。将这些不同种类的嵌入表示合并起来,可以进一步提高重定向的效果。具体来说,在视觉Transformer中,多个类代币用于表示多个嵌入空间。然后,一个自我多样性约束(SDC)被应用于这些空间,将它们推开,使每个嵌入空间多样化且紧凑。此外,动态权重控制器(DWC)被进一步设计用于在训练期间平衡它们之间的相对重要性。我们的方法的实验结果令人鼓舞,超过了先前常用的人重定向基准方法的几个优点。
https://arxiv.org/abs/2302.14335
Interest in automatic people re-identification systems has significantly grown in recent years, mainly for developing surveillance and smart shops software. Due to the variability in person posture, different lighting conditions, and occluded scenarios, together with the poor quality of the images obtained by different cameras, it is currently an unsolved problem. In machine learning-based computer vision applications with reduced data sets, one possibility to improve the performance of re-identification system is through the augmentation of the set of images or videos available for training the neural models. Currently, one of the most robust ways to generate synthetic information for data augmentation, whether it is video, images or text, are the generative adversarial networks. This article reviews the most relevant recent approaches to improve the performance of person re-identification models through data augmentation, using generative adversarial networks. We focus on three categories of data augmentation approaches: style transfer, pose transfer, and random generation.
自动人重定向系统的兴趣在近年来得到了显著增长,主要是为了开发监控和智能商店软件。由于人的姿态、不同照明条件和遮挡场景的变量,以及不同相机获取的图像质量不佳,目前这是一个无法解决的问题。在减少数据集的机器学习计算机视觉应用中,一种改善重定向系统性能的可能方法是增加训练神经网络的图像或视频数量。目前,生成对抗网络是生成合成信息中最稳健的方法之一,无论是视频、图像或文本,它们都是一种。本文综述了最近有关利用生成对抗网络提高人重定向模型性能的最受关注的方法。我们重点关注三个数据增强方法的分类:风格迁移、姿态迁移和随机生成。
https://arxiv.org/abs/2302.09119
Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same pedestrian from different modalities, where the challenges lie in the significant modality discrepancy. To alleviate the modality gap, recent methods generate intermediate images by GANs, grayscaling, or mixup strategies. However, these methods could ntroduce extra noise, and the semantic correspondence between the two modalities is not well learned. In this paper, we propose a Patch-Mixed Cross-Modality framework (PMCM), where two images of the same person from two modalities are split into patches and stitched into a new one for model learning. In this way, the modellearns to recognize a person through patches of different styles, and the modality semantic correspondence is directly embodied. With the flexible image generation strategy, the patch-mixed images freely adjust the ratio of different modality patches, which could further alleviate the modality imbalance problem. In addition, the relationship between identity centers among modalities is explored to further reduce the modality variance, and the global-to-part constraint is introduced to regularize representation learning of part features. On two VI-ReID datasets, we report new state-of-the-art performance with the proposed method.
visible-infrared person re-identification (VI-ReID)的目标是从不同模态中提取相同的行人图像,而模态差异的问题是挑战所在。为了缓解模态差异,最近的方法使用GAN、灰度化或混合策略生成中间图像。然而,这些方法可能会导致额外的噪声,并且两个模态之间的语义对应关系并未很好学习。在本文中,我们提出了块混合跨模态框架(PMCM),其中将同一个人的两个模态的图像分割成块并拼接成一个新的图像,以模型学习为目的。这样,模型通过学习不同模态块的不同风格来识别一个人,模态语义对应关系直接体现在图像块中。通过灵活的图像生成策略,块混合图像可以自由调整不同模态块的比例,这可以进一步减轻模态不平衡问题。此外,探索模态之间身份中心的关系,进一步减少模态差异方差,并引入全局到部分约束,以 regularize 部分特征表示学习。在两个VI-ReID数据集上,我们使用提出的方法报告了新的最高水平。
https://arxiv.org/abs/2302.08212
In unconstrained scenarios, face recognition and person re-identification are subject to distortions such as motion blur, atmospheric turbulence, or upsampling artifacts. To improve robustness in these scenarios, we propose a methodology called Distortion-Adaptive Learned Invariance for Identification (DaliID) models. We contend that distortion augmentations, which degrade image quality, can be successfully leveraged to a greater degree than has been shown in the literature. Aided by an adaptive weighting schedule, a novel distortion augmentation is applied at severe levels during training. This training strategy increases feature-level invariance to distortions and decreases domain shift to unconstrained scenarios. At inference, we use a magnitude-weighted fusion of features from parallel models to retain robustness across the range of images. DaliID models achieve state-of-the-art (SOTA) for both face recognition and person re-identification on seven benchmark datasets, including IJB-S, TinyFace, DeepChange, and MSMT17. Additionally, we provide recaptured evaluation data at a distance of 750+ meters and further validate on real long-distance face imagery.
在无约束场景中,人脸识别和人名重识别会受到诸如运动模糊、大气湍流或超采样失真等影响。为了提高在这些场景中的鲁棒性,我们提出了一种方法,称为“失真自适应学习变异性识别模型”(DaliID)。我们声称,失真增强剂,会降低图像质量,比文献中显示出的要成功地利用得多。借助自适应权重计划,在训练期间,一种新的失真增强剂被应用到严重的水平上。这种训练策略增加了特征级别的不变性,并减少了向无约束场景的域转移。在推理中,我们使用并行模型中特征的量级加权融合来保留在整个图像范围内的鲁棒性。DaliID模型在包括IJB-S、tinyFace、DeepChange和 MSMT17七个基准数据集上实现了人脸识别和人名重识别的最先进的性能(SOTA)。此外,我们还提供了距离超过750米重新捕获的评价数据,并进一步验证了真实的长距离人脸图像。
https://arxiv.org/abs/2302.05753
Currently, most existing person re-identification methods use Instance-Level features, which are extracted only from a single image. However, these Instance-Level features can easily ignore the discriminative information due to the appearance of each identity varies greatly in different images. Thus, it is necessary to exploit Identity-Level features, which can be shared across different images of each identity. In this paper, we propose to promote Instance-Level features to Identity-Level features by employing cross-attention to incorporate information from one image to another of the same identity, thus more unified and discriminative pedestrian information can be obtained. We propose a novel training framework named X-ReID. Specifically, a Cross Intra-Identity Instances module (IntraX) fuses different intra-identity instances to transfer Identity-Level knowledge and make Instance-Level features more compact. A Cross Inter-Identity Instances module (InterX) involves hard positive and hard negative instances to improve the attention response to the same identity instead of different identity, which minimizes intra-identity variation and maximizes inter-identity variation. Extensive experiments on benchmark datasets show the superiority of our method over existing works. Particularly, on the challenging MSMT17, our proposed method gains 1.1% mAP improvements when compared to the second place.
目前,大多数现有的人名识别方法使用实例级别的特征,这些特征仅从一张图像中获取。然而,这些实例级别的特征很容易忽略辨别信息,因为每个身份在不同图像中的外貌变化很大。因此,必须利用身份级别的特征,这些可以在每个身份的不同图像中共享的信息。在本文中,我们提出将实例级别的特征推广到身份级别的特征,通过使用交叉注意力将来自不同身份的图像的信息集成起来,从而得到更加一致和辨别的信息。我们提出了一个名为X-ReID的新训练框架,具体来说,一个跨身份实例模块(IntraX)将不同身份的实例合并起来,传递身份级别的知识,并使其实例级别的特征更加紧凑。另一个跨身份实例模块(InterX)涉及硬正例和硬负例,以提高对同一身份而不是不同身份的关注响应,从而减少身份内部变化,增加身份间变化。在基准数据集上进行广泛的实验表明,我们的方法和现有方法相比具有优越性。特别是,在挑战性的MSMT17数据集上,我们的方法比第二名提高了1.1%的mAP。
https://arxiv.org/abs/2302.02075
Visible-infrared person re-identification (VI-ReID) aims to match specific pedestrian images from different modalities. Although suffering an extra modality discrepancy, existing methods still follow the softmax loss training paradigm, which is widely used in single-modality classification tasks. The softmax loss lacks an explicit penalty for the apparent modality gap, which adversely limits the performance upper bound of the VI-ReID task. In this paper, we propose the spectral-aware softmax (SA-Softmax) loss, which can fully explore the embedding space with the modality information and has clear interpretability. Specifically, SA-Softmax loss utilizes an asynchronous optimization strategy based on the modality prototype instead of the synchronous optimization based on the identity prototype in the original softmax loss. To encourage a high overlapping between two modalities, SA-Softmax optimizes each sample by the prototype from another spectrum. Based on the observation and analysis of SA-Softmax, we modify the SA-Softmax with the Feature Mask and Absolute-Similarity Term to alleviate the ambiguous optimization during model training. Extensive experimental evaluations conducted on RegDB and SYSU-MM01 demonstrate the superior performance of the SA-Softmax over the state-of-the-art methods in such a cross-modality condition.
visible-infrared person re-identification (VI-ReID)旨在从不同模态中匹配特定的行人图像。尽管存在额外的模态差异,现有方法仍然遵循softmax损失的训练范式,这在单一模态分类任务中非常常见。softmax损失缺乏明显的模态差异penalty,这限制了VI-ReID任务的性能上限。在本文中,我们提出了Spectral-aware softmax (SA-Softmax)损失,它可以充分利用模态信息,并具有明确的解释性。具体来说,SA-Softmax损失利用基于模态原型的异步优化策略,而不是在原始softmax损失中基于身份原型的同步优化策略。为了鼓励两个模态之间的高重叠度,SA-Softmax优化每个样本,从另一个光谱原型中优化。基于对SA-Softmax的观察和分析,我们修改了SA-Softmax,添加了特征掩码和绝对相似性Term,以减轻模型训练中的歧义优化。在RegDB和SYSU-MM01等实验平台上进行了广泛的实验评估,证明了SA-Softmax在这样跨模态条件下的性能优势。
https://arxiv.org/abs/2302.01512
Cross-spectral person re-identification, which aims to associate identities to pedestrians across different spectra, faces a main challenge of the modality discrepancy. In this paper, we address the problem from both image-level and feature-level in an end-to-end hybrid learning framework named robust feature mining network (RFM). In particular, we observe that the reflective intensity of the same surface in photos shot in different wavelengths could be transformed using a linear model. Besides, we show the variable linear factor across the different surfaces is the main culprit which initiates the modality discrepancy. We integrate such a reflection observation into an image-level data augmentation by proposing the linear transformation generator (LTG). Moreover, at the feature level, we introduce a cross-center loss to explore a more compact intra-class distribution and modality-aware spatial attention to take advantage of textured regions more efficiently. Experiment results on two standard cross-spectral person re-identification datasets, i.e., RegDB and SYSU-MM01, have demonstrated state-of-the-art performance.
跨光谱人重识别旨在将身份与不同光谱下的行人联系起来,面临modality discrepancy的主要挑战。在本文中,我们从一个端到端的混合学习框架——名为强健特征挖掘网络(RFM)的端到端框架中,从图像水平和特征水平两个方面解决了这个问题。特别是,我们观察了在不同波长下拍摄同一表面反射强度的照片时,可以采用线性模型进行转换。此外,我们表明,在不同表面上的可变线性因素是引发modality discrepancy的主要罪魁祸首。我们提议了线性变换生成器(LTG)将这种反射观察整合到图像水平的数据增强中。此外,在特征水平上,我们引入了交叉中心损失,以探索更紧凑的班级分布和具有modalityaware的空间注意力,更有效地利用纹理区域。对于两个标准跨光谱人重识别数据集——RegDB和SYSU-MM01——的实验结果已经展示了最先进的性能。
https://arxiv.org/abs/2302.00884
Unsupervised domain adaptation person re-identification (Re-ID) aims to identify pedestrian images within an unlabeled target domain with an auxiliary labeled source-domain dataset. Many existing works attempt to recover reliable identity information by considering multiple homogeneous networks. And take these generated labels to train the model in the target domain. However, these homogeneous networks identify people in approximate subspaces and equally exchange their knowledge with others or their mean net to improve their ability, inevitably limiting the scope of available knowledge and putting them into the same mistake. This paper proposes a Dual-level Asymmetric Mutual Learning method (DAML) to learn discriminative representations from a broader knowledge scope with diverse embedding spaces. Specifically, two heterogeneous networks mutually learn knowledge from asymmetric subspaces through the pseudo label generation in a hard distillation manner. The knowledge transfer between two networks is based on an asymmetric mutual learning manner. The teacher network learns to identify both the target and source domain while adapting to the target domain distribution based on the knowledge of the student. Meanwhile, the student network is trained on the target dataset and employs the ground-truth label through the knowledge of the teacher. Extensive experiments in Market-1501, CUHK-SYSU, and MSMT17 public datasets verified the superiority of DAML over state-of-the-arts.
无监督跨域人名重识别(Re-ID)的目标是在未标记的目标域中识别行人图像,并使用辅助标记的源域数据集。许多现有工作试图通过考虑多个同质网络来恢复可靠的身份信息,并将这些生成的标签用于在目标域中训练模型。然而,这些同质网络在近似子空间中识别人名,并与他人或它们的均值网络平等地交换知识以提高能力,不可避免地限制了可用知识的范围,并将他们陷入同样的错误中。本文提出了一种双水平不对称双向学习方法(DAML),从更广泛的知识范围中学习具有差异的特征表示,并使用不同的嵌入空间。具体而言,两个异质网络从不对称子空间中通过硬蒸馏生成伪标签,并通过学生的知识自适应地学习知识,同时两个网络之间的知识转移基于不对称双向学习方式。教师网络学习识别目标域和源域,同时适应目标域分布,基于学生的知识。与此同时,学生网络在目标数据集上训练,并通过教师的知识使用真相标签。在Market-1501、CUHK-SYSU和MSMT17公共数据集上进行广泛的实验验证了DAML比现有方法优越的性能。
https://arxiv.org/abs/2301.12439
Unsupervised domain adaptive (UDA) person re-identification (re-ID) aims to learn identity information from labeled images in source domains and apply it to unlabeled images in a target domain. One major issue with many unsupervised re-identification methods is that they do not perform well relative to large domain variations such as illumination, viewpoint, and occlusions. In this paper, we propose a Synthesis Model Bank (SMB) to deal with illumination variation in unsupervised person re-ID. The proposed SMB consists of several convolutional neural networks (CNN) for feature extraction and Mahalanobis matrices for distance metrics. They are trained using synthetic data with different illumination conditions such that their synergistic effect makes the SMB robust against illumination variation. To better quantify the illumination intensity and improve the quality of synthetic images, we introduce a new 3D virtual-human dataset for GAN-based image synthesis. From our experiments, the proposed SMB outperforms other synthesis methods on several re-ID benchmarks.
无监督领域自适应(UDA)的人重识别(re-ID)旨在从源 domains 中的标记图像中学习身份信息,并将其应用于目标 domains 中的未标记图像。许多无监督重识别方法的一个主要问题是它们在大型领域变化方面的表现不如其他变化因素如光照、视角和遮挡。在本文中,我们提出了一种合成模型库(SMB),以处理无监督的人重识别光照变化。该 proposed SMB 由多个卷积神经网络(CNN)用于特征提取和马氏矩阵用于距离度量。使用不同的合成数据进行训练,使其协同作用使得 SMB 对光照变化具有鲁棒性。为了更准确地量化光照强度并提高合成图像的质量,我们介绍了基于GAN的图像合成新的三维虚拟人类数据集。从我们的实验中,该 proposed SMB 在多个重识别基准上比其他合成方法表现更好。
https://arxiv.org/abs/2301.09702
Person Re-identification (ReID) has been extensively studied in recent years due to the increasing demand in public security. However, collecting and dealing with sensitive personal data raises privacy concerns. Therefore, federated learning has been explored for Person ReID, which aims to share minimal sensitive data between different parties (clients). However, existing federated learning based person ReID methods generally rely on laborious and time-consuming data annotations and it is difficult to guarantee cross-domain consistency. Thus, in this work, a federated unsupervised cluster-contrastive (FedUCC) learning method is proposed for Person ReID. FedUCC introduces a three-stage modelling strategy following a coarse-to-fine manner. In detail, generic knowledge, specialized knowledge and patch knowledge are discovered using a deep neural network. This enables the sharing of mutual knowledge among clients while retaining local domain-specific knowledge based on the kinds of network layers and their parameters. Comprehensive experiments on 8 public benchmark datasets demonstrate the state-of-the-art performance of our proposed method.
https://arxiv.org/abs/2301.07320
Adversarial attacks have been recently investigated in person re-identification. These attacks perform well under cross dataset or cross model setting. However, the challenges present in cross-dataset cross-model scenario does not allow these models to achieve similar accuracy. To this end, we propose our method with the goal of achieving better transferability against different models and across datasets. We generate a mask to obtain better performance across models and use meta learning to boost the generalizability in the challenging cross-dataset cross-model setting. Experiments on Market-1501, DukeMTMC-reID and MSMT-17 demonstrate favorable results compared to other attacks.
https://arxiv.org/abs/2301.06286