Extracting and matching Re-Identification (ReID) features is used by many state-of-the-art (SOTA) Multiple Object Tracking (MOT) methods, particularly effective against frequent and long-term occlusions. While end-to-end object detection and tracking have been the main focus of recent research, they have yet to outperform traditional methods in benchmarks like MOT17 and MOT20. Thus, from an application standpoint, methods with separate detection and embedding remain the best option for accuracy, modularity, and ease of implementation, though they are impractical for edge devices due to the overhead involved. In this paper, we investigate a selective approach to minimize the overhead of feature extraction while preserving accuracy, modularity, and ease of implementation. This approach can be integrated into various SOTA methods. We demonstrate its effectiveness by applying it to StrongSORT and Deep OC-SORT. Experiments on MOT17, MOT20, and DanceTrack datasets show that our mechanism retains the advantages of feature extraction during occlusions while significantly reducing runtime. Additionally, it improves accuracy by preventing confusion in the feature-matching stage, particularly in cases of deformation and appearance similarity, which are common in DanceTrack. this https URL, this https URL
提取和匹配重新识别(Re-Identification,ReID)特征是许多最先进的(SOTA)多对象跟踪(MOT)方法中使用的,尤其是在频繁和长期遮挡效果方面非常有效。尽管端到端的物体检测和跟踪是最近研究的主要关注点,但它们在像MOT17和MOT20这样的基准测试中的表现尚未超过传统方法。因此,从应用角度来看,具有单独检测和嵌入的方案在准确性、可扩展性和易用性方面仍然是最佳选择,尽管由于涉及开销,它们在边缘设备上不可行。在本文中,我们研究了一种选择性的方法来最小化开销,同时保留准确度、可扩展性和易用性。这种方法可以集成到各种SOTA方法中。我们通过将该方法应用于StrongSORT和Deep OC-SORT来证明其有效性。在MOT17、MOT20和DanceTrack数据集的实验中,我们的机制在遮挡过程中保留了特征提取的优势,同时显著减少了运行时间。此外,通过防止在特征匹配阶段混淆,特别在变形和外观相似的情况下,提高了准确性。这个链接,这个链接
https://arxiv.org/abs/2409.06617
The primary challenges in visible-infrared person re-identification arise from the differences between visible (vis) and infrared (ir) images, including inter-modal and intra-modal variations. These challenges are further complicated by varying viewpoints and irregular movements. Existing methods often rely on horizontal partitioning to align part-level features, which can introduce inaccuracies and have limited effectiveness in reducing modality discrepancies. In this paper, we propose a novel Prototype-Driven Multi-feature generation framework (PDM) aimed at mitigating cross-modal discrepancies by constructing diversified features and mining latent semantically similar features for modal alignment. PDM comprises two key components: Multi-Feature Generation Module (MFGM) and Prototype Learning Module (PLM). The MFGM generates diversity features closely distributed from modality-shared features to represent pedestrians. Additionally, the PLM utilizes learnable prototypes to excavate latent semantic similarities among local features between visible and infrared modalities, thereby facilitating cross-modal instance-level alignment. We introduce the cosine heterogeneity loss to enhance prototype diversity for extracting rich local features. Extensive experiments conducted on the SYSU-MM01 and LLCM datasets demonstrate that our approach achieves state-of-the-art performance. Our codes are available at this https URL.
可见-红外人员识别的主要挑战来自于可见(vis)和红外(ir)图像之间的差异,包括模态和模态内的变化。这些挑战进一步复杂化,由于不同的视点和不规则的运动。现有的方法通常依赖于水平分割来对部分级别特征进行对齐,这可能会引入不准确,并且在减少模态差异方面效果有限。在本文中,我们提出了一个新型的原型驱动多特征生成框架(PDM),旨在通过构建多样化的特征和挖掘模态间潜在的语义相似性来缓解跨模态差异。PDM包括两个关键组件:多特征生成模块(MFGM)和原型学习模块(PLM)。MFGM生成从模态共享特征到模式共享特征的多样性特征,从而表示行人。此外,PLM利用可学习的原型来挖掘可见和红外模态之间局部特征之间的潜在语义相似性,从而促进跨模态实例级别的对齐。我们引入了余弦异质损失来增强原型的多样性,以提取丰富的局部特征。在SYSU-MM01和LLCM数据集上进行的大量实验证明,我们的方法达到了最先进水平。我们的代码可在此处访问:https://www.xxx
https://arxiv.org/abs/2409.05642
We address the problem of person re-identification (reID), that is, retrieving person images from a large dataset, given a query image of the person of interest. A key challenge is to learn person representations robust to intra-class variations, as different persons could have the same attribute, and persons' appearances look different, e.g., with viewpoint changes. Recent reID methods focus on learning person features discriminative only for a particular factor of variations (e.g., human pose), which also requires corresponding supervisory signals (e.g., pose annotations). To tackle this problem, we propose to factorize person images into identity-related and unrelated features. Identity-related features contain information useful for specifying a particular person (e.g., clothing), while identity-unrelated ones hold other factors (e.g., human pose). To this end, we propose a new generative adversarial network, dubbed identity shuffle GAN (IS-GAN). It disentangles identity-related and unrelated features from person images through an identity-shuffling technique that exploits identification labels alone without any auxiliary supervisory signals. We restrict the distribution of identity-unrelated features or encourage the identity-related and unrelated features to be uncorrelated, facilitating the disentanglement process. Experimental results validate the effectiveness of IS-GAN, showing state-of-the-art performance on standard reID benchmarks, including Market-1501, CUHK03, and DukeMTMC-reID. We further demonstrate the advantages of disentangling person representations on a long-term reID task, setting a new state of the art on a Celeb-reID dataset.
我们解决这个问题,即从大量数据中检索特定人物的形象,给定感兴趣人物的查询图像。一个关键挑战是学会对类内变化具有鲁棒性的人物表示,因为不同的人可能有相同的属性,而且人的外表看起来不同,例如视角变化。最近的方法集中在仅对特定因素的变化学习人物特征,例如人体姿势,这也需要相应的指导信号(例如姿势注释)。为了应对这个问题,我们将人物图像分解为身份相关和不相关特征。身份相关特征包含指定特定人物(例如服装)的有用信息,而身份无关特征持有其他因素(例如人体姿势)。为此,我们提出了一个新的生成对抗网络,被称为身份混淆生成对抗网络(IS-GAN)。它通过利用身份混淆技术从人物图像中区分身份相关和不相关特征,而无需任何辅助监督信号。我们限制身份无关特征的分布或鼓励身份相关和不相关特征相互独立,促进解纠缠过程。实验结果证实了IS-GAN的有效性,在包括Market-1501、CUHK03和DukeMTMC-reID等标准的REID基准测试中取得了最先进的性能。我们还进一步展示了在长期REID任务中解纠缠人物表示的优势,将桂冠REID数据集上的新颖状态推向了新的高度。
https://arxiv.org/abs/2409.05277
Deep learning-based person re-identification (re-id) models are widely employed in surveillance systems and inevitably inherit the vulnerability of deep networks to adversarial attacks. Existing attacks merely consider cross-dataset and cross-model transferability, ignoring the cross-test capability to perturb models trained in different domains. To powerfully examine the robustness of real-world re-id models, the Meta Transferable Generative Attack (MTGA) method is proposed, which adopts meta-learning optimization to promote the generative attacker producing highly transferable adversarial examples by learning comprehensively simulated transfer-based cross-model\&dataset\&test black-box meta attack tasks. Specifically, cross-model\&dataset black-box attack tasks are first mimicked by selecting different re-id models and datasets for meta-train and meta-test attack processes. As different models may focus on different feature regions, the Perturbation Random Erasing module is further devised to prevent the attacker from learning to only corrupt model-specific features. To boost the attacker learning to possess cross-test transferability, the Normalization Mix strategy is introduced to imitate diverse feature embedding spaces by mixing multi-domain statistics of target models. Extensive experiments show the superiority of MTGA, especially in cross-model\&dataset and cross-model\&dataset\&test attacks, our MTGA outperforms the SOTA methods by 21.5\% and 11.3\% on mean mAP drop rate, respectively. The code of MTGA will be released after the paper is accepted.
基于深度学习的行人重识别(重识别)模型在监视系统中被广泛应用,并继承了深度网络对对抗攻击的脆弱性。现有的攻击仅考虑跨数据集和跨模型可传输性,而忽略了不同领域训练模型的交叉测试能力。为了有效地研究真实世界重识别模型的鲁棒性,提出了元传输可训练生成攻击(MTGA)方法,通过元学习优化促进生成攻击者通过全面模拟基于交叉模型的交叉测试黑盒攻击任务来产生高度可传输的对抗样本。 具体来说,通过选择不同的重识别模型和数据集,对元训练和元测试攻击过程进行跨模型跨数据集的攻击。由于不同的模型可能关注不同的特征区域,因此进一步设计干扰随机擦除模块,以防止攻击者仅学习模型特定特征的攻击方法。为了提高攻击者具有跨测试可传输性,引入了标准化混合策略,通过混合目标模型的多个领域统计数据来模拟多样特征嵌入空间。 大量实验证明,MTGA在跨模型和跨数据集攻击方面具有优越性。与最先进的攻击方法相比,MTGA在平均mAP下降率上分别提高了21.5%和11.3%。MTGA的代码将在论文被接受后发布。
https://arxiv.org/abs/2409.04208
Place recognition is an important task within autonomous navigation, involving the re-identification of previously visited locations from an initial traverse. Unlike visual place recognition (VPR), LiDAR place recognition (LPR) is tolerant to changes in lighting, seasons, and textures, leading to high performance on benchmark datasets from structured urban environments. However, there is a growing need for methods that can operate in diverse environments with high performance and minimal training. In this paper, we propose a handcrafted matching strategy that performs roto-translation invariant place recognition and relative pose estimation for both urban and unstructured natural environments. Our approach constructs Birds Eye View (BEV) global descriptors and employs a two-stage search using matched filtering -- a signal processing technique for detecting known signals amidst noise. Extensive testing on the NCLT, Oxford Radar, and WildPlaces datasets consistently demonstrates state-of-the-art (SoTA) performance across place recognition and relative pose estimation metrics, with up to 15% higher recall than previous SoTA.
地点识别是自主导航中一个重要的任务,涉及从初始遍历中重新识别之前访问过的位置。与视觉地点识别(VPR)不同,激光雷达地点识别(LPR)对光照、季节和纹理的变化具有容错性,因此在结构化城市环境基准数据集上具有卓越的性能。然而,对于需要在多样环境中具有高性能且训练量最小的方法的需求不断增加。在本文中,我们提出了一种自手制作的匹配策略,可以对城市和自然环境实现旋转平移不变的地点识别和相对姿态估计。我们的方法构建了鸟瞰全局描述符,并采用匹配滤波技术进行两级搜索——一种信号处理技术,用于在噪声中检测已知信号。对NCLT、牛津雷达和WildPlaces数据集的广泛测试表明,在地点识别和相对姿态估计指标上,我们的方法具有最先进的(SoTA)性能,召回率最高可达之前SoTA的15%。
https://arxiv.org/abs/2409.03998
In recent years, the development of deep learning approaches for the task of person re-identification led to impressive results. However, this comes with a limitation for industrial and practical real-world applications. Firstly, most of the existing works operate on closed-world scenarios, in which the people to re-identify (probes) are compared to a closed-set (gallery). Real-world scenarios often are open-set problems in which the gallery is not known a priori, but the number of open-set approaches in the literature is significantly lower. Secondly, challenges such as multi-camera setups, occlusions, real-time requirements, etc., further constrain the applicability of off-the-shelf methods. This work presents MICRO-TRACK, a Modular Industrial multi-Camera Re_identification and Open-set Tracking system that is real-time, scalable, and easy to integrate into existing industrial surveillance scenarios. Furthermore, we release a novel Re-ID and tracking dataset acquired in an industrial manufacturing facility, dubbed Facility-ReID, consisting of 18-minute videos captured by 8 surveillance cameras.
近年来,用于人物识别任务的深度学习方法的开发取得了令人印象深刻的成果。然而,这也限制了在工业和实际应用领域的应用。首先,大部分现有工作都在封闭世界场景中进行,其中被重新识别(探针)的人与一个封闭集(画廊)进行比较。现实世界场景通常是开放式问题,画廊不知道提前知道,但文书中公开的开放式方法的数量相当有限。其次,多摄像头设置、遮挡、实时要求等挑战进一步限制了通用方法的适用性。本文介绍了一种名为MICRO-TRACK的模块化工业多相机重新识别和开放式跟踪系统,具有实时性、可扩展性和易于集成到现有工业安防场景的特点。此外,我们还发布了在工业制造设施中获取的一组18分钟视频,由8个监控摄像头捕捉的数据,称为设施-ReID数据集。
https://arxiv.org/abs/2409.03879
Unmanned Aerial Vehicles (UAVs), have greatly revolutionized the process of gathering and analyzing data in diverse research domains, providing unmatched adaptability and effectiveness. This paper presents a thorough examination of Unmanned Aerial Vehicle (UAV) datasets, emphasizing their wide range of applications and progress. UAV datasets consist of various types of data, such as satellite imagery, images captured by drones, and videos. These datasets can be categorized as either unimodal or multimodal, offering a wide range of detailed and comprehensive information. These datasets play a crucial role in disaster damage assessment, aerial surveillance, object recognition, and tracking. They facilitate the development of sophisticated models for tasks like semantic segmentation, pose estimation, vehicle re-identification, and gesture recognition. By leveraging UAV datasets, researchers can significantly enhance the capabilities of computer vision models, thereby advancing technology and improving our understanding of complex, dynamic environments from an aerial perspective. This review aims to encapsulate the multifaceted utility of UAV datasets, emphasizing their pivotal role in driving innovation and practical applications in multiple domains.
无人机(UAVs)已经在各种研究领域极大地推动了数据收集和分析的进程,提供了无与伦比的适应性和效果。本文对无人机数据集进行全面评估,强调它们的广泛应用和进展。无人机数据集包括各种类型的数据,如卫星影像、无人机捕获的图像和视频。这些数据集可以分为单模态或多模态,提供详尽而全面的信息。这些数据集在灾害损失评估、无人机监视、目标识别和跟踪中发挥着关键作用。它们为诸如语义分割、姿态估计、车辆识别和手势识别等任务开发复杂的模型提供了便利。通过利用无人机数据集,研究人员可以显著增强计算机视觉模型的能力,从而推动技术的发展和提高我们对复杂、动态环境的从空中的认识。本综述旨在概括无人机数据集的多重用途,强调其在多个领域推动创新和实际应用的关键作用。
https://arxiv.org/abs/2409.03245
Extracting robust feature representation is critical for object re-identification to accurately identify objects across non-overlapping cameras. Although having a strong representation ability, the Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features. Meanwhile, due to the structural difference between CNN and ViT, fine-grained strategies that effectively address this issue in CNN do not continue to be successful in ViT. To address this issue, by observing the latent diverse representation hidden behind the multi-head attention, we present PartFormer, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks. The PartFormer integrates a Head Disentangling Block (HDB) that awakens the diverse representation of multi-head self-attention without the typical loss of feature richness induced by concatenation and FFN layers post-attention. To avoid the homogenization of attention heads and promote robust part-based feature learning, two head diversity constraints are imposed: attention diversity constraint and correlation diversity constraint. These constraints enable the model to exploit diverse and discriminative feature representations from different attention heads. Comprehensive experiments on various object Re-ID benchmarks demonstrate the superiority of the PartFormer. Specifically, our framework significantly outperforms state-of-the-art by 2.4\% mAP scores on the most challenging MSMT17 dataset.
提取稳健的特征表示对于物体识别是至关重要的,因为这样可以在不重叠的相机之间准确地识别物体。尽管ViT具有很强的表示能力,但它往往在训练数据的各个显著部分上过拟合,限制了其泛化能力和对整体物体特征的关注。与此同时,由于CNN和ViT的结构差异,在CNN中有效解决这个问题的细粒度策略在ViT中不再成功。为了解决这个问题,通过观察隐藏在多头注意力背后的潜在多样化表示,我们提出了PartFormer,一种创新的ViT,旨在克服物体识别任务中的粒度限制。PartFormer集成了一个头分离块(HDB),在注意力注意力层中唤醒了多样化的多头自注意力,而不会导致特征丰富度的典型损失。为了避免关注头部的同质化,并促进基于部分的特征学习,对两个头多样性约束被引入:注意多样性约束和关联多样性约束。这些约束使模型能够从不同的注意力头中利用多样化和有鉴别性的特征表示。在各种物体识别基准上进行全面的实验证明,PartFormer具有优越的性能。具体来说,我们的框架在最具挑战性的MSMT17数据集上显著超过了最先进的水平,其mAP得分高出2.4%。
https://arxiv.org/abs/2408.16684
For the visible-infrared person re-identification (VI-ReID) task, one of the primary challenges lies in significant cross-modality discrepancy. Existing methods struggle to conduct modality-invariant information mining. They often focus solely on mining singular dimensions like spatial or channel, and overlook the extraction of specific-modality multi-dimension information. To fully mine modality-invariant information across a wide range, we introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. Empowered by the proposed Global Region Interaction (GRI), MIIM comprehensively mines non-local spatial and channel information through intra-dimension interaction. Moreover, Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information. AICL, by introducing the novel Cross-Modality Key-Instance Contrastive (CMKIC) loss, effectively guides the network in extracting modality-invariant information. We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset. The results demonstrate WRIM-Net's superiority over state-of-the-art methods.
对于可见-红外人员重新识别(VI-ReID)任务,一个主要挑战是显著的跨模态不一致性。现有的方法很难进行模态无关信息挖掘。他们通常仅关注空间或通道等单一维度,并忽略了提取特定模态多维度信息。要完全挖掘跨模态信息,我们引入了宽范围信息挖掘网络(WRIM-Net),它主要由多维度交互信息挖掘(MIIM)模块和基于辅助信息的目标差异学习(AICL)方法组成。通过引入全局区域交互(GRI),MIIM通过内维度交互全面挖掘非局部空间和通道信息。此外,由于低计算复杂度设计,单独的MIIM可以被放置在浅层,使网络更好地挖掘特定模态多维度信息。AICL通过引入新颖的跨模态键实例对比(CMKIC)损失有效地引导网络提取模态无关信息。我们在既定的知名SYSU-MM01和RegDB数据集以及最新的大型跨模态LLVM数据集上进行了广泛的实验。结果表明,WRIM-Net优越地超越了最先进的方法。
https://arxiv.org/abs/2408.10624
Unsupervised person re-identification (Re-ID) aims to learn a feature network with cross-camera retrieval capability in unlabelled datasets. Although the pseudo-label based methods have achieved great progress in Re-ID, their performance in the complex scenario still needs to sharpen up. In order to reduce potential misguidance, including feature bias, noise pseudo-labels and invalid hard samples, accumulated during the learning process, in this pa per, a confidence-guided clustering and contrastive learning (3C) framework is proposed for unsupervised person Re-ID. This 3C framework presents three confidence degrees. i) In the clustering stage, the confidence of the discrepancy between samples and clusters is proposed to implement a harmonic discrepancy clustering algorithm (HDC). ii) In the forward-propagation training stage, the confidence of the camera diversity of a cluster is evaluated via a novel camera information entropy (CIE). Then, the clusters with high CIE values will play leading roles in training the model. iii) In the back-propagation training stage, the confidence of the hard sample in each cluster is designed and further used in a confidence integrated harmonic discrepancy (CHD), to select the informative sample for updating the memory in contrastive learning. Extensive experiments on three popular Re-ID benchmarks demonstrate the superiority of the proposed framework. Particularly, the 3C framework achieves state-of-the-art results: 86.7%/94.7%, 45.3%/73.1% and 47.1%/90.6% in terms of mAP/Rank-1 accuracy on Market-1501, the com plex datasets MSMT17 and VeRi-776, respectively. Code is available at this https URL.
无监督的人重新识别(Re-ID)旨在在未标注数据集中学习具有跨相机检索能力的特征网络。尽管基于伪标签的方法在Re-ID方面已经取得了很大的进展,但它们在复杂场景下的性能仍然需要改进。为了减少可能的误导,包括特征偏差,噪声伪标签和无效硬样本,在学习和过程中累积的,这个部分提出了一个基于信心的聚类和对比学习(3C)框架来实现无监督的人Re-ID。这个3C框架提出了三个置信度。i)在聚类阶段,差异样本和聚类的置信度被提出以实现和谐差异聚类算法(HDC)。ii)在前向传播训练阶段,通过评估新颖的相机信息熵(CIE)来评估聚类的置信度。然后,置信度高的聚类将在训练中扮演主导角色。iii)在反向传播训练阶段,为每个聚类设计了置信度,并进一步用于信心整合的和谐差异(CHD),以选择用于更新记忆的信息样本。通过对三个流行的Re-ID基准进行广泛的实验,证明了所提出的框架的优越性。特别是,3C框架在Market-1501和MSMT17等复杂数据集上的表现达到最佳水平。代码可在此处下载:<https://url.com>
https://arxiv.org/abs/2408.09464
In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset.
在本文中,我们构建了一个大规模的人体识别基准数据集G2A-VReID,用于基于地面到空中的视频基于识别,该数据集包括185,907个图像和5,576个轨迹,具有2,788个不同的身份。据我们所知,这是第一个基于地面到空中的视频识别数据集。G2A-VReID数据集具有以下特点:1)剧烈的视图变化; 2)大量注释的身份; 3)丰富的户外场景; 4)分辨率巨大的差异。此外,我们通过将跨平台视觉对齐问题转化为视觉语义对齐,并通过视觉语言模型(即CLIP)提出了一种新的跨平台ReID基准方法,称为VSLA-CLIP。此外,为了进一步减少不同平台之间的巨大差异,我们还设计了平台桥提示,用于有效的视觉特征对齐。大量实验证明,与现有视频ReID数据集相比,以及我们提出的G2A-VReID数据集,所提出的方法具有优越性。
https://arxiv.org/abs/2408.07500
Within the domain of person re-identification (ReID), partial ReID methods are considered mainstream, aiming to measure feature distances through comparisons of body parts between samples. However, in practice, previous methods often lack sufficient awareness of anatomical aspect of body parts, resulting in the failure to capture features of the same body parts across different samples. To address this issue, we introduce \textbf{Part Aware Transformer (PAFormer)}, a pose estimation based ReID model which can perform precise part-to-part comparison. In order to inject part awareness to pose tokens, we introduce learnable parameters called `pose token' which estimate the correlation between each body part and partial regions of the image. Notably, at inference phase, PAFormer operates without additional modules related to body part localization, which is commonly used in previous ReID methodologies leveraging pose estimation models. Additionally, leveraging the enhanced awareness of body parts, PAFormer suggests the use of a learning-based visibility predictor to estimate the degree of occlusion for each body part. Also, we introduce a teacher forcing technique using ground truth visibility scores which enables PAFormer to be trained only with visible parts. A set of extensive experiments show that our method outperforms existing approaches on well-known ReID benchmark datasets.
在人体识别(ReID)领域,部分ReID方法被认为是主流的,通过比较样本中身体部位之间的特征来测量特征距离。然而,在实践中,以前的方法通常缺乏对身体部位解剖结构的足够认识,导致不同样本中相同身体部位的特征捕捉失败。为解决这个问题,我们引入了《部分感知Transformer(PAFormer)》(Part-Awareness Transformer,PAFormer)模型,一种基于姿态估计的身体部分ReID模型,可以精确地进行部分到部分的比较。为了在姿态令牌中注入部分意识,我们引入了称为“姿态令牌参数”的可学习参数,估计每个身体部位与图像部分区域的相关性。值得注意的是,在推理阶段,PAFormer无需额外的身体部分定位模块,这是以前使用姿态估计模型进行ReID的方法中常见的。此外,通过增强对身体的认识,PAFormer建议使用基于学习的可见性预测器估计每个身体部分的遮挡程度。另外,我们还引入了一种教师强迫技术,使用地面真实可见性分数,使得PAFormer只能训练可见部分。一系列广泛的实验证明,我们的方法在知名ReID基准数据集上优于现有方法。
https://arxiv.org/abs/2408.05918
Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras and plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Worsely, these solutions typically learn unimodal modal representations on the basis of visual samples, which fails to explore complementary information from different modalities. To address this challenge, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for polyp re-identification, which can effectively encourage modality collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized multimodal representations for multimodal fusion via end-to-end training. Experiments on the standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy.
结肠癌结肠癌镜检结肠癌镜检旨在通过不同相机从不同角度拍摄的图像,将大型画廊中的同一个结肠癌的图像与不同视角下的图像匹配,在计算机辅助诊断中预防治疗结肠癌方面发挥了重要作用。然而,直接采用ImageNet数据集训练的传统的物体识别方法通常在结肠癌数据集上产生不满意的检索性能,因为这两个数据集之间存在较大的领域差异。更糟糕的是,这些解决方案通常基于视觉样本学习单模态模态表示,而无法探索不同模态的互补信息。为了应对这个挑战,我们提出了一个名为DMCL的新颖的多模态协作学习框架,用于结肠癌识别,可以有效地鼓励模态合作并增强在医学场景下的泛化能力。基于它,引入了一种动态多模态特征融合策略,通过端到端的训练利用优化后的多模态表示进行多模态融合。在标准基准测试上进行的实验显示,多模态设置比最先进的单模态ReID模型具有更好的效果,尤其是当与特殊的多模态融合策略相结合时。
https://arxiv.org/abs/2408.05914
Object re-identification (ReID) in large camera networks has many challenges. First, the similar appearances of objects degrade ReID performances. This challenge cannot be addressed by existing appearance-based ReID methods. Second, most ReID studies are performed in laboratory settings and do not consider ReID problems in real-world scenarios. To overcome these challenges, we introduce a novel ReID framework that leverages a spatial-temporal fusion network and causal identity matching (CIM). The framework estimates camera network topology using the proposed adaptive Parzen window and combines appearance features with spatial-temporal cue within the Fusion Network. It achieved outstanding performance across several datasets, including VeRi776, Vehicle-3I, and Market-1501, achieving up to 99.70% rank-1 accuracy and 95.5% mAP. Furthermore, the proposed CIM approach, which dynamically assigns gallery sets based on the camera network topology, further improved ReID accuracy and robustness in real-world settings, evidenced by a 94.95% mAP and 95.19% F1 score on the Vehicle-3I dataset. The experimental results support the effectiveness of incorporating spatial-temporal information and CIM for real-world ReID scenarios regardless of the data domain (e.g., vehicle, person).
在大型相机网络中进行对象重新识别(ReID)有很多挑战。首先,对象相似的外观会降低ReID的性能。现有的基于外观的ReID方法无法解决这一挑战。其次,大多数ReID研究都是在实验室环境中进行的,没有考虑现实场景中的ReID问题。为了克服这些挑战,我们引入了一个新颖的ReID框架,它利用自适应Parzen窗口和因果身份匹配(CIM)来估计相机网络拓扑结构。该框架使用提出的自适应Parzen窗口估计相机网络拓扑结构,并在多个数据集上取得了出色的性能,包括VeRi776、Vehicle-3I和Market-1501,实现了高达99.70%的排名1准确性和95.5%的mAP。此外,所提出的CIM方法,根据相机网络拓扑结构动态分配图库集,进一步提高了ReID在现实场景中的准确性和鲁棒性,这在Vehicle-3I数据集上的表现尤为突出,mAP和F1分数分别为94.95%和95.19%。实验结果证实了将空间-时间信息以及CIM融入现实场景ReID场景中无论数据领域(如车辆,人物)都是有效的。
https://arxiv.org/abs/2408.05558
Online person re-identification services face privacy breaches from potential data leakage and recovery attacks, exposing cloud-stored images to malicious attackers and triggering public concern. The privacy protection of pedestrian images is crucial. Previous privacy-preserving person re-identification methods are unable to resist recovery attacks and compromise accuracy. In this paper, we propose an iterative method (PixelFade) to optimize pedestrian images into noise-like images to resist recovery attacks. We first give an in-depth study of protected images from previous privacy methods, which reveal that the chaos of protected images can disrupt the learning of recovery models. Accordingly, Specifically, we propose Noise-guided Objective Function with the feature constraints of a specific authorization model, optimizing pedestrian images to normal-distributed noise images while preserving their original identity information as per the authorization model. To solve the above non-convex optimization problem, we propose a heuristic optimization algorithm that alternately performs the Constraint Operation and the Partial Replacement Operation. This strategy not only safeguards that original pixels are replaced with noises to protect privacy, but also guides the images towards an improved optimization direction to effectively preserve discriminative features. Extensive experiments demonstrate that our PixelFade outperforms previous methods in resisting recovery attacks and Re-ID performance. The code is available at this https URL.
在线人格识别服务面临潜在数据泄露和恢复攻击的隐私泄露,将存储在云中的图像暴露给恶意攻击者,并引发公众关注。保护行人图像的隐私至关重要。之前用于保护个人识别信息的隐私方法都无法抵御恢复攻击,且会破坏准确性。在本文中,我们提出了一种迭代方法(PixelFade)将行人图像优化为噪音像以抵抗恢复攻击。我们首先对之前用于保护图像的隐私方法进行了深入研究,发现受保护图像的混乱可能干扰恢复模型的学习。因此,我们提出了一种以特定授权模型的特征约束为目标函数,将行人图像优化为正态分布噪音像,同时保留其原始身份信息的Noise-guided Objective Function。为解决上述非凸优化问题,我们提出了一个启发式优化算法,交替执行约束操作和部分替换操作。这种策略不仅保护了原始像素免受噪声替换以保护隐私,而且引导图像朝着提高优化方向的改进,有效地保留了区分特征。大量实验证明,我们的PixelFade在抵抗恢复攻击和Re-ID性能方面优于之前的方法。代码可在此链接处获取:https://www.example.com/pixelfade
https://arxiv.org/abs/2408.05543
Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and discriminative contrastive learning, and then conducts supervised fine-tuning training in the person ReID task. This person feature extraction method based on ViT with masked image modeling (PersonViT) has the good characteristics of unsupervised, scalable, and strong generalization capabilities, overcoming the problem of difficult annotation in supervised person ReID, and achieves state-of-the-art results on publicly available benchmark datasets, including MSMT17, Market1501, DukeMTMC-reID, and Occluded-Duke. The code and pre-trained models of the PersonViT method are released at this https URL to promote further research in the person ReID fie
人员识别(ReID)的目的是在非重叠的相机图像中检索相关个体,并在公共安全领域具有广泛的应用。近年来,随着Vision Transformer(ViT)和自监督学习技术的快速发展,基于自监督预训练的人员ReID性能得到了显著提高。人员ReID需要提取人体的高区分度局部细粒度特征,而传统的ViT则擅长提取与上下文相关的全局特征,使得关注局部人体特征困难。为此,本文将最近提出的遮罩图像建模(MIM)自监督学习方法引入人员ReID,通过结合遮罩图像建模和判别式对比学习在大规模无监督预训练过程中有效提取高质量全局和局部特征,然后对人员ReID任务进行有监督微调训练。基于ViT的遮罩图像建模(PersonViT)人员特征提取方法具有自主性、可扩展性和强泛化能力,克服了监督人员ReID中困难注释的问题,并在公开可用的基准数据集上实现了最先进的性能,包括MSMT17、Market1501、DukeMTMC-reID和Occluded-Duke。PersonViT方法的相关代码和预训练模型发布在https://这个URL上,以促进在人员ReID领域进一步研究。
https://arxiv.org/abs/2408.05398
This paper investigates the security vulnerabilities of adversarial-example-based image encryption by executing data reconstruction (DR) attacks on encrypted images. A representative image encryption method is the adversarial visual information hiding (AVIH), which uses type-I adversarial example training to protect gallery datasets used in image recognition tasks. In the AVIH method, the type-I adversarial example approach creates images that appear completely different but are still recognized by machines as the original ones. Additionally, the AVIH method can restore encrypted images to their original forms using a predefined private key generative model. For the best security, assigning a unique key to each image is recommended; however, storage limitations may necessitate some images sharing the same key model. This raises a crucial security question for AVIH: How many images can safely share the same key model without being compromised by a DR attack? To address this question, we introduce a dual-strategy DR attack against the AVIH encryption method by incorporating (1) generative-adversarial loss and (2) augmented identity loss, which prevent DR from overfitting -- an issue akin to that in machine learning. Our numerical results validate this approach through image recognition and re-identification benchmarks, demonstrating that our strategy can significantly enhance the quality of reconstructed images, thereby requiring fewer key-sharing encrypted images. Our source code to reproduce our results will be available soon.
本文研究了基于对抗实例图像加密的安全漏洞。通过执行数据重建(DR)攻击来检测加密图像中的安全漏洞。一种代表性的图像加密方法是 adversarial visual information hiding(AVIH),它使用类型I 对抗实例训练来保护用于图像识别任务的应用程序使用的图像数据集。在AVIH方法中,类型I 对抗实例方法创建看起来完全不同的图像,但仍被机器视为原始图像。此外,AVIH方法可以使用预定义的私钥生成模型将加密图像还原为原始形式。为了获得最佳的安全性,建议为每个图像分配唯一键;然而,存储限制可能迫使一些图像使用相同的密钥模型,这引发了AVIH的一个关键安全问题:在没有DR攻击的情况下,安全地共享相同密钥模型的图像数量是多少?为回答这个问题,我们通过引入(1)生成对抗损失和(2)增强身份损失的双重攻击策略,对AVIH加密方法进行了研究。我们的数值结果通过图像识别和重新识别基准来验证这一方法,表明我们的策略可以通过显著提高重构图像的质量,从而使共享加密图像的数量减少。我们的源代码将在不久的将来提供。
https://arxiv.org/abs/2408.04261
Text-based person re-identification (Re-ID) is a challenging topic in the field of complex multimodal analysis, its ultimate aim is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions. Despite the wide range of applicable areas such as security surveillance, video retrieval, person tracking, and social media analytics, there is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective. To address this gap, we propose to introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task. We start by laying the groundwork for text-based person Re-ID, elucidating fundamental concepts related to attribute/natural language-based identification. Then a thorough examination of existing benchmark datasets and metrics is presented. Subsequently, we further delve into prevalent feature extraction strategies employed in text-based person Re-ID research, followed by a concise summary of common network architectures within the domain. Prevalent loss functions utilized for model optimization and modality alignment in text-based person Re-ID are also scrutinized. To conclude, we offer a concise summary of our findings, pinpointing challenges in text-based person Re-ID. In response to these challenges, we outline potential avenues for future open-set text-based person Re-ID and present a baseline architecture for text-based pedestrian image generation-guided re-identification(TBPGR).
基于文本的人身识别(Re-ID)是复杂多模态分析领域的一个具有挑战性的课题,其最终目标是通过对属性/自然语言描述的审查来识别特定的行人。尽管应用领域很广,如安全监控、视频检索、行人追踪和社会媒体分析,但缺乏对基于文本的人身识别的全面综述是一个显著的不足。为了填补这个空白,我们提议引入一个评估、策略、架构和优化维度的分类法,全面调查基于文本的人身识别(Re-ID)任务。我们首先阐明了基于文本的人身识别的基础知识,阐明了与属性/自然语言-基于识别相关的核心概念。接着,我们详细介绍了现有基准数据集和指标。随后,我们深入研究了文本中常用的人身识别特征提取策略,接着是对该领域中常见的网络架构的简要概述。我们还对用于模型优化和模式对齐的现有损失函数进行了审查。总之,我们给出了我们调查的简要总结,指出了基于文本的人身识别的挑战。为了应对这些挑战,我们提出了未来可以开放设置的基于文本的人身识别的可能途径,并提出了基于文本行人图像生成引导的人身识别(TBPGR)的基准架构。
https://arxiv.org/abs/2408.00096
Person search is the task to localize a query person in gallery datasets of scene images. Existing methods have been mainly developed to handle a single target dataset only, however diverse datasets are continuously given in practical applications of person search. In such cases, they suffer from the catastrophic knowledge forgetting in the old datasets when trained on new datasets. In this paper, we first introduce a novel problem of lifelong person search (LPS) where the model is incrementally trained on the new datasets while preserving the knowledge learned in the old datasets. We propose an end-to-end LPS framework that facilitates the knowledge distillation to enforce the consistency learning between the old and new models by utilizing the prototype features of the foreground persons as well as the hard background proposals in the old domains. Moreover, we also devise the rehearsal-based instance matching to further improve the discrimination ability in the old domains by using the unlabeled person instances additionally. Experimental results demonstrate that the proposed method achieves significantly superior performance of both the detection and re-identification to preserve the knowledge learned in the old domains compared with the existing methods.
人员搜索是在场景图像的数据集中定位给定查询人员的任务。现有的方法主要针对单个目标数据集进行开发,然而在实际应用中,不断给出各种多样性的数据集。在这种情况下,它们在训练新数据集时会受到老数据集中知识遗忘的灾难性影响。在本文中,我们首先引入了一种名为 lifelong person search (LPS) 的新问题,该模型在训练新数据集的同时保留老数据集中的知识。我们提出了一个端到端的 LPS 框架,通过利用前景人员的原型特征以及老域中的困难背景提议来促进知识蒸馏,以实现老和新模型的一致性学习。此外,我们还通过使用未标注的人实例进一步优化了老域中的匹配性能。实验结果表明,与现有方法相比,所提出的方法在保留老数据集中的知识方面具有显著的优越性能。
https://arxiv.org/abs/2407.21252
Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: 1) incorporating language self-supervision in the first training stage can make the learnable text prompts more distinguishable, and 2) incorporating vision self-supervision in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: 1) the text prompt learning in the first stage can benefit from the language self-supervision, and 2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at this https URL.
近年来,像CLIP这样的大规模视觉语言预训练模型在图像识别(ReID)任务中表现出了出色的性能。在这项工作中,我们探讨了是否自监督能够有助于将CLIP应用于图像ReID任务。具体来说,我们提出了SVLL-ReID,这是第一个通过两个训练阶段将自监督和预训练CLIP集成到一起以促进图像ReID的尝试。我们观察到:1)在第一训练阶段将语言自监督引入其中可以使可学习文本提示更加突出;2)在第二训练阶段将视觉自监督引入其中可以使图像编码器学习的图像特征更加具有区分性。这些观察结果表明:1)第一阶段的文本提示学习有助于语言自监督;2)第二阶段的图像特征学习有助于视觉自监督。这些好处共同促进了所提出的SVLL-ReID的性能提升。在没有任何具体文本标签的六个图像ReID基准数据集上进行实验后,我们发现,与最先进的成果相比,所提出的SVLL-ReID具有最佳性能。代码将公开发布在https:// this URL上。
https://arxiv.org/abs/2407.20647