This study introduces a new framework for 3D person re-identification (re-ID) that leverages readily available high-resolution texture data in 3D reconstruction to improve the performance and explainability of the person re-ID task. We propose a method to emphasize texture in 3D person re-ID models by incorporating UVTexture mapping, which better differentiates human subjects. Our approach uniquely combines UVTexture and its heatmaps with 3D models to visualize and explain the person re-ID process. In particular, the visualization and explanation are achieved through activation maps and attribute-based attention maps, which highlight the important regions and features contributing to the person re-ID decision. Our contributions include: (1) a novel technique for emphasizing texture in 3D models using UVTexture processing, (2) an innovative method for explicating person re-ID matches through a combination of 3D models and UVTexture mapping, and (3) achieving state-of-the-art performance in 3D person re-ID. We ensure the reproducibility of our results by making all data, codes, and models publicly available.
本研究提出了一种新的框架,用于利用易于获取的高分辨率纹理数据在3D重建中改善人物识别(re-ID)任务的性能和可解释性。我们提出了一种方法,通过将纹理贴图与UV纹理映射相结合,强调3D人物识别模型的纹理,从而更好地区分人类受试者。我们的方法将纹理贴图和其热图与3D模型相结合,以可视化和解释人物识别过程。特别是,可视化和解释是通过激活图和基于属性的注意图来实现的,突出了对人物识别决策起重要作用的关键区域和特征。我们的贡献包括:(1)利用UVTexture处理来强调3D模型中的纹理的新技术;(2)通过结合3D模型和纹理贴图来阐述人物识别匹配的创新方法;(3)在3D人物识别领域实现最先进的性能。为确保我们的结果的可靠性,我们将所有数据、代码和模型公开发布。
https://arxiv.org/abs/2410.00348
This paper addresses the challenge of animal re-identification, an emerging field that shares similarities with person re-identification but presents unique complexities due to the diverse species, environments and poses. To facilitate research in this domain, we introduce OpenAnimals, a flexible and extensible codebase designed specifically for animal re-identification. We conduct a comprehensive study by revisiting several state-of-the-art person re-identification methods, including BoT, AGW, SBS, and MGN, and evaluate their effectiveness on animal re-identification benchmarks such as HyenaID, LeopardID, SeaTurtleID, and WhaleSharkID. Our findings reveal that while some techniques generalize well, many do not, underscoring the significant differences between the two tasks. To bridge this gap, we propose ARBase, a strong \textbf{Base} model tailored for \textbf{A}nimal \textbf{R}e-identification, which incorporates insights from extensive experiments and introduces simple yet effective animal-oriented designs. Experiments demonstrate that ARBase consistently outperforms existing baselines, achieving state-of-the-art performance across various benchmarks.
本论文致力于解决动物身份识别这一新兴领域,尽管与人物身份识别存在相似之处,但由于动物的多样性、环境和需求,呈现出独特的复杂性。为了促进该领域的研究,我们引入了OpenAnimals,一个专门为动物身份识别而设计的灵活且可扩展的代码库。我们通过回顾几种最先进的动物身份识别方法,包括BoT、AGW、SBS和MGN,对它们的适用性进行了全面评估,并分析了它们的动物身份识别基准测试,如HyenaID、LeopardID、SeaTurtleID和WhaleSharkID。我们的研究结果表明,虽然有些技术表现良好,但许多技术表现不佳,突显了两个任务之间的显著差异。为了弥合这一空白,我们提出了ARBase,一种专为动物身份识别而设计的强大Base模型,它结合了广泛的实验见解并引入了简单而有效的动物导向设计。实验证明,ARBase在各种基准测试中始终表现优异,实现了在各个基准测试中的最先进性能。
https://arxiv.org/abs/2410.00204
To address the occlusion issues in person Re-Identification (ReID) tasks, many methods have been proposed to extract part features by introducing external spatial information. However, due to missing part appearance information caused by occlusion and noisy spatial information from external model, these purely vision-based approaches fail to correctly learn the features of human body parts from limited training data and struggle in accurately locating body parts, ultimately leading to misaligned part features. To tackle these challenges, we propose a Prompt-guided Feature Disentangling method (ProFD), which leverages the rich pre-trained knowledge in the textual modality facilitate model to generate well-aligned part features. ProFD first designs part-specific prompts and utilizes noisy segmentation mask to preliminarily align visual and textual embedding, enabling the textual prompts to have spatial awareness. Furthermore, to alleviate the noise from external masks, ProFD adopts a hybrid-attention decoder, ensuring spatial and semantic consistency during the decoding process to minimize noise impact. Additionally, to avoid catastrophic forgetting, we employ a self-distillation strategy, retaining pre-trained knowledge of CLIP to mitigate over-fitting. Evaluation results on the Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-ReID, and P-DukeMTMC datasets demonstrate that ProFD achieves state-of-the-art results. Our project is available at: this https URL.
为了解决人体识别(ReID)任务中的遮挡问题,许多方法通过引入外部空间信息来提取部分特征。然而,由于遮挡造成的缺失部分表现信息和外部模型的噪音空间信息,这些纯粹基于视觉的方法无法从有限的训练数据中正确学习人体的身体部位特征,并且在准确地定位身体部位方面存在困难,最终导致不正确的部分特征。为了解决这些挑战,我们提出了一个提示引导的特征去噪方法(ProFD),它利用文本模态的丰富预训练知识促进模型生成 well-aligned 部分特征。ProFD 首先设计了一些特定部位的提示,并利用噪音分割掩码初步对视觉和文本嵌入进行对齐,使文本提示具有空间意识。此外,为了减轻外部掩码的噪声,ProFD 采用了一种混合注意力解码器,在解码过程中确保空间和语义一致性以最小化噪声影响。此外,为了避免灾难性遗忘,我们采用了一种自监督策略,保留 CLIP 的预训练知识以缓解过拟合。在市场1501、DukeMTMC-ReID、Occluded-Duke、Occluded-ReID 和 P-DukeMTMC 数据集上的评估结果表明,ProFD 实现了最先进的性能。我们的项目可在此处访问:https://this URL。
https://arxiv.org/abs/2409.20081
Lifelong person re-identification (LReID) aims to continuously learn from non-stationary data to match individuals in different environments. Each task is affected by variations in illumination and person-related information (such as pose and clothing), leading to task-wise domain gaps. Current LReID methods focus on task-specific knowledge and ignore intrinsic task-shared representations within domain gaps, limiting model performance. Bridging task-wise domain gaps is crucial for improving anti-forgetting and generalization capabilities, especially when accessing limited old classes during training. To address these issues, we propose a novel attribute-text guided forgetting compensation (ATFC) model, which explores text-driven global representations of identity-related information and attribute-related local representations of identity-free information for LReID. Due to the lack of paired text-image data, we design an attribute-text generator (ATG) to dynamically generate a text descriptor for each instance. We then introduce a text-guided aggregation network (TGA) to explore robust text-driven global representations for each identity and knowledge transfer. Furthermore, we propose an attribute compensation network (ACN) to investigate attribute-related local representations, which distinguish similar identities and bridge domain gaps. Finally, we develop an attribute anti-forgetting (AF) loss and knowledge transfer (KT) loss to minimize domain gaps and achieve knowledge transfer, improving model performance. Extensive experiments demonstrate that our ATFC method achieves superior performance, outperforming existing LReID methods by over 9.0$\%$/7.4$\%$ in average mAP/R-1 on the seen dataset.
终身人物识别(LReID)旨在通过从非稳定数据中持续学习,匹配不同环境中的个体。每个任务都受到照明和与个人相关的信息(如姿态和服装)的变异性影响,导致任务间的领域差距。目前,LReID方法集中于任务特定的知识,而忽略了领域差距内在的任务共享表示,从而限制了模型的性能。弥合任务间的领域差距对提高抗遗忘和泛化能力至关重要,尤其是在训练过程中访问有限的老类时。为解决这些问题,我们提出了一个新颖的属性-文本引导遗忘补偿(ATFC)模型,它探索了与身份相关信息的文本驱动全局表示和与身份无关信息的属性相关局部表示,为LReID。由于没有成对文本-图像数据,我们设计了一个属性文本生成器(ATG)来动态生成每个实例的文本描述。然后,我们引入了一个文本引导聚合网络(TGA)来探索每个身份的稳健文本驱动全局表示以及知识传递。此外,我们提出了一个属性补偿网络(ACN)来研究属性相关的局部表示,区分相似的个体并弥合领域差距。最后,我们开发了一个属性抗遗忘(AF)损失和知识传递(KT)损失,以最小化领域差距并提高模型性能。大量实验证明,我们的ATFC方法具有卓越的性能,平均mAP/R-1比现有LReID方法高出9.0%/7.4%。
https://arxiv.org/abs/2409.19954
Contrastive Language-Image Pre-Training (CLIP) model excels in traditional person re-identification (ReID) tasks due to its inherent advantage in generating textual descriptions for pedestrian images. However, applying CLIP directly to intra-camera supervised person re-identification (ICS ReID) presents challenges. ICS ReID requires independent identity labeling within each camera, without associations across cameras. This limits the effectiveness of text-based enhancements. To address this, we propose a novel framework called CLIP-based Camera-Agnostic Feature Learning (CCAFL) for ICS ReID. Accordingly, two custom modules are designed to guide the model to actively learn camera-agnostic pedestrian features: Intra-Camera Discriminative Learning (ICDL) and Inter-Camera Adversarial Learning (ICAL). Specifically, we first establish learnable textual prompts for intra-camera pedestrian images to obtain crucial semantic supervision signals for subsequent intra- and inter-camera learning. Then, we design ICDL to increase inter-class variation by considering the hard positive and hard negative samples within each camera, thereby learning intra-camera finer-grained pedestrian features. Additionally, we propose ICAL to reduce inter-camera pedestrian feature discrepancies by penalizing the model's ability to predict the camera from which a pedestrian image originates, thus enhancing the model's capability to recognize pedestrians from different viewpoints. Extensive experiments on popular ReID datasets demonstrate the effectiveness of our approach. Especially, on the challenging MSMT17 dataset, we arrive at 58.9\% in terms of mAP accuracy, surpassing state-of-the-art methods by 7.6\%. Code will be available at: this https URL.
对比性语言-图像预训练(CLIP)模型在传统的人识别(ReID)任务中表现出色,因为它固有地在生成行人图像的文本描述方面具有优势。然而,将CLIP直接应用于内摄像头监督的人识别(ICS ReID)存在挑战。ICS ReID要求每个相机内都有独立的身份标签,且不同相机之间没有关联。这限制了基于文本的增强效果的有效性。为了解决这个问题,我们提出了一个名为CLIP-based Camera-Agnostic Feature Learning(CCAFL)的新的框架用于ICS ReID。 相应地,我们设计了两个性质定制模块来引导模型积极学习相机无关的行人特征:内摄像头区分学习(ICDL)和跨相机对抗学习(ICAL)。具体来说,我们首先为内摄像头行人图像建立了可学习的文本提示,以获得后续内和跨相机学习的关键语义监督信号。然后,我们设计ICDL,通过考虑每个相机内的难样本(即难样本和简单样本)来增加跨相机变异性,从而学习内摄像头更细粒度的行人特征。此外,我们提出了ICAL,通过惩罚模型从哪个相机起源行人图像的能力来减少跨相机行人特征的差异,从而增强模型从不同角度识别行人的能力。在流行ReID数据集上进行广泛的实验证明了我们方法的的有效性。尤其,在具有挑战性的MSMT17数据集上,我们达到58.9%的mAP准确率,比最先进的methods高出7.6%。代码将在这个链接上提供:https://this URL。
https://arxiv.org/abs/2409.19563
Recent researches have proven that pre-training on large-scale person images extracted from internet videos is an effective way in learning better representations for person re-identification. However, these researches are mostly confined to pre-training at the instance-level or single-video tracklet-level. They ignore the identity-invariance in images of the same person across different videos, which is a key focus in person re-identification. To address this issue, we propose a Cross-video Identity-cOrrelating pre-traiNing (CION) framework. Defining a noise concept that comprehensively considers both intra-identity consistency and inter-identity discrimination, CION seeks the identity correlation from cross-video images by modeling it as a progressive multi-level denoising problem. Furthermore, an identity-guided self-distillation loss is proposed to implement better large-scale pre-training by mining the identity-invariance within person images. We conduct extensive experiments to verify the superiority of our CION in terms of efficiency and performance. CION achieves significantly leading performance with even fewer training samples. For example, compared with the previous state-of-the-art~\cite{ISR}, CION with the same ResNet50-IBN achieves higher mAP of 93.3\% and 74.3\% on Market1501 and MSMT17, while only utilizing 8\% training samples. Finally, with CION demonstrating superior model-agnostic ability, we contribute a model zoo named ReIDZoo to meet diverse research and application needs in this field. It contains a series of CION pre-trained models with spanning structures and parameters, totaling 32 models with 10 different structures, including GhostNet, ConvNext, RepViT, FastViT and so on. The code and models will be made publicly available at this https URL.
近年来,有研究表明,从互联网视频提取的大规模人物图像的预训练是学习更好的人体识别表示的有效方法。然而,这些研究主要局限于预训练在实例级别或单视频跟踪器级别。它们忽视了同一视频中不同人物之间图像的身份不变性,这是人体识别的一个重要关注点。为了解决这个问题,我们提出了一个跨视频身份相关预训练(CION)框架。将一个噪声概念定义为全面考虑内部身份一致性和跨身份歧视,CION 试图通过建模为跨视频图像的渐进多级去噪问题来从跨视频图像中实现身份相关性。此外,我们还提出了一个以身份为导向的自监督损失来实现在大型预训练中挖掘身份不变性。我们进行了广泛的实验来验证 CION 在效率和性能方面的优越性。CION 在更少的训练样本的情况下取得了显著的领先性能。例如,与最先进的现状相比,使用 ResNet50-IBN 的 CION 在 Market1501 和 MSMT17 上实现了 93.3\% 的 mAP 和 74.3\% 的准确率,而仅使用 8\% 的训练样本。最后,由于 CION 展示了卓越的模型无关性,我们在这个领域为满足多样研究和应用需求贡献了一个名为 ReIDZoo 的模型动物园。它包含一系列 CION 预训练模型,具有扩展开来结构和参数,共计 32 个模型,包括 GhostNet、ConvNext、RepViT、FastViT 等。代码和模型将在这个 https URL 上公开发布。
https://arxiv.org/abs/2409.18569
With the rapid development of intelligent transportation systems and the popularity of smart city infrastructure, Vehicle Re-ID technology has become an important research field. The vehicle Re-ID task faces an important challenge, which is the high similarity between different vehicles. Existing methods use additional detection or segmentation models to extract differentiated local features. However, these methods either rely on additional annotations or greatly increase the computational cost. Using attention mechanism to capture global and local features is crucial to solve the challenge of high similarity between classes in vehicle Re-ID tasks. In this paper, we propose LKA-ReID with large kernel attention. Specifically, the large kernel attention (LKA) utilizes the advantages of self-attention and also benefits from the advantages of convolution, which can extract the global and local features of the vehicle more comprehensively. We also introduce hybrid channel attention (HCA) combines channel attention with spatial information, so that the model can better focus on channels and feature regions, and ignore background and other disturbing information. Experiments on VeRi-776 dataset demonstrated the effectiveness of LKA-ReID, with mAP reaches 86.65% and Rank-1 reaches 98.03%.
随着智能交通系统以及智能城市基础设施的快速发展,车辆重新识别(Vehicle Re-ID)技术已成为一个重要的研究领域。车辆重新识别任务面临一个重要挑战,即不同车辆之间的相似性较高。现有方法使用额外的检测或分割模型来提取不同的局部特征。然而,这些方法要么依赖额外的注释,要么大大增加了计算成本。使用注意力机制来捕捉全局和局部特征是解决车辆重新识别任务中类间相似性的关键。在本文中,我们提出了带有大核注意力的LKA-ReID。具体来说,大核注意力(LKA)利用了自注意力和卷积的优势,可以更全面地提取车辆的局部和全局特征。我们还引入了混合通道注意力(HCA),将通道关注与空间信息相结合,使模型能更好地关注通道和特征区域,并忽略背景和其他干扰信息。在VeRi-776数据集上的实验证明,LKA-ReID的有效性,mAP达到86.65%,Rank-1达到98.03%。
https://arxiv.org/abs/2409.17908
The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models' capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However, to provide a comprehensive understanding of the storyline, a model must not only extract elements but also understand their relationships and generate highly informative captions. In this work, we propose a pipeline that leverages Vision-Language Models (VLMs) to obtain dense, grounded captions. To construct our pipeline, we introduce an attribute-retaining metric that assesses whether all important attributes are identified in the caption. Additionally, we created a densely annotated test set to fairly evaluate open-source VLMs and select the best captioning model according to our metric. Our pipeline generates dense captions with bounding boxes that are quantitatively and qualitatively superior to those produced by specifically trained models, without requiring any additional training. Using this pipeline, we annotated over 2 million panels across 13,000 books, which will be available on the project page this https URL.
漫画领域正在迅速发展,单页和多页分析与合成模型的开发是其进步的重要原因。为了支持和评估模型在任务中的能力,如检测(页面、角色、文本)、链接(角色识别和发言者识别)和分析漫画元素(例如,对话转录),我们引入了一些基准数据集和工具。然而,要全面了解故事情节,模型不仅需要提取元素,还需要理解它们之间的关系并生成高度有益的标题。在这项工作中,我们提出了一个利用Vision-Language Models(VLMs)的流程,以获得密集、明确的标题。为了构建我们的流程,我们引入了一个属性保留度度量,以评估正文中是否所有重要属性都被识别出来。此外,我们还创建了一个带有丰富注释的测试集,用于公正地评估开源VLMs,并根据我们的度量选择最佳的配音模型。我们的流程生成了具有边界框的密集标题,这些框的质量和数量优于那些通过专门训练的模型生成的框,而无需进行额外的训练。使用这个流程,我们在13,000本书的超过200,000个页面上标注了超过100,000个边界框,这些标注将在项目的页面上发布,链接如下:https://www.example.com。
https://arxiv.org/abs/2409.16159
Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.
重新识别(Re-ID)是计算机视觉领域的一个关键挑战,主要研究在行人或车辆的背景下。然而,稳健的物体实例Re-ID,其在自动驾驶探索、长期感知和场景理解等任务中具有重要的影响,仍然没有被深入研究。在这项工作中,我们通过提出一种新颖的双路径物体实例Re-ID转换器架构来解决这一空白。该架构整合了多模态的RGB和深度信息。通过利用深度数据,我们证明了在复杂场景或具有不同照明条件的情况下,Re-ID的改善。此外,我们还开发了一个基于Re-ID的局部定位框架,可以准确地定位和识别不同视角下的相机。我们使用两个自定义的RGB-D数据集以及来自开源TUM RGB-D数据集的多序列进行验证。我们的方法在物体实例Re-ID(mAP为75.18)和局部定位精度(在TUM-RGBD上的成功率为83%)方面都取得了显著的改进,突显了物体Re-ID在推动机器人感知方面的重要性。我们的模型、框架和数据集已经公开发布。
https://arxiv.org/abs/2409.12002
In recent years, workplaces and educational institutes have widely adopted virtual meeting platforms. This has led to a growing interest in analyzing and extracting insights from these meetings, which requires effective detection and tracking of unique individuals. In practice, there is no standardization in video meetings recording layout, and how they are captured across the different platforms and services. This, in turn, creates a challenge in acquiring this data stream and analyzing it in a uniform fashion. Our approach provides a solution to the most general form of video recording, usually consisting of a grid of participants (\cref{fig:videomeeting}) from a single video source with no metadata on participant locations, while using the least amount of constraints and assumptions as to how the data was acquired. Conventional approaches often use YOLO models coupled with tracking algorithms, assuming linear motion trajectories akin to that observed in CCTV footage. However, such assumptions fall short in virtual meetings, where participant video feed window can abruptly change location across the grid. In an organic video meeting setting, participants frequently join and leave, leading to sudden, non-linear movements on the video grid. This disrupts optical flow-based tracking methods that depend on linear motion. Consequently, standard object detection and tracking methods might mistakenly assign multiple participants to the same tracker. In this paper, we introduce a novel approach to track and re-identify participants in remote video meetings, by utilizing the spatio-temporal priors arising from the data in our domain. This, in turn, increases tracking capabilities compared to the use of general object tracking. Our approach reduces the error rate by 95% on average compared to YOLO-based tracking methods as a baseline.
近年来,许多企业和教育机构广泛采用虚拟会议平台。这导致了对这些会议进行分析和提取洞见的需求不断增加,这需要对独特个体的有效检测和跟踪。在实践中,视频会议录制布局没有标准化,而且它们在不同的平台和服务上的捕捉方式也没有标准化。这导致在获取此数据流并以统一方式分析它时存在挑战。我们的方法提供了解决最一般形式视频录制问题的方案,通常包括一个来自单个视频源的参与者的网格(\cref{fig:videomeeting})没有元数据,同时使用最少量的约束和假设来获取数据。 传统方法通常使用与跟踪算法耦合的YOLO模型,假定其类似于摄像机 footage 观察到的线性运动轨迹。然而,在虚拟会议中,参与者的视频流窗口会突然改变位置,破坏了基于线性运动轨迹的跟踪方法。因此,标准物体检测和跟踪方法可能会错误地将多个参与者分配到同一个跟踪器上。 在本文中,我们提出了一个在远程视频会议中跟踪和重新识别参与者的全新方法,通过利用我们领域中数据产生的空间时间先验。这进而提高了跟踪能力与使用通用物体跟踪方法相比。我们的方法将物体检测和跟踪误差率降低了95%。
https://arxiv.org/abs/2409.09841
In this paper, we present a system that generates synthetic free-text medical records, such as discharge summaries, admission notes and doctor correspondences, using Masked Language Modeling (MLM). Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk. The system incorporates a de-identification component that uses Philter to mask Protected Health Information (PHI), followed by a Medical Entity Recognition (NER) model to retain key medical information. We explore various masking ratios and mask-filling techniques to balance the trade-off between diversity and fidelity in the synthetic outputs without affecting overall readability. Our results demonstrate that the system can produce high-quality synthetic data with significant diversity while achieving a HIPAA-compliant PHI recall rate of 0.96 and a low re-identification risk of 0.035. Furthermore, downstream evaluations using a NER task reveal that the synthetic data can be effectively used to train models with performance comparable to those trained on real data. The flexibility of the system allows it to be adapted for specific use cases, making it a valuable tool for privacy-preserving data generation in medical research and healthcare applications.
在本文中,我们提出了一个使用Masked Language Modeling(MLM)生成合成自由文本医疗记录的系统,如出院摘要,入院笔记和医生对应关系。我们的系统旨在保留记录的关键信息,同时引入显著的多样性和降低识别风险。该系统包括一个去识别组件,使用Philter遮盖受保护的个人信息(PHI),然后是一个医疗实体识别(NER)模型来保留关键医疗信息。我们探讨了各种遮盖比例和遮盖填充技术,以在保持多样性的同时平衡可读性和准确性。我们的结果表明,该系统可以生成具有显著差异的高质量合成数据,同时实现HIPAA合规的个人信息回忆率0.96和较低的识别风险0.035。此外,通过下游使用NER任务的评估发现,合成数据可以有效地用于训练具有与真实数据训练的模型性能相当的模型。系统的灵活性使它可以适应特定的用例,成为保护医疗研究和小康应用中隐私生成的重要工具。
https://arxiv.org/abs/2409.09831
Person Re-Identification (Re-ID) has gained popularity in computer vision, enabling cross-camera pedestrian recognition. Although the development of deep learning has provided a robust technical foundation for person Re-ID research, most existing person Re-ID methods overlook the potential relationships among local person features, failing to adequately address the impact of pedestrian pose variations and local body parts occlusion. Therefore, we propose a Transformer-enhanced Graph Convolutional Network (Tran-GCN) model to improve Person Re-Identification performance in monitoring videos. The model comprises four key components: (1) A Pose Estimation Learning branch is utilized to estimate pedestrian pose information and inherent skeletal structure data, extracting pedestrian key point information; (2) A Transformer learning branch learns the global dependencies between fine-grained and semantically meaningful local person features; (3) A Convolution learning branch uses the basic ResNet architecture to extract the person's fine-grained local features; (4) A Graph Convolutional Module (GCM) integrates local feature information, global feature information, and body information for more effective person identification after fusion. Quantitative and qualitative analysis experiments conducted on three different datasets (Market-1501, DukeMTMC-ReID, and MSMT17) demonstrate that the Tran-GCN model can more accurately capture discriminative person features in monitoring videos, significantly improving identification accuracy.
人物识别(Re-ID)在计算机视觉领域取得了广泛的应用,实现了跨摄像头的行人识别。尽管深度学习的快速发展为人物识别研究提供了坚实的技术基础,但现有的人物识别方法大多数忽视了局部特征之间的潜在关系,未能充分解决行人姿态变化和局部身体部分遮挡的影响。因此,我们提出了一个Transformer-增强的图卷积网络(Tran-GCN)模型,以提高在监控视频中的行人识别性能。该模型包括四个关键组件:(1)一个姿态估计学习分支用于估计行人的姿态信息和固有骨骼结构数据,提取行人关键点信息;(2)一个Transformer学习分支学习细粒度和有意义局部人的全局依赖关系;(3)一个卷积学习分支使用基本ResNet架构提取行人的细粒度局部特征;(4)一个图卷积模块(GCM)整合了局部特征信息、全局特征信息和身体信息,进行更有效的行人识别融合。在三个不同数据集(Market-1501,DukeMTMC-ReID和MSMT17)上进行的定量和定性分析实验证明,Tran-GCN模型在监测视频中更准确地捕捉到具有区分性的行人特征,显著提高了识别准确性。
https://arxiv.org/abs/2409.09391
Extracting and matching Re-Identification (ReID) features is used by many state-of-the-art (SOTA) Multiple Object Tracking (MOT) methods, particularly effective against frequent and long-term occlusions. While end-to-end object detection and tracking have been the main focus of recent research, they have yet to outperform traditional methods in benchmarks like MOT17 and MOT20. Thus, from an application standpoint, methods with separate detection and embedding remain the best option for accuracy, modularity, and ease of implementation, though they are impractical for edge devices due to the overhead involved. In this paper, we investigate a selective approach to minimize the overhead of feature extraction while preserving accuracy, modularity, and ease of implementation. This approach can be integrated into various SOTA methods. We demonstrate its effectiveness by applying it to StrongSORT and Deep OC-SORT. Experiments on MOT17, MOT20, and DanceTrack datasets show that our mechanism retains the advantages of feature extraction during occlusions while significantly reducing runtime. Additionally, it improves accuracy by preventing confusion in the feature-matching stage, particularly in cases of deformation and appearance similarity, which are common in DanceTrack. this https URL, this https URL
提取和匹配重新识别(Re-Identification,ReID)特征是许多最先进的(SOTA)多对象跟踪(MOT)方法中使用的,尤其是在频繁和长期遮挡效果方面非常有效。尽管端到端的物体检测和跟踪是最近研究的主要关注点,但它们在像MOT17和MOT20这样的基准测试中的表现尚未超过传统方法。因此,从应用角度来看,具有单独检测和嵌入的方案在准确性、可扩展性和易用性方面仍然是最佳选择,尽管由于涉及开销,它们在边缘设备上不可行。在本文中,我们研究了一种选择性的方法来最小化开销,同时保留准确度、可扩展性和易用性。这种方法可以集成到各种SOTA方法中。我们通过将该方法应用于StrongSORT和Deep OC-SORT来证明其有效性。在MOT17、MOT20和DanceTrack数据集的实验中,我们的机制在遮挡过程中保留了特征提取的优势,同时显著减少了运行时间。此外,通过防止在特征匹配阶段混淆,特别在变形和外观相似的情况下,提高了准确性。这个链接,这个链接
https://arxiv.org/abs/2409.06617
The primary challenges in visible-infrared person re-identification arise from the differences between visible (vis) and infrared (ir) images, including inter-modal and intra-modal variations. These challenges are further complicated by varying viewpoints and irregular movements. Existing methods often rely on horizontal partitioning to align part-level features, which can introduce inaccuracies and have limited effectiveness in reducing modality discrepancies. In this paper, we propose a novel Prototype-Driven Multi-feature generation framework (PDM) aimed at mitigating cross-modal discrepancies by constructing diversified features and mining latent semantically similar features for modal alignment. PDM comprises two key components: Multi-Feature Generation Module (MFGM) and Prototype Learning Module (PLM). The MFGM generates diversity features closely distributed from modality-shared features to represent pedestrians. Additionally, the PLM utilizes learnable prototypes to excavate latent semantic similarities among local features between visible and infrared modalities, thereby facilitating cross-modal instance-level alignment. We introduce the cosine heterogeneity loss to enhance prototype diversity for extracting rich local features. Extensive experiments conducted on the SYSU-MM01 and LLCM datasets demonstrate that our approach achieves state-of-the-art performance. Our codes are available at this https URL.
可见-红外人员识别的主要挑战来自于可见(vis)和红外(ir)图像之间的差异,包括模态和模态内的变化。这些挑战进一步复杂化,由于不同的视点和不规则的运动。现有的方法通常依赖于水平分割来对部分级别特征进行对齐,这可能会引入不准确,并且在减少模态差异方面效果有限。在本文中,我们提出了一个新型的原型驱动多特征生成框架(PDM),旨在通过构建多样化的特征和挖掘模态间潜在的语义相似性来缓解跨模态差异。PDM包括两个关键组件:多特征生成模块(MFGM)和原型学习模块(PLM)。MFGM生成从模态共享特征到模式共享特征的多样性特征,从而表示行人。此外,PLM利用可学习的原型来挖掘可见和红外模态之间局部特征之间的潜在语义相似性,从而促进跨模态实例级别的对齐。我们引入了余弦异质损失来增强原型的多样性,以提取丰富的局部特征。在SYSU-MM01和LLCM数据集上进行的大量实验证明,我们的方法达到了最先进水平。我们的代码可在此处访问:https://www.xxx
https://arxiv.org/abs/2409.05642
We address the problem of person re-identification (reID), that is, retrieving person images from a large dataset, given a query image of the person of interest. A key challenge is to learn person representations robust to intra-class variations, as different persons could have the same attribute, and persons' appearances look different, e.g., with viewpoint changes. Recent reID methods focus on learning person features discriminative only for a particular factor of variations (e.g., human pose), which also requires corresponding supervisory signals (e.g., pose annotations). To tackle this problem, we propose to factorize person images into identity-related and unrelated features. Identity-related features contain information useful for specifying a particular person (e.g., clothing), while identity-unrelated ones hold other factors (e.g., human pose). To this end, we propose a new generative adversarial network, dubbed identity shuffle GAN (IS-GAN). It disentangles identity-related and unrelated features from person images through an identity-shuffling technique that exploits identification labels alone without any auxiliary supervisory signals. We restrict the distribution of identity-unrelated features or encourage the identity-related and unrelated features to be uncorrelated, facilitating the disentanglement process. Experimental results validate the effectiveness of IS-GAN, showing state-of-the-art performance on standard reID benchmarks, including Market-1501, CUHK03, and DukeMTMC-reID. We further demonstrate the advantages of disentangling person representations on a long-term reID task, setting a new state of the art on a Celeb-reID dataset.
我们解决这个问题,即从大量数据中检索特定人物的形象,给定感兴趣人物的查询图像。一个关键挑战是学会对类内变化具有鲁棒性的人物表示,因为不同的人可能有相同的属性,而且人的外表看起来不同,例如视角变化。最近的方法集中在仅对特定因素的变化学习人物特征,例如人体姿势,这也需要相应的指导信号(例如姿势注释)。为了应对这个问题,我们将人物图像分解为身份相关和不相关特征。身份相关特征包含指定特定人物(例如服装)的有用信息,而身份无关特征持有其他因素(例如人体姿势)。为此,我们提出了一个新的生成对抗网络,被称为身份混淆生成对抗网络(IS-GAN)。它通过利用身份混淆技术从人物图像中区分身份相关和不相关特征,而无需任何辅助监督信号。我们限制身份无关特征的分布或鼓励身份相关和不相关特征相互独立,促进解纠缠过程。实验结果证实了IS-GAN的有效性,在包括Market-1501、CUHK03和DukeMTMC-reID等标准的REID基准测试中取得了最先进的性能。我们还进一步展示了在长期REID任务中解纠缠人物表示的优势,将桂冠REID数据集上的新颖状态推向了新的高度。
https://arxiv.org/abs/2409.05277
Deep learning-based person re-identification (re-id) models are widely employed in surveillance systems and inevitably inherit the vulnerability of deep networks to adversarial attacks. Existing attacks merely consider cross-dataset and cross-model transferability, ignoring the cross-test capability to perturb models trained in different domains. To powerfully examine the robustness of real-world re-id models, the Meta Transferable Generative Attack (MTGA) method is proposed, which adopts meta-learning optimization to promote the generative attacker producing highly transferable adversarial examples by learning comprehensively simulated transfer-based cross-model\&dataset\&test black-box meta attack tasks. Specifically, cross-model\&dataset black-box attack tasks are first mimicked by selecting different re-id models and datasets for meta-train and meta-test attack processes. As different models may focus on different feature regions, the Perturbation Random Erasing module is further devised to prevent the attacker from learning to only corrupt model-specific features. To boost the attacker learning to possess cross-test transferability, the Normalization Mix strategy is introduced to imitate diverse feature embedding spaces by mixing multi-domain statistics of target models. Extensive experiments show the superiority of MTGA, especially in cross-model\&dataset and cross-model\&dataset\&test attacks, our MTGA outperforms the SOTA methods by 21.5\% and 11.3\% on mean mAP drop rate, respectively. The code of MTGA will be released after the paper is accepted.
基于深度学习的行人重识别(重识别)模型在监视系统中被广泛应用,并继承了深度网络对对抗攻击的脆弱性。现有的攻击仅考虑跨数据集和跨模型可传输性,而忽略了不同领域训练模型的交叉测试能力。为了有效地研究真实世界重识别模型的鲁棒性,提出了元传输可训练生成攻击(MTGA)方法,通过元学习优化促进生成攻击者通过全面模拟基于交叉模型的交叉测试黑盒攻击任务来产生高度可传输的对抗样本。 具体来说,通过选择不同的重识别模型和数据集,对元训练和元测试攻击过程进行跨模型跨数据集的攻击。由于不同的模型可能关注不同的特征区域,因此进一步设计干扰随机擦除模块,以防止攻击者仅学习模型特定特征的攻击方法。为了提高攻击者具有跨测试可传输性,引入了标准化混合策略,通过混合目标模型的多个领域统计数据来模拟多样特征嵌入空间。 大量实验证明,MTGA在跨模型和跨数据集攻击方面具有优越性。与最先进的攻击方法相比,MTGA在平均mAP下降率上分别提高了21.5%和11.3%。MTGA的代码将在论文被接受后发布。
https://arxiv.org/abs/2409.04208
Place recognition is an important task within autonomous navigation, involving the re-identification of previously visited locations from an initial traverse. Unlike visual place recognition (VPR), LiDAR place recognition (LPR) is tolerant to changes in lighting, seasons, and textures, leading to high performance on benchmark datasets from structured urban environments. However, there is a growing need for methods that can operate in diverse environments with high performance and minimal training. In this paper, we propose a handcrafted matching strategy that performs roto-translation invariant place recognition and relative pose estimation for both urban and unstructured natural environments. Our approach constructs Birds Eye View (BEV) global descriptors and employs a two-stage search using matched filtering -- a signal processing technique for detecting known signals amidst noise. Extensive testing on the NCLT, Oxford Radar, and WildPlaces datasets consistently demonstrates state-of-the-art (SoTA) performance across place recognition and relative pose estimation metrics, with up to 15% higher recall than previous SoTA.
地点识别是自主导航中一个重要的任务,涉及从初始遍历中重新识别之前访问过的位置。与视觉地点识别(VPR)不同,激光雷达地点识别(LPR)对光照、季节和纹理的变化具有容错性,因此在结构化城市环境基准数据集上具有卓越的性能。然而,对于需要在多样环境中具有高性能且训练量最小的方法的需求不断增加。在本文中,我们提出了一种自手制作的匹配策略,可以对城市和自然环境实现旋转平移不变的地点识别和相对姿态估计。我们的方法构建了鸟瞰全局描述符,并采用匹配滤波技术进行两级搜索——一种信号处理技术,用于在噪声中检测已知信号。对NCLT、牛津雷达和WildPlaces数据集的广泛测试表明,在地点识别和相对姿态估计指标上,我们的方法具有最先进的(SoTA)性能,召回率最高可达之前SoTA的15%。
https://arxiv.org/abs/2409.03998
In recent years, the development of deep learning approaches for the task of person re-identification led to impressive results. However, this comes with a limitation for industrial and practical real-world applications. Firstly, most of the existing works operate on closed-world scenarios, in which the people to re-identify (probes) are compared to a closed-set (gallery). Real-world scenarios often are open-set problems in which the gallery is not known a priori, but the number of open-set approaches in the literature is significantly lower. Secondly, challenges such as multi-camera setups, occlusions, real-time requirements, etc., further constrain the applicability of off-the-shelf methods. This work presents MICRO-TRACK, a Modular Industrial multi-Camera Re_identification and Open-set Tracking system that is real-time, scalable, and easy to integrate into existing industrial surveillance scenarios. Furthermore, we release a novel Re-ID and tracking dataset acquired in an industrial manufacturing facility, dubbed Facility-ReID, consisting of 18-minute videos captured by 8 surveillance cameras.
近年来,用于人物识别任务的深度学习方法的开发取得了令人印象深刻的成果。然而,这也限制了在工业和实际应用领域的应用。首先,大部分现有工作都在封闭世界场景中进行,其中被重新识别(探针)的人与一个封闭集(画廊)进行比较。现实世界场景通常是开放式问题,画廊不知道提前知道,但文书中公开的开放式方法的数量相当有限。其次,多摄像头设置、遮挡、实时要求等挑战进一步限制了通用方法的适用性。本文介绍了一种名为MICRO-TRACK的模块化工业多相机重新识别和开放式跟踪系统,具有实时性、可扩展性和易于集成到现有工业安防场景的特点。此外,我们还发布了在工业制造设施中获取的一组18分钟视频,由8个监控摄像头捕捉的数据,称为设施-ReID数据集。
https://arxiv.org/abs/2409.03879
Unmanned Aerial Vehicles (UAVs), have greatly revolutionized the process of gathering and analyzing data in diverse research domains, providing unmatched adaptability and effectiveness. This paper presents a thorough examination of Unmanned Aerial Vehicle (UAV) datasets, emphasizing their wide range of applications and progress. UAV datasets consist of various types of data, such as satellite imagery, images captured by drones, and videos. These datasets can be categorized as either unimodal or multimodal, offering a wide range of detailed and comprehensive information. These datasets play a crucial role in disaster damage assessment, aerial surveillance, object recognition, and tracking. They facilitate the development of sophisticated models for tasks like semantic segmentation, pose estimation, vehicle re-identification, and gesture recognition. By leveraging UAV datasets, researchers can significantly enhance the capabilities of computer vision models, thereby advancing technology and improving our understanding of complex, dynamic environments from an aerial perspective. This review aims to encapsulate the multifaceted utility of UAV datasets, emphasizing their pivotal role in driving innovation and practical applications in multiple domains.
无人机(UAVs)已经在各种研究领域极大地推动了数据收集和分析的进程,提供了无与伦比的适应性和效果。本文对无人机数据集进行全面评估,强调它们的广泛应用和进展。无人机数据集包括各种类型的数据,如卫星影像、无人机捕获的图像和视频。这些数据集可以分为单模态或多模态,提供详尽而全面的信息。这些数据集在灾害损失评估、无人机监视、目标识别和跟踪中发挥着关键作用。它们为诸如语义分割、姿态估计、车辆识别和手势识别等任务开发复杂的模型提供了便利。通过利用无人机数据集,研究人员可以显著增强计算机视觉模型的能力,从而推动技术的发展和提高我们对复杂、动态环境的从空中的认识。本综述旨在概括无人机数据集的多重用途,强调其在多个领域推动创新和实际应用的关键作用。
https://arxiv.org/abs/2409.03245
Extracting robust feature representation is critical for object re-identification to accurately identify objects across non-overlapping cameras. Although having a strong representation ability, the Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features. Meanwhile, due to the structural difference between CNN and ViT, fine-grained strategies that effectively address this issue in CNN do not continue to be successful in ViT. To address this issue, by observing the latent diverse representation hidden behind the multi-head attention, we present PartFormer, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks. The PartFormer integrates a Head Disentangling Block (HDB) that awakens the diverse representation of multi-head self-attention without the typical loss of feature richness induced by concatenation and FFN layers post-attention. To avoid the homogenization of attention heads and promote robust part-based feature learning, two head diversity constraints are imposed: attention diversity constraint and correlation diversity constraint. These constraints enable the model to exploit diverse and discriminative feature representations from different attention heads. Comprehensive experiments on various object Re-ID benchmarks demonstrate the superiority of the PartFormer. Specifically, our framework significantly outperforms state-of-the-art by 2.4\% mAP scores on the most challenging MSMT17 dataset.
提取稳健的特征表示对于物体识别是至关重要的,因为这样可以在不重叠的相机之间准确地识别物体。尽管ViT具有很强的表示能力,但它往往在训练数据的各个显著部分上过拟合,限制了其泛化能力和对整体物体特征的关注。与此同时,由于CNN和ViT的结构差异,在CNN中有效解决这个问题的细粒度策略在ViT中不再成功。为了解决这个问题,通过观察隐藏在多头注意力背后的潜在多样化表示,我们提出了PartFormer,一种创新的ViT,旨在克服物体识别任务中的粒度限制。PartFormer集成了一个头分离块(HDB),在注意力注意力层中唤醒了多样化的多头自注意力,而不会导致特征丰富度的典型损失。为了避免关注头部的同质化,并促进基于部分的特征学习,对两个头多样性约束被引入:注意多样性约束和关联多样性约束。这些约束使模型能够从不同的注意力头中利用多样化和有鉴别性的特征表示。在各种物体识别基准上进行全面的实验证明,PartFormer具有优越的性能。具体来说,我们的框架在最具挑战性的MSMT17数据集上显著超过了最先进的水平,其mAP得分高出2.4%。
https://arxiv.org/abs/2408.16684