This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{this https URL}{this https URL}.
本文提出了ViT Token Constraint和多尺度记忆库(TCMM)方法,旨在解决无监督行人重识别工作中出现的补丁噪声和特征不一致性问题。许多优秀的方法利用ViT特征来获取伪标签和聚类原型,然后通过对比学习训练模型。然而,由于ViT是通过对图像进行补丁嵌入处理来工作的,这不可避免地会在各个补丁中引入噪声,并可能影响重识别模型的性能。另一方面,基于以往记忆库的对比方法可能会因为批次大小的限制而导致数据不一致的问题。此外,现有的伪标签生成方法往往忽略那些难以聚类的异常样本,从而牺牲了这些异常样本的潜在价值,导致模型多样性和鲁棒性的局限性。 本文介绍了一种ViT Token Constraint技术来减轻补丁噪声对ViT架构造成的损害。提出的多尺度记忆库则增强了对异常样本的探索,并保持了特征的一致性。实验结果显示,我们的系统在常见基准测试中达到了最先进的性能水平。该项目可在[此处](this https URL)访问。
https://arxiv.org/abs/2501.09044
Video-based person re-identification (ReID) has become increasingly important due to its applications in video surveillance applications. By employing events in video-based person ReID, more motion information can be provided between continuous frames to improve recognition accuracy. Previous approaches have assisted by introducing event data into the video person ReID task, but they still cannot avoid the privacy leakage problem caused by RGB images. In order to avoid privacy attacks and to take advantage of the benefits of event data, we consider using only event data. To make full use of the information in the event stream, we propose a Cross-Modality and Temporal Collaboration (CMTC) network for event-based video person ReID. First, we design an event transform network to obtain corresponding auxiliary information from the input of raw events. Additionally, we propose a differential modality collaboration module to balance the roles of events and auxiliaries to achieve complementary effects. Furthermore, we introduce a temporal collaboration module to exploit motion information and appearance cues. Experimental results demonstrate that our method outperforms others in the task of event-based video person ReID.
基于视频的人重识别(ReID)由于在视频监控中的应用而变得越来越重要。通过采用事件数据,在连续帧之间可以提供更多的运动信息,从而提高识别的准确性。虽然之前的方法已经尝试将事件数据引入到基于视频的人体重识别任务中来辅助解决该问题,但它们仍然无法避免由RGB图像引起的隐私泄露问题。为了防止隐私攻击,并利用事件数据带来的优势,我们考虑仅使用事件数据进行人体重识别。 为充分利用事件流中的信息,我们提出了一种跨模态和时间协作(CMTC)网络用于基于事件的视频人体重识别任务。首先,我们设计了一个事件变换网络来从原始事件输入中获取相应的辅助信息。此外,我们提出了一个差分模态协作模块以平衡事件与辅助信息的作用,从而实现互补效果。同时,我们引入了时间协作模块来利用运动信息和外观线索。 实验结果表明,在基于事件的视频人体重识别任务上,我们的方法优于其他现有技术。
https://arxiv.org/abs/2501.07296
Clothing-change person re-identification (CC Re-ID) has attracted increasing attention in recent years due to its application prospect. Most existing works struggle to adequately extract the ID-related information from the original RGB images. In this paper, we propose an Identity-aware Feature Decoupling (IFD) learning framework to mine identity-related features. Particularly, IFD exploits a dual stream architecture that consists of a main stream and an attention stream. The attention stream takes the clothing-masked images as inputs and derives the identity attention weights for effectively transferring the spatial knowledge to the main stream and highlighting the regions with abundant identity-related information. To eliminate the semantic gap between the inputs of two streams, we propose a clothing bias diminishing module specific to the main stream to regularize the features of clothing-relevant regions. Extensive experimental results demonstrate that our framework outperforms other baseline models on several widely-used CC Re-ID datasets.
近年来,服装变化人员再识别(CC Re-ID)由于其应用前景而备受关注。现有大多数工作难以充分从原始RGB图像中提取出与身份相关的信息。在本文中,我们提出了一种基于身份感知特征解耦(IFD)的学习框架来挖掘身份相关的特征信息。特别地,IFD采用了双流架构,包括主流和注意力流。注意力流以服装遮罩后的图像为输入,并得出用于有效传输空间知识到主流以及突出富含身份相关信息区域的身份注意权重。 为了消除两个流的输入之间的语义差距,我们在主流中提出了一种专门针对服装偏差减小的模块来规范与服装相关的区域特征。大量的实验结果表明,我们的框架在几个广泛使用的CC Re-ID数据集上优于其他基线模型。
https://arxiv.org/abs/2501.05851
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.
无监督可见光-红外行人重识别(USL-VI-ReID)的目标是从未标注的跨模态数据集中学习模态不变特征,以减少跨模态差异。然而,现有的方法要么缺乏跨模态聚类能力,要么过度追求簇级别关联,这使得可靠地学习模态不变特征变得困难。为了解决这些问题,我们提出了一种扩展跨模态联合学习(ECUL)框架,该框架结合了扩展的模态-相机聚类(EMCC)和两步内存更新策略(TSMem)模块。 具体而言,我们设计的ECUL框架自然地整合了内模态聚类、跨模态聚类以及实例级别的跨模态选择机制,在减少引入噪音标签的同时建立了紧密且准确的跨模态关联。此外,EMCC通过扩展编码向量来捕捉和过滤邻域关系,进一步促进基于聚类算法的模态不变性和相机不变性知识的学习。最后,TSMem通过分阶段更新内存提供精确而通用的对比学习代理点。 在SYSU-MM01和RegDB数据集上进行的一系列实验结果表明,所提出的ECUL框架展示了令人鼓舞的表现,并且甚至超过了某些监督方法的效果。
https://arxiv.org/abs/2412.19134
The development of deep learning has facilitated the application of person re-identification (ReID) technology in intelligent security. Visible-infrared person re-identification (VI-ReID) aims to match pedestrians across infrared and visible modality images enabling 24-hour surveillance. Current studies relying on unsupervised modality transformations as well as inefficient embedding constraints to bridge the spectral differences between infrared and visible images, however, limit their potential performance. To tackle the limitations of the above approaches, this paper introduces a simple yet effective Spectral Enhancement and Pseudo-anchor Guidance Network, named SEPG-Net. Specifically, we propose a more homogeneous spectral enhancement scheme based on frequency domain information and greyscale space, which avoids the information loss typically caused by inefficient modality transformations. Further, a Pseudo Anchor-guided Bidirectional Aggregation (PABA) loss is introduced to bridge local modality discrepancies while better preserving discriminative identity embeddings. Experimental results on two public benchmark datasets demonstrate the superior performance of SEPG-Net against other state-of-the-art methods. The code is available at this https URL.
深度学习的发展促进了智能安全领域中人员再识别(ReID)技术的应用。可见光与红外线的人员再识别(VI-ReID)旨在跨红外和可见图像匹配行人,从而实现24小时监控。然而,当前依赖于无监督模态转换以及低效嵌入约束的方法在连接红外和可见图像之间的频谱差异时,其潜力受到了限制。为了克服上述方法的局限性,本文介绍了一种简单而有效的光谱增强与伪锚引导网络,命名为SEPG-Net。 具体来说,我们提出了一种基于频率域信息和灰度空间更均匀的光谱增强方案,该方案避免了由低效模态转换通常导致的信息损失。此外,引入了伪锚导向双向聚合(PABA)损失来弥合局部模式差异,并更好地保持具有判别性的身份嵌入。 在两个公开基准数据集上的实验结果表明,SEPG-Net优于其他最先进的方法。代码可在此URL获取:[此链接应为指向GitHub或类似平台的实际网址,在这里未提供实际链接]。
https://arxiv.org/abs/2412.19111
Person re-identification (ReID) models often struggle to generalize across diverse cultural contexts, particularly in Islamic regions like Iran, where modest clothing styles are prevalent. Existing datasets predominantly feature Western and East Asian fashion, limiting their applicability in these settings. To address this gap, we introduce IUST_PersonReId, a dataset designed to reflect the unique challenges of ReID in new cultural environments, emphasizing modest attire and diverse scenarios from Iran, including markets, campuses, and mosques. Experiments on IUST_PersonReId with state-of-the-art models, such as Solider and CLIP-ReID, reveal significant performance drops compared to benchmarks like Market1501 and MSMT17, highlighting the challenges posed by occlusion and limited distinctive features. Sequence-based evaluations show improvements by leveraging temporal context, emphasizing the dataset's potential for advancing culturally sensitive and robust ReID systems. IUST_PersonReId offers a critical resource for addressing fairness and bias in ReID research globally. The dataset is publicly available at this https URL.
人员重新识别(ReID)模型在面对不同的文化环境时经常难以泛化,尤其是在伊朗这样的伊斯兰地区,那里的服装风格通常遵循传统的保守着装规范。现有的数据集主要以西方和东亚的时尚为主,这限制了它们在这些特定场景下的适用性。为了解决这一问题,我们引入了IUST_PersonReId数据集,该数据集旨在反映新的文化环境中人员重新识别的独特挑战,特别是伊朗的各种场合,如市场、校园和清真寺中的保守着装情况。 使用最新的Solider和CLIP-ReID模型在IUST_PersonReId上的实验结果显示,与Market1501和MSMT17这样的基准数据集相比,性能有显著下降。这揭示了遮挡(occlusion)问题以及独特特征有限带来的挑战。基于序列的评估显示通过利用时间上下文可以改善识别效果,突显该数据集在推进文化敏感性和鲁棒性人员重新识别系统方面的作用。 IUST_PersonReId为解决全球范围内人员重新识别研究中的公平性和偏见提供了关键资源,并且该数据集已在[此处](https://example.com)公开发布。
https://arxiv.org/abs/2412.18874
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras, which greatly helps intelligent transportation systems. As we all know, Convolutional Neural Networks (CNNs) and Transformers have the unique strengths to extract local and global features, respectively. Considering this fact, we focus on the mutual fusion between them to learn more comprehensive representations for persons. In particular, we utilize the complementary integration of deep features from different model structures. We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID. More specifically, we first deploy a Dual-branch Feature Extraction (DFE) to extract features through CNNs and Transformers from a single image. Moreover, we design a novel Dual-attention Mutual Fusion (DMF) to achieve sufficient feature fusions. The DMF comprises Local Refinement Units (LRU) and Heterogenous Transmission Modules (HTM). LRU utilizes depth-separable convolutions to align deep features in channel dimensions and spatial sizes. HTM consists of a Shared Encoding Unit (SEU) and two Mutual Fusion Units (MFU). Through the continuous stacking of HTM, deep features after LRU are repeatedly utilized to generate more discriminative features. Extensive experiments on three public ReID benchmarks demonstrate that our method can attain superior performances than most state-of-the-arts. The source code is available at this https URL.
人员再识别(ReID)的目标是在非重叠摄像头之间检索特定的人,这极大地帮助了智能交通系统。众所周知,卷积神经网络(CNNs)和Transformer分别在提取局部特征和全局特征方面具有独特的优势。考虑到这一点,我们专注于它们之间的相互融合以学习更全面的人员表示。具体而言,我们利用不同模型结构中深层特征的互补整合。我们提出了一种名为FusionReID的新颖融合框架,旨在统一CNNs和Transformer的优势,用于基于图像的人员再识别。更具体地说,我们首先部署了双分支特征提取(DFE)通过CNNs和Transformers从单个图像中提取特征。此外,我们设计了一种新颖的双重注意力相互融合(DMF),以实现充分的特征融合。DMF包括局部精化单元(LRU)和异构传输模块(HTM)。LRU利用深度可分离卷积在通道维度和空间尺寸上对齐深层特征。HTM由共享编码单元(SEU)和两个相互融合单元(MFU)组成。通过持续堆叠HTM,经过LRU处理后的深层特征被反复用于生成更具判别力的特征。我们在三个公共ReID基准数据集上的广泛实验表明,我们的方法可以达到优于大多数最新方法的表现。源代码可在该网址获取:[此链接]。
https://arxiv.org/abs/2412.17239
Visual re-ranking using Nearest Neighbor graph~(NN graph) has been adapted to yield high retrieval accuracy, since it is beneficial to exploring an high-dimensional manifold and applicable without additional fine-tuning. The quality of visual re-ranking using NN graph, however, is limited to that of connectivity, i.e., edges of the NN graph. Some edges can be misconnected with negative images. This is known as a noisy edge problem, resulting in a degradation of the retrieval quality. To address this, we propose a complementary denoising method based on Continuous Conditional Random Field (C-CRF) that uses a statistical distance of our similarity-based distribution. This method employs the concept of cliques to make the process computationally feasible. We demonstrate the complementarity of our method through its application to three visual re-ranking methods, observing quality boosts in landmark retrieval and person re-identification (re-ID).
使用最近邻图(NN 图)进行视觉重新排序已被调整以获得较高的检索准确性,因为它有助于探索高维流形,并且无需额外的微调即可应用。然而,基于 NN 图的视觉重新排序的质量受限于其连通性,即 NN 图中的边。某些边可能会错误地连接到负面图像上,这被称为噪声边问题,导致检索质量下降。为了解决这一问题,我们提出了一种基于连续条件随机场(C-CRF)的互补去噪方法,该方法使用基于相似度分布的统计距离。此方法利用了团的概念以使过程在计算上可行。通过将我们的方法应用于三种视觉重新排序方法中,我们展示了其互补性,并观察到了地标检索和行人再识别(re-ID)质量上的提升。
https://arxiv.org/abs/2412.13875
Unsupervised visible-infrared person re-identification (USL-VI-ReID) is of great research and practical significance yet remains challenging due to the absence of annotations. Existing approaches aim to learn modality-invariant representations in an unsupervised setting. However, these methods often encounter label noise within and across modalities due to suboptimal clustering results and considerable modality discrepancies, which impedes effective training. To address these challenges, we propose a straightforward yet effective solution for USL-VI-ReID by mitigating universal label noise using neighbor information. Specifically, we introduce the Neighbor-guided Universal Label Calibration (N-ULC) module, which replaces explicit hard pseudo labels in both homogeneous and heterogeneous spaces with soft labels derived from neighboring samples to reduce label noise. Additionally, we present the Neighbor-guided Dynamic Weighting (N-DW) module to enhance training stability by minimizing the influence of unreliable samples. Extensive experiments on the RegDB and SYSU-MM01 datasets demonstrate that our method outperforms existing USL-VI-ReID approaches, despite its simplicity. The source code is available at: this https URL.
无监督的可见光-红外行人重识别(USL-VI-ReID)具有重要的研究和实用价值,但由于缺少标注信息而颇具挑战性。现有的方法旨在在无监督环境下学习模态不变表征。然而,这些方法常因次优的聚类结果及显著的模态差异,在同一或跨模态中遭遇标签噪声,阻碍了有效的训练过程。为解决这些问题,我们提出了一种通过利用邻域信息来缓解普遍存在的标签噪声的简单而有效的方法,以应对USL-VI-ReID中的挑战。具体来说,我们引入了基于邻居指导的通用标签校准(N-ULC)模块,在同质和异质空间中用从邻近样本衍生出的软标签替代显式的硬伪标签,以此来减少标签噪声。此外,我们还提出了基于邻居指导的动力权重调整(N-DW)模块,通过最小化不可靠样本的影响以增强训练稳定性。在RegDB和SYSU-MM01数据集上的广泛实验表明,尽管方法简单,我们的方法仍优于现有的USL-VI-ReID方案。源代码可在以下链接获取:此 https URL。
https://arxiv.org/abs/2412.12220
Lifelong person re-identification (LReID) is an important but challenging task that suffers from catastrophic forgetting due to significant domain gaps between training steps. Existing LReID approaches typically rely on data replay and knowledge distillation to mitigate this issue. However, data replay methods compromise data privacy by storing historical exemplars, while knowledge distillation methods suffer from limited performance due to the cumulative forgetting of undistilled knowledge. To overcome these challenges, we propose a novel paradigm that models and rehearses the distribution of the old domains to enhance knowledge consolidation during the new data learning, possessing a strong anti-forgetting capacity without storing any exemplars. Specifically, we introduce an exemplar-free LReID method called Distribution Rehearsing via Adaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser Learning mechanism that learns to transform arbitrary distribution data into the current data style at each learning step. To enhance the style transfer capacity of DRL, an Adaptive Kernel Prediction network is explored to achieve an instance-specific distribution adjustment. Additionally, we design a Distribution Rehearsing-driven LReID Training module, which rehearses old distribution based on the new data via the old AKPNet model, achieving effective new-old knowledge accumulation under a joint knowledge consolidation scheme. Experimental results show our DASK outperforms the existing methods by 3.6%-6.8% and 4.5%-6.5% on anti-forgetting and generalization capacity, respectively. Our code is available at this https URL
终身人物再识别(LReID)是一项重要但极具挑战性的任务,由于训练步骤之间的显著领域差异,该任务会遭受灾难性遗忘的影响。现有的LReID方法通常依赖于数据重放和知识蒸馏来缓解这一问题。然而,数据重放方法通过存储历史样本牺牲了数据隐私,而知识蒸馏方法因未被提炼的知识累积遗忘而导致性能受限。为了克服这些挑战,我们提出了一种新的范式,该范式通过建模并复习旧领域的分布,在新数据学习过程中增强知识巩固,无需存储任何样例就具有强大的抗遗忘能力。具体来说,我们引入了一种无样本的LReID方法,称为自适应风格核学习(DASK)下的分布复习。DASK包括一个分布复习器学习机制,该机制在每次学习步骤中学会将任意分布的数据转换为当前数据样式。为了增强DRL的风格迁移能力,探索了一个自适应核预测网络以实现特定实例的分布调整。此外,我们设计了一个由分布复习驱动的LReID训练模块,通过旧AKPNet模型基于新数据复习旧分布,从而在联合知识巩固方案下实现有效的新型和旧型知识积累。实验结果表明,我们的DASK方法在抗遗忘能力和泛化能力上分别比现有方法高出3.6%-6.8%和4.5%-6.5%。我们的代码可在该链接获取:[此https URL]
https://arxiv.org/abs/2412.09224
Person re-identification (re-ID) via 3D skeleton data is a challenging task with significant value in many scenarios. Existing skeleton-based methods typically assume virtual motion relations between all joints, and adopt average joint or sequence representations for learning. However, they rarely explore key body structure and motion such as gait to focus on more important body joints or limbs, while lacking the ability to fully mine valuable spatial-temporal sub-patterns of skeletons to enhance model learning. This paper presents a generic Motif guided graph transformer with Combinatorial skeleton prototype learning (MoCos) that exploits structure-specific and gait-related body relations as well as combinatorial features of skeleton graphs to learn effective skeleton representations for person re-ID. In particular, motivated by the locality within joints' structure and the body-component collaboration in gait, we first propose the motif guided graph transformer (MGT) that incorporates hierarchical structural motifs and gait collaborative motifs, which simultaneously focuses on multi-order local joint correlations and key cooperative body parts to enhance skeleton relation learning. Then, we devise the combinatorial skeleton prototype learning (CSP) that leverages random spatial-temporal combinations of joint nodes and skeleton graphs to generate diverse sub-skeleton and sub-tracklet representations, which are contrasted with the most representative features (prototypes) of each identity to learn class-related semantics and discriminative skeleton representations. Extensive experiments validate the superior performance of MoCos over existing state-of-the-art models. We further show its generality under RGB-estimated skeletons, different graph modeling, and unsupervised scenarios.
基于3D骨骼数据的人体再识别(re-ID)是一项在许多场景中具有重要价值的挑战性任务。现有的基于骨架的方法通常假设所有关节之间的虚拟运动关系,并采用平均关节或序列表示进行学习。然而,它们很少探索关键的身体结构和运动,例如步态,以更专注于重要的身体关节或四肢,同时缺乏充分挖掘骨骼有价值的空间-时间子模式的能力来增强模型的学习效果。本文提出了一种通用的动机引导图变换器与组合骨架原型学习(MoCos)方法,该方法利用特定于结构和与步态相关的身体关系以及骨架图的组合特征,以学习有效的骨架表示用于人体再识别。特别地,在关节结构内的局部性和步态中的身体部件协作的启发下,我们首先提出了动机引导图变换器(MGT),它结合了分层结构模式和步态协作模式,同时关注多阶局部关节关联和关键合作的身体部位以增强骨骼关系学习。然后,我们设计了一种组合骨架原型学习(CSP)方法,利用关节节点和骨架图的随机空间-时间组合来生成多样化的子骨架和子轨迹表示,并将这些表示与每个身份最具代表性的特征(原型)进行对比,从而学习类别相关的语义和辨别性骨骼表示。广泛的实验验证了MoCos在现有最先进的模型中的优越性能。我们进一步展示了它在基于RGB估算的骨架、不同的图建模以及无监督场景下的通用性。
https://arxiv.org/abs/2412.09044
Image retrieval methods rely on metric learning to train backbone feature extraction models that can extract discriminant queries and reference (gallery) feature representations for similarity matching. Although state-of-the-art accuracy has improved considerably with the advent of deep learning (DL) models trained on large datasets, image retrieval remains challenging in many real-world video analytics and surveillance applications, e.g., person re-identification. Using the Euclidean space for matching limits the performance in real-world applications due to the curse of dimensionality, overfitting, and sensitivity to noisy data. We argue that the feature dissimilarity space is more suitable for similarity matching, and propose a dichotomy transformation to project query and reference embeddings into a single embedding in the dissimilarity space. We also advocate for end-to-end training of a backbone and binary classification models for pair-wise matching. As opposed to comparing the distance between queries and reference embeddings, we show the benefits of classifying the single dissimilarity space embedding (as similar or dissimilar), especially when trained end-to-end. We propose a method to train the max-margin classifier together with the backbone feature extractor by applying constraints to the L2 norm of the classifier weights along with the hinge loss. Our extensive experiments on challenging image retrieval datasets and using diverse feature extraction backbones highlight the benefits of similarity matching in the dissimilarity space. In particular, when jointly training the feature extraction backbone and regularised classifier for matching, the dissimilarity space provides a higher level of accuracy.
图像检索方法依赖度量学习来训练骨干特征提取模型,该模型能够抽取具有判别力的查询和参考(画廊)特征表示以进行相似性匹配。尽管随着深度学习(DL)模型在大规模数据集上的应用,状态-of-the-art准确率有了显著提高,但在许多实际视频分析和监控应用场景中,如行人重识别,图像检索仍然是一项挑战。使用欧几里得空间进行匹配由于维度灾难、过拟合以及对噪声数据的敏感性限制了其在实际应用中的性能表现。我们认为特征不相似空间更适合于相似性匹配,并提出了一种二分转换方法,将查询和参考嵌入投影到单个不相似度空间中的单一嵌入中。我们还提倡端到端训练骨干模型与用于配对匹配的二元分类模型。通过分类单个不相似度空间嵌入(将其归类为相似或不相似),而不是比较查询和参考嵌入之间的距离,尤其是在进行端到端训练时,能够展示出显著的好处。我们提出了一种方法,通过对分类器权重施加L2范数约束以及结合铰链损失来同时训练最大间隔分类器与骨干特征提取器。我们在具有挑战性的图像检索数据集上的广泛实验及使用各种特征提取骨干模型均表明,在不相似度空间中进行相似性匹配的优势。特别是,当联合训练特征提取骨干和正则化分类器时,不相似度空间提供了更高的准确率水平。
https://arxiv.org/abs/2412.08618
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. In this paper, we propose an Embedding and Enriching Explicit Semantics (EEES) framework to learn semantically rich cross-modality pedestrian representations. Our method offers several contributions. First, with the collaboration of multiple large language-vision models, we develop Explicit Semantics Embedding (ESE), which automatically supplements language descriptions for pedestrians and aligns image-text pairs into a common space, thereby learning visual content associated with explicit semantics. Second, recognizing the complementarity of multi-view information, we present Cross-View Semantics Compensation (CVSC), which constructs multi-view image-text pair representations, establishes their many-to-many matching, and propagates knowledge to single-view representations, thus compensating visual content with its missing cross-view semantics. Third, to eliminate noisy semantics such as conflicting color attributes in different modalities, we design Cross-Modality Semantics Purification (CMSP), which constrains the distance between inter-modality image-text pair representations to be close to that between intra-modality image-text pair representations, further enhancing the modality-invariance of visual content. Finally, experimental results demonstrate the effectiveness and superiority of the proposed EEES.
可见光-红外行人再识别(VIReID)在不同的模态间检索同一身份的行人群图像。现有的方法仅从图像中学习视觉内容,缺乏感知高级语义的能力。本文提出了一种嵌入和丰富显式语义(EEES)框架来学习语义丰富的跨模态行人表示。我们的方法提供了几方面的贡献。首先,在多个大型语言-视觉模型的协作下,我们开发了显式语义嵌入(ESE),该技术自动为行人群补充语言描述,并将图像-文本对对齐到一个共同的空间中,从而学习与显式语义相关的视觉内容。其次,认识到多视角信息的互补性,我们提出了跨视图语义补偿(CVSC),它构建了多视图图像-文本对表示形式,建立了它们之间的多对多匹配,并将知识传播给单视图表示,从而用缺失的跨视图语义来补充视觉内容。第三,为了消除不同模态中的冲突颜色属性等噪声语义,我们设计了跨模态语义净化(CMSP),该技术限制不同模态间图像-文本对表示之间的距离接近同一模态内图像-文本对表示间的距离,进一步增强了视觉内容的模态不变性。最后,实验结果证明了所提出的EEES的有效性和优越性。
https://arxiv.org/abs/2412.08406
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) offers a more flexible and cost-effective alternative compared to supervised methods. This field has gained increasing attention due to its promising potential. Existing methods simply cluster modality-specific samples and employ strong association techniques to achieve instance-to-cluster or cluster-to-cluster cross-modality associations. However, they ignore cross-camera differences, leading to noticeable issues with excessive splitting of identities. Consequently, this undermines the accuracy and reliability of cross-modal associations. To address these issues, we propose a novel Dynamic Modality-Camera Invariant Clustering (DMIC) framework for USL-VI-ReID. Specifically, our DMIC naturally integrates Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC) and Hybrid Modality Contrastive Learning (HMCL) into a unified framework, which eliminates both the cross-modality and cross-camera discrepancies in clustering. MIE fuses inter-modal and inter-camera distance coding to bridge the gaps between modalities and cameras at the clustering level. DNC employs two dynamic search strategies to refine the network's optimization objective, transitioning from improving discriminability to enhancing cross-modal and cross-camera generalizability. Moreover, HMCL is designed to optimize instance-level and cluster-level distributions. Memories for intra-modality and inter-modality training are updated using randomly selected samples, facilitating real-time exploration of modality-invariant representations. Extensive experiments have demonstrated that our DMIC addresses the limitations present in current clustering approaches and achieve competitive performance, which significantly reduces the performance gap with supervised methods.
无监督学习可见光-红外行人重识别(USL-VI-ReID)相比有监督方法提供了更灵活且成本更低的替代方案。由于其潜在的应用前景,这一领域受到了越来越多的关注。现有方法简单地对模态特定样本进行聚类,并采用强大的关联技术实现实例到集群或集群到集群的跨模态关联。然而,它们忽略了跨摄像机差异,导致身份过度分割的问题显著存在。这削弱了跨模态关联的准确性和可靠性。为了解决这些问题,我们提出了一种新的动态模态-摄像机不变聚类(DMIC)框架用于USL-VI-ReID。具体来说,我们的DMIC自然地将模态-摄像机不变扩展(MIE)、动态邻域聚类(DNC)和混合模态对比学习(HMCL)整合到一个统一的框架中,这消除了聚类中的跨模态和跨摄像机差异。MIE融合了跨模态和跨摄像机的距离编码,在聚类层面弥合了模态和摄像机之间的差距。DNC采用了两种动态搜索策略来优化网络的目标函数,从提高可区分性转变为增强跨模态和跨摄像机的泛化能力。此外,HMCL旨在优化实例级别和集群级别的分布。使用随机选择的样本更新模态内和跨模态训练的记忆,有助于实时探索模态不变表征。广泛的实验表明,我们的DMIC解决了当前聚类方法中存在的局限性,并实现了具有竞争力的表现,显著缩小了与监督方法之间的性能差距。
https://arxiv.org/abs/2412.08231
Current visible-infrared cross-modality person re-identification research has only focused on exploring the bi-modality mutual retrieval paradigm, and we propose a new and more practical mix-modality retrieval paradigm. Existing Visible-Infrared person re-identification (VI-ReID) methods have achieved some results in the bi-modality mutual retrieval paradigm by learning the correspondence between visible and infrared modalities. However, significant performance degradation occurs due to the modality confusion problem when these methods are applied to the new mix-modality paradigm. Therefore, this paper proposes a Mix-Modality person re-identification (MM-ReID) task, explores the influence of modality mixing ratio on performance, and constructs mix-modality test sets for existing datasets according to the new mix-modality testing paradigm. To solve the modality confusion problem in MM-ReID, we propose a Cross-Identity Discrimination Harmonization Loss (CIDHL) adjusting the distribution of samples in the hyperspherical feature space, pulling the centers of samples with the same identity closer, and pushing away the centers of samples with different identities while aggregating samples with the same modality and the same identity. Furthermore, we propose a Modality Bridge Similarity Optimization Strategy (MBSOS) to optimize the cross-modality similarity between the query and queried samples with the help of the similar bridge sample in the gallery. Extensive experiments demonstrate that compared to the original performance of existing cross-modality methods on MM-ReID, the addition of our CIDHL and MBSOS demonstrates a general improvement.
当前的可见光-红外跨模态行人再识别研究仅集中在探索双模态互检范式上,我们提出了一种新的、更实用的混合模态检索范式。现有的可见光-红外行人再识别(VI-ReID)方法通过学习可见光和红外模态之间的对应关系,在双模态互检范式中取得了一些成果。然而,当这些方法应用于新的混合模态范式时,由于模态混淆问题导致性能显著下降。因此,本文提出了一个混合模态行人再识别(MM-ReID)任务,探讨了模态混合比例对性能的影响,并根据新的混合模态测试范式为现有数据集构建了混合模态测试集。为了在MM-ReID中解决模态混淆问题,我们提出了一种跨身份判别和谐损失(CIDHL),调整超球特征空间中的样本分布,拉近具有相同身份的样本中心的距离,并将不同身份的样本中心推开,同时聚集同一模态和相同身份的样本。此外,我们还提出了一个模态桥相似性优化策略(MBSOS),利用图像库中的类似桥梁样本帮助优化查询与被查询样本之间的跨模态相似度。广泛的实验表明,与现有跨模态方法在MM-ReID上的原始性能相比,加入我们的CIDHL和MBSOS显示出了一般性的改进。
https://arxiv.org/abs/2412.04719
Cloth-changing person re-identification (CC-ReID) aims to match individuals across multiple surveillance cameras despite variations in clothing. Existing methods typically focus on mitigating the effects of clothing changes or enhancing ID-relevant features but often struggle to capture complex semantic information. In this paper, we propose a novel prompt learning framework, Semantic Contextual Integration (SCI), for CC-ReID, which leverages the visual-text representation capabilities of CLIP to minimize the impact of clothing changes and enhance ID-relevant features. Specifically, we introduce Semantic Separation Enhancement (SSE) module, which uses dual learnable text tokens to separately capture confounding and clothing-related semantic information, effectively isolating ID-relevant features from distracting clothing semantics. Additionally, we develop a Semantic-Guided Interaction Module (SIM) that uses orthogonalized text features to guide visual representations, sharpening the model's focus on distinctive ID characteristics. This integration enhances the model's discriminative power and enriches the visual context with high-dimensional semantic insights. Extensive experiments on three CC-ReID datasets demonstrate that our method outperforms state-of-the-art techniques. The code will be released at github.
https://arxiv.org/abs/2412.01345
We introduce a new framework, dubbed Cerberus, for attribute-based person re-identification (reID). Our approach leverages person attribute labels to learn local and global person representations that encode specific traits, such as gender and clothing style. To achieve this, we define semantic IDs (SIDs) by combining attribute labels, and use a semantic guidance loss to align the person representations with the prototypical features of corresponding SIDs, encouraging the representations to encode the relevant semantics. Simultaneously, we enforce the representations of the same person to be embedded closely, enabling recognizing subtle differences in appearance to discriminate persons sharing the same attribute labels. To increase the generalization ability on unseen data, we also propose a regularization method that takes advantage of the relationships between SID prototypes. Our framework performs individual comparisons of local and global person representations between query and gallery images for attribute-based reID. By exploiting the SID prototypes aligned with the corresponding representations, it can also perform person attribute recognition (PAR) and attribute-based person search (APS) without bells and whistles. Experimental results on standard benchmarks on attribute-based person reID, Market-1501 and DukeMTMC, demonstrate the superiority of our model compared to the state of the art.
https://arxiv.org/abs/2412.01048
We propose a View-Decoupled Transformer (VDT) framework to address viewpoint discrepancies in person re-identification (ReID), particularly between aerial and ground views. VDT decouples view-specific and view-independent features by leveraging meta and view tokens, processed through self-attention and subtractive separation. Additionally, we introduce a Visual Token Selector (VTS) module that dynamically selects the most informative tokens, reducing redundancy and enhancing efficiency. Our approach significantly improves retrieval performance on the AGPReID dataset, while maintaining computational efficiency similar to baseline models.
https://arxiv.org/abs/2412.00433
Cross-spectral biometrics, such as matching imagery of faces or persons from visible (RGB) and infrared (IR) bands, have rapidly advanced over the last decade due to increasing sensitivity, size, quality, and ubiquity of IR focal plane arrays and enhanced analytics beyond the visible spectrum. Current techniques for mitigating large spectral disparities between RGB and IR imagery often include learning a discriminative common subspace by exploiting precisely curated data acquired from multiple spectra. Although there are challenges with determining robust architectures for extracting common information, a critical limitation for supervised methods is poor scalability in terms of acquiring labeled data. Therefore, we propose a novel unsupervised cross-spectral framework that combines (1) a new pseudo triplet loss with cross-spectral voting, (2) a new cross-spectral attention network leveraging multiple subspaces, and (3) structured sparsity to perform more discriminative cross-spectral clustering. We extensively compare our proposed RGB-IR biometric learning framework (and its individual components) with recent and previous state-of-the-art models on two challenging benchmark datasets: DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF) and RegDB person re-identification dataset, and, in some cases, achieve performance superior to completely supervised methods.
https://arxiv.org/abs/2411.19215
Regular unsupervised domain adaptive person re-identification (ReID) focuses on adapting a model from a source domain to a fixed target domain. However, an adapted ReID model can hardly retain previously-acquired knowledge and generalize to unseen data. In this paper, we propose a Dual-level Joint Adaptation and Anti-forgetting (DJAA) framework, which incrementally adapts a model to new domains without forgetting source domain and each adapted target domain. We explore the possibility of using prototype and instance-level consistency to mitigate the forgetting during the adaptation. Specifically, we store a small number of representative image samples and corresponding cluster prototypes in a memory buffer, which is updated at each adaptation step. With the buffered images and prototypes, we regularize the image-to-image similarity and image-to-prototype similarity to rehearse old knowledge. After the multi-step adaptation, the model is tested on all seen domains and several unseen domains to validate the generalization ability of our method. Extensive experiments demonstrate that our proposed method significantly improves the anti-forgetting, generalization and backward-compatible ability of an unsupervised person ReID model.
常规的无监督领域适应性人员重新识别(ReID)主要集中在将模型从源域迁移到固定的靶域上。然而,经过适应后的ReID模型很难保持之前获得的知识,并且难以推广到未见过的数据中。在本文中,我们提出了一种双重级联适应与抗遗忘(DJAA)框架,该框架能够在不忘记源域和每个已适应的靶域知识的情况下,逐步地将模型迁移到新的领域上。我们探索了利用原型和实例层次的一致性来减轻适应过程中遗忘的可能性。具体来说,我们将少量代表性图像样本及其对应的聚类原型存储在一个内存缓冲区中,并在每次适应步骤时更新该缓冲区。借助于缓存的图像与原型,我们通过调节图像到图像相似性和图像到原型相似性来复习旧知识。经过多步适应后,模型将在所有见过的域和一些未见的域上进行测试以验证方法的泛化能力。大量的实验表明,我们提出的方法显著提高了无监督人员ReID模型的抗遗忘、泛化及向后兼容的能力。
https://arxiv.org/abs/2411.14695