This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{this https URL}{this https URL}.
本文提出了ViT Token Constraint和多尺度记忆库(TCMM)方法,旨在解决无监督行人重识别工作中出现的补丁噪声和特征不一致性问题。许多优秀的方法利用ViT特征来获取伪标签和聚类原型,然后通过对比学习训练模型。然而,由于ViT是通过对图像进行补丁嵌入处理来工作的,这不可避免地会在各个补丁中引入噪声,并可能影响重识别模型的性能。另一方面,基于以往记忆库的对比方法可能会因为批次大小的限制而导致数据不一致的问题。此外,现有的伪标签生成方法往往忽略那些难以聚类的异常样本,从而牺牲了这些异常样本的潜在价值,导致模型多样性和鲁棒性的局限性。 本文介绍了一种ViT Token Constraint技术来减轻补丁噪声对ViT架构造成的损害。提出的多尺度记忆库则增强了对异常样本的探索,并保持了特征的一致性。实验结果显示,我们的系统在常见基准测试中达到了最先进的性能水平。该项目可在[此处](this https URL)访问。
https://arxiv.org/abs/2501.09044
Video-based person re-identification (ReID) has become increasingly important due to its applications in video surveillance applications. By employing events in video-based person ReID, more motion information can be provided between continuous frames to improve recognition accuracy. Previous approaches have assisted by introducing event data into the video person ReID task, but they still cannot avoid the privacy leakage problem caused by RGB images. In order to avoid privacy attacks and to take advantage of the benefits of event data, we consider using only event data. To make full use of the information in the event stream, we propose a Cross-Modality and Temporal Collaboration (CMTC) network for event-based video person ReID. First, we design an event transform network to obtain corresponding auxiliary information from the input of raw events. Additionally, we propose a differential modality collaboration module to balance the roles of events and auxiliaries to achieve complementary effects. Furthermore, we introduce a temporal collaboration module to exploit motion information and appearance cues. Experimental results demonstrate that our method outperforms others in the task of event-based video person ReID.
基于视频的人重识别(ReID)由于在视频监控中的应用而变得越来越重要。通过采用事件数据,在连续帧之间可以提供更多的运动信息,从而提高识别的准确性。虽然之前的方法已经尝试将事件数据引入到基于视频的人体重识别任务中来辅助解决该问题,但它们仍然无法避免由RGB图像引起的隐私泄露问题。为了防止隐私攻击,并利用事件数据带来的优势,我们考虑仅使用事件数据进行人体重识别。 为充分利用事件流中的信息,我们提出了一种跨模态和时间协作(CMTC)网络用于基于事件的视频人体重识别任务。首先,我们设计了一个事件变换网络来从原始事件输入中获取相应的辅助信息。此外,我们提出了一个差分模态协作模块以平衡事件与辅助信息的作用,从而实现互补效果。同时,我们引入了时间协作模块来利用运动信息和外观线索。 实验结果表明,在基于事件的视频人体重识别任务上,我们的方法优于其他现有技术。
https://arxiv.org/abs/2501.07296
Clothing-change person re-identification (CC Re-ID) has attracted increasing attention in recent years due to its application prospect. Most existing works struggle to adequately extract the ID-related information from the original RGB images. In this paper, we propose an Identity-aware Feature Decoupling (IFD) learning framework to mine identity-related features. Particularly, IFD exploits a dual stream architecture that consists of a main stream and an attention stream. The attention stream takes the clothing-masked images as inputs and derives the identity attention weights for effectively transferring the spatial knowledge to the main stream and highlighting the regions with abundant identity-related information. To eliminate the semantic gap between the inputs of two streams, we propose a clothing bias diminishing module specific to the main stream to regularize the features of clothing-relevant regions. Extensive experimental results demonstrate that our framework outperforms other baseline models on several widely-used CC Re-ID datasets.
近年来,服装变化人员再识别(CC Re-ID)由于其应用前景而备受关注。现有大多数工作难以充分从原始RGB图像中提取出与身份相关的信息。在本文中,我们提出了一种基于身份感知特征解耦(IFD)的学习框架来挖掘身份相关的特征信息。特别地,IFD采用了双流架构,包括主流和注意力流。注意力流以服装遮罩后的图像为输入,并得出用于有效传输空间知识到主流以及突出富含身份相关信息区域的身份注意权重。 为了消除两个流的输入之间的语义差距,我们在主流中提出了一种专门针对服装偏差减小的模块来规范与服装相关的区域特征。大量的实验结果表明,我们的框架在几个广泛使用的CC Re-ID数据集上优于其他基线模型。
https://arxiv.org/abs/2501.05851
Street cats in urban areas often rely on human intervention for survival, leading to challenges in population control and welfare management. In April 2023, Hello Inc., a Chinese urban mobility company, launched the Hello Street Cat initiative to address these issues. The project deployed over 21,000 smart feeding stations across 14 cities in China, integrating livestreaming cameras and treat dispensers activated through user donations. It also promotes the Trap-Neuter-Return (TNR) method, supported by a community-driven platform, HelloStreetCatWiki, where volunteers catalog and identify cats. However, manual identification is inefficient and unsustainable, creating a need for automated solutions. This study explores Deep Learning-based models for re-identifying street cats in the Hello Street Cat initiative. A dataset of 2,796 images of 69 cats was used to train Siamese Networks with EfficientNetB0, MobileNet and VGG16 as base models, evaluated under contrastive and triplet loss functions. VGG16 paired with contrastive loss emerged as the most effective configuration, achieving up to 97% accuracy and an F1 score of 0.9344 during testing. The approach leverages image augmentation and dataset refinement to overcome challenges posed by limited data and diverse visual variations. These findings underscore the potential of automated cat re-identification to streamline population monitoring and welfare efforts. By reducing reliance on manual processes, the method offers a scalable and reliable solution for communitydriven initiatives. Future research will focus on expanding datasets and developing real-time implementations to enhance practicality in large-scale deployments.
城市街道上的流浪猫常常依赖人类的干预来生存,这导致了种群控制和福利管理方面的挑战。2023年4月,中国的一家城市移动公司Hello Inc.推出了“Hello Street Cat”项目,旨在解决这些问题。该项目在中国14个城市部署了超过21,000个智能喂食站,这些站点集成了实时摄像头和通过用户捐赠启动的投喂器,并推广了捕获-绝育-放归(TNR)的方法。此外,还有一个由社区驱动的平台“HelloStreetCatWiki”,志愿者们在此平台上登记并识别流浪猫。然而,手动识别效率低下且不可持续,因此需要自动化的解决方案。 本研究探索了基于深度学习模型在“Hello Street Cat”项目中重新识别街道流浪猫的有效性。使用包含69只猫咪共2,796张图像的数据集来训练Siamese Networks,并采用EfficientNetB0、MobileNet和VGG16作为基础模型,在对比损失和三元组损失函数下进行评估。实验结果表明,结合了对比损失的VGG16配置最为有效,测试时达到了高达97%的准确率和F1得分为0.9344的良好效果。该方法通过图像增强和数据集精炼来克服有限数据量及视觉变化多样性的挑战。 这些发现强调了自动化猫识别在优化种群监测和福利努力方面所具有的潜力,通过减少对人工过程的依赖提供了一种可扩展且可靠的社区驱动解决方案。未来的研究将专注于扩大数据集规模,并开发实时实施方法以增强大规模部署时的实际应用性。
https://arxiv.org/abs/2501.02112
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.
无监督可见光-红外行人重识别(USL-VI-ReID)的目标是从未标注的跨模态数据集中学习模态不变特征,以减少跨模态差异。然而,现有的方法要么缺乏跨模态聚类能力,要么过度追求簇级别关联,这使得可靠地学习模态不变特征变得困难。为了解决这些问题,我们提出了一种扩展跨模态联合学习(ECUL)框架,该框架结合了扩展的模态-相机聚类(EMCC)和两步内存更新策略(TSMem)模块。 具体而言,我们设计的ECUL框架自然地整合了内模态聚类、跨模态聚类以及实例级别的跨模态选择机制,在减少引入噪音标签的同时建立了紧密且准确的跨模态关联。此外,EMCC通过扩展编码向量来捕捉和过滤邻域关系,进一步促进基于聚类算法的模态不变性和相机不变性知识的学习。最后,TSMem通过分阶段更新内存提供精确而通用的对比学习代理点。 在SYSU-MM01和RegDB数据集上进行的一系列实验结果表明,所提出的ECUL框架展示了令人鼓舞的表现,并且甚至超过了某些监督方法的效果。
https://arxiv.org/abs/2412.19134
The development of deep learning has facilitated the application of person re-identification (ReID) technology in intelligent security. Visible-infrared person re-identification (VI-ReID) aims to match pedestrians across infrared and visible modality images enabling 24-hour surveillance. Current studies relying on unsupervised modality transformations as well as inefficient embedding constraints to bridge the spectral differences between infrared and visible images, however, limit their potential performance. To tackle the limitations of the above approaches, this paper introduces a simple yet effective Spectral Enhancement and Pseudo-anchor Guidance Network, named SEPG-Net. Specifically, we propose a more homogeneous spectral enhancement scheme based on frequency domain information and greyscale space, which avoids the information loss typically caused by inefficient modality transformations. Further, a Pseudo Anchor-guided Bidirectional Aggregation (PABA) loss is introduced to bridge local modality discrepancies while better preserving discriminative identity embeddings. Experimental results on two public benchmark datasets demonstrate the superior performance of SEPG-Net against other state-of-the-art methods. The code is available at this https URL.
深度学习的发展促进了智能安全领域中人员再识别(ReID)技术的应用。可见光与红外线的人员再识别(VI-ReID)旨在跨红外和可见图像匹配行人,从而实现24小时监控。然而,当前依赖于无监督模态转换以及低效嵌入约束的方法在连接红外和可见图像之间的频谱差异时,其潜力受到了限制。为了克服上述方法的局限性,本文介绍了一种简单而有效的光谱增强与伪锚引导网络,命名为SEPG-Net。 具体来说,我们提出了一种基于频率域信息和灰度空间更均匀的光谱增强方案,该方案避免了由低效模态转换通常导致的信息损失。此外,引入了伪锚导向双向聚合(PABA)损失来弥合局部模式差异,并更好地保持具有判别性的身份嵌入。 在两个公开基准数据集上的实验结果表明,SEPG-Net优于其他最先进的方法。代码可在此URL获取:[此链接应为指向GitHub或类似平台的实际网址,在这里未提供实际链接]。
https://arxiv.org/abs/2412.19111
Person re-identification (ReID) models often struggle to generalize across diverse cultural contexts, particularly in Islamic regions like Iran, where modest clothing styles are prevalent. Existing datasets predominantly feature Western and East Asian fashion, limiting their applicability in these settings. To address this gap, we introduce IUST_PersonReId, a dataset designed to reflect the unique challenges of ReID in new cultural environments, emphasizing modest attire and diverse scenarios from Iran, including markets, campuses, and mosques. Experiments on IUST_PersonReId with state-of-the-art models, such as Solider and CLIP-ReID, reveal significant performance drops compared to benchmarks like Market1501 and MSMT17, highlighting the challenges posed by occlusion and limited distinctive features. Sequence-based evaluations show improvements by leveraging temporal context, emphasizing the dataset's potential for advancing culturally sensitive and robust ReID systems. IUST_PersonReId offers a critical resource for addressing fairness and bias in ReID research globally. The dataset is publicly available at this https URL.
人员重新识别(ReID)模型在面对不同的文化环境时经常难以泛化,尤其是在伊朗这样的伊斯兰地区,那里的服装风格通常遵循传统的保守着装规范。现有的数据集主要以西方和东亚的时尚为主,这限制了它们在这些特定场景下的适用性。为了解决这一问题,我们引入了IUST_PersonReId数据集,该数据集旨在反映新的文化环境中人员重新识别的独特挑战,特别是伊朗的各种场合,如市场、校园和清真寺中的保守着装情况。 使用最新的Solider和CLIP-ReID模型在IUST_PersonReId上的实验结果显示,与Market1501和MSMT17这样的基准数据集相比,性能有显著下降。这揭示了遮挡(occlusion)问题以及独特特征有限带来的挑战。基于序列的评估显示通过利用时间上下文可以改善识别效果,突显该数据集在推进文化敏感性和鲁棒性人员重新识别系统方面的作用。 IUST_PersonReId为解决全球范围内人员重新识别研究中的公平性和偏见提供了关键资源,并且该数据集已在[此处](https://example.com)公开发布。
https://arxiv.org/abs/2412.18874
Group Re-identification (G-ReID) faces greater complexity than individual Re-identification (ReID) due to challenges like mutual occlusion, dynamic member interactions, and evolving group structures. Prior graph-based approaches have aimed to capture these dynamics by modeling the group as a single topological structure. However, these methods struggle to generalize across diverse group compositions, as they fail to fully represent the multifaceted relationships within the group. In this study, we introduce a Hierarchical Multi-Graphs Learning (HMGL) framework to address these challenges. Our approach models the group as a collection of multi-relational graphs, leveraging both explicit features (such as occlusion, appearance, and foreground information) and implicit dependencies between members. This hierarchical representation, encoded via a Multi-Graphs Neural Network (MGNN), allows us to resolve ambiguities in member relationships, particularly in complex, densely populated scenes. To further enhance matching accuracy, we propose a Multi-Scale Matching (MSM) algorithm, which mitigates issues of member information ambiguity and sensitivity to hard samples, improving robustness in challenging scenarios. Our method achieves state-of-the-art performance on two standard benchmarks, CSG and RoadGroup, with Rank-1/mAP scores of 95.3%/94.4% and 93.9%/95.4%, respectively. These results mark notable improvements of 1.7% and 2.5% in Rank-1 accuracy over existing approaches.
群组再识别(G-ReID)面临的复杂性比个体再识别(ReID)更高,原因在于群体中成员间的相互遮挡、动态互动以及不断变化的结构。先前基于图的方法试图通过将整个群体建模为单一拓扑结构来捕捉这些动态特性。然而,由于无法全面表示群体内部多方面的关系,这些方法在处理不同组成的群体时难以泛化。 为此,在这项研究中,我们提出了一种分层多图学习(Hierarchical Multi-Graphs Learning, HMGL)框架以应对上述挑战。我们的方法将群组建模为包含多种关系的多图集合,并利用显式特征(如遮挡、外观和前景信息),以及成员之间的隐含依赖性。通过一个多图神经网络(Multi-Graph Neural Network,MGNN)进行编码的分层表示,我们可以解决复杂拥挤场景中成员间关系的模糊性问题。 为了进一步提高匹配精度,我们还提出了一种多尺度匹配算法(Multi-Scale Matching, MSM),该算法可以缓解因成员信息模糊和对困难样本敏感度带来的问题,从而在挑战性的场景下增强鲁棒性。 我们在两个标准基准测试集CSG和RoadGroup上验证了所提方法的性能,分别达到了Rank-1/mAP 95.3%/94.4% 和 93.9%/95.4% 的评分。相较于现有的方法,我们的方法在准确率方面取得了显著的进步,特别是在Rank-1指标上,与现有最佳方案相比有1.7%和2.5%的提升。
https://arxiv.org/abs/2412.18766
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras, which greatly helps intelligent transportation systems. As we all know, Convolutional Neural Networks (CNNs) and Transformers have the unique strengths to extract local and global features, respectively. Considering this fact, we focus on the mutual fusion between them to learn more comprehensive representations for persons. In particular, we utilize the complementary integration of deep features from different model structures. We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID. More specifically, we first deploy a Dual-branch Feature Extraction (DFE) to extract features through CNNs and Transformers from a single image. Moreover, we design a novel Dual-attention Mutual Fusion (DMF) to achieve sufficient feature fusions. The DMF comprises Local Refinement Units (LRU) and Heterogenous Transmission Modules (HTM). LRU utilizes depth-separable convolutions to align deep features in channel dimensions and spatial sizes. HTM consists of a Shared Encoding Unit (SEU) and two Mutual Fusion Units (MFU). Through the continuous stacking of HTM, deep features after LRU are repeatedly utilized to generate more discriminative features. Extensive experiments on three public ReID benchmarks demonstrate that our method can attain superior performances than most state-of-the-arts. The source code is available at this https URL.
人员再识别(ReID)的目标是在非重叠摄像头之间检索特定的人,这极大地帮助了智能交通系统。众所周知,卷积神经网络(CNNs)和Transformer分别在提取局部特征和全局特征方面具有独特的优势。考虑到这一点,我们专注于它们之间的相互融合以学习更全面的人员表示。具体而言,我们利用不同模型结构中深层特征的互补整合。我们提出了一种名为FusionReID的新颖融合框架,旨在统一CNNs和Transformer的优势,用于基于图像的人员再识别。更具体地说,我们首先部署了双分支特征提取(DFE)通过CNNs和Transformers从单个图像中提取特征。此外,我们设计了一种新颖的双重注意力相互融合(DMF),以实现充分的特征融合。DMF包括局部精化单元(LRU)和异构传输模块(HTM)。LRU利用深度可分离卷积在通道维度和空间尺寸上对齐深层特征。HTM由共享编码单元(SEU)和两个相互融合单元(MFU)组成。通过持续堆叠HTM,经过LRU处理后的深层特征被反复用于生成更具判别力的特征。我们在三个公共ReID基准数据集上的广泛实验表明,我们的方法可以达到优于大多数最新方法的表现。源代码可在该网址获取:[此链接]。
https://arxiv.org/abs/2412.17239
Visual re-ranking using Nearest Neighbor graph~(NN graph) has been adapted to yield high retrieval accuracy, since it is beneficial to exploring an high-dimensional manifold and applicable without additional fine-tuning. The quality of visual re-ranking using NN graph, however, is limited to that of connectivity, i.e., edges of the NN graph. Some edges can be misconnected with negative images. This is known as a noisy edge problem, resulting in a degradation of the retrieval quality. To address this, we propose a complementary denoising method based on Continuous Conditional Random Field (C-CRF) that uses a statistical distance of our similarity-based distribution. This method employs the concept of cliques to make the process computationally feasible. We demonstrate the complementarity of our method through its application to three visual re-ranking methods, observing quality boosts in landmark retrieval and person re-identification (re-ID).
使用最近邻图(NN 图)进行视觉重新排序已被调整以获得较高的检索准确性,因为它有助于探索高维流形,并且无需额外的微调即可应用。然而,基于 NN 图的视觉重新排序的质量受限于其连通性,即 NN 图中的边。某些边可能会错误地连接到负面图像上,这被称为噪声边问题,导致检索质量下降。为了解决这一问题,我们提出了一种基于连续条件随机场(C-CRF)的互补去噪方法,该方法使用基于相似度分布的统计距离。此方法利用了团的概念以使过程在计算上可行。通过将我们的方法应用于三种视觉重新排序方法中,我们展示了其互补性,并观察到了地标检索和行人再识别(re-ID)质量上的提升。
https://arxiv.org/abs/2412.13875
Unsupervised visible-infrared person re-identification (USL-VI-ReID) is of great research and practical significance yet remains challenging due to the absence of annotations. Existing approaches aim to learn modality-invariant representations in an unsupervised setting. However, these methods often encounter label noise within and across modalities due to suboptimal clustering results and considerable modality discrepancies, which impedes effective training. To address these challenges, we propose a straightforward yet effective solution for USL-VI-ReID by mitigating universal label noise using neighbor information. Specifically, we introduce the Neighbor-guided Universal Label Calibration (N-ULC) module, which replaces explicit hard pseudo labels in both homogeneous and heterogeneous spaces with soft labels derived from neighboring samples to reduce label noise. Additionally, we present the Neighbor-guided Dynamic Weighting (N-DW) module to enhance training stability by minimizing the influence of unreliable samples. Extensive experiments on the RegDB and SYSU-MM01 datasets demonstrate that our method outperforms existing USL-VI-ReID approaches, despite its simplicity. The source code is available at: this https URL.
无监督的可见光-红外行人重识别(USL-VI-ReID)具有重要的研究和实用价值,但由于缺少标注信息而颇具挑战性。现有的方法旨在在无监督环境下学习模态不变表征。然而,这些方法常因次优的聚类结果及显著的模态差异,在同一或跨模态中遭遇标签噪声,阻碍了有效的训练过程。为解决这些问题,我们提出了一种通过利用邻域信息来缓解普遍存在的标签噪声的简单而有效的方法,以应对USL-VI-ReID中的挑战。具体来说,我们引入了基于邻居指导的通用标签校准(N-ULC)模块,在同质和异质空间中用从邻近样本衍生出的软标签替代显式的硬伪标签,以此来减少标签噪声。此外,我们还提出了基于邻居指导的动力权重调整(N-DW)模块,通过最小化不可靠样本的影响以增强训练稳定性。在RegDB和SYSU-MM01数据集上的广泛实验表明,尽管方法简单,我们的方法仍优于现有的USL-VI-ReID方案。源代码可在以下链接获取:此 https URL。
https://arxiv.org/abs/2412.12220
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods. The source code is available at this https URL.
多模态对象重识别(ReID)旨在通过利用来自不同模式的互补图像信息检索特定对象。最近,像CLIP这样的大规模预训练模型在传统的单模态对象ReID任务中表现出色。然而,它们尚未被用于多模态对象ReID研究。此外,当前的多模态聚合方法在处理来自不同模式的长序列时存在明显局限性。为了解决上述问题,我们介绍了一个名为MambaPro的新框架,专用于多模态对象ReID。具体来说,我们首先采用并行前馈适配器(PFA)将CLIP适应于多模态对象ReID任务。接着,我们提出了协同残差提示(SRP),以指导多模态特征的联合学习。最后,借助Mamba在长序列上的优越扩展性,我们引入了Mamba聚合(MA),以高效建模不同模式之间的交互作用。结果,MambaPro能够提取更具鲁棒性的特征,且复杂度更低。在三个多模态对象ReID基准测试(即RGBNT201、RGBNT100和MSVR310)上的广泛实验验证了我们提出方法的有效性。源代码可在该https URL获得。
https://arxiv.org/abs/2412.10707
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by combining complementary information from multiple modalities. Existing multi-modal object ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three multi-modal object ReID benchmarks fully verify the effectiveness of our methods. The source code is available at this https URL.
多模态对象重新识别(ReID)旨在通过结合多种模式的互补信息来检索特定对象。现有的多模态对象ReID方法主要集中在异构特征的融合上,但往往忽视了多模态成像中的动态质量变化。此外,不同模态之间的共享信息可能会削弱模态特异性信息。为了解决这些问题,我们提出了一种名为DeMo的新特征学习框架,用于多模态对象ReID,该框架使用专家混合自适应地平衡解耦特征。具体来说,我们首先部署了一个Patch-Integrated Feature Extractor (PIFE)来提取多层次和多模态的特征。然后,我们引入了一个分层解耦模块(HDM),将多模态特征分解为非重叠的形式,以保留模态的独特性并增加特征多样性。最后,我们提出了一种Attention-Triggered Mixture of Experts (ATMoE),它使用从解耦特征中动态导出的注意力权重代替传统的门控机制。通过这些模块,我们的DeMo可以生成更稳健的多模态特征。在三个多模态对象ReID基准测试中的广泛实验充分验证了我们方法的有效性。源代码可在以下链接获取:[此https URL]。
https://arxiv.org/abs/2412.10650
Previous studies have demonstrated that not each sample in a dataset is of equal importance during training. Data pruning aims to remove less important or informative samples while still achieving comparable results as training on the original (untruncated) dataset, thereby reducing storage and training costs. However, the majority of data pruning methods are applied to image classification tasks. To our knowledge, this work is the first to explore the feasibility of these pruning methods applied to object re-identification (ReID) tasks, while also presenting a more comprehensive data pruning approach. By fully leveraging the logit history during training, our approach offers a more accurate and comprehensive metric for quantifying sample importance, as well as correcting mislabeled samples and recognizing outliers. Furthermore, our approach is highly efficient, reducing the cost of importance score estimation by 10 times compared to existing methods. Our approach is a plug-and-play, architecture-agnostic framework that can eliminate/reduce 35%, 30%, and 5% of samples/training time on the VeRi, MSMT17 and Market1501 datasets, respectively, with negligible loss in accuracy (< 0.1%). The lists of important, mislabeled, and outlier samples from these ReID datasets are available at this https URL.
先前的研究表明,在训练过程中,数据集中的每个样本的重要性并非相同。数据剪枝旨在移除不太重要或不太具有信息量的样本,同时仍能达到与在原始(未修剪)数据集上训练相媲美的结果,从而降低存储和训练成本。然而,大多数的数据剪枝方法都是应用于图像分类任务。据我们所知,这是首次探索这些剪枝方法在目标重识别(ReID)任务中的可行性,并且提出了一个更为全面的数据剪枝方法。通过充分利用训练过程中的logit历史记录,我们的方法提供了一个更准确和全面的量化样本重要性的指标,同时也能够纠正标签错误并识别异常值。此外,我们的方法非常高效,与现有方法相比,可以将重要性评分估计的成本降低10倍。我们的方法是一个即插即用、架构无关的框架,可以在VeRi、MSMT17和Market1501数据集上分别减少35%、30%和5%的样本/训练时间,同时精度损失可忽略不计(小于0.1%)。来自这些ReID数据集的重要样本、标签错误样本以及异常值列表可以在[此处](https://www.example.com/)获取。(注:URL已替换为示例以符合指示要求)
https://arxiv.org/abs/2412.10091
Lifelong person re-identification (LReID) is an important but challenging task that suffers from catastrophic forgetting due to significant domain gaps between training steps. Existing LReID approaches typically rely on data replay and knowledge distillation to mitigate this issue. However, data replay methods compromise data privacy by storing historical exemplars, while knowledge distillation methods suffer from limited performance due to the cumulative forgetting of undistilled knowledge. To overcome these challenges, we propose a novel paradigm that models and rehearses the distribution of the old domains to enhance knowledge consolidation during the new data learning, possessing a strong anti-forgetting capacity without storing any exemplars. Specifically, we introduce an exemplar-free LReID method called Distribution Rehearsing via Adaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser Learning mechanism that learns to transform arbitrary distribution data into the current data style at each learning step. To enhance the style transfer capacity of DRL, an Adaptive Kernel Prediction network is explored to achieve an instance-specific distribution adjustment. Additionally, we design a Distribution Rehearsing-driven LReID Training module, which rehearses old distribution based on the new data via the old AKPNet model, achieving effective new-old knowledge accumulation under a joint knowledge consolidation scheme. Experimental results show our DASK outperforms the existing methods by 3.6%-6.8% and 4.5%-6.5% on anti-forgetting and generalization capacity, respectively. Our code is available at this https URL
终身人物再识别(LReID)是一项重要但极具挑战性的任务,由于训练步骤之间的显著领域差异,该任务会遭受灾难性遗忘的影响。现有的LReID方法通常依赖于数据重放和知识蒸馏来缓解这一问题。然而,数据重放方法通过存储历史样本牺牲了数据隐私,而知识蒸馏方法因未被提炼的知识累积遗忘而导致性能受限。为了克服这些挑战,我们提出了一种新的范式,该范式通过建模并复习旧领域的分布,在新数据学习过程中增强知识巩固,无需存储任何样例就具有强大的抗遗忘能力。具体来说,我们引入了一种无样本的LReID方法,称为自适应风格核学习(DASK)下的分布复习。DASK包括一个分布复习器学习机制,该机制在每次学习步骤中学会将任意分布的数据转换为当前数据样式。为了增强DRL的风格迁移能力,探索了一个自适应核预测网络以实现特定实例的分布调整。此外,我们设计了一个由分布复习驱动的LReID训练模块,通过旧AKPNet模型基于新数据复习旧分布,从而在联合知识巩固方案下实现有效的新型和旧型知识积累。实验结果表明,我们的DASK方法在抗遗忘能力和泛化能力上分别比现有方法高出3.6%-6.8%和4.5%-6.5%。我们的代码可在该链接获取:[此https URL]
https://arxiv.org/abs/2412.09224
Person re-identification (re-ID) via 3D skeleton data is a challenging task with significant value in many scenarios. Existing skeleton-based methods typically assume virtual motion relations between all joints, and adopt average joint or sequence representations for learning. However, they rarely explore key body structure and motion such as gait to focus on more important body joints or limbs, while lacking the ability to fully mine valuable spatial-temporal sub-patterns of skeletons to enhance model learning. This paper presents a generic Motif guided graph transformer with Combinatorial skeleton prototype learning (MoCos) that exploits structure-specific and gait-related body relations as well as combinatorial features of skeleton graphs to learn effective skeleton representations for person re-ID. In particular, motivated by the locality within joints' structure and the body-component collaboration in gait, we first propose the motif guided graph transformer (MGT) that incorporates hierarchical structural motifs and gait collaborative motifs, which simultaneously focuses on multi-order local joint correlations and key cooperative body parts to enhance skeleton relation learning. Then, we devise the combinatorial skeleton prototype learning (CSP) that leverages random spatial-temporal combinations of joint nodes and skeleton graphs to generate diverse sub-skeleton and sub-tracklet representations, which are contrasted with the most representative features (prototypes) of each identity to learn class-related semantics and discriminative skeleton representations. Extensive experiments validate the superior performance of MoCos over existing state-of-the-art models. We further show its generality under RGB-estimated skeletons, different graph modeling, and unsupervised scenarios.
基于3D骨骼数据的人体再识别(re-ID)是一项在许多场景中具有重要价值的挑战性任务。现有的基于骨架的方法通常假设所有关节之间的虚拟运动关系,并采用平均关节或序列表示进行学习。然而,它们很少探索关键的身体结构和运动,例如步态,以更专注于重要的身体关节或四肢,同时缺乏充分挖掘骨骼有价值的空间-时间子模式的能力来增强模型的学习效果。本文提出了一种通用的动机引导图变换器与组合骨架原型学习(MoCos)方法,该方法利用特定于结构和与步态相关的身体关系以及骨架图的组合特征,以学习有效的骨架表示用于人体再识别。特别地,在关节结构内的局部性和步态中的身体部件协作的启发下,我们首先提出了动机引导图变换器(MGT),它结合了分层结构模式和步态协作模式,同时关注多阶局部关节关联和关键合作的身体部位以增强骨骼关系学习。然后,我们设计了一种组合骨架原型学习(CSP)方法,利用关节节点和骨架图的随机空间-时间组合来生成多样化的子骨架和子轨迹表示,并将这些表示与每个身份最具代表性的特征(原型)进行对比,从而学习类别相关的语义和辨别性骨骼表示。广泛的实验验证了MoCos在现有最先进的模型中的优越性能。我们进一步展示了它在基于RGB估算的骨架、不同的图建模以及无监督场景下的通用性。
https://arxiv.org/abs/2412.09044
Image retrieval methods rely on metric learning to train backbone feature extraction models that can extract discriminant queries and reference (gallery) feature representations for similarity matching. Although state-of-the-art accuracy has improved considerably with the advent of deep learning (DL) models trained on large datasets, image retrieval remains challenging in many real-world video analytics and surveillance applications, e.g., person re-identification. Using the Euclidean space for matching limits the performance in real-world applications due to the curse of dimensionality, overfitting, and sensitivity to noisy data. We argue that the feature dissimilarity space is more suitable for similarity matching, and propose a dichotomy transformation to project query and reference embeddings into a single embedding in the dissimilarity space. We also advocate for end-to-end training of a backbone and binary classification models for pair-wise matching. As opposed to comparing the distance between queries and reference embeddings, we show the benefits of classifying the single dissimilarity space embedding (as similar or dissimilar), especially when trained end-to-end. We propose a method to train the max-margin classifier together with the backbone feature extractor by applying constraints to the L2 norm of the classifier weights along with the hinge loss. Our extensive experiments on challenging image retrieval datasets and using diverse feature extraction backbones highlight the benefits of similarity matching in the dissimilarity space. In particular, when jointly training the feature extraction backbone and regularised classifier for matching, the dissimilarity space provides a higher level of accuracy.
图像检索方法依赖度量学习来训练骨干特征提取模型,该模型能够抽取具有判别力的查询和参考(画廊)特征表示以进行相似性匹配。尽管随着深度学习(DL)模型在大规模数据集上的应用,状态-of-the-art准确率有了显著提高,但在许多实际视频分析和监控应用场景中,如行人重识别,图像检索仍然是一项挑战。使用欧几里得空间进行匹配由于维度灾难、过拟合以及对噪声数据的敏感性限制了其在实际应用中的性能表现。我们认为特征不相似空间更适合于相似性匹配,并提出了一种二分转换方法,将查询和参考嵌入投影到单个不相似度空间中的单一嵌入中。我们还提倡端到端训练骨干模型与用于配对匹配的二元分类模型。通过分类单个不相似度空间嵌入(将其归类为相似或不相似),而不是比较查询和参考嵌入之间的距离,尤其是在进行端到端训练时,能够展示出显著的好处。我们提出了一种方法,通过对分类器权重施加L2范数约束以及结合铰链损失来同时训练最大间隔分类器与骨干特征提取器。我们在具有挑战性的图像检索数据集上的广泛实验及使用各种特征提取骨干模型均表明,在不相似度空间中进行相似性匹配的优势。特别是,当联合训练特征提取骨干和正则化分类器时,不相似度空间提供了更高的准确率水平。
https://arxiv.org/abs/2412.08618
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. In this paper, we propose an Embedding and Enriching Explicit Semantics (EEES) framework to learn semantically rich cross-modality pedestrian representations. Our method offers several contributions. First, with the collaboration of multiple large language-vision models, we develop Explicit Semantics Embedding (ESE), which automatically supplements language descriptions for pedestrians and aligns image-text pairs into a common space, thereby learning visual content associated with explicit semantics. Second, recognizing the complementarity of multi-view information, we present Cross-View Semantics Compensation (CVSC), which constructs multi-view image-text pair representations, establishes their many-to-many matching, and propagates knowledge to single-view representations, thus compensating visual content with its missing cross-view semantics. Third, to eliminate noisy semantics such as conflicting color attributes in different modalities, we design Cross-Modality Semantics Purification (CMSP), which constrains the distance between inter-modality image-text pair representations to be close to that between intra-modality image-text pair representations, further enhancing the modality-invariance of visual content. Finally, experimental results demonstrate the effectiveness and superiority of the proposed EEES.
可见光-红外行人再识别(VIReID)在不同的模态间检索同一身份的行人群图像。现有的方法仅从图像中学习视觉内容,缺乏感知高级语义的能力。本文提出了一种嵌入和丰富显式语义(EEES)框架来学习语义丰富的跨模态行人表示。我们的方法提供了几方面的贡献。首先,在多个大型语言-视觉模型的协作下,我们开发了显式语义嵌入(ESE),该技术自动为行人群补充语言描述,并将图像-文本对对齐到一个共同的空间中,从而学习与显式语义相关的视觉内容。其次,认识到多视角信息的互补性,我们提出了跨视图语义补偿(CVSC),它构建了多视图图像-文本对表示形式,建立了它们之间的多对多匹配,并将知识传播给单视图表示,从而用缺失的跨视图语义来补充视觉内容。第三,为了消除不同模态中的冲突颜色属性等噪声语义,我们设计了跨模态语义净化(CMSP),该技术限制不同模态间图像-文本对表示之间的距离接近同一模态内图像-文本对表示间的距离,进一步增强了视觉内容的模态不变性。最后,实验结果证明了所提出的EEES的有效性和优越性。
https://arxiv.org/abs/2412.08406
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) offers a more flexible and cost-effective alternative compared to supervised methods. This field has gained increasing attention due to its promising potential. Existing methods simply cluster modality-specific samples and employ strong association techniques to achieve instance-to-cluster or cluster-to-cluster cross-modality associations. However, they ignore cross-camera differences, leading to noticeable issues with excessive splitting of identities. Consequently, this undermines the accuracy and reliability of cross-modal associations. To address these issues, we propose a novel Dynamic Modality-Camera Invariant Clustering (DMIC) framework for USL-VI-ReID. Specifically, our DMIC naturally integrates Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC) and Hybrid Modality Contrastive Learning (HMCL) into a unified framework, which eliminates both the cross-modality and cross-camera discrepancies in clustering. MIE fuses inter-modal and inter-camera distance coding to bridge the gaps between modalities and cameras at the clustering level. DNC employs two dynamic search strategies to refine the network's optimization objective, transitioning from improving discriminability to enhancing cross-modal and cross-camera generalizability. Moreover, HMCL is designed to optimize instance-level and cluster-level distributions. Memories for intra-modality and inter-modality training are updated using randomly selected samples, facilitating real-time exploration of modality-invariant representations. Extensive experiments have demonstrated that our DMIC addresses the limitations present in current clustering approaches and achieve competitive performance, which significantly reduces the performance gap with supervised methods.
无监督学习可见光-红外行人重识别(USL-VI-ReID)相比有监督方法提供了更灵活且成本更低的替代方案。由于其潜在的应用前景,这一领域受到了越来越多的关注。现有方法简单地对模态特定样本进行聚类,并采用强大的关联技术实现实例到集群或集群到集群的跨模态关联。然而,它们忽略了跨摄像机差异,导致身份过度分割的问题显著存在。这削弱了跨模态关联的准确性和可靠性。为了解决这些问题,我们提出了一种新的动态模态-摄像机不变聚类(DMIC)框架用于USL-VI-ReID。具体来说,我们的DMIC自然地将模态-摄像机不变扩展(MIE)、动态邻域聚类(DNC)和混合模态对比学习(HMCL)整合到一个统一的框架中,这消除了聚类中的跨模态和跨摄像机差异。MIE融合了跨模态和跨摄像机的距离编码,在聚类层面弥合了模态和摄像机之间的差距。DNC采用了两种动态搜索策略来优化网络的目标函数,从提高可区分性转变为增强跨模态和跨摄像机的泛化能力。此外,HMCL旨在优化实例级别和集群级别的分布。使用随机选择的样本更新模态内和跨模态训练的记忆,有助于实时探索模态不变表征。广泛的实验表明,我们的DMIC解决了当前聚类方法中存在的局限性,并实现了具有竞争力的表现,显著缩小了与监督方法之间的性能差距。
https://arxiv.org/abs/2412.08231
Recent work has established the ecological importance of developing algorithms for identifying animals individually from images. Typically, a separate algorithm is trained for each species, a natural step but one that creates significant barriers to wide-spread use: (1) each effort is expensive, requiring data collection, data curation, and model training, deployment, and maintenance, (2) there is little training data for many species, and (3) commonalities in appearance across species are not exploited. We propose an alternative approach focused on training multi-species individual identification (re-id) models. We construct a dataset that includes 49 species, 37K individual animals, and 225K images, using this data to train a single embedding network for all species. Our model employs an EfficientNetV2 backbone and a sub-center ArcFace loss function with dynamic margins. We evaluate the performance of this multispecies model in several ways. Most notably, we demonstrate that it consistently outperforms models trained separately on each species, achieving an average gain of 12.5% in top-1 accuracy. Furthermore, the model demonstrates strong zero-shot performance and fine-tuning capabilities for new species with limited training data, enabling effective curation of new species through both incremental addition of data to the training set and fine-tuning without the original data. Additionally, our model surpasses the recent MegaDescriptor on unseen species, averaging an 19.2% top-1 improvement per species and showing gains across all 33 species tested. The fully-featured code repository is publicly available on GitHub, and the feature extractor model can be accessed on HuggingFace for seamless integration with wildlife re-identification pipelines. The model is already in production use for 60+ species in a large-scale wildlife monitoring system.
最近的研究确立了开发算法识别图像中个体动物的生态重要性。通常,每种物种都需要训练一个单独的算法,这是一个自然步骤,但也会给广泛应用带来重大障碍:(1)每个项目都非常昂贵,需要数据收集、数据整理、模型训练、部署和维护;(2)许多物种的数据样本很少;(3)不同物种之间的外观相似性没有得到利用。我们提出了一种专注于训练多物种个体识别(re-id)模型的替代方法。我们构建了一个包含49个物种、37,000多个个体动物和225,000张图像的数据集,使用这些数据来为所有物种训练一个单一嵌入网络。我们的模型采用EfficientNetV2作为骨干,并使用具有动态边界的子中心ArcFace损失函数。我们通过多种方式评估了这种多物种模型的性能。最值得注意的是,它在大多数情况下都超过了单独针对每个物种训练的模型,平均提高了12.5%的top-1准确率。此外,该模型表现出强大的零样本表现和对新物种有限数据微调的能力,使得可以通过逐步向训练集添加数据或无需原始数据进行微调来有效整理新的物种。另外,我们的模型在未见过的物种上超过了最近的MegaDescriptor,平均每个物种top-1准确率提高了19.2%,并在测试的所有33个物种中均有所提高。该功能完备的代码库已在GitHub上公开,并且特征提取器模型可以在HuggingFace上获取,以无缝集成到野生动物重新识别管道中。这个模型已经在大规模野生动物监测系统中的60多种物种中投入生产使用。
https://arxiv.org/abs/2412.05602